>  >  >  articles on technology and business  >  >  >

Web This Site
home  | 

Neural Nets Glossary

Altaf Khan

© Altaf Khan 2005-2006

This is a glossary of terms that can help you understand the literature on neural nets. Some of the mathematical symbols may not display properly on this page. For clarification on them, take a look at the glossary section in Feedforward Neural Networks With Constrained Weights PDF

Related InfoBank Articles

Activation function
is the transform applied to the weighted sum of inputs plus offset for computing the output of a neuron. Also known as the squashing function.
Affine group invariance.
The property of a group due to which it stays unchanged after the application of an affine transform.
Affine transform
is a transform from the set of rotations, shifts, scalings, or any combinations thereof.
A set of functions A is an algebra if f, g A,  J \mathbb R f + g A,  f ·g A,   and   J f A.
is a stochastic learning procedure which uses local correlations between changes in individual weights and changes in the cost function to update weights. This procedure predates the current resurgence in neural network learning by more than a decade and was originally proposed for mapping visual receptive fields.
Approximation property, Universal,
is the ability of a set functions to approximate a specific class of functions to any desired accuracy.
Approximation property, Best,
is the property of an approximation scheme on a set of functions to select a function that is at a minimum `distance' from the function to be approximated.
is ARTMAP with added spatial and temporal evidence accumulation processes.
is a supervised learning procedure explicitly based on neurobiology.
is ARTMAP with an instance counting procedure and a match tracking algorithm.
is an element of the input vector. Also known as a feature.
A system for which the desired output response is the same as the input.
Backpropagation, Error
is an procedure in which the difference between the actual and desired responses of the neurons in the output layer is minimised using the steepest-descent heuristic.
Balanced data set
is set in which all classes are equally represented.
Bayesian classifier
assigns a class to an object in such a way that the expectation value of misclassification is minimised. Also known as the minimum risk classifier, and belief network.
Bayesian statistics
differs from the conventional `frequentist' approach to statistics in that it allows the probability of an event to be expressed as `degree of belief' in a particular outcome instead of basing it solely on a set of observations.
Bayes's theorem
allows prior estimates of the probability of an event to be revised in accordance with new observations. It states that probability of an event A given another event B, P(A|B), is equal to P(B|A)P(A)/P(B).
Black-hole mechanism
is a rounding mechanism for `nearly discrete' weights.
Black-hole radius.
If the value of a weight gets within this radius of a discrete value, it becomes that discrete value.
A random sample is selected by sampling with replacement from the data set and is used to train the network. The trained network is then tested on the remaining data. This procedure is repeated a large number of times. The average of all such test errors is an estimate of the generalisation performance metric.
Borel measurable functions.
Just about all functions that one may encounter are Borel measurable. Functions that are not Borel measurable do exist but are known to mathematicians only as mathematical peculiarities.
learning method starts with a network without any hidden neurons and systematically increases their number during training until the required performance is achieved.
Cauchy sequence
is a sequence {an} of real numbers for which for every e > 0, there exists a positive integer n0 such that |am - an| < e, whenever m > n n0.
is a task in which the desired responses are restricted to a finite set of values.
Cl(A) = A {Limit points of A}.
Compact set.
A closed and bounded subset of \mathbb Rd. Also known as a compact.
Compact in \mathbb Rn.
A is a compact in \mathbb Rn if it is a subset of \mathbb Rn, is a closed set, and is a bounded set.
Convergence, Pointwise.
If {an} is a sequence of non-random real variables then an converges to a, i.e., an a as n , if there exists a real number a such that for any e > 0, there exists an integer Ne sufficiently large that |an - a| < e for all n Ne. Also known as deterministic convergence.
Convergence in distribution.
If {[^a]n} is a sequence of random variables having a distribution function {F: Fn(a) P[[^a]n a} then [^a]n converges to F in distribution, i.e. [^a]n d F, iff |Fn(a) - F(a)| 0 for every continuity point a of F.
Convergence in probability.
If {[^a]n} is a sequence of random variables then [^a]n converges to a in probability, i.e. [^a]n P a, if there exists a real number a such that for any e > 0, P[|[^a]n - a| < e] 1 as n . Also known as weak convergence.
Convergence in the mean.
If {[^a]n} is a sequence of random variables then [^a]n converges to a in the mean if \scriptscriptstyleanlimE{|[^a]n - a |}=0, where E{a} represents the estimated value of a.
Convergence in the mean squared sense.
If {[^a]n} is a sequence of random variables then [^a]n converges to a in the mean squared sense if \scriptscriptstyleanlimE{|[^a]n - a |2}=0, where E{a} represents the estimated value of a.
Convergence with probability 1.
If {[^a]n} is a sequence of random variables then [^a]n converges with probability 1 to a, i.e. [^a]n a as n or [^a]n P=1 a, if \scriptscriptstylean limP[[^a]n = a] = 1 for some real number a. Also known as almost sure convergence, convergence almost everywhere, and strong convergence.
Convergence, Uniform
The property that all of a family of functions or series on a given set converge at the same rate throughout the set; that is, for every e > 0 there is a single N such that for all points in the set , |fm(x) - fn(x)| < e for all m,n > N and similarly for uniform convergence as x tends to a value a.
Cost function
is the quantity that is to be minimised in an optimisation experiment. In the case of feedforward networks this quantity is usually the RMS error in the output of the network. Also known as error measure.
of a set A is a collection of sets {Ti} whose union contains A.
Cover, Open.
It is an open cover of {Ti} if each Ti is open.
Cover, Sub-
of a given cover is a subcollection whose union also contains A.
Cross-validation, n-fold.
The data set is divided equally into k randomly selected, mutually exclusive subsets called folds. k-1 networks are trained sequentially on all combinations of k-1 folds, while the performance of the trained networks is tested on the one remaining folds. The average of k-1 such errors is an estimate of the generalisation performance metric.
Decision sensitivity
is the likelihood that an event will be detected if it occurs. It is the ratio of true positives to the sum of true positives and false negatives. This metric is especially of importance when it is critical that a an event be detected. Also known as True Positive Ratio.
Decision specificity
is the likelihood that the absence of an even is detected given that it is present. It is the ratio of the true negatives to all negatives. Also known as True Negative Ratio.
Decision surface
is the plot of the response of an output neuron with respect to the inputs.
A set A is dense in a set S if A S and Cl(A) = S.
Denseness, Uniform.
A set of functions A is uniformly dense in C(\mathbb Rd) on the compact set K \mathbb Rd if for all f C(\mathbb Rd) and every e > 0 there exists [^f] A such that sup{|f(x) -[^f](x)|:x K} < e.
Denseness, Uniform on compacta.
A set of functions A is uniformly dense on compacta in C(\mathbb Rd) if it is uniformly dense on every compact subset of \mathbb Rd.
Disjunctive normal form (DNF).
The form of a logical expression consisting of a single conjunction (·) of a set of disjunctions(+). All logical expressions are expressible in this form.
Effective sample size
(for classification learning tasks) is the number of examples representing the smallest classification group.
EM algorithm.
Expectation Maximisation algorithm calculates the probability density of observations based on parameters and not observations.
is the cycle in which all examples in the training set are presented to the network.
Ergodic process.
A random process is ergodic if its ensemble and temporal averages are the same.
Error surface
is the plot of the cost-function with respect to all of the weights in a network.
Feedforward network
consists of a layer of inputs, zero or more layers of hidden neurons, and an output layer of neurons. Generally all neurons in adjacent layers are fully connected to each other with feedforward synapses only. There are no intra-layer synapses. Also known as the multilayer perceptron.
Fit, Over-
An over-fit is due to the trained network having a higher complexity than the concept embedded in the training data. Also known as memorisation and over-specialisation.
Fit, Under-
An under-fit is caused by the trained network having a complexity lower than that of the concept embedded in the training data.
Forward pass.
The process by which a network computes the output vector in response to an input vector. Also known as recall.
Function, Analytic.
f C(\mathbb R) is analytic at a \mathbb R with a radius of convergence r > 0 if there is an infinite sequence of real numbers, {cn},  n 0, such that for |x - a| < r, n=0 cn(x - a)n converges and f(x) = n=0 cn(x-a )n
Function, Superanalytic.
f C(\mathbb R) is superanalytic at a \mathbb R with a radius of convergence r > 0 if there is an infinite sequence of real numbers, {cn},  n 0, and if for every n 1,  cn 0, such that for |x - a| < r,  n=0 cn(x - a)n converges and f(x) = n=0 cn(x-a )n
Function approximation
is a task in which the desired output values are continuous. Also known as regression.
is a scalar-valued continuous linear function defined on a normed linear space.
Functional, Linear
on a linear space E over \mathbb R is a linear transformation of E into \mathbb R.
Functional, linear, Bounded.
A bounded linear transformation of a normed linear space E over \mathbb R into the normed linear space \mathbb R is called a bounded linear functional on E.
Generalisation performance
is the accuracy of decision of a trained network on a set of data which is similar to but not the same as the training data set.
Hahn-Banach theorem.
Let M be a linear subspace of a normed linear space N, and let f be a functional defined on M. Then f can be extended to a functional f0 defined on the whole space N such that ||f0 ||=||f||.
If Mc is a closed linear subspace of N and x0 is a vector not in Mc, then there exists a functional f0 in the conjugate space N* such that f0(Mc)=0 and f0(x0) 0.
Hebbian learning.
The main idea behind Hebbian learning is that the synapse between two neurons should be strengthened if they fire simultaneously.
Hidden layer
is the layer of neurons which is not directly connected to the network inputs or outputs.
Homogeneity property.
A set of functions A fulfils the homogeneity property if f A and J \mathbb R J f A.
Inequality, triangle.
|a + b| |a| + |b|
k-nearest neighbours
is a clustering algorithm that minimises the the sum of squares of distances between the training data and k points.
measures is a popular form of the cost function for feedforward networks.
Eo(W) =

|tj - oj|p
,       p = 1, 2,... ,
where tj is the target or desired value of the jth output, and oj is its value computed by the network.
is the process in which a feedforward network is forced to adjust its weights such that the network's response to a given input vector becomes closer to the desired response.
Learning, Batch.
The type of learning during which weights are updated at the end of every epoch. Also known as off-line learning.
Learning, In-situ
differs from on-line learning in that the former is the property of a network requiring the deployed network to have adaptive weights, whereas the later is a property of the learning procedure, requiring the weights to be updated on the presentation of every example.
Learning, On-line.
The type of learning during which weights are updated after the presentation of every training example. Also known as pattern and incremental learning.
Learning, Supervised
The learning process in which a system's internal parameters are modified in order to minimise the error in its output with respect to a desired value.
Learning, Unsupervised
The learning process in which a system's internal parameters are modified so that similar input patterns result in similar outputs.
Learning rate
determines the size of the weight modification at each training step.
is the probability density of observations calculated from parameters and not observations.
Limit Point.
A point p is a limit point of A if every neighbourhood of p contains a point q p such that q A.
Linear separability.
The property of a classification task by which the members of one class can be separated from the ones from all other classes by a single hyperplane.
Loading problem, The.
The problem of finding the optimal weight values for a given network such that the network performs the required mapping.
Logistic discriminant analysis
chooses classification hyperplanes with respect to maximising a conditional likelihood cost-function and not optimising a quadratic cost-function which is the case for linear discriminant analysis.
Margin, en.
Error in the output of a neuron is not backpropagated if it is within this small margin.
Minima, Global.
The points of minimum error on an error surface.
Minima, Local.
The points of zero gradient on an error surface which are not global minima.
Mixture representation
of data use a linear combination of Gaussian distributions to represent arbitrary distributions.
is a training parameter used in a very common variation on standard error backpropagation learning procedure. It controls the effect of the last weight modification on the current weight update.
n-layer network
is a feedforward network with n - 1 hidden layers.
NP-complete problems.
(non-polynomial time problems) The time required to find the optimal solution for this class of problems grows exponentially with the size of the problem. Also known as intractable problems.
Neural network, Artificial
is a set of interconnected artificial neurons.
Neuron, Artificial
is the fundamental processing element in an artificial neural network. It performs a weighted sum of its inputs, adds the offset value to that sum, and then outputs a certain transform of that sum. Also known as node and processing element (PE).
Ockham's Razor
is the conjecture that if, for a given problem, two solutions with similar performances are available then the one with the lower computational complexity should be preferred.
is the value added to the weighted sum before the transform is applied to compute a neuron's output. Also known as threshold and bias.
networks have a complexity higher than what is required to learn the concept embedded in training data. They act as look-up tables for the training data and are poor generalisers.
is a feedforward network with no hidden neurons.
Probability, Prior
is the probability assigned to an event in advance of any empirical evidence. Also known as `a priori' probability.
Probability, Posterior
is the probability assigned to an event based on observations. Also known as `a posteriori' probability.
Projection pursuit regression
is a generalisation of the feedforward network in that it allows more than one type of activation function in the hidden layer. These non-homogeneous activation functions are data-dependent and constructed during learning.
A class of methods designed to avoid overfitting to the training data by enforcing smoothness of the fit.
Ridge regression.
The precision of least-squares estimates gets worse with with an increase in dependence between the input variables. Ridge regression estimators are more precise in those situations and are obtained as the estimators whose distance to an ellipsoid (the `ridge') centred at a least-squares estimate from the origin of a parameter space is a minimum.
Ridging, Constrained.
Optimisation procedure in which some norm of the weights is constrained to a specific value.
Ridging, Penalised.
Optimisation procedure in which the cost function is augmented by a penalty term.
Ridging, Smoothed.
Optimisation procedure in which noise is introduced in the inputs.
Riesz representation theorem.
Let x* be a bounded linear functional on the Banach space C\mathbb R([a,b]). Then there is a real-valued function a of bounded variation on [a,b] such that x*(f) = baf da for all f C\mathbb R([a,b]). Further, if x* is a positive linear functional, then a is increasing on [a,b].
RMS error, Eorms,
is computed by summing the output layer errors for all examples in a training or test set, dividing the sum by the total number of examples and the number of the output layer neurons, and taking the square root of the resultant. The output layer error is computed by summing the squares of the individual neuron errors with respect to the desired output. An individual output-layer neuron's error is set to zero if it is less than the margin.
Sampling with replacement
may result in successive samples being not mutually exclusive, some of the examples may never appear in any of the samples, and there may be repetitions within an individual sample.
Separates points.
A family of functions A separates points on a set S if for every x, y S,x y, there exists f A such that f(x) f(y).
Set, Closed.
A subset M of metric space N is a closed set if it contains each of its limit points.
Set, Finite.
A is finite if all of its elements can be displayed as {a1, a2, . . . ,an} for some integer n.
Set, Open
is the subset G of the metric space X if each point of G is the centre of some open sphere contained in G.
If a set of functions F includes all possible dichotomies on a set S of points, then S is said to be shattered by F.
The difference between the training set accuracy of a network and its accuracy on a test set.
Sigmoidal functions.
Definitions vary but are generally taken to be bounded, monotone, and continuous, e.g. logistic and tanh(·) functions.
Simulated annealing
is a stochastic optimisation technique inspired by the physical process of annealing.
Skip-layer synapses.
Synapses connecting neurons in two non-adjacent layers. Also known as short-cut synapses. Known as main effects in the statistical literature.
Smoothing spline modelling
is piecewise approximation by polynomials of degree n with the requirement that the derivatives of the polynomials are continuous up to degree n-1 at the junctions.
The purpose of the softmax activation function is to make the sum of the output neuron responses equal to one, so that the outputs are interpretable as posterior probabilities. Also known as the multiple-logistic function.
Space, Banach
is a complete normed linear space.
Space, Compact
is a topological space in which every open cover has a finite subcover.
Space, Complete metric
is a metric space in which every Cauchy sequence is convergent.
Space, Conjugate.
N* is the set of all continuous linear transforms of the normed linear space N into \mathbb R.
Space, Eucledian
is the metric space (\mathbb R, d) such that d(x,y) = (r=1n(xr-yr)2)1/2.
Space, Hausdorff
is a topological space, T, in which any two given distinct points x,  y are such that there exist disjoint open subsets U,  V containing x,  y respectively.
Space, Lp
consists of all measurable functions f defined on a measure space M with measure m which are such that |f(x)|p is integrable, with the norm taken as ||f||p = (|f(x)|pdm(x))1/p.
Space, Normed linear
over \mathbb R is a pair {E,||·||}, where E is linear space over \mathbb R and || ·|| is a norm on E. Normed linear space is a metric space with the metric being ||x-y||.
Space, Topological
is a pair (X,T), where X a non-empty set and T is collection of subsets of X such that the subsets are closed under union and intersection operations.
Span of a set of functions.
For any function f C(\mathbb R) and r > 0,  f|(-r,r) denotes the restriction of f to the interval (-r,r), and for any class of function F,  F|(-r, r) denotes {f|(-r,r):f F}. For any F defined on a set O, the span of F,   sp(F), denotes the closure of the set of finite linear combinations of elements of F in the topology of uniform convergence on compact subsets of O.
Sphere, Open.
Sr(x0) with centre x0 and radius r is the subset of the metric space X with metric D defined by Sr(x0)={x: d(x,x0) < r}.
Stationary, strongly.
If a random variable X is strongly stationary then the distribution of X(t) is independent of the time t.
Subset, Proper.
A is a proper subset of B if A B and B A.
Subspace, Linear
is the non-empty subset, M, of a linear space if (x+y) M whenever x M and y M, and if ax M whenever x M, where a is a scalar.
is the least upper bound for a set.
is a measure of the effect that a neuron's output has on the output of another neuron at the other end of the synapse. Also known as connection, edge, and weight.
is the process of verifying the function of a trained network against a set of examples which is different from the training examples set.
A random sample containing one half of the total number of examples is selected. This subset is used to train the network while the remaining examples are used to test the network once it has been trained. The performance of the trained network on the test set is an estimate of the generalisation performance metric.
See Learning.
Training example
is a pair: an input vector, and the desired response to that input vector.
Vanishes at no point.
A family of functions A vanishes at no point of the set S if for each x S there exists f A such that f(x) 0.
is the value of a synapse or an offset.
Weight decay
is a common regularisation technique used in feedforward network training in which the cost-function is augmented with a term which penalises large weight values.
Weight depth
is the number of binary bits in a weight.
Weight elimination
is a regularisation technique used in feedforward network training in which the cost-function is augmented with a term which penalises the number of non-zero weights.
Weight perturbation
is a hardware-friendly alternative to BP learning. In this method, all of the weights are perturbed in turn and the associated change in the output of the network is used to approximate local gradients.
Weight sharing
is a regularisation technique used in feedforward network training in which the cost-function is augmented with a term which penalises the number of independent weights.
Alphabetical Index

AI technique selection

Books I like

BPO articles and presentations

BPO service providers in Pakistan

BPO service ideas

BPO startup, Finding customers for a

Biz plan: IP infrastructure services co.

Business plan: Software quality assurance co.

Call centers in Pakistan

Call centers, Managing staff turnover in

Cell phones: Basic features

Computing, Intro to

Configurable MIPS Simulator

Enabling the IT Boom

Exporting non-IT services over the Internet

Finding customers for a BPO startup

IP infrastructure services: Biz plan

Intro to computing

Intro to neural nets

Investing in Pakistan's IT Businesses

LCD monitors

Managing staff turnover in offshore call centers

MIPS Simulator, Configurable

Neural nets glossary

Neural nets, Intro to

Outsourcing to Pakistan

Raising venture capital for IT products

Right-sizing the software process

Sudoku: Rules and strategies

Software process, Right-sizing the

Software quality assurance: Biz plan

Teaching, Improve your

Venture capital for IT products, Raising

Why outsource to Pakistan?

We Love Feedback

Do you have comments? Suggestions?