 
Neural Nets Glossary    Summary
This is a glossary of terms that can help you understand the literature on neural nets. Some of the mathematical symbols may not display properly on this page. For clarification on them, take a look at the glossary section in Feedforward Neural Networks With Constrained Weights PDF   
  Related InfoBank Articles   
 Activation function
 is the transform applied to the weighted sum of inputs plus offset for computing the output of a neuron. Also known as the squashing function.
 Affine group invariance.
 The property of a group due to which it stays unchanged after the application of an affine transform.
 Affine transform
 is a transform from the set of rotations, shifts, scalings, or any combinations thereof.
 Algebra.
 A set of functions A is an algebra if f, g Î A, J Î \mathbb R Þ f + g Î A, f ·g Î A, and J f Î A.
 Alopex
 is a stochastic learning procedure which uses local correlations between changes in individual weights and changes in the cost function to update weights. This procedure predates the current resurgence in neural network learning by more than a decade and was originally proposed for mapping visual receptive fields.
 Approximation property, Universal,
 is the ability of a set functions to approximate a specific class of functions to any desired accuracy.
 Approximation property, Best,
 is the property of an approximation scheme on a set of functions to select a function that is at a minimum `distance' from the function to be approximated.
 ARTEMAP
 is ARTMAP with added spatial and temporal evidence accumulation processes.
 ARTMAP
 is a supervised learning procedure explicitly based on neurobiology.
 ARTMAPIC
 is ARTMAP with an instance counting procedure and a match tracking algorithm.
 Attribute
 is an element of the input vector. Also known as a feature.
 Autoassociator
 A system for which the desired output response is the same as the input.
   Backpropagation, Error
 is an procedure in which the difference between the actual and desired responses of the neurons in the output layer is minimised using the steepestdescent heuristic.
 Balanced data set
 is set in which all classes are equally represented.
 Bayesian classifier
 assigns a class to an object in such a way that the expectation value of misclassification is minimised. Also known as the minimum risk classifier, and belief network.
 Bayesian statistics
 differs from the conventional `frequentist' approach to statistics in that it allows the probability of an event to be expressed as `degree of belief' in a particular outcome instead of basing it solely on a set of observations.
 Bayes's theorem
 allows prior estimates of the probability of an event to be revised in accordance with new observations. It states that probability of an event A given another event B, P(AB), is equal to P(BA)P(A)/P(B).
 Blackhole mechanism
 is a rounding mechanism for `nearly discrete' weights.
 Blackhole radius.
 If the value of a weight gets within this radius of a discrete value, it becomes that discrete value.
 Bootstrap.
 A random sample is selected by sampling with replacement from the data set and is used to train the network. The trained network is then tested on the remaining data. This procedure is repeated a large number of times. The average of all such test errors is an estimate of the generalisation performance metric.
 Borel measurable functions.
 Just about all functions that one may encounter are Borel measurable. Functions that are not Borel measurable do exist but are known to mathematicians only as mathematical peculiarities.
 Cascadecorrelation
 learning method starts with a network without any hidden neurons and systematically increases their number during training until the required performance is achieved.
 Cauchy sequence
 is a sequence {a_{n}} of real numbers for which for every e > 0, there exists a positive integer n_{0} such that a_{m}  a_{n} < e, whenever m > n ³ n_{0}.
 Classification
 is a task in which the desired responses are restricted to a finite set of values.
 Closure.
 Cl(A) = A È{Limit points of A}.
 Compact set.
 A closed and bounded subset of \mathbb R^{d}. Also known as a compact.
 Compact in \mathbb R^{n}.
 A is a compact in \mathbb R^{n} if it is a subset of \mathbb R^{n}, is a closed set, and is a bounded set.
 Convergence, Pointwise.
 If {a_{n}} is a sequence of nonrandom real variables then a_{n} converges to a, i.e., a_{n} ® a as n ® ¥, if there exists a real number a such that for any e > 0, there exists an integer N_{e} sufficiently large that a_{n}  a < e for all n ³ N_{e}. Also known as deterministic convergence.
 Convergence in distribution.
 If {[^a]_{n}} is a sequence of random variables having a distribution function {F: F_{n}(a) º P[[^a]_{n} £ a} then [^a]_{n} converges to F in distribution, i.e. [^a]_{n} ® ^{d} F, iff F_{n}(a)  F(a)® 0 for every continuity point a of F.
 Convergence in probability.
 If {[^a]_{n}} is a sequence of random variables then [^a]_{n} converges to a in probability, i.e. [^a]_{n} ® ^{P} a, if there exists a real number a such that for any e > 0, P[[^a]_{n}  a < e] ®1 as n ® ¥. Also known as weak convergence.
 Convergence in the mean.
 If {[^a]_{n}} is a sequence of random variables then [^a]_{n} converges to a in the mean if \scriptscriptstylea_{n}®¥^{lim}E{[^a]_{n}  a }=0, where E{a} represents the estimated value of a.
 Convergence in the mean squared sense.
 If {[^a]_{n}} is a sequence of random variables then [^a]_{n} converges to a in the mean squared sense if \scriptscriptstylea_{n}®¥^{lim}E{[^a]_{n}  a ^{2}}=0, where E{a} represents the estimated value of a.
 Convergence with probability 1.
 If {[^a]_{n}} is a sequence of random variables then [^a]_{n} converges with probability 1 to a, i.e. [^a]_{n} ® a as n ® ¥ or [^a]_{n} ® ^{P=1} a, if \scriptscriptstylea_{n} ® ¥^{lim}P[[^a]_{n} = a] = 1 for some real number a. Also known as almost sure convergence, convergence almost everywhere, and strong convergence.
 Convergence, Uniform
 The property that all of a family of functions or series on a given set converge at the same rate throughout the set; that is, for every e > 0 there is a single N such that for all points in the set , f_{m}(x)  f_{n}(x) < e for all m,n > N and similarly for uniform convergence as x tends to a value a.
 Cost function
 is the quantity that is to be minimised in an optimisation experiment. In the case of feedforward networks this quantity is usually the RMS error in the output of the network. Also known as error measure.
 Cover
 of a set A is a collection of sets {T_{i}} whose union contains A.
 Cover, Open.
 It is an open cover of {T_{i}} if each T_{i} is open.
 Cover, Sub
 of a given cover is a subcollection whose union also contains A.
 Crossvalidation, nfold.
 The data set is divided equally into k randomly selected, mutually exclusive subsets called folds. k1 networks are trained sequentially on all combinations of k1 folds, while the performance of the trained networks is tested on the one remaining folds. The average of k1 such errors is an estimate of the generalisation performance metric.
 Decision sensitivity
 is the likelihood that an event will be detected if it occurs. It is the ratio of true positives to the sum of true positives and false negatives. This metric is especially of importance when it is critical that a an event be detected. Also known as True Positive Ratio.
 Decision specificity
 is the likelihood that the absence of an even is detected given that it is present. It is the ratio of the true negatives to all negatives. Also known as True Negative Ratio.
 Decision surface
 is the plot of the response of an output neuron with respect to the inputs.
 Denseness.
 A set A is dense in a set S if A Ì S and Cl(A) = S.
 Denseness, Uniform.
 A set of functions A is uniformly dense in C(\mathbb R^{d}) on the compact set K Ì \mathbb R^{d} if for all f Î C(\mathbb R^{d}) and every e > 0 there exists [^f] Î A such that sup{f(x) [^f](x):x Î K} < e.
 Denseness, Uniform on compacta.
 A set of functions A is uniformly dense on compacta in C(\mathbb R^{d}) if it is uniformly dense on every compact subset of \mathbb R^{d}.
 Disjunctive normal form (DNF).
 The form of a logical expression consisting of a single conjunction (·) of a set of disjunctions(+). All logical expressions are expressible in this form.
 Effective sample size
 (for classification learning tasks) is the number of examples representing the smallest classification group.
 EM algorithm.
 Expectation Maximisation algorithm calculates the probability density of observations based on parameters and not observations.
 Epoch
 is the cycle in which all examples in the training set are presented to the network.
 Ergodic process.
 A random process is ergodic if its ensemble and temporal averages are the same.
 Error surface
 is the plot of the costfunction with respect to all of the weights in a network.
 Feedforward network
 consists of a layer of inputs, zero or more layers of hidden neurons, and an output layer of neurons. Generally all neurons in adjacent layers are fully connected to each other with feedforward synapses only. There are no intralayer synapses. Also known as the multilayer perceptron.
 Fit, Over
 An overfit is due to the trained network having a higher complexity than the concept embedded in the training data. Also known as memorisation and overspecialisation.
 Fit, Under
 An underfit is caused by the trained network having a complexity lower than that of the concept embedded in the training data.
 Forward pass.
 The process by which a network computes the output vector in response to an input vector. Also known as recall.
 Function, Analytic.
 f Î C(\mathbb R) is analytic at a Î \mathbb R with a radius of convergence r > 0 if there is an infinite sequence of real numbers, {c_{n}}, n ³ 0, such that for x  a < r, å_{n=0}^{¥} c_{n}(x  a)^{n} converges and f(x) = å_{n=0}^{¥} c_{n}(xa )^{n}
 Function, Superanalytic.
 f Î C(\mathbb R) is superanalytic at a Î \mathbb R with a radius of convergence r > 0 if there is an infinite sequence of real numbers, {c_{n}}, n ³ 0, and if for every n ³ 1, c_{n} ¹ 0, such that for x  a < r, å_{n=0}^{¥} c_{n}(x  a)^{n} converges and f(x) = å_{n=0}^{¥} c_{n}(xa )^{n}
 Function approximation
 is a task in which the desired output values are continuous. Also known as regression.
 Functional
 is a scalarvalued continuous linear function defined on a normed linear space.
 Functional, Linear
 on a linear space E over \mathbb R is a linear transformation of E into \mathbb R.
 Functional, linear, Bounded.
 A bounded linear transformation of a normed linear space E over \mathbb R into the normed linear space \mathbb R is called a bounded linear functional on E.
 Generalisation performance
 is the accuracy of decision of a trained network on a set of data which is similar to but not the same as the training data set.
 HahnBanach theorem.
 Let M be a linear subspace of a normed linear space N, and let f be a functional defined on M. Then f can be extended to a functional f_{0} defined on the whole space N such that f_{0} =f.
If M_{c} is a closed linear subspace of N and x_{0} is a vector not in M_{c}, then there exists a functional f_{0} in the conjugate space N^{*} such that f_{0}(M_{c})=0 and f_{0}(x_{0}) ¹ 0.  Hebbian learning.
 The main idea behind Hebbian learning is that the synapse between two neurons should be strengthened if they fire simultaneously.
 Hidden layer
 is the layer of neurons which is not directly connected to the network inputs or outputs.
 Homogeneity property.
 A set of functions A fulfils the homogeneity property if f Î A and J Î \mathbb R Þ J f Î A.
 Inequality, triangle.
 a + b £ a + b
 knearest neighbours
 is a clustering algorithm that minimises the the sum of squares of distances between the training data and k points.
 L_{p}norm
 measures is a popular form of the cost function for feedforward networks.
E_{o}(W) =  æ è  J å j=1
 t_{j}  o_{j}^{p}  ö ø  [(1)/(p)]
 , p = 1, 2,... ¥, 
 where t_{j} is the target or desired value of the jth output, and o_{j} is its value computed by the network.  Learning
 is the process in which a feedforward network is forced to adjust its weights such that the network's response to a given input vector becomes closer to the desired response.
 Learning, Batch.
 The type of learning during which weights are updated at the end of every epoch. Also known as offline learning.
 Learning, Insitu
 differs from online learning in that the former is the property of a network requiring the deployed network to have adaptive weights, whereas the later is a property of the learning procedure, requiring the weights to be updated on the presentation of every example.
 Learning, Online.
 The type of learning during which weights are updated after the presentation of every training example. Also known as pattern and incremental learning.
 Learning, Supervised
 The learning process in which a system's internal parameters are modified in order to minimise the error in its output with respect to a desired value.
 Learning, Unsupervised
 The learning process in which a system's internal parameters are modified so that similar input patterns result in similar outputs.
 Learning rate
 determines the size of the weight modification at each training step.
 Likelihood
 is the probability density of observations calculated from parameters and not observations.
 Limit Point.
 A point p is a limit point of A if every neighbourhood of p contains a point q ¹ p such that q Î A.
 Linear separability.
 The property of a classification task by which the members of one class can be separated from the ones from all other classes by a single hyperplane.
 Loading problem, The.
 The problem of finding the optimal weight values for a given network such that the network performs the required mapping.
 Logistic discriminant analysis
 chooses classification hyperplanes with respect to maximising a conditional likelihood costfunction and not optimising a quadratic costfunction which is the case for linear discriminant analysis.
 Margin, e_{n}.
 Error in the output of a neuron is not backpropagated if it is within this small margin.
 Minima, Global.
 The points of minimum error on an error surface.
 Minima, Local.
 The points of zero gradient on an error surface which are not global minima.
 Mixture representation
 of data use a linear combination of Gaussian distributions to represent arbitrary distributions.
 Momentum
 is a training parameter used in a very common variation on standard error backpropagation learning procedure. It controls the effect of the last weight modification on the current weight update.
 nlayer network
 is a feedforward network with n  1 hidden layers.
 NPcomplete problems.
 (nonpolynomial time problems) The time required to find the optimal solution for this class of problems grows exponentially with the size of the problem. Also known as intractable problems.
 Neural network, Artificial
 is a set of interconnected artificial neurons.
 Neuron, Artificial
 is the fundamental processing element in an artificial neural network. It performs a weighted sum of its inputs, adds the offset value to that sum, and then outputs a certain transform of that sum. Also known as node and processing element (PE).
 Ockham's Razor
 is the conjecture that if, for a given problem, two solutions with similar performances are available then the one with the lower computational complexity should be preferred.
 Offset
 is the value added to the weighted sum before the transform is applied to compute a neuron's output. Also known as threshold and bias.
 Overtrained
 networks have a complexity higher than what is required to learn the concept embedded in training data. They act as lookup tables for the training data and are poor generalisers.
 Perceptron
 is a feedforward network with no hidden neurons.
 Probability, Prior
 is the probability assigned to an event in advance of any empirical evidence. Also known as `a priori' probability.
 Probability, Posterior
 is the probability assigned to an event based on observations. Also known as `a posteriori' probability.
 Projection pursuit regression
 is a generalisation of the feedforward network in that it allows more than one type of activation function in the hidden layer. These nonhomogeneous activation functions are datadependent and constructed during learning.
 Regularisation
 A class of methods designed to avoid overfitting to the training data by enforcing smoothness of the fit.
 Ridge regression.
 The precision of leastsquares estimates gets worse with with an increase in dependence between the input variables. Ridge regression estimators are more precise in those situations and are obtained as the estimators whose distance to an ellipsoid (the `ridge') centred at a leastsquares estimate from the origin of a parameter space is a minimum.
 Ridging, Constrained.
 Optimisation procedure in which some norm of the weights is constrained to a specific value.
 Ridging, Penalised.
 Optimisation procedure in which the cost function is augmented by a penalty term.
 Ridging, Smoothed.
 Optimisation procedure in which noise is introduced in the inputs.
 Riesz representation theorem.
 Let x^{*} be a bounded linear functional on the Banach space C_{\mathbb R}([a,b]). Then there is a realvalued function a of bounded variation on [a,b] such that x^{*}(f) = ò^{b}_{a}f da for all f Î C_{\mathbb R}([a,b]). Further, if x^{*} is a positive linear functional, then a is increasing on [a,b].
 RMS error, E_{orms},
 is computed by summing the output layer errors for all examples in a training or test set, dividing the sum by the total number of examples and the number of the output layer neurons, and taking the square root of the resultant. The output layer error is computed by summing the squares of the individual neuron errors with respect to the desired output. An individual outputlayer neuron's error is set to zero if it is less than the margin.
 Sampling with replacement
 may result in successive samples being not mutually exclusive, some of the examples may never appear in any of the samples, and there may be repetitions within an individual sample.
 Separates points.
 A family of functions A separates points on a set S if for every x, y Î S,x ¹ y, there exists f Î A such that f(x) ¹ f(y).
 Set, Closed.
 A subset M of metric space N is a closed set if it contains each of its limit points.
 Set, Finite.
 A is finite if all of its elements can be displayed as {a_{1}, a_{2}, . . . ,a_{n}} for some integer n.
 Set, Open
 is the subset G of the metric space X if each point of G is the centre of some open sphere contained in G.
 Shattered.
 If a set of functions F includes all possible dichotomies on a set S of points, then S is said to be shattered by F.
 Shrinkage.
 The difference between the training set accuracy of a network and its accuracy on a test set.
 Sigmoidal functions.
 Definitions vary but are generally taken to be bounded, monotone, and continuous, e.g. logistic and tanh(·) functions.
 Simulated annealing
 is a stochastic optimisation technique inspired by the physical process of annealing.
 Skiplayer synapses.
 Synapses connecting neurons in two nonadjacent layers. Also known as shortcut synapses. Known as main effects in the statistical literature.
 Smoothing spline modelling
 is piecewise approximation by polynomials of degree n with the requirement that the derivatives of the polynomials are continuous up to degree n1 at the junctions.
 Softmax.
 The purpose of the softmax activation function is to make the sum of the output neuron responses equal to one, so that the outputs are interpretable as posterior probabilities. Also known as the multiplelogistic function.
 Space, Banach
 is a complete normed linear space.
 Space, Compact
 is a topological space in which every open cover has a finite subcover.
 Space, Complete metric
 is a metric space in which every Cauchy sequence is convergent.
 Space, Conjugate.
 N^{*} is the set of all continuous linear transforms of the normed linear space N into \mathbb R.
 Space, Eucledian
 is the metric space (\mathbb R, d) such that d(x,y) = (å_{r=1}^{n}(x_{r}y_{r})^{2})^{1/2}.
 Space, Hausdorff
 is a topological space, T, in which any two given distinct points x, y are such that there exist disjoint open subsets U, V containing x, y respectively.
 Space, L_{p}
 consists of all measurable functions f defined on a measure space M with measure m which are such that f(x)^{p} is integrable, with the norm taken as f_{p} = (òf(x)^{p}dm(x))^{1/p}.
 Space, Normed linear
 over \mathbb R is a pair {E,·}, where E is linear space over \mathbb R and  · is a norm on E. Normed linear space is a metric space with the metric being xy.
 Space, Topological
 is a pair (X,T), where X a nonempty set and T is collection of subsets of X such that the subsets are closed under union and intersection operations.
 Span of a set of functions.
 For any function f Î C(\mathbb R) and r > 0, f(r,r) denotes the restriction of f to the interval (r,r), and for any class of function F, F(r, r) denotes {f(r,r):f Î F}. For any F defined on a set O, the span of F, sp(F), denotes the closure of the set of finite linear combinations of elements of F in the topology of uniform convergence on compact subsets of O.
 Sphere, Open.
 S_{r}(x_{0}) with centre x_{0} and radius r is the subset of the metric space X with metric D defined by S_{r}(x_{0})={x: d(x,x_{0}) < r}.
 Stationary, strongly.
 If a random variable X is strongly stationary then the distribution of X(t) is independent of the time t.
 Subset, Proper.
 A is a proper subset of B if A Ì B and B Ë A.
 Subspace, Linear
 is the nonempty subset, M, of a linear space if (x+y) Î M whenever x Î M and y Î M, and if ax Î M whenever x Î M, where a is a scalar.
 Supremum
 is the least upper bound for a set.
 Synapse
 is a measure of the effect that a neuron's output has on the output of another neuron at the other end of the synapse. Also known as connection, edge, and weight.
 Testing
 is the process of verifying the function of a trained network against a set of examples which is different from the training examples set.
 Trainandtest.
 A random sample containing one half of the total number of examples is selected. This subset is used to train the network while the remaining examples are used to test the network once it has been trained. The performance of the trained network on the test set is an estimate of the generalisation performance metric.
 Training
 See Learning.
 Training example
 is a pair: an input vector, and the desired response to that input vector.
 Vanishes at no point.
 A family of functions A vanishes at no point of the set S if for each x Î S there exists f Î A such that f(x) ¹ 0.
 Weight
 is the value of a synapse or an offset.
 Weight decay
 is a common regularisation technique used in feedforward network training in which the costfunction is augmented with a term which penalises large weight values.
 Weight depth
 is the number of binary bits in a weight.
 Weight elimination
 is a regularisation technique used in feedforward network training in which the costfunction is augmented with a term which penalises the number of nonzero weights.
 Weight perturbation
 is a hardwarefriendly alternative to BP learning. In this method, all of the weights are perturbed in turn and the associated change in the output of the network is used to approximate local gradients.
 Weight sharing
 is a regularisation technique used in feedforward network training in which the costfunction is augmented with a term which penalises the number of independent weights.

  