infobank @altafkhan.com	> > > articles on technology and business > > >	Web This Site
		home \|

Neural Nets Glossary

Altaf Khan

© Altaf Khan 2005-2006

Summary

This is a glossary of terms that can help you understand the literature on neural nets. Some of the mathematical symbols may not display properly on this page. For clarification on them, take a look at the glossary section in Feedforward Neural Networks With Constrained Weights PDF

Related InfoBank Articles

AI technique selection

Intelligent systems PPT

Intro to neural nets

Activation function: is the transform applied to the weighted sum of inputs plus offset for computing the output of a neuron. Also known as the squashing function.
Affine group invariance.: The property of a group due to which it stays unchanged after the application of an affine transform.
Affine transform: is a transform from the set of rotations, shifts, scalings, or any combinations thereof.
Algebra.: A set of functions A is an algebra if f, g Î A, J Î \mathbb R Þ f + g Î A, f ·g Î A, and J f Î A.
Alopex: is a stochastic learning procedure which uses local correlations between changes in individual weights and changes in the cost function to update weights. This procedure predates the current resurgence in neural network learning by more than a decade and was originally proposed for mapping visual receptive fields.
Approximation property, Universal,: is the ability of a set functions to approximate a specific class of functions to any desired accuracy.
Approximation property, Best,: is the property of an approximation scheme on a set of functions to select a function that is at a minimum `distance' from the function to be approximated.
ART-EMAP: is ARTMAP with added spatial and temporal evidence accumulation processes.
ARTMAP: is a supervised learning procedure explicitly based on neurobiology.
ARTMAP-IC: is ARTMAP with an instance counting procedure and a match tracking algorithm.
Attribute: is an element of the input vector. Also known as a feature.
Autoassociator: A system for which the desired output response is the same as the input.

Backpropagation, Error

is an procedure in which the difference between the actual and desired responses of the neurons in the output layer is minimised using the steepest-descent heuristic.

Balanced data set

is set in which all classes are equally represented.

Bayesian classifier

assigns a class to an object in such a way that the expectation value of misclassification is minimised. Also known as the minimum risk classifier, and belief network.

Bayesian statistics

differs from the conventional `frequentist' approach to statistics in that it allows the probability of an event to be expressed as `degree of belief' in a particular outcome instead of basing it solely on a set of observations.

Bayes's theorem

allows prior estimates of the probability of an event to be revised in accordance with new observations. It states that probability of an event A given another event B, P(A|B), is equal to P(B|A)P(A)/P(B).

Black-hole mechanism

is a rounding mechanism for `nearly discrete' weights.

Black-hole radius.

If the value of a weight gets within this radius of a discrete value, it becomes that discrete value.

Bootstrap.

A random sample is selected by sampling with replacement from the data set and is used to train the network. The trained network is then tested on the remaining data. This procedure is repeated a large number of times. The average of all such test errors is an estimate of the generalisation performance metric.

Borel measurable functions.

Just about all functions that one may encounter are Borel measurable. Functions that are not Borel measurable do exist but are known to mathematicians only as mathematical peculiarities.

Cascade-correlation

learning method starts with a network without any hidden neurons and systematically increases their number during training until the required performance is achieved.

Cauchy sequence

is a sequence {a_n} of real numbers for which for every e > 0, there exists a positive integer n₀ such that |a_m - a_n| < e, whenever m > n ³ n₀.

Classification

is a task in which the desired responses are restricted to a finite set of values.

Closure.

Cl(A) = A È{Limit points of A}.

Compact set.

A closed and bounded subset of \mathbb R^d. Also known as a compact.

Compact in \mathbb Rⁿ.

A is a compact in \mathbb Rⁿ if it is a subset of \mathbb Rⁿ, is a closed set, and is a bounded set.

Convergence, Pointwise.

If {a_n} is a sequence of non-random real variables then a_n converges to a, i.e., a_n ® a as n ® ¥, if there exists a real number a such that for any e > 0, there exists an integer N_e sufficiently large that |a_n - a| < e for all n ³ N_e. Also known as deterministic convergence.

Convergence in distribution.

If {[^a]_n} is a sequence of random variables having a distribution function {F: F_n(a) º P[[^a]_n £ a} then [^a]_n converges to F in distribution, i.e. [^a]_n ® ^d F, iff |F_n(a) - F(a)|® 0 for every continuity point a of F.

Convergence in probability.

If {[^a]_n} is a sequence of random variables then [^a]_n converges to a in probability, i.e. [^a]_n ® ^P a, if there exists a real number a such that for any e > 0, P[|[^a]_n - a| < e] ®1 as n ® ¥. Also known as weak convergence.

Convergence in the mean.

If {[^a]_n} is a sequence of random variables then [^a]_n converges to a in the mean if \scriptscriptstylea_n®¥^limE{|[^a]_n - a |}=0, where E{a} represents the estimated value of a.

Convergence in the mean squared sense.

If {[^a]_n} is a sequence of random variables then [^a]_n converges to a in the mean squared sense if \scriptscriptstylea_n®¥^limE{|[^a]_n - a |²}=0, where E{a} represents the estimated value of a.

Convergence with probability 1.

If {[^a]_n} is a sequence of random variables then [^a]_n converges with probability 1 to a, i.e. [^a]_n ® a as n ® ¥ or [^a]_n ® ^P=1 a, if \scriptscriptstylea_n ® ¥^limP[[^a]_n = a] = 1 for some real number a. Also known as almost sure convergence, convergence almost everywhere, and strong convergence.

Convergence, Uniform

The property that all of a family of functions or series on a given set converge at the same rate throughout the set; that is, for every e > 0 there is a single N such that for all points in the set , |f_m(x) - f_n(x)| < e for all m,n > N and similarly for uniform convergence as x tends to a value a.

Cost function

is the quantity that is to be minimised in an optimisation experiment. In the case of feedforward networks this quantity is usually the RMS error in the output of the network. Also known as error measure.

Cover

of a set A is a collection of sets {T_i} whose union contains A.

Cover, Open.

It is an open cover of {T_i} if each T_i is open.

Cover, Sub-

of a given cover is a subcollection whose union also contains A.

Cross-validation, n-fold.

The data set is divided equally into k randomly selected, mutually exclusive subsets called folds. k-1 networks are trained sequentially on all combinations of k-1 folds, while the performance of the trained networks is tested on the one remaining folds. The average of k-1 such errors is an estimate of the generalisation performance metric.

Decision sensitivity

is the likelihood that an event will be detected if it occurs. It is the ratio of true positives to the sum of true positives and false negatives. This metric is especially of importance when it is critical that a an event be detected. Also known as True Positive Ratio.

Decision specificity

is the likelihood that the absence of an even is detected given that it is present. It is the ratio of the true negatives to all negatives. Also known as True Negative Ratio.

Decision surface

is the plot of the response of an output neuron with respect to the inputs.

Denseness.

A set A is dense in a set S if A Ì S and Cl(A) = S.

Denseness, Uniform.

A set of functions A is uniformly dense in C(\mathbb R^d) on the compact set K Ì \mathbb R^d if for all f Î C(\mathbb R^d) and every e > 0 there exists [^f] Î A such that sup{|f(x) -[^f](x)|:x Î K} < e.

Denseness, Uniform on compacta.

A set of functions A is uniformly dense on compacta in C(\mathbb R^d) if it is uniformly dense on every compact subset of \mathbb R^d.

Disjunctive normal form (DNF).

The form of a logical expression consisting of a single conjunction (·) of a set of disjunctions(+). All logical expressions are expressible in this form.

Effective sample size

(for classification learning tasks) is the number of examples representing the smallest classification group.

EM algorithm.

Expectation Maximisation algorithm calculates the probability density of observations based on parameters and not observations.

Epoch

is the cycle in which all examples in the training set are presented to the network.

Ergodic process.

A random process is ergodic if its ensemble and temporal averages are the same.

Error surface

is the plot of the cost-function with respect to all of the weights in a network.

Feedforward network

consists of a layer of inputs, zero or more layers of hidden neurons, and an output layer of neurons. Generally all neurons in adjacent layers are fully connected to each other with feedforward synapses only. There are no intra-layer synapses. Also known as the multilayer perceptron.

Fit, Over-

An over-fit is due to the trained network having a higher complexity than the concept embedded in the training data. Also known as memorisation and over-specialisation.

Fit, Under-

An under-fit is caused by the trained network having a complexity lower than that of the concept embedded in the training data.

Forward pass.

The process by which a network computes the output vector in response to an input vector. Also known as recall.

Function, Analytic.

f Î C(\mathbb R) is analytic at a Î \mathbb R with a radius of convergence r > 0 if there is an infinite sequence of real numbers, {c_n}, n ³ 0, such that for |x - a| < r, å_n=0^¥ c_n(x - a)ⁿ converges and f(x) = å_n=0^¥ c_n(x-a )ⁿ

Function, Superanalytic.

f Î C(\mathbb R) is superanalytic at a Î \mathbb R with a radius of convergence r > 0 if there is an infinite sequence of real numbers, {c_n}, n ³ 0, and if for every n ³ 1, c_n ¹ 0, such that for |x - a| < r, å_n=0^¥ c_n(x - a)ⁿ converges and f(x) = å_n=0^¥ c_n(x-a )ⁿ

Function approximation

is a task in which the desired output values are continuous. Also known as regression.

Functional

is a scalar-valued continuous linear function defined on a normed linear space.

Functional, Linear

on a linear space E over \mathbb R is a linear transformation of E into \mathbb R.

Functional, linear, Bounded.

A bounded linear transformation of a normed linear space E over \mathbb R into the normed linear space \mathbb R is called a bounded linear functional on E.

Generalisation performance

is the accuracy of decision of a trained network on a set of data which is similar to but not the same as the training data set.

Hahn-Banach theorem.

Let M be a linear subspace of a normed linear space N, and let f be a functional defined on M. Then f can be extended to a functional f₀ defined on the whole space N such that ||f₀ ||=||f||.
If M_c is a closed linear subspace of N and x₀ is a vector not in M_c, then there exists a functional f₀ in the conjugate space N^* such that f₀(M_c)=0 and f₀(x₀) ¹ 0.

Hebbian learning.

The main idea behind Hebbian learning is that the synapse between two neurons should be strengthened if they fire simultaneously.

Hidden layer

is the layer of neurons which is not directly connected to the network inputs or outputs.

Homogeneity property.

A set of functions A fulfils the homogeneity property if f Î A and J Î \mathbb R Þ J f Î A.

Inequality, triangle.

|a + b| £ |a| + |b|

k-nearest neighbours

is a clustering algorithm that minimises the the sum of squares of distances between the training data and k points.

L_p-norm

measures is a popular form of the cost function for feedforward networks.

E_o(W) =

æ
è

J
å
j=1

|t_j - o_j|^p

ö
ø

[(1)/(p)]

, p = 1, 2,... ¥,

where t_j is the target or desired value of the jth output, and o_j is its value computed by the network.

Learning

is the process in which a feedforward network is forced to adjust its weights such that the network's response to a given input vector becomes closer to the desired response.

Learning, Batch.

The type of learning during which weights are updated at the end of every epoch. Also known as off-line learning.

Learning, In-situ

differs from on-line learning in that the former is the property of a network requiring the deployed network to have adaptive weights, whereas the later is a property of the learning procedure, requiring the weights to be updated on the presentation of every example.

Learning, On-line.

The type of learning during which weights are updated after the presentation of every training example. Also known as pattern and incremental learning.

Learning, Supervised

The learning process in which a system's internal parameters are modified in order to minimise the error in its output with respect to a desired value.

Learning, Unsupervised

The learning process in which a system's internal parameters are modified so that similar input patterns result in similar outputs.

Learning rate

determines the size of the weight modification at each training step.

Likelihood

is the probability density of observations calculated from parameters and not observations.

Limit Point.

A point p is a limit point of A if every neighbourhood of p contains a point q ¹ p such that q Î A.

Linear separability.

The property of a classification task by which the members of one class can be separated from the ones from all other classes by a single hyperplane.

Loading problem, The.

The problem of finding the optimal weight values for a given network such that the network performs the required mapping.

Logistic discriminant analysis

chooses classification hyperplanes with respect to maximising a conditional likelihood cost-function and not optimising a quadratic cost-function which is the case for linear discriminant analysis.

Margin, e_n.

Error in the output of a neuron is not backpropagated if it is within this small margin.

Minima, Global.

The points of minimum error on an error surface.

Minima, Local.

The points of zero gradient on an error surface which are not global minima.

Mixture representation

of data use a linear combination of Gaussian distributions to represent arbitrary distributions.

Momentum

is a training parameter used in a very common variation on standard error backpropagation learning procedure. It controls the effect of the last weight modification on the current weight update.

n-layer network

is a feedforward network with n - 1 hidden layers.

NP-complete problems.

(non-polynomial time problems) The time required to find the optimal solution for this class of problems grows exponentially with the size of the problem. Also known as intractable problems.

Neural network, Artificial

is a set of interconnected artificial neurons.

Neuron, Artificial

is the fundamental processing element in an artificial neural network. It performs a weighted sum of its inputs, adds the offset value to that sum, and then outputs a certain transform of that sum. Also known as node and processing element (PE).

Ockham's Razor

is the conjecture that if, for a given problem, two solutions with similar performances are available then the one with the lower computational complexity should be preferred.

Offset

is the value added to the weighted sum before the transform is applied to compute a neuron's output. Also known as threshold and bias.

Over-trained

networks have a complexity higher than what is required to learn the concept embedded in training data. They act as look-up tables for the training data and are poor generalisers.

Perceptron

is a feedforward network with no hidden neurons.

Probability, Prior

is the probability assigned to an event in advance of any empirical evidence. Also known as `a priori' probability.

Probability, Posterior

is the probability assigned to an event based on observations. Also known as `a posteriori' probability.

Projection pursuit regression

is a generalisation of the feedforward network in that it allows more than one type of activation function in the hidden layer. These non-homogeneous activation functions are data-dependent and constructed during learning.

Regularisation

A class of methods designed to avoid overfitting to the training data by enforcing smoothness of the fit.

Ridge regression.

The precision of least-squares estimates gets worse with with an increase in dependence between the input variables. Ridge regression estimators are more precise in those situations and are obtained as the estimators whose distance to an ellipsoid (the `ridge') centred at a least-squares estimate from the origin of a parameter space is a minimum.

Ridging, Constrained.

Optimisation procedure in which some norm of the weights is constrained to a specific value.

Ridging, Penalised.

Optimisation procedure in which the cost function is augmented by a penalty term.

Ridging, Smoothed.

Optimisation procedure in which noise is introduced in the inputs.

Riesz representation theorem.

Let x^* be a bounded linear functional on the Banach space C_{\mathbb R}([a,b]). Then there is a real-valued function a of bounded variation on [a,b] such that x^*(f) = ò^b_af da for all f Î C_{\mathbb R}([a,b]). Further, if x^* is a positive linear functional, then a is increasing on [a,b].

RMS error, E_orms,

is computed by summing the output layer errors for all examples in a training or test set, dividing the sum by the total number of examples and the number of the output layer neurons, and taking the square root of the resultant. The output layer error is computed by summing the squares of the individual neuron errors with respect to the desired output. An individual output-layer neuron's error is set to zero if it is less than the margin.

Sampling with replacement

may result in successive samples being not mutually exclusive, some of the examples may never appear in any of the samples, and there may be repetitions within an individual sample.

Separates points.

A family of functions A separates points on a set S if for every x, y Î S,x ¹ y, there exists f Î A such that f(x) ¹ f(y).

Set, Closed.

A subset M of metric space N is a closed set if it contains each of its limit points.

Set, Finite.

A is finite if all of its elements can be displayed as {a₁, a₂, . . . ,a_n} for some integer n.

Set, Open

is the subset G of the metric space X if each point of G is the centre of some open sphere contained in G.

Shattered.

If a set of functions F includes all possible dichotomies on a set S of points, then S is said to be shattered by F.

Shrinkage.

The difference between the training set accuracy of a network and its accuracy on a test set.

Sigmoidal functions.

Definitions vary but are generally taken to be bounded, monotone, and continuous, e.g. logistic and tanh(·) functions.

Simulated annealing

is a stochastic optimisation technique inspired by the physical process of annealing.

Skip-layer synapses.

Synapses connecting neurons in two non-adjacent layers. Also known as short-cut synapses. Known as main effects in the statistical literature.

Smoothing spline modelling

is piecewise approximation by polynomials of degree n with the requirement that the derivatives of the polynomials are continuous up to degree n-1 at the junctions.

Softmax.

The purpose of the softmax activation function is to make the sum of the output neuron responses equal to one, so that the outputs are interpretable as posterior probabilities. Also known as the multiple-logistic function.

Space, Banach

is a complete normed linear space.

Space, Compact

is a topological space in which every open cover has a finite subcover.

Space, Complete metric

is a metric space in which every Cauchy sequence is convergent.

Space, Conjugate.

N^* is the set of all continuous linear transforms of the normed linear space N into \mathbb R.

Space, Eucledian

is the metric space (\mathbb R, d) such that d(x,y) = (å_r=1ⁿ(x_r-y_r)²)^1/2.

Space, Hausdorff

is a topological space, T, in which any two given distinct points x, y are such that there exist disjoint open subsets U, V containing x, y respectively.

Space, L_p

consists of all measurable functions f defined on a measure space M with measure m which are such that |f(x)|^p is integrable, with the norm taken as ||f||_p = (ò|f(x)|^pdm(x))^1/p.

Space, Normed linear

over \mathbb R is a pair {E,||·||}, where E is linear space over \mathbb R and || ·|| is a norm on E. Normed linear space is a metric space with the metric being ||x-y||.

Space, Topological

is a pair (X,T), where X a non-empty set and T is collection of subsets of X such that the subsets are closed under union and intersection operations.

Span of a set of functions.

For any function f Î C(\mathbb R) and r > 0, f|(-r,r) denotes the restriction of f to the interval (-r,r), and for any class of function F, F|(-r, r) denotes {f|(-r,r):f Î F}. For any F defined on a set O, the span of F, sp(F), denotes the closure of the set of finite linear combinations of elements of F in the topology of uniform convergence on compact subsets of O.

Sphere, Open.

S_r(x₀) with centre x₀ and radius r is the subset of the metric space X with metric D defined by S_r(x₀)={x: d(x,x₀) < r}.

Stationary, strongly.

If a random variable X is strongly stationary then the distribution of X(t) is independent of the time t.

Subset, Proper.

A is a proper subset of B if A Ì B and B Ë A.

Subspace, Linear

is the non-empty subset, M, of a linear space if (x+y) Î M whenever x Î M and y Î M, and if ax Î M whenever x Î M, where a is a scalar.

Supremum

is the least upper bound for a set.

Synapse

is a measure of the effect that a neuron's output has on the output of another neuron at the other end of the synapse. Also known as connection, edge, and weight.

Testing

is the process of verifying the function of a trained network against a set of examples which is different from the training examples set.

Train-and-test.

A random sample containing one half of the total number of examples is selected. This subset is used to train the network while the remaining examples are used to test the network once it has been trained. The performance of the trained network on the test set is an estimate of the generalisation performance metric.

Training

See Learning.

Training example

is a pair: an input vector, and the desired response to that input vector.

Vanishes at no point.

A family of functions A vanishes at no point of the set S if for each x Î S there exists f Î A such that f(x) ¹ 0.

Weight

is the value of a synapse or an offset.

Weight decay

is a common regularisation technique used in feedforward network training in which the cost-function is augmented with a term which penalises large weight values.

Weight depth

is the number of binary bits in a weight.

Weight elimination

is a regularisation technique used in feedforward network training in which the cost-function is augmented with a term which penalises the number of non-zero weights.

Weight perturbation

is a hardware-friendly alternative to BP learning. In this method, all of the weights are perturbed in turn and the associated change in the output of the network is used to approximate local gradients.

Weight sharing

is a regularisation technique used in feedforward network training in which the cost-function is augmented with a term which penalises the number of independent weights.