>  >  >  articles on technology and business  >  >  >

Web This Site
home  | 

Intro to Neural Nets

Altaf Khan

© Altaf Khan 2005-2006

This article focuses on the most common type of neural nets: the multilayer feedforward net. Its structure, functionality, applications, and its connections with biological and statistical models is discussed. Answers to questions like `what is it capable of?' and `how difficult is it to train?' are also provided. An overview of methods for network training are then presented. It concludes with a short note on the major milestones in the development of neural nets.

Related InfoBank Articles

Artificial neural nets come in many shapes and forms, but here the discussion will be restricted to the most common type: multilayer feedforward neural net. The will be referred to as "neural net" only. They are general-purpose modeling devices which can extract the functionality of the underlying process from examples generated by that process. Each of these examples consists of the input vector and the response of the process to that input vector. neural nets are non-parametric in their modeling ability, in the sense that they do not demand any structural information about the process that they are modeling, merely its characteristic in the form of a set of input/output examples. They "let the data speak for itself".

The fundamental processing element used in forming a neural net is conventionally known as the artificial neuron because its function is somewhat analogous to the biological neuron. From now onwards, the artificial neuron will be referred to as "neuron" only. Each neuron calculates the dot product of the incoming signal vector î with its synaptic strength vector û, adds the offset Ø to the resultant, and outputs a value which is calculated by applying a nonlinear activation function to that sum. Two of the most common such activation functions are logistic and hyperbolic tangent.

A number of neurons can be connected to form a multilayer neural net. Consider a network with d inputs, q neurons in the middle layer and a single neuron in the output layer. All inputs are connected to all neurons in the middle layer and the outputs of all middle-layer neurons form the input of the output-layer neuron. This d:q:1 neural net has only a single output, but in general it can have any number of outputs. For analytical studies however, one needs to study only this single output prototype, as any k-output neural net can be represented as an ensemble of k separate, single output neural nets.

This particular neural net, having two layers of neurons, is termed a 2-layer neural net. The input layer is not counted as a layer as it does not perform any processing and consists of fan-out units only. The middle layer is termed the hidden layer, due to its lack of direct connections with the outside environment. The edges connecting the nodes to each other are called synapses. The term weight is used to refer both to the synapses and offsets. The neurons in the hidden layer must have nonlinear activation functions for the neural net to be able to perform nonlinear mappings. If the hidden neuron activation functions are linear, the 2-layer neural net can be collapsed into a 1-layer d:1 neural net, commonly known as a perceptron.

This article discusses the 2-layer neural net, from now on termed as neural net only, exclusively. Neural nets with many hidden layers may have advantages in having more profound mapping capabilities - they can implement a mapping with fewer weights as compared with a 2-layer network having a similar performance. They are not, however, as well understood as 2-layer networks and are harder to train.

Application Examples

It has been 20 years since the pioneering work of Rumelhart et al., in which learning in multilayer neural nets was first introduced. A large number of papers on the application of neural nets have since been published, in areas of application ranging from medicine, finance, process industry, high energy physics, automotives, telecommunications, robotics, to aerospace. These applications are generally divided into two groups: function approximation and classification. This division is based on the type of the desired outputs required to accomplish the task. If the output values are continuous, the neural net is performing function approximation, whereas if the outputs are restricted to a finite set of values, it is doing classification. The examples presented below highlight applications in high energy physics, automotives, meat industry, and nuclear material processing.
A fully parallel hardware implementation of a neural net is being used for the closed-loop real-time control of the shape of the magnetic confinement field of a high-temperature plasma in a Tokamak fusion reactor. This neural net, having digitally stored weights but analogue signal paths, and a bandwidth of 20 kHz, was trained using analytically generated data to simultaneously control the currents in all of the coils generating the confinement magnetic field.

An example of an application in automotive safety is Delco Electronics' supplemental inflatable restraint (air bag) controller which uses a neural net to distinguish between deployment and nondeployment events. Time series data collected from instrumented vehicle crash tests, along with the requirements of when and if air bag must be inflated for different type of situations, was used to train the neural net.

The Danish Meat Research Institute's neural net-based classifiers determine the value of cow carcasses, with the help of weight and visual information. It employs three neural nets to extract the class (cow, bull, or heifer) and shape of the carcasses which are then used in a linear model to determine the payment to be made to the farmer.

Urneco (Capenhurst) Ltd. employs a pair of neural nets for the closed-loop control of copper lasers, which are used in isotope separation for uranium enrichment. These neural nets have been trained according to the actions of experienced human operators to avoid certain load discharge conditions that reduce the lifetime of expensive laser modulator components and therefore improve the efficiency, due to decreased running costs and down time.

Biological -vs- Artificial Neural Networks

Although the original inspiration for work in the artificial neural network area was the human brain, the emphasis has since shifted from biological plausibility to usefulness as a computational tool. Artificial multilayer neural nets do, however, share some features with the biological brain: both are layered structures formed by a number of homogeneous and simple processing elements - neurons. Both types of neurons have many inputs and produce an output signal which is a nonlinear function of the dot product of inputs and weights. The key factor that distinguishes neural networks, both biological and artificial, from traditional computing paradigms is that processing is asynchronous and local to the individual neurons. The major emergent properties that these two classes of networks have in common are associative memory and a degree of fault tolerance.

The main barrier to the wide acceptance of the plausibility of artificial neural nets as analogs of the real ones is the learning procedures. It is well known that synapses in a human brain change with experience, but the exact mechanisms are not well understood because of the complex and distributed nature of the system. There is some agreement, however, that it is nothing like the popular procedures used for training neural nets. This, due to the fact that those procedures involve some global computation steps, which is in violation of the strictly local theories of learning in biological networks.

The Statistical Connection

Many neural network paradigms have their analogs in the statistical arena. The two fields, however, cannot be considered as being identical twins - more like step-brothers. The main differences between them are perhaps more cultural than technical: statistics has its roots in mathematics, neural networks in engineering, biology, and computer science. Statisticians are usually conservative and pessimistic, neural folk enthusiastic. Statisticians are often concerned with small samples, neural workers with large data sets. Statisticians mostly deal with static data, while neural workers are also interested in in-situ learning. The majority of statisticians do mathematical modeling, neural workers run computer solutions. Researchers in the artificial neural network field also differ in having a long-term goal of designing artificially intelligent systems. Both communities can, however, benefit from each other's expertise: for example, only recently the neural network community has started to benefit from the rigorous frameworks, such as Bayesian techniques, available in statistics.

The multilayer feedforward neural net, the only neural paradigm discussed here, has a direct analogue in statistics: projection pursuit regression. Projection pursuit is a generalization of the neural net in that it allows more than one type of activation function in the hidden layer. These non-homogeneous activation functions are data-dependent and constructed during learning. Projection pursuit learning differs from conventional neural net learning in that it is performed one hidden neuron at a time. The output weight of a hidden neuron is optimized followed by the shape of the activation function and then the input weights. It is a mathematically proven fact that the neural net can approximate almost all functions to any desired accuracy. Projection pursuit regression, a generalization of neural net, must also be able to approximate almost all functions, and therefore does not hold any advantage in that area. The advantage may, however, lie in its ability to construct more compact representations of arbitrary training data sets due to its freedom in having data-dependent activation functions.

Approximation Properties

2-layer neural nets are universal approximators in the space of Borel measurable functions. In other words, a 2-layer neural net exists can, given enough training data and enough hidden neurons, approximate virtually any function of interest to any desired degree of accuracy. This is a very powerful statement and provides great comfort to experimentalists in reinforcing their beliefs about the capabilities of neural nets. This, however, guarantees only the existence of an approximating network and does not give any clues about how to construct one.

Correspondingly, loading a neural net, that is finding the set of optimal weights for a neural net for a given set of training data, is an intractable problem. This problem has been shown to belong to a class of very tough problems, called NP-complete problems, which are generally accepted to have no polynomial-time solution. The difficulty of solving problems belonging to this class increases exponentially with the number of inputs. Intractability of loading, however, is the case for a fixed-size neural net only - if hidden neurons can be added and eliminated during training then loading is tractable. Moreover, fixed-size neural nets can be trained to achieve `good enough' solutions instead of optimal ones, which, in practice, is not as time consuming as achieving optimal solutions, and is certainly better than having no solution at all.

Neural Net Training: Prerequisites

The cardinal choice in the training of a neural net is the choice of the number of neurons in the hidden layer. This number is a function of the complexity of the concept to be learned from the training data. In most cases, the complexity of that concept is not known prior to training. In those cases, one of the following three schemes can be employed: training several networks, ontogenic techniques, reducing the complexity of a large network. The most frequently used technique is to start by training the simplest network and try several networks with an increasing number of hidden neurons until the required performance is achieved.

Ontogenic schemes are self-constructing techniques, of which the Cascade-correlation method is the most well known example. This method starts with a network without any hidden neurons and systematically increases their number during training until the required performance is achieved. Finally, one can also start with a large network and try to decrease its complexity during training to match the complexity of the concept being learned. The thinking behind this technique can be explained by the analogy of fitting an arbitrarily shaped object in a container which is only slightly larger than the object in all dimensions. In general, fitting that object in the container will be quite difficult, if possible at all. It is simpler to start with a large container, place the object in the container, and then somehow shrink the size of the container to achieve a very snug fit.

Pre-training data processing can be used to reduce training time and to improve the quality of learning. For example, the standardization of inputs to zero-means and small magnitudes results in all the input vectors clustering around the origin. This, combined with the initialization of the neural nets at the start of training with small randomly distributed weights which results in the start-up decision boundaries bunching up around the origin, gives the neural net the best chance of fast learning. Moreover, Le Cun et al. have shown analytically that standardizing the inputs to zero mean improves the convergence properties of some learning procedures.

For discrete inputs or outputs, categorical variables should not be treated as continuous variables. For instance, consider an input variable which represents shapes and can have one of three values: round, square, or octagon. This variable will not be represented as a single tri-state input, as that imposes an ordering of square being somehow greater than round and less than octagon, but as a triplet, having three possible states (1,0,0), (0,1,0), and (0,0,1) depending upon the shape being round, square, or octagon, respectively.

In classification problems, outputs are generally encoded as {0,1} or {1,-1}. Due to the fact that weight modifications computed by some learning procedures are proportional to the output value of a neuron, the first encoding has the disadvantage in that the weight modifications calculated by the learning procedure for class 0 are smaller relative to the ones for class 1. The bipolar-binary encoding does not have this problem but suffers from the drawback that no learning takes place for the most `confused' neurons, i.e. the ones with output values around zero. On balance, the bipolar-binary encoding should be preferred as in that case the learning procedure treats both output states in the same fashion during training.

Neural Net Training: Procedures

A training procedure is a collection of heuristics that finds appropriate weights for a particular neural net, given a set of training examples, such that an appropriate cost function, also known as an error measure, is minimized. The least-squares measure, which emphasizes the larger errors, being the most common one. Neural net training procedures can be divide into two major categories: those based on steepest descent methods and those influenced by stochastic optimization techniques.

Steepest descent methods attempt to minimize the cost function by taking small steps on the error surface in the direction of the maximum gradient. This process is analogous to a near-sighted skier trying to find the quickest way to the base of a mountain by skiing down the slopes with the largest gradient. A computationally efficient way of implementing this process in a neural net, called error backpropagation (BP), was popularized by Rumelhart et al.

In the batch version of BP, the weights are updated once after calculating the net change in weights based on all training examples, whereas in the on-line version weights are updated after the presentation of every training example. Both versions are guaranteed to converge to a solution in appropriate circumstances. The latter version is, however, faster for large training sets having some degree of information redundancy among examples. Two of the most common parameters related to BP are learning rate and momentum. Learning rate determines the size of the weight modification at each training step, and momentum controls the effect of the weight modification of the previous step over the weight modification of the current step. If learning rate is small and momentum is close to 1, on-line BP approximates batch BP. As training examples are usually presented to a network in a random order, on-line BP does its search in the weight space in a stochastic manner, and therefore is less prone (compared with batch BP) to getting stuck in local minima of the error surface.

Weight perturbation is an alternative to BP learning. In this method, all of the weights are perturbed in turn and the associated change in the output of the network is used to approximate local gradients. This technique requires only feedforward calculations for its operation, which simplifies its implementation in hardware. It lacks the mathematical efficiency of BP however, and therefore requires a large number of epochs to reach acceptable solutions.

Stochastic methods are the less common alternative to steepest descent. A popular representative of these methods is simulated annealing. This method is analogous to the physical process of annealing. It is different from methods based upon steepest descent in that the myopic skier of the last paragraph is not always going downhill, but going downhill only `most of the time'. This way the skier will not get trapped in local minima. In this method, weight modifications are made permanent if the new value of the cost function is lower or equal to the old one. If the new value is higher then the weight change is accepted with a probability which diminishes with the number of epochs. Besides the global minimum search advantage, stochastic techniques have the added advantage that they do not require gradient computations, which is attractive from the hardware implementation point of view. The main drawback of this class of techniques is speed: they generally require a large number of epochs for convergence and each epoch generally requires the recalculation of the cost function for every training example and every weight change.

Steepest descent methods are fast but can get trapped in local minima, whereas stochastic techniques have the ability of finding global minima but are slow. Methods that combine the speed of steepest descent with the global optimization character of stochastic techniques are an attractive alternative. An example of this approach is a modified version of on-line BP with a diminishing learning rate: the modification being the addition of decreasing amounts of random noise to the weights at each weight BP modification step. This procedure is guaranteed to converge to a global minimum.

Generalization Performance

Generalization performance - the accuracy of a trained network on a set of data which is similar to but not the same as the training data set - is the key metric which determines a learning paradigm's usefulness. This metric can be maximized by selecting a data set which completely represents the concept to be learned, and using that set, along with a global-minimum finding procedure, to train a network having a complexity that matches that of the concept to be learned.

A trained network will be a good generalize if it has learned the concept embedded in the training data. Training a network to be a good generalize is not a trivial task because most real-life applications demand training with noisy data. A good generalize usually has a smooth input to output mapping, which generally means that it will not have many large weights. More complex networks give a better fit to training data but are not good generalizes. The best generalizes. are neither too complex nor too simple, but match the the complexity of the problem exactly. Training the best generalize is quite difficult, just like fitting an object into a container which exactly matches the dimensions of the container. Poor Generalization results from the networks over- or under-fitting the training data. An over-fit is due to the network having a higher complexity than the concept embedded in the training data. This causes the network to essentially become a `look-up table' for the training data: the network behaves very well for the training data, but gives erroneous responses to inputs which are nearby but not actual training data. An under-fit is caused by the network having a complexity lower than that of the concept embedded in the training data. In general, it is harder to detect an under-fit during training as compared with an over-fit. The techniques for avoiding an over-fit and consequently improving Generalization performance are called regularization techniques.

Hardware Implementation

Although most neural nets are implemented in software on serial computers, a very attractive feature of neural nets, i.e. fast speed of execution, can only be achieved through their realization in parallel hardware: general purpose or customized, optical or electronic, analogue or digital, or any combination thereof. Custom hardware has the advantage in speed and the disadvantage in cost. Electronic hardware is generally compact and cost-effective. Optical hardware has the advantage of free-space connectivity. Analogue electronic systems are usually very fast, but suffer from susceptibility to noise, manufacturing difficulties, and lack of non-volatile on-chip adjustable weights. Digital electronic systems have the drawbacks of limited resolution computation and slow speed. They are less susceptible to noise and have non-volatile adjustable weights. Their main advantage is however in the ease of implementation due to the availability of a wide variety of mature VLSI tools and manufacturing facilities. This is the reason for their popularity, as is clear from the large number of reported systems.

Historical Note

Artificial multilayer neural nets draw their inspiration from the studies on the structure of the biological nervous systems. Such artificial layered structures were first investigated in the late 50's. Their usefulness as iterative learning machines did not manifest itself in that era due to the lack of a suitable learning algorithm. The breakthrough came in 1986 with the publication of two volumes by the PDP group titled Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Although ideas similar to the BP learning rule had been presented earlier, it became popular only after the publication of these volumes. The next big step in the continued development of the neural net was the development of proofs on the universal approximation capability of the 3-layer neural net and later on the 2-layer neural net in the late 80's. The last major achievement in this field, has been the integration of neural nets in a Bayesian framework. This approach about the architecture and learning parameters selection is based on Bayesian statistics and holds the promise of delivering optimal Generalization performance.


  1. An offset for the output neuron is not required for the `universal approximation' property of the neural net to hold, but has customarily been used by the practitioners.
  2. Only an identity activation function is required in the output neuron for the `universal approximation' property of the neural net to hold. It is however customary to use a logistic or hyperbolic tangent function for `classification tasks'. Having a logistic function as the activation function in output neurons also helps in the probabilistic interpretation of their responses.
  3. In-situ learning is performed by a deployed network which has already been trained. The deployed network constantly adapts its weights with respect to changes in the input/output behavior of its target task.
  4. Just about all functions that one may encounter are Borel measurable. Functions that are not Borel measurable do exist but are known to mathematicians only as mathematical peculiarities.
  5. Examples of such `shrinking' techniques in regard to neural nets are weight decay and weight elimination.
  6. Error surface is the plot of the cost function with respect to all of the weights in a network.
  7. On-line learning differs from the in-situ learning mentioned in that the latter is the property of a network requiring the deployed network to have adaptive weights, whereas the former is a property of the learning procedure, requiring the weights to be updated on the presentation of every example.
  8. Local minima are points of zero gradient on an error surface which are not global minima. Global minima are the points of minimum error on an error surface.
  9. The idea behind Hebbian learning is that the synapse between two neurons should be strengthened if they fire simultaneously.
  10. Bayesian approach differs from the conventional `frequentist' approach to statistics in that it allows the probability of an event to be expressed as `degree of belief' in a particular outcome instead of basing it solely on a set of observations.
Alphabetical Index

AI technique selection

Books I like

BPO articles and presentations

BPO service providers in Pakistan

BPO service ideas

BPO startup, Finding customers for a

Biz plan: IP infrastructure services co.

Business plan: Software quality assurance co.

Call centers in Pakistan

Call centers, Managing staff turnover in

Cell phones: Basic features

Computing, Intro to

Configurable MIPS Simulator

Enabling the IT Boom

Exporting non-IT services over the Internet

Finding customers for a BPO startup

IP infrastructure services: Biz plan

Intro to computing

Intro to neural nets

Investing in Pakistan's IT Businesses

LCD monitors

Managing staff turnover in offshore call centers

MIPS Simulator, Configurable

Neural nets glossary

Neural nets, Intro to

Outsourcing to Pakistan

Raising venture capital for IT products

Right-sizing the software process

Sudoku: Rules and strategies

Software process, Right-sizing the

Software quality assurance: Biz plan

Teaching, Improve your

Venture capital for IT products, Raising

Why outsource to Pakistan?

We Love Feedback

Do you have comments? Suggestions?