PC Press 1997 July

home *** CD-ROM | disk | FTP | other *** search

/ PC Press 1997 July / Sezamfile97_1.iso / msdos / infoprog / neural2.faq < prev next >

Wrap

Text File | 1997-06-15 | 103KB | 2,017 lines

Newsgroups: comp.ai.neural-nets,comp.answers,news.answers Path: Eunet.yu!EU.net!newsfeed.internetmci.com!news.msfc.nasa.gov!news.sgi.com!news.sprintlink.net!news-stk-200.sprintlink.net!news.sprintlink.net!news-chi-13.sprintlink.net!interpath!news.interpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss From: saswss@unx.sas.com (Warren Sarle) Subject: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Originator: saswss@hotellng.unx.sas.com Sender: news@unx.sas.com (Noter of Newsworthy Events) Message-ID: <nn2.posting_841287617@hotellng.unx.sas.com> Supersedes: <nn2.posting_838609217@hotellng.unx.sas.com> Approved: news-answers-request@MIT.EDU Date: Thu, 29 Aug 1996 03:00:19 GMT Expires: Thu, 3 Oct 1996 03:00:17 GMT X-Nntp-Posting-Host: hotellng.unx.sas.com Reply-To: saswss@unx.sas.com (Warren Sarle) Organization: SAS Institute Inc., Cary, NC, USA Keywords: frequently asked questions, answers Followup-To: comp.ai.neural-nets Lines: 1998 Archive-name: ai-faq/neural-nets/part2 Last-modified: 1996-08-27 URL: ftp://ftp.sas.com/pub/neural/FAQ2.html Maintainer: saswss@unx.sas.com (Warren S. Sarle) This is part 2 (of 7) of a monthly posting to the Usenet newsgroup comp.ai.neural-nets. See the part 1 of this posting for full information what it is all about. ========== Questions ========== ******************************** Part 1: Introduction Part 2: Learning How many learning methods for NNs exist? Which? What is backprop? What are conjugate gradients, Levenberg-Marquardt, etc.? How should categories be coded? Why use a bias input? Why use activation functions? What is a softmax activation function? What is the curse of dimensionality? How do MLPs compare with RBFs? What are OLS and subset regression? Should I normalize/standardize/rescale the data? Should I nonlinearly transform the data? What is ART? What is PNN? What is GRNN? What does unsupervised learning learn? What about Genetic Algorithms and Evolutionary Computation? What about Fuzzy Logic? Part 3: Generalization Part 4: Books, data, etc. Part 5: Free software Part 6: Commercial software Part 7: Hardware ------------------------------------------------------------------------ Subject: How many learning methods for NNs exist? ================================================= Which? ====== There are many many learning methods for NNs by now. Nobody knows exactly how many. New ones (or at least variations of existing ones) are invented every week. Below is a collection of some of the most well known methods, not claiming to be complete. The main categorization of these methods is the distinction between supervised and unsupervised learning: o In supervised learning, there is a "teacher" who in the learning phase "tells" the net how well it performs ("reinforcement learning") or what the correct behavior would have been ("fully supervised learning"). o In unsupervised learning the net is autonomous: it just looks at the data it is presented with, finds out about some of the properties of the data set and learns to reflect these properties in its output. What exactly these properties are, that the network can learn to recognise, depends on the particular network model and learning method. Usually, the net learns some compressed representation of the data. Many of these learning methods are closely connected with a certain (class of) network topology. Now here is the list, just giving some names: 1. UNSUPERVISED LEARNING (i.e. without a "teacher"): 1). Feedback Nets: a). Additive Grossberg (AG) b). Shunting Grossberg (SG) c). Binary Adaptive Resonance Theory (ART1) d). Analog Adaptive Resonance Theory (ART2, ART2a) e). Discrete Hopfield (DH) f). Continuous Hopfield (CH) g). Discrete Bidirectional Associative Memory (BAM) h). Temporal Associative Memory (TAM) i). Adaptive Bidirectional Associative Memory (ABAM) j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM) k). Competitive learning 2). Feedforward-only Nets: a). Learning Matrix (LM) b). Driver-Reinforcement Learning (DR) c). Linear Associative Memory (LAM) d). Optimal Linear Associative Memory (OLAM) e). Sparse Distributed Associative Memory (SDM) f). Fuzzy Associative Memory (FAM) g). Counterprogation (CPN) 2. SUPERVISED LEARNING (i.e. with a "teacher"): 1). Feedback Nets: a). Brain-State-in-a-Box (BSB) b). Fuzzy Congitive Map (FCM) c). Boltzmann Machine (BM) d). Mean Field Annealing (MFT) e). Recurrent Cascade Correlation (RCC) f). Backpropagation through time (BPTT) g). Real-time recurrent learning (RTRL) h). Recurrent Extended Kalman Filter (EKF) 2). Feedforward-only Nets: a). Perceptron b). Adaline, Madaline c). Backpropagation (BP) d). Cauchy Machine (CM) e). Adaptive Heuristic Critic (AHC) f). Time Delay Neural Network (TDNN) g). Associative Reward Penalty (ARP) h). Avalanche Matched Filter (AMF) i). Backpercolation (Perc) j). Artmap k). Adaptive Logic Network (ALN) l). Cascade Correlation (CasCor) m). Extended Kalman Filter(EKF) n). Learning Vector Quantization (LVQ) o). Probabilistic Neural Network (PNN) p). General Regression Neural Network (GRNN) ------------------------------------------------------------------------ Subject: What is backprop? =========================== Backprop is short for backpropagation of error. The term backpropagation causes much confusion. Strictly speaking, backpropagation refers to the method for computing the error gradient for a feedforward network, a straightforward but elegant application of the chain rule of elementary calculus (Werbos 1994). By extension, backpropagation or backprop refers to a training method that uses backpropagation to compute the gradient. By further extension, a backprop network is a feedforward network trained by backpropagation. Standard backprop is a euphemism for the generalized delta rule, the training algorithm that was popularized by Rumelhart, Hinton, and Williams in chapter 8 of Rumelhart and McClelland (1986), which remains the most widely used supervised training method for neural nets. The generalized delta rule (including momentum) is called the heavy ball method in the numerical analysis literature (Poljak 1964; Bertsekas 1995, 78-79). Standard backprop can be used for on-line training (in which the weights are updated after processing each case) but it does not converge. To obtain convergence, the learning rate must be slowly reduced. This methodology is called stochastic approximation. For batch processing, there is no reason to suffer through the slow convergence and the tedious tuning of learning rates and momenta of standard backprop. Much of the NN research literature is devoted to attempts to speed up backprop. Most of these methods are inconsequential; two that are effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller and Braun 1993). But conventional methods for nonlinear optimization are usually faster and more reliable than any of the "props". See "What are conjugate gradients, Levenberg-Marquardt, etc.?". More on-line info on backprop: Donald Tveter's Backpropagator's Review at http://www.mcs.com/~drt/bprefs.html. References on backprop: Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-14-0. Poljak, B.T. (1964), "Some methods of speeding up the convergence of iteration methods," Z. Vycisl. Mat. i Mat. Fiz., 4, 1-17. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986), "Learning internal representations by error propagation", in Rumelhart, D.E. and McClelland, J. L., eds. (1986), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, 318-362, Cambridge, MA: The MIT Press. Werbos, P.J. (1994), The Roots of Backpropagation, NY: John Wiley & Sons. References on stochastic approximation: Robbins, H. & Monro, S. (1951), "A Stochastic Approximation Method", Annals of Mathematical Statistics, 22, 400-407. Kiefer, J. & Wolfowitz, J. (1952), "Stochastic Estimation of the Maximum of a Regression Function," Annals of Mathematical Statistics, 23, 462-466. Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag. White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013. References on better props: Fahlman, S.E. (1989), "Faster-Learning Variations on Back-Propagation: An Empirical Study", in Touretzky, D., Hinton, G, and Sejnowski, T., eds., Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 38-51. Riedmiller, M. and Braun, H. (1993), "A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm", Proceedings of the IEEE International Conference on Neural Networks 1993, San Francisco: IEEE. ------------------------------------------------------------------------ Subject: What are conjugate gradients, ====================================== Levenberg-Marquardt, etc.? =========================== Training a neural network is, in most cases, an exercise in numerical optimization of a usually nonlinear function. Methods of nonlinear optimization have been studied for hundreds of years, and there is a huge literature on the subject in fields such as numerical analysis, operations research, and statistical computing, e.g., Bertsekas (1995), Gill, Murray, and Wright (1981). Masters (1995) has a good elementary discussion of conjugate gradient and Levenberg-Marquardt algorithms in the context of NNs. There is no single best method for nonlinear optimization. You need to choose a method based on the characteristics of the problem to be solved. For functions with continuous second derivatives (which would include feedforward nets with the most popular differentiable activation functions and error functions), three general types of algorithms have been found to be effective for most practical purposes: o For a small number of weights, stabilized Newton and Gauss-Newton algorithms, including various Levenberg-Marquardt and trust-region algorithms, are efficient. o For a moderate number of weights, various quasi-Newton algorithms are efficient. o For a large number of weights, various conjugate-gradient algorithms are efficient. For functions that are not continuously differentiable, the Nelder-Mead simplex algorithm is useful. All of the above methods find local optima. For global optimization, there are also a variety of approaches. You can simply run any of the local optimization methods from numerous random starting points. Or you can try more complicated methods designed for global optimization such as simulated annealing or genetic algorithms (see Reeves 1993 and "What about Genetic Algorithms and Evolutionary Computation?"). For a survey of optimization software, see More\' and Wright (1993). For more on-line information on numerical optimization see: o The kangaroos, a nontechnical description of various optimization methods, at ftp://ftp.sas.com/pub/neural/kangaroos. o The Netlib repository, http://www.netlib.org/, containing freely available software, documents, and databases of interest to the numerical and scientific computing community. o John Gregory's nonlinear programming FAQ at http://www.skypoint.com/subscribers/ashbury/nonlinear-programming-faq.html. o Arnold Neumaier's page on global optimization at http://solon.cma.univie.ac.at/~neum/glopt.html. o 'Simon Streltsovs page on global optimization at http://cad.bu.edu/go. References: Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-14-0. Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization, Academic Press: London. Levenberg, K. (1944) "A method for the solution of certain problems in least squares," Quart. Appl. Math., 2, 164-168. Marquardt, D. (1963) "An algorithm for least-squares estimation of nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 More\', J.J. (1977) "The Levenberg-Marquardt algorithm: implementation and theory," in Watson, G.A., ed., _Numerical Analysis_, Lecture Notes in Mathematics 630, Springer-Verlag, Heidelberg, 105-116. More\', J.J. and Wright, S.J. (1993), Optimization Software Guide, Philadelphia: SIAM, ISBN 0-89871-322-6. Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial Problems, NY: Wiley. Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A Survey, International Series of Numerical Mathematics, vol. 87, Basel: Birkhauser Verlag. ------------------------------------------------------------------------ Subject: How should categories be coded? ========================================= First, consider unordered categories. If you want to classify cases into one of C categories (i.e. you have a categorical target variable), use 1-of-C coding. That means that you code C binary (0/1) target variables corresponding to the C categories. Statisticians call these "dummy" variables. Each dummy variable is given the value zero except for the one corresponding to the correct category, which is given the value one. Then use a softmax output activation function (see "What is a softmax activation function?") so that the net, if properly trained, will produce valid posterior probability estimates. If the categories are Red, Green, and Blue, then the data would look like this: Category Dummy variables -------- --------------- Red 1 0 0 Green 0 1 0 Blue 0 0 1 When there are only two categories, it is simpler to use just one dummy variable with a logistic output activation function; this is equivalent to using softmax with two dummy variables. The common practice of using target values of .1 and .9 instead of 0 and 1 prevents the outputs of the network from being directly interpretable as posterior probabilities. Another common practice is to use a logistic activation function for each output. Thus, the outputs are not constrained to sum to one, so they are not valid posterior probability estimates. The usual justification advanced for this procedure is that if a test case is not similar to any of the training cases, all of the outputs will be small, indicating that the case cannot be classified reliably. This claim is incorrect, since a test case that is not similar to any of the training cases will require the net to extrapolate, and extrapolation is thoroughly unreliable; such a test case may produce all small outputs, all large outputs, or any combination of large and small outputs. If you want a classification method that detects novel cases for which the classification may not be reliable, you need a method based on probability density estimation. For example, see "What is PNN?". It is very important not to use a single variable for an unordered categorical target. Suppose you used a single variable with values 1, 2, and 3 for red, green, and blue, and the training data with two inputs looked like this: | 1 1 | 1 1 | 1 1 | 1 1 | | X | | 3 3 2 2 | 3 3 2 | 3 3 2 2 | 3 3 2 2 | +---------------------------- Consider a test point located at the X. The correct output would be that X has about a 50-50 chance of being a 1 or a 3. But if you train with a single target variable with values of 1, 2, and 3, the output for X will be the average of 1 and 3, so the net will say that X is definitely a 2! For an input with categorical values, you can use 1-of-(C-1) coding if the network has a bias unit. This is just like 1-of-C coding, except that you omit one of the dummy variables (doesn't much matter which one). Using all C of the dummy variables creates a linear dependency on the bias unit, which is not advisable unless you are using weight decay or Bayesian estimation or some such thing that requires all C weights to be treated on an equal basis. 1-of-(C-1) coding looks like this: Category Dummy variables -------- --------------- Red 1 0 Green 0 1 Blue 0 0 Another possible coding is called "effects" coding or "deviations from means" coding in statistics. It is like 1-of-(C-1) coding, except that when a case belongs to the category for the omitted dummy variable, all of the dummy variables are set to -1, like this: Category Dummy variables -------- --------------- Red 1 0 Green 0 1 Blue -1 -1 As long as a bias unit is used, any network with effects coding can be transformed into an equivalent network with 1-of-(C-1) coding by a linear transformation of the weights. So the only advantage of effects coding is that the dummy variables require no standardizing (see "Should I normalize/standardize/rescale the data?"). If you are using weight decay, you want to make sure that shrinking the weights toward zero biases ('bias' in the statistical sense) the net in a sensible, usually smooth, way. If you use 1 of C-1 coding for an input, weight decay biases the output for the C-1 categories towards the output for the 1 omitted category, which is probably not what you want, although there might be special cases where it would make sense. If you use 1 of C coding for an input, weight decay biases the output for all C categories roughly towards the mean output for all the categories, which is smoother and usually a reasonable thing to do. Now consider ordered categories. For inputs, some people recommend a "thermometer code" like this: Category Dummy variables -------- --------------- Red 1 1 1 Green 0 1 1 Blue 0 0 1 However, thermometer coding is equivalent to 1-of-C coding, in that for any network using 1-of-C coding, there exists a network with thermometer coding that produces identical outputs; the weights in the thermometer-coded network are just the differences of successive weights in the 1-of-C-coded network. To get a genuinely ordinal representation, you must constrain the weights connecting the dummy variables to the hidden units to be nonnegative (except for the first dummy variable). It is often effective to represent an ordinal input as a single variable like this: Category Input -------- ----- Red 1 Green 2 Blue 3 Although this representation involves only a single quantitative input, given enough hidden units, the net is capable of computing nonlinear transformations of that input that will produce results equivalent to any of the dummy coding schemes. But using a single quantitative input makes it easier for the net to use the order of the categories to generalize when that is appropriate. B-splines provide a way of coding ordinal inputs into fewer than C variables while retaining information about the order of the categories. See Brown and Harris (1994) or Gifi (1990, 365-370). Target variables with ordered categories require thermometer coding. The outputs are thus cumulative probabilities, so to obtain the posterior probability of any category except the first, you must take the difference between successive outputs. It is often useful to use a proportional-odds model, which ensures that these differences are positive. For more details on ordered categorical targets, see McCullagh and Nelder (1989, chapter 5). References: Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and Control, NY: Prentice Hall. Gifi, A. (1990), Nonlinear Multivariate Analysis, NY: John Wiley & Sons, ISBN 0-471-92620-5. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. ------------------------------------------------------------------------ Subject: Why use a bias input? =============================== Consider a multilayer perceptron with any of the usual sigmoid activation functions. Choose any hidden unit or output unit. Let's say there are N inputs to that unit, which define an N-dimensional space. The given unit draws a hyperplane through that space, producing an "on" output on one side and an "off" output on the other. (With sigmoid units the plane will not be sharp -- there will be some gray area of intermediate values near the separating plane -- but ignore this for now.) The weights determine where this hyperplane lies in the input space. Without a bias input, this separating hyperplane is constrained to pass through the origin of the space defined by the inputs. For some problems that's OK, but in many problems the hyperplane would be much more useful somewhere else. If you have many units in a layer, they share the same input space and without bias would ALL be constrained to pass through the origin. The "universal approximation" property of multilayer perceptrons with most commonly-used hidden-layer activation functions does not hold if you omit the bias units. But Hornik (1993) shows that a sufficient condition for the universal approximation property without biases is that no derivative of the activation function vanishes at the origin, which implies that with the usual sigmoid activation functions, a fixed nonzero bias can be used. Reference: Hornik, K. (1993), "Some new results on neural network approximation," Neural Networks, 6, 1069-1072. ------------------------------------------------------------------------ Subject: Why use activation functions? ======================================= Activation functions for the hidden units are needed to introduce nonlinearity into the network. Without nonlinearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a composition of linear functions is again a linear function. However, it is the nonlinearity (i.e, the capability to represent nonlinear functions) that makes multilayer networks so powerful. Almost any nonlinear function does the job, although for backpropagation learning it must be differentiable and it helps if the function is bounded; the sigmoidal functions such as logistic and tanh and the Gaussian function are the most common choices. For the output units, you should choose an activation function suited to the distribution of the target values. Bounded activation functions such as the logistic are particularly useful when the target values have a bounded range. But if the target values have no known bounded range, it is better to use an unbounded activation function, most often the identity function (which amounts to no activation function). There are certain natural associations between output activation functions and various noise distributions which have been studied by statisticians in the context of generalized linear models. The output activation function is the inverse of what statisticians call the "link function". See: McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. Jordan, M.I. (1995), "Why the logistic function? A tutorial discussion on probabilities and neural networks", ftp://psyche.mit.edu/pub/jordan/uai.ps.Z. For more information on activation functions, see Donald Tveter's Backpropagator's Review. ------------------------------------------------------------------------ Subject: What is a softmax activation function? ================================================ The purpose of the softmax activation function is to make the sum of the outputs equal to one, so that the outputs are interpretable as posterior probabilities. Let the net input to each output unit be q_i, i=1,...,c where c is the number of categories. Then the softmax output p_i is: exp(q_i) p_i = ------------ c sum exp(q_j) j=1 Unless you are using weight decay or Bayesian estimation or some such thing that requires the weights to be treated on an equal basis, you can choose any one of the output units and leave it completely unconnected--just set the net input to 0. Connecting all of the output units will just give you redundant weights and will slow down training. To see this, add an arbitrary constant z to each net input and you get: exp(q_i+z) exp(q_i) exp(z) exp(q_i) p_i = ------------ = ------------------- = ------------ c c c sum exp(q_j+z) sum exp(q_j) exp(z) sum exp(q_j) j=1 j=1 j=1 so nothing changes. Hence you can always pick one of the output units, and add an appropriate constant to each net input to produce any desired net input for the selected output unit, which you can choose to be zero or whatever is convenient. You can use the same trick to make sure that none of the exponentials overflows. Statisticians usually call softmax a "multiple logistic" function. It reduces to the simple logistic function when there are only two categories. Suppose you choose to set q_2 to 0. Then exp(q_1) exp(q_1) 1 p_1 = ------------ = ----------------- = ------------- c exp(q_1) + exp(0) 1 + exp(-q_1) sum exp(q_j) j=1 and p_2, of course, is 1-p_1. The softmax function derives naturally from log-linear models and leads to convenient interpretations of the weights in terms of odds ratios. You could, however, use a variety of other nonnegative functions on the real line in place of the exp function. Or you could constrain the net inputs to the output units to be nonnegative, and just divide by the sum--that's called the Bradley-Terry-Luce model. References: Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp. 227-236. Bridle, J.S. (1990b). Training Stochastic Model Recognition Algorithms as Networks can lead to Maximum Mutual Information Estimation of Parameters. In: D.S.Touretzky (ed.), Advances in Neural Information Processing Systems 2, San Mateo: Morgan Kaufmann, pp. 211-217. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. See Chapter 5. ------------------------------------------------------------------------ Subject: What is the curse of dimensionality? ============================================== This answer was provided by Janne Sinkkonen: Curse of dimensionality (Bellman 1961) refers to the exponential growth of hypervolume as a function of dimensionality. What has this to do with the NNs? Well, NNs are mappings from an input space to an output space. Thus, loosely speaking, an NN needs to "monitor" or cover or represent every part of its input space in order to know how the space should be mapped. Covering the input space takes resources, and the amount of resources needed is proportional to the hypervolume of the input space. This notion seems to hold generally, but formalizing "resources" and "every part of the input space" would take us so deep that we could eventually surface to a different world on the other side of the deepness. :) Here is an example. Think of an unsupervised competitive one-layer network that models data scattered uniformly over a unit hypercube. The network tries to share its units (resources) more or less equally over the hypercube (input space). One could argue that the average distance from a random point of the space to the nearest network unit measures the goodness of the representation: the shorter the distance, the better is the represention of the data in the cube. By simulations or by thinking it can be shown that the total number of units required to keep the average distance constant increases exponentially with the dimensionality of the cube. Curse of dimensionality causes networks with lots of irrelevant inputs to be behave relatively badly. The dimension of the input space is high, and the network uses almost all its resources to represent irrelevant portions of the space. References: Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press, section 1.4. Scott, D.W. (1992), Multivariate Density Estimation, NY: Wiley. ------------------------------------------------------------------------ Subject: How do MLPs compare with RBFs? ======================================== Notation: a_j is the altitude of the jth hidden unit b_j is the bias of the jth hidden unit f is the fan-in of the jth hidden unit h_j is the activation of the jth hidden unit s is a common width shared by all hidden units in the layer s_j is the width of the jth hidden unit w_ij is the weight connecting the ith input to the jth hidden unit w_i is the common weight for the ith input shared by all hidden units in the layer x_i is the ith input The inputs to each hidden or output unit must be combined with the weights to yield a single value called the net input. There does not seem to be a standard term for the function that combines the inputs and weights; I will use the term "combination function". A multilayer perceptron (MLP) has one or more hidden layers for which the combination function is the inner product of the inputs and weights, plus a bias. The activation function is usually a logistic or tanh function. Hence the formula for the activation is typically: h_j = tanh( b_j + sum[w_ij*x_i] ) The MLP architecture is the most popular one in practical applications. Each layer uses a linear combination function. The inputs are fully connected to the first hidden layer, each hidden layer is fully connected to the next, and the last hidden layer is fully connected to the outputs. You can also have "skip-layer" connections; direct connections from inputs to outputs are especially useful. Consider the multidimensional space of inputs to a given hidden unit. Since an MLP uses linear combination functions, the set of all points in the space having a given value of the activation function is a hyperplane. The hyperplanes corresponding to different activation levels are parallel to each other (the hyperplanes for different units are not parallel in general). These parallel hyperplanes are the isoactivation contours of the hidden unit. Radial basis function (RBF) networks usually have only one hidden layer for which the combination function is based on the Euclidean distance between the input vector and the weight vector. RBF networks do not have anything that's exactly the same as the bias term in an MLP. But some types of RBFs have a "width" associated with each hidden unit or with the the entire hidden layer; instead of adding it in the combination function like a bias, you divide the Euclidean distance by the width. To see the similarity between RBF networks and MLPs, it is convenient to treat the combination function as the square of distance/width. Then the familiar exp or softmax activation functions produce members of the popular class of Gaussian RBF networks. It can also be useful to add another term to the combination function that determines what I will call the "altitude" of the unit. I have not seen altitudes used in the NN literature; if you know of a reference, please tell me (saswss@unx.sas.com). The output activation function in RBF networks is usually the identity. The identity output activation function is a computational convenience in training (see Hybrid training and the curse of dimensionality) but it is possible and often desirable to use other output activation functions just as you would in an MLP. There are many types of radial basis functions. Gaussian RBFs seem to be the most popular by far in the NN literature. In the statistical literature, thin plate splines are also used (Green and Silverman 1994). This FAQ will concentrate on Gaussian RBFs. There are two distinct types of Gaussian RBF architectures. The first type uses the exp activation function, so the activation of the unit is a Gaussian "bump" as a function of the inputs. There seems to be no specific term for this type of Gaussian RBF network; I will use the term "ordinary RBF", or ORBF, network. The second type of Gaussian RBF architecture uses the softmax activation function, so the activations of all the hidden units are normalized to sum to one. This type of network is often called a "normalized RBF", or NRBF, network. In a NRBF network, the output units should not have a bias, since the constant bias term would be linearly dependent on the constant sum of the hidden units. While the distinction between these two types of Gaussian RBF architectures is sometimes mentioned in the NN literature, its importance has rarely been appreciated except by Tao (1993) and Werntges (1993). There are several subtypes of both ORBF and NRBF architectures. Descriptions and formulas are as follows: ORBFUN Ordinary radial basis function (RBF) network with unequal widths h_j = exp(f*log(a_j) - s_j^-2 * [(w_ij-x_i)^2] ) ORBFEQ Ordinary radial basis function (RBF) network with equal widths h_j = exp( - s^-2 * [(w_ij-x_i)^2] ) NRBFUN Normalized RBF network with unequal widths and heights h_j = softmax(f*log(a_j) - s_j^-2 * [(w_ij-x_i)^2] ) NRBFEV Normalized RBF network with equal volumes h_j = softmax( f*log(b_j) - s_j^-2 * [(w_ij-x_i)^2] ) NRBFEH Normalized RBF network with equal heights (and unequal widths) h_j = softmax( - s_j^-2 * [(w_ij-x_i)^2] ) NRBFEW Normalized RBF network with equal widths (and unequal heights) h_j = softmax( f*log(a_j) - s^-2 * [(w_ij-x_i)^2] ) NRBFEQ Normalized RBF network with equal widths and heights h_j = softmax( - s^-2 * [(w_ij-x_i)^2] ) To illustrate various architectures, an example with two inputs and one output will be used so that the results can be shown graphically. The function being learned resembles a landscape with a Gaussian hill and a logistic plateau as shown in ftp://ftp.sas.com/pub/neural/hillplat.gif. There are 441 training cases on a regular 21-by-21 grid. The table below shows the root mean square error (RMSE) for a test data set. The test set has 1681 cases on a regular 41-by-41 grid over the same domain as the training set. If you are reading the HTML version of this document via a web browser, click on any number in the table to see a surface plot of the corresponding network output (each plot is a gif file, approximately 9K). The MLP networks in the table have one hidden layer with a tanh activation function. All of the networks use an identity activation function for the outputs. Hill and Plateau Data: RMSE for the Test Set HUs MLP ORBFEQ ORBFUN NRBFEQ NRBFEW NRBFEV NRBFEH NRBFUN 2 0.218 0.247 0.247 0.230 0.230 0.230 0.230 0.230 3 0.192 0.244 0.143 0.218 0.218 0.036 0.012 0.001 4 0.174 0.216 0.096 0.193 0.193 0.036 0.007 5 0.160 0.188 0.083 0.086 0.051 0.003 6 0.123 0.142 0.058 0.053 0.030 7 0.107 0.123 0.051 0.025 0.019 8 0.093 0.105 0.043 0.020 0.008 9 0.084 0.085 0.038 0.017 10 0.077 0.082 0.033 0.016 12 0.059 0.074 0.024 0.005 15 0.042 0.060 0.019 20 0.023 0.046 0.010 30 0.019 0.024 40 0.016 0.022 50 0.010 0.014 The ORBF architectures use radial combination functions and the exp activation function. Only two of the radial combination functions are useful with ORBF architectures. For radial combination functions including an altitude, the altitude would be redundant with the hidden-to-output weights. Radial combination functions are based on the Euclidean distance between the vector of inputs to the unit and the vector of corresponding weights. Thus, the isoactivation contours for ORBF networks are concentric hyperspheres. A variety of activation functions can be used with the radial combination function, but the exp activation function, yielding a Gaussian surface, is the most useful. Radial networks typically have only one hidden layer, but it can be useful to include a linear layer for dimensionality reduction or oblique rotation before the RBF layer. The output of an ORBF network consists of a number of superimposed bumps, hence the output is quite bumpy unless many hidden units are used. Thus an ORBF network with only a few hidden units is incapable of fitting a wide variety of simple, smooth functions, and should rarely be used. The NRBF architectures also use radial combination functions but the activation function is softmax, which forces the sum of the activations for the hidden layer to equal one. Thus, each output unit computes a weighted average of the hidden-to-output weights, and the output values must lie within the range of the hidden-to-output weights. Thus, if the hidden-to-output weights are within as reasonable range (such as the range of the target values), you can be sure that the outputs will be within that same range for all possible inputs, even when the net is extrapolating. Radial combination functions incorporating altitudes are useful with NRBF architectures. The NRBF architectures combine some of the virtues of both the RBF and MLP architectures, as explained below. However, the isoactivation contours are considerably more complicated than for ORBF architectures. Consider the case of an NRBF network with only two hidden units. If the hidden units have equal widths, the isoactivation contours are parallel hyperplanes; in fact, this network is equivalent to an MLP with one logistic hidden unit. If the hidden units have unequal widths, the isoactivation contours are concentric hyperspheres; such a network is almost equivalent to an ORBF network with one Gaussian hidden unit. If there are more than two hidden units in an NRBF network, the isoactivation contours have no such simple characterization. If the RBF widths are very small, the isoactivation contours are approximately piecewise linear for RBF units with equal widths, and approximately piecewise spherical for RBF units with unequal widths. The larger the widths, the smoother the isoactivation contours where the pices join. The NRBFEQ architecture is a smoothed variant of the learning vector quantization (Kohonen 1988, Ripley 1996) and counterpropagation (Hecht-Nielsen 1990), architectures. In LVQ and counterprop, the hidden units are often called codebook vectors. LVQ amounts to nearest-neighbor classification on the codebook vectors, while counterprop is nearest-neighbor regression on the codebook vectors. The NRBFEQ architecture uses not just the single nearest neighbor, but a weighted average of near neighbors. As the width of the NRBFEQ functions approaches zero, the weights approach one for the nearest neighbor and zero for all other codebook vectors. LVQ and counterprop use ad hoc algorithms of uncertain reliability, but standard numerical optimization algorithms (not to mention backprop) can be applied with the NRBFEQ architecture. In a NRBFEQ architecture, if each observation is taken as an RBF center, and if the weights are taken to be the target values, the outputs are simply weighted averages of the target values, and the network is identical to the well-known Nadaraya-Watson kernel regression estimator, which has been reinvented at least twice in the neural net literature (see "What is GRNN?"). A similar NRBFEQ network used for classification is equivalent to kernel discriminant analysis (see "What is PNN?"). Kernels with variable widths are also used for regression in the statistical literature. Such kernel estimators correspond to the the NRBFEV architecture, in which the kernel functions have equal volumes but different altitudes. In the neural net literature, variable-width kernels appear always to be of the NRBFEH variety, with equal altitudes but unequal volumes. The analogy with kernel regression would make the NRBFEV architecture the obvious choice, but which of the two architectures works better in practice is an open question. Hybrid training and the curse of dimensionality +++++++++++++++++++++++++++++++++++++++++++++++ A comparison of the various architectures must separate training issues from architectural issues to avoid common sources of confusion. RBF networks are often trained by hybrid methods, in which the hidden weights (centers) are first obtained by unsupervised learning, after which the output weights are obtained by supervised learning. Unsupervised methods for choosing the centers include: 1. Distribute the centers in a regular grid over the input space. 2. Choose a random subset of the training cases to serve as centers. 3. Cluster the training cases based on the input variables, and use the mean of each cluster as a center. Various heuristic methods are also available for choosing the RBF widths. Once the centers and widths are fixed, the output weights can be learned very efficiently, since the computation reduces to a linear or generalized linear model. The hybrid training approach can thus be much faster than the nonlinear optimization that would be required for supervised training of all of the weights in the network. Hybrid training is not often applied to MLPs because no effective methods are known for unsupervised training of the hidden units (except when there is only one input). Hybrid training will usually require more hidden units than supervised training. Since supervised training optimizes the locations of the centers, while hybrid training does not, supervised training will provide a better approximation to the function to be learned for a given number of hidden units. Thus, the better fit provided by supervised training will often let you use fewer hidden units for a given accuracy of approximation than you would need with hybrid training. And if the hidden-to-output weights are learned by linear least-squares, the fact that hybrid training requires more hidden units implies that hybrid training will also require more training cases for the same accuracy of generalization (Tarassenko and Roberts 1994). The number of hidden units required by hybrid methods becomes an increasingly serious problem as the number of inputs increases. In fact, the required number of hidden units tends to increase exponentially with the number of inputs. This drawback of hybrid methods is discussed by Minsky and Papert (1969). For example, with method (1) for RBF networks, you would need at least five elements in the grid along each dimension to detect a moderate degree of nonlinearity; so if you have Nx inputs, you would need at least 5^Nx hidden units. For methods (2) and (3), the number of hidden units increases exponentially with the effective dimensionality of the input distribution. If the inputs are linearly related, the effective dimensionality is the number of nonnegligible (a deliberately vague term) eigenvalues of the covariance matrix, so the inputs must be highly correlated if the effective dimensionality is to be much less than the number of inputs. The exponential increase in the number of hidden units required for hybrid learning is one aspect of the curse of dimensionality. The number of training cases required also increases exponentially in general. No neural network architecture--in fact no method of learning or statistical estimation--can escape the curse of dimensionality in general, hence there is no practical method of learning general functions in more than a few dimensions. Fortunately, in many practical applications of neural networks with a large number of inputs, most of those inputs are additive, redundant, or irrelevant, and some architectures can take advantage of these properties to yield useful results. But escape from the curse of dimensionality requires fully supervised training as well as special types of data. Supervised training for RBF networks can be done by "backprop" (see "What is backprop?") or other optimization methods (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"), or by subset regression "What are OLS and subset regression?"). Additive inputs +++++++++++++++ An additive model is one in which the output is a sum of linear or nonlinear transformations of the inputs. If an additive model is appropriate, the number of weights increases linearly with the number of inputs, so high dimensionality is not a curse. Various methods of training additive models are available in the statistical literature (e.g. Hastie and Tibshirani 1990). You can also create a feedforward neural network, called a generalized additive network (GAN), to fit additive models (Sarle 1994). Additive models have been proposed in the neural net literature under the name "topologically distributed encoding" (Geiger 1990). Projection pursuit regression (PPR) provides both universal approximation and the ability to avoid the curse of dimensionality for certain common types of target functions (Friedman and Stuetzle 1981). Like MLPs, PPR computes the output as a sum of nonlinear transformations of linear combinations of the inputs. Each term in the sum is analogous to a hidden unit in an MLP. But unlike MLPs, PPR allows general, smooth nonlinear transformations rather than a specific nonlinear activation function, and allows a different transformation for each term. The nonlinear transformations in PPR are usually estimated by nonparametric regression, but you can set up a projection pursuit network (PPN), in which each nonlinear transformation is performed by a subnetwork. If a PPN provides an adequate fit with few terms, then the curse of dimensionality can be avoided, and the results may even be interpretable. If the target function can be accurately approximated by projection pursuit, then it can also be accurately approximated by an MLP with a single hidden layer. The disadvantage of the MLP is that there is little hope of interpretability. An MLP with two or more hidden layers can provide a parsimonious fit to a wider variety of target functions than can projection pursuit, but no simple characterization of these functions is known. Redundant inputs ++++++++++++++++ With proper training, all of the RBF architectures listed above, as well as MLPs, can process redundant inputs effectively. When there are redundant inputs, the training cases lie close to some (possibly nonlinear) subspace. If the same degree of redundancy applies to the test cases, the network need produce accurate outputs only near the subspace occupied by the data. Adding redundant inputs has little effect on the effective dimensionality of the data; hence the curse of dimensionality does not apply, and even hybrid methods (2) and (3) can be used. However, if the test cases do not follow the same pattern of redundancy as the training cases, generalization will require extrapolation and will rarely work well. Irrelevant inputs +++++++++++++++++ MLP architectures are good at ignoring irrelevant inputs. MLPs can also select linear subspaces of reduced dimensionality. Since the first hidden layer forms linear combinations of the inputs, it confines the networks attention to the linear subspace spanned by the weight vectors. Hence, adding irrelevant inputs to the training data does not increase the number of hidden units required, although it increases the amount of training data required. ORBF architectures are not good at ignoring irrelevant inputs. The number of hidden units required grows exponentially with the number of inputs, regardless of how many inputs are relevant. This exponential growth is related to the fact that RBFs and ERBFs have local receptive fields, meaning that changing the hidden-to-output weights of a given unit will affect the output of the network only in a neighborhood of the center of the hidden unit, where the size of the neighborhood is determined by the width of the hidden unit. (Of course, if the width of the unit is learned, the receptive field could grow to cover the entire training set.) Local receptive fields are often an advantage compared to the distributed architecture of MLPs, since local units can adapt to local patterns in the data without having unwanted side effects in other regions. In a distributed architecture such as an MLP, adapting the network to fit a local pattern in the data can cause spurious side effects in other parts of the input space. However, ORBF architectures often must be used with relatively small neighborhoods, so that several hidden units are required to cover the range of an input. When there are many nonredundant inputs, the hidden units must cover the entire input space, and the number of units required is essentially the same as in the hybrid case (1) where the centers are in a regular grid; hence the exponential growth in the number of hidden units with the number of inputs, regardless of whether the inputs are relevant. You can enable an ORBF architecture to ignore irrelevant inputs by using an extra, linear hidden layer before the radial hidden layer. This type of network is sometimes called an elliptical basis function network. If the number of units in the linear hidden layer equals the number of inputs, the linear hidden layer performs an oblique rotation of the input space that can suppress irrelevant directions and differentally weight relevant directions according to their importance. If you think that the presence of irrelevant inputs is highly likely, you can force a reduction of dimensionality by using fewer units in the linear hidden layer than the number of inputs. Note that the linear and radial hidden layers must be connected in series, not in parallel, to ignore irrelevant inputs. In some applications it is useful to have linear and radial hidden layers connected in parallel, but in such cases the radial hidden layer will be sensitive to all inputs. For even greater flexibility (at the cost of more weights to be learned), you can have a separate linear hidden layer for each RBF unit, allowing a different oblique rotation for each RBF unit. NRBF architectures with equal widths (NRBFEW and NRBFEQ) combine the advantage of local receptive fields with the ability to ignore irrelevant inputs. The receptive field of one hidden unit extends from the center in all directions until it encounters the receptive field of another hidden unit. It is convenient to think of a "boundary" between the two receptive fields, defined as the hyperplane where the two units have equal activations, even though the effect of each unit will extend somewhat beyond the boundary. The location of the boundary depends on the heights of the hidden units. If the two units have equal heights, the boundary lies midway between the two centers. If the units have unequal heights, the boundary is farther from the higher unit. If a hidden unit is surrounded by other hidden units, its receptive field is indeed local, curtailed by the field boundaries with other units. But if a hidden unit is not completely surrounded, its receptive field can extend infinitely in certain directions. If there are irrelevant inputs, or more generally, irrelevant directions that are linear combinations of the inputs, the centers need only be distributed in a subspace orthogonal to the irrelevant directions. In this case, the hidden units can have local receptive fields in relevant directions but infinite receptive fields in irrelevant directions. For NRBF architectures allowing unequal widths (NRBFUN, NRBFEV, and NRBFEH), the boundaries between receptive fields are generally hyperspheres rather than hyperplanes. In order to ignore irrelevant inputs, such networks must be trained to have equal widths. Hence, if you think there is a strong possibility that some of the inputs are irrelevant, it is usually better to use an architecture with equal widths. References: There are few good references on RBF networks. Bishop (1995) gives one of the better surveys, but also see Tao (1993) for the importance of normalization. Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Friedman, J.H. and Stuetzle, W. (1981), "Projection pursuit regression," J. of the American Statistical Association, 76, 817-823. Geiger, H. (1990), "Storing and Processing Information in Connectionist Systems," in Eckmiller, R., ed., Advanced Neural Computers, 271-277, Amsterdam: North-Holland. Green, P.J. and Silverman, B.W. (1994), Nonparametric Regression and Generalized Linear Models: A roughness penalty approach,, London: Chapman & Hall. Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models, London: Chapman & Hall. Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1 (suppl 1), 303. Minsky, M.L. and Papert, S.A. (1969), Perceptrons, Cambridge, MA: MIT Press. Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Sarle, W.S. (1994), "Neural Networks and Statistical Models," in SAS Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550, ftp://ftp.sas.com/pub/neural/neural1.ps. Tao, K.M. (1993), "A closer look at the radial basis function (RBF) networks," Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers (Singh, A., ed.), vol 1, 401-405, Los Alamitos, CA: IEEE Comput. Soc. Press. Tarassenko, L. and Roberts, S. (1994), "Supervised and unsupervised learning in radial basis function classifiers," IEE Proceedings-- Vis. Image Signal Processing, 141, 210-216. Werntges, H.W. (1993), "Partitions of unity improve neural function approximation," Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, vol 2, 914-918. ------------------------------------------------------------------------ Subject: What are OLS and subset regression? ============================================= If you are statistician, "OLS" means "ordinary least squares" (as opposed to weighted or generalized least squares), which is what the NN literature often calls "LMS" (least mean squares). If you are a neural networker, "OLS" means "orthogonal least squares", which is an algorithm for forward stepwise regression proposed by Chen et al. (1991) for training RBF networks. OLS is a variety of supervised training. But whereas backprop and other commonly-used supervised methods are forms of continuous optimization, OLS is a form of combinatorial optimization. Rather than treating the RBF centers as continuous values to be adjusted to reduce the training error, OLS starts with a large set of candidate centers and selects a subset that usually provides good training error. For small training sets, the candidates can include all of the training cases. For large training sets, it is more efficient to use a random subset of the training cases or to do a cluster analysis and use the cluster means as candidates. Each center corresponds to a predictor variable in a linear regression model. The values of these predictor variables are computed from the RBF applied to each center. There are numerous methods for selecting a subset of predictor variables in regression (Myers 1986; Miller 1990). The ones most often used are: o Forward selection begins with no centers in the network. At each step the center is added that most decreases the error function. o Backward elimination begins with all candidate centers in the network. At each step the center is removed that least increases the error function. o Stepwise selection begins like forward selection with no centers in the network. At each step, a center is added or removed. If there are any centers in the network, the one that contributes least to reducing the error criterion is subjected to a statistical test (usually based on the F statistic) to see if it is worth retaining in the network; if the center fails the test, it is removed. If no centers are removed, then the centers that are not currently in the network are examined; the one that would contribute most to reducing the error criterion is subjected to a statistical test to see if it is worth adding to the network; if the center passes the test, it is added. When all centers in the network pass the test for staying in the network, and all other centers fail the test for being added to the network, the stepwise method terminates. o Leaps and bounds (Furnival and Wilson 1974) is an algorithm for determining the subset of centers that minimizes the error function; this optimal subset can be found without examining all possible subsets, but the algorithm is practical only up to 30 to 50 candidate centers. OLS is a particular algorithm for forward selection using modified Gram-Schmidt (MGS) orthogonalization. While MGS is not a bad algorithm, it is not the best algorithm for linear least-squares (Lawson and Hanson 1974). For ill-conditioned data, Householder and Givens methods are generally preferred, while for large, well-conditioned data sets, methods based on the normal equations require about one-third as many floating point operations and much less disk I/O than OLS. Normal equation methods based on sweeping (Goodnight 1979) or Gaussian elimination (Furnival and Wilson 1974) are especially simple to program. While the theory of linear models is the most thoroughly developed area of statistical inference, subset selection invalidates most of the standard theory (Miller 1990; Roecker 1991; Derksen and Keselman 1992; Freedman, Pee, and Midthune 1992). Subset selection methods usually do not generalize as well as regularization methods in linear models (Frank and Friedman 1993). Orr (1995) has proposed combining regularization with subset selection for RBF training (see also Orr 1996). References: Chen, S., Cowan, C.F.N., and Grant, P.M. (1991), "Orthogonal least squares learning for radial basis function networks," IEEE Transactions on Neural Networks, 2, 302-309. Derksen, S. and Keselman, H. J. (1992) "Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables," British Journal of Mathematical and Statistical Psychology, 45, 265-282, Frank, I.E. and Friedman, J.H. (1993) "A statistical view of some chemometrics regression tools," Technometrics, 35, 109-148. Freedman, L.S. , Pee, D. and Midthune, D.N. (1992) "The problem of underestimating the residual error variance in forward stepwise regression", The Statistician, 41, 405-412. Furnival, G.M. and Wilson, R.W. (1974), "Regression by Leaps and Bounds," Technometrics, 16, 499-511. Goodnight, J.H. (1979), "A Tutorial on the SWEEP Operator," The American Statistician, 33, 149-158. Lawson, C. L. and Hanson, R. J. (1974), Solving Least Squares Problems, Englewood Cliffs, NJ: Prentice-Hall, Inc. (2nd edition: 1995, Philadelphia: SIAM) Miller, A.J. (1990), Subset Selection in Regression, Chapman & Hall. Myers, R.H. (1986), Classical and Modern Regression with Applications, Boston: Duxbury Press. Orr, M.J.L. (1995), "Regularisation in the selection of radial basis function centres," Neural Computation, 7, 606-623. Orr, M.J.L. (1996), "Introduction to radial basis function networks," http://www.cns.ed.ac.uk/people/mark/intro.ps or http://www.cns.ed.ac.uk/people/mark/intro/intro.html . Roecker, E.B. (1991) "Prediction error and its estimation for subset-selected models," Technometrics, 33, 459-468. ------------------------------------------------------------------------ Subject: Should I normalize/standardize/rescale the =================================================== data? ====== First, some definitions. "Rescaling" a vector means to add or subtract a constant and then multiply or divide by a constant, as you would do to change the units of measurement of the data, for example, to convert a temperature from Celsius to Fahrenheit. "Normalizing" a vector most often means dividing by a norm of the vector, for example, to make the Euclidean length of the vector equal to one. In the NN literature, "normalizing" also often refers to rescaling by the minimum and range of the vector, to make all the elements lie between 0 and 1. "Standardizing" a vector most often means subtracting a measure of location and dividing by a measure of scale. For example, if the vector contains random values with a Gaussian distribution, you might subtract the mean and divide by the standard deviation, thereby obtaining a "standard normal" random variable with mean 0 and standard deviation 1. However, all of the above terms are used more or less interchangeably depending on the customs within various fields. Since the FAQ maintainer is a statistician, he is going to use the term "standardize" because that is what he is accustomed to. Now the question is, should you do any of these things to your data? The answer is, it depends. There is a common misconception that the inputs to a multilayer perceptron must be in the interval [0,1]. There is in fact no such requirement, although there often are benefits to standardizing the inputs as discussed below. If your output activation function has a range of [0,1], then obviously you must ensure that the target values lie within that range. But it is generally better to choose an output activation function suited to the distribution of the targets than to force your data to conform to the output activation function. See "Why use activation functions?" When using an output activation with a range of [0,1], some people prefer to rescale the targets to a range of [.1,.9]. I suspect that the popularity of this gimmick is due to the slowness of standard backprop. But using a target range of [.1,.9] for a classification task gives you incorrect posterior probability estimates, and it is quite unnecessary if you use an efficient training algorithm (see "What are conjugate gradients, Levenberg-Marquardt, etc.?") Now for some of the gory details: note that the training data form a matrix. Let's set up this matrix so that each case forms a row, and the inputs and target variables form columns. You could conceivably standardize the rows or the columns or both or various other things, and these different ways of choosing vectors to standardize will have quite different effects on training. Standardizing either input or target variables tends to make the training process better behaved by improving the numerical condition of the optimization problem and ensuring that various default values involved in initialization and termination are appropriate. Standardizing targets can also affect the objective function. Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. Subquestion: Should I standardize the input variables (column +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ vectors)? +++++++++ That depends primarily on how the network combines input variables to compute the net input to the next (hidden or output) layer. If the input variables are combined via a distance function (such as Euclidean distance) in an RBF network, standardizing inputs can be crucial. The contribution of an input will depend heavily on its variability relative to other inputs. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input. So it is essential to rescale the inputs so that their variability reflects their importance, or at least is not in inverse relation to their importance. For lack of better prior information, it is common to standardize each input to the same range or the same standard deviation. If you know that some inputs are more important than others, it may help to scale the inputs such that the more important ones have larger variances and/or ranges. If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. The main emphasis in the NN literature on initial values has been on the avoidance of saturation, hence the desire to use small random values. How small these random values should be depends on the scale of the inputs as well as the number of inputs and their correlations. Standardizing inputs removes the problem of scale dependence of the initial weights. But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. Standardizing input variables also has different effects on different training algorithms for MLPs. For example: o Steepest descent is very sensitive to scaling. The more ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an important consideration for gradient descent methods such as standard backprop. o Quasi-Newton and conjugate gradient methods begin with a steepest descent step and therefore are scale sensitive. However, they accumulate second-order information as training proceeds and hence are less scale sensitive than pure gradient descent. o Newton-Raphson and Gauss-Newton, if implemented correctly, are theoretically invariant under scale changes as long as none of the scaling is so extreme as to produce underflow or overflow. o Levenberg-Marquardt is scale invariant as long as no ridging is required. There are several different ways to implement ridging; some are scale invariant and some are not. Performance under bad scaling will depend on details of the implementation. Subquestion: Should I standardize the target variables (column ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ vectors)? +++++++++ Standardizing target variables is typically more a convenience for getting good initial weights than a necessity. However, if you have two or more target variables and your error function is scale-sensitive like the usual least (mean) squares error function, then the variability of each target relative to the others can effect how well the net learns that target. If one target has a range of 0 to 1, while another target has a range of 0 to 1,000,000, the net will expend most of its effort learning the second target to the possible exclusion of the first. So it is essential to rescale the targets so that their variability reflects their importance, or at least is not in inverse relation to their importance. If the targets are of equal importance, they should typically be standardized to the same range or the same standard deviation. The scaling of the targets does not affect their importance in training if you use maximum likelihood estimation and estimate a separate scale parameter (such as a standard deviation) for each target variable. In this case, the importance of each target is inversely related to its estimated scale parameter. In other words, noisier targets will be given less importance. Subquestion: Should I standardize the input cases (row vectors)? ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Whereas standardizing variables is usually beneficial, the effect of standardizing cases (row vectors) depends on the particular data. Cases are typically standardized only across the input variables, since including the target variable(s) in the standardization would make prediction impossible. There are some kinds of networks, such as simple Kohonen nets, where it is necessary to standardize the input cases to a common Euclidean length; this is a side effect of the use of the inner product as a similarity measure. If the network is modified to operate on Euclidean distances instead of inner products, it is no longer necessary to standardize the input cases. Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. Issues regarding the standardization of cases must be carefully evaluated in every application. There are no rules of thumb that apply to all applications. You may want to standardize each case if there is extraneous variability between cases. Consider the common situation in which each input variable represents a pixel in an image. If the images vary in exposure, and exposure is irrelevant to the target values, then it would usually help to subtract the mean of each case to equate the exposures of different cases. If the images vary in contrast, and contrast is irrelevant to the target values, then it would usually help to divide each case by its standard deviation to equate the contrasts of different cases. Given sufficient data, a NN could learn to ignore exposure and contrast. However, training will be easier and generalization better if you can remove the extraneous exposure and contrast information before training the network. As another example, suppose you want to classify plant specimens according to species but the specimens are at different stages of growth. You have measurements such as stem length, leaf length, and leaf width. However, the over-all size of the specimen is determined by age or growing conditions, not by species. Given sufficient data, a NN could learn to ignore the size of the specimens and classify them by shape instead. However, training will be easier and generalization better if you can remove the extraneous size information before training the network. Size in the plant example corresponds to exposure in the image example. If the input data are measured on an interval scale, you can control for size by subtracting a measure of the over-all size of each case from each datum. For example, if no other direct measure of size is available, you could subtract the mean of each row of the input matrix, producing a row-centered input matrix. If the data are measured on a ratio scale, you can control for size by dividing each datum by a measure of over-all size. It is common to divide by the sum or by the arithmetic mean. For positive ratio data, however, the geometric mean is often a more natural measure of size than the arithmetic mean. It may also be more meaningful to analyze the logarithms of positive ratio-scaled data, in which case you can subtract the arithmetic mean after taking logarithms. You must also consider the dimensions of measurement. For example, if you have measures of both length and weight, you may need to cube the measures of length or take the cube root of the weights. In NN aplications with ratio-level data, it is common to divide by the Euclidean length of each row. If the data are positive, dividing by the Euclidean length has properties similar to dividing by the sum or arithmetic mean, since the former projects the data points onto the surface of a hypersphere while the latter projects the points onto a hyperplane. If the dimensionality is not too high, the resulting configurations of points on the hypersphere and hyperplane are usually quite similar. If the data contain negative values, then the hypersphere and hyperplane can diverge widely. ------------------------------------------------------------------------ Subject: Should I nonlinearly transform the data? ================================================== Most importantly, nonlinear transformations of the targets are important with noisy data, via their effect on the error function. Many commonly used error functions are functions solely of the difference abs(target-output). Nonlinear transformations (unlike linear transformations) change the relative sizes of these differences. With most error functions, the net will expend more effort, so to speak, trying to learn target values for which abs(target-output) is large. For example, suppose you are trying to predict the price of a stock. If the price of the stock is 10 (in whatever currency unit) and the output of the net is 5 or 15, yielding a difference of 5, that is a huge error. If the price of the stock is 1000 and the output of the net is 995 or 1005, yielding the same difference of 5, that is a tiny error. You don't want the net to treat those two differences as equally important. By taking logarithms, you are effectively measuring errors in terms of ratios rather than differences, since a difference between two logs corresponds to the ratio of the original values. This has approximately the same effect as looking at percentage differences, abs(target-output)/target or abs(target-output)/output, rather than simple differences. Less importantly, smooth functions are usually easier to learn than rough functions. Generalization is also usually better for smooth functions. So nonlinear transformations (of either inputs or targets) that make the input-output function smoother are usually beneficial. For classification problems, you want the class boundaries to be smooth. When there are only a few inputs, it is often possible to transform the data to a linear relationship, in which case you can use a linear model instead of a more complex neural net, and many things (such as estimating generalization error and error bars) will become much simpler. A variety of NN architectures (RBF networks, B-spline networks, etc.) amount to using many nonlinear transformations, possibly involving multiple variables simultaneously, to try to make the input-output function approximately linear (Ripley 1996, chapter 4). There are particular applications, such as signal and image processing, in which very elaborate transformations are useful (Masters 1994). It is usually advisable to choose an error function appropriate for the distribution of noise in your target variables (McCullagh and Nelder 1989). But if your software does not provide a sufficient variety of error functions, then you may need to transform the target so that the noise distribution conforms to whatever error function you are using. For example, if you have to use least-(mean-)squares training, you will get the best results if the noise distribution is approximately Gaussian with constant variance, since least-(mean-)squares is maximum likelihood in that case. Heavy-tailed distributions (those in which extreme values occur more often than in a Gaussian distribution, often as indicated by high kurtosis) are especially of concern, due to the loss of statistical efficiency of least-(mean-)square estimates (Huber 1981). Note that what is important is the distribution of the noise, not the distribution of the target values. The distribution of inputs may suggest transformations, but this is by far the least important consideration among those listed here. If an input is strongly skewed, a logarithmic, square root, or other power (between -1 and 1) transformation may be with trying. If an input has high kurtosis but low skewness, a tanh transform can reduce the influence of extreme values: input - mean tanh( c ------------ ) stand. dev. where c is a constant that controls how far the extreme values are brought in towards the mean. References: Atkinson, A.C. (1985) Plots, Transformations and Regression, Oxford: Clarendon Press. Carrol, R.J. and Ruppert, D. (1988) Transformation and Weighting in Regression, London: Chapman and Hall. Huber, P.J. (1981), Robust Statistics, NY: Wiley. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman and Hall. Masters, T. (1994), Signal and Image Processing with Neural Networks: A C++ Sourcebook, NY: Wiley. Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. ------------------------------------------------------------------------ Subject: What is ART? ===================== ART stands for "Adaptive Resonance Theory", invented by Stephen Grossberg in 1976. ART encompasses a wide variety of neural networks based explicitly on neurophysiology. ART networks are defined algorithmically in terms of detailed differential equations intended as plausible models of biological neurons. In practice, ART networks are implemented using analytical solutions or approximations to these differential equations. ART comes in several flavors, both supervised and unsupervised. As discussed by Moore (1988), the unsupervised ARTs are basically similar to many iterative clustering algorithms in which each case is processed by: 1. finding the "nearest" cluster seed (AKA prototype or template) to that case 2. updating that cluster seed to be "closer" to the case where "nearest" and "closer" can be defined in hundreds of different ways. In ART, the framework is modified slightly by introducing the concept of "resonance" so that each case is processed by: 1. finding the "nearest" cluster seed that "resonates" with the case 2. updating that cluster seed to be "closer" to the case "Resonance" is just a matter of being within a certain threshold of a second similarity measure. A crucial feature of ART is that if no seed resonates with the case, a new cluster is created as in Hartigan's (1975) leader algorithm. This feature is said to solve the stability/plasticity dilemma. ART has its own jargon. For example, data are called an arbitrary sequence of input patterns. The current training case is stored in short term memory and cluster seeds are long term memory. A cluster is a maximally compressed pattern recognition code. The two stages of finding the nearest seed to the input are performed by an Attentional Subsystem and an Orienting Subsystem, the latter of which performs hypothesis testing, which simply refers to the comparison with the vigilance threshhold, not to hypothesis testing in the statistical sense. Stable learning means that the algorithm converges. So the oft-repeated claim that ART algorithms are "capable of rapid stable learning of recognition codes in response to arbitrary sequences of input patterns" merely means that ART algorithms are clustering algorithms that converge; it does not mean, as one might naively assume, that the clusters are insensitive to the sequence in which the training patterns are presented--quite the opposite is true. There are various supervised ART algorithms that are named with the suffix "MAP", as in Fuzzy ARTMAP. These algorithms cluster both the inputs and targets and associate the two sets of clusters. The effect is somewhat similar to counterpropagation. The main disadvantage of the ARTMAP algorithms is that they have no mechanism to avoid overfitting and hence should not be used with noisy data. For more information, see the ART FAQ at http://www.wi.leidenuniv.nl/art/ and the "ART Headquarters" at Boston University, http://cns-web.bu.edu/. For a different view of ART, see Sarle, W.S. (1995), "Why Statisticians Should Not FART," ftp://ftp.sas.com/pub/neural/fart.doc. References: Carpenter, G.A., Grossberg, S. (1996), "Learning, Categorization, Rule Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, pp. 1.3-1.45. Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley. Kasuba, T. (1993), "Simplified Fuzzy ARTMAP," AI Expert, 8, 18-25. Moore, B. (1988), "ART 1 and Pattern Clustering," in Touretzky, D., Hinton, G. and Sejnowski, T., eds., Proceedings of the 1988 Connectionist Models Summer School, 174-185, San Mateo, CA: Morgan Kaufmann. ------------------------------------------------------------------------ Subject: What is PNN? ===================== PNN or "Probabilistic Neural Network" is Donald Specht's term for kernel discriminant analysis. You can think of it as a normalized RBF network in which there is a hidden unit centered at every training case. These RBF units are called "kernels" and are usually probability density functions such as the Gaussian. The hidden-to-output weights are usually 1 or 0; for each hidden unit, a weight of 1 is used for the connection going to the output that the case belongs to, while all other connections are given weights of 0. Alternatively, you can adjust these weights for the prior probabilities of each class. So the only weights that need to be learned are the widths of the RBF units. These widths (often a single width is used) are called "smoothing parameters" or "bandwidths" and are usually chosen by cross-validation or by more esoteric methods that are not well-known in the neural net literature; gradient descent is not used. Specht's claim that a PNN trains 100,000 times faster than backprop is at best misleading. While they are not iterative in the same sense as backprop, kernel methods require that you estimate the kernel bandwidth, and this requires accessing the data many times. Furthermore, computing a single output value with kernel methods requires either accessing the entire training data or clever programming, and either way is much slower than computing an output with a feedforward net. And there are a variety of methods for training feedforward nets that are much faster than standard backprop. So depending on what you are doing and how you do it, PNN may be either faster or slower than a feedforward net. PNN is a universal approximator for smooth class-conditional densities, so it should be able to solve any smooth classification problem given enough data. The main drawback of PNN is that, like kernel methods in general, it suffers badly from the curse of dimensionality. PNN cannot ignore irrelevant inputs without major modifications to the basic algorithm. So PNN is not likely to be the top choice if you have more than 5 or 6 nonredundant inputs. But if all your inputs are relevant, PNN has the very useful ability to tell you whether a test case is similar (i.e. has a high density) to any of the training data; if not, you are extrapolating and should view the output classification with skepticism. This ability is of limited use when you have irrelevant inputs, since the similarity is measured with respect to all of the inputs, not just the relevant ones. References: Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press. McLachlan, G.J. (1992) Discriminant Analysis and Statistical Pattern Recognition, Wiley. Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine Learning, Neural and Statistical Classification, Ellis Horwood. Scott, D.W. (1992) Multivariate Density Estimation, Wiley. Specht, D.F. (1990) "Probabilistic neural networks," Neural Networks, 3, 110-118. ------------------------------------------------------------------------ Subject: What is GRNN? ====================== GRNN or "General Regression Neural Network" is Donald Specht's term for Nadaraya-Watson kernel regression, also reinvented in the NN literature by Schi\oler and Hartmann. You can think of it as a normalized RBF network in which there is a hidden unit centered at every training case. These RBF units are called "kernels" and are usually probability density functions such as the Gaussian. The hidden-to-output weights are just the target values, so the output is simply a weighted average of the target values of training cases close to the given input case. The only weights that need to be learned are the widths of the RBF units. These widths (often a single width is used) are called "smoothing parameters" or "bandwidths" and are usually chosen by cross-validation or by more esoteric methods that are not well-known in the neural net literature; gradient descent is not used. GRN is a universal approximator for smooth functions, so it should be able to solve any smooth function-approximation problem given enough data. The main drawback of GRNN is that, like kernel methods in general, it suffers badly from the curse of dimensionality. GRNN cannot ignore irrelevant inputs without major modifications to the basic algorithm. So GRNN is not likely to be the top choice if you have more than 5 or 6 nonredundant inputs. References: Caudill, M. (1993), "GRNN and Bear It," AI Expert, Vol. 8, No. 5 (May), 28-33. Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ. Press. Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 Nadaraya, E.A. (1964) "On estimating regression", Theory Probab. Applic. 10, 186-90. Schi\oler, H. and Hartmann, U. (1992) "Mapping Neural Network Derived from the Parzen Window Estimator", Neural Networks, 5, 903-909. Specht, D.F. (1968) "A practical technique for estimating general regression surfaces," Lockheed report LMSC 6-79-68-6, Defense Technical Information Center AD-672505. Specht, D.F. (1991) "A Generalized Regression Neural Network", IEEE Transactions on Neural Networks, 2, Nov. 1991, 568-576. Watson, G.S. (1964) "Smooth regression analysis", Sankhy{\=a}, Series A, 26, 359-72. ------------------------------------------------------------------------ Subject: What does unsupervised learning learn? ================================================ Unsupervised learning allegedly involves no target values. In fact, for most varieties of unsupervised learning, the targets are the same as the inputs (Sarle 1994). In other words, unsupervised learning usually performs the same task as an auto-associative network, compressing the information from the inputs (Deco and Obradovic 1996). Unsupervised learning is very useful for data visualization (Ripley 1996), although the NN literature generally ignores this application. Unsupervised competitive learning is used in a wide variety of fields under a wide variety of names, the most common of which is "cluster analysis". The main form of competitive learning in the NN literature is adaptive vector quantization (AVQ, also called a Kohonen network), of which Kosko (1992) provides a good description. In AVQ, each of the competitive hidden units corresponds to a cluster center, and the error function is the sum of squared distances between each training case and the nearest center. Often, each training case is normalized to a Euclidean length of one, which allows distances to be simplified to inner products. The more general error function based on distances is the same error function used in k-means clustering, one of the most common classes of cluster analysis (MacQueen 1967; Anderberg 1973; Balakrishnan, Cooper, Jacob, and Lewis 1994). The k-means model is a special case of the normal mixture model (McLachlan and Basford 1988), which has also been found to be very useful in neural networking (e.g., Bishop 1995). Hebbian learning is the other most most common variety of unsupervised learning (Hertz, Krogh, and Palmer 1991). Hebbian learning minimizes the same error function as an auto-associative network with a linear hidden layer, trained by least squares, and is therefore a form of dimensionality reduction. This error function is equivalent to the sum of squared distances between each training case and a linear subspace of the input space (with distances measured perpendicularly), and is minimized by the leading principal components (Pearson 1901; Hotelling 1933; Rao 1964; Joliffe 1986; Jackson 1991). There are variations of Hebbian learning that explicitly produce the principal components (Hertz, Krogh, and Palmer 1991; Karhunen 1994; Deco and Obradovic 1996). Perhaps the most novel form of unsupervised learning in the NN literature is Kohonen's self-organizing (feature) map (SOM, Kohonen 1995). SOMs combine competitive learning with dimensionality reduction by smoothing the clusters with respect to an a priori grid (Kohonen 1995; Mulier and Cherkassky 1995). But the original SOM algorithm does not optimize an "energy" function (Kohonen 1995, pp. 126, 237) and so is not simply an information-compression method like most other unsupervised learning networks. Convergence of Kohonen's SOM algorithm is allegedly demonstrated by Yin and Allinson (1995), but their "proof" assumes the neighborhood size becomes zero, in which case the algorithm reduces to AVQ and no longer has topological ordering properties (Kohonen 1995, p. 111). References: Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc. Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with k-means clustering", Psychometrika, 59, 509-525. Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Deco, G. and Obradovic, D. (1996), An Information-Theoretic Approach to Neural Computing, NY: Springer-Verlag. Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley: Redwood City, California. Hotelling, H. (1933), "Analysis of a Complex of Statistical Variables into Principal Components," Journal of Educational Psychology, 24, 417-441, 498-520. Jackson, J.E. (1991), A User's Guide to Principal Components, NY: Wiley. Jolliffe, I.T. (1986), Principal Component Analysis, Springer-Verlag. Karhunen, J. (1994), "Stability of Oja's PCA subspace rule," Neural Computation, 6, 739-747. Kohonen, T. (1995), Self-Organizing Maps, Berlin: Springer-Verlag. Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: Prentice-Hall. McLachlan, G.J. and Basford, K.E. (1988), Mixture Models, NY: Marcel Dekker, Inc. MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297. Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an Iterative Kernel Smoothing Process," Neural Computation, 7, 1165-1177. Pearson, K. (1901) "On Lines and Planes of Closest Fit to Systems of Points in Space," Phil. Mag., 2(6), 559-572. Rao, C.R. (1964), "The Use and Interpretation of Principal Component Analysis in Applied Research," Sankya A, 26, 329-358. Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Sarle, W.S. (1994), "Neural Networks and Statistical Models," in SAS Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550, ftp://ftp.sas.com/pub/neural/neural1.ps. Yin, H. and Allinson, N.M. (1995), "On the Distribution and Convergence of Feature Space in Self-Organizing Maps," Neural Computation, 7, 1178-1187. ------------------------------------------------------------------------ Subject: What about Genetic Algorithms? ======================================= There are a number of definitions of GA (Genetic Algorithm). A possible one is A GA is an optimization program that starts with a population of encoded procedures, (Creation of Life :-> ) mutates them stochastically, (Get cancer or so :-> ) and uses a selection process (Darwinism) to prefer the mutants with high fitness and perhaps a recombination process (Make babies :-> ) to combine properties of (preferably) the succesful mutants. Genetic Algorithms are just a special case of the more general idea of ``evolutionary computation''. There is a newsgroup that is dedicated to the field of evolutionary computation called comp.ai.genetic. It has a detailed FAQ posting which, for instance, explains the terms "Genetic Algorithm", "Evolutionary Programming", "Evolution Strategy", "Classifier System", and "Genetic Programming". That FAQ also contains lots of pointers to relevant literature, software, other sources of information, et cetera et cetera. Please see the comp.ai.genetic FAQ for further information. ------------------------------------------------------------------------ Subject: What about Fuzzy Logic? ================================ Fuzzy Logic is an area of research based on the work of L.A. Zadeh. It is a departure from classical two-valued sets and logic, that uses "soft" linguistic (e.g. large, hot, tall) system variables and a continuous range of truth values in the interval [0,1], rather than strict binary (True or False) decisions and assignments. Fuzzy logic is used where a system is difficult to model exactly (but an inexact model is available), is controlled by a human operator or expert, or where ambiguity or vagueness is common. A typical fuzzy system consists of a rule base, membership functions, and an inference procedure. Most Fuzzy Logic discussion takes place in the newsgroup comp.ai.fuzzy (where there is a fuzzy logic FAQ) but there is also some work (and discussion) about combining fuzzy logic with neural network approaches in comp.ai.neural-nets. Early work combining neural nets and fuzzy methods used competitive networks to generate rules for fuzzy systems (Kosko 1992). This approach is essentially the same thing as bidirectional counterpropagation (Hecht-Nielsen 1990) and suffers from the same deficiencies. More recent work (Brown and Harris 1994) has been based on the realization that a fuzzy system is a nonlinear mapping from an input space to an output space that can be parameterized in various ways and therefore can be adapted to data using the usual neural training methods (see "What is backprop?") or conventional numerical optimization algorithms (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"). A neural net can incorporate fuzziness in various ways: o The inputs can be fuzzy. Any garden-variety backprop net is fuzzy in this sense, and it seems rather silly to call a net "fuzzy" solely on this basis, although Fuzzy ART (Carpenter and Grossberg 1996) has no other fuzzy characteristics. o The outputs can be fuzzy. Again, any garden-variety backprop net is fuzzy in this sense. But competitive learning nets ordinarily produce crisp outputs, so for competitive learning methods, having fuzzy output is a meaningful distinction. For example, fuzzy c-means clustering (Bezdek 1981) is meaningfully different from (crisp) k-means. Fuzzy ART does not have fuzzy outputs. o The net can be interpretable as an adaptive fuzzy system. For example, Gaussian RBF nets and B-spline regression models (Dierckx 1995) are fuzzy systems with adaptive weights (Brown and Harris 1994) and can legitimately be called neurofuzzy systems. o The net can be a conventional NN architecture that operates on fuzzy numbers instead of real numbers (Lippe, Feuring and Mischke 1995). o Fuzzy constraints can provide external knowledge (Lampinen and Selonen 1996). More information on neurofuzzy systems is available online: o The Fuzzy Logic and Neurofuzzy Resources page of the Image, Speech and Intelligent Systems (ISIS) research group at the University of Southampton, Southampton, Hampshire, UK: http://www-isis.ecs.soton.ac.uk/research/nfinfo/fuzzy.html. o The Neuro-Fuzzy Systems Research Group's web page at Tampere University of Technology, Tampere, Finland: http://www.cs.tut.fi/~tpo/group.html and http://dmiwww.cs.tut.fi/nfs/Welcome_uk.html o Marcello Chiaberge's Neuro-Fuzzy page at http://polimage.polito.it/~marcello. References: Bezdek, J.C. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press. Bezdek, J.C. & Pal, S.K., eds. (1992), Fuzzy Models for Pattern Recognition, New York: IEEE Press. Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and Control, NY: Prentice Hall. Carpenter, G.A. and Grossberg, S. (1996), "Learning, Categorization, Rule Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H. (1996), pp. 1.3-1.45. Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, ISBN 0-07-011189-8. Dierckx, P. (1995), Curve and Surface Fitting with Splines, Oxford: Clarendon Press. Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. Klir, G.J. and Folger, T.A.(1988), Fuzzy Sets, Uncertainty, and Information, Englewood Cliffs, N.J.: Prentice-Hall. Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: Prentice-Hall. Lampinen, J and Selonen, A. (1996), "Using Background Knowledge for Regularization of Multilayer Perceptron Learning", Submitted to International Conference on Artificial Neural Networks, ICANN'96, Bochum, Germany. Lippe, W.-M., Feuring, Th. and Mischke, L. (1995), "Supervised learning in fuzzy neural networks," Institutsbericht Angewandte Mathematik und Informatik, WWU Muenster, I-12, http://wwwmath.uni-muenster.de/~feuring/WWW_literatur/bericht12_95.ps.gz ------------------------------------------------------------------------ Next part is part 3 (of 7). Previous part is part 1. -- Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 677-8000 Cary, NC 27513, USA those of SAS Institute.