Windoware

home *** CD-ROM | disk | FTP | other *** search

/ Windoware / WINDOWARE_1_6.iso / miscprog / atre20 / expert.inf < prev next >

Wrap

Text File | 1991-08-08 | 12KB | 235 lines

Comments for the neural net expert on the atree release 2.0 adaptive logic network simulator. You will probably have a lot of experience with backpropagation-type networks, so the thrust of the following remarks is to encourage you to see adaptive logic networks (ALNs) as a modification of that approach that could offer improved performance in some of your problems. Certainly ALNs can help in problems that involve demanding real-time requirements or in situations where the available computing power is not adequate for a backpropagation network. First of all, you may not think of an adaptive logic network as a neural network at all, but it is! In fact, an ALN is a special case of the familiar multilayer perceptron (MLP) feedforward network. There are just a few changes, all of them simplifications. Briefly, the nodes (or elements) have (during training, at least) two input leads, the input signals x1, x2 are boolean ( 0 or 1), each connection weight is determined by a single bit of information, and the "squashing function" is a threshold. Specifically, if b1 and b2 are boolean, then the element outputs a boolean value which is 1 if and only if (b1 + 1) * x1 + (b2 + 1) * x2 >= 2. The four combinations of b1 and b2 ( 00, 11, 10, 01) generate the four boolean functions of two variables: AND, OR, LEFT, and RIGHT, where LEFT(x1, x2) = x1 and RIGHT(x1, x2) = x2. For example 1 * x1 + 1 * x2 >= 2 if and only if both x1 AND x2 are 1. A tree of such elements is connected to some boolean input variables and their complements. Inputs enter the tree at the leaves and the output is at the root node. The input variables are components of a vector representing a pattern, such as a handwritten character. Training on a set of input vectors, furnished with the desired boolean outputs, results in an assignment of functions to nodes that allows the tree to approximate the training data. It then has built-in capacity to generalize, i. e. to give appropriate responses to other possible input vectors not presented during training, as will be explained below. You might ask: "Why bother with a special case when we can use the whole power of the MLP and the backpropagation training algorithm?" Here, I think, are some convincing answers: 1) Speed of the hardware implementation of ALNs There are many ways to achieve pattern recognition capability, ranging from classical approaches like the K-nearest-neighbors method, to newer methods using neural networks. These all have their place, and certainly no one wants to try to argue with proofs showing near optimality of classical methods in some cases, but ALNs can often compute results far faster than all other approaches based on digital logic; and sometimes speed is the critical factor, for example in robotics or control of electronic circuits. Training an ALN results in a combinational logic circuit. In fact, after adaptation, one groups the subtrees of two-input ANDs together to get ANDs of higher fanin, and similarly for the ORs. The result is logically equivalent to a tree of NANDs, which is directly translatable into easily built hardware. Pattern recognition functions can then be computed in a few tens of nanoseconds in general, which is several orders of magnitude faster than other approaches allow. Dedicated ALNs can be realized in field programmable gate arrays quite cheaply. (This step is not yet done in the atree release 2.0 software. There are various types of target chips one could use. Discussions are underway for one kind of FPGA.) The adaptation algorithms are also designed to be fast: what corresponds to backprop, a credit assignment computation, is also combinational, so it is possible to build systems which will adapt a tree at a rate of, say, 25 million patterns per second. Adaptation to a pattern would take the same time that one special processor for backprop neural networks requires for just one multiply-add step of the many it has to do! (Unfortunately, at that speed, you wouldn't have time to go for coffee while the system converged; you probably wouldn't even make it to your office door! So you would have to try a lot harder problems.) If there is any hedging above, it is in insisting upon digital logic. Analog circuits could possibly be faster. Whether they would give reproducible results is another issue. As long as we are in the realm of digital logic, we have the theorem that any boolean function can be realized in two levels of logic (if the complements of the inputs are available) -- conjunctive normal form does that. For the problems we have tried, ALNs produce trees which are about five or ten times deeper than that, and hence which are are five or ten times slower than the absolute limit of speed for ANY digital system! Of course, any ALN circuit could be turned into a two layer one, but usually only at astronomical cost. Hence it is plausible that ALNs are not only a good choice for very high speed pattern recognition -- they are the only way to come close to the ultimate in speed at reasonable cost and with reproducible results. 2) Speed of software simulation If a simulation computes a 0 input to an AND gate, then no other inputs need to be evaluated to know the output of the AND. If the other input is not computed, we have a "lazy" evaluation of the AND. Laziness, also called parsimony, speeds up the simulation very significantly -- 8.9 times for a 10 layer binary tree and 157 times for a 20 layer tree. On the other hand, backprop can do little to avoid computing everything, since the sigmoid function, which is increasing in its entire domain from minus infinity to plus infinity, never cuts off small values, at least in theory. The speedup due to parsimony makes the atree software very reasonable to run even on personal computers. Since atree does not need floating point operations, no math co-processor is required. When special hardware is available, but is not large enough to evaluate a large logical expression in one go, parsimony is still important. Say there are parts of a computation which can be done completely in parallel in the hardware, but these parts have to be computed iteratively. Lazy evaluation will in general allow one to omit those iterations whose outputs are not needed, based on those already computed. Laziness or parsimony is an advantage of ALNs which assures them an increasingly important role in neurocomputing -- even if the majority of neural net research today is based mainly on other paradigms which are non-parsimonious. 3) Representational power: A sufficiently large ALN tree with appropriate node functions and connections can realize any boolean function. This means that one can, in principle, generate any mapping from any set representable in a digital computer to any other such set. The complements of the input variables have to be included as inputs to the binary tree described above, since otherwise the ALN could only produce increasing functions. (A boolean function is increasing if, for any input vector, changing a 0 to a 1 in it will never change the function value from 1 to 0.) There are theorems that deal with approximation of arbitrary continuous functions using the (infinitely often differentiable) functions produced, in theory, by neural networks using sigmoids -- but what if the function you need has a discontinuity? Can the discontinuity be generated without impairing extrapolation? One may be better off if one can generate discontinuities, and discontinuities in derivatives too. 4) Generalization or extrapolation This is the real strong point of ALNs. Pattern classification problems generally require an output which is the same for patterns which are in some way similar. Similarity could be based on all sorts of criteria, but generally it can be defined by saying two individuals have many features in common. Of course, two neigboring rows of the checkerboard have a lot in common even if they have no corresponding squares of the same color! But if we can agree on the idea of having features in common is a good criterion for being in the same class in pattern recognition, then for good generalization one doesn't want the output to change when one or a few features, among many, are altered. If a feature is represented by a boolean variable, then the above says, similarity can be measured by Hamming distance between boolean vectors. In order to use ALNs, we generally have to code other representations of individuals into boolean vectors in such a way that similarity for pattern classification purposes is reflected in Hamming distance. For example, the atree package incorporates a way of encoding real values into boolean vectors that, at least locally, somewhat preserves the metric on an interval of the real line. Unary numbers preserve the metric on positive integers perfectly, but this is quite an inefficient use of bits, so we have implemented a random walk technique. Binary trees with node functions AND, OR, LEFT, and RIGHT have been shown to have the property that the functions they realize tend to have the same output value when the inputs are perturbed. This depends on averaging over all functions on the tree, weighted by the number of ways each can be produced by setting the two bits at each node which determine the node's function. It is also necessary to average over all vectors and all perturbations of a given number of bits. The plausible reasoning behind this is that when an input signal to a gate changes, it may or may not change the output. A change to an input to an AND will not change its output if the other input to the AND is 0. Over all, the probability of a change is 0.5, if one input changes. So for a 10-layer tree, the probability of a single bit change at a leaf causing a change of the tree output is 1/1024. In a tree with multiple connections to the bits of an input vector and their complements, the arguments are a bit more elaborate. Consider a balanced binary tree. Suppose some bits are perturbed at random in the input vector, with p being the probability of an individual bit being changed. This could correspond to "salt and pepper" noise in a scanned character. We ask: what is the probability of change in the next layer? It turns out to be p - p^2 /4. The probability of perturbation decreases as the number of layers increases. More importantly, the limit as the number of layers goes to infinity is zero. For 14 layers, the probability of change of the output when the input is perturbed is about 0.2 no matter how much the input is perturbed! The sequence of p values starting at 1.0 is: 1.0, 0.75, .61, .52, .45, .40, .36, .33, .30, .28, .26, .24, .23, .21, .20. (If you think the probability should never go below 0.5, you haven't allowed for the case of a function which is almost a constant. As you train an ALN, it may start as almost a constant function. You force it to become sensitive in certain ways, but it retains its general tendency towards insensitivity.) The above implies that binary tree functions based on the given node functions, AND, OR, LEFT, and RIGHT, are very insensitive to changes in their inputs. If we find one, even by random search through the space of all possibilities that satisfies the training data, then it will tend to do a good job of extrapolation. So it is not the training procedure per se that is important here. Insensitivity replaces the concept of continuity in backpropagation networks. It is what we really need in a system intended to solve pattern recognition problems. I hope the above arguments will be enough to convince you to try the atree release 2.0 software. The lf language should make it easy. I suggest you take files for your training and test sets in some backpropagation problem and substitute them for the sets in the multiply.lf program to get yourprogram.lf. Pick a tree of a few thousand nodes, adjust the maximum and minimum value allowed for each variable, the numbers of quantization levels for the different variables, the number of bits used to represent each quantity, and the stepsize between representations of successive quantization levels in the random walk. Then type "lf yourprogram.lf". I hope you will be pleasantly surprised, and that you will then make ALNs a permanent part of your neural net toolkit! Good luck. Bill Armstrong