Usenet 1994 October

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 October / usenetsourcesnewsgroupsinfomagicoctober1994disk2.iso / misc / volume22 / back-prop / part01 / README

Wrap

Text File | 1991-08-28 | 43KB | 1,119 lines

.ce Fast Back-Propagation .ce Copyright (c) 1991 by Donald R. Tveter .ce 1. Introduction The programs described below were produced for my own use in studying back-propagation and for doing some examples that are found in my introduction to Artificial Intelligence textbook, \fIThe Basis of Artificial Intelligence\fR, to be published by Computer Science Press. (One would hope some time before the sun burns out or before the Cubs win the World Series or before Congress balances the budget or ... .) I have copyrighted these files but I hereby give permission to anyone to use them for experimentation, educational purposes or to redistribute them on a not for profit basis. All others that want to use these programs for business or commercial purposes, I will charge you a small fee. You should contact me by mail at: .na .nf Dr. Donald R. Tveter 5228 N. Nashville Ave. Chicago, Illinois 60656 USENET: drt@chinet.chi.il.us .ad .fi Also, I would be interested in hearing from anyone with suggestions, bug reports and major successes or failures. Note: this is use at your own risk software: there is no guarantee that it is bug-free. Use of this software constitutes acceptance for use in an as is condition. There are no warranties with regard to this software. In no event shall the author be liable for any damages whatsoever arising out of or in connection with the use or performance of this software. There is also a good chance that I will soon produce and sell DOS and extended DOS binary versions with more capabilities than there are in this UNIX (tm of AT&T Bell Labs) source code version. They will probably include Quickprop, SuperSAB and Cascade Correlation at the very least and perhaps also DSM, LVQ1 and counter-propagation as well. There are four simulators that can be constructed from the included files. The program, rbp, does back-propagation using real weights and arithmetic. The program, bp, does back-propagation using 16-bit integer weights, 16 and 32-bit integer arithmetic and some floating point arithmetic. The program, sbp, uses 16-bit integer symmetric weights but only allows two-layer networks. The program srbp does the same using real weights. The purpose of sbp and srbp is to produce networks that can be used with the Boltzman machine relaxation algorithm (not included). In most cases, the 16-bit integer programs are the most useful, because they are the fastest. Unfortunately, sometimes 16-bit integer weights don't have enough range or precision and then using the floating point versions may be necessary. Many other speed-up techniques are included in these programs. .ce Updated Version This version contains a number of improvements and changes to command names vs. the September 1990 version. For a summary of changes and additions see the file changes. .ce 2. Making the Simulators This code has been written to use either 32-bit floating point (float) or 64-bit floating point (double) arithmetic. On System V machines the standard seems to be that all floating point arithmetic is done with double precision arithmetic so double arithmetic is faster than float so this is the default. Other versions of C (e.g. ANSI C) will do single precision real arithmetic. To get 32-bit floating point set the compiler flag FLOAT in the makefile. The function, exp, defined in real.c is double since System V specifies it as double. If your C uses float, change this definition as well. One option exists for bp and sbp. If your compiler isn't smart enough to divide by 1024 by shifting, remove the SMART flag in the makefile. To make a particular executable file, use the makefile given with the data files and make any or all of them like so: .ce make bp .ce make sbp .ce make rbp .ce make srbp To make a record of all the input and output from the programs, the following small UNIX command file I call "record" can be used: .na .nf trap "" 2 outfile="${1}.record" if test -f $outfile then rm $outfile fi echo $outfile (tee -a $outfile | $*) | tee -a $outfile process=`ps | grep tee | cut -c1-6` kill $process .ad .fi For example to make a record of all the input and output from the program bp using data file, xor, use: .ce record bp xor .ce 3. A Simple Example Each version would normally be called with the name of a file to read commands from, as in: .ce bp xor When no file name is specified, bp will take commands from the keyboard (UNIX stdin file). After the data file is read commands are then taken from the keyboard. The commands are one letter commands and most of them have optional parameters. The `A', `B', `d' and `f' commands allow a number of sub-commands on a line. The maximum length of any line is 256 characters. An `*' is a comment and it can be used to make the remainder of the line a comment. Here is an example of a data file to do the xor problem: .na .nf * input file for the xor problem m 2 1 1 * make a 2-1-1 network c 1 1 3 1 * add this extra connection c 1 2 3 1 * add this extra connection s 7 * seed the random number function k 0 1 * give the network random weights n 4 * read four new patterns into memory 1 0 1 0 0 0 0 1 1 1 1 0 e 0.5 * set eta to 0.5 (and eta2 to 0.5) a 0.9 * set alpha to 0.9 .ad .fi First, in this example, the m command will make a network with 2 units in the input layer, 1 unit in the second layer and 1 unit in the third layer. The following c commands create extra connections from layer 1 unit 1 to layer 3 unit 1 and from layer 1 unit 2 to layer 3 unit 1. The `s' command sets the seed for the random number function. The `k' command then gives the network random weights. The `k' command has another use as well. It can be used to try to kick a network out of a local minimum. Here, the meaning of "k 0 1" is to examine all the weights in the network and for every weight equal to 0 (and they all start out at 0), add in a random number between -1 and +1. The `n' command specifies four new patterns to be read into memory. With the `n' command, any old patterns that may have been present are removed. There is also an `x' command that behaves like the `n' command, except the `x' commands \fIadds\fR the extra patterns to the current training set. The input pattern comes first, followed by the output pattern. The statement, e 0.5, sets eta, the learning rate for the upper layer to 0.5 and eta2 for the lower layers to 0.5 as well. The last line sets alpha, the momentum parameter, to 0.9. After these commands are executed, the following messages and prompt appears: .na .nf .ne3 Fast Back-Propagation Copyright (c) 1991 by Donald R. Tveter taking commands from stdin now [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? .ad .fi The characters within the square brackets are a list of the possible commands. To run 100 iterations of back-propagation and print out the status of the learning every 20 iterations type "r 100 20" at the prompt: [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? r 100 20 This gives: .na .nf running . . . 20 iterations 0.00% right (0 right 4 wrong) 0.49927 error/unit 40 iterations 0.00% right (0 right 4 wrong) 0.43188 error/unit 60 iterations 75.00% right (3 right 1 wrong) 0.09033 error/unit 62 iterations 100.00% right (4 right 0 wrong) 0.07129 error/unit patterns learned to within 0.10 at iteration^G 62 .ad .fi The program immediately prints out the "running . . ." message. After each 20 iterations, a summary of the learning process is printed, giving the percentage of patterns that are right, the number right and wrong and the average value of the absolute values of the errors of the output units. The program stops when the each output for each pattern has been learned to within the required tolerance, in this case the default value of 0.1. A ctrl-G is normally printed out as well to sound the bell. If the second number defining how often to print out a summary is omitted, the summaries will not be printed. Sometimes the integer versions will do a few extra iterations before declaring the problem done because of truncation errors in the arithmetic done to check for convergence. The status figures for iteration i are computed when making the forward pass of the iteration and before the weights are updated so these values are one iteration out of date. This saves on CPU time, however, if you really need up-do-date statistics use the u+ option described in the format specifications. .ce Listing Patterns To get a listing of the status of each pattern, use the `P' command to give: .na .nf [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? P 1 0.90 (0.098) ok 2 0.05 (0.052) ok 3 0.94 (0.062) ok 4 0.07 (0.074) ok 62 iterations 100.00% right (4 right 0 wrong) 0.07129 error/unit .ad .fi The numbers in parentheses give the sum of the absolute values of the output errors for each pattern. An `ok' is given to every pattern that has been learned to within the required tolerance. To get the status of one pattern, say, the fourth pattern, type "P 4" to give: 0.07 (0.074) ok To get a summary without the complete listing, use "P 0". To get the output targets for a given pattern, say pattern 3, use "O 3". A particular test pattern can be input to the network with the `p' command, as in: [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? p 1 0 0.90 .ce Examining Weights It is often interesting to see the values of some particular weights in the network. To see a listing of all the weights in a network, use the save weights command described below and then cat the file weights, however, to see the weights leading into a particular node, say the node in row 2, node 1 use the w command as in: .na .nf [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? w 2 1 layer unit unit value weight input from unit 1 1 1.00000 9.53516 9.53516 1 2 0.00000 -8.40332 0.00000 2 t 1.00000 4.13086 4.13086 sum = 13.66602 .ad .fi This listing also gives data on how the current activation value of the node is computed using the weights and the activations values of the nodes feeding into unit 1 of layer 2. The `t' unit is the threshold unit. .ce The Help Command To get a short description of any command, type `h' followed by the letter of the command. Here, we type h h for help with help: .na .nf .ne3 [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? h h h <letter> gives help for command <letter>. .ad .fi To list the status of all the parameters in the program, use `?'. .ce To Quit the Program Finally, to end the program, the `q' (for quit) command is entered: [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? q .ce 4. The Format Command There are several ways to input and output patterns, numbers and other values and there is one format command, `f', that is used to set these options. In the format command, a number of options can be specified on a single line, as for example in: .ce f b+ ir oc s- wB .ce Input Patterns The programs are able to read pattern values in two different formats. The default input format is the compressed format. In it, each value is one character and it is not necessary to have blanks between the characters. For example, in compressed format, the patterns for xor could be written out in either of the following ways: .ce 101 10 1 .ce 000 00 0 .ce 011 01 1 .ce 110 11 0 The second example is preferable because it makes it easier to see the input and the output patterns. Compressed format can also be used to input patterns with the `p' command. In addition to using 1 and 0 as input, the character, `?' can be used. This character is initially defined to be 0.5, but it can be redefined using the Q command like so: .ce Q -1 This sets the value of ? to -1. Other valid input characters are the letters, `h', `i', `j' and `k'. The `h' stands for `hidden'. Its meaning in an input string is that the value at this point in the string should be taken from the next unit in the second layer of the network. This notation is useful for specifying simple recurrent networks. Naturally, `i', `j' and `k' stand for taking input values from the third, fourth and fifth layers (if they exist). A simple example of a recurrent network is given later. The other input format for numbers is real. The number portion must start with a digit (.35 is not allowed, but 0.35 is). Exponential notation is not allowed. Real numbers have to be separated by a space. The `h', `i', `j', `k' and `?' characters are also allowed with real input patterns. To take input in the real format, it is necessary to set the input format to be real using the `f' (format) command as in: .ce f ir To change back to the compressed format, use: .ce f ic .ce Output of Patterns Output format is controlled with the `f' command as in: .ce f or .ce f oc .ce f oa The first sets the output to real numbers. The second sets the output to be compressed mode where the value printed will be a `1' when the unit value is greater than 1.0 - tolerance, a `^' when the value is above 0.5 but less than 1.0 - tolerance, a `v' when the value is less than 0.5 but greater than the tolerance. Below the tolerance value, a `0' is printed. The tolerance can be changed using the `t' command (not a part of the format command). For example, to make all values greater than 0.8 print as `1' and all values less than 0.2 print as `0', use: .ce t 0.2 Of course, this same tolerance value is also used to check to see if all the patterns have converged. The third output format is meant to give "analog compressed" output. In this format, a `c' is printed when a value is close enough to its target value. Otherwise, if the answer is close to 1, a `1' is printed, if the answer is close to 0, a `0' is printed, if the answer is above the target but not close to 1, a `^' is printed and if the answer is below the target but not close to 0, a `v' is printed. This output format is designed for problems where the output is a real number, as for instance, when the problem is to make a network learn sin(x). For the sake of convenience, the output format (and only the output format) can be set without using the `f', so that: .ce or will also make the output format real. .ce Breaking up the Output Values In the compressed formats, the default is to print a blank after every 10 values. This can be altered using the `b' (for inserting breaks) command. The use for this command is to separate output values into logical groups to make the output more readable. For instance, you may have 24 output units where it makes sense to insert blanks after the 4th, 7th and 19th positions. To do this, specify: .ce b 4 7 19 Then for example, the output will look like: .na .nf 1 10^0 10^ ^000v00000v0 01000 (0.17577) 2 1010 01v 0^0000v00000 ^1000 (0.16341) 3 0101 10^ 00^00v00000v 00001 (0.16887) 4 0100 0^0 000^00000v00 00^00 (0.19880) .ad .fi The `b' command allows up to 20 break positions to be specified. The default output format is the real format with 10 numbers per line. For the output of real values, the `b' command specifies when to print a carriage return rather than when to print a blank. (Note: the `b' command is not part of the `f' command.) .ce Pattern Formats There are two different types of problems that back-propagation can handle, the general type of problem where every output unit can take on an arbitrary value and the classification type of problem where the goal is to turn on output unit i and turn off all the other output units when the pattern is of class i. The xor problem is an example of the general type of problem. For an example of a classification problem, suppose you have a number of data points scattered about through two-dimensional space and you have to classify the points as either class 1, class 2 or class 3. For a pattern of class 1 you can always set up the output: "1 0 0", for class 2: "0 1 0" and for class 3: "0 0 1", however doing the translation to bit patterns can be annoying, so another notation can be used. Instead of specifying the bit patterns you can set the pattern format option to classification (as opposed to the default value of general) like so: .ce f pc and then the program will read data in the form: 1.33 3.61 1 * shorthand for 1 0 0 0.42 -2.30 2 * shorthand for 0 1 0 -0.31 4.30 3 * shorthand for 0 0 1 and translate it to the bit string form when the pattern is loaded onto the output units. To switch to the general form, use "f pg". In addition to controlling the input of data, the p command within the format command is used to control the output of patterns from a set of test patterns kept on a file. If the format is either c or g then when the test set is run thru the network you will only get a summary of how many patterns are correct. If the format is either C or G you will get a listing of the output values for each pattern as well as the summary. When reading patterns, C works the same as c and G works the same as g. .ce Controlling Summaries When the program is learning patterns you can have it print out the status of the learning process at regular intervals. The default is to print out only a summary of how learning is going, however you can also print out the status of every pattern at regular intervals. To get the whole set of patterns, use "f s-" to turn off the summary format and "f s+" to go back to summarizing. .ce Ringing the Bell To ring the bell when the learning has been completed use "f b+" and to turn off the bell, use "f b-". .ce Echoing Input When you are reading commands from a file, it is often worthwhile to see those commands echoed on the screen. To do this, use "f e+" and to turn off the echoing, use "f e-". .ce Up-To-Date Statistics During the ith pass thru the network the program will collect statistics on how many patterns are correct and how much error there is overall. These numbers are gathered before the weights are updated and so the results listed for iteration i really show the situation after the weight update for iteration i-1. To complicate matters more the weight updates can be done continuously instead of after all the patterns have been presented so the statistics you get here are skewed even more. If you want to have up-to-date statistics with either method, use "f u+" and to go back to statistics that are out of date, use "f u-". The penalty with "f u+" is that the program needs to do another forward pass. When using the continuous update method it is highly advisable to use "f u+", at least when you get close to complete convergence because the default method of checking may claim the learning is finished when it isn't or it may continue training after the tolerances have been met. .ce 5. Taking Training and Testing Patterns from Files In the xor example given above the four patterns were part of the data file and to read them in the following lines were used: .na .nf n 4 1 0 1 0 0 0 0 1 1 1 1 0 .ad .fi However, it is also convenient to take patterns from a file that contains nothing but a list of patterns (and possibly comments). To read a new set of patterns from some file, patterns, use: .ce n f patterns To add an extra bunch of patterns to the current set you can use: .ce x f patterns In addition to keeping a set of training patterns you can take testing patterns from a file as well. To specify the file you can invoke the program with a second file name, as in: .ce bp xor xtest In addition, if you do the following: .ce t f xtest the program will set xtest as the test file and immediately do the testing. Once the file has been defined you can test the patterns on the test file by "t f" or just "t". (This leaves the t command doing double duty since "t <real>" will set the tolerance to <real>.) Also in addition, the test file can be set without being tested by using .ce B t f xtest as explained in the benchmarking section. .ce 6. Saving and Restoring Weights and Related Values Sometimes the amount of time and effort needed to produce a set of weights to solve a problem is so great that it is more convenient to save the weights rather than constantly recalculate them. Weights can be saved as real values in an ASCII format (the default) or as binary, to save space. To save the weights enter the command, `S'. The weights are written on a file called "weights". The following file comes from the xor problem: .na .nf 62r file = ../xor3 9.5351562500 -8.4033203125 4.1308593750 5.5800781250 -4.9755859375 -11.3095703125 8.0527343750 .ad .fi To write the weights, the program starts with the second layer, writes out the weights leading into these units in order with the threshold weight last. Then it moves on to the third layer, and so on. To restore these weights, type an `R' for restore. At this time, the program reads the header line and sets the total number of iterations the program has gone through to be the first number it finds on the header line. It then reads the character immediately after the number. The `r' indicates that the weights will be real numbers represented as character strings. If the weights were binary, the character would be a `b' rather than an `r'. Also, if the character is `b', the next character is read. This next character indicates how many bytes are used per value. The integer versions, bp and sbp write files with 2 bytes per weight, while the real versions, rbp and srbp write files with 8 bytes per weight for double precision reals and 4 bytes per weight for single precision reals. With this notation, weight files written by one program can be read by the other. A binary weight format is specified within the `f' command by using "f wb". A real format is specified by using "f wr". If your program specifies that weights should be written in one format, but the weight file you read from is different, a notice will be given. There is no check made to see if the number of weights on the file equals the number of weights in the network. The above formats specify that only weights are written out and this is all you need once the patterns have converged. However, if you're still training the network and want to break off training and pick up the training from exactly the same point later on, you need to save the old weight changes when using momentum, and the parameters for the delta-bar-delta method if you are using this technique. To save these extra parameters on the weights file, use "f wR" to write the extra values as real and "f wB" to write the extra values as binary. In the above example, the command S, was used to save the weights immediately. Another alternative is to save weights at regular intervals. The command, S 100, will automatically save weights every 100 iterations the program does. The default rate at which to save weights is set at 100,000, which generally means that no weights will ever be saved. .ce 7. Initializing Weights and Giving the Network a `Kick' All the weights in the network initially start out at 0. In symmetric networks then, no learning may result because error signals cancel themselves out. Even in non-symmetric networks, the training process will usually converge faster if the weights start out at small random values. To do this, the `k' command will take the network and alter the weights in the following ways. Suppose the command given is: .ce k 0 0.5 Now, if a weight is exactly 0, then the weight will be changed to a random value between +0.5 and -0.5. The above command can therefore be used to initialize the weights in the network. A more complex use of the `k' command is to decrease the magnitude of large weights in the network by a certain random amount. For instance, in the following command: .ce k 2 8 all the weights in the network that are greater than or equal to 2, will be decreased by a random number between 0 and 8. Weights less than or equal to -2 will be increased by a random number between 0 and 8. The seed to the random number generator can be changed using the `s' command as in "s 7". The integer parameter in the `s' command is of type, unsigned. Another method of giving a network a kick is to add hidden layer units. The command: .ce H 2 0.5 adds one unit to layer 2 of the network and all the weights that are created are initialized to between - 0.5 and + 0.5. The subject of kicking a back-propagation network out of local minima has barely been studied and there is no guarantee that the above methods are very useful in general. .ce 8. Setting the Algorithm to Use A number of different variations on the original back-propagation algorithm have been proposed in order to speed up convergence and some of these have been built into these simulators. These options are set using the `A' command and a number of options can go on the one line. .ce Activation Functions To set the activation function, use: A al * the linear activation function A ap * the piece-wise activation function A as * the smooth activation function A at * the piecewise near-tanh function that runs from -1 to +1 A aT * the continuous near-tanh function that runs from -1 to +1 When using the linear activation function, it is only appropriate to use the differential step-size derivative and a two-layer network. The smooth activation function is: .ce f = 1.0 / (1.0 + exp(-x)) where x is the input to a unit. The piece-wise function is an approximation to the function, f and it will normally save some CPU time even though it may increase the number of iterations you need to solve the problem. The continuous near-tanh function is 2f - 1 and the piece-wise version approximates this function with a series of straight lines. .ce Sharpness (or Gain) The sharpness (or gain) is the parameter, D, in the function: .ce 1.0 / (1.0 + exp(-D*x)). A sharper sigmoid shaped activation function (larger D) will produce faster convergence (see "Speeding Up Back Propagation" by Yoshio Izui and Alex Pentland in the Proceedings of \fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum Associates, 1990). To set this parameter, to say, 8, use "A D 8". The default value is 1. Unfortunately, too large a value for D will hurt convergence so this is not a perfect solution to speeding up learning. Sometimes the best value for D may be less than 1.0. A larger D is also useful in the integer version of back-propagation where the weights are limited to between -32 and +31.999. A larger D value in effect magnifies the weights and makes it possible for the weights to stay smaller. Values of D less than one may be useful in extracting a network from a local minima (see "Handwritten Numeral Recognition by Multi-layered Neural Network with Improved Learning Algorithm" by Yamada, Kami, Temma and Tsukumo in Proceedings of the 1989 IJCNN, IEEE Press). A smaller value of D will also force the weights and the weight changes to be larger and this may be of value when the weight changes become less than the weight resolution of 0.001 in the integer version. .ce The Derivatives The correct derivative for the standard activation function is s(1-s) where s is the activation value of a unit, however when s is near 0 or 1 this term will give only very small weight changes during the learning process. To counter this problem, Fahlman proposed the following one for the output layer: .ce 0.1 + s(1-s) (For the original description of this method, see "Faster Learning Variations of Back-Propagation: An Empirical Study", by Scott E. Fahlman, in \fIProceedings of the 1988 Connectionist Models Summer School\fR, Morgan Kaufmann, 1989.) Besides Fahlman's derivative and the original one, the differential step size method (see "Stepsize Variation Methods for Accelerating the Back-Propagation Algorithm", by Chen and Mars, in \fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum, 1990) takes the derivative to be 1 in the layer going into the output units and uses the correct derivative for all other layers. The learning rate for the inner layers is normally set to some smaller value. To set a value for eta2, give two values in the `e' command as in: .ce e 0.1 0.01 To set the derivative, use the `A' command as in: .ne4 A dd * use the differential step size derivative (default) A dF * use Fahlman's derivative in only the output layer A df * use Fahlman's derivative in all layers A do * use the original, correct derivative .ce Update Methods The choices are the periodic (batch) method, the continuous (online) method and Jacob's delta-bar-delta method. The following commands set the update methods: A uc * for the continuous update method A ud * for the delta-bar-delta method A up * for the original periodic update method (default) The delta-bar-delta method uses a number of special parameters and these are set using the `d' command. Delta-bar-delta can be used with any of the derivatives and the algorithm will find its own value of eta for each weight. .ce Other Algorithm Options The `b' command controls whether or not to backpropagate error for units that have learned their response to within a given tolerance. The default is to always backpropagate error. The advantage to not backpropagating error is that this can save computer time. This parameter can be set like so: A b+ * always backpropagate error A b- * don't backpropagate error when close The `s' sub-command will set the number of times to skip a pattern when the pattern has been learned to within the desired tolerance. To skip 3 iterations, use "A s 3", to reset to not skip any patterns use "A s 0". The `t' sub-command will take the given pattern (only one at a time) out of the training set so that you can then train the other patterns and test the network's response to this one pattern that was removed. To test pattern 3, use "A t 3" and to reset to use all the patterns use "A t 0". .ce 9. The Delta-Bar-Delta Method The delta-bar-delta method attempts to find a learning rate eta, for each individual weight. The parameters are the initial value for the etas, the amount by which to increase an eta that seems to be too small, the rate at which to decrease an eta that is too large, a maximum value for each eta and a parameter used in keeping a running average of the slopes. Here are examples of setting these parameters: .na .nf d d 0.5 * sets the decay rate to 0.5 d e 0.1 * sets the initial etas to 0.1 d k 0.25 * sets the amount to increase etas by (kappa) to 0.25 d m 10 * sets the maximum eta to 10 d n 0.005 * an experimental noise parameter d t 0.7 * sets the history parameter, theta, to 0.7 .ad .fi These settings can all be placed on one line: .ce d d 0.5 e 0.1 k 0.25 m 10 t 0.7 The version implemented here does not use momentum. The symmetric versions, sbp and srbp do not implement delta-bar-delta. The idea behind the delta-bar-delta method is to let the program find its own learning rate for each weight. The `e' sub-command sets the initial value for each of these learning rates. When the program sees that the slope of the error surface averages out to be in the same direction for several iterations for a particular weight, the program increases the eta value by an amount, kappa, given by the `k' parameter. The network will then move down this slope faster. When the program finds the slope changes signs, the assumption is that the program has stepped over to the other side of the minimum and so it cuts down the learning rate, by the decay factor, given by the `d' parameter. For instance, a d value of 0.5 cuts the learning rate for the weight in half. The `m' parameter specifies the maximum allowable value for an eta. The `t' parameter (theta) is used to compute a running average of the slope of the weight and must be in the range 0 <= t < 1. The running average at iteration i, a\di\u, is defined as: .ce a\di\u = (1 - t) slope\di\u + ta\di-1\u, so small values for t make the most recent slope more important than the previous average of the slope. Determining the learning rate for back-propagation automatically is, of course, very desirable and this method often speeds up convergence by quite a lot. Unfortunately, bad choices for the delta-bar-delta parameters give bad results and a lot of experimentation may be necessary. If you have n patterns in the training set try starting e and k around 1/n. The n parameter is an experimental noise term that is only used in the integer version. It changes a weight in the wrong direction by the amount indicated when the previous weight change was 0 and the new weight change would be 0 and the slope is non-zero. (I found this to be effective in an integer version of quickprop so I tossed it into delta-bar-delta as well. If you find this helps, please let me know.) For more on delta-bar-delta see "Increased Rates of Convergence" by Robert A. Jacobs, in \fINeural Networks\fR, Volume 1, Number 4, 1988. .ce 10. Recurrent Networks Recurrent back-propagation networks take values from higher level units and use them as activation values for lower level units. This gives a network a simple kind of short-term memory, possibly a little like human short-term memory. For instance, suppose you want a network to memorize the two short sequences, "acb" and "bcd". In the middle of both of these sequences is the letter, "c". In the first case you want a network to take in "a" and output "c", then take in "c" and output "b". In the second case you want a network to take in "b" and output "c", then take in "c" and output "d". To do this, a network needs a simple memory of what came before the "c". Let the network be an 7-3-4 network where input units 1-4 and output units 1-4 stand for the letters a-d. Furthermore, let there be 3 hidden layer units. The hidden units will feed their values back down to the input units 5-7, where they become input for the next step. To see why this works, suppose the patterns have been learned by the network. Inputing the "a" from the first string produces some random pattern of activation, p1, on the hidden layer units and "c" on the output units. The pattern p1 is copied down to units 5-7 of the input layer. Second, the letter, "c" is presented to the network together with p1 now on units 5-7. This will give "b" on the output units. However, if the "b" from the second string is presented first, there will be a different random pattern, p2, on the hidden layer units. These values are copied down to input units 5-7. These values combine with the "c" to produce the output, "d". The training patterns for the network can be: 1000 000 0010 * "a" prompts the output, "c" 0010 hhh 0100 * inputing "c" should produce "b" 0100 000 0010 * "b" prompts the output, "c" 0010 hhh 0001 * inputing "c" should produce "d" where the first four values on each line are the normal input, the middle three either start out all zeros or take their values from the previous values of the hidden units. The code for taking these values from the hidden layer units is "h". The last set of values represents the output that should be produced. To take values from the third layer of a network, the code is "i". For the fourth and fifth layers (if they exist) the codes are "j" and "k". Training recurrent networks can take much longer than training standard networks and the average error can jump up and down quite a lot. .ce 11. The Benchmarking Command The main purpose of the benchmarking command is to make it possible to run a number of tests of a problem with different initial weights and average the number of iterations and CPU time for networks that converged. A second purpose is to run a training set thru the network a number of times, and for each try, a test pattern or a test set can be checked at regular intervals. A typical command to simply test the current parameters on a number of networks is: .ce B g 5 m 15 k 1 r 1000 200 The "g 5" specifies that you'd like to set the goal of getting 5 networks to converge but the "m 15" sets a maximum of 15 tries to reach this goal. The k specifies that each initial network will get a kick by setting each weight to a random number between -1 and 1. The "r 1000 200" portion specifies that you should run up to 1000 iterations on a network and print the status of learning every 200 iterations. This follows the normal run command and the second parameter defining how often to print the statistics can be omitted. For example, here is some output from benchmarking with the xor problem: .na .nf [?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? B g 5 m 5 k 1 r 200 seed = 7; running . . . patterns learned to within 0.10 at iteration 62 seed = 7; running . . . seed = 7; running . . . patterns learned to within 0.10 at iteration 54 seed = 7; running . . . patterns learned to within 0.10 at iteration 39 seed = 7; running . . . patterns learned to within 0.10 at iteration 44 1 failures; 4 successes; average = 49.750000 0.333320 sec/network .ad .fi The time/network includes however much time is used to print messages so to time the process effectively, all printing should be minimized. The timing is done using the UNIX clock(3C) function and on a UNIX PC at least, the time returned by clock will overflow after 2147 seconds, or about 36 minutes. If you system has the same limitation, take care that ALL of the benchmarking you do in a single call of the program adds up to less than 36 minutes. In the above example, the seed that was used to set the random values for the weights was set to 7 (outside the benchmarking command), however if you set a number of seeds, as in: .ce s 3 5 7 18484 99 the seeds will be taken in order for each network. When there are more networks to try than there are seeds, the random values keep coming from the last seed value, so actually you can get by using a single seed. The idea behind allowing multiple seeds is so that if one network does something interesting you can use that seed to run a network with the same initial weights outside of the benchmarking command. Once the benchmarking parameters have been set, it is only necessary to include the run portion in order to start the benchmarking process again, thus, "B r 200" will run benchmarking again using the current set of parameters. Also, the parameters can be set without starting the benchmarking process by just not including the `r' parameters in the B command as in: .ce B g 5 m 5 k 1 In addition to getting data on convergence, you can have the program run test patterns thru the network at the print statistics rate given in the `r' sub-command. To specify the test file, test100, use: .ce B t f test100 To run the training data thru for up to 1000 iterations and test every 200 iterations use: .ce B r 1000 200 If the pattern format specification p is set to either c or g you will get a summary of the patterns on the test file. If p is either C or G you will get the results for each pattern listed as well as the summary. To stop testing the data on the data file, use "B t 0". Sometimes you may have so little data available that it is difficult to separate the patterns into a training set and a test set. One solution is to remove each pattern from the training set, train the network on the remaining patterns and then test the network on the pattern that was removed. To remove a pattern, say pattern 1 from the training set use: .ce B t 1 To systematically remove each pattern from the training set, use a data file with the following commands: .na .nf B t 1 r 200 50 B t 2 r 200 50 B t 3 r 200 50 ... etc. .ad .fi and the pattern will be tested every 50 iterations. If, in the course of training the network, all the patterns converge, the program will print out a line starting with a capital S followed by a test of the test pattern. If the programs hits the point where statistics on the learning process have to be printed and the network has not converged, then a capital F will print out followed by a test of the test pattern. To stop this testing, use "B t 0". It would be nice to have the program average up and tabulate all the data that comes out of the benchmarking command, but I thought I'd leave that to users for now. You can use the record command to save the output from the entire session and then run it thru some other program, such as an awk program in order to sort everything out. .ce 12. Miscellaneous Commands Below is a list of some miscellaneous commands, a short example of each and a short description of the command. .IP " ! !cmd " 15 Anything after `!' will be passed on to UNIX as a command to execute. .IP " C C " 15 The C command will clear the network of values, reset the number of iterations to 0 and reset other values so that another run can be made. The seed value is reset so you can run the same network over with the same initial weights but with different learning parameters. Even though the seed is reset, the weights are not initialized, so you must do this step after clearing the network. .IP " i i f " 15 Entering "i f" will read commands from the file, f. When there are no more commands on the file, the program resumes reading from the previous file being used. You can also have an `i' command within the file f, however the depth to which you can nest the number of active files is 4 and stdin itself counts as the first one. Once an input file has been specified, you can simply type "i" to read from the file again. .IP " l l 2 " 15 Entering "l 2" will print the values of the units on layer 2, or whatever layer is specified. .IP " T T -3 " 15 In sbp and srbp only, "T -3" sets all the threshold weights to -3 or whatever value is specified and freezes them at this value. .IP " W W 0.9 " 15 Entering "W 0.9" will remove (whittle away) all the weights with absolute values less than 0.9. .in-15 In addition, when a user generated interrupt occurs (by typing DEL) the program will drop its current task and take the next command from the keyboard. .ce 13. Limitations Weights in the bp and sbp programs are 16-bit integer weights, where the real value of the weight has been multiplied by 1024. The integer versions cannot handle weights less than -32 or greater than 31.999. The weight changes are all checked for overflow but there are other places in these programs where calculations can possibly overflow as well and none of these places are checked. Input values for the integer versions can run from -31.994 to 31.999. Due to the method used to implement recurrent connections, input values in the real version are limited to -31994.0 and above.