home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Usenet 1994 October
/
usenetsourcesnewsgroupsinfomagicoctober1994disk2.iso
/
misc
/
volume22
/
back-prop
/
part01
/
README
Wrap
Text File
|
1991-08-28
|
43KB
|
1,119 lines
.ce
Fast Back-Propagation
.ce
Copyright (c) 1991 by Donald R. Tveter
.ce
1. Introduction
The programs described below were produced for my own use in studying
back-propagation and for doing some examples that are found in my
introduction to Artificial Intelligence textbook, \fIThe Basis of
Artificial Intelligence\fR, to be published by Computer Science Press.
(One would hope some time before the sun burns out or before the Cubs
win the World Series or before Congress balances the budget or ... .)
I have copyrighted these files but I hereby give permission to
anyone to use them for experimentation, educational purposes or to
redistribute them on a not for profit basis. All others that want to
use these programs for business or commercial purposes, I will charge
you a small fee. You should contact me by mail at:
.na
.nf
Dr. Donald R. Tveter
5228 N. Nashville Ave.
Chicago, Illinois 60656
USENET: drt@chinet.chi.il.us
.ad
.fi
Also, I would be interested in hearing from anyone with suggestions,
bug reports and major successes or failures.
Note: this is use at your own risk software: there is no guarantee
that it is bug-free. Use of this software constitutes acceptance for
use in an as is condition. There are no warranties with regard to this
software. In no event shall the author be liable for any damages
whatsoever arising out of or in connection with the use or performance
of this software.
There is also a good chance that I will soon produce and sell DOS and
extended DOS binary versions with more capabilities than there are in
this UNIX (tm of AT&T Bell Labs) source code version. They will
probably include Quickprop, SuperSAB and Cascade Correlation at the very
least and perhaps also DSM, LVQ1 and counter-propagation as well.
There are four simulators that can be constructed from the included
files. The program, rbp, does back-propagation using real weights and
arithmetic. The program, bp, does back-propagation using 16-bit integer
weights, 16 and 32-bit integer arithmetic and some floating point
arithmetic. The program, sbp, uses 16-bit integer symmetric weights but
only allows two-layer networks. The program srbp does the same using
real weights. The purpose of sbp and srbp is to produce networks that
can be used with the Boltzman machine relaxation algorithm (not
included).
In most cases, the 16-bit integer programs are the most useful,
because they are the fastest. Unfortunately, sometimes 16-bit
integer weights don't have enough range or precision and then using the
floating point versions may be necessary. Many other speed-up
techniques are included in these programs.
.ce
Updated Version
This version contains a number of improvements and changes to command
names vs. the September 1990 version. For a summary of changes and
additions see the file changes.
.ce
2. Making the Simulators
This code has been written to use either 32-bit floating point
(float) or 64-bit floating point (double) arithmetic. On System V
machines the standard seems to be that all floating point arithmetic is
done with double precision arithmetic so double arithmetic is faster
than float so this is the default. Other versions of C (e.g. ANSI C)
will do single precision real arithmetic. To get 32-bit floating point
set the compiler flag FLOAT in the makefile. The function, exp, defined
in real.c is double since System V specifies it as double. If your C
uses float, change this definition as well.
One option exists for bp and sbp. If your compiler isn't smart
enough to divide by 1024 by shifting, remove the SMART flag in the
makefile.
To make a particular executable file, use the makefile given
with the data files and make any or all of them like so:
.ce
make bp
.ce
make sbp
.ce
make rbp
.ce
make srbp
To make a record of all the input and output from the programs,
the following small UNIX command file I call "record" can be used:
.na
.nf
trap "" 2
outfile="${1}.record"
if test -f $outfile
then
rm $outfile
fi
echo $outfile
(tee -a $outfile | $*) | tee -a $outfile
process=`ps | grep tee | cut -c1-6`
kill $process
.ad
.fi
For example to make a record of all the input and output from the
program bp using data file, xor, use:
.ce
record bp xor
.ce
3. A Simple Example
Each version would normally be called with the name of a file to read
commands from, as in:
.ce
bp xor
When no file name is specified, bp will take commands from the keyboard
(UNIX stdin file). After the data file is read commands are then taken
from the keyboard.
The commands are one letter commands and most of them have optional
parameters. The `A', `B', `d' and `f' commands allow a number of
sub-commands on a line. The maximum length of any line is 256
characters. An `*' is a comment and it can be used to make the
remainder of the line a comment. Here is an example of a data file to
do the xor problem:
.na
.nf
* input file for the xor problem
m 2 1 1 * make a 2-1-1 network
c 1 1 3 1 * add this extra connection
c 1 2 3 1 * add this extra connection
s 7 * seed the random number function
k 0 1 * give the network random weights
n 4 * read four new patterns into memory
1 0 1
0 0 0
0 1 1
1 1 0
e 0.5 * set eta to 0.5 (and eta2 to 0.5)
a 0.9 * set alpha to 0.9
.ad
.fi
First, in this example, the m command will make a network with 2 units
in the input layer, 1 unit in the second layer and 1 unit in the third
layer. The following c commands create extra connections from layer 1
unit 1 to layer 3 unit 1 and from layer 1 unit 2 to layer 3 unit 1.
The `s' command sets the seed for the random number function. The `k'
command then gives the network random weights. The `k' command has
another use as well. It can be used to try to kick a network out of a
local minimum. Here, the meaning of "k 0 1" is to examine all the
weights in the network and for every weight equal to 0 (and they all
start out at 0), add in a random number between -1 and +1. The `n'
command specifies four new patterns to be read into memory. With
the `n' command, any old patterns that may have been present are
removed. There is also an `x' command that behaves like the `n'
command, except the `x' commands \fIadds\fR the extra patterns to the
current training set. The input pattern comes first, followed by the
output pattern. The statement, e 0.5, sets eta, the learning rate for
the upper layer to 0.5 and eta2 for the lower layers to 0.5 as well.
The last line sets alpha, the momentum parameter, to 0.9.
After these commands are executed, the following messages and prompt
appears:
.na
.nf
.ne3
Fast Back-Propagation Copyright (c) 1991 by Donald R. Tveter
taking commands from stdin now
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]?
.ad
.fi
The characters within the square brackets are a list of the possible
commands. To run 100 iterations of back-propagation and print out the
status of the learning every 20 iterations type "r 100 20" at the
prompt:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? r 100 20
This gives:
.na
.nf
running . . .
20 iterations 0.00% right (0 right 4 wrong) 0.49927 error/unit
40 iterations 0.00% right (0 right 4 wrong) 0.43188 error/unit
60 iterations 75.00% right (3 right 1 wrong) 0.09033 error/unit
62 iterations 100.00% right (4 right 0 wrong) 0.07129 error/unit
patterns learned to within 0.10 at iteration^G 62
.ad
.fi
The program immediately prints out the "running . . ." message. After
each 20 iterations, a summary of the learning process is printed, giving
the percentage of patterns that are right, the number right and wrong
and the average value of the absolute values of the errors of the output
units. The program stops when the each output for each pattern has been
learned to within the required tolerance, in this case the default value
of 0.1. A ctrl-G is normally printed out as well to sound the bell. If
the second number defining how often to print out a summary is omitted,
the summaries will not be printed. Sometimes the integer versions will
do a few extra iterations before declaring the problem done because of
truncation errors in the arithmetic done to check for convergence. The
status figures for iteration i are computed when making the forward pass
of the iteration and before the weights are updated so these values are
one iteration out of date. This saves on CPU time, however, if you
really need up-do-date statistics use the u+ option described in the
format specifications.
.ce
Listing Patterns
To get a listing of the status of each pattern, use the `P' command
to give:
.na
.nf
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? P
1 0.90 (0.098) ok
2 0.05 (0.052) ok
3 0.94 (0.062) ok
4 0.07 (0.074) ok
62 iterations 100.00% right (4 right 0 wrong) 0.07129 error/unit
.ad
.fi
The numbers in parentheses give the sum of the absolute values of the
output errors for each pattern. An `ok' is given to every pattern that
has been learned to within the required tolerance. To get the status of
one pattern, say, the fourth pattern, type "P 4" to give:
0.07 (0.074) ok
To get a summary without the complete listing, use "P 0". To get the
output targets for a given pattern, say pattern 3, use "O 3".
A particular test pattern can be input to the network with the `p'
command, as in:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? p 1 0
0.90
.ce
Examining Weights
It is often interesting to see the values of some particular weights
in the network. To see a listing of all the weights in a network, use
the save weights command described below and then cat the file weights,
however, to see the weights leading into a particular node, say
the node in row 2, node 1 use the w command as in:
.na
.nf
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? w 2 1
layer unit unit value weight input from unit
1 1 1.00000 9.53516 9.53516
1 2 0.00000 -8.40332 0.00000
2 t 1.00000 4.13086 4.13086
sum = 13.66602
.ad
.fi
This listing also gives data on how the current activation value of
the node is computed using the weights and the activations values of
the nodes feeding into unit 1 of layer 2. The `t' unit is the
threshold unit.
.ce
The Help Command
To get a short description of any command, type `h' followed by
the letter of the command. Here, we type h h for help with help:
.na
.nf
.ne3
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? h h
h <letter> gives help for command <letter>.
.ad
.fi
To list the status of all the parameters in the program, use `?'.
.ce
To Quit the Program
Finally, to end the program, the `q' (for quit) command is entered:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? q
.ce
4. The Format Command
There are several ways to input and output patterns, numbers and
other values and there is one format command, `f', that is used to set
these options. In the format command, a number of options can be
specified on a single line, as for example in:
.ce
f b+ ir oc s- wB
.ce
Input Patterns
The programs are able to read pattern values in two different
formats. The default input format is the compressed format. In it,
each value is one character and it is not necessary to have blanks
between the characters. For example, in compressed format, the patterns
for xor could be written out in either of the following ways:
.ce
101 10 1
.ce
000 00 0
.ce
011 01 1
.ce
110 11 0
The second example is preferable because it makes it easier to see the
input and the output patterns. Compressed format can also be used to
input patterns with the `p' command. In addition to using 1 and 0 as
input, the character, `?' can be used. This character is initially
defined to be 0.5, but it can be redefined using the Q command like so:
.ce
Q -1
This sets the value of ? to -1. Other valid input characters are the
letters, `h', `i', `j' and `k'. The `h' stands for `hidden'. Its
meaning in an input string is that the value at this point in the string
should be taken from the next unit in the second layer of the network.
This notation is useful for specifying simple recurrent
networks. Naturally, `i', `j' and `k' stand for taking input
values from the third, fourth and fifth layers (if they exist). A
simple example of a recurrent network is given later.
The other input format for numbers is real. The number portion must
start with a digit (.35 is not allowed, but 0.35 is). Exponential
notation is not allowed. Real numbers have to be separated by a space.
The `h', `i', `j', `k' and `?' characters are also allowed with real
input patterns. To take input in the real format, it is necessary
to set the input format to be real using the `f' (format) command as in:
.ce
f ir
To change back to the compressed format, use:
.ce
f ic
.ce
Output of Patterns
Output format is controlled with the `f' command as in:
.ce
f or
.ce
f oc
.ce
f oa
The first sets the output to real numbers. The second sets the
output to be compressed mode where the value printed will be a `1' when
the unit value is greater than 1.0 - tolerance, a `^' when the value
is above 0.5 but less than 1.0 - tolerance, a `v' when the value is
less than 0.5 but greater than the tolerance. Below the tolerance
value, a `0' is printed. The tolerance can be changed using the `t'
command (not a part of the format command). For example, to make all
values greater than 0.8 print as `1' and all values less than 0.2 print
as `0', use:
.ce
t 0.2
Of course, this same tolerance value is also used to check to see if all
the patterns have converged. The third output format is meant to
give "analog compressed" output. In this format, a `c' is printed when
a value is close enough to its target value. Otherwise, if the answer
is close to 1, a `1' is printed, if the answer is close to 0, a `0' is
printed, if the answer is above the target but not close to 1, a `^' is
printed and if the answer is below the target but not close to 0, a `v'
is printed. This output format is designed for problems where the
output is a real number, as for instance, when the problem is to make a
network learn sin(x).
For the sake of convenience, the output format (and only the
output format) can be set without using the `f', so that:
.ce
or
will also make the output format real.
.ce
Breaking up the Output Values
In the compressed formats, the default is to print a blank after
every 10 values. This can be altered using the `b' (for inserting
breaks) command. The use for this command is to separate output values
into logical groups to make the output more readable. For instance, you
may have 24 output units where it makes sense to insert blanks after the
4th, 7th and 19th positions. To do this, specify:
.ce
b 4 7 19
Then for example, the output will look like:
.na
.nf
1 10^0 10^ ^000v00000v0 01000 (0.17577)
2 1010 01v 0^0000v00000 ^1000 (0.16341)
3 0101 10^ 00^00v00000v 00001 (0.16887)
4 0100 0^0 000^00000v00 00^00 (0.19880)
.ad
.fi
The `b' command allows up to 20 break positions to be specified. The
default output format is the real format with 10 numbers per line. For
the output of real values, the `b' command specifies when to print a
carriage return rather than when to print a blank. (Note: the `b'
command is not part of the `f' command.)
.ce
Pattern Formats
There are two different types of problems that back-propagation can
handle, the general type of problem where every output unit can take on
an arbitrary value and the classification type of problem where the goal
is to turn on output unit i and turn off all the other output units when
the pattern is of class i. The xor problem is an example of the general
type of problem. For an example of a classification problem, suppose
you have a number of data points scattered about through two-dimensional
space and you have to classify the points as either class 1, class 2 or
class 3. For a pattern of class 1 you can always set up the
output: "1 0 0", for class 2: "0 1 0" and for class 3: "0 0 1",
however doing the translation to bit patterns can be annoying, so
another notation can be used. Instead of specifying the bit patterns
you can set the pattern format option to classification (as opposed to
the default value of general) like so:
.ce
f pc
and then the program will read data in the form:
1.33 3.61 1 * shorthand for 1 0 0
0.42 -2.30 2 * shorthand for 0 1 0
-0.31 4.30 3 * shorthand for 0 0 1
and translate it to the bit string form when the pattern is loaded
onto the output units. To switch to the general form, use "f pg".
In addition to controlling the input of data, the p command within
the format command is used to control the output of patterns from a set
of test patterns kept on a file. If the format is either c or g then
when the test set is run thru the network you will only get a summary
of how many patterns are correct. If the format is either C or G you
will get a listing of the output values for each pattern as well as
the summary. When reading patterns, C works the same as c and G works
the same as g.
.ce
Controlling Summaries
When the program is learning patterns you can have it print out
the status of the learning process at regular intervals. The default
is to print out only a summary of how learning is going, however you
can also print out the status of every pattern at regular intervals.
To get the whole set of patterns, use "f s-" to turn off the summary
format and "f s+" to go back to summarizing.
.ce
Ringing the Bell
To ring the bell when the learning has been completed use "f b+"
and to turn off the bell, use "f b-".
.ce
Echoing Input
When you are reading commands from a file, it is often worthwhile
to see those commands echoed on the screen. To do this, use "f e+"
and to turn off the echoing, use "f e-".
.ce
Up-To-Date Statistics
During the ith pass thru the network the program will collect
statistics on how many patterns are correct and how much error there
is overall. These numbers are gathered before the weights are updated
and so the results listed for iteration i really show the situation
after the weight update for iteration i-1. To complicate matters more
the weight updates can be done continuously instead of after all the
patterns have been presented so the statistics you get here are skewed
even more. If you want to have up-to-date statistics with either
method, use "f u+" and to go back to statistics that are out of date,
use "f u-". The penalty with "f u+" is that the program needs to do
another forward pass. When using the continuous update method it is
highly advisable to use "f u+", at least when you get close to
complete convergence because the default method of checking may claim
the learning is finished when it isn't or it may continue training
after the tolerances have been met.
.ce
5. Taking Training and Testing Patterns from Files
In the xor example given above the four patterns were part of the
data file and to read them in the following lines were used:
.na
.nf
n 4
1 0 1
0 0 0
0 1 1
1 1 0
.ad
.fi
However, it is also convenient to take patterns from a file that
contains nothing but a list of patterns (and possibly comments).
To read a new set of patterns from some file, patterns, use:
.ce
n f patterns
To add an extra bunch of patterns to the current set you can use:
.ce
x f patterns
In addition to keeping a set of training patterns you can take
testing patterns from a file as well. To specify the file you can
invoke the program with a second file name, as in:
.ce
bp xor xtest
In addition, if you do the following:
.ce
t f xtest
the program will set xtest as the test file and immediately do the
testing. Once the file has been defined you can test the patterns
on the test file by "t f" or just "t". (This leaves the t command
doing double duty since "t <real>" will set the tolerance to <real>.)
Also in addition, the test file can be set without being tested by using
.ce
B t f xtest
as explained in the benchmarking section.
.ce
6. Saving and Restoring Weights and Related Values
Sometimes the amount of time and effort needed to produce a set of
weights to solve a problem is so great that it is more convenient to
save the weights rather than constantly recalculate them. Weights can
be saved as real values in an ASCII format (the default) or as binary,
to save space. To save the weights enter the command, `S'. The weights
are written on a file called "weights". The following file comes from
the xor problem:
.na
.nf
62r file = ../xor3
9.5351562500
-8.4033203125
4.1308593750
5.5800781250
-4.9755859375
-11.3095703125
8.0527343750
.ad
.fi
To write the weights, the program starts with the second layer, writes
out the weights leading into these units in order with the threshold
weight last. Then it moves on to the third layer, and so on. To
restore these weights, type an `R' for restore. At this time, the
program reads the header line and sets the total number of iterations
the program has gone through to be the first number it finds on the
header line. It then reads the character immediately after the number.
The `r' indicates that the weights will be real numbers represented as
character strings. If the weights were binary, the character would be
a `b' rather than an `r'. Also, if the character is `b', the next
character is read. This next character indicates how many bytes are
used per value. The integer versions, bp and sbp write files with 2
bytes per weight, while the real versions, rbp and srbp write files with
8 bytes per weight for double precision reals and 4 bytes per weight
for single precision reals. With this notation, weight files written by
one program can be read by the other. A binary weight format is
specified within the `f' command by using "f wb". A real format is
specified by using "f wr". If your program specifies that weights
should be written in one format, but the weight file you read from is
different, a notice will be given. There is no check made to
see if the number of weights on the file equals the number of weights in
the network.
The above formats specify that only weights are written out and this
is all you need once the patterns have converged. However, if you're
still training the network and want to break off training and pick up
the training from exactly the same point later on, you need to save
the old weight changes when using momentum, and the parameters for the
delta-bar-delta method if you are using this technique. To save these
extra parameters on the weights file, use "f wR" to write the extra
values as real and "f wB" to write the extra values as binary.
In the above example, the command S, was used to save the weights
immediately. Another alternative is to save weights at regular
intervals. The command, S 100, will automatically save weights every
100 iterations the program does. The default rate at which to save
weights is set at 100,000, which generally means that no weights will
ever be saved.
.ce
7. Initializing Weights and Giving the Network a `Kick'
All the weights in the network initially start out at 0. In
symmetric networks then, no learning may result because error signals
cancel themselves out. Even in non-symmetric networks, the training
process will usually converge faster if the weights start out at small
random values. To do this, the `k' command will take the network and
alter the weights in the following ways. Suppose the command given is:
.ce
k 0 0.5
Now, if a weight is exactly 0, then the weight will be changed to a
random value between +0.5 and -0.5. The above command can therefore be
used to initialize the weights in the network. A more complex use of
the `k' command is to decrease the magnitude of large weights in the
network by a certain random amount. For instance, in the following
command:
.ce
k 2 8
all the weights in the network that are greater than or equal to 2, will
be decreased by a random number between 0 and 8. Weights less than or
equal to -2 will be increased by a random number between 0 and 8. The
seed to the random number generator can be changed using the `s' command
as in "s 7". The integer parameter in the `s' command is of type,
unsigned.
Another method of giving a network a kick is to add hidden layer
units. The command:
.ce
H 2 0.5
adds one unit to layer 2 of the network and all the weights that are
created are initialized to between - 0.5 and + 0.5.
The subject of kicking a back-propagation network out of local minima
has barely been studied and there is no guarantee that the above methods
are very useful in general.
.ce
8. Setting the Algorithm to Use
A number of different variations on the original back-propagation
algorithm have been proposed in order to speed up convergence and some
of these have been built into these simulators. These options are set
using the `A' command and a number of options can go on the one line.
.ce
Activation Functions
To set the activation function, use:
A al * the linear activation function
A ap * the piece-wise activation function
A as * the smooth activation function
A at * the piecewise near-tanh function that runs from -1 to +1
A aT * the continuous near-tanh function that runs from -1 to +1
When using the linear activation function, it is only appropriate to
use the differential step-size derivative and a two-layer network.
The smooth activation function is:
.ce
f = 1.0 / (1.0 + exp(-x))
where x is the input to a unit. The piece-wise function is an
approximation to the function, f and it will normally save some CPU
time even though it may increase the number of iterations you need to
solve the problem. The continuous near-tanh function is 2f - 1 and the
piece-wise version approximates this function with a series of straight
lines.
.ce
Sharpness (or Gain)
The sharpness (or gain) is the parameter, D, in the function:
.ce
1.0 / (1.0 + exp(-D*x)).
A sharper sigmoid shaped activation function (larger D) will produce
faster convergence (see "Speeding Up Back Propagation" by Yoshio Izui
and Alex Pentland in the Proceedings of \fIIJCNN-90-WASH-DC\fR, Lawrence
Erlbaum Associates, 1990). To set this parameter, to say, 8,
use "A D 8". The default value is 1. Unfortunately, too large a value
for D will hurt convergence so this is not a perfect solution to
speeding up learning. Sometimes the best value for D may be less than
1.0. A larger D is also useful in the integer version of
back-propagation where the weights are limited to between -32 and
+31.999. A larger D value in effect magnifies the weights and makes it
possible for the weights to stay smaller. Values of D less than one may
be useful in extracting a network from a local minima (see "Handwritten
Numeral Recognition by Multi-layered Neural Network with Improved
Learning Algorithm" by Yamada, Kami, Temma and Tsukumo in Proceedings of
the 1989 IJCNN, IEEE Press). A smaller value of D will also force the
weights and the weight changes to be larger and this may be of value
when the weight changes become less than the weight resolution of 0.001
in the integer version.
.ce
The Derivatives
The correct derivative for the standard activation function is s(1-s)
where s is the activation value of a unit, however when s is near 0 or 1
this term will give only very small weight changes during the learning
process. To counter this problem, Fahlman proposed the following one
for the output layer:
.ce
0.1 + s(1-s)
(For the original description of this method, see "Faster Learning
Variations of Back-Propagation: An Empirical Study", by Scott E.
Fahlman, in \fIProceedings of the 1988 Connectionist Models Summer
School\fR, Morgan Kaufmann, 1989.) Besides Fahlman's derivative and the
original one, the differential step size method (see "Stepsize Variation
Methods for Accelerating the Back-Propagation Algorithm", by Chen and
Mars, in \fIIJCNN-90-WASH-DC\fR, Lawrence Erlbaum, 1990) takes the
derivative to be 1 in the layer going into the output units and uses the
correct derivative for all other layers. The learning rate for the
inner layers is normally set to some smaller value. To set a value for
eta2, give two values in the `e' command as in:
.ce
e 0.1 0.01
To set the derivative, use the `A' command as in:
.ne4
A dd * use the differential step size derivative (default)
A dF * use Fahlman's derivative in only the output layer
A df * use Fahlman's derivative in all layers
A do * use the original, correct derivative
.ce
Update Methods
The choices are the periodic (batch) method, the continuous (online)
method and Jacob's delta-bar-delta method. The following commands
set the update methods:
A uc * for the continuous update method
A ud * for the delta-bar-delta method
A up * for the original periodic update method (default)
The delta-bar-delta method uses a number of special parameters and these
are set using the `d' command. Delta-bar-delta can be used with any of
the derivatives and the algorithm will find its own value of eta for
each weight.
.ce
Other Algorithm Options
The `b' command controls whether or not to backpropagate error for
units that have learned their response to within a given tolerance. The
default is to always backpropagate error. The advantage to not
backpropagating error is that this can save computer time. This
parameter can be set like so:
A b+ * always backpropagate error
A b- * don't backpropagate error when close
The `s' sub-command will set the number of times to skip a pattern
when the pattern has been learned to within the desired tolerance. To
skip 3 iterations, use "A s 3", to reset to not skip any patterns
use "A s 0".
The `t' sub-command will take the given pattern (only one at a
time) out of the training set so that you can then train the other
patterns and test the network's response to this one pattern that was
removed. To test pattern 3, use "A t 3" and to reset to use all the
patterns use "A t 0".
.ce
9. The Delta-Bar-Delta Method
The delta-bar-delta method attempts to find a learning rate
eta, for each individual weight. The parameters are the initial
value for the etas, the amount by which to increase an eta that seems
to be too small, the rate at which to decrease an eta that is
too large, a maximum value for each eta and a parameter used in keeping
a running average of the slopes. Here are examples of setting these
parameters:
.na
.nf
d d 0.5 * sets the decay rate to 0.5
d e 0.1 * sets the initial etas to 0.1
d k 0.25 * sets the amount to increase etas by (kappa) to 0.25
d m 10 * sets the maximum eta to 10
d n 0.005 * an experimental noise parameter
d t 0.7 * sets the history parameter, theta, to 0.7
.ad
.fi
These settings can all be placed on one line:
.ce
d d 0.5 e 0.1 k 0.25 m 10 t 0.7
The version implemented here does not use momentum. The symmetric
versions, sbp and srbp do not implement delta-bar-delta.
The idea behind the delta-bar-delta method is to let the program find
its own learning rate for each weight. The `e' sub-command sets the
initial value for each of these learning rates. When the program sees
that the slope of the error surface averages out to be in the same
direction for several iterations for a particular weight, the program
increases the eta value by an amount, kappa, given by the `k' parameter.
The network will then move down this slope faster. When the program
finds the slope changes signs, the assumption is that the program has
stepped over to the other side of the minimum and so it cuts down the
learning rate, by the decay factor, given by the `d' parameter. For
instance, a d value of 0.5 cuts the learning rate for the weight in
half. The `m' parameter specifies the maximum allowable value for an
eta. The `t' parameter (theta) is used to compute a running average of
the slope of the weight and must be in the range 0 <= t < 1. The
running average at iteration i, a\di\u, is defined as:
.ce
a\di\u = (1 - t) slope\di\u + ta\di-1\u,
so small values for t make the most recent slope more important than
the previous average of the slope. Determining the learning rate for
back-propagation automatically is, of course, very desirable and this
method often speeds up convergence by quite a lot. Unfortunately, bad
choices for the delta-bar-delta parameters give bad results and a lot of
experimentation may be necessary. If you have n patterns in the
training set try starting e and k around 1/n. The n parameter is an
experimental noise term that is only used in the integer version. It
changes a weight in the wrong direction by the amount indicated when the
previous weight change was 0 and the new weight change would be 0 and
the slope is non-zero. (I found this to be effective in an integer
version of quickprop so I tossed it into delta-bar-delta as well. If
you find this helps, please let me know.) For more on
delta-bar-delta see "Increased Rates of Convergence" by Robert A.
Jacobs, in \fINeural Networks\fR, Volume 1, Number 4, 1988.
.ce
10. Recurrent Networks
Recurrent back-propagation networks take values from higher level
units and use them as activation values for lower level units. This
gives a network a simple kind of short-term memory, possibly a little
like human short-term memory. For instance, suppose you want a network
to memorize the two short sequences, "acb" and "bcd". In the middle of
both of these sequences is the letter, "c". In the first case you
want a network to take in "a" and output "c", then take in "c" and
output "b". In the second case you want a network to take in "b" and
output "c", then take in "c" and output "d". To do this, a network
needs a simple memory of what came before the "c".
Let the network be an 7-3-4 network where input units 1-4 and output
units 1-4 stand for the letters a-d. Furthermore, let there be 3 hidden
layer units. The hidden units will feed their values back down to the
input units 5-7, where they become input for the next step. To see why
this works, suppose the patterns have been learned by the network.
Inputing the "a" from the first string produces some random pattern of
activation, p1, on the hidden layer units and "c" on the output units.
The pattern p1 is copied down to units 5-7 of the input layer. Second,
the letter, "c" is presented to the network together with p1 now on
units 5-7. This will give "b" on the output units. However, if the "b"
from the second string is presented first, there will be a different
random pattern, p2, on the hidden layer units. These values are copied
down to input units 5-7. These values combine with the "c" to produce
the output, "d".
The training patterns for the network can be:
1000 000 0010 * "a" prompts the output, "c"
0010 hhh 0100 * inputing "c" should produce "b"
0100 000 0010 * "b" prompts the output, "c"
0010 hhh 0001 * inputing "c" should produce "d"
where the first four values on each line are the normal input, the
middle three either start out all zeros or take their values from the
previous values of the hidden units. The code for taking these values
from the hidden layer units is "h". The last set of values represents
the output that should be produced. To take values from the third layer
of a network, the code is "i". For the fourth and fifth layers (if they
exist) the codes are "j" and "k". Training recurrent networks can take
much longer than training standard networks and the average error can
jump up and down quite a lot.
.ce
11. The Benchmarking Command
The main purpose of the benchmarking command is to make it possible
to run a number of tests of a problem with different initial weights and
average the number of iterations and CPU time for networks that
converged. A second purpose is to run a training set thru the network a
number of times, and for each try, a test pattern or a test set can be
checked at regular intervals.
A typical command to simply test the current parameters on a number
of networks is:
.ce
B g 5 m 15 k 1 r 1000 200
The "g 5" specifies that you'd like to set the goal of getting 5
networks to converge but the "m 15" sets a maximum of 15 tries to
reach this goal. The k specifies that each initial network will get
a kick by setting each weight to a random number between -1 and 1.
The "r 1000 200" portion specifies that you should run up to 1000
iterations on a network and print the status of learning every 200
iterations. This follows the normal run command and the second
parameter defining how often to print the statistics can be omitted.
For example, here is some output from benchmarking with the xor
problem:
.na
.nf
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? B g 5 m 5 k 1 r 200
seed = 7; running . . .
patterns learned to within 0.10 at iteration 62
seed = 7; running . . .
seed = 7; running . . .
patterns learned to within 0.10 at iteration 54
seed = 7; running . . .
patterns learned to within 0.10 at iteration 39
seed = 7; running . . .
patterns learned to within 0.10 at iteration 44
1 failures; 4 successes; average = 49.750000 0.333320 sec/network
.ad
.fi
The time/network includes however much time is used to print messages
so to time the process effectively, all printing should be minimized.
The timing is done using the UNIX clock(3C) function and on a UNIX PC
at least, the time returned by clock will overflow after 2147 seconds, or
about 36 minutes. If you system has the same limitation, take care that
ALL of the benchmarking you do in a single call of the program adds up
to less than 36 minutes.
In the above example, the seed that was used to set the random values
for the weights was set to 7 (outside the benchmarking command), however
if you set a number of seeds, as in:
.ce
s 3 5 7 18484 99
the seeds will be taken in order for each network. When there are more
networks to try than there are seeds, the random values keep coming from
the last seed value, so actually you can get by using a single seed.
The idea behind allowing multiple seeds is so that if one network does
something interesting you can use that seed to run a network with the
same initial weights outside of the benchmarking command.
Once the benchmarking parameters have been set, it is only necessary
to include the run portion in order to start the benchmarking process
again, thus, "B r 200" will run benchmarking again using the current set
of parameters. Also, the parameters can be set without starting the
benchmarking process by just not including the `r' parameters in the B
command as in:
.ce
B g 5 m 5 k 1
In addition to getting data on convergence, you can have the program
run test patterns thru the network at the print statistics rate given
in the `r' sub-command. To specify the test file, test100, use:
.ce
B t f test100
To run the training data thru for up to 1000 iterations and test every
200 iterations use:
.ce
B r 1000 200
If the pattern format specification p is set to either c or g you will
get a summary of the patterns on the test file. If p is either C or G
you will get the results for each pattern listed as well as the summary.
To stop testing the data on the data file, use "B t 0".
Sometimes you may have so little data available that it is difficult
to separate the patterns into a training set and a test set. One
solution is to remove each pattern from the training set, train the
network on the remaining patterns and then test the network on the
pattern that was removed. To remove a pattern, say pattern 1 from the
training set use:
.ce
B t 1
To systematically remove each pattern from the training set, use a
data file with the following commands:
.na
.nf
B t 1 r 200 50
B t 2 r 200 50
B t 3 r 200 50
... etc.
.ad
.fi
and the pattern will be tested every 50 iterations. If, in the course
of training the network, all the patterns converge, the program will
print out a line starting with a capital S followed by a test of the
test pattern. If the programs hits the point where statistics on the
learning process have to be printed and the network has not converged,
then a capital F will print out followed by a test of the test pattern.
To stop this testing, use "B t 0".
It would be nice to have the program average up and tabulate all the
data that comes out of the benchmarking command, but I thought I'd leave
that to users for now. You can use the record command to save the
output from the entire session and then run it thru some other program,
such as an awk program in order to sort everything out.
.ce
12. Miscellaneous Commands
Below is a list of some miscellaneous commands, a short example of
each and a short description of the command.
.IP " ! !cmd " 15
Anything after `!' will be passed on to UNIX as a command to execute.
.IP " C C " 15
The C command will clear the network of values, reset the number of
iterations to 0 and reset other values so that another run can be made.
The seed value is reset so you can run the same network over with the
same initial weights but with different learning parameters. Even
though the seed is reset, the weights are not initialized, so you
must do this step after clearing the network.
.IP " i i f " 15
Entering "i f" will read commands from the file, f. When there are
no more commands on the file, the program resumes reading from the
previous file being used. You can also have an `i' command within the
file f, however the depth to which you can nest the number of active
files is 4 and stdin itself counts as the first one. Once an input
file has been specified, you can simply type "i" to read from the file
again.
.IP " l l 2 " 15
Entering "l 2" will print the values of the units on layer 2,
or whatever layer is specified.
.IP " T T -3 " 15
In sbp and srbp only, "T -3" sets all the threshold weights
to -3 or whatever value is specified and freezes them at this value.
.IP " W W 0.9 " 15
Entering "W 0.9" will remove (whittle away) all the weights with
absolute values less than 0.9.
.in-15
In addition, when a user generated interrupt occurs (by typing DEL)
the program will drop its current task and take the next command from
the keyboard.
.ce
13. Limitations
Weights in the bp and sbp programs are 16-bit integer weights, where
the real value of the weight has been multiplied by 1024. The integer
versions cannot handle weights less than -32 or greater than 31.999.
The weight changes are all checked for overflow but there are other
places in these programs where calculations can possibly overflow as
well and none of these places are checked. Input values for the integer
versions can run from -31.994 to 31.999. Due to the method used to
implement recurrent connections, input values in the real version are
limited to -31994.0 and above.