LIBRARY
Michigan State
University

 

 

 

PLACE iN RETURN BOX to remove this checkout from your record.

TO AVOID FINES return on or before date due.

    

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. W

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

lI—‘TL IJ

MSU In An Afﬂnnetivo ActioNEquel Opportunity Institution
cmmut

 

 

 

LAYER-WISE TRAINING OF FEEDFORWARD
NEURAL NETWORKS BASED ON LINEARIZATION
AND SELECTIVE DATA PROCESSING

By
Shawn David Hunt

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1992

-3323

672

ABSTRACT

LAYER-WISE TRAINING OF FEEDFORWARD
NEURAL NETWORKS BASED ON LINEARIZATION
AND SELECTIVE DATA PROCESSING

By

Shawn David Hunt

A class of algorithms is presented for training nonlinear feedforward neural net-
works using purely “linear” techniques. The algorithms are based upon linearizations
of the network using error surface analysis, followed by a contemporary recursive
least squares identiﬁcation procedure which can be implemented using parallel pro-
cessing. Speciﬁc algorithms are presented to estimate weights node-wise, layer-wise,
and for estimating the entire set of network weights simultaneously. A procedure
for modifying the algorithms to selectively use the training data and increase speed
is also presented. A computationally inexpensive measure is developed with which
to assess the effect of a particular training pattern on the weight estimates prior to
its inclusion in any iteration. Data which do not signiﬁcantly change the weights
are not used in that iteration, obviating the computational expense of updating.
Several experimental studies are presented showing the advantages of this class of
algorithms. Speciﬁcally, the layer-wise algorithm is shown to be vastly superior to
back-propagation in terms of the number of convergences and convergence rate. Ad-
ditionally this algorithm is shown to be insensitive to the choice of initial weights and
forgetting factor, eliminating two of the greatest problems in the implementation of

existing training algorithms.

To my parents

David and Suzanne Hunt

Acknowledgments

I would like to thank John R. Deller, Jr., for the immense help in completing this
degree. The biggest help was probably learning how (even if I do not necessarily follow
it) to write a technical paper. I have heard that a Ph.D. degree means you can do
research. Of course, if you cannot transmit this to others in a clear, concise manner,
it is of little use. I also must give thanks for his defense (during my defense, especially
the proposal), and his willingness to see me and listen to any problem. One of the
greatest beneﬁts of being at Michigan State was being in a lab where the students
not only got along, but where there was a spirit of cooperation and friendliness. For
this I must thank both Dr. Deller and Joan (for keeping El Estimado happy).

I would also like to thank the members of my committee, Dr. Majid N ayeri, Dr.
Norman Birge, and Dr. Fathi Salam. It was an unexpected surprise to ﬁnd out that
the committee members had not only read the complete dissertation, but in great
detail. I could not have asked for more.

Money! For this I must thank the University of Puerto Rico for granting me a
fellowship. This work was also supported by National Science Foundation under Grant
No. MIP-9016734, and the Ofﬁce of Naval Research under Contract No. N00014-91-
J-l329.

I also wish to thank the members of the East Lansing Baha’i community for the
support which has made these years truly enjoyable. No, not only enjoyable, great!
First the Nazerians, Keyvan, Farzaneh, Mike, Alan and Lily for all the great food,

the thanksgivings, the borrowed cars, and of course the tennis. Njeru Murage, Roya

Mavaddat, Bruce Johnson and Irene Shim (a token Baha’i), for the most enlighten-
ing discussions on every imaginable topic. (Mike, Alan and I also had interesting
discussions but these were enlightening on a different level.) Also the Smiths, David,
Melanie, Amber, Amelia, and Zack(God is One), for the incredible spirit at the sunday
night ﬁresides. Joe and Judy DeFlorio! (and of course Sophia) Could we have done it
without you? Dominic, for being the only ﬁve year-old to be interested in hearing me
talk about black holes. Also thanks to Franco and Shahnaz Damazio for being a little
persian and worrying. Also to the members who enlightened the club here on cam-
pus over the years, Gail Etzenhauser(lets gol), Hami Missagieh(Haaaaami), Ohmead
Bashirelahi(Sasha who?), Farnoush(why sleep) and Kianoush(let go to Rezas) Sinai,
Tahirih Steen, Rick and Eve Martin, Mark Knox(lets teach), Gill Munro(caderas),
Greg(el tigre) and Valerie(hi babe) Heikes, and John Homan(thanks for the music).
I know I have left out many, please forgive me.

Anne and Alicia, thank you both for just being there and thank you for the

prayers.

Contents

List of Tables
List of Figures

1 Introduction and Background
1.1 Introduction ................................
1.1.1 The Promise of Neural Networks ................
1.1.2 Notation and Objectives .....................
1.2 The Feedforward Neural Network (FNN) ................
1.3 Training Algorithms for FNNs ......................
1.3.1 Performance Measures ......................
1.3.2 Training Efﬁciency ........................
1.3.3 The Back-Propagation Algorithm ................
1.3.4 Techniques for Improving Convergence .............

1.3.5 Complexity Measures .......................
2 New Training Algorithms

2.1 Introduction ................................
2.2 Linearized Training Algorithms .....................
2.2.1 Node—wise Weight Updating ...................
2.2.1.1 Two Layer Network ...................

2.2.1.2 L Layer Network ....................

2.2.2 Layer-wise Weight Updating ...................
2.2.2.1 Two Layer Network ...................

2.2.2.2 L Layer Network ....................

2.2.3 Network-wise Weight Updating .................
2.2.3.1 Two Layer Network ...................

2.2.3.2 L Layer Network ....................

2.3 Solving the Linearized Equations ....................
2.3.1 Solution by MIL-WRLS .....................

2.3.2 Solution by QR-WRLS ......................

3 Data Reduction Algorithm
3.1 Introduction ................................
3.2 Data Reduction Algorithm ........................

vi

viii

ix

©U1wHi—ai—i

9
13
14
15
19

20
20
20
21
21
28
30
3O
35
38
38
38
40
40
41

46
46
46

4 Simulation Results

4.1 Introduction ................................
4.2 Implementation Studies ...................... ' . . .
4.3 Layer Updating and Network Updating .................
4.3.1 Initialization ............................
4.3.2 Forgetting Factor .........................
4.3.3 Small Training Sets ........................
4.3.4 Large Training Set ........................
4.4 Simulations of the Data Reduction Algorithm .............
5 Conclusions and Future Research
5.1 Algorithmic Developments ........................
5.1.1 Training Speed ..........................
5.1.2 Layer-wise Training Algorithm ..................
5.1.3 QR-WRLS .............................
5.1.4 Data Reduction Algorithm ....................
5.2 Future Research ..............................
Appendix

vii

59
59
60
66
66
70
77
85
93

102
102
102
103
103
104
104

109

List of Tables

1.1

4.2
4.3

4.4

4.5

4.6

4.7

4.8

4.9

4.10

4.11

4.12

Complexities per iteration and normalized complexities of F N N train-
ing algorithms. M is the number of nodes per layer. L is the number
of layers ...................................
Number of convergences per 100 sets of initial weights .........
Convergence results for various initial weights, 2—bit parity checker. A
is the number of convergences out of 100, and B is the average number
of iterations to convergence. .......................
Convergence results for various initial weights, 4-bit parity checker. A
is the number of convergences out of 100, and B is the average number
of iterations to convergence. .......................
Convergence results for various forgetting factors, 2-bit parity checker.
A is the number of convergences out of 100, and B is the average
number of iterations to convergence. ..................
Convergence results for various forgetting factors, 4-bit parity checker.
A is the number of convergences out of 100, and B is the average
number of iterations to convergence. ..................
Convergence results for various forgetting factors, 4-bit parity checker.
A is the number of convergences out of 100, and B is the average
number of iterations to convergence. ..................
Iterations to convergence of the BP algorithm for various training sets
and varying number of hidden layer nodes. ...............
Iterations to convergence of the node-wise algorithm for various train-
ing sets and varying number of hidden layer nodes. ..........
Iterations to convergence of the layer-wise algorithm for various train-
ing sets and varying number of hidden layer nodes. ..........
Iterations to convergence of the network-wise algorithm for various
training sets and varying number of hidden layer nodes .........
Number of correctly classiﬁed training patterns for the large training
set for varying number of inner layer nodes. ..............

viii

19
64

68

68

78

80

82

90

91

94

List of Figures

1.1
1.2
1.3
2.4
4.5

4.6

4.7

4.8

4.9

4.10

4.11

4.12

4.13

4.14

A single node ................................ 6
Example of a sigmoid function. ..................... 7
A two-layer network. ........................... 8
Weight estimation using recursive QR-WRLS ............. 43
Network architectures used in the simulation studies. 1. the 2-bit
parity checker, 2. the 4-bit parity checker, and 3. the 4-bit bit counter. 62
Average error in dB for the X-OR implementations vs. iteration num-
ber. 1. back-propagation; 2. A-S algorithm; 3. QR—WRLS. ...... 65
Average error in dB for the BP implementation of the 2-bit parity
checker vs. iteration number using different methods of selecting initial
weights. .................................. 69
Average error in dB for the node-wise implementation of the 2-bit par-
ity checker vs. iteration number, using different methods of selecting
initial weights ................................ 69
Average error in dB for the layer-wise implementation of the 2-bit par-
ity checker vs. iteration number, using different methods of selecting
initial weights ................................ 70
Average error in dB for the network-wise implementation of the 2-
bit parity checker vs. iteration number, using different methods of
selecting initial weights. ......................... 71
Average error in dB for the BP implementation of the 4-bit parity
checker vs. iteration number, using different methods of selecting ini-
tial weights. ................................ 72
Average error in dB for the node-wise implementation of the 4-bit par-
ity checker vs. iteration number, using different methods of selecting
initial weights ................................ 72
Average error in dB for the layer-wise implementation of the 4-bit par-
ity checker vs. iteration number, using different methods of selecting
initial weights ................................ 73
Average error in dB for the network-wise implementation of the 4-
bit parity checker vs. iteration number, using different methods of
selecting initial weights. ......................... 74

ix

4.15

4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
4.32
4.33
4.34
4.35
4.36
4.37
4.38

4.39

Average error in dB for the QR-WRLS X—OR implementation vs. iter-
ation number, using different forgetting factors and weight change con-
straints. 1. V = 0.98,7 = 0.2; 2. V = 0.98,”)! = 1.0; 3. V = 0.1,7 21.0;
4. V = 0.1, 7 = 0.2; where V is the forgetting factor and 7 is the weight
constraint ..................................
Average error in dB for the node-wise implementation of the 2-bit
parity checker vs. I iteration number, using different forgetting factors.
Average error in dB for the layer-wise implementation of the 2-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the network-wise implementation of the 2-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the node-wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the layer—wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the network-wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the node-wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the layer—wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.
Average error in dB for the network-wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.
Training set A. ..............................
Training set C. ..............................
Training set E. ..............................
Training set G. ..............................
Training set I ................................
Training set K. ..............................
Training set L. ..............................
Training set M. ..............................
Training set N. ..............................
Large training set. ............................
Average error vs. iteration number for the BP implementation, train-
ing with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. . .
Average error vs. iteration number for the Node-wise implementation,

training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes.

Average error vs. iteration number for the Layer-wise implementation,

training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes.

Average error vs. iteration number for the data reduction algorithm
using 6 hidden layer nodes, training with the large training set.

Correctly classiﬁed training patterns vs. iteration number for the data
reduction algorithm using 6 hidden layer nodes and the large training
set ......................................

75
78
79
79
80
81
81
82
83
83
85
86
86
87
87
88
88
89
89
91
92
92
95

96

97

4.40

4.41

4.42

4.43

4.44

4.45

Average error vs. iteration number for the data reduction algorithm
using 7 hidden layer nodes, training with the large training set.

Correctly classiﬁed training patterns vs. iteration number for the data
reduction algorithm using 7 hidden layer nodes and the large training
set ......................................
Average error vs. iteration number for the data reduction algorithm
using 8 hidden layer nodes, training with the large training set.

Correctly classiﬁed training patterns vs. iteration number for the data
reduction algorithm using 8 hidden layer nodes and the large training
set ......................................
Number of training patterns used vs. iteration number for the data
reduction algorithm and a dividing factor of 30. 1. 6 nodes 2. 7 nodes
3. 8 nodes ..................................
Number of training patterns used vs. iteration number for the data
reduction algorithm and a dividing factor of 100. 1. 6 nodes 2. 7
nodes 3. 8 nodes. .................... ‘ .........

xi

97

98

98

99

99

Chapter 1

Introduction and Background

 

1 .1 Introduction

1.1.1 The Promise of Neural Networks

This research is concerned with supervised training of Feedforward Artiﬁcial Neural
Networks (FNNs). FNNs are a special case of a class of non-linear networks called
Artiﬁcial Neural Networks (ANNs). ANNs have gained popularity in recent years
due to their promise of solving previously intractable problems. ANNs have been
applied to many problems in the areas of pattern recognition, pattern classiﬁcation,
and associative memories. Some of these problems are from the areas of speech
recognition, computer vision, medical diagnosis and non-linear control. Included are
such speciﬁc problems such as teaching a machine to read out loud or to play games.
The usefulness of a computer that understands speech or can “see” has been known for
decades. For this reason, research to solve problems in the areas mentioned above has
been vigorous. However, despite all the research, solutions for many of these problems
have not been found using conventional linear techniques such as linear regression,
and non-linear techniques such as artiﬁcial intelligence implemented on a standard
sequential computer. For this reason, it has been suggested that the difficulty in

ﬁnding solutions to the problem does not lie in the fact that the correct algorithm

for implementation on a sequential computer has not been found, but rather'that
sequential computers are inherently unsuited to the problem. An architecture that
does seem well-suited to the problem is that of organic neural networks, the brains
of animals. Consequently, some properties from these organic neural networks have
been emulated in the development of AN Ns. Some of these properties are high inter-
connectivity, massive parallelism, and individual units that are non-linear. A major
area of research is now ascertaining whether training algorithms for ANNs, algorithms
designed to make the ANN solve the problems mentioned above, actually exist.

A number of recently published simulation results provide evidence that the FN N
can solve useful problems, even though an explicit mathematical proof of how the
solution was reached may not be possible at this time. One of the most widely
referenced experiments is that of NETTALK [1]. In this experiment, a FNN was
trained to read out loud. The input to the FN N was standard text, and the output
was a phonetic alphabet. This phonetic alphabet was fed into a separate computer
which allowed the output to be heard aloud. As impressive as it was to have a
network read out loud, even more exciting is the description by the experimenters of
how the learning was accomplished. It is reported that the network started off babling,
slowly progressing, making mistakes reminiscent of a child learning how to read. This
gave great encouragement that the properties borrowed from nature in the making of
FNNs were indeed useful in approximating a real neural system. Another experiment,
possibly even more impressive from the engineering point of view, was that of Nguyen
and Widrow [2]. The experiment succeeded in teaching a FN N how to back up a semi-
trailer truck simulated on a computer. This simulation result is signiﬁcant because
at present we cannot design a controller for this nonlinear control problem. This
experiment clearly shows the great advantage of FNNs over other methods. In a
typical system at present, the designer must specify all the variables. This is not

the case with FNNs. For example, in any pattern recognition problem, the features

to be detected and used for selection must be determined by the designer. With a
multilayer F NN, however, it is the system itself which determines what features to
use. This can be of great advantage as shown in the previous experiment. We would,
of course, like to be able to analyze mathematically all the properties of the FNN.

Until then, however, it will still be of use.

1.1.2 Notation and Objectives

As this dissertation relies heavily on the solution of a set of linear equations a short
discussion on notation is appropriate. Suppose we have a typical overdetermined
system of linear equations. There are many methods for solving this system for the
weights that give the least mean square error. Many popular methods are recursive,
and are typically called recursive least squares (RLS). A weighting factor can be
easily included in the recursions to weigh each of the linear equations differently in
ﬁnding the ﬁnal solution. This we will call weighted recursive least squares (WRLS).
One method frequently used [5] is based on the matrix inversion lemma [5] and will
be called MIL-WRLS here. A more contemporary recursive approach based on QR
decomposition [3],[4] will be used in this dissertation and will be called QR-WRLS.
The research described in this dissertation is concerned with the method and speed

of training FNNs. The main contributions have been:

1. To show the advantages of implementing linearized FN N training algorithms
with QR-WRLS instead of the conventional MIL-WRLS algorithm. This is done
by modifying one such algorithm, the Azimi-Sadjadi et a1. (A-S) [6] algorithm
to employ QR-WRLS and comparing a variety of simulation results to the A-S
algorithm implemented with MIL-WRLS.

2. To develop a “linearized” training procedure that is faster than the ones re-

ported in literature. This involves using conventional least mean square equa- .

tions with new “linearized” inputs and outputs determined from the F NN , and
searching for a way to determine inner layer target values faster than the back-

propagation technique now widely used.

3. To apply matrix perturbation theory to the linearized neural networks in an
attempt to increase convergence speed. This includes derivation of equations
that are well-suited for application to the FNN and are also simple enough so

that a computational saving is possible.

The ﬁrst two objectives are addressed in Chapter 2. A number of training algo-
rithms currently proposed for training F N N 3 use linearizations. These linearizations
transform the training problem to one of Solving a system of linear equations. This
allows the implementation of these training algorithms by any algorithm for solving
the least squared error problem in linear systems. Existing training algorithms use
the conventional MIL-WRLS algorithm for implementation. This algorithm takes
the form of two recursions. The recursive equations are numerically unstable [7] and
require initial conditions which are generally unknown. Also, in training the FNN, a
forgetting factor [7] must be included. Implementation with MIL-WRLS limits the
types of forgetting factors that can be used. All of these problems are dealt with
effectively using QR-WRLS. First, QRi-WRLS has been shown to be more efﬁcient
than the MIL-WRLS algorithm [4]. It is also stable numerically. QR—WRLS also al-
lows different forgetting methods to be used. One of the proposed training algorithms
that uses MIL-WRLS is implemented with QR-WRLS . Simulation results showed a
marked increase in the algorithm performance. This result should be very beneﬁcial
in stimulating the use of QR-WRLS for other training algorithms implemented using
MIL-WRLS. A new linearization training algorithm is also introduced in Chapter 2.

In order to further increase the rate of convergence, matrix perturbation theory
has been used. Inputs to the network along with the desired response - together

called training patterns — are used to train the FNN. In general, the network does

4

not usually “learn” to correctly classify the training patterns the ﬁrst time they are
used. Neither is there any guarantee that the network will ever learn to classify all
the training patterns correctly. However, in the training process the network usually
correctly classiﬁes more and more of the training patterns as they are used repeatedly.
For this reason the training patterns are used many times. If there were a way to
avoid using all the training patterns without changing the solution, training could
be made more efﬁcient. One method for doing this data reduction is presented in
Chapter 3.

Simulation studies of the performance of QR-WRLS, of the new training algo-
rithm, and of the data reduction algorithm are presented in Chapter 4.

Chapter 5 contains the conclusions and suggests areas of future research.

1.2 The Feedforward Neural Network (FN N)

The basic units of the FN N network are “neurons” or nodes. The model of the node
to be used in the research involves a weighted sum of the inputs, and an output that

is a non-linear function of this weighted input. Thus,

N
u=2wjxj (1.1)

i=1
where u is the weighted input, w,- is the weight for the 3"" input, and x, is the 3“”

input. The output of the node is:
y = S(u) (1-2)

where y is the output of the node, and S() is a nonlinear mapping. A representation
is shown in Figure 1.1. The non-linearity that will be used in this research is the

sigmoid function

 

 

Figure 1.1: A single node.

 

1
1+e-3

 

3(a) = (13)

S() has the following properties: it is continuous, has continuous derivatives of all
orders, is strictly increasing, and has ﬁnite limits as the argument goes ‘to +00 or
—00. A graph of the function is shown in Figure 1.2.

The nodes are grouped into layers. For clarity, in this dissertation a two-layer

1, an example of which is

network is considered in the initial theoretical discussions
illustrated in Figure 1.3. The generalization of the methods to an arbitrary number
of layers follows the initial developments with two layers. Each node above the input

layer in the FNN passes the sum of its weighted inputs through a non-linearity as in

(1.2). The inputs to layer zero are external. The outputs of the last layer are the

 

1Some authors might choose to call this a three layer network. The bottom layer of “nodes” shall
be designated as “layer zero” and not count it in the total number of layers. Layer zero is a set of
linear nodes which simply pass the inputs unaltered. For this reason, circular nodes are not shown
in the diagram.

 

 

Id
0..

0.."

0.7 '4

0.8-“

0..“

D.‘ -‘

output l(x)

0.8"

0.14

 

 

 

Figure 1.2: Example of a sigmoid function.

 

outputs of the network. _

Let us now formalize the network and deﬁne notation. The number of nodes in
layer i is denoted N,-, with No indicating the number of input nodes at the bottom
(input) of the network. The weights connecting to node I: (k’) of layer two (one) are
held in the Nl-vector (No-vector) 1.0;, = [wkm . . . , wk,N,]T (10],, = [20],“, . . . , w[,,N1]T).
The inputs to the nodes in all layers except the ﬁrst are the outputs of the layer

below. We denote by N the number of training patterns

{[:a:(n) = [m1(n),x2(n), ...,a:No(n)]T;t(n) = [t1(n),t2(n), ...,tN,(n)]T) , n = 1,2, . . . , N} ,
(1.4)

in which each 31(n) is the input to the 1“ node in layer zero, and tk(n) is the tar-

get output for node k in the output layer (output desired in response to the cor-

responding input). The computed outputs of layer two (one) in response to 23(n) =

[$100, z~o(n)lT are denoted W!) = [311(71), yN.(n)lT (y’(n) = [3401), y5v2(n)]T)-

 

 

 

 

 

 

yl yNz .
I I
yl le
x1 x1 XNO

Figure 1.3: A two-layer network.

 

Vectors will be used when convenient, and scalars will be used when they simplify

the presentation. Similarly to (1.1) we also deﬁne

”1
min) dé‘ ; wing-(n) = «viz/(n). (1.5)

Where uk(n) is the input to node k in the output layer in response to pattern n.

u;(n) is similarly deﬁned as the input to node j in layer one.

1.3 Training Algorithms for FNNs

1.3.1 Performance Measures

There are many methods proposed for the training of FNNs. Training, in the most
general sense, means to adjust the variable parameters of the network ~ the weights
— so that the network can perform the desired task, be it pattern recognition or
classiﬁcation, a control problem, etc. The ﬁrst problem encountered in training is in
quantifying how well the network is performing the desired task. This quantiﬁcation
is done with a performance measure.

A performance measure is a function of the weights of the network. There are
many in use. The most popular, and the one to be used in this research, is the sum of
the squared errors between the desired outputs and the actual outputs. The training
algorithm has the goal of adjusting the weights of the network so that the network
performs well with respect to the performance measure used. Training algorithms
can be divided into two overlapping groups. One group uses heuristics to search the
weight space in order to optimize the performance measure, while the other group
uses the performance measure in order to change the weights. Some training algo-
rithms use both heuristics and the performance measure. One example of a training

algorithm that uses a heuristic search over the parameter space is presented by Sun

and Grosky [8]. This algorithm uses random optimization and dynamic annealing to
ﬁnd a global minimum of the performance function. It does not compute the gradient
of the performance function. As noted by Widrow [9], most algorithms that use the
performance measure to change the weights can be further divided into two classes.
The ﬁrst class consists of algorithms which change the weights of the network to re-
duce the error between the desired response and the actual responce. Algorithms in
this class are called error-correction rules. Algorithms from the second class are called
gradient rules. These change the network weights along the gradient of the perfor-
mance measure to reduce the mean-squared error. An example of an error correction
rule is Rosenblatt’s perceptron learning rule [10]. Training algorithms that use the
gradient rule have become very common in recent research. There are many examples
of these rules in the literature [11]. Because there are many similar algorithms, and
many variations of the same algorithm, it would be very tedious to describe each.
The following is a summary of the main gradient rule training algorithms, and those
which are related to the present research.

The training problem for the two layer FNN is stated as follows: Given a set of
N training patterns as in (1.4), ﬁnd the weights which minimize the sum of squared

errors, say E, between the target outputs and the actual outputs,

N N
= 2; 23014") '— yk( (W2 (1'6)

Minimizing E can also be viewed as the problem of ﬁnding a mapping from the input
space, the space spanned by the a:(n)’s, to the output space, the space spanned by
the y(n)’s, so that E will be minimum.

It is the nonlinearity at the output of each node that gives the F NN the promise
of solving previously intractable problems. However, it is also the nonlinearity that

produces the problems in training. In order to appreciate some of the problems

10

encountered in training, consider the differences of the network with and without the
nonlinearities. Removing the non-linearities from the outputs of the nodes will result
in a linear network with the same structure as the FNN. If the network is linear, a
single layer can achieve the same mapping as a multi-layer network, thus there is no
advantage in using more than one layer, eg. [12, Ch.2]. In the linear case, ﬁnding
the weights that minimize E is a well-known problem. Taking the partial derivative
of E with respect to the weights and setting this equal to zero results in a set of
linear equations which can be solved for the desired weights. For non-trivial cases in
which there are at least as many independent training patterns as weights, the weights
found in this manner are unique. This implies that there is only one minimum of the
function E with respect to the weights.

Replacing the non-linearity at the output of each node changes many of the prop-
erties of the network. One difference is that one layer cannot map the inputs to all
the outputs that a multilayer network can. As shown in [13], a one layer network,
with the sigmoid replaced by the sgn function, can segment the input space into
two regions by a hyper-plane. With multiple layers, more complex segmentation can
be achieved. Multiple layers are thus useful, but pose more of a challenge to train
than single layer networks. With one layer, once a set of training patterns is given,
the inputs and outputs of every node are known. With multiple layers, the hidden
layer “inputs” and “outputs” are not known and must be computed by the training
algorithm.

Another difference between the linear network and the F N N is that there are gen-
erally multiple minima of the function E with respect to the weights. The occurence
of multiple minima corresponds to the fact that the derivative of E is nonlinear in the
weights. As before, we desire the weights that are the zeros of the equations resulting
from taking the derivative of E with respect to the weights. Many methods for doing

this appear in the literature. None of the methods, however, solves directly for the

11

weights. Some of the most popular methods for solving these non-linear equations
are the gradient rule algorithms. All the gradient rule algorithms proceed in a similar
manner. First, the partial derivative of E with respect to the weights at one training
pattern, 3%, is evaluated at the present weights. Some method, depending on the
algorithm, is then used to determine how to change the weights in order to reduce
E. This is repeated for each training pattern. The weights can be changed with each
training pattern, or after all the training patterns have been presented, depending on
the algorithm. This constitutes one iteration through the training patterns. Many
iterations are usually needed to ﬁnd the desired weights. Thus, %5 is evaluated at
the new weights and the next training pattern, and the weights are again changed
to reduce E. This is repeated until the partial derivative of E with respect to the
weights is close to zero. The weights at which the algorithm stops are called the ﬁnal
weights:2 Because there are multiple minima, the solution found may, or may not,
be the global minimum. In practice many initial weights are tried and the weights
giving the smallest E are chosen.

Because the non-linear nature of the problem requires an iterative approach, ﬁnd-
ing a solution will generally be much slower than in the linear case, where the weights
can be found in one iteration or with one block computation. Finding a solution can
be too slow in some cases, such as when the FNN is applied to speech processing [14,
Ch.14]. New algorithms which produce a solution more rapidly than the techniques
used currently would be of great value. This is the goal of the proposed research.

Thus, some method to compare the “speed” of each algorithm is necessary.

 

2They are called ﬁnal weights and not solution weights because these weights may not solve the
proposed problem.

12

1.3.2 Training Efﬁciency

The speed at which an algorithm converges to a solution depends on two factors:
the number of iterations and the complexity of each iteration. The complexity can
be measured in terms of the number of operations, multiplications, additions etc,
required. The number of iterations depends not only on the algorithm itself but on
the initial weights of the system and the training data. The comparisons done in
simulations to follow will eliminate these variables by using the same initial weights
and the same training data for each algorithm.

Different algorithms do not always converge to the same set of ﬁnal weights due
to the different ways in which a true gradient system is approximated. However, even
though the ﬁnal weights may be different, it is useful to know the rate at which the
algorithm converges to them. One way of doing this involves an error function [15].
This will be a function of the difference between the ﬁnal weights, w“, and the weights

at iteration n, w(n).

e(n) =|| w(n) - 10" || (17)

Where [I - [I is the euclidean norm. The rate that e(n) goes to zero will indicate
how fast the algorithm is converging to the ﬁnal weights. If e(n) can be written as
a constant term multiplied by e(n — l), e(n) = K e(n — 1) then the algorithm is said
to converge linearly. If e(n) = K e2(n — 1) , then it is said to converge quadratically.
In most cases a quadratically converging algorithm converges faster than a linearly
converging one [15]. Of course not all algorithms converge linearly or quadratically,
but it is useful to compare new algorithms with those known to converge linearly or

quadratically.

13

1.3.3 The Back-Propagation Algorithm

The most prominent method for training FNNs is the back-propagation (BP) algo-
rithm [17] [16]. Let w(n) be any network weight at iteration n. Given a training
pattern, it is possible to evaluate —7—5 for each weight 1n the network. A positive
value for m indicates that reducing w(n) will reduce E, conversely, a negative
value of 8—8T) indicates that an increase in w(n) will reduce E. The back-propagation

algorithm updates each weight at iteration n using the equation:

6E

m. (1.8)

w(n) =w(n- 1) —5

5 controls the rate of convergence, 0 < e < 1. 3277 for weights 1n the L — 1 layer
depends on the weights in layer L. This is true for all layers. ——5 for any weight 1n
the network depends on the weights in the layers above.

The algorithm operates as follows: 5% is ﬁrst evaluated and used to update
each weight in the last layer. 587—) is then evaluated for the weights 1n the L — 1
layer using the updated weight estimates of layer L. In similar fashion, the weights in
the remaining layers are updated. The process is then repeated with a new training
pattern. The algorithm stops when the change to the weights is below a set threshold.

Assume the two layer network described above. The partial derivative with respect

to a weight in the last layer, wa-(n), for training pattern n is from (1.5) and (1.6),

6E _ 391:0“)
m _ —2(tlc(n)"yk(n))auk(n)y1(n)

 

(1.9)

If w;,,(n) is a weight associated with a node in the ﬁrst layer, then for training pattern

n

 

0E 3yk(n) 331,1")
Bing—Trn): 2,; (tkm k(nu))0 k(n) wk“ 6u3(n)x WU ) (1'10)

14

As seen in (1.9), the partial derivative with respect to a weight in the last layer is
a product of the error of node 1:, tk(n) — yk(n), and the derivative of the output of
the node with respect to the input and the node input. The partial derivative with
respect to a weight in the ﬁrst layer is similar, as seen in (1.10), with the error term
replaced by the term 2%, [(tk(n) — yk(n))aZ’; 2 wkd]. This term can be interpreted
as the “error ” of node j in the ﬁrst layer “back-propagated” through the upper layer.

It is clear from this analysis that one disadvantage of the BP algorithm is that
by increasing the number of layers, the computational complexity is increased. For
example, going from one layer to two more than doubles the complexity. For a network
with M nodes in each layer, computing $5:- for a weight in the last layer takes three
multiplications and two subtractions. Calculating 2‘5 for a weight in the next to
last layer takes 3M+3 multiplications and 2M+1 subtractions. For a network with
L layers and M nodes per layer, this algorithm requires 0(3ML) ﬂ0ps (a ﬂop is one
multiplication and one addition) per iteration.

Although BP as described by Rumelhart et al. [12], Parker [16], and Werbos
[17] is not guaranteed to converge to a local minimum, changing the algorithm to
use integration. [18], does guarantee this convergence. This is a gradient system, and
given any initial weights, will converge to a local minimum. Another disadvantage of

BP is that it converges linearly to a minimum. Linear convergence, along with the

high complexity prohibits real time training in many applications.

1.3.4 Techniques for Improving Convergence

Many new techniques have been presented to improve the speed of convergence [6] —-
[19] [20] [21]. One method has been to implement algorithms that use second order
derivatives. Parker, who was among the ﬁrst to derive the BP equations, later turned
to second order algorithms to speed up convergence [22]. Many such techniques

include some form of linearization [6], [19], [23]. If the weights are only changed

15

a small amount during each iteration, linearizing the equation for E with respect
to the weights around the present weights is justiﬁed for each iteration. Thus the
linearization must be performed every iteration.

One technique using linearization, the Azimi-Sadjadi et al. [6] algorithm, (A-S
algorithm), uses linearized input, 373(n) and linearized output, fk(n) target values.

The equation to minimize is thus changed from (1.6) to

_ N N2 _ N1 2
E = Z; (“(71) — (:1 wk,j'g’;-(n))) (1.11)

n=1
In order to use equations of this form, the inputs and target outputs of every node
must be known. Only\the inputs to the ﬁrst layer and targets for the last layer are
given as training patterns, the other inputs and outputs must be calculated. The
algorithm proceeds as follows: the input to the ﬁrst layer, given by one training
pattern n, is used to calculate the output of each layer at the present weights. The
linearized input and output values are then calculated for the last layer. These values
are used to update the second layer weights using the MIL-WRLS algorithm. The
targets for the ﬁrst layer, say tg-(n), are calculated using a “back propagated” error,
similar to that used in the BP algorithm. These target outputs are used along with
the training pattern inputs to calculate the linearized input and output target values
for the ﬁrst layer. As in the second layer, the weights of the ﬁrst layer are updated
using these linearized values and MIL-WRLS. The procedure is repeated for successive
training pairs until the weight change is less than a preset threshold. In Chapter 2,
we will interpret the AS linearization process as one in which the linearized inputs
and output for each node are chosen so that the error E of the nonlinear equations
and the derivative of E with respect to the weights are the same for the linearized

equations. Thus at the present weights

16

N1
tk(n) — yk(n) = tk(n) — Z: whj‘yj-(n) V]: (1.12)

and

 

 

N2 2
Z(tk(n) - yk(n) )2: —63 Z (tk( n)— :21”.ij (n)) (1.13)

awka’ k=1k-J Ic=l j=1

One disadvantage of the A-S algorithm is the complexity. In addition to using back-
propagation to calculate the inner layer target values, it updates the weights using
recursive equations. The algorithm requires 0(3M2) flops per node per iteration.
This increases the complexity from 0(M) for BP to 0(3M2). For a network with L
layers and M nodes per layer, the algorithm requires an additional 0(3M3L) ﬂops per
iteration over BP. This algorithm however, converges faster than BP in simulations.
Azimi-Sadjadi et al. attribute this mainly to the MIL-WRLS equations. Additionally,
all the weights connected to a single node can be updated simultaneously instead
of sequentially as in BP. In their research simulations using a two layer network,
the number of iterations for this algorithm was an order of magnitude less than BP.
Another disadvantage of this algorithm is that it uses recursive equations. The values
to initialize these equations are not known practically. Also, previous equations used
to update the weights decay exponentially, thus old linearized training patterns that
are not useful for, and in fact may hinder, the updating of weights are still included
in the equations.

Kollias and Anastassiou [23] also describe a recursive linearized algorithm. Their
method is based on the Marquardt-Levenberg least squares optimization method.

This method is designed to solve an approximation to the linear system

H Aw = —f (1.14)
where H is the hessian matrix containing the second derivatives of the error function

17

E with respect to the weights, and f is a vector consisting of the ﬁrst derivatives of
E with respect to the weights. This is basically Newton’s method for determining the
zeros of a non-linear equation in matrix form. Newton’s method is known to converge
quadratically [15] for initial weights close to a minimum. However, Newton’s method
can have convergence problems for poor initial weights. The Marquardt-Levenberg
technique uses an approximation to H. This allows the algorithm to converge for
relatively poor initial weights and also retains the quadratic convergence when close
to a minimum. This algorithm also uses back-propagation to determine the targets
for the inner layers.

According to Kollias and Anastassiou the advantage of their algorithm is that it
converges quadratically when close to a minimum. One disadvantage of the algorithm
is that, like A-S, back-propagation is used for the inner layer training patterns. Also
like A-S, it uses recursive equations which suffer from problems previously mentioned.
Thus, this equation also has 0(3M2) ﬂops per node per iteration. For a network with
L layers and M nodes per layer, this algorithm requires 0(3M3L) per iteration in
addition to 0(3ML) for BP. Steck, et.al [24] implement this technique and show
that with parallel implementation, the additional computational requirements can be
effectively reduced.

Another linearization algorithm, very similar to the Kollias and Anastassiou algo-
rithm is presented by Ghiselli-Crippa and El-Jaroudi [25]. This algorithm also uses
an approach based on an approximation of the Hessian matrix. The weight updating

equations are:

w(i + 1) = w(i) — aH‘1(i)F(i) (1.15)

where i is the iteration index, a is the step size, to is the vector of all the network
weights, F is the gradient of the error E with respect to the network weights, and
H is the approximation to the Hessian matrix. To simplify the Hessian matrix, the

weights connected to each node are assumed to be independent. Simulations were

18

 

Table 1.1: Complexities per iteration and normalized complexities of FN N training
algorithms. M is the number of nodes per layer. L is the number of layers.

 

BP A-S (QR-WRLS) Layer-wise N etwork-wise

 

complexity 0(3ML) 0(2.5M3L + 3M1) 0(2.5M2L + 3ML) C)(2.5M2LL2 + 3ME)

 

normalized
complexity 00 389.98 107.6 781.79

 

 

 

 

 

 

 

 

performed, training a FN N to classify speech as voiced, unvoiced, or silence. The
error using this F NN for the classiﬁcation was lower than that of a statistical decision

classiﬁer.

1.3.5 Complexity Measures

All of these linearization algorithms are more complex computationally than BP.
Still, they offer faster convergence in the simulation examples presented by the re-
spective authors. This implies that computational complexity is not a good method
of judging the speed of an algorithm. To compare the speed of different algorithms
we will use a “normalized” complexity, the complexity of the algorithm multiplied by
a factor determined from the simulation studies in Chapter 4. The lower the normal-
ized complexity, the faster the algorithm. Simulation studies were done comparing
the algorithm developed in this dissertation with BP and the A-S algorithm. The
simulations from the training of a 4-bit parity checker were used to determine the
normalized complexities. The normalized complexity of the BP algorithm is shown
to be 00 since it failed to ﬁnd the solution even once. Table 1.1 shows the relative
complexities and the normalized complexities for these algorithms. The new layer-
wise algorithm developed here has a clear advantage in speed over both BP and the

A-S algorithm.

19

Chapter 2

New Training Algorithms

 

2.1 Introduction

This chapter is divided into two main sections, the ﬁrst of which introduces a new
class of linearized training algorithms for the FN N.

Because of the nonlinearities present in the FN N , all gradient descent learning al-
gorithms perform some form of linearization around the present weights. The learning
algorithms presented here are purely “linear” in the sense that the matrix vector equa-
tion to be solved has been transformed into the equation of a linear system. Accord-
ingly, unlike other popular training algorithms that are not in this form, this linear
algorithm and its potential variants will beneﬁt from the well-understood theoretical
properties of RLS and VLSI architectures for its implementation.

The second section describes implementation of these algorithms with QR-WRLS,

a contemporary version of the conventional recursive least squares algorithm.

2.2 Linearized Training Algorithms

As described in the previous chapter, all training algorithms solve for the desired
weights using several iterations through the training data. The gradient descent

algorithms form a linear approximation to the error surface E at the present weights,

20

and change the weights in the direction which reduces the total error. The manner
in which this approximation is made, and how it is implementated, account for the
differences among the training algorithms. As can easily be seen from (1.6), the error
is a function of all the weights. Typically the approximation to the error surface at a
given iteration is done by taking the derivative of E with respect to a subset of all the
weights and based on this gradient changing this subset to reduce E. This process is
repeated for another subset until all the weights have been changed to reduce E. The
most popular method of F N N training - BP -— uses only one weight at a time for this
approximation. In order to improve the rate of convergence and the ﬁnal solution,
newer techniques use more of the network weights, typically all the weights connected
to one node, simultaneously to form an approximation to E at each iteration. Since
E is a function of all the weights, it would seem better to simultaneously use as
many of the weights as possible to reduce E. The class of algorithms presented here
represents an improvement in the sense that the weights for an entire layer can be
used to form an approximation to E and are updated simultaneously. Also in the
special case that there is only one output, all the weights of the network can be
updated simultaneously.

The theoretical development of the algorithm is ﬁrst given for the case where all
weights connected to a given node are updated simultaneously. This will be expanded
to the case in which all the weights in the same layer are updated simultaneously.
Finally the case in which all the weights in the network are updated simultaneously
is presented. To simplify the presentation of each case, the development is done ﬁrst

for a two layer network, then for a general multilayer F NN.

2.2.1 Node-wise Weight Updating

2.2.1.1 Two Layer Network

FirSt We wish to concentrate on the training of the weights in the ﬁnal layer. Before

21

continuing, we note a simple fact which will reduce the number of details in our
discussion. It is easy to show from (1.6) that if only the weights connected to output
node k are allowed to change, with the other weights in the network held constant,
then E is minimized by minimizing only the errors associated with this node, say Ek.

Let us write E). in a form which explicitly features the weights allowed to change,

N
E. = 214(n1- 5(wfv’(n))]’- (2.1)

n=l

Algorithms for ﬁnding the optimal solution, say mg, to this problem are well-known
if the modeled output depends only upon a linear combination of the inputs us-
ing pattern-invariant (constant) weights. In the linear case yk(n) = S(wky’(n)) =
ﬂwky’ (n), for some constant 6 (which can be taken as unity without loss of general-

ity), and the error expression takes the form

N
El: = Elna) - wa’(n)]2. (2-2)

n=1

The solution in the linear case is the solution to the classical linear least squares
“normal equations” [7]. The solution of the normal equations can proceed in a variety
of ways. It is also possible to arrive at the solution without explicitly forming the
normal equations. This is the case, for example, when using the least mean square
(LMS) (e.g. see [26] or [27]) algorithm, a recursive solution which amounts to “back-
propagation” for a linear network. A second popular method is the conventional MIL-
WRLS algorithm. A contemporary version of the latter will serve as a computational
basis for the algorithm to be described in this dissertation, and MIL-WRLS is also the
basis for the A-S algorithm to which we wish to relate the method of this dissertation.
Appropriate description and formalism will be introduced as needed.

It is well-known that least squares estimation problems may be discussed in terms

of their error surfaces, in this case the graph of E1. as a function of 10].. Whatever

22

the form of the least square estimation algorithm, the ideal goal is to ﬁnd the weight
vector, say 111;, corresponding to the global minimum of Ek(wk). It is important to fu-
ture developments to note that E), depends not only on wk but also upon the training
patterns {(z(n),tk(n)), n 6 [1,N]} (see (2.1)). (In fact, since we have “frozen” the
weights in the ﬁrst layer, it is more appropriate in this case to view E), as a function
of 10;, and the pairs {(y’ (n), tk(n)), n 6 [1, N ]}) Once the training patterns are ﬁxed,
the error function may be described as a surface over the Nl-dimensional hyperplane
corresponding to the weights. Theoretically, the pairs {(y’(n), tk(n)), n 6 [1, N ]} rep-
resent partial realizations of a two-dimensional stochastic process which generates

them. In this sense

Ek(wka{(yl(n)atk(n))1n 6 [1,N]}) (2-3)

is only a sample error surface. In a pure sense, we would like to ﬁnd weights corre-
sponding to the global minimum of 8 {Ek(wk)} where 8 denotes the expected value.
We must be content, however, to work with the sample surface provided by the train-
ing data.

The point of the discussion above is to note that different algorithms construct and
use different sample error surfaces from the data. With LMS (or back-propagation),
error surfaces are sequentially constructed from individual training patterns, i.e., error

surfaces of the form

Ek(wk, [y’(n), tk(n)]), n = l, 2, . . . , N (2.4)

are created, and for each n, the weights are moved in the direction of the negative
gradient on that surface. The convergence properties are well-understood. MIL-

WRLS3, on the other hand, creates sequentially more reﬁned error surfaces of the

 

3Of course, here we are speaking of a linear model identiﬁcation.

23

form

Ek(wki{(y’(j)itk(j))aj E [linlll (25)

as n is incremented. At each step, if a weight update is computed, the solution
corresponds to the unique minimum of the newly reﬁned surface. We can appreciate,
therefore, that even if we neglect nonlinearities, the estimation processes behave quite
differently with respect to their error surface analysis.

The linearization technique adopted in this work can be explained in terms of
the error surface analysis. The error surface over which we would like to ﬁnd the
(global) minimum by choice of weights is given by (2.1). Suppose we wish to construct
a “linearized” error surface, say Ek, which is “similar” in some sense to E, in a
neighborhood of the present weights. Recalling that E]. is a function not only of
the weights, but also of the training patterns, the fundamental question is: Can
the pairs {(y’(n),t(n)),n E [1,N]} be modiﬁed in some sense, say (y’(n),t(n)) —+

(371(17): f1:01)), so that

N
Ek(wk,{(§ll(n),t1(n)),n E [1, N]}) = 2:151:01) - Wish”)? (2-6)
N
% Ek(w,., {(y’(n),tk(n)),n E [1,N]}) = 2:161:01) - ~‘>’(wfv’(n))l2

in some neighborhood of the present weights? The answer to this question is the key
theoretical development described in the following paragraphs.

In the ensuing discussion, the notation w; will be used to designate a local mini-
mum of Eh. Ideally, w; will be the global minimum, but we have no way of assuring
this. The objective is to ﬁnd, by means of a “linear” algorithm, a close approximation
to 10;.

The algorithm to be described proceeds in iterations, indexed by i = 1, 2, . . .. Each

iteration represents one complete training cycle through the N training patterns.

24

Suppose that a weight vector estimate w),(i — 1) results from iteration i — 1. In
iteration i, by manipulation of the data, we work with a “linearized” error surface,
Ek, which is similar to the nonlinear surface in the neighborhood of wk(i — 1). The

similarity follows from two criteria:

1- Ek(wk(i-1)1{(ylc(n)1t-k(n))!n e [1,N]}) = Ek(wk(i-1)1{(ylc(n)1tk(n))1n E [1,N]});

2 33.] = 31.1]
' aw). wk=wk(§_1) 310:: wk=wk(i-1)'

The ﬁrst task is to manipulate the pairs {(yj‘(n),t—k(n)),n E [1,N]} so that these

criteria hold. This is accomplished as follows. It follows from Criterion 1 that

20“ (n) - 3/I=()nl2 =Z_:(tk(")-w1:(z ‘1)9k(n))2-(2-7)

By letting
M") +- y).(n) = M") - 1011;“ - ”9201), (2-8)
M") = (Mn )— yk(n)) + "’1: T0 -1)t7i.(n), (2-9)

for each n, Criterion 1 is met. Now we take the partial derivatives required in Criterion

2. For the “nonlinear” error,

as
aw),

N
= —2 2(tk(n) - yk(n))3(uk(n))9'(”)

wg=wk(i—l) "=1

N
= -2 gun) - y.(n))s’(wt<i - 1)y'(n>)y'<n) (2.10)

where

° 1: dS(O)
50402))“; do,

 

(2.11)

a=uk(n)
All inputs and outputs in this and similar expressions are those associated with

weights wk(i - 1) (or the “current” set of weights around which linearization is tak-

25

ing place), but we will avoid writing uk(i — 1, n), for example, for simplicity. For the
“linear” error,
61?

N
07”: wﬁwm) = again) — wZ‘(i—1)vt(n))vt(n). (2-12)

Equating (2.10) and (2.12), in light of (2.8) we have

3720!) = 3(Wf(i - 1)y'(n))y’(n)- (2-13)

All quantities needed to compute the modiﬁed pair (tk(n), y;(n)) are known or can be
calculated at pattern n in iteration i. This procedure is repeated for each 1: (output
node)

Before extending the analysis down to layer one, let us ponder the signiﬁcance of
what we have done. By modifying the data pairs, we have created a “linear” error
surface which is similar to the “nonlinear” one in the neighborhood of wk(i — 1). In
particular, the error surfaces match at that point, and their gradients are identical
with respect to the weight vectors. We can ﬁnd the w)c which minimizes E)‘ by simple
linear least squares processing of the modiﬁed data {(fk(n),y’(n)), n E [1, N ]} The
linear estimate will correspond to a minimum of the error surface E), which need not
be near a minimum of Eh. However, because the error surfaces and the gradients
match with respect to the weight vector of node Ic, if the weight change is small
enough, the weight change will be in the direction of decreasing Eh. Accordingly, the
linear weights must be constrained to remain in a reasonably small neighborhood of
wk(i— 1). Because E), is reduced at each iteration, it is to be expected that a minimum
of E will be reached by repeating this procedure. In turn, this implies convergence
to the “nonlinear” solution for the weights, using purely linear techniques.

Let us now move down to the lower layer and consider the estimation of the

weights {w}, j 6 [1, N1]}. Again let us focus on a single node, say node I . However,

26

we must now optimize to; with respect to the entire external error, E, since all nodes
in the upper layer are affected by these weights. Suppose that we are working on the
if” cycle through the training patterns and that all weights in the upper layer are
ﬁxed at their newly updated values {101(2), 1: E [1, N2]}. Taking the derivative of E

with respect to 10;,

6E N N2 . . ° I T ¢
m = —2 2:1 gum) — y.(n)>5(uk(n)>ws(z)5(1w.) e(n))zw) (2.14)

where wa(i) denotes 1"“ element in weight vector wk(i) (weight on connection from

node j in layer one to node I: in layer two). This expression can be written

N
g5 = -2 glam) - y;(n>>s'<[w;-1Ta=(n))z(n) (2.15)

where t;(n) is called the target value for inner node j and is deﬁned such that

N2

(C(H) - y}(n)) = 2W0!) -- yk(n))3(w(n))wks(i)- (2.16)

k=l

The quantity on the right side of (2.16) is commonly called the back-propagated error

for node j. The solution sought, say wg‘, is one for which

6E

aw, ( 7)

Law:
In the top layer, for node I: we sought w; such that

6E

— = . ' 2.
aw). 0 ( 18)

w.=w;

With reference to (2.10), it is clear that the present optimization problem is equivalent

to the ones encountered at the upper nodes. In particular, the same linearization

27

considerations can be applied to obtain modiﬁed input and target values, say

(1300,23 WU )) (t’ jn( )Zj(n)) (2-19)

and the set of layer one weights in ;i( ) computed accordingly for each j.

2.2.1.2 L Layer Network

This algorithm can be extended in a straightforward, though notationally cumber-
some, manner to any number of layers. Here we will use parenthetical superscripts
instead of primes to indicate the layer. In order to simplify notation, the dependence
on the iteration number will be dropped. Thus, y;-l lwill be used to denote the output
of node j in layer l. Similarly,w gwill be used to denote the weight connecting node
p in layer l — 1 to node j in layer l. Assuming a network with L layers, let us update
the weights connected to node i in layer 1,10“), l 6 [1,2, . . . ,L]. We have already
presented the cases of l = L and l = L — 1. Taking the derivative of E with respect

to to?) ,

3E N N" _ (L) ' (L) ”L" (L)' (L-l)
'42: (Mn) 31k (n))S(uk ("ll 2(wk.,5(U, (n))(2.20)

aw?) — 11:1 k=1j=l

(~252ij (L— —I)S(u (L— -n2)( ))..

:3“ w’ {12)S("’l‘m(n))wgln) )) 3<u£"(n))y“-l>(n)

m=l

Similarly to (2.15), this expression can be written

Bin-E'z): -—2Z(t§”(n)— yf”(n))s'([wS”1Ty“-”(n ))y<‘-”(n) (2-21)

28

where ts!) is the target value for node i layer l, and is deﬁned by

NL .
(tS"(n)—y§"(n)) = g:((402)-y§.“(n))5(ut“(n)) p (2.22)
-NL" NL—2
( )3 (assay-11(3) ( z: (agrees—2m) . ..

m=l

N1+1 .
z22212.7”)(aw)))

Again the optimization problem is equivalent to the ones encountered at the upper
nodes.

Before continuing, let us note the relationship to the A-S algorithm noted above.
In fact, to this point in the discussion, the methods are nearly equivalent, though
derived from different starting points. The A-S algorithm proceeds by replacing the
nonlinearity S(-) for the node to be updated by a linear approximation, say S(),
consisting of the ﬁrst two terms of a Taylor series around the “present” value of the
node’s input. For example, suppose the k“ output node is to be linearized with

respect to the n‘” training pattern. Let tind- denote the present value of weight w)”.

Then,

.. . N1 N1 N1
S(u) z 5(u) = 5 (Z undid-(ﬂ) (u - thkdy;(n)) + 5' (Z tbk,jy;(n)) (2.23)

i=1 i=1 i=1
. N; N, . N, N1
= 5 (Z: was/m) u + [S (: wing-(22)) - s (2: wing-(12)) 2222,2100]
i=1 i=1 i=1 i=1

4.2‘ Kk(n)u + 5,.(n).

Azimi-Sadjadi et al. [6] recognized that by using this approximation in (2.10), the
optimization problem became equivalent to a set of linear least square error normal
equations if the data were modiﬁed according to (2.9) and (2.13). Therefore, by

quite different means, the theoretical developments arrive at the same set of linear

29

equations to be solved.

2.2.2 Layer-wise Weight Updating

2.2.2.1 Two Layer Network

The method used above — ﬁnding new linearized training patterns to update the net-
work weights — can be extended to update all the weights of one layer simultaneously.
In the previous section, E and E were used directly to calculate the linearized train-
ing patterns. This is useful from a theoretical point of view, it shows exactly why
the linearized training patterns can be used to calculate the gradient of the system,
but is cumbersome notationally. As noted in the previous section, the linearized error
surface E can be constructed by replacing the nonlinearity S() by a linear approx-
imation. Because using this approximation simpliﬁes notation and is equivalent to
using E and E directly, in this section we will use this latter approach for calculating
the linearized training patterns.

Consider again the general L layer network. Suppose that the weights connected
to one or more nodes in layer l are to be updated simultaneously‘. This may include
as few as one, and as many as all, nodes in layer l. Denote the set of such selected
nodes in layer l by N. Denote by M the set of all nodes above layer 1 to which any
node in N is connected, directly or indirectly. Let all weights not connected to nodes
in N and M be ﬁxed at present valuess. As shown above, the linearized error surface
is constructed by replacing the nonlinearity S'() for each node in N and M by the
linear approximation, S(). In fact, any node not in N or M may also be linearized
with no effect on the solution. Therefore, we may assume without loss of generality

that the entire network is “linearized,” even if only a portion of the weights is to be

 

‘If any weight connected to a node is to be updated, then every weight connected to that node
must be updated. This “constraint” is ordinarily beneﬁcial, since it implies the ability to simulta-
neously update more than one weight.

5In certain cases it is possible to update weights in different layers simultaneously. We discuss
one case at the end of this section.

30

updated.

Let us now return to the two layer network. Suppose we wish to update all
weights in the output layer simultaneously. We must linearize all output nodes (and
may arbitrarily linearize any other nodes). For node k in the output layer, the output

in response to input n is computed as in (1.2), which we repeat here for convenience.

i=1

N1
yk(n) = 5(11) = S (Zing-31;) (2.24)

As before, we let 371(11) represent the output of node I: after S(uk(n)) has been
replaced by S(uk(n)) = Kku),(n) + bk. Accordingly,

N1
3714") = Kid") [2; Maid-(n)

 

+bk(n’) or zk(n )=Kk(n )[iwk'jyj() )] (2.25)

j-l

with

e(n) “-‘é‘ ma) - e(n). (2.26)
We speak of the rightmost form in (2.25) as descriptive of a linearized node because
the output is a purely linear combination of the inputs to the node. The network
with all appropriate nodes linearized will be called the linearized network. Since
gk(n) = yk(n) at the present weights, the error at the k“ node will be the same for
the linearized and original network if the target value for 2k(n), say tk(n), is taken to

be
{,(n) dé‘ tk(n) — bk(n) (2.27)

and the “linearized” inputs to node lc at pattern n are
Mn) d-—°-‘ mam-(n), ,- = 1,2, . . . .N.. (2.28)

Note that the linearized inputs are dependent upon It, so that we have effectively

31

increased the number of training pairs by a factor of N2.
The problem has effectively been reduced to one of estimating weights for a single-
layer linear network. In order to simultaneously update all the weights in the output

layer, the system of N x N2 equations
tk(n) = 25k,j(n)wk,j, k=1,2,...,N2 n =1,2,...,N (2.29)

must be solved for the least square estimate of the N1 x N; weights w”, k 6
[1,N2] j E [1,N1]. However, since all weights in the hidden layer are ﬁxed, the
outputs yj-(n) are independent of k. This means that the equations indexed by differ-
ent values of Ir are independent of one another, and the sets of weights connected to
different outputs may be updated independently. In the output layer, therefore, there
is no theoretical difference between layer-wise and node-wise updating. To prove this
we need to show that the t—k(n) of (2.9) and Min) of (2.13) are the same as the tk(n)
and ik’j(n) of (2.29). Substituting for K).(n) in (2.28) gives

aim-(n) = S(uk(n))y;-(n), j=1,2,...,N1 (2.30)

= 3(ny’(n))y§(n)-
In vector form this equation is
int”) = S(tuft/("Dill") = 372(71)- (2-31)
Substituting for bk in (2.27) we get,

N1
M") = t2(n)-(yk(n)—Kk(n)2w2.jy;(n)) (2-32)

i=1

32

N1
= (“e(n) — yk) — 2:; wk,ij(n)yj-(n)
= (“J”) - 31k) - waKHnlt/(n)

= (M?!) - yk) - wig-37%)

Which is equivalent to (2.9). Thus the linearized inputs and outputs of the layer-wise
and node-wise weight updating are the same for the last layer. This is not true at
lower layers, however, as we now show for the hidden layer of the present network.
In order to update all weights in the hidden layer simultaneously, the weights
in the output layer are ﬁxed and all nodes in the network must be linearized. The

outputs of the hidden layer with 5() replaced by S() are given by
I I No I I
37,-(n) = Kj(n)[z wj,pa:p(n)] + bJ-(n) j = 1,2, . . . , N1. (2.33)
p=1

Substituting (2.33) in the leftmost expression in (2.25) results in

N1 N1 N0
371:0?) - [; Kk(n)Wk.jb;-(n) + Mull = :1 glenlwksKi(n)$p(n)lw;,p- (234)

As above, we can now view the problem as one of training a single-layer linear mapping

with target outputs

N1
71:01) = M”) - I; Kk(n)w,.,,-b;(n) + 51:01)] (2-35)
and inputs
ijw-mm) := Kk(n)wk,jK;-(n)a:p(n). (2.36)

The weight estimates for wg-m; j E [1, N1], p E [1, No] comprise the least square error

33

solution to the system of equations

N; No
t",,(n) = 2 ngwmmgp k =1,2,...,N2 n = 1,2, . . . , N. (2.37)

j=l p=l

Unlike the output layer, we see that the problem cannot be decomposed into separate
solutions for sets of weights connected to individual nodes in the hidden layer. This
is a reﬂection of the fact that all weights in the hidden layer are coupled through
their “mixing” in the output layer. This means that the simultaneous solution for all
weights in the hidden layer should be beneﬁcial with respect to a node-wise solution.
Indeed we will ﬁnd this to be the case in the experiments.

Note that, for a ﬁxed k, the inputs to the linearized network, it"(n), n E [1, N],
are most conveniently viewed as two-dimensional (indexed by couples (j, 11)). There
are N such “grid” inputs for each k, paired with the N values of fun). If there
were two hidden layers in the network, we would ﬁnd that there would be three
effective inputs. Although the number of subscripts continues to increase with in-
creasing layers, only the subscript of the output node, and the subscripts of the
weight with which the linearized input is associated are used. (see (2.44)). The
other subscripts are “summed” out of the equation. Further, it is noted that the
role of k in (2.37) is somewhat superﬂuous. In principle, the index is used to keep
track of which of N2 outputs in the linearized network is being considered. However,
the training pairs (t'k(n);i]‘,1,1(n),. ..,:Ej,,NhNo(n)), k E [1,N2], n E [1,N], can be
reindexed by mapping pairs (k,n) —+ i so that the training pairs may be written
(t'(i);§:'1,l(i), . . . ,i'Nl’No(i)), i 6 [1,N. x N2]. Of course, an identical system of equa-
tions to (2.37) results, but the linearized network may be viewed as a single output
linear layer with N x N; training pairs.

Updating of some subset of the weights in the hidden layer (in particular, “node-

wise” as in the A-S algorithm) is tantamount to solving the subsystem of (2.37)

34

corresponding to the desired weights, introducing the updated values into the sys-
tem, solving for the next desired subset, etc. Clearly this will result in a different
solution than the simultaneous solution. In terms of the error surfaces, this process
consists of continually updating the error surface as “partial” information becomes
available, then moving in the direction of the gradient with respect to a new subset
of weights in the updated surfaces. Intuitively, movement “at once” with respect to
the “complete” gradient would seem to be a preferable procedure. Indeed, the later
operation corresponds to the simultaneous updating.

The linearization allows us to approximate the error surface of the nonlinear sys-
tem for only a small neighborhood around the present weights. Because of the criteria
used to construct E, the weights will be changed in the direction of the true gradient
in the nonlinear space, but will move to the minimum of E which may be quite far
from the neighborhood over which E z E. As in the node-wise weight updating,
the weights must be allowed to change only a small amount using the training pat-
terns of the linearized system. If the linearized procedure results in a large change of
weights, measures must be taken to decrease the alteration. The updating procedure
is repeated until changing the weights does not result in a decrease in error. The al-
gorithm proceeds as follows: linearize the system around the present weights, change
the weights by a small amount to decrease error, then repeat the procedure. This is
done until changing the weights does not decrease the error or a maximum on the

number of iterations is reached.

2.2.2.2 L Layer Network

The general case of an L layer network is a straightforward extension of the two layer
case. As in the previous section, let a parenthetical superscript indicate the layer.
After linearizing all the nodes, the output equation for node I: in layer L in terms of

the outputs of layer L — l is,

35

21%) = Ki.”(n) (NZ 105.3121" ”( n)) + bt”(n) (2.38)

J1=1

This is equivalent to (2.25) and can be used to solve for layer L weights. Extending

this down one more layer, and writing 37,?)(n) in terms of the outputs of layer L — 2,

N1,
-L L L L—
y). ’(n) = Ki ’(n) ()5 wt}. (K). ”(n) (2.39)
Ji—l
N144 L— L- L L
1 2 —1
Z w§..§.’y§. ’(n )+b§-. ’(n))) +4. )(n)
J2=1
”MN“ (L) (L) (L 1) (L 1) (L- 2)
= Z Z Kk (n) wk,J1Kj1 - (n)wj1,:2 ng (n)
J1=1 J2=1

”1...:
+ bgL)(n)+K}f) Z w(le§f’-l’(n)

. k’JI
11 =1

Continuing the process for another layer gives,

NL-INL—2
_L
y). ’(n) = 2: z: K‘L’(n)w).’:,-’,K.‘~f ”(n) w§.’:;." (2.40)
j1=l 12:1
L- L— L- L—
(Ki. ”(n) 23 wi....”y§. ‘n+3’() bi- ”00)
Ja=1

NL—l
+ bj‘L)(n)+K,£L’ Z 10“») b(L-1)( )

‘1‘»:1 J1
J: =1

NL"~"'”~"" (L) (L) (L 1) (L— 1) (L 2) (L 2) (L- 3)
= 2: Z )3 (K. (n)w...-.K.-. (n)w...2 K.- (n)w.-.... y. (n))

J1=1 Jz=1 Ja=1

.. beams) "‘2‘ tbs-Fm 1’ )+.‘-.’"’K ":10 (be ”’< ”l

11:1 j2=l

It is easy to extend this to any number of layers. Let I E [1, L]. Then the equations

to update the weights in this layer are,

36

NL—i NL—2 Nz+1 N; IVs—1

Z Z... Z 2 Z (K£L’(n)w,(‘l}1K(lLl)(n) (2.41)

J1 =1 J2=1 JL—1—1 =1 JL—1=1JL-—z+1 =1

w(L-_1) K(1)Nw(n) (1) (1— 1) (,0)

J: J2 JL— & wJL—leL—l+1yJL—l+l

— L
y}. ’(n)

+ b(L)(n) + K(L) N: 10(3). (b(L- "(1.)

J1 =1

N5.
(L-l) (L— l) (L-2) ((+1) (1+1) b(l)
+ KJi _ 2wJ1J2 (sz (n )+ [{JL- 1.. 1 : wJL— 1_ 1 JL- 1 bJL—l))

J2=1 JL_ (=1

NL-i
1.9%) — [1.5:anng Z w£€1(b§f‘”(n)

J1=1

NL-2
(L-l) (L-l) (L-2) ((+1) ((+1) b(l)
+ KJ: . wJ1.J2 (6J2 (n )+ KJL— 1- 1 Z wJL— z— 1JL- 1 bJL— 1))]

”=1 JL_1=1

N1 Nl-l NL_1NL—2 N1+ 1 L L L 1
z 2 ([2 z--- z K<>(,n)w.a.>,1< KM.)

JL-1=1 JL—l+1=1 J1=1 J2=1 JL-(-1=1

(L-1) (1+1) (1) (l)
wle? “KJL-1_1 (n)yJL_g+1( (n)]waL—11JL— 1+1) (2.42)

Similarly to above, this can be viewed as a single layer linear mapping with target

outputs

NL—l NL-2
L L L — -
13%) = 1)."’(n)-[b£’(n)+K.£’Z .1303? "(n)+K.’," "2 11311.2"

J1=1 j2=l
(L-2) (H-l) (1+1) b(l)
(sz (n) + KJL—l— 1 :leL—1-1.JL—thL-1))] (2'43)
JL-l=

and inputs

(1) MAM-2 1"“ (L) (L) (L 1)
5" kaJL— hJL-H-l (Tl): Z Z: ... Z Kk (n)wk,j1Kj1 (Tl)

J1 =1 J2=1 JL—l—1=1

(L-l) ((+1) (I)
wJ11J2 . “KJL—l—l (n)yJL_ (+1(n 771)] (2.44)

37

In the same manner as described above these linearized inputs and outputs can be

used to update the weights in layer I .

2.2.3 Network-wise Weight Updating

2.2.3.1 Two Layer Network

For the same reason that simultaneous layer-wise estimation of weights is beneficial,
we should expect even more beneﬁt from complete network updating if such were
possible. It follows from the developments above that entire network updating is
possible for at least one case. If there is a single node in the output layer of the

network, let I: = 1 and deﬁne

10.1,, = wkdwg-J = wl'ng-J (2.45)
From (2.34) it follows that
N; N0 N1
(371(71) - 5101)) = Z ZIK1(n)K§(n)$z(n)lw’-,z + ZIK1(n)b3-(n)lw1.j- (2-46)

j=1 (=1 j=l

This can be interpreted as an attempt to train a single linear layer with one output
and (No x N1) + N1 inputs. In this case, there will be only N linearized training
patterns. The system can be solved for w1,j and w}, and (2.45) can be used to solve

I
for 11)“.

2.2.3.2 L Layer Network

As in the two layer case, the equations for layer wise weight updating in the L layer
network come directly from the layer wise weight updating equations. Rearranging

(2.41) and replacing k with 1 we get,

38

NL—i NL—2 N2

gﬂn) — 194%..) '= Z Z Z )1}, f (KfL)(n)K§f-”(n)(2.47)

11:1 J2=1 JL—1_1 =1 JL-1=1 JL_1+1 =1

KO) (n)y(i— -1) (,0) (L) w(_L-1) w(_1) w(0)

JL-1 JL— 1+1 w1J1wJ1 J2 JL_1—1JL-1 JL—1JL— 1+1
NL-iNL-z

+ z z 2’ 2‘ (KfL’1n)K)f-“(n)-~

J1=1 J2=1 JL—1—1=1 JL—1=1

5(0)) (L) w(L— 1) w

(1)
JL 1 wlelw J1.J2 JL—1-1JL—1 +
NL-iNL—2

+ Z Z (KfL’(n)K§f-‘)(n )bL- 2) w(L)w(L- 1-)

_ .1.J1 J1 J2
J1=1 J2=1

NL—1 L
+ Z (14 )(n)bf.;- )wggg

J1 =1

This can be redeﬁned as

11%) — b‘“(n)= (2.48)
NL-1NL—2 N2

2: Z - Z Z Z (KlL’(")K§.L"1’(n)W

J1 =1 J2=1 JL-1—1— =1 JL—1=1 JL—1+1=1

(1) (0) (L1)
KJL—1(n)yJL-1+1(n))wJ1.J2.--.JL—1—11JL-11JL—1+1

NL— 1 NL—2

+ Z Z” ' 2 Z (KLL)(n)KJ1L l)(n ) b.5214)wJﬁJ31---1JL—l—11JL—l

Ji=1 J2=1 JL—l— 1=1JL- (=1

+
NL-l NL—2
+ 2 z (K(L>(.)K;L-1>(.).g-2).13.;
J1=1 J2=1
NL—i
+ 2 (KfL’(n)bI.;-) .393),
J1=1

This can be interpreted as a one output, [NL + (N1, x NL_1) + (NL x NL_1 x NL_2)
+(NL x NL-1 x. .x N1 x No)] input linear system. The weights wiiLi'i... 'J-L_ ‘_ “j,“ 1J1- 1+1’
wJILItJ21---1JL-l‘l1JL—l , . . ., wg’g, wg’l can be solved for and used to determine the sys-

tem weights. In the previous secion we saw that as the number of layers increased,

the dimension of the inputs increased. Here as the number of layers increases, the

39

dimension of the weights increases.

2.3 ' Solving the Linearized Equations

2.3.1 Solution by MIL-WRLS

In principle, once the linearization is achieved at iteration i and pattern 12, any least
mean square type algorithm can be employed to update the weight estimates. The
A-S method uses the conventional MIL-WRLS algorithm. In this case, neglecting any
error weighting, MIL-WRLS takes the form of the two recursions (written for node I:
in the top layer) [5, Ch.5],

 

- _ PU." -1)il’(n)l37'(n)lTP(i.n - 1)
PM ‘ I ‘ 1 + [21(n'31TP(i.n - 1).-Jun) (2'49)
101413") = 1010'." - 1) + P(n)fli.(n)lt-k(n) — 371(n)l- (2-50)

wk(i,n) is the estimate of the weights wk following pattern 12 in the 1"" iteration
through the training data, and P"(z', n) is the covariance matrix at the same “time”

in the process,

P”(i. n) ‘l-i‘ i 17.01.7201? (251)

j=l
Note that wk(i,0) déf 1111(2' — 1,N) and similarly for the covariance matrix. This
presents the question of how wk(0,0) and P(0, 0) should be initialized. The inverse
covariance matrix contains theoretically inﬁnite values at the outset and a proper
initialization for the weights is practically not known (this means that the initial
linearizations of the training data are based on potentially very bad weight estimates).
This issue will be addressed further below. Also, it is clear that this solution, as
written, will continue to “accumulate” past linearized sets of data which might, in

fact, be linearized around very poor weight estimates. Therefore, the A-S algorithm

40

includes a “forgetting factor” [5] in the MIL-WRLS recursions. This is equivalent to
using a weighted error criterion with time varying (exponentially decaying) weights.
This can make convergence slow if the forgetting factor is large. If the forgetting
factor is small, then past values are forgotten more quickly, but the algorithm may
have convergence problems. We will also comment further on this issue below.

We have found that the choice of conventional MIL-WRLS as a solution method
seriously impairs the ability of this linearization method to converge on a proper set
of network weights. As an alternative, therefore, we suggest the method presented in

the following section.

2.3.2 Solution by QR—WRLS

In order to improve convergence the algorithm developed above can be implemented
using QR—WRLS [4, 7, 28]. This algorithm has distinct advantages over conven-
tional MIL-WRLS. First, the QR-WRLS algorithm does not suffer from initialization
problems noted above for MIL-WRLS. It is also robust numerically, as no matrix
inversions are done. This should prove to be very beneﬁcial for different algorithms
currently implemented with MIL-WRLS. Allred [29] made a study of various FNN
training algorithms for the US. Navy and rejected all which used MIL-WRLS, be-
cause, in his words, “Classical techniques such as Newton’s technique or regression
analysis (which invert large N x N matrices) can be ill-conditioned and fail. . . . Al-
though these approaches may be interesting, we reject them in favor of more robust
techniques. The last thing one wants is to have a software product which fails in a
production environment.” In addition to improved numerical properties, QR-WRLS
also permits the inclusion of several very ﬂexible “forgetting” strategies. To illustrate
the operation of the algorithm, it is sufﬁcient to consider the estimation of weights
w). in the output layer of the two layer network. All notation is consistent with that

used above.

41

In effect, the linearization technique described above reduces the problem at the
1"” iteration through the training patterns to one of ﬁnding the least square error

solution of the overdetermined system of equations

 

 

 

 

’(y).(1))T —»' P1311)“

-I T _, .k

(y.(2)) 101(2): 1(2) (2.52)
_(17;.(N))T —»_ _1.(N),

The QR—WRLS method is based upon transforming this system into an upper trian-
gular system by applying a series of orthonormal operators (Givens rotations). The

resulting system is
101(1) 2 (2.53)

_ “(N-Now. . _ d2(i,N) ]

 

 

 

 

where the matrix T(i, N) is N1 x N1 upper triangular and ijk denotes the j x 1:: zero
matrix. The solution for 1111(2) is easily obtained by back-substitution. A recursive
version of the solution is also possible. The recursive algorithm is shown in Figure
2.4. For details the reader is referred to [4, 28].

For discussion of further beneﬁts of the decomposition algorithm, it is useful to
view the A matrix, deﬁned in Figure 2.4, as four partitions. Following the rotation

of the n‘“ equation, in Step 2, for example,

q

T(2, Tl) d1(i, n)
A = . (2.55)

d2(i3n) J

F

 

 

 

 

01 XN]

As is the case with the A-S method, a forgetting factor must be employed to gradually

reduce the effects of earlier linearizations. This is very easily accomplished in the

42

Figure 2.4: Weight estimation using recursive QR-WRLS

 

WEIGHT ESTIMATION USING RECURSIVE QR-WRLS
Initialization: Initialize an (N1 + 1) x (N1 + 1) working matrix, say A, to a null
matrix.

Recursion: For i = 1,2, . . . (iteration); and, For n = 1,2, . . . , N (pattern),
1. Enter the next equation into the bottom row of A,

[ [.;(.)]T | 1.0.) ]. (2.54)

2. “Rotate” the new equation into the system using

I
m). = Am10+AN,+1,kTS
I
N1+1Jc = “AkaS'l'AMHJcUS

for k = m,m +1,...,N1 +1 and m = 1,...,N1; where a -= Amm/p, r 2
AM“, m/p, p = (Afnm+A}vl+l, m)1/2, Sis unity (useful later), and Amk(A:nk) is
the m, [1: element of A pre- (post-)rotation. N 0 other elements of A are affected.

3. Solve for the least square estimate of the weights wk if desired. (Solution after
the 71‘“ pattern will produce what has been called wk(i,n) in the text, and

1111(2', N) = 1111(1).)

4. If n < N, increment n. Otherwise check convergence criterion and increment i
and reset n if not met.

Termination: Stop when some convergence criterion is met.

 

43

QR—WRLS algorithm by simply multiplying the top N1 rows of the matrix A (matrix
T(i,n) and vector d1(i,n)) by a factor 3 < 1 prior to the rotation of the n + 1"
pattern equation. In this context, both the forgetting factor and the frequency of
weight updates can be varied. In addition to exponential forgetting factors, equations
can be “rotated out” of the matrix. This is done by changing S in Figure 2.4 to —1
and rotating in the equation to eliminate. Thus, for example, only the last Q >
N1 equations can be used to calculate the weight updates by sequentially removing
equation 11 — Q + 1 prior to inclusion of equation 12. This procedure effects a sliding
window over which the estimates are computed. Another forgetting method useful
for FN N s is possible because no initialization of the updating equations is necessary.
Because there are no initialization problems, the system can be re-initialized at any

step, thus completely “forgetting” the past linearized values. One method that has
worked well in our simulations is to re- initialize the A matrix after every iteration.
These and a number of other ﬂexible forgetting strategies made possible by this
algorithm may prove very useful in the training of FNNs [28]. In addition to new
forgetting factors, using the QR-WRLS implementation also allows the frequency of
updating of the weights to vary. As with conventional MIL-WRLS, the weights can
be updated every time a new linearization is used6.

The theoretical results above, along with those in Section 2, can be combined
to form a learning algorithm for FNNs. For the node-wise weight updating this
is done as follows. First, the weights of the network are initialized. This is done
randomly, each weight being selected from a uniform distribution over the set [-l,l].
Once the initial weights are chosen, the weight updating can begin. First, a training
pattern is input to the system. Because the weights are not updated until all the

training patterns have been used, convergence does not depend on the order in which

 

°This can be as often as every pattern, or at the end of each iteration through the patterns as
has been our convention.

44

the training patterns are used. Given a training pattern, the algorithm calculates
linearized training patterns for the last layer nodes and these are rotated into the
corresponding A matrix. Each node has a separate “A” matrix. The target outputs
of the layer below are calculated next using back-propagation, see (2.16). The A
matrices for the ﬁrst layer are then updated. A new training pattern is then used to
calculate a new set of linearized inputs and outputs. This is repeated until all the
training patterns have been used. The A matrices are then used to calculate updated
weights. This continues until the network converges to a solution or a speciﬁed
maximum number of iterations has been reached. By definition the solution is said to
have converged when the change of the norm of the vector of all the weights is below
a threshold. The algorithm for the layer-wise weight updating is similar. Weight
updating is exactly the same for the output layer nodes. The layers below have one
A matrix each. The linearized training patterns are calculated for each layer and
rotated into the corresponding matrix. After all the training patterns have been
rotated in, the new weights are calculated and the process is repeated. As before
the algorithm stops when a solution is reached or a speciﬁed maximum number of
iterations is reached. As with other training algorithms for FNNs, this algorithm may
not converge to the weights corresponding to the global minimum of the function of
E. Also, although the algorithm approximates a gradient system, because it is not
a gradient system, there is no guarantee that the algorithm will converge to any

solution.

45

Chapter 3
Data Reduction Algorithm

 

3.1 Introduction

This chapter describes a method of reducing the number of training patterns used
at each iteration. Each linearized training pattern that is rotated into T(i, N) and
d1(i,N) will affect both the magnitude and direction of the vector solution 111(1).
Some training patterns will have a relatively large effect on the solution and others
a relatively small effect. The goal is to be able to determine which training patterns
will have a very small effect on the resulting weights and avoid the rotation of these
into the system of equations, thereby reducing the number of computations at this

iteration and making training more efﬁcient.

3.2 Data Reduction Algorithm

Simulation studies have revealed a very important phenomenon which occurs when
the training algorithm of the previous chapter is employed. As the training proceeds,
a few of the linearized training patterns become dominant, and the weight change is
largely dependent on these training patterns. Although the original training patterns
do not change, the linearized training patterns are dependent on the present weights

and so the linearized training patterns generally change at each iteration. For this

46

reason, at one iteration a linearized training pattern may have a large effect on the
weight change while at another iteration, after the weights have changed, the lin-
earized training pattern may have a negligible effect. All linearized weight updating
algorithms presented in Chapter 2 in effect reduce the network to a linear network
with one output. Hence in this section we will consider one node with M inputs and
one output, using the notation introduced for QR—WRLS.

The fact that a few of the linearized training patterns become dominant during
the training is due to the non-linearity in the output. Simply put, it is usual for
the linearized training patterns to decrease in magnitude as the training progresses,
and so have a decreasing effect on the resulting vector w(i). Let us now consider
this effect in greater detail. The non-linearity used in the FNN is a sigmoid. As a
consequence, as the input to the nonlinearity of the node approaches :too the output
approaches a constant. The derivative of the output with respect to the input, which
is always positive, is largest when the input is zero, and goes to zero as the absolute
value of the input becomes large. This derivative is important in determining the
effect of a linearized training pattern on the result because the magnitude of the
linearized training pattern is proportional to this derivative. The derivative, in turn,
is dependent on the input 11. Finally, u is dependent on the weights and on the
non-linear training pattern.

As the weights change during the training process, we can view the changing
effect that a speciﬁc linearized training pattern has on the solution relative to itself,
or relative to the other linearized training patterns. First, let us view the changing
effect of a linearized training pattern relative to itself. As training progresses, it is
usual for the weights to increase. This increase in weights usually causes the input u
to increase. An increase in 11 causes the derivative at successive iterations to decrease,
and the magnitude of the linearized training pattern usually decreases. This can be

easily seen from (2.9) and (2.13). The linearized training pattern at the next iterations

47

will have a smaller magnitude and thus have a smaller effect on the solution weights.

Now consider a speciﬁc linearized training pattern relative to other linearized
training patterns. It is usual for all the linearized training patterns to decrease in
magnitude and thus have a decreasing effect on the solution weights. However, be-
cause the change of weights does not affect the input 11 corresponding to each of the
training patterns equally, the derivative of the output with respect to the input will
not change equally for all the training patterns. As the derivative for some train-
ing pattern becomes very small the effect of this linearized training pattern becomes
negligible relative to the others.

Another way of explaining this process involves the use of error surfaces. Each
training pattern has an associated error surface that is a function of the system
weights. As shown in (1.6), the total error surface is the sum of the error surfaces
associated with each of the individual training patterns. A training pattern with an
associated error surface that is almost ﬂat at the present weights does not contribute
signiﬁcantly to the derivative of the total error surface, and consequently does not
contribute signiﬁcantly to the determination of the gradient direction. The effect of
rotating a training pattern with a small corresponding derivative, one whose error
surface is almost ﬂat at the present weights, into the T(i, N) matrix is smaller than
the effect of another training pattern with a large derivative, or an error surface that
is not as ﬂat.

To determine whether some of the training patterns can be ignored at the present
iteration without affecting the solution signiﬁcantly, two issues must be addressed:
how to determine if a training pattern has an effect small enough to be excluded at
the present iteration, and how to do so in a computationally inexpensive way. The
“effect” of each training pattern will be measured by the magnitude of the change of
the solution weights due to this training pattern.

Assume that iteration i through the training patterns has been completed. Thus,

48

wIIi) has been computed, and the system weights have been changed in the direction
of w(i). New linearized training patterns are then calculated using these new weights.
We now need to decide which training patterns to use in the calculation of w(i + 1).

The following matrix equation is the result of the 1"" iteration :
T(i,N)w(i) = d1(i,N) (3.1)

Suppose we, use QR—WRLS to rotate the ﬁrst new linearized training pattern into

(3.1). This results in a new matrix equation, say,

T'(.', N)... (i)— _ J,(.', N) (3.2)

Let

6111(1') = w'(i)—w(i) (3.3)
6T(i,N) = T’(i,N)—T(i,N)

5.141.111) = d’1(z',N)—d1(z',N)

Thus [I 6w(i) [I is the magnitude of the weight change due to this training pattern
at this iteration, where, for the moment, || - N indicates any valid norm. To simplify

notation

w , _I__| 6w( ) II

will be used in comparing the magnitudes of the weight changes due to the training
patterns. The same procedure of rotating one training pattern into (3.1) is repeated
in order to calculate A(w(i)) separately for each of the linearized training patterns.
We now have a number for comparing the effect of each training pattern. However, it
is immediately apparent that in order to calculate A(w(i)) for each training pattern

using any norm, each training pattern must be rotated separately into the system

49

(3.1). This is the same number of computations needed to calculate new weights
using all the training patterns. Consequently, direct computation of A(w(i)) as a
means of checking for usefulness of the given training pattern offers no computational

advantage. We therefore seek an approximation to A(w(i)), say A(w(i)), which is
1. much less computationally expensive to compute than A(w(i)).
Additionally, it will be desirable for A(w(i)) to have two further properties:

2. Order preservation. If A(")(w(i)) indicates the normalized change in w(i) due
to training pattern 11, and A(”)(w(i)) < Alm)(w(i)), then it should be true that

A‘"’(w(i)) < A""’(w(i))-
3. A1n)(w(1)) —1 o as A(")(w(i)) _. 0.

Properties 2 and 3 assure that the approximation will “track” the true change at small
values (where the information is most important) while at the same time preserving

the order of relative changes among the patterns.

Now, from (3.2) and (3.3)
[T(i, N) + are, N)] [10(1) + 521(1)) = d.(.', N) + 5.110, N) (3.5)
Using (3.1), (3.5) becomes
6T(i, N)w(i) + To, N)6w(i) + arm, N)5w(.') = 5.1.0, N). (3.6)

Factoring 6111(2),

[T(2', N) + 5110', N)] 6111(1) = [5111(1, N) _ 6T(1’,N)w(2')] (3.7)

50

After some manipulation, we obtain

555(1) = T-1(1, N) [[5d.(1, N) — 5cm, N)w(z’)] _ 5cm, N)6w(i)]. (3.8)

Taking any valid norm gives the upper bound

|| 510(2) [ISII T"(i.N) II [II 5d1(i.N) - 5T(i.N)w(i) II + II 5T(L.N) llll 510(2) H].
(3.9)
which, after some algebra, yields,

ll T"(i.N) ll - || 5d1(i.N) - 5T(L.N)W(i) II

II 6111(1) us 1_ u T-1(.-,N) u - 1|6T(1.N) II

 

(3.10)

It is our objective to produce an approximation to ]| (5111(1) I] or A(w(i)) rather
than a bound. A pivotal consideration in doing so is the conditioning of the ma-

trix T(i, N). We digress momentarily to consider this issue. Proper conditioning of

T(:i, N) and T(i, N) + 6T(i, N) mean that

ll T(i,N) || ' ll T"(L,N) ll“ 1 (13-11)

H T(i.N) + 5T(i.N) || ° ll T"(i.N) + 5T(i.N) ll“ 1

Geometrically this implies that neither of the products

T(1,N)w(1) (3.12)

[T(i, N) + 5T(1,N)]w(.') (3.13)

is dependent on the direction of 10(2). The ability to discount the direction of 111(1)
will be central to the development of a useful approximation. A sufficient condition

for these matrices to be well conditioned is for the features to be independent and

51

identically distributed (i.i.d.). This is because T(i, N) is the Cholesky factor of the
“temporal” covariance matrix whose expectation is diagonal under proper ergodicity
assumptions. The i.i.d. feature model is not unrealistic in many problems.

Returning to the bound in (3.10), we invoke the fact that if the T‘1(z', N) matrix,
and equivalently the T(i, N) matrix, is well- conditioned, then we can replace the
inequality by an approximation,

ll T"(i.N) ll - ll 5d1(i,N) - 5T(i,N)W(i) ll.

1- n T-‘(zym u - u mam u ‘3'”)

 

ll 510(2) II”

This is possible because the Cauchy-Schwartz inequality can be approximated by an
equality under the assumed conditions. While this approximation can be shown to
have the second and third features mentioned above, it is too expensive computation-
ally to be useful in speeding up the algorithm. It would require more computations
to calculate this approximation than to use all the training patterns for the weight
updating. In order to reduce the number of computations, substitutions for some of

the factors in this approximation will be made. The ﬁrst substitution introduced will

be for ll T'IUV) ll (864301),

n w(2 ) u -.,. -
n M ,N) u sll T (.N) n. (3.10)

It is easily shown that under the same well—conditioning assumption, the inequality

is also a good approximation, so,

W an T-‘(zym u . (316)

Making this substitution for H T'1(i, N) H on the right side of (3.10), we obtain

)~Il ﬁn 6111: N) —6T(z‘. N)w(2) II
ﬁll an. N) n '

52

|| 510(1) (3.17)

 

Thus

w, _||6w(i)llz ||6d1(i.N)-5T(i.N)w(i)ll d... w,
“ (”— llw'(i)|| lld1(z'.N)Il-llw(i)llll6T(i.N)ll ‘N (l) (3'18)

 

Let us now show that under fairly general conditions, (3.18) is a good approxima-
tion. First we demonstrate order preservation. Let H 6(")w(i) H and H 6(m)w(i) ll be
the weight changes due to two different training patterns, 71 and m. Let us suppose
that

ll 5”)!”(2') ||<|| 5(m)w(i) ll - (3-19)

Therefore

ll T(UV) ll - || 5‘")w(i) |l<|l T(iJV) || ° N 5"")100') || - (3-20)

Let us assume that, for all training patterns, 6T(i, N) is small compared with T(i, N)

 

so that
H T(LN) llzll T(iaN) + 5T(i,N) M (3-21)
and
II MAN) 1 ,.
ll w(z') ll >>|| 6T( ,N) H . (3.22)

With assumption (3.21), we can write (3.20) as

H T(iiN) + 5‘")T(i,N) II - ll 5mm“) l|<|| TEN) + 5(m)T(i,N) ll ° N 5(m)w(i) ll -
(3.23)
Where 6(")T(i,N) is the change in T(i,N) due to pattern 11. Because T(i,N) +
6("lT(i,N) and T(z',N) + 5(mlT(i,N) are assumed well conditioned, (3.23) can be

written

u [T(z', N) + 6‘")T(z', N)]6(")w(i) ||<|| [T(i, N) + 6(m)T(i, N)]6(m)w(i) || . (3.24)

53

Now using condition (3.22),

 

 

H mm + 6<1>T(2',N)16<">w(i) u < n [T(i.N) + 5<m>T(.,N)(5(m>w(.-) u

 

 

 

 

d {N - d (iN - (3°25)
W— ll 6(n)T(z,N) ll W- H 631110.11) II
From (3.7), (3.25) is equivalent to the inequality
ll 541(131") - 5‘")T(i,N)w(i) H < H 5d1(i,N) - 5(m)T(i,N)w(i) ll (3 26)
W- n 5‘")T(i,N) n W— n 6<m>T(z',N) II
or
II 541(23N) - 5‘“)T(i,N)w(i) H < H 5d1(i,N) - 5(m)T(i,N)‘w(i) II
II d1(i,N) || - ll 100') ll ' || 5‘")T(i,N) II II d1(i,N) ll - “10(2) || ' ll 5(m’T(i,N) II
(3.27)
Hence we have shown that the ordering of weight effects (3.19) results in
Al")('w(i)) < A<m)(w(i)) (3.23)

and ordering is preserved. It is of interest to note that starting with (3.27), we can
derive (3.19) with the assumptions in (3.21) and (3.22), without the need for the well
conditioning of the matrices.

Next we investigate the behavior of A(w(z)) when H 610(1) II is small. In reducing
the number of training patterns to be used in the updating process, we want to detect
the training patterns with very small effects, and avoid the computations necessary

to include them. For any valid norm,

|| 6w(i) ll: 0 => 610(1): 0 (3.29)

In this case it is easily shown using (3.2) and (3.3) that

(( 541(1, N) — 6T(1',N)w(z') u: o. (3.30)

54

It follows immediately that A(w(z)) = 0. Consequently, the “tracking to zero” prop-
erty is established.

Thus it can be seen that if the T(i, N) matrices are well conditioned, then the
approximation is good in the sense of properties 2 and 3. If the T(i,N) matrix
is not well-conditioned, then two errors can occur. First, a training pattern with
a small effect may be included in the weight updates. This “false alarm” is not a
serious problem because including additional training patterns is not detrimental to
the weight estimation. Further, the numerator in the approximation is greatest when

the matrices are well conditioned, thus,
ll [T(iaN) + 5T(i,N)]5W(i) IISII T(ZEN) + 5T(i,N) ll ° ll 510(2) ll - (331)

If the weight changes are small, then the approximation will be smaller than if the
matrices were well-conditioned, and the chance of a false alarm is small. The sec-
ond error that can occur is the improper omission of a training pattern. A “miss”
means that the left side of (3.31) is much smaller than the right, implying the matrix
T(i, N) + 6T(i, N) is not well-conditioned. This ill conditioning occurs when there
is a feature that is always “zero”, so that a “zero” row appears in T(i,N), or if
there is a poor scaling of the features, so that the diagonal elements of T(i, N) are
not of comparable magnitudes. Either of these conditions implies a poorly designed
experiment, and could be corrected by eliminating unuseful features or by scaling the
features correctly.

Finally we must explore the computationally complexity of A(w(i)). Although
this approximation is less expensive than A(w(i)), computing it for each training
pattern will again be more expensive than simply using all the training patterns in
the updating equations. In order to achieve the desired reduced computational load,

two more substitutions, one for H 6d1(i, N) —6T(z', N)w(i) II and for H 6T(i, N) (I will

55

be introduced. To reduce the number of computations needed to calculate w(i + 1),
the total number of computations needed to calculate the approximation A(w(i))
for each training pattern and to use the chosen training patterns in the calculation
of new weights, must be less than using all the training patterns to calculate new
weights. Toward this end, let all the norms be the inﬁnity norm. For a vector, this
norm is the maximum of the absolute value of the elements, and the induced matrix
norm is the maximum of the sums of the absolute values of the row components. To
move to iteration i + 1, II d1(z', N) H0‘) and II 10(1) ”00 are determined from the results
of the previous iteration. Because these quantities are independent of the linearized
training patterns calculated at the iteration i + 1, they need to be calculated only
once. This means the approximation introduced in (3.16) is the same for all the
training patterns.

This leaves || 6d1(i,N) — 6T(i,N)w(i) ”0° and H 6T(i,N) ”0° to be calculated
for each linearized training pattern. If it were known which row of 6T(i, N) be the
largest, it would only be necessary to compute one row of T’ (2', N) for each training
pattern to get [I 6T(i,N) "0°, and similarly for H d1(i,N) — 6T(i,N)w(z') ”0°. In
order to have fewer computations, it would be better, if possible, to use the same
row to calculate u 6d1(z',N) — 6T(i,N)w(z’) ((0., and (( 6T(i,N) no... In this case,
only one row of T’(i, N) would be calculated for each training pattern, and a savings
in computations would result. For this reason, a judiciously selected single row will
be used to approximate || 6d1(z',N) — 6T(i,N)w(i) ”0° and H 6T(i,N) ”00. Since
T(i,N) and T'(i,N), and thus 6T(i,N), are upper triangular, only the first row
of 6T(i, N) can have all non-zero elements. Consequently, a natural approximation
for H 6T(i,N) ”00 is the sum of the absolute values of the ﬁrst row elements. The
norm of the ﬁrst row is also a good approximation for H 6d1(i, N) — 6T(i, N )w(i) “00.
This is not only because it has more non-zero elements, but also because more of the

weights are included in the computation of the row of 6T(z', N )w(i ) actually used in

56

the computations. This means a greater directional component is included in the ﬁnal
computation. For example, if the last row of 6T(i, N) were used, the contribution of
6T(z', N )w(z') would depend on only one weight. By using the ﬁrst row, the employed
row of 6T(i, N)w(i), and therefore || 6d1(z', N) —- 5T(z', N)w(i) ”0°, depend on all the
weights.

These substitutions are necessary in order to have a reduction in the number
of computations. Some consequences of using the substitutions can be explicitly
observed. First, by approximating the inﬁnity norm by one row, the approximation is
smaller or equal to the true norm. If the true norm is the sum of the absolute values of
the ﬁrst row components, then the approximation is exact. If not, the approximation
will be smaller than the true norm. This is true for both || 6d1(2', N)—6T(i, N)'w(i) lloo
and II 6T(i,N) "00. Both of these approximations have the effect of making the
weight change approximation smaller than the upper bound given in (3.10). The
substitution for [I T‘1(i, N) “00 used for (3.18) has the same effect. This means that
the ﬁnal approximation for H 610(2') ”00 / || w(z') M00 is bounded above by the right
side of (3.10). Consequently, if the upper bound of the weight change is going to zero,
the approximation will go to zero also. However, it is possible for the approximation
to go to zero without the true weight change going to zero. Although this can occur
theoretically, it has not proved to be a problem in implementation, as it does not
occurr in any of the simulations reported in Chapter 4.

The practical test of the approximation to the weight change is its effectiveness as
an indicator of the effect of the training pattern on the solution weights. Simulations
were run in which the norm of the true weight change was compared with the ap-
proximation. The correlation coefﬁcient between the norm of the true weight change
and the approximation was computed for each iteration. It was typically in the range
of 0.8 to 0.95. These results, along with the simulations run using the data reduction

algorithm, indicate that the approximation is indeed good at estimating the weight

57

change of the linearized training pattern.

58

Chapter 4

Simulation Results

 

4.1 Introduction

This chapter is divided into three main sections. Each section corresponding to one
of the three main contributions of this dissertation enumerated at the beginning of
Chapter 1.

The ﬁrst section presents the simulation results comparing BP, the A-S algorithm,
and the QR-WRLS-based node-wise weight updating algorithm develped in this re-
search. The principal purpose of this ﬁrst group of simulations is to show the advan-
tages of implementing linearized algorithms with QR—WRLS instead of MIL-WRLS.
The simulation results for BP have been included as a reference. -

Simulation results for BP, the node-wise, layer-wise and network-wise weight up-
dating algorithms are presented in the next section. The node-wise and layer-wise
algorithms are used for all the simulations, the network-wise algorithm is used for the
simulations of a one output network.

The last section presents simulation studies for the data reduction scheme im-
plemented for the layer-wise algorithm. BP results for the same studies are again

presented as a reference.

59

4.2 Implementation Studies

QR—WRLS, presented in Chapter 2, has been shown to be slightly more computa-
tionally efﬁcient than conventional MIL-WRLS in identiﬁcation of linear systems [4].
The simulations presented here are designed to show the beneﬁts of QR—WRLS for
the non-linear FNN. The theoretical advantages are described in Chapter 2. These
were conﬁrmed in these simulations. One of these advantages is that the QR—WRLS
equations need no initialization. With the MIL-WRLS implementation, the equations
are initialized and the linearized training patterns are used to update the weight esti—
mates. As mentioned in section 2.3.1, even though a forgetting factor is included, past
training patterns are still incorporated in the computation of all weights. However,
as the objective is to determine the gradient of the system at the present weights, in-
cluding past training patterns will necessarily lead to an incorrect gradient estimate.
The QR—WRLS equations on the other hand need no initialization. This allows the
use of a forgetting factor if desired, or the linearized training patterns of the present
iteration alone can be used to determine the gradient at the present weights. Even
with simple networks and few training patterns, the difference between implementa-
tion using QR—WRLS and MIL-WRLS with a forgetting factor was apparent. With
more complicated networks and many training patterns the error surface is more
complicated and the difference in implementation is even greater. In this section the
simulations were run with the QR—WRLS implementation using the same forgetting
factor as the MIL-WRLS equations. This was done to simplify the comparisons. It
will be shown in the next section that a smaller forgetting factor, or only using the
training patterns of the present iteration, is better. This means that the results for the
QR— WRLS implementation can be even better than those presented in this section.

Another signiﬁcant advantage of QR-WRLS is the robust numerical properties.
This was especially apparent in the ease of implementation. For a linear network, the

training patterns are used once. With the FNN, the training patterns are presented

60

many times. As described in Chapter 3, the linearized training patterns tend to
decrease in magnitude as the training progresses. This leads to the covariance matrix
becoming very small, and its inverse, the P matrix of the MIL-WRLS equations,
becoming very large. In the simulations this matrix tended to become numerically
unstable. The QR—WRLS algorithm has no inversions so this is not a problem.

The simulations reported in this section compare three training strategies for an

FNN . These are:

1. Conventional back-propagation, BP (no linearization).
2. The A-S algorithm (conventional MIL-WRLS with a forgetting factor).

3. Node-wise weight updating algorithm implemented with QR-WRLS with an

exponential forgetting factor.

The linearizations performed in the A-S algorithm and the node—wise updating al-
gorithms are equivalent, so the only differences between the last two algorithms is
the implementation. Each of the three strategies above was used to train each of the

following networks:
1. a 2-bit parity checker,
2. a 4-bit parity checker, and
3. a 4-bit bit counter.

The architectures for these three networks are illustrated in Figure 4.5.

The 2-bit parity checker (XOR) network has two inputs, two hidden layer nodes
and one output node. An additional input is added at each layer whose value is always
unity, to serve as a bias for each node. The output function S() is the sigmoid deﬁned
in (1.3). The initial weights were chosen as follows. Each weight in the network was

selected randomly from a uniform distribution over the set [—1,1]. This procedure

61

 

 

Figure 4.5: Network architectures used in the simulation studies. 1. the 2-bit parity
checker, 2. the 4-bit parity checker, and 3. the 4-bit bit counter.

 

62

was repeated 100 times to select 100 sets of initial weights. The same 100 sets of
weights were used for all 3 implementations. For the B-P algorithm, a factor of 0.04
was used in the weight updating equation. The A-S algorithm was implemented
[6] using no weight change constraints. The forgetting factor for this and for the
QR—WRLS implementation was 0.98. The QR—WRLS implementation used a weight
constraint of 0.2. Thus the weight vector associated with each node was allowed to
change at most by a magnitude of 0.2 in the Euclidean metric during each iteration.

The 4-bit parity checker network has four inputs, four hidden layer nodes and
one output node. A bias input is also added to each layer. Two output functions
were used. These were the same sigmoid function as above, and the logic activation

function. The logic activation function is a three piece piecewise linear function,

f(:1:) = 1 if :1: >1 (4.1)
a: if US$31 (4.2)
0 if a:<0

The derivative of S() is trivially determined everywhere except at zero and one where
it does not exist. This does not pose a problem in implementation if we let 3(a) = 1
if a 6 [0,1] and zero else. Two sets of 100 random initial weights were used for the
three implementations. The ﬁrst set of weights was random as in the 2-bit parity
checker, and the second set of weights was as described by Azimi-Sadjadi et al. in
their paper. The A-S method selects the weights so that the outputs of the network
will be between zero and one. This is done so that the derivative will not be zero
and weight updating can take place. The 4-bit bit counter had four inputs, four
hidden layer nodes and two outputs. An extra input was added to each layer. The
logic activation function was used as the output function. Two sets of initial weights,

random and those described by Azimi-Sadjadi, were used. The results are shown in

63

 

Table 4.2: Number of convergences per 100 sets of initial weights

 

 

 

 

 

 

2in- lout 4in- lout 4in-2out
sigmoid sigmoid logic activation logic activation
Impementation random random A-S random A-S random A- S
weights weights weights weights weights weights weights
Q-R 78 5 5 51 57 1 16
Back-Prop 1 l 0 0 1 53 0 0
A-S 8 0 0 1 37 0 9

 

 

 

 

 

 

 

 

 

 

 

Table 4.2. The table shows the number of times each implementation found weights
that solved the problem for the 100 initial weight sets.

Simulations were also run comparing the output error of each algorithm. In the
resulting ﬁgures, the relative error in dB means the following: Let e(i) be the sum of
the squared errors incurred in iteration i through the training patterns, averaged over
the 100 initial weight sets. Then, plotted in the ﬁgures is 1010g(5(i)/p) (dB), where
p is the maximum possible error at each iteration. The maximum possible error at
one iteration is the sum of the square of the Euclidean distance between a training
pattern and the output of the network associated with this training pattern.

Figure 4.6 shows the errors of the three X-OR implementations.

These results indicate a clear advantage for the QR-WRLS strategy. Algorithmic
differences among the three implementations account for performance differences.
One difference is the initialization needed for the MIL-WRLS equations. With the
MIL-WRLS strategy, both the covariance matrix recursion and the weight vector
recursion must be initialized using theoretically incorrect values. Because of ini-
tializations, the MIL-WRLS algorithm is not guaranteed to move the estimate in the
direction of greatest decrease of E, or even of decreasing E, for the ﬁrst few iterations.

Of the two MIL-WRLS recursions, the weight recursion seems to be the most sensi-

tive to the initialization problem. This is because P is initialized with large values,

64

 

 

l2

 

T1

relauvo error

~70... /3

 

 

 

 

—” 4W . . .W
0 50 100 150 200

iteration. 1
Figure 4.6: Average error in dB for the X-OR implementations vs. iteration number.
1. back-propagation; 2. A-S algorithm; 3. QR—WRLS.

 

P‘1 is small and the effect of this initialization is relatively small. The weight recur-
sion is sensitive to initialization because (2.50) depends eXplicitly upon wk(i, n — 1).
The QR—WRLS algorithm has only an implicit dependence on the weights, as do all
linearization algorithms, because the linearizations depend on the weights.

There is also a difference in the performance of the network using different func-
tions for S(). The logic activation function proved superior to the sigmoid in these
experiments. This is probably because the error will always be positive using the
sigmoid, but can be zero for the threshold logic activation function. No matter how
the weights are adjusted, the output of the sigmoid will always be bounded by one, so
that the training pattern outputs can never be matched exactly. With the threshold
logic activation function, once the weights are adjusted so that the output is off the
ramp (the linear region), the output can be zero or one in which case the difference

between the training output and the actual output can be zero.

65

4.3 Layer Updating and Network Updating

This section compares BP, the node-wise updating algorithm, the layer-wise updating
algorithm and the network-wise updating algorithm. It is divided into four subsec-
tions. The ﬁrst shows the results of the four algorithms implementing a 2-bit par-
ity checker and a 4-bit parity checker with different methods of selecting the initial
weights. The second subsection shows the result of the three linearized algorithms
with different forgetting factors. In the third subsection, results of simulations with
many different, relatively small training sets are given for the four algorithms. The
ﬁnal subsection presents the results of the ﬁrst three algorithms training with a large

training set.

4.3.1 Initialization

One of the greatest problems in F NN training is the selection of the initial weights. As
shown below, BP is very sensitive to the initial weights. Also, shown in the previous
section is the fact that, other algorithms, such as A-S, suffer from the same problem.
It is the purpose of this subsection to demonstrate that the layer—wise and network-
wise weight updating algorithms are superior to BP in robustness to problems caused
by “poor” initial weights. As will be seen in the simulations, both the layer-wise
and network-wise algorithms can be initialized to very small values and perform very
well. In fact the performance improves with very small initial values. This means
that the initial weights can be initialized to small values, typically less than 0.01,
without concern about the initial weights.

All four algorithms were used to train the 2-bit and 4-bit parity checker networks.
The network architectures are as in the previous section. For the 2-bit parity checker,
six different methods of choosing the initial weights were used. The ﬁrst method

was as in the previous section. Thus, 100 initial weight sets were chosen from the

66

uniform distribution [-1,1]. Next, 100 initial weight sets were again chosen from
the uniform distribution, but now over [—.5, .5]. This process was repeated, choosing
100 initial weight sets from uniform distributions over [—.25, .25], [—.1, .1], [—.01, .01]
and [—.001, .001]. Simulations were run for the four algorithms, and the results are
shown in Table 4.3. The BF algorithm had a factor of 0.05 for the weight updating
equations. The node-wise algorithm had a forgetting factor of 0.1 and a weight change
constraint of 0.25. The layer-wise algorithm had a forgetting factor of 0.1 and a weight
change constraint of 1.0. The network-wise algorithm had a forgetting factor of 0.3
and a weight change constraint of 1.0. The number of convergences out of a possible
100 is given. Also shown is the average number of iterations to convergence for the
trials where the algorithm converged to a. solution. Figures 4.7 through 4.10 show the
average error of each of the algorithms for the different weight sets. Each of the six
different weight sets is numbered, 1 is the weight set from the uniform distribution
over [-—1,1], 2 is the weight set over [—.5, .5], 3 is the weight set over [-.25, .25], 4 is
the weight set over [—.1, .1], 5 is the weight set over [—.01,.01], and 6 is the weight
set over [—.001, .001]. Each of the Figures 4.8 through 4.14, is numbered similarly.
The simulations for the 4-bit parity checker are similar. 100 initial weights over the
same distributions were chosen. The convergence results are shown in Table 4.4.
The weight change constraints and forgetting factors were the same as the 2-bit case.
Figures 4.11 through 4.14 show the average error of each of the algorithms for the
varying weight sets.

The results show that the performance of BP and the node-wise algorithm suffer
as the initial weights become small. The layer-wise and network-wise algorithms on
the other hand show very good performance even with all initial weights smaller
than 0.001 in magnitude. Because of the nature of the problem, gradient following
algorithms are sensitive to the initial weights. Some initial weights will lead to a

solution, and others will not. What we would like is one set of weights that would

67

 

Table 4.3: Convergence results for various initial weights, 2-bit parity checker. A is
the number of convergences out of 100, and B is the average number of iterations to
convergence.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Weight BP Node-wise Layer-wise Network-wise

variation A B A B A B A B
[-l,1] 1 - 70 23.59 97 14.84 86 96.78

[-.5,.5] 0 - 93 33.85 99 18.74 93 122.18
[-.25,.25] 0 - 99 72.95 99 11.55 94 150.37

[-.1,.1] 0 - 80 156.85 99 11.22 94 166.61
[-.01,.01] 0 - 7 393.57 97 12.08 98 81.97

[-.001,.001] 0 — 5 376.6 96 17.45 93 32.51

 

 

 

 

 

 

 

Table 4.4: Convergence results for various initial weights, 4-bit parity checker. A is
the number of convergences out of 100, and B is the average number of iterations to
convergence.

 

 

 

 

 

 

 

 

 

Weight BP N ode-wise Layer-wise Network-wise
variation A B A » B A B A B

[-1,1] 0 - 38 207.90 59 65.20 35 238.17
[-.5,.5] 0 - 9 301.67 76 65.16 38 197.03
[-.25,.25] 0 - 0 - 76 52.01 42 148.95
[-.1,.1] 0 - 0 - 82 47.43 45 138.62
[-.01,.01] 0 - 0 - 89 26.01 50 100.52
[-.001,.001] 0 - 0 - 87 29.92 66 121.89

 

 

 

 

 

 

 

 

 

 

 

lead to a solution in all cases. This is, of course, impossible, but the large number
of convergences of the layer-wise and network-wise algorithm for very small initial
weights indicate that very small weights can be used in practice to approximate the

“ideal” initial weights.

68

 

 

 

 

 

 

“13.5
1. [4,1]
~43 ~ 2. [—.5,.5]
3. [-.25,.25]
43-5 a . 4. [-.1,.1]
./ 1 5. [-.01,.01]
"“ " 6. [-.001,.001]
“ “44.6 '4
O
t
O _“ ._
E
ii «5.5 -
-.* / 2
lo
_“ _§/ 3 ﬂ
4T5T6
—u.a . , r r r m r
0 50 100 150 200 250 800 350

iteration. 1

Figure 4.7: Average error in dB for the BP implementation of the 2-bit parity checker
vs. iteration number using different methods of selecting initial weights.

 

 

 

 

     
 

 

 

 

 

\\ l 4 l 5 l 6
..5o _. A l 1
_m .—
g
1 —- —
E
5 “'° ‘ 1. [-1,1]
" 2. [-.5,.5]
"9° ’1 3. [-.25,.25]
4. [-.1,.1]
—1oo -« 5. [-.01,.01]
6. [-.001,.001]
-uo ................................................ . ................................................. , ................................................
o no 100

iteration. 1

Figure 4.8: Average error in dB for the node-wise implementation of the 2-bit parity
checker vs. iteration number, using different methods of selecting initial weights.

 

69

 

 

 

 

 

 

 

 

 

\\\
\{/ 2
\
i. \
2 \ 1. [4,1]
3 g \ 2. [-.5,.5]
: i .1. 3o [’o25,.25]
3 3 , "\ 4 [-1,.1]
3 {#6 \ 5 {-01.01}
3' \ 6 [- 001,001]
1" .
\ n 1L4) \_ \‘x ‘1 AL
\_j '\ _,L
«100 ................................................. l ................................................. l ................................................
0 30 100
iteration. 1

Figure 4.9: Average error in dB for the layer-wise implementation of the 2-bit parity
checker vs. iteration number, using different methods of selecting initial weights.

 

4.3.2 Forgetting Factor

This subsection considers the performance of the node-wise, layer-wise and network-
wise algorithms with different forgetting factors. When training a F NN with QR-
WRLS, some method of “forgetting” past linearized training patterns must be used in
the equations. This forgetting is necessary because old linearized training patterns are
not useful in determining the gradient of the error surface at the present weights. As
mentioned previously, implementation with QR—WRLS allows a variety of forgetting
methods. The one that has proved to be very good is a forgetting factor of zero. In
other words, with all previous training patterns completely “forgotten” (removed from
the estimation process). The forgetting factor is often problem dependent, and one
that works with one training set may not work well with another training set. Using
a very small or a zero forgetting factor means that one variable has been eliminated,

simplifying training.

70

 

 

 

 

 

 

 

 

 

 

~40 a
124 4 l
-30 ._. T 6 3 T
. .1
g —eo- /
i
3. "7° " ./ 5
3 1. [-1,1]
8 __ 4 2. [-.5,.5]
'° 3. (-.25,.25]
4. [—.1,.1]
«0 - 5. [-.01,.01]
6. [-.001,.001]
-—1oo ................................................. , ................................................. , ................................................
0 50 100

iteration. i

Figure 4.10: Average error in dB for the network-wise implementation of the 2-

bit parity checker vs. iteration number, using different methods of selecting initial
weights.

 

71

reietivo error

 

 

relative error

 

 

 

 

 

"-31.5
-31.. -— 1. [4,1]
2. [-.5,.5]
._31.7 -— 3. [-.25,.25]
4. [-.1,.1]
41.. .J 5. [-.01,.01]
6. [-.001,.001]
-31.9 -
-32 .4
1 1
T2T3T4T516
~33 j t T . . r .
0 50 100 150 200 280 800 350

iteration. 1

Figure 4.11: Average error in dB for the BP implementation of the 4-bit parity checker
vs. iteration number, using different methods of selecting initial weights.

 

 

 

4...; 113141516111

 

 

 

 

1. [4,1]
~86 - \ 2. [-.5,.5]
-55 - 3. [-.25,.25]
~37 - \ 4. [-.1,.1]
—30 a 5. [-.01,.01]
1., ._ 1 6. (001,001)
4— \
~41 -(

“‘2 .4
_.., .1
W64 '1
~45
~66

0

 

iteration. 1
Figure 4.12: Average error in dB for the node-wise implementation of the 4-bit parity
checker vs. iteration number, using different methods of selecting initial weights.

 

72

relative error

 

 

 

 

 

 

 

 

0 80 100 150

iteration. 1

Figure 4.13: Average error in dB for the layer-wise implementation of the 4-bit parity
checker vs. iteration number, using different methods of selecting initial weights.

 

The forgetting factor and weight change constraint that give good performance are
related. This means that if the forgetting factor is optimized for a certain training set
and weight change constraint, then changing the weight change constraint will mean
that the optimum forgetting factor will probably also change. In the simulations
it is noted that for good performance, a small weight change constraint must be
used with a small forgetting factor. This relation is shown in Figure 4.15. Here the
errors of the node-wise algorithm implementing a 2-bit parity checker using different
forgetting factors and different weight constraints. The number of convergences out
of a possible 100 for each of the settings was 78 for simulation 1, 60 for simulation 2,
56 for simulation 3 and 64 for simulation 4. It is apparent that the parameters which
yield the most convergences do not necessarily lead to the lowest average error. This
trade-off must be considered when implementing the algorithms.

Even though there is a relation between the forgetting factor and the weight

73

 

rel ative error

 

 

 

 

 

 

 

"31
-3; n\ l 4 .l 5
_” —-J
-u ...,
”a -,
-35 .q
_.” —- [

1. -1 1]
.. .J ’

" 2. [-.5,.5]
.5. - 3. [.25 .25]
1 / 3

4° _, 4. [-.1,.1]

5. [-.01,.01]
“*1 " 6. [-.001,.001]
«a . ' ‘

0 50 100 150
iteration. 1

Figure 4.14: Average error in dB for the network-wise implementation of the 4-
bit parity checker vs. iteration number, using different methods of selecting initial
weights.

 

74

relative error

 

-ao-
~4o~
«.30..

~10!) -‘ i‘
-1” .l 1
"-1.40 -] l

-150 ..
-1eo ~ ' ’ -"~
-goo .. '
-m ..
—m .4
mm —.
1... .1 / 2
_m and
um —

 

 

 

iteration. 1

Figure 4.15: Average error in dB for the QR—WRLS X-OR implementation vs. it—
eration number, using different forgetting factors and weight change constraints. 1.
u = 098,7 = 0.2; 2. u = 098,7 = 1.0; 3. u = 0.1,7 = 1.0; 4. u = 0.1,7 = 0.2; where
V is the forgetting factor and 7 is the weight constraint.

 

75

constraint, for simplicity in these simulations, the weight constraint was held constant
as the forgetting factor was varied. As in the previous subsection, a 2-bit parity
checker and a 4-bit parity checker are used in the simulations. 100 sets of initial
weights were used. The distribution for the weight selection for the 2-bit network
was [—.25, .25] for the node-wise algorithm, [—.1, .1] for the layer-wise algorithm, and
[—.01,.01] for the network-wise algorithm. These initial weight distributions were
selected using the results of the previous section, to obtain the best results for each
algorithm. As in the previous section, the weight change constraint was 0.25 for the
node-wise algorithm, and 1.0 for the layer-wise and network-wise algorithms. For
the 4-bit network the distribution for the weight selection was was [—1.0,1.0] for the
node-wise algorithm, and [—.01, .01] for the layer—wise and network-wise algorithms.
Again, these initial weight distributions were selected to optimize the results for each
algorithm. The weight change constraint was the same as the 2-bit network. The
results for different forgetting factors are shown in Table 4.5 for the 2-bit network
and Table 4.6 for the 4-bit network. Figures 4.16 through 4.18 show the average error
of each algorithm for the different forgetting factors for the 2-bit network. Figures
4.19 through 4.21 show the average error of each algorithm for the different forgetting
factors for the 4—bit network. In each of these ﬁgures the the different forgetting factors
are numbered. Trial runs with forgetting factor 0.75 are numbered 1, those with
forgetting factor 0.5 are numbered 2, those with forgetting factor 0.25 are numbered
3 and those with forgetting factor 0.1 are numbered 4. To show the interdependence
of the weight change constraint and the forgetting factor, the simulations using a
4- bit parity checker were run again using different weight constraints. The new
weight constraints were 0.1 for the node-wise algorithm, and 0.4 for the layer-wise and
network-wise algorithms. Results are shown in Table 4.7. Figures 4.22 through 4.24
again show the average errors for different forgetting factors. The same numbering

as above for the different forgetting factors applies.

76

For the simple 2-bit network, the performance of the training algorithms does not
exhibit much change across different forgetting factors. For the 4-bit network, there is
more variance. Table 4.6 shows that a large forgetting factor is reduces the iterations
to convergence. When the weight change constraint is decreased, the performance of
the layer-wise and network-wise algorithm improves, and smaller forgetting factors
improves performance. The performance of the node-wise algorithm with the smaller
weight change is surprising. There are no convergences for any of the forgetting fac-
tors tried. Although the results indicate that the layer and network-wise algorithms
have very good performance using small forgetting factors and weight change con-
straints, the most important result may be the fact that the performance of these
two algorithms, especially the layer-wise algorithms, is quite impervious to changes
in forgetting factors and weight change constraints. This, along with the small initial
weights, greatly simpliﬁes implementation.

The simulations presented in the last two sections deal with the problems of weight
initialization and the choosing of a forgetting factor. 2-bit and 4-bit parity checkers
are used for training. The initial weights and the forgetting factor were varied to
show that the layer-wise algorithm has good performance with small initial weights
and small forgetting factors. The BP and node-wise algorithm were shown to be very

sensitive to these variables and performance varied greatly as they changed.

4.3.3 Small Training Sets

In this section many small training sets are used to compare the BP, the node-wise,
the layer-wise, and the network-wise training algorithms. These small training sets
had with two inputs, and from twenty to thirty training patterns each. For each
training set, the number of iterations required to correctly classify all the training
patterns was determined. Simulations were run using two layer networks for all four

algorithms. The number of hidden nodes was varied from two to four for the simple

77

 

Table 4.5: Convergence results for various forgetting factors, 2-bit parity checker. A
is the number of convergences out of 100, and B is the average number of iterations
to convergence.

 

Forgetting Node-wise Layer-wise Network-wise

factor A B A B A B
0.75 97 110.87 96 7.5 84 37.89
0.5 97 75.98 100 12.28 86 44.51
0.25 96 69.33 100 9.59 91 52.55
0.1 96 67.04 98 9.75 84 86.63

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_w u-1
—“ ..
_“ ---1
_“ a—
a" .-
-30 .1
-5; .1
a.“ ...
~u -
—“ .—(
“7° .—
...7g ...( . . _. '
an -~ 3. 0.25 1 /' ’ .r .’ .1 .
~76 - 4. 0.1 ' '1',

error

relative

 

 

 

 

iteration. i

Figure 4.16: Average error in dB for the node-wise implementation of the 2-bit parity
checker vs. iteration number, using different forgetting factors.

 

78

 

 

 

 

 

 

-m '1 \\ \
g \
a ‘x\ \
g \\ \
‘\\‘ \\
0 ~70 «
: \\\ \\. +— 3
a \\ ‘
3 \\ {—- 2 \\ \
O \\ \
‘0 \\ \
_” -1 1 \\\ \ \
-) \‘u.\ \‘
1 0.75 ~\
\ \
~90 *1 2. 0.5 ":‘L \ A
3. 0.25 W
4. 0.1
-1” .v..."r'11..vr"w—vI‘vvv‘vvvvxvvvv'uuvvhvvv-'vvv‘wvvv-u-vrxvvvvv-vv-Iv-vvvrv—vvjwwvvvvvv-‘rvv—vwvvvv
0 10 IO N ‘0 50 60 70 .0 90
iteration. i

Figure 4.17: Average error in dB for the layer—wise implementation of the 2-bit parity
checker vs. iteration number, using different forgetting factors.

 

 

 

 

—47~ v 1314

mg...
-4e~ T1 x“.
-50...
«111—

_..,_ \

-53... \

4... \12

“a .. 1 0 75 \\
~56“ ' ' \
\
2. 0.5 \\
3. 0.25 ‘\
4. 0.1 \

relative error

47.1
”ﬂu-1
...”-J

 

 

 

‘m "rrVVYY'I‘V'r‘V'I'IY‘T"'-'UV‘I"V'UV'V‘xrv‘virv'vi'Vvvthv“""""':"'VVTVV'I'VVVVIU'I"V'vr'V

o 10 no so 40 so so 70 on so
iteration. i
Figure 4.18: Average error in dB for the network-wise implementation of the 2-bit
parity checker vs. iteration number, using different forgetting factors.

 

79

 

Table 4.6: Convergence results for various forgetting factors, 4-bit parity checker. A
is the number of convergences out of 100, and B is the average number of iterations
to convergence.

 

Forgetting N ode-wise Layer-wise N etwork-wise
factor A B A B A B
0.75 32 202.94 95 18.6 69 100.75
0.5 40 224.95 91 28.03 73 119.32
0.25 40 234.73 93 20.17 50 118.72
0.1 42 263.98 90 27.26 50 151.1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

~oa-1
nan—A

-w—
a”...
_”—4
"-3.

_.,1
~40“
41—]
~42-
_“..
«444
~45—
wiG-J
«57-4
~40“

relative error

 

 

 

 

iteration. i

Figure 4.19: Average error in dB for the node-wise implementation of the 4-bit parity
checker vs. iteration number, using different forgetting factors.

 

80

 

 

 

 

 

 

«80
'40 "f \
\\ \
\\
‘. ~50 -‘ \\\
D ‘t
in \ 4
g 40 _. \ \\ <
E \ ‘\ ,/ 3 \\
a \ \ \
' 1
'3 1 \
a "70 .4 \ \ \
\‘\
1. 0.75 \_.\_/ 2 .\ \\
4 2. 0.5 \, . 4
“'° 3. 0.25 “’\ 1
4. 0.1
a“ vvvvvvvvv T rrrrrrrrr ‘ IIIIIIIII I vvvvvvvv 1 vvvvvvvvv I IIIIIIIII f iiiiiii I tttttttt ' ttttttttt I aaaaaaaa
0 10 :0 u 40 60 80 70 .0 90
iteration. i

Figure 4.20: Average error in dB for the layer-wise implementation of the 4-bit parity
checker vs. iteration number, using different forgetting factors.

 

 

 

 

 

 

 

“38
w
\ \
\\i\ T 3
a.” .—
3 \\ \
1. 12
g “36 - ‘\
E 11 \
ii "35 ‘J ,\\
el‘ \
O
In
—” .4 1. 0.75
2. 0.5 '\
“‘3. '- 3e 0e25 \
4e 0el '
~” 1"‘IT"':"U‘I‘ITUIII"1‘Tv‘1t'ﬂ'Ui'I,""UTII'I'ﬂvt"UTIYYTYUYU‘Irlim‘TVnY‘U‘U'hTIII—I VVVVVVV
0 10 80 N 40 80 60 70 .0 90
iteration. i

Figure 4.21: Average error in dB for the network-wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.

 

 

Table 4.7: Convergence results for various forgetting factors, 4-bit parity checker. A
is the number of convergences out of 100, and B is the average number of iterations
to convergence.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Forgetting Node-wise Layer-wise Network-wise
factor A B A B A B
0.75 0 - 91 28.89 67 106.55
0.5 0 - 95 24.42 63 93.06
0.25 0 - 97 31.52 57 132.02
0.1 0 - 88 26.51 97 160.04
~30
40.5 ~
”‘31 —-<
8 "31.5 4
i‘.
O
t “u“
‘3
E -3“ -, 11.2311
43 H 1. 0.75
2. 0.5
43.. _ 3. 0.25
4. 0.1
—u ......... , ......... , ...... , ........ , ......... , ........ r, ........ , ......... , ......... , ........
o 10 an an m so no 70 so so
iteration. 1

Figure 4.22: Average error in dB for the node-wise implementation of the 4-bit parity
checker vs. iteration number, using different forgetting factors.

 

82

 

 

 

 

 

 

 

~30
\
.— -1 a M
\ “\‘\ \M
—‘o .1 \\ \ N\\\
\ F“.
\ \‘ FM
--45 "‘ \ \\
g \\ \f'
a 40 ﬂ .1“ \
.0 \
0 \
.2 -.. - x
3 “~ ‘\
. t
u "m “4 \
3.. \ \l 2
\4 \\
~65 ~ \
1e 0e75 \'\\ \\
~70 .. 2. 0.5 “\\l 1 \
.3. 0.25 \g \\
”‘7' ” 4 0 1 \
\
“.0 vvvvvvvvv : vvvvvvv 'nvvvuvvvvuerTV-uvl vvvvvvvvv ‘ vvvvvvv f uuuuuuuuu f IIIIIIII 11 r r11 r
O 10 20 30 40 50 60 70 IO 90
iteration. 1

Figure 4.23: Average error in dB for the layer-wise implementation of the 4-bit parity
checker vs. iteration number, using different forgetting factors.

 

 

 

 

 

g “3‘ " \\
s \\
3 -=- ~ \

 

 

 

” I“ IIIIII I IIIIIIIII ‘Ij'IIﬁYV‘ IIIIIIIII T'V'IIV‘VI: IIIIIIIII ' VVVVVVVVV I ''''''''' I VVVVVVVVV l IIIIIIII
iteration. i

Figure 4.24: Average error in dB for the network-wise implementation of the 4-bit
parity checker vs. iteration number, using different forgetting factors.

 

83

training sets. Up to seven hidden nodes were used for the more complicated training
sets. Since the network-wise algorithm can have only one output, only simulations
for training sets with two classes could be used. A particular combination of training
pattern set and number of hidden layer nodes that was not run are left blank. A
dash means that the algorithm failed to find weights that correctly classified all the
training patterns. The results are shown in Tables 4.7 through 4.10.

Training set A, shown if Figure 4.25, is a simple linearly separable set using
training patterns from two classes. The two classes are separated by one of the axes.
Training set B is the same as A, only the classes to which each training pattern belongs
has been interchanged. This was done for many of the training sets. Thus, training
sets C and D; E and F; G and H; and I and J are similar, only interchanging the
class of the training patterns. This was done to determine if the training algorithms
favored positive or negative weights. If an algorithm found weights so that the network
correctly classiﬁes training set A, then changing the sign on the weights means that
the network can now correctly classify training set B. Similarly for training sets C and
D; E and F; G and H; and I and J. Training set C is linearly separable, but not by
any of the axes as in training set A. Training set E is linearly separable, however the
separating line cannot run trough the origin as is the case for training sets A through
D. Training set C is the ﬁrst training set that is not linearly separable. Training set
L has training pairs divided into four classes. The training pairs are again linearly
separable and BP and the node-wise algorithms again perform very well. Training
set N has ten classes of training pairs. The training pairs were taken from the large
training set used in the next section.

The ﬁrst result to note is the poor performance of the network-wise training
algorithm. It seems that the algorithm is interesting from a theoretical point of view,
but is not practical for use in training without further work. From the tables, it can be

seen that the node-wise and BP performed very well for the linearly separable training

84

 

 

0.4

a2-—

0.1 ~4l

~~o.1 -~

input 2

~03!- 0

"-003 H o

".004 I. o

 

 

 

»03’ -Q5 -03 ~0A. CA. Q3 05
input 1

Figure 4.25: Training set A.

 

sets. When the training sets were not linearly separable, however the advantages of
the layer-wise algorithm are apparent. Most interesting is the fact that the number
of iterations to convergence for the layer-wise algorithm for training sets A through
K vary only slightly. Also, as shown in the tables, the layer-wise algorithm is the only
one to ﬁnd weights that correctly classiﬁy all training patterns for training set N. This
provides evidence to what has been mentioned previously, that changing all the weight

of the same layer has a great advantage over changing the weights independently.

4.3.4 Large Training Set

In this section a large training set was used to compare the BP, the node-wise and the
layer-wise training algorithms. The training set has 519 training pairs. As can be seen
in Figure 4.34, the training pairs are overlapping and so the correct classiﬁcation of all

training pairs is not possible. The layer-wise algorithm performed the best. It has the

85

 

input 2

 

0.4-

03
0

~03 '-<
O

.L

 

 

l l I I l

~0.2 0 0.2
input 1

Figure 4.26: Training set C.

 

0.6

 

 

input 2

 

—-o.4 —— 1
”0.5 _.
“0.6 .—
—o.7 -»
-—o.s ~
«0.9 —

_1—-

 

 

""- 1e1
"0.2

T I l , i

0 0.8 0.4
input 1

Figure 4.27: Training set E.

 

0..

 

86

 

input 2

 

0.0
0.0 -<
0.4 -‘
0.3 '—
03 '—
0.1 -.

-0.1 "1
~03 '-
_o.3 ..
-0.‘ ~
—0.0 -‘
--0.0 -‘
~‘0.7 "‘
-0.0 -‘
.—0.9 ..

 

 

«1

T i I I l T T I

-—O.8 ~04 -o.z
input 1

Figure 4.28: Training set G.

o1»
o
in
O
.3

 

 

input 8

 

_.
0.3

02-—
C

0.1 —‘

"Oe‘ ‘-

—005 —l

 

 

 

input 1

Figure 4.29: Training set I.

 

87

 

 

input 8

0.0
0.7
0.0
0.0
0.4-
0.3
0.8
0.1

'—°o‘
“0.2

~0.4
~05
~0.6
~0.7
~0.0

 

*-

 

 

I I T f

I"0e" -002

input 1

Figure 4.30: Training set K.

 

 

 

input a

0.4

0.1

"0e 1

_°e‘

05
O

 

-4

 

—Ju---

:1»

t T I T

 

 

I r 4
—0e3 ~0e 1

input 1

0. i 0.3

Figure 4.31: Training set L.

 

88

 

input a

0.7

 

0.0 ~1
0.0 '-
0.4 -
0.3 --1
0.2 ~
0.1 -*

~0.i ~
—0.2 --1
~0.3 -1
~0.4 ~
-0.0 ~
~0.6 ﬂ
~0.7 ~1

 

 

I I I I I

«0.3 -o.1 0.1
input 1
Figure 4.32: Training set M.

 

 

 

input 8

 

51853359838828
1

H H
O D N
l l l

 

 

input 1
Figure 4.33: Training set N.

01

 

 

89

 

Table 4.8: Iterations to convergence of the BP algorithm for various training sets and
varying number of hidden layer nodes.

um
set
A 71
74
66

192 295 306
- 160 165

 

 

lowest error and the most correctly classiﬁed training patterns. Table 4.12 shows the
number of correctly classiﬁed training patterns for different number of hidden layer
nodes. Figure 4.35 shows the relative errors of the BP algorithm using six, seven, and
eight hidden layer nodes. Figures 4.36 though 4.37 show the errors for the node-wise
and layer-wise algorithms.

After performing these comparative simulations, it is possible to enumerate some
of the advantages of the layer-wise algorithm over BP and the node-wise algorithm.
First, two problems that arise in training, selection of initial weights and forgetting
factor have been made easier. The layer-wise algorithm is relatively impervious to
changes in these two variables, and they can be chosen small with very good results.
Next, the layer-wise algorithm has very good convergence results with relatively com-
plicated training sets. When selecting a training algorithm, many things need to be

considered. Most important are speed and convergence performance. If a training

90

 

Table 4.9: Iterations to convergence of the node-wise algorithm for various training

sets and varying number of hidden layer nodes.

 

103 157

0
8
3
6
2
3
7
6
4

241
c

7
it
6

a;
6".

 

 

11

input 1

Figure 4.34: Large training set.
91

 

 

 

 

ﬂ aﬂﬁﬂu

 

 

 

“'3"
4.1
44
-5-
.1...
.4-
-9...
-104

 

relative error

11

 

 

l3

 

 

 

 

T2

 

Figure 4.35: Average error vs. iteration number for the BP implementation, training
with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes.

ﬁv—

T r I

400 000 600
iteration. i

 

 

 

o -.. --. .-. - .. ....-.-_._. ..- ._ m,m”..- ...,.

-1—
.g..
—3.—.
_‘....
~5—
~64
-7—
~0-1

"'10 -4
mm —.
~13 -<

relative error

--— M——.—-—.~4 -~-......- ~..—. ..—.. -g..~u.

2T

..—..o

13

-“ *U—t—‘v‘vno-o .Ao—w..—--Iq

11

 

 

T

0 100

Figure 4.36: Average error vs. iteration number for the Node-wise implementation,

zoo 300
iteration. i

training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes.

 

92

 

 

 

Table 4.10: Iterations to convergence of the layer-wise algorithm for various training
sets and varying number of hidden layer nodes.

set.

>

C
D
E
F

D:

- 217 347

 

 

algorithm is very fast but rarely converges it is not good. Also, if the algorithm al—
most always converges but is very slow, then it is also not good. The true “speed” of
a training algorithm must be determined by how fast it performs each iteration, how

many iterations it takes on the average to converge, and how often it converges.

4.4 Simulations of the Data Reduction Algorithm

This section contains the results of simulations run using the data reduction algo—
rithm. The data reduction algorithm was implemented along with the layer-wise
training algorithm. The results compare the layer-wise algorithm with and without
data reduction. The large data set introduced in the previous section is used for
comparisons. Both algorithms are run using six, seven, and eight hidden layer nodes.

The data reduction algorithm works by comparing the relative effect of the training

93

 

Table 4.11: Iterations to convergence of the network-wise algorithm for various train-
ing sets and varying number of hidden layer nodes.

en yer

set

 

 

 

Table 4.12: Number of correctly classiﬁed training patterns for the large training set
for varying number of inner layer nodes.

 

 

 

 

 

Hidden layer BP Node-wise Layer-wise
Nodes .
6 341 363 427
7 373 398 428
8 357 370 360

 

 

 

 

 

 

 

94

 

 

o .._.......-_... .-...- -.,_ -__....-..__-.. .- .. . _ .. -. _ ,

~1-1
“ta-4
~34 .
-+-« ‘ 13
a...
.6...
-7-
.4...

~11 -l
_u .1
"‘13 d
...1‘ —4
~10 . 1 . .
0 I.” m m 400

iteration. i

Figure 4.37: Average error vs. iteration number for the Layer-wise implementation,
training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes.

relative error
I
O
l

 

 

 

 

 

patterns on the magnitude of the weight change and doesunot use the training patterns
with very small effects. Thus, for implementation the user must decide what is the
level for accepting and rejecting training patterns for each iteration. The algorithm
is implemented by computing the approximation to the upper bound on the weight
change for each training pattern and selecting the largest of the approximations. This
approximation is divided by a user deﬁned factor to get a comparison number. All
. training patterns with approximations larger than this comparison number are used
in the weight updating for the present iteration. This is then repeated for the next
iteration. Figure 4.38 shows the relative error of three implementations of the layer-
wise algorithm with six hidden layer nodes. The ﬁrst has no data reduction. The
second has data reduction and a dividing factor of 30. The third has data reduction
and a dividing factor of 100. Figure 4.39 shows the number of correctly classiﬁed

training patterns vs. iteration number for the same implementations. Figures 4.40

95

 

 

. \
\ .
4—4 .‘ \ \

relative error
I I

—7 t-d . \_ \._

1. none ~ ‘\.

... -4 ‘\ ‘\_" t‘
. 2e 30 \ ‘—‘\"

_9 ’0 3e 100 ‘ K ‘\‘\‘\_4

 

 

 

--1° Urﬁ‘rTjTY‘frleYIITUIII‘UTITTIYUXUUYUVITjY} IIIIIIII

0 1o 80 30 40
iteration. 1

Figure 4.38: Average error vs. iteration number for the data reduction algorithm
using 6 hidden layer nodes, training with the large training set.

 

and 4.41 are similar to above, showing the results for seven hidden layer nodes. F ig-
ures 4.42 and 4.43 show the results for eight hidden layer nodes. Figure 4.44 shows
the number of training patterns used in the hidden layer vs. the iteration number for
a dividing factor of 30 and varying hidden layer nodes. Figure 4.45 is similar, but a
dividing factor of 100.

The results show the data reduction algorithm performing very well at the begin-
ning, with the algorithm implemented without data reduction occasionally catching
up at later iterations. The results can be misleading, because the data reduction
algorithm is using fewer training patterns, thus performs each iteration faster than
the algorithm with no data reduction. If the ﬁgures were to compare error vs. time,
the data reduction algorithm would be superior. The results indicate that using data
reduction is very good at speeding up the algorithm in the ﬁrst iterations. The er-

ror of the data reduction algorithm can actually increase if the number of training

96

 

 

1. none ' , . - ... 12 _,
’1 2. 30 .1 I, -—~--~- --_. 5, -1...“ «1
4 3. 100

L

l

§§E§§B§§§
\

Number of oorreotiy oiaeeified tp'e
)0 n r-

s s 5 s s

1 l I L g

f
/
F

 

 

 

Sf
.4
.,
.

I

0 10 an
iteration. 1

Figure 4.39: Correctly classiﬁed training patterns vs. iteration number for the data
reduction algorithm using 6 hidden layer nodes and the large training set.

 

 

 

 

 

 

1 .
o cur.” -”‘~ ,.
\Mk‘-n‘ .‘
\\ ‘\\ ~~\
\ " ~ -
...‘ .— \\ s \ \/ 1
\\\ \..\
3 “a M \‘\ \ x
E \ 2
0 ~3 "l \ '/
I» \
=3 \
\
"oi -4 .2 ‘\ n/ 3
" \
‘\
\
"5 r 1. none \\
2e 30 \-~\
"‘ “ 3. 100 ‘
—7 I ‘Vii I j I. I I ‘rT'Y I ‘I I I I 1 ﬂ f‘ j V I 1 f I —Y I I j l' I I T f I I 1 l IIIIIIII
0 10 an ‘ 30 40
iteration. i

Figure 4.40: Average error vs. iteration number for the data reduction algorithm
using 7 hidden layer nodes, training with the large training set.

 

97

 

 

1. none _
2. 30 .’ r” "A”
__ 3. 100 2 \ I" '

tpa

Number of oorreoity oiaaeii’ied
ror-Hr..-
.asassssss§§§§§§§
L
to
H

 

 

 

‘r v v—v *v ﬂu r ﬁﬁ
V V V v v I v ‘ p v I "Y i 'ﬁ' f ‘ ‘l’ r v I I V v 1 v . . ‘1 vi I v V 1 e v v v

0 10 20 30 4.0
iteration. i

Figure 4.41: Correctly classiﬁed training patterns vs. iteration number for the data
reduction algorithm using 7 hidden layer nodes and the large training set.

 

 

 

 

 

 

 

 

0.” ‘
0 ﬁg. _,.. —- - ~ 1
“~\E:“\
\ y \ \\-\ 1
no.” \\ \ ‘ \“\\\ ‘ /
\‘ \\ ‘-\
II \\ §‘}\ \\.‘
2 “7°01 -‘ ‘ \- \.“ .“\. \.
is \‘ ~\ .
. 2 /l -\ ‘/ 3
. ”0.1. —4 ‘\. -.
.5. ‘\ 5 1
.2 “ ‘\.
. ”0.2 M \ ‘\‘
u \‘ \
O.“ .4 1. none ' ‘"\ \
2. 30
\
. 100
\
—°.a ﬁ IIIIIIII x I I ‘I T t I ' ' I I I 1 I I I iii "r I I I I T I I I f7 I I I I I I rj j
0 10 a 30 40

iteration. i

Figure 4.42: Average error vs. iteration number for the data reduction algorithm
using 8 hidden layer nodes, training with the large training set.

 

98

 

 

0
Ag

1. none
2. 30

E

"§§§§§§§£§
s
'
\\

Number of correctly eieeeified tpe
H H o-
a s s E s a
' L I 1 I I
\
\
\
\\

 

3. 100

 

S
J
w

VVVVVVVVV

iteration. i

Figure 4.43: Correctly classiﬁed training patterns vs. iteration number for the data

reduction algorithm using 8 hidden layer nodes and the large training set.

 

 

 

 

 

 

 

 

eoo
e 500;!
E ' \ \
e .
‘3 \‘ '1
a. «too-J .
_ ‘. . /3
5 ~ \
ﬂ
3‘ 300-
I:
.. ./2
o
s. 300'-
é
a /1 ‘
2 1m...

0 . r . .

0 100 :00 soc 400
itereuomi

Figure 4.44: Number of training patterns used vs. iteration number for the data
reduction algorithm and a dividing factor of 30. 1. 6 nodes 2. 7 nodes 3. 8 nodes.

 

99

 

 

 

 

 

 

 

600
u m“
d
“ \
O
to.
‘5
3 m"
e
I!
t.
5
g aoo~
t.
0
‘6
3. m4 /2
i
a ./1
z zoo-1
0 , . . r
0 100 :00 ”0 400

iteration. 1

Figure 4.45: Number of training patterns used vs. iteration number for the data
reduction algorithm and a dividing factor of 100. 1. 6 nodes 2. 7 nodes 3. 8 nodes.

 

patterns used becomes too small. In practice this number can be easily monitored
and if the error increases, more training patterns can be used at the next iteration.
The data reduction algorithm increases the algorithm efﬁciency considerably at the
beginning of training because of the method of selecting the training patterns to be
used in the updating. The training patterns that have the largest effect on the weight
change are typically those that are near the center of the cluster of each class. If we
divide the training pairs into classes, those that are near the center of the cluster,
away from training pairs of different classes typically have the largest effect. These
training pairs are then used to quickly ﬁnd weights that classify the centers of the
clusters correctly. If the class clusters are well separated, then the algorithm will
do a good job of classifying all the training patterns. If there is overlapping among
the clusters, then more training patterns must be used as training proceeds to reﬁne

the weights and classify the outliers correctly. The large training set used here has

100

different classes with a lot of overlapping, so is the most difﬁcult challenge for the
data reduction algorithm.

The number of training patterns used in the weight updating decreases as training
proceeds. This indicates that all the training patterns have similar effect on the
magnitude of the weight change at the beggining of training, with some having less
and less effect as training progresses. This conﬁrms the observation that began the
investigation for a data reduction algorithm. The weight updating matrix used for
QR—WRLS is largest for the hidden layer. This size is important, because increasing
one weight greatly increases the number of rotations in the QR-WRLS algorithm.
For example, with eight hidden layer nodes and three inputs, (as in the simulation
here) there are 24 hidden layer weights. This means a 25 by 25 matrix must be
used for updating. There are 314 rotations for each training pattern. With only one
more weight in the hidden layer the number of rotations for each training patterns
increases to 340. This means that for even modest sized F N Ns, those havin six or
more hidden layer nodes, the number of training patterns needed to remove, so that
the data reduction algorithm is faster than the non-data reduction algorithm is small.
For this example if, fewer than 92 percent of the training patterns are used, the data

reduction algorithm is faster.

101

Chapter 5

Conclusions and Future Research

 

5.1 Algorithmic Developments

5.1.1 Training Speed

New algorithms to train FNNs are developed in this dissertation. The main purpose
in developing new algorithms is to make training more efﬁcient. Let us deﬁne the
speed of an algorithm as the time it takes to converge to a solution. In measuring
the speed of an algorithm, we need to know how fast it completes each iteration, how
many iterations it takes, on the average, to converge, and how often it converges.
Determining the speed of any of the algorithms used to train FNNs is quite difﬁcult,
if not impossible. This is due to several factors. First, algorithms are dependent on
initial conditions. Thus for different initial conditions, the same algorithm may con-
verge very fast, slowly or not at all. Next, the speed of an algorithms depends on the
training set being used. This makes it difﬁcult to compare training algorithms. Many
initial conditions for the same training set, and many different training sets must be
used to insure the comparisons are fair. The simulations done in Chapter 4 show
that the layer-wise algorithm described in this dissertation is indeed an improvement
over previous algorithms. As can be seen from the simulations, if the training set is

very simple, and linearly separable, then any of the training algorithms can be used

102

to train the F NN . If the training set is more complicated, however, then usin the
layer-wise algorithm greatly enhances performance. One way to speed up training is
to decrease the average number of iterations to a solution and to increase the number
of convergences. Another way of speeding up training is to decrease the time for each
iteration. The layer-wise algorithm uses the previous method. The data reduction

algorithm uses the latter method

5.1.2 Layer-wise Training Algorithm

The layer-wise training algorithm can be viewed as a logical progression from previous
training methods. The BP algorithms takes the derivative of the error surface with
respect to one weight, and changes this weight to reduce the total output error. Thus
each weight is changed individually to decrease the error. Next, many algorithms,
such as the A-S algorithm, and the algorithm developed by Kollias and Anastasiou
take the derivative of the weights connected to each node, and change the weights to
reduce the error. Here, all the weights connected to one node are changed together
to decrease the error. In the layer-wise algorithm the derivative with respect to all
the weights in the same layer is taken, and these weight are changed to reduce the
error. The layer-wise algorithm is a linear algorithm in that the non-linearities of the
FNN are linearized around the present weights at each iteration, and this linerized
approximation is used to update the weights. This makes possible the use of QR-

WRLS for solving the linear system at each iteration.

5.1.3 QR—WRLS

The use of derivatives of order greater than one in the training of FN N s has become
popular. The use of higher order derivatives means that the system can be linerized
so that the resulting equations are linear, and can be solved for many of the weights

simultaneously using some RLS algorithm, typically MIL-WRLS. It was one of the

103

purposes of this dissertation to introduce the QR-WRLS algorithm for use in F N N
training algorithms. Future algorithms introduced for the training of FNNs will
beneﬁt from implementation with QR—WRLS rather than MIL-WRLS.

5.1.4 Data Reduction Algorithm

In training the FNN it was found that not all the training patterns are useful at
each iteration in updating the system weights. A computationally inexpensive way of
computing whether a training pattern is useful or not for use in updating the weights
at the present iteration is developed. This allowed fewer training patterns to be used

at each iteration, speeding up training.

5.2 Future Research

The most exciting future research lies in the extension of the layer-wise algorithm.
As mentioned above, training algorithms have developed from updating one weight
at a time, to all the weights connected to one node at a time, to all the weights of the
same layer at a time. The next step will be updating all the weights of the network
simultaneously. The network-wise algorithm presented in Chapter 3 is a ﬁrst step in
this direction. It is limited however, in that only one output is allowed. Also, as noted
in Chapter 4 the performance is not as good as the layer-wise algorithm. To update
all the weights of the same layer, only a ﬁrst order liner approximation to the FNN
is needed. In updating all the weights of the network, a higher order approximation
may be used.

The set membership algorithm [31],[32] was the ﬁrst algorithm used in an attempt
to reduce the number of training patterns used at each iteration. This algorithms
can be used in linear systems when there is a bound on the output error. The

bounded output of the FNN made it seem a natural problem for the application of

104

set membership. It was quickly noted, however, that although the bounded output of
the FNN does imply a bound on the error, it does not imply a bound on the solution
weights. This occurs because of the non-linearities in the system. It was because of
this that perturbation theory was used. However, set membership theory in another
form may still be applicable to the FNN training problem, and this application is a

topic of future reseach.

105

Bibliography

[1] T. Sejnowski, and C. Rosenberg, “Nettalk: a parallel network that leans to read

[2]

[3]

[4]

[6]
[7]
[8]
[9]

[10]

aloud,” Technical Report JHU/ EECS-86/ 01, Johns Hopkins University, 1986.

D. Nguyen, and B. Widrow, “The truck backer-upper: and example of self-
learning in neural networks.” Proceedings of the IEEE Int. Joint Conf. on Neural
Networks, pp.1381-1384, 1990.

A.S. Householder, “The approximate solution of matrix problems,” J. Assoc.
Comput. Mach., vol.5, pp. 204-243,1958.

J .R. Deller, Jr.,‘ and D. Hsu, “An alternative adaptive sequential regression al-
gorithm and its application to the recognition of cerebral palsy speech,” IEEE
Transactions on Circuits and Systems, vol.CAS-34, no.7, pp.782-786, July 1987

D. Graupe, Time Series Analysis, Identification and Adaptive Filtering, Mal-
abar, Florida: Krieger Publishing Company, 1989.

M. Azimi-Sadjadi, S. Citrin, and S. Sheedvash, “Supervised learning process of
multi-layer perceptron neural networks using fast least squares,” Proceedings of
the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.1381-1384,
1990.

G. Golub and C. VanLoan, Matrix Computations, Baltimore, Maryland: Johns
Hopkins University Press, 1983.

J. Sun, W. Grosky, and M. Hassoun, “A fast algorithm for ﬁnding global minima
of error functions in layered neural networks,” Proceedings of the IEEE Int. Joint
Conf. on Neural Networks, vol.1, pp.715-720, 1990.

B. Widrow, and M. Lehr, “30 years of adaptive neural networks: perception,
madeline and backpropagation,” Proceedings of the IEEE , vol.78, no.9, Septem-
ber 1990.

F. Rosenblatt, Principles of neurodynamics: perceptrons and the theory of brain
mechanisms, Washington, D.C.,: Spartan Books, 1962.

106

[11] R. Watrous, “Learning algorithms for connectionist networks: applied gradient
methods of nonlinear optimization,” Proceedings of the IEEE Int. Joint Conf.
on Neural Networks, vol.2, pp.619-627, 1987.

[12] D. Rumelhart, G. Hinton, and G. Williams, “Learning internal representations
by error propagation,” in Parallel Distributed Processing, vol.l (D. Rumelhart
and J. McCleland, Eds.). Cambridge, MA: MIT Press, 1986.

[13] R. Lippman, “An introduction to computing with neural nets,” IEEE ASSP
Magazine, vol.4, pp.4.22, Apr.1987.

[14] J .R. Deller, Jr., J. Proakis, and H. Hansen, Discrete Time Processing of Speech
Signals, 1992.

[15] L. Johnson, R. Riess, Numerical Analysis, 2nd ed., Reading, MA. Addison-
Wesley Pub.Co., 1982.

[16] D. Parker, “Learning logic,” Technical Report TR—40, Center for Computational
Research in Economics and Management Science, MIT, April, 1985.

[17] P. Werbos, “Beyond regression: New tool for prediction and analyses in the
behavioral sciences,” Ph.D. dissertation, Harvard Univ., Cambridge, MA, 1974.

[18] F.M.A. Salam, “Neural nets and engineering implementations,” Key Address,
The Proc. of the 3lst Midwest Symposium on Circuits and Systems, St.Louis,
MO., Aug. 1988

[19] M. Azimi-Sadjadi and S. Citrin, “Fast leaning process of multi- layer neural nets
using recursive least squares technique,” IEEE' Int. Conf. on Neural Networks,
Washington D.C., June 1989. '

[20] E. Dahl, “Accelerated learninig using the generalized delta rule,” Proceedings of
the IEEE Int. Joint Conf. on Neural Networks, vol.2, pp.523-530, 1987.

[21] R. White, “The learning rate in back-propagation systems: an application of
newtons method,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks,
vol.l, pp.679-684, 1990.

[22] D. Parker, “Second order backpropagation: Implementing an optimal 0(n) ap—
proximation to Newton’s method as an artiﬁcial neural network,” IEEE' Conf.
on Neural Information Processing Systems, Denver, CO, Nov. 1987.

[23] S. Kollias and D. Anastassiou, “An adaptive least squares algorithm for the
efﬁcient training of artiﬁcial neural networks,” IEEE Transactions on Circuits
and Systems, vol. OAS-36, pp.1092-1101, August 1989.

[24] J. Steck, B. McMillin, K. Krishnamurthy, M. Ashouri, and G. Leininger, “Par—

allel implementation of recursive least squares neural network training method,”

107

Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.631-636,
1990.

[25] T. Ghiselli-Crippa, and A. El-Jaroudi, “Afast neural net training algorithms and
its application to voiced-unvoiced-silence classiﬁcation of speech,” Proceedings of
the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.441-444, 1991.

[26] B. Kosko, Neural Networks and Fuzzy Systems, Englewood Cliffs, New Jersey:
Prentice-Hall, 1992.

[27] B. Widrow and SD. Sterns, Adaptive Signal Processing, Englewood Cliffs, New
Jersey: Prentice-Hall, 1985.

[28] J .R. Deller, Jr., and G. Picache, “Advantages of a givens rotation approach to
temporally recursive linear prediction analysis of speech,” IEEE Transactions on
Acoustics, Speech, and Signal Processing, vol. 37, pp. 429- 431, March 1989.

[29] L. Allred, “Supervised learning techniques for backpropagation networks,” Pro-
ceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.721-728,

1990.

[30] D. Watkins, Fundamentals of matrix computations, New York, N.Y.,John Wiley
and Sons, 1991.

[31] J .R. Deller, Jr. and T.C. Luk, “Linear prediction analysis of speech based on set-
membership theory,” Computer Speech and Language, vol. 3, pp.301-327,1989.

[32] J.R. Deller, Jr., “Set membership identiﬁcation in digital signal processing,”
IEEE ASSP Magazine, pp.4-20, October 1989.

 

108

Appendix

Appendix

Listings of Training Programs

 

This Appendix lists the programs developed in this dissertation. The ﬁrst pro-
gram listed is the node-wise algorithm. Next the layer-wise algorithm is listed. The
network-wise algorithm follows. The last program listed is the layer-wise algorithm
with data reduction. All programs implement learning for a two layer network. All
programs were written in a similar manner. The main program inputs the training
set, and user deﬁned variables such as the number of hidden layer nodes and the
number of iterations. The number of input and output nodes is determined by the
training set. Here also the weights are initialized along with the other algorithm
variables such as the forgetting factor and weight constraint. A820, the main sub-
routine in each program, keeps track of the iteration number, calculates the outputs
of the network, then uses these values to calculate the linearized training values. The
ﬁrst main subroutine is the rotein() subroutine. This routine rotates the linearized
training pairs into the correct W matrix. There is one W matrix for each node in
the node-wise algorithm. In the layer-wise algorithm there is one W matrix for each
node in the output layer, and one W matrix for the hidden layer. The network-wise
algorithm only uses one W matrix. The next main subroutine is weightch(). This
subroutine uses the W matrix to calculate the updated weights after every iteration.

The other two smaller subroutines are ASw() and initw(). ASw() multiplies the W

109

matrices by a forgetting factor. initw() initializes the W matrices to zero. The only
differences between the node-wise, layer-wise, and network-wise algorithm is in how
the A320 subroutine calculates the linearized training patterns. The layer-wise al-
gorithm with data reduction is the same as the layer-wise algorithm with the data
reduction included. The ﬁrst change is in the A820 subroutine. Here for simplicity
of programming the outputs of the network are calculated twice. First when using
the data reduction algorithm to decide which training patterns to use, then when
using these training patterns to update the W matrices. There are three additional
subroutines. The ﬁrst is the reinitwpp() subroutine. This initializes the pr ma-
trix, which is used in determining which training patterns to use in the updating, to
zero. The next subroutine is called norms(). This subroutine calculates the norms
for the weights, the w(i) vectors and for the d1(i, N), here called b vectors. The ﬁnal
subroutine is called dbdA(), and calculates the estimate of the weight change vector
norm.

The input to all the programs is a training set with the following format;
number of training patterns,number of inputs,number of outputs
training pattern 1

training pattern 2

training pattern N
At each iteration the program writes to the screen the iteration number, the total
number of training patterns, the number of correctly classiﬁed training patterns, and

the sum of the squared errors between the training pattern outputs and the actual

outputs.

110

N ode-Wise Algorithm

 

#includeistdioh;
#includeimathhj,
#includeistdlibhz,

#deﬁne N levels 2

/*

This program simulates a NO-input, N2-output 2-layer, feedforward
neural net using the Azimi-Sadjadi algorithm with G-R and
includes biasing. node-wise updating.

*/

int datapts,xdata,ydata,numweight,zed,use,Nloops,N[3];
ﬂoat x[4][519],t[4][519];

ﬂoat w[3] [10] [10] ,wa[30];

ﬂoat K [3] [10] ,b[3] [10] ,deltal ,delta2;
ﬂoat W[13][10][10];

ﬂoat y [3] [10] ,wchange,err1;

ﬂoat z[30],tp;

ﬂoat ASweight;

double err/*,E[1000]*/;

char outﬁle[40];

FILE *ifpl,*ofpl,*ofp2,*ofp3;

main()

{

int k1,k2,k3;
void A320;

printf(”Enter name for input ﬁle: ”);
gets(outﬁle);
if((ifp1=fopen(outﬁle,”r”))==NULL)

printf(”fopen failed\n” );
exit(0);
}

printf(”Enter name for output ﬁlel: ”);
gets(outﬁle);
if((ofp1=fopen(outﬁle,”w”))==NULL)

111

{
printf(”fopen failed\n”);
exit(0);

}

printf(”Enter number of hidden nodes: ”);

scanf(”%d” ,&N[l]);

Here the input training set is read.
fscanf(ifp1,”%d,%d,%d\n”,&datapts,&xdata,&ydata);
N[0]=xdata;

N[2]=ydata;

printf(”datapts=%d xdata=%d ydata=%d\n”,
datapts,xdata,ydata);
for(k1=0;k1idatapts;++kl){
for(k2=0;k2;xdata;++k2){
fscanf(ifp1,”%f,”,&x[k2] [k1]);

}

for(k2=0;k2;ydata;++k2){

fscanf(ifp1,” %f,”,&t[k2][k1]);

}

fscanf(ifp1,”\n”);

printf(”Enter maximum number of loops: ”);
scanf(” %d” ,&N loops);
ASweight=0.1;

y{1llN [1]l=1;

3'10] [N [011:1;
for(k1=0;kl]Nlevels+l;++k1){
for(k2=0;k2;N[k1]+1;++k2){
for(k3=0;k3;N[k1-1]+1;++k3){
w[k1][k2][k3]=0.001;

}

}

}

reinitw();
wchange=0. 10;

A320;

fprintf(ofpl ,” %d,%d,%d\n” ,N[01,N[1l,Nl2l); i

112

/*for(k1=1;k1;Nlevels+1;++k1){
for(k2.—..0;k2iN [k1];++k2){
for(k3=0;k3iN[kl-1]+1;++k3){
fprintf(ofp1,” %f\n” ,w [kl] [k2] [k3] );
printf(”w[%d] [%d] [%d]=%f\n” ,
k1,k2,k3,w[k1] [k2] [k3]);

}

}
}*/

fclose(ifp1);
fclose(ofp1);
}

void AS2()
{

int n1,n2,n3,count,loop,data,test;

for(loop=0;loop;Nloops;++loop){
zed=0;

err=0;

use=0;

ASWO;
for(data=0;data;datapts;++data){
for(n1=0;n1;N[0];++n1){
y{0]ln1]=X[nlllda-tal;

}

Below the outputs of the nodes and the derivatives are calculated.
for(n1=1;nliNlevels+1;++nl){
for(n2=0;n2;N [n1] ;++n2){
ylnllln2l=03
for(n3=0;n3iN[n1-1]+1;++n3){
ylnllln21+=y{n1-1l[1131*Wlnllln2lln3];

if(Y [111][3211-20){Ylnll[n2]=-20;}
b[nl][n2]=y[n1][n2];
Ylnll[n21=1/(1+6XP(-Y[nll11121));
K1111][n21=Ylnll[1121*(1-3'1111111121);

b[n1] [n2]=y[n1] [n2]-K[nl] [n2]*b[n1] [n2];

}

Here the linearized intputs and outputs of the last layer
are calculated.

for(n1=0;n1;N[Nlevels];++n1){

if(t[n1][data ==0){tp=0.05-b[Nlevels][n1];}

113

else if(t[nl] [data]==1){tp=0.95-b[Nlevels] [n1] ;}
else{tp=t[n1] [data]-b[N levels] [n1];}
for(n2=0;n2iN [N levels-1]+ 1 ;++n2){

z[n2] =K [Nlevels] [n1] *y[Nlevels- 1] [n2];

rotin(N[Nlevels-1]+1,n1+1);

}

Below the linearized inputs and outputs of the hidden layer
are calculated.

f01(n2=0;n2iN[Nlevels-1] ;++n2){

tp=0;

for(n1=0;n1iN[Nlevels];++nl){

tp+= (t [n1] [data]-y[N levels] [n1] ) *K [Nlevels] [n1]

*w[Nlevels] [n 1] [n2] ;

tp+=y[N1evels- 1] [n2];

tp=tp-b[Nlevels- 1] [n2];
for(n3=0;n3jN[Nlevels-2]+1;+ +n3){

z [113] =K [Nlevels- 1] [n2] *y[Nlevels- 2] [n3] ;

rotin(N[Nlevels-2]+1,n2+N[Nlevels]+1);

}

The sum of square errors is calculated below.
for(n1=0,errl=0,test=0;nliydata;++n1){
errl+=(t[nl][data]-y[2][n1])*(t[n1][data]-y[2][n1]);
if(fabs(t[nl][data]-y[2][n1]) 2,: 0.5){test=l;}

}

if(test==0){zed++;}

err+=errl;

}

Below the new weights for the last layer are calculated.
for(n1=0;n1;N[Nlevels];++n1){
weightch(Nlevels,N[Nlevels—l]+1,nl+1);

for(n2=0;n2;N [N levels-1]+1 ;++n2){

w[Nlevels] [n1] [n2] =wa[n2] ;

}

Below the new weights for the hidden layer are calculated.
for(n2=0;n2;N [Nlevels-1];++n2){

weightch( Nlevels- 1 ,N [Nlevels- 2] + 1 ,n2+ N [N levels] + 1 );
for(n3=0;n3iN [N levels—2]+1 ;++n3){

w[Nlevels- 1] [n2] [n3] =wa[n3];

}
printf("loop=%d data=%d zed=%d err[data]=%f\n”,

114

loop,data,zed,err);

fprintf(ofp1,”loop=%d zed=%d err=%f\n”,loop,zed,err);
}

}

The following subroutine rotates the linearized training patterns
into the W matrices using QR decomposition.
rotin(MATSIZE,mat)

int MATSIZE,mat;

{

int k1,k2;

ﬂoat rho,sigma,tau,Wp [30] [30] ;

for(k2=0;k2§MATSIZE;++k2){
W[mat] [MATSIZE] [k2] =z[k2];

}
W [mat] [MATSIZE] [MATSIZE] =tp;

for(k2=0;k2;MATSIZE-l-l;++k2){
for(k1=0;k1;MATSIZE+1;++k1){
Wp[k2][k1]=0;

}

}

for(k2=0;k2;MATSIZE;++k2){

rho=sqrt(W[mat] [k2] [k2] *W[mat] [k2] [k2] +

W[mat] [MATSIZE] [k2] *W[mat] [MATSIZE] [k2] );

if( rho == 0){ /*if(mat==0) {printf(”below\n” );

}* /goto below4;}

sigma=W[mat] [k2] [k2] / rho;

tau=W[mat] [MATSIZE] [k2] / rho;

for(kl=k2;kl ;MATSIZE+1;++k1){

Wp[k2] [k1] =W[mat] [k2] [k1] *sigma+W[mat] [MATSIZE] [k1] *tau;
Wp[MATSIZE] [k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [k1] *sigma;

Wp[MATSIZE][k2]=0;
for(k1=0;k1;MATSIZE+1;++k1){
W[mat][k2][k1]=Wp[k2][k1];
W[mat][MATSIZE][k1]=Wp[MATSIZE][k1];
}

below4:;
}
}

The following subroutine re-initializes the W matrices.

115

reinitw()

{

int nl,n2,n3;
for(n1=0;n1;7;++n1){
for(n2=0;n2i30;++n2){
for(n3=0;n3i30;++n3){
W[n1][n2][n3]=0;

}
}
}

The following subroutine uses the W matrices to calculate
the updated weights.

weightch(level,MATSIZE,mat)

int level,MATSIZE,mat;

{

double wchangep [30] ,a[30] ,zzt,zt,Wp[30] ,den,fac;

int k,i,nl,n2,n3,counts;

zt=0;

if(level==2){
for(n1=0;n1;MATSIZE;++n1){
Wp[n1]=w[level] [mat-1] [n1];

}

if(level==1){
for(n1=0;n1;MATSIZE;++nl){

Wp [n1] =w[level] [mat- N [Nlevels]- 1] [n 1] ;

}

if(fabs(W[mat][MATSIZE—l][MATSIZE1]);1e-28){
a[MATSIZE1]=W[mat][MATSIZE-1] [MATSIZE] /W[mat][MATSIZE-1][MATSIZE-1];

else{
printf(”no weight update level %d %d\n” ,1evel,MATSIZE);
a[MATSIZE—1]=wp[MATSIZE—l];

}
for(n1=MATSIZE—2;n1L=0;—nl){
if(fabs(W[mat] [n1] [n1]);1e-28){
a[nl]=W[mat] [n1] [MATSIZE];
for(n2=MATSIZE— 1 ;n2;n1;—n2){

116

a[nl]-=W[mat] [n1] [n2] *a[n2];
a[111]/=W[mat][n1][n1];

else{
printf(”no weight update level %d %d\n”,level,MATSIZE);
alMATSIZE-l]=Wp[MATSIZE-1];

}

}
for(n1=0;n1jMATSIZE;++n1){

wchangep[n1]=a[nll‘wplnlli

den=0;
for(nl=0;n1iMATSIZE;++n1){
den+=wchangep [n1] *wchangep[n1];

}

den=sqrt(den);

if( den i. wchange) {fac=wchange/den;}
else{fac=1;}
for(n1=0;nl;MATSIZE;++n1){
wchangep[n1]*=fac;

for(nl=0;nliMATSIZE;++n1){
wa[n1]=wp[n1]+wchangep[n1];

}
for(n1=0,zt=0;n1;MATSIZE;++n1){
if(wa[nl]j,700){wa[n1]=700;zt=1;}

if( wa[n1]1-700){wa[n1]=-700;zt=1;}

}

if(zt==1){printf(”weight at 700\n”);}

The following subroutine implements the forgetting factor by
multiplyint the W matrices by the constant ASweight.
ASw()

{

int nl,n2,n3;
for(n1=0;n1110;++n1){
for(n2=0;n2§10;++n2){

for(n3=0;n3310;++n3){
W[n1][n2][n3]*=ASweight;

}

117

118

Layer-Wise Algorithm

 

#includeistdioh;

#includeimathh;

#includeistdlibhj,

#deﬁne Nlevels 2

/*

This program simulates a N 0-input, N2-output 2-layer, feedforward
neural net using the Azimi-Sadjadi algorithm with G-R and
includes biasing.

*/

int datapts,xdata,ydata,numweight,zed,use,Nloops,N [3];
ﬂoat x[4] [519] ,t[4] [519];

ﬂoat w[3] [10] [10] ,wa[30] ;

ﬂoat K[3] [10] ,b[3] [10] ,deltal ,delta2;
ﬂoat W[7] [30] [30];

ﬂoat y[3][10],wchange,err1;

ﬂoat z[30],tp;

ﬂoat ASweight;

double err/*,E[1000]*/;

char outﬁle[40];

FILE *ifp1,*ofp1,*ofp2,*ofp3;

main()

int kl ,k2,k3;
void AS2();

printf(”Enter name for input ﬁle: ”);
gets(outﬁle);
if((ifpl=fopen(outﬁle,”r”))==NULL)
{

printf(”fopen failed\n” );

exit(0);

}

/*printf(”Enter name for output ﬁlel: ”);
gets(outﬁle);
if((ofpl=fopen(outﬁle,”w”))==NULL)

{

printf(”fopen failed\n” );
exit(0);

119

}*l

printf(”Enter number of hidden nodes: ”);
scanf(”%d”,&N[1]);

Here the input training set is read.
fscanf(ifpl,” %d,%d,%d\n” ,&datapts,&xdata,&ydata);
N [0]=xdata;

N [2]=ydata;

printf(”datapts=%d xdata=%d ydata=%d\n”,
datapts,xdata,ydata);
for(k1=0;k1;datapts;++k1){
for(k2=0;k2;xdata;++k2){

fscanf(ifpl,” %f,” ,&x[k2] [k1]);

}

for(k2=0;k2;ydata;++k2){

fscanf(ifpl,” %f,” ,&t[k2][k1]);

}

fscanf(ifp1,”\n”);

}

printf(” Enter maximum number of loops: ”);
scanf(”%d”,&Nloops);
ASweight=0.1;

Below the weights are inintialized.
yunNun=1;

y{0][N[0]l=1;
for(k1=0;k1;Nlevels+1;++k1){
for(k2=0;k2iN[k1]+l;++k2){
for(k3=0;k3;N[kl-1]+1;++k3){
w[k1][k2][k3]=0.001;

}

}
}

reinitw();
wchange=.4;

A320;

/*fprintf(ofpl ,” %d,%d,%d\n” ,N[0] ,N[1],N[2]);
for(k1=l;kl;Nlevels+1;++k1){
for(k2=0;k2;N [k1];++k2){
for(k3=0;k3iN[kl-1]+l;++k3){

fprintf(ofp1 ,” %f\n” ,w[k1][k2] [k3] );

120

printf(” w[%d] [%d] [%d]=%f\n” ,
k1,k2,k3,w[k1] [k2] [k3]);

}
}*/

fclose(ifp1);
fclose(ofp1);

}
void A320
{

int n1,n2,n3,count,loop,data,test;

for(loop=0;loop;Nloops;++loop){
zed=0;

err=0;

use=0;

ASWO;

Below the outputs of the nodes and the derivatives are calculated.
for(data=0;dataidatapts;++data){
for(n1=0;nle[0];++n1){
y[0][n11=X[nll[datal;

for(n1=1;n1iNlevels+1;-++nl){
for(n2=0;n2jN [n1];++n2){
ylnllln2l=0;
for(n3=0;n3;N[n1-1]+l;++n3){

y[nl] [n2]+=y[n1-1] [n3]*w[nl] [n2] [n3];

f(3'lnllln2li-20){y{n1][n2l=-20;}
b[nl][n2]=y[n1][n2];

Ylllll [n21=1/(1+6XP(-Ylnllln21));
K[nll[n2]=y{nll[n21*(1-y{nll[n2]);
b[111][n2]=y[n1][n2]-K[n1][n2]*b[n1][n2];

}

Here the linearized intputs and outputs of the last
layer are calculated.
for(n1=0;nl;N[Nlevels];++n1){
tp=t[n1][data]-b[Nlevels][n1];
for(n2=0;n2;N[Nlevels-l]+1;++n2){

z[112] =K [N levels] [n1] *y[Nlevels- 1] [n2] ;

rotin(N[Nlevels-1]+1,n1+1);
}

121

Below the linearized inputs and outputs of the hidden layer
are calculated.

for(n1=0;n1;N [N levels] ;++n1 ){

count=0;

tp=t[n1][data]-b[Nlevels] [n1];

for(n2=0;n2;N [Nlevels-1] ;++n2){

tp-=K[N levels] [n1] *w[N levels] [n1] [n2] *b[N levels- 1] [n2] ;
for(n3=0;n3;N[Nlevels-2]+1;++n3){
z[count]=K[Nlevels] [n1]*K[Nlevels-1] [n2] *

w[N levels] [n1] [n2] * y [Nlevels- 2] [n3] ;

++count;

}

}
tp~=K [Nlevels] [n1] *w[N levels] [n1] [n2] *y [Nlevels- 1] [n2] ;

rotin(N [N levels-1]*(N [N levels-2]+ 1),0);

}

The sum of square errors is calculated below.
for(n1=0,err1=0,test=0;n1;ydata;++n1){
err1+=(t[nl][data]-y[2][nl])*(t[nl][data]-y[2][n1]);
if(fabs(t[nl][data]-y[2][n1]) 1,: 0.5){test=1;}

if(test==0){zed++;}
err+=err1;

}

Below the new weights for the last layer are calculated.
for(nl=0;n1;N [N levels] ;++n1){
weightch(Nlevels,N[Nlevels-1]+1,n1+1);

for(n2=0;n2;N [Nlevels-l]+1;++n2){

w [Nlevels] [n1] [n2] =wa[n2];

}

Below the new weights for the hidden layer are calculated.
weightch(N levels-1,N[N levels-1]*( N[Nlevels-2]+1),0);
count=0;

for(n1=0;n1;N[Nlevels-1];++n1){
for(n2=0;n2;N[Nlevels~2]+1;++n2){

w[Nlevels- 1] [n1] [n2] =wa[count];

++count;

}

}
printf(”loop=%d data=%d zed=%d err[data]=%f\n”,

loop,data,zed,err);

}
}

122

The following subroutine rotates the linearized training patterns
into the W matrices using QR decomposition.
rotin(MATSIZE,mat)

int MATSIZE,mat;

int kl ,k2;
ﬂoat rho,sigma,tau,Wp [30] [30];

for(k2=0;k2;MATSIZE;++k2){
W[mat] [MATSIZE][k2]=z[k2];

}
W[mat][MATSIZE][MATSIZE]=tp;

for(k2=0;k2jMATSIZE+1;++k2){
for(kl=0;k1;MATSIZE+1;++kl){
Wp[k2] [k1]=0;

}
for(k2=0;k2;MATSIZE;++k2){

rho=sqrt(W[mat] [k2] [k2] *W [mat] [k2] [k2] +

W[mat] [MATSIZE] [k2] *W[mat] [MATSIZE] [k2] );

if(rho == ){/*if(mat==0){printf(”below\n” );

}* /goto below4;}

sigma=W[mat] [k2] [k2] / rho;

tau=W[mat] [M ATSIZE] [k2] / rho;
for(k1=k2;k1[MATSIZE+1;++k1){
Wp[k2][kl]=W[mat][k2][k1]*sigma+W[mat][MATSIZE][k1]*tau;
Wp[MATSIZE][k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [kl]*sigma;
}

Wp[MATSIZE][k2]=0;

for(k1=0;kl ;MATSIZE+1;++k1 ){

W[mat] [k2] [k1]=Wp[k2] [kl];

W[mat] [MATSIZE][kl]=Wp[MATSIZE] [k1];

below4:;

}
}

The following subroutine re-initializes the W matrices.
reinitw()

{

int nl,n2,n3;

for(n1=0;nl;7;++nl){

for(n2=0;n2i30;++n2){

for(n3=0;n3;30;++n3){

123

W[n1][n2][n3]=0;

}
}
}

The following subroutine uses the W matrices to calculate
the updated weights.

weightch(level,MATSIZE,mat)

int level,MATSIZE,mat;

double wchangep [30] ,a[30] ,zzt ,zt ,Wp [30] ,den,fac;
int k,i,zer,nl,n2,n3,counts;

zt=0;

zer=0;

if(level==2){
for(n1=0;n1;MATSIZE;++n1){
Wp [n1]=w[level] [mat-1] [n1];

 

}

if( level==1){

counts=0;
for(n1=0;nl[N[Nlevels-l];++nl){
for(n2=0;n2;N[Nlevels-2]+1;++n2){
wp [counts] =w[level] [n 1] [n2] ;
++counts;

}
}

}
for(n1=0;n1iMATSIZE;++nl){

if(fabs(W[mat] [n1][n1]);1e—28){zer=l;}

if(zer==0){

a[MATSIZE-1]=W[mat] [MATSIZE- 1] [MATSIZE] /W[mat] [MATSIZE— l] [MATSIZE- 1];
for(nl=MATSIZE—2;nl L=0;—n1){

a[nl]=W[mat] [n1] [MATSIZE];

for(n2=MATSIZE-1;n22,n1;—n2){

a[11l]-=W[mat] [n1] [n2]*a[n2];

}
a[nl]/=W[mat] [n1][n1];

}

else{

if(level==2){

124

printf(”no weight update level 2\n” );
for(nl=0;n1;MATSIZE;-l-+n1){
a[n1]=WP[nll;

}

}

if(level==1){

printf("no weight update level 1\n”);
for(n1=O;nliMATSIZE;++n1){
a[nl]=Wp[n1];

}

}
for(n1=0;n1iMATSIZE;++n1){

wchangep[nl]=a[nll'wplnlli

den=0;
for(nl=0;n1iMATSIZE;++n1){

den+ =wchangep[n1] *wchangep[n1];

}

den=sqrt(den);

if( den 1, wchange){fac=wchange/den;}
else{fac=1;}
for(n1=0;n1;MATSIZE;++n1){
wchangep[nl]*=fac;

for(n1=0;n1;MATSIZE;-l-+nl){
wa[n1]=wp[n1]+wchangep[n1];

for(n1=0,zt=0;nliMATSIZE;++n1){
if(wa[n1]},700){wa[n1]=700;zt=l;}
if(wa[nl];-700){wa[nl]=-700;zt=l;}

}

if(zt==1){printf(”weight at 700\n” );}
}

The following subroutine implements the forgetting factor by
multiplyint the W matrices by the constant ASweight.
ASw()

{

int nl,n2,n3;
for(n1=0;n1;7;++nl){

for(n2=0;n2;30;++n2){
for(n3=0;n3;30;++n3){

125

W[n1][n2] [n3]*=ASweight;

126

Network-Wise Algorithm

 

#includeistdioh;
#includeimathh;
#includeistdlibh;

#deﬁne Nlevels 2

I»:

This program simulates a N 0-input, N2-output 2-layer, feedforward
neural net using the Azimi—Sadjadi algorithm with GR and
includes biasing. network wise updating.

*/

int datapts,xdata,ydata,numweight ,zed,use,Nloops,N [3];
ﬂoat x[4][519],t[4][519];

ﬂoat w[3][10][10],wa[30],wpp[10][10];
ﬂoat K[3][10],b[3][10],deltal,delta2;
ﬂoat W[1][50] [50];

ﬂoat y[3] [10] ,wchange,err1 ;

ﬂoat z[50],tp;

ﬂoat ASweight;

double err/*,E[1000]*/;

char outﬁle[40];

FILE *ifpl,*ofp1,*ofp2,*ofp3;

main()

{

int k1,k2,k3;
void A320;

printf(”Enter name for input ﬁle: ”);
gets(outﬁle);
if( (ifp1=fopen(outﬁle,”r” ))==NULL)

printf(”fopen failed\n” );
exit(0);

}

/*printf(”Enter name for output ﬁle]: ”);
gets(outﬁle);
if((Ofpl=fopen(outﬁle,”w” ))==NULL)

127

printf(”fopen failed\n” );
exit(0);
}*/

printf(”Enter number of hidden nodes: ”);

scanf(”%d”,&N[l]);

Here the input training set is read.

fscanf(ifpl ,” %d,%d,%d\n” ,&datapts,&xdata,&ydata);
N[0]=xdata;

N [2]=ydata;

printf(”datapts=%d xdata=%d ydata=%d\n”,
datapts,xdata,ydata);
for(kl=0;k1;datapts;++kl){
for(k2=0;k2;xdata;++k2){ .

fscanf(ifp1,” %f,” ,&x[k2] [k1] );

}

for(k2=0;k2iydata;++k2){
fscanf(ifp1,”%f,”,&t[k2] [k1]);

fscanf(ifp1,”\n”);
}

printf(”Enter maximum number of loops: ”);
scanf(” %d” ,&N loops);
ASweight=0.l;

yllllNllll=1;

YlollNloll=1;
for(k1=0;k1;Nlevels+1;++kl){
for(k2=0;k2;N[k1]+1;++k2){
for(k3=0;k3;N[k1-l]+1;++k3){
w[kl][k2][k3]=0.001;

}

}

}

reinitw();
wchange=.4;

A320;

fprintf(ofp1,” %d,%d,%d\n” ,N [0] ,N[1] ,N [2] );

128

/*for(k1=1;kliNlevels+1;++k1){
for(k2=0;k2;N[k1];++k2){
for(k3=0;k3;N[kl-1]+1;++k3){
fprintf(ofp1,” %f\n” ,w[k1] [k2] [k3] );
printf(” w[%d] [%d] [%d]=%f\n” ,
k1,k2,k3,w[k1] [k2] [k3]);

}

}
}*/

fclose(ifpl );
fclose(ofpl);
}

void A320
{

int n1,n2,n3,count,loop,data,test;

for(loop=0;loopiNloops;++loop){
zed=0;

err=0;

use=0;

ASWO;
for(data=0;dataidatapts;++data){
for(n1=0;nliN[0];++nl){
y[0][n1]=x[n1][data];

Below the outputs of the nodes and the derivatives are calculated.
The sum of square errors is calculated below.
for(nl=1;nliNlevels+l;++n1){
for(n2=0;n2;N[n1];++n2){

y[nllln2]=0;

for(n3=0;n33N [n1-1]+1;++n3){
14111][n2l+=y{n1-1][n3l*W[nll[11211113];

}

if(Ylnll[n21i-20){y{nll[n21=-20;}
blnllln21=y{nllln21;

y[nllln2]=1/ (1+exp(-y{nll[n21));
Klnlllnzl=y{nll[n2]*(1:y{nllln21);
b[n1][n2]=y[n1][n2]-K[n1][n2]*b[n1] [n2];

}

count=0;
Here the linearized intputs and outputs are calculated.

tp =t [0] [data]-b [Nlevels] [0]

129

for(n2=0;n2;N[Nlevels- 1] ; ++n2){
z [count] =K[Nlevels] [0] *b [Nlevels- 1] [n2] ;
++count;

}

z [count] =K[Nlevels] [0] * y [N levels- 1] [N [Nlevels- 1]] ;

+ +count;

for(n2=0;n2iN [Nlevels-l] ;++n2){

for(n3=0;n3;N [Nlevels-2] + 1 ; ++n3){

z[count] =K [Nlevels] [0] *K [Nlevels- 1] [n2] *y [Nlevels-2] [n3];
++count;

}
}
rotin(N[Nlevels-l]+l+(N[Nlevels-1]*(N[Nlevels-2] +1 ) ) ,0);

for(n1=0,err1=0,test=0;n1;ydata;++n1){
err1+=(t[n1][data]-y[2] [n1])*(t[nl][data]-y[2] [n1]);
if(fabs(t[nl] [data]-y[2] [n1]) 1,: 0.5){test=1;}

if(test==0){zed++;}
err+=err1;

}

Below the new weights are calculated.

weightch( N [Nlevels-1] + l +(N [N levels- l]*(N[Nlevels-2] + l ) ) ,0) ;
count=0;

for(n2=0;n2;N[Nlevels-1]+1;++n2){

w[Nlevels] [0] [n2] =wa[count];

++count;

}

for(n1=0;nl;N[Nlevels-1];++n1){
for(n2=0;n2;N[Nlevels-2]+1;++n2){

if(fabs(W[N levels] [0][nl]);1e-10){
pr[n1][n2]=wa[count];

w[Nlevels- 1] [n1] [n2] =wpp[nl] [n2] /w[Nlevels] [0] [n1];

++count;

}

}
printf(”loop=%d data=%d zed=%d err[data]=%f\n”,

loop,data,zed,err);

}
}

The following subroutine rotates the linearized training patterns
into the W matrices using QR decomposition.

130

rotin(MATSIZE,mat)

int MATSIZE,mat;

{

int k1,k2;

ﬂoat rho,sigma,tau,Wp [50] [50];

for(k2=0;k2iMATSIZE;++k2){
W [mat] [MATSIZE] [k2] =z[k2];

}
W[mat][MATSIZE][MATSIZE]=tp;

for(k2=0;k2;MATSIZE-l~l;++k2){
for(kl=0;k1;MATSIZE-l-l;++k1){
Wp[k2] [k1]=0;

}
for(k2=0;k2;MATSIZE;++k2){

rho=sqrt(W[mat] [k2] [k2] *W[mat] [k2] [k2] +

W[mat] [MATSIZE] [k2] *W[mat] [MATSIZE] [k2] );

if( rho == 0){ /*if(mat==0) {printf(”below\n” );

}* /goto below4;}

sigma=W[mat] [k2] [k2] / rho;

tau=W[mat] [MATSIZE] [k2] / rho;

for(k1=k2;k1 ;MATSIZE+1;++kl){

Wp[k2] [k1] =W[mat] [k2] [k1] *sigma+W[mat] [MATSIZE] [k1] *tau;
Wp[MATSIZE] [k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [k1] *sigma;

}

Wp[MATSIZE][k2]=0;
for(kl=0;k1iMATSIZE+l;++kl){

W[mat] [k2] [k1]=Wp [k2] [kl];

W[mat] [MATSIZE][k1]=Wp[MATSIZE] [k1];
}

below4:;

}
l

The following subroutine re—initializes the W matrices.
reinitw()

{

int nl,n2,n3;

for(nl=0;nl;1;++n1){

for(n2=0;n2;50;++n2){

for(n3=0;n3i50;++n3){

131

W[n1][n2] [n3]=0;

}
}
}

The following subroutine uses the W matrices to calculate the
updated weights.

weightch(MATSIZE,mat)

int MATSIZE,mat;

double wchangep [50] ,a[50] ,zzt ,zt ,Wp [50] ,den,fac;
int k,i,nl,n2,n3,counts;

zt=0;

counts=0;
for(n2=0;n2;N[Nlevels-1]+1;++n2){
Wp [counts] =w [N levels] [0] [n2] ;
++counts;

}

for(n1=0;n1;N[Nlevels-1];++n1){
for(n2=0;n2iN [Nlevels-2]+1;++n2){
Wp [counts] =wpp[n1] [n2];

++counts;

}

}

if(fabs(W[mat][MATSIZE—1][MATSIZE-1])Lle-28){
a[MATSIZE— 1]=W[mat] [MATSIZE— 1] [MATSIZE] / W[mat] [MATSIZE— 1] [MATSIZE- 1] ;
}

else{

printf(”no weight update %d\n”,MATSIZE);
a[MATSIZE—1]=wp[MATSIZE-l];

}

for(n1=MATSIZE-2;nlz,=0;—nl){
if(fabs(W[mat][n1][n1])j,1e-28){

a[111] =W[mat] [n1] [MATSIZE];
for(n2=MATSIZE-l;n2;n1;-—n2){
a[111]-=W[mat][nl][n2]*a[n2];

}
a[111]/=W[mat][n1] [ml];

132

else{
printf(”no weight update %d\n”,MATSIZE);
a[MATSIZE—l]=Wp[MATSIZE-l];

}

}
for(nl=0;n1;MATSIZE;++n1){

wchangep[n1]=a[111]'wpln1l;

den=0;
for(n1=0;nl;MATSIZE;++n1){
den+=wchangep[nl]*wchangep[n1];

}

den=sqrt(den);

if( den i. wchange){fac=wchange/den;}
else{fac=1;}
for(n1=0;nliMATSIZE;++n1){
wchangep[n1]*=fac;

for(n1=0;nliMATSIZE;++n1){
wa[nl]=wp[n1]+wchangep[n1];

}
for(n1=0,zt=0;n1iMATSIZE;++n1){
if(wa[n1]g,700){wa[n1]=700;zt=l;}
if(wa[n1]3-700){wa[n1]=-700;zt=1;}

}

if(zt==1){printf(”weight at 700\n”);}

The following subroutine implements the forgetting factor by
multiplyint the W matrices by the constant ASweight.
ASw()

{

int nl,n2,n3;

for(nl=0;nlil;++nl){
for(n2=0;n2i50;++n2){
for(n3=0;n3;50;++n3){

W [n1] [n2] [n3] *=A3weight;

133

Layer-Wise algorithm with data
reduction incorporated

 

#includeistdiohz,
#includeimathhj,
#includejstdlibh;

/*
This program simulates a NO-input, N2-output 2-layer, feedforward
neural net using the Azimi-Sadjadi algorithm with G-R and

includes biasing. Includes data reduction.

*/

int datapts,xdata,ydata,numweight ,zed,use[6] ,Nloops,N [3] ,Nlevels;
ﬂoat x[2][519],t[4][519],dwest[5][519],delta1,delta2;

ﬂoat W13][1011101,Wal301b13l[101,Kl3][101,Wl5ll301 [301,prl30];
ﬂoat y[3][10],wchange,err1,z[30],tp,A3weight;

ﬂoat wnorm[5] ,bnorm [5] ,wb[5] [30] ,olderr;

double err,wow[6];

char outﬁle[40];

FILE *ifp1,*ofpl;

main()

int k1,k2,k3;
void A320;

printf(”Enter name for input ﬁle: ”);
gets(outﬁle);
if((ifpl=fopen(outﬁle,”r”))==NULL)

printf(”fopen failed\n” );

exit(0);
}

/*printf(”Enter name for output ﬁlel: ”);
gets(outﬁle);
if( (OfPI =fopen(outﬁle,” w” ))==NULL)

printf(”fopen failed\n” );

134

exit(0);
}*/

printf(”Enter number of hidden nodes: ”);

scanf(”%d”,&N[1]);

fscanf(ifpl,” %d,%d,%d\n” ,&datapts,&xdata,&ydata);
N[0]=xdata;

N[2]=ydata;

printf(”datapts=%d xdata=%d ydata=%d\n”,
datapts,xdata,ydata);
for(k1=0;k1idatapts;++kl){
for(k2=0;k2;xdata;++k2){

fscanf(ifpl ,” %f,” ,&x[k2] [k1]);

}

for(k2=0;k2§ydata;++k2){

fscanf(ifpl ,” %f,” ,&t[k2] [k1]);

}

fscanf(ifp1,”\n” );

printf(”Enter maximum number of loops: ” );
scanf(” %d” ,&Nloops);
Nlevels=2;

ASweight=0.1;

olderr=0;

y{1l[Nllll=1;

YIOllNloll=1;
for(kl=0;k1iNlevels+1;++k1){
for(k2=0;k2;N[k1]+1;++k2){
for(k3=0;k3iN[kl-l]+1;++k3){
w[kl][k2][k3]=0.001;

}
}

for(k1=0;kl;5;++k1){Wowlkll=0;l
reinitw();
wchange=.4;

A320;

for(k1=0;k1i6;++k1){

wow[k1]/=(Nloops-10);

printf(”wow[%d]=%f ”,k1,WOW[k1]);

/* fprintf(ofpl,”wow[%d]=%f ” ,kl,wow[k1]);*/

135

printf(” \n” );
/* fprintf(ofpl ,” \n” );*/

/*fprintf(ofpl,” %d,%d,%d\n” ,N[0],N[1],N[2]);
for(k1=1;kl;Nlevels+l;++kl){
for(k2=0;k2;N[k1];++k2){
for(k3=0;k3iN[kl-1]+1;++k3){

fprintf(ofp1,” %f\n” ,w [kl] [k2] [k3] );
printf(”w[%d][%d][%d]=%f\n” ,
k1,k2,k3,w[k1] [k2] [k3]);

}
}*/

fclose(ifpl );
fclose(ofpl );
}

void A320

{

int n1,n2,n3,count,loop,data,test;
ﬂoat mean;

for(n2=0;n2i5;++n2){for(n1=0;n1idatapts;++n1)
{dwest[n2][nl]=1;}}

for(loop=0;loop;Nloops;++loop){

zed=0;

err=0;

for(n1=0;n1;5;++nl){use[n1]=0;}

The data reduction starts after the 10‘” iteration.
if(100pt9){

norms();

for(data=0;data;datapts;++data){
for(n1=0;n1iN[0];++n1){

y[0] [n1]=x[n1][data];

for(n1=1;n1;Nlevels+1;++n1){
for(n2=0;n2jN [n1];++n2){
y[nllln2l=0;
for(n3=0;n3jN[n1-1]+1;++n3){
{[111][n2]+=y[nl-l][n3]*w[nl][n2][n3];

136

if(Ylnll[3211-20){Y[Dll[n2]=-20;}
blnllln21=Y1nllln2h
y[nll[n2]=1/(1+exp(-y[nll11121));
Klnllln2l=Ylnll[n2]*(1-Yln111321 );
b[nl][n2]=y[n1][n2]-K[n1][n2]*b[n1][n2];

}

for(n1=0;n1iN[Nlevels];++n1){

tp=t [n1] [data]-b[N levels] [n1];
for(n2=0;n2jN[Nlevels-l]+1;++n2){
z[n2]=K[N levels] [n1]*y[N levels-1] [n2];

reinitwpp();
rotinl(N[Nlevels-1]+l,n1+l);
dbdA(N[Nlevels-l]+1,n1+1,data);

reinitwpp();

for(n1=0;n1;N[Nlevels];++n1){

count=0;

tp=t[n1] [data]-b[N levels] [n1];

for(n2=0;n2;N[Nlevels- 1] ;++n2){

tp-=K [N levels] [n1] *w[N levels] [n1] [n2] *b[N levels- 1] [n2];
for(n3=0;n3iN [Nlevels-2]+1;++n3){

z[count] =K[Nlevels] [n1] *K [Nlevels- 1] [n2] *

w[ N levels] [n1] [n2] *y[Nlevels- 2] [n3];

++count;

}

}
tp-=K [N levels] [11 1 ] *w[N levels] [n1] [n2] *y[N levels- 1] [n2] ;

rotinl (N [N levels-1]*( N [N levels-2]+1 ),0);
dbdA(N [Nlevels-1]*( N[Nlevels-2]+1),0,data);

fo1°(n1=0;n1;N[Nlevels]+1;++n1){

mean=0;

for(n2=0;n2;datapts;++n2){
if(meanjdwest[nl] [n2]){mean=dwest[n1] [n2];}

mean/=(2);

for(n2=0;n2idatapts;++n2){

if( dwest [n1] [n2] ;mean — dwest[n1][n2]30){dwest[n1][n2]=1;}
else{dwest[nl][n2]=0;}

}
}

137

ASWO;
for(data=0;data;datapts;++data){
for(nl=0;n1;N[0];++nl){
y[onn11=x[n1ndata1;

for(n1=1;n1;Nlevels+1;++n1){
for(n2=0;n2;N[nl];++n2){
y{nll[n2l=0;
for(n3=0;n3iN[n1-1]+1;++n3){

y[nl] [n2]+=y[nl-1] [n3] *w[nl] [n2] [n3];

}

if(Ylnll[3211-20){y{n11[n2l=-20;}
b[111][n2]=y[n1][n2];

y[nllln2]=1/ (1+exp(-y{n1][n2l));
K[nllln2l=y{nll[n21*( 1-Ylnllln21);

b[nl] [n2]=y[n1][n2]-K[n1][n2]*b[n1] [n2];

}

for(n1=0;nl;N[Nlevels];++nl){
if(dwest[n1+1][data]g,0){

tp=t[n1] [data]-b[Nlevels] [n1];
for(n2=0;n2;N [N levels- 1]+1;++n2){
z[n2]=K[N levels] [n1]*y[Nlevels-1] [n2];

rotin(N[Nlevels-1]+l,nl+1);
++use[nl+1];
}

}

if ( dwest [0] [data] 2,0){
for(n1=0;n1;N[Nlevels];++n1){

count=0;

tp=t [n1] [data]-b[Nlevels] [n1];
for(n2=0;n2;N [Nlevels- l];++n2){

tp- =K [Nlevels] [n1] *w[Nlevels] [n1] [n2] *b[Nlevels- 1] [n2];
fo1'(n3=0;n3iN[Nlevels-2]+1;++n3){
z[count] =K[Nlevels] [n1] *K [Nlevels- 1] [n2] *
w [Nlevels] [n 1] [n2] *y [Nlevels-2] [n3];
++count;

}

}
tp-=K [Nlevels] [n1] *w [N levels] [n1] [n2] * y [N levels- 1] [n2];

rotin(N[Nlevels-1]*(N[Nlevels-2]+1),0);

++use[0];
}

138

for(n1=0,err1=0,test=0;nl;ydata;++n1){
err1+=(t[n1][data]-y[2] [n1])*(t[nl][data]-y[2] [n1]);
if(fabs(t[nl][data]-y[2][n1]) 1,: 0.5){test=1;}

if(test==0){zed++;}
err+=err1;

}

for(n1=0;n1iN[Nlevels];++nl){
weightch(Nlevels,N[Nlevels-1]+1 ,n1+1);
for(n2=0;n2iN[Nlevels-1]+1;++n2){
w[N levels] [n1] [n2]=wa[n2];

}

weightch( N levels- 1 ,N[Nlevels- 1] * ( N [N levels-2] +1 ),0) ;
count=0;

for(n1=0;nl;N[Nlevels-l] ;++n1){
for(n2=0;n21N[Nlevels-2]+1;++n2){

w [N levels- 1] [n1] [n2] =wa[count];

++count;

}

}

printf(”loop=%d data=%d zed=%d err[data]=%f\n”,
loop,data,zed,err);

fprintf(ofp1,” %d,%d,%f” ,loop,zed,err);

for(n1=0;n1 i5;++n1){

printf(”use[%d]=%d ”,n1,use[n1]);

if(loop;9){

wow[n1]+=use[n1];
printf(” \n” l;
}

rotin(MATSIZE,mat)
int MATSIZE,mat;

{
int kl ,k2;
ﬂoat rho,sigma,tau,Wpl3oll3Ol;

for(k2=0;k2iMATSIZE;++k2){
W[mat] [MATSIZE][k2]=z[k2];
}

139

W[mat] [MATSIZE][MATSIZE]==tP;

for(k2=0;k2iMATSIZE+l;++k2){
for(k1=0;k1;MATSIZE+1;++k1){

}

for(k2=0;k2;MATSIZE;++k2){

rho=sqrt(W [mat] [k2] [k2] *W [mat] [k2] [k2] +

W[mat] [MATSIZE] [k2]*W[mat] [MATSIZE] [k2]);

if(rho == 0){ /*if(mat==0) {printf("below\n”);

}* /goto below4;}

sigma=W[mat] [k2] [k2] / rho;

tau=W[mat] [MATSIZE] [k2] / rho;
for(k1=k2;k1;MATSIZE+1;++k1){

Wp[k2] [k1] =W[mat] [k2] [k1]*sigma+W[mat] [MATSIZE] [k1] *tau;
Wp[MATSIZE] [k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [k1]*sigma;

Wp[MATSIZE][k2]=0;
for(kl=0;k1;MATSIZE+l;++k1){

W[mat] [k2] [k1] =Wp [k2] [kl];

W [mat] [MATSIZE] [k1]=Wp[MATSIZE] [k1];

below4:;

}
}

rotinl (MATSIZE,mat)

int MATSIZE,mat;

{

int k1,k2;

ﬂoat rho,sigma,tau,Wp[30];

for(k2=0;k2;MATSIZE;++k2){
W [mat] [MATSIZE] [k2] =z[k2];

}
W[mat] [MATSIZE][MATSIZE]=tp;

for(k1=0;k1;MATSIZE+l;++k1){
Wp[kl]=0;

}

rho=sqrt(W[mat] [0] [0] *W[mat] [0] [0] +

W[mat] [MATSIZE][0]*W[mat][MATSIZE] [0]);
if(rho == 0){goto below4;}

sigma=W[mat] [0] [0] / rho;

140

tau=W[mat] [MATSIZE] [0] / rho;
for(k1=0;k1;MATSIZE+l;++k1){

pr[kl]=W[mat] [0] [kl]*sigma+W[mat] [MATSIZE][k1]*tau;
}

below4 :;

}

This subroutine calculates approximation to the weight change vector.
dbdA(MATSIZE,mat,datas)

int MATSIZE,mat,datas;

{

int k1;

ﬂoat dbdAnorm,dAnorm;

dbdAnorm=0;

dAnorm=0;

for(k1=0;k1;MATSIZE;++k1){
dbdAnorm+=(W[mat] [0] [kl]-pr[k1])*wb[mat] [k1];
dAnorm+=fabs(W[mat] [0] [kl]-pr [k1]);

}
dbdAnorm=fabs( (W[mat] [0] [MATSIZE]-pr[MATSIZE])-dbdAnorm);
dwest [mat] [datas] = (dbdAnorm) / (bnorm [mat]-wnorm [mat] *dAnorm);

weightch(level,MATSIZE,mat)
int level,MATSIZE,mat;

double wchangep[30],a[30],zzt,zt,Wp[30],den,fac;
int k,i,zer,nl,n2,n3,counts;

zt=0;

zer=0;

if(level==2){
for(n1=0;nl;MATSIZE;-l-+nl){
wp[n1]=w[level][mat-l][n1];

}

if(level==1){

counts=0;
for(n1=0;n1;N[Nlevels-l];++n1){
for(n2=0;n2jN [N levels-2]+1 ;++n2){
wp [counts] =w [level] [n1] [n2];
++counts;

}

141

}

}
for(nl=0;nliMATSIZE;++n1){

if(fabs(W[mat] [n1][nl]);1e-28){zer=1;}

if(zer==0){

a[MATSIZE—1]=W[mat] [MATSIZE— 1] [MATSIZE]/W[mat] [MATSIZE- 1] [MATSIZE-l];
for(n1=MATSIZE—2;n1[,=0;-nl){

a[n1]=W[mat][n1] [MATSIZE];

for(n2=MATSIZE-l ;n2;n1;—n2){

a[111]-=W[mat] [nl][n2]*a[n2];

a[nl]/=W[mat] [n1] [n1];

}

else{

if(level==2){

printf(”no weight update level 2\n”);
for(n1=0;nleATSIZE;++n1){
a[n1]=wp[n1];

}
if(level==1){

printf(”no weight update level 1\n”);
for(nl=0;n1;MATSIZE;++nl){

aillrlll=wpllrllk

}

}
for(n1=0;nliMATSIZE;++n1){

wchangep[n1]=a[nl]-Wp[nll§

den=0;
for(n1=0;nl;MATSIZE;++nl){
den+=wchangep[nl]*wchangep[n1];

den=sqrt(den);

if ( den 3, wchange) {fac=wchange / den; }
else{fac=1;}

for(nl =0;n1 ;MATSIZE;++n1 ){
wchangep[n1]*=fac;

for(nl=0;nl;MATSIZE;++nl){
wa[nl]=Wp[nl]+wchangep[nl];

142

for(n1=0,zt=0;n1;MATSIZE;++n1) {
if(wa[n1];700){wa[n1]=700;zt=1;}
if(wa[n1] 1-700) {wa[nl]=-700;zt=1;}

}

if(zt==1){printf(”weight at 700\n” );}

reinitw()

{

int nl,n2,n3;
for(n1=0;n1;5;++nl){
for(n2=0;n2i30;++n2){
for(n3=0;n3;30;++n3){
W[n1][n2][n3]=0;

The subroutine below initializes the pr vector to zero.
reinitwpp()

{

int nl,n2,n3;

for(nl=0;n1;30;++n1){

prlnll=0;

}

This subroutine calculates the norms for the weight vectors
and the b vector.
norms0

{
int k1,k2,k3;

for(k1=0;kl;N[2]+1;++k1){

wnorm[k1]=0;

bnorm[k1]=0;

}

for(k1=l;kl;N[2]+l;++k1){

for(k2=0;k23N[1]+1;++k2){
if(wnorm[k1];fabs(wb[k1][k2])){wnorm[k1]=fabs(wb[kl][k2]);}

}
for(k2=0;k2iN[0]+ l ;++k2){

143

i}f(wnorm[0] ;fabs(wb[0][k2])){wnorm[0]=fabs(wb[0] [k2]);}

for(kl=1;kliN[2]+1;++k1){
for(k2=0;k2;N [1]+1;++k2){
if(bnorm[k1];fabs(W[k1] [k2] [N [l]+1]))
[bnorm[k1]=fabs(W[kl][k2] [N [1]+1D;}

}

for(k2=0;k2;N [0]+1 ;++k2){
if(bnorm[0];fabs(W[0][k2][N[l]*(N[0]+1)]))
[bnorm[0]=fabs(W[0] [k2][N[1]*(N[0]+1)]);}

}

ASw()

{

int k1,k2,k3;
for(k1=0;k1;N[2]+1;++k1){
for(k2=0;k2;30;++k2){
for(k3=0;k3i30;++k3){
W[k1][k2][k3]*=ASweight;

WWW-Iv

144