LIBRARY Michigan State University PLACE iN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE .. W lI—‘TL IJ MSU In An Afflnnetivo ActioNEquel Opportunity Institution cmmut LAYER-WISE TRAINING OF FEEDFORWARD NEURAL NETWORKS BASED ON LINEARIZATION AND SELECTIVE DATA PROCESSING By Shawn David Hunt A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1992 -3323 672 ABSTRACT LAYER-WISE TRAINING OF FEEDFORWARD NEURAL NETWORKS BASED ON LINEARIZATION AND SELECTIVE DATA PROCESSING By Shawn David Hunt A class of algorithms is presented for training nonlinear feedforward neural net- works using purely “linear” techniques. The algorithms are based upon linearizations of the network using error surface analysis, followed by a contemporary recursive least squares identification procedure which can be implemented using parallel pro- cessing. Specific algorithms are presented to estimate weights node-wise, layer-wise, and for estimating the entire set of network weights simultaneously. A procedure for modifying the algorithms to selectively use the training data and increase speed is also presented. A computationally inexpensive measure is developed with which to assess the effect of a particular training pattern on the weight estimates prior to its inclusion in any iteration. Data which do not significantly change the weights are not used in that iteration, obviating the computational expense of updating. Several experimental studies are presented showing the advantages of this class of algorithms. Specifically, the layer-wise algorithm is shown to be vastly superior to back-propagation in terms of the number of convergences and convergence rate. Ad- ditionally this algorithm is shown to be insensitive to the choice of initial weights and forgetting factor, eliminating two of the greatest problems in the implementation of existing training algorithms. To my parents David and Suzanne Hunt Acknowledgments I would like to thank John R. Deller, Jr., for the immense help in completing this degree. The biggest help was probably learning how (even if I do not necessarily follow it) to write a technical paper. I have heard that a Ph.D. degree means you can do research. Of course, if you cannot transmit this to others in a clear, concise manner, it is of little use. I also must give thanks for his defense (during my defense, especially the proposal), and his willingness to see me and listen to any problem. One of the greatest benefits of being at Michigan State was being in a lab where the students not only got along, but where there was a spirit of cooperation and friendliness. For this I must thank both Dr. Deller and Joan (for keeping El Estimado happy). I would also like to thank the members of my committee, Dr. Majid N ayeri, Dr. Norman Birge, and Dr. Fathi Salam. It was an unexpected surprise to find out that the committee members had not only read the complete dissertation, but in great detail. I could not have asked for more. Money! For this I must thank the University of Puerto Rico for granting me a fellowship. This work was also supported by National Science Foundation under Grant No. MIP-9016734, and the Office of Naval Research under Contract No. N00014-91- J-l329. I also wish to thank the members of the East Lansing Baha’i community for the support which has made these years truly enjoyable. No, not only enjoyable, great! First the Nazerians, Keyvan, Farzaneh, Mike, Alan and Lily for all the great food, the thanksgivings, the borrowed cars, and of course the tennis. Njeru Murage, Roya Mavaddat, Bruce Johnson and Irene Shim (a token Baha’i), for the most enlighten- ing discussions on every imaginable topic. (Mike, Alan and I also had interesting discussions but these were enlightening on a different level.) Also the Smiths, David, Melanie, Amber, Amelia, and Zack(God is One), for the incredible spirit at the sunday night firesides. Joe and Judy DeFlorio! (and of course Sophia) Could we have done it without you? Dominic, for being the only five year-old to be interested in hearing me talk about black holes. Also thanks to Franco and Shahnaz Damazio for being a little persian and worrying. Also to the members who enlightened the club here on cam- pus over the years, Gail Etzenhauser(lets gol), Hami Missagieh(Haaaaami), Ohmead Bashirelahi(Sasha who?), Farnoush(why sleep) and Kianoush(let go to Rezas) Sinai, Tahirih Steen, Rick and Eve Martin, Mark Knox(lets teach), Gill Munro(caderas), Greg(el tigre) and Valerie(hi babe) Heikes, and John Homan(thanks for the music). I know I have left out many, please forgive me. Anne and Alicia, thank you both for just being there and thank you for the prayers. Contents List of Tables List of Figures 1 Introduction and Background 1.1 Introduction ................................ 1.1.1 The Promise of Neural Networks ................ 1.1.2 Notation and Objectives ..................... 1.2 The Feedforward Neural Network (FNN) ................ 1.3 Training Algorithms for FNNs ...................... 1.3.1 Performance Measures ...................... 1.3.2 Training Efficiency ........................ 1.3.3 The Back-Propagation Algorithm ................ 1.3.4 Techniques for Improving Convergence ............. 1.3.5 Complexity Measures ....................... 2 New Training Algorithms 2.1 Introduction ................................ 2.2 Linearized Training Algorithms ..................... 2.2.1 Node—wise Weight Updating ................... 2.2.1.1 Two Layer Network ................... 2.2.1.2 L Layer Network .................... 2.2.2 Layer-wise Weight Updating ................... 2.2.2.1 Two Layer Network ................... 2.2.2.2 L Layer Network .................... 2.2.3 Network-wise Weight Updating ................. 2.2.3.1 Two Layer Network ................... 2.2.3.2 L Layer Network .................... 2.3 Solving the Linearized Equations .................... 2.3.1 Solution by MIL-WRLS ..................... 2.3.2 Solution by QR-WRLS ...................... 3 Data Reduction Algorithm 3.1 Introduction ................................ 3.2 Data Reduction Algorithm ........................ vi viii ix ©U1wHi—ai—i 9 13 14 15 19 20 20 20 21 21 28 30 3O 35 38 38 38 40 40 41 46 46 46 4 Simulation Results 4.1 Introduction ................................ 4.2 Implementation Studies ...................... ' . . . 4.3 Layer Updating and Network Updating ................. 4.3.1 Initialization ............................ 4.3.2 Forgetting Factor ......................... 4.3.3 Small Training Sets ........................ 4.3.4 Large Training Set ........................ 4.4 Simulations of the Data Reduction Algorithm ............. 5 Conclusions and Future Research 5.1 Algorithmic Developments ........................ 5.1.1 Training Speed .......................... 5.1.2 Layer-wise Training Algorithm .................. 5.1.3 QR-WRLS ............................. 5.1.4 Data Reduction Algorithm .................... 5.2 Future Research .............................. Appendix vii 59 59 60 66 66 70 77 85 93 102 102 102 103 103 104 104 109 List of Tables 1.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 Complexities per iteration and normalized complexities of F N N train- ing algorithms. M is the number of nodes per layer. L is the number of layers ................................... Number of convergences per 100 sets of initial weights ......... Convergence results for various initial weights, 2—bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. ....................... Convergence results for various initial weights, 4-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. ....................... Convergence results for various forgetting factors, 2-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. .................. Convergence results for various forgetting factors, 4-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. .................. Convergence results for various forgetting factors, 4-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. .................. Iterations to convergence of the BP algorithm for various training sets and varying number of hidden layer nodes. ............... Iterations to convergence of the node-wise algorithm for various train- ing sets and varying number of hidden layer nodes. .......... Iterations to convergence of the layer-wise algorithm for various train- ing sets and varying number of hidden layer nodes. .......... Iterations to convergence of the network-wise algorithm for various training sets and varying number of hidden layer nodes ......... Number of correctly classified training patterns for the large training set for varying number of inner layer nodes. .............. viii 19 64 68 68 78 80 82 90 91 94 List of Figures 1.1 1.2 1.3 2.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 A single node ................................ 6 Example of a sigmoid function. ..................... 7 A two-layer network. ........................... 8 Weight estimation using recursive QR-WRLS ............. 43 Network architectures used in the simulation studies. 1. the 2-bit parity checker, 2. the 4-bit parity checker, and 3. the 4-bit bit counter. 62 Average error in dB for the X-OR implementations vs. iteration num- ber. 1. back-propagation; 2. A-S algorithm; 3. QR—WRLS. ...... 65 Average error in dB for the BP implementation of the 2-bit parity checker vs. iteration number using different methods of selecting initial weights. .................................. 69 Average error in dB for the node-wise implementation of the 2-bit par- ity checker vs. iteration number, using different methods of selecting initial weights ................................ 69 Average error in dB for the layer-wise implementation of the 2-bit par- ity checker vs. iteration number, using different methods of selecting initial weights ................................ 70 Average error in dB for the network-wise implementation of the 2- bit parity checker vs. iteration number, using different methods of selecting initial weights. ......................... 71 Average error in dB for the BP implementation of the 4-bit parity checker vs. iteration number, using different methods of selecting ini- tial weights. ................................ 72 Average error in dB for the node-wise implementation of the 4-bit par- ity checker vs. iteration number, using different methods of selecting initial weights ................................ 72 Average error in dB for the layer-wise implementation of the 4-bit par- ity checker vs. iteration number, using different methods of selecting initial weights ................................ 73 Average error in dB for the network-wise implementation of the 4- bit parity checker vs. iteration number, using different methods of selecting initial weights. ......................... 74 ix 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36 4.37 4.38 4.39 Average error in dB for the QR-WRLS X—OR implementation vs. iter- ation number, using different forgetting factors and weight change con- straints. 1. V = 0.98,7 = 0.2; 2. V = 0.98,”)! = 1.0; 3. V = 0.1,7 21.0; 4. V = 0.1, 7 = 0.2; where V is the forgetting factor and 7 is the weight constraint .................................. Average error in dB for the node-wise implementation of the 2-bit parity checker vs. I iteration number, using different forgetting factors. Average error in dB for the layer-wise implementation of the 2-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the network-wise implementation of the 2-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the node-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the layer—wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the network-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the node-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the layer—wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Average error in dB for the network-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Training set A. .............................. Training set C. .............................. Training set E. .............................. Training set G. .............................. Training set I ................................ Training set K. .............................. Training set L. .............................. Training set M. .............................. Training set N. .............................. Large training set. ............................ Average error vs. iteration number for the BP implementation, train- ing with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. . . Average error vs. iteration number for the Node-wise implementation, training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. Average error vs. iteration number for the Layer-wise implementation, training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. Average error vs. iteration number for the data reduction algorithm using 6 hidden layer nodes, training with the large training set. Correctly classified training patterns vs. iteration number for the data reduction algorithm using 6 hidden layer nodes and the large training set ...................................... 75 78 79 79 80 81 81 82 83 83 85 86 86 87 87 88 88 89 89 91 92 92 95 96 97 4.40 4.41 4.42 4.43 4.44 4.45 Average error vs. iteration number for the data reduction algorithm using 7 hidden layer nodes, training with the large training set. Correctly classified training patterns vs. iteration number for the data reduction algorithm using 7 hidden layer nodes and the large training set ...................................... Average error vs. iteration number for the data reduction algorithm using 8 hidden layer nodes, training with the large training set. Correctly classified training patterns vs. iteration number for the data reduction algorithm using 8 hidden layer nodes and the large training set ...................................... Number of training patterns used vs. iteration number for the data reduction algorithm and a dividing factor of 30. 1. 6 nodes 2. 7 nodes 3. 8 nodes .................................. Number of training patterns used vs. iteration number for the data reduction algorithm and a dividing factor of 100. 1. 6 nodes 2. 7 nodes 3. 8 nodes. .................... ‘ ......... xi 97 98 98 99 99 Chapter 1 Introduction and Background 1 .1 Introduction 1.1.1 The Promise of Neural Networks This research is concerned with supervised training of Feedforward Artificial Neural Networks (FNNs). FNNs are a special case of a class of non-linear networks called Artificial Neural Networks (ANNs). ANNs have gained popularity in recent years due to their promise of solving previously intractable problems. ANNs have been applied to many problems in the areas of pattern recognition, pattern classification, and associative memories. Some of these problems are from the areas of speech recognition, computer vision, medical diagnosis and non-linear control. Included are such specific problems such as teaching a machine to read out loud or to play games. The usefulness of a computer that understands speech or can “see” has been known for decades. For this reason, research to solve problems in the areas mentioned above has been vigorous. However, despite all the research, solutions for many of these problems have not been found using conventional linear techniques such as linear regression, and non-linear techniques such as artificial intelligence implemented on a standard sequential computer. For this reason, it has been suggested that the difficulty in finding solutions to the problem does not lie in the fact that the correct algorithm for implementation on a sequential computer has not been found, but rather'that sequential computers are inherently unsuited to the problem. An architecture that does seem well-suited to the problem is that of organic neural networks, the brains of animals. Consequently, some properties from these organic neural networks have been emulated in the development of AN Ns. Some of these properties are high inter- connectivity, massive parallelism, and individual units that are non-linear. A major area of research is now ascertaining whether training algorithms for ANNs, algorithms designed to make the ANN solve the problems mentioned above, actually exist. A number of recently published simulation results provide evidence that the FN N can solve useful problems, even though an explicit mathematical proof of how the solution was reached may not be possible at this time. One of the most widely referenced experiments is that of NETTALK [1]. In this experiment, a FNN was trained to read out loud. The input to the FN N was standard text, and the output was a phonetic alphabet. This phonetic alphabet was fed into a separate computer which allowed the output to be heard aloud. As impressive as it was to have a network read out loud, even more exciting is the description by the experimenters of how the learning was accomplished. It is reported that the network started off babling, slowly progressing, making mistakes reminiscent of a child learning how to read. This gave great encouragement that the properties borrowed from nature in the making of FNNs were indeed useful in approximating a real neural system. Another experiment, possibly even more impressive from the engineering point of view, was that of Nguyen and Widrow [2]. The experiment succeeded in teaching a FN N how to back up a semi- trailer truck simulated on a computer. This simulation result is significant because at present we cannot design a controller for this nonlinear control problem. This experiment clearly shows the great advantage of FNNs over other methods. In a typical system at present, the designer must specify all the variables. This is not the case with FNNs. For example, in any pattern recognition problem, the features to be detected and used for selection must be determined by the designer. With a multilayer F NN, however, it is the system itself which determines what features to use. This can be of great advantage as shown in the previous experiment. We would, of course, like to be able to analyze mathematically all the properties of the FNN. Until then, however, it will still be of use. 1.1.2 Notation and Objectives As this dissertation relies heavily on the solution of a set of linear equations a short discussion on notation is appropriate. Suppose we have a typical overdetermined system of linear equations. There are many methods for solving this system for the weights that give the least mean square error. Many popular methods are recursive, and are typically called recursive least squares (RLS). A weighting factor can be easily included in the recursions to weigh each of the linear equations differently in finding the final solution. This we will call weighted recursive least squares (WRLS). One method frequently used [5] is based on the matrix inversion lemma [5] and will be called MIL-WRLS here. A more contemporary recursive approach based on QR decomposition [3],[4] will be used in this dissertation and will be called QR-WRLS. The research described in this dissertation is concerned with the method and speed of training FNNs. The main contributions have been: 1. To show the advantages of implementing linearized FN N training algorithms with QR-WRLS instead of the conventional MIL-WRLS algorithm. This is done by modifying one such algorithm, the Azimi-Sadjadi et a1. (A-S) [6] algorithm to employ QR-WRLS and comparing a variety of simulation results to the A-S algorithm implemented with MIL-WRLS. 2. To develop a “linearized” training procedure that is faster than the ones re- ported in literature. This involves using conventional least mean square equa- . tions with new “linearized” inputs and outputs determined from the F NN , and searching for a way to determine inner layer target values faster than the back- propagation technique now widely used. 3. To apply matrix perturbation theory to the linearized neural networks in an attempt to increase convergence speed. This includes derivation of equations that are well-suited for application to the FNN and are also simple enough so that a computational saving is possible. The first two objectives are addressed in Chapter 2. A number of training algo- rithms currently proposed for training F N N 3 use linearizations. These linearizations transform the training problem to one of Solving a system of linear equations. This allows the implementation of these training algorithms by any algorithm for solving the least squared error problem in linear systems. Existing training algorithms use the conventional MIL-WRLS algorithm for implementation. This algorithm takes the form of two recursions. The recursive equations are numerically unstable [7] and require initial conditions which are generally unknown. Also, in training the FNN, a forgetting factor [7] must be included. Implementation with MIL-WRLS limits the types of forgetting factors that can be used. All of these problems are dealt with effectively using QR-WRLS. First, QRi-WRLS has been shown to be more efficient than the MIL-WRLS algorithm [4]. It is also stable numerically. QR—WRLS also al- lows different forgetting methods to be used. One of the proposed training algorithms that uses MIL-WRLS is implemented with QR-WRLS . Simulation results showed a marked increase in the algorithm performance. This result should be very beneficial in stimulating the use of QR-WRLS for other training algorithms implemented using MIL-WRLS. A new linearization training algorithm is also introduced in Chapter 2. In order to further increase the rate of convergence, matrix perturbation theory has been used. Inputs to the network along with the desired response - together called training patterns — are used to train the FNN. In general, the network does 4 not usually “learn” to correctly classify the training patterns the first time they are used. Neither is there any guarantee that the network will ever learn to classify all the training patterns correctly. However, in the training process the network usually correctly classifies more and more of the training patterns as they are used repeatedly. For this reason the training patterns are used many times. If there were a way to avoid using all the training patterns without changing the solution, training could be made more efficient. One method for doing this data reduction is presented in Chapter 3. Simulation studies of the performance of QR-WRLS, of the new training algo- rithm, and of the data reduction algorithm are presented in Chapter 4. Chapter 5 contains the conclusions and suggests areas of future research. 1.2 The Feedforward Neural Network (FN N) The basic units of the FN N network are “neurons” or nodes. The model of the node to be used in the research involves a weighted sum of the inputs, and an output that is a non-linear function of this weighted input. Thus, N u=2wjxj (1.1) i=1 where u is the weighted input, w,- is the weight for the 3"" input, and x, is the 3“” input. The output of the node is: y = S(u) (1-2) where y is the output of the node, and S() is a nonlinear mapping. A representation is shown in Figure 1.1. The non-linearity that will be used in this research is the sigmoid function Figure 1.1: A single node. 1 1+e-3 3(a) = (13) S() has the following properties: it is continuous, has continuous derivatives of all orders, is strictly increasing, and has finite limits as the argument goes ‘to +00 or —00. A graph of the function is shown in Figure 1.2. The nodes are grouped into layers. For clarity, in this dissertation a two-layer 1, an example of which is network is considered in the initial theoretical discussions illustrated in Figure 1.3. The generalization of the methods to an arbitrary number of layers follows the initial developments with two layers. Each node above the input layer in the FNN passes the sum of its weighted inputs through a non-linearity as in (1.2). The inputs to layer zero are external. The outputs of the last layer are the 1Some authors might choose to call this a three layer network. The bottom layer of “nodes” shall be designated as “layer zero” and not count it in the total number of layers. Layer zero is a set of linear nodes which simply pass the inputs unaltered. For this reason, circular nodes are not shown in the diagram. Id 0.. 0.." 0.7 '4 0.8-“ 0..“ D.‘ -‘ output l(x) 0.8" 0.14 Figure 1.2: Example of a sigmoid function. outputs of the network. _ Let us now formalize the network and define notation. The number of nodes in layer i is denoted N,-, with No indicating the number of input nodes at the bottom (input) of the network. The weights connecting to node I: (k’) of layer two (one) are held in the Nl-vector (No-vector) 1.0;, = [wkm . . . , wk,N,]T (10],, = [20],“, . . . , w[,,N1]T). The inputs to the nodes in all layers except the first are the outputs of the layer below. We denote by N the number of training patterns {[:a:(n) = [m1(n),x2(n), ...,a:No(n)]T;t(n) = [t1(n),t2(n), ...,tN,(n)]T) , n = 1,2, . . . , N} , (1.4) in which each 31(n) is the input to the 1“ node in layer zero, and tk(n) is the tar- get output for node k in the output layer (output desired in response to the cor- responding input). The computed outputs of layer two (one) in response to 23(n) = [$100, z~o(n)lT are denoted W!) = [311(71), yN.(n)lT (y’(n) = [3401), y5v2(n)]T)- yl yNz . I I yl le x1 x1 XNO Figure 1.3: A two-layer network. Vectors will be used when convenient, and scalars will be used when they simplify the presentation. Similarly to (1.1) we also define ”1 min) dé‘ ; wing-(n) = «viz/(n). (1.5) Where uk(n) is the input to node k in the output layer in response to pattern n. u;(n) is similarly defined as the input to node j in layer one. 1.3 Training Algorithms for FNNs 1.3.1 Performance Measures There are many methods proposed for the training of FNNs. Training, in the most general sense, means to adjust the variable parameters of the network ~ the weights — so that the network can perform the desired task, be it pattern recognition or classification, a control problem, etc. The first problem encountered in training is in quantifying how well the network is performing the desired task. This quantification is done with a performance measure. A performance measure is a function of the weights of the network. There are many in use. The most popular, and the one to be used in this research, is the sum of the squared errors between the desired outputs and the actual outputs. The training algorithm has the goal of adjusting the weights of the network so that the network performs well with respect to the performance measure used. Training algorithms can be divided into two overlapping groups. One group uses heuristics to search the weight space in order to optimize the performance measure, while the other group uses the performance measure in order to change the weights. Some training algo- rithms use both heuristics and the performance measure. One example of a training algorithm that uses a heuristic search over the parameter space is presented by Sun and Grosky [8]. This algorithm uses random optimization and dynamic annealing to find a global minimum of the performance function. It does not compute the gradient of the performance function. As noted by Widrow [9], most algorithms that use the performance measure to change the weights can be further divided into two classes. The first class consists of algorithms which change the weights of the network to re- duce the error between the desired response and the actual responce. Algorithms in this class are called error-correction rules. Algorithms from the second class are called gradient rules. These change the network weights along the gradient of the perfor- mance measure to reduce the mean-squared error. An example of an error correction rule is Rosenblatt’s perceptron learning rule [10]. Training algorithms that use the gradient rule have become very common in recent research. There are many examples of these rules in the literature [11]. Because there are many similar algorithms, and many variations of the same algorithm, it would be very tedious to describe each. The following is a summary of the main gradient rule training algorithms, and those which are related to the present research. The training problem for the two layer FNN is stated as follows: Given a set of N training patterns as in (1.4), find the weights which minimize the sum of squared errors, say E, between the target outputs and the actual outputs, N N = 2; 23014") '— yk( (W2 (1'6) Minimizing E can also be viewed as the problem of finding a mapping from the input space, the space spanned by the a:(n)’s, to the output space, the space spanned by the y(n)’s, so that E will be minimum. It is the nonlinearity at the output of each node that gives the F NN the promise of solving previously intractable problems. However, it is also the nonlinearity that produces the problems in training. In order to appreciate some of the problems 10 encountered in training, consider the differences of the network with and without the nonlinearities. Removing the non-linearities from the outputs of the nodes will result in a linear network with the same structure as the FNN. If the network is linear, a single layer can achieve the same mapping as a multi-layer network, thus there is no advantage in using more than one layer, eg. [12, Ch.2]. In the linear case, finding the weights that minimize E is a well-known problem. Taking the partial derivative of E with respect to the weights and setting this equal to zero results in a set of linear equations which can be solved for the desired weights. For non-trivial cases in which there are at least as many independent training patterns as weights, the weights found in this manner are unique. This implies that there is only one minimum of the function E with respect to the weights. Replacing the non-linearity at the output of each node changes many of the prop- erties of the network. One difference is that one layer cannot map the inputs to all the outputs that a multilayer network can. As shown in [13], a one layer network, with the sigmoid replaced by the sgn function, can segment the input space into two regions by a hyper-plane. With multiple layers, more complex segmentation can be achieved. Multiple layers are thus useful, but pose more of a challenge to train than single layer networks. With one layer, once a set of training patterns is given, the inputs and outputs of every node are known. With multiple layers, the hidden layer “inputs” and “outputs” are not known and must be computed by the training algorithm. Another difference between the linear network and the F N N is that there are gen- erally multiple minima of the function E with respect to the weights. The occurence of multiple minima corresponds to the fact that the derivative of E is nonlinear in the weights. As before, we desire the weights that are the zeros of the equations resulting from taking the derivative of E with respect to the weights. Many methods for doing this appear in the literature. None of the methods, however, solves directly for the 11 weights. Some of the most popular methods for solving these non-linear equations are the gradient rule algorithms. All the gradient rule algorithms proceed in a similar manner. First, the partial derivative of E with respect to the weights at one training pattern, 3%, is evaluated at the present weights. Some method, depending on the algorithm, is then used to determine how to change the weights in order to reduce E. This is repeated for each training pattern. The weights can be changed with each training pattern, or after all the training patterns have been presented, depending on the algorithm. This constitutes one iteration through the training patterns. Many iterations are usually needed to find the desired weights. Thus, %5 is evaluated at the new weights and the next training pattern, and the weights are again changed to reduce E. This is repeated until the partial derivative of E with respect to the weights is close to zero. The weights at which the algorithm stops are called the final weights:2 Because there are multiple minima, the solution found may, or may not, be the global minimum. In practice many initial weights are tried and the weights giving the smallest E are chosen. Because the non-linear nature of the problem requires an iterative approach, find- ing a solution will generally be much slower than in the linear case, where the weights can be found in one iteration or with one block computation. Finding a solution can be too slow in some cases, such as when the FNN is applied to speech processing [14, Ch.14]. New algorithms which produce a solution more rapidly than the techniques used currently would be of great value. This is the goal of the proposed research. Thus, some method to compare the “speed” of each algorithm is necessary. 2They are called final weights and not solution weights because these weights may not solve the proposed problem. 12 1.3.2 Training Efficiency The speed at which an algorithm converges to a solution depends on two factors: the number of iterations and the complexity of each iteration. The complexity can be measured in terms of the number of operations, multiplications, additions etc, required. The number of iterations depends not only on the algorithm itself but on the initial weights of the system and the training data. The comparisons done in simulations to follow will eliminate these variables by using the same initial weights and the same training data for each algorithm. Different algorithms do not always converge to the same set of final weights due to the different ways in which a true gradient system is approximated. However, even though the final weights may be different, it is useful to know the rate at which the algorithm converges to them. One way of doing this involves an error function [15]. This will be a function of the difference between the final weights, w“, and the weights at iteration n, w(n). e(n) =|| w(n) - 10" || (17) Where [I - [I is the euclidean norm. The rate that e(n) goes to zero will indicate how fast the algorithm is converging to the final weights. If e(n) can be written as a constant term multiplied by e(n — l), e(n) = K e(n — 1) then the algorithm is said to converge linearly. If e(n) = K e2(n — 1) , then it is said to converge quadratically. In most cases a quadratically converging algorithm converges faster than a linearly converging one [15]. Of course not all algorithms converge linearly or quadratically, but it is useful to compare new algorithms with those known to converge linearly or quadratically. 13 1.3.3 The Back-Propagation Algorithm The most prominent method for training FNNs is the back-propagation (BP) algo- rithm [17] [16]. Let w(n) be any network weight at iteration n. Given a training pattern, it is possible to evaluate —7—5 for each weight 1n the network. A positive value for m indicates that reducing w(n) will reduce E, conversely, a negative value of 8—8T) indicates that an increase in w(n) will reduce E. The back-propagation algorithm updates each weight at iteration n using the equation: 6E m. (1.8) w(n) =w(n- 1) —5 5 controls the rate of convergence, 0 < e < 1. 3277 for weights 1n the L — 1 layer depends on the weights in layer L. This is true for all layers. ——5 for any weight 1n the network depends on the weights in the layers above. The algorithm operates as follows: 5% is first evaluated and used to update each weight in the last layer. 587—) is then evaluated for the weights 1n the L — 1 layer using the updated weight estimates of layer L. In similar fashion, the weights in the remaining layers are updated. The process is then repeated with a new training pattern. The algorithm stops when the change to the weights is below a set threshold. Assume the two layer network described above. The partial derivative with respect to a weight in the last layer, wa-(n), for training pattern n is from (1.5) and (1.6), 6E _ 391:0“) m _ —2(tlc(n)"yk(n))auk(n)y1(n) (1.9) If w;,,(n) is a weight associated with a node in the first layer, then for training pattern n 0E 3yk(n) 331,1") Bing—Trn): 2,; (tkm k(nu))0 k(n) wk“ 6u3(n)x WU ) (1'10) 14 As seen in (1.9), the partial derivative with respect to a weight in the last layer is a product of the error of node 1:, tk(n) — yk(n), and the derivative of the output of the node with respect to the input and the node input. The partial derivative with respect to a weight in the first layer is similar, as seen in (1.10), with the error term replaced by the term 2%, [(tk(n) — yk(n))aZ’; 2 wkd]. This term can be interpreted as the “error ” of node j in the first layer “back-propagated” through the upper layer. It is clear from this analysis that one disadvantage of the BP algorithm is that by increasing the number of layers, the computational complexity is increased. For example, going from one layer to two more than doubles the complexity. For a network with M nodes in each layer, computing $5:- for a weight in the last layer takes three multiplications and two subtractions. Calculating 2‘5 for a weight in the next to last layer takes 3M+3 multiplications and 2M+1 subtractions. For a network with L layers and M nodes per layer, this algorithm requires 0(3ML) fl0ps (a flop is one multiplication and one addition) per iteration. Although BP as described by Rumelhart et al. [12], Parker [16], and Werbos [17] is not guaranteed to converge to a local minimum, changing the algorithm to use integration. [18], does guarantee this convergence. This is a gradient system, and given any initial weights, will converge to a local minimum. Another disadvantage of BP is that it converges linearly to a minimum. Linear convergence, along with the high complexity prohibits real time training in many applications. 1.3.4 Techniques for Improving Convergence Many new techniques have been presented to improve the speed of convergence [6] —- [19] [20] [21]. One method has been to implement algorithms that use second order derivatives. Parker, who was among the first to derive the BP equations, later turned to second order algorithms to speed up convergence [22]. Many such techniques include some form of linearization [6], [19], [23]. If the weights are only changed 15 a small amount during each iteration, linearizing the equation for E with respect to the weights around the present weights is justified for each iteration. Thus the linearization must be performed every iteration. One technique using linearization, the Azimi-Sadjadi et al. [6] algorithm, (A-S algorithm), uses linearized input, 373(n) and linearized output, fk(n) target values. The equation to minimize is thus changed from (1.6) to _ N N2 _ N1 2 E = Z; (“(71) — (:1 wk,j'g’;-(n))) (1.11) n=1 In order to use equations of this form, the inputs and target outputs of every node must be known. Only\the inputs to the first layer and targets for the last layer are given as training patterns, the other inputs and outputs must be calculated. The algorithm proceeds as follows: the input to the first layer, given by one training pattern n, is used to calculate the output of each layer at the present weights. The linearized input and output values are then calculated for the last layer. These values are used to update the second layer weights using the MIL-WRLS algorithm. The targets for the first layer, say tg-(n), are calculated using a “back propagated” error, similar to that used in the BP algorithm. These target outputs are used along with the training pattern inputs to calculate the linearized input and output target values for the first layer. As in the second layer, the weights of the first layer are updated using these linearized values and MIL-WRLS. The procedure is repeated for successive training pairs until the weight change is less than a preset threshold. In Chapter 2, we will interpret the AS linearization process as one in which the linearized inputs and output for each node are chosen so that the error E of the nonlinear equations and the derivative of E with respect to the weights are the same for the linearized equations. Thus at the present weights 16 N1 tk(n) — yk(n) = tk(n) — Z: whj‘yj-(n) V]: (1.12) and N2 2 Z(tk(n) - yk(n) )2: —63 Z (tk( n)— :21”.ij (n)) (1.13) awka’ k=1k-J Ic=l j=1 One disadvantage of the A-S algorithm is the complexity. In addition to using back- propagation to calculate the inner layer target values, it updates the weights using recursive equations. The algorithm requires 0(3M2) flops per node per iteration. This increases the complexity from 0(M) for BP to 0(3M2). For a network with L layers and M nodes per layer, the algorithm requires an additional 0(3M3L) flops per iteration over BP. This algorithm however, converges faster than BP in simulations. Azimi-Sadjadi et al. attribute this mainly to the MIL-WRLS equations. Additionally, all the weights connected to a single node can be updated simultaneously instead of sequentially as in BP. In their research simulations using a two layer network, the number of iterations for this algorithm was an order of magnitude less than BP. Another disadvantage of this algorithm is that it uses recursive equations. The values to initialize these equations are not known practically. Also, previous equations used to update the weights decay exponentially, thus old linearized training patterns that are not useful for, and in fact may hinder, the updating of weights are still included in the equations. Kollias and Anastassiou [23] also describe a recursive linearized algorithm. Their method is based on the Marquardt-Levenberg least squares optimization method. This method is designed to solve an approximation to the linear system H Aw = —f (1.14) where H is the hessian matrix containing the second derivatives of the error function 17 E with respect to the weights, and f is a vector consisting of the first derivatives of E with respect to the weights. This is basically Newton’s method for determining the zeros of a non-linear equation in matrix form. Newton’s method is known to converge quadratically [15] for initial weights close to a minimum. However, Newton’s method can have convergence problems for poor initial weights. The Marquardt-Levenberg technique uses an approximation to H. This allows the algorithm to converge for relatively poor initial weights and also retains the quadratic convergence when close to a minimum. This algorithm also uses back-propagation to determine the targets for the inner layers. According to Kollias and Anastassiou the advantage of their algorithm is that it converges quadratically when close to a minimum. One disadvantage of the algorithm is that, like A-S, back-propagation is used for the inner layer training patterns. Also like A-S, it uses recursive equations which suffer from problems previously mentioned. Thus, this equation also has 0(3M2) flops per node per iteration. For a network with L layers and M nodes per layer, this algorithm requires 0(3M3L) per iteration in addition to 0(3ML) for BP. Steck, et.al [24] implement this technique and show that with parallel implementation, the additional computational requirements can be effectively reduced. Another linearization algorithm, very similar to the Kollias and Anastassiou algo- rithm is presented by Ghiselli-Crippa and El-Jaroudi [25]. This algorithm also uses an approach based on an approximation of the Hessian matrix. The weight updating equations are: w(i + 1) = w(i) — aH‘1(i)F(i) (1.15) where i is the iteration index, a is the step size, to is the vector of all the network weights, F is the gradient of the error E with respect to the network weights, and H is the approximation to the Hessian matrix. To simplify the Hessian matrix, the weights connected to each node are assumed to be independent. Simulations were 18 Table 1.1: Complexities per iteration and normalized complexities of FN N training algorithms. M is the number of nodes per layer. L is the number of layers. BP A-S (QR-WRLS) Layer-wise N etwork-wise complexity 0(3ML) 0(2.5M3L + 3M1) 0(2.5M2L + 3ML) C)(2.5M2LL2 + 3ME) normalized complexity 00 389.98 107.6 781.79 performed, training a FN N to classify speech as voiced, unvoiced, or silence. The error using this F NN for the classification was lower than that of a statistical decision classifier. 1.3.5 Complexity Measures All of these linearization algorithms are more complex computationally than BP. Still, they offer faster convergence in the simulation examples presented by the re- spective authors. This implies that computational complexity is not a good method of judging the speed of an algorithm. To compare the speed of different algorithms we will use a “normalized” complexity, the complexity of the algorithm multiplied by a factor determined from the simulation studies in Chapter 4. The lower the normal- ized complexity, the faster the algorithm. Simulation studies were done comparing the algorithm developed in this dissertation with BP and the A-S algorithm. The simulations from the training of a 4-bit parity checker were used to determine the normalized complexities. The normalized complexity of the BP algorithm is shown to be 00 since it failed to find the solution even once. Table 1.1 shows the relative complexities and the normalized complexities for these algorithms. The new layer- wise algorithm developed here has a clear advantage in speed over both BP and the A-S algorithm. 19 Chapter 2 New Training Algorithms 2.1 Introduction This chapter is divided into two main sections, the first of which introduces a new class of linearized training algorithms for the FN N. Because of the nonlinearities present in the FN N , all gradient descent learning al- gorithms perform some form of linearization around the present weights. The learning algorithms presented here are purely “linear” in the sense that the matrix vector equa- tion to be solved has been transformed into the equation of a linear system. Accord- ingly, unlike other popular training algorithms that are not in this form, this linear algorithm and its potential variants will benefit from the well-understood theoretical properties of RLS and VLSI architectures for its implementation. The second section describes implementation of these algorithms with QR-WRLS, a contemporary version of the conventional recursive least squares algorithm. 2.2 Linearized Training Algorithms As described in the previous chapter, all training algorithms solve for the desired weights using several iterations through the training data. The gradient descent algorithms form a linear approximation to the error surface E at the present weights, 20 and change the weights in the direction which reduces the total error. The manner in which this approximation is made, and how it is implementated, account for the differences among the training algorithms. As can easily be seen from (1.6), the error is a function of all the weights. Typically the approximation to the error surface at a given iteration is done by taking the derivative of E with respect to a subset of all the weights and based on this gradient changing this subset to reduce E. This process is repeated for another subset until all the weights have been changed to reduce E. The most popular method of F N N training - BP -— uses only one weight at a time for this approximation. In order to improve the rate of convergence and the final solution, newer techniques use more of the network weights, typically all the weights connected to one node, simultaneously to form an approximation to E at each iteration. Since E is a function of all the weights, it would seem better to simultaneously use as many of the weights as possible to reduce E. The class of algorithms presented here represents an improvement in the sense that the weights for an entire layer can be used to form an approximation to E and are updated simultaneously. Also in the special case that there is only one output, all the weights of the network can be updated simultaneously. The theoretical development of the algorithm is first given for the case where all weights connected to a given node are updated simultaneously. This will be expanded to the case in which all the weights in the same layer are updated simultaneously. Finally the case in which all the weights in the network are updated simultaneously is presented. To simplify the presentation of each case, the development is done first for a two layer network, then for a general multilayer F NN. 2.2.1 Node-wise Weight Updating 2.2.1.1 Two Layer Network FirSt We wish to concentrate on the training of the weights in the final layer. Before 21 continuing, we note a simple fact which will reduce the number of details in our discussion. It is easy to show from (1.6) that if only the weights connected to output node k are allowed to change, with the other weights in the network held constant, then E is minimized by minimizing only the errors associated with this node, say Ek. Let us write E). in a form which explicitly features the weights allowed to change, N E. = 214(n1- 5(wfv’(n))]’- (2.1) n=l Algorithms for finding the optimal solution, say mg, to this problem are well-known if the modeled output depends only upon a linear combination of the inputs us- ing pattern-invariant (constant) weights. In the linear case yk(n) = S(wky’(n)) = flwky’ (n), for some constant 6 (which can be taken as unity without loss of general- ity), and the error expression takes the form N El: = Elna) - wa’(n)]2. (2-2) n=1 The solution in the linear case is the solution to the classical linear least squares “normal equations” [7]. The solution of the normal equations can proceed in a variety of ways. It is also possible to arrive at the solution without explicitly forming the normal equations. This is the case, for example, when using the least mean square (LMS) (e.g. see [26] or [27]) algorithm, a recursive solution which amounts to “back- propagation” for a linear network. A second popular method is the conventional MIL- WRLS algorithm. A contemporary version of the latter will serve as a computational basis for the algorithm to be described in this dissertation, and MIL-WRLS is also the basis for the A-S algorithm to which we wish to relate the method of this dissertation. Appropriate description and formalism will be introduced as needed. It is well-known that least squares estimation problems may be discussed in terms of their error surfaces, in this case the graph of E1. as a function of 10].. Whatever 22 the form of the least square estimation algorithm, the ideal goal is to find the weight vector, say 111;, corresponding to the global minimum of Ek(wk). It is important to fu- ture developments to note that E), depends not only on wk but also upon the training patterns {(z(n),tk(n)), n 6 [1,N]} (see (2.1)). (In fact, since we have “frozen” the weights in the first layer, it is more appropriate in this case to view E), as a function of 10;, and the pairs {(y’ (n), tk(n)), n 6 [1, N ]}) Once the training patterns are fixed, the error function may be described as a surface over the Nl-dimensional hyperplane corresponding to the weights. Theoretically, the pairs {(y’(n), tk(n)), n 6 [1, N ]} rep- resent partial realizations of a two-dimensional stochastic process which generates them. In this sense Ek(wka{(yl(n)atk(n))1n 6 [1,N]}) (2-3) is only a sample error surface. In a pure sense, we would like to find weights corre- sponding to the global minimum of 8 {Ek(wk)} where 8 denotes the expected value. We must be content, however, to work with the sample surface provided by the train- ing data. The point of the discussion above is to note that different algorithms construct and use different sample error surfaces from the data. With LMS (or back-propagation), error surfaces are sequentially constructed from individual training patterns, i.e., error surfaces of the form Ek(wk, [y’(n), tk(n)]), n = l, 2, . . . , N (2.4) are created, and for each n, the weights are moved in the direction of the negative gradient on that surface. The convergence properties are well-understood. MIL- WRLS3, on the other hand, creates sequentially more refined error surfaces of the 3Of course, here we are speaking of a linear model identification. 23 form Ek(wki{(y’(j)itk(j))aj E [linlll (25) as n is incremented. At each step, if a weight update is computed, the solution corresponds to the unique minimum of the newly refined surface. We can appreciate, therefore, that even if we neglect nonlinearities, the estimation processes behave quite differently with respect to their error surface analysis. The linearization technique adopted in this work can be explained in terms of the error surface analysis. The error surface over which we would like to find the (global) minimum by choice of weights is given by (2.1). Suppose we wish to construct a “linearized” error surface, say Ek, which is “similar” in some sense to E, in a neighborhood of the present weights. Recalling that E]. is a function not only of the weights, but also of the training patterns, the fundamental question is: Can the pairs {(y’(n),t(n)),n E [1,N]} be modified in some sense, say (y’(n),t(n)) —+ (371(17): f1:01)), so that N Ek(wk,{(§ll(n),t1(n)),n E [1, N]}) = 2:151:01) - Wish”)? (2-6) N % Ek(w,., {(y’(n),tk(n)),n E [1,N]}) = 2:161:01) - ~‘>’(wfv’(n))l2 in some neighborhood of the present weights? The answer to this question is the key theoretical development described in the following paragraphs. In the ensuing discussion, the notation w; will be used to designate a local mini- mum of Eh. Ideally, w; will be the global minimum, but we have no way of assuring this. The objective is to find, by means of a “linear” algorithm, a close approximation to 10;. The algorithm to be described proceeds in iterations, indexed by i = 1, 2, . . .. Each iteration represents one complete training cycle through the N training patterns. 24 Suppose that a weight vector estimate w),(i — 1) results from iteration i — 1. In iteration i, by manipulation of the data, we work with a “linearized” error surface, Ek, which is similar to the nonlinear surface in the neighborhood of wk(i — 1). The similarity follows from two criteria: 1- Ek(wk(i-1)1{(ylc(n)1t-k(n))!n e [1,N]}) = Ek(wk(i-1)1{(ylc(n)1tk(n))1n E [1,N]}); 2 33.] = 31.1] ' aw). wk=wk(§_1) 310:: wk=wk(i-1)' The first task is to manipulate the pairs {(yj‘(n),t—k(n)),n E [1,N]} so that these criteria hold. This is accomplished as follows. It follows from Criterion 1 that 20“ (n) - 3/I=()nl2 =Z_:(tk(")-w1:(z ‘1)9k(n))2-(2-7) By letting M") +- y).(n) = M") - 1011;“ - ”9201), (2-8) M") = (Mn )— yk(n)) + "’1: T0 -1)t7i.(n), (2-9) for each n, Criterion 1 is met. Now we take the partial derivatives required in Criterion 2. For the “nonlinear” error, as aw), N = —2 2(tk(n) - yk(n))3(uk(n))9'(”) wg=wk(i—l) "=1 N = -2 gun) - y.(n))s’(wt)y'5(uk(n)>ws(z)5(1w.) e(n))zw) (2.14) where wa(i) denotes 1"“ element in weight vector wk(i) (weight on connection from node j in layer one to node I: in layer two). This expression can be written N g5 = -2 glam) - y;(n>>s'<[w;-1Ta=(n))z(n) (2.15) where t;(n) is called the target value for inner node j and is defined such that N2 (C(H) - y}(n)) = 2W0!) -- yk(n))3(w(n))wks(i)- (2.16) k=l The quantity on the right side of (2.16) is commonly called the back-propagated error for node j. The solution sought, say wg‘, is one for which 6E aw, ( 7) Law: In the top layer, for node I: we sought w; such that 6E — = . ' 2. aw). 0 ( 18) w.=w; With reference to (2.10), it is clear that the present optimization problem is equivalent to the ones encountered at the upper nodes. In particular, the same linearization 27 considerations can be applied to obtain modified input and target values, say (1300,23 WU )) (t’ jn( )Zj(n)) (2-19) and the set of layer one weights in ;i( ) computed accordingly for each j. 2.2.1.2 L Layer Network This algorithm can be extended in a straightforward, though notationally cumber- some, manner to any number of layers. Here we will use parenthetical superscripts instead of primes to indicate the layer. In order to simplify notation, the dependence on the iteration number will be dropped. Thus, y;-l lwill be used to denote the output of node j in layer l. Similarly,w gwill be used to denote the weight connecting node p in layer l — 1 to node j in layer l. Assuming a network with L layers, let us update the weights connected to node i in layer 1,10“), l 6 [1,2, . . . ,L]. We have already presented the cases of l = L and l = L — 1. Taking the derivative of E with respect to to?) , 3E N N" _ (L) ' (L) ”L" (L)' (L-l) '42: (Mn) 31k (n))S(uk ("ll 2(wk.,5(U, (n))(2.20) aw?) — 11:1 k=1j=l (~252ij (L— —I)S(u (L— -n2)( )).. :3“ w’ {12)S("’l‘m(n))wgln) )) 3(n) m=l Similarly to (2.15), this expression can be written Bin-E'z): -—2Z(t§”(n)— yf”(n))s'([wS”1Ty“-”(n ))y<‘-”(n) (2-21) 28 where ts!) is the target value for node i layer l, and is defined by NL . (tS"(n)—y§"(n)) = g:((402)-y§.“(n))5(ut“(n)) p (2.22) -NL" NL—2 ( )3 (assay-11(3) ( z: (agrees—2m) . .. m=l N1+1 . z22212.7”)(aw))) Again the optimization problem is equivalent to the ones encountered at the upper nodes. Before continuing, let us note the relationship to the A-S algorithm noted above. In fact, to this point in the discussion, the methods are nearly equivalent, though derived from different starting points. The A-S algorithm proceeds by replacing the nonlinearity S(-) for the node to be updated by a linear approximation, say S(), consisting of the first two terms of a Taylor series around the “present” value of the node’s input. For example, suppose the k“ output node is to be linearized with respect to the n‘” training pattern. Let tind- denote the present value of weight w)”. Then, .. . N1 N1 N1 S(u) z 5(u) = 5 (Z undid-(fl) (u - thkdy;(n)) + 5' (Z tbk,jy;(n)) (2.23) i=1 i=1 i=1 . N; N, . N, N1 = 5 (Z: was/m) u + [S (: wing-(22)) - s (2: wing-(12)) 2222,2100] i=1 i=1 i=1 i=1 4.2‘ Kk(n)u + 5,.(n). Azimi-Sadjadi et al. [6] recognized that by using this approximation in (2.10), the optimization problem became equivalent to a set of linear least square error normal equations if the data were modified according to (2.9) and (2.13). Therefore, by quite different means, the theoretical developments arrive at the same set of linear 29 equations to be solved. 2.2.2 Layer-wise Weight Updating 2.2.2.1 Two Layer Network The method used above — finding new linearized training patterns to update the net- work weights — can be extended to update all the weights of one layer simultaneously. In the previous section, E and E were used directly to calculate the linearized train- ing patterns. This is useful from a theoretical point of view, it shows exactly why the linearized training patterns can be used to calculate the gradient of the system, but is cumbersome notationally. As noted in the previous section, the linearized error surface E can be constructed by replacing the nonlinearity S() by a linear approx- imation. Because using this approximation simplifies notation and is equivalent to using E and E directly, in this section we will use this latter approach for calculating the linearized training patterns. Consider again the general L layer network. Suppose that the weights connected to one or more nodes in layer l are to be updated simultaneously‘. This may include as few as one, and as many as all, nodes in layer l. Denote the set of such selected nodes in layer l by N. Denote by M the set of all nodes above layer 1 to which any node in N is connected, directly or indirectly. Let all weights not connected to nodes in N and M be fixed at present valuess. As shown above, the linearized error surface is constructed by replacing the nonlinearity S'() for each node in N and M by the linear approximation, S(). In fact, any node not in N or M may also be linearized with no effect on the solution. Therefore, we may assume without loss of generality that the entire network is “linearized,” even if only a portion of the weights is to be ‘If any weight connected to a node is to be updated, then every weight connected to that node must be updated. This “constraint” is ordinarily beneficial, since it implies the ability to simulta- neously update more than one weight. 5In certain cases it is possible to update weights in different layers simultaneously. We discuss one case at the end of this section. 30 updated. Let us now return to the two layer network. Suppose we wish to update all weights in the output layer simultaneously. We must linearize all output nodes (and may arbitrarily linearize any other nodes). For node k in the output layer, the output in response to input n is computed as in (1.2), which we repeat here for convenience. i=1 N1 yk(n) = 5(11) = S (Zing-31;) (2.24) As before, we let 371(11) represent the output of node I: after S(uk(n)) has been replaced by S(uk(n)) = Kku),(n) + bk. Accordingly, N1 3714") = Kid") [2; Maid-(n) +bk(n’) or zk(n )=Kk(n )[iwk'jyj() )] (2.25) j-l with e(n) “-‘é‘ ma) - e(n). (2.26) We speak of the rightmost form in (2.25) as descriptive of a linearized node because the output is a purely linear combination of the inputs to the node. The network with all appropriate nodes linearized will be called the linearized network. Since gk(n) = yk(n) at the present weights, the error at the k“ node will be the same for the linearized and original network if the target value for 2k(n), say tk(n), is taken to be {,(n) dé‘ tk(n) — bk(n) (2.27) and the “linearized” inputs to node lc at pattern n are Mn) d-—°-‘ mam-(n), ,- = 1,2, . . . .N.. (2.28) Note that the linearized inputs are dependent upon It, so that we have effectively 31 increased the number of training pairs by a factor of N2. The problem has effectively been reduced to one of estimating weights for a single- layer linear network. In order to simultaneously update all the weights in the output layer, the system of N x N2 equations tk(n) = 25k,j(n)wk,j, k=1,2,...,N2 n =1,2,...,N (2.29) must be solved for the least square estimate of the N1 x N; weights w”, k 6 [1,N2] j E [1,N1]. However, since all weights in the hidden layer are fixed, the outputs yj-(n) are independent of k. This means that the equations indexed by differ- ent values of Ir are independent of one another, and the sets of weights connected to different outputs may be updated independently. In the output layer, therefore, there is no theoretical difference between layer-wise and node-wise updating. To prove this we need to show that the t—k(n) of (2.9) and Min) of (2.13) are the same as the tk(n) and ik’j(n) of (2.29). Substituting for K).(n) in (2.28) gives aim-(n) = S(uk(n))y;-(n), j=1,2,...,N1 (2.30) = 3(ny’(n))y§(n)- In vector form this equation is int”) = S(tuft/("Dill") = 372(71)- (2-31) Substituting for bk in (2.27) we get, N1 M") = t2(n)-(yk(n)—Kk(n)2w2.jy;(n)) (2-32) i=1 32 N1 = (“e(n) — yk) — 2:; wk,ij(n)yj-(n) = (“J”) - 31k) - waKHnlt/(n) = (M?!) - yk) - wig-37%) Which is equivalent to (2.9). Thus the linearized inputs and outputs of the layer-wise and node-wise weight updating are the same for the last layer. This is not true at lower layers, however, as we now show for the hidden layer of the present network. In order to update all weights in the hidden layer simultaneously, the weights in the output layer are fixed and all nodes in the network must be linearized. The outputs of the hidden layer with 5() replaced by S() are given by I I No I I 37,-(n) = Kj(n)[z wj,pa:p(n)] + bJ-(n) j = 1,2, . . . , N1. (2.33) p=1 Substituting (2.33) in the leftmost expression in (2.25) results in N1 N1 N0 371:0?) - [; Kk(n)Wk.jb;-(n) + Mull = :1 glenlwksKi(n)$p(n)lw;,p- (234) As above, we can now view the problem as one of training a single-layer linear mapping with target outputs N1 71:01) = M”) - I; Kk(n)w,.,,-b;(n) + 51:01)] (2-35) and inputs ijw-mm) := Kk(n)wk,jK;-(n)a:p(n). (2.36) The weight estimates for wg-m; j E [1, N1], p E [1, No] comprise the least square error 33 solution to the system of equations N; No t",,(n) = 2 ngwmmgp k =1,2,...,N2 n = 1,2, . . . , N. (2.37) j=l p=l Unlike the output layer, we see that the problem cannot be decomposed into separate solutions for sets of weights connected to individual nodes in the hidden layer. This is a reflection of the fact that all weights in the hidden layer are coupled through their “mixing” in the output layer. This means that the simultaneous solution for all weights in the hidden layer should be beneficial with respect to a node-wise solution. Indeed we will find this to be the case in the experiments. Note that, for a fixed k, the inputs to the linearized network, it"(n), n E [1, N], are most conveniently viewed as two-dimensional (indexed by couples (j, 11)). There are N such “grid” inputs for each k, paired with the N values of fun). If there were two hidden layers in the network, we would find that there would be three effective inputs. Although the number of subscripts continues to increase with in- creasing layers, only the subscript of the output node, and the subscripts of the weight with which the linearized input is associated are used. (see (2.44)). The other subscripts are “summed” out of the equation. Further, it is noted that the role of k in (2.37) is somewhat superfluous. In principle, the index is used to keep track of which of N2 outputs in the linearized network is being considered. However, the training pairs (t'k(n);i]‘,1,1(n),. ..,:Ej,,NhNo(n)), k E [1,N2], n E [1,N], can be reindexed by mapping pairs (k,n) —+ i so that the training pairs may be written (t'(i);§:'1,l(i), . . . ,i'Nl’No(i)), i 6 [1,N. x N2]. Of course, an identical system of equa- tions to (2.37) results, but the linearized network may be viewed as a single output linear layer with N x N; training pairs. Updating of some subset of the weights in the hidden layer (in particular, “node- wise” as in the A-S algorithm) is tantamount to solving the subsystem of (2.37) 34 corresponding to the desired weights, introducing the updated values into the sys- tem, solving for the next desired subset, etc. Clearly this will result in a different solution than the simultaneous solution. In terms of the error surfaces, this process consists of continually updating the error surface as “partial” information becomes available, then moving in the direction of the gradient with respect to a new subset of weights in the updated surfaces. Intuitively, movement “at once” with respect to the “complete” gradient would seem to be a preferable procedure. Indeed, the later operation corresponds to the simultaneous updating. The linearization allows us to approximate the error surface of the nonlinear sys- tem for only a small neighborhood around the present weights. Because of the criteria used to construct E, the weights will be changed in the direction of the true gradient in the nonlinear space, but will move to the minimum of E which may be quite far from the neighborhood over which E z E. As in the node-wise weight updating, the weights must be allowed to change only a small amount using the training pat- terns of the linearized system. If the linearized procedure results in a large change of weights, measures must be taken to decrease the alteration. The updating procedure is repeated until changing the weights does not result in a decrease in error. The al- gorithm proceeds as follows: linearize the system around the present weights, change the weights by a small amount to decrease error, then repeat the procedure. This is done until changing the weights does not decrease the error or a maximum on the number of iterations is reached. 2.2.2.2 L Layer Network The general case of an L layer network is a straightforward extension of the two layer case. As in the previous section, let a parenthetical superscript indicate the layer. After linearizing all the nodes, the output equation for node I: in layer L in terms of the outputs of layer L — l is, 35 21%) = Ki.”(n) (NZ 105.3121" ”( n)) + bt”(n) (2.38) J1=1 This is equivalent to (2.25) and can be used to solve for layer L weights. Extending this down one more layer, and writing 37,?)(n) in terms of the outputs of layer L — 2, N1, -L L L L— y). ’(n) = Ki ’(n) ()5 wt}. (K). ”(n) (2.39) Ji—l N144 L— L- L L 1 2 —1 Z w§..§.’y§. ’(n )+b§-. ’(n))) +4. )(n) J2=1 ”MN“ (L) (L) (L 1) (L 1) (L- 2) = Z Z Kk (n) wk,J1Kj1 - (n)wj1,:2 ng (n) J1=1 J2=1 ”1...: + bgL)(n)+K}f) Z w(le§f’-l’(n) . k’JI 11 =1 Continuing the process for another layer gives, NL-INL—2 _L y). ’(n) = 2: z: K‘L’(n)w).’:,-’,K.‘~f ”(n) w§.’:;." (2.40) j1=l 12:1 L- L— L- L— (Ki. ”(n) 23 wi....”y§. ‘n+3’() bi- ”00) Ja=1 NL—l + bj‘L)(n)+K,£L’ Z 10“») b(L-1)( ) ‘1‘»:1 J1 J: =1 NL"~"'”~"" (L) (L) (L 1) (L— 1) (L 2) (L 2) (L- 3) = 2: Z )3 (K. (n)w...-.K.-. (n)w...2 K.- (n)w.-.... y. (n)) J1=1 Jz=1 Ja=1 .. beams) "‘2‘ tbs-Fm 1’ )+.‘-.’"’K ":10 (be ”’< ”l 11:1 j2=l It is easy to extend this to any number of layers. Let I E [1, L]. Then the equations to update the weights in this layer are, 36 NL—i NL—2 Nz+1 N; IVs—1 Z Z... Z 2 Z (K£L’(n)w,(‘l}1K(lLl)(n) (2.41) J1 =1 J2=1 JL—1—1 =1 JL—1=1JL-—z+1 =1 w(L-_1) K(1)Nw(n) (1) (1— 1) (,0) J: J2 JL— & wJL—leL—l+1yJL—l+l — L y}. ’(n) + b(L)(n) + K(L) N: 10(3). (b(L- "(1.) J1 =1 N5. (L-l) (L— l) (L-2) ((+1) (1+1) b(l) + KJi _ 2wJ1J2 (sz (n )+ [{JL- 1.. 1 : wJL— 1_ 1 JL- 1 bJL—l)) J2=1 JL_ (=1 NL-i 1.9%) — [1.5:anng Z w£€1(b§f‘”(n) J1=1 NL-2 (L-l) (L-l) (L-2) ((+1) ((+1) b(l) + KJ: . wJ1.J2 (6J2 (n )+ KJL— 1- 1 Z wJL— z— 1JL- 1 bJL— 1))] ”=1 JL_1=1 N1 Nl-l NL_1NL—2 N1+ 1 L L L 1 z 2 ([2 z--- z K<>(,n)w.a.>,1< KM.) JL-1=1 JL—l+1=1 J1=1 J2=1 JL-(-1=1 (L-1) (1+1) (1) (l) wle? “KJL-1_1 (n)yJL_g+1( (n)]waL—11JL— 1+1) (2.42) Similarly to above, this can be viewed as a single layer linear mapping with target outputs NL—l NL-2 L L L — - 13%) = 1)."’(n)-[b£’(n)+K.£’Z .1303? "(n)+K.’," "2 11311.2" J1=1 j2=l (L-2) (H-l) (1+1) b(l) (sz (n) + KJL—l— 1 :leL—1-1.JL—thL-1))] (2'43) JL-l= and inputs (1) MAM-2 1"“ (L) (L) (L 1) 5" kaJL— hJL-H-l (Tl): Z Z: ... Z Kk (n)wk,j1Kj1 (Tl) J1 =1 J2=1 JL—l—1=1 (L-l) ((+1) (I) wJ11J2 . “KJL—l—l (n)yJL_ (+1(n 771)] (2.44) 37 In the same manner as described above these linearized inputs and outputs can be used to update the weights in layer I . 2.2.3 Network-wise Weight Updating 2.2.3.1 Two Layer Network For the same reason that simultaneous layer-wise estimation of weights is beneficial, we should expect even more benefit from complete network updating if such were possible. It follows from the developments above that entire network updating is possible for at least one case. If there is a single node in the output layer of the network, let I: = 1 and define 10.1,, = wkdwg-J = wl'ng-J (2.45) From (2.34) it follows that N; N0 N1 (371(71) - 5101)) = Z ZIK1(n)K§(n)$z(n)lw’-,z + ZIK1(n)b3-(n)lw1.j- (2-46) j=1 (=1 j=l This can be interpreted as an attempt to train a single linear layer with one output and (No x N1) + N1 inputs. In this case, there will be only N linearized training patterns. The system can be solved for w1,j and w}, and (2.45) can be used to solve I for 11)“. 2.2.3.2 L Layer Network As in the two layer case, the equations for layer wise weight updating in the L layer network come directly from the layer wise weight updating equations. Rearranging (2.41) and replacing k with 1 we get, 38 NL—i NL—2 N2 gfln) — 194%..) '= Z Z Z )1}, f (KfL)(n)K§f-”(n)(2.47) 11:1 J2=1 JL—1_1 =1 JL-1=1 JL_1+1 =1 KO) (n)y(i— -1) (,0) (L) w(_L-1) w(_1) w(0) JL-1 JL— 1+1 w1J1wJ1 J2 JL_1—1JL-1 JL—1JL— 1+1 NL-iNL-z + z z 2’ 2‘ (KfL’1n)K)f-“(n)-~ J1=1 J2=1 JL—1—1=1 JL—1=1 5(0)) (L) w(L— 1) w (1) JL 1 wlelw J1.J2 JL—1-1JL—1 + NL-iNL—2 + Z Z (KfL’(n)K§f-‘)(n )bL- 2) w(L)w(L- 1-) _ .1.J1 J1 J2 J1=1 J2=1 NL—1 L + Z (14 )(n)bf.;- )wggg J1 =1 This can be redefined as 11%) — b‘“(n)= (2.48) NL-1NL—2 N2 2: Z - Z Z Z (KlL’(")K§.L"1’(n)W J1 =1 J2=1 JL-1—1— =1 JL—1=1 JL—1+1=1 (1) (0) (L1) KJL—1(n)yJL-1+1(n))wJ1.J2.--.JL—1—11JL-11JL—1+1 NL— 1 NL—2 + Z Z” ' 2 Z (KLL)(n)KJ1L l)(n ) b.5214)wJfiJ31---1JL—l—11JL—l Ji=1 J2=1 JL—l— 1=1JL- (=1 + NL-l NL—2 + 2 z (K(L>(.)K;L-1>(.).g-2).13.; J1=1 J2=1 NL—i + 2 (KfL’(n)bI.;-) .393), J1=1 This can be interpreted as a one output, [NL + (N1, x NL_1) + (NL x NL_1 x NL_2) +(NL x NL-1 x. .x N1 x No)] input linear system. The weights wiiLi'i... 'J-L_ ‘_ “j,“ 1J1- 1+1’ wJILItJ21---1JL-l‘l1JL—l , . . ., wg’g, wg’l can be solved for and used to determine the sys- tem weights. In the previous secion we saw that as the number of layers increased, the dimension of the inputs increased. Here as the number of layers increases, the 39 dimension of the weights increases. 2.3 ' Solving the Linearized Equations 2.3.1 Solution by MIL-WRLS In principle, once the linearization is achieved at iteration i and pattern 12, any least mean square type algorithm can be employed to update the weight estimates. The A-S method uses the conventional MIL-WRLS algorithm. In this case, neglecting any error weighting, MIL-WRLS takes the form of the two recursions (written for node I: in the top layer) [5, Ch.5], - _ PU." -1)il’(n)l37'(n)lTP(i.n - 1) PM ‘ I ‘ 1 + [21(n'31TP(i.n - 1).-Jun) (2'49) 101413") = 1010'." - 1) + P(n)fli.(n)lt-k(n) — 371(n)l- (2-50) wk(i,n) is the estimate of the weights wk following pattern 12 in the 1"" iteration through the training data, and P"(z', n) is the covariance matrix at the same “time” in the process, P”(i. n) ‘l-i‘ i 17.01.7201? (251) j=l Note that wk(i,0) déf 1111(2' — 1,N) and similarly for the covariance matrix. This presents the question of how wk(0,0) and P(0, 0) should be initialized. The inverse covariance matrix contains theoretically infinite values at the outset and a proper initialization for the weights is practically not known (this means that the initial linearizations of the training data are based on potentially very bad weight estimates). This issue will be addressed further below. Also, it is clear that this solution, as written, will continue to “accumulate” past linearized sets of data which might, in fact, be linearized around very poor weight estimates. Therefore, the A-S algorithm 40 includes a “forgetting factor” [5] in the MIL-WRLS recursions. This is equivalent to using a weighted error criterion with time varying (exponentially decaying) weights. This can make convergence slow if the forgetting factor is large. If the forgetting factor is small, then past values are forgotten more quickly, but the algorithm may have convergence problems. We will also comment further on this issue below. We have found that the choice of conventional MIL-WRLS as a solution method seriously impairs the ability of this linearization method to converge on a proper set of network weights. As an alternative, therefore, we suggest the method presented in the following section. 2.3.2 Solution by QR—WRLS In order to improve convergence the algorithm developed above can be implemented using QR—WRLS [4, 7, 28]. This algorithm has distinct advantages over conven- tional MIL-WRLS. First, the QR-WRLS algorithm does not suffer from initialization problems noted above for MIL-WRLS. It is also robust numerically, as no matrix inversions are done. This should prove to be very beneficial for different algorithms currently implemented with MIL-WRLS. Allred [29] made a study of various FNN training algorithms for the US. Navy and rejected all which used MIL-WRLS, be- cause, in his words, “Classical techniques such as Newton’s technique or regression analysis (which invert large N x N matrices) can be ill-conditioned and fail. . . . Al- though these approaches may be interesting, we reject them in favor of more robust techniques. The last thing one wants is to have a software product which fails in a production environment.” In addition to improved numerical properties, QR-WRLS also permits the inclusion of several very flexible “forgetting” strategies. To illustrate the operation of the algorithm, it is sufficient to consider the estimation of weights w). in the output layer of the two layer network. All notation is consistent with that used above. 41 In effect, the linearization technique described above reduces the problem at the 1"” iteration through the training patterns to one of finding the least square error solution of the overdetermined system of equations ’(y).(1))T —»' P1311)“ -I T _, .k (y.(2)) 101(2): 1(2) (2.52) _(17;.(N))T —»_ _1.(N), The QR—WRLS method is based upon transforming this system into an upper trian- gular system by applying a series of orthonormal operators (Givens rotations). The resulting system is 101(1) 2 (2.53) _ “(N-Now. . _ d2(i,N) ] where the matrix T(i, N) is N1 x N1 upper triangular and ijk denotes the j x 1:: zero matrix. The solution for 1111(2) is easily obtained by back-substitution. A recursive version of the solution is also possible. The recursive algorithm is shown in Figure 2.4. For details the reader is referred to [4, 28]. For discussion of further benefits of the decomposition algorithm, it is useful to view the A matrix, defined in Figure 2.4, as four partitions. Following the rotation of the n‘“ equation, in Step 2, for example, q T(2, Tl) d1(i, n) A = . (2.55) d2(i3n) J F 01 XN] As is the case with the A-S method, a forgetting factor must be employed to gradually reduce the effects of earlier linearizations. This is very easily accomplished in the 42 Figure 2.4: Weight estimation using recursive QR-WRLS WEIGHT ESTIMATION USING RECURSIVE QR-WRLS Initialization: Initialize an (N1 + 1) x (N1 + 1) working matrix, say A, to a null matrix. Recursion: For i = 1,2, . . . (iteration); and, For n = 1,2, . . . , N (pattern), 1. Enter the next equation into the bottom row of A, [ [.;(.)]T | 1.0.) ]. (2.54) 2. “Rotate” the new equation into the system using I m). = Am10+AN,+1,kTS I N1+1Jc = “AkaS'l'AMHJcUS for k = m,m +1,...,N1 +1 and m = 1,...,N1; where a -= Amm/p, r 2 AM“, m/p, p = (Afnm+A}vl+l, m)1/2, Sis unity (useful later), and Amk(A:nk) is the m, [1: element of A pre- (post-)rotation. N 0 other elements of A are affected. 3. Solve for the least square estimate of the weights wk if desired. (Solution after the 71‘“ pattern will produce what has been called wk(i,n) in the text, and 1111(2', N) = 1111(1).) 4. If n < N, increment n. Otherwise check convergence criterion and increment i and reset n if not met. Termination: Stop when some convergence criterion is met. 43 QR—WRLS algorithm by simply multiplying the top N1 rows of the matrix A (matrix T(i,n) and vector d1(i,n)) by a factor 3 < 1 prior to the rotation of the n + 1" pattern equation. In this context, both the forgetting factor and the frequency of weight updates can be varied. In addition to exponential forgetting factors, equations can be “rotated out” of the matrix. This is done by changing S in Figure 2.4 to —1 and rotating in the equation to eliminate. Thus, for example, only the last Q > N1 equations can be used to calculate the weight updates by sequentially removing equation 11 — Q + 1 prior to inclusion of equation 12. This procedure effects a sliding window over which the estimates are computed. Another forgetting method useful for FN N s is possible because no initialization of the updating equations is necessary. Because there are no initialization problems, the system can be re-initialized at any step, thus completely “forgetting” the past linearized values. One method that has worked well in our simulations is to re- initialize the A matrix after every iteration. These and a number of other flexible forgetting strategies made possible by this algorithm may prove very useful in the training of FNNs [28]. In addition to new forgetting factors, using the QR-WRLS implementation also allows the frequency of updating of the weights to vary. As with conventional MIL-WRLS, the weights can be updated every time a new linearization is used6. The theoretical results above, along with those in Section 2, can be combined to form a learning algorithm for FNNs. For the node-wise weight updating this is done as follows. First, the weights of the network are initialized. This is done randomly, each weight being selected from a uniform distribution over the set [-l,l]. Once the initial weights are chosen, the weight updating can begin. First, a training pattern is input to the system. Because the weights are not updated until all the training patterns have been used, convergence does not depend on the order in which °This can be as often as every pattern, or at the end of each iteration through the patterns as has been our convention. 44 the training patterns are used. Given a training pattern, the algorithm calculates linearized training patterns for the last layer nodes and these are rotated into the corresponding A matrix. Each node has a separate “A” matrix. The target outputs of the layer below are calculated next using back-propagation, see (2.16). The A matrices for the first layer are then updated. A new training pattern is then used to calculate a new set of linearized inputs and outputs. This is repeated until all the training patterns have been used. The A matrices are then used to calculate updated weights. This continues until the network converges to a solution or a specified maximum number of iterations has been reached. By definition the solution is said to have converged when the change of the norm of the vector of all the weights is below a threshold. The algorithm for the layer-wise weight updating is similar. Weight updating is exactly the same for the output layer nodes. The layers below have one A matrix each. The linearized training patterns are calculated for each layer and rotated into the corresponding matrix. After all the training patterns have been rotated in, the new weights are calculated and the process is repeated. As before the algorithm stops when a solution is reached or a specified maximum number of iterations is reached. As with other training algorithms for FNNs, this algorithm may not converge to the weights corresponding to the global minimum of the function of E. Also, although the algorithm approximates a gradient system, because it is not a gradient system, there is no guarantee that the algorithm will converge to any solution. 45 Chapter 3 Data Reduction Algorithm 3.1 Introduction This chapter describes a method of reducing the number of training patterns used at each iteration. Each linearized training pattern that is rotated into T(i, N) and d1(i,N) will affect both the magnitude and direction of the vector solution 111(1). Some training patterns will have a relatively large effect on the solution and others a relatively small effect. The goal is to be able to determine which training patterns will have a very small effect on the resulting weights and avoid the rotation of these into the system of equations, thereby reducing the number of computations at this iteration and making training more efficient. 3.2 Data Reduction Algorithm Simulation studies have revealed a very important phenomenon which occurs when the training algorithm of the previous chapter is employed. As the training proceeds, a few of the linearized training patterns become dominant, and the weight change is largely dependent on these training patterns. Although the original training patterns do not change, the linearized training patterns are dependent on the present weights and so the linearized training patterns generally change at each iteration. For this 46 reason, at one iteration a linearized training pattern may have a large effect on the weight change while at another iteration, after the weights have changed, the lin- earized training pattern may have a negligible effect. All linearized weight updating algorithms presented in Chapter 2 in effect reduce the network to a linear network with one output. Hence in this section we will consider one node with M inputs and one output, using the notation introduced for QR—WRLS. The fact that a few of the linearized training patterns become dominant during the training is due to the non-linearity in the output. Simply put, it is usual for the linearized training patterns to decrease in magnitude as the training progresses, and so have a decreasing effect on the resulting vector w(i). Let us now consider this effect in greater detail. The non-linearity used in the FNN is a sigmoid. As a consequence, as the input to the nonlinearity of the node approaches :too the output approaches a constant. The derivative of the output with respect to the input, which is always positive, is largest when the input is zero, and goes to zero as the absolute value of the input becomes large. This derivative is important in determining the effect of a linearized training pattern on the result because the magnitude of the linearized training pattern is proportional to this derivative. The derivative, in turn, is dependent on the input 11. Finally, u is dependent on the weights and on the non-linear training pattern. As the weights change during the training process, we can view the changing effect that a specific linearized training pattern has on the solution relative to itself, or relative to the other linearized training patterns. First, let us view the changing effect of a linearized training pattern relative to itself. As training progresses, it is usual for the weights to increase. This increase in weights usually causes the input u to increase. An increase in 11 causes the derivative at successive iterations to decrease, and the magnitude of the linearized training pattern usually decreases. This can be easily seen from (2.9) and (2.13). The linearized training pattern at the next iterations 47 will have a smaller magnitude and thus have a smaller effect on the solution weights. Now consider a specific linearized training pattern relative to other linearized training patterns. It is usual for all the linearized training patterns to decrease in magnitude and thus have a decreasing effect on the solution weights. However, be- cause the change of weights does not affect the input 11 corresponding to each of the training patterns equally, the derivative of the output with respect to the input will not change equally for all the training patterns. As the derivative for some train- ing pattern becomes very small the effect of this linearized training pattern becomes negligible relative to the others. Another way of explaining this process involves the use of error surfaces. Each training pattern has an associated error surface that is a function of the system weights. As shown in (1.6), the total error surface is the sum of the error surfaces associated with each of the individual training patterns. A training pattern with an associated error surface that is almost flat at the present weights does not contribute significantly to the derivative of the total error surface, and consequently does not contribute significantly to the determination of the gradient direction. The effect of rotating a training pattern with a small corresponding derivative, one whose error surface is almost flat at the present weights, into the T(i, N) matrix is smaller than the effect of another training pattern with a large derivative, or an error surface that is not as flat. To determine whether some of the training patterns can be ignored at the present iteration without affecting the solution significantly, two issues must be addressed: how to determine if a training pattern has an effect small enough to be excluded at the present iteration, and how to do so in a computationally inexpensive way. The “effect” of each training pattern will be measured by the magnitude of the change of the solution weights due to this training pattern. Assume that iteration i through the training patterns has been completed. Thus, 48 wIIi) has been computed, and the system weights have been changed in the direction of w(i). New linearized training patterns are then calculated using these new weights. We now need to decide which training patterns to use in the calculation of w(i + 1). The following matrix equation is the result of the 1"" iteration : T(i,N)w(i) = d1(i,N) (3.1) Suppose we, use QR—WRLS to rotate the first new linearized training pattern into (3.1). This results in a new matrix equation, say, T'(.', N)... (i)— _ J,(.', N) (3.2) Let 6111(1') = w'(i)—w(i) (3.3) 6T(i,N) = T’(i,N)—T(i,N) 5.141.111) = d’1(z',N)—d1(z',N) Thus [I 6w(i) [I is the magnitude of the weight change due to this training pattern at this iteration, where, for the moment, || - N indicates any valid norm. To simplify notation w , _I__| 6w( ) II will be used in comparing the magnitudes of the weight changes due to the training patterns. The same procedure of rotating one training pattern into (3.1) is repeated in order to calculate A(w(i)) separately for each of the linearized training patterns. We now have a number for comparing the effect of each training pattern. However, it is immediately apparent that in order to calculate A(w(i)) for each training pattern using any norm, each training pattern must be rotated separately into the system 49 (3.1). This is the same number of computations needed to calculate new weights using all the training patterns. Consequently, direct computation of A(w(i)) as a means of checking for usefulness of the given training pattern offers no computational advantage. We therefore seek an approximation to A(w(i)), say A(w(i)), which is 1. much less computationally expensive to compute than A(w(i)). Additionally, it will be desirable for A(w(i)) to have two further properties: 2. Order preservation. If A(")(w(i)) indicates the normalized change in w(i) due to training pattern 11, and A(”)(w(i)) < Alm)(w(i)), then it should be true that A‘"’(w(i)) < A""’(w(i))- 3. A1n)(w(1)) —1 o as A(")(w(i)) _. 0. Properties 2 and 3 assure that the approximation will “track” the true change at small values (where the information is most important) while at the same time preserving the order of relative changes among the patterns. Now, from (3.2) and (3.3) [T(i, N) + are, N)] [10(1) + 521(1)) = d.(.', N) + 5.110, N) (3.5) Using (3.1), (3.5) becomes 6T(i, N)w(i) + To, N)6w(i) + arm, N)5w(.') = 5.1.0, N). (3.6) Factoring 6111(2), [T(2', N) + 5110', N)] 6111(1) = [5111(1, N) _ 6T(1’,N)w(2')] (3.7) 50 After some manipulation, we obtain 555(1) = T-1(1, N) [[5d.(1, N) — 5cm, N)w(z’)] _ 5cm, N)6w(i)]. (3.8) Taking any valid norm gives the upper bound || 510(2) [ISII T"(i.N) II [II 5d1(i.N) - 5T(i.N)w(i) II + II 5T(L.N) llll 510(2) H]. (3.9) which, after some algebra, yields, ll T"(i.N) ll - || 5d1(i.N) - 5T(L.N)W(i) II II 6111(1) us 1_ u T-1(.-,N) u - 1|6T(1.N) II (3.10) It is our objective to produce an approximation to ]| (5111(1) I] or A(w(i)) rather than a bound. A pivotal consideration in doing so is the conditioning of the ma- trix T(i, N). We digress momentarily to consider this issue. Proper conditioning of T(:i, N) and T(i, N) + 6T(i, N) mean that ll T(i,N) || ' ll T"(L,N) ll“ 1 (13-11) H T(i.N) + 5T(i.N) || ° ll T"(i.N) + 5T(i.N) ll“ 1 Geometrically this implies that neither of the products T(1,N)w(1) (3.12) [T(i, N) + 5T(1,N)]w(.') (3.13) is dependent on the direction of 10(2). The ability to discount the direction of 111(1) will be central to the development of a useful approximation. A sufficient condition for these matrices to be well conditioned is for the features to be independent and 51 identically distributed (i.i.d.). This is because T(i, N) is the Cholesky factor of the “temporal” covariance matrix whose expectation is diagonal under proper ergodicity assumptions. The i.i.d. feature model is not unrealistic in many problems. Returning to the bound in (3.10), we invoke the fact that if the T‘1(z', N) matrix, and equivalently the T(i, N) matrix, is well- conditioned, then we can replace the inequality by an approximation, ll T"(i.N) ll - ll 5d1(i,N) - 5T(i,N)W(i) ll. 1- n T-‘(zym u - u mam u ‘3'”) ll 510(2) II” This is possible because the Cauchy-Schwartz inequality can be approximated by an equality under the assumed conditions. While this approximation can be shown to have the second and third features mentioned above, it is too expensive computation- ally to be useful in speeding up the algorithm. It would require more computations to calculate this approximation than to use all the training patterns for the weight updating. In order to reduce the number of computations, substitutions for some of the factors in this approximation will be made. The first substitution introduced will be for ll T'IUV) ll (864301), n w(2 ) u -.,. - n M ,N) u sll T (.N) n. (3.10) It is easily shown that under the same well—conditioning assumption, the inequality is also a good approximation, so, W an T-‘(zym u . (316) Making this substitution for H T'1(i, N) H on the right side of (3.10), we obtain )~Il fin 6111: N) —6T(z‘. N)w(2) II fill an. N) n ' 52 || 510(1) (3.17) Thus w, _||6w(i)llz ||6d1(i.N)-5T(i.N)w(i)ll d... w, “ (”— llw'(i)|| lld1(z'.N)Il-llw(i)llll6T(i.N)ll ‘N (l) (3'18) Let us now show that under fairly general conditions, (3.18) is a good approxima- tion. First we demonstrate order preservation. Let H 6(")w(i) H and H 6(m)w(i) ll be the weight changes due to two different training patterns, 71 and m. Let us suppose that ll 5”)!”(2') ||<|| 5(m)w(i) ll - (3-19) Therefore ll T(UV) ll - || 5‘")w(i) |l<|l T(iJV) || ° N 5"")100') || - (3-20) Let us assume that, for all training patterns, 6T(i, N) is small compared with T(i, N) so that H T(LN) llzll T(iaN) + 5T(i,N) M (3-21) and II MAN) 1 ,. ll w(z') ll >>|| 6T( ,N) H . (3.22) With assumption (3.21), we can write (3.20) as H T(iiN) + 5‘")T(i,N) II - ll 5mm“) l|<|| TEN) + 5(m)T(i,N) ll ° N 5(m)w(i) ll - (3.23) Where 6(")T(i,N) is the change in T(i,N) due to pattern 11. Because T(i,N) + 6("lT(i,N) and T(z',N) + 5(mlT(i,N) are assumed well conditioned, (3.23) can be written u [T(z', N) + 6‘")T(z', N)]6(")w(i) ||<|| [T(i, N) + 6(m)T(i, N)]6(m)w(i) || . (3.24) 53 Now using condition (3.22), H mm + 6<1>T(2',N)16<">w(i) u < n [T(i.N) + 5T(.,N)(5(m>w(.-) u d {N - d (iN - (3°25) W— ll 6(n)T(z,N) ll W- H 631110.11) II From (3.7), (3.25) is equivalent to the inequality ll 541(131") - 5‘")T(i,N)w(i) H < H 5d1(i,N) - 5(m)T(i,N)w(i) ll (3 26) W- n 5‘")T(i,N) n W— n 6T(z',N) II or II 541(23N) - 5‘“)T(i,N)w(i) H < H 5d1(i,N) - 5(m)T(i,N)‘w(i) II II d1(i,N) || - ll 100') ll ' || 5‘")T(i,N) II II d1(i,N) ll - “10(2) || ' ll 5(m’T(i,N) II (3.27) Hence we have shown that the ordering of weight effects (3.19) results in Al")('w(i)) < A 610(1): 0 (3.29) In this case it is easily shown using (3.2) and (3.3) that (( 541(1, N) — 6T(1',N)w(z') u: o. (3.30) 54 It follows immediately that A(w(z)) = 0. Consequently, the “tracking to zero” prop- erty is established. Thus it can be seen that if the T(i, N) matrices are well conditioned, then the approximation is good in the sense of properties 2 and 3. If the T(i,N) matrix is not well-conditioned, then two errors can occur. First, a training pattern with a small effect may be included in the weight updates. This “false alarm” is not a serious problem because including additional training patterns is not detrimental to the weight estimation. Further, the numerator in the approximation is greatest when the matrices are well conditioned, thus, ll [T(iaN) + 5T(i,N)]5W(i) IISII T(ZEN) + 5T(i,N) ll ° ll 510(2) ll - (331) If the weight changes are small, then the approximation will be smaller than if the matrices were well-conditioned, and the chance of a false alarm is small. The sec- ond error that can occur is the improper omission of a training pattern. A “miss” means that the left side of (3.31) is much smaller than the right, implying the matrix T(i, N) + 6T(i, N) is not well-conditioned. This ill conditioning occurs when there is a feature that is always “zero”, so that a “zero” row appears in T(i,N), or if there is a poor scaling of the features, so that the diagonal elements of T(i, N) are not of comparable magnitudes. Either of these conditions implies a poorly designed experiment, and could be corrected by eliminating unuseful features or by scaling the features correctly. Finally we must explore the computationally complexity of A(w(i)). Although this approximation is less expensive than A(w(i)), computing it for each training pattern will again be more expensive than simply using all the training patterns in the updating equations. In order to achieve the desired reduced computational load, two more substitutions, one for H 6d1(i, N) —6T(z', N)w(i) II and for H 6T(i, N) (I will 55 be introduced. To reduce the number of computations needed to calculate w(i + 1), the total number of computations needed to calculate the approximation A(w(i)) for each training pattern and to use the chosen training patterns in the calculation of new weights, must be less than using all the training patterns to calculate new weights. Toward this end, let all the norms be the infinity norm. For a vector, this norm is the maximum of the absolute value of the elements, and the induced matrix norm is the maximum of the sums of the absolute values of the row components. To move to iteration i + 1, II d1(z', N) H0‘) and II 10(1) ”00 are determined from the results of the previous iteration. Because these quantities are independent of the linearized training patterns calculated at the iteration i + 1, they need to be calculated only once. This means the approximation introduced in (3.16) is the same for all the training patterns. This leaves || 6d1(i,N) — 6T(i,N)w(i) ”0° and H 6T(i,N) ”0° to be calculated for each linearized training pattern. If it were known which row of 6T(i, N) be the largest, it would only be necessary to compute one row of T’ (2', N) for each training pattern to get [I 6T(i,N) "0°, and similarly for H d1(i,N) — 6T(i,N)w(z') ”0°. In order to have fewer computations, it would be better, if possible, to use the same row to calculate u 6d1(z',N) — 6T(i,N)w(z’) ((0., and (( 6T(i,N) no... In this case, only one row of T’(i, N) would be calculated for each training pattern, and a savings in computations would result. For this reason, a judiciously selected single row will be used to approximate || 6d1(z',N) — 6T(i,N)w(i) ”0° and H 6T(i,N) ”00. Since T(i,N) and T'(i,N), and thus 6T(i,N), are upper triangular, only the first row of 6T(i, N) can have all non-zero elements. Consequently, a natural approximation for H 6T(i,N) ”00 is the sum of the absolute values of the first row elements. The norm of the first row is also a good approximation for H 6d1(i, N) — 6T(i, N )w(i) “00. This is not only because it has more non-zero elements, but also because more of the weights are included in the computation of the row of 6T(z', N )w(i ) actually used in 56 the computations. This means a greater directional component is included in the final computation. For example, if the last row of 6T(i, N) were used, the contribution of 6T(z', N )w(z') would depend on only one weight. By using the first row, the employed row of 6T(i, N)w(i), and therefore || 6d1(z', N) —- 5T(z', N)w(i) ”0°, depend on all the weights. These substitutions are necessary in order to have a reduction in the number of computations. Some consequences of using the substitutions can be explicitly observed. First, by approximating the infinity norm by one row, the approximation is smaller or equal to the true norm. If the true norm is the sum of the absolute values of the first row components, then the approximation is exact. If not, the approximation will be smaller than the true norm. This is true for both || 6d1(2', N)—6T(i, N)'w(i) lloo and II 6T(i,N) "00. Both of these approximations have the effect of making the weight change approximation smaller than the upper bound given in (3.10). The substitution for [I T‘1(i, N) “00 used for (3.18) has the same effect. This means that the final approximation for H 610(2') ”00 / || w(z') M00 is bounded above by the right side of (3.10). Consequently, if the upper bound of the weight change is going to zero, the approximation will go to zero also. However, it is possible for the approximation to go to zero without the true weight change going to zero. Although this can occur theoretically, it has not proved to be a problem in implementation, as it does not occurr in any of the simulations reported in Chapter 4. The practical test of the approximation to the weight change is its effectiveness as an indicator of the effect of the training pattern on the solution weights. Simulations were run in which the norm of the true weight change was compared with the ap- proximation. The correlation coefficient between the norm of the true weight change and the approximation was computed for each iteration. It was typically in the range of 0.8 to 0.95. These results, along with the simulations run using the data reduction algorithm, indicate that the approximation is indeed good at estimating the weight 57 change of the linearized training pattern. 58 Chapter 4 Simulation Results 4.1 Introduction This chapter is divided into three main sections. Each section corresponding to one of the three main contributions of this dissertation enumerated at the beginning of Chapter 1. The first section presents the simulation results comparing BP, the A-S algorithm, and the QR-WRLS-based node-wise weight updating algorithm develped in this re- search. The principal purpose of this first group of simulations is to show the advan- tages of implementing linearized algorithms with QR—WRLS instead of MIL-WRLS. The simulation results for BP have been included as a reference. - Simulation results for BP, the node-wise, layer-wise and network-wise weight up- dating algorithms are presented in the next section. The node-wise and layer-wise algorithms are used for all the simulations, the network-wise algorithm is used for the simulations of a one output network. The last section presents simulation studies for the data reduction scheme im- plemented for the layer-wise algorithm. BP results for the same studies are again presented as a reference. 59 4.2 Implementation Studies QR—WRLS, presented in Chapter 2, has been shown to be slightly more computa- tionally efficient than conventional MIL-WRLS in identification of linear systems [4]. The simulations presented here are designed to show the benefits of QR—WRLS for the non-linear FNN. The theoretical advantages are described in Chapter 2. These were confirmed in these simulations. One of these advantages is that the QR—WRLS equations need no initialization. With the MIL-WRLS implementation, the equations are initialized and the linearized training patterns are used to update the weight esti— mates. As mentioned in section 2.3.1, even though a forgetting factor is included, past training patterns are still incorporated in the computation of all weights. However, as the objective is to determine the gradient of the system at the present weights, in- cluding past training patterns will necessarily lead to an incorrect gradient estimate. The QR—WRLS equations on the other hand need no initialization. This allows the use of a forgetting factor if desired, or the linearized training patterns of the present iteration alone can be used to determine the gradient at the present weights. Even with simple networks and few training patterns, the difference between implementa- tion using QR—WRLS and MIL-WRLS with a forgetting factor was apparent. With more complicated networks and many training patterns the error surface is more complicated and the difference in implementation is even greater. In this section the simulations were run with the QR—WRLS implementation using the same forgetting factor as the MIL-WRLS equations. This was done to simplify the comparisons. It will be shown in the next section that a smaller forgetting factor, or only using the training patterns of the present iteration, is better. This means that the results for the QR— WRLS implementation can be even better than those presented in this section. Another significant advantage of QR-WRLS is the robust numerical properties. This was especially apparent in the ease of implementation. For a linear network, the training patterns are used once. With the FNN, the training patterns are presented 60 many times. As described in Chapter 3, the linearized training patterns tend to decrease in magnitude as the training progresses. This leads to the covariance matrix becoming very small, and its inverse, the P matrix of the MIL-WRLS equations, becoming very large. In the simulations this matrix tended to become numerically unstable. The QR—WRLS algorithm has no inversions so this is not a problem. The simulations reported in this section compare three training strategies for an FNN . These are: 1. Conventional back-propagation, BP (no linearization). 2. The A-S algorithm (conventional MIL-WRLS with a forgetting factor). 3. Node-wise weight updating algorithm implemented with QR-WRLS with an exponential forgetting factor. The linearizations performed in the A-S algorithm and the node—wise updating al- gorithms are equivalent, so the only differences between the last two algorithms is the implementation. Each of the three strategies above was used to train each of the following networks: 1. a 2-bit parity checker, 2. a 4-bit parity checker, and 3. a 4-bit bit counter. The architectures for these three networks are illustrated in Figure 4.5. The 2-bit parity checker (XOR) network has two inputs, two hidden layer nodes and one output node. An additional input is added at each layer whose value is always unity, to serve as a bias for each node. The output function S() is the sigmoid defined in (1.3). The initial weights were chosen as follows. Each weight in the network was selected randomly from a uniform distribution over the set [—1,1]. This procedure 61 Figure 4.5: Network architectures used in the simulation studies. 1. the 2-bit parity checker, 2. the 4-bit parity checker, and 3. the 4-bit bit counter. 62 was repeated 100 times to select 100 sets of initial weights. The same 100 sets of weights were used for all 3 implementations. For the B-P algorithm, a factor of 0.04 was used in the weight updating equation. The A-S algorithm was implemented [6] using no weight change constraints. The forgetting factor for this and for the QR—WRLS implementation was 0.98. The QR—WRLS implementation used a weight constraint of 0.2. Thus the weight vector associated with each node was allowed to change at most by a magnitude of 0.2 in the Euclidean metric during each iteration. The 4-bit parity checker network has four inputs, four hidden layer nodes and one output node. A bias input is also added to each layer. Two output functions were used. These were the same sigmoid function as above, and the logic activation function. The logic activation function is a three piece piecewise linear function, f(:1:) = 1 if :1: >1 (4.1) a: if US$31 (4.2) 0 if a:<0 The derivative of S() is trivially determined everywhere except at zero and one where it does not exist. This does not pose a problem in implementation if we let 3(a) = 1 if a 6 [0,1] and zero else. Two sets of 100 random initial weights were used for the three implementations. The first set of weights was random as in the 2-bit parity checker, and the second set of weights was as described by Azimi-Sadjadi et al. in their paper. The A-S method selects the weights so that the outputs of the network will be between zero and one. This is done so that the derivative will not be zero and weight updating can take place. The 4-bit bit counter had four inputs, four hidden layer nodes and two outputs. An extra input was added to each layer. The logic activation function was used as the output function. Two sets of initial weights, random and those described by Azimi-Sadjadi, were used. The results are shown in 63 Table 4.2: Number of convergences per 100 sets of initial weights 2in- lout 4in- lout 4in-2out sigmoid sigmoid logic activation logic activation Impementation random random A-S random A-S random A- S weights weights weights weights weights weights weights Q-R 78 5 5 51 57 1 16 Back-Prop 1 l 0 0 1 53 0 0 A-S 8 0 0 1 37 0 9 Table 4.2. The table shows the number of times each implementation found weights that solved the problem for the 100 initial weight sets. Simulations were also run comparing the output error of each algorithm. In the resulting figures, the relative error in dB means the following: Let e(i) be the sum of the squared errors incurred in iteration i through the training patterns, averaged over the 100 initial weight sets. Then, plotted in the figures is 1010g(5(i)/p) (dB), where p is the maximum possible error at each iteration. The maximum possible error at one iteration is the sum of the square of the Euclidean distance between a training pattern and the output of the network associated with this training pattern. Figure 4.6 shows the errors of the three X-OR implementations. These results indicate a clear advantage for the QR-WRLS strategy. Algorithmic differences among the three implementations account for performance differences. One difference is the initialization needed for the MIL-WRLS equations. With the MIL-WRLS strategy, both the covariance matrix recursion and the weight vector recursion must be initialized using theoretically incorrect values. Because of ini- tializations, the MIL-WRLS algorithm is not guaranteed to move the estimate in the direction of greatest decrease of E, or even of decreasing E, for the first few iterations. Of the two MIL-WRLS recursions, the weight recursion seems to be the most sensi- tive to the initialization problem. This is because P is initialized with large values, 64 l2 T1 relauvo error ~70... /3 —” 4W . . .W 0 50 100 150 200 iteration. 1 Figure 4.6: Average error in dB for the X-OR implementations vs. iteration number. 1. back-propagation; 2. A-S algorithm; 3. QR—WRLS. P‘1 is small and the effect of this initialization is relatively small. The weight recur- sion is sensitive to initialization because (2.50) depends eXplicitly upon wk(i, n — 1). The QR—WRLS algorithm has only an implicit dependence on the weights, as do all linearization algorithms, because the linearizations depend on the weights. There is also a difference in the performance of the network using different func- tions for S(). The logic activation function proved superior to the sigmoid in these experiments. This is probably because the error will always be positive using the sigmoid, but can be zero for the threshold logic activation function. No matter how the weights are adjusted, the output of the sigmoid will always be bounded by one, so that the training pattern outputs can never be matched exactly. With the threshold logic activation function, once the weights are adjusted so that the output is off the ramp (the linear region), the output can be zero or one in which case the difference between the training output and the actual output can be zero. 65 4.3 Layer Updating and Network Updating This section compares BP, the node-wise updating algorithm, the layer-wise updating algorithm and the network-wise updating algorithm. It is divided into four subsec- tions. The first shows the results of the four algorithms implementing a 2-bit par- ity checker and a 4-bit parity checker with different methods of selecting the initial weights. The second subsection shows the result of the three linearized algorithms with different forgetting factors. In the third subsection, results of simulations with many different, relatively small training sets are given for the four algorithms. The final subsection presents the results of the first three algorithms training with a large training set. 4.3.1 Initialization One of the greatest problems in F NN training is the selection of the initial weights. As shown below, BP is very sensitive to the initial weights. Also, shown in the previous section is the fact that, other algorithms, such as A-S, suffer from the same problem. It is the purpose of this subsection to demonstrate that the layer—wise and network- wise weight updating algorithms are superior to BP in robustness to problems caused by “poor” initial weights. As will be seen in the simulations, both the layer-wise and network-wise algorithms can be initialized to very small values and perform very well. In fact the performance improves with very small initial values. This means that the initial weights can be initialized to small values, typically less than 0.01, without concern about the initial weights. All four algorithms were used to train the 2-bit and 4-bit parity checker networks. The network architectures are as in the previous section. For the 2-bit parity checker, six different methods of choosing the initial weights were used. The first method was as in the previous section. Thus, 100 initial weight sets were chosen from the 66 uniform distribution [-1,1]. Next, 100 initial weight sets were again chosen from the uniform distribution, but now over [—.5, .5]. This process was repeated, choosing 100 initial weight sets from uniform distributions over [—.25, .25], [—.1, .1], [—.01, .01] and [—.001, .001]. Simulations were run for the four algorithms, and the results are shown in Table 4.3. The BF algorithm had a factor of 0.05 for the weight updating equations. The node-wise algorithm had a forgetting factor of 0.1 and a weight change constraint of 0.25. The layer-wise algorithm had a forgetting factor of 0.1 and a weight change constraint of 1.0. The network-wise algorithm had a forgetting factor of 0.3 and a weight change constraint of 1.0. The number of convergences out of a possible 100 is given. Also shown is the average number of iterations to convergence for the trials where the algorithm converged to a. solution. Figures 4.7 through 4.10 show the average error of each of the algorithms for the different weight sets. Each of the six different weight sets is numbered, 1 is the weight set from the uniform distribution over [-—1,1], 2 is the weight set over [—.5, .5], 3 is the weight set over [-.25, .25], 4 is the weight set over [—.1, .1], 5 is the weight set over [—.01,.01], and 6 is the weight set over [—.001, .001]. Each of the Figures 4.8 through 4.14, is numbered similarly. The simulations for the 4-bit parity checker are similar. 100 initial weights over the same distributions were chosen. The convergence results are shown in Table 4.4. The weight change constraints and forgetting factors were the same as the 2-bit case. Figures 4.11 through 4.14 show the average error of each of the algorithms for the varying weight sets. The results show that the performance of BP and the node-wise algorithm suffer as the initial weights become small. The layer-wise and network-wise algorithms on the other hand show very good performance even with all initial weights smaller than 0.001 in magnitude. Because of the nature of the problem, gradient following algorithms are sensitive to the initial weights. Some initial weights will lead to a solution, and others will not. What we would like is one set of weights that would 67 Table 4.3: Convergence results for various initial weights, 2-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. Weight BP Node-wise Layer-wise Network-wise variation A B A B A B A B [-l,1] 1 - 70 23.59 97 14.84 86 96.78 [-.5,.5] 0 - 93 33.85 99 18.74 93 122.18 [-.25,.25] 0 - 99 72.95 99 11.55 94 150.37 [-.1,.1] 0 - 80 156.85 99 11.22 94 166.61 [-.01,.01] 0 - 7 393.57 97 12.08 98 81.97 [-.001,.001] 0 — 5 376.6 96 17.45 93 32.51 Table 4.4: Convergence results for various initial weights, 4-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. Weight BP N ode-wise Layer-wise Network-wise variation A B A » B A B A B [-1,1] 0 - 38 207.90 59 65.20 35 238.17 [-.5,.5] 0 - 9 301.67 76 65.16 38 197.03 [-.25,.25] 0 - 0 - 76 52.01 42 148.95 [-.1,.1] 0 - 0 - 82 47.43 45 138.62 [-.01,.01] 0 - 0 - 89 26.01 50 100.52 [-.001,.001] 0 - 0 - 87 29.92 66 121.89 lead to a solution in all cases. This is, of course, impossible, but the large number of convergences of the layer-wise and network-wise algorithm for very small initial weights indicate that very small weights can be used in practice to approximate the “ideal” initial weights. 68 “13.5 1. [4,1] ~43 ~ 2. [—.5,.5] 3. [-.25,.25] 43-5 a . 4. [-.1,.1] ./ 1 5. [-.01,.01] "“ " 6. [-.001,.001] “ “44.6 '4 O t O _“ ._ E ii «5.5 - -.* / 2 lo _“ _§/ 3 fl 4T5T6 —u.a . , r r r m r 0 50 100 150 200 250 800 350 iteration. 1 Figure 4.7: Average error in dB for the BP implementation of the 2-bit parity checker vs. iteration number using different methods of selecting initial weights. \\ l 4 l 5 l 6 ..5o _. A l 1 _m .— g 1 —- — E 5 “'° ‘ 1. [-1,1] " 2. [-.5,.5] "9° ’1 3. [-.25,.25] 4. [-.1,.1] —1oo -« 5. [-.01,.01] 6. [-.001,.001] -uo ................................................ . ................................................. , ................................................ o no 100 iteration. 1 Figure 4.8: Average error in dB for the node-wise implementation of the 2-bit parity checker vs. iteration number, using different methods of selecting initial weights. 69 \\\ \{/ 2 \ i. \ 2 \ 1. [4,1] 3 g \ 2. [-.5,.5] : i .1. 3o [’o25,.25] 3 3 , "\ 4 [-1,.1] 3 {#6 \ 5 {-01.01} 3' \ 6 [- 001,001] 1" . \ n 1L4) \_ \‘x ‘1 AL \_j '\ _,L «100 ................................................. l ................................................. l ................................................ 0 30 100 iteration. 1 Figure 4.9: Average error in dB for the layer-wise implementation of the 2-bit parity checker vs. iteration number, using different methods of selecting initial weights. 4.3.2 Forgetting Factor This subsection considers the performance of the node-wise, layer-wise and network- wise algorithms with different forgetting factors. When training a F NN with QR- WRLS, some method of “forgetting” past linearized training patterns must be used in the equations. This forgetting is necessary because old linearized training patterns are not useful in determining the gradient of the error surface at the present weights. As mentioned previously, implementation with QR—WRLS allows a variety of forgetting methods. The one that has proved to be very good is a forgetting factor of zero. In other words, with all previous training patterns completely “forgotten” (removed from the estimation process). The forgetting factor is often problem dependent, and one that works with one training set may not work well with another training set. Using a very small or a zero forgetting factor means that one variable has been eliminated, simplifying training. 70 ~40 a 124 4 l -30 ._. T 6 3 T . .1 g —eo- / i 3. "7° " ./ 5 3 1. [-1,1] 8 __ 4 2. [-.5,.5] '° 3. (-.25,.25] 4. [—.1,.1] «0 - 5. [-.01,.01] 6. [-.001,.001] -—1oo ................................................. , ................................................. , ................................................ 0 50 100 iteration. i Figure 4.10: Average error in dB for the network-wise implementation of the 2- bit parity checker vs. iteration number, using different methods of selecting initial weights. 71 reietivo error relative error "-31.5 -31.. -— 1. [4,1] 2. [-.5,.5] ._31.7 -— 3. [-.25,.25] 4. [-.1,.1] 41.. .J 5. [-.01,.01] 6. [-.001,.001] -31.9 - -32 .4 1 1 T2T3T4T516 ~33 j t T . . r . 0 50 100 150 200 280 800 350 iteration. 1 Figure 4.11: Average error in dB for the BP implementation of the 4-bit parity checker vs. iteration number, using different methods of selecting initial weights. 4...; 113141516111 1. [4,1] ~86 - \ 2. [-.5,.5] -55 - 3. [-.25,.25] ~37 - \ 4. [-.1,.1] —30 a 5. [-.01,.01] 1., ._ 1 6. (001,001) 4— \ ~41 -( “‘2 .4 _.., .1 W64 '1 ~45 ~66 0 iteration. 1 Figure 4.12: Average error in dB for the node-wise implementation of the 4-bit parity checker vs. iteration number, using different methods of selecting initial weights. 72 relative error 0 80 100 150 iteration. 1 Figure 4.13: Average error in dB for the layer-wise implementation of the 4-bit parity checker vs. iteration number, using different methods of selecting initial weights. The forgetting factor and weight change constraint that give good performance are related. This means that if the forgetting factor is optimized for a certain training set and weight change constraint, then changing the weight change constraint will mean that the optimum forgetting factor will probably also change. In the simulations it is noted that for good performance, a small weight change constraint must be used with a small forgetting factor. This relation is shown in Figure 4.15. Here the errors of the node-wise algorithm implementing a 2-bit parity checker using different forgetting factors and different weight constraints. The number of convergences out of a possible 100 for each of the settings was 78 for simulation 1, 60 for simulation 2, 56 for simulation 3 and 64 for simulation 4. It is apparent that the parameters which yield the most convergences do not necessarily lead to the lowest average error. This trade-off must be considered when implementing the algorithms. Even though there is a relation between the forgetting factor and the weight 73 rel ative error "31 -3; n\ l 4 .l 5 _” —-J -u ..., ”a -, -35 .q _.” —- [ 1. -1 1] .. .J ’ " 2. [-.5,.5] .5. - 3. [.25 .25] 1 / 3 4° _, 4. [-.1,.1] 5. [-.01,.01] “*1 " 6. [-.001,.001] «a . ' ‘ 0 50 100 150 iteration. 1 Figure 4.14: Average error in dB for the network-wise implementation of the 4- bit parity checker vs. iteration number, using different methods of selecting initial weights. 74 relative error -ao- ~4o~ «.30.. ~10!) -‘ i‘ -1” .l 1 "-1.40 -] l -150 .. -1eo ~ ' ’ -"~ -goo .. ' -m .. —m .4 mm —. 1... .1 / 2 _m and um — iteration. 1 Figure 4.15: Average error in dB for the QR—WRLS X-OR implementation vs. it— eration number, using different forgetting factors and weight change constraints. 1. u = 098,7 = 0.2; 2. u = 098,7 = 1.0; 3. u = 0.1,7 = 1.0; 4. u = 0.1,7 = 0.2; where V is the forgetting factor and 7 is the weight constraint. 75 constraint, for simplicity in these simulations, the weight constraint was held constant as the forgetting factor was varied. As in the previous subsection, a 2-bit parity checker and a 4-bit parity checker are used in the simulations. 100 sets of initial weights were used. The distribution for the weight selection for the 2-bit network was [—.25, .25] for the node-wise algorithm, [—.1, .1] for the layer-wise algorithm, and [—.01,.01] for the network-wise algorithm. These initial weight distributions were selected using the results of the previous section, to obtain the best results for each algorithm. As in the previous section, the weight change constraint was 0.25 for the node-wise algorithm, and 1.0 for the layer-wise and network-wise algorithms. For the 4-bit network the distribution for the weight selection was was [—1.0,1.0] for the node-wise algorithm, and [—.01, .01] for the layer—wise and network-wise algorithms. Again, these initial weight distributions were selected to optimize the results for each algorithm. The weight change constraint was the same as the 2-bit network. The results for different forgetting factors are shown in Table 4.5 for the 2-bit network and Table 4.6 for the 4-bit network. Figures 4.16 through 4.18 show the average error of each algorithm for the different forgetting factors for the 2-bit network. Figures 4.19 through 4.21 show the average error of each algorithm for the different forgetting factors for the 4—bit network. In each of these figures the the different forgetting factors are numbered. Trial runs with forgetting factor 0.75 are numbered 1, those with forgetting factor 0.5 are numbered 2, those with forgetting factor 0.25 are numbered 3 and those with forgetting factor 0.1 are numbered 4. To show the interdependence of the weight change constraint and the forgetting factor, the simulations using a 4- bit parity checker were run again using different weight constraints. The new weight constraints were 0.1 for the node-wise algorithm, and 0.4 for the layer-wise and network-wise algorithms. Results are shown in Table 4.7. Figures 4.22 through 4.24 again show the average errors for different forgetting factors. The same numbering as above for the different forgetting factors applies. 76 For the simple 2-bit network, the performance of the training algorithms does not exhibit much change across different forgetting factors. For the 4-bit network, there is more variance. Table 4.6 shows that a large forgetting factor is reduces the iterations to convergence. When the weight change constraint is decreased, the performance of the layer-wise and network-wise algorithm improves, and smaller forgetting factors improves performance. The performance of the node-wise algorithm with the smaller weight change is surprising. There are no convergences for any of the forgetting fac- tors tried. Although the results indicate that the layer and network-wise algorithms have very good performance using small forgetting factors and weight change con- straints, the most important result may be the fact that the performance of these two algorithms, especially the layer-wise algorithms, is quite impervious to changes in forgetting factors and weight change constraints. This, along with the small initial weights, greatly simplifies implementation. The simulations presented in the last two sections deal with the problems of weight initialization and the choosing of a forgetting factor. 2-bit and 4-bit parity checkers are used for training. The initial weights and the forgetting factor were varied to show that the layer-wise algorithm has good performance with small initial weights and small forgetting factors. The BP and node-wise algorithm were shown to be very sensitive to these variables and performance varied greatly as they changed. 4.3.3 Small Training Sets In this section many small training sets are used to compare the BP, the node-wise, the layer-wise, and the network-wise training algorithms. These small training sets had with two inputs, and from twenty to thirty training patterns each. For each training set, the number of iterations required to correctly classify all the training patterns was determined. Simulations were run using two layer networks for all four algorithms. The number of hidden nodes was varied from two to four for the simple 77 Table 4.5: Convergence results for various forgetting factors, 2-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. Forgetting Node-wise Layer-wise Network-wise factor A B A B A B 0.75 97 110.87 96 7.5 84 37.89 0.5 97 75.98 100 12.28 86 44.51 0.25 96 69.33 100 9.59 91 52.55 0.1 96 67.04 98 9.75 84 86.63 _w u-1 —“ .. _“ ---1 _“ a— a" .- -30 .1 -5; .1 a.“ ... ~u - —“ .—( “7° .— ...7g ...( . . _. ' an -~ 3. 0.25 1 /' ’ .r .’ .1 . ~76 - 4. 0.1 ' '1', error relative iteration. i Figure 4.16: Average error in dB for the node-wise implementation of the 2-bit parity checker vs. iteration number, using different forgetting factors. 78 -m '1 \\ \ g \ a ‘x\ \ g \\ \ ‘\\‘ \\ 0 ~70 « : \\\ \\. +— 3 a \\ ‘ 3 \\ {—- 2 \\ \ O \\ \ ‘0 \\ \ _” -1 1 \\\ \ \ -) \‘u.\ \‘ 1 0.75 ~\ \ \ ~90 *1 2. 0.5 ":‘L \ A 3. 0.25 W 4. 0.1 -1” .v..."r'11..vr"w—vI‘vvv‘vvvvxvvvv'uuvvhvvv-'vvv‘wvvv-u-vrxvvvvv-vv-Iv-vvvrv—vvjwwvvvvvv-‘rvv—vwvvvv 0 10 IO N ‘0 50 60 70 .0 90 iteration. i Figure 4.17: Average error in dB for the layer—wise implementation of the 2-bit parity checker vs. iteration number, using different forgetting factors. —47~ v 1314 mg... -4e~ T1 x“. -50... «111— _..,_ \ -53... \ 4... \12 “a .. 1 0 75 \\ ~56“ ' ' \ \ 2. 0.5 \\ 3. 0.25 ‘\ 4. 0.1 \ relative error 47.1 ”flu-1 ...”-J ‘m "rrVVYY'I‘V'r‘V'I'IY‘T"'-'UV‘I"V'UV'V‘xrv‘virv'vi'Vvvthv“""""':"'VVTVV'I'VVVVIU'I"V'vr'V o 10 no so 40 so so 70 on so iteration. i Figure 4.18: Average error in dB for the network-wise implementation of the 2-bit parity checker vs. iteration number, using different forgetting factors. 79 Table 4.6: Convergence results for various forgetting factors, 4-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. Forgetting N ode-wise Layer-wise N etwork-wise factor A B A B A B 0.75 32 202.94 95 18.6 69 100.75 0.5 40 224.95 91 28.03 73 119.32 0.25 40 234.73 93 20.17 50 118.72 0.1 42 263.98 90 27.26 50 151.1 ~oa-1 nan—A -w— a”... _”—4 "-3. _.,1 ~40“ 41—] ~42- _“.. «444 ~45— wiG-J «57-4 ~40“ relative error iteration. i Figure 4.19: Average error in dB for the node-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. 80 «80 '40 "f \ \\ \ \\ ‘. ~50 -‘ \\\ D ‘t in \ 4 g 40 _. \ \\ < E \ ‘\ ,/ 3 \\ a \ \ \ ' 1 '3 1 \ a "70 .4 \ \ \ \‘\ 1. 0.75 \_.\_/ 2 .\ \\ 4 2. 0.5 \, . 4 “'° 3. 0.25 “’\ 1 4. 0.1 a“ vvvvvvvvv T rrrrrrrrr ‘ IIIIIIIII I vvvvvvvv 1 vvvvvvvvv I IIIIIIIII f iiiiiii I tttttttt ' ttttttttt I aaaaaaaa 0 10 :0 u 40 60 80 70 .0 90 iteration. i Figure 4.20: Average error in dB for the layer-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. “38 w \ \ \\i\ T 3 a.” .— 3 \\ \ 1. 12 g “36 - ‘\ E 11 \ ii "35 ‘J ,\\ el‘ \ O In —” .4 1. 0.75 2. 0.5 '\ “‘3. '- 3e 0e25 \ 4e 0el ' ~” 1"‘IT"':"U‘I‘ITUIII"1‘Tv‘1t'fl'Ui'I,""UTII'I'flvt"UTIYYTYUYU‘Irlim‘TVnY‘U‘U'hTIII—I VVVVVVV 0 10 80 N 40 80 60 70 .0 90 iteration. i Figure 4.21: Average error in dB for the network-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. Table 4.7: Convergence results for various forgetting factors, 4-bit parity checker. A is the number of convergences out of 100, and B is the average number of iterations to convergence. Forgetting Node-wise Layer-wise Network-wise factor A B A B A B 0.75 0 - 91 28.89 67 106.55 0.5 0 - 95 24.42 63 93.06 0.25 0 - 97 31.52 57 132.02 0.1 0 - 88 26.51 97 160.04 ~30 40.5 ~ ”‘31 —-< 8 "31.5 4 i‘. O t “u“ ‘3 E -3“ -, 11.2311 43 H 1. 0.75 2. 0.5 43.. _ 3. 0.25 4. 0.1 —u ......... , ......... , ...... , ........ , ......... , ........ r, ........ , ......... , ......... , ........ o 10 an an m so no 70 so so iteration. 1 Figure 4.22: Average error in dB for the node-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. 82 ~30 \ .— -1 a M \ “\‘\ \M —‘o .1 \\ \ N\\\ \ F“. \ \‘ FM --45 "‘ \ \\ g \\ \f' a 40 fl .1“ \ .0 \ 0 \ .2 -.. - x 3 “~ ‘\ . t u "m “4 \ 3.. \ \l 2 \4 \\ ~65 ~ \ 1e 0e75 \'\\ \\ ~70 .. 2. 0.5 “\\l 1 \ .3. 0.25 \g \\ ”‘7' ” 4 0 1 \ \ “.0 vvvvvvvvv : vvvvvvv 'nvvvuvvvvuerTV-uvl vvvvvvvvv ‘ vvvvvvv f uuuuuuuuu f IIIIIIII 11 r r11 r O 10 20 30 40 50 60 70 IO 90 iteration. 1 Figure 4.23: Average error in dB for the layer-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. g “3‘ " \\ s \\ 3 -=- ~ \ ” I“ IIIIII I IIIIIIIII ‘Ij'IIfiYV‘ IIIIIIIII T'V'IIV‘VI: IIIIIIIII ' VVVVVVVVV I ''''''''' I VVVVVVVVV l IIIIIIII iteration. i Figure 4.24: Average error in dB for the network-wise implementation of the 4-bit parity checker vs. iteration number, using different forgetting factors. 83 training sets. Up to seven hidden nodes were used for the more complicated training sets. Since the network-wise algorithm can have only one output, only simulations for training sets with two classes could be used. A particular combination of training pattern set and number of hidden layer nodes that was not run are left blank. A dash means that the algorithm failed to find weights that correctly classified all the training patterns. The results are shown in Tables 4.7 through 4.10. Training set A, shown if Figure 4.25, is a simple linearly separable set using training patterns from two classes. The two classes are separated by one of the axes. Training set B is the same as A, only the classes to which each training pattern belongs has been interchanged. This was done for many of the training sets. Thus, training sets C and D; E and F; G and H; and I and J are similar, only interchanging the class of the training patterns. This was done to determine if the training algorithms favored positive or negative weights. If an algorithm found weights so that the network correctly classifies training set A, then changing the sign on the weights means that the network can now correctly classify training set B. Similarly for training sets C and D; E and F; G and H; and I and J. Training set C is linearly separable, but not by any of the axes as in training set A. Training set E is linearly separable, however the separating line cannot run trough the origin as is the case for training sets A through D. Training set C is the first training set that is not linearly separable. Training set L has training pairs divided into four classes. The training pairs are again linearly separable and BP and the node-wise algorithms again perform very well. Training set N has ten classes of training pairs. The training pairs were taken from the large training set used in the next section. The first result to note is the poor performance of the network-wise training algorithm. It seems that the algorithm is interesting from a theoretical point of view, but is not practical for use in training without further work. From the tables, it can be seen that the node-wise and BP performed very well for the linearly separable training 84 0.4 a2-— 0.1 ~4l ~~o.1 -~ input 2 ~03!- 0 "-003 H o ".004 I. o »03’ -Q5 -03 ~0A. CA. Q3 05 input 1 Figure 4.25: Training set A. sets. When the training sets were not linearly separable, however the advantages of the layer-wise algorithm are apparent. Most interesting is the fact that the number of iterations to convergence for the layer-wise algorithm for training sets A through K vary only slightly. Also, as shown in the tables, the layer-wise algorithm is the only one to find weights that correctly classifiy all training patterns for training set N. This provides evidence to what has been mentioned previously, that changing all the weight of the same layer has a great advantage over changing the weights independently. 4.3.4 Large Training Set In this section a large training set was used to compare the BP, the node-wise and the layer-wise training algorithms. The training set has 519 training pairs. As can be seen in Figure 4.34, the training pairs are overlapping and so the correct classification of all training pairs is not possible. The layer-wise algorithm performed the best. It has the 85 input 2 0.4- 03 0 ~03 '-< O .L l l I I l ~0.2 0 0.2 input 1 Figure 4.26: Training set C. 0.6 input 2 —-o.4 —— 1 ”0.5 _. “0.6 .— —o.7 -» -—o.s ~ «0.9 — _1—- ""- 1e1 "0.2 T I l , i 0 0.8 0.4 input 1 Figure 4.27: Training set E. 0.. 86 input 2 0.0 0.0 -< 0.4 -‘ 0.3 '— 03 '— 0.1 -. -0.1 "1 ~03 '- _o.3 .. -0.‘ ~ —0.0 -‘ --0.0 -‘ ~‘0.7 "‘ -0.0 -‘ .—0.9 .. «1 T i I I l T T I -—O.8 ~04 -o.z input 1 Figure 4.28: Training set G. o1» o in O .3 input 8 _. 0.3 02-— C 0.1 —‘ "Oe‘ ‘- —005 —l input 1 Figure 4.29: Training set I. 87 input 8 0.0 0.7 0.0 0.0 0.4- 0.3 0.8 0.1 '—°o‘ “0.2 ~0.4 ~05 ~0.6 ~0.7 ~0.0 *- I I T f I"0e" -002 input 1 Figure 4.30: Training set K. input a 0.4 0.1 "0e 1 _°e‘ 05 O -4 —Ju--- :1» t T I T I r 4 —0e3 ~0e 1 input 1 0. i 0.3 Figure 4.31: Training set L. 88 input a 0.7 0.0 ~1 0.0 '- 0.4 - 0.3 --1 0.2 ~ 0.1 -* ~0.i ~ —0.2 --1 ~0.3 -1 ~0.4 ~ -0.0 ~ ~0.6 fl ~0.7 ~1 I I I I I «0.3 -o.1 0.1 input 1 Figure 4.32: Training set M. input 8 51853359838828 1 H H O D N l l l input 1 Figure 4.33: Training set N. 01 89 Table 4.8: Iterations to convergence of the BP algorithm for various training sets and varying number of hidden layer nodes. um set A 71 74 66 192 295 306 - 160 165 lowest error and the most correctly classified training patterns. Table 4.12 shows the number of correctly classified training patterns for different number of hidden layer nodes. Figure 4.35 shows the relative errors of the BP algorithm using six, seven, and eight hidden layer nodes. Figures 4.36 though 4.37 show the errors for the node-wise and layer-wise algorithms. After performing these comparative simulations, it is possible to enumerate some of the advantages of the layer-wise algorithm over BP and the node-wise algorithm. First, two problems that arise in training, selection of initial weights and forgetting factor have been made easier. The layer-wise algorithm is relatively impervious to changes in these two variables, and they can be chosen small with very good results. Next, the layer-wise algorithm has very good convergence results with relatively com- plicated training sets. When selecting a training algorithm, many things need to be considered. Most important are speed and convergence performance. If a training 90 Table 4.9: Iterations to convergence of the node-wise algorithm for various training sets and varying number of hidden layer nodes. 103 157 0 8 3 6 2 3 7 6 4 241 c 7 it 6 a; 6". 11 input 1 Figure 4.34: Large training set. 91 fl aflfiflu “'3" 4.1 44 -5- .1... .4- -9... -104 relative error 11 l3 T2 Figure 4.35: Average error vs. iteration number for the BP implementation, training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. fiv— T r I 400 000 600 iteration. i o -.. --. .-. - .. ....-.-_._. ..- ._ m,m”..- ...,. -1— .g.. —3.—. _‘.... ~5— ~64 -7— ~0-1 "'10 -4 mm —. ~13 -< relative error --— M——.—-—.~4 -~-......- ~..—. ..—.. -g..~u. 2T ..—..o 13 -“ *U—t—‘v‘vno-o .Ao—w..—--Iq 11 T 0 100 Figure 4.36: Average error vs. iteration number for the Node-wise implementation, zoo 300 iteration. i training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. 92 Table 4.10: Iterations to convergence of the layer-wise algorithm for various training sets and varying number of hidden layer nodes. set. > C D E F D: - 217 347 algorithm is very fast but rarely converges it is not good. Also, if the algorithm al— most always converges but is very slow, then it is also not good. The true “speed” of a training algorithm must be determined by how fast it performs each iteration, how many iterations it takes on the average to converge, and how often it converges. 4.4 Simulations of the Data Reduction Algorithm This section contains the results of simulations run using the data reduction algo— rithm. The data reduction algorithm was implemented along with the layer-wise training algorithm. The results compare the layer-wise algorithm with and without data reduction. The large data set introduced in the previous section is used for comparisons. Both algorithms are run using six, seven, and eight hidden layer nodes. The data reduction algorithm works by comparing the relative effect of the training 93 Table 4.11: Iterations to convergence of the network-wise algorithm for various train- ing sets and varying number of hidden layer nodes. en yer set Table 4.12: Number of correctly classified training patterns for the large training set for varying number of inner layer nodes. Hidden layer BP Node-wise Layer-wise Nodes . 6 341 363 427 7 373 398 428 8 357 370 360 94 o .._.......-_... .-...- -.,_ -__....-..__-.. .- .. . _ .. -. _ , ~1-1 “ta-4 ~34 . -+-« ‘ 13 a... .6... -7- .4... ~11 -l _u .1 "‘13 d ...1‘ —4 ~10 . 1 . . 0 I.” m m 400 iteration. i Figure 4.37: Average error vs. iteration number for the Layer-wise implementation, training with the large training set. 1. 6 nodes; 2. 7 nodes; 3. 8 nodes. relative error I O l patterns on the magnitude of the weight change and doesunot use the training patterns with very small effects. Thus, for implementation the user must decide what is the level for accepting and rejecting training patterns for each iteration. The algorithm is implemented by computing the approximation to the upper bound on the weight change for each training pattern and selecting the largest of the approximations. This approximation is divided by a user defined factor to get a comparison number. All . training patterns with approximations larger than this comparison number are used in the weight updating for the present iteration. This is then repeated for the next iteration. Figure 4.38 shows the relative error of three implementations of the layer- wise algorithm with six hidden layer nodes. The first has no data reduction. The second has data reduction and a dividing factor of 30. The third has data reduction and a dividing factor of 100. Figure 4.39 shows the number of correctly classified training patterns vs. iteration number for the same implementations. Figures 4.40 95 . \ \ . 4—4 .‘ \ \ relative error I I —7 t-d . \_ \._ 1. none ~ ‘\. ... -4 ‘\ ‘\_" t‘ . 2e 30 \ ‘—‘\" _9 ’0 3e 100 ‘ K ‘\‘\‘\_4 --1° Urfi‘rTjTY‘frleYIITUIII‘UTITTIYUXUUYUVITjY} IIIIIIII 0 1o 80 30 40 iteration. 1 Figure 4.38: Average error vs. iteration number for the data reduction algorithm using 6 hidden layer nodes, training with the large training set. and 4.41 are similar to above, showing the results for seven hidden layer nodes. F ig- ures 4.42 and 4.43 show the results for eight hidden layer nodes. Figure 4.44 shows the number of training patterns used in the hidden layer vs. the iteration number for a dividing factor of 30 and varying hidden layer nodes. Figure 4.45 is similar, but a dividing factor of 100. The results show the data reduction algorithm performing very well at the begin- ning, with the algorithm implemented without data reduction occasionally catching up at later iterations. The results can be misleading, because the data reduction algorithm is using fewer training patterns, thus performs each iteration faster than the algorithm with no data reduction. If the figures were to compare error vs. time, the data reduction algorithm would be superior. The results indicate that using data reduction is very good at speeding up the algorithm in the first iterations. The er- ror of the data reduction algorithm can actually increase if the number of training 96 1. none ' , . - ... 12 _, ’1 2. 30 .1 I, -—~--~- --_. 5, -1...“ «1 4 3. 100 L l §§E§§B§§§ \ Number of oorreotiy oiaeeified tp'e )0 n r- s s 5 s s 1 l I L g f / F Sf .4 ., . I 0 10 an iteration. 1 Figure 4.39: Correctly classified training patterns vs. iteration number for the data reduction algorithm using 6 hidden layer nodes and the large training set. 1 . o cur.” -”‘~ ,. \Mk‘-n‘ .‘ \\ ‘\\ ~~\ \ " ~ - ...‘ .— \\ s \ \/ 1 \\\ \..\ 3 “a M \‘\ \ x E \ 2 0 ~3 "l \ '/ I» \ =3 \ \ "oi -4 .2 ‘\ n/ 3 " \ ‘\ \ "5 r 1. none \\ 2e 30 \-~\ "‘ “ 3. 100 ‘ —7 I ‘Vii I j I. I I ‘rT'Y I ‘I I I I 1 fl f‘ j V I 1 f I —Y I I j l' I I T f I I 1 l IIIIIIII 0 10 an ‘ 30 40 iteration. i Figure 4.40: Average error vs. iteration number for the data reduction algorithm using 7 hidden layer nodes, training with the large training set. 97 1. none _ 2. 30 .’ r” "A” __ 3. 100 2 \ I" ' tpa Number of oorreoity oiaaeii’ied ror-Hr..- .asassssss§§§§§§§ L to H ‘r v v—v *v flu r fifi V V V v v I v ‘ p v I "Y i 'fi' f ‘ ‘l’ r v I I V v 1 v . . ‘1 vi I v V 1 e v v v 0 10 20 30 4.0 iteration. i Figure 4.41: Correctly classified training patterns vs. iteration number for the data reduction algorithm using 7 hidden layer nodes and the large training set. 0.” ‘ 0 fig. _,.. —- - ~ 1 “~\E:“\ \ y \ \\-\ 1 no.” \\ \ ‘ \“\\\ ‘ / \‘ \\ ‘-\ II \\ §‘}\ \\.‘ 2 “7°01 -‘ ‘ \- \.“ .“\. \. is \‘ ~\ . . 2 /l -\ ‘/ 3 . ”0.1. —4 ‘\. -. .5. ‘\ 5 1 .2 “ ‘\. . ”0.2 M \ ‘\‘ u \‘ \ O.“ .4 1. none ' ‘"\ \ 2. 30 \ . 100 \ —°.a fi IIIIIIII x I I ‘I T t I ' ' I I I 1 I I I iii "r I I I I T I I I f7 I I I I I I rj j 0 10 a 30 40 iteration. i Figure 4.42: Average error vs. iteration number for the data reduction algorithm using 8 hidden layer nodes, training with the large training set. 98 0 Ag 1. none 2. 30 E "§§§§§§§£§ s ' \\ Number of correctly eieeeified tpe H H o- a s s E s a ' L I 1 I I \ \ \ \\ 3. 100 S J w VVVVVVVVV iteration. i Figure 4.43: Correctly classified training patterns vs. iteration number for the data reduction algorithm using 8 hidden layer nodes and the large training set. eoo e 500;! E ' \ \ e . ‘3 \‘ '1 a. «too-J . _ ‘. . /3 5 ~ \ fl 3‘ 300- I: .. ./2 o s. 300'- é a /1 ‘ 2 1m... 0 . r . . 0 100 :00 soc 400 itereuomi Figure 4.44: Number of training patterns used vs. iteration number for the data reduction algorithm and a dividing factor of 30. 1. 6 nodes 2. 7 nodes 3. 8 nodes. 99 600 u m“ d “ \ O to. ‘5 3 m" e I! t. 5 g aoo~ t. 0 ‘6 3. m4 /2 i a ./1 z zoo-1 0 , . . r 0 100 :00 ”0 400 iteration. 1 Figure 4.45: Number of training patterns used vs. iteration number for the data reduction algorithm and a dividing factor of 100. 1. 6 nodes 2. 7 nodes 3. 8 nodes. patterns used becomes too small. In practice this number can be easily monitored and if the error increases, more training patterns can be used at the next iteration. The data reduction algorithm increases the algorithm efficiency considerably at the beginning of training because of the method of selecting the training patterns to be used in the updating. The training patterns that have the largest effect on the weight change are typically those that are near the center of the cluster of each class. If we divide the training pairs into classes, those that are near the center of the cluster, away from training pairs of different classes typically have the largest effect. These training pairs are then used to quickly find weights that classify the centers of the clusters correctly. If the class clusters are well separated, then the algorithm will do a good job of classifying all the training patterns. If there is overlapping among the clusters, then more training patterns must be used as training proceeds to refine the weights and classify the outliers correctly. The large training set used here has 100 different classes with a lot of overlapping, so is the most difficult challenge for the data reduction algorithm. The number of training patterns used in the weight updating decreases as training proceeds. This indicates that all the training patterns have similar effect on the magnitude of the weight change at the beggining of training, with some having less and less effect as training progresses. This confirms the observation that began the investigation for a data reduction algorithm. The weight updating matrix used for QR—WRLS is largest for the hidden layer. This size is important, because increasing one weight greatly increases the number of rotations in the QR-WRLS algorithm. For example, with eight hidden layer nodes and three inputs, (as in the simulation here) there are 24 hidden layer weights. This means a 25 by 25 matrix must be used for updating. There are 314 rotations for each training pattern. With only one more weight in the hidden layer the number of rotations for each training patterns increases to 340. This means that for even modest sized F N Ns, those havin six or more hidden layer nodes, the number of training patterns needed to remove, so that the data reduction algorithm is faster than the non-data reduction algorithm is small. For this example if, fewer than 92 percent of the training patterns are used, the data reduction algorithm is faster. 101 Chapter 5 Conclusions and Future Research 5.1 Algorithmic Developments 5.1.1 Training Speed New algorithms to train FNNs are developed in this dissertation. The main purpose in developing new algorithms is to make training more efficient. Let us define the speed of an algorithm as the time it takes to converge to a solution. In measuring the speed of an algorithm, we need to know how fast it completes each iteration, how many iterations it takes, on the average, to converge, and how often it converges. Determining the speed of any of the algorithms used to train FNNs is quite difficult, if not impossible. This is due to several factors. First, algorithms are dependent on initial conditions. Thus for different initial conditions, the same algorithm may con- verge very fast, slowly or not at all. Next, the speed of an algorithms depends on the training set being used. This makes it difficult to compare training algorithms. Many initial conditions for the same training set, and many different training sets must be used to insure the comparisons are fair. The simulations done in Chapter 4 show that the layer-wise algorithm described in this dissertation is indeed an improvement over previous algorithms. As can be seen from the simulations, if the training set is very simple, and linearly separable, then any of the training algorithms can be used 102 to train the F NN . If the training set is more complicated, however, then usin the layer-wise algorithm greatly enhances performance. One way to speed up training is to decrease the average number of iterations to a solution and to increase the number of convergences. Another way of speeding up training is to decrease the time for each iteration. The layer-wise algorithm uses the previous method. The data reduction algorithm uses the latter method 5.1.2 Layer-wise Training Algorithm The layer-wise training algorithm can be viewed as a logical progression from previous training methods. The BP algorithms takes the derivative of the error surface with respect to one weight, and changes this weight to reduce the total output error. Thus each weight is changed individually to decrease the error. Next, many algorithms, such as the A-S algorithm, and the algorithm developed by Kollias and Anastasiou take the derivative of the weights connected to each node, and change the weights to reduce the error. Here, all the weights connected to one node are changed together to decrease the error. In the layer-wise algorithm the derivative with respect to all the weights in the same layer is taken, and these weight are changed to reduce the error. The layer-wise algorithm is a linear algorithm in that the non-linearities of the FNN are linearized around the present weights at each iteration, and this linerized approximation is used to update the weights. This makes possible the use of QR- WRLS for solving the linear system at each iteration. 5.1.3 QR—WRLS The use of derivatives of order greater than one in the training of FN N s has become popular. The use of higher order derivatives means that the system can be linerized so that the resulting equations are linear, and can be solved for many of the weights simultaneously using some RLS algorithm, typically MIL-WRLS. It was one of the 103 purposes of this dissertation to introduce the QR-WRLS algorithm for use in F N N training algorithms. Future algorithms introduced for the training of FNNs will benefit from implementation with QR—WRLS rather than MIL-WRLS. 5.1.4 Data Reduction Algorithm In training the FNN it was found that not all the training patterns are useful at each iteration in updating the system weights. A computationally inexpensive way of computing whether a training pattern is useful or not for use in updating the weights at the present iteration is developed. This allowed fewer training patterns to be used at each iteration, speeding up training. 5.2 Future Research The most exciting future research lies in the extension of the layer-wise algorithm. As mentioned above, training algorithms have developed from updating one weight at a time, to all the weights connected to one node at a time, to all the weights of the same layer at a time. The next step will be updating all the weights of the network simultaneously. The network-wise algorithm presented in Chapter 3 is a first step in this direction. It is limited however, in that only one output is allowed. Also, as noted in Chapter 4 the performance is not as good as the layer-wise algorithm. To update all the weights of the same layer, only a first order liner approximation to the FNN is needed. In updating all the weights of the network, a higher order approximation may be used. The set membership algorithm [31],[32] was the first algorithm used in an attempt to reduce the number of training patterns used at each iteration. This algorithms can be used in linear systems when there is a bound on the output error. The bounded output of the FNN made it seem a natural problem for the application of 104 set membership. It was quickly noted, however, that although the bounded output of the FNN does imply a bound on the error, it does not imply a bound on the solution weights. This occurs because of the non-linearities in the system. It was because of this that perturbation theory was used. However, set membership theory in another form may still be applicable to the FNN training problem, and this application is a topic of future reseach. 105 Bibliography [1] T. Sejnowski, and C. Rosenberg, “Nettalk: a parallel network that leans to read [2] [3] [4] [6] [7] [8] [9] [10] aloud,” Technical Report JHU/ EECS-86/ 01, Johns Hopkins University, 1986. D. Nguyen, and B. Widrow, “The truck backer-upper: and example of self- learning in neural networks.” Proceedings of the IEEE Int. Joint Conf. on Neural Networks, pp.1381-1384, 1990. A.S. Householder, “The approximate solution of matrix problems,” J. Assoc. Comput. Mach., vol.5, pp. 204-243,1958. J .R. Deller, Jr.,‘ and D. Hsu, “An alternative adaptive sequential regression al- gorithm and its application to the recognition of cerebral palsy speech,” IEEE Transactions on Circuits and Systems, vol.CAS-34, no.7, pp.782-786, July 1987 D. Graupe, Time Series Analysis, Identification and Adaptive Filtering, Mal- abar, Florida: Krieger Publishing Company, 1989. M. Azimi-Sadjadi, S. Citrin, and S. Sheedvash, “Supervised learning process of multi-layer perceptron neural networks using fast least squares,” Proceedings of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.1381-1384, 1990. G. Golub and C. VanLoan, Matrix Computations, Baltimore, Maryland: Johns Hopkins University Press, 1983. J. Sun, W. Grosky, and M. Hassoun, “A fast algorithm for finding global minima of error functions in layered neural networks,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.715-720, 1990. B. Widrow, and M. Lehr, “30 years of adaptive neural networks: perception, madeline and backpropagation,” Proceedings of the IEEE , vol.78, no.9, Septem- ber 1990. F. Rosenblatt, Principles of neurodynamics: perceptrons and the theory of brain mechanisms, Washington, D.C.,: Spartan Books, 1962. 106 [11] R. Watrous, “Learning algorithms for connectionist networks: applied gradient methods of nonlinear optimization,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.2, pp.619-627, 1987. [12] D. Rumelhart, G. Hinton, and G. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing, vol.l (D. Rumelhart and J. McCleland, Eds.). Cambridge, MA: MIT Press, 1986. [13] R. Lippman, “An introduction to computing with neural nets,” IEEE ASSP Magazine, vol.4, pp.4.22, Apr.1987. [14] J .R. Deller, Jr., J. Proakis, and H. Hansen, Discrete Time Processing of Speech Signals, 1992. [15] L. Johnson, R. Riess, Numerical Analysis, 2nd ed., Reading, MA. Addison- Wesley Pub.Co., 1982. [16] D. Parker, “Learning logic,” Technical Report TR—40, Center for Computational Research in Economics and Management Science, MIT, April, 1985. [17] P. Werbos, “Beyond regression: New tool for prediction and analyses in the behavioral sciences,” Ph.D. dissertation, Harvard Univ., Cambridge, MA, 1974. [18] F.M.A. Salam, “Neural nets and engineering implementations,” Key Address, The Proc. of the 3lst Midwest Symposium on Circuits and Systems, St.Louis, MO., Aug. 1988 [19] M. Azimi-Sadjadi and S. Citrin, “Fast leaning process of multi- layer neural nets using recursive least squares technique,” IEEE' Int. Conf. on Neural Networks, Washington D.C., June 1989. ' [20] E. Dahl, “Accelerated learninig using the generalized delta rule,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.2, pp.523-530, 1987. [21] R. White, “The learning rate in back-propagation systems: an application of newtons method,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.l, pp.679-684, 1990. [22] D. Parker, “Second order backpropagation: Implementing an optimal 0(n) ap— proximation to Newton’s method as an artificial neural network,” IEEE' Conf. on Neural Information Processing Systems, Denver, CO, Nov. 1987. [23] S. Kollias and D. Anastassiou, “An adaptive least squares algorithm for the efficient training of artificial neural networks,” IEEE Transactions on Circuits and Systems, vol. OAS-36, pp.1092-1101, August 1989. [24] J. Steck, B. McMillin, K. Krishnamurthy, M. Ashouri, and G. Leininger, “Par— allel implementation of recursive least squares neural network training method,” 107 Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.631-636, 1990. [25] T. Ghiselli-Crippa, and A. El-Jaroudi, “Afast neural net training algorithms and its application to voiced-unvoiced-silence classification of speech,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.441-444, 1991. [26] B. Kosko, Neural Networks and Fuzzy Systems, Englewood Cliffs, New Jersey: Prentice-Hall, 1992. [27] B. Widrow and SD. Sterns, Adaptive Signal Processing, Englewood Cliffs, New Jersey: Prentice-Hall, 1985. [28] J .R. Deller, Jr., and G. Picache, “Advantages of a givens rotation approach to temporally recursive linear prediction analysis of speech,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 429- 431, March 1989. [29] L. Allred, “Supervised learning techniques for backpropagation networks,” Pro- ceedings of the IEEE Int. Joint Conf. on Neural Networks, vol.1, pp.721-728, 1990. [30] D. Watkins, Fundamentals of matrix computations, New York, N.Y.,John Wiley and Sons, 1991. [31] J .R. Deller, Jr. and T.C. Luk, “Linear prediction analysis of speech based on set- membership theory,” Computer Speech and Language, vol. 3, pp.301-327,1989. [32] J.R. Deller, Jr., “Set membership identification in digital signal processing,” IEEE ASSP Magazine, pp.4-20, October 1989. 108 Appendix Appendix Listings of Training Programs This Appendix lists the programs developed in this dissertation. The first pro- gram listed is the node-wise algorithm. Next the layer-wise algorithm is listed. The network-wise algorithm follows. The last program listed is the layer-wise algorithm with data reduction. All programs implement learning for a two layer network. All programs were written in a similar manner. The main program inputs the training set, and user defined variables such as the number of hidden layer nodes and the number of iterations. The number of input and output nodes is determined by the training set. Here also the weights are initialized along with the other algorithm variables such as the forgetting factor and weight constraint. A820, the main sub- routine in each program, keeps track of the iteration number, calculates the outputs of the network, then uses these values to calculate the linearized training values. The first main subroutine is the rotein() subroutine. This routine rotates the linearized training pairs into the correct W matrix. There is one W matrix for each node in the node-wise algorithm. In the layer-wise algorithm there is one W matrix for each node in the output layer, and one W matrix for the hidden layer. The network-wise algorithm only uses one W matrix. The next main subroutine is weightch(). This subroutine uses the W matrix to calculate the updated weights after every iteration. The other two smaller subroutines are ASw() and initw(). ASw() multiplies the W 109 matrices by a forgetting factor. initw() initializes the W matrices to zero. The only differences between the node-wise, layer-wise, and network-wise algorithm is in how the A320 subroutine calculates the linearized training patterns. The layer-wise al- gorithm with data reduction is the same as the layer-wise algorithm with the data reduction included. The first change is in the A820 subroutine. Here for simplicity of programming the outputs of the network are calculated twice. First when using the data reduction algorithm to decide which training patterns to use, then when using these training patterns to update the W matrices. There are three additional subroutines. The first is the reinitwpp() subroutine. This initializes the pr ma- trix, which is used in determining which training patterns to use in the updating, to zero. The next subroutine is called norms(). This subroutine calculates the norms for the weights, the w(i) vectors and for the d1(i, N), here called b vectors. The final subroutine is called dbdA(), and calculates the estimate of the weight change vector norm. The input to all the programs is a training set with the following format; number of training patterns,number of inputs,number of outputs training pattern 1 training pattern 2 training pattern N At each iteration the program writes to the screen the iteration number, the total number of training patterns, the number of correctly classified training patterns, and the sum of the squared errors between the training pattern outputs and the actual outputs. 110 N ode-Wise Algorithm #includeistdioh; #includeimathhj, #includeistdlibhz, #define N levels 2 /* This program simulates a NO-input, N2-output 2-layer, feedforward neural net using the Azimi-Sadjadi algorithm with G-R and includes biasing. node-wise updating. */ int datapts,xdata,ydata,numweight,zed,use,Nloops,N[3]; float x[4][519],t[4][519]; float w[3] [10] [10] ,wa[30]; float K [3] [10] ,b[3] [10] ,deltal ,delta2; float W[13][10][10]; float y [3] [10] ,wchange,err1; float z[30],tp; float ASweight; double err/*,E[1000]*/; char outfile[40]; FILE *ifpl,*ofpl,*ofp2,*ofp3; main() { int k1,k2,k3; void A320; printf(”Enter name for input file: ”); gets(outfile); if((ifp1=fopen(outfile,”r”))==NULL) printf(”fopen failed\n” ); exit(0); } printf(”Enter name for output filel: ”); gets(outfile); if((ofp1=fopen(outfile,”w”))==NULL) 111 { printf(”fopen failed\n”); exit(0); } printf(”Enter number of hidden nodes: ”); scanf(”%d” ,&N[l]); Here the input training set is read. fscanf(ifp1,”%d,%d,%d\n”,&datapts,&xdata,&ydata); N[0]=xdata; N[2]=ydata; printf(”datapts=%d xdata=%d ydata=%d\n”, datapts,xdata,ydata); for(k1=0;k1idatapts;++kl){ for(k2=0;k2;xdata;++k2){ fscanf(ifp1,”%f,”,&x[k2] [k1]); } for(k2=0;k2;ydata;++k2){ fscanf(ifp1,” %f,”,&t[k2][k1]); } fscanf(ifp1,”\n”); printf(”Enter maximum number of loops: ”); scanf(” %d” ,&N loops); ASweight=0.1; y{1llN [1]l=1; 3'10] [N [011:1; for(k1=0;kl]Nlevels+l;++k1){ for(k2=0;k2;N[k1]+1;++k2){ for(k3=0;k3;N[k1-1]+1;++k3){ w[k1][k2][k3]=0.001; } } } reinitw(); wchange=0. 10; A320; fprintf(ofpl ,” %d,%d,%d\n” ,N[01,N[1l,Nl2l); i 112 /*for(k1=1;k1;Nlevels+1;++k1){ for(k2.—..0;k2iN [k1];++k2){ for(k3=0;k3iN[kl-1]+1;++k3){ fprintf(ofp1,” %f\n” ,w [kl] [k2] [k3] ); printf(”w[%d] [%d] [%d]=%f\n” , k1,k2,k3,w[k1] [k2] [k3]); } } }*/ fclose(ifp1); fclose(ofp1); } void AS2() { int n1,n2,n3,count,loop,data,test; for(loop=0;loop;Nloops;++loop){ zed=0; err=0; use=0; ASWO; for(data=0;data;datapts;++data){ for(n1=0;n1;N[0];++n1){ y{0]ln1]=X[nlllda-tal; } Below the outputs of the nodes and the derivatives are calculated. for(n1=1;nliNlevels+1;++nl){ for(n2=0;n2;N [n1] ;++n2){ ylnllln2l=03 for(n3=0;n3iN[n1-1]+1;++n3){ ylnllln21+=y{n1-1l[1131*Wlnllln2lln3]; if(Y [111][3211-20){Ylnll[n2]=-20;} b[nl][n2]=y[n1][n2]; Ylnll[n21=1/(1+6XP(-Y[nll11121)); K1111][n21=Ylnll[1121*(1-3'1111111121); b[n1] [n2]=y[n1] [n2]-K[nl] [n2]*b[n1] [n2]; } Here the linearized intputs and outputs of the last layer are calculated. for(n1=0;n1;N[Nlevels];++n1){ if(t[n1][data ==0){tp=0.05-b[Nlevels][n1];} 113 else if(t[nl] [data]==1){tp=0.95-b[Nlevels] [n1] ;} else{tp=t[n1] [data]-b[N levels] [n1];} for(n2=0;n2iN [N levels-1]+ 1 ;++n2){ z[n2] =K [Nlevels] [n1] *y[Nlevels- 1] [n2]; rotin(N[Nlevels-1]+1,n1+1); } Below the linearized inputs and outputs of the hidden layer are calculated. f01(n2=0;n2iN[Nlevels-1] ;++n2){ tp=0; for(n1=0;n1iN[Nlevels];++nl){ tp+= (t [n1] [data]-y[N levels] [n1] ) *K [Nlevels] [n1] *w[Nlevels] [n 1] [n2] ; tp+=y[N1evels- 1] [n2]; tp=tp-b[Nlevels- 1] [n2]; for(n3=0;n3jN[Nlevels-2]+1;+ +n3){ z [113] =K [Nlevels- 1] [n2] *y[Nlevels- 2] [n3] ; rotin(N[Nlevels-2]+1,n2+N[Nlevels]+1); } The sum of square errors is calculated below. for(n1=0,errl=0,test=0;nliydata;++n1){ errl+=(t[nl][data]-y[2][n1])*(t[n1][data]-y[2][n1]); if(fabs(t[nl][data]-y[2][n1]) 2,: 0.5){test=l;} } if(test==0){zed++;} err+=errl; } Below the new weights for the last layer are calculated. for(n1=0;n1;N[Nlevels];++n1){ weightch(Nlevels,N[Nlevels—l]+1,nl+1); for(n2=0;n2;N [N levels-1]+1 ;++n2){ w[Nlevels] [n1] [n2] =wa[n2] ; } Below the new weights for the hidden layer are calculated. for(n2=0;n2;N [Nlevels-1];++n2){ weightch( Nlevels- 1 ,N [Nlevels- 2] + 1 ,n2+ N [N levels] + 1 ); for(n3=0;n3iN [N levels—2]+1 ;++n3){ w[Nlevels- 1] [n2] [n3] =wa[n3]; } printf("loop=%d data=%d zed=%d err[data]=%f\n”, 114 loop,data,zed,err); fprintf(ofp1,”loop=%d zed=%d err=%f\n”,loop,zed,err); } } The following subroutine rotates the linearized training patterns into the W matrices using QR decomposition. rotin(MATSIZE,mat) int MATSIZE,mat; { int k1,k2; float rho,sigma,tau,Wp [30] [30] ; for(k2=0;k2§MATSIZE;++k2){ W[mat] [MATSIZE] [k2] =z[k2]; } W [mat] [MATSIZE] [MATSIZE] =tp; for(k2=0;k2;MATSIZE-l-l;++k2){ for(k1=0;k1;MATSIZE+1;++k1){ Wp[k2][k1]=0; } } for(k2=0;k2;MATSIZE;++k2){ rho=sqrt(W[mat] [k2] [k2] *W[mat] [k2] [k2] + W[mat] [MATSIZE] [k2] *W[mat] [MATSIZE] [k2] ); if( rho == 0){ /*if(mat==0) {printf(”below\n” ); }* /goto below4;} sigma=W[mat] [k2] [k2] / rho; tau=W[mat] [MATSIZE] [k2] / rho; for(kl=k2;kl ;MATSIZE+1;++k1){ Wp[k2] [k1] =W[mat] [k2] [k1] *sigma+W[mat] [MATSIZE] [k1] *tau; Wp[MATSIZE] [k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [k1] *sigma; Wp[MATSIZE][k2]=0; for(k1=0;k1;MATSIZE+1;++k1){ W[mat][k2][k1]=Wp[k2][k1]; W[mat][MATSIZE][k1]=Wp[MATSIZE][k1]; } below4:; } } The following subroutine re-initializes the W matrices. 115 reinitw() { int nl,n2,n3; for(n1=0;n1;7;++n1){ for(n2=0;n2i30;++n2){ for(n3=0;n3i30;++n3){ W[n1][n2][n3]=0; } } } The following subroutine uses the W matrices to calculate the updated weights. weightch(level,MATSIZE,mat) int level,MATSIZE,mat; { double wchangep [30] ,a[30] ,zzt,zt,Wp[30] ,den,fac; int k,i,nl,n2,n3,counts; zt=0; if(level==2){ for(n1=0;n1;MATSIZE;++n1){ Wp[n1]=w[level] [mat-1] [n1]; } if(level==1){ for(n1=0;n1;MATSIZE;++nl){ Wp [n1] =w[level] [mat- N [Nlevels]- 1] [n 1] ; } if(fabs(W[mat][MATSIZE—l][MATSIZE1]);1e-28){ a[MATSIZE1]=W[mat][MATSIZE-1] [MATSIZE] /W[mat][MATSIZE-1][MATSIZE-1]; else{ printf(”no weight update level %d %d\n” ,1evel,MATSIZE); a[MATSIZE—1]=wp[MATSIZE—l]; } for(n1=MATSIZE—2;n1L=0;—nl){ if(fabs(W[mat] [n1] [n1]);1e-28){ a[nl]=W[mat] [n1] [MATSIZE]; for(n2=MATSIZE— 1 ;n2;n1;—n2){ 116 a[nl]-=W[mat] [n1] [n2] *a[n2]; a[111]/=W[mat][n1][n1]; else{ printf(”no weight update level %d %d\n”,level,MATSIZE); alMATSIZE-l]=Wp[MATSIZE-1]; } } for(n1=0;n1jMATSIZE;++n1){ wchangep[n1]=a[nll‘wplnlli den=0; for(nl=0;n1iMATSIZE;++n1){ den+=wchangep [n1] *wchangep[n1]; } den=sqrt(den); if( den i. wchange) {fac=wchange/den;} else{fac=1;} for(n1=0;nl;MATSIZE;++n1){ wchangep[n1]*=fac; for(nl=0;nliMATSIZE;++n1){ wa[n1]=wp[n1]+wchangep[n1]; } for(n1=0,zt=0;n1;MATSIZE;++n1){ if(wa[nl]j,700){wa[n1]=700;zt=1;} if( wa[n1]1-700){wa[n1]=-700;zt=1;} } if(zt==1){printf(”weight at 700\n”);} The following subroutine implements the forgetting factor by multiplyint the W matrices by the constant ASweight. ASw() { int nl,n2,n3; for(n1=0;n1110;++n1){ for(n2=0;n2§10;++n2){ for(n3=0;n3310;++n3){ W[n1][n2][n3]*=ASweight; } 117 118 Layer-Wise Algorithm #includeistdioh; #includeimathh; #includeistdlibhj, #define Nlevels 2 /* This program simulates a N 0-input, N2-output 2-layer, feedforward neural net using the Azimi-Sadjadi algorithm with G-R and includes biasing. */ int datapts,xdata,ydata,numweight,zed,use,Nloops,N [3]; float x[4] [519] ,t[4] [519]; float w[3] [10] [10] ,wa[30] ; float K[3] [10] ,b[3] [10] ,deltal ,delta2; float W[7] [30] [30]; float y[3][10],wchange,err1; float z[30],tp; float ASweight; double err/*,E[1000]*/; char outfile[40]; FILE *ifp1,*ofp1,*ofp2,*ofp3; main() int kl ,k2,k3; void AS2(); printf(”Enter name for input file: ”); gets(outfile); if((ifpl=fopen(outfile,”r”))==NULL) { printf(”fopen failed\n” ); exit(0); } /*printf(”Enter name for output filel: ”); gets(outfile); if((ofpl=fopen(outfile,”w”))==NULL) { printf(”fopen failed\n” ); exit(0); 119 }*l printf(”Enter number of hidden nodes: ”); scanf(”%d”,&N[1]); Here the input training set is read. fscanf(ifpl,” %d,%d,%d\n” ,&datapts,&xdata,&ydata); N [0]=xdata; N [2]=ydata; printf(”datapts=%d xdata=%d ydata=%d\n”, datapts,xdata,ydata); for(k1=0;k1;datapts;++k1){ for(k2=0;k2;xdata;++k2){ fscanf(ifpl,” %f,” ,&x[k2] [k1]); } for(k2=0;k2;ydata;++k2){ fscanf(ifpl,” %f,” ,&t[k2][k1]); } fscanf(ifp1,”\n”); } printf(” Enter maximum number of loops: ”); scanf(”%d”,&Nloops); ASweight=0.1; Below the weights are inintialized. yunNun=1; y{0][N[0]l=1; for(k1=0;k1;Nlevels+1;++k1){ for(k2=0;k2iN[k1]+l;++k2){ for(k3=0;k3;N[kl-1]+1;++k3){ w[k1][k2][k3]=0.001; } } } reinitw(); wchange=.4; A320; /*fprintf(ofpl ,” %d,%d,%d\n” ,N[0] ,N[1],N[2]); for(k1=l;kl;Nlevels+1;++k1){ for(k2=0;k2;N [k1];++k2){ for(k3=0;k3iN[kl-1]+l;++k3){ fprintf(ofp1 ,” %f\n” ,w[k1][k2] [k3] ); 120 printf(” w[%d] [%d] [%d]=%f\n” , k1,k2,k3,w[k1] [k2] [k3]); } }*/ fclose(ifp1); fclose(ofp1); } void A320 { int n1,n2,n3,count,loop,data,test; for(loop=0;loop;Nloops;++loop){ zed=0; err=0; use=0; ASWO; Below the outputs of the nodes and the derivatives are calculated. for(data=0;dataidatapts;++data){ for(n1=0;nle[0];++n1){ y[0][n11=X[nll[datal; for(n1=1;n1iNlevels+1;-++nl){ for(n2=0;n2jN [n1];++n2){ ylnllln2l=0; for(n3=0;n3;N[n1-1]+l;++n3){ y[nl] [n2]+=y[n1-1] [n3]*w[nl] [n2] [n3]; f(3'lnllln2li-20){y{n1][n2l=-20;} b[nl][n2]=y[n1][n2]; Ylllll [n21=1/(1+6XP(-Ylnllln21)); K[nll[n2]=y{nll[n21*(1-y{nll[n2]); b[111][n2]=y[n1][n2]-K[n1][n2]*b[n1][n2]; } Here the linearized intputs and outputs of the last layer are calculated. for(n1=0;nl;N[Nlevels];++n1){ tp=t[n1][data]-b[Nlevels][n1]; for(n2=0;n2;N[Nlevels-l]+1;++n2){ z[112] =K [N levels] [n1] *y[Nlevels- 1] [n2] ; rotin(N[Nlevels-1]+1,n1+1); } 121 Below the linearized inputs and outputs of the hidden layer are calculated. for(n1=0;n1;N [N levels] ;++n1 ){ count=0; tp=t[n1][data]-b[Nlevels] [n1]; for(n2=0;n2;N [Nlevels-1] ;++n2){ tp-=K[N levels] [n1] *w[N levels] [n1] [n2] *b[N levels- 1] [n2] ; for(n3=0;n3;N[Nlevels-2]+1;++n3){ z[count]=K[Nlevels] [n1]*K[Nlevels-1] [n2] * w[N levels] [n1] [n2] * y [Nlevels- 2] [n3] ; ++count; } } tp~=K [Nlevels] [n1] *w[N levels] [n1] [n2] *y [Nlevels- 1] [n2] ; rotin(N [N levels-1]*(N [N levels-2]+ 1),0); } The sum of square errors is calculated below. for(n1=0,err1=0,test=0;n1;ydata;++n1){ err1+=(t[nl][data]-y[2][nl])*(t[nl][data]-y[2][n1]); if(fabs(t[nl][data]-y[2][n1]) 1,: 0.5){test=1;} if(test==0){zed++;} err+=err1; } Below the new weights for the last layer are calculated. for(nl=0;n1;N [N levels] ;++n1){ weightch(Nlevels,N[Nlevels-1]+1,n1+1); for(n2=0;n2;N [Nlevels-l]+1;++n2){ w [Nlevels] [n1] [n2] =wa[n2]; } Below the new weights for the hidden layer are calculated. weightch(N levels-1,N[N levels-1]*( N[Nlevels-2]+1),0); count=0; for(n1=0;n1;N[Nlevels-1];++n1){ for(n2=0;n2;N[Nlevels~2]+1;++n2){ w[Nlevels- 1] [n1] [n2] =wa[count]; ++count; } } printf(”loop=%d data=%d zed=%d err[data]=%f\n”, loop,data,zed,err); } } 122 The following subroutine rotates the linearized training patterns into the W matrices using QR decomposition. rotin(MATSIZE,mat) int MATSIZE,mat; int kl ,k2; float rho,sigma,tau,Wp [30] [30]; for(k2=0;k2;MATSIZE;++k2){ W[mat] [MATSIZE][k2]=z[k2]; } W[mat][MATSIZE][MATSIZE]=tp; for(k2=0;k2jMATSIZE+1;++k2){ for(kl=0;k1;MATSIZE+1;++kl){ Wp[k2] [k1]=0; } for(k2=0;k2;MATSIZE;++k2){ rho=sqrt(W[mat] [k2] [k2] *W [mat] [k2] [k2] + W[mat] [MATSIZE] [k2] *W[mat] [MATSIZE] [k2] ); if(rho == ){/*if(mat==0){printf(”below\n” ); }* /goto below4;} sigma=W[mat] [k2] [k2] / rho; tau=W[mat] [M ATSIZE] [k2] / rho; for(k1=k2;k1[MATSIZE+1;++k1){ Wp[k2][kl]=W[mat][k2][k1]*sigma+W[mat][MATSIZE][k1]*tau; Wp[MATSIZE][k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [kl]*sigma; } Wp[MATSIZE][k2]=0; for(k1=0;kl ;MATSIZE+1;++k1 ){ W[mat] [k2] [k1]=Wp[k2] [kl]; W[mat] [MATSIZE][kl]=Wp[MATSIZE] [k1]; below4:; } } The following subroutine re-initializes the W matrices. reinitw() { int nl,n2,n3; for(n1=0;nl;7;++nl){ for(n2=0;n2i30;++n2){ for(n3=0;n3;30;++n3){ 123 W[n1][n2][n3]=0; } } } The following subroutine uses the W matrices to calculate the updated weights. weightch(level,MATSIZE,mat) int level,MATSIZE,mat; double wchangep [30] ,a[30] ,zzt ,zt ,Wp [30] ,den,fac; int k,i,zer,nl,n2,n3,counts; zt=0; zer=0; if(level==2){ for(n1=0;n1;MATSIZE;++n1){ Wp [n1]=w[level] [mat-1] [n1]; } if( level==1){ counts=0; for(n1=0;nl[N[Nlevels-l];++nl){ for(n2=0;n2;N[Nlevels-2]+1;++n2){ wp [counts] =w[level] [n 1] [n2] ; ++counts; } } } for(n1=0;n1iMATSIZE;++nl){ if(fabs(W[mat] [n1][n1]);1e—28){zer=l;} if(zer==0){ a[MATSIZE-1]=W[mat] [MATSIZE- 1] [MATSIZE] /W[mat] [MATSIZE— l] [MATSIZE- 1]; for(nl=MATSIZE—2;nl L=0;—n1){ a[nl]=W[mat] [n1] [MATSIZE]; for(n2=MATSIZE-1;n22,n1;—n2){ a[11l]-=W[mat] [n1] [n2]*a[n2]; } a[nl]/=W[mat] [n1][n1]; } else{ if(level==2){ 124 printf(”no weight update level 2\n” ); for(nl=0;n1;MATSIZE;-l-+n1){ a[n1]=WP[nll; } } if(level==1){ printf("no weight update level 1\n”); for(n1=O;nliMATSIZE;++n1){ a[nl]=Wp[n1]; } } for(n1=0;n1iMATSIZE;++n1){ wchangep[nl]=a[nll'wplnlli den=0; for(nl=0;n1iMATSIZE;++n1){ den+ =wchangep[n1] *wchangep[n1]; } den=sqrt(den); if( den 1, wchange){fac=wchange/den;} else{fac=1;} for(n1=0;n1;MATSIZE;++n1){ wchangep[nl]*=fac; for(n1=0;n1;MATSIZE;-l-+nl){ wa[n1]=wp[n1]+wchangep[n1]; for(n1=0,zt=0;nliMATSIZE;++n1){ if(wa[n1]},700){wa[n1]=700;zt=l;} if(wa[nl];-700){wa[nl]=-700;zt=l;} } if(zt==1){printf(”weight at 700\n” );} } The following subroutine implements the forgetting factor by multiplyint the W matrices by the constant ASweight. ASw() { int nl,n2,n3; for(n1=0;n1;7;++nl){ for(n2=0;n2;30;++n2){ for(n3=0;n3;30;++n3){ 125 W[n1][n2] [n3]*=ASweight; 126 Network-Wise Algorithm #includeistdioh; #includeimathh; #includeistdlibh; #define Nlevels 2 I»: This program simulates a N 0-input, N2-output 2-layer, feedforward neural net using the Azimi—Sadjadi algorithm with GR and includes biasing. network wise updating. */ int datapts,xdata,ydata,numweight ,zed,use,Nloops,N [3]; float x[4][519],t[4][519]; float w[3][10][10],wa[30],wpp[10][10]; float K[3][10],b[3][10],deltal,delta2; float W[1][50] [50]; float y[3] [10] ,wchange,err1 ; float z[50],tp; float ASweight; double err/*,E[1000]*/; char outfile[40]; FILE *ifpl,*ofp1,*ofp2,*ofp3; main() { int k1,k2,k3; void A320; printf(”Enter name for input file: ”); gets(outfile); if( (ifp1=fopen(outfile,”r” ))==NULL) printf(”fopen failed\n” ); exit(0); } /*printf(”Enter name for output file]: ”); gets(outfile); if((Ofpl=fopen(outfile,”w” ))==NULL) 127 printf(”fopen failed\n” ); exit(0); }*/ printf(”Enter number of hidden nodes: ”); scanf(”%d”,&N[l]); Here the input training set is read. fscanf(ifpl ,” %d,%d,%d\n” ,&datapts,&xdata,&ydata); N[0]=xdata; N [2]=ydata; printf(”datapts=%d xdata=%d ydata=%d\n”, datapts,xdata,ydata); for(kl=0;k1;datapts;++kl){ for(k2=0;k2;xdata;++k2){ . fscanf(ifp1,” %f,” ,&x[k2] [k1] ); } for(k2=0;k2iydata;++k2){ fscanf(ifp1,”%f,”,&t[k2] [k1]); fscanf(ifp1,”\n”); } printf(”Enter maximum number of loops: ”); scanf(” %d” ,&N loops); ASweight=0.l; yllllNllll=1; YlollNloll=1; for(k1=0;k1;Nlevels+1;++kl){ for(k2=0;k2;N[k1]+1;++k2){ for(k3=0;k3;N[k1-l]+1;++k3){ w[kl][k2][k3]=0.001; } } } reinitw(); wchange=.4; A320; fprintf(ofp1,” %d,%d,%d\n” ,N [0] ,N[1] ,N [2] ); 128 /*for(k1=1;kliNlevels+1;++k1){ for(k2=0;k2;N[k1];++k2){ for(k3=0;k3;N[kl-1]+1;++k3){ fprintf(ofp1,” %f\n” ,w[k1] [k2] [k3] ); printf(” w[%d] [%d] [%d]=%f\n” , k1,k2,k3,w[k1] [k2] [k3]); } } }*/ fclose(ifpl ); fclose(ofpl); } void A320 { int n1,n2,n3,count,loop,data,test; for(loop=0;loopiNloops;++loop){ zed=0; err=0; use=0; ASWO; for(data=0;dataidatapts;++data){ for(n1=0;nliN[0];++nl){ y[0][n1]=x[n1][data]; Below the outputs of the nodes and the derivatives are calculated. The sum of square errors is calculated below. for(nl=1;nliNlevels+l;++n1){ for(n2=0;n2;N[n1];++n2){ y[nllln2]=0; for(n3=0;n33N [n1-1]+1;++n3){ 14111][n2l+=y{n1-1][n3l*W[nll[11211113]; } if(Ylnll[n21i-20){y{nll[n21=-20;} blnllln21=y{nllln21; y[nllln2]=1/ (1+exp(-y{nll[n21)); Klnlllnzl=y{nll[n2]*(1:y{nllln21); b[n1][n2]=y[n1][n2]-K[n1][n2]*b[n1] [n2]; } count=0; Here the linearized intputs and outputs are calculated. tp =t [0] [data]-b [Nlevels] [0] 129 for(n2=0;n2;N[Nlevels- 1] ; ++n2){ z [count] =K[Nlevels] [0] *b [Nlevels- 1] [n2] ; ++count; } z [count] =K[Nlevels] [0] * y [N levels- 1] [N [Nlevels- 1]] ; + +count; for(n2=0;n2iN [Nlevels-l] ;++n2){ for(n3=0;n3;N [Nlevels-2] + 1 ; ++n3){ z[count] =K [Nlevels] [0] *K [Nlevels- 1] [n2] *y [Nlevels-2] [n3]; ++count; } } rotin(N[Nlevels-l]+l+(N[Nlevels-1]*(N[Nlevels-2] +1 ) ) ,0); for(n1=0,err1=0,test=0;n1;ydata;++n1){ err1+=(t[n1][data]-y[2] [n1])*(t[nl][data]-y[2] [n1]); if(fabs(t[nl] [data]-y[2] [n1]) 1,: 0.5){test=1;} if(test==0){zed++;} err+=err1; } Below the new weights are calculated. weightch( N [Nlevels-1] + l +(N [N levels- l]*(N[Nlevels-2] + l ) ) ,0) ; count=0; for(n2=0;n2;N[Nlevels-1]+1;++n2){ w[Nlevels] [0] [n2] =wa[count]; ++count; } for(n1=0;nl;N[Nlevels-1];++n1){ for(n2=0;n2;N[Nlevels-2]+1;++n2){ if(fabs(W[N levels] [0][nl]);1e-10){ pr[n1][n2]=wa[count]; w[Nlevels- 1] [n1] [n2] =wpp[nl] [n2] /w[Nlevels] [0] [n1]; ++count; } } printf(”loop=%d data=%d zed=%d err[data]=%f\n”, loop,data,zed,err); } } The following subroutine rotates the linearized training patterns into the W matrices using QR decomposition. 130 rotin(MATSIZE,mat) int MATSIZE,mat; { int k1,k2; float rho,sigma,tau,Wp [50] [50]; for(k2=0;k2iMATSIZE;++k2){ W [mat] [MATSIZE] [k2] =z[k2]; } W[mat][MATSIZE][MATSIZE]=tp; for(k2=0;k2;MATSIZE-l~l;++k2){ for(kl=0;k1;MATSIZE-l-l;++k1){ Wp[k2] [k1]=0; } for(k2=0;k2;MATSIZE;++k2){ rho=sqrt(W[mat] [k2] [k2] *W[mat] [k2] [k2] + W[mat] [MATSIZE] [k2] *W[mat] [MATSIZE] [k2] ); if( rho == 0){ /*if(mat==0) {printf(”below\n” ); }* /goto below4;} sigma=W[mat] [k2] [k2] / rho; tau=W[mat] [MATSIZE] [k2] / rho; for(k1=k2;k1 ;MATSIZE+1;++kl){ Wp[k2] [k1] =W[mat] [k2] [k1] *sigma+W[mat] [MATSIZE] [k1] *tau; Wp[MATSIZE] [k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [k1] *sigma; } Wp[MATSIZE][k2]=0; for(kl=0;k1iMATSIZE+l;++kl){ W[mat] [k2] [k1]=Wp [k2] [kl]; W[mat] [MATSIZE][k1]=Wp[MATSIZE] [k1]; } below4:; } l The following subroutine re—initializes the W matrices. reinitw() { int nl,n2,n3; for(nl=0;nl;1;++n1){ for(n2=0;n2;50;++n2){ for(n3=0;n3i50;++n3){ 131 W[n1][n2] [n3]=0; } } } The following subroutine uses the W matrices to calculate the updated weights. weightch(MATSIZE,mat) int MATSIZE,mat; double wchangep [50] ,a[50] ,zzt ,zt ,Wp [50] ,den,fac; int k,i,nl,n2,n3,counts; zt=0; counts=0; for(n2=0;n2;N[Nlevels-1]+1;++n2){ Wp [counts] =w [N levels] [0] [n2] ; ++counts; } for(n1=0;n1;N[Nlevels-1];++n1){ for(n2=0;n2iN [Nlevels-2]+1;++n2){ Wp [counts] =wpp[n1] [n2]; ++counts; } } if(fabs(W[mat][MATSIZE—1][MATSIZE-1])Lle-28){ a[MATSIZE— 1]=W[mat] [MATSIZE— 1] [MATSIZE] / W[mat] [MATSIZE— 1] [MATSIZE- 1] ; } else{ printf(”no weight update %d\n”,MATSIZE); a[MATSIZE—1]=wp[MATSIZE-l]; } for(n1=MATSIZE-2;nlz,=0;—nl){ if(fabs(W[mat][n1][n1])j,1e-28){ a[111] =W[mat] [n1] [MATSIZE]; for(n2=MATSIZE-l;n2;n1;-—n2){ a[111]-=W[mat][nl][n2]*a[n2]; } a[111]/=W[mat][n1] [ml]; 132 else{ printf(”no weight update %d\n”,MATSIZE); a[MATSIZE—l]=Wp[MATSIZE-l]; } } for(nl=0;n1;MATSIZE;++n1){ wchangep[n1]=a[111]'wpln1l; den=0; for(n1=0;nl;MATSIZE;++n1){ den+=wchangep[nl]*wchangep[n1]; } den=sqrt(den); if( den i. wchange){fac=wchange/den;} else{fac=1;} for(n1=0;nliMATSIZE;++n1){ wchangep[n1]*=fac; for(n1=0;nliMATSIZE;++n1){ wa[nl]=wp[n1]+wchangep[n1]; } for(n1=0,zt=0;n1iMATSIZE;++n1){ if(wa[n1]g,700){wa[n1]=700;zt=l;} if(wa[n1]3-700){wa[n1]=-700;zt=1;} } if(zt==1){printf(”weight at 700\n”);} The following subroutine implements the forgetting factor by multiplyint the W matrices by the constant ASweight. ASw() { int nl,n2,n3; for(nl=0;nlil;++nl){ for(n2=0;n2i50;++n2){ for(n3=0;n3;50;++n3){ W [n1] [n2] [n3] *=A3weight; 133 Layer-Wise algorithm with data reduction incorporated #includeistdiohz, #includeimathhj, #includejstdlibh; /* This program simulates a NO-input, N2-output 2-layer, feedforward neural net using the Azimi-Sadjadi algorithm with G-R and includes biasing. Includes data reduction. */ int datapts,xdata,ydata,numweight ,zed,use[6] ,Nloops,N [3] ,Nlevels; float x[2][519],t[4][519],dwest[5][519],delta1,delta2; float W13][1011101,Wal301b13l[101,Kl3][101,Wl5ll301 [301,prl30]; float y[3][10],wchange,err1,z[30],tp,A3weight; float wnorm[5] ,bnorm [5] ,wb[5] [30] ,olderr; double err,wow[6]; char outfile[40]; FILE *ifp1,*ofpl; main() int k1,k2,k3; void A320; printf(”Enter name for input file: ”); gets(outfile); if((ifpl=fopen(outfile,”r”))==NULL) printf(”fopen failed\n” ); exit(0); } /*printf(”Enter name for output filel: ”); gets(outfile); if( (OfPI =fopen(outfile,” w” ))==NULL) printf(”fopen failed\n” ); 134 exit(0); }*/ printf(”Enter number of hidden nodes: ”); scanf(”%d”,&N[1]); fscanf(ifpl,” %d,%d,%d\n” ,&datapts,&xdata,&ydata); N[0]=xdata; N[2]=ydata; printf(”datapts=%d xdata=%d ydata=%d\n”, datapts,xdata,ydata); for(k1=0;k1idatapts;++kl){ for(k2=0;k2;xdata;++k2){ fscanf(ifpl ,” %f,” ,&x[k2] [k1]); } for(k2=0;k2§ydata;++k2){ fscanf(ifpl ,” %f,” ,&t[k2] [k1]); } fscanf(ifp1,”\n” ); printf(”Enter maximum number of loops: ” ); scanf(” %d” ,&Nloops); Nlevels=2; ASweight=0.1; olderr=0; y{1l[Nllll=1; YIOllNloll=1; for(kl=0;k1iNlevels+1;++k1){ for(k2=0;k2;N[k1]+1;++k2){ for(k3=0;k3iN[kl-l]+1;++k3){ w[kl][k2][k3]=0.001; } } for(k1=0;kl;5;++k1){Wowlkll=0;l reinitw(); wchange=.4; A320; for(k1=0;k1i6;++k1){ wow[k1]/=(Nloops-10); printf(”wow[%d]=%f ”,k1,WOW[k1]); /* fprintf(ofpl,”wow[%d]=%f ” ,kl,wow[k1]);*/ 135 printf(” \n” ); /* fprintf(ofpl ,” \n” );*/ /*fprintf(ofpl,” %d,%d,%d\n” ,N[0],N[1],N[2]); for(k1=1;kl;Nlevels+l;++kl){ for(k2=0;k2;N[k1];++k2){ for(k3=0;k3iN[kl-1]+1;++k3){ fprintf(ofp1,” %f\n” ,w [kl] [k2] [k3] ); printf(”w[%d][%d][%d]=%f\n” , k1,k2,k3,w[k1] [k2] [k3]); } }*/ fclose(ifpl ); fclose(ofpl ); } void A320 { int n1,n2,n3,count,loop,data,test; float mean; for(n2=0;n2i5;++n2){for(n1=0;n1idatapts;++n1) {dwest[n2][nl]=1;}} for(loop=0;loop;Nloops;++loop){ zed=0; err=0; for(n1=0;n1;5;++nl){use[n1]=0;} The data reduction starts after the 10‘” iteration. if(100pt9){ norms(); for(data=0;data;datapts;++data){ for(n1=0;n1iN[0];++n1){ y[0] [n1]=x[n1][data]; for(n1=1;n1;Nlevels+1;++n1){ for(n2=0;n2jN [n1];++n2){ y[nllln2l=0; for(n3=0;n3jN[n1-1]+1;++n3){ {[111][n2]+=y[nl-l][n3]*w[nl][n2][n3]; 136 if(Ylnll[3211-20){Y[Dll[n2]=-20;} blnllln21=Y1nllln2h y[nll[n2]=1/(1+exp(-y[nll11121)); Klnllln2l=Ylnll[n2]*(1-Yln111321 ); b[nl][n2]=y[n1][n2]-K[n1][n2]*b[n1][n2]; } for(n1=0;n1iN[Nlevels];++n1){ tp=t [n1] [data]-b[N levels] [n1]; for(n2=0;n2jN[Nlevels-l]+1;++n2){ z[n2]=K[N levels] [n1]*y[N levels-1] [n2]; reinitwpp(); rotinl(N[Nlevels-1]+l,n1+l); dbdA(N[Nlevels-l]+1,n1+1,data); reinitwpp(); for(n1=0;n1;N[Nlevels];++n1){ count=0; tp=t[n1] [data]-b[N levels] [n1]; for(n2=0;n2;N[Nlevels- 1] ;++n2){ tp-=K [N levels] [n1] *w[N levels] [n1] [n2] *b[N levels- 1] [n2]; for(n3=0;n3iN [Nlevels-2]+1;++n3){ z[count] =K[Nlevels] [n1] *K [Nlevels- 1] [n2] * w[ N levels] [n1] [n2] *y[Nlevels- 2] [n3]; ++count; } } tp-=K [N levels] [11 1 ] *w[N levels] [n1] [n2] *y[N levels- 1] [n2] ; rotinl (N [N levels-1]*( N [N levels-2]+1 ),0); dbdA(N [Nlevels-1]*( N[Nlevels-2]+1),0,data); fo1°(n1=0;n1;N[Nlevels]+1;++n1){ mean=0; for(n2=0;n2;datapts;++n2){ if(meanjdwest[nl] [n2]){mean=dwest[n1] [n2];} mean/=(2); for(n2=0;n2idatapts;++n2){ if( dwest [n1] [n2] ;mean — dwest[n1][n2]30){dwest[n1][n2]=1;} else{dwest[nl][n2]=0;} } } 137 ASWO; for(data=0;data;datapts;++data){ for(nl=0;n1;N[0];++nl){ y[onn11=x[n1ndata1; for(n1=1;n1;Nlevels+1;++n1){ for(n2=0;n2;N[nl];++n2){ y{nll[n2l=0; for(n3=0;n3iN[n1-1]+1;++n3){ y[nl] [n2]+=y[nl-1] [n3] *w[nl] [n2] [n3]; } if(Ylnll[3211-20){y{n11[n2l=-20;} b[111][n2]=y[n1][n2]; y[nllln2]=1/ (1+exp(-y{n1][n2l)); K[nllln2l=y{nll[n21*( 1-Ylnllln21); b[nl] [n2]=y[n1][n2]-K[n1][n2]*b[n1] [n2]; } for(n1=0;nl;N[Nlevels];++nl){ if(dwest[n1+1][data]g,0){ tp=t[n1] [data]-b[Nlevels] [n1]; for(n2=0;n2;N [N levels- 1]+1;++n2){ z[n2]=K[N levels] [n1]*y[Nlevels-1] [n2]; rotin(N[Nlevels-1]+l,nl+1); ++use[nl+1]; } } if ( dwest [0] [data] 2,0){ for(n1=0;n1;N[Nlevels];++n1){ count=0; tp=t [n1] [data]-b[Nlevels] [n1]; for(n2=0;n2;N [Nlevels- l];++n2){ tp- =K [Nlevels] [n1] *w[Nlevels] [n1] [n2] *b[Nlevels- 1] [n2]; fo1'(n3=0;n3iN[Nlevels-2]+1;++n3){ z[count] =K[Nlevels] [n1] *K [Nlevels- 1] [n2] * w [Nlevels] [n 1] [n2] *y [Nlevels-2] [n3]; ++count; } } tp-=K [Nlevels] [n1] *w [N levels] [n1] [n2] * y [N levels- 1] [n2]; rotin(N[Nlevels-1]*(N[Nlevels-2]+1),0); ++use[0]; } 138 for(n1=0,err1=0,test=0;nl;ydata;++n1){ err1+=(t[n1][data]-y[2] [n1])*(t[nl][data]-y[2] [n1]); if(fabs(t[nl][data]-y[2][n1]) 1,: 0.5){test=1;} if(test==0){zed++;} err+=err1; } for(n1=0;n1iN[Nlevels];++nl){ weightch(Nlevels,N[Nlevels-1]+1 ,n1+1); for(n2=0;n2iN[Nlevels-1]+1;++n2){ w[N levels] [n1] [n2]=wa[n2]; } weightch( N levels- 1 ,N[Nlevels- 1] * ( N [N levels-2] +1 ),0) ; count=0; for(n1=0;nl;N[Nlevels-l] ;++n1){ for(n2=0;n21N[Nlevels-2]+1;++n2){ w [N levels- 1] [n1] [n2] =wa[count]; ++count; } } printf(”loop=%d data=%d zed=%d err[data]=%f\n”, loop,data,zed,err); fprintf(ofp1,” %d,%d,%f” ,loop,zed,err); for(n1=0;n1 i5;++n1){ printf(”use[%d]=%d ”,n1,use[n1]); if(loop;9){ wow[n1]+=use[n1]; printf(” \n” l; } rotin(MATSIZE,mat) int MATSIZE,mat; { int kl ,k2; float rho,sigma,tau,Wpl3oll3Ol; for(k2=0;k2iMATSIZE;++k2){ W[mat] [MATSIZE][k2]=z[k2]; } 139 W[mat] [MATSIZE][MATSIZE]==tP; for(k2=0;k2iMATSIZE+l;++k2){ for(k1=0;k1;MATSIZE+1;++k1){ } for(k2=0;k2;MATSIZE;++k2){ rho=sqrt(W [mat] [k2] [k2] *W [mat] [k2] [k2] + W[mat] [MATSIZE] [k2]*W[mat] [MATSIZE] [k2]); if(rho == 0){ /*if(mat==0) {printf("below\n”); }* /goto below4;} sigma=W[mat] [k2] [k2] / rho; tau=W[mat] [MATSIZE] [k2] / rho; for(k1=k2;k1;MATSIZE+1;++k1){ Wp[k2] [k1] =W[mat] [k2] [k1]*sigma+W[mat] [MATSIZE] [k1] *tau; Wp[MATSIZE] [k1]=-W[mat] [k2] [k1]*tau+W[mat] [MATSIZE] [k1]*sigma; Wp[MATSIZE][k2]=0; for(kl=0;k1;MATSIZE+l;++k1){ W[mat] [k2] [k1] =Wp [k2] [kl]; W [mat] [MATSIZE] [k1]=Wp[MATSIZE] [k1]; below4:; } } rotinl (MATSIZE,mat) int MATSIZE,mat; { int k1,k2; float rho,sigma,tau,Wp[30]; for(k2=0;k2;MATSIZE;++k2){ W [mat] [MATSIZE] [k2] =z[k2]; } W[mat] [MATSIZE][MATSIZE]=tp; for(k1=0;k1;MATSIZE+l;++k1){ Wp[kl]=0; } rho=sqrt(W[mat] [0] [0] *W[mat] [0] [0] + W[mat] [MATSIZE][0]*W[mat][MATSIZE] [0]); if(rho == 0){goto below4;} sigma=W[mat] [0] [0] / rho; 140 tau=W[mat] [MATSIZE] [0] / rho; for(k1=0;k1;MATSIZE+l;++k1){ pr[kl]=W[mat] [0] [kl]*sigma+W[mat] [MATSIZE][k1]*tau; } below4 :; } This subroutine calculates approximation to the weight change vector. dbdA(MATSIZE,mat,datas) int MATSIZE,mat,datas; { int k1; float dbdAnorm,dAnorm; dbdAnorm=0; dAnorm=0; for(k1=0;k1;MATSIZE;++k1){ dbdAnorm+=(W[mat] [0] [kl]-pr[k1])*wb[mat] [k1]; dAnorm+=fabs(W[mat] [0] [kl]-pr [k1]); } dbdAnorm=fabs( (W[mat] [0] [MATSIZE]-pr[MATSIZE])-dbdAnorm); dwest [mat] [datas] = (dbdAnorm) / (bnorm [mat]-wnorm [mat] *dAnorm); weightch(level,MATSIZE,mat) int level,MATSIZE,mat; double wchangep[30],a[30],zzt,zt,Wp[30],den,fac; int k,i,zer,nl,n2,n3,counts; zt=0; zer=0; if(level==2){ for(n1=0;nl;MATSIZE;-l-+nl){ wp[n1]=w[level][mat-l][n1]; } if(level==1){ counts=0; for(n1=0;n1;N[Nlevels-l];++n1){ for(n2=0;n2jN [N levels-2]+1 ;++n2){ wp [counts] =w [level] [n1] [n2]; ++counts; } 141 } } for(nl=0;nliMATSIZE;++n1){ if(fabs(W[mat] [n1][nl]);1e-28){zer=1;} if(zer==0){ a[MATSIZE—1]=W[mat] [MATSIZE— 1] [MATSIZE]/W[mat] [MATSIZE- 1] [MATSIZE-l]; for(n1=MATSIZE—2;n1[,=0;-nl){ a[n1]=W[mat][n1] [MATSIZE]; for(n2=MATSIZE-l ;n2;n1;—n2){ a[111]-=W[mat] [nl][n2]*a[n2]; a[nl]/=W[mat] [n1] [n1]; } else{ if(level==2){ printf(”no weight update level 2\n”); for(n1=0;nleATSIZE;++n1){ a[n1]=wp[n1]; } if(level==1){ printf(”no weight update level 1\n”); for(nl=0;n1;MATSIZE;++nl){ aillrlll=wpllrllk } } for(n1=0;nliMATSIZE;++n1){ wchangep[n1]=a[nl]-Wp[nll§ den=0; for(n1=0;nl;MATSIZE;++nl){ den+=wchangep[nl]*wchangep[n1]; den=sqrt(den); if ( den 3, wchange) {fac=wchange / den; } else{fac=1;} for(nl =0;n1 ;MATSIZE;++n1 ){ wchangep[n1]*=fac; for(nl=0;nl;MATSIZE;++nl){ wa[nl]=Wp[nl]+wchangep[nl]; 142 for(n1=0,zt=0;n1;MATSIZE;++n1) { if(wa[n1];700){wa[n1]=700;zt=1;} if(wa[n1] 1-700) {wa[nl]=-700;zt=1;} } if(zt==1){printf(”weight at 700\n” );} reinitw() { int nl,n2,n3; for(n1=0;n1;5;++nl){ for(n2=0;n2i30;++n2){ for(n3=0;n3;30;++n3){ W[n1][n2][n3]=0; The subroutine below initializes the pr vector to zero. reinitwpp() { int nl,n2,n3; for(nl=0;n1;30;++n1){ prlnll=0; } This subroutine calculates the norms for the weight vectors and the b vector. norms0 { int k1,k2,k3; for(k1=0;kl;N[2]+1;++k1){ wnorm[k1]=0; bnorm[k1]=0; } for(k1=l;kl;N[2]+l;++k1){ for(k2=0;k23N[1]+1;++k2){ if(wnorm[k1];fabs(wb[k1][k2])){wnorm[k1]=fabs(wb[kl][k2]);} } for(k2=0;k2iN[0]+ l ;++k2){ 143 i}f(wnorm[0] ;fabs(wb[0][k2])){wnorm[0]=fabs(wb[0] [k2]);} for(kl=1;kliN[2]+1;++k1){ for(k2=0;k2;N [1]+1;++k2){ if(bnorm[k1];fabs(W[k1] [k2] [N [l]+1])) [bnorm[k1]=fabs(W[kl][k2] [N [1]+1D;} } for(k2=0;k2;N [0]+1 ;++k2){ if(bnorm[0];fabs(W[0][k2][N[l]*(N[0]+1)])) [bnorm[0]=fabs(W[0] [k2][N[1]*(N[0]+1)]);} } ASw() { int k1,k2,k3; for(k1=0;k1;N[2]+1;++k1){ for(k2=0;k2;30;++k2){ for(k3=0;k3i30;++k3){ W[k1][k2][k3]*=ASweight; WWW-Iv 144