. . . . . K. 1 fin . a. Q I . , . .. . . dram. .ufirgmfir . _ .zwp..._$u.rs;v £3.53. v.55... lama. . . . . 9%.... a. o. 2 flat}... . . . up .9 . 1... .90. .3 {I}: mgrugv t ‘- . . hmwufivu. :2 4:54... n 3. 0,11... .. . s , , .. an» . up. Wauhgkzfi . . . . . , . . . . . , 2. . . . . . . . , . ravmummmh . . 5.. Ir, .3... . . .gi 2..: £5. . . . . . 321 J... . .12.. a. . . . . . U Vt « . l o 4E . . . . . . .c. . .. .A n . . . I ‘O G .lflzfn. . . : I“ ' all-(1’0“: : I ...... . . . rm... $11.23 :1 . ..:..... :03 . 6 A v A i . i. w... uh»... 3.. m... 3 n. .. i. “arrangufi f t? .. 3H......: .3. .27 .2... . «33.53%. 3' "Nam: .h’): M .‘Imnlou I. 157 .173)?! . I. it? I!) .2 5:3- It s. 2 5323...... ...w 3: \I’»E\2i {urn v 3..... - .5 , .. . it; . . .mMWiL ammo”... . Inn. 13‘ (curihnbu dub... w...“ R. .2..?..I.cl~.|52vltfi I! y... I: u: , u... {:1 .u... 1.1: ,.. I get... :3. . ‘lr4 SWEW. ! .l::. . .U. 3.4:... rirfir... . 2. . t I .15? thanfix ‘4»! I... ‘W‘HESQS Icmom sure a ”tilt/lull mll’l‘“"”°”‘ I lMil/1Mil/lllllilllL 3 1293 01421 6190 This is to certify that the dissertation entitled EFFICIENT EXTENDED KALMAN FILTER LEARNING FOR FEEDFORWARD LAYERED NEURAL NETWORKS presented by Saida Benromdhane has been accepted towards fulfillment of the requirements for Ph.D. degree in Electrical Engineering Major professor Date W MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State Unlverslty PLACE N RETURN BOX to remove We checkout from your record. TO AVOID FINES return on or before dete due. DATE DUE DATE DUE DATE DUE MSU le An Atflnnetlve AotlonlEquel Opportunity Inetltulon Walla-o1 Efficient Extended Kalman Filter Learning for Feedforward Layered Neural Networks By Saida Bem‘omdhane A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1996 ABSTRACT Efficient Extended Kalman Filter Learning for Feedforward Layered Neural Networks By S aida Benromdhane The thesis focuses on the computationally efficient convergence to satisfactory local minima of the Extended Kalman Filter Algorithm (EKF) when it is used in the supervised learning of Artificial Feedforward Neural Networks. There are two stages to our research work. In the first stage, the effect of different choices of the energy parameter or weighting factor A on the convergence of the EKF algorithm is investigated. We limit our attention to problems related to the supervised learning of Feedforward Artificial Neural Networks. Through the simulation of two region classification problems and the analysis of the results, we demonstrate that when /\ is chosen slightly smaller than 1, the algorithm experiences explosive divergence : The Least Square Error (LSE) grows indefinitely. However, for a choice of A slightly greater than 1, the algorithm is stable but often converges to unsatisfactory local minima, from the point-of-view of performance and computation time. The second stage of our work is where we propose several modifications of the algorithm. These modifications are aimed at improving the efficiency of the algorithm both in terms of performance and speed of convergence. One modification in the algorithm is the development of an update mechanism for the exponential weighting factor /\ which self—adjusts to the (LSE). The second modification is an augmentation of the recursion formulae of the algorithm. Both of these modifications result in a significant improvement in performance as well as marked decrease in convergence time when compared with the original EKF algorithm. To my father, who always prayed for my success but could not live to see this day. iv ACKNOWLEDGEMENTS In The Name Of Allah Most Gracious, Most Merciful. All praises be to Allah Subhanahu-wa-Taala, the creator and sustainer of this universe for providing me the strength and the patience to complete this work. I am greatful for the financial support from the Government of Tunisia, and the department of Electrical Engineering at Michigan State University. Without this support, this research work would not have been accomplished. I would like to express my sincere thanks to my major advisor, Dr. F. M. A. Salam for his continued support and guidance throughout the years of my graduate studies. I also would like to thank the members of my committee, Dr. Hassan Khalil, Dr. Joel Shapiro, and Dr. Majid Nayeri for their helpful criticism and suggestions. I wish to express my greatest thanks and appreciation to my parents for their continued prayers, support and encouragement. I deeply owe my heartiest feelings of gratitude to my husband, Izzat, for his patience and support especially during the last years of my research. The warmest feelings of gratitude are owed to my sisters Souad and Sonia, and my brothers Lassaad, Sofiane and Souheil. They have always prayed for my success and supported my pursuit of a Doctoral degree. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 Inherent Properties of Neural Networks ................. 1.2 Why a Different Training Approach? .................. 1.3 The Contribution of the Thesis ............ 2 Background 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Basic Components of Neural Network Models . . . . 2.1.1 Processing units ................. 2.1.2 Topology ..................... 2.1.3 Propagation rule ................ 2.1.4 Learning rule .................. Learning Types ..................... 2.2.1 Supervised learning ............... 2.2.2 Unsupervised learning ............. Computational Features of Neural Networks ..... 2.3.1 Function approximation ............ 2.3.2 Data compression ................ 2.3.3 Optimization .................. Examples of Models of Feedforward Neural Networks 2.4.1 Perceptron and adaline ............. 2.4.2 Multilayer perceptron .............. The Optimal Filtering Technique ........... Thesis Problem Description .............. The Objectives and the Outline of the Thesis . . . . 2.7.1 Outline of the thesis .............. 3 The Extended Kalman Filter Algorithm vi eeeeeeeee eeeeeeeee viii g. #WIOH HHF—‘F—‘t—‘r—‘F—‘HP—‘t—H NGQN-skéiQQIQlQr-‘QONCEU‘ j—I x] 18 19 19 20 21 23 3.1 Motivation ................................. 23 3.2 The Learning Dynamics ......................... 24 3.2.1 The global extended Kalman filter algorithm .......... 25 3.2.2 Computation of the gradient matrix ............... 27 3.2.3 The local appraches and the (NDEKF) ............. 31 3.3 Simulation Examples ........................... 34 3.4 Discussion and Analysis of Results ................... 49 3.4.1 Conjecture 1 ............................ 50 4 Convergence to Satisfactory Local Minima 51 4.1 Motivation ................................. 51 4.2 Divergence Phenomena .......................... 52 4.3 Proposed Solutions ............................ 53 4.3.1 A Modified update procedure .................. 54 4.3.2 The forgetting factor as a function of error ........... 55 4.4 Discussion and Categorization of Results ................ 80 4.4.1 Conjecture 2 ............................ 81 5 A Different Approach to the Computation of the Gain Matrix 83 5.1 Motivation ................................. 83 5.2 Recomputing the Gain Matrix ...................... 84 5.3 Simulation Results ............................ 85 6 Variable Slopes 87 6.1 Motivation ................................. 87 6.2 Preliminary Results ............................ 91 6.3 The Augmented Recursion Formulas .................. 96 6.4 Simulation Results ............................ 99 7 Summary and Conclusion 105 7.1 Summary ................................. 106 7.2 Conclusion and Future Research ..................... 107 A Derivation of the Extended Kalman Algorithm Equations 108 BIBLIOGRAPHY 1 1 1 vii 3.1 3.2 4.1 4.2 5.1 6.1 6.2 LIST OF TABLES The 2-region classification problem ................... 38 The 4-region classification problem ................... 39 The 2-region classification problem ................... 58 The 4-region classification problem ................... 59 The 4-region classification problem with a modified computation of the gain matrix ................................ 86 The 4-region classification problem with fixed slopes .......... 91 The 4-region classification problem with variable slopes ........ 99 viii 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 4.1 4.2 4.3 LIST OF FIGURES The 2-region classification pattern .................... 36 The 4-region classification pattern .................... 37 The TAE with A‘1 = l .......................... 40 The TAE with A’1 = 1.01 ........................ 40 The TAE with A”1 = 0.99 ........................ 41 The IAE with A‘1 = 1 .......................... 41 The IAE with A‘1 = 1.01 ......................... 41 The IAE with A'1 = 0.99 ......................... 42 The TAE with A’1 = 1 .......................... 42 The TAE with A“1 = 1.01 ........................ 42 The TAE with A“1 = 0.99 ........................ 43 The IAE with A‘1 = 1 .......................... 43 The IAE with A’1 = 1.01 ......................... 43 The IAE with A“ = 0.99 ......................... 44 The TAE with A"1 = l .......................... 44 The TAE with A"1 = 1.01 ........................ 44 The TAE with A“1 = 0.99 ........................ 45 The IAE with A’1 = 1 .......................... 45 The IAE with A'1 = 1.01 ......................... 46 The IAE with A’1 = 0.99 ......................... 46 The TAE with A'1 = 1 .......................... 47 The TAE with A'1 = 1.01 ........................ 47 The TAE with A”1 = 0.99 ........................ 47 The IAE with A‘1 = 1 .......................... 48 The IAE with A"1 = 1.01 ......................... 48 The IAE with A’1 = 0.99 ......................... 48 A = tanh(a*6) with a = 1000 and 6 is the IAE difference ....... 57 The TAE for A"1 = 1 ........................... 60 The TAE for update choicel ....................... 60 ix 4.4 The TAE for update choice2 ....................... 4.5 The TAE for update choice3 ....................... 4.6 The TAE for update choice4 ....................... 4.7 The IAE for A“1 = 1 ........................... 4.8 The IAE for update choicel ....................... 4.9 The IAE for update choice2 ....................... 4.10 The IAE for update choice3 ....................... 4.11 The IAE for update choice4 ....................... 4.12 The TAE for A“1 = 1 ........................... 4.13 The TAE for update choicel ....................... 4.14 The TAE for update choice2 ....................... 4.15 The TAE for update choice3 ....................... 4.16 The TAE for update choice4 ....................... 4.17 The IAE for A"1 = 1 ........................... 4.18 The IAE for update choicel ....................... 4.19 The IAE for update choice2 ....................... 4.20 The IAE for update choice3 ....................... 4.21 The IAE for update choice4 ....................... 4.22 The TAE for A’1 = 1 ........................... 4.23 The TAE for update choicel ....................... 4.24 The TAE for update choice2 ....................... 4.25 The TAE for update choice3 ....................... 4.26 The TAE for update choice4 ....................... 4.27 The IAE for A"1 = 1 ........................... 4.28 The IAE for update choicel ....................... 4.29 The IAE for update choice2 ....................... 4.30 The IAE for update choice3 ....................... 4.31 The IAE for update choice4 ....................... 4.32 The 4-region classification pattern .................... 4.33 Classification Performance for A”1 = 1 ................. 4.34 Classification Performance for update choicel ............. 4.35 Classification Performance for update choice2 ............. 4.36 Classification Performance for update choice3 ............. 4.37 Classification Performance for update choice4 ............. 4.38 The TAE for A'1 = l ........................... 4.39 The TAE for update choicel ....................... 4.40 The TAE for update choice2 ....................... 61 61 61 62 63 63 63 64 64 65 65 65 66 66 67 67 67 68 68 69 69 69 70 70 ~1~1~J~1~1~1~1~1 “commerce-draw 73 74 74 75 4.41 4.42 4.43 4.44 4.45 4.46 4.47 4.48 4.49 4.50 4.51 4.52 4.53 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 The TAE for update choice3 ....................... 75 The TAE for update choice4 ....................... 75 The IAE for A‘1 = 1 ........................... 76 The IAE for update choice1 ....................... 76 The IAE for update choice2 ....................... 77 The IAE for update choice3 ....................... 77 The IAE for update choice4 ....................... 77 The 4-region classification pattern .................... 78 Classification Performance for A’1 = 1 ................. 78 Classification Performance for update choice1 ............. 78 Classification Performance for update choice2 ............. 79 Classification Performance for update choice3 ............. 79 Classification Performance for update choice4 ............. 79 The sigmoid function with slope s = 0.5 ................ 89 The sigmoid function with slope s = 1 ................. 89 The sigmoid function with slope s = 1.5 ................ 89 The sigmoid function with slope s = 2 ................. 89 The derivative of the sigmoid function with slope s = 0.5 ....... 90 The derivative of the sigmoid function with slope s = 1 ........ 90 The derivative of the sigmoid function with slope s = 1.5 ....... 90 The derivative of the sigmoid function with slope s = 2 ........ 90 The TAE for a fixed slope s = 1 ..................... 92 The TAE for a fixed slope s = 0.5 .................... 92 The TAE for a fixed slope s = 1.5 .................... 92 The TAE for a fixed slope s = 2 ..................... 92 The IAE for a fixed slope s = 1 ..................... 93 The IAE for a fixed slope 3 = 0.5 .................... 93 The IAE for a fixed slope s = 1.5 .................... 93 The IAE for a fixed slope s = 2 ..................... 93 The 4-region classification pattern .................... 94 Classification Performance for a fixed slope s = 1 ........... 94 Classification Performance for a fixed slope s = 0.5 .......... 94 Classification Performance for a fixed slope s = 1.5 .......... 95 Classification Performance for a fixed slope s = 2 ........... 95 The TAE for varying the slope with so = 1 ............... 100 The TAE for varying the slope with so = 0.5 .............. 100 The TAE for varying the slope with so = 1.5 .............. 100 xi 6.25 6.26 6.27 6.28 6.29 6.30 6.31 6.32 6.33 6.34 The TAE for varying the slope with so = 2 ............... 100 The IAE for varying the slope with so = 1 ............... 101 The IAE for varying the slope with so = 0.5 .............. 101 The IAE for varying the slope with so = 1.5 .............. 101 The IAE for varying the slope with so = 2 ............... 101 The 4-region classification pattern .................... 102 Classification Performance for varying the slope with so = 1 ..... 102 Classification Performance for varying the slope with so = 0.5 . . . . 102 Classification Performance for varying the slope with so = 1.5 . . . . 103 Classification Performance for varying the slope with so = 2 ..... 103 xii CHAPTER 1 Introduction Even though Artificial Neural Networks may be limited and imperfect, they are still one of the few paradigms, if not the only ones, that seek to mimic natural intelligence at an architectural level. They have performed impressively in some limited applica- tions. The success and failure of Neural Networks can only suggest that research will ultimately give rise to artificial systems that can perform the same tasks that only the humans are currently able to perform. That is what constitutes the major drive behind the exponential growth of Neural Networks research. An outcome of the renewed interest in Neural networks is their use in numerous applications ranging from pattern classification and completion to automatic control. The network paradigms used in these applications can be very distinct each carrying their own name. They are made of a large number of processing elements operating in parallel. The topology that connects these processing elements is what determines the distinction between two or more network paradigms. Given that the main goal for Artificial Neural Networks is to possess human intel- ligence, the inspiration for research had to come from more specifically from neuro- biology. Neuro-biology offers a cellular formulation of the principles of intelligent behavior which inspire researchers to construct better intelligent systems known as “Artificial Neural Networks”. 1.1 Inherent Properties of Neural Networks This new hope finds its origins in all the inherent properties of Neural Networks, but more so in these four particularly important ones: (1)generalization (2) graceful degradation (3)adaptivity and learning and (4)parallelism [1, 2]. A focus on these properties will be sufficient to illustrate the importance of Neural Networks for related fields. Given that Neural Networks emulate continuous activation units, they provide a continuous representation and inference system with similarity metrics such that when similar real-world data inputs are presented, outputs or inferences are also similar. A smooth generalization from stored cases to new ones is then expected. The generalization performance of logic-based systems deteriorates considerably when the data available is inaccurate or incomplete. In contrast, the generalization performance of Neural Networks deteriorates proportionally to the degree of inac- curacy or incompleteness of the data, which is usually small. However, what really makes Neural Networks distinct from other expert systems are the two last charac- teristics mentioned above. The first one manifests itself in the way Neural Networks learn and adapt. Neural networks are taught by presentation of examples. They also adapt very easily to changing conditions. This issue is still an unsolved problem with other expert systems. The second characteristic is present in the architecture of Neural Networks adaptation and learning which are conducted totally in a parallel structure. There are other parallel reasoning models that have been developed and implemented, but in most cases they have to be supported by the user. By contrast, in Neural Networks, units in the same layer can be updated simultaneously. In this way, computationally expensive tasks can be performed efficiently. 1.2 Why a Different Training Approach? For several years the back propagation algorithm (BP) which is based on the gra- dient descent algorithm has been widely used to train Multi-Layered Neural Net- works(MN N s) to perform a desired task. The common use of the back propagation algorithm is attributed to its strengths which include a generalization ability from a modest number of training patterns, an acquiring of arbitrarily complex nonlinear mappings, all within a parallel structure of computation. However, like any other algorithm, the back propagation algorithm has limitations that are easily recognized during its use. The first noticeable one is the slow convergence time. The second limitation is an off-line encoding requirement which makes it unsuitable for certain applications, especially when dealing with temporal signals. In order to improve the speed of convergence of the (BP) algorithm, numerous schemes have been suggested, for example [3, 4, 5]. One of these schemes [5] used different energy functions to achieve faster convergence of the (BP)algorithm. All of these enhancements are computationally inexpensive, but they often require tunable parameters, thereby adding cumbersome guesswork. Given that the enhanced backpropagation algorithm, like the standard one, may not adequately handle temporal signals, it would not be suitable for some specific applications, for instance, the identification of temporal signals. In the face of this limitation, the problem that poses itself is how to formulate a Feedforward Artificial Neural Networks algorithm that possesses the temporal capability and is also efficient in terms of convergence time, guesswork and computation. In the last few years much interest focused on training algorithms[6, 7] that are faster than the back-propagation algorithm or are capable of solving more complex problems. A class of alternative algorithms that have been used in different applications is based on the Extended Kalman Filter algorithm. The Kalman algorithm, in contrast to gradient techniques, computes the optimum value of the network parameters each time a new data point is presented, therefore it is well suited for temporal signal processing. 1.3 The Contribution of the Thesis The main topic of this thesis is concerned with the Extended Kalman Filter Approach for training Feedforward Artificial Neural Networks FFANN. This is a supervised learning which is different from the backpropagation. Motivated by the help of previous and current analysis [8], a framework supported by computer simulations is developed to improve the supervised learning of F FANN using the Kalman Filter approach. The framework encompasses modifications of the Extended Kalman Filter algorithm including the development of an update mecha- nism for the exponential weighting factor A [8]. This mechanism self-adj usts according to the mean square error LSE. The weighting factor is also implemented as a function of the error. The update is aimed to speed up and enable convergence to minima that are satisfactory from the point-of—view of performance. A different computation of the gain matrix is also pursued. A final and important added feature to the algorithm is the augmentation of its recursion formulas. A recursion formula for the slope of the sigmoid nonlinearity is developed where the update is based on the gradient descent rule. The gradient descent rule is known to decrease the MSE without adding any oscillations. CHAPTER 2 Background The inspiration for Neural Networks structures is mainly based on our present under- standing of the nervous biological system. To completely specify a Neural Network model, its physical parts have to be defined along with the required algorithms. Sim- ilar to any other network a Neural Network is comprised of nodes we call neurons as its processing units, and edges connecting the neurons therefore defining the topol- ogy of the network. Once a network’s topology is defined, this Neural Network is classified in one of several Neural Networks models. These models include the feed- forward neuron-model, the feedback neuron-model and a completely interconnected neuron-model. Our primary interest in this work is in the feedforward neuron-model and the topology that defines the connection between its neurons. In the following two sections the feedforward neuron model and its topology will be discussed in de- tail. The resulting network will be referred to as the Feedforward Artificial Neural Network(FFANN) In the design of a Neural Network model, defining the topology between the neurons is only the first step. Devising the learning rule for this network is the most challenging part in the design. The learning rule is the core constituent of the algorithm necessary for the operation of the neurons. It consists of changing the state of individual neurons using inputs from neighboring neurons. This rule can take various particular forms, therefore we will only emphasize the forms that have been used in (FFAN N). We should also mention the one feature that is common to many models: most learning rules can be viewed as modifications of the stochastic gradient optimization of a certain objective function or performance criterion. An algorithm where this feature is very well known is the error backpropagation algorithm. Our primary interest in this thesis is another model that also has this feature and is a (FFANN) where the learning is based on the Kalman optimal filtering technique. This model will be introduced in the fourth section of this chapter and discussed more in detail in the remaining chapters. The first four sections of this chapter are aimed at describing Neural Networks in a broader context, they prepare for a complete understanding of the sections that follow. 2.1 Basic Components of Neural Network Mod- els A Neural Network is specified completely by the definition of its physical parts and the algorithms that are applied to it. Like any other network or graph, a Neural Network consists of processing units that are physically called nodes and serve as the artificial counterpart of neurons in the brain. It also consists of edges defining its topology or connectivity. In the succeeding material, the information-processing algorithm within a Neural Network is completely specified by defining the three following components : the functionality of processing units, topology, propagation and learning rules. with the learning rule being the key to the processing algorithm. Another algorithm, which may require some programming, is also required for any processing system. In Neural Networks, this task is summarized into a simple learning rule. 2.1 .1 Processing units The processing units in a Neural Network mimic the function of neurons in the brain. Therefore, information processed in them is determined by how the input of each node relates to its own output and also how each processing unit connects to others or the environment. This is described completely in these terms [1] by Input into the unit from other units or the environment. Output sent to other units or the environment. The unit’s internal state, also referred to as the unit’s activation. e The rule for computing the next state of the unit also called the activation rule or activation function The rule for computing the output from the current state. This is defined by the output function. Biological neurons are complicated physical systems. Their dynamics, if realistically modeled would involve a large number of differential equations. This feature of bio- logical neurons is the origin of a strong motivation toward simplifying the neuronal model. A common means of simplification is to have two separate models each describ- ing a functional mode of the neuron. The two functional modes are l)the processing mode, and 2)the learning mode. The dynamics of the processing state are described by a single differential equation. Widespread models such as perceptrons own processing units with no genuine state. Their output is a simple function of the input. This function is based primarily on the simplified neuron model of Pitts [9]. The simplest version is a step function of the weighted sum of individual inputs. This function is described in the following equation : 1 if a: > o f(:r,0) = (2.1) 0 otherwise An attempt to make the activation function differentiable resulted in the following so—called logistic function. 1 f($,0)=W (2.2) The processing state is described as combining inputs into a weighted sum, then passing them by the help of a so-called activation function such as the step or sigmoid function. As described above, the processing state is the equivalent to the state of activation of a simplified artificial neuron. However, the learning state is characterized by a vector of parameters of the activation function. These parameters are to be changed according to a desired performance criteria of a Neural Network. The parameters of an activation function consist of e Weights of individual inputs, with the help of which the weighted sum is com- puted. e A threshold of the step or sigmoid function. 2.1.2 Topology If a system made of the processing units that are described in the previous section is capable of solving complex tasks in a way reminiscent of human cognitive capabilities, its sophistication is attributed to the processing units themselves and more impor- tantly to the way they interact. The topology of the network is what determines the interactions and thus is of crucial importance for the performance of the network. There are various cases of interconnected networks. The simplest one is a com- pletely interconnected network. This topology allows for arbitrarily complex feedback structures. Although feedback extends the class of possible behaviors immensely, in some cases and for reasons not mentioned, it may be desirable to exclude the pos- sibility of feedback in the topology level. The way to achieve feedback exclusion is simply to make it impossible, following any path along directed connections, to enter a unit once left. Such a network, is called a feedforward network. Since a feedforward network has the properties of a directed graph, each unit in the network can be assigned a rank giving the number of units on the longest path between it and some input unit. It is then easily understood that, for any pair of interconnected units, the rank of the unit the connection leaves is lower than the rank of the unit the connection enters. The scheme for iterative evaluation according to the rank of individual units sug- gests for collecting units into subsets of equal rank. Adding the restriction that the rank difference between interconnected units is 1, we obtain a feedforward layered network. The evaluation of this network is done in a number of iterations that equals the number of interconnection layers. The kth layer consists of the connections between the units with ranks k and k - 1. This evaluation can be viewed as a sequence of vector operators. A simple case of this evaluation is that of a single-layer feedforward 10 network, which is the topology used for one of the first Neural Networks models, the perceptron of Rosenblatt [10]. We should also mention the other networks that are layered. These are the feedback layered networks with some examples like the Adaptive Resonance Theory(ART) of Grossberg [11] and the Bidirectional Associative Memory of Kosko [12]. 11 2.1 .3 Propagation rule After a thorough description of the Neural Network in terms of its processing units and topology, what now remains is to describe the propagation rule for information processing. This rule will specify the time at which, and the order in which, the unit activations are to be updated. The approach that is closest to the biological system is a simultaneous and asynchronous update of all units. Since the asynchronous approach does not seem to have any clear advantages, most existing propagation rules are synchronous and vary accordingly with the Neural Network model type. The variation of the propagation rule from a network type to the other is manifested in the two algorithms that we are to describe next. The first one is for feedforward layered Neural Networks and is a modification of the algorithm used for (FFANN). The algorithm is described in these basic steps: e The input units are set equal to the input pattern. e In the kth step, the activation of units of the kth layer are computed using the activations of the (k - l)th layer. e After a number of steps equal to the number of layers, the state of the output units corresponds to the output pattern. The processing in this FFANN is viewed as the evaluation of a nested non—recursive function. The situation is different for feedback networks. A detailed discussion of the propagation rule including the steps of the algorithm to be followed in the processing of information in a FFANN are found in [1]. 12 2.1 .4 Learning rule The learning rule is the most difficult constituent of a Neural Network. Formally, it resembles the propagation rule since it also operates on a network of interconnected units. The learning rule consists of changing the state of individual units (the state of activation function parameters) using inputs from the neighboring units. Given the existence of variations of the particular form of this rule, we will limit our discussion of this rule to a feature that is common to many models. This feature is observed in the fact that most'learning rules are modifications of stochastic gradient optimization of a certain objective function or performance criterion. Models that have this feature include the following: e The error backpropagation algorithm of Rumelhart, Hinton, and Williams [13] e The perceptron of Rosenblatt [10], minimizing a differentiable modification of the Bayesian misclassification rate. e The unsupervised learning rule of Oja[14], minimizing the difference between the original pattern and the pattern reconstructed from its feature representation. Other aspects of learning rules with a classification according to learning types are discussed in the next section. 2.2 Learning Types The current interest in Artificial Neural Networks methods is mainly attributed to their ability to learn from experience. This ability to learn offers a powerful alternative to programming. Learning types or methods can be classified as supervised and unsupervised, with a great many paradigms implementing each type. 13 2.2.1 Supervised learning Examples of supervised learning paradigms are the original perceptron, and more recently backpropagation. Supervised learning involves the training of the network on a training set consisting of vector pairs. One vector is applied to the input of the network; the other is used as a “target” which represents the desired output. Training is performed by an adjustment of the network weights so as to minimize the difference between the desired and actual output. This process can either be achieved in an iterative procedure, or the weights can be computed by closed form equations. The latter form of training may fail to qualify the system as a Neural Network since its method is so far from the biological method. However, such methods are useful, and provide a broader definition of Artificial Neural Networks. We can describe iterative training in the following statements: e Apply an input vector to the network. e Compare the output produced to the target vector. e Use the error signal obtained from the above comparison to modify the network weights. The modification of the weights which we can describe as the correction of the weights can be general, equally applied as a reinforcement to all parts or it may be specific, with each weight receiving an appropriate adjustment. The intention behind adjusting the weights at each step is to reduce the difference between the output and target vectors. Vectors from the training set are applied to the network repeatedly until the error is at an acceptably low value. The training process is successful if the network is capable of performing the desired mapping. 14 2.2.2 Unsupervised learning This type of learning, in contrast with supervised learning, requires no information and only needs the input vectors to train the network. Unsupervised learning is also called self-organization [15]. In the course of training, the network weights are adjusted so that similar inputs produce similar outputs. The training algorithm accomplishes this by extracting statistical regularities from the training set, repre- senting them as the values of network weights. As Wasserman phrased it in [2];“self- organization is reminiscent of the manner in which, in some cases, the human brain modifies its structure under the influence of its experiences without a ‘teacher’ ”. Although applications of unsupervised learning are not as frequent as for super- vised learning, they have been used in combination with other paradigms and have produced useful results. An example of these applications is the counterpropagation method[16] . 2.3 Computational Features of Neural Networks Neural networks research is directed toward finding models that can accomplish useful tasks such as association, pattern recognition etc. The most important classes Of application tasks that Neural Networks are able to perform will be identified in this section. 2.3.1 Function approximation A deterministic feedforward network represents a mapping between the input layer units and the output layer units. The evaluation of the mapping is achieved by propagating the activations from the input layer to the output layer. The network topology and activation functions are what determine the class of 15 functions that can be represented by a network. For instance, a single-layer perceptron output results from the input in the following way: y = Z wixi (2.3) It is then clear that the class of linear mappings can be represented by the linear perceptron. However, a broader range of function classes, usually nonlinear map- pings, have been represented by multi-layer perceptrons. In addition to single-layer and multi-layer perceptrons, layered feedforward Neural Networks with radial basis activation functions seem to perform especially well for general functional approxi- mation tasks. One of the functional approximation tasks that is of great application importance is classification. Even though any network model could be used for classi- fication, specific properties of this application require particular activation functions and learning rules. The classification task is characterized in the following way: e A set of objects that are each characterized by a fixed-length vector of numeric or Boolean features is supplied. A set of classes is also given, such that each object is only assigned to one class. e A subset is then selected and called a training set, for which the class assignment is explicitly known. e The Neural Network can be trained to estimate the class of objects that did not belong to the training set. e The performance of a neural classifier is measured by the proportion of objects for which their class assignment estimate (recognition rate) is correct. The feature vector describing the objects is frequently referred to as a pattern and the entire task as pattern recognition. 16 2.3.2 Data compression The task of a Neural Network model applied to data compression is to find a mapping that reduces the original pattern to a compressed pattern, usually of a substantially lower dimension. The learning rule of Oja[l4] is a Neural Network model for linear data compression. However, for nonlinear data compression, multilayer perceptrons in an autoassociative mode can be used: The desired output of the perceptron is set equal to the input. If an autoassociative perceptron with a hidden layer that is narrower than the input is used (and output) layer is trained to produce outputs that are very close to the inputs, the hidden layer constitutes a nonlinear compression of input patterns. The inverse mapping of compressed patterns to the original ones is given by the output layer. 2.3.3 Optimization The class of optimization tasks that can be solved by Neural Networks are of the type that can be solved by relaxation of annealing and can be characterized by the following properties: e A neighborhood system must be defined on the variables of the optimization task. e The objective function to be maximized has to be additive in terms of cliques, groups of variables within which each variable is a neighbor of every other variable. As in the case of functional approximation, optimization consists of a number of subclasses which can also be viewed as task classes. One of the subclasses that is frequently tackled by Neural Networks approaches is associative memory. 17 2.4 Examples of Models of Feedforward Neural Networks Given that the proposed research in this document will be restricted to FFANN in a supervised learning framework, only the models that belong to the same setting will be reviewed in this section. The discussion of additional models can be found in [1]. 2.4.1 Perceptron and adaline The perceptron model proposed by Rosenblatt [10] is one of the first models ever pro- posed and is still important to current research. The classification of visual patterns was one of its primary goals. Widrow [17, 18] also has pursed a similar approach. The perceptron is the simplest version of a (FFANN). It consists of a single layer for input units and a single output unit. The one layer inputs are weighed by vector of connection strengths then fed into a step function. The activation function of the output unit is described as follows: 2 = 6(2 war,- — a) (2.4) 1 for a: > 0 6(ar) = (2.5) 0 otherwise. with 9:.- being the ith element of input patterns, 2 the output, to, the connection weight of the ith input, and a the output unit’s threshold. The output unit activation can assume two values, zero and 1. Each value corresponds to one of two pattern classes that have to be separated. The learning rule of the perceptron is supervised . It consists of a very simple strategy of changing the weights only if the pattern x is misclassified, that is, if the activation of the output unit for this pattern is not equal to the correct class of the pattern. The amount of change is 18 x,- if class =1 Aw,- = (2.6) —a:.- if class = 0. This simple learning rule has an important property. If both classes are linearly separable, the weights will converge to the values that materialize the separation. 2.4.2 Multilayer perceptron The fact that single-layer perceptrons can only separate linearly separable classes was pointed out by Minsky and Papert [110] at the end of the 1960’s, and this led to a substantial decrease in interest in Neural Networks research. Rumelhart along with Hinton and Williams [13] formulated the backpropagation learning rule. The backpropagation model is based on two principles: e 1. To overcome the limitations of the single-layer perceptron, all that is neces- sary is to insert one or more additional layers between input and output. These layers consist of processing units with nonlinear activation functions, typically sigmoid functions (3). Arbitrary convex classes can be separated by a network with one such hidden layer (a two-layer network), and arbitrary non convex classes by a network with two hidden layers (a three-layer network); e 2. The delta rule (8) can be used to learn output layer weights. Remaining weights can be learned by recursive application of the chain rule for computing derivatives. For a two-layer network with cc.- representing input unit activations, yj hidden-unit activations, 2;, output unit activations, ij output layer weights, and vi.- hidden-layer weights, the rule is the following: —0E :23 2.£?_2I_E:_3yi__‘ L ”1““ avji k Jazk8y16 ‘vfix, 8ng = Z]. 2,-(21. - down-yin — yj):z:.- (2.7) 19 The related recursive formulas are given in [13] A large number of applications are based on the model constituted by the simple principles described above. 2.5 The Optimal Filtering Technique The computation of the weights of a FFANN in order to achieve an input/ output mapping is a problem that could be classified as a parameter estimation problem or a high dimensional nonlinear system identification problem. Nonlinear optimization techniques are well suited to solve the problem. The only disadvantage of these techniques is their computational cost especially if the problem is attacked globally instead of being divided into a set of manageable subproblems. The problem of estimating the weights in a (FFANN) is nonlinear due to the sigmoidal activation function, but it is smooth. Therefore methods of linear estimation theory could be applied to such a nonlinear problem by linear approximation of the effects of small perturbations in the state of the nonlinear system from a “nominal” value. Since the state variables are not known beforehand the nominal trajectory has to be defined “on the fly”[19] as the current best estimate of the actual trajectory. The approach used here is called extended kalman filtering. The advantage in this approach is that the perturbations include only the state estimation errors which are usually smaller than perturbations from any predefined nominal trajectory and therefore better conditioned for linear approximation. 2.6 Thesis Problem Description During the investigation of the work done so far in using the kalman filter algorithm in the supervised training of the (FFANN), the major problem that attracted our 20 attention is that several trials and simulations had to be done before the performance was satisfactory. The need for several trials, as we understand from the work previ- ously reported, is that for certain initial conditions the algorithm was trapped in local minima that were not acceptable in terms of performance. The only way to escape the problem that has been adopted so far is to run the algorithm with different initial conditions. No major work has been done to make the algorithm escape the local minima once it is trapped. Even though the algorithm has the advantage of being stable under certain conditions, it does not always converge to a satisfactory solution. Our main goal in this thesis is to develop solutions for the problem outlined above. 2.7 The Objectives and the Outline of the Thesis Having highlighted the problems that are of interest to us in the previous section, we now, address these problems by stating the goals that we plan to achieve. A research plan is then developed that includes the tasks to be accomplished. As mentioned earlier in this document, guesswork due to tunable parameters, etc. can hurt the efficiency of training a Neural Network to perform a desired task. This is one of the factors that has motivated researchers to focus their efforts on seeking algorithms that are more suitable, easily comprehended and also theoretically supported. It is believed that better understanding of an algorithm and the main theory behind it, is the basis for efficient training of (FFANN). The goal of two planned research tasks in this work is to improve the efficiency of a. newly introduced feedforward training algorithm without a degradation of its performance. 21 2.7.1 Outline of the thesis Since the research goals have already been determined, the research plan will be organized in a manner as to achieve these goals and arrive at a stage of subsequent developmental research. The tasks to be accomplished in the proposed work include the following main ones: Task No 1: Characterize the essential features of the learning algorithms based on Kalman filters. Objective: Through an investigation of the theory behind the algorithm and com- puter simulations to observe the behavior, point out the problems that this algorithm faces in terms of how the convergence is affected with different initial conditions. It is also beneficial to be aware of the assumptions made, if any, while designing this algorithm or any post-improvement of it. Significance: The simulation of algorithms that are theoretically comprehended and convergent will be meaningful in two ways. On one hand, it will help identify any problems with the algorithm that were or were not theoretically anticipated. On the other hand, it will provide insight or how to modify the algorithm to work better in terms of computation and convergence time efficiency. Approach: We start with a careful review of the work that was done on the sub- ject until present. This is accomplished by gathering most of the literature published. The theoretical understanding of the work that was previously pub- lished is the only way to discover any discrepancies or neglected issues in the problem. The next step is to check the accuracy of the results by a careful sim- ulation. Also, a verification of any claims that are not theoretically supported will be conducted during the computer simulation. This way we can discover 22 all the weaknesses of the method or the algorithm, and by further investigation, a modification will be developed for future improvements. Task No 2: Develop a modification of the learning algorithm aimed at its improve- ment. Objective: To overcome the identified problems of the previously proposed algo- rithm(see appendix); the major ones being convergence to none satisfactory solutions and the high load of computation. Significance: Once the modification is successful, this algorithm will be very com- petitive with the backpropagation algorithm, especially in terms of the compu- tational efficiency. Approach: Having identified the problems with the previous approach to the prob- lem at hand, a modification is suggested based on the following two criteria: e The modification should avoid any approximations or assumptions that were not theoretically justified. e The modification is aimed toward enabling the algorithm to converge to satisfactory solutions regardless of the choice of initial conditions and in an adequate amount of time. After the development of the modification that is based on theoretical under- standing and aims toward an improvement of the algorithm , simulations on different examples are conducted. The results are then compared with the pre- viously reported ones. The simulation results are expected to support what was anticipated from the modification theoretically. CHAPTER 3 The Extended Kalman Filter Algorithm 3.1 Motivation The classical (discrete-time) backpropagation algorithm, although widely used, has numerous disadvantages. One of them is an extreme sensitivity to the initial choice of the set of weights: There are many choices of weights for which the algorithm will not converge and finally when it does converge for some set of weights, the error diminishes only after a great number of iterations. There is also another problem with the backpropagation algorithm exhibited as an inconsistency in training. This means, the mean squared error could remain the unchanged for many iterations then suddenly decrease to a lower value. The Extended Kalman Filter algorithm, presented here, does not suffer from such problems, it is not extremely sensitive to the initial choice of weights and it converges comparatively faster. 23 24 3.2 The Learning Dynamics The task of finding weights, which would achieve a specific input / output mapping, is viewed as a parameter estimation problem. The network we use is a fully connected Feedforward Artificial Neural Network or (FFANN) which is multi-dimensional and nonlinear. The parameters to be estimated are the weights [20, 21, 8, 22, 23, 24]. An arbitrary FFAN N which depends on the application is considered. It consists of a large T dimensional weight vector 0 made up of all the synaptic connections with sigmoidal nonlinearities. The objective in training the (FFANN) is to determine the weight values producing the L dimensional output vectors g(n) which are the closest possible to the desired output values d(n). The input sequence i(n) is made up of N dimensional vectors. Here, we consider searching for the weights on-line with patterns presented in sequence at the input of the network and with updates made recursively. The synaptic weight vector 0 is viewed as the state of a static nonlinear dynamical system described by the equations an =aj—n =a d(i) = h(90,i(j))+e(j), where j is the time index, h(t9o, i(j) is the time varying function describing the net- work and e( j ) is the multidimensinal sequence of modeling errors[21]. To obtain the dynamic estimates d(n) of do, a traditional approach is adopted based on a deter- ministic formulation, in which (1( j), d( J)) is a sequence of non-random input / output pairs and the cost to be minimized at time n is ewon=$§nan—hvommnntvv. e1) i=1 25 Here e( j ) is the deterministic sequence of the modelling errors, and An'j (0 < A < 1) is a forgetting function which exponentially discounts the examples presented in the past. This is the so—called weighted recursive least squares approach(WRLS). In the case of h being linear, the solution could be obtained using the classical RLS algorithm [8]- 3.2.1 The global extended Kalman filter algorithm The extended Kalman filter(EKF) equations are derived by expanding the nonlinear function h(0,i(n)) around the current estimate parameter vector d(n - 1)(estimated from all the data up to time (n - 1)). Namely, the state model is rewritten as d(n) = h(é(n—1),i(n))+HT(n)(vo—o‘(n—1))+p(n)+e(n), (3.2) where H(n) is the T matrix given by 8h(0, i(n)) H(n) = —5,—— 1.3-(M, (3.3) and p(n) is the residual in the Taylor expansion of h. The state model becomes H(n) = 0(n — 1) 2 do d(n) = HT(n)0o +£(n) + e(n), (3-4) with {(n) = h(0(n —1)), i(n)) - HT(n)é(n — 1) + p(n). 26 Both, in the deterministic and in the stochastic formulations the estimate d(n) is obtained as the optimal regression of do in (3.2). We concentrate on the deterministic formulation, for which the estimates are obtained from the minimization of 4n) = >_: ll e(j) ”2 W. (3.5) The solution is derived from the normal equation V6060?) = 2Z?=1H(j)(d(j) — HT(j)00 — {OD/V” = 0. (3-6) which gives i(n) = o-1(n)r(n), (3.7) with (n) = Z) H(j)HT(j)A""' , (3-8) r‘1(n) is a T matrix. To initialize the algorithm, the covariance matrix is chosen as P(O) = 6'11, with 6 27 greater than zero, a small arbitrary value and weights 8(0) to some nonzero random values. 3.2.2 Computation of the gradient matrix In this section we take a closer look to the gradient matrix H T(n) Given the archi- tecture of the (FFAN N), the elements of the weight vector 0 are arranged so that the rows of H (n) contain the derivatives of the global outputs with respect to all the T synaptic weights for 0 = d(n — 1). The neurons are numbered from 1 to 7),, so that 9 = (w1, w2, ..., wm). The equation describing the desired output can then be rewritten (d1) ((58%? (53%)T (fill—Fl (ml) (61) 3 T 32 T 3 T 102 62 d2 = 15%) 155’?) . Iii) >< . + . (n) (01.) MW (av <§a>T1 (...,) (.L) Where w; is the vector containing the synaptic weights of neuron i, and e(n) = C(n)+e(n). The computation of the derivatives in the matrix H(n) is achieved via a back propagation of the output values g(n) through the network. In order to illustrate how these derivatives are obtained, we first need to describe the structure of the (FFAN N) used. The description included in this section is iden- tical to the one given in [25] since their description of the structure was clear and complete. Let M be the total number of layers in the (F FAN N) with the input and output layers included. The i — th neuron in the s — th layer is denoted by neuron (s,i), and 1), denotes the total number of neurons in the s — th layer. 2:, is the input of the neuron (1,i), x: is the output of the neuron (s,i), wfik is the linkweight coefficient 28 from the neuron (s, k) to neuron (3 +1, i), and 213;? is the threshold of the neuron (s, i). Usually, the thresholds are treated as weights that connect an input unit, which 03 always on, i.e. its value is always one, to the neuron, then 1 1 u}: if 2 < s < M 97:44.1 = 1, 2333M and wild“ = 0 ifs = M Further, let T 3 ,3 8-1 s—1 21 $1 wi,1 wl ) 23: 11": awf-l— aws—l: T 8 8 3‘1 8-1 Zn. $773+1 wink—1+1 wn. ) where wf’l is a (773.1 + 1) x 1 weight vector of the neur0n(s,i), u)”-1 is a 17, x (1),-1 + 1) weight matrix of the s — th layer, and x" is a n, x 1 output vector of the s — th layer. Then, the operation equation of the network can be expressed in the following vector form a: ifs=1 2’ ifszlorM 2‘: anda"= w"_1:r"1 if 2 $33M h(w3‘1x“1) if 2533M -l where the function h(.) may be chosen as the hyperbolic tangent function h(x) = tanh(.r) and h((wi“)Tx“‘) h(ws—lxs—l) = The weight training algorithm requires the computation of the partial deriva- tives of the output mM of the (FFANN) with respect to the weight vectors. These derivatives may be derived from the operation equation stated above. A computation procedure for partial derivatives from layer (M — 1) to layer (M — 2),- - -, until the layer 2 of the (FFANN) is considered. These derivatives are given as follows 29 ( o i 0 M “a?” = (xM-IT awfi "1 0 1 0 l where £337“: are (nM_1+1)ng matrices. For 2 S i 3 M—1, [3 = 1,2, - - - JIM-:41; B one can obtain an!" _ BxM axM‘l 0xM"+2 BxM‘H’l __ ‘4 BrM’j 62:114-,“ awgl—i _ 81,114-] axM-2 --- axM—H-l 61034-1 _ ( OOTM-j-l) 81024—3. J: Also for 2 S i S M — 1, fl = 1, 2, - - - ,17M_,-+1, the following relations are obtained from the operation equation axM—j hl((w]W-J'-1)TxM—j—1)w[W—j-l WIT—1 2 if #0 wag-3:1rad-1110,9123 axM—i-i-l M—i _ aUJfi where h’(:r) = 1/cosh2(:r). 30 31 3.2.3 The local appraches and the (NDEKF) Due to the high computational load of the EKF algorithm, researches thought to simplify the global approach by partitioning the global problem into a set of man- ageable subproblems. In the case of (FFANN), the partition of the problem may be down to a group of neurons, to the layer level, to the level of the single neuron, or, ultimately down to the level of a single synaptic weight. In this work, we are interested in partitioning the problem down to the neuron level and solve the global problem(minimizing the cost) by independently updating the weights each neuron at each step. The effect of the neuron i in the global output of the network is locally described by the gradient matrix H,T(n) [26]which is the i — th column of the HT(n). ((3% (air) H-T(n)= (321T = (43? ((23:11? ($3.1? ) Since the gradient matrix is now decomposed into a set of smaller gradient matrices, we can also think abot decomposing the weight vector into few subvectors each one associated with the corresponding gradient matrix. The decomposed parameter estimation problem is described by the following equa- tion: g(k) = f(waxlkll : f(wi, " 1wylna' ' ' iwiw-li' ° ' wM-l $06)) 7 "M 3 where y(k) is a 17M x 1 vector of output of the network at time k, 103, 1 S )6 S 1104.1, 1 S a S M — 1 are 170, x 1 weight vectors to be estimated, :r(k) is a 171 x 1 vector of the input of the network at time k, f () is a 17M X 1 smooth vector function. We now introduce the new parameter vectors 0;, 1 S i S n, n = X1127” that 32 represent the weight vectors wg, let 0,- : 203, then the scripts i and 01, 3 are related by 015:1, fl(i) = i, ifi S 712 and a(i)=1+max{j:(i— {_2 m) > 0}, fl(i) = i—Ef’le m, if i > 1);. Consequently the 0.- are (71a(i)+1) x 1) vectors, and the nonlinear euation is rewritten as y(k) = f(01,- - - ,0mr(k)). Given that p(k) is the input pattern, 314(k) is a 11M x 1 vector of desired output pattern of the (FFANN), and 0,- at time k. The weight training oof the (FFAN N ) is aimed toward estimating the weight vectors 0,- such that the output y(k) of the FFANN tracks the desired output yd(k) with an error that the algorithm can make converge to zero as k —+ inf. In order to decouple the global problem acuretly, we can only use the (M544 x Mg“) diagonal blocks of the error covariance matrix. Neglecting the error covariance matrices p;,-(k) of the estimations 9,-(k) in the recursive estimation procedure is based on the assumption that the error covariance matrix P(n) normally contains most of its energy around its diagonal blocks. [27]. It also results in obtaining an approximate neuron-decoupled extended kalman filter equations or N DEKF formulations from the standard extended kalman filter (EKF) formulations [25] as follows: A(n) = [I + A“ ,2: ij,(n —1)Hf(n)]"‘, (3.13) 0,-(n) = A‘lp,(n — 1)H,-T(n)A(n), (3.14) Mn) = A“(1 - Ge(n))H.'(n)p.-(n -1). (3-15) 0,-(n) = 0,-(n — 1) + G,(n)e(n), (3.16) H.(n) = 6f(91(n — 2%; 3'41”) _ 1)’ 3:02)). (3.17) e(n) = yd(n) — f(61(n — 1),...,19,,(n —1),:v(n)), (3.18) where A(k) is a 17M x 17M matrix, G;(k) are (770(i)+1) x anatrices of the filtering gain, and p,(k) = p,T( k) are (”01(2)“) x ("e(i)+1) matrices of the error covariance matrix of the estimation 0,-(k). The procedure of recursive training described in the above 33 equations can be explained as follows.The N DEKF uses the old and the new outputs to recursively train the weights. In every iteration, the NDEKF predicts the network output y(k) based on the previous estimated weight vectors and the new input. It then compares the difference between the predicted output y(k) and the new desired output yd(k), the prediction error e(k) can be obtained. Based on this prediction error and the information of the entire history of the input, desired output pair, which is stored in the covariance matrices p;(k -— 1), a set of modification coefficients, gain matrices G,(k) corresponding to the weight vectors 6,- of the neuron(a(i) + l,fl(i)), can be calculated. The new trained set of the weight vectors 0,-(k) is then determined by the sum of the last trained weight vectors 0,-(k — 1) and the innovation term which is a multiplication of the gain matrices G;(k) and the prediction error e(k). At last, the information of the new input and desired output is stored in the error covariance matrices p;(k) to be used in the next iteration. 34 3.3 Simulation Examples Simulations, were conducted to compare the performance of the NDEKF to the proposed modifications. In this chapter, we report the performance of the original NDEKF on two continuous 2-dimensional patterns. The XOR problem is described as a 2-region classification problem and is performed on a one of the 2-dimensional patterns. A 4-region classification problem is performed on the second 2-dimensional pattern. The network architecture consists of two inputs followed by two hidden lay- ers. The size of the output layer we use is 2 or 4 depending on the problem. The size of the hidden layers is 10. The network size is thus (2-10-10-2) for the 2-region classification problem and (2-10-10-4) for the 4-region classification problem. This network size has a weight vector, including the bias weights, of 162(respectively 184) elements including the bias weights. The input pattern was the uniformly distributed random points in the region [0, 1]”. The function to be approximated or the desired output values for the network are set to 1 and 0 inside and outside the regions depicted in the figures below. If a point falls into a region of a certain color, shade or class, it is assigned a target output vector for that class. The target output vectors are the two columns of the 2 x 2 matrix for the 2—region classification problem and the four columns of the 4 x 4 identity matrix for the four region classification problem. The inputs were not presented in sweeps but obtained continuously by a random number generator. The initial weights were randomly selected and also uniformly distributed between -0.5 and 0.5. The P matrices were initialized to 1000 times the identity matrix. The adaptation of the weights is done as follows: 1. Pick a random point and propagate forward to the output layer. 2. backpropagate the error values (error is computed as the difference between the target output value and the actual computed output) through the whole 1 35 network. 3. Simultaneously update all the network weights and matrices. The mean squared error is averaged every 250 iterations (We call this average: an interval average error). Simulations for the original (N DEKF ) were performed on 5 different initial condi- tions. The training is completed when a stoppage criterion is achieved. The stoppage criterion is described by a threshold of the interval average error or a maximum num- ber of iterations set by the user. In our simulations we chose a threshold error(0.05, 0.3) respectively. The maximum number of iterations was set to 50,000 for the 4- region classification problem and 25000 for the 2—region classification problem. The performance results are described in the following table and figures. These include re- sults for both the XOR 2-region classification problem and the 4-region classification problem. We define the Total Average Error (TAE) as C(71) = (E II e(j) ||2)/n, (3-19) 1:1 and the Interval Average Error (IAE) as k+L 614"): (2 ll e(j) ll'zl/La (3-20) 1:}: where k = j/L when j is a multiple of L. 0.9 0.8 0. N 0.6 0.54 0.4 0.3 A 0.2 0.1 36 ’fxxxl‘x ' 0.7 0.8 Figure 3.1. The 2-region classification pattern 37 Figure 3.2. The 4-region classification pattern 38 Table 3.1. The 2-region classification problem lambda choices ICl 1C2 IC3 1C4 IC5 IC6 A = 1 TAE 0.1095 0.1255 0.1102 0.1326 0.1136 0.1348 IAE 0.0823 0.1069 0.1022 0.1061 0.1006 0.1176 class % 97.2 96.2 97.0 96.2 97.0 95.8 #iterations 25000 25000 25000 25000 25000 25000 A = 0.99 TAE 0.7845 1605 0.6530 0.9377 0.7760 0.7759 IAE 0.9467 17.6 0.5663 1.0554 0.5952 0.9878 class % 37.3 25.4 80.5 20.2 54 36.9 #iterations 25000 25000 25000 25000 25000 25000 A = 0.98 TAE NAN NAN NAN NAN NAN NAN IAE NAN NAN NAN NAN NAN NAN class % 0 0 0 0 0 0 #iterations 5000 5500 3250 3000 3750 2500 A = 1.01 TAE 0.2384 0.2597 0.3620 0.3551 0.2706 0.3480 IAE 0.2452 0.2721 0.3595 0.3614 0.2812 0.3458 class % 85.7 86.1 77.4 80.0 85.1 79.9 #iterations 25000 25000 25000 25000 25000 25000 A = 1.02 TAE 0.3801 0.5089 0.4659 0.4324 0.5248 0.3724 IAE 0.3824 0.5565 0.4950 0.4150 0.5255 0.3950 class % 76.6 69.2 74.1 73.8 65.8 78.1 #iterations 25000 25000 25000 25000 25000 25000 39 Table 3.2. The 4-region classification problem lambda choices ICl IC2 IC3 1C4 IC5 IC6 A = l TAE 0.5757 0.5885 0.5718 0.5357 0.5596 0.5624 IAE 0.5736 0.5416 0.5295 0.4916 0.5145 0.5210 class ‘70 76.9 75.1 77.7 77.1 78.2 78.4 #iterations 50000 50000 50000 50000 50000 50000 A = 0.99 TAE 0.7604 4.7649E6 0.6638E3 NAN NAN NAN IAE 0.7936 0.0179E6 1.5E3 NAN NAN NAN class ‘70 44.7 22.5 23.3 0 0 0 #iterations 50000 50000 50000 46750 A = 0.98 TAE NAN NAN NAN NAN NAN NAN IAE NAN NAN NAN NAN NAN NAN class 70 0 0 0 0 0 0 #iterations 36260 36500 22750 11000 17500 24000 A = 1.01 TAE 0.7167 0.7223 0.7002 0.6387 0.7160 0.6636 IAE 0.7199 0.7030 0.6786 0.6165 0.6958 0.6416 class % 51.4 51.12 52.5 66.8 51.0 54.0 #iterations 50000 50000 50000 50000 50000 50000 A = 1.02 TAE 0.7375 0.7206 0.7331 0.7046 0.7493 0.6933 IAE 0.7392 0.6993 0.7331 0.7046 0.7337 0.6687 class % 49.5 54.2 51.3 54.5 50.4 54.8 #iterations 50000 50000 50000 50000 50000 50000 40 2—mglon orlgln-l problem IC2 J ”500” I I a O O —4 .— 0.0 ’- .. 0.4 r- _, O 2[ " o A M A L O 0.5 1 1 .5 2 2.5 3 numbor ol lloratlon. x .‘ 04 Figure 3.3. The TAE with A“1 = 1 2—roolon problem wllh Invomo lambda —‘l .01 IC2 2600 T 2000- \I\Jl\ .. 51500»- .4 E gnocc- -t aoo~ q 00 05 1 5 25 a 1 .6 numbor of "Or-(lon- Figure 3.4. The TAE with A"1 = 1.01 1.8 1.0 1.4 .81.... g 0.8 .— 0.6 0.4 0.2 0.9 0.8 0.7 ‘5 0.0 0.5 0.4 E .... 0.2 0.1 O 1 O 20 30 4O 60 60 7O 80 90 1 OO 12 1O mmau 0 41 z—mglon problam wllh lnvaraa lambda —O.90 IC2 V v r l A A A A A 1 .6 numbar o! ltarallona Figure 3.5. The TAE with A'1 = 0.99 a—mglon orlglnal problam IC2 V V V T v v A A A A A A A A A numbar o! ltarallona Figure 3.6. The IAE with A’1 = 1 x ‘04 2—roglon problam wllh lnvaraa lambda --‘l .01 IC2 r Y v v v V v V 4 - -1 2 '- .4 o A A A A A A A A A O 1 O 20 30 4O 50 60 70 BO 90 1 00 numbar o! ltarallona Figure 3.7. The IAE with A‘1 = 1.01 42 2—raglon problam wllh lnvama lambda —O.99 ICZ V v o A 4A A A A A A A A O 1 0 20 30 40 50 60 70 80 90 1 00 numbar ol‘ Ital-anon- Figure 3.8. The IAE with A’1 = 0.99 Z—nglon orlglnal problarn ICQ o A L A A A O 0.5 1 1 .6 2 2.5 3 numbar of ltaratlona x 1 04 e ‘ —1 Figure 3.9. The TAE With A = 1 2—roglon problam wllh lnvarao lambda-— 1 .01 IC6 1 .4 v v v 1 .2 - 1 -l E... - a 0.6 ~ A F- 0.4 '- -1 0.2 r- _. o A A A A 0 0.6 1 1 .6 2 2.5 :e numbar of liar-Nona x 1 04 Figure 3.10. The TAE with A"1 = 1.01 43 2—reglon problem wllh lnverae lambda— 0.99 too 1.4 v r v 1.2 ‘t 1 e- 0.6 l- .4 0.4 '- —1 0.2 ’- -1 0 . A A . O 0.6 1 1 .5 2 2.6 3 number 0! llerallona x 10‘ Figure 3.11. The TAE with A"1 = 0.99 2—reglon orlglnal problem IC6 ‘ v v v 0 1 O 20 30 40 60 00 ‘70 80 90 1 00 number of lterallona Figure 3.12. The IAE with A‘1 = 1 Zureglon problem wllh lnverae Iambda— 1 .01 Ice 0 1 0 20 30 40 60 60 7O 80 00 1 00 number or lterallona Figure 3.13. The IAE with A‘1 = 1.01 44 2~reglon problem with lnverae lambda— 0.90 Ice 1 7 ‘7 V Y I 0.2 — -1 O. 1 *- _, o A A A A A A A A A O 1 O 20 30 4O 50 80 7O 80 90 1 00 number at Me rallona Figure 3.14. The IAE with A‘1 = 0.99 4—reglon orlglnal problem I02 77 V V 0.2 *- .7 O A I . 0 1 2 a 4 6 6 number 01 lterallona x ‘0. Figure 3.15. The TAE with A"1 = 1 x .' O. 4—reglon problem wllh lnveree lambda—1 .01 IC2 6 r v v 5 - .\ ..4 4 v- _ 5 E“ ‘ 2 >- —< °o ; é a 3 8 a number 0! lteratlona x 1 04 Figure 3.16. The TAE with A‘1 = 1.01 1.2 0.0 E” 0.4 0.9 0.06 ‘g 0.0 0.75 E 0.7 0.05 0.6 0.56 0.6 45 4—reglon problem wllh lnveree lambda—0.00 Ica V V Y A Figure 3.17. The TAE with A‘1 :3 nur'nbe r 0' lie rallona 4—reglon orlglnal problem IC2 = 0.99 l v T Y 20 40 00 80 1 00 1 20 1 40 number or lteratlona Figure 3.18. The IAE with A“1 = 1 100 180 woman 5 0| 0 46 4—reglon problem wllh Inver'ee lambda—1 .01 IC2 Y I A A A A 2O 2O 1 4O 1 60 4O 60 BO 1 OO 1 number of Iteration. Figure 3.19. The IAE with A“1 = 1.01 100 200 0.0 ' 4—reglon problem wlth Inveree lambda—0.99 lC'Z L A 20 4O 60 80 1 00 1 20 1 40 number 0! llerallona Figure 3.20. The IAE with A’1 = 0.99 180 200 47 4—reglon origin-l problem ICO V *7 f d I 1 9 O I 1 Taiwan 9 0 0.4 - -4 0-2 - .4 o . O 1 2 3 4 5 6 number 0' "er-None x 1 0‘ Figure 3.21. The TAE with A‘1 = 1 1 4 4—reglon problem with lnveree lernbde—‘l .01 ICO 1 .2 v- .. ‘ 1 go... _ .i g 0.6 r- -‘ o 4 +— J 0.2 ’- .4 O . O 0.5 1 1 .6 2 2 5 number of “er-None x 1 0‘ Figure 3.22. The TAE with z\‘1 = 1.01 ‘ 4 4—reglon problem with lnveree lambda—0.99 Ice 1 .2 '- -4 1 4 5.0.0 -‘ g 0.6 r— -* 0.4 r- -* 0.2 >- _ O A . . 0 1 2 3 4 5 a number 0' llerellone x 1 04 Figure 3.23. The TAE with /\'l = 0.99 0.8 0.7 s 0.6 0.5 0.4 5 0.2 0.1 48 4—reglon orlglnel probler‘n IC6 V V V v 7’ fir v A A A A A A 2O 4O 60 80 ‘l 00 1 2O 1 4O 1 60 1 BO 200 number 0! llerellone Figure 3.24. The IAE with A'1 = 1 4—reglon problem wltn lnveree lembde—‘l .01 lCO V V V Y Y A A A A A A A A 4o 60 60 number of lte rellon. Figure 3.25. The IAE with A“ = 1.01 4—reglon problem wllh lnveree lambda—0.99 ICO Y I Y V V V ‘7 A A A A A BO 1 00 1 20 1 4O 1 60 1 60 200 number 0' lleretlon- Figure 3.26. The IAE with A‘1 = 0.99 49 3.4 Discussion and Analysis of Results The results of the simulations summarized in the above tables and figures suggest that the convergence and the stability of the Kalman filter algorithm are not very sensitive to the choice of the initial conditions. However, for different exponential weighting factors A ’s, the algorithm exhibits different behaviors. When A = 1, the algorithm shows a stable behavior but does not converge to an acceptable local minimum. In fact the IAE as defined in the previous section, stalls for a large number of iterations. If we then choose A slightly greater than 1, which is not in accordance with the forgetting factor definition, the algorithm is still stable but the same stalling phenomenon occurs. Note that A grater than 1 guarantees the stability of the discrete recursion equations that include the update of the covariance matrix described in equation (3-12). A slightly less than 1 however is a value that fits the definition of the forgetting factor and it has previously been used in other applications of the Kalman filter approach where it has performed satisfactorily. Though, in our application of the Kalman filter algorithm to supervised learning, the behavior of the algorithm under A slightly less than 1 shows a divergence phenomenon that will be given the name “explosive divergence” in the next chapter. This behavior can be expected if we carefully observe the recursion equations of the algorithm. Equations (3-10) through (3-12) show that when A < 1, A’1 > 1 is expected to drive both the covariance matrix elements and the weight parameters to higher and higher values as the number of iterations increases. The reason that this divergence problem occurs in this application (Supervised learning) of the algorithm and not in so many others may be due to the characteristics of the nonlinear model. The sigmoidal nonlinearity present in the (FFANN) model is the only nonlinearity present in the network. And its shape and the shape of its derivative could be at the origin of the poor performance of the algorithm. However, as described earlier in the section, divergence problems could be easily predicted from 50 the recursion formulas described in equations (3-10) through (3-12). in the above conclusions the following conjecture is formulated : 3.4.1 Conjecture 1 Let 6(d(n)) be the Least Square Error defined as : «an» = Z N am - h(é(n),i(j)> H2 WI (3.21) we conjecture that the following statements are true : o If A is < 1 then the EKF algorithm exhibits an unstable behavior described by an explosive divergence. o If A is 2 1 then the EKF algorithm is stable but often converges to unsatisfactory local minima. N ow that the simulations performed in the previous section identified some weak- nesses of the Extended Kalman filter approach to supervised learning, we propose to search for solutions for the following problem : o How can one make it possible for the algorithm to converge to satisfactory local minima regardless of the complexity of the problem and its initial conditions. and, o How can one improve the computational efficiency of the algorithm in cases where the performance is satisfactory but the convergence time is not adequate. CHAPTER 4 Convergence to Satisfactory Local Minima 4.1 Motivation The literature review done on the subject of kalman filters as an approach to super- vised learning suggests that convergence to satisfactory local minima is not guaran- teed and is very sensitive to the choice of initial conditions. This conclusion is drawn from the results reported in several papers [26, 25, 21]. The authors on these papers had to conduct several trials with different initial conditions then average the results. Seldom did they report all the trials with their individual results. In a practical situation, however, the averaging of the results does not give a solution to the prob- lem. The solution is usually chosen after several trials are performed to give the best performance. The need to try different initial conditions presents an inconvenience because of the extra time and resources used to arrive to a satisfactory solution. This drawback of the algorithm has motivated the research in this chapter. The analy- sis of the problem begins by investigating the reasons behind the divergence of the algorithm and categorizing them. The investigation is expected to outline the steps toward a solution of the problem and is done in the second section of this chapter. 51 52 The last two sections in the chapter are designated to the proposed solutions followed by simulations on two region classification problems. 4.2 Divergence Phenomena The kalman filter algorithm as with the special case of the recursive least squares(RLS) algorithm may exhibit two forms of divergence. o explosive divergence o lock-up divergence The first form of divergence described as explosive is due to the covariance matrix P(n) losing its positive definiteness which means that it becomes singular. However this kind of divergence could also occur due to roundoff errors and it is usually ob- served when the forgetting factor A is less than 1. When A is equal to 1 and the algorithm is left updating for a large number of iterations, the roundoff errors can become large and cause the algorithm to diverge. The second form of divergence called lock-up divergence usually occurs when A is equal to 1. This divergence shows up when the algorithm stalls. This stalling phenomena occurs when the algorithm stops updating which means that the filter weights stop changing in value. It was mentioned in [28] that The experience with this divergence may be temporary or permanent but in our own experience it was permanent. When the gain matrix which is used to update the weights becomes very small the filter weights will stop changing. However to compute the gain matrix the covariance matrix is used. If we look at the computation of the gain matrix we conclude that if the covariance matrix P(n) becomes very small the gain matrix G(n) also becomes small. We can then assert that the stalling phenomena is due to the covariance matrix becoming small. If the elements of the covariance matrix are near 53 zero the lock-up divergence is permanent. 4.3 Proposed Solutions After having defined the two forms of divergence that are frequently present in the Extended Kalman filter algorithm, we now proceed to summarize the previous at- tempts from several researchers to solve such problems with divergence. We will then propose and implement our own solutions to the same problem. The problems of divergence especially the lock—up divergence form described as the algorithm getting trapped in local minima that are often undesirable, is not only a characteristic of the extended Kalman filter algorithm. In fact the backpropagation algorithm in its original version, although guaranteed convergence, often converges to unsatisfactory local minima. Several approaches have been tried to escape these local minima. One approach was to add a momentum term to the update of the weights. Another approach which proved to be successful is the use of different energy functions or cost functions. In the Kalman filter algorithm one technique has been used to enable escaping local minima. This technique entails the addition of an artificial process noise in the update equation of the covariance matrix. This technique was mainly used to render the covariance matrix positive definite [27]. This was the only technique that we can find in the literature that was claimed to help with the lock-up divergence. The lack of any extensive research aimed toward a solution to the two forms of divergence described in the previous section is what motivated our research in this chapter. We will now begin outlining our own strategy for finding a solution to this problem. The Kalman filter algorithm as applied to supervised learning has been derived by considering a specific energy function for which the goal is to find a set of optimal parameters that will minimize it. The energy function is usually defined as a mean 54 square error between a desired output or target and the actual output given by the network. This (LSE) could also be weighted by a suitable function or coefficient [29]. This weight factor has proven to be beneficial in terms of improving the convergence. In our work we investigate the influence of the weighting factor A on the convergence of the algorithm. Even though A was always present in the recursion formulas of the algorithm in the earlier work [26] it was kept fixed at A equal to 1. Based on the theoretical analysis of the stability and the convergence problems the algorithm can exhibit [8], it would be safe to keep A fixed at 1. However this is a very conserva- tive choice and consequently the algorithm can present a stalling phenomena that is usually observed around an unsatisfactory local minimum. We concentrate on the deterministic formulation, for which the estimates are obtained from the minimization of e(n) = _§_: n e(j) n: W: (4.1) We propose an adaptation scheme for A which depends on how fast the error is decreasing and whether a stalling phenomena is present or not. We keep A smaller or equal to 1 as necessary. Preliminary results on two classical classification problems are very promising. Varying A has enabled the network to automatically escape from the local minimum and decrease the error to a better minimum. A comparison of the algorithm with or without fixing A using the same set of initial conditions will be given. 4.3.1 A Modified update procedure The original (NDEKF) used a forgetting factor that was fixed at 1. This value of A guarantees stability without always guaranteeing a good performance. To solve this problem we propose the following modified updates based on varying A. After the 55 mean squared error is averaged every 250 iterations (We call this average: an interval average error), this error is then compared to the previous error. If the difference is smaller than a certain threshold, it is decided that the network is stalling. At that moment A is changed to a value that is smaller but still very close to 1. If this new value is kept unchanged for a large number of iterations, the network shows some instability behavior. This phenomenon is justifiable if we examine the update formulas. In order to guarantee the stability a mechanism that will switch A back to 1 had to be incorporated in the algorithm. Three different ways were considered : 1. One choice is to change A for only a few iterations. 2. A second choice is to change A until the total average error shows some increase over the previous average error. 3. A third choice is to change A until the instantaneous error increases as compared to the one computed 1000 iterations before. 4.3.2 The forgetting factor as a function of error Since the mechanism of switching A back and forth between 1 and less than 1 depends on how the so called IAE behaves (whether it is decreasing or stalling), it would be convenient to model A as a function of this IAE. The inconvenience of the other choices is manifested in one of them at least. The time to switch A back to 1 was done by trial and error and that could take few trials. However from the few experimental trials we could conclude that as long as we only switch A for less than 50 iterations the stability is preserved. A suitable function that would model the behavior of A as a function of the error has to satisfy at least these two conditions in order to guarantee stability. 0 A is slightly < 1 when the difference between two consecutive interval average errors is less than a certain threshold. 56 o A is 1 when the difference defined above exceeds the same threshold chosen above. Several functions were picked but the one that was closest in satisfying the conditions is the following: tanh(a*6) if 6 > 7 A = (4.2) 0.98 otherwise where 6 is defined k+1+L k+L 6 = ( 2 ll em ||2)/L — ()3 II e(j) ||2)/L- (4.3) j=k+l j=k 7 is a difference of error threshold chosen as 0.0025 in our simulations. It could be changed accordingly with the problem. and k = i/L and i is the number of iterations that is a multiple of L. In our simulations L was chosen as the value 250 for comparison purposes. a is decided accordingly with the desired behavior of A. In our simulations 0 = 1000 fits best with the desired conditions for the chosen function. 57 The forgetting function 1 r I l l ' I I 0.9 ~ ‘ 0.8 - d 0.7 - i 0.6 - 1 (U B E 0 5 " d 2 o.4~ ‘ 0.3 - i 0.2 - l 0.1 - J O l 1 l l | 1 I I -5 .4 -3 -2 -1 o 1 2 3 4 5 Interval Average error difference x 10‘3 Figure 4.1. A = tanh(a*6) with a = 1000 and 6 is the IAE difference 58 Table 4.1. The 2-region classification problem update choices ICl IC2 IC3 1C4 IC5 IC6 = 1 TAE 0.1095 0.1255 0.1102 0.1326 0.1136 0.1348 IAE 0.0823 0.1069 0.1022 0.1061 0.1006 0.1176 class % 97.2 96.2 97.0 96.2 97.0 95.8 #iterations 25000 25000 25000 25000 25000 25000 choice1 TAE 0.1550 0.1452 0.1275 0.1240 0.1136 0.1447 IAE 0.0409 0.0413 0.0468 0.0493 0.1006 0.0500 class % 96.6 98.2 98 98 97 96.9 #iterations 6500 9750 10250 19500 25000 19000 choice2 TAE 0.2052 0.2419 0.1864 0.2157 0.2047 0.2502 IAE 0.0427 0.1236 0.0329 0.0987 0.0756 0.0839 class % 97.6 95.6 97.5 96.0 96.4 96.2 #iterations 4750 3500 4750 6500 4750 3500 choice3 TAE 0.1165 0.12 0.1128 0.1291 0.1319 0.1399 IAE 0.0374 0.0687 0.0459 0.0780 0.0693 0.0818 class % 97.0 98.0 97.2 97.1 96.4 96.4 #iterations 19750 25000 22000 25000 14500 21000 choice4 TAE 0.1250 0.2667 0.1179 0.1503 0.1998 0.1602 IAE 0.0450 0.1388 0.1084 0.0661 0.0660 0.0573 class % 98.6 95.6 97.5 96.7 96.8 97.2 #iterations 12000 25000 25000 13750 4750 11750 59 Table 4.2. The 4-region classification problem update choices 1C1 IC2 IC3 1C4 IC5 IC6 A = 1 TAE 0.5757 0.5885 0.5718 0.5357 0.5596 0.5624 IAE 0.5736 0.5416 0.5295 0.4916 0.5145 0.5210 class ‘70 76.9 75.1 77.7 77.1 78.2 78.4 #iterations 50000 50000 50000 50000 50000 50000 choice1 TAE 0.4355 0.4741 0.4356 0.4537 0.4917 0.4983 IAE 0.2892 0.29 0.2811 0.27 0.2822 0.2851 class % 93.1 93.6 91.7 95.3 95.5 94.8 #iterations 22250 50000 23000 41750 29750 30500 choice2 TAE 0.5229 0.5197 0.3027 0.4869 0.4413 0.4851 IAE 0.2966 0.2973 0.2673 0.2899 0.2893 0.2976 class ‘70 91.9 92.6 93.9 93.1 95.6 92.7 #iterations 16250 20750 26750 25250 26750 24000 choice3 TAE 0.4370 0.5146 0.4632 0.4555 0.4700 0.5137 IAE 0.3483 0.4260 0.3648 0.3566 0.3868 0.4282 class ‘70 93.5 87.1 87.5 90.1 90.0 89.9 #iterations 50000 50000 50000 50000 50000 50000 choice4 TAE 0.5288 0.5132 0.4696 0.4862 0.4676 0.4534 IAE 0.2891 0.3430 0.3859 0.2639 0.2764 0.2794 class % 93.9 92.1 87.3 92.1 93.06 94.0 #iterations 23750 22500 50000 26750 50000 37250 60 2—reglon orlgln-l problem ICZ 2 r v 1.5” " 1.6-“ “ 1.4" "‘ 1.2" -4 Truman 0-8 4 0.6 i- -‘ 0.4 i- - 0.2 r- ._ 00 0.6 1 1 .5 2.6 3 number 0' "er-(lone x j 04 ° —1 Figure 4.2. The TAE for A = 1 2—reglon problem 60 lC2 2 V7 T Y V V Y 1 .8 *- .1 1 .6 >- -‘ 1 .4 ~ -‘ s5 1 .2 - -1 g ‘ d 3 o a - q .— 0.0 - '4 0.4 .— -1 0.2 i— ‘4 00 1 000 2000 3000 4000 5000 6000 7000 8000 0000 1 0000 number of llerellone Figure 4.3. The TAE for update choicel Truman 9.0.00 2—reglon problem 24 ICE! Figure 4.6. The TAE for update choice4 2 r v .8 '- -« .8 r- «- .4 '- -' .2 -r 1 4 .8 -! a .. .. 4 i- -r 2 _ 00 500 1 600 1 5.00 2600 25.00 3600 35‘00 4000 number 0' llerellone Figure 4.4. The TAE for update ch01ce2 2 2—reglon problem 200 I02 .8 *- .8 '- -4 .4 '- -< .2 '- -. 1 _ .8 .1 .0 ~ -i 4 .- .. 2 '- ..f 00 O 5 1 1.L5 2 2 5 8 number of lleretlone x 1 0‘ . . Figure 4.5. The TAE for update ch01ce3 2 2—-reglon problem verlnble lembde ICE? .8 *- -‘ .8 '- -‘ .4 '- —. .2 '- -‘ 1 .f .8 '- .8 )- .1 4 - -f 2 - ff - number or Iteretlone x ‘04 62 2—reglon orlglnel problem ICE 0.9 r T v V 00 1 0 2O 30 4O 50 8O 70 80 90 1 00 number 0! lteretlone Figure 4.7. The IAE for A‘1 = 1 2—reglon problem 60 lC2 O 5 1 O 1 6 20 26 30 36 40 number of llerellone Figure 4.8. The IAE for update choice1 63 2—reglon problem 24 :02 1 ‘1' v 0.9 '- O.8 '- O.7 '- O.8 '- 0.6 #- O.3 '- 0.2 i- O.1 b number of llerellone Figure 4.9. The IAE for update choice2 2-—-reglon problem 200 lC2 O 1 O 20 30 40 so so 70 60 90 1 00 number of lteretlone Figure 4.10. The IAE for update choice3 2—reglon problem verloble lembde ICZ o 20 4O 60 BO 1 OO 1 20 number of leer-None Figure 4.11. The IAE for update choice4 64 2—reglon orlglnel problem IC6 Y V V J 00 0:5 1 1:6 5 2:6 3 number of Her-(lone x 1 0‘ Figure 4.12. The TAE for x\"1 = 1 2—reglon problem 60 ICC 1 .4 r v v v u r v v v 1 .2 q 1 — .°. 0 flrfi 1 O 0.2 0.4 0.6 0.8 1 1 .2 1 .4 1 .6 1 .8 2 number of Iteration. Figure 4.13. The TAE for update choice1 65 1.4 2—reglon problem 24 ICC d J Truman .0 O 0.6 i- u 0.4 '- - 0.2 '- .. 00 SCADO 1 600 1 5‘00 260 2g00 3600 36AOO 4000 0 number of llerellone Figure 4.14. The TAE for update choice2 2—reglon problem ICO V V O 1 1 .5 number of lterellone x ‘ 0‘ Figure 4.15. The TAE for update choice3 2-reglon problem verleble lembde lCO o A A A A A O 2000 4000 6000 8000 1 0000 1 2000 number of lterellone Figure 4.16. The TAE for update choice4 66 2—reglon onglnel problem Ice ‘0 50 BO 70 BO 90 1 (’0 number 0' “er-(lone Figure 4.17. The IAE for A‘1 = 1 2—reglon problem so Ice 4O 60 80 70 8° numb-r of “Cr-filo"- Figure 4.18. The IAE for update choice1 67 2—reglon problem 24 I00 “ r 0.9 '- O.‘ - 0.7 '- 5 0.. - 30.. _ 5 0.4 ~ 0.3 '- O.2 P 0.1 L 00 5 number 0' lleretlone 1‘0 1 5 Figure 4.19. The IAE for update choice2 ‘ 2—reglon prObIOVTI 200 Ice 0.0 '- -4 0.3 P u 0.7 - ~ .3 0.. . - EO.C — -4 i 0.4 '- -1 0.3 — - 0.2 - ..1 0.1 '— .4 00 1A0 2A0 32) - . 6‘0 7‘0 8A0 90 40 60 number 0' lte rellone Figure 4.20. The IAE for update choice3 2—reglon problem verleble lembde Ice 7 V r T V o A A A A A A A 0 5 1 0 1 6 20 25 30 36 4O 45 60 number of lterellone Figure 4.21. The IAE for update choice4 68 4—reglon orlglnel problem lC2 0.4 '- _4 0.2 '— -1 o A A L A A 0 1 2 3 4 6 a number 01 llerellone x ‘ on Flgure 4.22. The TAE for A = 1 4—reglon modllled problem 60 I02 1 .2 fir Y I I 1 .- 0.8 "' cut 30.6 P— -‘ 0-4 >- _‘ 0.2 " —1 0 . . L . . 0 1 2 a 4 5 6 number 0! llerellone x 1 04 Figure 4.23. The TAE for update choice1 69 4—realon rnodlfled problem 24 lC2 1 .2 v V 1 -4 0.8 '- -< 0.4 *- -1 0.2 '— —1 o A A A A O 0.5 2 2 6 1 1 .6 number 0! llerellone x ‘0‘ Figure 4.24. The TAE for update choice2 4—reglon modllled problem 200 IC2 1 Y Y 0.4 ”- .4 0.2 — _. o L A A O 1 2 :3 4 5 8 number 0! llerellone x ‘0‘ Figure 4.25. The TAE for update choice3 4—reglon modlfled problem verleble lembde lC2 1 .2 V ‘7’ V '7 0.4 - _. 0.2 '- .1 o A A A A 0 0.6 1 1 .6 2 2.5 number 0' llerellone x ‘04 Figure 4.26. The TAE for update choice4 7O 4—reglon orlglnel problem IC2 0.95 T f fi 7 0.9 .. 0.06 fl 3 0.8 —4 $0.75 —< 0.7 - d Ea.“ _ 0.0 - _ 0.65 i— 0.50 20 :0 50 1 :10 1 00 1 00 200 80 1 00 1 20 number ol llerellone Figure 4.27. The IAE for /\'l = 1 4—reglon modlfled problem 60 ICZ 0.4 "' -1 0‘3 I- —i 0.2 A A A A O 60 1 0° 1 50 20° 250 number of llerellone Figure 4.28. The IAE for update choice1 71 4—reglon modll'led problem 24 IC2 0.4 >- - 0.3 I— —l 0.2 A A A A A A A O 1 O 20 30 60 70 BO 90 40 50 number of lteretlone Figure 4.29. The IAE for update choice2 4—reglon modl'led problem 200 I02 O 20 4O 60 80 1 OO 1 20 1 40 1 60 1 GO 200 number of lleretlone Figure 4.30. The IAE for update choice3 4—reglon modllled problem verleble lambda IC2 W V V U V 1 T r Y Y 0.2 — d O. 1 >— —1 o L A A A A A A A A O 1 0 2O 30 70 BO 90 1 00 4O 60 60 number of lteretlone Figure 4.31. The IAE for update choice4 72 "1"I 31-: .“ 1.1-- I {'l.:; I; :x 33"}:1“ Ill“- l I: 1 1.I#~i “1}: 1' “In: .I" :5. n ‘l-ig .I' “f ”1.5? kit'flw‘w'h git-‘5‘“ “I {Di-IF“ I afigow , I‘?' “ ”n?€4°3~i"°° 5 3535. u.. ,3’3‘ '- '1 % .-' .7 ”w :1. mg "- *- °‘i.‘": coax: .‘ j '4 u 1" 0 "I . x. .l '3' “u Ill-l" -::“,-,;- I :A I II... f:‘ll ‘ . II I on I .. . I. I. . _ . :.'i‘ "' '- gew'J-TFW’ M 05 0‘ Figure 4.33. Classification Performance for A'1 = 1 Figure 4.34. Classification Performance for update choicel 73 _Ilh.ll 07 u Figure 4.36. Classification Performance for update choice3 o-ap-cu-u-uueu m" I I "I .i all f‘m5~;‘ x'xl, '1: ' ' ‘5 32:2“2' r‘h‘ “- Figure 4.37. Classification Performance for update choice4 74 4—reglon orlglnel problem too d WWW O b 0.6 r- .1 0.4 F —1 0.2 ’— d O L A . O 1 2 3 4 6 number 0' lleretlone 1 04 ' 1 Figure 4.38. The TAE for A = 1 4—reglon modlfled problem 60 ICB “ .4 r v v v f 1 .2 *- _. 1 .. ‘g 0.. l. —4 $0.6 "“ —4 \ 0.4 *- —I 0.2 r— .J o A A A L A O 0.6 1 1 .5 2 2.5 3.6 number 0! "er-None x ‘04 Figure 4.39. The TAE for update choice1 75 4—reglon modlfled problem 24 ICO J L .0 0 l 1 Tamar: .0 O 0.4 ’- -d 0.2 ~ .7 °o ois é 1T5 é 2.5 number of Her-None x 1 04 Figure 4.40. The TAE for update choice2 4—reglon modlfled problem 200 ICO 1 .4 fir T T T f 1 .2 — a " -+ 0.4 >- -‘ 0.2 '- _. °o 4 5 :2. i is a number 0! lleretlone x 1 0‘ Figure 4.41. The TAE for update choice3 4—reglon modlfled problem verleble lembde Ice 1 .4 . T T T T T T 1 .2 '- .4 1 _ o o l- J 0.4 - - 0.2 r— q o A A A A A J A O 0.6 1 1 .6 2 2. 3 3.5 4 number 0' lleretlone x 1 04 Figure 4.42. The TAE for update choice4 76 4—reglon orlglnel problem Ice r V 1 V A 20 4O 60 BO 00 1 20 1 4o 1 60 1 60 200 number 0! Ilereuone Figure 4.43. The IAE for A'1 = 1 4—reglon rnodlfled problem 60 ICC I fl 1 f Y 60 80 1 OO 1 2O 1 40 number o! Ileretlone Figure 4.44. The IAE for update choice1 77 4—reglon modified problem 24 ICC 1 V V V O 1 O 20 30 4O 60 CO 70 CO 90 1 00 number 0' lteretlone Figure 4.45. The IAE for update choice2 4—reglon modlfled problem 200 ICC ‘ v 7 fi 1' r 0.2 '- —1 O. 1 '- -< o A A A A A A O 20 40 CO 14° 16° 1 8° 20° 80 1 OO 1 20 number 0! lteretlone Figure 4.46. The IAE for update choice3 4—reglon modlfled problem verleble lernbde ICC 1 ‘r Y O A . O 60 1 OO 1 60 number 0! lleretlone Figure 4.47. The IAE for update choice4 78 , agm' '.x' 31.51;: 1%.“: r 2:} I" ""i "3:" °i ;".",'-'-. ;!.u,;.‘§ ’35 i ...,:c: :5“ L‘n °° 09 ‘4’ '3'-¥:., I >- ‘I zooogjf . ....:$‘£Jo "1... .1 Efi A I .‘ I.“ 4 z 3 43-“ M if as 2 s ...,;5 f 3 2 '3 a" {.50. . “0;. W. . ' o J .3. .'.. 'u: 3' I 9%.. ‘22- ...I . 3.. 8 ‘A O f o 8 o 0. {03 Fe $ I AL! :N p u _r.,. ,, LIP. - $1: :'=- "t " ouuuuuuuuua Figure 4.49. Classification Performance for A“ = 1 Figure 4.50. Classification Performance for update choice1 79 ' 3 Th '3‘ u I 1' ”Li',{:5:'::‘§: “to” ¥'x'{'. “ch“: as" ox°IP°° "2'”? ii. I“ ”It: at” .3051'.." 0P1" ”in ..0 00 ° 0° 80% :'. ‘ h i‘ wh'x’b 30° ' .13. “Hoffa. . :3 . i!- E Rikki .o ;'., ‘- .93‘33; If; w :3 00° °. ' ’.r' 0:0:o°:':. “I" M ':I.°o;‘ :3. .34‘ 09.4. .I‘ I 0 0° n' ‘ ,1 ° ,0“. '3' ' "I un'. I°ocgu°§ -. ‘0 a?» " In"; “L. mum» ; ~ "*r ..-'. l , 1 r J. “C‘i'r'fli:°:3&é:1‘.‘i ""17 Figure 4.53. Classification Performance for update choice4 80 4.4 Discussion and Categorization of Results The results of the simulations summarized in the above tables and figures suggest that using a fixed forgetting factor A is the cause of both types of divergences. The case of /\ = 1 shows a lock-up divergence problem while the case of A < l exhibits an explosive divergence phenomenon. Let’s now analyze the results of the different update choices closely. In the case of the 2-region classification problem, most of the update choices suggested decreased the number of iterations needed for an acceptable local minimum without degrading the performance. In this case, what is gained is the computational efficiency of the algorithm. Choice4 of the updates in this case did not work as well as the others but can be improved if the parameters of the suggested forgetting function are adjusted accordingly with the initial conditions. In most methods that are suggested for improving any type of algorithm, there is always tunable parameters that need to be dealt with. The more complex case of the 4-region classification problem is where the proposed update choices prove to considerably improve the algorithm. In fact, most of the update choices escape from unsatisfactory local minima to search for more satisfactory ones. When a stalling phenomena is observed for an interval of some number of iterations, the algorithm reacts automatically and decides to update A accordingly. Convergence to satisfactory local minima is then achieved at a faster rate which not only improves the performance of the algorithm but also its computational efficiency. Choice4 of the updates gives the algorithm more flexibility by allowing the user to decide what parameters of the forgetting function fit best with his / her problem. One is able to make the decision after only a single simulation trial on his/ her problem. In summary, these different choices of updates show a satisfactory performance even though some are better than others. The advantage of having a variety of updates to pick from is an increase of the chance of success in achieving satisfactory 81 local minima for any given initial condition. Even if choice1 through choice3 don’t achieve the goal, choice4 can be made to converge by a simple adjustment of the parameters of the forgetting function. (i.e a,L and 7) Based in the above conclusions the following conjecture is formulated : 4.4.1 Conjecture 2 Let e(é(n)) be the Least Square Error (LSE) defined as : «an» = 2: II do) — h(é(n), i(n) “2 W: (4.4) i=1 We conjecture that the following statements are true : 0 Switching A to a value less than 1 results in escaping local minima but often induces explosive divergence. 0 Switching A only for few iterations escapes local minima without inducing ex- plosive divergence. The number of these iterations can be decided from one experimental trial of the specific application at hand. 0 When A is modeled as a function of the error which satisfies these two conditions: 1. A is slightly less than 1 when the difference between two consecutive inter- val average errors is less than a certain threshold. 2. A is 1 when the difference defined above exceeds the same threshold chosen above. An example function is given as follows : tanh a =0: 6 if 6 > ,\ = ( ) 7 (4-5) 0.98 otherwise 82 , then this function guarantees convergence to satisfactory local minima if the parameters (i.e a,L and 7) are selected according to the problem at hand after one experimental trial. CHAPTER 5 A Different Approach to the Computation of the Gain Matrix 5.1 Motivation Although the Extended Kalman Filter algorithm to supervised learning is a good alternative to the backpropagation algorithm, some of its drawbacks still make it unattractive for some researchers in the neural network community. One major draw- back is a high computational load which we tried to solve by decreasing the number of iterations needed for convergence. Also if we examine the recursion equations needed to update the weights, we note that the algorithm computes the inverse of a matrix whose dimension depends on the dimension of the output. If the dimension of the output is high, the computation of the inverse becomes cumbersome. In this chapter, we propose a modification that does not require the computation of the inverse. This modification is also aimed toward an analog implementation 83 84 5.2 Recomputing the Gain Matrix The computation of the gain matrix for the global network is equivalent to solving the following equation: — G(n)A-l(n) + A"1P(n —1)HT(n) = 0, (5.1) where A-1(n) = I + A‘IHT(n)P(n — 1)H(n), (5.2) In this equation, we note that instead of conducting a crude computation of the root of the equation, which will require a computation of the matrix A(n), we trans- form the equation into a continuous time ordinary differential equation (o.d.e.). After convergence, the o.d.e will give a stable solution for the root which can be used as the new value of the gain matrix of each neuron. Consider the associated differential equation for each fixed 11 3:1:(t) — x(t)A-1(n) + A'1P(n -1)HT(n) = at , (5.3) Since for each fixed 11, A‘1(n) is expected to be positive definite, and hence invertible, the differential equation is a stable linear system. Thus $.~(t) will converge to a constant matrix which we take to be G(n). This way, we avoid the cumbersome computation of the inverse altogether. We observe also that such an o. d. e. is suitable for analog circuit implementation which can be incorporated as a co—processor to a digital computer. In this chapter, our interest is mainly in the (N DEKF) since it was proven to be more efficient than the global (EKF). However, in the NDEKF we will have as many 0. d. e. equations to simulate as the total number of neurons in the network not including the input-layer and this is at every update of the weight parameters or at every iteration. 85 5.3 Simulation Results Simulations, were conducted to compare the performance of the neuron decoupled al- gorithm to the proposed modification. In this chapter, we will report the performance on the 4-region classification problem which we have already used on both chapters three and four. The pattern used for this region classification problem was previously described in the earlier chapters. We initialize all parameters including the weights to the same values we chose in the earlier simulations for comparison purposes. The adaptation of the weights is done as follows: 1. Pick a random point and propagate forward to the output layer. 2. backpropagate the error values (error is computed as the difference between the target output value and the actual computed output) through the whole network. 3. Now do one simultaneous update for all the network weights and matrices. Our modification of the algorithm occurs at this step of the iteration. We do not do a static computation of the gain matrices. Every computation of a gain matrix which involves the computation of an inverse of a matrix, is now a simulation of an o.d.e equation which converges to the gain matrix in few steps. The initial values for the o. d. e. were chosen to be the values of the gain matrices from the previous iteration. The results of simulations using the same initial condition one will be given in the following table. The results in this table are very similar to the ones obtained previously with the computation of the inverse. However, these simulations were left training only for 10000 iterations but we still achieved the same performance results. The reason 86 Table 5.1. The 4-region classification problem with a modified computation of the gain matrix performance parameters 1C1 IC2 IC3 TAE 0.5872 0.5891 0.5767 IAE 0.5463 0.5420 0.5216 class % 74.6 75.4 76.1 #iterations 10000 10000 10000 behind limiting the training number of iterations is to due to the constraint of time. The o.d.e themselves take few steps to converge inside each training iteration. In conclusion the computation of the gain matrix using this approach is equivalent to the earlier method that includes the computation of the inverse. CHAPTER 6 Variable Slopes 6.1 Motivation Given the considerable success of the linear estimation methods in solving linear problems, researchers have extended these methods to slightly nonlinear problems and sometimes completely nonlinear problems. The “standard” Kalman filter is linear so in order to apply the approach of Kalman filtering to problems of nonlinear type, the Extended Kalman filter was derived. The essential idea of the Extended kalman filter was proposed by Stanley F.Scmidt. It was then named after him as the “Kalman- Scmidt” filter. The Kalman filter is extended mainly by expanding the nonlinear function h(0, z(n)) describing the model around the current estimate g(n—l) estimated from all data up to time (n — 1). The nonlinearity present in the network is described by the model function h(0,i(n)) which is the sigmoidal function chosen as tanh(a:) in the (FFANN). Due to the chain rule, the derivative of the sigmoidal function is present almost in every element of the gradient matrix H (n) The slope of the sigmoidal function was always fixed at 1. 87 88 Even though this function plays a big role in the Extended Kalman filter equations, the influence of the steepness of its slopes on the behavior of the algorithm was never investigated in previous related research. However, in the backpropagation algorithm, the effect of these steepness parameters especially on convergence was well investigated [30]. It was argued in their discussion that a high value of s can make the weights update slowly thus needing a large number of iterations to converge. A too low value of 3 according to Branko can have these two negative effects. 1. The derivative decreases even in unsaturated regions inducing slow convergence 2. The output of the linear combiner will always fall in the linear region of the sig- moidal function thus loosing any nonlinear effect of the network and dissolving its multilayer structure. These observations can only suggest that the value of the slope s can not be decided a priori. The solution that was suggested before [30] is to determine an optimum value of 3 via adaptive means. In the backpropagation algorithm, the slopes of the activation function were updated according to the delta rule [30]. In the case of the Kalman filter algorithm, we propose to investigate the effect of different slopes of the sigmoidal function on the convergence to satisfactory local minima. We will then develop a similar update of the slopes to the one used in the backpropagation algorithm. 89 The emwetlon lunatlon mm .609. -o.o 0.. v v 0.. '- -1 0.4 >- .. 0.2 >- d E o _ . fl -O.2 - _, —o.4 >- -4 —-0.0 > .. ‘03:: —170 _; —o-a a of: 1 {a r output of the llneer con-blue! Figure 6.1. The sigmoid function with slope s = 0.5 The eoflvetlon tuneuon wlth elope -1 1 - o.- »- .. o.- - - °-‘ P' -4 on p . é o . . a .02 »- - —o.4 ,. - ~o.o » -< -—0.. r- ~ _“I -105 -3 A A A v via a —O.5 O 0.6 output 0' the Ilneer combiner Figure 6.2. The sigmoid function with slope s = 1 The eetlvetlon Ounotlon Mth elope -1 .8 1 v . Y Y 0.. .- .1 0.. >- .. 0.4 I- 4 0.2 ~ -‘ § ., _ . g —0.2 >- .. -0.4 L- " —o.o ,. d —o.e )- a ‘15: — 1.6 -i A A 1 1:5 I! —O.5 O 0.5 oupul 00 the llneer oomblner Figure 6.3. The sigmoid function with slope s = 1.5 The ecuvellon hat-vouch wnh slope -2 1 v v a 1 o.- » u o.- l— -4 0.4 r- - 0.8 v- .. -0.8 '- 1 -0.4 v- -4 -0.. - a -O.. '- .. _1 A L A L —-O.6 o 0.6 output 0' the llneer oomblher Figure 6.4. The sigmoid function with slope s = 2 0.0 0.45 0.4 0.00 0 0 g 0.20 a 0.2 0. 1 0.06 0 The denvedve a! the eonveuoh hat-touch Mth elope —0.6 90 r —10 —2 O 2 output 0! (he llheer ooonblher Or Figure 6.5. The derivative of the sigmoid function with slope 0.0 0.0 0.7 0.0 0.6 mm 0.4 0.0 0.2 0.1 0 The derivetfve o! the Mellon “phonon whh slope -1 v v of f v . I r T 1 —10 _2 0 2 output of the llheer eonsblher 3:0.5 Figure 6.6. The derivative of the sigmoid function with slope s = 1 1.0 mm 0.6 The derlveflve of the eetlveCIoh 'uno‘lon whh elope -1 .8 4 —4 —2 O 2 was." at the llheer oornblner 10 Figure 6.7. The derivative of the sigmoid function with slope 1.0 1.0 1.4 1.2 0.0 0.0 0.4 0.2 —°1o The eeflvetlve o! the eeuvellon Ouhouon with elope —2 T Y T I _. —4 —2 0 2 output or! the llheer oomblher 3:1.5 Figure 6.8. The derivative of the sigmoid function with slope s = 2 91 6.2 Preliminary Results In this section of the chapter, we will illustrate via the 4—region classification problem how using different slopes can have an effect the convergence of the algorithm to satisfactory local minima. The results for each case are tabulated for three different initial conditions. Table 6.1. The 4-region classification problem with fixed slopes slope choices ICI IC2 IC3 s = 1 TAE 0.5757 0.5885 0.5718 IAE 0.5736 0.5416 0.5295 class % 76.9 75.1 77.7 #iterations 50000 50000 50000 .9 = 0.5 TAE 0.6097 0.6282 0.5991 IAE 0.6054 0.5663 0.5490 class % 73.5 71.3 76.6 #iterations 50000 50000 50000 3 = 1.5 TAE 0.5174 0.5636 0.3875 IAE 0.5079 0.5178 0.2961 class ‘70 83.0 79.1 88.9 #iterations 50000 50000 50000 3 = 2.0 TAE 0.4306 0.5227 0.3681 IAE 0.4074 0.4667 0.2946 class % 86.7 84.0 89.5 #iterations 50000 50000 50000 92 4—reglon orlglnel problem 001 1 .4 - w - f was. .09 00‘ 0.4 *- '1 Figure 6.9. The TAE for a fixed slope s = 1 .° 0 I 1 run—yam 0 *- 0.‘ '— _, O0 ; ; number o? Her-None :‘ a u 1 0:. Figure 6.10. The TAE for a fixed slope s = 0.5 “ '4 4—reglon oflglnel problem e—1 .BVIO‘I .s 0.. L _ §O.e - _. 0.‘ '- _, o" ; 5 number 0. lleretlone 3 é a .‘ 0:. Figure 6.11. The TAE for a fixed slope s = 1.5 ‘ .4 e—reolon orlalnel'problern e—a I'o1 '7 1 .2 - ..l 1 a E... . a 0.. .— - 0‘) 1 0 3 3 0 CI number 0' llerellone x 1 0‘ Figure 6.12. The TAE for a fixed slope s = 2 0.0 0.. 0.7 0.0 30.5 93 e—reolon orlolnel problem 001 v7 v v v B 0.4 F- —‘ 0.5 r- _, 0.2 h -1 0.1 +- _. 00 20 40 00 0 1 00 1 2 1 40 1 00 1 00 200 number 0' lteretlone I F1gure 6.13. The IAE for a fixed slope s = 1 e—reolon oflglnel problern e—O.B IO1 1 v v f v v v v 0.0 —4 0.. -4 0.7 p --1 5 0,. h— E 0.4 L- - 0.0 .— —< 0.2 ~ -4 0.1 - < O . L - - <3 20 40 00 1 40 1 00 1 .0 200 e0 1 oo 1 no number or ltereclone Figure 6.14. The IAE for a fixed slope s = 0.5 4—reolon origin-l problem e—1 .0 IO1 0,0 I— —< 0.2 >- -1 0.1 >- —J 00 20 40 00 0 1 00 1 2 1 40 1 00 1 .0 2(30 number 0' lteretlone ' f — F1gure 6.15. The IAE or a fixed slope s — 1.5 4—realon orlalnel problem e—a l01 0.0 v v v v v 0.. 0.7 .5 °" 30.. E 0.4 0.0 >- - 0.8 .— T 0.1 >- -4 00 20 40 00 00 1 00 1 20 1 40 1 00 1 00 a< number of lteretlone Figure 6.16. The IAE for a fixed slope s = 2 )0 94 l 'h : xii-1:? ”all :5; I“; .31.. ° 8 . -:.x§er :1».- - w I R 1 .- C 8 8 $13}. 5 " a I 'I _ = .I'. 1;... "‘7." III” M u a: on u u M u u 0 1 Figure 6.17. The 4-region classification pattern assesses '3 ‘ :_g::g::§ Figure 6.19. Classification Performance for a fixed slope s = 0.5 95 asecgggg- 2 Figure 6.20. Classification Performance for a fixed slope s = 1.5 Figure 6.21. Classification Performance for a fixed slope s = 2 96 6.3 The Augmented Recursion Formulas The (EKF) recursion equations start by a computation of the gain matrix C(n) which is then used to update the weights. The update of the weight parameters is followed by the update of the covariance matix P(n). To these two update equations we add an update equation for the slopes of the sigmoidal function. These slopes are updated at every node of all hidden layers in the network. The update is based on the gradient descent rule. The slope of the nonlinearity 3;, where I denotes the hidden layer I and j one of the nodes in that layer, is detemined so as to minimize a certain energy function. We define the energy function as follows: E701) =1/2(|| d(n) - h(90(7"t),i(n)) ”2)- where n is the time index and d(n) is the vector of desired outputs at time n. h(0(n),i(n)) is the nonlinear model function which is a composition of sigmoidal functions. In this section, without loss of generality, we develop the update equations of the slopes based on the same network architecture used in the 4-region classification problem simulation example. The network is composed of a total of 4 layers including two hidden layers. Let us call S2 the vector of slopes for layer 2 and S3 the vector of slopes for layer 3. The following equations describe the gradients of the energy function with respect to these vectors of slopes. ii: = (do) — h(ém.i(n))ahngf‘j” 97 8E1 _ 4 4 01:4 09:3 02:2 852 — (:6 -xd)8x369:2 652 03:4 3 53—5 — w ( I 2 2 T 2 2 \ h (51(w1) :1: )wl £713 ._ 0‘” h'(5¥o(wio>Tx2)w%o K 0 J 82 I h’(Sf(w1)Tx‘)(w})Tx1 ) CL‘ B's—2: \h’(5%o(wi°>Txl)(n) = mm — 1) + H(n)HT(n). (A.2) r = Arm — 1) + H(n)(d(n) — an» (A3) Using the matrix inversion lemma: A = B-1 + CCT (AA) A“ = B — BC(I + CTBC)“CTB. (A.5) with B = 9'10: — 1)/\‘1 and C = H(n), we get 108 109 (1)402) = A-IQ-l(n — 1) — A-2@-1(n —1)H(n) X[I + A'IHT(n)-1(n — 1), (A.6) Defining P(n) = tIP‘1(n — 1), and G(n) = A—1P(n — l)H(n)[I + A‘IHT(n)P(n — 1)H(n)]‘1, (A.7) we can rewrite eqn(A6) as P(n) = A‘1P(n — 1) — X'1G(n)HT(n)P(n — 1). (A.8) Equation (A7) can be rewritten as G(n) + A‘1G(n)HT(n)P(n — 1)H(n) = A‘1P(n —1)H(n), (A.9) 01‘ G(n) = [A'1P(n — 1) — A'1G(n)HT(n)P(n —1)]H(n). (A.10) Therefore the Kalman gain can be expressed as G(n) = P(n)H(n), (A.ll) The parameter estimate for the filter weight 0)(n) can then be expressed as (i(n) = P(n)r(n) (A.12) 110 (i(n) = P(n)>\r(n - 1) + P(n)H(n)(d(n) - 60%))- Substituting (A8) in the first term of (A12) we get 0(n) = P(n —1)r(n — 1) — G(n)HT(n)X P(n —1)r(n — 1) + P(n)H(n)(d(n) —- g(n)). From (All) we get é = é