r— J twmeamfii :Migfiggy 3 WE. “(m2 W iii mm». f WWW .a. a WW 3 mm I. 1-: 'a ‘t "163??? '“L‘L . u {1- '1 "1,1 ”wu- ‘ ~ w.» ... M75 «4 uv ‘_ u~._‘ 111.. a... -.n. ‘ 49'- '-'~ \uu ‘ h . :.:{. wV-m :I u 4“ NJ I 4 . . _ .-...‘:;t A‘ U. t , “-"1 dc? ;.=1‘:.‘.‘-~ .3 “L”! "~I~“:'.-.-q,- *' >~ -.r x . ‘ v «wk-2. u “r“ _r 3‘ ~ . a . 'f h“ .M.,-,__. < . _ . ~1vuuf . _ u ‘ uh .. 5k ”‘5..- . .1'. - «a w. . . '44.?! w M’ . g . V f?“ , -:. . t .. I.‘I. . .V...,;,_ . ., h" l'fvfy,‘::' lfl. . 'lue . ‘4‘ ' : .. ugh". 5'. I. I. Hr :xl‘n. u' ‘4' ~ . w'. "a“ ' ‘ ‘1-"nq ‘ gr 'I 31‘:- I . ' I o‘A-‘A U _m- I~:\:.n.’ A ‘1‘} ‘ : -. “Aw . 1 - .q men an x ~‘l‘v"'.-.'x, - :Lt:'l:h IM- A 4 ‘2‘ 6%???39uk‘445 7“ ‘wfi—IT: u... afar-W ~~. m «J u‘fA-AA ‘0: u. . . .. :‘Wu‘; am A' 15.“ v 1 3...! l 13:... It 3‘ ' "2’ C'l‘ufid 'o #5: his“... .I '¢ 39$ 51 ., ’,'4-".‘JE“' a: .fi ‘44- 3“ s -.< 22:. _ k r .- u.. JL l~ Ina . ‘ all“? “4‘ > M-u "_ - - LL «I n. u - . a (' . i 1,- M .. . w"L t: ‘. ‘lq‘~-;‘T V37- : ~ J. a???“ - u.» . . ‘ ‘u.n.‘. ,1 Inn.” m. :"\"‘i :‘fm'fi r; h'd “:3" .0“ , .w:.,_‘s 'M ,‘1. Il'w, 5; . 4‘4. ’3' ’14:. .m} w lHIUHIIHHIIlllllHHlllll||l‘IIHIIHIJIHllllllllllllll 3 1293 00902132 This is to certify that the dissertation entitled Supervised Learning of Feedforward Artificial Neural Networks Using Different Energy Functions presented by Maqbool Ahmad has been accepted towards fulfillment of the requirements for Ph.D. degree in Electrical Engineering Major professor Date 11/15/91 MS U is an Affirmative Action/Equal Opportunity Institution 0- 12771 LIBRARY Mlchlgan State Unlverslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE MSU Is An Affirmative Action/Equal Opportunity Institution emana-c: SUPERVISED LEARNING OF FEEDFORWARD ARTIFICIAL NEURAL NETWORKS USING DIFFERENT ENERGY FUNCTIONS By Maqbool Ahmad A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1991 con Ex; thci C01] (Suz law 86m 'Ski quel the llOn ABSTRACT SUPERVISED LEARNING OF FEEDFORWARD ARTIFICIAL NEURAL NETWORKS USING DIFFERENT ENERGY FUNCTIONS By Maqbool Ahmad A framework, supported by analysis and computer simulations, is developed to improve the supervised learning of feedforward artificial neural networks with a view towards their software implementation. The framework encompasses various learning rules derived from carefully selected energy functions in order to speed up and enable convergence to acceptable weights. We describe the Cauchy, the Polynomial and the Exponential energy functions and characterize their properties and the performance of their associated learning rules. We use continuous-time gradient update law learning rules due to particular advantages over their discrete-time counterparts. We perform comparative investigations among these learning rules vis-a-vis the usual Gaussian (sum-of-the-squared) energy function. The choice of an energy function specifies the back-propagated error in an update law. It also determines the learning performance in the sense of the speed of conver- gence. Carefully selected energy functions can in principle enable the neural system to "skip" undesirable minima; thus permitting convergence to acceptable minima. Conse- quently, a carefully selected energy function can render a learning process that enables the neural network to execute good classification/association and improved generaliza- tion capability. CC 161 ill: CD tic on Oil ex; dy1 lea The Gaussian energy function has been used frequently. In contrast, we use the Cauchy energy function, derived from the Cauchy distribution, in the gradient descent weight update law. We provide a statistical rational for the selection of the Cauchy energy. Analytical methods and computer simulations show that it could speed up the learning dynamics and would improve performance in a certain sense. We also show that all the stable equilibria of the learning dynamics based on the Cauchy energy are contained in the learning dynamics based on the Gaussian energy. Therefore, the net- work would still converge to one of the equilibrium points of the Gaussian-based sys- tem. We denote a polynomial energy function of order r as a linear combination of all the LP norms (1 S p S r). We show analytically and using computer simulations that employing the Polynomial energy function would improve the learning performance in a certain sense. The Polynomial energy function of order 2 does not introduce addi- tional equilibria to the Gaussian system. As a consequence the network converges to one of the minima of the Gaussian system. In order to achieve good classification and generalization results, the network ought to converge to a "good" or desirable minimum of the neural system. An exponential energy function is introduced to enable convergence to one of the desirable minima. A bound is established on the parameters of the Exponential energy function to achieve faster convergence to a desirable minimum. We also show that the learning dynamics based on the Exponential energy function possess the same minima as the learning dynamics based on the original Gaussian energy function. The performance of each of the energy functions addressed in this work, is evaluated using the XOR and the English character recognition problems. A highly user-oriented character recognition network is developed for the evaluation, with two separate modules: a feature extraction network to be trained on a large data base and a classification network to be subsequently trained by the user according to his/her requirements. An error correction capability is incorporated to enhance the character recognition performance. To my mother, who always prayed for my success but could not live to see this day. hut of CW: Nat ACKNOWLEDGMENTS Praise be to Allah Subhanahu-wa-Taala, who created this universe and all that is contained in it; mankind being the most superior due to its drinking faculties. The human brain inspires from its own structure numerous ideas and opens up many areas of pursuit and excellence in research to benefit the whole humanity. The more I pursue this research, the more He reveals His greatness to me. What ever I am today, I wholy owe ittoI-Iim I am grateful for the financial support fiom the Government of Pakistan, the National Science Foundation and the State of Michigan Research Excellence Fund (REF). Without this support, this research effort would not have been possible. I would like to express my sincere gratitude to my major advisor, Prof. F. M. A. Salam, for his inspiration, guidance and support throughout the years of my graduate studies. Thanks to Dean F. C. Hoppensteadt, Prof. H. K. Khalil and Prof. J. R. Deller, Jr. for their valuable comments and suggestions. I wish to express my sincere thanks and appreciation to my parents for their pati- ence, continued prayers, support and encouragement. I deeply owe my heartiest feel- ings of gratitude to my wife, Ghazala and my children, Muhammad, Hiraa and Saad for their love, understanding and support. List List List Cha] 2.4 2.5 TABLE OF CONTENTS List of Tables ............................................................................... vii List of Figures .............................................................................. viii List of Symbols ............................................................................. x Chapter 1 Introduction .................................................................... l 1.1 Neurobiology and artificial neural networks ................................. 2 1.1.1 A biological inspiration ..................................................... 3 1.1.2 Topologies of Artificial Neural Networks ............................... 5 1.2. Why new energy functions? ....................................... . .......... . 5 1.3 Contribution of the thesis ....................................................... 7 Chapter 2 Background .................................................................... 8 2.1 The feedforward neuron model ................................................. 9 2.2 The feedforward neural network connection topology ..................... 12 2.3 The back-propagation architecture ............................................. 14 2.4 Supervised training of the back-propagation network ...................... 16 2.5 The weight update laws and the energy functions .......................... 17 Ch; 2.6 The Gaussian energy function ...................................................................... 18 2.6.1 The learning dynamics due to the Gaussian energy function .............. 20 2.6.2 Software implementation ....................................................................... 22 2.7 The problem description ............................................................................... 23 2.8 The objectives and the outline of the thesis ................................................ 24 Chapter 3 The Cauchy energy function ................................................................... 26 3.1 Motivation ..................................................................................................... 26 3.2 The learning dynamics .................................................................................. 29 3.3 Analytical comparison with the Gaussian energy function ......................... 31 3.3.1 A criterion for the learning performance .............................................. 31 3.3.2 The learning dynamics for a single pattern .......................................... 32 3.3.3 The learning dynamics for multiple patterns ........................................ 34 3.4 Software implementation of the learning dynamics .................................... 36 3.5 Simulation example: the XOR problem ....................................................... 37 3.6 Discussion ....................................................................................................... 40 Chapter 4 The Polynomial energy function ............................................................. 42 4.1 Motivation ..................................................................................................... 42 4.2 The learning dynamics .................................................................................. 43 4.3 Analytical comparison with the Gaussian energy function ......................... 45 4.3.1 The learning dynamics for a single pattern .......................................... 46 4.3.2 The learning dynamics for multiple patterns ........................................ 47 4.4 Software implementation of the learning dynamics .................................... 48 4.5 Simulation example: the XOR problem ....................................................... 49 iv 4.6 Discussion ...................................................................................................... 50 Chapter 5 The Exponential energy function ............................................................ 58 5.1 Motivation ..................................................................................................... 58 5.2 The learning dynamics .................................................................................. 59 5.3 Analytical comparison with the simple energy function ............................. 60 5.3.1 Comparison of the learning dynamics ................................................... 60 5.4 Software implementation of the learning dynamics .................................... 63 5.5 Simulation example: the XOR problem ....................................................... 64 5.6 Discussion ...................................................................................................... 71 Chapter 6 The pattern/character recognition problem ............................................. 73 6.1 Problem description ..................................................................... '. ................. 73 6.2 Data base ....................................................................................................... 74’ 6.3 The network design ....................................................................................... 74 6.3.1 The feature extraction or preprocessing network .................................. 76 6.3.2 The classification network ..................................................................... 78 6.4 Learning evaluation with different energy functions ................................... . 80 6.4.1 Training of the feature extraction network ........................................... 81 6.4.2 Training of the classification network ................................................... 84 6.5 Discussion ...................................................................................................... 87 Chapter 7 Summary and Conclusion ....................................................................... 89 7.1 Summary ........................................................................................................ 89 7.2 Conclusion ..................................................................................................... 91 Appendix A Entropy of the Gaussian and the Cauchy distribution ....................... 92 Appendix B Proof of Lemma 3.1 ............................................................................ 94 Appendix C Proof of Theorem 3.1 .......................................................................... 95 Appendix D Proof of Lemma 3.2 ............................................................................ 97 Appendix E Proof of Theorem 3.2 .......................................................................... 98 Appendix F Proof of Proposition 3.1 ....................................................................... 100 Appendix G Proof of Theorem 4.1 ....................................................... 102 Appendix H Proof of Theorem 4.2 .......................................................................... 104 Appendix I Proof of Proposition 4.1 ....................................................................... 106 Appendix J Proof of Proposition 5.1 ....................................................................... 108 Appendix K List of the fonts used for the pattern recognition problem ................ 109 Bibliography ............................................................................................................... l 1 1 vi 2.1 3.1 .1 5.1 5.2 6.1 6.2 6.3 6.4 6.5 6.6 6.7 LIST OF TABLES The learning algorithm ....................................................................................... Training results comparing the Gaussian and the Cauchy learning ................. Training results comparing the Gaussian, the Cauchy and the Polynomial learning ............................................................................................................... Training results comparing the Exponential (it = l) of the Gaussian, the Cauchy and the Polynomial leaming ................................................................ Training results comparing the Exponential ( it = 1.8 ) of the Gaussian, the Cauchy and the Polynomial learning .......................................................... Error(s) and the number of steps at convergence of the learning algorithms used for the training of the feature extraction network ................. Character recognition using the feature extraction network trained with the Gaussian learning ......................................................................................... Character recognition using the feature extraction network trained with the Polynomial learning ..................................................................................... Character recognition using the feature extraction network trained with the Exponential of the Gaussian learning ......................................................... Error-(s) and the number of computer steps at convergence of the learning dynamics for the training of the classification network ..................... Character recognition when the feature extraction and the classification networks are trained with set 1 ......................................................................... Character recognition when the feature extraction and the classification networks are trained with set 1 and set 2, respectively ................................... 22 39 52 69 70 82 83 83 84 85 86 86 4.3 4.4 4.5 4.6 1.1 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 5.1 5.2 5.3 LIST OF FIGURES Simplified structure of a real neuron ................................................................. 3 Neuron activation functions ............................................................................... 10 The sigmoidal activation function with threshold and shape modification .............................................................................................. 11 The feedforward neuron model .......................................................................... 12 A structure of a multilayer FFANN .................................................................. 13 The back-propagation architecture of a FFANN ................... 14 Random error distribution .................................................................................. 28 The network used to solve the XOR problem .................................................. 37 Comparison of the Gaussian and the Cauchy learning ..................................... 38 Comparison of the Gaussian, the Cauchy and the Polynomial learning .......... 51 Trajectory showing the increase in L1 norm while L2 norm is decreasing ......................................................................................... 54 Trajectory showing the increase in L2 norm while L1 norm is decreasing ................................................. V ....................................... 54 Trajectory showing both the L1 and the L2 norm decreasing .......................... 55 Energy level diagram of the Polynomial energy function ................................ 55 Training results of the XOR problem, trained with the Modified Polynomial energy function ............................................................................... 56 Graph representing the relationship (5.7) .......................................................... 62 Comparison of the Exponential (K = 1,18) and the Gaussian learning ........... 66 Comparison of the Exponential (x = 1,18) and the Cauchy learning ............. 67 viii 5.4 5.4 6.1 6.2 6.3 6.4 Comparison of the Exponential (x = 1, 1.8) and the Polynomial learning ............................................................................................................... 68 Typical examples of the fonts ............................................................................ 75 Block diagram of the FFANN used for the character recognition problem ............................................................................................................... 76 Structure of the FFANN used for feature extraction ........................................ 77 Targets used in the feature extraction network ................................................. 79 LIST OF SYMBOLS u : neuron’s input v : neuron’s output 0 : neuron’s threshold net : neuron’s input excluding threshold f (.) : neuron’s input/output function x : input to a neural network y : output of a neural network t : desired output of a neural network v : neural network’s actual mapping : neural network’s desired mapping : energy function "11111” : function of an energy function : output error of a nemal network : error criterion : error signal to be propagated back : weights : weight change stopping criterion : index for patterns 'o’uegootmm : dummy index for patterns IN : index of input layer H j, Hk, I-I,, Hm : index of hidden layers 0 : index of output layer 1' : index of input layer neurons j, k, I , m : index of hidden layer neurons n : index of output layer neurons 11 : dummy index of output layer neurons 7 : learning rate x : skipping factor d(.) : density function L(.) : likelihood function H (.) : entropy function 0 : standard deviation of the Gaussian distribution A : parameter of the Cauchy distribution a , b : parameters of the Polynomial energy function it, It : parameters of the Exponential energy function A : change V : gradient 3 : partial derivative a, B : general constants mOd by t Whit have the t "-810" ing 6; 08M newIO: ing in. like .1, CHAPTER 1 INTRODUCTION Von Neuman machines have proven to be very useful in digital computation and have contributed significantly in the development of other spheres of life. However, there are various tasks where conventional digital computers can not compete with the performance of humans. Vision, classification, association, speech and pattern recogni- tion are only a few examples. It is also recognized that no matter how fast digital sequential computers would become, they are unlikely to outperform humans in these fields. Observations in neurobiology and psychology have inspired many researchers to model and design systems, which could emulate these human capabilities. Motivated by the highly interconnected networks of nerve cells referred to as neural networks, which form the basis of the human nervous system and the brain, many researchers have proposed various architectures that model the information processing aspect of the brain and the nervous system. These architectures are often referred to as artificial neural networks. Artificial neural networks are systems made of a large number of simple process- ing elements operating in parallel whose function is determined primarily by the topol- ogy that connects them. These systems are capable of high level functions such as adaptation and learning, and/or lower level functions such as data processing. These networks are inspired both by neurobiology and various mathematical theories of leam- in g and information processing. Neurobiological research in recent years has suggested that many types of map- like data representation are used in the brain [1],[2]. Some of these are now being used 2 in neural network models [1],[3],[4]. Other work has shown that complex type of tem- poral proces sing and non-linear operations appear to be performed by neurons. Com- plex neuron models are being used in some neural network models [SJ-[10]. Neuron models for the purpose of studying the functioning of the brain are quite involved and complex. These models are used as paradigms to test hypotheses and theories on the functioning of the brain. These models fall into the category of reverse-engineering and are of primary interest to neurobiologists, psychologists and physiologists. On the other hand, neuron models for the use of engineering technology fall under the category of mum-engineering. These models are used to implement the principles of operation to produce artificial neural networks capable of emulating some of the aspects of the human brain and thus aim at improving the artificial pro- cessing of data or information. 1.1 Neurobiology and Artificial Neural Networks There are two basic types of neural network models: (1) the neurobiological models that are intended as computational models of the biological nervous system and (2) the biologically inspired models that are intended as computational devices which are referred simply as neural network models. The main objective of the artificial neural network research is to design new architectures and algorithms that can solve problems which are difficult to solve by the conventional methods but are easy for intelligent biological organisms. For the inspira- tional development of the artificial neural networks, we must understand the essence of neurobiological architecture of real neurons and their connectivity. A simplified description and a modest analogy of the real neuron is presented in the following sub- sections. 1.1.1 A Biological Inspiration The human cerebral cortex is comprised of approximately 100 billion (10“) neu- rons with each having roughly 1,000 dendrites that form some 100,000 billion (10“) synapses. Given that this system operates at about 100 Hz, it functions at some 10,000,000 billion (1016) interconnections per seconds. The cortex weighs approxi- mately three pounds, covers about 0.15 square meters, and is about two millimeters thick. This is clearly beyond anything which can be reconstructed or modeled; but it is, perhaps, possible to understand how the brain performs information processing, and we hope that this understanding can be modeled and ultimately implemented in hardware. The basic anatomical unit of the nervous system is the nerve cell or the neuron [l 1]-[13]. Every neuron is a tiny information processing system with thousands of con- nections through which signals are sent and received. No two of these cells are exaCtly alike, but most share similar features. Each neuron has an inside and outside separawd by a plasma membrane. The inside of the cell and the fluid surrounding the cell have different concentrations of charged ions which create a potential difference. The major parts of a typical neuron are shown in Figure 1.1. Dendn'te Mfilin sheath 2 \\ i 7 N04" of Ranvier ’6 w ‘ :37 Axon Nucleus within the cell body Axon / 5 Synapse Figure 1.1: Simplified structure of a biological neuron. In The major parts of a biological neuron are: Cell Body : It is composed of a cytoplasm and contains the cell nucleus, nucleolus, mitochondria and ribosomes. The nucleus synthesizes the information which controls the activity of the neuron. Dendrites : These are the outgrowths of the cell body through which it picks up signals from other cells. Axon Hillock : It acts as a nonlinear threshold device that produces a rapid vol- tage increase or decrease, called an action potential. Axon : It is the longest process of a cell body, which carries messages away from it. It is very resistive and carries impulses to and from the cell body. Axons are enclosed in a thick sheet made of the fatty substance myelin, which is a special low-dielectric-constant material. The myelinated sheath has regular indentations along its length called the nodes of Ranvier, which restore the pulses periodi- cally along the axon. Synapses : These are contacts on a neuron which are the termination point for axons from other neurons. The synapse forms a gap between the slightly enlarged ending, or synaptic knob, of an axon terminal and the membrane of the next cell. At the synapse the electrical impulses that travel through the axon convert into chemical signals, than back to electrical impulses. When a nerve impulse reaches the synaptic knob, it causes the release of a chemical called a neuro- transmitter, from minute vesicals in which the substance is formed or stored. The neurotransmitters diffuse across the gap and may cause depolarization of the receptor cell membrane, thus starting an excitatory signal. If the membrane is charged more negatively than its resting voltage, it is said to be hyperpolarized, in which case the signal is inhibitory. 1.12 Topt Artific form, the « amplifier, tl resistors. T represented amplifier, 0 reach a pres The pr which they t«'iill fecdbac for the natu implications The mc in Some detz together, the; 1.1.2 Topologies of Artificial Neural Networks Artificial neurons are analogous to their biological inspirers. In electronic circuits form, the cell body is represented by a nonlinear element such as an operational amplifier, the axons and the dendrites by electrical wires and the synapses by variable resistors. The input voltages to a neuron are weighted by the synapses which are represented by the variable resistors. The resistors are connected to an operational amplifier, on which a threshold has been set. When the sum of these input voltages reach a preset threshold, the nonlinear element fires imitating a biological neuron. The processing nonlinear elements can interact in many ways by the manner in which they are interconnected. The interconnections may be feedforward or may con- tain feedback loops. The design of a neural network’s feedback loops has implications for the nature of its adaptivity, while the design of network’s interconnections has implications for its parallelism. The modeling and the connection strategy of these artificial neurons are discussed in some detail in Chapter 2. When many of these processing elements are connected together, they form an artificial neural network which mimic the neurobiological sys- tem in a very restricted way. 1.2 Why New Energy Functions? Artificial neural networks, unlike standard computers, do not execute programs. Their behavior is determined by parameters referred to as synaptic weights. These parameters are computed by a learning procedure, during which a set of examples (input-output pairs) are presented to the network; the weights are updated so that the output corresponding to each input from the training set is as close as possible to the desired value. When the learning phase is completed, the network can be presented with some unknown data and is anticipated to generalize from the data of the training set and produce a correct response. 1c 1110 ova ZlCQ Com II'On. The learning capability of neural networks plays a key role when we want to use them to do certain practical tasks. The automatic learning rules give artificial neural networks unique capabilities. Presently, many of the neural networks learn by using gradient descent methods associated with energy surface. The ’energy is usually defined as an error function in terms of the adjustable weights. The most commonly applied energy function gives rise to the so-called back-propagation update rule. This rule is discussed in detail in Chapter 2. . Pictorially, an energy surface can be thought of as consisting of hills and valleys. The surface represents a function of the output error of the network which in turn depends on the network weights. The gradient descent algorithm aims at finding the nearby lowest point called the local minimum. When the discrete-time gradient descent update law is implemented, the steps are discrete and we may observe oscillations and other undesirable behavior, which is more pronounced if the step size is large. We use the continuous-time gradient descent update law due to specific advan- tages over its discrete-time counterpart. These laws ensure convergence to the nearest local minimum and do not exhibit oscillations. Also the attained local minimum may not be useful to solve the problem for which the network is trained. The goal of this research is to make the learning faster and to reach an acceptable minimum quickly. We view the immediate domain of implementation to be software on digital computers. In that context, the learning can be increased by increasing the learning rate; but that would also increase the possibility of local oscillations and delayed convergence. In this work, energy surfaces are defined so that we do not add more minima to the already existing minima of the conventional energy surface. More- over, starting from any initial point, the aim is to speed up the convergence to an acceptable minimum. This is achieved by using different energy functions in the continuous-time update law in contrast with the usual sum-of-the-squared energy func- tion. for run net thei unpr thi dons confi then 1.3 Contribution of the Thesis Properly designed and trained neural networks aims at achieving human like per— formance in solving association and recognition problems. Neural networks are not so much programmed as they are trained with data - thus many believe that the use of neural networks can relieve today’s computer programmer of a significant portion of their programming load [14]-[16]. Moreover, neural networks improve with experience - the more data they are fed, the more accurate or complete their response becomes. A framework, supported by analysis and computer simulations, is developed to improve the supervised learning of feedforward artificial neural networks. The flame- work encompasses various learning rules derived from carefully selected energy func- tions in order to speed up and enable convergence to acceptable weights. We use continuous-time gradient update law learning rules due to particular advantages over their discrete-time counterparts. We describe the Cauchy and the Polynomial energy functions and characterize their properties and the performance of their associated learning rules. We show analytically and with computer simulations, that using the proposed energy functions in the continuous-time weight update law can speed up convergence to local minimum. The local minimum may not be useful to solve the desired problem. The Exponential energy function is proposed to overcome this problem. We show analyti- cally and with computer simulations that the continuous-time weight update law using the Exponential energy function can "skip" such minima and ensures convergence to acceptable minima. We perform comparative investigations among these learning rules vis-a-vis the update law using the usual Gaussian energy function. We also demonstrate an applica- tion to the character recognition problem in which various fonts of printed characters aroused vous and: Huliei nuniet Ward tMsw mdfih ltorks “131th andexa finerg}. applied anDunn mapping CHAPTER 2 BACKGROUND Neural net structures are based on our present understanding of the biological ner- vous systems. Neural net models are specified by the node characteristics, net topology and training or learning rules. By virtue of the type of input/output connections the models can be categorized as the feedforward neuron model or the feedback neuron model. The feedforward neuron model is discussed in section 2.1 whereas the feedfor- ward neural network connection topology using these neurons is of primary interest in this work and is described in section 2.2. The resulting network which is reminiscent of multilayer perceptron [16],[17], will be referred to as feedforward artificial neural net- works (FFANN). Learning in these networks is inspired by adaptive classification networks beginning with the perceptions convergence procedure and the least mean square (LMS) algorithm and extending to the more recent back-propagation, feattue map and reduced coulomb energy (RCE) algorithms [15]. The back-propagation is currently the most widely applied neural network architecture used for leaming. This popularity primarily revolves around the ability of back-propagation network to learn complicated multidimensional mappings [18]. The back-propagation architecture discussed in section 2.3, is used for all the learning strategies discussed in this work. The back—propagation neural networks can be trained to learn using supervised, unsupervised or self-supervised training. Supervised training requires labeled training data and a teacher who provides error information through a feedback signal after each trial. Unsupervised training forms internal clusters automatically with unlabeled data. Self-supervised training uses internal monitoring and error correction to improve perforr training B. forman tion 2.5 tions. T describe c'ontn'bu 2.1 The Mo McCullo treated as Where I}. j performance without an external teacher. This work mainly concentrates on supervised training of the FFANN, using (continuous-time) gradient descent methods. Back-propagation learning makes use of the gradient descent update laws. The per- formance of these update laws mainly depends upon the associated energy function. Sec- tion 2.5 describes the weight update laws and the influence of the associated energy func- tions. The Gaussian energy function is described in section 2.6. Section 2.7 is devoted to describe the current state-of-the-art in supervised training of FFAN N and highlights the contributions of this work. 2.1 The Feedforward Neuron Model Modeling of this class of artificial neural networks has its roots in the work of McCulloch and Pitts [19]. The feedforward model of neurons was clearly mentioned in [20] by Widrow and Hoff, where they called it an adaptive neuron. The neurons are treated as threshold logic units whose output v), is given by Vk =f(uk)=f [@ij xj +9k] (2.1) where x; is the external input, wk,- the weight or the strength of the connection from the jth input to the kth neuron and 6,, is the threshold or the bias for the neuron. Here f(uk) is the sgn function which is +1 ifuk > 0, 0 if uk = 0 and -1 if at < 0 and is diagrammatically shown in figure 2.1a. It was in 1972 that T. Kohonen [21] and J. A. Anderson [22] considered the fact that the neurons are not binary but are analog. They assumed that the activation of a neuron was sufficiently linear of the type shown in figure 2.1b, so that simple linear algebra would approximate the output of the system. It was realized later on [7],[23]-[26] that the sigmoidal function of the type shown in figure 2.1c is the most pervasive because it pro— vides a graded and nonlinear response which is closest to a real neuron. The range of the a. S gn function b. Nonlinear ramp function c. Logistic function Figure 2.1: Neuron activation functions ll sigmoidal function is sometimes changed from [-l,l] to [0,1] depending upon applica- tion. The logistic function is a convenient activation relationship that represent a sig- moidal function and is given as f (netk) = W. (2.2) Here 0,; controls the firing activity of the neuron for some given inputs. If 0; is negative, the sigmoidal function shifts horizontally to the right and firing is delayed until a sufficient input arrives. 90 is used to modify the shape of the sigmoid according to the gain requirements of the neuron. This is illustrated in figure 2.2. 6 Figure 2.2: The sigmoidal activation function with threshold and shape modification net; represents the summation of weighted inputs to the neuron and is given by net); = § Wk} Ij (2.3) 12 The complete feedforward model is shown in figure 2.3. Threshold 9, Threshold 0; Figure 2.3: The feedforward neuron model. 2.2 The feedforward neural network connection topology The neuron interconnection topology found in the human brain is very complicated but follows some pattern. Neurons in the retina and cortex are organized into layers with intra— and interlayer connections. Connections within a layer are referred to as short-term memory (STM) whereas connection between layers are referred as long-term memory (LTM). STM connections are usually considered to be unidirectional whereas LTM con- nections may propagate signals in feedforward and/or feedback direction. I We will use the feedforward connection t0pology throughout this work. This type of network topology makes use of feedforward neurons in different layers. In general the network consists of an input layer, m hidden layers and an output layer. These layers are indexed input through output layers by [N ,1-11,..,H.,«,H,c ,..,H,,, ,0 consecutively. The input layer is assumed to have I passive nodes with indices l,..,i ,..,I. The input passive nodes simply accept the individual components x,- of the input vector x and distribute them, without modification, to all of the neurons of the first hidden layer. m hidden layers may have sufficient number of neurons to perform the assigned task satisfactorily [271-[29]. It was discovered early in the neural network research [l7],[30],[31] that a sufficient 13 number of hidden layer(s) are required to solve a problem of normal complexity. Recently, it has been stated that one hidden layer would suffice for a large class of appli- cations [32]. Each neuron in each layer receives the output signal of each of the neurons of the layer preceding it. This continues through all of the layers of the network until the final output layer. The output layer of the network consists of N neurons with indices 1,..,n ,..N. The N dimensional y output of the network is available at the output of the 0th layer. It is assumed that the neurons in any particular layer, as depicted in figure 2.4, have no lateral or feedback connections. A FFANN formally carries out a non-linear bounded mapping v :XcR’ —)RN , from a compact subset X of I -dimensional space to the bounded subset v(X ) = Y of the N -dimensional space. When an input vector x is fed into the network’s input layer, the passive nodes of this layer transmit all of the components of x to all of the neurons of the first hidden layer. The output of all of the neurons of the first hidden layer are then transmitted to all of the neurons of the second hidden layer and so on, until finally the N output units emit the components of the output vector Y. "1' Hr Figure 2.4: A structure of a multi-layer feedforward artificial neural network. 14 2.3 The Back-propagation Architecture The back-propagation architecture as a technique was popularized by Rumelhart et al. in 1986 [7,8]. This architecture was earlier discovered by Bryson and H0 in 1969 [33] and independently rediscovered by Werbos in 1974 [34]. Today it is the most commonly used architecture in FFANN learning strategies [18]. Besides the feedforward connections, discussed earlier in section 2.2, each unit of the hidden layer receives an error signal from each of the units in the subsequent layer as depicted in Figure 2.5. The output units directly receive the error signal whenever the network is presented with some input. . . smarter . . 5[p][2][01 . . . . 5[p][n1[0] . . . . 5lpllNllol Figure 2.5: The back-propagation architecture of the FFANN. 15 The network functions in two stages: a forward pass and a backward pass. Con- sider the network’s input consisting of P vectors, corresponding to P input patterns, indexed l,..,p ,..,P , and each vector consists of 1 elements corresponding to I input pas- sive nodes, indexed 1,..,i,..,I. The forward pass starts by inserting I dimensional vectors [xp 1]" into the network’s input and the network emits out the corresponding N dimen- sional output vectors [yp = v(x,, )]f, which is the network’s estimate of the desired or the target output [II, = g (xp )]f. After the estimates [yplf’ are emitted, each of the output units is supplied with their correct or target output vector [tp ]f. The elements of the error vectors are computed at the output of the network as [epnlr = [Itpn -ypn ”r n =19m’N (24) The error elements 6,," and the error signal 8[p ][n][0] (to be discussed in section 2.6) are calculated and are available at each unit n of the output layer 0. In the backward pass, the output units directly receive the error signals from their outputs. The hidden units in the layer below the output layer receive their error signals from the output units to which they are connected. The error signal is thus passed back through the same links which were used for the forward pass but in the backward direc- tion. The error signal 8[p][j][H,-] of a unit j in a hidden layer ”j is thus a function of the error signal of the units in the H; II: layer and their inter-linking weights. The error is propagated back till each of the units in the first hidden layer get their corresponding error signals. The output layer becomes the input layer in the backward pass. To summarize, in each training cycle, the input is fed at the input layer of the net- work. The output of the network is determined in the forward pass. The error signal for each neuron in the network is calculated and propagated back in the backward pass. Depending upon the error signal the weight or the synaptic strength of each link in the network is adjusted in a manner so as to reduce the output error. 16 2.4 Supervised training of the back-propagation network After having designed the network with the back-propagation architecture, using feedforward neurons in the feedforward connection topology as discussed in sections 2.1, 2.2 and 2.3, the most critical task for the network is to learn, i.e. achieve weights values that would realize the mapping. Learning is one of the key issues in the artificial neural network research. With respect to learning, neural networks fall into a number of classes. Some net- works are completely fixed in which the design of the network determines the weights. Other networks have synaptic weights that depend on the problem that is being addressed by the network, but the weights are pre-programmed into the network. Other networks can be updated continuously by the outside data source. Networks of this type can per- form adaptive learning in response to changing environment. The final class is in situ learning, where the learning algorithm is incorporated directly into the network. Most of the neural network research has focussed on direct learning to solve specific tasks instead of building an internal model. Direct learning involves learning a specific action to solve each task. Learning internal world models is opening up an important area of learning where internal models of limited environment are built through interaction and do not have to be programmed into the network. One of the approaches is to use FFANN to mimic the input/output behavior of network interacting with the environment. This work focusses on the supervised training of FFANN with back-propagation architecture. It is a means of training adaptive netu'al networks which requires labeled training data and an external teacher. The teacher knows the desired correct response and provides an error signal when error is made by the network. This is sometimes called reinforcement learning or learning with a critic when the teacher only indicates whether a response was correct or incorrect but does not provide detailed error informa- tion. 17 During the supervised training, a training set consisting of examples (1tp , tp) of mapping between the input xp and corresponding desired output tp = g (xp) for the input-output pattern p = 1,...,P is provided. It is assumed that such examples of the map- ping g are representative of the whole class of map from the input space to the desired space. When the input data set [19,,”D is presented to the network, it produces the estimate [yp =v(xp )]{D of [tp = g (x,, )]fp and computes the output error vectors 8p, p = 1,...,P as given by (2.4). The error signal is generated from the error vectors according to certain weight update law, which is discussed in section 2.5. The error. signal when propagated back adjusts the weights of the network in a direction, so as to minimize the output error. With this training, the network thus, learns to associate an input pattern to its correspond- ing output pattern. If the training examples were the true representative of the class of mapping g , the network hopefully will associate all patterns (including those for which it has not been trained) correctly. 2.5 The weight update laws and the energy functions The weight update law is very critical in determining the error signal which ulti- mately makes the network learn. The Generalized Delta Rule (GDR) is a gradient descent method of updating the weights. It has been proposed by Rumelhart et al. [7,8] and is given as Awkj = - ‘Y unEo (2.5) Here wk,- is the weight connecting the output of the jth node in the H ,- th layer to the input of the kth node in the H; th layer, yis the learning rate, A represents the change and un represents partial derivative with respect to wk}. E is the energy function and is given as E = g; 5,, 5,, = % ; epnz. (2.6) 18 The weight update law (2.5) achieved a breakthrough in training the FFANN. It can be conveniently used in most cases, however the convergence cannot be guaranteed Describing the update law as a system of differential equations [35]-[39] in the form wt,- =- 7V...,.E. (2.7) has its advantages over the system of difference equations (2.5). In particular, being a continuous-time gradient descent system, (a) ensures convergence to local minima, i.e., stable equilibria. (b) Prevents the existence of oscillations and complicated behaviors and (c) lends itself to natural implementation via analog VLSI/LSI silicon circuits [35]-[39]. The choice of energy function specifies the error signal to be propagated back. It also determines the learning performance in the sense of the speed of convergence and the size of domains of attraction of the stable equilibrium points in the weight space. We remark that the sum-of—the-square energy function is the L2 (Euclidian) norm [40],[4l]. 2.6 The Gaussian energy function A learning rule which achieves the desired values of weights (preferably the global minimum), is statistically equivalent to a maximum likelihood estimator (MLE). In other words, obtaining a MLE is equivalent to minimizing the errors given by some energy function. In the process of maximizing the likelihood estimate, the energy function is derived given the probability distribution of the errors from the desired values in the out- put of the FFANN . A well known method of approximating a distribution on the basis of partial knowledge is to choose the density function d (x) which maximizes the entropy subject to the constraint of moments [421-[46]. It is well established that the maximum entropy distribution subject to the second moment constraint on (—oo,+oo) is the Gaussian distribu- tion [46]-[48]. 19 Now suppose that we would like the errors in the output of a FFANN to be a Gaus- sian random variable with zero mean. In order to make the network an MLE, the criterion is to minimize the sum-of-the-squared error energy function [48]. This is specifically shown below. The Gaussian distribution of the errors is given by _ l d (Epn) — W exp [.6P" 2,20”: 2] . (2.11) Assuming the output errors to be independently distributed for all the output nodes, the ‘ joint Gaussian distribution is then given by d,- (ep) = (2104" ’2 l Zip-1 I cxp[-1l-(£p )Talp )‘2(ep )]. (2.12) Here N denotes the total number of neurons in the output layer of the FFANN and 6,, = [cpl epz ........ epN]T, 2p =diag [Cpl Op; ........ OpNL where 0,," is the standard deviation of the error distribution of the n th output node when the pth pattern is used. Since the errors e, , p = 1,2,....,P , are independent, the likelihood function L is L(e,>=p1:rd,-(ep) = (2104” ”If," I cxn[-;1ypg(ep )T(Ze)‘2(ep )] (2.13) Choose op1=op2= ..... =0,” =1, for all the patterns p =1,2,....,P. Maximizing the likelihood function, then, corresponds to minimizing the energy function in equation (2.6), which is referred to as the Gaussian energy function. Of course, in general, 21, in (2.13) can be any symmetric nonsingular matrix. The continuous-time learning dynamics of the update law (2.7), using the Gaussian energy function, have been explored in [35]-[39]. In subsection 2.6.1, we describe these 20 learning dynamics, whereas the software implementation is outlined in subsection 2.6.2. 2.6.1 The learning dynamics due to the Gaussian energy function. As described earlier, the FFANN under consideration uses the error back- propagation architecture. For the weight wm. , connecting neuron m in the last hidden layer Hm to the output layer neuron n of a FFANN, the learning dynamics are described by the weight update law (2.7) and is given as wm = - y anE , (2.14i) and for any hidden layer Wk} = - YkuE , (2.1411) Using the chain rule, the dynamics are - _ 3E aepn aan aupn WM, - 'y. as” . 3y”, 'Bupn . 3w”, , (2.15) where E is the Gaussian energy function given by (2.6). 8,," is the error of the output y," from the target value tpn for the pth pattern and the nth output node and is given by (2.4). up" is the net input to the nth node of the output layer and is given by up, = g w,,,, 2,”, + 9,. ' (2.16) Here 2pm denotes the output of the m th node in the Hm th layer. 9,, is the threshold of the nth node in the output layer. The partial derivatives in equation (2.15) can be explicitly calculated as as =8 82 Ea =—sgn (tpn ”an )9 pn 3? n __ 374%"- "an(1 -)'pn)s 21 and Bu 37% = 2.... Here we have selected the so-called logistic function [7,8] 1 ypn = T1797 (2.17) to describe the input-output relationship of a neuron. Now the weight update law (2.15 ) canbewrittenas warn =7; 8pit - sgn (spa) - ypn ~(1-an)- 2pm =7; (tpn-ypn) - an - (l-ypn) - 2pm =7); 6[p1[n][0] .ylpltmlle]. (2.18) where y [p ][m][H,,,] is the output an of the mth node in the H", th layer. 5[p ][n][0] is the error signal which is completely determined by the desired and the actual outputs of node n in the output layer and is given by 5[P][n][0] = (tpn-ypn) -)’pn - (1 -ypn)- (2.19) Observe that the weight w,.,,, , is updated depending upon the error signal of the nth node and output of mth node. The weight update law for the weights wk,- , connecting the jth node of a hidden/input layer H j to the kth node in the subsequent hidden layer Hg, is derived similarly and is given by Wu =7; 5[P1[k][Hh]-)’[P]U][Hj], (220) where 5[P][k][Hk] 3= 21 MINIMUM - Wlk .th - (1 - ypk)- (2-21) In (2.21), I is the index of the nodes in the H1 th layer to which node k feeds its output 22 ylk and w”; is the weight of the corresponding connection. y [p][j][H,-] is the output of the node j in the H ,- th layer which is fed to node k in the I-Ik th layer. 2.6.2 Software implementation For the software implementation of the update law (2.7), using the Gaussian energy function, the learning algorithm is formalized in Table 2.1. Table 2.1 The Learning Algorithm 1 Present the input vectors xp , p = 1,...,P , at the input of the FFANN. Determine the network’s output yp = v(xp.), p = 1,...,P. Calculate the error signal for the output nodes. Propagate the error signal back to all the nodes in the network. Update the weights. O’tlllkwN If the weight change of the network is greater than the stopping criterion go to step 1, otherwise stop. The stopping criterion to can be defined as to := 2; |Aw,,,- I. (2.22) W” where wk,- are all the weights in the network. Theoretically in order to reach a local minimum the weight change criterion should be zero, which may never be achieved. In practice we can choose a sufficiently small value of (n, which when achieved indicate that the trajectories of the system have reached a local minimum. The weight change Awkj of the system can be determined as 23 Awkj := Wk} (t +At )-ij (t ). (2.23) The term At is the time step of the integration routine. Using the fourth order Runge- Kutta integration routine the learning algorithm outlined in Table 2.1 was implemented into computer software using the C—language [36]. The software is designed to initialize the weights of the network randomly, by a ran- dom number generator. The seed, which is an integer used to start the random number generator, is used to represent the set of initial weights. The value of the initial weights w is restricted to be -o.9 s w s 0.9. An error criterion g is defined as g := g3; 8,," 2, (2.24) is used to characterize the training: the lower the error, the better is the training. The error criterion can also prove to be useful in some cases, to terminate the training of the network. 2.7 The problem description We consider the implementation of FFANN into software on digital computers. For learning of such FFANNs, the continuous-time weight update law (2.7) which is a set of differential equations is implemented on digital computers. For faithful implementation of the gradient descent update law (2.7), the learning rate 7 and the time step At (2.23) must be very small. As a consequence of the small value of 7.At, (l) the learning is slow (2) the algorithm converges to the nearest local minimum which in most cases may not be satisfactorily useful. One may consider increasing value of the learning rate yas an immediate solution to the above problems. This may not work in all the cases. With a large value of 7, while we increase the tendency of skipping the nearby local minima, we may also make it prone to 24 skip desirable minima and delay or prevent the convergence. The so-called back-propagation update law in its existing form uses an almost Euler approximation to the continuous-time update law. In addition, most reported simulations use a large step size or a learning rate. Such large values prevent tractable analysis and have not been proven to ensure convergence. Thus the implementation of FFANNs with the two conflicting properties remains a challenge: (a) the update in speed is fast. (b) Gauranteed convergence to an acceptable minimum and the occurence of no oscilla- tions. In this work, we aim at achieving this challenge. 2.8 The objectives and the outline of the thesis (1) In this work we deal with this problem in two steps. We present a method to make the learning faster by using different energy functions in the update law (2.7). For the supervised learning of FFANN we use the continuous-time weight update law (2.7) which is derived from an energy function. If we choose an appropriate energy function the network may learn faster. The energy function can be viewed as (i) related to the distribution of the random errors in the output of the network, or (ii) a mathematical norm. We can choose a maximum entropy distribution of random errors in the output of the network, subject to the constraint of moments. Then, proper selection of the energy function for training the FFANN can make it an MLE in the statistical sense. In chapter 3, we show that in order to make the learning faster, we can choose a dis- tribution other than the Gaussian distribution. We choose the Cauchy distribution and motivate a rational for its selection and show that learning in this case is faster 25 or at least does not deteriorate. This conclusion is also supported by computer simu- lations. When the energy function is viewed as a mathematical norm, we have the option of selecting any LP norm. We combine various norms to form what we call the Polynomial energy function. In chapter 4, we show the usefulness of choosing the Polynomial energy function analytically and with simulation support. (2) We present a method which make it possible to "skip" unacceptable local minima and at the same time ensures faster convergence to one of the acceptable minima. The problem of skipping the unacceptable local minima is tackled in chapter 5. We propose an exponential energy function, which skips the unacceptable local minima and ensures faster convergence to one of the acceptable minima. The Exponential energy function is used in conjunction with the Gaussian, the Cauchy or the Polynomial energy functions. The performance of the learning algorithm using weight update law (2.7) with dif- ferent energy functions (Table 2.1) is evaluated in the application to the pattern recogni- tion problem in chapter 6. After discussing the network design, various computer simula- tions are presented. Finally in chapter 7, summary of this work is collected along with concluding remarks. CHAPTER 3 THE CAUCHY ENERGY FUNCTION 3.1 Motivation This chapter addresses the issue of increasing the learning speed of the FFANNs. In what follows, we describe the intuitive rational for selecting the Cauchy energy function from the point of view of statistical signal processing. On the other hand, the real rational for this work may stem from the convergence speed of the would-be proposed weight update learning rule as compared to the usual update rule (2.6)-(2.7). The derivation of a supervised training algorithm for a FFANN implies the selection of a norm criterion which gives a suitable measure of a particular distribution of errors in the output of the network. There exists a correspondence between the Lp norms and the error distributions [48]. Many researchers [44]-[48] consider the maximum entropy of the output errors subject to certain moments constraints to be a suitable measure to estimate the error distribution. The larger is the number of moment constraints, the lesser is the choice of error distributions. It is well known that subject to the first and the second moment constraints, the maximum entropy distribution is the Gaussian distribution [44]-[48]. Corresponding to the Gaussian distribution, the criterion is to select the L2 norm for the energy function in the weight update law (2.7) [49]. Restricting our attention to the first moment constraint, there are many choices of an error distribution. The Cauchy distribution [50]. given as d(epv.) = gfir} (3.1) 26 27 is chosen due to the following reasons. (1) The entropy of the Cauchy distribution (3.1) is about 1.78 times higher than the entropy of the Gaussian distribution (2.11), when the parameters hp, and 0,... for these distributions is chosen to be unity. This is analytically shown in Appendix A. (2) The speed of convergence of the weight update law (2.7) improves when the energy function corresponding to the Cauchy distribution is used instead of the Gaussian energy function (2.6). This is discussed in detail in this chapter. In order to derive the energy function corresponding to the Cauchy distribution of errors, let us consider the errors 6,», , l S p S P and 1 S n S N, to be independently distri- buted. The joint Cauchy distribution will then be dj (err) = (104,13 $155er . (3.2) Since the errors for all the desired patterns p = 1,2, ..... ,P , are assumed to be independent, the likelihood function is L‘EP"’=}:11"’(’P"’ = (to-"N ,1: 3%,. (3.3) We choose 37,1: 2,2: ...... = A,” = l for all the patterns p = l,2,....,P. Then maximiz- ing the likelihood estimator (3.3), corresponds to minimizing the quantity c = .§. 11 n (1+cp, 2). (3.4) p II We refer to the quantity in (3.4) as the Cauchy energy function [52]. For a qualitative comparison, we set 0,," in (2.11) and 27,, in (3.1) to 1 and obtain the two distributions depicted in Figure 3.1. From the graphs in Figure 3.1, observe that the Cauchy distribution of the error is more spread out than the Gaussian distribution. This "spread out" feature signifies higher entropy and more tolerance in errors of the 28 Cauchy distribution. 004 I I I I I T 0.35 - - 0.3-, 0.25 - T d(ep.) 0.2 0.15 I 0.1 0.05 r 20 Figure 3.1: Random error distribution. The horizontal axis show the random error and the vertical axis show the density with which they are distributed. The solid line represents the Gaussian error distribution as given by (2.11) with (3,,,, = l and the dotted lines represents the Cauchy error distribution as given by (3.1) with hp, = l . In this work, we desire to train the FFANN using a learning update law derived from the Cauchy energy function (3.4). When the update law converges to a weight value that minimizes the Cauchy energy function, the FFANN would then emulate a classifier with Cauchy distribution of errors from the target values. 29 The chapter is organized as follows. In section 3.2, we analyze the learning dynarn- ics of the weight update law (2.7) using the Cauchy energy function. In section 3.3, we analytically compare the learning dynamics due to the Gaussian (2.6) and the Cauchy (3.4) energy function. Software implementation of the learning dynamics are presented in section 3.4, which are then used to train a FFANN. As an illustrative example, we spe- cialize the software to solve the XOR problem. Simulation results are reported in section 3.5. We discuss the advantages and short-comings of using the Cauchy energy function in section 3.6. 3.2 The Learning Dynamies In this section, the continuous-time learning dynamics of the update law (2.7), using the Cauchy energy function, are explicitly derived. As described earlier in section 2.2, the output layer is layer 0 , with nodes indexed by n. The 0 -1th layer which feeds its output to the output layer is actually the last hidden layer Hm , whose nodes are indexed by m. The learning dynamics of the weights using the Cauchy energy function (3.4) in the update law (2.7); specifically for the output layer vhf». =-7V...,EC. (3.51) and for any hidden layer wfj = — y VWEC. (3.51i) Using the chain rule, the dynamics for the output layer are . c 7 3E c 3E1," aan 3up» WM, = - . 38p). . dypn . Hum, . dwm , (3.6) where up” is the net input to the nth node of the output layer and is given by up" = g Wm 2m + 0". (3.7) Here 2”, denotes the output of the m th node in the Hmth hidden layer. 6,, is the 30 threshold of the n th node in the output layer. The partial derivatives in equation (3.7) can be explicitly calculated as ~38517:eh";—[1";11W(1+em2)] [2H(l+epn2)].2.epn. p pep =-z[[nn(1+epn2)][n (1+ep,,2).] 2W], (3.8i) )0 p 11*" prep 3 n .. 3%;- = —sgn (tpn -yp,, ), (3.811) 32m = y n (1 _ y n), (3.8iii) up" P P and a“ n — (3.8iv) Here p is a dummy variable representing patterns 1 through P and 'q is a dummy index for nodes 1 through N of the output layer. We have selected the so-called logistic func- tion [7,8] _ 1 an 1+e‘“" to describe the input-output relationship of a nemon. Now the weight update law can be writtenas =7; 5lpHnll0] -)’[P][m][Hm]. (3.9) where 5 0= 1 2 1 n2 . n-,.. ,..l- n 3.10 [Pllnll ] 121g” +€pn)] [913” +8,» )] (tp yp ) yp ( yp) ( ) and y [p ][m ][H,,,] is the output 2”, of the m th node in the Hmth layer. ‘The weight update law for the weights wk}, connecting the jth node of the hidden/input layer H ,- to the k th node in the subsequent hidden layer I-Ik , is derived 31 similarly and is given by ij=7§ 8[P][k][1'1ht] -)’[P]U][Hj]. (3.11) where 5[PHk][H/t] == 2; 5W][l][Hl] - Wu: °ypk - (1 -)’pr)- (3.12) In (3.12) l is the index of the nodes in the Hrth layer to which node k feeds its output ya; and Wu is the weight of the corresponding connection. y [p ][j][Hj] is the output of the node j in the I-Ijth layer which is fed to node k in the Hkth layer. 3.3 Analytical Comparison with the Gaussian Energy Function 3.3.1 A criterion for the learning performance To compare the dynamics of the (Gaussian) learning (2.14) and the (Cauchy) learn- ing (3.5), we define the performance criterion delineated in the following procedure. First we consider V := E as a candidate Liapunov function [53]-[54]. Then, we calculate the derivative of V along the trajectories of the (Gaussian) learning (2.14) -- denote this derivative by v. Similarly, we calculate the derivative of V along the trajectories of the (Cauchy) learning (3.5) -- denote this derivative by W. Now if 1;" - v s o , then we say that the learning performance ’ of the second system is "faster" than or at least the same as the first system. This is motivated by the fact that V‘ — I} gives the projection of the rate of change of the difference in the weight vectors, say (WC — W), along the diver- gence vector Vw V. This we take as a measure of speed of updating the weight-vectors difference (WC - W), which in turn is equivalent to the comparative speed in updating the weight vectors We and W. * Improvement in the learning performance of system (3.5) here means the derivative of the sum-of-the—squared error at any point on the error surface along the trajectories of system (2.14) is less in magnitude than that of system (3.5). 32 Formally, we observe that lie —v =va . (W6 —W), (3.13) where W‘ and W denote vectors containing all the corresponding weights. Equations (3.5) give the components of WC and equations (2.14) give the components of W. 3.3.2 The learning dynamics for a single pattern Consider the case when only one input-output pattern is applied to the network. For ease of notation we will drop the subscript p for patterns and define the Gaussian and the Cauchy energy functions as := é-zn; 2,2 (3.14i) and Ec :.-. -5—1;1(1+e,,2), (3.14ii) respectively. By straight forward calculations, for the output layer equation (3.5i) can be rewritten as w:,,, = — y [V,,__E + ot,v,,_E ]. (3.15i) For the hidden layer Hg the learning algorithm in (3.5ii) can be similarly rewritten as of, = — 7 [VWE + -§- § an Vwenz]. (3.15ii) Here an 2 0 and is given explicitly as an=Ef+EQ+ ..... +E@v_1), where E? :=E —e,,2, E Q is the summation of all [Ni ] pairs not including 8,}, 33 and finally E 3.1-1) := 812822 ....... 8N2 not including 8,,2 Before we formally characterize the learning performance due to the Gaussian and the Cauchy energy function in the form of a theorem, Lemma 3.1 is in order, which uses Assumption 3.1 given below. Assumpfion 3.1: All the eigenvalues of the matrix with entries at,- = a,- + a}, are non- negative. Remark: In our simulations, we always initialize the weights in the learning dynamics so that the weight components have small values (typically within [-0.9,0.9]). In doing so, the output of the individual nodes are likely to be close to 0.5. In that case, (11qu: - - . =an= - - -=or~. If the equality holds, then Assumption 3.1 is automatically satisfied. See Appendix B. Lemma 3.1: Given 0t,- , x,- e R , 0t,- 2 O and assuming that Assumption 3.1 holds. were. Proof of Lemma 3.1 is provided in Appendix B. Then, Theorem 3.1: Assume that Assumption 3.1 holds. Consider V := E given by (3.14i), be a candidate Liapunov function. Let v be the derivative of v along the trajectories of sys- tem (2.14) and V" be the derivative of V along the trajectories of system (3.15). Then the learning performance of system (3.15) is "faster" than or at least is the same as system (2. 14). 34 The proof of this theorem, presented in Appendix C, analytically shows that the learning performance improves if the Cauchy energy function is used instead of the Gaussian energy function. 3.3.3 The learning dynamics for multiple patterns Now, we consider the learning dynamics of the weights for P patterns. The (Gaus- sian) weight update law is given by equation (2.14) where E is given by (2.6). The (Cau- chy) learning dynamics of the weights takes the form of (3.5) where E c now is given by (3.4). By straight forward calculations, oz... can be written as wgm=_y[vw_5+.}zorm V,,_,_ 9,2]. (3.161) 9 Similarly, for the hidden layers the learning algorithm of the Cauchy energy function (3.4) can be written as urge—4v,” E +%§§apn un- epnz]. (3.16ii) Here or," 2 O and is given by or?" =12)?n + E5" + ..... +15%“), where E’l" = E - epn 2. E5" is the summation of all [Hg-1] pairs not including 6,», 2, and finally EfflN_1) =8112€122 ....... €p~2 not including Epnz. Before we compare the learning performance due to the weight update laws (2.14) and (3.16), we state Lemma 3.2, which uses the following assumption. 35 Assumption 3.2: All the eigenvalues of the matrix with entries at”; 2 at,- + an, are non-negative. Remark: Assummion 3.2 is equivalent to Assumption 3.1. Lemma 3.2: Given org,- , x5,- 6 R , org,- 2 0 and assuming that Assumption 3.2 holds. Then, [;¥ xv] [;; (131' xii] Z 0. Proof of Lemma 3.2 is provided in Appendix D. Theorem 3.2: Assume that Assumption 3.2 holds. Consider a candidate Liapunov func- tion V := E given by equation (2.6). Let V be the derivative of V along the trajectories of (2.14) and V” be the derivative of V along the trajectories of the system (3.16). Then the learning performance of system (3.16) is "faster" than or at least is the same as sys- tem (2.14). Again the proof of this theorem is included in Appendix E. The analysis clearly indicate that the learning performance in " speed" can improve (or at least does not deteriorate) when the Cauchy energy function is used. Proposition 3.1: The equilibria of system (3.16) are also the equilibria of system (2.14). The proof of Proposition 3.1 is provided in Appendix F. Remark: Proposition 3.1 implies that when we use the Cauchy energy function (3.5) in the weight update law (2.7), we still converge to one of the equilibria of system (2.14). However, this does not mean that starting from the same initial condition, the two 36 systems would follow the same trajectory or that their trajectories would converge to the same equilibrium point. 3.4 Software implementation of the learning dynamics For the software implementation of the learning dynamics of the weights connected to the output layer (3.9) and the weights connected to the hidden layers (3.11), the learn- ing algorithm formalized in Table 2.1 is used. The weight update laws (3.9) and (3.11) are implemented into computer software using the C-language on Sun SPARC stations. Fourth order Runge-Kutta integration routine is used for the integration of these update laws. A stopping criterion to, was set to determine if the system has converged to a solu- tion. The parameter a) was defined as 0) 2: 2 IAij I, (3.17) Wu where wkj are all the weights in the network and Awkj I= ij (H-At ) - ij (t ). (3.18) The term At is the time step chosen for the forum order Runge-Kutta integration of the update laws. An error criterion § was used to characterize the training: the lower the value of the error, the better is the training. The (output) error parameter is defined as F, i: g g 8’," 2. (3.19) Note that this error parameter is exactly the Gaussian energy function. The weights of the network can be initialized by giving their values or selecting randomly by a ran- dom number generator of the UNIX system. The training simulator, using the Runge-Kutta integration routine reads the input data from an input file, which provides the values of time step At , learning rate 7, error criterion 5,, stopping criterion 0), the network structure, the training patterns and the 37 corresponding target values. The training is terminated when either the stopping criterion to or the error criterion i is achieved. The simulator writes the intermediate or/and the final values of the network weights or/and other parameters into an output file. 3.5 Simulation example: the XOR problem The training performance of the two software simulators, using the Cauchy and the Gaussian energy functions, is described and a comparison is made in this section. A pro- . totype FFANN used to solve the XOR problem had two input passive nodes (a passive node does not include a sigmoidal transfer function), two hidden nodes and one output node as shown in Figure 3.2. Input layer Hidden layer ............... ............. ............ ...................... ....................... .......... ...... ................. Output Layer Figure 3.2: The network used to solve the XOR problem. The input layer has two passive nodes, whereas the hidden and the output layers have two and one artificial feedforward neurons, respectively. The network was fully connected and weights were initialized randomly by a ran- dom number generator. The seed, which is a number used to start the random number generator, is used to represent the set of initial weights. The value of any initial weight w is restricted to be -0.9 S w 5 0.9. 38 error 0.5 — -0 _ I I I I I I I I I I I I -0 1000020000300004000050000 -0 1000020000300004000050000 set # l - iterations set # 2 - iterations 1 n 1.2 I; l — :, error 0.5 — error -0 .1 ........ I T 1 I I I I I I I —0 1000020000300004000050000 -0 1000020000300004000050000 set # 3 - iterations set # 4 - iterations . 1 4 1 _ S 6“" 0.5 - m o 5 _ _0 _ .' ...... _0 A L- I I I I I I I I I —0 20000 40000 -0 1000020000300004000050000 set # 5 - iterations set # 6 - iterations Figure 3.3: Comparison of the Gaussian (...) and the Cauchy ( ) learning. In all of the above simulations, the learning algorithm using the Cauchy energy function achieves the same error with fewer iterations. 39 Hugo 9. __ mega—es Be menace no. a 0.. . 0:95. 3:25: 9...... sin...» a 0.. .8853 a an .8838 a an .8828. a 352.88. “ 08%.»: .33 89.... 98.3 98.3 Axum» 3.9. 983. 98a .u .8 8. .». 98.».. 538 8.8 98.3 human 8...... » 08%.»: 9m... .88 3.8 9.3.3 3%» »m..8 98.8 $8» Na..8 9.83. 9m... x»... in: 98.3 was...» 39.6 98.8 328 898 u 08%.»: 98a .33 3.8 98.»u 3?.» 39.» 98.8 39.» »..u. .q 9%.... 98m .33 c9»; 98.»u #38 898 985» 583 898 a. 08%.»: 9a... . .96 8.8 98.9. 538 39.5 98.9 238 39.8 0283. 9M... 33 u. .6 98.9 $3» »u»...» 9m...»» 3.8a 898 u mecca»: 98m »»»c» .»9.o 98.3 38¢ u.».».. 98.3 Mama u.».».. 028.... 98m .» 3c .38 98.3 $3. ”$93 98.»u an?» u»m. .u a 02.8.»: 98m .83 8.3 98.». 33 . »m. .3 98.». 53.. . 3.3 028.... 98m 58. 8.3 98.». 38. Be...» 98.». 38. ”:93 .35...» 38.8 83.315 :6 Queue»: 2:. 9a 983. 55:3. 3.8. E... 58:39....» 98 8.48.83...» 8 .698 F! 3.3 5.5 5.... 025.2. 2:. so no..- 26. 265 8:38. 8.58:8 98 tag :5 0:9 2.810: n a... :5 tan... 0.5—in 9.818 e .a 8322.. 40 We assume that the network has converged when the stoppage criteria to S 0.0001 is satisfied. In all the simulation cases, the network converged to a local minimum A com- parison of 6 different sets of initial conditions is depicted in Figure 3.3. In all of the six simulations 7 and At are chosen to be 1 and 0.1 seconds, respectively. For the cases when the weights were initialized by sets 1, 3, 5 and 6, the training did achieve excellent results. In the other cases when the weights were initialized by sets 2 and 4, the network was stuck in an undesirable local minimum with an error value equal to 0.5. The training results using these six different sets of initial conditions are summerized in Table 3.1. In all six simulations the (Cauchy) learning algorithm achieves the same error per- formance as the Gaussian learning, in many fewer iterations. Finally, we also observe that the update law (3.5) outperforms the update law (2.14) in achieving any fixed error value in terms of the number of iterations required to realize that error value. It may also be observed in Figure 3.3 that the bulk of the number of iterations is executed very near the local minimum. It may therefore be useful to set an error criterion é to a desirable value along with the stability criterion co , for terminating the training. 3.6 Discussion In this chapter, we derived a gradient descent weight update law using a Cauchy energy function for supervised error back-propagation learning for FFANNs. We presented a statistical rational for selecting the Cauchy energy. function to be used in the weight update law. We also showed analytically that the learning performance may improve if the Cauchy energy function is used instead of the Gaussian energy function. This improvement is characterized in terms of speed of the continuous-time update law for the weights which results in fewer iterations in achieving a specified error (energy) value. We observe that each component of the vector field of (3.16) is always greater than or equal to the amplitude of the corresponding component of the vector field of (2.14). 41 This means that every point in the weight state-space the dynamics of (3.16) are "faster" than the dynamics of (2.14). However this notion of "speed" is misleading: one cannot rely on the vector fields alone, one must consider the trajectories crossing into a well- defined common, constant—energy levels of the same function [40],[53]-[54]. This amounts to the trajectories of both dynamics crossing into regions defined by the same error function E. Even though the proposed dynamics required more computations at every step, the tbtal number of steps to converge is usually reduced. Due to the growing computations with the number of output nodes and the input patterns, it may not be suitable to use software implementation of the algorithm. The networks implemented with analog hardware can still derive full benefit hour the learning speed of the proposed algorithm. Numerous simulations focusing on the XOR problem, which support the above analysis, have been conducted. A sample of these simulations is included. The analysis indicates that the Cauchy learning dynamics are faster at each point in the weight space. On the other hand, the simulations focusing on the XOR problem seem to support a stronger conclusion: the Cauchy learning dynamics are faster along each trajectory in the weight space as compared to the Gaussian learning dynamics. CHAPTER 4 THE POLYNOMIAL ENERGY FUNCTION 4.1 Motivation The energy function in the weight update law (2.7) specifies the error of the output (from the desired values), to be propagated back for the adjustment of the weights. This error or the energy can be viewed as a mathematical norm. The usual sum-of-the-squared (Gaussian) energy function is the L2 norm. Consequently we may choose other LP norms (1 Sp 5 oo) [54]. For every error vector 5,, e R",p =1,2,...,P, we know that all the norms are equivalent in the sense that K1 lep Ins Iep Isus lepla ‘ (4.1) Here K 1 and K 2 are any two constants and or. and B are two norms. This equivalence does not imply that if, for example, the L2 norm of an error vector 6,, is greater than the L2 norm of another error vector 6;, then the L1 norm ofe, is also greater than the L1 norm of e; . This implication is a motivation to specify an energy function which when used in the weight update law (2.7), the error decreases in all the specified norms. While error reduction in all the specified norms provide an intuitive rational for selecting the Polynomial energy function, the real rational for this work may stem from the convergence speed of the would-be-proposed weight update law as compared to the usual weight update law (2.6)-(2.7). A learning rule which achieves the desired values of the weights (preferably the glo- bal minimum) by reducing the error in the first r norms, may use a Polynomial energy function [57] of order r given as 42 43 EP =a 22¢," +b)’, (4.2) p‘ n where a and b are any suitable real constants. It should be noted that p used as super- script stands for polynomial and when used as a subscript stands for the pattern p =1,2,...,P. In this chapter we use the Polynomial energy function of order 2 and a =b = 71- in the weight update law (2.7), to train a FFANN. The Polynomial energy function would then reduce to _ l 1 2 EP_.2.;2n:(epn+-2-). . (4.3) It should be noted that the Polynomial energy function of order 2 (4.3) is basically a combination of L1 and L2 norms. This chapter is organized as follows. In section 4.2, we analyze the learning dynam- ics of the weight update law (2.7), using the Polynomial energy function (4.3). In section 4.3, we analytically compare the learning dynamics due to the Gaussian (2.6) and the Polynomial (4.3) energy function. Software implementations of the learning dynamics are presented in section 4.4, which are then used to train a FFANN. As an illustrative example, we specialize the software to solve the XOR problem. Simulation results are reported in section 4.5. In section 4.6 we conclude with the discussion on various advan- tages and the disadvantages of using the Polynomial energy function in the weight update law (2.7). 4.2 The learning dynamics In this section, the continuous-time learning dynamics of the update law (2.7), using the Polynomial energy function, are explicitly derived. As described earlier in section 2.2, the output layer is layer 0 , with nodes indexed by n. The 0-lth layer which feeds its output to the output layer is actually the last hidden layer Hm, whose nodes are indexed by m. The learning dynamics of the weights using the Polynomial energy 44 function (4.3) in the update law (2.7); specifically for the output layer WK," = - 7 VW_ EP , (4.4i) and for any hidden layer W5- : — y VWEP . (4.4ii) Using the chain rule, the dynamics for the output layer are ° _ BEP 38p" 3y n a“ n W5” ——7. 36,," . 3y," . a; . 8w; . (45) where up" and ypn are respectively the net input and the output to the nth node of the output layer and are given by up" = 2 WM, 2p", + 9,, (4.6i) m 'and 1 an = m- (4.6ii) Here 2pm is the output of the mth node in the Hmth layer. 0,, is the threshold of the nth node in the output layer. The partial derivatives in equation (4.5) can be explicitly calcu- lated as BE P l . = n + 4.7 36,». E3 [2,, ’2'] ( 1) 8y n _ _ 33% - an (1 an ). (4.7111) and Bu 379% 1' 2pm . (4.71V) 45 Now the weight update law can be written as 1%. =7§5[p][n][0] .y[p][m][Hm]. (4.8) where 5ipl[nl[0] := sgn (tpn -yp..) . (8p. + 71;) ~ypn(1-ypn) (4.9) and y [p ][m][H,,.] is the output 2,", of the m th node of the H... th layer. The weight update law for the weights wk,- , connecting the j th node of a hidden/mput layer H ,- to the kth node in the subsequent hidden layer Hk , is derived simi- larly and is given by W; =7; 5lpllk][Hk] -)’[P][i][Hj]. (4.10) where SlpllkHH/c] == Z. 5[P][l][Hz] -Wlk -ypk -(1 -ypk)- (4.11) In (4.11), l is the index of the nodes in the Hzth layer to which node k feeds its output ya; and w”; is the weight of the corresponding connection. y [p][i][H,-] is the output of the node j in the I-Ijth layer which is fed to node k in the Hkth layer. 4.3 Analytical Comparison with the Gaussian Energy Function Error back-propagation learning with the update law (2.7) using the Gaussian energy function (2.6) is very widely accepted In this section we compare the (Gaussian) learning (2.14) and the (Polynomial) learning (4.4). In order to do so, we use a similar criterion for the learning performance as discussed in subsection 3.3.1 and the com- parison of the learning dynamics using the Polynomial and the Gaussian energy function is made in subsections 4.3.1 and 4.3.2. 46 4.3.1 The learning dynamics for a single pattern In this subsection we consider a case- when we only use one pattern. For ease or notation we will drop the subscript p and define the Gaussian (sum-of-the-squared) and the Polynomial energy functions as 1 . E T: '2'; 8,;2 (4.121) and 13p -= 1 2(en-I- 1)2 (412ii) o '7 n 7 . o By straight forward calculations, equation (4.4i) can be rewritten as we, =—y [VW_E +V,,,_E1] (4.131) where l .. E = . 4.1311 1 '2' § 8n ( ) For the hidden layer the learning algorithm in (4.4ii) can be similarly rewritten as eg- = - 7 [ku5 + VWEl]. (4.13m) In order to formally compare the learning performance due to the Gaussian and the Polynomial energy functions, we state Theorem 4.1. Theorem 4.1: Assume that Assumption 3.1 holds. Consider V := E given by (4.12i), be a candidate Liapunov function. Let I} be the derivative of V along the trajectories of sys- tem (2.14) and VP be the derivative of V along the trajectories of system (4.13). Then the learning performance of system (4.13) is "faster" than or at least is the same as system (2.14). 47 The proof of this theorem presented in Appendix G, analytically shows that the learning performance improves if the Polynomial energy function is used in comparison with the usual sum-of-the-squared energy function. 4.3.2 The learning dynamics for multiple patterns Now, we consider the learning dynamics of the weights for P patterns. The weight update law is given by equation (2.14) where E is given by (2.6). The learning dynamics of the weights due to the Polynomial energy function are in the form of (4.4) where EP now is given by (4.3). By straight forward calculations, 1435,, can be written as wgn=—y[v,,_.E +Vw__ E1], (4.141) where E1: %. g; e”. . (4.1411) Similarly, for the hidden layers the learning algorithm of the Polynomial energy function (4.3) can be written as w,- =—y [ku E + V,” El]. (4.14111) When the network is‘ trained with the multiple patterns, we formally compare the learning performance due to the Gaussian and the Polynomial energy function in Theorem 4.2. Theorem 4.2: Assume that assumption 3.2 holds. Consider a candidate Liapunov func- tion V := E given by equation (2.6). Let V be the derivative of V along the the trajec- tories of (2.14) and VP be the derivative of V along the trajectories of the system (4.14). Then the learning performance of system (4.14) is "faster" than or at least is the same as system (2.14). 48 Again the proof of this theorem is included in Appendix H. The analysis clearly indicate that the learning performance in "speed" can improve (or at least does not deteriorate) when the Polynomial energy function is used. Proposition 4.1: The equilibria of system (4.14) are also the equilibria of system (2.14). The proof of Proposition 4.1 is provided in Appendix 1. Remark: Proposition 4.1 implies that when we use the Polynomial energy function (4.3) in the weight update law (2.7), we still converge to one of the equilibria of system (2.14). However, this does not mean that starting fi'om any initial condition, the trajec- tories followed by both the systems would be same or the trajectories would converge to the same equilibrium point. 4.4 Software implementation of the learning dynamim For the software implementation of the learning dynamics of the weights connected to the output layer (4.8) and the weights connected to the hidden layers (4.10), the learn- ing algorithm formalized in Table 2.1 is used. The weight update laws (4.8) and (4.10) are implemented into computer software using the C-language on Sun SPARC stations. Fourth order Runge-Kutta integration routine is used for the integration of these update laws. A stopping criterion 0), was set to determine if the system has converged to a solu- tion. The parameter to was defined as O) := 2 IAij I, (4.15) Wt} where wkj are all the weights in the network and Awkj 2: Wk} (t+At) - Wk} (i). (4.16) 49 The term At is the time step chosen for the fourth order Runge-Kutta integration of the update laws. An error criterion é was used to characterize the training: the lower the value of the error, the better is the training. The (output) error parameter is definedas §:= E: gap“? (4.17) Note that this error parameter is exactly the Gaussian energy function. The weights of the network can be initialized by giving their values or selecting randomly by a ran- dom number generator of the UNIX system. The training simulator, using the Runge-Kutta integration routine reads the input data fi'om an input file, which provides the values of time step At , learning rate 7, error criterion é, stopping criterion to, the network structure, the training patterns and the corresponding target values. The training is terminated when either the stopping criterion to or the error criterion § is achieved. The simulator writes the intermediate or/and the final values of the network weights or/and other parameters into an output file. 4.5 Simulation example: the XOR problem The training performance of the three software simulators, using the Polynomial, the Cauchy and the Gaussian energy functions, is described and a comparison is made in this section. A prototype FFANN used to solve the XOR problem had two input passive nodes (a passive node does not include a sigmoidal transfer function), two hidden nodes and one output node as shown in Figure 3.2. The network was fully connected and weights were initialized randomly by a ran- dom number generator. The seed, which is a number used to start the random number generator, is used to represent the set of initial weights. The value of any initial weight w is restricted to be -0.9 S w S 0.9. We assume that the network has converged when the stoppage criteria to 5 0.0001 is satisfied. In all the simulation cases, the network converged to a local minimum. A 50 comparison of 6 different sets of initial conditions is depicted in Figure 4.1. In all six simulations 7 and At are chosen to be 1 and 0.1 seconds, respectively. For the cases when the weights were initialized by sets 1, 3, 5 and 6, the training did achieve excellent results. In the other cases when the weights were initialized by sets 2 and 4, the network was stuck in an undesirable local minimum with an error value equal to 0.5. The training results using these six different sets of initial conditions are summarized in Table 4.1. It should be noted that in all six simulations the (Polynomial) learning algorithm achieves the same error performance as the Gaussian or the Cauchy learning, in many fewer iterations and computer time. Finally, we also observe that the update law (4.4) ' outperforms the update laws (2.14) and (3.5) in achieving any fixed error value in terms of the number of iterations required to realize that error value. As the bulk of the number of iterations is executed very near the local minimum, it is therefore useful to set an error criterion g to a desirable value along with the stability criterion to , for terminating the training. For the cases when the weights were initialized by the sets 1, 3, 5 and 6, E, was set to a value of 0.005. In the other cases when the weights were initialized by sets 2 and 4, g was set to 0.51. 4.6 Discussion Training of a FFANN with update law (2.7) using sum-of-the-squared energy func- tion (2.6) ensures that the L2 norm of the error vector 8 for all the output nodes n=1,2,...,N and all the patterns p=1,2,...,P , decreases as the training proceeds and finally converges to a local minimum value. While the L2 norm of the error vector decreases, other Lp norms (1 S p 5 co), may increase or decrease. Restricting our attention to the present case where we are using the Polynomial energy function of order 2, we only consider the L1 and the L2 norms. To illustrate the above fact, we consider a simple case of a trajectory in two dimensional error space in I I I -0 5000 10000 15000 set # l - iterations error 0.5 — .OIQ‘AAA‘AAAA ....... I I I -0 5000 10000 15000 set # 3 - iterations O O O O O I C W OOOOOOOOOOO Ti I I f -0 5000 100001500020000 set # 5 - iterations i I -0 5000 10000 set # 2 - iterations 0.6 .4 n I -0 5000 10000 set # 4 - iterations f. l: 1'. c ‘ " E c r. E ‘0 _ is... - I I I I -0 5000 10000 15000 set # 6 - iterations Figure 4.1: Comparison of the Gaussian (...), the Cauchy (---) and the Polynomial ( ) learning. In all of the above simulations, the Polynomial learning algorithm achieves the same error with fewer iterations. 52 4......o P. 9832...... a... a... a... 8. .. e. . 2.85. «.53.... .5...»— tammrs a an .833... :33. . 9.5.... 98... a... .2... 98... 9...... 98. 3.... 5.... 98... t... .2... 98... .2... .2... .6233... 98. ..... .9... 98... .... .9... 988... ...... t... . 95...: 9.... 5.... .9... 9...... .9... 8...... 9...... .9... 8...... 9...... 9.... S... ..... 9...... .2... .....e 9...... ....... 8...... 3.83... 9.... .8. .9... 9...... 3.... .8... 9.8... .3... .3... .. 9%.... 98. a... .9... 98... 3... ...... 98... .3... ...... 9...... 98. 3.... e9... 98... .8... .....8 98... ...... .8... 3.83... 98. .e... ..... 98... ..... .9... 988... ..... ...... . can? 9.... z... .9... 9...... a... 3...... 9...... a... 3...... 9:9. 9.... .... .9... 9...... ..... ...... 9...... ....... .8... 3.83... 9.... .... ....... 9...... .3... .8... 89...... ...e. .3... . mean... 98. .8... 59... 98... .8... ..... 98... .9... ...... 9...... 98. a... :9... 98... 2.... .8... 98... .9... ..9.. o 3.83... 98. .3. .9... 98... ..... .9... 98.8. ....... ....... . can... 98. a... 8... 98... .3... .3... 98... 2.... .2... 98.... 98. a... 8... 98... 2...... ...... 98... 3.... ...... 3.353... 98. 8... .9... 98... ..... .9... 988... ....... 5... .55....» 3...... 8:62...» :6 6.8%.... :8 9.8.... E... :8 $23.03.... .8:..=m. n...»— I... .afiagus 9.8 8:83am...» 8 5.98 .2. A»... .5...» =8 0.....- «.9... So 9......» 9.... 50 3.553.... 32.5. 2.828. 3:88.... 98 8...... 50 «:9 2.8193 3.. 50 £05... 9.5m... 3.83.3 8 .m 8.5.6... 53 Figure 4.2. If the training starts from point e,- and proceed along the trajectory and settles down at a local minimum ef , we observe that while the L2 norm of the error vector decreases the L1 norm increases throughout this learning. We can quantify the possibility of increase in the L1 norm while the L2 norm decreases at a point, in terms of the shaded area in Figure 4.2; the larger the area, the higher is the possibility. Similarly if we train the network, by decreasing the L1 norm of the error vector 8, we will notice as depicted in Figure 4.3, that the L2 norm of the errors may increase. We can again quantify the possibility of increasing the L2 norm while the L1 norm decreases at a point, by measuring the shaded area in Figure 4.3; the larger the area, the higher is the possibility. If e; is a point, where the L1 norm is tangent to the L2 norm (i.e) the shaded area measures zero, then possibility of a norm increasing while the other norm is decreasing reduces to zero at that point. This is depicted in Figure 4.4. This criterion can be easily extended to an n-dimensional error space, where we have an n-dimensional sphere representing the L2 norm and n-1 dimensional surface representing the L1 norm. The enclosed volume between the two norms in the direction of the trajectory, measures the possibility of increasing the output error in one norm while the error decreases in the other norm. This quantification is another aspect of this research which can be pursued further independently. When a network is trained with the Polynomial energy function, we decrease the error in the L1 as well as the L2 norm. We observe from Figure 4.5 that we reduce the possibility of any norm increasing while the other is decreasing with the measure given above. Giving more weight to any norm in the energy function, will enhance the tendency of decreasing that specific norm more than the other. In Other words we enhance the pos- sibility of other norm to increase while the first is decreased. This is illustrated by the fol- lowing example, in which we give more weight to the L1 norm in the energy function. We use the modified Polynomial energy function E "'P , which is /Z\ \/ Figure 4.2: Trajectory showing the increase in Llnorm while 14 norm is decreasing. 8i and 8f represent the initial and the final points of a trajectory in the error space. The size of the sha- ded area represents the possibility of L1 norm increasing while 14 norm is decreasing. 81 L1 1,2 Figure 4.3: Trajectory showing the increase in L; norm while L1 norm is increasing. 6i and 6f represent the initial and the final points of a trajectory in the error space. The size of the sha- ded area represent the possibility of 14 norm increasing while L1 norm is decreasing. SS Figure 4.4: Trajectory showing both the L1 and the L n ' . . . .. . 2 orms decreasrn . e- and 8 represent the nunal and the finalpornts of a trajectory in the error spacg. The L, 1’5 tangent to the L2 norm at the initial point of the trajectory, indicating that both the norms of the error vector decrease as the training proceeds along the trajectory. Figure 4.5: Energy level diagram of the Polynomial energy function. The network trained with the Polynomial energy function reduces the possibility of increasing the error in one norm while it is decreasing in the other (L1 or L7). 56 Emp = .5. p (cm. +1)2. (4.18) When the network is trained with the weight update law (2.7) using the modified Polynomial energy function and the initial weights are given by set 5, we notice in Figure 4.6 that at some stage the error which is the L2 norm starts to increase. Also notice that the L1 norm and the modified energy of the errors keep decreasing throughout this train- ing till the local minimum is achieved. 1.5 - error iterations Figure 4.6: Training results of the XOR problem, trained with the modified Polynomial energy function. L1 and L2 norm values are represented by dotted and solid lines respec- tively. Note that the L2 norm of the error increases while the L1 norm decreases after 1 8000 iterations. 57 Besides the main advantage of increased speed of convergence in terms of the number of iterations and less time, we can pick an energy function that lowers the value of the errors in the desired norms, to suit our requirements. Using the Polynomial energy function in order to train a FFANN in comparison with the usual sum-of-the-squared energy function is thus of a threefold advantage: (1) speed of convergence in terms of computer training cycles, (2) time and (3) the choice to decrease the output error in any particular norm or a combination of norms. CHAPTER 5 THE EXPONENTIAL ENERGY FUNCTION 5.1 Motivation In chapter 2 and 3 respectively, we used the Cauchy (3.4) and the Polynomial (4.3) energy functions in the weight update law (2.7), to speed up convergence to a local minimum. Using the Cauchy energy function ensures convergence to a local minimum in fewer steps in comparison with the Gaussian energy function. Due to more computations in each step, the Cauchy learning may take more computer time in comparison with the Gaussian learning. When we used the Polynomial energy function (4.3) in the update law (2.7), the algorithm not only outperformed the Gaussian and the Cauchy learning in terms of number of steps but also in computer time. We speed up the convergence to a local minimum by using the Cauchy or the Poly- nomial energy function in the continuous-time update law. The local minimum that we converge to, may not be good enough to solve the problem for which the network is trained For instance, in the case of the XOR problem when we initialize the weights with set 2 and 4, we do not converge to good minima. In this chapter, we formalize the skipping minima that are at a higher energy value and converging to an acceptable minimum at a lower energy value. Many researchers have proposed various modifications of the discrete-time update law (2.5) based on heuristic arguments, to speed up the convergence to acceptable weights [7],[8],[58],[59]. We propose an exponential energy function [60] of the form F =e‘d3, (5.1) 58 59 to be used in the weight update law (2.7) to overcome the above mentioned problems. Here K is any positive real number whose parameterization is discussed later in this chapter. When we use the Exponential energy function (5.1) in the continuous-time update law (2.7), the system still is a gradient descent with all of its nice properties. In section 5.2 we include the learning dynamics of the update law (2.7) using the Exponential energy function F. We take discrete steps when we implement the update law (2.7) via software on a digital computer. This feature adds the capability of skipping the undesirable local minima. We present explicit derivations of the learning algorithm based on the Exponential energy function in this section. In section 5.3, we compare the learning dynamics of the weight update law (2.7) when it uses the Exponential energy function F and a simple energy function E. E can be the Gaussian, the Cauchy, the Polynomial or any other suitable energy function. Software implementations of these derivations are then used to train a FFANN. As an illustrative example we specialize the software to solve the XOR problem in section 5.4. Simulation results are presented in section 5.5. We conclude by finally discussing the advantages and the limitations in sec- tion 5.6. 5.2 The learning dynamics Consider the learning dynamics of the weights using the simple energy function, which may be the Gaussian (2.6), the Cauchy (3.4) or the Polynomial (4.3)energy func- tion; specifically to =—'yV,,,E, (5.2) where w represents all the weights in the network. Now consider the learning dynamics of the weights using the Exponential energy function (5.1); specifically W‘ =—7V,,F, (5.3i) =—yxe“E VWE, (5.3ii) ='YXVwE ’ (5.3m) = X ,g, ’ (5.3iv) where x is the skipping factor and is defined as x := K e ‘3. (5.4) Observe that x is constant at any particular step of the integration and varies exponentially with the output error (energy) from one step to another. As the learning proceeds, the output error (energy) E decreases, the skipping factor x also decreases exponentially. Therefore the likelihood of skipping the local minima which are at higher energy levels are greater than skipping the local minima at lower energy levels. Here u‘r gives the learning dynamics for the update law (5.2). These learning dynamics have already been explored in chapters 2, 3 and 4 corresponding to the Gaus- sian, the Cauchy and the Polynomial energy functions. 5.3 Analytical Comparison with the Simple Energy Function In this section, we compare the Exponential learning (5.3), with the update law (2.7) using the simple energy function (2.6), (3.4) or (4.3), already described in earlier chapters. In order to do so, we use a similar criterion for the learning performance as dis- cussed in subsection 3.3.1 and the comparison of the learning dynamics is made in sub- section 5.3.1. 5.3.1 Comparison of the learning dynamics We analyze the learning performance of weight update law (2.7) using the Exponential energy function by comparing the learning dynamics of systems (5.2) and (5.3) with the criterion outlined in subsection 5.3.1. In order to formally compare the learning performance due to a simple and the Exponential energy functions, we state Theorem 5.1. 61 Theorem 5.1: Consider a candidate Liapunov function V := E given by equation (2.6), (3.4) or (4.3) corresponding to the Gaussian, the Cauchy or the Polynomial energy function. Let V be the derivative of V along the the trajectories of (5.2) and V‘ be the derivative of V along the trajectories of the system (5.3), then the learning performance of system (5.3) is better than or at least the same as system (5.2), provided the skipping factor x given by (5 .4) is greater than or equal to unity. Proof : In order to analyze the learning performance, let’s evaluate V‘ - V as given below. V‘-V=VwV.(We-W) Py- q r- “W =[..VWE..].—'y1 waE - V,',E >, (5.5i) = - y§[x - 1] [Vw E] 2. (5.5a) In order for the update law (2.7) using the Exponential energy function (5.1) to improve (or at least does not deteriorate) the learning performance , V‘ - I} should be non- positive. This implies x 2 1. (5.6) To improve the learning performance, the relationship (5.6) imposes the following res- triction on the energy function. _ lnx E 2 —K_' (5.7) 62 The above relationship is depicted in Figure 5 .1. We observe that for non-positive K the system does not behave as a gradient descent system. For positive K the learning performance only improves if for a certain value of K, E has a value that satisfies the relationship (5.7). We also observe that if K 21, the learning with the Exponential energy function always improve. If for some reason we want faster learning for some higher energy value greater than E o and slower learning for E < Eo, then we can chose appropriate value of K such that 0 < K < l. 20- 15— E10— o_L_----- Figure 5.1: Graph representing the relationship (5.7). It is an essential condition to improve the learning performance of system (5.3). The learning performance only improves if the parameters are chosen above the graph line representing (5.7). 63 Proposition 5.1: The equilibria of system (5.2) and system (5.3) are same. The proof of Proposition 5.1 is provided in Appendix J. Remark: Proposition 5.1 implies that when we use Exponential learning (5.3) we still converge to one of the equilibria of system (5 .2). However, this does not mean that start- ing from any initial condition, the trajectories followed by the systems would be same or the trajectories would converge to the same equilibrium point. 5.4 Software implementation of the learning dynamics For the software implementation of the weight update law (5.3), the learning algo- rithm formalized in Table 2.1 is used. The weight update law (5 .3) is implemented into computer software using the C-language on Sun SPARC stations. Fourth order Runge- Kutta integration routine is used for the integration of these update laws. A stopping criterion to, was set to determine if the system has converged to a solu- tion. The parameter to was defined as a) :2 MW I. (5.8) where w are all the weights in the network and Aw := w(t+At) - w(t). (5.9) The term At is the time step chosen for the fourth order Runge-Kutta integration of the update laws. An error criterion 5 was used to characterize the training: the lower the value of the error, the better is the training. The (output) error parameter is defined as § := g g spa 2. (5.10) Note that this error parameter is exactly the Gaussian energy function. The weights of the network can be initialized by giving their values or selecting randomly by a 64 random number generator of the UNIX system. The training simulator, using the Run ge-Kutta integration routine reads the input data from an input file, which provides the values of time step At , learning rate 7, error criterion é, stopping criterion 0), the energy function parameter K, the network structure, the n'aining patterns and the corresponding target values. The training is terminated when either the stopping criterion to or the error criterion E is achieved. The simulator writes the intermediate or/and the final values of the network weights or/and other parameters into an output file. 5.5 Simulation example: the XOR problem The training performance of the software simulators, using the Exponential and the simple energy functions, is described and a comparison is made in this section. For the simple energy functions, we used the Gaussian (2.6), the Cauchy (3.4) and the Polyno- mial (4.3) energy functions. A prototype FFANN used to solve the XOR problem had two input passive nodes (a passive node does not include a sigmoidal transfer function), two hidden nodes and one output node as shown in Figure 3.2. The network was fully connected and weights were initialized randomly by a ran- dom number generator. The seed, which is a number used to start, the random number generator, is used to represent the set of initial weights. The value of any initial weight w is restricted to be —0.9 S w S 0.9. We assume that the network has converged when the stopping criterion 0) 5 0.0001 is satisfied. In all the simulation cases, the network converged to a local minimum, when we chose energy function parameter K = l. A comparison of 6 different sets of initial conditions are depicted in Figures 5.2, 5.3 and 5.4. In all six simulations 7 and At are chosen to be 1 and 0.1 seconds, respectively. For the cases when the weights were initial- ized by sets 1, 3, 5 and 6, the training did achieve excellent results. In the other cases when the weights were initialized by sets 2 and 4, the network was stuck in an 65 undesirable local minimum with an error value equal to 0.5. The training results using these six different sets of initial conditions are summarized in Tables 4.1 and 5 .1. We increase the tendency of skipping the local minima in order to avoid the cases that we observe when the weights are initialized by sets 2 and 4, by chosing a larger skip- ping factor. This is achieved by selecting x = 1.8, which subsequently increases the skip- ping factor x. Comparison of the six different sets of initial conditions are made in Fig- ures 5.2, 5.3 and 5.4. The training results of these simulations are summerized in Table 512. We observe in Figure 5.2, 5.3 and 5.4 that x was chosen to be 1.8, the update law (2.7) using the Exponential of the Polynomial skips the local minima, when the weights were initialized by sets 2 and 4. For the same two sets of initial weights, the update law using the Exponential of the Gaussian and the Cauchy energy functions, still could not skip the local minima. It can be shown easily, that by increasing the value of x, we can skip these undesirable minima also. In all of the simulations, observe that the Exponential (5.3) learning achieves the same error criterion as the simple (5.2) learning in many fewer iterations and computer time. Depending upon the value of K or the skipping factor x, the Exponential learning algorithm can be made to skip the undesirable local minima, thus allowing it to converge to one of the desirable minima. As the bulk of the number of iterations is executed very near the local minimum, it is therefore useful to set an error criterion g to a desirable value along with the stability criterion 0) , for terminating the training. For the XOR problem the error criterion §= 0.005 provides reasonably good results and it will save us lot of computer steps and time. 66 1+ error 0.5 ._ error -0 _ I I I I T I I -0 5000 10000 15000 -0 5000 10000 set # 1 - iterations set # 2 - iterations 1 error 0.5 — error -0 _ ................................ I I I I I I -0 5000 10000 15000 -0 5000 10000 set # 3 - iterations set # 4 - iterations I I I I I I I I -0 5000 100001500020000 -0 5000 10000 15000 set # 5 - iterations set # 6 - iterations Figure 5.2: Comparison of the Exponential (KL-1, 1.8) and the Gaussian learning. The Gaussian is represented by (...) whereas the Exponential (K‘=l, 1.8) is represented by (---) and (_) respectively. 67 1 error 0.5 _ error -0 _ ”‘M WW I I T I I I I I m -0 5000 10000 15000 -0 2000 4000 6000 8000 set # 1 - iterations set # 2 - iterations 1 - 1.2 — .2 i. It 1 ._ error 0.5 — error 0.8 q . 0.6 — _0 _ , .... k; ....................... I I I I n I I I I -0 5000 10000 15000 -0 2000 4000 6000 8000 set # 3 - iterations set # 4 - iterations 1 1 ° . . . 1. “0’05— -0 _ gm n. bxm I I I I I I I -0 5000 10000 15000 -0 5000 10000 set # 5 - iterations set # 6 - iterations Figure 5.3: Comparison of the Exponential (ml, 1.8) and the Cauchy learning. The Cau- chy is represented by (...) whereas the Exponential (Ic=l, 1.8) is represented by (—--) and (__) respectively. 68 1 w\ ............. ‘1... t 1 .... \° ........... \ ....... 1 \‘ tl \ . error 05 " E error 0 5 4 \_ ......................... _0_ LL ......... .0_ L I I I I I I I I I -0 1000 2000 3000 4000 -0 1000 2000 3000 4000 5000 set # 1 - iterations set # 2 - iterations 1 _ P ............ .2 \ 1 —l “‘1'. ...... \ ..... I. ‘ ‘I error 0.5 -— 1 error t 0 5 _ ¥ oooooooooooooooooooooooo i -0 _ L g ' ......... .0 _. L T I T I I I I T I I -0 1000 2000 3000 4000 -0 1000 2000 3000 4000 set # 3 - iterations set # 4 - iterations 1 __ 152‘ ............. 1 _ \;'.. ....... \ ‘\ I ‘ '. error 0.5 — ‘. cm" 0 5 —- 1 '. I i t . _0 _ L k °' ..... -0 — L \_ .................. I I I I I I I -0 2000 4000 6000 -0 1000 2000 set # 5 - iterations set # 6 - iterations Figure 5.4: Comparison of the Exponential (14:1, 1.8) and the Polynomial learning. The Polynomial is represented by (...) whereas the Exponential (x=l, 1.8) is represented by (---) and (__) respectively. ace—o u.— 69 fi _ 555% as _ 35.93 «2 a em _ . N 2.2.3 9580: _ . Ea». tam—.8 _ _ 39 u an ”Range @383 23—. t on menu; 33%.88 _ «:9 t on ”.233“ gauge _ omens 8.8“ .8; 8.8 8.8.8 38¢ ~38 8.8.3 .388 88.8 983. 8.8“ :8 8.8 c.8§ ~88 6%.. 8.88.3 :88 2 .3 88558 8.8“ :8 :8 8.8.3 58 3.8 8.888 488 8.8 u 9.33 as 83 :8 8.85 38 8:5 8.8.8 .88 8.8 98.: as 88 8.8 328 am: :95 888 .88 8...: 8.5.8.8 is 83 8% 8.8.8 3% 3.8 88; 8.5 88.8 a means 8.8“ ES .838 8.8—8 :5 “and 8.8.8 as; 898 98.: c.8u 88 8.: 98.8 8.8 .88 8.888 :88 8 .8 3.3538 98“ :8 5.8 8.8—8 .2: 3.8 8.888. 8.: 8.98 a eczema. 88 8: 3% 32% 883 88.8 888 .88 Sea l 9.53. 35 88 88 8.8.2 38.. 5.8 888 $88 83.8 8.3538 58 83 5.8 8.8.2 33 8.8 38E use 2%.. a mean. c.8u an: and» 8.8.8 .88 2 _.8 8.8.3 38¢ 2 .8 98.: 8.8“ 8: 3.8 8.8—8 33 8.3 9888 .88 84.8 3.485.». 8.8“ :8 5.8 88—8 83 8.8 8.888— 3: «3.8 8 cases. 2.8 :8. 5.: 8.85 3:. 88.8 8.85 35. 88.8 983. e.§ as 8.3 8.8;. ~88 88:3 888 at: 85.. 8.483.». c.8u 8a 8.8 8.8.8 as 3.3 8.888 88.. 898 diam—i .833 8333a Go mag—«BE can 3 on 509:3»... :8 Ex Ex. :6 3.3538. gwfi. 1:8. 3.— 55525-8 98 83m 90 mix—.883. Ann—ven§0w§.50n§3.na=6ve_§§§ n§§.§8981§=8n§3§3n§n5o§82§03§emu 8328. 70 as...» 9. _ 53.5% a.» a.» .2» «a. a 5.. . 0:88. 3.83.. I......». aim...» — 23» * om .885...» @388. _ 9.3» a 5.. .8333 a 5.. .839...» . 02.8.»: 98m 83 49% 98.3 . aged. 98w Ba... 5...» 98.3 3.553.». 98m 8.. 93 98.3 n 02.8.»: 9m... ”6... awn... 9.8.3 988.... 0283. 9m... cum 9% 98.8 3... “a... 9.8.3 3.8 .393 3.553.». 9m... 8 9... 98.3 8 9... 9°88: 380 5......» H. 09.8.»: 98m .35 8.... 98.8 New”... v.98 989.8 33» 3...... 028.... 98m 88 up...» 985m 3..» 3.8 988.8 3%.. 89... I 3.353.». 98. 3.. 93 98.». as M...» 99589 3.8 393 A 088.»... 9m... 38 3.8 98.9 3...... .35.. 9.8a... Sue... 398 0253. 9m... one 9% 93.3 3.8 .945 9.8.5.. .53» 393 $55.53.». 9.... ma 9% 98.9 3 9% 9958.. .33.. 3...... M 02.8.»: 98m 33 88 98.3 58 398 9838 to». 3...: 9.5.6. 98m ”5.6 9.5. 98.3 .8.» 3.8 9888 5.80 um...» 3.355.». 98m .3 Pu. 98.3 5... 9w. 9:88.. «.83 3.... o 69.8.»... 98m .38 8b.. 98.». ~83 $98 988.... .38 uqmu. 955 98. 88 ~98 98.». .82 3.8 988.... 5.3.. 898 123.». 98. u... 9.... 98.». um... Maw 9958.. 3.8 .393 .....»......m 3a....» 83.513 So mace—.89.». c. u .9 5.. a.» 65.8.»... ...o 9.83 E... a.» 3.553.». .83.»? 3:». a... 5835......» a»... 5.5 5» 9.53:2». 3.85.. an n .8 5.. 5» 02.8.»... a.» 9.53. E... a.» 3.353.». 2.95. 3.5.9.». 8986...... a...» £30.. a.» 35» 2.82.9. n 2... 5» £05... n..»..mo 2.815.. 8 .m 85.96... 71 5.6 Discussion Weight update law (2.7) using the Exponential energy function (5.1) corresponding to the Gaussian (2.6), the Cauchy (3.4) or the Polynomial (4.3) energy function provides a variable skipping factor x, which changes with each computer step. The learning dynamics of this system is given by (5.3iii) as W‘ = — y x Vw E. If we consider 7); as the cumulative learning rate, then we can say that the learning rule (5.3iii) provides a variable learning rate, which varies with each computer step. The changes in the cumulative learning rate are such that it is large at higher energy levels and is small at lower energy levels. In digital software based networks, like the one we used in this work, the concept of variable learning rate is vary useful. The rea- son is that, we want the undesirable minima at higher energy levels to be skipped whereas it should converge to desirable minima at lower energy levels. As the calculations in section 5.3 reveal, the weight update law (2.7) using the Exponential energy function contributes to overall better learning performance as com- pared to when the Gaussian, the Cauchy or the Polynomial energy function is used. This is observed practically in all of our simulations with the XOR problem. Some of these simulations are reflected in Figures 5.2, 5.3 and 5.4. Apparently, we can make the learning as fast as desired, by choosing appropriate values of K. This is misleading because if we increase the values of the skipping factor beyond certain limit (which varies from problem to problem), the weights change so abruptly that the outputs of the network clamp to saturation (O or 1) values. As a result the partial derivatives of the outputs with respect to the corresponding inputs of the neu- rons are very very small. Consequently, the error signal (2.19), (3.10) and (4.9) which is propagated back is negligible, resulting in a smaller weight change. This makes the algorithm converge to a false minimum. This can be avoided to a certain extent by using 72 lower gain neurons as described in section 2.1, or using appropriate cost factor. Alterna- tively we restrict the value of K to a certain limit. When the weights are initialized randomly within the range [-0.9,0.9] (as is the case in all of the simulations), and value of K is restricted to [1.2.5] normally provide good results. The exact parameterization of K is not addressed here and can be pursued independently, depending upon the task. CHAPTER 6 THE PATTERN/CHARACTER RECOGNITION PROBLEM 6.1 Problem description Pattern recognition is one of the major application areas of FFANNs. In this chapter we employ FFANNs for character recognition. The characters that were chosen for this problem were the Arabic numerals of different fonts. Character recognition using a FFAN N has attracted the attention of many researchers, for it would find many applications in office automation and computer aided design. Merging the classical techniques of pattern recognition with the classification capabilities of a FFANN, the process of character recognition as viewed by many researchers [611-[63], consists of feature extraction by classical methods and classification by a FFANN. Many others motivated by the association and the classification capabilities of the human brain expect both of the tasks to be performed by a FFANN [64]-[66]. The FFANN reminiscent of Neocognitron [64] dedicated to solve the problem of character recognition use a large number of neurons and trainable weights. Many researchers exploit the capabilities of multilayer perceptrons [17] to solve the problem of character recognition by using a sufficient number of neurons in the hidden layers [27]-[29]. We decompose the character recognition problem into two parts: (1) feature extraction or preprocessing and (2) classification. We use two layer FFANN for feature extraction and an additional layer network for classification. The two networks are concatenated in a feedforward manner, the feature extraction network followed by the classification network. 73 74 In section 6.2, we discuss the data-base used for training and testing the network. The network design is discussed in section 6.3. The training and the learning algorithm used to train the network is discussed in section 6.4. In this section we carry out the learning evaluation with update law (2.7) using different energy functions. In section 6.5, the chapter is finally concluded with the discussion on the network design and various learning algorithms used. 6-2 'The data-base We have collected 100 different fonts fi'om printed material [671-[70]. Each of these printed fonts were scanned using the 807500 Toshiba scanner and was stored as a "Tiff" file. The files have subsequently been converted to "Hips" format which could easily handle binary data. The pixel image of various digits in different fonts were then obtained from the formatted Hips files. The images of these fonts were stretched appropriately to fit within a pixel win- dow of 17x15 dimension while leaving a small margin on all sides. With this pre- processing, we also achieve scaling invariance of the characters to some extent. The grey level of these pixels were set to be 0 or 1. Some of these fonts were chosen to represent this formatting, whose pixel images are depicted in Figure 6.1. The collected fonts included roman, bold and italic versions of different classes. The network was trained with half of these fonts (which included roman, bold and italic versions of some of the classes). The second half of the data-base was used to test the character recognition and generalization capabilities of the network. 6.3 The network design The character recognition network [71] that we used consist of the feature extrac— tion and the classification networks as depicted in Figure 6.2. The feature extraction network is designed in a manner to extract features locally with local connections from ufiooor .1910qu our or rndu; uu sa posn sruo; our ’0 saldumxe [90er :1'9 canard ' QWNQ‘HHN‘M‘IQ OWMOW$HWAO WM #13'41-1— :‘Iv’h': Badoni bold {D C] H] U) ['1 H's f I.) la.) l-1 C] Breite envmwewwwe wwummewwuo 76 previous layer, while the classification is done with global (i.e full) connections. Training of the network used for the character recognition takes a long time. We split the network into two, so that one part (i.e) feature extraction network, which takes major portion of the training time should be available as a commercially trained chip off-line on a very large data-base. The second part should be available to the user to train it on-line with new data to be added. We discuss the design of the feature extraction and the classification network in subsections 6.3.1 and 6.3.2 respectively. Feature Input EX ti Classification Pa __. trac on —-H Network ___, Class Network Figure 6.2: Block Diagram of the FFANN used for the Character Recognition problem. 6.3.1 The feature extraction network A partially connected FFANN with one hidden layer is chosen to exuact features of Arabic numerals of various fonts used for common printing. The input layer of this feature extraction network is a 17x15 array of passive nodes corresponding to the input patterns. These patterns are the characters 0 through 9 of different fonts. The input passive nodes simply accept the individual components of the input image and 17 77 distribute them to the specified neurons of the hidden layer. The output of each neuron is in the range (0,1) as the activation output is determined by the so called logistic function [7]-[8] l w. (6.1) output = A sliding window of 7X7 array of input nodes are connected to a neuron in a hid- den layer. The window slides all over the input array moving one column or one row ,of 7 input nodes at a time. As a result the dimension of the hidden layer reduces to 11x9. A similar sliding window of 5x5 array of hidden layer neurons is connected to a neuron in the output layer. The resulting array-size of the output layer consequently reduces to 7x5. The structure of the network is shown in Figure 6.3. / Input layer Hidden Layer Output Layer Figure 6.3: Structure of the FFANN used for feature extraction. 78 Ten 7x5 array targets are constructed corresponding to each input digit. The tar- gets are constructed so as to retain the general shape of the characters and, in addition, he pseudo-orthogonal to each other. Figure 6.4 shows all the targets used in the train- ing of this network. The targets are considered orthogonal if the hamming distance between them is equal to the dimension of the targets, namely 7x5=35. In the present case, it was not possible to construct orthogonal targets while retaining the geometric shape of the digit. The minimum and the maximum hamming distance between the targets given in Figure 6.4 is 17 and 22, respectively. As a consequence, this coding has an error correction capability of up to 8 errors in the output pixels. If the pixels with similar and dissimilar activation state (specifically, a high or low state) correlates to +1 and -1, the correlation matrix C for all the targets becomes 35 -3 -1 —5 -1 —1 +1 —1 -3 -3 —3 35 -7 —3 -3 +1 -1 -3 —1 -1 —1 -7 35 -1 —l —1 -7 -5 +1 +1 —5 -3 —1 35 -1 -1 +1 -1 —7 -3 -1 —3 —1 —l 35 -1 -3 -1 +1 +1 C ‘ —1 +1 -1 —1 -1 35 +1 -1 —3 +1 (6'2) +1 -1-7 +1 -3 +135 —3 -l —9 -1 —3 -5 -1 -1 -1 -3 35 -3 +1 -3 -1 +1 -‘7 +1 -3 -1 —3 35 -1 _-3 -1 +1 -3 +1 +1 -9 +1 -1 35. The performance of the feature extraction network can be measured by noting the correlation of the output digits with the targets. If the output is within a hamming dis- tance of 8 of a target, it is classified to that target; otherwise we consider the network to be unable to classify the particular input. 6.3.2 The classification network The classification network is a fully connected FFANN. There is only one layer consisting of 10 neurons corresponding to the different classes of characters, namely O,1,...,9. The output of the feature extraction network from its 35 output nodes is '6‘- 79 Figure 6.4: Targets used in the feature extraction network for various digits. 80 directly fed to all of the classification neurons. We chose orthogonal targets for all the 10 different classes. The targets are chosen so that, for any particular class, the output for the corresponding node is high and all others are low. If the output of more than one node goes high, the network is considered unable to classify. 6.4 Learning evaluation with different energy functions We used the Gradient Descent method with the error backpropagation learning technique for the training of the network. The feature extraction and the classification networks can be trained together or separately. We trained the two networks separately with a view to achieve the following advantages: (1) The feature extraction and the classification networks can be viewed as two separate modules; (2) The feature extraction being fully trained on a large data-base off-line and the classification network can be quickly trained on-line when new fonts are added. This will enhance the character recognition capability of the network as will be demonstrated in our work. We used a training simulator developed in [35]-[39],[57],[60],[71], coding the weight update law (2.7) in C-language on Sun SPARC station. For the integration of these update laws, the simulator uses fourth order Runge-Kutta integration routine. A stopping criterion 0), was set to determine if the system has converged to a solution. (The solution is actually the local minimum in the vicinity of the initial conditions in the weight space.) The weights were initialized randomly by a random number genera- tor. The value of any initial weight w is such that —0.9 S w 5 0.9. We assume that the network has converged when the specified stopping criterion to is achieved. The parameter to is defined as (1) := 2 MW I, . (6.3) 81 where w are all the weights in the network and . Aw := w(t+At) - w (t). ' (6.4) The term At is the time step of the fourth order Runge-Kutta integration which is used to integrate the update laws (2.7). We also used an error criterion to characterize the training: the lower the value of the error, the better is the training. The error parameter is defined as 1% == 2 )3 6p.”- (6.5) p :1 Half of the data-base was used as set 1 and the other half as set 2. Each of these sets included 50 different fonts that are listed in Appendix K. We trained the network in different stages to study the response of different modules under different condi- tions. Training of the feature extraction and the classification network is discussed in subsections 6.4.1 and 6.4.2, respectively. 6.4.1 Training of the feature extraction network We trained the feature extraction network with the update law (2.7) using the Gaussian (2.6), the Polynomial (4.3) and the Exponential (5.1) energy functions. The Cauchy energy function (3.4) is not used in the update law (2.7), as the computations grow exponentially with the number of patterns and the output nodes. With 50 different training fonts, the input patterns are 500 and the network has 35 output nodes. Therefore, it is not feasible to use the Cauchy energy function for the training of the pattern recognition problem. The training simulators using the learning algorithm outlined in Table 2.1 was developed corresponding to update law (2.7) using the Gaussian (2.6), the Polynomial (4.3) and the Exponential (5.1) energy functions. In all the training simulations the value of Ar and 7 were chosen to be 0.0005 and 1. x parameter of the Exponential 82 energy function (5.1) was chosen to be 0.004. We assume that the network has converged when the stopping criterion (0 S 0.15 is satisfied. The error criterion § = 400 was also chosen to terminate the training of the network. For the Gaussian and the Polynomial energy, learning was terminated when the stoppage criterion to was achieved. On the other hand for the exponential energy, learning was stopped when the error criterion was achieved. The error(s) and the number of steps at convergence are tabulated in Table 6.1. Table 6.1 Gaussian Polynomial Exponential Gaussian § 526.25 638.96 399.96 steps 16205 18442 10598 Table 6.1 shows the error(s) and the number of steps at convergence of the Gaussian, the Polynomial and the Exponential Gaussian learning used for the training of the feature extraction network. After the network was trained with set 1, it was tested with set 2. The character recognition performance is reported in Tables 6.2, 6.3 and 6.4 corresponding to the Gaussian, the Polynomial and the Exponential Gaussian energy functions. In all of the results the network incorporates the error correction capability which is built in the tar- gets (i.e. if the binerized output is upto a Hamming distance 8 from the targets, the corresponding input is classified to that target). 83 Table 6.2 patterns ll correct incorrect reject set 1 100% 0% 0% set 2 92.6% 3% 4.4% Table 6.2 shows the character recognition using the feature extraction network trained with the Gaussian learning. Table 6.3 set 2 93% 1.8% - 5.2% Table 6.3 shows the character recognition using the feature extraction network trained with the Polynomial learning. 84 Table 6.4 patterns IL correct incorrect reject set 1 100% 0% 0% set 2 92.4% 2.6% 5% Table 6.4 shows the character recognition using the feature extraction network trained with the Exponential of the Gaussian learning. On convergence we observe that the Polynomial learning achieved a better recog- nition performance, though the error criterion is higher than the Gaussian and the Exponential Gaussian learning. The reason is obvious if we look at the L1 norm. When the training terminated the L 1 norm of the errors for the Polynomial learning was 882.95, in contrast with 1694.85 for the Gaussian and 1398.56 for the Exponential Gaussian learning. We also notice that the Exponential Gaussian learning achieved almost same recognition performance as with the simple Gaussian learning, but in much less number of computer steps. In conclusion, we can use appropriate energy function in the update law (2.7) to improve learning speed and/or performance. 6.4.2 Training of the classification network We used the Gaussian learning (2.6)-(2.7) to train the classification network. The values of At, (0 and 7 were chosen to be 0.01, 0.1 and 1 respectively for the training of the classification network. The training was terminated when the stopping criterion to was achieved. The error(s) and the number of computer steps at convergence are tabulated in Table 6.5. 85 Table 6.5 I] set 1 set 2 § 2.46 8.32 steps 296 618 Table 6.5 shows the error(s) and the number of computer steps at convergence of the learning dynamics for the training of the classification network. The classification network was trained with two different sets. First we used the same set (set 1) as was used for the training of the feature extraction network. With this set as input to the trained feature extraction network, the output of this network was fed to the classification network and was trained with it. It is just like training the feature extraction and the classification networks together with set 1. The whole net- work (the feature extraction and the classification) was then tested with set 2 and the results are reported in Table 6.6. In the second case set 2 was input to the trained feature extraction network and the output of this network was used to train the classification network. This is done assuming that tha additional data available to the users to train the classification net- work would be different than the one on which the feature extraction network is trained. As such set 2 was used for training and set 1 was used for testing. The classification results of the network are reported in Table 6.7. 86 Tasble 6.6 patterns correct incorrect reject set 1 100% 0% 0% set 2 94% 2% 4% Table 6.6 shows the character recognition when the feature extraction and the classification networks are trained with set 1, using the Gaussian learning. Table 6.7 patterns ll correct incorrect reject == set 1 99.8% 0% ' 0.2% set 2 100% 0% 0% Table 6.7 shows the character recognition when the feature extraction and the classification networks are trained with set 1 and set 2, respectively, using the Gaus— sian learning. 87 For both the cases reported above, we trained the network with update law (2.7) using the Gaussian energy function. For the training of the classification network we only used one energy function (the Gaussian) as the emphasis here is to study the classification capability and not the learning speed of the network. Observe that when the feature extraction network is trained with set 1 and the classification network is trained with set 2 (which is the output of the feature extrac- tion network when the input is set 2), the character recognition improves considerably. . This feature highlights the importance of using the modular approach. Besides improv- ing the recognition performance, it provides training convenience to users as they have to train a smaller network (the classification network) on the new data to be added. 6.5 Discussion We decomposed the problem of pattern recognition into two, namely, the feature extraction and the classification of the input characters. Both of these tasks are per- formed by the FFANN. The network performing feature extraction is partially con- nected whereas the classification is done with a fully connected network. The two net- works are concatenated, the feature extraction followed by the classification. The two networks performing the feature extraction and the classification are treated as two separate modules. The feature extraction is trained on a large data-base off-line whereas the user can train the classification network on-line according to their requirements, as and when the new data is added. In this work we trained the featme extraction network on a certain data-base and tested it with another set of data. The training is supervised, in which we provide the targets, that incorporate the error correction capability due to their structure. We illustrate that the training can be improved using appropriate energy function in the update law (2.7). When the Polynomial energy function is used, we lower the output error in L2 as well as L1 norm. Observe that even if the L2 norm is higher in 88 value we achieve good recognition results due to the fact that the L1 norm of the out- put error is also reduced. Observe that when the classification network is trained (according to the require- ments of the user) with different fonts than the training set (with which the feature extraction network was trained), the over all pattern recognition capability is enhanced. Besides improving the recognition performance, this modular approach provides train- ing convenience to users. The training simulators based on the update law (2.7) incorporating different energy functions can also be used to train FFANNs for various other applications. The simulators are effectively used to train multilayer feedforward artificial nemal networks for handwritten numeric character recognition [72]. CHAPTER 7 SUMMARY AND CONCLUSION Artificial neural networks are mathematical models, originally designed to mimic . some of the functions of the nervous systems of higher animals. There are two main approaches to design these artificial neural networks, namely, those implemented via digital software and those manufactured via analog hardware. This research follows the former approach as applicable to the feedforward structure of these networks. Supervised learning is the key to make these networks useful for automated information processing of data. In this work we addressed some of the issues of the supervised learning as appli- cable to FFANNs. 7.1 Summary In this research, theory supported by computer simulations is described, to improve the supervised learning of FFANNs. Error back-propagation learning being widely accepted is used to realize the gradient descent method of updating the weights of the network. Realizing the advantage of the continuous-time update law, we preferred it over the conventional (Euler) discrete-time update law and we used it through out this work. The choice of an energy function specifies the back-propagated error in the update law. It also determines the learning performance in the sense of the speed of convergence and the size of the domains of attraction of the stable equilibrium points in the weight space. The Gaussian (sum-of-the-squared) energy function is commonly used due to its simplicity. Another morivating factor of its usage is that it is a derivative of the maximum-entropy distribution subject to the second moment constraint (the Gaussian 89 90 distribution). The Gaussian energy function is basically the L2 (Euclidian) norm. The choice of maximum-entropy distribution can be increased if the constraint is reduced to the first moment only. We chose the Cauchy distribution from the point of view that the corresponding energy function (namely the Cauchy energy function) improves the learning performance in a certain sense“. Though the choice of the Cau- chy energy function may not be good for solving bigger problems due to the exponential growth of its computations, it still remains as a candidate for possible hardware imple- mented neural networks. Reduction of the output errors of a network in the L2 norm may not always achieve the desirable results. We may have to reduce the output errors in other norms or a combi- nation of norms. The Polynomial energy function, not only takes care of this problem, but also improves the learning performance. A simple example of the Polynomial energy function is discussed in which a combination of L1 and L2 norms is used. Besides reducing the error in both norms, it improves the learning speed in terms of computer steps and time. A very common problem that is inherent with the gradient descent systems is con- vergence to the nearest local minimum. The nearest local minimum may not be good to solve the desired problem. The requirement of skipping the undesirable local minima (at higher energy levels) and converging to the desirable minima (at lower energy levels) is achieved by using the Exponential energy function. Besides, the convergence to the desirable minimum is much faster in terms of the number of computer steps and time, when the Exponential energy function is used. Numerous simulations on the XOR problem have been conducted to evaluate the learning performance of the gradient descent update law using different energy * lrnprovernent in the learning performance of the Cauchy system here means the derivative of the sum-of—the-squared error at any point on the error surface along the trajectories of the Gaussian system is less in magnitude than that of the Cauchy system. 91 functions. All simulations are in support of the analytical derivations indicating that a suitable choice of the energy function may improve the learning performance in terms of the learning steps and/or time. Finally, we evaluate the use of the energy functions described in this work on the practical problem of character recognition. A highly user-oriented character recognition network is developed with two separate modules; a feature extraction network to be trained on a large data-base and a classification network to be trained by the user accord- ' ing to his/her requirements. It is shown that different energy functions can be used to update the weights in order to improve the learning performance. In addition, the pro- posed structure is easy to use by the user as he/she would train the network on any addi- tional input patterns in order to improve recognition of newly acquired characters. 7 .2 Conclusion In this research, we have developed a method of exploiting the energy functions in the gradient descent weight update law to improve the learning performance of a FFANN. The improvement of learning performance aims at improving the learning speed and/or improved generalization. This methodology has been specialized to two typical problems: the prototype XOR problem and the practical character recognition problem. We have shown that investigating various energy functions with a view towards eventual implementation onto software had produced valuable results. Analytically it is tractable to investigate the learning rules in the continuous-time setting. Yet in software implementation, we had to anticipate the step-size and the learning rate and their appropriate selections so as to achieve "faster" convergence, and in addition guarantee the absence of oscillations. APPENDICES Appendix A Entropy comparison of the Gaussian and the Cauchy distribution Consider the distribution of a random variable x , to be Gaussian with zero mean, which is given as 1 —13/2c32 = e . A.1 f g (I) m a ( ) Choosing a = 1, the Gaussian distribution can be given as 1 -le2 = — . A2 The entropy Hg (f ) [43] of the Gaussian distribution f s (x) is calculated as H,(f)=— J f,(x) Iogf,(x) dx (A3) +0. =—loge If,dx 4|. 1 2 l 2 =—lo e e"’21n ("’2 dx 3 1,45 [427: =—loge+f le‘12’2 ln—l——-fi d1 3.1/27: 21: 2 +09 +00 1 1 _2 x2 1 2 =-lo e In— —e"2dx— —_"1/2 g 211:.L 21: i 2 \f2_1te dx 1 +00 =loge Ina—2‘12— Ix —xe“z’2dx 1: Now solving the integration on the right hand side by parts and evaluating xe ”’2’2 from —oo to +00 using l’I-Io‘pital’s rule, [51] we get Hg(f) = loge [In «5? + %] (A.4) 92 93 Now consider the distribution of the random variable x , to Cauchy with zero mean given as __1_ A fc(x)"' n - 12+x2. (A.5) Choosing A. = 1, the Cauchy distribution of x can be given by l 1 =—. 0 A06 The entropy He (f) of the Cauchy distribution f c (x) is calculated as +oo Hccf>=—- 1 ice) logfc(x)dx (A.7) +oo l l l l =—loe —. ln — dx 3 in 1+1:2 [111 l+x2] +oo 1 1 1 1 =—lo e — ln—+ln dx g n 1+x2[ It 1+x2] +00 +00 1 1 l l 1 l =—loe ln— — dx+ — ln g air: 1+1:2 in 1+1:2 1+x2 +0. 1 ln(1+x2) :10 e lnrt+— ———dx g “i l-t-x2 Now by substituting x = tanO and using «12 jlnsecede=£1n2 0 2 weget Hc(f) = loge .ln(4rt) ' (A.8) Comparing the Gaussian and the Cauchy entropy function values from (A.4) and (A8), we observe that Cauchy entropy is about 1.78 times the Gaussian entropy. Appendix B Proof of Lemma 3.1 Lemma 3.1 : Given 0t,- , x,- e R , 0t,- 2 0 and assuming that Assumption 3.1 holds. Then, [211 NEW xi ] 20» Proof : '1' F .q [Zn-HZQ; X;]=[...Xi...] . [a1"'a[] I} t t id . = xT laTx (13.1) = xTAx (13.2) = xT [é-(A +AT)]x (13.3) = xTA‘x (13.4) 2 o. (13.5) In step (3.2), A = 101T has non-negative entries. In step (B3), A in the sym- metric form flu is replaced by its symmetric part %(A +AT) =: A [411,[55]. By Assumption 3.1 A is positive senridefinite. Remark: Note that if all the Otis are equal, then in step (B.2), the matrix A is sym- metric positive semidefinite. This is obvious since A = or,— 117, and xTaillTx = or,- lllTx ”2_ 94 Appendix C Proof of Theorem 3.1 Theorem 3.1: Assume that Assumption 3.1 holds. Consider V := E given by (3.14i), be a candidate Liapunov function. Let I} be the derivative of V along the trajectories of system (2.14) and V“ be the derivative of V along the trajectories of system (3.15). Then the learning performance of system (3.15) is "faster" than or at least is the same as system (2.14). Proof : Case 1 (output nodes) In order to analyse the learning performance, let’s evaluate V‘ — V as v0 - r} = vwv . (We - W) ((2.1) Here W is the weight vector containing all the weights. Equation (3.5i) gives the com- ponents of We and equation (2.14i) gives the components of W. Then for the output layer, we get W - -] V —r'/ = [..V,,_E..] .-y+ $7,450 - vwf > (C.2i) t ‘ 'J =[..VW“E..] .ayy (l+a,,5V,,_E — Vw'_E » (C.2ii) .~ . l . : .J = — y 2 an IVW_E I2 (c.2111) .<.0 asanZOandy>0 (C.2iv) 95 96 Case 2 (hidden nodes) Now consider the dynamics of weights connecting the neurons in a hidden layer. The components of We and W are given by equations (3.5ii) and (2.14ii) respectively. Plugging in the values of We and W in equation (C.l), we get P. I ‘ VC—r} = [fivwran] .- 71 VWE + %- z ornvwe"2 - v E », (C3) '8 where VWE = % E: VWen 2. Therefore we get 1}” — r} = - % y £[ngen2] [£1 kuenz]. ((2.4) For each entry of the summation 2 we use Lemma 3.1 to show Wu [)2 Meg] E e, unenz] 2 0 (C5) therefore vc—t} s 0. Appendix D Proof of Lemma 3.2 Lemma 3.2 : Given org,- , xi} 6 R , (1,7 2 0 and assuming that Assumption 3.2 holds. Then, [ZZXUHZZaU x;,-] 20, t J g 1 Proof: pl- 1 [2 2.in ][2 2 aij xii] = [...xijm] [all - - . an] 1;]. I j J bi. . = IT laTx (DD = ITAX (D2) = x7 Bot +AT)]x (13.3) = ITAX (D4) 2 0. (US) In step (D.2), A = 1aT has non-negative entries. In step (D.3), A in the sym- metric form xTAx is replaced by its symmetric part -;-(A +AT) =: A [41],[56]. By Assumption 3.2 A is positive semidefinite. Remark: Note that if all the ail-s are equal, then in step (D.2), matrix A is sym- metric positive semidefinite. This is obvious since A = 01,7 117, and xTaijllTx = (xi!- IIITX “2. 97 Appendix E Proof of Theorem 3.2 Theorem 3.2: Assume that Assumption 3.2 holds. Consider a candidate Liapunov function V := E given by equation (2.6). Let V be the derivative of V along the the trajectories of (2. 14) and V" be the derivative of V along the trajectories of the system (3.16). Then the learning performance of system (3.16) is "faster" than or at least is the same as system (2.14). Proof : Case 1 (output nodes) In order to analyze the learning performance, we calculate V‘ — V as V“ - v = VWV . (WC - W) (E.1) where W is the weight vector containing all the weights. Equation (3.5) gives the com- ponents of We and equation (2.14) gives the components of W. Then for the output layer, using equations (2.14i) and (3.51), we get r r tie—V = [..VW_E..] .-y( $7,456 - V45 » (5.21) = [..anE..] .— 7+ v,,__E + % z apnvwmepn2 - vw'_g », (15,211) P where _ 1 2 VW“E - 3- ; Vwfiepn . Therefore we get 98 99 v" - V = - i r 2 [yaw] [2%.Vw_e,.2]. (E3) Wu- P P Using Lemma 3.1, it follows that each of the entries of the summation over WM, , on the right hand side of equation (15.3), is 2 0. Subsequently V“ — V S 0. Case 2 (hidden nodes) Now consider the dynamics of weights connecting the neurons in any hidden layer. The components of WC and W are given by equations (3.5ii) and (2.14ii), respectively. Again we calculate the difference V‘ - V as follows. - p ‘ . . 1 . - Vc-V = [..VWE..] .- yr VWUE + 3 )5 a; rat,,,,V,,.ue,,,,,2 — VW,,E >. (EA) where v -1 v 2 qu " 3 22 wuepn ° p I! With this substitution, we get v‘ — V = — % Y 2[22v:,e,.2] [zzapnwepfi] (15.5) p n wupn Considering the summation over wk,- , on the right hand side of (E5), we use Lemma 3.2 on each of its entries to show [zszrepn ] [Emu veep" ] 2 o p n p I: As each entry of the summation over wk,- on the right hand side of equation (E5) is 2 0, therefore tic-v s o. Appendix F Proof of Proposition 3.1 Proposition 3.1: The equilibria of system (3.16) are also the equilibria of system (2.14). .Proof: F‘u-st we consider the case of output nodes of a network. The equilibria of the weights connecting the output neurons of system (2.14) are given by 0=w .m’ = VW_E , = % z anean, (F.1) P and the equilibria of the weights connecting the output neurons of system (3.5 ) are given by 0=w; 1 2 = unE + -2- E aanwnepn , P _ 1 V 2 l V 2 ‘ 3 E ..__e,,,, + '2' 2 up» wuepn - (F2) P P This implies z Vw__ep,,2 = _ 2 up" Vw_ep,,2. (F.3) P P Multiplying both sides by 2 up, V," em, 2, we get P 2 2 2 2 [2 Vw.epn ] [2 “pr: Vwmepn ] = -[2 “pr: Vw,...epn ] - (F4) P P P According to Lemma 3.1, the left hand side is non-negative but we notice that the right hand side is non-positive. This implies 100 101 2 map"? = 2 up, Vw_ep,,2 = 0. (RS) P p This proves the subject claim for the weights connecting the output nodes. Fol- lowing similar derivations and using Lemma 3.2 the claim for the weights connecting the hidden nodes can be easily proved. Appendix G Proof of Theorem 4.1 Theorem 4.1: Assume that Assumption 3.1 holds. Consider V:=E given by (4.12i), be a candidate Liapunov function. Let V be the derivative of V along the trajectories of system (2.14) and V” be the derivative of V along the trajectories of system (4.13), then the learning performance of system (4.13) is "faster" than or at least is the same as system (2.14). Proof : Case 1 (output nodes) In order to analyze the learning pen‘ormance, let’s evaluate VP — V as VP — v = vwv . (WP - W) (6.1) = [..VW_E..] .—yt V4.5" - V,,.:__E > (6.2) - -l = [..VW_E..] .4” Vwmli‘ +' V,,_mli1 - V,,'__E > (G3) 3 ‘ ' ‘J where E 1 = é— Z en. Simplifying the expression in (G3), we get n VP _ r} = _ y z Ww.5)(vw,51) (0.4) wall = .27 z (eannE1)(anE1) (G5) = —27 Z t—:,,(V,,,mli‘1)2 (06) Want 102 103 therefore, VP - V s o. ((3.7) Case 2 (hidden nodes) For the hidden nodes we can similarly derive the following expression VP - V = - 'y z (un15) (VWEI) (6.8) ”it As the energy functions E and E l- are not direct functions of the weights wk,- , there- fore we can write VP _ V = —-5- y 2 [2 en VWen] [2 Vwen] (0.9) Wu n n For each entry of the summation 2 we use Lemma 3.1 to show We [2 6,, thlen] [E Vwen] 2 0. (6.10) therefore, VP — V s o. ((3.11) Appendix H Proof of Theorem 4.2 Theorem 4.2: Assume that Assumption 3.2 holds. Consider a candidate Liapunov function V := E given by equation (2.6). Let V be the derivative of V along the the trajectories of (2.14) and V” be the derivative of V along the trajectories of the system (4.14). Then the learning performance of system (4.14) is "faster" than or at least is the same as system (2.14). Proof : Case 1 (output nodes) In order to analyze the learning pen‘ormance, let’s evaluate V” — V as VP - V = VWV . (WP - W), (11.1) r t )- T - 1 = [..anE..] .-‘H V,,__EP - V,,:__E >. (HZ) L - L ‘J = [,.VWIE..] :71 VW_E ‘1‘: anEl -' VW;E *9 (Es) where E1 = % 22 em. Simplifying the expression in (H.3), we get p n VP — V = — y )3 (Vme) (VWHEI), (11.4) = -2‘y z [2 8P" unepn] [2Vw_epn]. (H5) Wu- P P 104 105 For each entry of summation Z we use Lemma 3.1 to show 2 em V“ cm, Eunepn 2 0, (11.6) p 1» therefore VP — V s o. (11.7) Case 2 (hidden nodes) For the hidden nodes we can similarly derive the following expression VP -— V = — y 2 (ku15) (un151). (11.8) W.) As the energy functions E and E1 are not direct functions of the weights wk]- , there- fore we can write . . 1 VP - v =7 72‘. [2222... men] [221mm] (119) wt, p n p n For each entry of the summation 2 we use Lemma 3.2 to show 19,, [22 .p, mg] [mg] .0, m, p n p n therefore VP - V 50. (11.11) Appendix I Proof of Proposition 4.1 Proposition 4.1: The equilibria of system (4.14) are also the equilibria of system (2.14). .Proof: First we consider the case of output nodes of a network. The equilibria of the weights connecting the output neurons of system (2.14) are given by 0 = Wm, = anE , _ _1_ V 2 (11) - 2 2 wasp" ’ ° p and the equilibria of the weights connecting the output neurons of system (4.4) are given by 0 = W5", = Vw E '1’ VW E1, — 1 V 2 1 v - -2' 2 w_epn + 3 Z w_£pnr P p 1 = z 8pn Vw,,.,epn + 3 E Vwmepn ' (12) p p This implies 1 2 2 Vw_ep,, = - E a," Vw_ep,,. (1.3) p p Multiplying both sides by z Vw“ep,, , we get 9 i 2 >3 V.._e,,.]2 = - [2 V...ep.] [)3 ep. V.._e,,.]. (1.4) P P P 106 107 According to Lemma 3.1, the right hand side is non-positive but we notice that the left hand side is non-negative. This implies 2 anep,, = 2 8p» Vw_ep,, = O. (1.5) p p This proves the subject claim for the weights connecting the output nodes. Following similar derivations and using Lemma 3.2 the claim for the weights connecting the hid- den nodes can be easily proved. Appendix J Proof of Proposition 5.1 Proposition 5.1: The equilibria of system (5.2) and system (5.3) are same. Proof: We consider the Exponential (5.3) and the simple (5.2) systems, which are rewritten as "'2‘ =—ywaE, (1.1) and u? =—7VWE, (1.2) where E can be the Gaussian (2.6), the Cauchy (3.4), the Polynomial (4.3) or any other suitable energy function. Let the equilibria of system (J .1) be w' , then 0=-yx VwElwo. (1.3) This implies VWE Iw. = o. (1.4) Consequently w‘ e the equilibria of system (J. 2). (J .5) Now consider the equilibria of system (J .2) to be given by wM , then O=—'waE|w-, 0.6) which implies VWE lwu = 0. (1.7) As a consequence w“ e the equilibria of system (J. 1). 0.8) (1.5) and (1.8) prove the claim. 108 Appendix K List of fonts used for the pattern recognition problem Set 1 Ancient Fluidum italic Mosaik Athenaeum italic Folkwang New Clarendon Augustea Francesca Nicholas Banco Garamond Normandy Breite Grotesque Paganini italic Century Horizon Peignot Choc Hyperion Petra Corvinus Italian Playbill Craw Clarendon bold Kaufmann Saltino Craw Modem Letter 9 Salut Diotima London Slim Black Diotima italic Lydian Stardivarius Egyptienne italic Matura Stencil Elizbeth italic Michel Angelo Tea Chest Falstaff Microgramma bold Walbaum italic Flash Mistral Weiss italic Fluidum Montan 109 110 Set 2 Albertus Amanda Badoni bold Baskerville Bembo Cartoon bold Caslon italic Consort bold Craw modern bold Eckmann Editor Egizio Erbar italic Folio Fortune Ganton Gill-sans Gothic bold italic Goudy bold Hercules Jacno Keyboard Klang Letter 4 Letter 6 Letter 7 Melior Mercurius Modern 20 New Gothic Noname Paganini Perpetua italic Placard Plantin Psitt Quirinus bold Radient bold Ritmo Rockwell Standard Tiemann Times Times bold Topic bold Trump Universe ' Vendome Walbaum Weiss BIBLIOGRAPHY [11 121 [31 [41 [5] [6] [7] [3] [9] BIBLIOGRAPHY D. H. Ballard, Interpolation coding: A representation of numbers in neural models, Biological Cybernetics, vol. 57, pp. 389-402, 1987. E. I. Knudsen, S. du Lac and S. D. Esterley, Computational maps in the brain, Annual Review of Neuroscience, vol. 10, pp. 41-65, 1987. M. Kuperstein, An adaptive neural model for mapping invariant target position, Behavioral Neuroscience, pp. 148-162, 1988. D. Walters, Selection of image primitives for general purpose visual processing, Computer Vision, Graphics and Image Processing, vol. 37, pp. 261-298, 1987. S. Dehaence, J. Changeux and J. Nadal, Neural networks that learn temporal sequences by selection, Proceedings National Academy Science, USA, Biophy- sics, vol. 84, pp 2713-2727, 1987. J. A. Feldman and D. H. Ballard, Connectionist model and their properties, Cognitive Science, vol. 6, pp. 205-254, 1982. D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal representa- tions by error propagation, in Parallel Distributed Processing: Exploration in the microstructures of cognition, Cambridge, MA : MIT Press, vol. 1, pp. 318-362, 1986. D. E. Rumelhart and J. L. McClelland, Explorations in Parallel Distributed Pro- cessing: A Handbook of Models, Cambridge, MA : MIT Press, pp. 121-159, 1988. D. Tank and J. J. Hopfield, Concentrating information in time. analog neural networks with application to speech recognition problem, 1st International Conference on Neural Networks, IEEE, June 1987. [10] F. C. Hoppensteadt, An Introduction to the Mathematics of Neurons, Cambridge University Press, 1986. [11] D. Purves, Body and Brain : a trophic theory of neural connections, Harvard University Press, Cambridge, Mass, 1988. 111 112 [12] R. J. MacGrager, Nemal and Brain Modeling, Academic Press, San Diago, Calif., 1987. [13] M. Berry, The Nerve Cell, MacKeith Press, London, 1986. [14] B. Widrow, study director, DARPA Neural Network Study, AFCEA International Press, February 1988. [15] J. A. Anderson and E. Rosenfield, N eurocomputing: Foundations of Research, Cambridge, MA: MIT Press, 1988. [16] F. Rosanblatt, Perceptons and the theory of Brain Mechanisms, Spartan Books, Newyork, 1962. [17] R. P. Lippman, An introduction to computing with neural nets, IEEE ASSP Magazine, vol. 4, pp. 4-22, April 1987. [18] R. H. Nielsen, Theory of back-propagation neural networks, International Joint Conference on Neural Networks, vol. 1, pp. 593-605, June 1989. [19] W. S. McCulloch and W. Pitts, A logical calculus of the ideas imminent in ner- vous activity, Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943. [20] B. Widrow and M. E. Hoff, Adaptive Switching Circuits, IRE WESCON Conven- tion Record, Newyork: IRE, pp. 96-104, 1960. [21] T. Kohonen, Correlation matrix memories, IEEE transactions on Computers, C- 21, pp. 353-359, 1972. [22] J. A. Anderson, A simple neural network generating an interactive memory, Mathematical Bioscience, vol. 14, pp. 197-220, 1972. [23] S. Grossberg, Studies of Mind and Brain, Reidel, Boston, 1982. [24] R. J. Williams, Unit activation rules for cognitive network models, ICS Report 8303, Institute of Cognitive Sciences, University of California at San Diago, November, 1983. [25] G. E. Hinton, T. J. Sejonowsky and D. H. Ackley, Boltzman machines: con- straint satisfaction networks that learn, Report CMU-CS-84—157, Carnegie- Mellon University, 1984. [26] J. J. Hopfield, Neurons with graded response have collective computational pro- perties like those of two-stage neurons, Proc. Natl. Acad. Sci., vol. 81, pp. 3088-3092, May 1984. 113 [27] M. Arai, Mapping abilities of three layer neural networks, International Joint . Conference on Neural Networks, vol. 1, pp. 419-423, 1989. [28] G. Mirchandni and W. Cao, 0n hidden nodes of Neural Nets, IEEE Transac- tions on Circuits and Systems, vol. 36, No. 5, pp. 661-664, May 1989. [29] S. C. Huang and Y. H. Huang, Bounds on the number of hidden neurons in mul- tilayer perceptrons, IEEE Transactions on Neural Networks, vol. 2, No. 1, Jan. 1991. [30] M. Minsky and S. Papert, Perceptrons, Cambridge, MA: MIT Press, 1969. [31] G. Cybenko, Continuous valued neural networks with two hidden layers are sufi‘icient, Tech. Rep., Department of Computer Science, Tufts University, March 1988. [32] K. Funahashi, 0n the approximate realization of continuous mapping by neural networks, Neural Networks vol. 2, pp. 183-192, 1989. [33] A. E. Bryson and Y. C. Ho, Applied Optimal Control. Blaisdell, Newyork, 1969. [34] P. J. Werbos, Beyond regression: New tools for prediction and analysis in the behavioral sciences, Ph.D. Thesis, Applied Mathematics, Harvard University, November 1974. [35] F. M. A. Salam, Neural Nets and Engineering Implementations, key address at the 3lst Midwest Symposium on Circuits and Systems, St. Louis, Missouri, August 10-12, 1988. [36] F. M. A. Salam, Artificial Neural Nets: Basic Theory and Engineering Implemen- tations, Department of Electrical Engineering, Michigan State University, East Lansing, MI 48824, October 1989. [37] F. M. A. Salam and M R. Choi, An All-MOS Analog Feedforward Neural Cir- cuit with Learning, 1990 IEEE International Symposium on Circuits and Systems (ISCAS), New Orleans, Louisiana, May 1-3, 1990. [38] F. M. A. Salam, Learning Algorithms For Artificial Neural Nets For Analog Cir- cuit Implementation, Interface ’90 Proceedings, East Lansing, MI, May 15-16, 1990. (Proceedings will appear in 1991). [39] I. K. Sethi and A. K. Jain, Artificial Neural Networks and Statistical Pattern Recognition, North Holland, 1991. [40] M. W. Hirsch and S. Smale, Differential Equations, Dynamical Systems and Lmear Algebra, Academic Press, 1974. , 114 [41] C. T. Chen, Linear System Theory and Design, Holt, Rinehart and Winston, New York, 1984. [42] E. T. Jaynes, Information theory and statistical mechanics, Phys. Re v., vol. 106, pp. 620-630, 1957. [43] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, Unive rsity of Illinois Press, Urbana, 1949. [44] A. Wragg D. C. Dawson, Fitting continuous probability density functions over [O,oo) using information theory ideas, IEEE Trans. Info. Theory, vol. 1T-l6, pp. 226-230, March 1970. [45] D. C. Dawson and A. Wrag , Maximum entropy distribution having prescribe d first M617 second moments, IEEE Trans. Info. Theory, vol. IT-19, pp. 68 9-693, ept. 1 3. [46] J. M. Einbu, 0n the existence of a class of maximum entropy probability density functions, IEEE Trans. Info. Theory, vol. IT-23, pp. 772-775, No v. 1977. [47] A. B. Carlson, Communication Systems - An Introduction to Signals and Noise in Electrical Communication, McGraw Hill, 1986. [48] P. Burrascano, A norm selection criterion for the Generalized Delta Rule , IEEE Transactions on Neural Networks, vol. 2, No. 1, Jan 1991. [49] J. Wang and B. Malakooti, 0n training of artificial neural networks, Intema- t3ional Joint anference on Neural Networks, Washington D. C., vol. 2, pp. 387- 93, June 198 . [50] A. Popoulis, Probability, Random Variables and Stochastic Processes, McGraw- Hill, New York, 1965. [51] C. H. Edwards, Jr. and D. E. Penney, Calculus and Analytic Geometry, Prentice- Hall, inc., 1982. . [52] M. Ahmad and F. M. A. Salam, Error back-propagation learning using the Cau- chy energy function, under preparation. [5 3] J. K. Hale, Ordinary Differential Equations, Wiley-Interscience, 1969. [54] H. K. Khalil, Nonlinear Systems, Macmillan, 1991. [55] M. Fiedler, Special Matrices and their Applications in Numerical Mathematics, Martinus Nijhof Publishers, Dordrecht, The Netherlands, 1986. 115 [56] G. H. Golub and C. F. Van Loan, Matrix Computations, Johns-Hopkins Univer- sity Press, 1983. [57] M. Ahmad and F. M. A. Salam, Error back-propagation learning using the Poly- nomial energy function, under preparation. [58] S. M. Ebeid, J. A. Sirat and J. R. Viala, A rationalized error back propagation learning algorithm, International Joint Conference on Neural Networks, vol. II, pp. 373-380, June 18-22, 1989. [59] T. Tollenaere, SuperSAB: Fast adaptive back propagation with good scaling pro- perties, Neural Networks, vol. 3, pp. 561-573, 1990. [60] M. Ahmad and F. M. A. Salam, Error back-propagation learning using the Exponential energy function, under preparation. [61] W. E. Weideman, M. T. Manry and H. C. Yau, A comparison of nearest neigh- bor classifier and a neural network for numeric handprint character recognition, International Joint Conference on Neural Networks, vol. 1, pp. 117-120, 1989. [62] P. Smagt, A comparative study of neural network algorithms applied to optical cflacter recognition, Association for Computing Machinery, pp. 1037- 1044, 1 . [63] J. S. Denker, W. R. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hub- bard, L. D. Jackel, H. S. Baird and I. Guyon, Neural network recognizer for hand written zip code digits, Advances in Neural Information Processing Systems 1, D. S. Tometzky, ed. M. Kaufmann, San Mateo, CA, 1989. [64] K. Fukushima, S. Miyake and T. Ito, Neocognitron: a neural network model for a mechanism for visual pattern recognition, lEEE Transactions on Systems, man and Cybernetics SMC-13, pp. 826-834, 1983. [65] K. Fukushima and N. Wake, Handwritten alphanumeric character recognition by tlh9e§ {Veocognitrom IEEE Transactions on Neural Networks, vol. 2, No. 3, May [66] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. J ackel, Handwritten digit recognition with a back-propagation net- work, Advances in Neural Information Processing Systems 2, pp. 396-404, 1990. [67] A. Bastien, Alphabet in Type, West Drayton, Middlesex, England, 1958. [68] F. Lamberty, Letter forms, Hasting House Publishing, New York, 1964. [69] D. Freres, Cent Alphabets Monotype, Union Bibliophile de France, Paris, 1963. 116 [70] W. J. Jespert, W. T. Berry and A. F. Johnson, The Encyclopedia of Type Faces, Blandford Press Ltd. London, 1970. [71] M. Ahmad and F. M. A. Salam, Feedforward artificial neural network structure for character recognition, to appear in the 29th Annual Allerton Conference on Communication, Control and Computing, October 1991. [72] S. Bebromdhane, M. Ahmad and F. M. A. Salam, Handwritten numeric charac- ter recognition using a multilayer neural network, to appear in the 29th Annual Allerton Conference on Communication, Control and Computing, October 1991.