PLACE ll RETURN Boxwmnmmbchockomm your record. TO AVOID FINES Mom on or baton duo duo. DATE DUE DATE DUE DATE DUE MSU IIN'I Nflmnflvo WOMEN Oppoflunlly Inflation Was-u ANALYTICAL FORMULATION FOR STATIC AND TEMPORAL PATTERN ASSOCIATION: NEURAL ARCHITECTURES AND ALGORITHMS By Jz'ansheng H on A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1997 ST; Curre‘ structure 115ng ofi: (matrix). This d3 used for $1 $61 35 Stab determine matrix is 6 [Or allgmefi The Pro adapted for in won an ABSTRACT ANALYTICAL FORMULATION FOR STATIC AND TEMPORAL PATTERN ASSOCIATION: NEURAL ARCHITECTURES AND ALGORITHMS By Jiansheng Ho'u Currently, most Artificial Neural Network (ANN) models have standard layered structure of interconnected neurons where the connections (weights) are computed using off-line or on-line methods. The process of finding an appropriate weight (matrix), if it exists, is usually computationally expensive. This dissertation focuses on new design and training of artificial neural networks used for storing static or temporal patterns. A product-of-norm model of artificial neural networks is presented. It is shown that this model can store an arbitrary data set as stable system equilibria — without the need for performing training process to determine the weights. Consequently, the process of iteratively finding the weight matrix is eliminated. The need of defining the weights is in fact replaced by defining (or augmenting) the architecture. The properties of this new model are investigated, and various modifications are adapted for practical applications. In particular, the use of the log form of the energy function and the subgrouping method are introduced to reduce the computational efibrt penahj Thi recurre proposr back-pr effort. Moreover, methods for recognizing shifted images are proposed by adding penalty terms to the energy function. This investigation also includes ways to improve the current algorithms for training recurrent networks to store temporal patterns. Specifically, two algorithms are proposed to avoid the backward integration mode in the time-dependent recurrent back-propagation method. Copyright by J iansheng Hou 1997 1 WO his cont Michiga guidanc Newhou for theh I am of Elect I am and 9110 been ab wannest ACKNOWLEDGMENTS I would like to express my sincere thanks to my advisor, Dr. Fathi M. A. Salam, for his continued support and guidance throughout the years of my graduate studies at Michigan State University. I also want to thank the other members of my Ph.D guidance committee members, Dr. Hassan Khalil, Dr. John Deller, Dr. Sheldon Newhouse, Dr. Robert Nowak, Dr. Frank Hoppensteadt and Dr. R. L. Tummala, for their valuable comments and suggestions. I am grateful for the financial support from the Case Center and the Department of Electrical Engineering of Michigan State University. I am especially grateful to my wife, J ian Yan, for her love, understanding, support and encouragement, particularly in the last years of this research. I would not have been able to finish this work without her support. Finally, I want to express my warmest gratitude to my parents for their continued support and encouragement. LIST ( LIST ( 1.1 1.2 1.3 1.4 1.5 2 AR] SOC torogoro #aéAtVH NET 3.1 ' 3.9 ‘ ‘- 3.4 3.6 TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 INTRODUCTION 1 1.1 Overview .................................. 1 1.2 Problem Statement ............................ 3 1.3 Research Goals of the Thesis ....................... 4 1.4 Major Contributions ........................... 7 1.5 Organization of the Thesis ........................ 8 2 ARTIFICIAL NEURAL NETWORKS FOR INFORMATION AS- SOCIATION 10 2.1 Background ................................ 10 2.2 The Hopfield Model ............................ 11 2.3 Bidirection Associative Memory ..................... 16 2.4 Existing Models for Storing Temporal Patterns ............. 18 3 THE PRODUCT-OF-NORM MODEL OF ARTIFICIAL NEURAL NETWORKS 21 3.1 Motivation ................................. 21 3.2 Proposed Mathematical Model ...................... 23 3.3 Stability Analysis ............................. 24 3.4 The Product-of—norm Model Using L2 Norm .............. 26 3.4.1 Oneneuron/m-pattern Case ................... 28 3.4.2 The Number of Equilibria for Product-of-norm Networks . . I. 34 3.5 The Case of Leo Norm .......................... 38 3.6 Simulation Results ............................ 40 vi 4.3 5 THE CUI 5.1 5") '8 5.3 _ 6 ART PAT' 6.1 6.9 Q AAAA 4 GENERALIZATION AND VARIATIONS OF THE “ENERGY” FUNCTION FOR PRODUCT-OF-NORM MODEL 43 4.1 The Log Form of the “Energy” Function ................ 44 4.2 Localization of the Product-of—norm Model ............... 56 4.2.1 Motivation ............................. 56 4.2.2 Subgrouping Method ....................... 58 4.2.3 Simulation Results ........................ 62 4.3 Modified Product-of—norm Models for Recognizing Shifted/ Translated Input Patterns .............................. 66 4.3.1 Motivation ............................. 66 4.3.2 Modified “Energy” Functions .................. 66 4.3.3 Simulation Results ........................ 72 4.3.4 Discussion and Remarks ..................... 73 5 THE BUILDING BLOCKS FOR MICRO-ELECTRONIC CIR- CUIT REALIZATION OF THE PRODUCT-OF-NORM MODEL 77 5.1 Motivations ................................ 77 5.2 Background and Basic Circuits ..................... 78 5.2.1 MOS Transistors in Subthreshold Region ............ 78 5.2.2 Current Mirrors .......................... 79 5.2.3 Differential Pair .......................... 80 5.2.4 Current-voltage Multiplier .................... 85 5.2.5 Four-quadrant Multiplier ..................... 88 5.3 An Example Circuit of One-neuron/two-pattern Product-of-norm Model 91 6 ARTIFICIAL NEURAL NETWORKS FOR TEMPORAL- PATTERN ASSOCIATION 94 6.1 Introduction ................................ 94 6.2 Methodology Review ........................... 95 6.2.1 Tapped Delay Lines ........................ 96 6.2.2 Context Units (Partially Recurrent Networks) ......... 98 6.2.3 Back Propagation Through Time ................ 98 6.2.4 Real-Time Recurrent Learning .................. 100 6.2.5 Time-dependent Recurrent Back-propagation . . . . ....... 103 6.2.6 The Green’s Function Method .................. 105 6.3 Motivation ................................. 107 6.4 The Initial Value Problem ........................ 109 6.5 The Choice of the Final Value of the Co-state ............. 111 vii 7 SC) to Iv—- A THE NEIL BIBLIO 6.6 Avoiding Backward Integration in Time-dependent Recurrent Back- propagation: Method I .......................... 113 6.6.1 Simulation Procedure ....................... 119 6.7 Avoiding Backward Integration: Method II ............... 123 6.7.1 Updating the Time—constant T ................. 129 6.7.2 Computation Procedures ..................... 131 6.7.3 Simulation Examples ....................... 134 6.7.4 Comparison and Discussion ................... 137 7 SUMMARY AND CONCLUSION 139 7.1 Summary ................................. 139 7.2 Future Research Directions ........................ 142 A THE PSPICE CODE AND SIMULATIONS FOR A ONE- NEURON/TWO-PATTERN CIRCUIT 143 BIBLIOGRAPHY 155 viii 4.1 6.1 LIST OF TABLES 4.1 Simulation results ............................. 73 6.1 Comparison of the algorithms ...................... 138 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 LIST OF FIGURES A neuron of an analog Hopfield model ................. 14 A network of Hopfield model ....................... 15 Structure of Bidirection Associative Memory .............. 17 The plot of “energy” function E = 0.5(0.25 — :c)2(0.75 — (1:) ...... 32 The plot of function H (v) with v'"1 = 0.25 and v'"2 = 0.75 ....... 32 The plot of “energy” function E = 0.5(0.25 — x)2(0.5 - x)2(0.75 — 3:)2 33 The plot of function H (v) with v“ = 0.25 v"‘2 = 0.5 and v‘3 = 0.75 . 33 An H (22) function with stable equilibria 0.2, 0.4, 0.7 and unstable ones at 0.3 and 0.6 ............................... 35 An H (2)) function with stable equilibria 0.2, 0.4, 0.7 and unstable equilibria at 0.3, 0.5 ............................ 35 An example vector field for Loo norm .................. 39 Initial and final outputs of the network with six pre—specified patterns 41 Stored two facial images ......................... 42 Image retrieving process ......................... 42 A modularized structure of Product-of-norm model .......... 57 An example structure of dividing a big network into isolated smaller subnetworks ................................ 61 An example structure of dividing a big network into overlapped smaller subnetworks ................................ 61 Three stored patterns ........................... 63 Test patterns ............................... 63 No over-lap grouping ........................... 64 Over-lap grouping ............. - ................ 64 The retrieving process of the network using non-overlap method . . . 65 The retrieving process of the network using overlap method ...... 65 Shifted input pattern ........................... 67 An example way of dividing the network into small regions ...... 71 4.12 Stored patterns .............................. 74 4.13 Test patterns ............................... 74 5.1 I”. vs IQ, ................................. 80 5.2 Two diode-connected transistors ..................... 81 5.3 Two types of current mirrors ....................... 82 5.4 11, 12 and 11 — 12 vs V1 — V2 ....................... 83 5.5 Differential pair .............................. 83 5.6 Transconductance amplifier ....................... 84 5.7 Current-voltage multiplier ........................ 86 5.8 Schematic symbol for current-voltage multiplier ............ 87 5.9 The plot of a current-voltage multiplier ................. 87 5.10 Gilbert multiplier ............................. 88 5.11 Spice simulation of a Gilbert-multiplier ................. 90 5.12 Schematic symbol of a voltage square circuit .............. 90 5.13 One-neuron/two-pattern circuit ..................... 92 6.1 A time-delay neural network ....................... 97 6.2 Architectures of networks using context units. Shaded arrows represent fully connected connections ....................... 99 6.3 A general recurrent network ....................... 101 6.4 A unfolded recurrent network for 4 time steps ............. 101 6.5 Network topology of XOR problem ................... 110 6.6 XOR network with input = [0, 0] .................... 121 6.7 XOR network with input = [0,1] .................... 121 6.8 XOR network with input = [1,1] .................... 122 6.9 XOR network with input = [1, 0] .................... 122 6.10 XOR network with input = [0,0] .................... 135 6.11 XOR network with input = [0,1] .................... 135 6.12 XOR network with input = [1,1] .................... 136 6.13 XOR network with input = [1,0] .................... 136 A.1 The response of the one-neuron/two-pattern circuits to different initial voltages. v“ = 0.4V and v‘2 = 0.7V. .................. 154 A2 The response of the One-neuron/two—pattern circuits to different initial voltages. v'"1 = 0.3V and v"2 = 0.75V .................. 154 xi CH 1.1 Since t human have b indUSt; though many 1 39 Gas} by dig] Year-01 Compu ID 1111”] 'R t1. CHAPTER 1 INTRODUCTION 1 . 1 Overview Since the birth of the first electronic digital computer — the Von Neumann machine, human civilization has entered a new era of information processing. Today, computers have been used in almost every area in our life, from home to office, from science to industry. It is hard to imagine what life will look like without computers today. Even though the digital computers are so advanced and so widely used, they still have many limitations when compared to human brains. Many things, which can be done so easily by human brains, can be very difficult, if not impossible, to be accomplished by digital computers. A good example is the processing of visual information: an one- year-old baby is much faster at recognizing human faces and objects than the best computers today [HKP90]. The following distinguished features have been attributed to human brain (neural networks): 0 Robust and fault tolerant. Nerve cells die every moment and this does not affect the performance of neural networks significantly. 0 Flexible and adaptive. It can easily adjust to a new environment by “learning”. It does not have to be programmed like a digital computer. 0 High degree of parallelism and distributed storage. 0 The ability to process information that is fuzzy, probabilistic or inconsistent. o Compactness. These features have inspired generations of researchers to study the human brain — neural networks, to investigate the principles of how human brain works and to build models to mimic the behavior of the human brain. This lead to the birth of “artificial neural network” (ANN) —— “ the next generation of computers”. The study of artificial neural network has gone through several circles of up and down. It began in the 1940’s with McCullough and Pitt. Second peak happened in the 1960’s with Rosenblatt’s perceptron convergence theorem. It began to explode again in the 1980’s inspired by John Hopfield’s energy approach and the modern era of multi-layered neural networks and back-propagation algorithm. Researchers have been investigating the possible applications of artificial neural networks in all areas. Information association using artificial neural networks is a major field which researchers have studied extensively. The research results can be used in a wide range of real world applications, such as image association, pattern recognition, associative memory and etc.. Associative memory is one of the early themes of the research works on artificial neural networks. It is such a kind of network that presents the same output when sim- ilar inputs are applied. Many artificial neural network architectures and algorithms have been proposed and studied by researchers. Some of the prOposed networks use feet IlQlWOI :~ 1.2 Some m l. 111:; 2. the usii While it 51‘stems PrOhibitj HEIWm-k COHSE‘qt order of COmPUt.‘ the in“ the data In t} 3830mm 10 OVerc On redu 100‘ use feedforward structures, while the majority use feedback structures (the recurrent networks). The focus of this dissertation research is on the recurrent networks. 1.2 Problem Statement Some major issues in designing functional neural networks are [8891, SWC91]: 1. the ability to store (or learn) any number of pre-specified set of data successfully; 2. the determination of a set of parameters (or weights) which would archive 1 using efficient off-line or on-line computations. While it is possible to employ analogous approaches to the algorithms employed in systems theory, control and adaptive control, the amount of computation may be prohibitive by present implementation media. In particular, for n—neuron neural network update methods, one must update or determine in the order of n2 weights. Consequently, most learning or weight updating methods can successfully store in the order of 7: patterns while updating about 112 weights. In addition, the weight updating computation is largely time consuming. Another limitation of current algorithms is the insufficient ability to store arbitrary set of data. Many of the algorithms require the data to be stored meet certain prerequisite conditions. In this thesis, current ANN architectures and algorithms used in static information association will be analyzed. New ANN architectures and algorithms will be presented to overcome the limitations identified in the analysis. Specifically, this thesis will focus on reducing the computation complexity of the networks and increasing the capacity of ANN networks. Implementation of new models via hardware will be investigated too. A .V .‘ great p1 extensiu Ways to be inx'es] 1.3 Goal 1: Objectix SignlfiCa] Approach ANN networks used for temporal pattern association is another area which has great potential applications. Most algorithms proposed so far are, to certain degree, extensions of the back-propagation method. The computation is very expensive. Ways to improve the performance of ANN used in temporal pattern association will be investigated in this thesis too. 1.3 Research Goals of the Thesis Goal 1: Objective: Significance: Approach: Improve the performance of ANN used for static pattern association Analyze Current ANN network architectures, models and algo- rithms, especially recurrent networks will be analyzed, so that their limitations can be identified. New architectures and algorithms will be presented to overcome the limitations identified. Analysis of current approaches is a very important step. Through this step, advantages and limitations of current ANN networks used for association of static patterns will be identified. Thus new archi- tectures and algorithms can be generated. A new architecture with better performance will have a wide range of applications. It is a step further in mimicking the behavior of neural networks. Current models and methods of ANN used for association of static patterns, especially feedback networks, will be investigated first. The capacity, speed, convergence and computation complexity of various nets will be studied. A new ANN architecture will be proposed Goal 2; Objecti‘ ‘ | SlgnlfiCa AppI‘OaCl Goal 2: Objective: Significance: Approach: with a new “energy” function. The resulting network will be a gradient-like system, thus it will have nice stable behavior. The proposed system will be able to store arbitrary analog data set. The determination of network parameters (weights) will be simplified using this model. Static patterns will be stored as stable equilibria of the system so that given enough hint (input) the network will be able to retrieve the correct pattern. Simulations will be performed to verify the theoretic results and demonstrate the potential usage of the new model. Generalization and extension of the proposed model. The proposed model will be generalized. Several ways of modifying the “energy” function will be investigated so that the resulting network can be tailored for various application needs. Generalization of the “energy” function can gain insights of the property of the new model. Investigation of ways to modify the “energy” function will lay a foundation for the new model to be used in real world applications. Different forms of the “energy” function will be explored to simplify the model and improve the performance. Ways to modify the model so that it can be easily modularized and parallelized will be inves- tigated. Methods of modifying the model so that it can be used in image association and recognition will be studied. In particular, variations of the “energy” function to recognize shifted images will Goal 3: Objecti Significa APlJroa Goa] 4: OblECti SignifiC Goal 3: Objective: Significance: Approach: Goal 4: Objective: Significance: be investigated. Software simulations will be performed to verify the correctness of theoretical result and computation complexity. Micro-electronic hardware implementation Study ways to implement the proposed network in hardware; par- ticularly, VLSI technology. Basic circuits building blocks will be presented. Implementing the new ANN model in hardware makes it possible to employ this model in real world applications. The implementation process can also provide feedback about directions to improve the theoretical model. Analog MOS transistors biased in subthreshold region will be used as the basic elements of circuits for the new ANN model. Circuit building blocks will be designed and verified using “Pspice”. Small networks will be built. Extensions to for temporal pattern association Analyze current models and algorithms used for temporal patterns association. Improve the performance of ANN by proposing new models or algorithms. ANNs used for temporal pattern association have great potential usage in real world applications. Appro 1.4 The maj ' Var as 1 lug Approach: Current models and algorithms for temporal pattern association will be surveyed first, then the limitations and advantages of each approach will be analyzed. New algorithms will be proposed to improve the performance and overcome the limitations identified. 1.4 Major Contributions The major contributions of this work can be summarized as follows: 0 A new ANN architecture — the product-of—norm model — is proposed. This model eliminates the dynamic learning of the weight connections. Weights are fixed instead. This network can store arbitrary set of data as stable system equilibria. Thus the data are guaranteed to be recallable. This model can be used in a wide range of real world applications. 0 Various methods of simplifying the product-of-norm model are presented, such as using the log form of the “energy” function, the infinity norm and subgroup- ing method. The simplified network can be easily parallelized and modularized. 0 Several ways of tailoring the “energy” function of the product-of—norm model are pr0posed so that the model can be used in specific applications. In partic- ular, tailoring the “energy” function for recognizing slightly shifted images is demonstrated. 0 Basic building blocks for implementing the product-of-norm model in hardware using VLSI technology are proposed. in de‘ val M? O‘VCI‘E for gener beCome s. 1.5 q The rest . Chapt 1101] of St; Chapt {101-m m0< Nations is inl’estig mall’Zed‘ . Chaple kinetic)“ ; u [UHCIIOI]. 0 Numerical algorithms are proposed which extend the time-dependent recurrent back-propagation method used in storing temporal patterns using recurrent networks. They eliminate the need of integration backward in time as needed in the original time-dependent recurrent back-propagation method. Rules are developed for setting the final boundary conditions of the co—state and the initial values of the neurons of the time-dependent recurrent back-propagation method are derived using optimal control theory. Overall, this work improves the current methods for artificial neural networks used for general association. Arbitrary data set can be stored and retrieved. “Learning” become simpler. 1.5 Organization of the Thesis The rest of the dissertation is organized as follows: Chapter 2 gives a basic introduction to artificial neural networks used in associa- tion of static patterns and identifies limitations of current approaches. Chapter 3 presents a new model of artificial neural network — the product-of- norm model of ANN. A new “energy” function is proposed and the system dynamic equations are derived from the “energy” function. The stability of this new system is investigated. The properties of the new model using different kinds of norms are analyzed, especially the L2 norm. Chapter 4 further investigates the ways to modify and generalize the “energy” function proposed for the new model, specifically the log form of the “energy” function. The subgrouping method is proposed to localize the connections. It transistc of the net I Chap' lem using each met limitation Finall} direction. also demonstrates the ways of tailoring the “energy” function for image recognition purpose. Simulations are proved. Chapter 5 presents some circuit building blocks for the new model using analog transistors biased in the subthreshold region. “Pspice” simulations are given for some of the networks. A one-neuron-/two-pattern network of the new model is also showed. Chapter 6 analyzes the current approaches for temporal pattern association prob- lem using artificial neural networks. It identifies the limitations and advantages of each method. Finally it presents some new algorithms to overcome some of the limitations. Finally, chapter 7 summarizes the whole thesis, points out the future research direction. AR’] CHAPTER 2 ARTIFICIAL NEURAL NETWORKS FOR INFORMATION ASSOCIATION 2.1 Background Some of the major concerns or criteria in designing associative neural networks are: 0 How many patterns can a network store? Can it store arbitrary patterns? How big is the network capacity (we define the capacity of an associative network as the ratio of the number of patterns it can store over the number of neurons in the network) ? 0 Will there be any spurious patterns stored? In other words, does the network store patterns which are not desired? 10 0 H'. be D HQ pa’ 0 Cam These 111 genera teIIlporal 1 Many , and Studiet memory 5 majority u in the later for Store st. 11 o How are the patterns stored? What is the training method? Can the network be trained on-line? o How easy is it for the network to store additional patterns without altering the patterns which have already been stored? 0 Speed of the network. How fast can it be trained? How fast can a pattern be retrieved? 0 Can the network be easily implemented in hardware/ software? These criteria measure the merit of a designed associative artificial neural network. In general, the above criteria apply to associative networks for both static and temporal patterns. Many artificial neural network architectures and algorithms have been proposed and studied by researchers. Most of them are focused on the applications of associative memory. Some of the networks proposed use the feedforward structure. While the majority use the feedback structure (the recurrent structure). Our main interest is in the later. In the following sections, we will study some of the popular ANNs used for store static or temporal patterns. 2.2 The Hopfield Model The analog Hopfield model [Hop84] is a recurrent network. One of the major contri- butions of this model is that it can be easily built with analog circuit and it is suitable for VLSI implementation. Figure 2.1 shows a circuit model of an element (neuron) of a Hopfield model. Figure 2.2 shows a Hopfield network. In Hopfield model, each procesir a resisto transfer finite co: are expre Where I, j Leta]: 12 processing unit (neuron) i, i = 1,2, . . . ,n, is an amplifier with a capacitor C: and a resistor p; at the input node. The processing node uses a sigmoid function S to transfer input u,- to output vi. Node (neuron) i is connected to node (neuron) j via a finite conductance Tij~ The connection matrix T = [ng] is symmetric and T.-,- = 0 for i 6 {1,2, . . . ,n}. The system dynamic equations can be derived by using KCL, and are expressed as: dug 7' U: 7t— Zfij(vj-u.)—;+I. j=l = ZTaJ-vj- (ZT..+— —)u.—+I i=1,2,...,n, (2.1) :1 Ci where I, is an input bias current to neuron i, and vj=S(u,-) j=1,2,...,n. (2.2) Let Ry: T and _1_ _ l. n _1_ Bi Pi “=1n 0. (2.7) Since C; > 0 and (jg-EV Z 0, if g 0 for t 2 o. (2-8) When “a? = 0 3E E): 0 2 — 1,2, ,n du, => Cg? — 0 So u.- is a equilibrium point of the system. Therefore the system will converge to a stable point (equilibrium point) starting from any point. This system is a gradient-like system. [SWC91] has more detailed analysis of this system. Hopfield network functions as associative memory by storing desired patterns as 14 —’VVV‘—' , [> v, Figure 2.1. A neuron of an analog Hopfield model stable equilibrium points of the system. Thus, when the system is presented with an input “similar” to one of the stored patterns ( within the region of attraction of one of the stable equilibrium point ), the system will converge to the closest stored pattern. However, it is difficult to determine the symmetric connection (weight) matrix T. The network cannot store arbitrary patterns either. There are many papers published on ways of finding the weight matrix for a set of given patterns, such as the one proposed by Barbosa [Bar91], and the one by Tseng et al [THL93]. There is a discrete version of Hopfield model too [Hop82]. The system dynamics can be described by: V: = sign (2 Tij) (2-9) 1 We omit the detailed analysis here. The Hopfield—type model has the following limitations: 0 The storage capacity is relatively small, i.e. for a fixed number of neurons the 15 Il I2 1.. T. T... T. T2 T2 Th Tn T,z Th 9. p. p. Figure 2.2. A network of Hopfield model I111: o Gi‘~ im; whe 0 011- 2.3 I Bart Kosk HGIWOfk :} Pattern) in (Unnected thS: the 0] 16 number of “storable” patterns is small. 0 Given an arbitrary set of patterns to be stored, it is difficult and maybe impossible to find a weight matrix. That is, the network cannot store arbitrary sets of patterns. For example, assume I, = 0, then the system equilibria of the Hopfield model must satisfy the condition: F T(Vi)Vj = F T(VJ')V'i, for i 75 j; where F(V) = R’IS‘1(V) [SWC91]. 0 On-line learning is difficult and computationally expensive. 2.3 Bidirection Associative Memory Bart Kosko extended the original Hopfield type associative memory to a two-layer network [Kos88], as shown in Figure 2.3. Each layer can represent a vector (or pattern) and the two layers can have different number of neurons. The two layers are connected through the weight matrix (connection matrix) M and its transpose M T. When the system is given an initial input to layer one, say A’, the system works like this: the output of the layer two is calculated by B’ = S (M A’), S() is a threshold function; the output of layer two is then fed to layer one again through [connection MT; the output of layer one then become A” = S'(MTB)’ until B = S(MA) and A = S (M ,T B ) A and B are the stored patterns of the network ( may not be a desired one ). Given pattern A as the input, the network can retrieve pattern B; similarly, given B as a input the network can retrieve pattern A for layer one. Kosko [K0388] proved that any real matrix M is bidirectionally stable (i.e. use matrix M as the connection matrix, the resulting network is stable). Kosko proposed the following way to find the connection matrix: for a given set of patterns (A1, Bl), (A2, 32), . . ., 1.4.... when dEHfiI Ho guard“ 17 (Am, Bm), matrix M can be constructed as follows: M = ZXFIG, (2.10) i=1 where X,- and Y.- are the bipolar form of patterns A,- and B,- respectively and are defined as: a;,- If (1.5 > 0 1‘35 = , (2.11) —1 If (1,} S 0 bgj if (25,“ > 0 31.3 = . (2.12) —1 if 12,-,- g 0 Figure 2.3. Structure of Bidirection Associative Memory However, as pointed by Wang et al. [WCM89], this proposed matrix cannot . guarantee the recall of the desired patterns, i.e. some of the desired patterns may not be e Training: single pa proposeti idea is t. ln [WCJ connectic associatix also prop IOI a g] \‘e The a limitation It 15 also ( 18 not be stable equilibrium points of the system. Wang et al. proposed “multiple Training Method” to build the matrix M, which can guarantee a successful recall of a single pair of patterns [WCM90]. To guarantee the recall of all training pairs, Wang proposed a method called “Dummy Augmentation Encoding” [WCM90]. The basic idea is to increase the dimension of the stored patterns to make them “noise-free”. In [WCJ91], Wang proposed a necessary and sufficient condition for a generalized connection matrix (also called correlation matrix in Wang’s paper) of a bidirectional associative memory (BAM) which guarantees the recall of all the training pairs. He also proposed to use linear programming techniques to find such matrix. However, for a given set of patterns this condition may not be satisfied. The advantage of BAM is its ability to store patterns in pairs. The major limitation with BAM is that it cannot guarantee to store an arbitrary set of patterns. It is also difficult to find the connection matrix if it exists at all. 2.4 Existing Models for Storing Temporal Pat- terns Many algorithms have been prOposed for training ANNs so that they can store/generate temporal patterns. To be able to generate temporal patterns, A recurrent network structure is usually required. Our main interest here is to extend current algorithms and to improve the performance of current methods. In this section, a brief introduction will be given to some popular algorithms. Detailed analysis will be given in later chapters. Our focus is on the algorithms of training a recurrent network to store/ generate temporal patterns. Back-P Bach [th 'Sr‘. so that ‘ correspo: connecth the origi: feedforwa The limit The lengr requires t Real-ti A met of the "en “Gt 19 Back-Propagation Through Time Back-propagation through time was proposed by Rumelhart, Hinton and Williams [RHW86]. This method “unfold” a fully connected recurrent network through time, so that the network can be represented by a feedforward network with each layer corresponding to a state of the recurrent network at certain time constant. The connections between each layer are the same, and they are the weight matrix of the original recurrent network. Therefore, the popular back-propagation method of feedforward network can be slightly modified to be used to train this kind of network. The limitation is that this algorithm cannot be used to train arbitrary length patterns. The length of the training patterns need to be fixed and known. This method also requires too much computer resources and is difficult to implement in hardware. Real-time Recurrent Learning A method proposed by Williams and Zipser [WZ89b, WZ89a] can be used to train general recurrent networks. It does not require the patterns to have fixed length. The training algorithms use a gradient descent method to update weights. The derivative of the “energy” function with respect to weights is calculated by integrating forward the derivative of the outputs of the neurons with respect to weights. Thus this method is a forward propagation method. The drawbacks with this method are: it is computationally very expensive, each time step requires 0(n4) of operation. It also needs large memory storage. Time-dependent Recurrent Back-propagation This method was proposed by Pearlmutter [Pea89] to train general recurrent neural networks. Other researchers developed similar method, such as the “adjoint function” method proposed by Toomarian and Barhen [TB91]. The time-dependent IfCUIIEI. method ‘energy‘ (”adjoint it needs and hard The Grc Sun. ( function 1 it used a (omputar equation SimUlatio In the HE‘IWQI-ks the limit; the ”93's 20 recurrent back-propagation method is in essence an extension to the backepropagation method used in feed forward networks. This method updates the weight / connections using a gradient-descent algorithm too. However, this time, the derivative of the “energy” function with respect to weights is calculated using a co—state variable (adjoint system). The computation per time step is in the order of 0(n2). However, it needs to integrate the co—state backward in time. Thus difficult to be used on-line and hard to be implemented in hardware. The Green’s Function Method Sun, Chen and Lee proposed another way of finding the derivative of the “energy” function with respect to weights [SCL91]. Instead od using back-propagation method, it used a “Green” function. This method can be integrated forward in time. The computation steps are in the order of 0(n3). This method needs to solve a linear equation at each time step, which may not have a unique solution in numerical simulations. In the rest of this dissertation, we will first propose a new model of artificial neural networks for association of static patterns. This new model will overcome some of the limitations we have pointed out in current approaches. Then we will investigate the ways to improve current methods and models used for association of temporal patterns. C] T] M 3.1 tum S[Dre CHAPTER 3 THE PRODUCT-OF-NORM MODEL OF ARTIFICIAL NEURAL NETWORKS 3.1 Motivation As can be seen from previous chapters, two of the major issues in the design of functional artificial neural networks (ANN) are [$391, SWC91]: (1) the ability to store (or learn) any number of pre—specified set of data (patterns) successfully; (2) the determination of a set of parameters (or weights, connections) which would achieve (1) by using efficient off—line or on-line computations. Dynamic learning has been a more attractive way in (adaptively) computing the parameters of an ANN. For (static) feedforward artificial neural network mod- els, the error-back-propagation is one of such dynamic methods [RHW86]. Pineda [Pin87, Pin88, Pin89] and some other researchers have ported the back-propagation 21 techr non Inetl The net“ Capa 22 techniques to recurrent networks too. Other learning methods such as the one proposed by Perlmutter [Pea89] in essence are error back-propagation method too. However, most of the proposed algorithms proposed so far are computationally expensive and the number of sets of data (patterns) which can be reliable learned is limited. Moreover, additional data may be inadvertently learned making what is known as spurious states. The usual models of ANN fix the architecture and the number of units (neurons) first, then seek to “compute” parameter values, either through off-line or on-line methods, which render a given set of data stable equilibria of input/output pair. The structure is not dependent on the number of elements of the set of data. These networks spend lots of time in learning stage and are well known to have limited capacity in storing arbitrary set of data. In this proposed model, we make trade-offs: we eliminate the dynamic learning of parameters but make the number of units or neurons dependent on the number of data to be stored. The resulting network does not “waste” any time on adapting its parameters and the given patterns are guaranteed to be the stable equilibria of the system (i.e. they are guaranteed to be recalled). However, the network’s “size” will depend on the number of desired data to be stored. The network becomes more complex: the connectivity or interaction among neurons increases. 3.2 We st with: W [18? 23 3.2 Proposed Mathematical Model We start the network by proposing a new general “energy” function of the network with: E =1ftm)Hne'H;, (3.1) p l=1 where e1 9—- [(vi'l - v1), (viil — “02),. - . 7(1):] - vnlli ||e1||p means the p-th norm of vector e’ with 1 < p < 00, (p = 1 and p = 00 need special treatment) ; v,- : the output of the j-th neuron; n : the number of neurons of the system; m : the number of patterns (data set) to be stored; v : the l-th specified pattern; v- : the j-th component of vector V”; f : a scaling function and f (m) > 0; u,- : the input to the j-th neuron, and v,- = S(u,~); S : a sigmoid function (a bounded, monotone increasing diffeomorphism); In particular we have used 5 = tanh() in our simulations. We define the system dynamic equations by 0E 12' : __ ' 0v,- i=1,2,...,n. (3.2) This is a gradient-like system. It preserves all the characteristics of a gradient system, such as no oscillations and global convergence. Since E Z O, the “energy” is bounded from below. Thus, the system is guaranteed to converge to its limit set which (L) 1,1771 Proo liter: and fen 24 which is composed of equilibria only [H874]. 3.3 Stability Analysis For the new artificial neural network architecture, the first issue we want to address is: if the desired patterns are the stable equilibria of the system, i.e. can they be recalled? Theorem 3.1 answers this question. Theorem 3.1 For the system defined by equation (3.1) and equation (3.2), suppose V" = [vf', 125’, . . . ,v’,“,’]T, l = 1,2, . . . ,m, are the desired/pre-specified patterns (note: in this paper we use bold face to represent vectors), Then V", I = 1,2,.. . ,m, are asymptotically stable equilibria of the system, i.e. they are guaranteed recallable. Proof: Since el(v) = [(1)1 _ v1): (v2 _ ’02), - ° ~ 1(2):;1— vnllTa then e’(v") = [0, 0, . . . ,0]T and Ile’llp = 0 => E(V") = o. If VaiéV‘l Vl=l,2,...,m, then new)”, >0 v1=1,2,...,m. - 1 e - Srnc p Therefor' assume \ pattern l that E13 V #V“. System C Where it 25 Since % > 0 and f(m) > 0, thus E(V)>0 Wé {V'1,V"2,...,V"m}. Therefore, V", l = 1,2,... ,m, are isolated minima of the “energy” function (we assume V‘“, l = 1,2, . . . ,m, are distinct). Consider a small nerghberhood Q around pattern k such that V"" is the only system equilibrium point inside 0, It is obvious that E(V) is continuously differentiable over 9. E(V’k) = 0 and E(V) > 0 for V 94 V“, V 6 Q. The derivative of the “energy” function along the trajectory of the system can be expressed as: dE(V) ” 8E 62),- , i=1 n BE 2 av.- '-*Xdaflznr an i=1 where 32—3: = 513%). By our choice of the function 5', we have git? > 0. Thus dflV) a <0 mrv¢v* mm Therefore, according to Lyapunov’s stability theorem [Kha91], V'k is an asymptoti- cally stable equilibrium point of the system. The second question about the new system is if the regions of attraction for each pre-specified patterns can be estimated, i.e. how stable are the desired patterns? From the proof above we know that the “energy” function E can serve as a Lyapunov function for the equilibrium point V‘k over the region Q(V"‘). Thus we can cab l::l.2 In H _ l wdlana 3.4 llhen u unbev The 53's“ 26 can calculate the Lyapunov surface to estimate the regions of attract for each V‘“, k=1,2,...,m. In this section, we investigated the overall stability of the new architecture. We will analyze this new model for specific norms in the following sections. 3.4 The Product-of-norm Model Using L2 Norm When using L2 norm (p = 2) in the product-of—norm model, the “energy” function can be written as: E = gnmrfiuetvmi l=1 = gummnvrhvu: = if (m)1'[[(v; ’+—v1)2 (valgv2)2+...+(v;I—vn)2]. (3.6) The system dynamic equations now become in = f(m)Z( (vfl-v.) f1 IIV'k- VH3 l=l kil _—_ f(m )(f: v”’—v,-)fi [fluff—22,) 2] i=1,2,...,n. (3.7) l=1 k=1j-l k¢l From the above system dynamic equations, it is obvious that at any desired pattern, say V", U = 0. It should also be observed that for a vector V to be a sr'stem its comj Otherwi Thus a! range of the IOllo TheOre “Tl, 27 system equilibrium point, i.e. m) in? — v.) 11 [for — ..,)2] , (=1 k=1 i=1 k¢l its components must satisfy the following conditions: min{v‘-"I :l=1,,.2 ..,m}$v,- i=1,2,...,n (3.8) max{v,-‘“ :l=1,,.2 ..,m}2v,~ i=1,2,...,n. Otherwise, there will exist an i 6 {1,2, . . . ,n} such that (vII—vg)fizl(v; - I‘-vJ-) 2>0 Vl€{1,2,...,m} (3.9) 52;} “=1 or . “—v,) )H —vJ-) <0 VlE{1,2,...,m}. (3.10) k=1j=1(v k 1 Thus it,- # 0, and V is not a system equilibrium point. Inequality (3.8) specifies the range of the locations of all the system equilibria. To summarize the above, we have the following theorem: Theorem 3.2 For the dynamic system defined by equation (3.7) o The desired patterns V", l = 1, 2, . . . , m, are the asymptotically stable equilibria. o For a vector V to be a system equilibrium point, it must satisfy the inequality (3.8). Note that this theorem can be applied to general pth norm also. 3AL1 For our vestrge and the Theore lies: 0T. f0 be Proof: Let 28 3.4.1 One-neuron/m—pattern Case For one-neuron/m-pattern case, the system’s characteristic can even be further in- vestigated. When n = 1, the “energy” function becomes E = —f(m) H(v" — v)2 (3.11) and the system dynamic equation becomes: m u = f(m) (E: (1)“1 — v) fiw'k — v)2 . (3.12) =1 iii Theorem 3.3 The system represented by equation (3.12) has the following proper- ties: o The system has a total of 2m — 1 equilibrium points. 0 The pre-specified patterns v". (l = 1,2, . . . ,m) are asymptotic stable equilibria. The other m — 1 equilibria are unstable. 0 Suppose v", (I = l,2,...,m), are distinct and v"1 < v"2 < < 22"”, then for every i E {l,2,...,m — 1}, there is an unstable equilibrium point v'li'f‘l'” between v‘“ and v41“). Proof: Let us define a new function: m H(v) = f(m): (v"l — v) fi(v'k - v)2 . (3.13) l=l k=l k¢l Then ti Since ii That m Re“ Where w II [S 0bV ThuS r I! b . e gll’en ’31 a 29 Then the roots of H (v) are the equilibria of the system defined by equation (3.12). Since H (v) is a polynomial of order 2m — 1, it generally has a total of 2m — 1 roots. That means system (3.12) will have at most 2m - 1 equilibria. Rewrite H(v) as H(v) = f(m)[ImI(v"-v)] filer-v) l=l l=1 til = Ho(v)H1(v), (3.14) where we define Hotv) = f(m) (fun—w], (3.15) H1(v) = in: fi (v'l -- v) . (3.16) 52:: It is obvious that at the specified patterns 12'“ (l = 1, 2, . . . , m), 110(1)“) = 0 and H(v") .-= 0. (3.17) Thus 1)", l = 1, 2, . . . , m, are equilibria of the system. The other m — 1 equilibria will be given by the roots of function H1(v). At a specific pattern v”, H100“) = ifi(vtk_v¢i) Since 2- i.e. sigr. Because interval HUmber these to, The been 5110 1. Can b( The f EVa lu 30 = liljwk — v")] [ f} (M — m] . (3.13) k=i+l Since v'k — v" < 0 for every k 6 (1,2,. .. ,i — 1], {-1 Sign (H(v‘k - v”)) = (-1)“‘, k=1 i.e. sign(H1(v‘f)) = (—1)l'1. Therefore H1(v")H1(v'(‘+1))< 0 i = 1,2, . . . ,m — 1. Because H1(v) is a continuous function of v, there must be a root v‘llt‘H) in each interval (v‘i,v’(‘+1)), i = l,2,...,m — 1, such that H1(v'li’f+1)) = 0. The total number of roots we have now is 2m — 1. Since H (v) is a polynomial of order 2m — 1, these roots must be the only equilibria of the system. The stability of the system at the pre—specified patterns v", i = 1, 2, . . . , m, has been shown at previous section. The instability of patterns v‘li'l“), i = 1, 2, . . . , m — 1, can be proved as follows: The function H1 (v) may now be expressed as H1(v) = eniil(v‘(q’q+ll — v), (3.19) q=l where c > 0, and c is a constant. Evaluating the derivative of equation (3.14) with respect to v at v‘lr'r’”), we obtain d d __ ’.(r,r+l) = :(r,r+l) _ .(r'r+]) dell ) Ho(v )dv H1(v ). (3.20) where Thereft Since thus Therefi So the) Fig- HeUron are glo equatic Points ‘ toms b CIOSS I,[ 31 where d 1 +1) "H ( +1) ( +1) _ :1 r,r : _ t q,q _ n1 r,r de1(v ) c 3;]; (v v ) q¢r m—1 = (—1)'"‘c H Iv‘m’“) — NW”). (3.21) :1 3.1+ Therefore - d 111(1'1'4-1) T-I Sign EH1“) ’ ) = (—1) . (3.22) Since sign (Ho(v'("'+1))) = sign (f(m) LH(v'l — v'("'+1))]) :1 thus _d_H(v1-1(r,r+1)) = 1H (vt(r,r+1))Ho(vn-(r,r+1)) < 0 (3 24) do dv ‘ ‘ ' Therefore, Vim“), r = 1,2, . . . ,m — 1, are the maxima of the “energy” function. So they are unstable equilibria. QED Figure 3.1 and 3.3 show two example plots of the “energy” function for two one- neuron/m-pattern cases. As we can see from the plots, all the desired patterns are global minima of the “energy” function. The corresponding system differential equations are plotted in Figure 3.2 and 3.4 where we can see that all the desired points are the roots at which the plots cross the :r-axis with a negative slope. All the roots between the desired ones are between two consecutive stable ones and the plots cross the :r-axis at those points with a positive slope. 32 0.002 0.0013 0.0016 0.0014 L740.0012 g5 0.001 £00003 0.0006 0.0004 0.0002 Figure 3.2. The plot of function H(v) with v"‘1 = 0.25 and v'2 = 0.75 Figure 33 3.5e-005 1 1 1 1 1 3e-005 2.5e-005 E26005 Elie-005 le—005 5e-006 0 0.2 Figure 3.3. The plot of “energy” function E = 0.5(0.25 — 2:)2(0.5 -— :r)2(0.75 -— 0:)2 0.002 1 1 1 1 1 0.0015 0.001 0.0005 -0.0005 -0.001 -0.0015 -0.002 J_ 1 J l l 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure 3.4. The plot of function H(v) with v‘1 = 0.25 v“2 = 0.5 and v"'3 = 0.75 Le: neuron each s; for ex: Withor m-lr and thel The [OCT Stable e. lOCatiOn. C3583 for 34 Let us look back again at the proof of theorem (3.3). We find that for the one- neuron/m-pattern case, we can even further control the regions of attractions for each specified pattern by specifying the unstable equilibria between the stable ones. 1 2 For example, suppose we want to store 22" , v“ , - - - , and 22"" as the stable equilibria. Without loosing generality, we assume v"1 < v"2 < < u”, then we can select m — 1 points: 111,122, - . - , 11““, such that ‘1 < 0""; (3.25) v‘l ’01 = (3.34) The system has a total of five equilibria. 3.5 The Case of Loo Norm For the case of p = 00 ( Loo norm ), the derivative of the norm does not always exist. The “energy” function is not continuous. This case need special treatment in software or hardware implementations. Here we define the “energy” function as: E = f(m)fi|lV"-Vlloo l=1 f(m) fi max Iv;fl — vJ-I. (3.35) l=l '7 Since the partial derivative aimllV"I - Vlloo does not always exist, to make the system gradient-like, we define 0 if (1);.“ — 12,-! 71 111311111123“I - v.) a .1 def . 1 .1 EZIIV - Vlloo — —1 1f v1- — 12,-: —mJax|vJ — ml (336) ° .1 _ cl ' 1 if v,- -— v, — mjax [v]- — vJI "z“ a“. (3-37) l The syster Since the 1 simulation 39 The system differential equations are defined as u,- = f(m) E 6;..- H mJax lug" — vJ- . (3.38) i l=1 111 Since the a. is not continuous, we need to choose f (m) to be very small in software simulations. I \ 1' / \ 1. ll 11 l I Figure 3.7. An example vector field for Loo norm Figure 3.7 shows how the vector field of this system will look like around a stable equilibrium point in two dimension case. As it can be seen from the figure that the system has limit number of directions to go at any point. If we fix the step size fo the system, the system will become a grid system. The system will move from one node of the grid to another. This may make it easier to simulate the system using digital computers. 40 3.6 Simulation Results To verify the correctness of the theoretic results and demonstrate how this model can be used in practical applications, we present several simulation results of this model used in image recognition and association. In the following examples, we use the outputs of neurons to represent pixels of images. Each pixel is shown as a square in the figures below. The neuron output value of 1 represents a full back square in the figures, the value of 0 represents a full white square and values between 0 and 1 represent black squares proportional to the values. We use L2 norm in the simulations. In example one, we considered a 304-neuron network. For each desired image, we stored the image and an assigned code in the network. This way we can read the output of the network by reading the outputs of the neurons representing the codes. To compare its performance with BAM [WCM89, WCM90, WCJ91], we have used the same image patterns as those used in [WCM89]. Figure 3.8 shows an example where six image-code pairs are stored. When the network was given the initial input as shown in the figure, the network retrieved the corresponding images correctly. In example two, we used a network with 37011 (169x219) neurons to store two facial images as shown in figure 3.9. Figure 3.10 shows that given a part of an image, the network was able to retrieve the who image correctly. 41 Figure 3.8. Initial and final outputs of the network with six pre-specified patterns 42 Figure 3.9. Stored two facial images Figure 3.10. Image retrieving process CHAPTER 4 GENERALIZATION AND VARIATIONS OF THE “ENERGY” FUNCTION FOR PRODUCT-OF-NORM MODEL In general, the proposed product-of—norm model is very complex: neurons are fully connected to one another. For the L2 norm case, the order of the system increases as the number of data sets increases. It is also difficult to modularize or parallelize the system in software simulations or hardware implementations. In this chapter, we will investigate various ways to simplify the system. We will also present some example ways of tailoring the “energy” function for specific applications. Our focus here is the L2 norm form of the product-of-norm model. However, the techniques discussed here can be applied to other forms of norm too. 43 44 4.1 The Log Form of the “Energy” Function If we take log() on both sides of the original “energy” function (3.1). We get a new “energy” function as: 1‘2 élogE=f.(m)+}:logne'H:, (4.1) l=1 where fl is a function of m. It is a scaling function like f (m) Now the system differential equations can be calculated as: 111' ll 1 IV] —-7||e‘||” i=1,2,...,n. (4.2) These dynamic equations describe a gradient-like system. Different from the original product-of-norm model (3.1), this system has been changed to a sum-ofnorm form. In the system differential equations, the terms corresponding to each pattern are now computed separately and then added together. However, the “energy” function is no longer bounded below. Fortunately, the points at which the “energy” function goes to negative infinity are the pre-specified patterns. In simulations, the system can stop at certain point according to the specified accuracy. Thus in simulation, even though the “energy” function is not bounded below, starting from any initial point, the system will still “converge” to one of its stored patterns. To avoid the problem that the “energy” function is not bounded from below, we 45 can modify the “energy” function as follows: E 2 S": [log(ne'n; + on] . (4.3) l=1 where c) > 0, l = 1,2, . . . ,m, are pre-chosen constants. The new system differential equations become 3 . m 1 , p . i=- —— =1,2,...,. 4.4 " Elle‘llhczavd'e“? ‘ " ( ) This system “energy” function now is bounded from below. In addition, it preserves the features of gradient systems. The terms corresponding to each pre-specified pattern in the system differential equations are added together. However, the pre- specified patterns will generally no longer be system equilibria. We will address the relation between the equilibria of the new log-form system and the original product- of—norm system next. Lemma 4.1 For a system with “energy” function E = fitlle’llt + or). (4.5) where c; > 0 (l = 1,2, . . . ,m), and c; are constants. The system diflerential equations are defined by: . m a r . u,- = 1}; 8—mue'ngg [||e'°||;+c,.] z = l,2,...,n. (4.6) - 11321 If V' is a (isolated) local energy minima of the system (4.5), then V' is a (isolated) 46 local minima of the “energy” function of (4.3). Proof: To distinguish the two “energy” functions, we refer to the “energy” defined by equation (4.3) as E in this proof. Since V“ is a local minima of the “energy” function of (4.5), there exists a small neighborhood I) of V", V" 6 R, such that for any point ofV#V*andV€Q,wehave E(V) — E(V‘) 2 0. For the system (4.5), since E > 0, Thus Since Thus E(V)—E(V') = logE(V)—logE(V") _ E(V) E( )= log E(V) and E(V") = log E(V‘). (4.7) (4.3) (4.9) (4.10) (4.11) (4.12) 47 and 2 0. (4.13) Therefore, V' is a local minima of the new “energy” function (4.3) too. It can been seen from the proof above that if we change the inequalities to strict form we can prove the case of isolated “energy” minima. We omit the proof here. QED. Lemma 4.2 A pattern V" is an asymptotically stable equilibrium of the system defined by (4.4)/(4.6) iff it is an isolated local minima of the “energy”function defined by (4.3)/( 4.5). Proof: Firstly, we want to show that if V" is an asymptotically stable equilibrium of the system (4.4)/(4.6) it must be a local “energy” minima of (4.3)/(4.5). We prove it by contradiction method. Case 1. Suppose V‘ is an asymptotically stable equilibrium of (4.4)/(4.6), but it is not a minima of the “energy” function (4.3)/(4.5). Define a small neighborhood D around V‘, such that V“ is the only system equilibrium point inside D. Consider a new function E 2- E (V‘) — E (V) It is obvious that E is continuously differentiable over D. Since V‘ is not a local minima of the energy function E, there exits some V0 with arbitrarily small llVo — V‘H and E(Vo) < E(V“) => E(vo) > E(vr) = 0. (4.14) Define a set 5 as n = {v e B. : E(V) > 0} (4.15) 48 where the ball B, 3 {V E R" : ||V — V*|| g r} and B, C D. Along the trajectory of the system E=—E 5E 8v,- . - £37317; 3v,- 2 = 2:51:12; > 0 for VEB. (4.16) According to Chetavev’s theorem [Kha91], V" is unstable. Contradiction! thus, V" must be a local energy minima. Case 2. Suppose V“ is a local energy minima of the system (4.3)/(4.5) but not an isolated one. Then for any small neighborhood of V" defined by Q = {V : [lV—V'II < e}, e > 0, there exits a local energy minima V0 6 I? such that 3% = 0 —-> Ulv0 = 0. Thus V" is not asymptotically stable. contradiction! Therefore, V‘ has to be an isolated local energy minima. Now we will show that if V“ is an isolated “energy” minima, then it is an asymptotically stable equilibrium point. Since V“ is an isolated “energy” minima, 0E WIV. = 0. (4.17) Thus V" is an equilibrium of the system (4.4)/(4.6). Consider an small neighborhood 0 around V‘ such that V“ is the only system equilibrium point inside (I. Then consider the continuously differentiable function E(V) over It, E(V') = 0 and 49 E(V) > 0 for V 75 V‘, V E 9. Thus E can serve as a Lyapunov function candidate over 9. Along the trajectory of the system, dE(V) _ dE(V) dt _ dt < 0 for v¢vzven (4.13) Thus V" is an asymptotically stable equilibrium of the system (4.4)/(4.6). QED Lemma 4.3 Suppose V", l = 1,2, . . . ,m, are the pre-specified patterns of the sys- tem defined by equations (4.5) and (4.6). Given any small positive constant r, there always exist a set of constants c; > 0, l = 1,2, . . . ,m, such that, for l = 1,2, . . . ,m, the system (4.5) has a local “energy” minima V" and “V"1 — ‘7‘!“ < r; V" is an equilibrium point of system (4.6). Proof: For simplicity, let us start by choosing 0 T1) 0} (4.19) such that v:1 go, l=2,3,...,m. (4.20) 50 Define another neighborhood 91 of V"1 as: 01 = {(V'"1 + AV) : r1 2 “AV“ 2 r2, r1 > r2 > 0}. The Open neighborhood Q; is defined as $12 = Q — (21. Consider the system “energy” function (4.5) at pattern V'I: E(V“) [Ile‘(V“)II£ + c1] E -ltve) = CrEn—1(Vd)a here [Ie1(V'1)||p = 0 and we define 13,._1 = II [Hang + a] . l=2 At a point V = V'1+AV, V E 91, E(V) = [A8 + Cl]En_1(V), where Ac = ||e1(V)||§. Therefore, E(V) — E(V’l) = [Ae +, c1]E _1(V'I+AV) — clEn_1(V'l) = AeEn-1(V’1+AV) + C1 En-1(V'1+AV) '- (4.21) (4.22) (4.23) (4.24) (4.25) 51 ClEn-1(VT1) > AeE.._1(V‘1+AV) + c1173.-. (v:1+ AV) — 61En_1(VT1). (4.26) where En_1(V) = {I}; [lle’llg + 1] and En_1(V) = 117;, Mel”; . For V‘1+AV 6 91, Ae > 0, En_1(V*1+AV) > 0 and c1 > 0, Since Ae and En-1 (V*l+ AV) are continuous with respect to AV over the compact set Q1, there exists a > 0, such that AeEn-1(V"l+AV) > a for every point (V‘1+AV) 6 I'll. Similarly, there exists b > 0 such that En_.1(V"1) < b for V‘1+ AV E 91. Therefore, for V’l+ AV E 521, we can always find a constant c1 > 0 such that a Cl < h and c1 < 1. (4.27) Thus, for any (V‘1+AV) E (11 AeE.._1(V"'1+AV) + C; [E -1(V"1+AV) — E,_,(v*1)] > a — clb > 0. (4.28) Since E (V) is continuous over the compact set 9, E (V) must have a minima V" 6 0 such that E(V“) S E (V) for every V 6 9. From inequality (4.28), we know “V”l — V'lll < r2 < r, i.e. V“ 6 02. (4.29) 52 That is V" is a local “energy” minima. Since this is a gradient like system, V“ is a equilibrium point of the system. Similarly we can prove the existence of c), l = 2,3, . . . ,m. we omit the detail here. QED From lemma 4.1 and lemma 4.3 we have the following theorem: Theorem 4.1 Suppose V", l = l,2,...,m, are the pre-specified patterns of the system defined by equations (4.3). Given any small positive constant r, there always exist a set of constants C) > 0, l = l,2,...,m, such that, for l = l,2,...,m, the system (4.3) has a equilibrium point V" and [[V"'1 — V‘IH < r. For the case of L2 norm, we can further study the stability of the equilibria of the new system. Lemma 4.4 For case of L2 norm, suppose V", l = 1,2, . . . , m, are the pre-specified distinct patterns of the system defined by equations (4.5) and (4.6). Given any small positive constant 6, there always exist a set of constants c) > 0, l = 1,2, . . . ,m, such that, forl = l,2,...,m, the system (4.6) has a corresponding asymptotically stable equilibrium point V" and ”V" — V"[| < 6. Proof: For L2 norm, equation (4.6) becomes 4:4: (v:’—v.-)II [Ile"||§+61] 1' (=1 l,2,...,n. (4.30) k=l k¢l 53 It can be written in general vector form as U = f(U)1 (4°31) where f is a C1 vector field. Similarly, by taking f(m) = 2, we can write equation (3.7) as U = f"(U). (4.32) Note that when c1: 0, l=1,2, . . . ,m, we have f = f‘. We have already showed V", l = 1, 2, . . . , m, are asymptotically stable equilibria of the system (3.7). The Jacobian matrix of system (3.7) evaluated at a specified pattern V" is 2\1 Df (U) ,,.. = (4.33) ’\n J where m n .. .12 av.- A1=-HLZ(”jk‘vjl[[a—m k=l i=1 k¢l ] < 0. (4.34) Therefore V", l = 1, 2, . . . , m, are hyperbolic equilibria. It is obvious that D f (U)|v.1 is invertible. We define the Cl—norm [IhIIC1 of a vector field h E v(R") (v(R") means all the vector field over R”) to be the least upper bound of all the numbers of ||h(U)|| and ||Dh(U)||, for U E R". According to the perturbation theory [H874], for any 6 > 0 there always exists a neighborhood a = {g : ||g — f‘HCx < r, r > 0} C v(R") 54 of f“ such that for any 9 E a there is a unique equilibrium V"1 of U = g(U) and “V“1 — V*1[[< 6. Since [If—f‘llcx is a continuous function of C), l = 1,2, . . . ,m, and [If-f‘llcx = 0 when c1 = 0, l = 1,2, . . . ,m, we can always find a c1 = c; > 0 such that the vector field f“) E a, (4.35) where M) = f()I..=..=...=c-=o. (436) Therefore, there is a unique equilibrium point V“ of U = f(1l(U) and “V"1 — V‘IH < 6. Further more V“ is asymptotically stable and hyperbolic. Now consider a vector field f (2) defined by f(2)() = f()lcl=c;,C3=C4=-~=cm=0 (4.37) Since V'"1 is a hyperbolic equilibrium of U = f(1)(U), the eigenvalues of the J acobian matrix D f (1) v.1 will have nonzero real parts. Thus, D f(l) 9.1 is invertible. According to the perturbation theorems, there exists a neighborhood 01 of f (1) such that for any g, E 01 there is a unique equilibrium point V“ of U = g1(U) and ”V” — V'lH < 61 < 6. Since ”V" — V“|| < e, we can always make cl small enough so that “Val — V*1|| < 6. Further more, V1.1 is hyperbolic and asymptotically stable. Since Hflzlllcx is a continuous function of c2, and fl?) 2 fm when c2 = 0, there exists a open set (0, b1) with b1 > 0 such that for very 62 E (0, b1), fl“) 6 01. Thus, there exits 55 a unique asymptotically stable equilibrium point VT] of U = f (2)(U) and —._-* (IV 1 — V“|| < e. (4.33) Since V‘“ is a hyperbolic asymptotically stable equilibrium point of f (1) too (this can be verified like those of f ‘, we omit the detail here), we can similarly show that there exists a set (0,b2) with b2 > 0 such that for every c2 6 (0,b2), there exits a unique asymptotically stable equilibrium point V‘“ of U = f(2)(U) and [IV‘2 — V‘zll < e. (4.39) We can choose c; E (0,min(b1, b2)) to have both (4.38) and (4.39) hold. Similarly, we can show that there exist C3 > 0, c4 > 0, . . ., cm > 0 such that there exits a unique asymptotically stable equilibrium V", l = 3,4,... ,m and ”V" — V"|| < e. QED From lemma (4.1), (4.2) and (4.4) we have the following theorem: Theorem 4.2 For case of L2 norm, suppose V", l = l,2,...,m, are the pre- specified distinct patterns of the system defined by equations (4.3) and (4.4). Given any small positive constant 6, there always exist a set of constants c) > 0, l = l,2,...,m, such that, forl = l,2,...,m, the system (4.4) has a corresponding asymptotically stable equilibrium point V" and ||V"" —- V‘lll < e. The proof is straight forward. We omit it here. From theorem 4.2, we know that we can use the new log form system to store patterns as close to (not equal to) the desired ones as we want. 56 For the new system, the right hand side of the equations (4.4) is in the form of summation, and the effects of the desired patterns are additive. This would make it easier to implement the model in standard software/ hardware modules which can be easily parallelized. This parallelized architecture of modules is depicted in Figure 4.1. As can be seen from the picture, each processing module works on one pattern alone. Therefore, the process can be parallelized. Some simulations have shown that, compare to those networks which use the original “energy” function, networks using this log-form “energy” function converge (retrieve) faster. It can also be observed that it is easier for this system to “learn” new patterns. To “learn” (store) a new pattern, the original structure of the system does not have to be changed, only a new module is added to the original network. The stored patterns of the old system are still stored (to certain degree) in the new system. 4.2 Localization of the Product-of—norm Model 4.2.1 Motivation The original product-of—norm system model described in equation (3.7) is fairly complex: the feedbacks are nonlinear; the structure is fully connected. As the number of units (neurons) increases, the number of connections will increase in the order of n“. This will limit the size of the network in implementations. In software, the limitation will appear in the time to simulate a reasonable-sized network. In hardware, the limitation will manifest itself in the full connectivity. Thus it appears to be difficult to directly implement this model in hardware using present technology. Therefore it is necessary to simplify the original networks. Processing Module 1 Processing Module M Processing Module 2 ' Neurons Figure 4.1. A modularized structure of Product-of-norm model 58 In last section we investigated the way to use log-formed “energy” function to modularize the original system. In this section we will explore other approaches to simplify the network. Our first goal here is to localizethe connectivity. As oppose to haphazard structure, we seek to select a local structure that in addition to preserving some of the delineated features of the original product-of—norm model, would only add a limited number of spurious equilibria. The specific approach rests on retaining the gradient systems property while rendering the pre-specified patterns as isolated minima of the “energy” function. Hence it can be guaranteed that the pre-specified patterns will always be asymptotic stable equilibria of the system. 4.2.2 Subgrouping Method A naive way to simplify the network is to divide the whole network into smaller subnetworks which are isolated from one another. One builds each subnetwork separately and assembles the small networks together to form the large network. Figure 4.2 depicts an example of a network grouped this way. The “energy” function which describes this model is: E = i ITZ('J.‘-'-vj)2+l'nl2(1).l"-v1)2 l=IjEN1 [=1 jEN2 +... +11 2 (v;l—vj)2 , (4.40) l=1j€Np Where Nk, for k = 1,2, . . . , p, are the index of neuron in subnet k (p is the total number of subnetworks). N,- r) NJ- : 05, for i 9193' and i,j 6 {1,2, . . . ,p}. The system 59 dynamic equations are: Di=-—. i=1,2,...,n (4.41) i.e. 4:701) lion-10ft : («at—1).)? . (4.42) k=1 JEN}. 1‘9“ 16M. Since at the pre-specified pattern V", the “energy” E(V'“) = 0 and E(V) > 0 for V 94 V", l = 1,2, . . . ,m, V" are the isolated “energy” minima of the system. Therefore they are asymptotic stable equilibria. However, all the combinations of the minima of the separated p subnets are stable equilibria of the system also. This leads to the generation of more so-called spurious equilibria. Thus the region of attractions for each equilibrium point will become smaller than that in the original product-of—norm model. In this model, each neuron is only directly connected to neurons in its own subset. Compared to the original model, the connectivity has dropped dramatically. A similar approach is to divide the original network into smaller ones too. However, this time each small subnetworks is coupled with one or more other subnetworks. That is, the subnets overlap with one another. Figure 4.3 shows one example structure of such network whose original network is divided into four subnetworks. The neurons which are marked in dark color belong to two subnets, while the rest belong to one subnet only. The motivation of building subnetworks with overlap is to overcome the generation of spurious equilibria of the previous model. In the previous model, each subnetwork works almost independently. Thus any combination of the “energy” minima of each subnet composes one system stable equilibrium point. While in this new model, 60 the “energy” minima of one subnet will affect the minima of other subnets. Thus intuitively, it should have fewer spurious equilibria than the previous model does. Compared to the previous model, this model will have more connections. The network will be a little more complex. However, compared to the original product- of-norm model, the connectivity is much less and the structure is much simpler. The system is still a gradient-like system and retains the features of a gradient system. The choice of the size of each fully connected subnetworks and the degree of overlapping with their “neighbor” subnetworks should be chosen judiciously for each application, i.e. they should be application dependent. We observe that there will always be a trade-off between the connectivity of these subnetworks and the application-related performance. The “energy” function for this new approach can be formalized as: l m kl m :l E = a H 211.- -v.-)2+II 21v.- w)? (=1J'5N; (=1 jeNz’ + . .. + fi 2 (1);! _ 101-)2 , (4.43) l=l jENT', where p is the total number of subnetworks. N = {1,2,. . . ,n}, N: C N for every i€{1,2,...,p}.Ni’f'i{N,’,N§,...,N,-’_,,N,-’+1,...,N1’,}9E 45 for i = l,2,...,p, and ULM=N. If we define I‘(i) to be the set of subnets which contains neuron i, that is mmmm,gnwg MM) 61 Group 3 Figure 4.2. An example structure of dividing a big network into isolated smaller subnetworks Figure 4.3. An example structure of dividing a big network into overlapped smaller subnetworks and i e N; 4:) N,’c c Ni), (4.45) then, the system dynamic equation for the model can be expressed as 11,-: f(m) Z Z (1):“ — v.) H 2 (vs-”f — vJ-)2 , i=1,2,...,n. (4.46) N66I‘(i) l=l 52:} JEN; As we will show in the simulation examples later, if the portion of neurons correspond- ing to those coupling neurons (neurons belonging to more than one group, shown in dark color in Figure 4.3) are the same for all the pre-specified patterns, the coupling neurons may lose their coupling effects, i.e. they may fail to pass information between the two overlapping subnetworks, therefore, they cannot prevent the generation of spurious equilibria. 4.2.3 Simulation Results To investigate the performance of the two new models, we built a network with 288 neurons using different models presented in previous section to store three patterns. Figure 4.4 shows the patterns/ images to be stored by the network. Figure 4.5 shows the two test patterns which we used as input to the network. Network one is built using the “energy” function described by equation (4.40). Figure 4.6 shows the grouping used in this network, which does not use overlapping method. Network two is built using the “energy” function (4.43). Figure 4.7 shows the grouping used in this network. In the simulation, both networks successfully retrieved the correct patterns when applied with the test patterns shown in figure 4.5. 63 Figure 4.4. Three stored patterns Figure 4.5. Test patterns 64 Figure 4.8 and figure 4.9 show the retrieve process for these two network when a given input is composed of half of the image “1” and half of the image of“3”. The network which was built using overlap method successfully retrieved a stored image “1”, while the one with no overlap failed to retrieve a stored image. Group 1 / g Figure 4.6. No over-lap grouping . -----"‘U l Figure 4.7. Over-lap grouping From the simulation results presented above, we can make the following remarks: 0 The grouping of the neurons affects the performance of the network. The 65 Figure 4.8. The retrieving process of the network using non-overlap method Figure 4.9. The retrieving process of the network using overlap method 66 grouping is dependent on the patterns to be stored. o The model of overlapping groups has better overall performance for correct retrieval of the stored patterns. This demonstrated that the overlapping among subnetworks is essential for prohibition of the generation of spurious equilibria and the enlargement of the basin of attraction for the desired patterns. 4.3 Modified Product-of—norm Models for Recog- nizing Shifted / Translated Input Patterns 4.3.1 Motivation The original product-of—norm model described in equations (3.7) cannot retrieve the correct patterns effectively when the input pattern is a shifted/translated stored pattern. An example network which stored patterns shown in figure 4.4 failed to retrieve the correct pattern when given a shifted version of the stored pattern as shown in figure 4.10. In this section we will explore various approaches to modify the original model, so that it can tolerate (to certain degree) shifted/ translated images while retaining the features of being a gradient-like system. This section also serves as an example of how the “energy” function can be tailored to suit specific application needs. 4.3.2 Modified “Energy” Functions We seek to modify the “energy” function to take the shift/translation effects into account. Let us assume the images to be stored are in rectangular form and the 67 Figure 4.10. Shifted input pattern neurons are indexed in such a way that the images are represented row by row, that is, the first row of the image corresponds to neuron 1 through c (c is the number of columns of the image), and the second row corresponds to neuron c + 1 through 2c, - - -. We propose a modified “energy” function as follows: 1 m n E — 5701011 Ztv.’ — v.)” + [:1 j=1 . 2 $14 f]: L 2 (v;‘ — an] (4.47) 1=11=1 =(i—l)c+l where K : is a constant and K > 0. Used here as a scaling factor; c : the number of columns of the digital image/ pattern; 12 : total number of neurons of the network, which equals to the total number of pixels of the stored image; r : the number of rows of the digital image/pattern; m : number of images/ patterns to be stored. 68 In comparison with the original “energy” function (3.7), this modified “energy” function has an additional term, which actually measures the error (i.e. the difference between the current output of the neurons and the stored images) of each row as a whole. This term serves as a penalty function. It is zero when the input pattern is a horizontally shifted version of one of the stored patterns, while is non-negative otherwise. It is obvious that at a stored pattern V", l = 1,2, . . . ,m, E = 0. Thus V"1 is an isolated minima of the “energy” function. Therefore, V", l = 1,2, . . . ,m, are stable equilibria of the system. When the current state of the system is close to a shifted version of a stored pattern, the added term will be relatively small and make E small also compared to the case of a random image as input. This, in a sense, builds a “energy” valley to attract t hose shifted images to the correct patterns stored. This can also be viewed as a simple feature mapping, the goal is to choose such maps that will separate input patterns in feature space while being shift-invariant. The “energy” function introduced this way may introduce more local “energy” minima (spurious equilibria) and in general will not be able to recognize all the shifted versions of the pre-specified patterns. The value of K becomes crucial when it is used for different applications. The new dynamic equations now become: 111' = f(m)ZW‘-01)(Hz Uri-”02+ l=l k=l j=1 krfil m J(3)+C m 1- pc 2 K: T Z (”id—‘04)] II XL 2 (vii—v11] i=1 1'_J(.') k=1 p=l ’=(:'-1)c+1 k¢l for i = 1,2,. . . ,n, (4.48) where R(i) is the index number of the row which contains neuron i, and J (z) is the 69 index of the first neuron of H(i). Another way of modifying the “energy” function to achieve the same goal is to add the penalty function as follows: Kill 2 (v;’—v,-)] (4.49) i=1 =(i-1)c+l The “energy” function of this form can be used in conjunction with the log modifi- cation method described earlier to achieve faster speed. It will also be easier to be parallelized or be modularize in software simulation or hardware implementation. The effects of this term can be controlled via the coefficient K. Later simulations showed that varying this constant is a very effective way of changing the regions of attractions for each stable equilibrium point. While, technically speaking, the values of K should be computed from a nonlinear version of the Lagrange multiplier approach, we opted for simplicity and practical implementation in software in our treatment. It should be clear from the discussion above that we can similarly construct an “energy” function such that the resulting neural network would be able to recognize vertically shifted patterns as well. Here we omit the detailed derivation. To modify the network so that it is shift-invariant, i.e. it is able to recognize both horizontally and vertically shifted images, we can obviously combine the two penalty functions used for horizontal shift and vertical shift together. Here we want to present a different approach. We propose to divide the network, more precisely 70 the neurons whose outputs represent an image, into small regions as we did in the subgroup method (See Figure 4.11 for example). Each region consists of a number of neurons which form a rectangular area of the image. For a region, I, obtained this way, we define the error for region I corresponding to the stored pattern V" as e,= 2(1) - -v.-)], (4.50) i6! where V is the current output vector of the neuron and v.- is its ith component. We use the summation of the errors of all the regions as the penalty term and add it to the original “energy” function. The new “energy” function can be expressed as: m n E m)H 2(1)}! — vj)2+ 1:1 j=l zéngfiZ[Z(v: -—v.-)]2, (4.51) l- 1166' iEI where K > 0 and G is the set of all the regions. The penalty term can also be added as in equation (4.49) to take the advantage of the log forms later. The new system differential equations are (assuming regions not overlapping with one another): "1);“ fl — vi) H 2(1), " "1)2 (=1 k=l j=1 k¢l m 2 +KZ 2(1); - ’,—v )H 2 k[:(v; - “-vj)] , (4.52) (=1 :61. Igé} leG jEI where I,- is the region which contains neuron 2°. The motivation for choosing such an error function can be explained as follows: 71 Figure 4.11. An example way of dividing the network into small regions 72 Referring to the grouping of neurons shown in Figure 4.11, if the input image is shifted one box (pixel) in the direction shown by the arrow, the changes in the error for each region will be in the order of (595-)? In general, if the region is w pixel wide and I pixel high (long) and the shift is hl pixels horizontally and fig pixels vertically, the change of errors for each region will be in the order of [1 — Egg—"Mr. When In and h; is much smaller than w and I respectively, the change of errors for each region will be very small. When h1 = h; = 0, the change equals zero. Thus it is not sensitive to shifts. This will help the network recognize shifted images as we explained in the horizontal-shift case. There is a trade-off between the tolerance to shift and the complexity of the computation. The choice of the size of the region is important. If w or I is too small, then the error defined for each region: wl e z [1— (I _ “1““ " M] (4.53) will be too large for a small shift of the image. If the region is too large, the term [2w — 4)]? (4.54) in the penalty function may become too small to detect the shift of the image. 4.3.3 Simulation Results To test the performance of the proposed modifications, we tried to build three 288- neuron networks using different “energy” functions. We used the images shown in Figure 4.12 as images to be stored and used images shown in Figure 4.13 as test 73 input patterns. Using these patterns, we tested three networks built using different models. Network one was built using the “energy” function (3.6). Network two was built using the “energy” function (4.47). Network three was built using the “energy” function described in (4.51). Table 4.1 compares the retrieve results for each case. As can be seen from the table, the proposed new network was able to recognize the shifted images in addition to those input images which are blurred by noises. However, we noticed that the modified network converges slower than the original model does in simulation. This is due to the complexity of the added connections. Table 4.1. Simulation results Test Patterns (60 (b) (C) ((1) model 1 x x \/ \/ model 2 \/ \/ x \/ model 3 \/ \/ \/ \/ Note: \/ means successful and x means failed. A (D V KKK KKKE 4.3.4 Discussion and Remarks So far we have surveyed current models and algorithms for neural networks used for information association, such as associative memory, especially in the application of character recognition and image retrieval. Our focus was on recurrent networks. We 74 Figure 4.12. Stored patterns (d) (e) (D Figure 4.13. Test patterns 75 proposed a product-of—norm model which has the following features: 0 The system eliminates the dynamic “learning” process. It can also be viewed as that the learning process becomes “static”. o The stored patterns are built into the network by design. In theory, such network can store any pattern as the system equilibria. There is no limit as to how many patterns can be stored. o The system is a gradient system which has no oscillation. The system is globally convergent . Some of the limitations of this network are: the original network is fully connected and is very complex, making it difficult to be implemented in hardware. We also investigated the product-of-norm model using different norms. Based on the original production-of—norm model, we proposed several modified network models. All of them retain the property of being a gradient-like system while still being able to store the desired patterns as stable equilibria. Especially, we investigated ways of tailoring the “energy” function for the application of image retrieval. We proposed ways of modifying the “energy” function of the original network to control the regions of attractions for each pattern, so that the network can recognize slightly shifted images. We proposed ways to simplify the network, specifically, we proposed to use “log” form of the “energy” function. Such networks can be easily parallelized in software simulation and modularized in hardware implementation. To store additional pat- terns, the network does not have to be built from scratch, it only needs to add some new modules to the original one. 76 We analyzed various ways to divide the original network into subnetworks so that the connectivity of the network can be reduced while we can still store the desired patterns. We demonstrated that the grouping with overlapping has better performance. We want to remark here: the simulation / application examples shown here are for demonstration purpose only. They illustrated how this model may be used in real life applications. The examples here are not intended for real image associations. Some of the methods are heuristic. As demonstrated by the examples, the new proposed models have great potential of being used in a wide range of real life applications: such as associative memories, categorization, character recognition and image association, etc.. CHAPTER 5 THE BUILDING BLOCKS FOR MICRO-ELECTRONIC CIRCUIT REALIZATION OF THE PRODUCT-OF-NORM MODEL 5.1 Motivations . One of the goals of the research on artificial neural networks is to understand how human brain works so that we can use hardware (circuits) to mimic the human brain to solve problems in real life applications. With today’s technology of very large scale integrated (VLSI) circuits, it is possible to fabricate tens of millions of devices (transistors) interconnected on a single wafer. This makes it possible to build complex ANN circuits for real world applications where digital computers have limited use. In this chapter, we will investigate ways of implementation of the product-of-norm 77 78 model using analog transistors biased in the subthreshold region. There are several advantages with circuits operating in the subthreshold region [Mea89]: 0 Power dissipation is extremely low - from 10‘12 to 10'6 watt for a typical circuit; 0 In the subthreshold region, a MOS transistor’s drain current saturates in a few KT/ q (thermal voltage), allowing the transistor to operate as a current source over most of the voltage ranging from near ground to Vdd; o The exponential nonlinearity is an ideal computation primitive for many appli- cations. First, we will propose some basic circuits which can be used as building blocks for the implementation of product-of-norm model of AN Ns. We will then present some examples of ways of implementation of small product-of-norm model. Please note that the example circuits presented here are for demonstration purpose only. They are not targeted to specific real world applications. 5.2 Background and Basic Circuits 5.2.1 MOS Transistors in Subthreshold Region The drain current of an N-channel metal-oxide-semiconductor (MOS) transistor bi- ased in the subthreshold region can be expressed as [Mea89]: I = Ioe;#”(e95? — 6%,), (5.1) I 79 where V9 : gate voltage. V, : souce voltage. Vd : drain voltage. 10 : a constant for a given MOS transistor. Isl . q . thermal voltage. When all voltages are normalized to quantity of %, the above equation can be written as: I = Ioe"V9(e’V‘—e"v") .—. Ioer-V-(r — 67V“), (5.2) where Vd, is the voltage between the drain and source nodes. When V4, is large enough, we have I z I... = Ioe"V9‘V' (5.3) where I... is the so-called saturation current of the transistor. Figure 5.1 shows the relationship of the saturation current vs gate voltage. 5.2.2 Current Mirrors Figure 5.2 shows two diode-connected transistors, their gates are connected to their drains respectively. For a typical process, Io is so small that even the smallest drain current used (10'12 amp) requires IQ, approximately equal to 0.4 volt [Mea89], thus makes the transistor well intro the saturation region of the drain current. Higher drain current requires higher V2,, (V4,). Thus for any useful V9,, the drain current of 80 16-005 3 I I I l I I T 1e-006 : _: a 3 ~73 4 1e-007 : A, i i , 1 16-008 1 1 1 1 1 1 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 V.. Figure 5.1. I,“ vs V9, the transistor has the same exponential dependence on the gate-source voltage. The current can be controlled by V9,. In circuit design, it is often necessary to change the sign (direction) of a current. Figure 5.3 shows two types of current mirrors using diode connected transistors. The input current I,“ biases a diode-connected transistor Q1. The resulting Vg, is just sufficient to bias the second transistor Q2 to a saturation current 10“. equal to 1m. As long as Q; stays in saturation region, ID“, is nearly independent of the drain voltage Of Q2. 5.2.3 Differential Pair Figure 5.5 shows a circuit which is usually called “differential pair”. The circuit uses the difference of the voltages as the input. Since we will use this kind of circuits quite 81 Figure 5.2. Two diode-connected transistors often, I will give a brief analysis here. Assume all transistors are working in the saturated region, the saturation current can be calculated as: I,M _—. IOeKVv-V'. (5.4) Thus we have I1 = Ioe‘VI'V, (5.5) 12 = Ioean-V. (5.6) Since 15 = 11 + 12 = Ioe’v(e"v1 + e"V’), (5.7) we have, 6'“ - I” l (5.8) 10 city] + 85V; ° 82 Figure 5.3. Two types of current mirrors Substitute e‘V back to equation (5.5) and (5.6), we obtain the following: CKVI I] = 15W (5.9) 88V: 12 = bm, (5.10) and CKV] _ eKVg I] — 12 = “65V; + 63V; — V2 = 15 tanh flY—l. (5.11) When V1 — V2 is small enough, K 11- 12 z 1550/1 - V2). (5-12) Figure 5.4 shows the relationship of II, 12, 11 — 12 vs V1 — V2. Using current mirror, 83 Current 10'7A o I -3 " [1‘12“ 1 1 1 m -3 -2 -1 o 1 V1 — V2 (V) [\D Figure 5.4. II, I; and 11 — 12 vs V1 — V2 Figure 5.5. Differential pair 84 Q. IV 4| Q. Figure 5.6. Transconductance amplifier 85 we can easily get 11 — 12 as output. As shown in Figure 5.6, where V — V [out = II — 12 = 1;, tanh K—(-—l—2—2—). (5.13) The circuit shown in Figure 5.6 is called transconductance amplifier. 5.2.4 Current-voltage Multiplier Figure 5.7 shows a prototype of current-voltage multiplier circuits. "Compared to the transconductance amplifier shown in the last section, the circuit output 10m now is drawn through two current mirrors instead of directly from the differential pair. Therefore, the output does not disturb the operation of the differential pair. It is obvious that the output current is: ”(V1 - V2) 10‘“: [1 —12=Ibtanh 2 (5.14) When V1 - V2 is small enough, 1 Iout z 516(% — ‘6) (5°15) Compared with the differential pair, 15 now is controlled by the input current In via a current mirror. In normal operation, the bias transistor will be in saturation region. Thus Ib = Iin and ”(V1 — V2) [out = [in tanh (5.16) I? glinU/l - V2) (5'17) 86 II ’7‘ I) Q; Q. s Q. I 1 F I “fits Q.}W I... 1J1 Q. ”Bi—h” E” Figure 5.7. Current-voltage multiplier 87 So [out is the product of the input current and the difference between the input voltages. Figure 5.9 shows a pspice simulation for this circuit. We use the schematic symbol shown in Figure 5.8 to denote this circuit. Since Ib cannot be negative, this circuit is a two-quadrant multiplier. Figure 5.8. Schematic symbol for current-voltage multiplier 4e-08 36-08 2e-08 1e-08 0 -1e-08 —2e-08 -3e-08 -4e—08 ' 1 J ' i l ' 1 ‘ 1 -0.8 -0.6 -0.4 ~0.2 0 V().2 0.4 0.6 0.8 1 l-Vz Iout(A) Figure 5.9. The plot of a current-voltage multiplier 88 V V H .I-Il J , I, v,-—| . Figure 5.10. Gilbert multiplier 5.2.5 Four-quadrant Multiplier Figure 5.10 shows a four-quadrant multiplier which is also called the “Gilbert Multi- plier”. From the analysis of the differential pair, we know eKVI ”85V; + eKV2 N u—l I 89 _ [b H(Vi-Vz) — 2(1+tanh 2 ), I _ 12 =§b(1—tanh M ). Similarly, we can obtain the following currents: 4(V3 - V.) —————2 ), 114 = %(1 — tanh iii/3512), "(V8 — WI) 2 3 NM - V4) 2 ) I 113 = é—(l+tanh 123 = 123(1-l-tanh I 124 = -23(1—tanh Thus 11+12 + I] —12tanh K(%—%) 2 2 2 1+: and = I1+Iz _ Il-Iztanh’cUé—V”). 2 2 2 Therefore, [out : 1+ "I- = (II — 12)tanh “(fig—“Q ———”(V1 _ V”) tanh = lb tanh 2 2 When V1 — V3 and V3 — V, are small enough, we have 2 I... z 1,,1‘4—(14 — V2)(V3 — v4). “(V3 - V4) (5.18) (5.19) (5.20) (5.21) (5.22) (5.23) (5.24) (5.25) (5.26) (5.27) 90 Figure 5.11 shows a pspice simulation of this circuit. 1e-007 . 5e—008 — (10-3) " ’ ’ f ” v ’ I ‘- ’ -’ a p N3513.008 - I -1e-007 l 1 l 1 1 0.3 -1.5e-007 -0.3 -0.2 -0.1 0 Vl-V2 0.1 0.2 Figure 5.11. Spice simulation of a Gilbert-multiplier In the following section, we will need to calculate (V1 —V2)2 for the product-of-norm model. We can use this four-quadrant multiplier to achieve that by setting V3 = V1 and V4 = V2. Figure 5.12 shows the schematic symbol for this square voltage-square circuit. ._. SQ ..L Figure 5.12. Schematic symbol of a voltage square circuit 91 5.3 An Example Circuit of One-neuron/two- pattern Product-of—norm Model As we discussed in section 3.4.1 for one-neuron/m-pattn case, the product-of-norm model’s dynamic equations can be written as: it = f(m) 2 (v‘1 — v) H (v'“ - v)2 . (5.28) 1:1 when m = 2, the equation becomes a = f(m)[(v" — v)(v'2 — v)” + (v‘2 -— v)(v" - vfl. (5.29) If we use voltage to represent the output (state of the neurons), the terms (1)"1 — v)2 can be easily calculated with the voltage-square circuit. Figure 5.13 shows an example circuit of the one-neuron/two—pattern case. As can be seen, the circuit structure is symmetric. The state of the neuron is represented by the voltage 1) over the capacitor C. For the voltage-square circuit, 12" and v are given as input. The output current [1 is then given by (v‘1 — v) Ir.(v"'l — v) 11 = 1;, tanh K tanh (5.30) This current is then fed to a current-voltage multiplier X2 through a current mirror. Thus the output current 12 of the current-voltage multiplier can be written as: t2 2 12 = Iltanh ”(v —v) 92 V' 1. SQ > Figure 5.13. One-neuron/two—pattern circuit :1 _ 2 12 _ = 1,, tanh U tanh M. (5.31) 2 2 When I)"1 — v and v'"2 — v are small, 12 can be written as 12 = 01(1)"2 — '1))(v"'1 — v)”, (5.32) where a is a constant which incorporates the effects of I), and 1:. Similarly, we can show that the current 1., can be expressed as '2 __ .1 __ I4 = Ib[tanh ”CE—{£212 tanh 5(2) 2 ’0) 0(1)"1 — 1))(1)"2 — v)”. (5.33) 22 93 Currents I; and 14 are then summed up and fed to the capacitor C. Thus the voltage over the capacitor has the following dynamics: Ci) = 12 + 14. (5.34) When 1)”1 — v and v‘"2 -— v are small, we have C1) = a(v"'l — v)(1)"2 — v)2 + (1)"2 - 1))(1)"l — 1))“. (5.35) The formula above is a differential equation for a one-neuron/two—pattern product- of-norm model. The PSpice code and simulation examples are presented in Appendix A. CHAPTER 6 ARTIFICIAL NEURAL NETWORKS FOR TEMPORAL-PATTERN ASSOCIATION 6.1 Introduction In previous chapters, we studied networks that could store static patterns and were used in applications like image recognition, pattern classification and associative memory. In this chapter, we will investigate neural networks that can store/learn temporal patterns. This type of networks can be used in many applications, such as speech recognition, temporal pattern generation and forecast. Learning or storing temporal patterns can be viewed as an extension to the general association of static patterns. Temporal patterns can be discrete or continuous. The 94 95 problem of learning temporal patterns can be classified into three different tasks [HKP90]: 1. Sequence Recognition: Here one wants to produce a particular output pattern when (or perhaps just after) a specific input pattern is seen). There is no need to reproduce the input pattern. This is appropriate for speech recognition problems where the output might indicate the word just spoken. It can also be used for pattern categorization. 2. Sequence Reproduction: In this case, the network must be able to generate the rest of the sequence itself when it receives part of it. This is the generaliza- tion of auto-association or pattern completion of dynamic patterns. It would be appropriate for learning a set of songs or for predicating the future course of a time series from examples. 3. Temporal Association: In this case, a particular pattern must be produced in response to a specific input pattern. The input and output patterns might be quite different, so this is the generalization of hetero-association to dynamic patterns. It includes , as special case, pure pattern generation and the previous two cases. 6.2 Methodology Review There are many methods proposed so far to address some of the tasks mentioned above. The first task —— sequence recognition — does not necessarily require a recurrent network. For the other two tasks, recurrence is usually needed. In this 96 section, we will review some of the models currently available. 6.2.1 Tapped Delay Lines Tapped delay line networks turn a temporal sequence into a spatial pattern on the input layer of a feedforward network. Therefore, the conventional back-propagation method can be used to learn and recognize sequences. Figure 6.1 shows a typical structure of such network. In this type of network, values of :1:(t), :1:(t — A), ..., a:(t — (m — 1)A) from an input signal a:(t) were presented simultaneously to the input layer of the network. The values are usually obtained by feeding the signal into a delay line that is tapped at various intervals. In a synchronous network, a shift register can be used. The tapped delay line network has been widely used in speech recognition problem [Wai89, WHH+89]. There are several drawbacks of this approach used for sequence recognition [M0289]. The length of the delay line or shift register must be chosen in advance to accommodate the longest possible sequences. It cannot handle arbitrary-length patterns. The input signal must be properly registered in time with the clock controlling the delay line and must arrive at exactly the correct rate. Tank and Hopfield [TH87] suggested a way of compensating the last limitations mentioned above. The idea is to replace the fixed delays by filters that broaden the signal in time as well as to delay it. 97 D D D D x(t) k Figure 6.1. A time-delay neural networ 98 6.2.2 Context Units (Partially Recurrent Networks) There are several approaches falling into this category. The idea of the approach here is to introduce a carefully chosen set of feedback connections in a feedforward network. The recurrence lets the network remember cues from the recent past but does not complicate the training. In most cases, the feedback connections are fixed, not trainable, so back-propagation algorithm may easily be used for training. Figure 6.2 shows several popular architectures of this type of network which can recognize sequences on the basis of its state at the end of the sequences. In some cases it can generates sequences too. For each net, it is not known what kind of patterns the network will be able to generate in advance. Bert DeVries proposed a Gamma model which introduces subnetworks as short-term memories to be added to the network [VP92]. 6.2.3 Back Propagation Through Time This type of networks is usually fully connected. Figure 6.3 shows a simple example of such network. If the patterns we are interested are of a maximum length, we can turn such network into a feedforward network by unfolding it through time. The idea is to represent the state of the network at different time by a layer in a feedforward network. Thus if the maximum length of the patterns concerned is T, then the network will have T layers with each layer connected by the same weight (connection) matrix. Figure 6.4 shows the unfolded version of the network shown in Figure 6.3 with four time steps. The back-propagation method can be modified to be used for training this new feedforward network. The desired patterns of the 99 Output Output 1 m 1 m Context Input a Context Input (a) (b) Output 03‘?“ . Hidden 2 Hidden Context ” a . ai Gontext l I I Input (C) (d) Figure 6.2. Architectures of networks using context units. Shaded arrows represent fully connected connections 100 original net at different time become the desired values of different layer of the new feedforward net. The main problem with this approach is the need for large computer resources. It is probably very difficult to be implemented in hardware too. For a long sequence or for a sequence of unknown length, this approach becomes impractical. It cannot be used for continuous patterns either. 6.2.4 Real-Time Recurrent Learning Williams and Zipser [WZ89b, WZ89a] proposed a way to construct a learning rule for general recurrent networks. This method can handle patterns of fixed or indefinite length. It can be run on-line too. Here is a brief description of this method. Assume the system difference equation is V,(t) = g(hi(t—1))= g(ngVJ-(t — 1) + C1(t - 1)), (6-1) where (g(t) is the desired (target) output of unit 2' at time t. By differentiating this equation with respect to w,,-, we have 0V1“) awpq = g’(h.-(t — 1)) amt — 1) + wa . (6.2) awry The desired values Ck(t) may be defined for some k’s and t’s only. The error on neuron k at time t is defined by Ck“) — Vk(t) if Ck is defined at t Eklt) = . (6.3) 0 otherwise 101 Wll W12 w22 10-02 W21 Figure 6.3. A general recurrent network Figure 6.4. A unfolded recurrent network for 4 time steps 102 The system “energy” function (the overall network error) is where W) = @5512. (6.5) Thus the gradient-decent weight updating rule can be written as 0E(t) AW (t) = -r] P9 aqu 3Vk(t) 7] Ek“) 1 (66) P aw... where r] is a positive constant. We have Aqu = Z Aqu(t). (6.7) t Therefore, if we know “..,—$3, we can solve this equation recursively. Because we assume that the initial state of the network has no functional dependence on the weight, we have 014(0) awpq = o. (6.8) Williams and Zipser [WZ89b] found that updating the weight after each time step instead of waiting until the sequence is ended ( at t = T ) worked well if )7 was sufficiently small. They called it real-time recurrent learning. This system requires large memory storage, and is computationally expensive. The number of operations for each time step is 0(n‘). 103 6.2.5 Time-dependent Recurrent Back-propagation Pearlmutter proposed a training algorithm for general continuous-time recurrent networks [Pea89]. Some other researchers derived the same result independently using different approaches [SCL91]. The general system differential equations are described by dy, Tia =—y,~+0‘(x,')+1,' i=1,2,...,n, (6.9) where x,- = Z,- wjiyj is the total input to unit 2, y.- is the state of unit 2, T.- is the time constant of unit 2', a is a sigmoid function, wij, i,j = 1,2, . . . ,n, are the weights, y.-(to) and the driving functions I,-(t) are the inputs to the system. The “energy” function is defined by T E = f, 25.0) — f1(t))2dt, (6.10) =0keo where O is the index of the output neurons, fk(t) is the desired output of neuron k at time t, T is the final time of concern. The functional derivative 544) = 555)) = 141a) — 5.5)]. (6.11) The derivative of E with respect to weight w,,- is defined by (See [Pea89] for detail) BE 1 T r aw“ = T/O yia ($j)ZJ'dt, (6.12) 1} 1 104 where Z is the Lagrange multiplier (or co-state variable): (VB ”(1“ = 0y.(t)’ (6.13) where 0+ denotes an ordered derivative. The dynamic equations of the co-state is dz,- 12 7t- = i2 —Z:T :Wfia' (:rJ)zJ (6.14) where = g; = y.-(t) - W) (6.15) Equation (6.14) can be derived using finite difference approximation [Pea89] or calculus of variation and Lagrange multipliers [SCL91]. We omit the derivation detail here. Similarly, we can find the derivative of the “energy” E with respect to the time constant T,: 6B 1 T dy, $-71?!) 2.5111. (6.16) Therefore, we can use gradient-descent method to update w.-J and T5. In the training process, we first integrate equation (6.9) forward from time to to t1 = T. Then set the boundary condition Z.-(t1) = 0, and integrate the system Z backward from t1 to to. In the backward process, o"(:cJ-)y,- and “ii are also integrated. Thus at the end, we can computeaaw — Jandg— a(:1:,-)’ and y,- are stored in the forward pass and replayed in the backward pass. This algorithm can be used for training general continuous-time recurrent net- works. It is related to other methods mentioned above. It can derive the real- 105 time recurrent learning method. It may also be viewed as an extension of the back- propagation through time, or extension to recurrent back-propagation. The drawback of this approach is that it needs to integrate backward in time. Thus it needs to store many intermediate values. Therefore, it needs large storage and it is difficult to be implemented in hardware. 6.2.6 The Green’s Function Method Sun, Chen and Lee [SCL92] proposed a method that uses a Green function to construct the derivative of the “energy” function with respect to weight Wij. By doing so, they avoided a lot of redundant computations. The resulting algorithm does not need backward integration while the operations per time step is in the order of 0(n3), which is one order faster than that of the real-time recurrent learning. For a general system described by at) = F(r(t), W. 1(0). (6.17) where W is a matrix representing the set of weights and all other adjustable param- eters. The nonlinear function F may look like F(:r(t),W,I(t))= -a:(t)+g(W,:L‘)+I(t). (6.18) The “energy” function can be given as E(x,:r') = ft! e(:r(t),:r"(t))dt, (6.19) to 106 where :r'(t) is the target value at time t. Using gradient descent method, the weight is modified according to 5E _ ‘1 Be 02: The on-line form is as 81:), (6.21) AW“) = —7l (5; ' 5W- where 1] is a positive constant. By using Green’s function (we omit the derivation here, see [SCL92]) the weight updating becomes AW = —n(v(t)n(t)), (6.22) where 11 is a third order tensor with the following dynamics: dII 6F 41 ( 0W . (6.23) H(to) = 0 v is a vector and can be obtained by solving the following linear equation: 6e v(t)U(t) — a—x. (6.24) U(t) is a matrix and is defined by dU(t) 3F —— + U(t)— = 0 dt Bx . (6.25) U(to) = 1 107 From systems science point of view, this method is basically the transition matrix method applied to the sensitivity function with the initial condition set to zero (V(t) in [SCL92] is the transition matrix). The advantage of this method is that it can be used on-line and the operations required per time step is only 0(n3). Overall, each algorithm has its own advantages and disadvantages. Current approaches still have the following limitations: 0 Require large storage. 0 Computationally expensive 0 Difficult to implement For training general recurrent networks, the time-dependent recurrent back propa- gation approach is probably the best unless on-line learning is needed [HKP90]. It can be used for both continuous-time and discrete-time systems. When on-line learning is needed, the Green’s function method is a better choice. In the rest of this chapter, we will focus on algorithms used for general recurrent networks. We will several propose ways to improve the original time-dependent recurrent back-propagation algorithm. 6.3 Motivation Among the training algorithms proposed for recurrent networks so far, the time- dependent recurrent back-propagation algorithm proposed by Pearlmutter and some other researchers has the following advantages: 108 The system is very general and can be applied to a wide range of applications. It can derive many other learning algorithms. The adjoint variable Z has clear physical meaning. 0 The operations required per time step is only 0(n2). It can be easily extended using variational approach to solve other problems beside trajectory learning. However, the original algorithms did not fully address the following problem: 0 The effects of the initial state: The training algorithm does not specify how the initial state of those neurons without desired values should be chosen (if the initial values of those neurons can be controlled). If the initial values are chosen arbitrarily during the training process, the retrieval process can have totally different dynamics. 0 The final values of the system: How to control the final values of the system? The final value of a pattern may be more important than the patterns of the intermediate states. Pearlmutter’s original algorithm assumed a free ending problem and set 2;, = 0. o Avoiding the backward integration: How to avoid the backward integration without increasing the computational complexity too much? In the following sections we will try to further investigate the areas mentioned above. 109 6.4 The Initial Value Problem The original algorithm of time-dependent recurrent back-propagation does not specify how to set the initial values of all the neurons. It assumes that the initial values are given and fixed. While in real life applications, the initial values of some of the neurons mayinot be given, and are changeable. If the initial values used in the retrieval process are different from those used in the training process, the system will behave differently. Therefore, it is natural to ask how we should set those initial values. Let us take the XOR problem described in [Pea89] as an example, Figure 6.5 shows the structure of such network. The network has a total of four neurons, with two neurons in input layer, one in output layer and the other one in hidden layer. The input neurons have external connections. The goal is to select a proper set of weights so that the relationship between the external inputs and the output of the output neuron is XOR. Here the initial values of the neurons can be given to arbitrary values. Obviously, some choices of the initial values will make the training easier while others will not. The co—state Z,-(t) used in [Pea89] is defined by _ (TE 3.71)“) 21(1) (6.26) That is, Z;(t) is the ordered derivative of E with respect to the system state y,(t). Thus 6+E 2““) = air-(0) (6.27) 110 Input Hidden Output Figure 6.5. Network topology of XOR problem 111 Since aayfg) represents the change in E over the time period T caused by infinitesimal change of y,-(0), we can use gradient-descent method to update y;(0) in the training process to find the optimal initial values by using — = ”721(0), (6-28) where 17 is a positive constant. 6.5 The Choice of the Final Value of the Co-state For a general system defined by: Y = f(X, 1,1), (6.29) where X = WTY. If we define the performance index function as J(t) = (Y(tf),tf) + [01, L(X,I,t)dt, (6.30) and the final state has constraint is w(Y(tf),tf) = 0. (6.31) The Hamiltonian is defined as H(Y, 1,1) = L(X.I,T) — ZTf(X,I,t). (6.32) 112 The boundary condition can be derived using optical control theory [L895]: Y(to) is given , (6.33) (¢x 'f' 11’3” — Z)TltJd3/(tf)+ (451+ 1b,?!) + H)|¢Jdtf = 0 where v is also a Lagrange Multiplier. Since the final t f is fixed, we have (I’x'l'dgv—Z) = 0 Z I), = (45. + 14;” v)|.,. (6.34) Equation (6.34) specifies how to find the final value of the co—state Z. The boundary condition used in original time-dependent recurrent back-propagation is a special case of equation (6.34) which can be derived by setting <15: 2 0 and II): = 0: 2|., = 0. (6.35) In real applications, we may have constrains on the final state and ¢(y(t;), t1) may not be a constant either. A good example is to train a network to learn a circular trajectory. It is desired to let the final state equal to the initial state. This can be emphasized by given a constraint function (I). 113 6.6 Avoiding Backward Integration in Time- dependent Recurrent Back-propagation: Method I One major limitation of the original time-dependent recurrent back-propagation method is that it needs to integrate the co-state Z backward. In the forward process, the state variables and some other variables need to be stored so that they can be played back during the backward integration. Therefore, this algorithm requires large storage and is difficult to be implemented in hardware. There are other methods available from optimal control theory [BcH75]. However, each method has its own limitations, for example, the method of guessing the initial Z (to) and simulate Z forward, the difficulty lies in guessing the Z (to) to be close enough to the correct one. In this section, we will present a numerical algorithm that can avoid the backward integration. The system equations concerned can be written in matrix form: Y./T -_- C—Y+0(X) (636) X=WTY 9 Here we define the operator “./” as Anx1./Tnx1 [a,-/T,-]nx1. C is the input vector. The co-state (the Lagrange multiplier) Z has the following dynamics: 2' = Z./T — E — {Wdiag[a’(:r)./T]}Z. (6.37) 114 Here diag[V] is defined as a diagonal matrix with the diagonal elements equal to the corresponding elements of vector V. E is defined as E=Y"—Y, (6.38) where Y“ is the desired value. The derivative of the “energy” E with respect to weight W can be written as: 3E ‘1 , , T 6—W_ — to Y[d1ag[a ($)./T]Z] dt. Or in other form: BE :1 - I T W _ /. d1ag[0' (:6)./T]ZY dt. In the following discussion, we define 1b = diag[a'(:z:)./T]. Thus 6E BWT t = f ’ zpszdt to and the derivative of the “energy” with respect to T is 6E 11 _ . 551-; = —/to d1ag[Z]Ydt. (6.39) (6.40) (6.41) (6.42) (6.43) Using the first-order approximation, the dynamic equation of the co-state becomes 115 (See [Pea89]) Zn (1 — At. /:r)z,.+1 + 61E, + At(Wdiag[a’(X,,). /T]Z,,+1) = {I — At./T + At(Wdiag[a’(:r,,)./T])}Zn“ + AtEn. (6.44) Where I is an identity matrix. Define p, = I — At(1./T — Wdiag[o’(:rn)./T]) (6.45) B, = AtE, Then, Z, = annH + B". (6.46) Write equation (6.42) in difference form: 0E N T —6WT — At]; IlikaYk - (6°47) Here we assume the final time T = N At. At time t1, we have 21 = [1122 + B] (6.48) m = I — At [I./T - Wdiag[a(X1)./T]] (6.49) Bl = AtEl. (6.50) If we denote 8E ) k T — = 4,2,1! , (6.51) (awn :4 , 116 we have BE ) = 1121213” (6.52) (aw/T1 1 = 1#1011 22 + Bl)Y1T (6.53) = 2(11}41Z2Y1T + $131K]: (6.54) At t2 we have Zg = [1223 + Bz (6.55) 8E T 6E (aw). - W + (w). = $2223? + $1M ZzYlT + I/JlBlYlT = «12201223 + 82)}? + ¢1F1(#2za + 82))? + wall/1T = $211223)”; + ¢il11fl2Z3Y1T + ¢2le/2T + $1.411 Bail/1T + $130,112 (6-56) Similarly, at t3: BE 3er 3 3E T wszsig + (aw)? = 1123;13Z4Y3T + ¢2#2_#SZ4Y2T + ¢1l11fl211324Y1T + 2p333}? + WMBBYzT + wlfllflzBal/F + ¢2B2Y2T + $111132le + $1 Bl Y1T° (6.57) If we consider jab: = [ BE ] 8W 8w.) an Then at time t1 we have BWk At t2: 8E ('JlW,c We define 6f 6; 3Wk 117 3E ( ) = T/Jlfllylkzz + $113134:- ‘1 (—) = [Mikel/2k + ¢IPIF2Y1kl Z3 + *2 [6219* + «ma/1*] 82 +¢IBIY1E 1111 Ylk [1122162]c + wlfllylk] ¢2Y2k + Bic/11 - (6.58) (6.59) (6.60) (6.61) (6.62) 118 Then at time t3 8E ( ) = {$3}? + Mini/2k + walflzyikl [#324 ‘1‘ B3] + 3:32 + £81 t3 8Wk = [$3Y3k + 5:112] [#3Z4 + B3] + 5:32 + flicBl- (6-63) Thus by defining Bf. = (6.1/n" + 61;l ,1..-“ n = 2,3, . . ., (6.64) we have 6E k t k — = Z + B (8‘40): fl: Ft t+1 £31 I = ,BtkfltZtH + Qf, (6.65) where B; = At[€g(t)]nx1 (6.66) [1: = I — At(I + Wdiag[a'(a:)./T]) (6.67) t Q? = Zfll‘Bz. (6.68) (=1 At the final step t = N and BE (5791:)“, = fllIiII‘NZT + in- (6-69) 119 If we use the same boundary condition as used by Pearlmutter in [Pea89]: Z7 2 0, (6.70) we could further simplify equation (6.69) as: 0E N k —— = B k=1,2,..., . 6.71 (awk)1v l 151 I n ( ) Thus 6E BE (5675) ~ (567.),- ‘6'”) 6.6.1 Simulation Procedure In simulation, from initial time to to the final time T, we need to calculate the following at each time step (interval At): First we need to simulate the system difference equation (6.36) forward, which will calculate the values of K, E, and X). Then (I): can be computed using equation (6.41). Variables ,n, and B, can be found using equation (6.45). We need to calculate Bf using equation (6.64) for each R: = 1, 2, . . . ,n. Thus we can obtain Qf using equation (6.68) for each 16 = 1,2, . . . ,n. At the final step T, we can compute the derivative of the “energy” with respect to weight W using 6E (m) = fitkfltZHl'l'Qfa (6-73) 120 and 8E 6W Therefore, we can update the weight using gradient-descent method: AW = ——17 (31%) (6.75) Wnew = WOld+AW’ (6.76) here 1) is a positive constant. Thus this algorithm needs only the forward integration. If we select ZT = 0, this algorithm can be further simplified and used on-line too. The computation complexity can be estimated as follows: For each time step, calculating Y, E and X needs 0(2n2) operations. Computing 2,!) needs 0(n) operations. Updating p and B requires 0(4n2) operations. The total operations for calculating ,B", k = 1, 2, . . . , n, is 0(n4 + n3). Calculating Q also needs 0(n3) operations per time step. Therefore, the total operations per time step will be 0(n4), which is the same order as the real-time recurrent method used by Williams and Zipser [WZ89a]. The advantages of this method are that it can use the ZT for different boundary conditions and it avoids backward integration. However, the computation complexity is still high. Figure 6.6 to figure 6.9 show the trajectories of an example XOR circuit, as shown in figure 6.5, trained using the above method. 121 0.75- moon 4 Figure 6.6. XOR network with input = [0, 0] 0.751. Figure 6.7. XOR network with input = [0, 1] 075' nabv'oq ( 025> on .- n 5 (J b 0 Figure 6.8. XOR network with input = [1,1] 0.75} .4 025? nuumom 1 Figure 6.9. XOR network with input = [1,0] 123 6.7 Avoiding Backward Integration: Method II In last section, we presented one way of avoiding the backward integration used in time-dependent recurrent back-propagation method. In this section, we will present another numerical algorithm, which avoids backward integration too while not increasing computational complexity too much. From last section, we know that the co-state dynamical equations can be written in vector form as: z' = Z./T - E — (Wdiag[a’(X)./T])Z, (6.77) where Z. / T g [2,,- / t j] and diag[X] is a diagonal matrix with diagonal elements equal to the element of vector X. In the following discussion, we denote 2p 2 diag[a’ (X ) / T]. The system defined by equation (6.77) is a linear time-varying system. It can be written in general form as z‘ = A(t)Z + 3(7). (6.78) The transition matrix of this system (t, to) can be calculated using —_— A(t) with 600,70) = 1. (6.79) and we have t z = <1>(t,to)z,,+ /. (t,r)B(r)d7 = <1>(7,to)z,o + (t,to) [it ‘l(7,to)B(‘r)d7-. (6.80) 124 Thus 6E BWT = f” zszTdt -_: ft! —t07[2[‘I>(t,to)Zto + (t,to) [it -1(7',t0)B(T)dT]YTdt t t t = 2,, ’¢¢(t,to)dt+/ ’¢¢(t,to) (1)-1(T,to)B(7)drdt. (6.81) to to to \ ’ L J term 1 term 2 Both term 1 and term 2 can be calculated forward. If we define ‘11 = <1)“, then it can be simulated with \iI = —A(t)\I‘ with \Il(t,to) = 1 (6.82) Zto can be calculated at final time t I using Zto = 'l(tf,to)[Zt, — ft” (tf,7')B(7')dT]. (6.83) 0 We now show how to implement the idea above numerically. We use the following approximation with sufficiently small time step At: Zn+l — Zn __ L At — Zn, (6.84) where 2,, = 2,3,“. Therefore, 271+] = -n + Ath = 2.. + At(Zn./T — E, — (Wdiag[a'(X,,)./T]) 2.). (6.85) 125 If we define 11,, = I + AtI./T — At(Wdiag[a’(Xn)./T]) , (6.86) C, = —AtE,, we have Z,+1 = 14.2,. + C... (6.87) From equation (6.87), we have: Zn = Vn-IZ —1 + C -1 = Vn—lan—2Zn—2 + Cit-2] "l" C -l = Vn—an—ZZn-2 + Vn—ICn-2 + 011-] = Vn-an—z ' ' ' V1Z1 + Vn-an-2 ' ° ' V201 + Vn—IVn-2 ' ° ' V302 + '° ° + Cn—l- Thus Zk = pk-l 21 + qk_1 k = 2, 3, . . . , N, (6.88) where we define: (II: = quk—1 + Ck (6.89) Pk = VkPk—i, (6.90) with ql = Cl (6.91) 126 p1 = V1. (6.92) The partial derivative of total “energy” with respect to weight can be approxi- mated in the following way: 8E At i (p Z YT (6 93) —— = k I: ~ - aWT k=1 k Here we assume T = AtN. For simplicity, we omit the At in the following derivations. If we define 6E k — (BWT) = 226,257, (6'94) 1: [:1 then, at each time step we have 613) = 912qu?” (6.95) 1 Q) Q.) 6%. = $121KT + ¢2IV121 + Cl]}/2T (6.96) Q: E "l dual/1T + 11126421 + CID/2T + ¢3IV2V121 + V201 + C21Y3T (6-97) Q) (“:1 Q: E "I Q: EQD ~30: VVV II I: k = Z 97127-12le + Z «12qusz (6.98) [:1 (:2 With 170 = I. Consider the ith column of (5%)“ which is $121}? + $211121)? + . . . + liming-121K: ‘Hbz‘hY; + wsqu; + - -- + wkq._1Y,:. (6.99) 127 By defining: Ric = Ic—l + I(kph-1Y7: (6.100) with R0 = 0 (6.101) and k Qk = 2: lbl‘IIAYIT, (6.102) (:2 we have 6E i i ' 2' W k = RkZ1 + Qk. (6.103) Therefore, if Z; is known, we can integrate the system forward. However, the boundary condition we know is ZN, not 21. Now, let us consider simulating the co-state system backward using difference equations. For sufficiently small At, we have the following approximations: ~ ~ Zn — 2,, '~ _—At +1 = -2“, (6.104) from equation (6.104), we have: Zn = Zn-H—Aténi-l = 2,.... — At(Z,,+1./T — 137,.+1 — [Wdiag[or'(X,,+1)./T]]Zn+1). We assume the simulation time period is [0, t1], and t1 = N At, N is an integer. 128 Define pn+1 = I — AtI./T + At(Wdiag[a'(Xn+1)./T]) (6.105) and Bn+1 = AtEn+1. (6.106) We have 2.. = MHZ... + Bu“. (6.107) Thus, use equation (6.107) recursively: Z1 = #222 + B2 = #201323 + 33) + 32 = 112/1323 + #233 + 32 = #2113 ° ' ' #1ka + #2fl3 ° ' ° #k—in + #2113 ' ' ' #k—sz—i + - - - + 32- The equation above can be written in general format as 21 =Uka+Dk k=2,3,...,N, (6.108) where we define: U]: = Uk—lllk k = 3,4,...,N (6.109) D}; = Dk-l +Uk_1Bk k =3,4,...,N (6.110) D2 = 82 (6.111) 129 U2 = [12. (6.112) When the time step At is sufficiently small, the two systems 2 and 2 should be so close that Z, x Z. Therefore, we have ZN _—. ZN (6.113) Thus equation (6.103) can be written as (aWT), = 17,(UNZN + DN) + Q. (6115) At k = N, it becomes LE i — R‘ U 2 D " 8W7 N -- N( N N + N) ‘1’ QN' (6116) Since R, U, D and Q all can be calculated forward in time, we can simulate this system forward in time and at time T calculate the quantity (gap-5:7). This way we can avoid the backward integration. 6.7 .1 Updating the Time-constant T The partial derivative of total “energy” with respect to T can be approximated using: 6E N . - — —- = —At 2 diag[Yk]Zk. (6.117) 8T k=l 130 As with the updating of weight, we omit At in the following derivation. Therefore, at time step 1 we have 6E . - - ('8'?)1 = —d1ag[Y1]Z1. At step 2, 8E . . _ (fi)2 — —d1ag[Y2]Zg =.- —diag[f€](u121+C1). At step 3 6E _ . _ (5T), — —d1ag[Y3]Z3 = ’diale31(P2ZI+Q2)° Define S). = Sk_1+diag[Yk]Pk-1 k=1,2,...,N (6.118) J). = J,._1+dia1g[1'/,.]q,._1 k=2,3,...,N, (6.119) with initial conditions 50 = 0, (6.120) J1 = Jo=0. (6.121) 131 Then (9E) _ — = —SN21 — JN. (6.122) (3T total Therefore, we can calculate S and .I forward. At the final step T, we calculate Q9 elm (a )total' 6.7.2 Computation Procedures From the derivation above, we can summarize the computation flow as follows: Start with k = 0, at time step k: Step 1. Calculate Y and E (for k = 0,1, . . . ,N — 1) using: Yk+1 = Y1: + At[0(Xk) + C1: + Ykl-/T E]; = Y]; — Y]: X;‘ = WTYk. Step 2. Compute 70: 11):: = diag [0’(Xk)./T]. Step 3. Calculate u and C: 11,. = 1 + AtI./T — Athpk Ck = —AtEk 132 Step 4. Calculate p and B: Mk = I—AtI./T+Ath)k B). = A15). Step 5. U and D (for k = 2,3, . . . , N) can be calculated with Uk Uk—iflk Dk = Dk—1+Uk-lBk Step 6. Compute q and p (for k = 1,2, . . . ,N — 1) using 9!: = VkCIk—l'l'Ck Pk = VkPk—l- or using Pk = (III-+11 ‘11: = U131 Dk+1- (6.123) (6.124) (6.125) (6.126) (6.127) (6.128) Step 6a. If we want to update T (for k = 1,2, . . . , N), we also need to calculate S]: = Sk-1+diag[f/k]pk_1 Jk Jk-l + diag[Yquk—i- 133 Step 7. Calculate Q: Q1: = Qk—1+¢ka—1YkT Q0 = Q1 = 0. Step 8. For each column I, I = 1,2, . . . ,n, calculate R; = 1224+ wkpk_1Y,: (6.129) Step 9. Loop back to step 1 until k = N. Then, go to step 10. Step 10. Calculate Z1: 21: UNZN + DN Step 11. For each column I, I = 1,2, . . . ,n, calculate 6E )’ , , — = RNZI + QN- (6W7 total Step 11a. If we want to update T, we also need to calculate 6E) — = —SNZ —JN. (6.131) (8T total 1 134 Step 12. Calculate new weight W with gradient-decent method. 3E W = W — a (—) . pre 0W total Calculate T with 3E T = Tpre — 18 (—) - 6T total Where a and 6 are updating rates for weight and time factor respectively. Loop back to step 1 until the total “energy” is smaller than the desired value. 6.7 .3 Simulation Examples For comparison purpose, we try to solve the same XOR problem as used in [Pea89]. The network topology is shown in figure 6.5. The input layer neurons have external input 11 and 12. The goal is to choose a proper set of weights so that the output of the network is the XOR of the two input signals between time t = 2 and t = 3. We define the “energy” to be E = Z). f23(yc(1k) - d("))2dt, where k ranges over the four cases, d“) is the desired output for case In, yo is the state of the output neuron. The input signals 11 and 12 were set to be [1,1], [1,0], [0,1] and [0,0] respectively for each case. The sigmoid function is chosen as 0(3) = (1 + 6”) — 1. The initial values Y of neurons are set to be I + 0(0). Initial weights W are chosen randomly from (—1,1). As with Pearlmutter’s case, when weight W and time factor T were updated simultaneously, T tended to go to 0 too quickly. In our simulations, we usually choose T’s update factor to be very small. Figure 6.10 to figure 6.13 show the network 135 075+ nanpooq . g O25] Figure 6.10. XOR network with input = [0, 0] ...] g 6.5 025* bu-pvoq , cf' - u» u a a Figure 6.11. XOR network with input = [0,1] 136 0.75 * Inn-(1 1001 , 025* Figure 6.12. XOR network with input = [1,1] 025* mvooq , Figure 6.13. XOR network with input = [1,0] 137 dynamics after it has been trained. 6.7 .4 Comparison and Discussion The number of computations in each time step can be estimated as follows: For each time step 1:, we need to calculate step 1 to 8. For step 1, the number of arithmetic computations is 0(2n2). For step 2, the number of computation steps is 0(n). Step 3 and step 4 each needs 0(4n2) computations. Step 5 needs 0(2n3). Step 6 needs 0(2n3). Step 6a needs 0(2n2) number of computations. Step 7 needs 0(2n2). Step 8 needs 0(2n3). Thus the total number of computations is in the order of 0(4n3). In the simulation process, we need to store variables: U, D, q, p, S, J, Q, and R. The added storage is 0(n3). Compare the proposed algorithm with the original time-dependent recurrent back— propagation method, the proposed one avoids the backward integration, thus the memory storage request is reduced. The price is the increase of computations per time step. Compared with other methods, This algorithm retains some of the advantages of the original time-dependent recurrent back-propagation method. The system used is very general and can be applied to many applications. The co-state Z has clear physical meaning. The boundary condition ZT can be chosen according to the system constraint. Toomarian and Barhen proposed a way to avoid the backward integration of the adjoint-function method [TB91]. They proposed similar way of finding Zto using ZT. However, the computation may be numerically unstable. Table 6.1 shows the comparison of each method. 138 Table 6.1. Comparison of the algorithms TDRBP RTRL Green’s proposed 1 proposed 2 function Computations 0(n2) 0(n4) 0(723) 0(n4) 0(n3) Backward yes no no no no integration On-line no yes yes yes yes Remarks co—state co—state co-state can be set can be set can be set according according according to system to system to system constraint constraint constraint CHAPTER 7 SUMMARY AND CONCLUSION 7 .1 Summary Artificial neural networks have developed very rapidly in the past decade. They have great potential of solving many problems which current digital computers cannot solve efficiently or cannot solve at all. One of the fundamental problem of designing artificial neural networks is to determine a set of weights such that the network can store or learn patterns or signals. When an input is applied to the network, the network can generate “desired” output. This behavior is referred to as “association”. A simple association is to store static patterns and to retrieve the stored patterns when some clues are given to the network as input. Associative memory is such a kind of application. A more complex case is to store and to generate temporal patterns. Numerous architectures and algorithms have been proposed so far for these kinds of tasks. Most of the proposed models have a pre-fixed structure. The weights are computed through on-line/off-line algorithms, which are usually computationally 139 140 expensive. In some cases, the weight matrix may not exist at all for a particular set of patterns. In this thesis, the product-of—norm model of artificial neural network architecture is proposed. The new structure is based on gradient system theory using a special “energy” function. Specifically, the product of the norm of the errors is used as the “energy” function. The system dynamical equations are defined using gradient- descent method. The resulting system is a gradient system and is globally stable. By using the product of the norm of the errors as the “energy” function, the desired patterns are guaranteed to be the global energy minima of the system. Thus they can be retrieved from the network. A distinguished feature of this network is that the weights are fixed by design. The weights learning is eliminated. The trade-off is the increasing of the network complexity. The stability of the product-of-norm model was analyzed using Lyapunov theory. Both theoretic proof and simulation results were given in chapter 3 to show the ability of the network to store and retrieve arbitrary set of patterns. For the one neuron case, the locations of the unstable system equilibria were given too. Based on the analysis, another way of design one neuron networks with pre-set stable and unstable equilibria was proposed. The possibility of using the infinity norm in the product-of-norm model was also discussed. In chapter 4, various techniques of generalizing of the “energy” function of the product-of-norm model were proposed. Ways to simplify the network structure were investigated too. It was shown that the system could be easily parallelized and modularized using the log form of the “energy” function. Using the log form “energy” function, the system can learn new patterns by simply adding modules to the existing network. The subgrouping method was proposed to localize the connections of the 141 network, so that the network could be simplified and the computations required could be reduced. Several ways of subgrouping were discussed, in general we found that the method of dividing the network into subgroups with overlapping neurons had better performance. chapter 4 showed that the product-of-norm system could be used to store and recognize digital image patterns by adding a penalty function to the original “energy” function. In particular, a method of dividing the images into small regions is discussed. The penalty function was calculated corresponding to each region. The resulting network was able to recognize both vertically and horizontally slightly shifted images. Chapter 5 investigated possible ways to implement the product-of-norm model in hardware. Analog transistors biased in subthreshold region were proposed as basic building elements. Several building blocks were discussed and an example one- neuron / twqpattern network was shown. Artificial neural networks used for association of temporal patterns were studied in chapter 6. Among training algorithms available for recurrent networks so far, time- dependent recurrent back-propagation is the best if on-line learning is not concerned. One of the limitations of this method is that it needs to integrate the co-state backward in time. Two algorithms were proposed to avoid the backward pass. The penalty is the increase of computation complexity. However, they are still among the simplest methods. Ways to set the initial values for the neurons and the final conditions for the co-state Z were analyzed also using optimal control theory. 142 7 .2 Future Research Directions The product-of-norm model is relatively complex. It requires high order feedbacks. Future work should be focused on further simplifying the network. Current hardware implementation example for product-of-norm model is not scalable for large networks. Hardware implementation of the product-or-norm model is another area which needs more investigation. APPENDIX APPENDIX A THE PSPICE CODE AND SIMULATIONS FOR‘A ONE-NEURON/TWO-PATTERN CIRCUIT In chapter 5, we presented some basic circuits structures to be used in implementing the product-of-norm model. In this chapter, we will list the PSpice code used for the simulations of the circuits. The following PSpice library file contains the definitions of some basic circuits: n-mirror, p-mirror, current—voltage multiplier and voltage-square circuits. ********************************************************************** * PSPice ciruit library: neuron.lib * * This library contains sub-circuits which may be used to * build product-of-norm model neural networks a: 143 144 * Jiansheng hou * Sun 02-23-1997 1:49:28 pm a: ********************************************************************** ********************************************************************** * * MOSFET used in this library * ***************************t************************#***************** .MODEL nfet NMOS LEVEL=2.00000 +LD=0.2500U TOX=401.000E-10 +NSUB=6.257042E+15 VT080.771327 KP=5.60600E-05 +GAMMA=O.53 PHI=O.60 U0=650.996 UEXP=O.153507 +UCRIT=41931.5 DELTA=O.472546 VMAX=67660.8 XJ=0.2SOOOOOU +LAMBDA=3.089401E-02 NFS=4.834708E+12 NEFF=1 +NSS=1.000000E+10 TPG=1.00000 RSH=29.600 CGSU=3.229249E-10 +CGDO=3.229249£-10 CGBO=6.72642QE-10 CJ=1.13650E-4 +MJ=O.6862O CJSH=5.3187OE-1O MJSH=O.26510 PB=O.80000 .MODEL pfet PMOS LEVEL=2.00000 +LD=O.209610U TOX=401.000E-10 +NSUB=6.722092E+15 VTO=-O.78821 KP=2.23000E-05 +GAMMA=O.5486 PHISO.60 U0=259 UEXP=O.182346 +UCRIT=12594.5 DELTA=O.844423 VMAX=41222.6 XJ=O.2SOOU +LAMBDA=4.981694E-02 NFS=1.000E+11 NEFF=1.00I +NSS=1.000000E+1O TPG=-1.00000 RSH=90.7OO CGSO=2.707532E-10 +CGDO=2.707532E-10 CGBO=5.839625E-10 CJ=2.513600E-4 "380.547100 +CJSH=2.9447OOE-10 MJSW80.318800 PB=O.80 *************#******************************************************* * * Sub-circuits * **************#****************************************************** ***********#*******#***#*******************#*** Pmos current mirror 4: a: * + ----- Iin * l +--- Iout * I l +- Vdd . Ill .. III 1 2 3 .subckt pmirror 145 m1 1 1 3 3 pfet L=12u H=4u m2 2 1 3 3 pfet L=12u H=4u .ends .subckt pmirror2 1 2 3 m1 1 1 3 3 pfet L=7u H=4u m2 2 1 3 3 pfet L=7u H=4u .ends .subckt pmirror3 1 2 3 m1 1 1 3 3 pfet L=24u H=2u m2 2 1 3 3 pfet L=24u H=2u .ends *********************************************** a: * N MOS mirror :1: * 1 ---Iin 2 -- Iout, 3 --- GND .subckt nmirror 1 2 3 m1 1 1 0 O nfet L=12u H=2u m2 2 1 0 0 nfet L=12u H=2u .ends * 1 ---Iin, 2 --- Iout, 3 --- GND .subckt nmirror2 1 2 3 m1 1 1 O O nfet L=12u H=8u m2 2 1 O O nfet L=12u H=8u .ends * 1 ---Iin 2 --- Iout, 3 --- GND .subckt nmirror3 1 2 3 m1 1 1 O O nfet L=24u H=1u m2 2 1 O 0 nfet L=24u H=1u .ends * 1 ---Iin 2 --- Iout, 3 --- GND .subckt nmirror4 1 2 3 m1 1 1 O O nfet L=14u H=3u m2 2 1 0 0 nfet L=14u H=3u .ends 146 *1 ---Iin 2---Iout, 3---GND .subckt n_cascode_mirror 1 2 3 m1 1 1 3 3 nfet L=10u H=14u m2 2 1 4 3 nfet L=10u H=14u m3 5 5 3 3 nfet L=6u W=14u m4 4 5 3 3 nfet L=6u H=14u .ends * 1 ---Iin 2 --- Iout, 3 --- GND .subckt n_nmirror 1 2 3 m1 1 1 3 3 nfet L=1200u H=1u m2 2 1 3 3 nfet L=1200u W=1u .ends * 1 ---Iin 2 --- Iout, 3 --- GND .subckt w-nmirror 1 2 3 m1 1 1 O 0 nfet L=2000u H=16u m2 2 1 O O nfet L=2000u H=16u .ends ********************#******************#******* * Wide range IV multiplier a: * 1 --- Iout, 2 ~-- Iin, * 3 --- V1, 4.--- V2 * 5 ---Vdd, 6---gnd .subckt iv_mul-wid 1 2 3 4 5 6 m1 8 3 7 O nfet m2 9 4 7 0 nfet *x0 2 7 O nmirror x0 2 7 O n_nmirror xi 8 1 5 pmirror x2 9 10 5 pmirror x3 10 1 O nmirror .ends *********************************#*****#******* * IV multiplier 4: * 1 --- Iout, 2 --- Iin, * 3 --- V1, 4 --- V2, * 5 --- Vdd, 6 ---gnd * * 6 is not really used 147 * .subckt iv_mul 1 2 3 4 5 6 m1 8 3 7 O nfet L=4u H=2u m2 9 4 7 0 nfet L=4u H=2u x0 2 7 O nmirror3 x1 8 1 5 pmirror2 x2 9 10 5 pmirror2 x3 10 1 0 nmirror .ends *******t********************#*********#******** 1 --- Iout, 2 --- Iin, ' 3 -" V1, 4 '-'V2, 5 ---Vdd, 6---gnd 6 is not really used ****** .subckt new_iv_mul 1 2 3 4 5 6 m1 8 3 7 O nfet L=4u H=2u m2 9 4 7 O nfet L=4u H=2u x0 2 7 O nmirror2 x1 8 1 5 pmirror2 x2 9 10 5 pmirror2 13 10 1 0 nmirror .ends ******************************#******#********* * 1 --- Iout, 2 --- Iin, * 3 --- V1, 4 ---V2, * 5 ---Vdd, 6---gnd * * 6 is not really used * .subckt new,iv_mul2 1 2 3 4 5 6 m1 8 3 7 0 nfet L=4u H=2u m2 9 4 7 0 nfet L=4u H=2u x0 2 7 O nmirror2 x1 8 1 5 pmirror2 x2 9 10 5 pmirror2 x3 10 1 O n_cascode_mirror .ends **t******************************************** 148 Negative IV multiplier 10 --- Iout, 2 --- Iin, 3 ---V1, 4 ---V2, 5 ---Vdd, 6 ---gnd 6 is not really used ******** .subckt neg_iv-mu1 10 2 3 4 5 6 m1 8 3 7 O nfet L=4u H=2u m2 9 4 7 O nfet L=4u H=2u x0 2 7 O nmirror2 xi 8 1 5 pmirror2 x2 9 10 5 pmirror2 x3 1 10 O n_cascode-mirror .ends tt************************#*********#********** Negative IV multiplier * * * 1O --- Iout, 2 --- Iin, * 3 ---V1, 4 ---V2, * 5 ---Vdd, 6 ---gnd 1.: a: :1: 6 is not really used .subckt neg_iv_mul2 10 2 3 4 5 6 m1 8 3 7 O nfet L=4u H=2u m2 9 4 7 O nfet L=4u H=2u x0 2 7 O nmirror2 11 8 1 5 pmirror2 x2 9 1O 5 pmirror2 13 1 1O 0 nmirror3 .ends *t*****#***********************#**************** Test IV multiplier I'l- * 1 --- Iout, 2 --- Iin, 3 --- V1, 4 ---V2, 5 ---Vdd, 6 ---gnd .subckt iv_mul-test 1 2 3 4 5 6 m1 8 3 7 0 nfet m2 1 4 7 0 nfet xO 2 7 O nmirror ** 149 x1 8 1 5 pmirror *x2 9 10 5 pmirror *x3 10 1 0 nmirror .ends ************************************************ * N MOS pair a: * 1---V1, 2---v2, 3---Iout, 4---Iin1, 5---Iin2 .subckt n_pair 1 2 3 4 5 6 m1 4 1 3 0 nfet L=4u H=3u m2 5 2 3 0 nfet L=4u H=3u .ends ******************************#************** * gilbert multiplier * 1-v1, 2-v2, 3-v3, 4-v4, S-Ib, 6-Iout, 7-Vdd, 8-gnd * .subckt gilbert_mul 1 2 3 4 5 6 7 8 x0 3 4 10 12 6 8 n_pair x1 3 4 11 6 12 8 n_pair x2 1 2 9 10 11 8 n_pair x3 12 6 7 pmirror m1 9 S 8 8 nfet .ends ***************************#*********************** * Voltage square circuit a: *1---v1, 2---v2, 3---Iout, 4---Ib, 8---Vdd, 9---gnd .subckt v-square 1 2 3 4 8 9 10 1 2 10 5 6 9 n_pair x1 1 2 11 6 5 9 n-pair x2 1 2 12 10 11 9 n_pair x3 5 3 8 pmirror3 x4 6 7 8 pmirror x5 7 3 9 nmirror2 ml 12 4 9 9 nfet L=4u H=3u .ends **************************************************************** * another voltage square circuit * *1---v1, 2---v2, 3---Iout1, 7--Iout2, 4---Ib, 8---Vdd, 9---gnd 150 *9 is not really used :1: .subckt new_v_square 1 2 3 7 4 8 9 x0 1 2 10 5 6 0 n_pair x1 1 2 11 6 5 0 n_pair x2 1 2 12 10 11 0 n-pair x3 5 3 8 pmirror3 x4 6 7 8 pmirror3 *x5 7 3 O nmirror2 ml 12 4 O O nfet L=4u H=3u .ends **ttt**************t****#**********************************#*t**# * v_sqaure circuit with two current outputs *1---v1, 2---v2, 3---Iout1, 7---Iout2, 4---Ib, 8---Vdd, 9---gnd .subckt v_square-2out 1 2 3 7 4 8 9 x0 1 2 10 5 6 O n_pair x1 1 2 11 6 5 O n_pair x2 1 2 12 10 11 O n_pair x3 5 3 8 pmirror x4 6 7 8 pmirror *x5 7 3 O nmirror ml 12 4 0 O nfet L=4u H=3u *m1 is the bias mosfet .ends ¢********************¢¢******¢***¢************ * test voltage square circuit :1: *1---v1, 2---v2, 4---Ib, 8---Vdd, 9---gnd .subckt v-squaretest 1 2 3 4 8 9 x0 1 2 10 5 6 O n_pair x1 1 2 11 6 5 0 n-pair x2 1 2 12 10 11 0 n_pair x3 5 3 8 pmirror x4 6 7 8 pmirror x5 7 3 O nmirror ml 12 4 O 0 nfet .ends t*****tt***********t#*********#****t***#***************** * Another test voltage square circuit * *1---v1, 2---v2, 3---Iout, 4---Ib, 8---Vdd, 9---gnd .subckt v_squaretest2 1 2 3 4 8 9 x0 1 2 10 5 6 0 n_pair x1 1 2 11 6 5 0 n_pair x2 1 2 12 10 11 0 n-pair x3 6 3 8 pmirror x4 5 7 8 pmirror x5 7 3 0 w_nmirror mtesti 7 3 0 0 nfet L=12u H=4u mtest2 7 3 0 0 nfet L=12u H=4u ml 12 4 0 0 nfet .ends 151 The following is the PSPice file for a one-neuron/two—pattern circuit. 1n2p circuit at * This circuit is a one-neuron/two-pattern circuit * of the product-of-norm neural network model * ********************************************** * The structure ******ttt*********************¥*************** x0 6 4 80 81 9 1 0 new_v_square x1 61 82 5 6 1 0 new-iv_mul x2 62 83 5 6 1 0 neg_iv_mu12 r1 80 82 0.00001 r2 81 83 0.00001 r5 61 6 0.00001 r6 62 6 0.00001 x3 6 5 70 71 9 1 0 new_v_square x4 63 72 4 6 1 0 new_iv_mul x5 64 73 4 6 1 0 neg_iv_mul2 r3 70 72 0.00001 r4 71 73 0.00001 r7 63 6 0.00001 r8 64 6 0.00001 **************************¥**#*****#********** * The settings *******#************************************** 152 vd 1 O 5v vc 6 0 iv vb 9 0 0.5V v_star1 4 0 0.6v v_star2 5 0 0.4V *#11:******************************************* * The subcircuits ******************tt****tttt**********¥******* ********t************************* * N MOS current mirror * 1 ---Iin 2 --- Iout, 3 --- GND .subckt nmirror8 1 2 3 m1 1 1 0 0 nfet L=20u H=2u m2 2 1 0 0 nfet L=20u H=2u .ends ********************************* * N MOS current mirror * 1 ---Iin 2 --- Iout, 3 --- GND .subckt nmirror9 1 2 3 m1 1 1 0 0 nfet L=7u H=5u m2 2 1 0 0 nfet L=7u H=5u .ends t*********#********************** * Negative IV multiplier a: .subckt neg_iv_mul 10 2 3 4 5 6 m1 8 3 7 0 nfet L=4u N=2u m2 9 4 7 0 nfet L=4u W82u x0 2 7 0 nmirror2 x1 8 1 5 pmirror2 x2 9 10 5 pmirror2 x3 1 10 0 n_cascode_mirror .ends *********************************************** Another negative IV multiplier 3 --- V1, 4 ---V2, 1.: 1|: * 10 --- Iout, 2 --- Iin, * . 5 ---Vdd, 6 ---gnd 1|: 153 * 6 is not really used :1: .subckt neg_iv_mul2 10 2 3 4 5 6 m1 8 3 7 0 nfet L=4u H=2u m2 9 4 7 0 nfet L=4u N=2u x0 2 7 0 nmirror2 x1 8 1 5 pmirror2 x2 9 10 5 pmirror2 x3 1 10 0 nmirror9 .ends *****************************¢***************** IV multiplier * :1: * 1 --- Iout, 2 --- Iin, * 3 --- V1, 4 ---V2, * 5 ---Vdd, 6 ---gnd :1: :1: :1: 6 is not really used .subckt new_iv_mul 1 2 3 4 5 6 m1 8 3 7 0 nfet L=4u W=2u m2 9 4 7 0 nfet L=4u H=2u x0 2 7 0 nmirror2 x1 8 1 5 pmirror2 x2 9 10 5 pmirror2 x3 10 1 0 nmirror8 .ends .lib c:/pspice/hou/neuron.lib .TEMP 25 .dc vc 0 1.0 0.01 .probe .end Setting 0‘1 = 0.4V and v"2 = 0.7V, we plot the response of the circuit to different initial voltages in Figure A.1. Figure A.2 shows the simulations of the same circuit with v" = 0.3V and v‘2 = 0.75V. As can be seen from the figures, the circuit stored two stable points, which are very close to the desired points. Because the circuit function we derived can only work in certain range, the resulting circuit cannot store arbitrary data. 154 0.9 I I T I I I I 0.8 *- _ 0.7 ~ - 0.6 1 > - go 0.5 -\\ ““““““““ _ d , ____ .4" :3 L” .. >0 0'41 0.3 a 0.2 fl 0.1 - _ l I l L l l l 0 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 Time t Figure A.1. The response of the one-neuron/two-pattern circuits to different initial voltages. v"1 = 0.4V and v‘2 = 0.7V. 0.9 1 1 I ..L . 0.7 — /’# I / 0.6 — // _ A > // , 61) 0.5 rem—'7'". - jg hhhh ‘ —>5 04 "‘~~~-::::::: _______ d 0.3 - ‘ 0.2 d 0.1 d 0 1 1 l 0 0.0005 0.001 0.0015 0.002 Time I Figure A2 The response of the One-neuron/two—pattern circuits to different initial voltages. v"1 = 0.3V and v"2 = 0.75V BIBLIOGRAPHY [Bar91] [BcH 75] [HKP90] [Hop82] [Hop84] [H874] [Kha91] [K0588] [LS95] [Mea89] BIBLIOGRAPHY Valmir C. Barbosa. Learning in analog hopfield networks. In Proceedings of IJCNN 1991, volume ii, pages 183—186, Seattle, July 1991. Arthur E. Bryson, Jr. and Yu chi Ho. Applied Optical Control Opti- mization, estimation, and control. Ohemisphere Publishing Corporation, Washington, DC, 1975. John Hertz, Anders Krogh, and Richard G. Palmer. Introduction To The Theory of Neural Computation. Addison-Wesley Publishing Company, 1990. John Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences, volume usa 79, pages 2554-2558, 1982. John J. Hopfield. Neurons with graded response have collective compu- tational properties like those of two-state neurons. In Proceedings of the National Academy of Sciences, volume usa 81, pages 3088-3092, 1984. Morris W. Hirsch and Stephen Smale. Differential Equations, Dynamical Systems, and Linear Algebra. Academic Press, Inc, 1974. Hassan K. Khalil. Nonlinear Systems. Macmillan Publishing Company, 1991. Bart Kosko. Bidirectional associative memories. IEEE Trans. 0n Sys- tems, Man, and Cybernetics, 18(1):49—60, Jan/Feb 1988. Frank L. Lewis and Vassilis L. Syrmos. Optical Control. John Wiley & Sons, Inc., 1995. Carver Mead. Analog VLSI and Neural Systems. Addision-wesley Pub- lishing Company, 1989. 155 [M0289] [Pea89] [Pin87] [Pin88] [Pin89] [RHW86] [SB91] [SCLQH [SCL92] [SWC91] [TB91] 156 M. C. Mozer. A focused back-propagation algorithm for temporal pattern recognition. Complex Systems, 3:349-381, 1989. B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. In International Joint Conference on Neural Networks, Wash- ington 1989, volume II, pages 365—372, Seattle, Washington, 1989. Fernando J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 59:2229—2232, 1987. Fernando J. Pineda. Dynamics and architecture for neural computations. Journal of Complexity, 4:216—245, 1988. Fernando J. Pineda. Recurrent back-propagation and the dynamical approach to adaptive neural computation. Neural Computation, 1:161- 172, 1989. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error prOpagation. In Parallel Distributed Processing: Explorations in the microstructures of cognition, pages 318-362. MIT Press, Cambridge, 1986. Fathi M. A. Salam and Shi Bai. A new feedback neural network with supervised learning. IEEE Trans. 0n Neural Networks, 1(1):170—173, Jan 1991. Guo-Zheng Sun, Hsing—Hen Chen, and Yee-Chun Lee. A fast on-line learning algorithm for recurrent neural networks. In Proceedings of 1.] CNN 1991, volume ii, pages 13-18, Seattle, July 1991. Guo—Zheng Sun, Hsing-Hen Chen, and Yee-Chun Lee. Greens’ function method for fast on-line learning algorithms of recurrent neural networks. In Advance in Neural Information Processing Systems, volume 4, pages 333-340. Morgan Kaufmann Publishing Inc., 1992. Fathi M. A. Salam, Yiwen Wang, and Myung-Ryul Choi. On the analysis of dynamic feedback neural nets. IEEE Trans. 0n Circuits and Systems, 38(2), Feb 1991. N. Toomarian and J. Barhen. Adjoint-functions and temporal learning algorithms in neural networks. In Advance in Neural Information Pro- cessing Systems, volume 3, pages 113—120. Morgan Kaufmann Publishing Inc., 1991. [TH87] [THL93] [VP92] [Wa189] [WCJ91] [WCM89] [WCM90] [WHH+89] [WZ89a] [WZ89b] 157 D. W. Tank and J. J. Hopfield. Neural computation by time compression. In Proceedings of the National Academy of Sciences USA, volume 84, pages 1896-1900, 1987. H. Chris Tseng, Victor H. Hwang, and Ling Lu. A fast supervised learn- ing scheme for recurrent neural networks with application to associative memory design. In Proceedings of 1993 IEEE International Conference on Neural Netw orks, volume ii, pages 789—793. IEEE Neural Network Concel, April 1993. Bert De Vries and Jose C. Principe. The gamma model — a new neural model for temporal processing. Neural Networks, 5:565—576, 1992. A. Waibel. Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1:39—46, 1989. Yeou-Fang Wang, Jose B. Cruz Jr., and J. H. Mulligan, Jr. Guaranteed recall of all training pairs for bidirectional associative memory. IEEE Trans. 0n Neural Networks, 2(6):559—567, Nov 1991. Yeou-Fang Wang, Jose B. Cruz, Jr., and J. H. Mulligan, Jr. An en- hanced bidirectional associative memory. In Proceedings of IJ CNN 1989, volume i, pages 105-110, June 1989. Yeou-Fang Wang, Jose B. Cruz Jr., and J. H. Mulligan Jr. Two coding strategies for bidirectional associative memory. IEEE Trans. 0n Neural Networks, l(1):81—92, Mar 1990. A. Waibel, T. Hanazawa, G. Hinton, K Shikano, and K Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37:328-339, 1989. R. J. Williams and D. Zipser. Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1:87-111, 1989. R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270-280, 1989.