3:... vrrvi..x;lfiry ..,¢:.11..ra {.2} ritlrfuirl . i y p‘ 3...}. :1... 2. <3, :2... a: .54....1153355 fir-.1594.» . \:ra . . #2.; .Zi . 3v.>_r;1 1:14.... .1:r,(:!1; y. r'l.:{l!alia 1 Ori...’i a ivllxvaulll..rlh.~t{rtc L1}. .4319 .(...O(frv!v. .41.!1r! .1!;)\.lk ‘ , J-Ialux‘krittrlrxl‘f pla .5553‘P11i17tf. ,7!5.$In.f..rr§.flu.1lr. I’ rllririI/fff - . .2, 38...}: it: rivolrr‘.l:fo~.'f‘t I: f Cu . ‘flftffflrxgii 1.5!!‘b‘1f1-3 . It’s) )(JIIP,rr?3\I1 A ll. ‘3. ¢ , 1.... y .!,.7.;fl.u (5) THESIS CHIGA I IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 3009 1 O 9996 III This is to certify that the dissertation entitled Architecture and Statistical Model of a Pulse—Mode Digital Multilayer Neural Network presented by Young-Chul Kim has been accepted towards fulfillment of the requirements for Ph.D. degree in Electrical Engineering JJt/gma/M IMajor professor Date F26. Off #293 MSU is an Affirmative Action/Eq ual Opportunity Institution 0-12771 I LIBRARY Michigan State 3 University PLACE IN RETURN BOX to remove this ch'eckout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE Iii-12.} ‘ MSU Is An Affirmative Action/EquaI'Opponunity Institution c:\cltc\datedue.pm3—pi1 ARCHITECTURE AND STATISTICAL MODEL OF A PULSE-MODE DIGITAL MULTILAYER NEURAL NETWORK By Young- Chul K im A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1993 ABSTRACT ARCHITECTURE AND STATISTICAL MODEL OF A PULSE—MODE DIGITAL MULTILAYER NEURAL NETWORK By Young- Chul Kim A new architecture for a pulse-mode digital neural network is presented. Algebraic neural operations are replaced by stochastic processes using pseudo—random pulse se- quences. Synaptic weights and neuron states are represented as probabilities and estimated as average rates of pulse occurrences in corresponding pulse sequences. A statistical model of error (or noise) is developed to estimate relative accuracy associ- ated with stochastic computing in terms of a mean and a variance. The stochastic computing model translates into simple logic gates as basic com~ puting elements leading to a high neuron-density on a chip. Furthermore, the use of simple logic gates for neural operations, the pulse-mode signal representation, and the modular design techniques lead to a massively parallel yet compact and flexible network architecture well-suited for VLSI implementation. Any size feed-forward net- work can be configured using the modules. Processing speed is independent of the network size. Multilayer feed-forward networks are modeled and applied to pattern classifica- tion problems such as encoding and character recognition. The architecture and all digital sub-components in the proposed neural network are modeled and simulated in VHDL. Computational accuracy is analyzed and the network performance is eval- uated in terms of a correct classification rate. The simulation experiments in these applications show the network performance is competitive with that of determinis- tic DMNN simulations and ordinary back—propagation networks while retaining the desirable properties of high speed and high density on a chip. Copyright by Young-Chul Kim 1993 To my parents and my wife ACKNOWLEDGEMENTS I would like to thank my major advisor, Dr. Michael A. Shanblatt, for his guid- ance and encouragement throughout the years of this research. I also want to thank all the members of my Ph.D guidance committee, Dr. P. David Fisher, Dr. Chin-Long Wey, Dr. Moon-Jung Chung, and Dr. Jacob Plotkin, for their valuable comments and suggestions. Finally, I wish to dedicate this dissertation to my parents for their love, under- standing, and support, my wife, Gyea—Sook Kim, for her love, patience, encourage- ment, and my lovely children, Jong-Seok and So-Youn. vi TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 . Introduction 1.1 Overview .................................. 1.2 Problem Statement ............................ 1.3 Research Tasks .............................. 1.4 Organization of the Dissertation ..................... 2. Background 2.1 Artificial Neural Networks ........................ 2.1.1 Biological/ Artificial Neurons ................... 2.1.2 Feedback Model .......................... 2.1.3 Feedforward Model ........................ 2.1.4 Recurrent Model ......................... 2.2 Artificial Neural Network Implementations ............... 2.2.1 Analog and Hybrid Implementations .............. 2.2.2 Digital Implementations ..................... 2.3 Pattern Recognition and Neural Networks ............... 2.3.1 Statistical Approach ....................... 2.3.2 Structural Approach ....................... 2.3.3 Neural Network Approach .................... 2.4 Behavioral Modeling with VHDL .................... 2.4.1 Behavioral Modeling ....................... 2.4.2 VHDL Characteristics ...................... 3. Stochastic Computing in Neural Networks 3.1 Introduction ................................ 3.2 Generating Probability .......................... 3.2.1 Pseudo-Random Pulse Sequences ................ vii xi WOVAIQH 10 11 11 14 19 24 26 27 29 33 33 35 35 36 37 38 42 43 44 44 3.2.2 Generating Probability ............... ‘ ....... 3.3 Distribution of Estimated Generating Probability ........... 3.3.1 Factorial Moment Generating Function ............. 3.3.2 Binomial Distribution Model ................... 3.3.3 New Distribution Model ..................... 3.4 Stochastic Computing in ANNs ..................... 3.4.1 Basic Stochastic Computations ................. 3.4.2 Stochastic Computing in the DMNN .............. 3.5 Back-Propagation in the DMNN ..................... . Pulse-mode Digital Multilayer Neural Networks 4.1 Basic Computing Elements ........................ 4.1.1 Random Pulse Generator ..................... 4.1.2 Synaptic Element ......................... 4.1.3 Input Neuron Body Element ................... 4.1.4 Regular Neuron Body Element ................. 4.2 Modular Architecture ........................... 4.3 DMNN Coprocessor ............................ 4.4 Behavioral model of a DMNN Coprocessor ............... 4.4.1 Introduction ............................ 4.4.2 Design Methodology ....................... 4.4.3 Coprocessor Control in VHDL .................. 4.4.4 DMNN Model in VHDL ..................... 4.5 Hardware Complexity ........................... . Analysis of the DMNN 5.1 Statistical Models ............................. 5.1.1 Synaptic Multiplication ...................... 5.1.2 Two-input Logical OR ...................... 5.2 Effects of Random Noise in Hidden Layers ............... 5.2.1 First Hidden Layer ........................ 5.2.2 Kth Hidden Layer ......................... 5.2.3 Neural Activation ......................... 5.3 Network Performance Model ....................... 5.4 Simulations ................................ . DMNN Application: Pattern Classification 6.1 Introduction ................................ viii 47 48 48 49 51 53 53 55 58 63 64 64 65 66 67 69 72 73 73 74 76 78 79 81 81 82 82 88 88 90 93 95 97 102 102 6.2 Methodology ............................... 104 6.2.1 Training and Classification .................... 105 6.3 Benchmark Problems ........................... 107 6.3.1 DMNN XOR Problem Solver .................. 107 6.3.2 DMNN Encoder .......................... 110 6.4 DMNN Character Recognizer ...................... 111 6.4.1 Data Set .............................. 111 6.4.2 Experimental Results and Network Performance ........ 111 6.5 Summary ................................. 118 7. Conclusion 119 7.1 Summary ................................. 119 7.2 Contributions ............................... 121 7.3 Future Research .............................. 122 APPENDICES 124 A Derivation of the rth Factorial Moment of a Hypergeometric Random Variable X ................................ 124 B Program for DMNN Back-Propagation Training ............ 126 C VHDL Code and Corresponding Schematics .............. 130 D Input Data for DMNN Binary Classifiers ................ 143 BIBLIOGRAPHY 153 ix 3.1 4.1 4.2 4.3 5.1 5.2 5.3 6.1 6.2 6.3 6.4 6.5 6.6 LIST OF TABLES Number of distinct PN sequences with the maximal length period. . . 46 Chip area required by basic digital components. ............ 80 Chip area required by DMNN elements .................. 80 Chip area required by two example networks. ............. 80 Standard deviations of 13,-(‘on’) and 13,-(‘ofi") when no 2 25, 17.1 = 5, n2 = 5, and (a) N = 127; (b) N = 255; (c) N = 511 ........... 99 Standard deviations of 13,-(‘on’) and 13,-(‘off’) when no = 25, n1 2 10,n2 = 5, 713 = 5, and (a) N =127; (b) N = 255; (c) N = 511. . . . 100 Pm, obtained from equation 5.13 and correct classification rates from VHDL simulations. ............................ 101 Representation of the XOR problem in a DMNN. ........... 107 Actual outputs of an n-bit two-layer DMNN for solving XOR problem. 108 Representation of the 8-to—3 encoding problem in a DMNN. ..... 109 Average number of iterations required for training in experiment 1. . 114 Performance of the DMN N 5-digit recognizer. ............. 115 Performance of the DMNN 10-digit recognizer ............. 117 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 4.5 LIST OF FIGURES A biological neuron ............................. 11 An electrical synapse ............................ 12 A simplified artificial neuron. ...................... 13 A sigmoid threshold function. ...................... 14 The Hopfield network model ........................ 15 The Kennedy-Chua network model .................... 18 A simple perceptron. ........................... 19 A multilayer perceptron. ......................... 21 A Boltzmann machine consisting of visible and hidden units ...... 24 The structure of a processing element in [64] ............... 31 A typical character recognition system .................. 33 Entity declaration in VHDL ........................ 38 The flow of design data in VHDL design process. ........... 41 (a) The block diagram of a LFSR. Examples of LFSRS with a maximal length period where (b) c1, C2, . . . , c7 = 1000001 and (c) c1, c2, . . . , c7 = 0101011 ................................... 45 A random pulse generator for fractional number a: ........... 47 Duality between Boolean operations and numerical operations, where the sampling clock period = 20 is assumed and the number of ‘1’ pulses generated during the period in 2:0,) is 203: for a: ............ 54 Stochastic computations in the DMNN (a) synaptic multiplication; (b) logical OR; (c) neural activation. .................... 57 (a) A maximum length 8-order LFSR where f (:c) = $2 EB :53 EB 2134 GB 233; (b) a random pulse generator for v; .................... 64 (a) A synaptic element (SYN); (b) a block diagram of a SYN. . . . . 66 (a) An input neuron body (INB); (b) a block diagram of INB. . . . . 67 (a) A regular neuron body element (RNB); (b) a block diagram of RNB. 68 An input layer module (ILM). ...................... 69 xi 4.6 4.7 4.8 4.9 4.10 4.11 4.12 5.1 5.2 5.3 5.4 6.1 6.2 6.3 6.4 CI C2 C3 A synaptic array module (SAM). .................... 70 A regular neuron array module (RNAM) ................. 70 The general architecture of the DMNN .................. 71 A DMNN coprocessor. .......................... 72 The design hierarchy of a DMNN coprocessor. ............. 75 VHDL code implementing the DMNN coprocessor. .......... 77 VHDL code implementing the DMNN. ................. 78 2-input logical OR where the dotted lines illustrate the deterministic nature of the output sequence net;(,,) ................... 83 one“ obtained from equation 5.5 when (a) N = 127; (d) N = 255, and one“. obtained from actual simulations when (b,c) N = 127; (e,f) N = 255 ................................... 85 (a) K 5K 5‘ with various network configurations when net.- = 0.55, no = 36, and k > 1; (b) standard deviation of mat; with respect to net;. . . 92 The distribution of 13,- in the hidden layer (0) and the output layer (+) for 8825 tests compared to a binomial distribution when v,- = (a) 0.45 or (b) 0.54, k = 2, no = 25, n1 = 5, and 71.; = 5. ............ 94 An example DMN N for solving the XOR problem ............ 108 An example DMNN for solving the 8-to—3 encoding problem ...... 110 For 5-digit classification in experiment 1: (a) pixel images of a typical data set; (b) Hamming distances between two digits. ......... 112 For IO-digit classification in experiment 2: (a) pixel images of a typical data set; (b) Hamming distances between two digits. ......... 113 An n-bit register with parallel load .................... 140 An n-bit magnitude comparator. .................... 141 An n-bit up-counter. ........................... 142 xii CHAPTER 1 Introduction Artificial neural networks (ANN) present a practical approach to solving computa- tionally intensive and (or) ill-defined problems such as pattern recognition, optimiza- tion, adaptive control, associative memory, and some complex information processing tasks. Dedicated VLSI implementation is crucial to building fast ANNs fully utilizing the parallelism embedded in ANN computations. This dissertation presents a new architecture for a digital feedforward neural network using stochastic computing tech- niques. Random noise effects in this architecture are also presented. The applicability of the network is demonstrated using pattern classification examples. This includes the network architecture, analysis, modeling, simulation, and applications. This chap- ter begins with a brief overview of the ANN implementation models. The problem to be solved is then defined, followed by the research tasks. Finally, the organization of this dissertation is outlined. 1 . 1 Overview An artificial neural network is a highly interconnected array of simple computing elements inspired by the computational strengths of biological neural systems. The structure of individual nerve cells, called neurons, in biological neural systems is well understood. The neuron is specialized to conduct electrochemical impulses from or to sensory organs and other neurons. Its function is accomplished by means of hairlike nerve fibers. However, it is not yet well known how this neural network with its mas- sive parallel interconnections functions as memory and manipulates complex human behaviors. Over the last decade many researchers from the fields of physics, mathe- matics, computer science, and engineering have provided useful theoretical analyses for various models of ANNs [1—11]. Neural network topologies and some design pro- cedures have been proposed and many of these ANN models have been proven to be superior to conventional digital computers in areas such as pattern recognition, combinatorial optimization, associative memory, and human information processing tasks. It is now widely believed that the massive parallelism and computational power of the human brain results from the global and complex interconnections among a large number of neurons rather than from the complexity of individual neurons. One of the major goals in the field of ANN implementation is to produce dedicated hardware that mimics those dense interconnections among a large number of neural elements. Most of the current ANN models, however, rely on computer simulations. With the help of the current advancements in integrated electronics, optical, and electro—optical technologies, dedicated hardware implementation of ANNs is now progressing [12-18]. To date, many analog and hybrid ANNs have been built using CMOS [12,15-17] or CCD technology [14]. Most of these are analog implementations of simple feedback or feedforward neural networks. Analog implementation offers high-speed with low hardware cost. The primary disadvantages of analog processing are the inaccuracy of analog computations and the low design flexibility due to the physical constraints of analog electronic devices. Digital ANN implementation can take advantage of some of the benefits of current VLSI technology such as well-understood and advanced design techniques and tools. Several digital neural networks based on custom VLSI design have been developed where a neuron is a processing element consisting of computing units, registers, and a loop—up table (or memory) [19—23]. This approach has an increased area requirement and the level of parallelism decreases significantly due to the communication over- head. Recently, however, a new digital approach has been introduced to reduce the hardware requirement and to increase the level of parallelism. In this new approach, a synaptic multiplication and (or) a neuron activation function is implemented with simple logic gates using stochastic computing techniques [28-32]. In this dissertation, a set of fundamental research tasks are described which are aimed toward developing an efficient architecture and statistical model of a pulse- mode Digital Multilayer Neural Network (DMNN) based on stochastic computing. A statistical model is developed by which the accuracy of stochastic computing in the DMNN is analyzed. The operational characteristics and performance of the DMNN are quantified. The applicability of the developed network is demonstrated using benchmark comparisons and example character recognition problems. The results of this research contribute to the establishment of a pulse-mode DMNN which has a compact, flexible, and expandable structure. 1 .2 Problem Statement Many current ANN models rely on software simulations using serial or parallel digital computers. The speed of all software simulators, even those run on parallel machines, is far from equaling that of specialized VLSI ANNs. This is due mainly to the sequential nature of control flow and the communication overhead in digital computers. Some VLSI analog or hybrid ANN implementations have been built us- ing matrices of fixed or variable resistors and nonlinear amplifiers [12,15—17,24,25,56]. Analog implementations of ANNs have the potential for high density; however, with current VLSI technology, it is very difficult to build large (or multichip) analog ANNs. This is mainly due to the inaccuracy of analog elements, the unavailability of reliable permanent analog storage devices, and design parameter variations such as noise, temperature, and high parasitic capacitances on external I/O pins. Difficulties in VLSI analog implementation of AN Ns limit their density on a chip and constrained their applications, in turn, leading to a limitation in solving real engineering prob- lems. A digital approach is a viable alternative alleviating some of the above drawbacks to analog implementation. Digital implementation can take advantage of some of the benefits of current VLSI technology such as well-understood and advanced de— sign techniques. Nevertheless, dedicated VLSI digital implementation has been less developed because a conventional digital approach to ANN implementation has an increased area requirement and complex connectivity. In order to build a large digital neural network, a space-efficient network architecture must be developed. Some digital ANN architectures using stochastic computing techniques show the possibility of the low-cost and high-speed digital ANN implementation [28-32]. In these architectures, algebraic operations are replaced by random processes using ran- dom pulse sequences. Simple logic gates combined with some other simplistic com- ponents perform multiplications and nonlinear transformation of signals. In this approach, the network performs pseudo-analog computations with operands ranging from 0.0 to 1.0. An operand :c in the pulse-mode representation is the probability of pulse occurrence in the corresponding binary pseudo-random pulse sequence 3(a) generated at each clock. However, the overall feedforward network architecture which is programmable and expandable to any size has not yet been established. The math- ematical model of the pulse-mode digital neural network also must be developed to estimate the relative accuracy of stochastic computations and to anticipate the net- work performance. Furthermore, the applicability of the developed neural network architecture must be verified using real—world application examples. 1 .3 Research Tasks The tasks of this research are to (1) develop a pulse-mode digital neuron ar- chitecture and the corresponding statistical model; (2) develop an efficient DMN N architecture and the statistical model of error (or noise) in the DMNN and analyze the accuracy of stochastic computations utilized in the DMNN; (3) Formulate the framework of VHDL modeling techniques for the DMNN and simulate the DMNN in VHDL; and (4) apply pattern classification problems to the DMNN and evaluate the network performance and compare the performance of the DMN N classifier with the results from other deterministic feedforward neural networks. To develop a pulse-mode digital neuron model, the first step is to investigate var- ious stochastic computing techniques using similarities between boolean algebra and probability algebra. The study is concentrated on developing a digital neuron model in which a non-linear transfer (sigmoid) function is embedded, which is essential to ANN models. Also various digital neuron architectures are studied, including those published recently. Simultaneously, all necessary components are developed in such a way that each of them can contribute to a simple and regular neuron architecture. A statistical model of the neuron is developed. The computing accuracy of a synaptic multiplication and a neuron activation is estimated in terms of means and variances. A regular neuron architecture is sought in such a way that it leads to an expandable network architecture. The second task is the development of a DMN N architecture and an analysis of the network. Existing feedforward network models with advanced architectures are explored. A flexible and modular architecture for the DMNN is sought such that the network can be programmed for different network configurations by simply connecting basic modules. The effective network structure to minimize the correlation between multiple pseudo—random pulse sequences is sought. The number of clock cycles per sampling period for the pulse-code representation of signals is determined in such a way that the required accuracy for a particular application problem is satisfied. A variable register length is one of the design issues for the DMNN. This decision may be made using knowledge gleaned from simulation results of actual problems. Simul- taneously, network analysis is performed based on the statistical models to estimate the differences between the results obtained by the DMNN and those obtained by the deterministic calculation. At this stage, some assumptions are made for the analy- sis on distributions of synaptic weights and neuron activations because they depend highly on network architectures and application problems. The third task involves VHDL modeling and simulation which demonstrate effi~ cient behavioral modeling techniques for the DMNN. First of all, the clever use of VHDL semantics is necessary to get a precise model. Detailed investigations are un- dertaken on process statements, functions, and delay characteristics. A logic block can be modeled using process statements and accompanying wait statements for the flow control in a VHDL description. In VHDL, a function subprogram defines an algorithm for computing values or representing the behavior of a hardware model. One of the useful functions in modeling digital neural networks is the bus resolution function which defines the resolution of output values for a common output signal. The delay model in VHDL must provide an accurate view of the timing associated with the logic gate. In addition, an effective naming convention is considered in order to develop VHDL models conveniently and to document them properly. For the last task, some testbench problems and character recognition problems are applied to the developed DMNN. This demonstrates the applicability of the DMNN. Traditional pattern recognition systems rely on programmable algorithms based on statistical or syntactical approaches. They perform a mapping from the observation space to the interpretation space by extracting features from observed data and clas- sifying the collected features into certain categories. The developed DMNN should self-organize the complex mapping required to solve the problem and provide a fast classification rate. The back-propagation algorithm for the DMNN is programmed in C and the DMNN will be modeled in VHDL. The network is trained on a host computer. After the training, the network configuration is determined and the clas- sification of test patterns is performed by the DMNN. Testbench problems are tested on the DMNN at first. These problems include “exclusive OR” and “encoding” prob- lems. The experimental results show the strength as well as the limitations of the DMNN. The performance measures include the number of classifications per second and the correct classification rate. Next, character classification problems are applied to demonstrate its applicability to real-world problems. The experimental results are compared with those of other approaches. For a particular problem, the proper rep- resentation for input and output patterns, and the best choice of a register length in the DMNN is determined. As a result, a design procedure for a DMNN binary classifier is proposed. 1.4 Organization of the Dissertation The remainder of the dissertation is organized as follows. Chapter 2 contains the background discussion of related topics. It begins with a discussion of various artificial neural network models, and it briefly describes the existing hardware im- plementations of ANNs. This is followed by a discussion of traditional and ANN approaches for solving pattern classification problems. It ends with the brief discus- sion of VHDL characteristics and behavioral modeling techniques. Chapter 3 presents the fundamentals of stochastic computing techniques. The techniques for generating random pulse sequences using Linear Feedback Shift Reg- isters (LFSRs) and the randomness properties of the pulse sequences are presented. Synaptic weights and neuron activations are represented as generating probabilities with the pulse sequences. The statistical model of the generating probability is de— veloped in terms of mean and variance. Then stochastic computing techniques to perform a synaptic multiplication and a signal intergration are discussed. Chapter 4 proposes an architecture for the DMNN. The DMNN consists of synap— tic elements, neuron body elements, and necessary connections. To develop an overall network architecture, modular design techniques are used. The DMNN is trained with the Back-Propagation (BP) learning rule suitable for the pulse-mode feed-forward neural network. A generic architecture of the DMN N coprocessor which can be at— tached to a host computer is proposed. The DMNN coprocessor is composed of the DMN N, a control unit, a memory unit, and some digital components. The network configuration for solving a particular problem is determined during the training ses- sion. Once the training is completed, the determined synaptic weights and network configuration are loaded into the memory in the DMNN coprocessor from a host com- puter. Then, the programmed or hardwired control unit can be used to control the operations of the coprocessor during the classification session. Random noises (errors) are involved in the stochastic computations of network operations. In Chapter 5, the statistical models of the network operations performed using stochastic computing techniques are presented. The relationship between the computing accuracy and the register length (or the sampling period), and the relation- ship between the computing accuracy and the network architecture will be discovered. The overall random noise effects on hidden and output layers are analyzed. The va- lidity of the developed models and analysis results is justified by simulations. In Chapter 6, the DMNN coprocessor is modeled and simulated in VHDL. Some testbench problems and character classification problems are applied to the coproces- sor. A design procedure for solving binary classification problems with the DMNN coprocessor is proposed. Testbench problems are tested to see the applicability of the DMNN to binary classification problems. Network performance of DMNN character classifiers is evaluated in terms of successful classification rates. The network perfor- mance is compared with that of deterministic DMNN simulations or other ordinary back-propagation networks. Finally, Chapter 7 contains the conclusions, contributions, and future direction of this work. CHAPTER 2 Background Many artificial neural network models have been developed based on current knowl- edge of biological neurons and with the help of available analytic methods for linear or nonlinear dynamic systems. Network topology, computational characteristics of neu- ron elements, and learning rules play key roles in specifying artificial neural networks. In the first section, feedback , feedforward, and recurrent models are discussed with learning rules associated with the network models. In recent years, many software and hardware implementations of these models have been developed. Among them, some software simulators, analog, and digital electronic ANNs are discussed in the next section, followed by related issues. Pattern classification is one of the major applications for feedforward ANNs. Traditional and neural network approaches used for pattern classification are presented. This chapter concludes with a brief discussion of behavioral modeling and VHSIC {Very High Speed Intergrated Circuit) Hardware Description Language (VHDL). 10 11 2.1 Artificial Neural Networks In this section, a brief review of biological and artificial neurons is provided. Next, typical feedback network models such as the Hopfield model and the Kennedy—Chua model are discussed in relation to ANNs. This is followed by a discussion of feedfor— ward network models and the Boltzmann machine as a recurrent network model. 2.1.1 Biological/ Artificial Neurons 2.1.1.1 Biological Neuron The biological nervous system consists of two principal classes of cells, the neurons and the neuroglia. The neuroglia are cells that fill the spaces between the neurons [33]. The neuron is a fundamental processing unit of all nervous systems. Most neu- rons contain four distinct regions which carry out the specialized functions of the cell: the cell body, the dendrites, the axon, and the synapse (Figure 2.1). “UCICUS axon hillock synapse axon / \ é ’ K W -». ‘ ’ 7" - 11 b d dendrite ce 0 y Figure 2.1. A biological neuron. Axons are specialized for carrying information toward other cells without reducing the magnitude of signals. Action potentials originate at the axon hillock and travel to synapses, from which point signals are passed to other cells. Dendrites receive signals from sensory organs or from the axons of other neurons, convert these signals into electrical impulses, and transmit them to the cell body. The cell body receives signals independently. If the electrical impulses are greater than a certain threshold, action potentials are generated and are actively conducted down the axon. The action potentials are pulse streams with a pulse-width of about I msec. Synapses generally pass signals to other cells in only one direction; an axon ter- minal from a presynaptic cell sends chemical or electrical signals through a synaptic gap. The signals are collected by a postsynaptic cell. Two types of synapses exist in biological neural systems: electrical and chemical. They differ in both structure and function. Cells communicating by electrical synapses are connected by gap junctions (Figure 2.2). This allows an electrical pulse to pass from the presynaptic cell to the postsynaptic cell. In chemical synapses, chemical substances, called neurotransmit- ters, are involved in passing the signals [33]. An action potential is generated in the postsynaptic cell. presynaptic cell / plasma membrane . . axon gap junction connection Postsynaptic cel Figure 2.2. An electrical synapse. 13 Two types of signals occur in synapses: excitatory and inhibitory. With an ex- citatory synapse, the signal from the presynaptic cell causes a change in the plasma membrane of the postsynaptic cell that tends to induce an action potential. How- ever, with an inhibitory synapse a nerve impulse in a presynaptic neuron affects the electrical properties of the postsynaptic membrane in such a way as to prevent the generation of an action potential. Excitatory and inhibitory stimuli often affect a single neuron in combination. 2. 1.1.2 Artificial Neuron An artificial neuron can be considered as a simple processing element which sums the weighted inputs and passes the result through a threshold or activation function. Figure 2.3 shows this simplified neuron. The input signals, which come from either sensors or outputs of other neurons, form the input vector, X = (r1,---,:r,-, - - - ,xn). The weights associated with each input form the weight vector, W.- = (wil, - - - , wij, - - - , w...) for the ith neuron, where wij represents the connection strength between the ith and jth neurons. A threshold function can be modeled by associating a threshold 0,- in each neuron. Non-linear activation function Yi Figure 2.3. A simplified artificial neuron. 14 The output of the ith neuron, y,, is then given by Eli :f(X'I/I/i“6i) (2.1) where f() is the threshold function. The most pervasive threshold function is the sigmoid function because it is a bounded, monotonic, non-decreasing function that provides a graded, nonlinear response, most resembling a biological neuron. The sigmoid function is shown in Figure 2.4. y=1/(1+ exp(-x)) / ‘! Figure 2.4. A sigmoid threshold function. 2.1 .2 Feedback Model Two feedback ANN models are reviewed: the Hopfield model and the Kennedy- Chua model. In feedback neural networks, neural elements are connected to one an- other by feedback paths from outputs to inputs of neural elements. Continuous-valued neural elements are normally implemented as electrical circuits, and the network dy- namics are described by differential equations. A key issue of these networks is to define an energy function which always decreases during the dynamical evolution. 2.1.2.1 The Hopfield Model The Hopfield model is a one-layer feedback network which consists of intercon- nected nonlinear analog neurons. Many implementations have been built based on this model. The general structure of this network is shown in Figure 2.5. In this model, each neuron is an amplifier with a capacitor C.- and a register p,- at the input node. The output of neuron j, 1),, is connected to the input of neuron i, u,-, via a conductance wij. Weight connecuon Outputs V1 V2 V3 V4 Figure 2.5. Hopfield network model. 16 The dynamics of an interacting system of n neurons can be described by the nonlinear differential equation u- — -- w, v — -—i + 1; (2-2) ,2; J ’ R.- where = — 1+ wt 9 a 5: . Pi j=l I,- is an external input current, v,- = f,(u,-), and f,- is a sigmoid function. 12,-0.- forms the time constant of neuron i for charging and discharging and ui/R; is the leakage current. The energy function defined by integral of equation 2.2 is 2 i=1 j wwv. _;1... + 2‘12. f" We.) d5.- (23) 1 i=1 mIH TI. TI du _ _m for Ci—(Ri — avi- If 11);,- = 11),; for all i and j, the time derivative of the energy function is —=—§é~ Lit) %i> (24> Since f (u,) 18 monotonically 1ncreasing,—— d—E < 0 for all t. As a result, the value of the energy function is strictly decreasing and becomes zero only at the equilibrium point where%—— =—C.-dj5,* = 0 for all i. Equations 2.2 and 2.3 define a gradient system and thus guarantee convergence. The Hopfield model has been applied to combinatorial optimization problems where it has been observed that the network model converges to a good solution in a few time constants [6, 10]. The objective function of the combinatorial problem is mapped to the computational energy function through the adjustments of the connectivity strengths My. Local minima of the energy function correspond to solutions to the 17 problem. When the Hopfield network is used as an associative memory, solutions for this network model may be memory patterns stored in the network. Approximately 0.15n memory patterns are simultaneously stored before the patterns become too close to each other and tend to merge [4]. 2.1.2.2 The Kennedy—Chua Model A canonical circuit model with feedback was proposed for solving both linear and nonlinear programming problems by Kennedy and Chua [35, 36]. This model uses integrators as neuron elements. The structural parameters of the networks correspond to the coefficients of the objective function and constraints descriptions. Figure 2.6 shows an architecture of the model, where p-cells are constraint amplifiers, f-cells are integrators, and V is the node voltages v1, v2, - - - , vn. The network dynamics can be described by @5591 "‘ . . in C. d, — sufgp’IgJII/IIav. (2.5) where C; is capacitance, v,- is the voltage of node i, f (v) is the objective function, and g(V) are constraints. The corresponding energy function is m 9.1V) E(V) = f(V) + Z l. mods. (2.6) i=1 Since if- S 0 for all t, E(V) is a Lyapunov function ensuring the system convergence to a stable equilibrium point without oscillation [36]. This model requires more hardware to form the integrator than the Hopfield model does, but it is superior to the Hopfield model in solving linear programming problems for which the Kennedy-Chua model guarantees a stable equilibrium point while the Hopfield model does not [79]. BI 221-2 a -: 2; t 2 t7? “ex-2 2; in 322% Figure 2.6. The Kennedy-Chua network model. 2.1 .3 Feedforward Model The Hopfield and Kennedy-Chua models are examples of one-layer feedback struc- tures. The interconnection structures of biological neurons are often organized into multiple layers of cellsi[7, 33]. Layered feedforward networks were first studied in de- tail by Rosenblatt and his colleagues in the early 1960’s [42]. Since then, feedforward multilayered structures and learning algorithms for training have been developed. The networks are trained with a set of input—target pairs as examples and can suc- cessfully generalize what has been learned. Feedforward networks have been applied to pattern recognition [37, 38, 49], robotics [39], and control problems [40, 41]. 2.1.3.1 Simple Perceptrons A simple perceptron is a single layered feedforward neural network, consisting of n inputs and an output layer. Figure 2.7 illustrates an example of a simple perceptron. Y1 3’2 ‘ ‘ ' .Ym-l I’m WI] .. Figure 2.7. A simple perceptron. 20 :3? is the ith element of the input pattern and yf‘ is the output of neuron i when pattern p is presented to the network. w;,- is the connection weight between neuron i and the jth element of the input pattern. If the number of patterns is p such that ,u = 1, 2, - - - ,p, the output in the output layer can be described by yi = wijxj +6i) if]: II M: where :58 = 1 for all ,u, wio = 0.- is a bias, and f () is the continuous sigmoid function. When tf‘ is the desired output of neuron i for input pattern ,u, the cost function, which measures the system’s performance, is defined by 1” u E 2 EEE = ézzwyr‘t = $23: vow-Mr. (27> The connection weights, w;,-, are changed by the gradient descent algorithm. 8E ang = n for — you; waxy). (2.8) 11:1 Awe = -77 The condition for the existence of a solution in the simple perceptron is the linear in- dependence of the input patterns [43]. The simple perceptron can not solve problems in which input patterns are not linearly independent, and may offer alternate par- tial solutions [43]. However, multilayer feedforward neural networks with nonlinear neuron elements can overcome this limitation. 21 2.1.3.2 Multilayer Perceptrons A multilayer perceptron (or feedforward neural network) consists of an input layer, an output layer, and one or more hidden layers in between. Figure 2.8 shows the generic structure of a multilayer neural network. y,- is the output of neuron i and w,,- are connection strengths between neuron pairs. Outputs of any layer are weighted and summed as an input to a neuron in the next layer. An external input is applied to the input layer. yi yn 0 0 layer X l I I .33 I I I H 3 Figure 2.8. Multilayered feed-forward neural network. _._.._.____..._.._.._-.-.,__.... - . _ _ . 22 Given pattern p, where u = 1,2, ~ - - ,p, a net input netf‘ in neuron i in any layer is netf‘ = 2 way; (2.9) i where y;‘ is the output of neuron j in the previous layer when pattern [1 is presented. yo‘ = 1 is often used. Thus, neuron i produces output it: f(net“) - f(Zw..y;-‘) (2.10) where f () is a differentiable sigmoid function. For a given input pattern, the output of the output layer is compared to the target pattern and the connection weights between layers are modified in a backward direction according to the error. This is known as back-propagation learning. Given pattern p, the error measure is 1 E" = 5:0? — yf‘)2 Q“) where tf‘ and yf‘ are the desired output and actual output for the ith output neuron, respectively, when pattern ,a is presented. The back-propagation rule states that we“) = w.-,-(k — 1) + EMMA“ (2-12) it For the output-to—hidden layer connections, the gradient descent rules gives 6E“ 310.5 8E“ Bnetf‘ “"W awo- _176E“ 3y” anet" flay,” Bneti‘ (9ng = 775.9% (2-13) Auwidk) = -77 23 where 6r = (t? — yomnett). In the hidden-to-hidden (or input) layer connections, prg, for the connection be- tween neuron i in the hidden layer and neuron j in the lower layer can be obtained by using the chain rule. AuwijUC) = ”77 6ng _776E“ 6y,” 33/.“ 5‘ij 3E“ anetf 8y? — _T’;3netfcl By,” 810,-,- where 65‘ = f,’ (net?) 2,, 6;:wk; and k denotes neurons in the upper layer. The overall measure of the error is therefore E = E E“. (2.15) Thus, the back-propagation rule for any layer has the form P 6E“ ”=1 awb‘ Awu = ‘7] P u=1 Some variations of the ordinary back-propagation algorithm have been suggested in order to help the networks learn faster or escape local minima [45-47]. Multilayer feedforward networks trained by these back-propagation algorithms have been used to solve pattern classification problems [45, 48-50]. 24 2.1.4 Recurrent Model Recurrent networks allow connections in both directions between a pair of layers, and within a layer to itself. The Boltzmann machine is a well-known recurrent network with symmetric connections [51, 52]. 2.1.4.1 Boltzmann Machine The Boltzmann machine consists of visible and hidden units where the visible units can be divided into input and output units. Figure 2.9 illustrates the structure of the Boltzmann machine. The units are stochastic and take output value v,- = +1 with probability f(h,) and value v,- = —l with probability 1 — f(h,), where h.- = Z wuvj ,- and _ 1 " 1 + e—2Bh' f(h) Output pnits Hidden Vi§ib1e units units _/ ’4’ . . x“ . Input units Figure 2.9. A Boltzmann machine consisting of visible and hidden units. 25 Here ,8 = % where T is pseudo—temperature. If w;,- = w,,- for all i and j, the energy function H{v.-} = —% Z ngjvgvj (2.17) has a minimum at a stable state characterized by v,- = sgn(h,-) where sgn(h,—) = +1 if h,- 2 0, otherwise sgn(h,-) = —1. The probability of finding the system in a particular state {vi}, after equilibrium is reached, is given by the Boltzmann-Gibbs distribution _ {o. P{v.~} = eflzrl where Z is a normalized constant. Boltzmann learning adjusts the connections ng such that the states of the visible units, a, have a desired probability distribution. Let fl be the states of the hidden units. The probability Por of finding the visible units in state a irrespective of ,6 is Pa = Zap [3 = Ze’fllflg (2.18) B where 1 Hag = —5 Z Z wijvflfi'US-yfi. 1' j The relative entrophy between actual probability PO, and desired probabilities RC, is R 7):. (2.19) E: ZRalog 26 E _>_ 0 and E = 0 if R, = R... for all a. The gradient descent rule gives Awe = ~77 = ”fizzaapmavrfivffi— < 5.5,- >] (2.20) a fi where the correlations < 5.5,- > are measured by taking a time average of Sgsj and the system must reach an equilibrium state for each a. A simulated annealing procedure is used to rapidly achieve a global minimum. Disadvantages of the Boltzmann machines are that learning requires an extremely long convergence time even with simulated annealing and its hardware implementation is impractical. Boltzmann machines have been applied to various problems: statistical pattern recognition [7], constraint satisfaction problems [51], and combinational optimization [53]. 2.2 Artificial Neural Network Implementations Many current ANN models rely on software simulations run on serial or parallel digital computers. The speed of software simulation even on a parallel machine is far from equaling that of specialized hardware AN Ns mainly because of programming and communication overhead. To date, a number of ANN hardware prototypes have been built using electronic, optical, and opto—electronic technologies. Electronic ANN hardware implementations, software simulators run on digital computers, and related issues are discussed. ANN implementations can be divided into three categories based on the method used to express the values within the network: analog, digital, and hybrid. 27 2.2.1 Analog and Hybrid Implementations Analog computation performed in analog or hybrid electronic hardware uses some fundamental physical principles such as the linear attenuation of voltage by an elec- trical resistor and the nonlinear transfer characteristics of an amplifier. In a simple analog neural network, the interconnections are simple fixed value resistors (see Figure 2.5). The output voltage of neuron i is given by vi = f(Z w;,-v,-) i=0 where 10,-, is the conductance of the resistor between neuron i and neuron j and f () is the transfer function of the amplifier. Neural networks with fixed value resistors can be used when the network function is known in advance and weight changes are not needed. This type of network with 256 neurons was designed on a single chip using standard CMOS technology by J ackel, et al. [15]. This circuit was not programmable due to the fixed synaptic weights. AN Ns can be programmed by storing synaptic weights in memory. A static mem- ory cell has been used as storage for a weight bit where the neurons and synapses were binary units. Multiplication was performed by a logical XOR gate [16]. For many applications, a higher resolution for weight values is required. One way of storing analog weights is to use a capacitor [55, 56]. A weight can be stored as the voltage difference between two capacitors; the voltage difference is multiplied by the input voltage in the circuits. The main disadvantage of this dynamic storage technique is that it requires refresh circuitry to overcome the charge leakage on the capacitance. An alternate way is to store weights digitally. In this case, a digital-to-analog (D / A) converter is required at each connection to perform an analog multiplication of the stored weight with the input signal. A matrix with 1024 multiplying D / A con- 28 verters was built using CMOS technology, where a weight was represented in four-bit magnitude plus a sign bit [57]. A floating gate field effect transistor (F ET) was used as a device to combine the weight storage and the multiplication, where the weight was determined by the charge stored in the floating gate. However, the weight range and polarity difficulties were significant limitations [58]. To overcome these difficulties, a Gilbert multiplier [59] was used to carry out the weight multiplication while a floating gate PET was used simply for weight storage [60]. Sage and his associates designed an ANN chip based on Metal Nitride Oxide Semiconductor (MNOS) floating gate transistor technology and Charge Coupled Device (CCD) technology [14]. Analog weights were stored in MNOS floating gate transistors. Charge packages instead of currents were added to compute the sum of products. This circuit implemented a simple Hopfield-type neu- ral network by operating with binary inputs and analog weights. In analog computation, available mathematical functions are limited because those functions are found in some physical principles of devices. When a complex transfer function is required, it is difficult to implement correctly using analog hardware alone. In this case, hybrid ANN hardware is more appropriate where the sum of products is carried out with analog components, digitized for the transfer function processing, and then converted back to analog [24]. The potential advantage of analog computation is that operations in the network can be performed using inexpensive hardware. However, analog computation results in low accuracy and limited dynamic range due to physical constraints, such as ther- mal and quantum noise of analog components. In addition, design flexibility in analog implementation is strictly constrained because only mathematical functions resulting from physical principles are available for use. 29 2.2.2 Digital Implementations In this section, the two mainstream approaches to digital implementation of AN N s are discussed: software simulations on general-purpose or special-purpose computers and dedicated VLSI implementation. 2.2.2.1 Software Simulators ANN simulations on digital computers can be divided into two categories: ANN simulations on general-purpose parallel computers and ANN simulations on special- purpose processors. Many general-purpose parallel machines, consisting of a large number of process- ing elements, are currently used for ANN simulation. Processing elements, cooperated on the same task, communicate through a single high speed data path between pro- cessing elements. A neural network and data are partitioned into different processing elements. Each processing element may have a dedicated memory to store data as- signed. For example, the Warp machine, which was a systolic array of 10 processing elements, was used to implement a back—propagation network [61]. Each processing element contained an adder, a multiplier, and an ALU. The 39 Mbyte cluster memory was used to store weights and 17 million weight updates per second was achieved. Forrest, et al. used a Distributed Array Processor (DAP) consisting of 4096 proces- sors to implement a Hopfield network [62]. The DAP was able to perform 25 million additions per second. The use of general-purpose parallel machines for ANN sim- ulations can be justified for the problems to be completed in a feasible amount of simulation time. However, large-size ANNs often require faster simulations. Special purpose processors, which are designed for ANN simulations and attached 30 as coprocessors to a host computer, are often called neurocomputers. A user program run on the host computer calls a special subroutine, and controls the neurocomputer whenever needed. Three methods for attaching a neurocomputer to a host computer have been defined [58]. The first method is to install the neurocomputer as a memory- mapped device on the host computer. In this method, the neurocomputer shares the memory space of the host computer. Data transfers between the host computer and the neurocomputer are controlled by the central processing unit in the host with ad- dresses in the memory space. The second method is to attach the neurocomputer as a peripheral device using a standard peripheral interface . The neurocomputer can be ported from one type of host computer to another relatively easily. The first and second methods have high bus loading problems on the host computer. In addition, the second method suffers from the reduced bandwidth of the peripheral interface. Thus, these two approaches are appropriate for small computers. The third approach is to attach the neurocomputer as a coprocessor to a host computer via a local area network (LAN). This method has the advantage that the neurocomputer can access memory servers and other outboard devices on a high—bandwidth LAN. Several manufacturers, such as TRW, Science Applications International Corpo- ration, and Hecht-Nielsen N eurocomputers, developed neurocomputers. For example, Mark III and Mark IV neurocomputers were developed by TRW [63]. The Mark III (Mark IV) machine consisted of many Motorola 68010 (68020) based single board computers mounted on a broadcast bus backplane. These systems used the Artificial Neural System Environment (ANSE) developed at TRW for specifying the neural network to be implemented. A neural network was called on the Mark III (IV) from user software on the DEC Micro VAX through an user interface. The Mark IV had an ultra high-speed graphics display facility for monitoring the activity of the neural network. The Mark III and Mark IV systems were able to process up to 450,000 and 5,000,000 interconnections per second, respectively. 31 2.2.2.2 Dedicated Hardware Implementations In order to fully utilize the parallelism embedded in ANN computations, the de- sign of dedicated VLSI ANN digital systems is desired. A three-layer feedforward ANN was designed to classify handwritten numbers [20]. The network consisted of 50 neurons and 6688 fixed interconnections using a 2—micron CMOS process. The resulting VLSI layout was 7.9 x 9.2 mm) in size. This design is quite compact, but its flexibility was so low due to the fixed synaptic weights. Suzuki and Atlas mapped an ANN to an array of custom processors [64, 65]. Figure 2.10 shows the structure of the proposed processing element, where the blocks represent special operations for the network update. A weight matrix W and a threshold vector 0 are stored in the product—sum unit (PSU). The arithmetic unit (AU) performs operations required for back-propagation. m LL] 5 , PSU I DF DO I NI“ AU " A ‘ Figure 2.10. The structure of a processing element in [64]. 32 The derivative of a nonlinear function (DF), desired outputs (DO), and a learning rate (77) are stored in the memory of the AU. Neural activations (X) and the error value (6) are accessed by both the PSU and the AU. This ANN hardware has a high design flexibility, but hardware requirements for this design are large. As indicated in the above two examples, dedicated digital ANN implementations can facilitate high parallelism, but it is difficult to simultaneously achieve the desired high design flexibility and high density on a chip. A new digital approach - digital ANNs using stochastic computing techniques - replaces algebraic operations in ANNs by stochastic processes using pseuddrandom pulse sequences [28, 31, 32]. Simple logic gates combined with other digital compo- nents perform multiplications and nonlinear transformation of signals. In this new approach, the values for synaptic weights and input operands are normalized after a network has been trained [28, 32] or all operands are restricted to the range between 0.0 and 1.0 both for training and testing [66]. An operand :1: in the pulse-mode representation is the probability of pulse occurrence in the corre- sponding binary sequence :r(,,) at each clock. 5: is the estimate of .1: taken over finite clock periods N. Stochastic computations using random pulse sequences inherently utilize concurrent processing in all synaptic and neuron elements. Furthermore, the use of simple logic gates as computing elements allows a high neuron-density on a chip and a relatively compact network architecture. High design flexibility can also be achieved by making the network programmable. However, network speed depends on the length of a sampling clock period. The sampling clock period is the time required to estimate the computation results. A longer sampling clock period yields more ac- curate computations. Thus, there exists a trade-off between speed and accuracy in this approach. Details on the network architecture, analysis, and performance will be discussed in following chapters. :33 2.3 Pattern Recognition and Neural Networks Pattern recognition is concerned with classification or description of complex pat- terns by means of some measured prOperties. A pattern recognition system requires data acquisition, data representation, and data classification. The design of a pattern recognition system involves the following three steps: (1) data acquisition, (2) preprocessing, and (3) decision making [68]. A typical character recognition system is illustrated in Figure 2.11. Digitized . Size normalization, Matching Identity of character matnx Noise cleaning character i. Pre- __i_> Feature __i__> Decision _I_> processor extractor maker Figure 2.1 1. A typical character recognition system. The first stage involves image processing, the last two stages deal with the pattern recognition. Mask (joint occurrences of black and white pixels), strokes and bays in various directions, the location of end points, and the intersection of line segments and loops, are all popular features for character recognition. Most pattern recognition systems utilize one of the following three approaches: statistical, structural, or neural network. 2.3.1 Statistical Approach In the statistical approach, a pattern is represented in terms of N features. Each 34 pattern can be viewed as a point in the N-dimensional space. If the choice of features is good, then pattern vectors belonging to diflerent classes will occupy different regions of this feature space. The objective in this approach is to establish decision boundaries in the feature space to separate patterns belonging to different classes. Assume that a given sample pattern belongs to one of M classes c1,c2, - . - ,cM based on its feature vector x = ($1,232, - - - ,xN) and that x has a class-conditional density p(x|c,-). Bayes decision rule states that a pattern with x as its feature vector is assigned to class c,- if p(CIIX) Z p(CjIX) for all i7“ where p(c,-|x) is the posteriori density for class q, defined as P(XIC:‘)P(C£) iii P(XIC:‘)P(C£) P(C:'IX) = E where p(c,-) is a priori probability density for class c,-. If p(c,-) = l/M, then the Bayes decision rule is identical to the maximum-likelihood decision rule. The decision boundary between pattern class c.- and c,- is defined by P(CiIx) - P(CjIX) = 0- If class-conditional densities are multivariate Gaussian, then P(XIC:') = NW, 1), where p,- is the mean vector for class c; and I denotes the identity covariance matrix. 35 If p(c,-) for all i are equal, then IX - it-II2 p(c.-|x) = —'———,—*-. where [I - M denotes the Euclidean norm. As a result, a pattern x is assigned to the class of the closest mean vector. If the class-conditional densities are known, Bayes decision rule can be used to design a classifier. If they are not known, they must be estimated by training with sample patterns. 2.3.2 Structural Approach When the number of features required to establish a reasonable decision boundary is very large, it is more appropriate to View a pattern as being composed of simple sub- patterns. In the structural approach, a complex pattern is represented in terms of the interrelationships among the simplest subpatterns, called primitives. This paradigm has been used in situations where the patterns have a definite structure which can be captured in terms of a set of rules. The primitives or grammatical rules must be inferred from the available samples. In this approach, the difficulty resides in segmentation or reliable extraction of the primitives from a finite number of pattern samples. 2.3.3 Neural Network Approach The neural network approach is based on the notion that a network of simple processing elements arranged in a manner similar to a biological neural system might 36 be able to self-organize itself to recognize and classify patterns. The Perceptron is considered as the first significant development of such in the early 19603 [2]. The basis for the inherent power of Perceptron devices was well understood. However, at that time, no method was known for training multilayer Perceptron devices and the cost for full implementation of those devices was extremely high. VLSI technology has advanced and the price of processors has dropped tremendously. More significantly, the generalized delta rule developed in 1986 by Rumelhart, et al. provides a practical way for training the multilayer Perceptrons [8]. Today, perceptron-like models trained by the generalized delta rule are being applied to pattern recognition. In pattern recognition systems using the neural network approach, all stages or some of stages in Figure 2.11 can be combined into one neural network. The net- work learns the mapping from the observation space to the interpretation space by a training algorithm. In this approach, human interactions involved in statistical or structural pattern recognition systems are minimized. Most recognition processes are performed in an autonomous manner. 2.4 Behavioral Modeling with VHDL In the design of large systems like ANN 3, use of Design Automation (DA) becomes necessary. The simulation and verification of a design using a behavioral description language at an early stage of the design process also becomes more important as the complexity of systems continues to grow. VHDL is a typical behavioral description language which is semantically oriented for digital systems. Digital ANNs can be modeled and simulated using VHDL. 37 2.4.1 Behavioral Modeling A promising approach for implementing artificial neural networks is the fabrica— tion of special-purpose VLSI chips. Traditionally designers start with a gate—level or a circuit-level schematic. However, as systems become more complex, a top-down design approach is needed in order to manage complexity and to reduce the design time and development costs. Test and modification of an original design can be done in an early stage of the design process. T op-down design starts with a high-level spec- ification which is decomposed into lower level specifications in a hierarchical fashion. Designers look at the system at an abstract level in a high—level specification. Hard- ware Description Languages (HDLs) are crucial to the high-level design [69-72]. VHDL is a typical HDL that can be used to express the function and logical organization of circuits, ranging from simple logic gates to complex digital systems [73-77]. VHDL is fast becoming an industry standard. The US. government made it a standard language, requiring the use of VHDL as the design and description mech- anism in Department of Defense (DoD) hardware designs. Compilers, translating the structural design in VHDL to an intermediate format such as Caltech Intermediate Format (CIF), are being produced by many CAD vendors. In VHDL one can model the behavior of systems and simulate them to verify the design. Modeling involves specifying the inputs and outputs of a device, and describing its behavior and/or structure. For example, when an ANN is modeled in VHDL, its behavior may be described by a set of static or dynamic equations by using function statements. Structure is described by interconnections of the subcom- ponents (synapses and neurons). An efficient and precise modeling of VLSI ANNs is facilitated by analysis of VHDL semantics, including a detailed investigation of process statements, functions, and delay characteristics. 38 2.4.2 VHDL Characteristics The primary element in VHDL is a design entity which can represent portions of a hardware design ranging from simple logic gates to complex digital systems. A design entity consists of two different types of descriptions: the entity declaration and one or more architectural bodies. The entity declaration defines the interface between the entity and the outside world. Figure 2.12 illustrates an example entity declaration. entity [COUNTER is . , « generic .(timegdelayz‘time: 10 ns); p. port (elk, reset: in it; ' sum: bufferinteger); end COUNTER Figure 2.12. Entity declaration in VHDL. The ports are the signals through which the design entity communicates with other modules. Their declaration can be any predefined or user-defined type. The port and local item defined in the entity declaration are made available to architectural bodies associated with this entity. A set of parameters, called generics, provides a channel for static information to be communicated to a design entity from its environment. Generics can be used to specify timing characteristics, the bit size of ports, or other descriptive characteristics of a design such as temperature, capacitance, location, etc. An architecture body supports three implementation styles of a design entity: be— havioral, structural, and data-flow. The behavioral body describes the system model in sequential program statements just like programs written in a high-level program- 39 ming language. The structural body describes a design entity purely in terms of its subcomponents and their interconnections. Finally, the data-flow body decomposes the architecture into a set of concurrent register assignments under the control of gating signals. Data-flow style emphasizes the flow of information between memory and gating elements. All three styles may be intermixed in an architectural body. A VHDL design entity is a template to be used in creating specific instances of a component via the component instantiation statement. A component may represent a structural partitioning of the design or a functional decomposition of a large system. Because this feature essentially isolates one level of design from another, two differ- ent design methodologies can be accommodated: top-down approach and bottom-up approach. In the former approach, the architectural body can be written in terms of abstract lower—level components. Such components must be fully described with a variety of design entities later in the code. In the latter approach, the local compo- nent declaration specifies the portion of the interface from an existing design entity that resides in the design library. Designers may specify the behavior of a subsystem and leave the implementation details of structural design to others. Thus, VHDL designers can model simply the function of the system independent of any implementation technology. A VHDL description is evaluated when an event occurs at one of the component’s inputs. The evaluation yields a new set of projected values for the outputs of the com- ponent. This effect may, in turn, causes additional changes. Independent sequences of events can occur simultaneously. The event-driven semantics of VHDL are based on the assumptions that all signals in a design propagate in well-defined directions and that signal propagation always includes a delay. A typical signal assignment statement consists of a driver and a target. A driver is a source of the value for a signal. A signal may have multiple sources. If a signal has more than one source, then all sources can participate in the calculation of the 40 value. Such a signal must be a resolved signal, and the resolution function calculates one effective value from an array of values. The target of a signal assignment is the signal on the left hand side of the assignment operator. The simulator creates a driver for each element of a target of every concurrent signal assignment [74]. Timing is one of the most important aspects of a VHDL model. The representation of time in VHDL has both a macrotime scale and a microtime scale. The macrotime scale represents real time (nanoseconds, microseconds, etc.) which is measured in discrete units. The microtime scale represents a unit delay which is essentially not measurable. Any number of micro-units of time may exist between any two macro- units of time. With two time scales, designers can perform unit-delay or real-time simulations [73]. There are two kinds of statements in VHDL: sequential and concurrent. Sequen- tial statements are used to define algorithms for the execution of a subprogram or process. They are executed one at a time. Concurrent statements are executed in an asynchronous pseudo-parallel fashion. They are used to define interconnected blocks and processes that jointly describe the overall behavior or structure of a design. Figure 2.13 shows the flow of data in the design process under a VHDL hardware support environment including an analyzer, a profiler, and a simulator. The design library contains intermediate representations of VHDL descriptions. The library unit resulting from the analysis of a design unit is placed into a working library. Only one library may be the working library during the analysis of any given design unit [74]. The analyzer accepts a VHDL source code, translates it into the intermediate form, and stores it in the design library. It checks the syntax and semantic rules of the language. The profiler pulls all necessary design entity interfaces, bodies, functions, and packages from the library, then configures a cross-section of a design hierarchy. The simulator and other tools may use this configuration. The simulator records signal histories and dynamic errors. An understanding of VHDL semantics and characteristics enables designers to use VHDL as an economical hardware design testbench. A system can be first modeled behaviorally with a high-level specification using appropriate modeling techniques, verifying the correctness of the design. decomposed into lower level specifications, incorporating more implementing tech- nological constraints. Finally, when the system is modeled in complete structural descriptions, the precise feasibility and detail of a hardware realization can be as- sessed. Other Tools A VHDL Inter- Analy- mediate ' zer + form Profiler control Profiler 41 Later, the high-level specification can be ->' Figure 2.13. The flow of design data in VHDL design process. Inter- mediate form Simulation control and data I -> Simu- lator Result CHAPTER 3 Stochastic Computing in Neural Networks An approach to performing arithmetic operations using random pulse sequences is discussed. In this approach, a number is normalized into a fraction from 0 to I. The fractional number is encoded using a random pulse stream where it is represented by the probability of a pulse occurrence in each clock period. Algebraic operations are re- placed by stochastic processes, and computational results expressed as probabilities are estimated in finite clock periods. Inaccuracies are inherently associated with stochastic computing and can be described in terms of mean and variance. In this chapter, the method for generating random pulse streams is discussed and a new statistical model for the estimate of probability generated from a random pulse generator is developed. Stochastic computing techniques, which can be utilized in digital artificial neural net- works, are presented. 42 43 3. 1 Introduction Von Neuman first observed that normalized numbers or voltages could be rep- resented by probabilities and that some properties of the nervous system could be explained through statistics [80]. He intended to show that simple algebraic opera- tions such as addition and multiplication could be performed by simple logical gates. Later, stochastic computing techniques using random pulse streams were proposed in the 1960’s [81, 82]. In stochastic computation, the operands are normalized and represented by proba- bilities which are actually encoded in random pulse streams. Probability is estimated as a relative frequency of ‘1’ pulse occurrences in a finite but long pulse stream. Since the probability can not be measured exactly, errors by estimation are introduced in the form of variance when the stochastic computing techniques are used. At the time it was originally proposed in the 1960’s, integration technology was not mature and the hardware cost for arithmetic devices was expensive. A main objective in using stochastic computing techniques was to implement some algebraic computations by inexpensive large parallel processors at the cost of speed and accuracy. Since then, the hardware cost of digital computing elements has continued to drop as VLSI tech- nology has advanced tremendously. Consequently, the idea of stochastic computing had been discarded. However, the idea has been resurgent as an alternative to deterministic computa- tions in the area of artificial neural networks since late 1980’s. The main reason is that stochastic computing using random pulse sequences shares one very important char- acteristic with ANN dynamics: network performance depends not on the accuracy of calculations performed in an individual processing element, but on the collective properties of the network (or system) where each processing element does not nec— 44 essarily perform correct computations. Recently, some neural network architectures have been proposed based on this idea and applied to some engineering problems such as associative memory [28] or binary classification [32]. 3.2 Generating Probability 3.2.1 Pseudo-Random Pulse Sequences A pseudo-random pulse (or binary) sequence can be generated by a tapped Linear Feedback Shift Register (LFSR) [67]. Figure 3.1 shows the diagram of an n-bit LFSR. The feedback function f(iltl, x2, - - - ,xn) is expressed in the form f(x11329”°ixn)= C1331 $62372 Q " ' $611171: where each constant c,~ is either 1 or 0, the symbol EB denotes modulo-2 adddition, and 9:1 and mu indicate the values of the most significant and least significant bits, respectively. For a given register length n, the maximal length period of a sequence is pmaJ: = 2" — 1. Define {an} be a PN sequence if and only if it is a binary sequence satisfying a linear recurrence a;c = E ciakn (modulo 2) i=1 and has pm” as a period. There are 2" combinations to select eg’s. Only a limited number of c,- combinations can form the maximal length PN sequences. In order to form a maximal length PN sequence, c,- is determined by the primitive polynomial [67]. 45 x1 x2 X3 ' ' ' xn-l xn , MSB - - - LSB (a) ’ MSB LSB u\ (b) ,. MSB LSB ——<:’ H— \\\ (C) Figure 3.1. (a) The block diagram of Linear Feedback Shift Register (LFSR); examples of LFSRs with a maximal length period where (b) {c1,02, . . ., c7} = { 1000001 } and (c) [c]. (:2, . . ., c7} = {0101011}. 46 Table 3.1. Number of distinct PN sequences with the maximal length period. InIPmazI anII 11I PmaxIanI 1 1 1 8 255 16 2 3 1 9 511 48 3 7 2 10 1023 60 4 15 2 11 2047 176 5 31 6 12 4095 144 6 63 6 13 8191 630 7 127 18 14 16383 756 Table 3.1 shows the number of distinct PN sequences, an, with period p = pm” with respect to the LFSR register length. For a given register length n, a sequence {an} has the following random properties assuming that each 0 and 1 is replaced by I and -1, respectively [67]: 1. The number of 1’s is nearly equal to the number of -l’s in a maximal period pm”. More precisely, Pma: [207.131- 71:1 2. Every possible array of n consecutive terms occurs exactly once, except all 0’s. This indicates that all n-bit integer numbers from 1 to 2" -— 1 are generated exactly once in pm”. 3. The autocorrelation of an is Pma: I If T = 0 cm = 1 Z) anam = pm“: 11:1 —1/pmaz If 0 < T < pmax- These random properties of an are utilized for encoding fractional numbers into cor- responding random pulse streams. 47 3.2.2 Generating Probability In order to utilize stochastic computing techniques, the values of all operands must lie between 0 and 1. The fractional numbers represented by probabilities are encoded in random pulse streams. If the fractional number is stored in an n-bit register, the resolution is 2,34. For example, when n = 8, 0.0 is stored as ‘00000000’, 1/‘255 as ‘00000001’, 2/255 as ‘00000010’, etc. The random pulse stream corresponding to a fractional number can be generated by comparing the number with a pseudo random number. The pseudo random numbers can be generated from a PN sequence by tak- ing all bits of the LFSR in parallel, as indicated in property 2 of the PN sequence. Fractional numbers from 571:]— to 1 equally spaced by 2,,1—_1 are generated exactly once in a period 2” — I. The distribution of the pseudo random number is close to an ideal uniform distribution. Figure 3.2 shows the diagram of a random pulse generator (RPG) for a fractional number x. Generating probability a: is defined as the probability of pulse occurrence in the corresponding random pulse sequence f(n) at each clock. a: is estimated in a sampling clock period, where the sampling clock period is defined as finite clock periods taken for estimation of 3:. Error (or noise) is involved in estimating the generating proba— bility in finite clock periods. Thus, estimate 5: can be modeled as an original signal plus random noise. x(n) clk __ I—I Us... Digital t LFSR comparator Figure 3.2. A random pulse generator for fractional number x. 48 3.3 Distribution of Estimated Generating Prob- ability The generating probability of the random pulse generator has been modeled as a binomial distribution in the literature [82, 83]. However, the distribution can be more precisely modeled by regarding the estimate as a hypergeometric random variable. This new model of the estimate is important, especially for case when a short sampling clock period is taken for estimation. 3.3.1 Factorial Moment Generating Function The following terms are defined: 2:: A fractional number or a generating probability. $0,): The pseudo—random pulse sequence for :r. N: The period of $(n) such that N = 2" — 1, where n is the order of a maximum length LFSR. P3: The sampling clock period for estimation. X: A capital :1: is a discrete random variable indicating the number of logic level ‘1’ pulses occurring in 33(71), where 0 S X g P, such that X = 0,1,2,. . . ,P,. E(X): The expected value of X. Var(X): The variance of X. It: The estimate of :1: and a random variable such that i: = X / P, 49 The factorial moment generating function of the distribution of a random variable X is formally defined as follows [84]: g(t) = E(tX) = / °° e fx(a:)d:1:. (3.1) Differentiating n(t) k times and substituting 1 for t gives Me) = iii-EM.-. dt" — = E[X(X—1)~-(X—k+1)], (3.2) where E[X(X — 1) - - - (X — k + 1)] are called the factorial moments. The variance of X can be computed from the first two factorial mements as Var(X) = E[X(X — 1)] + E(X) — [E(X)]2. (3.3) 3.3.2 Binomial Distribution Model If the occurrence of successive pulses in a sequence 3(a) is statistically independent, the sequence is called a Bernoulli sequence. The random variable X of interest is the number of logic level 1’s occurring in a sampling clock period P,. X is a Bernoulli variable. Consider X = k indicating k logic level 1’s occur in P, clock periods. Let the probability of pulse occurrence at each clock period in 3(a) be :1: = p. The probability of one particular sequence with k logic level 1’s in n clock periods is p"(1 - 19)“- 50 The number of sequences with k logic level 1’s in 71. clock periods is the same as the number of ways of taking k objects at a time from n objects. The number is n! (2) : k!(n — k)!' The quantity (2) is called the binomial coefficient. Thus, the probability function of X can be expressed by P(X = IC) = (2)1931 -p)""‘. k = 0.1,..- The density function of X is n. , n. (3.4) fx(=v) = 2(2):)"(1- 19)""‘V<5(=r - k)- k=0 The distribution of X is binomial. The factorial moment generating function of a binomial distribution is obtained using the binomial theorem as follows: 770) = E(tx) = 2n: t"(’13)p"(1 — P)""" k=0 = :09th — p)""‘ k=0 = [Pt + (1 —P)l"- (3.5) The mean and variance of X can be computed using the first two factorial moments, "‘(1) and 17”(1) as E(X) = tip and Var(X) = np(1 — p). (3.6) 51 Thus, the mean and variance of i: are, respectively E(iz) = p and Var(ir) = M. (3.7) n 3.3.3 New Distribution Model A pseudo-random pulse sequence, 1301), has been modeled in the literature as a Bernoulli sequence [82, 83]. However, the pulse occurrence in 3(a) is not perfectly independent because a maximum length LF SR generates fractional numbers between 5},- and I such that each number occurs exactly once in a period. Accordingly, the pulse occurrence in :1:(,,) has statistical dependency. If P, = n and x = p = l /N , the probability that k ‘1’ pulses in :1:(,,) occur during the sampling clock period It is the same as the probability that k black balls are taken out in n withdrawals from the box containing l black balls and N — l white balls, one ball being withdrawn at a time without replacement. Thus, the sampling distribution of X can be more closely modeled by the hypergeometric distribution. When a: = p = UN and P, _—_ n, the probability function of X is PM = k, = (up?) where l is a natural integer, i.e., l E {0,1,2,---,N —1,N}. Let (k), be the product of r consecutive integers starting with k. Then, the rth factorial moment is N E[(X).] = E(k)rP(X=k) Ic=r if“), (1) 2:1) kzr if) The detailed derivation of equation 3.8 can be found in Appendix A. The expected value of X is E(X) = E((X).]|.=1 _ ’13 _ N and for r = 2 in equation 3.8, ll—lnn—l E[X(X —1)]= ( N(])V (_ 1) I From equations 3.9 and 3.10, the variance of X can be computed as Var(X) = E(X?) — [E(X)]: = E[X(X —1)]+ E(X ) — [E(X >12 _ l(l—l)n(n—1)+l_n_(_ln) " N(N—l) N N __ niN—lN—n — N N N—l N—n Thus, the expected value and variance of :i: are, respectively, E(e) = ”U :[r—a Zli (3.8) (3.9) (3.10) (3.11) (3.12) 53 and Var(si) = $Var(X) _ 10(1 -P) N --n — n N _1 (3.13) where 0 S it: 5 1 when l _<_ n S N. As noted from equation 3.13, when N is large (implying a wide LFSR) the distribution of :3: tends to the binomial distribution. If P, = N = n, Var(i) = 0 and :1: becomes a constant :13. This new statistical model is used to perform an analysis on random noise ef- fects in digital multilayer neural networks (DMNN). The DMNN architecture will be developed in Chapter 4 and the analysis will be performed in Chapter 5. 3.4 Stochastic Computing in ANNs Stochastic computing exploits the similarities between probability algebra and Boolean algebra. Logical operations with simple logical gates over multiple pulse sequences correspond to pseudo—analog computations. 3.4.1 Basic Stochastic Computations A random pulse sequence (E(n) is a sequence of pulses whose probability :1: can not be measured at any one clock period, but it can be approximated by a measurement of average pulse rate. Any Boolean operation over individual pulses corresponds to an algebraic operation among variables represented by their respective average pulse rates [82]. Figure 3.3 shows the duality between logical operations with actual pulse occurrences and numerical operations with pulse occurrence probabilities. “(n) V(n) “(ID ”(n) _. WIn) = “(n1 AND VIn) W = u V. 2(n) = “(11) OR V(n) z = u+v-uv. It-(n) = NOT um) §=I-u. “=05jI—I fl TI Hf] I__II—L v=0.4[ n r1 I‘l III—‘l w=02I n r1 [‘1 2:0-7I1flljl II I I t Z=0-5[rII—I I—mI—I II Figure 3.3. Dualtity between Boolean operations and numerical operations, where the samp- ling clock period = 20 is assumed and the number of ‘1’ pulses generated during the period in x(,,) is 20 x for x. If two sequences :1:(,,) and g(n) are statistically independent, the probability of pulse occurrence in an output sequence z(,,) of an AND gate is and the probability of pulse occurrence in an output 2 = P(z(,,) =1) P($(n) = I /\ y(,,) = I) P(.’E(n) = 1)P(y(,,) = I) 333! (3.14) sequence z(,,) of an OR gate is 55 z = P(z(n)=1) = P(.’E(n) =1 V g(n) =1) = P($(n) =1)+P(y(n) =1)“ P($(n) =1 A y(n)=1) = :r+y—:1:y. (3.15) Instead of being statistically independent, if two sequences are mutually exclu- sive, implying that no two pulses coincide in two random pulse sequences, P(a:(,,) = 1 A y(n) = 1) = my = 0 in equation 3.15. Thus the logical OR performs a direct summation. The NOT gate in Figure 3.3 (c) produces an output pulse whenever no input pulse occurs. If 1%,) is an input pulse sequence of a NOT gate, the probability of pulse occurrence in an output sequence z(,,) is z = P(z(n)=1) = I - P($(n) = I) = 1— 1:. (3.16) A complete set of examples of stochastic computations utilizing the duality be- tween Boolean operations and algebraic operations can be found in reference [82]. 3.4.2 Stochastic Computing in the DMNN Neural operations in a stochastic neural network of the type considered here are performed with basic gates using pulse sequences as inputs. Let w;, and v, be the 56 connection weight between neurons i and j and the. neural activation of neuron j, respectively. If two sequences mm”) and vj(,,) are statistically independent, the prob- ability of pulse occurrence in an output sequence mm”) of an AND gate is mu : P(m.-,-(..) =1) = P(w.~,-(..) = 1 A We) =1) = wijvj. (317) Input summation and nonlinear transformation can be performed simultaneously using logical OR operation. The inputs of an OR gate are product sequences, mm”), produced from AND gates. Two kinds of synaptic weights 10,1",- and wg- are necessary, positive (or excitatory) and negative (or inhibitory) for most feedforward neural net- works. Thus, two separate OR gates per neuron are needed to form excitatory and inhibitory net inputs. Let net;I be the probability of a pulse occurrence in the output sequence netfin) of an n-input OR gate for an excitatory net input in neuron i and net: likewise for an inhibitory net input (See Figure 3.4). net?” and net," can be described by net?” = P(net$n)=1) I = 1— (1 - P(mii(n)=1))(1— P(mii(n) = 1)) ' ' ' (1 — P(m:+n(n) =1» =1—II(1-m;-§-) 3:1 = 1— fin — 7.0333,) (3.13) 57 D— . Wi1 — m. A mij(n) ’1‘“) [II] II > VIIn) '— (a) mi1(n) net: E I s ’ mm “m II II II , minal) . I t (b) net- + .— -133, 4:)— 11...] — ll 111111 | > (c) ’ Figure 3.4. Stochastic computations in the DMNN (a) synaptic multiplication; (b) logical OR; (c) neural activation. and net,’ 2 1— H(1+ wfjvj). (3.19) 3:1 Two net inputs, formed from dedicated OR gates, AND together to form the activa- tion function. If two sequences netxn) and netzm are statistically independent, the probability of a pulse occurrence, 1),, in the activation sequence is v,- = P(v,-(,,) = l) = P(net+ i(1i ) = I A net-"(1,, = 0) = netfll — netf) 58 = [1— H(1 -— wit-15)] fi(1 + wfjvj). (3.20) i=1 3:1 The nonlinear activation function, described in equation 3.20, is continuous and differentiable, indicating that back-propagation can be used for training [8]. This form of stochastic computation will be used for developing a generic DMNN archi- tecture in the next chapter. 3.5 Back-Propagation in the DMNN The DMNN is a feedforward neural network which can be trained with the back- propagation algorithm discussed in Section 2.1.3.2. The back-propagation algorithm performs gradient descent iteratively over a sum-squared error measure. This section shows how the non-traditional neuron activation function described in the previous section is incorporated into the back-propagation algorithm. Define n; as the number of neurons in the ith layer. The input layer is not counted as a layer. Accordingly, for a k-layer DMNN, no and n). indicate the number of elements in an input pattern and the number of output neurons in the output layer, respectively. The training for the DMNN can be done off-line or on-line using a digital computer. The choice depends on whether or not the resolution of the DMNN can represent the changes of synaptic weights during training for a particular application problem. The resolution of an n-bit DMNN is 271:1. For example, it is approximately 10’3 for an 10-bit DMNN. However, more than 10"5 precision is often required in most application problems. That is the reason that the DMNN must be trained off- line in most cases. Whenever an input pattern is presented to the network, the output pattern of the 59 output layer is compared to the target pattern; the connection weights between layers are modified in a backward direction according to the error. Given pattern ,u, the sum-squared error measure is = - :20: -v,,,-) (3.21) 2i=1 where t“; is the target output for the ith neuron in the output layer when input pattern p is presented and v,“- is the ith element of the actual output pattern. The overall measure of the sum-squared error over p training patterns is E = 2: E, (3.22) Thus, the back-propagation rule states that Aw.,( k): 211,211.,“ (3.23) :1 where the subscript k denotes the number of iterations and Anus-,- is the change to be made to the weight from the ith to jth neuron unit following presentation of pattern ,u. The gradient descent rule for positive weights states as, aw}; WBE va 8nd,],- - flavuganetl} 8w}:- . (3’24) 411103;“) = ’77 Similarly, the weight change for negative weights is given by 77,,BE 8v,“ Brief;- fllavuganet; 01.0,,- AuwS-(k): (3.25) 60 Define 6 _(9E,, u: — 1 at)”; + _ 8E” _ e ‘ 6v,“- “" — Bnet; — 1.. Bnetti’ 8E 61) g 5;; = ——u_ = Cut—p_ - (3.26) and”; anal,"- By equations 3.18 to 3.20, we can obtain 81),“- = 1 — t_., Bnetz} ne ‘" anet+- - +‘" =(1— net:,)—Z”—f,_-— 610,-,- 1 — tub-v“, and 3v,"- —— = —net+-, Bnet; “‘ Bnet; v - —T‘ = —(1 — net’,)—M—. 8w,j “ 1 + wfjvu, In the output layer, 6],; = t“; — Um. (3.27) In the hidden layers, __ as. C - _ _ I“ 8v,“- 61 _ 2“ 6E“ Onettk+z_ 8E, and;c — k anettk 3v,“- anet;k 3v,“- ’6 = £161}, (l—net: k)1—:——— mic-+1”; ’W]+:[—6;1—net;k)i—Iwi—-]. (3.28) kiviu' Then, the changes in positive and negative weights, resulting from the presentation of training pattern ,u are described respectively by following recursive forms: A.w3§-(k) = 775.11% = 775:,(1 — net;- l—j-blgv—m (3.29) and . .3... = = —176;,-(1 — new-1:11:35; (3.30) where 6:,- = (Em-(1 — net;,-) and 6;,- = —e,,,-net,']’,~. The back-propagation algorithm incorporating the activation function imple- mented in the DMNN has two forms: 772?: 6+;‘(1 — net+¢)::137 If w,-- = w?- 3...,(1) = I ‘ " 1 ’ ’ (3.31) —n Zflfl 6;,(1 — "660% if M5 = 10,—]. Ijv.’ Gradient descent, described above, can be extremely slow for small 7) while it can oscillate for large n [43]. In order to achieve the most rapid learning, a learning rate 17 which is as large as possible without leading to oscillation must be chosen. One way to accelerate the learning is to add a momentum term. P AngUC) = —77 2 8E“ + aAw;,-(k — I) (3.32) “=1 5ij 62 where a is the momentum parameter such that 0 S a S 1. 0 determines the ef- fect of past weight changes on the current direction of movement in weight space. This provides each connection weight 10,-,- with a kind of momentum so that it tends to change in the direction of the average downhill force instead of oscillating with high-frequency variations of the error surface in the weight space. In turn, the effec- tive learning rate can be made larger without divergent oscillations occurring. A C program implementing the back-propagation in the DMNN is listed in Appendix B. CHAPTER 4 Pulse-mode Digital Multilayer Neural Networks In this chapter, digital architectures of basic elements such as synaptic elments and neuron body elements are developed. Using these basic elements, the modular architecture for digital feedforward neural networks is developed as a Digital Multi- layer Neural Network (DMNN). Use of simple logic gates as computing elements and modular design techniques will lead to the DMNN architecture being relatively com- pact in size and expandable to any size network. Furthermore, massive parallelism embedded in stochastic computations using random pulse streams is fully utilized with this architecture. A generic architecture of a DMNN coprocessor is also presented. All components in the DMNN and the DMNN coprocessor are modeled and simulated in VHDL. Use of VHDL as the modeling tool for the DMNN coprocessor is discussed briefly. Finally, the hardware complexity of the DMNN is estimated. 63 64 4.1 Basic Computing Elements A random pulse generator, a synaptic element, an input neuron body element, and a regular neuron body element are developed as basic computing elements in the DMNN. These basic elements are used to develop a modular network architecture. 4.1.1 Random Pulse Generator The block diagram of a random pulse generator was presented in Chapter 3. The random pulse generator (RPG) is comprised of a tapped LFSR and a digital com- parator. In Figure 4.1(a), the order of a LFSR is 8 and the example feedback function is f(sc) = 2:2 69 32;, EB :54 ER 328 implemented by XOR logic gates, where the period of sequence v(,,) is 28 — l = 255. Figure 4.1(b) shows the structure of a random pulse generator using D flip-flops, XOR logic gates, and a digital comparator. x1 XZ - - .. x6 X7 x8 D . D D D D F» FF FF FF FF FF elk t t l L V1 IV —C f(x)=Jc2691c369Jc,€BJc8 D (20 vi vl(n)T [k x vim) ,l] mu Ht , C LFSR '—* , , Digital comparator (b) Figure 4.1. (a) A maximum length 8-order LFSR where f ( x) = x, e x, a; x, @ x8; (b) a pseudo-random pulse generator for v,. 65 At every clock period, a logic ‘1’ pulse is generated if v,- 2 3:. Otherwise, a logic ‘0’ pulse is generated. 4.1 .2 Synaptic Element A large number of synaptic multiplications are required, even for a small size feedforward neural network. For example, if the network consists of m layers exclud- ing an input layer, the number of synaptic multiplications required per feedforward operation is m 2: n1n1_1 1:] where n; is the number of neuron elements in the lth layer and no is the number of input patterns applied to the input layer. Each synaptic multiplication in the DMNN is performed relatively more slowly than a deterministic calculation done on a digital computer, but all the multiplications in the network can be performed in parallel. Let w;,- and v,- be the synaptic weight between neuron elements i and j and the neural activation in neuron element i, respectively. Figure 4.2 shows the structure and block diagram of a digital synaptic element (SYN). The VHDL code for a SYN model is listed in Appendix C. The SYN consists of a random pulse generator (RPG), a weight register, two AND gates, and two wired-OR lines. Weight m, is represented as an r-bit fractional number, where the MSB is a sign bit and the rest represent the magnitude in sign-magnitude format. With ng loaded into a weight register, the corresponding random pulse stream w,,-(,,) is generated through the RPG. The pulse stream is transmitted to two AND gates: the upper one for positive weights and the lower one for negative weights. If the synaptic weight is positive, a resulting product sequence mil-1'01) is transmitted to an excitatory net-input line. Otherwise, mE'J-(n) is transmitted to an inhibitory net-input line. 66 net,( ,f net,(n)' sign bit: ‘1’: negative l lohd 1:. ‘ register . select D: A Clk > on LFSR vj(n) > magnitude Wij(n) u o U (a) load Clk select (b) mij