PLACE ll RETURN Boxwmnmmbchockomm your record.
TO AVOID FINES Mom on or baton duo duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU IIN'I Nﬂmnﬂvo WOMEN Oppoﬂunlly Inﬂation
Was-u

 

ANALYTICAL FORMULATION FOR
STATIC AND TEMPORAL PATTERN ASSOCIATION:
NEURAL ARCHITECTURES AND ALGORITHMS

By

Jz'ansheng H on

A DISSERTATION
Submitted to
Michigan State University

in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1997

 

ST;

  

Curre‘
structure

115ng oﬁ:

 

(matrix).
This d3
used for $1

$61 35 Stab

determine
matrix is 6
[Or allgmeﬁ

The Pro
adapted for
in

won an

ABSTRACT

ANALYTICAL FORMULATION FOR
STATIC AND TEMPORAL PATTERN ASSOCIATION:
NEURAL ARCHITECTURES AND ALGORITHMS

By

Jiansheng Ho'u

Currently, most Artiﬁcial Neural Network (ANN) models have standard layered
structure of interconnected neurons where the connections (weights) are computed
using off-line or on-line methods. The process of ﬁnding an appropriate weight
(matrix), if it exists, is usually computationally expensive.

This dissertation focuses on new design and training of artiﬁcial neural networks
used for storing static or temporal patterns. A product-of-norm model of artiﬁcial
neural networks is presented. It is shown that this model can store an arbitrary data
set as stable system equilibria — without the need for performing training process to
determine the weights. Consequently, the process of iteratively ﬁnding the weight
matrix is eliminated. The need of deﬁning the weights is in fact replaced by defining
(or augmenting) the architecture.

The properties of this new model are investigated, and various modiﬁcations are
adapted for practical applications. In particular, the use of the log form of the energy

function and the subgrouping method are introduced to reduce the computational

 

eﬁbrt
penahj

Thi
recurre
proposr

back-pr

 

 

effort. Moreover, methods for recognizing shifted images are proposed by adding
penalty terms to the energy function.

This investigation also includes ways to improve the current algorithms for training
recurrent networks to store temporal patterns. Speciﬁcally, two algorithms are
proposed to avoid the backward integration mode in the time-dependent recurrent

back-propagation method.

 

Copyright by
J iansheng Hou

1997

 

1 WO
his cont
Michiga
guidanc
Newhou
for theh

I am
of Elect

I am
and 9110
been ab
wannest

ACKNOWLEDGMENTS

I would like to express my sincere thanks to my advisor, Dr. Fathi M. A. Salam, for
his continued support and guidance throughout the years of my graduate studies at
Michigan State University. I also want to thank the other members of my Ph.D
guidance committee members, Dr. Hassan Khalil, Dr. John Deller, Dr. Sheldon
Newhouse, Dr. Robert Nowak, Dr. Frank Hoppensteadt and Dr. R. L. Tummala,
for their valuable comments and suggestions.

I am grateful for the ﬁnancial support from the Case Center and the Department
of Electrical Engineering of Michigan State University.

I am especially grateful to my wife, J ian Yan, for her love, understanding, support
and encouragement, particularly in the last years of this research. I would not have
been able to ﬁnish this work without her support. Finally, I want to express my

warmest gratitude to my parents for their continued support and encouragement.

LIST (

LIST (

1.1
1.2
1.3
1.4

1.5

2 AR]
SOC

torogoro
#aéAtVH

NET
3.1 '
3.9 ‘

‘-

3.4

3.6

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 INTRODUCTION 1
1.1 Overview .................................. 1
1.2 Problem Statement ............................ 3
1.3 Research Goals of the Thesis ....................... 4
1.4 Major Contributions ........................... 7
1.5 Organization of the Thesis ........................ 8

2 ARTIFICIAL NEURAL NETWORKS FOR INFORMATION AS-
SOCIATION 10
2.1 Background ................................ 10
2.2 The Hopﬁeld Model ............................ 11
2.3 Bidirection Associative Memory ..................... 16
2.4 Existing Models for Storing Temporal Patterns ............. 18

3 THE PRODUCT-OF-NORM MODEL OF ARTIFICIAL NEURAL
NETWORKS 21
3.1 Motivation ................................. 21
3.2 Proposed Mathematical Model ...................... 23
3.3 Stability Analysis ............................. 24
3.4 The Product-of—norm Model Using L2 Norm .............. 26
3.4.1 Oneneuron/m-pattern Case ................... 28
3.4.2 The Number of Equilibria for Product-of-norm Networks . . I. 34
3.5 The Case of Leo Norm .......................... 38
3.6 Simulation Results ............................ 40

vi

4.3

5 THE
CUI
5.1
5")

'8

 

5.3 _

6 ART
PAT'
6.1
6.9

Q

AAAA

4 GENERALIZATION AND VARIATIONS OF THE “ENERGY”

FUNCTION FOR PRODUCT-OF-NORM MODEL 43
4.1 The Log Form of the “Energy” Function ................ 44
4.2 Localization of the Product-of—norm Model ............... 56
4.2.1 Motivation ............................. 56
4.2.2 Subgrouping Method ....................... 58
4.2.3 Simulation Results ........................ 62
4.3 Modiﬁed Product-of—norm Models for Recognizing Shifted/ Translated
Input Patterns .............................. 66
4.3.1 Motivation ............................. 66
4.3.2 Modiﬁed “Energy” Functions .................. 66
4.3.3 Simulation Results ........................ 72
4.3.4 Discussion and Remarks ..................... 73

5 THE BUILDING BLOCKS FOR MICRO-ELECTRONIC CIR-
CUIT REALIZATION OF THE PRODUCT-OF-NORM MODEL 77

5.1 Motivations ................................ 77
5.2 Background and Basic Circuits ..................... 78
5.2.1 MOS Transistors in Subthreshold Region ............ 78
5.2.2 Current Mirrors .......................... 79
5.2.3 Differential Pair .......................... 80
5.2.4 Current-voltage Multiplier .................... 85
5.2.5 Four-quadrant Multiplier ..................... 88

5.3 An Example Circuit of One-neuron/two-pattern Product-of-norm Model 91

6 ARTIFICIAL NEURAL NETWORKS FOR TEMPORAL-

PATTERN ASSOCIATION 94
6.1 Introduction ................................ 94
6.2 Methodology Review ........................... 95
6.2.1 Tapped Delay Lines ........................ 96
6.2.2 Context Units (Partially Recurrent Networks) ......... 98
6.2.3 Back Propagation Through Time ................ 98
6.2.4 Real-Time Recurrent Learning .................. 100
6.2.5 Time-dependent Recurrent Back-propagation . . . . ....... 103
6.2.6 The Green’s Function Method .................. 105
6.3 Motivation ................................. 107
6.4 The Initial Value Problem ........................ 109
6.5 The Choice of the Final Value of the Co-state ............. 111

vii

7 SC)

to Iv—-

A THE
NEIL

 

 

 

 

 

BIBLIO

6.6 Avoiding Backward Integration in Time-dependent Recurrent Back-

propagation: Method I .......................... 113

6.6.1 Simulation Procedure ....................... 119

6.7 Avoiding Backward Integration: Method II ............... 123
6.7.1 Updating the Time—constant T ................. 129

6.7.2 Computation Procedures ..................... 131

6.7.3 Simulation Examples ....................... 134

6.7.4 Comparison and Discussion ................... 137

7 SUMMARY AND CONCLUSION 139
7.1 Summary ................................. 139
7.2 Future Research Directions ........................ 142

A THE PSPICE CODE AND SIMULATIONS FOR A ONE-
NEURON/TWO-PATTERN CIRCUIT 143

BIBLIOGRAPHY 155

viii

 

4.1

6.1

 

LIST OF TABLES

4.1 Simulation results ............................. 73

6.1 Comparison of the algorithms ...................... 138

2.1
2.2
2.3

3.1
3.2
3.3
3.4
3.5

3.6

3.7
3.8
3.9
3.10

4.1
4.2

4.3

4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11

LIST OF FIGURES

A neuron of an analog Hopﬁeld model ................. 14
A network of Hopﬁeld model ....................... 15
Structure of Bidirection Associative Memory .............. 17
The plot of “energy” function E = 0.5(0.25 — :c)2(0.75 — (1:) ...... 32
The plot of function H (v) with v'"1 = 0.25 and v'"2 = 0.75 ....... 32

The plot of “energy” function E = 0.5(0.25 — x)2(0.5 - x)2(0.75 — 3:)2 33
The plot of function H (v) with v“ = 0.25 v"‘2 = 0.5 and v‘3 = 0.75 . 33
An H (22) function with stable equilibria 0.2, 0.4, 0.7 and unstable ones

at 0.3 and 0.6 ............................... 35
An H (2)) function with stable equilibria 0.2, 0.4, 0.7 and unstable

equilibria at 0.3, 0.5 ............................ 35
An example vector ﬁeld for Loo norm .................. 39
Initial and ﬁnal outputs of the network with six pre—speciﬁed patterns 41
Stored two facial images ......................... 42
Image retrieving process ......................... 42
A modularized structure of Product-of-norm model .......... 57

An example structure of dividing a big network into isolated smaller
subnetworks ................................ 61
An example structure of dividing a big network into overlapped smaller

subnetworks ................................ 61
Three stored patterns ........................... 63
Test patterns ............................... 63
No over-lap grouping ........................... 64
Over-lap grouping ............. - ................ 64
The retrieving process of the network using non-overlap method . . . 65
The retrieving process of the network using overlap method ...... 65
Shifted input pattern ........................... 67
An example way of dividing the network into small regions ...... 71

 

 

4.12 Stored patterns .............................. 74

4.13 Test patterns ............................... 74
5.1 I”. vs IQ, ................................. 80
5.2 Two diode-connected transistors ..................... 81
5.3 Two types of current mirrors ....................... 82
5.4 11, 12 and 11 — 12 vs V1 — V2 ....................... 83
5.5 Differential pair .............................. 83
5.6 Transconductance ampliﬁer ....................... 84
5.7 Current-voltage multiplier ........................ 86
5.8 Schematic symbol for current-voltage multiplier ............ 87
5.9 The plot of a current-voltage multiplier ................. 87
5.10 Gilbert multiplier ............................. 88
5.11 Spice simulation of a Gilbert-multiplier ................. 90
5.12 Schematic symbol of a voltage square circuit .............. 90
5.13 One-neuron/two-pattern circuit ..................... 92
6.1 A time-delay neural network ....................... 97
6.2 Architectures of networks using context units. Shaded arrows represent

fully connected connections ....................... 99
6.3 A general recurrent network ....................... 101
6.4 A unfolded recurrent network for 4 time steps ............. 101
6.5 Network topology of XOR problem ................... 110
6.6 XOR network with input = [0, 0] .................... 121
6.7 XOR network with input = [0,1] .................... 121
6.8 XOR network with input = [1,1] .................... 122
6.9 XOR network with input = [1, 0] .................... 122
6.10 XOR network with input = [0,0] .................... 135
6.11 XOR network with input = [0,1] .................... 135
6.12 XOR network with input = [1,1] .................... 136
6.13 XOR network with input = [1,0] .................... 136
A.1 The response of the one-neuron/two-pattern circuits to different initial

voltages. v“ = 0.4V and v‘2 = 0.7V. .................. 154
A2 The response of the One-neuron/two—pattern circuits to different initial

voltages. v'"1 = 0.3V and v"2 = 0.75V .................. 154

xi

CH

 

 

 

1.1

Since t
human
have b
indUSt;
though
many 1
39 Gas}
by dig]
Year-01
Compu

ID 1111”]

'R

t1.

CHAPTER 1

INTRODUCTION

1 . 1 Overview

Since the birth of the ﬁrst electronic digital computer — the Von Neumann machine,
human civilization has entered a new era of information processing. Today, computers
have been used in almost every area in our life, from home to ofﬁce, from science to
industry. It is hard to imagine what life will look like without computers today. Even
though the digital computers are so advanced and so widely used, they still have
many limitations when compared to human brains. Many things, which can be done
so easily by human brains, can be very difficult, if not impossible, to be accomplished
by digital computers. A good example is the processing of visual information: an one-
year-old baby is much faster at recognizing human faces and objects than the best
computers today [HKP90]. The following distinguished features have been attributed

to human brain (neural networks):

0 Robust and fault tolerant. Nerve cells die every moment and this does not affect

the performance of neural networks signiﬁcantly.

0 Flexible and adaptive. It can easily adjust to a new environment by “learning”.

It does not have to be programmed like a digital computer.
0 High degree of parallelism and distributed storage.
0 The ability to process information that is fuzzy, probabilistic or inconsistent.

o Compactness.

These features have inspired generations of researchers to study the human brain
— neural networks, to investigate the principles of how human brain works and to
build models to mimic the behavior of the human brain. This lead to the birth of
“artiﬁcial neural network” (ANN) —— “ the next generation of computers”.

The study of artiﬁcial neural network has gone through several circles of up and
down. It began in the 1940’s with McCullough and Pitt. Second peak happened in
the 1960’s with Rosenblatt’s perceptron convergence theorem. It began to explode
again in the 1980’s inspired by John Hopﬁeld’s energy approach and the modern
era of multi-layered neural networks and back-propagation algorithm. Researchers
have been investigating the possible applications of artiﬁcial neural networks in all
areas. Information association using artiﬁcial neural networks is a major ﬁeld which
researchers have studied extensively. The research results can be used in a wide range
of real world applications, such as image association, pattern recognition, associative
memory and etc..

Associative memory is one of the early themes of the research works on artificial
neural networks. It is such a kind of network that presents the same output when sim-
ilar inputs are applied. Many artiﬁcial neural network architectures and algorithms

have been proposed and studied by researchers. Some of the prOposed networks

use feet

IlQlWOI :~

1.2

Some m

l. 111:;

 

 

2. the
usii
While it
51‘stems
PrOhibitj
HEIWm-k
COHSE‘qt
order of
COmPUt.‘
the in“
the data
In t}
3830mm
10 OVerc
On redu

100‘

use feedforward structures, while the majority use feedback structures (the recurrent

networks). The focus of this dissertation research is on the recurrent networks.

1.2 Problem Statement

Some major issues in designing functional neural networks are [8891, SWC91]:

1. the ability to store (or learn) any number of pre-speciﬁed set of data successfully;

2. the determination of a set of parameters (or weights) which would archive 1

using efﬁcient off-line or on-line computations.

While it is possible to employ analogous approaches to the algorithms employed in
systems theory, control and adaptive control, the amount of computation may be
prohibitive by present implementation media. In particular, for n—neuron neural
network update methods, one must update or determine in the order of n2 weights.
Consequently, most learning or weight updating methods can successfully store in the
order of 7: patterns while updating about 112 weights. In addition, the weight updating
computation is largely time consuming. Another limitation of current algorithms is
the insufﬁcient ability to store arbitrary set of data. Many of the algorithms require
the data to be stored meet certain prerequisite conditions.

In this thesis, current ANN architectures and algorithms used in static information
association will be analyzed. New ANN architectures and algorithms will be presented
to overcome the limitations identiﬁed in the analysis. Speciﬁcally, this thesis will focus
on reducing the computation complexity of the networks and increasing the capacity
of ANN networks. Implementation of new models via hardware will be investigated

too.

A .V .‘
great p1
extensiu
Ways to

be inx'es]

1.3

Goal 1:

Objectix

 

 

SignlﬁCa]

Approach

ANN networks used for temporal pattern association is another area which has

great potential applications. Most algorithms proposed so far are, to certain degree,

extensions of the back-propagation method. The computation is very expensive.

Ways to improve the performance of ANN used in temporal pattern association will

be investigated in this thesis too.

1.3 Research Goals of the Thesis

Goal 1:

Objective:

Signiﬁcance:

Approach:

Improve the performance of ANN used for static pattern association

Analyze Current ANN network architectures, models and algo-
rithms, especially recurrent networks will be analyzed, so that their
limitations can be identiﬁed. New architectures and algorithms will

be presented to overcome the limitations identiﬁed.

Analysis of current approaches is a very important step. Through
this step, advantages and limitations of current ANN networks used
for association of static patterns will be identiﬁed. Thus new archi-
tectures and algorithms can be generated. A new architecture with
better performance will have a wide range of applications. It is a

step further in mimicking the behavior of neural networks.

Current models and methods of ANN used for association of static
patterns, especially feedback networks, will be investigated ﬁrst. The
capacity, speed, convergence and computation complexity of various

nets will be studied. A new ANN architecture will be proposed

 

Goal 2;

Objecti‘

 

‘ |
SlgnlﬁCa

AppI‘OaCl

Goal 2:

Objective:

Signiﬁcance:

Approach:

with a new “energy” function. The resulting network will be a
gradient-like system, thus it will have nice stable behavior. The
proposed system will be able to store arbitrary analog data set.
The determination of network parameters (weights) will be simpliﬁed
using this model. Static patterns will be stored as stable equilibria
of the system so that given enough hint (input) the network will be
able to retrieve the correct pattern. Simulations will be performed
to verify the theoretic results and demonstrate the potential usage

of the new model.
Generalization and extension of the proposed model.

The proposed model will be generalized. Several ways of modifying
the “energy” function will be investigated so that the resulting

network can be tailored for various application needs.

Generalization of the “energy” function can gain insights of the
property of the new model. Investigation of ways to modify the
“energy” function will lay a foundation for the new model to be

used in real world applications.

Different forms of the “energy” function will be explored to simplify
the model and improve the performance. Ways to modify the model
so that it can be easily modularized and parallelized will be inves-
tigated. Methods of modifying the model so that it can be used
in image association and recognition will be studied. In particular,

variations of the “energy” function to recognize shifted images will

 

Goal 3:

Objecti

 

Signiﬁca

APlJroa

Goa] 4:

OblECti

SigniﬁC

Goal 3:

Objective:

Signiﬁcance:

Approach:

Goal 4:

Objective:

Signiﬁcance:

be investigated. Software simulations will be performed to verify the

correctness of theoretical result and computation complexity.

Micro-electronic hardware implementation

Study ways to implement the proposed network in hardware; par-
ticularly, VLSI technology. Basic circuits building blocks will be

presented.

Implementing the new ANN model in hardware makes it possible to
employ this model in real world applications. The implementation
process can also provide feedback about directions to improve the

theoretical model.

Analog MOS transistors biased in subthreshold region will be used
as the basic elements of circuits for the new ANN model. Circuit
building blocks will be designed and veriﬁed using “Pspice”. Small

networks will be built.

Extensions to for temporal pattern association

Analyze current models and algorithms used for temporal patterns
association. Improve the performance of ANN by proposing new

models or algorithms.

ANNs used for temporal pattern association have great potential

usage in real world applications.

Appro

1.4

The maj

 

 

' Var
as 1

lug

Approach: Current models and algorithms for temporal pattern association
will be surveyed ﬁrst, then the limitations and advantages of each
approach will be analyzed. New algorithms will be proposed to

improve the performance and overcome the limitations identiﬁed.

1.4 Major Contributions

The major contributions of this work can be summarized as follows:

0 A new ANN architecture — the product-of—norm model — is proposed. This
model eliminates the dynamic learning of the weight connections. Weights are
ﬁxed instead. This network can store arbitrary set of data as stable system
equilibria. Thus the data are guaranteed to be recallable. This model can be

used in a wide range of real world applications.

0 Various methods of simplifying the product-of-norm model are presented, such
as using the log form of the “energy” function, the inﬁnity norm and subgroup-

ing method. The simpliﬁed network can be easily parallelized and modularized.

0 Several ways of tailoring the “energy” function of the product-of—norm model
are pr0posed so that the model can be used in speciﬁc applications. In partic-
ular, tailoring the “energy” function for recognizing slightly shifted images is

demonstrated.

0 Basic building blocks for implementing the product-of-norm model in hardware

using VLSI technology are proposed.

 

in
de‘
val

M?

O‘VCI‘E

 

for gener

beCome s.

1.5 q

The rest .
Chapt
1101] of St;
Chapt
{101-m m0<
Nations
is inl’estig
mall’Zed‘ .
Chaple
kinetic)“

; u
[UHCIIOI].

0 Numerical algorithms are proposed which extend the time-dependent recurrent
back-propagation method used in storing temporal patterns using recurrent
networks. They eliminate the need of integration backward in time as needed
in the original time-dependent recurrent back-propagation method. Rules are
developed for setting the ﬁnal boundary conditions of the co—state and the initial
values of the neurons of the time-dependent recurrent back-propagation method

are derived using optimal control theory.

Overall, this work improves the current methods for artiﬁcial neural networks used
for general association. Arbitrary data set can be stored and retrieved. “Learning”

become simpler.

1.5 Organization of the Thesis

The rest of the dissertation is organized as follows:

Chapter 2 gives a basic introduction to artiﬁcial neural networks used in associa-
tion of static patterns and identiﬁes limitations of current approaches.

Chapter 3 presents a new model of artiﬁcial neural network — the product-of-
norm model of ANN. A new “energy” function is proposed and the system dynamic
equations are derived from the “energy” function. The stability of this new system
is investigated. The properties of the new model using different kinds of norms are
analyzed, especially the L2 norm.

Chapter 4 further investigates the ways to modify and generalize the “energy”
function proposed for the new model, speciﬁcally the log form of the “energy”

function. The subgrouping method is proposed to localize the connections. It

transistc
of the net
I

Chap'
lem using
each met
limitation
Finall}

direction.

 

 

 

 

also demonstrates the ways of tailoring the “energy” function for image recognition
purpose. Simulations are proved.

Chapter 5 presents some circuit building blocks for the new model using analog
transistors biased in the subthreshold region. “Pspice” simulations are given for some
of the networks. A one-neuron-/two-pattern network of the new model is also showed.

Chapter 6 analyzes the current approaches for temporal pattern association prob-
lem using artiﬁcial neural networks. It identiﬁes the limitations and advantages of
each method. Finally it presents some new algorithms to overcome some of the
limitations.

Finally, chapter 7 summarizes the whole thesis, points out the future research

direction.

 

 

 

AR’]

CHAPTER 2

ARTIFICIAL NEURAL

NETWORKS FOR

INFORMATION ASSOCIATION

2.1 Background

Some of the major concerns or criteria in designing associative neural networks are:

0 How many patterns can a network store? Can it store arbitrary patterns? How
big is the network capacity (we deﬁne the capacity of an associative network as
the ratio of the number of patterns it can store over the number of neurons in

the network) ?

0 Will there be any spurious patterns stored? In other words, does the network

store patterns which are not desired?

10

0 H'.

be

D HQ

pa’

 

0 Cam

These
111 genera

teIIlporal 1

 

 

Many ,
and Studiet
memory 5
majority u
in the later

for Store st.

11

o How are the patterns stored? What is the training method? Can the network

be trained on-line?

o How easy is it for the network to store additional patterns without altering the

patterns which have already been stored?

0 Speed of the network. How fast can it be trained? How fast can a pattern be

retrieved?
0 Can the network be easily implemented in hardware/ software?

These criteria measure the merit of a designed associative artiﬁcial neural network.
In general, the above criteria apply to associative networks for both static and
temporal patterns.

Many artiﬁcial neural network architectures and algorithms have been proposed
and studied by researchers. Most of them are focused on the applications of associative
memory. Some of the networks proposed use the feedforward structure. While the
majority use the feedback structure (the recurrent structure). Our main interest is
in the later. In the following sections, we will study some of the popular ANNs used

for store static or temporal patterns.

2.2 The Hopﬁeld Model

The analog Hopﬁeld model [Hop84] is a recurrent network. One of the major contri-
butions of this model is that it can be easily built with analog circuit and it is suitable
for VLSI implementation. Figure 2.1 shows a circuit model of an element (neuron)

of a Hopﬁeld model. Figure 2.2 shows a Hopﬁeld network. In Hopﬁeld model, each

procesir
a resisto
transfer

ﬁnite co:

are expre

Where I, j

Leta]:

 

12

processing unit (neuron) i, i = 1,2, . . . ,n, is an ampliﬁer with a capacitor C: and
a resistor p; at the input node. The processing node uses a sigmoid function S to
transfer input u,- to output vi. Node (neuron) i is connected to node (neuron) j via a
ﬁnite conductance Tij~ The connection matrix T = [ng] is symmetric and T.-,- = 0 for
i 6 {1,2, . . . ,n}. The system dynamic equations can be derived by using KCL, and
are expressed as:

dug 7'

U:
7t— Zﬁj(vj-u.)—;+I.

j=l

= ZTaJ-vj- (ZT..+— —)u.—+I i=1,2,...,n, (2.1)

:1

Ci

where I, is an input bias current to neuron i, and

vj=S(u,-) j=1,2,...,n. (2.2)
Let Ry: T and
_1_ _ l. n _1_
Bi Pi “=1n<R'fj
Pi j=1

The system differential equations become:

Cd , n ,
071—1; 2:72-11),- — E:- + 1,2 =1,2,...,n. (2.4)

j=1

The t

The derir.

3.52

Where 1:,- :

When LE -
d: ~

 

iISae

stable DOIm

SB‘St
em [Sh

Hopﬁem

13
The “energy” function of the system is deﬁned as

n n

__ ...._.. "_1_‘“-1..
E— ZZTUW), 21.2”: 12‘ f0 5 (170.117,. (2.5)

i=1 j=1 i— £21

The derivative of the energy function along the trajectory of the system can be derived

 

as:
9'5— ” 1‘15““) a—E-2 (26)
dt — i=1 C,‘ du, 02); i
where v,- = S (11;) and S is a sigmoid function. Thus
d
d—iT > 0. (2.7)
Since C; > 0 and (jg-EV Z 0,
if g 0 for t 2 o. (2-8)
When “a? = 0
3E
E): 0 2 — 1,2, ,n
du,
=> Cg? — 0

So u.- is a equilibrium point of the system. Therefore the system will converge to a
stable point (equilibrium point) starting from any point. This system is a gradient-like
system. [SWC91] has more detailed analysis of this system.

Hopﬁeld network functions as associative memory by storing desired patterns as

14

 

 

 

—’VVV‘—' , [> v,

 

 

Figure 2.1. A neuron of an analog Hopﬁeld model

stable equilibrium points of the system. Thus, when the system is presented with an
input “similar” to one of the stored patterns ( within the region of attraction of one of
the stable equilibrium point ), the system will converge to the closest stored pattern.
However, it is difﬁcult to determine the symmetric connection (weight) matrix T. The
network cannot store arbitrary patterns either. There are many papers published on
ways of ﬁnding the weight matrix for a set of given patterns, such as the one proposed
by Barbosa [Bar91], and the one by Tseng et al [THL93].

There is a discrete version of Hopﬁeld model too [Hop82]. The system dynamics

can be described by:
V: = sign (2 Tij) (2-9)
1

We omit the detailed analysis here.

The Hopﬁeld—type model has the following limitations:

0 The storage capacity is relatively small, i.e. for a ﬁxed number of neurons the

15

 

 

 

 

 

 

 

 

 

 

 

 

 

Il I2 1..
T. T... T.
T2 T2 Th
Tn T,z Th
9. p. p.

 

 

 

 

 

 

 

 

Figure 2.2. A network of Hopﬁeld model

I111:

o Gi‘~

im;

whe

0 011-

2.3 I

Bart Kosk
HGIWOfk :}
Pattern) in
(Unnected

thS: the 0]

 

16

number of “storable” patterns is small.

0 Given an arbitrary set of patterns to be stored, it is difﬁcult and maybe
impossible to ﬁnd a weight matrix. That is, the network cannot store arbitrary
sets of patterns. For example, assume I, = 0, then the system equilibria of the
Hopﬁeld model must satisfy the condition: F T(Vi)Vj = F T(VJ')V'i, for i 75 j;

where F(V) = R’IS‘1(V) [SWC91].

0 On-line learning is difficult and computationally expensive.

2.3 Bidirection Associative Memory

Bart Kosko extended the original Hopﬁeld type associative memory to a two-layer
network [Kos88], as shown in Figure 2.3. Each layer can represent a vector (or
pattern) and the two layers can have different number of neurons. The two layers are
connected through the weight matrix (connection matrix) M and its transpose M T.
When the system is given an initial input to layer one, say A’, the system works like
this: the output of the layer two is calculated by B’ = S (M A’), S() is a threshold
function; the output of layer two is then fed to layer one again through [connection
MT; the output of layer one then become A” = S'(MTB)’ until B = S(MA) and
A = S (M ,T B ) A and B are the stored patterns of the network ( may not be a desired
one ). Given pattern A as the input, the network can retrieve pattern B; similarly,
given B as a input the network can retrieve pattern A for layer one. Kosko [K0388]
proved that any real matrix M is bidirectionally stable (i.e. use matrix M as the
connection matrix, the resulting network is stable). Kosko proposed the following

way to ﬁnd the connection matrix: for a given set of patterns (A1, Bl), (A2, 32), . . .,

1.4....

when

dEHﬁI

Ho

guard“

17

(Am, Bm), matrix M can be constructed as follows:

M = ZXFIG, (2.10)

i=1

where X,- and Y.- are the bipolar form of patterns A,- and B,- respectively and are

deﬁned as:
a;,- If (1.5 > 0
1‘35 = , (2.11)
—1 If (1,} S 0

bgj if (25,“ > 0
31.3 = . (2.12)
—1 if 12,-,- g 0

 

 

 

 

Figure 2.3. Structure of Bidirection Associative Memory

However, as pointed by Wang et al. [WCM89], this proposed matrix cannot

. guarantee the recall of the desired patterns, i.e. some of the desired patterns may

not be e

 

Training:
single pa
proposeti
idea is t.
ln [WCJ
connectic
associatix
also prop

IOI a g] \‘e

 

The a
limitation

It 15 also (

18

not be stable equilibrium points of the system. Wang et al. proposed “multiple
Training Method” to build the matrix M, which can guarantee a successful recall of a
single pair of patterns [WCM90]. To guarantee the recall of all training pairs, Wang
proposed a method called “Dummy Augmentation Encoding” [WCM90]. The basic
idea is to increase the dimension of the stored patterns to make them “noise-free”.
In [WCJ91], Wang proposed a necessary and sufﬁcient condition for a generalized
connection matrix (also called correlation matrix in Wang’s paper) of a bidirectional
associative memory (BAM) which guarantees the recall of all the training pairs. He
also proposed to use linear programming techniques to ﬁnd such matrix. However,
for a given set of patterns this condition may not be satisﬁed.

The advantage of BAM is its ability to store patterns in pairs. The major
limitation with BAM is that it cannot guarantee to store an arbitrary set of patterns.

It is also difﬁcult to ﬁnd the connection matrix if it exists at all.

2.4 Existing Models for Storing Temporal Pat-

terns

Many algorithms have been prOposed for training ANNs so that they can
store/generate temporal patterns. To be able to generate temporal patterns, A
recurrent network structure is usually required. Our main interest here is to extend
current algorithms and to improve the performance of current methods. In this
section, a brief introduction will be given to some popular algorithms. Detailed
analysis will be given in later chapters. Our focus is on the algorithms of training a

recurrent network to store/ generate temporal patterns.

 

Back-P

Bach
[th 'Sr‘.
so that ‘

correspo:

connecth

   
  
   
   
 

the origi:
feedforwa
The limit
The lengr

requires t

 

Real-ti
A met

of the "en

“Gt

19

Back-Propagation Through Time

Back-propagation through time was proposed by Rumelhart, Hinton and Williams
[RHW86]. This method “unfold” a fully connected recurrent network through time,
so that the network can be represented by a feedforward network with each layer
corresponding to a state of the recurrent network at certain time constant. The
connections between each layer are the same, and they are the weight matrix of
the original recurrent network. Therefore, the popular back-propagation method of
feedforward network can be slightly modiﬁed to be used to train this kind of network.
The limitation is that this algorithm cannot be used to train arbitrary length patterns.
The length of the training patterns need to be ﬁxed and known. This method also
requires too much computer resources and is difﬁcult to implement in hardware.
Real-time Recurrent Learning

A method proposed by Williams and Zipser [WZ89b, WZ89a] can be used to train
general recurrent networks. It does not require the patterns to have ﬁxed length. The
training algorithms use a gradient descent method to update weights. The derivative
of the “energy” function with respect to weights is calculated by integrating forward
the derivative of the outputs of the neurons with respect to weights. Thus this
method is a forward propagation method. The drawbacks with this method are: it is
computationally very expensive, each time step requires 0(n4) of operation. It also
needs large memory storage.
Time-dependent Recurrent Back-propagation

This method was proposed by Pearlmutter [Pea89] to train general recurrent
neural networks. Other researchers developed similar method, such as the “adjoint

function” method proposed by Toomarian and Barhen [TB91]. The time-dependent

IfCUIIEI.

method

‘energy‘
(”adjoint
it needs
and hard

The Grc

Sun. (

 

 

function 1
it used a
(omputar
equation
SimUlatio

In the
HE‘IWQI-ks
the limit;
the ”93's

20

recurrent back-propagation method is in essence an extension to the backepropagation
method used in feed forward networks. This method updates the weight / connections
using a gradient-descent algorithm too. However, this time, the derivative of the
“energy” function with respect to weights is calculated using a co—state variable
(adjoint system). The computation per time step is in the order of 0(n2). However,
it needs to integrate the co—state backward in time. Thus difﬁcult to be used on-line
and hard to be implemented in hardware.

The Green’s Function Method

Sun, Chen and Lee proposed another way of ﬁnding the derivative of the “energy”
function with respect to weights [SCL91]. Instead od using back-propagation method,
it used a “Green” function. This method can be integrated forward in time. The
computation steps are in the order of 0(n3). This method needs to solve a linear
equation at each time step, which may not have a unique solution in numerical
simulations.

In the rest of this dissertation, we will ﬁrst propose a new model of artiﬁcial neural
networks for association of static patterns. This new model will overcome some of
the limitations we have pointed out in current approaches. Then we will investigate
the ways to improve current methods and models used for association of temporal

patterns.

C]

T]

M

3.1

tum

S[Dre

CHAPTER 3

THE PRODUCT-OF-NORM

MODEL OF ARTIFICIAL

NEURAL NETWORKS

3.1 Motivation

As can be seen from previous chapters, two of the major issues in the design of
functional artiﬁcial neural networks (ANN) are [$391, SWC91]: (1) the ability to
store (or learn) any number of pre—speciﬁed set of data (patterns) successfully; (2)
the determination of a set of parameters (or weights, connections) which would achieve
(1) by using efficient off—line or on-line computations.

Dynamic learning has been a more attractive way in (adaptively) computing
the parameters of an ANN. For (static) feedforward artiﬁcial neural network mod-
els, the error-back-propagation is one of such dynamic methods [RHW86]. Pineda

[Pin87, Pin88, Pin89] and some other researchers have ported the back-propagation

21

techr

non

Inetl
The
net“

Capa

22

techniques to recurrent networks too. Other learning methods such as the one
proposed by Perlmutter [Pea89] in essence are error back-propagation method too.

However, most of the proposed algorithms proposed so far are computationally
expensive and the number of sets of data (patterns) which can be reliable learned
is limited. Moreover, additional data may be inadvertently learned making what is
known as spurious states.

The usual models of ANN ﬁx the architecture and the number of units (neurons)
ﬁrst, then seek to “compute” parameter values, either through off-line or on-line
methods, which render a given set of data stable equilibria of input/output pair.
The structure is not dependent on the number of elements of the set of data. These
networks spend lots of time in learning stage and are well known to have limited
capacity in storing arbitrary set of data.

In this proposed model, we make trade-offs: we eliminate the dynamic learning
of parameters but make the number of units or neurons dependent on the number
of data to be stored. The resulting network does not “waste” any time on adapting
its parameters and the given patterns are guaranteed to be the stable equilibria of
the system (i.e. they are guaranteed to be recalled). However, the network’s “size”
will depend on the number of desired data to be stored. The network becomes more

complex: the connectivity or interaction among neurons increases.

3.2

We st

with:

W [18?

23

3.2 Proposed Mathematical Model

We start the network by proposing a new general “energy” function of the network
with:

E =1ftm)Hne'H;, (3.1)
p l=1

where e1 9—- [(vi'l - v1), (viil — “02),. - . 7(1):] - vnlli
||e1||p means the p-th norm of vector e’ with 1 < p < 00,
(p = 1 and p = 00 need special treatment) ;
v,- : the output of the j-th neuron;
n : the number of neurons of the system;

m : the number of patterns (data set) to be stored;

v : the l-th speciﬁed pattern;
v- : the j-th component of vector V”;
f : a scaling function and f (m) > 0;
u,- : the input to the j-th neuron, and v,- = S(u,~);
S : a sigmoid function (a bounded, monotone increasing diffeomorphism);

In particular we have used 5 = tanh() in our simulations.

We deﬁne the system dynamic equations by

0E

12' : __
' 0v,-

i=1,2,...,n. (3.2)
This is a gradient-like system. It preserves all the characteristics of a gradient

system, such as no oscillations and global convergence. Since E Z O, the “energy”

is bounded from below. Thus, the system is guaranteed to converge to its limit set

which

(L) 1,1771

Proo

liter:

and

fen

24

which is composed of equilibria only [H874].

3.3 Stability Analysis

For the new artiﬁcial neural network architecture, the ﬁrst issue we want to address
is: if the desired patterns are the stable equilibria of the system, i.e. can they be

recalled? Theorem 3.1 answers this question.

Theorem 3.1 For the system deﬁned by equation (3.1) and equation (3.2), suppose
V" = [vf', 125’, . . . ,v’,“,’]T, l = 1,2, . . . ,m, are the desired/pre-speciﬁed patterns (note:

in this paper we use bold face to represent vectors), Then V", I = 1,2,.. . ,m, are

asymptotically stable equilibria of the system, i.e. they are guaranteed recallable.

Proof:
Since
el(v) = [(1)1 _ v1): (v2 _ ’02), - ° ~ 1(2):;1— vnllTa
then
e’(v") = [0, 0, . . . ,0]T
and
Ile’llp = 0 => E(V") = o.
If
VaiéV‘l Vl=l,2,...,m,
then

new)”, >0 v1=1,2,...,m.

- 1
e -
Srnc p

Therefor'
assume \
pattern l
that E13

V #V“.

 

System C

Where it

25

Since % > 0 and f(m) > 0, thus

E(V)>0 Wé {V'1,V"2,...,V"m}.
Therefore, V", l = 1,2,... ,m, are isolated minima of the “energy” function (we
assume V‘“, l = 1,2, . . . ,m, are distinct). Consider a small nerghberhood Q around

pattern k such that V"" is the only system equilibrium point inside 0, It is obvious
that E(V) is continuously differentiable over 9. E(V’k) = 0 and E(V) > 0 for
V 94 V“, V 6 Q. The derivative of the “energy” function along the trajectory of the
system can be expressed as:

dE(V) ” 8E 62),- ,

i=1

n BE 2 av.-
'-*Xdaﬂznr an

i=1

 

where 32—3: = 513%). By our choice of the function 5', we have git? > 0. Thus

dﬂV)

a <0 mrv¢v* mm

 

Therefore, according to Lyapunov’s stability theorem [Kha91], V'k is an asymptoti-
cally stable equilibrium point of the system.
The second question about the new system is if the regions of attraction for each
pre-speciﬁed patterns can be estimated, i.e. how stable are the desired patterns?
From the proof above we know that the “energy” function E can serve as a

Lyapunov function for the equilibrium point V‘k over the region Q(V"‘). Thus we

can cab

l::l.2
In H

_ l
wdlana

3.4

llhen u

unbev

The 53's“

 

26

can calculate the Lyapunov surface to estimate the regions of attract for each V‘“,
k=1,2,...,m.
In this section, we investigated the overall stability of the new architecture. We

will analyze this new model for speciﬁc norms in the following sections.

3.4 The Product-of-norm Model Using L2 Norm

When using L2 norm (p = 2) in the product-of—norm model, the “energy” function

can be written as:

E = gnmrﬁuetvmi
l=1
= gummnvrhvu:
= if (m)1'[[(v; ’+—v1)2 (valgv2)2+...+(v;I—vn)2]. (3.6)

The system dynamic equations now become

in = f(m)Z( (vfl-v.) f1 IIV'k- VH3
l=l kil
_—_ f(m )(f: v”’—v,-)ﬁ [fluff—22,) 2] i=1,2,...,n. (3.7)
l=1 k=1j-l
k¢l

From the above system dynamic equations, it is obvious that at any desired

pattern, say V", U = 0. It should also be observed that for a vector V to be a

sr'stem

its comj

Otherwi

 

Thus a!
range of

the IOllo
TheOre

“Tl,

27

system equilibrium point, i.e.

m) in? — v.) 11 [for — ..,)2] ,

(=1 k=1 i=1
k¢l

its components must satisfy the following conditions:

min{v‘-"I :l=1,,.2 ..,m}$v,- i=1,2,...,n
(3.8)
max{v,-‘“ :l=1,,.2 ..,m}2v,~ i=1,2,...,n.
Otherwise, there will exist an i 6 {1,2, . . . ,n} such that
(vII—vg)ﬁzl(v; - I‘-vJ-) 2>0 Vl€{1,2,...,m} (3.9)
52;} “=1
or
. “—v,) )H —vJ-) <0 VlE{1,2,...,m}. (3.10)
k=1j=1(v
k 1

Thus it,- # 0, and V is not a system equilibrium point. Inequality (3.8) speciﬁes the
range of the locations of all the system equilibria. To summarize the above, we have

the following theorem:

Theorem 3.2 For the dynamic system deﬁned by equation (3.7)
o The desired patterns V", l = 1, 2, . . . , m, are the asymptotically stable equilibria.

o For a vector V to be a system equilibrium point, it must satisfy the inequality

(3.8).

Note that this theorem can be applied to general pth norm also.

3AL1

For our

vestrge

and the

Theore

 

 

lies:

0T.

f0

be

Proof:

Let

28
3.4.1 One-neuron/m—pattern Case

For one-neuron/m-pattern case, the system’s characteristic can even be further in-

vestigated. When n = 1, the “energy” function becomes

E = —f(m) H(v" — v)2 (3.11)

and the system dynamic equation becomes:

m

u = f(m) (E: (1)“1 — v) ﬁw'k — v)2 . (3.12)
=1 iii

Theorem 3.3 The system represented by equation (3.12) has the following proper-

ties:
o The system has a total of 2m — 1 equilibrium points.

0 The pre-speciﬁed patterns v". (l = 1,2, . . . ,m) are asymptotic stable equilibria.

The other m — 1 equilibria are unstable.

0 Suppose v", (I = l,2,...,m), are distinct and v"1 < v"2 < < 22"”, then
for every i E {l,2,...,m — 1}, there is an unstable equilibrium point v'li'f‘l'”

between v‘“ and v41“).

Proof:

Let us deﬁne a new function:

m

H(v) = f(m): (v"l — v) ﬁ(v'k - v)2 . (3.13)

l=l k=l
k¢l

Then ti
Since ii
That m

Re“

 

Where w

 

II [S 0bV

ThuS r I!
b .
e gll’en

’31 a

29

Then the roots of H (v) are the equilibria of the system deﬁned by equation (3.12).
Since H (v) is a polynomial of order 2m — 1, it generally has a total of 2m — 1 roots.
That means system (3.12) will have at most 2m - 1 equilibria.

Rewrite H(v) as

H(v) = f(m)[ImI(v"-v)] filer-v)
l=l l=1 til
= Ho(v)H1(v), (3.14)

where we deﬁne

Hotv) = f(m) (fun—w], (3.15)

H1(v) = in: ﬁ (v'l -- v) . (3.16)
52::

It is obvious that at the speciﬁed patterns 12'“ (l = 1, 2, . . . , m),
110(1)“) = 0 and H(v") .-= 0. (3.17)

Thus 1)", l = 1, 2, . . . , m, are equilibria of the system. The other m — 1 equilibria will
be given by the roots of function H1(v).

At a speciﬁc pattern v”,

H100“) = iﬁ(vtk_v¢i)

Since 2-

i.e. sigr.

Because
interval
HUmber
these to,

The

 

been 5110
1. Can b(

The f

EVa

lu

30

= liljwk — v")] [ f} (M — m] . (3.13)

k=i+l

Since v'k — v" < 0 for every k 6 (1,2,. .. ,i — 1],

{-1
Sign (H(v‘k - v”)) = (-1)“‘,
k=1
i.e. sign(H1(v‘f)) = (—1)l'1. Therefore
H1(v")H1(v'(‘+1))< 0 i = 1,2, . . . ,m — 1.

Because H1(v) is a continuous function of v, there must be a root v‘llt‘H) in each
interval (v‘i,v’(‘+1)), i = l,2,...,m — 1, such that H1(v'li’f+1)) = 0. The total
number of roots we have now is 2m — 1. Since H (v) is a polynomial of order 2m — 1,
these roots must be the only equilibria of the system.

The stability of the system at the pre—speciﬁed patterns v", i = 1, 2, . . . , m, has
been shown at previous section. The instability of patterns v‘li'l“), i = 1, 2, . . . , m —
1, can be proved as follows:

The function H1 (v) may now be expressed as

H1(v) = eniil(v‘(q’q+ll — v), (3.19)

q=l

where c > 0, and c is a constant.

Evaluating the derivative of equation (3.14) with respect to v at v‘lr'r’”), we obtain

d d
__ ’.(r,r+l) = :(r,r+l) _ .(r'r+])
dell ) Ho(v )dv H1(v ). (3.20)

where

Thereft

Since

 

thus

Thereﬁ
So the)

Fig-
HeUron
are glo
equatic
Points ‘
toms b

CIOSS I,[

31

where
d 1 +1) "H ( +1) ( +1)
_ :1 r,r : _ t q,q _ n1 r,r
de1(v ) c 3;]; (v v )
q¢r
m—1
= (—1)'"‘c H Iv‘m’“) — NW”). (3.21)
:1
3.1+
Therefore
- d 111(1'1'4-1) T-I
Sign EH1“) ’ ) = (—1) . (3.22)
Since
sign (Ho(v'("'+1))) = sign (f(m) LH(v'l — v'("'+1))])
:1
thus
_d_H(v1-1(r,r+1)) = 1H (vt(r,r+1))Ho(vn-(r,r+1)) < 0 (3 24)
do dv ‘ ‘ '
Therefore, Vim“), r = 1,2, . . . ,m — 1, are the maxima of the “energy” function.
So they are unstable equilibria. QED

Figure 3.1 and 3.3 show two example plots of the “energy” function for two one-
neuron/m-pattern cases. As we can see from the plots, all the desired patterns
are global minima of the “energy” function. The corresponding system differential
equations are plotted in Figure 3.2 and 3.4 where we can see that all the desired
points are the roots at which the plots cross the :r-axis with a negative slope. All the
roots between the desired ones are between two consecutive stable ones and the plots

cross the :r-axis at those points with a positive slope.

32

 

0.002
0.0013
0.0016
0.0014

L740.0012

g5 0.001
£00003
0.0006
0.0004
0.0002

 

 

 

 

 

 

 

 

 

Figure 3.2. The plot of function H(v) with v"‘1 = 0.25 and v'2 = 0.75

Figure

 

33

 

3.5e-005 1 1 1 1 1
3e-005
2.5e-005
E26005
Elie-005
le—005

5e-006

 

 

 

0
0.2

 

Figure 3.3. The plot of “energy” function E = 0.5(0.25 — 2:)2(0.5 -— :r)2(0.75 -— 0:)2

 

0.002 1 1 1 1 1
0.0015
0.001
0.0005

-0.0005
-0.001
-0.0015

-0.002 J_ 1 J l l
0.2 0.3 0.4 0.5 0.6 0.7 0.8

 

 

 

 

Figure 3.4. The plot of function H(v) with v‘1 = 0.25 v“2 = 0.5 and v"'3 = 0.75

Le:
neuron
each s;
for ex:
Withor

m-lr

and thel

 

 

 

The [OCT
Stable e.

lOCatiOn.

C3583 for

34

Let us look back again at the proof of theorem (3.3). We ﬁnd that for the one-
neuron/m-pattern case, we can even further control the regions of attractions for

each speciﬁed pattern by specifying the unstable equilibria between the stable ones.

1 2

For example, suppose we want to store 22" , v“ , - - - , and 22"" as the stable equilibria.
Without loosing generality, we assume v"1 < v"2 < < u”, then we can select

m — 1 points: 111,122, - . - , 11““, such that
‘1 < 0""; (3.25)

v‘l<vl<v'2<v2<v’3<~-<vm

and the system differential equation can be constructed as

u = (1)"1 — v)(v1- v)(v"2 — v) - - - (v'm — v)
= Lﬁ(v" — 12)] [iii-1(2)“ — v)] . (3.26)

The locations of 121, v2

,---, and v"“1 will control the regions of attractions for those
stable equilibria. Figure 3.5 and ﬁgure 3.6 show two example functions H (v) deﬁned

this way.

3.4.2 The Number of Equilibria for Product-of-norm Net-

works

From theorem (3.3), we have already known the number of equilibria and their
locations for the case of one-neuron/m-pattern. Here we will investigate some special

cases for two-neuron networks.

 

Figure
0.3 an.

Fléun
9911in]

 

35

 

.9
O
o
o
to

1
1

I
53
O
o
o
[\3

I

1

I
.o
O
o
O
up
I
L

 

 

 

0.2 0.3 0.4 0.5 0.6 0.7

Figure 3.5. An H(v) function with stable equilibria 0.2, 0.4, 0.7 and unstable ones at
0.3 and 0.6

 

 

 

 

 

Figure 3.6. An H(v) function with stable equilibria 0.2, 0.4, 0.7 and unstable
equilibria at 0.3, 0.5

When 1

Since the
number 0:
the syster

For 111

It is Obvit
is Stable

FOr 1.;
the Sl’Ste

The ,

TherefOI

36

When n = 2, the system differential equations are :

0, = [23(1):l — v1) in [(vf" — 121)2 + (ka — v2)2]

:1 :1

m *3? . (3.27)
112 = 2(v31— 1’2) H [(vik — v112 + (”5" — "2)21

l=1 £2}

Since the shift or rotation of the axes does not change the relative locations and the
number of equilibria of the system, we can always transform the system to simplify
the system equations in the analysis.

For the case of n = 2 and m = l, The system is trivial

1'11 = (vi - '01)

(3.28)

112 = (v; — 122)

It is obvious that the system only has one equilibrium point and the equilibrium point
is stable.

For the case of m = 2, since we can always transform the coordinates to simplify
the system, we can assume, without losing generality, 051 = 1252 = 0.

The system equations are

1'11 = (v? - 'v1)l('vi‘2 - v1)? + v3] + (v12 - 121)[(vi‘1 - v1)2 + v3]

(3.29)

122 = —v2[(v‘2 — 111)2 + v3] — 121(1)." — v2)? + v3]

Therefore, at system equilibria we have

u; = 0 and (vi'1 — v1)(v;'2 — 211)2 + (121"2 — v1)(v‘l — 121)2 = 0.

501:1 = I"

for th1
only inves
straight 11'

The 5}

and

AI an €

37

01 v02 . . .
So v1 = v“, v"‘2 or 31%. Thus the system has a total of three equrllbrla.

For the case of m = 3 and n = 2, the system becomes complicated. Here we

only investigate the special case: the three pre-speciﬁed patterns are located on one

straight line. Without losing generality, we assume v31 = of = v;3 = 0 and 12:3 = 0.

The system differential equations become

1'11 = (vil - v1)[(vi'2 - v1)2 + villvf + 03]
+(vi‘2 - 1201(1);1 - 111)2 + 1231112? + v31

+(—vlll(vi2 '— ”112 + vilKvil — ”02 + ”21 (3-30)
and

112 = ’vzflvi'2 - 01)“ + villlvi3 - '01)2 + 0%]
+111? + v§][(v;‘ — 1).)? + v5]

+102“ — 1).)? + vilhvf" — 1).)? + v31}. (3.31)
At an equilibrium point v2 = 0 and

0 = ('01.1 — ”111(012 " ”llzllvil +
(”i2 - ”llllvil " 1)2llvil "' lllllviz — ”1121104l - ”llzl

0 = 112(v1'2 — 111) + v]"(v;‘l — v1) — v1(v]"2 — v1)(v;'l — v1). (3.32)

Thus r1 =

The systen

3.5 T

F01 the C35

CXISI. The

 

software or

Since 1},

SFStem gram

38

Thus v1 = 0 or

0 = 221(vi‘2 — v1) + 121(vi'1 - v1) — (vi'2 — v1)(v;'1 — v1) (3.33)

 

 

—(v;1 + 1:021: ,/(..;1)2_ 1111):: + (112)?
-3 '

=> ’01 = (3.34)

The system has a total of ﬁve equilibria.

3.5 The Case of Loo Norm

For the case of p = 00 ( Loo norm ), the derivative of the norm does not always
exist. The “energy” function is not continuous. This case need special treatment in

software or hardware implementations. Here we deﬁne the “energy” function as:

E = f(m)ﬁ|lV"-Vlloo
l=1

f(m) ﬁ max Iv;fl — vJ-I. (3.35)
l=l '7

Since the partial derivative aimllV"I - Vlloo does not always exist, to make the

system gradient-like, we deﬁne

0 if (1);.“ — 12,-! 71 111311111123“I - v.)
a .1 def . 1 .1
EZIIV - Vlloo — —1 1f v1- — 12,-: —mJax|vJ — ml (336)

° .1 _ cl '
1 if v,- -— v, — mjax [v]- — vJI

"z“ a“. (3-37)

l
The syster

 

Since the 1

simulation

39

The system differential equations are deﬁned as

u,- = f(m) E 6;..- H mJax lug" — vJ- . (3.38)
i l=1
111

Since the a. is not continuous, we need to choose f (m) to be very small in software

simulations.

I
\

1'

/

\
1.

ll
11

l

I

Figure 3.7. An example vector ﬁeld for Loo norm

Figure 3.7 shows how the vector ﬁeld of this system will look like around a stable
equilibrium point in two dimension case. As it can be seen from the ﬁgure that the
system has limit number of directions to go at any point. If we ﬁx the step size fo the
system, the system will become a grid system. The system will move from one node
of the grid to another. This may make it easier to simulate the system using digital

computers.

40

3.6 Simulation Results

To verify the correctness of the theoretic results and demonstrate how this model can
be used in practical applications, we present several simulation results of this model
used in image recognition and association. In the following examples, we use the
outputs of neurons to represent pixels of images. Each pixel is shown as a square
in the ﬁgures below. The neuron output value of 1 represents a full back square
in the ﬁgures, the value of 0 represents a full white square and values between 0
and 1 represent black squares proportional to the values. We use L2 norm in the
simulations.

In example one, we considered a 304-neuron network. For each desired image, we
stored the image and an assigned code in the network. This way we can read the
output of the network by reading the outputs of the neurons representing the codes.
To compare its performance with BAM [WCM89, WCM90, WCJ91], we have used
the same image patterns as those used in [WCM89].

Figure 3.8 shows an example where six image-code pairs are stored. When the
network was given the initial input as shown in the ﬁgure, the network retrieved the
corresponding images correctly.

In example two, we used a network with 37011 (169x219) neurons to store two
facial images as shown in ﬁgure 3.9. Figure 3.10 shows that given a part of an image,

the network was able to retrieve the who image correctly.

41

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.8. Initial and ﬁnal outputs of the network with six pre-speciﬁed patterns

42

 

 

 

 

 

 

 

 

 

Figure 3.9. Stored two facial images

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.10. Image retrieving process

CHAPTER 4

GENERALIZATION AND

VARIATIONS OF THE

“ENERGY” FUNCTION FOR

PRODUCT-OF-NORM MODEL

In general, the proposed product-of—norm model is very complex: neurons are fully
connected to one another. For the L2 norm case, the order of the system increases as
the number of data sets increases. It is also difﬁcult to modularize or parallelize the
system in software simulations or hardware implementations. In this chapter, we will
investigate various ways to simplify the system. We will also present some example
ways of tailoring the “energy” function for speciﬁc applications. Our focus here is
the L2 norm form of the product-of-norm model. However, the techniques discussed

here can be applied to other forms of norm too.

43

44

4.1 The Log Form of the “Energy” Function

If we take log() on both sides of the original “energy” function (3.1). We get a new

“energy” function as:

1‘2 élogE=f.(m)+}:logne'H:, (4.1)
l=1

where fl is a function of m. It is a scaling function like f (m)

Now the system differential equations can be calculated as:

111'

ll
1
IV]

 

—-7||e‘||” i=1,2,...,n. (4.2)

These dynamic equations describe a gradient-like system. Different from the
original product-of-norm model (3.1), this system has been changed to a sum-ofnorm
form. In the system differential equations, the terms corresponding to each pattern are
now computed separately and then added together. However, the “energy” function
is no longer bounded below. Fortunately, the points at which the “energy” function
goes to negative inﬁnity are the pre-speciﬁed patterns. In simulations, the system can
stop at certain point according to the speciﬁed accuracy. Thus in simulation, even
though the “energy” function is not bounded below, starting from any initial point,
the system will still “converge” to one of its stored patterns.

To avoid the problem that the “energy” function is not bounded from below, we

45

can modify the “energy” function as follows:

E 2 S": [log(ne'n; + on] . (4.3)

l=1

where c) > 0, l = 1,2, . . . ,m, are pre-chosen constants.
The new system differential equations become

3

. m 1 , p .
i=- —— =1,2,...,. 4.4
" Elle‘llhczavd'e“? ‘ " ( )

This system “energy” function now is bounded from below. In addition, it preserves
the features of gradient systems. The terms corresponding to each pre-speciﬁed
pattern in the system differential equations are added together. However, the pre-
speciﬁed patterns will generally no longer be system equilibria. We will address the
relation between the equilibria of the new log-form system and the original product-

of—norm system next.

Lemma 4.1 For a system with “energy” function

E = ﬁtlle’llt + or). (4.5)

where c; > 0 (l = 1,2, . . . ,m), and c; are constants. The system diﬂerential equations

are deﬁned by:

. m a r .
u,- = 1}; 8—mue'ngg [||e'°||;+c,.] z = l,2,...,n. (4.6)
- 11321

If V' is a (isolated) local energy minima of the system (4.5), then V' is a (isolated)

46

local minima of the “energy” function of (4.3).

Proof:

To distinguish the two “energy” functions, we refer to the “energy” deﬁned by

equation (4.3) as E in this proof. Since V“ is a local minima of the “energy” function

of (4.5), there exists a small neighborhood I) of V", V" 6 R, such that for any point

ofV#V*andV€Q,wehave

E(V) — E(V‘) 2 0.

For the system (4.5), since E > 0,

Thus

Since

Thus

E(V)—E(V') = logE(V)—logE(V")

_ E(V)

 

E( )= log E(V) and E(V") = log E(V‘).

(4.7)

(4.3)

(4.9)

(4.10)

(4.11)

(4.12)

47

and

 

2 0. (4.13)

Therefore, V' is a local minima of the new “energy” function (4.3) too. It can been
seen from the proof above that if we change the inequalities to strict form we can

prove the case of isolated “energy” minima. We omit the proof here. QED.

Lemma 4.2 A pattern V" is an asymptotically stable equilibrium of the system

deﬁned by (4.4)/(4.6) iff it is an isolated local minima of the “energy”function deﬁned

by (4.3)/( 4.5).

Proof:

Firstly, we want to show that if V" is an asymptotically stable equilibrium of the
system (4.4)/(4.6) it must be a local “energy” minima of (4.3)/(4.5). We prove it by
contradiction method.

Case 1. Suppose V‘ is an asymptotically stable equilibrium of (4.4)/(4.6), but it
is not a minima of the “energy” function (4.3)/(4.5). Deﬁne a small neighborhood D
around V‘, such that V“ is the only system equilibrium point inside D. Consider a
new function E 2- E (V‘) — E (V) It is obvious that E is continuously differentiable
over D. Since V‘ is not a local minima of the energy function E, there exits some

V0 with arbitrarily small llVo — V‘H and

E(Vo) < E(V“) => E(vo) > E(vr) = 0. (4.14)

Deﬁne a set 5 as

n = {v e B. : E(V) > 0} (4.15)

48

where the ball B, 3 {V E R" : ||V — V*|| g r} and B, C D. Along the trajectory of

the system

E=—E

5E 8v,- .
- £37317;

3v,- 2

= 2:51:12;

> 0 for VEB. (4.16)

According to Chetavev’s theorem [Kha91], V" is unstable. Contradiction! thus, V"
must be a local energy minima.

Case 2. Suppose V“ is a local energy minima of the system (4.3)/(4.5) but not an
isolated one. Then for any small neighborhood of V" deﬁned by Q = {V : [lV—V'II <
e}, e > 0, there exits a local energy minima V0 6 I? such that 3% = 0 —-> Ulv0 = 0.
Thus V" is not asymptotically stable. contradiction! Therefore, V‘ has to be an
isolated local energy minima.

Now we will show that if V“ is an isolated “energy” minima, then it is an
asymptotically stable equilibrium point.

Since V“ is an isolated “energy” minima,

0E
WIV. = 0. (4.17)

Thus V" is an equilibrium of the system (4.4)/(4.6). Consider an small neighborhood
0 around V‘ such that V“ is the only system equilibrium point inside (I. Then

consider the continuously differentiable function E(V) over It, E(V') = 0 and

49

E(V) > 0 for V 75 V‘, V E 9. Thus E can serve as a Lyapunov function candidate

over 9. Along the trajectory of the system,

 

 

dE(V) _ dE(V)
dt _ dt
< 0 for v¢vzven (4.13)

Thus V" is an asymptotically stable equilibrium of the system (4.4)/(4.6). QED

Lemma 4.3 Suppose V", l = 1,2, . . . ,m, are the pre-speciﬁed patterns of the sys-
tem deﬁned by equations (4.5) and (4.6). Given any small positive constant r, there
always exist a set of constants c; > 0, l = 1,2, . . . ,m, such that, for l = 1,2, . . . ,m,
the system (4.5) has a local “energy” minima V" and “V"1 — ‘7‘!“ < r; V" is an

equilibrium point of system (4.6).

Proof:

For simplicity, let us start by choosing

0<c1<1 l=1,2,...,m.

Deﬁne a small neighborhood (I of V"‘1 as:

Q ={(V.1+ AV) I ”AV” E 7'1, 7" > T1) 0} (4.19)

such that

v:1 go, l=2,3,...,m. (4.20)

50

Deﬁne another neighborhood 91 of V"1 as:
01 = {(V'"1 + AV) : r1 2 “AV“ 2 r2, r1 > r2 > 0}.
The Open neighborhood Q; is deﬁned as
$12 = Q — (21.
Consider the system “energy” function (4.5) at pattern V'I:

E(V“) [Ile‘(V“)II£ + c1] E -ltve)

= CrEn—1(Vd)a

here [Ie1(V'1)||p = 0 and we deﬁne

13,._1 = II [Hang + a] .
l=2
At a point V = V'1+AV, V E 91,
E(V) = [A8 + Cl]En_1(V),

where Ac = ||e1(V)||§. Therefore,

E(V) — E(V’l) = [Ae +, c1]E _1(V'I+AV) — clEn_1(V'l)

= AeEn-1(V’1+AV) + C1 En-1(V'1+AV) '-

(4.21)

(4.22)

(4.23)

(4.24)

(4.25)

51

ClEn-1(VT1)
> AeE.._1(V‘1+AV) + c1173.-. (v:1+ AV) —

61En_1(VT1). (4.26)
where En_1(V) = {I}; [lle’llg + 1] and En_1(V) = 117;, Mel”; . For V‘1+AV 6 91,
Ae > 0, En_1(V*1+AV) > 0 and c1 > 0,

Since Ae and En-1 (V*l+ AV) are continuous with respect to AV over the compact set
Q1, there exists a > 0, such that AeEn-1(V"l+AV) > a for every point (V‘1+AV) 6
I'll. Similarly, there exists b > 0 such that En_.1(V"1) < b for V‘1+ AV E 91.

Therefore, for V’l+ AV E 521, we can always ﬁnd a constant c1 > 0 such that
a
Cl < h and c1 < 1. (4.27)
Thus, for any (V‘1+AV) E (11

AeE.._1(V"'1+AV) + C; [E -1(V"1+AV) — E,_,(v*1)] > a — clb

> 0. (4.28)

Since E (V) is continuous over the compact set 9, E (V) must have a minima V" 6 0

such that E(V“) S E (V) for every V 6 9. From inequality (4.28), we know

“V”l — V'lll < r2 < r, i.e. V“ 6 02. (4.29)

 

52

That is V" is a local “energy” minima. Since this is a gradient like system, V“ is a
equilibrium point of the system.
Similarly we can prove the existence of c), l = 2,3, . . . ,m. we omit the detail

here. QED

From lemma 4.1 and lemma 4.3 we have the following theorem:

Theorem 4.1 Suppose V", l = l,2,...,m, are the pre-speciﬁed patterns of the
system deﬁned by equations (4.3). Given any small positive constant r, there always
exist a set of constants C) > 0, l = l,2,...,m, such that, for l = l,2,...,m, the

system (4.3) has a equilibrium point V" and [[V"'1 — V‘IH < r.

For the case of L2 norm, we can further study the stability of the equilibria of the

new system.

Lemma 4.4 For case of L2 norm, suppose V", l = 1,2, . . . , m, are the pre-speciﬁed
distinct patterns of the system deﬁned by equations (4.5) and (4.6). Given any small
positive constant 6, there always exist a set of constants c) > 0, l = 1,2, . . . ,m, such
that, forl = l,2,...,m, the system (4.6) has a corresponding asymptotically stable

equilibrium point V" and ”V" — V"[| < 6.

Proof:

For L2 norm, equation (4.6) becomes

4:4: (v:’—v.-)II [Ile"||§+61] 1'

(=1

l,2,...,n. (4.30)

k=l
k¢l

53

It can be written in general vector form as
U = f(U)1 (4°31)

where f is a C1 vector ﬁeld. Similarly, by taking f(m) = 2, we can write equation
(3.7) as

U = f"(U). (4.32)

Note that when c1: 0, l=1,2, . . . ,m, we have f = f‘.

We have already showed V", l = 1, 2, . . . , m, are asymptotically stable equilibria
of the system (3.7). The Jacobian matrix of system (3.7) evaluated at a speciﬁed
pattern V" is

2\1

Df (U)

 

,,.. = (4.33)

 

 

’\n
J

where

m n .. .12 av.-
A1=-HLZ(”jk‘vjl[[a—m

k=l i=1
k¢l

 

] < 0. (4.34)

Therefore V", l = 1, 2, . . . , m, are hyperbolic equilibria. It is obvious that D f (U)|v.1
is invertible.

We deﬁne the Cl—norm [IhIIC1 of a vector ﬁeld h E v(R") (v(R") means all the
vector ﬁeld over R”) to be the least upper bound of all the numbers of ||h(U)|| and
||Dh(U)||, for U E R". According to the perturbation theory [H874], for any 6 > 0

there always exists a neighborhood a = {g : ||g — f‘HCx < r, r > 0} C v(R")

54

of f“ such that for any 9 E a there is a unique equilibrium V"1 of U = g(U) and

“V“1 — V*1[[< 6.

Since [If—f‘llcx is a continuous function of C), l = 1,2, . . . ,m, and [If-f‘llcx = 0
when c1 = 0, l = 1,2, . . . ,m, we can always ﬁnd a c1 = c; > 0 such that the vector
ﬁeld

f“) E a, (4.35)
where
M) = f()I..=..=...=c-=o. (436)

Therefore, there is a unique equilibrium point V“ of U = f(1l(U) and “V"1 —
V‘IH < 6. Further more V“ is asymptotically stable and hyperbolic.

Now consider a vector ﬁeld f (2) deﬁned by

f(2)() = f()lcl=c;,C3=C4=-~=cm=0 (4.37)

Since V'"1 is a hyperbolic equilibrium of U = f(1)(U), the eigenvalues of the J acobian

matrix D f (1)

 

v.1 will have nonzero real parts. Thus, D f(l)

 

9.1 is invertible. According
to the perturbation theorems, there exists a neighborhood 01 of f (1) such that for
any g, E 01 there is a unique equilibrium point V“ of U = g1(U) and ”V” —
V'lH < 61 < 6. Since ”V" — V“|| < e, we can always make cl small enough so that
“Val — V*1|| < 6. Further more, V1.1 is hyperbolic and asymptotically stable. Since
Hflzlllcx is a continuous function of c2, and fl?) 2 fm when c2 = 0, there exists a

open set (0, b1) with b1 > 0 such that for very 62 E (0, b1), fl“) 6 01. Thus, there exits

55
a unique asymptotically stable equilibrium point VT] of U = f (2)(U) and

—._-*

(IV 1 — V“|| < e. (4.33)

Since V‘“ is a hyperbolic asymptotically stable equilibrium point of f (1) too (this can
be veriﬁed like those of f ‘, we omit the detail here), we can similarly show that there
exists a set (0,b2) with b2 > 0 such that for every c2 6 (0,b2), there exits a unique

asymptotically stable equilibrium point V‘“ of U = f(2)(U) and
[IV‘2 — V‘zll < e. (4.39)

We can choose c; E (0,min(b1, b2)) to have both (4.38) and (4.39) hold.

Similarly, we can show that there exist C3 > 0, c4 > 0, . . ., cm > 0 such that there
exits a unique asymptotically stable equilibrium V", l = 3,4,... ,m and ”V" —
V"|| < e. QED

From lemma (4.1), (4.2) and (4.4) we have the following theorem:

Theorem 4.2 For case of L2 norm, suppose V", l = l,2,...,m, are the pre-
speciﬁed distinct patterns of the system deﬁned by equations (4.3) and (4.4). Given
any small positive constant 6, there always exist a set of constants c) > 0, l =
l,2,...,m, such that, forl = l,2,...,m, the system (4.4) has a corresponding

asymptotically stable equilibrium point V" and ||V"" —- V‘lll < e.

The proof is straight forward. We omit it here.
From theorem 4.2, we know that we can use the new log form system to store

patterns as close to (not equal to) the desired ones as we want.

56

For the new system, the right hand side of the equations (4.4) is in the form of
summation, and the effects of the desired patterns are additive. This would make it
easier to implement the model in standard software/ hardware modules which can be
easily parallelized. This parallelized architecture of modules is depicted in Figure 4.1.
As can be seen from the picture, each processing module works on one pattern
alone. Therefore, the process can be parallelized. Some simulations have shown that,
compare to those networks which use the original “energy” function, networks using
this log-form “energy” function converge (retrieve) faster. It can also be observed
that it is easier for this system to “learn” new patterns. To “learn” (store) a new
pattern, the original structure of the system does not have to be changed, only a new
module is added to the original network. The stored patterns of the old system are

still stored (to certain degree) in the new system.

4.2 Localization of the Product-of—norm Model

4.2.1 Motivation

The original product-of—norm system model described in equation (3.7) is fairly
complex: the feedbacks are nonlinear; the structure is fully connected. As the number
of units (neurons) increases, the number of connections will increase in the order of n“.
This will limit the size of the network in implementations. In software, the limitation
will appear in the time to simulate a reasonable-sized network. In hardware, the
limitation will manifest itself in the full connectivity. Thus it appears to be difﬁcult
to directly implement this model in hardware using present technology. Therefore it

is necessary to simplify the original networks.

 

 

Processing
Module 1

 

 

 

 

 

 

 

 

 

Processing
Module M

 

 

Processing
Module 2 '
Neurons

 

 

 

 

Figure 4.1. A modularized structure of Product-of-norm model

 

 

58

In last section we investigated the way to use log-formed “energy” function to
modularize the original system. In this section we will explore other approaches to
simplify the network. Our ﬁrst goal here is to localizethe connectivity. As oppose to
haphazard structure, we seek to select a local structure that in addition to preserving
some of the delineated features of the original product-of—norm model, would only
add a limited number of spurious equilibria.

The speciﬁc approach rests on retaining the gradient systems property while
rendering the pre-speciﬁed patterns as isolated minima of the “energy” function.
Hence it can be guaranteed that the pre-speciﬁed patterns will always be asymptotic

stable equilibria of the system.

4.2.2 Subgrouping Method

A naive way to simplify the network is to divide the whole network into smaller
subnetworks which are isolated from one another. One builds each subnetwork
separately and assembles the small networks together to form the large network.
Figure 4.2 depicts an example of a network grouped this way. The “energy” function

which describes this model is:

E = i ITZ('J.‘-'-vj)2+l'nl2(1).l"-v1)2

l=IjEN1 [=1 jEN2
+... +11 2 (v;l—vj)2 , (4.40)
l=1j€Np

Where Nk, for k = 1,2, . . . , p, are the index of neuron in subnet k (p is the total

number of subnetworks). N,- r) NJ- : 05, for i 9193' and i,j 6 {1,2, . . . ,p}. The system

59

dynamic equations are:

Di=-—. i=1,2,...,n (4.41)

i.e.

4:701) lion-10ft : («at—1).)? . (4.42)

k=1 JEN}.
1‘9“ 16M.

Since at the pre-speciﬁed pattern V", the “energy” E(V'“) = 0 and E(V) > 0
for V 94 V", l = 1,2, . . . ,m, V" are the isolated “energy” minima of the system.
Therefore they are asymptotic stable equilibria.

However, all the combinations of the minima of the separated p subnets are stable
equilibria of the system also. This leads to the generation of more so-called spurious
equilibria. Thus the region of attractions for each equilibrium point will become
smaller than that in the original product-of—norm model.

In this model, each neuron is only directly connected to neurons in its own subset.
Compared to the original model, the connectivity has dropped dramatically. A similar
approach is to divide the original network into smaller ones too. However, this time
each small subnetworks is coupled with one or more other subnetworks. That is, the
subnets overlap with one another. Figure 4.3 shows one example structure of such
network whose original network is divided into four subnetworks. The neurons which
are marked in dark color belong to two subnets, while the rest belong to one subnet
only.

The motivation of building subnetworks with overlap is to overcome the generation
of spurious equilibria of the previous model. In the previous model, each subnetwork
works almost independently. Thus any combination of the “energy” minima of each

subnet composes one system stable equilibrium point. While in this new model,

60

the “energy” minima of one subnet will affect the minima of other subnets. Thus
intuitively, it should have fewer spurious equilibria than the previous model does.

Compared to the previous model, this model will have more connections. The
network will be a little more complex. However, compared to the original product-
of-norm model, the connectivity is much less and the structure is much simpler. The
system is still a gradient-like system and retains the features of a gradient system. The
choice of the size of each fully connected subnetworks and the degree of overlapping
with their “neighbor” subnetworks should be chosen judiciously for each application,
i.e. they should be application dependent. We observe that there will always be a
trade-off between the connectivity of these subnetworks and the application-related
performance.

The “energy” function for this new approach can be formalized as:
l m kl m :l
E = a H 211.- -v.-)2+II 21v.- w)?
(=1J'5N; (=1 jeNz’

+ . .. + ﬁ 2 (1);! _ 101-)2 , (4.43)

l=l jENT',

where p is the total number of subnetworks. N = {1,2,. . . ,n}, N: C N for every
i€{1,2,...,p}.Ni’f'i{N,’,N§,...,N,-’_,,N,-’+1,...,N1’,}9E 45 for i = l,2,...,p, and
ULM=N.

If we deﬁne I‘(i) to be the set of subnets which contains neuron i, that is

mmmm,gnwg MM)

61

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Group 3

Figure 4.2. An example structure of dividing a big network into isolated smaller
subnetworks

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.3. An example structure of dividing a big network into overlapped smaller
subnetworks

and

i e N; 4:) N,’c c Ni), (4.45)

then, the system dynamic equation for the model can be expressed as

11,-: f(m) Z Z (1):“ — v.) H 2 (vs-”f — vJ-)2 , i=1,2,...,n. (4.46)
N66I‘(i) l=l 52:} JEN;

As we will show in the simulation examples later, if the portion of neurons correspond-
ing to those coupling neurons (neurons belonging to more than one group, shown in
dark color in Figure 4.3) are the same for all the pre-speciﬁed patterns, the coupling
neurons may lose their coupling effects, i.e. they may fail to pass information between
the two overlapping subnetworks, therefore, they cannot prevent the generation of

spurious equilibria.

4.2.3 Simulation Results

To investigate the performance of the two new models, we built a network with 288
neurons using different models presented in previous section to store three patterns.
Figure 4.4 shows the patterns/ images to be stored by the network. Figure 4.5 shows
the two test patterns which we used as input to the network.

Network one is built using the “energy” function described by equation (4.40).
Figure 4.6 shows the grouping used in this network, which does not use overlapping
method. Network two is built using the “energy” function (4.43). Figure 4.7 shows the
grouping used in this network. In the simulation, both networks successfully retrieved

the correct patterns when applied with the test patterns shown in ﬁgure 4.5.

63

   

Figure 4.4. Three stored patterns

   

Figure 4.5. Test patterns

64

Figure 4.8 and ﬁgure 4.9 show the retrieve process for these two network when a
given input is composed of half of the image “1” and half of the image of“3”. The
network which was built using overlap method successfully retrieved a stored image

“1”, while the one with no overlap failed to retrieve a stored image.

Group 1

/
g

 

Figure 4.6. No over-lap grouping

. -----"‘U

l

 

Figure 4.7. Over-lap grouping

From the simulation results presented above, we can make the following remarks:

0 The grouping of the neurons affects the performance of the network. The

65

   

Figure 4.8. The retrieving process of the network using non-overlap method

   

Figure 4.9. The retrieving process of the network using overlap method

66
grouping is dependent on the patterns to be stored.

o The model of overlapping groups has better overall performance for correct
retrieval of the stored patterns. This demonstrated that the overlapping among
subnetworks is essential for prohibition of the generation of spurious equilibria

and the enlargement of the basin of attraction for the desired patterns.

4.3 Modiﬁed Product-of—norm Models for Recog-

nizing Shifted / Translated Input Patterns

4.3.1 Motivation

The original product-of—norm model described in equations (3.7) cannot retrieve the
correct patterns effectively when the input pattern is a shifted/translated stored
pattern. An example network which stored patterns shown in ﬁgure 4.4 failed to
retrieve the correct pattern when given a shifted version of the stored pattern as
shown in ﬁgure 4.10. In this section we will explore various approaches to modify the
original model, so that it can tolerate (to certain degree) shifted/ translated images
while retaining the features of being a gradient-like system. This section also serves
as an example of how the “energy” function can be tailored to suit speciﬁc application

needs.

4.3.2 Modiﬁed “Energy” Functions

We seek to modify the “energy” function to take the shift/translation effects into

account. Let us assume the images to be stored are in rectangular form and the

67

 

Figure 4.10. Shifted input pattern

neurons are indexed in such a way that the images are represented row by row, that
is, the ﬁrst row of the image corresponds to neuron 1 through c (c is the number

of columns of the image), and the second row corresponds to neuron c + 1 through

2c, - - -. We propose a modiﬁed “energy” function as follows:
1 m n
E — 5701011 Ztv.’ — v.)” +
[:1 j=1

. 2
$14 f]: L 2 (v;‘ — an] (4.47)
1=11=1 =(i—l)c+l

where K : is a constant and K > 0. Used here as a scaling factor;

c : the number of columns of the digital image/ pattern;

12 : total number of neurons of the network, which equals

to the total number of pixels of the stored image;
r : the number of rows of the digital image/pattern;

m : number of images/ patterns to be stored.

68

In comparison with the original “energy” function (3.7), this modiﬁed “energy”
function has an additional term, which actually measures the error (i.e. the difference
between the current output of the neurons and the stored images) of each row as a
whole. This term serves as a penalty function. It is zero when the input pattern
is a horizontally shifted version of one of the stored patterns, while is non-negative
otherwise. It is obvious that at a stored pattern V", l = 1,2, . . . ,m, E = 0. Thus
V"1 is an isolated minima of the “energy” function. Therefore, V", l = 1,2, . . . ,m,
are stable equilibria of the system. When the current state of the system is close to a
shifted version of a stored pattern, the added term will be relatively small and make E
small also compared to the case of a random image as input. This, in a sense, builds
a “energy” valley to attract t hose shifted images to the correct patterns stored. This
can also be viewed as a simple feature mapping, the goal is to choose such maps that
will separate input patterns in feature space while being shift-invariant.

The “energy” function introduced this way may introduce more local “energy”
minima (spurious equilibria) and in general will not be able to recognize all the
shifted versions of the pre-speciﬁed patterns. The value of K becomes crucial when

it is used for different applications. The new dynamic equations now become:

111' = f(m)ZW‘-01)(Hz Uri-”02+
l=l k=l j=1
krﬁl
m J(3)+C m 1- pc 2
K: T Z (”id—‘04)] II XL 2 (vii—v11]
i=1 1'_J(.') k=1 p=l ’=(:'-1)c+1
k¢l
for i = 1,2,. . . ,n, (4.48)

where R(i) is the index number of the row which contains neuron i, and J (z) is the

69

index of the ﬁrst neuron of H(i).
Another way of modifying the “energy” function to achieve the same goal is to

add the penalty function as follows:

Kill 2 (v;’—v,-)] (4.49)

i=1 =(i-1)c+l

The “energy” function of this form can be used in conjunction with the log modiﬁ-
cation method described earlier to achieve faster speed. It will also be easier to be
parallelized or be modularize in software simulation or hardware implementation.

The effects of this term can be controlled via the coefﬁcient K. Later simulations
showed that varying this constant is a very effective way of changing the regions
of attractions for each stable equilibrium point. While, technically speaking, the
values of K should be computed from a nonlinear version of the Lagrange multiplier
approach, we opted for simplicity and practical implementation in software in our
treatment.

It should be clear from the discussion above that we can similarly construct an
“energy” function such that the resulting neural network would be able to recognize
vertically shifted patterns as well. Here we omit the detailed derivation.

To modify the network so that it is shift-invariant, i.e. it is able to recognize
both horizontally and vertically shifted images, we can obviously combine the two
penalty functions used for horizontal shift and vertical shift together. Here we want

to present a different approach. We propose to divide the network, more precisely

70

the neurons whose outputs represent an image, into small regions as we did in the
subgroup method (See Figure 4.11 for example). Each region consists of a number of
neurons which form a rectangular area of the image. For a region, I, obtained this

way, we deﬁne the error for region I corresponding to the stored pattern V" as

e,= 2(1) - -v.-)], (4.50)

i6!

where V is the current output vector of the neuron and v.- is its ith component.
We use the summation of the errors of all the regions as the penalty term and add

it to the original “energy” function. The new “energy” function can be expressed as:

m n

E m)H 2(1)}! — vj)2+

1:1 j=l

zéngﬁZ[Z(v: -—v.-)]2, (4.51)

l- 1166' iEI

where K > 0 and G is the set of all the regions. The penalty term can also be added
as in equation (4.49) to take the advantage of the log forms later. The new system

differential equations are (assuming regions not overlapping with one another):

"1);“ fl — vi) H 2(1), " "1)2
(=1 k=l j=1
k¢l
m 2
+KZ 2(1); - ’,—v )H 2 k[:(v; - “-vj)] , (4.52)
(=1 :61. Igé} leG jEI

where I,- is the region which contains neuron 2°.

The motivation for choosing such an error function can be explained as follows:

71

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.11. An example way of dividing the network into small regions

72

Referring to the grouping of neurons shown in Figure 4.11, if the input image is
shifted one box (pixel) in the direction shown by the arrow, the changes in the error
for each region will be in the order of (595-)? In general, if the region is w pixel wide
and I pixel high (long) and the shift is hl pixels horizontally and fig pixels vertically,
the change of errors for each region will be in the order of [1 — Egg—"Mr. When
In and h; is much smaller than w and I respectively, the change of errors for each
region will be very small. When h1 = h; = 0, the change equals zero. Thus it is not
sensitive to shifts. This will help the network recognize shifted images as we explained
in the horizontal-shift case. There is a trade-off between the tolerance to shift and
the complexity of the computation. The choice of the size of the region is important.

If w or I is too small, then the error deﬁned for each region:

 

wl

e z [1— (I _ “1““ " M] (4.53)

will be too large for a small shift of the image. If the region is too large, the term

[2w — 4)]? (4.54)

in the penalty function may become too small to detect the shift of the image.

4.3.3 Simulation Results

To test the performance of the proposed modiﬁcations, we tried to build three 288-
neuron networks using different “energy” functions. We used the images shown in

Figure 4.12 as images to be stored and used images shown in Figure 4.13 as test

73

input patterns. Using these patterns, we tested three networks built using different
models. Network one was built using the “energy” function (3.6). Network two was
built using the “energy” function (4.47). Network three was built using the “energy”
function described in (4.51).

Table 4.1 compares the retrieve results for each case. As can be seen from the
table, the proposed new network was able to recognize the shifted images in addition
to those input images which are blurred by noises. However, we noticed that the
modiﬁed network converges slower than the original model does in simulation. This

is due to the complexity of the added connections.

Table 4.1. Simulation results

 

 

Test Patterns
(60 (b) (C) ((1)
model 1 x x \/ \/
model 2 \/ \/ x \/
model 3 \/ \/ \/ \/

Note: \/ means successful and x means failed.

 

A
(D
V

 

 

 

KKK
KKKE

 

 

 

 

 

 

 

 

 

 

 

4.3.4 Discussion and Remarks

So far we have surveyed current models and algorithms for neural networks used for
information association, such as associative memory, especially in the application of

character recognition and image retrieval. Our focus was on recurrent networks. We

74

   

Figure 4.12. Stored patterns

 

(d) (e) (D

Figure 4.13. Test patterns

75

proposed a product-of—norm model which has the following features:

0 The system eliminates the dynamic “learning” process. It can also be viewed

as that the learning process becomes “static”.

o The stored patterns are built into the network by design. In theory, such
network can store any pattern as the system equilibria. There is no limit as to

how many patterns can be stored.

o The system is a gradient system which has no oscillation. The system is globally

convergent .

Some of the limitations of this network are: the original network is fully connected
and is very complex, making it difﬁcult to be implemented in hardware.

We also investigated the product-of-norm model using different norms. Based on
the original production-of—norm model, we proposed several modiﬁed network models.
All of them retain the property of being a gradient-like system while still being able
to store the desired patterns as stable equilibria. Especially, we investigated ways of
tailoring the “energy” function for the application of image retrieval. We proposed
ways of modifying the “energy” function of the original network to control the regions
of attractions for each pattern, so that the network can recognize slightly shifted
images.

We proposed ways to simplify the network, speciﬁcally, we proposed to use “log”
form of the “energy” function. Such networks can be easily parallelized in software
simulation and modularized in hardware implementation. To store additional pat-
terns, the network does not have to be built from scratch, it only needs to add some

new modules to the original one.

76

We analyzed various ways to divide the original network into subnetworks so
that the connectivity of the network can be reduced while we can still store the
desired patterns. We demonstrated that the grouping with overlapping has better
performance.

We want to remark here: the simulation / application examples shown here are for
demonstration purpose only. They illustrated how this model may be used in real life
applications. The examples here are not intended for real image associations. Some
of the methods are heuristic.

As demonstrated by the examples, the new proposed models have great potential
of being used in a wide range of real life applications: such as associative memories,

categorization, character recognition and image association, etc..

CHAPTER 5

THE BUILDING BLOCKS FOR
MICRO-ELECTRONIC CIRCUIT
REALIZATION OF THE

PRODUCT-OF-NORM MODEL

5.1 Motivations .

One of the goals of the research on artiﬁcial neural networks is to understand how
human brain works so that we can use hardware (circuits) to mimic the human brain
to solve problems in real life applications. With today’s technology of very large
scale integrated (VLSI) circuits, it is possible to fabricate tens of millions of devices
(transistors) interconnected on a single wafer. This makes it possible to build complex
ANN circuits for real world applications where digital computers have limited use.

In this chapter, we will investigate ways of implementation of the product-of-norm

77

78

model using analog transistors biased in the subthreshold region. There are several

advantages with circuits operating in the subthreshold region [Mea89]:
0 Power dissipation is extremely low - from 10‘12 to 10'6 watt for a typical circuit;

0 In the subthreshold region, a MOS transistor’s drain current saturates in a few
KT/ q (thermal voltage), allowing the transistor to operate as a current source

over most of the voltage ranging from near ground to Vdd;

o The exponential nonlinearity is an ideal computation primitive for many appli-

cations.

First, we will propose some basic circuits which can be used as building blocks for
the implementation of product-of-norm model of AN Ns. We will then present some
examples of ways of implementation of small product-of-norm model. Please note
that the example circuits presented here are for demonstration purpose only. They

are not targeted to speciﬁc real world applications.

5.2 Background and Basic Circuits

5.2.1 MOS Transistors in Subthreshold Region

The drain current of an N-channel metal-oxide-semiconductor (MOS) transistor bi-

ased in the subthreshold region can be expressed as [Mea89]:

I = Ioe;#”(e95? — 6%,), (5.1) I

79

where V9 : gate voltage.
V, : souce voltage.
Vd : drain voltage.

10 : a constant for a given MOS transistor.

Isl .

q . thermal voltage.

When all voltages are normalized to quantity of %, the above equation can be

written as:

I = Ioe"V9(e’V‘—e"v")

.—. Ioer-V-(r — 67V“), (5.2)

where Vd, is the voltage between the drain and source nodes. When V4, is large
enough, we have

I z I... = Ioe"V9‘V' (5.3)

where I... is the so-called saturation current of the transistor. Figure 5.1 shows the

relationship of the saturation current vs gate voltage.

5.2.2 Current Mirrors

Figure 5.2 shows two diode-connected transistors, their gates are connected to their
drains respectively. For a typical process, Io is so small that even the smallest drain
current used (10'12 amp) requires IQ, approximately equal to 0.4 volt [Mea89], thus
makes the transistor well intro the saturation region of the drain current. Higher

drain current requires higher V2,, (V4,). Thus for any useful V9,, the drain current of

80

 

16-005 3 I I I l I I T

 

 

 

 

1e-006 : _:
a 3
~73 4
1e-007 : A,
i i
,
1
16-008 1 1 1 1 1 1 1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
V..

Figure 5.1. I,“ vs V9,

the transistor has the same exponential dependence on the gate-source voltage. The
current can be controlled by V9,.

In circuit design, it is often necessary to change the sign (direction) of a current.
Figure 5.3 shows two types of current mirrors using diode connected transistors. The
input current I,“ biases a diode-connected transistor Q1. The resulting Vg, is just
sufﬁcient to bias the second transistor Q2 to a saturation current 10“. equal to 1m. As

long as Q; stays in saturation region, ID“, is nearly independent of the drain voltage

Of Q2.

5.2.3 Differential Pair

Figure 5.5 shows a circuit which is usually called “differential pair”. The circuit uses

the difference of the voltages as the input. Since we will use this kind of circuits quite

81

 

 

Figure 5.2. Two diode-connected transistors

often, I will give a brief analysis here.
Assume all transistors are working in the saturated region, the saturation current

can be calculated as:

I,M _—. IOeKVv-V'. (5.4)
Thus we have
I1 = Ioe‘VI'V, (5.5)
12 = Ioean-V. (5.6)
Since
15 = 11 + 12
= Ioe’v(e"v1 + e"V’), (5.7)
we have,
6'“ - I” l (5.8)

10 city] + 85V; °

82

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.3. Two types of current mirrors

Substitute e‘V back to equation (5.5) and (5.6), we obtain the following:

CKVI
I] = 15W (5.9)
88V:
12 = bm, (5.10)
and
CKV] _ eKVg
I] — 12 = “65V; + 63V;
— V2
= 15 tanh ﬂY—l. (5.11)
When V1 — V2 is small enough,
K
11- 12 z 1550/1 - V2). (5-12)

Figure 5.4 shows the relationship of II, 12, 11 — 12 vs V1 — V2. Using current mirror,

83

 

 

 

 

Current 10'7A
o
I

 

-3 " [1‘12“ 1

1 1 m

 

 

-3 -2 -1 o 1
V1 — V2 (V)

[\D

Figure 5.4. II, I; and 11 — 12 vs V1 — V2

 

 

Figure 5.5. Differential pair

84

 

Q. IV 4| Q.

 

 

 

 

 

Figure 5.6. Transconductance ampliﬁer

85

we can easily get 11 — 12 as output. As shown in Figure 5.6, where

V — V
[out = II — 12 = 1;, tanh K—(-—l—2—2—). (5.13)

The circuit shown in Figure 5.6 is called transconductance ampliﬁer.

5.2.4 Current-voltage Multiplier

Figure 5.7 shows a prototype of current-voltage multiplier circuits. "Compared to
the transconductance ampliﬁer shown in the last section, the circuit output 10m now
is drawn through two current mirrors instead of directly from the differential pair.
Therefore, the output does not disturb the operation of the differential pair. It is

obvious that the output current is:

”(V1 - V2)

10‘“: [1 —12=Ibtanh 2

(5.14)

When V1 - V2 is small enough,

1
Iout z 516(% — ‘6) (5°15)

Compared with the differential pair, 15 now is controlled by the input current In

via a current mirror. In normal operation, the bias transistor will be in saturation

region. Thus Ib = Iin and

”(V1 — V2)

[out = [in tanh (5.16)

I?

glinU/l - V2) (5'17)

86

II ’7‘

 

I)

 

 

 

Q; Q. s Q.
I 1 F I

 

 

 

 

 

 

“ﬁts Q.}W

 

 

 

 

 

 

 

I... 1J1 Q. ”Bi—h” E”

Figure 5.7. Current-voltage multiplier

87

So [out is the product of the input current and the difference between the input
voltages.

Figure 5.9 shows a pspice simulation for this circuit. We use the schematic symbol
shown in Figure 5.8 to denote this circuit. Since Ib cannot be negative, this circuit is

a two-quadrant multiplier.

Figure 5.8. Schematic symbol for current-voltage multiplier

 

4e-08
36-08
2e-08
1e-08
0
-1e-08
—2e-08
-3e-08
-4e—08 ' 1 J ' i l ' 1 ‘

1 -0.8 -0.6 -0.4 ~0.2 0 V().2 0.4 0.6 0.8 1

l-Vz

Iout(A)

 

 

 

 

Figure 5.9. The plot of a current-voltage multiplier

88

 

V
V
H

 

 

.I-Il

 

 

 

 

 

 

 

 

 

 

 

J ,
I,
v,-—| .

Figure 5.10. Gilbert multiplier

5.2.5 Four-quadrant Multiplier

Figure 5.10 shows a four-quadrant multiplier which is also called the “Gilbert Multi-

plier”. From the analysis of the differential pair, we know

eKVI

”85V; + eKV2

N
u—l
I

89

_ [b H(Vi-Vz)
— 2(1+tanh 2 ),

I _
12 =§b(1—tanh M

).

Similarly, we can obtain the following currents:

4(V3 - V.)
—————2 ),
114 = %(1 — tanh iii/3512),
"(V8 — WI)
2 3
NM - V4)
2 )

I
113 = é—(l+tanh

123 = 123(1-l-tanh

I
124 = -23(1—tanh

Thus

11+12 + I] —12tanh K(%—%)

2 2 2

 

1+:

and

= I1+Iz _ Il-Iztanh’cUé—V”).

2 2 2

 

Therefore,

[out : 1+ "I-

= (II — 12)tanh “(ﬁg—“Q

———”(V1 _ V”) tanh

= lb tanh 2 2

When V1 — V3 and V3 — V, are small enough, we have

2

I... z 1,,1‘4—(14 — V2)(V3 — v4).

“(V3 - V4)

(5.18)

(5.19)

(5.20)
(5.21)
(5.22)

(5.23)

(5.24)

(5.25)

(5.26)

(5.27)

90

Figure 5.11 shows a pspice simulation of this circuit.

 

1e-007 .

5e—008 —

(10-3)

"
’
’
f
”
v
’
I
‘-
’
-’
a
p

N3513.008 -

I

-1e-007

 

l

 

1

l

1 1

0.3

 

-1.5e-007
-0.3

-0.2

-0.1

0
Vl-V2

0.1

0.2

Figure 5.11. Spice simulation of a Gilbert-multiplier

In the following section, we will need to calculate (V1 —V2)2 for the product-of-norm

model. We can use this four-quadrant multiplier to achieve that by setting V3 = V1

and V4 = V2. Figure 5.12 shows the schematic symbol for this square voltage-square

circuit.

 

._. SQ ..L

 

 

 

Figure 5.12. Schematic symbol of a voltage square circuit

91

5.3 An Example Circuit of One-neuron/two-

pattern Product-of—norm Model

As we discussed in section 3.4.1 for one-neuron/m-pattn case, the product-of-norm

model’s dynamic equations can be written as:

it = f(m) 2 (v‘1 — v) H (v'“ - v)2 . (5.28)
1:1
when m = 2, the equation becomes
a = f(m)[(v" — v)(v'2 — v)” + (v‘2 -— v)(v" - vﬂ. (5.29)

If we use voltage to represent the output (state of the neurons), the terms (1)"1 — v)2
can be easily calculated with the voltage-square circuit.

Figure 5.13 shows an example circuit of the one-neuron/two—pattern case. As can
be seen, the circuit structure is symmetric. The state of the neuron is represented by
the voltage 1) over the capacitor C. For the voltage-square circuit, 12" and v are given

as input. The output current [1 is then given by

(v‘1 — v) Ir.(v"'l — v)

11 = 1;, tanh K tanh (5.30)

This current is then fed to a current-voltage multiplier X2 through a current mirror.

Thus the output current 12 of the current-voltage multiplier can be written as:

t2

2

12 = Iltanh ”(v —v)

92

 

 

 

V' 1.
SQ >

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.13. One-neuron/two—pattern circuit

:1 _ 2 12 _
= 1,, tanh U tanh M. (5.31)
2 2
When I)"1 — v and v'"2 — v are small, 12 can be written as
12 = 01(1)"2 — '1))(v"'1 — v)”, (5.32)

where a is a constant which incorporates the effects of I), and 1:. Similarly, we can

show that the current 1., can be expressed as

'2 __ .1 __
I4 = Ib[tanh ”CE—{£212 tanh 5(2) 2 ’0)

0(1)"1 — 1))(1)"2 — v)”. (5.33)

22

93

Currents I; and 14 are then summed up and fed to the capacitor C. Thus the voltage

over the capacitor has the following dynamics:

Ci) = 12 + 14. (5.34)

When 1)”1 — v and v‘"2 -— v are small, we have

C1) = a(v"'l — v)(1)"2 — v)2 + (1)"2 - 1))(1)"l — 1))“. (5.35)

The formula above is a differential equation for a one-neuron/two—pattern product-
of-norm model.

The PSpice code and simulation examples are presented in Appendix A.

CHAPTER 6

ARTIFICIAL NEURAL

NETWORKS FOR

TEMPORAL-PATTERN

ASSOCIATION

6.1 Introduction

In previous chapters, we studied networks that could store static patterns and were
used in applications like image recognition, pattern classiﬁcation and associative
memory. In this chapter, we will investigate neural networks that can store/learn
temporal patterns. This type of networks can be used in many applications, such as
speech recognition, temporal pattern generation and forecast.

Learning or storing temporal patterns can be viewed as an extension to the general

association of static patterns. Temporal patterns can be discrete or continuous. The

94

95

problem of learning temporal patterns can be classiﬁed into three different tasks

[HKP90]:

1. Sequence Recognition: Here one wants to produce a particular output
pattern when (or perhaps just after) a speciﬁc input pattern is seen). There is no
need to reproduce the input pattern. This is appropriate for speech recognition
problems where the output might indicate the word just spoken. It can also be

used for pattern categorization.

2. Sequence Reproduction: In this case, the network must be able to generate
the rest of the sequence itself when it receives part of it. This is the generaliza-
tion of auto-association or pattern completion of dynamic patterns. It would
be appropriate for learning a set of songs or for predicating the future course of

a time series from examples.

3. Temporal Association: In this case, a particular pattern must be produced
in response to a speciﬁc input pattern. The input and output patterns might
be quite different, so this is the generalization of hetero-association to dynamic
patterns. It includes , as special case, pure pattern generation and the previous

two cases.

6.2 Methodology Review

There are many methods proposed so far to address some of the tasks mentioned
above. The ﬁrst task —— sequence recognition — does not necessarily require a

recurrent network. For the other two tasks, recurrence is usually needed. In this

96

section, we will review some of the models currently available.

6.2.1 Tapped Delay Lines

Tapped delay line networks turn a temporal sequence into a spatial pattern on the
input layer of a feedforward network. Therefore, the conventional back-propagation
method can be used to learn and recognize sequences. Figure 6.1 shows a typical
structure of such network. In this type of network, values of :1:(t), :1:(t — A), ...,
a:(t — (m — 1)A) from an input signal a:(t) were presented simultaneously to the input
layer of the network. The values are usually obtained by feeding the signal into a delay
line that is tapped at various intervals. In a synchronous network, a shift register can
be used. The tapped delay line network has been widely used in speech recognition
problem [Wai89, WHH+89].

There are several drawbacks of this approach used for sequence recognition
[M0289]. The length of the delay line or shift register must be chosen in advance
to accommodate the longest possible sequences. It cannot handle arbitrary-length
patterns. The input signal must be properly registered in time with the clock
controlling the delay line and must arrive at exactly the correct rate.

Tank and Hopﬁeld [TH87] suggested a way of compensating the last limitations
mentioned above. The idea is to replace the ﬁxed delays by ﬁlters that broaden the

signal in time as well as to delay it.

97

 

 

 

 

 

D D D D

 

 

x(t)

k

Figure 6.1. A time-delay neural networ

98
6.2.2 Context Units (Partially Recurrent Networks)

There are several approaches falling into this category. The idea of the approach
here is to introduce a carefully chosen set of feedback connections in a feedforward
network. The recurrence lets the network remember cues from the recent past but
does not complicate the training. In most cases, the feedback connections are ﬁxed,
not trainable, so back-propagation algorithm may easily be used for training.

Figure 6.2 shows several popular architectures of this type of network which can
recognize sequences on the basis of its state at the end of the sequences. In some
cases it can generates sequences too. For each net, it is not known what kind of
patterns the network will be able to generate in advance. Bert DeVries proposed a

Gamma model which introduces subnetworks as short-term memories to be added to

the network [VP92].

6.2.3 Back Propagation Through Time

This type of networks is usually fully connected. Figure 6.3 shows a simple example
of such network. If the patterns we are interested are of a maximum length, we
can turn such network into a feedforward network by unfolding it through time.
The idea is to represent the state of the network at different time by a layer in a
feedforward network. Thus if the maximum length of the patterns concerned is T,
then the network will have T layers with each layer connected by the same weight
(connection) matrix. Figure 6.4 shows the unfolded version of the network shown
in Figure 6.3 with four time steps. The back-propagation method can be modiﬁed

to be used for training this new feedforward network. The desired patterns of the

99

 

 

Output Output

 

 

 

 

 

 

1 m 1 m

 

 

 

 

 

 

 

 

 

 

 

Context Input a Context Input

 

 

 

 

(a) (b)

 

 

 

 

 

 

 

 

 

 

Output 03‘?“
. Hidden 2
Hidden

 

 

 

 

 

Context ”

a .

ai Gontext l I
I Input

(C) (d)

 

 

 

 

 

 

 

Figure 6.2. Architectures of networks using context units. Shaded arrows represent
fully connected connections

100

original net at different time become the desired values of different layer of the new
feedforward net. The main problem with this approach is the need for large computer
resources. It is probably very difficult to be implemented in hardware too. For a long
sequence or for a sequence of unknown length, this approach becomes impractical. It

cannot be used for continuous patterns either.

6.2.4 Real-Time Recurrent Learning

Williams and Zipser [WZ89b, WZ89a] proposed a way to construct a learning rule for
general recurrent networks. This method can handle patterns of ﬁxed or indeﬁnite
length. It can be run on-line too. Here is a brief description of this method.

Assume the system difference equation is

V,(t) = g(hi(t—1))= g(ngVJ-(t — 1) + C1(t - 1)), (6-1)

where (g(t) is the desired (target) output of unit 2' at time t. By differentiating this

equation with respect to w,,-, we have

0V1“)

awpq

= g’(h.-(t — 1)) amt — 1) + wa .

 

(6.2)

awry

The desired values Ck(t) may be deﬁned for some k’s and t’s only. The error on

neuron k at time t is deﬁned by

Ck“) — Vk(t) if Ck is deﬁned at t
Eklt) = . (6.3)

0 otherwise

101

Wll W12 w22
10-02
W21

Figure 6.3. A general recurrent network

 

Figure 6.4. A unfolded recurrent network for 4 time steps

102

The system “energy” function (the overall network error) is

where

W) = @5512. (6.5)

Thus the gradient-decent weight updating rule can be written as

 

 

0E(t)
AW (t) = -r]
P9 aqu
3Vk(t)
7] Ek“) 1 (66)
P aw...
where r] is a positive constant. We have
Aqu = Z Aqu(t). (6.7)

t

Therefore, if we know “..,—$3, we can solve this equation recursively. Because we assume
that the initial state of the network has no functional dependence on the weight, we

have

014(0)

awpq

 

= o. (6.8)

Williams and Zipser [WZ89b] found that updating the weight after each time step
instead of waiting until the sequence is ended ( at t = T ) worked well if )7 was
sufﬁciently small. They called it real-time recurrent learning.

This system requires large memory storage, and is computationally expensive.

The number of operations for each time step is 0(n‘).

103

6.2.5 Time-dependent Recurrent Back-propagation

Pearlmutter proposed a training algorithm for general continuous-time recurrent
networks [Pea89]. Some other researchers derived the same result independently using

different approaches [SCL91]. The general system differential equations are described

by
dy,

Tia

=—y,~+0‘(x,')+1,' i=1,2,...,n, (6.9)

where x,- = Z,- wjiyj is the total input to unit 2, y.- is the state of unit 2, T.- is the time
constant of unit 2', a is a sigmoid function, wij, i,j = 1,2, . . . ,n, are the weights,
y.-(to) and the driving functions I,-(t) are the inputs to the system.

The “energy” function is deﬁned by

T
E = f, 25.0) — f1(t))2dt, (6.10)

=0keo

where O is the index of the output neurons, fk(t) is the desired output of neuron k

at time t, T is the ﬁnal time of concern. The functional derivative

544) = 555)) = 141a) — 5.5)]. (6.11)

 

The derivative of E with respect to weight w,,- is deﬁned by (See [Pea89] for detail)

BE 1

T r
aw“ = T/O yia ($j)ZJ'dt, (6.12)
1} 1

 

104

where Z is the Lagrange multiplier (or co-state variable):

(VB

”(1“ = 0y.(t)’

 

(6.13)

where 0+ denotes an ordered derivative. The dynamic equations of the co-state is

dz,- 12
7t- = i2 —Z:T :Wﬁa' (:rJ)zJ (6.14)

where

= g; = y.-(t) - W) (6.15)

Equation (6.14) can be derived using ﬁnite difference approximation [Pea89] or
calculus of variation and Lagrange multipliers [SCL91]. We omit the derivation detail
here. Similarly, we can ﬁnd the derivative of the “energy” E with respect to the time

constant T,:

6B 1 T dy,
$-71?!) 2.5111. (6.16)

Therefore, we can use gradient-descent method to update w.-J and T5.

In the training process, we ﬁrst integrate equation (6.9) forward from time to to
t1 = T. Then set the boundary condition Z.-(t1) = 0, and integrate the system Z
backward from t1 to to. In the backward process, o"(:cJ-)y,- and “ii are also integrated.
Thus at the end, we can computeaaw — Jandg— a(:1:,-)’ and y,- are stored in the forward
pass and replayed in the backward pass.

This algorithm can be used for training general continuous-time recurrent net-

works. It is related to other methods mentioned above. It can derive the real-

105

time recurrent learning method. It may also be viewed as an extension of the back-
propagation through time, or extension to recurrent back-propagation.

The drawback of this approach is that it needs to integrate backward in time.
Thus it needs to store many intermediate values. Therefore, it needs large storage

and it is difﬁcult to be implemented in hardware.

6.2.6 The Green’s Function Method

Sun, Chen and Lee [SCL92] proposed a method that uses a Green function to
construct the derivative of the “energy” function with respect to weight Wij. By
doing so, they avoided a lot of redundant computations. The resulting algorithm
does not need backward integration while the operations per time step is in the order
of 0(n3), which is one order faster than that of the real-time recurrent learning.

For a general system described by
at) = F(r(t), W. 1(0). (6.17)

where W is a matrix representing the set of weights and all other adjustable param-

eters. The nonlinear function F may look like
F(:r(t),W,I(t))= -a:(t)+g(W,:L‘)+I(t). (6.18)
The “energy” function can be given as

E(x,:r') = ft! e(:r(t),:r"(t))dt, (6.19)

to

106

where :r'(t) is the target value at time t. Using gradient descent method, the weight

is modiﬁed according to

5E _ ‘1 Be 02:

The on-line form is

as 81:), (6.21)

AW“) = —7l (5; ' 5W-

where 1] is a positive constant. By using Green’s function (we omit the derivation

here, see [SCL92]) the weight updating becomes
AW = —n(v(t)n(t)), (6.22)

where 11 is a third order tensor with the following dynamics:

dII 6F
41 ( 0W . (6.23)
H(to) = 0

v is a vector and can be obtained by solving the following linear equation:

6e
v(t)U(t) — a—x. (6.24)
U(t) is a matrix and is deﬁned by
dU(t) 3F
—— + U(t)— = 0
dt Bx . (6.25)

U(to) = 1

107

From systems science point of view, this method is basically the transition matrix
method applied to the sensitivity function with the initial condition set to zero (V(t)
in [SCL92] is the transition matrix).

The advantage of this method is that it can be used on-line and the operations
required per time step is only 0(n3).

Overall, each algorithm has its own advantages and disadvantages. Current

approaches still have the following limitations:
0 Require large storage.
0 Computationally expensive
0 Difﬁcult to implement

For training general recurrent networks, the time-dependent recurrent back propa-
gation approach is probably the best unless on-line learning is needed [HKP90]. It can
be used for both continuous-time and discrete-time systems. When on-line learning
is needed, the Green’s function method is a better choice.

In the rest of this chapter, we will focus on algorithms used for general recurrent
networks. We will several propose ways to improve the original time-dependent

recurrent back-propagation algorithm.

6.3 Motivation

Among the training algorithms proposed for recurrent networks so far, the time-
dependent recurrent back-propagation algorithm proposed by Pearlmutter and some

other researchers has the following advantages:

108

The system is very general and can be applied to a wide range of applications.

It can derive many other learning algorithms.

The adjoint variable Z has clear physical meaning.

0 The operations required per time step is only 0(n2).

It can be easily extended using variational approach to solve other problems

beside trajectory learning.
However, the original algorithms did not fully address the following problem:

0 The effects of the initial state: The training algorithm does not specify
how the initial state of those neurons without desired values should be chosen
(if the initial values of those neurons can be controlled). If the initial values
are chosen arbitrarily during the training process, the retrieval process can have

totally different dynamics.

0 The final values of the system: How to control the ﬁnal values of the system?
The ﬁnal value of a pattern may be more important than the patterns of the
intermediate states. Pearlmutter’s original algorithm assumed a free ending

problem and set 2;, = 0.

o Avoiding the backward integration: How to avoid the backward integration

without increasing the computational complexity too much?

In the following sections we will try to further investigate the areas mentioned

above.

109

6.4 The Initial Value Problem

The original algorithm of time-dependent recurrent back-propagation does not specify
how to set the initial values of all the neurons. It assumes that the initial values are
given and ﬁxed. While in real life applications, the initial values of some of the
neurons mayinot be given, and are changeable. If the initial values used in the
retrieval process are different from those used in the training process, the system will
behave differently. Therefore, it is natural to ask how we should set those initial
values.

Let us take the XOR problem described in [Pea89] as an example, Figure 6.5
shows the structure of such network. The network has a total of four neurons, with
two neurons in input layer, one in output layer and the other one in hidden layer. The
input neurons have external connections. The goal is to select a proper set of weights
so that the relationship between the external inputs and the output of the output
neuron is XOR. Here the initial values of the neurons can be given to arbitrary values.
Obviously, some choices of the initial values will make the training easier while others
will not.

The co—state Z,-(t) used in [Pea89] is deﬁned by

_ (TE
3.71)“)

 

21(1) (6.26)

That is, Z;(t) is the ordered derivative of E with respect to the system state y,(t).

Thus
6+E

2““) = air-(0)

 

(6.27)

110

 

 

Input Hidden Output

 

 

 

Figure 6.5. Network topology of XOR problem

111

Since aayfg) represents the change in E over the time period T caused by inﬁnitesimal

 

change of y,-(0), we can use gradient-descent method to update y;(0) in the training

process to ﬁnd the optimal initial values by using
— = ”721(0), (6-28)
where 17 is a positive constant.
6.5 The Choice of the Final Value of the Co-state
For a general system deﬁned by:
Y = f(X, 1,1), (6.29)
where X = WTY. If we deﬁne the performance index function as
J(t) = <I>(Y(tf),tf) + [01, L(X,I,t)dt, (6.30)
and the ﬁnal state has constraint is
w(Y(tf),tf) = 0. (6.31)

The Hamiltonian is deﬁned as

H(Y, 1,1) = L(X.I,T) — ZTf(X,I,t). (6.32)

112

The boundary condition can be derived using optical control theory [L895]:

Y(to) is given

, (6.33)
(¢x 'f' 11’3” — Z)TltJd3/(tf)+ (451+ 1b,?!) + H)|¢Jdtf = 0
where v is also a Lagrange Multiplier. Since the ﬁnal t f is ﬁxed, we have
(I’x'l'dgv—Z) = 0
Z I), = (45. + 14;” v)|.,. (6.34)

Equation (6.34) speciﬁes how to find the ﬁnal value of the co—state Z. The boundary
condition used in original time-dependent recurrent back-propagation is a special case

of equation (6.34) which can be derived by setting <15: 2 0 and II): = 0:

2|., = 0. (6.35)

In real applications, we may have constrains on the ﬁnal state and ¢(y(t;), t1) may
not be a constant either. A good example is to train a network to learn a circular
trajectory. It is desired to let the ﬁnal state equal to the initial state. This can be

emphasized by given a constraint function (I).

113

6.6 Avoiding Backward Integration in Time-

dependent Recurrent Back-propagation:

Method I

One major limitation of the original time-dependent recurrent back-propagation
method is that it needs to integrate the co-state Z backward. In the forward process,
the state variables and some other variables need to be stored so that they can be
played back during the backward integration. Therefore, this algorithm requires large
storage and is difﬁcult to be implemented in hardware. There are other methods
available from optimal control theory [BcH75]. However, each method has its own
limitations, for example, the method of guessing the initial Z (to) and simulate Z
forward, the difﬁculty lies in guessing the Z (to) to be close enough to the correct one.
In this section, we will present a numerical algorithm that can avoid the backward
integration.
The system equations concerned can be written in matrix form:

Y./T -_- C—Y+0(X) (636)

X=WTY

9

Here we deﬁne the operator “./” as Anx1./Tnx1 [a,-/T,-]nx1. C is the input vector.

The co-state (the Lagrange multiplier) Z has the following dynamics:

2' = Z./T — E — {Wdiag[a’(:r)./T]}Z. (6.37)

114

Here diag[V] is deﬁned as a diagonal matrix with the diagonal elements equal to the

corresponding elements of vector V. E is deﬁned as

E=Y"—Y,

(6.38)

where Y“ is the desired value. The derivative of the “energy” E with respect to

weight W can be written as:

3E ‘1 , , T
6—W_ — to Y[d1ag[a ($)./T]Z] dt.
Or in other form:
BE :1 - I T
W _ /. d1ag[0' (:6)./T]ZY dt.

In the following discussion, we deﬁne

1b = diag[a'(:z:)./T].

Thus

6E
BWT

 

t
= f ’ zpszdt
to

and the derivative of the “energy” with respect to T is

6E 11 _ .
551-; = —/to d1ag[Z]Ydt.

(6.39)

(6.40)

(6.41)

(6.42)

(6.43)

Using the ﬁrst-order approximation, the dynamic equation of the co-state becomes

115

(See [Pea89])

Zn

(1 — At. /:r)z,.+1 + 61E, + At(Wdiag[a’(X,,). /T]Z,,+1)

= {I — At./T + At(Wdiag[a’(:r,,)./T])}Zn“ + AtEn. (6.44)

Where I is an identity matrix. Deﬁne

p, = I — At(1./T — Wdiag[o’(:rn)./T])

(6.45)
B, = AtE,
Then,
Z, = annH + B". (6.46)
Write equation (6.42) in difference form:
0E N T
—6WT — At]; IlikaYk - (6°47)
Here we assume the ﬁnal time T = N At. At time t1, we have
21 = [1122 + B] (6.48)
m = I — At [I./T - Wdiag[a(X1)./T]] (6.49)
Bl = AtEl. (6.50)

If we denote

8E ) k T
— = 4,2,1! , (6.51)
(awn :4 ,

116

we have

 

BE )
= 1121213” (6.52)
(aw/T1 1
= 1#1011 22 + Bl)Y1T (6.53)
= 2(11}41Z2Y1T + $131K]: (6.54)
At t2 we have
Zg = [1223 + Bz (6.55)

 

8E T 6E
(aw). - W + (w).
= $2223? + $1M ZzYlT + I/JlBlYlT

= «12201223 + 82)}? + ¢1F1(#2za + 82))? + wall/1T

= $211223)”; + ¢il11ﬂ2Z3Y1T + ¢2le/2T + $1.411 Bail/1T + $130,112 (6-56)

Similarly, at t3:

BE
3er 3

 

 

3E
T
wszsig + (aw)?

= 1123;13Z4Y3T + ¢2#2_#SZ4Y2T + ¢1l11ﬂ211324Y1T +
2p333}? + WMBBYzT + wlﬂlﬂzBal/F +
¢2B2Y2T + $111132le +

$1 Bl Y1T° (6.57)

 

If we consider jab: = [ BE ]
8W 8w.) an

Then at time t1 we have

BWk

At t2:

8E
('JlW,c

We define

6f
6;

3Wk

117

 

 

 

 

 

3E
( ) = T/Jlﬂlylkzz + $113134:-
‘1

(—) = [Mikel/2k + ¢IPIF2Y1kl Z3 +
*2

[6219* + «ma/1*] 82

+¢IBIY1E

1111 Ylk
[1122162]c + wlﬂlylk]

¢2Y2k + Bic/11 -

(6.58)

(6.59)

(6.60)

(6.61)

(6.62)

118

Then at time t3

8E
( ) = {$3}? + Mini/2k + walﬂzyikl [#324 ‘1‘ B3] + 3:32 + £81
t3

8Wk
= [$3Y3k + 5:112] [#3Z4 + B3] + 5:32 + ﬂicBl- (6-63)
Thus by deﬁning
Bf. = (6.1/n" + 61;l ,1..-“ n = 2,3, . . ., (6.64)
we have
6E k t k
— = Z + B
(8‘40): ﬂ: Ft t+1 £31 I
= ,BtkﬂtZtH + Qf, (6.65)
where
B; = At[€g(t)]nx1 (6.66)
[1: = I — At(I + Wdiag[a'(a:)./T]) (6.67)
t
Q? = Zﬂl‘Bz. (6.68)
(=1

At the ﬁnal step t = N and

BE
(5791:)“, = ﬂlIiII‘NZT + in- (6-69)

119

If we use the same boundary condition as used by Pearlmutter in [Pea89]:
Z7 2 0, (6.70)
we could further simplify equation (6.69) as:

0E N k
—— = B k=1,2,..., . 6.71
(awk)1v l 151 I n ( )

Thus
6E BE
(5675) ~ (567.),- ‘6'”)
6.6.1 Simulation Procedure

In simulation, from initial time to to the ﬁnal time T, we need to calculate the
following at each time step (interval At):

First we need to simulate the system difference equation (6.36) forward, which will
calculate the values of K, E, and X). Then (I): can be computed using equation (6.41).
Variables ,n, and B, can be found using equation (6.45). We need to calculate Bf using
equation (6.64) for each R: = 1, 2, . . . ,n. Thus we can obtain Qf using equation (6.68)
for each 16 = 1,2, . . . ,n. At the ﬁnal step T, we can compute the derivative of the

“energy” with respect to weight W using

6E
(m) = ﬁtkﬂtZHl'l'Qfa (6-73)

120

and
8E 6W

Therefore, we can update the weight using gradient-descent method:

AW = ——17 (31%) (6.75)
Wnew = WOld+AW’ (6.76)

here 1) is a positive constant.

Thus this algorithm needs only the forward integration. If we select ZT = 0, this
algorithm can be further simplified and used on-line too. The computation complexity
can be estimated as follows:

For each time step, calculating Y, E and X needs 0(2n2) operations. Computing
2,!) needs 0(n) operations. Updating p and B requires 0(4n2) operations. The total
operations for calculating ,B", k = 1, 2, . . . , n, is 0(n4 + n3). Calculating Q also needs
0(n3) operations per time step. Therefore, the total operations per time step will be
0(n4), which is the same order as the real-time recurrent method used by Williams
and Zipser [WZ89a]. The advantages of this method are that it can use the ZT
for different boundary conditions and it avoids backward integration. However, the
computation complexity is still high.

Figure 6.6 to ﬁgure 6.9 show the trajectories of an example XOR circuit, as shown

in ﬁgure 6.5, trained using the above method.

121

 

0.75- moon 4

 

 

 

 

Figure 6.6. XOR network with input = [0, 0]

 

0.751.

 

 

 

Figure 6.7. XOR network with input = [0, 1]

 

075' nabv'oq (

025>

 

 

 

on
.-
n 5
(J
b
0

Figure 6.8. XOR network with input = [1,1]

 

0.75} .4

025? nuumom 1

 

 

 

Figure 6.9. XOR network with input = [1,0]

123

6.7 Avoiding Backward Integration: Method II

In last section, we presented one way of avoiding the backward integration used
in time-dependent recurrent back-propagation method. In this section, we will
present another numerical algorithm, which avoids backward integration too while
not increasing computational complexity too much.

From last section, we know that the co-state dynamical equations can be written

in vector form as:

z' = Z./T - E — (Wdiag[a’(X)./T])Z, (6.77)

where Z. / T g [2,,- / t j] and diag[X] is a diagonal matrix with diagonal elements equal
to the element of vector X. In the following discussion, we denote 2p 2 diag[a’ (X ) / T].
The system deﬁned by equation (6.77) is a linear time-varying system. It can be

written in general form as

z‘ = A(t)Z + 3(7). (6.78)

The transition matrix of this system <I>(t, to) can be calculated using
<i> —_— A(t)<I> with 600,70) = 1. (6.79)
and we have

t
z = <1>(t,to)z,,+ /. <I>(t,r)B(r)d7

= <1>(7,to)z,o + <I>(t,to) [it <I>‘l(7,to)B(‘r)d7-. (6.80)

124

Thus

6E
BWT

 

= f” zszTdt
-_: ft! —t07[2[‘I>(t,to)Zto + <I>(t,to) [it <I>-1(7',t0)B(T)dT]YTdt

t t t
= 2,, ’¢¢(t,to)dt+/ ’¢¢(t,to) (1)-1(T,to)B(7)drdt. (6.81)
to to to
\ ’ L J

 

term 1 term 2

Both term 1 and term 2 can be calculated forward. If we deﬁne ‘11 = <1)“, then it can

be simulated with

\iI = —A(t)\I‘ with \Il(t,to) = 1 (6.82)

Zto can be calculated at ﬁnal time t I using

Zto = <I>'l(tf,to)[Zt, — ft” <I>(tf,7')B(7')dT]. (6.83)

0

We now show how to implement the idea above numerically. We use the following

approximation with sufficiently small time step At:

Zn+l — Zn __ L
At — Zn, (6.84)
where 2,, = 2,3,“. Therefore,
271+] = -n + Ath
= 2.. + At(Zn./T — E, — (Wdiag[a'(X,,)./T]) 2.). (6.85)

125

If we deﬁne
11,, = I + AtI./T — At(Wdiag[a’(Xn)./T])
, (6.86)
C, = —AtE,,
we have
Z,+1 = 14.2,. + C... (6.87)

From equation (6.87), we have:

Zn = Vn-IZ —1 + C -1
= Vn—lan—2Zn—2 + Cit-2] "l" C -l
= Vn—an—ZZn-2 + Vn—ICn-2 + 011-]

= Vn-an—z ' ' ' V1Z1 + Vn-an-2 ' ° ' V201 + Vn—IVn-2 ' ° ' V302 + '° ° + Cn—l-

Thus
Zk = pk-l 21 + qk_1 k = 2, 3, . . . , N, (6.88)
where we deﬁne:
(II: = quk—1 + Ck (6.89)
Pk = VkPk—i, (6.90)
with
ql = Cl (6.91)

126

p1 = V1. (6.92)

The partial derivative of total “energy” with respect to weight can be approxi-

mated in the following way:

8E At i (p Z YT (6 93)
—— = k I: ~ -
aWT k=1 k

Here we assume T = AtN. For simplicity, we omit the At in the following derivations.

If we deﬁne
6E k —
(BWT) = 226,257, (6'94)
1: [:1

then, at each time step we have

 

613) = 912qu?” (6.95)
1

Q)
Q.)
6%.

 

= $121KT + ¢2IV121 + Cl]}/2T (6.96)

Q:
E
"l

 

dual/1T + 11126421 + CID/2T + ¢3IV2V121 + V201 + C21Y3T (6-97)

Q)
(“:1

 

Q:
E
"I

Q:
EQD
~30:
VVV
II

I: k
= Z 97127-12le + Z «12qusz (6.98)
[:1 (:2

With 170 = I.

Consider the ith column of (5%)“ which is

$121}? + $211121)? + . . . + liming-121K:

‘Hbz‘hY; + wsqu; + - -- + wkq._1Y,:. (6.99)

127

By deﬁning:
Ric = Ic—l + I(kph-1Y7: (6.100)
with
R0 = 0 (6.101)
and
k
Qk = 2: lbl‘IIAYIT, (6.102)
(:2
we have
6E i i ' 2'
W k = RkZ1 + Qk. (6.103)

Therefore, if Z; is known, we can integrate the system forward. However, the

boundary condition we know is ZN, not 21.

Now, let us consider simulating the co-state system backward using difference

equations. For sufﬁciently small At, we have the following approximations:

~ ~

Zn — 2,, '~
_—At +1 = -2“, (6.104)

from equation (6.104), we have:

Zn = Zn-H—Aténi-l

= 2,.... — At(Z,,+1./T — 137,.+1 — [Wdiag[or'(X,,+1)./T]]Zn+1).

We assume the simulation time period is [0, t1], and t1 = N At, N is an integer.

128

Deﬁne
pn+1 = I — AtI./T + At(Wdiag[a'(Xn+1)./T]) (6.105)
and
Bn+1 = AtEn+1. (6.106)
We have
2.. = MHZ... + Bu“. (6.107)

Thus, use equation (6.107) recursively:

Z1 = #222 + B2
= #201323 + 33) + 32

= 112/1323 + #233 + 32

= #2113 ° ' ' #1ka + #2ﬂ3 ° ' ° #k—in + #2113 ' ' ' #k—sz—i + - - - + 32-

The equation above can be written in general format as

21 =Uka+Dk k=2,3,...,N, (6.108)

where we deﬁne:
U]: = Uk—lllk k = 3,4,...,N (6.109)
D}; = Dk-l +Uk_1Bk k =3,4,...,N (6.110)

D2 = 82 (6.111)

129

U2 = [12. (6.112)

When the time step At is sufficiently small, the two systems 2 and 2 should be

so close that Z, x Z. Therefore, we have

ZN _—. ZN (6.113)

Thus equation (6.103) can be written as

(aWT), = 17,(UNZN + DN) + Q. (6115)
At k = N, it becomes

LE i — R‘ U 2 D "

8W7 N -- N( N N + N) ‘1’ QN' (6116)

Since R, U, D and Q all can be calculated forward in time, we can simulate this
system forward in time and at time T calculate the quantity (gap-5:7). This way we

can avoid the backward integration.

6.7 .1 Updating the Time-constant T

The partial derivative of total “energy” with respect to T can be approximated using:

6E N . - —
—- = —At 2 diag[Yk]Zk. (6.117)
8T k=l

130

As with the updating of weight, we omit At in the following derivation. Therefore,

at time step 1 we have

6E . - -
('8'?)1 = —d1ag[Y1]Z1.

At step 2,
8E . . _
(ﬁ)2 — —d1ag[Y2]Zg
=.- —diag[f€](u121+C1).
At step 3
6E _ . _
(5T), — —d1ag[Y3]Z3
= ’diale31(P2ZI+Q2)°
Deﬁne
S). = Sk_1+diag[Yk]Pk-1 k=1,2,...,N (6.118)
J). = J,._1+dia1g[1'/,.]q,._1 k=2,3,...,N, (6.119)

with initial conditions

50 = 0, (6.120)

J1 = Jo=0. (6.121)

131

Then

(9E) _
— = —SN21 — JN. (6.122)
(3T total

Therefore, we can calculate S and .I forward. At the ﬁnal step T, we calculate

Q9
elm

(a

)total'

6.7.2 Computation Procedures

From the derivation above, we can summarize the computation ﬂow as follows:

Start with k = 0, at time step k:

Step 1. Calculate Y and E (for k = 0,1, . . . ,N — 1) using:
Yk+1 = Y1: + At[0(Xk) + C1: + Ykl-/T
E]; = Y]; — Y]:
X;‘ = WTYk.

Step 2. Compute 70:

11):: = diag [0’(Xk)./T].
Step 3. Calculate u and C:

11,. = 1 + AtI./T — Athpk

Ck = —AtEk

132

Step 4. Calculate p and B:

Mk = I—AtI./T+Ath)k

B). = A15).

Step 5. U and D (for k = 2,3, . . . , N) can be calculated with

Uk Uk—iﬂk

Dk = Dk—1+Uk-lBk

Step 6. Compute q and p (for k = 1,2, . . . ,N — 1) using

9!: = VkCIk—l'l'Ck

Pk = VkPk—l-

or using

Pk = (III-+11

‘11: = U131 Dk+1-

(6.123)

(6.124)

(6.125)

(6.126)

(6.127)

(6.128)

Step 6a. If we want to update T (for k = 1,2, . . . , N), we also need to calculate

S]: = Sk-1+diag[f/k]pk_1

Jk Jk-l + diag[Yquk—i-

133

Step 7. Calculate Q:

Q1: = Qk—1+¢ka—1YkT

Q0 = Q1 = 0.
Step 8. For each column I, I = 1,2, . . . ,n, calculate
R; = 1224+ wkpk_1Y,: (6.129)

Step 9. Loop back to step 1 until k = N. Then, go to step 10.

Step 10. Calculate Z1:

21: UNZN + DN

Step 11. For each column I, I = 1,2, . . . ,n, calculate

6E )’ , ,
— = RNZI + QN-
(6W7 total

Step 11a. If we want to update T, we also need to calculate

6E)
— = —SNZ —JN. (6.131)
(8T total 1

134

Step 12. Calculate new weight W with gradient-decent method.

3E
W = W — a (—) .
pre
0W total

Calculate T with
3E
T = Tpre — 18 (—) -
6T total

Where a and 6 are updating rates for weight and time factor respectively. Loop

back to step 1 until the total “energy” is smaller than the desired value.

6.7 .3 Simulation Examples

For comparison purpose, we try to solve the same XOR problem as used in [Pea89].
The network topology is shown in ﬁgure 6.5. The input layer neurons have external
input 11 and 12. The goal is to choose a proper set of weights so that the output of
the network is the XOR of the two input signals between time t = 2 and t = 3.

We deﬁne the “energy” to be E = Z). f23(yc(1k) - d("))2dt, where k ranges over the
four cases, d“) is the desired output for case In, yo is the state of the output neuron.
The input signals 11 and 12 were set to be [1,1], [1,0], [0,1] and [0,0] respectively for
each case. The sigmoid function is chosen as 0(3) = (1 + 6”) — 1. The initial values
Y of neurons are set to be I + 0(0). Initial weights W are chosen randomly from
(—1,1).

As with Pearlmutter’s case, when weight W and time factor T were updated
simultaneously, T tended to go to 0 too quickly. In our simulations, we usually

choose T’s update factor to be very small. Figure 6.10 to ﬁgure 6.13 show the network

135

 

075+ nanpooq .

g

O25]

 

 

 

Figure 6.10. XOR network with input = [0, 0]

 

 

...]

g 6.5

025* bu-pvoq ,

 

 

 

cf'
-
u»
u
a
a

Figure 6.11. XOR network with input = [0,1]

136

 

0.75 * Inn-(1 1001 ,

025*

 

 

 

Figure 6.12. XOR network with input = [1,1]

 

025* mvooq ,

 

 

 

Figure 6.13. XOR network with input = [1,0]

137

dynamics after it has been trained.

6.7 .4 Comparison and Discussion

The number of computations in each time step can be estimated as follows: For each
time step 1:, we need to calculate step 1 to 8. For step 1, the number of arithmetic
computations is 0(2n2). For step 2, the number of computation steps is 0(n). Step
3 and step 4 each needs 0(4n2) computations. Step 5 needs 0(2n3). Step 6 needs
0(2n3). Step 6a needs 0(2n2) number of computations. Step 7 needs 0(2n2). Step
8 needs 0(2n3). Thus the total number of computations is in the order of 0(4n3).

In the simulation process, we need to store variables: U, D, q, p, S, J, Q, and R.
The added storage is 0(n3).

Compare the proposed algorithm with the original time-dependent recurrent back—
propagation method, the proposed one avoids the backward integration, thus the
memory storage request is reduced. The price is the increase of computations per
time step.

Compared with other methods, This algorithm retains some of the advantages of
the original time-dependent recurrent back-propagation method. The system used
is very general and can be applied to many applications. The co-state Z has clear
physical meaning. The boundary condition ZT can be chosen according to the system
constraint.

Toomarian and Barhen proposed a way to avoid the backward integration of the
adjoint-function method [TB91]. They proposed similar way of ﬁnding Zto using ZT.
However, the computation may be numerically unstable.

Table 6.1 shows the comparison of each method.

138

Table 6.1. Comparison of the algorithms

 

 

 

 

 

 

 

TDRBP RTRL Green’s proposed 1 proposed 2
function
Computations 0(n2) 0(n4) 0(723) 0(n4) 0(n3)
Backward yes no no no no
integration
On-line no yes yes yes yes
Remarks co—state co—state co-state
can be set can be set can be set
according according according
to system to system to system
constraint constraint constraint

 

 

 

 

 

 

 

 

CHAPTER 7

SUMMARY AND CONCLUSION

7 .1 Summary

Artiﬁcial neural networks have developed very rapidly in the past decade. They have
great potential of solving many problems which current digital computers cannot
solve efﬁciently or cannot solve at all. One of the fundamental problem of designing
artiﬁcial neural networks is to determine a set of weights such that the network can
store or learn patterns or signals. When an input is applied to the network, the
network can generate “desired” output. This behavior is referred to as “association”.
A simple association is to store static patterns and to retrieve the stored patterns
when some clues are given to the network as input. Associative memory is such
a kind of application. A more complex case is to store and to generate temporal
patterns.

Numerous architectures and algorithms have been proposed so far for these kinds
of tasks. Most of the proposed models have a pre-ﬁxed structure. The weights

are computed through on-line/off-line algorithms, which are usually computationally

139

140

expensive. In some cases, the weight matrix may not exist at all for a particular set
of patterns.

In this thesis, the product-of—norm model of artiﬁcial neural network architecture
is proposed. The new structure is based on gradient system theory using a special
“energy” function. Speciﬁcally, the product of the norm of the errors is used as
the “energy” function. The system dynamical equations are deﬁned using gradient-
descent method. The resulting system is a gradient system and is globally stable.
By using the product of the norm of the errors as the “energy” function, the desired
patterns are guaranteed to be the global energy minima of the system. Thus they
can be retrieved from the network. A distinguished feature of this network is that the
weights are ﬁxed by design. The weights learning is eliminated. The trade-off is the
increasing of the network complexity. The stability of the product-of-norm model was
analyzed using Lyapunov theory. Both theoretic proof and simulation results were
given in chapter 3 to show the ability of the network to store and retrieve arbitrary set
of patterns. For the one neuron case, the locations of the unstable system equilibria
were given too. Based on the analysis, another way of design one neuron networks
with pre-set stable and unstable equilibria was proposed. The possibility of using the
inﬁnity norm in the product-of-norm model was also discussed.

In chapter 4, various techniques of generalizing of the “energy” function of the
product-of-norm model were proposed. Ways to simplify the network structure were
investigated too. It was shown that the system could be easily parallelized and
modularized using the log form of the “energy” function. Using the log form “energy”
function, the system can learn new patterns by simply adding modules to the existing

network. The subgrouping method was proposed to localize the connections of the

141

network, so that the network could be simpliﬁed and the computations required could
be reduced. Several ways of subgrouping were discussed, in general we found that
the method of dividing the network into subgroups with overlapping neurons had
better performance. chapter 4 showed that the product-of-norm system could be
used to store and recognize digital image patterns by adding a penalty function to
the original “energy” function. In particular, a method of dividing the images into
small regions is discussed. The penalty function was calculated corresponding to each
region. The resulting network was able to recognize both vertically and horizontally
slightly shifted images.

Chapter 5 investigated possible ways to implement the product-of-norm model
in hardware. Analog transistors biased in subthreshold region were proposed as
basic building elements. Several building blocks were discussed and an example one-
neuron / twqpattern network was shown.

Artiﬁcial neural networks used for association of temporal patterns were studied
in chapter 6. Among training algorithms available for recurrent networks so far, time-
dependent recurrent back-propagation is the best if on-line learning is not concerned.
One of the limitations of this method is that it needs to integrate the co-state
backward in time. Two algorithms were proposed to avoid the backward pass. The
penalty is the increase of computation complexity. However, they are still among
the simplest methods. Ways to set the initial values for the neurons and the ﬁnal

conditions for the co-state Z were analyzed also using optimal control theory.

142

7 .2 Future Research Directions

The product-of-norm model is relatively complex. It requires high order feedbacks.
Future work should be focused on further simplifying the network. Current hardware
implementation example for product-of-norm model is not scalable for large networks.
Hardware implementation of the product-or-norm model is another area which needs

more investigation.

APPENDIX

APPENDIX A

THE PSPICE CODE AND
SIMULATIONS FOR‘A
ONE-NEURON/TWO-PATTERN

CIRCUIT

In chapter 5, we presented some basic circuits structures to be used in implementing
the product-of-norm model. In this chapter, we will list the PSpice code used for the
simulations of the circuits. The following PSpice library ﬁle contains the deﬁnitions of
some basic circuits: n-mirror, p-mirror, current—voltage multiplier and voltage-square

circuits.

**********************************************************************

* PSPice ciruit library: neuron.lib
*
* This library contains sub-circuits which may be used to

* build product-of-norm model neural networks
a:

143

144

* Jiansheng hou
* Sun 02-23-1997 1:49:28 pm
a:

**********************************************************************

**********************************************************************
*

* MOSFET used in this library

*
***************************t************************#*****************
.MODEL nfet NMOS LEVEL=2.00000

+LD=0.2500U TOX=401.000E-10

+NSUB=6.257042E+15 VT080.771327 KP=5.60600E-05

+GAMMA=O.53 PHI=O.60 U0=650.996 UEXP=O.153507

+UCRIT=41931.5 DELTA=O.472546 VMAX=67660.8 XJ=0.2SOOOOOU
+LAMBDA=3.089401E-02 NFS=4.834708E+12 NEFF=1

+NSS=1.000000E+10 TPG=1.00000 RSH=29.600 CGSU=3.229249E-10
+CGDO=3.229249£-10 CGBO=6.72642QE-10 CJ=1.13650E-4

+MJ=O.6862O CJSH=5.3187OE-1O MJSH=O.26510 PB=O.80000

.MODEL pfet PMOS LEVEL=2.00000

+LD=O.209610U TOX=401.000E-10

+NSUB=6.722092E+15 VTO=-O.78821 KP=2.23000E-05

+GAMMA=O.5486 PHISO.60 U0=259 UEXP=O.182346

+UCRIT=12594.5 DELTA=O.844423 VMAX=41222.6 XJ=O.2SOOU
+LAMBDA=4.981694E-02 NFS=1.000E+11 NEFF=1.00I

+NSS=1.000000E+1O TPG=-1.00000 RSH=90.7OO CGSO=2.707532E-10
+CGDO=2.707532E-10 CGBO=5.839625E-10 CJ=2.513600E-4 "380.547100
+CJSH=2.9447OOE-10 MJSW80.318800 PB=O.80

*************#*******************************************************
*

* Sub-circuits
*

**************#******************************************************

***********#*******#***#*******************#***
Pmos current mirror

4:

a:

* + ----- Iin
* l +--- Iout
* I l +- Vdd
. Ill
.. III
1 2 3

.subckt pmirror

145

m1 1 1 3 3 pfet L=12u H=4u
m2 2 1 3 3 pfet L=12u H=4u
.ends

.subckt pmirror2 1 2 3

m1 1 1 3 3 pfet L=7u H=4u
m2 2 1 3 3 pfet L=7u H=4u
.ends

.subckt pmirror3 1 2 3

m1 1 1 3 3 pfet L=24u H=2u
m2 2 1 3 3 pfet L=24u H=2u
.ends

***********************************************
a:

* N MOS mirror

:1:

* 1 ---Iin 2 -- Iout, 3 --- GND

.subckt nmirror 1 2 3

m1 1 1 0 O nfet L=12u H=2u

m2 2 1 0 0 nfet L=12u H=2u

.ends

* 1 ---Iin, 2 --- Iout, 3 --- GND
.subckt nmirror2 1 2 3

m1 1 1 O O nfet L=12u H=8u

m2 2 1 O O nfet L=12u H=8u

.ends

* 1 ---Iin 2 --- Iout, 3 --- GND
.subckt nmirror3 1 2 3

m1 1 1 O O nfet L=24u H=1u

m2 2 1 O 0 nfet L=24u H=1u

.ends

* 1 ---Iin 2 --- Iout, 3 --- GND
.subckt nmirror4 1 2 3

m1 1 1 O O nfet L=14u H=3u

m2 2 1 0 0 nfet L=14u H=3u

.ends

146

*1 ---Iin 2---Iout, 3---GND
.subckt n_cascode_mirror 1 2 3
m1 1 1 3 3 nfet L=10u H=14u
m2 2 1 4 3 nfet L=10u H=14u
m3 5 5 3 3 nfet L=6u W=14u

m4 4 5 3 3 nfet L=6u H=14u
.ends

* 1 ---Iin 2 --- Iout, 3 --- GND
.subckt n_nmirror 1 2 3

m1 1 1 3 3 nfet L=1200u H=1u

m2 2 1 3 3 nfet L=1200u W=1u
.ends

* 1 ---Iin 2 --- Iout, 3 --- GND
.subckt w-nmirror 1 2 3

m1 1 1 O 0 nfet L=2000u H=16u
m2 2 1 O O nfet L=2000u H=16u
.ends

********************#******************#*******
* Wide range IV multiplier

a:

* 1 --- Iout, 2 ~-- Iin,
* 3 --- V1, 4.--- V2

* 5 ---Vdd, 6---gnd

.subckt iv_mul-wid 1 2 3 4 5 6
m1 8 3 7 O nfet

m2 9 4 7 0 nfet

*x0 2 7 O nmirror

x0 2 7 O n_nmirror

xi 8 1 5 pmirror

x2 9 10 5 pmirror

x3 10 1 O nmirror

.ends

*********************************#*****#*******
* IV multiplier

4:

* 1 --- Iout, 2 --- Iin,
* 3 --- V1, 4 --- V2,
* 5 --- Vdd, 6 ---gnd
*

* 6 is not really used

147

*

.subckt iv_mul 1 2 3 4 5 6
m1 8 3 7 O nfet L=4u H=2u
m2 9 4 7 0 nfet L=4u H=2u
x0 2 7 O nmirror3

x1 8 1 5 pmirror2

x2 9 10 5 pmirror2

x3 10 1 0 nmirror

.ends

*******t********************#*********#********
1 --- Iout, 2 --- Iin, '

3 -" V1, 4 '-'V2,

5 ---Vdd, 6---gnd

6 is not really used

******

.subckt new_iv_mul 1 2 3 4 5 6
m1 8 3 7 O nfet L=4u H=2u

m2 9 4 7 O nfet L=4u H=2u

x0 2 7 O nmirror2

x1 8 1 5 pmirror2

x2 9 10 5 pmirror2

13 10 1 0 nmirror

.ends

******************************#******#*********

* 1 --- Iout, 2 --- Iin,
* 3 --- V1, 4 ---V2,

* 5 ---Vdd, 6---gnd
*

* 6 is not really used
*

.subckt new,iv_mul2 1 2 3 4 5 6
m1 8 3 7 0 nfet L=4u H=2u

m2 9 4 7 0 nfet L=4u H=2u

x0 2 7 O nmirror2

x1 8 1 5 pmirror2

x2 9 10 5 pmirror2

x3 10 1 O n_cascode_mirror
.ends

**t********************************************

148

Negative IV multiplier
10 --- Iout, 2 --- Iin,
3 ---V1, 4 ---V2,
5 ---Vdd, 6 ---gnd

6 is not really used

********

.subckt neg_iv-mu1 10 2 3 4 5 6
m1 8 3 7 O nfet L=4u H=2u

m2 9 4 7 O nfet L=4u H=2u

x0 2 7 O nmirror2

xi 8 1 5 pmirror2

x2 9 10 5 pmirror2

x3 1 10 O n_cascode-mirror
.ends

tt************************#*********#**********
Negative IV multiplier

*
*
* 1O --- Iout, 2 --- Iin,
* 3 ---V1, 4 ---V2,

* 5 ---Vdd, 6 ---gnd

1.:
a:
:1:

6 is not really used

.subckt neg_iv_mul2 10 2 3 4 5 6
m1 8 3 7 O nfet L=4u H=2u

m2 9 4 7 O nfet L=4u H=2u

x0 2 7 O nmirror2

11 8 1 5 pmirror2

x2 9 1O 5 pmirror2

13 1 1O 0 nmirror3

.ends

*t*****#***********************#****************
Test IV multiplier

I'l-

*

1 --- Iout, 2 --- Iin,

3 --- V1, 4 ---V2,

5 ---Vdd, 6 ---gnd

.subckt iv_mul-test 1 2 3 4 5 6
m1 8 3 7 0 nfet

m2 1 4 7 0 nfet

xO 2 7 O nmirror

**

149

x1 8 1 5 pmirror
*x2 9 10 5 pmirror
*x3 10 1 0 nmirror
.ends

************************************************
* N MOS pair

a:

* 1---V1, 2---v2, 3---Iout, 4---Iin1, 5---Iin2
.subckt n_pair 1 2 3 4 5 6

m1 4 1 3 0 nfet L=4u H=3u

m2 5 2 3 0 nfet L=4u H=3u

.ends

******************************#**************
* gilbert multiplier

* 1-v1, 2-v2, 3-v3, 4-v4, S-Ib, 6-Iout, 7-Vdd, 8-gnd
*

.subckt gilbert_mul 1 2 3 4 5 6 7 8

x0 3 4 10 12 6 8 n_pair

x1 3 4 11 6 12 8 n_pair

x2 1 2 9 10 11 8 n_pair

x3 12 6 7 pmirror

m1 9 S 8 8 nfet

.ends

***************************#***********************
* Voltage square circuit

a:

*1---v1, 2---v2, 3---Iout, 4---Ib, 8---Vdd, 9---gnd
.subckt v-square 1 2 3 4 8 9

10 1 2 10 5 6 9 n_pair

x1 1 2 11 6 5 9 n-pair

x2 1 2 12 10 11 9 n_pair
x3 5 3 8 pmirror3

x4 6 7 8 pmirror

x5 7 3 9 nmirror2

ml 12 4 9 9 nfet L=4u H=3u

.ends

****************************************************************

* another voltage square circuit
*

*1---v1, 2---v2, 3---Iout1, 7--Iout2, 4---Ib, 8---Vdd, 9---gnd

150

*9 is not really used

:1:

.subckt new_v_square 1 2 3 7 4 8 9
x0 1 2 10 5 6 0 n_pair

x1 1 2 11 6 5 0 n_pair

x2 1 2 12 10 11 0 n-pair

x3 5 3 8 pmirror3

x4 6 7 8 pmirror3

*x5 7 3 O nmirror2

ml 12 4 O O nfet L=4u H=3u
.ends

**ttt**************t****#**********************************#*t**#
* v_sqaure circuit with two current outputs

*1---v1, 2---v2, 3---Iout1, 7---Iout2, 4---Ib, 8---Vdd, 9---gnd
.subckt v_square-2out 1 2 3 7 4 8 9

x0 1 2 10 5 6 O n_pair

x1 1 2 11 6 5 O n_pair

x2 1 2 12 10 11 O n_pair

x3 5 3 8 pmirror

x4 6 7 8 pmirror

*x5 7 3 O nmirror

ml 12 4 0 O nfet L=4u H=3u

*m1 is the bias mosfet

.ends

¢********************¢¢******¢***¢************
* test voltage square circuit

:1:

*1---v1, 2---v2, 4---Ib, 8---Vdd, 9---gnd
.subckt v-squaretest 1 2 3 4 8 9

x0 1 2 10 5 6 O n_pair

x1 1 2 11 6 5 0 n-pair

x2 1 2 12 10 11 0 n_pair

x3 5 3 8 pmirror

x4 6 7 8 pmirror

x5 7 3 O nmirror

ml 12 4 O 0 nfet

.ends

t*****tt***********t#*********#****t***#*****************
* Another test voltage square circuit

*

*1---v1, 2---v2, 3---Iout, 4---Ib, 8---Vdd, 9---gnd
.subckt v_squaretest2 1 2 3 4 8 9

x0 1 2 10 5 6 0 n_pair
x1 1 2 11 6 5 0 n_pair
x2 1 2 12 10 11 0 n-pair
x3 6 3 8 pmirror

x4 5 7 8 pmirror

x5 7 3 0 w_nmirror

mtesti 7 3 0 0 nfet L=12u H=4u
mtest2 7 3 0 0 nfet L=12u H=4u
ml 12 4 0 0 nfet

.ends

151

The following is the PSPice ﬁle for a one-neuron/two—pattern circuit.

1n2p circuit
at

* This circuit is a one-neuron/two-pattern circuit
* of the product-of-norm neural network model
*

**********************************************

* The structure

******ttt*********************¥***************

x0 6 4 80 81 9 1 0 new_v_square

x1 61 82 5 6 1 0 new-iv_mul
x2 62 83 5 6 1 0 neg_iv_mu12
r1 80 82 0.00001

r2 81 83 0.00001

r5 61 6 0.00001

r6 62 6 0.00001

x3 6 5 70 71 9 1 0 new_v_square

x4 63 72 4 6 1 0 new_iv_mul
x5 64 73 4 6 1 0 neg_iv_mul2
r3 70 72 0.00001

r4 71 73 0.00001

r7 63 6 0.00001

r8 64 6 0.00001

**************************¥**#*****#**********

* The settings

*******#**************************************

152

vd 1 O 5v

vc 6 0 iv

vb 9 0 0.5V
v_star1 4 0 0.6v
v_star2 5 0 0.4V

*#11:*******************************************

* The subcircuits
******************tt****tttt**********¥*******

********t*************************
* N MOS current mirror

* 1 ---Iin 2 --- Iout, 3 --- GND
.subckt nmirror8 1 2 3

m1 1 1 0 0 nfet L=20u H=2u

m2 2 1 0 0 nfet L=20u H=2u

.ends

*********************************
* N MOS current mirror

* 1 ---Iin 2 --- Iout, 3 --- GND
.subckt nmirror9 1 2 3

m1 1 1 0 0 nfet L=7u H=5u

m2 2 1 0 0 nfet L=7u H=5u

.ends

t*********#**********************
* Negative IV multiplier

a:

.subckt neg_iv_mul 10 2 3 4 5 6
m1 8 3 7 0 nfet L=4u N=2u

m2 9 4 7 0 nfet L=4u W82u

x0 2 7 0 nmirror2

x1 8 1 5 pmirror2

x2 9 10 5 pmirror2

x3 1 10 0 n_cascode_mirror
.ends

***********************************************
Another negative IV multiplier

3 --- V1, 4 ---V2,

1.:
1|:
* 10 --- Iout, 2 --- Iin,
*
. 5 ---Vdd, 6 ---gnd
1|:

153

* 6 is not really used

:1:

.subckt neg_iv_mul2 10 2 3 4 5 6
m1 8 3 7 0 nfet L=4u H=2u

m2 9 4 7 0 nfet L=4u N=2u

x0 2 7 0 nmirror2

x1 8 1 5 pmirror2

x2 9 10 5 pmirror2

x3 1 10 0 nmirror9

.ends

*****************************¢*****************
IV multiplier

*
:1:
* 1 --- Iout, 2 --- Iin,
* 3 --- V1, 4 ---V2,

* 5 ---Vdd, 6 ---gnd
:1:

:1:

:1:

6 is not really used

.subckt new_iv_mul 1 2 3 4 5 6
m1 8 3 7 0 nfet L=4u W=2u

m2 9 4 7 0 nfet L=4u H=2u

x0 2 7 0 nmirror2

x1 8 1 5 pmirror2

x2 9 10 5 pmirror2

x3 10 1 0 nmirror8

.ends

.lib c:/pspice/hou/neuron.lib

.TEMP 25

.dc vc 0 1.0 0.01
.probe

.end

Setting 0‘1 = 0.4V and v"2 = 0.7V, we plot the response of the circuit to different
initial voltages in Figure A.1. Figure A.2 shows the simulations of the same circuit
with v" = 0.3V and v‘2 = 0.75V. As can be seen from the ﬁgures, the circuit stored
two stable points, which are very close to the desired points. Because the circuit
function we derived can only work in certain range, the resulting circuit cannot store

arbitrary data.

154

 

 

 

 

0.9 I I T I I I I
0.8 *- _
0.7 ~ -
0.6 1
> -
go 0.5 -\\ ““““““““ _
d , ____ .4"
:3 L” ..
>0 0'41
0.3 a
0.2 ﬂ
0.1 - _

 

 

l I l L l l l

0
0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004

Time t

 

Figure A.1. The response of the one-neuron/two-pattern circuits to different initial
voltages. v"1 = 0.4V and v‘2 = 0.7V.

 

0.9

 

 

 

 

 

1 1 I
..L .
0.7 — /’# I
/
0.6 — // _ A
> // ,
61) 0.5 rem—'7'". -
jg hhhh ‘
—>5 04 "‘~~~-::::::: _______ d
0.3 - ‘
0.2 d
0.1 d
0 1 1 l
0 0.0005 0.001 0.0015 0.002

Time I

Figure A2 The response of the One-neuron/two—pattern circuits to different initial
voltages. v"1 = 0.3V and v"2 = 0.75V

BIBLIOGRAPHY

[Bar91]

[BcH 75]

[HKP90]

[Hop82]

[Hop84]

[H874]

[Kha91]

[K0588]

[LS95]

[Mea89]

BIBLIOGRAPHY

Valmir C. Barbosa. Learning in analog hopﬁeld networks. In Proceedings
of IJCNN 1991, volume ii, pages 183—186, Seattle, July 1991.

Arthur E. Bryson, Jr. and Yu chi Ho. Applied Optical Control Opti-
mization, estimation, and control. Ohemisphere Publishing Corporation,
Washington, DC, 1975.

John Hertz, Anders Krogh, and Richard G. Palmer. Introduction To The
Theory of Neural Computation. Addison-Wesley Publishing Company,
1990.

John Hopﬁeld. Neural networks and physical systems with emergent
collective computational abilities. In Proceedings of the National Academy
of Sciences, volume usa 79, pages 2554-2558, 1982.

John J. Hopﬁeld. Neurons with graded response have collective compu-
tational properties like those of two-state neurons. In Proceedings of the
National Academy of Sciences, volume usa 81, pages 3088-3092, 1984.

Morris W. Hirsch and Stephen Smale. Differential Equations, Dynamical
Systems, and Linear Algebra. Academic Press, Inc, 1974.

Hassan K. Khalil. Nonlinear Systems. Macmillan Publishing Company,
1991.

Bart Kosko. Bidirectional associative memories. IEEE Trans. 0n Sys-
tems, Man, and Cybernetics, 18(1):49—60, Jan/Feb 1988.

Frank L. Lewis and Vassilis L. Syrmos. Optical Control. John Wiley &
Sons, Inc., 1995.

Carver Mead. Analog VLSI and Neural Systems. Addision-wesley Pub-
lishing Company, 1989.

155

[M0289]

[Pea89]

[Pin87]

[Pin88]

[Pin89]

[RHW86]

[SB91]

[SCLQH

[SCL92]

[SWC91]

[TB91]

156

M. C. Mozer. A focused back-propagation algorithm for temporal pattern
recognition. Complex Systems, 3:349-381, 1989.

B. A. Pearlmutter. Learning state space trajectories in recurrent neural
networks. In International Joint Conference on Neural Networks, Wash-
ington 1989, volume II, pages 365—372, Seattle, Washington, 1989.

Fernando J. Pineda. Generalization of back-propagation to recurrent
neural networks. Physical Review Letters, 59:2229—2232, 1987.

Fernando J. Pineda. Dynamics and architecture for neural computations.
Journal of Complexity, 4:216—245, 1988.

Fernando J. Pineda. Recurrent back-propagation and the dynamical

approach to adaptive neural computation. Neural Computation, 1:161-
172, 1989.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal
representations by error prOpagation. In Parallel Distributed Processing:
Explorations in the microstructures of cognition, pages 318-362. MIT
Press, Cambridge, 1986.

Fathi M. A. Salam and Shi Bai. A new feedback neural network with
supervised learning. IEEE Trans. 0n Neural Networks, 1(1):170—173,
Jan 1991.

Guo-Zheng Sun, Hsing—Hen Chen, and Yee-Chun Lee. A fast on-line
learning algorithm for recurrent neural networks. In Proceedings of 1.] CNN
1991, volume ii, pages 13-18, Seattle, July 1991.

Guo—Zheng Sun, Hsing-Hen Chen, and Yee-Chun Lee. Greens’ function
method for fast on-line learning algorithms of recurrent neural networks.

In Advance in Neural Information Processing Systems, volume 4, pages

333-340. Morgan Kaufmann Publishing Inc., 1992.

Fathi M. A. Salam, Yiwen Wang, and Myung-Ryul Choi. On the analysis
of dynamic feedback neural nets. IEEE Trans. 0n Circuits and Systems,
38(2), Feb 1991.

N. Toomarian and J. Barhen. Adjoint-functions and temporal learning
algorithms in neural networks. In Advance in Neural Information Pro-
cessing Systems, volume 3, pages 113—120. Morgan Kaufmann Publishing
Inc., 1991.

[TH87]

[THL93]

[VP92]

[Wa189]

[WCJ91]

[WCM89]

[WCM90]

[WHH+89]

[WZ89a]

[WZ89b]

157

D. W. Tank and J. J. Hopﬁeld. Neural computation by time compression.
In Proceedings of the National Academy of Sciences USA, volume 84,
pages 1896-1900, 1987.

H. Chris Tseng, Victor H. Hwang, and Ling Lu. A fast supervised learn-
ing scheme for recurrent neural networks with application to associative
memory design. In Proceedings of 1993 IEEE International Conference
on Neural Netw orks, volume ii, pages 789—793. IEEE Neural Network
Concel, April 1993.

Bert De Vries and Jose C. Principe. The gamma model — a new neural
model for temporal processing. Neural Networks, 5:565—576, 1992.

A. Waibel. Modular construction of time-delay neural networks for speech
recognition. Neural Computation, 1:39—46, 1989.

Yeou-Fang Wang, Jose B. Cruz Jr., and J. H. Mulligan, Jr. Guaranteed
recall of all training pairs for bidirectional associative memory. IEEE
Trans. 0n Neural Networks, 2(6):559—567, Nov 1991.

Yeou-Fang Wang, Jose B. Cruz, Jr., and J. H. Mulligan, Jr. An en-
hanced bidirectional associative memory. In Proceedings of IJ CNN 1989,
volume i, pages 105-110, June 1989.

Yeou-Fang Wang, Jose B. Cruz Jr., and J. H. Mulligan Jr. Two coding
strategies for bidirectional associative memory. IEEE Trans. 0n Neural
Networks, l(1):81—92, Mar 1990.

A. Waibel, T. Hanazawa, G. Hinton, K Shikano, and K Lang. Phoneme
recognition using time-delay neural networks. IEEE Transactions on
Acoustics, Speech, and Signal Processing, 37:328-339, 1989.

R. J. Williams and D. Zipser. Experimental analysis of the real-time
recurrent learning algorithm. Connection Science, 1:87-111, 1989.

R. J. Williams and D. Zipser. A learning algorithm for continually running
fully recurrent neural networks. Neural Computation, 1:270-280, 1989.