r— J
twmeamﬁi :Migﬁggy

3

WE.

“(m2

W

iii

mm».

f

WWW

.a.
a

 

WW

3

 

mm

 

I. 1-:

'a
‘t

"163???

 

 

'“L‘L . u

{1- '1

"1,1

 

”wu-

 

 

‘ ~
w.»
... M75
«4 uv
‘_

u~._‘

111.. a...
-.n.

‘ 49'-
'-'~ \uu

‘ h . :.:{.
wV-m
:I

u 4“ NJ I

4 . . _

.-...‘:;t
A‘ U.

 

t , “-"1
dc? ;.=1‘:.‘.‘-~
.3 “L”!

"~I~“:'.-.-q,-
*' >~ -.r x . ‘
v «wk-2. u “r“ _r 3‘ ~ . a . 'f h“
.M.,-,__. < . _ .

~1vuuf . _ u

‘ uh ..
5k ”‘5..-

. .1'.
- «a w. . .
'44.?!
w

M’

. g .
V f?“ , -:. .

t ..
I.‘I.

.
.V...,;,_
. .,

h"

l'fvfy,‘::'

lﬂ.
. 'lue .
‘4‘ ' :

..

ugh".
5'. I. I.
Hr :xl‘n.

u'

‘4' ~ . w'.
"a“ ' ‘ ‘1-"nq
‘ gr

'I

31‘:- I .
' I o‘A-‘A U
_m- I~:\:.n.’ A ‘1‘}

 

‘ :
-. “Aw .
1 - .q

men an
x ~‘l‘v"'.-.'x,

- :Lt:'l:h IM- A
4
‘2‘ 6%???39uk‘445

7“ ‘wﬁ—IT:
u... afar-W

~~. m «J
u‘fA-AA ‘0:

u. .
.
..
:‘Wu‘;
am A'
15.“ v

1
3...!

l
13:...
It 3‘ '

"2’
C'l‘uﬁd 'o
#5: his“...

.I
'¢

39$
51 ., ’,'4-".‘JE“'
a: .ﬁ ‘44-
3“
s

 

-.< 22:. _ k r .-
u.. JL l~ Ina . ‘ all“?
“4‘ > M-u "_ - - LL
«I n. u - .
a (' .
i

 

1,-
M ..

.
w"L
t: ‘. ‘lq‘~-;‘T
V37- : ~ J.
a???“ -

 

u.» . .
‘ ‘u.n.‘. ,1
Inn.” m.

 

:"\"‘i
:‘fm'ﬁ
r;

 

h'd “:3" .0“
, .w:.,_‘s

'M
,‘1. Il'w,

5; .

4‘4.

’3' ’14:.
.m}
w

 

 

lHIUHIIHHIIlllllHHlllll||l‘IIHIIHIJIHllllllllllllll

3 1293 00902132

This is to certify that the

dissertation entitled

Supervised Learning of
Feedforward Artificial Neural Networks
Using Different Energy Functions

presented by

Maqbool Ahmad

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degree in Electrical Engineering

 

 

  

Major professor

Date 11/15/91

 

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

 

 

 

 

 

 

LIBRARY
Mlchlgan State
Unlverslty

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.

TO AVOID FINES return on or before date due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU Is An Affirmative Action/Equal Opportunity Institution
emana-c:

 

SUPERVISED LEARNING OF
FEEDFORWARD ARTIFICIAL NEURAL NETWORKS
USING DIFFERENT ENERGY FUNCTIONS

By

Maqbool Ahmad

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1991

con
Ex;
thci

C01]

(Suz

law
86m
'Ski
quel
the

llOn

ABSTRACT

SUPERVISED LEARNING OF
FEEDFORWARD ARTIFICIAL NEURAL NETWORKS
USING DIFFERENT ENERGY FUNCTIONS

By

Maqbool Ahmad

A framework, supported by analysis and computer simulations, is developed to
improve the supervised learning of feedforward artiﬁcial neural networks with a view
towards their software implementation. The framework encompasses various learning
rules derived from carefully selected energy functions in order to speed up and enable
convergence to acceptable weights. We describe the Cauchy, the Polynomial and the
Exponential energy functions and characterize their properties and the performance of
their associated learning rules. We use continuous-time gradient update law learning
rules due to particular advantages over their discrete-time counterparts. We perform
comparative investigations among these learning rules vis-a-vis the usual Gaussian
(sum-of-the-squared) energy function.

The choice of an energy function speciﬁes the back-propagated error in an update
law. It also determines the learning performance in the sense of the speed of conver-
gence. Carefully selected energy functions can in principle enable the neural system to
"skip" undesirable minima; thus permitting convergence to acceptable minima. Conse-
quently, a carefully selected energy function can render a learning process that enables
the neural network to execute good classiﬁcation/association and improved generaliza-
tion capability.

CC

161

ill:

CD

tic

on

Oil

ex;

dy1

lea

The Gaussian energy function has been used frequently. In contrast, we use the
Cauchy energy function, derived from the Cauchy distribution, in the gradient descent
weight update law. We provide a statistical rational for the selection of the Cauchy
energy. Analytical methods and computer simulations show that it could speed up the
learning dynamics and would improve performance in a certain sense. We also show
that all the stable equilibria of the learning dynamics based on the Cauchy energy are
contained in the learning dynamics based on the Gaussian energy. Therefore, the net-
work would still converge to one of the equilibrium points of the Gaussian-based sys-

tem.

We denote a polynomial energy function of order r as a linear combination of all
the LP norms (1 S p S r). We show analytically and using computer simulations that
employing the Polynomial energy function would improve the learning performance in
a certain sense. The Polynomial energy function of order 2 does not introduce addi-
tional equilibria to the Gaussian system. As a consequence the network converges to

one of the minima of the Gaussian system.

In order to achieve good classiﬁcation and generalization results, the network
ought to converge to a "good" or desirable minimum of the neural system. An
exponential energy function is introduced to enable convergence to one of the desirable
minima. A bound is established on the parameters of the Exponential energy function
to achieve faster convergence to a desirable minimum. We also show that the learning
dynamics based on the Exponential energy function possess the same minima as the

learning dynamics based on the original Gaussian energy function.

The performance of each of the energy functions addressed in this work, is
evaluated using the XOR and the English character recognition problems. A highly
user-oriented character recognition network is developed for the evaluation, with two
separate modules: a feature extraction network to be trained on a large data base and a
classiﬁcation network to be subsequently trained by the user according to his/her
requirements. An error correction capability is incorporated to enhance the character
recognition performance.

To
my mother, who always prayed for my success

but could not live to see this day.

hut

of

CW:

Nat

ACKNOWLEDGMENTS

Praise be to Allah Subhanahu-wa-Taala, who created this universe and all that
is contained in it; mankind being the most superior due to its drinking faculties. The
human brain inspires from its own structure numerous ideas and opens up many areas
of pursuit and excellence in research to beneﬁt the whole humanity. The more I pursue
this research, the more He reveals His greatness to me. What ever I am today, I wholy

owe ittoI-Iim

I am grateful for the ﬁnancial support ﬁom the Government of Pakistan, the
National Science Foundation and the State of Michigan Research Excellence Fund

(REF). Without this support, this research effort would not have been possible.

I would like to express my sincere gratitude to my major advisor, Prof. F. M. A.
Salam, for his inspiration, guidance and support throughout the years of my graduate
studies. Thanks to Dean F. C. Hoppensteadt, Prof. H. K. Khalil and Prof. J. R. Deller,

Jr. for their valuable comments and suggestions.

I wish to express my sincere thanks and appreciation to my parents for their pati-
ence, continued prayers, support and encouragement. I deeply owe my heartiest feel-
ings of gratitude to my wife, Ghazala and my children, Muhammad, Hiraa and Saad
for their love, understanding and support.

List
List

List

Cha]

2.4

2.5

TABLE OF CONTENTS

List of Tables ............................................................................... vii
List of Figures .............................................................................. viii
List of Symbols ............................................................................. x
Chapter 1 Introduction .................................................................... l
1.1 Neurobiology and artiﬁcial neural networks ................................. 2
1.1.1 A biological inspiration ..................................................... 3

1.1.2 Topologies of Artiﬁcial Neural Networks ............................... 5

1.2. Why new energy functions? ....................................... . .......... . 5

1.3 Contribution of the thesis ....................................................... 7
Chapter 2 Background .................................................................... 8
2.1 The feedforward neuron model ................................................. 9
2.2 The feedforward neural network connection topology ..................... 12
2.3 The back-propagation architecture ............................................. 14
2.4 Supervised training of the back-propagation network ...................... 16
2.5 The weight update laws and the energy functions .......................... 17

Ch;

2.6 The Gaussian energy function ...................................................................... 18

2.6.1 The learning dynamics due to the Gaussian energy function .............. 20
2.6.2 Software implementation ....................................................................... 22
2.7 The problem description ............................................................................... 23
2.8 The objectives and the outline of the thesis ................................................ 24
Chapter 3 The Cauchy energy function ................................................................... 26
3.1 Motivation ..................................................................................................... 26
3.2 The learning dynamics .................................................................................. 29
3.3 Analytical comparison with the Gaussian energy function ......................... 31
3.3.1 A criterion for the learning performance .............................................. 31
3.3.2 The learning dynamics for a single pattern .......................................... 32
3.3.3 The learning dynamics for multiple patterns ........................................ 34
3.4 Software implementation of the learning dynamics .................................... 36
3.5 Simulation example: the XOR problem ....................................................... 37
3.6 Discussion ....................................................................................................... 40
Chapter 4 The Polynomial energy function ............................................................. 42
4.1 Motivation ..................................................................................................... 42
4.2 The learning dynamics .................................................................................. 43
4.3 Analytical comparison with the Gaussian energy function ......................... 45
4.3.1 The learning dynamics for a single pattern .......................................... 46
4.3.2 The learning dynamics for multiple patterns ........................................ 47
4.4 Software implementation of the learning dynamics .................................... 48
4.5 Simulation example: the XOR problem ....................................................... 49

iv

4.6 Discussion ...................................................................................................... 50

Chapter 5 The Exponential energy function ............................................................ 58
5.1 Motivation ..................................................................................................... 58
5.2 The learning dynamics .................................................................................. 59
5.3 Analytical comparison with the simple energy function ............................. 60

5.3.1 Comparison of the learning dynamics ................................................... 60
5.4 Software implementation of the learning dynamics .................................... 63
5.5 Simulation example: the XOR problem ....................................................... 64
5.6 Discussion ...................................................................................................... 71

Chapter 6 The pattern/character recognition problem ............................................. 73
6.1 Problem description ..................................................................... '. ................. 73
6.2 Data base ....................................................................................................... 74’
6.3 The network design ....................................................................................... 74

6.3.1 The feature extraction or preprocessing network .................................. 76
6.3.2 The classiﬁcation network ..................................................................... 78
6.4 Learning evaluation with different energy functions ................................... . 80
6.4.1 Training of the feature extraction network ........................................... 81
6.4.2 Training of the classiﬁcation network ................................................... 84
6.5 Discussion ...................................................................................................... 87

Chapter 7 Summary and Conclusion ....................................................................... 89
7.1 Summary ........................................................................................................ 89
7.2 Conclusion ..................................................................................................... 91

Appendix A Entropy of the Gaussian and the Cauchy distribution ....................... 92

Appendix B Proof of Lemma 3.1 ............................................................................ 94
Appendix C Proof of Theorem 3.1 .......................................................................... 95
Appendix D Proof of Lemma 3.2 ............................................................................ 97
Appendix E Proof of Theorem 3.2 .......................................................................... 98
Appendix F Proof of Proposition 3.1 ....................................................................... 100
Appendix G Proof of Theorem 4.1 ....................................................... 102
Appendix H Proof of Theorem 4.2 .......................................................................... 104
Appendix I Proof of Proposition 4.1 ....................................................................... 106
Appendix J Proof of Proposition 5.1 ....................................................................... 108
Appendix K List of the fonts used for the pattern recognition problem ................ 109
Bibliography ............................................................................................................... l 1 1

vi

2.1
3.1
.1
5.1
5.2
6.1
6.2
6.3
6.4
6.5

6.6

6.7

LIST OF TABLES

The learning algorithm .......................................................................................
Training results comparing the Gaussian and the Cauchy learning .................
Training results comparing the Gaussian, the Cauchy and the Polynomial
learning ...............................................................................................................
Training results comparing the Exponential (it = l) of the Gaussian, the
Cauchy and the Polynomial leaming ................................................................
Training results comparing the Exponential ( it = 1.8 ) of the Gaussian,

the Cauchy and the Polynomial learning ..........................................................
Error(s) and the number of steps at convergence of the learning

algorithms used for the training of the feature extraction network .................
Character recognition using the feature extraction network trained with

the Gaussian learning .........................................................................................
Character recognition using the feature extraction network trained with

the Polynomial learning .....................................................................................
Character recognition using the feature extraction network trained with

the Exponential of the Gaussian learning .........................................................
Error-(s) and the number of computer steps at convergence of the

learning dynamics for the training of the classiﬁcation network .....................
Character recognition when the feature extraction and the classiﬁcation
networks are trained with set 1 .........................................................................
Character recognition when the feature extraction and the classiﬁcation

networks are trained with set 1 and set 2, respectively ...................................

22
39

52

69

70

82

83

83

84

85

86

86

 

 

4.3

4.4
4.5
4.6

1.1
2.1
2.2

2.3
2.4
2.5
3.1
3.2
3.3
4.1
4.2

4.3

4.4
4.5
4.6

5.1
5.2
5.3

LIST OF FIGURES

Simpliﬁed structure of a real neuron ................................................................. 3
Neuron activation functions ............................................................................... 10
The sigmoidal activation function with threshold and

shape modiﬁcation .............................................................................................. 11
The feedforward neuron model .......................................................................... 12
A structure of a multilayer FFANN .................................................................. 13
The back-propagation architecture of a FFANN ................... 14
Random error distribution .................................................................................. 28
The network used to solve the XOR problem .................................................. 37
Comparison of the Gaussian and the Cauchy learning ..................................... 38
Comparison of the Gaussian, the Cauchy and the Polynomial learning .......... 51

Trajectory showing the increase in L1 norm while

L2 norm is decreasing ......................................................................................... 54

Trajectory showing the increase in L2 norm while

L1 norm is decreasing ................................................. V ....................................... 54
Trajectory showing both the L1 and the L2 norm decreasing .......................... 55
Energy level diagram of the Polynomial energy function ................................ 55

Training results of the XOR problem, trained with the Modiﬁed

Polynomial energy function ............................................................................... 56
Graph representing the relationship (5.7) .......................................................... 62
Comparison of the Exponential (K = 1,18) and the Gaussian learning ........... 66
Comparison of the Exponential (x = 1,18) and the Cauchy learning ............. 67

viii

 

5.4

5.4

6.1
6.2

6.3
6.4

Comparison of the Exponential (x = 1, 1.8) and the Polynomial
learning ............................................................................................................... 68
Typical examples of the fonts ............................................................................ 75
Block diagram of the FFANN used for the character recognition

problem ............................................................................................................... 76
Structure of the FFANN used for feature extraction ........................................ 77
Targets used in the feature extraction network ................................................. 79

LIST OF SYMBOLS

u : neuron’s input

v : neuron’s output

0 : neuron’s threshold

net : neuron’s input excluding threshold
f (.) : neuron’s input/output function

x : input to a neural network

y : output of a neural network

t : desired output of a neural network

v : neural network’s actual mapping

: neural network’s desired mapping

: energy function

"11111”

: function of an energy function

: output error of a nemal network

: error criterion

: error signal to be propagated back
: weights

: weight change stopping criterion

: index for patterns

'o’uegootmm

: dummy index for patterns
IN : index of input layer
H j, Hk, I-I,, Hm : index of hidden layers

0 : index of output layer

1' : index of input layer neurons

j, k, I , m : index of hidden layer neurons

n : index of output layer neurons

11 : dummy index of output layer neurons

7 : learning rate

x : skipping factor

d(.) : density function

L(.) : likelihood function

H (.) : entropy function

0 : standard deviation of the Gaussian distribution

A : parameter of the Cauchy distribution

a , b : parameters of the Polynomial energy function
it, It : parameters of the Exponential energy function
A : change

V : gradient

3 : partial derivative

a, B : general constants

 

 

mOd
by t
Whit
have
the t

"-810"

ing 6;

08M

newIO:

ing in.

like .1,

CHAPTER 1

INTRODUCTION

Von Neuman machines have proven to be very useful in digital computation and
have contributed signiﬁcantly in the development of other spheres of life. However,
there are various tasks where conventional digital computers can not compete with the
performance of humans. Vision, classiﬁcation, association, speech and pattern recogni-
tion are only a few examples. It is also recognized that no matter how fast digital
sequential computers would become, they are unlikely to outperform humans in these

ﬁelds.

Observations in neurobiology and psychology have inspired many researchers to
model and design systems, which could emulate these human capabilities. Motivated
by the highly interconnected networks of nerve cells referred to as neural networks,
which form the basis of the human nervous system and the brain, many researchers
have proposed various architectures that model the information processing aspect of
the brain and the nervous system. These architectures are often referred to as artiﬁcial

neural networks.

Artiﬁcial neural networks are systems made of a large number of simple process-
ing elements operating in parallel whose function is determined primarily by the topol-
ogy that connects them. These systems are capable of high level functions such as
adaptation and learning, and/or lower level functions such as data processing. These
networks are inspired both by neurobiology and various mathematical theories of leam-

in g and information processing.

Neurobiological research in recent years has suggested that many types of map-

like data representation are used in the brain [1],[2]. Some of these are now being used

2

in neural network models [1],[3],[4]. Other work has shown that complex type of tem-
poral proces sing and non-linear operations appear to be performed by neurons. Com-

plex neuron models are being used in some neural network models [SJ-[10].

Neuron models for the purpose of studying the functioning of the brain are quite
involved and complex. These models are used as paradigms to test hypotheses and
theories on the functioning of the brain. These models fall into the category of
reverse-engineering and are of primary interest to neurobiologists, psychologists and
physiologists. On the other hand, neuron models for the use of engineering technology
fall under the category of mum-engineering. These models are used to implement
the principles of operation to produce artiﬁcial neural networks capable of emulating
some of the aspects of the human brain and thus aim at improving the artiﬁcial pro-

cessing of data or information.

1.1 Neurobiology and Artiﬁcial Neural Networks

There are two basic types of neural network models: (1) the neurobiological
models that are intended as computational models of the biological nervous system and
(2) the biologically inspired models that are intended as computational devices which
are referred simply as neural network models.

The main objective of the artiﬁcial neural network research is to design new
architectures and algorithms that can solve problems which are difﬁcult to solve by the
conventional methods but are easy for intelligent biological organisms. For the inspira-
tional development of the artiﬁcial neural networks, we must understand the essence of
neurobiological architecture of real neurons and their connectivity. A simpliﬁed
description and a modest analogy of the real neuron is presented in the following sub-

sections.

1.1.1 A Biological Inspiration

The human cerebral cortex is comprised of approximately 100 billion (10“) neu-
rons with each having roughly 1,000 dendrites that form some 100,000 billion (10“)
synapses. Given that this system operates at about 100 Hz, it functions at some
10,000,000 billion (1016) interconnections per seconds. The cortex weighs approxi-
mately three pounds, covers about 0.15 square meters, and is about two millimeters
thick. This is clearly beyond anything which can be reconstructed or modeled; but it is,
perhaps, possible to understand how the brain performs information processing, and we
hope that this understanding can be modeled and ultimately implemented in hardware.

The basic anatomical unit of the nervous system is the nerve cell or the neuron
[l 1]-[13]. Every neuron is a tiny information processing system with thousands of con-
nections through which signals are sent and received. No two of these cells are exaCtly
alike, but most share similar features. Each neuron has an inside and outside separawd
by a plasma membrane. The inside of the cell and the ﬂuid surrounding the cell have
different concentrations of charged ions which create a potential difference. The major

parts of a typical neuron are shown in Figure 1.1.

 

Dendn'te
Mﬁlin sheath
2 \\ i 7 N04" of Ranvier ’6
w ‘ :37
Axon
Nucleus within the cell body
Axon / 5
Synapse

Figure 1.1: Simpliﬁed structure of a biological neuron.

 

 

In

The major parts of a biological neuron are:

Cell Body : It is composed of a cytoplasm and contains the cell nucleus,
nucleolus, mitochondria and ribosomes. The nucleus synthesizes the information

which controls the activity of the neuron.

Dendrites : These are the outgrowths of the cell body through which it picks up
signals from other cells.

Axon Hillock : It acts as a nonlinear threshold device that produces a rapid vol-

tage increase or decrease, called an action potential.

Axon : It is the longest process of a cell body, which carries messages away from
it. It is very resistive and carries impulses to and from the cell body. Axons are
enclosed in a thick sheet made of the fatty substance myelin, which is a special
low-dielectric-constant material. The myelinated sheath has regular indentations
along its length called the nodes of Ranvier, which restore the pulses periodi-
cally along the axon.

Synapses : These are contacts on a neuron which are the termination point for
axons from other neurons. The synapse forms a gap between the slightly enlarged
ending, or synaptic knob, of an axon terminal and the membrane of the next
cell. At the synapse the electrical impulses that travel through the axon convert
into chemical signals, than back to electrical impulses. When a nerve impulse
reaches the synaptic knob, it causes the release of a chemical called a neuro-
transmitter, from minute vesicals in which the substance is formed or stored. The
neurotransmitters diffuse across the gap and may cause depolarization of the
receptor cell membrane, thus starting an excitatory signal. If the membrane is
charged more negatively than its resting voltage, it is said to be hyperpolarized,

in which case the signal is inhibitory.

 

1.12 Topt

Artiﬁc
form, the «
ampliﬁer, tl
resistors. T
represented
ampliﬁer, 0
reach a pres

The pr
which they
t«'iill fecdbac
for the natu
implications

The mc
in Some detz
together, the;

 

 

1.1.2 Topologies of Artiﬁcial Neural Networks

Artiﬁcial neurons are analogous to their biological inspirers. In electronic circuits
form, the cell body is represented by a nonlinear element such as an operational
ampliﬁer, the axons and the dendrites by electrical wires and the synapses by variable
resistors. The input voltages to a neuron are weighted by the synapses which are
represented by the variable resistors. The resistors are connected to an operational
ampliﬁer, on which a threshold has been set. When the sum of these input voltages

reach a preset threshold, the nonlinear element ﬁres imitating a biological neuron.

The processing nonlinear elements can interact in many ways by the manner in
which they are interconnected. The interconnections may be feedforward or may con-
tain feedback loops. The design of a neural network’s feedback loops has implications
for the nature of its adaptivity, while the design of network’s interconnections has

implications for its parallelism.

The modeling and the connection strategy of these artiﬁcial neurons are discussed
in some detail in Chapter 2. When many of these processing elements are connected
together, they form an artiﬁcial neural network which mimic the neurobiological sys-

tem in a very restricted way.

1.2 Why New Energy Functions?

Artiﬁcial neural networks, unlike standard computers, do not execute programs.
Their behavior is determined by parameters referred to as synaptic weights. These
parameters are computed by a learning procedure, during which a set of examples
(input-output pairs) are presented to the network; the weights are updated so that the
output corresponding to each input from the training set is as close as possible to the
desired value. When the learning phase is completed, the network can be presented
with some unknown data and is anticipated to generalize from the data of the training

set and produce a correct response.

 

 

1c

1110
ova
ZlCQ
Com

II'On.

The learning capability of neural networks plays a key role when we want to use
them to do certain practical tasks. The automatic learning rules give artiﬁcial neural
networks unique capabilities. Presently, many of the neural networks learn by using
gradient descent methods associated with energy surface. The ’energy is usually
deﬁned as an error function in terms of the adjustable weights. The most commonly
applied energy function gives rise to the so-called back-propagation update rule.
This rule is discussed in detail in Chapter 2.

. Pictorially, an energy surface can be thought of as consisting of hills and valleys.
The surface represents a function of the output error of the network which in turn
depends on the network weights. The gradient descent algorithm aims at ﬁnding the
nearby lowest point called the local minimum. When the discrete-time gradient descent
update law is implemented, the steps are discrete and we may observe oscillations and

other undesirable behavior, which is more pronounced if the step size is large.

We use the continuous-time gradient descent update law due to speciﬁc advan-
tages over its discrete-time counterpart. These laws ensure convergence to the nearest
local minimum and do not exhibit oscillations. Also the attained local minimum may

not be useful to solve the problem for which the network is trained.

The goal of this research is to make the learning faster and to reach an acceptable
minimum quickly. We view the immediate domain of implementation to be software
on digital computers. In that context, the learning can be increased by increasing the
learning rate; but that would also increase the possibility of local oscillations and
delayed convergence. In this work, energy surfaces are deﬁned so that we do not add
more minima to the already existing minima of the conventional energy surface. More-
over, starting from any initial point, the aim is to speed up the convergence to an
acceptable minimum. This is achieved by using different energy functions in the
continuous-time update law in contrast with the usual sum-of-the-squared energy func-

tion.

 

for
run
net

thei

unpr
thi
dons
conﬁ

then

1.3 Contribution of the Thesis

Properly designed and trained neural networks aims at achieving human like per—
formance in solving association and recognition problems. Neural networks are not so
much programmed as they are trained with data - thus many believe that the use of
neural networks can relieve today’s computer programmer of a signiﬁcant portion of
their programming load [14]-[16]. Moreover, neural networks improve with experience

- the more data they are fed, the more accurate or complete their response becomes.

A framework, supported by analysis and computer simulations, is developed to
improve the supervised learning of feedforward artiﬁcial neural networks. The ﬂame-
work encompasses various learning rules derived from carefully selected energy func-
tions in order to speed up and enable convergence to acceptable weights. We use
continuous-time gradient update law learning rules due to particular advantages over
their discrete-time counterparts.

We describe the Cauchy and the Polynomial energy functions and characterize
their properties and the performance of their associated learning rules. We show
analytically and with computer simulations, that using the proposed energy functions in

the continuous-time weight update law can speed up convergence to local minimum.

The local minimum may not be useful to solve the desired problem. The
Exponential energy function is proposed to overcome this problem. We show analyti-
cally and with computer simulations that the continuous-time weight update law using

the Exponential energy function can "skip" such minima and ensures convergence to

acceptable minima.

We perform comparative investigations among these learning rules vis-a-vis the
update law using the usual Gaussian energy function. We also demonstrate an applica-
tion to the character recognition problem in which various fonts of printed characters

aroused

 

vous

and:
Huliei
nuniet
Ward

tMsw
mdﬁh

ltorks

“131th
andexa
ﬁnerg}.
applied
anDunn

mapping

CHAPTER 2

BACKGROUND

Neural net structures are based on our present understanding of the biological ner-
vous systems. Neural net models are speciﬁed by the node characteristics, net topology
and training or learning rules. By virtue of the type of input/output connections the
models can be categorized as the feedforward neuron model or the feedback neuron
model. The feedforward neuron model is discussed in section 2.1 whereas the feedfor-
ward neural network connection topology using these neurons is of primary interest in
this work and is described in section 2.2. The resulting network which is reminiscent of
multilayer perceptron [16],[17], will be referred to as feedforward artiﬁcial neural net-
works (FFANN).

Learning in these networks is inspired by adaptive classiﬁcation networks beginning
with the perceptions convergence procedure and the least mean square (LMS) algorithm
and extending to the more recent back-propagation, feattue map and reduced coulomb
energy (RCE) algorithms [15]. The back-propagation is currently the most widely
applied neural network architecture used for leaming. This popularity primarily revolves
around the ability of back-propagation network to learn complicated multidimensional
mappings [18]. The back-propagation architecture discussed in section 2.3, is used for all

the learning strategies discussed in this work.

The back—propagation neural networks can be trained to learn using supervised,
unsupervised or self-supervised training. Supervised training requires labeled training
data and a teacher who provides error information through a feedback signal after each
trial. Unsupervised training forms internal clusters automatically with unlabeled data.

Self-supervised training uses internal monitoring and error correction to improve

 

perforr
training

B.
forman
tion 2.5
tions. T
describe

c'ontn'bu

2.1 The
Mo
McCullo

treated as

Where I}. j

performance without an external teacher. This work mainly concentrates on supervised

training of the FFANN, using (continuous-time) gradient descent methods.

Back-propagation learning makes use of the gradient descent update laws. The per-
formance of these update laws mainly depends upon the associated energy function. Sec-
tion 2.5 describes the weight update laws and the inﬂuence of the associated energy func-
tions. The Gaussian energy function is described in section 2.6. Section 2.7 is devoted to
describe the current state-of-the-art in supervised training of FFAN N and highlights the

contributions of this work.

2.1 The Feedforward Neuron Model

Modeling of this class of artiﬁcial neural networks has its roots in the work of
McCulloch and Pitts [19]. The feedforward model of neurons was clearly mentioned in
[20] by Widrow and Hoff, where they called it an adaptive neuron. The neurons are

treated as threshold logic units whose output v), is given by

Vk =f(uk)=f [@ij xj +9k] (2.1)

where x; is the external input, wk,- the weight or the strength of the connection from the
jth input to the kth neuron and 6,, is the threshold or the bias for the neuron. Here
f(uk) is the sgn function which is +1 ifuk > 0, 0 if uk = 0 and -1 if at < 0 and is
diagrammatically shown in ﬁgure 2.1a.

It was in 1972 that T. Kohonen [21] and J. A. Anderson [22] considered the fact that
the neurons are not binary but are analog. They assumed that the activation of a neuron
was sufﬁciently linear of the type shown in ﬁgure 2.1b, so that simple linear algebra
would approximate the output of the system. It was realized later on [7],[23]-[26] that the
sigmoidal function of the type shown in ﬁgure 2.1c is the most pervasive because it pro—

vides a graded and nonlinear response which is closest to a real neuron. The range of the

 

 

 

 

 

 

 

a. S gn function

 

 

 

 

 

 

 

b. Nonlinear ramp function

 

 

 

 

 

 

 

c. Logistic function

Figure 2.1: Neuron activation functions

ll

sigmoidal function is sometimes changed from [-l,l] to [0,1] depending upon applica-
tion. The logistic function is a convenient activation relationship that represent a sig-

moidal function and is given as

f (netk) = W. (2.2)

Here 0,; controls the ﬁring activity of the neuron for some given inputs. If 0; is negative,
the sigmoidal function shifts horizontally to the right and ﬁring is delayed until a
sufﬁcient input arrives. 90 is used to modify the shape of the sigmoid according to the
gain requirements of the neuron. This is illustrated in ﬁgure 2.2.

 

 

 

 

 

 

6

Figure 2.2: The sigmoidal activation function with threshold and shape modiﬁcation

net; represents the summation of weighted inputs to the neuron and is given by

net); = § Wk} Ij (2.3)

12

The complete feedforward model is shown in ﬁgure 2.3.

 

    

Threshold 9, Threshold 0;

Figure 2.3: The feedforward neuron model.

2.2 The feedforward neural network connection topology

The neuron interconnection topology found in the human brain is very complicated
but follows some pattern. Neurons in the retina and cortex are organized into layers with
intra— and interlayer connections. Connections within a layer are referred to as short-term
memory (STM) whereas connection between layers are referred as long-term memory
(LTM). STM connections are usually considered to be unidirectional whereas LTM con-
nections may propagate signals in feedforward and/or feedback direction. I

We will use the feedforward connection t0pology throughout this work. This type of
network topology makes use of feedforward neurons in different layers. In general the
network consists of an input layer, m hidden layers and an output layer. These layers are
indexed input through output layers by [N ,1-11,..,H.,«,H,c ,..,H,,, ,0 consecutively. The input
layer is assumed to have I passive nodes with indices l,..,i ,..,I. The input passive nodes
simply accept the individual components x,- of the input vector x and distribute them,
without modiﬁcation, to all of the neurons of the ﬁrst hidden layer. m hidden layers may
have sufﬁcient number of neurons to perform the assigned task satisfactorily [271-[29]. It

was discovered early in the neural network research [l7],[30],[31] that a sufﬁcient

13

number of hidden layer(s) are required to solve a problem of normal complexity.
Recently, it has been stated that one hidden layer would sufﬁce for a large class of appli-

cations [32].

Each neuron in each layer receives the output signal of each of the neurons of the
layer preceding it. This continues through all of the layers of the network until the ﬁnal
output layer. The output layer of the network consists of N neurons with indices
1,..,n ,..N. The N dimensional y output of the network is available at the output of the
0th layer. It is assumed that the neurons in any particular layer, as depicted in ﬁgure 2.4,

have no lateral or feedback connections.

A FFANN formally carries out a non-linear bounded mapping v :XcR’ —)RN ,
from a compact subset X of I -dimensional space to the bounded subset v(X ) = Y of the
N -dimensional space. When an input vector x is fed into the network’s input layer, the
passive nodes of this layer transmit all of the components of x to all of the neurons of the
ﬁrst hidden layer. The output of all of the neurons of the ﬁrst hidden layer are then
transmitted to all of the neurons of the second hidden layer and so on, until ﬁnally the N

output units emit the components of the output vector Y.
"1' Hr

 

Figure 2.4: A structure of a multi-layer feedforward artiﬁcial neural network.

14

2.3 The Back-propagation Architecture

The back-propagation architecture as a technique was popularized by Rumelhart et
al. in 1986 [7,8]. This architecture was earlier discovered by Bryson and H0 in 1969 [33]
and independently rediscovered by Werbos in 1974 [34]. Today it is the most commonly
used architecture in FFANN learning strategies [18].

Besides the feedforward connections, discussed earlier in section 2.2, each unit of
the hidden layer receives an error signal from each of the units in the subsequent layer as
depicted in Figure 2.5. The output units directly receive the error signal whenever the

network is presented with some input.

. . smarter
. . 5[p][2][01
. .
. . 5[p][n1[0]
. .
. . 5lpllNllol

 

Figure 2.5: The back-propagation architecture of the FFANN.

15

The network functions in two stages: a forward pass and a backward pass. Con-
sider the network’s input consisting of P vectors, corresponding to P input patterns,
indexed l,..,p ,..,P , and each vector consists of 1 elements corresponding to I input pas-
sive nodes, indexed 1,..,i,..,I. The forward pass starts by inserting I dimensional vectors
[xp 1]" into the network’s input and the network emits out the corresponding N dimen-
sional output vectors [yp = v(x,, )]f, which is the network’s estimate of the desired or the
target output [II, = g (xp )]f. After the estimates [yplf’ are emitted, each of the output
units is supplied with their correct or target output vector [tp ]f. The elements of the error

vectors are computed at the output of the network as
[epnlr = [Itpn -ypn ”r n =19m’N (24)

The error elements 6,," and the error signal 8[p ][n][0] (to be discussed in section 2.6)

are calculated and are available at each unit n of the output layer 0.

In the backward pass, the output units directly receive the error signals from their
outputs. The hidden units in the layer below the output layer receive their error signals
from the output units to which they are connected. The error signal is thus passed back
through the same links which were used for the forward pass but in the backward direc-
tion. The error signal 8[p][j][H,-] of a unit j in a hidden layer ”j is thus a function of
the error signal of the units in the H; II: layer and their inter-linking weights. The error is
propagated back till each of the units in the ﬁrst hidden layer get their corresponding
error signals. The output layer becomes the input layer in the backward pass.

To summarize, in each training cycle, the input is fed at the input layer of the net-
work. The output of the network is determined in the forward pass. The error signal for
each neuron in the network is calculated and propagated back in the backward pass.
Depending upon the error signal the weight or the synaptic strength of each link in the

network is adjusted in a manner so as to reduce the output error.

16

2.4 Supervised training of the back-propagation network

After having designed the network with the back-propagation architecture, using
feedforward neurons in the feedforward connection topology as discussed in sections 2.1,
2.2 and 2.3, the most critical task for the network is to learn, i.e. achieve weights values
that would realize the mapping. Learning is one of the key issues in the artiﬁcial neural

network research.

With respect to learning, neural networks fall into a number of classes. Some net-
works are completely ﬁxed in which the design of the network determines the weights.
Other networks have synaptic weights that depend on the problem that is being addressed
by the network, but the weights are pre-programmed into the network. Other networks
can be updated continuously by the outside data source. Networks of this type can per-
form adaptive learning in response to changing environment. The ﬁnal class is in situ

learning, where the learning algorithm is incorporated directly into the network.

Most of the neural network research has focussed on direct learning to solve speciﬁc
tasks instead of building an internal model. Direct learning involves learning a speciﬁc
action to solve each task. Learning internal world models is opening up an important area
of learning where internal models of limited environment are built through interaction
and do not have to be programmed into the network. One of the approaches is to use

FFANN to mimic the input/output behavior of network interacting with the environment.

This work focusses on the supervised training of FFANN with back-propagation
architecture. It is a means of training adaptive netu'al networks which requires labeled
training data and an external teacher. The teacher knows the desired correct response and
provides an error signal when error is made by the network. This is sometimes called
reinforcement learning or learning with a critic when the teacher only indicates
whether a response was correct or incorrect but does not provide detailed error informa-

tion.

17

During the supervised training, a training set consisting of examples (1tp , tp) of
mapping between the input xp and corresponding desired output tp = g (xp) for the
input-output pattern p = 1,...,P is provided. It is assumed that such examples of the map-
ping g are representative of the whole class of map from the input space to the desired
space. When the input data set [19,,”D is presented to the network, it produces the estimate
[yp =v(xp )]{D of [tp = g (x,, )]fp and computes the output error vectors 8p, p = 1,...,P as
given by (2.4). The error signal is generated from the error vectors according to certain
weight update law, which is discussed in section 2.5. The error. signal when propagated
back adjusts the weights of the network in a direction, so as to minimize the output error.
With this training, the network thus, learns to associate an input pattern to its correspond-
ing output pattern. If the training examples were the true representative of the class of
mapping g , the network hopefully will associate all patterns (including those for which it

has not been trained) correctly.

2.5 The weight update laws and the energy functions

The weight update law is very critical in determining the error signal which ulti-
mately makes the network learn. The Generalized Delta Rule (GDR) is a gradient descent
method of updating the weights. It has been proposed by Rumelhart et al. [7,8] and is

given as
Awkj = - ‘Y unEo (2.5)

Here wk,- is the weight connecting the output of the jth node in the H ,- th layer to the
input of the kth node in the H; th layer, yis the learning rate, A represents the change and
un represents partial derivative with respect to wk}. E is the energy function and is
given as

E = g; 5,, 5,, = % ; epnz. (2.6)

18

The weight update law (2.5) achieved a breakthrough in training the FFANN. It can
be conveniently used in most cases, however the convergence cannot be guaranteed

Describing the update law as a system of differential equations [35]-[39] in the form
wt,- =- 7V...,.E. (2.7)

has its advantages over the system of difference equations (2.5). In particular, being a
continuous-time gradient descent system, (a) ensures convergence to local minima, i.e.,
stable equilibria. (b) Prevents the existence of oscillations and complicated behaviors and

(c) lends itself to natural implementation via analog VLSI/LSI silicon circuits [35]-[39].

The choice of energy function speciﬁes the error signal to be propagated back. It
also determines the learning performance in the sense of the speed of convergence and
the size of domains of attraction of the stable equilibrium points in the weight space. We

remark that the sum-of—the-square energy function is the L2 (Euclidian) norm [40],[4l].

2.6 The Gaussian energy function

A learning rule which achieves the desired values of weights (preferably the global
minimum), is statistically equivalent to a maximum likelihood estimator (MLE). In other
words, obtaining a MLE is equivalent to minimizing the errors given by some energy
function. In the process of maximizing the likelihood estimate, the energy function is
derived given the probability distribution of the errors from the desired values in the out-
put of the FFANN .

A well known method of approximating a distribution on the basis of partial
knowledge is to choose the density function d (x) which maximizes the entropy subject
to the constraint of moments [421-[46]. It is well established that the maximum entropy

distribution subject to the second moment constraint on (—oo,+oo) is the Gaussian distribu-

tion [46]-[48].

19

Now suppose that we would like the errors in the output of a FFANN to be a Gaus-
sian random variable with zero mean. In order to make the network an MLE, the criterion
is to minimize the sum-of-the-squared error energy function [48]. This is speciﬁcally

shown below. The Gaussian distribution of the errors is given by

_ l
d (Epn) — W exp [.6P" 2,20”: 2] . (2.11)
Assuming the output errors to be independently distributed for all the output nodes, the

‘ joint Gaussian distribution is then given by

d,- (ep) = (2104" ’2 l Zip-1 I cxp[-1l-(£p )Talp )‘2(ep )]. (2.12)
Here N denotes the total number of neurons in the output layer of the FFANN and
6,, = [cpl epz ........ epN]T,

2p =diag [Cpl Op; ........ OpNL

where 0,," is the standard deviation of the error distribution of the n th output node when
the pth pattern is used. Since the errors e, , p = 1,2,....,P , are independent, the likelihood

function L is

L(e,>=p1:rd,-(ep)

= (2104” ”If," I cxn[-;1ypg(ep )T(Ze)‘2(ep )] (2.13)
Choose op1=op2= ..... =0,” =1, for all the patterns p =1,2,....,P. Maximizing

the likelihood function, then, corresponds to minimizing the energy function in equation
(2.6), which is referred to as the Gaussian energy function. Of course, in general, 21, in

(2.13) can be any symmetric nonsingular matrix.

The continuous-time learning dynamics of the update law (2.7), using the Gaussian

energy function, have been explored in [35]-[39]. In subsection 2.6.1, we describe these

20

learning dynamics, whereas the software implementation is outlined in subsection 2.6.2.

2.6.1 The learning dynamics due to the Gaussian energy function.

As described earlier, the FFANN under consideration uses the error back-
propagation architecture. For the weight wm. , connecting neuron m in the last hidden
layer Hm to the output layer neuron n of a FFANN, the learning dynamics are described

by the weight update law (2.7) and is given as

wm = - y anE , (2.14i)
and for any hidden layer
Wk} = - YkuE , (2.1411)

Using the chain rule, the dynamics are

- _ 3E aepn aan aupn
WM, - 'y. as” . 3y”, 'Bupn . 3w”, , (2.15)

where E is the Gaussian energy function given by (2.6). 8,," is the error of the output y,"
from the target value tpn for the pth pattern and the nth output node and is given by

(2.4). up" is the net input to the nth node of the output layer and is given by

up, = g w,,,, 2,”, + 9,. ' (2.16)

Here 2pm denotes the output of the m th node in the Hm th layer. 9,, is the threshold of the
nth node in the output layer. The partial derivatives in equation (2.15) can be explicitly
calculated as

as =8
82
Ea =—sgn (tpn ”an )9
pn
3? n __
374%"- "an(1 -)'pn)s

21

and

Bu
37% = 2....
Here we have selected the so-called logistic function [7,8]

1

ypn = T1797 (2.17)

to describe the input-output relationship of a neuron. Now the weight update law (2.15 )

canbewrittenas

warn =7; 8pit - sgn (spa) - ypn ~(1-an)- 2pm
=7; (tpn-ypn) - an - (l-ypn) - 2pm
=7); 6[p1[n][0] .ylpltmlle]. (2.18)

where y [p ][m][H,,,] is the output an of the mth node in the H", th layer. 5[p ][n][0] is
the error signal which is completely determined by the desired and the actual outputs of
node n in the output layer and is given by

5[P][n][0] = (tpn-ypn) -)’pn - (1 -ypn)- (2.19)

Observe that the weight w,.,,, , is updated depending upon the error signal of the nth node
and output of mth node. The weight update law for the weights wk,- , connecting the jth
node of a hidden/input layer H j to the kth node in the subsequent hidden layer Hg, is
derived similarly and is given by

Wu =7; 5[P1[k][Hh]-)’[P]U][Hj], (220)
where

5[P][k][Hk] 3= 21 MINIMUM - Wlk .th - (1 - ypk)- (2-21)

In (2.21), I is the index of the nodes in the H1 th layer to which node k feeds its output

22

ylk and w”; is the weight of the corresponding connection. y [p][j][H,-] is the output of
the node j in the H ,- th layer which is fed to node k in the I-Ik th layer.

2.6.2 Software implementation

For the software implementation of the update law (2.7), using the Gaussian energy

function, the learning algorithm is formalized in Table 2.1.

Table 2.1

 

The Learning Algorithm

 

1 Present the input vectors xp , p = 1,...,P , at the input of the FFANN.
Determine the network’s output yp = v(xp.), p = 1,...,P.

Calculate the error signal for the output nodes.

Propagate the error signal back to all the nodes in the network.
Update the weights.

O’tlllkwN

If the weight change of the network is greater than the stopping criterion

go to step 1, otherwise stop.

 

 

 

The stopping criterion to can be deﬁned as

to := 2; |Aw,,,- I. (2.22)

W”

where wk,- are all the weights in the network. Theoretically in order to reach a local
minimum the weight change criterion should be zero, which may never be achieved. In
practice we can choose a sufﬁciently small value of (n, which when achieved indicate that
the trajectories of the system have reached a local minimum. The weight change Awkj of

the system can be determined as

23

Awkj := Wk} (t +At )-ij (t ). (2.23)

The term At is the time step of the integration routine. Using the fourth order Runge-
Kutta integration routine the learning algorithm outlined in Table 2.1 was implemented

into computer software using the C—language [36].

The software is designed to initialize the weights of the network randomly, by a ran-
dom number generator. The seed, which is an integer used to start the random number
generator, is used to represent the set of initial weights. The value of the initial weights w

is restricted to be -o.9 s w s 0.9. An error criterion g is deﬁned as

g := g3; 8,," 2, (2.24)

is used to characterize the training: the lower the error, the better is the training. The
error criterion can also prove to be useful in some cases, to terminate the training of the

network.

2.7 The problem description

We consider the implementation of FFANN into software on digital computers. For
learning of such FFANNs, the continuous-time weight update law (2.7) which is a set of
differential equations is implemented on digital computers. For faithful implementation
of the gradient descent update law (2.7), the learning rate 7 and the time step At (2.23)
must be very small. As a consequence of the small value of 7.At,

(l) the learning is slow

(2) the algorithm converges to the nearest local minimum which in most cases may not

be satisfactorily useful.

One may consider increasing value of the learning rate yas an immediate solution to
the above problems. This may not work in all the cases. With a large value of 7, while we

increase the tendency of skipping the nearby local minima, we may also make it prone to

24

skip desirable minima and delay or prevent the convergence.

The so-called back-propagation update law in its existing form uses an almost Euler

approximation to the continuous-time update law. In addition, most reported simulations

use a large step size or a learning rate. Such large values prevent tractable analysis and

have not been proven to ensure convergence. Thus the implementation of FFANNs with

the two conﬂicting properties remains a challenge:

(a) the update in speed is fast.

(b) Gauranteed convergence to an acceptable minimum and the occurence of no oscilla-

tions. In this work, we aim at achieving this challenge.

2.8 The objectives and the outline of the thesis

(1)

In this work we deal with this problem in two steps.

We present a method to make the learning faster by using different energy functions
in the update law (2.7).

For the supervised learning of FFANN we use the continuous-time weight
update law (2.7) which is derived from an energy function. If we choose an

appropriate energy function the network may learn faster. The energy function can

be viewed as
(i) related to the distribution of the random errors in the output of the network, or
(ii) a mathematical norm.

We can choose a maximum entropy distribution of random errors in the output
of the network, subject to the constraint of moments. Then, proper selection of the
energy function for training the FFANN can make it an MLE in the statistical sense.
In chapter 3, we show that in order to make the learning faster, we can choose a dis-
tribution other than the Gaussian distribution. We choose the Cauchy distribution

and motivate a rational for its selection and show that learning in this case is faster

25

or at least does not deteriorate. This conclusion is also supported by computer simu-

lations.

When the energy function is viewed as a mathematical norm, we have the
option of selecting any LP norm. We combine various norms to form what we call
the Polynomial energy function. In chapter 4, we show the usefulness of choosing

the Polynomial energy function analytically and with simulation support.

(2) We present a method which make it possible to "skip" unacceptable local minima

and at the same time ensures faster convergence to one of the acceptable minima.

The problem of skipping the unacceptable local minima is tackled in chapter 5.
We propose an exponential energy function, which skips the unacceptable local
minima and ensures faster convergence to one of the acceptable minima. The
Exponential energy function is used in conjunction with the Gaussian, the Cauchy

or the Polynomial energy functions.

The performance of the learning algorithm using weight update law (2.7) with dif-
ferent energy functions (Table 2.1) is evaluated in the application to the pattern recogni-
tion problem in chapter 6. After discussing the network design, various computer simula-

tions are presented. Finally in chapter 7, summary of this work is collected along with

concluding remarks.

CHAPTER 3

THE CAUCHY ENERGY FUNCTION

3.1 Motivation

This chapter addresses the issue of increasing the learning speed of the FFANNs. In
what follows, we describe the intuitive rational for selecting the Cauchy energy function
from the point of view of statistical signal processing. On the other hand, the real rational
for this work may stem from the convergence speed of the would-be proposed weight
update learning rule as compared to the usual update rule (2.6)-(2.7).

The derivation of a supervised training algorithm for a FFANN implies the selection
of a norm criterion which gives a suitable measure of a particular distribution of errors in
the output of the network. There exists a correspondence between the Lp norms and the
error distributions [48]. Many researchers [44]-[48] consider the maximum entropy of the
output errors subject to certain moments constraints to be a suitable measure to estimate
the error distribution. The larger is the number of moment constraints, the lesser is the

choice of error distributions.

It is well known that subject to the ﬁrst and the second moment constraints, the
maximum entropy distribution is the Gaussian distribution [44]-[48]. Corresponding to
the Gaussian distribution, the criterion is to select the L2 norm for the energy function in
the weight update law (2.7) [49].

Restricting our attention to the ﬁrst moment constraint, there are many choices of an

error distribution. The Cauchy distribution [50]. given as

d(epv.) = gﬁr} (3.1)

26

27

is chosen due to the following reasons.

(1) The entropy of the Cauchy distribution (3.1) is about 1.78 times higher than the
entropy of the Gaussian distribution (2.11), when the parameters hp, and 0,... for

these distributions is chosen to be unity. This is analytically shown in Appendix A.

(2) The speed of convergence of the weight update law (2.7) improves when the energy
function corresponding to the Cauchy distribution is used instead of the Gaussian
energy function (2.6). This is discussed in detail in this chapter.

In order to derive the energy function corresponding to the Cauchy distribution of
errors, let us consider the errors 6,», , l S p S P and 1 S n S N, to be independently distri-
buted. The joint Cauchy distribution will then be

dj (err) = (104,13 $155er . (3.2)

Since the errors for all the desired patterns p = 1,2, ..... ,P , are assumed to be independent,

the likelihood function is

L‘EP"’=}:11"’(’P"’

= (to-"N ,1: 3%,. (3.3)

We choose 37,1: 2,2: ...... = A,” = l for all the patterns p = l,2,....,P. Then maximiz-

ing the likelihood estimator (3.3), corresponds to minimizing the quantity
c = .§. 11 n (1+cp, 2). (3.4)
p II

We refer to the quantity in (3.4) as the Cauchy energy function [52].

For a qualitative comparison, we set 0,," in (2.11) and 27,, in (3.1) to 1 and obtain
the two distributions depicted in Figure 3.1. From the graphs in Figure 3.1, observe that
the Cauchy distribution of the error is more spread out than the Gaussian distribution.

This "spread out" feature signiﬁes higher entropy and more tolerance in errors of the

28

Cauchy distribution.

 

004 I I I I I T

0.35 - -

0.3-,

0.25 -

T

d(ep.) 0.2

 

 

 

0.15

I

0.1

0.05 r

 

 

 

20

 

Figure 3.1: Random error distribution. The horizontal axis show the random error and the
vertical axis show the density with which they are distributed. The solid line represents
the Gaussian error distribution as given by (2.11) with (3,,,, = l and the dotted lines

represents the Cauchy error distribution as given by (3.1) with hp, = l .

In this work, we desire to train the FFANN using a learning update law derived
from the Cauchy energy function (3.4). When the update law converges to a weight value
that minimizes the Cauchy energy function, the FFANN would then emulate a classiﬁer

with Cauchy distribution of errors from the target values.

29

The chapter is organized as follows. In section 3.2, we analyze the learning dynarn-
ics of the weight update law (2.7) using the Cauchy energy function. In section 3.3, we
analytically compare the learning dynamics due to the Gaussian (2.6) and the Cauchy
(3.4) energy function. Software implementation of the learning dynamics are presented in
section 3.4, which are then used to train a FFANN. As an illustrative example, we spe-
cialize the software to solve the XOR problem. Simulation results are reported in section
3.5. We discuss the advantages and short-comings of using the Cauchy energy function

in section 3.6.

3.2 The Learning Dynamies

In this section, the continuous-time learning dynamics of the update law (2.7), using
the Cauchy energy function, are explicitly derived. As described earlier in section 2.2,
the output layer is layer 0 , with nodes indexed by n. The 0 -1th layer which feeds its
output to the output layer is actually the last hidden layer Hm , whose nodes are indexed
by m. The learning dynamics of the weights using the Cauchy energy function (3.4) in
the update law (2.7); speciﬁcally for the output layer

vhf». =-7V...,EC. (3.51)
and for any hidden layer
wfj = — y VWEC. (3.51i)

Using the chain rule, the dynamics for the output layer are

. c 7 3E c 3E1," aan 3up»

 

WM, = - . 38p). . dypn . Hum, . dwm , (3.6)
where up” is the net input to the nth node of the output layer and is given by
up" = g Wm 2m + 0". (3.7)

Here 2”, denotes the output of the m th node in the Hmth hidden layer. 6,, is the

30

threshold of the n th node in the output layer. The partial derivatives in equation (3.7) can
be explicitly calculated as

~38517:eh";—[1";11W(1+em2)] [2H(l+epn2)].2.epn.

p pep
=-z[[nn(1+epn2)][n (1+ep,,2).] 2W], (3.8i)
)0 p 11*" prep
3 n ..
3%;- = —sgn (tpn -yp,, ), (3.811)
32m = y n (1 _ y n), (3.8iii)
up" P P
and
a“ n — (3.8iv)

Here p is a dummy variable representing patterns 1 through P and 'q is a dummy index
for nodes 1 through N of the output layer. We have selected the so-called logistic func-
tion [7,8]

_ 1
an 1+e‘“"

to describe the input-output relationship of a nemon. Now the weight update law can be

writtenas

=7; 5lpHnll0] -)’[P][m][Hm]. (3.9)
where
5 0= 1 2 1 n2 . n-,.. ,..l- n 3.10
[Pllnll ] 121g” +€pn)] [913” +8,» )] (tp yp ) yp ( yp) ( )

 

and y [p ][m ][H,,,] is the output 2”, of the m th node in the Hmth layer.

‘The weight update law for the weights wk}, connecting the jth node of the
hidden/input layer H ,- to the k th node in the subsequent hidden layer I-Ik , is derived

31

similarly and is given by

ij=7§ 8[P][k][1'1ht] -)’[P]U][Hj]. (3.11)
where

5[PHk][H/t] == 2; 5W][l][Hl] - Wu: °ypk - (1 -)’pr)- (3.12)

In (3.12) l is the index of the nodes in the Hrth layer to which node k feeds its output ya;
and Wu is the weight of the corresponding connection. y [p ][j][Hj] is the output of the
node j in the I-Ijth layer which is fed to node k in the Hkth layer.

3.3 Analytical Comparison with the Gaussian Energy Function
3.3.1 A criterion for the learning performance

To compare the dynamics of the (Gaussian) learning (2.14) and the (Cauchy) learn-
ing (3.5), we deﬁne the performance criterion delineated in the following procedure. First
we consider V := E as a candidate Liapunov function [53]-[54]. Then, we calculate the
derivative of V along the trajectories of the (Gaussian) learning (2.14) -- denote this
derivative by v. Similarly, we calculate the derivative of V along the trajectories of the
(Cauchy) learning (3.5) -- denote this derivative by W. Now if 1;" - v s o , then we say
that the learning performance ’ of the second system is "faster" than or at least the same
as the ﬁrst system. This is motivated by the fact that V‘ — I} gives the projection of the
rate of change of the difference in the weight vectors, say (WC — W), along the diver-
gence vector Vw V. This we take as a measure of speed of updating the weight-vectors
difference (WC - W), which in turn is equivalent to the comparative speed in updating

the weight vectors We and W.

 

* Improvement in the learning performance of system (3.5) here means the derivative of the
sum-of-the—squared error at any point on the error surface along the trajectories of system (2.14) is
less in magnitude than that of system (3.5).

32

Formally, we observe that
lie —v =va . (W6 —W), (3.13)

where W‘ and W denote vectors containing all the corresponding weights. Equations

(3.5) give the components of WC and equations (2.14) give the components of W.

3.3.2 The learning dynamics for a single pattern

Consider the case when only one input-output pattern is applied to the network. For

ease of notation we will drop the subscript p for patterns and deﬁne the Gaussian and the

Cauchy energy functions as
:= é-zn; 2,2 (3.14i)
and
Ec :.-. -5—1;1(1+e,,2), (3.14ii)

respectively. By straight forward calculations, for the output layer equation (3.5i) can be

rewritten as

w:,,, = — y [V,,__E + ot,v,,_E ]. (3.15i)
For the hidden layer Hg the learning algorithm in (3.5ii) can be similarly rewritten as

of, = — 7 [VWE + -§- § an Vwenz]. (3.15ii)

Here an 2 0 and is given explicitly as
an=Ef+EQ+ ..... +E@v_1),
where

E? :=E —e,,2,

E Q is the summation of all [Ni ] pairs not including 8,},

33

and ﬁnally

E 3.1-1) := 812822 ....... 8N2 not including 8,,2

Before we formally characterize the learning performance due to the Gaussian and
the Cauchy energy function in the form of a theorem, Lemma 3.1 is in order, which uses

Assumption 3.1 given below.

Assumpﬁon 3.1: All the eigenvalues of the matrix with entries at,- = a,- + a}, are non-

negative.

Remark: In our simulations, we always initialize the weights in the learning dynamics
so that the weight components have small values (typically within [-0.9,0.9]). In doing
so, the output of the individual nodes are likely to be close to 0.5. In that case,
(11qu: - - . =an= - - -=or~. If the equality holds, then Assumption 3.1 is automatically
satisﬁed. See Appendix B.

Lemma 3.1: Given 0t,- , x,- e R , 0t,- 2 O and assuming that Assumption 3.1 holds.

were.

Proof of Lemma 3.1 is provided in Appendix B.

Then,

Theorem 3.1: Assume that Assumption 3.1 holds. Consider V := E given by (3.14i), be
a candidate Liapunov function. Let v be the derivative of v along the trajectories of sys-
tem (2.14) and V" be the derivative of V along the trajectories of system (3.15). Then the
learning performance of system (3.15) is "faster" than or at least is the same as system

(2. 14).

34

The proof of this theorem, presented in Appendix C, analytically shows that the
learning performance improves if the Cauchy energy function is used instead of the

Gaussian energy function.

3.3.3 The learning dynamics for multiple patterns

Now, we consider the learning dynamics of the weights for P patterns. The (Gaus-
sian) weight update law is given by equation (2.14) where E is given by (2.6). The (Cau-
chy) learning dynamics of the weights takes the form of (3.5) where E c now is given by

(3.4). By straight forward calculations, oz... can be written as
wgm=_y[vw_5+.}zorm V,,_,_ 9,2]. (3.161)
9

Similarly, for the hidden layers the learning algorithm of the Cauchy energy function

(3.4) can be written as
urge—4v,” E +%§§apn un- epnz]. (3.16ii)

Here or," 2 O and is given by

or?" =12)?n + E5" + ..... +15%“),
where

E’l" = E - epn 2.

E5" is the summation of all [Hg-1] pairs not including 6,», 2,
and ﬁnally

EfﬂN_1) =8112€122 ....... €p~2 not including Epnz.

Before we compare the learning performance due to the weight update laws

(2.14) and (3.16), we state Lemma 3.2, which uses the following assumption.

35

Assumption 3.2: All the eigenvalues of the matrix with entries at”; 2 at,- + an, are

non-negative.
Remark: Assummion 3.2 is equivalent to Assumption 3.1.

Lemma 3.2: Given org,- , x5,- 6 R , org,- 2 0 and assuming that Assumption 3.2 holds.

Then,
[;¥ xv] [;; (131' xii] Z 0.

Proof of Lemma 3.2 is provided in Appendix D.

Theorem 3.2: Assume that Assumption 3.2 holds. Consider a candidate Liapunov func-
tion V := E given by equation (2.6). Let V be the derivative of V along the trajectories
of (2.14) and V” be the derivative of V along the trajectories of the system (3.16). Then
the learning performance of system (3.16) is "faster" than or at least is the same as sys-

tem (2.14).

Again the proof of this theorem is included in Appendix E. The analysis clearly
indicate that the learning performance in " speed" can improve (or at least does not

deteriorate) when the Cauchy energy function is used.

Proposition 3.1: The equilibria of system (3.16) are also the equilibria of system (2.14).

The proof of Proposition 3.1 is provided in Appendix F.

Remark: Proposition 3.1 implies that when we use the Cauchy energy function (3.5) in
the weight update law (2.7), we still converge to one of the equilibria of system (2.14).

However, this does not mean that starting from the same initial condition, the two

36

systems would follow the same trajectory or that their trajectories would converge to the

same equilibrium point.

3.4 Software implementation of the learning dynamics

For the software implementation of the learning dynamics of the weights connected
to the output layer (3.9) and the weights connected to the hidden layers (3.11), the learn-
ing algorithm formalized in Table 2.1 is used. The weight update laws (3.9) and (3.11)
are implemented into computer software using the C-language on Sun SPARC stations.
Fourth order Runge-Kutta integration routine is used for the integration of these update

laws.

A stopping criterion to, was set to determine if the system has converged to a solu-

tion. The parameter a) was deﬁned as

0) 2: 2 IAij I, (3.17)

Wu

where wkj are all the weights in the network and
Awkj I= ij (H-At ) - ij (t ). (3.18)

The term At is the time step chosen for the forum order Runge-Kutta integration of the
update laws. An error criterion § was used to characterize the training: the lower the

value of the error, the better is the training. The (output) error parameter is deﬁned as

F, i: g g 8’," 2. (3.19)

Note that this error parameter is exactly the Gaussian energy function. The weights
of the network can be initialized by giving their values or selecting randomly by a ran-

dom number generator of the UNIX system.

The training simulator, using the Runge-Kutta integration routine reads the input
data from an input ﬁle, which provides the values of time step At , learning rate 7, error

criterion 5,, stopping criterion 0), the network structure, the training patterns and the

37

corresponding target values. The training is terminated when either the stopping criterion
to or the error criterion i is achieved. The simulator writes the intermediate or/and the

ﬁnal values of the network weights or/and other parameters into an output ﬁle.

3.5 Simulation example: the XOR problem

The training performance of the two software simulators, using the Cauchy and the

Gaussian energy functions, is described and a comparison is made in this section. A pro-
. totype FFANN used to solve the XOR problem had two input passive nodes (a passive
node does not include a sigmoidal transfer function), two hidden nodes and one output

node as shown in Figure 3.2.

Input layer Hidden layer

 
   

...............
.............
............

 

......................
.......................
..........
......
.................

Output Layer

 

Figure 3.2: The network used to solve the XOR problem. The input layer has two passive
nodes, whereas the hidden and the output layers have two and one artiﬁcial feedforward

neurons, respectively.

The network was fully connected and weights were initialized randomly by a ran-
dom number generator. The seed, which is a number used to start the random number
generator, is used to represent the set of initial weights. The value of any initial weight w

is restricted to be -0.9 S w 5 0.9.

38

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

error 0.5 —
-0 _
I I I I I I I I I I I I
-0 1000020000300004000050000 -0 1000020000300004000050000
set # l - iterations set # 2 - iterations
1 n 1.2 I;
l — :,
error 0.5 — error
-0 .1 ........
I T 1 I I I I I I I
—0 1000020000300004000050000 -0 1000020000300004000050000
set # 3 - iterations set # 4 - iterations
. 1 4 1 _ S
6“" 0.5 - m o 5 _
_0 _ .' ...... _0 A L-
I I I I I I I I I
—0 20000 40000 -0 1000020000300004000050000
set # 5 - iterations set # 6 - iterations

Figure 3.3: Comparison of the Gaussian (...) and the Cauchy ( ) learning. In all of the
above simulations, the learning algorithm using the Cauchy energy function achieves the

same error with fewer iterations.

39

Hugo 9.

 

 

 

    
   

 

 

 

 

 

 

 

 

__ mega—es Be menace
no. a 0.. .
0:95. 3:25:
9...... sin...» a 0.. .8853 a an .8838 a an .8828. a 352.88.
“
08%.»: .33 89.... 98.3 98.3 Axum» 3.9.
983. 98a .u .8 8. .». 98.».. 538 8.8 98.3 human 8......
» 08%.»: 9m... .88 3.8 9.3.3 3%» »m..8 98.8 $8» Na..8
9.83. 9m... x»... in: 98.3 was...» 39.6 98.8 328 898
u 08%.»: 98a .33 3.8 98.»u 3?.» 39.» 98.8 39.» »..u. .q
9%.... 98m .33 c9»; 98.»u #38 898 985» 583 898
a. 08%.»: 9a... . .96 8.8 98.9. 538 39.5 98.9 238 39.8
0283. 9M... 33 u. .6 98.9 $3» »u»...» 9m...»» 3.8a 898
u mecca»: 98m »»»c» .»9.o 98.3 38¢ u.».».. 98.3 Mama u.».»..
028.... 98m .» 3c .38 98.3 $3. ”$93 98.»u an?» u»m. .u
a 02.8.»: 98m .83 8.3 98.». 33 . »m. .3 98.». 53.. . 3.3
028.... 98m 58. 8.3 98.». 38. Be...» 98.». 38. ”:93

 

 

 

 

 

 

 

 

 

 

 

 

.35...» 38.8 83.315 :6 Queue»: 2:. 9a 983. 55:3. 3.8. E... 58:39....» 98 8.48.83...» 8 .698 F! 3.3 5.5 5.... 025.2. 2:. so no..-
26. 265 8:38. 8.58:8 98 tag :5 0:9 2.810: n a... :5 tan... 0.5—in 9.818 e .a 8322..

 

40

We assume that the network has converged when the stoppage criteria to S 0.0001 is
satisﬁed. In all the simulation cases, the network converged to a local minimum A com-
parison of 6 different sets of initial conditions is depicted in Figure 3.3. In all of the six
simulations 7 and At are chosen to be 1 and 0.1 seconds, respectively. For the cases when
the weights were initialized by sets 1, 3, 5 and 6, the training did achieve excellent
results. In the other cases when the weights were initialized by sets 2 and 4, the network
was stuck in an undesirable local minimum with an error value equal to 0.5. The training

results using these six different sets of initial conditions are summerized in Table 3.1.

In all six simulations the (Cauchy) learning algorithm achieves the same error per-
formance as the Gaussian learning, in many fewer iterations. Finally, we also observe
that the update law (3.5) outperforms the update law (2.14) in achieving any ﬁxed error

value in terms of the number of iterations required to realize that error value.

It may also be observed in Figure 3.3 that the bulk of the number of iterations is
executed very near the local minimum. It may therefore be useful to set an error criterion

é to a desirable value along with the stability criterion co , for terminating the training.

3.6 Discussion

In this chapter, we derived a gradient descent weight update law using a Cauchy
energy function for supervised error back-propagation learning for FFANNs. We
presented a statistical rational for selecting the Cauchy energy. function to be used in the
weight update law. We also showed analytically that the learning performance may
improve if the Cauchy energy function is used instead of the Gaussian energy function.
This improvement is characterized in terms of speed of the continuous-time update law
for the weights which results in fewer iterations in achieving a speciﬁed error (energy)

value.

We observe that each component of the vector ﬁeld of (3.16) is always greater than

or equal to the amplitude of the corresponding component of the vector ﬁeld of (2.14).

 

41

This means that every point in the weight state-space the dynamics of (3.16) are "faster"
than the dynamics of (2.14). However this notion of "speed" is misleading: one cannot
rely on the vector ﬁelds alone, one must consider the trajectories crossing into a well-
deﬁned common, constant—energy levels of the same function [40],[53]-[54]. This
amounts to the trajectories of both dynamics crossing into regions deﬁned by the same

error function E.

Even though the proposed dynamics required more computations at every step, the
tbtal number of steps to converge is usually reduced. Due to the growing computations
with the number of output nodes and the input patterns, it may not be suitable to use
software implementation of the algorithm. The networks implemented with analog

hardware can still derive full beneﬁt hour the learning speed of the proposed algorithm.

Numerous simulations focusing on the XOR problem, which support the above
analysis, have been conducted. A sample of these simulations is included. The analysis
indicates that the Cauchy learning dynamics are faster at each point in the weight space.
On the other hand, the simulations focusing on the XOR problem seem to support a
stronger conclusion: the Cauchy learning dynamics are faster along each trajectory in the

weight space as compared to the Gaussian learning dynamics.

CHAPTER 4

THE POLYNOMIAL ENERGY FUNCTION

4.1 Motivation

The energy function in the weight update law (2.7) speciﬁes the error of the output
(from the desired values), to be propagated back for the adjustment of the weights. This
error or the energy can be viewed as a mathematical norm. The usual sum-of-the-squared
(Gaussian) energy function is the L2 norm. Consequently we may choose other LP norms
(1 Sp 5 oo) [54].

For every error vector 5,, e R",p =1,2,...,P, we know that all the norms are

equivalent in the sense that

K1 lep Ins Iep Isus lepla ‘ (4.1)

Here K 1 and K 2 are any two constants and or. and B are two norms. This equivalence does
not imply that if, for example, the L2 norm of an error vector 6,, is greater than the L2
norm of another error vector 6;, then the L1 norm ofe, is also greater than the L1 norm
of e; . This implication is a motivation to specify an energy function which when used in

the weight update law (2.7), the error decreases in all the speciﬁed norms.

While error reduction in all the speciﬁed norms provide an intuitive rational for
selecting the Polynomial energy function, the real rational for this work may stem from
the convergence speed of the would-be-proposed weight update law as compared to the
usual weight update law (2.6)-(2.7).

A learning rule which achieves the desired values of the weights (preferably the glo-
bal minimum) by reducing the error in the ﬁrst r norms, may use a Polynomial energy

function [57] of order r given as

42

43
EP =a 22¢," +b)’, (4.2)
p‘ n

where a and b are any suitable real constants. It should be noted that p used as super-
script stands for polynomial and when used as a subscript stands for the pattern

p =1,2,...,P. In this chapter we use the Polynomial energy function of order 2 and
a =b = 71- in the weight update law (2.7), to train a FFANN. The Polynomial energy

function would then reduce to
_ l 1 2
EP_.2.;2n:(epn+-2-). . (4.3)

It should be noted that the Polynomial energy function of order 2 (4.3) is basically a

combination of L1 and L2 norms.

This chapter is organized as follows. In section 4.2, we analyze the learning dynam-
ics of the weight update law (2.7), using the Polynomial energy function (4.3). In section
4.3, we analytically compare the learning dynamics due to the Gaussian (2.6) and the
Polynomial (4.3) energy function. Software implementations of the learning dynamics
are presented in section 4.4, which are then used to train a FFANN. As an illustrative
example, we specialize the software to solve the XOR problem. Simulation results are
reported in section 4.5. In section 4.6 we conclude with the discussion on various advan-
tages and the disadvantages of using the Polynomial energy function in the weight update
law (2.7).

4.2 The learning dynamics

In this section, the continuous-time learning dynamics of the update law (2.7), using
the Polynomial energy function, are explicitly derived. As described earlier in section
2.2, the output layer is layer 0 , with nodes indexed by n. The 0-lth layer which feeds
its output to the output layer is actually the last hidden layer Hm, whose nodes are

indexed by m. The learning dynamics of the weights using the Polynomial energy

44

function (4.3) in the update law (2.7); speciﬁcally for the output layer

WK," = - 7 VW_ EP , (4.4i)
and for any hidden layer
W5- : — y VWEP . (4.4ii)

Using the chain rule, the dynamics for the output layer are

° _ BEP 38p" 3y n a“ n
W5” ——7. 36,," . 3y," . a; . 8w; . (45)

where up" and ypn are respectively the net input and the output to the nth node of the

output layer and are given by

up" = 2 WM, 2p", + 9,, (4.6i)
m

'and

1

an = m- (4.6ii)

Here 2pm is the output of the mth node in the Hmth layer. 0,, is the threshold of the nth
node in the output layer. The partial derivatives in equation (4.5) can be explicitly calcu-
lated as

BE P l .
= n + 4.7
36,». E3 [2,, ’2'] ( 1)
8y n _ _
33% - an (1 an ). (4.7111)

and

Bu

379% 1' 2pm . (4.71V)

45

Now the weight update law can be written as

1%. =7§5[p][n][0] .y[p][m][Hm]. (4.8)
where
5ipl[nl[0] := sgn (tpn -yp..) . (8p. + 71;) ~ypn(1-ypn) (4.9)

and y [p ][m][H,,.] is the output 2,", of the m th node of the H... th layer.

The weight update law for the weights wk,- , connecting the j th node of a

hidden/mput layer H ,- to the kth node in the subsequent hidden layer Hk , is derived simi-

larly and is given by

W; =7; 5lpllk][Hk] -)’[P][i][Hj]. (4.10)
where

SlpllkHH/c] == Z. 5[P][l][Hz] -Wlk -ypk -(1 -ypk)- (4.11)

In (4.11), l is the index of the nodes in the Hzth layer to which node k feeds its output ya;
and w”; is the weight of the corresponding connection. y [p][i][H,-] is the output of the
node j in the I-Ijth layer which is fed to node k in the Hkth layer.

4.3 Analytical Comparison with the Gaussian Energy Function

Error back-propagation learning with the update law (2.7) using the Gaussian
energy function (2.6) is very widely accepted In this section we compare the (Gaussian)
learning (2.14) and the (Polynomial) learning (4.4). In order to do so, we use a similar
criterion for the learning performance as discussed in subsection 3.3.1 and the com-
parison of the learning dynamics using the Polynomial and the Gaussian energy function

is made in subsections 4.3.1 and 4.3.2.

46

4.3.1 The learning dynamics for a single pattern

In this subsection we consider a case- when we only use one pattern. For ease or

notation we will drop the subscript p and deﬁne the Gaussian (sum-of-the-squared) and

the Polynomial energy functions as
1 .
E T: '2'; 8,;2 (4.121)
and
13p -= 1 2(en-I- 1)2 (412ii)
o '7 n 7 . o

By straight forward calculations, equation (4.4i) can be rewritten as
we, =—y [VW_E +V,,,_E1] (4.131)

where
l ..
E = . 4.1311
1 '2' § 8n ( )
For the hidden layer the learning algorithm in (4.4ii) can be similarly rewritten as
eg- = - 7 [ku5 + VWEl]. (4.13m)

In order to formally compare the learning performance due to the Gaussian and

the Polynomial energy functions, we state Theorem 4.1.

Theorem 4.1: Assume that Assumption 3.1 holds. Consider V := E given by (4.12i), be
a candidate Liapunov function. Let I} be the derivative of V along the trajectories of sys-
tem (2.14) and VP be the derivative of V along the trajectories of system (4.13). Then the
learning performance of system (4.13) is "faster" than or at least is the same as system

(2.14).

47

The proof of this theorem presented in Appendix G, analytically shows that the
learning performance improves if the Polynomial energy function is used in comparison

with the usual sum-of-the-squared energy function.

4.3.2 The learning dynamics for multiple patterns

Now, we consider the learning dynamics of the weights for P patterns. The weight
update law is given by equation (2.14) where E is given by (2.6). The learning dynamics
of the weights due to the Polynomial energy function are in the form of (4.4) where EP

now is given by (4.3). By straight forward calculations, 1435,, can be written as
wgn=—y[v,,_.E +Vw__ E1], (4.141)

where
E1: %. g; e”. . (4.1411)

Similarly, for the hidden layers the learning algorithm of the Polynomial energy function

(4.3) can be written as

w,- =—y [ku E + V,” El]. (4.14111)

When the network is‘ trained with the multiple patterns, we formally compare the
learning performance due to the Gaussian and the Polynomial energy function in

Theorem 4.2.

Theorem 4.2: Assume that assumption 3.2 holds. Consider a candidate Liapunov func-
tion V := E given by equation (2.6). Let V be the derivative of V along the the trajec-
tories of (2.14) and VP be the derivative of V along the trajectories of the system (4.14).
Then the learning performance of system (4.14) is "faster" than or at least is the same

as system (2.14).

48

Again the proof of this theorem is included in Appendix H. The analysis clearly
indicate that the learning performance in "speed" can improve (or at least does not

deteriorate) when the Polynomial energy function is used.

Proposition 4.1: The equilibria of system (4.14) are also the equilibria of system (2.14).

The proof of Proposition 4.1 is provided in Appendix 1.

Remark: Proposition 4.1 implies that when we use the Polynomial energy function
(4.3) in the weight update law (2.7), we still converge to one of the equilibria of system
(2.14). However, this does not mean that starting ﬁ'om any initial condition, the trajec-
tories followed by both the systems would be same or the trajectories would converge to

the same equilibrium point.

4.4 Software implementation of the learning dynamim

For the software implementation of the learning dynamics of the weights connected
to the output layer (4.8) and the weights connected to the hidden layers (4.10), the learn-
ing algorithm formalized in Table 2.1 is used. The weight update laws (4.8) and (4.10)
are implemented into computer software using the C-language on Sun SPARC stations.
Fourth order Runge-Kutta integration routine is used for the integration of these update

laws.

A stopping criterion 0), was set to determine if the system has converged to a solu-

tion. The parameter to was deﬁned as

O) := 2 IAij I, (4.15)

Wt}
where wkj are all the weights in the network and

Awkj 2: Wk} (t+At) - Wk} (i). (4.16)

49

The term At is the time step chosen for the fourth order Runge-Kutta integration of the
update laws. An error criterion é was used to characterize the training: the lower the

value of the error, the better is the training. The (output) error parameter is deﬁnedas

§:= E: gap“? (4.17)

Note that this error parameter is exactly the Gaussian energy function. The weights
of the network can be initialized by giving their values or selecting randomly by a ran-

dom number generator of the UNIX system.

The training simulator, using the Runge-Kutta integration routine reads the input
data ﬁ'om an input ﬁle, which provides the values of time step At , learning rate 7, error
criterion é, stopping criterion to, the network structure, the training patterns and the
corresponding target values. The training is terminated when either the stopping criterion
to or the error criterion § is achieved. The simulator writes the intermediate or/and the

ﬁnal values of the network weights or/and other parameters into an output ﬁle.

4.5 Simulation example: the XOR problem

The training performance of the three software simulators, using the Polynomial,
the Cauchy and the Gaussian energy functions, is described and a comparison is made in
this section. A prototype FFANN used to solve the XOR problem had two input passive
nodes (a passive node does not include a sigmoidal transfer function), two hidden nodes

and one output node as shown in Figure 3.2.

The network was fully connected and weights were initialized randomly by a ran-
dom number generator. The seed, which is a number used to start the random number
generator, is used to represent the set of initial weights. The value of any initial weight w

is restricted to be -0.9 S w S 0.9.

We assume that the network has converged when the stoppage criteria to 5 0.0001 is

satisﬁed. In all the simulation cases, the network converged to a local minimum. A

50

comparison of 6 different sets of initial conditions is depicted in Figure 4.1. In all six
simulations 7 and At are chosen to be 1 and 0.1 seconds, respectively. For the cases when
the weights were initialized by sets 1, 3, 5 and 6, the training did achieve excellent
results. In the other cases when the weights were initialized by sets 2 and 4, the network
was stuck in an undesirable local minimum with an error value equal to 0.5. The training

results using these six different sets of initial conditions are summarized in Table 4.1.

It should be noted that in all six simulations the (Polynomial) learning algorithm
achieves the same error performance as the Gaussian or the Cauchy learning, in many
fewer iterations and computer time. Finally, we also observe that the update law (4.4) '
outperforms the update laws (2.14) and (3.5) in achieving any ﬁxed error value in terms
of the number of iterations required to realize that error value.

As the bulk of the number of iterations is executed very near the local minimum, it
is therefore useful to set an error criterion g to a desirable value along with the stability
criterion to , for terminating the training. For the cases when the weights were initialized
by the sets 1, 3, 5 and 6, E, was set to a value of 0.005. In the other cases when the

weights were initialized by sets 2 and 4, g was set to 0.51.

4.6 Discussion

Training of a FFANN with update law (2.7) using sum-of-the-squared energy func-
tion (2.6) ensures that the L2 norm of the error vector 8 for all the output nodes
n=1,2,...,N and all the patterns p=1,2,...,P , decreases as the training proceeds and ﬁnally
converges to a local minimum value. While the L2 norm of the error vector decreases,

other Lp norms (1 S p 5 co), may increase or decrease.

Restricting our attention to the present case where we are using the Polynomial
energy function of order 2, we only consider the L1 and the L2 norms. To illustrate the

above fact, we consider a simple case of a trajectory in two dimensional error space in

 

 

 

 

 

I I I
-0 5000 10000 15000

set # l - iterations

 

error 0.5 —

 

 

.OIQ‘AAA‘AAAA .......

 

 

I I I
-0 5000 10000 15000

set # 3 - iterations

 

 

 

O

O

O

O
O
I
C
W
OOOOOOOOOOO

 

 

Ti

I I f
-0 5000 100001500020000

set # 5 - iterations

 

 

 

 

i I
-0 5000 10000
set # 2 - iterations

 

0.6 .4

 

 

 

 

n I
-0 5000 10000
set # 4 - iterations

 

f.

l:

1'.

c

‘ " E
c

r.
E

‘0 _ is... -
I I I I
-0 5000 10000 15000
set # 6 - iterations

 

 

 

Figure 4.1: Comparison of the Gaussian (...), the Cauchy (---) and the Polynomial ( )

learning. In all of the above simulations, the Polynomial learning algorithm achieves the

same error with fewer iterations.

52

4......o P.

 

 

 

 

 

 
   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

9832...... a... a... a...
8. .. e. .
2.85. «.53....
.5...»— tammrs a an .833... :33.

. 9.5.... 98... a... .2... 98...
9...... 98. 3.... 5.... 98... t... .2... 98... .2... .2...
.6233... 98. ..... .9... 98... .... .9... 988... ...... t...
. 95...: 9.... 5.... .9... 9...... .9... 8...... 9...... .9... 8......
9...... 9.... S... ..... 9...... .2... .....e 9...... ....... 8......
3.83... 9.... .8. .9... 9...... 3.... .8... 9.8... .3... .3...
.. 9%.... 98. a... .9... 98... 3... ...... 98... .3... ......
9...... 98. 3.... e9... 98... .8... .....8 98... ...... .8...
3.83... 98. .e... ..... 98... ..... .9... 988... ..... ......
. can? 9.... z... .9... 9...... a... 3...... 9...... a... 3......
9:9. 9.... .... .9... 9...... ..... ...... 9...... ....... .8...
3.83... 9.... .... ....... 9...... .3... .8... 89...... ...e. .3...
. mean... 98. .8... 59... 98... .8... ..... 98... .9... ......
9...... 98. a... :9... 98... 2.... .8... 98... .9... ..9..

o

3.83... 98. .3. .9... 98... ..... .9... 98.8. ....... .......
. can... 98. a... 8... 98... .3... .3... 98... 2.... .2...
98.... 98. a... 8... 98... 2...... ...... 98... 3.... ......
3.353... 98. 8... .9... 98... ..... .9... 988... ....... 5...

 

.55....» 3...... 8:62...» :6 6.8%.... :8 9.8.... E... :8 $23.03.... .8:..=m. n...»— I... .aﬁagus 9.8 8:83am...» 8 5.98 .2. A»... .5...» =8 0.....-
«.9... So 9......» 9.... 50 3.553.... 32.5. 2.828. 3:88.... 98 8...... 50 «:9 2.8193 3.. 50 £05... 9.5m... 3.83.3 8 .m 8.5.6...

 

53

Figure 4.2. If the training starts from point e,- and proceed along the trajectory and settles
down at a local minimum ef , we observe that while the L2 norm of the error vector

decreases the L1 norm increases throughout this learning.

We can quantify the possibility of increase in the L1 norm while the L2 norm
decreases at a point, in terms of the shaded area in Figure 4.2; the larger the area, the
higher is the possibility. Similarly if we train the network, by decreasing the L1 norm of
the error vector 8, we will notice as depicted in Figure 4.3, that the L2 norm of the errors
may increase. We can again quantify the possibility of increasing the L2 norm while the
L1 norm decreases at a point, by measuring the shaded area in Figure 4.3; the larger the
area, the higher is the possibility. If e; is a point, where the L1 norm is tangent to the L2
norm (i.e) the shaded area measures zero, then possibility of a norm increasing while the

other norm is decreasing reduces to zero at that point. This is depicted in Figure 4.4.

This criterion can be easily extended to an n-dimensional error space, where we
have an n-dimensional sphere representing the L2 norm and n-1 dimensional surface
representing the L1 norm. The enclosed volume between the two norms in the direction
of the trajectory, measures the possibility of increasing the output error in one norm
while the error decreases in the other norm. This quantiﬁcation is another aspect of this

research which can be pursued further independently.

When a network is trained with the Polynomial energy function, we decrease the
error in the L1 as well as the L2 norm. We observe from Figure 4.5 that we reduce the
possibility of any norm increasing while the other is decreasing with the measure given

above.

Giving more weight to any norm in the energy function, will enhance the tendency
of decreasing that speciﬁc norm more than the other. In Other words we enhance the pos-
sibility of other norm to increase while the ﬁrst is decreased. This is illustrated by the fol-
lowing example, in which we give more weight to the L1 norm in the energy function.

We use the modiﬁed Polynomial energy function E "'P , which is

/Z\
\/

Figure 4.2: Trajectory showing the increase in Llnorm while 14 norm is decreasing. 8i and 8f
represent the initial and the ﬁnal points of a trajectory in the error space. The size of the sha-
ded area represents the possibility of L1 norm increasing while 14 norm is decreasing.

81
L1
1,2

Figure 4.3: Trajectory showing the increase in L; norm while L1 norm is increasing. 6i and 6f
represent the initial and the ﬁnal points of a trajectory in the error space. The size of the sha-
ded area represent the possibility of 14 norm increasing while L1 norm is decreasing.

 

 

 

 

SS

 

 

 

Figure 4.4: Trajectory showing both the L1 and the L n ' . .

. .. . 2 orms decreasrn . e- and 8
represent the nunal and the ﬁnalpornts of a trajectory in the error spacg. The L, 1’5
tangent to the L2 norm at the initial point of the trajectory, indicating that both the
norms of the error vector decrease as the training proceeds along the trajectory.

 

 

 

Figure 4.5: Energy level diagram of the Polynomial energy function. The network
trained with the Polynomial energy function reduces the possibility of increasing the
error in one norm while it is decreasing in the other (L1 or L7).

56

Emp = .5. p (cm. +1)2. (4.18)

When the network is trained with the weight update law (2.7) using the modiﬁed
Polynomial energy function and the initial weights are given by set 5, we notice in Figure
4.6 that at some stage the error which is the L2 norm starts to increase. Also notice that
the L1 norm and the modiﬁed energy of the errors keep decreasing throughout this train-

ing till the local minimum is achieved.

 

1.5 -
error

 

 

 

 

iterations

Figure 4.6: Training results of the XOR problem, trained with the modiﬁed Polynomial
energy function. L1 and L2 norm values are represented by dotted and solid lines respec-
tively. Note that the L2 norm of the error increases while the L1 norm decreases after

1 8000 iterations.

57

Besides the main advantage of increased speed of convergence in terms of the
number of iterations and less time, we can pick an energy function that lowers the value
of the errors in the desired norms, to suit our requirements. Using the Polynomial energy
function in order to train a FFANN in comparison with the usual sum-of-the-squared
energy function is thus of a threefold advantage: (1) speed of convergence in terms of
computer training cycles, (2) time and (3) the choice to decrease the output error in any

particular norm or a combination of norms.

CHAPTER 5

THE EXPONENTIAL ENERGY FUNCTION

5.1 Motivation

In chapter 2 and 3 respectively, we used the Cauchy (3.4) and the Polynomial (4.3)
energy functions in the weight update law (2.7), to speed up convergence to a local
minimum. Using the Cauchy energy function ensures convergence to a local minimum in
fewer steps in comparison with the Gaussian energy function. Due to more computations
in each step, the Cauchy learning may take more computer time in comparison with the

Gaussian learning.

When we used the Polynomial energy function (4.3) in the update law (2.7), the
algorithm not only outperformed the Gaussian and the Cauchy learning in terms of

number of steps but also in computer time.

We speed up the convergence to a local minimum by using the Cauchy or the Poly-
nomial energy function in the continuous-time update law. The local minimum that we
converge to, may not be good enough to solve the problem for which the network is
trained For instance, in the case of the XOR problem when we initialize the weights with

set 2 and 4, we do not converge to good minima.

In this chapter, we formalize the skipping minima that are at a higher energy value
and converging to an acceptable minimum at a lower energy value. Many researchers
have proposed various modiﬁcations of the discrete-time update law (2.5) based on
heuristic arguments, to speed up the convergence to acceptable weights [7],[8],[58],[59].
We propose an exponential energy function [60] of the form

F =e‘d3, (5.1)

58

59

to be used in the weight update law (2.7) to overcome the above mentioned problems.
Here K is any positive real number whose parameterization is discussed later in this
chapter. When we use the Exponential energy function (5.1) in the continuous-time

update law (2.7), the system still is a gradient descent with all of its nice properties.

In section 5.2 we include the learning dynamics of the update law (2.7) using the
Exponential energy function F. We take discrete steps when we implement the update
law (2.7) via software on a digital computer. This feature adds the capability of skipping
the undesirable local minima. We present explicit derivations of the learning algorithm
based on the Exponential energy function in this section. In section 5.3, we compare the
learning dynamics of the weight update law (2.7) when it uses the Exponential energy
function F and a simple energy function E. E can be the Gaussian, the Cauchy, the
Polynomial or any other suitable energy function. Software implementations of these
derivations are then used to train a FFANN. As an illustrative example we specialize the
software to solve the XOR problem in section 5.4. Simulation results are presented in
section 5.5. We conclude by ﬁnally discussing the advantages and the limitations in sec-

tion 5.6.

5.2 The learning dynamics
Consider the learning dynamics of the weights using the simple energy function,
which may be the Gaussian (2.6), the Cauchy (3.4) or the Polynomial (4.3)energy func-
tion; speciﬁcally
to =—'yV,,,E, (5.2)

where w represents all the weights in the network. Now consider the learning dynamics

of the weights using the Exponential energy function (5.1); speciﬁcally
W‘ =—7V,,F, (5.3i)

=—yxe“E VWE, (5.3ii)

='YXVwE ’ (5.3m)
= X ,g, ’ (5.3iv)

where x is the skipping factor and is deﬁned as
x := K e ‘3. (5.4)

Observe that x is constant at any particular step of the integration and varies
exponentially with the output error (energy) from one step to another. As the learning
proceeds, the output error (energy) E decreases, the skipping factor x also decreases
exponentially. Therefore the likelihood of skipping the local minima which are at higher
energy levels are greater than skipping the local minima at lower energy levels.

Here u‘r gives the learning dynamics for the update law (5.2). These learning

dynamics have already been explored in chapters 2, 3 and 4 corresponding to the Gaus-

sian, the Cauchy and the Polynomial energy functions.

5.3 Analytical Comparison with the Simple Energy Function

In this section, we compare the Exponential learning (5.3), with the update law (2.7)
using the simple energy function (2.6), (3.4) or (4.3), already described in earlier
chapters. In order to do so, we use a similar criterion for the learning performance as dis-
cussed in subsection 3.3.1 and the comparison of the learning dynamics is made in sub-

section 5.3.1.

5.3.1 Comparison of the learning dynamics

We analyze the learning performance of weight update law (2.7) using the
Exponential energy function by comparing the learning dynamics of systems (5.2) and
(5.3) with the criterion outlined in subsection 5.3.1. In order to formally compare the
learning performance due to a simple and the Exponential energy functions, we state

Theorem 5.1.

61

Theorem 5.1:

Consider a candidate Liapunov function V := E given by equation (2.6), (3.4) or
(4.3) corresponding to the Gaussian, the Cauchy or the Polynomial energy function. Let
V be the derivative of V along the the trajectories of (5.2) and V‘ be the derivative of V
along the trajectories of the system (5.3), then the learning performance of system (5.3)
is better than or at least the same as system (5.2), provided the skipping factor x given by
(5 .4) is greater than or equal to unity.

Proof :

In order to analyze the learning performance, let’s evaluate V‘ - V as given below.

V‘-V=VwV.(We-W)

Py- q r- “W

=[..VWE..].—'y1 waE - V,',E >, (5.5i)

 

 

 

 

 

 

= - y§[x - 1] [Vw E] 2. (5.5a)

In order for the update law (2.7) using the Exponential energy function (5.1) to improve
(or at least does not deteriorate) the learning performance , V‘ - I} should be non-
positive. This implies

x 2 1. (5.6)
To improve the learning performance, the relationship (5.6) imposes the following res-

triction on the energy function.

_ lnx
E 2 —K_' (5.7)

62

The above relationship is depicted in Figure 5 .1. We observe that for non-positive K
the system does not behave as a gradient descent system. For positive K the
learning performance only improves if for a certain value of K, E has a value that
satisﬁes the relationship (5.7). We also observe that if K 21, the learning with the
Exponential energy function always improve. If for some reason we want faster learning
for some higher energy value greater than E o and slower learning for E < Eo, then we

can chose appropriate value of K such that 0 < K < l.

 

20-

15—

E10—

 

 

 

 

 

 

o_L_-----

Figure 5.1: Graph representing the relationship (5.7). It is an essential condition to
improve the learning performance of system (5.3). The learning performance only

improves if the parameters are chosen above the graph line representing (5.7).

63

Proposition 5.1: The equilibria of system (5.2) and system (5.3) are same.

The proof of Proposition 5.1 is provided in Appendix J.

Remark: Proposition 5.1 implies that when we use Exponential learning (5.3) we still
converge to one of the equilibria of system (5 .2). However, this does not mean that start-
ing from any initial condition, the trajectories followed by the systems would be same or

the trajectories would converge to the same equilibrium point.

5.4 Software implementation of the learning dynamics

For the software implementation of the weight update law (5.3), the learning algo-
rithm formalized in Table 2.1 is used. The weight update law (5 .3) is implemented into
computer software using the C-language on Sun SPARC stations. Fourth order Runge-

Kutta integration routine is used for the integration of these update laws.

A stopping criterion to, was set to determine if the system has converged to a solu-

tion. The parameter to was deﬁned as

a) :2 MW I. (5.8)

where w are all the weights in the network and
Aw := w(t+At) - w(t). (5.9)

The term At is the time step chosen for the fourth order Runge-Kutta integration of the
update laws. An error criterion 5 was used to characterize the training: the lower the

value of the error, the better is the training. The (output) error parameter is deﬁned as

§ := g g spa 2. (5.10)

Note that this error parameter is exactly the Gaussian energy function. The weights

of the network can be initialized by giving their values or selecting randomly by a

64

random number generator of the UNIX system.

The training simulator, using the Run ge-Kutta integration routine reads the input
data from an input ﬁle, which provides the values of time step At , learning rate 7, error
criterion é, stopping criterion 0), the energy function parameter K, the network structure,
the n'aining patterns and the corresponding target values. The training is terminated when
either the stopping criterion to or the error criterion E is achieved. The simulator writes
the intermediate or/and the ﬁnal values of the network weights or/and other parameters

into an output ﬁle.

5.5 Simulation example: the XOR problem

The training performance of the software simulators, using the Exponential and the
simple energy functions, is described and a comparison is made in this section. For the
simple energy functions, we used the Gaussian (2.6), the Cauchy (3.4) and the Polyno-
mial (4.3) energy functions. A prototype FFANN used to solve the XOR problem had
two input passive nodes (a passive node does not include a sigmoidal transfer function),

two hidden nodes and one output node as shown in Figure 3.2.

The network was fully connected and weights were initialized randomly by a ran-
dom number generator. The seed, which is a number used to start, the random number
generator, is used to represent the set of initial weights. The value of any initial weight w

is restricted to be —0.9 S w S 0.9.

We assume that the network has converged when the stopping criterion 0) 5 0.0001
is satisﬁed. In all the simulation cases, the network converged to a local minimum, when
we chose energy function parameter K = l. A comparison of 6 different sets of initial
conditions are depicted in Figures 5.2, 5.3 and 5.4. In all six simulations 7 and At are
chosen to be 1 and 0.1 seconds, respectively. For the cases when the weights were initial-
ized by sets 1, 3, 5 and 6, the training did achieve excellent results. In the other cases

when the weights were initialized by sets 2 and 4, the network was stuck in an

65

undesirable local minimum with an error value equal to 0.5. The training results using

these six different sets of initial conditions are summarized in Tables 4.1 and 5 .1.

We increase the tendency of skipping the local minima in order to avoid the cases
that we observe when the weights are initialized by sets 2 and 4, by chosing a larger skip-
ping factor. This is achieved by selecting x = 1.8, which subsequently increases the skip-
ping factor x. Comparison of the six different sets of initial conditions are made in Fig-
ures 5.2, 5.3 and 5.4. The training results of these simulations are summerized in Table
512.

We observe in Figure 5.2, 5.3 and 5.4 that x was chosen to be 1.8, the update law
(2.7) using the Exponential of the Polynomial skips the local minima, when the weights
were initialized by sets 2 and 4. For the same two sets of initial weights, the update law
using the Exponential of the Gaussian and the Cauchy energy functions, still could not
skip the local minima. It can be shown easily, that by increasing the value of x, we can

skip these undesirable minima also.

In all of the simulations, observe that the Exponential (5.3) learning achieves the
same error criterion as the simple (5.2) learning in many fewer iterations and computer
time. Depending upon the value of K or the skipping factor x, the Exponential learning
algorithm can be made to skip the undesirable local minima, thus allowing it to converge

to one of the desirable minima.

As the bulk of the number of iterations is executed very near the local minimum, it
is therefore useful to set an error criterion g to a desirable value along with the stability
criterion 0) , for terminating the training. For the XOR problem the error criterion
§= 0.005 provides reasonably good results and it will save us lot of computer steps and

time.

66

 

 

1+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

error 0.5 ._ error
-0 _
I I I I T I I
-0 5000 10000 15000 -0 5000 10000
set # 1 - iterations set # 2 - iterations
1
error 0.5 — error
-0 _ ................................
I I I I I I
-0 5000 10000 15000 -0 5000 10000
set # 3 - iterations set # 4 - iterations

 

 

 

 

 

 

 

 

 

 

I I I I I I I I
-0 5000 100001500020000 -0 5000 10000 15000
set # 5 - iterations set # 6 - iterations
Figure 5.2: Comparison of the Exponential (KL-1, 1.8) and the Gaussian learning. The
Gaussian is represented by (...) whereas the Exponential (K‘=l, 1.8) is represented by (---)

and (_) respectively.

67

 

 

   

 

 

 

 

 

 

 

 

  

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

1
error 0.5 _ error
-0 _ ”‘M WW
I I T I I I I I m
-0 5000 10000 15000 -0 2000 4000 6000 8000
set # 1 - iterations set # 2 - iterations
1 - 1.2 — .2
i.
It
1 ._
error 0.5 — error
0.8 q .
0.6 —
_0 _ , .... k; .......................
I I I I n I I I I
-0 5000 10000 15000 -0 2000 4000 6000 8000
set # 3 - iterations set # 4 - iterations
1 1 ° . . . 1.
“0’05—
-0 _ gm n. bxm
I I I I I I I
-0 5000 10000 15000 -0 5000 10000
set # 5 - iterations set # 6 - iterations

Figure 5.3: Comparison of the Exponential (ml, 1.8) and the Cauchy learning. The Cau-
chy is represented by (...) whereas the Exponential (Ic=l, 1.8) is represented by (—--) and
(__) respectively.

68

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 w\ ............. ‘1...
t 1 .... \° ...........
\ ....... 1 \‘
tl \ .
error 05 " E error 0 5 4 \_ .........................
_0_ LL ......... .0_ L
I I I I I I I I I
-0 1000 2000 3000 4000 -0 1000 2000 3000 4000 5000
set # 1 - iterations set # 2 - iterations
1 _ P ............ .2
\ 1 —l “‘1'. ......
\ ..... I.
‘ ‘I
error 0.5 -— 1 error
t 0 5 _ ¥ oooooooooooooooooooooooo
i
-0 _ L g ' ......... .0 _. L
T I T I I I I T I I
-0 1000 2000 3000 4000 -0 1000 2000 3000 4000
set # 3 - iterations set # 4 - iterations
1 __ 152‘ ............. 1 _ \;'.. .......
\ ‘\
I
‘ '.
error 0.5 — ‘. cm" 0 5 —- 1
'. I
i t .
_0 _ L k °' ..... -0 — L \_ ..................
I I I I I I I
-0 2000 4000 6000 -0 1000 2000
set # 5 - iterations

set # 6 - iterations
Figure 5.4: Comparison of the Exponential (14:1, 1.8) and the Polynomial learning. The

Polynomial is represented by (...) whereas the Exponential (x=l, 1.8) is represented by
(---) and (__) respectively.

ace—o u.—

 

 

69

 

 

 

 

 

 

 

 

 

 

 

 

ﬁ _ 555% as _ 35.93
«2 a em _ .
N 2.2.3 9580: _ .
Ea». tam—.8 _ _ 39 u an ”Range @383 23—. t on menu; 33%.88 _ «:9 t on ”.233“ gauge
_ omens 8.8“ .8; 8.8 8.8.8 38¢ ~38 8.8.3 .388 88.8
983. 8.8“ :8 8.8 c.8§ ~88 6%.. 8.88.3 :88 2 .3
88558 8.8“ :8 :8 8.8.3 58 3.8 8.888 488 8.8
u 9.33 as 83 :8 8.85 38 8:5 8.8.8 .88 8.8
98.: as 88 8.8 328 am: :95 888 .88 8...:
8.5.8.8 is 83 8% 8.8.8 3% 3.8 88; 8.5 88.8
a means 8.8“ ES .838 8.8—8 :5 “and 8.8.8 as; 898
98.: c.8u 88 8.: 98.8 8.8 .88 8.888 :88 8 .8
3.3538 98“ :8 5.8 8.8—8 .2: 3.8 8.888. 8.: 8.98
a eczema. 88 8: 3% 32% 883 88.8 888 .88 Sea
l 9.53. 35 88 88 8.8.2 38.. 5.8 888 $88 83.8
8.3538 58 83 5.8 8.8.2 33 8.8 38E use 2%..
a mean. c.8u an: and» 8.8.8 .88 2 _.8 8.8.3 38¢ 2 .8
98.: 8.8“ 8: 3.8 8.8—8 33 8.3 9888 .88 84.8
3.485.». 8.8“ :8 5.8 88—8 83 8.8 8.888— 3: «3.8
8 cases. 2.8 :8. 5.: 8.85 3:. 88.8 8.85 35. 88.8
983. e.§ as 8.3 8.8;. ~88 88:3 888 at: 85..
8.483.». c.8u 8a 8.8 8.8.8 as 3.3 8.888 88.. 898

 

 

 

 

 

 

 

 

 

 

 

diam—i .833 8333a Go mag—«BE can 3 on 509:3»... :8 Ex Ex. :6 3.3538. gwﬁ. 1:8. 3.— 55525-8 98 83m 90 mix—.883.
Ann—ven§0w§.50n§3.na=6ve_§§§ n§§.§8981§=8n§3§3n§n5o§82§03§emu

8328.

 

70

as...» 9.

  

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_ 53.5% a.» a.» .2»
«a. a 5.. .
0:88. 3.83..
I......». aim...» — 23» * om .885...» @388. _ 9.3» a 5.. .8333 a 5.. .839...»
. 02.8.»: 98m 83 49% 98.3 .
aged. 98w Ba... 5...» 98.3
3.553.». 98m 8.. 93 98.3
n 02.8.»: 9m... ”6... awn... 9.8.3 988....
0283. 9m... cum 9% 98.8 3... “a... 9.8.3 3.8 .393
3.553.». 9m... 8 9... 98.3 8 9... 9°88: 380 5......»
H. 09.8.»: 98m .35 8.... 98.8 New”... v.98 989.8 33» 3......
028.... 98m 88 up...» 985m 3..» 3.8 988.8 3%.. 89...
I 3.353.». 98. 3.. 93 98.». as M...» 99589 3.8 393
A 088.»... 9m... 38 3.8 98.9 3...... .35.. 9.8a... Sue... 398
0253. 9m... one 9% 93.3 3.8 .945 9.8.5.. .53» 393
$55.53.». 9.... ma 9% 98.9 3 9% 9958.. .33.. 3......
M 02.8.»: 98m 33 88 98.3 58 398 9838 to». 3...:
9.5.6. 98m ”5.6 9.5. 98.3 .8.» 3.8 9888 5.80 um...»
3.355.». 98m .3 Pu. 98.3 5... 9w. 9:88.. «.83 3....
o 69.8.»... 98m .38 8b.. 98.». ~83 $98 988.... .38 uqmu.
955 98. 88 ~98 98.». .82 3.8 988.... 5.3.. 898
123.». 98. u... 9.... 98.». um... Maw 9958.. 3.8 .393

 

.....»......m 3a....» 83.513 So mace—.89.». c. u .9 5.. a.» 65.8.»... ...o 9.83 E... a.» 3.553.». .83.»? 3:». a... 5835......» a»... 5.5 5» 9.53:2».
3.85.. an n .8 5.. 5» 02.8.»... a.» 9.53. E... a.» 3.353.». 2.95. 3.5.9.». 8986...... a...» £30.. a.» 35» 2.82.9. n 2... 5» £05... n..»..mo 2.815.. 8

.m 85.96...

 

71

5.6 Discussion

Weight update law (2.7) using the Exponential energy function (5.1) corresponding
to the Gaussian (2.6), the Cauchy (3.4) or the Polynomial (4.3) energy function provides
a variable skipping factor x, which changes with each computer step. The learning

dynamics of this system is given by (5.3iii) as
W‘ = — y x Vw E.

If we consider 7); as the cumulative learning rate, then we can say that the learning rule

(5.3iii) provides a variable learning rate, which varies with each computer step.

The changes in the cumulative learning rate are such that it is large at higher
energy levels and is small at lower energy levels. In digital software based networks, like
the one we used in this work, the concept of variable learning rate is vary useful. The rea-
son is that, we want the undesirable minima at higher energy levels to be skipped

whereas it should converge to desirable minima at lower energy levels.

As the calculations in section 5.3 reveal, the weight update law (2.7) using the
Exponential energy function contributes to overall better learning performance as com-
pared to when the Gaussian, the Cauchy or the Polynomial energy function is used. This
is observed practically in all of our simulations with the XOR problem. Some of these
simulations are reﬂected in Figures 5.2, 5.3 and 5.4.

Apparently, we can make the learning as fast as desired, by choosing appropriate
values of K. This is misleading because if we increase the values of the skipping factor
beyond certain limit (which varies from problem to problem), the weights change so
abruptly that the outputs of the network clamp to saturation (O or 1) values. As a result
the partial derivatives of the outputs with respect to the corresponding inputs of the neu-
rons are very very small. Consequently, the error signal (2.19), (3.10) and (4.9) which
is propagated back is negligible, resulting in a smaller weight change. This makes the

algorithm converge to a false minimum. This can be avoided to a certain extent by using

72

lower gain neurons as described in section 2.1, or using appropriate cost factor. Alterna-

tively we restrict the value of K to a certain limit.

When the weights are initialized randomly within the range [-0.9,0.9] (as is the case
in all of the simulations), and value of K is restricted to [1.2.5] normally provide good
results. The exact parameterization of K is not addressed here and can be pursued

independently, depending upon the task.

CHAPTER 6

THE PATTERN/CHARACTER RECOGNITION PROBLEM

6.1 Problem description

Pattern recognition is one of the major application areas of FFANNs. In this
chapter we employ FFANNs for character recognition. The characters that were chosen

for this problem were the Arabic numerals of different fonts.

Character recognition using a FFAN N has attracted the attention of many
researchers, for it would ﬁnd many applications in ofﬁce automation and computer
aided design. Merging the classical techniques of pattern recognition with the
classiﬁcation capabilities of a FFANN, the process of character recognition as viewed
by many researchers [611-[63], consists of feature extraction by classical methods and

classiﬁcation by a FFANN.

Many others motivated by the association and the classiﬁcation capabilities of the
human brain expect both of the tasks to be performed by a FFANN [64]-[66]. The
FFANN reminiscent of Neocognitron [64] dedicated to solve the problem of character
recognition use a large number of neurons and trainable weights. Many researchers
exploit the capabilities of multilayer perceptrons [17] to solve the problem of character
recognition by using a sufﬁcient number of neurons in the hidden layers [27]-[29].

We decompose the character recognition problem into two parts: (1) feature
extraction or preprocessing and (2) classiﬁcation. We use two layer FFANN for feature
extraction and an additional layer network for classiﬁcation. The two networks are
concatenated in a feedforward manner, the feature extraction network followed by the

classiﬁcation network.

73

74

In section 6.2, we discuss the data-base used for training and testing the network.
The network design is discussed in section 6.3. The training and the learning algorithm
used to train the network is discussed in section 6.4. In this section we carry out the
learning evaluation with update law (2.7) using different energy functions. In section
6.5, the chapter is ﬁnally concluded with the discussion on the network design and

various learning algorithms used.

6-2 'The data-base

We have collected 100 different fonts ﬁ'om printed material [671-[70]. Each of
these printed fonts were scanned using the 807500 Toshiba scanner and was stored as
a "Tiff" ﬁle. The ﬁles have subsequently been converted to "Hips" format which could

easily handle binary data. The pixel image of various digits in different fonts were
then obtained from the formatted Hips ﬁles.

The images of these fonts were stretched appropriately to ﬁt within a pixel win-
dow of 17x15 dimension while leaving a small margin on all sides. With this pre-
processing, we also achieve scaling invariance of the characters to some extent. The
grey level of these pixels were set to be 0 or 1. Some of these fonts were chosen to
represent this formatting, whose pixel images are depicted in Figure 6.1.

The collected fonts included roman, bold and italic versions of different classes.
The network was trained with half of these fonts (which included roman, bold and

italic versions of some of the classes). The second half of the data-base was used to

test the character recognition and generalization capabilities of the network.

6.3 The network design

The character recognition network [71] that we used consist of the feature extrac—
tion and the classiﬁcation networks as depicted in Figure 6.2. The feature extraction

network is designed in a manner to extract features locally with local connections from

uﬁooor .1910qu our or rndu; uu sa posn sruo; our ’0 saldumxe [90er :1'9 canard

' QWNQ‘HHN‘M‘IQ

OWMOW$HWAO

 

WM #13'41-1— :‘Iv’h':

Badoni bold

 

{D C] H] U) ['1 H's f I.) la.) l-1 C]

Breite

envmwewwwe
wwummewwuo

 

 

76

previous layer, while the classiﬁcation is done with global (i.e full) connections.

Training of the network used for the character recognition takes a long time. We
split the network into two, so that one part (i.e) feature extraction network, which
takes major portion of the training time should be available as a commercially trained
chip off-line on a very large data-base. The second part should be available to the user
to train it on-line with new data to be added. We discuss the design of the feature

extraction and the classiﬁcation network in subsections 6.3.1 and 6.3.2 respectively.

 

 

 

Feature
Input EX ti Classiﬁcation
Pa __. trac on —-H Network ___, Class
Network

 

 

 

 

 

 

 

 

 

Figure 6.2: Block Diagram of the FFANN used for the Character Recognition problem.

6.3.1 The feature extraction network

A partially connected FFANN with one hidden layer is chosen to exuact features
of Arabic numerals of various fonts used for common printing. The input layer of this
feature extraction network is a 17x15 array of passive nodes corresponding to the input
patterns. These patterns are the characters 0 through 9 of different fonts. The input

passive nodes simply accept the individual components of the input image and

17

77

distribute them to the speciﬁed neurons of the hidden layer. The output of each neuron
is in the range (0,1) as the activation output is determined by the so called logistic

function [7]-[8]

l

w. (6.1)

output =

A sliding window of 7X7 array of input nodes are connected to a neuron in a hid-
den layer. The window slides all over the input array moving one column or one row
,of 7 input nodes at a time. As a result the dimension of the hidden layer reduces to
11x9. A similar sliding window of 5x5 array of hidden layer neurons is connected to
a neuron in the output layer. The resulting array-size of the output layer consequently

reduces to 7x5. The structure of the network is shown in Figure 6.3.

 

 

 

 

 

 

 

 

/

Input layer Hidden Layer Output Layer

Figure 6.3: Structure of the FFANN used for feature extraction.

78

Ten 7x5 array targets are constructed corresponding to each input digit. The tar-
gets are constructed so as to retain the general shape of the characters and, in addition,
he pseudo-orthogonal to each other. Figure 6.4 shows all the targets used in the train-

ing of this network.

The targets are considered orthogonal if the hamming distance between them is
equal to the dimension of the targets, namely 7x5=35. In the present case, it was not
possible to construct orthogonal targets while retaining the geometric shape of the
digit. The minimum and the maximum hamming distance between the targets given in
Figure 6.4 is 17 and 22, respectively. As a consequence, this coding has an error
correction capability of up to 8 errors in the output pixels. If the pixels with similar
and dissimilar activation state (speciﬁcally, a high or low state) correlates to +1 and -1,

the correlation matrix C for all the targets becomes

35 -3 -1 —5 -1 —1 +1 —1 -3 -3
—3 35 -7 —3 -3 +1 -1 -3 —1 -1
—1 -7 35 -1 —l —1 -7 -5 +1 +1
—5 -3 —1 35 -1 -1 +1 -1 —7 -3

-1 —3 —1 —l 35 -1 -3 -1 +1 +1
C ‘ —1 +1 -1 —1 -1 35 +1 -1 —3 +1 (6'2)
+1 -1-7 +1 -3 +135 —3 -l —9
-1 —3 -5 -1 -1 -1 -3 35 -3 +1
-3 -1 +1 -‘7 +1 -3 -1 —3 35 -1
_-3 -1 +1 -3 +1 +1 -9 +1 -1 35.

 

 

The performance of the feature extraction network can be measured by noting the
correlation of the output digits with the targets. If the output is within a hamming dis-
tance of 8 of a target, it is classiﬁed to that target; otherwise we consider the network

to be unable to classify the particular input.

6.3.2 The classiﬁcation network

The classiﬁcation network is a fully connected FFANN. There is only one layer
consisting of 10 neurons corresponding to the different classes of characters, namely

O,1,...,9. The output of the feature extraction network from its 35 output nodes is

 

'6‘-

79

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.4: Targets used in the feature extraction network for various digits.

80

directly fed to all of the classiﬁcation neurons. We chose orthogonal targets for all the
10 different classes. The targets are chosen so that, for any particular class, the output
for the corresponding node is high and all others are low. If the output of more than
one node goes high, the network is considered unable to classify.

6.4 Learning evaluation with different energy functions

We used the Gradient Descent method with the error backpropagation learning
technique for the training of the network. The feature extraction and the classiﬁcation
networks can be trained together or separately. We trained the two networks separately

with a view to achieve the following advantages:

(1) The feature extraction and the classiﬁcation networks can be viewed as two

separate modules;

(2) The feature extraction being fully trained on a large data-base off-line and the
classiﬁcation network can be quickly trained on-line when new fonts are added.
This will enhance the character recognition capability of the network as will be

demonstrated in our work.

We used a training simulator developed in [35]-[39],[57],[60],[71], coding the
weight update law (2.7) in C-language on Sun SPARC station. For the integration of
these update laws, the simulator uses fourth order Runge-Kutta integration routine. A
stopping criterion 0), was set to determine if the system has converged to a solution.
(The solution is actually the local minimum in the vicinity of the initial conditions in
the weight space.) The weights were initialized randomly by a random number genera-
tor. The value of any initial weight w is such that —0.9 S w 5 0.9.

We assume that the network has converged when the speciﬁed stopping criterion

to is achieved. The parameter to is deﬁned as

(1) := 2 MW I, . (6.3)

81

where w are all the weights in the network and .
Aw := w(t+At) - w (t). ' (6.4)

The term At is the time step of the fourth order Runge-Kutta integration which is used
to integrate the update laws (2.7). We also used an error criterion to characterize the
training: the lower the value of the error, the better is the training. The error parameter
is deﬁned as

1% == 2 )3 6p.”- (6.5)
p :1

Half of the data-base was used as set 1 and the other half as set 2. Each of these
sets included 50 different fonts that are listed in Appendix K. We trained the network
in different stages to study the response of different modules under different condi-
tions. Training of the feature extraction and the classiﬁcation network is discussed in

subsections 6.4.1 and 6.4.2, respectively.

6.4.1 Training of the feature extraction network

We trained the feature extraction network with the update law (2.7) using the
Gaussian (2.6), the Polynomial (4.3) and the Exponential (5.1) energy functions. The
Cauchy energy function (3.4) is not used in the update law (2.7), as the computations
grow exponentially with the number of patterns and the output nodes. With 50
different training fonts, the input patterns are 500 and the network has 35 output
nodes. Therefore, it is not feasible to use the Cauchy energy function for the training
of the pattern recognition problem.

The training simulators using the learning algorithm outlined in Table 2.1 was
developed corresponding to update law (2.7) using the Gaussian (2.6), the Polynomial
(4.3) and the Exponential (5.1) energy functions. In all the training simulations the

value of Ar and 7 were chosen to be 0.0005 and 1. x parameter of the Exponential

82

energy function (5.1) was chosen to be 0.004.

We assume that the network has converged when the stopping criterion (0 S 0.15
is satisﬁed. The error criterion § = 400 was also chosen to terminate the training of the
network. For the Gaussian and the Polynomial energy, learning was terminated when
the stoppage criterion to was achieved. On the other hand for the exponential energy,
learning was stopped when the error criterion was achieved. The error(s) and the

number of steps at convergence are tabulated in Table 6.1.

 

 

 

Table 6.1
Gaussian Polynomial Exponential Gaussian
§ 526.25 638.96 399.96
steps 16205 18442 10598

 

 

 

 

 

 

Table 6.1 shows the error(s) and the number of steps at convergence of the Gaussian,
the Polynomial and the Exponential Gaussian learning used for the training of the

feature extraction network.

After the network was trained with set 1, it was tested with set 2. The character
recognition performance is reported in Tables 6.2, 6.3 and 6.4 corresponding to the
Gaussian, the Polynomial and the Exponential Gaussian energy functions. In all of the
results the network incorporates the error correction capability which is built in the tar-
gets (i.e. if the binerized output is upto a Hamming distance 8 from the targets, the
corresponding input is classiﬁed to that target).

83

 

 

 

 

Table 6.2
patterns ll correct incorrect reject
set 1 100% 0% 0%
set 2 92.6% 3% 4.4%

 

 

 

 

 

 

Table 6.2 shows the character recognition using the feature extraction network trained

with the Gaussian learning.

Table 6.3

 

 

 

 

 

 

 

set 2 93% 1.8% - 5.2%

 

Table 6.3 shows the character recognition using the feature extraction network trained

with the Polynomial learning.

84

 

 

 

Table 6.4
patterns IL correct incorrect reject
set 1 100% 0% 0%
set 2 92.4% 2.6% 5%

 

 

 

 

 

 

Table 6.4 shows the character recognition using the feature extraction network trained

with the Exponential of the Gaussian learning.

On convergence we observe that the Polynomial learning achieved a better recog-
nition performance, though the error criterion is higher than the Gaussian and the
Exponential Gaussian learning. The reason is obvious if we look at the L1 norm. When
the training terminated the L 1 norm of the errors for the Polynomial learning was
882.95, in contrast with 1694.85 for the Gaussian and 1398.56 for the Exponential

Gaussian learning.

We also notice that the Exponential Gaussian learning achieved almost same
recognition performance as with the simple Gaussian learning, but in much less
number of computer steps. In conclusion, we can use appropriate energy function in

the update law (2.7) to improve learning speed and/or performance.

6.4.2 Training of the classiﬁcation network

We used the Gaussian learning (2.6)-(2.7) to train the classiﬁcation network. The
values of At, (0 and 7 were chosen to be 0.01, 0.1 and 1 respectively for the training
of the classiﬁcation network. The training was terminated when the stopping criterion
to was achieved. The error(s) and the number of computer steps at convergence are

tabulated in Table 6.5.

85

 

 

 

Table 6.5
I] set 1 set 2
§ 2.46 8.32
steps 296 618

 

 

 

 

 

Table 6.5 shows the error(s) and the number of computer steps at convergence of the

learning dynamics for the training of the classiﬁcation network.

The classiﬁcation network was trained with two different sets. First we used the
same set (set 1) as was used for the training of the feature extraction network. With
this set as input to the trained feature extraction network, the output of this network
was fed to the classiﬁcation network and was trained with it. It is just like training the
feature extraction and the classiﬁcation networks together with set 1. The whole net-
work (the feature extraction and the classiﬁcation) was then tested with set 2 and the

results are reported in Table 6.6.

In the second case set 2 was input to the trained feature extraction network and
the output of this network was used to train the classiﬁcation network. This is done
assuming that tha additional data available to the users to train the classiﬁcation net-
work would be different than the one on which the feature extraction network is
trained. As such set 2 was used for training and set 1 was used for testing. The

classiﬁcation results of the network are reported in Table 6.7.

86

 

 

 

Tasble 6.6
patterns correct incorrect reject
set 1 100% 0% 0%
set 2 94% 2% 4%

 

 

 

 

 

 

Table 6.6 shows the character recognition when the feature extraction and the

classiﬁcation networks are trained with set 1, using the Gaussian learning.

 

 

Table 6.7
patterns ll correct incorrect reject
==
set 1 99.8% 0% ' 0.2%
set 2 100% 0% 0%

 

 

 

 

 

 

Table 6.7 shows the character recognition when the feature extraction and the
classiﬁcation networks are trained with set 1 and set 2, respectively, using the Gaus—

sian learning.

87

For both the cases reported above, we trained the network with update law (2.7)
using the Gaussian energy function. For the training of the classiﬁcation network we
only used one energy function (the Gaussian) as the emphasis here is to study the

classiﬁcation capability and not the learning speed of the network.

Observe that when the feature extraction network is trained with set 1 and the
classiﬁcation network is trained with set 2 (which is the output of the feature extrac-
tion network when the input is set 2), the character recognition improves considerably.
. This feature highlights the importance of using the modular approach. Besides improv-
ing the recognition performance, it provides training convenience to users as they have

to train a smaller network (the classiﬁcation network) on the new data to be added.

6.5 Discussion

We decomposed the problem of pattern recognition into two, namely, the feature
extraction and the classiﬁcation of the input characters. Both of these tasks are per-
formed by the FFANN. The network performing feature extraction is partially con-
nected whereas the classiﬁcation is done with a fully connected network. The two net-

works are concatenated, the feature extraction followed by the classiﬁcation.

The two networks performing the feature extraction and the classiﬁcation are
treated as two separate modules. The feature extraction is trained on a large data-base
off-line whereas the user can train the classiﬁcation network on-line according to their

requirements, as and when the new data is added.

In this work we trained the featme extraction network on a certain data-base and
tested it with another set of data. The training is supervised, in which we provide the

targets, that incorporate the error correction capability due to their structure.

We illustrate that the training can be improved using appropriate energy function
in the update law (2.7). When the Polynomial energy function is used, we lower the

output error in L2 as well as L1 norm. Observe that even if the L2 norm is higher in

88

value we achieve good recognition results due to the fact that the L1 norm of the out-

put error is also reduced.

Observe that when the classiﬁcation network is trained (according to the require-
ments of the user) with different fonts than the training set (with which the feature
extraction network was trained), the over all pattern recognition capability is enhanced.
Besides improving the recognition performance, this modular approach provides train-
ing convenience to users.

The training simulators based on the update law (2.7) incorporating different
energy functions can also be used to train FFANNs for various other applications. The

simulators are effectively used to train multilayer feedforward artiﬁcial nemal networks

for handwritten numeric character recognition [72].

CHAPTER 7

SUMMARY AND CONCLUSION

Artiﬁcial neural networks are mathematical models, originally designed to mimic
. some of the functions of the nervous systems of higher animals. There are two main
approaches to design these artiﬁcial neural networks, namely, those implemented via
digital software and those manufactured via analog hardware. This research follows the
former approach as applicable to the feedforward structure of these networks. Supervised
learning is the key to make these networks useful for automated information processing
of data. In this work we addressed some of the issues of the supervised learning as appli-
cable to FFANNs.

7.1 Summary

In this research, theory supported by computer simulations is described, to improve
the supervised learning of FFANNs. Error back-propagation learning being widely
accepted is used to realize the gradient descent method of updating the weights of the
network. Realizing the advantage of the continuous-time update law, we preferred it over

the conventional (Euler) discrete-time update law and we used it through out this work.

The choice of an energy function speciﬁes the back-propagated error in the update
law. It also determines the learning performance in the sense of the speed of convergence
and the size of the domains of attraction of the stable equilibrium points in the weight
space. The Gaussian (sum-of-the-squared) energy function is commonly used due to its
simplicity. Another morivating factor of its usage is that it is a derivative of the

maximum-entropy distribution subject to the second moment constraint (the Gaussian

89

90

distribution). The Gaussian energy function is basically the L2 (Euclidian) norm.

The choice of maximum-entropy distribution can be increased if the constraint is
reduced to the ﬁrst moment only. We chose the Cauchy distribution from the point of
view that the corresponding energy function (namely the Cauchy energy function)
improves the learning performance in a certain sense“. Though the choice of the Cau-
chy energy function may not be good for solving bigger problems due to the exponential
growth of its computations, it still remains as a candidate for possible hardware imple-

mented neural networks.

Reduction of the output errors of a network in the L2 norm may not always achieve
the desirable results. We may have to reduce the output errors in other norms or a combi-
nation of norms. The Polynomial energy function, not only takes care of this problem,
but also improves the learning performance. A simple example of the Polynomial
energy function is discussed in which a combination of L1 and L2 norms is used. Besides
reducing the error in both norms, it improves the learning speed in terms of computer

steps and time.

A very common problem that is inherent with the gradient descent systems is con-
vergence to the nearest local minimum. The nearest local minimum may not be good to
solve the desired problem. The requirement of skipping the undesirable local minima (at
higher energy levels) and converging to the desirable minima (at lower energy levels) is
achieved by using the Exponential energy function. Besides, the convergence to the
desirable minimum is much faster in terms of the number of computer steps and time,

when the Exponential energy function is used.

Numerous simulations on the XOR problem have been conducted to evaluate the

learning performance of the gradient descent update law using different energy

 

* lrnprovernent in the learning performance of the Cauchy system here means the derivative of
the sum-of—the-squared error at any point on the error surface along the trajectories of the
Gaussian system is less in magnitude than that of the Cauchy system.

91

functions. All simulations are in support of the analytical derivations indicating that a
suitable choice of the energy function may improve the learning performance in terms

of the learning steps and/or time.

Finally, we evaluate the use of the energy functions described in this work on the
practical problem of character recognition. A highly user-oriented character recognition
network is developed with two separate modules; a feature extraction network to be
trained on a large data-base and a classiﬁcation network to be trained by the user accord-
' ing to his/her requirements. It is shown that different energy functions can be used to
update the weights in order to improve the learning performance. In addition, the pro-
posed structure is easy to use by the user as he/she would train the network on any addi-

tional input patterns in order to improve recognition of newly acquired characters.

7 .2 Conclusion

In this research, we have developed a method of exploiting the energy functions in
the gradient descent weight update law to improve the learning performance of a
FFANN. The improvement of learning performance aims at improving the learning speed
and/or improved generalization. This methodology has been specialized to two typical
problems: the prototype XOR problem and the practical character recognition problem.

We have shown that investigating various energy functions with a view towards
eventual implementation onto software had produced valuable results. Analytically it is
tractable to investigate the learning rules in the continuous-time setting. Yet in software
implementation, we had to anticipate the step-size and the learning rate and their
appropriate selections so as to achieve "faster" convergence, and in addition guarantee

the absence of oscillations.

APPENDICES

Appendix A

Entropy comparison of the Gaussian and the Cauchy distribution

Consider the distribution of a random variable x , to be Gaussian with zero mean,

which is given as

 

1 —13/2c32
= e . A.1
f g (I) m a ( )
Choosing a = 1, the Gaussian distribution can be given as
1 -le2
= — . A2

The entropy Hg (f ) [43] of the Gaussian distribution f s (x) is calculated as

H,(f)=— J f,(x) Iogf,(x) dx (A3)

+0.
=—loge If,<x)1nf,<x>dx

 

 

 

 

4|.
1 2 l 2
=—lo e e"’21n ("’2 dx
3 1,45 [427:
=—loge+f le‘12’2 ln—l——-ﬁ d1
3.1/27: 21: 2
+09 +00
1 1 _2 x2 1 2
=-lo e In— —e"2dx— —_"1/2
g 211:.L 21: i 2 \f2_1te dx
1 +00
=loge Ina—2‘12— Ix —xe“z’2dx
1:

Now solving the integration on the right hand side by parts and evaluating xe ”’2’2 from
—oo to +00 using l’I-Io‘pital’s rule, [51] we get

Hg(f) = loge [In «5? + %] (A.4)

92

93

Now consider the distribution of the random variable x , to Cauchy with zero

mean given as

 

__1_ A
fc(x)"' n - 12+x2. (A.5)

Choosing A. = 1, the Cauchy distribution of x can be given by

 

l 1
=—. 0 A06

The entropy He (f) of the Cauchy distribution f c (x) is calculated as

+oo
Hccf>=—- 1 ice) logfc(x)dx (A.7)

 

 

 

 

+oo
l l l l
=—loe —. ln — dx
3 in 1+1:2 [111 l+x2]
+oo

 

 

 

1 1 1 1
=—lo e — ln—+ln dx
g n 1+x2[ It 1+x2]
+00 +00
1 1 l l 1 l
=—loe ln— — dx+ — ln
g air: 1+1:2 in 1+1:2 1+x2
+0.
1 ln(1+x2)
:10 e lnrt+— ———dx
g “i l-t-x2
Now by substituting x = tanO and using
«12
jlnsecede=£1n2
0 2
weget
Hc(f) = loge .ln(4rt) ' (A.8)

Comparing the Gaussian and the Cauchy entropy function values from (A.4) and
(A8), we observe that Cauchy entropy is about 1.78 times the Gaussian entropy.

Appendix B

Proof of Lemma 3.1

Lemma 3.1 : Given 0t,- , x,- e R , 0t,- 2 0 and assuming that Assumption 3.1 holds.

 

 

 

 

Then,
[211 NEW xi ] 20»
Proof :
'1' F .q
[Zn-HZQ; X;]=[...Xi...] . [a1"'a[] I}
t t id .
= xT laTx (13.1)
= xTAx (13.2)
= xT [é-(A +AT)]x (13.3)
= xTA‘x (13.4)
2 o. (13.5)

In step (3.2), A = 101T has non-negative entries. In step (B3), A in the sym-
metric form ﬂu is replaced by its symmetric part %(A +AT) =: A [411,[55]. By

Assumption 3.1 A is positive senrideﬁnite.

Remark: Note that if all the Otis are equal, then in step (B.2), the matrix A is sym-
metric positive semideﬁnite. This is obvious since A = or,— 117, and
xTaillTx = or,- lllTx ”2_

94

Appendix C

Proof of Theorem 3.1

Theorem 3.1: Assume that Assumption 3.1 holds. Consider V := E given by (3.14i),
be a candidate Liapunov function. Let I} be the derivative of V along the trajectories
of system (2.14) and V“ be the derivative of V along the trajectories of system (3.15).
Then the learning performance of system (3.15) is "faster" than or at least is the
same as system (2.14).

Proof : Case 1 (output nodes)
In order to analyse the learning performance, let’s evaluate V‘ — V as
v0 - r} = vwv . (We - W) ((2.1)
Here W is the weight vector containing all the weights. Equation (3.5i) gives the com-

ponents of We and equation (2.14i) gives the components of W. Then for the output

layer, we get

 

 

 

 

 

 

 

 

 

 

 

W - -]
V —r'/ = [..V,,_E..] .-y+ $7,450 - vwf > (C.2i)
t ‘ 'J
=[..VW“E..] .ayy (l+a,,5V,,_E — Vw'_E » (C.2ii)
.~ . l . : .J
= — y 2 an IVW_E I2 (c.2111)
.<.0 asanZOandy>0 (C.2iv)

95

96

Case 2 (hidden nodes)
Now consider the dynamics of weights connecting the neurons in a hidden layer.
The components of We and W are given by equations (3.5ii) and (2.14ii) respectively.
Plugging in the values of We and W in equation (C.l), we get

P. I ‘

VC—r} = [ﬁvwran] .- 71 VWE + %- z ornvwe"2 - v E », (C3)
'8

 

 

 

 

 

 

where
VWE = % E: VWen 2.
Therefore we get
1}” — r} = - % y £[ngen2] [£1 kuenz]. ((2.4)
For each entry of the summation 2 we use Lemma 3.1 to show
Wu
[)2 Meg] E e, unenz] 2 0 (C5)

therefore

vc—t} s 0.

Appendix D

Proof of Lemma 3.2

Lemma 3.2 : Given org,- , xi} 6 R , (1,7 2 0 and assuming that Assumption 3.2 holds.

 

 

 

 

Then,
[ZZXUHZZaU x;,-] 20,
t J g 1
Proof:
pl- 1
[2 2.in ][2 2 aij xii] = [...xijm] [all - - . an] 1;].
I j J bi. .
= IT laTx (DD
= ITAX (D2)
= x7 Bot +AT)]x (13.3)
= ITAX (D4)
2 0. (US)

In step (D.2), A = 1aT has non-negative entries. In step (D.3), A in the sym-
metric form xTAx is replaced by its symmetric part -;-(A +AT) =: A [41],[56]. By

Assumption 3.2 A is positive semideﬁnite.
Remark: Note that if all the ail-s are equal, then in step (D.2), matrix A is sym-

metric positive semideﬁnite. This is obvious since A = 01,7 117, and
xTaijllTx = (xi!- IIITX “2.

97

Appendix E

Proof of Theorem 3.2

Theorem 3.2: Assume that Assumption 3.2 holds. Consider a candidate Liapunov
function V := E given by equation (2.6). Let V be the derivative of V along the the
trajectories of (2. 14) and V" be the derivative of V along the trajectories of the system
(3.16). Then the learning performance of system (3.16) is "faster" than or at least is
the same as system (2.14).

Proof : Case 1 (output nodes)
In order to analyze the learning performance, we calculate V‘ — V as
V“ - v = VWV . (WC - W) (E.1)
where W is the weight vector containing all the weights. Equation (3.5) gives the com-

ponents of We and equation (2.14) gives the components of W. Then for the output
layer, using equations (2.14i) and (3.51), we get

r r

tie—V = [..VW_E..] .-y( $7,456 - V45 » (5.21)

 

 

 

 

 

 

= [..anE..] .— 7+ v,,__E + % z apnvwmepn2 - vw'_g », (15,211)
P

 

 

 

 

 

 

where
_ 1 2
VW“E - 3- ; Vwﬁepn .

Therefore we get

98

99

v" - V = - i r 2 [yaw] [2%.Vw_e,.2]. (E3)
Wu- P P

Using Lemma 3.1, it follows that each of the entries of the summation over WM, , on
the right hand side of equation (15.3), is 2 0. Subsequently V“ — V S 0.

Case 2 (hidden nodes)

Now consider the dynamics of weights connecting the neurons in any hidden
layer. The components of WC and W are given by equations (3.5ii) and (2.14ii),
respectively. Again we calculate the difference V‘ - V as follows.

- p ‘

. . 1 . -
Vc-V = [..VWE..] .- yr VWUE + 3 )5 a; rat,,,,V,,.ue,,,,,2 — VW,,E >. (EA)

 

 

 

 

 

 

where
v -1 v 2
qu " 3 22 wuepn °
p I!

With this substitution, we get

v‘ — V = — % Y 2[22v:,e,.2] [zzapnwepﬁ] (15.5)
p n

wupn

Considering the summation over wk,- , on the right hand side of (E5), we use Lemma

3.2 on each of its entries to show

[zszrepn ] [Emu veep" ] 2 o
p n p I:

As each entry of the summation over wk,- on the right hand side of equation (E5)
is 2 0, therefore tic-v s o.

Appendix F

Proof of Proposition 3.1

Proposition 3.1: The equilibria of system (3.16) are also the equilibria of system
(2.14).

.Proof: F‘u-st we consider the case of output nodes of a network. The equilibria of the
weights connecting the output neurons of system (2.14) are given by

0=w

.m’
= VW_E ,
= % z anean, (F.1)
P
and the equilibria of the weights connecting the output neurons of system (3.5 ) are
given by
0=w;
1 2
= unE + -2- E aanwnepn ,
P
_ 1 V 2 l V 2
‘ 3 E ..__e,,,, + '2' 2 up» wuepn - (F2)
P P
This implies
z Vw__ep,,2 = _ 2 up" Vw_ep,,2. (F.3)
P P

Multiplying both sides by 2 up, V," em, 2, we get
P
2 2 2 2
[2 Vw.epn ] [2 “pr: Vwmepn ] = -[2 “pr: Vw,...epn ] - (F4)
P P P

According to Lemma 3.1, the left hand side is non-negative but we notice that the

right hand side is non-positive. This implies

100

101

2 map"? = 2 up, Vw_ep,,2 = 0. (RS)
P p

This proves the subject claim for the weights connecting the output nodes. Fol-
lowing similar derivations and using Lemma 3.2 the claim for the weights connecting

the hidden nodes can be easily proved.

Appendix G

Proof of Theorem 4.1

Theorem 4.1: Assume that Assumption 3.1 holds. Consider V:=E given by (4.12i),
be a candidate Liapunov function. Let V be the derivative of V along the trajectories
of system (2.14) and V” be the derivative of V along the trajectories of system (4.13),

then the learning performance of system (4.13) is "faster" than or at least is the same
as system (2.14).

Proof : Case 1 (output nodes)

In order to analyze the learning pen‘ormance, let’s evaluate VP — V as

 

 

 

 

 

 

 

 

 

 

 

VP — v = vwv . (WP - W) (6.1)
= [..VW_E..] .—yt V4.5" - V,,.:__E > (6.2)

- -l
= [..VW_E..] .4” Vwmli‘ +' V,,_mli1 - V,,'__E > (G3)

3 ‘ ' ‘J

where E 1 = é— Z en. Simplifying the expression in (G3), we get
n
VP _ r} = _ y z Ww.5)(vw,51) (0.4)
wall
= .27 z (eannE1)(anE1) (G5)
= —27 Z t—:,,(V,,,mli‘1)2 (06)
Want

102

103

therefore,

VP - V s o. ((3.7)

Case 2 (hidden nodes)

For the hidden nodes we can similarly derive the following expression
VP - V = - 'y z (un15) (VWEI) (6.8)
”it
As the energy functions E and E l- are not direct functions of the weights wk,- , there-

fore we can write

VP _ V = —-5- y 2 [2 en VWen] [2 Vwen] (0.9)
Wu n n
For each entry of the summation 2 we use Lemma 3.1 to show
We
[2 6,, thlen] [E Vwen] 2 0. (6.10)

therefore,

VP — V s o. ((3.11)

Appendix H

Proof of Theorem 4.2

Theorem 4.2: Assume that Assumption 3.2 holds. Consider a candidate Liapunov
function V := E given by equation (2.6). Let V be the derivative of V along the the
trajectories of (2.14) and V” be the derivative of V along the trajectories of the system
(4.14). Then the learning performance of system (4.14) is "faster" than or at least is
the same as system (2.14).

Proof : Case 1 (output nodes)

In order to analyze the learning pen‘ormance, let’s evaluate V” — V as

 

 

 

 

 

VP - V = VWV . (WP - W), (11.1)
r t
)- T - 1

= [..anE..] .-‘H V,,__EP - V,,:__E >. (HZ)
L - L ‘J

= [,.VWIE..] :71 VW_E ‘1‘: anEl -' VW;E *9 (Es)

 

 

 

 

 

 

where E1 = % 22 em. Simplifying the expression in (H.3), we get
p n

VP — V = — y )3 (Vme) (VWHEI), (11.4)
= -2‘y z [2 8P" unepn] [2Vw_epn]. (H5)
Wu- P P

104

105

For each entry of summation Z we use Lemma 3.1 to show

2 em V“ cm, Eunepn 2 0, (11.6)
p 1»
therefore
VP — V s o. (11.7)

Case 2 (hidden nodes)
For the hidden nodes we can similarly derive the following expression

VP -— V = — y 2 (ku15) (un151). (11.8)

W.)

As the energy functions E and E1 are not direct functions of the weights wk]- , there-

fore we can write

. . 1
VP - v =7 72‘. [2222... men] [221mm] (119)
wt, p n p n
For each entry of the summation 2 we use Lemma 3.2 to show
19,,

[22 .p, mg] [mg] .0, m,

p n p n
therefore

VP - V 50. (11.11)

Appendix I

Proof of Proposition 4.1

Proposition 4.1: The equilibria of system (4.14) are also the equilibria of system
(2.14).

.Proof: First we consider the case of output nodes of a network. The equilibria of the

weights connecting the output neurons of system (2.14) are given by

0 = Wm,
= anE ,
_ _1_ V 2 (11)
- 2 2 wasp" ’ °
p
and the equilibria of the weights connecting the output neurons of system (4.4) are
given by
0 = W5",
= Vw E '1’ VW E1,
— 1 V 2 1 v
- -2' 2 w_epn + 3 Z w_£pnr
P p
1
= z 8pn Vw,,.,epn + 3 E Vwmepn ' (12)
p p
This implies
1
2 2 Vw_ep,, = - E a," Vw_ep,,. (1.3)
p p

Multiplying both sides by z Vw“ep,, , we get
9

i
2

 

>3 V.._e,,.]2 = - [2 V...ep.] [)3 ep. V.._e,,.]. (1.4)

P P P

106

107

According to Lemma 3.1, the right hand side is non-positive but we notice that the left
hand side is non-negative. This implies

2 anep,, = 2 8p» Vw_ep,, = O. (1.5)
p p
This proves the subject claim for the weights connecting the output nodes. Following

similar derivations and using Lemma 3.2 the claim for the weights connecting the hid-
den nodes can be easily proved.

Appendix J

Proof of Proposition 5.1
Proposition 5.1: The equilibria of system (5.2) and system (5.3) are same.

Proof: We consider the Exponential (5.3) and the simple (5.2) systems, which are

rewritten as

"'2‘ =—ywaE, (1.1)
and

u? =—7VWE, (1.2)

where E can be the Gaussian (2.6), the Cauchy (3.4), the Polynomial (4.3) or any
other suitable energy function. Let the equilibria of system (J .1) be w' , then

0=-yx VwElwo. (1.3)
This implies

VWE Iw. = o. (1.4)
Consequently

w‘ e the equilibria of system (J. 2). (J .5)

Now consider the equilibria of system (J .2) to be given by wM , then
O=—'waE|w-, 0.6)
which implies
VWE lwu = 0. (1.7)
As a consequence
w“ e the equilibria of system (J. 1). 0.8)
(1.5) and (1.8) prove the claim.

108

Appendix K

List of fonts used for the pattern recognition problem

 

 

 

Set 1
Ancient Fluidum italic Mosaik
Athenaeum italic Folkwang New Clarendon
Augustea Francesca Nicholas
Banco Garamond Normandy
Breite Grotesque Paganini italic
Century Horizon Peignot
Choc Hyperion Petra
Corvinus Italian Playbill
Craw Clarendon bold Kaufmann Saltino
Craw Modem Letter 9 Salut
Diotima London Slim Black
Diotima italic Lydian Stardivarius
Egyptienne italic Matura Stencil
Elizbeth italic Michel Angelo Tea Chest
Falstaff Microgramma bold Walbaum italic
Flash Mistral Weiss italic
Fluidum Montan

 

109

 

110

 

Set 2

 

 

Albertus
Amanda
Badoni bold
Baskerville
Bembo
Cartoon bold
Caslon italic
Consort bold
Craw modern bold
Eckmann
Editor
Egizio

Erbar italic
Folio
Fortune
Ganton
Gill-sans

Gothic bold italic
Goudy bold
Hercules
Jacno
Keyboard
Klang

Letter 4
Letter 6
Letter 7
Melior
Mercurius
Modern 20
New Gothic
Noname
Paganini
Perpetua italic
Placard

Plantin

Psitt
Quirinus bold
Radient bold
Ritmo
Rockwell
Standard
Tiemann
Times
Times bold
Topic bold
Trump
Universe

' Vendome

Walbaum

Weiss

 

 

BIBLIOGRAPHY

[11
121
[31
[41

[5]

[6]

[7]

[3]

[9]

BIBLIOGRAPHY

D. H. Ballard, Interpolation coding: A representation of numbers in neural
models, Biological Cybernetics, vol. 57, pp. 389-402, 1987.

E. I. Knudsen, S. du Lac and S. D. Esterley, Computational maps in the brain,
Annual Review of Neuroscience, vol. 10, pp. 41-65, 1987.

M. Kuperstein, An adaptive neural model for mapping invariant target position,
Behavioral Neuroscience, pp. 148-162, 1988.

D. Walters, Selection of image primitives for general purpose visual processing,
Computer Vision, Graphics and Image Processing, vol. 37, pp. 261-298, 1987.

S. Dehaence, J. Changeux and J. Nadal, Neural networks that learn temporal
sequences by selection, Proceedings National Academy Science, USA, Biophy-
sics, vol. 84, pp 2713-2727, 1987.

J. A. Feldman and D. H. Ballard, Connectionist model and their properties,
Cognitive Science, vol. 6, pp. 205-254, 1982.

D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal representa-
tions by error propagation, in Parallel Distributed Processing: Exploration in the
microstructures of cognition, Cambridge, MA : MIT Press, vol. 1, pp. 318-362,
1986.

D. E. Rumelhart and J. L. McClelland, Explorations in Parallel Distributed Pro-
cessing: A Handbook of Models, Cambridge, MA : MIT Press, pp. 121-159,
1988.

D. Tank and J. J. Hopﬁeld, Concentrating information in time. analog neural
networks with application to speech recognition problem, 1st International
Conference on Neural Networks, IEEE, June 1987.

[10] F. C. Hoppensteadt, An Introduction to the Mathematics of Neurons, Cambridge

University Press, 1986.

[11] D. Purves, Body and Brain : a trophic theory of neural connections, Harvard

University Press, Cambridge, Mass, 1988.

111

112

[12] R. J. MacGrager, Nemal and Brain Modeling, Academic Press, San Diago, Calif.,
1987.

[13] M. Berry, The Nerve Cell, MacKeith Press, London, 1986.

[14] B. Widrow, study director, DARPA Neural Network Study, AFCEA International
Press, February 1988.

[15] J. A. Anderson and E. Rosenﬁeld, N eurocomputing: Foundations of Research,
Cambridge, MA: MIT Press, 1988.

[16] F. Rosanblatt, Perceptons and the theory of Brain Mechanisms, Spartan Books,
Newyork, 1962.

[17] R. P. Lippman, An introduction to computing with neural nets, IEEE ASSP
Magazine, vol. 4, pp. 4-22, April 1987.

[18] R. H. Nielsen, Theory of back-propagation neural networks, International Joint
Conference on Neural Networks, vol. 1, pp. 593-605, June 1989.

[19] W. S. McCulloch and W. Pitts, A logical calculus of the ideas imminent in ner-
vous activity, Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943.

[20] B. Widrow and M. E. Hoff, Adaptive Switching Circuits, IRE WESCON Conven-
tion Record, Newyork: IRE, pp. 96-104, 1960.

[21] T. Kohonen, Correlation matrix memories, IEEE transactions on Computers, C-
21, pp. 353-359, 1972.

[22] J. A. Anderson, A simple neural network generating an interactive memory,
Mathematical Bioscience, vol. 14, pp. 197-220, 1972.

[23] S. Grossberg, Studies of Mind and Brain, Reidel, Boston, 1982.

[24] R. J. Williams, Unit activation rules for cognitive network models, ICS Report
8303, Institute of Cognitive Sciences, University of California at San Diago,
November, 1983.

[25] G. E. Hinton, T. J. Sejonowsky and D. H. Ackley, Boltzman machines: con-

straint satisfaction networks that learn, Report CMU-CS-84—157, Carnegie-
Mellon University, 1984.

[26] J. J. Hopﬁeld, Neurons with graded response have collective computational pro-

perties like those of two-stage neurons, Proc. Natl. Acad. Sci., vol. 81, pp.
3088-3092, May 1984.

113

[27] M. Arai, Mapping abilities of three layer neural networks, International Joint
. Conference on Neural Networks, vol. 1, pp. 419-423, 1989.

[28] G. Mirchandni and W. Cao, 0n hidden nodes of Neural Nets, IEEE Transac-
tions on Circuits and Systems, vol. 36, No. 5, pp. 661-664, May 1989.

[29] S. C. Huang and Y. H. Huang, Bounds on the number of hidden neurons in mul-
tilayer perceptrons, IEEE Transactions on Neural Networks, vol. 2, No. 1, Jan.
1991.

[30] M. Minsky and S. Papert, Perceptrons, Cambridge, MA: MIT Press, 1969.

[31] G. Cybenko, Continuous valued neural networks with two hidden layers are
suﬁ‘icient, Tech. Rep., Department of Computer Science, Tufts University, March
1988.

[32] K. Funahashi, 0n the approximate realization of continuous mapping by neural
networks, Neural Networks vol. 2, pp. 183-192, 1989.

[33] A. E. Bryson and Y. C. Ho, Applied Optimal Control. Blaisdell, Newyork, 1969.

[34] P. J. Werbos, Beyond regression: New tools for prediction and analysis in the
behavioral sciences, Ph.D. Thesis, Applied Mathematics, Harvard University,
November 1974.

[35] F. M. A. Salam, Neural Nets and Engineering Implementations, key address at

the 3lst Midwest Symposium on Circuits and Systems, St. Louis, Missouri,
August 10-12, 1988.

[36] F. M. A. Salam, Artiﬁcial Neural Nets: Basic Theory and Engineering Implemen-
tations, Department of Electrical Engineering, Michigan State University, East
Lansing, MI 48824, October 1989.

[37] F. M. A. Salam and M R. Choi, An All-MOS Analog Feedforward Neural Cir-
cuit with Learning, 1990 IEEE International Symposium on Circuits and Systems
(ISCAS), New Orleans, Louisiana, May 1-3, 1990.

[38] F. M. A. Salam, Learning Algorithms For Artiﬁcial Neural Nets For Analog Cir-
cuit Implementation, Interface ’90 Proceedings, East Lansing, MI, May 15-16,
1990. (Proceedings will appear in 1991).

[39] I. K. Sethi and A. K. Jain, Artiﬁcial Neural Networks and Statistical Pattern
Recognition, North Holland, 1991.

[40] M. W. Hirsch and S. Smale, Differential Equations, Dynamical Systems and
Lmear Algebra, Academic Press, 1974. ,

114

[41] C. T. Chen, Linear System Theory and Design, Holt, Rinehart and Winston, New
York, 1984.

[42] E. T. Jaynes, Information theory and statistical mechanics, Phys. Re v., vol.
106, pp. 620-630, 1957.

[43] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication,
Unive rsity of Illinois Press, Urbana, 1949.

[44] A. Wragg D. C. Dawson, Fitting continuous probability density functions over
[O,oo) using information theory ideas, IEEE Trans. Info.
Theory, vol. 1T-l6, pp. 226-230, March 1970.

[45] D. C. Dawson and A. Wrag , Maximum entropy distribution having prescribe d
ﬁrst M617 second moments, IEEE Trans. Info. Theory, vol. IT-19, pp. 68 9-693,
ept. 1 3.

[46] J. M. Einbu, 0n the existence of a class of maximum entropy probability density
functions, IEEE Trans. Info. Theory, vol. IT-23, pp. 772-775, No v. 1977.

[47] A. B. Carlson, Communication Systems - An Introduction to Signals and Noise in
Electrical Communication, McGraw Hill, 1986.

[48] P. Burrascano, A norm selection criterion for the Generalized Delta Rule , IEEE
Transactions on Neural Networks, vol. 2, No. 1, Jan 1991.

[49] J. Wang and B. Malakooti, 0n training of artiﬁcial neural networks, Intema-
t3ional Joint anference on Neural Networks, Washington D. C., vol. 2, pp. 387-
93, June 198 .

[50] A. Popoulis, Probability, Random Variables and Stochastic Processes, McGraw-
Hill, New York, 1965.

[51] C. H. Edwards, Jr. and D. E. Penney, Calculus and Analytic Geometry, Prentice-
Hall, inc., 1982.

. [52] M. Ahmad and F. M. A. Salam, Error back-propagation learning using the Cau-
chy energy function, under preparation.

[5 3] J. K. Hale, Ordinary Differential Equations, Wiley-Interscience, 1969.
[54] H. K. Khalil, Nonlinear Systems, Macmillan, 1991.

[55] M. Fiedler, Special Matrices and their Applications in Numerical Mathematics,
Martinus Nijhof Publishers, Dordrecht, The Netherlands, 1986.

115

[56] G. H. Golub and C. F. Van Loan, Matrix Computations, Johns-Hopkins Univer-
sity Press, 1983.

[57] M. Ahmad and F. M. A. Salam, Error back-propagation learning using the Poly-
nomial energy function, under preparation.

[58] S. M. Ebeid, J. A. Sirat and J. R. Viala, A rationalized error back propagation
learning algorithm, International Joint Conference on Neural Networks, vol. II,
pp. 373-380, June 18-22, 1989.

[59] T. Tollenaere, SuperSAB: Fast adaptive back propagation with good scaling pro-
perties, Neural Networks, vol. 3, pp. 561-573, 1990.

[60] M. Ahmad and F. M. A. Salam, Error back-propagation learning using the
Exponential energy function, under preparation.

[61] W. E. Weideman, M. T. Manry and H. C. Yau, A comparison of nearest neigh-
bor classiﬁer and a neural network for numeric handprint character recognition,
International Joint Conference on Neural Networks, vol. 1, pp. 117-120, 1989.

[62] P. Smagt, A comparative study of neural network algorithms applied to optical
cﬂacter recognition, Association for Computing Machinery, pp. 1037- 1044,
1 .

[63] J. S. Denker, W. R. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hub-
bard, L. D. Jackel, H. S. Baird and I. Guyon, Neural network recognizer for
hand written zip code digits, Advances in Neural Information Processing Systems
1, D. S. Tometzky, ed. M. Kaufmann, San Mateo, CA, 1989.

[64] K. Fukushima, S. Miyake and T. Ito, Neocognitron: a neural network model for

a mechanism for visual pattern recognition, lEEE Transactions on Systems, man
and Cybernetics SMC-13, pp. 826-834, 1983.

 

[65] K. Fukushima and N. Wake, Handwritten alphanumeric character recognition by
tlh9e§ {Veocognitrom IEEE Transactions on Neural Networks, vol. 2, No. 3, May

[66] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard
and L. D. J ackel, Handwritten digit recognition with a back-propagation net-
work, Advances in Neural Information Processing Systems 2, pp. 396-404, 1990.

[67] A. Bastien, Alphabet in Type, West Drayton, Middlesex, England, 1958.
[68] F. Lamberty, Letter forms, Hasting House Publishing, New York, 1964.

[69] D. Freres, Cent Alphabets Monotype, Union Bibliophile de France, Paris, 1963.

116

[70] W. J. Jespert, W. T. Berry and A. F. Johnson, The Encyclopedia of Type Faces,
Blandford Press Ltd. London, 1970.

[71] M. Ahmad and F. M. A. Salam, Feedforward artiﬁcial neural network structure
for character recognition, to appear in the 29th Annual Allerton Conference on
Communication, Control and Computing, October 1991.

[72] S. Bebromdhane, M. Ahmad and F. M. A. Salam, Handwritten numeric charac-
ter recognition using a multilayer neural network, to appear in the 29th Annual
Allerton Conference on Communication, Control and Computing, October 1991.