. . . . . K. 1 ﬁn . a. Q I
. , . .. . . dram. .uﬁrgmﬁr
. _ .zwp..._$u.rs;v £3.53.
v.55... lama. .
. . . 9%.... a. o. 2
ﬂat}... . . . up .9 .
1...
.90. .3 {I}:
mgrugv t

‘- . .
hmwuﬁvu.
:2 4:54...

 

n 3. 0,11...

..
.
s
, , .. an» . up.
Wauhgkzﬁ . . . . . , . . . . . ,
2. . . . . . . . , . ravmummmh

. . 5.. Ir, .3...
. . .gi 2..: £5.
. . . . . 321 J... . .12.. a.
. . . . . U Vt «
. l o
4E . . . . . . .c.
. .. .A n . . . I ‘O
G .lﬂzfn. . .

:
I“ '

all-(1’0“: : I

...... .

. . rm... $11.23

:1 .

..:.....

:03 .
6 A
v
A

i
. i. w...
uh»... 3..

m...
3 n. .. i.
“arranguﬁ

f t? ..
3H......:

.3.
.27
.2... . «33.53%. 3' "Nam:
.h’): M .‘Imnlou
I. 157 .173)?! . I.

it? I!)

.2
5:3- It s. 2
5323...... ...w
3: \I’»E\2i {urn v
3..... - .5 ,
.. .

it;

. .
.mMWiL ammo”...

 

. Inn. 13‘
(curihnbu

dub... w...“ R.

 

.2..?..I.cl~.|52vltﬁ I!
y... I: u:

, u... {:1
.u... 1.1: ,.. I get... :3. .

‘lr4
SWEW.

!

 

.l::. . .U. 3.4:... rirfir... . 2. . t I .15? thanﬁx ‘4»! I...

‘W‘HESQS Icmom sure

a ”tilt/lull mll’l‘“"”°”‘

 

 

 

I lMil/1Mil/lllllilllL

3 1293 01421 6190

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled

EFFICIENT EXTENDED KALMAN FILTER
LEARNING FOR FEEDFORWARD
LAYERED NEURAL NETWORKS

presented by
Saida Benromdhane

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degree in Electrical Engineering

 

    

Major professor

Date W

MS U i: an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY

Michigan State
Unlverslty

 

 

 

PLACE N RETURN BOX to remove We checkout from your record.
TO AVOID FINES return on or before dete due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU le An Atﬂnnetlve AotlonlEquel Opportunity Inetltulon
Walla-o1

Efﬁcient Extended Kalman Filter Learning for

Feedforward Layered Neural Networks

By

Saida Bem‘omdhane

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1996

ABSTRACT

Efﬁcient Extended Kalman Filter Learning for

Feedforward Layered Neural Networks
By

S aida Benromdhane

The thesis focuses on the computationally efﬁcient convergence to satisfactory
local minima of the Extended Kalman Filter Algorithm (EKF) when it is used in the
supervised learning of Artiﬁcial Feedforward Neural Networks. There are two stages
to our research work.

In the ﬁrst stage, the effect of different choices of the energy parameter or weighting
factor A on the convergence of the EKF algorithm is investigated. We limit our
attention to problems related to the supervised learning of Feedforward Artiﬁcial
Neural Networks. Through the simulation of two region classiﬁcation problems and
the analysis of the results, we demonstrate that when /\ is chosen slightly smaller than
1, the algorithm experiences explosive divergence : The Least Square Error (LSE)
grows indeﬁnitely. However, for a choice of A slightly greater than 1, the algorithm
is stable but often converges to unsatisfactory local minima, from the point-of-view
of performance and computation time.

The second stage of our work is where we propose several modiﬁcations of the

algorithm. These modiﬁcations are aimed at improving the efﬁciency of the algorithm
both in terms of performance and speed of convergence. One modiﬁcation in the
algorithm is the development of an update mechanism for the exponential weighting
factor /\ which self—adjusts to the (LSE). The second modiﬁcation is an augmentation
of the recursion formulae of the algorithm. Both of these modiﬁcations result in a
signiﬁcant improvement in performance as well as marked decrease in convergence

time when compared with the original EKF algorithm.

To my father, who always prayed for my success but could not live to see this day.

iv

ACKNOWLEDGEMENTS

In The Name Of Allah
Most Gracious, Most Merciful.

 

 

All praises be to Allah Subhanahu-wa-Taala, the creator and sustainer of this
universe for providing me the strength and the patience to complete this work.

I am greatful for the ﬁnancial support from the Government of Tunisia, and the
department of Electrical Engineering at Michigan State University. Without this
support, this research work would not have been accomplished.

I would like to express my sincere thanks to my major advisor, Dr. F. M. A.
Salam for his continued support and guidance throughout the years of my graduate
studies. I also would like to thank the members of my committee, Dr. Hassan Khalil,
Dr. Joel Shapiro, and Dr. Majid Nayeri for their helpful criticism and suggestions.

I wish to express my greatest thanks and appreciation to my parents for their
continued prayers, support and encouragement. I deeply owe my heartiest feelings of
gratitude to my husband, Izzat, for his patience and support especially during the
last years of my research. The warmest feelings of gratitude are owed to my sisters
Souad and Sonia, and my brothers Lassaad, Soﬁane and Souheil. They have always

prayed for my success and supported my pursuit of a Doctoral degree.

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1 Introduction

1.1 Inherent Properties of Neural Networks .................
1.2 Why a Different Training Approach? ..................

1.3

The Contribution of the Thesis ............

2 Background

2.1

2.2

2.3

2.4

2.5
2.6
2.7

Basic Components of Neural Network Models . . . .
2.1.1 Processing units .................
2.1.2 Topology .....................
2.1.3 Propagation rule ................
2.1.4 Learning rule ..................
Learning Types .....................
2.2.1 Supervised learning ...............
2.2.2 Unsupervised learning .............
Computational Features of Neural Networks .....
2.3.1 Function approximation ............
2.3.2 Data compression ................
2.3.3 Optimization ..................
Examples of Models of Feedforward Neural Networks
2.4.1 Perceptron and adaline .............
2.4.2 Multilayer perceptron ..............
The Optimal Filtering Technique ...........
Thesis Problem Description ..............
The Objectives and the Outline of the Thesis . . . .
2.7.1 Outline of the thesis ..............

3 The Extended Kalman Filter Algorithm

vi

eeeeeeeee

eeeeeeeee

viii

g.

#WIOH

HHF—‘F—‘t—‘r—‘F—‘HP—‘t—H
NGQN-skéiQQIQlQr-‘QONCEU‘

j—I
x]

18
19
19
20
21

23

3.1 Motivation ................................. 23

3.2 The Learning Dynamics ......................... 24
3.2.1 The global extended Kalman ﬁlter algorithm .......... 25

3.2.2 Computation of the gradient matrix ............... 27

3.2.3 The local appraches and the (NDEKF) ............. 31

3.3 Simulation Examples ........................... 34
3.4 Discussion and Analysis of Results ................... 49
3.4.1 Conjecture 1 ............................ 50

4 Convergence to Satisfactory Local Minima 51
4.1 Motivation ................................. 51
4.2 Divergence Phenomena .......................... 52
4.3 Proposed Solutions ............................ 53
4.3.1 A Modiﬁed update procedure .................. 54

4.3.2 The forgetting factor as a function of error ........... 55

4.4 Discussion and Categorization of Results ................ 80
4.4.1 Conjecture 2 ............................ 81

5 A Different Approach to the Computation of the Gain Matrix 83
5.1 Motivation ................................. 83

5.2 Recomputing the Gain Matrix ...................... 84
5.3 Simulation Results ............................ 85

6 Variable Slopes 87
6.1 Motivation ................................. 87
6.2 Preliminary Results ............................ 91
6.3 The Augmented Recursion Formulas .................. 96
6.4 Simulation Results ............................ 99

7 Summary and Conclusion 105
7.1 Summary ................................. 106
7.2 Conclusion and Future Research ..................... 107

A Derivation of the Extended Kalman Algorithm Equations 108
BIBLIOGRAPHY 1 1 1

vii

3.1
3.2

4.1
4.2

5.1

6.1
6.2

LIST OF TABLES

The 2-region classiﬁcation problem ................... 38
The 4-region classiﬁcation problem ................... 39
The 2-region classiﬁcation problem ................... 58
The 4-region classiﬁcation problem ................... 59

The 4-region classiﬁcation problem with a modiﬁed computation of the

gain matrix ................................ 86
The 4-region classiﬁcation problem with ﬁxed slopes .......... 91
The 4-region classiﬁcation problem with variable slopes ........ 99

viii

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
3.24
3.25
3.26

4.1
4.2
4.3

LIST OF FIGURES

The 2-region classiﬁcation pattern .................... 36
The 4-region classiﬁcation pattern .................... 37
The TAE with A‘1 = l .......................... 40
The TAE with A’1 = 1.01 ........................ 40
The TAE with A”1 = 0.99 ........................ 41
The IAE with A‘1 = 1 .......................... 41
The IAE with A‘1 = 1.01 ......................... 41
The IAE with A'1 = 0.99 ......................... 42
The TAE with A’1 = 1 .......................... 42
The TAE with A“1 = 1.01 ........................ 42
The TAE with A“1 = 0.99 ........................ 43
The IAE with A‘1 = 1 .......................... 43
The IAE with A’1 = 1.01 ......................... 43
The IAE with A“ = 0.99 ......................... 44
The TAE with A"1 = l .......................... 44
The TAE with A"1 = 1.01 ........................ 44
The TAE with A“1 = 0.99 ........................ 45
The IAE with A’1 = 1 .......................... 45
The IAE with A'1 = 1.01 ......................... 46
The IAE with A’1 = 0.99 ......................... 46
The TAE with A'1 = 1 .......................... 47
The TAE with A'1 = 1.01 ........................ 47
The TAE with A”1 = 0.99 ........................ 47
The IAE with A‘1 = 1 .......................... 48
The IAE with A"1 = 1.01 ......................... 48
The IAE with A’1 = 0.99 ......................... 48
A = tanh(a*6) with a = 1000 and 6 is the IAE difference ....... 57
The TAE for A"1 = 1 ........................... 60
The TAE for update choicel ....................... 60

ix

4.4 The TAE for update choice2 .......................
4.5 The TAE for update choice3 .......................
4.6 The TAE for update choice4 .......................
4.7 The IAE for A“1 = 1 ...........................
4.8 The IAE for update choicel .......................
4.9 The IAE for update choice2 .......................
4.10 The IAE for update choice3 .......................
4.11 The IAE for update choice4 .......................
4.12 The TAE for A“1 = 1 ...........................
4.13 The TAE for update choicel .......................
4.14 The TAE for update choice2 .......................
4.15 The TAE for update choice3 .......................
4.16 The TAE for update choice4 .......................
4.17 The IAE for A"1 = 1 ...........................
4.18 The IAE for update choicel .......................
4.19 The IAE for update choice2 .......................
4.20 The IAE for update choice3 .......................
4.21 The IAE for update choice4 .......................
4.22 The TAE for A’1 = 1 ...........................
4.23 The TAE for update choicel .......................
4.24 The TAE for update choice2 .......................
4.25 The TAE for update choice3 .......................
4.26 The TAE for update choice4 .......................
4.27 The IAE for A"1 = 1 ...........................
4.28 The IAE for update choicel .......................
4.29 The IAE for update choice2 .......................
4.30 The IAE for update choice3 .......................
4.31 The IAE for update choice4 .......................
4.32 The 4-region classiﬁcation pattern ....................
4.33 Classiﬁcation Performance for A”1 = 1 .................
4.34 Classiﬁcation Performance for update choicel .............
4.35 Classiﬁcation Performance for update choice2 .............
4.36 Classiﬁcation Performance for update choice3 .............
4.37 Classiﬁcation Performance for update choice4 .............
4.38 The TAE for A'1 = l ...........................
4.39 The TAE for update choicel .......................

4.40 The TAE for update choice2 .......................

61
61
61

62
63
63
63
64
64
65
65
65
66
66
67
67
67
68
68
69
69
69
70
70

~1~1~J~1~1~1~1~1
“commerce-draw

73
74
74
75

4.41
4.42
4.43
4.44
4.45
4.46
4.47
4.48
4.49
4.50
4.51
4.52
4.53

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
6.20
6.21
6.22
6.23
6.24

The TAE for update choice3 ....................... 75
The TAE for update choice4 ....................... 75
The IAE for A‘1 = 1 ........................... 76
The IAE for update choice1 ....................... 76
The IAE for update choice2 ....................... 77
The IAE for update choice3 ....................... 77
The IAE for update choice4 ....................... 77
The 4-region classiﬁcation pattern .................... 78
Classiﬁcation Performance for A’1 = 1 ................. 78
Classiﬁcation Performance for update choice1 ............. 78
Classiﬁcation Performance for update choice2 ............. 79
Classiﬁcation Performance for update choice3 ............. 79
Classiﬁcation Performance for update choice4 ............. 79
The sigmoid function with slope s = 0.5 ................ 89
The sigmoid function with slope s = 1 ................. 89
The sigmoid function with slope s = 1.5 ................ 89
The sigmoid function with slope s = 2 ................. 89
The derivative of the sigmoid function with slope s = 0.5 ....... 90
The derivative of the sigmoid function with slope s = 1 ........ 90
The derivative of the sigmoid function with slope s = 1.5 ....... 90
The derivative of the sigmoid function with slope s = 2 ........ 90
The TAE for a ﬁxed slope s = 1 ..................... 92
The TAE for a ﬁxed slope s = 0.5 .................... 92
The TAE for a ﬁxed slope s = 1.5 .................... 92
The TAE for a ﬁxed slope s = 2 ..................... 92
The IAE for a ﬁxed slope s = 1 ..................... 93
The IAE for a ﬁxed slope 3 = 0.5 .................... 93
The IAE for a ﬁxed slope s = 1.5 .................... 93
The IAE for a ﬁxed slope s = 2 ..................... 93
The 4-region classiﬁcation pattern .................... 94
Classiﬁcation Performance for a ﬁxed slope s = 1 ........... 94
Classiﬁcation Performance for a ﬁxed slope s = 0.5 .......... 94
Classiﬁcation Performance for a ﬁxed slope s = 1.5 .......... 95
Classiﬁcation Performance for a ﬁxed slope s = 2 ........... 95
The TAE for varying the slope with so = 1 ............... 100
The TAE for varying the slope with so = 0.5 .............. 100
The TAE for varying the slope with so = 1.5 .............. 100

xi

6.25
6.26
6.27
6.28
6.29
6.30
6.31
6.32
6.33
6.34

The TAE for varying the slope with so = 2 ............... 100
The IAE for varying the slope with so = 1 ............... 101
The IAE for varying the slope with so = 0.5 .............. 101
The IAE for varying the slope with so = 1.5 .............. 101
The IAE for varying the slope with so = 2 ............... 101
The 4-region classiﬁcation pattern .................... 102
Classiﬁcation Performance for varying the slope with so = 1 ..... 102
Classiﬁcation Performance for varying the slope with so = 0.5 . . . . 102
Classiﬁcation Performance for varying the slope with so = 1.5 . . . . 103
Classiﬁcation Performance for varying the slope with so = 2 ..... 103

xii

CHAPTER 1

Introduction

Even though Artiﬁcial Neural Networks may be limited and imperfect, they are still
one of the few paradigms, if not the only ones, that seek to mimic natural intelligence
at an architectural level. They have performed impressively in some limited applica-
tions. The success and failure of Neural Networks can only suggest that research will
ultimately give rise to artiﬁcial systems that can perform the same tasks that only
the humans are currently able to perform. That is what constitutes the major drive
behind the exponential growth of Neural Networks research.

An outcome of the renewed interest in Neural networks is their use in numerous
applications ranging from pattern classiﬁcation and completion to automatic control.
The network paradigms used in these applications can be very distinct each carrying
their own name. They are made of a large number of processing elements operating
in parallel. The topology that connects these processing elements is what determines
the distinction between two or more network paradigms.

Given that the main goal for Artiﬁcial Neural Networks is to possess human intel-
ligence, the inspiration for research had to come from more speciﬁcally from neuro-
biology. Neuro-biology offers a cellular formulation of the principles of intelligent
behavior which inspire researchers to construct better intelligent systems known as

“Artiﬁcial Neural Networks”.

1.1 Inherent Properties of Neural Networks

This new hope ﬁnds its origins in all the inherent properties of Neural Networks,
but more so in these four particularly important ones: (1)generalization (2) graceful
degradation (3)adaptivity and learning and (4)parallelism [1, 2]. A focus on these
properties will be sufﬁcient to illustrate the importance of Neural Networks for related
ﬁelds.

Given that Neural Networks emulate continuous activation units, they provide a
continuous representation and inference system with similarity metrics such that when
similar real-world data inputs are presented, outputs or inferences are also similar. A
smooth generalization from stored cases to new ones is then expected.

The generalization performance of logic-based systems deteriorates considerably
when the data available is inaccurate or incomplete. In contrast, the generalization
performance of Neural Networks deteriorates proportionally to the degree of inac-
curacy or incompleteness of the data, which is usually small. However, what really
makes Neural Networks distinct from other expert systems are the two last charac-
teristics mentioned above. The ﬁrst one manifests itself in the way Neural Networks
learn and adapt. Neural networks are taught by presentation of examples. They
also adapt very easily to changing conditions. This issue is still an unsolved problem
with other expert systems. The second characteristic is present in the architecture
of Neural Networks adaptation and learning which are conducted totally in a parallel
structure. There are other parallel reasoning models that have been developed and
implemented, but in most cases they have to be supported by the user. By contrast,
in Neural Networks, units in the same layer can be updated simultaneously. In this

way, computationally expensive tasks can be performed efﬁciently.

1.2 Why a Different Training Approach?

For several years the back propagation algorithm (BP) which is based on the gra-
dient descent algorithm has been widely used to train Multi-Layered Neural Net-
works(MN N s) to perform a desired task. The common use of the back propagation
algorithm is attributed to its strengths which include a generalization ability from
a modest number of training patterns, an acquiring of arbitrarily complex nonlinear
mappings, all within a parallel structure of computation. However, like any other
algorithm, the back propagation algorithm has limitations that are easily recognized
during its use. The ﬁrst noticeable one is the slow convergence time. The second
limitation is an off-line encoding requirement which makes it unsuitable for certain
applications, especially when dealing with temporal signals.

In order to improve the speed of convergence of the (BP) algorithm, numerous
schemes have been suggested, for example [3, 4, 5]. One of these schemes [5] used
different energy functions to achieve faster convergence of the (BP)algorithm. All of
these enhancements are computationally inexpensive, but they often require tunable
parameters, thereby adding cumbersome guesswork.

Given that the enhanced backpropagation algorithm, like the standard one, may
not adequately handle temporal signals, it would not be suitable for some speciﬁc
applications, for instance, the identiﬁcation of temporal signals. In the face of this
limitation, the problem that poses itself is how to formulate a Feedforward Artiﬁcial
Neural Networks algorithm that possesses the temporal capability and is also efficient
in terms of convergence time, guesswork and computation. In the last few years much
interest focused on training algorithms[6, 7] that are faster than the back-propagation
algorithm or are capable of solving more complex problems. A class of alternative
algorithms that have been used in different applications is based on the Extended

Kalman Filter algorithm. The Kalman algorithm, in contrast to gradient techniques,

computes the optimum value of the network parameters each time a new data point

is presented, therefore it is well suited for temporal signal processing.

1.3 The Contribution of the Thesis

The main topic of this thesis is concerned with the Extended Kalman Filter Approach
for training Feedforward Artiﬁcial Neural Networks FFANN. This is a supervised
learning which is different from the backpropagation.

Motivated by the help of previous and current analysis [8], a framework supported
by computer simulations is developed to improve the supervised learning of F FANN
using the Kalman Filter approach. The framework encompasses modiﬁcations of the
Extended Kalman Filter algorithm including the development of an update mecha-
nism for the exponential weighting factor A [8]. This mechanism self-adj usts according
to the mean square error LSE. The weighting factor is also implemented as a function
of the error. The update is aimed to speed up and enable convergence to minima
that are satisfactory from the point-of—view of performance. A different computation
of the gain matrix is also pursued.

A ﬁnal and important added feature to the algorithm is the augmentation of its
recursion formulas. A recursion formula for the slope of the sigmoid nonlinearity
is developed where the update is based on the gradient descent rule. The gradient

descent rule is known to decrease the MSE without adding any oscillations.

CHAPTER 2

Background

The inspiration for Neural Networks structures is mainly based on our present under-
standing of the nervous biological system. To completely specify a Neural Network
model, its physical parts have to be deﬁned along with the required algorithms. Sim-
ilar to any other network a Neural Network is comprised of nodes we call neurons as
its processing units, and edges connecting the neurons therefore deﬁning the topol-
ogy of the network. Once a network’s topology is deﬁned, this Neural Network is
classiﬁed in one of several Neural Networks models. These models include the feed-
forward neuron-model, the feedback neuron-model and a completely interconnected
neuron-model. Our primary interest in this work is in the feedforward neuron-model
and the topology that deﬁnes the connection between its neurons. In the following
two sections the feedforward neuron model and its topology will be discussed in de-
tail. The resulting network will be referred to as the Feedforward Artiﬁcial Neural
Network(FFANN)

In the design of a Neural Network model, deﬁning the topology between the
neurons is only the ﬁrst step. Devising the learning rule for this network is the
most challenging part in the design. The learning rule is the core constituent of
the algorithm necessary for the operation of the neurons. It consists of changing the

state of individual neurons using inputs from neighboring neurons. This rule can take

various particular forms, therefore we will only emphasize the forms that have been
used in (FFAN N). We should also mention the one feature that is common to many
models: most learning rules can be viewed as modiﬁcations of the stochastic gradient
optimization of a certain objective function or performance criterion. An algorithm
where this feature is very well known is the error backpropagation algorithm. Our
primary interest in this thesis is another model that also has this feature and is a
(FFANN) where the learning is based on the Kalman optimal ﬁltering technique. This
model will be introduced in the fourth section of this chapter and discussed more in
detail in the remaining chapters. The ﬁrst four sections of this chapter are aimed
at describing Neural Networks in a broader context, they prepare for a complete

understanding of the sections that follow.

2.1 Basic Components of Neural Network Mod-
els

A Neural Network is speciﬁed completely by the deﬁnition of its physical parts and the
algorithms that are applied to it. Like any other network or graph, a Neural Network
consists of processing units that are physically called nodes and serve as the artiﬁcial
counterpart of neurons in the brain. It also consists of edges deﬁning its topology or
connectivity. In the succeeding material, the information-processing algorithm within
a Neural Network is completely speciﬁed by deﬁning the three following components
: the functionality of processing units, topology, propagation and learning rules. with
the learning rule being the key to the processing algorithm. Another algorithm, which
may require some programming, is also required for any processing system. In Neural

Networks, this task is summarized into a simple learning rule.

2.1 .1 Processing units

The processing units in a Neural Network mimic the function of neurons in the brain.
Therefore, information processed in them is determined by how the input of each
node relates to its own output and also how each processing unit connects to others

or the environment. This is described completely in these terms [1] by

Input into the unit from other units or the environment.

Output sent to other units or the environment.

The unit’s internal state, also referred to as the unit’s activation.

e The rule for computing the next state of the unit also called the activation rule

or activation function

The rule for computing the output from the current state. This is deﬁned by

the output function.

Biological neurons are complicated physical systems. Their dynamics, if realistically
modeled would involve a large number of differential equations. This feature of bio-
logical neurons is the origin of a strong motivation toward simplifying the neuronal
model. A common means of simpliﬁcation is to have two separate models each describ-
ing a functional mode of the neuron. The two functional modes are l)the processing
mode, and 2)the learning mode.

The dynamics of the processing state are described by a single differential equation.
Widespread models such as perceptrons own processing units with no genuine state.
Their output is a simple function of the input. This function is based primarily on
the simpliﬁed neuron model of Pitts [9]. The simplest version is a step function of
the weighted sum of individual inputs. This function is described in the following

equation :

1 if a: > o
f(:r,0) = (2.1)

0 otherwise

An attempt to make the activation function differentiable resulted in the following

so—called logistic function.
1

f($,0)=W

(2.2)

The processing state is described as combining inputs into a weighted sum, then
passing them by the help of a so-called activation function such as the step or sigmoid
function.

As described above, the processing state is the equivalent to the state of activation
of a simpliﬁed artiﬁcial neuron. However, the learning state is characterized by a
vector of parameters of the activation function. These parameters are to be changed
according to a desired performance criteria of a Neural Network. The parameters of

an activation function consist of

e Weights of individual inputs, with the help of which the weighted sum is com-

puted.

e A threshold of the step or sigmoid function.

2.1.2 Topology

If a system made of the processing units that are described in the previous section is
capable of solving complex tasks in a way reminiscent of human cognitive capabilities,
its sophistication is attributed to the processing units themselves and more impor-
tantly to the way they interact. The topology of the network is what determines the
interactions and thus is of crucial importance for the performance of the network.

There are various cases of interconnected networks. The simplest one is a com-
pletely interconnected network. This topology allows for arbitrarily complex feedback
structures. Although feedback extends the class of possible behaviors immensely, in
some cases and for reasons not mentioned, it may be desirable to exclude the pos-
sibility of feedback in the topology level. The way to achieve feedback exclusion is
simply to make it impossible, following any path along directed connections, to enter
a unit once left. Such a network, is called a feedforward network.

Since a feedforward network has the properties of a directed graph, each unit in
the network can be assigned a rank giving the number of units on the longest path
between it and some input unit. It is then easily understood that, for any pair of
interconnected units, the rank of the unit the connection leaves is lower than the rank
of the unit the connection enters.

The scheme for iterative evaluation according to the rank of individual units sug-
gests for collecting units into subsets of equal rank. Adding the restriction that the
rank difference between interconnected units is 1, we obtain a feedforward layered
network.

The evaluation of this network is done in a number of iterations that equals the
number of interconnection layers. The kth layer consists of the connections between
the units with ranks k and k - 1. This evaluation can be viewed as a sequence of

vector operators. A simple case of this evaluation is that of a single-layer feedforward

10

network, which is the topology used for one of the ﬁrst Neural Networks models, the
perceptron of Rosenblatt [10].

We should also mention the other networks that are layered. These are the feedback
layered networks with some examples like the Adaptive Resonance Theory(ART) of

Grossberg [11] and the Bidirectional Associative Memory of Kosko [12].

11

2.1 .3 Propagation rule

After a thorough description of the Neural Network in terms of its processing units
and topology, what now remains is to describe the propagation rule for information
processing. This rule will specify the time at which, and the order in which, the unit
activations are to be updated. The approach that is closest to the biological system
is a simultaneous and asynchronous update of all units.

Since the asynchronous approach does not seem to have any clear advantages,
most existing propagation rules are synchronous and vary accordingly with the Neural
Network model type. The variation of the propagation rule from a network type to
the other is manifested in the two algorithms that we are to describe next.

The ﬁrst one is for feedforward layered Neural Networks and is a modiﬁcation of

the algorithm used for (FFANN). The algorithm is described in these basic steps:
e The input units are set equal to the input pattern.

e In the kth step, the activation of units of the kth layer are computed using the

activations of the (k - l)th layer.

e After a number of steps equal to the number of layers, the state of the output

units corresponds to the output pattern.

The processing in this FFANN is viewed as the evaluation of a nested non—recursive
function. The situation is different for feedback networks. A detailed discussion of the
propagation rule including the steps of the algorithm to be followed in the processing

of information in a FFANN are found in [1].

12

2.1 .4 Learning rule

The learning rule is the most difﬁcult constituent of a Neural Network. Formally, it
resembles the propagation rule since it also operates on a network of interconnected
units. The learning rule consists of changing the state of individual units (the state of
activation function parameters) using inputs from the neighboring units. Given the
existence of variations of the particular form of this rule, we will limit our discussion
of this rule to a feature that is common to many models. This feature is observed in
the fact that most'learning rules are modiﬁcations of stochastic gradient optimization
of a certain objective function or performance criterion.

Models that have this feature include the following:
e The error backpropagation algorithm of Rumelhart, Hinton, and Williams [13]

e The perceptron of Rosenblatt [10], minimizing a differentiable modiﬁcation of

the Bayesian misclassiﬁcation rate.

e The unsupervised learning rule of Oja[14], minimizing the difference between the

original pattern and the pattern reconstructed from its feature representation.

Other aspects of learning rules with a classiﬁcation according to learning types are

discussed in the next section.

2.2 Learning Types

The current interest in Artiﬁcial Neural Networks methods is mainly attributed to
their ability to learn from experience. This ability to learn offers a powerful alternative
to programming.

Learning types or methods can be classiﬁed as supervised and unsupervised, with

a great many paradigms implementing each type.

13

2.2.1 Supervised learning

Examples of supervised learning paradigms are the original perceptron, and more
recently backpropagation. Supervised learning involves the training of the network
on a training set consisting of vector pairs. One vector is applied to the input of the
network; the other is used as a “target” which represents the desired output.

Training is performed by an adjustment of the network weights so as to minimize
the difference between the desired and actual output. This process can either be
achieved in an iterative procedure, or the weights can be computed by closed form
equations. The latter form of training may fail to qualify the system as a Neural
Network since its method is so far from the biological method. However, such methods
are useful, and provide a broader deﬁnition of Artiﬁcial Neural Networks.

We can describe iterative training in the following statements:
e Apply an input vector to the network.
e Compare the output produced to the target vector.

e Use the error signal obtained from the above comparison to modify the network

weights.

The modiﬁcation of the weights which we can describe as the correction of the weights
can be general, equally applied as a reinforcement to all parts or it may be speciﬁc,
with each weight receiving an appropriate adjustment. The intention behind adjusting
the weights at each step is to reduce the difference between the output and target
vectors. Vectors from the training set are applied to the network repeatedly until the
error is at an acceptably low value. The training process is successful if the network

is capable of performing the desired mapping.

14

2.2.2 Unsupervised learning

This type of learning, in contrast with supervised learning, requires no information
and only needs the input vectors to train the network. Unsupervised learning is
also called self-organization [15]. In the course of training, the network weights are
adjusted so that similar inputs produce similar outputs. The training algorithm
accomplishes this by extracting statistical regularities from the training set, repre-
senting them as the values of network weights. As Wasserman phrased it in [2];“self-
organization is reminiscent of the manner in which, in some cases, the human brain
modiﬁes its structure under the inﬂuence of its experiences without a ‘teacher’ ”.
Although applications of unsupervised learning are not as frequent as for super-
vised learning, they have been used in combination with other paradigms and have

produced useful results. An example of these applications is the counterpropagation

method[16] .

2.3 Computational Features of Neural Networks

Neural networks research is directed toward ﬁnding models that can accomplish useful
tasks such as association, pattern recognition etc. The most important classes Of
application tasks that Neural Networks are able to perform will be identiﬁed in this

section.

2.3.1 Function approximation

A deterministic feedforward network represents a mapping between the input layer
units and the output layer units. The evaluation of the mapping is achieved by
propagating the activations from the input layer to the output layer.

The network topology and activation functions are what determine the class of

15

functions that can be represented by a network. For instance, a single-layer perceptron

output results from the input in the following way:

y = Z wixi (2.3)

It is then clear that the class of linear mappings can be represented by the linear
perceptron. However, a broader range of function classes, usually nonlinear map-
pings, have been represented by multi-layer perceptrons. In addition to single-layer
and multi-layer perceptrons, layered feedforward Neural Networks with radial basis
activation functions seem to perform especially well for general functional approxi-
mation tasks. One of the functional approximation tasks that is of great application
importance is classiﬁcation. Even though any network model could be used for classi-
ﬁcation, speciﬁc properties of this application require particular activation functions
and learning rules.

The classiﬁcation task is characterized in the following way:

e A set of objects that are each characterized by a ﬁxed-length vector of numeric
or Boolean features is supplied. A set of classes is also given, such that each

object is only assigned to one class.

e A subset is then selected and called a training set, for which the class assignment

is explicitly known.

e The Neural Network can be trained to estimate the class of objects that did not

belong to the training set.

e The performance of a neural classiﬁer is measured by the proportion of objects

for which their class assignment estimate (recognition rate) is correct.

The feature vector describing the objects is frequently referred to as a pattern and

the entire task as pattern recognition.

16

2.3.2 Data compression

The task of a Neural Network model applied to data compression is to ﬁnd a mapping
that reduces the original pattern to a compressed pattern, usually of a substantially
lower dimension. The learning rule of Oja[l4] is a Neural Network model for linear
data compression. However, for nonlinear data compression, multilayer perceptrons
in an autoassociative mode can be used: The desired output of the perceptron is
set equal to the input. If an autoassociative perceptron with a hidden layer that is
narrower than the input is used (and output) layer is trained to produce outputs that
are very close to the inputs, the hidden layer constitutes a nonlinear compression of
input patterns. The inverse mapping of compressed patterns to the original ones is

given by the output layer.

2.3.3 Optimization

The class of optimization tasks that can be solved by Neural Networks are of the
type that can be solved by relaxation of annealing and can be characterized by the

following properties:

e A neighborhood system must be deﬁned on the variables of the optimization

task.

e The objective function to be maximized has to be additive in terms of cliques,
groups of variables within which each variable is a neighbor of every other

variable.

As in the case of functional approximation, optimization consists of a number of
subclasses which can also be viewed as task classes. One of the subclasses that is

frequently tackled by Neural Networks approaches is associative memory.

17
2.4 Examples of Models of Feedforward Neural

Networks

Given that the proposed research in this document will be restricted to FFANN in a
supervised learning framework, only the models that belong to the same setting will

be reviewed in this section. The discussion of additional models can be found in [1].

2.4.1 Perceptron and adaline

The perceptron model proposed by Rosenblatt [10] is one of the ﬁrst models ever pro-
posed and is still important to current research. The classiﬁcation of visual patterns
was one of its primary goals. Widrow [17, 18] also has pursed a similar approach.
The perceptron is the simplest version of a (FFANN). It consists of a single layer for
input units and a single output unit. The one layer inputs are weighed by vector of
connection strengths then fed into a step function. The activation function of the

output unit is described as follows:

2 = 6(2 war,- — a) (2.4)

1 for a: > 0
6(ar) = (2.5)

0 otherwise.

with 9:.- being the ith element of input patterns, 2 the output, to, the connection weight
of the ith input, and a the output unit’s threshold. The output unit activation can
assume two values, zero and 1. Each value corresponds to one of two pattern classes
that have to be separated.

The learning rule of the perceptron is supervised . It consists of a very simple
strategy of changing the weights only if the pattern x is misclassiﬁed, that is, if the
activation of the output unit for this pattern is not equal to the correct class of the

pattern. The amount of change is

18

x,- if class =1
Aw,- = (2.6)

—a:.- if class = 0.

This simple learning rule has an important property. If both classes are linearly

separable, the weights will converge to the values that materialize the separation.

2.4.2 Multilayer perceptron

The fact that single-layer perceptrons can only separate linearly separable classes was
pointed out by Minsky and Papert [110] at the end of the 1960’s, and this led to a
substantial decrease in interest in Neural Networks research.

Rumelhart along with Hinton and Williams [13] formulated the backpropagation

learning rule. The backpropagation model is based on two principles:

e 1. To overcome the limitations of the single-layer perceptron, all that is neces-
sary is to insert one or more additional layers between input and output. These
layers consist of processing units with nonlinear activation functions, typically
sigmoid functions (3). Arbitrary convex classes can be separated by a network
with one such hidden layer (a two-layer network), and arbitrary non convex

classes by a network with two hidden layers (a three-layer network);

e 2. The delta rule (8) can be used to learn output layer weights. Remaining
weights can be learned by recursive application of the chain rule for computing
derivatives. For a two-layer network with cc.- representing input unit activations,
yj hidden-unit activations, 2;, output unit activations, ij output layer weights,

and vi.- hidden-layer weights, the rule is the following:

—0E :23 2.£?_2I_E:_3yi__‘ L ”1““
avji k Jazk8y16 ‘vﬁx, 8ng

= Z]. 2,-(21. - down-yin — yj):z:.- (2.7)

19

The related recursive formulas are given in [13]
A large number of applications are based on the model constituted by the simple

principles described above.

2.5 The Optimal Filtering Technique

The computation of the weights of a FFANN in order to achieve an input/ output
mapping is a problem that could be classiﬁed as a parameter estimation problem or
a high dimensional nonlinear system identiﬁcation problem. Nonlinear optimization
techniques are well suited to solve the problem. The only disadvantage of these
techniques is their computational cost especially if the problem is attacked globally
instead of being divided into a set of manageable subproblems. The problem of
estimating the weights in a (FFANN) is nonlinear due to the sigmoidal activation
function, but it is smooth. Therefore methods of linear estimation theory could be
applied to such a nonlinear problem by linear approximation of the effects of small
perturbations in the state of the nonlinear system from a “nominal” value. Since the
state variables are not known beforehand the nominal trajectory has to be deﬁned
“on the ﬂy”[19] as the current best estimate of the actual trajectory. The approach
used here is called extended kalman ﬁltering. The advantage in this approach is
that the perturbations include only the state estimation errors which are usually
smaller than perturbations from any predeﬁned nominal trajectory and therefore

better conditioned for linear approximation.

2.6 Thesis Problem Description

During the investigation of the work done so far in using the kalman ﬁlter algorithm

in the supervised training of the (FFANN), the major problem that attracted our

20

attention is that several trials and simulations had to be done before the performance
was satisfactory. The need for several trials, as we understand from the work previ-
ously reported, is that for certain initial conditions the algorithm was trapped in local
minima that were not acceptable in terms of performance. The only way to escape
the problem that has been adopted so far is to run the algorithm with different initial
conditions. No major work has been done to make the algorithm escape the local
minima once it is trapped. Even though the algorithm has the advantage of being
stable under certain conditions, it does not always converge to a satisfactory solution.

Our main goal in this thesis is to develop solutions for the problem outlined above.

2.7 The Objectives and the Outline of the Thesis

Having highlighted the problems that are of interest to us in the previous section, we
now, address these problems by stating the goals that we plan to achieve. A research
plan is then developed that includes the tasks to be accomplished.

As mentioned earlier in this document, guesswork due to tunable parameters,
etc. can hurt the efficiency of training a Neural Network to perform a desired task.
This is one of the factors that has motivated researchers to focus their efforts on
seeking algorithms that are more suitable, easily comprehended and also theoretically
supported. It is believed that better understanding of an algorithm and the main
theory behind it, is the basis for efficient training of (FFANN). The goal of two
planned research tasks in this work is to improve the efficiency of a. newly introduced

feedforward training algorithm without a degradation of its performance.

21

2.7.1 Outline of the thesis

Since the research goals have already been determined, the research plan will be
organized in a manner as to achieve these goals and arrive at a stage of subsequent
developmental research. The tasks to be accomplished in the proposed work include

the following main ones:

Task No 1: Characterize the essential features of the learning algorithms based on

Kalman ﬁlters.

Objective: Through an investigation of the theory behind the algorithm and com-
puter simulations to observe the behavior, point out the problems that this
algorithm faces in terms of how the convergence is affected with different initial
conditions. It is also beneﬁcial to be aware of the assumptions made, if any,

while designing this algorithm or any post-improvement of it.

Signiﬁcance: The simulation of algorithms that are theoretically comprehended
and convergent will be meaningful in two ways. On one hand, it will help identify
any problems with the algorithm that were or were not theoretically anticipated.
On the other hand, it will provide insight or how to modify the algorithm to

work better in terms of computation and convergence time efﬁciency.

Approach: We start with a careful review of the work that was done on the sub-
ject until present. This is accomplished by gathering most of the literature
published. The theoretical understanding of the work that was previously pub-
lished is the only way to discover any discrepancies or neglected issues in the
problem. The next step is to check the accuracy of the results by a careful sim-
ulation. Also, a veriﬁcation of any claims that are not theoretically supported

will be conducted during the computer simulation. This way we can discover

22

all the weaknesses of the method or the algorithm, and by further investigation,

a modiﬁcation will be developed for future improvements.

Task No 2: Develop a modiﬁcation of the learning algorithm aimed at its improve-

ment.

Objective: To overcome the identiﬁed problems of the previously proposed algo-
rithm(see appendix); the major ones being convergence to none satisfactory

solutions and the high load of computation.

Signiﬁcance: Once the modiﬁcation is successful, this algorithm will be very com-
petitive with the backpropagation algorithm, especially in terms of the compu-

tational efﬁciency.

Approach: Having identiﬁed the problems with the previous approach to the prob-

lem at hand, a modiﬁcation is suggested based on the following two criteria:

e The modiﬁcation should avoid any approximations or assumptions that

were not theoretically justiﬁed.

e The modiﬁcation is aimed toward enabling the algorithm to converge to
satisfactory solutions regardless of the choice of initial conditions and in

an adequate amount of time.

After the development of the modiﬁcation that is based on theoretical under-
standing and aims toward an improvement of the algorithm , simulations on
different examples are conducted. The results are then compared with the pre-
viously reported ones. The simulation results are expected to support what was

anticipated from the modiﬁcation theoretically.

CHAPTER 3

The Extended Kalman Filter
Algorithm

3.1 Motivation

The classical (discrete-time) backpropagation algorithm, although widely used, has
numerous disadvantages. One of them is an extreme sensitivity to the initial choice
of the set of weights: There are many choices of weights for which the algorithm
will not converge and ﬁnally when it does converge for some set of weights, the error
diminishes only after a great number of iterations. There is also another problem
with the backpropagation algorithm exhibited as an inconsistency in training. This
means, the mean squared error could remain the unchanged for many iterations then
suddenly decrease to a lower value.

The Extended Kalman Filter algorithm, presented here, does not suffer from such
problems, it is not extremely sensitive to the initial choice of weights and it converges

comparatively faster.

23

24
3.2 The Learning Dynamics

The task of ﬁnding weights, which would achieve a speciﬁc input / output mapping, is
viewed as a parameter estimation problem. The network we use is a fully connected
Feedforward Artiﬁcial Neural Network or (FFANN) which is multi-dimensional and
nonlinear. The parameters to be estimated are the weights [20, 21, 8, 22, 23, 24].
An arbitrary FFAN N which depends on the application is considered. It consists of
a large T dimensional weight vector 0 made up of all the synaptic connections with
sigmoidal nonlinearities. The objective in training the (FFANN) is to determine the
weight values producing the L dimensional output vectors g(n) which are the closest
possible to the desired output values d(n). The input sequence i(n) is made up of N
dimensional vectors.

Here, we consider searching for the weights on-line with patterns presented in
sequence at the input of the network and with updates made recursively. The synaptic
weight vector 0 is viewed as the state of a static nonlinear dynamical system described

by the equations

an =aj—n =a

d(i) = h(90,i(j))+e(j),

where j is the time index, h(t9o, i(j) is the time varying function describing the net-
work and e( j ) is the multidimensinal sequence of modeling errors[21]. To obtain the
dynamic estimates d(n) of do, a traditional approach is adopted based on a deter-
ministic formulation, in which (1( j), d( J)) is a sequence of non-random input / output

pairs and the cost to be minimized at time n is

ewon=$§nan—hvommnntvv. e1)

i=1

25

Here e( j ) is the deterministic sequence of the modelling errors, and An'j (0 < A < 1)
is a forgetting function which exponentially discounts the examples presented in the
past. This is the so—called weighted recursive least squares approach(WRLS). In the

case of h being linear, the solution could be obtained using the classical RLS algorithm

[8]-

3.2.1 The global extended Kalman ﬁlter algorithm

The extended Kalman ﬁlter(EKF) equations are derived by expanding the nonlinear
function h(0,i(n)) around the current estimate parameter vector d(n - 1)(estimated

from all the data up to time (n - 1)). Namely, the state model is rewritten as

d(n) = h(é(n—1),i(n))+HT(n)(vo—o‘(n—1))+p(n)+e(n), (3.2)

where H(n) is the T matrix given by

8h(0, i(n))

H(n) = —5,—— 1.3-(M, (3.3)
and p(n) is the residual in the Taylor expansion of h.
The state model becomes
H(n) = 0(n — 1) 2 do
d(n) = HT(n)0o +£(n) + e(n), (3-4)

with {(n) = h(0(n —1)), i(n)) - HT(n)é(n — 1) + p(n).

26

Both, in the deterministic and in the stochastic formulations the estimate d(n) is
obtained as the optimal regression of do in (3.2). We concentrate on the deterministic

formulation, for which the estimates are obtained from the minimization of

4n) = >_: ll e(j) ”2 W. (3.5)

The solution is derived from the normal equation

V6060?) = 2Z?=1H(j)(d(j) — HT(j)00 — {OD/V” = 0. (3-6)

which gives

i(n) = o-1(n)r(n), (3.7)

with
<I>(n) = Z) H(j)HT(j)A""' , (3-8)
r<n1= : v-‘J'Hoxdm — 5(1)) (39)

The key assumption here is that that the residual sequence {(72) does not depend on
do.

The EKF recursions for the network are

G(n) = A’1P(n —1)H(n) [1+ A—IHT(n)P(n —1)H(n)]'1, (3.10)
i(n) = to. — 1) + G(n)(d(n) — h(é(n —1),i(n)), (3.11)
P(n) = A'1P(n — 1) —- A—1G(n)HT(n)P(n — 1). (3.12)

G is a T matrix (the so called Kalman gain) and P(n) = <I>‘1(n) is a T matrix.

To initialize the algorithm, the covariance matrix is chosen as P(O) = 6'11, with 6

27

greater than zero, a small arbitrary value and weights 8(0) to some nonzero random

values.

3.2.2 Computation of the gradient matrix

In this section we take a closer look to the gradient matrix H T(n) Given the archi-
tecture of the (FFAN N), the elements of the weight vector 0 are arranged so that the
rows of H (n) contain the derivatives of the global outputs with respect to all the T
synaptic weights for 0 = d(n — 1). The neurons are numbered from 1 to 7),, so that 9

= (w1, w2, ..., wm). The equation describing the desired output can then be rewritten

 

 

 

 

 

 

 

 

(d1) ((58%? (53%)T (ﬁll—Fl (ml) (61)

3 T 32 T 3 T 102 62
d2 <n>= 15%) 155’?) . Iii) >< . + . (n)
(01.) MW (av <§a>T1 (...,) (.L)

Where w; is the vector containing the synaptic weights of neuron i, and e(n) =
C(n)+e(n). The computation of the derivatives in the matrix H(n) is achieved via a
back propagation of the output values g(n) through the network.

In order to illustrate how these derivatives are obtained, we ﬁrst need to describe
the structure of the (FFAN N) used. The description included in this section is iden-
tical to the one given in [25] since their description of the structure was clear and
complete.

Let M be the total number of layers in the (F FAN N) with the input and output
layers included. The i — th neuron in the s — th layer is denoted by neuron (s,i),
and 1), denotes the total number of neurons in the s — th layer. 2:, is the input of the

neuron (1,i), x: is the output of the neuron (s,i), wﬁk is the linkweight coefﬁcient

28

from the neuron (s, k) to neuron (3 +1, i), and 213;? is the threshold of the neuron (s, i).
Usually, the thresholds are treated as weights that connect an input unit, which 03

always on, i.e. its value is always one, to the neuron, then

1 1 u}: if 2 < s < M
97:44.1 = 1, 2333M and wild“ =

0 ifs = M
Further, let
T
3 ,3 8-1 s—1
21 $1 wi,1 wl )
23: 11": awf-l— aws—l:
T
8 8 3‘1 8-1
Zn. $773+1 wink—1+1 wn. )
where wf’l is a (773.1 + 1) x 1 weight vector of the neur0n(s,i), u)”-1 is a 17, x

(1),-1 + 1) weight matrix of the s — th layer, and x" is a n, x 1 output vector of the
s — th layer. Then, the operation equation of the network can be expressed in the

following vector form

a: ifs=1 2’ ifszlorM
2‘: anda"=

w"_1:r"1 if 2 $33M h(w3‘1x“1) if 2533M -l

where the function h(.) may be chosen as the hyperbolic tangent function h(x) =

tanh(.r) and

h((wi“)Tx“‘)

h(ws—lxs—l) =

The weight training algorithm requires the computation of the partial deriva-
tives of the output mM of the (FFANN) with respect to the weight vectors. These
derivatives may be derived from the operation equation stated above. A computation
procedure for partial derivatives from layer (M — 1) to layer (M — 2),- - -, until the

layer 2 of the (FFANN) is considered. These derivatives are given as follows

29

 

 

 

 

 

( o i
0
M
“a?” = (xM-IT
awﬁ "1
0
1 0 l
where £337“: are (nM_1+1)ng matrices. For 2 S i 3 M—1, [3 = 1,2, - - - JIM-:41;
B
one can obtain
an!" _ BxM axM‘l 0xM"+2 BxM‘H’l __ ‘4 BrM’j 62:114-,“
awgl—i _ 81,114-] axM-2 --- axM—H-l 61034-1 _ ( OOTM-j-l) 81024—3.
J:

Also for 2 S i S M — 1, ﬂ = 1, 2, - - - ,17M_,-+1, the following relations are obtained

from the operation equation

axM—j hl((w]W-J'-1)TxM—j—1)w[W—j-l
WIT—1 2 if #0

wag-3:1rad-1110,9123

axM—i-i-l

 

M—i _
aUJﬁ

where h’(:r) = 1/cosh2(:r).

 

30

 

31

3.2.3 The local appraches and the (NDEKF)

Due to the high computational load of the EKF algorithm, researches thought to
simplify the global approach by partitioning the global problem into a set of man-
ageable subproblems. In the case of (FFANN), the partition of the problem may
be down to a group of neurons, to the layer level, to the level of the single neuron,
or, ultimately down to the level of a single synaptic weight. In this work, we are
interested in partitioning the problem down to the neuron level and solve the global
problem(minimizing the cost) by independently updating the weights each neuron at
each step.

The effect of the neuron i in the global output of the network is locally described

by the gradient matrix H,T(n) [26]which is the i — th column of the HT(n).

((3% (air)
H-T(n)= (321T = (43?
((23:11? ($3.1? )

 

 

Since the gradient matrix is now decomposed into a set of smaller gradient matrices,
we can also think abot decomposing the weight vector into few subvectors each one
associated with the corresponding gradient matrix.

The decomposed parameter estimation problem is described by the following equa-
tion:

g(k) = f(waxlkll : f(wi, " 1wylna' ' ' iwiw-li' ° ' wM-l $06))

7 "M 3

where y(k) is a 17M x 1 vector of output of the network at time k, 103, 1 S )6 S 1104.1,
1 S a S M — 1 are 170, x 1 weight vectors to be estimated, :r(k) is a 171 x 1 vector of
the input of the network at time k, f () is a 17M X 1 smooth vector function.

We now introduce the new parameter vectors 0;, 1 S i S n, n = X1127” that

32

represent the weight vectors wg, let 0,- : 203, then the scripts i and 01, 3 are related by
015:1, ﬂ(i) = i, ifi S 712 and a(i)=1+max{j:(i— {_2 m) > 0}, ﬂ(i) = i—Ef’le m,
if i > 1);. Consequently the 0.- are (71a(i)+1) x 1) vectors, and the nonlinear euation is
rewritten as y(k) = f(01,- - - ,0mr(k)). Given that p(k) is the input pattern, 314(k)
is a 11M x 1 vector of desired output pattern of the (FFANN), and 0,- at time k. The
weight training oof the (FFAN N ) is aimed toward estimating the weight vectors 0,-
such that the output y(k) of the FFANN tracks the desired output yd(k) with an
error that the algorithm can make converge to zero as k —+ inf.

In order to decouple the global problem acuretly, we can only use the (M544 x
Mg“) diagonal blocks of the error covariance matrix. Neglecting the error covariance
matrices p;,-(k) of the estimations 9,-(k) in the recursive estimation procedure is based
on the assumption that the error covariance matrix P(n) normally contains most of
its energy around its diagonal blocks. [27]. It also results in obtaining an approximate
neuron-decoupled extended kalman ﬁlter equations or N DEKF formulations from the

standard extended kalman ﬁlter (EKF) formulations [25] as follows:

 

A(n) = [I + A“ ,2: ij,(n —1)Hf(n)]"‘, (3.13)
0,-(n) = A‘lp,(n — 1)H,-T(n)A(n), (3.14)

Mn) = A“(1 - Ge(n))H.'(n)p.-(n -1). (3-15)
0,-(n) = 0,-(n — 1) + G,(n)e(n), (3.16)

H.(n) = 6f(91(n — 2%; 3'41”) _ 1)’ 3:02)). (3.17)
e(n) = yd(n) — f(61(n — 1),...,19,,(n —1),:v(n)), (3.18)

where A(k) is a 17M x 17M matrix, G;(k) are (770(i)+1) x anatrices of the ﬁltering
gain, and p,(k) = p,T( k) are (”01(2)“) x ("e(i)+1) matrices of the error covariance matrix

of the estimation 0,-(k). The procedure of recursive training described in the above

33

equations can be explained as follows.The N DEKF uses the old and the new outputs
to recursively train the weights. In every iteration, the NDEKF predicts the network
output y(k) based on the previous estimated weight vectors and the new input. It
then compares the difference between the predicted output y(k) and the new desired
output yd(k), the prediction error e(k) can be obtained. Based on this prediction error
and the information of the entire history of the input, desired output pair, which is
stored in the covariance matrices p;(k -— 1), a set of modiﬁcation coefﬁcients, gain
matrices G,(k) corresponding to the weight vectors 6,- of the neuron(a(i) + l,ﬂ(i)),
can be calculated. The new trained set of the weight vectors 0,-(k) is then determined
by the sum of the last trained weight vectors 0,-(k — 1) and the innovation term which
is a multiplication of the gain matrices G;(k) and the prediction error e(k). At last,
the information of the new input and desired output is stored in the error covariance

matrices p;(k) to be used in the next iteration.

34
3.3 Simulation Examples

Simulations, were conducted to compare the performance of the NDEKF to the
proposed modiﬁcations. In this chapter, we report the performance of the original
NDEKF on two continuous 2-dimensional patterns. The XOR problem is described
as a 2-region classiﬁcation problem and is performed on a one of the 2-dimensional
patterns. A 4-region classiﬁcation problem is performed on the second 2-dimensional
pattern. The network architecture consists of two inputs followed by two hidden lay-
ers. The size of the output layer we use is 2 or 4 depending on the problem. The
size of the hidden layers is 10. The network size is thus (2-10-10-2) for the 2-region
classiﬁcation problem and (2-10-10-4) for the 4-region classiﬁcation problem. This
network size has a weight vector, including the bias weights, of 162(respectively 184)
elements including the bias weights.

The input pattern was the uniformly distributed random points in the region
[0, 1]”. The function to be approximated or the desired output values for the network
are set to 1 and 0 inside and outside the regions depicted in the ﬁgures below. If
a point falls into a region of a certain color, shade or class, it is assigned a target
output vector for that class. The target output vectors are the two columns of the
2 x 2 matrix for the 2—region classiﬁcation problem and the four columns of the
4 x 4 identity matrix for the four region classiﬁcation problem. The inputs were not
presented in sweeps but obtained continuously by a random number generator. The
initial weights were randomly selected and also uniformly distributed between -0.5
and 0.5. The P matrices were initialized to 1000 times the identity matrix. The

adaptation of the weights is done as follows:
1. Pick a random point and propagate forward to the output layer.

2. backpropagate the error values (error is computed as the difference between

the target output value and the actual computed output) through the whole

1

35

network.
3. Simultaneously update all the network weights and matrices.

The mean squared error is averaged every 250 iterations (We call this average: an
interval average error).

Simulations for the original (N DEKF ) were performed on 5 different initial condi-
tions. The training is completed when a stoppage criterion is achieved. The stoppage
criterion is described by a threshold of the interval average error or a maximum num-
ber of iterations set by the user. In our simulations we chose a threshold error(0.05,
0.3) respectively. The maximum number of iterations was set to 50,000 for the 4-
region classiﬁcation problem and 25000 for the 2—region classiﬁcation problem. The
performance results are described in the following table and ﬁgures. These include re-
sults for both the XOR 2-region classiﬁcation problem and the 4-region classiﬁcation
problem.

We deﬁne the Total Average Error (TAE) as

C(71) = (E II e(j) ||2)/n, (3-19)
1:1
and the Interval Average Error (IAE) as

k+L

614"): (2 ll e(j) ll'zl/La (3-20)

1:}:

where k = j/L when j is a multiple of L.

0.9

0.8

0.

N

0.6

0.54

0.4

0.3

A

0.2

0.1

 

36

 

 

’fxxxl‘x

 

' 0.7 0.8

Figure 3.1. The 2-region classiﬁcation pattern

37

 

 

 

 

 

 

Figure 3.2. The 4-region classiﬁcation pattern

38

Table 3.1. The 2-region classiﬁcation problem

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

lambda choices ICl 1C2 IC3 1C4 IC5 IC6
A = 1 TAE 0.1095 0.1255 0.1102 0.1326 0.1136 0.1348
IAE 0.0823 0.1069 0.1022 0.1061 0.1006 0.1176

class % 97.2 96.2 97.0 96.2 97.0 95.8

#iterations 25000 25000 25000 25000 25000 25000
A = 0.99 TAE 0.7845 1605 0.6530 0.9377 0.7760 0.7759
IAE 0.9467 17.6 0.5663 1.0554 0.5952 0.9878

class % 37.3 25.4 80.5 20.2 54 36.9

#iterations 25000 25000 25000 25000 25000 25000

A = 0.98 TAE NAN NAN NAN NAN NAN NAN
IAE NAN NAN NAN NAN NAN NAN

class % 0 0 0 0 0 0

#iterations 5000 5500 3250 3000 3750 2500
A = 1.01 TAE 0.2384 0.2597 0.3620 0.3551 0.2706 0.3480
IAE 0.2452 0.2721 0.3595 0.3614 0.2812 0.3458

class % 85.7 86.1 77.4 80.0 85.1 79.9

#iterations 25000 25000 25000 25000 25000 25000
A = 1.02 TAE 0.3801 0.5089 0.4659 0.4324 0.5248 0.3724
IAE 0.3824 0.5565 0.4950 0.4150 0.5255 0.3950

class % 76.6 69.2 74.1 73.8 65.8 78.1

#iterations 25000 25000 25000 25000 25000 25000

 

 

 

 

 

 

 

 

 

39

Table 3.2. The 4-region classiﬁcation problem

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

lambda choices ICl IC2 IC3 1C4 IC5 IC6
A = l TAE 0.5757 0.5885 0.5718 0.5357 0.5596 0.5624
IAE 0.5736 0.5416 0.5295 0.4916 0.5145 0.5210
class ‘70 76.9 75.1 77.7 77.1 78.2 78.4
#iterations 50000 50000 50000 50000 50000 50000
A = 0.99 TAE 0.7604 4.7649E6 0.6638E3 NAN NAN NAN
IAE 0.7936 0.0179E6 1.5E3 NAN NAN NAN
class ‘70 44.7 22.5 23.3 0 0 0
#iterations 50000 50000 50000 46750
A = 0.98 TAE NAN NAN NAN NAN NAN NAN
IAE NAN NAN NAN NAN NAN NAN
class 70 0 0 0 0 0 0
#iterations 36260 36500 22750 11000 17500 24000
A = 1.01 TAE 0.7167 0.7223 0.7002 0.6387 0.7160 0.6636
IAE 0.7199 0.7030 0.6786 0.6165 0.6958 0.6416
class % 51.4 51.12 52.5 66.8 51.0 54.0
#iterations 50000 50000 50000 50000 50000 50000
A = 1.02 TAE 0.7375 0.7206 0.7331 0.7046 0.7493 0.6933
IAE 0.7392 0.6993 0.7331 0.7046 0.7337 0.6687
class % 49.5 54.2 51.3 54.5 50.4 54.8
#iterations 50000 50000 50000 50000 50000 50000

 

 

 

 

 

 

 

 

 

 

40

2—mglon orlgln-l problem IC2

 

J
”500”
I

I

 

 

 

 

a O O —4
.—

0.0 ’- ..

0.4 r- _,

O 2[ "

o A M A L
O 0.5 1 1 .5 2 2.5 3
numbor ol lloratlon. x .‘ 04

Figure 3.3. The TAE with A“1 = 1

2—roolon problem wllh Invomo lambda —‘l .01 IC2

 

 

 

 

 

2600 T
2000- \I\Jl\ ..
51500»- .4
E
gnocc- -t
aoo~ q
00 05 1 5 25 a

1 .6
numbor of "Or-(lon-

Figure 3.4. The TAE with A"1 = 1.01

1.8
1.0

1.4

.81....

g 0.8

.—
0.6

0.4

0.2

0.9
0.8

0.7

‘5 0.0

0.5
0.4
E ....
0.2

0.1

O 1 O 20 30 4O 60 60 7O 80 90 1 OO

12

1O

mmau
0

41

z—mglon problam wllh lnvaraa lambda —O.90 IC2

V v

 

r
l

 

 

A A A A A

 

 

1 .6
numbar o! ltarallona

Figure 3.5. The TAE with A'1 = 0.99

a—mglon orlglnal problam IC2

 

V V V

T v v

 

 

 

A A A A A A A A A

 

numbar o! ltarallona

Figure 3.6. The IAE with A’1 = 1

x ‘04 2—roglon problam wllh lnvaraa lambda --‘l .01 IC2

r Y

 

v v v V v V

 

 

 

 

 

4 - -1
2 '- .4
o A A A A A A A A A
O 1 O 20 30 4O 50 60 70 BO 90 1 00

numbar o! ltarallona

Figure 3.7. The IAE with A‘1 = 1.01

42

2—raglon problam wllh lnvama lambda —O.99 ICZ

 

V v

 

 

 

o A 4A A A A A A A A
O 1 0 20 30 40 50 60 70 80 90 1 00
numbar ol‘ Ital-anon-

 

Figure 3.8. The IAE with A’1 = 0.99

Z—nglon orlglnal problarn ICQ

 

 

 

 

 

 

 

 

 

 

 

o A L A A A
O 0.5 1 1 .6 2 2.5 3
numbar of ltaratlona x 1 04
e ‘ —1
Figure 3.9. The TAE With A = 1
2—roglon problam wllh lnvarao lambda-— 1 .01 IC6
1 .4 v v v
1 .2 -
1 -l
E... -
a 0.6 ~ A
F-
0.4 '- -1
0.2 r- _.
o A A A A
0 0.6 1 1 .6 2 2.5 :e
numbar of liar-Nona x 1 04

Figure 3.10. The TAE with A"1 = 1.01

43

 

2—reglon problem wllh lnverae lambda— 0.99 too

1.4 v r v
1.2 ‘t
1 e-

 

 

 

 

0.6 l- .4
0.4 '- —1
0.2 ’- -1

0 . A A .
O 0.6 1 1 .5 2 2.6 3
number 0! llerallona x 10‘

Figure 3.11. The TAE with A"1 = 0.99

2—reglon orlglnal problem IC6

‘ v v v

 

 

 

 

 

0 1 O 20 30 40 60 00 ‘70 80 90 1 00
number of lterallona

Figure 3.12. The IAE with A‘1 = 1

Zureglon problem wllh lnverae Iambda— 1 .01 Ice

 

 

 

 

 

0 1 0 20 30 40 60 60 7O 80 00 1 00
number or lterallona

Figure 3.13. The IAE with A‘1 = 1.01

44

2~reglon problem with lnverae lambda— 0.90 Ice
1 7 ‘7 V Y I

 

 

 

 

 

0.2 — -1
O. 1 *- _,
o A A A A A A A A A
O 1 O 20 30 4O 50 80 7O 80 90 1 00

number at Me rallona

Figure 3.14. The IAE with A‘1 = 0.99

4—reglon orlglnal problem I02
77 V V

 

 

 

 

 

0.2 *- .7

O A I .
0 1 2 a 4 6 6
number 01 lterallona x ‘0.

Figure 3.15. The TAE with A"1 = 1

x .' O. 4—reglon problem wllh lnveree lambda—1 .01 IC2

 

 

 

 

 

6 r v v
5 - .\ ..4
4 v- _
5
E“ ‘
2 >- —<
°o ; é a 3 8 a
number 0! lteratlona x 1 04

Figure 3.16. The TAE with A‘1 = 1.01

1.2

0.0

E”

0.4

0.9

0.06

‘g 0.0

0.75

E 0.7
0.05
0.6

0.56

0.6

45

4—reglon problem wllh lnveree lambda—0.00 Ica

 

 

V V Y

A

 

 

Figure 3.17. The TAE with A‘1

:3
nur'nbe r 0' lie rallona

4—reglon orlglnal problem IC2

= 0.99

 

l

 

v T Y

 

20

40 00 80 1 00 1 20 1 40
number or lteratlona

Figure 3.18. The IAE with A“1 = 1

100

180

 

woman
5 0|

0

46

4—reglon problem wllh Inver'ee lambda—1 .01 IC2

 

 

Y I

A A A A

 

 

 

2O

2O 1 4O 1 60

4O 60 BO 1 OO 1
number of Iteration.

Figure 3.19. The IAE with A“1 = 1.01

100

200

 

0.0 '

 

 

4—reglon problem wlth Inveree lambda—0.99 lC'Z

L A

 

 

20

4O 60 80 1 00 1 20 1 40
number 0! llerallona

Figure 3.20. The IAE with A’1 = 0.99

180

200

47

4—reglon origin-l problem ICO
V *7 f

 

d
I
1

9
O
I
1

Taiwan
9
0

 

 

 

 

 

 

 

 

 

 

 

 

 

0.4 - -4
0-2 - .4
o .
O 1 2 3 4 5 6
number 0' "er-None x 1 0‘
Figure 3.21. The TAE with A‘1 = 1
1 4 4—reglon problem with lnveree lernbde—‘l .01 ICO
1 .2 v- ..
‘ 1
go... _ .i
g 0.6 r- -‘
o 4 +— J
0.2 ’- .4
O .
O 0.5 1 1 .6 2 2 5
number of “er-None x 1 0‘
Figure 3.22. The TAE with z\‘1 = 1.01
‘ 4 4—reglon problem with lnveree lambda—0.99 Ice
1 .2 '- -4
1 4
5.0.0 -‘
g 0.6 r— -*
0.4 r- -*
0.2 >- _
O A . .
0 1 2 3 4 5 a
number 0' llerellone x 1 04

Figure 3.23. The TAE with /\'l = 0.99

0.8

0.7

s 0.6
0.5
0.4

5
0.2

0.1

48

4—reglon orlglnel probler‘n IC6
V V V

 

 

 

v 7’

ﬁr v

A A A A A A

 

 

2O 4O 60 80 ‘l 00 1 2O 1 4O 1 60 1 BO 200
number 0! llerellone

Figure 3.24. The IAE with A'1 = 1

4—reglon problem wltn lnveree lembde—‘l .01 lCO

 

 

V V V Y Y

 

A A A A A A A A

 

4o 60 60
number of lte rellon.

Figure 3.25. The IAE with A“ = 1.01

 

 

4—reglon problem wllh lnveree lambda—0.99 ICO
Y I Y V V

V ‘7

A A A A A

 

 

BO 1 00 1 20 1 4O 1 60 1 60 200
number 0' lleretlon-

Figure 3.26. The IAE with A‘1 = 0.99

49
3.4 Discussion and Analysis of Results

The results of the simulations summarized in the above tables and ﬁgures suggest that
the convergence and the stability of the Kalman ﬁlter algorithm are not very sensitive
to the choice of the initial conditions. However, for different exponential weighting
factors A ’s, the algorithm exhibits different behaviors. When A = 1, the algorithm
shows a stable behavior but does not converge to an acceptable local minimum. In fact
the IAE as deﬁned in the previous section, stalls for a large number of iterations. If we
then choose A slightly greater than 1, which is not in accordance with the forgetting
factor deﬁnition, the algorithm is still stable but the same stalling phenomenon occurs.
Note that A grater than 1 guarantees the stability of the discrete recursion equations
that include the update of the covariance matrix described in equation (3-12). A
slightly less than 1 however is a value that ﬁts the deﬁnition of the forgetting factor
and it has previously been used in other applications of the Kalman ﬁlter approach
where it has performed satisfactorily. Though, in our application of the Kalman
ﬁlter algorithm to supervised learning, the behavior of the algorithm under A slightly
less than 1 shows a divergence phenomenon that will be given the name “explosive
divergence” in the next chapter. This behavior can be expected if we carefully observe
the recursion equations of the algorithm. Equations (3-10) through (3-12) show that
when A < 1, A’1 > 1 is expected to drive both the covariance matrix elements and the
weight parameters to higher and higher values as the number of iterations increases.

The reason that this divergence problem occurs in this application (Supervised
learning) of the algorithm and not in so many others may be due to the characteristics
of the nonlinear model. The sigmoidal nonlinearity present in the (FFANN) model
is the only nonlinearity present in the network. And its shape and the shape of its
derivative could be at the origin of the poor performance of the algorithm. However,

as described earlier in the section, divergence problems could be easily predicted from

50

the recursion formulas described in equations (3-10) through (3-12).

in the above conclusions the following conjecture is formulated :

3.4.1 Conjecture 1

Let 6(d(n)) be the Least Square Error deﬁned as :

«an» = Z N am - h(é(n),i(j)> H2 WI (3.21)

we conjecture that the following statements are true :

o If A is < 1 then the EKF algorithm exhibits an unstable behavior described by

an explosive divergence.

o If A is 2 1 then the EKF algorithm is stable but often converges to unsatisfactory

local minima.

N ow that the simulations performed in the previous section identiﬁed some weak-
nesses of the Extended Kalman ﬁlter approach to supervised learning, we propose to

search for solutions for the following problem :

o How can one make it possible for the algorithm to converge to satisfactory local
minima regardless of the complexity of the problem and its initial conditions.

and,

o How can one improve the computational efﬁciency of the algorithm in cases

where the performance is satisfactory but the convergence time is not adequate.

CHAPTER 4

Convergence to Satisfactory Local

Minima

4.1 Motivation

The literature review done on the subject of kalman ﬁlters as an approach to super-
vised learning suggests that convergence to satisfactory local minima is not guaran-
teed and is very sensitive to the choice of initial conditions. This conclusion is drawn
from the results reported in several papers [26, 25, 21]. The authors on these papers
had to conduct several trials with different initial conditions then average the results.
Seldom did they report all the trials with their individual results. In a practical
situation, however, the averaging of the results does not give a solution to the prob-
lem. The solution is usually chosen after several trials are performed to give the best
performance. The need to try different initial conditions presents an inconvenience
because of the extra time and resources used to arrive to a satisfactory solution. This
drawback of the algorithm has motivated the research in this chapter. The analy-
sis of the problem begins by investigating the reasons behind the divergence of the
algorithm and categorizing them. The investigation is expected to outline the steps

toward a solution of the problem and is done in the second section of this chapter.

51

52

The last two sections in the chapter are designated to the proposed solutions followed

by simulations on two region classiﬁcation problems.

4.2 Divergence Phenomena

The kalman ﬁlter algorithm as with the special case of the recursive least squares(RLS)

algorithm may exhibit two forms of divergence.
o explosive divergence
o lock-up divergence

The ﬁrst form of divergence described as explosive is due to the covariance matrix
P(n) losing its positive deﬁniteness which means that it becomes singular. However
this kind of divergence could also occur due to roundoff errors and it is usually ob-
served when the forgetting factor A is less than 1. When A is equal to 1 and the
algorithm is left updating for a large number of iterations, the roundoff errors can
become large and cause the algorithm to diverge.

The second form of divergence called lock-up divergence usually occurs when A
is equal to 1. This divergence shows up when the algorithm stalls. This stalling
phenomena occurs when the algorithm stops updating which means that the ﬁlter
weights stop changing in value. It was mentioned in [28] that The experience with
this divergence may be temporary or permanent but in our own experience it was
permanent. When the gain matrix which is used to update the weights becomes
very small the ﬁlter weights will stop changing. However to compute the gain matrix
the covariance matrix is used. If we look at the computation of the gain matrix we
conclude that if the covariance matrix P(n) becomes very small the gain matrix G(n)
also becomes small. We can then assert that the stalling phenomena is due to the

covariance matrix becoming small. If the elements of the covariance matrix are near

53

zero the lock-up divergence is permanent.

4.3 Proposed Solutions

After having deﬁned the two forms of divergence that are frequently present in the
Extended Kalman ﬁlter algorithm, we now proceed to summarize the previous at-
tempts from several researchers to solve such problems with divergence. We will then
propose and implement our own solutions to the same problem.

The problems of divergence especially the lock—up divergence form described as
the algorithm getting trapped in local minima that are often undesirable, is not only
a characteristic of the extended Kalman ﬁlter algorithm. In fact the backpropagation
algorithm in its original version, although guaranteed convergence, often converges to
unsatisfactory local minima. Several approaches have been tried to escape these local
minima. One approach was to add a momentum term to the update of the weights.
Another approach which proved to be successful is the use of different energy functions
or cost functions.

In the Kalman ﬁlter algorithm one technique has been used to enable escaping
local minima. This technique entails the addition of an artiﬁcial process noise in the
update equation of the covariance matrix. This technique was mainly used to render
the covariance matrix positive deﬁnite [27]. This was the only technique that we can
ﬁnd in the literature that was claimed to help with the lock-up divergence. The lack
of any extensive research aimed toward a solution to the two forms of divergence
described in the previous section is what motivated our research in this chapter. We
will now begin outlining our own strategy for ﬁnding a solution to this problem.

The Kalman ﬁlter algorithm as applied to supervised learning has been derived
by considering a speciﬁc energy function for which the goal is to ﬁnd a set of optimal

parameters that will minimize it. The energy function is usually deﬁned as a mean

54

square error between a desired output or target and the actual output given by the
network. This (LSE) could also be weighted by a suitable function or coefﬁcient [29].
This weight factor has proven to be beneﬁcial in terms of improving the convergence.
In our work we investigate the influence of the weighting factor A on the convergence
of the algorithm. Even though A was always present in the recursion formulas of the
algorithm in the earlier work [26] it was kept ﬁxed at A equal to 1. Based on the
theoretical analysis of the stability and the convergence problems the algorithm can
exhibit [8], it would be safe to keep A ﬁxed at 1. However this is a very conserva-
tive choice and consequently the algorithm can present a stalling phenomena that is
usually observed around an unsatisfactory local minimum.

We concentrate on the deterministic formulation, for which the estimates are

obtained from the minimization of

e(n) = _§_: n e(j) n: W: (4.1)

We propose an adaptation scheme for A which depends on how fast the error is
decreasing and whether a stalling phenomena is present or not. We keep A smaller or
equal to 1 as necessary. Preliminary results on two classical classiﬁcation problems
are very promising. Varying A has enabled the network to automatically escape from
the local minimum and decrease the error to a better minimum. A comparison of
the algorithm with or without ﬁxing A using the same set of initial conditions will be

given.

4.3.1 A Modiﬁed update procedure

The original (NDEKF) used a forgetting factor that was ﬁxed at 1. This value of A
guarantees stability without always guaranteeing a good performance. To solve this

problem we propose the following modiﬁed updates based on varying A. After the

55

mean squared error is averaged every 250 iterations (We call this average: an interval
average error), this error is then compared to the previous error. If the difference
is smaller than a certain threshold, it is decided that the network is stalling. At
that moment A is changed to a value that is smaller but still very close to 1. If
this new value is kept unchanged for a large number of iterations, the network shows
some instability behavior. This phenomenon is justiﬁable if we examine the update
formulas. In order to guarantee the stability a mechanism that will switch A back to

1 had to be incorporated in the algorithm. Three different ways were considered :

1. One choice is to change A for only a few iterations.

2. A second choice is to change A until the total average error shows some increase

over the previous average error.

3. A third choice is to change A until the instantaneous error increases as compared

to the one computed 1000 iterations before.

4.3.2 The forgetting factor as a function of error

Since the mechanism of switching A back and forth between 1 and less than 1 depends
on how the so called IAE behaves (whether it is decreasing or stalling), it would be
convenient to model A as a function of this IAE. The inconvenience of the other choices
is manifested in one of them at least. The time to switch A back to 1 was done by
trial and error and that could take few trials. However from the few experimental
trials we could conclude that as long as we only switch A for less than 50 iterations
the stability is preserved. A suitable function that would model the behavior of A as a
function of the error has to satisfy at least these two conditions in order to guarantee

stability.

0 A is slightly < 1 when the difference between two consecutive interval average

errors is less than a certain threshold.

56

o A is 1 when the difference deﬁned above exceeds the same threshold chosen

above.

Several functions were picked but the one that was closest in satisfying the conditions

is the following:

tanh(a*6) if 6 > 7

A = (4.2)
0.98 otherwise
where 6 is deﬁned
k+1+L k+L
6 = ( 2 ll em ||2)/L — ()3 II e(j) ||2)/L- (4.3)
j=k+l j=k

7 is a difference of error threshold chosen as 0.0025 in our simulations. It could
be changed accordingly with the problem. and k = i/L and i is the number of
iterations that is a multiple of L. In our simulations L was chosen as the value 250 for
comparison purposes. a is decided accordingly with the desired behavior of A. In our

simulations 0 = 1000 ﬁts best with the desired conditions for the chosen function.

57

 

 

 

 

The forgetting function
1 r I l l ' I I
0.9 ~ ‘
0.8 - d
0.7 - i
0.6 - 1
(U
B
E 0 5 " d
2
o.4~ ‘
0.3 - i
0.2 - l
0.1 - J
O l 1 l l | 1 I I
-5 .4 -3 -2 -1 o 1 2 3 4 5
Interval Average error difference x 10‘3

Figure 4.1. A = tanh(a*6) with a = 1000 and 6 is the IAE difference

58

Table 4.1. The 2-region classiﬁcation problem

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

update choices ICl IC2 IC3 1C4 IC5 IC6
= 1 TAE 0.1095 0.1255 0.1102 0.1326 0.1136 0.1348
IAE 0.0823 0.1069 0.1022 0.1061 0.1006 0.1176

class % 97.2 96.2 97.0 96.2 97.0 95.8
#iterations 25000 25000 25000 25000 25000 25000
choice1 TAE 0.1550 0.1452 0.1275 0.1240 0.1136 0.1447
IAE 0.0409 0.0413 0.0468 0.0493 0.1006 0.0500

class % 96.6 98.2 98 98 97 96.9
#iterations 6500 9750 10250 19500 25000 19000
choice2 TAE 0.2052 0.2419 0.1864 0.2157 0.2047 0.2502
IAE 0.0427 0.1236 0.0329 0.0987 0.0756 0.0839

class % 97.6 95.6 97.5 96.0 96.4 96.2

#iterations 4750 3500 4750 6500 4750 3500
choice3 TAE 0.1165 0.12 0.1128 0.1291 0.1319 0.1399
IAE 0.0374 0.0687 0.0459 0.0780 0.0693 0.0818

class % 97.0 98.0 97.2 97.1 96.4 96.4
#iterations 19750 25000 22000 25000 14500 21000
choice4 TAE 0.1250 0.2667 0.1179 0.1503 0.1998 0.1602
IAE 0.0450 0.1388 0.1084 0.0661 0.0660 0.0573

class % 98.6 95.6 97.5 96.7 96.8 97.2
#iterations 12000 25000 25000 13750 4750 11750

 

 

 

 

 

 

 

 

 

59

Table 4.2. The 4-region classiﬁcation problem

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

update choices 1C1 IC2 IC3 1C4 IC5 IC6
A = 1 TAE 0.5757 0.5885 0.5718 0.5357 0.5596 0.5624
IAE 0.5736 0.5416 0.5295 0.4916 0.5145 0.5210

class ‘70 76.9 75.1 77.7 77.1 78.2 78.4

#iterations 50000 50000 50000 50000 50000 50000
choice1 TAE 0.4355 0.4741 0.4356 0.4537 0.4917 0.4983
IAE 0.2892 0.29 0.2811 0.27 0.2822 0.2851

class % 93.1 93.6 91.7 95.3 95.5 94.8

#iterations 22250 50000 23000 41750 29750 30500

choice2 TAE 0.5229 0.5197 0.3027 0.4869 0.4413 0.4851
IAE 0.2966 0.2973 0.2673 0.2899 0.2893 0.2976

class ‘70 91.9 92.6 93.9 93.1 95.6 92.7

#iterations 16250 20750 26750 25250 26750 24000
choice3 TAE 0.4370 0.5146 0.4632 0.4555 0.4700 0.5137
IAE 0.3483 0.4260 0.3648 0.3566 0.3868 0.4282

class ‘70 93.5 87.1 87.5 90.1 90.0 89.9

#iterations 50000 50000 50000 50000 50000 50000
choice4 TAE 0.5288 0.5132 0.4696 0.4862 0.4676 0.4534
IAE 0.2891 0.3430 0.3859 0.2639 0.2764 0.2794

class % 93.9 92.1 87.3 92.1 93.06 94.0

#iterations 23750 22500 50000 26750 50000 37250

 

 

 

 

 

 

 

 

 

60

2—reglon orlgln-l problem ICZ

 

2 r v

1.5” "
1.6-“ “
1.4" "‘
1.2" -4

Truman

 

 

 

 

 

 

 

 

0-8 4
0.6 i- -‘
0.4 i- -
0.2 r- ._
00 0.6 1 1 .5 2.6 3
number 0' "er-(lone x j 04
° —1
Figure 4.2. The TAE for A = 1
2—reglon problem 60 lC2
2 V7 T Y V V Y
1 .8 *- .1
1 .6 >- -‘
1 .4 ~ -‘
s5 1 .2 - -1
g ‘ d
3 o a - q
.—
0.0 - '4
0.4 .— -1
0.2 i— ‘4
00 1 000 2000 3000 4000 5000 6000 7000 8000 0000 1 0000

 

number of llerellone

Figure 4.3. The TAE for update choicel

Truman

9.0.00

2—reglon problem 24 ICE!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.6. The TAE for update choice4

2 r v
.8 '- -«
.8 r- «-
.4 '- -'
.2 -r
1 4
.8 -!
a .. ..
4 i- -r
2 _
00 500 1 600 1 5.00 2600 25.00 3600 35‘00 4000
number 0' llerellone
Figure 4.4. The TAE for update ch01ce2
2 2—reglon problem 200 I02
.8 *-
.8 '- -4
.4 '- -<
.2 '- -.
1 _
.8 .1
.0 ~ -i
4 .- ..
2 '- ..f
00 O 5 1 1.L5 2 2 5 8
number of lleretlone x 1 0‘
. .
Figure 4.5. The TAE for update ch01ce3
2 2—-reglon problem verlnble lembde ICE?
.8 *- -‘
.8 '- -‘
.4 '- —.
.2 '- -‘
1 .f
.8 '-
.8 )- .1
4 - -f
2 - ff -
number or Iteretlone x ‘04

62

2—reglon orlglnel problem ICE

 

0.9 r

T v V

 

 

 

 

00 1 0 2O 30 4O 50 8O 70 80 90 1 00

number 0! lteretlone

Figure 4.7. The IAE for A‘1 = 1

2—reglon problem 60 lC2

 

 

 

 

 

O 5 1 O 1 6 20 26 30 36 40
number of llerellone

Figure 4.8. The IAE for update choice1

63

2—reglon problem 24 :02

1 ‘1' v

 

0.9 '-
O.8 '-
O.7 '-
O.8 '-

0.6 #-

O.3 '-

0.2 i-

 

O.1 b

 

 

 

number of llerellone

Figure 4.9. The IAE for update choice2

2-—-reglon problem 200 lC2

 

 

 

 

 

O 1 O 20 30 40 so so 70 60 90 1 00
number of lteretlone

Figure 4.10. The IAE for update choice3

 

2—reglon problem verloble lembde ICZ

 

 

 

 

o 20 4O 60 BO 1 OO 1 20
number of leer-None

Figure 4.11. The IAE for update choice4

64

 

2—reglon orlglnel problem IC6
Y V V

 

 

 

 

 

 

J
00 0:5 1 1:6 5 2:6 3
number of Her-(lone x 1 0‘
Figure 4.12. The TAE for x\"1 = 1
2—reglon problem 60 ICC

1 .4 r v v v u r v v v
1 .2 q
1 —

.°.
0

flrﬁ
1

 

 

 

O 0.2 0.4 0.6 0.8 1 1 .2 1 .4 1 .6 1 .8 2
number of Iteration.

Figure 4.13. The TAE for update choice1

65

1.4

 

2—reglon problem 24 ICC

d
J

Truman
.0
O

 

 

 

0.6 i- u

0.4 '- -

0.2 '- ..
00 SCADO 1 600 1 5‘00 260 2g00 3600 36AOO 4000

0
number of llerellone

Figure 4.14. The TAE for update choice2

2—reglon problem ICO
V V

 

 

 

 

 

 

O 1 1 .5
number of lterellone x ‘ 0‘

Figure 4.15. The TAE for update choice3

2-reglon problem verleble lembde lCO

 

 

 

 

 

o A A A A A
O 2000 4000 6000 8000 1 0000 1 2000
number of lterellone

 

Figure 4.16. The TAE for update choice4

66

 

2—reglon onglnel problem Ice

 

 

 

 

‘0 50 BO 70 BO 90 1 (’0
number 0' “er-(lone

Figure 4.17. The IAE for A‘1 = 1

2—reglon problem so Ice

 

 

 

 

4O 60 80 70 8°
numb-r of “Cr-ﬁlo"-

Figure 4.18. The IAE for update choice1

67

2—reglon problem 24 I00

 

 

 

 

 

 

“ r
0.9 '-
O.‘ -
0.7 '-
5 0.. -
30.. _
5 0.4 ~
0.3 '-
O.2 P
0.1 L
00 5 number 0' lleretlone 1‘0 1 5
Figure 4.19. The IAE for update choice2
‘ 2—reglon prObIOVTI 200 Ice
0.0 '- -4
0.3 P u
0.7 - ~
.3 0.. . -
EO.C — -4
i 0.4 '- -1
0.3 — -
0.2 - ..1
0.1 '— .4
00 1A0 2A0 32) - . 6‘0 7‘0 8A0 90

 

 

 

40 60
number 0' lte rellone

Figure 4.20. The IAE for update choice3

 

2—reglon problem verleble lembde Ice
7 V r T V

 

 

 

o A A A A A A A

 

0 5 1 0 1 6 20 25 30 36 4O 45 60
number of lterellone

Figure 4.21. The IAE for update choice4

68

4—reglon orlglnel problem lC2

 

 

 

 

 

 

 

 

 

0.4 '- _4
0.2 '— -1

o A A L A A
0 1 2 3 4 6 a
number 01 llerellone x ‘ on

Flgure 4.22. The TAE for A = 1
4—reglon modllled problem 60 I02
1 .2 ﬁr Y I I

1 .-
0.8 "' cut
30.6 P— -‘
0-4 >- _‘
0.2 " —1

0 . . L . .
0 1 2 a 4 5 6
number 0! llerellone x 1 04

Figure 4.23. The TAE for update choice1

69

4—realon rnodlfled problem 24 lC2

 

1 .2 v V
1 -4
0.8 '- -<

 

 

 

0.4 *- -1
0.2 '— —1
o A A A A
O 0.5 2 2 6

1 1 .6
number 0! llerellone x ‘0‘

Figure 4.24. The TAE for update choice2

4—reglon modllled problem 200 IC2
1 Y Y

 

 

 

 

 

0.4 ”- .4
0.2 — _.

o L A A
O 1 2 :3 4 5 8
number 0! llerellone x ‘0‘

Figure 4.25. The TAE for update choice3

4—reglon modlﬂed problem verleble lembde lC2
1 .2 V ‘7’ V

 

'7

 

 

 

0.4 - _.
0.2 '- .1
o A A A A
0 0.6 1 1 .6 2 2.5
number 0' llerellone x ‘04

Figure 4.26. The TAE for update choice4

7O

4—reglon orlglnel problem IC2

 

 

 

 

0.95 T f ﬁ 7
0.9 ..
0.06 ﬂ
3 0.8 —4
$0.75 —<
0.7 - d
Ea.“ _
0.0 - _
0.65 i—
0.50 20 :0 50 1 :10 1 00 1 00 200

80 1 00 1 20
number ol llerellone

Figure 4.27. The IAE for /\'l = 1

4—reglon modlfled problem 60 ICZ

 

 

 

 

0.4 "' -1
0‘3 I- —i
0.2 A A A A

O 60 1 0° 1 50 20° 250

number of llerellone

Figure 4.28. The IAE for update choice1

71

4—reglon modll'led problem 24 IC2

 

 

 

 

0.4 >- -
0.3 I— —l
0.2 A A A A A A A

O 1 O 20 30 60 70 BO 90

40 50
number of lteretlone

Figure 4.29. The IAE for update choice2

4—reglon modl'led problem 200 I02

 

 

 

 

 

O 20 4O 60 80 1 OO 1 20 1 40 1 60 1 GO 200
number of lleretlone

Figure 4.30. The IAE for update choice3

4—reglon modllled problem verleble lambda IC2
W V V U V

 

1 T r Y Y

 

 

 

 

0.2 — d
O. 1 >— —1
o L A A A A A A A A
O 1 0 2O 30 70 BO 90 1 00

4O 60 60
number of lteretlone

Figure 4.31. The IAE for update choice4

72

 

"1"I 31-: .“ 1.1-- I
{'l.:; I; :x 33"}:1“ Ill“- l I: 1

1.I#~i “1}:

    

 

 

 

 

1' “In: .I" :5. n ‘l-ig .I' “f
”1.5? kit'ﬂw‘w'h git-‘5‘“
“I {Di-IF“ I aﬁgow , I‘?' “
”n?€4°3~i"°° 5 3535.
u.. ,3’3‘ '- '1

% .-' .7
”w :1. mg "- *-
°‘i.‘": coax: .‘ j '4
u 1" 0 "I . x. .l '3'
“u Ill-l" -::“,-,;- I :A I

II... f:‘ll ‘ . II I
on I .. . I. I. . _ .

:.'i‘ "' '- gew'J-TFW’

M 05 0‘

 

 

 

Figure 4.33. Classiﬁcation Performance for A'1 = 1

 

 

Figure 4.34. Classiﬁcation Performance for update choicel

73

 

_Ilh.ll
07 u

 

 

 

 

 

 

 

 

Figure 4.36. Classiﬁcation Performance for update choice3

o-ap-cu-u-uueu

m" I I "I .i
all f‘m5~;‘ x'xl, '1:
' ' ‘5 32:2“2' r‘h‘ “-

 

 

 

 

 

Figure 4.37. Classiﬁcation Performance for update choice4

74

4—reglon orlglnel problem too

 

d

WWW
O
b

 

 

 

 

 

 

 

 

 

0.6 r- .1
0.4 F —1
0.2 ’— d
O L A .
O 1 2 3 4 6
number 0' lleretlone 1 04
' 1
Figure 4.38. The TAE for A = 1
4—reglon modlﬂed problem 60 ICB

“ .4 r v v v f
1 .2 *- _.
1 ..
‘g 0.. l. —4
$0.6 "“ —4

\

0.4 *- —I
0.2 r— .J

o A A A L A

O 0.6 1 1 .5 2 2.5 3.6

number 0! "er-None x ‘04

Figure 4.39. The TAE for update choice1

75

4—reglon modlﬂed problem 24 ICO

 

J
L

.0
0
l
1

Tamar:
.0
O

 

 

 

 

0.4 ’- -d
0.2 ~ .7
°o ois é 1T5 é 2.5
number of Her-None x 1 04
Figure 4.40. The TAE for update choice2
4—reglon modlﬂed problem 200 ICO
1 .4 ﬁr T T T f
1 .2 — a
" -+

 

 

 

 

0.4 >- -‘
0.2 '- _.
°o 4 5 :2. i is a
number 0! lleretlone x 1 0‘
Figure 4.41. The TAE for update choice3
4—reglon modlﬂed problem verleble lembde Ice

1 .4 . T T T T T T
1 .2 '- .4
1 _

 

 

 

o o l- J
0.4 - -
0.2 r— q

o A A A A A J A
O 0.6 1 1 .6 2 2. 3 3.5 4
number 0' lleretlone x 1 04

Figure 4.42. The TAE for update choice4

76

4—reglon orlglnel problem Ice

 

 

 

r

V 1 V

 

A

 

20

4O 60 BO 00 1 20 1 4o 1 60 1 60 200
number 0! Ilereuone

Figure 4.43. The IAE for A'1 = 1

4—reglon rnodlfled problem 60 ICC
I ﬂ 1

 

 

f Y

 

 

60 80 1 OO 1 2O 1 40
number o! Ileretlone

Figure 4.44. The IAE for update choice1

77

4—reglon modiﬁed problem 24 ICC

 

1 V V V

 

 

 

 

O 1 O 20 30 4O 60 CO 70 CO 90 1 00
number 0' lteretlone

Figure 4.45. The IAE for update choice2

 

4—reglon modlfled problem 200 ICC

‘ v 7 ﬁ 1' r

 

 

 

 

0.2 '- —1
O. 1 '- -<
o A A A A A A
O 20 40 CO 14° 16° 1 8° 20°

80 1 OO 1 20
number 0! lteretlone

Figure 4.46. The IAE for update choice3

4—reglon modlfled problem verleble lernbde ICC
1 ‘r Y

 

 

 

 

O A .
O 60 1 OO 1 60
number 0! lleretlone

 

Figure 4.47. The IAE for update choice4

78

 

, agm' '.x' 31.51;: 1%.“: r 2:}

I" ""i "3:" °i ;".",'-'-. ;!.u,;.‘§ ’35 i

...,:c: :5“ L‘n
°° 09 ‘4’ '3'-¥:., I

>- ‘I zooogjf . ....:$‘£Jo "1... .1 Efi

A

I .‘
I.“

4

z 3
43-“
M
if
as

2 s
...,;5
f

3 2
'3 a"
{.50. .
“0;. W. . '
o J .3. .'.. 'u: 3'
I 9%.. ‘22- ...I . 3..
8 ‘A
O
f
o 8
o 0.
{03
Fe
$
I AL! :N

p

 

 

u _r.,. ,,
LIP. - $1:

:'=- "t "

 

 

ouuuuuuuuua

Figure 4.49. Classiﬁcation Performance for A“ = 1

 

 

Figure 4.50. Classiﬁcation Performance for update choice1

79

 

 

 

  

 

 

' 3 Th '3‘ u I 1'
”Li',{:5:'::‘§: “to” ¥'x'{'. “ch“:
as" ox°IP°° "2'”? ii. I“
”It: at” .3051'.." 0P1"
”in ..0 00 ° 0° 80% :'. ‘ h i‘
wh'x’b 30° ' .13. “Hoffa. . :3 . i!- E
Rikki .o ;'., ‘- .93‘33; If;
w :3 00° °. ' ’.r' 0:0:o°:':. “I"
M ':I.°o;‘ :3. .34‘ 09.4. .I‘
I 0 0° n' ‘ ,1 ° ,0“. '3' ' "I
un'. I°ocgu°§ -. ‘0 a?» " In";
“L. mum» ; ~ "*r
..-'. l , 1 r J.

“C‘i'r'ﬂi:°:3&é:1‘.‘i ""17

Figure 4.53. Classiﬁcation Performance for update choice4

80
4.4 Discussion and Categorization of Results

The results of the simulations summarized in the above tables and ﬁgures suggest
that using a ﬁxed forgetting factor A is the cause of both types of divergences. The
case of /\ = 1 shows a lock-up divergence problem while the case of A < l exhibits
an explosive divergence phenomenon. Let’s now analyze the results of the different
update choices closely. In the case of the 2-region classiﬁcation problem, most of the
update choices suggested decreased the number of iterations needed for an acceptable
local minimum without degrading the performance. In this case, what is gained is
the computational efﬁciency of the algorithm. Choice4 of the updates in this case
did not work as well as the others but can be improved if the parameters of the
suggested forgetting function are adjusted accordingly with the initial conditions. In
most methods that are suggested for improving any type of algorithm, there is always
tunable parameters that need to be dealt with.

The more complex case of the 4-region classiﬁcation problem is where the proposed
update choices prove to considerably improve the algorithm. In fact, most of the
update choices escape from unsatisfactory local minima to search for more satisfactory
ones. When a stalling phenomena is observed for an interval of some number of
iterations, the algorithm reacts automatically and decides to update A accordingly.
Convergence to satisfactory local minima is then achieved at a faster rate which not
only improves the performance of the algorithm but also its computational efﬁciency.

Choice4 of the updates gives the algorithm more ﬂexibility by allowing the user to
decide what parameters of the forgetting function ﬁt best with his / her problem. One
is able to make the decision after only a single simulation trial on his/ her problem.

In summary, these different choices of updates show a satisfactory performance
even though some are better than others. The advantage of having a variety of

updates to pick from is an increase of the chance of success in achieving satisfactory

81

local minima for any given initial condition. Even if choice1 through choice3 don’t
achieve the goal, choice4 can be made to converge by a simple adjustment of the
parameters of the forgetting function. (i.e a,L and 7)

Based in the above conclusions the following conjecture is formulated :

4.4.1 Conjecture 2

Let e(é(n)) be the Least Square Error (LSE) deﬁned as :

«an» = 2: II do) — h(é(n), i(n) “2 W: (4.4)
i=1
We conjecture that the following statements are true :

0 Switching A to a value less than 1 results in escaping local minima but often

induces explosive divergence.

0 Switching A only for few iterations escapes local minima without inducing ex-
plosive divergence. The number of these iterations can be decided from one

experimental trial of the specific application at hand.
0 When A is modeled as a function of the error which satisﬁes these two conditions:

1. A is slightly less than 1 when the difference between two consecutive inter-

val average errors is less than a certain threshold.
2. A is 1 when the difference deﬁned above exceeds the same threshold chosen

above.

An example function is given as follows :

tanh a =0: 6 if 6 >
,\ = ( ) 7 (4-5)
0.98 otherwise

82

, then this function guarantees convergence to satisfactory local minima if
the parameters (i.e a,L and 7) are selected according to the problem at hand

after one experimental trial.

CHAPTER 5

A Different Approach to the

Computation of the Gain Matrix

5.1 Motivation

Although the Extended Kalman Filter algorithm to supervised learning is a good
alternative to the backpropagation algorithm, some of its drawbacks still make it
unattractive for some researchers in the neural network community. One major draw-
back is a high computational load which we tried to solve by decreasing the number of
iterations needed for convergence. Also if we examine the recursion equations needed
to update the weights, we note that the algorithm computes the inverse of a matrix
whose dimension depends on the dimension of the output. If the dimension of the
output is high, the computation of the inverse becomes cumbersome.

In this chapter, we propose a modiﬁcation that does not require the computation

of the inverse. This modiﬁcation is also aimed toward an analog implementation

83

84
5.2 Recomputing the Gain Matrix

The computation of the gain matrix for the global network is equivalent to solving
the following equation:

— G(n)A-l(n) + A"1P(n —1)HT(n) = 0, (5.1)

where

A-1(n) = I + A‘IHT(n)P(n — 1)H(n), (5.2)

In this equation, we note that instead of conducting a crude computation of the
root of the equation, which will require a computation of the matrix A(n), we trans-
form the equation into a continuous time ordinary differential equation (o.d.e.). After
convergence, the o.d.e will give a stable solution for the root which can be used as
the new value of the gain matrix of each neuron. Consider the associated differential
equation for each ﬁxed 11

3:1:(t)

— x(t)A-1(n) + A'1P(n -1)HT(n) = at , (5.3)

 

Since for each ﬁxed 11, A‘1(n) is expected to be positive deﬁnite, and hence invertible,
the differential equation is a stable linear system. Thus $.~(t) will converge to a
constant matrix which we take to be G(n). This way, we avoid the cumbersome
computation of the inverse altogether. We observe also that such an o. d. e. is
suitable for analog circuit implementation which can be incorporated as a co—processor
to a digital computer. In this chapter, our interest is mainly in the (N DEKF) since
it was proven to be more efﬁcient than the global (EKF). However, in the NDEKF
we will have as many 0. d. e. equations to simulate as the total number of neurons
in the network not including the input-layer and this is at every update of the weight

parameters or at every iteration.

85
5.3 Simulation Results

Simulations, were conducted to compare the performance of the neuron decoupled al-
gorithm to the proposed modiﬁcation. In this chapter, we will report the performance
on the 4-region classiﬁcation problem which we have already used on both chapters
three and four. The pattern used for this region classiﬁcation problem was previously
described in the earlier chapters. We initialize all parameters including the weights
to the same values we chose in the earlier simulations for comparison purposes.

The adaptation of the weights is done as follows:

1. Pick a random point and propagate forward to the output layer.

2. backpropagate the error values (error is computed as the difference between
the target output value and the actual computed output) through the whole

network.

3. Now do one simultaneous update for all the network weights and matrices. Our
modiﬁcation of the algorithm occurs at this step of the iteration. We do not do
a static computation of the gain matrices. Every computation of a gain matrix
which involves the computation of an inverse of a matrix, is now a simulation of
an o.d.e equation which converges to the gain matrix in few steps. The initial
values for the o. d. e. were chosen to be the values of the gain matrices from

the previous iteration.

The results of simulations using the same initial condition one will be given in the
following table.

The results in this table are very similar to the ones obtained previously with
the computation of the inverse. However, these simulations were left training only

for 10000 iterations but we still achieved the same performance results. The reason

86

Table 5.1. The 4-region classiﬁcation problem with a modiﬁed computation of the
gain matrix

 

 

 

 

 

performance parameters 1C1 IC2 IC3
TAE 0.5872 0.5891 0.5767
IAE 0.5463 0.5420 0.5216

class % 74.6 75.4 76.1
#iterations 10000 10000 10000

 

 

 

 

 

 

behind limiting the training number of iterations is to due to the constraint of time.
The o.d.e themselves take few steps to converge inside each training iteration.
In conclusion the computation of the gain matrix using this approach is equivalent

to the earlier method that includes the computation of the inverse.

CHAPTER 6

Variable Slopes

6.1 Motivation

Given the considerable success of the linear estimation methods in solving linear
problems, researchers have extended these methods to slightly nonlinear problems and
sometimes completely nonlinear problems. The “standard” Kalman ﬁlter is linear so
in order to apply the approach of Kalman ﬁltering to problems of nonlinear type, the
Extended Kalman ﬁlter was derived. The essential idea of the Extended kalman ﬁlter
was proposed by Stanley F.Scmidt. It was then named after him as the “Kalman-
Scmidt” ﬁlter. The Kalman ﬁlter is extended mainly by expanding the nonlinear
function h(0, z(n)) describing the model around the current estimate g(n—l) estimated
from all data up to time (n — 1). The nonlinearity present in the network is described
by the model function h(0,i(n)) which is the sigmoidal function chosen as tanh(a:)
in the (FFANN). Due to the chain rule, the derivative of the sigmoidal function
is present almost in every element of the gradient matrix H (n) The slope of the

sigmoidal function was always ﬁxed at 1.

87

88

Even though this function plays a big role in the Extended Kalman ﬁlter equations,
the inﬂuence of the steepness of its slopes on the behavior of the algorithm was
never investigated in previous related research. However, in the backpropagation
algorithm, the effect of these steepness parameters especially on convergence was well
investigated [30]. It was argued in their discussion that a high value of s can make
the weights update slowly thus needing a large number of iterations to converge. A

too low value of 3 according to Branko can have these two negative effects.

1. The derivative decreases even in unsaturated regions inducing slow convergence

2. The output of the linear combiner will always fall in the linear region of the sig-
moidal function thus loosing any nonlinear effect of the network and dissolving

its multilayer structure.

These observations can only suggest that the value of the slope s can not be decided
a priori. The solution that was suggested before [30] is to determine an optimum
value of 3 via adaptive means. In the backpropagation algorithm, the slopes of the
activation function were updated according to the delta rule [30].

In the case of the Kalman ﬁlter algorithm, we propose to investigate the effect
of different slopes of the sigmoidal function on the convergence to satisfactory local
minima. We will then develop a similar update of the slopes to the one used in the

backpropagation algorithm.

89

The emwetlon lunatlon mm .609. -o.o

 

 

 

 

0.. v v
0.. '- -1
0.4 >- ..
0.2 >- d
E o _ .
ﬂ
-O.2 - _,
—o.4 >- -4
—-0.0 > ..
‘03:: —170 _; —o-a a of: 1 {a r
output of the llneer con-blue!

Figure 6.1. The sigmoid function with slope s = 0.5

The eoﬂvetlon tuneuon wlth elope -1

 

 

 

1 -
o.- »- ..
o.- - -
°-‘ P' -4
on p .
é o . .
a .02 »- -
—o.4 ,. -
~o.o » -<
-—0.. r- ~
_“I -105 -3 A A A v via a

 

—O.5 O 0.6
output 0' the Ilneer combiner

Figure 6.2. The sigmoid function with slope s = 1

The eetlvetlon Ounotlon Mth elope -1 .8

 

 

 

 

1 v . Y Y

0.. .- .1
0.. >- ..
0.4 I- 4
0.2 ~ -‘

§ ., _ .
g —0.2 >- ..
-0.4 L- "
—o.o ,. d
—o.e )- a
‘15: — 1.6 -i A A 1 1:5 I!

—O.5 O 0.5
oupul 00 the llneer oomblner

Figure 6.3. The sigmoid function with slope s = 1.5

The ecuvellon hat-vouch wnh slope -2

1 v v a 1

 

o.- » u
o.- l— -4
0.4 r- -
0.8 v- ..

-0.8 '- 1
-0.4 v- -4
-0.. - a
-O.. '- ..

_1 A L A L

 

 

 

—-O.6 o 0.6
output 0' the llneer oomblher

Figure 6.4. The sigmoid function with slope s = 2

0.0
0.45
0.4
0.00
0 0

g 0.20
a 0.2
0. 1

0.06

0

The denvedve a! the eonveuoh hat-touch Mth elope —0.6

90

 

 

 

r

 

 

—10

—2 O 2
output 0! (he llheer ooonblher

Or

Figure 6.5. The derivative of the sigmoid function with slope

0.0

0.0

0.7

0.0

0.6

mm

0.4

0.0

0.2

0.1

0

The derivetfve o! the Mellon “phonon whh slope -1
v v of f v .

 

I r T 1

 

 

 

 

—10

_2 0 2
output of the llheer eonsblher

3:0.5

Figure 6.6. The derivative of the sigmoid function with slope s = 1

1.0

mm

0.6

The derlveﬂve of the eetlveCIoh 'uno‘lon whh elope -1 .8

 

 

 

 

4 —4

—2 O 2
was." at the llheer oornblner

10

Figure 6.7. The derivative of the sigmoid function with slope

1.0

1.0

1.4

1.2

0.0

0.0

0.4

0.2

—°1o

The eeﬂvetlve o! the eeuvellon Ouhouon with elope —2

 

T Y T I

 

 

 

_. —4

—2 0 2
output or! the llheer oomblher

3:1.5

Figure 6.8. The derivative of the sigmoid function with slope s = 2

91
6.2 Preliminary Results

In this section of the chapter, we will illustrate via the 4—region classiﬁcation problem
how using different slopes can have an effect the convergence of the algorithm to
satisfactory local minima. The results for each case are tabulated for three different

initial conditions.

Table 6.1. The 4-region classiﬁcation problem with ﬁxed slopes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

slope choices ICI IC2 IC3
s = 1 TAE 0.5757 0.5885 0.5718
IAE 0.5736 0.5416 0.5295

class % 76.9 75.1 77.7

#iterations 50000 50000 50000
.9 = 0.5 TAE 0.6097 0.6282 0.5991
IAE 0.6054 0.5663 0.5490

class % 73.5 71.3 76.6

#iterations 50000 50000 50000
3 = 1.5 TAE 0.5174 0.5636 0.3875
IAE 0.5079 0.5178 0.2961

class ‘70 83.0 79.1 88.9

#iterations 50000 50000 50000
3 = 2.0 TAE 0.4306 0.5227 0.3681
IAE 0.4074 0.4667 0.2946

class % 86.7 84.0 89.5

#iterations 50000 50000 50000

 

 

92

4—reglon orlglnel problem 001
1 .4 - w - f

 

was.
.09
00‘

 

 

 

 

0.4 *- '1
Figure 6.9. The TAE for a ﬁxed slope s = 1

.°
0
I
1

run—yam
0
*-

 

 

 

 

 

 

 

 

 

 

 

0.‘ '— _,
O0 ; ; number o? Her-None :‘ a u 1 0:.
Figure 6.10. The TAE for a ﬁxed slope s = 0.5
“ '4 4—reglon oﬂglnel problem e—1 .BVIO‘I
.s 0.. L _
§O.e - _.
0.‘ '- _,
o" ; 5 number 0. lleretlone 3 é a .‘ 0:.
Figure 6.11. The TAE for a ﬁxed slope s = 1.5
‘ .4 e—reolon orlalnel'problern e—a I'o1 '7
1 .2 - ..l
1 a
E... .
a 0.. .— -
0‘) 1 0 3 3 0 CI
number 0' llerellone x 1 0‘

 

 

Figure 6.12. The TAE for a ﬁxed slope s = 2

0.0
0..
0.7

0.0

30.5

93

e—reolon orlolnel problem 001

 

 

v7

v v v

 

 

 

 

B 0.4 F- —‘
0.5 r- _,
0.2 h -1
0.1 +- _.

00 20 40 00 0 1 00 1 2 1 40 1 00 1 00 200

number 0' lteretlone
I
F1gure 6.13. The IAE for a ﬁxed slope s = 1
e—reolon oﬂglnel problern e—O.B IO1

1 v v f v v v v
0.0 —4
0.. -4
0.7 p --1

5 0,. h—

E 0.4 L- -
0.0 .— —<
0.2 ~ -4
0.1 - <

O . L - -
<3 20 40 00 1 40 1 00 1 .0 200

 

 

 

e0 1 oo 1 no
number or ltereclone

Figure 6.14. The IAE for a ﬁxed slope s = 0.5

4—reolon origin-l problem e—1 .0 IO1

 

 

 

 

 

 

 

 

 

 

0,0 I— —<
0.2 >- -1
0.1 >- —J
00 20 40 00 0 1 00 1 2 1 40 1 00 1 .0 2(30
number 0' lteretlone
' f —
F1gure 6.15. The IAE or a ﬁxed slope s — 1.5
4—realon orlalnel problem e—a l01
0.0 v v v v v
0..
0.7
.5 °"
30..
E 0.4
0.0 >- -
0.8 .— T
0.1 >- -4
00 20 40 00 00 1 00 1 20 1 40 1 00 1 00 a<
number of lteretlone

Figure 6.16. The IAE for a ﬁxed slope s = 2

)0

94

 

l
'h

: xii-1:? ”all :5; I“;

.31.. ° 8 .

-:.x§er

:1».- -

w

I
R
1

.-
C

8

8

$13}. 5
" a I 'I
_ = .I'. 1;... "‘7." III”
M u a: on u u M u u

0

1

 

 

 

Figure 6.17. The 4-region classiﬁcation pattern

assesses

 

'3

 

‘

:_g::g::§

 

 

Figure 6.19. Classiﬁcation Performance for a ﬁxed slope s = 0.5

95

asecgggg-

2

 

 

Figure 6.20. Classiﬁcation Performance for a ﬁxed slope s = 1.5

 

Figure 6.21. Classiﬁcation Performance for a ﬁxed slope s = 2

96
6.3 The Augmented Recursion Formulas

The (EKF) recursion equations start by a computation of the gain matrix C(n) which
is then used to update the weights. The update of the weight parameters is followed
by the update of the covariance matix P(n). To these two update equations we add
an update equation for the slopes of the sigmoidal function. These slopes are updated
at every node of all hidden layers in the network. The update is based on the gradient
descent rule. The slope of the nonlinearity 3;, where I denotes the hidden layer I and
j one of the nodes in that layer, is detemined so as to minimize a certain energy
function.

We deﬁne the energy function as follows:

E701) =1/2(|| d(n) - h(90(7"t),i(n)) ”2)-

where n is the time index and d(n) is the vector of desired outputs at time n.
h(0(n),i(n)) is the nonlinear model function which is a composition of sigmoidal
functions.

In this section, without loss of generality, we develop the update equations of
the slopes based on the same network architecture used in the 4-region classiﬁcation
problem simulation example. The network is composed of a total of 4 layers including
two hidden layers.

Let us call S2 the vector of slopes for layer 2 and S3 the vector of slopes for layer
3. The following equations describe the gradients of the energy function with respect

to these vectors of slopes.

ii: = (do) — h(ém.i(n))ahngf‘j”

 

 

 

97

 

 

 

8E1 _ 4 4 01:4 09:3 02:2
852 — (:6 -xd)8x369:2 652
03:4 3
53—5 — w

( I 2 2 T 2 2 \
h (51(w1) :1: )wl
£713 ._
0‘” h'(5¥o(wio>Tx2)w%o
K 0 J

 

 

 

 

82 I h’(Sf(w1)Tx‘)(w})Tx1 )
CL‘
B's—2:
\h’(5%o(wi°>Txl)<wio)Txl J
8E7: _ 4_ 4 334%
053 ‘(3 $067-$653
h'(s%(w:)Tx2)(w%)Tx'-’ )
8E
55:1“ x3)w3

where 2:4 is the vector of actual outputs for the output layer and x3 is the vector of

desired outputs. 1:3 and 32 are the outputs to layers 2 and 3. x1 is the input to layer

1.

 

hI(Sfo(wf0)T$2)(wfo)T$2 }

Also h(.r) = tanh(s:c) and h'(:c) = l/cosh2(sx)

The update equations for the vector of slopes corresponding to layer 2 and 3 are

given by the following equations:

52(72) = 52(n —

BET

1)— ﬁ+

10(52(n - 1) - 52(n - 2))

98

53(7)) = 53(n — 1) — 535% + p(53(n — 1) — 53(n — 2))

The augmented recursion formulas are then:

A(n) = [I + i Hm?) — 1)H3" (nu-1,

A=1

e(n) = 1).-(n —1)H.T(n)A(n),

 

 

 

99
6.4 Simulation Results

We conducted the same number of simulations as for the preliminary results. We used
the same three initial conditions but since the update of the slopes is being added,
we needed to also choose initial conditions for the slope parameters present in the
update equations of the slopes. There are four slope initial conditions for each initial
condition on the weight parameters.

The results are tabulated in the table and the ﬁgures that follow.

Table 6.2. The 4-region classiﬁcation problem with variable slopes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

initial parameters choices 1C1 IC2 IC3
so = 16 = 0.1p = 0.001 TAE 0.5467 0.4019 0.4799
IAE 0.5013 0.2985 0.3729

class % 79.3 88.2 84.6

#iterations 50000 41250 50000
so = 0.53 = 0.1p = 0.001 TAE 0.4010 .4012 0.3265
IAE 0.2922 0.2958 0.2584

class % 92.5 86.2 89.7

#iterations 22250 15500 50000
so = 1.55 = 0.1;) = 0.001 TAE 0.4857 0.5444 0.4668
IAE 0.4692 0.4814 0.3832

class % 82.4 77.9 85.9

#iterations 50000 50000 50000
so = 2.0ﬂ = 0.1;) = 0.001 TAE 0.4617 0.4606 0.3761
IAE 0.4393 0.3970 0.2998

class ‘70 82.2 85.6 91.5

#iterations 50000 50000 6250

 

 

Mayan

ramps)»

9

9

.0

100

e—reolon orlclnel problem verelope ell-1 lo1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.25. The TAE for varying the slope with so = 2

2 )— .,
00 ; a number oé llerellene ; at 1 0:.
Figure 6.22. The TAE for varying the slope with so = 1

‘ e—reglon orlglnel problem verel'ope e—0.° I01 r

.2 — -)

1 -.

0‘) 0:6 ‘7 1 .6 3 2.0
number 0' llerellone x .‘ 0“

Figure 6.23. The TAE for varying the slope with so = 0.5

,4 e—realon orlglnel problem verelope ..-‘ .3 IO'l

0 .. -1

4 )- ..

00 ; 5 number 0? lleretlone 3 u 10:.

Figure 6.24. The TAE for varying the slope with so = 1.5

" e—realen orlglnel problem verelepe :—2 lo1

.2 '- .4

1 -1

0 —4

0‘, 1 2 a A ID
number o! llerellone x ‘04

4—reolon orlglnel problem verelope e—1 IO1

101

 

 

0
0
rrrt

 

 

20

Figure 6.26. The IAE for varying the slope with so = 1

4O

00

00 1 00 1 20
number 0' llerellone

4—reglon orlglnel problem verelope e—o.6 lc1

100

1-0

 

2‘30

 

 

 

 

 

0.0

20

40 50
number 0' llerellone

00

70

-0

Figure 6.27. The IAE for varying the slope with so = 0.5

4—reolon orlqlnel problem verelope e—1 .8 lO1

00

 

O.-
0.7
0.0

0.0

E 0..
0.0

 

v

f

v

 

 

 

._ 4
0.2 F- -i
0.1 )— _.
01:) 20 40 co co 1 oo 1 20 1 40 1 oo 1 eo 200
number or llerellone

Figure 6.28. The IAE for varying the slope with so = 1.5

e—reglon orlolnel problem verelope e—a 001

 

 

f

 

 

 

20

Figure 6.29. The IAE for varying the slope with so = 2

4O

00

O 1 00 1 2
number 0' llerellone

140

1-0

102

 

 

 

 

 

O 0.1 01 0.3 M 05 0.0 0.7 0.. u 1

Figure 6.31. Classiﬁcation Performance for varying the slope with so = 1

 

 

omozuuuuuuui

Figure 6.32. Classiﬁcation Performance for varying the slope with so = 0.5

103

 

 

 

Figure 6.33. Classiﬁcation Performance for varying the slope with so = 1.5

 

 

Figure 6.34. Classiﬁcation Performance for varying the slope with so = 2

104

The numbers in the table suggest that an initial choice of so = 0.5 is the best
choice amongst all since it is the more efﬁcient in terms of both the performance and
the computation load. However, when the slopes were ﬁxed in the previous section, a
ﬁxed slope of s = 2 was the best especially in terms of performance. In conclusion, if
we have no knowledge of what the optimum value of s has to be, which is the case in
general, the method of varying the slopes should be used while keeping in mind that
the initial choice so should remain in the range of 0.49 < so < 1. The update of the
slopes in conjunction with the previous modiﬁcations of the algorithm including the

update of the A parameter is a tremendous improvement of the algorithm altogether.

CHAPTER 7

Summary and Conclusion

Mathematical modeling of real physical systems has been the interest of a consider-
able number of researchers in the engineering ﬁeld. Artiﬁcial Neural Networks are
amongst the most attractive and challenging models these researchers have developed
over the years. The strength of these artiﬁcial models relies on their ability to mimic
natural intelligence. The two main approaches to design these artiﬁcial models are
those implemented via digital software and those manufactured via analog hardware.
This research, however, is designated to the software approach as applied to feedfor-
ward artiﬁcial neural networks(FFANN). Since supervised learning has been the main
learning type used in these networks, the work in this document addressed some of
the issues of supervised learning as applicable to FFANNs while using the Extended

Kalman Filter approach for training.

105

106
7 .1 Summary

The research in this thesis was dedicated to a recently introduced (the 1990’s) super-
vised learning algorithm called the Extended Kalman Filter EKF. The more widely
used has been the Neuron-Decoupled Extended Kalman Filter (NDEKF) due to its
tremendously lower computation load. In the previous reported work this approach
has been used in different applications ranging from system identiﬁcation to nonlin-
ear control. However, the problems that would be faced while practically using the
algorithm have never been addressed in any of the previous work.

The lack of any previous discussion of the problems we experimented with while
using the algorithm ourselves, was our main motivation for the research in this work.
This research has accomplished many tasks that led to the improvement of the algo-
rithm. Starting with our analysis and understanding of how the Extended Kalman
Filter algorithm operates, we were able to point out the many problems and obstacles
that any user of the algorithm is likely to encounter while using the algorithm. We
recall such obstacles as the high computational load and the two forms of divergence
that are frequently manifested. The theory behind the algorithm supports most of
our ﬁndings about the problems and the weaknesses that are present in the algorithm.
Also with the help of theory we were able to propose the different modiﬁcations aimed
toward a more efﬁcient kalman ﬁlter training algorithm for supervised learning. These
modiﬁcations are present in chapters four through six. They range from a suitable
update of the energy parameter()\) to an update of the nonlinearity slopes without
adding cumbersome work such as a large number of tunable parameters.

These modiﬁcations have the advantage of giving the user the ﬂexibility of adapt-
ing the algorithm to any application at hand. Thanks to generalizing the deﬁnition of
certain functions and parameters the user is able to adjust the variables accordingly

with his or her application.

107
7 .2 Conclusion and Future Research

The main contribution in this research is in familiarizing any reader of this thesis with
the approach of Extended Kalman Filtering to supervised learning while enabling him
or her to decide if the approach with its all advantages and disadvantages is suitable
to achieve any research goal he or she may have. The improvements of the algorithm
present in this thesis give the user a better more efﬁcient algorithm to work with.
The efﬁciency is improved in terms of both the computation and the performance of
the algorithm.

Due to the constraint of time in this research neither all aspects of the algorithm
were investigated nor all improvements strategies that were thought of could be im-
plemented. However, future research in the subject is possible and will include trying
alternative energy or cost functions that could reduce the convergence time while

keeping the performance satisfactory.

APPENDICES

APPENDIX A

Derivation of the Extended

Kalman Algorithm Equations

In order to keep this thesis self-contained, the standard derivations for the recursive
least squares recursions are included. They are a multidimensional extension of the

ones given in [8]

g(n) = A: An-l-J‘Homm) + H(n)HT<n) (A4)
<I>(n) = mm — 1) + H(n)HT(n). (A.2)
r<n> = Arm — 1) + H(n)(d(n) — an» (A3)

Using the matrix inversion lemma:

A = B-1 + CCT (AA)
A“ = B — BC(I + CTBC)“CTB. (A.5)

with B = 9'10: — 1)/\‘1 and C = H(n), we get

108

109

(1)402) = A-IQ-l(n — 1) — A-2@-1(n —1)H(n)
X[I + A'IHT(n)<P‘1(n —1)H(n)]’l

XHT(n)<I>-1(n — 1), (A.6)
Deﬁning P(n) = tIP‘1(n — 1), and
G(n) = A—1P(n — l)H(n)[I + A‘IHT(n)P(n — 1)H(n)]‘1, (A.7)

we can rewrite eqn(A6) as

P(n) = A‘1P(n — 1) — X'1G(n)HT(n)P(n — 1). (A.8)

Equation (A7) can be rewritten as
G(n) + A‘1G(n)HT(n)P(n — 1)H(n) = A‘1P(n —1)H(n), (A.9)

01‘

G(n) = [A'1P(n — 1) — A'1G(n)HT(n)P(n —1)]H(n). (A.10)

Therefore the Kalman gain can be expressed as
G(n) = P(n)H(n), (A.ll)
The parameter estimate for the ﬁlter weight 0)(n) can then be expressed as

(i(n) = P(n)r(n) (A.12)

110
(i(n) = P(n)>\r(n - 1) + P(n)H(n)(d(n) - 60%))-

Substituting (A8) in the ﬁrst term of (A12) we get

0(n) = P(n —1)r(n — 1) — G(n)HT(n)X

P(n —1)r(n — 1) + P(n)H(n)(d(n) —- g(n)).

From (All) we get

é<n> = é<n — 1) + G(n)(d(n) — :(n) - HT(n)é(n —1)),

01‘

(i(n) = é(n — 1) + G(n)(d(n) — h(é(n —1),i(n))).

(A.13)

(A.14)

(A.15)

(A.16)

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] T. Hrycej, Modular Learning in Neural Networks. Toronto, Canada: John Wiley
and Sons, Inc., 1992.

[2] P. D. Wasserman, Advanced Methods in Neural Computing. London, England:
International Thomson Publishing, 1993.

[3] R. A. Jacobs, “Increased rates of convergence through learning rate adaptation,”

Neural Networks, pp. 295—307, 1988.

[4] T. Samad, “Backpropagation improvements based on heuristic arguments,” Pro-
ceedings of International Joint Conference On Neural Networks, pp. 565—568,
Washington, DC: IEEE Press 1990.

[5] M. Ahmad, Supervised Learning of Feedforward Artiﬁcial Neural Networks Us-
ing Diﬁerent Energy Functions. Michigan State University, East Lansing, Mi:
Michigan State University, Dec 1991.

[6] R. W. Prager and Fallside, “The modiﬁed Kanerva Model for Automatic Speech
Recognition,” In IEEE workshop on Speech Recognition, vol. Arden House, Har-
riman NY, May 1988.

[7] B. Irie and S. Miyake, “Capabilities of Three—layered Perceptrons,” Proceedings
of the IEEE International Conference on Neural Networks, vol. 1, pp. 641—648,
June 1988.

[8] S. Haykin, Adaptive ﬁlter theory. NJzPrentice Hall: Englewood Cliffs, 1991.

[9] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas of immanent
in nervous activity,” Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133,
1943.

[10] F. Rosenblatt, Principles of Neurodynamics. Washington, DC: Spartan Books,
1961.

111

112

[11] S. Grossberg, “Adaptive pattern classiﬁcation and universal recoding I: Paral-

lel development and coding of neural feature detectors,” Biological Cybernetics,
vol. 23, pp. 121—134, 1976.

[12] B. Kosko, “Bidirectional associative memories,” IEEE Transactions on Systems,

Man and Cybernetics SMC, vol. 18, pp. 49—60, 1987.

[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Internal Repre-
sentations by error back propagation in parallel distributed processing: Explo-

rations in the microstructures of cognition,” MA :MI T Press. Cambridge, vol. 1,
pp. 318—362, 1986.

[14] E. Oja, “A simpliﬁed neuron model as a principal component analyzer,” Journal
of Mathematical Biology, pp. 267—273, 1982.

[15] T. Kohonen, Self-Organization and Associative Memory. New York: Springer-
Verlag, 1988.

[16] Hecht-Neilson, “Counterpropagation Networks,” Proceedings of the IEEE First
International Conference on Neural Networks, vol. 2, pp. 19—32, 1987.

[17] B. Widrow, “An adaptive “Adaline” neuron using chemical “Memistors”,” Tech-
nical Report, vol. 1553-2, October 1960.

[18] B. Widrow and F. W. Smith, “Pattern recognizing control systems,” Computer
and Information Sciences Symposium Proceedings, vol. Sparton Books, Wash-

ington D. C. 1963.

[19] M. S. Grewal, Kalman ﬁltering : theory and practice. N. J. : Prentice-Hall:
Englewood Cliffs, 1993.

[20] H. W. Sorenson, “Kalman Filtering Techniques,” In Advances in Control Systems
Theory and Applications, vol. 3, pp. 219—292, ed. C. T. Leondes. New York:
Academic Pres 1966.

[21] S. Singhal and Wu, “Training Multilayer Perceptrons with the Extended Kalman
Algorithm,” In Advances in Neural Information Processing Systems, pp. 133—
140, Denver 1988, led. D. S. Touretsky, San Mateo, CA: Morgan Kaufmann
1988.

[22] S. A. Shah and F. Palmieri, “MEKA - A Fast, Local Algorithm for Training
Feedforward Neural Networks,” In International Joint Conference on Neural
Networks, vol. 3, pp. 41—45, NewYork: IEEE 1990.

113

[23] F. Palmieri and S. Shah, “Fast training of multilayer perceptron using multilin-

ear parameterization,” Proceedings of International Joint Conference on Neural

Networks, vol. 1, pp. 696—699, IEEE Press 1990.

[24] K. S. Narendra and K. Parasarathy, “Identiﬁcation and Control of Dynamical
Systems Using Neural Networks,” IEEE Transactions on Neural Networks, pp. 4—
27, 1990.

[25] L. Jin, P. N. Nikiforuck, and M. M. Gupta, “Decoupled Recursive Estimation
Training and Trainable Degree of Feedforward Neural Networks,” IJ CNN, 1991.

[26] S. Shah, F. Palmeiri, and M. Datum, “Optimal Filtering Algorithms for Fast
Learning in Feedforward Neural Networks,” Neural Networks, vol. 5, pp. 779—
787, 1992.

[27] G. V. Puskorius and L. A. Feldkamp, “Decoupled Extended Kalman Filter Train-
ing of Feedforward Layered Networks,” In IEEE International Conference on
Neural Networks, vol. 1, pp. 771—777, 1991.

[28] G. E. Bottomley and S. T. Alexander, “A Theoretical Basis for the Divergence
of Conventional Recursive Least Squares Filters,” , 1989.

[29] G. V. Puskorius and L. A. Feldkamp, “Neurocontrol of Nonlinear Dynamical Sys-
tems with Kalman Filter Trained Recurrent Networks,” In IEEE Transactions
on Neural Networks, vol. 5, pp. 279—297, March 1994.

[30] B. Soucek and the Iris Group, Neural and intelligent systems integration.
Newyork- Brisbane— Toronto- Singapore: A wiley Interscience Publication, John

Wiley Sons, Inc., 1991.

MICHIGAN STATE UNIV. LIBRARIES
[I[I[1]][HI][11]]1]][I][[1]][111111111]HI
31293014216190