SYSTEM IDENTIFICATION BY
STOCHASTIC APPROXIMATION

Thesis for the Degree of Ph. D.
MICHIGAN STATE UNIVERSITY
DELI. D, MGNAHAN
1970

 

 

 

III/IIIIIIIIIIIIII J; 1.12272} ~
THE st"? II 3 hilChigaﬁc‘ VII-2116;
3 129300 00995 8723 ‘ W!“

This is to certify that the

thesis entitled

System Identification By Stochastic Approximation

presented by

Bell U. Monahan

has been accepted towards fulfillment I
of the requirements for

Ph.D. degree in Electrical Engineering I

and Systems Science

2W I?

Major ﬁrefessor

Date 1774;? / 6/

0-169

'9'
BINDING“

IL_unAs a sous

mm mc. I: I

 

v.1:l‘l7
' Q

I ””5 m?
if“. ”.- - ' .

ABSTRACT
SYSTEM IDENTIFICATION BY STOCHASTIC APPROXIMATION
BY

Dell D. Monahan

The primary purpose of this work is to develop a practi-
cal means for finding a mathematical model for an unknown discrete-
time system from observations of its input and output variables.

The system is assumed to be excited by a random input
process having bounded moments but otherwise unknown statistical
properties. The system itself may be nonlinear, in a specified
sense, and time-varying. It is observed with additive random
errors having unknown distributions. In view of these conditions,
an on-line procedure is selected which uses an adaptive model.

The adaptive model has a state-space representation in which the
equations are assumed known up to a set of parameters. General
conditions restricting the form of the model are given. The prob-
lem is to find the parameter values which minimize the error
between the system output and the model output. The method chosen
for computation is a stochastic gradient-search algorithm which
updates the model parameters with each new set of input-output
measurements. Under a liberal set of assumptions the parameter
estimator is shown to converge within a small bias-error in the
mean-square and almost-sure senses to the true parameter value

as the number of observations increases. The arguments for

Dell D. Monahan

convergence rely on stochastic approximation theory. Therefore,
the procedure considered is a stochastic approximation system
identification algorithm. It is formulated in a general sense
and shown to apply to certain practical problems. The veri-
fication is carried out by a digital computer simulation. The
examples simulated include the identification of a system which
is nonlinear in its state, input, and unknown parameter.

The method extends the work that appears in the liter-
ature on system identification in at least four ways. The
identification procedure is generalized to apply to systems which
have nonlinear and time-varying characteristics. Second, a
geometric interpretation of identifiability is given. Third,
under liberal assumptions, mean-square and almost-sure convergence
of the estimation algorithm is proven. Finally, the rate of con-

vergence of the estimator is evaluated.

SYSTEM IDENTIFICATION BY STOCHASTIC APPROXIMATION
By

Dell D. Monahan

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering and Systems Science

1970

ACKNOWLEDGEMENTS

The author wishes to express his sincere gratitude to
his thesis advisor, Dr. G.L. Park, for his continued assistance
and encouragement during the course of this research.

The author is indebted to Dr. R.C. Dubes who contributed
considerably to the author's graduate study. Dr. Dubes' care-
ful proof reading of the manuscript, which lead to several
improvements, is also gratefully acknowledged.

Thanks are also due to Dr. H. Salehi, Dr. R. Zemach,
and Dr. W.L. Kilmer for their interest anc critical reviews

of the thesis.

ii

Chapter

I

II

III

IV

TABLE OF CONTENTS

ABSTRACT

LIST OF FIGURES

INTRODUCTION .................................

1.1 Motivation and Models for System

Identification ........... . ..... . ........
1.2 A Survey of System Identification
Techniques ..............................
1.3 Thesis Objectives .......................
DEVELOPMENT OF THE AIGORITHM .................
2.1 The System Model ........................
2.2 Assumptions for Identification and
Identifability .........................
2.3 The System Identification Algorithm .....
2.4 Consistency of the Estimate .............
2.5 Approximation to the System Identification
Algorithm ...............................
CONVERGEN CE PROOFS ...........................
3.1 Mean-square Convergence .................
3.2 Almost-sure Convergence .................
3.3 Rate of Convergence .....................
3.4 Extension to a Larger Class of Nonlinear
Systems .................................
3.5 Summary .................................
EXAMPLES .....................................

4.1 Nonlinear State Model: Linear in the
Unknown Parameter .......................
4.2 General Nonlinear System: Scalar Case

CONCLUSIONS ..................................

5.1 Thesis Results ..........................
5.2 Possible Extensions ............ . ........

APPENDIX .....................................

BIBIJOGRAPHY .................................

Page

18
18
20
26
32
35
38
39
48
55

6O
65

66
66
73
77

77
79

81

98

LIST OF FIGURES

Figure Page
1 Hyperplane H, With Unit Normal 2, and Vector
Distances IfizI From H ........... . ..... ......... 23
2 Estimator Error Squared as a Function of Algorithm

Iterations; Large Initial Estimator Error and

Various Gains ...................................... 71

3 Estimator Error Squared as a Function of Algorithm
Iterations; Small Initial Estimator Error and

Various Gains .................... ........ .......... 72

4 Relative Squared Error as a Function of Algorithm

Iterations; Least-squares Gain with Various Noise

Levels ........ . .......... ............. ............. 75

5 Relative Squared Error as a Function of Algorithm
Iterations; Deterministic Gain with Various Noise

Levels ...................................... . ...... 76

iv

CHAPTER I

INTRODUCTION

In this chapter the definition and purpose of dynamic
system modeling and identification are briefly stated. Then, a
representative set of methods for obtaining dynamic system
models is discussed with respect to the practical problems.
From these methods a particular approach to system identifica-
tion is selected. Finally the general aim of this work is
stated, establishing this approach as a practical method for

system identification.

1.1 Motivation and Models for System Identification

Generally the subject of system identification is

motivated by problems connected with control and systhesis.

The present study had its origin in the problem of stability

of large-scale interconnected electric power systems. Modeling
such systems provides a means for predicting the transient
reaction of the system to step changes in load as, for example,
caused by distant faults. Knowledge of such reSponses leads

to better design and control policies for improved system reli-
ability. This application has an immediate parallel in auto-
matic control theory and the development of adaptive control

systems.

The definition of a system will be considered first.
Using operator notation, a system may be defined as a triplet

(x,u,H) where

(1.1.1) x Hu

i.e., the operator H maps an admissable input function of

time u into an output function of time x. Then, x is the
response of the system H to the excitation u. The system
identification problem may then be thought of as the problem

of using observed x's and u's to determine the unknown operator
H. The question of uniqueness immediately arises, and one finds
that, generally, H belongs to an equivalence class of H-
operators in which each member produces an identical output

Hu, for each admissible input excitation u. This is evident

from the following example.

(1.1.2) x hll hl2 ul

x2 h21 h22 U2

1“

There is no single pair of outputs and inputs (x1,x2) and
(u1,u2) which will uniquely define {hijz i,j = 1,2}. There

are infinitely many {h that satisfy (1.1.2) (R3). In

13}
some cases there are canonical forms associated with the equi-
valent classes; for example, linear time-invariant single-input
single-output systems of known order (L2). In any event, the
lack of uniqueness is a basic difficulty in a system identifica-

tion problem since one is only able to find a model equivalent

to the system observed. Hence, models that are made available

for design and control studies are usually, at most, equivalent
to the unknown system.

The first step in an identification problem is to select,
from knowledge of the problem, a form for H which is known
explicitly up to a set of components, typically specified by a
finite set of parameters. This selection defines a set of
systems which is assumed to contain a unique H-operator which
is equivalent to the unknown system. The problem is then to
find the parameters of this H-operator. By definition the
system is "identified" when this set of parameters is found.

A wide variety of mathematical models are available
for representing the unknown system. Some examples are briefly
reviewed here:

(a) The Volterra Functional Series
The model (1.1.1) may be more explicitly described by

the Volterra expansion for continuous functionals

t
(1.1.3) x(t) = ho(t) +I h1(t,1'1)u('r1)d'rl +
t t
+-I ... I hn(t,T1,...,Tn)u(T1)u(T2)...u(Tn)d¢1...d¢n
+ .

In (1.1.1) H represents the integral operation (1.1.3). Any
nonlinear, deterministic, time-varying, controllable system

can be described by a relation of the type (1.1.3) (K6, V3, 32).
If the system is linear, only the first two terms of (1.1.3)

need to be considered. Kwatny and Shen (K6) have shown that

a parameterized approximation to (1.1.3) is determined by trun-
cating the expansion at some N > O and approximating
hn(.,...,.) by a multidimensional series in the orthogonal
functions I¢i(-)}:=1 for each n = 1,...,N. Then, for example,

if N = 2, (1.1.3) becomes

k k t
(1.1-4) x(t) = .2 aicpi(t) + .2: biImi(T1)u(t - T1)d¢1

1=l I=l «a

k t t
+ . ;_ cij I cpi('r1)U(t - r1)dr1I cpj('f2)U(t - 3)de
1,3-1 -m «a
k

where h2(t,T1,T2) is replaced by i §=1cij¢i(T1)¢j(T2) and

the time variation is accounted for by allowing the parameters
ai’bi and cij to vary in time. Thus, with the approximation
(1.1.4), the system identification is accomplished by determin-
ing the parameter values which best establish a mathematical
equivalence of the unknown system. If inputs and outputs are
sampled periodically, the integrals in (1.1.4) may be replaced
with summations. Kwatny and Shen suggested sampled-data Laguerre
functions for the orthogonal discrete weighting functions
“Hamil '

The chief reasons for using the type of representation
in (1.1.4) are the genrality in describing nonlinear, time-
varying deterministic systems and the linearity in the unknown
kernels, hn(°""")’ or the unknown parameters of (1.1.4).
However, practical difficulties arise since the required number

of orthogonal functions and coefficients is usually much greater

than the order of the System. IAlso, in the linear case, for

example, the representation (1.1.4) does not directly reveal
the natural frequencies of the system, or alternatively, the
natural modes or eigenvalues of the system.
(b) The Frequency ReSponse Function
A useful mathematical representation for linear, time-
invariant systems results when the Fourier transform is used

in connection with the system equation given by
a

(1.1.5) x<t> =I h(¢)u(t - 1')d'r
-@

Here {u(t)} is assumed to be a wide-sense stationary process,
h(-) is the system impulse response function and {x(t)} is
the system output process. An estimate of h(o) is obtained
implicitly by estimating its Fourier transform. This is accom-
plished by evaluating the power Spectral density of the output
and the cross-Spectral density of the input and output, and then

applying the Fourier transformed version of (1.1.5) given by

(1.1.6) Sxx (w) = Sux (w)H(w)

Here, the power Spectral density of the output process is
as
(1.1.7) Sxx(w) =I E[x(t)x(c + T)]exp(-jw*r)d'r

and the cross-spectral density of the input and output pro-

cesses is
(1.1.8) Sux(w) ==I E[u(t)x(t + ¢)]exp(-ij)dT

The frequency reSponse function is given by

a
(1.1.9) H(u.)) =I’ h(t)exp(-jwt)dt
-a

Time averaging may be used to estimate the correlation func-
tions in (1.1.7) and (1.1.8) if the input process {u(t)} is
ergodic in the wide sense (P2). Evaluating Sxx(m) and Sux(w)
gives the system frequency response function H(w) using the
non-parametric representation (1.1.6). Some smoothing pro-
cedures and goodness-of-fit criteria have been develOped for
use in conjunction with this calculation (B4, A1). The appli-
cation of (1.1.6) leads to optimal (minimum mean-square error)
estimates of h(-) (L2, Jl). In order to use spectral methods
to find the frequency response function, i.e., the Fourier
transform of h(-), the assumption of causality (h(t) = 0
for t < 0) is dropped. However, if the data are actually
taken from a linear physical system, the realizibility condition
will automatically appear in the estimate of h(-).

In practice, certain difficulties arise. The model
(1.1.5) is restricted to linear time-invariant systems with
stationary inputs. The procedure requires a large set of
input-output data samples in order to obtain good estimates
of Sxx(w) and Sux(w) which are to be computed for all m,
as opposed to computing a finite parameter set. Furthermore,
perfect observations of the input-output variables should be
assumed. The primary advantage of this representation is that

the order of the system need not be known.

(c) The State-Space Representation
The unknown (causal) system may be described by a
system of differential equations written in the form of an
automomous or dynamical first-order stochastic system. Such

a form is the state-space representation given by

(1.1.10) §<<t> = §(X(t).U(t),W(t),a,t)
y(t) = H<x<c>.¢<t>>
v(t) = G(u<t>.p<t)>

where x(t) is a vector-valued function representing the system
state and Q(-) is a vector-valued function of the state x(t),
the input vector u(t), an internal noise vector w(t), a con-
stant vector parameter a, and time t. The observable quan-
tities, y(t) and v(t), are noisy observation vectors of the
output and input, respectively. As such they depend on the
noise vectors w(t) and p(t) according to the functions
H(-) and G(-), reSpectively. This representation encompasses
a wide variety of nonlinear, time-varying, causal systems. In
the form (1.1.10), the functions Q(-), H(-) and G(-) are
assumed known and the unknown system parameters are lumped into
the vector a-

In the discrete-time case the continuous-time system

(1.1.10) may be written as a discrete-time model given by

(1.1.11) xk+1 = §k(xk,uk,wk,a)

yk = H(xk"Ik)

= G(uk.pk)

where the dimensionalities remain unchanged and the index k
replaces the independent variable t in (1.1.10). Thus,
(1.1.11) serves as a model when the data is sampled periodically.
Generally such differential-difference models are convenient
because the number of parameters may be selected in advance.
Also Since there is no practical way to determine §k(-),
H(-), and G(-), their forms are assumed based on available
information about the system and the kind of model one wishes
to fit to it. If the system is assumed linear and time in-

variant, then (1.l.11) becomes

(1.1.12) x

k-I-l Axk + Buk + wk

Yk “HXk+I‘k

v =Guk+pk

where the unknown parameter a is contained implicitly in the
matrices A and B. When the system is a single-input single-
output system, (1.1.12) reduces to a linear difference equation

with unknown coefficients (L2) such as

n to
(1.1.13) x(k) = z aix(k-i) + z: b.u(k-i) + §k
i=1 i=0 1

where the noise variables are lumped into the random variable
gk and the coefficients ai and bi are unknown. Thus the
generalized version (1.1.11) may be specialized to a variety
of cases depending on assumptions characterizing the unknown
system.

The intent here is to use the most practical form of

model. Since data are often acquired by sampling, discrete-

time models will be considered. Furthermore, experience has
shown that most practical systems can be adequately described

by a difference model of low order (32). Hence, in the interest
of practicality and flexibility, only models of the form (1.1.11)

will be discussed in this thesis.

1.2 A Survey of System Identification Techniques

System identification methods fall into two categories:
on-line and off-line methods. The on-line approach indicates a
procedure that updates the estimate of the system parameters
with each new set of observations, so the on-line estimator is
recursively defined in terms of its last value and the most
recent set of observations. The off-line method uses the entire
data set to perform a single calculation to estimate the unknown
parameter. The literature seems to indicate that on-line methods
are the most flexible because systems that are nonlinear and time-
varying and that are excited by nonstationary inputs can usually
be treated by on-line methods; in contrast, off-line approaches
are usually restricted to Stationary linear systems with sta-
tionary inputs. However, consistent estimates are obtained
more easily off-line since on-line estimators are often biased
(A4, Bl).

The literature on this subject is too extensive to
review completely here. However, some recent results related
to the identification of the model (1.1.11) will be surveyed.

(a) The Maximum-likelihood Method
The maximum-likelihood estimate of the parameter a is

*
the value of the estimator a of a for which the conditional

10

density of the system output given the past inputs and outputs
and estimate a* is maximized. Consequently, computing the
maximum-likelihood estimate of 0 requires a knowledge of
probability densities, specifically the likelihood function

for the observable random variables. Gaussian, independent
random variables are usually assumed for the noise since these
lead to analytic expressions for the likelihood functions (D2).
This method has been verified in the linear time-invariant case
(1.1.13), where the {gk} are independent and Gaussian. In
particular, an off-line approach gives a maximum-likelihood
estimator for a which is consistent, asymptotically normal,
and efficient (A4, 85). In his approach, Astron (A4) used the
Newton-Raphson method and Steiglitz and Rodgers (SS) used a
Gauss-Newton method to compute the maximum of the likelihood
function. Both of these methods are iterative gradient search
schemes for recursively estimating the maximum using the
gradient of the likelihood function. The function to be maximized
is defined for (1.1.13) as

S: [ZVar(§k)]-1 - constant
1

(1.2.1)

r
II
WIN 2

where gk is intrepreted as the error between the measured

output x(k) of (1.1.13) and the predicted output which is

n m
2 a,x(k-i) +' Z biu(k-i). Derivatives are taken with reSpect
i=1 i=0

to the unknown parameters and are used to construct the iterative

numerical solutions, as an analytic solution is not possible.

11

Generally, the limitations associated with this method
include: (1) the amount of computation required increases
rapidly with the size of the data set N, and (2) there exists
an uncertainty that a computed maximum is global instead of
local.

(b) Learning Techniques

A variety of methods under the heading of "learning"
techniques have been studied for system identification ranging
from statistical methods to deterministic procedures (Cl, D3,
F3, F2, CZ, N1, R2, B3). Such techniques have been used in
conjunction with the performance criteria for determining

goodness-of-fit determined by minimizing the loss function

* 2

(1.2.2) HekHZ = H0 ‘ “k“

where is the error between the unknown parameter a of

6k
*

(1.1.11) and the estimate after k observations ak’ while

H'H denotes the Euclidean norm for vectors.

For some models in the class of systems (1.1.11),

statistical learning techniques will minimize the expectation

of the loss (1.2.2), that is, the mean-square risk function

(1.2.3) EHekHZ = EH0 - GZHZ

*

Here the estimate is the expected value of the estimator ak

given the past observations (D2). AS with the maximum likeli-
hood method, a prior knowledge of the noise distributions is
required; gaussian assumptions are usually made in order to

*
obtain analytic expressions for the estimator akf With such

12

gaussian assumptions, on-line estimators for minimizing the
risk (1.2.3) have been develOped with the help of Bayes rule

for the special case of (1.1.11) as in (1.2.4) below (D3, F3).

(1.2.4) xk+1 = §k(xk,uk)a +-wk
yk g xk + ‘I‘k
Vk =‘ﬁc + pk

Under reasonable approximations and assumptions, consistent
estimators for a were found.

At the other extreme is the deterministic approach.
A simple example of an application of deterministic learning
occurs with the deterministic model (1.1.13) resulting from

setting the random gk's to zero (N1, R2, B3). The model

(1.1.13) becomes

n In
(1.2.5) x(k) = : a,x(k-i) + z b,u(k-i)
i=1 1 i=0 1

and the algorithm is given in two parts by

(1.2.6) a’i‘(j+1) = a:(j) +Aj[x(j) - z(j)]x(j-i), i=1,2,...,n
and
(1.2.7) b:(j+l) = 13:0) +Aj[x(j) - z(j)]x(j-i), i=0,1,...,m

where a:(j) is the estimate of the unknown scalar ai at
the jth iteration of (1.2.6). The quantity b:(j) is defined
similarly, and the scalar Aj is given by

n m —1

(1.2.8) A. = [2: x2(j'i) + Zu2(j"i)] - c
3 i=1 i=0

13

where 0 < c < 2. The scalar variable z(j) is the projected
prior estimate defined as
m

n * *
(1~2-9) z(j) = Z 31(j)X(j'i) + 2 bi(j)u(j'i)
i=1 i=0

Roberts (R2) showed that the loss function, similar to (1.2.2),

given by
n m

(1.2.10) .10) = z (ai - a’i‘unz + 23001 - bjonz
i=1 i‘

is strictly decreasing with increasing j, under liberal
assumptions on the initial estimates {a:(l)} and {b:(l)}
and the input. Similar convergence arguments were given by
Nagumo and Noda (N1).

(c) Least Squares and Stochastic Approximation Methods

A commonly-used performance criteria for parameter
estimators is the minimization of the quadratic loss function.
For estimation of the parameter a in unknown discrete systems,
such as those special cases of (1.1.11) given above, the
quadratic loss function is defined as
* k 2

(1.2.11) Jk+1(a ) = iElIIyi+l - 91+1II
where yi+1 is the system output measurement at time i+l

and y is the computed value of the model output which

i+l
O * 0
depends on an estimated parameter a , Input-output measure-

ments and the system equations which are assumed known except

for the system parameter a-

14

Application of the criteria (1.2.11) represents the
least-squares approach for fitting the model to the unknown
system. The method consists of (l) finding the value of a*
which minimizes the loss function (1.2.11), and (2) showing
that this value approaches a in some sense, e.g., mean-square
or with probability one, as k increases to infinity. The
important features of this method are its computational sim-
plicity and that no assumptions defining the distributions of
the noise variables need be made. If knowledge of these dis-
tributions is available and appropriate weighting of the data
is used in (1.2.11), the least-square estimator exhibits behavior
Similar to that of the maximum-likelihood and Bayesian estimators.
However, depending on assumptions made regarding noise variances,
the least-squares method leads to biased estimates of the un-
known system parameter 0. Generally, the bias will appear
small if the noise power is small relative to the excitation
power. Also, in an on-line formulation, bias is usually a
problem with all estimation methods and consequently an estimate
of the resulting error is required.

The least-squares method has been proven effective in
the linear case (1.1.13) when the noise variable §k is assumed
to contain both system and input—output observation noise, i.e.,
the input and output are observed through additive noise (S4,

C3, H3, H4, P1, H5, K5, L2). In this case the estimator is
defined recursively in terms of its last value and the most
recent set of observations. In particular, the estimator takes

the form:

15
1 2 12 * - * + P z k+l z' *
( ° ' ) ak+l “k k+l k<y< ) ' kak)

where a:+1 is the estimate of a based on k+l observations
of the input and output, y(k+l) is the observed output at time
k+l, and the Zk and Pk+l are data vectors and matrices
reSpectively having the appropriate dimensions. The resulting
sequence of estimates is shown to converge in the mean-square
sense to the unknown parameter a with the help of a theorem
on stochastic approximation (D5).

Generally, the stochastic-approximation literature deals
with estimators for parameters of regression functions and their
asymptotic behavior by considering solutions to first order
stochastic difference equations. In particular, a sequence of

random variables Yj, j=l,... is observed where Yn depends

on a parameter 9 in the sense that
(1.2.13) E(YnIe) = Fn(e), n = 1,2,...

and {Fn(e)} is a known sequence of (generally) nonlinear
regression functions (R1, D5, A2). The difference equation

suggested for estimating 9 is given by the following recursion

(1.2.14) en+1 = 9n + an[:Yn - Fn(en)]

where is the estimate of 0 based on the first n

9n+1
n .

observations [Yj}j=l’ and {an} 13 a suitably chosen sequence

of weighting coefficients comprising a gain factor for the pre-

diction error [Yn - Fn(9n)] . The linear, least-squares,

recursive estimator suggested by Saridis and Stein (S4), (1.2.12),

16

exhibits behavior similar to that required for convergence in
(1.2.14) by Dvoretzky (D5) for linear regressions; hence
Dvoretzky's theorem.was invoked to complete their convergence

argument.

1.3 Thesis Objectives

The primary purpose of this work is to develop a means
for finding a mathematical model for an unknown discrete-time
system from observations of its input and output variables.

The method is intended to be highly practical in that (1) a
wide variety of models may be used and a minimal number of
assumptions is required on the model, input excitation,and the
noise variables, and that (2) the method is computationally
simple. In the interest of achieving the greatest practicality
and breadth of application, the assumptions placed on the
system are not restricted to those required for the procedure
to exhibit statistically optimal properties.

The system is assumed to be excited by a random input
process having bounded moments but otherwise unknown statis-
tical properties. The system itself may be nonlinear, in a
Specified sense, and time-varying. It is further assumed that
the system's input and output are observed with additive random
errors having unknown distributions. In view of these con-
ditions, an on-line procedure is selected which includes an
adaptive model from the class of models indicated in (1.1.11)
in which the parameters are estimated from observed input-

output data. With no prior distributions given, the procedure

17

is a non-Bayesian on-line parameter estimation scheme. The
method chosen for computation is a stochastic gradient-search
algorithm which updates the model parameters with each new set

of input-output measurements. Under a liberal set of assump-
tions, the parameter estimator is shown to converge within a
small bias-error in the mean-square and almost-sure senses to

the true parameter value as the number of observations increases.
The arguments for convergence rely on stochastic approximation
theory. Therefore, the procedure considered here is a stochastic
approximation, system-identification algorithm. It is formulated
in a general sense and shown to apply to certain practical prob-
lems.

Since this work has its origin in the problem of modeling
large-scale, interconnected electric power systems from observa-
tions of power and frequency on the tie-line interconnections,
the identification procedure developed here is constructed so
that it may apply to this problem; however, it evidently applies
to larger class of system identification problems as well.

This method extends the work that appears in the lit-
erature on system identification in at least four ways. The
identification procedure is generalized to apply to systems
which have nonlinear and time-varying characteristics. Second,

a geometric interpretation of identifiability is given. Third,
under liberal assumptions, mean-square and almost-sure conver-
gence of the estimation algorithm is proven. Finally, the rate

of convergence of the estimator is evaluated.

CHAPTER II

DEVELOPMENT OF THE ALGORITHM

In this chapter, the state model for the unknown
system is defined. A set of assumptions is Stated for the
identification algorithm and the meaning of these assumptions
in terms of system identifiability is discussed. The algorithm
is then formulated, and finally, some characteristics of this

parameter estimation scheme are examined.

2.1 The System Model

In the class of models of the type (1.1.11), the set

of models to be considered for identification is given by

(2.1.1) xk+1 - §k(xk,uk,a) + wk
yk .. "k + ‘Ik

Vk““k+Pk
where

§k(°,',') = nXl vector-valued system transition function

xk = nXl state vector
uk = input vector, any dimension
wk = nXl noise vector

wk - nxl state noise vector
= nxl state observation vector
= input noise vector

18

l9

vk = input observation vector
0 = mxl unknown parameter vector
k = independent discrete time variable

correSponding to periodic sampling

If the functions H(-) and G(-) of (1.1.11) have an inverse
it will not alter the strategy here if they are excluded.
Therefore, for Simplicity, they will be omitted with the
assumption that the system state xk and the input uk are
observable with additive noise. The identification problem
will be approached in two parts in order to facilitate the
theoretical development. First, the model (2.1.1) will be
specialized to the case of linearity in the unknown parameter

a and then the treatment will be extended to the nonlinear

case. Thus,

(2.1.2) xk+1 ¢k(xk,uk)a +wk

yk'—'-xk + wk

k k+pk

4
II
c

where @k(',°) is an nXm matrix-valued function and the other
variables are the same as in (2.1.1). The class of systems
defined by (2.1.1) and (2.1.2) is assumed, without the noise
variables, to be asymptotically stable (P4, L3, C4, A3). Also,
it is necessary for the sampling rate to be sufficiently high
to approximate the model's continuous-time equations with a

Specified accuracy.

20

2.2 ASSumptions for Identification and Identifiability

The identification procedure is to (a) select a
particular member of the class of systems (2.1.2) which is
supposed to best represent the unknown system and (b) find
the vector parameter a which gives the best fit of this
model to the unknown system based on noisy observations of
the systems input uk and state xk. The problem is then
a matter of estimating the best a for the assumed hnknown
system of form (2.1.2). An estimator for a will be
established in the context of the following assumptions.

Al: {wk}, {pk} and {IR} are vector sequences whose
entries are zero-mean independent random variables
having uniformly-bounded moments through the
fourth moment.

A2: {uk} is a sequence of random vectors whose entries
have uniformly-bounded moments through the fourth
moment. The matrix products {§k(yk’vk)ék(yk’vk)}
are matrices whose entries have uniformly bounded
first and second moments.

Independence of the random variables and uniform bounds
on their moments are assumed to prove convergence of an estimator
of the unknown parameter a- The particular assumptions of Al
and A2 are reasonable in practice.

A3: Contributions to the output xk+1 due to initial
states of the system are negligible; that is, the
effects of the initial conditions of the system

have died out and the system is operating in the

21

steady-state condition, A stable system operating
for a sufficiently long time will exhibit this
property.

The system (2.1.1) or (2.1.2) is completely con-
trollable from the input {uk} and is completely
observable. These seem to be necessary conditions
since, at least in the case of linear Systems, the
only part of the system.which is identifiable from
input-output observations is that part which is
completely controllable and completely observable
(Kl) .

There exist real constants x0 > 0 and p > 1 and

a set of integers l = VI’Vz --- with pk = v - v

k+l k

and 0 < pk s p where

Jk = {vk,vk + l,...,vk+1 - 1}

such that

ian inf I . 2 QI(y.,v.)§.(y.,v.ﬂﬂ 2 I a.s.
k {yj’vajEJk} m1n[jEJk J J J J J J 0

where §j(yj,vj) is the known nXm matrix transition
function of the observables yj and vj, the func-

tion 'xmin(') takes the value of the minimum eigen-
value of the matrix placed in its argument, the '
denotes the matrix transpose, and a.s. means almost

surely (or, equivalently, with probability one).

It is also assumed that

22

i' (y.vi)i (yjjw)
infl: Inf Amin< 2 “j. (31’, VjﬂjIj “Q (y V )II)] 2 T. > 0 a.s.
{yj.vaj€Jk} jeJk j j’ j j’ JI

 

Assumption A5 implies that there is a partition of the integers
into sets such that in the kth set Jk’ any random observable
output sequence {yjIj 6 JR} and random observable input sequence
{vaj E Jk} that occurs for which

<2k I. (y. ,v. )0. (y. .v. >
J6

)‘min

occurs with zero probability. Further implications of this

assumption and the next assumption A6, are discussed after

stating A6.

A6: Using the notation of A5, there is a real constant
T > 0 such that,

mjn Am ax(9j(yj,V. .)9. J(yj.vj))

k
inf inf r I 2 T a.s.
k [{yj,vj‘jEJk} mjx xmaij (yj svj)§j (yj :Vj))

k

 

where kmax(.) denotes the maximum eigenvalue of
its argument.
The assumptions A4, A5, and A6 constitute an identi-

fiability criterion for unknown systems using the identification

 

method developed here. Assumption A4 is certainly a necessary
condition, for if part of the system were not connected into
the input, but were connected into the output, then the system
would not be controllable from that input; conversely, if the
same part were not connected to the output, then the system

would not be completely observable from its output. The

23

necessity of this condition is discussed more generally by
Kalman, Lee and Aoki (Kl, L2, A3).

Assumption A5 has the following geometric intrepreta-
tion which explains why it is necessary for identifiability.

For a set of p nXm.matrices Fk with

I kl kl
(2.2.1) F; = (f: , £2 ,..., fn ) , k = 1,2,...,p

k
where the fi are row vectors of Fk’ for each k (V1)

<2 2 2) < p f [ < p ]
.. ), )3 F'F ) = in z' 2 F'F )2
min k=1 k k IIZII=1 k=1 k k
P P n 2
= inf [ 2 (sz)'(sz)] = inf 2 Z (fiz) ]
HZH=1 k=l qu=1 k=l i=1

The inner product IszI can be intrepreted as the distance
between the vector f: and a hyperplane passing through the
origin having 2 as its unit normal vector. Then, the Xmin
defined above is the sum of all such squared distances between
the vectors f? and this m-l dimensional hyperplane. This

is indicated geometrically in Figure 1 below for m=3 and p=l.

A

 

 

 

Figure 1.

1
Hyperplane H, With Unit Normal 2, and Vector Distances IfizI
From H

24

If A or the sum of squared distances is positive,then the

min
vectors f: have spanned the m-dimensional Euclidean space;
if the sum is zero the vectors f: have at most spanned the
(m-l)-dimensional space of the hyperplane, i.e., they lie in
the hyperplane. Similarly, assumption A5 requires that the
sequence of matrices {§j(yj,vj)Ij = 1,2,...} can be parti-
tioned into finite groups indexed by the sets Jk’ k = 1,2,...
in such a way that the row vectors in each group

{§j(yj,vj)Ij 6 Jk} Span the m-space. This is guaranteed for
almost every set of observations {(yj,vj)Ij 6 JR} for each

k = 1,2,... in such a way that the sum of squared distances is

bounded away from zero with probability one. That is,

(2.2.3) inf z' E §!(y.,v.)§.(y ,v ) z 2 )( >0

with probability one or for almost every observation set

{(yj,vj)Ij E Jk} corresponding to the kth interval, k = 1,2,...
The necessity of this requirement is evident from con-

sidering a direct estimate o* of a from the defining equa-

tion (2.1.2); that is, using the equation,

*
(2'2'4) yk+1 ' Q1<(y1<"’1<)"’ + wk + Ik+1 '

*
I .
Note that Qk(yk,vk)§k(yk,vk) must have max1mum rank or a

cannot be determined, even in the noise-free case. If

I .
§k(yk,vk)§k(yk,vk) does not have maximum rank for some k then
a solution is still possible if the sum of a finite set of equa-

*
tions like (2.1.6) gives the coefficient of a maximum rank.

25

Thus, with noise present it is necessary that the row vectors

of §k(yk,vk) have a non-zero projection on all m coordinate
axes of the m-space containing a infinitely often as k tends
to infinity, as in (2.1.4), in order that a sequence of estimates
of a may converge to a- Assumption A5 requires this pro-
perty in §k(yk,vk) in a precise manner. Furthermore, A5 may
be interpreted as a requirement on the input sequence {uk},

and indeed, for linear time-invariant single-input single-output
systems, it reduces to an input requirement (A4). In this case
Astrom (A4) found it necessary for an input covariance matrix

of appropriate dimension to be positive definite giving the
result that the input excites all modes of the system or all
components of the unknown vector. This property is analogous

to the characteristics of assumption A5.

Assumption A6 completes the conditions for an identi-
fiability criterion in the sense that it is a necessary con-
dition imposed on the input and system function §k(-,') to
Obtain an estimator a* that converges in some probabilistic
sense to a- The assumption A6 is quite reasonable in practice
and as such, with A4 and A5, it renders the system identifiable
in the sense suggested by Aoki (A3); i.e., that there is a

of a such that lim & = a in

sequence of estimates k

“k
some sense.

In conclusion, the assumptions Al through A6 are practical
and realistic in the sense that wide-sense stationarity and prior

knowledge of the noise distributions are not assumed. Furthermore,

the assumptions are well-suited to the on-line case where the

26

effect of the initial state of the system is insignificant and
the system is identified from its actual input and output as
opposed to a preselected input, such as white noise. The
assumption that the system is controllable and observable, as
well as stable, is also reasonable from a physical and practical

vieWpoint.

2.3 The Identification Algorithm

In order to judge how well the model fits the unknown
system, it is necessary to define a loss function as a measure
of "fit". In the present case, a logical choice is the quadratic

loss function which is a function of the random error given by

k
A _ A 2
(2'3'1) Lk+1(°‘k+1) iElII§i(yi’vi)ak+l yi-I-lII
where the norm function H-H is the Euclidean norm for vectors,
&k+l is an estimate of a based on k+l observations of the

systems input and output, and the remaining symbols are defined
in (2.1.1). This function does not depend explicitly on noise
covariancesand, as it is quadratiC, its minimization in &k+l
is in accordance with the least-squares criterion for estimating
o- If some knowledge of the error covariance is available,

then the loss function (2.3.1) can be weighted by replacing

each term on the right-hand side with

A 2 _ A I
II§i(yi’vi)°’k+l ‘ yi+lIIVi ’ (§i(yi’vi)ak+l " yi+l)

' vi(§i(yi’vi)ak+l ' yi+l) ’

27

where the V1 are positive-definite matrices of appropriate
dimension. These act as weights for the different residual
errors given by the quantities i§i(yi’vi)&k+l - yi+1}, and
are defined as the inverse of the covariance matrices of these
errors. In the arguments to follow, the prior weighting of the
quadratic loss (2.3.1) plays no significant role and since the
error covariances are not known, no such weighting will be
included.

The objective is, then, to find the &k*l which yields
the minimum value of the loss Lk+l(&k+l) given by (2.3.1)-

This is random, as is the loss. However, its value is

a'k+l
determined from the data with (2.3.1) in a deterministic sense,
and then its characteristics are examined in a probabilistic
sense. With the application of this loss function, two assump-
tions are now required in addition to the set Al through A6.
A7: As was stated previously, the unknown system
(2.1.2) is assumed to be stable (in the asymptotic
sense) (P4, L3); it is further assumed that the

model of the system (2.1.2), defined with the

notation of (2.1.1), namely,

(2.3.2) yk+1 = ik(yk.vk)ak .

is also stable. This means that the system (2.1.2)
is also stable for 0 replaced with any estimate
O’k+1 °f °"

If the condition A7 is not available, then all estimates

of a may be constrained to a closed and bounded m-Space containing

28

a corresponding to the region of stability. In this event,
the arguments to follow for the system identification algorithm
and proof of its validity are not changed.
A8: It is assumed that the loss function (2.3.1) has
a unique minimum in &k+1.

Assumption A8 has special importance for the case when
systems of the type (2.1.1) are considered. This class will
be considered later.

Now consider the minimization of the loss function

*
(2.3.1). Let “k be such that

* A 1 . *
(2.3.3) Lk(ak) < Lk(ak) for a l ark 7‘ ak .

*
Then, ak gives an optimal fit, in the least-squares sense, of
the model to the data based on k observations. Under certain
assumptions on the noise, it is an unbiased estimator of the
*
unknown parameter a as well. In such cases ak is a least-

squares estimator of a, (H1, D2). Otherwise is a biased

*
“k
estimator of o and the bias must be evaluated. In a gaussian

a I * O
formulation, w1th prOper selection of the Vi’ ak gives the
Bayesian estimator, or conditional expectation, of a, (A2, A3).

Such assumptions are not made here and so the loss (2.3.1) will

be considered as it stands. The minimization is approached by

*

*
k+l in terms Of a and the most recent observations

finding a k

*
of the system's input and output, (Vk’yk)‘ This defines ak+l
as a recursive estimator. The resulting recursion is a well-

known consequence of classical least-squares theory (H2, L2, A3).

29

Its development in the context of the present problem is outlined

here. For convenience, let
(2.3.4) Fk = kak’vk) and Fk = §k(xk,uk)

Then (2.3.1) may be written as

(2'3'5) Lk+1(°’k+1) Lk(°’k+1) +IIFlcc"1<+1 yk+1II '
If
2 3 6 * - “ +
( ‘ ' ) OZk+1 ak+l Ac’k+1 ’
*
“ d ' ' . .
where ak+l and ak+l are efined 1n (2 3 3), then for any
A * a
ak+l’ ak+l must satisfy
k A A *
' - =
(2°3°7) (A“k+1) iElFi(Fi°‘k+1 yi+l) 0 '

This result follows from a direct substitution of the right

side of (2.3.6) into the right side of (2.3.3) and then per-
forming the indicated cancellations and algebraic reductions
as follows. Let k = 1 and drop the subscripts k+l in

(2.3.6). Then (2.3.3) implies
I210? - y.\\2 s n29 - yznz
..
\Iila*\\2 - 2.291; + 11sz2 s Ifqanz - 2y; 2,2 + IIszIz
= HF1(o* - Aa)H2 - 2yé(§1)(a* - Au) + HYZHZ

so that

30
t * 2 . * 2 ~ * 2 . * .
IIFla II " 23'2' Fla + IIYZII 5 IIFIQ’ II " (Fla )Fl M
A A * A 2 A * A 2
- Aa'Fl'Fla + IIF1 AaII - ZyéFla + 2y2'F1Aa + IIY2II
Therefore
2 * IA “ 2 l "
-2(F1a ) FlAa +IIF1Aa/II + 2y2 Flm 2 o
and this is true if and only if o* is such that
'1? '(i‘ * ) = 0
A“ 1 1°‘ " y2

Application of this result to each of the k terms in (2.3.3)
gives (2.3.7).

Now specialize (2.3.6) to the case

* *

that is where the arbitrary estimate &k+1 is particularized
*
to ak' Then
)=L(*+ )= kIF<*+Aa) II2
L1<("'k+1 k “k A"’1.<+1 121‘ 1 “k k+l y1+1

k A * 2 A A ' A *
1.’,:'=1I:IIFi°’k ' yi+lII + IIFi A"’1<+1II2 + (ZFiMk-I-l) (Fiak ‘ yi+l)}

and the last term on the right is zero by (2.3.7). With this

relation the definitions (2.3.1) and (2.3.8) give

* _ * a * 2
Lk+1(°’k+1) " Lk(°’k+1) + IIFk(°’k + Mk+1) ' yk-I-lII

k—l

‘7 .EIIIIFi“: ' 29.1““ + Iiimmuzi + Hike: - ykﬂuz +\\2.Aa...1\\2

+ 2(FkMk-I-l) (Fkak yk+1)’

31

Hence

* = * A * - 2 ' A. A *
(2'3'9) Lk+1(°’k+1) Lkmk) + IIFkO’k y1<+1II + 2<Mk+l)Fk(Fkak ’ yk+1)

k
I A [A
+ Mk+1(i§=l FiFi)Mk+l '

* a I o I
Since Lk+l(ak+l) is the minimum value of the function

(& ), the right side must also be minimum and this defines

Lk+l k+l

the value of Au Then note that the last two terms of (2.3.9)

k+l°

have the quadratic form
2a'M'b + a'Na
which is minimized when
a = -N-1M'b

if the inverse exists (A3). This result and (2.3.8) give

* _ * .' A *
(2°3'10) OZk+1 “k + P1<+1Fk(yk+1 Fko’k)
where
k A A 1
2.3.11 P = ' ’
( ) k+l (’3 FiFi)

and the inverse exists for sufficiently large k by assumption
A5 with probability one. This relation defines recursively

the minimizing value of & for the loss function Lk+l(°)

k+l
*
in terms of the last value ak and a correction term which
*
depends on the last value ak and the most recent observations
(yk+l’vk+l)° As Such, (2.3.10) represents the desired on-line

system identification algorithm. The computation of the matrix

32

inverse need not be repeated in total with each k since a
matrix inversion lemma for positive-definite matrices may be

applied (H2). The lemma gives
A A A -1 A
= .. l I ll
(2.3.12) Pk+l Pk (Pka)(I +~FkPka) (Pka)

when P is positive definite, and Since P is positive

k
definite with probability one, (2.3.12) is true with probability

k

one. Thus, (2.3.12) provides a recursive means for computing
Pk and requires the generally smaller task of computing the
inverse on the right side of (2.3.12) than the inverse on the

right side of (2.3.11).

2.4 Consistency of the Estimate

A sequence of estimators is, by definition, consistent
if it converges to the true parameter in probability (D2). In
this case, the sequence of estimators may be called a simple
consistent estimator of the true parameter (Ml). However, if
the expected value of each estimator forms a sequence which is
bounded away from the true parameter, the sequence of estimators
is not consistent. The difference between the expected value
of an estimator and the true parameter value is the bias. If
both the bias and the variance tend to zero in the sequence of
estimators, then the sequence of estimators is consistent. In
the case of (2.3.10), it will be shown that a bias exists which
is uniformly bounded away from zero for all k and, hence, the
estimator a: is not consistent. Some estimate of the effect

of this bias-error on the result is necessary. In this section,

33

a calculation of the bias is carried out, and an estimate of
its magnitude is made.

Equation (2.3.10) for a: defines recursively the value
of &k which minimizes the loss Lk(&k) given by (2.3.1) for
each k. Hence, a: gives a least-squares fit of the model to
the measured input-output data. As will be shown in Chapter III,
the estimator 0: also has the property of converging with in-

creasing k to some &* which minimizes the expected loss

given by
(2.4.1) E[L(&)] = E “9,01,.an - ymllz ,

for any i > 1, where the expectation is taken over the product
space of all the random variables involved. If &* f a, then
the estimator a: is not consistent. The value of & which
minimizes the risk (2.4.1) may be determined by setting the

gradiant of the risk function with respect to & equal to

zero and solving for E:
(2.4.2) v&[E[L(&)]} = O for 5 = &

The solution follows from expanding (2.4.2) by applying the

definition (2.3.1); i.e.,

(2.4.3) 0 v&{E[L(&)]}

. .. 2 2,2. 2
v&{EIIFk9II .21... Pimp + Euymu 1

A'A N- A.
2E(FiFi)a 2E<Fiyi+l)

so that

34

N = ‘1‘.“ "'
(2.4.4) o* E (FiFi)E(Fiyi+l)
Now let
(2.4.5) gk = (Fk - Fk)o +wk + Ik+1

and then, since

(2.4.6) yi+1 = 21a + ti .

the minimizing value &* of (2.4.4) becomes

(2.4.7) &* = o + E'1(§;?i)E(?{[Fi ' F1])01

since by assumption Al, w and Ii+l are not correlated with

1

Pi. If the random processes involved are stationary and the
system is time—invariant, the index i in (2.4.7) may be
dropped. In any case, assumptions Al and A2 provide that the
bias term on the right in (2.4.7) is bounded in norm. This
bias is best interpreted by evaluating its magnitude in terms
of assumed values of noise powers using the assumed forms of

the System function F a = Qi(xi,ui)a. If the system function

i
Qi(-,~) acts on the system input and output and measurements

in such a way that

(2.4.8) E(F{[Fi - 21])

Ill
0

then the bias is small. No assumption such as (2.4.8) is made

however, and the bias term is carried through in the algorithm.

35

2.5 Approximation to the System Identification Algorithm

Several adjustments in the algorithm (2.3.10) are intro-
duced in this section which serve to facilitate the computation
required and to aid in proving the convergence of the algorithm
itself. First, the algorithm is redefined in terms of the
assumption A5. That is, using (2.3.11), let

-1 k

2.5.1 2 == 2 if ..v.)0 ..v
( ) k+1 i=1 Ki J<yJ J 3(yJ J)

-l. l
=P + §C ', . Q. .’ O
2 (y V)J(YJ VJ)

JEJk

where the JR is defined in assumption A5. Then (2.3.10)
remains the same except that the index k in a: indicates

a single estimate of a corresponding to the Jk interval of
input-output observations, and PR given in (2.5.1) is gen-
erated as before by (2.3.12). Now consider the behavior of

the matrix gain sequence {Pk} in order that an approximation
to it may be made. It has been proven (H2) that for a positive-
definite sequence of matrices given by (2.3.12), the eigenvalues
of Pk behave asymptotically as l/k as k tends to infinity.

Under assumption AS this prOperty of P remains true (Lemma

k
10, Appendix). Then for the proper choice of a sequence of
matrices {Ak}’ to replace the matrices {Pk}, the asymptotic

*
behavior of the estimator ak in (2.3.10) should be about the

same. For this approximation of Pk’ define Ak as

(2.5.2) A =

A
1 _
k k k.- 1,2,

36

where A1 is intended to approximate the mean value of P1,
namely E(P1), and is estimated from the available knowledge
. of inputs and outputs and noise powers. Actually, convergence
of (2.3.10) will occur with any choice of a positive-definite
A1 having the appropriate dimensions, however the choice E(P1)
gives the best approximation to the least-squares estimate of
a for each k. Nonstationarity of the input and noise impairs
the validity of the approximation (2.5.2) but not the asymptotic
properties of the estimator a: of (2.3.10). With this 1/k

*

the index k in ak

t
reSpond to the k h observation if Ak is divided by the length

behavior of P and A

k k’ may again cora

of the kth interval pk’ since the same behavior l/k remains.
This adjustment is included with an assumed p1, and as a result
k corresponds to the kth observation. In practice the estimates
would have to be spaced in time sufficiently far apart to
guarantee the validity of assumption Al.

An additional adjustment in the algorithm (2.3.10) is
to center the estimator a: around its mean value, that is,

to subtract the bias in (2.4.7). Using the bias (2.4.7), let
= l = A' .. A
(2.5.3) bk E[ik(yk,vk)§k]a E(Fk[Fk Fk])oz .

where gk is given in (2.4.5). Then, with the factor bk

included in the estimation Scheme along with (2.5.2), the
algorithm (2.3.10) becomes

ﬁ'i‘a -b

A =5 A' .-
(2.S.4) ok+1 Gk + Ak+1(Fkyk+l k k k k)

37

The notation &k in place of a: denotes suboptimal estima-
tion in the least-squares sense and, as the asymptotic pro-
perties remain the same, the estimator ak in (2.5.4) will

be shown to converge in the mean-square and almost-sure senses
to the unknown parameter o- As such, the estimator ak is
said to be strongly consistent. With the deterministic "gain"
sequence {Ak}’ the final form of the recursive estimator of
a, given in (2.5.4), more closely resembles estimators shown
to converge by Stochastic approximation methods (R1, D5, A2).
The matter of proving convergence of the system identification

algorithm (2.5.4), with the help of stochastic approximation

theory, will be considered in the next chapter.

CHAPTER III

CONVERGENCE PROOFS

The purpose of this chapter is to prove convergence of
the algorithm (2.5.4). Arguments are presented for mean-square
and almost-sure convergence. The maximum rate of convergence
is then determined. With additional hypotheses, the algorithm
is shown to extend to the class of Systems for which the system
function is nonlinear in the unknown parameter, as well as in
the state and input variables.

The estimator (2.5.4) is formulated as a multi-dimen-
sional stochastic approximation algorithm. As such, proof of
its convergence relies heavily on stochastic approximation
theory (A2, B7, D5, D1, K5, R1, 82). The basic steps necessary
for the two convergence proof are, in effect, indicated by
Albert and Gardner (A2). However, because of the differences
between the identification problem considered here and the
regression problem considered by them, the two proofs were
modified. The modifications which comprise the original work
here include the following: the assumptions A5 and A6 are
modified versions of the assumptions found necessary by Albert
and Gardner in their problem of estimating parameters of a
regression function; Lemmas 3, 4 and 10 (Appendix) are new.
However, the necessity for proving them is Suggested in the

work of Albert and Gardner; the proofs, as presented in Sections

38

39

3.1 and 3.2 are new versions of their proofs in accordance with
the assumptions Al-A8 and the new lemmas in the Appendix. All
other aspects of the work in these two sections and the Appendix,
in varying degrees, depend on the literature and are referenced
appropriately.

The argument for the rate of convergence presented in
Section 3.3 is largely due to Albert and Gardner. However, as
their work did not apply directly, it was necessary to redevelop
the argument in terms of the assumptions Al-A8. Finally, the
extension to identification of Systems (2.1.1) which are non-
linear in the unknown parameter a, as well as in the state and
input variables, is presented in Section 3.4. An algorithm is
stated for identifying such systems, and the steps necessary
for proving its convergence are given in terms of the proofs of
Sections 3.1 and 3.2. The application of Taylor's expansion to
linearize the system transition function is implicit in the work
of Albert and Gardner. However, algorithms for the on-line
identification of Such systems have not appeared in the liter-

ature to date.

3.1 Mean-square Convergence

The system model, parameter estimator and the mean-
square convergence propositions are restated here in simpler
notation. The system to be identified is assumed to have the

form

40

(3.1.1) xk+1 =Fkar-I-wk
yk “"k +‘I‘k
Vk ... “k + pk
where Fk is defined below and the other variables of (3.1.1)

are defined in (2.1.1). The recursively defined estimator for

the unknown parameter a is given in (2.5.4) and may be written as
(3.1.2) ak+l = 07k - Ak(Hk(ozk - a) - vk)

where

A

&k - an estimate of the mxl vector parameter based
on k observations of the input and output.

A Q Al/k, where A

k 1 is a preselected positive

definite me matrix as discussed in Section 2.5.
'A

Q A
Hk Fka

Fk A Qk(yk,vk), the nXm observed system transition

matrix
Fk Q §k(xk,uk), the true system transition matrix
Yk = Fkgk - bk’ zero-mean le noise vector
= .. A +
§k (Fk Fk)“+wk Ik+1

bk = “F1291. - ij>

The mean-square convergence theorem is now considered.
The gist of the proof is described here. First, the algorithm

is considered within an interval, Jk’ of observations. The

true parameter a is then subtracted from &k, the algorithm

is written in an iterated form, and the algorithm equation is

41

squared. Frequent applications of the Cauchy and triangle in-
equalities are made, and expectations are taken. It is then
observed that mean-square convergence will occur if the sequence

of estimates q A a - a converges in the mean-square sense;

Vk Vk
i.e., the sequence of estimates corresponding to the first or
th .
vk observation in each interval Jk’ k = 1,2,... . Second,

the set of estimates {qvk} is considered. A similar con-
struction is made from iterating the equation, squaring, using
the triangle and Cauchy inequalities and taking expectations.
The resulting expression is considered term-wise with the help
of Lemma 4 (Appendix). Convergence is finally obtained by
achieving bounds for these terms which tend to zero. The Appendix
and assumptions Al-A8 are applied frequently.

The mean-square convergence theorem for the estimator

&k is stated and proved below.

Theorem I : Mean-square Convergence Under the assumptions Al

 

through A8 of section 2.2 the estimator , given in (3.1.2),

57
k
converges in the mean-square sense to the unknown parameter a

of the system (3.1.1).

Proof:

First substract o from each Side of (3.1.2) and let

Then (3.1.2) becomes

(3°13) q1<+1 = qk ' Ak(Hqu ' Yk)

42

In accordance with assumption AS the sequence of observations

{(yi,vi)Ii = 1,2,...} is partitioned into sets indexed by the
sets of integers Jk where

v - 1}, k = 1,2,...

k+1"'°’ k+l

(3.1.4) Jk = {vk, v

and the length of the partition is uniformly bounded such that

k+1 Vk " pk S P < “ ‘

Now let n be arbitrary such that n 2 p. Then n belongs to
one of the index sets Jk’ for some k. Fix this k, and using
(3.1.3), iterate (3.1.3) with itself from n back to the first

index in the set Jk’ namely Vk’ getting

11

n n
(3.1.5) qn+1 = .9 (I - AjHj)qv + .§ ._H (I - AiHi)Ajvj
J-vk k J-vk i-j+l

where factors in the product terms are indexed in descending

order. Premultiplying (3.1.5) by its transpose gives, with

application of the triangle and Cauchy inequalities for matrices

(V1),

I _ 2 n 2 2
(3°1'6) qn+1qn+1 ” an+1H S .H “I " AjHjH qu H
J=vk k
II n I n .. I I
+ 2 q n I - A.H. z n I - A.H. A v,
n n 2
+ <2 H II -A1H1IIIIA,-vj\\>
j=vk i=j+1
where the norm function H'H is defined in the Appendix. Now

according to Lemma 4 (Appendix), there is an MI > 0 Such that

43
(3.1.7) igj HI - AiHiH 3 M1 a.s.
k
for all sufficiently large k. Applying (3.1.7) to (3.1.6),

taking expectations and noting that IIAv H 2 “Ajn for all
k

j E Jk’ gives
(3.1.3) EIIqm.1II2 s miEIIq,kII2 + ZMiE<IqukII jEJkIIvaI IIA,kII>

+ IA, Min? IIvaI>2 .
k j=vk

Applying the Schwarz inequality (L4) to the second term of

(3.1.8) and noting that assumptions Al and A2 imply that there

is an M2 > 0 such that

n 2 2
(3.1.9) E( 2 IIvaI) <M2
j=vk

reduces (3.1.8) to

(3.1.10) EIan+1II2 s MiEIqukIIZ + MZIIAkaI(EIIqVkII2)%

2

+-MiHAka M2 .

By examining (3.1.10) it is evident that mean-square convergence
is proved when it is shown that the subsequence {qV } converges
k

to zero in the mean-square sense. To show this first let

(3.1.11) n 9 1

vk+1 '

in (3.1.5). Then with

44

(3.1.12) Qkg n (I - AiHi) ,
ing

where the indexing is defined in descending order, premultiply-

ing (3.1.5) by its transpose and taking norms as before gives

(3.1.13) nq, nzsnoknzuqvnz
k+1 k

' ' "1&1-1
+ 2q Q 2 n (I - A.H.)A.Y.
vk k ngk i=j+1 1 1 J J

\)]L<'+l-1 2
+ ( 2 H H (I ‘ AiHi)H HAijH) °

jEJk i=j+l
Furthermore, using Lemma 4 (Appendix) with sufficiently large

k, that is k > N as in Lemma 4, and taking expectations,

(3.1.13) becomes

(3.1.14) EHqV H2

‘2
k+1

.<. <1 - c'uA, .1\\>2 Enqvx
k+1 k

"1<-+1'1

+ 2 ' ' - A
E(qv¥§k jEJk i=g+1 (I iHi)AjvJ-)

+ miuAv H2 E< 2 uvjuﬁ .
k jEJk

Now in order to complete the proof it remains necessary to
show that the right side of (3.1.14) tends to zero. In order
to accomplish this consider first the second term of (3.1.14).

In the second term the product term may be expanded to get

45

"M1"1 j
l = .
(3.1.15) Qk i n (I - AiHi) (I + 2;: AiHi)
=J+1 l-V
k
2
- igJ (A111i + HiAi) + Ak ij ,
k
where
2 2 2
mm A, 4 ,2 M M
1€J
k
and
(3.1.17) ij Q (all remaining product factors)-A;(2

and note that ij is uniformly bounded with probability one
by Lemma 4 (Appendix). Then with (3.1.15) the second term of

(3.1.14) may be written as

-1
vk+1

(3.1.18) 2 E(q' Q' 2 n (I - A.H.)A.Y.)
vk 1‘ jeJk i=j+1 1 1 J J

J'
=2E'I+ A,H A,,-2E' BAH +HA,A
(qv ( .§ 1 i) 3Y3) (qv (. i ' i 1) ij)
k 1-vk k JEJk

2
+ 2 ' A

k

In order to obtain the required bounds on the second term of
(3.1.14), consider each of the three terms on the right side
of (3.1.18). Dropping the coefficient 2 and noting that Yj

is zero-mean, the first term may be written as follows

46

(3.1.19) EE(q' (1 + 2 AH mum )lq >

Vk i=vk vk

=EE(q' (2 AWHMqu

vk i=vk Vk)

and by the Cauchy-Schwarz inequality (L4)

J'
s EE<nqvku<igvkuAiunuiuuAjunvjn>nqvku>
and by assumptions A1 and A2 there is an M3 > 0 such that

s M3HAka2 Enqvkll .

The second term on the right side of (3.1.18) may be evaluated
the same way using the Cauchy inequality and assumptions Al

and A2; this gives

(3.1.20) EE(q' (2 AiiH +HA i)Ajjv |q )
Vk 131k vk

s EE(qu “(1' ZHAiHHHiH>\\AJ-\H\Yj\\\qv >
k 1€Jk k

s M.nAvknznuqvku

for an appropriate M4 > 0. Likewise, the third term on the
right side of (3.1.18) may be evaluated using the definition
of A: given in (3.1.16), the Cauchy inequality and assump-

tions Al and A2 as before. The result is

2
3.1.21 EE 'A G
( ) (qVk k jk Aj wjlq

s EE(qukH HAVRHZ [iéijHiHZ JHGJ-kH uAju ijll \qvk)

s nAvku2 MSEuqvkn .

47

where M5 is again appropriately selected. Then, defining M

6
to be (M3 +-M. +-M5), (3.1.18) becomes

4
"k+1'1 2
(3.1.22) E(q' qt 2 n (I - A,H.)A. ) S M A E .
Vk k iEJk i=j+1 1 1 JV] 6“ VkH “(1ka

Since generally a2 + b2 2 ab and by Jensen's inequality (D2)

2 2
(3.1.23) E qu H s Equ H ,

k k
the right side of (3.1.22) becomes
2 2 2 2 2

(3.1.24) MHA H E q H SMHA H Eq H + A | .

6 vk H Vk 6 Vk H vk H vk‘

With the bounds thus obtained for the right side of (3.1.22),

and hence for (3.1.18), the right side of (3.1.14) may be

bounded as follows

2 2 2

(3.1.25) Eq 5 (1 - c' A _ ) E q H

n ,an u v1... .1 n .k
2 2 2 2 2 2 2

+MHA H EHq H +HA H +MHA H EH: Hv-H> -

6 Vk vk Vk 1 Vk jeJk J

In the first and second terms on the right note that there

are constants M , M , and N >~N, where N is defined in

7 8 1
Lemma 4 (Appendix), such that

(1 - c'nAka-1\\>2 + M:\\A,kuz s 1 -M7\\A,kn

and

M8 2 E( z HY.H)2-M2
. J 1
JEJk

48

for all k 2 N . Then (3.1.25) reduces to

1
(3.1.26) Equk+1H2 s (1 - M7HAVkH) Euqkaz + MBHAkaZ
k 1 n 2
= n - M A E
H: 7H.j> qu1\\
k k
W 2 n <1 -M7uA., n2>nAv n"-
j=N1k=j+l i J'

The first term on the right side evidently tends to zero as

k tends to infinity. Also since

2 “AV H2 < a ,
kle k

Lemma 8 (Appendix) implies the second term tends to zero as

well. Hence

2

(3.1.27) lim EHq H = o ,
k V

k+1

which (3.1.26) now implies. Consequently the subsequence
{qv } converges to zero in the mean-square sense, and this
k

completes the proof of mean-square convergence of the
algorithm (3.1.2).

Q.E.D.

3.2 Almost-sure Convergence

The convergence proof presented here shows that
lim Hc‘rn - a\\ = lim anH = 0
n n

with probability one or almost surely (a.s.). The steps taken

49

to accomplish this proof are as follows. As before, the
algorithm is first considered on the Jk interval of observa-
tions. The algorithm equation is written in an iterated form,
norms are taken and the Cauchy and triangular inequalities are
applied. It is then observed that the convergence is proven
when it is shown that the subsequence of estimates {qvk} con-
verges to 0 a.s. Then, the algorithm, written iteratively
in terms of the subsequence {qV }, is considered termwise for
finding bounds which tend to zero with probability one. The
required bounds are obtained with the help of some algebraic
relations due to Albert and Gardner (A2) and the Lemmas in the

Appendix. This convergence proof, as with the mean-square case,

may be stated in the form of a theorem.

Theorem 11 Almost-sure Convergence Under the assumptions Al
through A8 of section 2.2, the estimator ak given in (3.1.2)
converges with probability one to the unknown parameter a of

the system (3.1.1).

Proof:
The proof begins in a manner similar to the previous
mean-square proof. The algorithm is written in terms of the

error qk A ak - a, and then, the equation is iterated back

to the first observation in the J interval. First let n

k

be arbitrary; then n belongs to JR for some k. Iterating

the recursive algorithm (3.1.3) of qj from n back to the

first estimate of the set Jk’ namely q\J , gives the same result

k
as (3.1.5) which is repeated here as

HR.
-|2\~ _

50

n n n
(3-2-1) (1 = II (I " A.H.)q + )3 1'1 (1 ' A.H.)A Y. ,

where Jk is given in (3.1.4) and the other variables are given
in (3.1.2). Taking norms in (3.2.1) is the next step toward
finding bounds for those terms on the right. With frequent

use of the triangle and Cauchy inequalities the normed version

of (3.2.1) becomes

3. n lux - AininnAjvjn

(3.2.2) nqnﬂu s 3 HI winiu nqv n +
°_ k k J k

J-v

By Lemma 4 (Appendix) there is a constant M1 > O which bounds
the product terms on the right of (3.2.2), as indicated in

(3.1.7), and as a result (3.2.2) gives

(3.2.3) an+1H s Mlqu H + pM1( max HAjyjH) a.s.

k jEJk
In (3.2.3) the p is the uniform bound over the length of the
Jk intervals as given in (3.1.4). Since the bound M1 holds
with probability one, (3.2.3) necessarily holds with probability

one. Now Lemma 1 (Appendix) implies that the partial sums of

(Ajyj) form a Cauchy sequence so that

(3.2.4) lim ( max HA,y H) = o a.s.
k 363k 3 j

Therefore, almost-sure convergence of the estimator (3.1.2) is

proven when it is shown that

(3.2.5) lim Hq H = o a.s.
k Vk

51

In order to accomplish (3.2.5), it is first necessary to con-

struct an algebraic relation using the following definitions:

n
(3.2.6) R Q R,(n) 9 n (I - A H ) for j = 1,2,...,n
J k- kk
“J
Q I for j > n
where product indexing is in descending order, and
(3 2 7) s Q :4 A with s Q 0
. . k+l . -Y- 1 .
J=1
With these definitions,
n n n
(3.2.8) 2'. 1'1 (I - A. 1.11 1)A y = Z‘. R v
k=1 i=k+l k kk=1 k+1A k k
n n n
=£R(S -S)‘2Rs-2:Rs
k=l k+l k+1 k k=l k+1 k+l k=l k+l k
n-l n
= 2 R s + s - 2 R s
k=1 k+1 k+1 n+1 k=l k+1 k
n n
= Z R s + s - 2 R s
k=1 k k n+1 k-l k+l k
n
= (ﬂak-+1“ " Aka)sk ' Rk+131<) + Sn+1
n
= ' E Rk+lAkﬁkSk + Sn+1
k-l
n n
= A - - + - A
£21 Rk+1 kaS (I R1)S Sn+l £21 Rk+1 kﬂksk

where the first two terms on the right side are equal by Lemma

6 (Appendix) and finally

52
n
‘ R18 + (sn+1 ‘ S) ' £21 Rk+lAka(sk ' 8)
Now let
(3.2.9) n = v - l

and substitute this value for n into (3.2.1). By sub-

stituting (3.2.8) into (3.2.1), (3.2.1) becomes

K k "k+1'1
(3-2-10) q = 11 (Q q ) + Z 2 1'1 (1 - A.H.)A Y
Vk+1 j=l 3 v1 j=1 rer i=r+l 1 1 r r
E Q ( + ) + ( )
= , q s s . - s
j=l 3 V1 Vk+1

_ K
?-j21 rgJ Rr+l(vk+l - 1)ArHr($r - s)
3

Taking norms in (3.2.10) gives
k l \ H
(3.2.11) q 5. 1'1 Q q + s + s -s

+ z 2 R (v -l)AH(s -s) .
Hj=l rEJj r+l k+l r r r H

As k tends to infinity, the limit of the first and second
terms on the right is zero almost surely by Lemma 5 and Lemma

1 reSpectively (Appendix). Also, for every r E Jj’

k
(3.2.12) R (v -1)H s n Q HM a.s.
H r+l k+1 i=j+1u i 9

when M9 > 0 is selected such that

vj+1-1
(3.2.13) H 1'1 (1 - AiHi)H s M
i=r+l

a.s.

9

53

according to Lemma 4 (Appendix). The third term of (3.2.11) is

k
(3.2.14) ngl rEJer+1(vk+1-1)Arur(sr - s)H

k k

s M, 2 .n HQiH 2 (HArH HHrH>Hsr - sH a-s-
j=l 1=j+l rer
k k

s M. 2 .r; HQiH 2 <1A.HHH.H><maxus. - su>
j=l 1=J+1 rEJj rEJj

Now note that with the definition (3.1.16) of A: ,

 

 

 

(3.2.15) 1 s [ 2 “Ar“ HHrH] A31
rEJj
2 HA HZHH H2 + 2' 2 H H HH H HA H HH H A:
_ reJJ r r k,rer,k#r Ak k r ' r
: HAHZHHAF
rEJj
’ (2.,.(p2 - 1» max \1A.1\2\\H.Hz>
s 1 + r613
1 max HA.\\21\H.\\2
rEJj

= (92 - p +-1)%

6,.

where the left inequality holds because of the generally true

inequality

(3.2.16) 2 \a,| 2 ( z \a \ ) .
3:1 .1 3:1 3

Therefore, the right side of (3.2.14) is bounded as follows

54

(3.2.17) M, jg ijﬂuqiu £1, cuArn HHrH>Hsr - s\\
k k
S Mgp' 3'21 i=ljll+1HQiH “322:: “Sr - SH)
N N k
= Mgp' 531 i=gI+IHQiH A, 1 =1EIHHMH (r213: HS.- ' SH)
k k
+ Mgp' j=§+l i=gl+lHQiH (522:: H3,. ' SH)
k k
S M10 1=HII+1(1 - c A1) +IM11(1/C)j;§+1bkj(r2j: Hsr - S“)

where N and c are defined in Lemma 4 (Appendix), Ak is

given in (3.1.16), M10 and M are appropriate constants and

11
k
(3.2.18) b . = c A. n (1 - c A.)
1” J i=j+l 1

The bound developed in (3.2.17) for the right side of (3.2.14)
is the quantity which bounds the last term on the right side of
(3.2.11). The proof is completed by showing that this bound
tends to zero. Considering each term on the right side of
(3.2.17), the first term tends to zero with probability one

by Lemma 5 (Appendix). The second term has a limit of zero by

the following argument.

(3.2.19) lim bkj = O a.s.

for each j by Lemma 2 and Lemma 5 (Appendix). Also note that

55

k k
(3.2.20) lim. 2 bkj = lim (1 - n (1 - c Aj)) = l a.s.
k j‘N k j=N

by Lemma 5 and Lemma 6 (Appendix). In addition,

(3.2.21) bkj 2 o a.s.

and

(3.2.22) lim ( max “8 - s“) = O a.s.
. r
J rEJj

by Lemma 1 (Appendix). It then follows that (3.2.19) through

(3.2.22) satisfy Lemma 7 (Appendix), and Lemma 7 implies

k
(3.2.23) lim (l/c) z bkj( max us - SH) = o a.s.
j=N rEJ r

J

Hence, the second term on the right of (3.2.17) has the limit
of zero with probability one. Thus

(3.2.24) lim Hq H = o a.s.
k Vk-+1

because each of the three terms on the right in (3.2.11) have
now been shown to have the limit of zero (a.s.). Therefore,
the proof of almost-sure convergence is complete.

Q.E.D.

3.3 Rate of Convergence

In this section the rate of convergence of the mean-
squared estimator &k’ given in (3.1.2) or (2.5.4), is calculated
in the sense that a bound, which has limit zero, is found for

the second moment of the error norm. In other words, a maximum

56

rate of convergence is found for the mean-square error. The
notation is the same as in the previous two sections.

In order to obtain the rate of convergence, consider
as the starting point the recursive relation for the mean-square
error developed in section 3.1 and given in (3.2.26). This re-

lation is repeated here as

2 2 2
can Euqvmu s <1 - M7\\Avku>auq,ku +4.16ka
Here
A -
qvk. avk. a
and Vk is the smallest integer in the index set Jk defined
in assumption AS. M7 and M8 are positive constants selected

such that (3.3.1) is true for sufficiently large k. Recalling

that

3V 'V

(3.3.2) k+1 k

pk

in assumption AS, a useful change of notation in (3.3.1) is

as follows. Clearly there are constants z > 0 and c1 > 0

such that (3.3.1) may be written as

 

2 sz-z 2 -2
(3.3.3) EHqVkHH s[ - Vk-l] EqukH + civk
Now let
A
(3.3.4) zj (zpj_2)/Vj_1

and choose the index r such that with iterating (3.3.3) back-

wards to r,

57

 

 

2 k 2
(3.3.5) Euq H s n (1 - z.)EHq H
Vk.+l j=r J vk
k k _2
+ C1 Z 11 (1 ' z.)Vn a
n=r j=n+l J
and
(3.3.6) 0 < zj < 1 for all j > r .
By Lemma 9 (Appendix) there is a constant dk-l 2 1 such that
(3.3.7) lim d = l
k
k
an’d’f $71111“ j ‘2 ‘r
k vs-2 2
(3.3.8) n (1 - z.) s d ( ) , for s = r,r+l,...,k .
. J s+1
J=s _ k-l
Then
2 vr-Z Z 2
(3.3.9) EHq H s d _ [ ] EHq H
Vk+l r 1 Vk-l vr
k v z
+ C1 2 d [ r-2] -2
n=r n Vk-l n
Now let
A z/2
(3.3.10) tk+1 Vk-l qv

k+l

Then

58

 

2 “.22 2 -2
(3.3.11) Eutk+lu s (6:_1) dr [v: 1] (Eucrn )(vr_2)

Z
n V V
+ c 2 dnc32—-)zc-51l)

1 n r k-l Vn2

k
z-2
1 Z dnvn
n=r

=%EMJ2+C

The first term on the right is finite and the second term con-

verges to a finite limit as k increases to infinity. if

(3.3.12) 2 < 1 .

However, z must also satisfy the derivation of (3.3.3) from

(3.3.1), i.e., 2 must satisfy

(3.3.13) (1 - zpk_2/vk_1) 2 (1 - M7HAka)

By direct substitution into (3.3.13), it can be shown that
(3.3.14) 2 s M7HA1H

satisfies (3.3.13) for all Vk > p. Therefore, with
(3.3.15) 0 < z < min(1,M7HA1H) ,

(3.3.11) implies that

3 1 2 _ -z

( .3. 6) EHqVkH - 0(vk_2)

Here the function 0(-) is defined in Lemma 4 (Appendix).
The bound in (3.3.16) on the rate of convergence

indicates that the second moment of the error norm, at best,

59

cannot diminish at a rate faster than 1/k where k indexes
the observations. Alternatively, k indicates the observation
time. The matter of convergence rate in stochastic approxima-
tion methods for system identification has been considered for
linear systems (K5, D4, 83). However, it has not been extended
to a larger class of systems such as the state-model form (2.1.2).
The resulting limit of 1/k on the rate, in any case, seems to
be in agreement with these previous efforts.

The proof of convergence rate above evidently lends
further support to the claim of Deutsch (D2) that certain
applications of the method of stochastic approximation "can
be frequently unattractive because of agonizingly slow con-
vergence after the first few iterations". However, optimal
estimates cannot be achieved without some assumptions on the
noise variables. For example, gaussian, independent random
noise may be assumed in order to find analytic expressions for
Bayesian estimates of parameters (H1). Even if the estimates
of a are optimal, in a maximum-likelihood sense, an on-line
scheme as preposed here would be developed in the same way as
in Section 2.3 with the correct application of the weighting
matrices {Vi} indicated there, and the resulting convergence
rate would still be no faster than 1/k. However, suboptimal
on-line estimators require larger observation time to achieve
the same accuracy as that of optimal estimators. In Chapter
IV, results of computer-simulated examples illustrate this

fact.

60

3.4 Extension To a Larger Class of Nonlinear Systems

With the algorithm (2.5.4) stated and proved for the
class of systems (2.1.2), it is worth noting that, to some
degree, the method may be extended to a larger class of systems.

This class is given in (2.1.1) and is restated here as

(3.4.1) x k,a) + w

k+1 Q“"16”
yk g xk 1’ Hk

k

vk=uk+pk

where the §k(°,°,') is regarded as a known nxl vector-valued
system transition function. In this section, the requirements

for extending the algorithm to include some systems in the class
(3.4.1) will be considered. The extension will be indicated by
considering only the scalar case, i.e., where n = 1, since, with
the theory of previous sections, the multidimensional case follows
directly.

The chief requirement necessary to carry out this exten-
sion is that the first partial derivatives of §k(°,°,-) with
respect to the m components of the vector u exist and are
continuous, and also that the function may be closely approxi-
mated by the first two terms of its Taylor expansion in a.

In particular, let the scalar system function be
(3°4°2) Fk(01) = §k(xk,uk’a)
and

(3.4.3) aQa*+M.

61

Here Au represents the difference between the true parameter
*
a and some estimated value a of a- Then, using the first

two terms of the Taylor expansion of Fk(a),
~ * .' *
(3.4.4) Fk(a) = Fk(a ) + Fk(a )Aa

. -‘ *
where the mxl vector-valued gradient Fé(a ) is defined as

* * *

. * aFk(a ) aFk(a ) aFk(a )

(3.4.5) FILQY ) = —_*_ a —"'.k_ a°°°a —;_"
801 302 Bum

* *
and the a1 are scalar components of the mxl vector 0 .

This step linearizes (3.4.1) in the unknown parameter a while
allowing it to remain nonlinear in the other variables.

With this linearization, the assumptions A1 through A8
remain the same in substance. However, the relations stated
in A5 and A6 are changed. In particular, it is assumed under

the statements of A5 that

 

(3.4.6) inf inf A . [ 2 §.(&)§'(&)il 2 x > 0 a.s.
k [1(Yj:vj’&)HjEJk} min jEJk J J o
and that
A i A
FJ(&)Fj(a)
(3.4.7) inf [ inf , [ 2 , 2 ] 2 T' > o a.s.
k {(yj,vj,&)\j€Jk} “‘1“ 39!. qu<a>u

where the nXl vector-valued function Fj(&) is given by
3-4-8 3. A = Q. . v. “
( ) J(a) J(YJ: J90)

Similarly, under the statements of A6 it is assumed that

62

. l: A ;l 5
min )‘max[Fj (0L!)Fj (Gil
(3.4.9) inf inf 1‘ . 2 21-20 a.s.
A . ~ “ A I A
k {(yj avj 30’)‘JEJk} mix )‘max[Fk(a)Fj (0)]
k
The relations (3.4.6), (3.4.7), and (3.4.9) constitute the
basic changes in the assumptions Al through A8.
The development of the algorithm defining the estimator
&k of a now proceeds exactly as in section 2.3 with no addi-
tional complication. The loss function Lk+l(ak+l)’ given by

k
A _ A 2
(3°4'10) Lk+1(°’k+1) ’ iEl(§i(yi"’i"”k+1) ' VH1)

o a c A _ * h * 0
has a unique m1n1mum.when ak+l - ak+1, w ere ak+1 1s
recursively defined as

3 4 11 * - * + p 1'3 ( * 1? *
( ' ' ) ak+1 “k k+1 kak)(yk+l k(°'k)) ’
and in this case
-1 A .: * .2, * -1
(3.4.12) P - 2 F (a )F (a ) + P , k = 1,2,...
k , j k j k k-l
JEJk

The steps leading to the result (3.4.11) are omitted here
since they constitute a repetition of the steps of Section
2.3 leading to the analogous result (2.3.10).

As before, with the help of (3.4.6) and Lemma 10

(Appendix) it may be shown that
(3.4.13) “Pk“ = 0(k) a.s.

and so the matrix Pk in (3.4.11) may be approximated with

a matrix Ak given by

63
(3.4.14) Ak = E(P1)/(kp) , k = 1,2,...

where E(P1) is positive definite and the p is given in
assumption A5. A sample-average approximation may be used to
estimate E(P1) and a value p may be selected to correspond
to the maximum length of the Jk intervals. As mentioned pre-
viously, any positive definite matrix having the appropriate
dimensions may be used for A1 and convergence of the algorithm

will be assured; however, the Ak suggested above best approxi-
mates the Pk necessary for a least-squares estimate of a.

The algorithm (3.4.11) may now be applied in the form

A

(3.4.15) ak+1 = ak +Ak+1 Fkﬁak)(yk+1 - Fk(ak)),

where is the suboptimal estimate of 0 replacing the

o’1<+1
*
minimum quadratic loss estimator ak+1° The suboptimal

estimator & has the same asymptotic behavior as does the

k+l
optimal estimator a:+1°

A bias occurs in the estimation scheme (3.4.15) as
occurred in the previous estimator (2.3.10). With the assump-
tion of wide-sense stationary random variables, the bias may

be determined by finding the value of & which minimizes the

expected loss function given by

. ~ 2
(3.4.16) E(L(&)) = E(Fk(a) - yk+1) .

Applying the Taylor expansion (3.4.4) to the function Fk(&),
the value of the mxl vector E which minimizes the risk

(3.4.16) is given by

64
(3.4.17) 82* -- 01 + E“1(i‘3~k(a>f}1;<a>> Eé'kmmkm) - Row

by using the same procedure as used in section 2.4. The bias-
error evidently depends on the choice of functions §k(-,-,-).
Whether the bias introduces a large error is best determined
by evaluating the second term.with assumed noise variances
based on the available knowledge of the system behavior.

The algorithm to use in practice is given in (3.4.15)
with the Taylor expansion applied to the Fk(&k) appearing
on the right side. However, if some information is available
to estimate the effect of the bias it may be incorporated into

(3.4.15) by subtracting an estimate of the term bk given by

(3.4.18) bk == E(f;‘k(a)[Fk(a) - §k(oz)])

The term b , which appears as a factor in (3.4.11), i.e.,

k
without the expectation, should be subtracted from (3.4.15)

in such a way that
(3'4'19) ak+1 = “k + A1<+10515031)W15“) + wk + Hk+1 ' Fk(°’k)] ' bk"

Here the quantity following the Ak+1 now tends to have zero

mean as tends to 0 since the w and ¢k+l are un-

a'k . k

correlated with the §k(&k) by assumption Al. The same pro-
cedure was used in the previous estimator (2.5.4) to center
the estimator, in the same asymptotic sense as in (3.4.19),
around its mean value.

The final form for the estimator (3.4.19) may be shown

to converge in the mean-square sense and the almost-sure sense

65

to the unknown system parameter a. The procedure to accomplish
this proof, with the help of (3.4.4), (3.4.6), and (3.4.9), is
entirely the same as that used in Sections 3.1 and 3.2 and con-
sequently will not be pursued here. Also the convergence rate

may be shown to behave the same as in Section 3.3.

3.5 Summary

In this chapter the characteristics of the system
identification algorithm (2.5.4) were examined. In Sections
3.1 and 3.2 arguments were presented that show that the para-
meter estimator, defined by (2.5.4), converges in the mean-
square and almost-sure senses to the true value. These proofs
depend on stochastic approximation theory. Also, the rate of
convergence was calculated in Section 3.3. This result indicates
that the mean-squared estimator (2.5.4) cannot converge faster
than a rate proportional to the inverse of the observation time.
It is further indicated that this is true for both suboptimal
and optimal on-line parameter estimators. In Section 3.4 it is
shown that, with additional hypotheses, the identification
algorithm (2.5.4) may be modified to apply to systems which are
nonlinear in the unknown parameter. Here, this extension is
developed for scalar systems, as the multidimensional case is
a parallel result to the original algorithm (2.5.4).

In order to verify the practicality of the estimation
scheme stated and proved in this chapter, models of various forms
have been simulated on the digital computer. With the help of a
random number generator, the estimation method has been demon-

strated. The results of this verification are discussed in the

next chapter.

CHAPTER IV

EXAMPLES

The system identification algorithm developed in the
previous chapters is demonstrated in this chapter by a digital
computer simulation. Two examples are considered here. The
first is a state model of the type (2.1.2) which is linear in
the unknown parameter and nonlinear in the state and input
variables. The second example is a one-dimensional system
which is nonlinear in all variables including the unknown para-
meter. This example verifies the extensions to the algorithm
discussed in Section 3.4. These two examples were simulated
on a Control Data 6500 digital computer. The inputs and noise

variables were generated internally.

4.1 Nonlinear State Model: Linear in the Unknown Parameter

Consider the discrete-time nonlinear state model given by

(4.1.1) x1(k+1) Tanh(x2(k) + u(k)) x1(k) a

1
x2 (k+1) x2 (k) Atan(x1(k)) a2
w1(k)
+
W2 00
y1<k>' x1<k> 11m
= +
y2 (k), x2 00 H2 00
v(k) = u(k) +'p(k)

66

67

(4.1.1) is a purely academic example of systems of the type
(2.1.2). In this section, results of a simulated identifica-
tion of this system are reviewed. Some characteristics of the
deterministic part of the system are mentioned first.
Expanding the nonlinear terms in (4.1.1) in a Taylor
series and suppressing the terms past the first reduces the

non-random part of (4.1.1) to the linear system given by

x1(k+1) a a x1(k) a

2

(4.1.2) - 4- u(k)

x2(k+1) a2 a1 x2(k) 0

This system is asymptotically stable if its eigenvalues
(0, a1 +'a2) lie inside the unit circle. When this is true,
the system (4.1.1) is asymptotically stable (Petrovski, K.T.K.).

If it also happens that

0’1 “1’2

(4.1.3) Det # 0 ,
0 “P32

then the system (4.1.1) satisfies conditions for (null) con-
trollability (Ll). Furthermore, the system (4.1.1) is observ-
able since the state vector is observed directly (L1).

The system (4.1.1) was modeled on the digital computer
with a random input u(k) which was zero-mean, independent,
and uniformly distributed. The noise variables wi(k), Hi(k),

and p(k) had the same characteristics as those of u(k). The

vector selected for the unknown parameter is given by

68
(4'1”4) 01' = (01902) = (”'25: "70)

This vector satisfies condition (4.1.3) and the stability

criterion. The identification algorithm used is given by

&1(k+1) 611(k) y1<k+1> 611(k)
(4.1.5) = + 9 ﬁ'[ - ‘1" :1
- x x k+1 k k x
6, (1+1) 0:2 (1.) y, (1+1) a2 (1.)
where
Tanh<y2(k) + v(k)) y1(k>
i‘k =
y,(k> Acan<y1<k)>
and the 2X2 matrix Pk+1 is given in (2.3.12). A deterministic

gain Ak+1 was also used in place of Pk+l for comparison.

The gain Ak is given by

(4.1.6) A =———

where a1 and a2 are positive constants. This gain is selected

in keeping with the theory of Chapter III and Lemma 10 (Appendix).
The simulation was carried out for a high initial

estimator error and again for a low initial error. In both

cases, various gains were used. Figures 2 and 3 show averages

of the squared error of ten runs of the identification algorithm

over 512 observations for each gain selected. The squared error

is defined as

(4.1.» 111.112 = Her 3.112

69

where a is given in (4.1.4) and &k is the vector estimator
given in (4.1.5)

Figure 2 illustrates the comparitive performance of
the estimator for the deterministic and least-squares optimal
gains AR and Pk. This simulation indicates that both types
of gains give similar results. However, in order to chose the
best deterministic gain, the constants a1 and a2 should be

chosen so that Ak decreases in a manner similar to the eigen-

values of Pk. This requires testing Pk over a few iterations
using the matrix inversion lemma (2.3.12). Figure 2 further
indicates that when a large initial estimation error is
anticipated, the initial gain values, PO and A0, should be
larger than unity. This correSponds to a large initial variance
in the estimate. Thus, the two lower curves in Figure 2 are a
result of slightly higher initial gain values.

Figure 3 illustrates the identification results when
the initial error is relatively small. Here, the various gains
do not cause a significant change in the convergence of the
estimator. A comparison of Figures 2 and 3 indicates that the
algorithm rapidly compensates for a large initial error in the
estimate, i.e., Figure 2 shows a steeper descent than Figure 3.

With regard to choosing gains, this experiment indicated
some effects which are not apparent from the figures. There
appears to be some best value for A0 when a large initial error
is anticipated. Figure 2 shows when A0 is equal to 2 the

result is better than when AO equals one. However, when A0

is much larger, for example 10, two undesirable effects occur.

70

The initial estimator variance becomes large and the average
error increases before it begins to decrease. In particular,
for the example in Figure 2, a gain of 100/(10 + k) produced

19

a maximum squared error of 4x10 at 32 iterations and a

minimum of 2.5x10'2 at 512 iterations over a 10 run average;

1 for 32

one run in this set produced a squared error of 9X10-
iterations. Conversely, when A0 is selected small, con-
vergence is very slow and the estimator's variance is small.

In the same example of Figure 2, a gain of 1/(10 + k) produced
a minimum squared error of 6X103 after 512 iterations of the
algorithm (4.1.5). The variance of the estimate when PR is
used is much better behaved for all choices of Po' Small
values of “Po“ give very slow convergence while very large
values tend to result in a sharp initial drop in the estimator's
error. The sharp initial drop is typically followed by gradual
convergence resulting in errors which are no improvement over
those obtained with smaller “Po“ values. It appears, there-
fore, that there is no general advantage in choosing very large
or very small initial gain values.

It is also noteworthy that the application of the
deterministic gains reduced the central processing time of the
CDC 6500 by one half for this second-order state model simula-
tion. Higher-order models would lead to greater time savings
than those gained here because the computation of the gain P

k

would become more time consuming.

71

2 4
= l. 9 X 10
quH 9
H = (+100, -100)
2 o
squared error, quu \wi(k)‘ s .2
~ . k) s .2
105 + NJ 1
\p (k)| s .2
0 6+0 ‘U (k?\ S 1.0
104 x 61' .
' 3; + gain Pk’ P0 = I
X . 3'
0 gain Ak = 10/(10+k)
3 ‘ 3
10 r x g x gain Pk’ P0 = 10I
x O , gain Ak = 20/(10+k)
.2
10 run avera e
102 - . g
x
t
X
1 +
10 F o
o
+
x
100 _ o
X
10'1 - ' x
10"2 . ‘ 1‘
l 1 l l l l l l 2 I 21

 

 

1 2 4 8 16 32 64 128 256 512
Iterations, k

Figure 2

Estimator Error Squared as a Function of Algorithm Iterations;

Large Initial Estimator Error and Various Gains

72

2
quu .5525
a; = (0.0)
|wi(k)\ s .2
”100‘ 5 '2
|p (k)| s .2
Squared error, quHZ ‘u (k)‘ s 1.0
, + gain Pk, PC) = I
100 . o gain Ak = 10/(10+k)
X gain Pk’ Po 8 10I
‘b 0 gain Ak ' 20/(10+k)
g t . . 10 run average
3: 9' $ §_
10‘1 - " x 6
x + 5
" +
X X 00
....
X
o.
t o.
-2 +
10 - >‘

 

 

l 2 4 8 16 32 64 128 256 512

Figure 3

Iterations, k

Estimator Error Squared as a Function of Algorithm Iterations;

Small Initial Estimator Error and Various Gains

73

4.2 General Nonlinear System: Scalar Case

The extension of the identification algorithm to systems
which are nonlinear in the unknown parameter as well as in the
state and input variables has been verified by a digital computer
simulation. The results are discussed in this section.

The example considered here is the system of equations

3
Xk+1 ” 0’le + “zuk + 3(“1xk + °’2”k + "k

(4.2.1) yk=xk+¢k

<
II

k uk+pk
where all the variables are scalar quantities. This system
is.asymptoticallystable when a1 lies inside the unit circle.
When a2 # 0 conditions for (null) controllability are satisfied
(L1).

The system was modeled on the computer with the para-

meter a given by
(4.2.2) 0' = (al,a2) = (-.25,+.70)

The noise disturbances and input were selected from a uniformly
distributed independent process generated by the computer.
The system.was identified using the algorithm developed in

Section 3.4 given as

(4.2.3) 61 = 01k + Pk+11‘k(11k)(yk+1 - ly‘k(ak))

where

74

A A A A A A 3
= +
Fk(01k) a1(k)yk 012(k))!k + 3(al(k)yk + a2(k)vk) .

x A 2
.1 yk + 9Yk(crl(1<)yk + 012 (k)Vk)
F (& ) =
k k x x 2
Vk + 9Vkia1(k)yk + a2(k)Vk)

and Pk+l is given by (3.4.12). In accordance with the theory
of Section 3.4, and in particular assumption (3.4.4), the initial
estimator error should be small. Therefore the initial estimator
value was set at zero giving an initial squared error of .5525.
With this initial condition, convergence of the estimator is
illustrated in Figures 4 and 5. Here, the application of a
deterministic gain is shown to be a valid substitute for the
least-squares optimal gain Pk which, with this type of system,
is a function of the estimate itself.

When larger initial errors were tried convergence was
poor. For example, an initial estimate of a; = (&1(0), &2(0)) =
(10,10), giving an initial squared error of quH2 = 191.6,
gave a final squared-error of 22.9 after 512 iterations. An
increase in the initial gain value, “Po“ and A0, did not

improve the estimate. This result verifies the necessity for

a smaller initial error as discussed in Section 3.4.

75

 

2
quH = .5525
6; = (0.0)
‘wk‘ 5 .25
Relative “q “2 Hwkl S '1
Squared error, k 2 ‘pk‘ s l
quH
‘Uk‘ S .5
H
3 ’-‘° >3 gain Pk, Po =1
’2 3? noise free case
P . o X state noise only
f 0 state noise and observation
0
'- x 0 noise included
0
o X 0
X
P X
o
"' o
o
L-
. 10 run average
1 1 1 _1 1 g 1 1 1 #

 

 

1 2 4 8 16 32 64 128 256 512 Iterations, k

Figure 4

Relative Squared Error as a Function of Algorithm Iterations;

Least-Squares Gain with Various Noise Levels

76

quH2 = .5525
a; = (0.0)
\wk\ s .25
Relative qunz Wk‘ 5 .1
Squared error,'———jf
“qOH \Pkl S ’1

0 f "o g luk‘ 5.5

>2 )2 gain Ak -- 5/(5+k)
. -
» ﬂ . Noise-free case
0 .(
X State noise only
. O»o State and observation
X
- ‘ 0 noise included
0 X 0
X
H. O

10 run average

I J 1 1 l 1 I 1 n n

l 2 4 8 16 32 64 128 256 512

 

 

Iterations, k

Figure 5

Relative Squared Error as a Function of Algorithm Iterations;

Deterministic Gain with Various Noise Levels

GHAPTER V

CONCLUSIONS

This chapter includes a discussion of the general
results of this thesis. Possible extensions to this work are

also included.

5.1 Thesis Results

The chief result of this thesis is the development of
a practical method for finding a mathematical model for a
discrete-time system. This method is based on the assumption
that the system is known up to a finite set of parameters.

The procedure develOped provides a means for finding the unknown
parameters from observations of the system's input and output
variables. The method for finding the parameters is an on-line,
stochastic, gradient-search algorithm.which updates an estimate
of the parameters with each new set of input-output data.

The system identification algorithm has a general
formulation based on a liberal set of assumptions. The
formulation includes approximating the least-squares optimal
estimator gain with a deterministic gain. The result is a
system-parameter estimation scheme which has at least three
practical features. First, a wide variety of time-varying non-
linear systems may be identified by this algorithm. Secondly,

the on-line strategy of the algorithm permits identification

77

78

when the moments of the random processes are time-varying and
when the system transition function is time-varying as well.
The third practical advantage is the savings in computation
time resulting from the use of deterministic gains in the
parameter estimation algorithm. The reduction in computation
time increases when the order of the system is increased.

Two convergence theorems are presented in Chapter III
which prove that the parameter estimator converges in the mean-
square and almost-sure senses. These two proofs, introduced
as modifications to the theory of nonlinear regression, are
new to the literature in system identification. The convergence
arguments were based on the liberal set of assumptions outlined
in Chapter 11, some of which were shown to define a criteria
for system identifiability. Two important assumptions used
in the convergence proofs are (l) the system is linear in the
unknown parameter and (2) a deterministic gain is used in place
of the least-squares optimal gain. Later, in Section 3.4, the
algorithm.was extended to include systems which are nonlinear in
the unknown parameter. This extended version is new to the
system-identification literature.

The maximum rate of convergence of the estimator was
also calculated in Chapter III. This result indicates that,
in an asymptotic sense, the mean-squared estimator cannot con-
verge faster than a rate inversely proportional to the observa-
tion time. It is shown that this is true for optimal on-line

estimation schemes as well.

79

The algorithm was demonstrated by a digital computer
simulation. An example was selected to show the sensitivity
of the parameter estimator to various gains. It was demon-
strated that the deterministic gain is a useful substitute for
the least-squares gain, since convergence rates were comparable
and estimates using the deterministic gains were computed
faster. A second example was selected to demonstrate the
extension of the algorithm to systems which are nonlinear in
the unknown parameter. Here, it was found that large initial
errors in the estimator lead to slow convergence. This did
not occur in the previous example. However, a smaller initial

estimator error improved the convergence rate significantly.

5.2 Possible Extensions

The identification examples given in Chapter IV
indicate a more extensive investigation of gains would be use-
ful. In Chapter IV, three types of gains were verified:
deterministic gains given by (4.1.6), least-squares (random)
given by (2.3.12) and least-squares (random-adaptive) given
by (3.4.12). What is needed is a method for choosing a gain,
or its initial value, in such a way that the estimator error
will drop quickly during the first few iterations, independent
of the initial error. Some gain sequences may be develOped
which accelerate the initial estimator's convergence. Such
gains have been suggested in stochastic-approximation liter-

ature (K2, V2).

80

Another extension to this work is the parallel solution
to the identification problem in continuous time. This requires
the application of continuous-time stochastic approximation
theory. Although most of the literature on stochastic approxima-
tion deals with the discrete-time case (A2, D1), some studies
on continuous~time stochastic approximation theory are avail-
able (81, S3, B6).

Finally, this identification method may be connected
with the study of stability of large-scale interconnected
electric power systems. Here, there is a need for modeling
and control methods based on input-output variables available
on the power system's tie-lines. It is in this general problem

that this thesis had its origin.

APPENDIX

 

APPENDIX

In this appendix various lemmas which are required for
the convergence arguments in Chapter III are stated. Those lemmas
which are taken directly from the literature are stated without
proof. In some cases the proofs rely on results in the litera-
ture and so the proof is presented with the appropriate references.
In the remaining cases, namely Lemmas 3, 4 and 10, the proofs
given are new. In the interest of completeness all propositions
required for Chapter III are stated here with the exception of
well-known classical theorems.

For convenience, the algorithm to be considered is re-
stated here in the notation of Chapter III. The recursive
estimator &k of the unknown parameter a of the model selected

from the class (2.1.2) is given in (2.5.4) and (3.1.2) and is

A

°’k+1 g “k ’ A141011531: ' a) ' Yk)

where

11 AM =i'(y,v>1<y.v> .

k kk k k k k k k

Vk = £15k ’ bk ’

Ak Q Al/k -- Bap/(pk) .

5k " [§k(xk’uk) ' §k(yk’vk)]°’ “'1. + Hk ’
and

81

82
bk = E(§£(yk’vk)§k)

These relations are discussed in Sections 2.4 and 2.5.
The norm function H°H for vectors is defined to be

the Euclidean norm, that is for a real nXl vector 2 where

. z )

v-
z (21, 22, ... n

the norm of z is given by

n a
qu= [2 12,12] -
i=1

The norm function H'“ applied to square matrices is the

"spectral" norm, that is for an an matrix A and nXl vectors x

HAH = s M

”P
x¢0 \x“
where the Euclidean norm given above is used on the right.

When A is hermitian

“A“ = Max (A) ’

that is the norm of A is the largest eigenvalue of A (V1).

These two definitions will be used throughout this section.

Lemma 1 (A2, L4)

Let

Then

83

lim 3 = s a s
n
where
HS” < m a.s.
Proof:

By assumptions Al and A2 the moments of Yk are

uniformly bounded, so

124141216113 < 111.112

Then, with the Cauchy-Schwarz inequality (B5)

 

kg, EHAnkH2 s .51 “1141121416
5 constant - n 2 t t n HAle
#21 “AR“ 5 cons n kgl k2 .< a .

Also, by definition of Yk ,
E(Akyk) = AkE(vk) = O .

Hence the convergence of the sum of the second moments of the
zero-mean random variable AkYk guarantees that

lim 5n = s a.s. with “s” < a a.s.

n
by direct application of a theorem due to Loeve (Theorem D,
Section 29.1), (L4).

Q.E.D.

Lemma 2

13111 “An“ Han“ = o a.s.

84

Proof:

RE]. E(“Ak“2HHkH2) = .kEl HARHZENH‘kNZYN

“ 2
s constant 2 “Aku < a
k=1

2
since. EHHkH is uniformly bounded for each k by assumptions

Al and A2. Also by the Fubini theorem (P3)

‘° 2 2 °° 2 2
~> 8HA\\E<\HH>=E 2 AHHHH.
k=1 1. \1. k=1H 1. 1.
According to the Monotone Convergence Theorem (L4), if

1:. kgluAkHZHHkHZ = .. .

then
um E kguAkuzuakuz = .. .

The contrapositive of this statement is

°° 2 2
E 13111111 11111 < w

implies that

°° 2 2
1:1 111,1 111,1 < .. . .

Now this implication means that

kgl (1,12,61,12. ..

implies that

85

' 2 2 = a.s.
131141 111,1 0

from which it follows that

lim HAkH “HRH = 0 a.s.
Q.E.D.
Lemma 3
lim k2]- HARM “HR“ = on a.s.
Proof:

Using the notation of assumptions A5 where Jk is

defined as the index set of all integers between the integers

and v - 1, that is

vk k+1
Jk = {Vk, Vk+l, ... , Vk+1'1} :

we have that

13-3-1 HAkH HHkH = 1:1 {1ng “An“ HHnH
2 2 1min(Av ) 2 “Hr,“
k=1 k néJk

2 2 H . (A ) 2 H
k=1 mm Vk HneJk “H

from the triangle inequality (BS); and by assumption A5

Q
2 £21 xmin(Avk) lo a.s.,

86

The last equality is true since

co

2 A

(A ) = O
k=1 1"

min

and every infinite subsequence of {xmin(Ak)} Sums to infinity.

Q.E.D.

Lemma 4

For each realization of the random sequence {Hn}’ where

g .
Hn Qn(yn’vn)§n(yn’vn) ’

there exist constants c,c' > 0 and 0 < N < a such that

-l

OSHQ \<l-cAk<1-c'(v l) a.s.

kl k+l-

for all k > N. Here the variables are defined as

A -
Qk .11 (I AiHi)
IEJk

where the indexing is in descending order, that is,

Q=(I-A H _)(I-A _H _)
1‘ Vk+1'1 Vk+1 1 Vk+1 2 Vk+1 2
(I - AVkHVk)
and
A 2 2 3
A), = (Z HAiH HﬂiH )
iGJk

Proof:

First premultiply th by its transpose and note that

AR and Hk are symmetric. This gives

87

0qu = n (1 - A111,) 11 (1 - 1,111)

i .
EJk IEJk
=(I-H A )... (I-H _1A _1).
Vk “k Vk+1 Vk+1
(I-A _1H _1)...(I-A H )
Vk+1 Vk+1 Vk vk

2
I - .2 (AiHi + HiAi) + Aka
16.1k

where

Gk = [(second order product terms)

+ (higher order terms)]-( 2 “AiHZHHiH2)-1 .
iEJk

Now there is a constant MI > 0 such that “Gk“ 3 M1 a.s.

for all sufficiently large k for each realization {Hi}.

This is true since

(1) Lemma 2 implies that
. . -2
11m “(higher order terms)\\Ak = 0 a.s.
k

because the numerator tends to zero faster than the quantity

2

k’ and

A

(2) “(second order product terms)HA1-(2
-2
s 1 + “("off-diagonal" second order product terms)“Ak

which follows from application of the Cauchy-Schwarz and

triangle inequalities. Furthermore,

88

“(Hoff-diagonal" second order product terms)uA;2

-2
S 2 ,2 (HA-HHHH\\“‘n\\HH.H>A .
{i#j;i,je.1k} 1 1 J J k
by the triangle inequality, and with p 2 vk+1 - vk for all k,

2 (2,6,2 - p>max 11111211112 1;}
iEJk

s 202032 - p>><max \\A,H2HH,\\2><max 1111121111164
iEJk iEJk

2
= P ' P -

Hence (1) and (2) imply that “Gk“ is uniformly bounded with
probability one for sufficiently large k. Now define the
function 0(°) as follows. For real constants an and bn’
0(an) = bn means that \bn/an\ < constant as n increases

to infinity. Therefore
2 ' _ 2
“AR Gk“ "' 0(Ak) 2
and so the triangle inequality gives

I 2
\\Qka\\ s 1 - H); (AiHi + HiAi)H + 0(Ak)
16.]k

2
s 1 ' AminH: z: (AiHi + HiAi)] + Omk)
iEJk

1 - (xmin[z (A

-1 2
i H. + 111111)] Ak )Ak+ 0(Ak) .
éJk

i 1

The proof that HQk“ < 1 - cAk is completed by showing that

for some M2 > 0

89
x [2 (AH +HA)]A-1>M a.s.
min . i i i i k 2
16.1k
for all sufficiently large k. To accomplish this last step

first note the following matrix properties:

For positive definite real matrices Bi’ 1 = 1,2,...,n,

n n
(a) )‘min(i§1 Bi) 2 121 "min(Bi)
(b) xmin(BlBZ) 2 Xmin(Bl)Xmin(BZ)

(c) (313 ) is positive definite for all i and j,

J
(V1). Using these properties,

-1 -1
2 mini + HiAiﬂAk 2 kmin[ Z (AiHiﬂAk
ieJk iEJk

A

min[

-1
+ kmin [12‘] (HiAigAk
k

Now consider the first term on the right. Using the properties

(a) - (c) further,

-1 -1
1. (2 A.H.>1 2 z 1. (A.H.><A Hoax HH \>>
min iEJk 1 1 k iEJk min 1 1 Hvk jGJk j‘

-1
<A ”mag. H,>\\A,kH<max uujm .

2 xmin vk
k

Since assumption A5 implies that

A A
I

iFi

H
S —_ S i
3““)ng Human] “i“[iilk “3“]

ieJk

S kmin[min ”H1HJ a.s. ’
Jk

0<T'

90

the bounding becomes

"min(A v kHNnin 2 Hi) kminw‘v )(min Hui“)

 

 

iEJk k Jk '
E11<maﬂ1nj11114 11<max 11n,11> ° 1
k Jk k Jk

Now since A1 is positive definite by choice, for some M3 > 0,

A (A )mm0A1)(1/vk)

min Vk

—“-——n—v £11“ (A)(1/v)2M >0.

 

Hence by assumption A6

1 mm“) (mJin 1111111)

 

k
11A 11aax11n.11> «1.13.1.1...
V 1
k J
k
Therefore,
1 . ( Z AiHi)Al-<12M > 0 a.s.
min WEJ 3

The same argument applies to the second similar term to the
one on the left in the last inequality. The result is,

therefore,

)..”: AH) 1mm: HA)

 

 

min ieJk i nieJk 1 1
Ak
( 2 (Ai H +'H1A ))
xmin iEJk i i
2 2 M > 0 a.s.

Ak 2

for some M2 > 0.

91

Consequently for sufficiently large k,
1212 =11QQ'11s1 «A +0012) .1 s
k k k 2 k k ' '
and so with an appropriate c > 0,
“QR“ s 1 - cAk a.s.

The second part of the Lemma follows directly; note that
2 2 35 2 35
A}, = <2 HA1“ “Hi“ > 2 HA, -1H( '2 NH,“ >
iEJk k+l 16k

2 “Avk+1-1H x0 a.s.

by assumption AS. Therefore for sufficiently large k

-1)

“QR“ S 1 ' CAk s 1 ' CHAvk_'_l-1H)‘o = 1 ' c'/("1<+1

almost surely, where

c' A CHAIHxO
Q.E.D.
Lemma 5
n

lim II “QR“ = 0 a.s.

n kﬂl
Proof:

By Lemma 4,

n n _1
o s kgl “QR“ 5. 1:210 - c (ka-l) )

n
-l
- ' -
5 exp( c £21(Vk+l l) ) a.s.

92

Since l/k sums to infinity, every infinite subsequence does

also, so
“ -1
lim 2 (v -1) = on
n k=1 k+l
Therefore
“ 1
. I _ ‘ ..
11m exp( C 2 (Vk+1 1) ) 0 :
n k=1
and hence the result follows.
Q.E.D.
Lemma 6 (A2)
"Let A1,A2,..., be a sequence of square matrices.
Then for all 1 s k s n and n 2 l,
n n n
2 n (I-AiM. =1 - no "‘1’
j=k i=j+1 3 i=k

where products are to be read backwards and void ones as the

identity."

Lemma 7 (K3)
L6t
(1) {xn} be a real sequence such that
lim x = O
n
n

(2) {a } be a set of reals such that
uv

n
gio‘ankj < K n = 0,1,2,...

for some constant K > O

93
Now note that for some r > 0, fixed, that
lim a = 0 .
nr
n

Then (1) and (2) imply

n
lim 2 a x = 0 .
n k=0 nk k

Lemma 8 (A2)

"Let Pn = n (1 - a.) when a belongs to (0,1)

jgl J j
for all j 2 n and lim Pn = 0. Then if
n
2 X < w a
k k
then
n
lim max" t. (P /P,)x = 0 . "
n 15km j=k “ J 3
Lemma 9

In keeping with the notation of assumption A5 let

ID

V

n n+1-vn<p<~

ID

Bn pn-Z/Vn-l

N
II

sz with z > 0
and N' > 0 is chosen so that
zj < l for all j 2 N'

Then,

n
2
.H (1 - zj) s dk(vj_2/vn_1)
J=k
where k =N',N' + 1,...,n, and dkz 1 such that
lim d = 1 .
k k

Proof: (A2)

Note that for all x,y > 0

Then let

so that

Since

then

exp(x) > (1 + x/y)y .

xéz.
J

sza - Bj)

2(1-Bj)
exp(zj) > [1 + szj/zj(l - ej)]

-Z(1 - Bj)
= (1 -Bj)

-1
(1 + x/y) = (1 - Bj) = (vj_1/vj_2)

z zej
exp(zj) > (”j-l/Vj-Z) (1 ‘ Bj)

Now invert and take the product from k to n getting

95

n n n
H (1 - z.) s H exp(-z ) = exp(- 2 z )
j=k J j=k j .1ng
n z z j z
S ij (vj _2/v j- _1) (1 - Bj) = (vk_2/vn_1) -S
where
n ~25,
0<1n(S) = Elna-B) J
j=k j
=-Z zB 1n(1 -B.) ,
jrk j J
Let

A Q
dk - exp [-2 jEk len(1 - Bj)] 2 S .

Then since for sufficiently large j, Bj < 1 giving

exptej/<1 - ej)] > (1 - ej>'1

so that
B.
T-J—>-ln(l - B)
'B
1
Therefore
d < exp 2 E 82/(1 - a )
k j=k J j

Now since

Os-z Bj1n(1 ‘53) s z 33/(1 -5j)
jgk j:

96

then as k tends to infinity the right-hand sum tends to zero

as does the left-hand sum. Hence by the definition of d

k
lim d = 1 .
k k
Finally, replacing S with dk gives

n z
.9 (l = zj) s (Vk-Z/Vn-l) dk
J-k

Q.E.D.

The following lemma has appeared elsewhere in the liter-
ature (H2, D3) but is proved here distinctly in the context of

the assumptions A1 through A8.

Lemma 10

In accordance with the notation of Chapter II let

-1 ~ . -1
= i =
Pk FiF. + P k 1,2,...

-l
iEJk 1 k

and P0 is preselected as a positive-definite symmetric

matrix or alternatively set equal to zero. Then

“P1.“ = 0(k) a.s.

Proof:

Suppose Po is positive definite. Then P;1 is also

positive definite. Since P;1 is symmetric it follows that

P is unitarily similar to a diagonal matrix D

k (F1), and

k

the eigenvalues of P; appear on the diagonal of Dk. Then
PR is evidently unitarily similar to DLI. Therefore,

97

-1 -l
xmax(Pk) g xmin<Pk )

By definition the right side equals

)1 (P'1 + 21; 2 ﬁ'i?‘ )2 x (P'l)
min 0 i=1 jEJi j j min 0
k
+ . ﬁ'i‘.
1:1 x“‘1“ng j J)

by assumption AS. Hence,

k ax(Pk) = “Pk“ s (xmin(P;1) + 10,0).1 = 0(k) a.s.

Alternatively, if P0 was omitted from the beginning all the
arguments would carry through with probability one with the
use of assumption A5.

Q.E.D.

BIBLIOGRAPHY

Al

A2

A3

A4

B1

BZ

B3

B4

B5

B6

BIBLIOGRAPHY

Akaike, H., "0n the Statistical Estimation of

Frequency Response Function," Ann. Inst. Lf Stat.
Math. , Vol. 14, No. 1, pp. 23-56,1962.

Albert, A.E. and L.A. Gardner Jr., Stochastic
Approximation and Nonlinear Regression, Research
Monograph No. 42, M.I.T. Press, Cambridge,
Massachusetts, 1967.

 

 

Aoki, M. , Optimization Lf Stochastic Systems,
Academic Press, New York, 1967.

 

Astrom, K.J., T. Bohlin, and S. Wensmark, "Automatic
Construction of Linear Stochastic Dynamic Models

for Stationary Industrial Processes with Random
Disturbances Using Operating Records," Technical
Paper TP 18,150, IBM Nordic Laboratory, 1965.

Baender, R.G. and F.W. Smith, "A Note on the
Statistical Estimation of a Hyperplane," IEEE Trans.
Ln Automatic Control, Vol. AC- 13, No. 5, pp. 591,
October 1968.

 

Balakrishnan, A.V. and V. Perterka, "Identification
in Automatic Control Systems," Proc. IFAC gymposium
Ln Identification in Automatic Control Systems,
Warsaw, Poland, Survey Paper 9, June 1969.

 

 

 

Belanger, P.R., 'Comments on "A Learning Method for
System Identification",' IEEE Trans. on Automatic
Control - Corres., Vol. AC-13, No. 2, pp. 207,
April 1968.

Bendat, J.S. and A.G. Piersol, Measurement and
Analysis 2: Random Data, John Wiley and Sons, Inc.,
New York, 1966.

 

 

Berberian, S. K. ,Introduction to Hilbert Space,
Oxford University Press, New York, 1961.

 

 

Bird, MJW., "Asymptotic Convergence of Nonlinear
Continuous-time Filters," Ph.D. Thesis, Michigan
State University, 1969.

98

B7

Cl

C2

C3

C4

D1

D2

D3

D4

D5

F1

F2

F3

99

Burkholder, D.L., "On a Class of Stochastic Approxima-
tion Processes," Ann. 2£_Math. Stat., Vol. 27,
pp. 1044-1059, 1956.

Chaadev, V.M., "Adaptive Models of Plants," Proc.
IFAC Symposium 93 Identification 13 Automatic Control
Systems, Prague, Czechoslovakia, Paper 1.8, June 1967.

 

Chien, Y.T. and F.S. Fu, "On Bayesian Learning and
Stochastic Approximation," IEEE Trans. 33 System
Science and Cybernetics, Vol. SSC-3, No. 1, pp. 28-38,
June 1967.

Clarke, D.W., "Generalized-least-squares Estimation
of the Parameters of a Dynamic Model," Proc. IFAC
Symposium 23 Identification 32_Automatic Control,
Prague, Czechoslovakia, Paper 3.17, June 1967.

 

 

Coddington, E.A. and N. Levinson, Theory 22.0rdinary
Differential Equations, McGraw-Hill Book Company,
New York, 1955.

 

Derman, C., "Stochastic Approximation," Ann. Math.
Stat., Vol. 27, pp. 879-886, 1956.

Deutsch, R., Estimation Theory, Prentice-Hall Inc.,
New Jersey, 1965.

Donoghue, P.J., "System Identification by Bayesian
Learning," Ph.D. Thesis, Michigan State University,
1968.

Dupac, V., "A Dynamic Stochastic Approximation Method,"
Ann. Math- Stat-. Vol. 36, pp. 1695-1702, 1965.

 

Dvoretzky, A., "On Stochastic Approximation," Proc.
Third Berkeley Symposium 23 Mathematical Statistics
and Probability, Vol. 1, pp. 39-55, University of
California Press, Berkeley, California, 1956.

 

 

 

Frame, J.S., ”Matrix Functions and Applications,”
IEEE Spectrum, Vol. 1, March-July 1964.

Fu, K.S. and Y.T. Chien, "Learning in Nonstationary
Encironment Using Dynamic Stochastic Approximation,"
Proc. Fifth Annual Allerton Conference pp Circuit
and System Theory, University Of Illinois, pp. 337-
345, October 1967.

Fukao, T.,"System Identification by Bayesian Learning,"
Proc. Tokyo Symposium, pp. 137-146, 1965.

H1

H2

H3

H4

H5

J1

K1

K2

K3

K4

K6

L1

100

Hancock, J.C. and P.A, Wintz, Signal Detection Theory,
McCraw-Hill Book Company, New York, 1966.

 

Ho, Y.C., "On Stochastic Approximation and Optimal
Filtering Methods," A, Math. Anal. Appl., Vol. 6,
No. 1, pp. 152-154, February 1963.

Ho, Y.C. and R.C.K. Lee, "Identification of Linear
Dynamic Systems," i, Information and Control, Vol. 8,
pp. 93-110, February 1965.

 

Holmes, J.K., "Two Stochastic Approximation Procedures
for Identifying Linear Systems," IEEE Trans. 23
Automatic Control, pp. 292-295, June 1969.

 

Hsia, T.C., "On an Identification Method for Linear
Discrete Systems," Proc. IEEE, Vol. 57, No. 2, pp. 229-
230, February 1969.

Jenkins, G.M. and D.G. Watts, Spectral Analysis and

its Applications, Holden-Day Inc., San Francisco,
1968.

 

Kalman, R.E., "Mathematical Description of Linear
Dynamical Systems," SIAM Journal pg Control, Vol. 1,
No. 2, pp. 152-192, 1963.

Kesten, H., "Accelerated Stochastic Approximation,"
Ann. Math. Stat., Vol. 29, pp. 41-59, 1958.

Knopp, R., Theory and Applications 2: Infinite Series,
Hafner, New York, 1947.

Koenig, H.E., Y. Tokad, and H. Kesavan, Analysis 2:
Discrete Physical Systems, McGraw-Hill Book Company,
New York, 1967.

Kushner, H.J., "A Simple Iterative Procedure for the
Identification of Unknown Parameters of a Linear
Time-varying Discrete System," Trans. ASME, Journal
Basic Egg.,‘Vol. 85, June 1963.

 

Kwatny, R.G. and D.W.C. Shen, "Identification of
Nonlinear Systems Using a Method of Stochastic
Approximation," Joint Automatic Control Conference,
University of Pennsylvania, pp. 814-826, June 1967.

 

Lee, B.E. and L. Markus, Foundations pf Optimal Control
Theory, John Wiley and Sons, Inc., New York, 1967.

 

L2

M1

N1

P1

P2

P3

P4

R1

R2

R3

31

32

101

Lee, R.G.K., Optimum Estimation, Identification and
Control, M,I.T. Press, Cambridge, Massachusetts, 1964.

 

Lindorff, D.P., Theory 2: Sampled-data Control Systems,
John Wiley and Sons, Inc., New York, 1965.

Loeve, M., Probability Theory, D. Van Nostrand Company,
Inc., Princeton, New Jersey, 1963.

 

Mood, A.M. and F.A. Graybill, Introduction £2 the
Theory 2: Statistics, McGraw-Hill Book Company,
New York, 1963.

Nagumo, J. and A. Noda, "A Learning Method for System
Identification," IEEE Trans. 22_Automatic Control,
Vol. AC-12, No. 3, pp. 282-287, June 1967.

 

Panuska, V., "A Stochastic Approximation Method for
Identification of Linear Systems Using Adaptive
Filtering," JACC Preprints, Ann Arbor, 1968.

 

Papoulis, A., Probability, Random Variables, and
Stochastic Processes, McGraw-Hill Book Company,
New York, 1965.

 

 

Parzen, E., Stochastic Processes, Holden-Day Inc.,
San Francisco, 1965.

Petrovski, I.G., Ordinary Differential Equations,
Prentice-Hall, Inc., New Jersey, 1966.

Robbins, H. and S. Monro, "A Stochastic Approximation
Method," Ann. Math. Stat., V01. 22, pp. 400-407, 1951.

 

Roberts, J.D., "A Method of Optimizing Adjustable
Parameters in a Control System," Proc. IEE, London,
Vol. 109, pp. 519, November 1962.

Root, W.L., "On System Measurement and Identification,"
Symposium 22 System Theory, Polytechnic Institute of
Brooklyn, pp. 133-157, April 1965.

Sakrison, D.J., "A Continuous Kiefer-Wolfowitz Procedure
for Random Processes," Ann. Math. Stat., Vol. 35, pp. 590-
599, July 1964.

Sakrison, D.J.,."Stochastic Approximation: A Recursive
Method for Solving Regression Problems," Advances 23
Communication Theory, Vol. 2, Edited by A.V. Balakrishnan,
Academic Press, New York, 1966.

S3

S4

S5

V1

V2

V3

102

Sakrison, D.J., "The Use of Stochastic Approximation
to Solve the System Identification Problem," IEEE
Trans. gp_Automatic Control, Vol. AC-12, No. 5,

pp. 563-567, October 1967.

 

Saridis, G.N. and G. Stein, "Stochastic Approximation
Algorithms for Linear Discrete-time System Identification,"
IEEE Trans. 22 Automatic Control, Vol. AC-13, No. 5,
pp. 513-523, October 1968.

Steiglitz, K. and A.E. Rodgers, "Maximum Likelihood
Estimation of Rational Transfer Function Parameters,"
IEEE Trans. 22_Automatic Control, Vol. AC-12, pp. 594-
597, October 1967.

Varga, R.S., Matrix Iterative Analysis, Prentice-Hall
Inc., Englewood Cliffs, New Jersey, 1962.

Venter, J.H., "An Extension of the Robbins-Monro
Procedure," Ann. Math. Stat., Vol. 38, No. 1, pp. 183-
190, February 1967.

 

Volterra, V., Theory pf Functionals and pi Integro-
Differential Equations, Blackie and Son Ltd., London
and Glasgow, 1930.