SYSTEM IDENTIFICATION BY STOCHASTIC APPROXIMATION Thesis for the Degree of Ph. D. MICHIGAN STATE UNIVERSITY DELI. D, MGNAHAN 1970 III/IIIIIIIIIIIIII J; 1.12272} ~ THE st"? II 3 hilChigafic‘ VII-2116; 3 129300 00995 8723 ‘ W!“ This is to certify that the thesis entitled System Identification By Stochastic Approximation presented by Bell U. Monahan has been accepted towards fulfillment I of the requirements for Ph.D. degree in Electrical Engineering I and Systems Science 2W I? Major firefessor Date 1774;? / 6/ 0-169 '9' BINDING“ IL_unAs a sous mm mc. I: I v.1:l‘l7 ' Q I ””5 m? if“. ”.- - ' . ABSTRACT SYSTEM IDENTIFICATION BY STOCHASTIC APPROXIMATION BY Dell D. Monahan The primary purpose of this work is to develop a practi- cal means for finding a mathematical model for an unknown discrete- time system from observations of its input and output variables. The system is assumed to be excited by a random input process having bounded moments but otherwise unknown statistical properties. The system itself may be nonlinear, in a specified sense, and time-varying. It is observed with additive random errors having unknown distributions. In view of these conditions, an on-line procedure is selected which uses an adaptive model. The adaptive model has a state-space representation in which the equations are assumed known up to a set of parameters. General conditions restricting the form of the model are given. The prob- lem is to find the parameter values which minimize the error between the system output and the model output. The method chosen for computation is a stochastic gradient-search algorithm which updates the model parameters with each new set of input-output measurements. Under a liberal set of assumptions the parameter estimator is shown to converge within a small bias-error in the mean-square and almost-sure senses to the true parameter value as the number of observations increases. The arguments for Dell D. Monahan convergence rely on stochastic approximation theory. Therefore, the procedure considered is a stochastic approximation system identification algorithm. It is formulated in a general sense and shown to apply to certain practical problems. The veri- fication is carried out by a digital computer simulation. The examples simulated include the identification of a system which is nonlinear in its state, input, and unknown parameter. The method extends the work that appears in the liter- ature on system identification in at least four ways. The identification procedure is generalized to apply to systems which have nonlinear and time-varying characteristics. Second, a geometric interpretation of identifiability is given. Third, under liberal assumptions, mean-square and almost-sure convergence of the estimation algorithm is proven. Finally, the rate of con- vergence of the estimator is evaluated. SYSTEM IDENTIFICATION BY STOCHASTIC APPROXIMATION By Dell D. Monahan A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1970 ACKNOWLEDGEMENTS The author wishes to express his sincere gratitude to his thesis advisor, Dr. G.L. Park, for his continued assistance and encouragement during the course of this research. The author is indebted to Dr. R.C. Dubes who contributed considerably to the author's graduate study. Dr. Dubes' care- ful proof reading of the manuscript, which lead to several improvements, is also gratefully acknowledged. Thanks are also due to Dr. H. Salehi, Dr. R. Zemach, and Dr. W.L. Kilmer for their interest anc critical reviews of the thesis. ii Chapter I II III IV TABLE OF CONTENTS ABSTRACT LIST OF FIGURES INTRODUCTION ................................. 1.1 Motivation and Models for System Identification ........... . ..... . ........ 1.2 A Survey of System Identification Techniques .............................. 1.3 Thesis Objectives ....................... DEVELOPMENT OF THE AIGORITHM ................. 2.1 The System Model ........................ 2.2 Assumptions for Identification and Identifability ......................... 2.3 The System Identification Algorithm ..... 2.4 Consistency of the Estimate ............. 2.5 Approximation to the System Identification Algorithm ............................... CONVERGEN CE PROOFS ........................... 3.1 Mean-square Convergence ................. 3.2 Almost-sure Convergence ................. 3.3 Rate of Convergence ..................... 3.4 Extension to a Larger Class of Nonlinear Systems ................................. 3.5 Summary ................................. EXAMPLES ..................................... 4.1 Nonlinear State Model: Linear in the Unknown Parameter ....................... 4.2 General Nonlinear System: Scalar Case CONCLUSIONS .................................. 5.1 Thesis Results .......................... 5.2 Possible Extensions ............ . ........ APPENDIX ..................................... BIBIJOGRAPHY ................................. Page 18 18 20 26 32 35 38 39 48 55 6O 65 66 66 73 77 77 79 81 98 LIST OF FIGURES Figure Page 1 Hyperplane H, With Unit Normal 2, and Vector Distances IfizI From H ........... . ..... ......... 23 2 Estimator Error Squared as a Function of Algorithm Iterations; Large Initial Estimator Error and Various Gains ...................................... 71 3 Estimator Error Squared as a Function of Algorithm Iterations; Small Initial Estimator Error and Various Gains .................... ........ .......... 72 4 Relative Squared Error as a Function of Algorithm Iterations; Least-squares Gain with Various Noise Levels ........ . .......... ............. ............. 75 5 Relative Squared Error as a Function of Algorithm Iterations; Deterministic Gain with Various Noise Levels ...................................... . ...... 76 iv CHAPTER I INTRODUCTION In this chapter the definition and purpose of dynamic system modeling and identification are briefly stated. Then, a representative set of methods for obtaining dynamic system models is discussed with respect to the practical problems. From these methods a particular approach to system identifica- tion is selected. Finally the general aim of this work is stated, establishing this approach as a practical method for system identification. 1.1 Motivation and Models for System Identification Generally the subject of system identification is motivated by problems connected with control and systhesis. The present study had its origin in the problem of stability of large-scale interconnected electric power systems. Modeling such systems provides a means for predicting the transient reaction of the system to step changes in load as, for example, caused by distant faults. Knowledge of such reSponses leads to better design and control policies for improved system reli- ability. This application has an immediate parallel in auto- matic control theory and the development of adaptive control systems. The definition of a system will be considered first. Using operator notation, a system may be defined as a triplet (x,u,H) where (1.1.1) x Hu i.e., the operator H maps an admissable input function of time u into an output function of time x. Then, x is the response of the system H to the excitation u. The system identification problem may then be thought of as the problem of using observed x's and u's to determine the unknown operator H. The question of uniqueness immediately arises, and one finds that, generally, H belongs to an equivalence class of H- operators in which each member produces an identical output Hu, for each admissible input excitation u. This is evident from the following example. (1.1.2) x hll hl2 ul x2 h21 h22 U2 1“ There is no single pair of outputs and inputs (x1,x2) and (u1,u2) which will uniquely define {hijz i,j = 1,2}. There are infinitely many {h that satisfy (1.1.2) (R3). In 13} some cases there are canonical forms associated with the equi- valent classes; for example, linear time-invariant single-input single-output systems of known order (L2). In any event, the lack of uniqueness is a basic difficulty in a system identifica- tion problem since one is only able to find a model equivalent to the system observed. Hence, models that are made available for design and control studies are usually, at most, equivalent to the unknown system. The first step in an identification problem is to select, from knowledge of the problem, a form for H which is known explicitly up to a set of components, typically specified by a finite set of parameters. This selection defines a set of systems which is assumed to contain a unique H-operator which is equivalent to the unknown system. The problem is then to find the parameters of this H-operator. By definition the system is "identified" when this set of parameters is found. A wide variety of mathematical models are available for representing the unknown system. Some examples are briefly reviewed here: (a) The Volterra Functional Series The model (1.1.1) may be more explicitly described by the Volterra expansion for continuous functionals t (1.1.3) x(t) = ho(t) +I h1(t,1'1)u('r1)d'rl + t t +-I ... I hn(t,T1,...,Tn)u(T1)u(T2)...u(Tn)d¢1...d¢n + . In (1.1.1) H represents the integral operation (1.1.3). Any nonlinear, deterministic, time-varying, controllable system can be described by a relation of the type (1.1.3) (K6, V3, 32). If the system is linear, only the first two terms of (1.1.3) need to be considered. Kwatny and Shen (K6) have shown that a parameterized approximation to (1.1.3) is determined by trun- cating the expansion at some N > O and approximating hn(.,...,.) by a multidimensional series in the orthogonal functions I¢i(-)}:=1 for each n = 1,...,N. Then, for example, if N = 2, (1.1.3) becomes k k t (1.1-4) x(t) = .2 aicpi(t) + .2: biImi(T1)u(t - T1)d¢1 1=l I=l «a k t t + . ;_ cij I cpi('r1)U(t - r1)dr1I cpj('f2)U(t - 3)de 1,3-1 -m «a k where h2(t,T1,T2) is replaced by i §=1cij¢i(T1)¢j(T2) and the time variation is accounted for by allowing the parameters ai’bi and cij to vary in time. Thus, with the approximation (1.1.4), the system identification is accomplished by determin- ing the parameter values which best establish a mathematical equivalence of the unknown system. If inputs and outputs are sampled periodically, the integrals in (1.1.4) may be replaced with summations. Kwatny and Shen suggested sampled-data Laguerre functions for the orthogonal discrete weighting functions “Hamil ' The chief reasons for using the type of representation in (1.1.4) are the genrality in describing nonlinear, time- varying deterministic systems and the linearity in the unknown kernels, hn(°""")’ or the unknown parameters of (1.1.4). However, practical difficulties arise since the required number of orthogonal functions and coefficients is usually much greater than the order of the System. IAlso, in the linear case, for example, the representation (1.1.4) does not directly reveal the natural frequencies of the system, or alternatively, the natural modes or eigenvalues of the system. (b) The Frequency ReSponse Function A useful mathematical representation for linear, time- invariant systems results when the Fourier transform is used in connection with the system equation given by a (1.1.5) x =I h(¢)u(t - 1')d'r -@ Here {u(t)} is assumed to be a wide-sense stationary process, h(-) is the system impulse response function and {x(t)} is the system output process. An estimate of h(o) is obtained implicitly by estimating its Fourier transform. This is accom- plished by evaluating the power Spectral density of the output and the cross-Spectral density of the input and output, and then applying the Fourier transformed version of (1.1.5) given by (1.1.6) Sxx (w) = Sux (w)H(w) Here, the power Spectral density of the output process is as (1.1.7) Sxx(w) =I E[x(t)x(c + T)]exp(-jw*r)d'r and the cross-spectral density of the input and output pro- cesses is (1.1.8) Sux(w) ==I E[u(t)x(t + ¢)]exp(-ij)dT The frequency reSponse function is given by a (1.1.9) H(u.)) =I’ h(t)exp(-jwt)dt -a Time averaging may be used to estimate the correlation func- tions in (1.1.7) and (1.1.8) if the input process {u(t)} is ergodic in the wide sense (P2). Evaluating Sxx(m) and Sux(w) gives the system frequency response function H(w) using the non-parametric representation (1.1.6). Some smoothing pro- cedures and goodness-of-fit criteria have been develOped for use in conjunction with this calculation (B4, A1). The appli- cation of (1.1.6) leads to optimal (minimum mean-square error) estimates of h(-) (L2, Jl). In order to use spectral methods to find the frequency response function, i.e., the Fourier transform of h(-), the assumption of causality (h(t) = 0 for t < 0) is dropped. However, if the data are actually taken from a linear physical system, the realizibility condition will automatically appear in the estimate of h(-). In practice, certain difficulties arise. The model (1.1.5) is restricted to linear time-invariant systems with stationary inputs. The procedure requires a large set of input-output data samples in order to obtain good estimates of Sxx(w) and Sux(w) which are to be computed for all m, as opposed to computing a finite parameter set. Furthermore, perfect observations of the input-output variables should be assumed. The primary advantage of this representation is that the order of the system need not be known. (c) The State-Space Representation The unknown (causal) system may be described by a system of differential equations written in the form of an automomous or dynamical first-order stochastic system. Such a form is the state-space representation given by (1.1.10) §< = §(X(t).U(t),W(t),a,t) y(t) = H> v(t) = G(u.p where x(t) is a vector-valued function representing the system state and Q(-) is a vector-valued function of the state x(t), the input vector u(t), an internal noise vector w(t), a con- stant vector parameter a, and time t. The observable quan- tities, y(t) and v(t), are noisy observation vectors of the output and input, respectively. As such they depend on the noise vectors w(t) and p(t) according to the functions H(-) and G(-), reSpectively. This representation encompasses a wide variety of nonlinear, time-varying, causal systems. In the form (1.1.10), the functions Q(-), H(-) and G(-) are assumed known and the unknown system parameters are lumped into the vector a- In the discrete-time case the continuous-time system (1.1.10) may be written as a discrete-time model given by (1.1.11) xk+1 = §k(xk,uk,wk,a) yk = H(xk"Ik) = G(uk.pk) where the dimensionalities remain unchanged and the index k replaces the independent variable t in (1.1.10). Thus, (1.1.11) serves as a model when the data is sampled periodically. Generally such differential-difference models are convenient because the number of parameters may be selected in advance. Also Since there is no practical way to determine §k(-), H(-), and G(-), their forms are assumed based on available information about the system and the kind of model one wishes to fit to it. If the system is assumed linear and time in- variant, then (1.l.11) becomes (1.1.12) x k-I-l Axk + Buk + wk Yk “HXk+I‘k v =Guk+pk where the unknown parameter a is contained implicitly in the matrices A and B. When the system is a single-input single- output system, (1.1.12) reduces to a linear difference equation with unknown coefficients (L2) such as n to (1.1.13) x(k) = z aix(k-i) + z: b.u(k-i) + §k i=1 i=0 1 where the noise variables are lumped into the random variable gk and the coefficients ai and bi are unknown. Thus the generalized version (1.1.11) may be specialized to a variety of cases depending on assumptions characterizing the unknown system. The intent here is to use the most practical form of model. Since data are often acquired by sampling, discrete- time models will be considered. Furthermore, experience has shown that most practical systems can be adequately described by a difference model of low order (32). Hence, in the interest of practicality and flexibility, only models of the form (1.1.11) will be discussed in this thesis. 1.2 A Survey of System Identification Techniques System identification methods fall into two categories: on-line and off-line methods. The on-line approach indicates a procedure that updates the estimate of the system parameters with each new set of observations, so the on-line estimator is recursively defined in terms of its last value and the most recent set of observations. The off-line method uses the entire data set to perform a single calculation to estimate the unknown parameter. The literature seems to indicate that on-line methods are the most flexible because systems that are nonlinear and time- varying and that are excited by nonstationary inputs can usually be treated by on-line methods; in contrast, off-line approaches are usually restricted to Stationary linear systems with sta- tionary inputs. However, consistent estimates are obtained more easily off-line since on-line estimators are often biased (A4, Bl). The literature on this subject is too extensive to review completely here. However, some recent results related to the identification of the model (1.1.11) will be surveyed. (a) The Maximum-likelihood Method The maximum-likelihood estimate of the parameter a is * the value of the estimator a of a for which the conditional 10 density of the system output given the past inputs and outputs and estimate a* is maximized. Consequently, computing the maximum-likelihood estimate of 0 requires a knowledge of probability densities, specifically the likelihood function for the observable random variables. Gaussian, independent random variables are usually assumed for the noise since these lead to analytic expressions for the likelihood functions (D2). This method has been verified in the linear time-invariant case (1.1.13), where the {gk} are independent and Gaussian. In particular, an off-line approach gives a maximum-likelihood estimator for a which is consistent, asymptotically normal, and efficient (A4, 85). In his approach, Astron (A4) used the Newton-Raphson method and Steiglitz and Rodgers (SS) used a Gauss-Newton method to compute the maximum of the likelihood function. Both of these methods are iterative gradient search schemes for recursively estimating the maximum using the gradient of the likelihood function. The function to be maximized is defined for (1.1.13) as S: [ZVar(§k)]-1 - constant 1 (1.2.1) r II WIN 2 where gk is intrepreted as the error between the measured output x(k) of (1.1.13) and the predicted output which is n m 2 a,x(k-i) +' Z biu(k-i). Derivatives are taken with reSpect i=1 i=0 to the unknown parameters and are used to construct the iterative numerical solutions, as an analytic solution is not possible. 11 Generally, the limitations associated with this method include: (1) the amount of computation required increases rapidly with the size of the data set N, and (2) there exists an uncertainty that a computed maximum is global instead of local. (b) Learning Techniques A variety of methods under the heading of "learning" techniques have been studied for system identification ranging from statistical methods to deterministic procedures (Cl, D3, F3, F2, CZ, N1, R2, B3). Such techniques have been used in conjunction with the performance criteria for determining goodness-of-fit determined by minimizing the loss function * 2 (1.2.2) HekHZ = H0 ‘ “k“ where is the error between the unknown parameter a of 6k * (1.1.11) and the estimate after k observations ak’ while H'H denotes the Euclidean norm for vectors. For some models in the class of systems (1.1.11), statistical learning techniques will minimize the expectation of the loss (1.2.2), that is, the mean-square risk function (1.2.3) EHekHZ = EH0 - GZHZ * Here the estimate is the expected value of the estimator ak given the past observations (D2). AS with the maximum likeli- hood method, a prior knowledge of the noise distributions is required; gaussian assumptions are usually made in order to * obtain analytic expressions for the estimator akf With such 12 gaussian assumptions, on-line estimators for minimizing the risk (1.2.3) have been develOped with the help of Bayes rule for the special case of (1.1.11) as in (1.2.4) below (D3, F3). (1.2.4) xk+1 = §k(xk,uk)a +-wk yk g xk + ‘I‘k Vk =‘fic + pk Under reasonable approximations and assumptions, consistent estimators for a were found. At the other extreme is the deterministic approach. A simple example of an application of deterministic learning occurs with the deterministic model (1.1.13) resulting from setting the random gk's to zero (N1, R2, B3). The model (1.1.13) becomes n In (1.2.5) x(k) = : a,x(k-i) + z b,u(k-i) i=1 1 i=0 1 and the algorithm is given in two parts by (1.2.6) a’i‘(j+1) = a:(j) +Aj[x(j) - z(j)]x(j-i), i=1,2,...,n and (1.2.7) b:(j+l) = 13:0) +Aj[x(j) - z(j)]x(j-i), i=0,1,...,m where a:(j) is the estimate of the unknown scalar ai at the jth iteration of (1.2.6). The quantity b:(j) is defined similarly, and the scalar Aj is given by n m —1 (1.2.8) A. = [2: x2(j'i) + Zu2(j"i)] - c 3 i=1 i=0 13 where 0 < c < 2. The scalar variable z(j) is the projected prior estimate defined as m n * * (1~2-9) z(j) = Z 31(j)X(j'i) + 2 bi(j)u(j'i) i=1 i=0 Roberts (R2) showed that the loss function, similar to (1.2.2), given by n m (1.2.10) .10) = z (ai - a’i‘unz + 23001 - bjonz i=1 i‘ is strictly decreasing with increasing j, under liberal assumptions on the initial estimates {a:(l)} and {b:(l)} and the input. Similar convergence arguments were given by Nagumo and Noda (N1). (c) Least Squares and Stochastic Approximation Methods A commonly-used performance criteria for parameter estimators is the minimization of the quadratic loss function. For estimation of the parameter a in unknown discrete systems, such as those special cases of (1.1.11) given above, the quadratic loss function is defined as * k 2 (1.2.11) Jk+1(a ) = iElIIyi+l - 91+1II where yi+1 is the system output measurement at time i+l and y is the computed value of the model output which i+l O * 0 depends on an estimated parameter a , Input-output measure- ments and the system equations which are assumed known except for the system parameter a- 14 Application of the criteria (1.2.11) represents the least-squares approach for fitting the model to the unknown system. The method consists of (l) finding the value of a* which minimizes the loss function (1.2.11), and (2) showing that this value approaches a in some sense, e.g., mean-square or with probability one, as k increases to infinity. The important features of this method are its computational sim- plicity and that no assumptions defining the distributions of the noise variables need be made. If knowledge of these dis- tributions is available and appropriate weighting of the data is used in (1.2.11), the least-square estimator exhibits behavior Similar to that of the maximum-likelihood and Bayesian estimators. However, depending on assumptions made regarding noise variances, the least-squares method leads to biased estimates of the un- known system parameter 0. Generally, the bias will appear small if the noise power is small relative to the excitation power. Also, in an on-line formulation, bias is usually a problem with all estimation methods and consequently an estimate of the resulting error is required. The least-squares method has been proven effective in the linear case (1.1.13) when the noise variable §k is assumed to contain both system and input—output observation noise, i.e., the input and output are observed through additive noise (S4, C3, H3, H4, P1, H5, K5, L2). In this case the estimator is defined recursively in terms of its last value and the most recent set of observations. In particular, the estimator takes the form: 15 1 2 12 * - * + P z k+l z' * ( ° ' ) ak+l “k k+l k 0 and p > 1 and a set of integers l = VI’Vz --- with pk = v - v k+l k and 0 < pk s p where Jk = {vk,vk + l,...,vk+1 - 1} such that ian inf I . 2 QI(y.,v.)§.(y.,v.flfl 2 I a.s. k {yj’vajEJk} m1n[jEJk J J J J J J 0 where §j(yj,vj) is the known nXm matrix transition function of the observables yj and vj, the func- tion 'xmin(') takes the value of the minimum eigen- value of the matrix placed in its argument, the ' denotes the matrix transpose, and a.s. means almost surely (or, equivalently, with probability one). It is also assumed that 22 i' (y.vi)i (yjjw) infl: Inf Amin< 2 “j. (31’, VjfljIj “Q (y V )II)] 2 T. > 0 a.s. {yj.vaj€Jk} jeJk j j’ j j’ JI Assumption A5 implies that there is a partition of the integers into sets such that in the kth set Jk’ any random observable output sequence {yjIj 6 JR} and random observable input sequence {vaj E Jk} that occurs for which <2k I. (y. ,v. )0. (y. .v. > J6 )‘min occurs with zero probability. Further implications of this assumption and the next assumption A6, are discussed after stating A6. A6: Using the notation of A5, there is a real constant T > 0 such that, mjn Am ax(9j(yj,V. .)9. J(yj.vj)) k inf inf r I 2 T a.s. k [{yj,vj‘jEJk} mjx xmaij (yj svj)§j (yj :Vj)) k where kmax(.) denotes the maximum eigenvalue of its argument. The assumptions A4, A5, and A6 constitute an identi- fiability criterion for unknown systems using the identification method developed here. Assumption A4 is certainly a necessary condition, for if part of the system were not connected into the input, but were connected into the output, then the system would not be controllable from that input; conversely, if the same part were not connected to the output, then the system would not be completely observable from its output. The 23 necessity of this condition is discussed more generally by Kalman, Lee and Aoki (Kl, L2, A3). Assumption A5 has the following geometric intrepreta- tion which explains why it is necessary for identifiability. For a set of p nXm.matrices Fk with I kl kl (2.2.1) F; = (f: , £2 ,..., fn ) , k = 1,2,...,p k where the fi are row vectors of Fk’ for each k (V1) <2 2 2) < p f [ < p ] .. ), )3 F'F ) = in z' 2 F'F )2 min k=1 k k IIZII=1 k=1 k k P P n 2 = inf [ 2 (sz)'(sz)] = inf 2 Z (fiz) ] HZH=1 k=l qu=1 k=l i=1 The inner product IszI can be intrepreted as the distance between the vector f: and a hyperplane passing through the origin having 2 as its unit normal vector. Then, the Xmin defined above is the sum of all such squared distances between the vectors f? and this m-l dimensional hyperplane. This is indicated geometrically in Figure 1 below for m=3 and p=l. A Figure 1. 1 Hyperplane H, With Unit Normal 2, and Vector Distances IfizI From H 24 If A or the sum of squared distances is positive,then the min vectors f: have spanned the m-dimensional Euclidean space; if the sum is zero the vectors f: have at most spanned the (m-l)-dimensional space of the hyperplane, i.e., they lie in the hyperplane. Similarly, assumption A5 requires that the sequence of matrices {§j(yj,vj)Ij = 1,2,...} can be parti- tioned into finite groups indexed by the sets Jk’ k = 1,2,... in such a way that the row vectors in each group {§j(yj,vj)Ij 6 Jk} Span the m-space. This is guaranteed for almost every set of observations {(yj,vj)Ij 6 JR} for each k = 1,2,... in such a way that the sum of squared distances is bounded away from zero with probability one. That is, (2.2.3) inf z' E §!(y.,v.)§.(y ,v ) z 2 )( >0 with probability one or for almost every observation set {(yj,vj)Ij E Jk} corresponding to the kth interval, k = 1,2,... The necessity of this requirement is evident from con- sidering a direct estimate o* of a from the defining equa- tion (2.1.2); that is, using the equation, * (2'2'4) yk+1 ' Q1<(y1<"’1<)"’ + wk + Ik+1 ' * I . Note that Qk(yk,vk)§k(yk,vk) must have max1mum rank or a cannot be determined, even in the noise-free case. If I . §k(yk,vk)§k(yk,vk) does not have maximum rank for some k then a solution is still possible if the sum of a finite set of equa- * tions like (2.1.6) gives the coefficient of a maximum rank. 25 Thus, with noise present it is necessary that the row vectors of §k(yk,vk) have a non-zero projection on all m coordinate axes of the m-space containing a infinitely often as k tends to infinity, as in (2.1.4), in order that a sequence of estimates of a may converge to a- Assumption A5 requires this pro- perty in §k(yk,vk) in a precise manner. Furthermore, A5 may be interpreted as a requirement on the input sequence {uk}, and indeed, for linear time-invariant single-input single-output systems, it reduces to an input requirement (A4). In this case Astrom (A4) found it necessary for an input covariance matrix of appropriate dimension to be positive definite giving the result that the input excites all modes of the system or all components of the unknown vector. This property is analogous to the characteristics of assumption A5. Assumption A6 completes the conditions for an identi- fiability criterion in the sense that it is a necessary con- dition imposed on the input and system function §k(-,') to Obtain an estimator a* that converges in some probabilistic sense to a- The assumption A6 is quite reasonable in practice and as such, with A4 and A5, it renders the system identifiable in the sense suggested by Aoki (A3); i.e., that there is a of a such that lim & = a in sequence of estimates k “k some sense. In conclusion, the assumptions Al through A6 are practical and realistic in the sense that wide-sense stationarity and prior knowledge of the noise distributions are not assumed. Furthermore, the assumptions are well-suited to the on-line case where the 26 effect of the initial state of the system is insignificant and the system is identified from its actual input and output as opposed to a preselected input, such as white noise. The assumption that the system is controllable and observable, as well as stable, is also reasonable from a physical and practical vieWpoint. 2.3 The Identification Algorithm In order to judge how well the model fits the unknown system, it is necessary to define a loss function as a measure of "fit". In the present case, a logical choice is the quadratic loss function which is a function of the random error given by k A _ A 2 (2'3'1) Lk+1(°‘k+1) iElII§i(yi’vi)ak+l yi-I-lII where the norm function H-H is the Euclidean norm for vectors, &k+l is an estimate of a based on k+l observations of the systems input and output, and the remaining symbols are defined in (2.1.1). This function does not depend explicitly on noise covariancesand, as it is quadratiC, its minimization in &k+l is in accordance with the least-squares criterion for estimating o- If some knowledge of the error covariance is available, then the loss function (2.3.1) can be weighted by replacing each term on the right-hand side with A 2 _ A I II§i(yi’vi)°’k+l ‘ yi+lIIVi ’ (§i(yi’vi)ak+l " yi+l) ' vi(§i(yi’vi)ak+l ' yi+l) ’ 27 where the V1 are positive-definite matrices of appropriate dimension. These act as weights for the different residual errors given by the quantities i§i(yi’vi)&k+l - yi+1}, and are defined as the inverse of the covariance matrices of these errors. In the arguments to follow, the prior weighting of the quadratic loss (2.3.1) plays no significant role and since the error covariances are not known, no such weighting will be included. The objective is, then, to find the &k*l which yields the minimum value of the loss Lk+l(&k+l) given by (2.3.1)- This is random, as is the loss. However, its value is a'k+l determined from the data with (2.3.1) in a deterministic sense, and then its characteristics are examined in a probabilistic sense. With the application of this loss function, two assump- tions are now required in addition to the set Al through A6. A7: As was stated previously, the unknown system (2.1.2) is assumed to be stable (in the asymptotic sense) (P4, L3); it is further assumed that the model of the system (2.1.2), defined with the notation of (2.1.1), namely, (2.3.2) yk+1 = ik(yk.vk)ak . is also stable. This means that the system (2.1.2) is also stable for 0 replaced with any estimate O’k+1 °f °" If the condition A7 is not available, then all estimates of a may be constrained to a closed and bounded m-Space containing 28 a corresponding to the region of stability. In this event, the arguments to follow for the system identification algorithm and proof of its validity are not changed. A8: It is assumed that the loss function (2.3.1) has a unique minimum in &k+1. Assumption A8 has special importance for the case when systems of the type (2.1.1) are considered. This class will be considered later. Now consider the minimization of the loss function * (2.3.1). Let “k be such that * A 1 . * (2.3.3) Lk(ak) < Lk(ak) for a l ark 7‘ ak . * Then, ak gives an optimal fit, in the least-squares sense, of the model to the data based on k observations. Under certain assumptions on the noise, it is an unbiased estimator of the * unknown parameter a as well. In such cases ak is a least- squares estimator of a, (H1, D2). Otherwise is a biased * “k estimator of o and the bias must be evaluated. In a gaussian a I * O formulation, w1th prOper selection of the Vi’ ak gives the Bayesian estimator, or conditional expectation, of a, (A2, A3). Such assumptions are not made here and so the loss (2.3.1) will be considered as it stands. The minimization is approached by * * k+l in terms Of a and the most recent observations finding a k * of the system's input and output, (Vk’yk)‘ This defines ak+l as a recursive estimator. The resulting recursion is a well- known consequence of classical least-squares theory (H2, L2, A3). 29 Its development in the context of the present problem is outlined here. For convenience, let (2.3.4) Fk = kak’vk) and Fk = §k(xk,uk) Then (2.3.1) may be written as (2'3'5) Lk+1(°’k+1) Lk(°’k+1) +IIFlcc"1<+1 yk+1II ' If 2 3 6 * - “ + ( ‘ ' ) OZk+1 ak+l Ac’k+1 ’ * “ d ' ' . . where ak+l and ak+l are efined 1n (2 3 3), then for any A * a ak+l’ ak+l must satisfy k A A * ' - = (2°3°7) (A“k+1) iElFi(Fi°‘k+1 yi+l) 0 ' This result follows from a direct substitution of the right side of (2.3.6) into the right side of (2.3.3) and then per- forming the indicated cancellations and algebraic reductions as follows. Let k = 1 and drop the subscripts k+l in (2.3.6). Then (2.3.3) implies I210? - y.\\2 s n29 - yznz .. \Iila*\\2 - 2.291; + 11sz2 s Ifqanz - 2y; 2,2 + IIszIz = HF1(o* - Aa)H2 - 2yé(§1)(a* - Au) + HYZHZ so that 30 t * 2 . * 2 ~ * 2 . * . IIFla II " 23'2' Fla + IIYZII 5 IIFIQ’ II " (Fla )Fl M A A * A 2 A * A 2 - Aa'Fl'Fla + IIF1 AaII - ZyéFla + 2y2'F1Aa + IIY2II Therefore 2 * IA “ 2 l " -2(F1a ) FlAa +IIF1Aa/II + 2y2 Flm 2 o and this is true if and only if o* is such that '1? '(i‘ * ) = 0 A“ 1 1°‘ " y2 Application of this result to each of the k terms in (2.3.3) gives (2.3.7). Now specialize (2.3.6) to the case * * that is where the arbitrary estimate &k+1 is particularized * to ak' Then )=L(*+ )= kIF<*+Aa) II2 L1<("'k+1 k “k A"’1.<+1 121‘ 1 “k k+l y1+1 k A * 2 A A ' A * 1.’,:'=1I:IIFi°’k ' yi+lII + IIFi A"’1<+1II2 + (ZFiMk-I-l) (Fiak ‘ yi+l)} and the last term on the right is zero by (2.3.7). With this relation the definitions (2.3.1) and (2.3.8) give * _ * a * 2 Lk+1(°’k+1) " Lk(°’k+1) + IIFk(°’k + Mk+1) ' yk-I-lII k—l ‘7 .EIIIIFi“: ' 29.1““ + Iiimmuzi + Hike: - ykfluz +\\2.Aa...1\\2 + 2(FkMk-I-l) (Fkak yk+1)’ 31 Hence * = * A * - 2 ' A. A * (2'3'9) Lk+1(°’k+1) Lkmk) + IIFkO’k y1<+1II + 2 The mean-square convergence theorem is now considered. The gist of the proof is described here. First, the algorithm is considered within an interval, Jk’ of observations. The true parameter a is then subtracted from &k, the algorithm is written in an iterated form, and the algorithm equation is 41 squared. Frequent applications of the Cauchy and triangle in- equalities are made, and expectations are taken. It is then observed that mean-square convergence will occur if the sequence of estimates q A a - a converges in the mean-square sense; Vk Vk i.e., the sequence of estimates corresponding to the first or th . vk observation in each interval Jk’ k = 1,2,... . Second, the set of estimates {qvk} is considered. A similar con- struction is made from iterating the equation, squaring, using the triangle and Cauchy inequalities and taking expectations. The resulting expression is considered term-wise with the help of Lemma 4 (Appendix). Convergence is finally obtained by achieving bounds for these terms which tend to zero. The Appendix and assumptions Al-A8 are applied frequently. The mean-square convergence theorem for the estimator &k is stated and proved below. Theorem I : Mean-square Convergence Under the assumptions Al through A8 of section 2.2 the estimator , given in (3.1.2), 57 k converges in the mean-square sense to the unknown parameter a of the system (3.1.1). Proof: First substract o from each Side of (3.1.2) and let Then (3.1.2) becomes (3°13) q1<+1 = qk ' Ak(Hqu ' Yk) 42 In accordance with assumption AS the sequence of observations {(yi,vi)Ii = 1,2,...} is partitioned into sets indexed by the sets of integers Jk where v - 1}, k = 1,2,... k+1"'°’ k+l (3.1.4) Jk = {vk, v and the length of the partition is uniformly bounded such that k+1 Vk " pk S P < “ ‘ Now let n be arbitrary such that n 2 p. Then n belongs to one of the index sets Jk’ for some k. Fix this k, and using (3.1.3), iterate (3.1.3) with itself from n back to the first index in the set Jk’ namely Vk’ getting 11 n n (3.1.5) qn+1 = .9 (I - AjHj)qv + .§ ._H (I - AiHi)Ajvj J-vk k J-vk i-j+l where factors in the product terms are indexed in descending order. Premultiplying (3.1.5) by its transpose gives, with application of the triangle and Cauchy inequalities for matrices (V1), I _ 2 n 2 2 (3°1'6) qn+1qn+1 ” an+1H S .H “I " AjHjH qu H J=vk k II n I n .. I I + 2 q n I - A.H. z n I - A.H. A v, n n 2 + <2 H II -A1H1IIIIA,-vj\\> j=vk i=j+1 where the norm function H'H is defined in the Appendix. Now according to Lemma 4 (Appendix), there is an MI > 0 Such that 43 (3.1.7) igj HI - AiHiH 3 M1 a.s. k for all sufficiently large k. Applying (3.1.7) to (3.1.6), taking expectations and noting that IIAv H 2 “Ajn for all k j E Jk’ gives (3.1.3) EIIqm.1II2 s miEIIq,kII2 + ZMiE + IA, Min? IIvaI>2 . k j=vk Applying the Schwarz inequality (L4) to the second term of (3.1.8) and noting that assumptions Al and A2 imply that there is an M2 > 0 such that n 2 2 (3.1.9) E( 2 IIvaI) N as in Lemma 4, and taking expectations, (3.1.13) becomes (3.1.14) EHqV H2 ‘2 k+1 .<. <1 - c'uA, .1\\>2 Enqvx k+1 k "1<-+1'1 + 2 ' ' - A E(qv¥§k jEJk i=g+1 (I iHi)AjvJ-) + miuAv H2 E< 2 uvjufi . k jEJk Now in order to complete the proof it remains necessary to show that the right side of (3.1.14) tends to zero. In order to accomplish this consider first the second term of (3.1.14). In the second term the product term may be expanded to get 45 "M1"1 j l = . (3.1.15) Qk i n (I - AiHi) (I + 2;: AiHi) =J+1 l-V k 2 - igJ (A111i + HiAi) + Ak ij , k where 2 2 2 mm A, 4 ,2 M M 1€J k and (3.1.17) ij Q (all remaining product factors)-A;(2 and note that ij is uniformly bounded with probability one by Lemma 4 (Appendix). Then with (3.1.15) the second term of (3.1.14) may be written as -1 vk+1 (3.1.18) 2 E(q' Q' 2 n (I - A.H.)A.Y.) vk 1‘ jeJk i=j+1 1 1 J J J' =2E'I+ A,H A,,-2E' BAH +HA,A (qv ( .§ 1 i) 3Y3) (qv (. i ' i 1) ij) k 1-vk k JEJk 2 + 2 ' A k In order to obtain the required bounds on the second term of (3.1.14), consider each of the three terms on the right side of (3.1.18). Dropping the coefficient 2 and noting that Yj is zero-mean, the first term may be written as follows 46 (3.1.19) EE(q' (1 + 2 AH mum )lq > Vk i=vk vk =EE(q' (2 AWHMqu vk i=vk Vk) and by the Cauchy-Schwarz inequality (L4) J' s EEnqvku> and by assumptions A1 and A2 there is an M3 > 0 such that s M3HAka2 Enqvkll . The second term on the right side of (3.1.18) may be evaluated the same way using the Cauchy inequality and assumptions Al and A2; this gives (3.1.20) EE(q' (2 AiiH +HA i)Ajjv |q ) Vk 131k vk s EE(qu “(1' ZHAiHHHiH>\\AJ-\H\Yj\\\qv > k 1€Jk k s M.nAvknznuqvku for an appropriate M4 > 0. Likewise, the third term on the right side of (3.1.18) may be evaluated using the definition of A: given in (3.1.16), the Cauchy inequality and assump- tions Al and A2 as before. The result is 2 3.1.21 EE 'A G ( ) (qVk k jk Aj wjlq s EE(qukH HAVRHZ [iéijHiHZ JHGJ-kH uAju ijll \qvk) s nAvku2 MSEuqvkn . 47 where M5 is again appropriately selected. Then, defining M 6 to be (M3 +-M. +-M5), (3.1.18) becomes 4 "k+1'1 2 (3.1.22) E(q' qt 2 n (I - A,H.)A. ) S M A E . Vk k iEJk i=j+1 1 1 JV] 6“ VkH “(1ka Since generally a2 + b2 2 ab and by Jensen's inequality (D2) 2 2 (3.1.23) E qu H s Equ H , k k the right side of (3.1.22) becomes 2 2 2 2 2 (3.1.24) MHA H E q H SMHA H Eq H + A | . 6 vk H Vk 6 Vk H vk H vk‘ With the bounds thus obtained for the right side of (3.1.22), and hence for (3.1.18), the right side of (3.1.14) may be bounded as follows 2 2 2 (3.1.25) Eq 5 (1 - c' A _ ) E q H n ,an u v1... .1 n .k 2 2 2 2 2 2 2 +MHA H EHq H +HA H +MHA H EH: Hv-H> - 6 Vk vk Vk 1 Vk jeJk J In the first and second terms on the right note that there are constants M , M , and N >~N, where N is defined in 7 8 1 Lemma 4 (Appendix), such that (1 - c'nAka-1\\>2 + M:\\A,kuz s 1 -M7\\A,kn and M8 2 E( z HY.H)2-M2 . J 1 JEJk 48 for all k 2 N . Then (3.1.25) reduces to 1 (3.1.26) Equk+1H2 s (1 - M7HAVkH) Euqkaz + MBHAkaZ k 1 n 2 = n - M A E H: 7H.j> qu1\\ k k W 2 n <1 -M7uA., n2>nAv n"- j=N1k=j+l i J' The first term on the right side evidently tends to zero as k tends to infinity. Also since 2 “AV H2 < a , kle k Lemma 8 (Appendix) implies the second term tends to zero as well. Hence 2 (3.1.27) lim EHq H = o , k V k+1 which (3.1.26) now implies. Consequently the subsequence {qv } converges to zero in the mean-square sense, and this k completes the proof of mean-square convergence of the algorithm (3.1.2). Q.E.D. 3.2 Almost-sure Convergence The convergence proof presented here shows that lim Hc‘rn - a\\ = lim anH = 0 n n with probability one or almost surely (a.s.). The steps taken 49 to accomplish this proof are as follows. As before, the algorithm is first considered on the Jk interval of observa- tions. The algorithm equation is written in an iterated form, norms are taken and the Cauchy and triangular inequalities are applied. It is then observed that the convergence is proven when it is shown that the subsequence of estimates {qvk} con- verges to 0 a.s. Then, the algorithm, written iteratively in terms of the subsequence {qV }, is considered termwise for finding bounds which tend to zero with probability one. The required bounds are obtained with the help of some algebraic relations due to Albert and Gardner (A2) and the Lemmas in the Appendix. This convergence proof, as with the mean-square case, may be stated in the form of a theorem. Theorem 11 Almost-sure Convergence Under the assumptions Al through A8 of section 2.2, the estimator ak given in (3.1.2) converges with probability one to the unknown parameter a of the system (3.1.1). Proof: The proof begins in a manner similar to the previous mean-square proof. The algorithm is written in terms of the error qk A ak - a, and then, the equation is iterated back to the first observation in the J interval. First let n k be arbitrary; then n belongs to JR for some k. Iterating the recursive algorithm (3.1.3) of qj from n back to the first estimate of the set Jk’ namely q\J , gives the same result k as (3.1.5) which is repeated here as HR. -|2\~ _ 50 n n n (3-2-1) (1 = II (I " A.H.)q + )3 1'1 (1 ' A.H.)A Y. , where Jk is given in (3.1.4) and the other variables are given in (3.1.2). Taking norms in (3.2.1) is the next step toward finding bounds for those terms on the right. With frequent use of the triangle and Cauchy inequalities the normed version of (3.2.1) becomes 3. n lux - AininnAjvjn (3.2.2) nqnflu s 3 HI winiu nqv n + °_ k k J k J-v By Lemma 4 (Appendix) there is a constant M1 > O which bounds the product terms on the right of (3.2.2), as indicated in (3.1.7), and as a result (3.2.2) gives (3.2.3) an+1H s Mlqu H + pM1( max HAjyjH) a.s. k jEJk In (3.2.3) the p is the uniform bound over the length of the Jk intervals as given in (3.1.4). Since the bound M1 holds with probability one, (3.2.3) necessarily holds with probability one. Now Lemma 1 (Appendix) implies that the partial sums of (Ajyj) form a Cauchy sequence so that (3.2.4) lim ( max HA,y H) = o a.s. k 363k 3 j Therefore, almost-sure convergence of the estimator (3.1.2) is proven when it is shown that (3.2.5) lim Hq H = o a.s. k Vk 51 In order to accomplish (3.2.5), it is first necessary to con- struct an algebraic relation using the following definitions: n (3.2.6) R Q R,(n) 9 n (I - A H ) for j = 1,2,...,n J k- kk “J Q I for j > n where product indexing is in descending order, and (3 2 7) s Q :4 A with s Q 0 . . k+l . -Y- 1 . J=1 With these definitions, n n n (3.2.8) 2'. 1'1 (I - A. 1.11 1)A y = Z‘. R v k=1 i=k+l k kk=1 k+1A k k n n n =£R(S -S)‘2Rs-2:Rs k=l k+l k+1 k k=l k+1 k+l k=l k+l k n-l n = 2 R s + s - 2 R s k=1 k+1 k+1 n+1 k=l k+1 k n n = Z R s + s - 2 R s k=1 k k n+1 k-l k+l k n = (flak-+1“ " Aka)sk ' Rk+131<) + Sn+1 n = ' E Rk+lAkfikSk + Sn+1 k-l n n = A - - + - A £21 Rk+1 kaS (I R1)S Sn+l £21 Rk+1 kflksk where the first two terms on the right side are equal by Lemma 6 (Appendix) and finally 52 n ‘ R18 + (sn+1 ‘ S) ' £21 Rk+lAka(sk ' 8) Now let (3.2.9) n = v - l and substitute this value for n into (3.2.1). By sub- stituting (3.2.8) into (3.2.1), (3.2.1) becomes K k "k+1'1 (3-2-10) q = 11 (Q q ) + Z 2 1'1 (1 - A.H.)A Y Vk+1 j=l 3 v1 j=1 rer i=r+l 1 1 r r E Q ( + ) + ( ) = , q s s . - s j=l 3 V1 Vk+1 _ K ?-j21 rgJ Rr+l(vk+l - 1)ArHr($r - s) 3 Taking norms in (3.2.10) gives k l \ H (3.2.11) q 5. 1'1 Q q + s + s -s + z 2 R (v -l)AH(s -s) . Hj=l rEJj r+l k+l r r r H As k tends to infinity, the limit of the first and second terms on the right is zero almost surely by Lemma 5 and Lemma 1 reSpectively (Appendix). Also, for every r E Jj’ k (3.2.12) R (v -1)H s n Q HM a.s. H r+l k+1 i=j+1u i 9 when M9 > 0 is selected such that vj+1-1 (3.2.13) H 1'1 (1 - AiHi)H s M i=r+l a.s. 9 53 according to Lemma 4 (Appendix). The third term of (3.2.11) is k (3.2.14) ngl rEJer+1(vk+1-1)Arur(sr - s)H k k s M, 2 .n HQiH 2 (HArH HHrH>Hsr - sH a-s- j=l 1=j+l rer k k s M. 2 .r; HQiH 2 <1A.HHH.H> j=l 1=J+1 rEJj rEJj Now note that with the definition (3.1.16) of A: , (3.2.15) 1 s [ 2 “Ar“ HHrH] A31 rEJj 2 HA HZHH H2 + 2' 2 H H HH H HA H HH H A: _ reJJ r r k,rer,k#r Ak k r ' r : HAHZHHAF rEJj ’ (2.,.(p2 - 1» max \1A.1\2\\H.Hz> s 1 + r613 1 max HA.\\21\H.\\2 rEJj = (92 - p +-1)% 6,. where the left inequality holds because of the generally true inequality (3.2.16) 2 \a,| 2 ( z \a \ ) . 3:1 .1 3:1 3 Therefore, the right side of (3.2.14) is bounded as follows 54 (3.2.17) M, jg ijfluqiu £1, cuArn HHrH>Hsr - s\\ k k S Mgp' 3'21 i=ljll+1HQiH “322:: “Sr - SH) N N k = Mgp' 531 i=gI+IHQiH A, 1 =1EIHHMH (r213: HS.- ' SH) k k + Mgp' j=§+l i=gl+lHQiH (522:: H3,. ' SH) k k S M10 1=HII+1(1 - c A1) +IM11(1/C)j;§+1bkj(r2j: Hsr - S“) where N and c are defined in Lemma 4 (Appendix), Ak is given in (3.1.16), M10 and M are appropriate constants and 11 k (3.2.18) b . = c A. n (1 - c A.) 1” J i=j+l 1 The bound developed in (3.2.17) for the right side of (3.2.14) is the quantity which bounds the last term on the right side of (3.2.11). The proof is completed by showing that this bound tends to zero. Considering each term on the right side of (3.2.17), the first term tends to zero with probability one by Lemma 5 (Appendix). The second term has a limit of zero by the following argument. (3.2.19) lim bkj = O a.s. for each j by Lemma 2 and Lemma 5 (Appendix). Also note that 55 k k (3.2.20) lim. 2 bkj = lim (1 - n (1 - c Aj)) = l a.s. k j‘N k j=N by Lemma 5 and Lemma 6 (Appendix). In addition, (3.2.21) bkj 2 o a.s. and (3.2.22) lim ( max “8 - s“) = O a.s. . r J rEJj by Lemma 1 (Appendix). It then follows that (3.2.19) through (3.2.22) satisfy Lemma 7 (Appendix), and Lemma 7 implies k (3.2.23) lim (l/c) z bkj( max us - SH) = o a.s. j=N rEJ r J Hence, the second term on the right of (3.2.17) has the limit of zero with probability one. Thus (3.2.24) lim Hq H = o a.s. k Vk-+1 because each of the three terms on the right in (3.2.11) have now been shown to have the limit of zero (a.s.). Therefore, the proof of almost-sure convergence is complete. Q.E.D. 3.3 Rate of Convergence In this section the rate of convergence of the mean- squared estimator &k’ given in (3.1.2) or (2.5.4), is calculated in the sense that a bound, which has limit zero, is found for the second moment of the error norm. In other words, a maximum 56 rate of convergence is found for the mean-square error. The notation is the same as in the previous two sections. In order to obtain the rate of convergence, consider as the starting point the recursive relation for the mean-square error developed in section 3.1 and given in (3.2.26). This re- lation is repeated here as 2 2 2 can Euqvmu s <1 - M7\\Avku>auq,ku +4.16ka Here A - qvk. avk. a and Vk is the smallest integer in the index set Jk defined in assumption AS. M7 and M8 are positive constants selected such that (3.3.1) is true for sufficiently large k. Recalling that 3V 'V (3.3.2) k+1 k pk in assumption AS, a useful change of notation in (3.3.1) is as follows. Clearly there are constants z > 0 and c1 > 0 such that (3.3.1) may be written as 2 sz-z 2 -2 (3.3.3) EHqVkHH s[ - Vk-l] EqukH + civk Now let A (3.3.4) zj (zpj_2)/Vj_1 and choose the index r such that with iterating (3.3.3) back- wards to r, 57 2 k 2 (3.3.5) Euq H s n (1 - z.)EHq H Vk.+l j=r J vk k k _2 + C1 Z 11 (1 ' z.)Vn a n=r j=n+l J and (3.3.6) 0 < zj < 1 for all j > r . By Lemma 9 (Appendix) there is a constant dk-l 2 1 such that (3.3.7) lim d = l k k an’d’f $71111“ j ‘2 ‘r k vs-2 2 (3.3.8) n (1 - z.) s d ( ) , for s = r,r+l,...,k . . J s+1 J=s _ k-l Then 2 vr-Z Z 2 (3.3.9) EHq H s d _ [ ] EHq H Vk+l r 1 Vk-l vr k v z + C1 2 d [ r-2] -2 n=r n Vk-l n Now let A z/2 (3.3.10) tk+1 Vk-l qv k+l Then 58 2 “.22 2 -2 (3.3.11) Eutk+lu s (6:_1) dr [v: 1] (Eucrn )(vr_2) Z n V V + c 2 dnc32—-)zc-51l) 1 n r k-l Vn2 k z-2 1 Z dnvn n=r =%EMJ2+C The first term on the right is finite and the second term con- verges to a finite limit as k increases to infinity. if (3.3.12) 2 < 1 . However, z must also satisfy the derivation of (3.3.3) from (3.3.1), i.e., 2 must satisfy (3.3.13) (1 - zpk_2/vk_1) 2 (1 - M7HAka) By direct substitution into (3.3.13), it can be shown that (3.3.14) 2 s M7HA1H satisfies (3.3.13) for all Vk > p. Therefore, with (3.3.15) 0 < z < min(1,M7HA1H) , (3.3.11) implies that 3 1 2 _ -z ( .3. 6) EHqVkH - 0(vk_2) Here the function 0(-) is defined in Lemma 4 (Appendix). The bound in (3.3.16) on the rate of convergence indicates that the second moment of the error norm, at best, 59 cannot diminish at a rate faster than 1/k where k indexes the observations. Alternatively, k indicates the observation time. The matter of convergence rate in stochastic approxima- tion methods for system identification has been considered for linear systems (K5, D4, 83). However, it has not been extended to a larger class of systems such as the state-model form (2.1.2). The resulting limit of 1/k on the rate, in any case, seems to be in agreement with these previous efforts. The proof of convergence rate above evidently lends further support to the claim of Deutsch (D2) that certain applications of the method of stochastic approximation "can be frequently unattractive because of agonizingly slow con- vergence after the first few iterations". However, optimal estimates cannot be achieved without some assumptions on the noise variables. For example, gaussian, independent random noise may be assumed in order to find analytic expressions for Bayesian estimates of parameters (H1). Even if the estimates of a are optimal, in a maximum-likelihood sense, an on-line scheme as preposed here would be developed in the same way as in Section 2.3 with the correct application of the weighting matrices {Vi} indicated there, and the resulting convergence rate would still be no faster than 1/k. However, suboptimal on-line estimators require larger observation time to achieve the same accuracy as that of optimal estimators. In Chapter IV, results of computer-simulated examples illustrate this fact. 60 3.4 Extension To a Larger Class of Nonlinear Systems With the algorithm (2.5.4) stated and proved for the class of systems (2.1.2), it is worth noting that, to some degree, the method may be extended to a larger class of systems. This class is given in (2.1.1) and is restated here as (3.4.1) x k,a) + w k+1 Q“"16” yk g xk 1’ Hk k vk=uk+pk where the §k(°,°,') is regarded as a known nxl vector-valued system transition function. In this section, the requirements for extending the algorithm to include some systems in the class (3.4.1) will be considered. The extension will be indicated by considering only the scalar case, i.e., where n = 1, since, with the theory of previous sections, the multidimensional case follows directly. The chief requirement necessary to carry out this exten- sion is that the first partial derivatives of §k(°,°,-) with respect to the m components of the vector u exist and are continuous, and also that the function may be closely approxi- mated by the first two terms of its Taylor expansion in a. In particular, let the scalar system function be (3°4°2) Fk(01) = §k(xk,uk’a) and (3.4.3) aQa*+M. 61 Here Au represents the difference between the true parameter * a and some estimated value a of a- Then, using the first two terms of the Taylor expansion of Fk(a), ~ * .' * (3.4.4) Fk(a) = Fk(a ) + Fk(a )Aa . -‘ * where the mxl vector-valued gradient Fé(a ) is defined as * * * . * aFk(a ) aFk(a ) aFk(a ) (3.4.5) FILQY ) = —_*_ a —"'.k_ a°°°a —;_" 801 302 Bum * * and the a1 are scalar components of the mxl vector 0 . This step linearizes (3.4.1) in the unknown parameter a while allowing it to remain nonlinear in the other variables. With this linearization, the assumptions A1 through A8 remain the same in substance. However, the relations stated in A5 and A6 are changed. In particular, it is assumed under the statements of A5 that (3.4.6) inf inf A . [ 2 §.(&)§'(&)il 2 x > 0 a.s. k [1(Yj:vj’&)HjEJk} min jEJk J J o and that A i A FJ(&)Fj(a) (3.4.7) inf [ inf , [ 2 , 2 ] 2 T' > o a.s. k {(yj,vj,&)\j€Jk} “‘1“ 39!. quu where the nXl vector-valued function Fj(&) is given by 3-4-8 3. A = Q. . v. “ ( ) J(a) J(YJ: J90) Similarly, under the statements of A6 it is assumed that 62 . l: A ;l 5 min )‘max[Fj (0L!)Fj (Gil (3.4.9) inf inf 1‘ . 2 21-20 a.s. A . ~ “ A I A k {(yj avj 30’)‘JEJk} mix )‘max[Fk(a)Fj (0)] k The relations (3.4.6), (3.4.7), and (3.4.9) constitute the basic changes in the assumptions Al through A8. The development of the algorithm defining the estimator &k of a now proceeds exactly as in section 2.3 with no addi- tional complication. The loss function Lk+l(ak+l)’ given by k A _ A 2 (3°4'10) Lk+1(°’k+1) ’ iEl(§i(yi"’i"”k+1) ' VH1) o a c A _ * h * 0 has a unique m1n1mum.when ak+l - ak+1, w ere ak+1 1s recursively defined as 3 4 11 * - * + p 1'3 ( * 1? * ( ' ' ) ak+1 “k k+1 kak)(yk+l k(°'k)) ’ and in this case -1 A .: * .2, * -1 (3.4.12) P - 2 F (a )F (a ) + P , k = 1,2,... k , j k j k k-l JEJk The steps leading to the result (3.4.11) are omitted here since they constitute a repetition of the steps of Section 2.3 leading to the analogous result (2.3.10). As before, with the help of (3.4.6) and Lemma 10 (Appendix) it may be shown that (3.4.13) “Pk“ = 0(k) a.s. and so the matrix Pk in (3.4.11) may be approximated with a matrix Ak given by 63 (3.4.14) Ak = E(P1)/(kp) , k = 1,2,... where E(P1) is positive definite and the p is given in assumption A5. A sample-average approximation may be used to estimate E(P1) and a value p may be selected to correspond to the maximum length of the Jk intervals. As mentioned pre- viously, any positive definite matrix having the appropriate dimensions may be used for A1 and convergence of the algorithm will be assured; however, the Ak suggested above best approxi- mates the Pk necessary for a least-squares estimate of a. The algorithm (3.4.11) may now be applied in the form A (3.4.15) ak+1 = ak +Ak+1 Fkfiak)(yk+1 - Fk(ak)), where is the suboptimal estimate of 0 replacing the o’1<+1 * minimum quadratic loss estimator ak+1° The suboptimal estimator & has the same asymptotic behavior as does the k+l optimal estimator a:+1° A bias occurs in the estimation scheme (3.4.15) as occurred in the previous estimator (2.3.10). With the assump- tion of wide-sense stationary random variables, the bias may be determined by finding the value of & which minimizes the expected loss function given by . ~ 2 (3.4.16) E(L(&)) = E(Fk(a) - yk+1) . Applying the Taylor expansion (3.4.4) to the function Fk(&), the value of the mxl vector E which minimizes the risk (3.4.16) is given by 64 (3.4.17) 82* -- 01 + E“1(i‘3~k(a>f}1;> Eé'kmmkm) - Row by using the same procedure as used in section 2.4. The bias- error evidently depends on the choice of functions §k(-,-,-). Whether the bias introduces a large error is best determined by evaluating the second term.with assumed noise variances based on the available knowledge of the system behavior. The algorithm to use in practice is given in (3.4.15) with the Taylor expansion applied to the Fk(&k) appearing on the right side. However, if some information is available to estimate the effect of the bias it may be incorporated into (3.4.15) by subtracting an estimate of the term bk given by (3.4.18) bk == E(f;‘k(a)[Fk(a) - §k(oz)]) The term b , which appears as a factor in (3.4.11), i.e., k without the expectation, should be subtracted from (3.4.15) in such a way that (3'4'19) ak+1 = “k + A1<+10515031)W15“) + wk + Hk+1 ' Fk(°’k)] ' bk" Here the quantity following the Ak+1 now tends to have zero mean as tends to 0 since the w and ¢k+l are un- a'k . k correlated with the §k(&k) by assumption Al. The same pro- cedure was used in the previous estimator (2.5.4) to center the estimator, in the same asymptotic sense as in (3.4.19), around its mean value. The final form for the estimator (3.4.19) may be shown to converge in the mean-square sense and the almost-sure sense 65 to the unknown system parameter a. The procedure to accomplish this proof, with the help of (3.4.4), (3.4.6), and (3.4.9), is entirely the same as that used in Sections 3.1 and 3.2 and con- sequently will not be pursued here. Also the convergence rate may be shown to behave the same as in Section 3.3. 3.5 Summary In this chapter the characteristics of the system identification algorithm (2.5.4) were examined. In Sections 3.1 and 3.2 arguments were presented that show that the para- meter estimator, defined by (2.5.4), converges in the mean- square and almost-sure senses to the true value. These proofs depend on stochastic approximation theory. Also, the rate of convergence was calculated in Section 3.3. This result indicates that the mean-squared estimator (2.5.4) cannot converge faster than a rate proportional to the inverse of the observation time. It is further indicated that this is true for both suboptimal and optimal on-line parameter estimators. In Section 3.4 it is shown that, with additional hypotheses, the identification algorithm (2.5.4) may be modified to apply to systems which are nonlinear in the unknown parameter. Here, this extension is developed for scalar systems, as the multidimensional case is a parallel result to the original algorithm (2.5.4). In order to verify the practicality of the estimation scheme stated and proved in this chapter, models of various forms have been simulated on the digital computer. With the help of a random number generator, the estimation method has been demon- strated. The results of this verification are discussed in the next chapter. CHAPTER IV EXAMPLES The system identification algorithm developed in the previous chapters is demonstrated in this chapter by a digital computer simulation. Two examples are considered here. The first is a state model of the type (2.1.2) which is linear in the unknown parameter and nonlinear in the state and input variables. The second example is a one-dimensional system which is nonlinear in all variables including the unknown para- meter. This example verifies the extensions to the algorithm discussed in Section 3.4. These two examples were simulated on a Control Data 6500 digital computer. The inputs and noise variables were generated internally. 4.1 Nonlinear State Model: Linear in the Unknown Parameter Consider the discrete-time nonlinear state model given by (4.1.1) x1(k+1) Tanh(x2(k) + u(k)) x1(k) a 1 x2 (k+1) x2 (k) Atan(x1(k)) a2 w1(k) + W2 00 y1' x1 11m = + y2 (k), x2 00 H2 00 v(k) = u(k) +'p(k) 66 67 (4.1.1) is a purely academic example of systems of the type (2.1.2). In this section, results of a simulated identifica- tion of this system are reviewed. Some characteristics of the deterministic part of the system are mentioned first. Expanding the nonlinear terms in (4.1.1) in a Taylor series and suppressing the terms past the first reduces the non-random part of (4.1.1) to the linear system given by x1(k+1) a a x1(k) a 2 (4.1.2) - 4- u(k) x2(k+1) a2 a1 x2(k) 0 This system is asymptotically stable if its eigenvalues (0, a1 +'a2) lie inside the unit circle. When this is true, the system (4.1.1) is asymptotically stable (Petrovski, K.T.K.). If it also happens that 0’1 “1’2 (4.1.3) Det # 0 , 0 “P32 then the system (4.1.1) satisfies conditions for (null) con- trollability (Ll). Furthermore, the system (4.1.1) is observ- able since the state vector is observed directly (L1). The system (4.1.1) was modeled on the digital computer with a random input u(k) which was zero-mean, independent, and uniformly distributed. The noise variables wi(k), Hi(k), and p(k) had the same characteristics as those of u(k). The vector selected for the unknown parameter is given by 68 (4'1”4) 01' = (01902) = (”'25: "70) This vector satisfies condition (4.1.3) and the stability criterion. The identification algorithm used is given by &1(k+1) 611(k) y1 611(k) (4.1.5) = + 9 fi'[ - ‘1" :1 - x x k+1 k k x 6, (1+1) 0:2 (1.) y, (1+1) a2 (1.) where Tanh i‘k = y,(k> Acan and the 2X2 matrix Pk+1 is given in (2.3.12). A deterministic gain Ak+1 was also used in place of Pk+l for comparison. The gain Ak is given by (4.1.6) A =——— where a1 and a2 are positive constants. This gain is selected in keeping with the theory of Chapter III and Lemma 10 (Appendix). The simulation was carried out for a high initial estimator error and again for a low initial error. In both cases, various gains were used. Figures 2 and 3 show averages of the squared error of ten runs of the identification algorithm over 512 observations for each gain selected. The squared error is defined as (4.1.» 111.112 = Her 3.112 69 where a is given in (4.1.4) and &k is the vector estimator given in (4.1.5) Figure 2 illustrates the comparitive performance of the estimator for the deterministic and least-squares optimal gains AR and Pk. This simulation indicates that both types of gains give similar results. However, in order to chose the best deterministic gain, the constants a1 and a2 should be chosen so that Ak decreases in a manner similar to the eigen- values of Pk. This requires testing Pk over a few iterations using the matrix inversion lemma (2.3.12). Figure 2 further indicates that when a large initial estimation error is anticipated, the initial gain values, PO and A0, should be larger than unity. This correSponds to a large initial variance in the estimate. Thus, the two lower curves in Figure 2 are a result of slightly higher initial gain values. Figure 3 illustrates the identification results when the initial error is relatively small. Here, the various gains do not cause a significant change in the convergence of the estimator. A comparison of Figures 2 and 3 indicates that the algorithm rapidly compensates for a large initial error in the estimate, i.e., Figure 2 shows a steeper descent than Figure 3. With regard to choosing gains, this experiment indicated some effects which are not apparent from the figures. There appears to be some best value for A0 when a large initial error is anticipated. Figure 2 shows when A0 is equal to 2 the result is better than when AO equals one. However, when A0 is much larger, for example 10, two undesirable effects occur. 70 The initial estimator variance becomes large and the average error increases before it begins to decrease. In particular, for the example in Figure 2, a gain of 100/(10 + k) produced 19 a maximum squared error of 4x10 at 32 iterations and a minimum of 2.5x10'2 at 512 iterations over a 10 run average; 1 for 32 one run in this set produced a squared error of 9X10- iterations. Conversely, when A0 is selected small, con- vergence is very slow and the estimator's variance is small. In the same example of Figure 2, a gain of 1/(10 + k) produced a minimum squared error of 6X103 after 512 iterations of the algorithm (4.1.5). The variance of the estimate when PR is used is much better behaved for all choices of Po' Small values of “Po“ give very slow convergence while very large values tend to result in a sharp initial drop in the estimator's error. The sharp initial drop is typically followed by gradual convergence resulting in errors which are no improvement over those obtained with smaller “Po“ values. It appears, there- fore, that there is no general advantage in choosing very large or very small initial gain values. It is also noteworthy that the application of the deterministic gains reduced the central processing time of the CDC 6500 by one half for this second-order state model simula- tion. Higher-order models would lead to greater time savings than those gained here because the computation of the gain P k would become more time consuming. 71 2 4 = l. 9 X 10 quH 9 H = (+100, -100) 2 o squared error, quu \wi(k)‘ s .2 ~ . k) s .2 105 + NJ 1 \p (k)| s .2 0 6+0 ‘U (k?\ S 1.0 104 x 61' . ' 3; + gain Pk’ P0 = I X . 3' 0 gain Ak = 10/(10+k) 3 ‘ 3 10 r x g x gain Pk’ P0 = 10I x O , gain Ak = 20/(10+k) .2 10 run avera e 102 - . g x t X 1 + 10 F o o + x 100 _ o X 10'1 - ' x 10"2 . ‘ 1‘ l 1 l l l l l l 2 I 21 1 2 4 8 16 32 64 128 256 512 Iterations, k Figure 2 Estimator Error Squared as a Function of Algorithm Iterations; Large Initial Estimator Error and Various Gains 72 2 quu .5525 a; = (0.0) |wi(k)\ s .2 ”100‘ 5 '2 |p (k)| s .2 Squared error, quHZ ‘u (k)‘ s 1.0 , + gain Pk, PC) = I 100 . o gain Ak = 10/(10+k) X gain Pk’ Po 8 10I ‘b 0 gain Ak ' 20/(10+k) g t . . 10 run average 3: 9' $ §_ 10‘1 - " x 6 x + 5 " + X X 00 .... X o. t o. -2 + 10 - >‘ l 2 4 8 16 32 64 128 256 512 Figure 3 Iterations, k Estimator Error Squared as a Function of Algorithm Iterations; Small Initial Estimator Error and Various Gains 73 4.2 General Nonlinear System: Scalar Case The extension of the identification algorithm to systems which are nonlinear in the unknown parameter as well as in the state and input variables has been verified by a digital computer simulation. The results are discussed in this section. The example considered here is the system of equations 3 Xk+1 ” 0’le + “zuk + 3(“1xk + °’2”k + "k (4.2.1) yk=xk+¢k < II k uk+pk where all the variables are scalar quantities. This system is.asymptoticallystable when a1 lies inside the unit circle. When a2 # 0 conditions for (null) controllability are satisfied (L1). The system was modeled on the computer with the para- meter a given by (4.2.2) 0' = (al,a2) = (-.25,+.70) The noise disturbances and input were selected from a uniformly distributed independent process generated by the computer. The system.was identified using the algorithm developed in Section 3.4 given as (4.2.3) 61 = 01k + Pk+11‘k(11k)(yk+1 - ly‘k(ak)) where 74 A A A A A A 3 = + Fk(01k) a1(k)yk 012(k))!k + 3(al(k)yk + a2(k)vk) . x A 2 .1 yk + 9Yk(crl(1<)yk + 012 (k)Vk) F (& ) = k k x x 2 Vk + 9Vkia1(k)yk + a2(k)Vk) and Pk+l is given by (3.4.12). In accordance with the theory of Section 3.4, and in particular assumption (3.4.4), the initial estimator error should be small. Therefore the initial estimator value was set at zero giving an initial squared error of .5525. With this initial condition, convergence of the estimator is illustrated in Figures 4 and 5. Here, the application of a deterministic gain is shown to be a valid substitute for the least-squares optimal gain Pk which, with this type of system, is a function of the estimate itself. When larger initial errors were tried convergence was poor. For example, an initial estimate of a; = (&1(0), &2(0)) = (10,10), giving an initial squared error of quH2 = 191.6, gave a final squared-error of 22.9 after 512 iterations. An increase in the initial gain value, “Po“ and A0, did not improve the estimate. This result verifies the necessity for a smaller initial error as discussed in Section 3.4. 75 2 quH = .5525 6; = (0.0) ‘wk‘ 5 .25 Relative “q “2 Hwkl S '1 Squared error, k 2 ‘pk‘ s l quH ‘Uk‘ S .5 H 3 ’-‘° >3 gain Pk, Po =1 ’2 3? noise free case P . o X state noise only f 0 state noise and observation 0 '- x 0 noise included 0 o X 0 X P X o "' o o L- . 10 run average 1 1 1 _1 1 g 1 1 1 # 1 2 4 8 16 32 64 128 256 512 Iterations, k Figure 4 Relative Squared Error as a Function of Algorithm Iterations; Least-Squares Gain with Various Noise Levels 76 quH2 = .5525 a; = (0.0) \wk\ s .25 Relative qunz Wk‘ 5 .1 Squared error,'———jf “qOH \Pkl S ’1 0 f "o g luk‘ 5.5 >2 )2 gain Ak -- 5/(5+k) . - » fl . Noise-free case 0 .( X State noise only . O»o State and observation X - ‘ 0 noise included 0 X 0 X H. O 10 run average I J 1 1 l 1 I 1 n n l 2 4 8 16 32 64 128 256 512 Iterations, k Figure 5 Relative Squared Error as a Function of Algorithm Iterations; Deterministic Gain with Various Noise Levels GHAPTER V CONCLUSIONS This chapter includes a discussion of the general results of this thesis. Possible extensions to this work are also included. 5.1 Thesis Results The chief result of this thesis is the development of a practical method for finding a mathematical model for a discrete-time system. This method is based on the assumption that the system is known up to a finite set of parameters. The procedure develOped provides a means for finding the unknown parameters from observations of the system's input and output variables. The method for finding the parameters is an on-line, stochastic, gradient-search algorithm.which updates an estimate of the parameters with each new set of input-output data. The system identification algorithm has a general formulation based on a liberal set of assumptions. The formulation includes approximating the least-squares optimal estimator gain with a deterministic gain. The result is a system-parameter estimation scheme which has at least three practical features. First, a wide variety of time-varying non- linear systems may be identified by this algorithm. Secondly, the on-line strategy of the algorithm permits identification 77 78 when the moments of the random processes are time-varying and when the system transition function is time-varying as well. The third practical advantage is the savings in computation time resulting from the use of deterministic gains in the parameter estimation algorithm. The reduction in computation time increases when the order of the system is increased. Two convergence theorems are presented in Chapter III which prove that the parameter estimator converges in the mean- square and almost-sure senses. These two proofs, introduced as modifications to the theory of nonlinear regression, are new to the literature in system identification. The convergence arguments were based on the liberal set of assumptions outlined in Chapter 11, some of which were shown to define a criteria for system identifiability. Two important assumptions used in the convergence proofs are (l) the system is linear in the unknown parameter and (2) a deterministic gain is used in place of the least-squares optimal gain. Later, in Section 3.4, the algorithm.was extended to include systems which are nonlinear in the unknown parameter. This extended version is new to the system-identification literature. The maximum rate of convergence of the estimator was also calculated in Chapter III. This result indicates that, in an asymptotic sense, the mean-squared estimator cannot con- verge faster than a rate inversely proportional to the observa- tion time. It is shown that this is true for optimal on-line estimation schemes as well. 79 The algorithm was demonstrated by a digital computer simulation. An example was selected to show the sensitivity of the parameter estimator to various gains. It was demon- strated that the deterministic gain is a useful substitute for the least-squares gain, since convergence rates were comparable and estimates using the deterministic gains were computed faster. A second example was selected to demonstrate the extension of the algorithm to systems which are nonlinear in the unknown parameter. Here, it was found that large initial errors in the estimator lead to slow convergence. This did not occur in the previous example. However, a smaller initial estimator error improved the convergence rate significantly. 5.2 Possible Extensions The identification examples given in Chapter IV indicate a more extensive investigation of gains would be use- ful. In Chapter IV, three types of gains were verified: deterministic gains given by (4.1.6), least-squares (random) given by (2.3.12) and least-squares (random-adaptive) given by (3.4.12). What is needed is a method for choosing a gain, or its initial value, in such a way that the estimator error will drop quickly during the first few iterations, independent of the initial error. Some gain sequences may be develOped which accelerate the initial estimator's convergence. Such gains have been suggested in stochastic-approximation liter- ature (K2, V2). 80 Another extension to this work is the parallel solution to the identification problem in continuous time. This requires the application of continuous-time stochastic approximation theory. Although most of the literature on stochastic approxima- tion deals with the discrete-time case (A2, D1), some studies on continuous~time stochastic approximation theory are avail- able (81, S3, B6). Finally, this identification method may be connected with the study of stability of large-scale interconnected electric power systems. Here, there is a need for modeling and control methods based on input-output variables available on the power system's tie-lines. It is in this general problem that this thesis had its origin. APPENDIX APPENDIX In this appendix various lemmas which are required for the convergence arguments in Chapter III are stated. Those lemmas which are taken directly from the literature are stated without proof. In some cases the proofs rely on results in the litera- ture and so the proof is presented with the appropriate references. In the remaining cases, namely Lemmas 3, 4 and 10, the proofs given are new. In the interest of completeness all propositions required for Chapter III are stated here with the exception of well-known classical theorems. For convenience, the algorithm to be considered is re- stated here in the notation of Chapter III. The recursive estimator &k of the unknown parameter a of the model selected from the class (2.1.2) is given in (2.5.4) and (3.1.2) and is A °’k+1 g “k ’ A141011531: ' a) ' Yk) where 11 AM =i'(y,v>1 . k kk k k k k k k Vk = £15k ’ bk ’ Ak Q Al/k -- Bap/(pk) . 5k " [§k(xk’uk) ' §k(yk’vk)]°’ “'1. + Hk ’ and 81 82 bk = E(§£(yk’vk)§k) These relations are discussed in Sections 2.4 and 2.5. The norm function H°H for vectors is defined to be the Euclidean norm, that is for a real nXl vector 2 where . z ) v- z (21, 22, ... n the norm of z is given by n a qu= [2 12,12] - i=1 The norm function H'“ applied to square matrices is the "spectral" norm, that is for an an matrix A and nXl vectors x HAH = s M ”P x¢0 \x“ where the Euclidean norm given above is used on the right. When A is hermitian “A“ = Max (A) ’ that is the norm of A is the largest eigenvalue of A (V1). These two definitions will be used throughout this section. Lemma 1 (A2, L4) Let Then 83 lim 3 = s a s n where HS” < m a.s. Proof: By assumptions Al and A2 the moments of Yk are uniformly bounded, so 124141216113 < 111.112 Then, with the Cauchy-Schwarz inequality (B5) kg, EHAnkH2 s .51 “1141121416 5 constant - n 2 t t n HAle #21 “AR“ 5 cons n kgl k2 .< a . Also, by definition of Yk , E(Akyk) = AkE(vk) = O . Hence the convergence of the sum of the second moments of the zero-mean random variable AkYk guarantees that lim 5n = s a.s. with “s” < a a.s. n by direct application of a theorem due to Loeve (Theorem D, Section 29.1), (L4). Q.E.D. Lemma 2 13111 “An“ Han“ = o a.s. 84 Proof: RE]. E(“Ak“2HHkH2) = .kEl HARHZENH‘kNZYN “ 2 s constant 2 “Aku < a k=1 2 since. EHHkH is uniformly bounded for each k by assumptions Al and A2. Also by the Fubini theorem (P3) ‘° 2 2 °° 2 2 ~> 8HA\\E<\HH>=E 2 AHHHH. k=1 1. \1. k=1H 1. 1. According to the Monotone Convergence Theorem (L4), if 1:. kgluAkHZHHkHZ = .. . then um E kguAkuzuakuz = .. . The contrapositive of this statement is °° 2 2 E 13111111 11111 < w implies that °° 2 2 1:1 111,1 111,1 < .. . . Now this implication means that kgl (1,12,61,12. .. implies that 85 ' 2 2 = a.s. 131141 111,1 0 from which it follows that lim HAkH “HRH = 0 a.s. Q.E.D. Lemma 3 lim k2]- HARM “HR“ = on a.s. Proof: Using the notation of assumptions A5 where Jk is defined as the index set of all integers between the integers and v - 1, that is vk k+1 Jk = {Vk, Vk+l, ... , Vk+1'1} : we have that 13-3-1 HAkH HHkH = 1:1 {1ng “An“ HHnH 2 2 1min(Av ) 2 “Hr,“ k=1 k néJk 2 2 H . (A ) 2 H k=1 mm Vk HneJk “H from the triangle inequality (BS); and by assumption A5 Q 2 £21 xmin(Avk) lo a.s., 86 The last equality is true since co 2 A (A ) = O k=1 1" min and every infinite subsequence of {xmin(Ak)} Sums to infinity. Q.E.D. Lemma 4 For each realization of the random sequence {Hn}’ where g . Hn Qn(yn’vn)§n(yn’vn) ’ there exist constants c,c' > 0 and 0 < N < a such that -l OSHQ \ N. Here the variables are defined as A - Qk .11 (I AiHi) IEJk where the indexing is in descending order, that is, Q=(I-A H _)(I-A _H _) 1‘ Vk+1'1 Vk+1 1 Vk+1 2 Vk+1 2 (I - AVkHVk) and A 2 2 3 A), = (Z HAiH HfliH ) iGJk Proof: First premultiply th by its transpose and note that AR and Hk are symmetric. This gives 87 0qu = n (1 - A111,) 11 (1 - 1,111) i . EJk IEJk =(I-H A )... (I-H _1A _1). Vk “k Vk+1 Vk+1 (I-A _1H _1)...(I-A H ) Vk+1 Vk+1 Vk vk 2 I - .2 (AiHi + HiAi) + Aka 16.1k where Gk = [(second order product terms) + (higher order terms)]-( 2 “AiHZHHiH2)-1 . iEJk Now there is a constant MI > 0 such that “Gk“ 3 M1 a.s. for all sufficiently large k for each realization {Hi}. This is true since (1) Lemma 2 implies that . . -2 11m “(higher order terms)\\Ak = 0 a.s. k because the numerator tends to zero faster than the quantity 2 k’ and A (2) “(second order product terms)HA1-(2 -2 s 1 + “("off-diagonal" second order product terms)“Ak which follows from application of the Cauchy-Schwarz and triangle inequalities. Furthermore, 88 “(Hoff-diagonal" second order product terms)uA;2 -2 S 2 ,2 (HA-HHHH\\“‘n\\HH.H>A . {i#j;i,je.1k} 1 1 J J k by the triangle inequality, and with p 2 vk+1 - vk for all k, 2 (2,6,2 - p>max 11111211112 1;} iEJk s 202032 - p>> 0 89 x [2 (AH +HA)]A-1>M a.s. min . i i i i k 2 16.1k for all sufficiently large k. To accomplish this last step first note the following matrix properties: For positive definite real matrices Bi’ 1 = 1,2,...,n, n n (a) )‘min(i§1 Bi) 2 121 "min(Bi) (b) xmin(BlBZ) 2 Xmin(Bl)Xmin(BZ) (c) (313 ) is positive definite for all i and j, J (V1). Using these properties, -1 -1 2 mini + HiAiflAk 2 kmin[ Z (AiHiflAk ieJk iEJk A min[ -1 + kmin [12‘] (HiAigAk k Now consider the first term on the right. Using the properties (a) - (c) further, -1 -1 1. (2 A.H.>1 2 z 1. (A.H.>> min iEJk 1 1 k iEJk min 1 1 Hvk jGJk j‘ -1 \\A,kH ° 1 k Jk k Jk Now since A1 is positive definite by choice, for some M3 > 0, A (A )mm0A1)(1/vk) min Vk —“-——n—v £11“ (A)(1/v)2M >0. Hence by assumption A6 1 mm“) (mJin 1111111) k 11A 11aax11n.11> «1.13.1.1... V 1 k J k Therefore, 1 . ( Z AiHi)Al-<12M > 0 a.s. min WEJ 3 The same argument applies to the second similar term to the one on the left in the last inequality. The result is, therefore, )..”: AH) 1mm: HA) min ieJk i nieJk 1 1 Ak ( 2 (Ai H +'H1A )) xmin iEJk i i 2 2 M > 0 a.s. Ak 2 for some M2 > 0. 91 Consequently for sufficiently large k, 1212 =11QQ'11s1 «A +0012) .1 s k k k 2 k k ' ' and so with an appropriate c > 0, “QR“ s 1 - cAk a.s. The second part of the Lemma follows directly; note that 2 2 35 2 35 A}, = <2 HA1“ “Hi“ > 2 HA, -1H( '2 NH,“ > iEJk k+l 16k 2 “Avk+1-1H x0 a.s. by assumption AS. Therefore for sufficiently large k -1) “QR“ S 1 ' CAk s 1 ' CHAvk_'_l-1H)‘o = 1 ' c'/("1<+1 almost surely, where c' A CHAIHxO Q.E.D. Lemma 5 n lim II “QR“ = 0 a.s. n kfll Proof: By Lemma 4, n n _1 o s kgl “QR“ 5. 1:210 - c (ka-l) ) n -l - ' - 5 exp( c £21(Vk+l l) ) a.s. 92 Since l/k sums to infinity, every infinite subsequence does also, so “ -1 lim 2 (v -1) = on n k=1 k+l Therefore “ 1 . I _ ‘ .. 11m exp( C 2 (Vk+1 1) ) 0 : n k=1 and hence the result follows. Q.E.D. Lemma 6 (A2) "Let A1,A2,..., be a sequence of square matrices. Then for all 1 s k s n and n 2 l, n n n 2 n (I-AiM. =1 - no "‘1’ j=k i=j+1 3 i=k where products are to be read backwards and void ones as the identity." Lemma 7 (K3) L6t (1) {xn} be a real sequence such that lim x = O n n (2) {a } be a set of reals such that uv n gio‘ankj < K n = 0,1,2,... for some constant K > O 93 Now note that for some r > 0, fixed, that lim a = 0 . nr n Then (1) and (2) imply n lim 2 a x = 0 . n k=0 nk k Lemma 8 (A2) "Let Pn = n (1 - a.) when a belongs to (0,1) jgl J j for all j 2 n and lim Pn = 0. Then if n 2 X < w a k k then n lim max" t. (P /P,)x = 0 . " n 15km j=k “ J 3 Lemma 9 In keeping with the notation of assumption A5 let ID V n n+1-vn 0 and N' > 0 is chosen so that zj < l for all j 2 N' Then, n 2 .H (1 - zj) s dk(vj_2/vn_1) J=k where k =N',N' + 1,...,n, and dkz 1 such that lim d = 1 . k k Proof: (A2) Note that for all x,y > 0 Then let so that Since then exp(x) > (1 + x/y)y . xéz. J sza - Bj) 2(1-Bj) exp(zj) > [1 + szj/zj(l - ej)] -Z(1 - Bj) = (1 -Bj) -1 (1 + x/y) = (1 - Bj) = (vj_1/vj_2) z zej exp(zj) > (”j-l/Vj-Z) (1 ‘ Bj) Now invert and take the product from k to n getting 95 n n n H (1 - z.) s H exp(-z ) = exp(- 2 z ) j=k J j=k j .1ng n z z j z S ij (vj _2/v j- _1) (1 - Bj) = (vk_2/vn_1) -S where n ~25, 0<1n(S) = Elna-B) J j=k j =-Z zB 1n(1 -B.) , jrk j J Let A Q dk - exp [-2 jEk len(1 - Bj)] 2 S . Then since for sufficiently large j, Bj < 1 giving exptej/<1 - ej)] > (1 - ej>'1 so that B. T-J—>-ln(l - B) 'B 1 Therefore d < exp 2 E 82/(1 - a ) k j=k J j Now since Os-z Bj1n(1 ‘53) s z 33/(1 -5j) jgk j: 96 then as k tends to infinity the right-hand sum tends to zero as does the left-hand sum. Hence by the definition of d k lim d = 1 . k k Finally, replacing S with dk gives n z .9 (l = zj) s (Vk-Z/Vn-l) dk J-k Q.E.D. The following lemma has appeared elsewhere in the liter- ature (H2, D3) but is proved here distinctly in the context of the assumptions A1 through A8. Lemma 10 In accordance with the notation of Chapter II let -1 ~ . -1 = i = Pk FiF. + P k 1,2,... -l iEJk 1 k and P0 is preselected as a positive-definite symmetric matrix or alternatively set equal to zero. Then “P1.“ = 0(k) a.s. Proof: Suppose Po is positive definite. Then P;1 is also positive definite. Since P;1 is symmetric it follows that P is unitarily similar to a diagonal matrix D k (F1), and k the eigenvalues of P; appear on the diagonal of Dk. Then PR is evidently unitarily similar to DLI. Therefore, 97 -1 -l xmax(Pk) g xmin