1.9.“ :I'.‘ . .1... "g ., IIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIZIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 301411 2894 I TfiESts <- LIBRARY Mlchtgan State University This is to certify that the tlfesis entitled LOW POWER ANALOG CHIPS FOR THE COMPUTATION OF THE MAXIMAL PRINCIPAL COMPONENT presented by SHANTI SNARUP VEDULA has been accepted towards fulfillment of the requirements for M. S. degree in Electrical Eng. I 11A 4 Major professor Tm DateM 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution I PLACE ll RETURN BOX to romovo this chockout from your rooord. TO AVOID FINES Mum on or botoro dot. duo. DATE DUE DATE DUE DATE DUE f {film'u‘lj 3‘33 2“ MSU loAn Afinnottvo Mon/Equal Opportunity Irutltwon mum —- — —_ _ ___._ __ _ Low Power Analog Chips for the Computation of the Maximal Principal Component By Shantz' Swamp Vedula A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering 1995 ABSTRACT Low Power Analog Chips for the Computation of the Maximal Principal Component By Shantz' Swarup Vedula The eigenvector projection method, also called the principal component analy- sis (PCA) approach, is a popular linear projection algorithm used in unsupervised processing. This method involves computation of covariance matrix of the input pat- terns, its eigenvalues and eigenvectors. A number of neural network models have been proposed which compute the eigenvectors directly without computing the co- variance matrix and the eigenvalues. The neural network ‘learns’ on its own, based on the input patterns presented, and after ‘learning’, the parameters (weights) of the network specify the eigenvectors. This work focuses on the implementation of a compact, modular, low power, subthreshold analog VLSI circuit for computing the maximal principal component based on a new analog VLSI non—linear model. The model uses MOS elements and differential pairs operating in the subthreshold regime of MOS operation. The model for computing the principal component has been sim— ulated (using MATLAB and PSpice) and implemented upto six dimensions on tiny prototype chips fabricated via the M0818 service. To My Mother iii ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor Dr. F athi M.A. Salam for his able guidance and constructive criticism throughout the course of my research. I would like to thank Dr. Anil K. Jain and Dr. Diane T. Rover for being on my graduate thesis committee. Their critical comments and useful discussions aided in improving the quality of my thesis. My colleagues at the Circuits, Systems and Artificial Neural Networks Labora- tory, Department of Electrical Engineering, MSU, provided extensive help during the course of this research. I would like to sincerely thank Hwa—Joon Oh whose bril- liant programs made conducting the experiments as painless as possible and Ammar Gharbi for his help, moral support and useful discussions. I also record my appreciation of the help rendered by the staff and the faculty of the Department of Electrical Engineering, MSU. I would like to express my sincere appreciation to Chitra Dorai, Sharath Pankanti and Nalini Ratha at the Pattern Recognition and Image Processing Laboratory, De- partment of Computer Science, MSU, for their generous help and patient replies to my numerous questions on image processing and computer vision. I take this opportunity to thank a very dear and special person, Dr. Houria Has- souna, at the Special Coagulation Center, Division of Thrombosis and Haemastosis, College of Human Medicine, MSU. I am deeply indebted to her for the love and re- spect she showed me besides the graduate assistantship which she provided me during a critical period of this research. . I would like to express my sincere gratitude to my parents, especially my mother for her love, affection and blessings. Finally, a BIG heartfelt thanks to my friends Krunali, Umashankar, Roomi, San- jay, Chitra, Mona, Vijay, Sumant, Niranjan, Annica and many others. They were always there to share the “ups and downs” of my life; without them, life would have been miserable! iv TABLE OF CONTENTS LIST OF FIGURES vii 1 Introduction 1 1.1 Motivation ................................. 1 1.2 Graphical Approach ........................... 5 1.2.1 Pictorial Display Method ..................... 5 1.2.2 Functional Mapping Method ................... 6 1.3 Projection Approach ........................... 16 1.3.1 Linear Projection Algorithms .................. 17 1.3.2 Nonlinear Projection Algorithms ................ 29 2 Artificial Neural Networks Approach 32 2.1 Introduction ................................ 32 2.2 Fundamental Concepts .......................... 33 2.2.1 The Neuron Model ........................ 33 2.2.2 Types of Neural Networks .................... 35 2.3 Previous Modeling Efforts ........................ 38 2.3.1 Hebb’s Rule ............................ 39 2.3.2 Oja’s Rule ............................. 41 2.3.3 Sanger’s Rule ........................... 42 2.4 The Proposed Circuit Model ....................... 45 3 Computing the Maximal Principal Component 48 3.1 Introduction ................................ 48 3.2 Basic Building Blocks ........................... 49 3.2.1 Differential Pair .......................... 50 3.2.2 Current Mirrors .......................... 51 3.2.3 Transconductance Amplifier ................... 52 3.3 MOS Circuit Realization ......................... 55 3.4 Computation of the Maximal Principal Component of 2-D Gaussian Data .................................... 65 3.4.1 Mathematical Simulation .................... 65 3.4.2 PSpice Simulation ........................ 74 3.4.3 Experimental Verification .................... 74 3.5 Computation of the Maximal Principal Component of a Color Image 84 3.5.1 Theoretical Result ........................ 86 3.5.2 Simulation Result ......................... 87 3.5.3 Experimental Result ....................... 88 4 Conclusions and Future Work 92 BIBLIOGRAPHY 94 vi 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 3.6 LIST OF FIGURES Analysis of high—dimensional data ..................... Chernoff faces for the Iris data with the category labels ......... Chernoff faces for the Iris data without the category labels. ...... Chernoff faces for the IMOX data with the category labels. ...... Chernoff faces for the IMOX data without the category labels. Functional plot of the Iris data. ..................... Functional plot of the four classes of the IMOX data. ......... Functional plot of the four classes of the IMOX data (All Classes). . . Projection of the Iris data along the first two principal components. . Projection of the Iris data along the third and fourth principal com- ponents. .................................. Projection of the IMOX data along the first two principal components. Projection of the IMOX data along the third and fourth principal com- ponents. .................................. Projection of the IMOX data along the seventh and eighth principal components ................................. Projection of the Iris data using Discriminant Analysis ......... Projection of the IMOX data using Discriminant Analysis. ...... A biological neuron ............................. McCulloch-Pitts model. ......................... Sigmoidal functions. ........................... Types of neural networks. ........................ A simple linear unit. ........................... Schematic of a differential pair. ..................... Current mirrors ............................... Schematic of a transconductance amplifier ................ Response of an on-chip transamp ..................... Circuit schematic of nstage. ....................... Circuit layout of nstage. ......................... vii COCO-x)“ 10 14 15 16 24 25 25 26 26 29 3O 34 35 36 37 39 5O 51 52 53 54 54 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 Circuit schematic of pstage. ....................... 56 Circuit layout of pstage. ......................... 56 Chip layout of Chipl ............................ 58 Circuit schematic of 2-D circuit using nstage ............... 59 Circuit layout of 2-D circuit using nstage ................. 59 Chip layout of Chip? ............................ 60 Circuit schematic of 2-D circuit using pstage ............... 61 Circuit layout of 2-D circuit using pstage ................. 61 Circuit schematic of 3-D circuit using nstage ............... 62 Circuit layout of 3—D circuit using nstage ................. 62 Chip layout of Chip3 ............................ 63 Circuit schematic of 6-D circuit using pstage ............... 64 Circuit layout of 6-D circuit using pstage ................. 64 Simulation results for a Gaussian distribution. ............. 65 Simulation results for the initial condition wo = [ 0.2190 ; 0.0470 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model ................ 66 Simulation results for the initial condition WO 2 [ 0.6789 ; 0.6793 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model ................ 67 Simulation results for the initial condition we = [ 0.5297 ; 0.6711]. (a)-(b) Oja’s model. (c)-(d) Circuit model ................ 68 Simulation results for the initial condition WO 2 [ 0.9347 ; 0.3835 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model ................ 69 Simulation results of the Reconfigured data for the initial condition wo = [0.2190 ; 0.0470 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model. . . . 70 Simulation results of the Reconfigured data for the initial condition wo = [0.6789 ; 0.6793 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model. . . . 71 Simulation results of the Reconfigured data for the initial condition wo = [0.5297 ; 0.6711 ]. (a)-(b) Oja’s model. (c)—(d) Circuit model. . . . 72 Simulation results of the Reconfigured data for the initial condition wo = [0.9347 ; 0.3835 ]. (a)-(b) Circuit model. .............. 73 PSpice simulation: w.- signals for input signals at 2 kHz in the increas- ing order for the initial condition of 2.5V ................. 75 PSpice simulation: w.- signals for input signals at 2 kHz in the increas- ing order for the initial condition of 0V .................. 75 PSpice simulation: w,- signals for input signals at 2 kHz in the decreas- ing order for the initial condition of 2.5V ................. 76 The maximal principal component. ................... 77 viii 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.40 3.41 3.42 3.43 3.44 3.45 z.- signals for equal amplitude input signals ................ 79 6-D Implementation: 2,- signals for equal amplitude input signals. . . . 79 z,- signals for input signals in the increasing order ............ 80 6-D Implementation: 2:.- signals for input signals in the increasing order. 80 z,- signals for input signals in the decreasing order ............ 81 6-D Implementation: 2,- signals for input signals in the decreasing order. 81 z,- signals for input signals at high frequency ............... 82 z,- signals for input signals with capacitors. ............... 82 Response of 21 to change in the We, of the input transamps for the 2-D Implementation ........................... 83 Response of 21 to change in the We! of the input transamps for the 3-D Implementation ........................... 83 RGB-to-YIQ conversion. ......................... 85 Experimental setup ............................ 89 RGB-to-YIQ conversion by computing the maximal principal compo- nent of the image .............................. 90 ix CHAPTER 1 Introduction 1 . 1 Motivation As innovative and more economical methods of data acquisition are developed, the emphasis in a variety of disciplines is now on quantification. The problem of handling large volumes of data has partly been solved with the advent of faster processors equipped with larger memory. But, there are some problems which are difficult to solve, for example, visualization and analysis of high dimensional data. An object can be represented as a pattern (or point) in the measured feature space. Consider, as an example, objects being represented in terms of their features such as height and weight. A pattern (2,3) might then represent an object with height 2m and weight 3kg in the ‘height-weight’ feature space. Since human visual perception is limited to three dimensions, graphical representation and visualization of data becomes impos- sible as the dimensionality of the feature space is increased beyond three. As a result of this, visual identification of relationship among data points, for example, identi- fying clusters, is no longer possible. It is also a common practice in data analysis applications to start with a large set of features. Some of these features, however, may not contribute any significant information for the purpose of data analysis. So, in order to cut down the feature measurement costs and the computational burden, it is essential to weed out the unnecessary features. In fact, having a large set of fea- tures affects classification accuracy adversely when the number of patterns is small (see [6], [31]). Hence, it is essential to reduce the dimensionality of the feature space either by choosing the m new features to be a subset (feature selection) or a linear combination (feature extraction) of the original d features. Several methods have been proposed to visualize and analyze high dimensional data. These approaches can be mainly classified as graphical methods and projection methods. The primary goal of the graphical techniques is to provide an alternative, visual representation for the original high dimensional data. Pictorial display and functional mapping methods are the chief constituents of the graphical approach. Pro- jection algorithms attempt to provide a representation in a lower dimensional space (not necessarily 2-D). Since representation of data in the lower dimension involves loss of information, these algorithms aim at preserving the information content as much as possible. The data projection methods can be further subdivided into linear and non-linear methods. Each of these methods can be applied to both unsupervised and supervised cases. In unsupervised learning, the class membership of the data points is not known, while in supervised learning, the class labels are known. All these approaches are summarized in Figure 1.1. Irrespective of the approach chosen, the aim is to retain as much structural information of the high-dimensional data as possible in the lower dimension. The category information is not available in a wide variety of applications. A linear projection method called the principal component approach, is the most widely used approach in such (unsupervised) cases. This method involves computation of the covariance matrix of input patterns, its eigenvalues and eigenvectors. The eigenvectors define a linear transformation which not only expresses the new features as a linear combination of the original features, but in doing so also preserves the information content as much as possible. So, due to its simplicity, data visualization (projection) seas. «cocoa—.80 .335... Figure 1.1. Analysis of high-dimensional data. and feature extraction capabilities, this method is the most favored approach. It may, however, be noted that the entire process is sequential in nature because of which the time and complexity involved in the computations increases rapidly as the dimensionality of the feature space increases. Neural network architectures have been developed for a number of applications in both supervised and unsupervised cases. This approach is particularly useful and convenient to the eigenvector problem because there is no explicit computation of the covariance matrix, its eigenvalues and eigenvectors. The network ‘learns’ on its own, based on the input patterns presented, and the parameters (weights) of the network specify the eigenvectors. This method not only simplifies the computational burden but also leads to possible hardware implementation. Hardware implementation being dependent only on the time constants of the circuits would then eliminate the time complexity involved thereby making it suitable for real-time applications. A number of neural network models have been proposed to compute the eigenvectors directly but none of them leads to a compact circuit realization. A recent interest in analog VLSI is to provide circuit implementation for various existing algorithms. A new model proposed by Salam and Vedula [26] aims at a compact circuit implementation of the eigenvector problem in analog VLSI medium. Computing the largest eigenvector (also called the maximal principal component) is the first major step towards such implementation, because it contains maximum information pertaining to the original data. The focus of this research is on verifying the efficacy of this model for computing the maximal principal component. Mathe- matical and circuit simulation results as well as experimental results from prototype chips are presented in support of the model. The remainder of this chapter describes the various approaches for analyzing high dimensional data with special emphasis on the eigenvector approach. The pictorial display and the functional mapping methods which fall under the graphical approach category are described in section 1.2. The projection algorithms (both linear and non-linear) are presented in section 1.3. Chapter 2 presents the neural networks approach to computing the principal components. It examines the previous modeling efforts and describes the proposed model. Implementation of the model, simulation and experimental results and an application to image-data compression are discussed in Chapter 3. Finally, conclusions and future work are presented in Chapter 4. 1 .2 Graphical Approach In both the pictorial display and functional mapping approaches, the emphasis is on alternative representation for the original high dimensional data. They try to preserve the information in the original data completely by utilizing all the features as attributes of graphical representation for the patterns. 1.2.1 Pictorial Display Method Chernoff’s method [4] is an example of this approach. In this method, each pattern is represented as a cartoon face, with each feature in the data mapped to a feature on the face, like the shape of the face, the size of the eyes, length of the nose, curvature of the mouth, distance between the eyes, etc. Patterns belonging to the same group have similar faces and differ from those belonging to a different group. However, this method is limited to 18 features and in practice to perhaps only 10 features. Analysis of the data represented in this fashion becomes cumbersome when the number of patterns is large. Besides, the precise mapping of the data features to the facial features also plays an important role. The Iris and IMOX data sets, widely used in the field of cluster analysis and pattern recognition are used here for the study. The Iris data set represents three classes of flowers belonging to the group Iris: setosa, versicolor and virginica with 50 patterns from each class. Each pattern represents a flower with the sepal length, sepal width, petal length and petal width being the measured features. The IMOX data has four classes representing the ‘I’, ‘M’, ‘O’, ‘X’ characters, each class having 48 patterns and eight features. The first 15 patterns in each of the classes are utilized for the present study. Figure 1.2 shows the Chernoff faces for the Iris data set with category labels. Figure 1.3 shows the faces for the same data set without the labels. Similarly, Figures 1.4 and 1.5 represent the results for the IMOX data set. It is evident from the Chernoff diagrams that all faces belonging to class ‘8’ are characterized by smaller faces and very short noses (almost insignificant) while faces of class ‘V’ have very short distances between the nose and the mouth. Class ‘S’ is very well separated from ‘C’ and ‘V’ while classes ‘C’ and ‘V’ appear to be very close to each other (elongated faces). It is quite easy to make these conclusions from Figure 1.2. However, if the category labels are removed, then grouping can be difficult. For example, patterns (23,37) and (19,26) in Figure 1.3 maybe grouped together. Patterns 15 and 45 appear to be outliers. Looking at the IMOX data, ‘I’s are characterized by wide smiles, while ‘O’s and ‘M’s have sad faces. ‘M’s have pretty elongated faces too. ‘I’s and ‘X’s are pretty close to each other. Once again, if the pattern labels are removed, errors might occur. For example, patterns 15 and 55 in Figure 1.5 might be grouped together if there is no a priori information about their category information. 1.2.2 Functional Mapping Method This approach [5] transforms the higher-dimensional data into a function of a sin- gle variable. The function is then plotted against the variable, presenting a two— dimensional but functional view of the original data. A suitable transformation would then detect clusters, help to select features as well as perform statistical analysis of the data. for the Iris data with the category labels. Figure 1.2. Chernoff faces O 0 <9 C CD 11 G o CD CD 0 0 CD 26 21 16 42 27 17 12 37 32 38 43 33 28 18 13 o o o 9 0 O O (D 0 45 CD 30 4O 35 25 20 15 10 Figure 1.3. Chernoff faces for the Iris data without the category labels. Figure 1.4. Chernoff faces for the IMOX data with the category labels. 10 6% ®% 7 2 Own. ®w ®9 ®3 8 2 0% Om Gm Q4 Gm HUN on @5 CD 48 54 42 6% 30 24 18 Gm... ©6 Figure 1.5. Chernoff faces for the IMOX data without the category labels. 11 For a given d-dimensional pattern x,- = [;r,-1,a:,-2,. . .,a:,-d]T, a functional mapping is defined as follows: 1 . . — (L‘il + 22,2 Slnt + 13.3 cost + 13-4 sm 2t f$i(t) fl and sin gt if d is even +175 cos2t + ...+ :rid cos d—g—l-t if dis odd for-rStSr (1.1) A plot of fr,(t) for t 6 [—7r, 7r] would then capture the structure of the d—dimensional pattern. The functional mapping in eqn.(1.1) has the following mathematical and statistical properties (see [5]): o The functional mapping is linear. It can be easily proved that for a given d—dimensional pattern x = ax] + (”(2, fx(t) can be written as : fx(t) = afx1(t) + bfx2(t) (1-2) where, X] and x2 are two d—dimensional patterns and a and b are constants. c The functional mapping preserves the mean (average) of the data. If u is the mean of the n patterns x1,x2, . . . , xn: n . 1 n [1: —ZX; (1.3) 1:1 then the function corresponding to p can be proved with the help of the linearity property, to be the mean of the functions corresponding to the n patterns. This 12 can be put mathematically as: f#(t): £212.“) (1'4) The functional mapping preserves distances. A distance measure for two functions corresponding to two points x1 and X; may be defined as: Ilfx1(t)— fx.(t) n: [3mm -— march (1.5) Making use of the orthogonal properties of the sine and the cosine functions, it can be shown that this distance is proportional to the squared Euclidean distance between the two points in the d-space, i.e., II fx1(t) — fan“) II = K II X1 — X2 II2 :1 = “2(31j_1’2j)2 (1-6) i=1 where K is a proportionality constant. Data points that are close in the original d—space will appear as close functions and vice versa thereby, preserving the structural relationship among the data points. The functional mapping yields one-dimensional projections. For a given value of t = to, the function fx,(t0) is proportional to the projection of the data vector x,- on the directional vector Pro where sin t0,cost0,sin 2t0,cos 2t0, . . .]T (1.7) 1 P10 = [$3 13 As to changes, the projections of x.- on a continuum of directional vectors are recorded in the functional plot. When the functions of the data vectors are plotted, these projections, or functional values might reveal various structural relationships in the data. For example, if data vectors form clusters in different orthogonal spaces, these different clusters will be reflected in the projections of Pt as it passes through or near these spaces. 0 The functional mapping preserves variances. If the original data vectors x.- have been suitably transformed so that the compo- nents are approximately independent with equal variance 02, then the variance of fx‘.(t) is given by: Var[fx,(t)] = 02 (g + sin2 t + coszt + sin2 2t + cos2 2t + . . .) 02 i , if dis odd = (2) (1.8) 02(g + 6),] e [3% if d is even Thus, the functional mapping has a constant or near constant variance. The functional mapping eqn.(1.1) was used to represent the first 15 patterns of each class of the Iris and the IMOX data sets. Although in Figure 1.6 (a)—(c), it is clear that each class has a distinct characteristic, the clarity is soon lost (especially between ‘C’ and ‘V’) when the class information is not available (Figure 1.6(d)). Similar comments apply to the IMOX data represented in Figure 1.7 and Figure 1.8. Several other approaches including representing the high dimensional data as stars, trees, and castles etc. have been tried. All these techniques are useful only when the number of patterns and the number of features are small (n < 50, d < 10). 14 26» Ed 0» OI ‘45 44 3 4 4% 8 1 5 i i s in 4 5 5 4 I i a I (a) Class ‘S’ (b) Class ‘C’ to» 10» Be» 85:- or 4 0I isiséiéi$$fls is4$$$oaa ll (c) C ass ‘V’ (d) All Classes Figure 1.6. Functional plot of the Iris data. 15 a.” Clan“ no» so 0» T 0 ”L 4 w B an 5 so I l tor to o> ] o 10» . 10 ‘11s 4 o a 1 o r a a a I! ‘2; 4 I am- 4 ” co» 4 do a. n. 230 10* to» 0’ 0r .‘°> A A A A A A l A ,1: -l 4 12 1 O 1 2 J I 15 I v (c) Class ‘0’ Figure 1.7. Functional plot of the four classes of the IMOX data. Figure 1.8. Functional plot of the four classes of the IMOX data (All Classes). 1 .3 Projection Approach Projection algorithms map a set of n d-dimensional patterns onto an m-dimensional space, where m < d. Linear projection algorithms are non-iterative. They are rel- atively simple to use, preserve the structural properties of the data and have well understood mathematical properties. The mapping is defined or computed by a pre- cise transformation and hence, the lower-dimensional representation in this case is unique. On the other hand, the non-linear approach is iterative; there is no explicit mapping (mathematical equation) which explains the relationship between the higher dimensional data and its lower dimensional representation. Consequently, the lower dimensional representation may not be unique. An excellent analysis of the projec- tion algorithms is presented in [12]. Biswas et al, evaluated the various widely used projection algorithms and summarized the results in [2]. 17 1.3.1 Linear Projection Algorithms A linear projection reduces the dimensionality (m < d) by expressing the m new features as a linear combination of the original (I features: inHmX; i=l,2,...,n In the above expression, y,- is the (m—dimensional) projected pattern of x,, x,- is the (d—dimensional) original pattern and 71,", (mx d matrix) is the linear transforma- tion. The type of linear projection used depends on the availability of the category information in the form of labels on the patterns. If category information is available (supervised), then discriminant analysis is used. If on the other hand, the category information is not available (unsupervised), then the principal component method is used. Principal Component Analysis Let A“ be a n x (1 pattern matrix with each row representing a single d-dimensional pattern x'f. Thus, an element :rf, belonging to this matrix represents the jth feature of the ith pattern. Each of the n patterns may then be normalized as: (Egj = —— (1.9) where “1' is the mean and of is the variance for the jth feature given by: n #j =(1/nlz$fj i=1 n 0} =(1/n)2(1€ij— #02 i=1 18 If .4 represents the normalized pattern matrix, its covariance matrix ’R. is defined R = (1/n)ATA (1.10) Each element of this d>< d matrix is given by: m = (U71): xkifckj k=1 Since ”R. is symmetric, its eigenvalues are all real and its eigenvectors can be taken as orthogonal (see [22]). Further, since ’R. is obtained by taking the outer product of A, ’R. is positive semi-definite and hence, all its eigenvalues are positive or zero. Thus, the eigenvalues of ’R can be labeled so that: AIZMZH-ZMZO with the corresponding eigenvectors (also called principal components) being c1,c2,.. . ,cd. The eigenvectors corresponding to the m largest eigenvalues of the covariance matrix ”R constitute the rows of the transformation matrix Hm and the projected patterns are given by: B... = AH; (1.11) The matrix ’Hm projects the d-dimensional pattern space onto a m-dimensional space whose axes are in the directions of eigenvectors corresponding to the m largest eigenvalues of ’R. Thus, the transformation merely rotates the axes of the pattern space and makes the new coordinate system align along the eigenvectors of ’R. For the purpose of illustration, if the data points of A are pictured as an ellipsoidal swarm 19 of points in the 3-D pattern space, the axes in the rotated pattern space would be along the axes of the ellipsoid. The eigenvector projection defined in eqn.(1.11) preserves some of the important statistical properties [12],[5]: o The m new features are uncorrelated. The eigenvalues of R are given by I ’R — AL I: 0, where L1 is a d x d identity matrix. An eigenvector c.- corresponding to A,- satisfies: (R—AiId)Ci = 0 0 ifi¢j 1ifi=j T __ thj — From the above properties, it implies that: C,T(R - AiId)C.' = 0 or in the vector form: cRRcf,‘ = AR where A3 = diag(A1,/\2,...,/\d) is a diagonal matrix of order d and C; = [c1,c2,...,cd]. From eqns.(1.10) and (1.11), (1/n)13’3,‘,13’m = Hmmr; (1.12) Since C3 is an orthogonal matrix, its inverse is nothing but its transpose. Thus, 65‘ = Ct" 72 = chRcR 20 Substituting the above expression for R in eqn.(1.12), (1 main," = HmchRmmcgf (1.13) The matrix Itng can be partitioned as [Im | 0], where Im is an m X m identity matrix and O is an m X (d —- m) zero matrix. Thus, the covari- ance matrix in the new space given in eqn.(1.13) reduces to a diagonal matrix Am = diag(/\1,A2,...,/\m) which means that the features in the m—space are uncorrelated. o The first m principal components retain the maximum variance in the m space. If 31,-]- is the projected component of x.- along the jth principal component C], then, T 310' = Cj Xi Var(y,-J-) = Variance(y,-J-) = Var(cfx.-) = EICJT(Xi-#)(Xi-H)chl _ T , — c]. RC] and, 000(yijal/ik) = Covariance(y.-j,y,~k) = CfRCk =0 jaék 21 Thus, the variance retained in the m space along the jth principal component is equal to M. Hence, in order to retain maximum variance in the m-space, the principal components corresponding to the m largest eigenvalues should be cho- sen (since,the eigenvalues are ordered largest first, this corresponds to the first m principal components). The sum of all the d eigenvalues of R represents the variance in the original space and the sum of the first m eigenvalues represents the variance retained in the m—space. Therefore, the ratio Tm : 23:1 :j j=1 i represents the fraction of the original variance retained in the m—space. Usually, m is chosen so that rm 2 0.95. Thus, a ‘good’ eigenvector projection is that which retains a large proportion of the variance present in the original feature space with only a few features in the transformed space and this is achieved by choosing the first m principal components. The eigenvector projection minimizes the mean-square-error. Consider a linear transformation y.- = LTx; where, C = [L1,L2, . . .,Lm] is a d X m matrix and L.- is the ith column of L. Since m < d, there is a loss of information in representing the pattern x,- by y;. A loss function maybe defined which would represent the extent to which x,- maybe predicted by knowing y,-. In other words, the loss function defines the loss of information or the error incurred in representing the pattern in the m—space. It can then be shown that the transformation which minimizes the loss function is composed of nothing but the principal components. Let L? x;,Lg’ xi, . . . ,L; x,- be m linear functions of x.- and let 0,2- be the resid- ual variance (squared-error) in predicting :r,-,- based on L? x;, L; x,-, . . . , Lg, xi. 22 Assume that L1, L2, . . . , Lm can be chosen so that: LfnL, = 1 LfnLk = 0 j¢k This implies that the components of the transformed variables are uncorrelated and with variance unity. The residual variance in predicting 33,-,- on the basis of L? x;, L; x;, . . . ,L; x,- is given by: 012 = 0"?)- — [002433051]? X,‘)]T — . . . — [000013;], L: X;)]T = 0% - (LIE-)2 - ~-- (Lin-)2 2 sz-j — (LijRle — . . . — LflRflthm) where Rj is the jth column of R and a}, is the variance of 33.5. The overall residual variance (total square-error) is: d d d E a} = E of, — L,T(Z njnfuf — . . . — L3,,‘(ZR,R,T)L; i=1 j:l j: jzl l = tr(R) — (L,T121aL1 + . . . + Li’RRLm) To minimize 231:1 a}, (LirRRLl + . . .+ L,T,,RRLm) has to be maximized. Con- sider a set of orthonormal vectors Q1, Q2, . . . , Qd. Then an important property [5] states that: ZQFRQ; g Zcfnc. = A1 +...+ A, i=1 {:1 where s = 1,2,. . . , d and the maximum is attained when Q.- = Cg. So, from this property, it is very clear that (LglRRLl + . . . + LiRRLm) will be maximized only when L,- is the ith eigenvector of RR, which is the same as the eigenvector of R. Thus, the square-error is minimized only when the transformation consists 23 of the m principal components. The minimum value of the error errm,n = Ad+1+...+Ad. o The eigenvector projection maximizes the scatter. The scatter for the normalized pattern matrix A maybe defined in terms of its covariance matrix R by: I5I=nleI where I R I is the determinant of the covariance matrix R. Since tr(R) 2] R [2 i=1 Ag, the scatter can be expressed in terms of the eigenvalues as: (Ti/\i) M i=1“- d [8 I: ndH)‘, i=1 1 1 Since the scatter is dependent only on the eigenvalues of the covariance matrix, the scatter is preserved completely if m = d and is preserved to the maximum extent possible when the first m-principal components are chosen in the reduced space. Thus, the eigenvector projection de-correlates the original features and retains only those which: 0 provide the largest spreads (variance), 0 minimize the square error, and o maximize the scatter Figure 1.9 shows the eigenvector projection of the first 15 patterns of each class of the Iris data set along the first two principal components. Figure 1.10 shows results of the projection of the data along the third and the fourth principal components. It is evident that the structural information is preserved to a very good extent by the first 24 and the second principal components. The fact that the smaller the eigenvalue, the lower is the information content retained by the corresponding principal component is also evident from Figures 1.11-1.13 which show the projections for the IMOX data. Hence, the performance deteriorates from Figure 1.11 to Figure 1.13. nilfifétft'gifié‘ii’lia V Y” c C lb 7 Cc V W S e S C c 00 c . “3’ S C vv v 3 $5 C W 8 o C if S a V V v m s v C v $1 83 C V 3 s 4 6 8 FurstConponent Figure 1.9. Projection of the Iris data along the first two principal components. Discriminant Analysis The projection of patterns using discriminant analysis requires category information i.e, each pattern needs a class or category label. So, assuming that the (1 dimensional patterns have been separated into K clusters or categories, the unnormalized patterns in the ith class maybe denoted as: .4: = pg“), x3“), . . . ,x*<">]T n: 25 32333093332833 V N. O .. Vi ~°H c (is S C V v io. c c g C C sg qlv : S S 3 1‘3"- V s S 9 S V c V V N C s V c v -0.5 0.0 0.5 ThirdConponent Figure 1.10. Projection of the Iris data along the third and fourth principal compo- nents. Angl’jiisii’spfil 33138183321 (9 0° 12 -10 O ‘14 I Second Component 2 g I x x M l ‘3 ”M M M ’I I W M x I l H I 9 X m I M X I § X X o x N x x 0 5 10 15 F1131 Cormonent Figure 1.11. Projection of the IMOX data along the first two principal components. 26 AnFiIIiI‘siiifSI 8%‘Ifi83i'liita I N MI x x M M I" M _ 4 x §I MMI‘M‘: )50 3 I°x (D x E x o M I 0° 8Q« 0 0 00 O E x I ll 3 l .2 x x i 0 I? X x M x I x 9‘ I ' M x s 4 2 o 2 4 6 a ThirdCorrponent Figure 1.12. Projection of the IMOX data along the third and fourth principal com- ponents. An‘ZII‘sS'iDSI fiW‘Bséta 0 M0 MM “5.. l Ix 0 ' M "Mx 3‘ XII IX ’6 l i x O 0 ngLM X o M I M 5 M 9° M Ind. O X 0 It? '0 9. 0., I ' 2 o 2 4 SeventhConpomm Figure 1.13. Projection of the IMOX data along the seventh and eighth principal components. 27 where, n,- is the total number of patterns in class i and the jth pattern in the class i maybe represented as: x-ym _ 4i) .(.~) *(iIT J -— [le ,xJ-2 ,...,xjd] The mean of the kth feature for the ith group is given by: III." = (Uni) 2 mil." i=1 Combining all the means for the features in a vector form for the i th group would result in a mean It“) for the ith group given by: I1“) = [#I", #3", - - . mi"? Let ,u denote the mean for all the patterns (considered as a single group): K #1 =(1/n)Zn.-u(" 2:] where n = 23:1 71,-. If the patterns are normalized by subtracting the overall mean (i) = XTI‘) from all the patterns, x, J — p, then the scatter matrix maybe defined as: Kn.- 8=ZZ®W4W i=1 j=1 The scatter matrix for the ith group is defined as: "i 50) = 209‘“) _ “(i)“xjdi) _ #0))T i=1 The within-group scatter matrix, Spy, is defined as the sum of the group scatter matrices: A, - 8w = >38“) i=1 28 Finally, the between-group scatter matrix, 53, is defined as the scatter matrix for the group means. K 71." SB = 2 20¢“) — I001“) - MT i=1 j=l It can be easily shown that S = 83 +5w. This shows that the total scatter is divided into the between-group scatter and within—group scatter. There exists a linear transformation ’Ho (a (K — 1) x (1 matrix) which projects (1') each pattern x, (j th pattern in class i) into a (K — 1)-dimensional subspace by: yltl Z Hoxg‘i) J :12. ,n. z = 1,2,. ,K In doing so, this projection tries to maximize the between—group scatter 83 while holding the within-group scatter 8w constant, thereby maintaining the ratio %I (also called Wilks’s lambda statistic) constant. The rows of Ho consist of the eigenvectors corresponding to the (K — 1) nonzero eigenvalues of 8,},183. The between-group scatter matrix 53 involves only K vectors and a mean is subtracted thereby making the rank of the matrix less than or equal to (K — 1). Since the rank of a product of two matrices cannot exceed the ranks of the components, 8;,ng has at most only (K — 1) nonzero eigenvalues. It is assumed here that n — K 2 d,d 2 K and that Spy is nonsingular. If the eigenvalues of SEVISB are ordered such that {1 Z {2 Z . 2 {K4 2 0, then in order to project the patterns onto a t-dimensional space (t < (K - 1)), only the first teigenvectors are chosen to form ’HO. This method is very similar to the PCA approach. The category information is however, used here to maximize the separation between classes. As a result of this, the results of discriminant analysis are usually better than those of PCA. Figure 1.14 29 and Figure 1.15 show the results of discriminant analysis for the Iris and the IMOX data sets respectively. The separation between classes ‘C’ and ‘V’ is much better in Figure 1.14 than in Figure 1.9. Figure 1.14. Discriminqrr']; figwsis of the c " c w, v C C ‘3 V c 85 E v c C 5 go V C S B ‘I/ C c SsS ° v v 3 SS 3r V S v S S s V v C s in V S v I’ v 10 5 o 5 10 FirstVariable Projection of the Iris data using Discriminant Analysis. 1.3.2 Nonlinear Projection Algorithms Linear projection algorithms yield good results if the data in the original space has a simple structure. On the other hand, if the patterns have a complex structure such as patterns lying on a curved surface (for example, a helix), then the linear algorithms fail to maintain the inter-pattern distances and consequently do not preserve the structure in the lower dimensions. Nonlinear projection algorithms have an edge over the linear algorithms on this issue and hence, have become very popular. Most nonlinear projection algorithms are formulated as an optimization problem whereby a function of a large number of variables is either maximized or minimized. 30 Discrimimfiaalgis of the M X an X xx x d M X qt x M MM w o, M M .. M M x g MM x x I 3 X x «I III > | 260‘ II g I l I I o 0000 I o q) 0 o I 0 O O [ 0 O ' Yo I 2 4 6 8 1O 12 firstVariable Figure 1.15. Projection of the IMOX data using Discriminant Analysis. There is no explicit mathematical expression for the nonlinear transformation. Con- sequently, there is no unique representation for the data in the new space. An unsupervised nonlinear technique which aims at preserving all the inter-pattern distances in the reduced space was proposed by Sammon [29]. Given a set of n d- dimensional patterns {xi}, let d(i, j) represent the distance between the patterns x; and xj. Sammon’s algorithm starts with n random points in the reduced m space (m < (1). Let D(i, j) represent the distance between the points corresponding to x,- and x, in the m space. Then a mean-square-error between the two sets of distances maybe defined as follows: _ 1 [d(iij) - D(i,j)]2 E _ Egg Zi