1.9.“

:I'.‘

. .1...

"g .,

 

 

 

 

 

 

 

 

 

 
 

 

 

 

 

 

 

 

 
 

 

 

 
 

 

 

 

 

  

IIIIIIIIIIIIIIIIIIIIIIIIIIII

IIIIIIIIIIIIIZIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

301411 2894 I

 

TﬁESts
<- LIBRARY
Mlchtgan State
University

 

 

 

This is to certify that the

tlfesis entitled
LOW POWER ANALOG CHIPS

FOR THE COMPUTATION OF THE
MAXIMAL PRINCIPAL COMPONENT

presented by

SHANTI SNARUP VEDULA

has been accepted towards fulfillment
of the requirements for

M. S. degree in Electrical Eng.

I 11A 4
Major professor

Tm

 

 

DateM

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

I

PLACE ll RETURN BOX to romovo this chockout from your rooord.
TO AVOID FINES Mum on or botoro dot. duo.

DATE DUE DATE DUE DATE DUE

f {ﬁlm'u‘lj 3‘33 2“

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU loAn Aﬁnnottvo Mon/Equal Opportunity Irutltwon
mum

—- — —_ _ ___._ __ _

Low Power Analog Chips for the Computation of

the Maximal Principal Component

By

Shantz' Swamp Vedula

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

MASTER OF SCIENCE

Department of Electrical Engineering

1995

ABSTRACT

Low Power Analog Chips for the Computation of

the Maximal Principal Component
By

Shantz' Swarup Vedula

The eigenvector projection method, also called the principal component analy-
sis (PCA) approach, is a popular linear projection algorithm used in unsupervised
processing. This method involves computation of covariance matrix of the input pat-
terns, its eigenvalues and eigenvectors. A number of neural network models have
been proposed which compute the eigenvectors directly without computing the co-
variance matrix and the eigenvalues. The neural network ‘learns’ on its own, based
on the input patterns presented, and after ‘learning’, the parameters (weights) of
the network specify the eigenvectors. This work focuses on the implementation of
a compact, modular, low power, subthreshold analog VLSI circuit for computing the
maximal principal component based on a new analog VLSI non—linear model. The
model uses MOS elements and differential pairs operating in the subthreshold regime
of MOS operation. The model for computing the principal component has been sim—
ulated (using MATLAB and PSpice) and implemented upto six dimensions on tiny

prototype chips fabricated via the M0818 service.

To My Mother

iii

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor Dr. F athi M.A. Salam
for his able guidance and constructive criticism throughout the course of my research.

I would like to thank Dr. Anil K. Jain and Dr. Diane T. Rover for being on my
graduate thesis committee. Their critical comments and useful discussions aided in
improving the quality of my thesis.

My colleagues at the Circuits, Systems and Artiﬁcial Neural Networks Labora-
tory, Department of Electrical Engineering, MSU, provided extensive help during the
course of this research. I would like to sincerely thank Hwa—Joon Oh whose bril-
liant programs made conducting the experiments as painless as possible and Ammar
Gharbi for his help, moral support and useful discussions.

I also record my appreciation of the help rendered by the staff and the faculty of
the Department of Electrical Engineering, MSU.

I would like to express my sincere appreciation to Chitra Dorai, Sharath Pankanti
and Nalini Ratha at the Pattern Recognition and Image Processing Laboratory, De-
partment of Computer Science, MSU, for their generous help and patient replies to
my numerous questions on image processing and computer vision.

I take this opportunity to thank a very dear and special person, Dr. Houria Has-
souna, at the Special Coagulation Center, Division of Thrombosis and Haemastosis,
College of Human Medicine, MSU. I am deeply indebted to her for the love and re-
spect she showed me besides the graduate assistantship which she provided me during
a critical period of this research. .

I would like to express my sincere gratitude to my parents, especially my mother
for her love, affection and blessings.

Finally, a BIG heartfelt thanks to my friends Krunali, Umashankar, Roomi, San-
jay, Chitra, Mona, Vijay, Sumant, Niranjan, Annica and many others. They were
always there to share the “ups and downs” of my life; without them, life would have

been miserable!

iv

TABLE OF CONTENTS

LIST OF FIGURES vii
1 Introduction 1
1.1 Motivation ................................. 1
1.2 Graphical Approach ........................... 5
1.2.1 Pictorial Display Method ..................... 5

1.2.2 Functional Mapping Method ................... 6

1.3 Projection Approach ........................... 16
1.3.1 Linear Projection Algorithms .................. 17

1.3.2 Nonlinear Projection Algorithms ................ 29

2 Artiﬁcial Neural Networks Approach 32
2.1 Introduction ................................ 32
2.2 Fundamental Concepts .......................... 33
2.2.1 The Neuron Model ........................ 33

2.2.2 Types of Neural Networks .................... 35

2.3 Previous Modeling Efforts ........................ 38
2.3.1 Hebb’s Rule ............................ 39

2.3.2 Oja’s Rule ............................. 41

2.3.3 Sanger’s Rule ........................... 42

2.4 The Proposed Circuit Model ....................... 45

3 Computing the Maximal Principal Component 48
3.1 Introduction ................................ 48
3.2 Basic Building Blocks ........................... 49
3.2.1 Differential Pair .......................... 50

3.2.2 Current Mirrors .......................... 51

3.2.3 Transconductance Amplifier ................... 52

3.3 MOS Circuit Realization ......................... 55

3.4 Computation of the Maximal Principal Component of 2-D Gaussian

Data .................................... 65

3.4.1 Mathematical Simulation .................... 65

3.4.2 PSpice Simulation ........................ 74

3.4.3 Experimental Verification .................... 74

3.5 Computation of the Maximal Principal Component of a Color Image 84
3.5.1 Theoretical Result ........................ 86

3.5.2 Simulation Result ......................... 87

3.5.3 Experimental Result ....................... 88

4 Conclusions and Future Work 92
BIBLIOGRAPHY 94

vi

1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10

1.11
1.12

1.13

1.14
1.15
2.1
2.2
2.3
2.4
2.5
3.1
3.2
3.3
3.4
3.5
3.6

LIST OF FIGURES

Analysis of high—dimensional data .....................
Chernoff faces for the Iris data with the category labels .........
Chernoff faces for the Iris data without the category labels. ......
Chernoff faces for the IMOX data with the category labels. ......
Chernoff faces for the IMOX data without the category labels.

Functional plot of the Iris data. .....................
Functional plot of the four classes of the IMOX data. .........
Functional plot of the four classes of the IMOX data (All Classes). . .

Projection of the Iris data along the first two principal components. .

Projection of the Iris data along the third and fourth principal com-
ponents. ..................................
Projection of the IMOX data along the ﬁrst two principal components.
Projection of the IMOX data along the third and fourth principal com-
ponents. ..................................
Projection of the IMOX data along the seventh and eighth principal
components .................................
Projection of the Iris data using Discriminant Analysis .........
Projection of the IMOX data using Discriminant Analysis. ......
A biological neuron .............................
McCulloch-Pitts model. .........................
Sigmoidal functions. ...........................

Types of neural networks. ........................
A simple linear unit. ...........................
Schematic of a differential pair. .....................
Current mirrors ...............................
Schematic of a transconductance ampliﬁer ................
Response of an on-chip transamp .....................
Circuit schematic of nstage. .......................

Circuit layout of nstage. .........................

vii

COCO-x)“

10
14
15
16
24

25
25

26

26
29
3O
34
35
36
37
39
5O
51
52
53
54
54

3.7

3.8

3.9

3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21

3.22

3.23

3.24

3.25

3.26

3.27

3.28

3.29

3.30

3.31

3.32

Circuit schematic of pstage. ....................... 56

Circuit layout of pstage. ......................... 56
Chip layout of Chipl ............................ 58
Circuit schematic of 2-D circuit using nstage ............... 59
Circuit layout of 2-D circuit using nstage ................. 59
Chip layout of Chip? ............................ 60
Circuit schematic of 2-D circuit using pstage ............... 61
Circuit layout of 2-D circuit using pstage ................. 61
Circuit schematic of 3-D circuit using nstage ............... 62
Circuit layout of 3—D circuit using nstage ................. 62
Chip layout of Chip3 ............................ 63
Circuit schematic of 6-D circuit using pstage ............... 64
Circuit layout of 6-D circuit using pstage ................. 64
Simulation results for a Gaussian distribution. ............. 65
Simulation results for the initial condition wo = [ 0.2190 ; 0.0470 ].

(a)-(b) Oja’s model. (c)-(d) Circuit model ................ 66
Simulation results for the initial condition WO 2 [ 0.6789 ; 0.6793 ].

(a)-(b) Oja’s model. (c)-(d) Circuit model ................ 67
Simulation results for the initial condition we = [ 0.5297 ; 0.6711].

(a)-(b) Oja’s model. (c)-(d) Circuit model ................ 68
Simulation results for the initial condition WO 2 [ 0.9347 ; 0.3835 ].

(a)-(b) Oja’s model. (c)-(d) Circuit model ................ 69
Simulation results of the Reconﬁgured data for the initial condition wo

= [0.2190 ; 0.0470 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model. . . . 70
Simulation results of the Reconfigured data for the initial condition wo

= [0.6789 ; 0.6793 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model. . . . 71
Simulation results of the Reconﬁgured data for the initial condition wo

= [0.5297 ; 0.6711 ]. (a)-(b) Oja’s model. (c)—(d) Circuit model. . . . 72
Simulation results of the Reconﬁgured data for the initial condition wo

= [0.9347 ; 0.3835 ]. (a)-(b) Circuit model. .............. 73
PSpice simulation: w.- signals for input signals at 2 kHz in the increas-

ing order for the initial condition of 2.5V ................. 75
PSpice simulation: w.- signals for input signals at 2 kHz in the increas-
ing order for the initial condition of 0V .................. 75
PSpice simulation: w,- signals for input signals at 2 kHz in the decreas-
ing order for the initial condition of 2.5V ................. 76

The maximal principal component. ................... 77

viii

3.33
3.34
3.35
3.36
3.37
3.38
3.39
3.40
3.41

3.42

3.43

3.44
3.45

z.- signals for equal amplitude input signals ................ 79
6-D Implementation: 2,- signals for equal amplitude input signals. . . . 79
z,- signals for input signals in the increasing order ............ 80
6-D Implementation: 2:.- signals for input signals in the increasing order. 80
z,- signals for input signals in the decreasing order ............ 81
6-D Implementation: 2,- signals for input signals in the decreasing order. 81
z,- signals for input signals at high frequency ............... 82
z,- signals for input signals with capacitors. ............... 82

Response of 21 to change in the We, of the input transamps for the

2-D Implementation ........................... 83
Response of 21 to change in the We! of the input transamps for the

3-D Implementation ........................... 83
RGB-to-YIQ conversion. ......................... 85
Experimental setup ............................ 89

RGB-to-YIQ conversion by computing the maximal principal compo-

nent of the image .............................. 90

ix

CHAPTER 1

Introduction

1 . 1 Motivation

As innovative and more economical methods of data acquisition are developed, the
emphasis in a variety of disciplines is now on quantiﬁcation. The problem of handling
large volumes of data has partly been solved with the advent of faster processors
equipped with larger memory. But, there are some problems which are difficult to
solve, for example, visualization and analysis of high dimensional data. An object
can be represented as a pattern (or point) in the measured feature space. Consider,
as an example, objects being represented in terms of their features such as height and
weight. A pattern (2,3) might then represent an object with height 2m and weight
3kg in the ‘height-weight’ feature space. Since human visual perception is limited to
three dimensions, graphical representation and visualization of data becomes impos-
sible as the dimensionality of the feature space is increased beyond three. As a result
of this, visual identiﬁcation of relationship among data points, for example, identi-
fying clusters, is no longer possible. It is also a common practice in data analysis
applications to start with a large set of features. Some of these features, however,
may not contribute any signiﬁcant information for the purpose of data analysis. So,

in order to cut down the feature measurement costs and the computational burden,

it is essential to weed out the unnecessary features. In fact, having a large set of fea-
tures affects classiﬁcation accuracy adversely when the number of patterns is small
(see [6], [31]). Hence, it is essential to reduce the dimensionality of the feature space
either by choosing the m new features to be a subset (feature selection) or a linear
combination (feature extraction) of the original d features.

Several methods have been proposed to visualize and analyze high dimensional
data. These approaches can be mainly classiﬁed as graphical methods and projection
methods. The primary goal of the graphical techniques is to provide an alternative,
visual representation for the original high dimensional data. Pictorial display and
functional mapping methods are the chief constituents of the graphical approach. Pro-
jection algorithms attempt to provide a representation in a lower dimensional space
(not necessarily 2-D). Since representation of data in the lower dimension involves
loss of information, these algorithms aim at preserving the information content as
much as possible. The data projection methods can be further subdivided into linear
and non-linear methods. Each of these methods can be applied to both unsupervised
and supervised cases. In unsupervised learning, the class membership of the data
points is not known, while in supervised learning, the class labels are known. All
these approaches are summarized in Figure 1.1. Irrespective of the approach chosen,
the aim is to retain as much structural information of the high-dimensional data as
possible in the lower dimension.

The category information is not available in a wide variety of applications. A
linear projection method called the principal component approach, is the most widely
used approach in such (unsupervised) cases. This method involves computation of the
covariance matrix of input patterns, its eigenvalues and eigenvectors. The eigenvectors
deﬁne a linear transformation which not only expresses the new features as a linear
combination of the original features, but in doing so also preserves the information

content as much as possible. So, due to its simplicity, data visualization (projection)

 

seas.

«cocoa—.80
.335...

 

    

 

 

Figure 1.1. Analysis of high-dimensional data.

and feature extraction capabilities, this method is the most favored approach. It
may, however, be noted that the entire process is sequential in nature because of
which the time and complexity involved in the computations increases rapidly as the
dimensionality of the feature space increases.

Neural network architectures have been developed for a number of applications
in both supervised and unsupervised cases. This approach is particularly useful and
convenient to the eigenvector problem because there is no explicit computation of the
covariance matrix, its eigenvalues and eigenvectors. The network ‘learns’ on its own,
based on the input patterns presented, and the parameters (weights) of the network
specify the eigenvectors. This method not only simpliﬁes the computational burden
but also leads to possible hardware implementation. Hardware implementation being
dependent only on the time constants of the circuits would then eliminate the time
complexity involved thereby making it suitable for real-time applications. A number
of neural network models have been proposed to compute the eigenvectors directly
but none of them leads to a compact circuit realization. A recent interest in analog
VLSI is to provide circuit implementation for various existing algorithms. A new
model proposed by Salam and Vedula [26] aims at a compact circuit implementation
of the eigenvector problem in analog VLSI medium.

Computing the largest eigenvector (also called the maximal principal component)
is the ﬁrst major step towards such implementation, because it contains maximum
information pertaining to the original data. The focus of this research is on verifying
the efﬁcacy of this model for computing the maximal principal component. Mathe-
matical and circuit simulation results as well as experimental results from prototype
chips are presented in support of the model.

The remainder of this chapter describes the various approaches for analyzing high
dimensional data with special emphasis on the eigenvector approach. The pictorial

display and the functional mapping methods which fall under the graphical approach

category are described in section 1.2. The projection algorithms (both linear and
non-linear) are presented in section 1.3. Chapter 2 presents the neural networks
approach to computing the principal components. It examines the previous modeling
efforts and describes the proposed model. Implementation of the model, simulation
and experimental results and an application to image-data compression are discussed

in Chapter 3. Finally, conclusions and future work are presented in Chapter 4.

1 .2 Graphical Approach

In both the pictorial display and functional mapping approaches, the emphasis is
on alternative representation for the original high dimensional data. They try to
preserve the information in the original data completely by utilizing all the features

as attributes of graphical representation for the patterns.

1.2.1 Pictorial Display Method

Chernoff’s method [4] is an example of this approach. In this method, each pattern is
represented as a cartoon face, with each feature in the data mapped to a feature on
the face, like the shape of the face, the size of the eyes, length of the nose, curvature
of the mouth, distance between the eyes, etc. Patterns belonging to the same group
have similar faces and differ from those belonging to a different group. However, this
method is limited to 18 features and in practice to perhaps only 10 features. Analysis
of the data represented in this fashion becomes cumbersome when the number of
patterns is large. Besides, the precise mapping of the data features to the facial
features also plays an important role.

The Iris and IMOX data sets, widely used in the ﬁeld of cluster analysis and
pattern recognition are used here for the study. The Iris data set represents three

classes of ﬂowers belonging to the group Iris: setosa, versicolor and virginica with

50 patterns from each class. Each pattern represents a ﬂower with the sepal length,
sepal width, petal length and petal width being the measured features. The IMOX
data has four classes representing the ‘I’, ‘M’, ‘O’, ‘X’ characters, each class having
48 patterns and eight features. The ﬁrst 15 patterns in each of the classes are utilized
for the present study. Figure 1.2 shows the Chernoff faces for the Iris data set with
category labels. Figure 1.3 shows the faces for the same data set without the labels.
Similarly, Figures 1.4 and 1.5 represent the results for the IMOX data set.

It is evident from the Chernoff diagrams that all faces belonging to class ‘8’ are
characterized by smaller faces and very short noses (almost insigniﬁcant) while faces
of class ‘V’ have very short distances between the nose and the mouth. Class ‘S’
is very well separated from ‘C’ and ‘V’ while classes ‘C’ and ‘V’ appear to be very
close to each other (elongated faces). It is quite easy to make these conclusions
from Figure 1.2. However, if the category labels are removed, then grouping can be
difﬁcult. For example, patterns (23,37) and (19,26) in Figure 1.3 maybe grouped
together. Patterns 15 and 45 appear to be outliers.

Looking at the IMOX data, ‘I’s are characterized by wide smiles, while ‘O’s and
‘M’s have sad faces. ‘M’s have pretty elongated faces too. ‘I’s and ‘X’s are pretty
close to each other. Once again, if the pattern labels are removed, errors might occur.
For example, patterns 15 and 55 in Figure 1.5 might be grouped together if there is

no a priori information about their category information.

1.2.2 Functional Mapping Method

This approach [5] transforms the higher-dimensional data into a function of a sin-
gle variable. The function is then plotted against the variable, presenting a two—
dimensional but functional view of the original data. A suitable transformation would
then detect clusters, help to select features as well as perform statistical analysis of

the data.

for the Iris data with the category labels.

Figure 1.2. Chernoff faces

O 0 <9 C CD

11

G o CD CD 0 0 CD

26

21

16

42

27

17

12

37

32

38 43

33

28

18

13

o o o 9 0 O O (D

0

45

CD

30

4O

35

25

20

15

10

Figure 1.3. Chernoff faces for the Iris data without the category labels.

Figure 1.4. Chernoff faces for the IMOX data with the category labels.

10

6%

®%

7
2

Own.
®w
®9

®3

8
2

0%
Om
Gm
Q4

Gm
HUN
on
@5

CD
48 54

42

6%

30

24

18

Gm...
©6

Figure 1.5. Chernoff faces for the IMOX data without the category labels.

11

For a given d-dimensional pattern x,- = [;r,-1,a:,-2,. . .,a:,-d]T, a functional mapping

is deﬁned as follows:

1 . .
— (L‘il + 22,2 Slnt + 13.3 cost + 13-4 sm 2t

f$i(t) ﬂ

and sin gt if d is even
+175 cos2t + ...+
:rid cos d—g—l-t if dis odd

for-rStSr (1.1)

A plot of fr,(t) for t 6 [—7r, 7r] would then capture the structure of the d—dimensional
pattern. The functional mapping in eqn.(1.1) has the following mathematical and

statistical properties (see [5]):

o The functional mapping is linear.

It can be easily proved that for a given d—dimensional pattern x = ax] + (”(2,

fx(t) can be written as :

fx(t) = afx1(t) + bfx2(t) (1-2)

where, X] and x2 are two d—dimensional patterns and a and b are constants.

c The functional mapping preserves the mean (average) of the data.

If u is the mean of the n patterns x1,x2, . . . , xn:

n .

1 n
[1: —ZX; (1.3)
1:1

then the function corresponding to p can be proved with the help of the linearity

property, to be the mean of the functions corresponding to the n patterns. This

12

can be put mathematically as:

f#(t): £212.“) (1'4)

The functional mapping preserves distances.

A distance measure for two functions corresponding to two points x1 and X;

may be deﬁned as:
Ilfx1(t)— fx.(t) n: [3mm -— march (1.5)

Making use of the orthogonal properties of the sine and the cosine functions,
it can be shown that this distance is proportional to the squared Euclidean

distance between the two points in the d-space, i.e.,

II fx1(t) — fan“) II = K II X1 — X2 II2
:1
= “2(31j_1’2j)2 (1-6)

i=1

where K is a proportionality constant. Data points that are close in the original
d—space will appear as close functions and vice versa thereby, preserving the

structural relationship among the data points.

The functional mapping yields one-dimensional projections.

For a given value of t = to, the function fx,(t0) is proportional to the projection

of the data vector x,- on the directional vector Pro where

sin t0,cost0,sin 2t0,cos 2t0, . . .]T (1.7)

1
P10 = [$3

13

As to changes, the projections of x.- on a continuum of directional vectors are
recorded in the functional plot. When the functions of the data vectors are
plotted, these projections, or functional values might reveal various structural
relationships in the data. For example, if data vectors form clusters in different
orthogonal spaces, these different clusters will be reﬂected in the projections of

Pt as it passes through or near these spaces.

0 The functional mapping preserves variances.

If the original data vectors x.- have been suitably transformed so that the compo-
nents are approximately independent with equal variance 02, then the variance

of fx‘.(t) is given by:

Var[fx,(t)] = 02 (g + sin2 t + coszt + sin2 2t + cos2 2t + . . .)
02 i , if dis odd
= (2) (1.8)

02(g + 6),] e [3% if d is even

Thus, the functional mapping has a constant or near constant variance.

The functional mapping eqn.(1.1) was used to represent the ﬁrst 15 patterns of
each class of the Iris and the IMOX data sets. Although in Figure 1.6 (a)—(c), it is
clear that each class has a distinct characteristic, the clarity is soon lost (especially
between ‘C’ and ‘V’) when the class information is not available (Figure 1.6(d)).
Similar comments apply to the IMOX data represented in Figure 1.7 and Figure 1.8.

Several other approaches including representing the high dimensional data as stars,
trees, and castles etc. have been tried. All these techniques are useful only when the

number of patterns and the number of features are small (n < 50, d < 10).

14

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

26» Ed

0» OI

‘45 44 3 4 4% 8 1 5 i i s in 4 5 5 4 I i a I

(a) Class ‘S’ (b) Class ‘C’

to» 10»
Be» 85:-

or 4 0I
isiséiéi$$fls is4$$$oaa ll

 

 

 

 

 

 

 

(c) C ass ‘V’ (d) All Classes

Figure 1.6. Functional plot of the Iris data.

 

 

15

 

 

 

 

 

 

a.” Clan“
no» so
0» T 0
”L 4 w
B an 5 so
I
l
tor to
o> ] o
10» . 10
‘11s 4 o a 1 o r a a a I! ‘2; 4
I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

am- 4 ”
co» 4 do
a. n.

230
10* to»
0’ 0r

 

.‘°>

 

A A A A A A l A

 

 

 

 

,1:

-l 4 12 1 O 1 2 J I 15
I

v

 

 

 

 

 

 

 

(c) Class ‘0’

Figure 1.7. Functional plot of the four classes of the IMOX data.

 

 

 

 

Figure 1.8. Functional plot of the four classes of the IMOX data (All Classes).

1 .3 Projection Approach

Projection algorithms map a set of n d-dimensional patterns onto an m-dimensional
space, where m < d. Linear projection algorithms are non-iterative. They are rel-
atively simple to use, preserve the structural properties of the data and have well
understood mathematical properties. The mapping is deﬁned or computed by a pre-
cise transformation and hence, the lower-dimensional representation in this case is
unique. On the other hand, the non-linear approach is iterative; there is no explicit
mapping (mathematical equation) which explains the relationship between the higher
dimensional data and its lower dimensional representation. Consequently, the lower
dimensional representation may not be unique. An excellent analysis of the projec-
tion algorithms is presented in [12]. Biswas et al, evaluated the various widely used

projection algorithms and summarized the results in [2].

17

1.3.1 Linear Projection Algorithms

A linear projection reduces the dimensionality (m < d) by expressing the m new

features as a linear combination of the original (I features:
inHmX; i=l,2,...,n

In the above expression, y,- is the (m—dimensional) projected pattern of x,, x,- is
the (d—dimensional) original pattern and 71,", (mx d matrix) is the linear transforma-
tion. The type of linear projection used depends on the availability of the category
information in the form of labels on the patterns. If category information is available
(supervised), then discriminant analysis is used. If on the other hand, the category
information is not available (unsupervised), then the principal component method

is used.

Principal Component Analysis

Let A“ be a n x (1 pattern matrix with each row representing a single d-dimensional
pattern x'f. Thus, an element :rf, belonging to this matrix represents the jth feature

of the ith pattern. Each of the n patterns may then be normalized as:

(Egj = —— (1.9)
where “1' is the mean and of is the variance for the jth feature given by:

n

#j =(1/nlz$fj

i=1

n

0} =(1/n)2(1€ij— #02

i=1

18

If .4 represents the normalized pattern matrix, its covariance matrix ’R. is deﬁned

R = (1/n)ATA (1.10)

Each element of this d>< d matrix is given by:

m = (U71): xkifckj
k=1

Since ”R. is symmetric, its eigenvalues are all real and its eigenvectors can be taken as
orthogonal (see [22]). Further, since ’R. is obtained by taking the outer product of A,
’R. is positive semi-deﬁnite and hence, all its eigenvalues are positive or zero. Thus,

the eigenvalues of ’R can be labeled so that:
AIZMZH-ZMZO

with the corresponding eigenvectors (also called principal components) being
c1,c2,.. . ,cd. The eigenvectors corresponding to the m largest eigenvalues of the
covariance matrix ”R constitute the rows of the transformation matrix Hm and the

projected patterns are given by:
B... = AH; (1.11)

The matrix ’Hm projects the d-dimensional pattern space onto a m-dimensional
space whose axes are in the directions of eigenvectors corresponding to the m largest
eigenvalues of ’R. Thus, the transformation merely rotates the axes of the pattern
space and makes the new coordinate system align along the eigenvectors of ’R. For

the purpose of illustration, if the data points of A are pictured as an ellipsoidal swarm

19

of points in the 3-D pattern space, the axes in the rotated pattern space would be
along the axes of the ellipsoid.
The eigenvector projection deﬁned in eqn.(1.11) preserves some of the important

statistical properties [12],[5]:
o The m new features are uncorrelated.

The eigenvalues of R are given by I ’R — AL I: 0, where L1 is a d x d identity

matrix. An eigenvector c.- corresponding to A,- satisﬁes:

(R—AiId)Ci = 0
0 ifi¢j

1ifi=j

T __
thj —

From the above properties, it implies that:

C,T(R - AiId)C.' = 0

or in the vector form:

cRRcf,‘ = AR

where A3 = diag(A1,/\2,...,/\d) is a diagonal matrix of order d and C; =

[c1,c2,...,cd]. From eqns.(1.10) and (1.11),

(1/n)13’3,‘,13’m = Hmmr; (1.12)

Since C3 is an orthogonal matrix, its inverse is nothing but its transpose. Thus,

65‘ = Ct"

72 = chRcR

20

Substituting the above expression for R in eqn.(1.12),

(1 main," = HmchRmmcgf (1.13)

The matrix Itng can be partitioned as [Im | 0], where Im is an m X m
identity matrix and O is an m X (d —- m) zero matrix. Thus, the covari-
ance matrix in the new space given in eqn.(1.13) reduces to a diagonal matrix
Am = diag(/\1,A2,...,/\m) which means that the features in the m—space are

uncorrelated.

o The ﬁrst m principal components retain the maximum variance in the m space.

If 31,-]- is the projected component of x.- along the jth principal component C],

then,

T
310' = Cj Xi

Var(y,-J-) = Variance(y,-J-)
= Var(cfx.-)
= EICJT(Xi-#)(Xi-H)chl

_ T ,
— c]. RC]

and,

000(yijal/ik) = Covariance(y.-j,y,~k)
= CfRCk

=0 jaék

21

Thus, the variance retained in the m space along the jth principal component
is equal to M. Hence, in order to retain maximum variance in the m-space, the
principal components corresponding to the m largest eigenvalues should be cho-
sen (since,the eigenvalues are ordered largest ﬁrst, this corresponds to the ﬁrst
m principal components). The sum of all the d eigenvalues of R represents the
variance in the original space and the sum of the ﬁrst m eigenvalues represents
the variance retained in the m—space. Therefore, the ratio
Tm : 23:1 :j
j=1 i
represents the fraction of the original variance retained in the m—space. Usually,
m is chosen so that rm 2 0.95. Thus, a ‘good’ eigenvector projection is that
which retains a large proportion of the variance present in the original feature
space with only a few features in the transformed space and this is achieved by

choosing the ﬁrst m principal components.

The eigenvector projection minimizes the mean-square-error.

Consider a linear transformation y.- = LTx; where, C = [L1,L2, . . .,Lm] is a
d X m matrix and L.- is the ith column of L. Since m < d, there is a loss of
information in representing the pattern x,- by y;. A loss function maybe deﬁned
which would represent the extent to which x,- maybe predicted by knowing y,-.
In other words, the loss function deﬁnes the loss of information or the error
incurred in representing the pattern in the m—space. It can then be shown that
the transformation which minimizes the loss function is composed of nothing

but the principal components.

Let L? x;,Lg’ xi, . . . ,L; x,- be m linear functions of x.- and let 0,2- be the resid-

ual variance (squared-error) in predicting :r,-,- based on L? x;, L; x,-, . . . , Lg, xi.

22

Assume that L1, L2, . . . , Lm can be chosen so that:

LfnL, = 1

LfnLk = 0 j¢k

This implies that the components of the transformed variables are uncorrelated

and with variance unity. The residual variance in predicting 33,-,- on the basis of

L? x;, L; x;, . . . ,L; x,- is given by:
012 = 0"?)- — [002433051]? X,‘)]T — . . . — [000013;], L: X;)]T
= 0% - (LIE-)2 - ~-- (Lin-)2
2 sz-j — (LijRle — . . . — LﬂRﬂthm)

where Rj is the jth column of R and a}, is the variance of 33.5. The overall

residual variance (total square-error) is:

d d d
E a} = E of, — L,T(Z njnfuf — . . . — L3,,‘(ZR,R,T)L;
i=1

j:l j: jzl

l
= tr(R) — (L,T121aL1 + . . . + Li’RRLm)

To minimize 231:1 a}, (LirRRLl + . . .+ L,T,,RRLm) has to be maximized. Con-
sider a set of orthonormal vectors Q1, Q2, . . . , Qd. Then an important property

[5] states that:
ZQFRQ; g Zcfnc. = A1 +...+ A,
i=1 {:1

where s = 1,2,. . . , d and the maximum is attained when Q.- = Cg. So, from this
property, it is very clear that (LglRRLl + . . . + LiRRLm) will be maximized
only when L,- is the ith eigenvector of RR, which is the same as the eigenvector of

R. Thus, the square-error is minimized only when the transformation consists

23

of the m principal components. The minimum value of the error errm,n =

Ad+1+...+Ad.

o The eigenvector projection maximizes the scatter.

The scatter for the normalized pattern matrix A maybe deﬁned in terms of its

covariance matrix R by:

I5I=nleI

where I R I is the determinant of the covariance matrix R.

Since tr(R) 2] R [2 i=1 Ag, the scatter can be expressed in terms of the

eigenvalues as:

(Ti/\i)

M
i=1“-

d
[8 I: ndH)‘,
i=1

1

1

Since the scatter is dependent only on the eigenvalues of the covariance matrix,
the scatter is preserved completely if m = d and is preserved to the maximum
extent possible when the ﬁrst m-principal components are chosen in the reduced

space.

Thus, the eigenvector projection de-correlates the original features and retains

only those which:
0 provide the largest spreads (variance),
0 minimize the square error, and
o maximize the scatter

Figure 1.9 shows the eigenvector projection of the ﬁrst 15 patterns of each class of
the Iris data set along the ﬁrst two principal components. Figure 1.10 shows results of
the projection of the data along the third and the fourth principal components. It is

evident that the structural information is preserved to a very good extent by the ﬁrst

24

and the second principal components. The fact that the smaller the eigenvalue, the
lower is the information content retained by the corresponding principal component
is also evident from Figures 1.11-1.13 which show the projections for the IMOX data.

Hence, the performance deteriorates from Figure 1.11 to Figure 1.13.

nilfifétft'giﬁé‘ii’lia
V

 

 

 

 

Y” c
C
lb
7 Cc V
W
S
e S C c
00
c .
“3’ S C vv v
3 $5 C W
8
o C
if S a V V v
m s v
C v

$1 83 C V
3 s

4 6 8

FurstConponent

Figure 1.9. Projection of the Iris data along the ﬁrst two principal components.

Discriminant Analysis

The projection of patterns using discriminant analysis requires category information
i.e, each pattern needs a class or category label. So, assuming that the (1 dimensional
patterns have been separated into K clusters or categories, the unnormalized patterns
in the ith class maybe denoted as:

.4: = pg“), x3“), . . . ,x*<">]T

n:

25

32333093332833

 

 

 

 

V
N.
O
.. Vi
~°H c (is S C V v
io. c c
g C C sg qlv
: S S 3
1‘3"- V s S
9 S
V c V
V
N C s
V c v
-0.5 0.0 0.5
ThirdConponent

Figure 1.10. Projection of the Iris data along the third and fourth principal compo-
nents.

Angl’jiisii’spﬁl 33138183321

(9 0°

 

12 -10
O

‘14
I

Second Component
2
g
I
x
x

 

 

 

M l
‘3 ”M M M ’I I
W M x I l
H I
9 X m I
M X I
§ X X
o x
N x x
0 5 10 15
F1131 Cormonent

Figure 1.11. Projection of the IMOX data along the ﬁrst two principal components.

26

 

AnFiIIiI‘siiifSI 8%‘Iﬁ83i'liita

 

 

 

I
N
MI x
x M
M I" M
_ 4 x
§I MMI‘M‘: )50
3 I°x (D x
E x o
M I 0°
8Q« 0 0
00 O
E x I ll
3 l
.2 x x i 0
I? X x M x
I x
9‘ I
' M
x
s 4 2 o 2 4 6 a
ThirdCorrponent

Figure 1.12. Projection of the IMOX data along the third and fourth principal com-
ponents.

An‘ZII‘sS'iDSI ﬁW‘Bséta

 

 

 

 

0
M0 MM
“5.. l
Ix 0 '
M
"Mx
3‘ XII IX
’6 l
i x O 0 ngLM
X
o M I M
5 M
9° M
Ind. O X
0
It? '0
9.
0., I
' 2 o 2 4
SeventhConpomm

Figure 1.13. Projection of the IMOX data along the seventh and eighth principal
components.

27

where, n,- is the total number of patterns in class i and the jth pattern in the class i

maybe represented as:

x-ym _ 4i) .(.~)

*(iIT
J -— [le ,xJ-2 ,...,xjd]

The mean of the kth feature for the ith group is given by:
III." = (Uni) 2 mil."
i=1

Combining all the means for the features in a vector form for the i th group would

result in a mean It“) for the ith group given by:

I1“) = [#I", #3", - - . mi"?

Let ,u denote the mean for all the patterns (considered as a single group):

K

#1 =(1/n)Zn.-u("

2:]

where n = 23:1 71,-. If the patterns are normalized by subtracting the overall mean

(i) = XTI‘)

from all the patterns, x, J — p, then the scatter matrix maybe deﬁned as:

Kn.-

8=ZZ®W4W

i=1 j=1
The scatter matrix for the ith group is deﬁned as:

"i

50) = 209‘“) _ “(i)“xjdi) _ #0))T

i=1

The within-group scatter matrix, Spy, is deﬁned as the sum of the group scatter

matrices:
A, -
8w = >38“)

i=1

28

Finally, the between-group scatter matrix, 53, is deﬁned as the scatter matrix for the

group means.

K 71."
SB = 2 20¢“) — I001“) - MT

i=1 j=l
It can be easily shown that S = 83 +5w. This shows that the total scatter is divided
into the between-group scatter and within—group scatter.

There exists a linear transformation ’Ho (a (K — 1) x (1 matrix) which projects

(1')

each pattern x, (j th pattern in class i) into a (K — 1)-dimensional subspace by:

yltl Z Hoxg‘i)
J :12. ,n.
z = 1,2,. ,K

In doing so, this projection tries to maximize the between—group scatter 83 while
holding the within-group scatter 8w constant, thereby maintaining the ratio %I (also
called Wilks’s lambda statistic) constant. The rows of Ho consist of the eigenvectors
corresponding to the (K — 1) nonzero eigenvalues of 8,},183. The between-group
scatter matrix 53 involves only K vectors and a mean is subtracted thereby making
the rank of the matrix less than or equal to (K — 1). Since the rank of a product of
two matrices cannot exceed the ranks of the components, 8;,ng has at most only
(K — 1) nonzero eigenvalues. It is assumed here that n — K 2 d,d 2 K and that
Spy is nonsingular. If the eigenvalues of SEVISB are ordered such that {1 Z {2 Z

. 2 {K4 2 0, then in order to project the patterns onto a t-dimensional space
(t < (K - 1)), only the ﬁrst teigenvectors are chosen to form ’HO.

This method is very similar to the PCA approach. The category information is

however, used here to maximize the separation between classes. As a result of this,

the results of discriminant analysis are usually better than those of PCA. Figure 1.14

29

and Figure 1.15 show the results of discriminant analysis for the Iris and the IMOX

data sets respectively. The separation between classes ‘C’ and ‘V’ is much better in

Figure 1.14 than in Figure 1.9.

Figure 1.14.

Discriminqrr']; ﬁgwsis of the
c

 

 

 

 

" c
w, v C C
‘3 V c 85
E v c C 5
go V C S
B ‘I/ C c SsS
° v v 3 SS
3r V S
v S S
s
V v C s
in V S
v
I’ v
10 5 o 5 10
FirstVariable

Projection of the Iris data using Discriminant Analysis.

1.3.2 Nonlinear Projection Algorithms

Linear projection algorithms yield good results if the data in the original space has a

simple structure. On the other hand, if the patterns have a complex structure such as

patterns lying on a curved surface (for example, a helix), then the linear algorithms

fail to maintain the inter-pattern distances and consequently do not preserve the

structure in the lower dimensions. Nonlinear projection algorithms have an edge over

the linear algorithms on this issue and hence, have become very popular.

Most nonlinear projection algorithms are formulated as an optimization problem

whereby a function of a large number of variables is either maximized or minimized.

30

 

Discrimimﬁaalgis of the
M

 

 

 

X
an
X xx x
d M X qt x
M MM w
o, M M
.. M M x
g MM x x I
3 X x
«I III
> |
260‘ II
g I
l
I I
o 0000 I
o q) 0 o I
0 O O [
0 O
' Yo I
2 4 6 8 1O 12
ﬁrstVariable

Figure 1.15. Projection of the IMOX data using Discriminant Analysis.

There is no explicit mathematical expression for the nonlinear transformation. Con-
sequently, there is no unique representation for the data in the new space.

An unsupervised nonlinear technique which aims at preserving all the inter-pattern
distances in the reduced space was proposed by Sammon [29]. Given a set of n d-
dimensional patterns {xi}, let d(i, j) represent the distance between the patterns x;
and xj. Sammon’s algorithm starts with n random points in the reduced m space
(m < (1). Let D(i, j) represent the distance between the points corresponding to x,-
and x, in the m space. Then a mean-square-error between the two sets of distances

maybe deﬁned as follows:

_ 1 [d(iij) - D(i,j)]2
E _ Egg Zi<j d(i,j) Z: d( I

i<j i<j 2,]

 

 

The double sum in the above equation is carried out over the set {(i, j) : 1 S i < j S
n}. Sammon’s algorithm then uses the gradient descent method to reconﬁgure the

patterns in the m-space so as to minimize E. The algorithm stops when the error goes

31

below an acceptable threshold or there is no further change in the value of E. The
algorithm is repeated for several initial conﬁgurations in order to avoid local minima
and hence, achieve global minimum for E. Details and results of this algorithm can
be found in [12], [29].

The main disadvantage of this method (true for nonlinear algorithms in general),
is that there is no explicit mapping function. Consequently, new patterns cannot be
‘projected’. So, if more patterns are added or m is changed, the whole process has to
be repeated again. Despite its drawbacks, this method yields good results.

The various projection approaches presented so far represent key approaches to
visualize and analyze high dimensional data. Attempts have been made to implement
these algorithms using neural networks approach [16]. Several other techniques do

exist but are not of interest from the neural networks point of view.

CHAPTER 2

Artiﬁcial Neural Networks

Approach

2. 1 Introduction

Digital computers are very good at (among other tasks) solving complex mathemat-
ical equations involving large number of variables. However, apart from ‘number
crunching’, their performance is not considered to be good at tasks which humans
perceive as ‘simple’ such as face recognition. For example, if an image of a person is
stored in the computer’s memory, then the computer has great difﬁculty in recogniz-
ing a test image if it is not exactly same as the stored image, even if it belongs to the
same person. The approach of neural computing or artiﬁcial neural networks (ANN)
is to capture the guiding principles that underly the brain’s solution to such problems
and apply them to computer systems. A unique feature of the neural network models
is their ability to ‘learn’ which allows them to be ‘trained’ for a suitable task.

The ﬁeld of artiﬁcial neural networks draws its inspiration from the biological

neural networks, which have very interesting properties [18],[11],[1]:

0 They are highly parallel.

32

33

The human brain derives its power from billions of heavily interconnected neu-
rons working in parallel.
0 They are highly robust.

They can deal with fuzzy, probabilistic, noisy or inconsistent information.

0 They are adaptive.

The networks can easily ‘learn’ to adjust to a new environment

0 They are fault-tolerant.

Nerve cells in the brain die everyday without affecting the performance of the

brain.

0 They consume and dissipate very little power.

Despite the large number of neurons, the brain is small, compact and dissipates

very little power.

The remainder of this chapter presents brieﬂy few essential concepts and then
describes previous modeling efforts to solve the principal components problem using

neural network approach.

2.2 Fundamental Concepts

2.2.1 The Neuron Model

The neuron (Figure 2.1) is the basic unit of the brain, and is a stand-alone analog
processing unit. It has a cell body (also called soma), which contains the nucleus. At-
tached to the soma are long, irregularly shaped, tree-like nerve ﬁbers called dendrites
which receive input from other neurons; the cell body being the other entity capable

of receiving the input. Another type of nerve process attached to the soma is called an

34

axon which serves as an output channel of the neuron. The axon after branching into
strands and substrands terminates in a specialized contact called synapse (located at

the tips of these strands / substrands), which couples the axon to the dendrites or cell

bodies of various other neurons.

synapse

/
K/

O /._\_/

cell body (soma)

P‘\/

dendrites

Figure 2.1. A biological neuron.

A neuron sends a signal to another neuron at the synapse by releasing a complex,
chemical substance called neurotransmitters. The effect of this transfer is either to
raise or lower the electric potential in the body of the receiving cell. If the potential
in the receiving cell exceeds a threshold (resting potential), the receiving neuron ‘ﬁres’
or sends a pulse along the axon. A simple model which captures the above salient

points was proposed by McCulloch and Pitts [17]:

(t+1) O(Zw,~j€,(t) M)

The ith neuron according to this model (Figure 2.2), basically computes a weighted

35

Input Nodes

Figure 2.2. McCulloch—Pitts model.

sum of its inputs and ﬁres if the weighted sum exceeds a threshold fig. {,- is the jth
component of the input pattern 6 (also, jth input to the neuron), w, is the weight
representing the strength of the synapse connecting neuron (or input) j to neuron i

and v; is the (binary) output of the neuron. (9(a) is the unit step function:

1 ifcmO
9(3) =

0 otherwise
A continuous valued neuron is modeled by replacing O by Sigmoidal functions such

as logistic function (Figure 2.3(a)) or a hyperbolic tangent function (Figure 2.3(b)).

2.2.2 Types of Neural Networks

Neurons can be connected to other such neurons to form an interconnected network.
Although the model is very simple, a single or a small network of neurons are capable
of performing all the boolean functions [11]. A class of networks of particular interest
is the feed-forward network (Figure 2.4(a)), in which the neurons are arranged in
the form of layers. The neurons in each layer process the inputs received from the

previous layer and their outputs are presented as inputs to the next layer. Thus,

36

 

 

 

 

 

 

 

 

 

Loo’etic Function Hyperbolic Tengent Function
1 v M T T I r v v
0.9’ 08
0.8" 06
0,7h 04
:06 a? 0.2%
50.5 E 0L -4
3770.4 -o.2 -I
‘5
0.3 —O.4
02 -O.6I-
01? -0.8
35 -4 4i -2 -i Ii i 2 3 i 5 '13 -2 -i Ii i 2 3
I M
(a) Logistic function. (b) Hyperbolic tangent function.

Figure 2.3. Sigmoidal functions.

the data ﬂows from the input layer to the output layer through the hidden layers.
The other types of networks are feedback or Hopﬁeld type networks (Figure 2.4(b)),
where a neuron is connected to every other neuron (including to its own, by output
feedback ) and hybrid networks (Figure 2.4(c)), where the outputs of some of the
neurons maybe connected to the inputs of the neurons in the previous or same layer.

A typical neural network application ﬁrst involves ‘training’ in which the network
is presented a set of input patterns. In some applications, the true outputs for the
input patterns are known. In this case (supervised learning), an error measure
(such as least squares error) can be deﬁned in terms of the difference between the
true output and the actual output, which can then be used to update the weights
so as to minimize the error. In some applications, the true outputs are not known
(unsupervised learning). The network hence, must ﬁgure out on its own, patterns,
features, regularities, correlations, or categories in the input data and code for them
in the output. The network in such cases relies heavily on the redundancy in the input

data. In either case, the learning stops (and the network is considered to have been

37

Output

 

 

 

 

 

 

 

 

Figure 2.4. Types of neural networks.

38

‘trained’ for the task) when the error goes below an acceptable threshold or there is
no further change in the update of the weights. Several weight update rules can be
found in [33], [9]. These ‘converged’ weights can then be used to process the new

data.

2.3 Previous Modeling Efforts

Two important applications in unsupervised learning are the familiarity problem and
the principal component analysis. In the former, a single continuous valued output
indicates how similar a new input pattern is to typical or average patterns seen in the
past. The network gradually learns what is ‘typical’. An extension to several units
involving the construction of multi-component basis, or a set of axes, along which to
measure similarity to previous examples is the focus of principal component analysis.
It will be shown shortly that the familiarity problem is equivalent to computing the
maximal principal component.

As noted in Chapter 1, computation of principal components involves computation
of covariance matrix of the input patterns, its eigenvalues and eigenvectors. However,
the entire process is sequential in nature because of which the time and complexity
involved in the computations increases rapidly as the dimensionality of the feature
space increases. Neural network architectures have been developed for a number of
applications in both supervised and unsupervised cases. This approach is particularly
useful and convenient to the eigenvector problem because there is no explicit compu-
tation of the covariance matrix, its eigenvalues and eigenvectors. The network ‘learns’
on its own, based on the input patterns presented, and the weights of the network
specify the eigenvectors. This method not only simpliﬁes the computational burden
but also leads to possible hardware implementation. Hardware implementation being

dependent only on the time constants of the circuits would then, eliminate the time

39

complexity involved thereby making it suitable for real-time applications. A number

of neural network models have been proposed to compute the eigenvectors directly.

2.3.1 Hebb’s Rule

5:1 .\
/ V :1; w!“ ii
Input Nodes

Figure 2.5. A simple linear unit.

Consider a simple, linear output unit shown in Figure 2.5. A random input pattern 6
drawn from a probability distribution 19(5) is applied to the network at each instant
of time. After seeing enough such samples, the network should indicate as its output,
how well a particular input pattern conforms to the distribution. The output v of the
linear unit is given by:

v = ije. = W = W (2.1)

where w is the weight vector and 5 is the input pattern and {j is its jth component.
Since the network consists of only one unit, the output of this neuron is expected
to represent a scalar measure of familiarity, i.e., the more probable that a particular

input pattern 6 is, the larger the output I v | should become, at least on the average.

40

According to Hebb’s rule [10], the weights should be updated so that:
Aw,- = 7105; (2.2)

where E; is the ith component of the input pattern 6, Aw.- is the change in the weight
connected to the ith input and 17 is a constant called learning factor. In the continuous
time domain, Aw, is replaced by a. This weight update rule in essence, strengthens
the weights (thereby making the output high) to only those inputs which belong to
P(§). At the equilibrium point, the average changes in the weights should be zero.

This implies that:
0 = Aw.- = '17:.- = 2 was, = 2 00-10,: Cw (2.3)
i j

where the over-line indicates an average over the input distribution P(§) and Cij =
3?,- or in the vector form, C = {(7. C is called the correlation matrix and differs from
the covariance matrix R (eqn.(1.10)) deﬁned in section 1.3.1 in that, the patterns are
normalized for the computation of R. However, C has properties that are similar to

those of R, i.e.,:

e C is symmetric. Hence, all eigenvalues of C are real and eigenvectors can be

taken as orthogonal.

e C is positive semi-deﬁnite. Hence, all eigenvalues are positive or zero.

Eqn.(2.3) implies that w is an eigenvector of C with eigenvalue zero. However,
this cannot be the equilibrium point because C would have at least some positive
eigenvalues and any change having a component along an eigenvector with positive
eigenvalue would grow exponentially with time. So, it is evident that the growth would
be maximum along the direction corresponding to the maximum eigenvalue Am”.

Thus, the maximal principal component is the solution to the familiarity problem.

41

The main disadvantage of the Hebbian weight update rule is that the weights
keep on increasing and the weight update never stops. This is unsuitable for circuit
implementation because the voltages or currents used to represent the weights have
deﬁnite bounds. The weights of course, could be normalized after each update thereby

restricting their growth, but that would increase the computational burden.

2.3.2 Oja’s Rule

Oja [20],[21] modiﬁed the Hebb’s rule by adding a weight decay proportional to v2,
as a result of which the weight vectors approach a constant unit length, thereby
obviating the need for external normalization. Oja’s rule for weight update is given
by:

Aw, = 17v“; — vwg) (2.4)

In addition to being of unit length, the weight vector also aligns along the direction
corresponding to Amaa: (maximal principal component) of C.

Extending this to higher dimensions, it is possible to have a m-unit network that
extracts the ﬁrst m principal components. The network consists of a one-layer feed-

forward network with the mapping given by:
v = wTE (2.5)

where the output vector v = [v1, . . . ,vm]T, w = [w1,...,wm], .5 = [51,. . . ,Ifd]T. It
maybe noted that eqn.(2.5) is very similar to the transformation in section 1.3.1. The

ith output v,- is given by:
v.- = Z wijéj = WiTé = {TWi (2-6)
j

where w; is the (column) weight vector for the ith output. Oja’s m-unit rule is given

42

Awe = nvzléj — i vkwk.) (2.7)
k=1

The fact that Oja’s weight update rule indeed yields weight vectors that span the

ﬁrst m-principal components has been proved in great detail in [11],[9].

2.3.3 Sanger’s Rule

Sanger [30] proposed a new weight update rule given by:

sz‘j = 770451 — 2’: vkwkj) (2-8)
k=1

which reduced to some extent the computational burden in Oja’s rule. Sanger refor-

mulated the rule as (in the vector form):

i—l
W, 2 vi — Z vgkak — v,-2w,- (2.9)

k=1

In the continuous time domain, Aw.- is replaced by w;. From the above equation, it
is evident that in contrast to the Oja’s m-unit rule, Sanger’s modiﬁcation makes the
update law for the ith node dependent on the information up to the previous node.
It maybe noted that both Oja’s rule and Sanger’s rule reduce to eqn.(2.4) for m = 1
and i = 1 respectively. In both cases, the w,- vectors converge to orthogonal unit
vectors wfw, = 6,]- and project a d-dimensional input vector onto a m-dimensional
space spanned by the ﬁrst m-principal components. For Oja’s rule, the m-weight
vectors converge to span the same subspace as the ﬁrst m eigenvectors, but do not
ﬁnd the eigenvectors themselves. Further, the weight vectors although spanning the
right subspace, might vary from trial to trial depending on the initial weights. Sanger’s

rule on the other hand, makes the weight vectors become exactly the ﬁrst m principal

Component directions in order and the results are reproducible (up to sign differences)

43

for a given data set (for non-degenerate eigenvalues).

An important observation is the fact that although the approach is linear, the
resulting models (2.7) and (2.9) are always nonlinear (for other rules, see [14], [24]
[23]). The basic building blocks of analog circuits are also inherently nonlinear which
motivates the reformulation of the original algorithm in the nonlinear domain to
make circuit implementation feasible. A new model which aims at compact circuit
implementation was proposed by Salam in [25] and its analysis and implementation
results were subsequently presented in [26], [32], [27], and [28]. The circuit model and
its analysis presented in this chapter is based on the material presented in [26].

Using (2.6), eqn.(2.9) can be rewritten for the purpose of analysis as:

i—l

w,- = [I — z: wkwfkgTw, — vfw; (2.10)
k=1
For i = 1, one obtains
W1 = €€TW1 — ’0le (2.11)

If the frequency contents of the input signal 5 is very high in comparison with the
inverse of the time constant of the update laws of the weights (2.10), then one can

consider the time-averaged outer product matrix of the input signals, viz.,

T+to

35731323; €(t)€(t)Tdt} (2.12)

0

replacing {5T in equations (2.10) and (2.11). It maybe noted that the matrix 3:7 is
a constant symmetric matrix. Suppose also that the eigenvalues of 357 are ordered,
i.e.,/\l > A2 > > An >0.

Equation (2.11) has the form of Oja’s learning rule for a single output. The update
law for wl is decoupled from the other weight vectors. However, every subsequent

weight vector is coupled to all prior weights as in (2.10). It may also be observed that

44

the ﬁrst term of the vector ﬁeld of (2.11) is linear in w] and has the matrix {IT while
the ﬁrst term of the vector ﬁeld of (2.10), which is linear in Wg, has the following

matrix:

i—l
Qi = (I — Z WIWEKET (2.13)
k=l
The matrix:
i-l
Pi = Z wkwf (2.14)
k=1

is a (linear) projection operator onto the subspace spanned by the vectors Wk, for
all k < i. Therefore, the matrix [I — Pg] is a (linear) projection operator onto the
subspace spanned by the vectors orthogonal to the set of vectors wk, for all k < i. As
Oja’s (single output) rule converges to the maximal eigenvector of the outer product
matrix of driving signals [11], namely ET, it is now clear that subsequently, the
projection operator Qg sets all the eigenvalues A1,A2,...Ag_1 to zero. Hence, the
ordered eigenvalues now become, 0,... ,0, Ag, ..., An. Thus, the linear network with
the scalar output vg is a single output Oja’s network and its dynamics converge to
the maximal eigenvector of Qgéﬁ—T, namely the eigenvector associated with Ag.

Eqn.(2.10) may be written as

min = II - PIKETWI' — vgzwi (2.15)

where n.- represents a diagonal matrix of time constants. Eqn.(2.15) can be rewritten

in the form

UIV'Vi = II " PillﬁéTWi - viwil — [Film-2W; (2-16)

In this setting, the vector ﬁeld is obtained by projecting

ItéTw: — viwi] (2.17)

45

onto the subspace spanned by vectors orthogonal to the subspace spanned by wk, for
all k < i, minus the projection of vEWg onto the subspace spanned by the vectors wk,
for all k < i.

This form is analogous to the formulation of the LEAP system reported by Chen
and Liu [3] which was for discrete systems. However, there is a difference, namely
that, the second term in the vector ﬁeld is replaced by [Pg]w,-. That is, the term v? is
absent from the second term.

Hence, the basic dynamic system is that of the Oja single output system typiﬁed
by (2.11). Sanger’s modiﬁcation basically serves to project the matrix {57’ so that
the largest eigenvalue of its time-averaged matrix becomes Ag and all the eigenvalues
Ak, k < i are set to zero. Thus, each weight vector dynamics strives to converge to

the maximal eigenvector for its associated matrix [I — Pg]§{—T.

2.4 The Proposed Circuit Model

Now the focus is on the performance of the single output Oja’s update rule. Recon—

sider Eqn.(2.15), namely,
Ui‘lvi = [I — PilééTWi — vgzwt

Assuming that all prior weight vectors wk, 1: < i have already converged, this is a
dynamic equation in the state Wg with a linear term [I — Pg]{£Tw.-. The second term
v?w,- is actually cubic in the state vector Wg. Noting the deﬁnition of vg, this term
gives the cubic form WgwiTéﬁTWg. It may also be noted that in general, vg Z 0 and
speciﬁcally, vg > 0 for i = 1 (See [11]). The cubic term maybe replaced by a cubic
nonlinear function of Wg.

With a View towards circuit implementation, the inverse of the Sigmoidal function

46

(or a scaled hypertangent function) was used to approximate the cubic term. Thus,

in comparison to (2.10) and (2.11), the modiﬁed rules are:

0% = “Uzi " aizi (2-18)
Wg' = S(Zg') (2.19)
where S = [51, ...,Sd]T is a vector composed of Sigmoidal functions. The ordered

constants a1 > a; > > am > 0 act as pseudo-eigenvalues. The variation of this
rule makes it suitable to implement. Using transconductance ampliﬁers or differential
pairs as building blocks, see Mead [18], the following variation makes it more suitable

to implement.

nijétj = ~01izij + vildifj, (2.20)
ng = tanh(zgj) (2.21)
[diff] = tanh(fj) (2.22)

v,- = wild.” (2.23)

where Zg'j is the jth element of the vector zg, mg is its time constant, 0g is a positive
constant (pseudo-eigenvalue), ng is the j th component of the vector Wg, and f]- is the
jth element of the input vector 6. Basically, the input vector is passed through the
nonlinearity tanh(.) and so is the state Zgj which is passed through a similar Sigmoidal
nonlinearity.

In vector form, system (2.20)-(2.23) can be rewritten as:

mil = —OtiZi + [Idifflgg'fflwi (2.24)

47

The equilibria are at:

0 = —ag'S—1(Wg') + [IdiffIzz’gfflwi (2.25)

Using a tanh function ensures global stability and bounds Wg. Also, as the bound is

sufﬁciently small, S ‘1(Wg) z Wg. Hence,
agWg = [Idg'ffIg-ff]wg' (2.26)

If Ag is the largest eigenvalue of IdgHIdTiff, then

A.I§fw.-I = i—jImII-ﬁfw.) (2.27)
This says that the equilibrium weight vector associated with the largest eigenvalue
Ag, will be of length 3? instead of the usual unity. Moreover, its stability is due to the
cubic (inverse Sigmoidal) function. This model eliminates the need for the term v,-2Wg
which would require a multiplication of three terms.

It should be noted that due to the replacement of the cubic weight term by the
inverse Sigmoidal term and due to the introduction of additional nonlinearities (sig-
moidal functions), the precision in computing the principal component is affected.
It is not the principal component per se that is being computed but is actually a
nonlinear analog equivalent of it. The goal is to provide a stable model suitable

for compact circuit implementation which qualitatively computes an analog of the

principal component.

CHAPTER 3

Computing the Maximal Principal

Component

3. 1 Introduction

Principal component analysis (PCA) is most appropriate for unsupervised learning
situations, where there is little or no a priori knowledge about the data, its nature
or categorization. This feature makes PCA the perfect ﬁt for sensory data process-
ing systems. In many such applications, real-time computation is called for, which
motivates a physical implementation of the algorithm. Oja’s and Sanger’s methods
being linear, there are limits to the direct realization of these algorithms since the
basic building blocks of analog circuits are naturally nonlinear. The objective of the
model proposed in section 2.4 was to reconstruct the PCA algorithm in the nonlinear
domain to make circuit implementation feasible.

Computing the eigenvector corresponding to the largest eigenvalue, also called the
Ihaximal principal component, is the ﬁrst major step towards such implementation,
because it contains maximum information pertaining to the original data. For the

COrnputation of the maximal principal component, the network consists of only one

48

49

unit. Dropping the subscript i, the equations (2.20)-(2.23) become:

7712} = —az,- + vIdg'ffJ (3.1)
w, = tanh(zg) (3.2)
[dim = tanh(fj) (3-3)

v = WTIdgff (3.4)

where 23- is the jth element of the vector z, 77g is its time constant, a is a positive
constant (pseudo-eigenvalue), wj is the jth component of the vector w, and {j is the
jth element of the input vector 6. The model presented in (3.1)-(3.4) is tailored for
circuit realization. It can be implemented by using MOS elements and differential
pairs which operate in the subthreshold regime of MOS operation. Consequently,
power consumption is also reduced.

Section 3.2 describes brieﬂy, the basic building blocks needed to implement the
above model. Circuit schematics and their layouts describing the actual implemen-
tation are presented in section 3.3. Section 3.4 deals with the simulation and ex-
perimental results of the circuit model for the computation of the maximal principal
component of 2-D Gaussian data while section 3.5 deals with the RGB-to-YIQ con-
version of a color image by computing the maximal principal component of the color

image.

3.2 Basic Building Blocks

Differential pairs, current mirrors and transconductance ampliﬁers are some of the
widely used building blocks in analog VLSI circuits. These and other such building

blocks have been widely used in Optical motion sensing, retina and cochlea modeling

50

[18], construction of neural network architectures and signal processing [19], sepa-
ration of mixed signals [8], stereo [15] and feature matching circuits [13], etc. This
section provides a brief description of the building blocks used in the implementation

of the model (eqns.(3.1)-(3.4)). In-depth details can be found in [18].

3.2.1 Diﬂerential Pair

 

 

 

 

 

 

 

 

Vb —‘] Qb

 

 

 

/77/

Figure 3.1. Schematic of a differential pair.

A typical differential pair is shown in Figure 3.1. The bias transistor (2;, acts as a
current source. Its drain voltage V is usually large enough to make the drain current
1;, to be saturated (determined by the bias voltage Vb). The (drain) current ﬂowing

through a transistor in saturation is given by:

1,0,: [Carve-Vs (3.5)

51

where V9 and V, are the gate and the source voltages. Applying this to transistors

Q1 and Q2, 1,11 = IoeKVI‘V and L12 = Iced/2“,. The currents Id] and Id; should add

up to 15. Hence,

II = I.“ + L12 = Joe-VIN + cw?)

From the above equation:

6

—v z £2_1__
IO eKV} + etch/2

Substituting this value into the expressions for Lg and L12 yields:

L11 =

Ln =

eKVI

etch/1 + CKI/Q
eKV2

lb

(3.8)

(3.9)

If V1 >> V2, Q2 gets turned off. This means then, that Iggl R: 1;, and Id; z 0. The

converse is true when V2 >> V1.

3.2.2 Current Mirrors

v Iin

v Iout

 

 

 

 

 

 

 

 

 

 

 

 

777'

(a) n-mirror

Figure 3.2.

 

 

 

 

 

 

 

\/ Iin \V 1out

(b) p-mirror

Current mirrors.

52

There are two types of current mirrors: n-mirror and p—mirror shown in Figures
3.2(a) and 3.2(b) respectively. A current mirror is used either to reﬂect a current or
to make a copy of a current. In both the types, since the gate and the source voltages

of both the transistors are same, from eqn.(3.5), it follows that [out = Ign.

3.2.3 Transconductance Ampliﬁer

 

 

 

 

 

Figure 3.3. Schematic of a transconductance ampliﬁer.

A p—mirror added to the differential pair as shown in Figure 3.3 results in the transcon—
ductance ampliﬁer (or transamp). The mirror reﬂects the current L11, thereby making
I“ = L3 = Idl. From Figure 3.3 and eqns.(3.8) and (3.9), the resulting current Ioggg

of the transamp is given by:

Iout = Id4_Id2

= 141 — Id2

53

 

eKV1 _ 651/2
bm (3.10)
Multiplying both the numerator and denominator by e—(V1+V2)/2,
eKlvl-Vzl/Z _ e-‘(VI-Vzl/Z
[out beK(I/1-V2)/2 + e—K(VI-VV2)/2 (3.11)
K. V — V
= II, tanh —(——1—2—) (3.12)

The response of an on-chip transamp to the difference of the input voltages is shown
in Figure 3.4. The input voltage V1 was varied from 2.1V to 2.9V;V2 was ﬁxed at

2.5V. An important observation from (3.12) is that the output current law of the

 

 

 

 

 

 

Figure 3.4. Response of an on-chip transamp.

transamp is proportional to the bias current 15. This means that the transamp can
be used as a multiplier replacing Ig, with a suitable quantity; a fact that has been
extensively used in the implementation (for example, to compute 'UIdiffJ in eqn.(3.1)

and v in eqn.(3.4)).

54

 

 

 

 

  

W,“ ‘ mnh(§g)

 

 

Figure 3.5. Circuit schematic of nstage.

 

Figure 3.6. Circuit layout of nstage.

55

3.3 MOS Circuit Realization

Two circuit versions which are compatible in architecture and similar in processing
have been developed to implement the model. Figure 3.5 shows the circuit represen-
tation (nstage) for a single stage of the model presented in the equations (3.1)-(3.4).
2,- represents the voltage across the capacitor. The output current of the transamp
T2 represents the weight w,- (= tanh(zj), eqn.(3.2)). This current is then reﬂected
by the n-mirror (for proper direction) and fed as bias current to transamp T3, which
also receives 5,- as the input voltage. The output current of T3 is therefore, the bias
current (Wj) multiplied by IgggffJ (2 wJ- tanh(ﬁj), eqns.(3.3)-(3.4)). For a multistage
circuit, the output current of T3 from all the stages are summed up to yield the out-
put v. The output (current) v is reﬂected by a n-mirror and fed as bias current to
the transamp T1. The transamp T1 presents at its output a current that is equal to
vIdg f f). The current ﬂowing through the capacitor is this current minus the current
ﬂowing through the resistance (eqn.(3.1)). In actual implementation, the resistance
has not been used. The input resistance of the MOS transistor acts as the resistance
represented in the circuit. The layout for this circuit representation is shown in Figure
3.6.

The current mirror immediately after the transamp (T2) that computes the weight
wJ- can be eliminated (thereby reducing one current mirror per stage) by suitably
modifying the transamp (T3)(that computes ijdg 1' f1) after the current mirror. This
compact circuit realization (pstage) is presented in Figure 3.7 and its layout is shown
in Figure 3.8.

It is evident from both the realizations that the model being modular in nature,
additional stages may be added to process higher-dimensional data; the number of
stages being equal to the number of dimensions of the input data. Conversely, some

of the inputs may be grounded if not needed. (A circuit realization to handle n-

56

 

 

 

Figure 3.7. Circuit schematic of pstage.

 

Figure 3.8. Circuit layout of pstage.

57

dimensional data will be referred to as n-D circuit). A 2-D circuit consisting of two
nstages was included as part of a tiny chip Chip] fabricated through the MOSIS
service. The layout for the chip Chip] is shown in Figure 3.9. The circuit schematic
and its layout are shown in Figures 3.10 and 3.11 respectively. The circuit path for
the weights was broken to facilitate their measurement; the currents were converted
to voltages using external op-amps.

A useful observation from eqn.(3.2) is that the weight wg (current) follows zj (volt-
age). Since it is difﬁcult to measure small currents (resulting in loss of accuracy), and
the fact that the circuit path has to be broken to facilitate the measurements (con-
sumption of more pins on the chip), it is preferable to measure 2,- (voltage across the
capacitors). Another reason which supports this idea is the fact that after ‘learning’,
it is easier to apply a constant voltage (2,) than a constant current (w,) to process
data. These facts were taken into consideration during the design of Chip? which
consisted of a 2-D circuit making use of the pstage and a 3-D circuit making use of
the nstage. The objectives of this chip design were to test the working of the com-
pact circuit realization using pstage (2-D circuit) and to test the modularity of the
model (3-D circuit). The chip layout of Chip? is shown in Figure 3.12. The circuit
schematic of the 2-D circuit using pstage and its layout are shown in Figures 3.13 and
3.14 respectively. The counterparts for the 3-D circuit implementation are shown in
Figure 3.15 and Figure 3.16.

A circuit realization for processing (much higher dimensional) 6-D data designed
on Chip3 (shown in Figure 3.17) using nstage is shown in Figure 3.18 and its layout

is shown in Figure 3.19. All the layouts on the chips were designed using LEdit.

DEED

 

l:

58

DDQDBDDDD

 

E1 :1 I: C: :1 CI. 2:] I: :21 1:. C. :2}
2:!— g
I -
.. .................. E

 

 

 

 

III]

 

 

 

Figure 3.9. Chip layout of ChipI.

59

 

 

 

Figure 3.10. Circuit schematic of 2-D circuit using nstage.

 

Figure 3.11. Circuit layout of 2-D circuit using nstage.

 

 

. I . < ' . .. .. 4 A ,4
‘g . --... -
4:. ; . . .
: . _ - . - . _ : .

I—L.

 

 

 

 

UZDIIFIIIII I

 

 

 

 

 

 

Figure 3.12. Chip layout of Chip?

 

4 Mg . . , - _ A
5“, mg}. 1 N :5: It . - I g . g , 'J. g ‘
1.-.. g .. .. ,- . , w. ..I .

61

 

 

Figure 3.13. Circuit schematic of 2-D circuit using pstage.

 

 

 

 

 

 

Figure 3.14. Circuit layout of 2-D circuit using pstage.

62

 

 

 

 

 

 

 

 

Figure 3.15. Circuit schematic of 3-D circuit using nstage.

 

Figure 3.16. Circuit layout of 3—D circuit using nstage.

63

 

Figure 3.17. Chip layout of Chip3.

,—

 

 

 

 

is
as

Figure 3.18. Circuit schematic of 6-D circuit using pstage.

 

 

 

 

 

 

 

 

 

 

 

 

 

e

u

g t

— u 1
;.

I...
i
qubtwlm~ﬁvlmvﬂwﬂ~W .-.-.._. . . . . . wt...— IE

Figure 3.19. Circuit layout of 6-D circuit using pstage.

65
3.4 Computation of the Maximal Principal Com-

ponent of 2-D Gaussian Data

3.4.1 Mathematical Simulation

The Gaussian distribution is an ideal data set to test the efﬁcacy of any PCA model.
This is because of the fact that for a 2-D representation, the data set looks like an el—
lipse (for unequal variances) in which case, the maximal principal component is along
the direction of the major axis. Such a data set containing 2000 random samples was
generated (using SPlus) with u = (0 0)T and 01 = 1,012 = 021 = 0,02 = 2. Addi-
tional data sets were generated by rotating this data set in various directions. The
model (3.1)-(3.4) which represents the circuit model was simulated using MATLAB
for these data sets and as seen from Figure 3.20, the maximal principal component is

approximately along the major axis.

Data Data rotated by 90 degrees

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.20. Simulation results for a Gaussian distribution.

66

Comparison of Oja’s Model and the Circuit Model

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. waozmnoml Mn?“
3»
2»
6»
0’ 3" I. . . 3'. .
’ '%"'x I
at '3" ' T- 7 ‘3
I: r ' 3
u or — S o» 5 "
. .g
-2. ;' g
-1» ‘
.4.
.6»
-2.
.0»
’30 3 -i -4 3 M) i 4 T Ii 10 ‘3': -2 -i o i 2 3
at W!
(a) The maximal principal component. (b) W2 vs. W1.
.. 3040213). owol . to“ W2 n WI
3. . 3 . ' . I 4 on»
'I oo-
0 o" ‘ ., -
.. .'~-.-.; 5-5-1"; - . I
a g 0 . .i‘g'ﬁ‘yu
0 ,.'-' it’s 3:3:
-'.'.t::'§ls:.e '
-2» 4 -02* “I :U :l .
4b ‘0‘ . o. . c‘
4 -. 4w
4» '0" 1
[3,0 _; j, _; _.:, a 2 ; é ; ,0 '11 —oia as -oI4 a: I) 02 (it do 0’; t
n m no4
(c) The maximal principal component. ((1) W2 vs. wl.

Figure 3.21. Simulation results for the initial condition we 2 [ 0.2190 ; 0.0470 ].
(a)-(b) Oja’s model. (c)-(d) Circuit model.

Figures 3.21—3.24 show the performance of the Oja’s model (eqn.(2.4)) and the cir-
cuit model (eqn.(3.1)-(3.4)) in terms of computing the maximal principal component
of the 2-D Gaussian data for different initial conditions. Each ﬁgure compares the

ﬁnal direction and the characteristic of the weights for the models for a given initial

67

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. M'mnm g WNW!
at
2»
.'
‘- il .l
2» « 1!
u o _ i 5 0’ ' ' .'
.;
4" i i .. a ‘H 4 if!
.4. . ..... 4 4. .0 3‘
4.. “,.
-2»
.0»- 4
‘ﬁo .3 .8 .2 .5 6 5 i i i 10 i1 .é -i 6 i 5 a
It '11
(a) The maximal principal component. (b) W2 vs W1
«Humanism no van-m
1: V Y Y *7 Y 1 f T
l 0.8
3.. .. , i. . .3 "1' . i . 05> . ' l
4» . 0.4» - '. I
a o . - S ° ' .i'fk‘éir'x '
. ... 40;",
.02» ' .o ‘43:.‘r't
4* a; ' ’.
-o.4 ':
.‘l—
~06
4’ «
418
4» 4
.3 r r . i . l . . . '31 .0} —o‘o -o‘.4 .412 5 0L2 074 aft of: |
-10 4 4 .4 .2 0' 2 4 o a 10 V" no"
I
(c) The maximal principal component. (d) W2 vs. W1.

Figure 3.22. Simulation results for the initial condition W0 2 [ 0.6789 ; 0.6793 ].
(a)-(b) Oja’s model. (c)-(d) Circuit model.

68

 

um; um]

w 1 r r v r

 

 

 

 

 

mum"

 

.....

 

 

41

"I'm-44440244.»

 

(c) The maximal principal component.

 

 

 

 

 

 

 

 

 

W2 ”W1
2 l-
' ' a
o I
£ 0 > .
c I "
‘x -‘
s " .‘ ' 4".
-1 r . J
I. 3
-2 > 4
‘33 -2 -1 o 1 2 3
II
(b) W2 vs . wl .
I ‘0" m."‘
l v v v T W
0.0%
0.6- . .
o“ P . . . . ". .
' ' I . . \ .
" .' .' . v. \
02 b . '2. ﬁ. 0... '
a; ’ 0’
S o I‘ . .1. .
4. "‘ : ‘t. ‘
I . z." .-{.'
'02 ’ ' .25‘0 .91.. . 'o
I 3 t. . .
-o.4» ~ ' ‘9' . <
~05 *
-0.8
-' A l A l I 1 A L 1
-1 -03 4.6 -0.4 ~02 0 02 0.4 0.6 0.8 1
W1 1 10"

(d) W2 VS. W1.

Figure 3.23. Simulation results for the initial condition W0 = [ 0.5297 ; 0.6711].

(a)—(b) Oja’s model. (c)-(d) Circuit model.

69

 

 

   

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

.. M.W7,ﬂ.£3$] ‘ mum
.p
2» 4
6’
4»
1r-
2r
Q o 4 S o» 3 +
.2. ..... ‘ }“. t”;
-1»
4. . ‘ 3
4» °‘
.2»
.8».
'le : -4i -4 -2 3 2 4 i a 10 ‘33 .2 -i o i 2 3
ll W1
(a) The maximal principal component. (b) W2 vs. W1.
mmmm no" Ham
:3 T I T I Ti I I ‘ Y I' .I V V Y
a ..... 0‘.
a» or,
4» 0.4,
2- 02) 1
g 0 - 1 g 0
.2» .1 42'
4. .04
.5. :2 l' . -o.o>
.g. ~03
13° 3 _; J .2 3 g i ; i w ’31 41‘: j? .0; 4:2 3 012 074 ole of: 1
:1 In :10“
(c) The maximal principal component. ((1) W2 vs. W1.

Figure 3.24. Simulation results for the initial condition W0 = [ 0.9347 ; 0.3835 ].
(a)-(b) Oja’s model. (c)-(d) Circuit model.

70

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

m W on: $04021”; 0.0470] ‘- W2 VI W1
3»
2»
T
‘L ' l 1 .°
2> , - f.‘
9 or £0' . :73. H:
i
‘2’ l .. '1. s .‘4 I”...
-1» -' .
4’ I 3
4. ‘2 l
. 4+
4»
"'30 .3 4i .3 .5 3 i i i 3 1o ‘33 .5 -i 6 i 5 :1
:1 W!
(a) The maximal principal component. (b) W2 vs. w].
.. 1104021». 0.0470) x ‘04 W2 1» m
o» W 06' ' , . .
‘» 4 0‘ . .
2. 02 '15 $533 '

u o» s of :22"; :52!” . 4
.2- 42”}:3154' 4
4t -0.‘ .
4L .. . . . ; . ~ . g ‘ ..... . 41,6)

4;. ‘08'
"1310 j, _; J _2 i 2 j ; Q ,0 '31 41‘; 4717—074 .0} o 072 0:4 jg oi: 1
ll W1 1104
(c) The maximal principal component. ((1) W2 vs. W1.

Figure 3.25. Simulation results of the Reconﬁgured data for the initial condition we

= [ 0.2190 ; 0.0470 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model.

71

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. www.mo.m . 11121131141
or»
2.
l’ .. 1
2 .L ' ' .-
i ' -'
.. b " “a:
-1.
- n
-2.
‘3: -5 -i T i 5 a
m
(b) W2 VS. W1.
[104 ”I'1
0.11-
05. . . ‘ i
114- l . ‘ 5. I . 1 (
02- '. -'.=-"..'1' '
S o. . ”:‘F‘ﬁiii.’ .
42' .' 4-1:;.r,‘~1.\1:l§': ' 1
-0.‘*' .. 4
41.0» (
p «on»
30 4 ,3 4 4 1i 2 ‘ o f ,0 “.11 .0} 4; -oi4 .02 o 02 04 0; on 1
:1 In no"
(c) The maximal principal component. (d) W2 vs. W1.

Figure 3.26. Simulation results of the Reconﬁgured data for the initial condition W0
= [ 0.6789 ; 0.6793 ]. (a)-(b) Oja’s model. (c)—(d) Circuit model.

72

 

 

   

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

M WMMWLOBHH . W2nW1
a»
2»
6»
t» 4 4
1) .
2r 1 .3
11 o — g o» ' ' .-
1
-2~ "l .. \- .‘. :4:
4» -‘h .‘. ’0 .0 l
.5.
.2»
.2. ......
:30 -; 3 -4 a." :1 i i 40 3 1o '3': -2 -i 1i i 2 3
:1 V"
(a) The maximal principal component. (b) W2 vs. W1.
.. .o-pm; W11] ' , 104 W'Wl
.l‘ 08 1
o» 0.6
4» “T .~
2» 02- u
11 o — S 0* “'1‘.
.2. 412’ .s’
4... 4 -O.‘
4r - , i 4.6- i
.2... 1 410'
‘30 _; 3 _; 3 3 2 2 g ; ,0 '31 .03 .0711 41:4 .0} i1 02 oi4 o} of: 1
:1 In no"
(c) The maximal principal component. ((1) w; vs. w].

Figure 3.27. Simulation results of the Reconﬁgured data for the initial condition W0

= [0.5297 ; 0.6711 ]. (a)-(b) Oja’s model. (c)-(d) Circuit model.

73

 

.3
T .l‘o ”um
I '

 

   

 

 

 

 

 

 

'1‘! o 1 S 0*
.2. .( 411’
.4. 4 'Q"
4... . it. 1 ‘ ., '0"
-0.l>
.2.
:3 . . l i . . . L. {1.0331343230124040} 0131
-10 4 4 .4 .2 i: 2 4 o I 10 '" no"
I
(a) The maximal principal component. (b) W2 vs. w].

Figure 3.28. Simulation results of the Reconﬁgured data for the initial condition W0

= [ 0.9347 ; 0.3835 ]. (a)-(b) Circuit model.

condition.

Since the last ﬁve samples of the original data set were not its ‘true’ representa-
tives, and since the ﬁnal direction of the weights is considerably inﬂuenced by the
last segment of the samples, the data was reconﬁgured randomly. Due to the re—
ordering of the data set (which removed the ‘bad’ samples in the ﬁnal segment), the
performance of both the models on this reconﬁgured data set improved considerably
(Figures 3.25-3.28). However, the simulation for the Oja’s rule for the reconﬁgured
data set aborted for the initial condition WO 2 [ (19347 03835 ]T when the weights
became too large.

In summary, the ﬁnal direction obtained by the circuit model was quite close to
that of Oja’s rule. However, the direction obtained by both the models is dependent
on the initial condition. In order to nullify the effect of dependence of the ﬁnal
direction on the initial condition and the ﬁnal data sample, it is a better idea to take
the ‘average’ direction over the ‘ﬁnal’ segment. The characteristic of the weights in

almost all the cases shows that this average direction is along the major axis.

74

3.4.2 PSpice Simulation

A convenient way of generating a Gaussian distribution for practical purposes is by
applying sine signals (of same dc level but of different amplitudes) as inputs to the
circuit. The 2,- signals which constitute the maximal principal component vector
should then ‘follow’ the corresponding input signals. Extensive PSpice simulations
were performed to verify this aspect. A bias of 2.5V was used to bias the transcon-
ductance ampliﬁers properly. The training data consisted of two sine signals with a
frequency of 2kHz. The amplitude of the ﬁrst signal was 25mV and that of the second
signal was 100mV (increasing order). The circuit was simulated for different initial
conditions (initial voltages across the capacitors). Figure 3.29 shows the weights’for
an initial condition of 2.5V while Figure 3.30 shows the weights for the case with zero
initial condition. In both the cases, the amplitude (above the average) of the second
component 1.02 is approximately four times that of the ﬁrst component wl.

In the second experiment, the amplitude of the ﬁrst component was 50mV while
the second had an amplitude of 25mV (decreasing order). Again, as Figure 3.31

demonstrates, the weights ‘follow’ the input data.

3.4.3 Experimental Veriﬁcation

Although it is clear from the PSpice simulations that the weights follow the input
signals, it may not be obvious that the z,’s indeed form the maximal principal com-
ponent. To prove this fact and verify it experimentally, 4 of the 6 inputs of the 6-D
circuit were grounded and input signals (10mv and 30mV) were applied to the re-
maining two inputs in the increasing order. Since the weights w,- follow the z,- signals
and since they are in phase, the maximum of each 2,- was taken as the component of

the maximal vector and a plot of these signals with respect to the two inputs (Figure

 

Since the PSpice simulations were performed prior to the design of Chip], the current signals
w,- representing the weights were measured instead of z.- signals.

75

   

(a) Initial weights. (b) Final weights.

Figure 3.29. PSpice simulation: w, signals for input signals at 2 kHz in the increasing
order for the initial condition of 2.5V.

 

 

1’“ .......................................................................

 

(a) Initial weights. (b) Final weights.

Figure 3.30. PSpice simulation: w,- signals for input signals at 2 kHz in the increasing
order for the initial condition of 0V.

76

 

0

«Ont d an “Arr-
° IQXMRRI .W) 0 IGXMRRSIQ)

Figure 3.31. PSpice simulation: w,- signals for input signals at 2 kHz in the decreasing
order for the initial condition of 2.5V.

3.32) shows that the experimental result is in close agreement with the simulation
results. However, it should be noted that the data in the experiment is different from
that of the simulations. Further, in both the simulations and the experiment, the
error in the direction obtained by the circuit model can perhaps be attributed to the
nonlinearities, transistor mismatches and the fact that the model is not computing

the actual principal component but only the nonlinear analog equivalent of it.

Experimental results of the 2-D, 3-D and 6-D circuits

Since it is clear from the experimental veriﬁcation that the z,- signals ‘following’ the
input data is equivalent to computing the maximal principal component, this section
describes similar experimental results [32],[27] for the different circuit realizations. It
will be noted that the results for all the implementations are analogous to each other
thereby proving the working and the modularity of the model.

In the ﬁrst experiment, sine signals at 2kHz, each with a peak-to—peak amplitude

77

 

 

 

 

 

_o' l l l 1 L 1 l l I
3.05 -0.04 -0.03 -0.Q -0.01 0 0.01 0.02 0.03 0.04 0.05

Figure 3.32. The maximal principal component.

of 15mV (30mV for the 6-D circuit) were applied as input. The dc bias for these
signals was ﬁxed at 2.52V. Figures 3.33 and 3.34 show that the z,- signals for each
implementation have the same amplitude and almost the same dc level. The slight
discrepancy in the dc levels is attributed to transistor mis—matches.

In the second experiment, the input signals had the same frequency. The ampli-
tudes however, were changed to IOmV and 20mV respectively for the 2-D implemen-
tation and 10mV,20mV and 33mV respectively for the 3-D and 6-D implementations:r
The signals were obtained from a common source by means of a set of voltage dividers
with individual dc biases added on. The 2,- signals for this data set as shown in Figure
3.35 and Figure 3.36 are also in the increasing order.

In the third experiment, input signals were reversed (decreasing order). In re-
sponse, as shown in Figures 3.37 and 3.38, the order of the 25,8 also got reversed. In
conclusion, in all the three cases, the z,- signals, and hence the weights, are aligned
along the data direction.

In contrast to the time varying z,- signals obtained so far, it is desirable to have

the z,- signals, (and hence the weights w,-) approach constant voltages. It would then

 

To make the experiment easier, inputs 1 and 3, 2 and 5, 3 and 6 were short-circuited.

78

be possible to apply these converged weights to process test patterns or signals. With
this end result in mind, an experiment was carried out in which it was observed that
the weights approached constant values when the frequency of the input signals was
increased to a high value. Figure 3.39(a) shows the result for the 2-D implementation
at 100kHz and Figure 3.39(b) shows the result for the 3-D implementation at 60kHz.
These weights deﬁne a single direction along the maximal eigenvector of the time-
averaged outer product matrix of the input vectors. An equivalent result was obtained
when the capacitive values in the circuit were externally increased (Figure 3.40).
These actions amount to separating the time—scales of the circuit dynamics and the
driving input signal.

An experiment was carried out to determine the circuit parameter improvements
needed to expand the architecture beyond three dimensions. The reference voltage
for the input transamps, the bias voltage and the reference voltage for the weight
transamps are the key parameters useful in preventing clipping of the z,- signals. F ig-
ures 3.41 and 3.42 show the effect of varying the reference voltage We, for the input
transamps. Although the z, voltages follow the input signals in general, there is a
slight discrepancy with respect to the dc levels and amplitudes. This is attributed to
the transistor mis-matches. The effect of such circuit anomalies becomes more promi-
nent as the number of stages, i.e., the dimensionality of the input space increases.
To alleviate these problems, the reference voltages of the input transamps and the
weight transamps for each of the stages were separated from each other on Chip? for

‘ﬁne-tuning’.

79

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

l l l I 1 l l l

A J A
"o 02 0.4 0.0 00 1 12 1.4 1.5 1.0 2 "0 0.2 0.4 0.6 00 1 12 1.4 1.0 13 2

 

 

(a) 2-D Implementation (b) 3—D Implementation

Figure 3.33. z, signals for equal amplitude input signals.

 

 

 

 

 

 

 

 

 

0 0.5 1 1.5 2 O 0.5 . 1 1 .5
time x 10-: time x 10

(“to

 

 

 

 

 

 

 

 

 

o 0.5 1 1.5 2 o 0.5 '1 1.5
"m x 10'3 “m x 10"

N

 

 

 

 

 

 

 

 

 

O 0.5 1 1 .5 0.5 1 1 .5 2
time x 10 time x 104

(no
0

Figure 3.34. 6-D Implementation: 2,- signals for equal amplitude input signals.

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

"0 02 0.4 0.6 00 1 12 1.4 1.5 10 2 "o 02 0.4 0.5 0.5 1 12 1.4 1.5 1.0 2

(a) 2-D Implementation (b) 3-D Implementation

Figure 3.35. z,- signals for input signals in the increasing order.

 

 

 

 

 

 

 

 

 

 

4+~~ ’ 4" l
._ W V W
N2 N2 ....................................... a

o ‘ 0 ‘

0 05 1 15 2 0 05 1 15 2

.4114 um“

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O 0.5 1 1.5 2 0 0.5 1 1.5 2
x 10'3 x 10‘1

Figure 3.36. 6-D Implementation: 2,- signals for input signals in the increasing order.

81

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

"0 0.2 0.1 05 0.0 1 12 1.4 1.5 12 2 "0 02 0.4 0.6 0.0 1 12 1.4 1.5 12 2

(a) 2-D Implementation (b) 3-D Implementation

Figure 3.37. z; signals for input signals in the decreasing order.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O 05 1 15 2 0 05 1 15 2
ﬁ - - .2194 - - "“4
4, i . .',. ...‘ ., J 41 4
:32__.,, a.
O ‘ ‘ '* 0 ‘ ‘
O 05 1 1.5 2 0 05 1 15 2
x104 11104

Figure 3.38. 6-D Implementation: 2,- signals for input signals in the decreasing order.

82

 

 

 

 

Z1

 

 

 

 

 

 

 

 

 

5 T 1 T T r I T r r r 1 1 r 1 f 1 T I
4

__ ﬁ - N21

3 1 1 1 1 1 1 1 1 1

‘ 0 02 0‘ 05 0.3 1 1.2 14 15 18

x10

1 1 1 1 1 1 1 1 1 I I I I I I I ’ I

110 ‘ i ' ' ‘ ' ' '
:1?

r r ﬁ T ﬁn r r r T 1 1 1 1 J 1 1 l 1

 

 

 

"0 0.2 0.4 0.0 0.0 1 12 1.4 1.0 1.0

 

-.-.- .---.-- . 1110
vav f h "V V‘ vvvvvv

 

 

 

 

 

 

 

 

I I Y T i I Y T T
4 . .
n .
. ”2»
02 04 00 00 1 12 14 10 10 2 c0 02 0‘4 010 is i 12 1'4 1’0 1’0
110” 1110
(a) 2-D Implementation (b) 3-D Implementation

Figure 3.39. z,- signals for input signals at high frequency.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T I I I r I T I Y 1 I I I V I I 1 I
: 4" .........
““‘:‘.."'f““‘. AAAeJ-e'e; N2"
* v0 02 04 00 0.0 1 12 14 10 10
110
L l l 1 l l l l l ‘ T I I I ' I I I I
02 0.4 0.0 0.0 1 12 1.4 1.0 10 2 ~ m r
4
1
10 a? ';'
' . "0 02 0.4 0.0 0.0 1 12 1.4 1.0 1.0
' x10
3 "—“**
I I I I I T T r I
‘.
0
..... N2.
1 L l J. l L J. l I c l 1 l l l l l l l
02 0.4 0.0 0.0 1 12 1.4 1.0 1.0 2 0 0.2 0.4 0.0 0.0 1 12 1.4 1.0 10
4
110 110
(a) 2-D Implementation (b) 3-D Implementation

Figure 3.40. z,- signals for input signals with capacitors.

83

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vref=2.49V Vref=2.50V
‘ ' W1
4 4» . .
N 2’ 1 N 2
Go 0.5 1 1.5 2 Go 0.5 1 1:5 2
x 10‘3 x 10"
Vf01=2 .51 V V181=2.52V
4 1
N 2 ................................
o A -
o o 5 1 1 5 2 co 0 5 1 1 5 2
x 10" x 10"
Vre1=2.53v Vre1=2.54v
o 0.5 1 1.5 2 o 0.5 1 1.5 2
x 10" x 10"

Figure 3.41. Response of 21 to change in the Vref of the input transamps for the 2-D
Implementation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vre1=2.50V Vref=2.51V
5 . . . 5 . . .
4, .................. ‘. ....... ,
3 A1 3 1
111 ’15
2» 1 2i
1) ' 11
o A A A c A A A
0 0.5 1 1.5 2 O 0.5 1 1.5 2
1110“, x103
Vre1=2.54V Vre1=2.55
5 : - . 5 . +
4, . . '_.. ,. 4. .
3, . ..:. .. . .. . 2 3p 4
F1 ; . ﬂ
2 .. 2 ............ .
0o 0.5 1 1:5 2
x10"

 

Figure 3.42. Response of 21 to change in the Vref of the input transamps for the 3-D
Implementation

84
3.5 Computation of the Maximal Principal Com-
ponent of a Color Image

A color model is a speciﬁcation of a 3-D color coordinate system and a visible subset
in the coordinate system within which all colors in a particular color gamut lie [7]. For
example, the RGB (red, green, blue) color model is the unit cube subset of the 3-D
Cartesian coordinate system. The purpose of a color model is to allow convenient
speciﬁcation of colors within some color gamut. Three color models (RGB, YIQ,
CMY) are widely used for the hardware devices. The RGB model is used for the
color CRT monitors; the YIQ model for the color-TV broadcast system; and CMY
(cyan, magenta, yellow) for certain color—printing devices. There are also models (such
as HSV) which directly relate to the hue, saturation, and brightness as perceived by
the human eye. Further, for each model, there is a means of converting to another
speciﬁcation.

The YIQ model used in US. commercial color-TV broadcasting is a recoding of
RGB for transmission efﬁciency and for downward compatibility with black-and—white
television. The recoded signal is then transmitted using the National Television Sys-
tem Committee (NTSC) standard. The Y component of the YIQ model represents
the luminance. The chromaticity is encoded in I and Q. However, only the Y compo-
nent of a color TV signal is shown on black-and-white televisions. The RGB-to-YIQ

mapping is given by:

. , . .
Y 0.299 0.587 0.114 R

I = 0.596 —0.275 —0.321 G (3-13)

 

 

 

 

 

 

1Q L0.212 —0.528 0.311 B

d h J

The quantities in the ﬁrst row reflect the relative importance of green and red and

the relative unimportance of blue in brightness. However, an average of the (R,G,B)

85

values for each pixel is a good approximation for the Y value for practical purposes.
For instance, an original color image, the converted image (using the Y value from
eqn.(3.13)) and the average image (average of RGB representing the Y value) are
shown in Figure 3.43 (a)-(c).

 

(b) Converted image

      

(c) Average image (d) Theoretical projection-1

Figure 3.43. RGB-to—YIQ conversion.

The objective of this experiment is to explore the possibility of the maximal prin-

cipal component approximating such a conversion replacing the (R,G,B) values of

86

each pixel in a color image by a single gray scale value while retaining the ‘informa-
tion’ content to the maximum extent possible. This is by no means a better method
over the previously discussed techniques. However, it deﬁnitely demonstrates the ca—
pability of the chip and lends support to the new model for computing the maximal

principal component.

3.5.1 Theoretical Result

The (raw) pattern matrix A" for the image is formed by considering each pixel as an
individual pattern. The covariance matrix ’R (= i-ATA) ffor the normalized pattern

matrix A (computed using MATLAB) is given by:

1- a

4.7591 4.8695 4.9278
R=103>< 4.8695 5.0767 5.1001 (3.14)

 

 

L4.9278 5.1001 5.1925

The eigenvalues (represented in the form of a diagonal matrix) D1 and the eigenvectors

V1 of this matrix are given by:

 

 

 

 

r . . -
0.0048 0 0 —0.8216 0.0937 0.5623
D1=104>< 0 0.0034 0 V1: 0.4734 0.6618 0.5813
0 0 1.4946 1 0.3177 —0.7438 0.5881

(3.15)

The maximal eigenvector corresponding to the largest eigenvalue of R is given by

T
m1 = [ 0,5623 0,5813 0,5881 ] . The components of m1 can be expressed in terms

 

T
of ratio as m'f = “”321”,!!!” = [ 03247 0,3357 0,3396 1 . The image projected

along m; is shown in Figure 3.43(d) (with the caption ‘Theoretical projection-1’).

The correlation matrix C (= AETA‘TA“) for the pattern matrix A“ (computed using

 

gee section 1.3.1

87

MATLAB) is given by:

p

2.1991 2.3114 2.2178
C=104>< 2.3114 2.4393 2.3364 (3-16)

 

 

2.2178 2.3364 2.2461

The eigenvalues (represented in the form of a diagonal matrix) D2 and the eigenvectors

V2 of this matrix are given by:

 

 

 

 

7 . .
0.0048 0 0 I-—0.8250 0.0047 0.5651
D2=104>< 0 0.0043 0 V2: 0.4117 0.6900 0.5953
0 0 6.8755 0.3872 —0.7238 0.5712

(3.17)

The maximal eigenvector corresponding to the largest eigenvalue of C is given by

T
m2 = [ 0,5651 0,5953 0,5712 ] . The components of H12 can be expressed in terms

 

:r
of ratio as m; = 111124.322; +1113. = [ 0.3263 0.3438 0,3299 ] . The image projected

along m; is shown in Figure 3.45(a) (with the caption ‘Theoretical projection-2’).

3.5.2 Simulation Result
Sequential Approach

In this approach, each pixel of a 64 x 64 image was processed independently and
sequentially (treated as a spatial signal). The differential equations representing the
circuit model were used for the simulation. The pixel intensity values which range
from 0-255 were mapped to 0—5V range. The z vector (solution of the differential
equations) obtained for each pixel was used to replace the (R,G,B) triple of that pixel
by the output value v of the model. The resulting image presented in [28] is shown in

Figure 3.45(b).

88

Random Approach

Instead of processing each pixel sequentially, pixels were picked at random using a
uniform random generator. A total of 8192 pixels (twice the size of image) were
chosen in order to provide a good representation of the image. A pixel was applied
at every instant of time to the circuit model. The ﬁnal value of 2 obtained after
processing all the selected pixels was then used to ‘project’ the entire image wherein
the (R,G,B) values of each pixel were now replaced by the output value v.

So, in the sequential approach, 2 was calculated for each pixel individually, while
in the random approach, it was approximated over the entire image. The random
approach is not only faster but also captures the ‘global essence’ of the image and
hence, yields better results in terms of preserving the information content of the

original image. The resulting image from this approach is shown in Figure 3.45(c).

3.5.3 Experimental Result

,An experiment was performed to verify the simulation result for the random approach.
The schematic of the experimental setup is shown in Figure 3.44. The RGB intensity
values for each pixel were converted to [—2.5V,2.5V] range. The converted RGB
values were down loaded to three Hewlett-Packard Arbitrary Waveform Generators
(HP33120A). The opamp-resistor network (R1 = 910109, R; = 9.11:0) was used to
map the input range to i25mV range. The output of the network was then applied
to a voltage follower. Individual dc bias was added to the output of the voltage follower
and then applied as input to the chip Chip3§ The 2,- values were then captured by a
Philips (PM3394A) Oscilloscope through a voltage follower network.

Due to experimental limitations, only 4096 of the 8192 randomly generated pixels

T
were used for ‘training’. From the experiment, the value of z = [ 2.58 2,82 1.64 ]

 

The inputs 4-6 were grounded.

89

 

Philips

Oscilloscope

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.44. Experimental setup

90

was used to project the entire image. The output v (current) for each pixel was
converted into a voltage by an op-amp network. These voltages (which now represent
single intensity values for the image) were converted to 0-225 range. The resulting

image is shown in Figure 3.45(d).

 

(c) Random approach. (d) Experimental result.

Figure 3.45. RGB-to—YIQ conversion by computing the maximal principal component
of the image.

In both the simulations and the experiment, it was decided to eliminate the addi-

tional step of normalizing the input data. So, it is essentially the maximal principal

91

component of the correlation matrix that is being computed (and hence, the inclu-
sion of the Figure 3.45(a) for the purpose of comparison). Figure 3.43(d) and Figures
3.45(a),(c) and ((1) show that the maximal principal component does a good approx-
imation of the RGB-to—YIQ conversion and that the model and the chip do a good

job of computing it.

CHAPTER 4

Conclusions and Future Work

The principal component analysis is a projection approach that is widely used in
cluster analysis and other pattern recognition applications. Since the neural network
approach is based on ‘learning’, it reduces the computational burden (especially when
the dimensionality of the feature space is high) by avoiding the actual computation
of the covariance matrix and its eigenvalues. Further, analog circuit implementation
being dependent only on the time constants of the circuit, would speed-up the process.
Hence, this problem is worth implementing on hardware. The model presented in [25]
and [26] is tailored for such circuit implementation.

Two prototype circuit implementations that compute the maximal principal com-
ponent based on this model have been presented. The circuit model utilizes differen-
tial pairs and MOS elements operating in the subthreshold regime of MOS operation.
Since the circuit operates in the subthreshold region, the power consumption is very
low; typically in the nano—watts to micro-watts range (depending on the number
of stages). The model is very compact (15 transistors per stage) and is modular in
nature and hence, facilitates extension to higher dimensions. The model has been im-
plemented upto six dimensions, which is the maximum size that can be implemented
on a single tiny-chip. Although there is large amount of chip real-estate available, the

limitation is the number of available pins.

92

93

Mathematical and PSpice circuit simulation results as well as experimental results
for computing the maximal principal component of 2-D Gaussian data have been
presented. Simulation results of the circuit model for different initial conditions have
been shown to be close to those of the Oja’s rule. Further, the experimental results
have been shown to be in close agreement with the simulation results.

Finally, the experiment on RGB-YIQ conversion is by no means a novel or a
better technique over the existing ones. However, the maximal principal component
does a good approximation of the conversion and the experiment demonstrates the
capability of the chip thereby lending further support to the new model for computing
the maximal principal component.

It should be emphasized that due to the replacement of the cubic weight term in
the Sanger’s rule by the inverse Sigmoidal term, and due to the introduction of addi-
tional nonlinearities (Sigmoidal functions), the precision in computing the principal
component is affected. The model does not compute the principal component per se
but, it provides a compact circuit realization which qualitatively computes a nonlinear
analog of the principal component with the additional features such as modularity,
low power consumption etc.

Future work involves extending the implementation to compute principal compo-
nents corresponding to smaller eigenvalues thereby, providing an implementation that
can perform ‘complete’ principal component analysis. In some of the experiments, the
original data had to be converted to the milli-volts range to suit the circuit require-
ments. This maybe sometimes difficult and possibly result in a poor or inaccurate
representation of the original data. Hence, circuit modiﬁcations needed to handle a

wider range of input signals need to be explored.

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] R. Beale and T. Jackson. Neural Computing: An Introduction. Adam Hilger,
New York, 1990.

[2] G. Biswas, A.K. Jain, and R.C. Dubes. Evaluation of projection algorithms.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 3(6):701—708,
November 1981.

[3] H. Chen and R. Liu. An on-line unsupervised learning machine for adaptive

feature extraction. In Proceedings of IEEE Symposium on Circuits and Systems,
volume 1, pages 535—538, Chicago, IL, 1993.

[4] H. Chernoff. The use of faces to represent points in k-dimensional space graphi-
cally. Journal of American Statistical Association, 68:361-368, June 1973.

[5] Y. Chien. Interactive Pattern Recognition. M. Dekker, New York, NY, 1978.

[6] Foley. Sample and feature size. IEEE Transactions on Information Theory, pages
619—622, September 1972.

[7] J. Foley, A. van Dam, S.K. Feiner, J .F . Hughes, and R.L. Phillips. Introduction
to Computer Graphics. Addision-Wesley, New York, NY, 1994.

[8] A. Gharbi and F.M.A. Salam. Implementation and test results of a chip for the
separation of mixed signals. In IEEE International Symposium on Circuits and
Systems, Seattle, WA, 1995.

[9] S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan, New
York, NY, 1994.

[10] D.O. Hebb. The Organization of Behaviour. Wiley, New York, 1949.

[11] J.A. Hertz, A. Krogh, and R. Palmer. Introduction to Theory of Neural Compu-
tation. Addison-Wesley, New York, NY, 1989.

94

95

[12] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall,
Englewood Cliffs, NJ, 1988.

[13] B. Ling, S.S. Vedula, and F.M.A. Salam. Subthreshold analog feature matching
circuits with large output voltages. In Proceedings of IEEE Midwest Symposium
on Circuits and Systems, volume 2, pages 1251—1254, Detroit , MI, 1993.

[14] R. Linsker. From basic network principles to neural architecture. Proc. of the
National Academy of Sciences, USA 83:7508—7512, 8390—8394, 8779—8783, 1986.

[15] M.A. Mahowald and T. Delbruck. Cooperative stereo matching using static
and dynamic image features. In C. Mead and M. Ismail, editors, Analog VLSI
Implementation of Neural Systems. Kluwer Academic Publishers, 1989.

[16] J. Mao and A.K. Jain. Artiﬁcial neural networks for feature extraction and
multivariate data projection. IEEE Transactions on Neural Networks, 6(2):296—
317, March 1995.

[17] W.S. McCulloch and Pitts. A logical calculus of the ideas imminent in nervous
activity. Bulletin of Mathematical Biophysics, 5:115—133, 1943.

[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, New York, NY,
1989.

[19] H.J. Oh and F.M.A. Salam. Analog cmos implementation of neural network for
adaptive signal processing. In Proceedings of IEEE Symposium on Circuits and
Systems, London, 1994.

[20] E. Oja. A simpliﬁed neuron model as a principal component analyzer. Journal

of Mathematical Biology, 15:267—273, 1982.

[21] E. Oja. Neural networks, principal components, and subspaces. International
Journal of Neural Systems, 1:61-68, 1989.

[22] J .M. Ortega. Matrix Theory: A Second Course. Plenum Press, New York, NY,
1987.

[23] J. Rubner and K. Schulten. Development of feature detectors by self-
organization. Biol. Cybern., 62:193—199, 1990.

[24] J. Rubner and P. Tavan. A self-organizing network for principal component
analysis. Europhysics Letters, 10:693-698, 1989.

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

96

F.M.A. Salam. EE 963C:Class Notes. Michigan State University, MI, 1993.

F .M.A. Salam and SS. Vedula. Data-driven subthreshold analog circuit that
computes principal components. In Proceedings of IEEE Midwest Symposium on
Circuits and Systems, volume 1, pages 770—773, Detroit, MI, 1993.

F.M.A. Salam, S.S. Vedula, and G. Erten. Low power analog chips for the com-
putation of the maximal principal component. In Proceedings of IEEE Interna-
tional Conference on Neural Networks, volume VII, pages 4746—4749, Florida,
FL, 1994.

F.M.A. Salam, S.S. Vedula, A.B.A. Gharbi, and H.J. Oh. Neural chips with learn-
ing for data compression/ decompression and blind separation of signals: real-

time experiments. In Proceedings of International Conference on Fuzzy Logic,
Neural Networks, and Soft Computing, IIZUKA, pages 485—486, Japan, 1994.

J .W. Sammon. A non-linear mapping for data structure analysis. IEEE T rans-
actions on Computers, C(18):401—409, May 1969.

T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward
neural network. Neural Networks, 2:459—473, 1989.

G.V. Trunk. A problem of dimensionality: A simple example. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1(3):306—307, July 1979.

SS. Vedula, F.M.A. Salam, and G. Erten. Subthreshold analog circuit for com-
puting the maximal principal component of 3-d data. In Proceedings of IEEE
Symposium on Circuits and Systems, pages 371—374, London, 1994.

J.M. Zurada. Introduction to Artiﬁcial Neural Systems. West Publishing, St.Paul,
MN, 1992.