PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/DateDue.indd-p.1

SENSORY MAPPING, DEVELOPMENT AND THEIR
APPLICATIONS TO FEATURE AND ATTENTION
SELECTION

By

Nan Zhang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Computer Science and Engineering

2006

ABSTRACT
SENSORY MAPPING, DEVELOPMENT AND THEIR APPLICATIONS TO
FEATURE AND ATTENTION SELECTION
By
Nan Zhang

Rich literature has been generated on computer vision systems. However, an
implementable computational model of immediate vision for general and unknown
environments is still illusive. Motivated by the autonomous development process
Of humans, we are interested in building a robot vision system that automatically
develops its visual cognitive Skills through realtime interactions with the environment.
DevelOpmental vision is required by the demands Of general-purpose vision systems
for complex human environments.

Based upon the above requirement, in this dissertation we propose a general archi-
tecture called Staggered Hierarchical Mapping (SHM) that performs feature deriva-
tion for a set of receptive ﬁelds and attention selection. The work reported here is
motivated by the structure of the early visual pathway. We use several layers Of stag-
gered receptive ﬁelds to model the needed units Of local analysis. From sequentially

sensed video frames the proposed algorithm develops a hierarchy of ﬁlters, whose out-

puts are maximally uncorrelated within each layer, and contains an increasing scale
of receptive ﬁelds that range from low to high layers. We also show how this general
architecture can be applied to occluded face recognition, which demonstrates a case
of attention selection.

Besides this general neural network architecture, we develop several sensory map-
ping learning rules for deriving feature detectors, including a fast incremental inde-
pendent component analysis (ICA) method called Lobe Component Analysis (LCA),
which derives independent components for many natural cases. A mathematical anal-
ysis of the algorithm has been done and which Shows the advantages of the LCA
method.

We push the research further by investigating the LCA method in a overcomplete
setting, which means the number of basis functions is greater than dimension of the
Observation. A new incremental method is developed to solve the equivalent regression
problem with Sparse regulization term, i.e. the LASSO regression. We Show that the
LCA method is a special case when noise is not present. It makes LCA’s principal
applicable to a large variety of applications, e.g. classiﬁcation, regression, and feature

selection.

To Eric.

iv

ACKNOWLEDGMENTS

While working the last four years on the subject of this thesis I enjoyed the support
of my colleagues and other people. Of course, the most direct support came from my
supervisors Dr. Juyang Weng, who helped me choose the topic Of this dissertation
and has been a great source of ideas, comments and encouragement. Without his
support, this dissertation would not have completed. I am grateful to Dr. Jain for
offering an excellent course on pattern classiﬁcation and machine intelligence, which
I took and learned so much from it. My gratitude also goes to all the professors in
my committee, Dr. Ofria, Dr. Henderson, and Dr. Stockman, for providing many
valuable critiques and comments on my research topic and for personal support. I
would like to acknowledge my Special thanks to Dr. Zhang, who took pains to travel
across the country to attend my committee meetings and provided research grant for
my research project.

I used several pieces Of software and code developed by my colleague. Wey-Shiuan
Hwang provided the HDR code. Yilu Zhang invested a huge amount of time to help
me on the CCIPCA algorithm and code. Shuqing Zeng has jointed force with me

tO develop many test experiments in this dissertation. Raja Ganjikunta and Martin

Law have shared many valuable ideas with me in the research. Without these, it will
take me much longer, if not impossible at all, to ﬁnish this work.

Thanks also to the members in both the E1 and PRIP Laboratory, past and
present, at Michigan State University. I beneﬁted from many discussions with my
colleagues. Thanks go to Xiao Huang, Feilong Chen, Zhengping Ji, Wei Tong, Xi-
aoguang lv, Hong Chen, Yi Chen, Ameet Joshi and Jason Massey. Special thanks
to Gil Abramovich, who provided supports throughout and helped me get rid of the
occasional doom-thought when working on this dissertation.

I am greatly indebted to my parents for their continuous encouragement and
support throughout my life. Thanks also go to my wife, Yue Wang, and my son Eric

for their patience and support during the years at Michigan State University.

vi

TABLE OF CONTENTS

LIST OF FIGURES ...............................
LIST OF TABLES ................................

1 Introduction ..................................
1.1 Motivation ...................................
1.2 Background of biological visual system ...................
1.2.1 Biological visual system ..........................
1.2.2 Early Visual Pathway ...........................
1.2.3 Simple and Complex Cells .........................
1.2.4 Topographic Maps in Brain ........................
1.3 Early visual processing models ........................
1.4 Models for visual attention ..........................
1.4.1 Attention mechanism in the brain .....................
1.4.2 Bottom-up attention selection .......................
1.4.3 Top-down attention selection .......................
1.4.4 Summary ..................................
1.5 Discussion ...................................

2 Staggered Hierarchical Mapping for Attentive Perception: A De-

velopmental Model .............................
2.1 Introduction ..................................
2.2 Staggered Hierarchical Mapping .......................
2.3 Developmental Algorithm ..........................
2.4 Computational complexity analysis .....................
2.5 Input Image Reconstruction .........................
2.6 Attention Effectors ..............................
2.7 Experiment with SHM ............................
2.8 Experiment with Occlusion .........................
2.9 Experiment with Navigation Simulation ..................
2.10 Conclusions ..................................
3 Lobe Component Analysis .........................
3.1 Introduction to Independent Component Analysis .............
3.1.1 Problem statement and assumptions ...................
3.1.2 Overview of ICA methods .........................
3.1.3 Methods based on Non-Gaussianality measurement ...........
3.1.4 High order cumulant based method ....................

vii

XV

12
15
17
2O
24
26
26

29
35
43
46
47
51
51
54

61
63
65
78
80
86

3.1.5 Methods based on Maximum likelihood and Infomax .......... 87

3.1.6 Methods based on Sparseness measurement ............... 89
3.1.7 Summary .................................. 91
3.2 Lobe Component Analysis .......................... 93
3.2.1 Fine structure of high-dimensional density ................ 94
3.2.2 Global: Whitening the data ........................ 96
3.2.3 Local: Lobe components .......................... 97
3.2.4 Belongingness ................................ 100
3.2.5 Statistical efficiency ............................ 102
3.2.6 Amnesic mean ............................... 104
3.2.7 Efficiency Of the amnesic mean ...................... 105
3.2.8 Proposed CCILCA Algorithm ....................... 108
3.2.9 Time and space complexities ....................... 110
3.2.10 LCA is ICA for super-Gaussians ..................... 112
3.3 Experiment Results .............................. 112
3.3.1 Low dimensional simulation ........................ 112
3.3.2 Comparison with Other ICA Methods .................. 113
3.3.3 Cocktail Party Problem .......................... 116
3.3.4 Natural Image ICA ............................. 117
3.4 Derivation Of the LCA algorithm ...................... 119
3.4.1 [7’ norm Sparseness measurement ..................... 119
3.4.2 Inﬁnity norm Sparseness optimization ................... 122
3.5 Broader Impact ................................ 125
3.6 Conclusion ................................... 126
4 Sparse representation and Feature Selection .............. 137
4.1 Introduction: The Supervised Learning Problem .............. 138
4.2 Method .................................... 144
4.2.1 Duality ................................... 144
4.2.2 Derivations ................................. 145
4.2.3 Quadratic Programming .......................... 149
4.2.4 Filter Development ............................. 152
4.2.5 Gradient Sparseness Optimization Algorithm .............. 152
4.3 Experiments .................................. 154
4.3.1 Kernel Regression ............................. 154
4.3.2 Classiﬁcation ................................ 155
4.4 Conclusions .................................. 159
5 Conclusions .................................. 160
5.1 Summary ................................... 160
5.2 Contributions ................................. 162
5.3 Summary of the future work ......................... 163
APPENDICES .................................. 165

viii

A The Candid Covariance-free Incremental PCA Algorithm .....

B Derivative of Maximum Function ....................

165

1.1

1.2

1.3

1.4

1.5

2.1

LIST OF FIGURES

Mammal early visual pathway. (Adapted from (Kandel et al., 2000).) . . . .

The receptive ﬁeld in primary visual cortex are different and varied than those
in the retina and LGN. Areas marked with “+” denote the excitatory region,
correspondingly “-” presents inhibitory region. (A) The concentric receptive
ﬁeld of cells found in retina and LGN. (B) Receptive ﬁeld of simple cells in
primary visual cortex.(Adapted from (Hubel and Wiesel, 1962).) ......

The orientation preference map of the macaque monkey’s striate cortex, mea-
sured by Optical imaging technology. Each color presents on preferred direction.
From the image we can see the neighbor neurons tend to prefer similar direc-
tions, forming the unitary color regions called iso—orientation blobs. .....

Brain areas that involve in attention control. This diagram only shows the
control ﬂow. The spatial layout of the pathways can be found in (Kandel

et al., 2000). (Adapted from (Itti and Koch, 2001) and (Kandel et al., 2000))

Bottom-up attention selection model. The elementary properties (such as color,
orientation, size, and depth) of the visual ﬁeld are extracted by separate parallel
pathways, each of which generate a feature map. Selected features are fused into
a master map, which is a representation of those features that have the most
saliency. Selective attention happens after the features have been associated in
a small region of the master map. Adapted from (Treisman, 1986). .....

Recognition under occlusion through features extracted using a sensory map-
ping architecture. (a) In the learning session the sensory mapping learns the
features of the upper face (indicated by “On”) through active attention to the
upper face, (b) In the performance session the lower part of the face is oc—
cluded and attention focused on the upper face allows the sensory mapping to
deliver features for the non-occluded part of the face (“On” part of the block),
which are fed to the classiﬁer for successful recall. In general, a sensory map-
ping enables suppression Of unattended receptive ﬁelds, but provides signals
from attended receptive ﬁelds for generating attention control signals by the
following classiﬁer/ mapping function. ....................

5

18

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

2.10

2.11

2.12

2.13

2.14

3.1

A comparison Of the neural cylinder (a) and the staggered receptive ﬁelds (b).
In (a) a group of ﬁlters share the same receptive ﬁelds resulting in a low spatial
resolution. In (b) each ﬁlter is centered at a different position resulting in a
higher density of spatial sampling ........................ 38

Localized receptive ﬁeld DP, with size 4. ................... 39

Computational order of eigen-groups. There are 16 eigen groups. The numbers
in the table denote the order Of updating using residual image. The positioning
of the numbers represents the relative position of the eigen-groups ....... 41

The architecture of SHM. Each square denotes a ﬁlter. Layer 0 is the input
image. The ﬁlters marked in black in layer I belong to different eigen-group,
because they overlaps. Bold lines that are derived from a single ﬁlter and
expanded to the original image mark the receptive ﬁeld of the combined ﬁlter.
The size of the receptive ﬁeld in a particular layer is 20% larger than its previous
layer in this diagram, which is Shown at the right. The size of the receptive
ﬁeld is rounded to the nearest integer. .................... 43

Sequence Of residual images. From left to right and top to bottom starting
from the original input image, the sequence of images displays the process that
a component is subtracted from the previous one. .............. 46
The ﬁlter connections that covers the outliers of the image with have zero input. 50

Several samples Of 5000 natural images collected. ............... 52

The ﬁlters developed in the ﬁrst layer by the sharing method. The order of

dominance is from left to right and from top to bottom. ........... 53
The reconstructed images from different levels. The ﬁrst image is the orig-
inal input image. Others are reconstructed images from Layer 1 to Layer 4
respectively. .................................. 54
The reconstruction error of each layer. .................... 54
The schematic illustration of Operation. C3 is the attention signal generator.
C1 and C2 are the classiﬁers for each occlusion case. 31 and S2 are SHMs, 31
for U views and S2 for L views. ........................ 56
Merged image sample from the two cameras. ................. 58
Six attention positions on the image. ..................... 59
The cocktail party problem ........................... 66

xi

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

3.12

3.13

The density function of super-Gaussian, Gaussian, and sub-Gaussian distri-
butions. All the distribution are normalized to unit variance. The Laplace
distribution has positive kurtosis, thus is a super-Gaussian distribution. the
uniform distribution has negative kurtosis, which implies sub-Gaussian. And
the normal distribution has a zero kurtosis ...................

Two Laplace distribution. ...........................

Two Laplace distribution mixed by mixing matrix A ..............

Example Of two uniform distributions. (a) The source signal. (b) Two uniform
distributions mixed by mixing matrix A. ...................

After whitening, the joint distribution is sphered. (a)Laplace joint distribution.
(b)Uniform joint distribution. .........................

Joint distribution of two uni-variance Gaussian random variables. ......

Gaussian signal vs. sparse signal. The sub plot on the top shows a zero
mean unit variance Gaussian random variable. The X axis denotes different
observations, and Y axis is the value. The bottom sub plot shows a Sparse
random variable with same mean and variance. ...............

Two neurons with horizontal inhibitions. Hollow triangles indicate excitatory
connections and solid triangles indicate inhibitory ones .............

Illustration of lobe components. The plus signs are random samples. lst col—
umn: the original source signal. 2nd column: the observed mixed signals (after
afﬁne transform from the left). 3rd column: whitened signals. Lobe compo-
nents are marked by arrows ...........................

The sample space of a zero-mean white random vector x in 2-D space can be
illustrated by a circle. Each mark + indicates a random sample of x. The
distribution is partitioned into c = 3 (symmetric) lobe regions 12,-, i = 1, 2,3,
where R,- is represented by the lobe component (vector) w,. .........

The amnesic average coefﬁcient. X axis is the number of visits. Y axis is the
weight coefﬁcient of wl and wg. ........................

The error coefﬁcients C(n) for amnesic means with different amnesic functions
p(n). .....................................

71

73

74

74

77

77

9O

91

127

128

3.14

3.15

3.16

3.17

3.18

3.19

3.20

3.21

3.22

4.1

4.2

The proposed method on 2-D Laplace distribution data set. (a) The source in-
dependent components and the evaluated independent components. The dash-
dot lines indicate the direction of source independent components, and the two
arrows indicate the components evaluated by the proposed method. (b) Av-
erage error in the angle of the evaluated independent components vectors and
the source independent component vectors. .................

Comparison of ICA results among (Type-3) NPCA, (Type-2.5) ExtBSl and
(Type-4) CCILCA for super-Gaussian sources. (a) Average error for different
data dimension d. (b) Average time to run for different data dimension d. (c)
Average error for d = 25 as a function of the number of samples. ......

Comparison among (T ype-2) Extended Infomax, (Type-1) FastICA and three
variants of the proposed Type-4 CCILCA algorithms ..............

Cocktail party problem. (a) A music sound clip in its original form. It is
one of the nine sound sources. (b) One of the nine mixed sound signals. (c)
Recovered music sound wave. Comparing to (a), the sound Signal is recovered
after approximately 1.5 second. ........................

Lobe components from natural images (not factorizable). (a) LCA derived
features from natural images, ordered by the number of hits in decreasing
order. (b) The numbers of hits of the corresponding lobe components in (a). .

Results of Algorithm 5. Topographically ordered basis functions in ICA of
natural images. ................................

Results Of Algorithm 5. Number Of times for each independent component to
be the “winner.” The noticeable big hit corresponding to the neuron at row
No.9 and column NO. 8. ...........................

Data set is joint distribution of two unit variance Laplace distribution. The
two independent Laplace random variable are mixed by a rotation matrix of
20 degree. ...................................

The (p norm criteria functions under different p. The data set is projected onto
a series of orthogonal basis, and then the expectation of I? norms of all the
projected samples are computed. .......................

Geometry explanation of duality. Point B is the closest point in K to the origin.

Kernel regression results. The dashed lines are true sine functions. Solid lines
are approximation results. The circles shown in the ﬁrst row are those selected
kernel functions. Note that we do not Show the circles in the ridge regression
results, because every kernel is selected. (a) Proposed gradient method. (b)
Ridge regression. (c) LASSO regression .....................

xiii

130

131

132

133

134

135

135

136

136

145

155

4.3

4.4

4.5

5.1

Objective function vs. number of iterations. We utilize a dynamic learning
rate mechanism called Amnesic Average. Interested readers can refer to (Weng
and Zhang, 2004) for more details. ......................

The effect of control parameter t over mean square error and Sparseness. . . .

The decision boundary of the artiﬁcial XOR data set. (a) The decision bound-
ary of the proposed winner-takeall method. (b) Kernel selection of the pro-

posed method. (c) The decision boundary of Least Square Support Vector
Machine (LSSVM). (d) Kernel selection of LSSVM ...............

The flow diagram of a developmental vision system. The sensorimotor module
for innate behavior is illustrated by a block only for simplicity, but it also
contains the three mappings which are, however, much smaller in storage and
much simpler in the mappings it realizes. ...................

xiv

158

161

1.1

2.1

2.2

4.1

5.1

LIST OF TABLES

Differences in the sensitivity of M and P cells to stimulus features. (Adapted

from (Kandel et al., 2000)) .......................... 19
Summaries Of the occlusion experiment. .................. 57
Summaries of the navigation image Simulation. .............. 60
The result of the 5-fold cross-validation ..................... 159
Contribution of the dissertation. ...................... 162

XV

Chapter 1

Introduction

1 . 1 Motivation

The function Of current computer vision systems lags far behind the biological visual
processing systems. Computer vision systems face many delicate problems, which can
be solved easily even by some primary animal vision system. One possible reason is
that the brain is built on a much larger scale in the sense Of computational units. For
example, human brain contains 1015 synapses compared to fewer than 108 transistors,
plus the more complex function of the neural cells (Kandel et al., 2000).

It is almost impossible to design and manufacture a computer with 1015 transis-
tors and make it work. How does nature manage this problem? It is believed that
the genome plays a major rule in forming the protein, and thus the tissues and appa-
ratuses. Is the form of synapses completely determined by human genes? The answer
is negative. The genome has only approximately 105gcnes. Even if we assume that
most of the genes are dedicated to code the synapses, the compression rate is as high

1

as 1 to 1010. SO far, we don’t see that much redundancy in the neural synapses.
Examination of the neural system revealed that the environment has a large effect on
the forms Of the neural system. As early as 1970, Blakemore and Cooper (Blakemore
and Cooper, 1970) reported that the kittens’ visual cortex do not have cells sensitive
to edged orientations that they did not Observe, if they lived in a controlled envi-
ronment after birth, where only edges Of a certain orientation were present. Recent
studies by Sur and coworkers (Sur et al., 1999) have shown that the input signal can
fundamentally determine the ﬁlters generated under development. They have rewired
the visual sensory axon of ferrets to the auditory cortex, and tested the sensory re-
sponse in it. The auditory cortex shows orientation sensitive cells if it receives visual
signals early in life, which can not be found in a natural auditory cortex. Now, it is
widely accepted that the neural system is largely formed by its conveyed signals. In
engineering terms, this is a data-driven self-organizing system.

Another possible reason that natural vision systems outperform their computer
counterpart lies in the selective visual attention mechanism used by the brain. Atten-
tion implements an information processing control gate, only allowing a small part Of
the incoming sensory input to reach short-term memory and visual awareness (Desi-
mone and Duncan, 1995). Instead Of trying to fully process the entire sensory input in
parallel, nature has devised a serial strategy to achieve near optimal and realtime per-
formance despite the limited computational resources. Selective visual attention has
broken down the cluttered visual ﬁeld into a series of light computational, localized
visual analysis.

In this chapter, I will focus on these two problems and review the relevant works.

2

Our goal is to ﬁnd the relationship between neuron development and attention selec-

tion and explore the possibility that is learning to pay attention.

This paper is organized as follows: section 1.2 introduces some background knowl-
edge of the biological visual system. Section 1.3 reviews several models of the early
visual processing with the emphasis on modeling the topographic maps Of the cortex.
Models for visual attention, which is the major concern Of this chapter, is shown in
section 1.4. In the last section we will discuss how visual system development relates

to the learning Of visual attention.

1.2 Background Of biological visual system

The visual system has the most complex neural circuitry of all the sensory systems.
The auditory nerve system contains 30,000 ﬁbers, but the optical nerve contains over
one million!

The task Of constructing such a system is tremendously complex. Early stage Of
this process, in which neurons ﬁnd their proper locations and send axons to roughly
the right region of the correct target structure, are somehow known to be independent
from the neuron activity, thus could be wired by genes. Later stages of this process,
however, depends on the neuron electrical activity to form the accurate synapses
structure. What is know about the activity—dependent process of synapse forming
is generally consistent with a hypothetical rule that was proposed by Hebb (Hebb,
1949), which is also known as “Hebbian learning rule”: Synapses are strengthened if
there is a temporal correlation between their presynaptic and postsynaptic patterns

3

Of activity, and weaken otherwise. In a concise phrase, “Neurons that ﬁre together

will be wired together”.

In this section, I will give a brief introduction to the biological early visual system
and review some of the evidence that patterned-activity driven synapse modiﬁcations

operate in neural development.

1.2.1 Biological visual system

The biological visual system has been studied over several species, including cat,
ferret, and monkey. Although their visual systems do contain some structural and
functional differences, the main mechanism Of the visual processing and development

Of the circuitry is expected to apply to other species as well.

Fig. 1.1 Shows the early visual pathway Of the human . The visual processing
can be separated into roughly three stages: the retina, the lateral geniculate nu-
cleus (LGN) and the primary visual cortex. Lights projected on the retina evokes
the photoreceptors and related cells on the rear surface of the eye. The activity of
photoreceptors is encoded by the retina ganglion cells. These cells are not evenly
distributed. The densest section, which is called fovea, is around the center of the

visual ﬁeld.

The visual ﬁelds of the left and the right eye have signiﬁcant overlap, so that the
visual ﬁeld has both binocular and monocular zones. Light from the binocular zone
will be projected onto both eye, and light from the monocular zone only appears on

one Of the retinas.

[Fixation Point

 

I
Left \
I \
Monocular / | Right
\
~.\‘ Zone I ‘ Monocular
\‘\ I \ Zone ,o
\s‘ I I ’a’
‘\\‘ f \ Nasal ’’’’’
\\ ,' yhemiretina ,,,,
\‘ ' ”””
\\\ I /X ’a”

 
   

Right temporal

Left temporal
hemiretina

hemiretina

Primary Visual Cortex

Figure 1.1: Mammal early visual pathway. (Adapted from (Kandel et al., 2000).)

Visual information travels in separated pathways for each half Of the visual ﬁeld.
The signal in the left hemiﬁeld of both retina will go to the left subcortical area.
Similarly the right hemiﬁeld only has connections to the right subcortical area. The
Optic nerves from each eye project to the optic chiasm, where ﬁbers from each eye
destined for one or the other side of the brain are sorted out and bundled in the
bilateral optic tracts, which project to three major subcortical areas: the pretectum,

5

the superior colliculus and the lateral geniculate nucleus (LGN). The former two
subcortical areas are known to be in charge of saccadic eye movements and pupillary
reflexes accordingly, which is out Of the topic of this paper. Ninety percent of the
retina axons terminate in the lateral geniculate nucleus Of the thalamus. The LGN is
the main terminus for input to the visual cortex (Kandel et al., 2000). We will focus

on the projection to the LGN.

Visual information from different eyes is kept segregated into different neural
layers in the LGN. From the LGN, the signal continues to the primary visual cortex
(V1; also called striate cortex or area 17). V1 is the ﬁrst cortical area that processes
the visual information. The output from V1 goes on many higher layers, including
layers underlying some high level cognitive process, like face recognition or top—down

attention control.

1.2.2 Early Visual Pathway

The retina, LGN and primary visual cortex constitute the early visual pathway. Al-
though the fundamental function of the early visual pathway is somehow still unclear,
many facts already revealed helped us to recognize its functionality.

First, this pathway is the major visual information conducting pathway. Without
this pathway visual perception is lost, although some very limited stimulus detection

and movement toward Objects in the visual ﬁeld is still possible.

Second, this pathway does not project the original visual ﬁeld directly into the
cerebric cortex. At the photoreceptor level, the representation Of the visual ﬁeld is

6

much like a image, but signiﬁcant processing occurs in the subsequent subcortical and
early cortical stages, which distorts and recodes the visual information (a review can
be found in (Kandel et al., 2000)).

Third, the neural cells in this pathway are organized in a hierarchy and have a
variety of receptive ﬁeld sizes (The receptive ﬁeld Of a cell, by Hubel and Wiesel’s
deﬁnition (Hubel and Wiesel, 1962), is “the region of retina (or visual ﬁeld) over
which one can influence the ﬁring Of the cell.” ). The neurons in the higher layer
have larger receptive ﬁeld than the early neuron. In the same layer, different neurons
cover different regions Of the visual ﬁeld. Then the neurons’ ﬁring pattern is the
representation of the its receptive ﬁeld. Thus neurons in different layers and at
different locations form multi—scale and multi-position representations of the visual
ﬁeld.

Finally, there is not only one Single pathway in the early visual processing. As
mentioned before, the Optical nerves from the left and the right eyes are segregated in
this pathway. Furthermore, psychophysical experiments Show us that the early visual
pathway can also be identiﬁed as two parallel pathways: Magnocellular pathway (M
channel) and Parvocellular pathway (P channel). These pathways start from the
Magnocellular Ganglion cells (M cells) and Parvocellular Ganglion cells (P cells)
accordingly, and keep separated until fused in some high cortical layers. M cells and
P cells are proved to be sensitive to different stimuli modalities, e.g. M cells are
sensitive to luminance contrast and temporal frequency Signal. P cells are found
preferring color contrast and spatial frequency changing. From a later section we
will discuss that this is believed to be the main biological evidence Of the bottom-up

7

attention control mechanism.

Fifth, the early visual pathway is largely formed by visual experience. This is
obvious as shown by Blakemore and COOper’s newborn kitten experiment (Blakemore

and Cooper, 1970) and Sur’s optical nerve rewiring experiment (Sur et al., 1999).

The early visual pathway serves an important role in visual information transmis-
sion and processing. To study the function of this pathway, one needs to address the

processing of individual cells.

1.2.3 Simple and Complex Cells

AS we move from retina to primary visual cortex, a fundamental change takes place in
the nature of the representation of visual information. In the retina, each ganglion cell
effectively represents a small region in the image. The receptive ﬁeld often receives
input from only one cone in the fovea, that effectively samples the image. In frequency
terms, each cell has a broad bandwidth that is essentially equal among same type Of
cells. In the cortex, the situation is strikingly different. Receptive ﬁelds are narrow
band and oriented, and may differ from one another in their orientation, size and peak
frequency. This means that we have gone from a single representation by relatively
homogeneous cells, to multiple representations by distinct populations Of cells differing
in orientation and spatial frequency. Hubel and Wiesel tried to classify the neural
cells in the visual cortex by measuring the response properties of visual cortical cells
when the subject was exposed tO a black screen with only one light spot or light bar
visible. The light stimuli is put in different places and the response Of the cell is

8

measured to determine the receptive ﬁeld property.

They found there are two distinct cell types. The ﬁrst type of cells has explicit
excitation and inhibition regions in their receptive ﬁelds. Illumination of part or all
of an excitatory region increased the ﬁring of the cell, whereas a light shone in the
inhibitory region suppressed the ﬁring. A large spot conﬁned to either area produced
a greater change in ﬁring rate than a small spot, indicating summation within either
region. This cell group is called simple cells. While receptive ﬁelds Of simple cells
are similar to those of retina and LGN cells in possessing explicit excitatory and in-
hibitory regions, they differ tremendously in the spatial arrangement of the regions.
Fig. 1.2 shows the difference between retina and LGN cells and simple cortical cells.
The orientation sensitive arrangement yields simple cells ﬁring profoundly to Spe-
ciﬁc orientated stimuli, thus invoke the hypothesis that simple cells are used as“edge
detectors” proposed by Barlow (Barlow, 1972) etc.

The ﬁring patterns Of the other group of neurons are far more intricate and elab—
orate. The receptive ﬁelds Of these cells are termed “complex”. In the experiment,
small spots fail to map the excitatory and inhibitory regions. The “on” and “off”
ﬁring patterns are triggered by more complex stimulus, such as slits, edges, and dark
bars.

The classiﬁcation Of the simple and complex cells is merely the result Of the ex-
ternal measurement. They can not be separated anatomically. Be aware of the fact
that even the auditory cortex can evolve to have the property of the visual cortex.
The difference becomes to lie on the arrangement Of its previous layer cells and input
signal. Phrthermore, the study on a neighborhood of neurons in the visual cortex

9

 

 

Figure 1.2: The receptive ﬁeld in primary visual cortex are different and varied than those in
the retina and LGN. Areas marked with “+” denote the excitatory region, correspondingly
“-” presents inhibitory region. (A) The concentric receptive ﬁeld Of cells found in retina and
LGN. (B) Receptive ﬁeld Of simple cells in primary visual cortex.(Adapted from (Hubel
and Wiesel, 1962).)

conﬁrms a topographic distribution of cell properties (Adrian, 1941) (Marshall and
Talbot, 1942). That is, whether Simple or complex cells, cells in a close range tends
to ﬁre uniformly to the same stimuli, thus forming a visualized topographic map in

the brain.

1.2.4 Topographic Maps in Brain

In all mammals, much Of the neocortex consists Of orderly representations or maps
of neurons that are typically topographic at global levels while being iSO—property at

10

local levels. Two prominent topographic maps are ocular dominance columns and

orientation preference maps.

The ocular dominance column represents an orderly arrangement of cells that re-
ceive inputs only from the left or right eye and is important for binocular interaction.
By injecting radio-labelled amino acids, the ocular dominance columns can be visu-
alized (similar to striped patterns). The neural pathways from the left and the right
eye are shown to be largely segregated through the LGN and primary visual pathway.

The orientation preference map is the phenomenon that neurons in the primary
visual cortex ﬁre selectively to a speciﬁc direction of stimuli. Neurons in the nearby
region have similar, but not identical preferences. The preferences repeat continuously
at regular interval (approximately 1-2mm) in every direction. Fig. 1.3. shows the
orientation preference map of monkey’s striate cortex. In some locations, the map

forms a pattern like a pinwheel.

 

Figure 1.3: The orientation preference map Of the macaque monkey’s striate cortex, mea-
sured by optical imaging technology. Each color presents on preferred direction. From the
image we can see the neighbor neurons tend to prefer similar directions, forming the unitary
color regions called iso—orientation blobs.

11

Although the topographic map is currently a prominent research topic in neuro-
science, its functioning rule in perception is still controversial (Kass, 1997) (Wein-
berg, 1997). Weinberg argued that topography could be mere epiphenomenon of
minimal intrinsic signiﬁcance, but he also admitted that it may be useful to examine
possible explanations or cortical function. At least, this easy-tO-Observe phenomenon
can be used as a guideline to build the computational model of the early visual path-
way, especially when the goal function Of this pathway is still unknown. In fact,
many neural computation models have attempted to explain the topographic maps
with some success. In the following section we will review some computational models

that model the topographic maps and simple cells Of the early visual pathway.

1.3 Early visual processing models

Visual system development is a complex process. It is difﬁcult to integrate the in-
coherent experirnental results into a profound understanding of how this system is
constructed. Computational neural models provide useful ways to recognize functions
of the neural system. These could help to validate current theory, and perhaps predict
some uncovered facts.

There has been a long history Of researchers using quantitative models to explain
some observed topographic maps in the brain, such as ocular dominance stripes and
orientation preference maps in the early visual pathway. Although limited by the
computational power available, some early models did Show that orientation selection
could be developed. As early as 1973, von der Malsburg showed that orientation selec-

12

tion neurons could be developed from an oriented image pattern by an unsupervised
learning rule (von der Malsburg, 1973). As a pioneering study in this ﬁeld, von der
Malsburg’s model had many features that were adopted by many later researchers,
such as using two dimension array to present the subcortical and cortical layers, a
nearby excitatory and faraway inhibitory mechanism, using the incremental Hebbian
learning rule to update the connections.

von der Malsburg’s result is not surprising, because the input pattern is ori-
ented. Linsker, however, demonstrated that even with random noise input without
pre—applied orientation signal, his model could develop orientation preference maps
(Linsker, 1988). In his three consecutive papers (Linsker, 1986), Linsker Showed
how the “on-off” centralized neuron, the orientation selection neuron, and orientation
preference maps can be developed from a simple set of rules (including Hebbian-
type modiﬁcation). Linsker’s model gets support from the biological experiment that
new born animals possess orientation preference maps even before their eyes open
(Crair et al., 1998). This map driven by the internally generated signal is modeled
by training with random noise.

Linsker’s model motivated several other models that integrate Hebbian learn-
ing rule and localized lateral inhibition rule. Like in Miikkulainen et. al.’s RF-
LISSOM model (Sirosh and Miikkulainen, 1997) (short for Receptive-Field Lateral
Interconnected Synergetically Self-Organizing Map), two layers Of neurons model the
LGN layer and visual cortex, in which the cortical layer applies a modiﬁable lat-
eral excitation-inhibition interconnection. Both afferent and lateral connections are
adapted by the same Hebbian mechanism in a purely local and unsupervised learning

13

process. This model can develop the ocular dominance column and orientation pref-
erence map. Other similar models that used lateral inhibition and Hebbian learning

can be found in (Burger and Lang, 1999) and (Bartsch and van Hemmen, 2001).

Miller (Miller, 1994) used a different approach to make the Simulation more
practical. He showed that if the visual system is essentially linear, then the sequence
of input patterns can be replaced with a simple function representing its long term
correlations. His model is essentially Hebbian learning based. Instead Of incrementally
updating the connection by each input image or pattern, his model integrates the
whole sequence Of the input in the Hebbian learning equation and comes up with
an equivalent form that only demands the correlation function Of the image set.
This modiﬁcation signiﬁcantly improved the convergence speed of the computation.
Compared to thousands of images used to training other models, this correlation

based model only require tens of the iterations to get a stable result.

In this section we have reviewed several early visual pathway models that explain
the development of orientation preference maps. One prominent question Should be
asked: why is the orientation preference map important? We believe the orientation
map, as a low-level feature extraction mechanism, is very important for the attention
selection process. As discussed in the next section, the bottom-up attention selection
mechanism fuses several feature maps to gain the attention shift position. The orien-
tation map is one of the basic feature maps that contribute to the attention control.
Most important of all, the orientation map is developed from the visual experience,
which makes attention from learning possible.

14

1.4 Models for visual attention

The human’s brain does not take all the sensory information at one time. Only a

subset of all the sensory input can be processed.

Sometimes this selection happens between different sensory modalities. When
focusing on reading a book, one could be very insensitive to the surrounding sounds
and voices, and when trying to listen to a sound very carefully, one tends to close his

eyes to help focus on the auditory sensory input.

In some sensory modes, selection within the modality is also possible. In the visual
modality, it is done by suppressing information outside a spatially circumscribed
region of the visual ﬁeld. This is called “focus of attention”. In mammals, this
attention selection mechanism is implemented either by rapid, saccadic eye movement
(also referred as “overt” shift of attention) or so-called “covert” (internal shift of

attention, i.e., without moving eyes) shift Of visual attention.

In the ﬁrst case, overt attention can be Observed externally as it is orienting the
sensor or the highest resolution part of it toward the attended stimuli. This allows the
selected information to enter short-term memory and to remain there long enough to
reach conscious and cognitive levels of representation. Most mammals’ eye systems
have some anatomical speciﬁcation to accommodate this “spotlight” mechanism as
well. In many species’ retina, the photoreceptors are not distributed in a even manner.
They contain a Specialized region around the center of the visual ﬁeld with a much
higher cell density than at the peripheral region. This is called “fovea” in primates,
“central area” in cats and “visual steak” in rabbits.

15

In contrast, covert attention is not directly observable but takes place by ﬁlter-
ing or assigning the resource. Covert attention selection usually precedes the overt
attention mechanism (Ditterich et al., 2000). Like in a task of fast visual searching
for an object, one looks at the scene without a speciﬁed ﬁxation, then overt attention
ﬁlters irrelevant distraction. After a few candidate positions are selected, saccadic
eye movement takes place for further visual analysis.

Although the overt and covert visual attention are modelled in similar ways, they
have different costs and effects. Moving sensors takes a longer time than the internal
attention selection. Further more, the overt attention selection requires shutting down
the sensory input during the saccade. But in some realtime active-vision system,
saccadic sensory movement is necessary to prevent the attended Object from moving
out of the visual ﬁeld and take advantage Of the highest sensory resolution.

There are at least two advantages that a visual attention mechanism can bring
about. First the amount of data to be processed has been reduced, resulting a lower
computational resource demand. Thus, it enables further information processing.
Second, the distracting information is suppressed in this procedure, SO that relevant
information may take more consideration. For example, in a typical Object classiﬁ-
cation system, it is very difﬁcult to deal with samples that come with a large variety
of backgrounds. With the help of visual attention selection, the background infor-
mation is somehow suppressed, thus only the foreground Object will inﬂuence the
performance of the system. In this aspect, attention is a central part of the system,
because it selects the information on which the system activities are based.

Selective visual attention has been researched in psychology, physiology, and neu-

16

roscience for decades (James, 1981) (Kandel et al., 2000) (Treisman and Gelade,
1980), but the majority of the research focuses on empirical aspects, measuring the
phenomenon, describing the result, and giving at most qualitative models of the at-
tention control. Computational models of selective attention have drawn research
interest only in the last decade. Although largely based on the qualitative models,
the computational models Of attention casts a light on solving some difficult computer
vision problems, and in turn as a quantitative approach to help understanding the

biological visual attention mechanism.

In the following section we ﬁrst give some background knowledge Of the attention
mechanism in the brain, and then review several computational models of selective vi-
sual attention. It is almost impossible to cover all the related work in this area. Some
good reviews can also be found in this literature (Itti and Koch, 2001) (Desimone

and Duncan, 1995).

1.4.1 Attention mechanism in the brain

The control of selective visual attention involves several brain regions, including the
early visual pathway and the prefrontal cortex. Cooperation between two streams of
control signal is believed necessary to achieve attention selection. The where stream
will ﬁnd where to attend to, primarily controlled by the dorsal visual processing
pathway, which comprises cortical areas in the posterior parietal cortex. The what
stream controlled by the ventral visual processing pathway, involving cortical areas in
the inferior temporal cortex, is primarily in charge of localized object recognition. The

17

result of the two streams could affect each other. For example the recognition result

from the what stream can bias the next attention Shift, make it to some predicted

focal attention area where it may contain next interesting Object, and vice versa.
Fig. 1.4. shows the brain areas that participate in the control of visual attention.

Visual information

I’ \
’ \
f \
I \
I \
J \
I \
’ \
I
r ..
I

P pathway‘«\\_ (: :j ‘,,> M pathway

~
‘~

‘~
~\
~
‘~
~.

LGN

Ventral Filthy [Visual Corteg “real pathway
Inferotemporal
cortex

 

 

 

Posterior
parietal cortex

 

 

 

Prefrontal

\C
,/

 

cortex

/._\

 

Memory and

f

Motor and

other system cognition
Superior
colliculus

Saccadic eye
movement

r—w

l

 

L.)

Figure 1.4: Brain areas that involve in attention control. This diagram only shows the
control ﬂow. The spatial layout of the pathways can be found in (Kandel et al., 2000).
(Adapted from (Itti and Koch, 2001) and (Kandel et al., 2000))

Visual information enters the retina and is processed by two kinds Of ganglion
cells: the magnocellular ganglion cells (M cells) and the parvocellular ganglion cells

18

 

Stimulus feature M cells P cells
Color contrast N 0 Yes
Luminance contrast Higher Lower
Spatial frequency Lower Higher
temporal frequency Higher Lower

 

 

 

 

 

 

Table 1.1: Differences in the sensitivity Of M and P cells to stimulus features. (Adapted
from (Kandel et al., 2000))

(P cells). The information that is conveyed by M and P cells is kept segregated in the
LGN and projected to separate layers of the primary visual cortex. Thus, it has led
to the view that the separate sequence Of retinal ganglion, LGN, and visual cortical
cells can be regarded as two parallel pathways, referred to as the M and P pathways.
As indicated in Table 1.1, there are striking differences between cells in the M and P
pathways. By selectively removing one or the other pathway, experiments Show that
the M and P pathway make different contributions to perception. The P pathway
is critical for color vision and for vision that requires high spatial and low temporal
resolution (e.g. static color images). The M pathway contributes most to vision

requiring low spatial and high temporal resolution (e.g. motions).

Based on anatomical and functional evidence, Ungerleider and Mishkin argued
that visual processing centers in the cerebral cortex are also believed to be organized
in two pathways: the dorsal processing pathway to the posterior parietal cortex and
the ventral pathway to the inferotemporal cortex (Ungerleider and Mishkin, 1980).
Anatomical evidence also Shows that the dorsal pathway is the successive visual path-
way corresponding tO the M pathway in the early visual processing, and the ventral
pathway corresponds to the P pathway accordingly. Thus it is natural and reason-
able tO assume the dorsal and ventral pathways implement some high level perception

19

combination Of the information that the M and P pathways convey.

The dorsal pathway is believed to process a subset Of high level perceptions, in-
cluding motion and depth. These perceptions are very likely derived from the M
pathway information(motion and depth perception from the low spatial high tempo-
ral feature). The neurons in this system are relatively insensitive to color and poor at
analyzing stationary objects. Therefore, the dorsal pathway is more concerned with

where Objects are rather than what they are.

The ventral pathway deals with color and form perception. The color and the form
perception are derived from P pathway features(color and form perception from color
contrast and high spatial low temporal feature), because a great deal of information
about Shape and form is derived from colors, edges, and borders. This system is
capable of high resolution, which is important for seeing stationary objects in detail.

Thus the ventral pathway in more concerned with what is seen.

The dorsal pathway (where stream) and the ventral pathway (what stream) end in
different regions of the prefrontal cortex that specialize, respectively, in visual spatial

working memory and object recognition working memory.

1.4.2 Bottom-up attention selection

How is information about color, form, depth, and motion, which is conveyed by
different neural pathways, combined together and translated into a comprehensive
perception? When we see a red moving car, we do not see separated features. We
combined different properties, color (red), shape (car), and motion (linear moving),

20

into one perception. It is impossible to develop different specialized neural cells to
detect all these combinations, thus there must be fusion of these pathways somewhere
in a higher level, which is also called a binding mechanism.

From a computational aspect, this fusion of low level features resulting a high level
attention selection is also referred as the bottom-up attention selection mechanism.

One of the earliest theoretical models that addresses this problem is in the psy-
chophysical literature. "Heisman et a1. has proposed that different features are derived
from different feature maps in different brain regions ('Ifeisman and Gelade, 1980).
To solve the binding problem, the brain dedicates some processing as a master map to
combine the code Of the feature map. Fig. 1.5 shows the outline of ’Ifeisman’s model.
Later, more detailed and neurally plausible models for visual attention selection were
proposed (Koch and Ullman, 1985) (Eriksen and Yeh, 1985) (Olshausen et al.,
1993). These models are still qualitative models.

The most prominent model of visual attention was proposed by Koch and Ullman

in 1985 (Koch and Ullman, 1985). The model includes following elements:

1. It has a process called early representation, in which a number Of elementary
features, such as color, orientation, direction of movement, disparity etc. are
represented in parallel in different topographical maps. This corresponds to the

feature map Of 'Ifeisman’s model.

2. It has a selective mapping from these representations into a central representa-
tion which at any instant contains only the properties of a single visual location,
the selected location. This is corresponding to the master map.

21

Detail analysis from
focused attention
\

 

a \
\\\§ Master Maps

Color \Orientation Size Distance

.A‘! I, Feature Maps

6
a C

Visual Field

 

Figure 1.5: Bottom-up attention selection model. The elementary properties (such as color,
orientation, size, and depth) of the visual ﬁeld are extracted by separate parallel pathways,
each of which generate a feature map. Selected features are fused into a master map, which
is a representation Of those features that have the most saliency. Selective attention happens
after the features have been associated In a small region of the master map. Adapted from
(Treisman,1986).

3. It has certain kinds of rules that determine which location is mapped to the

central representation. This is implemented by a Winner-TakeAll network.

4. It has a mechanism called inhibition of return, which will suppress current

attended location and enable the shifting to the next salient location.

Itti and Koch et al. proposed a pure bottom-up attention control model (Itti et al.,
1998) (Itti et al., 2001). They integrated color, intensity, and orientation as basic fea-
tures. A Gaussian pyramid was created to extract intensity information in six scales.
Similarly, Gaussian pyramids was constructed to extract features from red-green Op-
ponency channel, blue-yellow Opponency channel. Four orientation-selective pyramids

22

were also created using Gabor ﬁltering at 0, 45, 90, and 135 degree in six different
scales. A total of 42 feature maps was thus created: 6 for intensity, 12 for color, and
24 for orientation. Then, they fused feature maps into a so—called saliency map. The
task of the saliency map is to compute a scalar quantity representing the salience
at every position in the visual ﬁeld. In their combination scheme, they promoted
those feature maps with a few strong peak and suppressed those with many compa-
rable peaks or just evenly distributed. This combination strategy outperforms the
regularly normalization method, and also keeps biological plausibility (Sillito et al.,
1995). After normalization, the feature maps for intensity, color, and orientation
are summed across scales into three separate conspicuity maps, each for different
modality. Searching for the most salient position is done by a winner-take—all (WTA)
network with a global lateral inhibition (Koch and Ullman, 1985). It has long been
known that lateral inhibition in neural network can lead to a WTA competition, so
that only a single neuron is active at a steady state. TO avoid reselecting the ﬁrst
attention position repeatedly, they have used a mechanism to inhibit the recently
selected object and shift to another salient position. In psychophysics, this process is
called inhibition of return.

Backer et a1. applied similar strategy to an active-vision system, called NAVIS
(Neural Active VISion), which emphasizes the visual attention selection in a dynamic
visual scene (Backer et al., 2001). Instead of directly using some low level features
like orientation and intensity, they accommodate additional processing to ﬁnd middle
level features to build the feature map, such as symmetry and eccentricity. Fusion of
conspicuity maps has been done by a so—called Dynamic Neural Fields, proposed by

23

Amari (Amari, 1977), which, in fact, is still a lateral inhibition network. Due to the
nature of a dynamic scene, the inhibition Of return processes is somehow more difﬁcult
than visual searching in a stationary scene. TO name one, the Object that was ﬁrst
picked up and then inhibited could be moving, resulting a possible selection of the
next attention shift. In case Of a gaze shift, the positions of the saliency before and
after shift should be corresponding. In Backer’s model, these problems are addressed

by Speciﬁc approaches.

Many of the bottom-up attention selection models share similar ideas. The major
differences lie in the variety of features, different strategy of fusing the feature map,

and the rules that ﬁnd the most salient location.

Although the bottom-up attention model may explain the ﬁrst few hundreds Of
milliseconds after presented with a new scene, obviously a more complete model of

attentional control must include a top—down, volitional biasing shift as well.

1.4.3 Top-down attention selection

Top-down attention selection has involved much complex perception procedures. One
obvious phenomenon is that we are able to move our eye voluntarily towards any po-
sition, no matter how inconspicuous it is. It is believed to include object recognition,
understanding of the scene, and even some planning. The precise mechanism by
which a voluntary shift is elicited remains elusive, although several researches have

inspected the possible brain regions that are involved (Hopﬁnger et al., 2000).

As to our knowledge, there are no computational model can explicitly solve this

24

problem right now, although some models, combined with bottom-up attention selec-
tion, tried to accommodate this mechanism.

Tsotsos et a1. (Tsotsos et al., 1995) implemented attention selection using the
combination Of a feed forward bottom-up feature extraction structure and a top-down
selective tuning (of these features) extraction mechanism. First, the representation
in the net is calculated, based on a feed forward activation for low level to high
level. Then the target of attention is selected. The location is then prOpagated back
through the feature extraction hierarchy. In the top-down process, the WTA process
is reduced to the units connected with the winner Of the previous layer. In the lowest
layer, this procedure results in an accumulation of units corresponding to the most
conspicuous region in the image.

Although Tsotsos’s model includes some top-down control features, it is not usu-
ally considered as a true top—down attention selection model. The model’s top-down
selective tuning only helps select the conspicuous location. It does not know what
it is attending to. A successful top-down attention system should accommodate the
Object recognition capability in order to implement the “what” stream in the brain
attention control pathway.

One Of the earliest models that combines object recognition and attention is
MORSEL (Pashler, 1996), in which the attention is shown to help recognition. This
model is designed to recognize words. Without attention, the neighborhood Of word
could conﬂict to yield ambiguity and confusion of the recognition. The addition Of
top-down attention control allow the system to focus one word at a time.

Schill and colleagues proposed a model combining a bottom-up feature extrac-

25

tion component and a top-down, knowledge based recognition component (Schill
et al., 2001). The ﬁrst component acts like the aforementioned bottom-up atten-
tion selection model. The second component contains a hierarchical knowledge tree,
which represents object classes by deﬁning several critical point and corresponding eye
movement commands that maximize the information gain. This permits the model

to discriminate between the candidate object classes.

1.4.4 Summary

In this section, we have reviewed the brain mechanism of attention control and two
basic attention selection strategies: bottom-up and top-down. Although bottom—up
attention selection is more mature and has been implemented in several successful
computational models, a successful focal attention selection model relies on a good

combination of the two strategies.

1 .5 Discussion

In this chapter, we have reviewed several early visual pathway models and attention
control models. These two problems, at a ﬁrst approximation, are different and
unrelated. But after reviewing several related works, we ﬁnd it is possible to put
them in a Single theoretical framework.

The topographic map in the early visual pathway may serve as a feature or saliency
map in the bottom-up attention control process. The lateral inhibition mechanism in
the model Of orientation preference map is also adopted by the saliency map fusion of

26

the attention control models. This leads to a possible uniform solution that addresses
attention from learning.

The following chapters will focus on the model of feature maps and attention
selection models. Two learning mechanisms and a general vision architecture will be

presented and discussed.

27

Chapter 2

Staggered Hierarchical Mapping
for Attentive Perception: A

Developmental Model

In this chapter we propose a general architecture that can accommodate attention
selection and development of a vision system. The work reported here is motivated by
the structure of the early visual pathway. We used several layers Of staggered receptive
ﬁelds to model the pattern of positioning of the processing cells of the early visual
pathway. From sequentially arriving video frames the proposed algorithm develops a
hierarchy of ﬁlter banks, whose outputs are uncorrelated within each layer, but with
an increasing scale of receptive ﬁelds that range from low to high layers. We have
also Shown how this general architecture can be applied to occluded face recognition,
which demonstrates a case of attention selection.

28

2. 1 Introduction

There has been rich literature about the computer vision system. However, an imple-
mentable computer vision model for a general purpose attention guided recognition for
unknown environments is still illusive. The appearance-based approach, e.g. eigen-
face (Kirby and Sirovich, 1990; Tqu and Pentland, 1991), has Opened the possibility
of dealing with unknown, uncontrolled, and complex real world scenes because of its

capability of deriving features from any vector distribution.

The appearance—based approach can learn any type of invariance exhibited in
training samples, but it requires preregistered objects because it lacks of attention
selection capability. Current techniques of the appearance—based approach require
that the Object tO be recognized being registered in a predetermined position and
size. This is either provided manually by a human or by a separate object detector.

However, in many scenarios this is not applicable.

Along the line of the appearance-based approaches, the developmental vision ar-
chitecture, proposed here, is motivated by the developmental process of biological
vision systems. The idea Of developmental vision is consistent with the idea of purpo-
sive active vision (Aloimonos, 1992), because the developmental process is purposive
and active and so is the performance process. If the system is exposed to a wide
variety of environments and has learned many tasks in the developmental process,
the resulting system is large, complex and it “acts like” a general-purpose system. If
the system is trained in a controlled setting for a single task, the resulting system is
small, lean and it “acts like” a special purpose system.

29

An important factor in such a general architecture is the attention mechanism. An
Object must be recognized at any position and size after it is projected onto the retina,
without turning the eyes. Attention selection is essential for suppressing irrelevant
information. Otherwise the image vector that includes the object of interest and the
background is a totally new vector that does not have a good match from the learned
experience. To learn an object, active attention must be applied not only in the
performance session, but also in the learning session.

Many studies have proposed hierarchical models for vision and pattern recognition
systems, but few have addressed attention selection as a major goal. Fukushima et al.,
proposed a neural network called “Neocognitron” (Fukushima, 1980). The Neocog-
nitron is a multilayered network consisting of cascaded connections of many layers of
cells. The information of the stimulus pattern given to the input layer is processed
step-by-step through stages Of the network. The synapses between the neurons are
updated by a supervised or unsupervised learning method, but the features retained
are hand—picked by humans. Weng et al., proposed a dynamic neural network model
called “Cresceptron” (Weng et al., 1997), which can automatically grow a hierarchy
Of maps directly from image inputs by a learning-with-teacher process. The network
grows by creating new neurons, connections and an architecture which memorize new
image structures and context as they are detected. Although these studies address
the global structure of sensory mapping, the efﬁciency and completeness of the gen-
erated feature representation are problematic, Since those methods do not explicitly
use statistical properties Of the signals in the ﬁlter generation.

Other studies concentrate on the subproblem of feature derivation from statistical

30

properties of input images. They include the entropy reduction method (Atick and
Redlich, 1993), the sparse coding method (Olshaushen and Field, 1997), and the
independent component method (Bell and Sejnowski, 1997; Hyvarinen and Hoyer,
2000). The above studies applied the principal component analysis (PCA) or the
independent component analysis (ICA) to a single size, ﬁxed receptive ﬁeld. In order
to deal with compounding problems Of multiple receptive ﬁelds, a general objective
recognizer must be able to recognize a pattern at different positions and with different
Sizes. In our proposed architecture, we use a hierarchical structure with staggered
receptive ﬁelds inside each layer.

Some researchers (Linsker, 1986; Sirosh and Miikkulainen, 1997) used a hierar-
chical network to simulate the effect of a population of neurons and to explain some
biologically observed phenomenons, such as orientation preference maps. These stud-
ies did not address the representation issues nor how to use the representation for
later classiﬁcation and recognition. Attention was not a subject of concern in these
studies. We believe that a major purpose of the early visual pathway is to provide
a coupled solution to two mutually dependent problems: (a) provide representation
Of input using the current attention selection (suppressing unattended area), to the
later recognition and attention signal generator; (b) execute the attention generation
by the later recognition and attention signal generator.

These two issues combined together accomplish both the bottom-up and top-down
information ﬂow.

This raises a series Of new issues such as how receptive ﬁelds are organized, the
criteria Of feature representation, how attention is applied, and how the output of the

31

sensory mapping is used.

Another essential requirement for a sensory mapping developmental algorithm is
that the computational method must be incremental without estimating the covari-
ance matrix directly from online real-time experience. Further, the computation must
be quick to respond in real-time. For sensory mapping, this means that the current
sensory mapping must be updated within the time interval of every sensory frame.
This indicates the challenges in the design of a real-time developmental program.

This chapter proposes architecture of hierarchical sensory mapping, called Stag-
gered Hierarchical Mapping (SHM), for attention selection and attentive-based recog-
nition and its developmental algorithm for active vision systems. Fig. 2.1 illustrates a
case of how the sensory mapping is used. The algorithm develops ﬁlters for a family
of receptive ﬁelds. The responses Of all ﬁlters are available for further analysis by
the following classiﬁer/mapping function. Therefore the proposed sensory mapping
develops a hierarchical representation for a large set of receptive ﬁelds, but does not
complete vision by itself. The following classiﬁer / mapping function is needed to work
in tandem. The autonomous development Of cells in the visual sensory mapping is
performed by the CCIPCA (Weng et al., 2003) method, which addresses the con-
vergence problem with existing incremental PCA methods. However, the other ﬁlter
derivation methods, in addition to PCA, can also be used, e.g. Independent Com-
ponent Analysis (Olshaushen and Field, 1996; Bell and Sejnowski, 1997) and Slow
Component Analysis (Wiskott and Sejnowski, 2002) by the proposed architecture. In
this chapter, we used CCIPCA as an example.

An important issue for developing a sensory mapping is the completeness of repre-

32

 

   

 

 
   
 

 

 

action output

 

Classiﬁer/
Regressor

(a) Learning session

Class label
output (given)

 

Sensory
mapping

 

 

 

 

 

  
   

 

Attention
action output

//
/

  

Classiﬁer/
Regressor

    

   

\IlI/

‘\\iii/’

 

Class label
Sensory output (retrieved)

mapping

\\\
\

 

\\\\IIIII/
\\\\llIl//

(b) Performance session

Figure 2.1: Recognition under occlusion through features extracted using a sensory map-
ping architecture. (a) In the learning session the sensory mapping learns the features of
the upper face (indicated by “On”) through active attention to the upper face, (b) In the
performance session the lower part of the face is occluded and attention focused on the
upper face allows the sensory mapping to deliver features for the non-occluded part of the
face (“On” part of the block), which are fed to the classiﬁer for successful recall. In general,
a sensory mapping enables suppression of unattended receptive ﬁelds, but provides sig-
nals from attended receptive ﬁelds for generating attention control signals by the following
classiﬁer/mapping function.

33

sentation. Since the environment is unknown at the time when designing the system,
the programmer designs the mechanism, through which the actual representation (in-
cluding network architecture, neural weights and responses) is developed from the
sensory experience. However, if the mechanism does not allow the derivation of rep-
resentation that is essential for any particular environment, the resulting mapping is
deﬁcient in performance. If the mechanism enables the representation to be complete,
in the sense of no loss of information, then the resulting representation is able to han-
dle any environment that an intelligent agent runs into. The completeness for a single
receptive ﬁeld is clear, due to the reconstruction of PCA. However, the reconstruction
issue for several staggered receptive ﬁelds is illusive. We address the completeness Of
the resulting SHM mapping by reconstructing the original images from the generated

representation.

The subsequent organization Of this chapter is as follows. Section 2.2 explains
the structure of the proposed SHM model. Section 2.3 describes the developmental
method. In Section 2.4, we analyze the computational complexity of this method.
Section 2.5 discusses the completeness Of representation in terms Of reconstruction.
Section 2.6 depicts the prOposed model as an attention effector. In Section 2.7 we
Show the results Of some of the experiments with SHM. Section 2.8 presents a classiﬁ-
cation system in which the SHM is used for attention selection. Section 2.10 provides
concluding remarks.

34

2.2 Staggered Hierarchical Mapping

We assume each line Of sensory input can be digitized into a real number. It is
represented by Euclidian n—space R”.
For the case of visual sensory input, we can assume the input units are organized

in a 2—D plane with ﬁxed r rows and c columns. SO all the visual sensory inputs form

an foc space, where R is the real space. Thus we have

$110) 1‘12“) $1c(t)

X(t)=[x,-j(t)]= $210) a322(t) 5We“) (21)

 

 

xrllt) 331:2“) 37‘6“)-

We also call a visual sensory input an image. We will use this two terms inter-

changeably in this chapter.

Analysis on this matrix space 72”“ is difﬁcult, because it introduces some un-
necessary complexity. Therefore, the vector space model is used to represent images.
Under this model, each image is modeled as a Single vector and the collection of im—
ages is modeled as a single data matrix, where each column Of the matrix corresponds
to an image and each row corresponds to a feature dimension. Thus, we introduce a
vectorization Operator to convert the 2-D matrix representation to the vector space

presentation.

Deﬁnition 2.2.1 Let A E RTXC and aj its j-th column, then vecA is an rc x 1

35

vector,

T
vecA = [afangaZ] .

Then it is natural to get the vector Space representation I Of an image X, i.e. I =

vec X.

Without knowing the original row and column dimension, the vector representa-
tion lose information Of the Rrxc space. So we have an important assumption: the
statistical property of the image is stationary. Basically, the stationary assumption
states that the statistical property of an image space is independent of positions of
the image. With the stationary assumption, the statistical property Of the image
space can be captured by the covariance matrix and higher order tensors of the image

vector space.

In SHM, a ﬁlter is modeled by a column vector “’2': and w,- has the same dimension
as the image vector representation I. All ﬁlters that have connections to the image

then form a matrix W = [w1, W2, ...wn], where n is number Of ﬁlters.

The responses of a set of ﬁlters R = f (WTI), where T is the transpose Operator
and f () is a non-linear function. In the proposed architecture, we did not consider

the non-linear function f (), thus the model is a linear one.

When a family Of receptive ﬁelds are considered, we have Observed that the Princi-
pal Component Analysis is a suitable mechanism for ﬁlter development. It gives a set
of ﬁlters that minimize the mean square error between the input and reconstructed
Space, giving a ﬁxed dimensionality under a linear projection. The response from the
PCA ﬁlters are mutually uncorrelated. This increases the efﬁciency of the sensory

36

coding. Furthermore, incremental PCA that adopts a Hebbian learning mechanism
has some support from the biological model of the cortex. It has been shown (Oja,
1982; Sanger, 1989; Rubner and Schulten, 1990) that artiﬁcial neurons updated by the
Hebbian rule, while being inhibited by nearby neurons (lateral inhibition), produce

synaptic weights that are principal components produced by PCA.

However, the conventional PCA also has its limitations. It is only applied to the
full input vector, and it is not capable Of extracting local features in the input space.
We shall design an architecture, in which the responses are computed from a family

of localized receptive ﬁelds.

A stack of ﬁlters developed from a single receptive ﬁeld is Shown in Fig. 2.2(a).
We have two problems: First, the Spatial resolution is low if the receptive ﬁelds do
not overlap; Second, if they overlap, the computational cost is high. In the proposed
staggered scheme, we trade off spatial resolution with computational cost as shown
in Fig. 2.2(b). In the proposed architecture, ﬁlters are not stacked to cover the same
receptive ﬁelds. Instead, the receptive ﬁelds of ﬁlters are staggered and cover different
regions. In practice, this means each ﬁlter only gets input from a local region of the
input. However, for the convenience Of analysis, we shall still model ﬁlters to cover
the entire input range. Therefore, the matrix W is usually a sparse matrix, since
ﬁlters are not connecting to all the inputs, and the connection, or synapses, are
localized. This deﬁnition intertangled with the issue of ﬁlter development makes the
analysis much more difﬁcult. SO, we introduce a receptive ﬁeld matrix to restrict the
connection to the local area, and thus set no constraints to the ﬁlter matrix W.

37

       

.,‘
’ 9

await/ﬂ

+7.”:

   
  

Figure 2.2: A comparison of the neural cylinder (a) and the staggered receptive ﬁelds (b).
In (a) a group of ﬁlters share the same receptive ﬁelds resulting in a low spatial resolution.
In (h) each ﬁlter is centered at a different position resulting in a higher density of spatial
sampling.

Deﬁnition 2.2.2 The receptive ﬁeld D,- of the ﬁlter “’1' is a diagonal matrix

(1,31 0 0
o d- 2 o
D,- = ” ,
0
0 0 . di,K

with did = 0,1 and K = r X c is the dimension of the image vector space

The single ﬁlter response can be written as r,- = f ((Diwi)TI). Diwi deﬁnes a ﬁlter
with a receptive ﬁeld.

This deﬁnition did not constrain the position Of the receptive ﬁelds, so the synapses
can be scattered in the entire image. But in practical, the receptive ﬁeld of the
ﬁlter is usually localized to a small ﬁeld in the image. However, this topological
information is discarded during the vectorization process. SO, do we lose important
information? The answer to this question is in two fold. First, we do lose the exact
topological relationship between the ﬁlters. However the correlation between ﬁlter
pairs can be reserved in the covariance matrix and higher order correlations can

38

 

Figure 2.3: Localized receptive ﬁeld Dpr with Size 4.

be found in the higher order tensors. Also the Signal is natural image, instead Of
white noise, 80 there exists high correlation among the neighborhood input areas. In
this sense, the topological structures are encoded in the input signal. A data-driven
learning mechanism can ﬁnd this high order correlation, which thus determine the
receptive ﬁeld. It is worth noting that the introducing of the receptive ﬁeld matrix D
is merely for the convenience of mathematical analysis. In the actual implementation
the weight vector is the product Of Diwi, where zero connections are all omitted to

save computational resources.

A localized receptive ﬁeld is deﬁned by its starting row p and starting column
r and size 2. For the Simplicity of the model, we usually assume every localized
receptive ﬁelds in the same layer has the same size 2. From now on, if not otherwise
speciﬁed, we will always assume this condition. Fig. 2.3 illustrate a localized receptive

ﬁeld Dpr with size 4.

Filters are partitioned into non-intersected groups 93-,i = 1, 2..., which are called
eigen-groups. Within an eigen—group 92-, the receptive ﬁelds of the ﬁlters have no
overlap nor gap between them. Therefore, the receptive ﬁelds of ﬁlters in the same
group are paved seamlessly to cover the entire image, i.e. for any I: 7é l and D k, D, E

39

92', we have

Dle = e, (2.2)

where O is zero matrix, and

Z Dk = I, (2.3)

Dkegi

where the summation is element-wised matrix addition and I is the identity matrix.
In SHM model, ﬁlters in the same eigen-group compute the PCA of the same order.
Basically, if the size Of receptive ﬁeld is n x n, then there could be maximally n x n
different eigen—groups. None Of the eigen-groups share same receptive ﬁeld, i.e. VD,- E
gkaj E 91, where k 7S l, D,- 7A Dj. Then we can deﬁne distance between eigen-
groups as

d = min{lp — r| , Iq — s|}|Vqu E Gk,VDr5 E 91,}: 7% I.

We call the minimum value of d “inter-ﬁlter distance”, or “inter-neuron distance”.
Typically, the structure of SHM is homogeneous, SO the inter-ﬁlter distance is the

distance between any adjacent ﬁlters.

A randomly predeﬁned order is given to all eigen—groups, for example shown by a
number in Fig. 2.4. The numbers determine the order of the eigenvector computed by
the eigen-group, and the position of the numbers determines the relative position of
the eigen-group. After each input image coming in, the ﬁlters in the lower order eigen-
group is updated ﬁrst, and their responses are generated. The ﬁlter updating rule
will be discussed in section 2.3. Here, we ﬁrst take a look at what we called residue
image processing between the computation of consecutive eigen-groups. Suppose we

40

 

 

 

 

 

 

 

 

 

15 4 l 6
3 8 9 10
12 13 7 l4
2 5 0 ll

 

Figure 2.4: Computational order of eigen-groups. There are 16 eigen groups. The numbers
in the table denote the order of updating using residual image. The positioning of the
numbers represents the relative position of the eigen-groups.

have ﬁnished eigen-group 9k. The response of the eigen—group is rz- = (Dz-wi)TI.i_1,
where D,- E Gk, Ii—l is the residue image from the last eigen-group and IQ = I. The

residue image after this computation is deﬁned by

1k = Ik—l — Z (DiwilT Ik—1(Diwi) (2-4)
Diegk

The residue image can be considered as the remaining variances that are not captured
by the current eigen-group, thus serves as the input of the next eigen-group.

Beside the feed forward processing, ﬁlters also have interactions in the same layer.
If we assume the lateral interaction is linear, then we can use another matrix to model

it.
Deﬁnition 2.2.3 The lateral interaction matrix L is an n X n matrix,
L = llijlnxn

The element lij determines the inhibition or excitation synapse from ﬁlteri to

ﬁlter j.

41

If L is a symmetrical matrix, then the lateral interaction is symmetrical, asymmetrical

otherwise. The responses are affected by both W, D and L, i.e. rz- = (LiDiwi)TI.

The multi-layer network is just a natural extension of the Single layer network. We
deﬁne the ﬁlter set Wk the ﬁlters of the k-th layer, and the corresponding receptive
ﬁeld matrix and lateral inhibition matrix are D]: and Lk, respectively. The total
ﬁltering matrix in one layer, without considering non-linear mapping function, is
deﬁned by F]: = WkaLk. Then the response of the multi-layer network is deﬁned

by:
T

k
R: HF, I
i=1

Since F is a linear transformation, we call C = “H1 F,- combined ﬁlter matrix.

2:
Each column of C is corresponding to one combined ﬁlter. The size of the receptive
ﬁeld Of combined ﬁlters is increasing with the order Of the layers. Thus, SHM gives
a presentation of different scales in different layers. Furthermore, the staggered re-
ceptive ﬁelds give the representation in different positions over the input image. The

presentations of different scales and different positions is the key idea of why SHM is

better than other flat representations.

The structure of the proposed SHM is Shown in Fig. 2.5. We only draw the weight
connection that is non-zero.

42

Layer I t d' t Size of the
No. £— n er-neuron ls ance receptive ﬁeld

 

‘OQQO‘UI-hWN—

Figure 2.5: The architecture Of SHM. Each square denotes a ﬁlter. Layer 0 is the input
image. The ﬁlters marked in black in layer 1 belong to different eigen-group, because they
overlaps. Bold lines that are derived from a single ﬁlter and expanded to the original image
mark the receptive ﬁeld of the combined ﬁlter. The size of the receptive ﬁeld in a particular
layer is 20% larger than its previous layer in this diagram, which is shown at the right. The
size of the receptive ﬁeld is rounded to the nearest integer.

2.3 Developmental Algorithm

The developmental algorithm for the SHM must be incremental and fast for real-time
application. By “incremental” we mean that each input image must be discarded after
it has been used for updating the SHM. It is not practical to store all of the images for
batch processing due to the large amount of Space required. By “fast” we mean that
updating for each image must be computationally efﬁcient. For example, iteration
within each updating is not allowed. These requirements limit the complexity of the
computation that can be performed by a biological cortex. They also make the design
Of a developmental program for SHM challenging.

We have further developed the CCIPCA (Weng et al., 2003) that was designed
with these requirements as design constrains. This algorithm can be found in the
appendix.

The ﬁlter development procedure Of a single layer is described as follows:

43

 

Algorithm 1 The ﬁlter development algorithm Of a single layer.
1: Use steps 1 to 2 of the CCIPCA algorithm in the appendix to initialize each ﬁlter
in the network.
2: Initialize the residue image I0 = I(n), where I(n) is the n—th input image.
3: k *— 0;
4
5

 

: for all ﬁlters w,- in ﬁlter group Gk do
Use Eq. A2 of the CCIPCA algorithm to update connection weight vector
ail-(n) of each corresponding ﬁlter with residue image Ik-

Wz‘ = (I(nlwr + ﬁ(n)(DiWi)TIklk.

where a(n) and 6(n) are the amnesic average coefﬁcient deﬁned in Eq. A3.
6: Use Eq. A.4 to compute the corresponding response ri(n) of each ﬁlter,

7,, z (D.w.-)TI
' new.”

7: Use Eq. A6 to compute the residual image for the next eigen-group,

T
. . I . .
Ik+1=Ik- Z (Diwi) k(l2'-)zw2).
Diegk “Diwill
end for

: k <— k + 1; GO to step 3 until all the eigen-groups are updated.
10: Get the next input image I (n + 1) and go to step 2.

‘99?

 

44

Each layer is similar in that it takes its input and runs the CCIPCA algorithm
on it, producing both ﬁlters for this layer and the output response image for the
next layer. The ﬁrst layer of the network uses the original input image, and each
subsequent layer uses output response images from the previous layer.

The receptive ﬁeld in the same level is not concentrated in only several center
positions. Instead, we use a masking matrix D,- to spread the receptive ﬁelds all over
the input layer. Then the staggered positions, along with the increasing size Of the
receptive ﬁeld, make it possible for input regions to have local representations with
any Size at any position (up to a sampled resolution) in the network.

Inhibitions between neurons play a vital role in the function of the visual cor-
tex. We use residual images to represent the effect of inhibitive competition between
neurons. This SHM network is not a feed-forward network because it deals with the
re—entrant process of neurons. Each time a ﬁlter is computed, it is subtracted out
from the original image, leaving information that has not yet been extracted by the
current eigen-group. The next ﬁlter can only “see” part of the input that the previous
ﬁlters cannot extract, thus, the ﬁrst one can inhibit the latter one, which makes the
process asymmetric. Each ﬁlter can “see” unique input data that has not been “seen”
by previous neurons. This is more efﬁcient than symmetrical inhibition where it takes
time to compete the order and achieve the same result.

In SHM, the lateral interaction is not explicitly modeled by an lateral interaction
matrix L. Instead, ﬁlters are lateral affected by the residue images process. Suppose
we represent F,- = Diwi, then the response of the ﬁlter rm, where Dm 6 9k, can be
expressed as,

45

rm = (Fm)T E- Z F,F,T E — E— Z F,F,T I
FiEgk_1 FiEgl
The term in the square brace is the transpose of the lateral interaction matrix
LT.
Fig. 2.6 shows a sequence of residual images. It can be seen that the variance of
the series residual images is decreased sequentially as more orthogonal components

are extracted.

 

Figure 2.6: Sequence of residual images. From left to right and top to bottom starting from
the original input image, the sequence of images displays the process that a component is
subtracted from the previous one.

2.4 Computational complexity analysis

We would like to give a concise expression for the amount of computation. Since it
is unnecessary to keep the same inter-ﬁlter distance for all areas, we can assume a
similar number of connections in every cell in the sensory mapping. The number of

46

processing elements is roughly the same across different layers, which can be estimated
from input dimension d.
Suppose that in layer 14,- the average number of neural connections per ﬁlter is
c,- and the array of ﬁlters has hi rows and re, column. The amount of computation
required to update connections using incremental Hebbian-like algorithm is linear in
Ci- Thus, with L layers, i = 1, 2, ...L, the amount of computation in the entire network
is:
L
T3 = Z hiwz-O (Ci) z 0 (ch),
i=1
where d is the number Of pixels in the sensory input, c is the average number Of
connections per ﬁlter.
Since the sensory mapping stores the information in the place where cell compu-
tation and processing result needs to be stored too, the space complexity is equal

tO:

53 =3 O (dCL) .

2.5 Input Image Reconstruction

With SHM the input image is coded as responses at several layers. To Show how
complete the representation is at each single layer, we need to reconstruct the original
image from the response Of that layer alone. The reconstruction is a reverse processing
Of the mapping. The responses Of a certain layer, which are actually the sample’s

47

projection along the principal components, are used to recover its component in the
sample space, then added back to the residual image. The reconstructed image IT is

computed by,

Ir = Z Z Ti(Diwz’)a (2-5)

It Diegk

where r, is the response of the i—th ﬁlter.

Lemma 2.5.1 The ﬁlters with localized receptive ﬁeld in the same eigen-group, are

orthogonal to each other. \7’ ﬁlters Di,Dj E Q, i at j we have < Diwi,Djwj >= 0

The notation< - > is deﬁned as inner product.
. . . . _ T T . . T . _
Proof. < D,w,,Dva_7 >— wz. Di DJw]. From Eq. 2.2, we have D2. DJ —— G,

where 6 is zero matrix. Therefore, < Din’a Djwj >= 0. 4

Lemma 2.5.2 The residue vector lie of an Image Ik—l after applying eigen-group
Gk is orthogonal to every ﬁlter in the Gk, i.e. VD,- E Gk, < Ik,Djwj >= 0, where

Dj € 9],;-

Proof. From Eq. 2.4, it is simple to Show

T
T
<Ik,Djo> = 119—1‘ 2 (Diwi) Ik—1(Diwi) DJWJ
Diegk
T
T T T
= (Djwj) Ik_1— Z (Diwi) Ik—1(Wi Di Djwi)

Diégk

= (Djwj)TIk_1— (Djw-)TIk_1= o

48

Since Dj E Gk, according to Eq. 2.2 the addend in the summation is zero when
T
i 75 j, and (Djwj) Ik—l when i = 3'. Note here we assume ”Din‘ll = 1.
<1

Therefore, the initial image space is partitioned into a sequence of subspaces,
which is spanned by the vectors in each eigen-group. And each of these subspaces
is orthogonal to the next residue image. It is then easy to show that the total

reconstruction error is

E = III-Irll2

k
= 10-2 2: Ti(Dz'Wz')

j=1 Diegj

k
= 11—2 2 Ti(DiWi)

j=2 Diegj

2
= ll1k+1ll

 

 

 

 

where H” is the l2 norm of the vector, and Ik+1 is the last residue image after

being processed by k eigen—groups.

We need also to consider the border conditions. If the localized ﬁlter covers
the border of the input image, then the connections that are out of the image will
get zero input. This condition will increase the image by (z — 1) X 2 rows and
(z — 1) X 2 columns, as shown in Fig. 2.7. This new image space will have dimension

49

 

Figure 2.7: The ﬁlter connections that covers the outliers Of the image with have zero
input.

of (r+2(z - 1))(c+2(z —1)). Since the outlier pixels are set to zero, the true dimension

Of the image space is still r x c.

The total number of ﬁlters that have localized receptive ﬁeld covering at least part
of the input image will be (r + 2(z — 1))(c + 2(2 — 1)). For all WW and “743, we have
< wpr, qu >= 0. And the number Of ﬁlters is more than the effective dimension Of
the image Space. Thus, the ﬁlters are linearly dependent if the image border is also

covered by ﬁlters.

The single layer reconstruction algorithm is shown in Algorithm 2.

 

Algorithm 2 The Single layer reconstruction algorithm

 

1: I,- = 8, where O is the zero matrix.

2: for all Response output r,- and the corresponding ﬁlter vector w, and receptive
ﬁeld Di. do

3: Update the reconstructed image by,
Ir = Ir + Z X Ti (Diwi)
lc Diegk
4: end for

 

50

2.6 Attention Eﬂ‘ectors

The response of each ﬁlter (or processing cell) at every level is available to be used
for later processing. The goal of the attention effectors is to suppress the response
(value) from areas that are beyond the attended visual ﬁeld.

However, the ﬁlters share input planes with the staggered receptive ﬁelds shown in
Fig. 2.5. This means that any region of ﬁlters in a layer of SHM does not correspond
to a clear-cut region in the input plane. A special case of this situation happens
when this region is so small that it contains only one ﬁlter. Interactions between
ﬁlters expand the extent of each pixel beyond that marked by direct inputs to ﬁlters.
Therefore, we Should not expect that selecting the response from a region in any layer
will result in a clear-cut attended region in the input.

We deﬁne an attended region as a 3—D ellipsoid centered at a position (x,y,l) in a
layer of SHM, where, x, y, denotes the position of the ﬁlter and l denotes the layer.
The length of each axis of the ellipsoid corresponds to the scale along the dimension.
All the ﬁlters falling into the ellipsoid have their output passed on and all others are

blocked.

2.7 Experiment with SHM

Testing began with a collection of over 5000 natural images. They were taken using
a Sony digital camcorder in an outdoor environment. A variety of recordings were
taken, including buildings, trees and shrubbery, closeups Of bark and wood chips,
ﬂowers, and other various scenes found in nature and on campus. The frames from

51

the video were then captured and digitized into 118 x 158 images (each has 118 x 158

pixels). Fig. 2.8 shows a few sample images from the image set.

 

Figure 2.8: Several samples of 5000 natural images collected.

There are two methods used to develop ﬁlters, weight-sharing or non-weight-
sharing. The weight-sharing method assumes that the ﬁlters that are in the same
eigen-group have the same connection weights. This is a good estimation only if the
images have the same statistical properties across the entire image frame. Thus, the
sharing method only updates a single ﬁlter for one eigen-groups. The non-sharing
method updates a separate set of ﬁlters for each ﬁlter in the eigen-group.

The ﬁlters of the ﬁrst layer, trained by the sharing method, is shown in Fig. 2.9.
There are 64 eigen-groups in this layer and the size of each ﬁlter in this layer is
16 x 16. The ﬁlters are reordered according to their computation order so that the
dominant principal components are shown ﬁrst. The ﬁrst several ﬁlters appear similar
to the receptive ﬁeld excitation-inhibition patterns reported by Hubel and Wiesel’s
experiment (Hubel and Wiesel, 1962). Later ones show a complex pattern, which
may fall into the cell classes that were not categorized by Hubel and Wiesel.

52

[1
L5

    
 

     
 

  

-wm—cu,

reamefsw
magnesia,
seamessw
eessaaxa
ﬁﬁﬁEEBﬁU
ﬁﬁﬁwﬁﬁﬁﬂ
grasses"

   

Figure 2.9: The ﬁlters developed in the ﬁrst layer by the sharing method. The order of

dominance is from left to right and from top to bottom.

When input was placed into the network, a scalar output response value was
generated for each ﬁlter at its respective receptive ﬁeld. This measures the amount of
energy Of the input along the direction Of the ﬁlter and represents the ﬁring strength
Of a ﬁlter. The ﬁlter responses were then organized into groupings based on their

order in the eigen-group and combined into a structured image vector and passed on

as input to the next layer of the network.

The responses of a layer are reconstructed by its following layer, ﬁnally reaching
the original input layer. The reconstructed images are shown in Fig. 2.10, which
indicated that the response of each single layer is sufﬁcient to reconstruct the original
image fairly well. We also test the reconstruction error of each layer. The chart in
Fig. 2.10 shows the error. The error is computed based on the Euclidean distance
between the original image and the reconstructed image. The two images are nor-
malized by subtracting the mean and being divided by their standard deviation. The

distance is then divided by the number Of pixels to make it possible to compare im-

53

ages of difference sizes. The error is averaged over 100 reconstruction results. Fig.

2.11 illustrates the reconstruction error of each layer.

 

Figure 2.10: The reconstructed images from different levels. The ﬁrst image is the original
input image. Others are reconstructed images from Layer 1 to Layer 4 respectively.

 

Average pixel emor

 

 

 

4

2 3
Layer of reconstruction

Figure 2.11: The reconstruction error of each layer.

2.8 Experiment with Occlusion

The purpose of the experiment summarized here is not to show a practical application
system, but rather to study the use of SHM as a local representation suitable for

54

attention selection. The proposed sensory mapping provides a complete family of
local representation that is needed for the generation Of attention control signals from
later processing. Further, the proposed sensory mapping serves as an effector of the
top-down attention control. In a future system that uses SHM as a sensory “cortex,”
the parallel outputs of all ﬁlters in all layers are used by a higher level processing
that generates attention signals for the SHM. There have been numerous studies
on attention selection. However, due to the complex nature of attention selection,
those studies use hand-deﬁned task-speciﬁc saliency as the attention selection criteria.
The architecture prOposed here is a learning-based general mechanism at the sensory
mapping level, for autonomously developed attention selection behaviors, potentially
for any task. In the current experiment, we used a regressor denoted as C3 ,shown in
Fig. 2.12, that learns the mapping from input to the attention control Signals needed
by the SHM at any given time. This regressor is trained ﬁrst.

The experimental setting is organized as follows: In the training phase, a series of
global views of face images is presented to the system with class labels. In the testing
phase different face images of the same set of people are presented to the system, but
all Of them are partially occluded: the upper view (U view) of a face has the lower
third area replaced by a uniform intensity (gray), and the lower view (L view) has
the upper third replaced by gray.

If a system is passive, it learns the global view and tries to match the input U
or L view with the learned global prototype. If we use the nearest neighbor method
(NN) to ﬁnd the prototype, we call it monolithic + NN testing.

We also study the use of active internal action in visual perception. For this

55

purpose, we construct an active system Shown in Fig. 2.12. Three classiﬁers C 1, C2
and C3 are integrated in the system. Here, we use the Hierarchical Discriminant
Regression Tree (HDR Tfee) (Hwang and Weng, 2000). To learn the mapping, C3
is to determine what speciﬁc occlusion case the image is, U or L. In our test, the
classiﬁer C3 has reached a 100% classiﬁcation rate.

The 31 and 5'2 are SHMS. 51 is for U views and 8'2 is for L views. Cl and C2
are two HDR Trees used to classify images. Which SHM + HDR should be used is

determined by the attention signal from C3.

 

 

 

 

 

 

 

 

 

 

 

 

 

Occluded face
images
NV

A
S 1 $2
C1 / C2
Integration of the
results

 

 

 

Figure 2.12: The schematic illustration of operation. C3 is the attention signal generator.
C1 and C2 are the classiﬁers for each occlusion case. S1 and S2 are SHMs, SI for U views
and S2 for L views.

The experiment used the face data from the Weizmann data set. The set was taken
from 28 human subjects, each having 30 images with all possible combinations of two
different expressions, three lighting conditions and ﬁve different facial orientations.

56

 

Method Recognition Rate

U L U+L

Monolithic+N N 51.43% 75.83% 82.38%
SHM+HDR 92.86% 95.95% 98.57%

Method Testing Time (ms)

U L U+L

Monolithic+N N 765.5 765.5 2263.0
SHM+HDR 702.4 702.4 2016.6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 2.1: Summaries Of the occlusion experiment.

We use the leave-one-out cross-validation method. 30 experiments were conducted,
in which 1 out of 30 samples Of each person is picked up as a testing sample, and
all of the remaining ones as training samples. The average of the recognition result
is reported in Table 2.1, along with the time run on a dual Pentium III 600 MHz
processor PC. The U, L columns are for testing with U, L views only. The U+L
column is for an integration of 2 test views, U and L are to give a Single classiﬁcation
(person). The U+L integration is done in the following way. For each view, top K
best matches form a top labels list. Thus, each pair of U and L views give a U label

list and a L label list. The label that has the minimum rank sum is the output label.

Table 2.1 shows that the proposed SHM with the HDR classiﬁer is effective in
integrating active partial views, U and L. Without active attention, monolithic + NN
performed poorly, even with U+L integration. This is because it lacks a mechanism
to actively extract local views when presented with a global view. The speed of the
SHM + HDR is fast enough to reach about 2 Hz refreshing rate.

57

2.9 Experiment with Navigation Simulation

We also applied the SHM algorithm to a navigation simulation task. A total of
18,330 images are collected from two cameras mounted on an autonomous vehicle.
The images of the camera are aligned and merged together to offer a broad View
of the environment. A sample image with dimension 120 by 320 pixels is shown in

Fig. 2.13.

 

Figure 2.13: Merged image sample from the two cameras.

2,541 images from the data set are used to train a three level SHM network. The
ﬁlter size of each layer is 16, 8, 8, starting from the bottom layer. Unlike the previous
occluded face recognition experiment, a more general attention selection effector is
used. As shown in Fig. 2.14, we use a GUI program to manually select 6 points
on the road boundary, which are marked as circles. These 6 points are attention
positions. A small window around the attention position and across levels of SHM is
extracted. Let us denote (31, 52, S3, S4, S5, S6) as the concatenation vector of all 6
windows. The coordinates of the attention positions (r1,c1,r2, c2, ..., r6, c6) are also
concatenated with the input vector. Since these are different data modality from the
SHM output, we have normalized each section with their respective variances. Thus,

58

the input vector to the HDR mapping is

 

5'1, 52, S3, S4, S5, 56 r1, c1, 1'2, 02, ..., r6, c6)

(

‘71 02

where 01 is the variance Of the SHM output, and 02 is the variance of the attention
point coordinates. This manipulation will make sure that the coordinate informa-
tion will not be overwhelmed by the high dimensional image input. Three driving
directions: going left, going right and going forward, are also provided by the GUI
program as classiﬁcation labels for the supervised learning. Note that the attention
signals are all manually provided at this stage. Automatical attention selection will

be the further research topic of the SHM model.

 

 

G \AVSJmage'imloeDuzﬁhmgmmmﬂ ppm

 

c uvsquommenunmmdﬁr’f'm

 

 

 

Figure 2.14: Six attention positions on the image.

We compare the results with the monolithic representation, in which only a ﬂat
image is used as input to the HDR mapping, and class labels remain the same. Since

59

 

 

 

 

 

 

 

Method Recognition Rate
Monolithic+HDR 44.72%
SHM+HDR 85.52%
Method Testing Time (ms) per image
Monolithic+HDR 103.2
SHM+HDR 748.1

 

 

 

 

Table 2.2: Summaries Of the navigation image simulation.

the input image is Of high dimension, we scale it down to one sixteenth of the original
size. 2,541 images are used as training samples, and the rest 15,789 images are used
as testing set. The result is compared in Table. 2.2. It clearly Shows that the SHM

representation is much better than the monolithic representation.

2.10 Conclusions

This chapter presents the design principles, architecture and developmental algorithm
Of a general purpose sensory mapping called the Staggered Hierarchical Mapping that
tightly couples attention guided bottom-up information generation and execution of
top-down attention selection. The experimental data has shown that the representa-
tion at each layer has a high completeness. The test results Of the occluded image
recognition system show that the SHM has a local analysis property, thus, it can
be used to achieve attention control in an active vision system. A future research
direction is to study mechanisms other than PCA develop sensory maps, as well as

their advantages and disadvantages.

60

Chapter 3

Lobe Component Analysis

Orientation selective cells in V1 are well known, but the underlying computational
principles that guide their emergence (i.e., development) are still illusive. The result
reported here is based on high-dimensional probability distribution using a network
structure. We introduce a concept called lobe component. Each lobe component rep—
resents a high concentration Of probability density of the high-dimensional sensory
inputs. A developmental algorithm for sensory network has been designed and tested
based on the concept of lobe components. Our simulation result of network devel-
opment revealed that many lobe component cells developed by the algorithm from
natural images are orientational selective cells similar to those found in V1. There-
fore, we make a hypothesis that orientation selection is only a natural consequence of
cortical development but probability density estimation by neurons is a fundamental
principle of cortical development. If this hypothesis is true, the principle of neuronal
probability approximation may have deep implications to the development and self-
organization Of other sensory cortices (e.g., auditory and somatosensory cortices) and

61

higher cortices.

The proposed developmental algorithm is not meant to be biologically veriﬁed but
is biological plausible. It is based on two well-known computational neural mecha-
nisms, Hebbian learning and lateral inhibition. It is an in-place developmental algo—
rithm in which every cell is both a signal processor and the developer of the cortex.
There is no need for a separate network to deal with the development (and learning).
The algorithm is purely incremental in the sense that there is no need for extra stor-
age for information other than that can be stored and incrementally computed by the

processing network itself.

To clarify technically, four types of algorithms have been considered with pro-
gressively more restrictive conditions: batch (Type-1), block-incremental (Type-2),
incremental (Type-3) and covariance-free incremental (Type-4). The proposed Can-
did Covariance-free Incremental (CCI) Lobe Component Analysis (LCA) algorithm
seems the ﬁrst Type-4 algorithm for independent component analysis (ICA). The
preliminary simulation studies showed that it out-performed, in convergence rate
and computation time, some well-known state-Of-the-art ICA algorithms in Type-3
through Type 1 for high-dimensional data streams. The proposed LCA algorithm

also has a simple structure, with the lowest possible space and time complexities.

In this chapter, we will ﬁrst brieﬂy introduce the background of Independent
Component Analysis in section 3.1, which is closely related to the LCA method.
Then, we proposed the method in section 3.2 and analyze its performance.

62

3.1 Introduction to Independent Component
Analysis

Recently, there has been a resurgence of interest in independent component analysis
(ICA). ICA has a wide application in signal and image processing, telecommunication,

and medical data processing.

The basic assumption of ICA is that the signal we have observed from the sensory

input is a linear combination of several latent independent sources 1. The main
purpose of ICA is to ﬁnd latent independence from the Signal as much as possible,

without the priori knowledge of their distributions and the mixture model.

In contrast to Principal Component Analysis (PCA) method, ICA will consider
the information that is beyond the second order statistics (covariances). In PCA, all
necessary information is in the covariance matrix. The principal axis can be found
by doing eigen-decomposition on the covariance matrix. In other words, PCA only
considers the pair-wise correlation. Many ICA methods, on the other hand, need a
preprocessing procedure, which will remove the covariance information. This prepro-

“

cessing, often referred as whitening” or “sphering”, will diagonalize the covariance
matrix and normalize each principal component to the unit length. After whitening,
the covariance matrix will become a identity matrix, thus no second order statis-

tics will remain. Therefore ICA deals with the higher order statistics, and attempts

to ﬁnd a linear transform that will make the resulting components as statistically

 

1Some methods also considered the non-linear mixing case, which is out of the scope Of this
chapter.

63

independent from each other as possible.

Several different research communities are interested in the ICA problem. In the
statistics community, several researches have been proposed to ﬁnd the “interesting”
projection of a multi-variant random variables to show the structure of the data.
This approach, called “projection pursuit” (Friedman and Tukey, 1974; Friedman,
1987; Huber, 1985), tries to ﬁnd the projection direction, so that the projected signal
distribution will be the least like a Gaussian distribution, which is very similar to
certain ICA approaches that try to maximize the non-Gaussianality.

Blind source separation (BSS) is a class of problems in the signal processing com-
munity. It aims at recovering unobserved signals or “sources” from observed mixtures,
exploiting only the assumption of mutual independence between the signals. SO it is
obvious that BSS and ICA address the Similar problem. Practical BSS problems are
usually solved by the ICA method.

Another aspect to describe this problem is the sparse coding mechanism. The
sparse coding problem can be found in the neuroscience literature (Field, 1994; Ol-
shaushen and Field, 1996). It aims at explaining the coding mechanism of the neural
system. The sparse coding hypothesis suggests that the principal in the sensory cod-
ing is to produce a sparse-distributed representation of the sensory input. Sparseness
here means the neuron will have a high probability to stay inactive, and also have a
relatively higher probability for large response. Here, “relatively” means compared to
a Gaussian distribution with same variance. While considering a family of neurons,
this means that at each time instance only a small number Of neurons will have a
large ﬁring rate, and the rest Of them have low activity. Since Sparseness is essentially

64

 

 

 

the property of a super-Gaussian distribution, maximizing the Sparseness equals to
maximizing the super-Gaussianality. Thus the Sparseness principle ﬁnally leads to
the solution of independent super-Gaussian components. In this sense, the sparse
coding is merely a special case of ICA.

In this chapter, we will ﬁrst deﬁne the ICA problem, then survey the related
ICA works. A new ICA algorithm is proposed in this chapter. We also discuss the

experiment result of the new method and its limitation.

3.1.1 Problem statement and assumptions
Cocktail party problem

To illustrate the problem of independent component analysis, let us start from a real
world example. Suppose there are it people speaking simultaneously in a room. At
the same time, n microphones are set up to pick up sound signals. This scenario is
illustrated in Fig. 3.1. Obviously, the microphone input is a mixture of sound of all
the people. Since the sound strength is inversely proportional to the square of the
distance from the sound source, the mixing factor of each sound source is different.
The problem is how to separate it original source signals from the n mixed signals.
This problem is often called the “cocktail party problem.”

The difficulty of this problem lies in that we don’t know any priori knowledge,
e.g. statistical property, of the speaking person, nor do we know the setting of the
microphones, e.g. how far away each microphone is from each person. The only
thing we know is that each sound source is independent from each other. Of course in

65

 

Figure 3.1: The cocktail party problem.

semantic level the conversations may not be independent: speakers may all talk about
the weather. But in signal level, since every person has independent vocal organs,
each sound source can be think as independent. Knowing independence does not
help us much, because there is so much arbitrariness. It is also worth noting that this
is only a simpliﬁed version of the actual cocktail party problem. We have assumed
that there were no noises, no echo effects, and all sound signals reach microphones

instantaneously.

To formalize the problem, let us denote the source signal by 31(t), 32(t), ..., sn(t),
and the mixed signal by x1(t), x2(t), ...,xn(t). As shown in Fig. 3.1, the extenuating
factors of the ith microphone to each of the jth sources are denoted by aij, which are
unknown and inversely proportional to the square of the distances between source
and sensor. Suppose microphones do not introduce any non-linearity. The mixed

66

signals x1(t),x2(t), ...,xn(t) can be expressed as:

$10") = a1181(t) + a1282 (t) + + aInSn (i)

xn (t) = anlsl (t) + an232 (t) + + annsn (t).

TO simplify the notation, we can write above equations into a matrix form:

x (t) = As(t), (3.1)

where x(t) = [x1(t),:1:2(t),...,xn(t)]T is the mixed Signal vector, s(t) =

s t ,s t ,..., sn t T is the source signal vector, and
l 2

all...a1n

 

 

b d

is the mixing matrix. The mixing is assumed to be instantaneous, which assumes
the sources arrive at different sensors at the same time without delay. Since we have
this assumption, the time index t can be dropped to simplify the notation. Then the

mixture model becomes x 2 As.

Apparently if A is known, 3 can be easily solved by a set of linear equations.
Even prior knowledge, e.g. probability density function (pdf), Of source signal s can
be useful for estimation the unknown parameters, e.g. to be used in a maximum
likelihood estimation method. Unfortunately, s and A are both unmeasurable, which

67

makes this problem considerately more difﬁcult. However, the recently developed
method of Independent Component Analysis can be used to estimate A based on the
information of independence. The goal of ICA is to ﬁnd a linear transformation W
for the dependent sensory signals x that makes the recovered signal 11 as independent
as possible:

11 = Wx = WAS, (3.2)

where u is an estimation of the sources. The sources 5 will be fully recovered when

W is the inverse Of A up to a permutation and scaling factor.

In the following sections we will review the general principles that lead tO the

solution Of this problem.

What is independence

The deﬁnition Of statistical independent is deﬁned as follows (Tucker, 1962).

Deﬁnition 3.1.1 Let C denote a collection of events. The events in C are said to
be independent if the probability of the joint occurrence of any ﬁnite number of them

equals the product of their probabilities.

Basically, this is said that if a random vector x that consists of n random variables
x1,x2, ...,xn with the joint pdf function p(x) is equal to the multiplication of the

marginal pdf’s, i.e.:
n

P(X) = pr’l’z‘).

i=1

68

then x1, x2, ..., xn are independent. A set Of random variables is mutually independent
if and only if the joint pdf p(x) is factorizable in the above way.
A weak form of independent random variables is uncorrelated random variables.

The uncorrelation is deﬁned as:

Deﬁnition 3.1.2 Two random variables x and y are said to be uncorrelated if their
covariance is zero:

cov(x, y) = 0.

An important property Of independent random variables is that:
Theorem 3.1.1 Independent random variables are uncorrelated.

The proof of this theorem is relatively straight forward, and can be seen in much
classical literature. For the sake of convenience, we have shown it here.

Proof. We ﬁrst Show this theorem holds for the case Of two randOm variables.
The extension to the n variable case is natural. We need to Show that cov(x, y) = 0,

where x and y are independent random variables. This is equivalent to showing

E(xy) = E (x) E (y). Since

E(xy) = f / 163/ﬂaw) dwdy= / / $3119 ($)p(y) drdy
=/xp(x)dx-/yp(y)dy= E(I)E(y).

<1
It is worth noting that the inverse of theorem 3.1.1 is not true, which means
the uncorrelated random variables are not necessarily independent, with the only

69

exception of Gaussian random variables. SO the uncorrelation is only a necessary
condition Of independence. Many ICA algorithms have a preprocessing stage, which
will enforce the uncorrelation of the data. Although this processing will not generate
independent component, it greatly reduces the uncertainty in the ICA model and
results in a fast convergence.

The independence and uncorrelation can be intuitively explained by visualization

Of the data, as we will Show in section 3.1.1.

Super-Gaussian, sub-Gaussian and Kurtosis

Random variables can be classiﬁed based on their Similarity to the Gaussian distri-
butions, or Gaussianality.

There is a class Of random distributions, whose center of probability density fun-
tion is more peaked than a Gaussian distribution. And the probability density Of its
outliers is considerably higher than Gaussian pdf. This class of distributions is Of-
ten called super-Gaussian distributions. Generally, the super-Gaussian distributions
are also referred as “heavy tail” distributions, because Of the higher density at the
outliers.

Another class Of random distribution, on the other hand, is more “shallow” at the
center and has lower density at outliers than Gaussian distribution. Accordingly, they
are referred to as sub-Gaussian distributions. Fig. 3.2 shows three typical random
distributions, respectively the Laplace distribution, the Gaussian distribution, and
the uniform distribution.

The Laplace distribution is also called the double exponential distribution. It

70

0.7 ' a.

0'6 b l 1, Laplace, Kurtosis=3
0.5 r f i.
0.4 ' ,1 l
, .' '\ Uniform. Kurtosis=-6/5
0.3 r__/I,____,~,.__,
I " \‘ I
0.2 - I / I,-' ,Ndrmal, Kurtosis=0
l .

 

I. \\ I
01 ' ' ' '
. P . ‘
I, ‘0
' \
.”I I‘-
. "‘ ALL 41 1 r 11 T ‘-:——1

Figure 3.2: The density function of super-Gaussian, Gaussian, and sub-Gaussian distri-
butions. All the distribution are normalized to unit variance. The Laplace distribution
has positive kurtosis, thus is a super-Gaussian distribution. the uniform distribution has
negative kurtosis, which implies sub-Gaussian. And the normal distribution has a zero

kurtosis.

is the distribution of differences between two independent variables with identical

exponential distributions. The pdf of the Laplace distribution is:
p(x) : ie-lfB-Hl/b (33)
2b ’

where [,t is the mean, and b controls the shape of the distribution. The Laplace

distribution is a super-Gaussian distribution.

In Fig. 3.2, all three distributions are normalized so that their variance are all
1. This makes comparing kurtosis possible, which is a quantitative measurement Of
Gaussianality, Since the kurtosis of distributions are comparable only when they have

71

the same variance. The classical kurtosis deﬁnition is:

km (y) = E {y4} — 3 (E {y2})2, (3.4)

where E denotes the expectation. If we assume y is of unit variance, the kurtosis
is simpliﬁed to E {y4} — 3. the kurtosis can be either positive or negative, which
is actually a normalized forth order cumulant. The distributions that have positive
kurtosis is called super-Gaussian distributions, and those with negative kurtosis are
called sub-Gaussian. The Gaussian distribution and certain non-Gaussian distribu-
tions have zero kurtosis. In statistical literature, they are also referred as leptokurtic,

platykurtic, and mesokurtic respectively.

Illustration of independence

Intuition is very important for understanding complex and abstract linear algebra
problems. In this section, we try to explain the ICA problems with intuitive ﬁgures.

We focus on examples in the 2-D data space, since they can be easily plotted.

Suppose we have a random vector 5 = [31, 32]T, where the random variables 31

and 32 are Laplace distribution, whose probability density is deﬁned as in Eq. 3.3.

Fig. 3.3 shows the joint distribution of these random variables. The x and y
coordinates Of each data sample denote the values Of 31 and 32. The dot-dash lines
represent the source component axis.

72

10

 

 

 

 

 

“-9

Figure 3.3: Two Laplace distribution.

The mixing matrix A iS deﬁned as:

A = . (3.5)

By applying the mixing matrix A to the source signal, we will get the mixed
signal x 2 As. Fig. 3.4 displays the mixed distributions. Since the mixing matrix
is a linear transformation matrix, it performs rotating and scaling transformation
to the sample data points. The dot-dash lines in Fig. 3.4 are original independent
components transformed by the same mixing matrix. Then the task is to ﬁnd the

independent components given those mixed data points.

Fig. 3.5 shows the example of sub-Gaussian distribution under the same mixing
matrix as in Eq. 3.5. The source random distributions are uniform distribution of

73

 

30-

20-

 

 

 

 

2‘0 4b

Figure 3.4: Two Laplace distribution mixed by mixing matrix A.

unit variance with the following pdf:

 

 

 

 

 

 

 

 

 

ﬁ if ls’il S \/§,
12(51') = (3-6)
0 otherwise.
4
1.5 3
1 2_ #33,: ,. (J ,_
0.5 1. ..-“ r‘ ”If ' ’ i i .
o o ##E‘ N ‘
“f“? . T ' ‘ ‘3?
-05 -1' ¢ , 3 §%% .
..2 ‘ so“ ' l
-1 ’1’: . l , I
—3- 4‘ I I 4
-1.5 _4 J
32 —1 o 1 2 —6 -A -2 6 2 4 6
(a) (b)

Figure 3.5: Example of two uniform distributions. (a) The source signal. (b) Two uniform

distributions mixed by mixing matrix A.

The task of ICA is then to ﬁnd the original source axis represented by the dot-dash

74

lines.

Preprocessing procedure

The central problem of ICA is to ﬁnd the mixing matrix or its inverse form (de-
mixing matrix). In the completed case (number of sources equals to the number of
sensory inputs), the degree of freedom Of the mixing matrix A is n X n, where n is
number of sources. This makes the searching for mixing matrix less tractable, since
it2 parameters need to be determined. However, by a carefully designed preprocess-
ing procedure, the degree of freedom can be reduced, thus leads to not only a well
conditioned problem but also a faster convergent speed for iterative approaches. In
this section we discuss some common preprocessing strategies for ICA.

Centering: Subtracting the sample mean may be the commonest preprocessing.
It is deﬁned as: 5‘: = x— E {x}. This process will center the sample cluster and remove
the possible Shifting transformation from the mixing matrix A. To fully recover the
original source Signal, the mean E {x} can be added back to the recovered signal u(t).
In the following sections, if not speciﬁed we assume the sample vector x has already
been centered.

Whitening: Whitening is an very useful preprocessing procedure. As we men—
tioned in the preceding section, the mixing matrix A performs both rotation and
scaling transformation. In fact, the whitening procedure will remove the scaling

transformation and only leave the unknown rotation transformation to the ICA step.

SO it will reduce the degrees of freedom of mixing matrix to n(n — 1) / 2 2, about a

 

2The degrees of freedom of an n X nortliogonal matrix is n(n — 1) / 2.

75

half Of the n2 degrees of freedom in an arbitrary mixing matrix.

Whitening is well established and can be achieved by eigenvalue decomposition
(EVD) of the covariance matrix E {xxT} = VDVT, where x is the sample vector,
E denotes the expectation, V is the orthogonal matrix with columns as eigenvectors
of the covariance matrix, and D = diag(d1, ...,dn) is the diagonal matrix of the

eigenvalues. Whitening is achieved by:
- ..1 T
x = D 2V x. (3.7)

After whitening, the covariance matrix of i will be an identity matrix. Intuitively,
whitening is equivalent to rotating the original axis to the principal component di-
rection and scaling along the axis with the square root of the eigenvalue. Therefore
the whitened vectors i have unit variances for every components x1, ...xn and are
decorrelated. Whitening is also referred as “Sphering” due to this reason.

Fig. 3.6 shows the after-whitening effect of joint distribution of the mixed signal
in Fig. 3.4 and Fig. 3.5(b). With an (up to now) unknown rotation transformation,

we can then recover the original signal.

Why not Gaussian?

One Of the basic assumptions Of ICA is, to avoid ambiguity, at most one source
could have a Gaussian distribution. This is because for Gaussian joint distributions
decorrelation means independence. The whitened Gaussian distribution has central
symmetry, so any set Of orthogonal base could be the independent components. As

76

 

CD

 

Jh

ON¢O3
NU

   

 

 

 

 

 

o
-2 -1
-4 . ’ . \ —2
2
-6 -3
° -5 o 5 f4 _é 6 é 4
(a) (b)

Figure 3.6: After whitening, the joint distribution is sphered. (a)Laplace joint distribution.
(b)Uniform joint distribution.

shown in Fig. 3.7, the density is completely symmetric, then any two orthogonal axis

could be the independent basis.

 

 

 

 

 

Figure 3.7: Joint distribution of two uni-variance Gaussian random variables.

More rigorously, we need to prove that upon arbitrary orthogonal transformation
a Gaussian random vector with identity covariance matrix will yield same probability
density.

77

Theorem 3.1.2 Suppose y is a multi-variant random vector and y = Mx, where M
is an n x n orthogonal matrix and x is a Gaussian random vector with zero mean

identity covariance matrix, then y have the some probability density function as x.

Proof. In general, for any random vector x with density pa; and for any matrix M,
the density of y = Mx is given by pm (M-ly) ldet M-ll (Papoulis, 1991). Since x

has zero mean and identity covariance, its pdf is

 

px(x) = (3327-2- exp [—éxTx] , (3.8)

then
My) = ”(M-1),) ldetM‘ll = (273,”, exp [-é (MTy)T (M3)] (3.9)
: (arid/2 exp l-éyTyl ' (310)

Then y and x has the same pdf function. <1

3.1.2 Overview of ICA methods

The origin of ICA problem can be dated back to the 19803, when Herault and Jut-
ten (Herault and Jutten, 1986) proposed a feed forward neural network with feed
back connections. The Herault-Jutten algorithm is based on a gradient descent up-
dating rule that was able to separation independent sources, although the algorithm
converges only under some severe restrictions, e.g. sub-Gaussian source signals. This
work Opened a remarkable chapter of independent component analysis and blind

78

source separation in the history and inspired many others.

The general framework of ICA introduced by Herault and Jutten was analyzed
by Comon (Comon, 1994) in a more theoretical way. Comon proposed to minimize
the mutual information of signals 11, where u = Wx is an estimation of the source
signals. In a different aspect, this is equivilant to minimizing the the Kullback-
Leibler (Kullback and Leibler, 1951; Kullback, 1959) distance between the joint pdf
and marginal pdfs. Since mutual information requires both the joint pdf of u and
marginal pdf of ui, which are both unknown, one can only use some approximations to
substitute the real probability density. Comon used a high order cumulant expansion
Of the Kullback—Leibler distance to approximate the probability densities.

In parallel to blind source separation research, unsupervised learning rules based
on information theory are also considered by researchers in the neural network com-
munity. Linsker proposed a learning mechanism to maximize the mutual informa-
tion between the input and output Of a neural network (Linsker, 1992). Linsker did
not explicitly connect it with ICA or blind source separation, but his method in
some respects resembles the infomax principle independently proposed by other re-
searchers (Bell and Sejnowski, 1995; N adal and Parga, 1994), which is used to separate
independent sources. The infomax approach is based on maximizing the information
ﬂow (mutual information between the inputs and outputs) of a neural network with
non-linear output units. At about the same time, Maximum Likelihood Estimation
(MLE) were proposed to recover the source independence (Pham et al., 1992). The
MLE approach considers the mixed Signals are generated by an statistical model with
unknown parameter to be determined. Then the model parameters are estimated by

79

maximizing the likelihood of the Observations. Strikingly, it is independently proved
by several authors (Cardoso, 1997; Pearlmutter and Parra, 1997) that the MLE and

infomax principle, regarding the source separation, are mathematically equivalent.

In the neuroscience community, the ICA problem is usually considered as a Sparse
coding problem. Sparseness is a characteristic of super—Gaussian random variables.
Due to the central limit theorem, mixing several super-Gaussian random variables will
decrease the Sparseness of the mixed signals. SO searching the independent direction
to maximize the Sparseness is a desirable strategy. Many studies have addressed
the Sparseness measurement issue and derived efﬁcient ICA algorithms (Field, 1994;

Olshaushen and Field, 1996; Hyvairinen, 2001).

In the following sections we will categorize the existing ICA methods and explain

the details of each principles.

3.1.3 Methods based on Non-Gaussianality measurement

A straightforward way to recover independent components is measuring the similarity
to the Gaussian distribution. The Central Limit Theorem tells us that the summation
of n random variables is also a random variable, which has a Gaussian probability den-
sity when n approaches inﬁnity. Intuitively, the summation of several non-Gaussian
random variables are more “like” a Gaussian distribution than any Of these individual
random variables under certain conditions. SO, maximizing the non-Gaussianality of
the estimated signals u = Wx with respect to W lead to the solution of the ICA
problem.

80

Kurtosis

A quantitative non-Gaussianality measurement is kurtosis. The kurtosis Of a random
variable y is deﬁned as kurt(y) = E{y4} — 3(E{y2})2. Since in the ICA problem
signals are generally pre—whitened and are of unit variance, the kurtosis turns out to
be kurt(y) = E {y4} - 3, which is a shifted fourth order cumulant. It is worth noting
that the kurtosis is comparable only at the case Of equal variance. For super-Gaussian
random variables the kurtosis is generally greater than zero, while the sub-Gaussian is
less than zero. The kurtosis is only deﬁned on single variate random variable, therefore
ICA based on kurtosis optimization is usually a single unit algorithm, which means

every time the algorithm can only separate given a non-Gaussian signal.

The criteria function of this estimation is then deﬁned as the kurtosis of estimated
source signal u, i.e. JW 2 kurt(wa). The goal is tO search the W Space and to
maximum or minimize the criteria function, in which the gradient ascent algorithm
or other optimization approaches can be used. Hyvarinen and Oja has proposed a

criteria function (Hyvarinen and Oja, 1997), which has the following form,

J(w) = E{(wa)4} —3||w||4+F(||w||4), (3.11)

where the ﬁrst two terms on the right hand side represent the kurtosis. The
last one is the penalty term to enforce the norm of w to one, since without this
constrain the kurtosis can be arbitrarily large. The above criteria function can be
Optimized with standard gradient descent/ascent algorithm, but consequentially it
will be affected by the selection Of learning rate. In the same paper, Hyvarinen et.

81

al used a method related to Lagrange Multiplier and got an updating rule that only
evolves computing the expectation, thus speeded up the convergence.

Kurtosis is a Simple way to measure non-Gaussianality, however it suffers from
several disadvantages. Kurtosis is a normalized fourth order cumulant, so it is very
sensitive to the noise in the outliers. A few noise samples with high value will greatly
change the kurtosis. Another drawback of kurtosis is that Gaussian distribution is
not the only one that have zero kurtosis, so the kurtosis measurement can not recover

the independent direction of these distributions.

Negentropy

An important conclusion in information theory is that the Gaussian random variables
have the largest entropy of all possible random variables with the same variances. This
indicates that the Gaussian random variables have the most randomness. In fact, the
entropy is the most optimal criteria to measure non-Gaussianality.

The entropy for discrete random variable Y is deﬁned as

H(Y) = — Z P(Y = a,) log P(Y = (2,). (3.12)

This deﬁnition can be generalized for continuous Space random variables, in which

case Often called differential entropy, and deﬁned as:

He) = — / r<yilogr(y)dy. (3.13)

Usually for convenience a modiﬁed version Of differential entropy - negentropy is

82

Often deﬁned, which is

J(Yl = H(YGauss) " H(Yl~ (3-14)

where H denotes the differential entropy and yGauss is a Gaussian random vari-
able. Negentropy is always non-negative, and is zero for Gaussian distributions. Ne-
gentropy is well justiﬁed by information theory and is also considered as an optimal
estimation of non-Gaussianality. However, since it involves estimation of probabil-
ity density, it is computationally expensive and practically intractable for the high
dimensional random variables. This leaves the exact estimating of entropy a rather
theoretical approach.

Many researchers considered the problem of entrOpy approximation (Comon, 1994;
Hyvarinen, 1998) using high order cumulant. Comon used Edgeworth expansion
based on third and forth order cumulant to approximate the negentropy as shown
here:

1 2 1 2 7 1 2

_ __ _ _ 4 _ _ -2

where the J (y) is the negentropy of random variable y and ﬁn is the nth order
cumulant. A simpliﬁed high order approximation can also be seen in the literature,
which used only the ﬁrst two term at the right hand side of Eq. 3.15 (Jones and
Sibson, 1987). Using a high order cumulant to approximate negentropy encounters a
similar problem as using kurtosis. The estimation is not stable and sensitive to the
outlier noises.

83

To avoid these problems, Hyvarinen proposed using nonpolynornial cumulants to
approximate the negentropy (Hyvarinen, 1998) based on maximum-entropy approach.

The criteria function has the following form:

P
Jo) z 2 k.- {Ewe} — E{Gi(v)}]2. (3.16)

i=1
where the random variable v is a Gaussian random variable Of zero mean and unit
variance, and 19,- are positive constants. It is worthy noting that this is also a gen-
eralized cumulant-based approximation. When we choose C(x) = :13“, Eq. 3.16 is

degenerated to a normalized kurtosis.

Mutual information

Mutual information criteria is closely related to entropy. The mutual information I

between m scalar random variables ui,i = ...m is deﬁned as:
m
I(u) = 2 Hal) — H(u). (3.17)
i=1

where H (y) denotes differential entropy. Mutual information is always non-negative
and equals to zero iff y, are statistically independent. To ﬁnd the ICA component,
one needs to minimize the mutual information. A very important property of mutual
information is if there exists an invertible linear transformation W, so that u = Wx,
then

84

I(u) = Z H(ui) — H(x) — log |detW|. (3.18)
i=1

If we assume x is pre-whitened and W is only a rotation matrix, then the last two
terms in right side of Eq. 3.18 will be independent from W, thus the Optimization
of Eq. 3.18 is equal to minimizing the entropy, i.e. maximizing the negentropy. This

shows the fundamental relation between the negentropy and mutual information.

Mutual information can also be explained by Kullback-Leibler divergence (Kull-

back and Leibler, 1951). The Kullback-Leibler divergence is deﬁned as:

Kenn = [u log (SE—:3) p(u)du. (3.19)

where p and q denote two probability densities. Intuitively, the K-L divergence can be
thought of as the distance between two probability densities. Then if we substitute
p with the joint probability density of u and marginal probability densities of ui,

Eq. 3.20 is actually measuring the similarity between them:

Keir.) = fulog % f(u)du

fu log (f (U))f(U)du - [u 10s (Hf1(ui)) f(U)du-

by marginalize the second integrate in Eq. 3.19, we can see the K-L divergence is
exactly same as mutual information.

85

3.1.4 High order cumulant based method

Statistical properties of the output data set u can be described by its moments or,
more conveniently, by its cumulants C(u). Since the data have zero mean, the sample

cumulants up to the fourth order are deﬁned as

Cf“) = 0,

(U)
Cij

<
Cij‘llc) = <uiujuk>,
<

C(u)

ijkl ui“j“k”l> - (um) (ukull - (”laurel (afar) - (ur’uz) (WU/c).

where (0) denotes mathematical expectation. Cumulants of a given order form a
tensor. The diagonal elements characterize the distribution of Single components ui.
For example, the autocumulants Of second, third, and forth order deﬁne variance,
skewness, and kurtosis, respectively. The off-diagonal elements or cross—cumulants
(all cumulants with i jkl 75 iiii) characterize the statistical dependencies between
components. If and only if (iff) all components u,- are statistically independent,
the off—diagonal elements vanish and the cumulant tensors of all orders are diagonal

(assuming inﬁnite amount of data) (Blaschke and Wiskott, 2002).

The ICA method based on high order cumulants is thus to ﬁnd a linear transforma-
tion W that will minimize the absolute value of off-diagonal elements in the cumulant
tensors Of u = Wx of all order. In practical terms, apparently inﬁnite tensors are
not tractable, therefore third and fourth order cumulants are usually used to approxi-

86

mate this problem (Comon, 1994; Blaschke and Wiskott, 2002). The advantage Of this
method is that it requires zero knowledge of the probability densities. Moreover, it is
also used to approximate the mutual information as shown in section 3.1.3. Then it
clearly shows the relation of high order cumulants and probability density. The order
of cumulants used to approximate the problem is a trade-off between computational

expense and accuracy.

3.1.5 Methods based on Maximum likelihood and Infomax

The goal of maximum likelihood estimation (MLE) is to model the Observation x as
generated from the latent random variables 5 via a linear mapping A. In the noiseless
case, we can use a parametric density estimator f (x; 6) to ﬁnd the parameter vector
9 that maximizes the probability of observations x. The estimator can be formulated

as

11
”V MN

{I f (w:x(t )) + Nlog IdetWI, (3.20)
where W is the de—mixing matrix and is also treated as the parameter vector 0; N is
number of Observations; f2; is the marginal probability density Of sources 11 with W as
the parameter to be estimated (Pham et al., 1992; Pham and Garrat, 1997). Here we
assume [2 is known up to a scaling factor here. The derivation of this MLE equation
is due to the fact that: for any random vector 5 with density p; and for any invertible
matrix W, the density of x = W‘ls is given by p3(Wx) ldeth (Papoulis, 1991).

In parallel to the MLE study, unsupervised learning rules based on information

theory were proposed by Linsker(Linsker, 1992). The basic idea here was to maximize

87

the mutual information between the output and signal portion in the input of a linear
neural network. This principle was related to the redundance reduction principle
suggested by Barlow (Barlow, 1961) as an coding strategy in neurons. By adding a
non—linear transfer function to the network, Bell and Sejnowski (Bell and Sejnowski,
1995) proposed the infomax principle (information maximization), which used the
stochastic gradient method to seek the maximization of mutual information between
the input and output. A similar method was independently developed by Roth and
Baram (Roth and Baram, 1996). Strikingly, the infomax and MLE are proved to be
equivalent by Cardoso (Cardoso, 1997) and Pearlmutter and Parra(Pearlmutter and
Parra, 1997) independently. They Show that the MLE and the infomax can both be
explained by the Kullback-leibler divervence between Wx and the estimated source
11.

The major problem of MLE and infomax principle is that they both assume that
the marginal probability density Of source s is known, which in most of the cases
is unknown or intractable. In practise, both methods need an approximation of
the density. For example, in Bell and Sejnowski’s method they assumed a sigmoid
function as the cumulative density function of the source random variables. Since the
corresponding probability density function of the sigmoid function is a super-Gaussian
distribution, Bell and Sejnowski’s method can only deal with some super-Gaussian
sources with questionable accuracy. Other researchers extended the infomax approach
to address this problem.

Lee, Girolami and Sejnowski (Lee et al., 1999) suggested an extended infomax
algorithm, in which they use mixture of Gaussian distributions to simulate the prob-

88

ability density Of sub— and super-Gaussian signals and then are able to separate both

type Of signals.

The original infomax algorithm involves inverting the feedforward weight matrix
W, which makes the computation expensive. Amari et. al (Amari, 1998; Amari
et al., 1996) and Cardoso et. al (Cardoso and Laheld, 1996) soon realized that it can
be improved by using the natural gradient (or referred as relative gradient in Cardoso
et. al’s paper), which multiplies the gradient of the W matrix by a positive deﬁnite
matrix WTW. This change will speed up the convergence by removing the matrix

inversion.

3.1.6 Methods based on Sparseness measurement

As proposed by Olshausen and Field and others (Field, 1994; Olshaushen and Field,
1996; Olshaushen and Field, 1997), the cerebral cortex uses a sparse coding scheme.
Statistically, given any input from the environment, only a few neurons respond
actively. Most other neurons do not, or hardly, respond. The Sparse coding is desirable
since it effectively transforms an input event that is represented by many active pixels
quickly down to a few neurons, which is Similar to the redundance reduction principle
suggested by Barlow (Barlow, 1961). The reason that Sparseness is a reasonable
conjecture, suggested by Olshaushen (Olshaushen and Field, 1997), lies in that natural
scene can be usually described by a few structure primitives, such as edges, lines and
curves. These features, if ﬁltered with Garbor or wavelet ﬁlters, will Show a sparse
response pattern.

89

Olshausen and Field’s suggested the following criteria function:

2

E(w) = Z I(t) — gig-(ow,- + ZS (El—0(9), (3.21)

t i

where the ﬁrst term at the right Side of the equal sign is the reconstruction constrain,
and second term is the Sparseness measure. The reconstruction constrain is used
merely for avoiding the trivial solution, which means all component vectors are zero

vectors.

 

 

 

 

 

 

 

 

'50 2‘0 4‘0 60 so

Figure 3.8: Gaussian signal vs. sparse signal. The sub plot on the top shows a zero mean
unit variance Gaussian random variable. The X axis denotes different Observations, and Y
axis is the value. The bottom sub plot shows a sparse random variable with same mean
and variance.

The sparse coding principle implies that the statistics Of natural images is super
Gaussian: it consists of a high probability region near the center, but with heavy
tails. The high probability region near the center corresponds to most cases where
inputs are nearly orthogonal to the feature that the neuron represents. The heavy
tails correspond to the cases that the neuron represents and, thus, should respond.

How does a set of neurons detect super-Gaussianality? A common neural archi-

90

tecture is shown in Fig. 3.9.

 

Figure 3.9: Two neurons with horizontal inhibitions. Hollow triangles indicate excitatory
connections and solid triangles indicate inhibitory ones.

Fmrn the ﬁgure we can see that each neuron does not develop its weights alone.
Consider a neuron, its response is inhibited by the response from nearby neurons
that shares the Similar receptive ﬁeld. It also inhibits its neighboring neurons. If this
neuron responses most strongly in the neighborhood, it most effectively inhibits its
neighboring neurons, which further weakens the inhibition from its neighbors. After
a few milliseconds, it effectively inhibits its neighbors, although not totally.

In section 3.2.8 we will propose an algorithm called Lobe Component Analysis
(LCA) based on the Sparseness measure. And in section 3.4, we will discuss the

Sparseness in more details.

3.1.7 Summary

ICA research has become very active. There has been at least one regularly scheduled
conference series (Int’l Conf. on Independent Component Analysis and Blind Signal
Separation) and several special issues (IEEE Proceedings, Oct. 1998; Signal Process-
ing, vol. 73, 1999; IEICE Transactions, March 2003; Journal of Machine Learning Re-
search, Dec., 2003; Neurocomputing, vol. 49, 2002, etc.). Many alternative methods

91

have been proposed, using the principles of minimum mutual information, maximum
likelihood, infomax, higher-order cumulants such as Kurtosis, etc. (see a survey by
Hyvarinen (Hyvarinen, 1999)).

In the following, we introduce four types of algorithm. Type-1: Batch. A batch
algorithm collects observed data as a batch and then calculates the results. A batch
algorithm cannot deal with very long or open-ended data streams and, therefore,
cannot compute the results in real-time. Type-2: Block-incremental. A block-
incremental algorithm is not allowed to store training data as a batch for computation.
Instead, it can only store statistics, e.g., mean, covariance matrix and other higher
order statistics for incremental computation. The block enables it to compute these
statistics from the data in a temporal block (a small batch). A block-incremental
algorithm can deal with Open-ended data streams but the result is delayed by a block
size. Type-3: Incremental. An incremental (also called sequential or adaptive)
algorithm is a framewise incremental algorithm (block size is one time frame). By
default, an incremental algorithm has a block size of 1 unless the term “block” is
stated. An incremental algorithm must update its statistic estimates for one time
frame at a time and, thus, it can process Open-ended data streams with little delay.
Type-4: Covariance-free incremental. Such an algorithm is an incremental algo-
rithm, but further it is not allowed to compute the covariance matrix or other higher
order statistics matrices. In summary, from Type 1 to Type 4, the conditions under
which an algorithm is run becomes progressively more restrictive. In general, one
Should not expect that an algorithm of Type (i + 1) to converge faster than one Of
Type (i), i = 1,2,3.

92

The state—Of-the-art batch algorithms include FastICA by Hyvarinen & Oja
(Hyvarinen and Oja, 1997; Hyvarinen and Oja, 2000), which is among the fastest
Type-1 ICA algorithms in terms of speed of convergence and its high capability to
handle high-dimensional data. The Extended Infomax algorithm by Sejnowski and
coworkers (Bell and Sejnowski, 1997; Lee et al., 1999) is a well-known Type-2 ICA
algorithm. The NPCA—RLS algorithm by Karhunen (Karhunen and Pajunen, 1997)
is a Type-3 ICA algorithm. We are not aware of any previously existing Type-4 ICA
algorithm.

A Type4 algorithm has several major advantages: It can process Open-ended
streams, causing little delay, and does not even store covariance matrix. The last
property is especially important when the dimension d is large. Its potential rela-
tionship with the “in-place” development Of a neuron will be discussed in Section 3.5
Broader Impact. By in-place development, we mean that an (artiﬁcial) neuron has
to learn on its own while interacting with nearby neurons, to develop into a feature
detector. A neuron with d connections does not have (1 x (1 storage Space at its
disposal. In other words, a Type-4 ICA algorithm for each ICA component is an
in—place learner. We can expect that designing an effective Type—4 ICA algorithm is

not trivial.

3.2 Lobe Component Analysis

Suppose a sequentially arriving series Of vectors x1, x2, ..., where each input vector
xt, drawn from a high-dimensional random space X, is spatially highly correlated. It

93

is assumed that the temporal dependence of xt is weak and not of interest. The basic
need of our proposed research is to estimate the probability structure of X.

Given an arbitrary high-dimensional space X, the distribution of x E X may not
necessarily have independent components. In other words, there exist no independent

T

components w1,w2, ...,wd, so that their projection from x E X : x,- = x wi, i =

1, 2, ..., d, are statistically independent so that their pdf. is factorizable:

p(X) = P1($1)P2($2) ° - ram)-

Fig. 3.10 illustrates three cases. The ﬁrst two cases are factorizable. The third case,
a more general case, is not factorizable. In many high-dimensional applications (e.g.,

images and ﬁnancial data streams), the pdf. is not factorizable.

3.2.1 Fine structure of high-dimensional density

High-dimensional density estimation is an important and yet very challenging problem
which has been extensively studied in mathematical statistics, computer science and
engineering. Major methods for probability density estimation include histograms,
naive estimators, kernel estimators, nearest neighbor estimators, variable kernel meth-
ods, orthogonal series estimators, maximum penalized likelihood estimators and gen-
eral weight function estimators (see, e.g., Silverman (Silverman, 1986) for a survey).
For high-dimensional data, kernel-based methods, e.g., EM algorithms for a ﬁnite
normal mixture (also called a mixture of Gaussians) (McLachlan, 1997), have been
widely used.

94

These traditional methods are problematic when they are applied to practical
high-dimensional data. The problems include: (1) The lack Of a method for high-
dimensional density estimation that satisﬁes the three stringent operative require—
ments: incremental, coveriance—free and undersample. By undersample, we mean that
the incremental algorithm must work even when the number Of samples is smaller than
the dimension d. (2) The lack Of an effective method to determine the model param-
eters (e.g., the means, covariances, and weights in the mixture-Of-Gaussian models).
(3) The lack Of a method that gives a correct convergence (a good approximation for
high-dimensional data), not just convergence to a local extremum (as with the EM
method for mixture of Gaussians). (4) The lack of a method that is not only optimal
in terms Of the objective function deﬁned, but also in terms of the convergence speed
in the sense Of statistical efﬁciency.

Approaches to estimating statistical distribution fall into two categories: global
and local. A global approach uses shape characteristics of the distribution described
by global higher order statistics. A global statistic is an expectation of the entire space
of distribution, such as covariance matrix, kurtosis and other higher order cumulants
(or moments).

A local approach estimates the distribution by decomposing the space of distri-
bution into smaller regions, and each region is estimated by relatively lower order
statistics (which is sufﬁcient for local data). The kernel-based methods (e.g, the
mixture-Of-Gaussian methods) belong to the local approach. Techniques of global
approaches can be used for local approaches.

A major advantage Of a local method is to decompose a complex global modeling

95

problem into simpler local ones SO that lower order statistics are sufﬁcient. Higher
order moments are sensitive to noise and outliers and are not effective for estimating
a complex distribution. Local density approximation using lower-order statistics is
more effective.3 The challenges with a local approach include how to effectively
decompose the global complex problem into local simpler ones and how to integrate
solutions to local problems into the solution to the original global complex problem.

The proposed LCA uses a global-tO-local (coarse-tO-ﬁne) estimation scheme. (1)
First, we estimate the position: the mean i of the data (through incremental es-
timation of i) and then subtract 3‘: out of the observation. Thus, the transformed
data has a zero mean. (2) Next, we estimate the scale: Incrementally estimate the
covariance matrix of x, and then whiten the zero mean observation so that it has a
unit covariance matrix. (3) Estimate the lobe components from the zero-mean, unit
covariance Signals. Each lobe component represents a high concentration of p.d.f..
Without whitening the data, the drastic scale difference in different data components

can obscure the ﬁne details of the lobe components.

3.2.2 Global: Whitening the data

For step (2), we use PCA on a d-dimensional distribution (the proposed CCILCA
algorithm will compute PCA with a slight modiﬁcation), the unit principal component
vectors v1 , v2, ..., Vd (eigenvectors of the covariance matrix of x E X) are found where

v,- is associated with the eigenvalue M, with A1 2 A2 2 2 Ad. The space X is

 

3To see this global versus local contrast more intuitively, consider the problem of shape approx-
imation. The (global) Taylor expansion is not practical for approximating a complex real-world
shape due to its use of higher order derivatives. In contrast, low order splines are more effective.

96

divided into two parts, the major space M:
M = span{v1,v2, ...,vk}

and the residual space R = span{vk+1,vk+2, ...,vd} so that X = M ED R where
EB denotes the orthogonal sum. The dimension k of space M is chosen so that the
ratio of the covariance in R over the original space X, Zf=k+1 Ai/Zgzl A7; is a small
percentage (e.g., 1%). From the result of PCA, the whitening matrix is:

—1/2

/2 —1/2
,A2 ,...,,\k

V = [v1,v2, ...,vk]diag{/\1~1

}.

The whitened observation is y = VT(x — i). In the following, we assume that y is

whitened: it has a zero mean and a unit covariance matrix.

3.2.3 Local: Lobe components

The sample space of a white, zero-mean random vector y in k-dimensional space can
be illustrated by a lit-dimensional hypersphere, as shown in Fig. 3.11. It is known
that although the whitening process only normalizes the variance along orthogonal
principal directions, the whitened vector y has a unit variance in all the possible
directions. That is the meaning of the circle in Fig. 3.11. But the data can go out of

the circle.

If the distribution of y is Gaussian, the samples will be equally dense in all direc-
tions. In general, the distribution is not Gaussian and the probability density may

97

concentrate along certain directions (although the global covariance Of projections
along any given direction is unit). Each major cluster along a direction is called a
lobe, illustrated in Fig. 3.11 as a petal “lobe”. Each lobe may have its own ﬁne
structure (e.g., sublobes). The Shape of a lobe can be of any type, depending on the
distribution, not necessarily like the petals in Fig. 3.11. If we assume that x and —x
are equally likely (e.g., a patch of skin will have the equal chance of feeling the onset
and the offset of a press), the distribution is then symmetric about the origin. (But

this is not necessarily true in general and, consequently, nonsymmetric lobes can be

deﬁned.)

Given a limited resource — c vectors, LCA divides the sample space X of x into

c mutually nonoverlapping regions, called lobe regions:

32: R1 URgU...URc, (3.22)

where R, (l Rj = d), if i 76 j, as illustrated in Fig. 3.11. Each region is represented by

a single unit feature vector “’2': called the lobe component.

The partition issue presented here is similar to Kohonen’s Self Organizing Maps
(SOM) (Kohonen, 2001) or, more general, vector quantization (Gray and Neuhoff,
1998; Zador, 1982; Gersho, 1979). We do not use Euclidean distance as with SOM and
quantization (we use the concept of high-dimensional probability density of a whitened
random vector space instead). Further, each region is inﬁnite (while most cells in
SOM and vector quantization are bounded), Since we will generate a response T(x)
for any vector in the whitened space. T (x) does not approximate x but represents

98

it as a feature parameter vector (vector quantization uses the code itself as a direct
approximation of input x). The response is useful for visualization, decision-making,

prediction or regression.

Suppose we have computed c lobe components (column unit vectors) wi, i =
1. 2, ..., c. They are not necessarily orthogonal and not necessarily linearly indepen-
dent, depending on the distribution of the whitened space and the number 0. They

span a lobe feature subspace

F = span{w1, W2, ...,wc}. (3.23)

Any vector x in M is represented by the inner product vector:

l= [w1,w2,...,wc]Tx.

It is called the response from the lobe components, which can be used for visualization
or further decision-making using support vector machines (Christianini and Shawe-
Taylor, 2000; Lin, 2002) or the Incremental Discriminant Regression developed by

Weng & coworkers (Hwang and Weng, 2000; Weng and Hwang, 2002).

The discrete probability in the lit-dimensional whitened space M can be estimated
in the following way. Each region R,- keeps the number of hits n(i), which records the
number of times samples of x fall into region R,. Then, the continuous distribution

99

of x can be estimated by a discrete probability distribution of c regions:

P{x E R1} = ”—782, i=1,2,...,c.

The larger the number c, the more regions can be derived and, thus, the ﬁner partition

of the sample space 32.

3.2.4 Belongingness

The remaining major issue is how to compute the lobe regions. We need to determine
what kind of vector wz- is best to represent a lobe region R,- (Or for that matter, any

subregion of M).

If x belongs to a region Ri, x can be approximated by wz- as x z x = (x - wi)w,-
(it minimizes the mean square error E ”x — itl|2 for a ﬁxed unit Wil- We determine

the unit vector w,- as the one that minimizes the following error of approximation:

T

llx — (x - wave-n? = x x - (x - w.)2, (3.24)

after expansion using the fact that wz- is a unit vector. The expected error Of the
above approximation over region 12,- is
2 T 2 T T T
EIIX- (X'Wilwill = Elx X- (X°Wz') l= Elx X-Wi XX wt]
2 E[xTx] - w;E[xxT]wz- = trace(2x) — wfzxwgsas)

100

where 2x is the covariance matrix of x. We determine the unit wi to maximize the
above approximation error. Since trace(2x) is constant, the unit w,- that minimizes
the above expression is the one that maximizes wz-T Exwi. From the standard theory
of PCA (e.g., see (Jolliffe, 1986)), we know that the solution wz- is the unit eigenvector

of Ex associated with the largest eigenvalue A“:
Aidwi = Exwi.

In other words, wz- is the ﬁrst principal component of Ex, where expectation of Ex is
over the region of approximation, Rt or any other region. Relacing 2x by the sample

covariance matrix, we have

)‘iJWi a: i—Z x(t)x(t)Tw,- = %Z(x(t) ~w,-)x(t). (3.26)
t=l i=1

We can see that the best lobe component vector can be estimated by the average of
x(t) multiplied by the projection (x(t) w,). This is an important batch estimation
equation: It means that the candid version of wz- is equal to the average of yt =
(x(t) - wi)x(t). By candid, we mean that we keep the covariance of the projections
onto Wi (which is A231) with wz- and, thus, the estimator for Ai,lwi is computed
instead of the unit “’2' alone. This scheme is needed for the quasi-optimal efficiency

to be discussed in Sec. 3.2.7.

If wi(n — 1) has already been estimated using n — 1 samples, we can use it to

101

compute yt(n) deﬁned as:

 

z x(t) - (“n(n — nix“)

W) iiw.(n — 1)“

(3.27)
Then Eq. (3.26) states that the lobe component vector is estimated by the average:
1 Tl
A,,1w,- z E z yt('n). (3.28)
t=1

The larger the inner product x(t) - wi, the more applicable the input x(t) is to
the region that the lobe component wz- represents. Thus, the inner product x(t) - “’2'
is the “belongingness” Of x(t) to the region represented by Wi- For symmetric region

Ri, we use the absolute value of x(t) - w,- as the belongingness.

This mechanism not only enables us to compute the best wz- given a region, but
also enables many lobe component vectors to compete when data x(t) are sequentially
received. The vector wz- whose belongingness is the highest is the “winner,” which
best inhibits all other vectors. The winner uses the current input x(t) to update its

vector, as in Eq. (3.26), but all others do not.

3.2.5 Statistical efﬁciency

Suppose that there are two estimators F1 and F2, for vector parameter 0 = (01, ..., 6k),
which are based on the same set of observations 8 = {y1,y2, ..., yn}. If the expected
square error Of F1 is smaller than that of F2, i.e., Elll‘l — 6H2 < E||F2 — 0H2, the
estimator F1 is more statistically efﬁcient than F2.

102

Statistical estimation theory reveals that for many distributions (e.g., Gaussian
and exponential distributions), the sample mean is the most efﬁcient estimator of the
population mean. This follows directly from Theorem 4.1, p. 429-430 of Lehmann
(Lehmann, 1983), which states that under some regularity conditions satisﬁed by
most distributions (such as Gaussian and exponential distributions), the maximum
likelihood estimator (MLE) 6 of the parameter vector 6 is asymptotically efﬁcient,
i.e. its asymptotic covariance matrix is the Cramér-Rao information bound (the
lower bound) for all unbiased estimators, via convergence in probability to a normal

distribution:

«at? — 9) 3» N{O,I(6)_1} (3.29)

in which the Fisher information matrix I (6) is the covariance matrix of the score

WOW) WOW
59 ‘ k

1 , .., }, and f (y, 6) is the probability density of random vector

vector {
y if the true parameter value is 6 (see, e.g., Lehmann (Lehmann, 1983), p. 428). The
matrix I (6)”1 is called information bound Since under some regularity constraints,
any unbiased estimator 6 Of the parameter vector 6 satisﬁes cov(6 — 6) 2 I (6)"1 /n

(see, e.g., Lehmann (Lehmann, 1983), p.428 or Weng et a1. (Weng et al., 1993), pp.

203204)?

Since in many cases the MLE of the population mean is the sample mean, we
estimate the mean 6 Of vector y by the sample mean. Thus, we estimate an inde-

pendent vector Alwi by the sample mean in Eq. (3.28), where yt(n) is a “random

 

“For two real symmetric matrices A and B of the same size, A 2 B means that A — B is
nonnegative deﬁnite, which implies, in particular, that the diagonal elements are all nonnegative,
which gives the lower bound for the variance of every element of the vector estimator of 6.

103

Observation” .

Since we do not know the distribution of yt(n) and it is even dependent on the
currently estimated w,- (i.e., the observations are from a nonstationary process), we
use the amnesic mean technique below which gradually “forgets” Old “Observations”

(which use bad yt (it) when n is small) while keeping the estimator reasonably efﬁcient.

3.2.6 Amnesic mean

The mean in Eq. (3.28) is a batch method. For incremental estimation, we use what

is called an amnesic mean (Weng et al., 2003).

 

_ n — 1 — n _ _ 1 + n

y(n) = :U'( )y(n I) + [J'( )Yn (330)
n n

where p(n) is the amnesic function depending on n. If u E 0, the above gives the

straight incremental mean. The way to compute a mean incrementally is not new but

the way to use the amnesic function Of n is new for computing a mean incrementally.

0 if n S n1,
n(n) = C(n — n1)/(n2 — n1) if n1 < n S 17.2, (3-31)
c+(n—n2)/r ifng <n,

in which, e.g., c = 2, r = 10000. As can be seen above, n(n) has three intervals. When
n is small, the straight incremental average is computed. Then, n(n) changes from
0 to 2 linearly in the second interval. Finally, n enters the third section where n(n)
increases at a rate about l/r, meaning the second weight (1 + ,u(n))/n in Eq. (3.30)

104

approaches a constant l/r, to slowly trace the slowly changing distribution. Fig. 3.12
shows the development of the amnesic average coefficient, where wl = w and
_ 1+u n)

_ n ,

“)2

For the convenience of description, we write the coefﬁcients of the amnesic average

 

as:
cr(n) = n ‘ 117“"), (3.32)
and
n(n) = 1+5") (3.33)

3.2.7 Efﬁciency Of the amnesic mean

We consider whether the amnesic mean is an unbiased estimator. From the recursive
deﬁnition of the amnesic mean in Eq. (3.30), we can see that the amnesic mean y(n)

is a weighted sum of the involved data yt:

where vt(n) is the weight of data item yt which entered at time t S n in the amnesic
mean 9(a). It can be proven using induction on n that the weight vt(n) is given by

the following expression:

 

’Ut(n)=L+—’#—(Q H 3'47””). (3.34)
j=t+l J

105

Since all the multiplicative factors above are non-negative, we have vt(n) 2 0, t =
1,2, ..., n. Using the induction on n, it can be proven that all the weights vt(n) sum

to one for any n 2 1:

Z vt(n) = 1. (3-35)
t=l

(When n = 1, we require that n(l) = 0.) Suppose that the samples yt are inde-
pendently and identically distributed (i.i.d.) with the same distribution as a random

variable y. Then, the amnesic mean is an unbiased estimator Of E {y}:

3

Tl

Bib—’00} = Elzvdnb't} = ZvdnlEin} = ZvdnlEly} = EU}-

Let cov(y) denote the covariance matrix of y. The expected mean square error of the

amnesic mean ﬂu) is

cov(y(n)) = (Z vt2(n)) cov(y) = c(n)cov(y). (3.36)

t=l

where we deﬁne the error coefﬁcient:

When p(t) = 0 for all n, the error coefﬁcient becomes c(n) = l/n and Eq. (3.36)

returns to the expected square error of the regular sample mean:

cov(y(n)) = %COV(y). (3.37)

106

It is then expected that the amnesic mean for a stationary process will not have the
same efﬁciency as the straight sample mean for a stationary process. Fig. 3.13 Shows
the error coefﬁcient ((72) for three different amnesic functions n(n) E 2, n(n) E 1 and
n(n) E 0. The smaller the error coefficient, the smaller the expected square error,
but also less capability to adapt to a changing distribution. The three cases shown
in Fig. 3.13 indicate that when n > 10, the amnesic mean with n(t) E 1 increased
about 50% (for the same n) from that for p(t) E 0 and with p(t) E 2 it increased

about 100%.

From Fig. 3.13 we can see that a constant positive p(t) is not best when t is
small. The multi-sectional amnesic function p(t) in Eq. (3.31) performs a straight
average for small t to reduce the error coefﬁcient for earlier estimates, and when
t is very large, the amnesic function changes with t to track the slowly changing
distribution. Therefore, the multisectional amnesic function p(t) is more suited for
practical signals with unknown nonstationary statistics. It is appropriate to note
that the exactly Optimality of the multisectional amnesic function is unlikely under
an unknown nonstationary process (not i.i.d.), unless an assumption of certain types

of nonstationary process is imposed, which iS not, however, necessarily true in reality.

In summary, we should not expect that an estimator suited for an unknown non-
stationary process tO have the same expected efﬁciency as ones for an i.i.d. stationary
process. The distribution of signals received in many applications is typically non-
stationary and, therefore, an amnesic mean with a multisectional (dynamic) amnesic
function is better (see later Fig. 3.16).

107

3.2.8 Proposed CCILCA Algorithm

The Candid Covariance-free Incremental LCA algorithm incrementally computes the

lobe matrix

W(t) =[w1(tt),wg ), .. .,w£t)],

for lobes, from whitened samples x(l),x(2), of dimension k without computing
the k X k covariance matrix. The number c is the number Of lobe components to
be extracted (predetermined limited resource). The length Of wz- is the variance of
projections Of the vectors x(t) in the i-th region onto Wi- The algorithm is shown in

Algorithm 3.

 

Algorithm 3 The CCILCA algorithm.

 

1: Initialize using observations: for t = 1, 2, ,,c .wgc— C)- xt( ) and at = 1.
2: for t = c+ 1,c+2,... do

3: for i = 1,2, ...,c do

4: Compute the response Of all lobe vectors

x(t) -w§t‘”/iiw§“” u.

5: Decide the winner: j— - arg maxlSiSCﬂli |}.
6: Update the number of wins nj <— nj + 1, compute the amnesic weight coef-
ﬁcient a(nj + 1) and 5(le + 1)with Eq. 3. 32 and Eq. 3. 33.
7: update winner using amnesic mean Of the winner lobe component:
(n ') (Tr-1)
j J = a(nj +1)Wj '7 + C(71): +1)ljx(t) (3.38)

8: end for
9: end for

 

Algorithm 3 requires that input be whitened. The following is the CCILCA algo-
rithm that does not require inputs to be whitened. It normalizes the output power
Of each lobe component while computing all the lobe components.

108

 

Algorithm 4 CCILCA algorithm without pre-whitening.

 

Initialize using observations: n = 0. For t = 1, 2, ..., e, do WEC) = x(t), nt = 0, and
at = IIX(t)Il2-
for t = c+ 1,c+2, do

For i = 1, 2, ...c, compute the projection of the i-th cell

1.- = x(t) -w§t‘1’/iiw§t’”il.

Generate the normalized response 2,- = I”, /o,~.
Decide the winner among the normalized responses:

'=ar max z~ .
J gISiSC“ ill

For the winner, the j-th cell, it - r— n -+1, compute the amnesic weight coefficient
a(nj + 1) and 6(nj + 1)with Eq. 3.32 and Eq. 3.33.
Update the amnesic average oj (the variance Of response) of all cells:

oj <— a(nj + 1)oj + H(nj + 1)l§.

Update the winner’s lobe component vector using the amnesic average (cell’s
Hebian learning):
Update the weights for the winner:

(i)
J

= a(nj + vii-"1” (“"3 +1)zjx(t)/\/BT’

end for

 

109

The above two algorithms are simple. They never compute any k X k matrix

(covariance-free) .

The above algorithm computes CCI PCA (Weng et al., 2003) if the step 2 is cut
short to contain only steps (a) and (c). Step (c) is modiﬁed so that for each vector

(t)

“’2': the residual from the previous vector wz._1 is computed, and then the residual

(

is used as the input to the next vector wit—l) for updating. NO winner-takeall is

needed.

Therefore, the pre—whitening step can be computed by the same type (CCI) of
algorithm. The difference in the pre—whitening step is to use the residual so that the

resulting vectors are orthogonal (a requirement Of PCA).

After we computed the whitening matrix V and the lobe component matrix W,
we have BT = WTVT and l = BT(y — y) as projections onto lobe components, or

response of lobe components.

We can also deﬁne a neighborhood function hJ-i(t), which acts as a smoothing
kernel. Usually, hjz-(t)) = h(||rj — Till: t), where rj, rz: are the location vectors of node
j and i. hﬂ( t) Often has the shape Of “Mexican hat” or triangle. Then, the algorithm

is changed to Algorithm 5.

3.2.9 Time and space complexities

Given each k-dimensional input x, the time complexity for updating c lobe compo-
nents and computing all the responses from x is 0(ck). Since LCA is meant to run in
real-time, this low update complexity that is important. If there are t input vectors,

110

 

Algorithm 5 Topographic CCILCA.
1: w,- = y;, i = 0,1,2,...,k.
2: ni=1,i=1,2,...,k.
3: for t = 1,2, do

 

4: j = arg max
i

 

 

5: for All component vector wi(n,-), i = 1, 2, ..., is do
6:
w-(n-+1) = a(n-)w-(n-) + h--(t)ﬂ(n-)wi(ni)T ’Yty (3 39)
' z z z ' 1* ’ ”WI-(null " '

where wi(ni) is the component vector wz- after the ni-th updating, a(n,-)
and 6(ni) are given in Eqs. 3.32 and 3.33, respectively.

7: nz- = ni + l.

8: end for

9: end for

 

the total amount Of computation is 0(ckt).

The LCA’S Space complexity is 0(ckt), for c neurons with 1: dimensional input x.

In fact, the above Space and time complexities are the lowest possible ones:
Since c vectors need to be computed and each vector has 1:: components, the space
complexity cannot be lower than 0(ck). Phrther, the time complexity cannot be
lower than 0(ckt) because the responses I = (l1, l2, ..., lc) for each of t inputs need
that many computations.

Suppose that each lobe component (vector) is considered as a “neuron” and the
number of hits n(j) is its clock Of “maturity,” which determines the single weight wl
(2122 = 1 -— ml) for its updating. The CCILCA algorithm is an in-place development
algorithm, in the sense that the network does not need extra storage or an extra
developer. The winner-takeall mechanism is a computer simulation of the lateral in-
hibition mechanism in the biological neural networks (see, e.g., Kandel et al. (Kandel
et al., 2000) p. 4623). The inhibition-winner updating rule is a computer simulation

lll

of the Hebbian mechanism in the biological neural networks (see, e.g., Kandel et a1

(Kandel et al., 2000) p.1262.).

3.2.10 LCA is ICA for super-Gaussians

Further, the components that have a super-Gaussian distribution correspond to lobe
components we deﬁned here. Each linear combination Of k super-Gaussian indepen-
dent components correspond tO symetric lobes, illustrated in Fig. 3.11. Therefore, if
components in s(t) are all super-Gaussian, ﬁnding lobe components by CCILCA is

equivalent to ﬁnding independent components.

3.3 Experiment Results

3.3.1 Low dimensional simulation

The proposed algorithm was ﬁrst applied to a low dimensional data set, which contains
2 i.i.d source components. Each of the source components is taken from a Laplace
distribution of zero mean and unit variance, thus, is a super-Gaussian random vari-
able. Fig. 3.14(a) shows 5000 sample points used in the experiment. The directions
of the source independent components and evaluated independent components are
illustrated, which are almost identical. A more detailed error over the number Of
samples used is shown in Fig. 3.14(b).

112

3.3.2 Comparison with Other ICA Methods

Suppose that the distribution of k-dimensional random input x is stationary. In
CCILCA, the lobe component vector w,- converges to the eigenvalue scaled eigenvector

Ai.1hi.1 in the mean square sense and the speed Of convergence is estimated as
t 2 2
ErllWl ) - Ai,1hi,1ll z Elm

where o is the estimated average component-wise variance of observation yt = (x(t) -

(t-I)

wz- )x(t)/||wzlt_1)||. The convergence is not to a local extremum. It is to the
correct solution. The quasi-Optimal convergence speed is due to the use of statistical
efﬁciency in the algorithm design. If the distribution of x changes slowly, the above
error estimate is still applicable.

The quasi-Optimal statistical efficiency appears to drastically improve the capac-
ity to deal with high—dimensional data. We selected two state-of-the-art incremental
ICA algorithms, the Type2 Extended Bell-Sejnowski (ExtBS) (Lee et al., 1999) and
Type-3 (NPCA-LS) (Pajunen and Karhunen, 1998; Karhunen et al., 1998) for per-
formance comparison with our proposed Type-4 CCILCA algorithm. The choice of
these two algorithms is due to their superior performance in the comparison results
of (Giannakopoulos et al., 1998). We used the downloaded code from the authors for
the ExtBS algorithm.

The NPCA algorithm used was proposed in (Pajunen and Karhunen, 1998)
with [3 = 0.998 and g as tanh. The ExtBS algorithm was run with the following

set Of parameters: blocksize = 1, learning rate = 0.001, learning factor = 0.985,

113

momentum constant = 0.1, number of iterations = 1, step = 5, and block size for
kurtosis estimation is 1000. This version is called ExtBSl, the number 1 indicating
that the block Size for updating was 1. Thus, ExtBS 1 is a partial sequential algo—
rithm, sequential for independent component update but computation for kurtosis is
block-incremental. We called it Type-2.5.

To investigate the capabilities to handle high-dimensional data, we synthetically
generated random data with different dimensions. The number n Of samples used in
each run is a function of the input dimension d (we used n = 1000d samples). Each
independent source has a Laplacian distribution. The mixing matrix was chosen
randomly and was non-degenerate. The error between the direction of the true and
the estimated independent components was measured as the angle between them in
radians. The results were averaged over 50 runs. Figs. 3.15(a) and (b), Show that
these two better performing ICA algorithms have problems in converging when the
dimension d increases beyond 5 and the correct convergence fails when d = 25. The
CCILCA consistently give considerably smaller error and take the least CPU time
among the three. To Show more detail about the proﬁle of decreasing error, we plot
the errors in Fig. 3.15(c). As indicated by the ﬁgure, both the Type-3 algorithm
NPCA and Type-2.5 ExtBSl do not converge for a moderate dimension of d = 25
although ExtBSl does ﬁne when d = 2. The proposed Type-4 CCILCA does well.

Next, we compare our CCILCA algorithm with Type-2 Extended Bell-Sejnowski
(or extended infomax) (Lee et al., 1999) with block size 1000 and the Type-1 batch
algorithm FastICA (Hyvarinen and Oja, 1997; Hyvarinen and Oja, 2000). This is not
a fair comparison since a Type-4 algorithm (like CCILCA) should not be expected to

114

out-perform a Type-1 or Type-2 algorithm. We compare them anyway to understand
the limit when CCILCA is compared with two state-of—the—art Type-1 and Type-2
algorithms.

It is well known that ICA algorithms are “data grizzlies”. Typically, even for a low
dimension simulation task (e.g., d = 2), ICA algorithms need thousands of samples
to approach the independent components. Convergence in respect to the number of
samples used in training is a good evaluation of the efﬁciency of ICA algorithms.

For a higher dimension, we synthetically generated random observations from an
i.i.d. Laplacian random vector with dimension Of 100. The results are Shown in
Fig. 3.16, where the y—axis marks the number of samples and the x—axis indicates the
average error in radians. In order to Show more detailed aspects of CCILCA, three
variations are tested. “LCA with ﬁxed m” (m stands for the amnesic function u(n))
and “LCA with dynamic m” are original LCA methods with a ﬁxed u and varying n(n)
as deﬁned in Eq. (3.31), respectively. “LCA eliminating cells” algorithm dynamically
eliminates cells whose hitting rate is smaller than 3/4 of the average hitting rate,
Since sometimes two vectors share a single lobe (which is very rare and does not
signiﬁcantly affect the purpose of density estimation by lobe components but does
affect our error measure). As shown in Fig. 3.16, all the three Type-4 LCA algorithms
converged very fast, faster than the Type-2 algorithm Extended infomax and even the
Type-1 FastICA. The batch Extended Infomax algorithm needs much more samples,
therefore, it did not converge in these tests.

It is somewhat surprising that the proposed Type-4 CCILCA algorithm, oper-
ating under the most restrictive condition, out-performs the state—Of-the-art Type-3,

115

Type—2, and Type-1 algorithms by a remarkably wide margin. We think this is mainly
due to the new lobe component concept and the quasi-Optimal property of the statis-

tical efﬁciency. The simple structure of the CCILCA algorithm is also very attractive.

3.3.3 Cocktail Party Problem

We test the algorithm on a simulation of the cocktail party problem. Nine sound
sources are mixed by an orthogonal matrix. Each sound source is 6.25 seconds long
and the sampling rate is 8.0KHz in 8 bits mono format. Therefore each sound source
contains 50000 values. The sound clips are downloaded from http://www. cis .hut .
f i/pro jects/ ica/ cocktail/ cocktai1_en. cgi, where they use these sound clips to

test the FastICA algorithm (Hyvarinen, 2001).

Here, we simpliﬁed the problem Since the mixing matrix is a pure rotation matrix,

116

thus no whitening process is needed. The mixing matrix is shown in Eq.

—0.47 0.31 —0.30 —0.07 0.13
-—0.22 —0.17 —0.11 0.36 0.14
—0.52 -—0.18 —0.21 0.08 —0.66
0.41 —0.18 —0.02 0.41 0.05
M: 0.43 —0.26 —0.11 —0.29 —0.57
—-0.26 -—O.28 0.62 0.12 —0.17
0.13 0.24 —0.45 0.55 —0.23

0.05 0.06 —0.32 —0.54 0.03

 

L—0.11 -0.78 —0.39 -0.04 0.34

0.27 0.55 —0.07

0.24 —0.65 0.05

0.21 —0.07 0.11

0.71 0.25 —0.18

0.03 0.12 0.14

—0.11 0.16 —0.40

—0.46 0.05 —0.37

0.20 —0.34 —0.66

—0.24 0.22 -—0.02

3.40.

—0.43

-—0.53

0.39

0.18

—0.54

—0.20

—0.04

0.11

0.06

 

(3.40)

Fig. 3.17(a) shows one Of the nine original source signals. Fig. 3.17(b) displays

one of the nine mixed sound signals. The mixed signals are ﬁrst whitened, then

we applied the proposed algorithm to the mixed sound signals. It is worth noting

that the proposed algorithm is an incremental method. Therefore, unlike other batch

ICA method doing iterations on the date set, we have used data only once and then

discarded them. In Fig. 3.17(c) The independent components quickly converge to the

true ones, with a good approximation as early as 1.5 second.

3.3.4 Natural Image ICA

As one Of our major applications, we also conduct our experiment on natural image

patches. 500,000 Of 16 x l6—pixel image patches are randomly taken from thirteen

117

natural images, available from http://www.cis.hut.ﬁ/projects/ica/

imageica/. The CCILCA algorithm is applied to the pre—whitened image patches
k = 16 x 16 = 256 to update the lobe component matrix V. The matrix (ET)-1
is then computed. Each column Of the matrix is Shown in Fig. 3.18(a) by a 16 x 16
patch, as the features of the natural scenes. A total of 256 256-dimensional vectors
are displayed in the ﬁgure. They all look smooth and most of them are localized
(only a small patch are non-zero, or gray) as expected. The entire process take less
than 46 minutes (i.e., 181 frames / second) on a Pentium III 700MHz PC with 512MB
memory compared to over 10 hours of learning time for the FastICA algorithm using
24% of the 500,000 samples (disk thrashing is also a factor).

Fig. 3.18(b) shows how many times each lobe component was the top “winner”.
Most components have roughly a similar rate of hits, except relatively few leftmost
(top) ones and rightmost (tailing) ones. Although it is not exactly true that each lobe
component is equally likely to be hit, nearly equal hits for the majority is a desirable
property for a high-dimensional density estimation due to the following consideration:

Suppose that a random number 2 can take a discrete number of values 2 = 2i,
with P{z = 22:} = Pi: i = 1, 2, ..., c. Given a ﬁxed c, the discrete distribution
that maximizes the entropy is a uniform distribution (Papoulis, 1991): p,- = 1/c,
i = 1, 2, ..., c. In other words, the distribution of hits by lobe components has nearly
the largest entropy.

We apply the topological LCA algorithm, shown in Algorithm 5, to natural image
set as well. Fig. 3.19 displays the ﬁlters of the topographic CCILCA algorithm.
Filters show iso—orientation preference in a neighborhood. And the orientation blocks

118

are surrounding a singular position at row No.9 and column No.8. This is resembling
the pin-wheel structure found in the biological visual cortex. The winning times are
shown in Fig. 3.20. The neuron at the singular position get signiﬁcantly more hits

than other neurons.

3.4 Derivation Of the LCA algorithm

In this section we discuss the mathematical analysis of the LCA algorithm from a
sparse optimization point of view. Here we start from a intuitive Sparseness measure-

ment explanation, and end up with the LCA learning algorithm.

3.4.1 6" norm Sparseness measurement

As we mentioned in the section 3.1.6, Sparseness is an important property Of super-
Gaussian random variable. Maximize Sparseness is a desirable strategy that is used
by different ICA algorithms. But the measurement of Sparseness is still illusive. Many
heuristic measurements of Sparseness have been proposed, e. g. Olshausen and Field

proposed the following form:

Sp<u> = - ES (21.). (3.41)

where S p(u) is deﬁned as the Sparseness of random vector u; S (u?) is a nonlinear
function, and u,- is the response Of the neuron. Olshausen and Field suggest S (u,)

2
to be an even nonnegative function, e.g. —e_“ , log (1 + 1:2), and lul. Intuitively,

119

when there are more neurons (random vector u) ﬁring at the same time, the larger
the function Z S (u,) would be. Therefore, maximizing Eq. 3.41 will maximize the

’l

Sparseness.

Karvanen and Cichocki (Karvanen and Cichocki, 2003) generalized the measure

of Sparseness to [p norm criteria:

Sp(u) 2E Z

 

 

(3.42)

It is clear that Olshausen et. al’s Sparseness measure is a special case of (p norm,
when p is equal to 1 and |u| is chosen as the S function. When p = 4, the [p norm is

related to the widely used non-Gaussianality measure kurtosis 5.

Karvanen et. al suggest the range of p in (0, 1], and particularly smaller p, such as
p = 0.1 or p = 0.01, should be used. However we found when p —> 00, it will lead to a
surprisingly good independent component analysis algorithm. Let us ﬁrst take lOOk at
an example. Fig. 3.21 shows the joint distribution of two Laplace random variables.
The original components are mixed with a rotation matrix of 20 degree. We then
projected the data set the a series Of orthogonal basis, which is a rotated from 0
degree to 180 degree with 1 degree interval. After projected to the basis, mean of
different [p norms are calculated and plotted. Fig. 3.22 diSplays the mean of £1) norm
Of different projections. It is interesting to see the (2 norm is a straight line, which

makes sense, because rotation will not change the Euclidean distance. Surprisingly,

 

5The kurtosis is deﬁned as kurt (u) = E {u“} — 3, where u has unit variance.

120

the Optima of gp norm curve are inverted on the two side Of the E2 norm, which
means the maxima of the 6p norm with p < 2 are corresponding to the positions of
the minima of (p norm with p < 2. Then to ﬁnd the independent component (in this
case it is along the 20 degree and 110 degree directions), one needs to ﬁnd the minima

Of the 87’ norm with p < 2 or maxima of [6 norm with p > 2.

Another issue is robustness of the estimation when noise presents. From Fig. 3.22,
we can see almost all the norm curves agree on the same Optima position (except [2
norm), but the norm curve with greater difference between the maxima and minima
will be more robust when noise is presented. So the choice of p can be either close
to zero or close to inﬁnity. That is why Karvanen et. al suggested smaller p, such as
0.1 or 0.01. The problem of small p is that the Optimization process is difﬁcult, Since

the gradient of such criteria typically involves inverting matrices.

On the other hand, at ﬁrst glance p ——+ 00 is not a good choice either, but since it

generally holds that:
l
5
lim 2 luilp = max{|u2-|}. (3.43)
i

19""00

So it leads to a surprisingly simple computation. In the next section we will also see
the gradient Of inﬁnity norm also has a simple form, thus the Optimization process
is relatively easy for the inﬁnity norm case. Therefore, we can deﬁne the following

criteria function:

_1_
P
J(u)=E 1011,11ch Eiﬁiur =E{mgx{u.}}, (3.44)

121

and maximizing this function solves the maximizing sparseness problem.

3.4.2 Inﬁnity norm sparseness optimization

As we deﬁned in Eq. 3.44 in the last section, this criteria function that will maximize

the mean of [00 norm can also be written as:

J(W) = / my): [(wjlyﬂ p(y) dy. (3.45)

where we replace the |uj| with (WEI—302 for convenience, since the absolute function

is not differentiable . SO the goal is to maximize this function with respect to W.

Intuitively saying, if wj is along the tail direction of the multivariate super-
Gaussian distributions, the criteria function will be maximized. So that our goal

is to maximize the Eq. 3.45 with respect to the wj.

The closed-form for the determination of wj is not available with the general
random variable y with pdf p(y). SO we have to resort to an iterative approximation

scheme.

The difﬁculty Of the analysis lies in the discontinuity caused by the maximum
2
function, max [(wfy) ]. This problem can be circumvented by going back to the
J

inﬁnity norm form as shown in Eq. 3.43.

2
Let c = arg max [(wg—y) ] . Thus, the maximum part can be rewritten as:
j

122

(3.46)

T—*OO

II
B
M
A
,3
-l
‘<
v
l0
‘
|_.__l
‘iIH

To use the gradient descent approach to ﬁnd the maximum of the Eq. 3.45, we

take the partial derivative of Eq. 3.45 with respect to wj:

38— 1' j d 347
E‘fr—lf’éo aw], p(Y) y. (- )

. [2 (wiyf’j

 

Let

21'
A = Z (wJ-Ty) . (3.48)
.7
Eq. 3.46 can be written as 7‘li’mc>0 AF. Since,

53% = 2rZ[(W;ry)2r_1'Y'aWi/awj:ly

 

J i
T 27“].
= 2r (wj y) y, (3.49)
then we have:
_1_
CAT : EAL-l- (9A
dwj r 3w,-
1 2r—l
= 2Ar“1(ijy) y. (3.50)

123

Denote

 

 

 

2r—1
(iv-Ty)
B = J
A
2r—1
T
_ (“’3' y)
— 2r
2. (le
z
' 2r ‘1
(szy) T ‘1
= Z T 2r (Wj y)
- . (“’2' y)
p 2T 2T 21- -1
(Wiy) (WT-v) (Wiy) T —1
= T 2r+”'+ T 2r+ T 2r (“’3 y)
.(w: y) (.., y) (..,. yl
(3.51)
wT T 2r
Given w -, the term is the maximum. When r —> 00, the term w y
‘7 Wgry Wj y

will dominate the series of summation. So,

 

T —2r
-1
#11203 = .133. Wiy) (W55)
—1
T

where 663- is the Kronecker delta. It has the following form:

1 if c = j,
cj —
0 otherwise.

124

Substitute B into Eq. 3.50, we Obtain

 

.1... .21: = .(.;,)2.,(.;.)~1,

= 2663- (wgy) y (3-52)

The Eq. 3.50 is also the updating rule for the weight matrix W. With the con-
sideration of the most efﬁcient estimation, we can come up with the updating rule

exactly the same as Eq. 3.38.

3.5 Broader Impact

In neuroscience and machine perception, there is a growing interest in simulating
the developmental process of feature detectors in the early sensory pathways (see,
e.g., Elman, Bates, Johnson, Karmiloff-Smith, Parsi & Plunkett (Elman et al., 1997);
Weng, McClelland, Pentland, Sporns, Stockman, Sur & Thelen (Weng et al., 2001);
Weng & Stockman (Weng and Stockman, 2002)). In such sensory pathways, there is
a need for developing feature detectors for all areas (receptive ﬁelds), across different
positions and sizes, in the sensor array (e.g., camera CCD array or the retina). The
total number of such areas is so large (sampled in the y-x positions and in the size
range) that it is impractical for each feature detector of k connections to have extra
storage space of k X k for its development. This raises the critical need for Type-4

incremental ICA algorithms.

We will place the Type-4 algorithms developed in this proposed project on the

125

web, freely available to everybody. Due to its simple structure, lowest possible order
of time and space complexities, quasi-optimal statistical efficiency, and the Type-4

nature, we expect that this class of algorithms will be widely used.

3.6 Conclusion

The proposed CCILCA algorithm is simple and fast under multiple demanding com-
putational requirements: high dimensional, incremental and free of higher order
statistics computation. This is due to the simplicity of the winner-take—all principle
and the most efficient estimation concept used in the algorithm design. The sparse
representation can come from a winner-take—all competitive learning neural network,

which helps to explain some learning mechanism in the neural system.

126

 

sis-.29.; " ’-€:r~°‘ ‘r' --
.

6

,5?

        
  

 

l_‘,#_ .A ,,

i,xLﬁ#_—h#

Factorizable into two independent uniform distributions (sub-Gaussians)

P 7 l

  

Wm- ,

 

 

 

 

f I f

: El
1 |
-.. ,l ,,_i.___L__‘ #

Factorizable into two independent Laplacian distributions (super-Gaussians)

l T i ‘l '* .3

 

 

 

 

 

 

 

     

 

Nonfactorizable distribution

Figure 3.10: Illustration of lobe components. The plus signs are random samples. lst
column: the original source signal. 2nd column: the observed mixed signals (after affine
transform from the left). 3rd column: whitened signals. Lobe components are marked by

arrows.

127

 

Figure 3.11: The sample space of a zero-mean white random vector x in 2—D space can be
illustrated by a circle. Each mark + indicates a random sample of x. The distribution is
partitioned into c = 3 (symmetric) lobe regions 12,-, 2' = l, 2, 3, where R- is represented by
the lobe component (vector) wi.

 

035; \N2
01%
04¢

025 \N1

 

 

 

0 so i 160 150 200
Number of accesses

Figure 3.12: The amnesic average coefﬁcient. X axis is the number of visits. Y axis is the
weight coefﬁcient of ml and W2.

128

 

 

--- p(t)32
0.9r --u(t)a1_
-- ”(0'50
08* .

 

 

 

0.7 r « I
0.6~ .

0.5% .

Error coefﬁcient

 

 

 

10‘ 102 103

Figure 3.13: The error coefficients c(n) for amnesic means with different amnesic functions
n(n)-

129

 

r

r

Vi '7
I
l
I
l

 

 

 

 

—5 0 5 10

J».
o

 

9.99.099
mwlsscnvmsi

Average error in radians

f

 

 

”o 1000 2000 3000 4000 5000
Number of Samples

0))

Figure 3.14: The proposed method on 2-D Laplace distribution data set. (a) The source
independent components and the evaluated independent components. The dash-dot lines
indicate the direction of source independent components, and the two arrows indicate the
components evaluated by the proposed method. (b) Average error in the angle of the
evaluated independent components vectors and the source independent component vectors.

 

130

1.4 - .

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

+1NPCA
12, —— ExtBS1 _
a) ' + LCA :’
C
(U
a 1»
N
m
.508
O
t
was
0
5’
£0"
<
0.2
0o 5 1‘0 1‘5 2‘0 25
Sample Dimension
(3)
150 - . .
+ NPCA ,.
— ExtBS1
+ LCA
5
9100. i
0
g
0
5’
9 50’
<
00 ' 5 1‘0 15 2‘0 25
Sample Dimension
0))
1.6 . . .
— NPCA
1.4> -- ExtBS1 r
m + CCIICA
512+ «
'0
m
a: 1.
.s
€08»
LIJ
3,00
go4
2 ' (
0.2
00 10 20 30 4o 50 60
SampleNumber/500
(C)

Figure 3.15: Comparison of ICA results among (Type-3) NPCA, (Type-2.5) ExtBSl and
(Type-4) CCILCA for super-Gaussian sources. (a) Average error for different data dimen-
sion (1. (b) Average time to run for different data dimension d. (c) Average error for d = 25
as a function of the number of samples.

131

1.4

 

~— Extended Infomax
FastICA
.. LCA ﬁxed mu
- -- ~ LCA dynamic mu
— LCA eliminating cells

 

 

 

 

 

 

 

 

 

 

Figure 3.16: Comparison among (Type-2) Extended Infomax, (Type-1) FastICA and three
variants of the proposed Type-4 CCILCA algorithms.

132

 

Amplitude

 

 

é 5 l
Time(second)

(a)

 

Amplitude

 

 

2 3 4
Time(second)

(b)

 

Amplitude
9

 

 

CIJAI‘
..x

2 3 it
Time(second)
(C)
Figure 3.17: Cocktail party problem. (a) A music sound clip in its original form. It is

one of the nine sound sources. (b) One of the nine mixed sound signals. (c) Recovered
music sound wave. Comparing to (a), the sound signal is recovered after approximately 1.5

d.
secon 1 33

flit“...
_
.

 

 

 

 

_ .
u

Figure 3.18: Lobe components from natural images (not factorizable). (a) LCA derived
features from natural images, ordered by the number of hits in decreasing order. (b) The

numbers of hits of the corresponding lobe components in (a).

134

, l l l
/ i ’
\ s i! i l i l
' j K 3 f i
i‘ ‘1 \_ | “ l i
y \ .
x, l l
a, \n
' l l
i l i ’ i
\ s , f a. i l
K \ i! t; .31.". "- v- “ ﬂ
,.
. :5 .. 2 .. ~ ‘
l l l a: l ‘
f 2 :1 ref": 17' ‘f" * . ' ..- . . ~ ' '
l i i l i“? - c. i
‘ ’ . ’ x " A
:3 l l , ’
...! ~ \ «L. in
3‘ l -, / ’ '
i... ‘. I
y a. “ . . - .. .... , . § l l“ ‘
I: . '2
‘(ﬂr‘j .‘ f » 3‘ / I_, g '- ..
W-F‘ﬁ -..V.e- » ,4»: . . I“, g“... .. . 7‘ .‘ .'
ﬁg,“ 1. :l .I "' ‘ ' ' in! h “

Figure 3.19: Results of Algorithm 5. Topographically ordered basis functions in ICA of
natural images.

 

 

 

 

2x105
5
E
E15-
0
.D
9..
‘6
gas» .
3
2 WW
0
O 64 128 192 256

Independent Components

Figure 3.20: Results of Algorithm 5. Number of times for each independent component
to be the “winner.” The noticeable big hit corresponding to the neuron at row No.9 and
column No. 8.

135

 

O N A O) G)
’+
+
\\

 

 

 

 

-2 . + ,
-4. ,” ‘\ .
-6 . i .
'8 -5 0 5

Figure 3.21: Data set is joint distribution of two unit variance Laplace distribution. The
two independent Laplace random variable are mixed by a rotation matrix of 20 degree.

 

V “‘\\_// I 075
1 .5 “

 

 

 

 

 

C N______..__.»-—~————- ~“__——»—-—-~m‘4 1.5
8 2
E 1 L3~W\W““§j5€ funny
0
Z
0 5L ,,Fx ,,--.-—---\ J
. /" \ ”/'/ ___ \ '
L- _- ,/»-«~*:_""‘____1:\:.\V—~;t;/*:;_.__::~~-: 95’""““"’
\:K’:::’/”’ —~\\::_~:#;,..—/m T“ "-25
LNJ ,,..,x-'~——\~__ ”...mwr—‘mw -o.7s
..___ _-.._.--w--—v~~~-~-—~—~..~H_._..-..~r———~———~--— -o.5
0 . . . . . -o.25
0 3O 60 90 120 1 50 1 80
Kmemmm

Figure 3.22: The (7’ norm criteria functions under different p. The data set is projected
onto a series of orthogonal basis, and then the expectation of 8” norms of all the projected
samples are computed.

136

Chapter 4

Sparse representation and Feature

Selection

In the chapter 3 Lobe Component Analysis algorithm based on a novel sparseness
measure is discussed. In this chapter we further discuss the sparseness optimization.
We propose a new perspective to achieve sparseness via the winner-take—all principle
for the linear kernel regression and classiﬁcation task. We form the duality of the
Least Absolute Shrinkage Selection Operator (LASSO) (Tibshirani, 1996) criteria,
and transfer an El norm minimization to an €00 norm maximization problem. Two
solutions are proposed: it can be solved by standard quadratic programming with
linear inequality constraints. We show that the number of parameters to be estimated
in the £00 normed space is half of the parameters in the solution suggested by David
Gay for 61 normed space. Second, we introduce a novel winner-take—all neural network
solution derived from gradient descending, which links the sparse representation and
the competitive learning scheme. This scheme is a form of unsupervised learning in

137

which each input pattern comes through learning, to be associated with the activity
of one or at most a few neurons. However, the lateral interaction between neurons
in the same layer is strictly preemptive in this model. This framework is applicable
to a variety of problems, such as Independent Component Analysis (ICA), feature

selection, and data clustering.

4.1 Introduction: The Supervised Learning Prob-

lem

The central problem of supervised learning or regression can be formulated as a
function approximation. In either case we have pairwise correspondence of samples
xi and yi, where xi 6 Rd, 3;,- E R, and d is the dimension of the input space.
Especially, when y, is discrete, e.g. y, E {—1, 1}, the problem is called classiﬁcation,
whereas continuous y,- (for example y, E R) indicates a regression problem. The
task is to infer a function f (), such that y = f (x) More precisely, if the model of
the function is chosen, the function can be written as y = f (x, H), where [3 is the
parameter vector of the model, y = [y1, ...yn]T, and n is number of training samples.
The model of the function f () can have various forms. For example, in a three-layer

perceptron neural networks (Bishop, 1995), the function has the following form,

3’ = f2(W2f1(W1X+b0)+b1). (4-1)

138

where fi(') is a non-linear limit function, like sigmoid. Then B = {W1,W2,b0, b1}
is the parameter set to be optimized.
In this study, we consider the form of the function to be linear with respect to the

parameter vector beta, i.e.,

k
f(X. a) = Zah. (x) = (ﬂux), (42)

i=1

where h(x) = [h1(x), ...hk(x)]T is a vector of k determined functions of the input

vector x. Typical settings of this function are:

1. Linear regression, where h(x) = [1,:r1, ..., xd]T, and d is the dimension of the

input space. Apparently, I: = d + 1.

2. Non—linear regression, where h(x) = [031(x), ..., ¢k(x)]T. If <15,- = g(Wx), where
g(-) is the sigmoid function, this model is the single layer perceptron. Note that

usually ¢1(x) = 1.

3. Kernel regression, where h(x) = [1, K(x,x1), ...K(x,xk_1)]T. K(x,x.,-) is some

kernel function, e. g. Gaussian kernels.

Typically, it is assumed that the output variable y from the training set is contami-
nated by additive white Gaussian noise, i.e. y = f (x, [3)+W, where w = [2121, ..., wn]T
is a set of i.i.d. white Gaussian random variables with variance 02. Thus, the condi-

tional probability p(ylﬂ) is Gaussian, i.e.

p(ym) ——— N<y|h<x>Ta 021). (4.3)

139

 

where I is the identity matrix. We write H = h(x)T, and H is called design matrix.
The element (i,j) of H, denoted by Hija is given by Hij = hj(:r,-).
Since the noise is assumed to be i.i.d, by simply applying the Maximum Likelihood

Estimator (MLE), we obtain the log-likelihood,

yz' - Hijﬁj
.=1 i=1 _ _IIy—Hﬂll% (4 4)
o2 _ T '

 

l(ﬂ) =1nP(YIﬁ) = -

where ||-||2 is the £2 norm, i.e. Euclidian norm of the vector, which is defined by

||z||2 = V sz. The estimated parameter vector B

A

s = (HTHrlHTy. (4.5)

The problem of the least square error estimation is that it emphasizes the ﬁtting
accuracy, however with limited data samples it usually leads to a severe over-ﬁtting
problem. That is the trained function works well on the training samples set, but
it generates large error on the testing set. This drawback is often referred as low

generalization ability.

Note in this case there is no preference on the parameter vector {3, so its prior
is of a uniform distribution. This prior typically leads to some large values of the
coefﬁcient beta,, and is considered to have more complexity and roughness.

With a zero-mean Gaussian prior for 5 with diagonalized covariance matrix A,
the estimation can be formalized by maximum a posteriori (MAP) process. The prior
of [3 then becomes an Eg-norm regularization term in the log-posteriori, where it will

140

prefer a smaller [3. When A = #21, it is called ridge regression (Hoerl and Kennard,

1970). The posteriori is given by

 

lMAPW) = 1n(p(y|ﬂ)p(ﬂla))
n k H 2 k 2
_ —i§1(y2 jgl 235]) ~47;—
— 2 #2
_ _||y—H6||§_ﬂéﬁ
— ——02 #2

It is worth noting that for a Gaussian prior, the solution Bridge of the MAP and

Bayesian estimation agrees at the same point and has the following close form solution,

. _ -1
aridge = (02A 1+ HTH) HTy.

However, for other priors the MAP and Bayesian estimation generally have dif-

ferent solution, and analytical solution may not exist.

The Ridge regression has the similar over-ﬁtting problem as the MLE, since the
regulization term in the objective function favors smaller and smooth coefﬁcients in
,6, but does not penalize a complex model, which means almost all the coefficients
in ﬂ are non-zero and one needs to keep all the basis function in the design matrix
H. In contrast to this limitation, a sparse estimation of [i is deﬁned as one, in which
most of the coefﬁcients in 0 are zero with only a few non-zero elements. Sparseness
is desirable in the ways that

141

1. it leads to a simpler model, since most of the basis function in design matrix H

will not be used.
2. the generalization ability improves with a sparse model.
3. sparseness corresponds to performing feature and variable selection.

If sparseness is preferred, then a Laplacian prior can be adopted for 3, i.e.

(I

p(ﬂla) = (§)'°exp<—a nan),

where a is a parameter of the Laplacian pdf, and ||x||1, i = 0, ...oo is the so called €1-

norm, which equals to llﬁlll = {:1 lﬂil The Laplacian distribution features a heavy
z:

tail and has a high concentration at the near-zero area. It means that most of the [3’s

components will be zero, and the probability of having a large value is relatively high,

compared to the Gaussian distribution with the same variance. Fig. 3.2 compares

three prior distributions, i.e. Laplacian, Gaussian, and Uniform.

Utilizing the same MAP process, the estimation of 5 is given by

A

a = ...ng {lly — Hang + t Hall}, (4.6)

where t = 2020 is a control parameter, which can favor either the squared error term
or the él-norm regularization term. This criterion is also known as Least Absolute
Shrinkage Selection Operator (LASSO) (Tibshirani, 1996). It is worth noting here
that due to the non-Gaussian prior, the MAP estimation is not equivalent to the
Bayesian estimation as it is in the ridge regression. Therefore, the estimation is bi-

142

 

 

ased. To make a unbiased estimation, one needs to integrate in all 5 space, which
is computationally prohibitive. However, if the posterior concentrates at a certain
point, then this biased estimation may only have a small variance from the unbi-
ased estimation, which is desirable. This is one reason why a concentrated sparse
prior is preferred. Some researchers introduced “hyper-parameter” to further steepen
the prior, for example, in Relevance Vector Machine (RVM) (Tipping, 2001) and
F igueiredo’s work (Figueiredo, 2003).

Another reason that a sparse representation is desirable is because it improves the
generalization of a learning system. For example, in supervised learning, the goal is
to infer a mapping based on the training samples. The generalization capability is
accomplished by reducing the complexity of the model, which is characterized by the
number of non-zero parameters. This problem is formalized as [BO-norm minimiza-
tion. It is known that EO-norm minimization is NP-hard (Amaldi and Kann, 1998).
However, it is established that the solution of the [1 problem is the same as the [0
problem if a certain condition is satisﬁed. So, El-norm problem is still an important
issue.

In this chapter, we propose a method to solve the €1-norm version of the problem.
We construct the dual problem of LASSO criterion as in Eq. (4.6) and use a gradient-
based method to ﬁnd the solution. This method is by no means claimed to be superior
to the quadratic programming (QP) based method; however, it opens a perspective to
address the problem differently. We have shown that the proposed method actually
applies the competitive learning principle. In addition, this method can also motivate
other biologically plausible models for solving similar problems.

143

 

 

The reminder of the chapter is organized as follows: In Section 4.2 we formulate
the dual problem of the LASSO, and propose a solution based on gradient descent.
We reformulate the proposed algorithm in Section 4.2.5. The experiment results and
comparison with existing sparse optimization algorithms is in Section 4.3. Section 4.4

provides conclusions.

4.2 Method

4.2.1 Duality

First, we will construct the dual problem of Eq. (4.6). Fig. 4.1 illustrates this problem.
The Eq. (4.6) can be geometrically explained as the minimum €1-norm between the
origin and the convex set K. This distance, according to duality theory, is equal to
the maximum Koo-norm distance between the origin and the plane that separates the
origin and the convex set K. In Fig. 4.1, the parabola denotes the convex set K. Point
B is the closest point to the origin, in terms of the El-norm. Actually, the Zl—norm
of B consists of two parts, the quadratic form of Eq. (4.6) and t ”ﬁll 1 in Eq. (4.6), as
illustrated in Fig. 4.1.

Point A is on the tangent plane of point B, and A is the closest point to the origin,
in terms of the Koo-norm. The duality theory establishes that the £00 norm reaches
its maximum value while the €1-norm reaches its minimum value. Thus, with a few
manipulations, we get the following theorem to formulate this intuitive geometric

explanation.

144

 

 

 

 

 

. :llYi-HBlli :
“Bill B

 

Figure 4.1: Geometry explanation of duality. Point B is the closest point in K to the origin.

Theorem 4.2.1 If we have

 

 

 

T ..
3%. w
—1 Z
B'=argma.x— -'
ﬂ BZ
3t?
—1
4 00

where Z = “y — Hﬂllg, then [3’ = B, and ,3 is the same as in Eq. (4.6).

The generalized proof can be found in (Luenverger, 1969).

4.2.2 Derivations

Let Z = “y — Hang = sTnTHs — 2yTHs + yTy, and denote

145

T T

 

02 2/t<HTHa — HTy)
M) = 33m = (4.7)
-1 _1
Now we can formulate the dual problem of the original LASSO criterion,
tﬂ
J (A)
Z
E B = — , 4.8
( ) ”mum ( )
Therefore,
BZ 2 T T
—— = - H — H . 4.
am tlH A y) ( 9)
0
W 2 T
—— = H I . -
85 f H . (4 10)
0

 

 

Letting C = HTH = [c1, ..., en], and substituting with Eq. (4.10) and Eq. (4.9), we

get
0E 2cm [Wm-WY] alumni...

8_5=IIJ(ﬁ)lloo— ”mug, ' as

Now we proceed to compute the partial derivative of ||J(B)||OO w.r.t. [3.

,1}

t

= lmf‘xﬂz (We “YT’H)I2’1}

146

 

”Joan... = max {.3 (We. — fix.)
2

 

t0

So,

3||J(ﬁ)||oo 1 —1
—— = - llﬂﬁllloo
2
a Bi} 2 2 (4.11)

 

The last partial derivative term in Eq. (4.11) needs some special treatment. The
difﬁculty of the analysis lies in the discontinuity caused by the maximum function.
This problem can be circumvented by the use of the following equality. Let {0i} be

a set of positive real scalars; then it generally holds that

mzax{az} _ 7.11520 2a,
1
This is just another identity of the (foo-norm, which is differentiable. The proof of
the derivative of maximum function can be found in the appendix. In fact, a similar
technique can be seen in (Kohonen, 2001), in which Kohonen derived the vector
quantization (VQ) algorithm based on this idea. In (Weng and Zhang, 2004), a

competitive learning algorithm has been derived from a maximum criteria function.

Therefore, based on Eq. 8.5, this leads to the following updating rules.

2
,1:

§ (ﬁTcm — yThm) -em,ifti2 (liTcm — yThm)2 > 1

7

% mix (I? (We - yT112)

 

(4.12)

[0, 0...O]T, otherwise.

where

147

 

 

m = arngnax (lﬂch — yThj‘) . (4.13)

Rearranging these equations, we have the ﬁnal updating rules,

20 4[ﬂTCﬂ—YTY]
vs or ”Will... ‘ t||J(ﬁ)||§o .(chm — yThm) Cm (4.14)

ﬁt;

The switching condition of these two updating rules is the same as that in Eq. (4.12)

 

Remarks: Eq. (4.13) deﬁnes a competitive learning process. The algorithm will
ﬁnd the winner and use the updating rule with the winner’s value.

The aforementioned method will search for the optimal value of B to minimize the
criteria function in Eq. (4.6). With the El-norm constraint, some of the components 32'
of vector B will gradually decay. However, due to the nature of the gradient method,
they will not be exactly zero. So an additional step that sets a hard threshold to
make smallﬂz- to zero will help the convergence of the algorithm.

The choice of learning rate in any gradient algorithm is important. Here we
used a dynamic learning rate that is introduced in chapter 3, i.e. amnesic average.
Suppose there are two statistical estimators F1 and P2 for estimating parameter 0.
If E ||I‘1 — 9”2 < E “1‘2 — 6H2, F1 is said to be more statistically efficient than 1‘2.

We consider the right hand side of Eq. 4.14 as an “observation.” The goal is to
get the mean of this observation, while B is estimated incrementally. It is known
that for many distributions, the sample mean is the most efficient estimator for the
mean of the random variable. When the distribution is unknown, the sample mean

148

is the best linear estimator, which results in the minimum error variance. For many
distributions, the sample mean reaches of approaches the Cramér-Rao bound (CRB).
Then, an efficient estimator is one that has the least variance from the real pa-
rameter B, and its variance is bounded below by the CRB. Thus, we estimate a set of
independent component vectors w,- by the sample mean of the observation in Eq. 4.14.
The algorithm is guided by the statistical efficiency, but it is not absolutely the

most efficient one, because

1. the true distribution of the observation is unknown;

2. the distribution changes with B being incrementally estimated;

3. amnesic average is used to gradually discount “old” observations, which reduces

the statistical efﬁciency moderately.

4.2.3 Quadratic Programming

Quadratic programming (OP) is used for ﬁnding the solution of LASSO regression.
The method suggested by David Gay (Tibshirani, 1996) transforms the LASSO to a
QP problem with 2n variables and (2n + 1) constraints, where n denotes the number
of the kernels. In this subsection, we are proposing the dual-QP algorithm, which
converts the original problem into a problem with fewer variables (n) and 2n con-

straints.
The dual problem of LASSO regression is described in Eq. (4.8) and we recapitu-

149

late it as ﬁnding B, s.t.

t/3
J ((3)
. Z
ﬁ=argmax-——— 4.15
[3 Ham”... ( )
This optimization problem is equivalent to
. t5

ﬂ = arg max -J(ﬁ)
ﬂ Z

= arg mﬂin —[t2ﬁT (ﬁHTﬂHlﬁ — YTYl

s.t.

||J(ﬁ)l|oo =1 (4-16)

The constraint Eq. (4.16) can be written as
2 T T
max(ll;H Hr—H ylloo.1) =1

and is simplified to be

2
IIZHTHﬁ - HTYlloo S 1 (4.17)

Written in component form, Eq. (4.17) becomes

2
max I chﬁ—a; IS 1, fori= 1,...,n
Z

150

where

T T
C1 0‘1
T T
C a
HTH = 2 and ZHTy = 2
c; a;

 

 

 

 

The inequality can be expressed in
2
~15 1.34-... g 1
We write the inequality in vector form

2HTy — 1 g 2HTHﬁ/t 3 1+ 2HTy

Thus, the equivalent QP problem becomes
B = arg min[ﬁTHTHﬁ]

s.t.,

—2HTHﬂ/t g 1 — 2HTy
2HTHﬂ/t g 1 + 2HTy
It is worth noting that the above QP has n variables and 272 constraints.

151

 

4.2.4 Filter Development

Let us reconsider the original linear kernel regression problem in a special case. Sup-
pose the design matrix H is an orthogonal n by 11 matrix, and there is no additive
noise. So, ﬂ = HTy. Also, we are not considering optimization with respect to 5,
but instead, with respect to the design matrix H. Then, we get the criteria function

adapted from Eq. (4.6)

H = argmin{|lHTyll }.
H 1

Using the same gradient technique described in section 4.2.2, we can get the updating
rule of design matrix H,
Vb.- oc 26..(hly)y.

hT.y
where c = arg max “111' , and (56]- is the Kronecker delta: cicj =
i

{1 if c = j, 0 otherwise}. This is precisely a “winner-take-all” algorithm. Only the
winning kernels (neurons) will get updated. For more details about this method,

see (Zhang and Weng, 2004).

4.2.5 Gradient Sparseness Optimization Algorithm

We summarize the training procedure in Algorithm 6. Each iteration of this
algorithm is computationally efficient, because it only involves matrix multiplication
and maximum function. The time complexity of each iteration is 0(kn), where k is
the number of basis vectors and m is the dimension of each basis vector.

152

 

 

Algorithm 6 Gradient Sparseness Optimization Algorithm

 

1: Preprocessing: Get the training set (xi,y,-), i = 0,1,2,...,n. For each i, y,- =
m
Y'i _ 5': Where 5’ = % Z yi.

i=1

 

2: ,6,- = 0, i = 0,1,2,...,k.
3: Initialize the design matrix H, and compute C = HTH.
4: repeat
5: k = 1.
6: Compute m = arg max (lﬂch — yThJ-D.

j

2
7: if T47 (Bl-cm — yThm) > 1 then
8: Update 6 with:
2C3
ﬁnew = 1013 1d + “Di——
0 WW...
T _ T
4 [ﬂ C3 3’ Y] ° (HT-Cm - yThm) Cm],
t||J(ﬁ)lloo
where ml and wg is the same as a and B respectively deﬁned in Eq. (3.32)
and Eq. (3.33).
9: else
10: Update 5 with:
2C5
[in = w k 6 + w k ——

11: end if
12: k = k + 1.

13: until Objective function in Eq. (4.8) reaches the target value, or A6 is less than
some threshold.

 

153

4.3 Experiments

4.3.1 Kernel Regression

Our first experiment illustrates the performance of the proposed algorithm in kernel

regression. The regression model is

k
y = f(x, 4) = a) + Zﬂ.K(x.x.-).
i=1

where K (x, xi) = arm—W} is the kernel function, x; and a are the kernel
parameters. The function to be approximated is 1- d sinc function y = sin(:r)/:r. We
randomly collected 150 samples and added Gaussian noise with variance 0.01. The
ﬁrst row in Fig. 4.2 shows the ﬁtting results of proposed method, ridge regression,
and the proposed dual-QP LASSO regression algorithm, respectively. The dots are
samples with noise, and the dashed lines are the ground truth sine function. Solid
lines show the approximation results. The circled dots correspond to the kernels with
nonzero weights, aka the “supporting kernels”. In the second row, we use the bar
ﬁgure to display the weights of those kernels. As Fig. 4.2 clearly indicates, both
proposed gradient sparseness method and dual-QP LASSO achieve sparseness. The
[foo-norms are also marked on these ﬁgures. The proposed method in our testing
performs better than LASSO regression.

Fig. 4.3 shows the convergence procedure. The proposed method takes about 200
iterations to converge. Fig. 4.4 illustrates how the control parameter t = 2020 affects
the mean square error and model sparseness. We conducted 20 tests with t ranging

154

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.2 - - 0.2 - - 0.2
01 l |o=10 . 01 » lo=150 1 01 |o=11 1
0 I I“ 0W 0 II I '
-0.1* ‘ -0.1: 1 -0.1: i
-0.2 ‘ ‘ -0.2 ‘ ‘ -0.2 - *
0 50 100 150 0 50 100 150 0 50 100 150

(a) (b) (C)

Figure 4.2: Kernel regression results. The dashed lines are true sinc functions. Solid
lines are approximation results. The circles shown in the ﬁrst row are those selected kernel
functions. Note that we do not show the circles in the ridge regression results, because
every kernel is selected. (a) Proposed gradient method. (b) Ridge regression. (c) LASSO
regression.

from 0.3 to 1.5. As indicated in the ﬁgure, greater t makes the model ﬁt well but

increases its complexity, and vice versa.

4.3.2 Classiﬁcation

The experiments addressed the kernel-based classiﬁer for two-class problems: A spe-
cial case of the regression problem with y E {+, —}. The classiﬁer is formulated as

155

 

01
O

 

A
O

(A
0

Objective Function

N
O

 

 

 

0o 200 400 660 800 1000
Sample Iterations

Figure 4.3: Objective function vs. number of iterations. We utilize a dynamic learning
rate mechanism called Amnesic Average. Interested readers can refer to (Weng and Zhang,

2004) for more details.

 

 

 

 

 

 

 

 

tand Error
20 . .
10-
0 0:5 1 1 5
tand Sparseness
2
1 . 4
0

05 i 1.5

Figure 4.4: The effect of control parameter t over mean square error and sparseness.

156

the following regression function:

k
.10 l x) = 4010 + g 4mm.»

where 1/1 denotes the logistic function ib(z) = (1 + exp(—z))-l. If p(y | x) > 0.5, x
belongs to the class +. Otherwise, x belongs to the class —.

First, we run the classiﬁer on an artiﬁcial data set drawn randomly from the 2D
XOR function: y = sign(x1 * 2:2), where 1:1, 3:2 6 (—1,1). The data set includes 30

samples.

Two methods are compared. In Fig. 4.5 (a), the top row shows the decision
boundary of our proposed method while the bottom one illustrates the values of
the learned coefficient vector B. For comparison, (b) gives the results using the

LSSVM (Suykens and Vandewalle, 1999). The sparseness is improved tremendously.

Second, we use three data sets from real-data problems: the Pima indian dia-
betesl, which are collected from women of Pima heritage and the goal is to decide
whether a subject has diabetes or not, based on 7 different tests; the Wisconsin breast
cancer (WBC)2, whose goal is to diagnose (benign/malignant) based on the results
of 9 measurements; the AR face database3, in which two male face samples were
selected and each person has 26 samples. In WBC, we remove the cases with missing
attributes for simplicity. Table 4.3.2 shows the results of the proposed classiﬁer on the

5—fold cross validation experiment. For comparison, we also include the results of the

 

1Downloadable at www.stats.ox.ac.uk/ pub/ PRN N
2Downloadable at www.ics.uci.edu/ mlearn/MLSummary.html
3Available at rvl1.ecn.purdue.edu/vl/ARdatabase/ARdatabase.html

157

 

 

 

 

 

 

 

 

30 40

 

20
# Kernels
(C) (d)
Figure 4.5: The decision boundary of the artiﬁcial XOR data set. (a) The decision bound-
ary of the proposed winner-takeall method. (b) Kernel selection of the proposed method.

(c) The decision boundary of Least Square Support Vector Machine (LSSVM). (d) Kernel
selection of LSSVM.

158

 

Table 4.1: The result of the 5—fold cross-validation.

 

I ROG/No. kernels I Pima I WBC IDBF face]
Logistic 075/200 0.77/455 0.81/40
LSSVM 090/200 0.76/455 0.96/40

Proposed classiﬁer 0.96/70 0.98/253 0.97/20

 

 

 

 

 

 

 

 

 

 

logistic classiﬁer. In all data sets, the proposed classiﬁer is better. The performance

is improved partially by setting some decayed kernel weights to be exactly zero.

4.4 Conclusions

In this chapter, we have formulated the dual problem of the £1 norm based sparse
approximation. We show the geometric relation of the duality and solve the dual
€00 maximization problem by gradient and quadratic programming. The algorithm’s
performance is close to or better than the result of LASSO regression. Compared

with traditional methods, the efficiency of this algorithm is quite remarkable.

159

Chapter 5

Conclusions

5.1 Summary

This dissertation presents a general-purpose vision architecture shown in Fig. 5.1,
which can accommodates attention selection and development of a vision system.
Motivated by the structure of the early visual pathway, several layers of staggered
receptive ﬁelds are used to model the pattern of positioning of the processing cells of
the early visual pathway. From sequentially arriving video frames the proposed SHM
algorithm develops a hierarchy of ﬁlter banks, whose outputs are uncorrelated within
each layer, but with an increasing scale of receptive ﬁelds that range from low to high
layers. We also show how this general architecture can be applied to occluded face
recognition, which demonstrates a case of attention selection.

The architecture shown in Fig. 5.1 is equally well applicable to other sensory
modalities, such as vision, speech and touch, each with a different set of developmental
parameters (e.g., the extent of temporal context).

160

To motors

  
   
          
  
  
 
      

 

 

 

  
  
 

      
 

 

 

“a“
mapping
Top-down
attention control lnnate Cognitive
(inhibits responses behaviors mapping
from black cells)
Delayed A4
Fourth area (A4)
Delayed A3
Third area (A3) _._———— —’
Delayed A2 Senso
Second area (A2)..— —’ mappirgg
.s“, s‘ v «\‘v s s.\\
Delayed A] ‘l‘ii‘: Kill ss‘si‘§.:3\‘.s\ . 33).). 'si.‘. s) ... si.‘ ... \\ :§" §\ 9‘

 

      
      

illi'li'li'li'li'li'li’li'lilli'l

illilli

 

Delayed retina
Retina

 

 

Figure 5.1: The ﬂow diagram of a developmental vision system. The sensorimotor module
for innate behavior is illustrated by a block only for simplicity, but it also contains the three
mappings which are, however, much smaller in storage and much simpler in the mappings
it realizes.

Besides the general architecture, a developmental algorithm for sensory network
has been designed and tested based on the concept of lobe components. Our simula-
tion result of network development revealed that many lobe component cells developed
by the algorithm from natural images are orientational selective cells similar to those
found in V1.

The proposed LCA method is based on two well-known computational neural
mechanisms, Hebbian learning and lateral inhibition. It is an in-place developmental
algorithm in which every cell is both a signal processor and the developer of the
cortex. There is no need for a separate network to deal with the development (and
learning). The algorithm is purely incremental in the sense that there is no need
for extra storage for information other than that can be stored and incrementally
computed by the processing network itself.

161

 

Topic

Contribution

 

General vision archi-
tecture

Proposition and implementation of a general
vision architecture; Automatically develop-
ing ﬁlters from natural image; Demonstrat-
ing a case of attention selection.

 

Blind source separa-
tion

An efﬁcient solution to Super—Gaussian Blind
Source Separation problem by LCA method.

 

Feature derivation

A fast LCA algorithms that applies to spatial
ﬁlter development, and solves local feature
extraction problem online and in realtime.

 

Natural image ﬁlters

Developing LCA ﬁlters from natural image
set; Showing orientation preference neurons
and the pinwheel structure of the ﬁlter topol-

ogy.

 

Feature Selection

 

 

Sparse optimization and winner—takeall
method for optimal selection of basis func-
tions and features.

 

Table 5.1: Contribution of the dissertation.

we further discuss the sparseness optimization. We propose a new perspective
to achieve sparseness via the winner-takeall principle for the linear kernel regression
and classiﬁcation task. We form the duality of the LASSO criteria, and convert an
61 norm minimization to an €00 norm maximization problem. The winner-takeall
updating rule coming from the Zoo term is especially efﬁcient due the simplicity of
the maximum function. This method has been successfully tested on several tasks,

including base / feature selection,

5.2 Contributions

classiﬁcation and regression.

The major contribution of this dissertation can be summarized in Table 5.1

162

 

5.3 Summary of the future work

The research work done in this dissertation shows that a general purpose vision system
is possible to solve a variety of vision problems. However, some issues need to be

explored further:

1. An interesting research direction in ICA is independent subspace analysis, in
which the components are divided into groups or subspaces so that components
in different subspaces are independent, but components in the same subspace are
not. In particular, the distribution of the components in a subspace is assumed
to depend only on the norm of the projection on that subspace. Typically, this
implies that the components of a subspace tend to be active simultaneously.
The proposed LCA algorithm could be applied to this problem. In the original
LCA algorithm, only one winner will be updated to “take care of” the sample
in the space, therefore leads to the direction of the maximum sparseness. If we
modify the algorithm to update the top n winners among all the vectors that
has the maximum subspace projection norm then we could ﬁnd the sparseness
structure in this subspace representation. At current stage we have not yet test

this hypothesis, whereas it is a reasonable research direction for the future work.

2. Other basis development algorithm should also be considered and explored,
including but not limit to Linear Discriminant Analysis (LDA), non—linear PCA

and etc.

3. Temporal information is extremely important in perception, however in the

163

current model we don’t deal with much temporal information. A promising
research direction is to adopt the temporal information and look for spatio-
temporal ﬁlter development algorithm that can catch the structure in both

spatial and temporal domain.

. A general purpose vision system including an attention selection mechanism
that accommodates both top-down and bottom-up attention selection mecha-
nism should be tested in real environment. This architecture will serve as the

sensory mapping part of the autonomous mental development (AMD) paradigm.

164

Appendix A

The Candid Covariance-free

Incremental PCA Algorithm

The main goal of the CCIPCA (Weng et al., 2003) algorithm is to compute the
principal component vectors incrementally without estimating the covariance matrix
of input vectors. Similar techniques can be dated back to (Oja, 1982) and (Sanger,
1989). However, in this method, the statistically efficient estimation is considered.
Thus, it is much faster than the previous learning-rate based method. The following
algorithm does not assume staggering, but staggering only requires that the residual
images are tiled together before the next principal vector is computed at a shifted
position.

In this algorithm, the principal components of the input are computed incremen-
tally. Each input is a column vector :31, x2, ..., run, ..., with dimension d. The covari-
ance matrix A of the vectors is unknown and cannot be estimated. An incremental
approximation is needed of the ﬁrst I: largest eigenvalues 0(") = (AYE), Ag”), ..., AS”)

165

and the associated eigenvectors V(n) = [v§n),v§n),...,v£n)], when n 2 k, where

each 122‘”) is a d—dimensional column vector, thus V(") has k columns. Since each
$1,172, ...,:rn would require too much space to retain, all operations are done incre-
mentally. The previous eigenvalues 0("_1) and previous eigenvectors V(n"’1) must

be updated by only using the new sample input xn.

CCIPCA algorithm

1: Choose the initial guess of 12(0) and k, where 58(0) is the samples mean and k is
number of eigenvectors with the largest k eigenvalues.

2: Choose the initial guess of V(O). For example, it can be any normalized random
image. Choose the initial guess AlO) = 0,i = 1, 2, ...k.

3: for n = 1,2,3, ...N, where N is total number of samples do

4: Get a new sample In

5: Use the following incremental method to compute the amnesic average 17(71):

 

,(n) = Bilge—1),. 1+1,” (M)
n n
{where I is the amnesic average parameter. When n— 1 — l _<_ 0 use non-amnesic
averaging (the above equation with I set temporarily to zero).}
6: Compute the scatter vector: an = $7, — f(n)
7: Set the ﬁrst residual vector ill = an
{Now we proceed to update the i th principal vectors and projecting the residual
vector. Between each iteration, the residual vector (next receptive ﬁeld) is

returned to a large residual image before the next ﬁlter is extracted. At this

166

10:

11:

12:

13:

point, shifting can occur to implement the “scattered” receptive ﬁeld method}
for i = 1,2,3..., k do

Update the ith principal vector:

 

Vim) = Win-("71)
1 .. (n-1) ,, 7 (A2)
+ﬂ(n)m ("i - V2 )ui
where
___ n l—l
aw ” (A.3)
n(n) = 15’

are the coefficient of the amnesic average, and l is a constant.
Compute the coordinate of the residual vector along the current ith principal

component:

17.: ——1———(a. - vi”) (A.4)

Update the ith eigenvalue:
All”) = a(n))\§n_1) + ﬁ(n)y,-2 (A.5)
Compute the remaining residual vector (next input):
11.41 = — —'— (As)
end for

167

14: end for

The eigenvectors that are incrementally developed by this algorithm represent
the basis of the sample space. Their respective eigenvalues represent the weight of
each of the ﬁlters and are used to normalize them so that each ﬁlter has the same
weight when computing a response to the input. This procedure is also referred to as
“whitening.” After whitening, the output of each ﬁlter will be uncorrelated and has
an identity covariance matrix.

The whitened ﬁlter output is computed as follows:

 

168

Appendix B

Derivative of Maximum Function

To simplify the derivation, let

1203) = (%(5Tc. — yTH,))2,. < .. (13.1)
fi(l3)=1,i= n

Then

‘lli-I

T—’OO

meff1(ﬂ)}= lim [ZMAV] -

169

The derivative of the maximum function can be deﬁned by,

 

 

 

1
., %_1
= rlifféo [223W f.- (W - E1 f1 (AV—1 915%} (B2)
1 -l- Zf:(ﬁ)"1—ézi—af'(m
=r1—1$5{[Zfz (ﬂ ' I Zfdﬁ)’ }
LCt a
2!. (73)“ 1.4%?
A = :1.- (W‘ ’ (8'3)

and let m = arg max {fl-(fin, then
j

371% .13 @5132

its):
2 mar-1

 

(13.4)

Notice that when r —+ 00, only the term fm([3)T—1/fm(ﬁ)T-1 remains, and all other

terms vanishes. So, we have

. _ 1 _afmw)
TiggoA_fm(ﬁ) 0.6 '

 

We obtain the derivative of maximum function

3 Bfmw )
aﬁmaxifi(ﬁll=a 8,6 1 (B5)

 

170

where m = arg max {fl-(5)}.
.7

171

Bibliography

Adrian, E. D. (1941). Afferent discharges to the cerebral cortex from peripheral sense
organs. J. of Physiology, 100:154—191.

Aloimonos, Y. (1992). Purposive active vision. Computer Vision, Graphics, and
Image Processing: Image Understanding, 17(1-3):285—348.

Amaldi, E. and Kann, V. (1998). On the approximibility of minimizing non zero
variables or unsatisﬁed relations in linear systems. Theoretical Computer Science,
209:237—260.

Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural
ﬁelds. Biological Cybernetics, 27:77—87.

Amari, S. (1998). Natural gradient works efﬁciently in learning. Neural Computation,
10(2):251—276.

Amari, S.-I., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind
source separation. In Advances in Neural Information Processing Systems 8,
pages 757—763. MIT Press.

Atick, J. J. and Redlich, A. N. (1993). Convergent algorithm for sensory receptive
ﬁeld development. Neural Computation, 5:45—60.

Backer, G., Mertsching, B., and Bollmann, M. (2001). Data- and model-driven gaze
control for an active-vision system. IEEE Trans. Pattern Analysis and Machine
Intelligence, 23(12):1415—1429.

Barlow, H. B. (1961). Possible principles underlying the transformations of sensory
messages. In Rosenblith, W. A., editor, Sensory Communication, pages 217—234.
MIT Press.

Barlow, H. B. (1972). Single units and sensation: a neurone doctrine for perceptual
psychology? Perception, 1:371—394.

Bartsch, A. P. and van Hemmen, J. L. (2001). Combined hebbian development of
geniculocortical and lateral connectivity in a model of primary visual cortex.
Biological Cybernetics, 84:41—55.

172

Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to
blind separation and blind deconvolution. Neural Computation, 7(6):1129—1159.

Bell, A. J. and Sejnowski, T. J. (1997). The ‘independent components’ of natural
scenes are edge ﬁlters. Vision Research, 37(23):3327—3338.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford Unversity
Press, Oxford.

Blakemore, C. and Cooper, G. F. (1970). Development of the brain depends on the
visual environment. Nature, 228:477—478.

Blaschke, T. and Wiskott, L. (2002). An improved cumulant based method for in-
dependent component analysis. In Dorronsoro, J. R., editor, Proc. Intl. Conf.
on Artiﬁcial Neural Networks - I CANN ’02, Lecture Notes in Computer Science,
pages 1087—1093. Springer.

Burger, T. and Lang, E. (1999). An incremental Hebbian learning model of the
primary visual cortex with lateral plasticity and real input patterns. Zeitschrift
fur Naturforschung C—A Journal of Biosciences, 54:128—140.

Cardoso, J .-F. (1997). Infomax and maximum likelihood for source separation. IEEE
Signal Processing Letters, 4:112—114.

Cardoso, J.—F. and Laheld, B. H. (1996). Equivariant adaptive source separation.
IEEE Trans. on Signal Processing, 44(12):3017—3030.

Christianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Ma-
chines and Other Kernel-based Learning Methods. Cambridge University Press,
Cambridge, UK.

Comon, P. (1994). Independent component analysis, A new concept? Signal Pro-
cessing, 36:287—314.

Crair, M., Gillespie, D., and Stryker, M. (1998). The role of visual experience in the
development of columns in cat visual cortex. Science, 279:566—570.

Desimone, R. and Duncan, J. (1995). Neural mechanism of selective visual attention.
Annual Review Neuroscience, 18:193—222.

Ditterich, J ., Eggert, T., and Straube, A. (2000). The rule of attention focus in the
visual information processing underlying saccadic adaptation. Vision Research,
40:1125—1134.

Elman, J., Bates, E. A., Johnson, M. H., Karmiloff—Smith, A., Parisi, D., and Plun-
kett, K. (1997). Rethinking Innateness: A connectionist perspective on develop-
ment. MIT Press, Cambridge, Massachusetts.

Eriksen, C. and Yeh, Y. (1985). Allocation of attention in the visual ﬁeld. Journal of
experimental Psychology: Human Perception and Performance, 11(5):583—597.

173

Field, D. J. (1994). What is goal of sensory coding? Neural Computation, 6:559~601.

F igueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Trans.
Pattern Analysis and Machine Intelligence, 25(9):1150—1159.

Friedman, J. (1987). Exploratory projection pursuit. J. of the American Statistical
Association, 82(397):249—266.

Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for ex-
ploratory data analysis. IEEE Trans. of Computers, c-23(9):881-—890.

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for
a mechanism of pattern recognition unaffected by shift in position. Biological
Cybernetics, 36:193—202.

Gersho, A. (1979). Asymptotically optimal block guantization. IEEE Trans. on
Information Theory, 25(4):373—380.

Giannakopoulos, X., Karhunen, J., and Oja, E. (1998). Experimental comparison
of neural ICA algorithms. In Proc. Int. Conf. on Artiﬁcial Neural Networks
(ICANN’98), pages 651—656, Skvde, Sweden.

Gray, R. M. and Neuhoff, D. L. (1998). Guantization. IEEE Trans. on Information
Theory, 44(6):1—63.

Hebb, D. O. (1949). The organization of behavior; a neuropsychological theory. Wiley,
New York.

Herault, J. and Jutten, J. (1986). Space or time adaptive signal processing by neural
network models. In Denker, J. S., editor, Neural Networks for Computing: AIP
Conference Proceedings, volume 151, New York. American Institute for Physics.

Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthog-
onal problems. Technometrics, 12:55—67.

Hopﬁnger, J., Buonocore, M., and Mangun, G. (2000). The neural mechanism of
top-down attention control. Nature Neuroscience, 3(3):284-291.

Hubel, D. and Wiesel, T. (1962). Receptive ﬁelds, binocular interaction and functional
architecture in the cat’s visual cortex. J. of Physiology, 160:106—154.

Huber, P. J. (1985). Projection pursuit. The Annals of Statistics, 13(2):435—475.

Hwang, W. S. and Weng, J. (2000). Hierarchical discriminant regression. IEEE Trans.
Pattern Analysis and Machine Intelligence, 22(11):1277—1293.

Hyvarinen, A. (1998). New approximations of differential entropy for independent
component analysis and projection pursuit. In Advances in Neural Information
Processing Systems, volume 10, pages 273—279. MIT Press.

174

Hyvarinen, A. (1999). Survey on independent component analysis. Neural Computing
Surveys, 2:94—128.

Hyvarinen, A. (2001). Fast independent component analysis. In Roberts, S. and
Everson, R., editors, Independent Component Analysis: Principles and Practice.
Cambridge University Press. in press.

Hyvarinen, A. and Hoyer, P. O. (2000). Emergence of phase and shift invariant
features by decomposition of natural images into independent feature subspaces.
Neural Computation, 12(7):1705—1720.

Hyvarinen, A. and Oja, E. (1997). A fast ﬁxed-point algorithm for independent
component analysis. Neural Computation, 9(7):1483-1492.

Hyvarinen, A. and Oja, E. (2000). Independent component analysis: algorithms and
applications. Neural Networks, 13(4—5):411—430.

Itti, L., Gold, C., and Koch, C. (2001). Visual attention and target detection in
clutterd natural scenes. Optical Engineering, 40(9):1784-1793.

Itti, L. and Koch, C. (2001). Computational modeling of visual attention. Nature
Reviews Neuroscience, 2(3):194—203.

Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based visual attention
for rapid scene analysis. IEEE Trans. Pattern Analysis and Machine Intelligence,
20(11):1254—1259.

James, W. (1890/1981). The Principles of Psychology. Harvard University Press,
Cambridge, MA.

Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag, New York.

Jones, M. and Sibson, R. (1987). What is projection pursuit ? J. of the Royal
Statistical Society, Ser. A, 150:1—36.

Kandel, E. R., Schwartz, J. H., and Jessell, T. M., editors (2000). Principles of Neural
Science. McGraw-Hill, New York, 4th edition.

Karhunen, J. and Pajunen, P. (1997). Blind source separation using least-squares
type adaptive algorithms. In Proc. IEEE Int. Conf. on Acoustics, Speech and
Signal Processing, pages 3048—3051, Munich, Germany.

Karhunen, J ., Pajunen, P., and Oja, E. (1998). The nonlinear pca criterion in blind
source separation: Relations with other approaches. Neurocomputing, 22:5—20.

Karvanen, J. and Cichocki, A. (2003). Measuring sparseness of noisy signals. In
Proceedings of 4th International symposium on Independent Component Analysis
and Blind Signal Seperotion (ICA2003), pages 125—130.

175

Kass, J. (1997). Topographic maps are fundamental to sensory processing. Brain
Research Bulletin, 44(2):107-112.

Kirby, M. and Sirovich, L. (1990). Application of the karhunen—loeve procedure for
the characterization of human faces. IEEE Trans. Pattern Analysis and Machine
Intelligence, 12(1):103—108.

Koch, C. and Ullman, S. (1985). Shifts in selective visual attention: Towards the
underlying neural circuitry. Human Neuronbiology, 42219—227.

Kohonen, T. (2001). Self-organizing Maps. Springer-Verlag, New York. 3rd edition.
Kullback, S. (1959). Information Theory and Statistics. Wiley.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of
Mathematical Statistics, 22:79—86.

Lee, T. W., Girolami, M., and Sejnowski, T. J. (1999). Independent component
analysis using an extended infomax algorithm for mixed sub—gaussian and super-
gaussian sources. Neural Computation, 11(2):417—441.

Lehmann, E. L. (1983). Theory of Point Estimation. John Wiley and Sons, Inc., New
York.

Lin, Y. (2002). Support vector machines and the bayes rule in classiﬁcation. Data
Mining and Knowledge Discovery, 6:259—275.

Linsker, R. (1986). From basic network principles to neural architecture: emergence
of spatial-oponent cells. Proc. Natl. Acad. Sci, 83:7508—7512, 8390—8394, 8779—-
8783.

Linsker, R. (1988). Self-organization in a perceptual network. Computer, pages 105—
117.

Linsker, R. (1992). Local synaptic learning rules sufﬁce to maximize mutual informa-
tion in a linear network. Neural Computation, 4:691—702.

Luenverger, D. G. (1969). Optimization by Vector Space Methods. John Wiley and
Sons, Inc.

Marshall, W. and Talbot, S. (1942). Recent evidence for neural mechanisms in vision
leading to a general theory of sensory acuity. Bio. Symp., 7:117—164.

McLachlan, G. J. (1997). The EM Algorithm and Extensions. Wiley, New York.

Miller, K. D. (1994). A model for the development of simple cell receptive ﬁelds
and the odered arrangement of orientation columns through activity-dependent
competition between on- and off-center inputs. J. of Neuroscience, 14:409—441.

176

Nadal, J.-P. and Parga, N. (1994). Non-linear neurons in the low noise limit: a
factorial code maximizes information transfer. Network, 52565-581.

Oja, E. (1982). A simpliﬁed neuron model as a principal component analyzer. J. of
Mathematical Biology, 15:267—273.

Olshausen, B., Anderson, C., and Essen, D. V. (1993). A neurobiological model of
visual attention and invariant pattern recognition based on dynamic routing of
information. Journal of Neuroscience, 13(11):4700-4719.

Olshaushen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld
properties by learning a sparse code for natural images. Nature, 381:607—609.

Olshaushen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis
set: A strategy used by v1? Vision Research, 37(23):3311—3325.

Pajunen, P. and Karhunen, J. (1998). Least-squares methods for blind source sepa-
ration based on nonlinear PCA. Int. J. of Neural Systems, 8(5-6):601-612.

Papoulis, A. (1991). Probability, random variables, and stochastic processes. McGraw-
Hill, New York. 3rd edition.

Pashler, H., editor (1996). Attention. University College London, London.

Pearlmutter, B. A. and Parra, L. C. (1997). Maximum likelihood blind source sepa-
ration: a context-sensitive generalization of ICA. In Advances in Neural Infor-
mation Processing Systems, volume 9, pages 613-619.

Pham, D.—T. and Garrat, P. (1997). Blind separation of mixture of independent
sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal
Processing, 45(7):1712-1725.

Pham, D.-T., Garrat, P., and Jutten, C. (1992). Separation of a mixture of inde-
pendent sources through a maximum likelihood approach. In Proc. E USIPCO,
pages 771—774.

Roth, Z. and Baram, Y. (1996). Multidimensional density shaping by sigmoids. IEEE
Trans. on Neural Networks, 7(5):1291—1298.

Rubner, J. and Schulten, K. (1990). Development of feature detectors by self-
organization. Biological Cybernetics, 62:193—199.

Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedfor-
ward neural network. IEEE Trans. on Neural Networks, 2(7):459—473.

Schill, K., Umkehrer, E., Beinlich, S., Krieger, G., and Zetzsche, C. (2001). Scene
analysis with saccadic eye movemens: Top-down and bottom-up modeling. Jour-
nal of Electronic Imaging, 10(1):152—160.

177

Sillito, A., Grieve, K., Jones, H., Cudeiro, J., and Davis, J. (1995). Visual cortical
mechanisms detecting focal orientation discontinuities. Nature, 378:492—496.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chap-
man and Hall, London.

Sirosh, J. and Miikkulainen, R. (1997). Topographic receptive ﬁeld and patterned
lateral interaction in a self-organizing model of the primary visual cortex. Neural
Computation, 9:577—594.

Sur, M., Angelucci, A., and Sharm, J. (1999). Rewiring cortex: The role of pat-

terned activity in development and plasticity of neocortical circuits. Journal of
Neurobiology, 41:33—43.

Suykens, J. A. K. and Vandewalle, J. (1999). Least squares support vector machine
classiﬁers. Journal of Neural Processing Letters, 9(3):293—300.

Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. Royal
Statistical Society (B), 58:267—288.

Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine.
Journal of Machine Learning Research, 1:211—244.

Treisman, A. (1986). Features and objects in visual processing. Scientiﬁc American,
255(5):114-—125.

Treisman, A. and Gelade, G. (1980). A feature—intergration theory of attention.
Cognitive Pshchology, 12(1):97—136.

Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., and Nuflo, F. (1995). Modeling
visual attention via selective tuning. Artiﬁcial Intelligence, 78:507—545.

Tucker, H. G. (1962). An introduction to probability and mathematical statistics.
Academic Press, New York.

Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1):71—86.

Ungerleider, L. and Mishkin, M. (1980). Two cortical visual systems. In Ingle, D. J .,
Goodale, M. A., and Mansﬁeld, R. J. W., editors, Analysis of Visual Behavior.
MIT Press, Cambridge, MA.

von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the
striate cortex. Kybernetik, 15285—100.

Weinberg, R. J. (1997). Are topographic maps fundamental to sensory processing?
Bruin Rearch Bulletin, 44(2):113—116.

Weng, J ., Ahuja, N., and Huang, T. (1997). Learning recognition and segmentation
using the cresceptron. International Journal of Computer Vision, 25(2):109—143.

178

Weng, J., Huang, T. S., and Ahuja, N. (1993). Motion and Structure from Image
Sequences. Springer-Verlag, New York.

Weng, J. and Hwang, W. (2002). Online image classiﬁcation using IHDR. Interna-
tional Journal on Document Analysis and Recognition, 5(2-3):118—125.

Weng, J., McClelland, J., Pentland, A., Sporns, O., Stockman, I., Sur, M., and
Thelen, E. (2001). Autonomous mental development by robots and animals.
Science, 291(5504):599—600.

Weng, J. and Stockman, I. (2002). Autonomous mental deveIOpment: Workshop on
development and learning. AI Magazine, 23(2):95—98.

Weng, J. and Zhang, N. (2004). A quasi-optimally efﬁcient algorithm for independent
component analysis. In Proc. IEEE Int. Conf. on Acoustics Speech, and Signal
Processing, Montreal, Canada.

Weng, J., Zhang, Y., and Hwang, W. (2003). Candid covariance-free incremental
principal component analysis. IEEE Trans. Pattern Analysis and Machine In-
telligence, 25(8):1034—1040.

Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: unsupervised learning
of invariances. Neural Computation, 14(4):715—770.

Zador, P. L. (1982). Asymptotic quantization error of continuous signals and the
quantization dimension. IEEE Trans. on Information Theory, 28(2):139—149.

Zhang, N. and Weng, J. (2004). Sparse representation from a winner-takeall neu-
ral network. In Proc. IEEE Int. Joint Conf. on Neural Networks, Budapest,
Hungary.

179