PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 p:/ClRC/DateDue.indd-p.1 SENSORY MAPPING, DEVELOPMENT AND THEIR APPLICATIONS TO FEATURE AND ATTENTION SELECTION By Nan Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2006 ABSTRACT SENSORY MAPPING, DEVELOPMENT AND THEIR APPLICATIONS TO FEATURE AND ATTENTION SELECTION By Nan Zhang Rich literature has been generated on computer vision systems. However, an implementable computational model of immediate vision for general and unknown environments is still illusive. Motivated by the autonomous development process Of humans, we are interested in building a robot vision system that automatically develops its visual cognitive Skills through realtime interactions with the environment. DevelOpmental vision is required by the demands Of general-purpose vision systems for complex human environments. Based upon the above requirement, in this dissertation we propose a general archi- tecture called Staggered Hierarchical Mapping (SHM) that performs feature deriva- tion for a set of receptive fields and attention selection. The work reported here is motivated by the structure of the early visual pathway. We use several layers Of stag- gered receptive fields to model the needed units Of local analysis. From sequentially sensed video frames the proposed algorithm develops a hierarchy of filters, whose out- puts are maximally uncorrelated within each layer, and contains an increasing scale of receptive fields that range from low to high layers. We also show how this general architecture can be applied to occluded face recognition, which demonstrates a case of attention selection. Besides this general neural network architecture, we develop several sensory map- ping learning rules for deriving feature detectors, including a fast incremental inde- pendent component analysis (ICA) method called Lobe Component Analysis (LCA), which derives independent components for many natural cases. A mathematical anal- ysis of the algorithm has been done and which Shows the advantages of the LCA method. We push the research further by investigating the LCA method in a overcomplete setting, which means the number of basis functions is greater than dimension of the Observation. A new incremental method is developed to solve the equivalent regression problem with Sparse regulization term, i.e. the LASSO regression. We Show that the LCA method is a special case when noise is not present. It makes LCA’s principal applicable to a large variety of applications, e.g. classification, regression, and feature selection. To Eric. iv ACKNOWLEDGMENTS While working the last four years on the subject of this thesis I enjoyed the support of my colleagues and other people. Of course, the most direct support came from my supervisors Dr. Juyang Weng, who helped me choose the topic Of this dissertation and has been a great source of ideas, comments and encouragement. Without his support, this dissertation would not have completed. I am grateful to Dr. Jain for offering an excellent course on pattern classification and machine intelligence, which I took and learned so much from it. My gratitude also goes to all the professors in my committee, Dr. Ofria, Dr. Henderson, and Dr. Stockman, for providing many valuable critiques and comments on my research topic and for personal support. I would like to acknowledge my Special thanks to Dr. Zhang, who took pains to travel across the country to attend my committee meetings and provided research grant for my research project. I used several pieces Of software and code developed by my colleague. Wey-Shiuan Hwang provided the HDR code. Yilu Zhang invested a huge amount of time to help me on the CCIPCA algorithm and code. Shuqing Zeng has jointed force with me tO develop many test experiments in this dissertation. Raja Ganjikunta and Martin Law have shared many valuable ideas with me in the research. Without these, it will take me much longer, if not impossible at all, to finish this work. Thanks also to the members in both the E1 and PRIP Laboratory, past and present, at Michigan State University. I benefited from many discussions with my colleagues. Thanks go to Xiao Huang, Feilong Chen, Zhengping Ji, Wei Tong, Xi- aoguang lv, Hong Chen, Yi Chen, Ameet Joshi and Jason Massey. Special thanks to Gil Abramovich, who provided supports throughout and helped me get rid of the occasional doom-thought when working on this dissertation. I am greatly indebted to my parents for their continuous encouragement and support throughout my life. Thanks also go to my wife, Yue Wang, and my son Eric for their patience and support during the years at Michigan State University. vi TABLE OF CONTENTS LIST OF FIGURES ............................... LIST OF TABLES ................................ 1 Introduction .................................. 1.1 Motivation ................................... 1.2 Background of biological visual system ................... 1.2.1 Biological visual system .......................... 1.2.2 Early Visual Pathway ........................... 1.2.3 Simple and Complex Cells ......................... 1.2.4 Topographic Maps in Brain ........................ 1.3 Early visual processing models ........................ 1.4 Models for visual attention .......................... 1.4.1 Attention mechanism in the brain ..................... 1.4.2 Bottom-up attention selection ....................... 1.4.3 Top-down attention selection ....................... 1.4.4 Summary .................................. 1.5 Discussion ................................... 2 Staggered Hierarchical Mapping for Attentive Perception: A De- velopmental Model ............................. 2.1 Introduction .................................. 2.2 Staggered Hierarchical Mapping ....................... 2.3 Developmental Algorithm .......................... 2.4 Computational complexity analysis ..................... 2.5 Input Image Reconstruction ......................... 2.6 Attention Effectors .............................. 2.7 Experiment with SHM ............................ 2.8 Experiment with Occlusion ......................... 2.9 Experiment with Navigation Simulation .................. 2.10 Conclusions .................................. 3 Lobe Component Analysis ......................... 3.1 Introduction to Independent Component Analysis ............. 3.1.1 Problem statement and assumptions ................... 3.1.2 Overview of ICA methods ......................... 3.1.3 Methods based on Non-Gaussianality measurement ........... 3.1.4 High order cumulant based method .................... vii XV 12 15 17 2O 24 26 26 29 35 43 46 47 51 51 54 61 63 65 78 80 86 3.1.5 Methods based on Maximum likelihood and Infomax .......... 87 3.1.6 Methods based on Sparseness measurement ............... 89 3.1.7 Summary .................................. 91 3.2 Lobe Component Analysis .......................... 93 3.2.1 Fine structure of high-dimensional density ................ 94 3.2.2 Global: Whitening the data ........................ 96 3.2.3 Local: Lobe components .......................... 97 3.2.4 Belongingness ................................ 100 3.2.5 Statistical efficiency ............................ 102 3.2.6 Amnesic mean ............................... 104 3.2.7 Efficiency Of the amnesic mean ...................... 105 3.2.8 Proposed CCILCA Algorithm ....................... 108 3.2.9 Time and space complexities ....................... 110 3.2.10 LCA is ICA for super-Gaussians ..................... 112 3.3 Experiment Results .............................. 112 3.3.1 Low dimensional simulation ........................ 112 3.3.2 Comparison with Other ICA Methods .................. 113 3.3.3 Cocktail Party Problem .......................... 116 3.3.4 Natural Image ICA ............................. 117 3.4 Derivation Of the LCA algorithm ...................... 119 3.4.1 [7’ norm Sparseness measurement ..................... 119 3.4.2 Infinity norm Sparseness optimization ................... 122 3.5 Broader Impact ................................ 125 3.6 Conclusion ................................... 126 4 Sparse representation and Feature Selection .............. 137 4.1 Introduction: The Supervised Learning Problem .............. 138 4.2 Method .................................... 144 4.2.1 Duality ................................... 144 4.2.2 Derivations ................................. 145 4.2.3 Quadratic Programming .......................... 149 4.2.4 Filter Development ............................. 152 4.2.5 Gradient Sparseness Optimization Algorithm .............. 152 4.3 Experiments .................................. 154 4.3.1 Kernel Regression ............................. 154 4.3.2 Classification ................................ 155 4.4 Conclusions .................................. 159 5 Conclusions .................................. 160 5.1 Summary ................................... 160 5.2 Contributions ................................. 162 5.3 Summary of the future work ......................... 163 APPENDICES .................................. 165 viii A The Candid Covariance-free Incremental PCA Algorithm ..... B Derivative of Maximum Function .................... 165 1.1 1.2 1.3 1.4 1.5 2.1 LIST OF FIGURES Mammal early visual pathway. (Adapted from (Kandel et al., 2000).) . . . . The receptive field in primary visual cortex are different and varied than those in the retina and LGN. Areas marked with “+” denote the excitatory region, correspondingly “-” presents inhibitory region. (A) The concentric receptive field of cells found in retina and LGN. (B) Receptive field of simple cells in primary visual cortex.(Adapted from (Hubel and Wiesel, 1962).) ...... The orientation preference map of the macaque monkey’s striate cortex, mea- sured by Optical imaging technology. Each color presents on preferred direction. From the image we can see the neighbor neurons tend to prefer similar direc- tions, forming the unitary color regions called iso—orientation blobs. ..... Brain areas that involve in attention control. This diagram only shows the control flow. The spatial layout of the pathways can be found in (Kandel et al., 2000). (Adapted from (Itti and Koch, 2001) and (Kandel et al., 2000)) Bottom-up attention selection model. The elementary properties (such as color, orientation, size, and depth) of the visual field are extracted by separate parallel pathways, each of which generate a feature map. Selected features are fused into a master map, which is a representation of those features that have the most saliency. Selective attention happens after the features have been associated in a small region of the master map. Adapted from (Treisman, 1986). ..... Recognition under occlusion through features extracted using a sensory map- ping architecture. (a) In the learning session the sensory mapping learns the features of the upper face (indicated by “On”) through active attention to the upper face, (b) In the performance session the lower part of the face is oc— cluded and attention focused on the upper face allows the sensory mapping to deliver features for the non-occluded part of the face (“On” part of the block), which are fed to the classifier for successful recall. In general, a sensory map- ping enables suppression Of unattended receptive fields, but provides signals from attended receptive fields for generating attention control signals by the following classifier/ mapping function. .................... 5 18 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 3.1 A comparison Of the neural cylinder (a) and the staggered receptive fields (b). In (a) a group of filters share the same receptive fields resulting in a low spatial resolution. In (b) each filter is centered at a different position resulting in a higher density of spatial sampling ........................ 38 Localized receptive field DP, with size 4. ................... 39 Computational order of eigen-groups. There are 16 eigen groups. The numbers in the table denote the order Of updating using residual image. The positioning of the numbers represents the relative position of the eigen-groups ....... 41 The architecture of SHM. Each square denotes a filter. Layer 0 is the input image. The filters marked in black in layer I belong to different eigen-group, because they overlaps. Bold lines that are derived from a single filter and expanded to the original image mark the receptive field of the combined filter. The size of the receptive field in a particular layer is 20% larger than its previous layer in this diagram, which is Shown at the right. The size of the receptive field is rounded to the nearest integer. .................... 43 Sequence Of residual images. From left to right and top to bottom starting from the original input image, the sequence of images displays the process that a component is subtracted from the previous one. .............. 46 The filter connections that covers the outliers of the image with have zero input. 50 Several samples Of 5000 natural images collected. ............... 52 The filters developed in the first layer by the sharing method. The order of dominance is from left to right and from top to bottom. ........... 53 The reconstructed images from different levels. The first image is the orig- inal input image. Others are reconstructed images from Layer 1 to Layer 4 respectively. .................................. 54 The reconstruction error of each layer. .................... 54 The schematic illustration of Operation. C3 is the attention signal generator. C1 and C2 are the classifiers for each occlusion case. 31 and S2 are SHMs, 31 for U views and S2 for L views. ........................ 56 Merged image sample from the two cameras. ................. 58 Six attention positions on the image. ..................... 59 The cocktail party problem ........................... 66 xi 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 The density function of super-Gaussian, Gaussian, and sub-Gaussian distri- butions. All the distribution are normalized to unit variance. The Laplace distribution has positive kurtosis, thus is a super-Gaussian distribution. the uniform distribution has negative kurtosis, which implies sub-Gaussian. And the normal distribution has a zero kurtosis ................... Two Laplace distribution. ........................... Two Laplace distribution mixed by mixing matrix A .............. Example Of two uniform distributions. (a) The source signal. (b) Two uniform distributions mixed by mixing matrix A. ................... After whitening, the joint distribution is sphered. (a)Laplace joint distribution. (b)Uniform joint distribution. ......................... Joint distribution of two uni-variance Gaussian random variables. ...... Gaussian signal vs. sparse signal. The sub plot on the top shows a zero mean unit variance Gaussian random variable. The X axis denotes different observations, and Y axis is the value. The bottom sub plot shows a Sparse random variable with same mean and variance. ............... Two neurons with horizontal inhibitions. Hollow triangles indicate excitatory connections and solid triangles indicate inhibitory ones ............. Illustration of lobe components. The plus signs are random samples. lst col— umn: the original source signal. 2nd column: the observed mixed signals (after affine transform from the left). 3rd column: whitened signals. Lobe compo- nents are marked by arrows ........................... The sample space of a zero-mean white random vector x in 2-D space can be illustrated by a circle. Each mark + indicates a random sample of x. The distribution is partitioned into c = 3 (symmetric) lobe regions 12,-, i = 1, 2,3, where R,- is represented by the lobe component (vector) w,. ......... The amnesic average coefficient. X axis is the number of visits. Y axis is the weight coefficient of wl and wg. ........................ The error coefficients C(n) for amnesic means with different amnesic functions p(n). ..................................... 71 73 74 74 77 77 9O 91 127 128 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 4.1 4.2 The proposed method on 2-D Laplace distribution data set. (a) The source in- dependent components and the evaluated independent components. The dash- dot lines indicate the direction of source independent components, and the two arrows indicate the components evaluated by the proposed method. (b) Av- erage error in the angle of the evaluated independent components vectors and the source independent component vectors. ................. Comparison of ICA results among (Type-3) NPCA, (Type-2.5) ExtBSl and (Type-4) CCILCA for super-Gaussian sources. (a) Average error for different data dimension d. (b) Average time to run for different data dimension d. (c) Average error for d = 25 as a function of the number of samples. ...... Comparison among (T ype-2) Extended Infomax, (Type-1) FastICA and three variants of the proposed Type-4 CCILCA algorithms .............. Cocktail party problem. (a) A music sound clip in its original form. It is one of the nine sound sources. (b) One of the nine mixed sound signals. (c) Recovered music sound wave. Comparing to (a), the sound Signal is recovered after approximately 1.5 second. ........................ Lobe components from natural images (not factorizable). (a) LCA derived features from natural images, ordered by the number of hits in decreasing order. (b) The numbers of hits of the corresponding lobe components in (a). . Results of Algorithm 5. Topographically ordered basis functions in ICA of natural images. ................................ Results Of Algorithm 5. Number Of times for each independent component to be the “winner.” The noticeable big hit corresponding to the neuron at row No.9 and column NO. 8. ........................... Data set is joint distribution of two unit variance Laplace distribution. The two independent Laplace random variable are mixed by a rotation matrix of 20 degree. ................................... The (p norm criteria functions under different p. The data set is projected onto a series of orthogonal basis, and then the expectation of I? norms of all the projected samples are computed. ....................... Geometry explanation of duality. Point B is the closest point in K to the origin. Kernel regression results. The dashed lines are true sine functions. Solid lines are approximation results. The circles shown in the first row are those selected kernel functions. Note that we do not Show the circles in the ridge regression results, because every kernel is selected. (a) Proposed gradient method. (b) Ridge regression. (c) LASSO regression ..................... xiii 130 131 132 133 134 135 135 136 136 145 155 4.3 4.4 4.5 5.1 Objective function vs. number of iterations. We utilize a dynamic learning rate mechanism called Amnesic Average. Interested readers can refer to (Weng and Zhang, 2004) for more details. ...................... The effect of control parameter t over mean square error and Sparseness. . . . The decision boundary of the artificial XOR data set. (a) The decision bound- ary of the proposed winner-takeall method. (b) Kernel selection of the pro- posed method. (c) The decision boundary of Least Square Support Vector Machine (LSSVM). (d) Kernel selection of LSSVM ............... The flow diagram of a developmental vision system. The sensorimotor module for innate behavior is illustrated by a block only for simplicity, but it also contains the three mappings which are, however, much smaller in storage and much simpler in the mappings it realizes. ................... xiv 158 161 1.1 2.1 2.2 4.1 5.1 LIST OF TABLES Differences in the sensitivity of M and P cells to stimulus features. (Adapted from (Kandel et al., 2000)) .......................... 19 Summaries Of the occlusion experiment. .................. 57 Summaries of the navigation image Simulation. .............. 60 The result of the 5-fold cross-validation ..................... 159 Contribution of the dissertation. ...................... 162 XV Chapter 1 Introduction 1 . 1 Motivation The function Of current computer vision systems lags far behind the biological visual processing systems. Computer vision systems face many delicate problems, which can be solved easily even by some primary animal vision system. One possible reason is that the brain is built on a much larger scale in the sense Of computational units. For example, human brain contains 1015 synapses compared to fewer than 108 transistors, plus the more complex function of the neural cells (Kandel et al., 2000). It is almost impossible to design and manufacture a computer with 1015 transis- tors and make it work. How does nature manage this problem? It is believed that the genome plays a major rule in forming the protein, and thus the tissues and appa- ratuses. Is the form of synapses completely determined by human genes? The answer is negative. The genome has only approximately 105gcnes. Even if we assume that most of the genes are dedicated to code the synapses, the compression rate is as high 1 as 1 to 1010. SO far, we don’t see that much redundancy in the neural synapses. Examination of the neural system revealed that the environment has a large effect on the forms Of the neural system. As early as 1970, Blakemore and Cooper (Blakemore and Cooper, 1970) reported that the kittens’ visual cortex do not have cells sensitive to edged orientations that they did not Observe, if they lived in a controlled envi- ronment after birth, where only edges Of a certain orientation were present. Recent studies by Sur and coworkers (Sur et al., 1999) have shown that the input signal can fundamentally determine the filters generated under development. They have rewired the visual sensory axon of ferrets to the auditory cortex, and tested the sensory re- sponse in it. The auditory cortex shows orientation sensitive cells if it receives visual signals early in life, which can not be found in a natural auditory cortex. Now, it is widely accepted that the neural system is largely formed by its conveyed signals. In engineering terms, this is a data-driven self-organizing system. Another possible reason that natural vision systems outperform their computer counterpart lies in the selective visual attention mechanism used by the brain. Atten- tion implements an information processing control gate, only allowing a small part Of the incoming sensory input to reach short-term memory and visual awareness (Desi- mone and Duncan, 1995). Instead Of trying to fully process the entire sensory input in parallel, nature has devised a serial strategy to achieve near optimal and realtime per- formance despite the limited computational resources. Selective visual attention has broken down the cluttered visual field into a series of light computational, localized visual analysis. In this chapter, I will focus on these two problems and review the relevant works. 2 Our goal is to find the relationship between neuron development and attention selec- tion and explore the possibility that is learning to pay attention. This paper is organized as follows: section 1.2 introduces some background knowl- edge of the biological visual system. Section 1.3 reviews several models of the early visual processing with the emphasis on modeling the topographic maps Of the cortex. Models for visual attention, which is the major concern Of this chapter, is shown in section 1.4. In the last section we will discuss how visual system development relates to the learning Of visual attention. 1.2 Background Of biological visual system The visual system has the most complex neural circuitry of all the sensory systems. The auditory nerve system contains 30,000 fibers, but the optical nerve contains over one million! The task Of constructing such a system is tremendously complex. Early stage Of this process, in which neurons find their proper locations and send axons to roughly the right region of the correct target structure, are somehow known to be independent from the neuron activity, thus could be wired by genes. Later stages of this process, however, depends on the neuron electrical activity to form the accurate synapses structure. What is know about the activity—dependent process of synapse forming is generally consistent with a hypothetical rule that was proposed by Hebb (Hebb, 1949), which is also known as “Hebbian learning rule”: Synapses are strengthened if there is a temporal correlation between their presynaptic and postsynaptic patterns 3 Of activity, and weaken otherwise. In a concise phrase, “Neurons that fire together will be wired together”. In this section, I will give a brief introduction to the biological early visual system and review some of the evidence that patterned-activity driven synapse modifications operate in neural development. 1.2.1 Biological visual system The biological visual system has been studied over several species, including cat, ferret, and monkey. Although their visual systems do contain some structural and functional differences, the main mechanism Of the visual processing and development Of the circuitry is expected to apply to other species as well. Fig. 1.1 Shows the early visual pathway Of the human . The visual processing can be separated into roughly three stages: the retina, the lateral geniculate nu- cleus (LGN) and the primary visual cortex. Lights projected on the retina evokes the photoreceptors and related cells on the rear surface of the eye. The activity of photoreceptors is encoded by the retina ganglion cells. These cells are not evenly distributed. The densest section, which is called fovea, is around the center of the visual field. The visual fields of the left and the right eye have significant overlap, so that the visual field has both binocular and monocular zones. Light from the binocular zone will be projected onto both eye, and light from the monocular zone only appears on one Of the retinas. [Fixation Point I Left \ I \ Monocular / | Right \ ~.\‘ Zone I ‘ Monocular \‘\ I \ Zone ,o \s‘ I I ’a’ ‘\\‘ f \ Nasal ’’’’’ \\ ,' yhemiretina ,,,, \‘ ' ””” \\\ I /X ’a” Right temporal Left temporal hemiretina hemiretina Primary Visual Cortex Figure 1.1: Mammal early visual pathway. (Adapted from (Kandel et al., 2000).) Visual information travels in separated pathways for each half Of the visual field. The signal in the left hemifield of both retina will go to the left subcortical area. Similarly the right hemifield only has connections to the right subcortical area. The Optic nerves from each eye project to the optic chiasm, where fibers from each eye destined for one or the other side of the brain are sorted out and bundled in the bilateral optic tracts, which project to three major subcortical areas: the pretectum, 5 the superior colliculus and the lateral geniculate nucleus (LGN). The former two subcortical areas are known to be in charge of saccadic eye movements and pupillary reflexes accordingly, which is out Of the topic of this paper. Ninety percent of the retina axons terminate in the lateral geniculate nucleus Of the thalamus. The LGN is the main terminus for input to the visual cortex (Kandel et al., 2000). We will focus on the projection to the LGN. Visual information from different eyes is kept segregated into different neural layers in the LGN. From the LGN, the signal continues to the primary visual cortex (V1; also called striate cortex or area 17). V1 is the first cortical area that processes the visual information. The output from V1 goes on many higher layers, including layers underlying some high level cognitive process, like face recognition or top—down attention control. 1.2.2 Early Visual Pathway The retina, LGN and primary visual cortex constitute the early visual pathway. Al- though the fundamental function of the early visual pathway is somehow still unclear, many facts already revealed helped us to recognize its functionality. First, this pathway is the major visual information conducting pathway. Without this pathway visual perception is lost, although some very limited stimulus detection and movement toward Objects in the visual field is still possible. Second, this pathway does not project the original visual field directly into the cerebric cortex. At the photoreceptor level, the representation Of the visual field is 6 much like a image, but significant processing occurs in the subsequent subcortical and early cortical stages, which distorts and recodes the visual information (a review can be found in (Kandel et al., 2000)). Third, the neural cells in this pathway are organized in a hierarchy and have a variety of receptive field sizes (The receptive field Of a cell, by Hubel and Wiesel’s definition (Hubel and Wiesel, 1962), is “the region of retina (or visual field) over which one can influence the firing Of the cell.” ). The neurons in the higher layer have larger receptive field than the early neuron. In the same layer, different neurons cover different regions Of the visual field. Then the neurons’ firing pattern is the representation of the its receptive field. Thus neurons in different layers and at different locations form multi—scale and multi-position representations of the visual field. Finally, there is not only one Single pathway in the early visual processing. As mentioned before, the Optical nerves from the left and the right eyes are segregated in this pathway. Furthermore, psychophysical experiments Show us that the early visual pathway can also be identified as two parallel pathways: Magnocellular pathway (M channel) and Parvocellular pathway (P channel). These pathways start from the Magnocellular Ganglion cells (M cells) and Parvocellular Ganglion cells (P cells) accordingly, and keep separated until fused in some high cortical layers. M cells and P cells are proved to be sensitive to different stimuli modalities, e.g. M cells are sensitive to luminance contrast and temporal frequency Signal. P cells are found preferring color contrast and spatial frequency changing. From a later section we will discuss that this is believed to be the main biological evidence Of the bottom-up 7 attention control mechanism. Fifth, the early visual pathway is largely formed by visual experience. This is obvious as shown by Blakemore and COOper’s newborn kitten experiment (Blakemore and Cooper, 1970) and Sur’s optical nerve rewiring experiment (Sur et al., 1999). The early visual pathway serves an important role in visual information transmis- sion and processing. To study the function of this pathway, one needs to address the processing of individual cells. 1.2.3 Simple and Complex Cells AS we move from retina to primary visual cortex, a fundamental change takes place in the nature of the representation of visual information. In the retina, each ganglion cell effectively represents a small region in the image. The receptive field often receives input from only one cone in the fovea, that effectively samples the image. In frequency terms, each cell has a broad bandwidth that is essentially equal among same type Of cells. In the cortex, the situation is strikingly different. Receptive fields are narrow band and oriented, and may differ from one another in their orientation, size and peak frequency. This means that we have gone from a single representation by relatively homogeneous cells, to multiple representations by distinct populations Of cells differing in orientation and spatial frequency. Hubel and Wiesel tried to classify the neural cells in the visual cortex by measuring the response properties of visual cortical cells when the subject was exposed tO a black screen with only one light spot or light bar visible. The light stimuli is put in different places and the response Of the cell is 8 measured to determine the receptive field property. They found there are two distinct cell types. The first type of cells has explicit excitation and inhibition regions in their receptive fields. Illumination of part or all of an excitatory region increased the firing of the cell, whereas a light shone in the inhibitory region suppressed the firing. A large spot confined to either area produced a greater change in firing rate than a small spot, indicating summation within either region. This cell group is called simple cells. While receptive fields Of simple cells are similar to those of retina and LGN cells in possessing explicit excitatory and in- hibitory regions, they differ tremendously in the spatial arrangement of the regions. Fig. 1.2 shows the difference between retina and LGN cells and simple cortical cells. The orientation sensitive arrangement yields simple cells firing profoundly to Spe- cific orientated stimuli, thus invoke the hypothesis that simple cells are used as“edge detectors” proposed by Barlow (Barlow, 1972) etc. The firing patterns Of the other group of neurons are far more intricate and elab— orate. The receptive fields Of these cells are termed “complex”. In the experiment, small spots fail to map the excitatory and inhibitory regions. The “on” and “off” firing patterns are triggered by more complex stimulus, such as slits, edges, and dark bars. The classification Of the simple and complex cells is merely the result Of the ex- ternal measurement. They can not be separated anatomically. Be aware of the fact that even the auditory cortex can evolve to have the property of the visual cortex. The difference becomes to lie on the arrangement Of its previous layer cells and input signal. Phrthermore, the study on a neighborhood of neurons in the visual cortex 9 Figure 1.2: The receptive field in primary visual cortex are different and varied than those in the retina and LGN. Areas marked with “+” denote the excitatory region, correspondingly “-” presents inhibitory region. (A) The concentric receptive field Of cells found in retina and LGN. (B) Receptive field Of simple cells in primary visual cortex.(Adapted from (Hubel and Wiesel, 1962).) confirms a topographic distribution of cell properties (Adrian, 1941) (Marshall and Talbot, 1942). That is, whether Simple or complex cells, cells in a close range tends to fire uniformly to the same stimuli, thus forming a visualized topographic map in the brain. 1.2.4 Topographic Maps in Brain In all mammals, much Of the neocortex consists Of orderly representations or maps of neurons that are typically topographic at global levels while being iSO—property at 10 local levels. Two prominent topographic maps are ocular dominance columns and orientation preference maps. The ocular dominance column represents an orderly arrangement of cells that re- ceive inputs only from the left or right eye and is important for binocular interaction. By injecting radio-labelled amino acids, the ocular dominance columns can be visu- alized (similar to striped patterns). The neural pathways from the left and the right eye are shown to be largely segregated through the LGN and primary visual pathway. The orientation preference map is the phenomenon that neurons in the primary visual cortex fire selectively to a specific direction of stimuli. Neurons in the nearby region have similar, but not identical preferences. The preferences repeat continuously at regular interval (approximately 1-2mm) in every direction. Fig. 1.3. shows the orientation preference map of monkey’s striate cortex. In some locations, the map forms a pattern like a pinwheel. Figure 1.3: The orientation preference map Of the macaque monkey’s striate cortex, mea- sured by optical imaging technology. Each color presents on preferred direction. From the image we can see the neighbor neurons tend to prefer similar directions, forming the unitary color regions called iso—orientation blobs. 11 Although the topographic map is currently a prominent research topic in neuro- science, its functioning rule in perception is still controversial (Kass, 1997) (Wein- berg, 1997). Weinberg argued that topography could be mere epiphenomenon of minimal intrinsic significance, but he also admitted that it may be useful to examine possible explanations or cortical function. At least, this easy-tO-Observe phenomenon can be used as a guideline to build the computational model of the early visual path- way, especially when the goal function Of this pathway is still unknown. In fact, many neural computation models have attempted to explain the topographic maps with some success. In the following section we will review some computational models that model the topographic maps and simple cells Of the early visual pathway. 1.3 Early visual processing models Visual system development is a complex process. It is difficult to integrate the in- coherent experirnental results into a profound understanding of how this system is constructed. Computational neural models provide useful ways to recognize functions of the neural system. These could help to validate current theory, and perhaps predict some uncovered facts. There has been a long history Of researchers using quantitative models to explain some observed topographic maps in the brain, such as ocular dominance stripes and orientation preference maps in the early visual pathway. Although limited by the computational power available, some early models did Show that orientation selection could be developed. As early as 1973, von der Malsburg showed that orientation selec- 12 tion neurons could be developed from an oriented image pattern by an unsupervised learning rule (von der Malsburg, 1973). As a pioneering study in this field, von der Malsburg’s model had many features that were adopted by many later researchers, such as using two dimension array to present the subcortical and cortical layers, a nearby excitatory and faraway inhibitory mechanism, using the incremental Hebbian learning rule to update the connections. von der Malsburg’s result is not surprising, because the input pattern is ori- ented. Linsker, however, demonstrated that even with random noise input without pre—applied orientation signal, his model could develop orientation preference maps (Linsker, 1988). In his three consecutive papers (Linsker, 1986), Linsker Showed how the “on-off” centralized neuron, the orientation selection neuron, and orientation preference maps can be developed from a simple set of rules (including Hebbian- type modification). Linsker’s model gets support from the biological experiment that new born animals possess orientation preference maps even before their eyes open (Crair et al., 1998). This map driven by the internally generated signal is modeled by training with random noise. Linsker’s model motivated several other models that integrate Hebbian learn- ing rule and localized lateral inhibition rule. Like in Miikkulainen et. al.’s RF- LISSOM model (Sirosh and Miikkulainen, 1997) (short for Receptive-Field Lateral Interconnected Synergetically Self-Organizing Map), two layers Of neurons model the LGN layer and visual cortex, in which the cortical layer applies a modifiable lat- eral excitation-inhibition interconnection. Both afferent and lateral connections are adapted by the same Hebbian mechanism in a purely local and unsupervised learning 13 process. This model can develop the ocular dominance column and orientation pref- erence map. Other similar models that used lateral inhibition and Hebbian learning can be found in (Burger and Lang, 1999) and (Bartsch and van Hemmen, 2001). Miller (Miller, 1994) used a different approach to make the Simulation more practical. He showed that if the visual system is essentially linear, then the sequence of input patterns can be replaced with a simple function representing its long term correlations. His model is essentially Hebbian learning based. Instead Of incrementally updating the connection by each input image or pattern, his model integrates the whole sequence Of the input in the Hebbian learning equation and comes up with an equivalent form that only demands the correlation function Of the image set. This modification significantly improved the convergence speed of the computation. Compared to thousands of images used to training other models, this correlation based model only require tens of the iterations to get a stable result. In this section we have reviewed several early visual pathway models that explain the development of orientation preference maps. One prominent question Should be asked: why is the orientation preference map important? We believe the orientation map, as a low-level feature extraction mechanism, is very important for the attention selection process. As discussed in the next section, the bottom-up attention selection mechanism fuses several feature maps to gain the attention shift position. The orien- tation map is one of the basic feature maps that contribute to the attention control. Most important of all, the orientation map is developed from the visual experience, which makes attention from learning possible. 14 1.4 Models for visual attention The human’s brain does not take all the sensory information at one time. Only a subset of all the sensory input can be processed. Sometimes this selection happens between different sensory modalities. When focusing on reading a book, one could be very insensitive to the surrounding sounds and voices, and when trying to listen to a sound very carefully, one tends to close his eyes to help focus on the auditory sensory input. In some sensory modes, selection within the modality is also possible. In the visual modality, it is done by suppressing information outside a spatially circumscribed region of the visual field. This is called “focus of attention”. In mammals, this attention selection mechanism is implemented either by rapid, saccadic eye movement (also referred as “overt” shift of attention) or so-called “covert” (internal shift of attention, i.e., without moving eyes) shift Of visual attention. In the first case, overt attention can be Observed externally as it is orienting the sensor or the highest resolution part of it toward the attended stimuli. This allows the selected information to enter short-term memory and to remain there long enough to reach conscious and cognitive levels of representation. Most mammals’ eye systems have some anatomical specification to accommodate this “spotlight” mechanism as well. In many species’ retina, the photoreceptors are not distributed in a even manner. They contain a Specialized region around the center of the visual field with a much higher cell density than at the peripheral region. This is called “fovea” in primates, “central area” in cats and “visual steak” in rabbits. 15 In contrast, covert attention is not directly observable but takes place by filter- ing or assigning the resource. Covert attention selection usually precedes the overt attention mechanism (Ditterich et al., 2000). Like in a task of fast visual searching for an object, one looks at the scene without a specified fixation, then overt attention filters irrelevant distraction. After a few candidate positions are selected, saccadic eye movement takes place for further visual analysis. Although the overt and covert visual attention are modelled in similar ways, they have different costs and effects. Moving sensors takes a longer time than the internal attention selection. Further more, the overt attention selection requires shutting down the sensory input during the saccade. But in some realtime active-vision system, saccadic sensory movement is necessary to prevent the attended Object from moving out of the visual field and take advantage Of the highest sensory resolution. There are at least two advantages that a visual attention mechanism can bring about. First the amount of data to be processed has been reduced, resulting a lower computational resource demand. Thus, it enables further information processing. Second, the distracting information is suppressed in this procedure, SO that relevant information may take more consideration. For example, in a typical Object classifi- cation system, it is very difficult to deal with samples that come with a large variety of backgrounds. With the help of visual attention selection, the background infor- mation is somehow suppressed, thus only the foreground Object will influence the performance of the system. In this aspect, attention is a central part of the system, because it selects the information on which the system activities are based. Selective visual attention has been researched in psychology, physiology, and neu- 16 roscience for decades (James, 1981) (Kandel et al., 2000) (Treisman and Gelade, 1980), but the majority of the research focuses on empirical aspects, measuring the phenomenon, describing the result, and giving at most qualitative models of the at- tention control. Computational models of selective attention have drawn research interest only in the last decade. Although largely based on the qualitative models, the computational models Of attention casts a light on solving some difficult computer vision problems, and in turn as a quantitative approach to help understanding the biological visual attention mechanism. In the following section we first give some background knowledge Of the attention mechanism in the brain, and then review several computational models of selective vi- sual attention. It is almost impossible to cover all the related work in this area. Some good reviews can also be found in this literature (Itti and Koch, 2001) (Desimone and Duncan, 1995). 1.4.1 Attention mechanism in the brain The control of selective visual attention involves several brain regions, including the early visual pathway and the prefrontal cortex. Cooperation between two streams of control signal is believed necessary to achieve attention selection. The where stream will find where to attend to, primarily controlled by the dorsal visual processing pathway, which comprises cortical areas in the posterior parietal cortex. The what stream controlled by the ventral visual processing pathway, involving cortical areas in the inferior temporal cortex, is primarily in charge of localized object recognition. The 17 result of the two streams could affect each other. For example the recognition result from the what stream can bias the next attention Shift, make it to some predicted focal attention area where it may contain next interesting Object, and vice versa. Fig. 1.4. shows the brain areas that participate in the control of visual attention. Visual information I’ \ ’ \ f \ I \ I \ J \ I \ ’ \ I r .. I P pathway‘«\\_ (: :j ‘,,> M pathway ~ ‘~ ‘~ ~\ ~ ‘~ ~. LGN Ventral Filthy [Visual Corteg “real pathway Inferotemporal cortex Posterior parietal cortex Prefrontal \C ,/ cortex /._\ Memory and f Motor and other system cognition Superior colliculus Saccadic eye movement r—w l L.) Figure 1.4: Brain areas that involve in attention control. This diagram only shows the control flow. The spatial layout of the pathways can be found in (Kandel et al., 2000). (Adapted from (Itti and Koch, 2001) and (Kandel et al., 2000)) Visual information enters the retina and is processed by two kinds Of ganglion cells: the magnocellular ganglion cells (M cells) and the parvocellular ganglion cells 18 Stimulus feature M cells P cells Color contrast N 0 Yes Luminance contrast Higher Lower Spatial frequency Lower Higher temporal frequency Higher Lower Table 1.1: Differences in the sensitivity Of M and P cells to stimulus features. (Adapted from (Kandel et al., 2000)) (P cells). The information that is conveyed by M and P cells is kept segregated in the LGN and projected to separate layers of the primary visual cortex. Thus, it has led to the view that the separate sequence Of retinal ganglion, LGN, and visual cortical cells can be regarded as two parallel pathways, referred to as the M and P pathways. As indicated in Table 1.1, there are striking differences between cells in the M and P pathways. By selectively removing one or the other pathway, experiments Show that the M and P pathway make different contributions to perception. The P pathway is critical for color vision and for vision that requires high spatial and low temporal resolution (e.g. static color images). The M pathway contributes most to vision requiring low spatial and high temporal resolution (e.g. motions). Based on anatomical and functional evidence, Ungerleider and Mishkin argued that visual processing centers in the cerebral cortex are also believed to be organized in two pathways: the dorsal processing pathway to the posterior parietal cortex and the ventral pathway to the inferotemporal cortex (Ungerleider and Mishkin, 1980). Anatomical evidence also Shows that the dorsal pathway is the successive visual path- way corresponding tO the M pathway in the early visual processing, and the ventral pathway corresponds to the P pathway accordingly. Thus it is natural and reason- able tO assume the dorsal and ventral pathways implement some high level perception 19 combination Of the information that the M and P pathways convey. The dorsal pathway is believed to process a subset Of high level perceptions, in- cluding motion and depth. These perceptions are very likely derived from the M pathway information(motion and depth perception from the low spatial high tempo- ral feature). The neurons in this system are relatively insensitive to color and poor at analyzing stationary objects. Therefore, the dorsal pathway is more concerned with where Objects are rather than what they are. The ventral pathway deals with color and form perception. The color and the form perception are derived from P pathway features(color and form perception from color contrast and high spatial low temporal feature), because a great deal of information about Shape and form is derived from colors, edges, and borders. This system is capable of high resolution, which is important for seeing stationary objects in detail. Thus the ventral pathway in more concerned with what is seen. The dorsal pathway (where stream) and the ventral pathway (what stream) end in different regions of the prefrontal cortex that specialize, respectively, in visual spatial working memory and object recognition working memory. 1.4.2 Bottom-up attention selection How is information about color, form, depth, and motion, which is conveyed by different neural pathways, combined together and translated into a comprehensive perception? When we see a red moving car, we do not see separated features. We combined different properties, color (red), shape (car), and motion (linear moving), 20 into one perception. It is impossible to develop different specialized neural cells to detect all these combinations, thus there must be fusion of these pathways somewhere in a higher level, which is also called a binding mechanism. From a computational aspect, this fusion of low level features resulting a high level attention selection is also referred as the bottom-up attention selection mechanism. One of the earliest theoretical models that addresses this problem is in the psy- chophysical literature. "Heisman et a1. has proposed that different features are derived from different feature maps in different brain regions ('Ifeisman and Gelade, 1980). To solve the binding problem, the brain dedicates some processing as a master map to combine the code Of the feature map. Fig. 1.5 shows the outline of ’Ifeisman’s model. Later, more detailed and neurally plausible models for visual attention selection were proposed (Koch and Ullman, 1985) (Eriksen and Yeh, 1985) (Olshausen et al., 1993). These models are still qualitative models. The most prominent model of visual attention was proposed by Koch and Ullman in 1985 (Koch and Ullman, 1985). The model includes following elements: 1. It has a process called early representation, in which a number Of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps. This corresponds to the feature map Of 'Ifeisman’s model. 2. It has a selective mapping from these representations into a central representa- tion which at any instant contains only the properties of a single visual location, the selected location. This is corresponding to the master map. 21 Detail analysis from focused attention \ a \ \\\§ Master Maps Color \Orientation Size Distance .A‘! I, Feature Maps 6 a C Visual Field Figure 1.5: Bottom-up attention selection model. The elementary properties (such as color, orientation, size, and depth) of the visual field are extracted by separate parallel pathways, each of which generate a feature map. Selected features are fused into a master map, which is a representation Of those features that have the most saliency. Selective attention happens after the features have been associated In a small region of the master map. Adapted from (Treisman,1986). 3. It has certain kinds of rules that determine which location is mapped to the central representation. This is implemented by a Winner-TakeAll network. 4. It has a mechanism called inhibition of return, which will suppress current attended location and enable the shifting to the next salient location. Itti and Koch et al. proposed a pure bottom-up attention control model (Itti et al., 1998) (Itti et al., 2001). They integrated color, intensity, and orientation as basic fea- tures. A Gaussian pyramid was created to extract intensity information in six scales. Similarly, Gaussian pyramids was constructed to extract features from red-green Op- ponency channel, blue-yellow Opponency channel. Four orientation-selective pyramids 22 were also created using Gabor filtering at 0, 45, 90, and 135 degree in six different scales. A total of 42 feature maps was thus created: 6 for intensity, 12 for color, and 24 for orientation. Then, they fused feature maps into a so—called saliency map. The task of the saliency map is to compute a scalar quantity representing the salience at every position in the visual field. In their combination scheme, they promoted those feature maps with a few strong peak and suppressed those with many compa- rable peaks or just evenly distributed. This combination strategy outperforms the regularly normalization method, and also keeps biological plausibility (Sillito et al., 1995). After normalization, the feature maps for intensity, color, and orientation are summed across scales into three separate conspicuity maps, each for different modality. Searching for the most salient position is done by a winner-take—all (WTA) network with a global lateral inhibition (Koch and Ullman, 1985). It has long been known that lateral inhibition in neural network can lead to a WTA competition, so that only a single neuron is active at a steady state. TO avoid reselecting the first attention position repeatedly, they have used a mechanism to inhibit the recently selected object and shift to another salient position. In psychophysics, this process is called inhibition of return. Backer et a1. applied similar strategy to an active-vision system, called NAVIS (Neural Active VISion), which emphasizes the visual attention selection in a dynamic visual scene (Backer et al., 2001). Instead of directly using some low level features like orientation and intensity, they accommodate additional processing to find middle level features to build the feature map, such as symmetry and eccentricity. Fusion of conspicuity maps has been done by a so—called Dynamic Neural Fields, proposed by 23 Amari (Amari, 1977), which, in fact, is still a lateral inhibition network. Due to the nature of a dynamic scene, the inhibition Of return processes is somehow more difficult than visual searching in a stationary scene. TO name one, the Object that was first picked up and then inhibited could be moving, resulting a possible selection of the next attention shift. In case Of a gaze shift, the positions of the saliency before and after shift should be corresponding. In Backer’s model, these problems are addressed by Specific approaches. Many of the bottom-up attention selection models share similar ideas. The major differences lie in the variety of features, different strategy of fusing the feature map, and the rules that find the most salient location. Although the bottom-up attention model may explain the first few hundreds Of milliseconds after presented with a new scene, obviously a more complete model of attentional control must include a top—down, volitional biasing shift as well. 1.4.3 Top-down attention selection Top-down attention selection has involved much complex perception procedures. One obvious phenomenon is that we are able to move our eye voluntarily towards any po- sition, no matter how inconspicuous it is. It is believed to include object recognition, understanding of the scene, and even some planning. The precise mechanism by which a voluntary shift is elicited remains elusive, although several researches have inspected the possible brain regions that are involved (Hopfinger et al., 2000). As to our knowledge, there are no computational model can explicitly solve this 24 problem right now, although some models, combined with bottom-up attention selec- tion, tried to accommodate this mechanism. Tsotsos et a1. (Tsotsos et al., 1995) implemented attention selection using the combination Of a feed forward bottom-up feature extraction structure and a top-down selective tuning (of these features) extraction mechanism. First, the representation in the net is calculated, based on a feed forward activation for low level to high level. Then the target of attention is selected. The location is then prOpagated back through the feature extraction hierarchy. In the top-down process, the WTA process is reduced to the units connected with the winner Of the previous layer. In the lowest layer, this procedure results in an accumulation of units corresponding to the most conspicuous region in the image. Although Tsotsos’s model includes some top-down control features, it is not usu- ally considered as a true top—down attention selection model. The model’s top-down selective tuning only helps select the conspicuous location. It does not know what it is attending to. A successful top-down attention system should accommodate the Object recognition capability in order to implement the “what” stream in the brain attention control pathway. One Of the earliest models that combines object recognition and attention is MORSEL (Pashler, 1996), in which the attention is shown to help recognition. This model is designed to recognize words. Without attention, the neighborhood Of word could conflict to yield ambiguity and confusion of the recognition. The addition Of top-down attention control allow the system to focus one word at a time. Schill and colleagues proposed a model combining a bottom-up feature extrac- 25 tion component and a top-down, knowledge based recognition component (Schill et al., 2001). The first component acts like the aforementioned bottom-up atten- tion selection model. The second component contains a hierarchical knowledge tree, which represents object classes by defining several critical point and corresponding eye movement commands that maximize the information gain. This permits the model to discriminate between the candidate object classes. 1.4.4 Summary In this section, we have reviewed the brain mechanism of attention control and two basic attention selection strategies: bottom-up and top-down. Although bottom—up attention selection is more mature and has been implemented in several successful computational models, a successful focal attention selection model relies on a good combination of the two strategies. 1 .5 Discussion In this chapter, we have reviewed several early visual pathway models and attention control models. These two problems, at a first approximation, are different and unrelated. But after reviewing several related works, we find it is possible to put them in a Single theoretical framework. The topographic map in the early visual pathway may serve as a feature or saliency map in the bottom-up attention control process. The lateral inhibition mechanism in the model Of orientation preference map is also adopted by the saliency map fusion of 26 the attention control models. This leads to a possible uniform solution that addresses attention from learning. The following chapters will focus on the model of feature maps and attention selection models. Two learning mechanisms and a general vision architecture will be presented and discussed. 27 Chapter 2 Staggered Hierarchical Mapping for Attentive Perception: A Developmental Model In this chapter we propose a general architecture that can accommodate attention selection and development of a vision system. The work reported here is motivated by the structure of the early visual pathway. We used several layers Of staggered receptive fields to model the pattern of positioning of the processing cells of the early visual pathway. From sequentially arriving video frames the proposed algorithm develops a hierarchy of filter banks, whose outputs are uncorrelated within each layer, but with an increasing scale of receptive fields that range from low to high layers. We have also Shown how this general architecture can be applied to occluded face recognition, which demonstrates a case of attention selection. 28 2. 1 Introduction There has been rich literature about the computer vision system. However, an imple- mentable computer vision model for a general purpose attention guided recognition for unknown environments is still illusive. The appearance-based approach, e.g. eigen- face (Kirby and Sirovich, 1990; Tqu and Pentland, 1991), has Opened the possibility of dealing with unknown, uncontrolled, and complex real world scenes because of its capability of deriving features from any vector distribution. The appearance—based approach can learn any type of invariance exhibited in training samples, but it requires preregistered objects because it lacks of attention selection capability. Current techniques of the appearance—based approach require that the Object tO be recognized being registered in a predetermined position and size. This is either provided manually by a human or by a separate object detector. However, in many scenarios this is not applicable. Along the line of the appearance-based approaches, the developmental vision ar- chitecture, proposed here, is motivated by the developmental process of biological vision systems. The idea Of developmental vision is consistent with the idea of purpo- sive active vision (Aloimonos, 1992), because the developmental process is purposive and active and so is the performance process. If the system is exposed to a wide variety of environments and has learned many tasks in the developmental process, the resulting system is large, complex and it “acts like” a general-purpose system. If the system is trained in a controlled setting for a single task, the resulting system is small, lean and it “acts like” a special purpose system. 29 An important factor in such a general architecture is the attention mechanism. An Object must be recognized at any position and size after it is projected onto the retina, without turning the eyes. Attention selection is essential for suppressing irrelevant information. Otherwise the image vector that includes the object of interest and the background is a totally new vector that does not have a good match from the learned experience. To learn an object, active attention must be applied not only in the performance session, but also in the learning session. Many studies have proposed hierarchical models for vision and pattern recognition systems, but few have addressed attention selection as a major goal. Fukushima et al., proposed a neural network called “Neocognitron” (Fukushima, 1980). The Neocog- nitron is a multilayered network consisting of cascaded connections of many layers of cells. The information of the stimulus pattern given to the input layer is processed step-by-step through stages Of the network. The synapses between the neurons are updated by a supervised or unsupervised learning method, but the features retained are hand—picked by humans. Weng et al., proposed a dynamic neural network model called “Cresceptron” (Weng et al., 1997), which can automatically grow a hierarchy Of maps directly from image inputs by a learning-with-teacher process. The network grows by creating new neurons, connections and an architecture which memorize new image structures and context as they are detected. Although these studies address the global structure of sensory mapping, the efficiency and completeness of the gen- erated feature representation are problematic, Since those methods do not explicitly use statistical properties Of the signals in the filter generation. Other studies concentrate on the subproblem of feature derivation from statistical 30 properties of input images. They include the entropy reduction method (Atick and Redlich, 1993), the sparse coding method (Olshaushen and Field, 1997), and the independent component method (Bell and Sejnowski, 1997; Hyvarinen and Hoyer, 2000). The above studies applied the principal component analysis (PCA) or the independent component analysis (ICA) to a single size, fixed receptive field. In order to deal with compounding problems Of multiple receptive fields, a general objective recognizer must be able to recognize a pattern at different positions and with different Sizes. In our proposed architecture, we use a hierarchical structure with staggered receptive fields inside each layer. Some researchers (Linsker, 1986; Sirosh and Miikkulainen, 1997) used a hierar- chical network to simulate the effect of a population of neurons and to explain some biologically observed phenomenons, such as orientation preference maps. These stud- ies did not address the representation issues nor how to use the representation for later classification and recognition. Attention was not a subject of concern in these studies. We believe that a major purpose of the early visual pathway is to provide a coupled solution to two mutually dependent problems: (a) provide representation Of input using the current attention selection (suppressing unattended area), to the later recognition and attention signal generator; (b) execute the attention generation by the later recognition and attention signal generator. These two issues combined together accomplish both the bottom-up and top-down information flow. This raises a series Of new issues such as how receptive fields are organized, the criteria Of feature representation, how attention is applied, and how the output of the 31 sensory mapping is used. Another essential requirement for a sensory mapping developmental algorithm is that the computational method must be incremental without estimating the covari- ance matrix directly from online real-time experience. Further, the computation must be quick to respond in real-time. For sensory mapping, this means that the current sensory mapping must be updated within the time interval of every sensory frame. This indicates the challenges in the design of a real-time developmental program. This chapter proposes architecture of hierarchical sensory mapping, called Stag- gered Hierarchical Mapping (SHM), for attention selection and attentive-based recog- nition and its developmental algorithm for active vision systems. Fig. 2.1 illustrates a case of how the sensory mapping is used. The algorithm develops filters for a family of receptive fields. The responses Of all filters are available for further analysis by the following classifier/mapping function. Therefore the proposed sensory mapping develops a hierarchical representation for a large set of receptive fields, but does not complete vision by itself. The following classifier / mapping function is needed to work in tandem. The autonomous development Of cells in the visual sensory mapping is performed by the CCIPCA (Weng et al., 2003) method, which addresses the con- vergence problem with existing incremental PCA methods. However, the other filter derivation methods, in addition to PCA, can also be used, e.g. Independent Com- ponent Analysis (Olshaushen and Field, 1996; Bell and Sejnowski, 1997) and Slow Component Analysis (Wiskott and Sejnowski, 2002) by the proposed architecture. In this chapter, we used CCIPCA as an example. An important issue for developing a sensory mapping is the completeness of repre- 32 action output Classifier/ Regressor (a) Learning session Class label output (given) Sensory mapping Attention action output // / Classifier/ Regressor \IlI/ ‘\\iii/’ Class label Sensory output (retrieved) mapping \\\ \ \\\\IIIII/ \\\\llIl// (b) Performance session Figure 2.1: Recognition under occlusion through features extracted using a sensory map- ping architecture. (a) In the learning session the sensory mapping learns the features of the upper face (indicated by “On”) through active attention to the upper face, (b) In the performance session the lower part of the face is occluded and attention focused on the upper face allows the sensory mapping to deliver features for the non-occluded part of the face (“On” part of the block), which are fed to the classifier for successful recall. In general, a sensory mapping enables suppression of unattended receptive fields, but provides sig- nals from attended receptive fields for generating attention control signals by the following classifier/mapping function. 33 sentation. Since the environment is unknown at the time when designing the system, the programmer designs the mechanism, through which the actual representation (in- cluding network architecture, neural weights and responses) is developed from the sensory experience. However, if the mechanism does not allow the derivation of rep- resentation that is essential for any particular environment, the resulting mapping is deficient in performance. If the mechanism enables the representation to be complete, in the sense of no loss of information, then the resulting representation is able to han- dle any environment that an intelligent agent runs into. The completeness for a single receptive field is clear, due to the reconstruction of PCA. However, the reconstruction issue for several staggered receptive fields is illusive. We address the completeness Of the resulting SHM mapping by reconstructing the original images from the generated representation. The subsequent organization Of this chapter is as follows. Section 2.2 explains the structure of the proposed SHM model. Section 2.3 describes the developmental method. In Section 2.4, we analyze the computational complexity of this method. Section 2.5 discusses the completeness Of representation in terms Of reconstruction. Section 2.6 depicts the prOposed model as an attention effector. In Section 2.7 we Show the results Of some of the experiments with SHM. Section 2.8 presents a classifi- cation system in which the SHM is used for attention selection. Section 2.10 provides concluding remarks. 34 2.2 Staggered Hierarchical Mapping We assume each line Of sensory input can be digitized into a real number. It is represented by Euclidian n—space R”. For the case of visual sensory input, we can assume the input units are organized in a 2—D plane with fixed r rows and c columns. SO all the visual sensory inputs form an foc space, where R is the real space. Thus we have $110) 1‘12“) $1c(t) X(t)=[x,-j(t)]= $210) a322(t) 5We“) (21) xrllt) 331:2“) 37‘6“)- We also call a visual sensory input an image. We will use this two terms inter- changeably in this chapter. Analysis on this matrix space 72”“ is difficult, because it introduces some un- necessary complexity. Therefore, the vector space model is used to represent images. Under this model, each image is modeled as a Single vector and the collection of im— ages is modeled as a single data matrix, where each column Of the matrix corresponds to an image and each row corresponds to a feature dimension. Thus, we introduce a vectorization Operator to convert the 2-D matrix representation to the vector space presentation. Definition 2.2.1 Let A E RTXC and aj its j-th column, then vecA is an rc x 1 35 vector, T vecA = [afangaZ] . Then it is natural to get the vector Space representation I Of an image X, i.e. I = vec X. Without knowing the original row and column dimension, the vector representa- tion lose information Of the Rrxc space. So we have an important assumption: the statistical property of the image is stationary. Basically, the stationary assumption states that the statistical property of an image space is independent of positions of the image. With the stationary assumption, the statistical property Of the image space can be captured by the covariance matrix and higher order tensors of the image vector space. In SHM, a filter is modeled by a column vector “’2': and w,- has the same dimension as the image vector representation I. All filters that have connections to the image then form a matrix W = [w1, W2, ...wn], where n is number Of filters. The responses of a set of filters R = f (WTI), where T is the transpose Operator and f () is a non-linear function. In the proposed architecture, we did not consider the non-linear function f (), thus the model is a linear one. When a family Of receptive fields are considered, we have Observed that the Princi- pal Component Analysis is a suitable mechanism for filter development. It gives a set of filters that minimize the mean square error between the input and reconstructed Space, giving a fixed dimensionality under a linear projection. The response from the PCA filters are mutually uncorrelated. This increases the efficiency of the sensory 36 coding. Furthermore, incremental PCA that adopts a Hebbian learning mechanism has some support from the biological model of the cortex. It has been shown (Oja, 1982; Sanger, 1989; Rubner and Schulten, 1990) that artificial neurons updated by the Hebbian rule, while being inhibited by nearby neurons (lateral inhibition), produce synaptic weights that are principal components produced by PCA. However, the conventional PCA also has its limitations. It is only applied to the full input vector, and it is not capable Of extracting local features in the input space. We shall design an architecture, in which the responses are computed from a family of localized receptive fields. A stack of filters developed from a single receptive field is Shown in Fig. 2.2(a). We have two problems: First, the Spatial resolution is low if the receptive fields do not overlap; Second, if they overlap, the computational cost is high. In the proposed staggered scheme, we trade off spatial resolution with computational cost as shown in Fig. 2.2(b). In the proposed architecture, filters are not stacked to cover the same receptive fields. Instead, the receptive fields of filters are staggered and cover different regions. In practice, this means each filter only gets input from a local region of the input. However, for the convenience Of analysis, we shall still model filters to cover the entire input range. Therefore, the matrix W is usually a sparse matrix, since filters are not connecting to all the inputs, and the connection, or synapses, are localized. This definition intertangled with the issue of filter development makes the analysis much more difficult. SO, we introduce a receptive field matrix to restrict the connection to the local area, and thus set no constraints to the filter matrix W. 37 .,‘ ’ 9 await/fl +7.”: Figure 2.2: A comparison of the neural cylinder (a) and the staggered receptive fields (b). In (a) a group of filters share the same receptive fields resulting in a low spatial resolution. In (h) each filter is centered at a different position resulting in a higher density of spatial sampling. Definition 2.2.2 The receptive field D,- of the filter “’1' is a diagonal matrix (1,31 0 0 o d- 2 o D,- = ” , 0 0 0 . di,K with did = 0,1 and K = r X c is the dimension of the image vector space The single filter response can be written as r,- = f ((Diwi)TI). Diwi defines a filter with a receptive field. This definition did not constrain the position Of the receptive fields, so the synapses can be scattered in the entire image. But in practical, the receptive field of the filter is usually localized to a small field in the image. However, this topological information is discarded during the vectorization process. SO, do we lose important information? The answer to this question is in two fold. First, we do lose the exact topological relationship between the filters. However the correlation between filter pairs can be reserved in the covariance matrix and higher order correlations can 38 Figure 2.3: Localized receptive field Dpr with Size 4. be found in the higher order tensors. Also the Signal is natural image, instead Of white noise, 80 there exists high correlation among the neighborhood input areas. In this sense, the topological structures are encoded in the input signal. A data-driven learning mechanism can find this high order correlation, which thus determine the receptive field. It is worth noting that the introducing of the receptive field matrix D is merely for the convenience of mathematical analysis. In the actual implementation the weight vector is the product Of Diwi, where zero connections are all omitted to save computational resources. A localized receptive field is defined by its starting row p and starting column r and size 2. For the Simplicity of the model, we usually assume every localized receptive fields in the same layer has the same size 2. From now on, if not otherwise specified, we will always assume this condition. Fig. 2.3 illustrate a localized receptive field Dpr with size 4. Filters are partitioned into non-intersected groups 93-,i = 1, 2..., which are called eigen-groups. Within an eigen—group 92-, the receptive fields of the filters have no overlap nor gap between them. Therefore, the receptive fields of filters in the same group are paved seamlessly to cover the entire image, i.e. for any I: 7é l and D k, D, E 39 92', we have Dle = e, (2.2) where O is zero matrix, and Z Dk = I, (2.3) Dkegi where the summation is element-wised matrix addition and I is the identity matrix. In SHM model, filters in the same eigen-group compute the PCA of the same order. Basically, if the size Of receptive field is n x n, then there could be maximally n x n different eigen—groups. None Of the eigen-groups share same receptive field, i.e. VD,- E gkaj E 91, where k 7S l, D,- 7A Dj. Then we can define distance between eigen- groups as d = min{lp — r| , Iq — s|}|Vqu E Gk,VDr5 E 91,}: 7% I. We call the minimum value of d “inter-filter distance”, or “inter-neuron distance”. Typically, the structure of SHM is homogeneous, SO the inter-filter distance is the distance between any adjacent filters. A randomly predefined order is given to all eigen—groups, for example shown by a number in Fig. 2.4. The numbers determine the order of the eigenvector computed by the eigen-group, and the position of the numbers determines the relative position of the eigen-group. After each input image coming in, the filters in the lower order eigen- group is updated first, and their responses are generated. The filter updating rule will be discussed in section 2.3. Here, we first take a look at what we called residue image processing between the computation of consecutive eigen-groups. Suppose we 40 15 4 l 6 3 8 9 10 12 13 7 l4 2 5 0 ll Figure 2.4: Computational order of eigen-groups. There are 16 eigen groups. The numbers in the table denote the order of updating using residual image. The positioning of the numbers represents the relative position of the eigen-groups. have finished eigen-group 9k. The response of the eigen—group is rz- = (Dz-wi)TI.i_1, where D,- E Gk, Ii—l is the residue image from the last eigen-group and IQ = I. The residue image after this computation is defined by 1k = Ik—l — Z (DiwilT Ik—1(Diwi) (2-4) Diegk The residue image can be considered as the remaining variances that are not captured by the current eigen-group, thus serves as the input of the next eigen-group. Beside the feed forward processing, filters also have interactions in the same layer. If we assume the lateral interaction is linear, then we can use another matrix to model it. Definition 2.2.3 The lateral interaction matrix L is an n X n matrix, L = llijlnxn The element lij determines the inhibition or excitation synapse from filteri to filter j. 41 If L is a symmetrical matrix, then the lateral interaction is symmetrical, asymmetrical otherwise. The responses are affected by both W, D and L, i.e. rz- = (LiDiwi)TI. The multi-layer network is just a natural extension of the Single layer network. We define the filter set Wk the filters of the k-th layer, and the corresponding receptive field matrix and lateral inhibition matrix are D]: and Lk, respectively. The total filtering matrix in one layer, without considering non-linear mapping function, is defined by F]: = WkaLk. Then the response of the multi-layer network is defined by: T k R: HF, I i=1 Since F is a linear transformation, we call C = “H1 F,- combined filter matrix. 2: Each column of C is corresponding to one combined filter. The size of the receptive field Of combined filters is increasing with the order Of the layers. Thus, SHM gives a presentation of different scales in different layers. Furthermore, the staggered re- ceptive fields give the representation in different positions over the input image. The presentations of different scales and different positions is the key idea of why SHM is better than other flat representations. The structure of the proposed SHM is Shown in Fig. 2.5. We only draw the weight connection that is non-zero. 42 Layer I t d' t Size of the No. £— n er-neuron ls ance receptive field ‘OQQO‘UI-hWN— Figure 2.5: The architecture Of SHM. Each square denotes a filter. Layer 0 is the input image. The filters marked in black in layer 1 belong to different eigen-group, because they overlaps. Bold lines that are derived from a single filter and expanded to the original image mark the receptive field of the combined filter. The size of the receptive field in a particular layer is 20% larger than its previous layer in this diagram, which is shown at the right. The size of the receptive field is rounded to the nearest integer. 2.3 Developmental Algorithm The developmental algorithm for the SHM must be incremental and fast for real-time application. By “incremental” we mean that each input image must be discarded after it has been used for updating the SHM. It is not practical to store all of the images for batch processing due to the large amount of Space required. By “fast” we mean that updating for each image must be computationally efficient. For example, iteration within each updating is not allowed. These requirements limit the complexity of the computation that can be performed by a biological cortex. They also make the design Of a developmental program for SHM challenging. We have further developed the CCIPCA (Weng et al., 2003) that was designed with these requirements as design constrains. This algorithm can be found in the appendix. The filter development procedure Of a single layer is described as follows: 43 Algorithm 1 The filter development algorithm Of a single layer. 1: Use steps 1 to 2 of the CCIPCA algorithm in the appendix to initialize each filter in the network. 2: Initialize the residue image I0 = I(n), where I(n) is the n—th input image. 3: k *— 0; 4 5 : for all filters w,- in filter group Gk do Use Eq. A2 of the CCIPCA algorithm to update connection weight vector ail-(n) of each corresponding filter with residue image Ik- Wz‘ = (I(nlwr + fi(n)(DiWi)TIklk. where a(n) and 6(n) are the amnesic average coefficient defined in Eq. A3. 6: Use Eq. A.4 to compute the corresponding response ri(n) of each filter, 7,, z (D.w.-)TI ' new.” 7: Use Eq. A6 to compute the residual image for the next eigen-group, T . . I . . Ik+1=Ik- Z (Diwi) k(l2'-)zw2). Diegk “Diwill end for : k <— k + 1; GO to step 3 until all the eigen-groups are updated. 10: Get the next input image I (n + 1) and go to step 2. ‘99? 44 Each layer is similar in that it takes its input and runs the CCIPCA algorithm on it, producing both filters for this layer and the output response image for the next layer. The first layer of the network uses the original input image, and each subsequent layer uses output response images from the previous layer. The receptive field in the same level is not concentrated in only several center positions. Instead, we use a masking matrix D,- to spread the receptive fields all over the input layer. Then the staggered positions, along with the increasing size Of the receptive field, make it possible for input regions to have local representations with any Size at any position (up to a sampled resolution) in the network. Inhibitions between neurons play a vital role in the function of the visual cor- tex. We use residual images to represent the effect of inhibitive competition between neurons. This SHM network is not a feed-forward network because it deals with the re—entrant process of neurons. Each time a filter is computed, it is subtracted out from the original image, leaving information that has not yet been extracted by the current eigen-group. The next filter can only “see” part of the input that the previous filters cannot extract, thus, the first one can inhibit the latter one, which makes the process asymmetric. Each filter can “see” unique input data that has not been “seen” by previous neurons. This is more efficient than symmetrical inhibition where it takes time to compete the order and achieve the same result. In SHM, the lateral interaction is not explicitly modeled by an lateral interaction matrix L. Instead, filters are lateral affected by the residue images process. Suppose we represent F,- = Diwi, then the response of the filter rm, where Dm 6 9k, can be expressed as, 45 rm = (Fm)T E- Z F,F,T E — E— Z F,F,T I FiEgk_1 FiEgl The term in the square brace is the transpose of the lateral interaction matrix LT. Fig. 2.6 shows a sequence of residual images. It can be seen that the variance of the series residual images is decreased sequentially as more orthogonal components are extracted. Figure 2.6: Sequence of residual images. From left to right and top to bottom starting from the original input image, the sequence of images displays the process that a component is subtracted from the previous one. 2.4 Computational complexity analysis We would like to give a concise expression for the amount of computation. Since it is unnecessary to keep the same inter-filter distance for all areas, we can assume a similar number of connections in every cell in the sensory mapping. The number of 46 processing elements is roughly the same across different layers, which can be estimated from input dimension d. Suppose that in layer 14,- the average number of neural connections per filter is c,- and the array of filters has hi rows and re, column. The amount of computation required to update connections using incremental Hebbian-like algorithm is linear in Ci- Thus, with L layers, i = 1, 2, ...L, the amount of computation in the entire network is: L T3 = Z hiwz-O (Ci) z 0 (ch), i=1 where d is the number Of pixels in the sensory input, c is the average number Of connections per filter. Since the sensory mapping stores the information in the place where cell compu- tation and processing result needs to be stored too, the space complexity is equal tO: 53 =3 O (dCL) . 2.5 Input Image Reconstruction With SHM the input image is coded as responses at several layers. To Show how complete the representation is at each single layer, we need to reconstruct the original image from the response Of that layer alone. The reconstruction is a reverse processing Of the mapping. The responses Of a certain layer, which are actually the sample’s 47 projection along the principal components, are used to recover its component in the sample space, then added back to the residual image. The reconstructed image IT is computed by, Ir = Z Z Ti(Diwz’)a (2-5) It Diegk where r, is the response of the i—th filter. Lemma 2.5.1 The filters with localized receptive field in the same eigen-group, are orthogonal to each other. \7’ filters Di,Dj E Q, i at j we have < Diwi,Djwj >= 0 The notation< - > is defined as inner product. . . . . _ T T . . T . _ Proof. < D,w,,Dva_7 >— wz. Di DJw]. From Eq. 2.2, we have D2. DJ —— G, where 6 is zero matrix. Therefore, < Din’a Djwj >= 0. 4 Lemma 2.5.2 The residue vector lie of an Image Ik—l after applying eigen-group Gk is orthogonal to every filter in the Gk, i.e. VD,- E Gk, < Ik,Djwj >= 0, where Dj € 9],;- Proof. From Eq. 2.4, it is simple to Show T T = 119—1‘ 2 (Diwi) Ik—1(Diwi) DJWJ Diegk T T T T = (Djwj) Ik_1— Z (Diwi) Ik—1(Wi Di Djwi) Diégk = (Djwj)TIk_1— (Djw-)TIk_1= o 48 Since Dj E Gk, according to Eq. 2.2 the addend in the summation is zero when T i 75 j, and (Djwj) Ik—l when i = 3'. Note here we assume ”Din‘ll = 1. <1 Therefore, the initial image space is partitioned into a sequence of subspaces, which is spanned by the vectors in each eigen-group. And each of these subspaces is orthogonal to the next residue image. It is then easy to show that the total reconstruction error is E = III-Irll2 k = 10-2 2: Ti(Dz'Wz') j=1 Diegj k = 11—2 2 Ti(DiWi) j=2 Diegj 2 = ll1k+1ll where H” is the l2 norm of the vector, and Ik+1 is the last residue image after being processed by k eigen—groups. We need also to consider the border conditions. If the localized filter covers the border of the input image, then the connections that are out of the image will get zero input. This condition will increase the image by (z — 1) X 2 rows and (z — 1) X 2 columns, as shown in Fig. 2.7. This new image space will have dimension 49 Figure 2.7: The filter connections that covers the outliers Of the image with have zero input. of (r+2(z - 1))(c+2(z —1)). Since the outlier pixels are set to zero, the true dimension Of the image space is still r x c. The total number of filters that have localized receptive field covering at least part of the input image will be (r + 2(z — 1))(c + 2(2 — 1)). For all WW and “743, we have < wpr, qu >= 0. And the number Of filters is more than the effective dimension Of the image Space. Thus, the filters are linearly dependent if the image border is also covered by filters. The single layer reconstruction algorithm is shown in Algorithm 2. Algorithm 2 The Single layer reconstruction algorithm 1: I,- = 8, where O is the zero matrix. 2: for all Response output r,- and the corresponding filter vector w, and receptive field Di. do 3: Update the reconstructed image by, Ir = Ir + Z X Ti (Diwi) lc Diegk 4: end for 50 2.6 Attention Efl‘ectors The response of each filter (or processing cell) at every level is available to be used for later processing. The goal of the attention effectors is to suppress the response (value) from areas that are beyond the attended visual field. However, the filters share input planes with the staggered receptive fields shown in Fig. 2.5. This means that any region of filters in a layer of SHM does not correspond to a clear-cut region in the input plane. A special case of this situation happens when this region is so small that it contains only one filter. Interactions between filters expand the extent of each pixel beyond that marked by direct inputs to filters. Therefore, we Should not expect that selecting the response from a region in any layer will result in a clear-cut attended region in the input. We define an attended region as a 3—D ellipsoid centered at a position (x,y,l) in a layer of SHM, where, x, y, denotes the position of the filter and l denotes the layer. The length of each axis of the ellipsoid corresponds to the scale along the dimension. All the filters falling into the ellipsoid have their output passed on and all others are blocked. 2.7 Experiment with SHM Testing began with a collection of over 5000 natural images. They were taken using a Sony digital camcorder in an outdoor environment. A variety of recordings were taken, including buildings, trees and shrubbery, closeups Of bark and wood chips, flowers, and other various scenes found in nature and on campus. The frames from 51 the video were then captured and digitized into 118 x 158 images (each has 118 x 158 pixels). Fig. 2.8 shows a few sample images from the image set. Figure 2.8: Several samples of 5000 natural images collected. There are two methods used to develop filters, weight-sharing or non-weight- sharing. The weight-sharing method assumes that the filters that are in the same eigen-group have the same connection weights. This is a good estimation only if the images have the same statistical properties across the entire image frame. Thus, the sharing method only updates a single filter for one eigen-groups. The non-sharing method updates a separate set of filters for each filter in the eigen-group. The filters of the first layer, trained by the sharing method, is shown in Fig. 2.9. There are 64 eigen-groups in this layer and the size of each filter in this layer is 16 x 16. The filters are reordered according to their computation order so that the dominant principal components are shown first. The first several filters appear similar to the receptive field excitation-inhibition patterns reported by Hubel and Wiesel’s experiment (Hubel and Wiesel, 1962). Later ones show a complex pattern, which may fall into the cell classes that were not categorized by Hubel and Wiesel. 52 [1 L5 -wm—cu, reamefsw magnesia, seamessw eessaaxa fififiEEBfiU fififiwfifififl grasses" Figure 2.9: The filters developed in the first layer by the sharing method. The order of dominance is from left to right and from top to bottom. When input was placed into the network, a scalar output response value was generated for each filter at its respective receptive field. This measures the amount of energy Of the input along the direction Of the filter and represents the firing strength Of a filter. The filter responses were then organized into groupings based on their order in the eigen-group and combined into a structured image vector and passed on as input to the next layer of the network. The responses of a layer are reconstructed by its following layer, finally reaching the original input layer. The reconstructed images are shown in Fig. 2.10, which indicated that the response of each single layer is sufficient to reconstruct the original image fairly well. We also test the reconstruction error of each layer. The chart in Fig. 2.10 shows the error. The error is computed based on the Euclidean distance between the original image and the reconstructed image. The two images are nor- malized by subtracting the mean and being divided by their standard deviation. The distance is then divided by the number Of pixels to make it possible to compare im- 53 ages of difference sizes. The error is averaged over 100 reconstruction results. Fig. 2.11 illustrates the reconstruction error of each layer. Figure 2.10: The reconstructed images from different levels. The first image is the original input image. Others are reconstructed images from Layer 1 to Layer 4 respectively. Average pixel emor 4 2 3 Layer of reconstruction Figure 2.11: The reconstruction error of each layer. 2.8 Experiment with Occlusion The purpose of the experiment summarized here is not to show a practical application system, but rather to study the use of SHM as a local representation suitable for 54 attention selection. The proposed sensory mapping provides a complete family of local representation that is needed for the generation Of attention control signals from later processing. Further, the proposed sensory mapping serves as an effector of the top-down attention control. In a future system that uses SHM as a sensory “cortex,” the parallel outputs of all filters in all layers are used by a higher level processing that generates attention signals for the SHM. There have been numerous studies on attention selection. However, due to the complex nature of attention selection, those studies use hand-defined task-specific saliency as the attention selection criteria. The architecture prOposed here is a learning-based general mechanism at the sensory mapping level, for autonomously developed attention selection behaviors, potentially for any task. In the current experiment, we used a regressor denoted as C3 ,shown in Fig. 2.12, that learns the mapping from input to the attention control Signals needed by the SHM at any given time. This regressor is trained first. The experimental setting is organized as follows: In the training phase, a series of global views of face images is presented to the system with class labels. In the testing phase different face images of the same set of people are presented to the system, but all Of them are partially occluded: the upper view (U view) of a face has the lower third area replaced by a uniform intensity (gray), and the lower view (L view) has the upper third replaced by gray. If a system is passive, it learns the global view and tries to match the input U or L view with the learned global prototype. If we use the nearest neighbor method (NN) to find the prototype, we call it monolithic + NN testing. We also study the use of active internal action in visual perception. For this 55 purpose, we construct an active system Shown in Fig. 2.12. Three classifiers C 1, C2 and C3 are integrated in the system. Here, we use the Hierarchical Discriminant Regression Tree (HDR Tfee) (Hwang and Weng, 2000). To learn the mapping, C3 is to determine what specific occlusion case the image is, U or L. In our test, the classifier C3 has reached a 100% classification rate. The 31 and 5'2 are SHMS. 51 is for U views and 8'2 is for L views. Cl and C2 are two HDR Trees used to classify images. Which SHM + HDR should be used is determined by the attention signal from C3. Occluded face images NV A S 1 $2 C1 / C2 Integration of the results Figure 2.12: The schematic illustration of operation. C3 is the attention signal generator. C1 and C2 are the classifiers for each occlusion case. S1 and S2 are SHMs, SI for U views and S2 for L views. The experiment used the face data from the Weizmann data set. The set was taken from 28 human subjects, each having 30 images with all possible combinations of two different expressions, three lighting conditions and five different facial orientations. 56 Method Recognition Rate U L U+L Monolithic+N N 51.43% 75.83% 82.38% SHM+HDR 92.86% 95.95% 98.57% Method Testing Time (ms) U L U+L Monolithic+N N 765.5 765.5 2263.0 SHM+HDR 702.4 702.4 2016.6 Table 2.1: Summaries Of the occlusion experiment. We use the leave-one-out cross-validation method. 30 experiments were conducted, in which 1 out of 30 samples Of each person is picked up as a testing sample, and all of the remaining ones as training samples. The average of the recognition result is reported in Table 2.1, along with the time run on a dual Pentium III 600 MHz processor PC. The U, L columns are for testing with U, L views only. The U+L column is for an integration of 2 test views, U and L are to give a Single classification (person). The U+L integration is done in the following way. For each view, top K best matches form a top labels list. Thus, each pair of U and L views give a U label list and a L label list. The label that has the minimum rank sum is the output label. Table 2.1 shows that the proposed SHM with the HDR classifier is effective in integrating active partial views, U and L. Without active attention, monolithic + NN performed poorly, even with U+L integration. This is because it lacks a mechanism to actively extract local views when presented with a global view. The speed of the SHM + HDR is fast enough to reach about 2 Hz refreshing rate. 57 2.9 Experiment with Navigation Simulation We also applied the SHM algorithm to a navigation simulation task. A total of 18,330 images are collected from two cameras mounted on an autonomous vehicle. The images of the camera are aligned and merged together to offer a broad View of the environment. A sample image with dimension 120 by 320 pixels is shown in Fig. 2.13. Figure 2.13: Merged image sample from the two cameras. 2,541 images from the data set are used to train a three level SHM network. The filter size of each layer is 16, 8, 8, starting from the bottom layer. Unlike the previous occluded face recognition experiment, a more general attention selection effector is used. As shown in Fig. 2.14, we use a GUI program to manually select 6 points on the road boundary, which are marked as circles. These 6 points are attention positions. A small window around the attention position and across levels of SHM is extracted. Let us denote (31, 52, S3, S4, S5, S6) as the concatenation vector of all 6 windows. The coordinates of the attention positions (r1,c1,r2, c2, ..., r6, c6) are also concatenated with the input vector. Since these are different data modality from the SHM output, we have normalized each section with their respective variances. Thus, 58 the input vector to the HDR mapping is 5'1, 52, S3, S4, S5, 56 r1, c1, 1'2, 02, ..., r6, c6) ( ‘71 02 where 01 is the variance Of the SHM output, and 02 is the variance of the attention point coordinates. This manipulation will make sure that the coordinate informa- tion will not be overwhelmed by the high dimensional image input. Three driving directions: going left, going right and going forward, are also provided by the GUI program as classification labels for the supervised learning. Note that the attention signals are all manually provided at this stage. Automatical attention selection will be the further research topic of the SHM model. G \AVSJmage'imloeDuzfihmgmmmfl ppm c uvsquommenunmmdfir’f'm Figure 2.14: Six attention positions on the image. We compare the results with the monolithic representation, in which only a flat image is used as input to the HDR mapping, and class labels remain the same. Since 59 Method Recognition Rate Monolithic+HDR 44.72% SHM+HDR 85.52% Method Testing Time (ms) per image Monolithic+HDR 103.2 SHM+HDR 748.1 Table 2.2: Summaries Of the navigation image simulation. the input image is Of high dimension, we scale it down to one sixteenth of the original size. 2,541 images are used as training samples, and the rest 15,789 images are used as testing set. The result is compared in Table. 2.2. It clearly Shows that the SHM representation is much better than the monolithic representation. 2.10 Conclusions This chapter presents the design principles, architecture and developmental algorithm Of a general purpose sensory mapping called the Staggered Hierarchical Mapping that tightly couples attention guided bottom-up information generation and execution of top-down attention selection. The experimental data has shown that the representa- tion at each layer has a high completeness. The test results Of the occluded image recognition system show that the SHM has a local analysis property, thus, it can be used to achieve attention control in an active vision system. A future research direction is to study mechanisms other than PCA develop sensory maps, as well as their advantages and disadvantages. 60 Chapter 3 Lobe Component Analysis Orientation selective cells in V1 are well known, but the underlying computational principles that guide their emergence (i.e., development) are still illusive. The result reported here is based on high-dimensional probability distribution using a network structure. We introduce a concept called lobe component. Each lobe component rep— resents a high concentration Of probability density of the high-dimensional sensory inputs. A developmental algorithm for sensory network has been designed and tested based on the concept of lobe components. Our simulation result of network devel- opment revealed that many lobe component cells developed by the algorithm from natural images are orientational selective cells similar to those found in V1. There- fore, we make a hypothesis that orientation selection is only a natural consequence of cortical development but probability density estimation by neurons is a fundamental principle of cortical development. If this hypothesis is true, the principle of neuronal probability approximation may have deep implications to the development and self- organization Of other sensory cortices (e.g., auditory and somatosensory cortices) and 61 higher cortices. The proposed developmental algorithm is not meant to be biologically verified but is biological plausible. It is based on two well-known computational neural mecha- nisms, Hebbian learning and lateral inhibition. It is an in-place developmental algo— rithm in which every cell is both a signal processor and the developer of the cortex. There is no need for a separate network to deal with the development (and learning). The algorithm is purely incremental in the sense that there is no need for extra stor- age for information other than that can be stored and incrementally computed by the processing network itself. To clarify technically, four types of algorithms have been considered with pro- gressively more restrictive conditions: batch (Type-1), block-incremental (Type-2), incremental (Type-3) and covariance-free incremental (Type-4). The proposed Can- did Covariance-free Incremental (CCI) Lobe Component Analysis (LCA) algorithm seems the first Type-4 algorithm for independent component analysis (ICA). The preliminary simulation studies showed that it out-performed, in convergence rate and computation time, some well-known state-Of-the-art ICA algorithms in Type-3 through Type 1 for high-dimensional data streams. The proposed LCA algorithm also has a simple structure, with the lowest possible space and time complexities. In this chapter, we will first briefly introduce the background of Independent Component Analysis in section 3.1, which is closely related to the LCA method. Then, we proposed the method in section 3.2 and analyze its performance. 62 3.1 Introduction to Independent Component Analysis Recently, there has been a resurgence of interest in independent component analysis (ICA). ICA has a wide application in signal and image processing, telecommunication, and medical data processing. The basic assumption of ICA is that the signal we have observed from the sensory input is a linear combination of several latent independent sources 1. The main purpose of ICA is to find latent independence from the Signal as much as possible, without the priori knowledge of their distributions and the mixture model. In contrast to Principal Component Analysis (PCA) method, ICA will consider the information that is beyond the second order statistics (covariances). In PCA, all necessary information is in the covariance matrix. The principal axis can be found by doing eigen-decomposition on the covariance matrix. In other words, PCA only considers the pair-wise correlation. Many ICA methods, on the other hand, need a preprocessing procedure, which will remove the covariance information. This prepro- “ cessing, often referred as whitening” or “sphering”, will diagonalize the covariance matrix and normalize each principal component to the unit length. After whitening, the covariance matrix will become a identity matrix, thus no second order statis- tics will remain. Therefore ICA deals with the higher order statistics, and attempts to find a linear transform that will make the resulting components as statistically 1Some methods also considered the non-linear mixing case, which is out of the scope Of this chapter. 63 independent from each other as possible. Several different research communities are interested in the ICA problem. In the statistics community, several researches have been proposed to find the “interesting” projection of a multi-variant random variables to show the structure of the data. This approach, called “projection pursuit” (Friedman and Tukey, 1974; Friedman, 1987; Huber, 1985), tries to find the projection direction, so that the projected signal distribution will be the least like a Gaussian distribution, which is very similar to certain ICA approaches that try to maximize the non-Gaussianality. Blind source separation (BSS) is a class of problems in the signal processing com- munity. It aims at recovering unobserved signals or “sources” from observed mixtures, exploiting only the assumption of mutual independence between the signals. SO it is obvious that BSS and ICA address the Similar problem. Practical BSS problems are usually solved by the ICA method. Another aspect to describe this problem is the sparse coding mechanism. The sparse coding problem can be found in the neuroscience literature (Field, 1994; Ol- shaushen and Field, 1996). It aims at explaining the coding mechanism of the neural system. The sparse coding hypothesis suggests that the principal in the sensory cod- ing is to produce a sparse-distributed representation of the sensory input. Sparseness here means the neuron will have a high probability to stay inactive, and also have a relatively higher probability for large response. Here, “relatively” means compared to a Gaussian distribution with same variance. While considering a family of neurons, this means that at each time instance only a small number Of neurons will have a large firing rate, and the rest Of them have low activity. Since Sparseness is essentially 64 the property of a super-Gaussian distribution, maximizing the Sparseness equals to maximizing the super-Gaussianality. Thus the Sparseness principle finally leads to the solution of independent super-Gaussian components. In this sense, the sparse coding is merely a special case of ICA. In this chapter, we will first define the ICA problem, then survey the related ICA works. A new ICA algorithm is proposed in this chapter. We also discuss the experiment result of the new method and its limitation. 3.1.1 Problem statement and assumptions Cocktail party problem To illustrate the problem of independent component analysis, let us start from a real world example. Suppose there are it people speaking simultaneously in a room. At the same time, n microphones are set up to pick up sound signals. This scenario is illustrated in Fig. 3.1. Obviously, the microphone input is a mixture of sound of all the people. Since the sound strength is inversely proportional to the square of the distance from the sound source, the mixing factor of each sound source is different. The problem is how to separate it original source signals from the n mixed signals. This problem is often called the “cocktail party problem.” The difficulty of this problem lies in that we don’t know any priori knowledge, e.g. statistical property, of the speaking person, nor do we know the setting of the microphones, e.g. how far away each microphone is from each person. The only thing we know is that each sound source is independent from each other. Of course in 65 Figure 3.1: The cocktail party problem. semantic level the conversations may not be independent: speakers may all talk about the weather. But in signal level, since every person has independent vocal organs, each sound source can be think as independent. Knowing independence does not help us much, because there is so much arbitrariness. It is also worth noting that this is only a simplified version of the actual cocktail party problem. We have assumed that there were no noises, no echo effects, and all sound signals reach microphones instantaneously. To formalize the problem, let us denote the source signal by 31(t), 32(t), ..., sn(t), and the mixed signal by x1(t), x2(t), ...,xn(t). As shown in Fig. 3.1, the extenuating factors of the ith microphone to each of the jth sources are denoted by aij, which are unknown and inversely proportional to the square of the distances between source and sensor. Suppose microphones do not introduce any non-linearity. The mixed 66 signals x1(t),x2(t), ...,xn(t) can be expressed as: $10") = a1181(t) + a1282 (t) + + aInSn (i) xn (t) = anlsl (t) + an232 (t) + + annsn (t). TO simplify the notation, we can write above equations into a matrix form: x (t) = As(t), (3.1) where x(t) = [x1(t),:1:2(t),...,xn(t)]T is the mixed Signal vector, s(t) = s t ,s t ,..., sn t T is the source signal vector, and l 2 all...a1n b d is the mixing matrix. The mixing is assumed to be instantaneous, which assumes the sources arrive at different sensors at the same time without delay. Since we have this assumption, the time index t can be dropped to simplify the notation. Then the mixture model becomes x 2 As. Apparently if A is known, 3 can be easily solved by a set of linear equations. Even prior knowledge, e.g. probability density function (pdf), Of source signal s can be useful for estimation the unknown parameters, e.g. to be used in a maximum likelihood estimation method. Unfortunately, s and A are both unmeasurable, which 67 makes this problem considerately more difficult. However, the recently developed method of Independent Component Analysis can be used to estimate A based on the information of independence. The goal of ICA is to find a linear transformation W for the dependent sensory signals x that makes the recovered signal 11 as independent as possible: 11 = Wx = WAS, (3.2) where u is an estimation of the sources. The sources 5 will be fully recovered when W is the inverse Of A up to a permutation and scaling factor. In the following sections we will review the general principles that lead tO the solution Of this problem. What is independence The definition Of statistical independent is defined as follows (Tucker, 1962). Definition 3.1.1 Let C denote a collection of events. The events in C are said to be independent if the probability of the joint occurrence of any finite number of them equals the product of their probabilities. Basically, this is said that if a random vector x that consists of n random variables x1,x2, ...,xn with the joint pdf function p(x) is equal to the multiplication of the marginal pdf’s, i.e.: n P(X) = pr’l’z‘). i=1 68 then x1, x2, ..., xn are independent. A set Of random variables is mutually independent if and only if the joint pdf p(x) is factorizable in the above way. A weak form of independent random variables is uncorrelated random variables. The uncorrelation is defined as: Definition 3.1.2 Two random variables x and y are said to be uncorrelated if their covariance is zero: cov(x, y) = 0. An important property Of independent random variables is that: Theorem 3.1.1 Independent random variables are uncorrelated. The proof of this theorem is relatively straight forward, and can be seen in much classical literature. For the sake of convenience, we have shown it here. Proof. We first Show this theorem holds for the case Of two randOm variables. The extension to the n variable case is natural. We need to Show that cov(x, y) = 0, where x and y are independent random variables. This is equivalent to showing E(xy) = E (x) E (y). Since E(xy) = f / 163/flaw) dwdy= / / $3119 ($)p(y) drdy =/xp(x)dx-/yp(y)dy= E(I)E(y). <1 It is worth noting that the inverse of theorem 3.1.1 is not true, which means the uncorrelated random variables are not necessarily independent, with the only 69 exception of Gaussian random variables. SO the uncorrelation is only a necessary condition Of independence. Many ICA algorithms have a preprocessing stage, which will enforce the uncorrelation of the data. Although this processing will not generate independent component, it greatly reduces the uncertainty in the ICA model and results in a fast convergence. The independence and uncorrelation can be intuitively explained by visualization Of the data, as we will Show in section 3.1.1. Super-Gaussian, sub-Gaussian and Kurtosis Random variables can be classified based on their Similarity to the Gaussian distri- butions, or Gaussianality. There is a class Of random distributions, whose center of probability density fun- tion is more peaked than a Gaussian distribution. And the probability density Of its outliers is considerably higher than Gaussian pdf. This class of distributions is Of- ten called super-Gaussian distributions. Generally, the super-Gaussian distributions are also referred as “heavy tail” distributions, because Of the higher density at the outliers. Another class Of random distribution, on the other hand, is more “shallow” at the center and has lower density at outliers than Gaussian distribution. Accordingly, they are referred to as sub-Gaussian distributions. Fig. 3.2 shows three typical random distributions, respectively the Laplace distribution, the Gaussian distribution, and the uniform distribution. The Laplace distribution is also called the double exponential distribution. It 70 0.7 ' a. 0'6 b l 1, Laplace, Kurtosis=3 0.5 r f i. 0.4 ' ,1 l , .' '\ Uniform. Kurtosis=-6/5 0.3 r__/I,____,~,.__, I " \‘ I 0.2 - I / I,-' ,Ndrmal, Kurtosis=0 l . I. \\ I 01 ' ' ' ' . P . ‘ I, ‘0 ' \ .”I I‘- . "‘ ALL 41 1 r 11 T ‘-:——1 Figure 3.2: The density function of super-Gaussian, Gaussian, and sub-Gaussian distri- butions. All the distribution are normalized to unit variance. The Laplace distribution has positive kurtosis, thus is a super-Gaussian distribution. the uniform distribution has negative kurtosis, which implies sub-Gaussian. And the normal distribution has a zero kurtosis. is the distribution of differences between two independent variables with identical exponential distributions. The pdf of the Laplace distribution is: p(x) : ie-lfB-Hl/b (33) 2b ’ where [,t is the mean, and b controls the shape of the distribution. The Laplace distribution is a super-Gaussian distribution. In Fig. 3.2, all three distributions are normalized so that their variance are all 1. This makes comparing kurtosis possible, which is a quantitative measurement Of Gaussianality, Since the kurtosis of distributions are comparable only when they have 71 the same variance. The classical kurtosis definition is: km (y) = E {y4} — 3 (E {y2})2, (3.4) where E denotes the expectation. If we assume y is of unit variance, the kurtosis is simplified to E {y4} — 3. the kurtosis can be either positive or negative, which is actually a normalized forth order cumulant. The distributions that have positive kurtosis is called super-Gaussian distributions, and those with negative kurtosis are called sub-Gaussian. The Gaussian distribution and certain non-Gaussian distribu- tions have zero kurtosis. In statistical literature, they are also referred as leptokurtic, platykurtic, and mesokurtic respectively. Illustration of independence Intuition is very important for understanding complex and abstract linear algebra problems. In this section, we try to explain the ICA problems with intuitive figures. We focus on examples in the 2-D data space, since they can be easily plotted. Suppose we have a random vector 5 = [31, 32]T, where the random variables 31 and 32 are Laplace distribution, whose probability density is defined as in Eq. 3.3. Fig. 3.3 shows the joint distribution of these random variables. The x and y coordinates Of each data sample denote the values Of 31 and 32. The dot-dash lines represent the source component axis. 72 10 “-9 Figure 3.3: Two Laplace distribution. The mixing matrix A iS defined as: A = . (3.5) By applying the mixing matrix A to the source signal, we will get the mixed signal x 2 As. Fig. 3.4 displays the mixed distributions. Since the mixing matrix is a linear transformation matrix, it performs rotating and scaling transformation to the sample data points. The dot-dash lines in Fig. 3.4 are original independent components transformed by the same mixing matrix. Then the task is to find the independent components given those mixed data points. Fig. 3.5 shows the example of sub-Gaussian distribution under the same mixing matrix as in Eq. 3.5. The source random distributions are uniform distribution of 73 30- 20- 2‘0 4b Figure 3.4: Two Laplace distribution mixed by mixing matrix A. unit variance with the following pdf: fi if ls’il S \/§, 12(51') = (3-6) 0 otherwise. 4 1.5 3 1 2_ #33,: ,. (J ,_ 0.5 1. ..-“ r‘ ”If ' ’ i i . o o ##E‘ N ‘ “f“? . T ' ‘ ‘3? -05 -1' ¢ , 3 §%% . ..2 ‘ so“ ' l -1 ’1’: . l , I —3- 4‘ I I 4 -1.5 _4 J 32 —1 o 1 2 —6 -A -2 6 2 4 6 (a) (b) Figure 3.5: Example of two uniform distributions. (a) The source signal. (b) Two uniform distributions mixed by mixing matrix A. The task of ICA is then to find the original source axis represented by the dot-dash 74 lines. Preprocessing procedure The central problem of ICA is to find the mixing matrix or its inverse form (de- mixing matrix). In the completed case (number of sources equals to the number of sensory inputs), the degree of freedom Of the mixing matrix A is n X n, where n is number of sources. This makes the searching for mixing matrix less tractable, since it2 parameters need to be determined. However, by a carefully designed preprocess- ing procedure, the degree of freedom can be reduced, thus leads to not only a well conditioned problem but also a faster convergent speed for iterative approaches. In this section we discuss some common preprocessing strategies for ICA. Centering: Subtracting the sample mean may be the commonest preprocessing. It is defined as: 5‘: = x— E {x}. This process will center the sample cluster and remove the possible Shifting transformation from the mixing matrix A. To fully recover the original source Signal, the mean E {x} can be added back to the recovered signal u(t). In the following sections, if not specified we assume the sample vector x has already been centered. Whitening: Whitening is an very useful preprocessing procedure. As we men— tioned in the preceding section, the mixing matrix A performs both rotation and scaling transformation. In fact, the whitening procedure will remove the scaling transformation and only leave the unknown rotation transformation to the ICA step. SO it will reduce the degrees of freedom of mixing matrix to n(n — 1) / 2 2, about a 2The degrees of freedom of an n X nortliogonal matrix is n(n — 1) / 2. 75 half Of the n2 degrees of freedom in an arbitrary mixing matrix. Whitening is well established and can be achieved by eigenvalue decomposition (EVD) of the covariance matrix E {xxT} = VDVT, where x is the sample vector, E denotes the expectation, V is the orthogonal matrix with columns as eigenvectors of the covariance matrix, and D = diag(d1, ...,dn) is the diagonal matrix of the eigenvalues. Whitening is achieved by: - ..1 T x = D 2V x. (3.7) After whitening, the covariance matrix of i will be an identity matrix. Intuitively, whitening is equivalent to rotating the original axis to the principal component di- rection and scaling along the axis with the square root of the eigenvalue. Therefore the whitened vectors i have unit variances for every components x1, ...xn and are decorrelated. Whitening is also referred as “Sphering” due to this reason. Fig. 3.6 shows the after-whitening effect of joint distribution of the mixed signal in Fig. 3.4 and Fig. 3.5(b). With an (up to now) unknown rotation transformation, we can then recover the original signal. Why not Gaussian? One Of the basic assumptions Of ICA is, to avoid ambiguity, at most one source could have a Gaussian distribution. This is because for Gaussian joint distributions decorrelation means independence. The whitened Gaussian distribution has central symmetry, so any set Of orthogonal base could be the independent components. As 76 CD Jh ON¢O3 NU o -2 -1 -4 . ’ . \ —2 2 -6 -3 ° -5 o 5 f4 _é 6 é 4 (a) (b) Figure 3.6: After whitening, the joint distribution is sphered. (a)Laplace joint distribution. (b)Uniform joint distribution. shown in Fig. 3.7, the density is completely symmetric, then any two orthogonal axis could be the independent basis. Figure 3.7: Joint distribution of two uni-variance Gaussian random variables. More rigorously, we need to prove that upon arbitrary orthogonal transformation a Gaussian random vector with identity covariance matrix will yield same probability density. 77 Theorem 3.1.2 Suppose y is a multi-variant random vector and y = Mx, where M is an n x n orthogonal matrix and x is a Gaussian random vector with zero mean identity covariance matrix, then y have the some probability density function as x. Proof. In general, for any random vector x with density pa; and for any matrix M, the density of y = Mx is given by pm (M-ly) ldet M-ll (Papoulis, 1991). Since x has zero mean and identity covariance, its pdf is px(x) = (3327-2- exp [—éxTx] , (3.8) then My) = ”(M-1),) ldetM‘ll = (273,”, exp [-é (MTy)T (M3)] (3.9) : (arid/2 exp l-éyTyl ' (310) Then y and x has the same pdf function. <1 3.1.2 Overview of ICA methods The origin of ICA problem can be dated back to the 19803, when Herault and Jut- ten (Herault and Jutten, 1986) proposed a feed forward neural network with feed back connections. The Herault-Jutten algorithm is based on a gradient descent up- dating rule that was able to separation independent sources, although the algorithm converges only under some severe restrictions, e.g. sub-Gaussian source signals. This work Opened a remarkable chapter of independent component analysis and blind 78 source separation in the history and inspired many others. The general framework of ICA introduced by Herault and Jutten was analyzed by Comon (Comon, 1994) in a more theoretical way. Comon proposed to minimize the mutual information of signals 11, where u = Wx is an estimation of the source signals. In a different aspect, this is equivilant to minimizing the the Kullback- Leibler (Kullback and Leibler, 1951; Kullback, 1959) distance between the joint pdf and marginal pdfs. Since mutual information requires both the joint pdf of u and marginal pdf of ui, which are both unknown, one can only use some approximations to substitute the real probability density. Comon used a high order cumulant expansion Of the Kullback—Leibler distance to approximate the probability densities. In parallel to blind source separation research, unsupervised learning rules based on information theory are also considered by researchers in the neural network com- munity. Linsker proposed a learning mechanism to maximize the mutual informa- tion between the input and output Of a neural network (Linsker, 1992). Linsker did not explicitly connect it with ICA or blind source separation, but his method in some respects resembles the infomax principle independently proposed by other re- searchers (Bell and Sejnowski, 1995; N adal and Parga, 1994), which is used to separate independent sources. The infomax approach is based on maximizing the information flow (mutual information between the inputs and outputs) of a neural network with non-linear output units. At about the same time, Maximum Likelihood Estimation (MLE) were proposed to recover the source independence (Pham et al., 1992). The MLE approach considers the mixed Signals are generated by an statistical model with unknown parameter to be determined. Then the model parameters are estimated by 79 maximizing the likelihood of the Observations. Strikingly, it is independently proved by several authors (Cardoso, 1997; Pearlmutter and Parra, 1997) that the MLE and infomax principle, regarding the source separation, are mathematically equivalent. In the neuroscience community, the ICA problem is usually considered as a Sparse coding problem. Sparseness is a characteristic of super—Gaussian random variables. Due to the central limit theorem, mixing several super-Gaussian random variables will decrease the Sparseness of the mixed signals. SO searching the independent direction to maximize the Sparseness is a desirable strategy. Many studies have addressed the Sparseness measurement issue and derived efficient ICA algorithms (Field, 1994; Olshaushen and Field, 1996; Hyvairinen, 2001). In the following sections we will categorize the existing ICA methods and explain the details of each principles. 3.1.3 Methods based on Non-Gaussianality measurement A straightforward way to recover independent components is measuring the similarity to the Gaussian distribution. The Central Limit Theorem tells us that the summation of n random variables is also a random variable, which has a Gaussian probability den- sity when n approaches infinity. Intuitively, the summation of several non-Gaussian random variables are more “like” a Gaussian distribution than any Of these individual random variables under certain conditions. SO, maximizing the non-Gaussianality of the estimated signals u = Wx with respect to W lead to the solution of the ICA problem. 80 Kurtosis A quantitative non-Gaussianality measurement is kurtosis. The kurtosis Of a random variable y is defined as kurt(y) = E{y4} — 3(E{y2})2. Since in the ICA problem signals are generally pre—whitened and are of unit variance, the kurtosis turns out to be kurt(y) = E {y4} - 3, which is a shifted fourth order cumulant. It is worth noting that the kurtosis is comparable only at the case Of equal variance. For super-Gaussian random variables the kurtosis is generally greater than zero, while the sub-Gaussian is less than zero. The kurtosis is only defined on single variate random variable, therefore ICA based on kurtosis optimization is usually a single unit algorithm, which means every time the algorithm can only separate given a non-Gaussian signal. The criteria function of this estimation is then defined as the kurtosis of estimated source signal u, i.e. JW 2 kurt(wa). The goal is tO search the W Space and to maximum or minimize the criteria function, in which the gradient ascent algorithm or other optimization approaches can be used. Hyvarinen and Oja has proposed a criteria function (Hyvarinen and Oja, 1997), which has the following form, J(w) = E{(wa)4} —3||w||4+F(||w||4), (3.11) where the first two terms on the right hand side represent the kurtosis. The last one is the penalty term to enforce the norm of w to one, since without this constrain the kurtosis can be arbitrarily large. The above criteria function can be Optimized with standard gradient descent/ascent algorithm, but consequentially it will be affected by the selection Of learning rate. In the same paper, Hyvarinen et. 81 al used a method related to Lagrange Multiplier and got an updating rule that only evolves computing the expectation, thus speeded up the convergence. Kurtosis is a Simple way to measure non-Gaussianality, however it suffers from several disadvantages. Kurtosis is a normalized fourth order cumulant, so it is very sensitive to the noise in the outliers. A few noise samples with high value will greatly change the kurtosis. Another drawback of kurtosis is that Gaussian distribution is not the only one that have zero kurtosis, so the kurtosis measurement can not recover the independent direction of these distributions. Negentropy An important conclusion in information theory is that the Gaussian random variables have the largest entropy of all possible random variables with the same variances. This indicates that the Gaussian random variables have the most randomness. In fact, the entropy is the most optimal criteria to measure non-Gaussianality. The entropy for discrete random variable Y is defined as H(Y) = — Z P(Y = a,) log P(Y = (2,). (3.12) This definition can be generalized for continuous Space random variables, in which case Often called differential entropy, and defined as: He) = — / r, < C(u) ijkl ui“j“k”l> - (um) (ukull - (”laurel (afar) - (ur’uz) (WU/c). where (0) denotes mathematical expectation. Cumulants of a given order form a tensor. The diagonal elements characterize the distribution of Single components ui. For example, the autocumulants Of second, third, and forth order define variance, skewness, and kurtosis, respectively. The off-diagonal elements or cross—cumulants (all cumulants with i jkl 75 iiii) characterize the statistical dependencies between components. If and only if (iff) all components u,- are statistically independent, the off—diagonal elements vanish and the cumulant tensors of all orders are diagonal (assuming infinite amount of data) (Blaschke and Wiskott, 2002). The ICA method based on high order cumulants is thus to find a linear transforma- tion W that will minimize the absolute value of off-diagonal elements in the cumulant tensors Of u = Wx of all order. In practical terms, apparently infinite tensors are not tractable, therefore third and fourth order cumulants are usually used to approxi- 86 mate this problem (Comon, 1994; Blaschke and Wiskott, 2002). The advantage Of this method is that it requires zero knowledge of the probability densities. Moreover, it is also used to approximate the mutual information as shown in section 3.1.3. Then it clearly shows the relation of high order cumulants and probability density. The order of cumulants used to approximate the problem is a trade-off between computational expense and accuracy. 3.1.5 Methods based on Maximum likelihood and Infomax The goal of maximum likelihood estimation (MLE) is to model the Observation x as generated from the latent random variables 5 via a linear mapping A. In the noiseless case, we can use a parametric density estimator f (x; 6) to find the parameter vector 9 that maximizes the probability of observations x. The estimator can be formulated as 11 ”V MN {I f (w:x(t )) + Nlog IdetWI, (3.20) where W is the de—mixing matrix and is also treated as the parameter vector 0; N is number of Observations; f2; is the marginal probability density Of sources 11 with W as the parameter to be estimated (Pham et al., 1992; Pham and Garrat, 1997). Here we assume [2 is known up to a scaling factor here. The derivation of this MLE equation is due to the fact that: for any random vector 5 with density p; and for any invertible matrix W, the density of x = W‘ls is given by p3(Wx) ldeth (Papoulis, 1991). In parallel to the MLE study, unsupervised learning rules based on information theory were proposed by Linsker(Linsker, 1992). The basic idea here was to maximize 87 the mutual information between the output and signal portion in the input of a linear neural network. This principle was related to the redundance reduction principle suggested by Barlow (Barlow, 1961) as an coding strategy in neurons. By adding a non—linear transfer function to the network, Bell and Sejnowski (Bell and Sejnowski, 1995) proposed the infomax principle (information maximization), which used the stochastic gradient method to seek the maximization of mutual information between the input and output. A similar method was independently developed by Roth and Baram (Roth and Baram, 1996). Strikingly, the infomax and MLE are proved to be equivalent by Cardoso (Cardoso, 1997) and Pearlmutter and Parra(Pearlmutter and Parra, 1997) independently. They Show that the MLE and the infomax can both be explained by the Kullback-leibler divervence between Wx and the estimated source 11. The major problem of MLE and infomax principle is that they both assume that the marginal probability density Of source s is known, which in most of the cases is unknown or intractable. In practise, both methods need an approximation of the density. For example, in Bell and Sejnowski’s method they assumed a sigmoid function as the cumulative density function of the source random variables. Since the corresponding probability density function of the sigmoid function is a super-Gaussian distribution, Bell and Sejnowski’s method can only deal with some super-Gaussian sources with questionable accuracy. Other researchers extended the infomax approach to address this problem. Lee, Girolami and Sejnowski (Lee et al., 1999) suggested an extended infomax algorithm, in which they use mixture of Gaussian distributions to simulate the prob- 88 ability density Of sub— and super-Gaussian signals and then are able to separate both type Of signals. The original infomax algorithm involves inverting the feedforward weight matrix W, which makes the computation expensive. Amari et. al (Amari, 1998; Amari et al., 1996) and Cardoso et. al (Cardoso and Laheld, 1996) soon realized that it can be improved by using the natural gradient (or referred as relative gradient in Cardoso et. al’s paper), which multiplies the gradient of the W matrix by a positive definite matrix WTW. This change will speed up the convergence by removing the matrix inversion. 3.1.6 Methods based on Sparseness measurement As proposed by Olshausen and Field and others (Field, 1994; Olshaushen and Field, 1996; Olshaushen and Field, 1997), the cerebral cortex uses a sparse coding scheme. Statistically, given any input from the environment, only a few neurons respond actively. Most other neurons do not, or hardly, respond. The Sparse coding is desirable since it effectively transforms an input event that is represented by many active pixels quickly down to a few neurons, which is Similar to the redundance reduction principle suggested by Barlow (Barlow, 1961). The reason that Sparseness is a reasonable conjecture, suggested by Olshaushen (Olshaushen and Field, 1997), lies in that natural scene can be usually described by a few structure primitives, such as edges, lines and curves. These features, if filtered with Garbor or wavelet filters, will Show a sparse response pattern. 89 Olshausen and Field’s suggested the following criteria function: 2 E(w) = Z I(t) — gig-(ow,- + ZS (El—0(9), (3.21) t i where the first term at the right Side of the equal sign is the reconstruction constrain, and second term is the Sparseness measure. The reconstruction constrain is used merely for avoiding the trivial solution, which means all component vectors are zero vectors. '50 2‘0 4‘0 60 so Figure 3.8: Gaussian signal vs. sparse signal. The sub plot on the top shows a zero mean unit variance Gaussian random variable. The X axis denotes different Observations, and Y axis is the value. The bottom sub plot shows a sparse random variable with same mean and variance. The sparse coding principle implies that the statistics Of natural images is super Gaussian: it consists of a high probability region near the center, but with heavy tails. The high probability region near the center corresponds to most cases where inputs are nearly orthogonal to the feature that the neuron represents. The heavy tails correspond to the cases that the neuron represents and, thus, should respond. How does a set of neurons detect super-Gaussianality? A common neural archi- 90 tecture is shown in Fig. 3.9. Figure 3.9: Two neurons with horizontal inhibitions. Hollow triangles indicate excitatory connections and solid triangles indicate inhibitory ones. Fmrn the figure we can see that each neuron does not develop its weights alone. Consider a neuron, its response is inhibited by the response from nearby neurons that shares the Similar receptive field. It also inhibits its neighboring neurons. If this neuron responses most strongly in the neighborhood, it most effectively inhibits its neighboring neurons, which further weakens the inhibition from its neighbors. After a few milliseconds, it effectively inhibits its neighbors, although not totally. In section 3.2.8 we will propose an algorithm called Lobe Component Analysis (LCA) based on the Sparseness measure. And in section 3.4, we will discuss the Sparseness in more details. 3.1.7 Summary ICA research has become very active. There has been at least one regularly scheduled conference series (Int’l Conf. on Independent Component Analysis and Blind Signal Separation) and several special issues (IEEE Proceedings, Oct. 1998; Signal Process- ing, vol. 73, 1999; IEICE Transactions, March 2003; Journal of Machine Learning Re- search, Dec., 2003; Neurocomputing, vol. 49, 2002, etc.). Many alternative methods 91 have been proposed, using the principles of minimum mutual information, maximum likelihood, infomax, higher-order cumulants such as Kurtosis, etc. (see a survey by Hyvarinen (Hyvarinen, 1999)). In the following, we introduce four types of algorithm. Type-1: Batch. A batch algorithm collects observed data as a batch and then calculates the results. A batch algorithm cannot deal with very long or open-ended data streams and, therefore, cannot compute the results in real-time. Type-2: Block-incremental. A block- incremental algorithm is not allowed to store training data as a batch for computation. Instead, it can only store statistics, e.g., mean, covariance matrix and other higher order statistics for incremental computation. The block enables it to compute these statistics from the data in a temporal block (a small batch). A block-incremental algorithm can deal with Open-ended data streams but the result is delayed by a block size. Type-3: Incremental. An incremental (also called sequential or adaptive) algorithm is a framewise incremental algorithm (block size is one time frame). By default, an incremental algorithm has a block size of 1 unless the term “block” is stated. An incremental algorithm must update its statistic estimates for one time frame at a time and, thus, it can process Open-ended data streams with little delay. Type-4: Covariance-free incremental. Such an algorithm is an incremental algo- rithm, but further it is not allowed to compute the covariance matrix or other higher order statistics matrices. In summary, from Type 1 to Type 4, the conditions under which an algorithm is run becomes progressively more restrictive. In general, one Should not expect that an algorithm of Type (i + 1) to converge faster than one Of Type (i), i = 1,2,3. 92 The state—Of-the-art batch algorithms include FastICA by Hyvarinen & Oja (Hyvarinen and Oja, 1997; Hyvarinen and Oja, 2000), which is among the fastest Type-1 ICA algorithms in terms of speed of convergence and its high capability to handle high-dimensional data. The Extended Infomax algorithm by Sejnowski and coworkers (Bell and Sejnowski, 1997; Lee et al., 1999) is a well-known Type-2 ICA algorithm. The NPCA—RLS algorithm by Karhunen (Karhunen and Pajunen, 1997) is a Type-3 ICA algorithm. We are not aware of any previously existing Type-4 ICA algorithm. A Type4 algorithm has several major advantages: It can process Open-ended streams, causing little delay, and does not even store covariance matrix. The last property is especially important when the dimension d is large. Its potential rela- tionship with the “in-place” development Of a neuron will be discussed in Section 3.5 Broader Impact. By in-place development, we mean that an (artificial) neuron has to learn on its own while interacting with nearby neurons, to develop into a feature detector. A neuron with d connections does not have (1 x (1 storage Space at its disposal. In other words, a Type-4 ICA algorithm for each ICA component is an in—place learner. We can expect that designing an effective Type—4 ICA algorithm is not trivial. 3.2 Lobe Component Analysis Suppose a sequentially arriving series Of vectors x1, x2, ..., where each input vector xt, drawn from a high-dimensional random space X, is spatially highly correlated. It 93 is assumed that the temporal dependence of xt is weak and not of interest. The basic need of our proposed research is to estimate the probability structure of X. Given an arbitrary high-dimensional space X, the distribution of x E X may not necessarily have independent components. In other words, there exist no independent T components w1,w2, ...,wd, so that their projection from x E X : x,- = x wi, i = 1, 2, ..., d, are statistically independent so that their pdf. is factorizable: p(X) = P1($1)P2($2) ° - ram)- Fig. 3.10 illustrates three cases. The first two cases are factorizable. The third case, a more general case, is not factorizable. In many high-dimensional applications (e.g., images and financial data streams), the pdf. is not factorizable. 3.2.1 Fine structure of high-dimensional density High-dimensional density estimation is an important and yet very challenging problem which has been extensively studied in mathematical statistics, computer science and engineering. Major methods for probability density estimation include histograms, naive estimators, kernel estimators, nearest neighbor estimators, variable kernel meth- ods, orthogonal series estimators, maximum penalized likelihood estimators and gen- eral weight function estimators (see, e.g., Silverman (Silverman, 1986) for a survey). For high-dimensional data, kernel-based methods, e.g., EM algorithms for a finite normal mixture (also called a mixture of Gaussians) (McLachlan, 1997), have been widely used. 94 These traditional methods are problematic when they are applied to practical high-dimensional data. The problems include: (1) The lack Of a method for high- dimensional density estimation that satisfies the three stringent operative require— ments: incremental, coveriance—free and undersample. By undersample, we mean that the incremental algorithm must work even when the number Of samples is smaller than the dimension d. (2) The lack Of an effective method to determine the model param- eters (e.g., the means, covariances, and weights in the mixture-Of-Gaussian models). (3) The lack Of a method that gives a correct convergence (a good approximation for high-dimensional data), not just convergence to a local extremum (as with the EM method for mixture of Gaussians). (4) The lack of a method that is not only optimal in terms Of the objective function defined, but also in terms of the convergence speed in the sense Of statistical efficiency. Approaches to estimating statistical distribution fall into two categories: global and local. A global approach uses shape characteristics of the distribution described by global higher order statistics. A global statistic is an expectation of the entire space of distribution, such as covariance matrix, kurtosis and other higher order cumulants (or moments). A local approach estimates the distribution by decomposing the space of distri- bution into smaller regions, and each region is estimated by relatively lower order statistics (which is sufficient for local data). The kernel-based methods (e.g, the mixture-Of-Gaussian methods) belong to the local approach. Techniques of global approaches can be used for local approaches. A major advantage Of a local method is to decompose a complex global modeling 95 problem into simpler local ones SO that lower order statistics are sufficient. Higher order moments are sensitive to noise and outliers and are not effective for estimating a complex distribution. Local density approximation using lower-order statistics is more effective.3 The challenges with a local approach include how to effectively decompose the global complex problem into local simpler ones and how to integrate solutions to local problems into the solution to the original global complex problem. The proposed LCA uses a global-tO-local (coarse-tO-fine) estimation scheme. (1) First, we estimate the position: the mean i of the data (through incremental es- timation of i) and then subtract 3‘: out of the observation. Thus, the transformed data has a zero mean. (2) Next, we estimate the scale: Incrementally estimate the covariance matrix of x, and then whiten the zero mean observation so that it has a unit covariance matrix. (3) Estimate the lobe components from the zero-mean, unit covariance Signals. Each lobe component represents a high concentration of p.d.f.. Without whitening the data, the drastic scale difference in different data components can obscure the fine details of the lobe components. 3.2.2 Global: Whitening the data For step (2), we use PCA on a d-dimensional distribution (the proposed CCILCA algorithm will compute PCA with a slight modification), the unit principal component vectors v1 , v2, ..., Vd (eigenvectors of the covariance matrix of x E X) are found where v,- is associated with the eigenvalue M, with A1 2 A2 2 2 Ad. The space X is 3To see this global versus local contrast more intuitively, consider the problem of shape approx- imation. The (global) Taylor expansion is not practical for approximating a complex real-world shape due to its use of higher order derivatives. In contrast, low order splines are more effective. 96 divided into two parts, the major space M: M = span{v1,v2, ...,vk} and the residual space R = span{vk+1,vk+2, ...,vd} so that X = M ED R where EB denotes the orthogonal sum. The dimension k of space M is chosen so that the ratio of the covariance in R over the original space X, Zf=k+1 Ai/Zgzl A7; is a small percentage (e.g., 1%). From the result of PCA, the whitening matrix is: —1/2 /2 —1/2 ,A2 ,...,,\k V = [v1,v2, ...,vk]diag{/\1~1 }. The whitened observation is y = VT(x — i). In the following, we assume that y is whitened: it has a zero mean and a unit covariance matrix. 3.2.3 Local: Lobe components The sample space of a white, zero-mean random vector y in k-dimensional space can be illustrated by a lit-dimensional hypersphere, as shown in Fig. 3.11. It is known that although the whitening process only normalizes the variance along orthogonal principal directions, the whitened vector y has a unit variance in all the possible directions. That is the meaning of the circle in Fig. 3.11. But the data can go out of the circle. If the distribution of y is Gaussian, the samples will be equally dense in all direc- tions. In general, the distribution is not Gaussian and the probability density may 97 concentrate along certain directions (although the global covariance Of projections along any given direction is unit). Each major cluster along a direction is called a lobe, illustrated in Fig. 3.11 as a petal “lobe”. Each lobe may have its own fine structure (e.g., sublobes). The Shape of a lobe can be of any type, depending on the distribution, not necessarily like the petals in Fig. 3.11. If we assume that x and —x are equally likely (e.g., a patch of skin will have the equal chance of feeling the onset and the offset of a press), the distribution is then symmetric about the origin. (But this is not necessarily true in general and, consequently, nonsymmetric lobes can be defined.) Given a limited resource — c vectors, LCA divides the sample space X of x into c mutually nonoverlapping regions, called lobe regions: 32: R1 URgU...URc, (3.22) where R, (l Rj = d), if i 76 j, as illustrated in Fig. 3.11. Each region is represented by a single unit feature vector “’2': called the lobe component. The partition issue presented here is similar to Kohonen’s Self Organizing Maps (SOM) (Kohonen, 2001) or, more general, vector quantization (Gray and Neuhoff, 1998; Zador, 1982; Gersho, 1979). We do not use Euclidean distance as with SOM and quantization (we use the concept of high-dimensional probability density of a whitened random vector space instead). Further, each region is infinite (while most cells in SOM and vector quantization are bounded), Since we will generate a response T(x) for any vector in the whitened space. T (x) does not approximate x but represents 98 it as a feature parameter vector (vector quantization uses the code itself as a direct approximation of input x). The response is useful for visualization, decision-making, prediction or regression. Suppose we have computed c lobe components (column unit vectors) wi, i = 1. 2, ..., c. They are not necessarily orthogonal and not necessarily linearly indepen- dent, depending on the distribution of the whitened space and the number 0. They span a lobe feature subspace F = span{w1, W2, ...,wc}. (3.23) Any vector x in M is represented by the inner product vector: l= [w1,w2,...,wc]Tx. It is called the response from the lobe components, which can be used for visualization or further decision-making using support vector machines (Christianini and Shawe- Taylor, 2000; Lin, 2002) or the Incremental Discriminant Regression developed by Weng & coworkers (Hwang and Weng, 2000; Weng and Hwang, 2002). The discrete probability in the lit-dimensional whitened space M can be estimated in the following way. Each region R,- keeps the number of hits n(i), which records the number of times samples of x fall into region R,. Then, the continuous distribution 99 of x can be estimated by a discrete probability distribution of c regions: P{x E R1} = ”—782, i=1,2,...,c. The larger the number c, the more regions can be derived and, thus, the finer partition of the sample space 32. 3.2.4 Belongingness The remaining major issue is how to compute the lobe regions. We need to determine what kind of vector wz- is best to represent a lobe region R,- (Or for that matter, any subregion of M). If x belongs to a region Ri, x can be approximated by wz- as x z x = (x - wi)w,- (it minimizes the mean square error E ”x — itl|2 for a fixed unit Wil- We determine the unit vector w,- as the one that minimizes the following error of approximation: T llx — (x - wave-n? = x x - (x - w.)2, (3.24) after expansion using the fact that wz- is a unit vector. The expected error Of the above approximation over region 12,- is 2 T 2 T T T EIIX- (X'Wilwill = Elx X- (X°Wz') l= Elx X-Wi XX wt] 2 E[xTx] - w;E[xxT]wz- = trace(2x) — wfzxwgsas) 100 where 2x is the covariance matrix of x. We determine the unit wi to maximize the above approximation error. Since trace(2x) is constant, the unit w,- that minimizes the above expression is the one that maximizes wz-T Exwi. From the standard theory of PCA (e.g., see (Jolliffe, 1986)), we know that the solution wz- is the unit eigenvector of Ex associated with the largest eigenvalue A“: Aidwi = Exwi. In other words, wz- is the first principal component of Ex, where expectation of Ex is over the region of approximation, Rt or any other region. Relacing 2x by the sample covariance matrix, we have )‘iJWi a: i—Z x(t)x(t)Tw,- = %Z(x(t) ~w,-)x(t). (3.26) t=l i=1 We can see that the best lobe component vector can be estimated by the average of x(t) multiplied by the projection (x(t) w,). This is an important batch estimation equation: It means that the candid version of wz- is equal to the average of yt = (x(t) - wi)x(t). By candid, we mean that we keep the covariance of the projections onto Wi (which is A231) with wz- and, thus, the estimator for Ai,lwi is computed instead of the unit “’2' alone. This scheme is needed for the quasi-optimal efficiency to be discussed in Sec. 3.2.7. If wi(n — 1) has already been estimated using n — 1 samples, we can use it to 101 compute yt(n) defined as: z x(t) - (“n(n — nix“) W) iiw.(n — 1)“ (3.27) Then Eq. (3.26) states that the lobe component vector is estimated by the average: 1 Tl A,,1w,- z E z yt('n). (3.28) t=1 The larger the inner product x(t) - wi, the more applicable the input x(t) is to the region that the lobe component wz- represents. Thus, the inner product x(t) - “’2' is the “belongingness” Of x(t) to the region represented by Wi- For symmetric region Ri, we use the absolute value of x(t) - w,- as the belongingness. This mechanism not only enables us to compute the best wz- given a region, but also enables many lobe component vectors to compete when data x(t) are sequentially received. The vector wz- whose belongingness is the highest is the “winner,” which best inhibits all other vectors. The winner uses the current input x(t) to update its vector, as in Eq. (3.26), but all others do not. 3.2.5 Statistical efficiency Suppose that there are two estimators F1 and F2, for vector parameter 0 = (01, ..., 6k), which are based on the same set of observations 8 = {y1,y2, ..., yn}. If the expected square error Of F1 is smaller than that of F2, i.e., Elll‘l — 6H2 < E||F2 — 0H2, the estimator F1 is more statistically efficient than F2. 102 Statistical estimation theory reveals that for many distributions (e.g., Gaussian and exponential distributions), the sample mean is the most efficient estimator of the population mean. This follows directly from Theorem 4.1, p. 429-430 of Lehmann (Lehmann, 1983), which states that under some regularity conditions satisfied by most distributions (such as Gaussian and exponential distributions), the maximum likelihood estimator (MLE) 6 of the parameter vector 6 is asymptotically efficient, i.e. its asymptotic covariance matrix is the Cramér-Rao information bound (the lower bound) for all unbiased estimators, via convergence in probability to a normal distribution: «at? — 9) 3» N{O,I(6)_1} (3.29) in which the Fisher information matrix I (6) is the covariance matrix of the score WOW) WOW 59 ‘ k 1 , .., }, and f (y, 6) is the probability density of random vector vector { y if the true parameter value is 6 (see, e.g., Lehmann (Lehmann, 1983), p. 428). The matrix I (6)”1 is called information bound Since under some regularity constraints, any unbiased estimator 6 Of the parameter vector 6 satisfies cov(6 — 6) 2 I (6)"1 /n (see, e.g., Lehmann (Lehmann, 1983), p.428 or Weng et a1. (Weng et al., 1993), pp. 203204)? Since in many cases the MLE of the population mean is the sample mean, we estimate the mean 6 Of vector y by the sample mean. Thus, we estimate an inde- pendent vector Alwi by the sample mean in Eq. (3.28), where yt(n) is a “random “For two real symmetric matrices A and B of the same size, A 2 B means that A — B is nonnegative definite, which implies, in particular, that the diagonal elements are all nonnegative, which gives the lower bound for the variance of every element of the vector estimator of 6. 103 Observation” . Since we do not know the distribution of yt(n) and it is even dependent on the currently estimated w,- (i.e., the observations are from a nonstationary process), we use the amnesic mean technique below which gradually “forgets” Old “Observations” (which use bad yt (it) when n is small) while keeping the estimator reasonably efficient. 3.2.6 Amnesic mean The mean in Eq. (3.28) is a batch method. For incremental estimation, we use what is called an amnesic mean (Weng et al., 2003). _ n — 1 — n _ _ 1 + n y(n) = :U'( )y(n I) + [J'( )Yn (330) n n where p(n) is the amnesic function depending on n. If u E 0, the above gives the straight incremental mean. The way to compute a mean incrementally is not new but the way to use the amnesic function Of n is new for computing a mean incrementally. 0 if n S n1, n(n) = C(n — n1)/(n2 — n1) if n1 < n S 17.2, (3-31) c+(n—n2)/r ifng 10, the amnesic mean with n(t) E 1 increased about 50% (for the same n) from that for p(t) E 0 and with p(t) E 2 it increased about 100%. From Fig. 3.13 we can see that a constant positive p(t) is not best when t is small. The multi-sectional amnesic function p(t) in Eq. (3.31) performs a straight average for small t to reduce the error coefficient for earlier estimates, and when t is very large, the amnesic function changes with t to track the slowly changing distribution. Therefore, the multisectional amnesic function p(t) is more suited for practical signals with unknown nonstationary statistics. It is appropriate to note that the exactly Optimality of the multisectional amnesic function is unlikely under an unknown nonstationary process (not i.i.d.), unless an assumption of certain types of nonstationary process is imposed, which iS not, however, necessarily true in reality. In summary, we should not expect that an estimator suited for an unknown non- stationary process tO have the same expected efficiency as ones for an i.i.d. stationary process. The distribution of signals received in many applications is typically non- stationary and, therefore, an amnesic mean with a multisectional (dynamic) amnesic function is better (see later Fig. 3.16). 107 3.2.8 Proposed CCILCA Algorithm The Candid Covariance-free Incremental LCA algorithm incrementally computes the lobe matrix W(t) =[w1(tt),wg ), .. .,w£t)], for lobes, from whitened samples x(l),x(2), of dimension k without computing the k X k covariance matrix. The number c is the number Of lobe components to be extracted (predetermined limited resource). The length Of wz- is the variance of projections Of the vectors x(t) in the i-th region onto Wi- The algorithm is shown in Algorithm 3. Algorithm 3 The CCILCA algorithm. 1: Initialize using observations: for t = 1, 2, ,,c .wgc— C)- xt( ) and at = 1. 2: for t = c+ 1,c+2,... do 3: for i = 1,2, ...,c do 4: Compute the response Of all lobe vectors x(t) -w§t‘”/iiw§“” u. 5: Decide the winner: j— - arg maxlSiSCflli |}. 6: Update the number of wins nj <— nj + 1, compute the amnesic weight coef- ficient a(nj + 1) and 5(le + 1)with Eq. 3. 32 and Eq. 3. 33. 7: update winner using amnesic mean Of the winner lobe component: (n ') (Tr-1) j J = a(nj +1)Wj '7 + C(71): +1)ljx(t) (3.38) 8: end for 9: end for Algorithm 3 requires that input be whitened. The following is the CCILCA algo- rithm that does not require inputs to be whitened. It normalizes the output power Of each lobe component while computing all the lobe components. 108 Algorithm 4 CCILCA algorithm without pre-whitening. Initialize using observations: n = 0. For t = 1, 2, ..., e, do WEC) = x(t), nt = 0, and at = IIX(t)Il2- for t = c+ 1,c+2, do For i = 1, 2, ...c, compute the projection of the i-th cell 1.- = x(t) -w§t‘1’/iiw§t’”il. Generate the normalized response 2,- = I”, /o,~. Decide the winner among the normalized responses: '=ar max z~ . J gISiSC“ ill For the winner, the j-th cell, it - r— n -+1, compute the amnesic weight coefficient a(nj + 1) and 6(nj + 1)with Eq. 3.32 and Eq. 3.33. Update the amnesic average oj (the variance Of response) of all cells: oj <— a(nj + 1)oj + H(nj + 1)l§. Update the winner’s lobe component vector using the amnesic average (cell’s Hebian learning): Update the weights for the winner: (i) J = a(nj + vii-"1” (“"3 +1)zjx(t)/\/BT’ end for 109 The above two algorithms are simple. They never compute any k X k matrix (covariance-free) . The above algorithm computes CCI PCA (Weng et al., 2003) if the step 2 is cut short to contain only steps (a) and (c). Step (c) is modified so that for each vector (t) “’2': the residual from the previous vector wz._1 is computed, and then the residual ( is used as the input to the next vector wit—l) for updating. NO winner-takeall is needed. Therefore, the pre—whitening step can be computed by the same type (CCI) of algorithm. The difference in the pre—whitening step is to use the residual so that the resulting vectors are orthogonal (a requirement Of PCA). After we computed the whitening matrix V and the lobe component matrix W, we have BT = WTVT and l = BT(y — y) as projections onto lobe components, or response of lobe components. We can also define a neighborhood function hJ-i(t), which acts as a smoothing kernel. Usually, hjz-(t)) = h(||rj — Till: t), where rj, rz: are the location vectors of node j and i. hfl( t) Often has the shape Of “Mexican hat” or triangle. Then, the algorithm is changed to Algorithm 5. 3.2.9 Time and space complexities Given each k-dimensional input x, the time complexity for updating c lobe compo- nents and computing all the responses from x is 0(ck). Since LCA is meant to run in real-time, this low update complexity that is important. If there are t input vectors, 110 Algorithm 5 Topographic CCILCA. 1: w,- = y;, i = 0,1,2,...,k. 2: ni=1,i=1,2,...,k. 3: for t = 1,2, do 4: j = arg max i 5: for All component vector wi(n,-), i = 1, 2, ..., is do 6: w-(n-+1) = a(n-)w-(n-) + h--(t)fl(n-)wi(ni)T ’Yty (3 39) ' z z z ' 1* ’ ”WI-(null " ' where wi(ni) is the component vector wz- after the ni-th updating, a(n,-) and 6(ni) are given in Eqs. 3.32 and 3.33, respectively. 7: nz- = ni + l. 8: end for 9: end for the total amount Of computation is 0(ckt). The LCA’S Space complexity is 0(ckt), for c neurons with 1: dimensional input x. In fact, the above Space and time complexities are the lowest possible ones: Since c vectors need to be computed and each vector has 1:: components, the space complexity cannot be lower than 0(ck). Phrther, the time complexity cannot be lower than 0(ckt) because the responses I = (l1, l2, ..., lc) for each of t inputs need that many computations. Suppose that each lobe component (vector) is considered as a “neuron” and the number of hits n(j) is its clock Of “maturity,” which determines the single weight wl (2122 = 1 -— ml) for its updating. The CCILCA algorithm is an in-place development algorithm, in the sense that the network does not need extra storage or an extra developer. The winner-takeall mechanism is a computer simulation of the lateral in- hibition mechanism in the biological neural networks (see, e.g., Kandel et al. (Kandel et al., 2000) p. 4623). The inhibition-winner updating rule is a computer simulation lll of the Hebbian mechanism in the biological neural networks (see, e.g., Kandel et a1 (Kandel et al., 2000) p.1262.). 3.2.10 LCA is ICA for super-Gaussians Further, the components that have a super-Gaussian distribution correspond to lobe components we defined here. Each linear combination Of k super-Gaussian indepen- dent components correspond tO symetric lobes, illustrated in Fig. 3.11. Therefore, if components in s(t) are all super-Gaussian, finding lobe components by CCILCA is equivalent to finding independent components. 3.3 Experiment Results 3.3.1 Low dimensional simulation The proposed algorithm was first applied to a low dimensional data set, which contains 2 i.i.d source components. Each of the source components is taken from a Laplace distribution of zero mean and unit variance, thus, is a super-Gaussian random vari- able. Fig. 3.14(a) shows 5000 sample points used in the experiment. The directions of the source independent components and evaluated independent components are illustrated, which are almost identical. A more detailed error over the number Of samples used is shown in Fig. 3.14(b). 112 3.3.2 Comparison with Other ICA Methods Suppose that the distribution of k-dimensional random input x is stationary. In CCILCA, the lobe component vector w,- converges to the eigenvalue scaled eigenvector Ai.1hi.1 in the mean square sense and the speed Of convergence is estimated as t 2 2 ErllWl ) - Ai,1hi,1ll z Elm where o is the estimated average component-wise variance of observation yt = (x(t) - (t-I) wz- )x(t)/||wzlt_1)||. The convergence is not to a local extremum. It is to the correct solution. The quasi-Optimal convergence speed is due to the use of statistical efficiency in the algorithm design. If the distribution of x changes slowly, the above error estimate is still applicable. The quasi-Optimal statistical efficiency appears to drastically improve the capac- ity to deal with high—dimensional data. We selected two state-of-the-art incremental ICA algorithms, the Type2 Extended Bell-Sejnowski (ExtBS) (Lee et al., 1999) and Type-3 (NPCA-LS) (Pajunen and Karhunen, 1998; Karhunen et al., 1998) for per- formance comparison with our proposed Type-4 CCILCA algorithm. The choice of these two algorithms is due to their superior performance in the comparison results of (Giannakopoulos et al., 1998). We used the downloaded code from the authors for the ExtBS algorithm. The NPCA algorithm used was proposed in (Pajunen and Karhunen, 1998) with [3 = 0.998 and g as tanh. The ExtBS algorithm was run with the following set Of parameters: blocksize = 1, learning rate = 0.001, learning factor = 0.985, 113 momentum constant = 0.1, number of iterations = 1, step = 5, and block size for kurtosis estimation is 1000. This version is called ExtBSl, the number 1 indicating that the block Size for updating was 1. Thus, ExtBS 1 is a partial sequential algo— rithm, sequential for independent component update but computation for kurtosis is block-incremental. We called it Type-2.5. To investigate the capabilities to handle high-dimensional data, we synthetically generated random data with different dimensions. The number n Of samples used in each run is a function of the input dimension d (we used n = 1000d samples). Each independent source has a Laplacian distribution. The mixing matrix was chosen randomly and was non-degenerate. The error between the direction of the true and the estimated independent components was measured as the angle between them in radians. The results were averaged over 50 runs. Figs. 3.15(a) and (b), Show that these two better performing ICA algorithms have problems in converging when the dimension d increases beyond 5 and the correct convergence fails when d = 25. The CCILCA consistently give considerably smaller error and take the least CPU time among the three. To Show more detail about the profile of decreasing error, we plot the errors in Fig. 3.15(c). As indicated by the figure, both the Type-3 algorithm NPCA and Type-2.5 ExtBSl do not converge for a moderate dimension of d = 25 although ExtBSl does fine when d = 2. The proposed Type-4 CCILCA does well. Next, we compare our CCILCA algorithm with Type-2 Extended Bell-Sejnowski (or extended infomax) (Lee et al., 1999) with block size 1000 and the Type-1 batch algorithm FastICA (Hyvarinen and Oja, 1997; Hyvarinen and Oja, 2000). This is not a fair comparison since a Type-4 algorithm (like CCILCA) should not be expected to 114 out-perform a Type-1 or Type-2 algorithm. We compare them anyway to understand the limit when CCILCA is compared with two state-of—the—art Type-1 and Type-2 algorithms. It is well known that ICA algorithms are “data grizzlies”. Typically, even for a low dimension simulation task (e.g., d = 2), ICA algorithms need thousands of samples to approach the independent components. Convergence in respect to the number of samples used in training is a good evaluation of the efficiency of ICA algorithms. For a higher dimension, we synthetically generated random observations from an i.i.d. Laplacian random vector with dimension Of 100. The results are Shown in Fig. 3.16, where the y—axis marks the number of samples and the x—axis indicates the average error in radians. In order to Show more detailed aspects of CCILCA, three variations are tested. “LCA with fixed m” (m stands for the amnesic function u(n)) and “LCA with dynamic m” are original LCA methods with a fixed u and varying n(n) as defined in Eq. (3.31), respectively. “LCA eliminating cells” algorithm dynamically eliminates cells whose hitting rate is smaller than 3/4 of the average hitting rate, Since sometimes two vectors share a single lobe (which is very rare and does not significantly affect the purpose of density estimation by lobe components but does affect our error measure). As shown in Fig. 3.16, all the three Type-4 LCA algorithms converged very fast, faster than the Type-2 algorithm Extended infomax and even the Type-1 FastICA. The batch Extended Infomax algorithm needs much more samples, therefore, it did not converge in these tests. It is somewhat surprising that the proposed Type-4 CCILCA algorithm, oper- ating under the most restrictive condition, out-performs the state—Of-the-art Type-3, 115 Type—2, and Type-1 algorithms by a remarkably wide margin. We think this is mainly due to the new lobe component concept and the quasi-Optimal property of the statis- tical efficiency. The simple structure of the CCILCA algorithm is also very attractive. 3.3.3 Cocktail Party Problem We test the algorithm on a simulation of the cocktail party problem. Nine sound sources are mixed by an orthogonal matrix. Each sound source is 6.25 seconds long and the sampling rate is 8.0KHz in 8 bits mono format. Therefore each sound source contains 50000 values. The sound clips are downloaded from http://www. cis .hut . f i/pro jects/ ica/ cocktail/ cocktai1_en. cgi, where they use these sound clips to test the FastICA algorithm (Hyvarinen, 2001). Here, we simplified the problem Since the mixing matrix is a pure rotation matrix, 116 thus no whitening process is needed. The mixing matrix is shown in Eq. —0.47 0.31 —0.30 —0.07 0.13 -—0.22 —0.17 —0.11 0.36 0.14 —0.52 -—0.18 —0.21 0.08 —0.66 0.41 —0.18 —0.02 0.41 0.05 M: 0.43 —0.26 —0.11 —0.29 —0.57 —-0.26 -—O.28 0.62 0.12 —0.17 0.13 0.24 —0.45 0.55 —0.23 0.05 0.06 —0.32 —0.54 0.03 L—0.11 -0.78 —0.39 -0.04 0.34 0.27 0.55 —0.07 0.24 —0.65 0.05 0.21 —0.07 0.11 0.71 0.25 —0.18 0.03 0.12 0.14 —0.11 0.16 —0.40 —0.46 0.05 —0.37 0.20 —0.34 —0.66 —0.24 0.22 -—0.02 3.40. —0.43 -—0.53 0.39 0.18 —0.54 —0.20 —0.04 0.11 0.06 (3.40) Fig. 3.17(a) shows one Of the nine original source signals. Fig. 3.17(b) displays one of the nine mixed sound signals. The mixed signals are first whitened, then we applied the proposed algorithm to the mixed sound signals. It is worth noting that the proposed algorithm is an incremental method. Therefore, unlike other batch ICA method doing iterations on the date set, we have used data only once and then discarded them. In Fig. 3.17(c) The independent components quickly converge to the true ones, with a good approximation as early as 1.5 second. 3.3.4 Natural Image ICA As one Of our major applications, we also conduct our experiment on natural image patches. 500,000 Of 16 x l6—pixel image patches are randomly taken from thirteen 117 natural images, available from http://www.cis.hut.fi/projects/ica/ imageica/. The CCILCA algorithm is applied to the pre—whitened image patches k = 16 x 16 = 256 to update the lobe component matrix V. The matrix (ET)-1 is then computed. Each column Of the matrix is Shown in Fig. 3.18(a) by a 16 x 16 patch, as the features of the natural scenes. A total of 256 256-dimensional vectors are displayed in the figure. They all look smooth and most of them are localized (only a small patch are non-zero, or gray) as expected. The entire process take less than 46 minutes (i.e., 181 frames / second) on a Pentium III 700MHz PC with 512MB memory compared to over 10 hours of learning time for the FastICA algorithm using 24% of the 500,000 samples (disk thrashing is also a factor). Fig. 3.18(b) shows how many times each lobe component was the top “winner”. Most components have roughly a similar rate of hits, except relatively few leftmost (top) ones and rightmost (tailing) ones. Although it is not exactly true that each lobe component is equally likely to be hit, nearly equal hits for the majority is a desirable property for a high-dimensional density estimation due to the following consideration: Suppose that a random number 2 can take a discrete number of values 2 = 2i, with P{z = 22:} = Pi: i = 1, 2, ..., c. Given a fixed c, the discrete distribution that maximizes the entropy is a uniform distribution (Papoulis, 1991): p,- = 1/c, i = 1, 2, ..., c. In other words, the distribution of hits by lobe components has nearly the largest entropy. We apply the topological LCA algorithm, shown in Algorithm 5, to natural image set as well. Fig. 3.19 displays the filters of the topographic CCILCA algorithm. Filters show iso—orientation preference in a neighborhood. And the orientation blocks 118 are surrounding a singular position at row No.9 and column No.8. This is resembling the pin-wheel structure found in the biological visual cortex. The winning times are shown in Fig. 3.20. The neuron at the singular position get significantly more hits than other neurons. 3.4 Derivation Of the LCA algorithm In this section we discuss the mathematical analysis of the LCA algorithm from a sparse optimization point of view. Here we start from a intuitive Sparseness measure- ment explanation, and end up with the LCA learning algorithm. 3.4.1 6" norm Sparseness measurement As we mentioned in the section 3.1.6, Sparseness is an important property Of super- Gaussian random variable. Maximize Sparseness is a desirable strategy that is used by different ICA algorithms. But the measurement of Sparseness is still illusive. Many heuristic measurements of Sparseness have been proposed, e. g. Olshausen and Field proposed the following form: Sp = - ES (21.). (3.41) where S p(u) is defined as the Sparseness of random vector u; S (u?) is a nonlinear function, and u,- is the response Of the neuron. Olshausen and Field suggest S (u,) 2 to be an even nonnegative function, e.g. —e_“ , log (1 + 1:2), and lul. Intuitively, 119 when there are more neurons (random vector u) firing at the same time, the larger the function Z S (u,) would be. Therefore, maximizing Eq. 3.41 will maximize the ’l Sparseness. Karvanen and Cichocki (Karvanen and Cichocki, 2003) generalized the measure of Sparseness to [p norm criteria: Sp(u) 2E Z (3.42) It is clear that Olshausen et. al’s Sparseness measure is a special case of (p norm, when p is equal to 1 and |u| is chosen as the S function. When p = 4, the [p norm is related to the widely used non-Gaussianality measure kurtosis 5. Karvanen et. al suggest the range of p in (0, 1], and particularly smaller p, such as p = 0.1 or p = 0.01, should be used. However we found when p —> 00, it will lead to a surprisingly good independent component analysis algorithm. Let us first take lOOk at an example. Fig. 3.21 shows the joint distribution of two Laplace random variables. The original components are mixed with a rotation matrix of 20 degree. We then projected the data set the a series Of orthogonal basis, which is a rotated from 0 degree to 180 degree with 1 degree interval. After projected to the basis, mean of different [p norms are calculated and plotted. Fig. 3.22 diSplays the mean of £1) norm Of different projections. It is interesting to see the (2 norm is a straight line, which makes sense, because rotation will not change the Euclidean distance. Surprisingly, 5The kurtosis is defined as kurt (u) = E {u“} — 3, where u has unit variance. 120 the Optima of gp norm curve are inverted on the two side Of the E2 norm, which means the maxima of the 6p norm with p < 2 are corresponding to the positions of the minima of (p norm with p < 2. Then to find the independent component (in this case it is along the 20 degree and 110 degree directions), one needs to find the minima Of the 87’ norm with p < 2 or maxima of [6 norm with p > 2. Another issue is robustness of the estimation when noise presents. From Fig. 3.22, we can see almost all the norm curves agree on the same Optima position (except [2 norm), but the norm curve with greater difference between the maxima and minima will be more robust when noise is presented. So the choice of p can be either close to zero or close to infinity. That is why Karvanen et. al suggested smaller p, such as 0.1 or 0.01. The problem of small p is that the Optimization process is difficult, Since the gradient of such criteria typically involves inverting matrices. On the other hand, at first glance p ——+ 00 is not a good choice either, but since it generally holds that: l 5 lim 2 luilp = max{|u2-|}. (3.43) i 19""00 So it leads to a surprisingly simple computation. In the next section we will also see the gradient Of infinity norm also has a simple form, thus the Optimization process is relatively easy for the infinity norm case. Therefore, we can define the following criteria function: _1_ P J(u)=E 1011,11ch Eifiiur =E{mgx{u.}}, (3.44) 121 and maximizing this function solves the maximizing sparseness problem. 3.4.2 Infinity norm sparseness optimization As we defined in Eq. 3.44 in the last section, this criteria function that will maximize the mean of [00 norm can also be written as: J(W) = / my): [(wjlyfl p(y) dy. (3.45) where we replace the |uj| with (WEI—302 for convenience, since the absolute function is not differentiable . SO the goal is to maximize this function with respect to W. Intuitively saying, if wj is along the tail direction of the multivariate super- Gaussian distributions, the criteria function will be maximized. So that our goal is to maximize the Eq. 3.45 with respect to the wj. The closed-form for the determination of wj is not available with the general random variable y with pdf p(y). SO we have to resort to an iterative approximation scheme. The difficulty Of the analysis lies in the discontinuity caused by the maximum 2 function, max [(wfy) ]. This problem can be circumvented by going back to the J infinity norm form as shown in Eq. 3.43. 2 Let c = arg max [(wg—y) ] . Thus, the maximum part can be rewritten as: j 122 (3.46) T—*OO II B M A ,3 -l ‘< v l0 ‘ |_.__l ‘iIH To use the gradient descent approach to find the maximum of the Eq. 3.45, we take the partial derivative of Eq. 3.45 with respect to wj: 38— 1' j d 347 E‘fr—lf’éo aw], p(Y) y. (- ) . [2 (wiyf’j Let 21' A = Z (wJ-Ty) . (3.48) .7 Eq. 3.46 can be written as 7‘li’mc>0 AF. Since, 53% = 2rZ[(W;ry)2r_1'Y'aWi/awj:ly J i T 27“]. = 2r (wj y) y, (3.49) then we have: _1_ CAT : EAL-l- (9A dwj r 3w,- 1 2r—l = 2Ar“1(ijy) y. (3.50) 123 Denote 2r—1 (iv-Ty) B = J A 2r—1 T _ (“’3' y) — 2r 2. (le z ' 2r ‘1 (szy) T ‘1 = Z T 2r (Wj y) - . (“’2' y) p 2T 2T 21- -1 (Wiy) (WT-v) (Wiy) T —1 = T 2r+”'+ T 2r+ T 2r (“’3 y) .(w: y) (.., y) (..,. yl (3.51) wT T 2r Given w -, the term is the maximum. When r —> 00, the term w y ‘7 Wgry Wj y will dominate the series of summation. So, T —2r -1 #11203 = .133. Wiy) (W55) —1 T where 663- is the Kronecker delta. It has the following form: 1 if c = j, cj — 0 otherwise. 124 Substitute B into Eq. 3.50, we Obtain .1... .21: = .(.;,)2.,(.;.)~1, = 2663- (wgy) y (3-52) The Eq. 3.50 is also the updating rule for the weight matrix W. With the con- sideration of the most efficient estimation, we can come up with the updating rule exactly the same as Eq. 3.38. 3.5 Broader Impact In neuroscience and machine perception, there is a growing interest in simulating the developmental process of feature detectors in the early sensory pathways (see, e.g., Elman, Bates, Johnson, Karmiloff-Smith, Parsi & Plunkett (Elman et al., 1997); Weng, McClelland, Pentland, Sporns, Stockman, Sur & Thelen (Weng et al., 2001); Weng & Stockman (Weng and Stockman, 2002)). In such sensory pathways, there is a need for developing feature detectors for all areas (receptive fields), across different positions and sizes, in the sensor array (e.g., camera CCD array or the retina). The total number of such areas is so large (sampled in the y-x positions and in the size range) that it is impractical for each feature detector of k connections to have extra storage space of k X k for its development. This raises the critical need for Type-4 incremental ICA algorithms. We will place the Type-4 algorithms developed in this proposed project on the 125 web, freely available to everybody. Due to its simple structure, lowest possible order of time and space complexities, quasi-optimal statistical efficiency, and the Type-4 nature, we expect that this class of algorithms will be widely used. 3.6 Conclusion The proposed CCILCA algorithm is simple and fast under multiple demanding com- putational requirements: high dimensional, incremental and free of higher order statistics computation. This is due to the simplicity of the winner-take—all principle and the most efficient estimation concept used in the algorithm design. The sparse representation can come from a winner-take—all competitive learning neural network, which helps to explain some learning mechanism in the neural system. 126 sis-.29.; " ’-€:r~°‘ ‘r' -- . 6 ,5? l_‘,#_ .A ,, i,xLfi#_—h# Factorizable into two independent uniform distributions (sub-Gaussians) P 7 l Wm- , f I f : El 1 | -.. ,l ,,_i.___L__‘ # Factorizable into two independent Laplacian distributions (super-Gaussians) l T i ‘l '* .3 Nonfactorizable distribution Figure 3.10: Illustration of lobe components. The plus signs are random samples. lst column: the original source signal. 2nd column: the observed mixed signals (after affine transform from the left). 3rd column: whitened signals. Lobe components are marked by arrows. 127 Figure 3.11: The sample space of a zero-mean white random vector x in 2—D space can be illustrated by a circle. Each mark + indicates a random sample of x. The distribution is partitioned into c = 3 (symmetric) lobe regions 12,-, 2' = l, 2, 3, where R- is represented by the lobe component (vector) wi. 035; \N2 01% 04¢ 025 \N1 0 so i 160 150 200 Number of accesses Figure 3.12: The amnesic average coefficient. X axis is the number of visits. Y axis is the weight coefficient of ml and W2. 128 --- p(t)32 0.9r --u(t)a1_ -- ”(0'50 08* . 0.7 r « I 0.6~ . 0.5% . Error coefficient 10‘ 102 103 Figure 3.13: The error coefficients c(n) for amnesic means with different amnesic functions n(n)- 129 r r Vi '7 I l I l —5 0 5 10 J». o 9.99.099 mwlsscnvmsi Average error in radians f ”o 1000 2000 3000 4000 5000 Number of Samples 0)) Figure 3.14: The proposed method on 2-D Laplace distribution data set. (a) The source independent components and the evaluated independent components. The dash-dot lines indicate the direction of source independent components, and the two arrows indicate the components evaluated by the proposed method. (b) Average error in the angle of the evaluated independent components vectors and the source independent component vectors. 130 1.4 - . +1NPCA 12, —— ExtBS1 _ a) ' + LCA :’ C (U a 1» N m .508 O t was 0 5’ £0" < 0.2 0o 5 1‘0 1‘5 2‘0 25 Sample Dimension (3) 150 - . . + NPCA ,. — ExtBS1 + LCA 5 9100. i 0 g 0 5’ 9 50’ < 00 ' 5 1‘0 15 2‘0 25 Sample Dimension 0)) 1.6 . . . — NPCA 1.4> -- ExtBS1 r m + CCIICA 512+ « '0 m a: 1. .s €08» LIJ 3,00 go4 2 ' ( 0.2 00 10 20 30 4o 50 60 SampleNumber/500 (C) Figure 3.15: Comparison of ICA results among (Type-3) NPCA, (Type-2.5) ExtBSl and (Type-4) CCILCA for super-Gaussian sources. (a) Average error for different data dimen- sion (1. (b) Average time to run for different data dimension d. (c) Average error for d = 25 as a function of the number of samples. 131 1.4 ~— Extended Infomax FastICA .. LCA fixed mu - -- ~ LCA dynamic mu — LCA eliminating cells Figure 3.16: Comparison among (Type-2) Extended Infomax, (Type-1) FastICA and three variants of the proposed Type-4 CCILCA algorithms. 132 Amplitude é 5 l Time(second) (a) Amplitude 2 3 4 Time(second) (b) Amplitude 9 CIJAI‘ ..x 2 3 it Time(second) (C) Figure 3.17: Cocktail party problem. (a) A music sound clip in its original form. It is one of the nine sound sources. (b) One of the nine mixed sound signals. (c) Recovered music sound wave. Comparing to (a), the sound signal is recovered after approximately 1.5 d. secon 1 33 flit“... _ . _ . u Figure 3.18: Lobe components from natural images (not factorizable). (a) LCA derived features from natural images, ordered by the number of hits in decreasing order. (b) The numbers of hits of the corresponding lobe components in (a). 134 , l l l / i ’ \ s i! i l i l ' j K 3 f i i‘ ‘1 \_ | “ l i y \ . x, l l a, \n ' l l i l i ’ i \ s , f a. i l K \ i! t; .31.". "- v- “ fl ,. . :5 .. 2 .. ~ ‘ l l l a: l ‘ f 2 :1 ref": 17' ‘f" * . ' ..- . . ~ ' ' l i i l i“? - c. i ‘ ’ . ’ x " A :3 l l , ’ ...! ~ \ «L. in 3‘ l -, / ’ ' i... ‘. I y a. “ . . - .. .... , . § l l“ ‘ I: . '2 ‘(flr‘j .‘ f » 3‘ / I_, g '- .. W-F‘fi -..V.e- » ,4»: . . I“, g“... .. . 7‘ .‘ .' fig,“ 1. :l .I "' ‘ ' ' in! h “ Figure 3.19: Results of Algorithm 5. Topographically ordered basis functions in ICA of natural images. 2x105 5 E E15- 0 .D 9.. ‘6 gas» . 3 2 WW 0 O 64 128 192 256 Independent Components Figure 3.20: Results of Algorithm 5. Number of times for each independent component to be the “winner.” The noticeable big hit corresponding to the neuron at row No.9 and column No. 8. 135 O N A O) G) ’+ + \\ -2 . + , -4. ,” ‘\ . -6 . i . '8 -5 0 5 Figure 3.21: Data set is joint distribution of two unit variance Laplace distribution. The two independent Laplace random variable are mixed by a rotation matrix of 20 degree. V “‘\\_// I 075 1 .5 “ C N______..__.»-—~————- ~“__——»—-—-~m‘4 1.5 8 2 E 1 L3~W\W““§j5€ funny 0 Z 0 5L ,,Fx ,,--.-—---\ J . /" \ ”/'/ ___ \ ' L- _- ,/»-«~*:_""‘____1:\:.\V—~;t;/*:;_.__::~~-: 95’""““"’ \:K’:::’/”’ —~\\::_~:#;,..—/m T“ "-25 LNJ ,,..,x-'~——\~__ ”...mwr—‘mw -o.7s ..___ _-.._.--w--—v~~~-~-—~—~..~H_._..-..~r———~———~--— -o.5 0 . . . . . -o.25 0 3O 60 90 120 1 50 1 80 Kmemmm Figure 3.22: The (7’ norm criteria functions under different p. The data set is projected onto a series of orthogonal basis, and then the expectation of 8” norms of all the projected samples are computed. 136 Chapter 4 Sparse representation and Feature Selection In the chapter 3 Lobe Component Analysis algorithm based on a novel sparseness measure is discussed. In this chapter we further discuss the sparseness optimization. We propose a new perspective to achieve sparseness via the winner-take—all principle for the linear kernel regression and classification task. We form the duality of the Least Absolute Shrinkage Selection Operator (LASSO) (Tibshirani, 1996) criteria, and transfer an El norm minimization to an €00 norm maximization problem. Two solutions are proposed: it can be solved by standard quadratic programming with linear inequality constraints. We show that the number of parameters to be estimated in the £00 normed space is half of the parameters in the solution suggested by David Gay for 61 normed space. Second, we introduce a novel winner-take—all neural network solution derived from gradient descending, which links the sparse representation and the competitive learning scheme. This scheme is a form of unsupervised learning in 137 which each input pattern comes through learning, to be associated with the activity of one or at most a few neurons. However, the lateral interaction between neurons in the same layer is strictly preemptive in this model. This framework is applicable to a variety of problems, such as Independent Component Analysis (ICA), feature selection, and data clustering. 4.1 Introduction: The Supervised Learning Prob- lem The central problem of supervised learning or regression can be formulated as a function approximation. In either case we have pairwise correspondence of samples xi and yi, where xi 6 Rd, 3;,- E R, and d is the dimension of the input space. Especially, when y, is discrete, e.g. y, E {—1, 1}, the problem is called classification, whereas continuous y,- (for example y, E R) indicates a regression problem. The task is to infer a function f (), such that y = f (x) More precisely, if the model of the function is chosen, the function can be written as y = f (x, H), where [3 is the parameter vector of the model, y = [y1, ...yn]T, and n is number of training samples. The model of the function f () can have various forms. For example, in a three-layer perceptron neural networks (Bishop, 1995), the function has the following form, 3’ = f2(W2f1(W1X+b0)+b1). (4-1) 138 where fi(') is a non-linear limit function, like sigmoid. Then B = {W1,W2,b0, b1} is the parameter set to be optimized. In this study, we consider the form of the function to be linear with respect to the parameter vector beta, i.e., k f(X. a) = Zah. (x) = (flux), (42) i=1 where h(x) = [h1(x), ...hk(x)]T is a vector of k determined functions of the input vector x. Typical settings of this function are: 1. Linear regression, where h(x) = [1,:r1, ..., xd]T, and d is the dimension of the input space. Apparently, I: = d + 1. 2. Non—linear regression, where h(x) = [031(x), ..., ¢k(x)]T. If <15,- = g(Wx), where g(-) is the sigmoid function, this model is the single layer perceptron. Note that usually ¢1(x) = 1. 3. Kernel regression, where h(x) = [1, K(x,x1), ...K(x,xk_1)]T. K(x,x.,-) is some kernel function, e. g. Gaussian kernels. Typically, it is assumed that the output variable y from the training set is contami- nated by additive white Gaussian noise, i.e. y = f (x, [3)+W, where w = [2121, ..., wn]T is a set of i.i.d. white Gaussian random variables with variance 02. Thus, the condi- tional probability p(ylfl) is Gaussian, i.e. p(ym) ——— NTa 021). (4.3) 139 where I is the identity matrix. We write H = h(x)T, and H is called design matrix. The element (i,j) of H, denoted by Hija is given by Hij = hj(:r,-). Since the noise is assumed to be i.i.d, by simply applying the Maximum Likelihood Estimator (MLE), we obtain the log-likelihood, yz' - Hijfij .=1 i=1 _ _IIy—Hflll% (4 4) o2 _ T ' l(fl) =1nP(YIfi) = - where ||-||2 is the £2 norm, i.e. Euclidian norm of the vector, which is defined by ||z||2 = V sz. The estimated parameter vector B A s = (HTHrlHTy. (4.5) The problem of the least square error estimation is that it emphasizes the fitting accuracy, however with limited data samples it usually leads to a severe over-fitting problem. That is the trained function works well on the training samples set, but it generates large error on the testing set. This drawback is often referred as low generalization ability. Note in this case there is no preference on the parameter vector {3, so its prior is of a uniform distribution. This prior typically leads to some large values of the coefficient beta,, and is considered to have more complexity and roughness. With a zero-mean Gaussian prior for 5 with diagonalized covariance matrix A, the estimation can be formalized by maximum a posteriori (MAP) process. The prior of [3 then becomes an Eg-norm regularization term in the log-posteriori, where it will 140 prefer a smaller [3. When A = #21, it is called ridge regression (Hoerl and Kennard, 1970). The posteriori is given by lMAPW) = 1n(p(y|fl)p(flla)) n k H 2 k 2 _ —i§1(y2 jgl 235]) ~47;— — 2 #2 _ _||y—H6||§_fléfi — ——02 #2 It is worth noting that for a Gaussian prior, the solution Bridge of the MAP and Bayesian estimation agrees at the same point and has the following close form solution, . _ -1 aridge = (02A 1+ HTH) HTy. However, for other priors the MAP and Bayesian estimation generally have dif- ferent solution, and analytical solution may not exist. The Ridge regression has the similar over-fitting problem as the MLE, since the regulization term in the objective function favors smaller and smooth coefficients in ,6, but does not penalize a complex model, which means almost all the coefficients in fl are non-zero and one needs to keep all the basis function in the design matrix H. In contrast to this limitation, a sparse estimation of [i is defined as one, in which most of the coefficients in 0 are zero with only a few non-zero elements. Sparseness is desirable in the ways that 141 1. it leads to a simpler model, since most of the basis function in design matrix H will not be used. 2. the generalization ability improves with a sparse model. 3. sparseness corresponds to performing feature and variable selection. If sparseness is preferred, then a Laplacian prior can be adopted for 3, i.e. (I p(flla) = (§)'°exp<—a nan), where a is a parameter of the Laplacian pdf, and ||x||1, i = 0, ...oo is the so called €1- norm, which equals to llfilll = {:1 lflil The Laplacian distribution features a heavy z: tail and has a high concentration at the near-zero area. It means that most of the [3’s components will be zero, and the probability of having a large value is relatively high, compared to the Gaussian distribution with the same variance. Fig. 3.2 compares three prior distributions, i.e. Laplacian, Gaussian, and Uniform. Utilizing the same MAP process, the estimation of 5 is given by A a = ...ng {lly — Hang + t Hall}, (4.6) where t = 2020 is a control parameter, which can favor either the squared error term or the él-norm regularization term. This criterion is also known as Least Absolute Shrinkage Selection Operator (LASSO) (Tibshirani, 1996). It is worth noting here that due to the non-Gaussian prior, the MAP estimation is not equivalent to the Bayesian estimation as it is in the ridge regression. Therefore, the estimation is bi- 142 ased. To make a unbiased estimation, one needs to integrate in all 5 space, which is computationally prohibitive. However, if the posterior concentrates at a certain point, then this biased estimation may only have a small variance from the unbi- ased estimation, which is desirable. This is one reason why a concentrated sparse prior is preferred. Some researchers introduced “hyper-parameter” to further steepen the prior, for example, in Relevance Vector Machine (RVM) (Tipping, 2001) and F igueiredo’s work (Figueiredo, 2003). Another reason that a sparse representation is desirable is because it improves the generalization of a learning system. For example, in supervised learning, the goal is to infer a mapping based on the training samples. The generalization capability is accomplished by reducing the complexity of the model, which is characterized by the number of non-zero parameters. This problem is formalized as [BO-norm minimiza- tion. It is known that EO-norm minimization is NP-hard (Amaldi and Kann, 1998). However, it is established that the solution of the [1 problem is the same as the [0 problem if a certain condition is satisfied. So, El-norm problem is still an important issue. In this chapter, we propose a method to solve the €1-norm version of the problem. We construct the dual problem of LASSO criterion as in Eq. (4.6) and use a gradient- based method to find the solution. This method is by no means claimed to be superior to the quadratic programming (QP) based method; however, it opens a perspective to address the problem differently. We have shown that the proposed method actually applies the competitive learning principle. In addition, this method can also motivate other biologically plausible models for solving similar problems. 143 The reminder of the chapter is organized as follows: In Section 4.2 we formulate the dual problem of the LASSO, and propose a solution based on gradient descent. We reformulate the proposed algorithm in Section 4.2.5. The experiment results and comparison with existing sparse optimization algorithms is in Section 4.3. Section 4.4 provides conclusions. 4.2 Method 4.2.1 Duality First, we will construct the dual problem of Eq. (4.6). Fig. 4.1 illustrates this problem. The Eq. (4.6) can be geometrically explained as the minimum €1-norm between the origin and the convex set K. This distance, according to duality theory, is equal to the maximum Koo-norm distance between the origin and the plane that separates the origin and the convex set K. In Fig. 4.1, the parabola denotes the convex set K. Point B is the closest point to the origin, in terms of the El-norm. Actually, the Zl—norm of B consists of two parts, the quadratic form of Eq. (4.6) and t ”fill 1 in Eq. (4.6), as illustrated in Fig. 4.1. Point A is on the tangent plane of point B, and A is the closest point to the origin, in terms of the Koo-norm. The duality theory establishes that the £00 norm reaches its maximum value while the €1-norm reaches its minimum value. Thus, with a few manipulations, we get the following theorem to formulate this intuitive geometric explanation. 144 . :llYi-HBlli : “Bill B Figure 4.1: Geometry explanation of duality. Point B is the closest point in K to the origin. Theorem 4.2.1 If we have T .. 3%. w —1 Z B'=argma.x— -' fl BZ 3t? —1 4 00 where Z = “y — Hflllg, then [3’ = B, and ,3 is the same as in Eq. (4.6). The generalized proof can be found in (Luenverger, 1969). 4.2.2 Derivations Let Z = “y — Hang = sTnTHs — 2yTHs + yTy, and denote 145 T T 02 2/t 1 7 % mix (I? (We - yT112) (4.12) [0, 0...O]T, otherwise. where 147 m = arngnax (lflch — yThj‘) . (4.13) Rearranging these equations, we have the final updating rules, 20 4[flTCfl—YTY] vs or ”Will... ‘ t||J(fi)||§o .(chm — yThm) Cm (4.14) fit; The switching condition of these two updating rules is the same as that in Eq. (4.12) Remarks: Eq. (4.13) defines a competitive learning process. The algorithm will find the winner and use the updating rule with the winner’s value. The aforementioned method will search for the optimal value of B to minimize the criteria function in Eq. (4.6). With the El-norm constraint, some of the components 32' of vector B will gradually decay. However, due to the nature of the gradient method, they will not be exactly zero. So an additional step that sets a hard threshold to make smallflz- to zero will help the convergence of the algorithm. The choice of learning rate in any gradient algorithm is important. Here we used a dynamic learning rate that is introduced in chapter 3, i.e. amnesic average. Suppose there are two statistical estimators F1 and P2 for estimating parameter 0. If E ||I‘1 — 9”2 < E “1‘2 — 6H2, F1 is said to be more statistically efficient than 1‘2. We consider the right hand side of Eq. 4.14 as an “observation.” The goal is to get the mean of this observation, while B is estimated incrementally. It is known that for many distributions, the sample mean is the most efficient estimator for the mean of the random variable. When the distribution is unknown, the sample mean 148 is the best linear estimator, which results in the minimum error variance. For many distributions, the sample mean reaches of approaches the Cramér-Rao bound (CRB). Then, an efficient estimator is one that has the least variance from the real pa- rameter B, and its variance is bounded below by the CRB. Thus, we estimate a set of independent component vectors w,- by the sample mean of the observation in Eq. 4.14. The algorithm is guided by the statistical efficiency, but it is not absolutely the most efficient one, because 1. the true distribution of the observation is unknown; 2. the distribution changes with B being incrementally estimated; 3. amnesic average is used to gradually discount “old” observations, which reduces the statistical efficiency moderately. 4.2.3 Quadratic Programming Quadratic programming (OP) is used for finding the solution of LASSO regression. The method suggested by David Gay (Tibshirani, 1996) transforms the LASSO to a QP problem with 2n variables and (2n + 1) constraints, where n denotes the number of the kernels. In this subsection, we are proposing the dual-QP algorithm, which converts the original problem into a problem with fewer variables (n) and 2n con- straints. The dual problem of LASSO regression is described in Eq. (4.8) and we recapitu- 149 late it as finding B, s.t. t/3 J ((3) . Z fi=argmax-——— 4.15 [3 Ham”... ( ) This optimization problem is equivalent to . t5 fl = arg max -J(fi) fl Z = arg mflin —[t2fiT (fiHTflHlfi — YTYl s.t. ||J(fi)l|oo =1 (4-16) The constraint Eq. (4.16) can be written as 2 T T max(ll;H Hr—H ylloo.1) =1 and is simplified to be 2 IIZHTHfi - HTYlloo S 1 (4.17) Written in component form, Eq. (4.17) becomes 2 max I chfi—a; IS 1, fori= 1,...,n Z 150 where T T C1 0‘1 T T C a HTH = 2 and ZHTy = 2 c; a; The inequality can be expressed in 2 ~15 1.34-... g 1 We write the inequality in vector form 2HTy — 1 g 2HTHfi/t 3 1+ 2HTy Thus, the equivalent QP problem becomes B = arg min[fiTHTHfi] s.t., —2HTHfl/t g 1 — 2HTy 2HTHfl/t g 1 + 2HTy It is worth noting that the above QP has n variables and 272 constraints. 151 4.2.4 Filter Development Let us reconsider the original linear kernel regression problem in a special case. Sup- pose the design matrix H is an orthogonal n by 11 matrix, and there is no additive noise. So, fl = HTy. Also, we are not considering optimization with respect to 5, but instead, with respect to the design matrix H. Then, we get the criteria function adapted from Eq. (4.6) H = argmin{|lHTyll }. H 1 Using the same gradient technique described in section 4.2.2, we can get the updating rule of design matrix H, Vb.- oc 26..(hly)y. hT.y where c = arg max “111' , and (56]- is the Kronecker delta: cicj = i {1 if c = j, 0 otherwise}. This is precisely a “winner-take-all” algorithm. Only the winning kernels (neurons) will get updated. For more details about this method, see (Zhang and Weng, 2004). 4.2.5 Gradient Sparseness Optimization Algorithm We summarize the training procedure in Algorithm 6. Each iteration of this algorithm is computationally efficient, because it only involves matrix multiplication and maximum function. The time complexity of each iteration is 0(kn), where k is the number of basis vectors and m is the dimension of each basis vector. 152 Algorithm 6 Gradient Sparseness Optimization Algorithm 1: Preprocessing: Get the training set (xi,y,-), i = 0,1,2,...,n. For each i, y,- = m Y'i _ 5': Where 5’ = % Z yi. i=1 2: ,6,- = 0, i = 0,1,2,...,k. 3: Initialize the design matrix H, and compute C = HTH. 4: repeat 5: k = 1. 6: Compute m = arg max (lflch — yThJ-D. j 2 7: if T47 (Bl-cm — yThm) > 1 then 8: Update 6 with: 2C3 finew = 1013 1d + “Di—— 0 WW... T _ T 4 [fl C3 3’ Y] ° (HT-Cm - yThm) Cm], t||J(fi)lloo where ml and wg is the same as a and B respectively defined in Eq. (3.32) and Eq. (3.33). 9: else 10: Update 5 with: 2C5 [in = w k 6 + w k —— 11: end if 12: k = k + 1. 13: until Objective function in Eq. (4.8) reaches the target value, or A6 is less than some threshold. 153 4.3 Experiments 4.3.1 Kernel Regression Our first experiment illustrates the performance of the proposed algorithm in kernel regression. The regression model is k y = f(x, 4) = a) + Zfl.K(x.x.-). i=1 where K (x, xi) = arm—W} is the kernel function, x; and a are the kernel parameters. The function to be approximated is 1- d sinc function y = sin(:r)/:r. We randomly collected 150 samples and added Gaussian noise with variance 0.01. The first row in Fig. 4.2 shows the fitting results of proposed method, ridge regression, and the proposed dual-QP LASSO regression algorithm, respectively. The dots are samples with noise, and the dashed lines are the ground truth sine function. Solid lines show the approximation results. The circled dots correspond to the kernels with nonzero weights, aka the “supporting kernels”. In the second row, we use the bar figure to display the weights of those kernels. As Fig. 4.2 clearly indicates, both proposed gradient sparseness method and dual-QP LASSO achieve sparseness. The [foo-norms are also marked on these figures. The proposed method in our testing performs better than LASSO regression. Fig. 4.3 shows the convergence procedure. The proposed method takes about 200 iterations to converge. Fig. 4.4 illustrates how the control parameter t = 2020 affects the mean square error and model sparseness. We conducted 20 tests with t ranging 154 0.2 - - 0.2 - - 0.2 01 l |o=10 . 01 » lo=150 1 01 |o=11 1 0 I I“ 0W 0 II I ' -0.1* ‘ -0.1: 1 -0.1: i -0.2 ‘ ‘ -0.2 ‘ ‘ -0.2 - * 0 50 100 150 0 50 100 150 0 50 100 150 (a) (b) (C) Figure 4.2: Kernel regression results. The dashed lines are true sinc functions. Solid lines are approximation results. The circles shown in the first row are those selected kernel functions. Note that we do not show the circles in the ridge regression results, because every kernel is selected. (a) Proposed gradient method. (b) Ridge regression. (c) LASSO regression. from 0.3 to 1.5. As indicated in the figure, greater t makes the model fit well but increases its complexity, and vice versa. 4.3.2 Classification The experiments addressed the kernel-based classifier for two-class problems: A spe- cial case of the regression problem with y E {+, —}. The classifier is formulated as 155 01 O A O (A 0 Objective Function N O 0o 200 400 660 800 1000 Sample Iterations Figure 4.3: Objective function vs. number of iterations. We utilize a dynamic learning rate mechanism called Amnesic Average. Interested readers can refer to (Weng and Zhang, 2004) for more details. tand Error 20 . . 10- 0 0:5 1 1 5 tand Sparseness 2 1 . 4 0 05 i 1.5 Figure 4.4: The effect of control parameter t over mean square error and sparseness. 156 the following regression function: k .10 l x) = 4010 + g 4mm.» where 1/1 denotes the logistic function ib(z) = (1 + exp(—z))-l. If p(y | x) > 0.5, x belongs to the class +. Otherwise, x belongs to the class —. First, we run the classifier on an artificial data set drawn randomly from the 2D XOR function: y = sign(x1 * 2:2), where 1:1, 3:2 6 (—1,1). The data set includes 30 samples. Two methods are compared. In Fig. 4.5 (a), the top row shows the decision boundary of our proposed method while the bottom one illustrates the values of the learned coefficient vector B. For comparison, (b) gives the results using the LSSVM (Suykens and Vandewalle, 1999). The sparseness is improved tremendously. Second, we use three data sets from real-data problems: the Pima indian dia- betesl, which are collected from women of Pima heritage and the goal is to decide whether a subject has diabetes or not, based on 7 different tests; the Wisconsin breast cancer (WBC)2, whose goal is to diagnose (benign/malignant) based on the results of 9 measurements; the AR face database3, in which two male face samples were selected and each person has 26 samples. In WBC, we remove the cases with missing attributes for simplicity. Table 4.3.2 shows the results of the proposed classifier on the 5—fold cross validation experiment. For comparison, we also include the results of the 1Downloadable at www.stats.ox.ac.uk/ pub/ PRN N 2Downloadable at www.ics.uci.edu/ mlearn/MLSummary.html 3Available at rvl1.ecn.purdue.edu/vl/ARdatabase/ARdatabase.html 157 30 40 20 # Kernels (C) (d) Figure 4.5: The decision boundary of the artificial XOR data set. (a) The decision bound- ary of the proposed winner-takeall method. (b) Kernel selection of the proposed method. (c) The decision boundary of Least Square Support Vector Machine (LSSVM). (d) Kernel selection of LSSVM. 158 Table 4.1: The result of the 5—fold cross-validation. I ROG/No. kernels I Pima I WBC IDBF face] Logistic 075/200 0.77/455 0.81/40 LSSVM 090/200 0.76/455 0.96/40 Proposed classifier 0.96/70 0.98/253 0.97/20 logistic classifier. In all data sets, the proposed classifier is better. The performance is improved partially by setting some decayed kernel weights to be exactly zero. 4.4 Conclusions In this chapter, we have formulated the dual problem of the £1 norm based sparse approximation. We show the geometric relation of the duality and solve the dual €00 maximization problem by gradient and quadratic programming. The algorithm’s performance is close to or better than the result of LASSO regression. Compared with traditional methods, the efficiency of this algorithm is quite remarkable. 159 Chapter 5 Conclusions 5.1 Summary This dissertation presents a general-purpose vision architecture shown in Fig. 5.1, which can accommodates attention selection and development of a vision system. Motivated by the structure of the early visual pathway, several layers of staggered receptive fields are used to model the pattern of positioning of the processing cells of the early visual pathway. From sequentially arriving video frames the proposed SHM algorithm develops a hierarchy of filter banks, whose outputs are uncorrelated within each layer, but with an increasing scale of receptive fields that range from low to high layers. We also show how this general architecture can be applied to occluded face recognition, which demonstrates a case of attention selection. The architecture shown in Fig. 5.1 is equally well applicable to other sensory modalities, such as vision, speech and touch, each with a different set of developmental parameters (e.g., the extent of temporal context). 160 To motors “a“ mapping Top-down attention control lnnate Cognitive (inhibits responses behaviors mapping from black cells) Delayed A4 Fourth area (A4) Delayed A3 Third area (A3) _._———— —’ Delayed A2 Senso Second area (A2)..— —’ mappirgg .s“, s‘ v «\‘v s s.\\ Delayed A] ‘l‘ii‘: Kill ss‘si‘§.:3\‘.s\ . 33).). 'si.‘. s) ... si.‘ ... \\ :§" §\ 9‘ illi'li'li'li'li'li'li’li'lilli'l illilli Delayed retina Retina Figure 5.1: The flow diagram of a developmental vision system. The sensorimotor module for innate behavior is illustrated by a block only for simplicity, but it also contains the three mappings which are, however, much smaller in storage and much simpler in the mappings it realizes. Besides the general architecture, a developmental algorithm for sensory network has been designed and tested based on the concept of lobe components. Our simula- tion result of network development revealed that many lobe component cells developed by the algorithm from natural images are orientational selective cells similar to those found in V1. The proposed LCA method is based on two well-known computational neural mechanisms, Hebbian learning and lateral inhibition. It is an in-place developmental algorithm in which every cell is both a signal processor and the developer of the cortex. There is no need for a separate network to deal with the development (and learning). The algorithm is purely incremental in the sense that there is no need for extra storage for information other than that can be stored and incrementally computed by the processing network itself. 161 Topic Contribution General vision archi- tecture Proposition and implementation of a general vision architecture; Automatically develop- ing filters from natural image; Demonstrat- ing a case of attention selection. Blind source separa- tion An efficient solution to Super—Gaussian Blind Source Separation problem by LCA method. Feature derivation A fast LCA algorithms that applies to spatial filter development, and solves local feature extraction problem online and in realtime. Natural image filters Developing LCA filters from natural image set; Showing orientation preference neurons and the pinwheel structure of the filter topol- ogy. Feature Selection Sparse optimization and winner—takeall method for optimal selection of basis func- tions and features. Table 5.1: Contribution of the dissertation. we further discuss the sparseness optimization. We propose a new perspective to achieve sparseness via the winner-takeall principle for the linear kernel regression and classification task. We form the duality of the LASSO criteria, and convert an 61 norm minimization to an €00 norm maximization problem. The winner-takeall updating rule coming from the Zoo term is especially efficient due the simplicity of the maximum function. This method has been successfully tested on several tasks, including base / feature selection, 5.2 Contributions classification and regression. The major contribution of this dissertation can be summarized in Table 5.1 162 5.3 Summary of the future work The research work done in this dissertation shows that a general purpose vision system is possible to solve a variety of vision problems. However, some issues need to be explored further: 1. An interesting research direction in ICA is independent subspace analysis, in which the components are divided into groups or subspaces so that components in different subspaces are independent, but components in the same subspace are not. In particular, the distribution of the components in a subspace is assumed to depend only on the norm of the projection on that subspace. Typically, this implies that the components of a subspace tend to be active simultaneously. The proposed LCA algorithm could be applied to this problem. In the original LCA algorithm, only one winner will be updated to “take care of” the sample in the space, therefore leads to the direction of the maximum sparseness. If we modify the algorithm to update the top n winners among all the vectors that has the maximum subspace projection norm then we could find the sparseness structure in this subspace representation. At current stage we have not yet test this hypothesis, whereas it is a reasonable research direction for the future work. 2. Other basis development algorithm should also be considered and explored, including but not limit to Linear Discriminant Analysis (LDA), non—linear PCA and etc. 3. Temporal information is extremely important in perception, however in the 163 current model we don’t deal with much temporal information. A promising research direction is to adopt the temporal information and look for spatio- temporal filter development algorithm that can catch the structure in both spatial and temporal domain. . A general purpose vision system including an attention selection mechanism that accommodates both top-down and bottom-up attention selection mecha- nism should be tested in real environment. This architecture will serve as the sensory mapping part of the autonomous mental development (AMD) paradigm. 164 Appendix A The Candid Covariance-free Incremental PCA Algorithm The main goal of the CCIPCA (Weng et al., 2003) algorithm is to compute the principal component vectors incrementally without estimating the covariance matrix of input vectors. Similar techniques can be dated back to (Oja, 1982) and (Sanger, 1989). However, in this method, the statistically efficient estimation is considered. Thus, it is much faster than the previous learning-rate based method. The following algorithm does not assume staggering, but staggering only requires that the residual images are tiled together before the next principal vector is computed at a shifted position. In this algorithm, the principal components of the input are computed incremen- tally. Each input is a column vector :31, x2, ..., run, ..., with dimension d. The covari- ance matrix A of the vectors is unknown and cannot be estimated. An incremental approximation is needed of the first I: largest eigenvalues 0(") = (AYE), Ag”), ..., AS”) 165 and the associated eigenvectors V(n) = [v§n),v§n),...,v£n)], when n 2 k, where each 122‘”) is a d—dimensional column vector, thus V(") has k columns. Since each $1,172, ...,:rn would require too much space to retain, all operations are done incre- mentally. The previous eigenvalues 0("_1) and previous eigenvectors V(n"’1) must be updated by only using the new sample input xn. CCIPCA algorithm 1: Choose the initial guess of 12(0) and k, where 58(0) is the samples mean and k is number of eigenvectors with the largest k eigenvalues. 2: Choose the initial guess of V(O). For example, it can be any normalized random image. Choose the initial guess AlO) = 0,i = 1, 2, ...k. 3: for n = 1,2,3, ...N, where N is total number of samples do 4: Get a new sample In 5: Use the following incremental method to compute the amnesic average 17(71): ,(n) = Bilge—1),. 1+1,” (M) n n {where I is the amnesic average parameter. When n— 1 — l _<_ 0 use non-amnesic averaging (the above equation with I set temporarily to zero).} 6: Compute the scatter vector: an = $7, — f(n) 7: Set the first residual vector ill = an {Now we proceed to update the i th principal vectors and projecting the residual vector. Between each iteration, the residual vector (next receptive field) is returned to a large residual image before the next filter is extracted. At this 166 10: 11: 12: 13: point, shifting can occur to implement the “scattered” receptive field method} for i = 1,2,3..., k do Update the ith principal vector: Vim) = Win-("71) 1 .. (n-1) ,, 7 (A2) +fl(n)m ("i - V2 )ui where ___ n l—l aw ” (A.3) n(n) = 15’ are the coefficient of the amnesic average, and l is a constant. Compute the coordinate of the residual vector along the current ith principal component: 17.: ——1———(a. - vi”) (A.4) Update the ith eigenvalue: All”) = a(n))\§n_1) + fi(n)y,-2 (A.5) Compute the remaining residual vector (next input): 11.41 = — —'— (As) end for 167 14: end for The eigenvectors that are incrementally developed by this algorithm represent the basis of the sample space. Their respective eigenvalues represent the weight of each of the filters and are used to normalize them so that each filter has the same weight when computing a response to the input. This procedure is also referred to as “whitening.” After whitening, the output of each filter will be uncorrelated and has an identity covariance matrix. The whitened filter output is computed as follows: 168 Appendix B Derivative of Maximum Function To simplify the derivation, let 1203) = (%(5Tc. — yTH,))2,. < .. (13.1) fi(l3)=1,i= n Then ‘lli-I T—’OO meff1(fl)}= lim [ZMAV] - 169 The derivative of the maximum function can be defined by, 1 ., %_1 = rlifféo [223W f.- (W - E1 f1 (AV—1 915%} (B2) 1 -l- Zf:(fi)"1—ézi—af'(m =r1—1$5{[Zfz (fl ' I Zfdfi)’ } LCt a 2!. (73)“ 1.4%? A = :1.- (W‘ ’ (8'3) and let m = arg max {fl-(fin, then j 371% .13 @5132 its): 2 mar-1 (13.4) Notice that when r —+ 00, only the term fm([3)T—1/fm(fi)T-1 remains, and all other terms vanishes. So, we have . _ 1 _afmw) TiggoA_fm(fi) 0.6 ' We obtain the derivative of maximum function 3 Bfmw ) afimaxifi(fill=a 8,6 1 (B5) 170 where m = arg max {fl-(5)}. .7 171 Bibliography Adrian, E. D. (1941). Afferent discharges to the cerebral cortex from peripheral sense organs. J. of Physiology, 100:154—191. Aloimonos, Y. (1992). Purposive active vision. Computer Vision, Graphics, and Image Processing: Image Understanding, 17(1-3):285—348. Amaldi, E. and Kann, V. (1998). On the approximibility of minimizing non zero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209:237—260. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27:77—87. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251—276. Amari, S.-I., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind source separation. In Advances in Neural Information Processing Systems 8, pages 757—763. MIT Press. Atick, J. J. and Redlich, A. N. (1993). Convergent algorithm for sensory receptive field development. Neural Computation, 5:45—60. Backer, G., Mertsching, B., and Bollmann, M. (2001). Data- and model-driven gaze control for an active-vision system. IEEE Trans. Pattern Analysis and Machine Intelligence, 23(12):1415—1429. Barlow, H. B. (1961). Possible principles underlying the transformations of sensory messages. In Rosenblith, W. A., editor, Sensory Communication, pages 217—234. MIT Press. Barlow, H. B. (1972). Single units and sensation: a neurone doctrine for perceptual psychology? Perception, 1:371—394. Bartsch, A. P. and van Hemmen, J. L. (2001). Combined hebbian development of geniculocortical and lateral connectivity in a model of primary visual cortex. Biological Cybernetics, 84:41—55. 172 Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129—1159. Bell, A. J. and Sejnowski, T. J. (1997). The ‘independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327—3338. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford Unversity Press, Oxford. Blakemore, C. and Cooper, G. F. (1970). Development of the brain depends on the visual environment. Nature, 228:477—478. Blaschke, T. and Wiskott, L. (2002). An improved cumulant based method for in- dependent component analysis. In Dorronsoro, J. R., editor, Proc. Intl. Conf. on Artificial Neural Networks - I CANN ’02, Lecture Notes in Computer Science, pages 1087—1093. Springer. Burger, T. and Lang, E. (1999). An incremental Hebbian learning model of the primary visual cortex with lateral plasticity and real input patterns. Zeitschrift fur Naturforschung C—A Journal of Biosciences, 54:128—140. Cardoso, J .-F. (1997). Infomax and maximum likelihood for source separation. IEEE Signal Processing Letters, 4:112—114. Cardoso, J.—F. and Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12):3017—3030. Christianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Ma- chines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK. Comon, P. (1994). Independent component analysis, A new concept? Signal Pro- cessing, 36:287—314. Crair, M., Gillespie, D., and Stryker, M. (1998). The role of visual experience in the development of columns in cat visual cortex. Science, 279:566—570. Desimone, R. and Duncan, J. (1995). Neural mechanism of selective visual attention. Annual Review Neuroscience, 18:193—222. Ditterich, J ., Eggert, T., and Straube, A. (2000). The rule of attention focus in the visual information processing underlying saccadic adaptation. Vision Research, 40:1125—1134. Elman, J., Bates, E. A., Johnson, M. H., Karmiloff—Smith, A., Parisi, D., and Plun- kett, K. (1997). Rethinking Innateness: A connectionist perspective on develop- ment. MIT Press, Cambridge, Massachusetts. Eriksen, C. and Yeh, Y. (1985). Allocation of attention in the visual field. Journal of experimental Psychology: Human Perception and Performance, 11(5):583—597. 173 Field, D. J. (1994). What is goal of sensory coding? Neural Computation, 6:559~601. F igueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Trans. Pattern Analysis and Machine Intelligence, 25(9):1150—1159. Friedman, J. (1987). Exploratory projection pursuit. J. of the American Statistical Association, 82(397):249—266. Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for ex- ploratory data analysis. IEEE Trans. of Computers, c-23(9):881-—890. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36:193—202. Gersho, A. (1979). Asymptotically optimal block guantization. IEEE Trans. on Information Theory, 25(4):373—380. Giannakopoulos, X., Karhunen, J., and Oja, E. (1998). Experimental comparison of neural ICA algorithms. In Proc. Int. Conf. on Artificial Neural Networks (ICANN’98), pages 651—656, Skvde, Sweden. Gray, R. M. and Neuhoff, D. L. (1998). Guantization. IEEE Trans. on Information Theory, 44(6):1—63. Hebb, D. O. (1949). The organization of behavior; a neuropsychological theory. Wiley, New York. Herault, J. and Jutten, J. (1986). Space or time adaptive signal processing by neural network models. In Denker, J. S., editor, Neural Networks for Computing: AIP Conference Proceedings, volume 151, New York. American Institute for Physics. Hoerl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthog- onal problems. Technometrics, 12:55—67. Hopfinger, J., Buonocore, M., and Mangun, G. (2000). The neural mechanism of top-down attention control. Nature Neuroscience, 3(3):284-291. Hubel, D. and Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. of Physiology, 160:106—154. Huber, P. J. (1985). Projection pursuit. The Annals of Statistics, 13(2):435—475. Hwang, W. S. and Weng, J. (2000). Hierarchical discriminant regression. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(11):1277—1293. Hyvarinen, A. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In Advances in Neural Information Processing Systems, volume 10, pages 273—279. MIT Press. 174 Hyvarinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2:94—128. Hyvarinen, A. (2001). Fast independent component analysis. In Roberts, S. and Everson, R., editors, Independent Component Analysis: Principles and Practice. Cambridge University Press. in press. Hyvarinen, A. and Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7):1705—1720. Hyvarinen, A. and Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7):1483-1492. Hyvarinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks, 13(4—5):411—430. Itti, L., Gold, C., and Koch, C. (2001). Visual attention and target detection in clutterd natural scenes. Optical Engineering, 40(9):1784-1793. Itti, L. and Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2(3):194—203. Itti, L., Koch, C., and Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(11):1254—1259. James, W. (1890/1981). The Principles of Psychology. Harvard University Press, Cambridge, MA. Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag, New York. Jones, M. and Sibson, R. (1987). What is projection pursuit ? J. of the Royal Statistical Society, Ser. A, 150:1—36. Kandel, E. R., Schwartz, J. H., and Jessell, T. M., editors (2000). Principles of Neural Science. McGraw-Hill, New York, 4th edition. Karhunen, J. and Pajunen, P. (1997). Blind source separation using least-squares type adaptive algorithms. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pages 3048—3051, Munich, Germany. Karhunen, J ., Pajunen, P., and Oja, E. (1998). The nonlinear pca criterion in blind source separation: Relations with other approaches. Neurocomputing, 22:5—20. Karvanen, J. and Cichocki, A. (2003). Measuring sparseness of noisy signals. In Proceedings of 4th International symposium on Independent Component Analysis and Blind Signal Seperotion (ICA2003), pages 125—130. 175 Kass, J. (1997). Topographic maps are fundamental to sensory processing. Brain Research Bulletin, 44(2):107-112. Kirby, M. and Sirovich, L. (1990). Application of the karhunen—loeve procedure for the characterization of human faces. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(1):103—108. Koch, C. and Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neuronbiology, 42219—227. Kohonen, T. (2001). Self-organizing Maps. Springer-Verlag, New York. 3rd edition. Kullback, S. (1959). Information Theory and Statistics. Wiley. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22:79—86. Lee, T. W., Girolami, M., and Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed sub—gaussian and super- gaussian sources. Neural Computation, 11(2):417—441. Lehmann, E. L. (1983). Theory of Point Estimation. John Wiley and Sons, Inc., New York. Lin, Y. (2002). Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6:259—275. Linsker, R. (1986). From basic network principles to neural architecture: emergence of spatial-oponent cells. Proc. Natl. Acad. Sci, 83:7508—7512, 8390—8394, 8779—- 8783. Linsker, R. (1988). Self-organization in a perceptual network. Computer, pages 105— 117. Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual informa- tion in a linear network. Neural Computation, 4:691—702. Luenverger, D. G. (1969). Optimization by Vector Space Methods. John Wiley and Sons, Inc. Marshall, W. and Talbot, S. (1942). Recent evidence for neural mechanisms in vision leading to a general theory of sensory acuity. Bio. Symp., 7:117—164. McLachlan, G. J. (1997). The EM Algorithm and Extensions. Wiley, New York. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the odered arrangement of orientation columns through activity-dependent competition between on- and off-center inputs. J. of Neuroscience, 14:409—441. 176 Nadal, J.-P. and Parga, N. (1994). Non-linear neurons in the low noise limit: a factorial code maximizes information transfer. Network, 52565-581. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. of Mathematical Biology, 15:267—273. Olshausen, B., Anderson, C., and Essen, D. V. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13(11):4700-4719. Olshaushen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607—609. Olshaushen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy used by v1? Vision Research, 37(23):3311—3325. Pajunen, P. and Karhunen, J. (1998). Least-squares methods for blind source sepa- ration based on nonlinear PCA. Int. J. of Neural Systems, 8(5-6):601-612. Papoulis, A. (1991). Probability, random variables, and stochastic processes. McGraw- Hill, New York. 3rd edition. Pashler, H., editor (1996). Attention. University College London, London. Pearlmutter, B. A. and Parra, L. C. (1997). Maximum likelihood blind source sepa- ration: a context-sensitive generalization of ICA. In Advances in Neural Infor- mation Processing Systems, volume 9, pages 613-619. Pham, D.—T. and Garrat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45(7):1712-1725. Pham, D.-T., Garrat, P., and Jutten, C. (1992). Separation of a mixture of inde- pendent sources through a maximum likelihood approach. In Proc. E USIPCO, pages 771—774. Roth, Z. and Baram, Y. (1996). Multidimensional density shaping by sigmoids. IEEE Trans. on Neural Networks, 7(5):1291—1298. Rubner, J. and Schulten, K. (1990). Development of feature detectors by self- organization. Biological Cybernetics, 62:193—199. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedfor- ward neural network. IEEE Trans. on Neural Networks, 2(7):459—473. Schill, K., Umkehrer, E., Beinlich, S., Krieger, G., and Zetzsche, C. (2001). Scene analysis with saccadic eye movemens: Top-down and bottom-up modeling. Jour- nal of Electronic Imaging, 10(1):152—160. 177 Sillito, A., Grieve, K., Jones, H., Cudeiro, J., and Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378:492—496. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chap- man and Hall, London. Sirosh, J. and Miikkulainen, R. (1997). Topographic receptive field and patterned lateral interaction in a self-organizing model of the primary visual cortex. Neural Computation, 9:577—594. Sur, M., Angelucci, A., and Sharm, J. (1999). Rewiring cortex: The role of pat- terned activity in development and plasticity of neocortical circuits. Journal of Neurobiology, 41:33—43. Suykens, J. A. K. and Vandewalle, J. (1999). Least squares support vector machine classifiers. Journal of Neural Processing Letters, 9(3):293—300. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. Royal Statistical Society (B), 58:267—288. Tipping, M. (2001). Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211—244. Treisman, A. (1986). Features and objects in visual processing. Scientific American, 255(5):114-—125. Treisman, A. and Gelade, G. (1980). A feature—intergration theory of attention. Cognitive Pshchology, 12(1):97—136. Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., and Nuflo, F. (1995). Modeling visual attention via selective tuning. Artificial Intelligence, 78:507—545. Tucker, H. G. (1962). An introduction to probability and mathematical statistics. Academic Press, New York. Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71—86. Ungerleider, L. and Mishkin, M. (1980). Two cortical visual systems. In Ingle, D. J ., Goodale, M. A., and Mansfield, R. J. W., editors, Analysis of Visual Behavior. MIT Press, Cambridge, MA. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15285—100. Weinberg, R. J. (1997). Are topographic maps fundamental to sensory processing? Bruin Rearch Bulletin, 44(2):113—116. Weng, J ., Ahuja, N., and Huang, T. (1997). Learning recognition and segmentation using the cresceptron. International Journal of Computer Vision, 25(2):109—143. 178 Weng, J., Huang, T. S., and Ahuja, N. (1993). Motion and Structure from Image Sequences. Springer-Verlag, New York. Weng, J. and Hwang, W. (2002). Online image classification using IHDR. Interna- tional Journal on Document Analysis and Recognition, 5(2-3):118—125. Weng, J., McClelland, J., Pentland, A., Sporns, O., Stockman, I., Sur, M., and Thelen, E. (2001). Autonomous mental development by robots and animals. Science, 291(5504):599—600. Weng, J. and Stockman, I. (2002). Autonomous mental deveIOpment: Workshop on development and learning. AI Magazine, 23(2):95—98. Weng, J. and Zhang, N. (2004). A quasi-optimally efficient algorithm for independent component analysis. In Proc. IEEE Int. Conf. on Acoustics Speech, and Signal Processing, Montreal, Canada. Weng, J., Zhang, Y., and Hwang, W. (2003). Candid covariance-free incremental principal component analysis. IEEE Trans. Pattern Analysis and Machine In- telligence, 25(8):1034—1040. Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: unsupervised learning of invariances. Neural Computation, 14(4):715—770. Zador, P. L. (1982). Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Trans. on Information Theory, 28(2):139—149. Zhang, N. and Weng, J. (2004). Sparse representation from a winner-takeall neu- ral network. In Proc. IEEE Int. Joint Conf. on Neural Networks, Budapest, Hungary. 179