. 3.3:». . . 3.3..” .. .w it .. .r. is} 3... 3 .mwwia sift». a. ‘- .kfiwflflrovflaza. .a. ... , E .3... .un... . \1 In .3! . 9 . J :1 2.5! . . . . .. . €33.33. t............:3. , . . 2.... 3.... 3 3.33.3.3? .2. 2 51.0251?! ..!u9“na&a«~rw . . . [Jinx n ‘t. :fgg. .J. Ian—121:1}...Ifl 3A . . . . . t. 5 I v. a: 21;! . . . 3. 3. «9.3.3.3....» . . . 6:.“ 3...-.. 9. 5.1-... .. ‘3 i. . 3&334. . -Lgfi...r . . 2‘33... :1 7 5.. 411.348.” 31.25: t .fi 2 at»... is .1: '1 s 51.5., 1| 3‘! 2.9.331“! 9.1 $45. . . . 3:. . :1. i :. 4.32114“). , flit—31v: : .13“. v3 3 4. wit: 3.»- inflnflarrluzmr ; _ it? .h %.fi .r .. . a . .o 3 Q5 . s. ..:.|. J. . II. 540...... .. .2;I-5~.t. 1?: 12 1.5%.. .ut.nd3. I?! II ..x a." I .— 7 c. - ‘Ilq. . 5.333 wwguummrfu r...» . 2... . . t. puff.“ {VI- KIIE5 3.}. 333%. .z . . 3.. , V s... I. In , Humm- up... . of. alS)‘.f} 3’ . V 3 .1531... n. ?.(l.€l.d 1. 9...? 2.1;?!” . .i..\1\v>.‘.v::v.s. . I . . t. s 35.3.. ..3.....!nvx. . 1:23.. 1.... .3533... va .! . e. 3:. , fluiam [but . 5’11. .. .3 «may? .1 . 3.. u. v u “f; . A” 4.. a 1 2'. 4W“! 1.. 3.1! 3.1.3... .3. . 1. . 1 q? u _ .. ..l( 13.11... .iduolhuvflduvfw. 3.1.... 1.23;!!! p :(S. .L .3 .7351 . x . . . 23.1.}. a. . . . , . . . . .. .2.- ta. . .25....“ . . “Minor. 33.33333 _ 3%%3.. 33333.13 . . . :1 I2... . .18. ; . . . LIBRARY Michigan State University This is to certify that the thesis entitled Multi-Layer ln-Place Learning for Autonomous Mental Development presented by Matthew D. Luciw has been accepted towards fulfillment of the requirements for the Master of degree in Computer Science and Science Engineering A ‘ \ ' WM Wort major Professor’s Sigriathre 8/24/1605 Date MSU is an Affinnative ActiorVEquaI Opportunity Institution —-o-a—u-n-o---.—._ PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 p:lClRC/DateDue.inddp1 MULTI-LAYER IN-PLACE LEARNING FOR AUTONOMOUS MENTAL DEVELOPMENT By Matthew D. Luciw A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science and Engineering 2006 ABSTRACT MULTI-LAYER IN-PLACE LEARNING FOR AUTONOMOUS MENTAL DEVELOPMENT By Matthew D. Luciw Cortical self-organization during open-ended development is a core issue for percep— tual development. Traditionally, unsupervised learning and supervised learning are two different types of learning conducted by different networks. However, there is no evidence that the biological nervous system treats them in a disintegrated way. The computational model presented here integrates both types of learning using a new biologically inspired network whose learning is in-place. By in-place learning, we mean that each neuron in the network learns on its own while interacting with other neurons. There is no need for a separate learning network. Presented in this thesis is the Multi-layer In-place Learn- ing Network (MILN) for regression and classification. This work concentrates on its two-layer version (without incorporating an attention selection mechanism). The network enables both unsupervised and supervised learning to occur concurrently. Within each layer, the adaptation of each neuron is nearly-optimal in the sense of the least possible es- timation error given the observations. MILN is intended to provide memory organization capability and to be used as a general-purpose regressor in Autonomous Mental Devel- opment. The theory behind the network is discussed in detail. Experimental results are presented to show the effects of the properties investigated. TABLE OF CONTENTS List of Figures ................................... List of Tables ................................... Introduction Background 2.1 Autonomous Mental Development ..................... 2.1.1 Drawbacks of task-specific perception ............... 2.1.2 Embodiment ............................ 2.2 Link to Biological Development ...................... Criteria and Evaluation of Methods 3.1 Restrictions of a Developmental Program ................. 3.2 Types of Learning Algorithms ....................... 3.3 Review of Existing In-Place Implementations ............... 3.4 Unsupervised and Supervised Learning Methods .............. A Developmental Perceptual Mapping 4.1 The Sensory Mapping ............................ 4.2 Theory of Sensory Mapping ........................ 4.2.1 Input formulation .......................... 4.2.2 Events ................................ 4.2.3 Receptive fields ........................... 4.2.4 Areas and layers .......................... 4.2.5 Output formulation ......................... 4.3 Architecture ................................. Optimal Representation 5.1 Introduction ................................. 5.2 A Neuron’s Input .............................. 5.3 Position Normalization ........................... 5.4 Amnesic Mean ............................... 5.5 Scatter Vector ................................ 5.6 Statistical Efficiency ............................. 5.6.1 Efficient estimator .......................... 5.6.2 Cramer-Rao lower bound ..... . ................. iii 9 9 10 12 13 14 14 14 15 16 19 21 23 24 27 27 27 28 30 3O 31 31 5.7 Efficiency Study of Amnesic Average ................... Principal Components Analysis 6.1 Introduction ................................. 6.2 Principal Component ............................ 6.3 Whitening .................................. 6.4 Candid, Covariance-Free, Incremental PCA ................ 6.4.1 Deriving the first eigenvector .................... 6.4.2 Geometric explanation ....................... 6.4.3 Convergence speed ......................... 6.4.4 Deriving lower-order eigenvectors ................. 6.4.5 Link to developmental neuroscience ................ 6.5 Problems With PCA ............................. Independent Component Analysis 7.1 ICA Formulation .............................. 7.2 ICA and Vision ............................... Lobe Component Analysis 8.1 Lobe components .............................. 8.2 Optimality .................................. 8.3 Lobe components for nonstationary processes ............... 8.4 Single-layer updating algorithm ....................... 8.5 LCA is ICA for super-Gaussians ...................... The Multi-Layer In-Place Learning Network 9.1 Introduction of the Network ......................... 9.1.1 Three types of projections ..................... 9.1.2 Lateral projection .......................... 9.1.3 Bottom-up projections ....................... 9.2 Integrating LCA into MILN ......................... 9.2.1 Input preprocessing ......................... 9.2.2 Layers and weights ......................... 9.2.3 Layer responses and weight updating ................ 9.2.4 Adaptive sigmoid .......................... 9.2.5 Top-down influence ......................... 9.2.6 Top-“K” parameters for sparse representation ........... 9.2.7 Network output ........................... iv 36 36 37 4O 41 41 43 43 44 45 46 43 48 49 50 50 52 53 54 55 57 57 59 6O 6O 61 61 62 63 64 65 65 66 9.2.8 Multiple-layer in-place learning .................. 10 Experimental Results 10.1 Lobe Components of Natural Images .................... 10.1.1 LCA comparison .......................... 10.1.2 Topographic LCA .......................... 10.1.3 Principal components as features .................. 10.2 Recognition of Handwritten Digits ..................... 10.2.1 Limited cells ............................ 10.2.2 Sparse coding ............................ 10.2.3 Topography and invariance ..................... 10.3 Regression Using Single Dimensional Input ................ 10.4 Regression Using Two-Dimensional Input ................. 10.4.1 Approximation with a neuron grid ................. 10.4.2 Approximation using lobe components ............... 10.4.3 Concentration of limited neurons .................. 10.5 High-Dimensional Regression In a Robotic Application .......... 11 Conclusions 11.1 Concluding Remarks ............................ A Proofs A.l Proof of Equation 5.11 ........................... A.2 Proof of Equation 5.12 ........................... BIBLIOGRAPHY 67 68 68 69 71 72 75 76 78 79 82 86 88 89 90 95 99 99 101 101 102 104 LIST OF FIGURES 4.1 Each of the four images represents a vector of equal direction, but different mag- nitude. There is no need to distinguish between these in a classification task. 4.2 Flow diagram of a developmental perceptual system (figure courtesy of [37]). . . 8.1 Three lobe components converge to the optimal representative prototype within the region of of each, from whitened input. In the unsigned version of LCA, each lobe component represents both positive and negative signals (figure courtesy of [42]). ................................... 8.2 Components extracted by LCA are super-Gaussian, having a high kurtosis. The statistics of natural data, such as natural images, have been shown to be mixtures of super-Gaussian edge components. (figure courtesy of [42]). ......... 9.1 The architecture of the Multi-layer In-place Learning Networks. A circle indi- cates a cell (neuron). The thick segment from each cell indicates its axon. The connection between a solid signal line and a cell indicates an excitatory con- nection. The connection between a dashed signal line and a cell indicates an in- hibitory connection. Projection from other areas indicates excitatory or inhibitory supervision signals (e.g., excitatory attention selection). ............ 9.2 In I-Iebbian Learning, a neuron adapts to become sensitive to the part of the input signal that is present when it fires. In Lateral Inhibition, a neuron with high output suppresses the firing of other nearby neurons (figure courtesy of [39]. 10.1 Example of a natural image used in the experiment. .............. 10.2 Lobe components from natural images (with whitening). ............ 10.3 Number of wins per component when whitening was used. ........... 10.4 Lobe components from natural images (without whitening) ........... 10.5 Number of wins per component when whitening was not used. ......... 10.6 Topographic map of lobe components from natural images ............ 10.7 Principal components from natural images .................... 10.8 100 examples of MNIST digits. ........................ 10.9 The effects of the limited number of layer-1 cells and the update of lobe compo- nents. In “Pure initialization,” every weight vector in layer-l is purely initialized by a single training sample and is not updated .................. 10.10 The effect of multiple responses versus the neuronal density. ......... 10.11 The topographic map of layer-one neurons for the hand-written digit eight. . . . 10.12 (a) The topographic map of layer-1 neurons. (b) Learned weights of the layer-2 neuron for the digit “1” class. ......................... vi 18 25 70 70 73 73 74 74 75 77 78 79 10.13 Using only 10 neurons to approximate the function y = sz’n(a:)/a:. The posi- tions of neurons in the input-space are displayed as the lower triangles, and the neurons positioned in the output space are displayed at the triangles on the right. A small number of neurons relative to the complexity of the function leads to large quantization errors. ........................... 83 10.14 Increasing the number of input-layer neurons is helpful, but y-quantization error remains. ................................... 84 10.15 Output smoothing helps reduce the y—quantization error, but the number of neu- rons in the output layer provides insufficient coverage. ............. 84 10.16 Increasing the number of output-layer neurons is helpful in providing greater coverage, e. g., at the center peak of this function ................. 85 10.17 The function to be approximated ........................ 86 10.18 Approximation results when the input-layer neurons are placed on a grid. In (1), the training and testing samples both lie exactly on the neuron grid, and the approximation appears flawless ......................... 89 10.19 Approximation results when the input-layer neurons are placed on a grid. In (2), the accuracy of (1) is shown to be worse when testing is done in-between the neuron locations. In (3), this is somewhat remedied through use of smoothing in the output space. ............................... 91 10.20 (4): Randomly placing input-layer neurons leads to larger possible x-quantization errors than in (5), where lobe component updating is used to represent the uni- form input surface. .............................. 92 10.21 Relatively few neurons cannot approximate the function well over the entire input space. ................................. 93 10.22 A concentration of few neurons can lead to good performance in one particular area. If the other areas are not experienced recently, there is no need to provide any resource to represent them. ........................ 94 10.23 The SICK PLS laser range finder is mounted on the front of the Dav robot. . . . 95 10.24 The pre-attentional map is approximated by a polygon P, represented by h end- points. The post-attentional laser threshold is given by the half-circle T. Points that lie beyond this threshold are capped to the threshold distance. Points that lie within the threshold are returned normally. Figure courtesy of [46]. ...... 96 10.25 Error of the trained networks on the heading direction. ............ 98 10.26 Error of the trained networks on the speed. .................. 98 vii LIST OF TABLES 10.1 Summary of all tests using the cross function ................ 87 10.2 Summary of parameters and results. .................... 87 viii Chapter 1 Introduction Autonomous Mental Development [41] intends to advance the AI field through the devel- opmental approach, based on the mechanisms of perceptual and cognitive learning and development in humans from infancy to adulthood. The program that controls the ma- chine is called the developmental program, inspired by the genetic program that drives the growth of an intelligent human being from the single-celled zygote. This is a push away from traditional, task-specific programs in robotics. In AMD, the developmental program must be general purpose, allowing a robot to be taught new tasks that were not conceived of at programming time. This is not to say the machine’s development occurs from zero capability — innate behaviors can be very useful as a starting point. But, the main goal is to allow intelligent capability to be incrementally scaled from the ground up. The developmental program must provide a broad developmental framework for sen- sory, cognitive and motivational growth, and thus cannot be task-specific. Additionally, it must be able to fulfill some difficult requirements, such as incremental learning given high-dimensional signals, real-time performance, performing while being taught, and adaptability to unexpected environments. Recent evidence from neuroscience suggests that developmental mechanisms in the brain might be very similar or identical across different sensing modalities [24]. Thus, the “complexity problem”, an argument against a brain-based approach to AI based on the sheer number of neurons and synapses in the developed brain, might not be as much of an issue as was previously thought. Past work on a developmental program was heavily dependent on the IHDR tech- nique [38]. IHDR, briefly, is a general purpose regressor that can deal with very high- dimensional input due to its class-discriminating dimensionality reduction capability and perform in real-time due to its automatically imposed tree structure. However, the con- cept of IHDR is not totally founded in biological plausibility. In particular, the internal representation of the environment was somewhat inefficient, as IHDR stores nearly every sample, making memory an issue. Further, IHDR is not an in-place algorithm. By in-place learning [42] [40], it is meant that the signal processing network itself deals with its own adaptation through its own internal physiological mechanisms and interactions with other connected networks and, thus, there is no need to have an extra network that accomplishes the learning (adaptation). This is different from most classic neural network approaches to learning, which require a mechanism that is not accounted for by neuron behaviors and connections to train the network. This thesis is concerned with explaining the motivation, theory, and applications of an in-place network for general purpose regression, which can fulfill the difficult require- ments of AMD. The MultinLayer In-Place Learning Network (MILN) was proposed by Weng and Luciw [40] and Weng et al. [39] for such a purpose. MILN integrates su- pervised learning due to teaching with unsupervised learning due to natural correlations within the input and output. Currently there is a lack of biologically inspired networks that integrate these two different learning modes using a single learning network. MILN enables unsupervised and supervised learning to take place at the same time throughout the network. It is not convincing that biological networks use two different types of networks for unsupervised and supervised learning, which occur in an intertwined way in the process of development. When a child learns to draw, his parent can hold his hand during some periods to guide his hand movement (i.e., supervised) but leave him practicing on his own during other periods (i.e., unsupervised). Does the brain switch between two totally dif- ferent networks, one for supervised moments and the other for unsupervised moments? The answer to this type of question is not clear at the current stage of knowledge. How- ever, there is evidence that the cortex has wide-spread projections both bottom-up and top-down [30] (pages 99-103). For example, cells in layer 6 in V1 project back to the lateral geniculate nucleus [21] (page 533). Can projections from later cortical areas be used as supervision signals? One of the major advantages of supervised learning is the development of Certain in— variant representations. Some networks have built-in (programmed-in) invariance, either spatial, temporal or some other signal properties. Other networks do not have built-in in- variance. The required global invariance then must be learned object-by-object. However, they cannot share invariance of subparts (or locally invariant features) for different ob- jects. Consequently, the number of samples needed to reach the desired global invariance in object recognition is very large. The general-purpose, multi-layer network discussed herein will learn invariance from experience. The network is biologically inspired. The network has multiple layers; later layers take the response from early layers as their input. This work concentrates on two layers. The network enables supervision from two types of projections: (a) supervision from the succeeding layer; (b) supervision from other cortical regions (e.g., as attention selection signals). The network is self-organized with unsupervised signals (input data) from bottom-up and supervised signals (motor signals, attention selection, etc.) from top-down. From a mathematical point of view, in each layer of the network, unsupervised learn- ing enables nodes (neurons) to generate a self-organized map that approximates the statis- tical distribution of the bottom-up signals (input vector space), while supervised learning adjusts the node density in such a map so that those areas in the input space that are not related (or weakly related) to the output from this layer receive no (or fewer) nodes. Therefore, more nodes in each layer will respond to output-relevant input components. This property leads to increasing invariance from one layer to the next in a multi-layer network. Finally, global invariance emerges at the last (motor) layer. The network learns to represent the input in a near-optimal way. Theoretically, the stored representations of the environment will be quasi-optimally efficient, in the sense of least estimation error from the observations. But each representation never totally converges — providing capability to adapt to new environments. The remainder of this thesis is organized as follows. Autonomous Mental Develop- ment (AMD) and background information are discussed in Chapter 2. Chapter 3 estab- lishes a AMD-relevant criteria for learning algorithms and discusses some existing meth- ods with respect to that criteria. Chapter 4 explains the theory of a developmental sensory mapping, part of a proposed developmental architecture presented in [47] and [37]. This sensory mapping addresses a very difficult learning problem, which the proposed network is meant to address. Chapter 5 introduces the sensory neuron, the fundamental unit of fea- ture detection, and the theory of optimal representation of the environment is discussed. The next three chapters deal with possible ways of automatic feature detection by groups (layers) of neurons: Principal Component Analysis (Chapter 6), Incremental Component Analysis (Chapter 7), and Lobe Component Analysis (Chapter 8). Chapter 9 presents the Multi-Layer In-Place Learning Network described above, which uses Lobe Component Analysis on each layer. Chapter 10 presents experimental results. Chapter 11 provides concluding remarks. Chapter 2 Background 2.1 Autonomous Mental Development Autonomous Mental Development, proposed by Weng et al. [41] is a new AI field empha- sizing embodiment, autonomy, and a concept called task-nonspecificity. In autonomous learning, the emergence of behaviors and skills is a result of the interaction between the learner and the environment; these behaviors and skills are not necessary hard-coded to appear. In an autonomous-learning program, the mechanisms of learning must then be en- tirely contained within the program itself. Tasks are learned as a combination of “innate", or preprogrammed, drives and values, and a set of algorithms not designed so that a robot learns to perform a particular task, but instead to interact with the environment in such a way that task learning automatically emerges, given a necessary set of environmental constraints conductive to learning that task (such as a teacher and a set of innate abilities that map the teacher’s actions to the set of the robot’s innate values). A program that is both embodied and allows autonomous learning is called a devel— opmental program. A developmental program contains all the mechanisms necessary for the robot to learn all tasks it might need to learn, even tasks that the programmer might not have thought of. The hope is that this concept will further influence a paradigm shift in programming robots: away from task-specific programs and towards more adaptive learning-based programs. Task-specific programs are limited by the programmer’s choice of task. No matter how sophisticated the robot’s physical properties, i.e., sensors and ef- fectors, it can never use them to do anything beyond that task. If a new task becomes necessary, a new program must be written. In AMD, each task is learned as a result of the robot’s interactions with the environment, with the teacher becoming the driving force of new task emergence. 2.1.1 Drawbacks of task-specific perception In order to interact with the complex world in a human-like way, a robot must parse complex, high-dimensional sensory input so that its (non-imposed) current task can be accomplished. The predominant method used to do this is the use of hand-designed fea- tures such as oriented edges or colors. These features extract task-specific information and convert the complex sensory stream into a simpler feature vector. From our perspective, there are several major drawbacks to the use of hand-designed features. First is the limitation of a robot’s potential abilities through an imposed, task- specific model of the environment. A set of preselected features can lead to good per- formance on a preselected task, but might not be applicable to a novel task, or even the same task in a different environment. Second, use of preselected features limits the online learning ability of a robot, as information not represented by the features will be lost from the raw signal, no matter what that signal is. If, in a novel environment, this information becomes important to behavior, it should not be thrown out. 2.1.2 Embodiment Desirable (from a developmental programmer’s point of view) properties of human per- ception include our ability to learn a large number of different categories and classify an observed object despite a limited range of variations (e.g., size, pose, lighting, etc.). In traditional computer vision, these are hard problems, with no current program per- forming at a human-level. The theory of AMD posits that invariant categories must be learned through real-time, online interactions with the environment. Researchers such as Edelman [10] and Pfeifer [29] also view this problem as one of learning sensory-motor correlations. In an interesting recent result [23], it was shown that a mobile robot learned to detect invariant object classes from continuous visual input, primarily because of the robot’s self-motion. The slow-changing stream of input was also shown to be essential to developing the invariant representations. As for the large number of categories learnable by humans, it is possible this is attributable to the hierarchical nature of object detec- tion [15], in that most objects can be thought of as a collection of parts. 2.2 Link to Biological Development AMD establishes a link between developmental robotics and biological development through the concept of the developmental program, which is analogous to the genome. In humans and animals, development starts within a single cell (called the zygote) containing the genome of the organism. All physical and mental development thereafter is dependent on the genome’s coding. The emergence of behaviors and skills is a process driven by the genes. At latest estimate, the human genome contains 20,000 — 25,000 genes [8]. Com- pare that to the estimated 1011 neurons and 1014 synapses in the central nervous system of a mature adult. The discrepancy in number between the two estimates implies that the operations done by many neurons are common. The hypothesis (strongly put forth in [28]) of the input-driven self-organization of the cortical areas of the brain is that fully developed neurons will respond to different stimuli because of adaptation to different types of input over time. The operations done by each neuron are the same, but the input is different. An experiment [24] that was very close to testing exactly this was done on a newborn ferret. Input from the visual sensors was redirected to the auditory cortex, and input from the auditory areas was disconnected. It was found, after some development, that what would have been the auditory cortex in a normal ferret had altered its representations to function as the visual cortex for the “rewired” ferret. Thus, it is thought that the categorization and recognition ability of the different cortical areas develops as a result of the same mechanism throughout. Many neurons in the first layer of the visual cortex (known as V1) develop sensitivity to edges at preferred orientations [16]. The neurons in V2, which takes afferent input from Vl, become sensitive to more complex shapes, such as comers [15]. By the hypothesis of input-driven self-organization, neurons in both these areas are operationally identical, i.e., if their positions were switched before stimulation they would learn to detect features similar to other neurons in their new areas. If this is the case, then the same self-organizing learning mechanism is active in both areas. Feature detection by cortical neurons is not only a function of their afferent inputs; it is known that their competitive interactions (e.g., lateral inhibition) with one another play a role. What could neurons be competing for? It is possible that they compete for stimu- lation. A non-stimulated neuron will be induced to destroy itself through a process called apoptosis [9], so there is, at least on the surface, a motivation for neurons to compete. An aim in computational neuroscience is to understand and to model the mechanisms of learning at the neural level. This is also a goal for AMD, as it may lead to understanding of how the human perceptual system, with the properties of automatic derivation of many object classes and invariant detection of these classes, develops. Chapter 3 Criteria and Evaluation of Methods This chapter discusses some of the challenges of usable programs in AMD, presents a set of criteria (from [42] and [40]) for evaluating algorithms for use in AMD, and reviews a set of algorithms with respect to that criteria. 3.1 Restrictions of a Developmental Program 0 On-line Operation. In AMD, learning should be done sample-by-sample. Samples arrive in a high-dimensional continuous stream, and cannot be stored for very long. 0 Real-Time Operation. For each sample, learning must occur within a limited time period. A framerate (e.g. 30 frames per second) can be specified. Any algorithm must finish processing each frame before the next arrives. o High-Dimensional Inputs and Outputs. A developmental robot must have high- dimensional visual sensors, and complex effectors are necessary to manipulate most objects. Thus, any learning algorithms must be able to handle (converge, run at all, etc.) high-dimensional data spaces. This also has an impact on the previous criteria, since usually, more dimensions means that more computations must be done. 0 Plasticity A non-plastic algorithm has a learning rate that approaches zero as time goes to infinity. These algorithms cannot adapt to different environments over time, or even variations in the same environment, e. g., different lighting conditions in the same room. 0 Autonomy. If an agent is to be truly autonomous - a self-directed learner - the mechanisms of learning must be entirely contained within the program itself. There is no off-line tuning of the program. Any necessary distinction between a training phase and a testing phase, and any changing of program parameters must occur in response to sensory input. 3.2 Types of Learning Algorithms Traditionally, a learning algorithm is called online when it discards each sample after it is observed. An online algorithm learns after each sample, starting with just a single sample. All other algorithms have traditionally been labeled as offline, and learn only once, after the point when all samples have been collected. For purposes of developmental robotics and AMD, a finer classification system is necessary. Any algorithm that requires all samples be available before learning is useless to us, due to the on-line operation constraint. But if an algorithm only requires a certain number (a block) of samples before updating, and can update more than once, it might be usable. It depends on memory constraints, however. Algorithms that do not need to store any input samples at all are even better, since sensation is incremental as well as high- dimensional. But because of these high-dimensional sensors, the higher-order statistics such as the covariance matrix become infeasible to estimate due to memory limits (and time). As discussed previously, we desire algorithms for AMD to perform in-place neuron- leaming. In addition to the purely practical limits described above, in-place algorithms are biologically-founded. AMD isn’t about solving an engineering problem only. It also 10 seeks to test and validate embodied models of the biological development of intelligence using robot platforms. With an in-place algorithm, learning occurs as a side-effect of low-level unit behaviors and competitive interactions. There is no guiding developmen- tal mechanism overseen by a separate learning network, such as minimizing error rate in gradient-based backpropagation learning, unless it is biologically founded (i.e., using rewards from something similar to the dopaminergic system). To classify the suitability of learning algorithms for AMD, we define five types: Type-1 batch: A batch learning algorithm L1 computes g and w using a batch of vector inputs B = {x1, x2, ..., xb}, where b is the batch size. Type-2 block-incremental: A type-2 learning algorithm, L2, breaks a series of input vectors into blocks of certain size b (b > 1) and computes updates incrementally between blocks. Within each block, the processing by L2 is in a batch fashion. Type-3 incremental: Type-3 is the extreme case of Type-2 in the sense that block size b = 1. Type-4 covariance-free incremental: A Type-4 learning algorithm L4 is a Type-3 algorithm, but, it is not allowed to compute the 2nd or higher order statistics of the input x. The CCI PCA algorithm [43] is Type-4 algorithm. Type-5 in-place neuron learning: A Type-5 learning algorithm L5 is a Type-4 algo- rithm, but further, the learner L5 must be implemented by the signal processing neuron. Learning is a side-effect of the neuron interactions within the system. It is desirable that a developmental system uses an in-place developmental program due to its simplicity and biological plausibility. Further, biological in-place learning mechanisms can facilitate our understanding of biological systems since there is no evi- dence that each biological network has a separate network to handle its learning. The five types of algorithms have progressively more restrictive conditions, with batch (Type-1) being the most general and in-place (Type—5) being the most restrictive. For an algorithm to be useful in AMD, it must at least be Type-3, due to the require- 11 ment of real-time, online and incremental processing. Due to the high-dimensionality of the inputs, a Type-4 algorithm is necessary, as storage of the higher order statistics becomes impractical with high-dimensional input. Our goal is to build a system that is usable to fulfill the requirements of AMD through functional biological plausibility, thus we aim for a Type-5, in-place, implementation. 3.3 Review of Existing In-Place Implementations Recent research involving Type-5 implementations includes the deve10pment of computa- tional maps similar to those found in the visual cortex using the LISSOM [28] technique, the Dynamic Field Approach [1 1], and the “Brain-Based” architecture used in the Darwin VII robot [23]. Our work differs from the others in that the goal is Autonomous Mental Development realized in a developmental robot. The purpose of [28] is a programmed, biologically plausible model of vision development. Thus, real-time, online robot performance is not their concern. The Dynamic Field Approach is founded on a very broad idea - modeling the “dynamics of development” through layered neural systems. But so far the work has not been realized in a robotics application. The Darwin VII’s architecture is specialized to fulfill the goal of the study — a real-time system where the robot learns invariant object categories from sensory-motor experience in a controlled “block-world” environment. So, the preprocessing of the visual stream was done in a task—specific way, by detecting only horizontal lines, vertical lines or blobs. All the experiments were done using striped blocks. It is probable that the same result would not have occurred if non-striped, or perhaps circular objects were used, but, of course, that was not their goal. However, in AMD, we cannot use a task-specific early visual representation. 12 3.4 Unsupervised and Supervised Learning Methods Well known unsupervised learning algorithms include Self-Organizing Map (SOM) [22], vector quantization, PCA [20], Independent Component Analysis (ICA) [7], Isomap [33], and Non-negative Matrix Factorization (NMF) [25]. Only a few of these algorithms have been expressed by in-place versions (e.g., SOM and PCA [43] — and SOM does not take into account statistical efficiency, making it difficult to tune when given high-dimensional input). Supervised learning networks include feed-forward networks with back-propagation learning (e.g., [45]), radial-basis functions with iterative model fitting (based on gradient or similar principles) (e.g., [31]), Cascade-Correlation Learning Architecture [12], sup~ port vector machines (SVM) [6], and Hierarchical Discriminant Regression (HDR) [17]. None of these have been expressed by in-place versions. l3 Chapter 4 A Developmental Perceptual Mapping 4.1 The Sensory Mapping The purpose of what is named the sensory mapping by Weng et al. [47] [37] is threefold. First, to automatically derive representative features and classes from the experiences of an agent. Second, to provide a source for learning attention selection - learning which part of the image is important for the task at hand. Third, to provide a target for attention itself. The MILN network is meant to address these problems. It is known that neurons in the early visual pathway detect light at different places on the retina. The area of activity per neuron is called its receptive field. These receptive fields are very small in the early areas, such as Vl, thus providing a sparse coding of the image for later areas. Questions that pertain to a developmental sensory mapping for AMD are how these receptive fields are developed, what the features detected in early vision are, and how this organization can help coordinate our attention selection capability. 4.2 Theory of Sensory Mapping Here, the sensory mapping is formally defined. 14 4.2.1 Input formulation This work mainly has to do with vision. As such, the formulation of the sensory map- ping will be presented here with images containing the information to be perceived. The following defines the sensory mapping’s input space, originally from [37]. A digital, grayscale image contains a set of pixels organized into r rows and c columns. At each time t, an image is provided, in the following form: [- - $1.10) 1”120) 171.c(t) X(t)= 932,10) $2.20) firm-(t) (4.1) _:r.,.’1(t) $r,2(t) mnca)‘ Any image in matrix form must be mapped onto vector space for further processing. Each image matrix X is mapped to a vector :1: with d elements: at) = (171(t),:1t2(t), ...,;rd$(t))T (4.2) By concatenating each row onto the end of the previous row, so that x(j—1)C+l(t) : Xi,j(t),'l=1,2,...,T,j:1,2,...,c (4.3) Then the length of the image vector :r(t) gives the dimensionality dz = rc. Real-world visual input is given as a video stream. A spatiotemporal input vector includes the last h input frames, where h is the length of the short term memory. The spatiotemporal input vector y( t) contains 5 elements: W) = (y1(t)ay2(t), ydy(t)) (4.4) This vector is composed by concatenating the vector of each frame in order from the most recent through the last h: 15 y(k-1)d+i(t) 2 11‘1“ — k +1),Z=1,2,...,d,k =1,2,...,h (4.5) The length (1,, of this vector is the number of pixels from the last it image frames: dy = dxh = rch. The images may have multiple components per pixel. E.g. in RGB format, each pixel is associated with a different intensity value for red, green, and blue components. There are other such formats, like YUV, where each pixel is composed of separate (usually three) components. In these cases, each component of each pixel becomes a single value in X. It is useful (and necessary, if receptive fields are to be used) to place the components of each pixel in the same row and in three subsequent columns. Note that the choice of concatenating row by row in :10 versus column by column is immaterial, as long as it is done that way consistently, since the possible correlation between every possible pair of elements within y will be considered. Each vector y gives a spatiotemporal input to the sensory mapping. The space of all possible values for y is given by y = {y(t) I t = 0,1,...,oo} (4.6) the space 32 gives the input space of the sensory mapping. The sensory mapping component of a developmental architecture is a function szH—MC (4-7) where [I is the output space of the sensory mapping, and the mapping itself, T, is developed through experience. 4.2.2 Events Learning the function 7 depends on the existence of repeated events within 3?. l6 An event is a set of correlations among a set of dimensions of the input space. The structure of these may be primarily spatial or spatial with a significant temporal compo- nent. An event that occurs often should be learned to be detected by T. To do this, a number of different event—detection units (which are commonly called many things: fil- ters, feature detectors, neurons, and lobe components) are used. Each one of these has a set of weights for all dimensions of the input, giving a weight vector 212. Each weight vector corresponds to an event which can be detected within the input y by use of vector ‘6 9’ . inner product (or dot product). This operation is denoted by the symbol, and is done as follows y(t) - w(t) = E.g.-(trait) (4.8) i=1 From now on, it will be useful to discuss comparisons between weights and inputs geometrically. In the last section, the input vector was introduced as a set of components, each of which corresponds to a dimension. In a Euclidean coordinate space, the vector is best visualized as a line with a certain direction and length (referred to as magnitude), emanating from the origin point. The inner product operation takes the product of the magnitude of each vector and the angle 0 between two vectors, as follows .90) - w(t) = Ill/(0H l'w(t)|| 60809)- (4.9) Taking the cosine on the angle gives a larger value when the angle between the two vectors is near 0° or 180° and converges to zero as the angle goes to :l:90°. Considering we take the absolute value of the input, for each neuron, the highest result from the inner product operation will be when 0 is smallest, i.e., when the vector directions are most sim- ilar. The direction of the weight vector gives the sensitivity of the neuron, as it determines the components in the input space where, when they are high, the inner product operation 17 Figure 4.1: Each of the four images represents a vector of equal direction, but different magnitude. There is no need to distinguish between these in a classification task. gives the highest value. The length of a vector can be normalized, i.e., set equal to one, by dividing each element by it’s length: 111' 2 Normally, w is normalized before computing the inner _w_ ll'wll' product. In order to find the best-matching neuron to an input stimulus, the responses of neurons should be based on their direction, and not length. This has to do with competition and will be explained in more detail in later sections. In any case, the inner product operation is the crucial component in determining what is called the response of each neuron, and this response will be high when w and y are similar in direction. Sensitivity is better defined using vector direction instead of Euclidean distance. Events with the same direction, but with different magnitudes, should be classified as the same, as in Fig. 4.1. A fundamental problem is how to handle variations of similar events within y. The same set of relative correlations could occur at two different spatial positions within the visual plane. The system should recognize this event as the same in both cases. For example, consider one of the images in Fig. 4.1. If the digit within that image is translated several pixels to the left, then a event-detector with sensitivity based on the originally placed digit will not respond as highly to the translated image. A system with translation- invariant detectors would respond the same in both cases. There is another purpose of the sensory mapping besides recognition: providing a source and target for attention. An agent that interacts with the environment must know the position of the objects. So it is not enough to just detect that something is within the image, as in some probabilistic models of visual object recognition. The location (which in some cases may need to be very precise) of the event is also required. How can many 18 different areas within the image plane be represented internally, so that the input can be automatically segmented? To handle the above issues, a receptive field architecture is used. 4.2.3 Receptive fields In the appearance-based approach to recognition, feature extraction, selection and detec- tion are done using the entire input vector. In the input formulation introduced in Section , the space 32 contained the entireties of each image from each time. If features are de- ve10ped that use this entire vector, invariant detection becomes computationally heavy (a different neuron is needed for every possible variation of the feature) and fundamentally flawed. Additionally, using this (lack of an) architecture to learn attention selection is problematic. This basic approach is monolithic. If the program input is a set of same-size objects, each at the exact same position within the image, the features extracted from the input will only be useful for objects at that exact position. The receptive field approach is nonmonolithic - object recognition is eventually handled by detection of subparts, e.g. a face can be recognized as two eyes, a nose, and a mouth in a certain relative configuration. The receptive field of each neuron is defined as the set of dimensions in y for which the neuron has a nonzero weight. For convenience, ”nonzero” can be taken to mean ”significant” so that dimensions associated with very small weights, which will have no bearing on the developed filters, can be considered to be outside of the receptive field. Receptive fields can be hardcoded, developed, or a combination of both. In the last, each neuron is given an initial receptive field, and the nonzero weights will be further pruned during development. The following parameters define all receptive fields: 1. The number of different receptive field locations nr, as well as the total number of sensory neurons n3. Note that these two parameters are not necessarily equal - multiple neurons may have the same receptive field. For each neuron, the following 19 four parameters must be set: . The position or center of the field. This is defined as the center of the field in 3D input space - i and j give the spatial center, k: gives the temporal center. . The shape of the field’s coverage over the 3D input space. For example, cubic, cylindrical, spherical, multiple cylinders (disconnected), or a combination of several different shapes. . The size of the field. The parameterized shape function is 3(9), where s is the shape function chosen, and size is determined by 6 (which may represent multiple parameters, such as radius and height for a cylinder). . The cortical area of the neuron(s) associated with the field. There are L areas 111,112, ..., A L- Afferent input (meaning: not considering the lateral or feedback inputs) to neurons on A1 is directly from the input space. Neurons with higher- numbered areas take direct afferent inputs are from the outputs of the previous area. However, the classical receptive field of higher-area neurons is still defined in terms of coverage over the input space. See section 4.2.4 for more discussion. The classical receptive field of each neuron is defined as y = S (3(6), 2', j, k) where 3(6) is the parameterized shape function defining what relative inputs to include, and i, j, k determine the center of that shape in the 3D input space. These parameters define all starting receptive fields in the system They should be further developed so that the system will converge to some quasi-optimal organization of receptive fields for the agent and environment. Exactly how to do this for all five parameters is still an open problem, although it was shown in [48] automatic development in terms of size, position, and shape. The tradeoff dealt with both manually selecting and automatically developing values for the set of receptive field parameters is that of efficiency versus effectiveness. The 20 most effective set of receptive fields, in terms of possible attended locations, is one for every possible combination of dimensions. Due to time and memory restrictions and computational load, that is probably not feasible. The most efficient set of receptive fields is just one, at a single pixel. This provides very poor coverage as well as very poor features, obviously. The issue is one of segmentation: if an object is positioned where it is bisected by two (or more) receptive fields, the recognition will be worse than in the case where it is centered. That is why an overgrowth of receptive field locations may be necessary, although a saccadic-like attention mechanism may be able to center any possible objects. 4.2.4 Areas and layers According to the well-known ”curse of dimensionality”, it requires exponentially more samples to cover a higher-dimensional input space than one of lower-dimension. There- fore, convergence of learning algorithms in higher-dimensional space is exponentially slower, and requires exponentially more samples. If the ”initial guess” for the algorithm is in a poorly populated area, convergence may not even occur (a problem as real-world input spaces are usually sparsely populated). But a useful input space for a robot that uses vision must be very high-dimensional. If we use a camera, which gives a relatively small 50 x 50 image, our input dimensionality is 50 * 50 = 2500. If we want binocular vision capability, we’ll need two cameras, and the dimension will double to 5000 pixels. Important events might take up a large area of the input space, or cover many pixels. If an event covers just a tenth of all pixels, features will have to converge in a 500-d subspace, containing 256500 possible sub—images. We will need receptive fields over large areas of the input space to extract and detect features for these large events. But because of the curse of dimensionality, extracting features using all dimensions for this size receptive field may be too slow, or might not work at all. To effectively derive features for larger receptive fields, we will organize feature detec- 21 tors hierarchically into areas. Neurons from the first area perform feature extraction and detection, and thus dimensionality reduction, using lower-dimensional receptive fields and send the dimension-reduced signals for the next area to use as input. A neuron in a higher area will have a receptive field equal to the union of the receptive field elements of the connected neurons in the lowest area. There are ph (which stands for ”pyramid height”) areas A1, A2, ..., APh‘ The afferent inputs of neurons in area A,,i > 1 are from the outputs of the neurons in area A,. 1. The inputs to neurons on A1 are from the external environment. The assumption we make by using a hierarchical architecture is that all events over a large area can be represented as a set of sub-events, each over a smaller area. E.g., a contour is detected by first detecting a set of smaller edges, arranged in a particular way. There is ample evidence that this is the case - the human visual system is also organized hierarchically - but there are a few issues we must handle to get this working in practice. 1. Timing. Operations in area A,- cannot start until area A,_1 is done since the input of A,- is the output of A,_1. However, area Ai_1 can start working on new input while area A,- is working. This is the same as the notion of pipelining in computer architecture. To maximize throughput, we will allow each area to begin processing on next input as soon as it (both the input and the area) is available. This introduces a delay into the system. The outputs of area A,_1 will be from one frame ahead of those of area A,. 2. Possible Information loss. Features on each level must derive representations us- ing only (the responses of) the previous area’s features, not the direct input. Any information lost after features are detected in the input space can’t be recovered. The next area considers an input space using only those features. We want to have a good array of features so that the features extracted by the next area are useful to us. If we extract only one feature at each location, we might be missing out on lower order correlations within the same field that may be useful further on. 22 We introduce the concept of layers. Each area contains pd (standing for ”pyramid depth”) layers L1, L2, ..., LPd' This means there are pd feature detectors on area A0 with a receptive field centered at the same location. Using multiple layers allows different features to be extracted from the same set of inputs. Ideally, the optimal value for Pd should be automatically derived. 3. Speed. The system is meant to be run on a developmental robot in real—time. If the processing is at too low of a frame rate, it will not be very useful. Without predicted attention, the system probably will be too slow to use. 4.2.5 Output formulation In Section 4.2.3, the input space y of the sensory mapping was defined. Originally from [37], the output space [I is defined here: Every neuron can be uniquely identified by its receptive field center i, j, k, area a, and layer within each area I. Each neuron vi,j,k,l,a can be thought of as a function that takes inputs ui, M1", and produces output Zi,j,A-,l.a- For simplicity, all neurons can be indexed by a single value 72. so that v" = vi,j,k,l,aa n = 1,2, ...,ns (4.10) From the vector of inputs at time t, each neuron produces output 2,, (t) = v,,,(u,,(t)) (4.11) The output of all neurons in the sensory mapping can be represented by 2(t) z(t) = (21(t),z2(t),...,zns(t)) (4.12) All possible outputs give the output space, and the data manifold L in this space is 23 given by the value of 2 at every time c = (:(t) | t = 1,2,...,oo) (4.13) The sensory mapping is a function T that takes input y(t) and produces output 2(t) 2(t) = T(y(t). (1(0) (4.14) T automatically develops over time, and is influenced by the attention vector a(t), described next. 4.3 Architecture The issues of segmentation and recognition constitute somewhat of a ”chicken-egg” prob- lem in developmental vision. Objects in the visual plane can’t be recognized until they are separated from the background, but this can’t be done until the objects are (at least partially) recognized. A large set of receptive fields at many different positions and sizes can locate objects simply by brute force, but this lacks in terms of efficiency. In natural images, many of the windows will contain only background (non-category) pixels. It is a waste to devote computational resource to the windows where there are no objects. How can the locations of these windows be known? An attention-selection mechanism predicts the location of receptive fields containing non-background pixels. All other windows are ignored, meaning feature detection is not done in these locations, and computation time per frame is reduced. Attention requires that the input be a continuous video stream. A regressor maps previous attended loca- tions and features detected within to the next attended locations. The core assumption we make is that the positions of known objects over time can be predicted after having seen enough examples of that object’s movement (if it has any) or having experienced how the agent’s self-movement influences the visual position of objects. The attention vector a(t), 24 To motors . M0 or Subsumptron mapping Top-down attention control Innate Cognitive (inhibits responses behavrors mapping from black cells) 4 Delayed A4 MAN ourth area A4' 1,...— F ( j I“? 310;:(1M, s- 0 «one “\“i Delayed A3 11'» ‘I"§§,"§s writers.~~':~'.-;:~\\ “in, are (M) J» a» a9; .39 fi_, 3 ‘ #Ilyfllyfl \IoYQiY’ii- ' Mi’m’.’ MR"! Sensory ‘ ‘w‘u‘ j'\\s;«\\‘.\‘\".\“sj‘.\~s.c f'\~‘s"‘,\w;’.¢\“- ' Dela ed A2 llail"ls,"s,’«1"83’38 ttli'ltli'cliitlg“. "“9"" Se area (A2) L.’ * 9 *o$o$p¢e‘o*o , ——> ' )M‘ly‘eel(1.1.0.1131! \g.:§,\;."ll‘\."g\x’l\;.’\4\;\.:lt'\t’\l. \ 9‘1 1 ‘I ‘ Delayed Al this: Elfii‘; 3". "3%.“: .‘;\\\.‘§\ first area (A 1) ..ro- —> Delayed retina Retina 1““ 'n" ‘0‘.“ _»iu iiiit iris iii, Figure 4.2: Flow diagram of a developmental perceptual system (figure courtesy of [37]). originating from outside the sensory mapping, inhibits the responses from most neurons at each time. In it’s simplest form, the attention vector is a binary vector with dimension equal to the number of receptive field centers. Each component is coupled with a recep- tive field center. A 1 in the attention vector means that the receptive fields at that location would be attended. The rest (corresponding with 0 values) would be ignored. Figure 4.2 shows a flow diagram of a developmental vision system. The vision system is separated into three mappings: sensory, cognitive, and motor, along with a block for innate behaviors. The retina and delayed retina give the set of input pixels (the input vec- tor from y). The black and white circles represent the many feature detectors in different areas and layers. They are connected, either directly or indirectly (through lower num- bered areas), to the input. The cognitive mapping [36] learns a function from input z(t) 25 (the sensory output vector) to output a(t). Mimicking a saccadic mechanism, attention is applied by the motor area. White circles in the figure indicate receptive fields within an attended area, while dark circles indicate receptive fields that are not attended. The lower-level innate behaviors are in themselves a complete sensorimotor loop, and give some predefined behaviors for when the higher-level system is foundering. For example, if no objects have been found, perform a random window search. The subsumption (in- troduced in [4]) mediates the action produced by this module and the action produced by the cognitive mapping, and only allows one (with the highest confidence with preference given to the higher level) to be applied. It should be noted that a the input to this structure doesn’t have to take only visual input. It will work the same given input from a set of any sensory modalities. 26 Chapter 5 Optimal Representation 5.1 Introduction The next few chapters of this thesis concern the activities of the system at the neuron level. This section is about neural adaptation mechanisms without considering competition with other neurons. In feature extraction, the goal is to find and best represent the important sections on the input manifold, i.e., the parts with significant correlation among different signal lines. In AMD, the input space is high-dimensional, and the input to each neuron will contain many signals. The following subsections detail a set of operations each neuron can do on its input, over time, to make it more likely that the features that do exist are detected in the input space of each. Per the restrictions of development, our goal is to find ways each may be done incrementally, without use of a separate storage area or neuron developer outside of the neuron itself. 5.2 A Neuron’s Input A neuron’s input comes from a set of (1 signal lines. Activity on these lines is sampled at time intervals from t = 1, 2, The input to the neuron itself is the resulting sample 27 vector :1:(t), which has dimension (1. Each neuron does not have any information about how the values of a:(t) relate to the (usually very large) world-specific input space. This would involve modeling world- specific knowledge. This is not necessary — an agent’s sensors will contain enough infor- mation such that it can learn to act appropriately, without explicit knowledge of how the sensory information relates to the environment [5]. 5.3 Position Normalization Having no absolute knowledge, each neuron must compute everything relative to what it has previously experienced. Position normalization places the base of each the neu- ron’s future experience at the location of the sample mean. All future samples are then evaluated by their position relative to the sample mean. Basically, position normalization moves the origin to an “important”, highly populated, area of the vast, sparsely populated input space. So, the neuron is closer to any features that may exist, and convergence will be more likely and faster. An in-place way to compute mean is through the am- nesic averaging technique, introduced in [43], and discussed here. The amnesic average is so-named since it will “forget” old observations, being useful for an agent experiencing multiple environments, since without forgetting, we only represent the early environments experienced. 5.4 Amnesic Mean When samples generated by a new process are observed, samples from previous processes should be forgotten. But there is no way to know a priori when a new process will occur. A way to do this ”forgetting” with no prior knowledge is to increase the learning rate as time increases. The amnesic mean [43] allows forgetting. The incremental, non-amnesic mean is defined as 28 t—l 1 :i'(t) = t .‘i:(t—1)+?:I:t (5.1) where rt is the current observation and i:(t — l) is the previously stored average. The following equations, originally from [42], increase the weight of new samples by adding the value returned by the amnesic function Mt), and define the amnesic mean. Add p(t) to the new samples: 1 + /.1( t) ((12 z -——t—- ' (5.2) and subtract the same value from the weight of the previous average (5.3) Since the same value that is added to mg is subtracted from wl, the two weights still sum to l t- 1—/1.(t) + 1+/1(t) t t = 1 (5.4) The amnesic averaging equation, given the previous average (11:)(t — 1) and the new sample rt is as follows i“) = ”(1)117“ — 1) + ngt (5.5) The function ,u(t) is characterized by two ”switch-points”, t1 and t2. The amnesic mean is the same as the incremental mean, i.e., pt = 0, from t = 0 until switch-point t1. From t1 until t2 the learning rate will increase linearly until ,u(t) = c. After t2, p(t) increases as a function of time and r. 29 (0 High, W) =( C(t— t1)/(t2 — t1) ift1= ...,, (5,9) 810 :r, 6 n/ [-—g—gé——2]2p(x, 6)dx —(X3 The Cramer-Rao bound is a useful tool for evaluating the efficiency of an unbiased estimator. More efficient estimators will have expected square error closer to the bound. If the expected square error is equal to the bound, than the estimator is by definition the most efficient estimator of the parameter. It can be proven that taking the sample mean as an estimator for the population mean has mean square error equal to the Cramer-Rao bound if the p.d.f. has a known standard deviation 0 and is Gaussian or exponential [27]. 31 Thus, the sample mean is the most efficient estimator for the population mean for these distributions. 5.7 Efficiency Study of Amnesic Average It must first be proved that the amnesic mean is an unbiased estimator of the population mean. If this is the case, its convergence to the population mean is guaranteed. First, note that the amnesic mean is a weighted sum of all n samples, for any n n) = Z wt(n);rt (5.10) t=1 It can be proven using induction on n, that the following (originally provided in [42]) gives the value of the weight of any sample t = 1,2, ..., n, for any n 1 - 1 —— tut/(n): + :1“ )fi j————— (5.11) j= —t+l It can be seen that each weight is nonnegative wt >= 0, and all weights sum to 1, for any n (the following was originally provided in [42]). Zwfin) = 1 (5.12) t=1 except when n = 1 - in that case, Mt) = 0. This relation can be proved by induction. The two induction proofs for equations 5.12 and 5.11 were left as an exercise, and were done by myself, and are contained in Appendix A. If the samples art are independently and identically distributed (i.i.d.) from a random variable :17, then the amnesic mean is unbiased. To prove this, Ei(n) 2 Ex, must be proved. The proof is as follows, originally provided in [42]. First, substitute using equa- tion 5.10 32 n Ei‘(n) = E Z wt(n)xt. (5.13) i=1 the summation is a constant, so 77. Tl E Z wt(n):rt = Z wt(n)E;rt. (5.14) t=l t=l Now, substitute using equation 5.12. And since the expected value of a set of samples that are i.i.d. from a random variable is equal to the expected value of that variable, substitute E r for E1}, and the proof is complete 71 Zwt(n)Ert = 1E2: = Ex. (5.15) t=1 So, the amnesic mean is unbiased. Now, the efficiency is evaluated. It is expected that the efficiency of the amnesic mean is less than that of the sample mean, and that is the case. In order to compare the efficiency of the amnesic mean to the sample mean, we derive an error coefficient for each. First note the variance of a parameter is the same as the expected mean square error Var(a‘:(n)) = E || (i(n) — 5702.)), where :i: is the population mean. Assume random variable :1: has known variance 02. Now Var(. NI (72.)) = Var(:wt(n);rt) i=1 wt2(7i)Var(:L‘t) M: F.- H p—a M: w (n)0 as. H p—d gives the expected mean square error of the amnesic mean. The rule Var(ca:) = €2Var(:17), where c is a constant and a: is a random variable is used to move from the first to the second step. The rest is substitution. The expected mean square error of the sample mean, when Mt) = 0, is r: | Var(. (71)) __—_ Var(Zwt(n):r:t) t=1 n = Z 'u.it2(n)Var(;r:t) =1 72, 7,2 1 n. H» 2 since each weight in this case equals l/n. The difference in efficiency between sample mean and amnesic mean can now be 11 evaluated by comparing the amnesic mean’s error coefficient 2 wt2(n) to the sample 1 i=1 mean’s error coefficient —. When Mi) 2 0, the error coefficient is that of the sample In mean. In each of the three cases, the amnesic mean will converge to the population mean. Note the extreme difference in expected error for the early samples. From this, it would seem a lower Mt) is more beneficial. But when Mt) is smaller, the less of a capability exists to adapt to a changing distribution. This is why the multi-section amnesic mean in equation ?? is used. Until the first switch-point, the sample mean is used for maximum statistical efficiency. It is assumed that these early samples are all from the same process. After the first switch-point, Mt) is gradually increased, becoming a function of time after the second switch-point, whereafter it may adapt as t -—> 00. In conclusion,the multi-sectional amnesic averaging technique will be useful for real experiences, such as robot vision for a large number of frames, where it is probable that the samples will be generated from different processes over time. The multi-sectional amnesic mean is unbiased (so that convergence to population mean is guaranteed for a 34 stationary process), efficient, but at the same time can adapt to a nonstationary or multiple generative processes. In the remainder of this document, the term mean (or average) should be taken to refer to the amnesic mean presented here. 35 Chapter 6 Principal Components Analysis 6.1 Introduction Principal Components Analysis (PCA) [20] is one technique that can be used to automati- cally derive representation from the input samples. It has been popular in the appearance- based approach to computer vision, e.g., Turk and Pentland’s Eigenfaces [34]. The set of principal components of a set of samples gives an optimal linear manifold for any di- mensionality less than the original sample dimensionality. The samples can be projected into this smaller space to represent them in an efficient, linear way. Each of this basis’ components, known as the principal components, is a feature. Weng et al. [43] proposed a fast-converging, candid, incremental algorithm for PCA that will converge in high di- mensional sample spaces. This algorithm, known as CCIPCA, can be used to extract the principal components under the constraints of AMD. The theory of CCIPCA is very important for the CCILCA technique used in MILN, since each lobe component is derived as the first principal component of the samples con- tributing to it. Additionally, the first principal component gives the optimal representation of the contributing samples. 36 6.2 Principal Component A principal component is a vector obtained through eigen-decomposition of the sample covariance matrix. The covariance matrix A of a set of samples :1:(t),t = 1,2, ..., n with mean vector 1‘: is defined as A(:1.‘) = 71 i 1 Z(;‘1(t) — :i?)(.'1:(t) — :ir)T (6.1) 1 i=1 If the samples are already position normalized, equation 6.1 is the same as T A(u.) = UU (6-2) 71—1 where each sample u(t),t = 1, 2, ..., n is a column vector placed in n-by-d matrix U, where d is the dimensionality of each samples. The reason the denominator in both equations is set to n — 1 instead of 11 since the sample covariance matrix will then be an unbiased estimator of the population covariance matrix (which is not the case if we divide by n). The covariance matrix is an extension of the concept of variance to multiple dimen- sions. Along the diagonal, the value given at row i and column j is a measure of the variance of component 2' (or j). Visually, this usually (assuming there are no outliers) gives how much the samples are spread out along the axes of the coordinate space. The values that are not along the diagonal indicate the variance between two components. Vi- sually, these values give how much the samples are spread out along dimensions exactly in between two original dimensions. If the components are first scale-nonnalized, the value at row r and column 0 in the co- variance matrix gives the correlation among components ur and no of vector 11 over time. The higher the absolute value, the higher the correlation. A high correlation indicates a high-level of usefulness as a feature. Correlation among the components is graphically indicated as the distribution shape. Note that the direction where the correlations are the 37 highest might not be along any of the sample coordinate axes. This is where PCA comes in. The first principal component is the direction where the covariance is the largest. This is known as the Most Expressive Feature (MEF) [32]. A principal component is an eigenvector of the covariance matrix. An eigenvector v of the covariance matrix has corresponding eigenvalue A. v has unit length, as do all principal components. /\ gives the variance of the samples along 11. A (11.)?) = AU. (6.3) The eigenvector with the largest eigenvalue is the first principal component v1. The direction of 121’s is along the axis of largest variance in the sample space. 112’s direction is the axis of largest variance that is orthogonal to 121. 123’s direction is the axis of largest variance that is orthogonal to both 111 and v2... and so on until d vectors are obtained. The corresponding eigenvalues are ordered such that A1 2 A2 __>_ 2 Ad. Note that there may be multiple solutions for the principal components. For example, when the data is a sphere, every direction will have equal variance. In that case, any nonzero vector will be a solution for 121. All principal components form an orthogonal basis for a d-dimensional vector space, so each sample can be decomposed as a weighted sum of the components Mt) = ylvl + 112112 + + ydvd. (6.4) where 311,312, ..., yd are scalar coefficients. The principal components are ordered by variance: a higher-indexed component will be in a direction of higher sample variance than its lower-indexed components. Another way to phrase this is to say the samples are most correlated along the first principal axis, then the second principal axis, until the last. So 111 is the most important component, 122 is second-most important, and so on. It is expected that the lesser order components will not contain much useful information about 38 the samples. It will be beneficial to eliminate them. We can select a value k, where k < d, and project the samples into a k-dimensional space using the responses to only the first k principal components 3) = (y1,yg, ..., yk)T. Define the approximate vector 12 as the reconstruction of it using the first k principal components and the projection weights yl, ..., 3);, of u to each {I = 3111.11 + yQ‘UQ + + ykvk. (6.5) The optimality of PCA is defined as follows. The basis made up of principal com- ponents is optimal in the sense that the expected squared error between the approximate sample (in k dimensions) and the original vector (in d dimensions) is the least possible for any set of k orthogonal vectors. This can be formally proved using induction, but just the outline will be presented here, as it is quite intuitive. Think of the case where k = 1, i.e., we are using just one principal component. Then the problem is equivalent to fitting a line to the data to minimize the least square errors. From the theory of linear least squares, this line is along the dimension of highest sample variance. As stated above, this is the first principal component, and is the optimal axis to use as a one dimensional basis in terms of minimum information loss. So, PCA is optimal when k = 1. Now, assume we already have an optimal basis for k dimensions and seek an optimal basis in k + 1 dimensions. This component must be orthogonal to the previous k. But consider a sample subspace gotten by simply leaving out the axes of any of the first k components. In this space, every vector will be orthogonal to all previous ones. The best axis in terms of least information lost in this space is along the direction of highest variance in this space. The first principal component in the subspace will be the best solution for basis vector k + 1 that minimizes the information lost for a basis with k + 1 components. Thus, PCA derives an optimal linear basis in the least- squared error sense, for any is. In many cases, the number of dimensions that can be reduced is very large — especially 39 when the dimensionality is large to begin with. For further processing, the samples are represented only by their responses (projec— tions) onto the principal components. In this space, these responses are uncorrelated, meaning their covariance is zero. 6.3 Whitening An important (see the experiment in Section 10.1.1) preprocessing procedure called whitening can now be introduced. Note that the components yl, ..., yk of the responses y(1), ..., y(t) after PCA will be uncorrelated. That is, the covariance of all pairs of differ- ent components will be equal to zero. Whitening takes this a step further and normalizes each response component by its standard deviation, so that the variance is unit. After whitening, the covariance matrix of y(t) is the identity matrix. The variance of component y,- is given by the eigenvalue A,- of eigenvector vi. The whitening matrix W is generated by taking the matrix of principal components V = v1’U2...’U and dividing each by its standard deviation (the s uare root of variance). The k q 1 matrix D is a diagonal matrix where component (i, 2') (along the diagonal) is V—T. Then, i W = VD. To whiten, multiply the response vector(s) by this matrix. 17 = WY = VDY. (6.6) A dewhitening matrix Z can be created to undo this procedure. It is defined as Y = Z? = VD—li/ (6.7) where 0-1 is the inverse of D. Visually, whitening can be thought of as a “sphering” procedure, since afterwards, the data will have unit variance in all directions. It is a useful procedure to do before fea- ture extraction, as extraction procedures based on high variance will otherwise be biased 4O towards the directions with larger eigenvalues. It should be noted that an adaptive sigmoid performing scale normalization on the response signal line of each principal component is equivalent to equation 6.6. 6.4 Candid, Covariance-Free, Incremental PCA How can the principal components be computed given the constraints of the develop- mental program? Weng et al. derived a Type-4 algorithm that computes the principal components. The theory and equations from this section are from that paper [43], but all explanation is as per my understanding. The CCILCA technique used on each layer of the in-place learning network computes the first principal component for each neuron using this technique. 6.4.1 Deriving the first eigenvector Recall the definition of a principal component Av = A(u)v (6.8) where A(u) is the covariance matrix of the samples. Substitute for A(u) using its definition A6: 1 Zu(t)u(t)TU (6.9) T and note that an v = (u - 'v)u. 261(1) . 1011(1). (6.10) 41 There are multiple solutions to Av with regards to the scaling factor, or eigenvalue, A. From now on, enforce the length of v to be unit, by normalizing it, so there will be only one eigenvalue per eigenvector direction. When 1) is unit, the value of A will equal the variance of observations along axis 1). We can now represent the left side simply as 1), and the length ”1.1“ will give A. 1). ' i - J 1 “H 111(1). (6.11) n— 1 t=1 “1:” 1,1 = The right side of the equation gives the parameter to estimate incrementally, notated as b(t) 1 v = n _ 1 gm). (6.12) This is just the batch average of b(t). Convert this to an incremental average: 1—2 1 11.(t)-11(t—1) —t—1v(t-1)+t-1 “6(1—1)” 11(1). (6.13) from samples arriving at t = 2, 3, Note that the incremental version must start at t = 2 since the first scatter vector will be a zero vector: 11(1) = 0 (because 17(1) = M 1)). Lastly, instead of incremental average, amnesic average is used to estimate b(t). This is done for two reasons. First, b(t) is a function of Mt — 1), which is not stationary. Second, there is no guarantee that the generative process itself will be stationary. This is the same reason we use amnesic mean ro estimate population mean: the developmental program’s input will typically be samples from a long-running sensory stream from a changing environment. 11(t) 611(t —- 1) ”’00 - 1)” 11(1) 2 11111.1(t — 1) + 1172(1) u(t), (6.14) 42 t— 2 — t where 1111(1) = —t—IEL—)—, (6.15) 1 . t and wg(t) 2 —¥. (6.16) 6.4.2 Geometric explanation Intuitively, the estimation of principal component vector 1) will be “pulled” by the direc- tion of the each observation. That 1s,11(t)’s direction will change to be closer to that of each sample. The convergence of 11(t) depends on the direction where the “force of pulling” is the most. When the dot product is expanded, the magnitude of b(t) becomes ||u(t)2||cos(6), where 6 is the angle between Mt) and 11(t — 1). As ||u(t)2|| is just a measure of variance for the samples, the direction where the force of pulling is the largest is the axis of most variance, or the first principal component. The maximum amount of pulling changes depending on the weighting factors 1121 and 1112. If the amnesic function Mt) is a constant, v(t) will eventually converge in a statisti- cally efficient way. If Mt) is a function of time, as it is after the second switch-point, 1) can continue to shift in direction to adapt to new distributions for any time, and it does this with near, or quasi-optimal efficiency. 6.4.3 Convergence speed The formula Tl EH1 )11)—1||2— 21112E||b||2— 216,2 trace(2b) (6.17) t=l gives expected estimation error as a function of number of samples n. It is a function of the amnesic error coefficient, how spread out the data is, as well as the dimensions, given 43 by trace(§Jb), the product of the variances within the covariance matrix of b. Since the amnesic average is used, the right side will never truly reach zero. However, it constantly decreases as 71 increases, and does converge to zero as 71 goes to infinity. Its main use is to estimate the number of samples needed to get below an acceptable error bound. 6.4.4 Deriving lower-order eigenvectors The preceding section only derived the first principal component. What about the other I: — 1? Recall that these vectors compose an orthogonal basis. So each component 11,- from 1' = 2, ..., k must be orthogonal to all previous components. Orthogonality is enforced by use of residual vectors. Take the derivation of 122 as an example. In order for 122 to be orthogonal to 111, first compute the projection vector of the sample onto 111. Then subtract that vector from the sample, effectively deriving a new sample in a residual space. In the residual space, every vector will be orthogonal to 111. Now, the first principal component in the residual space is equal to v2. For any number k of principal components, k — 1 residuals are computed at each step. Denote 11,-( t) as the sample, possibly in a residual space, used when updating vector 12,-. When 1' = 1, 111(t) = 11(t), the original sample. When 1' > 1, 11,; gives a residual vector. First, project 11,- onto 1),: y.(t) = u.(t)- ”:8”- (6.18) Second, subtract the projection vector from 11.,- to obtain 11,-“. ,j t ”ll-1+1“) = WU) — yr “:8” (6.19) This is done k — 1 times per sample. 6.4.5 Link to developmental neuroscience Hebb’s principle of neural adaptation showed that a neuron would strengthen its existing connection to another neuron when both fired at the same time. In Long Term Potentiation, groups of neurons develop sensitivity to input connections corresponding to repetitive (common) input stimuli. Most neuroscientists believe that the adaptation described by Hebb’s rule leads to this sensitivity. In a computational neuron, sensitivity among the input dimensions is defined by its weights. A higher weight means the neuron is more sensitive to that input component. Inner product enforces this sensitivity. The inner product of an input sample and a neuron will be highest when their directions are the same, i.e., the relationships between the sample components are the same as those among the neuron’s weights. Each neuron adapts by use of the learning rule in Equation 6.13. An input vector “pulls” the weight vector. By Hebb’s rule, when the part of the input space that a neuron is sensitive to is high, the neuron is most likely to fire. The response of a computational neuron can be thought of the firing rate. If the same stimulus is presented over and over again, the firing rate (inner product response) of a single neuron will become larger and larger. A neuro-computational concept of CCIPCA is as follows. Each neuron is a princi- pal component detector, and each inhibits the neuron of next index (11,- influences 1),-+1) through use of residuals. The competitive interactions between neurons are strictly en- forced by Equation 6.19. Inhibition through subtraction is not based in any observations from neuroscience. It is hypothesized that early neural processing decorrelates the in- put [l], which CCIPCA does as well. But it is not known exactly how this is done, and there is no evidence of anything like residuals. Thus, CCIPCA is a Type-4 algorithm, since it is online and doesn’t use any higher order statistics. It is not, however, a Type-5 (in-place) algorithm, since it requires an “overseer” to enforce the assignment of index to the neurons. However, an in-place, Hebbian-like learning method is used to compute the component in each neuron’s input space (which may be a residual space). 45 6.5 Problems With PCA There are four problems with PCA that limit its usefulness as the only feature extractor in developmental perception. l. Lower-Order Components are Constrained. PCA derives an optimal basis in the least squared error sense, for any dimension less than the original dimension. As basis components, all vectors must be orthogonal to each other. The principal components can also be called the most expressive features (MEFs). Each is the best feature in terms of maximum covariances (components that are often active together), given the constraint of orthogonality to all previous com- ponents. The response of a sample to feature 1), will be largest when the sample matches the feature, since y, = u - 11,-. So, principal components as features will de- tect what they represent. This is not necessarily good. As the index decreases, the components become more constrained, and they may not represent anything useful. A major part of PCA’s theory (and how its optimality was defined) was minimizing reconstruction error. But we are not necessarily interested in exactly reconstructing the input, since representation does not have to be world-specific. 2. PCA Extracts Global Features. PCA cannot handle the case where there are multiple concentrations of correlated inputs. As an example, assume the samples fall into two well-separated clusters. The first eigenvector will converge to a direction between them, and may not corre- spond to any samples so far observed. 3. PCA is Linear. The basis made up of the principal components in a linear manifold. The most natural shape of the data may be on a non-linear manifold. For example, if the data lies on a spherical surface, the axis on which the data has the most variance is not a 46 , line, but a curve. As a real world example of this: many maps represent the surface of Earth, which is a spherical object in 3D, as a 2D surface. The first component is the equator, the curved line which goes around the planet at its widest point. This leads to an “unfolding” of the surface in the 2D space. Two linear vectors, as used in PCA, would not be able to separate one side of the sphere from the other in 2D. . CCIPCA is Not a Type-5 (In-Place) Algorithm There is no evidence from neuro- science that anything similar to residual subtraction is used. Only the first principal component can be justified as developed by an in-place algorithm (by the mecha- nism of Hebbian learning). The lower-order components require a separate devel- oper and are not in-place. 47 Chapter 7 Independent Component Analysis This chapter briefly introduces the concepts of Independent Components Analysis, an approach that extracts sparse features. Independent Component Analysis [18] requires for the features to not only be uncorrelated, but also independent. 7 .1 ICA Formulation At every time instance 1. = 1,2, ...,, a sample vector 3 is generated from 11. sources s(t) = (31 (t), 32(t), ..., s,,(t))T. The sources themselves are unknown, due to their being mixed by the mixing matrix A, which is full-rank. The mixing matrix transforms each source vector into an observed (sensed) vector linearly at each time x(t) = As(t). x(t) is an n- dimensional vector of observed signals. The goal of ICA is to find a linear transformation such that the original sources can be recovered. The core assumption in ICA is that the original signals are independent. However, it is too difficult to extract truly independent components in practice, so a variety of other criteria are used, such as maximizing kurtosis or negentropy (non-Gaussianity), or mutual information (infomax), to name a few — see [18] for a survey. 48 7.2 ICA and Vision The ICA model applies very well to blind-source separation, where the number of source signals is typically small and known. But this model is too strong for natural images, where the number of sources is not known, and may be much larger than the number of sensors (pixels) — in fact, the meaning of a “source” in vision is somewhat unclear. The features derived by non-Gaussianity maximizing ICA methods when given natu- ral images are not independent sources, but simply the most non-Gaussian components. Most of these correspond to localized edge filters [3]. Decomposing an image using these features leads to a sparse coding of the image, as most edges do not overlap. Sparse features are desirable in AMD, since the generalization capability of the system is in- creased, and the many weights equal to zero can be pruned and ignored, leading to fast performance. However, most ICA algorithms are rather inefficient. FasthA is a Type-1 algorithm, while Extended Infomax is Type-2. These do not suit AMD implementations. The CCI LCA technique [42] is a Type-5 ICA algorithm for natural data. It is based on principles of statistical efficiency and quasi-optimal representation. CCILCA functions as an ICA algorithm when the input is a super-Gaussian mixture [48]. It can be thought that the criteria maximized is that of sparseness, through use of a winner-take-all approach. Somewhat surprisingly, it outperforms the popular less restrictive, state-of-the-art ICA algorithms by a very wide margin. CCI LCA will be the primary method used to derive features in the in—place learning system. 49 Chapter 8 Lobe Component Analysis This chapter is adapted from Weng and Zhang [42] and Weng and Luciw [40]. The former introduced the concepts presented here, while the latter concerns the multi-layer version. 8.1 Lobe components Atick and coworkers [1] proposed that early sensory processing decorrelates inputs. Weng et al. [43] proposed an in-place algorithm that develops a network that whitens the input. Therefore, we can assume that prior processing has been done so that its output vector y is roughly white. By white, we mean its components have unit variance and are pairwise uncorrelated. The sample space of a k-dimensional white input random vector y can be illustrated by a k-dimensional hypersphere. A concentration of the probability density of the input space is called a lobe, which may have its own finer structure (e.g., sublobes). The shape of a lobe can be of any type, depending on the distribution. For non-negative input components, the lobe components lie in the section of the hypersphere where every component is non-negative (correspond- ing to the first octant in 3-D). Given a limited cortical resource, c cells fully connected to input y, the developing cells divide the sample space 32 into c mutually nonoverlapping regions, called lobe re- 50 Figure 8.1: Three lobe components converge to the optimal representative prototype within the region of of each, from whitened input. In the unsigned version of LCA, each lobe component represents both positive and negative signals (figure courtesy of [42]). gions: (where U denotes the union of two spaces). Each region R,- is represented by a single unit feature vector v,, called the lobe component. Given an input y, many cells, not only vi, will respond. The response pattern forms a new population representation of y. Suppose that a unit vector (neuron) v,- represents a lobe region R4. If y belongs to R), y can be approximated by v,- as the projection onto v,: y z y = (y - v,)v,~. Suppose the neuron v,- minimizes the mean square error Elly — y”? of this representation when y belongs to R4. According to the theory of Principal Component Analysis (PCA) (e.g., see [20]), we know that the best solution of column vector v, is the principal component of the con- ditional covariance matrix 211,171 conditioned on y belonging to R4. That is v,- satisfies A,-,1v,‘ = 2y,iVi- Replacing Ew- by the estimated sample covariance matrix of column vector y, we 51 have Mm z fizyuwufv = Z(y(t)-V1)y(t)- (8.2) t=1 We can see that the best lobe component vector V4, scaled by “energy estimate” eigenvalue A441, can be estimated by the average of the input vector y(t) weighted by the linearized (without 9) response y(t) . v,- whenever y(t) belongs to R4. This average expression is crucial for the concept of optimal statistical efficiency discussed below. 8.2 Optimality Suppose that there are two estimators F1 and F2, for a vector parameter (i.e., synapses or a feature vector) 6 = (61, ..., 6k), which are based on the same set of observations S = {111,112, X"). If the expected square error of 1‘1 is smaller than that of F2 (i.e., E ||I’1 - 6”2 < E || P2 — 6H2), the estimator P1 is more statistically efficient than F2. Given the same observations, among all possible estimators, the optimally efficient estimator has the smallest possible error. The challenge is how to convert a nonlinear search problem into an optimal estimation problem using the concept of statistical efficiency. For in-place development, each neuron does not have extra space to store all the train- ing samples y(t). Instead, it uses its physiological mechanisms to update synapses in- crementally. If the i-th neuron v4(t — I) at time t — 1 has already been computed using previous t — 1 inputs y(l), y(2), ..., y(t — l), the neuron can be updated into V4(t) using the current sample defined from y(t) as: Y(t)'Vt(t-1) ||V1'(t — 1)“ Xt = y(t). (8.3) Then Eq. (8.2) states that the lobe component vector is estimated by the average: A4 1v, 3 .77.. Xt. (8.4) 52 Statistical estimation theory reveals that for many distributions (e.g., Gaussian and ex- ponential distributions), the sample mean is the most efficient estimator of the population mean (see, e. g., Theorem 4.1, p. 429-430 of Lehmann [27]). In other words, the estimator in Eq. (8.4) has nearly the minimum error given the observations. 8.3 Lobe components for nonstationary processes The sensory environment of a developing brain is not stationary. That is the distribution of the environment changes over time. Therefore, the sensory input process is a nonsta- tionary process too. We use the amnesic mean technique below which gradually “forgets” old “observations” (which use bad X) when t is small) while keeping the estimator quasi- optimally efficient. The mean in Eq. (8.4) is a batch method. For incremental estimation, we use what is called an amnesic mean [43]. j“) : tit—l—ij—l) + wax (8.5) where Mt) is the amnesic function depending on t. If ,u E 0, the above gives the straight incremental mean. We adopt a profile of Mt): f 0 fitgq, 140:1 C(t—t1)/(t2—t1) ift1- Top one nonzero response 0 35_ ‘. - +' Top three nonzero responses _ . 4‘ \\\ \‘ 0.3% ~ . . I \ ‘t b \ \ \ 20.25» ~, ‘. . to ‘\ ‘ : 02*- A\ \\ “ o ' x R t x . I.“ \ ‘ 0.15 e\\\ , \\¥\ \A \ 0.1 * \\:‘ _ 53‘ .m‘ 0.05“ *‘*::é::'$ T 0 i i ......1 r 4 turn: 1 grunt +. ......t I . . it... o 1 2 3 4 5 10 10 10 10 10 10 Layer-one neurons Figure 10.10: The effect of multiple responses versus the neuronal density. 10.2.2 Sparse coding In this experiment, we study whether it will help to allow more neurons to fire. We used the same network structure as the first experiment, but all the lobe components in layer-l are updated. In the first network type “top one nonzero response,” only the top- one winner in layer-l is allowed to fire. In the second type called “top three nonzero responses,” the top-three responses from layer-l were multiplied by a factor of 1, 0.66 and 0.33, respectively, and all other layer-l neurons give zero response. The error rates for different numbers of layer-l neurons are shown in Fig. 10.10. We know that top-3 responses give more information about the position of the input in relation with the top- three winning neurons. However, Fig. 10.10 showed that multiple responses did not help when the density of the prototypes is low. This is reasonable because bringing in far—away prototypes in decision making may add noise. This situation changed when the number of prototypes is sufficiently large so that the positions of nearby prototypes better predict the shape of the class boundaries indicated by the crossing point in Fig. 10.10. 78 unvrwwwmmm *("KVV'V‘M‘hNNMoo 'v0\~«»%%%mmm wmmmwmmmmm i% 9% 2% EB 88 88 88 22 82 S? 33% 82% 858 888 888 888 888 288 888 888 Figure 10.1 I: The topographic map of layer—one neurons for the hand—written digit eight. 10.2.3 Topography and invariance The next experiments focus on topographical cortical self-organization and its implica- tion to within-class invariance. As observed in V1 and other cortical areas, nearby neu- rons have similar representations. This means nearby neurons represent nearby inputs or concepts. We place all layer-1 neurons in a two-dimensional, square grid, simulating a “cortical sheet.” The winner neuron that updates for each input will also cause its nearby neurons within a distance of 2 from the winner to update as well, with their gain rug in the algorithm weighted by their distance to the winner, but wl +w2 : 1 still holds. Fig. 10.l1 shows an example with a 10 x 10 grid and all of the training samples were from the class “.”8 We can see that within-class variation of the digit 8 is represented by this cortical area, sampled by the discrete number of prototypes. The invariance for such within—class variation is achieved by the positive weights from all of these neurons to the 8-th layer-2 neuron. The above case does not include between-class competition for resources (neurons). To show the self-organization of resource for all 10 classes, Fig. 10.12(a) shows the layer- } neurons in a 40 x 40 grid trained with all training samples. There are areas which can be seen to correspond to a particular class, since there is no guarantee that the area of a single 79 class is connected in the topographic map, which is also the case in biological networks (see, e.g., [35]). It depends on the size of neighborhood update, the inputs and degree of input variations. Fig. 10.12(b) shows the learned weights of the layer-2 neuron corresponding to the digit-l class. The darker the intensity, the larger the weight. As shown, invariance is achieved by the positive weights from the corresponding digit “1” neurons to the corre- sponding output neuron. Therefore, the within-class invariance shown at the output layer can be learned from multiple regions in the previous layer. 80 1ax$5555‘l“‘lt‘tlilllzzz2113533};a77777 111$555666“6“lllill&112111333303a77777 11Lt666$66“66‘ll!lilllllll1333323aa$777 L£55666£6£‘666OPIFFIIJ}131133333:a)17777 666‘6666(“£6699988231}))113313%12i#7777 66‘66‘6‘8“‘6000038511‘993133\\‘i331fl717 6666666558£boooooaQQNNSSiSSIIII\2221:111 “0666668666000000419‘151333 lllll 23321171 QQfiebbbbbbbboo00$§411fl11ttll|11131122211 qqqqabbbbbbbUOO§€§§‘\QH\\II lllll 23322117 “QQQQQr-bebb000555‘1‘Qnr\\|||||||"'o-?I1.7 Vquqqaébb00000961“\31!|l lllll IIIIFViTY Vquqqoaoooooooo.q“‘I-" ''''''' ‘l't'9'11 Hanaaa90000000000353|808|llll1131If?9V77 Qua:aa9000000055335353$Illlllll222??!777 qua:a000000065533335338!tlIIII2222221772 Haaaaaoo0006555333333$3tlllll(122222977] 32332000000555533333351111Illl1:222)»777 2322Jo0000555553333333117illlltllladv172 2222900000555553333333311110ttlt320#444I I:JJ5550000555335333333711111711OUVV‘VQV ado555556003333333333331WQQQSQQ1oVVVV444 00465555:1123333333388391949Q1910ViiiVVV a005555532113333338ittiifiq999919qvv¥v441 ao05555531223333388888139499997494914117 coo555553311333388881119Q59Q?¢¢94911¢117 o0093535333333331328131099vwvv494fi179111 OOOD33333533133J12233’771999990v01719911 0000333333333313122229777999V9v944914441 00053533333333}!222777777779997449994941 566$55333333$l£69777777777799VYv49999949 555555553333d6l“777777777???Y11799999!I 005555533311113507777777f779977797999‘I/ OOGSSJJJJJI’III‘I777777’7799?77777799O// 009555333111115‘l777777’7779€§7777994l1I OOBSSSJJFIIIIIIII7777777771ffffiff"?!tr IDSJJJJIFIIIIIIII17777777!!!frfiffOOVI IJJJJJJIIIflIl/IIII777777!!!If::rrrltr JSJJJJJIIlIlI/IIIII7777775!IIIrrrrrrlt I f f f 3331330..)‘lll/I/I/l/7.1777753I!IIfffffl r. r f r f (a) (b) (a) The topographic map of layer-1 neurons. (b) Learned weights of the layer-2 Figure 10.12: neuron for the digit “1” class. 8] 10.3 Regression Using Single Dimensional Input In this section of experiments, we move from classification to a simple regression task. We choose a simple problem initially in order to work out the issues that arise in regression, and also to better understand the effect of the tunable parameters on the accuracy and efficient storage of the resulting weights. The function to be approximated here is y = 31-71(55) / :13, where a: is in the input space and y is in the output space. :1: and y are one—dimensional signals. Sampling of :1: is uniform over the range of the function for all tests. The value (sampling) of samples in the output space is dependant on the output of the function. The difference between classification and regression is that the values in the output space are no longer discrete (typically). This means that, now every possible output cannot be exactly stored. The output must now be represented in the same way that the input is — using a limited number of neurons as representative “prototypes” of the output space. In the teaching phase, samples contain both an cc and a y component: 5,- = (113,331,), where S,- is the 2'“, training sample. We assign n3; layer-one neurons to the input space, and my layer-two neurons to the output space. A one-dimensional version of LCA is used to position the neurons from the samples. The closest layer-one neuron to 21:,- wins, and adapts to better represent xi. In the same way, the closest neuron to y,- wins, and adapts to better represent 3),. The positions of the neurons are displayed as triangles in this section’s figures, as in 10.13. The function itself is approximated by learning sensory-motor neuron correlations. There are nxny weights from the input layer to the output layer (each neuron in the output space has a n]; dimensional weight vector). When a supervised sample is given, the closest neuron in the motor space to y adapts its weight vector to the layer-one response from 2:. Recall that k = (k1,, ky, ky, kr, k0) is the top-“k” parameter vector for a two-layer network. In these experiments, we set kg; 2 kg = ky = 1, meaning winner-take-all updating is used in the three applications of LCA. kr is the number of neurons in the first 82 fl {n‘ fl f—‘_" ‘ "' -\ 3' r(train)=1 :f‘fi”: b . r(test) = 1 vi \ t O = 1 :1] \ : a Sparse =0.48 \ . § MSE 20.28 \: I 2» I: \r F 4 s .I- ‘1 Q l t S ,-' ' 8 ,1 : ' ‘3‘ 1L ,. 1' : '1‘ x p ./ I I\‘ -1 - - g / \\\ I/ : . \ II/ \ 7"," " ' ' 2“ (I ' : \‘ \\ > ‘x If : : /: |\ 1 A A A . A 1A __ o 1 2 3 Sensory (Input) space Figure 10.13: Using only 10 neurons to approximate the function y : sin(a:)/x. The positions of neurons in the input-space are displayed as the lower triangles, and the neurons positioned in the output space are displayed at the triangles on the right. A small number of neurons relative to the complexity of the function leads to large quantization errors. layer that are allowed to fire — the response of the rest are set to zero. k0 is the number of neurons in the second (output) layer that are allowed to fire. The averaged positions of these neurons in motor space gives 2, the network’s output. Figure 10.13 shows a result when nm = 5, my = 5, kr z 1, and k0 2: 1. The true function is viewable as the solid line. The approximated result after several epochs of training is the dotted line. Noted on each figure, “MSE” is the mean-square error of the approximation from the true function. And “Sparse” is the sparseness measure of the input-to-output layer weights. It is the percent of these weights that are equal to zero. Sparseness is desirable in AMD for efficient storage and speed - weights that converge to zero do not have to be stored and do not matter in any computations. The obvious source of error in the first figure is attributable to quantization in both the input and output space. To reduce the :r-quantization error, we increase 72.1; to 33 - on average, three samples per neuron. A result is displayed in Figure 10.14. The error decreases, and sparseness increases since there are more weights. But the y-quantization error remains. 83 -/.:. . 3 . r ‘ .1 r (tram) =1 ' i r (test) =1 u 1 :\ 8 Sparse =0.75 ,a u . §2r MSE 20.044 7‘ .k b i ... l ‘5 ' I g :/ I e I : § 1 I- . I -4 g VIA: I ' : 7‘ 'r b ’4] N; l ‘ > O > :\\_ I :_\ 2 I, \ g / 1" \ I ‘ b \ .I \ / - ‘I" 3- 4, > o \ /I \II' 1 A; as: my; .2. a.s.x.:~..m.m ‘ a. gamma-m any“; .m- 0 3 1 2 Sensory (Input) space Figure 10.15: Output smoothing helps reduce the y-quantization error, but the number of neurons in the output layer provides insufficient coverage. 84 Q E 31-- » q r (train) :1 f \ > r (test) :3 ‘ b 0 =3 i ’ Sparse =O.91 ,‘f 2- MSE =0.0067 :1 p 4 - r i. V "‘1 . 1,, Oct/1% mamawawiam o 3 Motor (output) space vav 1 2 Sensory (Input) space Figure 10.16: Increasing the number of output-layer neurons is helpful in providing greater coverage, e.g., at the center peak of this function. In the third experiment, we now set the output smoothing parameter kg to three, and Ir, to three during the testing phase. Figure 10.15 shows a result. The error decreases without increasing the number of neurons or decreasing sparseness. However, another source of possible error becomes evident. As averages of samples, the neurons will tend to move away from the edges of the sample space. This leads to errors on the edges, as is the case with the samples that correspond to the center peak in the y-space, and the samples on the far left and right in the x-space. In Figure 10.16, we increase ny so that enough neurons are available to approximate the function at the peak. Sparseness increases again, since the total number of weights increases. In general, keeping kr low during training leads to a sparse solution. But it may be at the cost of accuracy, depending on the complexity of the function. In [40], it was shown that increasing kr during training in a high-dimensional classification task was beneficial when there were many neurons, so that the close neurons were all members of the same class. Otherwise, the neurons that responded only served as distracter points (noise). This is a very simple function, so increasing kr during training was not beneficial. Ideally, we _ would find a solution that has few non-zero weights, but also of acceptable accuracy. 85 2 2 2 y = max(e' m1. e'soxz, e'5("1* 4)) Output space (y) _1 Input space (x1) Input speech!) -1 Figure 10.17: The function to be approximated, 10.4 Regression Using Two-Dimensional Input The next experiments involve the approximation of a more interesting function. The “cross” function involves a two-dimensional input vector x = ($1, m2)t and a one-dimensional output value y. It is defined as —10x1,e—50$2’e—5($1 + 12)} (10.1) 1 y = max{e The range of the input is from [-1 -l] to [l l]. The range of the output is from -2 to 2. The details of the different tests are briefly summarized in Table 10.2. Constant parameters for all tests are as follows. In all tests, the neurons in the second layer were set to quantize the output space by .05, so try was set to 80. For the top-k parameters, I 2 kg = ky = kr = 1 for all tests. 86 Table 10.1: Summary of all tests using the cross function Experiment number Training Testing (1) uniform grid (41 x 41) uniform grid (41 x 41) (2) uniform grid (41 x 41) uniform grid (82 x 82) (3) uniform grid (41 x 41) uniform grid (82 x 82) (4) uniform - subset uniform grid (82 x 82) (5) uniform uniform grid (82 x 82) (6) uniform grid (41 x 41) uniform grid (41 x 41) (7) single gaussian single gaussian Table 10.2: Summary of parameters and results. Experiment number 711 km kr k0 RMSE Sparseness (1) 412 1 1 1 My3 0.99 (2) 412 1 1 1 3.7a"2 0.99 (3) 4121 1 2+ (dynamic) ny 7.0e-3 0.96 (4) 412 1 1 1 2.16-2 0.99 (5) 412 5 1 1 3.4e—3 0.91 (6) [(412)/10j 5 2+ (dynamic) ny 1.2(3‘2 0.70 (7) [(412)/l(_)j 5 2+ (dynamic) ”a 4.16—3 0.73 87 10.4.1 Approximation with a neuron grid The first test is meant to provide a starting point. The neurons in the first layer are placed on a 41 by 41 grid, at equal distances from one another. Thus, there are 412 neurons. There are also 412 training and testing samples, placed in the same locations as the neurons. The weights were trained over five iterations. The order of the training samples were permuted before each iteration of the main loop. The result is shown in (1) of Figure 10.19. Note that the mesh plot does not show the results at all areas in the input space. Only the values at the testing locations are given, and the plotting function interpolates in- between. In this first simple test, there appears to be no difference from the result and the actual function. And as expected, the only error in the approximation result is attributable to quantization in y-space. Now, we double the density of the testing samples and test over an 82 by 82 grid. Samples that are in-between neurons in the input space will be given the value associated with the nearest, since kg was set to one. The result resembles the “stepwise” result in Figure 10.13, but in two dimensions. It is shown in (2) of Figure 10.19. In the same way as in the one—dimensional case, this error is attributable to quantiza- tion in the output space. This error can be reduced by interpolation, i.e., smoothing in the output layer. To interpolate the approximated function for samples in-between neuron locations, k0 must be set to a value greater than one. In the third test, a method to dynamically change kr is used. We set a response thresh- old to the :1: part of the sample based on the top two responding neurons. If their responses 7‘1+_7‘2 ). All responses are n and r2, respectively, we set the threshold at 1'2 - (r1 - below this threshold are set to zero. The reason this is done is because of symmetry. If a sample falls exactly in between two neurons, and k7. is set to one, then one neuron will be abnormally ignored. Now, in testing, k0 is set to the number of output neurons. The result is shown in part (3) of Figure 10.19. It looks smooth compared to the previous. In the table it can be seen 88 Figure 10.18: Approximation results when the input-layer neurons are placed on a grid. In (1), the training and testing samples both lie exactly on the neuron grid, and the approximation appears flawless. that the error is nearly halved when smoothness is used, but sparseness (and thus, speed in an AMD implementation) decreases. 10.4.2 Approximation using lobe components The previous set of tests dealt with the reduction of y-quantization error through the top-k parameters. This experiment shows the effect of using lobe components on the x-quantization error. In the first case, we randomly select the positions of the 412 neurons in the first layer, and do not allow these to change. In the second, LCA (winner decided by Euclidean distance instead of inner product) is used to update the neurons over several epochs through the set of training samples. The resulting neuron positions are shown in Figure 10.20, where the top figure is the first result, and the bottom figure is the second. Obviously, using LCA updating will reduce the larger x-quantization errors. The error measured in the two tests reflects this, as the MSE of the second test is more than six 89 times better than the first. 10.4.3 Concentration of limited neurons Here, the number of neurons in the input space is limited (1 / 10 of in the previous tests). The entirety of the x-space of the function cannot be well-represented using this number of lobe components, as in (6a) of Figure 10.21. However, if the neurons are concentrated densely in a single area, as in Figure 10.22, the approximation in that particular area becomes much better. In both tests, the number of neurons was set to 168. The difference was the probability distribution of samples in the input space. In the first test, training and testing were done uniformly over the entire surface. In the second, training and testing samples were gener- ated by a gaussian process, centered at :1: = (0,0) with a covariance matrix of 2 = .05 I , where I is the identity matrix. The MSE of the second result was nearly three times better than that of the first. In a real context, many places in the world will never be observed by an agent. The latter experiment can be interpreted in this way, where the true function is representative of the environment, and the approximation is representative of an agent’s experience. 90 Figure 10.19: Approximation results when the input-layer neurons are placed on a grid. In (2), the accuracy of (1) is shown to be worse when testing is done in-between the neuron locations. In (3), this is somewhat remedied through use of smoothing in the output space. 91 Inputsp-eety O ‘ Q 0 °%% 0 c? ‘11“; M °§ 0 O on O 0 3’“. O 0 O o 0 -0.5 > ‘38 8%: 0 Input space (x,) (4) 1‘ o ”swam e80 ° 930%99 ooqpoooo connections?“ goofgga°98°g iggomo‘hsogf Sumac!” '33 o 0“ than? 2% 0000 g §o%°§°>%oooo 0 (’fi Nonfictflgoo‘so °D°o <90 0 o 3w: 0 8 of 3“ $300120 90°t§° d: 800:25 0.5> i°°° cogs a???” 93%? “0°13." 3.92%”?ogoai 300809 1 ° °%°o°°°°g°:g° $0 on ° °u° 0° (3‘9 ‘30 fiesta °° «3e 32.36 59°23 31,3 3:65.," ”0° it coo! ° 8 095° 0 Q30 3%) go & g g °%°8° °Q900°ozo° 0%30 0500 fi 00 g 80%8gm000300 out}: 00 W O . “1% O 0 30 o of 0 § 0 1% D 0 G 4’9 0° 0 0 1. 0 °8 0a" 0 m &0 "1pm 5m (5) (La 0 o “pd" 30¢! ° 3‘ o i 00 c &> t 37 3: 3‘ 0 48° 3° °°"‘1'¥P9b t1° ° %8 0‘?) -0.5 * 80 1:93) $ 3 i *4» ”W es ’ V3 «3’ gJon one a." ‘80 000 (.9 9° °Z° 8 {a a o o 8995::gg 3o (p c S’ “3 i 8) e 5.. 3 8 3 9* i of, ($0 '29 I _L '0- 0 Input space (x1) (5) Figure 10.20: (4): Randomly placing input-layer neurons leads to larger possible z-quantization errors than in (5), where lobe component updating is used to represent the uniform input surface. 92 1. oooo ooo°°oeoso.°‘°J O O o o o 0 ° ° 0 0 O o O O . 0 0 Q . 0 0 0-5 o e o o o c o e o '7. 9 0 ° 0 " 2 90 e 5. 0 ° 9 0 ° 9 ° 9 9 ° 0 0 ° 0 ° 8 o ° ° ' e o a 0' ° ° 9 ° 0 ° " a o o o o o o o i 0 ° 0 ° 0 9 ° ‘° 0 : °° 0 ° 0 ° ° ° 9 — ° ° 0 o o o 0 e o o 0 o o 05‘ 0 0 e 0 o °° 0 e 0 O o 0 ° 0 ° 0 o o 0 o 0 0 °o° °¢oooo°o°oeo°°ogo -1. l l l I l -1 -O.5 0 .5 1 Input space (111) Figure 10.21: Relatively few neurons cannot approximate the function well over the entire input space. 93 (7a) Input space (x2) 0 O 0 i 1 l l I 0 0.5 1 Input space (111) (7b) Figure 10.22: A concentration of few neurons can lead to good performance in one particular area. If the other areas are not experienced recently, there is no need to provide any resource to represent them. 94 Figure 10.23. The SICK PLS laser range finder is mounted on the front of the Dav robot. 10.5 High-Dimensional Regression In a Robotic Applica- tion This experiment is adapted from Zeng and Weng’s work in [46], in which it was done using IHDR. Here, several two-layer in-place networks are used to judge the performance. The Dav robot (Figure 10.23) is equipped with a SICK PLS laser range finder. [t is mounted on the front of the robot, and is angled 38° downwards. The range finder returns a vector of values corresponding to the distance to the closest objects 180° around the front of the robot. This vector x 6 R361, which is the input space, and x is quantized to set two values per degree. As in [46], we wish to use x to direct the reactive behavior of the robot so that it will avord possible collisions when moving. The movement of the robot can be controlled by a two-dimensional vector y = (y1 , 312) where yl gives the heading direction in radians and 312 gives the movement veloc~ 95 Figure 10.24: The pre-attentional map is approximated by a polygon P, represented by h end- points. The post-attentional laser threshold is given by the half-circle T. Points that lie beyond this threshold are capped to the threshold distance. Points that lie within the threshold are returned normally. Figure courtesy of [46]. ity. The regression task in this problem is to learn the function y(t + 1) = f (x(t)), where the action generated at time t + 1 is best suited to avoid collisions with nearby obstacles. The network was trained using supervised examples. Supervision was done in real- time on the robot using a training program. The current range map from the range finder was interpreted as an image, and the teacher’s mouse click location on the image was interpreted as a velocity and direction. The x and y parts were coupled and saved as a training sample. In general, the robot was trained to move at a moderate speed and straight when no object was nearby. When an object was close, it was trained to slowly turn in a direction away from the object. This type of local reactive behavior can be used in parallel with purpose-driven movement, and can take over when an obstacle is nearby. A two-layer network was trained and tested using 3926 samples previously collected from teaching, done by S. Zeng for the study in [46]. In each run, 75% of the samples were randomly selected for training and the remaining 25% were set aside for a disjoint test. The data contains samples from many different scenarios including interactions with single and multiple stationary obstacles, single and multiple moving obstacles (people), moving in a hallway (where the robot must avoid running into the walls), and moving in 96 a wider space. As in [46], a preprocessing attention module was used on the range vector when an obstacle was nearby. The attentional mechanism implements a threshold on the distance returned by the range finder (think of it as a synthetic enclosure) when a small value is returned, meaning an object is nearby that should be attended to. This is to reduce the number of training samples needed for obstacle avoidance, since the surface beyond the threshold no impact on the robot’s action when an obstacle is close —- these cases should be classified the same. We tested the performance using different numbers of input-layer neurons. The num- ber of output-layer neurons was set to the number of unique actions (around 1000) in the dataset. The top-k parameters were k3; = kg = kr = k0 = 1 and ky = 10 (the number of sensory-motor weights to adjust) since there were a large number of output-layer neurons. Results are presented in Figures 10.25 and 10.26. In testing, the program ran at nearly 30 frames per second, without pruning. Due to the sparseness of the solution, we expect after pruning, it will run much faster. As of now, there is no way to tell whether the observed errors will lead to collisions in practice, as the system may quickly adjust when tested on a future sample before the collision occurs. There is no way to truly judge the collision avoidance capability without actually testing it on the robot, which will be done very soon. 97 0.2 + Resubstitution test Q, -13-. Disjoint test \ '3. c ‘. o 0.15 1‘ a) ‘, .: . a. 8. 1., 1?“ .E ..... g 0.1 ~ ‘51.“ . .2 “‘fl\\ "' "5 ‘~\ “J \g_- a) ““““ +1- 2 0.05 ~~~~~ i ‘13 0 L 0 500 1000 1500 2000 2500 3000 Number of input-space neurons Figure 10.25: Error of the trained networks on the heading direction. 0.08 . r . ‘. a + Resubstltutton test 1 —a— Disjoint test 007- ‘ . .o o 07 Error rate for speed 0 O 01 0.04— 0.03» 0.02 . 4 . . 1 0 500 1000 1500 2000 2500 3000 Number of input—space neurons Figure 10.26: Error of the trained networks on the speed. 98 Chapter 11 Conclusions 11.1 Concluding Remarks This thesis concerns the motivating factors and theory of a multi-layer, in-place network that integrates bottom-up unsupervised learning and top—down supervised learning. A two—layer version of this network was built and tested with a variety of experiments to empirically validate the theoretical properties, and to better understand the effect of the parameters. This is the first multi-layer in-place learning network for general-purpose regression and classification with a near-optimal within-layer statistical efficiency. The experimental networks showed that the lobe components on each layer can automatically develop ori- ented edge filters with localized receptive fields from whitened natural input, and can learn the complex shape of the class manifolds in the layer’s input (inner-product) space. Other experiments showed the two-layer networks can learn to perform regression by adapting to correlations within sensory-motor input, with lobe components representing each layer in a near-optimal way. Application for AMD was shown by verifying that a two-layer network could learn a real-world, high-dimensional regression problem accurately. Due to the sparseness of the solution, it is expected that the speed will be suitable for real-time use in a robotics application. 99 A plan for future work is to (1): study the extension of MILN to more than two layers, (2): compare the real-time capabilities of the program as with IHDR, (3): examine MILN’s use for developmental vision in Dav and (4): study the possibility of its use as both a source and target for learning attention. 100 Ifl-- . Appendix A Proofs In [42], in proving the near-optimality of amnesic average, the following two equations were presented. But the proofs were not done at the time, instead being left as an exercise. Here, I prove each. A.1 Proof of Equation 5.11 The formula to calculate any amnesic weight wt, 1‘. = 1, 2, ..., n for any n wt(n) = l—itfigl H 17—1192 (A.1) j=t+1 J is here proved using induction on n. 0 Base Case. n = 1, meaning only one sample has been observed. From equation A.1 1 0 (111(1) = —‘1L— :1 (A2) Since ,u(1) = 0 and the product on the right is not applicable when t = n. 0 Induction Hypothesis. Now 71 = k. Assume that 101 k . 1 t -1— t .111): T” H L—Jfl 01.3) j=t+1 J 0 Induction Step. Another sample is observed, so 11 = k + 1. By equation 5.2, the weight of the newest sample is (1 + y(k + 1))/(k + 1). And from equation 5.3, the previous average :EUc) is multiplied by ((k + 1) — 1 — 11(k + 1)/(k + 1). For purposes of this proof, denote this quantity as wm. Since i:(t — 1) is a weighted sum of all samples observed at time 1 to k, each term can be multiplied by mm as follows :i:(k) = (1,1,,,,(11.'1;r1 + 2112:1'2 + + wkwk) = wmwlrl + wmngg + + wmwkxk (A.4) Finally, we substitute using the induction hypothesis. For any t = 1, 2, ..., (k + 1) tb‘tU‘i-i-l) = ’llr’twl'wm l+;1(t) 151 j—1.—/t(t) _ t 'w1n j=t+1 *7 k . : 1+)1.(t) H J—I—p(t)(k+1)—1-,u(k+1) t j=t+1 ] [0 +1 : 1+ y(t) [eff j — 1 — y(t) t ._ j j—t+1 A.2 Proof of Equation 5.12 This section proves that for any number of samples n, the sum of all amnesic weights wt, t = 1,2, ...,n will be equal to l 102 D1: 113(71) = 1 (A.5) t=1 using induction on n. a Base Case. 1 1 + 1(1) 211.9(1): 1121(1) = 1/ =1 (A.6) t=1 Using equation 5.11, and the fact that 0(1) 2 0. 0 Induction Hypothesis Assume that, for some number of samples n = k k 2 11.3(13) = 1. (A.7) t=l 0 Induction Step Another sample is observed and n = k + 1. By equations 5.2 and 5.3, the weight of sample k + 1 is (1 + u(k + 1))/(13+ 1), and all the previous weights are multiplied by ((k + 1) — 1 — Mk + 1))/(k + 1). Using the induction hypothesis, we can prove that the sum of the weights is still equal to one as follows k+1 (k+1)—1—#(k+1) +1+)i(k:+1) wtk+1) = —— g; < <20 _ (k+l)—1—/1(k+1) 1+)t(k+1) _ k+1 k+1 __ k+1 _ m. 103 BIBLIOGRAPHY 104 BIBLIOGRAPHY [l] J. J. Atick and A. N. Redlich. Towards a theory of early visual processing. Neural Computation, 2:308-320, 1990. [2] H.B. Barlow. Single units and sensation: A neuron doctrine for perceptual psychol- ogy? Perception, 1:371—394, 1972. [3] A. J. Bell and T. J. Sejnowski. The ‘independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327—3338, 1997. [4] R. A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2(1): 14—23, March 1986. [5] R. A. Brooks. Intelligence without representation. Artificial Intelligence, 47:139- 160, 1991. [6] N. Christianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cam- bridge, UK, 2000. [7] P. Comon. Independent component analysis, A new concept? Signal Processing, 36:287—314, 1994. [8] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 4311931—945, 2004. [9] M. J. Crowe, J.C. Bresnahan, S.L. Shuman, J.N. Masters, and MS. Beattle. Nature Medicine, 3:73—76, 1997. [10] GE. Edelman. Neural Darwinism. The theory of neuronal group selection. New York, 1987. [l 1] W. Erlhagen and G. Schoner. Dynamic field theory of movement preparation. Psy- chological Review, 3:545—572, 2002. [12] SE. Fahlman. The recurrent cascade-correlation architecture. In J. E. Moody and D. S. Touretzky, editors, Advances in Neural Information Processing Systems, pages 190—196. Morgan Kaufmann, San Mateom CA, 1991. 105 [13] DJ. Field. Relations between the statistics of natural images and the response prop- erties of cortical cells. Journal of the Optical Society of America, 4:2379—2394, 1987. [14] DJ. Field. What is the goal of sensory coding? Neural Computation, 6:559—601, 1994. [15] J. Hegde and D. C. van Essen. Selectivity for complex shapes in primate visual area v2. Neuroscience, 20:RC6I, 2000. [16] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160(1): 107—155, 1962. [17] W. S. Hwang and J. Weng. Hierarchical discriminant regression. IEEE Trans. Pat- tern Analysis and Machine Intelligence, 22(11):1277-1293, 2000. [18] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, New York, 2001. [19] A. Hyvarinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7):l483—1492, 1997. [20] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986. [21] E. R. Kandel, J. H. Schwartz, and T. M. Jessell, editors. Principles of Neural Science. McGraw-Hill, New York, 4th edition, 2000. [22] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 3rd edition, 2001. [23] J. Krichmar and G. Edelman. Machine psychology: autonomous behavior, percep- tual categorization and conditioning in a brain-based device. Cerebral Cortex, 2002. [24] S. L. Pallas L. von Melchner and M. Sur. Visual behavior mediated by retinal pro- jections directed to the auditory pathway. Nature, 404:871—876, 2000. [25] D Lee and S Seung. Learning the parts of objects by non-negative matrix factoriza- tion. Nature, 401:788—791, 1999. [26] T. W. Lee, M. Girolami, and T. J. Sejnowski. Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2):417—441, 1999. [27] E. L. Lehmann. Theory of Point Estimation. John Wiley and Sons, Inc., New York, 1983. [28] R. Miikkulainen, J. A. Bednar, Y. Choe, and J. Sirosh. Computational Maps in the Visual Cortex. Springer, Berlin, 2005. 106 [29] R. Pfeifer and C. Scheier. Sensory-motor coordination: the metaphor and beyond. Autonomous Systems, 20: 157C178, 1997. [30] M. I. Posner and M. E. Raichle. Images of Mind. Scientific American Library, New York, 1994. [31] M. Rosenblum and Larry S. Davis. An improved radial basis function network for visual autonomous road following. IEEE Trans. on Neural Networks, 7(5):llll— 1120, 1996. [32] D. L. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence, 18(8):831—836, 1996. [33] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319—2323, December 22 2000. [34] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro- science, 3(l):71—86, 1991. [35] X. Wang, M. M. Merzenich, K. Sameshima, and W. M. Jenkins. Remodeling of hand representation in adult cortex determined by timing of tactile stimulation. Nature, 378(2):]3—14, 1995. [36] J. Weng. Developmental robotics: Theory and experiments. International Journal of Humanoid Robotics, 1(2): 199—235, 2004. [37] J. Weng, G. Abramovich, and D. Dutta. Adaptive part inspection through develop- mental vision. Journal of Manufacturing Science and Engineering, 127(4):846—856, 2005. [38] J. Weng and W.S. Hwang. Incremental hierarchical discriminant regression. IEEE Trans. on Neural Networks, 2006. accepted and to appear. [39] J. Weng, H. Lu, T. Luwang, and X. Xue. In-place learning for positional and scale invariance. In Proc. World Congress on Computational Intelligence, Vancouver, Canada, July 16-21 2006. [40] J. Weng and MD. Luciw. Optimal in-place self-organization for cortical devel- opment: Limited cells, sparse coding and cortical topography. In Proc. IEEE 5th International Conf. on Development and Learning (ICDL 2006), Bloomington, IN, May 30 - June 3 2006. [41] J. Weng, J. McClelland, A. Pentland, O. Spoms, I. Stockman, M. Sur, and E. Thelen. Autonomous mental development by robots and animals. Science, 29l(5504):599— 600, 2001. 107 [42] J. Weng and N. Zhang. Optimal in-place learning and the lobe component analysis. In Proc. World Congress on Computational Intelligence, Vancouver, Canada, July 16-21 2006. [43] J. Weng, Y. Zhang, and W. Hwang. Candid covariance-free incremental princi- pal component analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 25(8):]034—1040, 2003. [44] Juyang Weng, Nan Zhang, and Raja Ganjikunta. Distribution approximation: An in-place developmental algorithm for sensory cortices and a hypothesis. Technical Report MSU-CSE-04-40, Computer Science and Engineering, Michigan State Uni- versity, East Lansing, Michigan, September 2004. [45] L. Bengio Y. LeCun and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278C2324., 1998. [46] S. Zeng and J. Weng. Obstacle avoidance through incremental learning with at- tention selection. In Proc. IEEE Conf. on Robotics and Automation, New Orleans, Louisiana, April 26 - May 1 2004. [47] N. Zhang and J. Weng. A developing sensory mapping for robots. In Proc. IEEE 2nd International Conf on Development and Learning (ICDL 2002), pages 13—20, MIT, Cambridge, Massachusetts, June 12-15 2002. [48] N. Zhang and J. Weng. Sparse representation from a winner-take-all neural network. In Proc. of International Joint Confernce on Neural Networks, Budapest, Hungary, 2004. 108 rl"fllijl[lfljl[[l[[i[][[1]]