.. pi- h... am: a. 4. . ‘r , . 97a. r .55: A: i var-$35. .fiéfi . . .3: . . ‘ : .5 .at. Rana :1. if .33“ . tar...» ., MS”? :. "z. s ‘ 6 Ramada." 2. in“. 1. x 9 .4..-..7.......‘.. . ”4.1.. H ,...»y. . .4 4.1.2.11. ix... .. . 1. c .5 , , M i u, . . Wt? ~W 13% in.” at. _ . 9%,. .5. wt; ‘ , . . . LIBRARIES MICHIGAN STATE UNIVERSITY EAST LANSING, MICH 48824- 1048 J' L\\J This is to certify that the dissertation entitled DEVELOPMENTAL LEARNING WITH APPLICATIONS TO ATTENTION, TASK TRANSFER AND USER PRESENCE DETECTION presented by XIAO HUANG has been accepted towards fulfillment of the requirements for the Ph.D degree in Computer Science )(IIIA‘) in; ”A Mafor Professor’s Signature 5/8; fl cos I Date MSU is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 21m DEVELOPMENTAL LEARNING WITH APPLICATIONS TO ATTENTION, TASK TRANSFER AND USER PRESENCE DETECTION By Xiao Huang A DISSERTATION Submitted to Michigan State University in partial fulfillment Of the requirements for the degree of DOCTOR OF PHILOSOPHY Department Of Computer Science and Engineering 2005 ABSTRACT DEVELOPMENTAL LEARNING WITH APPLICATIONS TO ATTENTION, TASK TRANSFER AND USER PRESENCE DETECTION By X iao Huang How to build an intelligent machine has fascinated researchers for more than half a century. TO tackle this problem, scientists have taken one of the four approaches: knowledge-based, learning-based, behavior-based and evolution-based. However, au- tonomous mental development (AMD) of robots did not receive sufficient attention. Motivated by animal and human autonomous mental development from infancy to adulthood, a developmental paradigm for robots has recently been proposed. With this paradigm, a robot develops its mental Skills through real-time, online interactions with the environment. We call such a robot a developmental robot. My work focuses on motivational system and sensorimotor learning. The ma- jor contributions of this work include: (1) Proposed and implemented the Devel- opmental, Observation driven, Self-Aware, Self-Effecting, Markov Decision Process (DOSASE MDP) model, which integrates multimodal sensing, action-imposed learn- ing, reinforcement learning, and communicative learning. (2) Integrated novelty and reinforcement learning into a robotic valuesystem (motivational system) for the first time. As an important part Of AMD, a value system signals the occurrence of salient sensory inputs, modulates the mapping from sensory inputs to action outputs, and evaluates candidate actions. (3) Proposed and implemented the Locally Balanced Incremental Hierarchical Discriminant Regression (IHDR) algorithm as the engine of cognitive mapping for learning in non-stationary environments. Based upon the above architecture and basic techniques, we have designed and implemented a prototype robot that learns the following cognitive behaviors: (1) Visual attention via novelty and rewards. We treat visual attention as a behavior guided by the value system. NO salient feature is predefined but instead novelty based on experience, which is applicable to any task. (2) Covert perceptual capability development for vision-based navigation. An agent develops its covert capability via reinforcement learning through interactions with trainers. (3) Cross-task learning in developmental settings. A developmental robot learns multiple tasks incrementally and uses acquired knowledge to speed up learning new tasks. (4) Audio/ Visual User presence detection. The developmental learning paradigm has been applied to a developmental agent, which detects human activities in an office using multimodal context information. The work reported in this thesis serves as a starting point Of the on-going research on developmental learning. TO my parents. iv ACKNOWLEDGMENTS I wish to thank my thesis advisor, Professor J uyang Weng, who has been the guiding force of my professional development. This dissertation would not have been com- pleted without his insight, inspiration and encouragement. I am lucky to have such a distinguished and responsible professor as my advisor. I am also grateful to Prof. Ning Xi, Prof. Jim Zacks, and Prof. Rong Jin, for their serving on my committee, Sharing ideas with me, and providing critical comments on my thesis. During my four years at Michigan State University, I was fortunate to benefit from several respected scholars. A Special acknowledgment goes to Dr. Zhengyou Zhang at Microsoft Research for his guidance in the “user presence detection” project, which is a very interesting component of my dissertation. I would like to thank Prof. Anil Jain, who served on my committee for one year and provided me with a solid pattern recognition background. I learned a lot from many of my colleagues. The early version of HDR program provided by Wey-Shiuan Hwang was the starting point of the work reported in this thesis. Yilu Zhang, another early SAIL group member, gave me a jump-start on my research at Michigan State University. During the development of SAIL project, I V enjoyed many Open discussions with my labmates: Micky Badgero, Yi Chen, Raja Ganjikunta, Jianda Han, Zhengping Ji, Ameet Joshi, Zain Kazim, Wei Tong, Shuqing Zeng, and Nan Zhang. We Shared friendship and precious time together. Thanks also go to current and former PRIPpies: Hong Chen, Rein-Lein Hsu, Xi- aoguang Lu, Anoop Namboodiri, Arun Ross, and Umut Uludag, for sharing their wonderful ideas with me in the seminars, which opened a wider view over related research fields. Those wonderful group activities also made my Phd study unforget- table. It goes without saying that I owe everything to my dear parents, Laying Kang and Hefu Huang. Their endless love, care, patience, and tremendous support accompanied me throughout my whole life. Now that I am finally graduating, I wish to turn over the diploma to them as a present. vi TABLE OF CONTENTS LIST OF FIGURES ................................ ix LIST OF TABLES ................................. xii 1 Introduction 1.1 Introduction to autonomous mental development ............. 1 1.1.1 Traditional approaches ........................... 1 1.1.2 A new direction in A1 — autonomous mental development (AMD) . . . 5 1.1.3 Eight requirements of AMD ........................ 9 1.1.4 Comparison Of approaches ......................... 11 1.2 Survey on animal learning .......................... 13 1.2.1 N onassociative learning .......................... 14 1.2.2 Classical conditioning ........................... 16 1.2.3 Instrumental conditioning ......................... 18 1.2.4 Cognitive learning ............................. 20 1.3 New machine learning types ......................... 22 1.3.1 Classification of learning types ...................... 24 1.3.2 Comparison with traditional definitions .................. 25 1.4 Thesis outline ................................. 27 2 DOSASE MDP Model 2.1 Introduction .................................. 29 2.2 An agent model ................................ 31 2.3 Observation-driven Markov Decision Processes ............... 33 2.3.1 Observation-driven MDP ......................... 34 2.3.2 Observation-driven SASE MDP ...................... 39 2.4 System architecture .............................. 43 2.5 Major components .............................. 45 2.5.1 Sensory mapping .............................. 45 2.5.2 Complementary Candid Incremental Principal Component Analysis . . 47 2.5.3 The cognitive mapping: IHDR tree .................... 49 2.5.4 Motor mapping ............................... 50 2.5.5 Innate value system ............................ 53 2.5.6 Sensorimotor algorithm .......................... 54 2.6 The SAIL robot ................................ 56 2.7 Summary ................................... 61 3 Locally Balanced Incremental Hierarchical Regression 3.1 Introduction .................................. 62 3.2 Locally balanced incremental hierarchical discriminant regression . . . . 65 vii 3.2.1 Incremental hierarchical discriminant regression ............. 65 3.2.2 Locally balanced tree and time smoothing ................ 67 3.2.3 Algorithm .................................. 71 3.3 The experimental results ........................... 73 3.3.1 Experiments using synthetic data ..................... 73 3.3.2 Experiments using marketing data .................... 75 3.3.3 Experiments using navigation image data ................ 75 3.4 Conclusions .................................. 80 4 The Value System of A Developmental Robot 4.1 Introduction .................................. 81 4.1.1 The value system of a developmental robot ................ 82 4.1.2 Innate value system ............................ 83 4.1.3 The requirements Of value systems .................... 83 4.2 Background .................................. 84 4.3 Neurobiological background for value systems ............... 84 4.4 Computational model of value systems ................... 86 4.4.1 Modeling operant conditioning ...................... 86 4.4.2 Modeling value systems .......................... 87 4.5 The value system of an AMD robot ..................... 89 4.6 System architecture .............................. 90 4.6.1 Incremental hierarchical discriminant regression ............. 92 4.6.2 The value system .............................. 93 4.7 Simulation results ............................... 101 4.7.1 Habituation effect ............................. 102 4.7.2 Integration of novelty and positive reward ................ 103 4.7.3 Increase novelty with a moving Object .................. 104 4.7.4 Suppress novelty with punishment .................... 106 4.8 Experiments with SAIL robot ........................ 106 4.8.1 Novelty and multiple reinforcers for different actions .......... 107 4.8.2 Boltzmann softmax exploration ...................... 109 4 .9 Conclusions .................................. 110 5 Covert Perceptual Capability Development for Vision Based Navi- gation 5.1 Introduction .................................. 1 13 5.2 Problem Description ............................. 115 5.3 System Architecture of Robotic Cognitive Development .......... 118 5.3.1 K Nearest Neighbor Prototype Updating Queue ............. 120 5.3.2 Algorithm of Covert Capability Development .............. 121 5.4 Experimental Results ............................. 122 5.4.1 Experiments Using Top-1 Prototype Updating Queue .......... 122 5.4.2 Experiment Using KN N Prototype Updating Queue .......... 123 5.5 Conclusions and Future Work ........................ 128 viii 6 Cross-Task Learning 6.1 Introduction .................................. 130 6.2 Problem Description ............................. 131 6.3 Architecture of DOSASE MDP ....................... 134 6.3.1 Observation-driven state transition function ............... 135 6.3.2 Online learning of one task through cognitive mapping ......... 136 6.3.3 One-task learning algorithm by a developmental agent ......... 137 6.3.4 Multiple task learning through DOSASE ................. 138 6.3.5 Attention mechanism for the break point set ............... 139 6.3.6 Cross-tasking learning algorithm ..................... 140 6.4 Experimental Results ............................. 141 6.4.1 Trajectory and transferred knowledge ................... 142 6.4.2 Complexity reduced in terms of space and time ............. 143 6.5 Conclusions .................................. 147 7 Adaptive Multimodal Context-Aware User Presence Detection 7.1 Introduction .................................. 150 7.2 System Overview and Challenges ...................... 153 7.3 System Architecture ............................. 156 7.3.1 Acoustic signal classification using IHDR ................. 156 7.3.2 HMMS for motion pattern classification .................. 159 7.3.3 Integration component ........................... 161 7.4 Experimental results ............................. 162 7.4.1 Experimental results of the auditory component ............. 162 7.4.2 Experimental results of the motion component .............. 166 7.4.3 Experimental results of the high-level reasoning ............. 167 7.5 Summary and Discussion ........................... 168 8 Conclusions and Future Work 8.1 Contributions ................................. 170 8.2 Future directions ............................... 172 ix 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 LIST OF FIGURES Kitten carousal. ................................ The AMD paradigm for machine agents. ................... An illustration of classical conditioning. ................... Observation-driven Markov decision process with dynamically generated states, context-driven state representation, and value-based action generation. Note: the state is not symbolic. It is a vector in a high dimensional space. The architecture of a multi-sensor multi—effector agent: Observation-driven Markov decision process. Each square in the temporal streams denotes a smallest admissible mask. This architecture takes the entire image frame without applying any mask. The block marked with L is a set of context states (prototypes), which are clusters of all observed context vectors 1 (t). The DOSASE MDP architecture for developmental learning. ......... A block diagram of the architecture of a sensorimotor system. The lower cog- nitive mapping realizes the reality mapping and the upper one realizes the priming mapping. GK: gate keeper, an internal effector to actively control the update of the last context. ....................... The spatiotemporal organization of areas in the sensory mapping. The ellipses represent receptive fields covering both space and time. Areas are organized in a hierarchical way, and the output of an earlier (low order) neural area is used as input to the later (higher order) neural area. Along the pathway of information processing, a neuron in a later area has a larger receptive field than one in an early area. In order to make the correct signal available at the right time for the right input line, we use a time shifting technique. Acting as a time delay unit, the shifter moves each signal to the next line at each time instant. ........................... Regions in the input space marked i.j where i is the index of the parent region while j is the index of child region. The leaves of the tree represent the finest partition of the space. The decision boundary of the region is of quadratic form. ................................... The motor mapping as a reverse application of the sensory mapping, but with signal reconstruction ............................. X 16 35 36 42 43 47 50 51 2.8 An illustration of a simple motor mapping for a single effector. The left is a gating mechanism and the right is a subsumption mechanism. ...... 2.9 The SAIL robot at Michigan State University .................. 2.10 The SAIL robot system diagram: left side view ................. 2.11 The SAIL robot system diagram: right side view. ............... 2.12 The SAIL robot: the Eshed robot arm. .................... 3.1 Y-clusters in space y and the corresponding X—clusters in space X. Each sample is indicated by a number which denotes the order of arrival. The first and the second order statistic are updated for each cluster. The first statistic gives the position of the cluster, while the second statistics gives the size, shape and orientation of each cluster. ................... 3.2 A one-dimensional discriminant subspace is formed for two classes. ...... 3.3 Locally balanced node. Two clusters are generated for class 1. a more discriminant subspace is formed ...................... 3.4 Initial distribution, two clusters ......................... 3.5 The balanced distribution, 3 clusters. ..................... 3.6 SAIL robot navigating through the corner of the corridor. ........ 3.7 A subset of images which are inputs to guide the robot turn. ......... 3.8 The timing recording for learning a set of 715 images .............. 4.1 The standard reinforcement-learning model. (Adapted from [54]) ........ 4.2 The system architecture of SAIL experiments. ................ 4.3 State and its primed contexts. Each state is associated with a. list of primed con- texts. Each primed context consists of a primed action, a primed sensation and a primed Q-value. ........................... 4.4 (a) Learning rate based on amnesic average (the parameters are shown in the ti- tle); (b) Temperature of Boltzmann Softmax exploration based on Gaussian density model. ............................... 4.5 The GUI simulator. The arrow indicates the position and the viewing angle Of the robot ................................... 4.6 Habituation effect. On the left part, the let, 2nd and 3rd plots correspond to the Q-value of action 0, action 1, and action 2, respectively. On the right part, the frequency of actions in different time frames. .......... 4.7 Integration of novelty and immediate reward. ................. xi 52 57 58 59 60 64 68 70 74 74 76 77 80 86 91 93 97 101 104 4.8 Simulation of a moving object. ........................ 4.9 Increase novelty with a moving object ...................... 4.10 Suppress novelty with immediate rewards .................... 4.11 Toys used as moving objects (Adopted from [126] ) ............... 4.12 The Q-value, reward, novelty and learning rate of each action of one state at position -1 when multiple reinforcers are issued ............... 4.13 Boltzmann Softmax Exploration: The total probability is 1 in the first plot. The probabilities of action 0, 1, 2 are plotted at the top, middle and bottom, respectively. The star denotes the random value received at that time. The second plot shows the temperature. .................... 5.1 (a) A pair Of images in vision based outdoor navigation with 6 attention win- dows; (b) Sub-image of one window and the type of road boundary (pl’s type is R1) .................................. 5.2 The system operation architecture for robotic cognitive development. The first- level learning modules generate the mapping from visual input to road boundary type while the second-level learning module generates the map- ping from road boundary type vector to heading direction for navigation. . 5.3 The system architecture of each learning module for robotic cognitive develop- ment. ................................... 5.4 Input sample pairs for navigation test ...................... 5.5 The Q-value of each action in one state using top-1 PUQ. .......... 5.6 Boltzmann Softmax Exploration: The total probability is 1 in the first plot. The probabilities of action 0, 1, 2, 3, 4 are plotted from top to bottom, respectively. ................................ 5.7 The Q-value Of each action. .......................... 5.8 Temperature of Boltzmann exploration. .................... 5.9 Boltzmann Softmax Exploration: The total probability is 1 in the first plot. The probabilities of action 0, 1, 2, 3, 4 are plotted from top to bottom, respectively. The “+” denotes the random value received at that time. The second plot shows the action sequence .................... 5.10 Retrieval time of each image for one tree ................... 6.1 Typical setting of cross-task learning. ..................... 6.2 The cross-task learning system architecture ................... 6.3 Observation-driven state transition function ................... xii 105 106 107 108 110 116 117 122 126 127 127 134 136 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 State space for 2 tasks .............................. Discriminate state transition using multi-modal information. ......... Simulation interface. .............................. A subset of input images. ........................... TIajectory of two tasks. ........................... Task label information. The plot shows the retrieved task label of each sample when testing on task 2. ..................... HDR tree structure. The first two plots Show the number of states in different layers for task 1 and task 2, respectively. The third plot shows the tree structure if cross-task learning is conducted. ..... Training time of each step for task 1. ................... Typical setting of a user presence detection system ............... Overview of the user presence detection system. ............... A detailed architecture of the user presence detection system .......... The procedure to extract F FT, Mel-F FT and MFCC features. ........ Different acoutsic signals: from top to down are background noise, conversation, phone ring and other noise (clapping). x-axis denotes the number of raw signal points, y-axis denotes the normalized energy. ............ The first row shows the video sequence, which is classified as “Moving near the door”; the second row shows the bounding box of the moving object. . Likelihood of motion patterns over time. ................... Likelihood of human activities over time. ................... xiii 138 140 142 153 154 156 157 163 166 1.1 1.2 3.1 3.2 3.3 3.4 3.5 6.1 6.2 7.1 7.2 7.3 7.4 7.5 LIST OF TABLES Approaches to Artificial Intelligence ..................... Eight types of learning ............................ Financial data sets .............................. Error rates of IHDR and HDR. ....................... Resubstitution and disjoint tests for two trips with different rejection mea- surements .................................. Comparison of IHDR and LBIHDR ...................... The tree structure of the navigation experiment ............... Trasfered knowledge when testing on task 2. ................ Complexity reduced in terms of space and time. .............. Comparision of different features for acoustic signals. ........... Confusion matrix of Offline training. (C=Conversation, P=Phone) . . . . Confusion matrix after online training. (C=Conversation, P=Phone) . . Recognition rate of each auditory set. (C=Conversation; UN=Uncertain Noise; P=Phone; S=Silence) ....................... Recognition rate of human activities (N=Nobody; OA=Other activity; R=Rest; C=Conversation) ......................... 11 25 78 78 79 143 146 164 164 165 165 168 Chapter 1 Introduction 1.1 Introduction tO autonomous mental develop- ment “Can machines think?” The fundamental question in artificial intelligence (AI) raised by Alan Turing [103] has attracted researchers for about half a century. In order tO tackle this mysterious problem, four basic approaches have been proposed: knowledge- based, learning-based, behavior-based, and evolution-based. 1 . 1 . 1 Traditional approaches In his well-known paper in 1950 [103], Turing wrote: “one might have a complete 1 system of logical inference ‘built in’.’ “The store would be largely occupied with definitions and propositions. The propositions would have various kinds of status, e.g., well-established facts, conjectures, mathematically proven theorems, statements 1 given by an authority, expressions having the logical form of proposition but not belief-value.” This approach is later called the knowledge-based approach in the field of AI: It is the responsibility of the human programmer to program knowledge into the machine. Knowledge-based systems emerged in the 1970’s. MYCIN [15], a typical ex- ample, is an interactive program that diagnoses certain infectious diseases, prescribes antimicrobial therapy, and can explain its reasoning in detail. These systems are proved to be very successful in areas such as medical diagnosis and aviation control. In this paradigm a domain engineer finds out the rules to solve the problem and then designs data structures and algorithms. What a computer needs to do is to run the program. By contrast, the learning—based approach allows machines to learn. Among the earliest learn-based systems, Samuel’s checker player [81] was SO successful that it was able to compete in some very strong human tournaments. In this framework, a human engineer proposes a learning model, the parameters Of the model can be Obtained by training the system. The basic difference between the knowledge—based approach and the learning-based approach is: With the former it is humans who supply the knowledge for the task at hand. With the latter humans use input data to indirectly supply a part of knowledge required for the task. How extensive the learned part is depends on the actual application task. However, the learning-based approach is still task specific — the learning algorithm is designed for a specific given task. A great deal of human engineering effort is put into analyzing and understanding the given task. Then a procedure is defined by the human designer. He specifies: 2 1. which features should be used; 2. which task representation model should be used so that the given task is reduced to determining a manageable number of parameters for the model; 3. what input training data are required to train the system; 4. what computational tool should be used which determines how the input train— ing data are used to determine the parameters of the task representation model. Rodney Brooks proposed the behavior-based approach in the 19808 [13], trying to overcome some constraints Of the knowledge-based approach. A major emphasis Of the behavior-based approach is to avoid explicitly modeling the world in which a robot Operates. Instead, programming for the robot concentrates on modeling Of robot behaviors. A behavior-based system starts with designing a set Of basic behaviors instead Of building a complete model of the world, which may not even exist in the first place. The system interacts with the world by manipulating and organizing these basic behaviors. In his well-known article Intelligence Without Reasons [11] presented during his Computer and Thoughts Award speech at the International Joint Conference on Artificial Intelligence, 1991, Brooks summarized a number of key aspects characterizing this style of work. Situatedness: The robots are situated in the world -— they do not deal with abstract descriptions, but with here and now of the world directly through their sensors and effectors. Embodiment: The robots have bodies and experience the world directly. 3 Intelligence: Robots are Observed to be intelligent — but the source of intelligence is not limited to just the computational engine. It also comes from the situation in the environment to which the robot is part of. Emergence: The intelligence of the system emerges from the system’s interactions with the world and sometimes from indirect interactions between its components — it is sometimes hard to point to one event or place within the system and to explain why an external action was manifested. Unfortunately, behavior-based approaches also have fundamental limitations. It seems clear now that humans are unable to adequately describe and decompose com- plex behaviors, not mention their interactions. Using this paradigm, the behaviors are designed manually in advance. As a result, for complicated systems such as COG, a human-shaped robot at MIT [12], the design of behaviors becomes a serious bottleneck. Evolutional approaches are motivated by the evolutional process in biological sys- tems. In the 19505, several researchers independently studied Simulation systems with an idea Of solving engineering problems using evolution as a search tool. The idea in these studies was to evolve a population of candidate solutions to a given problem, using operators inspired by natural genetic variation and natural selection. John Hol- land [40] invented “genetic algorithms” which are potentially the most “knowledge- free” way for developing intelligent machines. A strategy does not need to be as complete as an algorithm, since the detail of actions in a strategy does not need to be specified. In principle, it is possible to use an evolutionary strategy to search for an 4 intelligent machine “species” from scratch. However, both the cost and time required for such an evolutionary process are daunting. Therefore, researchers have been using evolutional ideas in a task-specific way. From the above survey, we know that the traditional AI approaches generally follow such a manual development paradigm, 1. Given a task, the human engineer analyzes it and translates it into representa- tions and rules that a computer program may work on. 2. The human engineer writes a program that transforms the input information into representation and follows the rules to control the machine. 3. The human engineer runs the program on the machine. In this paradigm it is the human designer not the robot who understands the task. Even though some machine learning techniques might be used, the built-in represen- tation and parameters defined by the human designer is task-specific. This manual development paradigm has met tremendous difficulties for tasks that require complex cognitive and behavioral capabilities, such as autonomous navigation, target finding, and human-robot interaction through gesture in unknown environments. 1.1.2 A new direction in Al — autonomous mental develop- ment (AMD) Is it true that human brain has built-in representation for the tasks that humans gen- erally do? Psychologist Piaget proposed assimilation—accommodation as a model of 5 cognitive development [74]. Assimilation involves the incorporation of new events into preexisting cognitive structures. Accommodation means existing structures change to accommodate to new information. By repeatedly accommodating to and assim- ilating novel environmental elements, the cognitive development takes place, which means, the cognitive system changes its internal structure and gradually evolves with maturation and experience. A classical experiment supporting this perspective was performed by Richard Held and Alan Hein [37] with kittens raised from birth in total darkness. The kittens (Shown in Fig. 1.1) were placed in pair in an apparatus called “kitten carousel” when they were old enough to walk. The first kitten in the pair was harnessed to pull the carousel so that it could learn how what it sees changes due tO its own actions. The other was carried in the gondola. It saw passively and what it saw is determined by the first kitten’s action. The kittens did this for three hours every day and lived in darkness in the rest Of the day. After 42 days these kittens were lowered onto the surface of a visual cliff. It was Observed that the passive kittens did not develop the cliff avoidance behavior but the active ones did. This experiment demonstrated that passive movements are not sufficient for normal mental development. Recent studies Of brain plasticity in neuroscience have Shown that a human brain is not as task-specific as commonly believed. For example, Mriganka Sur and his coworkers rewired the visual input to an animal’s (ferret) auditory cortex early in life. The target tissue in the auditory cortex, which is supposed to take auditory representation, was found to take on visual representation instead [92]. Based on evidences in psychology and neuroscience, the new mental developmental 6 Figure 1.1: Kitten carousel. paradigm for robots [29] is proposed. Procedure 1 The new AMD paradigm is as follows: 1. Designing a robot body: According to the general ecological conditions in which the robot will work (8.9., oncland or underwater), human designers deter- mine the sensors, the effectors and the computational resources that the robot needs, and then the human designs a sensor-rich robot body. The computational resources include computer components such as the CPU, memory, disk, and the controllers of the sensors and eflectors, etc. 2. Designing a developmental program: The human designer designs the de- velopmental program for the robot. Since he does not know what specific tasks the machine will learn, the developmental program does not require any informa- tion about the tasks. However, the developmental pmgram may take advantage of the sensors, effectors and the computational resources that have been previ- 7 ously chosen. 3. Birth: The human operator turns on the robot whose computer starts to run the developmental program. 4. Developing the mind autonomously: Humans mentally “raise” the de- velopmental robot by interacting with it. The robot develops its mental skills through real-time, online interactions with its environment, including humans (e.g., let them attend special lessons). Human operators teach robots through verbal, gestural or written commands very much like the way parents teach their children. New skills and concepts are teamed by the robots daily. The software (brain) can be downloaded from robots of diflerent mental ages to be run by mil- lions of other computers, e.g., desktop computers. The human operator turns on the robot whose computer starts to run the developmental program. This AMD paradigm does not start with any specific task and, in fact, the tasks are unknown at the time of machine construction (or programming). The hallmark of the AMD paradigm is that the human engineer does not need to understand or even anticipate the tasks to be learned by the machine. Consequently, the task-specific representation must be generated autonomously by the robot itself, instead Of be— ing designed by the human programmer. Of course, the teachers who will instruct the machine to perform the tasks must understand the tasks, but the human engi- neer who programs the machine would not be required to. A robot built through the autonomous mental development paradigm is called a developmental robot. The developmental phase of this paradigm is Shown in Fig. 1.2. 8 . Given to Given to Given to Ecological conditions . . ‘ Given to Release ' . _ _ . Turn Training, Training, Training, ‘] on testing testing testing Sense, act, sense, act, sense, act.. Agent Construction & Autonomous Time programming —> 4—— development ———> base phase Figure 1.2: The AMD paradigm for machine agents. 1.1.3 Eight requirements of AMD Practical robotic AMD requires that the following eight technically challenging con- ditions are all dealt with: 1. Environmental Openness: Due to the task-nonspecificity, AMD must deal with unknown and uncontrolled environments, including various human environ- ments. 2. High-dimensional sensors: The dimension of a sensor is the number of sealer values per unit time. AMD must deal directly with continuous raw Signals from high-dimensional sensors, e.g., vision, audition and taction. For example, learning from a video camera is more difficult than learning from a laser range finder, because the former typically contains more data per unit time and the photo-electric information is affected by more factors than the range (distance). 3. Completeness in using sensory information. Due to the environmental openness and task nonspecificity, it is not desirable for a developmental program to dis- 9 card, at the program design stage, sensory information that may be useful for future, unknown tasks. Of course, its task-Specific representation, autonomously derived after birth, does discard information that is not useful for a particular task. . Online processing: At each time instant, what the machine will sense next depends on what the machine does now. Off-line processing is unable to accom- plish AMD. . Real-time speed: The sensory/memory refreshing rate must be high enough so that each physical event (e.g., motion and speech) can be temporally sam- pled and processed in real-time (e.g., about 15Hz for vision). Time consuming iterations must be avoided. . Incremental processing: This requires incremental processing. Thus, batch pro- cessing is not practical for AMD. Each new observation must be used to update the current complex representation and the raw sensory data must be discarded after it is used for updating. . Perform while learning: Conventional machines perform after they are built. An AMD machine must perform while it “builds” itself mentally. . Scale up to muddy tasks: For large perceptual and cognitive tasks, an AMD machine must be able to scale its capabilities up to handle multimodal con- texts, large long-term memory and generalization, and capabilities Of increasing maturity, all without catastrophic memory loss. 10 Table 1.1: Approaches to Artificial Intelligence Approach Species World System Task- architecture knowledge behavior specific Knowledge—based Programming Manual Manual Yes modeling modeling Learning-based Programming Modeling with Modeling with Yes parameters parameters Behavior-based Programming Do without Manual Yes representation modeling Evolutionary Genetic Modeling with Modeling with Yes search parameters parameters Developmental Partly Avoid Avoid No programming modeling modeling 1.1.4 Comparison of approaches What are the major differences between the new developmental approach and the existing approaches to making an intelligent machine? Table 1.1 outlines the major differences in terms of four categories: species architecture, world knowledge, system behaviors and task- specificity. From Table 1.1, we can see that the developmental approach stands on middle ground between two extremes: at one extreme, the agent is totally hand-programmed by human beings (the knowledge-based approach) and at the other extreme, the agent is constructed using the genetic search (the evolutionary approach)‘. The former extreme requires a large amount of human domain knowledge and, thus, is the most domain specific and ad hoc in nature. The latter extreme requires the least amount of human knowledge but requires a tremendous amount of computation time for evolving a sophisticated agent. The developmental approach liberates humans from explicit 1It is worth noting that in principle, the evolutionary approach does not necessarily require a task-specific representation. However, due to the complexity considerations, the current genetic al- gorithms have been used with a human designed, task-specific chromosomes representation and only a particular part of the evolving system is searched for genetically (e. g., Animate [122] , AutonoMouse [26] and the work by Steels [91]). 11 design of (a) any task-specific representation and knowledge and (b) system behavior representation, behavior modules and their interactions. However, the developmental program supplied at “birth” must be designed by human beings. The developmental approach is the first approach that is not task-specific. The task-nonspecific nature of the developmental approach implies a fundamental departure from the traditional ways of developing a machine. It allows a more sys- tematic and more tractable way of approaching artificial intelligence than all other existing approaches. This does not mean that the approach is easy. The design Of developmental program is definitely very challenging, both conceptually and tech- nically. However, Since a developmental program can be written without requiring explicit knowledge about the actual tasks in the applications, all the task-related ad hoc messy things —— things that are called muddy — are completely excluded from consideration at the time of programming. In fact, the designer of the developmen- tal program does not need to understand or know the endless muddy tasks that the machine will end up learning. The design problem becomes systematic, easier to un- derstand and relatively tractable. The miracle of learning various interesting tasks occurs after the programming, not before. This fundamental change in the way we approach artificial intelligence will produce a new kind of machine — developmental machine — that can demonstrate capabilities that no traditional machines have. However, we must be aware of that all of these capabilities cannot be realized overnight. Scaling up capabilities require a significant amount of developmental ex- perience, which in turn requires time. Furthermore, the developmental programs need repeated modifications and improvements before a high level of mental capabilities 12 can be developed and demonstrated. And since this thesis is a starting-point Of devel- opmental learning, we still adopt the successful components of traditional approaches, which will be shown in the following chapters. 1.2 Survey on animal learning Since the developmental learning paradigm is derived from psychology, neuroscience and cognitive science. It is important to know the progress in these fields, especially in animal learning. Let us first examine the two different definitions of learning: Definition 1.2.1 Learning is an adaptive change in behavior resulting from experi- ence. Compare it with the following definition [24] (page13): Learning is an enduring change in the mechanisms Of behavior involving specific stimuli and/ or responses that results from prior experience with similar stimuli and responses. The latter definition defines learning in terms of a change in the mechanisms Of behavior rather than a change in behavior itself. The main reason is that a change in behavior is determined by many factors in addition to learning. For example, whether you eat food or not depends on how hungry you are and how much you like the food. You start to eat food when you are hungry. After you have had enough, you stop eating. The second definition does not include this action change, from eating to not eating, as learning, but the first definition does. In other words, the 13 second definition does not include what is called non-associative learning explained below. In this thesis, I adopt Definition 1.2.1. Learning takes place all the time during development, whether you are eating, working, entertaining or sleeping. Every stimulus has a potential to change the behavior. 1.2.1 Nonassociative learning Habituation and sensitization are examples Of nonassociative learning. In habituation learning, a subject learns about the properties of a Single stimulus — such as a loud noise from a toy gun or a view Of a new toy. A number of autonomic changes are triggered in the human body. Heart beats faster and breathing becomes more rapid. If the same stimulus is repeated in a short interval or it sustains for a long period Of time, these responses will abate. This is habituation, a type of learning that men experience routinely in one way or another. With habituation an animal learns about the properties of a benign stimulus, the result of which is reduced response. With sensitization an animal learns about the properties of a harmful or threatening stimulus. A person startled by a gunshot is likely to jump at any loud noise heard minutes thereafter. How are sensitization and habituation related? A mouse will be startled when it is exposed to a loud noise for the first time. As the same loud noise is repeated, the mouse will habituate to it and no longer respond. The startle response to the noise can be quickly restored by delivering a single electric shock to the feet of the mouse. Habitation and sensitization effects reflect how the agent ends up sorting 14 out what to ignore and what to respond to. Groves and Thompson’s dual-process theory [36] of sensitization and habituation assumes that different types of underlying neural processes are responsible for the increases and decreases in responsiveness to stimulation. The Solomon’s opponent-process theory [87] [86] assumes that an important function of mechanisms that control emotional behaviors is to minimize deviations from emotional neutrality or stability, a level called homostasis. The neural basis of learning has been a very active subject of study in neural science. The weakening and strengthening Of synaptic connections between sensory neurons and motor neurons (through interneurons) are found to be involved in ha- bituation and sensitization learning in the limb withdrawal reflex of the cat and the withdrawal of the gills and siphons of the marine snail called Aplysia’. Enhancement of synaptic connections is sufficient for short-term habituation and sensitization where the same stimuli are repeated for a Short period Of time. In long term non-associative learning, the stimuli repeat for days or weeks and the change of behavior also lasts much longer. This long term change in behavior requires the synthesis of new proteins and the growth of new synaptic connections along the passway from sensory neurons to motor neurons. In nonassociative learning, the subject learns about the properties of a single stimulus by being exposed to it repeatedly. In associative learning discussed below, a subject learns the relationship between two stimuli (classical conditioning) or the relationship of a stimulus to the subject’s behavior (instrumental or operant condi- tioning) . 2For a more detailed explanation of the study, the reader is referred to [90] (pages 36 - 45). 15 Training: Tone Food Salivation (CS) (US) (UR) VV 7"“ Learned Established Testing: Tone Salivation (CS) (CR) V f Time Predicted Figure 1.3: An illustration Of classical conditioning. 1.2.2 Classical conditioning Systematic studies Of classical conditioning began with the great work Of the Russian physiologist Ivan P. Pavlov. According to Pavlov, what animals and humans learn is not the association of ideas, as suggested by Aristotle, but the association Of stimuli. Classical conditioning involves the association or pairing of two stimuli, an un- conditioned stimulus (US), such as food or a shock to the leg, and a conditioned stimulus (CS), such as a tone or a light. The unconditioned stimulus always produces a response, called unconditioned response (UR), such as salivation or leg withdrawal. It is called an unconditioned response because it is a hardwired (inborn) response to the unconditioned stimulus. Suppose that a tone (CS) is repeatedly followed by the presentation of food (US) which always elicits a salivation response (UR) from a horse. Then, the tone (CS) alone starts to elicit salivation (CR) even without food (UR), as shown in Fig. 1.3. Although classical conditioning is classified into appeti- tive conditioning (US is appetitive) and defensive conditioning (US is aversive), the 16 US can be neutral, as long as the US is reliably linked with the UR, either hardwired or learned. The essence of classical conditioning is a means by which animals learn to predict relationships between two events in the environment — CS predicting US. After the training session, if the tone is repeatedly presented without food, i.e., CS is no longer followed by US, the tone is no longer able to evoke salivation. This process is called extinction. The available evidence in animal studies indicates that extinction is not simply forgetting. During the extinction, the animal learns something new: The conditioned stimulus no longer predicts that the unconditioned stimulus will occur; it instead predicts that the unconditioned stimulus will not occur. In our developmental robot, the inconsistency with the past experience (surprising- ness) will raise the value of the current event, which enhances the memory process. The animal is paying attention to the delivery Of unconditional stimulus (US) fol- lowing the conditional stimulus (CS). If the conditional stimulus does not occur, the inconsistency with the experience is detected and the value system sends an increased value to enhance the memory process for the current event. Another important phenomenon about classical conditioning is called blocking, demonstrated in a three-phase experiment conducted by L. J. Kamin in 1968 [56]. In phase 1, stimulus light (CS A) is paired with an aversive unconditional stimulus, a strong electric shock, until the rats totally suppressed their lever pressing behavior (UR) whenever a light (CS A) was presented. Next, in phase 2, a new stimulus tone (CS B) was presented together with stimulus light (CS A) and paired with the electric shock (US). Finally, in phase 3, stimulus tone (CS B) was presented alone in a test trial to see if it would also suppress lever pressing (CR). Interestingly, less conditional 17 response (surpressing) was observed than the control rats’ group (that skipped the phase 1). In this example, CS B is blocked (partially) by CS A. The idea that the surprisingness of an unconditioned stimulus determines its ef- fectiveness in producing new learning was developed into a model by Robert Rescorla and Alan Wagner [80]. An event is surprising if it is different from what is expected. If the CS perfectly predicts US, the rate of learning becomes zero. This is also consistent to our discussion of the value system that is dependent on the surprisingness. 1.2.3 Instrumental conditioning Instrumental conditioning (also called operant conditioning or trial-and-error learn- ing) is another major form of associative learning, systematically studied by B. F. Skinner and others. With this type Of learning, the trainer is able to tune the behavior Of the animal towards the direction that the trainer wants. Instrumental conditioning can be conducted in the following way. A hungry rat is placed in a chamber which has a lever protruding from the wall. A press on the lever will release a food pellet into a cup. If pressing the lever is not something that the rat already does occasionally, it may never “discover” on its own what it has to do to Obtain the food. We can use a sequence of training steps called shaping that has been used by animal trainers for centuries. At first, food is given if the rat does anything remotely related to the desired response. For example, at first, the rat is given a food pellet each time it gets up on its hind legs anywhere in the chamber. Once the rearing response has been established, the rat iS given food only if it rears 18 over the lever. Once it is over the lever, the rat is given food only if the rat actually depresses the lever. With this training, whenever the rat is hungry and finds himself in the chamber, it is likely to press the lever to get food. Classical conditioning learning forms a predicate relationship between two stimu- lus (CS and US). Instrumental learning can be considered as a formation Of a predicate relationship between a stimulus and the animal’s response. Unlike classical condition- ing, where the conditional response is evoked by a specific conditional stimulus, the response in the instrumental conditioning is emitted without a recognizable stimulus (but the animal still uses some environmental cues, such as the view of the lever to evoke the behavior). In both types of learning, timing is critical. The reinforcer (food) should closely follow the operant response (pressing lever). If the time delay is tOO long, only weak conditioning is established. The neural mechanisms of classical conditioning and of instrumental conditioning are very similar. If the chain of events predicts a reward, the response is emitted. If a punishment is predicted, the corresponding response is suppressed. In classical conditioning, the chain of events is CS-US-UR (e.g., tone, food, salivation). With instrumental conditioning, the chain of events is E-R-UR, where E is environmental cues (in the chamber or the view of a lever) and R is the shaped response (pressing lever) and UR is the food reward. In both cases, if UR is a punishment, the animal will evoke avoidance behaviors. In classical conditioning, CS will suppress the animal’s ongoing activity or elicit the learned avoidance behavior (e.g., leg withdrawal). In instrumental conditioning, the animal would decrease the response rate or suppress the response that leads to the punishment. 19 To summarize what is necessary (but may not be sufficient) for nonassociative and associative learning, the following gives a list: 1. Biased sensors that define appetitive and aversive stimuli. 2. A value generation system. It decreases the output value when there is no unexpected inconsistency with the experience (lack of surprise or novelty). It increases the output value quickly when novelty is high or aversive stimuli are sensed. 3. An event prediction mechanism that is able to predict the chain of events (de- fined as sensorimotor signals) in time. 4. A response selection mechanism. If appetitive or avoidance of aversive stimulus is predicted, emit the corresponding response. If aversive stimulus is predicted or omission Of appetitive stimulus is predicted, suppress the response. 1.2.4 Cognitive learning The above types of animal learning does not cover a more complex learning, which is called cognitive learning in psychology. The word “cognition” comes from the Latin meaning “knowledge or thinking.” Cognitive learning covers more complex behaviors of human and animal learning. The subject is closely related to the mechanisms of organizing memory and forgetting. Timing learning refers to the ability of learning to estimate the duration of time. Although all the above types of learning require a proper duration of time for each 20 event, animals are able to estimate and distinguish the length of an event and respond accordingly. For example, pigeons (or rats) can be trained to peek a red key if a light is turned on for a short period of time (e.g., 2 seconds) and to peek a green key if the light is on for a long period of time (e.g., 10 seconds). During the learning, peeks on the correct key are reinforced (e.g., with food) (see, e.g., [107] [19]). In serial pattern learning, animals learn the order of events. A panel consists of five squares. Five distinct stimuli (e.g., dots, circles, and the like) were presented at the same time in the panel, each stimulus in one square. Let us call the stimuli A, B, C, D, E. The task for monkeys is to press the squares in the panel in the prescribed order: A, B, C, D, E. Training started with a presentation of stimulus A alone. After the monkeys learned to press A, B was added. After they learned the AB sequence, C was added, and so forth. After the monkeys learned the entire sequence, they were tested on subsets of two and three stimuli (AB, BC, CD, DE, ABC, BCD, DCE) and reached above-chance accuracy (see, e.g., [22]). Perceptual concept learning is the capability Of learning a concept. For example, although chairs have different colors, shapes and heights, we can all agree on what is a chair and what is not. The “chair” is an example of a perceptual concept. Pigeons have been trained to respond to the presence or absence of fish in underwater photographs [39], to the presence of the letter A as opposed to other letters of the alphabet in various fonts [66] and even to discriminate pictures of Monet from those of Picasso [108]. Language learning is probably the most complex cognitive Skill learned by humans and animals. In fact, many have assumed that the linguistic skill is so complex and 21 Specialized that it is uniquely human. Some assume that the ability of using language depends on certain innate processes that have evolved only in humans (see, e.g., Chomsky 1972 [18]). Some animals are intelligent enough to learn language when they are properly trained. Since they cannot speak a human language, the American Sign Language has been taught to Chimpanzees [33], gorillas [72], dolphins [38], sea lions [34] and African grey parrots [73]. In Allen and Beatrice Gardner’s study [33], their chimpanzee, named Washoe, learned to Sign well over 100 words. AS we can see above, the models Of animal learning have been studied based on specific types. A model that was designed to explain a type of learning is not appli- cable tO other types of learning. One Of the major goals of this volume is to present a unified developmental model that is capable of developing an architecture that can perform all major types of animal and human learning, including nonassociative learn- ing, classical conditioning, instrumental conditioning and the major forms of complex cognitive learning. 1.3 New machine learning types In AMD, the environment can affect the robot brain through three channels. The first channel is its sensors. The second channel is its effectors. Affecting an effector can be done by imposing a force on it, either by stopping or by changing its motion. The third way of affecting the brain of an animal is to directly change the internal state of the brain. For example, a neural surgeon can open the human skull to perform brain surgery. For machines, we can feed data directly into the program. These ways 22 of accessing the brain directly change the internal state of the brain. In summary, there are three channels through which the environment can affect an agent: sensors, effectors, and internal state. The sensors are divided into two classes, biased and unbiased. If the machine has a predefined preference pattern to the signals from a sensor at the birth time, this sensor is a biased sensor. Otherwise, it is an unbiased sensor. We have a total of four entities, u, b, i, and e, representing unbiased sensors, biased sensors, internal state, and effectors, respectively. In order to fully understand the meaning Of developmental learning, we investigate learning types from three entities, b, i, and e, representing biased sensors, internal state, and effectors, respectively. Depending on whether the action e in {b,i,e} is imposed or not, the learning can be classified into efi’ector-imposed learning and effector-autonomous learning. Effector-imposed learning is such that the effector ac- tion is supplied by the trainer. For example, when an adult teaches a child to hold a pen properly, the adult can place the pen into the child’s hand and manipulate the hand to hold it properly. This is effector-imposed learning. Otherwise, the learning is effector-autonomous learning. Depending on whether the biased sensor b in (b, i, e} is used or not, the learning can be classified into reinforcement learning and communicative learning. Reinforce- ment learning is such that a biased sensor is used to reinforce or punish certain responses from the machine agent. Communicative learning is such that biased sen- sors are not used and only unbiased sensors are. This requires the learner to correctly interpret the signal from unbiased sensors, an instruction for an action, an encourage- ment, or an explanation, etc. Learning by a human adult is conducted mostly in the 23 communicative learning mode, not in the reinforcement learning mode. For example, a student goes to a university everyday to earn his Ph.D. degree. The future of her Ph.D. degree does not give him any immediate physical pleasure. On the contrary, his studies at the university may often cause physical displeasure, such as fatigue. 1.3.1 Classification of learning types Thus, any type of learning can be represented by a 3-tuple (i,b, e), which contains three components i, b, and e, each of which can be either represented by 0 or au- tonomous 1. i = 1 means that the internal state is imposable and i = 0 state- autonomous. b = 1 indicates reinforcement and b = 0 communicative. e = 1 denotes effector imposed and e = 0 effector free. Thus, depending on what attribute each component takes (0 or 1) there are a total of 8 different 3-tuples, representing a total of 8 different learning types. For example, (i, b, e) = (1,1,0) means state-imposable, reinforcement effector-autonomous learning. This is the typical mode of reinforce- ment learning in the machine learning community. If we consider ibe as three binary bits of the type index number of learning, we have 8 types of learning defined below: We can also name each type. For example, Type 0 is state—autonomous, communica- tive, effector autonomous learning. Type 7 is state-imposable, reinforcement, effector imposed learning. All traditional learning methods correspond to Types 4 to 7, which is because the representation is designed after the task is given and so are the constraints of internal representation. Autonomous development uses Types 0 to 3. However, the 24 Table 1.2: Eight types of learning Type (binary) Internal state Biased sensor Effector 0 (000) Autonomous Not used Autonomous 1 (001) Autonomous Not used Imposed 2 (010) Autonomous Used Autonomous 3 (011) Autonomous Used Imposed 4 (100) Imposable Not used Autonomous 5 (101) Imposable Not used Imposed 6 (110) Imposable Used Autonomous 7 (111) Imposable Used Imposed state-autonomous nature is not sufficient to characterize the power Of autonomous development, Since just realizing state-autonomous is not difficult and not of much importance without the real power for dealing with muddy tasks. 1.3.2 Comparison with traditional definitions The above definitions of learning types are different from traditional definitions for machine learning. A term supervised learning, often used in the machine learning and pattern recognition research communities, means that human teachers supervise the learner during the learning process. From the most strict definition Of unsupervised learning, all Of the above learning modes are supervised by humans to some degree. It is difficult to identify any type of learning that is completely unsupervised. For a better understanding of the eight new learning modes, the new explicit definitions in Table 1.2 are required beyond the coarse classification of supervised, unsupervised, and reinforcement learning that the scientific field is familiar with. According to the above refined new definition, the traditional supervised learning falls into Type 5, while the traditional reinforcement learning belongs to Type 6. 25 Thus, the traditional learning methods belong to state-imposable learning, Types 4 to 6, because humans must define the feature space after the tasks are given (thus state-imposable). If system states are used, the meanings of the internal states are manually assigned by the programmer after the tasks are given. Human (and animal) learning does not allow the teacher to exercise direct access to the brain (the natural counterpart of the learning program). Thus, all the modes Of human and other animal learning belong to Types 0 to 3. During the early stages of human child development, Types 2 and 3 are often used and are effective. For instance, a mother rewards a good behaving toddler with candy. Types 2 and 3 are less effective for human adults. For example, offering candies or Slapping on the face does not necessarily succeed in making a 30-year Old do what the teacher wants. Type 1 is Often used for older children whose capability of communicative learning is still limited or if the learner is a novice on the subject being taught. For example, a ballet or athletic teacher may often held the limbs of a novice student in order to teach the student to master a technique. Type 0 learning is dominant in human learning. It occurs probably as early as the fetus stage. The majority of a human’s time, either an infant, a child, an adolescent or an adult, is spent in the Type 0 learning mode. Reinforcement through biased sensors is relatively rare, compared to the time when no significant signals are received from the biased sensors. We realize that the learning types applied to autonomous development are Types 0 to 3. How are these types being used in the actual development process? This is the major issue of this dissertation. 26 1.4 Thesis outline The organization Of this thesis is stated follows. Chapter 2 proposes the Developmental, Observation driven, Self-Aware, Self- Effecting, Markov Decision Process (DOSASE MDP) model, which integrates multi- modal sensing, action—imposed learning, reinforcement learning, and communicative learning. Chapter 3 is mainly concerned with building the cognitive mapping engine using the Locally Balanced Incremental Hierarchical Discriminant Regression (LBIHDR) technique. Chapter 4 presents our model for the value system of a developmental robot. Novelty and reinforcement learning are integrated into a value system for the first time. AS an indispensable part Of AMD, a value system Signals the occurrence of salient sensory inputs, modulates the mapping from sensory inputs to action outputs, and evaluates candidate actions. The value system is applied to guide a robot’s visual attention behavior. Chapter 5 presents a robot system that develops convert perceptual capability for vision-based autonomous navigation behavior through online interactions with human trainers using reinforcement learning. This work is quite new comparing to the dominant supervised learning techniques used in this domain. Chapter 6 provides the developmental agent with this cross-task learning capabil- ity. That is, an agent can learn multiple tasks and use acquired knowledge to learn new tasks. 27 Chapter 7 presents a user presence detection system. Context information from multi-sensory inputs is integrated to infer a user’s activities in an office. We design a layered architecture to model human activities with different granularities. The prototype system verifies the effectiveness of the developmental learning paradigm. Chapter 8 concludes with a summary of the contributions of this thesis. Some thoughts on future directions will be discussed. 28 Chapter 2 DOSASE MDP Model In this chapter, we propose the Developmental, Observation driven, Self-Aware, Self- Effecting, Markov Decision Process (DOSASE MDP) model, which integrates multi- modal sensing, action—imposed learning, reinforcement learning. and communicative learning. Two major components of the architecture: cognitive mapping engine and value system will be discussed in Chapter 3 and Chapter 4, respectively. At the end of this chapter, we will briefly present the hardware architecture of the validation platform, the SAIL robot. 2.1 Introduction Many different architectures have been proposed in the intelligent robot community. Robot perception and perception-based behavior generation have been proved to be very difficult, especially in unknown or partially unknown environments. Some work on robot learning was motivated by human learning and development: from sim- 29 ple to complex. BAIRN (a Scottish word for “child”) is a symbolic self-modifying information processing system used as an implementation for a theory of cognitive development[106]. Drescher[27] utilized the schema, a symbolic tripartite structure in the form of a “context-action—result,” to model the ideas of child sensory-motor learning in Piaget’s constructivist theory of cognitive development. Soar[59] and ACT-R[4] are two well-known symbolic systems with an aim to model cognitive pro- cesses, although cognitive development was not the goal Of these efforts. The model of Albus[2] takes sensors as input and produces control signals to effectors as out- put, thus, allowing a finer numeric representation of sensory features when a specific task is given. The Finite State Machine (F SM) or its probability based variant, the Markov Decision Process (MDP), are two general frameworks that have been used for conducting autonomous learning with a given goal in a symbolic world (e.g., Shen[85]) or reinforcement learning (e.g., Kaelbling et al.[54]). The explanation-based neural network, called “lifelong learning,” was used by Thrun[99]. Designing a program and its representation in a task-specific way using a tra- ditional approach is typically complex, ad hoc, and labor intensive. The resulting system tends to be brittle especially in unknown and uncontrolled environments. Recently an important direction Of research aims at reducing or avoiding the human imposed limitations (e.g., features, models) of the world for better environment adap- tation in learning. Cresceptron[113] is a system that allows a human teacher to inter- actively segment natural objects from complex images through which it incrementally grows a network that performs both recognition and segmentation. SHOSLIF[45], SARCOS[105], Cog[14], and Kismet[9] are motivated by simulating infant skills via 30 learning through interactions with the environment. The efforts discussed above were motivated by human learning and cognitive devel- opment to various degrees. However, these proposed architectures are fundamentally different from human learning and have not reached autonomous mental development (AMD)[119]. The SAIL robot[lll] and the Darwin V robot[3] are two prototypes Of developmental robot. Darwin V was designed to provide a concrete example of how the computational weights of neural circuits are determined in a controlled environ- ment by the behavioral and environmental interactions of an autonomous device. The SAIL developmental robot was designed for developing perceptual and behav- ioral Skills through interactions in uncontrolled, complex human environments. There has been a lack Of systematic theory for developmental mental architecture. This chapter introduces a theory of mental architecture appropriated to au- tonomous development. Section 2.2 formulates the developmental agent model. The Developmental, Observation driven, Self-Aware, Self-Effecting, Markov Decision Pro- cess (DOSASE MDP) model is introduced in Section 2.3. Sections 2.4 and 2.5 present the architecture and the major components of the SAIL robot, respectively. A detailed description of SAIL is given in Section 2.6. Conclusions are drawn in Section 2.7. 2.2 An agent model TO be precise in our further discussion, we need mathematical notations. Definition 2.2.1 Contest: A contest c(t) of an agent is a stochastic process. It consists of two parts c(t) = (:i:(t), a(t)), where :r(t) denotes the sensory vector at time 31 t, which collects all signals sensed by the agent at time t, a(t) is the effector vector consisting of all the signals sent to the efi‘ectors of the agent at time t. Definition 2.2.2 Length of Context: The context related to the agent from the pre- vious time t1 up to a later time t2 is a realization of a stochastic process [c(r)|t1 g r 3 t2}. The length of the context is L = t2 - ti- Definition 2.2.3 Given an agent at time t1, suppose that the agent produces difl'erent action contexts a1 and a2, from two difierent contexts Cl: {c(t) | t; S t 3 t2} and C2 = {c(t) I t1 S t 3 t3}, respectively. If a1 and a2 are considered different by a social group (human or robot), conditioned on Cl and Cg, then we say that the agent discriminates two contexts C1 and C2 in the society. Otherwise, we say that the agent does not discriminate Cl and 02 in the society. Definition 2.2.4 The last context at time t is the recent part of history, l(t) = {(r(t),a(t))[r z t — k, ...,t —— 1,t}. The last context consists of a short segment of recent experience of temporal length k + 1. Currently in our work, k does not change with t and a different k is designed for different sensors and levels. Definition 2.2.1 (Environment) The internal environment of an agent is the brain (or “the central nervous system”) of the agent. The external environment con- sists of all the remaining parts of the world, including the agent’s own body ( excluding the brain). 32 Having defined the external and internal environments, we define the sensors and effectors associated with these concepts. Definition 2.2.2 (Internal sensors and effectors) An external sensor 5', is a sensor that senses the external environment. An internal sensor 5',- is a sensor that senses the internal environment. An external effector EC is an eflector that acts on the external environment. An internal eflector E,- is an effector that acts on the internal environment. Definition 2.2.5 A state s(t) is the internal representation of the context in an agent’s internal information environment. 2.3 Observation-driven Markov Decision Pro- CCSSCS According to the above definitions, an agent views the world as a general random process. The state gives a more compact representation than a simple concatenation Of last contexts. Due to sensory uncertainty and partial Observation, the state is also a random vector. The state at time instance t depends on both the context I (t) and the last state s(t — 1). Therefore, we can recursively generate the state with: s(t) = f(s(t-1),l(t))- (2.1) where the function f generates states using feature derivation, temporal chucking, and vector quantization. The recursive estimation process can be modeled by a Markov 33 decision process (MDP), where the state transition follows a conditional distribution, P(s(t) = s|s(t — 1) = s’, l(t) =1) (2.2) and the representation s(t) here is a high-dimensional vector. The most probable state at time t is given by, 5*(t) = arg max P(s(t) = s|s(t — 1) = s’, l(t) = l), (2.3) 363 where S is the set Of all possible states, which is infinite. In a practical imple- mentation, we approximate the set S by a finite number Of prototype states. These prototype states are incrementally merged cluster centers of many experienced state vectors. In practice, the maximization in Eq. (2.3) is too computationally expensive. We use an IHDR tree tO approximate it, taking advantage of the numerical vector representation of s(t) and l (t) 2.3.1 Observation-driven MDP An agent has a number of sensors and effectors. Fig. 2.2 illustrates a multi-sensor multi-effector model of an agent. The agent A(t) operates at equally spaced discrete time instances t = 0, 1, We assume that an image is produced at each time instance by the sensor, independent of the sensing modality, visual, auditory, touch, etc. Without losing the generality, we assume that the agent has two external sensors and two external effectors. Each external sensor 56,-, i = 1, 2, senses a random multi- 34 Figure 2.1: Observation-driven Markov decision process with dynamically generated states, context-driven state representation, and value-based action generation. Note: the state is not symbolic. It is a vector in a high dimensional space. dimensional sensory frame xe(t) = (x,1(t),x62(t)) at each time instance t and the sensed Signal is fed into the agent. Each external effector Ea, i = 1,2, receives from the agent an effector frame ae(t) = (ael(t), a62(t)) at each time instance t. Let x, E X and p, E ”P be the Observations and outcome covariates (i.e., ran- dom vectors) at time t, respectively. Note that we change a variable Of a vector to its subscript (e.g., change x(t) to x,) when it is more convenient to consider the variable as a discrete index number. Let H, be the random vector of the history: H, = {x,,x,_1,...,x0,p,_,,...,p0}. At time t, the agent A(t) needs to estimate the distribution of P(p, | H, = h). If t is large, H, is too large to be practical and it contains much information that is not very related to the outcome H,. Definition 2.3.1 A mental architecture is called a k-th order Observation-driven 35 Last Primed context context DDD DDEDDD SI DUE] DDDDDDDDDD Sensor: DDDDDD DDEDDDDDDDDDDDDD Se, DUE] DUE] DDDDDDDDDDDDD DE! [JUDGE] DDDDDDDDDDDDD L JL l l l lL-I-I-j-Jl I l l l l A 1 l l l l l l l J """" t """ Hate DDD DD D DDDDDDDDDD Efit a D DDDDDDDDDDD CC OI'SZ 1.382 E] El DDDEDD DDDDDDDD [IE] DDDDDEDD El DDDDL—JDUDD WEAR-PEER Figure 2.2: The architecture of a multi—sensor multi-effector agent: Observation-driven Markov decision process. Each square in the temporal streams denotes a smallest admissible mask. This architecture takes the entire image frame without applying any mask. The block marked with L is a set of context states (prototypes), which are clusters of all observed context vectors l(t). Markov Decision Process {MDP)/21] if the H, to contain only the last k observations: 1t = {I'll xt—la ..., $t—k3Pt-1, Milli—k} as shown in Fig. 2.2. The random observations in 1, across time t = 0, l, ...,t are the source from which the agent automatically generates states in the form of clusters l 6 L, where [3 consists of all possible observations of the last contexts L = {H, | 0 S t}. We define the state s(t) in Eq. 2.1, which means the state is also kth order MDP. The predicted consequence p, consists of predicted action a, and the predicted value 36 vi, Pi = (anvil- A major difference between a regular MDP[54, 93] (or HMM[79]) and an Observation-driven MDP are that the states 3, with a regular MDP are hand-designed by a human programmer but the states with an observation-driven MDP can be auto- matically generated (developed). With a regular MDP, the programmer must provide initial estimates for the prior probability distribution P(so), the state transitional probability P(s, | x,,s,_1) and the state observation probability P(x, [ 3,). It in turn, requires that the human programmer establishes a correspondence between the meanings of the physical events being modeled and the states. Due to the fact that physical events are not known at the time of programing for the developmental robots, the regular MDP is not suited for the developmental program. In contrast, the obser- vation driven MDP requires no prior probability and all the probability distribution P(p, | l, = l) can be estimated incrementally on-the-fiy. Another important difference between the states in the traditional MDP and the observation driven MDP is their very different nature of the representation. The states in the former correspond to some Objects or events in the world by human hand-design. The entire set of states is a monolithic representation of the modeled part of the world. That results in the high brittleness of the agent in dealing with unexpected events. In contrast, the states in the latter are clusters of Observational vectors. They are not monolithic in the sense that an Object in the world can correspond to many state clusters. Further, each cluster may correspond to an observation of several objects in the world. Therefore, there is no strict one-to—one correspondence between a state 37 in the Observation-driven MDP and an object Of the world. It is the behavior of the agent (e.g., the same picking apple behavior from different views of the apples) that shows the discrimination and generalization power of the agent. In practice, we implement the regressor R using the Incremental Hierarchical Discriminant Regression (IHDR)[116, 115]. Given any Observed (last) context l(t), the system generates the current state s(t) using Eq. 2.1; the regressor R produces multiple consequences (primed context) p1(t), ..., pk(t) with a high probability: {a(t), --..pi(t)} = R(S(t))- (2.4) Thus, the regressor R is a mapping from the space of the last context [I to the power set of P: R: c ..., 2”. (2.5) R is developed incrementally through the real-time experience. For any t > 0 (after the birth), it is a total function since it is defined for all elements in C, but it does not do well for most elements in [I that it has not experienced. It is not an onto function since its range covers only a very small part of 27’. Therefore, we need a value system that selects desirable contexts from multiple primed ones. The value system V(t) takes a set of (e.g., k) contexts from the regressor R and selects a single context: V(R(S(t))) = V({pi(t).p2(t)i ~--:Pk(t)}l = 1).-(t) (2-6) 38 where 1 S i S k and k varies according to experience. The value function selects the best consequence p,(t) that has the best value v,(t) in p,(t) = (a,(t),v,~(t)). For example, V({p1(t),p2(t), ...,pk(t)}) = p,(t) if i = arg max{v1(t),v2(t), ...,vk(t)}. The real-time Q-learning algorithm[110] can be used to estimate the value Of each consequence p,(t), i = 1,2,...,k, and the agent selects the one (action) with the highest value. Therefore, the value system V is a mapping from the power set of P to the space Of P: V : 2” H 19. (2.7) 2.3.2 Observation-driven SASE MDP Defined in Weng[112] a Self-Aware and Self-Effecting agent not only has external sensors S, and external effectors E, that sense and act upon the external environment (including the robot body), but also has internal sensors S,- and internal effectors E,- that sense and act upon the internal representation of the brain (not including its body) In neuroscience, there is no concept Of internal sensors and effectors. The brain does not need a receptor to sense the Signals in the brain, Since the brain signals are already in the desirable form. The concept of internal sensors and effectors is introduced to facilitate computational understanding of the SASE agent. Definition 2.3.2 (Awareness) If an agent A senses the states (presence, absence, diflerent forms) of object b through its sensors, we say that the agent A senses the 39 object b. If the agent is able to recall the association between the sensed various states of b and their resulting effects in a task domain D sensed by the agent, we say that the agent is aware of object b in the task domain D. By definition, an agent must use its sensors, the entry point of its sensory architecture (the input of T), to sense an object. For example, if the state of an object is fed into the architecture, e. g., through the middle of the regressor R, the motor mapping M or the value system V, the agent does not sense the object by the definition because the agent cannot use such information properly as it does with its sensors. In the above definition for awareness, we consider a task domain D because any awareness has a scope. A person who is aware of the boiling temperature of water in a domain (e.g., in a normal environment) may not necessarily be aware of the boiling temperature of water in another domain (e.g., lower in a low pressure environment). In the definition, we require that awareness must associate with various states of b and the corresponding resulting effects. For example, if a human did not aware of the correspondence between various states of gravity (presence, absence, strong gravity) and the resulting effects (e.g., to a falling object), he would not aware of the Object. With the above definition in hand, we are able to deal with the issue of self- awareness and self-effecting. Theorem 2.3.1 (Necessary conditions of awareness) Suppose an agent is aware of its mental activities in a domain. Then the following points must be true: (1) It senses such activities using its sensors. (2) It feeds the sensed signal into its perceptual entry point just like that for external sensors. (3) It recalls the 40 association between the different status of the activities and the resulting effects to the environment. Proof: Point (1) is true because, according to Definition 7 for the awareness of an object, the agent must sense the object using its sensors. Point (2) is true because the status of the Object must be sensed and fed into the entry point for sensors for proper perception and effect recall. Point (3) is true because the definition of awareness requires the recall of such an association. El Based on the previous theorem, let us examine the issue of self-awareness. If an agent runs a Q-learning algorithm (or any algorithm for that matter) but it does not sense the algorithm using its sensors which are linked to its entry point for sensors, the agent is not aware of its own algorithm. For the same reason, humans do not sense the way their primary cortex works and, therefore, normally they are not aware Of their own earlier visual processing. This early processing is subconscious, in the sense that it does not require a conscious decision. However, the voluntary part of the mental decision process does require a conscious, willful decision. Therefore, in the architecture design, the parts that require voluntary decisions must be sensed by the agent, and the sensed signals must enter through the entry point of the sensors. Definition 2.3.3 The DOSASE architecture is based on the Observation-driven MDP mental architecture, but additionally, the internal voluntary decision is sensed by the internal sensors S,- and the sensed signals are fed into the entry point of sen- sors, i.e., the entry point of the attention selector T. In order to recall the effects of the voluntary actions, not only is the expected reward value estimated by the value 41 system, but also the primed context, which includes not only the primed action, but also the primed sensation. Last Primed CODICXI COIltCXt S DD DD BU DDDDDDDDDD SMM: DD DD D DDDDDDDDDDD 5g DD DOD DD ‘DDDDDDDDU DD Dam DUCUEDDDDDDDDD L LTTTJPTT'. .\. . . . 1 i . . . I """" t t"""" \ Eme E t% EDD DEEDS DDDDDDD as,“ L UD :DCDD DDDDDDDD CC OISZ 5g E ODD DUE JDDDDDDD E DOD DEED DDDDDDDDD Figure 2.3: The DOSASE MDP architecture for developmental learning. The architecture illustrated in Fig. 2.3 is the DOSASE MDP architecture. TWO voluntary internal actions are modeled by E“ for attention selection, and by En for action release. Both internal actions are sensed by the internal (virtual) sensors 5,, and 5,2, respectively. The external action (not released) is sensed by the virtual internal sensor 8.3. Note that when the action is released, it is sensed as external action from Eel and E82. The primed sensation (which predicts the sensation of 8,1 and 5,2) is used by the value system in selecting the best action according to, e.g., 42 F . Prlmed contexts Prototype updating queue mapping Sensory T r l Effector Reality Motor “9"“ input Sensory Sensory mapping _)’ mapping mapping mapping a Dela: Space Control input Figure 2.5: The spatiotemporal organization of areas in the sensory mapping. The ellipses represent receptive fields covering both space and time. Areas are organized in a hierarchical way, and the output of an earlier (low order) neural area is used as input to the later (higher order) neural area. Along the pathway of information processing, a neuron in a later area. has a larger receptive field than one in an early area. In order to make the correct signal available at the right time for the right input line, we use a time shifting technique. Acting as a time delay unit, the shifter moves each signal to the next line at each time instant. locations and Sizes is nearly complete in the sense that it does not lose important information. 2.5.2 Complementary Candid Incremental Principal Compo- nent Analysis Complementary Candid Incremental Principal Component Analysis (CCIPCA) is a stochastic approximation algorithm to estimate eigenvalue and eigenvector iteratively. It is the major tool to implement the sensory mapping. PCA is a. well-known technique in data compression and feature extraction. It gives a linear transform that converts a set of d-dimensional data into a lower- dimensional space by minimizing the error in the least mean square (LMS) sense. 47 Therefore, PCA is used to generate new representations from sensory inputs. A well-known approach to PCA is to solve an eigensystem problem. Given A as the sample covariance matrix of the data set, one can find its eigenvectors and eigenvalues, sort the eigenvalues in descending order, and construct a k x d matrix T with the rows being the eigenvectors corresponding to the largest k eigenvalues. The matrix T is the sought transform [31]. This is the basic idea behind most techniques for PCA, such as the QR method [35]. However, since this approach requires an estimate of the covariance matrix, the data set usually needs to be completely available at the computation time. This is not appropriate for a developmental robot because of two reasons. First, a developmental robot is an online agent, which senses and responds to the environment continuously. It should not wait until all data is accumulated before doing the processing. Second, when the dimension of the data is high, both the computation and storage complexity grow dramatically. For example, in the eigenface method [57] [104], one of the promising face recognition methods that involves PCA, a moderate grey level image is of 88 rows and 64 columns, which results in a 5632- dimensional vector. Since the sample covariance matrix of a data set of d—dimensional random vectors contains d(d + 1)/ 2 independent numbers, this amounts to 15,862,528 numbers! [120] proposes an algorithm, called complementary candid incremental PCA (CCIPCA), to incrementally compute the principal components of sequentially ar- rived samples without estimating the covariance matrix. CCIPCA has been shown empirically to be a very efficient estimation algorithm compared with SGA and GHA for high-dimensional data [127]. 48 2.5.3 The cognitive mapping: IHDR tree The R and F mappings in Fig. 2.4 are implemented by two Incremental Hierarchical Discriminant Regression (IHDR) trees [47]. Learning a function f : X I—+ y in real time for high-dimensional data remains a nontrivial problem, particularly when the complexity of the function to be estimated are unknown. By surveying the literature of regression in statistical learning, one can identify two classes of methods: global model fitting methods and local model fitting methods. Local model fitting methods (e.g., IHDR) use temporal-spatially localized (thus computationally efficient) models, in the meanwhile, increase the complexity auto- matically (i.e., the number and organization of local models) to account for the non- linearity and the complexity of the problem. They are more suited for incremental real-time learning, especially in the situation where there is limited knowledge about the scene and the robot, itself, needs to develop the representation of the scene in a generative and data driven fashion. IHDR generates local models to sample the high-dimensional space X x y sparsely based on the presence of data points in a vector quantization (VQ)[58] manner. IHDR enjoys two nice properties: First, IHDR derives automatically discriminating feature subspaces in a coarse-to—fine manner from input space X. Discriminating features are automatically derived discriminating features at the internal nodes of the tree. The features are most discriminative in the sense that they maximize the trace (or the determinant) of between-class scatter. In this way input components that are 49 1'00! O... O... O... G... Ll 1.4 2.l 2.4 3.l 3.4 4.1 4.4 Figure 2.6: Regions in the input space marked 1'. j where i is the index of the parent region while j is the index of child region. The leaves of the tree represent the finest partition of the space. The decision boundary of the region is of quadratic form. irrelevant to the mapping’s output are disregarded to achieve better discrimination and generalization. Second, IHDR organizes its local models in a hierarchical way, as shown in Fig. 2.6. IHDR’s tree structure recursively excludes many far-away local models from consideration (e.g., an input face does not search among nonfaces), thus, the time to retrieve and update the tree for each newly arrived data point x is O(log(n)), where n is the size of the tree or the number of local models. This extremely low time complexity is essential for real-time learning with a very large memory. 2.5.4 Motor mapping A major goal of motor mapping is to generate a concise representation for stereo- typical motor trajectories so that skilled actions that involve many motors can be represented by high-level motor primitives in a lower-dimensional space. As shown in Fig. 2.7, the architecture of the motor mapping is very similar to the spatiotemporal 50 Signal flow direction for action space learning V Signal flow direction for skilled action synthesis . = Time ‘ 7 Area 1 l sc-y I I I xngxrn/ :><’ ’_‘>("—‘5C'_ LA New high-level control inputs ___‘ I rig . x "xl > Figure 2.11: The SAIL robot system diagram: right side view. 59 Shoulder Base ; I 1 Figure 2.12: The SAIL robot: the Eshed robot arm. The Eshed controller, the PTU controllers and the ADR 2000 A/ D converter are connected to an AccelePort 8e multiport serial adapter from Digi International Inc., which enables the serial communication with the host computer. The “eyes” of the SAIL robot are Panasonic GP-KSISZ industrial Color 1/2 inch CCDs with auto gain control and auto white balance. Two Matrox Meteor II video cards are used for real-time image capturing and digitization. The “ear” of the SAIL robot is a WMS-PRO wireless microphone system. The wireless microphone operates within a range of 250 feet. It is connected to a sound card (Creative Sound Blaster 16) on the host computer. Currently, the host computer is a Xeon 2.2GHz dual-processor workstation with 1GB RAM. This allows a real-time memory recall and update as well as real-time ef- fector controls. The monitor is a ViewSonic ViewPanel VPA138 14—inch LCD display, 60 which is flat, light and easy to carry. The power of the SAIL robot is provided either by heavy-duty batteries or by line power. An automatic power system APS 750 from TRIPP LITE power protection is used to switch between these two sources. 2.7 Summary This chapter introduces the Developmental, Observation driven, Self-Aware, Self- Effecting, Markov Decision Process (DOSASE MDP) model for developmental learn- ing. The system architecture is presented. There are three major components, e.g., sensory, cognitive, value system, and motor mappings. In the next two chapters, we will focus on cognitive mapping and value system. 61 Chapter 3 Locally Balanced Incremental Hierarchical Regression In this chapter, we propose the Locally Balanced Incremental Hierarchical Discrimi- nant Regression (LBIHDR) algorithm as the engine of cognitive mapping for learning in non-stationary environments. 3. 1 Introduction Decision trees (class labels as outputs) and regression trees (numerical vectors as outputs) organize the data in a hierarchy so that retrieval time can be logarithmic, which is essential for the AMD paradigm. Traditionally, classification and regression trees use a univariate split at each in- ternal node, such as in CART [10], C50 [77] and many others. This means that the partition of input space by each node uses a hyper-plane that is orthogonal to an 62 axis in the input space X. 0C1 [67] and SHOSLIF tree [96] are two methods for constructing oblique trees. 0C1 uses iterative search to find a split. SHOSLIF uses the principal component analysis (PCA) and the linear discriminant analysis (LDA) to directly compute splits. The Hierarchical Discriminating Regression (HDR) algo- rithm [47] casts both classification problems and regression problems into a unified regression problem. Virtual labels in the output space provide virtual labels for mem- bership information in forming clusters in the input space. If training samples are images and video data, the problem becomes harder. These applications give rise to high dimensional data with strong correlation among input components. CART, C5.0, 0C1 and other published tree classifiers are not designed for highly correlated high dimensional data. The problem becomes even harder if the learning must be incremental. The incremental regression problem is as follows: An unknown function y = f (:13) generates a stream of input—output pairs arrives as ($1,341), ($2,312), ..., where :5,- is the input to the system and y,- is the output from the system. The goal is to approximate function f incrementally by a regressor f,, where f,- is based on previous approximator f,_1 and the new samples (x,, y,), 2' = 1,2, Fig. 3.1 shows the procedure. In AMD, the regression technique has to satisfies all of eight requirements of AMD. Generally, the incremental hierarchical discriminant regression (IHDR) by Hwang and Weng [49] grows the regression tree incrementally with the eight requirements as the designed goal. However, IHDR face several challenging problems. First, some data sets have high-dimensional data in input space while there are only few labels in output space. Some financial and marketing data only have two labels. For instance, 63 A . r5“. The 9“?" ‘9' the A za’ 3’83””133; I 6 ea 8 S“ C ’ Virtual labeL_ Figure 3.1: Y-clusters in space y and the corresponding X-clusters in space X. Each sample is indicated by a number which denotes the order of arrival. The first and the second order statistic are updated for each cluster. The first statistic gives the position of the cluster, while the second statistics gives the size, shape and orientation of each cluster. usually there are two kinds of customers in the market: satisfied or not. In this case, the dimensionality of discriminant subspace of IHDR is only one, which is not enough to partition a high-dimensional input space. Second, in real-time learning the statistics, a segment of sensory stream is not stationary, but is typically very biased. Consider vision-based navigation. If the robot learns by trying first in a straight corridor, almost all incoming sampr are associated with the control signal (“Go straight”). A node must spawn children if the number of stored prototypes in it hits a certain limitation. Otherwise, the retrieval time increases linearly. This situation of biased statistics is true with human mental developmental [29]. The above problems are very challenging and have not been addressed by the existing literature. We proposed a novel node self-organization and spawning strategy which generates more discriminant subspace by forming multiple clusters for one class and balancing the 64 number of samples in each cluster. The requirements for real-time and incremental learning are satisfied too. 3.2 Locally balanced incremental hierarchical dis- criminant regression 3.2.1 Incremental hierarchical discriminant regression Many problems like content—based retrieval, vision-based navigation and intelligent data engineering can be formulated as a complicated function which maps high- dimensional input and the current state to low-dimensional output signals. We use a decision tree to approximate this function. In order to build such a tree, two types of clusters are incrementally updated at each node of the IHDR algorithm — y-clusters and x—clusters [47]. The y-clusters are clusters in the output space y and x—clusters are those in the input space X. There are a maximum of q clusters of each type at each node. The idea is that each node models a region of the input space using q Gaussians. For each new sample (any), 3; finds the nearest y-cluster in Euclidean distance and updates the center of the y-cluster. This y-cluster indicates which corresponding x-cluster the input (3:, 3;) belongs to. Then, the a: part of (22,31) is used to update the statistics of the x-cluster (the mean vector and the covariance matrix). Each node 65 keeps up to q x-clusters. The centers of these q x-clusters are denoted by C ={C1,C2,...,Cq|C1' E X,i = 1,2, ...,q}. (3.1) In online-learning, the robot updates the tree’s statistics incrementally for each new training example. The average of n input examples x1, x2, ...,xn can be recur- sively computed from the current input data 23,, and the previous average i("‘1) by equation (3.2): — I i,(n+l) = n #("lim + +p(n) n .2 n+1 n+1$+1 (3) where ”(71) is a parameter. If p(n) > O, the new input gets more weight than old inputs. We called this implementation the amnesic average. The covariance matrix can be updated incrementally by using the amnesic average too. Finn-+1) : n—l;p(n)1a;(rn) + 1+:(n)($n+1 _ 5(n+1))($n+1 _ T(n+l))T (33) If the node is mature enough (the number of samples hits a certain limitation), it will spawn children, which means the region of space is modeled by a finer Gaussian mixture. Thus, a coarse to fine approximation of probability distribution of the training samples is realized. A more detailed description of the algorithm can be found in [49]. 66 3.2.2 Locally balanced tree and time smoothing However, IHDR faces several challenging problems. First, many applications in intel- ligent data engineering only have two labels in output space, for example, whether a customer switches from one company to another one. As shown in Fig. 3.2, supposed there are two classes specified by “+” and “-”. The input dimensionality is three. m1 and m2 are two vectors describing the centers of the two classes. The discriminant vector D1 is determined by Eq. 3.4. 01 = S;1(m1 — mg) (3.4) According to the IHDR algorithm described above, the dimensionality of within-class scatter matrix Sw is one because q = 2 and Sw is (q -— 1) x (q — 1). The discriminant vector is along the direction of 7m — m2. Obviously, D1 is not enough to partition these two classes. The second challenging problem is caused by the nonstationary statistics of a segment of sensory stream. Given vision-based navigation as an example, if the robot learns by trying first in a straight corridor, almost all incoming samples are associated with the control signal (“Go straight”). In order to make sure that the learning is conducted in real-time, a node must spawn children if the number of stored prototypes in it hits a certain limitation. Nonstationary statistics causes the following problems: (a) A segment of output contains only few values, which is similar to the first case. In some extreme cases, only one value for y. (b) The performance of a tree cannot degrade after spawning children from a node. It is challenging for new children to 67 Separator based on Dl x] Figure 3.2: A one-dimensional discriminant subspace is formed for two classes. maintain the performance realized earlier by the experienced parent. (c) Due to a lack of sample statistics for finer prototypes, each leaf node contains a moderate number (up to N) of micro-prototypes in the form of (say), where (53,37) is the average of a few similar samples. Spawning children requires distribution of micro-prototypes which could slow down the system response without a carefully designed spawning strategy. In order to tackle these problems, we proposed a new node self-organization and spawning strategy. Suppose there are two classes in a task, say c1 (“+”), 02 ( “-"), as shown in Fig. 3.3. The number of micro—prototypes of each class is N,, i = 1,2. The number of clusters in a node is q. If q > 2 (say q = 3), we try to allocate some micro-prototypes of CI to 68 those clusters, which are not in use. Each class in the old tree could be mapped to multiple clusters in the balanced tree. For the ith class, the algorithm will generate q,’ clusters in the new node. I Ni (12' = (17V: (35) where N, = total number of micro-prototypes in this leaf node. In Eq. 3.5, the q; could be float. Eq. 3.6 converts it to an integer. q; = [Z qgi — 221‘;- (3.6) where i = 1, 2, ..., Ny, Ny is the number of classes of training samples. ‘q‘; is defined by Eq. 3.7, which guarantees that at least one cluster is generated for each old class. I Eli = max{1, Qi} (3-7) Fig. 3.3 shows the balanced node. Originally, there were 2 clusters. Now, there are 3 clusters. Two clusters are formed for c1. A two-dimensional subspace passes the head tips of the three center vectors: m1, m2 and m3. The scatter vectors are defined by the following equation: where 2' = 1, 2, 3, fit is the mean of all center vectors. Let S be the set that contains 69 Figure 3.3: Locally balanced node. Two clusters are generated for class 1. a more discriminant subspace is formed. these scatter vectors: S = {silt = 1, 2, 3}. The discriminant subspace spanned by S, denoted by span(S), consists of all the possible linear combinations from vectors in S. By applying Gram-Schmidt Orthogonalization (GSO), we can get two orthogonalized discriminant vectors DI and D2. Obviously, more discriminating features are derived from the new node. We should note that the algorithm only balances the micro- prototypes in each node, while the entire tree may not be strictly balanced. So this is a local balancing method. After the node is balanced, the statistics of the new structure must be generated. However, in real-time navigation, the updating frequency is around 10Hz. If the robot calculates the statistical information in one time slot, it will get stuck in the road, 70 which is not allowed. In order to smooth the updating time for each new sample, the algorithm keeps two sets of statistics. The old statistic is used to build the tree. At the same time, the new balanced statistic is updated incrementally after the old statistic is half-mature (the number of micro-prototypes in the node hits half of the limitation). When the new statistics is mature, the old one is thrown away and micro-prototypes are redistributed and updated based on the new structure. This smoothing strategy works very well for real-time navigation. When the number of micro—prototypes hits the limitation N, a node should spawn children based on the new statistics. The performance cannot degrade after spawning. This can be resolved by redistributing micro-prototypes. 3.2.3 Algorithm Procedure 2 Update-tree: Given the root of the tree and sample (:r, y), update the tree using (1:, y). 1. From the root of the tree, update the node by calling Update-node. 2. For every active cluster received, check if it points to a child node. If it does, explore the child node by calling Update-node. 3. Do the above steps until all leaf nodes are reached. 4. Each leaf node keeps micro-prototypes (23,-, 3],) that belong to it. If y is not given, the output is 3],- if :5,- is the nearest neighbor among these micro-prototypes. 5. If y is given, do the following: If ”1' — 113,-“ is less than certain threshold, (any) 71 updates (33,-, 37,-) only. Otherwise, (it, y) is a new sample to keep in the leaf. 6. If the leaf node is half-mature, call Balance-node. 7. If the number of micro-prototypes hits the limitation required for estimating statistics in the new children, the leaf node spawns q children and is frozen. The micro-prototypes are redistributed through multiple steps. Procedure 3 Update-node: Given a node N and (Lay), update the node N using (2:, y) incrementally. 1. Find the top matched :r-cluster in the following way. If y is given, do (a) and (b); otherwise do (b). (a) Update the mean of the y-cluster nearest y in Euclidean distance. In- crementally update the mean and the covariance matrix of the :t-cluster corresponding to the y-cluster. (b) Find the nearest III-cluster according to the probability-based distances. Up- date the z-cluster if it has not been updated in (a). Mark this central z-cluster as active. 2. Return the chosen as-cluster as active cluster. Procedure 4 Balance-node: When the total number of micro-prototypes in this leaf is equal to half of the limitation, then balance the micro-prototypes of each cluster. Ny is the number of classes in the old node. 1. For each y, ifq g Ny, do y-space clustering. 72 2. Otherwise, (q > Ny), sort X-clusters according to the number of micro- prototypes (increasing). 3. Calculate the number of clusters for each old class by using Eq. ( 3. 6) and Eq. ( 3. 7). 4. Conduct K-mean algorithm [28/ to generate new clusters. Reallocate micro- pmtotypes to each new cluster. 5. Calculate new statistics for the balanced node. 3.3 The experimental results In order to show the effectiveness of the new algorithm, we applied our method to two kinds of problem: financial data engineering and vision-based navigation. First, we tried synthetic data. 3.3.1 Experiments using synthetic data The motivation of using synthetic data for testing is to investigate the behavior of balancing a tree. This experiment used a data set that has two clusters. The number of dimension is two. Each cluster are modeled by a Gaussian distribution. The centers of the clusters are at (—2,-2) and (2,0), respectively. q is set to 3. The covariance matrix are: 10 62 73 One cluster has 20 micro-prototypes; another cluster has 40. Fig. 3.4 shows the initial distribution of these two clusters. Fig. 3.5 shows the balanced distribution, Figure 3.4: Initial distribution, two clusters. now we have 3 clusters. Obviously, the micro-prototypes are balanced in the new -2- 1 0 Figure 3.5: The balanced distribution, 3 clusters. node. The center of each cluster is specified by a star. 74 3.3.2 Experiments using marketing data Two kinds of marketing data were used to test the algorithm (Tab. 3.1). The first data set is about the ranking of decision making. Totally there are 12404 samples. Two thirds of all samples were used for training. The remaining samples were used to for testing. The input dimensionality is 12 and there are 16 rankings of decision making. The second data set is the ranking of customer satisfaction. Only two types of labels are provided: satisfied or not. Table 3.1: Financial data sets Data Training Testing Input Classes set samples samples dimensions Datal 9306 3098 12 16 Data2 1768 590 18 2 We compared the error rate of the proposed LBIHDR algorithm with HDR, IHDR and other methods. As we can see from Tab. 3.2, if the number of classes is larger (data set 1), then the LBIHDR has the same performance as that of IHDR. However, data set 2 shows that the new algorithm improves the recognition rate. In this experiment, q = 6 and N; = 100. The the dimensionality of discriminant space is five while that of the original IHDR is only one. More discriminant subspace is formed and the error rate reduces from 12.92% to 8.22%. 3.3.3 Experiments using navigation image data We applied the Locally Balanced IHDR algorithm to vision-based indoor navigation on our SAIL robot, as shown in Fig. 3.6. The robot is trained online on second floor 75 Table 3.2: Error rates of IHDR and HDR. Figure 3.6: SAIL robot navigating through the corner of the corridor. in the Engineering building of our University. In the training phase, at each time interval. the robot accepts a pair of stereo images and a control signal, incrementally updates its context which contains past sensory inputs and actions, and then updates the statistics of the decision tree. In the testing phase, a pair of input images is used to retrieve the best match. The associated control signal of the match is the output. This is a very challenging problem, which approximates the mapping from high dimensional input space into a low dimensional output space by using a decision tree and all the computations have to be performed in real time (about 10Hz). Some of the example input images are shown in Fig 3.7. 76 Figure 3.7: A subset of images which are inputs to guide the robot turn. The robot is trained around the corridor. In each trip there are about 5000 thousand pairs of images and corresponding control signals. In table 3.3, a total of 5443 color stereo images with the corresponding heading directions were used for training (one trip). The resolution of each image is 30 by 40. The input dimensionality of the IHDR algorithm is 30 x 40 x 3 x 2 = 7200, where 3 is the length of context. We used the other 5764 (another trip) stereo images to test the performance of the trained system. There are three kinds of output control signals: “Go Straight”, “Left”, “Right”. We performed both resubstitution and disjoint tests with different rejection measurements. The parameters are defined as follows: q = 2, the limitation of the number of samples in each node is 200. In table 3.3, R stands for resubstitution test, D stands for disjoint test. The performance is very good for such a large high dimensional data set. The rejection measure determines whether a coming sample is used to update the tree or not. If the distance between the coming sample and the 77 retrieved micro-prototype is less than certain measure, the sample would be rejected. For example, if the rejection measure is 0, then all samples will be saved in the tree. In this case, the tree is about 130Mb. If the rejection measure is 2000, the tree is only 45Mb and the retrieval speed is faster. Table 3.3: Resubstitution and disjoint tests for two trips with different rejection measurements. Rejection Training Correct Correction measure set number rate 0 (R) 5443 5443 100% 2000 (R) 5443 5389 99.01% 0 (D) 5764 5434 94.28% 2000 (D) 5764 5412 93.88% The performance is improved significantly compared with the existing IHDR [49]. As shown in Tab. 3.4, in the resubstitution test, the recognition rate of LBIHDR can reach 100%. In the disjoint test, the recognition rates of LBIHDR is 94% while IHDR can only reach 87%. The retrieval time is also less because the new algorithm balances the distribution of samples in each node. The result testifies that LBIHDR works very well for nonstationary data like visual images in navigation. The tree Table 3.4: Comparison of IHDR and LBIHDR. Method Resubstitution Disjoint Testing test Test time (mm) IHDR 92.72% 87.35% 28.6 LBIHDR 100% 94.28% 15.7 structure of the navigation experiment is shown in Tab. 3.5. The left part is the number of nodes in each layer; the left part is the number of samples in each layer. 78 Because the number of samples is balanced in each internal node, the tree will not grow too deep (only four level) that helps build up more reliable statistics in each node. Table 3.5: The tree structure of the navigation experiment. Layer No. of nodes No. of samples 1 1 0 2 8 335 3 48 2527 4 64 2702 In order to guarantee that IHDR can be implemented in real time, we applied the time smoothing algorithm. Fig. 3.8 records the updating time for each sample in different steps. Here only shows the time profile of 715 samples. The first row shows the total learning time of each sample. The second row shows the updating time except doing local batch and redistribution. Time records of calculating new statistics and redistribution are shown in third and fourth rows respectively. When the node is half mature (i.e, the node already has 100 micro-prototypes), the new statistics are computed through multiple steps; when the node is mature (i.e, the node has 200 micro—prototypes), micro-prototypes are redistributed. At these moments, the learning time reaches peaks. However, with the time smoothing algorithm, the updating rate still could reach about 3—4Hz, which is extreme important for online learning. 79 f: l l .l g . E .2 . _’ 0 1 1 i n 1 L J. 0 50 100 200 300 400 500 600 700 800 a . I I I T T I I .E a E El- : 0 My A4 .r-L.~...JWLP 0 50 100 200 300 400 500 600 700 800 g . r 1 v v v .9 g g t- 0 0 J L 1 L 1 L L h 1 8 0 50 100 200 300 400 500 600 700 800 '5 , r r w r r r T 3 g E .‘L’ l- g 0 _____1 IL 41 4 w .L. _ _L___flL____..- m 0 1 200 300 400 500 600 700 800 Step Figure 3.8: The timing recording for learning a set of 715 images. 3.4 Conclusions In order to solve the major problems due to the nonstationary nature of input-output experience and fewer labels in output space with high-dimensional input space, a new node self-organization and spawning strategy is proposed. Partition of the input space is carried out without relying on a sufficient number of output clusters. Multiple clusters could be generated for one class. The number of samples in each cluster is balanced. Thus, more discriminant subspace can be formed. The performance of the tree is stable before and after children spawn. The distribution of computational load over time during spawning avoids a significant slow down of the system response time. The experiments with marketing data and the real-time, incremental learning of vision-based navigation showed that the new locally balanced IHDR technique performed well in the different applications. 80 Chapter 4 The Value System of A Developmental Robot In this chapter, we propose a value system based on novelty and reinforcement learn- ing will be proposed. The working of the value system is demonstrated by an exper- iment that producing visual attention controls. 4.1 Introduction Developmental agents face a continuously changing stream of input data. If it sits there and does nothing, how can it develop mentally? Then we have to ask why a developmental robot should do anything in the first place: “Something has to start the process in the first place. Something has to motivate infants to look, to reach, to mouth, to seek out information about their worlds” [98]. This issue also arises in the case of a developmental robot. what drives agents to explore their environment and 81 learn about it? How to model the motivation? Psychologists addressed this issue by introducing the concept of intrinsic motivational forces such as “drives.” Piaget [74] suggested that the need to adapt to the environment drives infants’ behaviors. The essence of these proposals is that this internal driving force builds biases into the agent. These biases make the agent prefer one behavior over another. That is, they encode what is value for an agent. 4.1.1 The value system of a developmental robot An indispensable part of a developmental program is its value system, which signals the occurrence of salient sensory inputs, modulates the mapping from sensory inputs to action outputs, and evaluates candidate actions. It provides motivation that drives mental development. Since a developmental robot is not task specific, without a value system, it simply does nothing. This is very different from a traditional robot which is designed to perform a specific task only — the task-specific representation and rules have already been programmed into the robot. The value system of the central nervous system of a robot at its “birth” time is its innate value system. It further develops continuously throughout its “life” experience. A value system at a particular mental age after birth is called a developed value system. 82 4.1.2 Innate value system The innate value system of a developmental robot is designed by the programmer. It includes the following two aspects: 1. innate spatial bias. 2. innate temporal bias. The term “spatial” here means different sensory elements are located at different locations of the robot body. The term “temporal” means that the spatial bias changes with time. A pain signal from a pain sensor is assigned a low value and the signal from a sweet taste is assigned a high value. This is the innate spatial bias. If a pain sensor continuously sends signals to the brain for a long period of time, a newborn does not feel the pain as strong as it is sensed for the first time. This is the temporal bias. 4.1.3 The requirements of value systems The challenges of designing and implementing the value system for a developmental robot include: (1) It must be applicable to all the possible unknown tasks. For example, it is not possible to predefine salient features, because salient features for one task (e.g. motion in intruder detection) may not be so for another task (e.g. driving). (2) It must be applicable to different maturation stages. For instance, playing with toys is interesting for youngsters but not for adults. (3) It must provide guidance for behaviors all the time, in every fraction of a second since the robot’s actions are updated at this high temporal frequency. Reinforcers (e.g. sweet tastes 83 and pain senses) are not sufficient for a value system. A developmental robot mostly lives during a time where a search for such reinforcers is not a goal (e.g. while a child plays). (4) The value system must work with a real-time system that incrementally grows and updates representation and memory through interactions with an open, complex real physical world. Without this, the value system is unable to deal with increasingly complex tasks in unpredictable environments. 4.2 Background 4.3 Neurobiological background for value systems Neuroscience studies have shown that a value system has the basic function of the multiple diffuse ascending systems of the vertebrate brain [83]. The detailed mech- anisms of the value system and its development are mostly unknown although some characterizations of this system are available. An adaptive organism must be able to predict future events such as the presence of mates, food, and danger. Predictions give an animal time to prepare behavioral reactions and can help improve the choices an animal makes in the future. Some theories of reward-dependent learning suggest that learning is driven by the upre- dictability of the reward and sensory input. Motivational values arise either through innate mechanisms or, more often, through learning. In this way, rewards help to es- tablish values system for behavior and serve as key references for behavioral decision. Schultz [82] described the basic reward process in the brain. 84 Experiments on monkeys’ brains show information about rewards seem to be pro— cessed in different forms by neurons in different brain structures, ranging from the detection and perception of rewards, through the expectation of imminent rewards, to the use of information about predicted rewards for the control of goal-directed behav- ior. Here are a few descriptions of how the brain processes reward. “Only a limited number of brain structures seem to be implicated in the detection and perception of rewards and reward-predicting stimuli. The midbrain dopamine neurons report the occurrence of reward relative to its prediction and emit a global reinforcement signal to all neurons in the striatum and many neurons in the prefrontal cortex. The dopamine reward response might be the result of reward-detection activity in several basal ganglia structures, the amygdala and the orbitofrontal cortex. Some neurons in the dorsal and ventral striatum, orbitofrontal cortex, amygdala and lateral hy- pothalamus discriminate well between different rewards, which might contribute to the perception of rewards, be involved in identifying particular objects and / or provide an assessment of their motivational value.” More researchers focused their research on dopamine neurons and their relation to reward prediction and the value system. Montague and his colleagues [64] developed a theoretical framework that shows how dopamine systems could distribute to their targets a signal that represents information about future expectations. In particular, they show how activity in the cerebral cortex can make predictions about future receipt of reward and how fluctuations in the activity levels of neurons in diffuse dopamine systems above and below baseline levels would represent errors in these predictions that are delivered to cortical and subcortical targets. 85 4.4 Computational model of value systems Based on abundant results in neuroscience, psychology and animal learning, scientists try to build the value system into autonomous agents. 4.4.1 Modeling operant conditioning In the last three decades, a number of computational models of operant conditioning were produced. This method provides a way of programming agents by reward and punishment without needing to specify how the task is to be achieved. Typically, in reinforcement learning [54], statistical techniques and dynamic programming methods are used to estimate the utility of actions. A standard model is shown in Fig. 4.1. An agent is connected to its environment via perception and action. In each step Figure 4.1: The standard reinforcement-learning model. (Adapted from [54]). of interaction the agent receives input, 2', some indication of the current state of the environment, 3; the agent then chooses an action, a, to generate as output. The action changes the state, and the value of this state transition is communicated to the agent through a reinforcement signal, 1'. The agent’s behavior, B, should choose actions 86 that tend to increase the long—run sum of values of the reinforcement signal. The basic idea is to learn what to do—how to map situations to actions—so as to maximize a numerical reward signal through a trial-and—error procedure. More elaborate models are TD(/\) and Q—learning [95] [110]. We should note that all of these models only consider the interaction between an agent and the external world while the internal sensors and effectors which sense and act on the internal world (the brain) are not discussed. 4.4.2 Modeling value systems Although reinforcement learning is a breakthrough in machine learning, it is not enough for modeling all the behaviors of animals. Researchers try to apply value systems to model more complex adaptive behaviors of robots like chaining, forging, etc. Touretzky [102] extended reinforcement learning to model some functions of operant learning, such as chaining. Mataric’s [62] work shows forging behaviors of mobile robots. Forging behavior is constructed from routines like wandering, homing, aggregation and dispersion. However, the above two cases use symbolic methods, which means the human trainers define the representation for the robots. That is not applicable to developmental robots. A developmental perspective to neural model of learning Ogmen’s work [68] proposed a developmental perspective to the neural network model of learning. The choices that the network makes regarding what objects to approach are not based on minimization of a fixed, predetermined, global cost function. Instead, 87 the choices are based on three interacting sets of criteria: reinforcement, novelty, and habit. The model combines such design principles as adaptive resonance, opponent processing, and lateral inhibition. In this framework the interplay between reinforce- ment, novelty, and habit is dynamic; which of these factors is more important is heavily dependent on the environment context in which the network (or robot) finds itself. Even the stimulus criteria that maximize reinforcement can change dynami- cally with the context. However, only a simple simulation experiment is reported in the paper. The work is not directly applicable to complex high-dimensional sensory signals with distributed representation. Modeling dopamine neurons Much evidence, reviewed by. Schultz [82], suggests that dopamine (DA) cells in the primate midbrain play an important role in reward and action learning. As a kind of neurotransmitters, dopamine affects brain processes that control movement, emo- tional response, and ability to experience pleasure and pain. Electrophysiological studies in both instrumental and classical conditioning tasks support a theory that DA cells signal a global prediction error for summed future reward in appetitive con- ditional tasks, which is very similar to a temporal difference (TD) learning model. In this model of primate dopamine neurons, their phasic activity reports a prediction error for future reward. Dayan [55] addresses two important sets of anomalies: gener- alization and novelty. Generalization responses are treated as the natural consequence of partial information; novelty responses are treated by the suggestion that dopamine cells multiplex information about reward bonuses, including exploration bonuses and 88 shaping bonuses. They interpret this additional role for dopamine in terms of the mechanistic attentional and psychomotor effects of dopamine, which guides explo- ration. However, they did not provide a source of information nor a computational model to measure novelty. Another point we need to note is that dopamine neurons are a part of the value system in the brain, more comprehensive models are necessary to model the complex functionality of value systems. 4.5 The value system of an AMD robot Studies in animal learning testify that different reinforcers (e.g. sweet tastes and pain senses) carry different values in the animals value system. However, computational studies of reinforcement often model rewards into a single value, delivered from a separate reward channel [94] [89]. This single-value modeling facilitates understanding and simplifies computation. However, primed sensation (what is predicted by the robot) has been neglected. The value of an action under a state is determined by the rich nature of the primed sensation, not just a global value. In the value system presented here, the value is not only derived from separate reward channels but also from primed sensation. The framework also addresses a fundamental limitation of reinforcer based value system. Reinforcers are typically sparse in time: they are delivered at infrequent spots along the time axis. Novelty from primed sensation is however dense in time, defined at every sensory refresh cycle. Thus, the framework enables continuous exploration of a developmental robot, not just exploring a task-specific setting. We should note that 89 the so defined just models an innate value system, which should develop through a robot’s life experience. Different primed actions and primed sensations are associated with different Q—values. These Q-values determine the probability for each action to be taken. The novelty and reinforcers jointly determine the Q-value, which changes with experience. In this way, the value system is learned through a developmental robot’s interactions with the world. To demonstrate the working of the value system, we chose a challenging behavior domain: visual attention through neck pan actions. Although the degree of freedom of motor actions is only one, the difficulties lie in the highly complex, uncontrolled visual environment. Our approach is fundamentally different from traditional task- specific approaches in that we cast visual attention selection as a behavior guided by a value system, which is applicable to any task. For example, our value system does not define saliency of features, but instead novelty based on experience. A novel stimulus for one robot at one time is not novel if it is sensed repeatedly for long by the same robot. The visual attention selection in a general visual environment corresponds to a very challenging problem that cannot be systematically addressed without the general task-nonspecific approach described here. 4.6 System architecture The basic architecture implemented for the SAIL robot is shown in Fig. 4.2. The sensory input is represented by a high dimensional vector at each time instant so that each component corresponds to a pixel. It is the cognitive mapping module that 90 Cognitive Mapping Sensory Action To Input Output Effectors Boltzmann Exploration Legend: Prototype: " associated with WOW Update Queue primed sensations, primed actions and Q—value Figure 4.2: The system architecture of SAIL experiments. derives the most discriminating features from input streams and maps each input vector to the corresponding effector control signal. The cognitive mapping is realized by Incremental Hierarchical Discriminant Regression (IHDR) [114], simulating the high dimensional mapping of a cortical area. A more detailed explanation is beyond scope. Basically, given each input vector, the IHDR tree provides an output consisting of three components: the primed sensation, the primed action and the corresponding Q-values. Primed actions are the possible actions in each state. The probability to take each primed action is based on its Q-value. The primed sensation predicts what will be the actual sensation if the corresponding primed action is taken. Novelty is measured by the difference between primed sensation and actual sen- sation. In the value system, novelty is integrated with reinforcement learning so that humans can issue rewards to modulate a developmental robot’s behavior. In order to let the robot explore more states, Boltzmann Softmax exploration is implemented. In the traditional Q-learning algorithms, we have to update all states. However, it is not practical for online learning. We add a prototype updating queue module to the architecture, which keeps the most recently visited states (pointed by dash lines). 91 Only states in that queue are updated at each time instant. Thus, state updating is local and only along the trajectory of visited states. In the following sections, we describe each component of the architecture in details. 4.6.1 Incremental hierarchical discriminant regression Mathematically, the cognitive mapping is formulated as a mapping M: S x X H X’ x A x Q, where S is the state space, X is the current sensory space, X’ is the space of primed sensation, A is the primed action space, and Q is the space of value. They are all numeric vector spaces. At every time step, It! accepts the current sensory input x(t) and combines it with the current state s(t) and then maps to the primed sensation .r’ (t) and the corresponding action output a(t +1) and its value. The cognitive mapping is realized by the Incremental Hierarchical Discriminant Regression (IHDR) tree [51] [114]. More details of IHDR can be found in chapter 4. It is worth noting that in this chapter IHDR is integrated with reinforcement learning. A prototype in IHDR is also a state in reinforcement learning algorithm. In order to avoid confusion, we only use state in the following sections. Thus, given an input vector 1:, the tree finds the best matched state 3. If they are close, a: is used to update 3. Otherwise it is considered as a new state. Each state 3 (Fig. 4.3) is associated with a list of primed contexts: {c’1,c’2, ..., c2}. Each primed context, 62, consists of a primed sensation 2:2, a primed action a,- and a corresponding Q-value. The primed sensation is what the robot predicts to sense after taking the corresponding primed action. The Q-value is the expected value of the corresponding action. Given 92 the list of primed contexts, it is the value system that determines which primed action should be taken based on its Q-value. Besides primed contexts, a state consists of four kinds of information: age, learning rate, temperature, and standard deviation of the primed sensation. The age of a state is used to determine the learning rate and temperature of Boltzmann Softmax exploration when the state is visited. The standard deviation is used to calculate novelty. More details can be found in the next section. Primed contexts c’ State Primed Primed Q action 0 sensation 0 ° . Primed Primed Q action 1 sensation 1 1 Primed Primed Q action 2 sensation 2 2 Figure 4.3: State and its primed contexts. Each state is associated with a list of primed contexts. Each primed context consists of a primed action, a primed sensation and a primed Q-value. 4.6.2 The value system The value system reported here integrates novelty and reinforcement learning. Novelty As we know, rewards are as important as value but they are sparse in time. In contrast, novelty of signals is dense in time. It is defined for every time instant. In order to motivate a developmental robot at any time, it is essential to integrate novelty with rewards. As shown in Fig. 4.3, every state retrieved from the IHDR tree is associated with a list of primed contexts (c’), which include: primed sensations 93 X ' = (23;,1, x;,2, ...,1‘;m), primed actions A = (ap1,ap2, ..., am) and corresponding Q- values Q = (qp1,qp2, ..., qpn). Novelty can be measured by the agreement between what is predicted by the robot and what the robot actually senses. If the robot can predict what will happen, the novelty is low. If the ith action is chosen, we can define novelty as the normalized distance between the ith primed sensation x;- = (x’1,x’2...r;n) at time t and the actual sensation x(t + 1) at the next time: n(t) = $2": (mg-(t) — a(t + 1))2 (4,, 3:1 030) where m is the dimension of sensory input. Each component is divided by the expected deviation 0,, which is the time-discounted average of the squared difference (173 —:rJ-)2, as shown in Eq. (4.2): 1+u0) _t-l-uU) 02a) _ —— 2. t o(t—1)+ t 1 (If — 553')? (4-2) J where p(t) is the amnesic parameter to give more weight to the new samples. With an appropriate a, a(t) would represent the short-term variation of the sensation. When a state is generated for the first time, the initial value of 0,2 is copied from its nearest neighbor in the same leaf node. The amnesic parameter is formulated by Eq. (4.3): #(t) = c(t - n1)/(n2 - n1) if n1 < t _<_ n2 (4-3) c + (t — n2) / m otherwise 94 where n1 and n; are two switch points, c and m are two constant numbers which determine the shape of a. After the above calculations, 33(t) would be set as rj (t + 1), which is the new primed sensation. Integration of Novelty and Rewards However, novelty is only a low level measure. The system’s preference to a sensory input is typically not just a simple function of n(t). Besides novelty, human trainers and the environment can shape the robot’s behaviors through its biased sensors. A biased sensor for a developmental robot is the one whose signal has an innate prefer- ence pattern at the birth time. For example, a biased sensor value b = 1 if the human teacher presses its “good” button (biased sensory) and b = -—-1 if the human teacher presses its “bad” button (another biased sensor). Furthermore, studies in animal learning show that different reinforcers have different effects. Punishment typically produces a change in behavior much more rapidly than other forms of reinforcers [25]. We integrate novelty and immediate rewards so that the robot can take different fac- tors into account. The combined reward is defined as a weighted sum of physical reinforcers and the novelty: r(t) = wbrb(t) + wgrg(t) + wnn(t) (4.4) where w, > wg > wn are three normalized weights of punishment, reward and novelty, respectively, satisfying w, + wg + w, = 1. 95 Q-Learning Algorithm and Boltzmann Softmax Exploration There are two major problems. First, the reward r is not always consistent. Humans may make mistakes in giving rewards, and thus, the relationship between an action and the actual reward is not always certain. The second is the delayed reward prob- lem. The reward due to an action is typically delayed since the effect of an action is not known until some time after the action is complete. These two problems are dealt with by the following Q-learning algorithm. Q—learning is one of the most popular reinforcement learning algorithms [110]. The basic idea is as follows. At each state 3, keep a Q—value for every possible primed context c’, which includes primed sensation 23p, possible action ap: Q(s, c’), which indicates the value of context c’ at current state 3. The action with the largest value will be selected as output and then a reward r(t + 1) will be received. A modified Q-learning updating expression is as follows: Q(s(t), c’(t)) +— (1 — a)Q(s(t), a(t)) + a(r(t + 1) + 7Q(s(t + 1), a(t + 1)) (45) where a and 7 are two positive numbers between 0 and 1. a = (1 + p(n)) / t is a time varying learning rate based on amnesic average parameter (u(n)). The parameter 7 is for value discount in time. With this algorithm, Q-values are updated according to the immediate reward r(t + 1) and the next Q-value, thus, a delayed reward can be back—propagated in time during learning. An important difference between this design and traditional Q-learning algorithm is that we use changing learning rates based on amnesic average for each state in- stead a global constant learning rate for the robot. The idea is derived from human 96 development. In different mature stages, the learning rules of human are different. A single value is not enough to model the case. For example, the first time we meet an unknown person, we would remember him right away (high learning rate). Later, when we meet him in different dresses, we would gradually update his image in our brains with lower learning rate. The formulation of a guarantees that it has a large value at the beginning and converges to a constant smaller value through the robot’s experience. Fig. 4.4 (a) shows an example of learning rate based on amnesic average. ' = = = = Temperature of Boltzmann Softmax exploration 1 Leamlng rate'(n1 1, n2. 3. m 2,c 2) ":10, 0:2. 01:5. c2=0. 1.2- 0.9 1 . 0.8 2 0.8> 3 0 7 _ €0.6- ' .2 0.4r 0.6- 02. \ . 0“’o 50 4o 60 so 100 ”o 10 2b 50 40 Age Age (a) (b) Figure 4.4: (a) Learning rate based on amnesic average (the parameters are shown in the title); (b) Temperature of Boltzmann Softmax exploration based on Gaussian density model. We applied the Boltzmann Softmax exploration [95] to Q-learning algorithm. At each state (s), the robot has a list of action A(s) = (ap1,ap2,...,a,m) to choose from. The probability for action a to be chosen at s is: $1192 M P(S,a)=e T “Ede/1(3):? T ) (4-6) 97 where r is a positive parameter called temperature. With a high temperature, all actions in A(s) have almost the same probability to be chosen. When 1' —+ 0, the Boltzmann Softmax exploration more likely chooses action a that has a high Q-value. The question is how to determine 1'. In [42], only a constant value is set to 7. However, it meets problem when the value system is applied to the real world. As we know, when we sense a novel stimulus at the first time, we would pay attention to it for a while. In this case, a small 7 is preferred because the Q-value of action “stare” would be high and the robot should choose this action. If r is too large, the probability of each action is almost equal, which is not the case in attention. After staring at the novel stimulus for a while, the robot would feel tired and pay attention to other stimuli. Now a larger 1' is preferred. After a period of exploration T should drop again, which means the state is fully explored, the robot can take the action with the highest Q-value. If we choose a large constant r, then the robot would explore even though it visits a state for the first time. If we choose a small 7", the robot would face the local minimal problem and cannot explore enough states. Fortunately a Gaussian density model (Eq. (4.7)) for local temperature solves this problem. r(t) = Wexpi—édfibti + c. (4.7) where Cl is a constant to control the maximal value of temperature, C; controls the minimal value, t is the age of the state. The plot of the model can be found in Fig. 4.4 (b). Let’s talk a little about the convergence of the algorithm if the learning rate is changing. We use a(t) to denote the learning rate at time t. The conditions required 98 to assure convergence with probability one is: Z a(t) = 00 (4.8) t=l and Z 02(t) < 00 (4.9) The first condition guarantees that the learning rates are large enough to even- tually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the learning rates become small enough to assure conver- gence. In our case, the learning rate changes from 0.5 to 1. The second condition is not met, indicating that the estimates cannot completely converge but continue to vary in response to the most recently received rewards. But this property is actually desirable in a nonstationary environment! For a developmental robot that has to explore in unknown environments, the problems it faces are effectively nonstationary. Thus, a developmental robot will not be limited to learn a specific task, but to learn multiple tasks through its experience in the world. Prototype Updating Queue In the batch learning mode of a Q-learning algorithm, the back-up is applied to all states. For real-time development, this global iteration method is not applicable, due to the excessive time required. We must use a local method that only involves a small number of computations. That is why we designed the prototype updating queue in Fig. 4.2, which stores the addresses of formerly visited states. Thus, not only is 99 the Q-value back-propagated, so is the primed sensation. This back-up is performed iteratively from the tail of the queue back to the head of the queue. After the entire queue is updated, the current state’s address is pushed into the queue and the oldest state at the head is pushed out of the queue. Because we can limit the length of the queue, real-time updating becomes possible. Algorithm of Innate Value System The algorithm of the innate value system works in the following way: 1. Grab the new sensory input r(t). 2. Query the IHDR tree and get a state s(t) and related list of primed contexts. 3. If r(t) is significantly different from s(t), it is considered as a new state and the IHDR tree is updated by saving x(t). Otherwise, x(t) updates s(t) through incremental averaging. 4. Update the age of the state, calculate the temperature of the state with Eq. (4.7). 5. Using the Boltzmann Softmax Exploration in Eq. (4.6) to chose an action based on the Q-value of every primed action. Execute the action. 6. Calculate novelty with Eq. (4.1) and integrate with immediate reward r(t + 1) with Eq. (4.4). 7. Update the learning rate of current state based on amnesic average. Update the primed sensation using the actual sensation. 100 8. Update the Q-value of states in PUQ. G0 to step 1. 4.7 Simulation results In order to test the value system, a simulation environment is developed. The simula— tor GUI is shown in Fig. 4.5. The big window shows the viewing environment, while tlrlrm Figure 4.5: The GUI simulator. The arrow indicates the position and the viewing angle of the robot. the small window shows the image the robot observes currently. There are several buttons that control the position and viewing angle of the robot. The “Good” and “Bad” buttons are used to issue rewards. In every state, the baby robot has three possible actions: stay at the current viewing angle (action 0), turn neck left 30 degree (action 1) and turn neck right 30 degree (action 2). The representation of a state consists of visual images and absolute viewing angle. The dimension of input image is 100 x 100. We assume that the robot cannot look backward and the number of ab- 101 solute viewing angle is 7. The parameters are defined as follows: 7 = 0.5 in Eq. (4.5); c = 2, n1 = 1, n2 = 3, m = 2 in Eq. (4.3). The value of CI in Eq. (4.7) is 5; c2 = 0.1; u=20ando=2. 4.7. 1 Habituation effect In the first experiment we allowed the robot to explore on its own by looking around. The total number of state is equal to the number of view angle (7). The initial Q-value of each action is 0. Fig. 4.6 shows how the Q—value of each action changes based on novelty in one state (the absolute view angle of the state is 0 and the input image of the state is shown in the small window of the Fig. 4.5). In the beginning, the primed sensation is set as a long vector in which every element is zero. After taking an action, the current sensation is very different from the initial sensation. That is, the novelty value is high. We can see from Fig. 4.6 that in the first several steps, the Q—values of each action increase. However, after the primed sensation is updated, it would be the same as the actual sensation if the robot takes the action again. Then the novelty becomes zero and the Q-value decreases. After a long period training (300 steps), the robot can predict the actual sensation of next step whatever action it takes. So the Q-value of each action converges to the same value (0). This means each action has the same probability to be chosen. The right part of Fig. 4.6 shows the number of each action in different time frames (60 steps in each time frame). After 300 steps, the Q-value of each action is nearly equal, so the numbers of each action are close. The experiment shows that because of the habituation effect, the robot loses the interest 102 O-Vaiue of any action after exploration and chooses an action randomly. Because a state is not visited at every time instant, there are flat periods in the Q—value plot, which means that the robot is not at the state and the Q-value does not change. [ o ' 50 100 150 200 250 1 . . . _l‘i o -1 50 100 150 200 250 1 , 0 JL -1 t 50 100 150 200 250 Time Time frame Q-value vs time Action number Figure 4.6: Habituation effect. On the left part, the lst, 2nd and 3rd plots correspond to the Q-value of action 0, action 1, and action 2, respectively. On the right part, the frequency of actions in different time frames. 4.7 .2 Integration of novelty and positive reward After the above experiment, we began to issue rewards. For example, when the robot turns left, a human teacher can give it a positive reward (1). For action 2, negative rewards (-1) are issued. Then the actual reward the robot receives is an integration of novelty and an immediate reward. The Q-value of action 0 converges to 0. That of action 2 turns to be negative after a punishment was issued at step 348. The Q—value of action 1 is always positive because we kept issuing positive rewards. As we can see on the left part of Fig. 4.7, after training, the Q-value of action 1 is much larger than that of other actions. As shown in Fig. 4.7 (right), gradually, action 1 was chosen the 103 Q-Valuo most often. . . e 45 . . f - - Action 0 o 40» Action 1 ' A... 2 50100150 231 250 300 350 400 450 500 30_ .fi. .... 0J1 so 100 150 200 250 300 350 400 450 500 15’ 0W 5. Action number M ; 5010015020025030035040045054» 12 4 Time Tlmetrame Q—value vs time Action number Figure 4.7: Integration of novelty and immediate reward. 4.7 .3 Increase novelty with a moving object In order to show novelty preference, a moving toy was added to the simulation envi- ronment after the first experiment. The testing images are shown in Fig. 4.8. Now the total number of state is 12, which is 5 more than that of the first two experi- ments. Every time the robot is in the state with the absolute viewing angle 0, one of these images is generated randomly. Thus, the primed sensation of action 0 could be different from the actual sensation. As shown on the left part of Fig. 4.9, the Q-value of action 0 is positive because of high novelty. In contrast, the Q—values of action 1 and action 2 are close to zero. After training, the robot found that staying in the current state is the most interesting. So action 0 was chosen the most often. 104 o-[Lg————WW 35» Figure 4.8: Simulation of a moving object. _iJiCi'D" N U! Action number N O _s 01 i Action 2 300 400 500 600 Time Q-value vs time Action Number Figure 4.9: Increase novelty with a moving object. 105 4.7.4 Suppress novelty with punishment After the third experiment, we issued positive rewards to action 2 (turn right), and negative rewards to action 0. Thus, even though the novelty is high when the robot stares at a moving object, the immediate rewards suppress the novelty. Gradually, the Q-value of action 2 increased. As shown in Fig. 4.10, after training, the robot chose action 2 most of time. However, action 0 and action 1 were still chosen at few times because of Boltzmann Softmax exploration. A: no - Action 0 40 Action 1 [fl - Am" 2 1 00 200 300 400 500 600 700 0 Action number N N O m 10020030040050060070015‘ OJ; 5 j 4 1 l a 1 . , o 2 2 3 1 : 9 j ' 100 200 300 400 500 600 700 1 2 3 4 5 6 7 a 9 10 11 Time Timetrame Q-value vs time Action number Figure 4.10: Suppress novelty with immediate rewards. 4.8 Experiments with SAIL robot The value system is also tested on our SAIL robot (short for Self-organizing Au- tonomous Incremental Learner) through vision-guided neck action selection. SAIL, shown in Fig.2.9, is a human-size robot custom-made at Michigan State University. It has two “eyes”, which are controlled by fast pan-tilt heads. In real-time testing, at 106 each step SAIL has 3 action choices: turn its neck left, turn its neck right and stay. Totally, there are 7 absolute positions of its neck. Center is position 0, and from left to right is position -3 to 3. Because there is a lot of noise in real-time testing (people come in and come out of the view), we restricted the number of states by applying a Gaussian mask to image input after subtracting the image mean. The total number of states in the real—time experiment is about 50. The dimension of the input image is 30 x 40 x 3 x 2, where 3 arises from RGB colors and 2 for 2 eyes. The resolution of the image is 30 x 40. Toys used as moving objects are shown in Fig. 4.11. The input representation consists of visual images and the absolute position of the robot’s neck. The two components are normalized so that each has similar weight in the representation. Biased touch sensors are used to issue punishment and reward . We used the same parameters as we did in the simulation. Figure 4.11: Toys used as moving objects (Adopted from [126] ). 4.8.1 Novelty and multiple reinforcers for different actions In order to show the effect of novelty, we allowed the robot to explore by itself for about 5 minutes (200 steps), then kept moving toys at neck position -1. At each position there could be multiple states because the input images at certain neck positions could change. Fig. 4.12 shows the information of one state at position -1. 107 The image part of the state is the fourth image of the first row shown in Fig. 4.11, which is the background of the experiment. The first three plots are the Q-value of each action (stay, left, right), the fourth plot is the reward of corresponding action, the fifth plot is novelty value and the last one is the learning rate of the state. After exploration (200 steps later), we moved toys in front of the robot, which increases the novelty and Q-value of action 0 (stay). After training, the robot would prefer toys and keep looking at it from step 230 to step 270. A subset of the image sequence is 1 1 1 1 1 w 1 1 0 in OLr—L l i I — Stay | g a _1 1 1 1 1 1 1 1 L 1 10 50 1 00 1 50 200 250 300 350 400 450 500 g 1 1 1 1 1 T J¥ 1 1 1 0 3 J ” ‘ _1 1 L 1 1 1 1 1 1 1 o 10 5'0 190 1?0 290 2?0 390 3§0 490 4?0 500 a _1 1 1 1 1 1 1 1 1 1 0 50 1 00 1 50 200 250 300 350 400 450 500 E 1 _ 1 1 1 1 1 1 4W 1 1 .- Stay % 0mm 1111+“ ~1- ------ - +1» .1 Let! ~ m -1 1 1 1 1 1 '1 1' 1 + nghI " g 0 a. .4. .‘" 4* . + . . Len -. 2 _1 1 1 1 1 1 1 1 + ht $3 10 50 100 150 200 250 300 350 490 450 500 m . 1 1 1 1 1 T 1 . Sta ”805 _M M 1.} + 44* “‘0- 00° 0-. - ° OGHH-N ., Lefly - g 0 1 1 1 1 1 1 1 + Right 3 0 50 1 00 1 50 200 250 300 350 400 450 500 Time Figure 4.12: The Q—value, reward, novelty and learning rate of each action of one state at position -1 when multiple reinforcers are issued. shown in Fig. 4.11. If the actual sensations in the second row and the corresponding primed sensation in the third row are very different, the novelty would be high. The novelty value is shown in the fifth plot. The novelty of action 0, 1, 2 is specified by ‘.’, ‘*’, ‘+’, respectively. 108 After step 300, the trainers began to issued different reinforcers to different actions. Punishments were issued to action 0 at step 297 and step 298 (the fourth plot) and to action 2 at step 315. Rewards were issued to action 1 at step 322 and step 329. The Q-values of action 0 and action 2 became negative while that of action 1 became positive, which means that the visual attention ability of the robot is developed through the interactions with the environment. Even though the novelty of action 0 could be high, the robot preferred action 1 because of its experience. The learning rate in the fifth row shows that at the beginning the robot immediately remembered the new stimuli and then gradually updated the stimuli. 4.8.2 Boltzmann softmax exploration As we mentioned in section 3, Boltzmann Softmax exploration is applied so that the robot can experience more states. In Fig. 4.13, only information from step 1 to step 60 of the above state is shown. The first plot is the probability of each action based on its Q-value. The total probability is 1. The probabilities of action 0, l, 2 are plotted at the top, middle and bottom, respectively. The star denotes the random value generated by a uniform distribution. If the random value is in one range, say, the middle range, then action 1 would be taken. Because the robot is not always in the state, the plot is sparse. The second plot shows the temperature based on Gaussian density model of Eq. (4.7). At the beginning, the T is small and the novelty of the state is high (the initial Q-values of another actions are zero), so the probability of action ‘stay’ is the largest one (almost 100%). The robot would stare at the stimulus 109 for while. Then, the temperature increases. The probabilities of each action became similar and the robot began to choose another actions and explore more states. After about 10 times, the temperature dropped to a small value (0.1) again, the action with larger Q-value would have more chance to be taken. Thus, we solved the problem in [42]. g go. 9 D. 0 .4 0 10 2o 30 4o 50 so 3 1 o . o g. . 0 50.5 , , o ’- o a n no... ' ’ O o a ”0 1o 20 30 40 50 60 Time Figure 4.13: Boltzmann Softmax Exploration: The total probability is l in the first plot. The probabilities of action 0, 1, 2 are plotted at the top, middle and bottom, respectively. The star denotes the random value received at that time. The second plot shows the temperature. 4.9 Conclusions In this chapter, a model for value system development of a robot is proposed. Novelty and reinforcement learning are integrated into the value system. Novelty is derived from the high dimensional primed sensation, thus value is not just a scalar number. Instead, it presents the rich information of the world. Furthermore, the primed sensa- 110 tion based value system can guide the developmental robot all the time since novelty from primed sensation is defined at every sensory refresh cycle while reinforcers are sparse in time. Since no salient features are predefined, the value system is applicable, in principle, to any task and provides guidance to the robot all the time. The working of the value system is shown through vision-based neck action selection. Our SAIL robot learns to pay attention to salient visual stimuli. More important is that the robot’s responses to certain visual stimuli would change after interacting with human trainers. That is, the robot develops its value system through its experience in the real world. 111 Chapter 5 Covert Perceptual Capability Development for Vision Based Navigation In this chapter, we propose a model to develop robots’ covert perceptual capability using reinforcement learning. Covert capabilities such as attention cannot be devel- oped using supervised learning methods. In contrast, reinforcement learning helps robots deveIOp these capabilities via internal but non-enforceable probes. We treat covert perceptual behavior as action selected by a motivational system based on re- inforcement learning. We apply this model to vision-based navigation. The goal is to enable a robot to learn the type of road boundary. Instead of deal with simple problem in controlled environments with limited states and low input dimension, We test the model on images captured in non-stationary environments for navigation. In 112 order to handle the problem of huge state space in the active vision domain, Incre- mental Hierarchical Discriminant Regression is used to generate states on the fly. Its coarse-to-fine tree structure guarantees real-time retrieval for high-dimensional input. K Nearest-Neighbor strategy is adopted to further reduce training time complexity. 5. 1 Introduction Supervised (action-imposed) learning is an effective learning mode to help a robot develop intelligent capabilities. However, in psychology and neuroscience [74] [30], there are a lot of cognitive capabilities can not be developed using this learning mode. For example, visual attention and thinking are two internal behaviors, which cannot be imposed from outside but develop through interactions with the environments. If the behavior of a robot is not visible from the outside, we call it covert behavior. In this chapter, we apply the developmental learning paradigm to covert percep- tual capability development for a robot. The goal is to teach a robot to learn the type of road boundaries for vision-based navigation. Instead of giving the the correct type explicitly, human teacher issues rewards and punishments according to the robot’s guess. In this sense, the robot develops this covert capability via internal but non- enforceable probes. No action (boundary type) is imposed. After this learning stage is mature, the robot can learn the correct heading direction for navigation in the same mode. This paradigm is important and indispensable. First, action-imposed learning is not enough to model sophisticated robotic cognitive development. Second, if the robot learns wrong actions it cannot make a correction with this learning mode. On 113 vl the contrary, reinforcement learning is a good model to learn intelligent capabilities when imposed action is not acceptable. Moreover, it gives the robot the ability to recover from error. There are quite a few studies, which apply reinforcement learning to vision and at- tention problems. Ballard and Whitehead’s study [121] is one of the earliest attempt to use reinforcement learning as a model of active perception. Balkenius and Hulth [5] model attention as selection for action. They train a robot to pick out the correct focus of attention based on reinforcement learning. Experimental results are report on a simple simulator. Bandera [6] used reinforcement learning to solve gaze control problem. However, their application is in an artificial environment. Its performance in the real world is unknown. Minut and Mahadevan [63] proposed a reinforcement learning model of selective visual attention. The goal is to use a fixed pan-tilt—zoom camera to find an object in a cluttered lab environment. That is, the environment is stationary. In summary we can find out that most former studies, which apply reinforcement learning to model covert perceptual behaviors, test on controlled en- vironments. The input dimension of state vector is low and the number of state is very limited. In order to reach the goal of real-time active vision development in non-stationary environments (outdoor navigation), we propose IHDR (Incremental Hierarchical Discriminant Regression) to generate continuous state‘space on the fly for high dimensional visual input. And its coarse-to-fine structure is able to speed up the state retrieval procedure. Furthermore, we implement K Nearest Neighbor algorithm for Q-learning so that at each time instant multiple similar states can be updated, which dramatically reduces the number of rewards human teachers have 114 to issue. This implementation shares the same idea with [52]. Instead of using weighted—nearest neighbor, we use top-K nearest neighbor directly. We also propose a multi-layer reinforcement learning architecture [97]. The first level learning mod- ules handle covert perceptual capability development while the second level learning module deal with navigation behaviors. In what follows, we first describe the covert capability development problem in outdoor navigation. The detailed architecture of the developmental paradigm is pre- sented in Section 3. The experimental results are reported in Section 4. Finally, we draw our conclusions and discuss about the future work. 5.2 Problem Description Our goal is to enable a robot to develop its covert perceptual capability for vision- based outdoor navigation. A good example is shown in Fig. 5.1 (a). Suppose the robot has two eyes (left image and right image). Instead of using the entire image as input to each first-level learning module, we divide two input images into 6 sub- windows (specified by rectangles). Suppose the size of the original image is 160 x 120. The dimension of the state vector is 19200, which is almost intractable. With six windows, the dimension of state vector is reduced to 80 X 40 = 3200. In this way, we reduce the dimension can control the state space. And it turns out these six windows can cover most of road boundaries. Our goal is to teach the robot to learn the type of road boundary. Let’s look at the top window on the right (Fig. 5.1 (b)). There is a road edge, which intersects with the window at two points (p1, pg). 115 For each intersecting point, we define five boundary types (ranges). Given p1 as an examplethere 5 possible ranges (R1, R2, 123,124, R5) and p1 fits in R1. Suppose the image of the window is r(t) (a long vector and each element corresponding to a pixel). The robot needs to learning the mapping from x(t) to correct road boundary type a = {R1,R2,R3,R4,R5}. Since each window has two intersecting points, we can generate a vector A = (01,02, ...,a12). This vector can be the input for the second level module for heading direction learning. P2 (b) Figure 5.1: (a) A pair of images in vision based outdoor navigation with 6 attention windows; (b) Sub-image of one window and the type of road boundary (pl’s type is R1). A major part of the developmental learning paradigm is Hierarchical Discriminant Regression [51]. Using HDR tree, the robot learns the type of road boundary in each window (A = (a1, a2, ..., a12)). The overall system operation architecture for robotic cognitive development is shown in Fig. 5.2. There are HDR trees at two levels. At the first level, 6 HDR trees learns the mapping from visual input to road boundary type. The second-level HDR tree maps road boundary type of all 6 windows to the correct 116 Lefl Image Right Image Win3 Win4 :............:.l....' §.[o:-.------'-one: 5 W1"? 5 i Win5 E ...' ..... r..,....-| I 1.. . '... WM 5 I I W1n6 1 ' 1 1 ‘ I I l i l I l | l i 1 i i v 1 Learning mm, Y Y Y Y Y 1 r 1 g I l Roadboundarytype 11 Learning module Heading Direction Figure 5.2: The system operation architecture for robotic cognitive development. The first-level learning modules generate the mapping from visual input to road boundary type while the second-level learning module generates the mapping from road boundary type vector to heading direction for navigation. 117 heading direction. It is worth noting that in this study we only conduct experiments for the first-level learning modules. How the robot learns each mapping is discussed in next section. 5.3 System Architecture of Robotic Cognitive De- velopment Sensory : Input x(t): 3 Action 5 Output i l Boltzmann . E l | Exploration 5 Legend: : Top 1 T i , , . . . ,4 Prototype. . Q-Leaming . . : Too 2 ‘— 4 Motivational : 35.50%“ W" . . . I n ‘ System . pnmed sensations. I : .._I_ _ .1. : primed actions and 2 Tap K ‘—1 : Qvalue Prototype Updating Queue Figure 5.3: The system architecture of each learning module for robotic cognitive develop— ment. The basic architecture for robotic cognitive development is shown in Fig. 5.3. The sensory input x(t) is represented by a high dimensional vector. At the first level the input is visual image while at the second level the input is road boundary type vector. In this chapter we only discuss the first-level learning modules. x(t) is fed into the cognitive mapping module. The cognitive mapping is realized by Incremental Hierarchical Discriminant Regression (IHDR) [51]. IHDR clusters x(t) into current state s(t) and then maps the current state to the corresponding action output. Thus, IHDR generates state for high-dimensional visual input on the fly, which is a very 118 difficult problem. That’s why most of studies only deal with low-dimensional state space. In contrast, IHDR makes it possible to handle high-dimensional state space. In the testing phase, given a state s(t), the IHDR finds the best matched 3’ associated with a list of primed contexts (c’ = (x’, a’, q)), which include: primed sensations X’ = (:r’l,$’2, ...,xg), primed actions A’ = (a’1,a’2, ...,a;) and corresponding Q-values Q = (q1,q2, ...,qn), where n is the number of different actions. In other words, the function of IHDR is g: S 1—> X’ x A’ x Q. Primed actions are the possible actions in each state. The probability to take each primed action is based on its Q-value. The motivational system of the developmental learning paradigm works as an action selection function 1) : 2’4' 1—+ A (2", denotes all the possible subsets of A’), which chooses an action from a list of primed actions. At the beginning the robot can choose random actions. Human teachers issue reward and punishment based on its actions. Through this reinforcement learning procedure the robot develops the capability to learn road type without imposed actions. In order to let the robot fully explore the action space, Boltzmann Softmax exploration is implemented. To reach the requirement of real-time updating in developmental learning, we add a prototype updating queue module to the architecture, which keeps the most recently visited states (pointed by dash lines). Only states in that queue are updated at each time instant. However, in non-stationary environments (for instance, outdoor navigation), the state space is huge. If we use the standard Q-learning algorithm, the training time is tremendous. In order to reduce the time of training, we adopt top-k nearest neighbor strategy. For each state, its top K—nearest neighbors are saved in the prototype updating queue (Fig. 5.3). If a reward is issued, the system not only 119 updates the current state-action pair but also updates the K-N N state-pairs, which dramatically reduces time complexity of the training procedure. Please refer to last chapter to check the description of IHDR, Q-learning and Boltzmann exploration. The major difference is that we implement a new prototype updating queue. 5.3.1 K Nearest Neighbor Prototype Updating Queue Even though, IHDR makes learning in non-stationary environments possible, the time to train the system is still not acceptable. The major reason is: as the robot explores in new environments, the state space would continue grow. In order to reduce the training time, we adopt the K-nearest neighbor strategy [31]. Suppose the state space is S and the current state is s(t). Let’s define D is the distance between s(t) and the kth -nearest neighbor of s(t). The volume of the set of all states whose distance from s(t) is less than D is defined as follows: 20"7r”/2 A(k, S, S(t)) = m. (5.1) Using K-NN updating rule, the updated space goes from one point s(t) to A, which is a tremendous improvement. We can also look into the training complexity issue. Suppose for each state we need to issue m rewards to get a reliable training result (Q converges.) and the number of states is N, the time complexity of training is mN. If t0p K updating is applied and we assume each state has the same possibility to be chosen as a match, the time complexity is N + Sm—Qfl, where the first N means that each state has to 120 m—l N be visited at least once while K is the number of training needed using top-K updating. The ratio of time complexity using top-1 and top-K is: mN mk —N+m; 8, 0 converges to 0. Greedy strategy is applied in this case. We use Boltzmann exploration to choose the action for each state. In Fig. 5.9, only information from step 1 to step 15 is shown. The first plot is the probability of each action based on its Q-value. The total probability is 1. The probabilities of action 0, 1, 2, 3 and 4 are plotted from top 125 Temperature in Boltzmann softmax based on sigmoid function. c1=10.c2=1 1.2’ 0.8 r theta 0.6 r 0.4 ~ 0.2 * 00 5 10 1 5 20 25 30 35 40 Step Figure 5.8: Temperature of Boltzmann exploration. to bottom, respectively. The ‘+’ denotes the random value generated by a uniform distribution. If the random value is in one range, say, the bottom range, then action 0 would be taken. As we can see, at the beginning, each action has similar probability (about 0.2) to be chosen. After training, action 1 is chosen for most of the time. The action sequence is show in the second plot. When 71,, > 4, the system only chose action 1. That is, for each attention point it takes about 5 visits to converge. This again shows the advantage of KNN-based reinforcement learning. Retrieval Time The retrieval time of each image is shown in Fig. 5.10. The average training time of each step is 0.01863 and the highest training time is 0.128. Even though we have 12 trees for 6 windows, the system can still work in real time. This shows how efficient IHDR is for robot mental development. 126 Action Sequence 0 r N r O A 1, 1 0 15 20 Number of Visits Figure 5.9: Boltzmann Softmax Exploration: The total probability is 1 in the first plot. The probabilities of action 0, 1, 2, 3, 4 are plotted from top to bottom, respectively. The “+” denotes the random value received at that time. The second plot shows the action sequence . Retrival time 01 the first IHDR tree 0.12 1 r 0.1 '- Rerivel time (s) p 8 _o 5? 0.02 0 50 100 150 200 The number 01 input Images Figure 5.10: Retrieval time of each image for one tree. 127 5.5 Conclusions and Future Work In this chapter, Q learning is used to model robotic covert perceptual capability development in navigation. It is applicable to any cognitive capability development when enforceable probes are not acceptable. IHDR is implemented so that learning in real-time becomes possible. And we adopt K Nearest Neighbor strategy, which dramatically reduces training time complexity in non-stationary environments. The experimental results show the effectiveness of the model, which enables a robot to learn road boundary type through interactions with teachers. We will extend the model to develop the robot’s vision-based navigation capability in the future. 128 Chapter 6 Cross-Task Learning Cross-task learning is essential for an artificial agent to explore in the real world. With this capability, an agent can learn multiple tasks and use acquired knowledge to learn new tasks. In this paper, the DOSASE MDP model, enables an agent to find the relatedness between different tasks and to develop cross-task learning capability by interacting with the environment. One challenge of this work is that no task is defined in advance. The trainer shapes the behavior of the agent to learn different tasks interactively and continuously through interactions. The development is conducted in real-time using high-dimensional input, which is another challenging issue. We tested the architecture on an artificial agent for vision-based navigation. The results show that DOSASE is an effective model for autonomous mental development (AMD) and dramatically reduces both time and space complexities for cross-task learning. 129 6. 1 Introduction Cross-task learning capability is very important for an artificial agent to explore in the real world. By cross-task, we mean that the same system must learn multiple tasks incrementally in the same mode, dealing with task specific contexts correctly. With this capability, the agent can learn different tasks and transfer learned knowl- edge to new tasks. Scientists in psychology proposed a lot of models to simulate human cognition in cross-task learning. Lovett [60] used an architecture called ACT- R to model how people organize knowledge and produce intelligent behaviors.They claimed that after separately fitting performance on a preliminary task, the model can make zero-parameter cross-task prediction of performance in on a second task. Computer scientists are also interested in building multiple-task learning models for artificial agents. Thrun [100] proposed the task-clustering algorithm, which learns the relationship between different tasks. When facing new a task, the algorithm first finds the most related task, then exploits information from this task. Pratt [75] and Caruana [17] investigated the similar problem: how to use information from one neural network to help a second network learn a related task. The limitations of these methods consist of: 1) human programmers know the task knowledge at the beginning. 2) in order to build up the relationship between different tasks, real-time learning is impossible. For a typical developmental agent, these limitations are not acceptable. A developmental agent has to run in real time and face different tasks, which are unknown to the agent before training. In this paper, we applied the DOSASE MDP model to conducting cross-task learn- 130 ing. The advantages of the model include: 1) no task is defined at the beginning. By interacting with the trainer, the agent incrementally learns different tasks. 2) the agent can learn task-relatedness online in real-time. 3) the agent transfers learned knowledge to new tasks and dramatically reduces both time and space complexi- ties. The model has been successfully tested on an artificial agent for vision-based navigation. In the following section, we will first formulate what is cross-task learning. The detailed DOSASE MDP model is introduced in section 3. Experimental results are shown in section 4 and then we conclude with a summary and discussion about future works. 6.2 Problem Description To be precise in our further discussion, we need mathematical notations. Definition 6.2.1 Given an agent at time t1, suppose that the agent produces different action contexts a1 and a2, from two different contexts CI = {c(t) I t1 3 t 3 t2} and C2 = {c(t) I t1 _<_ t g t3}, respectively. If a1 and a2 are considered different by a social group (human or robot), conditioned on Cl and Cg, then we say that the agent discriminates two contexts Cl and C2 in the society. Otherwise, we say that the agent does not discriminate C1 and C2 in the society. For example, given a voice command “G0 to the elevator,” the speech signals of the command are different from different people. However, humans consider the 131 commands are the same. In this case, the agent should not discriminate the above commands from different people. Definition 6.2.2 Cross-task Learning: 1). There are N tasks: I‘ = {T,-|i = 1, 2, ...N} 2). For each task, the learning goal is to generate the mapping from context to action: M : C —-) A, where A is the action space, C is the context space, which could consists of different sensory input: vision, audition, touch etc. 3). Given the first N tasks, the agent uses acquired knowledge to speed up learning the N +1 th task. A typical setting is shown in Fig. 6.1. There are two tasks: T1 and T2. For example, in autonomous navigation problem, T1 is “go around” and T 2 is “go to the elevator.” The contexts of each task are C1 and C2, respectively. Definition 6.2.3 Shared context: The overlapped trajectory between these two tasks is called shared context Cshare = C1 0C2. Non-overlapped trajectory is called non- shaned context Cnthare = C1 U C2 -— Cshare- Give the above figure as an example, C1 = {C11,C12,C13,C14,C15} and C2 = {C21, C22, C23, C24, C25}. In the task space, the shared context is Csham = {C12, C14} and the non-shared context is Cnondghare = {C11, C13, C15, C21, C23, C25}. Definition 6.2.4 Merge point and break point: The point, where the contexts of two or more tasks begin to overlap, is call “merge point” (M1, M2). The point, where the contexts of two or more tasks begin to diverge is “break point” (Bl, 82). 132 C13 M 1 Merge Task1 point Task2 Figure 6.1: Typical setting of cross-task learning. 133 l I Channel : l selector Visual sensation l T ff cto Action sensation I oe e rs Task Label | 804) Action : I selectorl l | | | | l ' : : Prototype updating queue I Second level LBE r _____________________ 'l selector | :Channel . , I Auditory sensation l Task Label Action | selectorl l : Legend: | I] Reflexive l primed Prototype updating queue l 9011th ______________________ Backpropagated First level LBE pnmed context Figure 6.2: The cross—task learning system architecture. 6.3 Architecture of DOSASE MDP We apply the Developmental, Observation driven, Self—Aware, Self-Effecting, Markov Decision Process (DOSASE MDP) model to conduct cross-task learning. The detailed architecture of this model for cross-task learning is shown in Fig. 6.2. There are two level—building element (LBE) components. The first LBE online learns the association between task label and verbal commands. The second LBE learn the specific tasks 134 (for example, different vision-based navigation tasks). A developmental agent is self- aware and self-effecting, which means the agent can sense both internal and external inputs and generate both internal and external actions. A typical internal action is thinking. Definition 6.3.1 Internal action and external action: Action consists of two compo- nent. That is a(t) = (a,(t),ae(t)), where a,(t) and ae(t) are the internal action and external action, respectively. a,(t) changes the internal state while ae(t) changes the external state. The input goes through a channel selector, which only pay attention to part of context. For example, the first LBE pays attention to auditory input while the second one pays attention to visual input. The input through the channel selector is fed into an observation-driven state transition function. After generating the current state, the cognitive mapping engine finds the best match and outputs the associated action. It is worth noting that in the experiment only the second LBE is used for vision based navigation. The experimental result of voice command learning is not presented. 6.3.1 Observation-driven state transition function After going through channel selector, the sensory input is fed into the observation- driven state transition component. Let’s first define last context: l(t) = f (a(t — 1),x(t)), which consists of last action a(t — 1) and current input x(t). l(t) is pushed into a context queue as shown in Fig. 6.3 and is combined with last state information to generate the current state. 135 Sensory Context Queue Input ___,. x(t) I a(t) State Transition x(t-1) [a(t—1) “mm” 5' Last State , 'HDR 3“-” . Current Matched State State 0 s(t) xa-K)|afi- Figure 6.3: Observation-driven state transition function. 6.3.2 Online learning of one task through cognitive mapping Many problems like content-based retrieval, vision-based navigation can be formu- lated as a complicated function which maps high dimensional input and the current state to low-dimensional output signals. We use a decision tree to approximate this function, which is implemented by Locally Balanced Incremental Hierarchical Dis- criminating Regression (LBIHDR) [43] [46]. A detailed explanation is beyond scope. Basically, given a state 3, the IHDR finds the best matched 3’ associated with action. The mapping is done through a coarse- to-fine tree structure. In the testing phase, given 3, we use K-nearest neighbor rule to find the best match in the leaf node. I - I s = arg 1131511 II 3, — 3 II (6.1) Using IHDR, the relatedness of different task is measured by the similarity between the states generated for different tasks. If some of the states generated by two or more tasks are the same, then we say these tasks are related. That’s why our architecture is better than the task-clustering method [100], which has to define different tasks in 136 advance to measure the relatedness. Using the DOSASE MDP model, we don’t need to know the task in advance but generating states for different tasks incrementally. IHDR intrinsically has incremental online learning capability to adapt to new inputs, which is a requirement for autonomous mental development. And because of its efficiency, it can learn high dimensional input, which outperforms the typical neural network algorithm [17]. 6.3.3 One-task learning algorithm by a developmental agent The algorithm to learn one task by a developmental agent is as follows: Procedure 5 One task learning. Learn the association between context and ac- tion. 1: Collect the current context c(t). 2: Go through channel selector to get sensory input x(t) = 0,,(t). (For vision-based navigation task.) 3: Push x(t) to the context queue and generate last context I (t) 4: Use observation-driven state transition function (Eq.(1)) to generate the current state s(t). 5: Find the best match 3’ of s(t). If they are similar, use amnesic average to update 3’. 6: If imposed action is given, take this action. Otherwise, choose the action (a) associated with s’ . 7: Go back to step 1. 137 Output 6.3.4 Multiple task learning through DOSASE How can this architecture learn multiples tasks? We know that the agent can learn one task with the model. Using the same architecture multiple tasks can be learned if the C.- and A,- are provided. Now, the problem is how to speed up learning if the agent acquires useful knowledge from former tasks? Given the tasks in Fig.6.1 as an example, if T1(“go around”) has been learned, how the agent learns T2 (“go to the 9 . ' ' 0 State in Task 1 <> 3 L . State in Task 2 000°: 00 7 J 6 hi . 0:000 - q '4 a. C 00 a a O 5 r ”a, 0 l e.. Ch 4" ggeeeeeeeeeeeeeeeee I O 3 r 0000 C2 9 d O 00 C1 .45 2 P 000 ”e d 1 ~o c '. ‘ Q 0 .- .....1b _1 L 1 1 1 0 10 20 30 4O 50 State space Figure 6.4: State space for 2 tasks. elevator”)? Let’s take a look at the state space generated for each task. A part of the state generated by these two tasks is shown in Fig. 6.4. The context of T1 is C1 = {c1,c2,c3}. The state generated for T1 is $1 = {31,32,53}, where s,- = {SJ-1,3,2, ...,s,,,,}. j = {1,2,3} while n, denotes the number of state generated for jth context. Now, the agent face T 2. The context is C, = {cm 65, cc}. The state 138 generated for T2 is $2 = {sa,sa,sc}, where sm = {sm1,sm2, ...,smnm}. m = {a, b, c} while rim denotes the number of state generated for mth context. The shared context of T1 and T2 is c2 (Cb). The advantage of cross-task learning is that if the agent learns Csharc for T1, then there is no need to experience it again for T2. Lemma 6.3.1 If the number of states generated for non-shared context of T1 and T2 is NNshare and the number of states generated for shared context is Nshare: the ratio of the space complexity saved for multiple task learning is: Nahum / (N Nahum + Nshare) Now let’s think about the complexity in the training time space. Lemma 6.3.2 If the N + 1th task shares context with the former N tasks: Cshare = N (U C.) flat/+1, the saved time for training the N + 1th task is Lahore. i=1 6.3.5 Attention mechanism for the break point set Suppose the agent has been trained with C1 and Cshare, in the testing phase, the agent would go through the same context trajectory. However, there is a problem if the agent experience 3b", (the last state of context Cb) again. Since for different tasks the external actions ac associated with 8m, are different. For T1, the external action should make the state move from 32", to 331 while for T2, the external action should make the state move from 81m, (equal to 32%) to 3C1. What should the agent do at the “break point?” An attention mechanism is necessary to solve the problem. Remember that the agent learn the task label through the first LBE. If the agent get into the state 32"., (e.g. 317%), there are two external actions ael, 0.62. Each external action is associated 139 10 0 State in Task 1 ‘ . State in Task 2 Output 5.\/an! X 0. Break point -5i 2 60 1.5 40 20 Task label 1 0 State sequence Figure 6.5: Discriminate state transition using multi-moda] information. with a task label. Using the task label information, the agent knows what action should be taken to reach next state. Fig. 6.5 shows how the task label information helps discriminate state transition. Adding task label as another dimension, the break point is split into two points. By checking current task label, the agent knows which action to. The internal action generated by the attention mechanism is defined as: 1 if T,- = 1 a,- = (6.2) 0 otherwise The output external action is a = ac ® a,, where ® is component-wise multiplication. Only after checking the task label information, can an action be issued. 6.3.6 Cross-tasking learning algorithm The algorithm is as follows: 140 Procedure 6 Cross-task learning. 1: Collect the current context c(t). x(t) = c(t). 2: Push x(t) to the context queue and generate last context I (t) 3: Use observation-driven state transition function to generate the current state s(t). 4: Find the best match 3' of s(t). If they are similar, use amnesic average to update 3’. Check the task label. If these two states have different task labels, generate a new action for s’. 5: If imposed action is given, take this action. Otherwise, check task label, then use Eq. 6.2 to pay attention to task label and take the corresponding action. 6: Go back step 1. 6.4 Experimental Results In order to test the cross-task learning capability of the DOSASE MDP model , a simulation environment is developed. The simulation map is shown in Fig. 6.6. The artificial agent (rectangle with an arrow) can navigate through the white area. There is an elevator on the bottom right (white rectangle). In every state, the agent has three possible actions: go straight, turn left and turn right. Some of the example input images are shown in Fig 6.7. The second and the third images on the third row are the images near the elevator. The dimension of input image is 25 x 25. 141 Figure 6.6: Simulation interface. 6.4.1 Trajectory and transferred knowledge Only two tasks are considered: T1 “go around” and T2 “go the elevator.” The contexts of these tasks are C1 and C2. The non-shared context is C’. The trajectories of two tasks are shown in Fig. 6.8. In the training stage, C1 (solid line) is trained and then C’ (dot line from point ‘A’ to point ‘B’) is trained. C’ includes context near the elevator. In the testing phase, the agent successfully finished these two navigation tasks. For task 2. its trajectory is different from that of task 1 around the “elevator" since this part is the non-shared. While another part of its trajectory is overlapped with that of task 1 because another part is not trained for task 2 but transferred from the knowledge of task 1. Fig. 6.9 shows the task label of each sample when testing on task 2. The task label in the testing phase is switching between 2 and 1. Tab. 6.1 shows that there 142 7V V‘V fifi,‘ ‘7 ‘7‘ Figure 6.7: A subset of input images. are totally 5380 samples. 2442 of them are retrieved with task label 2 while 2938 of them are retrieved with task label 1, which means that for this test about 54.61% knowledge is transferred from task 1 to task 2. Table 6.1: Trasfered knowledge when testing on task 2. Task 1 2 All Transferred knowledge No. of samples 2938 2442 5380 54.61% 6.4.2 Complexity reduced in terms of space and time The structures of IHDR trees for different tasks are shown in Fig. 6.10. The first two plots correspond to the IHDR tree architecture if taskl and task 2 are trained separately. The plots shows the number of states in different layers. The depth of both trees is 6. In the forth layer, the tree saves most of the states. The third plot 143 Testing Trajectories of Two Tasks 3000 - 2500 - 2000 _ 1500 - 1000 - 500 r \ Start 7 ) A o l A J; 0 500 1000 1500 2000 2500 3000 Figure 6.8: Tfajectory of two tasks. Task Information 3 . . [ — Retrieved Task Label I 2 i- -—1 -l 2 5 x to a I'- 1 0 1 J. 1 1 o 1000 2000 3000 4000 5000 Time Step Figure 6.9: Task label information. The plot shows the retrieved task label of each sample when testing on task 2. shows the IHDR tree structure if the agent learns C1 first and then learns C’. As you can see, the depth of tree is still 6, which means that the system learns two tasks but the space complexity does not increase too much comparing with learning only one task. Table. 6.2 shows how cross-task learning reduces the complexity in terms of both space and time complexities. If we train 2 tasks separately, the size of the IHDR trees are 22.6M and 27.5M, respectively. If we train task 1 (C 1), and then train the non-shared context (C’) of C1 and C2, the size of the tree is 36.6M. The percentage of saved space is W = 26.95%. The training time of task 1 and 2 is 147.423 and 143.758, respectively. If cross-task learning is conducted, the training time is 205.833. The percentage of saved time is about 29.31%. The experimental result shows that the model is effective in principle and cross-task learning really reduces a 145 Number of samples in diflerent layers 1500 . . . r , | - HDR Tree of Task 1 I 1000» « 1 2 3 4 5 6 I - HDR Tree of Task 2 I 1000- o 1 L 1 2 3 4 5 6 I | i HDR Tree of Two Tasks ] 1 2 3 4 5 6 Figure 6.10: HDR tree structure. The first two plots show the number of states in different layers for task 1 and task 2, respectively. The third plot shows the tree structure if cross-task learning is conducted. lot in terms of time and space complexities. Table 6.2: Complexity reduced in terms of space and time. Task 1 2 All Complexity reduced Tree size 22.6M 27.5M 36.6M 26.95% Training time 147.423 143.753 205.83 29.31% The training time of each step for task 1 is shown in Fig. 6.11. The average training time of each step is 0.0433 and the highest training time is 0.143. Obviously, the system can work well in real time, which is a necessary condition to conduct autonomous mental development for an artificial agent. 146 Training time for task 1 Training lime/step (s) 500 1000 1500 2000 2500 3000 3500 Time Step Figure 6.11: Tfaining time of each step for task 1. 6.5 Conclusions In this chapter, we apply the DOSASE MDP model to conducting cross—task learning for autonomous mental development. No prior knowledge is needed for each task. The model helps an agent to find the relatedness between different tasks by comparing the similarity of states generating for each task. The agent doesn’t need to know all the tasks in advance. The learned knowledge in shared context can be transferred to new task, which speeds up the training procedure. The model is tested in a simulation environment for vision-based navigation. Two tasks have been incrementally learned in real time. There are big gains in terms of time and space complexities. The experimental result shows the effectiveness of the AMD paradigm. For the future work, we will test the model on a robot in the real world. This is a challenging problem since in the real world the number of state could be huge, which makes learning diflicult. Also we would like to expand the model for learning using multi- 147 modal inputs to see how a robot develops its visual and auditory capabilities at the same time. 148 Chapter 7 Adaptive Multimodal Context-Aware User Presence Detection Context-aware system has drawn increasing attention recently. The goal is to ob- serve users’ status and provide proper assistance. In this chapter, we present a user presence detection system. Both visual and acoustic contexts are integrated to infer a user’s activities in an office. Since human behaviors are organized in a hierarchical manner, we implement Layered Hidden Markov Models (LHMM) to model human activities with different granularities. A more challenging issue is to make the system adapt to different users and different environments. Instead of building in all the world knowledge in advance (which is intractable), the system’s adaptive capability enables it to learn user-centered knowledge (which is tractable). IHDR (Incremental 149 Hierarchical Discriminant Regression) algorithm is used to automatically generate models for acoustic signals related to a user. IHDR can learn new signals online and its coarse-to—fine tree structure guarantees learning in real-time. A new type of feature is developed to further improve acoustic signal classification. The prototype system has been successfully tested in open domains. 7 .1 Introduction Context-aware system [65] has drawn increasing attention from researchers and engi- neers. Context is defined as “the physical and social situation in which computational devices are embedded.” Context-awareness is the key component of the next genera- tion human-computer interaction technique, which tends to measure the information about “where,” “what,” “when,” and “who.” A few prototype context-aware systems have been implemented in last decade. Shafer, Brumitt & Cadiz [84] create an “Ea- syLiving” home environment which is similar to the intelligent office environment [69]. In [71] [101], driver’s behaviors are modeled so that the intelligent assistance systems can improve the safety of driving. Wearable computing [88] is another hot area, where intelligent devices embedded in clothes, watches or glasses. The goal of context-aware computing is to provide users with services, which are appropriate to their particular situational information. The system detects whether users are in certain environments (room, office, car, etc) and what’s their status (in, out, online, offline, awake, sleeping, etc). However, current presence detection systems rely on keyboard and mouse activities, which do not reflect actual user activities. 150 For example, one current presence system may set the presence status to offline after 15 minutes of keyboard inactivity. If the user is left the office immediately after working on the computer, his presence status is still shown as “active.” Our work aims at computing the presence reliably in real-time. In order to build such a system, we integrate visual-acoustic contextual information and other aspects of a user’s information such as past states. A significant portion of previous work related to context-aware computing focuses on recognizing human activities based on a single modality input in a specific envi- ronment. Aggarwal and Cai [1] write a review of human motion analysis. There are two popular probabilistic approaches in visual activity recognition: Hidden Markov Model (HMM) and Bayesian Belief Network (BBN). One of the earlier attempts to apply HMMs to activity recognition is found in [123]. Since then, a lot of extensions of HMMs have been tried to model different human activities. Variable-length HMM [32] is applied to exercise behavior recognition. Brand & Oliver use Coupled-HMM [8] to detect interactions between multiple people. Entropic-HMM [7] is introduced to detect and recognize activities in video. Bobick and Ivanov [53] apply a stochastic context-free parsing to recognize activities sequence generated by low-level HMMs. Bayesian Belief Network is another popular approach. Buxton & Gong [16] adopt BBN for visual surveillance. Madbhushi & Aggarwal [61] use BBN to recognize ac- tions including sitting down, standing up, hugging, etc. One limitation of the above studies is that there have been few studies on human activity recognition using multimodal context information. In [20], audio and visual contexts are combined to classify simple activities such as crossing road and passing 151 through door. Oliver [70] proposes a Layered HMM architecture integrates informa- tion from multimodal inputs to recognize six activities in an office. In order to push the limit of context-aware computing, we need further study of multi-modal context awareness systems. A more challenging issue is the adaptation capability of context-aware systems. These systems should be able to work in different environments and adapt to different users. Most current user presence detection systems are limited to a constrained setting. A reliable system for a user in one environment is not useful for other users in different settings. For example, if we can detect the phone ring signal in an office, it would be helpful to infer whether a user is going to have a phone conversation. But each room has different telephones and each user has different cell phones. It is impossible to build a system to detect all kinds of phone ring using pre—trained classifier. A useful system must adapt to its user and learn user-centered signal on the fly in real—time. In this paper, we propose a real-time audio/ video user presence detection system to monitor the behaviors of a single human in an office. The system has the following features: 1) A multimodal context-aware system is built. We integrated both audio and visual information. A layered architecture is implemented to model human ac- tivities with different granularities. 2) IHDR (Incremental Hierarchical Discriminant Regression) is implemented to learn new acoustic signal online and adapt to new settings, which is the major difference between this work and [70]. This novel com- ponent is crucial for adaptation to unknown environments as a consumer product. Instead of building in world knowledge in advance (which is intractable), the system’s 152 adaptive capability enables it to learn user-centered knowledge (which is tractable). And IHDR is organized in tree structure. The coarse-to—fine structure guarantees learning in real-time. 3) A new type of feature for acoustic signal is developed to further improve the performance. In what follows, we first describe the system and challenges we have to face in Section 7.2. System architecture is discussed in Section 7.3. Then we present the experimental results and summarize our work. 7 .2 System Overview and Challenges A typical setting of a context-aware user presence detection system is shown in Fig. 7.1. A table and a chair are in an office. On the table, there are a telephone, a Door Camera I Table Microphone 1:20 [/ - :3 11 EE [:D Chair . — Telephone Figure 7.1: Typical setting of a user presence detection system. personal computer, a video camera and a microphone. The sensors we use to collect context information are: 1) USB camera: a Logitech QuickCam camera sampled at 15 fps. is used to detect human motions. The resolution is 640 x 480. 2) Microphone: 153 a built in microphone, associated with the camera. The goal of this system is to detect human activities in an office. The overview of the user presence detection system is shown in Fig. 7.2. Two kinds of sensory Auditory input . . Mel-FFT Audition 5 analysis —‘-> pattern classifier Behavior . classifier \Ilsual System Input Motion Egg?" ouput detector classi er Figure 7.2: Overview of the user presence detection system. inputs are used: auditory and visual. Audition pattern classifer discriminates four kinds of auditory patterns: “Phone ring,” “Conversation,” “Uncertain noise,” and “Silence.” A motion detector captures motion information by computing the difference between two consecutive images. A sequence of motion activities is classified into four motion patterns: “Rest,” “Moving near door,” “Moving in the office,” and “Out.” The outputs from the above two components are combined together to infer human activities in the office. There are totally four kinds of activities: “Conversation,” “Other activity,” “Rest,” and “Nobody around.” One technical difficulty is to fuse different context inputs. In this system, the sampling rate of acoustic signal is 50Hz while the sampling rate of visual signal is 15Hz. A “Moving in the office” pattern can be detected in 0.1 second while to detect “Nobody around” the system has to wait for 2 seconds to confirm the status. Psychological studies show that human behaviors are hierarchical [124]. In order to model human activities with different granularities, we implement Layered-Hidden 154 Markov Model (LHMM). Low-level HMMs are used for motion pattern classification while high-level HMMS are used for human activity classification. Adaptation to user and environment is a more challenging issue. As we mentioned in Section 1, if the system can detect the user’s sound or detect his phone ring signal (telephone in office or cell phone), it would be easy to infer whether he is in the office and what is his status (face-face conversation, phone conversation, silent, etc). Since each user has unique voice and there are thousands of different phone ring tones in this world, it is impossible to has a system to recognize all these acoustic signals. If the system could adapt to each user and the related environment, we have a tractable solution. We propose the IHDR (Incremental Hierarchical Discriminant Regression) algorithm for acoustic signal classification and learning. HMM has been used for speech recognition because it has dynamic modeling capability for sequential signals. However, one limitation of HMM is that it is only a computational model in the sense that HMM is not designed to be generated automatically from observations. Engineers have to manually design the system for given settings in advance, which affects its adaptive capability in unknown environments. For example, if a new type of sensory input is added to the system, the designer has to create new HMMs and specify parameters for each HMM, which is inconvenient especially for unprofessional users. In contrast, IHDR organizes the learned patterns in tree structure. New patterns are added as new leaf nodes. One tree is enough to model all acoustic signals related to a user. Its coarseto—fine tree structure enables efficient learning in real-time. We also develop Mel-FFT feature to further improve the performance of acoustic signal classification. Details of each component is discussed in next section. 155 7 .3 System Architecture The detailed architecture of the user presence detection system is shown in Fig. 7.3. Mel analysis [23] is applied to raw auditory signals to extract Mel-F FT features, which are fed into an IHDR tree to classify four kinds of auditory patterns. Sequences of mo- tion activities are classified by low-level HMMs into four motion patterns. High—level HMMs integrate outputs from the above two components to infer human activities. Mel-FFT I Prototype features Auditory pattern 6:53 (3 ”9m?" activities e———> eeme V ass-us sequence K1 0 o o o o o , Motion Hi h—ievelHMM ozone 9 Low-level HMM Figure 7.3: A detailed architecture of the user presence detection system. 7 .3.1 Acoustic signal classification using IHDR Feature Extraction With the audition component, we try to discriminate the following four signals: con- versation, phone ring, silence (background noise) and uncertain noise (radio, music, flapping, clapping, etc). In most speech processing systems, Mel—frequency Cepstral Coeflicients (MFCCs) [23] is the dominating feature because MFCC captures the 156 properties of human speech signals. The procedure to calculate the MF CC feature is shown in Fig. 7.4. Signal FFT .' > Log ——> Band filters v > IDFT -—> MFCC V V F FT feature Mel-FFT Figure 7.4: The procedure to extract F FT, Mel-FFT and MF CC features. First, we apply F F T (Fast Fourier Transform) to get features in the frequency domain. Usually, we calculate the logarithmic energy of F F T coefficients. Critical band filters are applied to derive the appropriate frequency components for computing the MFCC. This implementation is described in Eq. 7.1, Fie W) = Z logis (7.1) k=Fis where log|S(k)| is the kth log energy of the Fourier coefficient, N’ is the length of the ith triangular band. H,- is a triangular filter: 1 — |:1:|, a: g 1 112(33): . (7.2) 0, :1:>1 The variable :1: is defined in Eq. 7.3: where 17,-, and Re are the starting frequency and the ending frequency of the ith triangular band, respectively; 17,-, _<_ k S Fig. After we get a sequence of Y(z'), we can apply IDFT (Inverse Discrete Fourier Transform) to this sequence, and then we can get the MFCC feature using Eq. 7.4: N'-1 1 2 : ~ ' 1r 1 ’m k=0 where m stands for the index of the MF CC feature. In Eq. 7.1, Y(z') denotes the sum of the weighted log|S(k)| within the ith critical band filter. Y(i) is sometimes called the weighted log energy in the ith critical band. In this paper, we call Y(i),i = 1, 2, ..N’ the Mel-FFT feature. It is worth noting that the system doesn’t need to understand the meaning of conversational signals but to classify different sounds. In section 7.4, we compare F F T feature, Mel-FFT feature and MFCC feature. It turns out that Mel-FFT is the best choice for this application. Incremental Hierarchical Discriminant Regression After extracting the features of acoustic signals, how do we classify them? Classifi- cation and regression problems can be formulated as a complicated function which maps high-dimensional input and the current state to low-dimensional output signals. Decision tree [10] has been a popular method for function approximation. We use the IHDR for acoutic signal classification. Each leaf node generates quite a few primitive prototypes (block in Fig. 7.3), which represent different patterns (model generator). In testing phase, if a prototype 158 is reached (shadowed block), the label associated with it would be the output (com- putational model). A more detailed description of the algorithm can be found in [48]. The advantages of IHDR include: 1) IHDR maps the high dimension data into low dimensional subspace. LDA is conducted in the discriminating subspace, which saves a lot of computational power. 2) IHDR organizes the learning knowledge coarse to fine, which speeds up the retrieval procedure. The time complexity is N logN , where N is the number of prototypes in the tree. 3) IHDR intrinsically has incremental online learning capability to adapt to new signals since the sufficient statistics are updated incrementally. This is why we use IHDR instead of traditional HMM to classify auditory patterns. For practical applications, the system has to work in dif- ferent offices. Each office has different people and different telephones. In order to make HMMS adapt to new settings, a bank of new HMMs has to be created manually. In contrast, one IHDR is enough to handle all types of auditory signals related to a user. And also important is that the adaptation can be incrementally conducted in real-time. 7 .3.2 HMMs for motion pattern classification In order to recognize human motion activities, we have to find the moving body and its location. We implement a motion detector to find the bounding box of a moving body with two consecutive images. First, we calculate the difference between these two images. Then a low—pass filter is used to remove noise in the difference image. The bounding box of the moving body is determined by the histogram of the 159 image [76]. Now we know the size and the location of the bounding box, which is the observation of motion activities. After getting a sequential motion sub—patterns, we use discrete Hidden Markov Model (HMM) to classify human motion behaviors. An HMM is denoted by A = (A, B, 7r), where A is state transition probability matrix, B is the observation symbol probability matrix, 7r is the initial state distribution. Specification of an HMM involves the choice of the number of states N, the number of observations M. With training data, we can calculate A by using Baum-Welch algorithm [78]. Given a model A and a sequence of observation O={01, 02, ..., OT}, the likelihood of the sequence is P(OIA) = Ziar(i),1 S 2' S N, (75) where at(i) is defined as at“) = [Z at_1(i)aij]bj(0t), (7-6) where (1,5 is the element of A and 1),-(0;) is the probability for state j in the model A of observing 0;. The meaning of at(i) is the probability of the partial observation sequence, 01, 02...o¢, and state i at time t, given the model A. Let K be the number of motion/ activity patterns, we usually need to generate a bank of K HMMs (A,c (1 S k _<_ K )) The likelihood of the observation sequence in each model is L), = P(OIAk). Suppose the maximal likelihood is Lmax and the 160 minimal likelihood is Lmini the normalized likelihood is I L — L ' L, = ’° mm . (7.7) Lmax " Lmin HMMs choose the pattern It: with the largest normalized likelihood as output. I} = arg max{L)c} (7.8) k 7.3.3 Integration component The system is able to recognize four types human activities: “Conversation,” “Other activity,” “Rest,” and “Nobody around.” Integration of the above two low-level com- ponents to infer human activities is difficult since different modalities have different updating frequency and they can be either related or unrelated. High-level HMMs are necessary for reasoning human activities based on information of low-level compo- nents. Usually the reasoning is conducted with a larger time granularity (for example, 2 seconds), while in motion pattern classification and auditory pattern classification, the granularity is less than 1 second. The outputs from low-level component are symbols, which can be fed into the integration HMMs. Since the vision component outputs 4 types of different patterns (so does the audition component), the combi- nation of these two components gives 16 types of observations. For example, if the motion observation Oman-0,, is “Moving in office” and the acoustic observation Gamma-C is “Conversation,” let’s define Omotion = 2 and Gamma = 2. The pattern of integra- 161 tion observation is 0mgegraam = Omaha" x 2 + Oacousuc = 2 x 4 + 2 = 10. Baum-Welch algorithm [78] is used to train sequences of combination observations. 7 .4 Experimental results We conducted experiments for each of these three components. 7.4.1 Experimental results of the auditory component We trained the system with 9 kinds of male conversation signals, 3 kinds of female conversation signals, 16 kinds of phone ring signals, background noise and uncer— tain noise (radio, music, flapping, clapping) using IHDR. Some signals are shown in Fig. 7.5. Comparison of different feature vectors First, we compared three kinds of features: F F T, Mel-F F T, and MFCC. The auditory data are digitized at 11025Hz by a normal sound blaster card. The number of Mel- FFT feature vectors is about 12000. Half is used for training, another half is used for testing. The number of training samples is about 6000 for each feature, the number of testing samples is 5174. We can find out in Tab. 7.1 that Mel-F F T is better than FFT and MF CC. The recognition rate is about 90%. The dimensions of FF T, Mel-F F T, and MFCC feature vector are 512, 21, 13, respectively. The corresponding time to train each feature vector using IHDR, is 115ms, 8.6ms and 7.3ms, respectively. In order to build a real-time online learning system, FF T is not acceptable since the 162 noise Background Phone ring Conversation ’5 EE 00. 5% OT.) v Figure 7.5: Different acoutsic signals: from top to down are background noise, conversation, phone ring and other noise (clapping). x-axis denotes the number of raw signal points, y-axis denotes the normalized energy. 163 system has to incrementally learn about 50 frames per second. Considering training time and recognition rate, we choose Mel-FFT as the feature of acoustic signals. Table 7.1: Comparision of different features for acoustic signals. Features Dim Training Correct Total Rate Time FFT 512 115ms 4314 5174 83.38% MelFFT 21 8.6ms 4641 5174 89.70% MFCC 13 7.3ms 4448 5174 85.92% Online learning As we mentioned earlier. we use IHDR rather than HMM for acoustic signals classi- fication. The reason is that one IHDR tree can adapt to new signals, new users and new environments. In contrast, using HMM the designer has to create new models for new signals. In order to verify the adaptive capability of IHDR, we tested the online learning component. We divided the dataset into 3 sets: Set A consists of conversation signal and 8 types of phone ring signals; Set B consists of conversation signal and 7 types of phone rings signals; Set C consists of one type of phone ring signal, which is in A but not in B. We first trained set B offline then tested on set A. The confusion matrix is shown in Tab. 7.2, the recognition rate of phone ring signal is only 81.5%. Table 7.2: Confusion matrix of offline training. (C=Conversation, P=Phone) Data C P Total Rate C 4064 254 4318 94.12% P 275 1212 1487 81.50% 164 In the next experiment, using the offline generated IHDR tree, we trained the system with set C and then tested on set A again. The result of online learning is shown in Tab. 7.3. As we can see, the recognition rate of phone ring signals is improved from 81.50% to 89.64%, which proves the adaptive capability of IHDR. Table 7.3: Confusion matrix after online training. (C=Conversation, P=Phone) Data C P Total Rate C 4071 247 4318 94.28% P 154 1333 1487 89.64% Then we tested on all four kinds of acoustic signals. The confusion matrix is shown in Tab. 7.4. Conversation and uncertain noise signals are confusing sometimes Table 7.4: Recognition rate of each auditory set. (C=Conversation; UN=Uncertain Noise; P=Phone; S=Silence) Data C UN P S Total Rate C 2275 188 6 4 2470 92.11% UN 216 2104 0 0 2320 90.69% P 8 14 1768 5 1795 98.50% S 0 0 0 1141 1141 100% (188 conversation vectors are recognized as uncertain noise) because the uncertain noise covers a very large feature space. The overall recognition rate is 94.33%. Com- paring to [44], the performance gains about 0.5%. We need to notice that the test is source-dependent. In other words, the testing set and the training set come from the same source. If we test the system with signals of a telephone we never trained, the performance would drop. That’s why IHDR is important for this application since it can incrementally learn new auditory patterns. 165 Figure 7.6: The first row shows the video sequence, which is classified as “Moving near the door”; the second row shows the bounding box of the moving object. 7 .4.2 Experimental results of the motion component Fig. 7.6 shows a motion sequence, which is classified as “Moving near the door”. The first row is the video sequence while the second row shows the bounding box of the moving object. With the size and the position of the moving body, we can feed the motion sub- pattern sequence (observations) into HMMs. In this experiment, the parameters of the low-level HMMs are: N, = 6, M24 and K=4. Different observations are: “Motion in door area,” “Motion in room but not door,” “Static & last motion in door area,” and “Static & last motion in room but not door.” The normalized likelihood of each motion pattern is shown in the first four plots of Fig. 7 .7. The x-axis is time line about 400 seconds. The granularity of HMMs is 1 second. If L], = 1, then pattern It is reported. The ground truth of activity sequences is shown in the fifth plot, which goes as follows: the user firstly moved in the room (pattern 1), rested for a while (2), moved around the door (3) and then went out (4). The motion behaviors are clearly recognized. A mistake occurs around step 320, the system classified pattern (3) as 166 pattern (1) when the user moved near the boundary of the door and the remain parts of the room. o ; ‘ I I _-_ < o b l L J_ L I l ‘l ,9 10° 290 3.90 49° 590 £00 700 0, l .c l l r— 1 00 200 300 400 500 600 700 p I I I I M —°°°' L .i J l l l l l 1 Normalized likelihood o .0 oat- 1Q 190 290 390 490 590 600 700 l —— Out 0.5 r 4 o 190 290 390 490 590 690 700 g g 4 _ — [ Ground truth I 8 1: 2 .. ._ . z 8 . . . 1 . ‘ 0 100 200 300 400 500 600 700 Time Figure 7.7: Likelihood of motion patterns over time. 7 .4.3 Experimental results of the high-level reasoning The specifications of the high level HMMs are: N, = 4, M=16 and K=4. Normalized likelihood of each human activity is shown in the first four plots of Fig. 7.8. The ground truth of activity sequences is in the fifth plot, which goes like this: the user moved around in the rest (activity 1), rested (2), moved around again (1), talked for a while (3), moved to the door (1) and went out (4). The x-axis is time line about 800 time frames. “Nobody” and “Rest” are perfectly classified, while “Conversation” and “Other activity” are messed up a little because sometimes you can move and talk at the same time. 40 minutes of office activity are recorded (about 10 minutes for each activity). A half of the data is used for training; another half is used for 167 testing. The recognition rate of human activities is shown in Tab. 7.5. About 4% of “Conversation” is incorrectly recognized as “Other activity”, which is consistent with the results in Fig. 7.8. 1 I x — Other Activi 0.5. ]] I [I]. i m. 0" l l 1 1 l 10 290 490 690 §QO 1000 1200 2 (,5 _ —EM 5.3 o . L-——. 4. ll] . . . ‘3 10 200 400 600 800 1000 1200 % —— Nobod E 0.5 *- I 2 G A 1 i i l d 10 290 490 690 890 1300 1200 — Conversation 0.5 J] I] [H] i J. c l 1 l l l ‘4 o 290 490 690 890 1qu 1200 5 .é‘ 4 b _ [ - Ground truth I E -.?= 2, _ , 3 O I a 4L — l l 1 0 200 400 600 800 1000 1200 Time Figure 7.8: Likelihood of human activities over time. Table 7.5: Recognition rate of human activities (N=Nobody; OA=Other activity; R=Rest; C=Conversation). Data N O R C 1 0 0 0 0 0 5. R 0 0 1 0 0 4.5 0 95. 7 .5 Summary and Discussion In this chapter, we proposed a multimodal context-aware system to detect user pres- ence in an office. Two levels of HMMs with different granularities handle motion 168 pattern classification and integration of low-level outputs, respectively. A more ex- citing feature of the system is its adaptive capability. IHDR, as a model generator, generates representations for different auditory patterns and can easily adapt to user- centered signals. For example, IHDR can learn its owner’s sound and the phone ring signal in his/her office in real-time. The initial results of the prototype system are promising. However, to build a highly adaptive system for consumers needs further study. Here are some future works: 1) Detect motion activities of multiple persons. Right now, only one user can be detected. With multi-person detection, we can find out the interactions between different people. 2) Dynamic CPU usage adaptation. In the current system, the audio and visual components use up most of the computa— tional power. We are going to implement an attention mechanism, which can turn off a component when it is not useful. Thus, the user can run the system in the background while allocating computational power to other applications. 169 Chapter 8 Conclusions and Future Work 8.1 Contributions Studies in human cognitive development have shown that interactions between a higher animal and its environment is essential for perception development and knowl- edge acquisition. This thesis reports some recent research on developmental robots, the robots that learn autonomously through real-time interactions with the environ- ment. The feasibility of the developmental robots is demonstrated under eight chal- lenging requirements for autonomous mental development (AMD) as we discussed in Section 1.1.3. At this early stage of research in developmental robots, we are facing an array of challenging technical issues. We have developed two major techniques based on a developmental architecture, which was implemented on a real robot, SAIL, an early developmental robot prototype. A summary of the contributions is given below. 170 . Propose the Developmental, Observation driven, Self-Aware, Self-Effecting, Markov Decision Process (DOSASE MDP) model, which integrates multimodal sensing, action-imposed learning, reinforcement learning, and communicative learning. . Build the cognitive mapping engine using the Locally Balanced Incremental Hi- erarchical Discriminant Regression (LBIHDR) technique, which has better gen- eralization capability under non-stationary environments, especially for vision- based navigation. . Novelty and reinforcement learning are integrated into a robotic value system for the first time. As an important part of AMD, a value system signals the oc- currence of salient sensory inputs, modulates the mapping from sensory inputs to action outputs, and evaluates candidate actions. The value system is applied to guide a robot’s visual attention behavior. With this value system, develop- mental robots has better autonomy with novelty (dense input) and reinforcer (sparse input). . Present a robotic system that develops covert perceptual capability for vision- based autonomous navigation via reinforcement learning. The agent learns different road boundaries through online interactions with human trainers. All these cannot be achieved by traditional supervised learning techniques since these kind of behavior cannot be imposed. . Provide the developmental agent with cross-task learning capability. That is, an 171 agent can learn multiple tasks and use acquired knowledge to speed up learning new tasks. With capability, a developmental robot can save time and save space for multiple tasks. This is very important for developmental learning at current stage since resource is limited. 6. Applies the above techniques to a practical application funded by industry. Context information from multi-sensory inputs is integrated to infer a user’s activities in an office. This is the first open-domain audio/ visual learning system for user presence detection. The prototype system verifies the effectiveness of the developmental learning paradigm. Currently, we are seeking the possibility for turning the system into products. 8.2 Future directions Autonomous mental development by a robot is an interdisciplinary research area. While more and more researchers are beginning to realize the importance and po— tential power, it is still at its early stage. This thesis represents initial steps in this direction. In addition to the above contributions, this work has raised many new questions and left out many interesting but unresolved problems as well. Attention-based navigation. Vision-based navigation is a hard problem. How to use attention mechanisms to generalize learned knowledge could be a solution. Further studies are needed to fully explore the capability of DOSASE MDP model for navigation task learning. For instance, communicative learning could be a good choice to guide a robot’s attention behavior. 172 A sophisticated value system. In this thesis, we propose a value system integrating novelty and reinforcement learning for the first time. We only test the value system for the visual attention behavior. The problem is how to scale up the value system for complex behaviors? While it is clear that the behavior of the value system should be developed through experiences, it is not clear how such a system can learn from an unstructured and highly inconsistent environment. Should this value system be a centralized universal control coordinator or a distributed system? How does a developmental robot conduct thinking under the guidance of its value system? Applications in human machine interactions. While enjoying the improved quality of life brought by the more and more sophisticated machines, we are facing many new problems, such as the machines’ flexibility, ease of usage, and efficient cooperation with human users. The related broad research areas can be grouped under human machine interactions (HMI). Machines conducting perceptual learning and behavior learning autonomously in real-time would resolve some of the above problems of HMI. It is about time to identify some HMI related applications, such as intelligent rooms and intelligent vehicles, and apply the principles and techniques of autonomous mental development. The prototype user presence detection system provides a good test-bed. 173 Bibliography [1] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer Vision and Image Understanding, 73(3):428—440, 1999. [2] J. S. Albus. Outline for a theory of intelligence. IEEE Trans. Systems, Man and Cybernetics, 21(3):473—509, May/ June 1991. [3] N. Almassy, G. M. Edelman, and O. Sporns. Behavioral constraints in the development of neural properties: A cortical model embedded in a real-world device. Cerebral Cortex, 8(4):346—361, 1998. [4] J. R. Anderson. Rules of the Mind. Lawrence Erlbaum, Mahwah, New Jersey, 1993. [5] C. Balkenius and N. Hulth. Attention as selection-for-action: a scheme for active perception. In Proceedings of EUROBOT ’99, pages 113—119. 2001. [6] C. Bander, F. Vico, J .Bravo, M. Harmon, and L. Baird. Residual q-learning applied to visual attention. In Proc. of the 13th International Conference on Machine Learning, pages 20—27. Bari, Italy, July 3 - 6 1996. [7] M. Brand and V. Kettnaker. Discovery and segmentaion of activities in video. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8), 2000. [8] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for modeling interacting processes. In Proc. of Int. Conf. on Computer Vision and Pattern Recognition, pages 994 — 999, 1996. [9] C. Breazeal. Social constraints on animate vision. Cambridge, MA, June 20—25, 2000. [10] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Triees. Chapman & Hall, New York, 1993. [11] R. Brooks. Intelligence without reason. In Proc. Int ’1 Joint Conf. on Artificial Intelligence, pages 569—595, Sydney, Australia, August 1991. [12] R. Brooks. Cog, a humanoid robot. Technical report, MIT Artificial Intelligence Laboratory, 1997. private communication. 174 [13] R. A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2(1):14—23, March 1986. [14] R. A. Brooks, C. Breazeal, M. Marjanovic, B. Seassellati, and M. M. William. The cog project: Building a humanoid robot. In C. L. Nehaniv, editor, Compu- tation for Metaphors, Analogy and Agents, vol. 1562 of Springer Lecture Notes in Artificial Intelligence. Springer-Verlag, New York, NY, 1999. [15] BC. Buchanan and EH. Shortliffe, editors. Rule-Based Expert Systems: The M YCIN Experiments of the Stanford Heuristic Programming Project. Addison- Wesley, Reading, MA, 1984. [16] H. Buxton and S. Gong. Advanced Visual Surveillance using Bayesian Net- works. In Proc. of Int. Conf. on Computer Vision, pages 111—123, Cambridge, Massachusetts, June 1995. [17] R. Caruana. Multitask learning. Machine Learning, 28:41—75, 1997. [18] N. Chomsky. Language and Mind. Harcourt Brace Jovanovich, New York, 1972. [19] R. M. Church, D. J. Getty, and N. D. Lerner. Duration discrimination by rats. Journal of Experimental Psychology: Animal Behavior Processes, 2:303—312, 1976. [20] B. Clarkson and A. Pentland. Unsupervised clustering of ambulatory audio and video. In Int. Joint Conf. on Acoustic, Speech and Signal Processing, I CASSP ’99, pages 3037 — 3040, 1999. [21] D. R. Cox. Statistical analysis of time series: Some recent developments. Scand. J. Statist, 8(2):93—115, 1981. [22] M. R. D’Amato and M. Colombo. Representation of serial order in monkeys (ce- bus apella). Journal of Experimental Psychology: Animal Behavior Processes, 14:131—139, 1988. ' [23] J. Deller, J. Proakis, and J. Hansen. Discrete-time processing of speech signals. Institute of Electrical and Electronics Engineers Press, New York, 2000. [24] M. Domjan. The Principles of Learning and Behavior. Brooks/ Cole, Belmont, CA, fourth edition, 1998. [25] M. Domjan. The Principles of learning and behavior. Brooks/ Cole Publishing Company, Belmont, CA, 1998. [26] M. Dorigo and M. Colombetti. Robot shaping: Developing autonomous agents through learning. ‘ Artificial Intelligence, 71(2):321-370, 1994. [27] G. L. Drescher. Made- Up Minds. MIT Press, Cambridge MA, 1991. 175 [28] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley, New York, NY, 2001. [29] J. Weng et al. Autonomous mental development by robots and animals. Science, 291:599—600, January 26, 2001. [30] J .H. Flavell, P.H. Miller, and SA. Miller. Cognitive Development. Prentice—Hall, Englewood Cliffs, NJ, 1993. [31] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, NY, second edition, 1990. [32] A. Galata, N. Johnson, and D. Hogg. Learning variable length markov models of behaviour. Computer Vision and Image Understanding, 81(3):398—413, 2001. [33] R. A. Gardner and B. T. Gardner. Teaching sign language to a chimpanzee. Science, 165:664—672, 1969. [34] R. Gisiner and R. J. Schusterman. Sequence, syntax, and semantics: Responses of a languagetrained sea lion (zalophus californianus) to novel sign combina- tions. Psychology, 106:78-91, 1992. [35] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, 1989. [36] R. M. Groves and R. F. Thompson. Habituation: A dual process theory. Psy- chological Review, 77:419—450, 1970. [37] R. Held and A. Hein. Movement—produced stimulation and the development of visually guided behaviors. Journal of Comparative and Physiological Psychology, 56:872-876, 1963. [38] L. M. Herman. Receptive competencies of language-trained animals. In J. S. Rosenblatt, C. Beer, M. C. Busnel, and P. J. B. Slater, editors, Advances in the study of behavior (Vol. 17), pages 1—60. Academic Press, New York, 1987. [39] R. J. Herrnstein and P. A. deVilliers. Fish as a natural category for people and pigeons. In G. H. Bower, editor, The psychology of learning and motivation (Vol. 14), pages 60—97. Academic Press, New York, 1980. [40] J. H. Holland. Adaptation in Naural and Artificial Systems. University of Michigan Press, Ann Arbor, MI, 1975. [41] X. Huang and J. Weng. Novelty and reinforcement learning in the value system of developmental robots. In Proc. Second International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems (EPIROB’UQ), pages 47—55, Edinburgh, Scotland, August 10 - 11, 2002. 176 [42] X. Huang and J. Weng. Novelty and reinforcement learning in the value sys- tem of developmental robots. In Second International Workshop on Epigenetic Robotics, Edinburgh, Scottland, August 10—11 2002. [43] X. Huang and J. Weng. Locally balanced incremental hierarchical discriminant regression. In Fourth International Conference on Intelligent Data Engineering and Automated Learning. 2003. [44] X. Huang, J. Weng, and Z. Zhang. Office presence detection using multimodal context information. In Proc. of the International ConferenceAcoustics, Speech, and Signal Processing (ICASSP 2004), Montreal, Quebec, Canada, USA, 2004. [45] W. Hwang and J. Weng. Vision-guilded robot manipulator control as learn— ing and recall using SHOSLIF. In Proc. IEEE Int ’1 Conf. on Robotics and Automation, Albuquerque, NM, April 20—25 1997. [46] W. Hwang and J. Weng. Hierachical discriminant regression. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(11):127 7 -1293, 2000. [47] W. Hwang and J. Weng. Hierarchical discriminant regression. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(11):1277—1293, 2000. [48] W. Hwang and J. Weng. An online training and online testing algorithm for OCR and image orientation classification using hierarchical discriminant regres- sion,. In Proc. Fourth IAPR International Workshop on Document Analysis Systems, Rio De Janeiro, Brazil, December 10—13, 2000. [49] W. Hwang and J. Weng. Incremental hierarchical discriminant regression for indoor visual navigation. In Int ’l Conf. on Image Processing, Thessaloniki, Greece, October 2001. [50] W.S. Hwang. Visual learning and its application to sensorimotor actions. PhD thesis, Department of Computer Science and Engineering, Michigan State Uni- versity, 2000. [51] W.S. Hwang and J .J . Weng. Hierarchical discriminat regression. IEEE Trans. on Patten Analysis and Machine Intelligence, 22(11):1277—1293, 1999. [52] L. Baird III and A. Klopf. Reinforcement learning with high-dimensional, con- tinuous actions. Technical Report WL-TR-93-1147 Wright Laboratory, 1993. [53] Y. Ivanov and A. Bobick. Recognition of visual activities and interactions by stochstic parsing. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):852—872, 2000. [54] LP. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237—285, 1996. 177 [55] S. Kakade and P. Dayan. Dopamine bonuses. In Advances in Neural Information Processing Systems, volume 13, 2000. [56] L. J. Kamin. “attention-like” processes in classical conditioning. In M. R. Jones, editor, Miami symposium on the prediction of behavior: Aversive stimulation, pages 9—31. University of Miami Press, Miami, FL, 1968. [57] M. Kirby and L. Sirovich. Application of the karhunen-loéve procedure for the characterization of human faces. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(1):103—108, Jan. 1990. [58] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, second edition, 1997. [59] J. E. Laird, A. Newell, and P. S. Rosenbloom. Soar: An architecture for general intelligence. Artificial Intelligence, 33:1—64, 1987. [60] MC. Lovett, L.Z. Daily, and L.M. Reder. A source activation theory of working memory: Cross-task prediction of performance in act-r. Cognitive Systems Research, 1:99—118, 2000. [61] A. Madabhushi and J. Aggarwal. A bayesian approach to human activity recog- nition. In Proc. of The second Int. workshop on Visual surveillance, pages 25—30, 1999. [62] M. Mataric. Designing and understanding adaptive group behavior. Adaptive Behavior, 4:1:51—80, 1995. [63] S. Minut and S. Mahadevan. A reinforcement learning model of selective visual attention. In Proceedings of the fifth international conference on Autonomous agents, pages 457 — 464. Montreal, Quebec, Canada, 2001. [64] P. Montague, P. Dayan, and T. Sejnowski. A framework for mesencephalic dopamine systems based on predictive hebbian learning. The journal of Neu- roscience, 16(5):1936—1947, 1996. [65] T. Moran and P. Dourish. Introduction to this special issue on context-aware computing. Human Computer Interaction, 16:87—95, 2001. [66] M. J. Morgan, M. D. Fitch, J. G. Holman, and S. E. Lea. Pigeons learn the concept of an “A”. Perception, 5:57—66, 1976. [67] S. K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence, 2, August 1994. [68] H. Ogmen. A developmental perspective to neural models of intelligence and learning. In D. Levine and R. Elsberry, editors, Optimality in Biological and Artificial Networks .9, pages 363—395. Lawrence Erlbaum Associates, Publishers, Hillsdale, NJ, 1997. 178 [69] N. Oliver and E. Horvitz. Selective perception policies for guiding sensing and computation in multimodal systems: A comparative analysis. In Proc. of Int. Conf. on Multimodal Interfaces, pages 3—8, 2003. [70] N. Oliver, E. Horvitz, and A. Garg. Layered representation for human activity recognition. In Proc. of Int. Conf. on Multimodal Interfaces, pages 3—8, 2002. [71] N. Oliver and A. Pentland. Driver behaVior recognition and prediction in a smartcar. In Proc. of SPIE Aemsense2000 ’Enhanced and Synthetic Vision’, Orlando, Florida, 2000. [72] F. G. Patternson. The gestures of a gorilla: Language acquisition in another pongid. Brain and Language, 5:56—71, 1978. [73] I. M. Pepperberg. Some cognitive capabilities of an African Grey parrot (psit- tacus erithacus). In P. J. B. Slater, J. S. Rosenblatt, and C. Beer, editors, Advances in the study of behavior (Vol. 19), pages 357—409. Academic Press, New York, 1990. [74] J. Piaget. The Origins of Intelligence in Children. International Universities Press, INC, New York, 1952. [75] Lorien Y. Pratt, Jack Mostow, and Candace A. Kamm. Direct Transfer of Learned Information among Neural Networks. In Proceedings of the Ninth Na- tional Conference on Artificial Intelligence (AAAI-QI), pages 584-589. AAAI, July 1991. ‘ [76] W. Pratt. Digital Image Processing. John Wiley&Sons, Inc, New York, NY, 1991. [77] J. Quinlan. 04.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [78] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings IEEE, 77(2):257—286, 1989. [79] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of IEEE, 77(2):257—286, 1989. [80] R. A. Rescorla and A. R. Wagner. A theory of pavlovian conditioning: Vari- ations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black and W. F. Prokasy, editors, Classical conditioning II: Current research and theory, pages 64—99. Appleton-Century-Crofts, New York, 1972. [81] AL. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210—229, 1959. [82] W. Schultz. Multiple reward signals in the brain. Nature Reviews: Neuroscience, 1:199—207, 2000. 179 [83] W. Schultz, P. Dayan, and R. Montague. A neural substrate of prediction and reward. Science, 275(14):1593-1599, 1997. [84] S. Shafer, B. Brumitt, and J. Cadiz. Interaction issues in context-aware inter- active environments. Human Computer Interaction, 16, 2001. [85] Wei-Min Shen. Autonomous Learning from the Environment. Computer Science Press, New York, 1994. [86] R. L. Solomon and J. D. Corbit. An opponent-process theory of motivation: Ii. cigarette addition. Journal of Abnormal Psychology, 81:158-171, 1973. [87] R. L. Solomon and J. D. Corbit. An opponent-process theory of motivation: I. the temporal dynamics of affect. Psychological Review, 81:119—145, 1974. [88] F. Sparacino, A. Pentland, and G. Davenport. Wearable performance. In The First International Symposium on Wearable Computers, Cambridge, MA, 1997. [89] O. Sporns, N. Almassy, and G. Edelman. Plasticity in value system and its role in adaptive behavior. Adaptive Behavior, 2000. [90] L. R. Squire and E. R. Kandel. Memory: From Mind to Molecules. Scientific American Library, New York, 1999. [91] Luc Steels. Emergent functionality in robotic agents through on-line evolution. In R. A. Brooks and P. Maes, editors, Articificial Life IV, pages 8—14. MIT Press, Cambridge, Massochusetts, 1994. [92] M. Sur, A. Angelucci, and J. Sharm. Rewiring cortex: the role of patterned activity in development and plasticity of neocortical circuits. Journal of Neu- robiology, 41:33—43, 1999. [93] R. S. Sutton and A. Barto. Reinforcement Learning. The MIT Press, Cam- bridge, Massachusetts, 1998. [94] RS. Sutton and AG. Barto. Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, pages 135—170, 1981. [95] RS. Sutton and AG. Barto. Reinforcement Learning — An Introduction. The MIT Press, Cambridge, MA, 1998. [96] D. L. Swets and J. Weng. Hierarchical discriminant analysis for image retreival. IEEE Trans. Pattern Analysis and Machine Intelligence, 21(5):386-401, 1999. [97] Y. Takahashi and M. Asada. Vision-guided behavior acquisition of a mobile robot by multi—layered reinforcement learning. In IEEE/RSJ International Con- ference on Intelligent Robots and Systems, pages 395—402. 2000. [98] E. Thelen and L. Smith. A dynamics systems approach to the development of cognition and action. MIT Press, Cambridge, MA, 1994. 180 [99] S. Thrun. Explanation-Based Neural Network Learning: A Lifelong Learning Approach. Kluwer Academic, Boston, Massachusetts, 1996. [100] S. Thrun and J. O’Sullivan. Clustering learning tasks and the selective cross- task transfer of knowledge. In S. Thrun and L.Y. Pratt, editors, Learning To Learn. Kluwer Academic Publishers, 1998. [101] K. Torkkola, N. Massey, Bob. Leivian, C. Wood, J. Summers, and S. Kundalkar. Classification of critical driving events. In Proc. of the International Conference on Machine Learning and Applications (ICMLA), pages 81—85, Los Angeles, CA, USA, 2003. [102] D. Touretzky and L. Saksida. Operant conditioning in skinnerbots. Adaptive Behavior, 5(3/4):219—247, 1997. [103] A. M. Turing. Computing machinery and intelligence. Mind, 59: 433— 460, Oe- tober 1950. [104] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71—86, 1991. [105] S. Vijayahumar and S. Schaal. Real time learning in humanoids: A challenge for scalability of online algorithms. Cambridge, Massachusetts, Sept. 7-8 2000. [106] I. Wallance, D. Klahr, and K. Bluff. A self-modifying production system of cog— nitive development. In D. Klahr, P. Langley, and R. Neches, editors, Production System Models of Learning and Development, pages 359—435. MIT Press, Cam- bridge, Massachusetts, 1987. [107] E. A. Wasserman, R. E. DeLong, and M. B. Larew. Temporal order and dura- tion: Their discrimination and retention by pigeons. Annals of the New York Academy of Sciences, 423:103-115, 1984. [108] S. Watanabe, J. Sakamoto, and M. Wakita. Pigeons’ discrimination of paintings by Monet and Picasso. Journal of the Experimental Analysis of Behaviors, 63:165—174, 1995. [109] C. Watkins. Learning from delayed rewards. Technical report, PhD thesis, King’s College, Cambridge, England, 1989. [110] C. J. Watkins. Q-learning. Machine Learning, 8:279—292, 1992. [111] J. Weng. Learning in image analysis and beyond: Towards automation of learning. In C. W. Chen and Y. Q. Zhang, editors, Visual Communication and Image Processing. Marcel Dekker, New York, NY, 1998. [112] J. Weng. A theory for mentally developing robots. In Proc. IEEE 2nd Interna- tional Conference on Development and Learning (ICDL 2002), pages 131—140, MIT, Cambridge, Massachusetts, June 12-15 2002. 181 [113] J. Weng, N. Ahuja, and T. S. Huang. Learning recognition using the Crescep- tron. Int ’l Journal of Computer Vision, 25(2):109—143, Nov. 1997. [114] J. Weng and W. Hwang. An incremental learning algorithm with automati- cally derived discriminating features. In Proc. of Fourth Asian Conference on Computer Vision, pages 426—431, Taipei, Taiwan, January 8-9 2000. [115] J. Weng and W. Hwang. Online image classification using IHDR. International Journal on Document Analysis and Recognition, 5(2—3):118—125, 2002. [116] J. Weng and W. S. Hwang. An incremental learning algorithm with automati- cally derived discriminating features. In Proc. Asian Conference on Computer Vision, pages 426 - 431, Taipei, Taiwan, Jan. 8-9 2000. [117] J. Weng, W. S. Hwang, Y. Zhang, and C. Evans. Developmental robots: Theory, method and experimental results. In Proc. 2nd International Conference on Humanoid Robots, pages 57—64, Tokyo, Japan, Oct. 8—9 1999. IEEE Press. [118] J. Weng, W. S. Hwang, Y. Zhang, C. Yang, and R. Smith. Developmental humanoids: Humanoids that develop skills automatically. In Proc. First IEEE Conf. on Humanoid Robots, MIT, Cambridge, Massachusetts, Sept. 7— 8 2000. IEEE Press. [119] J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, and E. Thelen. Autonomous mental development by robots and animals. Science, 291(5504):599—600, 2001. [120] J. Weng, Y. Zhang, and W. Hwang. Candid covariance-free incremental prin- cipal component analysis. IEEE Trans. Pattern Analysis and Machine Intelli- gence, 25(8):1034—1040, 2003. [121] S. Whitehead and D. Ballard. Active perception and reinforcement learning. Neural Computation, 2:409—419, 1990. [122] S. Wilson. Classifier systems and the Animat problem. Machine Learning, 2(3):199—228, 1987. [123] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-seqential images using hidden markov model. In Proc. of Int. Conf. on Computer Vision and Pattern Recognition, pages 379—385, 1992. [124] J. Zacks and B. Tersky. Event structure in perception and cognition. Psycho- logical Bulletin, 127(1):3—21, 2001. [125] N. Zhang and J. Weng. A developing sensory mapping for robots. In Proc. IEEE 2nd International Conference on Development and Learning (I CDL 2002), MIT, Cambridge, MA, June 12-15 2002. 182 [126] Y. Zhang. Online Development of Cognitive Behaviors by a Robot - A Case Study Using Auditory and Visual Sensing. PhD thesis, Department of Computer Science and Engineering, Michigan State University, 2002. [127] Y. Zhang and J. Weng. Complementary candid incremental principal compo— nent analysis. Technical Report MSU—CSE—01-24, Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, August 2001. [128] Y. Zhang and J. Weng. Action chaining by a developmental robot with a value system. In Proc. IEEE 2nd International Conference on Development and Learning (ICDL 2002), MIT, Cambridge, MA, June 12—15 2002. 183 i[[1]][iiyjiii[iii