an 3:4 3“ ”tuft”. 5.. 5w. ‘ 4“. ‘. . .y .. A: . J: immunuxhi . ‘ ‘ . V ‘ . In , . , . . .1 1 .333 .. . . .365 1 , . .lefitfifiw‘? 1.. . L. t.» ‘1 “B It (A! WW3.” .,_..“...uw._.,.. Etta-.. LIBRARY Michigan State University This is to certify that the thesis entitled Autonomous Mental Development in High Dimensional and Continuous State and Action Spaces and its Application in Autonomous Learning of Speech presented by Ameet V. Joshi has been accepted towards fulfillment of the requirements for the MS. degree in Electrical and Computer Engineering 6361@L73 lg/ ‘ Major Professor’s Signature Wednesday, May 07, 2003 Date . —-—--.-o-—-_-—.—-—--—-._.__..._ MSU is an Affirmative Action/Equal Opportunity institution PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 cJCIFlC/Dateoue.965«p.15 AUTONOMOUS MENTAL DEVELOPMENT IN HIGH DIMENSIONAL AND CONTINUOUS STATE AND ACTION SPACES AND ITS APPLICATION IN AUTONOMOUS LEARNING OF SPEECH. By Ameet J oshi A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical and Computer Engineering 2003 ABSTRACT AUTONOMOUS MENTAL DEVELOPMENT IN HIGH DIMENSIONAL AND CONTINUOUS STATE AND ACTION SPACES AND ITS APPLICATION IN AUTONOMOUS LEARNING OF SPEECH. By Ameet J oshi Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fun- damentally changed the way an intelligent machine is developed from manual to autonomous. The work presented in this thesis is a part of the SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5 to 10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high di- mensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example. ACKNOWLEDGMENTS I would like to express sincere gratitude towards my thesis advisor, Dr. Juyang Weng for his continuous encouragement in my research work as well as expert guidance. Right from the beginning he nourished the passion towards the research. Working with him and pursuing the goals and finding solutions was a joyful experience. I would also like to thank my fellow lab mates: Yilu Zhang, Nan Zhang, Xiao Huang, Yi Chen, Micky Badgero, Shuking Zheng, David Cherba, Raja Ganjikunta and Gil Abramovich. Working in this enthusiastic and amicable group was fun. I also had many interesting conversations with them which led to lot of new ideas. I would like to express deep regards towards my beloved father, Mr. Vijay Joshi, mother Mrs. Madhuri Joshi and brother Mandar, who, in Spite of being far away from me, made a deep contribution in this thesis by continuously encouraging me and making me believe in my abilities. This work was not possible without their support. I would like to express sincere regards towards my thesis defense comity members, Dr. Fernanda Ferreira and Dr. George Stockman. Dr. Ferreira’s insightful sugges- tions and comments helped me correct the thesis report on the basis of psychological iii development and Dr. Stockman’s subtle remarks helped me understand the different perspectives of the topic. Last but not the least, I would also like to thank many of my friends and room- mates with whom I shared a valuable time in the form of long discussions regarding my research work and who always encouraged me towards the research. iv TABLE OF CONTENTS LIST OF FIGURES vii LIST OF TABLES ix 1 Introduction 1 1.1 What is Intelligence? ............................. 1 1.2 SAIL Robot Project ............................. 4 1.3 Autonomous Mental Development (AMD) and Speech .......... 5 2 Background 7 2.1 Early Development of Language in Humans ................ 7 2.2 Speech Synthesis ............................... 9 2.2.1 Classification based on Target Vocabulary ................ 10 2.2.2 Classification based on Target “Intelligibility and Quality” tradeoff . . 10 2.2.3 Classification based on Synthetic versus Biologically motivated methods 11 2.3 Reinforcement Learning ........................... 13 2.3.1 Reinforcement Learning Algorithm .................... 14 2.3.2 Numerical example of reinforcement learning ............... 17 2.3.3 Mathematical model of reinforcement learning .............. 18 2.4 Q-Learning .................................. 19 2.5 Exploration and Exploitation ........................ 20 2.5.1 Directed Exploration ............................ 21 2.5.2 Undirected Exploration .......................... 21 2.5.3 The Search Methods ............................ 24 2.6 Developmental Learning ........................... 24 2.6.1 The eight challenges of AMD ....................... 25 3 Working of the Organs 27 3.1 Working of Ear ................................ 27 3.1.1 Cepstral parameter encoding ....................... 28 3.2 Working of Mouth .............................. 29 3.3 Working of IHDR Tree ............................ 30 4 System Architecture and Objective 35 4.1 System Architecture ............................. 35 4.1.1 The Notion of Pseudo—Parallelism ..................... 35 4.2 Context Based Learning ........................... 38 4.2.1 Obtaining Context ............................. 39 4.2.2 Normalization of the Parameters ..................... 39 4.2.3 Information Storage and IHDR ...................... 40 4.3 Reinforcement Learning ........................... 41 5 The Two Approaches 46 5.1 Learning with Direct Update on Direction Cosines (DUDC) ....... 47 5.1.1 EMMP ................................... 47 5.1.2 Robust Learning .............................. 49 5.1.3 Consideration about Rewards ....................... 51 5.2 Learning with High-dimensional Conjugate Gradient Search (HCGS) . . 54 5.2.1 Amnesic VQ in Action Space ....................... 54 5.2.2 Interpolation Using Q-values ....................... 56 5.2.3 Conjugate Gradient Method ........................ 58 5.2.4 The Algorithm ............................... 59 5.2.5 Q-Learning ................................. 67 5.2.6 Boltzmann Exploration .......................... 68 5.2.7 The Detailed Behavior of the Action Micro-Clusters ........... 68 5.3 Developmental Learning ........................... 71 6 Results and Discussion 74 6.1 Testing of DUDC ............................... 74 6.1.1 Testing with Synthetic Teacher ...................... 77 6.1.2 Testing with Human Teacher ....................... 78 6.2 Testing with HCGS .............................. 79 7 Contributions and Conclusions 86 7.1 Contributions ................................. 86 7.2 Conclusions .................................. 87 8 Future Scope 88 8.1 Future Scope ................................. 88 vi 2.1 2.2 2.3 3.1 3.2 3.3 3.4 4.1 4.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 LIST OF FIGURES The vocal tract structure ............................ 12 The fundamental model of reinforcement learning ................ 14 The variation in the probability distributions (pdf’s) of the actions with change in Boltzmann temperature “0”. ...................... 23 The hamming window schema .......................... 29 The vocal tract with HL parameters. ..................... 30 The block diagram of KLSYN99 synthesizer. ................. 31 The block diagram of HLsyn. ......................... 32 The temporal behavior of the system. ..................... 38 The modified reinforcement learning model used in the thesis .......... 42 The solid circle represents the target and the hollow circle is the starting point. The arrow denotes the direction of the previous action and its end denotes the new start position. The different variations of the positions of the new starting point and the target are shown. Although in each case the distance from the target is same, the direction towards the target is entirely different from the starting direction ........ 50 The system architecture with DUDC. ..................... 53 The plots of the Interpolated functions and its gradients for various com- binations of the distribution of sample points and their Q-values. . . . 57 Modified Conjugate Gradient method applied to simulation data similar to practical data. The starting point is chosen arbitrarily ........ 62 Modified Conjugate Gradient method applied to simulation data similar to practical data. In spite of starting point being near maxima, new search method still converges to minima. ................ 63 Modified Conjugate Gradient method applied to general sin-cosine func- tion. The starting point is chosen arbitrarily ............... 64 Modified Conjugate Gradient method applied to general sin-cosine func- tion. In spite of starting point being near maxima, new search method still converges to minima. ........................ 65 The plots of the Interpolated functions and performance Of CG on it. In each figure first plot shows convergence of CG v/s starting point of the search and second plot shows the interpolated function. The dotted lines enclose the region where the CG search is deviating from the target. 66 vii 5.9 The Search trajectory of CG in two dimensions. The real trajectory in 3D and its contour path are shown. ..................... 5.10 The system architecture with HCGS. ..................... 5.11 The Flow Chart of the Program ......................... 5.12 The block diagram of the system as part of the entire SAIL architecture. . . . 6.1 This figure shows the distribution of the four vowels in the reduced dimensions (using PCA). The point in the left top corner represents one end of the search space and the point in the right bottom corner represents the other end. The distribution as can be seen is sparse and also overlapping. . . . . 6.2 The variation in convergence rate with variable dimensions and error rate in rewards .................................. 6.3 The Convergence plots generated during the learning of different vowels during 600 iterations using ST. The plots in the order of left to right and top to bottom are for vowels ‘/a/’, ‘/e/’, ‘/i/’, ‘/ae/’ and ‘/u/’. The context used in this testing was the direct internal sensory information in the form of HL parameter. ...................... 6.4 The figure shows the convergence plots obtained from the test carried with synthetic teacher. The plots show the results for contexts from one to four in the order left to right and top to bottom. The synthetic teacher played the wave files recorded in my voice to create context and gave rewards based on distance to the target. ................ 6.5 The modified timing diagram for learning with Human Teacher. ....... 6.6 The Convergence plots from the real time test with human teacher. The left figure shows the convergence at a context based on the distance from the final point. Due to unknown target parameters the normal- ized distances are found from the final convergence point. Hence the distance of zero is the final convergence. The right figure shows the trajectory of the search is plotted in reduced dimensional space. The dimension reduction is done using PCA. ................ 6.7 The figure on left shows the convergence with number of micro-clusters as 100 and figure on right shows the convergence with number of micro- clusters as 30. ............................... viii 75 77 83 83 84 85 LIST OF TABLES 2.1 The simulation of Reinforcement Learning ................. 17 6.1 The table displaying the average learning rate with variable dimensions and variable error rates in rewards. ................... 76 6.2 Difference in the two methods of testing .................... 79 ix Chapter 1 Introduction 1.1 What is Intelligence? The concept of intelligence has always been an issue of heated debate among the philosophers all over the world. There are hundreds of definitions of the word intel- ligence available, most of which are the result of playing around with words and the nuances of their meanings. The common thought that flows in all of them is that the intelligence has to be based on the skills acquired by experience. The type of skills vary to a great extent from humans to other animals of birds. With the emergence of machines and their revolution and culmination into the birth of a computer has given a new direction to the definition of intelligence. Now it is not just human intelligence that we deal with but also the intelligence of the machines. Computers possess very high raw computing capabilities, which are far beyond human scope. This fact leads to the interesting comparison of human intelligence with machine intelligence. Although these machines possess some powerful comput- 1 ing capabilities, they fail miserably when dealing with trivial things. There lies an important distinction between the type of tasks computers find easy to do and ex- tremely difficult to do. The tasks are called as “muddy” tasks when humans find them trivial, but computers find them extremely difficult. For more information about the muddy tasks readers are suggested to read [34]. However it is undeniable that these machines do possess “something” which is better than humans. If someone wants to make a firm statement about the comparative intelligence of a human and a machine a clear definition of intelligence is customary. The concept of “Artificial Intelligence” has evolved in this quest of finding the solution to this problem. The definition of Artificial Intelligence as given by ‘Alan Turing’, a pioneering scientist and mathemati- cian, in 1950 is considered as the most appropriate even till today by many people. Turing’s test is defined as, ”You are talking to ”somebody” behind the curtain and if, after talking to that ”somebody” you feel that the ”somebody” is a human then that ”somebody” is intelligent. This definition assumes that when we feel something as ‘human’ it is intelligent and associates the intelligence absolutely with each and every human behavior. Due to the wide popularity of the phrase ”Artificial Intelligence”, the word was misused to represent the family of softwares which deal with complex logical computa- tions and complex decision making processes. These softwares were called intelligent softwares or expert systems. The performance of these machines was impressive and they superficially used to appear ‘intelligent’, as these machines could do something which only an expert person in that field only could do. These software were written by experts who already knew how to solve a class of problems, hence the intelligence shown was really a human intelligence. The program just used to act as an imitator and do its job. There was no new task that an expert could not solve and the machine could. In order to make a system Artificially Intelligent it should pass through the above mentioned Turing’s test. The most remarkable ability of humans is to learn new things from the knowledge of old things. We, can give birth to children who can learn new things that parents do not know, but we cannot, till date, make a machine which can outperform the programmers in learning new things. All the intelligence of the machine is present at the beginning of the system. With acquisition of some extra information the performance might improve, but there is no significant change the behavior of the system. The developmental machines proposed in this thesis do not have any expertise in any specific task. All they possess is the the capability of autonomously generating the symbolic representation of new things. The developmental learning algorithm proposed by Weng [34],[32] is based on these considerations. The difference between the evolutionary algorithms and devel- opmental algorithm can be stated as: the evolutionary theories are based on mu- tations or natural selection, most of which occur by accident or by chance, while the developmental algorithm is based on human “development” from childhood to adulthood. Another most discriminating factor in the deveIOpmental learning is the task non-specific nature of learning. None of the behaviors is pre—programmed by the programmer [33]. The SAIL project is aimed at making a humanoid using this algorithm. 1.2 SAIL Robot Project SAIL stands for Self-organizing Autonomous Incremental Learner. This project deals with the creation of robotic machines which are capable of learning autonomously. Any human being can learn various behaviors in his physical capabilities by getting appropriate training throughout life while conventional machines, after their manu- facturing, are just reproducers. The goal of this project is to break this hardwired concept about machines and develop robotic machines that are capable of learning af- ter their creation. The robot has various sensors that are similar to humans or other living organisms and is capable of continuously acquiring information from them; store it and process it. The work reported in this thesis is part of this project and, hence, is also based on the developmental learning framework. Speech is one of the most difficult modality even for a human baby to learn, as can be seen from its development. An average human baby takes about a year to speak the first word while by this time the development of the other organs is so advanced that it can even dance to music, wave a hand, and even play meccano [18],[20]. The properties of speech that make its learning difficult include the complicated process by which sound is generated from our vocal tract. The extent of variation in the production and sound of the same utterance even by the same person. Another aspect that becomes prominent in speech learning is overlapping of sensory information with its own action. In the early childhood the baby does not have a sense of separate identity. The fact that the baby is a separate individual and it is not a part of it’s parents is also a learned concept. Hence, to separate its own utterances from surrounding sounds is also something that a baby needs to learn from experience. There are no well defined actions in speech the way they are in locomotory organs. All these aspects reflect in the current development of robots and other intelligent systems. Thus it was expected that the performance of this system will be slow compared to the other locomotory and vision related autonomous systems being developed. 1.3 Autonomous Mental Development (AMD) and Speech The Autonomous Mental Development (AMD) [33] paradigm is the heart of the whole system. It is characterized by: 1. The existence of the body. The body will possess intelligence and hence intelligence is not purely conceptual. Most traditional work in the field of AI assumes the system as a pure computer program which can solve complicated problems. The task is well defined and the input to the system is given electron- ically. However, with the context of a developmental robot, the first difference that needs to be considered for the development of the algorithm is the existence of the body and the system should be designed to Optimize the performance that can be obtained from that body. 2. The developmental program. This program knows the capabilities of the body and can generate symbols for the perceptions obtained by the sensors on the body. This part is also present in the traditional programs. 3. Birth. This is again an important difference, as the real development of the machine starts after the birth. With the start of the program the machine becomes alive or the robot is born. It starts interacting with the environment and starts learning. 4. Development. The acquisition of the rewards from the environment facilitate the development of the robot and it starts producing actions that give him more and more rewards. This marks the final objective if the AMD. In the view of AMD the body that is considered in this thesis consists of mouth in the form of multimedia speakers and an ear which is in the form of microphone. The brain of the robot and the developmental learning program is in the form of software. With the birth of the robot, it starts blabbering random utterances and also responds to the sounds from the environment. The rewards given by the environment help improve the utterances based on the context in which the reward was acquired. The research focus of this thesis is AMD learning in high dimensional context and action spaces using generalized reinforcement learning mechanisms. The autonomous learning of speech by robot is taken as a challenging application to test the methods developed. Chapter 2 Background The system developed in the thesis is based on three main topics, ‘Speech synthesis’, ‘Reinforcement Learning’ and ‘High Dimensional Search’. This section describes the three topics in detail. The early development of the language production in humans from the perspective of cognitive psychology is also essential for understanding of the system. Before discussing these details the objective of the this work is presented. This work deals with learning of primitive vowels only and is not aimed at learning of consonants or words. 2.1 Early Development of Language in Humans The ‘how’ of the development of a child in its early months up to first year is one of the most puzzling question that all the psychologists and cognitive science people face. The absolute understanding of this development, if it is possible, may solve most of the mysterious problems faced by the psychologists and cognitive scientists 7 and also can lead to the production of real life humanoids. As this work is aimed towards making a robot, which can learn to produce sounds like a human baby, the development is viewed with the perspective of language acquisition in infants. The baby starts crying and is also capable of listening immediately after the birth. However, the linguistic development takes place much slower than the devel- opment of other behaviors. The primary reason behind this can be given on the basis of the structural development hypothesis [10], [6]. According to this hypothesis, there are distinct levels in the behavior shown by the babies. The operations can be labelled as first order and second order. The first order operations include the logico-mathematical operations and physical operations. The logico-mathematical operations deal with reasoning out of the perceptual information. The infants try to correlate the observed phenomena. The physical operations include the movements that relate with the causality and the cognition of the space, time and objects. The infant starts building its understanding of the world from these elementary opera- tions. The structural development also states that the development is recursive. The abilities obtained from the first order operations are used to generate the second order operations through experience and repeated use of the acquired behaviors. The high level linguistic development comes in the second order development. The understand- ing of the words and associating them with some physical objects is carried out in this stage. The words are assigned some meaning at this stage. The later development in the language acquisition deals with the understanding of the grammar. This is an extremely complex process and is not discussed in detail here due to limitation of the scope of this thesis. The initial vocabulary of the infant is marked with some meaningless blabbering. These utterances are produced with certain specific movement of the lips and the inner sound producing parts. The motion is governed by the amount to which the baby can stretch its organs. These utterances are shaped in later stages to generate the simple words like “Mommy” or “doggy”. The shaping of the primitive words is the main focus of this thesis. The scope limited to five vowels, ‘/a/’ as in ‘car’, ‘/e/’ as in ‘bell’, ‘/I/’ as in ‘bit’, ‘/ae/’ as in ‘hat’, ‘/u/’ as in ‘put’. 2.2 Speech Synthesis The field of speech synthesis finds its origin at the beginning of the twentieth century. From early 70s the real efforts towards the implementation of a machine or IC capable of synthesizing some form of speech were visible [2]. The speech synthesis was studied with different approaches and depending on the approach different methods were devised. Although I have discussed them briefly here, it is beyond scope of this thesis to describe all the approaches and methods, but an interested reader can find them from, [26] and [16]. The approaches to speech synthesis can be broadly classified on the basis of (1)Tar- get Vocabulary, (2)Target “Intelligibility-Quality” tradeoff and (3)Consideration of biological and psychological aspects of speech production by humans. The third criteria primarily differs from the former methods on the issue of purely synthetic methods versus biologically motivated methods. 2.2.1 Classification based on Target Vocabulary The target vocabulary can be limited to few words or it can have a large set of words. In the former case the system can be customized for the required utterances and system design is simple at the cost of loss of generalization. However if the target vocabulary is large then the customization does not work and more sophisticated model needs to be used. The systems targeted towards small vocabulary generally use synthesis by rule technique and have a database of the utterances. The details of such algorithms are discussed in [7]. 2.2.2 Classification based on Target “Intelligibility and Qual- ity” tradeoff Some systems are required to produce voices which are supposed to be understood as commands. For example, the voices in the automated telephone answering machine. In such cases the quality and the fidelity of the sound are not important but the only requirement is to have an utterance which can be appropriately ‘recognized’ or in other words, the utterance should have good intelligibility. Other applications might demand for a good quality sound. For example, a machine capable of singing songs like human singers. With small vocabulary good quality of speech can be produced at a reasonable cost, but with large vocabularies the cost of the system can be astronomic if it is implemented using similar techniques [4]. 10 2.2.3 Classification based on Synthetic versus Biologically motivated methods For simple systems, discussed in the previous two classifications, the purely synthetic methods can be effectively used. Synthetic methods refer to algorithms which con- sider speech data as data obtained from a general non—stationary statistical process. These methods do not take into consideration the processes involved in the production of sound by humans. The data is then modelled using different techniques like Linear Predictive Coding (LPC) [2], Cepstral or Homomorphic Analysis [23], Vocoders [19], Formant Synthesizers [19], Hidden Markov Models (HMM) [19], etc. Although the- oretically speech is a non-stationary process, for all practical purposes it is assumed to be stationary for a small time factor of ‘twenty milliseconds’. Exploiting this lim- itation, these systems use a sampling window of twenty milliseconds and encode the data in each window. To make the operation smoother, overlapping windows are also used, e.g. hamming windows. The templates corresponding to each window are used for representing the utterances. For high quality sound representation the amount of data needed for encoding is enormous and the systems cease to be practical for large time requirements. The biologically motivated methods are more effective in these situations. These methods try to model the human vocal tract with a “source and filter model”. The sources are models of the air puffs generated inside and the filters are models of the various organs in the auditory channel. To model a human vocal tract it is essential to have a complete understanding of the acoustics involved in human sound production. The structure of our vocal tract is shown in Figure 2.1. 11 The air puffs originating in the lungs form the source for sound production. It is then modulated as it travels through the various organs in the vocal tract until it comes out of the mouth. The entire column serves as filter for sound modulation. The nature .. _ a alveolar ridge , - ., a hard palate {I “at, t‘ velum ,. , nasal it 0' vocal folds ”,2 cavity I ’ teeth —{ 1, 19,. --.-,,,, , 2,. i Q tongue If". , ‘,.- lowerllp '5 ,c: . ,, . 4', pharynx . “a . . -o. . : m; ‘.. ‘2? tongue root fl, , ,'.- ,1,“- ‘ epIglottIsfii larynx 5 ‘ trachea esophagus Figure 2.1: The vocal tract structure of acoustics needed for modelling is complicated and requires an involved analytical computation and approximation to arrive at a set of parameters which represent the structure of the vocal tract. Obtaining the correspondence between the parameter values and the generated utterances remains after establishment of the model. In view of the objective of this thesis, choice of biologically motivated method was mandatory. The most widely accepted model of the vocal tract, developed by Klatt 12 [1], is used in this project. The details of this model are discussed in the next chapter. The Klatt model has been in use since 1988 and since that time many systems have been developed based on it. The current trends in speech synthesis revolve around such combinations [16] and also use psychological information in generating more human—like speech. Some of the modern systems use extremely sophisticated methods, targeted towards obtaining generalized results in multilingual speech synthesis [27]. In the background of existence of such systems, one might question about devising a new method to implement speech synthesis. However, this thesis should not be confused with a pure speech synthesis system. The objective of this thesis is to have a developmentally learning system like a human baby, which is capable of learning different behaviors based on the utterances of the five vowels. These behaviors can be unknown to the programmer and are designed by the teacher. The algorithm is task independent and speech synthesis is just an application of it. The following discussion, deals with reinforcement learning. 2.3 Reinforcement Learning The reinforcement learning is the most intuitive way of learning based on experience. It is also the commonly observed learning mechanism in living organisms. In simple terms, the reinforcement learning means mapping the usefulness or effectiveness of actions to the situations, based on the rewards obtained. One interesting aspect of this learning is the delayed acquisition of rewards. This may not seem absurd in real life, as we are used to it, but in the background of computer algorithms, this makes 13 the problem quite complex. After an action is taken by machine a reward is obtained from the environment which is to be used to improve the action, which is already taken in previous context. Hence its effect can only be seen when the same context repeats. When the same context repeats the machines identifies its similarity with the previous context and also recollects the action taken before and the reward obtained in response. Using this information the machine then improves its performance. In a fundamental reinforcement learning experiment, we have three objects, the learner, the environment and the rewards, as shown in Figure 2.2. The steps in learning can be listed as: Action Figure 2.2: The fundamental mode] of reinforcement learning. 2.3.1 Reinforcement Learning Algorithm 1. The agent senses the current state of the environment and performs some action. 2. A reward is given to the agent from the environment in response to the action. 14 3. The action by the agent changes the state of the environment, also the environ- ment is active itself and it can change its state on its own on top of the changes made by the agent. As a result a new state is created. 4. Go to step 1. This cycle continues indefinitely. In most of the practical situations, there lies a fixed goal in the form of some desired action or a set of actions for given state or states. The rewards are generated in such a way that the actions taken by the agent keep improving and then ultimately they converge to the desired ones. The mathematical modelling of the state and action spaces is the most crucial aspect of reinforcement learning. In a simple problem the number of states is small and the number of actions the agent can take can be modelled with few integer numbers. Maze solving problem is a typical case in such problems. Most of the research in this area assume these conditions and many strategies are devised and are tested in this area [13], [30]. However many times the situation is not as simple and the number of state spaces is large [14] or the space might also be continuous [17] and real valued. The continuous and real valued case is the most challenging as in this case we cannot have a discrete set of actions. Hence either a quantization has to be imposed on the state space or the whole search algorithm needs to be implemented in a different way. If in the case of continuous state and action spaces, the dimensionality is also high the problem becomes more challenging. Very few research examples are observed who have tried to tackle this problem [15]. The problem considered in this thesis has one of the most general framework of state and action spaces. The search 15 dimensions also suffer from inter-dependencies. This is discussed in more detail in chapter 4. Many times the same situation is described using Bayesian Network(BN) or Markov Decision Process (MDP). Few interesting examples of reinforcement learning with this framework can be found in [14],[3]. Although all these models model similar situations the mathematical framework is entirely different in them. The reward can be obtained in variety of ways. It can be boolean in the form of ‘Good’ and ‘Bad’ or sometimes it can be tristate with possibility of no reward or zero reward. However sometimes the situations demand to have more information in the rewards and then we can have integral or even real numbered rewards. For most cases and especially for the situations where learning lasts for long time or for large number of iterations, the boolean reward is sufficient. When the process is long lasting the improvement in each iteration is also proportionally less, consequently the effect of the reward in the improvement is also less. In other words the reward carries lesser information. The reinforcement learning is generally defined by the policy used by the agent, to improve the mapping of the actions to the states based on the rewards obtained. Although the nature of state and action spaces is variable the objective of the policy is unique and can be stated as “to reach the target as soon as possible”. Although this goal is apparently simple it needs to be defined as a mathematical expression in order to implement it using computers. This conversion is not at all obvious. There were many attempts made to fix this, but every approach seems to lack something [30]. The most widely accepted goal is to maximize the long term rewards. The policy which achieves this is called as optimal policy. The definition of the long term 16 Table 2.1: The simulation of Reinforcement Learning Iteration No. State Action Reward 1 S1 A1 0 2 81 A4 0 3 S4 A3 0 4 83 A2 B 5 S2 A1 0 6 S3 A1 B 7 81 A1 0 8 83 A3 G 9 S3 A3 G reward is also relative to the task at hand. Generalizing this statement, if we want to maximize the reward in next h iterations, then the optimal policy can be defined as the one which maximizes, E (2?:0 rt). A simple numerical example of reinforcement learning will make the whole discussion easier to understand. 2.3.2 Numerical example of reinforcement learning Consider a system with six states denoted as ‘Sl’ to ‘86’ and four possible actions as ‘A1’ to ‘A4’. The reward is tristate in nature, with ‘G’ for good, ‘B’ for bad and ‘Zero (0)’ for no reward. The objective is to train the agent to take action ‘A3’ in the state ‘S3’. A sample learning is shown in table 2.1. The policy used here is choose actions randomly and avoid actions which received ‘Bad’ reward before. The table illustrates the reinforcement learning as it occurs in discrete state and action spaces with specific target available. When we start generalization of the algorithm by making the state and action spaces continuous and hence infinite and then further making the availability of the goal obscure, the design becomes more and more complicated. Devising fast learning and also consistently convergent policy 17 in this situation becomes challenging. 2.3.3 Mathematical model of reinforcement learning With reference to Figure 2.2 the mathematical model of reinforcement learning is now presented. Let “3” denote the current state of environment. Let “a” be the current action taken among the entire set of actions “A” available and “r” is the reward obtained by the agent in response to it. “s’” be the next state of the environment. The action chosen by the agent is the result of a certain policy that is being followed by it to decide a particular action among the set of available actions. Let us denote the policy as “(I)”. Now as this policy is function of the set of possible actions and also the current state of the environment let us denote it as “(s, a)”. In order to choose one of the possible actions and also to update their probabilities of occurrence based on the rewards obtained it is necessary to have a value system associated with the actions. These are commonly called as “Q” values. As there is a “Q” value associated with each action there exist a one-to—one mapping between them and we can we will use notation of Q(s, a) for “Q” values. Thus the updated notation for the policy is “(Q(a, s), a)”. With this understanding of the system setup, we can write the symbolic mathematical formulation of the learning algorithm as Q’ (s, a) = f (Q(s, a),r), where Q'(s,a) is the updated set of Q values from the reward 7‘. Among the different variations of the above formula we will be considering only the algorithm called Q-learning first designed by Watkins in his PhD dissertation [5]. The information about the other variations and their scopes can be obtained from 18 [30]. 2.4 Q-Learning We have described the process of Q-learning in a symbolic way with the introduction of the policy of the action choosing and the function of Q-value update. The Q- learning is also called as an Off-Policy TD control. The value update rule is given in equation 2.1. Q(s, a) = (1 — a)Q(s,a) + a(r + 7 man Q(s', a’)). (2.1) The may in equation 2.1 means the Q-value of the action having maximum Q-value in the state 3’. Let us call this value as maxQ. Here the usage of the future Q-values might seem to be absurd, but as this whole process is iterative and the same states are going to be visited again and again the future values for current states might already have occurred before. This makes the use of future Q-values justified. In the beginning of the process all the Q values are initialized to zero. ‘a’ is called as learning rate. There are no theoretical ways to decide the value of this learning rate, but most of the time a positive value which is very close to zero is used, e.g. ‘0.1’. This equation can be interpreted as, the current Q-value is updated by adding the difference between weighted maxQ and the current Q-value of the chosen action and the reward, both weighed by the learning rate to the original value. In this algorithm only one future reward r and one future Q-value is used to update the current Q—value Q(a, 3). This concept can be extended and multiple future discounted rewards and Q-values can be used to update current Q-value Q(a, s). For proof of the convergence 19 of this method and its variations can be found in [30]. The former methods are called TD(1) while the generalized methods are called as TD(A) methods. 2.5 Exploration and Exploitation The choice of maximizing the immediate reward always points towards the action with maximum Q-value. If we just keep on taking that action, the system will soon be stagnated at a local optima and there will be no scope of improvement. In order to take other than optimal action we need to choose some action which has a Q-value less than the maximum. This raises a question as to which one to choose among a set of actions all of which are not optimal. The second best or third or the worst? This question is vague due to lack of information and most of the time the answer is to choose any action arbitrarily or randomly. The method of choosing the optimal action is called as exploitation and the method of choosing the action non-optimally is called as exploration. In order to have the system working well and reach the target in desired time we need to have suitable tradeoff between exploration and exploitation. The exploration always aims at reaching the absolute Optima while exploitation uses the learned behavior and tries to give optimal performance in given situation, which might not be the globally optimal performance. The considerations about the choice of exploration and exploitation are based on the situation where the learning is to be performed. Broadly the strategies of exploration can be classified into two types [31]: (1)Directed and (2)Undirected. 20 2.5.1 Directed Exploration These methods exploit the process or task specific information and thus are more efficient. Mostly these methods give superior performance in small time duration tasks with fast convergence requirements [24]. However they do not have generalization capabilities. 2.5.2 Undirected Exploration There are basically two types of undirected methods, (1)Random walk and (2)Boltz- mann Exploration. Random walk The random walk exploration is purely random process as the name suggests and hence not useful in situations where the system needs to settle after some time even if the absolute Optimal solution is not found. Boltzmann Exploration Before dealing with the mathematical formula of the Boltzmann exploration, it is essential to know the underlying concept. This method starts with almost random exploration and all the possible actions have equal probability of occurrence. However as time progresses the system starts giving more and more preference to the actions with higher Q-values. As the system learns it starts choosing the actions leading to optimal learned solution more often. Thus this method makes a smooth transition 21 from the learning phase to the testing phase. The details of the method are discussed in the following subsection. Let us denote the probabilities of all the “n” actions as ‘p,,z’ = 1 to n’ and Q-values as ‘q,-,z' = 1 to n’. Then the probabilities are generated based on following equation. ”ma-1) Llama-0‘1) (2'2) 1'— 0 is called as Boltzmann temperature. Essentially this factor controls how much importance the Q-value will get in deciding the individual probability of each action. The more the value of the temperature the lesser the importance of the Q-values. The lesser importance of Q-values means that almost all the actions will get equal probability irrespective of their Q—values. This is illustrated in Figure 2.3. When Boltzmann temperature is infinity, all the actions are equally probable. This is ideal random case. The importance of this method is that the parameter 0 can be controlled programmatically and thus the randomness in exploration can be monitored. As the machine learns, the Q—values start getting closer to their optimal values. The tendency towards using the previously learned actions should be increased along with. Maturity of a machine can be a controlling parameter in the randomness of the exploration. This modification in the randomness that is introduced through the Boltzmann temperature has not originated from any task specific information and hence it is general. After the creation of cumulative density functions (cdfs) for all the actions using Boltzmann exploration formula, a random number between ‘0’ and ‘1’ is generated and the corresponding action is taken. This method essentially gives one action 22 0.9 0.8 0.7 0.6 0.4 0.3 0.2 0.1 0 2 4 6 The variation of the pdf’s of the q-values with theta The q value of 1.5 The q value of 1.0 / The q value of 0.5 / / I I I I f L l l 4 l 8 1 0 theta 12 Figure 2.3: The variation in the probability distributions (pdf’s) of the actions with change in Boltzmann temperature “0”. among a set of actions from their Q-values. This limits the scope of this method to only discrete action space problems. When the action space is continuous and the number of actions is practically infinite, this method cannot be directly used. One way is to discretize the space and create Q—value system and then use the method, but in some cases and especially the one used in this project this was not feasible and hence a new system of exploration was developed. Though the mechanism is different it is based on the principle of slow maturation in time and based on that reduction in exploration probabilities. The method is explained in the chapter 4. 23 2.5.3 The Search Methods In a generalized case of a learning environment, the state and action spaces are continuous and high dimensional. To have an efficient search algorithm, the problem of the curse of dimensionality that arises has to be tackled. The solution becomes more difficult when the data is unstable or sparse. The parameters used in this thesis for speech synthesis have sparsely distributed data and inter-dependency among the dimensions. The sparse distribution of data in reduced dimensionality is shown in Fig. 6.1. The inter-dependency in the dimensions arises due to the fact that the five formant frequencies need to have monotonically increasing values. I did not find any reference to this type of problem in the literature that I read during the two years of the deve10pment. Few had dealt with high dimensional cases, but the work that comes closest to mine is [15]. However, in this referred paper, the main objective was to solve the maze-like problems with well-defined goals. They did not have the problem of the sparseness of the data and dimensional inter-dependency. The problem is tackled with two approaches, (1)Direct Update On Direction Cosines (DUDC) and (2)High-Dimensional Conjugate Gradient Search (HCGS). Both algorithms are discussed in detail in chapter 5 2.6 DeveIOpmental Learning The most important aspect of developmental learning is that it tackles the problem of computational representation of the cognitive learning observed in human infants. Autonomous Mental Development (AMD) framework is capable of producing typical 24 classical conditioning described in the cognitive psychology [34]. The AMD mode of operation discussed in chapter 1 is the building block of the system. The most distinguishing property of the developmental program is that the programmer does not know the tasks the the robot ends up learning after the birth. Therefore a developmental program must be able to generate internal representation on the fly for virtually any task. The capability of the machine is developed through real time interactions with the physical world. It depends on five constraints: (1) sensor, (2) effector, (3) computational resource, (4) developmental program and (5) the way robot is taught. Before going into the details of the system architecture the main challenges that AMD tackles are discussed. 2.6.1 The eight challenges of AMD 1. Environmental openness: Due to the task non-specific nature, AMD must deal with unknown and uncontrolled environments, including various human envi- ronments. 2. High-dimensional sensors: AMD should be capable of dealing with continuous digitized signals coming from the different sensors. The problem of curse of dimensionality arises here which needs to be tackled. 3. Completeness in using sensory information: There is no way the robot can determine which information is more useful and which is not. Hence it must try to utilize all the information that is available to it. All it can take advantage of is the statistical similarity in it. 25 01 Online processing: The robot itself is affecting the state which is sensed by it. Off-line processing is unable to accomplish AMD. . Real-time speed: The sensory/ memory refreshing rate must be high enough so that each physical event (e.g., motion and speech) can be temporally sampled and processed in real time (e.g., about 15Hz for vision). The notion of pseudo parallelism explains the duration of one second that is used in this thesis. . Incremental processing: Acquired skills must be used to assist in the acquisition of new skills, as a form of “scaffolding.” This requires incremental processing. Thus, batch processing is not practical for AMD. Each new observation must be used to update the current complex representation and the raw sensory data must be discarded after it is used for updating. Perform while learning: Conventional machines perform after they are built. An AMD machine must perform while it “builds” itself “mentally.” Muddy tasks: For large perceptual and cognitive tasks, an AMD machine must handle multi-modal contexts, large long-term memory and generalization, and capabilities for increasing maturity, all without catastrophic memory loss. 26 Chapter 3 Working of the Organs The system comprises of three main organ simulations. ‘The brain and associated memory’ and ‘the ear’ and ‘the mouth’. Before going into the details of the working of the system as a whole the individual organs (ear, mouth and memory) and their functionality is discussed in this chapter. 3.1 Working of Ear The ear is a purely sensory organ and does not have any action associated with it. The function of ear is to collect the speech data as obtained from the multimedia sound card and encode it into some parameters, which can then be processed. The human ear perceives the sound information through the vibrations on the diaphragm and eventually it is converted into some form of electrical signal which is then carried to the brain through neurons. The choice of encoding is based on few criteria like the capability of representation and compression power. There need not be any hard and 27 fast rule about choice of encoding. Also it is not quite essential from the conceptual point of view that the parameters should be similar to what humans use. However, it is still unknown to us how exactly the encoding of speech signal works in human ear. Out of the other possible alternatives like Linear Predictive Coding (LPC) pa- rameters or the formant frequencies, the cepstral parameters were chosen. The main advantage of these parameters is their ease of computation. The processing of the raw speech to cepstral parameter extraction is now explained in detail. 3.1.1 Cepstral parameter encoding The speech is sampled at 11.025 Khz and with eight bit resolution for each sample. Thus one second duration of speech contains 11025 samples. A hamming window of size 256 is used to extract the cepstral coefficients. Instead of using separated hamming windows the neighboring windows are overlapped. An overlap of 56 samples is maintained in order to have smoother performance. Hence the effective window size is reduced to 200 samples as can be seen from Figure 3.1. The 200 samples correspond to precisely 18.14 milliseconds. The fastest response time for human ear is experimentally found to be near 17 milliseconds [19]. One set of cepstral coefficients contains 13 parameters. Thus, one second sample of the raw speech data is converted to total of 55 sets of cepstral coefficients, or total of 55 x 13, i.e. 715 cepstral parameters. 28 Complete Hamming Window Overlapping Region Efective Window Size A 256 /LA 200 / .1 Figure 3.1: The hamming window schema. These parameters are kept in a queue and as new parameters arrive, the queue is updated. The old items are discarded from one end and new items enter from another. The brain keeps track of this queue and acquires the information after the end of every one second. This forms the dimensionality of the external sensory information as 715. 3.2 Working of Mouth The speech producing organ is the only active action generation organ or effector in the system. The speech synthesizer developed by Klatt, known as KLSYN88 [21], [1] is used. There are total of 40 parameters in the original model developed by Klatt. This model was later enhanced by Stevens and Bickley [28] who came up with total of ten constraints which were able to control the actual forty parameters. These ten parameters were called as ‘High Level’ parameters of HL parameters. This model is not an artificial speech generation model, but is entirely based on human vocal tract. These ten HL parameters can be broadly divided into two categories. Five of them represent the various physical dimensions of the vocal tract and its components and the remaining five represent the pitch and the formant frequencies. The structure of 29 vocal tract along with the HL parameters is shown in Figure 3.2 l1, l2, t3, M an .5 1 3/3. ab ue ll 39 Figure 3.2: The vocal tract with HL parameters. The block diagram of the KLSYN88 synthesizer is shown in Figure 3.3. The “SensimetricsTMInc.” used this model to develop a software called as ‘HLsyn’. This software is used for this thesis. The program of HLsyn directly maps the HL param- eters to raw speech data. This data can then be converted into wave file or played directly through program. The ‘dll’ file of the HLsyn is called directly from the main program to generate the sound from the HL parameters. The block diagram of HLsyn is shown in Figure 3.4. 3.3 Working of IHDR Tree The associative memory is implemented using the HDR tree. HDR stands for Hi- erarchical Discriminant Regression. The incremental version of the HDR was then modified for incremental update as the data is acquired sequentially. This reduces the memory requirements drastically. The new version is called as IHDR. Here IHDR tree is described in brief as required for this thesis. Basically IHDR is a statistical 30 Filtered Impulse Train KLGLO'I‘T88 ' Model Tracheal Pole-Zero F1 F2 F3 F4 F5 Pair ‘ 0 Output m... 0 n “W a u Figure 3.3: The block diagram of KLSYN99 synthesizer. decision tree and its details can be found in [11] and [12]. The IHDR tree is a bimodal tree. One lobe of the tree (lobe A) is developed using the statistical or mahalanobis distance metric, while the other lobe (lobe B) is developed using euclidian distance. Two main operations can be performed on the tree, (i) adding a sample and (ii) re- trieving a sample. The sample is always in the form of a set of two vectors, called X and Y. Vector X goes to lobe A while vector Y goes to lobe B. During the process of adding a sample to the tree the sample is first searched for a nearest neighbor in lobe A. When a suitable match is found, it is added into that node. Then the corresponding node in lobe B is extracted from the mapping that is maintained with the development of the tree. Then the second vector is added to that position. As the samples are added to the tree only the statistics of the node is updated and only 31 | fo fl t2 f3 f4 ag ”J HLParameters Mapping Relations 7 (pressures, flows, etc.) Lower Level Parameters Speech Output Sources Transfer Functions Figure 3.4: The block diagram of HLsyn. limited number of samples are preserved. After a node is accumulated with sufficient number of additions the node is frozen and it spawns into child nodes. The number of maximum nodes that can be created in a tree is predetermined. In the initial phase the new nodes are created when the incoming sample has large distance from the pre-existing nodes. In the later cases when all the nodes are populated, the incoming sample is classified into one of the existing nodes. The IHDR tree structure is capable of handling the seven stringent requirements that any regression tree should. The requirements are listed below. 1. It must take high-dimensional inputs, with unknown correlation between the components. Some input components might not be related to output at all. 2. It must perform one—instance learning. An event represented by only a single input sensory frame must be learned and recalled. Thus, iterative learning methods such as back-propagation learning are not applicable. 3. It must dynamically adapt to increasing complexity. It cannot have fixed num- ber of parameters, like a traditional neural network, since the complexity of the desired regression function is unpredictable. 32 4. It must deal with the problem of local minima. Due to the online real-time learning requirement, the tree being built must be successful. Simultaneously, keeping multiple networks, each starting with a different random initial guess, and then selecting the best performing network, is not applicable to real-time online learning 5. It must be incremental. The input must be discarded as soon as it is used for updating the memory. It is impossible to save all the training samples since the space required is too large. 6. It must be able to retain most of the information of the long-term memory without catastrophic memory loss. However, it must also forget and neglect unrelated details for memory efficiency and generalization. With an artificial network with back-propagation learning, the effect of old samples will be lost if these samples do not appear later. 7. It must have low time complexity in computing and update so that the response time is within fraction of a second for real-time learning, even if the memory size has grown very large. This outlines the development of the IHDR tree. When the sample is to be re- trieved it is extracted using the vector Y from lobe B, using the euclidian distance. Then from the mapping, its counterpart, vector X from lobe A is extracted. Though the tree has two different lobes, different distance metric is used to classify them. Hence to exploit this feature both the vectors X and Y used in the sample are iden- tical, and represent the context. The context consists of external sensory information 33 as discussed in ‘Working of Ear’ and internal sensory information in the form of HL parameters as discussed in ‘Working of Mouth’. 34 Chapter 4 System Architecture and Objective 4.1 System Architecture After dealing with the working of the individual organs, in this chapter the functioning of the ‘system as a whole’ is discussed. Although all the organs function as separate threads, a certain synchronization needs to be maintained. The issue of the true parallelism versus pseudo parallelism becomes important in this scenario. The next subsection discusses the need of having the pseudo-parallelism. 4.1.1 The Notion of Pseudo-Parallelism As we are aware of our surroundings and at the same time we can listen to some music and also think about something and tapping hands on the table and also shaking legs we did not event notice the kind of parallelism that our body is performing. The brain is getting flooded with data from all the sensory organs which needs to be processed and based on that responses are to be generated for all the effectors 35 which will implement it. These processes are going on continuously, however we are unaware of it. However when it comes to design a system which is aiming at imitating human, it is customary to have clear understanding of these processes. Although we say our body is performing all the actions simultaneously, the brain is unique. It can be argued that there are multiple parts of brain and all of them are processing simultaneously (which is partly true also), but there can always be instances where the same part of the brain has to take multiple decisions at the same time. This is ideally impossible. However if the same processing part carries out a time slicing and take the different decisions one after the other, and if the time it takes for each one is sufficiently small the discontinuity would be unnoticed. In this case the actual processing is sequential but the observed type of processing is parallel. Also the notion of ‘same time’ is practically limited to the smallest portion of time as constrained by the flexibility of the organs. For example, gap of a microsecond in processing the information received by the ear is not going to make any difference in the resulting observed response. In other words we can say that all the observed parallelism in humans is limited by a certain small fraction of time. This emphasizes the fact that all the observed parallel human activities are quantized. In reality, these actions might be carried out sequentially one after the other. This ‘observed’ parallelism is termed as pseudo-parallelism. Most of the times we do not really come across activities which occur at a speed near our quantization limits, however sometimes it becomes impossible to explain things without its notion. A classic example is of television. It makes us believe that there is a motion picture going on, when its actually a series of colored dots running serially left to right and top to bottom on the screen. 36 If we could sense the picture in infinitesimally small time the television would not have come to existence. For, a simple sequence of few seconds would require almost infinite amount of data. The objective behind stressing the importance of pseudo- parallelism is the fact that it is impossible to have a machine which will process the data infinitely fast. Also the machine developed in the thesis is to be executed on a standard PC, which has only one processing unit and also it has to execute the operating system processes and some other programs along with this system. Hence the processing mechanism involved is inherently pseudo-parallel. The primary goal of this thesis is to generate an algorithm which is capable of learning to produce speech like human babies in an interactive context based environment using reinforcement learning. After considering all the practical aspects of the speech production using multimedia computer a time quantization of one second is fixed for the system. This time quantization is apparently large and unacceptable for humans, but for the scope of this thesis this is sufficient. The working of the system in this constraint will prove the validity of the algorithm and then this time quantization can be arbitrarily reduced with use of more powerful systems of using dedicated processors. The pseudo-parallelism in this system is implemented by parallel running VC++ threads. There are exactly three threads, one for each organ. As the sensory and effector organs are working the brain keeps collecting the data and at the end of one second it performs its action by making the decision. The decision is based on the external data obtained from the ear and also on the internal data obtained from the mouth. The context is formed based on both the data streams. The rewards from the environment can come at anytime and as and when they are available they are 37 recorded by the brain thread. The rewards are used by the brain along with the context for decision making. The timing diagram shown in Figure 4.1 gives precise idea of the system behavior. ‘ ‘4 ‘A 7‘ 7‘ V 1 Sec 1 Sec 1 Sec Teacher I | Teacher | Robot 1 d—D Momentary Pulse Robot Robot Rwd Figure 4.1: The temporal behavior of the system. 4.2 Context Based Learning The learning is context based. The rewards are the only inputs given to the system to shape its behavior. The learning is non-supervised (the internal parameters are never controlled directly by the teacher) as an ideal human learning should be. With the reinforcement learning algorithm the system learns based on the rewards. The details of the context parameters and their acquisition is now explained. 38 4.2.1 Obtaining Context The parameters taken from the ear are cepstral parameters and one set of cepstral parameters correspond to a duration of about 18 milliseconds. It is found experimen- tally that, although human ear can recognize sounds of a duration as short as 17 ms, most of the time it accumulates the sound of duration 200 ms [19] and then processes it. The detailed description of the functioning of an ear is given in the chapter 3. The parameters used to model the mouth are HL parameters. Useful HL pa- rameters are ten. There are actually 13 parameters in one set of HL parameters, as specified in the original model, but the parameters after ten are basically used to classify different speakers. We can use a constant set for implementing all of the sounds for a single person. Also, our robot is supposed to be a single person with some specific voice. 4.2.2 Normalization of the Parameters There are total 715 parameters obtained from the ear and ten parameters obtained from the mouth. The context is based on the combination of both of them. Although there are lot more parameters from the ear the importance of them in deciding the context is equal to the ten parameters obtained from the mouth. The overall impor- tance of the parameters should be independent of the number of the parameters and also on their values. For example, one set of parameters can have larger values and small quantity and vice versa. Hence, in order to remove the effects of these proper- ties, the variance of each set is considered as the fundamental quantity and based on 39 that the parameters are normalized. The variance of both the sets is estimated by simulation data and the values are used in the program. In later course this can be done on the fly and the variance constants can be updated in each iteration. 4.2.3 Information Storage and IHDR The most important aspect of the working of brain is the storage of the informa- tion. The storage and retrieval of the data from the memory is not sequential like traditional computer memory, which plays a key role in system architecture. It is not known till date the precise information about the information storage in brain. However, the associative learning observed in brain strongly points towards existence of a tree structure where the data is stored on the basis of its statistical similarity. The structure that is used in this thesis is called as ‘IHDR’ which stands for Incre- mental Hierarchical Discriminant Regression as discussed in chapter 3. As soon as the context is given to the brain, it searches the IHDR tree and tries to retrieve the nearest neighbor for that context. If a good neighbor is found, it is extracted. The choice of good neighbor is based on a threshold of the distance. The actions taken in the past in the similar context are retrieved along with their Q—values. Due to the continuous and real valued nature of actions, they are represented in the form of fixed number of sets of micro-clusters of similar actions or a single set of direction cosines along with a step size. The details about the creation of these micro-clusters and the direction cosines is described in chapter 5. The brain then uses this information and applies the policy of choosing action and generates a new conceptual action. The new 40 conceptual action is a ten dimensional vector in HL parameter space. The real action is then taken by the mouth. The policy of the brain to choose the next action is now discussed. This involves reinforcement learning and gradient search algorithms and also the tradeoff between exploration and exploitation. 4.3 Reinforcement Learning The fundamental model of reinforcement learning and its algorithm are discussed in the chapter 2. In the traditional reinforcement learning model there is a distinct boundary between the environment to be served and the agent. However with the complex architecture of a humanoid robot, which is considered in this thesis, this model becomes inadequate. The humanoid robot has a physical body which consists of locomotory organs like hands, feet and mouth, sensory organs like ears and eyes and also has a brain that controls the organs. The brain is more of an abstract organ. In order for the robot to function properly its brain should have precise knowledge of the relative positions of the organs along with their limitations. During the response of the robot to the environment the conceptual action is generated by the brain and real action is taken by its organs. Hence, it is essential that the brain keeps track of the state of its own organs. Thus with regards to the brain the state of the organs is external, however this state is internal for the humanoid robot. This state information is called as internal sensory information and brain obtains it by communicating with respective organs. Figure 4.2 shows the picture of the modified reinforcement learning system that is used in this thesis. 41 Environmental Environmental 7” Action Sensory \ ’ Information agenctual Internal Actionp Somatic Organs Sensory Information Figure 4.2: The modified reinforcement learning model used in the thesis. The complete state information is generated by the combining the set of external state information and the set of internal state information. The external sensory state information is in the form of cepstral coefficients obtained from the ear and the internal sensory state information is in the form of the HL parameters obtained from the mouth. The total cepstral parameters are 715 as explained in chapter 3 and there are ten HL parameters, hence the total state space is of dimensionality 725. Any reinforcement learning system has to deal with certain search space. The learning of the system is characterized by multitude of variations depending on the specifications of the search space and the different constraints imposed by it on the learning. The possible combinations of them are discussed below. The search space can be ‘n’ dimensional, the ‘n’ can be small (in the range from one to three), or it can be large (more than three). The space can be discrete or continuous. In discrete case the number of actions in any state are fixed while con- 42 tinuous case the number of actions are always unlimited. With this discrimination we can have different possible combinations of the search spaces ranging from small dimensional discrete to high dimensional continuous. One more aspect that often affects the learning is the various constraints imposed by the search space due to the practical limitations. The basic constraints arise due to boundaries of the search space in all the dimensions, as we cannot go from negative infinity to positive infinity in all the dimensions. The other limitation arises due to dependency among the various dimensions. For example, certain combinations of values are not possible to obtain practically, even all the values in each of the dimensions are within their individual specified range. This leads to a new concept of variable range in each dimension dependent on the other dimensions. The other parameter that is crucial in the performance of the reinforcement learn- ing method is the generation of rewards. The reward is the only guiding parameter that is used in reinforcement learning. The reward can be boolean (‘1’ and ‘0’) or it can be tristate (‘—1’, ‘0’ and ‘+1’) or it can be quantized with more than three levels or it can be real valued. In the traditional reinforcement learning paradigm as given by the equation 4.1 any of these values can be used. Q(s, a) = (1 — a)Q(s, a) + oz(7‘ + 7 menu Q(s', a')). (4.1) The mathematical modelling of the Q-learning or any other variations of it is based on the ultimate goal of maximizing the reward. The time duration in which to maximize the reward is a crucial aspect in modelling the learning equation. In the cases with less time duration and also small dimensionality and discrete nature of search space 43 the reward need to be utilized heavily in order to improve while in the other extreme case when the search space is large and high dimensional and continuous and also having inter-dependency among the different dimensions, the time duration has to be large and the use of rewards is minimal. Most of the cases studied in this field focuss on the simpler cases and optimizing the learning performance in specific learning environments. Even the more general cases considered by many deal with optimizing the performance of the system to apply for some specific environment. The current work started with the intention of developing a learning system to tackle the most general case in reinforcement learning with high (ten) dimensional and continuous state space. Each dimension having different ranges and also there exists an interdependency among the values in all the dimensions. The specific goal to be achieved is use this system to generate a robotic system capable of learning to speak like a human baby. There is absolutely no supervised learning and also the learning has to be general. The baby should learn to respond in the way the teacher desires. With the consideration of the most generalized case and also the imposition of not exploiting the task specific information which is not known to the baby, development of the algorithm is a very difficult task. The system presented here has the objective to keep working for unlimited time. Hence, it should be capable of handling unlimited number of interactions. This goal is drastically different from the commonly used objective of achieving some desired behavior in limited number of steps. Hence it is important for the system to be capable of unlearning certain things learned in the past and also keep learning with the new data. The speed of learning is not very important criterion as time is not a 44 critical parameter, however system should continuously improve in certain direction, based on the interaction with the environment and in long run the program should converge to the target with probability of unity. During the development of the system two different approaches and their vari- ations are experimented and evaluated and the summary of each approach and its performance in specific cases is discussed in the next chapters. 45 Chapter 5 The Two Approaches This chapter deals with the two approaches that are developed during the research. As described in the earlier chapters, the challenging issue in the research is to use the reinforcement learning model effectively in the areas which are more complicated that the assumptions of the traditional reinforcement algorithm assumes. In practi- cal situations it is difficult to have a discrete set of actions which will be constant irrespective of the context state. However, the quantification used in reinforcement learning required it. The same applies to the existence of finite state while in reality there are infinite variations of states. Hence some modifications in the algorithm are mandatory. To tackle this problem two different approaches are tried. The details of the algorithms are discussed in the following sections. 46 5.1 Learning with Direct Update on Direction Cosines (DUDC) The most important change in this method is removal of a set of actions and replac- ing it with a single set of direction cosines, “\I!(z'), z' = 1, ..., 10”. This set represents the attempt of the optimal direction in the given search space and in a given hierar- chy of step size. The objective of the reinforcement learning is to find the optimal set of direction cosines. The set of direction cosines when coupled with the step size represents the action. Hence now we have just one set of action and the ob— jective is to shape it using reinforcement learning. As this framework is different from the traditional learning a new mathematical model of learning is mandatory. Before explaining the new model of reinforcement learning it is essential to explain the exploration-exploitation tradeoff mechanism for this framework. 5.1.1 EMMP The EMMP stands for “Exploration with Maturation with Multidimensional Pertur- bation”. With the given set of direction cosines, ‘\Il(z')’, the concept of exploration is to change the given direction randomly, so that all the possible directions are equally likely. The concept of maturation comes with the tradeoff between exploration and exploitation. As the ‘\I1(z')’ are shaped with the reinforcement learning, the direct use of them is the exploitation. With the well known fact - ‘exploration leads to find- ing the global optima and exploitation leads to using the learned local optima’, the exploration is slowly reduced as the system matures. ‘K.’ is used as the quantitative 47 measure of the maturation. The mathematical model that is developed with this concept is described below. K, = log(1+%) (5.1) 9(2) 2 6+\Il(2)n (5.2) .5 = (Z; 92(0)”? (5.3) mi) = —€?— (5.4) ‘e’ is a real valued random number between —1 and +1 such that it can take all the values in the range with equal probability. ‘2’ is the iteration number, which is incremented with the arrival of a context. The number ‘40’ is chosen from the empir- ical evaluation of various other values. The first step calculates ‘n’, the quantitative measure for the maturity of the system from the number of iterations. The next three steps describe how the new set of \I'(2') are obtained from the old ones using the EMMP method. The weights, denoted as ‘Q(2')’ are the intermediate variables used before the normalization. ‘5’ is the normalizing variable. This method uses a variation of the exploration and exploitation tradeoff, which is based on the principles of the Boltzmann Exploration. The ‘\II(2')’ represent the original direction cosines and the ‘\II’(2')’ represent the new direction cosines obtained after doing the exploration. The reinforcement learning model is based on the EMMP method and the equation is given below. \II(2') <— \I'(2') + ar(\Il'(2') - \Il(2')) (5.5) The value of ‘0.1’ is normally used for the ‘a’, which is also called as learning rate. 48 The EMMP followed by the reinforcement learning constitute first step in the entire hierarchical search mechanism. The complete action is obtained by multiplying the direction cosines with the step size (5. The choice of the step size 6 is crucial in the working of the method. With any given step size 6 there is limitation to how close we can reach to the target. In order to get desired proximity towards any given target and starting from any arbitrary point coarse-to-fine search is necessary. With change of step size the previously learned direction becomes useless. As can be seen from Figure 5.1. After reaching certain point with certain step size the step size is reduced. The direction required with new step size is totally different from the direction that was required with the previous step size. The method described here tries to improve the direction by trying different possible directions and does not try to model the possible distribution of the rewards and thereby this method does not seek the information form the neighborhood. The advantage of this method is speed of convergence as is discussed in chapter 6. 5.1.2 Robust Learning The EMMP method finds the near optimal direction towards the target under the constraint of the given step size. With each reduction in the step size 6, the search becomes finer and sc0pe of the search also reduces. The reduction in 6 is solely con- trolled by the rewards, the details of which are discussed in the following subsection. In order to take into account the possible errors in the rewards, and make the algo- rithm converge in spite of erroneous rewards, the following rule is applied to change 49 Figure 5.1: The solid circle represents the target and the hollow circle is the starting point. The arrow denotes the direction of the previous action and its end denotes the new start position. The different variations of the positions of the new starting point and the target are shown. Although in each case the distance from the target is same, the direction towards the target is entirely different from the starting direction. the step size. Reversible step size change algorithm: 1. Initialize 6. 2. Initialize counter for bad rewards. 3. Use EMMP for updating the \II(2') also keep updating the bad reward counter. 4. If the number of bad rewards exceed the threshold then increase 6 and restart 50 the EMMP. The above algorithm makes sure that the EMMP is not stagnated at certain point in the search space due to the acquisition of a erroneous reward leading to unwanted reduction in 6. This makes the algorithm robust. 5.1.3 Consideration about Rewards As is discussed before, the nature of rewards also play substantial role in overall performance of the system, the assumed behavior of the rewards in this mechanism is now discussed. In general, the rewards can be relative or absolute. In former case the ‘Good’ reward means that the performance of the system is improved compared to the previous attempt, while in latter case the ‘Good’ reward means that the performance of the system is within certain predetermined bounds irrespective of the previous performance. In this particular case, both the types of rewards are expected by the system in certain Specific way. The rewards are of three types, ‘Good’, ‘No reward’ and ‘Bad’. The ‘Good’ reward is considered as absolute, and the latter two rewards are considered as relative. The ‘Good’ reward means that the search has reached sufficiently close to the target and the step size can be reduced. The ‘Bad’ reward means that the performance is getting worse compared to previous performance. ‘No reward’ can mean the performance is unchanged or that it has improved compared to the previous attempt, but the improvement is still not within the predetermined bounds to go to the next hierarchical level. This structure of rewards is constructed based on the real time evaluation, where the rewards will be obtained from the human 51 teacher. It is observed that the ‘Good’ reward definitely means sufficient proximity to the target. When the system is exploring in areas away from the target, it is diflicult F for a human teacher to determine whether the system is going in right direction. Hence, in such cases no reward is offered. However, when the performance deteriorates giving ‘Bad’ reward is possible. Algorithm 1. Initialize the start point. 2. Initialize the direction cosines in all the dimensions. The values are normalized with the norm as unity. 3. Initialize number of iteration, n : 1. 4. Start iterations. 5. Get the reward from the iteration. 6. Use the reinforcement learning equation to update the direction cosines. 7. Take the action. 8. Increment the iteration number n = n + 1. 9. Go to step five. (No Stop.) The architecture is shown in Figure 5.2 52 External State Information Speaker Ear Environment / Human Teacher { Reward Sensor) Add . —-F Retneveal —fi Patter” Exploration / , Value S stem loitation Reward y f I Exp HDR Tree C .9 [ Internal 2 Sensory 3 information E . (HL 2 External Sensory Information P 5 (Cepstral Parameters) arameters) Mouth External Action Figure 5.2: The system architecture with DUDC. 53 5.2 Learning with High-dimensional Conjugate Gradient Search (HCGS) In the traditional Q-learning as is described in chapter 2, there are number of dis- crete actions and a Q-value is associated with each action. Depending on the policy an action is chosen and it is implemented. This method proves inadequate with the continuous and real valued state and action spaces. However, with the proven con- vergence properties of Q-learning and its robustness to occasional incorrect rewards [29], it is suitable for this problem. In order to fit current problem in Q—learning architecture, ‘Vector Quantization’ (VQ) of the action space is required. The VQ used in this thesis is called as amnesic VQ. The method is discussed in the following section. 5.2.1 Amnesic VQ in Action Space The quantized actions in the action space are called as micro~clusters. Using a suitable threshold the actions are quantized. Each micro-cluster represents a single action and possesses a Q-value. Each context is limited to have a specific number of micro- clusters of actions. After trying values from ‘10’ to ‘100’ for this number, it was finally fixed to ‘30’. The mean value that represents a micro-cluster is updated using the method of ‘Amnesic Average’. In traditional update with accumulation of large number of actions the contribution of the new action is reduced to infinitesimal. In practice this is not acceptable as all the system properties need to change with time. Amnesic update eliminates this drawback by using a different amnesic factor based 54 on number of updates on the action [34]. The amnesic update is based on amnesic parameter l(n), which is defined as, [(71) z 2(22 — n1)/(n2 - 221) if 721 < n S 21.2 (5-6) 2+(n—n2)/m ifn>n2 The typical values of ‘221’ and ‘222’ are ‘20’ and ‘2000’. Thus when the number of updates is less than ‘20’, the update is same as traditional average, however after that the amnesic parameter ‘l(n)’ is designed in such a way that more priority is given to the incoming action vector. After 222 updates the weight of the incoming sample is changed again and it increases with a rate 1/m. The value of ‘m’ is chosen to be around ‘1 / 1000’, so that after ‘2000’ updates the weight of the incoming sample is almost constant at about ‘0.1%’. The update equation using this amnesic factor is stated as, + ——+—V (5.7) Where ‘R"’ and ‘Rnfl’ denote the ‘n‘”’ and ‘(n + 1)"” updated vector respectively and ‘V’ denotes the new action taken in ‘(n + 1)"” iteration. With the learning process the micro-clusters are developed and their Q values are updated. Action micro-clusters and actions are used interchangeably for representing the same list. These micro-clusters are used to generate the interpolation function for Q value in composite context and action space, as described in the next section. 55 5.2.2 Interpolation Using Q-values In order to use the Q values of the various actions effectively to find optimal direc- tion, it is essential to have an interpolation function representing the distribution of the Q-values in context-action space. The CG method can then effectively find the optimal action using this distribution. In order for CG method to work effectively it is important that the interpolation function is smooth and has a single well defined local minima. In order to get these properties in the function, a new interpolation method called “Density Sensitive Kernel Interpolation” (DSKI) is used. The details of the method are listed below, Gfl($;$13$21'"1xn) = Zwi($)y(i) (5-8) 2:,- E B“, y,- E R". Value of ‘d’ is ‘10’ in the current context. x1,z2,...,:r,, are n nearest neighbors of a: E R“ from S. S is a set of micro-clusters in R“ representing the approximation of reciprocal of density in R“. S can also be called as a set of finite number of neurons for each state. The definition of the weights w,- is based on the squared local sparseness 02, which is defined as, k n 02 = 5 E Z ”$2 - 33“? (5-9) i—l where, ‘k’ is called as ‘kernel variance factor’. The more the value of ‘k’, the more is kernel variance and more flat the interpolated function is. The effect of ‘k’ on the distribution is shown in the figure. The expression for weights is, HI - will? 22 ), i=1,2,...,n (5.10) 0' w,- = Cezp(— 56 The constant ‘C’ is computed such that 23;, w,- = 1. y,- = f (2,) is the function to be approximated. In this case, y, denotes the Q-value of the action. The action list contains about 30 samples. Hence the top—15 neighbors out of the 30 are chosen to interpolate the function. The sample plots generated from the DSKI method from the sample simulation one dimensional data are shown in Figure 5.3 15 ...5~ -10. -15 30 20* 10* T T Random samples & uniform Q—values —- Interpolated function - Gradient of the Interpolated function . The sample data points 2 4 6 8 Uniform samples & random O-values 1 —— Interpolated function - Gradient of the Interpolated function - The sample data points 4 6 8 30 20- 30 20 Random samples 8. uniform Q—values —— Interpolated function - Gradient of the Interpolated function 2 The sample data points 20 40 60 80 100 Random samples & random Q—values 5 — Interpolated function Gradient of the Interpolated function a The sample data points 20 40 60 80 100 Figure 5.3: The plots of the Interpolated functions and its gradients for various combinations of the distribution of sample points and their Q-values. 57 5.2.3 Conjugate Gradient Method The Conjugate Gradient method tackles the problem of searching in high dimensions in an elegant and sophisticated way. There are many iterative methods which deal with finding the minima of a function in multi-dimensions, for example Steepest Descent or Conjugate Directions [25]. Conjugate Gradient method is the best of all in terms of efficiency and memory usage. Originally the method is develOped for finding minima of a symmetric quadratic function in the form f(x) = ngAx—bx+c (5.11) where ‘A’ is a positive definite and symmetric matrix. This method guarantees con- vergence in ‘n’ steps, where ‘n’ is the number of dimensions of x. The entire theory behind the development of the method is beyond the scope of this thesis and can be found in [25]. The same method can be extended to find the minima of a non-linear function. In this case the function along'with its first and second order derivatives should be known either analytically or they should be easily computable numerically with less numerical errors. Nonlinear CG does not assure convergence in ‘n’ steps, it can also converge to a local minimum or it might just diverge depending on the given function characteristics. The detailed analysis of the convergence of the method in various cases can be found in [8]. However, most of the work in this area only con- centrates towards finding global minimum in a bunch of local minima [9] [25], while the case of a function having a maxima in a neighborhood is not studied in detail. There are two variations of the method that are used in non-linear case, the ‘Fletcher- Reeves’ method and ‘Polak-Ribiere’ method. The ‘Polak-Ribiere’ method is better 58 among the two in most cases [25]. Both the algorithms share most of the steps, and vary only in the way an intermediate variable ‘8’ is designed. The algorithm is given below, 5.2.4 The Algorithm 1. d(0) = 7(0) = -f'($(o) 2. Find 01(1') that minimizes f (330) + a(,)d(o)), 3. scum = 13m + Omdm), 4. 7'i+1 = —f’($(i+l))v r? i . _ r i _ e 5. Fletcher-Reeves: 50'“) = w, Polak - Ribzere . rtt+1)(’< +1) '( l) T "(iirm 7mm) 6. dew) = mu) + fiends) In this nonlinear version of CG there is no direct way to estimate the step size ‘a’ as it is in linear version. Hence line minimization techniques are employed to find it. This step size denotes the optimal distance to be travelled in the direction ‘d’ as found by the CG algorithm. Iterative methods like ‘Newton—Raphson’ or ‘Secant’ are generally used for this purpose. Both the methods are based on the Taylor series expansion of the a function. The analysis of secant method is now presented. f(.r + ad) z f(x) + a[£f(r + ad)]a=o + 23-[553 (:1: + ad)]a=0 (5.12) z m) + a[f’(:r)]Td+ %de"(x)d (5.13) iflx + ad) z [f’(a:)]Td + ade"(:r)d (5.14) 59 The secant method tries to express the second order derivative of the function with the first order derivative at different positions. The expression is given below, [déa- (II: + ad)]a=a — [ii-ft]: + ad)]a:0 if(:i:+ozd) dag (5.15) [fif’f’r + 0(1)] - [firm] 0' 22 (5.16) Using the equations 5.13 and 5.16 we can write the derivative of ‘f (:1: + ad)’ as, fife: + ad) a 012:)le + g lf’(x + ad)le - lf’(:v)de (5-17) Now in order to minimize the function the first order derivative ‘f (x+ozd)’ is equated to zero. This gives us an expression for ‘a’. _ 1112:)de " ‘ _0lf’(x+0d)le-[f’($)le (5'18) Secant method suffers from subtle drawback. When it tries to find the expression for ‘a’, it can reach maxima or minima depending on which is closer. There is no way the method can guarantee that the optimum is a maxima or a minima. The immediate solution to this problem can be thought of as looking at the second order derivative to decide about the maxima or minima. This method has two main practical problems: (i) finding second order derivative is highly computation intensive and it also becomes considerably error prone when dealing with small neighborhood and slow changing functions; and (ii) most of the times the second order derivatives return arrays filled with zeros. In order to make remove this drawback and make the system always seek the minima a the algorithm is slightly modified. When a suitable step size ‘oz’ is computed using the secant method, (the Newton-Raphson method requires computation of second order derivative, hence secant method is preferred over it) 60 the value of function in the direction opposite to that suggested by secant method is checked with the value in the suggested direction. If the value in the opposite direction is less than the value in opposite direction, it is assumed that the secant method is trying to get maxima instead of a minima and the opposite direction is chosen instead. It is also observed that sometimes due to the rapid changes in the value of gradients the step size is incremented drastically and the algorithm is diverging from the target. In order to restrict the performance of the secant method the value of the function at the consecutive values of x produced with the secant method are compared and if it is observed that the method is diverging it is immediately terminated to the previously found value of ‘z’ and the CG method then takes over. The effects of these changes are shown in the figures 5.4, 5.5, 5.6, 5.7. However both the methods bear a considerable drawback when the function also has a maxima in the neighborhood. Both the methods rely on finding the point in the neighborhood of the given point where the first order derivative of the given function reduces to zero. This inherently finds either maxima or minima. The function generated using DSKI is feeded to the CG search to find the minima. The Q-values are sign reversed before generating the interpolated function so that the minima of the function corresponds to maximum Q-value and hence optimal action. The plots of the interpolated function and performance of the CG method are shown in Figure 5.8 The performance of the CG in two dimensional case is shown in Figure 5.9. The starting point along with the trajectory followed by the CG to reach the optimal point is also shown. 61 3‘ t r" 3“ \\‘ i“ ‘\ \\ ' 9 K“ a G: o, \ \ \ l m M \ \‘ \ \‘\\\"\ 1‘ \‘ ‘\ l l i i i -20 —15 —1o -5 o Figure 5.4: Modified Conjugate Gradient method applied to simulation data similar to practical data. The starting point is chosen arbitrarily. 62 l l l l L J l l l -20 —15 -10 —5 0 5 10 15 20 25 Figure 5.5: Modified Conjugate Gradient method applied to simulation data similar to practical data. In spite of starting point being near maxima, new search method still converges to minima. 63 9/2/15 . lax/.v‘vflfiiifliflivv . III/H/z/fivvflflflflflflflflvv 53 , vudiflilliI! .5 III-I552?“ H E. : If! g! I! 95/15 fflfflflflfliiigr/x/y/ rt flfflflflflflfliflflré/z a??? 52/? . tut!!! ! ! 5v i ‘! ‘55::— fllzx I I: . .2 IlllllllIIn-fi = EEIIIIIIIIInE \E Eniiiilllliihfixxs . l§i§ xxx? \ \‘Niiiii as \\\\\§\iiliiii§\$§ . iiiili‘§\§$§ $§i§ Iiiiiiii\\w \\ \fii \fik‘i‘i‘l‘l‘i‘! RV . ._ . tfifiihlilliiiifiéi , . H iEEIIIIIIIIIEEE- _ . t inn-IIIIlIIIn-E :- . ..IIII II 4::— V A: - __ III- III! III E Efiill §E-IIII ~1§~§i§ii|iili§§ . a .. \fiiiiliiiiiifi \\ \iiiiliii§ , \\\\§iiiiiiiii\§s . , EinIIII IIIIIII~§§§ . ’ff// 5: . .H IIIIIIIII :3 iii. \§ , . ~25 -20 —15 10 Figure 5 6' M ' . . odlfied Co ' ' njugate Gradient method applied to ge l nera sin-cosine func— 64 ,,[otf\\\\\l1‘\\\ .. 0.54 .. \\»II’ :é\\\ i ’I'IMW “‘10,; m \\ 33:]. o~ , $3.1), ii), g.) 415‘ .\ .¢”” ‘5‘“ I .. —25 -25 Figure 5.7: Modified Conjugate Gradient method applied to general sin-cosine func— tion. In spite of starting point being near maxima, new search method still converges to minima. 65 function and panormanc. of CG wlth Uniform sample. and uniform O—vlluol . u v . i v v ' ll u a ITWI (a) ' CO with Random umploo and unlform Q—v-luol . v . fir .7 I ‘IO 20 30 40 50 60 70 BO 90 1 )0 r _. A 1 4 A M l I 1 10 20 30 4O 50 60 70 BO 90 1 )0 (C) Figure 5.8: The plots of the Interpolated functions and performance of CG on it. In each figure first plot shows convergence of CG v/s starting point of the search and second plot shows the interpolated function. The dotted lines enclose the region where the CG search is deviating from the target. 66 o 0 0o o 0. o o o o o'o'Oz':":"o 0°:':‘:":':":":"'I e oo'ol’! II ’0 v. .0.0.'o a. o, a, t, I, III, III/II” Ill/l “Q... o 9.9% 0. o. 0’ 'll””0 00,,,,,,,,,,,/ 400 \\\\\\-\“ “3‘0. '0.0.‘o'0/0."o"0 ”II ”’IIIIII//////I \§\\“\\\\:\3“«‘.. o o 0’0 0o '4 ”I ’II’ ”II”, x ‘“ . “.. g ..o, . o, o, 0,, 1,, 11,, III/Ill]! e‘xe:\\§§““‘\\«‘. o o 0.09 0: Io, I: II” ’ III/ll II \ \\ \\‘ ““ ‘a. o o. o o, I, o, 0, I”, 200 ”‘3?$§§3“‘: .‘o‘ 0.:0.'0.::c.:g,{,’;lly’7,””i’ .\ . \ a \\ “': 8' o I II ‘t:§:.‘“- 30:0" ,'o:'o,"o, ’II’;’ liar-1; “3. ...:.:',:o,;o,'o,/ r, In 1 O O O I ' ‘ 59:09'» ’ I o I ,’II 10 —10 Figure 5.9: The Search trajectory of CG in two dimensions. The real trajectory in 3D and its contour path are shown. Due to inherent limitations of linear line minimization methods they cannot be used to give optimal performance in all the situations. Hence to preserve the generality of the algorithm the line minimization part of the CG algorithm is removed. The initial guess for the step size is calculated based on the density of the samples at the starting point. 5.2.5 Q—Learning The mathematical description of the Q-learning is explained in chapter 2. A queue of the ‘n’ recent states is maintained called a ‘Prototype Updating Queue’ (PUQ) and with each incoming reward the Q-values of the states in PUQ are updated. The Q-value of the most recent state is updated using the direct Q-learning rule as given in equation 2.1. The Q-values of the succeeding states are updated using, Q(S, 0‘) = (1_ a)Q(s, 0') + 0(7Q). (519) where ‘Q’ represents the updated Q value of the preceding state in PUQ. 67 5.2.6 Boltzmann Exploration The details of Boltzmann exploration are given in chapter 2 and equation 2.2. The Boltzmann exploration also requires the actions in discrete nature, as is the require- ment of the Q-learning, in order to make a choice. The generation of micro-clusters solved the problem with the Q-learning, but with careful investigation it can be understood that we still cannot use the Boltzmann exploration directly on these ac- tions. The subtle difference in this list of action micro-clusters and traditional set is the traditional set encompasses all the possible actions, while current set does not. The actions outside the list of action micro-clusters can also be taken and they can even be better than the ones in the list. Hence a separate random exploration is used along with the choice of exploitation using CG search. The random exploration uses the upper and lower bounds on the action along with the inter-dependencies in the action values to generate a random action, while CG search uses all the action micro-clusters to generate new action. These two choices are feeded to Boltzmann exploration with probabilities which are based on the status of the action list. If the action list contains sufficient actions with good Q-values, then more probability is assigned to the exploitation and vice versa. However, ultimately the decision is taken by the Boltzmann Exploration whether to use the exploitation or random exploration. 5.2.7 The Detailed Behavior of the Action Micro-Clusters During the initial phase when the action micro-clusters are being created for the first time a default distance threshold is used. In the later stages the micro-clusters 68 are updated with new samples. With each update the micro-clusters move towards the new sample. The amnesic averaging mechanism makes sure that even after large number of updates the new samples still makes noticeable impact on the micro-cluster. It is observed that with the exploitation superseding the exploration some of the micro-clusters being closer to the target are updated more often than others. Also due to this most of the micro-clusters are under utilized. As an implication of this phenomenon, the required uniform convergence is not observed. After the search reaches a certain proximity, the algorithm stagnates. In order to address this issue the re-organization of the micro-clusters is performed after the context is repeated certain number of times. The mechanism of the re-organization is now discussed. The weighted mean of the micro-cluster distribution is evaluated and then all the micro-clusters are pulled towards it based on the number of times each one is visited. This is followed by reduction in the search space. The reduction in search space ensures that the random exploration is also restricted to explore the important area. After the repositioning of the micro-clusters the count of the number of updates on each micro-cluster is initialized along with their Q-values. This starts the new search in the reduced space. This mechanism basically represents the hierarchical coarse to fine search. In another approach the top ‘n’ clusters with highest Q-values are selected and remaining clusters are removed. They are then filled with the new incoming data. The search boundaries are also reduced along with this in order to restrict the random exploration. However, this method suffered from the drawback that many times the main target remains outside the reduced search space after few iterations of the cluster 69 re—organization. This causes the search to stop after reaching certain accuracy level. Considering more detailed picture of learning. The IHDR tree is initialized with maximum of five clusters in each node and maximum number of samples that can be held in a node is restricted to 200. The sample list is maintained as a common list for the entire node and each sample is marked with the membership number which corresponds to the cluster in the node. With each context retrieval the nearest sample is found and if the distance to the nearest sample is above a threshold (a threshold of 7.0 is used) then it is assumed that there is a new context and it is added to the tree and a default action which is “keeping silent” is taken. If the distance to nearest neighbor is less than the threshold then it is assumed that the context is similar to the one that is observed in the past and the action list corresponding to that context is extracted and then using the policy as described above an action is chosen and is taken. The block diagram of this method is shown in Figure 5.10. Action Mapping r l NN Sam le Context __ _'_p_-!____ 1 f: Action1 1 . oniueate‘ g . q Interpolat Gradient ‘ g l Action 2 dz f(x) S ch 5- . . l 98' Boltzmann . g l , : Exploration g . ' I Random ‘1 Action n qn : Exploration : Upper Bounds ! l Lower Bounds ' ........ I Figure 5.10: The system architecture with HCGS. 70 5.3 Developmental Learning Finally all these components are grouped into a complete system in the Developmen- tal Learning paradigm [36]. The software platform of SAIL project is used for the development of the system. The sound acquisition part using cepstral components is already developed and is directly used. The other components for the mouth and the learning system for the speech are independently coded and integrated into the system. The multi-threaded architecture of the system made it easy to add com- ponents having separate threads to execute. The IHDR tree is at the heart of the system providing the associative memory similar to human memory. The sensory information, internal as well as external is combined, normalized and added to the tree continuously as the system starts learning. The block diagram of the system architecture and the flow chart of all the parallel running processes in the system are shown in figures 5.11 and 5.2 respectively. This thesis is part of the SAIL project, where an entire humanoid robotic system is being developed capable of having vision, and locomotion. The basic block diagram of the system as part of SAIL is shown in Figure 5.12. In the entire block diagram of SAIL the block of sensory mapping consists of all the sensory organs including the vision and locomotion. The details of the system can be found in [35]. 71 I Start 1 '\ 1 Get Exteral Sensory Get Internal Information and rewards Sensory Information L__1 __ ___I Retrieve the HDR tree 1 Update Values using reward and retrieve Context Value System Exploration / Exploitation Choose lntemal Action Value Update Send to the effectors t a Figure 5.11: The Flow Chart of the Program. 72 Updated Values _ IHDR —-——p Auditory Effector Input lntemal State Jd—J Sensory Mapping Cognitive Mapping Figure 5.12: The block diagram of the system as part of the entire SAIL architecture. 73 Chapter 6 Results and Discussion The testing of the system is performed in multiple stages. The testing of the two different approaches is performed independently according to their requirements and objectives and the results are discussed. The distribution of the four vowels ‘/a/’, ‘/e/’, ‘/i/’, ‘/u/’ to be learned in their formant space is shown in Figure 6.1. The data related to the fifth vowel ‘/ae/’ was not present and hence is not shown in the figure. The original data is obtained from [22]. As can be seen there lies a considerable overlap among them which is a challenging issue in the learning. 6.1 Testing of DUDC The objective of the system is to learn in a context based environment. The method of DUDC is designed to learn fast in high dimensional space by utilizing the absolute rewards. The distinction between absolute rewards and relative rewards is discussed in chapter 5. Although the system can handle erroneous rewards, it demands more 74 C) _. /u/ The boundary oints -1000— -1500L -2000[> _2500_ _ .. .. ...... . _ _3000_ , ,, ........... g ............ > ............. .......................... ........ _ A Ann I l l _WW i i g + 2000 4000 6000 8000 1 0000 1 2000 1 4000 Figure 6.1: This figure shows the distribution of the four vowels in the reduced dimensions (using PCA). The point in the left top corner represents one end of the search space and the point in the right bottom corner represents the other end. The distribution as can be seen is sparse and also overlapping. from rewards than the HCGS method. As the learning occurring at two different contexts is independent of each other, it is essential that the validity of the learning in individual context is tested first. With the mechanism of obtaining the context in place the rest of the functioning is extension of single context learning. Before tackling the actual system, the algorithm is tested using a simulation program. A data with variable dimensions from two to ten is used for the testing. The different locations of the target relative to the starting point are chosen and the results are tabulated in the table 6.1. The start point is chosen arbitrarily as its position is not 75 Dim. Error Rate Target Point Location Avg. no. of steps Center 29.91 0% Boundary 3.44 2 Random 54.48 Center 35.24 25% Boundary 3.84 Random 44.32 Center 257.37 0% Boundary 55.66 5 Random 319.18 Center 303.49 25% Boundary 50.12 Random 856.04 Center 1248.22 0% Boundary 445.27 10 Random 1089.82 Center 1844.08 25% Boundary 717.64 Random 2279.45 Table 6.1: The table displaying the average learning rate with variable dimensions and variable error rates in rewards. important, but the target point location is varied in three stages as ‘near the center of the search space’, ‘near the boundary of the search space’ and ‘arbitrarily anywhere in the search space’. The testing is carried out for two, five and ten dimensions. Each experiment is conducted for 100 iterations of search to get near perfect statistics of the data. The table 6.1 shows the average values of iterations for each dimensional data and also in each case the rewards without error and with 25% error are used. The measure of convergence is defined as the reduction in the target distance by the factor of 20 with respect to the starting distance. This corresponds to 5% of the initial distance. The percentage rise of the average steps for convergence is plotted in Figure 6.2 76 2500 l 7 i l l l T — Center, No Error - — Boundary, No Error . . 3 . 1 - Random, No Error : . : ; i ” -—- Center,25% Error 3 ; : ; . ” — Boundary 25% Error ; . ; . ; I : : ' ' ; : ’1’ t I / q" ’ ’. / ’I L .l .' ' f .' T .' '/ 1000 500 Figure 6.2: The variation in convergence rate with variable dimensions and error rate in rewards. After establishing the convergence of the algorithm, it is incorporated in the main system having all the three organs working parallel, the brain, the mouth and the ear. The test results with different methods are now discussed. 6.1.1 Testing with Synthetic Teacher The concept of a Synthetic Teacher (ST) is introduced. This ST is implemented as a standalone program running parallel with the system and doing the job of a human teacher. The testing with the ST is carried out in two stages. In the first stage, the ear and mouth are disabled and the learning context is not varied. This approach is 77 similar to the testing of the simulation data. The HL parameters, generated in the form of action, are directly fed to the ST to receive a reward. Due to the absence of delay due to utterance, the learning is fast. The plots of convergence are shown in the Figure 6.3 In the second stage, the ear and mouth are enabled. The system produces sounds and also listens to the environment. The ST produces sounds as pre-recorded wave files every alternate second to provide the context. However, the rewards are gener- ated from the distance to the target using the HL parameters. Because this learning requires the one second wait in each state, the learning is slow. Also as the context is obtained from the external and internal sensory parameters, there are multiple con- texts the system has to deal with. The plots of convergence are shown in Figure 6.4. 6.1.2 Testing with Human Teacher Although the final goal of the system is to have it interact with humans and learn from the rewards obtained, some practical problems with this testing are pre-identified: 1. The synchronization of the states of the machine and the utterance of the teacher is difficult to obtain, which is needed for optimal performance. 2. The inherent variation in the human voice and change in the volume can make the similar utterances sensed as different contexts. This is theoretically accept- able but it will delay the learning process. 3. The teacher is likely to get confused in giving rewards when robot is speaking 78 Table 6.2: Difference in the two methods of testing Testing with ST Testing with Human Teacher 1. Accurate. 1. Prone to errors. 2. Have a single and well 2. The target is a cluster of different defined target. utterances which are recognized by the human teacher as similar. simultaneously. 4. The similarity in the utterances is sometimes difficult to identify and this makes the learning process slow. These problems are only going to make the learning slow while the capability of the system to converge remains unhindered. The results are plotted in Figure 6.6. Two plots are shown for each convergence one of them shows the distance from the final utterance and the second one shows the actual positions of the parameters in reduced dimensions. The dimension reduction is performed using PCA. As can be seen the convergence is faster, but the final utterance is not of very good quality as is obtained with the ST. The reason being all the rewards given by ST are perfect (excluding the case of erroneous rewards). The testing has been conducted for more than two hours. The temporal description of the mechanism is shown in the Figure 6.5. 6.2 Testing with HCGS The method of learning with HCGS is designed to tackle the problem of developing the system to handle very large amount of interactions. The demand from the rewards is relaxed in this method and it can take absolute as well as relative rewards. The 79 objective is to have a convergence rate is not as fast as in the previous method, however the system should not deviate from the target in the long run. The testing framework is similar to the previous method, except that due to slow rate of convergence the last two real time testing steps are not carried out. The simulation results on the generalized sine-cosine function and also on a sample data obtained from the real testing are discussed in chapter 5. The performance of the system in real time simulation with ST is now discussed. The number of micro-clusters in the action list is the most crucial parameter in the overall performance. With the curse of dimensionality constructing an interpolation function in ten dimensions and then using the CG search on it to get the minima depends a lot on the number of samples available. The number of micro-clusters is varied from ten to 100 and in each case the convergence plots are generated. The convergence plots as shown in Figure 6.7 need some further explanation. The case with number of micro-clusters as 30 the convergence is better compared with the other case when the number of micro-clusters is 100. However in either cases the search does stagnates after certain level of convergence. The reason behind this can be explained as below. ‘Initially as the micro-clusters are formed by random exploration some of the clusters are close to the target. However, along with learning the micro—clusters start moving with amnesic average, only the ones that are close to the target are chosen according to the HCGS method and the same clusters keep getting updated. As a result the other clusters remain unchanged and the clusters which are already closer to target keep moving. Research has been carried out in order to organize these clusters after certain 80 stage and various approaches tried are discussed in chapter 5. However the results are not satisfactory in this direction as of now. It is kept as a future work to find the optimal method for the reorganization of the action micro-clusters. 81 0.5 r I I [ —— lhcremefit Size ] 0’5 ll—L__i_ Ihcrement Size ] .l 12000 [ — Distance from Target ] M m . L—— Distanee from Target I L 200 300 .400 500 600 00 Number of Iteration .5 I r , I f 0.5 0 [ll [ — Increment Size l 100 500 600 .0. as. ,5. Number of Iteration Ll—,_;Wl — lhcrement Size ] WT L—I Distance from Target ] r T [ —1 Distance from Targg] 0 1(1) 200 300 400 500 600 0 Number of Iteration 100 200 300 300 Number of Iteration 0.5 i r T x I ] ' ' [ —— Increment Size ] c l 1 A J 12000 r fl M x J [ — Distance from Target I 10000 . ‘ " 1 200 300 400 Number of Iteration 500 600 Figure 6.3: The Convergence plots generated during the learning of different vowels during 600 iterations using ST. The plots in the order of left to right and top to bottom are for vowels ‘/a/’, ‘/e/’, ‘/i/’, ‘/ae/’ and ‘/u/’. The context used in this testing was the direct internal sensory information in the form of HL parameter. 82 0.5 12000 ‘ I_\—\fi [— Increment Sizej I —1— Dlstance from Tafgit] m. I mi- 0 l l L L l 0 20 40 100 120 140 160 Number of Iteration 5 L r— IncrementSize] t— Distance frbm Target I . ' ' SOgIImIIA Iiililiiiimi 6m Number of Iteration c l 0 100 200 180 v0 0.5 I ._’ Increment Size I M I — Distance from Tar at 20 .0 a a. to .5. Number of Iteration L7 A M k U I — lhcremenT Size ] M 1m L— Distanée from Targit] moo. . . , . , ,. . 8000 6W0 4000» 800 O .2. a a .50 Number of Iteration Figure 6.4: The figure shows the convergence plots obtained from the test carried with synthetic teacher. The plots show the results for contexts from one to four in the order left to right and top to bottom. The synthetic teacher played the wave files recorded in my voice to create context and gave rewards based on distance to the target. §——r~— $T1T1T1T1T2 ’- D ‘6 P'— .Q 0 II .9 D (U 3 ll 0.) II ’ Figure 6.5: The modified timing diagram for learning with Human Teacher. 83 L M E:— InEiemenf Size I 1000 —1000 1 i -2000] I J —4000 M -5000 o 20 .2. .. g. . ... Number of Iteration 120 140 i g i i - The Starting Point The intermediate Path o The Final Point it 0 L 2m 4000 0000 I soon 1 1m 1m Figure 6.6: The Convergence plots from the real time test with human teacher. The left figure shows the convergence at a context based on the distance from the final point. Due to unknown target parameters the normalized distances are found from the final convergence point. Hence the distance of zero is the final convergence. The right figure shows the trajectory of the search is plotted in reduced dimensional space. The dimension reduction is done using PCA. 84 Figure 6.7: The figure on left shows the convergence with number of micro-clusters as 100 and figure on right shows the convergence with number of micro-clusters as 30. 3000 I 2500 2000‘ I 500 [— distance to target] L 400 600 800 1000 3000 2500- 500- 4L 1 r I — distahce to target I 0 0 200 400 600 800 85 1000 Chapter 7 Contributions and Conclusions 7. 1 Contributions During the development of the thesis and working towards the objective of developing the system in AMD architecture, some new areas in the field of reinforcement were explored. The previously existing methods were found insufficient or in some cases incompatible with the current architecture. The pursuit of the problems resulted in some new contributions in the related areas. The main contributions are listed below. (1) The first work on learning action in high dimensional action space (5 to 10 dimensions) using the new AMD mode. (2) A new technique of DUDC is developed for reinforcement learning in continu- ous and high dimensional action space along with the development of the alternative for Boltzmann exploration as EMMP. (3) The method of Conjugate Gradient is modified for use in high dimensions. The secant method, which is used for line minimization suffers from the drawback of 86 not being able to identify the neighboring optima as maxima or minima. The method is modified to tackle this issue successfully. (4) The first work towards learning of speech production interactively using the reinforcement learning framework without supervised learning mode. 7 .2 Conclusions The methods developed in this thesis appear promising towards tackling the problem of autonomously developing a robotic system capable of learning to produce high dimensional (e.g. 10D), action interactively and autonomously. The techniques de- signed and implemented in this work seem capable of realizing the initial development of basic, early behaviors in a high dimensional space through the AMD mode. In the later stages of the speech learning deve10pment the system shows goal- directed behaviors, which facilitates faster learning. It is marked by a representation of the goal from the early learning experience, using the goals to activate actions, changing direction explicitly (e.g., understanding the goal by biting down on one’s tongue and giving it a few tries). The current work is, however, new and very important in bootstrapping higher level goal directed learning in the later development stage. 87 Chapter 8 Future Scope 8.1 Future Scope In this thesis the most generalized case in reinforcement learning is considered with the objective of making robotic system capable of learning vowels. After the lit- erature survey conducted throughout the research it is found that even the proven sophisticated methods seem insufficient to handle the situation. Practical problems in learning are also identified and it gives a direction towards the future research that needs to be carried out. Two new methods, DUDC and HCGS are developed and tested. The former method worked well in the given circumstances, however the latter method also seems promising with some more research in the ways to organize the action micro-clusters. The current thesis limits its scope at the production of simple vowels where the position of human mouth and vocal tract system is unchanged during the entire utterance. This also means that the HL parameters that are used for modelling the 88 vocal tract also do not change. There are numerous possibilities for enhancing the project. The ultimate goal is generation of a machine, which is capable of speaking like an adult human. Although the approaches discussed here are intended towards speech production, they are also useful in the general motor mapping in the developmental SAIL robot or in any situation where one has to deal with high dimensions and continuous real values search spaces. This work marks the first step towards modelling high dimensional sensory inputs and effector outputs with reinforcement learning. All the new deve10pments in the field of Artificial Intelligence cannot escape from these problems. Hence this work will provide some useful feedback in this direction. The learning algorithm is motivated from the learning behavior observed in the human infants, however the current work does not claim to be the exact replica of a human baby. However, this work does provide an engineering solution to the type of problems in this area. 89 Bibliography [1] Jonathan Allen, M. Sharon Hunnicut, and Dennis Klatt. From Text to Speech: The M1 Talk System. Cambridge University Press, Cambridge, 1987. [2] B. S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linear prediction of the speech wave. Journal of the Acoustic Society of America, 1971. [3] Jonathan Baxter and Peter Bartlett. Direct gradient-based reinforcement learn- ing, (parts i and ii). Technical report, The Australian National University, Can- berra, Australia, 1999. [4] D. Bursky. New algorithms, chips bestow human qualities on synthesized speech. Electronic Design, pages 113—129, May 1985. [5] Watkins C.J.C.H. Learning from delayed rewards. Technical report, Cambridge University, 1989. [6] Micheal Cole and Sheila Cole. The development of children. W. H. Freeman and Co., 1989. [7] E.J.Yannakoudakis and P.J.Hutton. Speech Synthess and Recognition Systems. Ellis Horwood Limited, New York, 1987. 90 [8] R. Fletcher and GM. Reeves. Function minimization by conjugate gradients. [9] I10] [11] [12] [13] [14] [15] Comput. Journal, 7:149—154, 1964. Peter Géczy. Superlinear conjugate gradient method with adaptable step length and constant momentum term. ICICE Trans. Fundamentals, E83-Az2320—2328, November 2000. Claude-Alain Hauert. Developmental Psychology: Cognitive, Perceptuo-Motor and Neuropsychological Perspectives. North Holland, 1990. W. Hwang and J .Weng. Hierachical discriminant regression. IEEE Trans. Pat- tern Analysis and Machine Intelligence, 22(11):1277 1293, 2000. J.Weng. and W. Hwang. An incremental learning algorithm with automatically derived discriminating features. Proc. Fourth Asian Conference on Computer Vision, 22(11), 2000. L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237—285, 1996. Sam Maes, Karl 'Iliyls, and Bernard Manderick. Reinforcement learning in large state spaces - simulated robotic soccer as testbed. In Proceedings of Robocup2002, Fukuoka Japan, 2002. Andrew Moore and Christopher Atkeson. The party-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Advances in Neural Information Processing Systems, 1994. 91 [16] Renato De Mori and Ching Y. Suen. New Systems and Architectures for Auto- matic Speech Recognition and Synthesis. Springer-Verlag Berlin Heidelberg, NY, 1984. [17] Remi Munos. A convergent reinforcement learning algorithm in the continuous case: the finite-element reinforcement learning. International Joint Conference on Artificial Intelligence, 1997. [18] L. Oesterreich, B. Holt, and S. Karas. Ages and stages-newborn to 1 year, Iowa family child care handbook. Extension Distribution Center, Ames, IA, 1995. [19] Douglas O’Shaughnessy. Speech Communication. Addison-Wesley Publishing Company, New York, 1987. [20] M. Scot Peck. The Road Less Travelled. Rider Books, London, 1978. [21] Joseph S. Perkell and Dennis H. Klatt. Invariance and Variability in Speech Processes. Lawrence Erlbaum Associates, Publishers, Hillsade, New Jersey, 1986. [22] Peterson and Barney. Formant frequency databse of Vowels. Bell Communica- tions Research, NY, 1993. [23] L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. Prentice Hall Inc., New Jersey, NY, 1978. [24] Stefan Schaal, Christopher G. Atkeson, and Sethu Vijaykumar. Rear-time robot learning with locally weighted statistical learning. International Conference on Robotics and Automation, April 2000. 92 [25] Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without agonizing pain. Technical report, Carnegie Mellon University, 1994. [26] J. C. Simon. Spoken language generation and understanding. In Proceedings of the NATO Advanced Study Institute. D. Reidel Publishing Company, Boston, USA, 1980. [27] Richard Sproat. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwar Academic Publishers, Boston, USA, 1998. [28] K. Stevens and C. Bickley. Constraints among parameter simplify control of klatt formant synthesizer. Phonetics, 1, 1990. [29] Richard Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 329—44, 1988. [30] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, An Introduc- tion. MIT Press, Cambridge, Massachusetts, 1998. [31] Sebastian B. Thrun. Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, PA, Jan. 1992. [32] J. Weng. The living machine initiative. Technical Report MSU-CPS-96-60, Michigan State University, East Lansing, MI, Dec. 1996. [33] J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, and E. Thelen. Autonomous mental development by robots and animals. Science, 291:599 — 600, 2000. 93 [34] Juyang Weng. Mentally Developing Robots. To appear in MIT Press, 2003. [35] Yilu Zhang and Juyang Weng. Grounded auditory development by a develop- mental robot. In Proceedings of International Conference on Machine Learning, pages 1059—1064, 2001. [36] Yilu Zhang and Juyang Weng. Chained action learning through real-time in- teractions. In Proceedings of International Conference on Machine Learning, 2002. 94 EU [[][]][[[[[[]]i][[[] ' O