OPTIMALITY AND HIERARCHICAL REPRESENTATION IN EMERGENT NEURAL TURING MACHINES AND THEIR VISUAL NAVIGATION By Zejia Zheng A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2018 ABSTRACT OPTIMALITY AND HIERARCHICAL REPRESENTATION IN EMERGENT NEURAL TURING MACHINES AND THEIR VISUAL NAVIGATION By Zejia Zheng Traditional Turing Machines (TMs) are symbolic in the sense that representations in these TMs are static and hand-crafted. This paper presents a new kind of TM – emergent neural Turing Machine. By neural, we mean that the control of the TM has neurons as basic computing elements. By emer- gent, we mean that the internal representations are formed during learning without hand-crafting. Developmental Network-1 (DN-1) uses emergent representation to perform Turing Computation but the internal hierarchy is handcrafted with emergent features. The major novelty of the proposed TM (Developmental Network-2) over DN-1 is that the representational hierarchy inside DN-2 is emergent and fluid. DN-2 grows complex hierarchies by dynamically allowing initialization of neurons with different domains of connection. Its optimality in terms of maximum likelihood prop- erties is established under the conditions of limited learning experience and resources. Although DN-2 is meant for general learning tasks, we experimented with a complex task— vision-guided navigation in simulated and natural worlds using DN-2. Real-world and simulated navigation ex- periments showed that DN-2 successfully learned rules of navigation with image and other inputs. The formed hierarchical representation in DN-2 focuses on important navigation features like road edges while disregarding the distractors like shadows edges. Copyright by ZEJIA ZHENG 2018 LIST OF TABLES . LIST OF FIGURES . . . . . . . LIST OF ALGORITHMS . . . . . . . . . . . . . TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 3 8 CHAPTER 1 INTRODUCTION . . 1.1 Related Work . 1.2 Optimality of DN-2 . 1.3 Discussion: Sensorimotor Recursive Experiments vs. Batched Data Classifica- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tion Experiments . . . . 1.4 Example: Shadows edges and road edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . 11 . . . Patterning . . 2.1 Main Procedure . 2.2 Details of the algorithm . . CHAPTER 2 DEVELOPMENTAL NETWORK 2 . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Response Computation and Competition . . . . . . . . . . . . . . . . . . . 17 2.2.3 Hebbian Learning of excitation . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.5 Controlling α(t) for multistage learning . . . . . . . . . . . . . . . . . . . 18 2.2.6 Receptive fields initialization . . . . . . . . . . . . . . . . . . . . . . . . . 19 Synaptic maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.7 Initialization . . . . . CHAPTER 3 MAXIMUM LIKELIHOOD ESTIMATION IN DN-2 . . . . . . . . . . . . 22 3.1 Review of the three theorems in DN-1 . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Definition of DN-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Conditions on DN-2 learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Lemma 1: DN-2 optimizes its weight under maximum likelihood for each up- . . date, conditioned on C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Theorem: DN-2 learns optimally under maximum likelihood from its incremen- 3.6 Discussion . . . tal learning experience, conditioned on C. . . . . . . . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6.1 On curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6.2 On over-fitting . 3.6.3 On local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 . . . . . CHAPTER 4 EMERGENT TURING MACHINE AND HIERARCHICAL REPRESEN- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 The Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Emergent FA learning in DN-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Hierarchical representation in DN-2 . . . . . . . . . . . . . . . . . . . . . . . . . 29 TATION IN DN2 . . . . . . iv 4.4 Universal Turing Machine and Developmental Network . . . . . . . . . . . . . . . 30 4.4.1 Definition of UTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.2 Differences between UTM and DN-2 . . . . . . . . . . . . . . . . . . . . 31 4.5 Autonomous navigation needs Emergent Turing Machine . . . . . . . . . . . . . . 31 . . . . . . . . 5.2.1 CHAPTER 5 SIMULATION WITH MAZE ENVIRONMENT . . . . . . . . . . . . . . . 34 5.1 The environment and the teacher FA . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 The agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Stage 1: Learning concept where, what and scale . . . . . . . . . . . . . . 37 5.3 Stage 2: Skills, route, and distance . . . . . . . . . . . . . . . . . . . . . . . . . . 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.1 Teaching skills 5.3.2 Teaching route concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.3 Evaluation of performance . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Experiment results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Skilling learning and chaining . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.1 5.4.2 GPS blurring and noisy inputs 5.5 Discussion . . . . . . . . . CHAPTER 6 REAL-WORLD EXPERIMENTS AND RESULTS . . . . . . . . . . . . . 42 6.1 Batched vision data learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2 Real-world navigation: incremental training and testing . . . . . . . . . . . . . . . 45 6.2.1 Training and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2.2 Visualization: Bottom-up weights of type 100 neurons . . . . . . . . . . . 47 6.2.3 Visualization: Lateral weights from type 100 neurons to type 111 neurons . 48 CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 APPENDIX . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 v LIST OF TABLES Table 2.1: Y neuron types and perfect match value θ . . . . . Table 4.1: Multiple types of neurons learning FA transition . Table 5.1: Experiment result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6.1: DN-2 performance compared to benchmarks on AIML 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . 28 . . 41 . . 44 Table A.1: Real-time training and testing detail . . . . . . . . . . . . . . . . . . . . . . . . 52 Table A.2: Real-time testing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 vi LIST OF FIGURES Figure 1.1: Summary of the DN-2 framework. DN-2 has seven types of internal neurons with different types of connections. Blue lines: top-down connection from Z. Red lines: lateral connection from Y . Green lines: bottom-up connection from X. In this paper, we show the optimality in the learning procedure of DN-2 by proving Lemma 1 and Theorem 1, shown in this figure. The verbal version of Theorem 1 can be found in the italic paragraph in the Introduction. Details about the theorem is presented in Sec. 3.5. . . . . . . . . . . . . . . . . 9 Figure 1.2: Demonstration of how low-level where-what representation facilitates learn- ing complex navigation rules. Low-level 100 Y neurons look at local input patterns. Some neurons learn to recognize road edges (red square) at specific locations, and others learn to recognize shadow edges (blue square). Before synaptic maintenance, the high level Y neuron of type 011 learns different firing patterns from low-level 100 Y neurons. After synapse maintenance, the stable connections (on constant road edges) are kept while the unstable connections (on varying shadow edges) are cut from the high-level Y neu- rons, forming a shadow-invariant representation to learn the navigation rule “correct facing direction when left road edge is in the middle of the input image”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 1.3: Demonstration of how top-down navigation context affects internal firing. In this example, all type 100 Y neurons are learning local features (road edges and shadow edges). But given different GPS navigation context (red letter in Z) motor, different high-level neurons (type 011) would fire. The attention of the network would thus be shifted toward the most relevant feature based on the current navigation context. . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 1.4: Demonstration of bottom-up and top-down effects in internal firing com- bined. The network shifts its attention according to the navigation context (red letter. The demonstrations in this Figure are synthetic examples for the purpose of demonstration.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 vii Figure 5.1: Simulated maze environment and the agent design. (a) and (b): simulated environment and corresponding GUI. The environment is 450 pixels by 450 pixels, with the maze no larger than nine by nine blocks. The agent is 20 pixels by 20 pixels, moving continuously in the maze environment. (1) wall blocks are red blocks in the GUI. (2) obstacle blocks are blue blocks in the GUI. (3) destination blocks are green block in the GUI. (4) reward/punishment blocks are yellow blocks in the GUI. (5) The agent’s route is presented in the GUI. Black routes are skills already learned or going back routes. Red routes are novel skills to be learned. (6) Agent in the environment. Discussed in detail in subfigure (c). (7) and (8) Information panel for training and testing. (9) Traffic light block with two states. In (a) these traffic lights are at ‘go’ state, and in (b) these traffic lights are at ‘stop’ state. (10) 2D vision image is seen by the agent. (c) agent representation in GUI. (11) GPS signal from the environment. GPS indicates where the destination is in the simulated envi- ronment. But GPS is not aware of the obstacles in the environment thus the agent needs to learn skill “avoid obstacle”. (12) Vision sensor of the agent. Here to simplify things we show seven vision lines out of the 43 vision lines forming a 210-degree vision ranging 75 pixels around the agent. . . . . . . . . 36 Figure 5.2: Skills and routes taught to the navigation agent in the simulation experiment. Skill 1: move forward when the next block is open. Skill 2: avoid obstacle and correct facing direction. Skill 3: turn left on the corner with GPS in- dicates left and avoid the obstacle. Skill 4: move through the narrow path. Skill 5: turn right when GPS indicates right. Skill 6: turn left when GPS indicates left. Skill 7: turn left and then turn right to reach the destination. During teaching, traffic lights would be placed along the route as shown in Fig. 5.1, the agent needs to learn to follow the rules of the traffic light as well. Although the agent has 43 vision lines as shown in Fig. 5.1, we show only 7 of these for clarity. A full video of training and testing is available at https://youtu.be/CbhS1qvWZn0. . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 6.1: Real-world experiment flow chart. The lower level Y neurons (type 100) detect local features in the input image. The higher level Y neurons are more robust against changes in local variances, while forming a temporal context for navigation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 . . . Figure 6.2: Sample data from the AIML 2016 navigation dataset. All images are labeled with four Z motor concepts: action, gps, attention and type. Data is collected around university campus using an HTC One M8 android device. . . . . . . . . 43 viii Figure 6.3: AIML 2016 vision performance comparison among different network con- figurations. The X axis is the number of high level neurons in DN-2 (i.e. type 101, type 011 and type 111). The Y axis is the error action rate with four fold cross validation on the AIML testing data. Each configuration is labeled with (receptive size in type 100 neurons, number of type 100 neurons at each pixel location and the type of high level neurons used). DN-2 with lo- cal 100 neurons and 500 high level 111 neurons offers the best performance with reasonable computational resource requirements. . . . . . . . . . . . . . 43 Figure 6.4: Real-time experiment setup. 1. Stereo USB camera connected to the mobile device. The cameras are streaming a 480 by 640 stereo RGB image pair in real-time. 2. Oneplus 5 phone running DN-2 equipped navigation application in CPU and GPU mode. 3. Moga pro controller for motor supervision and testing result recording. 4. The navigation application interface for training and testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . . . . Figure 6.5: Bottom-up weights learned by type 100 neurons. Each 100 neuron focuses on a 5 by 5 region on the input image as shown in Fig. 6.1, and conduct local dynamic top-k competition. As the neurons take stereo RGB inputs, we show only the weights corresponding to the left camera. Each small square is the 5x5 rgb weight corresponding to a specific neuron. The weight values are normalized within range 0 and 1. These bottom-up weights are projected to image space in Fig. 6.6 and Fig. 6.7. . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 6.6: Projected lateral weights for type 111 neurons with no synaptic maintenance. Type 111 neurons with low firing ages forms evenly distributed attention due to the local competition zones of lower level 100 neurons. . . . . . . . . . . . 48 Figure 6.7: Projected lateral weights for type 111 neurons with synaptic maintenance. After synaptic maintenance the high-level neurons focuses attention on con- sistent road edges and become invariant to the changes in the highly variant shadow shapes. The connections from type 100 neurons to type 111 neurons shifts towards these consistent features, while the highly unstable connec- tions are cut from the range of connection. . . . . . . . . . . . . . . . . . . . . 49 Figure 6.8: Training and testing routes around the university during different times of day and with different natural lighting conditions. Disjoint testing sessions were conducted along paths that the machine has not learned. . . . . . . . . . . 49 ix LIST OF ALGORITHMS Algorithm 1: Environment for teaching individual skills . . . . . . . . . . . . . . . . . 37 Algorithm 2: Environment for teaching route concept . . . . . . . . . . . . . . . . . . 38 x CHAPTER 1 INTRODUCTION The learning process in human children appears to be cumulative, with the task complexity pro- gressively increasing through its exploration and interaction with the learning environment (e.g., supervised by a teacher or reinforced by self-exploration). The view that new knowledge must be constructed from existing knowledge is supported by researches in the field of developmen- tal psychology (e.g., [40], [52]). Most importantly, this scaffolding learning principle has been demonstrated to happen across modalities, from language acquisition [2], to text understanding [34], and to motor skill learning [45]. In this paper we investigate the basic mechanisms that allow a neural network to learn like a child, i.e., learn incrementally from simple to complex while being task non-specific. By task- nonspecific, we mean that the programmer does not know about the environment when program- ming the learning agent. After birth, the agent learns to perform tasks according to its sensor and motor experience, while developing its hierarchical internal representations without the teacher manually selecting features. To demonstrate the power of such a computational framework (i.e. Developmental Network-2, or DN-2), a simulated agent and a real-world application equipped with DN-2 would learn to incrementally navigate in complex scenarios using only visual input (simu- lated vision input for the simulated agent and stereo RGB input for the real-world application) and GPS signals. The following conceptual steps guided us toward the targeted framework: Concept 1: Turing Machine and Universal Turing Machine. A lot of the real-world traffic rules can be broken down into state transitions. For example, when in the state of “moving for- ward”, an agent should take action “stop” if the input is recognized as “obstacle”. These rules can σ(t)−−→ q(t + 1), where q(t) is the state of the agent at time t, and σ(t) is the input observed by the agent at time t. However, in real-world navigation be formulated as a Finite Automaton: q(t) we not only act according to the input, but we are also changing the environment. E.g., we are ac- tively changing the signal of the traffic light when we are pressing the wait button at a traffic light. 1 A more subtle example is that we often associate landmarks with the concept of distance during navigation, which is similar to hunters putting down stone piles in a forest to mark how far they have traveled. These actions are all changing the environment and are beyond the computational capability of a Finite Automaton (FA). Turing machine (TM) [49, 18, 33] is a better computation model in this sense as it offers an additional read-write head that allows the agent to alter the input tape. The input tape in our case is the environment. Following the definition of Turing Machine, we can thus formulate a navigation sequence as T = (Q, Σ, Γ, q0, δ), where Q is the set of states (navigation states as in “moving forward”, “turning left”, “arriving”), Σ is the input (current input images), Γ is the tape alphabet (all possible images), q0 is the initial state (“start navigation”), and δ is the transition function: δ : Q × Γ → Q × Γ × {R, L, S} (1.1) where {R, L, S} are the head motion right, left and stationary in the context of Turing Machine, but can be redefined and expanded to the actions taken by the agent to alter the environment. Our DN-2 aims to incrementally learn these individual Turing Machines of single navigation segments. Thus our DN-2 aims to behave like a universal Turing Machine (UTM) [18, 33]. The state Q in Eq. (1.1) are reflected in the concept zones of the network in the output layer. However, the formulation above uses symbolic representation (i.e., the states and symbols are human-defined with clear boundaries among those symbolic states), which requires human hand-crafting and prior knowledge about the task. Using emergent representation (i.e., representation that emerges from the interactions between the agent and the environment without human hand-crafting) to learn clear logic is difficult. Mar- vin Minsky and others stated that symbolic models are logical and neat, but connectionist models (using emergent representations) are analogical and scruffy [37]. Weng argued that DN-1, a con- nectionist model, learns an FA error-free while using emergent representation, thus the DN family resolves the scruffy abstraction problem and has clean logic inside. DN-2 inherits this from DN-1 [55]. Detailed correspondence between our DN-2 and a UTM is also presented in Sec. 4.4 and Sec. 4.5. 2 Concept 2: Hierarchical representation. The formulation in Eq. (1.1) requires the neural network to learn a complete set of tape alphabet Γ, which in the context of vision-based navigation is all possible input images. This is computationally impossible as real-world images contain distractors of numerous shapes, colors, and scales. Hierarchical representation is helpful in this sense as it abstracts lower level details into noise invariant representations. The lower level representations are often formed within a local receptive field, limiting the number of variations of appearances within that area. A higher level representa- tion sums up the local features and produces robust high-level feature maps for decision making. Ideally, this feature map should achieve invariance with regard to the numerous low-level distrac- tors while saving computational resources. Symbolic models (models that use symbolic representation. E.g., Markov models based on FA [13, 10], graphic models [3, 20] or belief nets [16, 17]) use hierarchical representation to cut off symbolic linking, as they cannot handle all possible linkings thus disregarding most of the links. This hierarchy is external and can never emerge internal connections, as they rely on outside restriction by human hand-crafting. Emergent brain-like hierarchical models for navigation are proposed in [47, 30], but the hierarchy inside their methods is rigid (static region boundaries) and one-directional (no top-down context for internal representations). The landmark concept tree for navigation in [51] is only in the concept area of the neural network, while the internal representation is still without hierarchy. In DN-2, the hierarchy is not external, but internal and emergent. We allow all possible connections to be learned automatically because DN-2 uses natural sensor and motor patterns as internal states. Each neuron has its competition zone. Thus the hierarchy inside DN-2 is fluid instead of rigid hierarchy compared to other methods. 1.1 Related Work As a brain-inspired learning framework, DN-2 touches many aspects in the field of autonomous mental development, machine learning, and neural networks. Here we only compare DN-2 with the researches and experiments most relevant to the scope of this paper. 3 Comparison with incremental learning Kraft et. al. proposed an incremental learning sys- tem that builds object representations and associated object-specific grasp affordances based on exploration with no supervision[23]. Incremental learning is also investigated in error minimized extreme learning machine (EM-ELM) [9], where the true label of the training data is always pro- vided during learning. The difference between DN-2 learning and such systems lies in the learning mode of DN-2: the agent updates its internal representation based on the current context in sensor and motor, regardless of being supervised or not. This allows the teacher to partially supervise the output layer (e.g. only supervise higher level concepts while leaving lower level concepts emergent), and so allowing concept scaffolding. Comparison with transfer learning Transfer learning refers to the situation where the target do- main is different from the source domain, and the knowledge representation learned in the source domain needs to be transferred to the target domain[39]. The experiments in this paper can be viewed as a transfer-learning problem where the source domain is where we train the agent with low-level representations and then apply the low-level representation to the source domain where we train the agent to learn navigation skills. However, the DN-2 framework by itself is more flexi- ble than transfer-learning scenarios as the DN-2 does not differentiate between different domains. The source and target domains are implicitly defined by the teacher when teaching the network different concepts, but the learning algorithm itself requires not such separation among domains. Comparison with SLAM SLAM (simultaneous localization and mapping) comprises the simul- taneous estimation of the state of a robot equipped with on-board sensors, and the construction of a model (the map) of the environment that the sensors are perceiving [8, 5]. Because we rely on the GPS embedded onto the smart phone and the Google map API to give general guidances of di- rection, we do not aim to tackle the problem of localization, which is heavily investigated in many SLAM based outdoor navigation researches (e.g., [35, 36, 5]). Compared to brain inspired SLAM architectures such as RatSLAM, our DN-2 forms a hierarchical representation that are more similar to the local to global hierarchies found in visual cortex [50] than to the topological representations 4 found in the hippocampus [36]. Comparison with neural architectures for vision There are many neural network architectures are also inspired by visual cortex. Convolutional Neural Networks (i.e., CNNs) [25, 26, 6] gained state of the art performance on image recognition by stacking spatially invariant convolution oper- ation with pooling operation. The learning mechanisms in these networks are based on stochastic gradient descent (SGD), which requires many iterations of training to reach convergence thus is unsuitable for incremental learning. Moreover, global tuning with stochastic gradient descent also disrupts long-term memory, preventing the network to keep old knowledge (lower-level skills) when learning the newer ones (higher-level skills). DN-2 uses competitive learning and hebbian learning instead of SGD. Finally, the hierarchies in CNNs (and also HMAX, a feedforward hier- archical neural network of alternating template matching and pooling operations to model object recognition in visual cortex [41]) are static in the sense that the network’s structure is munually picked and predefined by the programmer. In DN-2 neurons in different levels of hierarchy are initialized when current match is bad, forming a fluid hierarchy based on the learning experience. In the experiment section we present a baseline model using a CNN architecture. Other related works The problem we are address in this paper is similar to the problem ad- dressed by the Qualitative Learner of Actions and Perception algorithm (QLAP) [38]. QLAP, however, assumes a perfect detection result from the environment by assuming “the agent has trackers for a small number of moving objects (including its own body parts) within an otherwise static environment”. This means that the hierarchy in QLAP is built in the concept area, while DN-2 is building hierarchy of representation internally to handle real-world input with noises and distractors. The Hebbian learning mechanism used in DN framework by itself resembles the learning in Self Organizing Maps (SOM) [22]. However, DN-2 differs from SOM in the following aspects: 1) SOM initializes all neurons at t = 0 where DN-2 releases neurons based on goodness of match. The initialization of neurons in DN-2 is more similar to the Growing Neural Gas algorithm (GNG) 5 [12]. However, 2) SOMs and GNG do not have hierarchy of representations inside. DN-2 uses lateral connections to form hierarchical representations. 3) SOM and GNG do not take temporal information into account. DN-2 updates recursively thus has a temporal aspect in its internal representation. The experiments in this paper can be categorized as ‘topological map based outdoor navigation’ as proposed in [4], where a topological map is defined as a graph of landmark nodes associated with specific instructions (e.g. cross road, turning, or go straight ahead). In our case, the topological map is the instruction (e.g. turn left, turn right, move forward, or arrive) from the GPS system at each intersection way-point, defined by the Google map API. Unlike the researches listed in [4], where most experiments are performed on unmanned vehicles or robots with mechanical effectors, in this paper the agent’s actions are broadcasted as audio instructions to the human user with the user performing the actions according to the output. The developed application is intended to help visually impaired people to walk around campus using smart phone with a pair of stereo camera and the proposed application. Comparison with Neural Networks doing Turing Computations Siegelmann and Sontag showed that there is a finite recurrent neural network with the capability of Turing computation [42, 43]. However, their simulation in [43] is still symbolic, as the binary tape used in their experiments carries hand-crafted specific meaning. Graves et al. proposed a neural network with read-write head with external memory in [15, 14]. However, Graves et al.’s network did not deal with emergent representation and did not use firing patterns as a stage and action. Unlike the work of Siegelmann and Sontag [42, 43], Graves et al. [15, 14] did not prove their network’s equivalence to Turing Machine. In contrast, Weng already proved that DN-1 could perform Turing computations in [56]. DN-2 can achieve the computation capabilities of DN-1 by using just type 101 neurons (neurons with connections from only X, the input area, and Z, the motor area). 6 Comparison with DN-1 As Weng pointed out in [56], the Developmental Network (DN) was the first general-purpose emergent FA that: 1. uses fully emergent representations, 2. allows natural sensory firing patterns, 3. allows the motor area to have subareas where each subarea represents either an abstract con- cept (cost, skill, means, etc.) or natural muscle actions (e.g., move forward or turn left/right) 4. learns incrementally – taking one-pair of sensory pattern and motor pattern at a time to update the network and discarding the pair immediately after 5. realizes the maximum likelihood estimate of the network, conditioned on the limited com- putational resources in the network and the limited learning experience in the network’s “lifetime”. The DN-1 framework has already had several implementations named as Where-What Net- works (WWN) which are used to recognize and localize foreground objects directly from cluttered scenes. WWN-1 and WWN-2 [21] recognizes two types of information for single foregrounds over natural backgrounds: type recognition given location information and location finding given type information. WWN-3 [31] recognizes multiple objects in natural backgrounds. WWN-4 [32] demonstrates advantages of direct inputs from the sensory and motor sources. In WWN-5 [44], object apparent scales are learned using the added scale motor concept zone. WWN-6 [54] uses synapse maintenance to form dynamic receptive fields in the hidden layers incrementally with- out handcrafting. WWN-7 [59] learns multiple scales for each foreground object using short time video input. However, the computation in DN-1 relies on exact matching between the store weights and the current input pattern (from both sensor area X and the motor area Z). There is no hierarchy inside the representation of DN-1. 7 This paper presents DN-2, a novel neural network architecture based on DN-1. DN-2 extends DN-1 by introducing the following mechanisms into the original framework: First, lateral connections enable hierarchies of representation. To compute the internal firing patterns Y (t) DN-1 relies only on the sensor input pattern X(t−1) and motor input pattern Z(t−1). This is primitive in several ways: (1) such feature is concrete with no abstraction, and (2) such feature does not have temporal context from the last internal state of the network. In DN-2 the sources of input for internal area Y (t) extends to the previous internal firing pattern Y (t− 1). This allows the network to develop a hierarchy of representations, discussed in Sec. 4.3. The learned hierarchy of representation is thus (1) more abstract and thus more invariant to the noises in the input and (2) more extensive in temporal dependency. Second, multiple levels of neuron form a hierarchy in representation. There are seven types of Y neurons in DN-2 (shown in Fig. 1.1). DN-1 only has one type of neuron (with only bottom-up and top-down connection, a.k.a. type 101. We use three binary digits to represent whether the neuron has connection from input, internal and output areas respectively.) There is no hierarchy of representation in DN-1. In DN-2, type 100 and 001 neurons form the lowest level of representation, with their attended information integrated into higher levels of neurons (e.g., 011, 110, and 111). The lower level representation may be sensitive to local changes in the input patterns, while the higher levels neurons (with global receptive fields) use the firing pattern from lower level neurons to incorporate both global view and local finer details within the current input. 1.2 Optimality of DN-2 The most important theorem about DN-2 in this paper can be summarized in the following sentence: Under the constraint of skull-closed incremental learning and the pre-defined network hyper- parameters, DN-2 optimizes internal parameters to generate maximum likelihood firing patterns in network areas, conditioned on its sensory and motor experience up to the network’s last update. The theorem is formally introduced in Sec. 3.5, separated into Lemma 1 and Theorem 1, and 8 Figure 1.1: Summary of the DN-2 framework. DN-2 has seven types of internal neurons with different types of connections. Blue lines: top-down connection from Z. Red lines: lateral connection from Y . Green lines: bottom-up connection from X. In this paper, we show the optimality in the learning procedure of DN-2 by proving Lemma 1 and Theorem 1, shown in this figure. The verbal version of Theorem 1 can be found in the italic paragraph in the Introduction. Details about the theorem is presented in Sec. 3.5. also illustrated in Fig. 1.1. To prove this theorem, we formulated the learning problem under Maximum Likelihood Estimation, which is attached in the Appendix. DN-2 mitigates some of the most notorious learning problems: Brittle symbolic modules There are no symbolic modules in DN-2. The environment has too many rules and variances to hand-craft clean, symbolic representations without missing an aspect. Emergent representations in DN and DN-2 are not restricted by handcrafted design, with new representations initialized to learn the specific input if current best match is insufficient. DN- 2 is thus immune from the brittleness of the symbolic representation (if all neurons are already initialized then the network performs maximum likelihood estimation as stated by the theorem). 9 YhiddenneuronsXsensory inputZmotorarea100001011111101010110100001011111101010110100001011111101010110100001011111101010110t t +1t +2t +3 Curse of dimensionality problem DN-2 is less likely to suffer from curse of dimensionality. Curse of dimensionality is caused by using input of high-dimensionality during learning [19]. In the case of DN-2, this curse of dimensionality is dealt with local-to-global hierarchy developed in- crementally according to the network’s learning experience. All receptive fields in DN-2 develop invariance. Such invariance is developed using attention context from Z area. The group concept neuron will only take the relevant components as input and disregard irrelevant components be- cause when the group concept neuron fires, relevant component neurons stably fire but irrelevant component neurons do not consistently fire. This problem is discussed again in Sec. 3.6.1. Over-fitting problem DN-2 is less likely to over-fit. The over-fitting problem refers to the phe- nomenon where the learning agent remembers only the training data but generalizes poorly over new data, often due to the large number of parameters and the small number of training samples. DN is less likely to over-fit because DN starts with no internal neurons and gradually initializes new neurons only when the trained neurons cannot recognize the current input well. When the input is one frame in the receptive field for each neuron, then the formed representation is optimal. Later input frames for each neuron are taken into account through recursive average, which is opti- mal based on maximum likelihood estimation. Neuron initialization is introduced in detail in Sec. 2.2.4. This problem is discussed again in Sec. 3.6.2. Local minima problem DN-2 is less likely to be stuck in local minima. Typical learning agents (e.g. agents using error back-propagation) aim to minimize a task-specific loss function, and would thus often get stuck at local minima during optimization. As we state in Theorem 1 (and also reiter- ate in the introduction above), DN-2 optimizes (in the sense of Maximum Likelihood Estimation) over the entire learning experience, conditioned on incremental learning and limited resources. The side-effect is that DN-2’s performance is now dependent on the external teaching schedule (i.e. learning experience). This is discussed again in Sec. 3.6.3. Please note we do not claim DN-2 completely solves these learning issues, but by conceptually investigating the learning procedures and properties of DN-2 we believe these issues are to a large 10 extent mitigated. 1.3 Discussion: Sensorimotor Recursive Experiments vs. Batched Data Clas- sification Experiments Should we compare DN’s performance on standard image classification datasets (e.g. NORB [28], MNIST [27], CIFAR-10 [24] and ImageNet [7]) with other image classification networks? Standard image classification datasets are widely used to benchmark performance for different neural network architectures. These datasets usually contain large number of labeled images or video sequences labeled with corresponding classification information. Batched data experiments are not sensorimotor recursive, meaning that the motor output of the agent does not have impact on the training or testing data. The focus of this paper, however, is to build a neural network that uses emergent representation and incrementally learns to behave like a Turing Machine with a read-write head. This requires the agent to perform in a sensorimotor recursive environment. DN can of course handle image classification tasks with proper teaching: Wagel and Weng used a variation of Where-What Net- work for CIFAR-10 classification task and achieved an error rate of 19.8% . The same network architecture achieved an error rate of 5.0% on Norb dataset [53]. However, a learning agent that is task non-specific, general purpose and learns incrementally should not be unfairly compared to agents optimized with batch data as these agents are task specific. It is like human brains cannot be compared to an electronic calculator at the task of long number multiplication, as the calculator is specifically designed for this task and a human brain as a general purpose would make more mistakes and also be slower at long number multiplication. 1.4 Example: Shadows edges and road edges In this example, we outline how DN-2 uses the learned where-what representation to form robust representation for navigation. The following discussion corresponds to the steps illustrated in Fig. 1.2. In the context of navigation, the agent is required to pay attention to the left road edge and right 11 Figure 1.2: Demonstration of how low-level where-what representation facilitates learning complex navigation rules. Low-level 100 Y neurons look at local input patterns. Some neurons learn to recognize road edges (red square) at specific locations, and others learn to recognize shadow edges (blue square). Before synaptic maintenance, the high level Y neuron of type 011 learns different firing patterns from low-level 100 Y neurons. After synapse maintenance, the stable connections (on constant road edges) are kept while the unstable connections (on varying shadow edges) are cut from the high-level Y neurons, forming a shadow-invariant representation to learn the navigation rule “correct facing direction when left road edge is in the middle of the input image”. road edge. When making a left turn, the agent must adjust its facing direction by turning slightly right when the left road edge is recognized in the center of the image (unless there is an obstacle at the right-hand side). However, the recognition of road edges is often disrupted by shadows, which are usually monotone and of various shapes. Bottom-up feature hierarchy Low level neurons with local receptive fields would be paying at- tention to local road edges (stable features) and shadow edges (unstable features). During learning a higher level neuron would group these low level features together. Synapse maintenance cuts away the unstable features, focusing the network’s attention on important stable features. See Fig. 1.2 (a). 12 Figure 1.3: Demonstration of how top-down navigation context affects internal firing. In this example, all type 100 Y neurons are learning local features (road edges and shadow edges). But given different GPS navigation context (red letter in Z) motor, different high-level neurons (type 011) would fire. The attention of the network would thus be shifted toward the most relevant feature based on the current navigation context. Figure 1.4: Demonstration of bottom-up and top-down effects in internal firing combined. The network shifts its attention according to the navigation context (red letter. The demonstrations in this Figure are synthetic examples for the purpose of demonstration.) 13 Top-down context affects firing pattern With a different firing context, different higher-level neurons would fire. These neurons are linked with different groups of lower-level neurons, thus focusing on different part of the same input image. See Fig. 1.2 (b). Combined effect Combining these effects together we have a dynamic attention system that is not only focusing on important bottom up features but also incorporating top-down navigation contexts from the previous computation. See Fig. 1.2 (c). This is just one example under the context of real-time navigation. We want our DN-2 to automatically form these rules that are too numerous to hand-craft. The detailed DN-2 mechanism to achieve this is introduced in the following section. 14 CHAPTER 2 DEVELOPMENTAL NETWORK 2 2.1 Main Procedure X is the sensory zone with input x(t) represented as a vector (e.g. input image reshaped to a vector). Y is the hidden layer(s) inside the network with all neurons’ final response values represented as a vector y(t). Z is the motor area with multiple concept zones. Concatenate all responses in Z and we have z(t) . The dimension and representation of X and Z areas are based on the sensors and effectors specified by the programmer (or DNA). Y is skull-closed, not directly accessible from the outside. In the scope of this paper, the teacher/programmer is required to design the teaching schedule and the concept zones (motor areas) of the learning agent. Motor areas need to be designed to accomplish corresponding tasks (e.g., if there is no action motor then it would be impossible for the agent to do navigation, no matter how ‘smart’ the agent is). However, the internal representation of the agent remains emergent and skull-closed. 1. For Y area, all Y neurons enter initialization stage (not active) with random weights and zero firing ages. Z neurons initialize their adaptive weights to zero weights. All Z neurons are active at birth time. Ny is the adaptive part of the hidden layer. Nz is the adaptive part of the motor area. 2. At time t = 0, supervise initial state z. Input the first sensory input x. Internal states in Y are all zeros as no neurons are firing. Thus y(0) = 0. 3. At time t = 1, ..., repeat the following steps: a) Compute Y area’s response vector for all active Y neurons in parallel. Also update the 15 Table 2.1: Y neuron types and perfect match value θ Type X (Sensor) Y (Hidden) Z (Motor) 001 010 011 100 101 110 111 Yes No Yes No Yes No Yes No No No Yes Yes Yes Yes No Yes Yes No No Yes Yes θ 1 1 2 1 2 2 3 adaptive part of the hidden layer with the new N(cid:48) y. (y(t), N(cid:48) y) = fy(py, Ny) (2.1) where py denotes the tuple of (x(t − 1), y(t − 1), z(t − 1)), and fy is the response computation, response competition, and learning function defined in Sec.2.2.2 and Sec. 2.2.3. Also, one or several Y neurons in initialization stage enters learning stage if all current active Y neurons can not match the input vector well, defined in Sec.2.2.4. b) Compute Z area’s response vector for all Z neurons in parallel. Update the adaptive part of the motor layer with the new N(cid:48) z. (z(t), N(cid:48) z) = fz(pz, Nz) (2.2) where pz = y(t − 1). 2.2 Details of the algorithm 2.2.1 Patterning For human brains, the early wirings between neurons are determined before birth (by genes or DNA) [46, 11]. Inspired by this, we create several types of early Y neurons with different connec- tion regions to simulate early connections. The seven Y neuron types are shown in Table 2.1. Each type of neuron has a different growth rate controlled by α(t) in Eq. (2.6). Top-k compe- titions, described in Sec. 2.2.2, are performed among each type of neurons. 16 2.2.2 Response Computation and Competition The area functions fy and fz are based on the theory of Lobe Component Analysis (LCA) [58]. Each active Y neuron calculates its preresponse. The bottom-up part of the preresponse in neurons with bottom-up connection is calculated as follows: b,i = (cid:104) x(t − 1) r(cid:48) (cid:107)x(t − 1)(cid:107), wb,i (cid:107)wb,i(cid:107)(cid:105) (2.3) where wb,i is the bottom-up weight of that neuron. The brackets indicate inner product of two unit vectors. This equation calculates the cosine similarity between the stored pattern (i.e. wb,i) and the input vector. Top-down response r(cid:48) For neuron i, its pre-response r(cid:48) t,i and lateral response r(cid:48) l,i are calculated in a similar way. i is then calculated as: t,i + r(cid:48) l,i. b,i + r(cid:48) r(cid:48) i = r(cid:48) (2.4) If the neuron does not have lateral connection (or bottom-up connection, or top-down connection), then r(cid:48) l,i = 0 (or r(cid:48) To simulate inhibitions within Y , we define dynamic competition set for each neuron. A neuron b,i = 0, or r(cid:48) t,i = 0). can only fire when among top-k winners in its dynamic competition. In this paper, k = 1. In this paper, each neuron’s dynamic competition set is defined as Yi = {j|typej = typei, (rfj∩ rfi (cid:54)= ∅)||(typei (cid:54)= 100)}, where rfi is initialized according to Sec. 2.2.6. If neuron i is the neuron with the highest response within its competition zone, it fires with yi = 1. Otherwise, yi = 0. After all Y neurons compute their responses in parallel, we obtain y(cid:48) (the new Y response). The Z area computes its response z(cid:48) similarly. In this paper, the Z neuron’s competition zone is within the neuron’s concept zone (i.e. all neurons within the same concept zone compete with each other). 17 2.2.3 Hebbian Learning of excitation If a neuron wins in the multistep lateral competition described above (meaning its firing rate is greater than zero), its bottom-up weight (top-down weight, or lateral weight, depending on the type of neuron) would update using the following Hebbian learning rule: wi ← β1wi + β2rix(t − 1) β1 and β2 determine retention and learning rate of the neuron, respectively: (2.5) with β1 + β2 ≡ 1, mi is the neuron’s firing age, i.e. mi = 1 when the neuron first enters learning stage, and increases by one every time the neuron wins competition. , β2 = 1 mi β1 = mi − 1 mi 2.2.4 Initialization The random GDN initialization method is used in DN2. If the top-1 winner in Y has a pre-response lower than almost perfect match m(t), take a neuron in initialization stage to fire with its response equal almost perfect match. The almost perfect match m(t) is defined as follows: m(t) = α(t)(θ − ) (2.6) where  is the machine zero, θ is the perfect response value for each type defined in Table 2.1 and α(t) is a handcrafted table to control the speed of neuronal growth. We discuss the design of α(t) in Sec. 2.2.5. This new neuron would thus suppress other neurons from firing, as it’s guaranteed to win in the competition. Because its firing age is 1 at this time, its learning rate would be 1. According to LCA it will learn the current input perfectly. 2.2.5 Controlling α(t) for multistage learning By controlling the growth rate α(t) for each type of neuron, the DN releases the seven types of neurons at different speeds. The growth speed is fastest when α(t) = 1, at which time m(t) = θ− 18 (the almost perfect response for that type of neuron) according to Eq.(2.6). A small variation in the input would make the highest response among Y neurons less than the perfect value, thus initializing a new neuron. When α(t) = 0, no new neurons are initialized. In the experiments we presented in in this paper, multistage learning is realized by adjusting the teaching schedule according to the grow rate table of the network. From t = 0 to 1 hour the growth rate of type 100 neurons is set to 0.85, while the growth rate of the other types is set to 0. Thus we train the network to recognize basic visual information like edges, corners, etc. After t = 1 hour the growth rate of type 111 neurons is 0.8, and the growth rates of other types are 0, then we train the network with higher level concepts with more complicated actions. 2.2.6 Receptive fields initialization In the scope of this paper, the receptive fields for neurons of type 001, 010, 011, 101, 110, and 111 are initialized as global receptive fields. Any neuron of type 011 would have global connections to other types of Y neurons and global connections to the Z area. Type 100 neurons, which are used in real-world experiments to learn basic local visual features like edges, corners, color information, etc., have local receptive fields in the bottom-up input. Their functionality is the same as the locally-connected layer with unshared weights in the context of Convolutional Neural Networks, with a stride of 1, a predefined filter size, and a predefined filter number. According to the definition of competition zone (see Sec. 2.2.2), type 100 neurons perform local top-1 competition with each other (only compete with neurons of the same type and also have overlapping receptive fields). Thus during network update, this type of neurons generates a dispersed feature map. Z neurons in this paper all start with global receptive field. 19 2.2.7 Synaptic maintenance Synaptic maintenance is conducted by each neuron after a certain number of firing. The goal of synaptic maintenance is to dynamically fine-tune the receptive fields. A neuron i’s synaptic weight vector and input are denoted as vi = (vi1, vi2, ...vid) and pi = (pi1, pi2, ..., pid) respectively. vij, j-th component of vi, is the weight value between neuron j and neuron i. pij, j-th component of pi, is the input from neuron j to i(j = 1, 2, ..., d). We use amnesic average of l1-norm deviation of match between vij and pij to measure expected uncertainty for each synapse. Suppose that σij(ni) is the deviation at neuron’s firing age ni. The expression is as follows:  σij(ni) = if ∆n ≤ n0  β1(∆nij)σij(ni − 1) +β2(∆nij)|vij − pij| otherwise σij(ni) = β1(∆nij)σij(ni−1) + β2(∆nij)|xij − wij| (2.8) where  is the machine zeros, ∆nij = ni − nij is the number of firings this synapse advanced, β2(∆nij) is the learning rate dependent on the firing age (counts) of this synapse, and β1(∆nij) is the retention rate, β1(∆nij) + β2(∆nij) ≡ 1. These are similar to the β1 and β2 defined in Eq. (2.5), while the difference is using synapse age instead of neuron age. n0 is the waiting latency (e.g. n0 = 20). The expected synaptic deviation among all the synapses of a neuron is defined by: ¯σi(ni) = 1 d We define the relative ratio: d(cid:88) j=1 σij(ni). rij(ni) = σij(ni) ¯σi(ni) . and introduce a smooth synaptogenic factor f (r) defined as: (2.7) (2.9) (2.10) (2.11)  f (r) = 1 βb(t)−r βb(t)−βs(t) 0 if r < βs(t) if βs(t) ≤ r ≤ βb(t) otherwise 20 where βs(t) and βb(t) are parameters (can be updated with step through time) to control the number of synapse with active connection (f (r) > 0). In this paper we set βs(t) = 0.8, and βb = 1.5. Meaning that the synapses deviating more than 1.5 times the average of the deviation would be cut, and the synapses deviating less than 0.8 times the average of the deviation would be intact. Then trim the weight vector vi = (vi1, vi2, ..., vid) to be vij ← f (rij)vij (2.12) j = 1, 2, ..., d. Similarly, trim the input vector pi = (pi1, pi2, ..., pid). In short, synaptic maintenance cuts the connections where the weight’s deviation is too large (i.e. more than βb times the average deviation among the neuron’s connections). Only the stable connections are kept after synaptic maintenance. 21 CHAPTER 3 MAXIMUM LIKELIHOOD ESTIMATION IN DN-2 3.1 Review of the three theorems in DN-1 In order the understand the property of DN-2, we need to look back at the three important properties of DN-1 proved in [56]: 1. With enough neurons, DN-1 incrementally learns any FA error-free by observing the transi- tions in the target FA only once with two updates of the network, supervised by inputs from X and Z. 2. When frozen, DN-1 generates responses in Z with maximum likelihood estimation, condi- tioned on the last state of the network. 3. With limited resources, DN “thinks” (i.e., learns and generalizes) recursively and optimally in the sense of maximum likelihood. The proof in [56] is under two assumptions, which are no longer satisfied in the context of DN-2: 1. Only type 101 Y neurons. DN-1 only uses Y neurons with bottom-up and top-down con- nections. However, DN-2 now has seven types of neurons, as shown in Fig. 1.1. 2. Global top-1 competition among Y neurons. Only one Y neuron in DN-1 is firing at any given time, thus learning the exact X pattern with a specific Z pattern associated to this Y neuron. However, DN-2 has several Y neurons firing at any given time. Thus the assumptions for DN-1 optimality is no longer available, requiring a new way to prove the optimality of learning in DN-2. 22 3.2 Definition of DN-2 A DN-2 network at time t can be defined as: N (t) = {θ(t), D(t), Γ} (3.1) Where Γ denotes the hand-picked hyper parameters for the network, D(t) denotes the long-term statistics inside the network. Parameters in D(t) are not used for optimization. θ(t) contains the weights that are actually being optimized at time t. Γ = {G, k, lin, lout, ny, nz} D(t) = {{g|g ∈ Y, Z},{¯g(t)|g ∈ Y, Z}, a(t), L(t)} θ(t) = {W (t)} G is the growth rate table. k is the top-k parameter for competition. lin is the limit of con- lout is the limit of connections from a neuron to other neurons. nections to a specific neuron. ny is the maximum number of neurons in Y zone. nz is the maximum number of neurons in Z zone. g is the set of neurons in Y and Z area. ¯g(t) is the inhibition field of the specific neuron at time t. L(t) = {L(g, t)|g ∈ Y, Z} is the set of receptive field (line subspace) for each neu- ron at time t. a(t) = {a(g, t)|g ∈ Y, Z} is the set of connection ages for each neuron at time t. W (t) = {W (g, t)|g ∈ Y, Z} is the of weights for each neuron at time t. 3.3 Conditions on DN-2 learning During learning, DN-2 is constrained by the following conditions, denoted as C = {Ci, i = 1, 2, 3} C1: Incremental learning. This means the network never stores training data, learns on-line, and has no pre-programmed priors about the task at the birth time. C2: Skull-closed learning. Human teachers can only supervise the motor and input areas of the network once it begins learning. 23 C3: Limited resources. The hand-picked hyper-parameter Γ (which limits the number of neurons used by the network) remains unchanged during learning. 3.4 Lemma 1: DN-2 optimizes its weight under maximum likelihood for each update, conditioned on C. Lemma 1 Define p(t + 1) = {x(t), y(t), z(t)} (p is thus a set of all responses in the three zones). At time t + 1, DN incrementally adapts its parameter θ as the Maximum Likelihood (ML) estimator for input in Y and Z, based on its learning experience with limited resources Γ : θ(t + 1) = max θ f{p(t + 1)|N (t), C}, t ≥ 0 (3.2) The probability density f{p(t+1)|N (t)} is the probability density of the new observation p(t+1), conditioned on the last status of the network N (t), based on the network’s learning experience. Proof for Lemma 1 is attached in the appendix. Lemma 1 states each time DN-2 is providing the best estimate. Inside the ‘skull’ neurons’ weights are updated optimally based on C. But the external environment providing the optimal teaching schedule is not guaranteed. The following theorem is proposed by using Lemma 1 recursively: 3.5 Theorem: DN-2 learns optimally under maximum likelihood from its incremental learning experience, conditioned on C. = {x(t0), x(t0 + 1), ...x(t − 1)}, Zt0 0 = {z(t0), z(t0 + 1), ...z(t − 1)}. At Theorem 1 Define Xt t0 time t + 1 DN-2 adapts its parameter as the ML estimator for the current p(t + 1) = {x(t), z(t)}, conditioned on the sensory experience Xt and Zt with limited resources Γ: θ(t + 1) = max θ f(cid:48){p(t + 1)|Xt 0, Zt 0, C}, t ≥ 0 (3.3) The probability density f(cid:48){p(t + 1)|Xt p(t + 1), conditioned on the entire sensory experience and the pre-defined hyper-parameters. 0, Γ} is the probability density of the new observation 0, Zt 24 Proof for theorem 1 is attached in the Appendix. This theorem is important as it shows that a DN-2 equipped agent behaves in a maximal like- lihood fashion while learning incrementally and immediately without the need to store batch data or iterate through training data for multiple times. 3.6 Discussion Having introduced the algorithm in detail, we can discuss more about the properties listed in the Introduction in depth. 3.6.1 On curse of dimensionality As shown in Sec. 2.2.4, DN is less likely to suffer from curse of dimensionality for two reasons: 1) local receptive fields of type 100 neurons is of lower dimension even though the input maybe of higher dimension, and 2) synapse maintenance cuts away unstable connections, therefore keeping the receptive field focusing on important features. 3.6.2 On over-fitting DN is less likely to over-fit because DN starts with no internal neurons and gradually initializes new neurons by comparing the current best match with the almost perfect match m(t). If large numbers of training samples cluster around similar samples (e.g., training navigation along the same route repeatedly), best match value would constantly be high, and so DN will simply retrain several of the neurons representing these samples instead of tuning the network globally to fit these training samples, avoiding over-fitting the entire network on these samples. Uncommon but important samples will not be ignored by DN-2 as their goodness of match would be low and new neurons would be assigned to remember these samples. In summary, the model’s capacity will be determined by the diversity of training. 25 3.6.3 On local minima DN is task non-specific, not needing to design a handcrafted loss function, and learning incremen- tally. In DN, only the winner neurons fire and update their weights with the Hebbian learning rule. Such an incremental computation of local average of many vectors does not suffer from the well-known local-minima problem since it converts a highly nonlinear global optimization prob- lem into a composition of many local linear problems (the incremental update of local weight is a local linear problem), which does not have local minima. Unfortunately, there is still a local min- ima problem in the external teaching. For example, if the teacher teaches a complex task first and then a simple task, the learner does not have the skills from the simple task to learn the complex task. But this “local minima” problem is outside the network. The network itself is still internally optimal. 26 EMERGENT TURING MACHINE AND HIERARCHICAL REPRESENTATION IN DN2 CHAPTER 4 4.1 The Challenges In navigation, skills are the navigation movements learned in different sections of the environ- ment. For example, one of the skills can be “turn slightly left when seeing an obstacle”. During real-time training, the agent learns different skills during several training sessions, each with a dif- ferent context (e.g., in a simulation there was a concept zone with individual neurons corresponding to each skill; in the real-world environment the starting and ending point would be different for each training session. ). Individual navigation skills can be viewed as separated transition entries inside the transition table of the Finite Automata that navigates in the desired environment. The challenges are: 1. Going beyond static hierarchy. In other hierarchical neural networks (e.g. Convolutional Neural Nets) the hierarchies are static with pre-defined layers and connection patterns. DN- 1 has several areas simulating the visual pathways inside human brain (e.g. the PP, IT and L4 areas in [31]). Static hierarchy is rigid and non-developmental. Hand-crafting region boundaries before training results in suboptimal resource distribution. In DN-2 we have emergent hierarchy which means that we have dynamic boundaries and dynamic number of regions, controlled by the learning experience and growth rate of the network. DN-2 initializes new neurons when needed, forming a fluid hierarchy among regions. 2. Emergent Turing Machine (ETM) from the fluid hierarchy. Our network uses emergent representation to function as a Turing Machine. Based on the lower-level concepts (e.g. where and what), higher-level concepts (e.g. skills and actions) are invariant to the changes in input image (e.g. appearance variances and shadows) as long as the lower-level concepts abstract correctly. The firing pattern in Z is thus supported by the fluid hierarchy of internal 27 Table 4.1: Multiple types of neurons learning FA transition Zone \ Time Z (Motor) t z(t) t + 0.33 z(t) Y 011 100 * * * y100(t) = {x(t),−,−} t + 0.66 z(t) y011(t) = {−, y100(t), z(t)} y100(t) X (Input) x(t) x(t) x(t) t + 1 z(t + 1) y011(t) y100(t) * representation so that invariance emerges from multi-stage learning. 4.2 Emergent FA learning in DN-2 In order the understand the property of DN-2, we need to look back at the three important properties of DN-1 proved in [56]: 1. With enough neurons, DN-1 learns any FA incrementally error-free by observing the transi- tions in the target FA for only once with two updates of the network, with supervision in X and Z. 2. When frozen, DN-1 generates responses in Z with maximum likelihood estimation, condi- tioned on the last state of the network. 3. Under limited resources, DN “thinks” (i.e., learns and generalizes) recursively and optimally in the sense of maximum likelihood. The proof of these properties relies on two steps: 1) exact matching between the current input (x(t), z(t)) and the stored patterns (top-down and bottom-up weights) in the firing Y neuron, and 2) the firing Y neuron corresponds to an output state z(t + 1). With enough resources, a single Y neuron corresponds to a single entry in the FA’s transition table {x(t), z(t)} → z(t + 1). With limited resources, the Y neurons are doing tessellation in the {X, Z} space, and optimality (in the sense of maximum likelihood) comes from the tessellation. DN-2 extends DN-1 by introducing lateral connections among neurons and multiple types of neurons. In the above proof of DN-1, we observe that the 1-to-1 correspondence between Y 28 neuron and the transition state can be relaxed, we only need to guarantee that each transition state is uniquely represented by the firing Y neuron that are connected to Z area. The output in Z can be connected to this firing Y neuron in the same way as in DN-1. There are multiple ways to represent (x(t), z(t)) uniquely in the Y area in DN-2, with the help of multiple types of Y neurons: 1. Type 101 neurons. With only type 101 neurons, DN-2 is down-graded to DN-1, and DN-1’s properties still hold in this situation. 2. Type 100 + type 011 neurons. The network needs to update twice to fire a type 011 neuron that corresponds to this transition, as illustrated in Table 4.1. Assuming top-1 competition in different types of neurons (i.e. only one neuron firing) and the network can always initialize new neurons when matching is not perfect, then the higher-level 011 neuron would always be able to learn the transition after two updates. 3. Type 110 + type 001 neurons. Similar argument as the type 100 + type 011 situation. 4. Type 111 neurons. With only type 111 neurons, DN-2 is performing tessellation in the {X, Y, Z} space. The 111 neurons are memorizing transition {x(t), y(t), z(t)} → z(t + 1), which is a superset of the transitions in the FA defined in DN-1. Thus, the error-free learning of FA in DN-1 still holds with multiple types of neurons DN-2. The optimality in the sense of maximum-likelihood under the condition of limited resources is a natural result of tessellation, thus can be adapted from the proof in [56] with little modification. 4.3 Hierarchical representation in DN-2 We will use type 100 neurons (lower level representation, local receptive field) and type 111 neurons (higher level representation, global receptive field) as an example. Other types follow similar discussion. We present a simplified case of learning specific groups of features F = {fi|i = 1, 2, ..., m} across all possible locations L = {Lj|j = 1, 2, ..., l} inside the input. Z motor thus has a location concept and a type concept. 29 During stage 1 the network is developing representation in type 100 neurons (no other types of neurons have been initialized, controlled by α(t)), with each neuron learning a specific feature fi at location lj. Depending on the number of lower-level features presented in the input image, one or several low-level neurons learning specific features at specific locations {fi, lj} would be firing at any given time. During stage 2 (t = t2, where t2 is the starting time of stage 2), α(t) for type 100 neurons de- creases, meaning that few new lower-level neurons are initialized. Also α(t) for type 111 neurons increases, meaning that the network is now growing higher-level neurons. As a specific group of fi is presented, one or more 100 neurons would fire, forming a lower-level firing pattern y100(t). At time t + 1, a higher level type 011 neuron y111 would fire and learn this lower level firing pattern, thus forming a hierarchy of representation, denoted as y100(t) → y111(t + 1). Multi-stage learning allows coarse-to-fine integration of representation. DN-1 has only global matching of representation using type 101 neurons. Under limited resources, each neuron in DN-1 are averaging over multiple input globally where detailed local features are blurred and neglected. Synaptic maintenance resolves this issue to some extent by shaping the receptive field gradually but this process requires extensive training. In DN-2, finer, detailed representation from local receptive fields of type 100 neurons supports firing in the coarser, higher level firing among type 111 neurons. 4.4 Universal Turing Machine and Developmental Network 4.4.1 Definition of UTM A Universal Turing machine (UTM) simulates the behavior of any TM, given the TM and the input is encoded onto the input tape of the UTM. Formally, a UTM receives an input string in the form of e(T )e(x) on the input tape, where e is the encoding function, T is the targeted TM and x is the input. It simulates the computation of T on data x, and output e(z), where z is the output of T on data x. The UTM does not really know the meaning of z [33]. 30 4.4.2 Differences between UTM and DN-2 The encoding function under the UTM framework is hand-crafted. There are many possible ways to build such an encoding function, which is designed to represent the symbolic transition table. DN-2 does not have this encoding function because it does not use symbolic representation. DN- 2 uses natural input as patterns in X. It also uses emergent state as patterns in Z. The actual encoding is the transformation from the {X, Z} patterns to the neuronal weights in the network. This transformation is not hand-crafted encoding, but rather the result of competition based on the biologically inspired mechanisms of DN such as Hebbian Learning in Lobe Component Analysis and dynamic inhibition regions. For a UTM, the input tape contains both the TM part e(T ) and input part e(x). In DN-2, there is no fixed or static division between rules T and data x. Earlier sensorimotor experiences from X and Z tend to be considered as data x for the agent to recognize and extract rules. Later sensorimotor experiences enable the learning agent to get rules very quickly with very few examples because the rules are in abstract forms in such experiences. In other words, X and Z experience contains both the instruction and data. To summarize in computational terms, UTM performs using an encoding function. It searches current input e(x) in e(T ), and produces corresponding state e(q) and movement e(z). DN per- forms using emergent patterns in X, Y and Z. It searches the entire learning experience embedded in all weights of the network. After competition, the corresponding states and movement emerge in Z. 4.5 Autonomous navigation needs Emergent Turing Machine Here we apply our DN-2 theory to a real-world scenario: autonomous navigation. Autonomous navigation requires general-purpose learning like DN-2 instead of hand-craft rules and feature detectors for the following two reasons: 1. Dynamic environment. In a real-world setting, we cannot anticipate the kind of landmarks, concepts, and context needed for an unknown driving environment. 31 2. Complex traffic rules. The sheer number of rules for a navigation FA is too large to enumer- ate. These rules also interact with each other and the interaction is impossible to hand-craft. In DN-2, the interactions among different rules are handled by lateral connections connections. The representation is emergent from the context. Concepts and contexts are generated ‘on de- mand‘. I.e., if the agent is familiar with the current navigation setting (internal firing value close to perfect), no new context would be created. If not, new context would be created on the fly that make up of this lacking. In a navigation setting, we have the X inputs as the sensor inputs (e.g. GPS input, vision input from cameras, or LIDAR input from the laser sensors). The low level Z skills correspond to recognition results (recognized type information and lo- cation information) at a given location. High-level Z concepts correspond to the actions taken in different driving settings (e.g., turn slightly right when an obstacle is detected on the left-hand side). Using this framework, DN-2 learns multiple Z concepts using different levels of reasoning: “what” action to take at a specific location (“where”) and “which” recognized object is most rele- vant to the current situation. Unlike a TM where the internal states must be human-defined and their transition rules must be hand-crafted, DN-2 uses emergent representation internally to learn clear logic with optimality. In the following section we present three experiments of DN-2: 1. Simulated navigation with maze environment. This experiment is similar to the simulation in [48] but we are training the agent with hierarchical concepts and simulated visual inputs. The DN-2 agent in this simulation is equipped with three sensory areas (X areas) and six concept areas (Z areas). It is taught to recognize object types and locations as its lowest level of concept. Based on the recognition results the agent is then taught different skills (small navigation segments) with the action motor supervised and the recognition motors unsupervised. The agent is then taught about the two major routes from start to finish, with 32 only the highest level of concept “route” supervised. The agent needs to chain different lower-level skills together with no supervision from the teacher. 2. Batch vision data learning. We compare the performance of standard Convolutional Neu- ral Network (AlexNet with 20 convolution kernels of size 5 + 2 by 2 max pooling + 50 convolution kernels of size 5 + 2 by 2 max pooling + 500 neuron fc layer + fc layer to out- put [25, 60]), DN-1 and DN-2 using the same dataset. Although data is collected in batch mode and the benchmark CNN is trained with iterations, we train the DNs with incremental on-line learning set up. The CNN iterated through the dataset for 500 iterations to reach convergence, while the DN networks only viewed the dataset twice with better performance. This experiment demonstrated the capability of DN-2 and also helped us to find desirable hyper-parameters. 3. Real-world navigation with stereo RGB images. We put the DN-2 agent with the hyper- parameters in the previous experiment into a real-world scenario. The agent is taught to recognize objects at different locations and then learn to navigate in different sections. The testing route is now disjoint from the training routes, and the agent needs to learn to gener- alize what it learned in these novel settings. 33 CHAPTER 5 SIMULATION WITH MAZE ENVIRONMENT The navigation agent and a sample of the simulated environment are presented in Fig. 5.1. This simulation helps us to verify our argument that DN-2 learns the navigation FA perfectly with enough resources. The video demonstrating the training and testing process can be found at https: //youtu.be/CbhS1qvWZn0. 5.1 The environment and the teacher FA The environment is a block-based maze with different types of blocks: open, wall, obstacle, traffic light, and destination. Each block, except for the traffic light, can be only of one type, with a size of 50 pixels in height and 50 pixels in width. Traffic lights are small objects on top of a wall block with 20 pixels in width and 20 pixels in height. Open blocks are transparent. Wall blocks are red. Obstacle is blue. Destination is dark green. Traffic light has two colors, purple(stop) and light green(pass). The interface shows the maze environment from a top-down viewing angle with all the blocks plotted in 2D. The vision sensor of the agent gives it a 43 by 30 image, as shown in Fig. 5.1. The walls are 30 pixels tall, while the obstacles are of 15 pixels tall and occupy the bottom half of the image. The traffic lights are 15 pixels tall and occupy the top half of the image. The teacher FA follows the hand-crafted rules listed below: 1. Follow GPS direction when there is no obstacle or traffic light within the range of agent’s vision input. 2. If there is an obstacle seen by the agent and the obstacle is less than 20 pixels away (or, wider than 8 pixels in the vision image of the agent), the agent should turn and move toward the open block to avoid hitting the obstacle. 3. If there is a purple traffic light seen by the agent, the agent should stop and wait. 34 4. If there is a light green traffic light seen by the agent, the agent follows the other rules. 5. When stepping from a block to its adjacent block, the agent should put down a marker to keep track of distance. 6. After the agent reaches the destination, it should go back to the starting point. 5.2 The agent The agent is of size 20 by 20 pixels and moves continuously inside the simulated environment. The network inside the agent has three X areas: Xvision: The agent has vision of 210 degrees ranging 75 pixels. This 210 degrees view gives the agent a 2D RGB image of 43 by 20 pixels. Xgps: The agent is also equipped with a simulated GPS to find the correct turn at each cross- road. GPS is shown as a black arrow in the GUI. GPS cannot sense the presence of obstacles thus the agent cannot follow the GPS signals blindly. There are three input neurons in this area, corresponding to left, forward, and right. Xmark: The agent is equipped with a marking sensor (a single neuron) which would be firing with value one if the agent senses its previously put down marker. The agent also has 7 Z areas: Zwhere: Lowest level of concept. There are 43 neurons in this area corresponding to the 43 possible different locations in the input image. Zwhat: Lowest level of concept. There are four types of objects the agent needs to recognize: open, obstacle, traffic light (stop), traffic light (go). Zscale: Lowest level of concept. The agent is trained to recognize objects at different scales (from a scale of 1 pixel wide to scale of 21 pixels wide). Zaction: Mid-level concept. There are four neurons in this area: forward, left, right and stop. Each forward movement advances the agent’s position toward its heading position by 1 pixel. Each left/right turn increases/decreases the agent’s heading by 10 degrees. 35 Figure 5.1: Simulated maze environment and the agent design. (a) and (b): simulated environment and corresponding GUI. The environment is 450 pixels by 450 pixels, with the maze no larger than nine by nine blocks. The agent is 20 pixels by 20 pixels, moving continuously in the maze environment. (1) wall blocks are red blocks in the GUI. (2) obstacle blocks are blue blocks in the GUI. (3) destination blocks are green block in the GUI. (4) reward/punishment blocks are yellow blocks in the GUI. (5) The agent’s route is presented in the GUI. Black routes are skills already learned or going back routes. Red routes are novel skills to be learned. (6) Agent in the environment. Discussed in detail in subfigure (c). (7) and (8) Information panel for training and testing. (9) Traffic light block with two states. In (a) these traffic lights are at ‘go’ state, and in (b) these traffic lights are at ‘stop’ state. (10) 2D vision image is seen by the agent. (c) agent representation in GUI. (11) GPS signal from the environment. GPS indicates where the destination is in the simulated environment. But GPS is not aware of the obstacles in the environment thus the agent needs to learn skill “avoid obstacle”. (12) Vision sensor of the agent. Here to simplify things we show seven vision lines out of the 43 vision lines forming a 210-degree vision ranging 75 pixels around the agent. Zskill: Higher level concept. There are eight neurons in this area corresponding to 8 different scenarios that are essential for navigation in the simulated environment. Zroute: Highest level concept. There are two ways to reach the same destination in the final test. Thus this area has three neurons: route 1, route 2, and go back. Zmark: There are two neurons in this area, corresponding to putting down marker or not putting down marker. This motor can be viewed as the write head of our ETM, and the marker would be sensed by Xmark in the following iterations of navigation in the same environment. We allocated 1300 Y neurons inside the agent. We organized the lessons such that at only type 101 neurons are released when learning where-what and basic skills. Type 011 neurons are released when learning individual routes. 36 Algorithm 1: Environment for teaching individual skills initialize agent DN2; initialize environment; for zskill ← 1 to skill num do for i ← 1 to epoch num do Initialize random maze with crucial blocks; // Agent moves to the destination block; while Agent is not at destination do Get supervised action zaction from teacher; Get current X input x (for all X areas); Supervise agent with x and zaction, zskill (other concepts are zeros vectors); Update agent’s DN2; Agent moves according to supervision; end // Agent goes back to initial position ; while Agent is not at initial position do Get supervised action zaction from teacher; Get current X input x (for all X areas); zskill ← go back Supervise agent with x and zaction, zskill (other concepts are zeros vectors); Update agent’s DN2; Agent moves according to supervision; end end end 5.2.1 Stage 1: Learning concept where, what and scale We randomly generate mazes and put the agent into random locations of the maze. At each time the teacher supervises the where, what and scale motors according to the current object in the input image. We trained the agent for 1000 images with where what information. At this stage only type 101 neurons are used. 5.3 Stage 2: Skills, route, and distance 5.3.1 Teaching skills At this stage, we train the agent seven basic skills to navigate in the maze environment. These skills are presented in Fig. 5.2. Note that the presented mazes are only one instance among numerous 37 Algorithm 2: Environment for teaching route concept for zroute ← 1 to route num do Initialize maze with specific route; while Agent is not at destination do Get current X input x (for all X areas); Get current zmark from environment; Supervise agent with x, zroute and zmark (other concepts are emergent from previous update); Update agent’s DN2, record emergent skill and action; agent moves according to emergent action; end // Agent goes back to initial position; while Agent is not at initial position do Get supervised action zaction from teacher; Get current X input x (for all X areas); zskill ← go back; zroute ← go back; Supervise agent with x, zaction, zskill, and zroute ; Update agent’s DN2; agent moves according to supervision; end end randomly generated maze environment with key blocks unchanged. Thus the learned skills are environment invariant when learning takes enough iterations. When teaching skills, we are supervising Zaction and Zskill during each network update. The lower-level concepts (where, what and scale) are emergent, based on the learning results of the previous stage. The higher level concepts are all zero vectors and no connection is learned in the irrelevant concept zones. Teacher’s behavior follows Algorithm 1. After the agent reaches the destination, the teacher leads the agent to move back to the starting position. This is to simulate the continuous context of the agent inside the simulation with no sudden change of the external environment. 38 Figure 5.2: Skills and routes taught to the navigation agent in the simulation experiment. Skill 1: move forward when the next block is open. Skill 2: avoid obstacle and correct facing direction. Skill 3: turn left on the corner with GPS indicates left and avoid the obstacle. Skill 4: move through the narrow path. Skill 5: turn right when GPS indicates right. Skill 6: turn left when GPS indicates left. Skill 7: turn left and then turn right to reach the destination. During teaching, traffic lights would be placed along the route as shown in Fig. 5.1, the agent needs to learn to follow the rules of the traffic light as well. Although the agent has 43 vision lines as shown in Fig. 5.1, we show only 7 of these for clarity. A full video of training and testing is available at https://youtu.be/CbhS1qvWZn0. 5.3.2 Teaching route concept After each skill is learned for enough iterations, we teach the agent two ways to navigate to the entrance at the lower right corner of a specific environment. The teacher’s behavior is presented in detail in Algorithm 2. At this stage the Zaction and Zskill are emergent (emerged from the computation of the agent). The teacher only supervises Zroute and the write head Zmark of the network. 39 5.3.3 Evaluation of performance Performance of the simulated agent is evaluated at two stages: 1. Skill chaining with higher level concept supervised. When learning Zroute, the lower-level motors are emergent. The agent needs to successfully chain these skills together. We evaluate the success rate of skill chaining under such situation and report it in Sec. 5.4.1. 2. Navigation with higher level concept unsupervised. The agent only relies on the GPS and vi- sion input, with all Z motors emergent. We evaluate the success rate of navigation (reaching the final destination) under such situation and report it in Sec. 5.4.1. 5.4 Experiment results and analysis 5.4.1 Skilling learning and chaining We initialized 15 DN-2 agents in parallel to learn the lower-level skills first according to Algorithm 1. Then after these basic skills are mastered by the agent, the teacher leaves the lower-level skills emergent while supervised the higher level concept zones according to Algorithm 2. At this stage, the agent needs to chain different lower-level skills together, with the navigation context different from the context during the previous stage. As shown in Table 5.1, all of the 15 DN-2 agents successfully chained these lower level skills together when training higher-level skills. 5.4.2 GPS blurring and noisy inputs The next experiment we did is to blur the GPS input and the sensor inputs at the same time by different degrees, shown in Table 5.1. As learning route-1 (5 subtasks) is a more difficult task compared to learning route-2 (3 subtasks), more agents failed at learning route-1 when the GPS is blurred. Nevertheless, all agents succeeded in learning both routes when the noise level is relatively low (less than five percent). 40 noise level route 1 chaining route 1 GPS blurring route 2 chaining route 2 GPS blurring total success rate Table 5.1: Experiment result 10% 10/15 6/15 15/15 14/15 55/60 0% 15/15 15/15 15/15 15/15 60/60 100% 100% 91.67% 53.33% 5% 15/15 15/15 15/15 15/15 60/60 20% 2/15 0/15 15/15 13/15 32/60 5.5 Discussion The video recording the entire training and testing scenario can be found at https://youtu.be/ CbhS1qvWZn0. The network successfully learned the navigation rules listed in Sec. 5.1 with 100% accuracy when noise in input or GPS signals is small. The result verifies our claims with enough resources, DN-2 learns the FA error-free using emergent representation. 41 CHAPTER 6 REAL-WORLD EXPERIMENTS AND RESULTS 6.1 Batched vision data learning AIML 2016 contest [1] provided batch data of three modalities (vision, audition, and text) for on-line learning agents to test their learning capability with limited resources. In this paper, we use AIML vision data to validate our design and select hyperparameters for our real-world navigation application. This dataset contains around 4,109 images collected using a mobile device and the onboard camera. Each image (gray scaled image, rescaled to 38 by 38 pixels) is labeled with detailed information about the desired action, the corresponding GPS signal, the most prominent landmark and the landmark’s location. Sample images from this dataset are presented in Fig. 6.2. The baseline model (implemented with Convolutional Neural Network, i.e., an AlexNet with fully-connected layer concatenated with GPS information) achieved an error action rate of 24.70%, with around 500 epochs of iteration through the training data [60] Best performing model of the contest achieved an error rate of 26.40% at the first epoch of training with 1500 globally connected neurons in Y area. With multiple views and additional resources (comparison in Table 6.1) a DN-1 Figure 6.1: Real-world experiment flow chart. The lower level Y neurons (type 100) detect local features in the input image. The higher level Y neurons are more robust against changes in local variances, while forming a temporal context for navigation. 42 Figure 6.2: Sample data from the AIML 2016 navigation dataset. All images are labeled with four Z motor concepts: action, gps, attention and type. Data is collected around university campus using an HTC One M8 android device. Figure 6.3: AIML 2016 vision performance comparison among different network configurations. The X axis is the number of high level neurons in DN-2 (i.e. type 101, type 011 and type 111). The Y axis is the error action rate with four fold cross validation on the AIML testing data. Each configuration is labeled with (receptive size in type 100 neurons, number of type 100 neurons at each pixel location and the type of high level neurons used). DN-2 with local 100 neurons and 500 high level 111 neurons offers the best performance with reasonable computational resource requirements. 43 2005001500DN-2 neuron num20%25%30%35%40%45%50%55%60%60%(3,4,111)(3,8,111)(5,4,111)(5,8,111)(11,4,111)(11,8,111)(23,4,111)(23,8,111)(3,4,011)(3,8,011)(5,4,011)(5,8,011)(11,4,011)(11,8,011)(23,4,011)(23,8,011)DN-1, 101Error Action Rate Table 6.1: DN-2 performance compared to benchmarks on AIML 2016 CNN 46,986 # of neurons 100 101 111 best performance DN 1 Best AIML DN DN 2 4900 0 3,900 0 500 0 20.75% 0 1,500 0 26.50% 24.70% 22.80% implementation achieved an error rate of 22.80% [60]. As discussed in previous sections, DN-1 does not have a hierarchy in Y (spatially or tem- porally) thus have limited power of representation. We conducted experiments with DN-2 with different types and numbers of high level Y neurons, see Fig. 6.3. To make a fair comparison and simulate incremental learning scenario, all networks (except for Convolutional Neural Net) went through the following training and testing processes: 1. Training low-level representations with where-what information. If we are training a DN-2 network then we first release the type 100 neurons to form low-level representations like edges and gradients. With DN-1 there is no type 100 neuron thus this step is skipped. 2. Training higher level representations with navigation action, GPS information, and where- what recognition results. With a DN-2 network type 111 (or 011, 010, etc.) neurons are released according to the previously decided growth rate table m(t). At this stage, higher level representation is formed based on the lower level representation at stage 1. 3. Testing action accuracy with network frozen. The network is no longer updating at this stage. We test the performance of the network by recording the accuracy of its generated action given each image. We experimented several different network configuration with four-fold cross-validation. The final result is presented in Fig. 6.3 and Table 6.1. The best network, with type 100 and 1500 type 111 neurons, achieved an error rate of 20.01% after viewing the training data for only once, which is more accurate than the CNN and DN-1 benchmark. 44 Figure 6.4: Real-time experiment setup. 1. Stereo USB camera connected to the mobile device. The cameras are streaming a 480 by 640 stereo RGB image pair in real-time. 2. Oneplus 5 phone running DN-2 equipped navigation application in CPU and GPU mode. 3. Moga pro controller for motor supervision and testing result recording. 4. The navigation application interface for training and testing. 6.2 Real-world navigation: incremental training and testing Guided by the batch experiment results, we developed an android application for real-time pedestrian navigation using DN-2. The general setup for the navigation application is presented in Fig. 6.4. The application is run- ning on an OnePlus5 android phone connected to a stereo camera (two USB webcams mounted on top of a helmet). The four motors, shown in Fig. 6.1, can all be supervised via the Moga Pro Blue- tooth controller during training. The controller is also used to record the network’s performance during testing. The network is trained around the campus of the university to learn the task of autonomous navigation on the sidewalk. Fig. 6.8 provides an illustration of the extensiveness of training and testing. The inputs to the DN were from the same mobile phone that performs computation, in- cluding the stereo image from a separate camera and the GPS signals from the Google Directions 45 1234 interface. The outputs of the system include heading direction or stop, the location of the attention, and the type of the object at the attended location (which detects a landmark), and the scale of attention. The wide variety of real-world visual scenes implied by the extensive routes in Fig. 6.8 pre- sented great and rich challenges to this camera-only system without using any laser device. DN-2 uses multiple types of neurons to form robust representations about the road edges and obstacles as discussed in Sec.1.1. 6.2.1 Training and testing We set the growth hormone of DN-2 to use only two types of neurons: type 100 for low-level image feature extraction and type 111 for robust high-level representations, according to the result in our batch experiment. Testing is performed with the network frozen (weights not updating but the neurons are still generating responses). As shown in Fig. 6.8 the testing routes are novel settings to the learned network, but with similar obstacles, roads, and bushes compared to the training settings. Performance of the network is summarized in Table .1. Performance of the network is evaluated using two different metrics: different from user’s intention (diff) and absolution errors (error). The difference is that in a ‘diff’ situation the network still can recover from the movement in subsequent frames (e.g., zig-zagging instead of going straight forward), while the absolute errors are defined as the situations where the network gets stuck into an unrecoverable action (e.g., stop and not moving or bump into obstacles). According to Table .2, the most errors we got are from untrained obstacles (e.g., bicycles in the middle of the lane, or pedestrians on skateboards). This can be resolved by more extensive training and a larger network. 46 Figure 6.5: Bottom-up weights learned by type 100 neurons. Each 100 neuron focuses on a 5 by 5 region on the input image as shown in Fig. 6.1, and conduct local dynamic top-k competition. As the neurons take stereo RGB inputs, we show only the weights corresponding to the left camera. Each small square is the 5x5 rgb weight corresponding to a specific neuron. The weight values are normalized within range 0 and 1. These bottom-up weights are projected to image space in Fig. 6.6 and Fig. 6.7. 6.2.2 Visualization: Bottom-up weights of type 100 neurons Fig. 6.5 shows part of the bottom-up weights of the type 100 neurons in the trained network. As shown in the figure, the lower-level neurons pick up basic features such as edges and color gradients in the image. After dynamic top-k competition, the firing neurons among these low-level neurons focuses on the most prominent features locally, which can be either road edges (important features) or shadow edges (distractors). The firing pattern in those lower-level neurons is fed into type 111 neurons (higher-level of representation), which use synaptic maintenance to cut away the unstable connections with distractors, forming a more robust representation. 47 Figure 6.6: Projected lateral weights for type 111 neurons with no synaptic maintenance. Type 111 neurons with low firing ages forms evenly distributed attention due to the local competition zones of lower level 100 neurons. 6.2.3 Visualization: Lateral weights from type 100 neurons to type 111 neurons Fig. 6.6 and Fig. 6.7 show the projected lateral weights of high level 111 neurons in the trained network. Each sub-figure shows the Y to the Y connection of a 111 neuron, with the connected 100 neurons’ bottom-up weights projected into the image space. Each subfigure is equivalent to showing only the red receptive fields for type 111 neurons in Fig. 6.1. 111 neurons in Fig. 6.6 have lateral connections evenly distributed among lower level 100 neurons, as they have not entered the synaptic maintenance stage due to their low firing ages. Lateral weights of 111 neurons that have entered synaptic maintenance stage are shown in Fig. 6.7. Synaptic maintenance cuts away the unstable connections with high variances. As shown in the visualization, these neurons focus their attention on road edges and become invariant to the changes in monotone shadows. This visualization proves that using multiple types of neurons combined with lateral connec- tions, we can form robust hierarchies of representation with the internal features. 48 Figure 6.7: Projected lateral weights for type 111 neurons with synaptic maintenance. After synaptic maintenance the high-level neurons focuses attention on consistent road edges and become invariant to the changes in the highly variant shadow shapes. The connections from type 100 neurons to type 111 neurons shifts towards these consistent features, while the highly unstable connections are cut from the range of connection. Figure 6.8: Training and testing routes around the university during different times of day and with different natural lighting conditions. Disjoint testing sessions were conducted along paths that the machine has not learned. 49 training 100 Y neuronstraining 111 Y neuronsreal-time testingsunny weathercloudy weathercloudy weathersunny weathersunny weathersunny weathersunny weathersunny weathercloudy weather CHAPTER 7 CONCLUSION In this paper, we established that DN-2 is an emergent Turing Machine that learns the rules of a navigation TM with hierarchical representation. DN-2 uses multiple types of neurons to form hierarchical representation internally. Compared to FA based methods with no hierarchy in rep- resentation, this framework forms robust representations of important features while disregarding distractors in inputs. Experimental results demonstrated that DN-2 learns navigation rules with satisfactory performance. 50 APPENDIX 51 Table A.1: Real-time training and testing detail Date Training 8/5/2017 8/6/2017 Detail # of samples Training type 100 3051 Training type 111 Neuron num: 500 1492 Time 14:30 - 15:30 14:15- 15:30 14:00- 15:45 940 8/7/2017 Right: 214, SLRight: 338, Forward: 1182, SLLeft: 352, Left: 210, Stop: 136 Total Segment: 18, Total Steps: 1155 Diff count: 62 (5.36%) Error Count: 9 (0.78%) Right: 92, SLRight: 163, SLLeft: 156, Left: 97, Forward: 560, Stop: 87 Testing Video recordings of the experiment can be found on our youtube video list https://www.youtube. com/playlist?list=PLVs0MJh9CrcFZqYxCJ9pM6zRQU90kCxj9. A.1 Proof of Lemma 1 Proof. The proof is broken down into two steps. Step 1 demonstrates the firing pattern in each zone is actually a binary ML estimator of the current input conditioned on N (t). Step 2 shows after learning the weights are then incrementally updated according to ML estimation. Step 1, case 1 (new neuron enters learning stage at time t + 1): When a new neuron g enters learning stage for the first time, it memorizes the input vector p(t + 1) perfectly (explained in Sec. 2.2.4). This neuron also suppresses other neurons inside its ¯g from firing (via competition as the neuron’s firing value is perfect), thus becoming the ML estimator of the current input p(t + 1). As is stated in Sec. 2.2.4, a new neuron would be initialized when the current match is bad. The current match would be bad for two reasons: 1) the current input is a novel input. In this case the new neuron would not affect learning of the previous neurons as no active neurons have learned this input yet. This new neuron is the only neuron that has learned the current input thus is the ML estimator of the current input. 2) the current input has already been learned by an active neuron j 52 Section Total Diff Error L SL 10 Table A.2: Real-time testing results Detail SR R rain stain, rocks, facing direction correction rain stain, overturn, facing direction correction rocks, overturn, facing direction correction 11 5 5 3 7 F 27 21 41 7 6 8 8 7 4 5 5 8 6 Stop Descriptions F->SL->F-> L ->SR->L->F F->R->SL->R ->F->SR->F->Stop F->SR->F->R ->SL->R->F F->L->SR->L ->F->R->SL ->R->F->Stop F->SR->SL->SR ->R->SL->R->F ->Stop F->SL->SR->SL ->L->SR->L->SL ->SR->F->Stop F->R->SR->F-> SL(Error)->F->Stop F->SL(Error)->F-> L->F->SL->R->F->Stop F->SR->L->F->Stop F->SR->SL->F-> SL->Stop F->SL->F->SL-> F->R->F->Stop F->SR->L->F-> SL->F->Stop F->SR->F->SL-> L->SR->L->F-> R->SL->R->F->Stop F->SR->F->SL->F ->SR->F->SL->SR ->F->SL->F->SR-> SL->SR->F->Stop F->SR->SL->F->SR ->L->F->Stop F->SR->F->SL->SR ->R->SL->F->Stop F->SR->F->SL->F ->SR->F->R->F->Stop F->SR->F->SR->F ->SL->F 3 6 8 7 3 4 7 5 4 10 11 18 5 8 17 62 13 10 27 13 35 22 7 14 7 9 7 12 11 13 40 29 26 33 27 25 9 8 7 12 13 13 7 6 13 11 12 23 8 10 6 10 26 14 5 7 11 4 20 34 41 32 6 11 6 11 7 97 156 560 163 92 87 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 50 50 59 60 107 104 73 75 56 55 55 66 71 53 41 59 70 51 4 3 1 3 4 6 5 4 3 2 0 7 5 2 3 8 0 2 Total 1155 62 0 0 0 0 0 3 2 2 2 0 0 0 0 0 0 0 0 0 9 rain stain, overturn, facing direction correction shadows, dirt road, protruding trees, overturn, facing direction correction shadows, protruding trees, overturn, facing direction correction, error at overturn noval obstacle, shadows, bushes, overturn, error at untrained obstacle shadows, facing direction correction, overturn facing direction correction, shadows untrained obstacle, shadows, bushes facing direction correction, untrained obstacle facing direction correction, bushes, shadows facing direction correction, overturn, rocks on the side of rode, untrained obstacles, bushes winding road, constant facing direction correction winding road, bushes, facing direction correction, overturn shadows, uphill, facing direction correction, overturn untrained obstacles, bushes, winding road untrained obstacles, bushes, winding road (current best match). This rarely happens as m(t) > rj = (cid:104) Σxi ||Σxi||, ||p||(cid:105) p ||xi||, Σ(cid:104) xi ≥ 1 n p ||p||(cid:105) (1) where p = p(t) is the current input, m(t) is the current almost perfect match, xi are the inputs that are learned by neuron j, and rj is the response of neuron j. This means at least one input should be greatly different than p (their similarity measure much less than m(t)), but it still got picked up by the neuron in previous learning. We rarely observe this during learning. 53 Step 1, case 2 (no new neurons are added at time t + 1): For each neuron g ∈ Y, Z, consider its inhibition zone at time t: ¯g(t). c(g, t) is the number of neurons in its inhibition zone ¯g(t). The normalized weights of these c neurons can be denoted as (w1, w2, ..., wc(g,t)). Then we can define c(g, t) Voronoi regions Rj, j = 1, 2, ..., c(g, t) in L(g, t) (receptive field of g at time t, which is a linear subspace of X × Y × Z), where each Rj contains all p ∈ L(g, t) that are closer to wj than to other wi: Rj = {p|j = arg max 1≤i≤c wi · p}, j = 1, 2, ..., c Given observation p(t + 1), the conditional probability density h(p(t + 1)|L(g, t), W (g, t)) is zero if p(t + 1) falls out of the Voronoi region of neuron g:  0, fi{p(t + 1)|L(g, t), W (g, t)}, if p(t + 1) ∈ Ri; (2) otherwise f{p(t + 1)|L(g, t), W (g, t)} = where fi{p(t + 1)|L(g), W (g)} is the probability density within Ri. Note that the distribution of fi{p(t + 1)|L(g), W (g)} within Ri is irrelevant as long as it integrates to 1. Given p(t + 1), the ML estimator for the binary vector y(t) (also z(t)) needs to maximize f{p(t + 1)|N (t)}, which is equivalent to finding the set of firing neurons n(t + 1): n(t + 1) = {arg max g∈Y,Z f{p(t + 1)|N (t)}} = {g|g = arg max g∈Y,Z = {g|g = arg max j∈¯g(t) f{p(t + 1)|L(g, t), W (g, t)} w(j) · p(t + 1)} (3) since finding the ML estimator in Eq. 2 is equivalent to finding the Voronoi region where p(t) belongs. This is exactly what the Y area does, supposing k = 1 for top-k competition for each neuron’s competition zone. Step 2: Here we demonstrate the statistical efficiency of the LCA learning rule of W (g), where g ∈ n(t) (meaning that g is among the firing neurons at time t). 54 Eq. (11) in [58] shows the candid version of LCA is actually an incremental estimation of the average of the inputs that trigger firing in that specific neuron: w(g, t + 1) = = w(g, t) + 1 a(g, t) p(t + 1) a(g, t) − 1 a(g, t) 1 a(g, t) Σ a(g,t) i=1 p(ti) (4) where w(g, t) is the weight vector w(g) at time t. a(g, t) is the age of neuron g at time t. ti, i = 1, 2, ..., a(g, t) are the times where neuron g wins dynamic top-k competition and fires. Following the proof of optimality in LCA [58], statistical estimation theory reveals for many distributions (e.g., Gaussian and exponential distributions), the sample mean is the most efficient estimator of the population mean. This follows directly from Th. 4.1, p.429-430 in [29], which states that under some regularity conditions satisfied by many distributions (such as Gaussian and exponential distributions), the maximum likelihood estimator (MLE) ˆθ of the parameter vector θ is asymptotically efficient, in the sense that its asymptotic covariance matrix is the Cramer-Rao information bound (the lower bound) for all unbiased estimators via convergence in probability to a normal distribution: √ n(ˆθ − θ) p−→ N{0, I(θ)−1} in which the Fisher information matrix I(θ) is the covariance matrix of the score vector: {(∂f (x, θ))/(∂θ1), ..., (∂f (x, θ))/(∂θk)} (5) (6) f (x, θ) is the probability density of random vector x if the true parameter value is θ. The matrix I(θ)−1 is called information bound since under some regularity constraints, any unbiased estimator ˜θ of the parameter vector θ satisfies cov(˜θ − θ) ≥ I(θ)−1/n (see, e.g., [29], p. 428 or [57], p. 203-204]). Thus, as Weng et al. showed in [58], the LCA algorithms learns the optimal weights in the sense of maximum likelihood estimation. Combining step 1 and step 2 we would have: θ(t + 1) = max θ (t)f{p(t)|N (t − 1)} (7) 55 Please note that Lemma 1 applies to not only Y neurons, but also Z neurons as well. Z neurons act the same as Y neurons with a fixed range of lateral inhibition area to form different concept zones. Lemma 1 applies to any neuron that learns with regard to its inhibition zone and updates its weights using LCA learning rules. A.2 Proof of Theorem 1 Proof. Intuitively, although f(cid:48){p(t + 1)|Xt 0, Γ} is in a different format compared to the f{p(t + 1)|N (t)} in Lemma 1, the two probability functions are the same as N (t) is determined by Xt, Zt and Γ. As there is no random weight initialization in DN-2, two DN-2 equipped learn- 0, Zt ing agents would be exactly identical given the same hyper-parameter Γ and learning experience Xt, Zt. To formally prove this we are going to recursively use the conclusion of Lemma 1. At t + 1, we can reuse Eq. (7) from Lemma 1. θ(t + 1) = max θ(t) = max θ(t) = max θ(t) = max θ(t) = ... = max θ(t) = max θ(t) f{p(t + 1)|N (t)} f{p(t + 1)|L(N (t − 1), x(t), z(t), Γ)} f 1{p(t + 1)|N (t − 1), Xt f 2{p(t + 1)|N (t − 2), Xt t, Γ} t−1, Zt t−1, Γ} t, Zt f t{p(t + 1)|N (0), Xt 0, Zt f(cid:48){p(t + 1)|Xt 0, Γ} 0, Zt 0, Γ} (8) where L is the learning function of the network, and f i is the probability density function of the current input p, conditioned on the network i time steps ago and the learning experience from t− i to t. 56 BIBLIOGRAPHY 57 BIBLIOGRAPHY [1] Artificial intelligence machine learning contest 2016. http://www.brain-mind- institute.org/AIMLcontest/index-2016.html. Accessed: 2016-10-02. [2] Arthur N Applebee and Judith A Langer. Instructional scaffolding: Reading and writing as natural language activities. Language arts, 60(2):168–175, 1983. [3] David Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2012. [4] Francisco Bonin-Font, Alberto Ortiz, and Gabriel Oliver. Visual navigation for mobile robots: A survey. Journal of intelligent and robotic systems, 53(3):263–296, 2008. [5] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos´e Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics, 32(6):1309–1332, 2016. [6] Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and Jurgen Schmidhuber. Con- volutional neural network committees for handwritten character classification. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 1135–1139. IEEE, 2011. [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [8] Gamini Dissanayake, Shoudong Huang, Zhan Wang, and Ravindra Ranasinghe. A review of recent developments in simultaneous localization and mapping. In Industrial and Information Systems (ICIIS), 2011 6th IEEE International Conference on, pages 477–482. IEEE, 2011. [9] Guorui Feng, Guang-Bin Huang, Qingping Lin, and Robert Gay. Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Transactions on Neural Networks, 20(8):1352–1357, 2009. [10] Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden markov model: Anal- ysis and applications. Machine learning, 32(1):41–62, 1998. [11] Sharon E Fox, Pat Levitt, and Charles A Nelson III. How the timing and quality of early experiences influence the development of brain architecture. Child development, 81(1):28– 40, 2010. [12] Bernd Fritzke. A growing neural gas network learns topologies. information processing systems, pages 625–632, 1995. In Advances in neural [13] Paul A Gagniuc. Markov Chains: From Theory to Implementation and Experimentation. John Wiley & Sons, 2017. 58 [14] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi´nska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external meomory. Nature, 538:471–476, 2016. [15] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. [16] Geoffrey E Hinton. Deep belief networks. Scholarpedia, 4(5):5947, 2009. [17] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006. [18] John E Hopcroft, Rajeev Motwani, and Jeffrey D Ullman. Automata theory, languages, and computation. International Edition, 24, 2006. [19] Gordon Hughes. On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory, 14(1):55–63, 1968. [20] Finn V Jensen. An introduction to Bayesian networks, volume 210. UCL press London, 1996. [21] Zhengping Ji, Juyang Weng, and Danil Prokhorov. Where-what network 1:“where” and “what” assist each other through top-down connections. In Development and Learning, 2008. ICDL 2008. 7th IEEE International Conference on, pages 61–66. IEEE, 2008. [22] Teuvo Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1–6, 1998. [23] Dirk Kraft, Renaud Detry, Nicolas Pugeault, Emre Baseski, Frank Guerin, Justus H Piater, and Norbert Kruger. Development of object and grasping knowledge by robot exploration. IEEE Transactions on Autonomous Mental Development, 2(4):368–383, 2010. [24] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014. [25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [26] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. [27] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010. [28] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recogni- tion with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–104. IEEE, 2004. 59 [29] Erich Leo Lehmann. Theory of point estimation. New York: Wiley, 1983. [30] Martin Llofriu, Gonzalo Tejera, M Contreras, Tatiana Pelc, Jean-Marc Fellous, and Alfredo Weitzenfeld. Goal-oriented robot navigation learning using a multi-scale space representa- tion. Neural Networks, 72:62–74, 2015. [31] Matthew Luciw and Juyang Weng. Where what network 3: Developmental top-down at- tention with multiple meaningful foregrounds. In International Joint Conference on Neural Networks, pages 4233–4240, 2010. [32] Matthew Luciw and Juyang Weng. Where-what network-4: The effect of multiple internal areas. In Development and Learning (ICDL), 2010 IEEE 9th International Conference on, pages 311–316. IEEE, 2010. [33] John C Martin. Introduction to Languages and the Theory of Computation, volume 4. McGraw-Hill NY, 1991. [34] Danielle S McNamara and Walter Kintsch. Learning from texts: Effects of prior knowledge and text coherence. Discourse processes, 22(3):247–288, 1996. [35] Michael J Milford and Gordon F Wyeth. Seqslam: Visual route-based navigation for sunny In Robotics and Automation (ICRA), 2012 IEEE summer days and stormy winter nights. International Conference on, pages 1643–1649. IEEE, 2012. [36] Michael J Milford, Gordon F Wyeth, and David Prasser. Ratslam: a hippocampal model for simultaneous localization and mapping. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 1, pages 403–408. IEEE, 2004. [37] Marvin L Minsky. Logical versus analogical or symbolic versus connectionist or neat versus scruffy. AI magazine, 12(2):34, 1991. [38] Jonathan Mugan and Benjamin Kuipers. Autonomous learning of high-level states and ac- tions in continuous environments. IEEE Transactions on Autonomous Mental Development, 4(1):70–86, 2012. [39] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. knowledge and data engineering, 22(10):1345–1359, 2010. IEEE Transactions on [40] Jean Piaget. The Grasp of Consciousness (Psychology Revivals): Action and Concept in the Young Child. Psychology Press, 2015. [41] Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with features inspired In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE by visual cortex. Computer Society Conference on, volume 2, pages 994–1000. Ieee, 2005. [42] Hava T Siegelmann and Eduardo D Sontag. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991. [43] Hava T Siegelmann and Eduardo D Sontag. On the computational power of neural nets. Journal of computer and system sciences, 50(1):132–150, 1995. 60 [44] Xiaoying Song, Wenqiang Zhang, and Juyang Weng. Where-what network 5: Dealing with scales for objects in complex backgrounds. In Neural Networks (IJCNN), The 2011 Interna- tional Joint Conference on, pages 2795–2802. IEEE, 2011. [45] Jason Stanley and John W Krakauer. Motor skill depends on knowledge of facts. Frontiers in human neuroscience, 7, 2013. [46] Joan Stiles and Terry L Jernigan. The basics of brain development. Neuropsychology review, 20(4):327–348, 2010. [47] Huajin Tang, Weiwei Huang, Aditya Narayanamoorthy, and Rui Yan. Cognitive memory and mapping in a brain-like system for robotic navigation. Neural Networks, 87:27–37, 2017. [48] Jun Tani and Naohiro Fukumura. Learning goal-directed sensory-based navigation of a mo- bile robot. Neural networks, 7(3):553–563, 1994. [49] Alan Mathison Turing. On computable numbers, with an application to the entschei- dungsproblem. Proceedings of the London mathematical society, 2(1):230–265, 1937. [50] David C Van Essen and John HR Maunsell. Hierarchical organization and functional streams in the visual cortex. Trends in neurosciences, 6:370–375, 1983. [51] Horatiu Voicu. Hierarchical cognitive maps. Neural Networks, 16(5-6):569–576, 2003. [52] Lev Semenovich Vygotsky. Mind in society: The development of higher psychological pro- cesses. Harvard university press, 1980. [53] Nikita Wagle and Juyang Weng. Developing dually optimal lca features in sensory and action In Development and Learning and Epigenetic Robotics (ICDL), spaces for classification. 2012 IEEE International Conference on, pages 1–8. IEEE, 2012. [54] Yuekai Wang, Xiaofeng Wu, and Juyang Weng. Brain-like learning directly from dynamic cluttered natural video. In Proc. International Conference on Brain-Mind, pages, pages 1–8. [55] J Weng. Why have we passed “neural networks do not abstract well”?”. Natural Intelligence: the INNS Magazine, 1(1):13–22, 2011. [56] Juyang Weng. Brain as an emergent finite automaton: A theory and three theorems. Interna- tional Journal of Intelligence Science, 5(02):112, 2015. [57] Juyang Weng, Thomas S Huang, and Narendra Ahuja. Motion and structure from image sequences. Springer Science & Business Media, 1993. [58] Juyang Weng and Matthew Luciw. Dually optimal neuronal layers: Lobe component analysis. IEEE Transactions on Autonomous Mental Development, 1(1):68–85, 2009. [59] Xiaofeng Wu, Qian Guo, and Juyang Weng. Skull-closed autonomous development: Wwn-7 dealing with scales. In Proc. International Conference on Brain-Mind. East Lansing, Michi- gan: BMI Press, pages 1–8. Citeseer, 2013. 61 [60] Zejia Zhengj and Juyang Weng. Mobile device based outdoor navigation with on-line learning In Proceedings of the neural network: A comparison with convolutional neural network. IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 11–18, 2016. 62