.04. .
:2 ,4 h. .

32$ch
.3. fr;
Lag»?
Lilixe. u. x
.2 . .6?! .
Raﬁ.
.l. ﬁt’ww1w

a...»

 

1 LII-mm... :
...!L..u..mr.1mmn! nip". .
.. I 9%.”.
“umtmyn..unrwﬁumu
Ham-(WM.

of

 

2.3“: ,
AFﬁﬂhétq «c

.l%. : .1
I 4 . z.

n...“
. . .tr ..
,. 1.27.?

. 2,. .u‘

I
..

 

. . . , . - . _ . - .. . w : a. ..x Lag .u......,............ A, . a... .ﬁ. ﬁg
: «awluvmiim... n. 3. W..Hﬁﬁ.ﬂw ‘ . . . Hurry»; m..." 3% . ‘ ghvhmnmmmuu. aka .. ._ .1... £9.35. a: 9.9.1“... $054.19”. .-

 

 

THFQH;

) 003

This is to certify that the
dissertation entitled

Online Development of Cognitive Behaviors by a Robot:
A Case Study Using Auditory and Visual Sensing

presented by

Yilu Zhang

has been accepted towards fulfillment
of the requirements for the

Doctoral degree in Computer Science and
EngineerinL

WWW

i/ Major Professor’s Signature
Dex. 7, 20192—

Date

 

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 cJClHCJDatODuo.p65-p.15

 

ONLINE DEVELOPMENT OF COGNITIVE BEHAVIORS BY
A ROBOT: A CASE STUDY USING AUDITORY AND
VISUAL SENSING

By

Yilu Zhang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment Of the requirements
for the degree Of

DOCTOR OF PHILOSOPHY

Department of Computer Science and Engineering

2002

ABSTRACT
ONLINE DEVELOPMENT OF COGNITIVE BEHAVIORS BY A ROBOT:
A CASE STUDY USING AUDITORY AND VISUAL SENSING
By
Yilu Zhang

Audition and vision are two major sensory modalities for humans to sense and
understand the world. Although a Signiﬁcant progress has been made in automatic
Speech recognition and visual Object recognition, the fields still face a tremendous
amount Of difﬁculties.

Motivated by the autonomous development process Of humans, we are interested
in building a robot that automatically develops its auditory and visual cognitive and
behavioral skills through real-time interactions with the environment, which typically
involves humans. We call such a robot a developmental robot. TO resolve the techni—
cal challenges for a developmental robot, three basic techniques have been developed
and implemented: (1) A fast incremental principal component analysis algorithm,
the complementary candid incremental principal component analysis (CCIPCA) al-
gorithm; (2) A customized version Of hierarchical discriminant analysis (HDR) for

long temporal contexts; (3) A developmental architecture and algorithm that inte-

grate multimodal sensing, action-imposed learning, reinforcement learning, and com-
municative learning.

Based upon the above three basic techniques, we have designed and implemented a
prototype robot that learns cognitive behaviors from Simple tO complex: ( 1) Grounded
Speech learning. The system develops its audition-driven behaviors through physical
interactions with the environment. This is the first attempt we know that realizes
developmental auditory learning from unsegmented and unlabeled Speech streams
without using priori knowledge such as handcrafted word models or task-specific
heuristic. This emulates a mode in which human children learn. (2) Task transfer.
The system applies what has been learned in one context tO new contexts (tasks)
through verbal instructions from trainers. This is the ﬁrst system Of this kind in an
autonomous mental development (AMD) mode. (3) Semantics learning. The system
acquires simple semantics through real-time multimodal inputs (vision and audition).
This is the ﬁrst robot with both developmental auditory and developmental visual
capabilities to do online learning. It takes advantage Of the Spatiotemporal continuity
Of the physical world during learning. Among the above three learning types, the ﬁrst
one uses only the reﬂexive capability Of the proposed architecture, the second one uses
an additional priming capability, and the third one uses an additional multimodal
integration capability.

The work reported in this thesis serves as a starting point Of the on-going research

on a highly integrated develomnental robot.

To my parents.

iv

ACKNOWLEDGMENTS

First and foremost, I would like to thank my thesis advisor, Professor Juyang Weng,
who has been the guiding force of my professional development. This dissertation
would not have been completed without his insight, inspiration and encouragement.

During my four years at Michigan State University, I had the fortune of beneﬁting
from several respectful scholars. Anil Jain provided me a solid pattern recognition
background and he was always helpful and supportive Of my research. John Deller
led me into the Speech recognition ﬁeld and constantly cared about my career. I was
especially impressed by his rigorous scholarship. Lijian Yang helped me to provide the
rigorous proof Of theoretical statistical problems thanks to his insight and knowledge.
Sridhar Mahadevan introduced me to the broad spectrum Of machine learning areas.
George Stockman has the talent Of turning research into a joyful experience, which
made the discussions with him always a fun. I am especially grateful to professors
John Deller, Anil Jain, and Lijian Yang, for serving on my committee, sharing ideas
with me, and providing critical comments on my thesis.

A Special acknowledgment goes to Dr. Jialin Zhong at Bell Labs Lucent Tech-
nologies, where I Spent a summer as a research intern. A broad exposure tO various

‘7

research problems made my internship a valuable experience.

I have beneﬁted a lot from many Of my colleagues. The early version of IHDR
program provided by Wey-Shiuan Hwang was the starting point Of the work reported
in this thesis. He and Colin Evans, another early SAIL group member, gave me a
jump-start on my research at Michigan State University. The discussions and coop-
eration with them were always interesting and helpful. Yong-Beom Lee wrote the
early Cepstrum Analysis code and taught me my ﬁrst class on Speech recognition.
During the development Of SAIL project, I enjoyed many Open discussions with my
labmates, Micky Badgero, Amy Bardenhagen, Greg Bloy, Shaoyun Chen, Raja Gan-
jikunta, Jianda Han, XiaO Huang, Ameet JOShi, Zain Kazim, Mahesh Krishnaswamy,
Jason Sperber, Georgios Theocharous, Jingliang Wang, Changjiang Yang, Shuqing
Zeng, and Nan Zhang. We Shared friendship as well as time together.

I wish to acknowledge some current and former PRIPpieS: Paul Albee, Vera Ba-
kic, Scott Connell, Nicolae Duta, Dan Gutchess, Rein-Lein Hsu, Xiaoguang Lu, Sil-
viu Minut, Anoop NambOOdiri, Salil Prabhakar, Arun ROSS, Umut Uludag, Aditya
Vailaya, and Jianguo Wang. The conversations with them and the seminars gave
by them gave me a wide view over related research ﬁelds. Those wonderful group
activities also made my Phd study unforgettable.

It goes without saying that I owe everything to my dear parents, Shouzhen Wang
and YunruO Zhang. Their endless love, care, patience, and tremendous support ac-
companied me through all my life, especially the last four difﬁcult years. Now that I
am ﬁnally graduating, I wish to turn over the diploma to them. They have earned it
at least as much as I have.

vi

This research was supported in part by the National Science Foundation under
grant NO. 118 9815191, DARPA ETO under contract NO. DAAN02-98-C-4025, and

DARPA ITO under grant NO. DABT63-99—1-0014.

vii

TABLE OF CONTENTS

LIST OF FIGURES xi
LIST OF TABLES xiv

1 Introduction 1
1.1 Methodology limitations ........................... 3
1.1.1 Separation Of natural language understanding and Speech recognition . 3
1.1.2 Symbolic representation .......................... 5
1.1.3 Current training methodology ....................... 6
1.2 Research motivation and technical challenges ................ 8

2

1.3 Thesis outline ................................. 1
2 Background 14
2.1 Existing speech recognition methods .................... 14
2.1.1 Dynamic time warping ........................... 16
2.1.2 Artiﬁcial neural networks ......................... 18
2.1.3 Hidden Markov model ........................... 20
2.1.4 ANN-HMM hybrid methods ........................ 22
2.2 Brief survey on cognitive development ................... 24
2.2.1 Nativist perspective ............................ 24
2.2.2 Environmental-learning perspective .................... 25
2.2.3 Constructivist perspective ......................... 27
2.2.4 Cultural-context perspective ........................ 29
2.2.5 Lack of computational study ....................... 30
2.2.6 Summary .................................. 30
2.3 Developmental robots ............................ 31
2.3.1 Motivation ................................. 31
2.3.2 A new direction in AI — autonomous mental development (AMD) . . . 35
2.3.3 Eight requirements Of AMD ........................ 37
2.3.4 Early investigations ............................ 39
3 A Developmental System 42
3.1 An agent model ................................ 42
3.2 System architecture .............................. 46
3.3 Incremental hierarchical discriminant regression (IHDR) in cognitive map-

ping .................................... 51
3.3.1 Uniﬁed analysis Of classiﬁcation and regression .............. 52
3.3.2 Hierarchical double clustering ....................... 53

viii

3.3.3 Incremental hierarchical discriminant regression ............. 55

3.4 Learning strategies .............................. 57
3.4.1 Supervised learning ............................. 57
3.4.2 Reinforcement learning ........................... 59
3.4.3 Uniﬁed learning strategy .......................... 60
3.5 The SAIL robot ................................ 63
4 Incremental PCA in Sensory Mapping 69
4.1 Introduction .................................. 69
4.2 Derivation of the algorithm ......................... 71
4.2.1 First eigenvector .............................. 71
4.2.2 Intuitive Explanation ............................ 76
4.2.3 Higher-order eigenvectors ......................... 78
4.2.4 Equal Eigenvalues ............................. 79
4.2.5 Algorithm summary ............................ 80
4.3 Empirical results on convergence ...................... 80
4.3.1 Experiments on Speech data set ...................... 81
4.3.2 Experiments on FERET face data set .................. 85
4.4 Conclusions .................................. 88
5 Grounded Speech Learning 92
5.1 Introduction .................................. 92
5.2 Implementation details ............................ 93
5.2.1 IHDR .................................... 93
5.2.2 Auditory sensation ............................. 95
5.2.3 Behavior generation ............................ 97
5.2.4 Learning procedure ............................. 98
5.2.5 Communicative learning .......................... 99
5.3 Experiments on audition-driven behavior development with reinforcement
learning .................................. 100
5.4 Experiments on selective attention learning ................. 105
5.5 Experiments on the SAIL Robot ...................... 107
5.5.1 Preliminary experiment on number data ................. 107
5.5.2 Real robot experiments .......................... 108
5.5.3 Speed issues ................................. 111
5.6 Experiments on communicative learning .................. 113
5.7 Conclusions .................................. 116
6 Task Transfer 117
6.1 Introduction .................................. 117
6.2 Problem description ............................. 120
6.2.1 Principles Of autonomous learning ..................... 120
6.2.2 Numerical representations ......................... 123
6.2.3 Missing context ............................... 124
6.3 Single-level architecture ........................... 125

ix

6.3.1 Handling the missing context ....................... 125

6.3.2 The value system .............................. 130
6.3.3 Mechanism for cross-modality attention ................. 131
6.3.4 Algorithm .................................. 134
6.4 Multilevel Architecture ............................ 135
6.4.1 Algorithm .................................. 138
6.5 Experiments on AudioDeveloper ....................... 140
6.6 Experiments on the SAIL robot ....................... 149
6.6.1 Behavior extinction ............................. 150
6.6.2 Task transfer ................................ 153
6.6.3 Execution time and memory size ..................... 154
6.7 Conclusions .................................. 155
7 Semantics Learning through Multiple Modalities 158
7.1 Introduction .................................. 158
7.2 Problem description ............................. 160
7.3 Architecture and algorithm ......................... 163
7.4 Experimental results ............................. 165
7.5 Conclusions .................................. 175
8 Conclusions and Future Work 178
8.1 Contributions ................................. 1 78
8.2 Future directions ............................... 182
APPENDICES 184
A Convergence Analysis of CCIPCA 184
A.1 Relation to a differential equation ...................... 185
A2 Prove v1(n) —-> iAlel ............................. 187
A21 ”211” is bounded .............................. 188
A22 aj/al —+ 0 .................................. 189
A23 al —> iAl .................................. 189
A24 Summary .................................. 190
A3 Prove ”01(71) —> :l:)\,-e,- with induction ..................... 190
A4 Conclusions .................................. 192

LIST OF FIGURES

2.1 Structure Of a typical automatic speech recognition system ..........
2.2 A typical HMM topology for phonemes. ...................
2.3 Kitten carousal. ................................

3.1 A temporal illustration of the agent model. The last context includes informa-
tion from sensors and the status of the agent’s own actions. Depending on
both the last context and the last state, the mapping f returns a new state,
associated with a list Of actions. The action with the highest primed value
is selected, which contains the effector control signals ............

3.2 Observation-driven Markov decision process with dynamically generated states,
context-driven state representation, and value-based action generation.
Note: the state iS not symbolic. It is a vector in a high dimensional space.

3.3 Basic architecture of the SAIL robot. .....................

3.4 Detailed architecture of the sensory mapping module ..............

3.5 The adding-on cognitive mapping modules: (a) the basic module (b) the en-
hanced module (c) the two-level architecture ((1) the semantics learning
system architecture .............................

3.6 The adding-on cognitive mapping modules (Cont’d): the semantics learning
system architecture .............................

3.7 Double clustering in both input and output Spaces. ..............

3.8 Hierarchical discriminant analysis ........................

3.9 Flowchart for uniﬁed learning: the system learns while performing. ......

3.10 The SAIL robot at Michigan State University ..................

3.11 The SAIL robot system diagram: left side view .................

3.12 The SAIL robot system diagram: right Side view. ...............

3.13 The SAIL robot system diagram: hardware connections. ...........

3.14 The SAIL robot: the EShed robot arm. ....................

4.1 Intuitive explanation Of CCIPCA. .......................
4.2 CCIPCA algorithm: compute first It dominant eigenvectors,
v1(n),v2(n), . . . ,vk(n), directly from u(n), where n = 1,2, . . .. ......
4.3 Number utterance data set: the correctness, or the coherence, represented by
dot products, Of the ﬁrst 10 eigenvectors computed (a) SGA (b) GHA (c)
CCIPCA, respectively. ..........................

4.4 Number utterance data set: the correctness Of the eigenvalue, Jig—:11 by CCIPCA.

xi

15
21
29

44

45
46
47

50

51
54
55
61
63
64
65
66
67

77

81

83
84

4.5 Number utterance data set: the correctness, or the coherence, represented by
dot products, Of the ﬁrst 10 eigenvectors computed by SGA with ﬁnely tuned
learning rate. ...............................

4.6 FERET data set: the correctness, or the coherence, represented by dot prod-
ucts, Of the ﬁrst 10 eigenvectors computed by (a) SGA (b) GHA (c) CCIPCA
with the amnesic parameter I = 2. ....................

4.7 FERET data set: the correctness of the eigenvalue, 11%“ by CCIPCA.

4.8 FERET data set: the effect of the amnesic parameter. The correctness of the
ﬁrst 10 eigenvectors computed by CCIPCA, with the amnesic parameter
I = 0. A comparison with Fig. 4.6 (c). ..................

4.9 FERET data set: the correctness of the ﬁrst 10 eigenvectors computed by (a)
SGA (b) GHA (c) CCIPCA (with the amnesic parameterl = 2), respectively
over 20 epochs (a long data stream). ...................

4.10 FERET data set: the ﬁrst 10 eigenfaces obtained by (a) batch PCA, and (b)
CCIPCA (with amnesic parameter I = 2), shown as images .........

5.1 The sample distributions in the input and output Space Should have the same
topology. ..................................
5.2 The process over the auditory sensation before it enters the cognitive mapping
module ....................................

90

94

96

5.3 The GUI of AudioDeveloper: (a)During online learning; (b)After online learning.101

5.4 A real-time system: correct action rate vs. training time. ...........
5.5 Simulation: correct action rate vs. epoches. ..................

5.6 Simulation: smoothed Q values Of the internal actions of one of the states vs.
the number Of times it was updated .....................

5.7 Selective attention: overall performance vs. training epoches. .........
5.8 Part of the IHDR tree after training using the number data set. Shown in each

image is the x-cluster mean. ........................
5.9 Engineering Building 2nd ﬂoor at MSU .....................
5.10 The SAIL robot: the execution time in each loop (one-trainer case). (a) MFCCS

calculation; (b) IHDR tree construction/retrieval; (c) Total execution time.

5.11 The SAIL robot: the shape Of IHDR tree in one-trainer case. The horizontal
axis specifies the depth of the tree and the vertical axis speciﬁes the number
Of nodes. ..................................
5.12 The SAIL robot is running autonomously in the hallway Of the Engineering
Building Of Michigan State University ....................

5.13 Examples of the image sequence seen by the SAIL robot. ...........

6.1 (I) The SAIL robot house-built at Michigan State University. (II) Behaviors
represented as gripper tip trajectories Of the SAIL robot. (a)-(d): Individual
behaviors as petal drawing, each of which starts from the black dot. (e)-(g):
Drawings consists Of more than one petal, as behaviors developed through
multiple task transfers ............................

xii

103
104

104
106

109
111

114

115

115
116

120

6.2

6.3
6.4
6.5
6.6
6.7

6.8

6.9

6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18

6.19

7.1
7.2
7.3
7.4
7.5

7.6

7.7
7.8

7.9

A new view of task transfer and classical conditioning. Two vertically aligned
circles constitute a context at a particular time. Solid arrows indicate pre-
viously established association, and dash arrows denote newly established

association through priming ......................... 121
The cognitive mapping module ......................... 126
The enhanced cognitive mapping. ....................... 127
The behavior of prediction model (6.4) with different '7, l, and PUQ size. . . . 128
Handling the missing context. ......................... 129
The comparison Of a system with and without cross-modality attention. This

ﬁgure is presented in color .......................... 132
A two—level architecture of SAIL. ....................... 137
Internal mechanism Of the two-level architecture. ............... 138
The GUI of AudioDeveloper. (a)During online learning; (b)After online learning. 141
A fraction Of the training session of step (2) ................... 144
A closer look at the training session of step (2). ................ 145
A fraction Of the test session of step (2) ..................... 146
A closer look at the test session of step (2). .................. 147
The behavior changes of SAIL with the two-level architecture. ........ 152
The behavior changes of SAIL with the one-level architecture. ........ 152

The SAIL robot learned the new task after verbally instructed by human trainers. 153
The average execution time Of the two-level system in each execution step is
lower than 18.1ms, the required interval Of each Speech frame. ...... 155

The node distribution and primitive prototype distribution in the trees: (a)
R-tree Of the ﬁrst level LBE; (b) P-tree Of the ﬁrst level LBE; (c) P-tree Of
the second level LBE. ........................... 156
The semantics learning system architecture ................... 165
The objects used in the experiment. ...................... 166
Part of a sample image sequence. ....................... 167

The alignment Of image and speech data (a) during training (h) during testing. 169
The two correct answer rates Of SAIL vs. the question positions in each image

sequence ................................... 170
The node distribution and primitive prototype distribution in the trees: (a)

P-tree of A-LBE; (b) P-tree Of V-LBE; (c) R-tree Of H-LBE. ....... 174
A sample sequence Of the primed visual sensation ................ 175
The average execution time of the semantics learning system at each time step

is much shorter than 18.1ms, the required interval Of each Speech frame. . 176
The node distribution and primitive prototype distribution in the trees: (a)

P-tree of A-LBE; (b) P-tree of V-LBE; (c) R—tree of H-LBE. ....... 177

xiii

4.1

0.1
5.2
5.3
5.4
5.5
5.6

6.2
7.1

7.2

7.3

8.1

LIST OF TABLES

The average execution time. .......................... 88
Simulation: confusion table for test. ...................... 103
Selective attention: correct rate (OR) for using different sensory mapping layerle5
Frequency Of the attention selection behavior per external behavior ...... 106
Results on number recognition ......................... 108
Performance of the SAIL robot in one-trainer case. .............. 110
Performance Of the SAIL robot when following the 12th trainer’s command

in multi-trainer case ........................... 112
Performance Of the SAIL robot on off-line test data in multi-trainer case 113
The SAIL robot: the average execution time in each loop (one-trainer case) . 113

Performance of AudioDeveloper trained by multiple users. C.R.l and C.R.2

represent the correct rate for action A31 and the new task, respectively. . . 148
Performance of SAIL doing task transfer .................... 154
SAIL’S responses on question 1 (“Name?”) when the questions were aligned

with image frame NO. 250 .......................... 171
SAIL’S responses on question 2 (“Size?”) when the questions were aligned with

image frame No. 250. The italic numbers represent the correct rates. . . . 172
The correct answer rate 2 (majority correct rate) of SAIL when the questions

were aligned with image frame No. 250. .................. 173

Summary Of contributions on developmental robot architecture ......... 179

xiv

Chapter 1

Introduction

Speech is a convenient communication tool for human beings. Children start to ac-
quire Skills related to Speech when they are very young. To most of us, this learning
process is so smooth that we do not realize how complex a phenomenon speech per-
ception actually is. However, building a spoken language interface to computers still
faces a lot of difﬁculties after over ﬁve decades of research.

The Speech-related scientiﬁc ﬁelds include automatic speech recognition (ASR),
natural language understanding (NLU), computational linguistics, and Speech syn-
thesis, which together are called speech and language processing [41]. Research on
Speech and language processing has been conducted in a broad range of disciplines
from linguistics, psychology, electrical engineering, to computer science, where plenty
of related questions are still waiting for answers.

Consider ASR, for example. ASR is one of the few perception-related ﬁelds
that have demonstrated commercial products. 111 the past ten years, the Speech

recognition community has moved from small vocabulary, isolated-word, speaker-

dependent recognition problems towards the problems Of very-large-vocabulary,
speaker-independent, continuous Speech recognition [64] [91] [39] [41]. Commercial
speech recognition products started to appear in the market in the mid 903. T wo
of the well-known products are ViaVoice from IBM and Dragon NaturallySpeaking
from Dragon Systems]. Despite of this progress and the increasing customer base,
some insightful speech recognition researchers predicted an plateau in future devel-
opment [70]. It has become very difﬁcult to improve the capability of handling en-
vironmental, context, and Speaker variations with continuous and large vocabulary
with the state-of-the-art methodology. The technical challenges that face this com-
munity include robustness (insensitivity to environment change), portability (rapid
design, development and deployment for new applications), adaptation (adaptation
to changing conditions, e.g. new Speakers), spontaneous speech (Speech with false
starts, hesitation, ungrammatical constructions), and prosody (stress, intonation, and
rhythm that convey user intentions) [112].

Some major problems related to ASR can be illustrated by an example of using
the dictation system of Dragon NaturallySpeaking version 5. After following the

instructions step-by—Step to train the system, a speaker read the following text:

“Likewise proposals have been made to remedy some of the weak points
Of the symbolic approach, by introducing fuzzy versions of classic cal-
culi, 07‘ importing non-monotonic techniques for dealing with incomplete

information.”

 

1The system was originally developed by Dragon Systems and was once acquired by Lernout &
Hauspie. Now it belongs to ScanSoft Inc.

The recognized text was:

“Likewise proposals have been made to remedy some of the weak points
of the symbolic approach, by introducing fuzzy versions of classic calculi,
all it in parking noun monotonic techniques for dealing with incomplete

information.”

While Showing a quite amazing recognition rate, Dragon NaturallySpeaking made
very interesting mistakes on “or importing non-monotonic” vs. “all in parking noun
monotonic.” These are the mistakes related to missing contextual information, which
are not very likely to be missed by humans. SO our question is: why do current
speech recognition systems have harder time incorporating contextual information

than human beings do?

1.1 Methodology limitations

In this section, we discuss some of the problems associated with the basic methodology
of current speech and language processing techniques, which have posed fundamental

limitations to its further progress.

1.1.1 Separation of natural language understanding and
speech recognition

Natural language understanding (NLU) and Speech recognition (SR) are two major

technologies in the ﬁeld of speech and language processing. There is much evidence

3

showing that human speech understanding involves a close cooperation between SR
and NLU. For example, syntax and/or semantics of natural language may provide
the valuable contextual information for identifying the spoken utterances as we have
seen in the example of Dragon NaturallySpeaking above. On the other hand, prosodic
information (e. g., the changes in pitch and duration of the voice) embedded in acoustic

signals gives clues, such as the Speaker’s intention, to understand a Spoken expression.

Research on integrating the NLU and ASR technologies in machine speech un-
derstanding started as early as 19703. Nowadays, Simple language models, such as
trigram models, have been widely used in SR. The general stochastic techniques used

in ASR are also used in in NLU to interpret meaning. -

However, a deep level integration iS still lack of. NLU research was originally
motivated by a desire to understand cognitive processes. The theories have been
mainly developed in linguistics and psychology with symbolic approaches. Practical
applications have been less important than increasing intuitions about the underly-
ing cognitive process. On the other hand, ASR research focuses on the Signal-level
attributes of Speech. To convert the speech from the acoustic form to the string form,
ASR research typically does not go into the semantics of the conveyed information.
While the cooperation between NLU and ASR is desirable for both communities,
the initial differences between them in motivations, interests, techniques, tools, and
criteria for success made the integration very difﬁcult [69].

4

1 . 1 . 2 Symbolic representation

NLU research typically uses symbolic representations. For example, in semantic anal-
ysis, to Show the meaning of the sentence I have a car, one Of the well-understood

representations is the ﬁrst order predicate calculus,

33:, yHaving(:r) /\ Haver(Speaker, :11) /\ HadThz'ng(y, :13) /\ Car(y),

where the terms like “Speaker” are symbolic tokens. ASR research is mainly con-
ducted in electrical engineering domain, where numerical representation is more pop-
ular. However, symbolic representations such as phonemes and words are also used,
especially when compatibility with NLU research is needed in order to use some
techniques from NLU, e.g., language models.

There are two major problems with symbolic representations. First, a symbol is
indivisible. Once a symbol is deﬁned, its representation limit is decided. Unless we
deﬁne another symbol, no smaller unit of information may be represented. AS we
never know what Size of an information unit iS appropriate in the real world, the
dilemma is that a too small unit is inefficient while a too large unit lacks discriminat-
ing power. Second, it is difﬁcult for machines to generate new symbols. Once a system
is built, its vocabulary of symbols is typically ﬁxed. There have been discussions on
dynamically creating new symbols according to some predetermined rules. Unfortu-
nately, these efforts usually result in one of the following two Situations: if the rules
to create new symbols are too simple, a human designer would rather create symbols

manually instead; if the rules are too complicated, a human designer could not afford

C11

to program them. So, it is highly desirable to avoid symbolic representations for the

sake of high representation power and system generalization.

1.1.3 Current training methodology

Speech is a communication tool. TO successfully exchange information, the parties
involved in communication use a lot of contextual information, which is either em-
bedded in the Speech environment or has been acquired from previous sensory experi-
ences. We call the procedure of learning directly from sensory experiences, grounded
learning. Current methodology used to train a speech processing system does not

incorporate grounded learning.

Building acoustic or language models starts from collecting a large amount of
Speech or text data. The speech data are then manually segmented and translated into
strings of symbols representing the corresponding linguistic units. This translation
procedure is called transcription, which is very labor—intensive and requires Special
expertise. A transcribed speech data set is called a speech corpus and is used to
train the models. The problem here is that after the above data collection and
transcription procedure, a large amount of contextual information contained in the
speech environment is discarded. A system trained using a Speech corpus, therefore,
fails to use any environmental cues to boost its recognition rate. This is a fundamental
problem, especially for a dialog system.

Another problem related to this methodology is that the testing data is not guar-
anteed to “resemble” the training data. The issue of maintaining high performance

6

in various environments is called robustness. One way to achieve robust high perfor-
mance is to use a huge training data set in an attempt to cover all possible acoustic
variabilities. There is an axiom Of Speech research saying “there are no data like more
data” [49]. However, a corpus generation procedure is usually very expensive, which
makes it difﬁcult to extend an application system. A more serious problem here is
that statistical models trained based on data collected in a wide range of conditions
will be so diffused and they will not be able to do a good job in any Speciﬁc condition.

No need to say, there are always new testing data unseen during training.

A lot of efforts have been made to resolve the robustness issue in ASR. Currently,
the most promising approach is adaptation, especially online incremental adaptation,
as discussed in a nice review by CH. Lee [50]. Online adaptation has an issue related
to data supervision. Supervised adaptation, with transcribed data, usually gives
better results than those of unsupervised adaptation. However, it is more realistic to
do unsupervised adaptation in practice, which does not need data transcription but
requires some form of goodness of ﬁt to verify data. While various methods have been
proposed for unsupervised adaptation, they usually involve task-Speciﬁc heuristic. For
example, a telephone number recognition system has knowledge of non-existing area
codes. So when it discovers that a recognized digit sequence spoken by a new customer
does not exist, the system can modify the models with the criterion Of reducing the
chance to generate this sequence. The task-Speciﬁc heuristic is essentially a type of
contextual information, which brings us back to the grounded-learning issue we raised

in the beginning of this section.

1.2 Research motivation and technical challenges

Human beings have powerful speech recognition and understanding capabilities.
These capabilities are acquired gradually along with the acquisition of other complex
mental capabilities, such as vision. This procedure is characterized by the following

features:

0 An autonomous procedure. Although human babies are usually cared for by
their parents, their mental functionality development is an internal procedure.
Nobody can Open their brains and manipulate the information for them. In this

sense, it is an autonomous procedure.

0 Raw sensory experiences. Like any other biological systems, human beings
access the external world through their sensory organs. The sensory experiences

form the grounding to acquire knowledge.

0 Active interactions with the environment. Human babies actively participate
in the interaction with the environment by forming their own internal reﬂection

of the environment and acting according to it [66].

l\-’lotivated by the developmental process of humans, we are interested in building a
robot that develops its cognitive and behavioral skills through online, real-time inter-
actions with the environment. We call such a robot a developmental robotz. This line
of work is based on the following considerations: (1) Studies on human and animal

cognitive development have shown that interactions between a higher animal and its

 

2A more detailed discussion on human development procedures and developmental robots is
presented in Chapter 2.

environment is essential for perceptual and cognitive development. (2) To success-
fully communicate through speech, the parties involved should Share certain common
knowledge, which is usually embedded in the environment and is grounded upon the
parties’ sensory experiences. (3) The capability of autonomous online learning frees
human engineers from labor-intensive data collection and transcription procedures
in developing a traditional speech recognition system, and it may potentially enable
robot systems to reach a performance level that is not practical with “spoon-fed-data”

learning.

The autonomous cognitive development of a robot potentially makes the realiza-
tion of machine perception easier and allows a machine to reach higher performance
than the traditional methodology [24] [23]. Studies in developmental psychology im-
plies that the new developmental approach has also the potential to address many
current difﬁculties in computer vision and artiﬁcial intelligence in general in addition

to speech recognition.

However, this new capability also raises a series of new challenges that a tradi-
tional Speech learning system does not have to deal with. Due to these technical
challenges in realizing autonomous developmental learning directly from continuous
sensory streams, our current goal is not to demonstrate a performance as good as
the one that has been reached by using manual development in constrained domains.
Instead, we concentrate OII architecture and self-organization schemes that realize

developmental learning.

III the work presented here, we address the following challenges.

9

Automatic generation of internal representation. Internal representation
includes, among others, features extracted from perceived signals, subspaces repre-
sented by clusters in sensory inputs, inter-connections between nodes, and states
consisting of Short-history information. For a traditional learning robot, the repre-
sentation is typically designed by a human programmer for a particular task. The
programmer takes the advantage of knowing the task and the environment by pro-
gramming various constraints into the robot program. For the developmental robot
we are interested in, no task-Speciﬁc information is available until the system has
been built, programming has been ﬁnished, and the robot has entered the user stage.
During the user stage, users will be the trainers and any internal-data-level interven-
tion by human programmers is no longer possible. Interactions between the robot
and its environment, including the users, are only through the robot’s sensors and
effectors. Therefore, a developmental robot must collect data and generate internal
representations automatically.

Unsegmented sensory stream. The environment is sensed by a robot as con-
tinuous data streams, e.g. the auditory stream from a microphone, visual streams
from cameras, and tactile streams from touch sensors. To associate sensory inputs
with robot actions, an important issue is to decide the appropriate contexts. For
example, meaningful speech units (phonemes, words, phrases, and sentences) have
varying temporal lengths. Without relying on a human designer to segment or tran-
scribe sensory inputs, a robot must automatically associate the appropriate contexts
with the desired actions. The temporal closeness between the sensory inputs and the
actions is of essential importance for sensorimotor association. In addition to this,

10

another thing that affects the association establishment is the feedback the robot
receives from the environment, such as the reward supplied by human trainers (en-
couraging or discouraging Signals) or nonhuman environment (collisions). However,

a reward is typically delayed and is often inconsistent in the delivery time.

Learning internal behaviors. Autonomous learning is not effective without
developing internal behaviors. By internal behaviors, we mean the perception—invoked
actions that are applied to internal effectors, such as attention effectors, action-release
effectors, etc. For example, closely related to the issue of unsegmented sensory stream
is the need of a selective attention behavior that allows a robot to focus on the part Of
the input data critical in the current situation. Another example of internal behaviors
is the manipulation of internal states. Since internal behaviors are not accessible by
the world outside the robot, it is very challenging to enable the robot to develop

desired internal behaviors through external interactions via its sensors.

One-instance learning without local minima. Online learning implies that
the learning algorithm must be very adaptive while highly reliable, which are two
conflicting criteria. A system must be able to learn from as few as a Single instance for
association yet without getting stuck into local minima. The methods such as hidden
Markov models (HlV'IMS) typically need a preprocessing stage to estimate the initial
observation probabilities and transition probabilities before the learning algorithm
(e.g. Baum-welch algorithm) starts. This is because the Baum-VVelch algorithm does
not give a good solution starting from a random initial guess. However, an online
learning algorithm must not give a bad solution even without a preprocessing stage.

11

There are many other technical issues, which are beyond the scope Of the work
reported here, e.g., the development Of a high-level value system to efﬁciently and
effectively evaluate robot behavior. Because of these new challenges, at current re-
search stage of autonomous cognitive develOpment, we Should not expect the system
performance to reach the level of a traditional Speech learning system immediately,
in either vocabulary Size, Speaker variation or recognition rate. However, as discussed
in [108] and [70], the autonomous audition-related cognitive development provides
new dimensions for speech learning researchers to address current difﬁculties in deal-
ing with Speaker variation, environmental noise, and inherent ambiguity in audition

without multiple modal sensory inputs.

1 .3 Thesis outline

The organization of this thesis is as follows.

Chapter 2 contains background materials on current techniques Of ASR and a
brief survey of human cognitive development. Developmental robots, a new direction
of AI, is discussed along with some early examples of such robots.

Chapter 3 is mainly concerned with establishing the system architecture used in all
the experiments of the following chapters. Two of the three major techniques, incre-
mental hierarchical discriminant regression (IHDR) and the uniﬁed learning strategy
are also discussed.

Chapter 4 presents the third major techniques, an incremental method to com-
pute the principal components for sequentially arriving observations without esti-

12

mating and maintaining the covariance matrix. The proposed complementary candid
incremental principal component analysis (CCIPCA) algorithm is fast in convergence
rate and low in the computational complexity, compared to existing methods. A
mathematical proof on the convergence of CCIPCA is given in the appendix of this
thesis.

Chapter 5 presents a robot system that develops some basic audition-driven be-
haviors through online real-time interactions with human trainers. We Show how the
above architecture and three major teclmiques contribute to resolving some of the
challenges of a developmental robot.

Chapter 6 provides an improvement of the cognitive mapping module used in
Chapter 5. The resulted robot system is able to do task transfer, i.e., applying what
has been learned in one context to new contexts under verbal instructions by human
trainers.

Chapter 7 discusses a further enhancement of the cognitive mapping module,
which helps a robot to do real-time multimodality learning. With the ability to handle
perception of both audition and vision, a robot system conducts Simple semantics
learning and is able to answer a few verbal questions about presented Objects.

Chapter 8 concludes with a summary of the contributions of this thesis. Some

thoughts 011 future directions are discussed.

13

Chapter 2

Background

In this chapter, we will brieﬂy review some of the areas related to building a robot that
develops its audition-related cognitive and behavioral Skills through online, real-time

interactions with the environment.

2.1 Existing speech recognition methods

Since the work reported in this thesis iS highly related to the auditory system of de-
velOpmental robots, it is helpful to have a brief survey on the existing ASR methodsl.
The task of ASR is to ﬁnd the most probable word sequence given a sequence of
acoustic observations. Shown in Fig. 2.1 is a basic architecture of an ASR system.
The elements are as follows:
Raw speech. Speech is typically sampled at a high frequency, e.g., 16KHz over

a microphone or 8KHz over a telephone. This yields a sequence of amplitude values

 

1A very good textbook by John Deller et a1. [19] has a broad and deep coverage over speech
recognition and other speech Signal processing techniques.

14

 

acoustic

 

 

 

 

 

 

 

 

 

 

 

models
w -‘.
l l
_ ' ‘. [ﬂu-[l [ I "one","two",...
‘l I l. t '
H ' ' signal acoustic
. _ . . ___>
analySIs analySIs
raw speech word
speech frames sequence

Figure 2.1: Structure of a typical automatic Speech recognition system

over time.

Signal analysis. Raw Speech Should be initially transformed and compressed, in
order to Simplify subsequent processing. Many signal analysis techniques are available
which can extract useful features and compress the data by a factor of ten without
losing any important information. Followings are some of the most popular Signal

analysis conducted in speech recognition.

0 Linear Predictive Coding (LPC) yields coefﬁcients of a linear system, which
produces the short segment of raw speech signals when driven by certain input

excitation sequence [19].

o Perceptual Linear Prediction (PLP) takes the LPC features and modiﬁes them
in ways consistent with human hearing. For example, the spectral resolution
Of human hearing is worse at high frequencies, and the perceived loudness of a
sound is related the cube rate of its intensity. So PLP applies various ﬁlters to

the LPC Spectrum and takes the cube root of the feature [41].

o Cepstral analysis represents a Short segment of raw speech signals by the Slowly
varying part of the logarithm of its power Spectrum. The Slowly varying part
characterizes the speech system while the “quickly varying” part is due to the

15

excitation [19].

Speech frames. The result of signal analysis is a sequence of speech frames,
each lasts 5-20 mS and is represented by tens of coefﬁcients. These frames may be
augmented by their ﬁrst and/or second derivatives, providing explicit information
about Speech dynamics, which typically leads to improved performance.

Acoustic models. In order to analyze the Speech frames for their acoustic con‘
tent, a set of acoustic models of acoustic units, such as phonemes, syllables, and
words, need to be constructed during the training session. There are many kinds
of acoustic models, varying in representations, granularity, context dependence, and
other properties. Some of the popular models will be discussed in the following part
of this section.

Acoustic analysis. Acoustic analysis applies the acoustic models over the Speech
frame sequence and computes the likelihood to decide the recognized words. In reality,
depending on the nature of the model, the product of acoustic analysis may be other
acoustic units such as phonemes.

Word sequence. The ﬁnal result is a word sequence —— the sentence hypothesis
for the utterance. Actually, it is common to return several such sequences. Further

constraints, such as language models, will be used to pick up the best one.

2.1.1 Dynamic time warping

Dynamic time warping (DTW) iS one of the oldest and most important algorithms

in Speech recognition [95] [38] [82]. It matches acoustic models with Speech frame

16

sequences by ﬁnding the Optimal nonlinear alignment.

The simplest way to recognize an isolated word sample is to compare it against
a number of stored word templates and determine which is the best match. This
goal is complicated by a number of factors. First, different samples of a given word
will have somewhat different duration. This problem can be eliminated by Simply
normalizing the templates and the unknown Speech so that they all have an equal
duration. However, another problem is that the rate of speech may not be constant
throughout the word; in other words, the Optimal alignment between a template and
the Speech sample may be nonlinear. DTW is an efﬁcient method for ﬁnding such an .

alignment.

Consider a matrix D, where D(:r,y) is the Euclidean distance between the :1:-
th input speech frame and the y-th frame of the reference template. Starting from
(0,0) in D, every movement from one entry in D to an adjacent entry is called a
transition. Denote C (:13, y) as the cumulative score along an optimal alignment path

that originates from (0, 0) leads to (:r, y). Then we have,

C(rr,y) = min{C(:I: — 1,y),C(:1: -1,y —1),C'(:1:,y — 1) + D(a:,y)}

It is now clear that DTW is an instance of the general class Of algorithms known as
dynamic programming. Its time and space complexity is merely linear in the duration
of the speech sample and the vocabulary Size. By keeping track of backpointers, an
optimal alignment path can be computed for each reference word template. The path

with the lowest cumulative score is considered to be the best match for the unknown

17

Speech frame sequence. There are many variations on DTW algorithms. For example,
it is common to set local path constraints, e.g., weighing the transitions in various

ways [82].

2.1.2 Artiﬁcial neural networks

Artiﬁcial neural network (ANN) was introduced in 19605 and was reintroduced in
early 19808 as an powerful tool for pattern recognition problems. ANN has been used
by many researchers in speech recognition. Some excellent results have been achieved
in such basic tasks as voiced / unvoiced discrimination [99], phoneme recognition [97],
and Spoken-digit recognition [26].

There are two basic approaches to speech recognition using neural networks: static
and dynamic. In static approaches, the neural network sees all of the input Speech
at once, and makes a Single decision. By contrast, in dynamic approaches, the neural
network sees only a small window of the speech, and this window slides over the input
speech while the network makes a series of local decisions, which have to be integrated

into a global decision at a later time.

Huang [33] demonstrated that static approach of neural networks can form com-
plex decision surface from Speech data. They applied a multi-layer perceptron with
only two inputs, 50 hidden units, and 10 outputs, to Peterson and Barney’s collec-
tion of vowels produced by men, women and children, using the ﬁrst two formants of
the vowels as the input Speech representation. After 50,000 iterations of training, the

network produced the decision regions nearly optimal, resembling the decision regions

18

that would be drawn by hand. The classiﬁcation accuracy achieved was comparable
to that Of more conventional algorithms, such as k-nearest neighbor and Gaussian
classiﬁcation.

A problem with static scheme is that the input, if it includes context, is a ﬁxed—
time window without any possibility for alignment or time-stretching. “Time Delay
Neural Networks” (TDNNS) solve a part of the problem by tying the weights from
every frame of input to the hidden layer. The same can be done with a range of hidden-
layer units. Sections of the network can be made to perform the same computation,
which makes the network time—shift invariant [97]. This form of weight Sharing and
this imposition of constraints on the connectivity can be seen as the incorporation of
prior knowledge into the ANN design.

Another dynamic approach to avoid the problem of ﬁxed time windows is intro-
ducing cycles into the network graph. This lets the net keep information about past
inputs for amount of time that is not ﬁxed a priori, but that depends on weights and
on the input data. Variations of the BP algorithm have been developed to train this
kind Of recurrent ANNS [76].

In addition to feed-forward or recurrent nets trained by error back propagation,
there are different ANN architecture for classiﬁcation tasks. Two approaches that
can be mentioned brieﬂy are learning vector quantization” (LVQ) [44], which is an
algorithm to train a two-layer network for optimum discrimination between pattern
classes, and radial basis function(RBF) networks, which is based on a very similar
idea as the nearest neighbor method [56].

Furthermore, there are ANN architectures that are appropriate for producing new

19

kind of representations of complex data, for example, Speech data. An example of
such networks is self organizing map(SOM) [45], [46]. This kind of networks organize
themselves automatically by SO-called competitive learning, according to the structure
of the input data (unsupervised training). Incoming Speech can be mapped as the
path of best-responding cells of the SOM. Such a mapping can be used as a basis of

speech recognition [93].

2.1.3 Hidden Markov model

In current Speech recognition techniques, hidden Markov model (HMM) is the most
widely used method [34] [71] [57] [41]. General speaking, it models Speech with a

doubly stochastic process [73] [72].

The ﬁrst stochastic process models Short speech segments. Essentially, Speech is
a non-stationary process. However, it is Shown that speech segments within a Short
time, typically under 20mS, can be effectively modeled as the output of a linear time-
invariant all-pole ﬁlter excited by appropriate sources. So we may treat these Short

speech segments as stationary units and represent them by a set of ﬁlter parameters.

The second stochastic process models the piecewise stationary process over a se-
quence of Speech frames. Speech is produced by a physical system — the human vocal
system, which does not conduct dramatic changes. So the parameters of the speech
frames are usually held fairly steadily over a certain period Of time and evolves to
another set of parameters gradually. We may identify these quasi-stationary periods
and characterize how one such period evolves to the next. In HMM, these steady pe-

20

@—>®~®—>O
W

Figure 2.2: A typical HMM topology for phonemes.

riods are termed as states while the Shifting procedures between periods are modeled
as transition probabilities. Notice that the state in HMM iS a real entry but is not

accessible explicitly. That is where the term “hidden” comes from.

A typical discrete HIV’IM can be represented by a stochastic ﬁnite automaton
A = (S, A, B, 7r} with a set of stationary states S, a transition probability matrix A,
a emission probability matrix B and a set of initial state probabilities 7r. Usually,
speech recognition systems use strictly left to right HMMS to model words, syllables,
phonemes or sub-phonetic units. Often, words are modeled as a sequence of phonemes,
which in turn are modeled as a sequence of HMM states. Fig. 2.2 shows the topology

of a typical phoneme Hl\»’ll\/’l.

There are three problems arising, when using HMMS to model Speech: (1) Eval-
uation: W hat is the probability that a given HMM generated a given sequence of
speech frames; (2) Decoding; Given a sequence Of Speech frames and a HMM, what
is the most likely sequence of states through the HMM that lead to the generation
of the speech frames; (3) Parameter estimation: Given a HMM and a set of Speech
frames to be modeled by this HMM, how can we adapt the parameters (emission
and transition probability distribution) of the model to maximize the likelihood of
generation.

21

All of the above three problems have very efﬁcient solutions in form of special
cases of dynamic programming algorithms. For instance, the evaluation problem
occurs in isolated word recognition where we want to score different word HMMS
according to their likelihood. It can be solved using the Forward algorithm. The
decoding problems occurs in continuous speech recognition where we are seeking the
most probable path through a very large HIV'IM consisting of all possible sequences
of basic sound units. Once we found this path, we can derive the most probable
sequence of phonemes or words. The decoding problem can be solved using the
Viterbi algorithm. The last problem, also called the training problem, can solved by
the Forward/ backward or Baum/Welsch algorithm, which is essentially a version of

the Expectation-Maximization (EM) algorithm.

In the case of left-right HMMS with a constant small number of transitions in each
state, all three algorithms have a computational complexity Of only 0(NT), where N

is the number of states in the HMM and T is the number of speech frames.

2.1.4 ANN-HMM hybrid methods

As presented above, ANN are very suitable for pattern recognition tasks where the
variety and separability of the patterns can be very complicated. However, ANN
has difﬁculty in modeling temporary structure. This is why ANN can have perfect
performance in tasks such as isolated words or phonemes with a limited vocabulary
but not in tasks with longer context and larger dynamic variation. On the contrary,
HMM has proved to be an excellent tool for handling dynamic problems. Considering

22

these facts, it is natural to think of ways to combining these two techniques.

There are two diﬂerent ways to combine ANN and HMM. One is to use ANN as
the generator Of the posterior state probabilities for continuous HMMS. This means
we need to train a neural network which has as many output nodes as there are HMM
states. Then we can compute the likelihood by dividing the network outputs by the
prior state probabilities.

To train such a neural network, we usually need to generate target vector for each
Speech frame. But before we recognize the word or sentence, we do not have any idea
of the frames’ target vectors. An iterative procedure based on the EM algorithm is
used as we did in the training problem of HMM. First, labels are generated arbitrarily.
In practice, to speed up the iteration process, an existing HMM recognizer is used to
ﬁnd out the most probable sequence of states through the HMM by forcing Viterbi
alignment, given the sequence of Speech frames. Then, with the generated state labels
for each frame of the utterance, a neural network is trained, using the performance
on an independent cross validation set as a measurement of generalization. With the
emission probabilities given by this trained neural network, the Viterbi alignment is
forced again to get new labels. This procedure continues until some performance mea-
sure haS been achieved. Alternatively, the Forward-Backward instead of the V iterbi
alignment algorithm may be used. [7]

The second way to combine ANN and HMM is using the ANN as the vector
quantizer for discrete HMMS. Similar to the case of using ANN as the posterior
state probability generator, a supervised Signal has to be presented during training.
Usually, the vector quantizer is trained on phonetically labeled Speech data. It is

23

applicable when the codebook Size is not large. But if it is not the case, the training
procedure could be intolerable. One way to handle this problem is given by [75].
The basic idea is training the ANN with the maximum mutual information(MMI)
principle by relating the label stream Y resulted from a vector quantizer and the

word string W representing the words of the corresponding utterance.

2.2 Brief survey on cognitive development

Cognitive development concerns the process Of intellectual maturation from infancy
through childhood to adolescence. Despite that there is not yet an uniﬁed theory
with coherence principles to diagnose the general problems, researchers in cognitive
development tends to classify the variety of developmental theories into theories from
four perspectives, namely, the nativist, the environmental-learning, the constructivist,

and the cultural-context ones [74] [16].

2.2.1 Nativist perspective

According to the nativist perspective, the major cause of cognitive development iS
maturation. The development is viewed as a largely automatic process. In other
words, the basic sequence of changes that characterizes the development comes from
inside the organism as a consequence of the genes the organism inherits. The envi-
ronment only works as a “trigger” or “ﬁne-tuner”, but does not substantially alters
the development’s course and ﬁnal form.

There are two major categories under the nativist perspective. Some nativists

24

believe in hard-wired knowledge while others emphasize the genetic constraint. The
latter viewpoint is usually called “soft nativism.”

One of the famous examples of the ﬁrst category comes from Noam Chomsky’s
theory of language. He points out that most of what a child hears in everyday language
experience is highly diverse, faulty, and piecemeal and thus he rejects the idea that
the child learns grammar by imitating sentences heard. Chomsky concludes that all
child’s abilities in grammar must be inborn, in the form of a set of innate rules coded
in their genes and biologically inherited [74].

According to soft nativism, knowledge is not innate, but the overall structure is.
The structure determines the kinds of information that can be received, the kinds of
problems that can be solved and the kind of representations that can subsequently

be stored [22].

2.2.2 Environmental-learning perspective

Researchers with the environmental-learning perspective believe that the major cause
of the knowledge acquisition is the sensory experience.

One of the interesting experiments supports this perspective was done on cat2.
In the experiments reported in [31], some kittens were raised in an environment that
allowed them to see only horizontal lines for several months. It was Observed that
they had weaker capability to detect vertical lines than normally raised cats. Further

studies showed that certain cells in a cat’s brain respond best to horizontal lines,

 

2Although cognitive development is an area studying human mind, experiments OII animal sub-
jects are sometimes conducted to avoid inappropriate effects on human subjects.

25

whereas other cells respond best to vertical lines. Both kinds of cells are present in
large numbers in new born kittens that do not have any visual experience. When
kittens are raised in an environment with no vertical lines, they produce far fewer
of the nerve that respond to vertical lines than normal kittens do. This Shows that
kittens’ internal representation of the world, which is one important respect of the
knowledge, is largely inﬂuenced by the environmental effects.

It is observed that biological systems have some basic learning mechanisms. The
environment acts through these learning mechanisms, and subsequently shapes the
system development. The learning mechanisms that are believed to operate through-
out the whole development process include nonassociative learning, classical condi-
tioning and instrumental conditioning [20].

In nonassociative learning, a subject learns about the properties of a Single stimu-
lus by being exposed to it repeatedly. Habituation learning is an example of nonasso-
ciative learning. In habituation learning, a subject gradually decreases its attention
paid to a repeated stimulus, for example a view of a toy. On the contrary, to pay
attention again when some aspect of the stimulus has been changed (e.g., a new toy)
is called dishabituation.

Classical conditioning is a process by which an organism learns which events in
its environment go with each other. In the famous experiments done by Pavlov, after
several experiences of hearing a tone just before food was placed in its mouth, a dog
would begin to salivate in response to the tone, before it received any food. Classical
conditioning is a process by which previously existing behaviors come to be elicited
by new stimuli.

26

Instrumental conditioning is a process following such a basic idea, changes in
a behavior occur as a result of the positive or negative consequences the behavior
produces. In other words, organisms tend to repeat behaviors that lead to rewards
and give up behaviors that lead to punishment. The behaviors of animals in circuses
are good examples showing the effects of instrumental conditioning.

The major impact of the environmental-learning perspective is that it emphasizes
a new cause of cognitive development, the environment. While those learning mech-

anisms could be innate, the concepts and the behaviors are believed to be acquired

2.2.3 Constructivist perspective

The difference between the nativist perspective and the environmental—learning per-
spective reﬂects the long-debated nature-nurture question: What controls the devel-
opment process, heredity or experience?

In the constructivist perspective, nature and nurture are believed to be equally
necessary. l\»loreover, constructivist theories attribute to an agent3 a greater role in
shaping its own development. In constructivists’ Opinions, the mind is an agent that
actively structures the Observation by forming its own internal representations and
creating new functional relations between its parts.

Jean Piaget, one of the most prominent constructivists, proposes that humans are

born with some initial schemata, where a schema is deﬁned as a mental structure that.

 

3By an agent here, we mean anything that can be viewed as perceiving its environment through
sensors and acting upon that environment through effectors.

27

provides an organism with a model for actions in similar or analogous circumstances.
The initial schemata are strengthened and transformed into new schemata through
two processes, “assimilation” and “accommodation.” Assimilation is a process by
which children incorporate new experiences into their existing schemata. Accommo-
dation is a process by which children modify their existing schema in order to adapt

to new experiences.

A classical experiment supporting the constructivist perspective was done by
Richard Held and Alan Hein [29] with kittens raised from birth in total darkness.
When the kittens were old enough to walk, they were placed in pair in an apparatus
called “kitten carousel” as Shown in Fig. 2.3. One kitten was harnessed to pull the
carousel while the other was carried in the gondola. The kittens did this for three
hours every day and lived in darkness in the rest of the day. After 42 days, these kit-
tens were lowered onto the surface of a visual cliff”. It was Observed that the kittens
harnessed in the carousal hesitated to land on the deep side of the cliff while those
kittens in gondola did not try to avoid it. These two groups of kittens had roughly the
same visual experiences but different interactions with the world. This experiment
suggests that active interactions between the agent and the environment make a big
difference in learning the interpretation of the environment. Similar phenomena were

observed on human babies too[6].

 

4Visual cliff is a research apparatus with a transparent table-top. It is divided into two regions:
(1) a “shallow side” in which the checkerboard pattern is ﬂush against the glass surface, (2) a “deep
side” in which the checkerboard pattern is much closer to the ﬂoor and far away from the glass
surface.

28

 

Figure 2.3: Kitten carousal.

2.2.4 Cultural-context perspective

Like the constructivist perspective, the cultural-context perspective emphasizes that
the development involves the active engagement of the individual. These two per-
spectives are different in that the cultural-context perspective assumes that both
the baby and the caretaker are active agents, and views the development as a “co-
constructive” process. In addition to give more space for variability of individuality,
this idea stresses the role of active environment in establishing and/or accelerating

cognitive development.

As suggested by its name, the cultural-context perspective insists that cognitive
development is Shaped by the speciﬁc cultural-historical context a child lives in. The
culture includes the designs for living, speaking, beliefs, values and customs. Adults,
who have. already inherited the culture from the last generation, inﬂuence the devel-

opment of children through their behaviors.

There are a lot of evidence showing that the knowledge of the society determines

29

a lot of the behaviors of the new born babies. Japanese adults have difﬁculty in
distinguishing phonemes /r/ and /l/. However, studies Show that Japanese babies
actually can perceive the difference between them. This capability to make phonemic
distinctions begins to diminish at about 6 to 8 months of age [16]. Another example
is that children growing up among the Oksapmin of New Guinea establish a very
different counting system — counting by body parts — while they seem to have the same

ability to grasp basic number concepts as those growing up in Paris or Pittsburg [58].

2.2.5 Lack of computational study

Although developmental psychology is a well—studied subject, past studies in the ﬁeld
have been largely qualitative in nature. This Situation is probably due to the difﬁcul-
ties Of approaching such an complex subject in exact computational terms. Not until
recently, did some researchers start to address cognitive development in a computa-
tional perspective [100] [22] [3]. These recent efforts model cognitive and behavioral
development through computational processes, which allows little ambiguity and the
resulted model is veriﬁable through quantitative experiments. This direction may

help to deepen our understanding of cognition, behavior and other related issues.

2.2.6 Summary

Except for the nativist perspective, the other three perspectives give a lot of credits
to the effect of the environment in the course of cognitive development. While we still

do not know the grand truth of where the intelligence comes from, the experiments

30

done on human or animal subjects seem convincing to us that the intelligence has a
lot to do with interactions between an agent and its environment. This impression is
the starting point of our intention Of building developmental robot, as we will see in

the next section.

2.3 Developmental robots

2.3. 1 Motivation

With the invention Of the computer, people started to raise questions such as “Can we
emulate human intelligence on a machine?” A new ﬁeld appeared in the middle 20th
century, whose name, artiﬁcial intelligence (Al), was Ofﬁcially coined in a two-month
workshop at Dartmouth in the summer of 1956. After 50 years, AI now covers areas
ranging from game playing, logical inference, theorem proving, planning, medical
diagnosis, to vision, natural language understanding, and robotics [81].

In general, AI problems can be viewed uniformly as the development of intelligent
agents. While it is very obvious to imagine an autonomous robot with a real body

9

as an agent, pure softwares, such as chess player “deep blue,’ are also agents. “Deep
blue” perceives the coded chess conﬁguration and acts by generating the next move.

Various agents have been proposed in last 50 years, ranging from reﬂex agents to fully

deliberative, knowledge-based agents.

Knowledge-based systems emerged in 19708. Since Feigenbaum’s MY CIN [11],

these systems proved to be very successful in areas such as medical diagnosis. To

31

set up such a system, it typically needs a domain engineer, who is responsible for
interviewing the experts in the related area and building the domain knowledge for a
computer. Typically, the domain engineer ﬁnds out the rules followed by the experts
to solve the problem, and designs data structures and algorithms so that a computer
can run these rules. AS these rules are usually not explicitly expressed by the experts,
the domain engineer needs to be extremely smart to extract them out. The procedure
to establish domain knowledge requires a lot of human efforts and expertise.

III contrast with the manual knowledge-feeding approach, the learning-based ap-
proach enables machines to learn. Among the earliest learn-based systems, Samuel’s
checker player [84] was so successful that it was able to compete on equal terms in
some very strong human tournaments. Following this approach, a human engineer
ﬁrst needs to identify a task by specifying the input, the output and the performance
evaluation procedure. Then he will decide a representation (model) and ﬁgure out
a learning algorithm to train the parameters Of the model. Many data are typically
collected to train the system before it is released.

While learning gives a system much power to establish new knowledge, there are
some problems with the current way of building a learning system. First, a learning-
based system is usually designed to solve a speciﬁc task. However, intelligence is not
a capability to do a Single task. Moreover, there are tasks that are not separable from
others, e.g., language translation is close related to culture background in addition to
syntax of the languages. Second, in many current learning-based systems, the training
process is off-line, i.e., training data need to be collected and, usually, edited before
the training procedure starts. As human-collected training data will never cover all

32

the situations in the real world, the off-line learning will never be completed.

The behavior-based approach, originally proposed in 19803 [10], tries to escape
from some constraints of the knowledge-based approach by avoiding modeling the
world. The design of a behavior-based system starts from designing a set of basic
behaviors instead of building a complete model of the world, which may not even
exist in the ﬁrst place. The system interacts with the world by manipulating and
organizing these basic behaviors. In some sense, the representation Of the world iS
expressed through the basic behaviors, which are distributed, fragmented and even
allowed to be inconsistent. One of the limitation of the behavior—based approach lies
in the fact that the behaviors are designed manually in advance. It seems now clear
that humans are unable to adequately describe and decompose complex behaviors.
As a result, for complicate systems, such as COG, a human-shaped robot in MIT [9],
the design of behaviors becomes a serious bottleneck.

From above description, we can see that existing AI systems generally follow such

a manual development paradigm,

1. Given a task, the human engineer analyzes it and translates it into representa-

tions and rules that a computer program may work on.

2. The human engineer writes a program that transforms the input information

into representation and follows the rules to control the machine.

3. The human engineer runs the program on the machine.

In this paradigm, it is the responsibility of a human engineer to tackle the task.
In this sense, this paradigm may be viewed as “... solve a hard problem, you almost

33

have to know the answer already” [81].

By comparing the tendency in intelligent agent development to the perspectives
in cognitive development, we can see surprising similarity between this major AI
paradigm and the nativist viewpoint. We seem to believe that the agent should know
everything by which it needs for “survival” before being “born,” just as nativists
believe that everything necessary has been stored in genes. The agent development,
thus, reduces to a procedure of coding and feeding knowledge to the agent by the
designer. Even in the cases of using learning-based method, the knowledge of how to
learn and what to learn is decided by the human designer in advance according to a

speciﬁc task.

However, thinking about the other perSpectiveS in cognitive development, we
seems missing some of the discoveries from our biological counterpart. R. Brooks,
one of the founders of behavior-based systems, has a great insight into this problem.
He once pointed out that intelligence is not limited to just computational engine but
also comes from the situation in the environment to which the robot is a part of [8].
Knowledge does not necessarily need to be born possessed by the agent, just as what
happened in the kitten carousal example: a kitten is not born knowing the danger of
visual cliff. It ﬁgures it out by interacting with the environment.

34

2.3.2 A new direction in AI — autonomous mental develop-

ment (AMD)

A developmental robot is one that learns and practices autonomously in the real
physical world by interacting with the environment through sensors and effectors,
probably under human supervision. This is an important direction as discussed in a
recent article in Science [24].

A developmental robot follows a new autonomous mental development (AMD)

paradigm [23],

1. Design a body: A human designer designs a robot body (the sensors, the ef-
fectors and the computational resources) according to the general ecological

condition in which the robot will work (e.g., on-land or underwater);

2. Design a developmental program: A human programmer designs a task-

nonspeciﬁc developmental program for the robot.

3. Birth: A human operator turns on the robot whose computer runs the devel-

opment program.

4. Develop mind: Humans mentally “raise” the developmental robot by interacting
with it. The robot develops its mental skills through real-time, online interac-
tions with the environment which includes humans (e.g., let them attend special
lessons). Human trainers teach robots through verbal, gestural or written com-
mands in much the same way as parents teach their children. New skills and
concepts are autonomously learned by the robots everyday.

35

While the developmental program is the central part of a developmental robot, the
step of “develop mind” is the key to this new paradigm. First, in this step, the
robot generates the representation autonomously according to the task it encounters
instead of deciding them in advance. Second, what the robot does is determined in
this step by the experiences accumulated in the real world. In contrast to the tradi-
tional paradigm, the knowledge is no longer pre-programmed and, more importantly,
learning happens after the system is released for ﬁnal usage instead Of in the process
of building the system. As we mentioned above, intelligence is not a capability to
solve a Single task. Only with the factor of learning after “born” may we achieve a
possibility of the task-nonspeciﬁc solution.

The AMD paradigm borrows ideas from the environmental-learning, construc-
tivist, and culture-context perspectives. First of all, the interaction with the envi-
ronment is considered essential for behavior and mental development, which is the
major point Shared by the three perspectives. Secondly, it is emphasized that, for
a developmental robot, this interactive process has to be autonomous. After it is
“born,” the robot is 011 its own to generate the representation of the world and the
behaviors to conduct, instead of letting human designer to program for it. In other
words, the robot must be actively engaged in this interaction, as stressed by the
constructivist perspective. Thirdly, this paradigm involves a lot of participation of
human trainers (not human designers). These human trainers act as the active en-
vironment of the robot and the source of knowledge, whose necessity is emphasized
by the culture-context perspective. For example, the robot will learn to response to
English if it is “raised” in an English-Speaking environment while learning Chinese in

36

a Chinese-Speaking environment.

2.3.3 Eight requirements of AMD

Practical AMD is technically very challenging: The following eight requirements are

all necessary for practical AMD.

1. Environmental Openness: Due to the task-nonspeciﬁcity, AMD must deal with
unknown and uncontrolled environments, including various human environ-
ments. Although a human child cannot do everything that a human adult
can do, he is exposed to nearly the full complexity of a human environment

early in his life.

2. High-dimensional sensors: The dimension of a sensor is the number of scaler
values per unit time. AMD must directly deal with continuous raw signals
from high-dimensional sensors, e.g., vision, audition and taction. For exam-
ple, learning from a video camera is more difﬁcult than learning from a laser
range ﬁnder, because the former typically contains more data per unit time
and the photo-electric information is affected by more factors than the range
(distance). Symbolic input is allowed as a clean (hand processed) but Simpler

sensing modality.

3. Completeness in using sensory information. Due to the environmental open-
ness and task nonspeciﬁcity, it is not desirable for a developmental program to
discard, at the program design stage, sensory information that may be useful

37

for some future, unknown tasks. Of course, its task-Speciﬁc representation au—
tonomously derived after birth does discard information that is not useful for a

particular task.

. Online processing: At each time instant, what the machine will sense next de-
pends on what the machine does now. Ofﬂine processing is unable to accomplish

A M D .

. Real-time Speed: The sensory/ memory refreshing rate must be high enough so
that each physical event (e.g., motion and speech) can be temporally sampled
and processed in real time (e.g., about 15Hz for vision). This Speed must be
maintained even when a full (very large but ﬁnite) physical “machine brain
Size” is used. It must handle one-instance learning: learning from one instance

of experience. Time consuming iterations must be avoided.

. Incremental processing: Acquired Skills must be used to assist in the acquisition
of new skills, as a form of “scaffolding.” This requires incremental processing.
Thus, batch processing is not practical for AMD. Each new Observation must
be used to update the current complex representation and the raw sensory data

must be discarded after it is used for updating.

. Perform while learning: Conventional machines perform after they are built.

An AMD machine must perform while it “builds” itself “mentally.”

. Muddy tasks: For large perceptual and cognitive tasks, an AMD machine must
handle multimodal contexts, large long-term memory and generalization, and

38

capabilities for increasing maturity, all without catastrophic memory loss.

2.3.4 Early investigations

Lately, more and more researches on robots are motivated by mental develop-
ment. The early examples of developmental robots include the SAIL (Short for
Self-organizing, Autonomous, Incremental Learner) robot at Michigan State Univer-
sity [106], the Darwin V robot at the Neurosciences Institute, San Diego [3], and the
Illy robot at University of Illinois, Urbana—Champaign [51]. However, among these
early research efforts, SAIL is the only one designed to fulﬁll all the eight requirements

of AMD.

The SAIL robot was designed as an engineering testbed for developmental pro-
grams that are meant for scaling up to complex cognitive and behavioral capabil-
ities [100] [105] [37]. The SAIL-2 developmental program has been tested for au-
tonomous derivation of architecture and representation through online, real-time de-
velOpment of association (1) between visual stimuli of objects and eye aiming for the
objects (object evoked visual attention); (2) between visual stimuli of objects and
arm pro-reaching for the object (vision evoked Object reaching); (3) between voice
stimuli and arm actions (verbal command learning and execution) and (4) between
visual stimuli and locomotion effectors (vision-guided navigation). Ongoing investi-
gations include temporal Signal processing, more SOphisticated attention mechanism,

and value system development [106].

The Darwin V robot by Almassy, Sprons, and Edelman was designed to pro-

39

vide a concrete example to Show how the computational weights Of neural circuits
were determined by the behavioral and environmental interactions of an autonomous
device [3]. Darwin V had a group of pre-designed behaviors and sensory patterns.
Its visual capability was implemented by blob-pattern and stripe-pattern detectors.
Darwin V has been tested for developing behaviors in response to visual stimuli at dif-
ferent positions and orientations (visual invariance learning). It also has been tested
for the association of aversive and appetitive stimuli with visual stimuli (value learn-
ing). This work is concentrated on the development Of high level plasticity of the
association between stimuli and behaviors.

The Illy robot project by Levinson, Huang, and coworkers focuses on spoken
language acquisition. The paper [51] reported how to teach the Illy robot to act
according to short voice commands in different languages. This work paralleled our
work of auditory learning by the SAIL robot. However, the Speech acquisition system
of the Illy robot had some preprogrammed modules such as “loud sound detection”
and “end point detection,” while the method reported here does not use such modules
to reach a fully developmental robot. Our method is potentially applicable to both
word and non-word Situations, such as natural sound and music.

Another work worth mentioning is done by Roy, Schiele and Pentland in Media Lab
at MIT [79] [77]. They developed a multimodal system which automatically learns
Shape categories and their corresponding Spoken names from an input consisting of
naturally Spoken utterances paired with the visual representations of the Objects.
This work was motivated by theories of infant cognition and aimed at enabling a
robot to learn semantic representations through sensorimotor experiences. However,

40

the learning process was not done through real-time interactions between an agent
and the environment. In their experiments, the visual-auditory Signal pairs were
prepared manually from separately recorded images and Speech signals. Moreover,
human designers intervened the system’s Signal processing procedure. The auditory
Signals were ﬁrst segmented at utterance boundaries using a recurrent neural network
off-line trained by the TIMIT database. The visual representation of an object is a
two-dimensional histogram of distances between edge points and relative angles of
edges, which limits the representation power of the resulted system to be no more
than narrowly—deﬁned shape concepts. This kind of human involvement prevents a
system from doing AMD. These limitations need to be overcome before the system
can imitate the infant cognition development.

The computational mechanisms of mental development are getting attention from
more and more researchers. These investigations are expected to improve our sys-
tematic understanding of the wide variety of cognitive and behavioral capabilities
in humans and enable autonomous development of these highly complex capabilities
by robots and other artiﬁcial systems. A Workshop on Development and Learning
(WDL), funded by NSF and DARPA, was held at Michigan State University, April
5 -7, 2000. This workshop was attended by about 30 distinguished researchers in
neuroscience, developmental psychology, machine learning and robotics. It resulted
in a whitepaper [23] to provide the US government about this new direction. The
success of WDL leads to a regularly scheduled conference, International Conference
on Development and Learning (ICDL), which was ﬁrst held at MIT, June 12-15,
2002[1]

41

Chapter 3

A Developmental System

III this chapter, we ﬁrst propose an agent model and present a basic system ar-
chitecture that implements this model. Then we discuss two Of the three major
techniques, incremental hierarchical discriminant regression (IHDR) and the uniﬁed
learning strategy. The third major technique, complementary candid incremental
PCA, will be discussed in Chapter 4. In the end of this chapter, we will brieﬂy

present the hardware architecture of the validation platform, the SAIL robot.

3.1 An agent model

To handle numerical representations of the complex physical world, we propose an

agent model for a robot, starting with the following three deﬁnitions.

Deﬁnition 3.1.1 The history at time t is deﬁned as h(t) = {(1r(r),a(r))|r =
0, 1, 2, ..., t}, a realization of the random process from the “birth” time t = 0, where
117(7) is the numerical sensory vector acquired at time T, and a(r) is the numerical

42

control vector of effectors at time 7'.

The history includes all information available to the agent from the “birth” time,

such as the auditory sensations and the status of the agent’s own actions.

Deﬁnition 3.1.2 The last context at time t is the recent part of history, l(t) =

{(zr(t),a(t))|r = t — l)(t), ...,t —— 1, t}.

The last context consists of a Short segment of recent experience of temporal
length b(t) + 1. Currently in our work, b(t) does not change with t and a different

b(t) is designed for different sensors and levels.

Deﬁnition 3.1.3 A state s(t) is the internal representation of the context in an

agent’s internal information environment.

According to the above deﬁnitions, an agent views the world as a general random
process, called history. The state gives a more compact representation than a Simple
concatenation of last contexts. Due to sensory uncertainty and partial observation,
the state is also a random vector. The state at time instance t depends on both the
context l(t) and the last state s(t — 1). Therefore, we can recursively generate the

state with:

s(t) = f(s(t—1),l(t)). (3.1)

where the function f generates states using feature derivation, temporal chucking, and
vector quantization. The recursive estimation process can be modeled by a Markov

43

Sensor inputs x5 (t)
Last context l(t) f

N

Actions sensed

 

State 3(t) .

 

Actions sent a (t) ,..

 

Figure 3.1: A temporal illustration of the agent model. The last context includes informa-
tion from sensors and the status Of the agent’s own actions. Depending on both the last
context and the last state, the mapping f returns a new state, associated with a list of
actions. The action with the highest primed value is selected, which contains the effector
control Signals.

decision process (MDP), where the state transition follows a conditional distribution,

Pen) 2 s|s(t — 1) = s’,l(t) =1 (3.2)

and the representation s(t) here is a high-dimensional vector. The most probable
state at time t is given by,

s*(t) = arg max P(s(t) :2 s|s(t - 1) = s', l(t) = l), (3.3)

365

where S is the set of all possible states, which iS inﬁnite. In a practical implemen-
tation, we approximate the set S by an ﬁnite number of prototype states. These
prototype states are incrementally merged cluster centers of many experienced state
vectors. In practice, the maximization in Eq. (3.3) is tOO computationally expensive.
We use an IHDR tree1 to approximate it, taking advantage of the numerical vector

representation of s(t) and l(t).

 

1See Section 3.3.

44

 

Figure 3.2: Observation-driven Markov decision process with dynamically generated states,
context-driven state representation, and value-based action generation. Note: the state is
not symbolic. It iS a vector in a high dimensional space.

The overt behavior of the agent is determined through a value system with respect
to the states. To evaluate the value of each action a at each state s, a Q-value function
q(s,a) realizes a mapping q : S x A —> Q, where A is the action space and Q is a
scalar representing the estimated value. The innate value system is such that the
agent executes the action that has the highest estimated Q value. Fig. 3.1 gives a

temporal illustration of the agent model.

The above way of modeling a decision process, as shown in Fig. 3.2, is called
Observation—Driven Markov model (ODMM) [17] [110]. Different from the traditional
MDPS and HMMS, all the states here are numerical instead of symbolic. They are
generated “on-the-ﬂy.” Further, the state here represents the agent’s internal context,
instead of the state of a world event as in all existing non-developmental agents.

45

Attention selection

Vision

system

Internal state

Touch

 

Figure 3.3: Basic architecture of the SAIL robot.

3.2 System architecture

The architecture that implements the above agent model is Shown in Fig. 3.3.
Inputs from each sensor ﬁrst enter a module called sensory mapping, which consists
of a bank of developed ﬁlters. As Shown in Fig. 3.4, considering the Spatial and
temporal expanse of the sensory inputs, we treat a sensor as a window moving in a
Spatiotemporal space with only the information inside the window getting through.
We divide such a window into overlapped regions of different sizes, named receptive
ﬁelds, each of which corresponds to an artiﬁcial neural column. These neural columns
are organized in a hierarchical manner. The output from multiple lower-layer neural
columns is the input to the neural columns at the next higher layer. Thus, the latter
ones have larger receptive ﬁelds. If the multiple lower-layer neural columns Spreads
along the Spatial dimension, a larger receptive ﬁeld stands for a wider ﬁeld of view.
If the multiple lower-layer neural columns Spreads along the temporal dimension, a

larger receptive ﬁeld stands for a longer period Of context.
A neural column has two roles, developing a ﬁlter and computing the response

46

Temporal dimension Window covered by a sensor

 

 

 

 

 

 

 

 

 

 

 

 

ll
Attention selection signal
Y 1 Attention
Hf] 1" l Selection ”'>
”if? “if ’t '
[M]
“H, '
[ h’[
l
Delays eural column b
TL}! (Shifts)
L7] Tl Fig? l Second order neural column
\J t .2 First order neural column
\ Spatial dimension

 

I
Receptive ﬁeld of neural column b

Figure 3.4: Detailed architecture of the sensory mapping module.

using the ﬁlter updated so far. The role of ﬁlter development is carried out mainly by
a technique called incremental principal component analysis (IPCA) to compute the
principal directions, along which the data has the largest variance. The response is
computed by the inner product between the input vector and the principal component
vectors. The outputs from the neural columns are fed into the next module, cognitive
mapping, after being selected by the control Signals from the cognitive mapping.
This selection is an internal behavior that enables the selective attention mechanism
to handle the unsegmented sensory stream.

The cognitive mapping module is the central part of the system. It is responsible
for learning the association between the sensory information from the environment
and the behaviors that the system is supposed to produce. The behaviors can be both
external and internal. External behaviors correspond to generating control signals
for external effectors, such as a joint motor of a robot arm, or any peripherals that

47

the system is equipped with to act on the environment. Internal effectors include
the above-mentioned attention selection effector in the sensory mapping module, the
effector that manipulates internal states, and the threshold control effector in the

gating system in Fig. 3.3.

l\v"lathematically, we have what is called observation-driven state transition func—
tion f,

f:SXL—>S,

and the action generation function g,

g:SxL—>2A,

where S is the (local) state space, L is the context space, A is the output action
Space, and 2” denotes the power set of A (all the possible subsets of A). In other
words, the cognitive mapping accepts the last state s(t — 1) and the current context
l (t) to generate a set of action candidates. The value system implements an action

selection function '0,

v:2”—>A,

according to the value of every action in the candidate action set. Typically, v selects

the candidate with the highest expected value.

While the state transition function f is pre-deﬁned in current implementation, the
action generation function g is a learned mapping. Speciﬁcally, mapping f is done by
maintaining the state as a ﬁrst-in ﬁrst-out queue. At every time instance, the state

48

is updated by replacing its oldest part with l(t). In the real world, the number of
all possible contexts is inﬁnite. However, the resource of an agent is ﬁnite and so
is the number of state prototypes in the memory. We have to resolve the conﬂict
that S x L is an unbounded Space while S is a bounded Space. To do this, we use
the method of IHDR [35] [104]. In some sense, an IHDR tree behaves as a vector
quantizor (VQ). In effect, the IHDR tree does much more than a traditional VQ by
enforcing a discriminant regression mechanism. It derives the features that are most
relevant to the outputs and disregards the irrelevant ones. It uses a tree structure to

ﬁnd the best matching input prototype in a fast logarithmic time for real-time Speed.

A Simplest way of implementing the cognitive mapping uses one IHDR tree, which
is discussed in Chapter 5. To enable the robot to handle more complicated percep-
tion and behavior, we have continuously augmented the cognitive mapping module
(Fig. 3.5 and 3.6). Detailed discussions are given in Chapter 6 and Chapter 7. AS
one will ﬁnd out, the enhanced modules are built one upon the other. A good prop-
erty of such a spectrum of designs is that later modules keep all the functions of
early modules. Therefore, the ﬁnal system is not task Speciﬁc, which is an important

requirement of a developmental robot.

The gating system is a part Of the value system too. It evaluates whether the
intended action accumulates sufﬁcient thrust to be issued as an actual action. In this
way, actions are taken only when a sufﬁcient number of outputs are suggested by the
cognitive mapping module through the time.

49

_________ _' To gating

From I system

sensory I
mapping

      
   
 

 

 

 

 

 

 

 

 

 

 

I Legend:
I ————————— J U Primed
Cognitive mapping , context
(a)

I Attention control signals I

| l

I —{T=L I

I IHDR *U I To gating
Auditory sensation I system
Action sensation R-trce I

IChanneI Aidan I

l I so ector

[se cc °r IHDR I

I I

I P—tree 2(9 p(t+I) I Legend:

I [I I ([4] I [I Reﬂexive

I I primed

I Prototype updating queue I 00016)“

L ________________ I Backpropagated

Enhanced cognitive mapping ‘ primed context
0))

 

 

 
  
 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I I
I I
I I
I IHDR I
I I To effectors
I

I Channel Action I
| selector selector |
I IHDR I
I I
I P—tree I
I I
l LI__L’I:I‘— l
I Prototype updating queue I
L ________________ 1

Second level LBE
f_ ——————————— d — — — —-r
I Attention control signals I
I I
I I
I IHDR I
Auditory sensation I

 

 

 

 

Action sensation R-tree

   

 

 

 

IChannel
|selector selector I
l iHDR l
I l
I P—tree I Legend:
I I Reflexive
| I_I__I_’I.jI l primed
I Prototype updating queue I CONCXI
L ———————————————— I Backpropagated
I .
First level LBE anCd context

(C)

Figure 3.5: The adding-on cognitive mapping modules: (a) the basic module (b) the en-
hanced module (c) the two-level architecture ((1) the semantics learning system architecture

50

 

Legend: ] l E ] R-tree

 

D Reﬂexive IChannel
primed |selector
context I IHDR

 

Backpropagated
' primed context

\

  

 

Action I
selector l

 

 

   

To effectors
»

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Prototype updating queue
L ________________ I
H-LBE

,.. ————————————————— ,— ___________

| Attention control signals I | Attention control signals

I I I

n I l —n.—_'

I IHDR —I] I I IHDR —I:I
Vision I _ AuditionIL
Action 5:] R-tree I Action R-tree

lChannel Action I IChannel

|selector selector I |selector

I IHDR 3

 

Prototype updating queue I
L ________________

V-LBE

 

 

 

D———q

 

 

 

 

Prototype updating queue

L ________________ I

Figure 3.6: The adding-on cognitive mapping modules (Cont’d): the semantics learning

system architecture

3.3 Incremental hierarchical discriminant regres-

sion (IHDR) in cognitive mapping

Hierarchical discriminant regression is a new hierarchical statistical modeling method

introduced by Hwang and Weng [35]. Basically, HDR does clusterings in both input

and output space with the former one done on the subspace derived according to the

latter one. In this way the discrimant analysis is enforced. Both of the clusterings

are done in a coarse-to-ﬁne manner SO as to reduce the computational complexity.

51

 

3.3.1 Uniﬁed analysis Of classiﬁcation and regression

Before getting to the details of HDR, it helps to take an uniﬁed view of classiﬁcation
and regression.

Typically, we have two types of problems in discrimant analysis. One is called
classification problem with the desired output being a class label. The other one
is called regression problem where we have numerical output. Because most of the
statistical tools involves a numerical form of computation, we may cast a classiﬁcation
problem into a regression problem.

For a classiﬁcation problem, the training samples are typically given as a set,
L : {(;r,-,lk)[i : 1,2, . ..,n,k : 1,2, . ..,c}, where 13,6 X is an input vector and 1,,
is the label of :1:,-. The problem is to determine the class label of any unknown input

.1: E X. Here are two ways to cast it into a regression problem.

1. If a cost matrix [cm] is known, where cij is the cost of confusing class i and j, one
can design an c-dimension output vector y,- and yj for class i and j, respectively,
so that [[y, — y J” is as close to cij as possible. Here [I - [I means any appropriate
distance metric. In a simple and special case, we may have 0 — 1 cost matrix,
where cij = (5,1. The output vector y,- may be designed as a vector with its ith

component. being one and all others zeros. It is also called canonical mapping.

2. If the cost matrix is not known, we may design the output vector y, according
to input space. That is, set y, as the sample mean of all :I: that belong to ith
class. This design is equivalent to taking the distance measure in input space
as the cost function. It is a good choice when the real cost matrix is unknown.

52

With either one of above two transformations, the original classiﬁcation problem
011 set L : {($,-,lk)} is converted to a problem on set L = {(x,,yk)} where yk is
a numerical vector instead of a symbolic label. We are now ready to do double

clustering.

3.3.2 Hierarchical double clustering

Consider a general regression problem: approximating a mapping h : X —> Y from
a set of training samples {(T,,y,-)[ir,- E X, yi E Y,i = 1,2, . . .,n}. HDR employs the
idea of Linear Discriminant Analysis (LDA) [21] to ﬁnd such a mapping. LDA uses
class information and seeks an optimal linear transformation from the original data
Space to a new Space in which the samples are well separated (or discriminated).
In a typical usage of LDA, one needs to estimate the between-class and within—class
scatter matrices, which requires class information. However, for a regression problem,
y,- is a numerical vector and there are very few samples Sharing a single y. In other
words, there are very few samples in each class. Therefore, the within-class scatter
matrix will be poorly estimated, especially for high-dimensional input data, e.g., a
few thousand dimensions in vision problems or a few hundred dimensions in audition

problems. HDR handles this issue with a coarse-to-ﬁne way.

First of all, q y-clusters in the output Space are generated according to the output
part of the training samples using a k-meanS-like algorithm. The training samples is
thus gathered into q super groups. The x-part of the samples is accordingly clustered
and we call these clusters :r-clusters (see Fig. 3.7). Since q is typically a small number,

53

"s‘ 1“
\ I++ #‘
+
I “‘4‘ +“
‘3‘ C +
*s
+‘\
I
I
I
I
’4-
I
I
4.
1w
4+”
...
x}
*1“
\I
+
I

 

 

 

 

+\
+ ‘ I' s
\ I \
++ +‘-I. I! +."': 1+ +‘\\ + *I
\
“+ + \s ‘ ++,' \\ +\\ I +0 I"
. 1“ I + 4" \\ .‘If \ I + 7'
\ ‘\ 4: + ’ / X \ ' + +
\+ + ‘ ’ «0- 4' I
\ sv X4. ‘ \ I
‘ + 42 ‘~ I \ 1’
s g -- -
\“.'
Xspace Yspace

Figure 3.7: Double clustering in both input and output Spaces.

e.g., 5, it is effective to estimate the between-class and within-class scatter matrices
for these q x-clusters. We call this process a double clustering process. The q x-cluster
centers Span a (q — 1)-dimensional space, which we call a discriminant space because
it characterizes the discriminant information among x-clusters. A probability-based
distance in the discriminant Space is used to determine which x-cluster a test sample

belongs to.

As we may immediately realize, while the double-clustering process is done on the
whole training set, it can be recursively used for the data in each of the super groups
(see Fig. 3.8), which ends up with a tree structure. Moreover, the deeper nodes in
the tree, the smaller variance the data has. After the tree is constructed, we may
generate the output of a new input :1: by retrieving the tree. The ﬁnal output y would

be the mean of the y-cluster of the leaf node on the retrieval path.

In summary, HDR has two major advantages. First, it automatically generates
representations by ﬁnding the discriminant Spaces according to the events (data)
encountered. These subspaces are optimal for different particular tasks in terms
of dicriminant capability. Second, the hierarchical structure not only reduces the

54

 

 

ll

2
Q! Q!

I \
Q l / \ \l
\l \_

X space Y space

// //

 

 

 

 

 

 

 

Figure 3.8: Hierarchical discriminant analysis.

searching computational complexity but also enables the system to make use of more
samples at shallow levels where the variance of the data is large, which is critical to
parameter estimation. At deep levels, the sample variance is tremendously reduced
and correct decisions can be made with very few number of samples. Because of
the ﬁrst advantage, HDR has little limitation in terms of representation power and
potentially can ﬁt any data. Because of the second advantage, HDR can learn without
any iterations. As a result, HDR realizes one-instance learning without local minima,

i.e. zero error on training data without iterations.

3.3.3 Incremental hierarchical discriminant regression

HDR has been successfully used in face recognition[35], OCR [36]. In these applica-
tions, we typically have a training data set ready for the computation of statistics.
In a system like a developmental robot, however, we are conducting online learning,
which means the data are only available sequentially. In this case, above HDR method
needs to be modiﬁed so as to be used incrementally [104].

55

A major problem of Incremental HDR (IHDR) implementation is that incremental
computation of the statistics may result in bad estimation of cluster boundaries.
These statistics include the sample means and scatter matrices of both x-clusters and
y-clusters. Because the number of samples is small at the beginning of the incremental
procedure, the clusters are usually ill—generated. So is the cluster boundary, which will
inﬂuence later partitions of the tree. In IHDR, we use the amnesic average technique
to gradually get rid of the effects of earlier data.

Suppose {a,,-,z' = 1,2, . . .,n} are the data coming into the system sequentially.

Their sample mean and scatter matrix are given by,

 

Write these equations in an incremental style, we have,

 

 

 

 

..(n)
—(n+1) _ 7m + (Ln-H _ T). —(n)
a — — —a + an 3.6
n+1 n+1 n+1 H ( l
n 1
1’3”” = , 1“") + (a,- — a)(a,- — a)T (3.7)

From equations 3.4 and 3.5, we know that these incremental estimations of the
statistics are actually done by weighting each samples contribution equally (1 / n) As
time elapses, n increases and a new sample would make less and less difference on the

56

estimation.

Amnesic average is implement in this way,

 

“(n-H) : TL _ l -(n) 1+l
w—l 1+1
F(n+l) : n__r(n) __ i- i— — T 9
. n+1. +n+1m Ma a) (3)

The amnesic parameter, l, changes the relative weighting of old estimation and

new samples. By setting I larger than 0, new samples will have more effects and old

samples are “forgotten” gradually.

3.4 Learning strategies

In general,

there are three learning strategies, supervised learning, reinforcement

learning and unsupervised learning. As we believe no system can really evolve with-

out any feedback from the environment, we leave out unsupervised learning and only

discuss the ﬁrst. two here.

3.4.1 Supervised learning

Supervised learning is the most extensively studied learning strategies in machine

learning. It requires the external environment of the system to provide examples in

form of input-output pairs. The system then derives the association or mapping that

ﬁts the data according to certain criteria. Under the stationarity assumptionz, the

 

2The stationarity assumption is introduced by Valiant [81]. Basically, it says that the training
and test sets are drawn randomly from the same population of examples using the same probability

57

learned mapping would predict the output correctly given a new input.

Supervised learning is widely used in statistical pattern recognition problems. For
example, a ﬁngerprint recognition system is to identify a person from the ﬁngerprint
image. The provided sample would be the ﬁngerprint image and its identity label.
The learning objective is to derive the mapping from the extracted features of the

image to the identity label.

Supervised learning is also used in intelligent robot area, where the provided exam-
ples are perception-action pairs. A good example for that is ALVIN N (Autonomous
Land Vehicle In a Neural Network) [67]. The major component of ALVINN is a multi-
layer perception neural network with 30-by-32 road image as input and 30-dimension
vector as output to specify direction. The training data used are collected as pairs of

road image and correponding correct driving direction.

While supervised learning is a powerful and successful learning strategy, the re-
quirement of input-output pairs as training data is sometimes demanding to an agent
like a real-time robot. ALV INN is lucky because it is easy to collect the training
data by having a human drive the vehicle and recording the image/direction pairs.
It would be extremely difficult to collect the training data when the behavior of the
robot becomes complicated. Actually, even for ALVINN, many synthetic data were
generated from real data by rotating the image to simulate the situation when the
vehicle needs to be recovered from off course. Moreover, it is more desirable for a

robot to learn from its own experience without a knowledgable supervisor around.

 

distribution.

58

3.4.2 Reinforcement learning

The term of reinforcement learning was ﬁrst introduced in psychology domain. People
observed the phenomena that biological systems, such as dog and human, increase
the likelihood of certain behaviors that lead to encouraging stimuli. The central idea
of reinforcement learning is that an agent receives some biased feedback of its action,
such as encouragement or punishment, instead of being told the correct action itself.
This is a more realistic situation that a real-time agent will meet. That is, in many
cases, the feedback from the environment is the reinforcing stimuli or the reward that
encourage or discourage the actions the agent takes instead of the exact guidance or
demonstration of actions or action sequences. In this sense, reinforcement learning is

a more powerful learning mechanism for a complete, interactive, goal-seeking agent.

A reinforcement learning algorithm has to resolve two basic issues, trial-and-error
search and delayed reward [90]. Trial-and-error search means that the learner has to
discover the most rewarding actions through exploration, since it is not told which
action to take. The delayed reward means that the consequence of the learner’s
action is not necessarily adjacent to the action. They typically appears later. It
is the responsibility of the learning algorithm to credit the reward to appropriate

actions.

The product of a reinforcement learning system is a policy, which tells the agent
how to behave in different situations or states. The system usually maintains a value
estimation, which reflects the “happiness” degree under current system state. This
value deﬁnes an objective function because the ultimate goal of the system is to

59

maximize it by following certain policy. It is updated by the reward received from
the environment.

Q-learning is one of the most popular reinforcement learning algorithms [98] [90].
In Q-learning, each state maintains a set of action values, called Q-values. The action
with the largest value will be selected as the system output. The Q-learning algorithm

is as follows,

Q(8(t-1),a(t))f-(1- O)Q(8(t-1),a(t))+ alTU) + 7ma§XQ(S(t), (1')], (310)

where a and a’ are actions associated with a state, s(t — 1) is the state at time t — 1,
s(t) is the state the system lands on after executing action a(t), r(t) is the reward
received from the environment, a and y are learning rates. The algorithm shows
that Q-values are updated according to the immediate reward r(t) and the value of
the next state, which allows delayed reward to be back-propagated in time during
learning. After enough experiences of trial-and-error, the system may predict the
reward correctly when similar contexts occurs the next time.

While reinforcement learning gives a more realistic learning model to a real-time

agent, it is usually slow for an agent to acquire the optimal policy.

3.4.3 Uniﬁed learning strategy

In our develOpmental robot, above two learning strategies are implemented in an
uniﬁed mode (see Fig. 3.9). A developmental robot starts to conduct behaviors since
it is released to the world. At any time, its behaviors depend on what it has learned

60

[ Grab current sensory frame I<——

Imposed
action?

 
   

 
  

  

Den' ve action

  

 

 

 

——>l Update memory I‘—

l

 

Figure 3.9: Flowchart for uniﬁed learning: the system learns while performing.

and the speciﬁc environment context. At the same time, learning goes on according

to the interaction between the robot and the environment.

If the trainer imposes an action on an effector, the robot will comply it depending
on whether the imposed action is consistent with what the robot, itself, plans to do.
If it is, everything is ﬁne and go ahead do it. If it is not, the robot will modify its
learned behaviors so as to following the trainer’s instruction (the imposed action).
This is a supervised learning procedure. This learning mode is much more efﬁcient

than reinforcement learning when the input—output pairs is available.

There are situations, however, when supervised learning can not be conducted.
For example, the robot has internal behaviors such as selecting receptive ﬁelds and
manipulating internal states. The trainer does not have access to these internal
actions. By observing the external behaviors of the robot, the trainer may give
reinforcement stimuli to encourage or discourage the robot. From the viewpoint
of robot, it receives certain biased sensory inputs instead of imposed actions. If
the reward is positive, the robot is considered “understanding” and following the
trainer’s intention. If the reward is negative, the robot will modify its behavior
generation strategy (policy) and derive some new internal or external behaviors. This

61

is a reinforcement learning procedure.

In either case, the robot memory is updated and the new association of input-

output pair is internalized.

We have adapted Q-learning to integrate supervised learning and reinforcement
learning strategies into a uniﬁed mode (Fig. 3.9)3. In our integrated learning mode,
if no action is imposed by the environment, we follow the Q-learning algorithm to
update the action values. If an action is imposed by the environment, we increase
the value of the corresponding action with a ﬁxed amount. Typically, we make this
amount large so as to reflect the fact that the system should follow the instructions

of the trainer.

Notice, for reinforcement learning case, the biased sensory input may come from
both biased and unbiased sensors. By biased sensors, we mean those hardwired so
that the information to represent reward directly. For example, human’s preference of
sweet taste and hate of bitter taste are hardwired. On the other hand, sensors, such
as audition and vision, are essentially unbiased because they usually do not lead to
pain or happiness at birth. However, these unbiased sensors may be turned to biased
ones through experience. For example, a low-pitched sound of “bad dog!” may make
a well-trained dog very upset although it does not make any difference to wild dogs

at all.

 

3Related work on integrating reinforcement and supervised learning can be found in [15].

62

 

Figure 3.10: The SAIL robot at Michigan State University.

3.5 The SAIL robot

All the experiments presented in the following chapters have been validated on our
SAIL robot (Fig. 3.10). SAIL is a human-size mobile robot house-made at Michigan
State University. Its hardware components are shown in Fig. 3.11 and 3.12. Their
connections are outlined in Fig. 3.13“.

The drive—base is adapted from a wheel-chair, which enables SAIL to operate both
indoor and outdoor. It has two driving wheels in the middle and three supporting

wheels, two front ones and a rear one. This drive-base does not have steering wheel.

 

4Figures 3.11, 3.12 and 3.13 are obtained from W.S. Hwang’s PhD dissertation [37] with the
permission of the author.

63

Camera

    
   

  

4-—— LCD Display

7 LCD Monitor
Pan—Tilt Unit 5%
A

'— Kcy Board
— .......
Neck I

‘ Drive Base Controller

‘— Drive Base

    

AC Power Strip
ADR 2000

Pan—Tilt Unit Powcr Box Pan—Tilt Unit Controller

Eshed Controller
AccelePort 8e

(x)
I

Eshed Robot Arm —> Battery

Power Inverter

— Drive 3886
Bumper _>

Figure 3.11: The SAIL robot system diagram: left side view.

64

     
    
  
 

  

LCD Monitor

C amcra <— LCD Display

1/ '— e o
% KyBard

Pan—Tilt Unit

Host Computer

Eshed Robot Arm Battery

Drive Base

Bumper _>

Figure 3.12: The SAIL robot system diagram: right side view.

65

 

[ Eshcd RobotArm ] [Rotary table (neckﬂ| 32 sensors llPan-Tilt Unit lLPan—Tilt Unit I

Pan-Tilt Unit
Controller

     
   

 
    

Pan-Tilt Unit
Controller

 

[ Eshed Controlleﬂ

  

 

 

 

 

 

 

AccelePon 8e

 
 
   

LcoMooooo

 

      

 

 

 

 

 

 

—l
LCD 0‘51"” ‘ [Livratrox Video board J
—' Host Computer
KeyBoard —— Ir Creative Sound Blaster IGJ

 

 

 

 

 

l ,
WMS—PRO Wireless
Microphone System

 

 

Figure 3.13: The SAIL robot system diagram: hardware connections.

Its turning is achieved by the differential speed of two driving wheels.

Other two major effectors are the robot arm and the pan—tilt units. The six-
joint SCORBOT-ER III robot arm was developed by Eshed Robotec (Fig. 3.14). Its
controller can control eight motors totally. In addition to the six joint motors, we
used one of two spare channels to control the rotary table which serves as the “neck”
of the SAIL robot. The two Pan-Tilt Units (PTU) are located in the “head” of the
SAIL robot. They are produced by Directed Perception, Inc. Mounted on these PTUs
are two CCD cameras as the “eyes” of SAIL. By controlling each of these PTUS, the

3
3.

“eyes” may have pan and tilt motion to cover larger vision ﬁeld.

The SAIL robot has four pressure sensors on its torso. They can sense push
actions and force. 28 touch sensors are distributed on its arm, neck, head, and
bumper, which allow human to teach its behaviors by direct touch. These 32 inputs
are multiplexed into an eight-channel A/ D converter (ADR 2000) using hardwired

66

 

 

 

 

Base Ar

 

 

Figure 3.14: The SAIL robot: the Eshed robot arm.

analog multiplexers.

The Eshed controller, the PTU controllers and the ADR 2000 A/ D converter are
are connected to an AccelePort 8e multiport serial adapter from Digi International

Inc., which enables the serial communication with the host computer.

The “eyes” of the SAIL robot are Panasonic GP-KSl52 industrial Color 1 / 2 inch
CCDs with auto gain control and auto white balance. Two Matrox Meteor 11 video

cards are used for real-time image capturing and digitization.

The “ear” of the SAIL robot is a WMS-PRO wireless microphone system. The
wireless microphone operates within a range of 250 feet. It is connected to a sound

card (Creative Sound Blaster 16) on the host computer.

The host computer, the “brain” of the SAIL robot, used to be a Pentium II
333MHz dual-processor PC with 512 MB RAM and an internal 27 GB three-drive
disk array. It retired after ﬁnishing the experiments of Chapter 6. Currently, the

67

host computer is a Xeon 2.2GHz dual—processor workstation with 1GB RAM. This
allows a real-time memory recall and update as well as real-time effector controls.
The monitor is a ViewSonic ViewPanel VPA138 14-inch LCD display, which is ﬂat,
light and easy to carry.

The power of the SAIL robot is provided either by heavy-duty batteries or by line
power. An automatic power system APS 750 from TRIPP LITE power protection is

used to switch between these two sources.

68

Chapter 4

Incremental PCA in Sensory

Mapping

4. 1 Introduction

In Chapter 3, we have discussed that each neuron in the sensory mapping module is
doing PCA. PCA is a well-known technique in data compression and feature extrac-
tion. It gives a linear transformation that is used to convert a set of d-dimensional
data into a lower-dimensional space by minimizing the least mean square (LMS) error.

A well-known computational approach to PCA involves solving an eigensystem
problem, i.e., computing the eigenvectors and eigenvalues of the sample covariance
matrix, using a numerical method such as the power method and the QR method [27].
This approach requires that all the training images are available before the principal
components can be estimated. This is called a batch method. This type of method
no longer satisﬁes an up-coming new trend of computer vision research [102], in

69

which all visual ﬁlters are incrementally derived from very long online real-time video
stream, motivated by the development of animal vision systems. Online development
of visual ﬁlters requires that the system performs while new sensory signals ﬂow in.
Further, when the dimension of the image is high, both the computation and storage
complexity grow dramatically. For example, in the eigenface method, a moderate
grey image of 64 rows and 88 columns results in a d-dimensional vector with d =
5632. The symmetric covariance matrix requires d(d + 1) / 2 elements, which amounts
to 15,862,528 entries! A clever saving method can be used when the number of
images is smaller than the number of pixels in the image [86]. However, an online
developing system must observe an open number of images and the number is larger
than the dimension of the observed vectors. Thus, an incremental method is required
to compute the principal components for observations arriving sequentially, where the
estimate of principal components are updated by each arriving observation vector.
No covariance matrix is allowed to be estimated as an intermediate result. There
is evidence that biological neural networks use an incremental method to perform
various learning, e.g., Hebbian learning [30].

Several IPCA techniques have been proposed to compute principal components
without the covariance matrix [59] [60] [85]. However, they ran into convergence prob-
lems when facing high dimensional vectors. We will discuss the underlying problem of
these methods and present a new method, complementary candid IPCA (CCIPCA).
CCIPCA is motivated by a well-known statistical concept called efﬁcient estimate.
An amnesic average technique is also used to dynamically determine the retaining
rate of the old and new data, instead of a ﬁxed learning rate. Experimental re-

70

sults on high-dimensional data show the better convergence rate of CCIPCA over
existing methods. A mathematical proof of the convergence of CCIPCA is given in

Appendix A).

4.2 Derivation of the algorithm

4.2. 1 First eigenvector

Suppose that sample vectors are acquired sequentially, u(1), u(2), . . ., possibly inﬁ-
nite. Each u(n), n = 1, 2, . . . , is a d—dimensional vector and d can be as large as 5000
and beyond. Without loss of generality, we can assume that u(n) has a zero mean
(the mean may be incrementally estimated and subtracted out). A = E{u(n)uT(n)}
is the d x d covariance matrix, which is neither known nor allowed to be estimated

as an intermediate result.

By deﬁnition, an eigenvector :1: of matrix A satisﬁes
A3: 2 A33, (4.1)

where A is the corresponding eigenvalue. By replacing the unknown A with the sample
covariance matrix, and replacing the a: of Eq. (4.1) with its estimate :1:(i) at each time

step i, we obtain an illuminating expression for v 2 A27,

21(7).) 2 iZu(i)uT(i)z(i). (4.2)

71

where u(n) is the n—th step estimate of 1). As we will see soon, this equation is
motivated by statistical efficiency. Once we have the estimate of u, it is easy to get

the eigenvector and the eigenvalue since /\ 2 Hull and a: = u/ ||v||.

Now the question is how to estimate 33(2') in Eq. (4.2). Considering a: = v/Ilvll,
we may choose :c(i) as u(i — 1) / ||u(i — 1)”, which leads to the following incremental

expression,

1 , T_ 1
u(n) = 52mm (2)—llv(i-1)ll' (4.3)

To begin with, we set 12(0) 2 u(1), the ﬁrst direction of data spread. For incremental

estimation, Eq. (4.3) is written in a recursive form,

 

(4.4)

where (n ~— 1) / n is the weight for the last estimate and 1 / n is the weight for the new
data. We have proved that with the algorithm given by Eq. (4.4), u1(n) —> :l:/\1e1
when n ——> 00, where A1 is the largest eigenvalue of the covariance matrix of {u(n)},

and e1 is the corresponding eigenvector (see Appendix A).

The derivation of Eqs. (4.2)-(4.4) is motivated by statistical efficiency. An unbi-
ased estimate Q of the parameter Q is said to be the most eﬁicient estimate for the
class D of distribution functions if for every distribution density function f (u, Q) of

D the variance D2(Q) (squared error) has the minimal value given by

1

, .
nu: [MAM] f(o,e)du

 

02(0) = EKQ - Q)2l 2

72

The right side of inequality (4.5) is called Cramér-Rao bound. It says that the most
efﬁcient estimate is one that has the least variance from the real parameter, and
its variance is bounded below by the Cramér—Rao bound. For example, the sample
mean, a; = %Z?:, w(i), is the most efﬁcient estimate of the mean of a Gaussian
distribution with a known standard deviation 0 [25]. For a vector version of the
Cramér-Rao bound, the reader is referred to [103].

If we deﬁne w(i) = u(i)uT(i):1:(i), u(n) in Eq. (4.2) can be viewed as the mean
of “samples” w(i). That is exactly why our method is motivated by the statistical
efﬁciency in using averaging in Eq. (4.2). In other words, statistically, the method
tends to converge most quickly or the estimate has the smallest error variance given
the currently observed samples. Of course, w(i) is not necessarily drawn from a
Gaussian distribution independently and thus the estimate using the sample mean
in Eq. (4.4) is not strictly most efﬁcient. However, the estimate u(n) still has a high
statistical efﬁciency and has a fairly low error variance as we will Show experimentally.

The Cramér-Rao lower error bound in Eq. (4.5) can also be used to estimate the
error variance, or equivalently the convergence rate, using a Gaussian distribution
model, as proposed and experimented with by Weng et al. [103, Section 4.6]. This
is a reasonable estimate because of our near optimal statistical efficiency here. Weng
et al. [103] demonstrated that actual error variance is not very sensitive to the dis-
tribution (e.g., uniform or Gaussian distributions). This error estimator is especially
useful to estimate roughly how many samples are needed for a given tolerable error
variance.

IPCA algorithms have been studied by several researchers [61] [92] [59] [60].

73

An early work with a rigorous proof for convergence was given by Oja &
Karhunen [59] [60], where they introduced their stochastic gradient ascent (SGA)

algorithm. SGA computes,

5,;(n) = 2),-(n — 1) + 7,-(n)u(n)uT(n)u,-(n — 1) (4.6)

-v,:(n) = orthonormalize 27,(n) w.r.t. vj(n),j = 1, 2, . . . ,i — 1 (4.7)

where, u,(n) is the estimate of the i-th dominant eigenvectors of the sample covariance
matrix A 2 E {u(n)uT(n)}, and u,(n) is the new estimate. In practice, the orthonor-
malization in Eq.(4.7) can be done by a standard Gram-Schmidt Orthonomalization
(GSO) procedure. The parameter 7,-(n) is a stochastic approximation gain. The

convergence of SGA has been proved under some assumptions of A and 7,-(n) [60].

SGA is essentially a gradient method, associated with which is the problem of
choosing y,(n), the learning rate. Simply speaking, the learning rate should be ap-
propriate so that the second term (the correction term) on the right side of Eq. (4.6)
is comparable to the ﬁrst term, neither too large nor too small. In practice, ’Yi(7l)
depends very much on the nature of the data and usually requires a trial-and-error
procedure. Oja gave some suggestions on 7,:(n) in [59], which is typically 1 / n multi-
plied by some constants. However, procedure (4.6) is at the mercy of the magnitude
of observation u(n), where the ﬁrst term has a unit norm but the second can take any
magnitude. If u(n) has a very small magnitude, the second term will be too small
to make any changes in the new estimate. If u(n) has a large magnitude, which is

74

the case with high dimensional images, the second term will dominate the right side
before a very large number n and, hence, a small 7,-(n), has been reached. In either
case, the updating is inefﬁcient and the convergence will be slow. It is true that 7,-(n)
can be manually selected so that it takes into account the magnitude of u(n). But
y,»(n) alone can not accomplish statistical efﬁciency. Further, such a manual selec-
tion is not suited for an online learning algorithm since the user does not necessarily
have the required expertise and the real-time experiences may not be repeated. The
algorithm must automatically compute data-sensitive parameters.

Contrasted with SGA, the ﬁrst term on the right side of Eq. (4.4) is not normalized.
In effect, u(n) in Eq. (4.4) converges to /\e instead ofe as it does in Eq. (4.6), where /\ is
the eigenvalue and e is the eigenvector. In Eq. (4.4), the statistical efﬁciency is realized
by keeping the scale of the estimate at the same order of the new observations (the
ﬁrst and second terms properly weighted on the right side of Eq. (4.4) to get sample
mean), which allows fully use of every observation in terms of statistical efﬁciency.
Note that the coefﬁcient (77. — 1) / n in Eq. (4.4) is as important as the “learning rate”
1 / n in the second term to realize sample mean. Although (n — 1) / n is close to 1 when
n is large, it is very important for fast convergence with early samples. The point is
that if the estimate does not converge well at the beginning, it is harder to be pulled
back later when n is large. Thus, one does not need to worry about the nature of
the observations. This is also the reason that we used “candid” in naming the new
algorithm.

There is a further improvement to procedure (4.4). In Eq. (4.4), all the “samples”

 

(10(2) 2 u(i)uT(i)H::[::[]H), are weighted equally. However, since w(i) is generated

75

by u(i) and u(i) is far away from its real value at a early estimation stage, w(i) is
a “sample” with large “noise” when i is small. To speed up the convergence of the
estimation, it is preferable to give smaller weight to these early “samples.” A way to

implement this idea is to use an amnesic average by changing Eq. (4.4) into,

n-n_:_1_—_lvn— l—ﬂunuT ——
o<)— n < 1” ... (l ( llv(n-1)ll’

where the positive parameter l is called the amnesic parameter. Note that the two
modiﬁed weights still sum to 1. With the presence of 1, larger weight is given to new
“samples” and the effect of old “samples” will fade out gradually. Typically, l ranges

from 2 to 4.

4.2.2 Intuitive Explanation

An intuitive explanation of procedure (4.4) is as follows. Consider a set of 2-
dimensional data with a Gaussian probability distribution function (For any other
physically arising distribution, we can consider its ﬁrst two orders of statistics since
PCA does so). The data is charactrised by an ellipse as shown in Fig. 4.1. Ac-
cording to the geometrical meaning of eigenvectors, we know that the ﬁrst eigen-

vector is aligned with the long axis (ul) of the ellipse. Suppose u1(n — 1) is the

(n — 1)th-step estimation of the ﬁrst eigenvector. Noticing uT(n)—v—‘§"—-Q- is a scalar,
||v1(n-1)I|

v1(n—1)

T
( HMM—1)”

we know %u(n)u n) is essentially a scaled vector of u(n). According to pro-

cedure (4.4), 211(n) is a weighted combination of the last estimate, v1(n —- 1) and the
scaled vector of u(n). Therefore, geometrically speaking, u1(n) is obtained by pulling

76

 

Figure 4.1: Intuitive explanation of CCIPCA.

u,(n — 1) toward u(n) by a small amount.

A line, l2, orthogonal to u1(n -— 1), divides the whole plane into two halves, the

upper and the lower ones. Because every point at in the lower half plane has an

T v1(n—1)

, u, m is a negative scalar. So, for at, Eq. (4.4) may

obtuse angle with u1(n — 1)

be written as,

 

 

where —u, is an upper half plane point obtained by rotating u, for 180 degrees w.r.t.
the origin. Since the ellipse is centrally symmetric, we may rotate all the lower half
plane points to the upper half plane and only consider the pulling effect of upper
half plane points. For the points u,, in the upper half plane, the pure force will pull
u1(n — 1) towards the direction of v1 since there are more data points to the right
side of v1(n — 1) than those to the left side. As long as the ﬁrst two eigenvalues
are different, this pulling force always exists and the pulling direction is towards the
eigenvector corresponding to a larger eigenvalue. v1(n - 1) will not stop moving until
it is aligned with 121 when the pulling forces from both sides are balanced. In other

77

words, 221 (n) in Eq. (4.4) will converge to the ﬁrst eigenvector. As we can imagine, the
larger the ratio of the ﬁrst eigenvalue over the second eigenvalue, the more unbalanced
the force is, and the faster the pulling or the convergence will be. However, when
A1 = A2, the ellipse degenerates to a circle. The movement will not stop, which seems
that the algorithm does not converge. Actually, since any vector in that circle can
represent the eigenvector, it does not hurt not converging. We will get back to the

cases of equal eigenvalues in Section 4.2.4.

4.2.3 Higher-order eigenvectors

Procedure (4.4) only estimates the ﬁrst dominant eigenvector. One way to compute
the other higher order eigenvectors is following what SGA does: Start with a set of
orthonormalized vectors, update them using the suggested iteration step, and recover
the orthogonality using GSO. For real-time online computation, we need to avoid the
time-consuming GSO. Further, breaking-then-recovering orthogonality slows down
the convergence compared with keeping orthogonality all along. We know eigenvec-
tors are orthogonal to each other. So, it helps to generate “observations” only in
a complementary space for the computation of the higher order eigenvectors. For
example, to compute the second order eigenvector, we ﬁrst subtract from the data its

projection on the estimated ﬁrst order eigenvector u1(n), as shown in Eq. (4.9),

111(7),) v1(n)

“0101)“ [[0100]],

 

(4.9)

u2(n) : u1(n) — u((n)

78

where u1(n) = u(n). The obtained residual, u2(n), which is in the complementary
space of u1(n), serves as the input data to the iteration step. In this way, the orthog-
onality is always enforced when the convergence is reached, although not exactly so
at early stages. This in effect well uses the sample available and thus speeds up the
convergence. That is the reason we use “complementary” in referring to the prOposed
algorithm.

A similar idea has been used by some other researchers. Kreyszig proposed an
algorithm, which ﬁnds the ﬁrst eigenvector using a method equivalent to SGA and
subtracts the ﬁrst component from the samples before computing the next com-
ponent [47]. Sanger suggested an algorithm, called generalized hebbian algorithm
(GHA), based on the same idea except that all the components are computed at the

same time [85]. However, in both cases, the statistical efﬁciency was not considered.

4.2.4 Equal Eigenvalues

Let us consider the case where there are equal eigenvalues. Suppose ordered eigen-

values between A, and Am are equal:

/\[-1> /\[Z)\1+1: ZAm > /\m+1-

According to the explanation in Section 4.2.2, the vector estimate will converge
to the one with a larger eigenvalue ﬁrst. Therefore, the estimate of eigenvectors e,-,
where i < l, will not be affected anyway.

The vector estimates of e, to em will move into the subspace spanned by themselves

79

but do not converge inside the subspace since the shape in Fig. 4.1 is now a hyper-
sphere. This does not hurt because any vector in that subspace is a good estimate.

According to Section 4.2.3, those eigenvectors with smaller eigenvalues are es-
timated in the subspace orthogonal to the space spanned by all the eigenvectors
corresponding to larger eigenvalues. Although those estimates with equal eigenvalues
keep moving within the subspace, the subspace itself does not move and, therefore,
nor does its complementary subspace. So, the estimates of eigenvectors with small
eigenvalues will not be affected, either.

When two eigenvalues are nearly equal, the shape in Fig. 4.1 is nearly a hyper-
sphere. The convergence will be slower than with an elongated shape. But the
negative effect is minimal since the corresponding vectors are almost equivalent as we

discussed above for the equal eigenvalue case.

4.2.5 Algorithm summary

Combining the mechanisms discussed above, we have the complementary candid

IPCA algorithm shown in Fig. 4.2.

4.3 Empirical results on convergence

We performed experiments to study the statistical efﬁciency of the new algorithm
as well as the existing IPCA algorithms. The experiments were conducted on two
data sets, one with medium dimensionality (130-D) and the other with very high
dimensionality (5632-D).

80

 

For n = 1, ,..., do,
1. u1(n) = u(n).

2. Fori= 1,2,...,k, do,
2.1. Ifi < n,

u(n) = " .l ’vr(n - 1) + 1—,f—’uz-(n)Uf(n)HZ,EZ_1]Ho

"”1 (”l = “11") " uf(n) Iii—1:)” Ill—LRZYI‘I'
2.2. Ifi = n, initialize the ith eigenvector,
2),-(n) = u,-(n).
2.3. Ifi > n, initialize the ith eigenvector,
u,(n) = 0.

 

 

 

 

Figure 4.2: CCIPCA algorithm: compute ﬁrst It dominant eigenvectors,
v1(n), u2(n), . . . ,uk(n), directly from u(n), where n = 1, 2, . . ..

4.3.1 Experiments on speech data set

The ﬁrst data set consisted of the utterances of ten numbers (“one” to “ten”) by 15
persons collected in pattern recognition and image processing (PRIP) lab at MSU.
Each number had ﬁve utterances. The sampling rate was 11.025 kHz. Totally
we had 374-second speech data. The 13-order Mel-frequency Cepstral Coefﬁcients
(MFCCS) [19] were computed for speech frames generated using a 256-point Ham-
ming window. The MFCCS of ten consecutive frames were concatenated as a single
vector, which served as a sample. Therefore, the sample dimensionality is 130 and

there are totally 20612 samples for the 374-second speech data.

We ﬁrst computed the eigenvectors using a batch PCA with QR method and used
them as our ground truth. The code to do batch PCA was adopted and modiﬁed from
the C recipes of [68]. Then, SGA, GHA, and CCIPCA were implemented to estimate
the eigenvectors incrementally. We divided the whole data set into 20 subsets. When
the data went through the IPCA algorithms, the estimations of the eigenvectors

81

were saved after each subset was passed. In SGA, the learning rate was chosen
according to the suggestions in [59, page 54], i.e., 71(n) = 0.7/n, 72(7).) 2 2.5/n,
73(n) = 10/n, u(n) = 20/n, and, 75(n) = 32/n. Since there were only suggestions on
the ﬁrst 5 diagonal components of 1‘", we extrapolated them and chose 76(n) = 46/ n,
77(12.) 2 62/n, 78(n) = 80/n, 79(n) = 100/n, and, 710(n) = 130/n. In GHA, we set

c/(n) as 1 / n. The amnesic parameter l was set to be 2 in CCIPCA.

The coherence between the estimated unit eigenvector u and the one computed
by the batch method 11’, also normalized, is represented by their inner product 1) - '0’.
Because He — 21’” = 2(1— 2) - u’), we have u = '0’ iff u - v’ .—: 1. In other words, the
closer to 1 the inner product is, the better the coherence of the two vectors is. As we
can see from Fig. 4.3, SGA does not converge at all while GHA shows some trends to
converge during the last portion of the experiment. On the contrary, the eigenvectors
obtained by CCIPCA converge very well. ||u,|| converges to A,- reliably as shown
in Fig. 4.4, where the vertical axis is the ratio, ”—33“, the magnitude of the ﬁrst 10
eigenvectors obtained by CCIPCA over the corresponding eigenvalues computed by

the batch PCA.

To show more clearly why CCIPCA works better than SGA, let us divide both
sides of Eq. 4.4 by |]u,(n — 1)” and we have,

u(n) n — 1 u,(n —— l) l T

—1
M‘ n llvo(n-1)ll+n|lvo(n—1)l|ui(n)u‘("l——

 

 

82

 

 

— isteigenvector
—0— 2nd eigenvector
+ 3rd eigenvector
-— 4th eigenvector
+ 5th eigenvector

 

 

 

 

i

.r

 

 

0.9
0.8
0.7

g 0.6

‘30..

E 0.4
0.3
0.2
0.1

  

 

 

+ 10th ei envector

 

 

0.9
0.8

 

12 3 4 5 6 7 8 91011121314151617181920
Data sequence (x 1000)

 

12 3 4 5 6 7 8 91011121314151617181920
Data sequence (x 1000)

 

— 1st eigenvector
+ 2nd eigenvector
+ 3rd eigenvector
—~— 4th eigenvector

r r r r r r r r v v I

 

  
 
 
 
 

 

 

  
  

 

 

 

— 6th eigenvector
+ 7th eigenvector
+ 8th eigenvector
+ 9th eigenvector
+ 10thei envector

 
 
  
 

  

 

 

 

0.7 :3;— 5th eigenvector 0.7 4
In
23; 0.6 g 0.6 -]
3:05 , gas .
8 0.4 g 0.4 .
0.3 0.3 -
0.2 0.2 .
0.1 0.1 ‘
12 3 4 5 6 7 8 91011121314151617181920 12 3 4 5 6 7 8 91011121314151617181920
Data sequence (x 1000) Data sequence (x 1000)
1 1 r r v r r v v r f v r r r . r v r v
0.9 i 0.9 l
0.8 i 0.8 -
0.7 ‘ 0.7 '
8 0.6 18! eigenvector < g 0.6 —- 6th eigenvector .
§ + 2nd eigenvector § + 7th eigenvector
30.5 + 3rd eigenvector ‘ 50.5 + 8th eigenvector -
‘6 + 4th eigenvector ’6 —~— 901 eigenvector
o 0.4 + 5th eigenvector ‘ o 0.4 + 10th eigenvector ‘
0.3 < 0.3 <
0.2 - 0.2 4
0.1 < 0.1 1

 

 

 

   
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

U012 3 4 5 6 7 8 91011121314151617181920
(C)

Figure 4.3: Number utterance data set: the correctness, or the coherence, represented
by dot products, of the ﬁrst 10 eigenvectors computed (a) SGA (b) GHA (c) CCIPCA,
respectively.

Data sequence (x 1000)

83

 

12 3 4 5 6 7 8 91011121314151617181920
Data sequence (x 1000)

 

 

IIIIIIIIIIIIIIIIIII

  
 
 

 

  
   

— 6th eigenvector
—0— 7th eigenvector
—.- 81h eigenvector
-+— 9th 0198an
+ 10th eigenvector

  

 

 

 

 

    

—— 1st eigenvector
-o— 2nd eigenvector
-.— 3rd eigenvector 4 0.2
.... 4th eigenvector
+ 5th eigenvector

 

 

 

 

 

 

A A A 1 l A l L L 1_1 1 l l 1 l A l L L l l l l l L llllllllllll

012 3 4 5 6 7 8 91011121314151617181920 "012 3 4 5 6 7 8 91011121314151617181920
Data sequence (x1000) Datasequence(x1000)

Figure 4.4: Number utterance data set: the correctness of the eigenvalue, [AK—:1] by CCIPCA.

v,(n

Denoting m as 5,-(n), we may rewrite Eq. 4.10 as

 

 

r n—l 2),-(n- 1) 1 T v,(n-1)
2),-(n) = + u,-(n)u,o (n)———— (4.11)
n Ilvi(n- 1)“ ”llvoin— 1)“ llvi(n- 1)“
Eq. 4.11 is very similar to Eq. 4.6 if we have 7(n) = m. In other words, the

CCIPCA algorithm can be viewed as an SGA algorithm with a learning rate adaptable
according to the magnitude of u,(n— 1). Considering ]|u,-(n— 1)” —> A,, where A,- is the
i-th largest eigenvalue obtained by the batch PCA, we set 7,-(n) = 1/(A,~n) in Eq. 4.6
and did the experiment again. Interestingly, SGA converged much better than before
as shown in Fig. 4.5, which means 1/(A,-n) was a very good learning rate for SGA.
Comparing Fig. 4.5 with Fig. 4.3 (c), we observe that the results of SGA with the
ﬁnely-tuned learning rate assemble the results of CCIPCA, especially in the cases of
the ﬁrst few principal components. However, for a real-time application we can not
use trial-and-error method to ﬁnd the eigenvalues in advance. Therefore, CCIPCA
has a great advantage over SGA.

84

 

 

 

 

  

 

     

 

 

 

 

 

 

 

 

 

 

 

 

 

1 1 .................. a
0.9 r a 0.9 r
D
0.8 7 k 0.8 p x
0-7 0.7- ,
’8 0-5 -- tst eigenvector .3 05 . ,
§ —0— 2nd eigenvector § >
50 5 + 3rd eigenvector -‘ . 50.5 - - ' . . .
«5 —+—- 4th eigenvector .5
O 0 4 + 5th eiwvector . o 0.4 r ~
0.3» 0'3” —— 6th eigenvector
0 2 0 2 . . ' + 7th eigenvector
' ' -+— 8th eigenvector
0.1 o_1 )- . -+— 9th eigenvector ,
‘ + 10th eigenvector
01 2 3 4 5 6 7 8 91011121314151617181920 012 3 4 5 6 7 8 91011121314151617181920
Data sequence Data sequence

Figure 4.5: Number utterance data set: the correctness, or the coherence, represented by
dot products, of the ﬁrst 10 eigenvectors computed by SGA with ﬁnely tuned learning rate.

4.3.2 Experiments on FERET face data set

As a second example, we conducted some experiments on the F ERET face data
set [65] with even higher dimensionality. We used the frontal views of 457 subjects
from the data set. Most of the subjects have two views while 34 of them have four
views and two of them have one view, which results in a frontal face data set of 982
images. The size of each image is 88-by-64 pixels, or 5632 dimensions.

Again, we ﬁrst computed the eigenvectors using the batch PCA algorithm. Then
we divided the entire data set into 20 subsets and saved the estimates of the eigen-
vectors after each subset went through the IPCA algorithms. The learning rates for
SGA and GHA were set the same as in speech data set experiment. Fig. 4.6 and
Fig. 4.7 give a summary of the result, which shows similar convergence patterns as in
number data set experiment.

To demonstrate the effect of amnesic parameter 1 in Eq. (4.8), we show the result
of eigenvector estimate with l = 0. Comparing Fig. 4.8 with Fig. 4.6 (c), we can see

85

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 . Y . . . . . . . . , . . . . . . . . 1 7 Ya ﬁ. 4 . . r v . r . v v v v v .
— 1steigenvector —— 6th eigenvector
0'9 —e— 2nd eigenvector ‘ 0-9’ + 7th eigenvector '
+ 3rd eigenvector _ + 8th eigenvector
0'8 —~—— 4th eigenvector ' 0‘8 —+— 9th eigenvector ‘
0'7_ + 5th eigenvector g 0.7 _ + 10th eigenvector ‘
U) m
- . - - 0.6- -
a) 0 6 g
50.5 - - gas - ‘
8 0.4 - 8 0.4 r .
0.3 - 0.3 > ‘
0.2 - -1 0.2 - ‘
0.1 0.1 -
1234567891011121314151617181920 t1234567891011121314151617181920
Data sequence (x 50) Data sequence (x 50)
(a)
1 . r . v . r . r r v v r r v v v v . 1 r v r r r r v r . . . . ﬁw v v r v v
— tsteigenvector — 6th eigenvector
0-9' -e— 2nd eigenvector ‘ 0-9‘ + 7th eigenvector ‘
+ 3rd eigenvector + 8th eigenvector ‘
0'8' —~— 4th eigenvector ‘ 0'8' + 9th eigenvector
0.7- + 51h eigenvector ‘ 0-7_ + 10th eigenvector
U:
0.6- , - . - -
3 go 6
505' , - ~ . gas .
8 0.4 - . 3 0.4 - -
0.3- 4 0.3» 4
0.2 - _
0.1 - t
1234567891011121314151617181920 0' 123 4567891 11121314151617181920
Data sequence (x 50) Data sequence (x 50
1 . . . . . . AL r ] t . . . . v r v . r . . . . . . . . . r
0.9] . 0.9[ -
0.8] - 0.8- -
0.7L - 0.7r )
.3 05L —— tst eigenvector - .3 0.5 ~ .
g + 2nd eigenvector .§
aO.5r -— 3rd eigenvector 4 505* ‘
.5 + 4th eigenvector .5
a 0,4 » + 5th eigenvector ‘ o 0.4 r ‘
0'3» f 0”” — 61h eigenvector l
0.2 . , . + 7th eigenvector ,
0 2 -- 8th eigenvector
0_1 . OJ -+— 9th eigenvector .
+ 10th ei envector
1234567891011121314151617181920 1234567891011121314151617181920
Data sequence (x 50) Data sequence (x 50)

(C)

Figure 4.6: FERET data set: the correctness, or the coherence, represented by dot prod-
ucts, of the ﬁrst 10 eigenvectors computed by (a) SGA (b) GHA (c) CCIPCA with the
amnesic parameter I = 2.

86

 

 

 

 

    

 

 

 

 

 

 

 

 

 

 

 

1 .2 - 1 .2 - a
l ‘ 1 t
I
(I)
S 0.8 g 0.8
s s l
n a. t
‘5 0.6 '5 0.6 - <
o o
0-4 — 1steigenvector 0.4- —— 6th eigenvector
+ 2nd eigenvector —e— 7th eigenvector
0 2 + 3rd eigenvector 0 2 , + 8th eigenvector 4
‘ —+— 4th eigenvector ' ——+—- 9th eigenvector
+ 5th eigenvector + 10th eigenvector
1234567891011121314151617181920 1234567891011121314151617181920
Data sequence (x 50) Data sequence (x 50)

Figure 4.7: FERET data set: the correctness of the eigenvalue, UK—ZU by CCIPCA.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I 1
——- 6th eigenvector
03F 0'9' + 7th eigenvector ‘
+ 8th eigenvector
05’ ' 0'8' + 9th eigenvector ‘
0.7- ‘ 0.7_ . + 10th eigenvector g
m
~0.6> ‘ .6-
EL0.5) < 550.5
80.4» - 80.4— ,
0'3. — 1steigenvector ‘ 0‘3»
0 2 _ + 2nd eigenvector q 0 2,
' + 3rd eigenvector '
0.1 _ _._ 4th eigenvector . 0.1 .
+ 5th eigenvector
1234567891011121314151617181920 1234567891011121314151617181920
D tasequence

Data sequence

Figure 4.8: FERET data set: the effect of the amnesic parameter. The correctness of
the ﬁrst 10 eigenvectors computed by CCIPCA, with the amnesic parameter I = 0. A
comparison with Fig. 4.6 (c).

that the amnesic parameter did help to achieve faster convergence.

A further experiment was done to show the performance of the algorithm with
a long data stream. Since the statistics of a real-world image stream may not nec-
essarily be stationary (for example, the mean and variance may change with time),
the changing mean and variance make correctness evaluation difﬁcult. To avoid this
effect, we fed the same set of FERET image data repeatedly into the algorithms
to simulate a statistically stable long data stream. Fig. 4.9 shows the result after

87

Table 4.1: The average execution time.

SGA GHA CCIPCA
0.425 0.083s 0.072s

 

 

 

 

 

 

 

20 epochs. As expected, all IPCA algorithms converged better with the long data
stream while CCIPCA was the quickest one.

In order to provide a visual View about the eigenvectors computed, Fig. 4.10 shows
the ﬁrst 10 eigenfaces obtained by batch PCA and CCIPCA (with amnesic parameter
1 = 2), respectively as images. The corresponding eigenfaces computed by the two
very different methods are very similar.

The average execution time of SGA, GHA, and CCIPCA in each estimation step is
shown in Table 4.1. As shown, without doing the G30 procedure, GHA and CCIPCA
run signiﬁcantly faster than SGA. CCIPCA has a further computational advantage

over GHA because of a saving in normalization.

4.4 Conclusions

This chapter concentrates on a challenging issue of computing dominating eigen-
vectors and eigenvalues from incrementally arriving high dimensional data stream
without computing the corresponding covariance matrix and without knowing data
in advance. The proposed CCIPCA algorithm is fast in convergence rate and low in
the computational complexity. Our results showed that whether the concept of the
most efﬁcient estimate is used or not plays a dominating role in convergence speed
for high dimensional data. An amnesic average technique is implemented to further
improve the convergence rate.

88

 

 

 

 

 

 

 

 

 

 

 

 

1......YW—r 1...r.1.......r..rﬁr
—— 1stelgenvector [ — 61h elgenvector
0-9 + 2nd eigenvector ‘ 0-9 + 7th eigenvector ‘
+ 3rd eigenvector ~0— 8th eigenvector
0'8 _.—. 4th eigenvector ‘ 0-8’ —~— 9th eigenvector
+ 5th eigenvector + 1001 eigenvector
0.7r - 0.7- <
£06» . goco .
§ §
505- . 50.5- 4
80.4- 80.4» .
0.3r 4 0.3-
t
0.2- - 0.2- i
0.1 _ OJ _
1234567891011121314151617181920 1234567891011121314151617181920
Epoch Epoch
(a)
1 r v 1 w r . . v . . r r r . . v v v
-— 6th eigenvector
03f —9— 7th eigenvector ‘
+ 8th eigenvector
03' + 9th eigenvector W
+ 10th eigenvector
0.7» .
{30.6-
§
50.5-
80.4 ‘
0'3 — 1steigenvector 0'3. l
02 + 2nd eigenvector ] _ _ 1
+ 3rd eigenvector 02
0.1 _._ 4th eigenvector .] 0‘1 . 4
+ 5th eigenvector

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

12 3 4 5 6 7 8 91011121314151617181920
Epoch

 

 

 

—- 1steigenvector <
+ 2nd eigenvector
+ 3rd eigenvector <
—.— 4th eigenvector
+ 5th eigenvector -

 

 

 

 

  
  
  
 

12 3 4 5 6 7 8 91011121314151617181920
Epoch

 

 

 

—- 6th eigenvector
+ 7th eigenvector
+ 8th eigenvector 4
+ 9th eigenvector

+ 10th ei envector

 

   

 

 

 

m»
m.
N»
m»

91011121314151617181920
Epoch

 

 

91011121314151617181920
Epoch

Figure 4.9: FERET data set: the correctness of the ﬁrst 10 eigenvectors computed by (a)
SGA (b) GHA (c) CCIPCA (with the amnesic parameter I = 2), respectively over 20 epochs
(a long data stream).

89

l”, i. I“. km. N [is ‘1.
i . all-".1 [1“]..P] '3' 1’3“},
4'. - .7 L: .g‘ fl '

f“ a! ['11 ﬁl.‘ '41:" w in? Tu -l ... "III II - 0 '.1I Foil-rm"!!! rex""'"rrtnqll 7mm FIUWW1 rwar?
‘1 L ‘ ] l t "'

5:1I t] - r‘ 1 .... .. .

“1.»an haw m jam-J Lemma-i-J Iii-.4 urea-HI. stud hwuuamo- barium-db maul-um“ LIIWUJ he

r-

l
l
a ;
It“ a). 4 HI

 

7'""""“"”'7 n... “M...“ n_,-.-..-,....1 ﬁt..." 5...-...-._..
A ‘ ,.- .,

“I ‘ i a: t

t
I a. .
a..- e h. ~1- aul its... 1:4,-

r-- .-‘P*-.:..-.h_-—- ..
,

 

 

Figure 4.10: FERET data set: the ﬁrst 10 eigenfaces obtained by (a) batch PCA, and (b)
CCIPCA (with amnesic parameter 1 = 2), shown as images.

The importance of the result presented here is potentially beyond the apparent
technical scope interesting to the computer vision community. As discussed in [102],
what a human brain does is not just computing — processing data -— but more im-
portantly and more fundamentally, developing the computing engine itself, from real-
world, online sensory data streams. Although a lot of studies remain to be done and
many open questions are waiting to be answered, the incremental development of a
“processor” plays a central role in brain develOpment. The “processor” here is closely
related to a procedure widely used now in appearance—based vision: inner product of
input scatter vector u with an eigenvector, something that a neuron does before sig-
moidal nonlinearity. What is the relationship between IPCA and our brain? A clear
answer is not available yet, but Rubner & Schulten [80] proved that the well-known
mechanisms of biological Hebbian learning and lateral inhibition between nearby neu-
rons [42] (pages 1020, 376) result in an incremental way of computing PCA. Although
we do not claim that the computational steps of the proposed CCIPCA can be found
physiologically in the brain, the link between incremental PCA and the developmen-
tal mechanisms of our brain is probably more intimate than we can fully appreciate

90

Il OW .

91

Chapter 5

Grounded Speech Learning

5. 1 Introduction

We View a speech recognition system as an intelligent agent with audition-driven
behaviors. The development of such a system is an agent development problem. We
have discussed some of the challenges of such a system in Chapter 1. In this chapter,
we show how the architecture and the techniques we developed in the last two chapters
contribute to resolving the problems. We would like to show that an artiﬁcial system
can develop its audition-dependent behaviors through online, real-time interaction
with the environment. This is one step toward fully autonomous learning, which may
lead ASR systems to be autonomous.

92

5.2 Implementation details

5.2.1 IHDR

The ﬁrst requirement for a system to learn directly from its real-time sensory experi-
ences is to eliminate the manual design of internal representation. Internal represen-
tation includes, among others, features extracted from perceived signals, subspaces
represented by clusters in sensory inputs, inter-connections between nodes, and states
consisting of short—history information.

As discussed in Chapter 3, we choose the method of incremental hierarchical
discriminant regression (IHDR) to achieve this goal. IHDR does clustering in both
input and output spaces. When constructing (or learning) the tree, we ﬁrst derive the
clusters in the output space, according to which the clusters in the input space are
formed. When using the constructed or partially constructed tree to do the mapping,
we retrieve the tree in the input space to ﬁnd the primitive prototype and then locate
the corresponding prototype in the tree of the output space to generate the output.
To guarantee the performance, IHDR needs the sample distributions in the input and
output spaces to have the same topology, i.e., the order of distance between samples
should be consistent in both spaces. Otherwise, the input space clusters will not be
appropriately constructed. For example, in Fig. 5.1, because y2 is close to 313 and 1:2
is close to 133, the corresponding input and output clusters are well organized. But it
is not very desirable to form the input space cluster of 2:1 and 2:4, which corresponds
to the output space cluster of y1 and 314.

However, for audition-driven behavior development, the distribution of the input

93

 

 

$3
y4 y2

l 4
y]

 

 

 

 

 

 

Input space Output space
Figure 5.1: The sample distributions in the input and output space should have the same
topology.
to the system, the auditory signals, does not necessarily maintain the same topology
as that of the output, the motor control signals. For example, there are cases that the
motors do not move while there are non-silent auditory signals heard by the robot.
We surely do not want to group all these non-silent auditory signals into a big cluster

corresponding to a zero output.

To compile with the requirement of the IHDR method, we did not directly use
the auditory signals and the motor control signals as the input-output pairs entering
the IHDR. tree. Instead, we set the output part as the concatenation of the auditory

signal and the motor control signal,

l(t) =< xs(t)v-T’a(t) >.

where l(t) is the output part in the input-output pair, x3(t) and xa(t) are the auditory
sensation vector and the control signal vector, respectively. In this way, when the
robot is not doing any actions, :1:a(t) is zero and the IHDR tree is constructed according
to the auditory signals only. When the robot is conducting certain actions, both 3:3(t)
and a:a(t) are used to guide the clustering in output space and eventually guide the

94

clustering in input space.
Since the auditory signal part had different dimension, mean, and variance from
the control signal part, to avoid the domination of either part, we normalized them

before the concatenation,

v,(t) — ”(l(t)

O'z'(t)

17,-(t) = , i=1,2,

where 'v1(t) = l(t), U2(t) = xa(t), 0,—(t) is the normalized signal, 27,-(t) is the mean of

v,(t), and 0,(t) is the variance of the norm of v,(t).

5.2.2 Auditory sensation

For a robot, the auditory signal enters the microphone as a continuous stream. With-
out the help of a human designer to edit or transcribe the sensory input, the mean-
ingful speech units, such as phonemes, words, phrases, and sentences, do not exist to
a robot any more, which has a twofold effect. On one hand, the robot only needs to
make sure actions are conducted under appropriate auditory contexts and does not
need to decide whether the context corresponds to a phoneme, a word, or a sentence,
which makes the system design simpler. On the other hand, the context has to be
determined automatically and appropriately by the robot itself through its interactive
real-time experiences, which is very challenging. The sensory mapping we proposed
(see Section 3.2) is an attempt to resolve the challenge, although it may not be the
best or the ﬁnal solution. Nevertheless, our experiments did show some interesting
results by enabling a selective attention mechanism, modeled as an internal behavior,

95

Sensory mapping layer number

Sensory mapping
response

    
 

Eigenvector number

Last context
Window size

Auditory stream

Time

Figure 5.2: The process over the auditory sensation before it enters the cognitive mapping
module.

to choose the context.

Speech is a linear signal in the sense that the information is distributed over

time. Therefore, a single and short auditory frame usually does not contain enough

information for behavior decision making. In our implementation hereafter, if not

speciﬁed explicitly, a newly captured auditory sensation includes 20 auditory frames

in the history at each time instance. This short history information proved to be

enough for short phrases which are typical in early audition development stage.

96

Fig.5.2 illustrates how auditory sensation is processed before entering the cogni-
tive mapping module. Cepstrum Analysis is done over the raw auditory signals and
generates an auditory stream of Mel—frequency Cepstral Coefﬁcients (MFCCS [19]). A
sliding window over the continuous auditory stream turns the stream into a series of
last auditory contexts with each containing 20 auditory frames. The sensory mapping
modules take these last contexts and generates responses from its multiple layers. In
the upper panel of F ig.5.2, each row of the rectangulars corresponds to one layer’s

response. These sensory mapping responses eventually enter the cognitive mapping.

5.2.3 Behavior generation

The nature of autonomous development of our system prevents us from deﬁning
the behaviors in advance. While this offers the ﬂexibility of the ultimate behavior
capability of the system, it immediately introduces a problem — how to acquire the

desired actions among such a very large number of possible ones?

As we can imagine, the action space is determined at the system programming
time. For example, after the robot arm is built, the action space is deﬁned, although
not all points in this space are reachable. The system can generate all the possible
external actions from a small set of internal action increments. The actual action out-
puts of the system will be the accumulation of these action increments. For example,
to touch every (3:, :9) position in a bounded 2-D space, a minimum of four increment
vectors are sufﬁcient: (1,0), (—1,0), (0,1), (0, —1).

97

5.2.4 Learning procedure

At each time instance, the robot conduct the following learning procedure.

1.

Cf!

Grab new auditory sensation 2:3(t).

. Update each neural column of the sensory mapping module by doing PCA

incrementally.

. Concatenate the outputs from the neural columns in the sensory mapping mod-

ule and the reading from the internal action sensor, $a(t), into a single vector,
the last context 1 (t). Suppress the outputs from some of the neural columns to

zero according to the attention selection signals (part of xa(t)).

Replace the oldest part of the internal state s(t — 1) with l(t) to get a vector

s’(t)1.

. Query the partially constructed IHDR tree and get a primitive prototype, s(t),

that is closest to s' (t) using the fast tree search.

If s'(t) is signiﬁcantly different from s(t), it is considered as a new sample and
we update the IHDR tree using the input-output pair < 3’ (t), l (t) >. Otherwise,

s’(t) updates s(t) through incremental averaging.

After updating the tree, we query it again using s’(t) to get the primitive pro-
totype s(t), which would be the new state. Thus we ﬁnish the state transition

as shown in Fig. 3.2.

 

1See page 48.

98

8. If an action is given (imposed by a trainer) through the touch sensors as a,(t),
increase the value of the action associated with s(t) that is most similar to a,(t),
and return a,(t) as the action to be executed at this time instance. Otherwise,

update the Q-values of s(t — 1) using Eq. (3.10), and return

a(t) = arg rgaXQ(8(t), a’).

as the action to be executed at this time instance.

9. Return to step 1

5.2.5 Communicative learning

While both supervised learning and reinforcement learning have some advantages and
they compensate each other in our uniﬁed learning strategy, they are relatively low-
level learning mechanisms and demand a lot of training efforts in complex behavior
learning. For example, to teach a robot to “go left at the corner,” the human trainer
has to either push the robot to turn (supervised learning) or give some rewards when
the robot happens to turn left at the corner (reinforcement learning). Actually, in
animal learning and human learning, there is a higher level of learning mode, which
we call communicative learning. Communicative learning takes advantage of a higher
animal’s capability of language acquisition to transfer sophisticated skills from the
trainer to the learner through language instructions. In the case of the previous
example, suppose “recognizing a corner” and “following a ‘turn left’ command” are
two learned behaviors. Teaching “go left at the corner” would be as easy as saying

99

“turn left” when the robot reaches the corner. As a preliminary investigation of
communicative learning, we taught the robot to learn vision-guided navigation after

it acquired some simple language capability and the results are reported in Section 5.6.

5.3 Experiments on audition-driven behavior de-

ve10pment with reinforcement learning

The ﬁrst experiment we conducted was to train a system to develop audition-driven

behaviors through reinforcement learning.

We built a real-time software agent called AudioDeveloper. AudioDeveIOper had
two kinds of sensors, a microphone as the auditory sensor and two touch sensors as
reward sensors. A simulated 4-joint robot arm was the effector. As shown in Fig. 5.3,
the GUI of AudioDeveloper displays the audio wave (the top part), the rewards (the
middle part), and the actions (the bottom part) along the horizontal time axis. Hu-
man trainers ﬁrst spoke to AudioDeveloper and then gave the rewards by pushing
the “G” and “B” buttons on the GUI toolbar for “good” and “bad”, respectively, ac—
cording to AudioDeveloper’s behavior. To facilitate the training procedure, another
software agent, AudioTeacher, was programmed as the real-time virtual teacher. Au-
dioTeacher played the pre-recorded sound through a set of computer speakers, and
generated and sent the reward to AudioDeveloper according to AudioDeveloper’s re-
sponse. The training procedure was run in real-time for 6 hours, during which the
resulting developed system was saved every half an hour so that we could examine

100

 

”muons Novel [And loll] 1 1. luulll‘i‘mml‘ :Jgjgj MﬂkllN'V vvﬂnpl‘ |n¢m1'| lll'.‘
Dnzamwmwwkmdm _ ...lgm memwwmmam
9‘9 1’ . c a Bil 1 II G s

Audlu

 

.slﬂlﬂ
Alli]

 

Audla

 

Rev-rd
_

Reaction

Dlm 4

 

Dlm l
Dlm Z

  

 

 

 

 

 

 

(a) (b)
Figure 5.3: The GUI of AudioDeveloper: (a)During online learning; (b)After online learn-
ing.

the performance later.

The speech was contributed by 140 persons with a variety of nationalities (Amer-
ican, Chinese, French, Indian, Malaysian, and Spanish) and ages (from 18 to 50).
Each person made ﬁve utterances for each of the four vowels, / A / (hard), / E/ (head),
/ i / (heed) and /c/(haul)2. There was silence lasting about 0.55 between two consecu-
tive utterances. The sound was digitized at 11.025 kHz. In this way, we got an speech
data set with 2800 isolated utterances. For the ﬁve sets of vowel utterances, we did
a 5-fold leave-one-out cross-validation in the experiment. That is the AudioTeacher
played four out of ﬁve sets of utterances during training and played the independent
set of utterances during testing.
Before the speech data reached the sensory mapping, the 13-order MFCCS were
computed over every 256 sound sample points (a single auditory frame of about 20ms)

before the data reached the sensory mapping module. Covering 10 auditory frames

 

2We use single symbol version of ARPAbet to represent the phonetic alphabets.

101

the dimension of a new grabbed auditory sensation vector is 10 x 13 = 130.

The sensory mapping module had four layers. The neural columns in each layer
took as the input the outputs from ten neural columns of the next lower layer spread-
ing along the temporal dimension. Thus, the length of context covered by the four
layers was about 200ms, 400ms, 800ms, and 1600ms, respectively. When doing PCA,
we determined the number of principal components so that 95% of the variance in
the data would be kept. To be speciﬁc, the number of principal components in the
four sensory mapping layers was 31, 27, 22, and 18, respectively.

The control signal to the effector of the system was a 4—D action vector. Four
desired behaviors were deﬁned, each for one of the four vowels. Behaviors were iden-
tiﬁed by the component of the action vector with the maximum value. For example,
if the ﬁrst component of the action vector had the maximum value, it was identiﬁed
as action 1. Internal actions were deﬁned as the increment on each component of the
action vector. There was an extra internal action that reset the whole action vector
to zero.

The rewards were decided as follows. If the system made a correct action within
a range of (-200ms,200ms) at the end of an utterance, the system would receive a
reward 1. If the action was wrong or there was no action made within that time
window, the system would get a reward -1. In all other cases including the silence
period, the system got reward -0.001.

The performance was evaluated as follows. Within a short period before or after
the end of an utterance, if the system reacted correctly once or more than once within
that time window, we counted it as a success. Otherwise it was an error. F ig.5.4 shows

102

 

.0 .o
\l on
I I

a

.0

01
I

 

c:

- -4

 

 

0 30 60 90 120 150 180 210 240 270 300 330 360
Training time (min)

Figure 5.4: A real-time system: correct action rate vs. training time.

Table 5.1: Simulation: confusion table for test.

 

 

 

 

 

 

 

 

Expected Behavior vs. Actual Behavior /A/ /E/ /i/ /c/ Rejection
/A/ 125 3 0 5 7
/E/ 1 127 9 0 3
/i/ 0 5 134 0 1
/c/ 5 0 0 133 2

 

 

that the system gradually and steadily improved its behavior through practice as the
time passed by.

To examine the behavior of the system in more details. We did a simulation
experiment in the same way as the real-time system experiment above. The only
difference was that the MFCCS were computed in advance and fed into the system
manually. We call a round of feeding all the training data an epoch. The performance
evaluated after each epoch is shown in Fig. 5.5, which has the similar trend as that

of the real-time system. The confusion table of the expected behaviors (E.B.) vs. the

actual behaviors (AB) is shown in Table 5.1.

To take a closer look at the behavior of the system, we traced one of the primitive
prototypes in the IHDR tree that corresponded to vowel “a.” All the state’s internal

103

 

 

 

 

0 1 2 3 4 5 6 7 8 9 10
Number of epoches

Figure 5.5: Simulation: correct action rate vs. epoches.

 

2

1.5-

   

 

   

 

 

 

 

 

 

 

1 - — Internal action 0
U) .
g —— lntemal action 1
T; < Internal action 2
O -—-- Internal action 3
0.5 - - - Internal action 4 ~
0 ..
_ _ _‘_. _' ; ._ _ _. _ _.'._ _. - — _'- _ _. ‘-"‘_— _ -‘ ________________
_05 ,_.____,_-.-.- _,_-_._ _,._____,_ _._ _._

' 0 50 100 150 200 250 300
Updating times

Figure 5.6: Simulation: smoothed Q values of the internal actions of one of the states vs.
the number of times it was updated.

action values were set to zero initially except for that of the extra “reset” action,
whose initial value was 0.1. This meant that the system preferred to do nothing at
the very beginning. As the development process went on, the Q-value of internal
action 1 increased while those of other internal actions decreased (F ig.5.6). This
suggested that the system gradually and reliably ﬁgured out that it should increase
the ﬁrst component of the action vector (as deﬁned by the meaning of the internal
actions) under this state, which was exactly what we wanted.

104

Table 5.2: Selective attention: correct rate (OR) for using different sensory mapping
layers

 

Layer 1 2 3 4 All
OR. 88.6% 89.7% 83.3% 78.25% 92.7%

 

 

 

 

 

 

 

 

 

5.4 Experiments on selective attention learning

In the experiments presented in Section 5.3, the attention selection signals were set

in advance so that the outputs from all sensory mapping layers were selected.

We did this because we observed that selecting all layers had better performance
than that of any single layer for this task, as shown in Table 5.2. We owed this
phenomenon to two reasons. On one hand, higher layer lost some information when
we ignored the higher order principal components. So, the performance of layer 4
degraded comparing to lower layers. On the other hand, by covering different length
of context, different layers extracted uncorrelated information, which did good to
the ﬁnal decision making in different cases. Hence, their combination gave us better

performance than each single layer.

To speciﬁcally test the system’s attention selection capability, we collected speech
data of ten numbers (1-10) from ﬁve people, which had more context information than
vowel data. The simulation experiments were conducted in a way similar to that of
last section. The difference included that the sensory mapping module here had only
two layers and the system had one more internal action to select the output from one
of the layers. After being trained for ten epoches, the correction rate of the external
actions were constantly improved (Fig. 5.7). We examined and recorded the system’s
choices of attention at the time the system made the external behaviors. In Table 5.3,

105

 

0.9

I
J

0,8ir , . t (:7 _ . .,’ :7, ‘
0.7»1»1'~A»‘i§"-

rate

I
l

c 0.6

ctio

«10.5- - ;~ : , ; 5 ~ «

Correct
.0
b
I

 

 

 

O
l-

0 1 2 3 4 5 6 7 8 9 1 0
Number of epoches

Figure 5.7: Selective attention: overall performance vs. training epoches.

Table 5.3: Frequency of the attention selection behavior per external behavior

 

External Behavior 1 2 3 4 5 6 7 8 9 10

 

 

lst layer 0 1.00 0.73 0.83 0.66 0.74 0 0.63 0.25 0.40
2nd layer 1.00 0 0.27 0.17 0.33 0.26 1.00 0.38 0.75 0.60

 

 

 

 

 

 

 

 

 

 

 

the ﬁrst line shows the external behaviors represented by their corresponding number
utterances. The second and third lines show the frequency that the system selected
the output from the ﬁrst or the second layer of the sensory mapping when ﬁring

external behaviors.

An interesting thing happened to external behaviors 1 and 7 as we expected. The
system exclusively chose the second layer when making decisions. After a second
thought, one will realize that the tail parts (about 200ms) of the utterances of “one”
and “seven” were similar. Therefore, the ﬁrst layer did not cover enough context
for distinguishing these two words and the system chose the second layer to make
decision. For “nine” and “ten”, which had similar tail parts, the system showed its
preference by selecting the second layer too. In other words, the attention selection

behavior was successfully acquired.

106

 

5.5 Experiments on the SAIL Robot

In the experiments presented above, we showed that a real-time artiﬁcial system could
learn to react to auditory inputs through interaction with the environment. However,
the vocabulary size was very small and the behaviors were simple. In this section,
we will show the work done on a real robot that learned to follow more complicated

verbal commands.

5.5.1 Preliminary experiment on number data

Before working on the SAIL robot, we did a preliminary experiment on a larger-size
vocabulary. The whole settings were the same as those in Section 5.3, except following

changes,

0 The effector of the system was represented by a 10-D action vector. Ten de-
sired behaviors were deﬁned, each for one of the ten numbers (“one” to “ten”).
Behaviors were still identiﬁed by the component of the action vector with the

maximum value.

e To simulate the situation a robot would face in a real world, the eigenvectors
in the sensory mapping were derived from more auditory data, 5-hour radio

programs that included both news and music.

0 The training procedure consists of two phases, supervised learning followed by
reinforcement learning. In supervised learning, the imposed actions were given
to the system by the end of each utterance. The settings for reinforcement

107

Table 5.4: Results on number recognition

 

 

 

External Behavior Correct Rate (%) Incorrect Rate(%) Rejection Rate(%)
1 98.4 1.6 0
2 95.2 3.2 1.6
3 93.7 3.2 3.2
4 96.8 3.2 0
5 95.2 4.8 0
6 93.7 3.2 3.2
7 96.8 3.2 0
8 92.1 3.2 4.8
9 93.7 3.2 3.2
10 93.7 3.2 3.2
Average 94.9 3.2 1.9

 

 

 

 

 

 

learning were the same as before.

0 The auditory data were collected at the same time when those vowel data were
collected. Each of the 63 persons made ﬁve utterances for each of the ten
numbers, “one” to “ten”. There was a silence of a length of about 0.55 between

two consecutive utterances. The data set had totally 3150 isolated utterances.

The test was done using the 5-fold leave-one-out cross-validation. The results
summarized in Table 5.4 show that our system reliably responded to the complicate
auditory inputs after online interactive learning. Fig. 5.8 shows part of the resulted

IHDR tree.

5.5.2 Real robot experiments

To teach the SAIL robot, in the supervised learning phase, a trainer spoke to the
robot with a spoken command C and then imposed a desired action A by pressing
a pressure sensor or a touch sensor that was linked to the corresponding effector. In

108

 

 

 

 

 

 

o" '0'
_:_: lWillmilll :illMlllﬂl lilll .lllll‘ " 11L] Mi:

1

lllllllu. WW THim-HI” 1'" ’l'l’" 1ﬂl§jj

I . ii .I

I -.. il -.. i
Figure 5.8: Part of the IHDR tree after training using the number data set. Shown in each
image is the x-cluster mean.

 

 

 

 

 

 

 

 

 

reinforcement learning, a “good” button and a “bad” button were used to simulate
appetizing and aversive sensors, respectively. The reward was decided in the same
way as in the experiments presented in Section 5.3. We still set the SAIL robot a

default attention selection behavior to select all sensory mapping layers.

The training process was conducted online in real-time through physical interac-
tions between a trainer and the SAIL robot. After being trained for 15 minutes, the
SAIL robot could follow commands with about a 90% correct rate. Table 5.5 summa-
rizes the performance of the SAIL robot when it was guided by the verbal commands
to navigate through the corridors of the Engineering Building at MSU (see Fig. 5.9).

The arm and eye commands were issued ten times each at different locations.

To further test the system’s capability of dealing with speaker variations, we
conducted a multi-trainer experiment. Eleven persons participated in training. They
spoke each of the 15 commands for ﬁve times which resulted in 825 utterances. The
speech data (four out of ﬁve utterances of each commands) was fed into the SAIL
robot off—line appended with appropriated actions at the end of each utterance. The
SAIL robot with partially trained “brain” started to run in real-time. Then, a trainer,

109

Table 5.5: Performance of the SAIL robot in one-trainer case.

 

 

Commands Total times Correct rate(%)
Go left 35 97.1
Go right 23 91.3
Forward 65 93.8
Backward 7 100.0
Freeze 5 80.0
Arm left 10 100.0
Arm right 10 90.0
Arm up 10 100.0
Arm down 10 100.0
Hand open 10 90.0
Hand close 10 90.0
See left 10 100.0
See right 10 100.0
See up 10 100.0
See down 10 100.0

 

 

 

 

 

being the 12th trainer, taught the SAIL robot through physical interactions four times
for each command. In this way, we simulated the situation that a partially developed

SAIL robot continuously developed its audition-driven behaviors.

After training, the 12th trainer tested the SAIL robot by guiding it through the
second ﬂoor of the Engineering Building, just as was done in the one-trainer case.
The performance is summarized in Table 5.6. More trainers introduced more variance
in speech data. The results show that the performance of the SAIL robot in the
multi-trainer case degraded a little compared with the one-trainer case but it was
still reasonable. The performance for other trainers was evaluated off-line using the

left-out utterances. The performance is summarized in Table 5.7.

To our knowledge, this is the ﬁrst work on online speech learning without a pre-
designed model and with the number of words and the number of speakers totally

110

 

 

 

J

.
I
—___l— ,— .. —. —'_~—'_r- '— - _p _
-'—‘—_—‘—~

.—V—

 

 

Figure 5.9: Engineering Building 2nd ﬂoor at MSU.

Open3. Although the performance described above has not matched those of the tra-
ditional SR systems in term of vocabulary size, this work has made a solid progress

for the very difﬁcult new learning mode.

5.5.3 Speed issues

Execution time is a very important issue for a real-time system. In each computation

loop, the SAIL robot needs to collect the sensory data from sound card, touch and

 

3The work in [107] was for recognizing ﬁve vowels in a simulated online mode.

111

Table 5.6: Performance of the SAIL robot when following the 12th trainer’s command
in multi-trainer case

 

 

Commands Total times Correct rate(%)
Go left 36 88.9
Go right 28 89.3
Forward 70 92.8
Backward 8 87.5
Freeze 9 88.9
Arm left 10 90.0
Arm right 10 90.0
Arm up 10 100.0
Arm down 10 100.0
Hand open 10 90.0
Hand close 10 80.0
See left 10 100.0
See right 10 100.0
See up 10 100.0
See down 10 100.0

 

 

 

 

 

pressure sensors, compute the MFCCS for each auditory frame, construct and retrieve
the IHDR tree, and submit the control signals to the controllers of the effectors. As
each speech frame covers 256 points with 56 point overlap and the auditory digiti-
zation frequency is 11.025 kHz, above computations should be ﬁnished in 18.1ms,
which is very tight in time. While the system worked smoothly in this experiment,
we would like to keep track of its speed performance. We recorded the execution time
of MFCCS calculation and IHDR tree construction / retrieval in each 100p, which were
the most time-consuming portions (Fig. 5.10). The average execution time is shown
in Table 5.8. As we can see, the MFCCS calculation took the major execution time
and the total average time was well under 18.1ms. The shape of resulted IHDR tree is
shown in Fig. 5.11, where the horizontal axis is the depth of the tree and the vertical
axis is the number of nodes.

112

Table 5.7: Performance of the SAIL robot on off-line test data in multi-trainer case

 

 

Commands Correct rate(%)
Go left 94.5
Go right 89.9
Forward 92.7
Backward 100.0
Freeze 100.0
Arm left 100.0
Arm right 90.9
Arm up 96.3
Arm down 92.7
Hand open 89.9
Hand close 89.9
See left 90.0
See right 92.7
See up 100.0
See down 100.0

 

 

 

 

Table 5.8: The SAIL robot: the average execution time in each loop (one-trainer case)
Computation of MFCCS IHDR tree retrieval Total
Ave. exec. time/ loop 8.3ms 0.59ms 8.9ms

 

 

 

 

 

 

 

 

5.6 Experiments on communicative learning

The above experiments demonstrated a process of simple language acquisition with
rich semantics and simple syntax“. The semantics was naturally acquired because
the speech was learned within the physical context. Upon learning to follow verbal
commands, the SAIL robot was taught to perform vision—guided navigation through
communicative learning5.

Communicative learning includes two phases. The ﬁrst phase is language acqui-
sition, which has been achieved above. The second phase is teaching using language

(Fig. 5.12). Using the trained commands, the trainer guided and instructed the robot

 

4It is typically the case for a child.
5This work is jointly done by Wey Hwang and me.

113

 

F3
O
A
—4
—1
ﬁ

F3

O

(D
I

Execution time (ms)
;> c:
O O
A N
I I

 

l l l l

0 500 1000 1 500 2000 2500 3000 3500 4000

O
l

 

.0
O
.5

I I T T

.0
O
m
r
I

.0

0

-.
I
1

 

 

Execution time (ms)
0
O
N
I
1

O
i

.. In 5AA .. n I A. n 1. Pk Ila ... I m
0 500 1000 1 500 2000 2500 3000 3500 4000

 

 

.0
O
.5

I l l I l

.0

O

O)
I

Execution time (ms)
9 o
O O
.. m
I I

l l L l l l I

0
0 500 1000 1 500 2000 2500 3000 3500 4000
Time step (a 18.1 ms)

 

Figure 5.10: The SAIL robot: the execution time in each loop (one-trainer case). (a)
MFCCS calculation; (b) IHDR tree construction/retrieval; (c) Total execution time.

to go through the entire second floor of the Engineering Building at Michigan State
University. The actions executed under the corresponding voice commands were as-
sociated with the visual context by another IHDR tree in real time. Fig. 5.13 shows
some of the images seen by the SAIL robot. Later, when the robot saw a similar
visual context, it would retrieve the IHDR mapping and ﬁnd the practiced action. In
the experiment presented here, the switch of the attention between vision and audi-
tion was an innate (programmed-in) volume-based behavior of the robot. The voice
command had a higher priority to control the robot than the learned vision-guided
controls. After training, the SAIL robot was tested for ten turns in the Engineering
Building until the robot batteries ran low. It needed further voice instructions at

114

 

 

 

 

I 2 3

Figure 5.11: The SAIL robot: the shape of IHDR tree in one-trainer case. The horizontal
axis speciﬁes the depth of the tree and the vertical axis speciﬁes the number of nodes.

 

Figure 5.12: The SAIL robot is running autonomously in the hallway of the Engineering
Building of Michigan State University.

only ﬁve locations which were difﬁcult for vision—guided navigation because of the
changing lighting (windows) or very narrow corners. The experiment of communica-
tive learning reported here saved the trainer a lot of effort in training by using verbal

commands.

In the next chapter, we will discuss a more sophisticated architecture for com-
municative learning -- learning complex behaviors through verbal instructions after
acquiring simple ones.

115

' "a rt'vs‘. ~1 r-"m'vmws'ﬁ‘ “ "ﬂ
_ ' " L 4."-

 

..‘.. r'l ' iii-I |;.-—t ‘ L. i
I: “i" "l .3 . I
. 3‘5“” I": . 311', ,;q
L J. «a

 

Figure 5.13: Examples of the image sequence seen by the SAIL robot.

5.7 Conclusions

We have demonstrated that it is feasible for an artiﬁcial system to develop its audition-
driven behaviors through online, real-time interaction with the environment. This
work is among a few works that enable a machine to learn directly from unsegmented
and unlabeled speech streams, a mode in which human children learn.

IHDR played an important role in the success of the experiments. It automatically
derives discriminant features and thus automatically generates internal representa-
tions. The behaviors of the robot were developed through a uniﬁed learning strategy.
Both external behaviors, such as body movement, and internal behaviors, such as
selective attention, were acquired.

Our experiments show that after a short process of grounded simple language
acquisition, the robot can produce desired behaviors upon hearing verbal commands.
We have also showed that this simple language capability can be used later for other
behavior learning, such as vision-guided navigation. In the next chapter, we will
report more results on teaching a robot to conduct complex behaviors through verbal

instructions.

116

Chapter 6

Task Transfer

6. 1 Introduction

Task transfer is a capability of autonomously applying what has been learned in one
context to new contexts (tasks) [13]. It is more than memorizing and learning. Here
is an example of task transfer. Suppose that a tiger in a circus has learned two skills:
(1) Jump onto the table top immediately after hearing “Table!” command from the
trainer. (2) Jump through the ﬁre ring immediately after hearing “Ring!” command.
The new task is to perform (1) and (2) in sequence after hearing a new command
“Start!” without stepwise commands “Table!” and “Ring!”

Task transfer is an important capability to test the success of what is called cogni-
tive learning by humans and animals. Cognitive learning is considered as a higher-level
learning than classical conditioning [63] and instrumental conditioning [53]. Unlike
classical conditioning and instrumental conditioning, there is no reinforcement in-
volved in cognitive learning. As far as we know, this is the first work to formulate

117

and realize task transfer in the AMD paradigm.
Symbolically, classical conditioning has been described as the following so-called

acquisition procedure (A.P.):

CS—>US—>UR:>CS—>CR (6.1)

where, CS stands for conditioned stimuli (e.g., tone), US for unconditioned stimuli
(e.g., food), UR for unconditioned response (e.g., salivation), CR for conditioned
response (e.g., salivation), —> means “followed by,” and 2 means “develops.” Other
conditioning protocols that have been widely studied include secondary conditioning,

where one CS is preceded by another CS,

052 —+ CS. —> CR :> 052 —> CR (6.2)

and instrumental conditioning, where the association between a stimulus (S) and a
response (R) is affected by the following positive or negative reinforcement (reward).
The major problems of the traditional computational models of these animal au-

tonomous learning phenomena include:

1. They are symbolic models in that the stimuli or responses are considered as
an atomic entity [88] [89] [43] [4], instead of spreading over real-time sensory
streams. See [5] for an excellent survey. Some of these models have been
implemented on robots. For example, in [94] [83], a robot functioned upon a set
of perception and behaviors which were deﬁned in advance. In [87], a mobile

118

robot learned to associate the visual sensations of predeﬁned classes (blob and
strip patterns on cubes) with predeﬁned action sequences. The computational
models are not directly applicable to multimodal real sensory streams, just like
the case where a text language model does not directly apply to raw auditory

signals.

2. They are ad hoc in nature. A model for instrumental conditioning is not ap-
plicable to classical conditioning and verse visa. There is a dilemma of model
applicability when they are applied to a general autonomous learning setting:

Which model is applicable at any time?

3. They overlook the role of autonomous attention on the learning agent part.
Traditionally, task-related information is manually extracted for symbolic de-
scription as in AP. (6.1) and (6.2) without modeling why only this information
is relevant among many other environmental stimuli and bodily internal signals,

e.g., motor signals.

In this chapter, we augment the architecture presented in Chapter 5 and make
it applicable to all kinds of associative learning, including classical conditioning, sec-
ondary conditioning, instrumental conditioning, and the more general class, cognitive
learning. In order to fully explain how an AMD agent works instead of partially mod-
eling one kind of learning, this architecture captures also the important cross-modality
attention mechanism in the process of autonomous learning.

119

C) (d)

 

(a) (b) (
(6) (0

(11)
Figure 6.1: (I) The SAIL robot house-built at Michigan State University. (11) Behaviors
represented as gripper tip trajectories of the SAIL robot. (a)-(d): Individual behaviors as

petal drawing, each of which starts from the black dot. (e)-(g): Drawings consists of more
than one petal, as behaviors developed through multiple task transfers.

(g)

 

6.2 Problem description

Suppose a robot has learned to draw a correct petal following the commands “left,”
“right,” “upper,” and “lower,” respectively, as shown in Fig. 6.1 (II). Now the teacher
wants the robot to transfer the learned skills of drawing individual petals to a new
task: drawing a flower consisiting of multiple petals triggered by a new command. Of
course, these basic skills can be transferred to any number of new composite tasks,

each corresponding to a new command.

6.2.1 Principles of autonomous learning

Although the literature in psychology has clearly classiﬁed several animal learning
types, such as classical conditioning, instrumental conditioning, and cognitive learn-

120

Task

 

transfer: Sensation
Action
Classical Tone FOOd '
conditioning: W Sensation
‘ Action

Salivation (sent)

Figure 6.2: A new view of task transfer and classical conditioning. Two vertically aligned
circles constitute a context at a particular time. Solid arrows indicate previously established
association, and dash arrows denote newly established association through priming.

ing, they are enabled by a single brain architecture. The challenging issue here is to
investigate the architecture and implement it to test whether the architecture works
on a developmental robot. We hold a view that different learning types described
in psychology are enabled by a set of principles, which involves context, association,
priming, attention, and value.

To see these important principles, we illustrate task transfer and classical condi-
tioning in an unconventional way in Fig. 6.2. In the ﬁgure, a circle in the upper row
corresponds to a sensation context and a circle in the lower row corresponds to an
action context. In the same row from left to right, different circles denote contexts
at different time instances. A context at a time instance is denoted by two vertically
aligned circles, a sensation context at the upper row and an action context at the
lower row.

The attention control at each time instance determines which part in the current
context is attended to form the current attended context. If an attended context Cl
is repeatedly followed by another attended context Cg in time, an association from

121

C] to C2 is established in memory, where C1 and 02 can be a sensation context, an
action context, a part of each, or a combination thereof. Once such an association is
established, C; will be primed whenever there is a context similar to C1. However,
human brain does not just prime the context for the next time instance, it also primes

farther, as indicated by the dash arrows in Fig. 6.2.

Some associations are innate (e. g., food-to-salivation association in classical condi-
tioning) while others are learned. If C 1 corresponds to a sensation and Cg corresponds
to an action, the value of C2 estimated by the value system (also called motivational
system) is crucial to determine whether the action should be sent to motors for exe-
cution. The value system of a developmental agent starts with an innate value system
at birth time, which prefers appetitive stimuli (e.g., sweat taste) and avoids aversive
stimuli (e.g., pain sense). The added learned value system can greatly enrich and

alter such an inborn value system.

Our SAIL developmental robot is based on our above theory of human and an-
imal learning, which is different from the traditional type-speciﬁc models of animal
learning. This theory serves as a guide in the development of the SAIL general ar-
chitecture. However, designing a practical architecture from the above theory faces a
series of challenges. We discuss two of them here. One is the challenge of numerical
representation. The other is the challenge of the missing contexts.

122

6.2.2 Numerical representations

It is important to note that each circle in Fig. 6.2 corresponds to a realization of a
random process, or in other words, the (discrete) experience of a developmental robot

after “birth” can be represented by a series of random vectors,

E = {C(t) = (X(f),A(t))[t = 0,1,2,...},

where C (t) denotes the context random vector at time t, X (2‘) and A(t) denotes
the sensory and action context, respectively. The context is the part of information
that the agent attends to at time t. It is stored in working memory. Since hand-
segmentation and hand-labeling of sensory streams are not allowed for the AMD
mode, we cannot use a symbolic label to represent a context. Instead, each context is
a numerical vector representing information in (sensory and motor) space and time
(the last few context vectors in time, probably processed). At each time instance t,
the sensory input s(t) is a long vector collected by the real sensors, such as cameras,
micrOphones, and touch sensors, as well as internal sensors such as those that sense
selective attention effectors, covering last several time frames, t— k, t— 19+ 1, ..., t— 1, t.
The dimension of s(t) is as high as a few hundreds (for audition) to a few thousands
(for vision) and beyond. With the same sampling rate, the vectors of control signals
are sent to the motor controllers. Depending on the number of motors, the vector of
control signals may vary from 1 to a few dozens in the case of the SAIL robot.

The numerical representation is motivated by the distributed neural coding in bio-
logical brains. An atomic symbol loses its similarity information with respect to other

123

symbols but a numerical vector representation does not. The high-dimensional vec-
tor representation potentially has a higher generalization capability than the symbolic
one, as the number of classes is dynamic and virtually unlimited. Therefore, a nu-
merical representation enables the same program to learn new stimuli and behaviors.
However, it also poses greater challenges than dealing with symbolic representations.
First, many different sensory contexts X (t) require the same action vector. For exam-
ple, different people have different voices but they can speak the same word. Second,
different action contexts A(t) are perceptually equivalent, just like the fact that a
word can be hand-written differently at each time. Third, it is not known which part
of the sensory input is related to the action output. For instance, the appropriate
temporal length of a context vector is hard to determine at the “birth” time.

To keep the discussion clear, we will still use symbols to denote random processes
although the internal representation does not use any symbol as a concept. Therefore,

the task transfer process can be written as the following A.P.,

Cc —-> Csl —) .431 —) C52 '—> A32 => Cc —'> A31 -‘> A32 (6.3)

where, CC is a composite command (e.g., “Start”), C31 and 0,2 are simple commands

(e.g., “Left”), A31 and A52 are the actions (e.g., drawing individual petals).

6.2.3 Missing context

Another challenge in task transfer comes from the missing contexts, the commands
031 and 0,2, in AP. (6.3). Logically, C'C should not elicit A31 —-> A32 without the pres-

124

ence of C31 and C52. The teacher wants to make use of the learned skills, 031 —> A51
and C32 ——> .432, to save the training process. So, on the learner side, there should
be some mechanism to handle the missing contexts. The study of classical condition-
ing shows that animals take advantage of the causal property of the physical world
(the orderly event sequence) to achieve task transfer, which is a general mechanism
fulﬁlling our task-nonspeciﬁc requirement. However, because of the numerical rep-
resentation, 0,1 and C32 are random processes with varying length of duration. A

powerful architecture design is needed.

6.3 Single-level architecture

In the previous chapters, we have proposed an architecture that handles numerical
representations, which enabled the SAIL robot to conduct grounded speech learning.
Without any task-speciﬁc information available, such as pre-deﬁned acoustic models,
the SAIL developmental program generated internal representations and architecture
autonomously according to the events encountered. All the learning processes of SAIL
were conducted online in real-time through physical interactions between trainers and
the SAIL robot. It is based on such acquired sensorimotor skills that the task transfer

reported here can take place.

6.3.1 Handling the missing context

The cognitive mapping module of the architecture presented in Chapter 3 (Fig. 6.3) is
essentially a reﬂexive one, which produces the corresponding behavior output given a

125

_________ _. To gating

From | system

sensory I
mapping

        
 

Legend:

I ————————— J D Primed
Cognitive mapping context

Figure 6.3: The cognitive mapping module

context input. It cannot look ahead and predict sufficiently far. If there is any context
missing, the IHDR tree will give a badly matched state and the agent’s behavior is
not predictable. To handle the missing context, we augmented the reﬂexive cognitive
module by adding another IHDR tree (Fig. 6.4). The new tree is identical to the old
one in the architecture except that it is associated with a prototype updating queue
(PUQ). We call the original tree the reality tree, or R-tree, and the added one the
priming tree, or P-tree. The goal of PUQ for the P-tree is to enable a looking-ahead
(farther priming) mechanism. The PUQ maintains a list of pointers to the primed
contexts retrieved by the P-tree. A primed context is a sensation-action-value tuple,
p = (:r, a, Q), where the primed sensation (mental image) :1: is useful for developing
high-level value system and for planning, the primed action a is a possible action at
the current state, and Q is its value, estimated by the value system.

At every time instance, a pointer to a newly retrieved primed context enters the
PUQ while the oldest one moves out. When the pointers are kept in PUQ, the
primed contexts they point to are updated with a recursive model adapted from

Q-learning [98],

 

 

   

To gating
system

 

l

I

| —-[]_3_.

| IHDR ...D
Auditory sensation I

Action sensation R—tree

 
 
 

 

 

 

 

 

 

 

lChannel Action l
[selector IHDR selector ]
l |
l P—tree p(l+1) l Legend:
I “0 lDRﬂ’
I c exwe
I l primed
I Prototype updating queue l context
I ———————————————— I Backpropagated
Enhanced cognitive mapping ' primed context

Figure 6.4: The enhanced cognitive mapping.

where, p("I(t) is the primed context at the time instance t, n represents the number
of times pl")(t) has been updated, and 7 is a time-discount rate. l is an amnesic
parameter introduced by us, (e.g., l = 2), which is typically positive and is used to

give more weight on the newer data points.

Reorganizing Eq. (6.4), we have,

1+!

n — 1 —l _
—p( —7p“‘ 1)(t + 1) (5-5)

p(”’(t) = "‘”(t) +

TL

which shows more clearly that a primed context pl")(t) is updated by averaging its
last version phi—Wt) and the time-discounted version of the current primed context
p("‘1!(t+ 1). In this way, the information embedded in the future context, p(""1)(t+ 1)
in model (6.4), is recursively backpropagated into earlier primed contexts. Therefore,
Eq. (6.4) is effectively a prediction model. When an earlier context is recalled, it
contains the expected future information. Note that this kind of manipulation on
context can only be done on a numerical representation, which makes the power of
numerical representation clear.

127

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

—1 Original signal _J Original signal

7:09, 1:0, size=30 — 7:0.9, I=O, size=30 .

0.8~ ~ 4 am -- y=0.99,l=0,size=30 ..
--- 7:0.9, I=1,size=30
; ----- $0.9, l=0, size=50
0.6~ . . 0.6r ' --
0.4~ ’ _ « 0.4-
0.2- 7 - . * 0.2
0 I I I I I I 0 I I I I I
10 20 30 40 50 60 70 80 1O 20 30 40 70 80
time time
(a) (b)

Figure 6.5: The behavior of prediction model (6.4) with different 7, l, and PUQ size.

To view this effect more intuitively, we show the behavior of the prediction model
on a simple example. Suppose the primed contexts appearing over time are repre-
sented by a series of sealers. A scaler with value 1 means the primed context contains
certain information while 0 means no information is embedded. An example of a
series of the primed contexts is shown with a solid line in Fig. 6.5 (a), where there is
certain information over the ﬁve consecutive time instances (t = 55, 56, . . . , 59) and
nothing elsewhere. Applying the model (6.4) with 7 = 0.9 and l = 0 using a PUQ
of size 30, we get the dotted line in Fig. 6.5 (a), where the information has been
backpropagated with the peak appearing at t = 54. In other words, at an earlier time

instance, the model predicts about 60% of the information.

Three parameters have inﬂuence on the behavior of model (6.4): the time-discount
rate 7, the amnesic parameter 1, and the size of the PUQ. As shown in Fig. 6.5 (b),
a larger 7 helps the model to predict more information while a larger 1 enables the

model to predict longer in the future (by propagating the information back more in

128

     
 
 
 
 
 
 
   

.. S ta rt" "Right" Missing context

"Left" "Upper"

   

Last sensation

. Last context
Last actIon

New association

Primed sensation Primed context

Primed action

' ‘

Draw Draw Draw

 

left ri ht u r Primed context
petal pegtal p553 backpropagation
' > Time

I ti-l t-I-Z t-l-3 t-l-4 Ils

Figure 6.6: Handling the missing context.

time). The performance is not very sensitive to the PUQ size but the PUQ should
be long enough so that the newly entered primed context can still affect the one that

is about be moved out of PUQ.

With the prediction model (6.4), the process of handling missing context can
be viewed in Fig. 6.6. At each time instance, the robot decides the primed context,
especially the primed action, based upon the observed last context. At the same time,
the Operations in PUQ makes the primed contexts to be backprOpagated over time.
When there is a missing context, e.g., missing “Left,” “Right,” and “Upper,” the
drawings will still be executed because of the newly established association between
last contexts and primed contexts in the P-tree. Therefore, one way to look at the
roles played by the R-tree and the P-tree is as follows: Both trees do the mapping from
the last context to the primed context. While the R-tree does immediate mapping,
the P-tree does mapping with the prediction to a near future.

129

6.3.2 The value system

One problem of having a double-tree architecture in the cognitive mapping module
is decision making. In other words, when both trees provide action vectors, which
one should the agent choose and send to the effectors? This is the responsibility of
a value system. Subsumption [10] is a simple example of a value system, while the
state-action value in Q-learning is another one. A sophisticated value system should
be developed from experience.

In the value system of the presented work, each primed context includes a value,
Q(p). It is assigned with an initial value when a prototype is stored in the IHDR tree.
If there is an action imposed at the time, the initial value is set to be 1. Otherwise,
it is 0. Since the prediction model (6.4) is also applied to this value when the primed
context enters PUQ, the value will be distributed among consecutive primed contexts.
In addition to this value, another factor aﬂecting decision making is the goodness of
match d(s), the Euclidean distance between the current state 3 and the prototype
state retrieved by the tree. A small d(s) means that the agent recalls the new context
with a high conﬁdence and, therefore, the decision is reliable.

Put the above two items together, we build a quantity called conﬁdence index,

where e is a small number to avoid a zero denominator. The action selector in Fig.6.4
simply compares the conﬁdence index associated with the action vector given by both
the R—tree and the P-tree. The one with higher conﬁdence index gets through and

130

goes to the corresponding effector.

6.3.3 Mechanism for cross-modality attention

We have two trees doing different mappings and a value system selecting the outputs
from them. The robot still can not do task transfer without a cross-modality attention

mechanism.

In the work here, we only consider one complex sensory modality, audition. There
is one more internal sensing modality, the sensation of the agent action. Each volun—
tary effector needs a corresponding sensor. This sensation is very important because
it informs the agent its current status which may affect its later behavior. One way
of treating these two modalities is to put the sensation vectors together as a single
last context entering the IHDR tree. However, we will see that such a naive way does

not work.

First, we need to understand that a last context covers a short duration instead of
a single time instance. That means even when an utterance is ﬁnished, there is still
some non-silent auditory sensation presented to the robot. The same thing happens
to the action sensation when each action lasts more than one time instance. This
is shown in the legend of Fig. 6.7. We represent each last context with a color bar.
Since there are two sensing modalities, the last context at each time instance can be
described by a color bar pair. For example, the ﬁrst last context shown in Fig. 6.7
is < indigo, white >, where indigo represents the last auditory sensation and white
represents the last action sensation. The external action is also presented as a color

131

Legend:

 

AUdIIer R—tree mapping: _>
sensauon .
or P—lree mapping: ... .>
Action
sensation:
Without attention:
Training session Test session

Instanul "(meal I ”(W0"| "stan"

Auditory
sensation

Action
sensation

External
action

Action 1 Action 2
With attention:

Training session Test session
"start" "one" I «twoul l"gtan"l

Auditory
sensation

Action
sensation

External
action

 

Action 1 Action 2 Action I Action 2

Figure 6.7: The comparison of a system with and without cross-modality attention. This
ﬁgure is presented in color.

132

bar in Fig. 6.7. Notice that the sensed action is the external action at the last time
instance. Therefore, the action sensation is a shifted bar sequence of the external
action. In general, a white bar means no sound or no action is sensed or executed.

“ ’3

During training, the teacher gives all the three commands, “start” (CC), one
(C31), and “two” (C32). The R-tree learns the immediate mapping from a last context
to a primed context which includs a primed action. This mapping is shown by the
solid arrows in Fig. 6.7. The P-tree learns a mapping with prediction to a near future
as shown with the dash arrows in Fig. 6.7. The choice to execute the primed action
of the R—tree or the P-tree is made by the value system.

Now we take a look at a system running without a cross-modality attention. Dur-
ing testing, after the utterance of “start,” the P-tree successfully gives the mapping
< indigo, white >—>< red > and the value system chooses the output of the P—tree
as the external action. Everything is perfect so far. Unfortunately, the last context
conﬁguration in the next time instance, < white, red >, has never showed up during
training, leading to unpredictable behavior. The new task skill is not acquired.

In fact, a cross-modality attention mechanism will resolve the problem. Notice
that although < white, red > has never appeared during training, the red action bar
is always followed by the green action bar. Suppose the robot pays attention to its
action sensation only (the red action bar) after it starts to move during training.
Then the green and blue bars of auditory sensation will not appear in the auditory
sensation. Now we have the last context conﬁguration of < white,red >, which is
mapped to the external action < green > by the R—tree. When action 1 is ﬁnished,

the auditory sensation should be restored to sense the utterance of “two” and the

133

similar switch to action sensation is needed after the action 2 starts to be executed.
In general, it is unlikely to have a matched global context. Selective attention brings
related information to the recognition process.

In summary, we need a mechanism that can switch between the two sensation
channels. The principle is that the channel with higher short-term variation is more
salient and would receive the exclusive attention. To do this, we designed a simple
cross-modality attention module, the channel selector shown in Fig. 6.4. The standard
deviation 0(t) of the magnitude m(t) of the sensation from each modality is computed
recursively as,

t — l — 1 1 -I- l
: —o

where l is the amnesic parameter to give more weight to the new samples. With an
appropriate 1, 0(t) would represent the short-term variation of the sensation. The
channel selector implements a “winner-takes-all” criterion by blocking the channel

with lower 0( t).

6.3.4 Algorithm

As a summary of above, at each time instance, the agent executes the following
learning process. Since there is no explicitly separated training and testing sessions,

the algorithm is executed all the time.
1. Collect the sensation from both the auditory sensor, 2:,(t), and the action sensor,

xa(t).

2. The channel selector forms a single sensation vector (the last context), l(t) =

134

(:r,(t), Ia(t)), by replacing x,(t) or ra(t) with a zero vector.

. Both the R—tree and the P-tree conduct learning from l(t), using IHDR tree.

. The best-matched states, or the prototypes, sR(t) and 8p(t), are retrieved from
two trees together with their associated primed contexts, pR(t) and pp(t), re-

spective] y.

. If there is an imposed action, a,(t), the primed action part of pR(t) is set to be
a,(t) and its value is set to be 1. Otherwise, the primed action part of pR(t) is

set to 0 (the default non-action) and its value is set to be 0, too.

. Compute the conﬁdence index for both trees and decide the external action,

ae(t). Its associated value is termed as Qe(t).

. The primed context pp(t) of sp(t) enters the PUQ and each entry in the PUQ is
updated according to the model (6.4). Speciﬁcally, the sensation part is updated

using 3p(t), the action part using a€(t), and the value part using Qe(t).

. Send ae(t) to the corresponding effectors.

6.4 Multilevel Architecture

The above described system learns the basic actions through supervised learning.

While supervised learning has the advantage of efﬁciency, the system is tedious to

teach and does not allow “extinction” or teacher errors. For example, suppose the

robot has been taught to lift its arm when hearing the command “one.” Later on, the

135

teacher does not want the robot to lift its arm any more, which is known in psychology
as “extinction.” Therefore, we need to add a reinforcement learning mechanism into

the system.

One might think that this can be done by simply feeding reinforcement signals to

the system and modiﬁed the prediction model for updating Q into the following one:

1

 

;: 1%) + 7Q(n_1)(t+1)— Q‘"‘”(t)l

Q"”(t) = (er-”(n +

In this way, one can tune the Q value in the primed context through the reinforcement

signals and, consequently, tune the behavior of the system.

However, this straightforward strategy did not work well. Because of the predic-
tion model (6.4) and the PUQ, the primed context of a particular primitive prototype
would be propagated to other prototypes so that multiple primitive prototypes may
have a similar primed context. However, in a real application, the exact context is not
guaranteed to be repeated, which means not all the propagated primitive prototypes
may show up in the following training session. As a result, while we may adjust the
primed context of certain primitive prototypes, we may not be able to adjust all the
propagated ones. This is the “abstraction-from-signals” issue in the new challenging

AMD mode.

To solve this problem, we designed the architecture shown in Fig. 6.8. We call
the whole module of Fig. 6.4 as a level building element, LBE. The new architecture
has two levels of LBEs. All the LBEs run the algorithm of Section 6.3.4 except
that the second-level one takes the primed sensations of the lower-level one as the

136

 

 

  
 

 

 

 

 

 

 

IHDR
To effectors
I
' R-tree
I Channel
I selector selector |

 

I IHDR I
I I
I P-tree 43 I
| I
I I
I Prototype updating queue I
L ________________

Second level LBE

P———————————-i ———q

Attention control signals

$02]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

IHDR
Auditory sensation _ :
Action sensation R—tree I
IChannel Action I
|selector selector I
l IHDR I
I I
I P—tree I Legend:
| I [I Reﬂexive
I U_Jﬂil+- I Primed
I Prototype updating queue I context
L ________________ J Backpropagated
First level LBE ancd 60m“

Figure 6.8: A two-level architecture of SAIL.

input. The underlying idea of the two-level system is as follows. Because of the
prediction model (6.4), the primed sensation vectors are the averaged version of its
future context vectors. This means that, the vectors which used to fall into different
primitive prototypes in the ﬁrst level LBE L1 may be grouped into one primitive
prototype of the second level LBE L2, as shown in Fig. 6.9. As a result, even though
it may be rare for a particular primitive prototype to be revisited in L1, it is very
likely for L2 to cover the context represented by this primitive prototype when a

similar context appears.

To make decisions on actions from multiple levels, L2 checks the Q value of the
action that is closest to the one selected by L1. If it is larger than zero, the corre-

137

2nd level

primitive
prototypem

timer

 
  

 

  

 

 

 

 

 

 

lst level

primitive Update Q
prototypes _
time"

U ateQ
Action pd I _
A . . timer

SSlgn cre

Reward I I _
timer

Figure 6.9: Internal mechanism of the two-level architecture.

spending control signals would be sent to the effectors. Otherwise, the control signals
would be blocked. In other words, no action would be taken. The Q value given by
L2 will enter L1 as an internal reward so the behavior of L1 will be tuned even if there
is no constant environment feedback.

In the current implementation, the R-tree of L2 was not used. It will be used
when a higher level is built. To quickly assign the reward to the right context, a
credit-assignment criterion is preprogrammed by targeting the reward to the starting

point of a recently conducted action, as shown in Fig. 6.9.

6.4.1 Algorithm

In summary, the following is the developmental learning algorithm of the two-level

system at each time instance.

1. Collect the sensation from both the auditory sensor, $,(t), and the action sensor,

za(t).

2. For L1, do steps 2 through 5 of the algorithm of Section 6.3.4. Denote the

138

primitive prototypes retrieved by R-tree and P-tree as 331(t) and sp1(t), re-
spectively. Find the primed context with highest conﬁdent index among the

primed contexts associated with 331 (t) and sp1(t) and denote it as p1(t).

3. Take the primed sensation part of p1(t) as the input to L2 and do step 2 through

5 of the algorithm of Section 6.3.4 for L2. Denote the primitive prototype
retrieved by P-tree as sp2(t). Find the primed context of Sp2(t) with the primed
action part most similar to that of p1(t) and denote it as p2(t). Let p2(t) enter

the PUQ of L2 and update according to model (6.4). For updating Q, use

model (6.6).

4. If the Q value of p2(t) is larger than zero, send the primed action part of
p1(t) to the corresponding effectors. Otherwise, send the zero vector to the

corresponding effectors.

5. Let p1(t) enter the PUQ of L1 and update according to model (6.4). For up-

dating Q, use

1+1

 

W) + Q§"‘”(t) + 7Q("'_1)(t+1) — car-”(tn

where Qg")(t) is the Q value of p2(t).

From the above discussions, we can see that the role of L2 is effectively a module
evaluating the behaviors of L1. This means that our system have some characteristics
of the well-known actor-critic methods [90, page 151]. The major differences here are
(1) our method starts from raw sensors instead of symbolic input, (2) both the “critic”

139

L2 and the “actor” L1 learn from the environment, (3) the critic is not task-speciﬁc.
It gives the system a quicker response to the change of the environment by not letting
the actor wait until the critic realizes the change. Another important characteristic of
our system is that the “critic” and the “actor” have similar architecture and learn in
the same manner, which makes the architecture move systematic. L2 was motivated
by higher-order cortex in the biological brain [42] but further discussion is beyond

the scope of this paper.

6.5 Experiments on AudioDeveIOper

The task transfer performance of the agent model has been validated on a real-
time software agent, called AudioDeveloper. This software testing and evaluation
environment provides detailed performance record that is not readily available from
a real robot. AudioDeveloper has two kinds of sensors, a microphone and the touch
sensors (the buttons on the toolbar of the GUI of AudioDeveloper). The touch sensors
are used by the trainer to impose actions. The agent has another internal sensation
to sense its own actions. A simulated 3-joint robot arm is the effector. In the GUI
(Fig. 6.10), the top panel shows the audio waves. The reactions and the imposed
actions of the agent are shown in the third and the forth panel of the GUI, along
the horizontal time axis. The time interval between two consecutive time instances
is about 18.1ms. There are three simple behaviors (A31, A32, and A53) corresponding
to three voice commands, (C'sl =“one,” C32 =“two,” and 0,3 =“three”), respectively.
The behaviors are identiﬁed by the component of the action vector with the maximum

140

value. For example, if the ﬁrst component of the action vector has the maximum
value, it is identiﬁed as A51. The new task Ac is deﬁned, in the teacher’s mind,
as A31 —+ A32 —) A33 and the corresponding voice command (Cc) is “start.” The

developmental program was written without any task-speciﬁc information.

 

MIIIIIIIIIIIIII “ 4:11.] -Audw~eiupu IAthIU) “1;. IIIIIIIII‘III'I‘I"‘ ‘ 11mg]
Dngumwmmmacawm lIg_I_I Bamwumwmw =13ng

Dﬁﬂ‘1‘nsa12345 Do‘ﬂ Y'RFGBIZJIS
Audio

 

   

 

 

 

 

 

 

 

Rewald Reward
—
Reaction Reactinn
Imiosed action Imiuled union
EII IltIlI El It"
A N I II - - ”I I
m V ' mii ' may I‘M‘iﬁ T"
(a) (b)

Figure 6.10: The GUI of AudioDeveloper. (a)During online learning; (b)After online
learning.

Before learning the new tasks, the agent should be able to perform the simple
actions. So, the whole experiment are divided into two steps: (1) The agent learns
simple skills following voice commands; (2) The agent learns complex skills following
new voice commands through task transfer. All the learning of the above two steps
is done in the same mode as required by AMD, i.e., real-time physical interactions
between the agent and the trainer. As we discussed above, whether an interaction is
for training or testing is only in the mind of the teacher. The same algorithm works
in a single mode all the time. In the following discussion, when we mention the test
session, we simply mean “freezing” any changes to the cognitive mapping module so
that we may evaluate the system.

141

The training process of step (1) is as follows,

1. A voice command is spoken to AudioDeveloper.

2. At the end of the utterance, a corresponding action is imposed by pressing a

button in the tool bar of AudioDeveloper.

3. Wait for a couple of seconds and go back to 1. As long as two consecutive
commands and actions are separated by a sufﬁcient amount of time, it does not

matter how long the separation time is.

The above training process is a grounded speech learning step, which has been
discussed in more details in [111]. While a supervised learning procedure was con-
ducted in this experiment, it can also be done with reinforcement learning [109]. After
training, the voice commands were spoken to AudioDeveloper again with no actions
imposed. As we will see in the experimental results below, AudioDeveloper correctly
conducted the appropriate actions after the commands, “one,” “two,” and “three.”

The training process of step (2) is as follows,

1. A command, “start,” followed by “one,” “two,” and then “three,” is spoken to

AudioDeveloper.

2. Wait for a couple of seconds and go back to 1.

This is essentially a process that the trainer teaches the system to do the new task
through verbal instructions. Fig. 6.11 shows a fraction of this training session. In
Fig. 6.11, the upper panel presents the energy of the voice commands along the time
axis. The next three panels display the 3D action vectors of the R—tree, the P—tree,

142

and the enhanced cognitive mapping. The behaviors of the robot were identiﬁed by
the component of the action vector with the maximum value. For example, if the
ﬁrst component of the action vector had the maximum value, it was identiﬁed as
action 1. The ﬁfth panel of GUI shows the conﬁdence of the R-tree and the P-tree on
their outputs. The bottom panel shows the attention of the system over the auditory
sensation and action sensation channels. The sequence of the external action vectors
shows that the system responded to commands C's], C32 and CS3 accordingly.

Let us take a closer look at how the system behaved around time instance 100 when
the command “one” was being spoken (Fig. 6.12). Note that the conﬁdence subplot
is shown in a semi-logarithmic way in this ﬁgure. At time instance 98, the P-tree
started to Show signiﬁcant trend to conduct action A31 and it had a higher conﬁdence
than the R-tree. So the enhanced cognitive mapping ﬁred the external action A31.
Once the action was started, the attention was switched from the auditory sensation
channel to the action sensation channel. With the sensation for the starting of action
A31, the R-tree began to have higher conﬁdence and kept conducting action A31 until
time instance 104 when the R-tree decided to stop the action. For a better view of
the transition of the conﬁdence of the R-tree and the P-tree between time instances
100 and 104, the reader is referred back to Fig. 6.11.

After training, when only the composite command “start” was given, the system
successfully repeated the action sequence, AC(= A31 —> A32 ——> A33), as shown in
Fig. 6.13. Again, let us give the system behavior a closer look in Fig. 6.14, which

expands the time axis round time instance 4601. During training, the utterance of

 

1The conﬁdence subplot is shown as semi-logarithmic.

143

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5; 20 I I I I I I I I I
53
8
.2 10
3
8 0 I I Slat pne 1 M04 L threeI l l
20 40 60 80 100 120 140 160 180 200
40 I I I I I I r I I I
E — External
~ - - P—tree
U "" -I
3'}: 20 - - Fl-tree
O I I l I I I l L I
20 40 60 80 1 20 1 40 1 60 180 200
40 l I I I I 1 I I I I
g — External
~ . - - P-tree
U I- —I
g 2° — - R-tree .' ..
0 I I I l I n I . .’ '. I I I I
20 40 60 80 1 00 1 20 1 40 160 1 80 200
40 l I I I I I I l I I
(g — — External
~ ~ - P—tree
O )- —1
g 20 - — - FI-tree
0 I I I l I I I I I I I
20 4O 60 80 1 20 1 40 160 1 80 200
100 I m I I I I I r I r ' .I I I
§ - P—tree I :I; I :
{2’ 50 —-FI-tree I 311 I I. _
"‘ I '. ' - ' I-
0 0 l ; I I I I I ' m I I i I: L m
20 4O 60 80 120 140 160 180 200
1 I J L I I L ¥, 1 L
g -— Sound -
"g 0'5 - - - Action _
i I
0 1 I I I I I I I I I
20 40 60 80 120 1 40 160 1 80 200

 

Figure 6.11: A fraction of the training session of step (2).

“start” was always followed by the utterance of “one.” Because of the backpropa-
gation of the primed context, upon hearing “start,” the P-tree learned to produce
the primed context associated with “one.” Therefore, without hearing “one” during
testing, the P-tree started to have signiﬁcant trend of executing action A31 after hear-
ing “start” at time instance 464. Since the P-tree also had higher conﬁdence than
the R-tree at the time, action .431, was ﬁred as the external action of the enhanced
cognitive mapping. The attention was then switched to the action sensation channel
at time instance 465 and the R-tree took over until action A31 was stopped at time
instance 471 since it had a higher conﬁdence (see Fig. 6.13 for an expanded view of

144

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

315 I l
B
8
‘0 10 m
C
3
(S 5 l l
90 95 100 105
40 , ﬂ
E — External /
g 20 _ _g g-gree / -
E -ree / .............
0 l ._.w "' / l ... _______
90 95 100 105
1 I I
g — External
~ P-tree
5 0 I“ — - R-tree
LU
-1 l 41
90 95 100 105
1 I I
Q — External
§ 0 ~- P-tree
: "‘ ’ R-lfee
[LI
_1 J l
590 95 100 105
Q)
8 ~-P-tree .___.____-_.__...-
g 10 - ’ R-"ee . / ./ x...
S _._._._._._._,_._._._._.____._"_"_'"_"/
o 10'5 1 1
90 95 100 105
1 l ...... I ............................... a
g — Sound
'= - Action
5 05 _
z:
<
o l ‘ l
90 95 100 105

Figure 6.12: A closer look at the training session of step (2).

conﬁdence).

With the numerical representation, an utterance of a voice command typically

corresponds to multiple primitive prototypes saved in the IHDR tree, depending on

the duration and the auditory changes of the utterance. During training, the primed

context of all these multiple prototypes may be changed because of the primed context

backpropagation. Therefore, there were more than one cases that the P-tree could

start action A31. If the simple action lasted too short, this simple action would be

ﬁred for a second time. This is why we see the action was repeated after the system

ﬁnished it once at time instance 472. In the real world, the duration of an action is

typically longer than an utterance. So such action repetition would be avoided as we

145

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5; 20 I I I I
E:
8
E 10 WW
3
c?) o 1 I start] ,
300 350 400 450 500
40 I I I I
'5 — External
5 _l ‘ ' - P-tree
g 20 — - R-tree
LU
0 l l l
300 350 400 450
40 I I I
g — External
°- . - - P—tree
O l—t
.2 20 — - R—tree
Lu
0 l l l
300 350 400 450
40 I l l
”g -— External
.. - - - P-tree
U r—
g 20 - - R—tree
LU
1 1 1
300 350 400 450
100 l I I
§ - - ~ P-tree
S
0 0 L L l
300 350 400 450
1 I L I
g — Sound
" - - - Act'on
é o 5 '
<
o l l l
300 350 400 450

 

Figure 6.13: A fraction of the test session of step (2).

would see in the experiment on a real robot below.

Similar to the case that “start” elicited action A31, during training, the P-tree
“saw” that action .431 was followed by the utterance of “two” for many times. There-
fore, the primed context of the utterance of “two” was backpropagated to the last
context of seeing action .431. As a result, the P-tree started to Show a high conﬁdence
of conducting action Asg at time instance 480 although “two” was not presented.
The above process was repeated until the end of action .433, which means the agent

successfully learned to conduct the new task.

Multiple-user results. Above experimental results were obtained with the inter-

146

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5‘, 20 I I I I I I I
IT:
8
C
3 M
(S o l l l l L l I;
450 455 460 465 470 475 480 485 490
40 I I I I I I
E — External
«5 2 _ P—tree _
£2 0 — - R-tree
w
0 I I . . - ~ . . I l
450 455 460 465 470 475 480 485 490
40 L I I I l I I
g — External
~ _ P-tree
33;; 20 — - R—tree
u.I . _ .
0 I I I I .......... ' ' I
450 455 460 465 470 475 480 485 490
4 I I I I I I f
‘2 — External ,
.3 2 P—tree —
g — - R—tree .-
Lu
0 l l l l l J l
5450 455 460 465 470 475 480 485 490
a: 10 I I I I I I
g P-tree ',_ __ _ _ _ _ .7 ________ \ ............ ..
£33100 — R-tree .7 \/ \/ \/ d
C ................................. .
o _____________ /
0 10‘5 I I I I I I I
450 455 460 470 475 480 485 490
1 I I ............ I ........... I ............ I ............ I ............
g — Sound
‘3 Act'on
35:, o 5 ' -
<
0 l l - l l I l
450 455 460 465 470 475 480 485 490

Figure 6.14: A closer look at the test session of step (2).

action between AudioDeveloper and one trainer. To test AudioDeveloper with wider
variance, we conducted more experiments. To save the human efforts in training Au-
dioDeveloper, ﬁrst we built another software agent, AudioTeacher, as a “teacher.” It
can play pre-recorded sound and give imposed actions to AudioDeveloper.

The auditory data was taken from the number data set contributed by 63 persons
with a variety of nationalities (American, Chinese, French, Indian, Malaysian, and
Spanish) and ages (from 18 to 50). Each person made ﬁve utterances for each of
the ten numbers, one to ten. In the experiment here, we only used the utterances of
“one,” “two,” and “ten,” where the ﬁrst two were the simple commands and the last
one was the composite command. Among the ﬁve utterances of each number, three

147

Table 6.1: Performance of AudioDeveloper trained by multiple users. C.R.l and C.R.2

represent the correct rate for action A31 and the new task, respectively.

 

 

 

 

Epoch No. 6 7 8 9 10 11 12 13
C.R.l (%) 42.86 57.14 60.32 71.43 74.60 74.60 74.60 77.78
C.R.2 (%) 1.59 9.52 26.19 38.89 42.06 57.94 67.46 73.81
Epoch No. 14 15 16 17 18 19 20

C.R.l (%) 77.78 77.78 77.78 77.78 77.78 77.78 77.78

C.R.2 (%) 76.98 81.75 84.92 84.92 89.68 91.27 92.86

 

 

 

 

 

 

 

 

 

 

 

were used in training and two were used in testing. In effect, in the step (2) testing,
only the utterances of the composite command were needed.

AudioTeacher was programmed to lead the two-step training process as we dis-
cussed above. Since we did not have enough speech data from each subject, the data
was replayed after each round. We call one cycle through the data set “an epoch.”
In the step (1) training, one epoch was sufﬁcient for AudioDeveloper to acquire the
simple actions. In the step (2) training, multiple epochs (reviews and practices) were
needed for AudioDeveloper on the new task. We stopped the training after 20 epochs.
Starting from epoch six, the cognitive mapping were saved after each epoch for later
performance evaluation.

The entire training process took about 6 hours for the 63 subjects’ data. The
performance of AudioDeveloper was evaluated as follows. After a composite command
was spoken, if AudioDeveloper made action A31 for one or several times followed by
action A32 for one or several times, we counted it as a success. Otherwise, it was a
failure. Table 6.1 shows the percentage of successfully acquiring the new task after
each epochs. After 20 epochs, AudioDeveloper learned the new task reliably with a
correct response rate of about 93%. Also shown in this table are the success rates
for action A81. As expected, A31 was acquired earlier than the acquisition of the new

148

task. The reason it did not reach very high percentage was that there were cases
where the acquisition of both action A31 and the new task occurred within one epoch.
In these cases, the success of acquiring action A31 was not counted.

The failed cases were related to three subjects. We examined them one by one and
found that the reasons for failure were the same. These three subjects’ utterances
were very short. Since AudioTeacher played the utterances one after the other in
the step (2) training, the utterances of these three subjects tended to be very close,
which prevented them from being recognized correctly. The problem was if the simple
commands were not recognized correctly, the step (2) training was not really effective.
After adjusting the separation time between the utterances for these three subjects,
the new task was successfully acquired after about ten epochs. A lesson we learned
here is that it is important to provide sufﬁcient pause time between tasks to avoid

unwanted association, a phenomenon well recognized in animal learning.

6.6 Experiments on the SAIL robot

SAIL shown in F ig. 6.1 (I) is a human-size mobile robot house-made at Michigan
State University. It has a drive-base, a six-joint robot arm, a neck, and two pan-tilt
units, on which two CCD cameras (as eyes) are mounted. A wireless microphone
functions as an ear. SAIL has four pressure sensors on its torso and 28 touch sensors
on its eyes, arm, neck, and bumper. Its main computer is a dual-processor PII PC
workstation with 512 MB RAM and an internal 27 GB three-drive disk array. All the
sensory information processing, memory recall and update as well as effector controls

149

are done in real-time.

We have conducted some experiments on the SAIL robot by interacting with it
through its auditory sensor, a microphone, and the micro—switch sensors on its arm.
One of the switch sensors were deﬁned as a biased sensor to accept reinforcement
signals. Speciﬁcally, if the reading of that biased sensor was positive, the reward was
1, and if the reading was negative, the reward was -1. Other two switch sensors were
used to impose the four basic arm actions, drawing four petals as shown in Fig. 6.1
(II).

We ﬁrst taught the robot the basic actions in the same way as we did in Section 6.5.
Speciﬁcally, the training process went as follows: (1) The trainer spoke one of the
voice commands ( “one,” “two,” and “three,” in this experiment). (2) At the end
of the utterance, the trainer pressed the switch sensor of SAIL to impose one of
the four actions. (3) Wait for a couple of seconds and go back to (1). The trainer
might need to repeat some of the commands for SAIL to practice. But according
to our observation, within three repeats, a single behavior can be established with
the correct rate of 95-100%. This type of fast learning is not possible with iterative
mapping learning algorithms such as artiﬁcial neural networks. Thanks to IHDR

which enables one-instance learning.

6.6.1 Behavior extinction

We ﬁrst report an experiment showing how to extinct the established behaviors of
the robot. Suppose the robot has learned to draw the upper-petal after hearing “one”

150

and, later, the teacher does not want it to move the arm after the same command.
The training process goes as follows: ( 1) The trainer speaks “one.” (2) If the robot
moves its arm, the trainer presses a reward switch sensor to give a punishment. (3)
Wait for a couple of seconds and go back to (1).

To trace the internal changes of the robot, we saved the information related to
decision making at every time step at an interval of about 20 millisecond. The
upper panel of Fig. 6.15 shows the sound volume along the time axis. The vertical
dash lines show the time instances when the robot started the upper-petal drawing.
Among them, the ﬁrst one is when the trainer imposed the action. The lower panel
of Fig. 6.15 shows the changes of Q values of two actions associated with a particular
primitive prototype. A non-action means the robot stays still and action 1 means
upper—petal drawing. In the real-time experiment, there were typically more than
one thousand primitive prototypes saved in each IHDR tree. We got hold of the best
matched primitive prototype when it was retrieved. For the time instances when a
particular primitive prototype was not retrieved, we simply show its Q value when
it was retrieved previously. Therefore, the Q value of a particular prototype-action
pair changed only when the prototype was visited. This was also the reason that the
changes of the Q values shown up much later than the reinforcement signals. However,
we do see the Q value of action 1 starting to grow after the action was imposed. At the
same time, that of the non-action dropped. This means the robot gradually preferred
action 1 to the non-action. After receiving a punishment reinforcement signal at the
time instance shown by the vertical dash line in the lower panel of Fig. 6.15, the
robot’s preference reversed by the changing Q values. From the upper panel, we can

151

 

 

 

 

 

 

 

 

czc . T ,7 . L . L
g [— Sound volume
215 3 Action
.210
9
B 5*
a . .
:3 o .2 . 3 :4 2 . 2 . L
200 400 600 800 1000 1200 1400
1 . . . . . . 1
— Non_action
Action1
~ - Reward

 

 

 

..... ,__I—=

200 400 600 800 1000 1200 1400
Time (x 20ms)

Q values/Reward
0
in

 

 

 

 

 

C

Figure 6.15: The behavior changes of SAIL with the two-level architecture.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

:20 t n
9 — Sound volume
$15 , Action
2 l ‘
.3. 10
O
>
2 5
3
(3 I i E i : ‘ . ,‘ :
o A l . I L L 14
500 1000 1500 2000
1 ﬁ' W 1 v 1 11
E l j — Non-action
g l I . Actiont
‘1’ l l -'— Reward
'50 s» I l I ' F I .
a I j I j i i
> I . I I I .
O 0 7 I l l ’ A: l l 1:
500 1000 1500 2000

Time (x 20ms)

Figure 6.16: The behavior changes of SAIL with the one-level architecture.

see that the previous established behavior was extinct thereafter.

For comparison, we did the same experiment on a single level system. As shown
in Fig. 6.16, although the system can establish the actions, it took a much longer time
to extinct the established behavior, which shows the power of multilevel architecture.

152

.11ll'llillllllllllllm)IIII»
III III' lIlllIIIII llI IIII
lllll

 

Figure 6.17: The SAIL robot learned the new task after verbally instructed by human
trainers.

6 .6.2 Task transfer

In this experiment, we teach the robot to do the composite action. The training pro-
cess goes as follows: ( 1) A trainer speaks “start,” followed by “one,” “two,” “three,”
“four.” (2) Wait for a couple of seconds and go back to (1). This is essentially a
process that the trainer teaches the system to do the composite action through ver-
bal instructions. After repeating the procedure several times, the trainer speaks only
“start.” If the robot continuously do the four actions successfully, we count it as a
SUCcess. Fig. 6.17 shows that the SAIL robot learned to execute an extended action
Sequence upon a new stimulus (the new command) by transferring previously learned

Skills.

We repeated 20 times the experiment of teaching SAIL drawing individual petals
and then drawing the 4-petal ﬂower upon learning the new start command. Table 6.2
Shows that there was only one case among 20 when the robot failed to learn the
new task. In that particular case, the robot confused the command “start” with

153

the command “three” because their beginning portion sounded similar. Even in this

failed case, the invokation of the ﬁrst two actions was still successful.

Table 6.2: Performance of SAIL doing task transfer
No. of Actions 1 2 3 4

C.R.(%) 100 100 95 95

 

 

 

 

 

 

 

 

 

6.6.3 Execution time and memory size

The IHDR trees keep only clusters, instead of samples. Typically, the more com-
plex the sound distribution is, the larger the trees are. And larger trees size means
(slightly) longer retrieval time because the IHDR has a logarithmic time complexity.
To push the system to its extreme in terms of tree size, we played loud music to it
and recorded the execution time of above computations within each time step.

Fig. 6.18 shows that the execution time tended to grow at the beginning and it
flattened out after about 100 seconds. Overall, the execution time of each time cycle
is well under the interval of 18.1ms, though it varies all the time because a primitive
prototype may be stored at and retrieved from different depths of the trees. The total
size of the three trees reached about 300MB after 550 seconds of “music listening.”
We do not expect that the time is a major issue for very large IHDR trees because of
its logarithmic time complexity. To provide an idea about the structure of the trees,
some structure data about the resulted IHDR trees is shown in Fig. 6.19, where the
horizontal axes denote the depth of the tree and the vertical axes are the number of
nodes or the number of primitive prototypes saved. The P-tree of L2 is signiﬁcantly
smaller than the R-tree and the P-tree of L1 because of the “grouping” we mentioned

154

 

 

 

 

 

 

100 200 300 400 500
Time (s)

Figure 6.18: The average execution time of the two-level system in each execution step is
lower than 18.1ms, the required interval of each speech frame.

in Section 6.4.

6.7 Conclusions

The capability of transferring acquired skills from one task to another setting is crucial
for a robot to autonomously learn complex cognitive behaviors. In this report, we
presented principles, architecture, and computational framework of a developmental
agent capable of task transfer. Our experiments showed that, it is theoretically sound
and experimentally practical to construct an artiﬁcial device that can autonomously
learn new complex cognitive skills by applying its previously learned skills. The
implication of this conclusion lies in the fact that everything is done online in real
time, without a human programmer to supply a task-speciﬁc representation.

A major contribution of this work is that the learning process of the agent can be
conducted through online real-time interactions between the agent and trainers, which
has not been achieved by any previous research. The high-dimensional numerical rep-

155

 

.. 5— A14 “.22. ..
I O
35" . $212
. z»
230- ”10»
e l E: r
E25) '8 $1
‘520 ‘1
o‘ g s
215 g
.: 4
10 g
5 g 2l II
0 - OJI ,
123456789101 123456789101
Tree depth Tree depth
(a)

 

No clnodrs
—| d N N w (a)

O 01 0 (ll 0 0| 0 0!

No. of primitive prototypes (x 00)
iii: 0 J1 5.
I
I
—
—

 

123456789101 123456789101
Tree depth Tree depth
8
7 §3.5
6 ~25. 3
‘35 $2.5
° 2
‘54 3
c5 .2 1.5
23 E
‘C
2 o. 1
‘5
1 30.5
0 1 2 3 4 O 1 2 3 4
Tree depth Tree depth
(C)

Figure 6.19: The node distribution and primitive prototype distribution in the trees: (a)
R-tree of the ﬁrst level LBE; (b) P-tree of the ﬁrst level LBE; (c) P-tree of the second level
LBE.

resentation made this possible although it posed very challenging technical issues of
high—dimensional incremental regression. Thanks to the grounded speech learning ca-
pability, all the perception and the actions, including even the actual number of their
classes, need not be available before the programming was ﬁnished. The advantage of
this architecture is that the artiﬁcial agent has a great ﬂexibility for perception and

behavior learning.

The missing context problem requires the robot to look ahead or use the knowledge

156

about what will happen in the future. A system using future knowledge is not causal
and can not be implemented directly. We followed the idea of Q-learning and proposed
a model to do backpropagation over the primed contexts. This backward model
effectively realized a prediction mechanism and enabled the robot to tolerate missing
some context while doing task transfer.

The multilevel architecture resolved the abstraction issue on perception and facil-
itated the behavior adjustment process.

Probably having implication to the understanding about how humans form ab-
stract symbolic concepts from numerical sensory inputs, we have demonstrated that
symbolic behaviors, including complex phenomena such as task transfer, can emerge
from non-symbolic distributed vector-type internal representation. There is no clear
boundary between audition, speech, and language in such a developmental robot.
This idea has been successfully used in a later research project reported in the next
chapter, where a robot integrates vision and audition in a dynamic world via online

dialogue.

157

Chapter 7

Semantics Learning through

Multiple Modalities

7 .1 Introduction

While communication seems so natural and easy for human beings, the situation in the
counterpart, machines, is so much worse. Many efforts have been made to duplicate
the fascinating language capability of human on an artiﬁcial agent. However, even
though we see the performance of automatic speech recognition (ASR) technologies
getting constantly improved in the past decade, in terms of communication, none of
the existing systems is even close to the capability of a three-year old kid.

Language is composed of two inter-related components: syntax and semantics.
Syntax studies the structure of well—formed phrases (spelled out as sound sequences).
Semantics deals with the way syntactic structures are interpreted. In some sense,
syntax plays the role of a protocol in communication while semantics is the content.

158

If two people share the same syntax and semantics, they can communicate with each
other. Between syntax and semantics, the latter seems more essential for communi-
cation. We all have the experience that two people from different countries can have
some simple communication even neither of them know the other’s language. The
acquisition of semantics is a very challenging part of machine language learning.

Semantics is the meaning of a string in a language [2], shared by the users of the
language. So, the acquisition of semantics is the development of this shared under-
standing. There is not a common deﬁnition of “meaning” or “shared understanding”
among cognitive scientists yet [96], which partially contributes to the difﬁculty of ma-
chine language learning. Here we would like to propose a working deﬁnition. All the
activities of an agent can be regarded as the motor control driven by the sensory in-
put. This is quite obvious for external behaviors such as locomotion and speech while
it is also true for internal behaviors such as visual attention [101]. Our deﬁnition
of understanding meaning or semantics is to conduct appropriate behaviors under
appropriate context. By context, we mean a conﬁguration of available information
that an agent uses for making sense of language in particular situations.

Under this deﬁnition, a semantics acquisition process is a process of internalizing
and organizing the context while a communication process is a process of retrieving
appropriate context and producing corresponding verbal behaviors. As we have been
discussing in the previous chapters, during both of these two processes, an important
requirement is grounding [40] [28] [14]. Grounding means that representations inside
machine should be connected to their referents in the external world. For example,
the representation of “dog” should be related to the presence of actual dogs in the

159

environment. Grounding is accomplished through real time sensory experiences. The
fact that people speaking different languages can communicate with each other is a
side support to the importance of grounding: since people share the same physical
world, we develop similar semantics and we can communicate even we do not have
the same syntax.

In this chapter, we present an embodied system that acquires simple semantics in
real time. We explore some of the difﬁculties in the process of grounded semantics

learning and propose an computational model to resolve the challenges.

7.2 Problem description

Semantics is grounded upon rich sensory inputs. III the process children developing
language capability, they take in information through all the senses - sight, hearing,
smell, touch, and taste, and they integrate the information and act upon it. There
are evidence showing that if visual, auditory, and tactile inputs never have the chance
to occur together, there is no opportunity to develop an integrated linkage between
what is seen, heard and felt [31]. Therefore, our study on robot semantics learning is
based upon multimodal information.

In communication, we make sense of what someone says (semantics) based on
whatever information resources we think are relevant. These contextual resources are

likely to be found in such things as [55],

o the physical surroundings;

e the past shared experience and relationship of the speakers;

160

e the speakers’ shared tasks or goals;

0 the speakers’ experience of similar kinds of conversation.

Retrieving contextual information from different resources involves complex mental
activities, which a young baby is not competent of. Neither is a early-stage learn-
ing system. So, we limit the contextual resources of a robot to immediate physical
surroundings or situations at this research stage.

In summary, we would like a robot to conduct appropriate behaviors given a
similar visuoauditory context. In particular, we present a system that can answer
verbal questions appropriately, given visual stimuli of some objects. This is a simple
semantics acquisition process, which requires the robot to develop visual and auditory
perception and associate visuoauditory context with appropriate behaviors, all in real
time. We have presented real-time audition learning in previous chapters. However,
real-time multimodality learning has its own challenges.

Visual representation of objects. To interact successfully with objects in
the environment, the robot must be able to recognize them across changes in their
orientations. The rich studies of visual representations of objects can be divided into
two major types: object-based models and image-based models. The former ones
suggest that objects are represented as structural descriptions of their 3-D parts and
the relations between those parts in a manner that is independent of the objects’
orientation relative to the observer [54]. The latter ones propose that objects are
represented as a set of 2-D views or snapshots taken as speciﬁc orientations relative
to the observer. While the former models sound intuitively more plausible to reach

161

view-point invariance, the latter ones receive more supports by recent psychophysical
and neurophysiological studies [12] [52] [62].

One advantage of image-based models over object-based models is that image
based models do not require manually designed object-speciﬁc 3-D models. Since
a developmental robot needs to handle various objects without manual interference,
we adopted the image-based models for the visual representation of our system. Our
previous studies have shown the feasibility of real-time visual learning [106]. However,
given the combination of vision and audition, the speed of the system is yet to be
tested, especially considering the time-critical nature of the audition information.

Imperfect alignment. Many existing works on multimodality learning rely on
the strict coupling between vision and audition information, such as the movement of
mouth and the utterance produced [18] [32]. In our simple semantics learning problem,
the alignment between the visual stimuli and the auditory stimuli is imperfect. In
the real world, the presence of an object is usually coupled with the related auditory
signals, such as the noise made by the object or the verbal label given by a teacher.
However, since image-based models are viewpoint-dependent, the observations from
different views of an object are different. The auditory signals are not likely to appear
exactly when the same visual observation is acquired. In other words, if we call a
visual-auditory stimulus pair at a particular time instance a visuoauditory context, it
is unlikely that a particular visuoauditory context will be exactly repeated.

A recent study proposed a mutual-information-based framework to resolve the
problem [78]. But, in the reported experiment, the visual stimuli were unrealistic still
images of the objects, although the auditory data was very challenging real mother-

162

baby verbal interaction. We view this problem as an abstraction problem and use
the Spatiotemporal correlations among contiguous visual and auditory information to

resolve it.

7 .3 Architecture and algorithm

In Chapter 6, we introduced the module of LBE with two IHDR trees to handle
immediate reﬂexive behaviors and missing or delayed behaviors, respectively. The
primed contexts of the P-tree possess future information, i.e., the sensations about
to be captured, and the action about to be made together its value. Since the pre-
diction model (Eq. (6.4)) is actually a low-pass ﬁlter, the primed sensation changes
slowly compared to the corresponding last sensation. Taking advantage of this, we
constructed a multi-level system using LBE as the basic element to handle different
abstraction at different levels. The abstraction idea can also be used in the simple
semantics learning problem.

As discussed above, one of the major challenges here is that an object appears
to the robot as a sequence of images captured from different viewpoints. We need a
representation upon which the robot can group these different images into a single
cluster, i.e. an object, before the object-speciﬁc action can be learned. If manual
transcription is allowed, we may design a representation, say a label, and assign it to
the corresponding images one by one. However, in the developmental paradigm, such
a task-speciﬁc design and data structure level manipulation are not allowed. They are
not practical to do for a robot running in an unstructured and complex environment,

163

either. Fortunately, there is a very important property of the physical world we may
take advantage of, i.e., the continuity.

In the real world, an object does not emerge from nothing and it would not
disappear like a magic. We may make use of the shared image features of the spa-
tiotemporally contiguous views of an object. Moving in and out the agent’s ﬁeld of
view, two consecutive views of an objects are similar by a large when the capturing
speed is high enough. If we ﬁlter out the high-frequency components, the images
change even more slowly and may be considered as identical in some cases, which is
exactly what we need.

Fig. 7.1 shows the architecture we used to do simple semantics learning. It has
three LBE modules, a vision LBE (V-LBE), an audition LBE (A—LBE), and a high-
level LBE (H-LBE). The visual sensation is the original image captured by a CCD
camera. We do not do any pre-processing on the images to derive any low-level
features such as edge histogram. The important discriminant features are derived
automatically by the IHDR trees. The auditory sensation is captured by a sound
blaster card through a microphone. We do Cepstrual analysis on the original sound
signals before the data entering the A-LBE. Since sound is a linear signal in the
sense the information is distributed over a period of time, each auditory sensation
vector actually covers 20 speech frames as you would see in the experimental results.
The primed sensations from the P-trees of both V-LBE and A-LBE are inputs to
H-LBE. After the low-pass ﬁltering in PUQ, the primed sensation only keeps the low-
frequency components of the last context. We would discuss how the architecture
works in details while presenting the experimental results.

164

 

To effectors

 

Legend:
Reﬂexive IChannel
primed |selector

 

context I

Backpropagated
' primed context

 

\

\\\

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Prototype updating queue
L ________________ l
H-LBE
'— _ — — —. — — — — _ -------- l—' —————————————————
| Attention control Signals ] | Attention control signals 1
| I I
I _0—4 I I —D’—' I
l IHDR —.ﬂ ] l IHDR —D '
Vision ' _ Audition] r _ '
Action E]: R-tnee I Action R-tree l
lChannel Action I IChannel Action I
|selector selector l |selector selector I

 

 

 

 

I IHDR I I IHDR l

| l

P—tree | P-tree |

I |

l |

L Prototype updating queue i Prototype updating queue ]
________________ L___..___..___..____.._____

Figure 7.1: The semantics learning system architecture.

7 .4 Experimental results

The multilevel semantics learning architecture has been implemented on the SAIL
robot.

The experiment was done in the following way. After SAIL started running, the
trainer mounted objects one after another on the gripper of SAIL and let SAIL rotate
the gripper in front of its eyes at the speed of about 3.65 per round. Totally 13
objects were presented (Fig. 7.2) to SAIL. All these real-world objects were of very
complex shape, for example, the hair of “Harry Porter,” and non-rigid form. It was
extremely difﬁcult, if not impossible, to model them using 3—D representations. The

165

   

Baby 2 Barbie Kitty

 

    

Hugme
$1? .1.“
J
i .1
Dwarf Micky mouse 1 Micky mouse 2 Winnie Harry Porter

Figure 7.2: The objects used in the experiment.

properties of each object was taught to SAIL in a question-and-answer session. First,
the trainer verbally asked the questions, such as “name?” and “size‘?”, when an
object was presented. And then the trainer gave the appropriate answers by pushing
the switch sensors of SAIL, where different switch sensor status represented different
answers. Since the objects were rotated, and moved in and out of SAIL’s ﬁeld of view
continuously, the orientation and the positions of the objects kept changing. There
were hardly chances that SAIL could see the same images of the objects, when the
same question was asked again. A sample video sequence seen by SAIL is shown
in Fig. 7.3. We expected SAIL to correctly answer the taught questions when the

objects were presented and the question were asked the next time.

The images were captured by a Matrox Meteor II board as gray-scale images at

166

1“ r." ' ‘ .‘ 5"" if" 5“ 5‘ ’“
Lil Lifts: L333 L485 L4; L45? UH L43: 5.3:: L423
3; 1 3.“ 3.3 .3 «5‘ .3! J ,.l I
.525 .4... cm; 5:... .:-L tag can (at ..‘a t u
D I ,1. I I f x " ﬁ ‘
LEI. .11 b1 51 b1. sci: L-L L1:a L N bk
I t t '5 'i' ‘l ‘5 ~  
J1}; uni: bl: Lihi‘r LC: Lu; Lat: L‘ﬂnb." Lil-.9" LXI:
' 3‘ 3' Ir 4* ._ 1 ‘ ."
{in Lilith hi Lil; L b. ...}? ‘ L53. r51, :41: 5A.: 03:.
f. "x i. \ i. 1. N 'L A a {"3 17*
mummmmkaﬁ

Figure 7.3: Part of a sample image sequence.

30 frames per second. The dimension of the images was 25-by-20. The speech data
were digitized at 11.025kHz by a normal sound blaster card. We did Cepstral analysis
on the speech data and 1.3-order Mel-frequency Cepstral Coefﬁcients (MFCCS) were
computed over 256-point wide frame windows. There was an overlap of 56 points
between two consecutive frames. Therefore, the MFCCS entered the auditory channel
of SAIL at the rate of about 50Hz. We concatenated 20 consecutive MFCC vectors
together as a single auditory sensation vector because a 20ms speech frame is too
short to convey any meaningful information. To compensate the slower capture rate of
image data, the cognitive mapping module of SAIL would use the last image captured
accompanying the new vector of MFCC when a new image was not available.

To examine the behavior of SAIL in detail and evaluate the performance, we
pursued an experiment on pre-recorded data ﬁrst. The image data of each object
were five video sequences of the object moving in SAIL’s ﬁeld of view, rotating for
roughly one round, and then moving out of SAIL’s ﬁeld of view. Each image sequence

contained 350 frames,

0 Frame 1-50: background images;

167

0 Frame 51-100: an object moving to the center of SAIL’s; ﬁeld of view;

0 Frame 101-300: the object rotating along its center axis;

0 Frame 301-350: the object moving out of the SAIL’s ﬁeld of view.

The speech data was part of the number data set used in Section 5.5.1. Ten subjects
were randomly selected from the total 63 ones. Each number were spoken ﬁve times
by each subject. We used the utterances of “one” to represent “name” and “two”
to represent “size.” During training, the switch sensor inputs (a numerical vector)
were given after the utterances were ﬁnished, which was the time SAIL was taught
the answers. Of all the ﬁve sets of image and speech data, we used four of them in
training and the left-out one for testing. So, with 13 objects, ten persons, and two
questions, SAIL was taught 1040 times in training and evaluated for 260 times in
testing.

To emulate the situation that the trainer would not be able to ask questions when
the objects were presented with exactly the same orientations and positions in testing
as in training, we randomly choose the point to align the image sequences and speech
sequences (Fig. 7.4). Speciﬁcally, the end point of questions was aligned with image
No. 300 during training. When testing, we aligned the end point of questions with
image N0. 100, 150, 200, 250, and 300, respectively.

The behavior of SAIL were evaluated in two different ways. First, we counted the
total number of times when SAIL responded with certain answers after the question
utterances. This number was usually larger than the number of image sequences
because there were chances when SAIL responded more than once after a single

168

Lama; .4 ...5- ..r....'..:...' .5 .r .r .1 .1 11.1. .1.. .4...r.~ .1. .5. 15.5.5 .5 .5 .

(a)

LA'LJaé‘J .4? 4.5” ...!" ..." ...‘41'4' .f .f .X .1 .2 .1.. .29. .21.. .4; .22; El" .9 .45 A" J J’ .2” ,.

 

(b)

Figure 7.4: The alignment of image and speech data (a) during training (b) during testing.
question. For each of the responses, if it was correct with respect to the object

presented at the time, we counted it as correct. So, we got the ﬁrst correct answer

rate (C.A.R.1),

no. of correct responses

 

C.A.R.1 2 .
total no. of responses

In the second evaluation way, we counted all the responses during one object
image sequences as one trial. During each object image sequence, if the majority of
the responses were correct, we counted this trial as correct. Otherwise, we counted it

as wrong. Here came the second correct answer rate (C.A.R.2),

 

C 4 R 2 no. of image sequences with correct majority responses
H ' ' no. of image sequences '

We plot the correct answer rates with respect to the question positions during
testing in Fig. 7.5. The visuoauditory scenes were never exactly same during testing as
during training when the questions were asked. When the question-position difference
between training and testing was not large, SAIL maintained high correct answer rate.
With the increase of the question-position difference, the correct answer rate dropped
gradually. Also, during the time the objects moving in or out of SAIL’s ﬁeld of view,
SAIL’s performance was also low because SAIL did not have an attention mechanism

169

Training question position

 

100 .
90 ~
80 ~
70 -
so,

 

50'

Object rotating

 

40» e

_l_

30 ~ «
Object moving in Object moving out

Correct answer rate (%)

20

101 -*- C.A.R.1 .
—+— C.A.R.2

900 1 50 200 250 300 350
The testing quesiton postions in each image sequence.

 

 

 

 

 

 

 

Figure 7.5: The two correct answer rates of SAIL vs. the question positions in each image
sequence.

to locate the object in the center of its ﬁeld of view.

Particularly, a detailed confusion table (Table 7.1 and Table 7.2) shows SAIL’s
C .A.R.1 on different objects and different questions when the questions were aligned
with image frame No. 250. The average C.A.R.1 is 87.89%. Table 7.3 Shows SAIL’S

C.A.R.2 with an average rate of 95.77%.

The size of the whole “brain” after training was 806MB. The shape the three
major trees of the three LBEs are shown in Fig. 7.6. Because of the tree structure,
the average execution time at each time step is 3.4ms, much lower than 18.1ms, the

required interval of a single speech frame.

To see why SAIL was able to respond when the questions were not asked at the
exactly same time in testing as in training, we show the primed sensation of V-LBE
in Fig. 7.7. Since the operation done in the PUQ of V-LBE was a low-pass ﬁltering,
the primed visual sensation was a blurred version of the real visual sensation. The

170

 

 

 

cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c HHaEm
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c meH
co.ma cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c Hoﬁombcam
cc.c cmﬁw cc.c cc.c cc.c cc.c cc.cH cc.c cc.c cc.c cc.c cc.c cc.c cc.c 255$
cc.c cc.c Saw cod cc.c cc.c cc.c cc.c cc.c cc.c om.m cc.c cc.c cc.c m .9622
cc.c cc.c cc.c cm.ma cc.c cc.c ucc cc.c cc.c cc.c cc.c cc.c cc.c cc.c H K33:2
cc.c cc.c cc.c cc.c Son cc.c cc.c cc.c cc.c cc.c cc.c $.oH 3H cc.c oEwsHH
cc.c cc.c cc.c cc.c cc.c oc.ma cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c 2H4.
cc.c cc.c cc.c cc.c cc.c cc.c mwac cc.c cc.c cc.c cc.c cc.c cc.c cc.c ED
cc.c a: cc.c cc.c cc.c cod cc.c 3% 8.: 86 8d cc.c cc.c cc.c amen
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cm.ma cc.c cc.c cc.c cc.c cc.c team
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c coda cc.c cc.c cc.c cc.c ESQ
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cmHa cc.c cc.c cc.c @5me
cc.c cc.c cc.c cc.c mwém cc.c cc.c cc.c cc.c cc.c cc.c mwcw cc.c cc.c m zoom
cc.c cc.c cc.c cc.c cc.c cc.c cog. cc.c ooh cc.c cc.c cm.m maﬁa cc.c H .3me
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c 252
m H 2%
Stem $205 @255 m H 8355‘.
.3ch 2555 .3322 53:2 mﬁmsm ma< H: U 3on chasm 3:3 295m .Ewm .Ewm 95 Z \muooEO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.cmm .02 SEE ewes: 5H3 Homage 8m? msosmosa 9: can? TaonmZL H 203.8% :0 momsoamcc maHH<m ”HS 2an

171

 

cc.~c cc.c cc.m cc .3. cc.c cc .cc cc.cx. manta cc .cc cc.c SH cc.c cc.c cc.c HHdEm
cc.c cc .cc cc .3. cc.c cc .cc cc.c cm.mH ncH cc.c cc .cc ccéc cc .cc cc .cc cc.c meH
cos cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c “825%.:de
cc.c cos cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c 255$
cc.c cc.c cog. cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c m 56:2
cc.c cc.c cc.c cos cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c H 54032
cc.c cc.c cc.c cc.c cog. cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c mﬁwsm
cc.c cc.c cc.c cc.c cc.c cos cc.c cc.c cc.c cc.c cc.c cc.c cc.c oo.o 274.
cc.c cc.c cc.c cc.c cc.c cc.c cog. cc.c cc.c cc.c cc.c cc.c cc.c cc.c HLHU
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.» cc.c cc.c cc.c cc.c cc.c cc.c .3on
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cos cc.c cc.c cc.c cc.c cc.c 2.9an
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c co.» cc.c cc.c cc.c cc.c 55H
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cog. cc.c cc.c cc.c oEHmmH
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cog. cc.c cc.c m zoom
cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cc.c cog. cc.c H .Ewm

 

cc.c cc.c 8d 8d 86 23 Se 23 cc.c 2:. Se cc.c cod 8d 252
m H cc
.HQHMOAH meHOE omHHOE N H muogmﬂxw

 

 

 

 

 

 

 

 

 

 

.CHSH 2555 193:2 53:2 2533 oa< 210 $on coma/Q 35H 225m comm chasm ecoZ \muucEO

 

 

 

 

 

 

40.8.9 89:8 9: Emmmaou
9835:: 21% 2:. .cmm .oZ 952a ewes: 5H3 Hoeswzw o8? muoSmosc 2: can? fawoﬁmgv N 203823 so mmmcommou mad/E Mm.» @358

172

 

Table 7.3: The correct answer rate 2 (majority correct rate) of SAIL when the questions

were aligned with image frame No. 250.

 

 

 

 

 

 

 

 

 

 

 

 

Objects/ Baby Baby Barbie Kitty Dwarf Doggy Girl
Questions(%) 1 2

“Name?” 90.0 90.0 90.0 90.0 100.0 100.0 90.0
“Size?” 100.0 100.0 100.0 100.0 100.0 100.0 80.0
Objects/ Ape Hugme Micky Micky Winnie Harry
Questions(%) mouse 1 mouse 2 Porter
“Name?” 100.0 70.0 100.0 100.0 90.0 100.0
“Size?” 100.0 100.0 100.0 100.0 100.0 100.0

 

 

result was that the vision inputs to H-LBE did not change a lot in consecutive frames
when the same object was presented. Thus, SAIL was able to answer the question
correctly even it was taught when another pose of the object was seen.

In the real-time experiment, the verbal questions (“name?” and “size”) were asked
followed by the answers imposed through the switch sensors of SAIL. For each object,
we usually issued each question ﬁve to six times. To make it easy for the trainer to
see the response of SAIL, we manually mapped SAIL’s action vectors to the names of
the objects and used Microsoft text-to-speech software to read out the names. After
going through three objects (baby 1, dwarf, and girl), the objects were mounted on
the gripper in turn again and the questions were asked without giving answers. We
repeated the above process ten times and SAIL responded correctly at about 90% of
the time for all the trained three objects.

We also recorded the execution time to have some‘idea about the speed perfor-
mance of SAIL. Fig. 7.8 shows the similar pattern as we have seen in Chapter 5 and
Chapter 6. The execution time of each step grew at the beginning and it ﬂattened
out after about 100 seconds. The short surging period around 1008, 1508 and 2103
were the times when we changed the objects. Since the visual context changed a lot

173

A
A

 

 

so e
5.12
4° go
~§ ..
£30 5. 8
£20 3% 6
E. 4
1o "7!
2 2
oO 5 00
Tree depth Tree depth
(a)
20 85 j
9
35,5
.5 g J
§ 2 4
c 9
‘53 10 33
o E
z E 2
5 2
° ll l
2 I
00 5 10 0O --I
Tree depth Tree depth
350
300 $50
5,
250 g 40
U) E l
3200 % 30l
a
32.50 g 1
2 E 20'
1 00 a
7::
50 o' 1 0
Z
00 5 1o -
Tree depth Tree depth

(C)

Figure 7.6: The node distribution and primitive prototype distribution in the trees: (a)
P-tree of A-LBE; (b) P-tree of V-LBE; (c) R-tree of H-LBE.

at the time, the trees conducted extensive learning and required more time in each
execution step. But even in these periods, the execution time of each step is lower

than 18.1ms, the required interval of a single speech frame.

The shape the three major trees of the three LBEs are shown in Fig. 7.9. The
P-tree of V-LBE is fairly small comparing to the P-tree of A-LBE because SAIL’S
eye focused on a small ﬁeld of View covering the object and did not experience very
dramatic changes. In contrast, the microphone of SAIL collected the conversation

174

Ink. Li. uni. sk. 5dr. uh. La. uh; 5%; n&:

hi: 5&1 ski“. ha: ha: the“: LJ‘: 5.x '

l '- 1 1

.13 ...‘m ..1- Lg- Li..- M... L;. sf... mi... Ll...

". . .l fr it? 4" F" i’ i
a. 5' .

Laud. La. bk L4. nL L.ﬂ. nuL nnL C. hm.

Figure 7.7: A sample sequence of the primed visual sensation.

of the trainer with his lab mate in addition to the verbal questions. The size of the

whole “brain” containing three LBEs is about 60MB after the above training process.

7 .5 Conclusions

In this chapter, we further extended the architecture of a developmental robot we pro-
posed in the last two chapters in order to handle information from multiple modalities.
With this architecture, a robot was able to pursue real-time simple semantics learn-
ing. After taught the answers to verbal questions upon the presence of objects, the
SAIL robot was able to answer the questions correctly even when the orientation of
the objects was changed. This process emulates the way a child learns concepts of
the physical world through verbal instructions. The semantics learning system took
advantage of the Spatiotemporal continuity of the real world. It ﬁltered out the high-
frequency components in visual sensation so that the resulted slowly-changing primed
visual context was not sensitive to orientation changes, which enabled the system to
tolerate the imperfect alignment.

175

 

(D
.4

on
W
1

\l
I

05
I
1

U1
T
l

 

.h
I
l

 

 

 

03
I
l

 

 

 

 

Execution time/step (ms)

 

 

 

 

N
T
l

 

d
I
l

 

 

 

0 50 100 150 200 250 300
Time (s)

Figure 7.8: The average execution time of the semantics learning system at each time step
is much shorter than 18.1ms, the required interval of each speech frame.

With the current implementation, the robot did not discriminate the foreground
from the visual background. In other words, the robot did not really have an object
concept. It essentially treated the whole image as a pattern, with which the audition
signals and behaviors were associated. To achieve object concept learning, among
other requirements, the system needs a sophisticated attention mechanism to establish
the bound of the objects. While we believe object concepts can be evolved from
interactions between the agent and the environment, the detailed mechanism is largely
unclear. Another limitation of this implementation is that the action of the system
was designed to be the output of one of the three LBES, namely the H-LBE. This
manual design is not practical for an autonomous robot. This semantics learning
architecture still needs a good value system to coordinate the behaviors of all the

three LBEs.

176

  
   

N 0) b

No. of primitive prototypes (x 100)

1 0
Tree depth

 

 

 

35 {3‘12
25'
3“ E 1
(I)
$25 90.8
8 e
.6 2 Q
. m0.6
o a
2 1.5 E
1 30.4
, ‘5
o 5[ I 302
0 i o —

1 2 3 ..., 1 ' 2 3
Tree depth Tree depth

A
0"
v

 

30 3.5
8
._ 3
25 as.
20 $2.5
§ % 2
”615 a
o 131.5
2 .
10 _E
S. 1
*5
5 20.5
0o oo

1 O 1 0
Tree depth Tree depth

(C)

Figure 7.9: The node distribution and primitive prototype distribution in the trees: (a)
P-tree of A-LBE; (b) P-tree of V—LBE; (c) R-tree of H-LBE.

Chapter 8

Conclusions and E1ture Work

8. 1 Contributions

Studies in human cognitive development have shown that interactions between a
higher animal and its environment is essential for perception development and knowl-
edge acquisition. This thesis reports some recent research on developmental robots,
the robots that learn autonomously through real-time interactions with the environ-
ment. The feasibility of the developmental robots is demonstrated under eight chal-
lenging requirements for autonomous mental develOpment (AMD) as we discussed
in Section 2.3.3. This work is one of the few works that enable a machine to learn
directly from unsegmented and unlabeled sensory streams, a mode in which children
learn. A potential beneﬁt of this work is to help us to build machines capable of
spoken communication with humans.

At this early stage of research in developmental robots, we are facing an array
of challenging technical issues. We have developed three major techniques, based on

178

Table 8.1: Summary of contributions on develOpmental robot architecture.

 

 

Grounded speech learning Task transfer Semantics learning

0 Learning from o Scaffolding: 0 Use of physical
unsegmented cognitive behavior world causality
auditory stream scale-up . Integration of

0 Internal action 0 Communicative multimodalities
learning through learning: language (vision and audition)
reinforcement acquisition and . AMD mode
learning behavror-learnmg

. AMD mode usmg language

0 AMD mode

 

 

 

 

 

which a developmental architecture was implemented on a real robot, SAIL, an early
developmental robot prototype. A summary of the contributions is given below. Par—
ticularly, the contributions on proposed developmental robot architecture are listed

in Table 8.1.

1. The proposed complementary candid incremental principal component analysis
(CCIPCA) algorithm computes principal components of a sequence of samples
incrementally without estimating the covariance matrix. Motivated by the con-
cept of statistical efﬁciency (the estimate has the smallest variance given the
observed data), it keeps the scale of observations and calculates the mean of
observations incrementally, which is the most efﬁcient estimate for some well
known distributions (e.g., Gaussian). Although the efﬁciency is not guaranteed
in our case because of the unknown sample distribution, empirical studies of
this method show very fast convergence rate compared to existing IPCA algo-
rithms, especially for high dimensional image vectors. A mathematical proof of
the convergence of CCIPCA is given.

179

2. Hierarchical discriminant regression is a hierarchical statistical modeling
method introduced by Hwang and Weng [35]. It automatically derives dis-
criminant features and thus automatically generates internal representations.
The coarse-to-ﬁne memory self-organization of HDR ensures a logarithmic time
complexity and plays an important role in reaching real-time speed when pro-
cessing high dimensional input. HDR requires the sample distributions in the
input and output spaces to have the same topology, which is usually not fulﬁlled
in audition-driven behavior development. To comply with the requirement of
the HDR method, we set the output part as the concatenation of the audi-
tory signal and the motor control signal. This combination gave a much better

performance when the input and output have different distribution topology.

3. While supervised learning is efﬁcient, it is not practical to constantly provide
input-output pairs for an open—ended learning robot and it is impossible when a
robot learns internal behaviors. We unify supervised learning and reinforcement

learning in a single framework.

4. We have realized a grounded speech learning system that develops its audition-
driven behaviors through physical interactions with the environment. Our ex-
periments showed that after a short process of grounded simple language acqui-
sition, the robot could produce desired behaviors, external behaviors, such as
body movement, and internal behaviors, such as selective attention, upon hear-
ing verbal commands. A special study was provided on the establishment of a
particular internal behavior, selective attention, through reinforcement learning.

180

5. The capability of learning new skills is crucial for a robot to learn complex cog-
nitive behaviors. With an add-on priming capability over the grounded speech
learning system, the SAIL robot successfully developed complex behaviors (pro-
cedures) upon the acquisition of simple ones, which we call a scaffolding process.
Our experiments showed that, the robot’s behavior could be shaped by the train-
ers through verbal instructions (communicative learning). The learning process
of the agent was conducted through online real-time interactions thanks to the
above grounded speech learning system and the missing context problem was

resolved by a prediction model modiﬁed from Q-learning.

6. A further extended architecture was able to pursue real-time simple semantics
learning through perception from multiple modalities. After taught the answers
to verbal questions upon the presence of objects, the SAIL robot was able to
answer the questions correctly even when the orientation of the objects was
changed. This process emulates the way a child learns concepts of the phys-
ical world through verbal instructions. The semantics learning system took
advantage of the Spatiotemporal continuity of the real world and ﬁltered the
high-frequency components in visual sensation. The resulting slowly-changing
primed visual context was not sensitive to orientation changes and could tolerate

the imperfect alignment.

181

8.2 Future directions

Autonomous mental development by a robot is an interdisciplinary research area.
While more and more researchers are beginning to realize the importance and poten-
tial power, it is still at its early stage. This thesis represents one of the initial steps
in this direction. In addition to the above contributions, this work has raised many

new questions and left out many interesting but unresolved problems as well.

Internal attention. We proposed a simple attention mechanism in Chapter 5 to
select context with appropriate length. The same mechanism can be used in visual
perception in order to select appropriate portion in the ﬁeld view of a robot instead
of treating the whole view monolithically. However, such an attention behavior could

be very complicated. How to effectively learn this behavior is an open question.

A SOphisticated value system. One of the limitations of the semantics learning
system is that the action of the system was designed to be the output of H-LBE. The
coordination of the multiple LBEs’ behavior requires a sophisticated value system,
which takes the contextual information into consideration when making decisions on
action selection. While it is clear that the behavior of the value system should be
developed through experiences, it is not clear how such a system can learn from
an unstructured and highly inconsistent environment. Should this value system be
a centralized universal control coordinator or a distributed system? What is the
mechanism for it to cooperate with other modules, such as sensory mapping and
cognitive mapping?

High-level symbolic concepts. We have discussed the limitations of using

182

symbolic representation in Chapter 1. However, it is obvious that human beings
have symbolic concepts such as language. The low level numerical sensory input is
converted to high level symbolic concepts somewhere in the brain. While avoiding a
manual design of symbolic representation to get across this gap, we need to investigate
the underlying mechanism for this transition, which will be essential for implementing
a system with more complicated cognitive behaviors.

Applications in human machine interactions. While enjoying the improved
quality of life brought by the more and more sophisticated machines, we are facing
many new problems, such as the machines’ flexibility, ease of usage, and efﬁcient
cooperation with human users. The related broad research areas can be grouped
under human machine interactions (HMI). Machines conducting perceptual learning
and behavior learning autonomously in real-time would resolve some of the above
problems of HMI. It is about time to identify some HMI related applications, such
as video indexing and intelligent vehicles, and apply the principles and techniques of

autonomous mental development.

183

Appendix A

Convergence Analysis of CCIPCA

We are going to prove that, with algorithm given by

n— 1 1 T 2),-(72— 1)

 

 

 

u(n) = n v,(n — 1) + Etc-(Mu,- (n) ”Uzi" _ 1)“ (A.1)
u' n = u- n — uT n v,_1(n) v,_1(n) A
.( ,) .-1< > .-1( )llvi—1(n)llllvi—1(n)ll ( .2)

2),-(n) —> /\,e, when n ——> 00, where A,- is the i-th largest eigenvalue of the covariance
matrix of {u(n)}, e,- is the corresponding eigenvector, and u1(n) = u(n), under the

following assumptions,

1. A(n) = u(n)'uT(-n) are mutually statistically independent with E{A(n)} = A

for all n.
2. A is of full rank, which means Ad, the smallest eigenvalue of A, is not zero.

In the following sections, starting from the case of i = 1, we ﬁrst prove that,
under certain assumptions, the above algorithm converges to the asymptotically stable

184

solutions of a related differential equation with probability 1 (w.p.1). Then, we show
the asymptotically stable solutions are iAlel and all the assumptions are satisﬁed.

We prove u(n) —> Aie, for 2' > 1 using mathematical induction.

A.1 Relation to a differential equation

When 2' = 1, the algorithm is,

1

121(71) = 111(71 —— 1) + _T; ( A(n)

ll’vi(71- llll

 

—I)mM-1) me)

where A(n) = u1(rz)u1T(n). It may be rewritten as,

A
llv1(n -1)||

 

 

amazmm—n+%(

where A = E{A(n)} for all 72.
Lemma A. 1. l limnnoo P{sup |lmv1(n— 1)” > 5}— — 0.

Proof. Since lllA—_—|v1((n) (n_ 1’3”?) (72 1)—|| ||A('n.) — AH and A = E{A(n)} for all n, it

is simple to conclude

 

A
lim P{sup ||
n—mo '01

Lemma A.1.2 v1(n) is bounded w.p.1.

185

Proof. According to Eq. A.3, we have,

an— . 71.21171—
Ilmmlr = Hahn-UM: 1‘ 1”” < 1)

 

_Zvrn_ v _
ll’Ui(n—1)|| n1( 1) 1(n 1)

ivﬂn — 1)A?(n)v1(n — 1) + loan _ 1)v1(n _1)

 

 

llv1(n -1)ll2 n2
_ EUR" -—1)A(n)v1(n— 1)
n? W — 1m (A5)

We know that 2111(71 — 1)A(n)vl(n — 1) g Am.ax(A(n))v1T(n — 1)v1(n — 1), where

/\,naI(A(n)) is the largest eigenvalue of A(n). If ||vl(n — 1)|| 2 2Amax(A(n)), we have,

 

 

3'0“” — 1)A(n)v1(n — 1) < 2AmaI(A(n))vT(n — 1)A(n)v1(n — 1)
n llv1(nt-'1)l| '— 71 llv1(n'-‘1)H
g %v1T(n — 1)v1(n — 1). (A.6)

Similarly, since vﬂn — 1)A2(n)v1(n — 1) S A2 (A(n))v}r(n — 1)v1(n — 1), if

man:

|lv1(n — 1)” 2 Ama,r(A(n)) for certain n > 2, we have,

_1_v¥l(n — 1).42(n)2)1(n — 1) < 1A2 (A(n))v1T(n — 1)v1(n — 1)

 

 

||'1.21(n —1)I|2 - n2 llv1(n - 1)le
1
3 Evan -1)v1(n — 1)
1
< gal/{(77, — 1)v1(n — 1) (A7)

Further, when n is large enough (72. > 2), we have,

1 1
n—2vir(n. — 1)v1(-n — 1) < ﬂight — 1)v1(n — 1). (A8)

186

Considering Eqs. A.6-Eq. A.8, we have, for large enough 77., if ||u1(n — 1)” Z
2Ama$(A(n)),

ll‘1)1(nlll< ll’01(n ‘1)”-

Otherwise,

||v1(n—1)|| < 2/\max(A(n)).

Since AmaI(A(n)) is bounded w.p.l because A = E{A(n)} for all n, we have ||v1(n)||

is bounded w.p. 1. <1

Theorem A.1.1 Let U10 be a locally asymptotically stable (in the sense of Liapunov)

solution to

i1=( A 1) v1 (A.9)

”’01”
with domain of attraction 13(2110). If there is a compact set A C D0210) such that

121(7).) 6 A inﬁnitely often, u1(n) tends to U10 almost surely.

Proof. We prove it using Theorem 2.3.1 in Kushner and Clark [48]. Assumptions
A221, A222, and A223 in [48] are trivial. Lemma A.1.1 fulﬁlls the assumption
of A224. Together with Lemma A.1.2, all the assumptions are satisﬁed and, thus,

Theorem 2.3.1 implies Theorem A.1.1 here. <1

A.2 PI‘OVB 711(Tl) —> :lZ)\1€1

In this section, we are going to ﬁnd the asymptotically stable solutions of Eq. A.9.
Expending u, in terms of the eigenvectors, which is a base of d—dimensional space,

187

d . . .
we have, vl = 21.21 a'jej, where aj : Life]; ConSIderlng A61 2 A161, we can write

Eq. A.9 as,
AA

/ d
Zkzl 0%

where, a = (a1,a'2, . . . ,ozd)T, ||v1|| = l/Zzzi 0%., and AA = diag()\1, A2, . . .,Ad).

—— I a (A.10)

a:

A.2.1 ||v1|| is bounded

Let r 2 Hal” 2 (ﬁlmy/2, we have,

(i a)?“
llvlll

where 6,, is the angle between 221 and ek. Because 21:1 cos2 0,, = 1 and A1 > A2 >

> Ad > 0, we have,

d—l
7.“ 2 Ad + Z:COS2 6k(/\k - Ad) — T‘ (A.11)
k:l

where 0 S 2::11cos2 6k(Ak—Ad) g 22;:(Ak—Ad) < 2:: Ak. Hence, ”211” is bounded
between Ad and 2:21 Ak when t —> 00.

188

A.2.2 ozj/ozl —> 0

Let 6]- : ozj/al forj > 1 when 01 ¢ 0, and we have,

. 1
6- : —aoz.-—a-c'r
J 0%(11 J 1)
1 (A A >
: —— -aa-— Ora.
. d J1] 11]
akaleEIQC
91'

= _(A, — A1) (A.12)

/ d
21:21“:

Since Ad 3 Hull] : ”2:10;“: < 2:1)”: and AJ- < A1, 0, —-> 0 as t -—> 00. Again

because of the upper bound of llvlll, Oil is bounded and, consequently, aj —> 0 for

j>1whent——>O.

A.2.3 041 —> iAl

From Eq. A.10, we have,

/\1

Cl’l : —— — 1 (11.
d
V 21:21 0’12:
We may drop a, forj > 1 when t —> 00 and get,

d1: i/\1— (11,

which means (.21 —> :tA1 when t —> 00.

189

A.2.4 Summary

Above we assume cr1(t) 74 0. When 041(t) = 0, u1(t) is in a (d — 1)dimensional space,
R‘f'l = span{e,-,i : 2,. . .,d}. In other words, D({j:/\1e1}) is R“ — R‘f‘l. Since it
is very unlikely that we choose 211(0) in space R“ — It‘ll—1, u1(n) enters D({:l:)\1€1})

w.p.l. Applying Theorem A.1.1, we have v1(n) —> :tAlel w.p.l when n —> oo.

A.3 Prove u,(n) —> :leZ-e, with induction

We want to prove that 1),-(n) ——) :tAie, under induction assumption that vj(n) —> :tAJ-ej
(j < 2').

From Eq. A.2, we have,

 

‘Tl : U' 71 —' ( ) U" n
u“ l r“ l l (n) H "“ ll

- ———-——] u1(n) (A.15)

where H,(n) = I if the superscript and the subscript cross over. So, Eq. A.1 can be

written as,

2),-(n) ..—_ 2),-(n — 1) + _1_(

n

Ht(n)A(n)TL-(n)
llvi(n - 1)||

190

 

-" I) 2),-(n — 1) (A.1?)

where A(n) : u1(n)u?(n)

Similar to the case of v1(n), we may ﬁnd the related differential equation of

Eq. A.17 as,

1.11 = (Hi/1H; 1) vi (.AJS)

ll‘vz'll

- . T
where A : E{A(n)} and II, = 113:, [I — ﬁl'
J
With the induction assumption, we write, for j < i,

’U’
—H’UJ|| 2 6J‘ + 83‘1le (A19)
J

where wj is a time-variable unit-length vector, and for j < i, ej(t) —+ O as t —-> 00.

(Following analysis will be similar if we write, ”—11% = —e,- + ejwj.)
.1

From the deﬁnition of II,- we have,

11W) = 113:1 [I — (61' + 51(tle(t))(€j + 51(tle(t))Tl

: H;;11[I —— ejeT -

J. ej(t)(u2j(t)eJT + ejwja)?‘ + w,(t)wj(t)TJ

= I — ejeT — O(e(t)) (A20)

J

1—1
mama-Te) = IL-JnJA — Z tea-e? — 0mm
j=1
i—l
= A — Ajejef —- O(e(t) (A21)
j=1

Expending u,- in terms of the eigenvectors, we have,

d
lite) = Z ak(t)ek, (A22)

where ak(t) = u?(t)ek. Substituting Eq. A21 and Eq. A.22 into Eq. A18, and

ignoring O(e(t)) when t is large, we have,

AM

a: —————1 a

/ d
ZkZIQE
. / d
where, a = (01,02,...,O’d)T, ||u,-|| = Zkzlai, and A,\,- =

diag(0,0,...,0,)\,-,)\,-+1,...,Ad). Since dj = —a,- forj < i, we have ozj —> 0
when t —> oo. Dropping a]- (j < i), we have very similar differential equations as
in Eq. A.10. Following the proof in Section A.2.1— A.2.3, we can conclude with
a,- —) :t/\,- when t —+ 00.

Similar to the analysis in Section A24, we have 1),-(n) —> the w.p.l when n ——>

A.4 Conclusions

As a conclusion of above analysis, with the algorithm given by Eq. A.1 and Eq. A.2,
7.3-(n) —> iA,e,~ when n —a 00, where A,- is the i-th largest eigenvalue of the covariance

matrix of {u(n)}, and e,- is the corresponding eigenvector.

192

Bibliography

[1]

l2]

l3]

l4]

l5]

[6]

Proc. of the 2nd International Conference on Development and Learning. Cam-

bridge, i\*’IA, June 12-15, 2002.

K. Allan. Natural Language Semantics. Blackwell Publishers Ltd, Malden, MA,

2001.

N. Almassy, G.M. Edelman, and O. Sporns. Behavioral constraints in the

development of neuronal properties: a cortical model embedded in a real-word

device. Cerebral Cortex, 8:346—361, June, 1998.

C. Balkenius. Generalization in instrumental learning. In P. Maes, M.J. Mataric,
J.A. Meyer, et al., editor, From Animals to Animats 4: Proceedings of the
Fourth International Conference on Simulation of Adaptive Behavior. The MIT

Press/Bradford Books, Cambridge, MA, 1996.

C. Balkenius and J. Moren. Computational models of classical conditioning: a

comparative study. Lund University Cognitive Studies, 62, 1998.

B.I. Bertenthal, J.J Campos, and KC. Barrett. Self-produced locomotions:

an organizer of emotional, cognitive, and social development in infancy. In

193

[10]

[11]

[12]

[13]

R. Emde and R. Harmon, editors, Continuities and Discontinities in Develop-

ment. Plenum Press, New York, NY, 1984.

H. Bourland and N. Morgan. Neural networks for statistical inference general-
izations with applications to speech recognition. In Proc. International Joint

Conference on Neural Networks, pages 242—247, Singapore, 1991.

R. Brooks. Intelligence without reason. In Proc. Int’l Joint Conf. on Artiﬁcial

Intelligence, pages 569—595, Sydney, Australia, August 1991.

R. Brooks. Cog, a humanoid robot. Technical report, MIT Artiﬁcial Intelligence

Laboratory, 1997. private communication.

R. A. Brooks. A robust layered control system for a mobile robot. IEEE Journal

of Robotics and Automation, 2(1):14—23, March 1986.

BC. Buchanan and EH. Shortliffe, editors. Rule-Based Expert Systems: The
M YCIN Experiments of the Stanford Heuristic Programming Project. Addison-

Wesley, Reading, MA, 1984.

H.H. Bulthoff and S. Edelman. Psychophysical support for a two-dimensional
view interpolation theory of object recognition. Proceeding of the National

Academy of Sciences, 89:60—64, January 1992.

JP. Byrnes. Cognitive Development and Learning in Instructional Contexts.
Boston: Allyn and Bacon, 1995.

194

[14] DJ . Chalmers. Subsymbolic computation and the Chinese room. In J. Dinsmore,
editor, The Symbolic and Connectionist Paradigms: Closing the Gap, pages 25-

48. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992.

[15] J .A. Clouse. Learning from an automated training agent. In G. Weis and S. Sen,
editors, Adaptation and learning in multiagent systems. Springer Verlag, Berlin,

1996.

[16] M. Cole and S. R. Cole. The Development of Children. Freeman, New York,

third edition, 1996.

[17] D. R. Cox. Statistical analysis of time series: Some recent developments. Scand.

J. Statist., 8(2):93—~115, 1981.

[18] YR. de Sa and D. Ballard. Category learning through multimodality sensing.

Neural Communication, 10:1097—1117, 1998.

[19] J. R. Deller, Jone G. Proakis, and John H. L. Hansen. Discrete- Time Processing

of Speech Signals. IEEE Press, New York, NY, 2nd edition, 2000.

[20] M. Domjan. The Principles of Learning and Behavior. Brooks/ Cole, Belmont,

CA, fourth edition, 1998.

[21] RC. Duta, P.E. Hart, and DC. Stork. Pattern classiﬁcation. Wiley, New York,

second edition, 2001.

[22] J.L. Elman, E.A. Bates, M.H. Johnson, A. Karmiloff-Smith, D. Parisi, and
K. Plunkett. Rethinking Innateness: A connectionist perspective on develop-

ment. MIT Press, Cambridge, MA, 1997.

[23] J. Weng et a1. Autonomous mental develOpment by robots and animals.

http://www. cse. msu. edu/dl/whitepaperpdf.

[24] J. Weng et a1. Autonomous mental development by robots and animals. Science,

291:599—600, January 26, 2001.

[25] M. F isz. Probability theory and mathematical statistics. John Wiley & Sons,

Inc., New York, third edition, 1963.

[26] M. Franzini, M. W itbrock, and K.F. Lee. Speaker-independent recognition of
connected utterances using recurrent and non-recurrent neural networks. In

Proc. International Joint Conference on Neural Networks, 1989.

[27] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins

University Press, Baltimore, MD, 1989.

[28] S. Harnard. The symbol grounding problem. Physica D, 42:335—346, 1990.

[29] R. Held and A. Hein. Movement-produced stimulation and the development of
visualll guided behaviors. Journal of Comparative and Physiological Psychology,

56:872—876, 1963.

[30] J. Hertz, A. Krogh, and R.G. Palmer. Introduction To the Theory of Neural
Computation. Addison-Wesley, Reading, MA, 1991.

196

BM

[32]

[331

[34]

[37]

H.V.B. Hirsch and D.N. Spinelli. Modiﬁcation of the distribution of recep-
tive ﬁeld orientation in cats by selective visual exposure during develOpment,.

Experimental brain research, 13:509—527, 1971.

TS. Huang, L.S. Chen, and H. Tao. Bimodal emotion recognition by man and
machine. In Proc. ATR Workshop on Virtual Communication Environments,

Kyoto, Japan, April, 1998.

W.M. Huang and R. Lippmann. Neural net and traditional classiﬁers,. In
D. Anderson, editor, Neural Information Processing Systems,, pages 387—396.

American Institute of Physics, New York, NY, 1988.

X.D. Huang. Phoneme classiﬁcation using semicontinuous hidden markov mod-

els. IEEE Trans. Signal Processing, 40(5), 1992.

W. Hwang and J. Weng. Hierachical discriminant regression. IEEE Trans.

Pattern Analysis and Machine Intelligence, 22(11):1277 —1293, 2000.

W. Hwang and J. Weng. An online training and online testing algorithm for
OCR and image orientation classiﬁcation using hierarchical discriminant re-
gression. In Proc. Fourth IAPR International Workshop on Document Analysis

Systems, Rio De Janeiro, Brazil, December 10-13, 2000.

W.S. Hwang. Visual learning and its application to sensorimotor actions. PhD
thesis, Department of Computer Science and Engineering, Michigan State Uni-
versity, 2000.

197

[38]

[391

[401

[41]

[42]

[43]

[44]

[45]

F. Itakura. Minimum prediction residual principle applied to speech recognition.

IEEE Trans. Acoustics, Speech and Signal Processing, 23:67—72, February 1975.

G. Zavaliagkos, J. McDonough, D. Miller, et al. The bbn byblos 1997 large
vocabulary conversational speech recognition system. In Proc. IEEE Int’l Conf.
Acoust, Speech and Signal Processing, pages 905—908, Seattle, Washington,

May, 1998.

M. Johnson. The body in the mind: the bodily basis of meaning, imagination,

and reason. The University of Chicago, Chicago and London, 1974.

D. Jurafsky and J.H. Martin. Speech and language processing: an introduction to
natural language processing, computational linguistics, and speech recognition.

Prentice-Hall, Inc., Upper Saddle River, New Jersey, 2000.

BR. Kandel, J.H. Schwartz, and TM. Jessell, editors. Principles of Neural

Science. Appleton and Lance, Norwalk, Connecticut, third edition, 1991.

A.H. Kloph. A neuronal model of classical conditioning. Psychobiology, 16:85-

125, 1988.

T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag,

Berlin, second edition, 1988.

T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, third

edition, 1989.

[46] T. Kohonen. The self-organizing map. Proc. of the IEEE, 78:1464—1480, 1990.

198

[47] E. Kreyszig. Advanced engineering mathematics. Wiley, New York, 1988.

[48] H.J. Kushner and D.S. Clark. Stochastic approximation methods for constrained

and unconstrained systems. Springer-Verlag, New York, 1978.

[49] L. Lamel and R. Cole. Language resources: spoken language corpora. In G. B.
Varile and A. Zampolli, editors, Survey of the State of the Art in Human Lan-

guage Technology. Cambridge University Press, Cambridge, UK, 1997.

[50] CH. Lee. On stochastic feature and model compensation approaches to robust

speech recognition. Speech Communication, 25:29—47, 1998.

[51] Q. Liu, S. Levinson, Y. Wu, and T. Huang. Robot speech learning via entropy
guided LV Q and memory association. In Proc. INNS—IEEE International Joint
Conference on Neural Networks, pages 2176—2181, Washington, DC, July 14-

19, 2001.

[52] N Logothetis, J. Pauls, and T. Poggio. Shape representation in the inferior

temporal cortex of monkeys. Current Biology, 5(5):552—563, 1995.

[53] NJ. Mackintosh. Conditioning and Associative Learning. Clarendon

Press/ Oxford University Press, Oxford/ New York, 1983.

[54] D. Marr and H.K. Nishihara. Representation and recognition of the spatial
organization of three-dimensional shapes. Proceedings of the Royal Society of

London. Series B: Biological Sciences, 200:269—294, 1978.

199

[55]

[56]

[571

[58]

[60]

[61]

[62]

N. Mercer. Words and Minds: How We Use Language to Think Together.

Routledge, London, UK, 2000.

J. Moody and C. Darken. Fast learning in networks of locally tuned processing

units. Neural Computation, pages 281—294, 1989.

RD. Mori and F. Brugnara. Hmm methods in speech recognition. In G. B. Var-
ile and A. Zampolli, editors, Survey of the State of the Art in Human Language

Technology. Cambridge University Press, Cambridge, UK, 1997.

E. Ochs and B Schieffelin. Language acquisition and socialization: three devel-
opmental stoires and their implications. In Shweder R. and LeVine R., editors,
Culture Theory: Essays on Mind, Self, and Emotion. Cambridge University

Press, Cambridge, UK, 1984.

E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press,

Letchworth, UK, 1983.

E. Oja and J. Karhunen. On stochastic approximation of the eigenvectors and
eigenvalues of the expectation of a random matrix. Journal of Mathematical

Analysis and Application, 106:69—84, 1985.

N.L. Owsley. Adaptive data orthogonalization. In Proc. IEEE Int’l Conf.
Acoust., Speech and Signal Processing, pages 109—112, Tulsa, Oklahoma, April

10-12 1978.

J. Pauls, E. Bricolo, and N Logothetis. View invariant representations in mon-

key temporal cortex: position, scale, and rotational invariance. In Shree K.

200

[63]

[64]

[65]

[67]

[68]

[69]

Nayar and Tomaso Poggio, editors, Early Visual Learning, pages 9—41. Oxford

University Press, New York Oxford, 1996.

LP. Pavlov. Conditioned Reflexes: an Investigation of the Physiological Activity

of the Cerebral Cortex. Oxford University Press, London, 1927.

B. Peskin, L. Gillick, and N. Liberman. Progress in recognizing conversational
telephone speech. In Proc. IEEE Int’l Conf. Acoust., Speech and Signal Pro—

cessing, pages 1811—1814, Munich, Germany, April, 1997.

P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi. The F ERET evaluation
methodology for face-recognition algorithms. In Proc. IEEE Conf. Computer

Vision and Pattern Recognition, pages 137—143, Puerto Rico, June 1997.

J. Piaget. The origins of intelligence in children. International Universities

Press, New York, 1956.

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.
In D. Touretzky, editor, Advances in Neural Information Processing, volume 1,

pages 305—313. Morgran-Kaufmann Publishers, San Mateo, CA, 1989.

W .H. Press, B.P. Flannery, S.A. Teukolsky, and WT. Vetterling. Numerical

Recipes in C. Cambridge University Press, New York, 2nd edition, 1986.

P. Price. Spoken language input: spoken language understanding. In G. B. Var-
ile and A. Zampolli, editors, Survey of the State of the Art in Human Language

Technology. Cambridge University Press, Cambridge, UK, 1997.

201

[70] SE. Levinson, Q. Liu, et al. The role of sensorimotor function, associative

memory and reinforcement learning in automatic acquisition of spoken language

by an autonomous robot. In Proc. Workshop on Development and Learning,

East Lansing, Michigan, April 5-7, 2000.

[71] L. Rabiner and B. Juang. Fundamentals of Speech Recognition. Prentice Hall,

[721

Englewood Cliffs, NJ, 1993.

L. R. Rabiner. A tutorial on hidden Markov models and selected applications

in speech recognition. Proceedings of IEEE, 77(2):257—286, 1989.

[73] LR. Rabiner and RH. J uang. An introduction to hidden markov models. IEEE

[74]

[751

[751

[77]

ASSP Magazine, pages 4—16, January 1986.

K. Richardson. Models of Cognitive Development. Psychology Press Ltd, East

Sussex, UK, 1998.

G. Rigoll. Maximum mutual information neural networks for hybrid
connectionist-hmm speech recognition systems. IEEE Trans. on Speech and

Audio Processing, 2(1), January 1994.

A. Robinson and F. Fallside. Static and dynamic error propapgation networks
with application to speech coding. In D. Anderson, editor, Neural Information
Processing Systems, pages 632—641. American Institute of Physics, New York,

NY, 1988.

D. Roy. Learning from multimodal observations. In Proc. IEEE Int. Conf.

Multimedia and Expo, New York, NY, July, 2000.

202

[78] D. Roy and A. Pentland. Learning words from sights and sounds: A computa-

tional model. Cognitive Science, 26(1):113—146, 2002.

[79] D. Roy, B. Schiele, and A. Pentland. Learning audio-visual associations using
mutual information. In Workshop on Integrating Speech and Image Understand-

ing, Proc. Int’l Conf. Comp. Vision, Corfu, Greece, September, 1999.

[80] J. Rubner and K. Schulten. Development of feature detectors by self-

organization. Biological Cybernetics, 62:193—199, 1990.

[81] S. Russell and R. Norvig. Artiﬁcial Intelligence — A Modern Approach. Prentice-

Hall, Inc., New Jersey, 1995.

[82] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for
spoken word recognition. IEEE Trans. Acoustics, Speech and Signal Processing,

26:43—49, February 1978.

[83] L.M. Saksida, S.M. Raymond, and D.S. Touretzky. Shaping robot behavior us-
ing principles from instrumental conditioning. Adaptive Behavior, 22(3/4):231—

249, 1998.

[84] A. L. Samuel. Some studies in machine learning using the game of checkers.

IBM Journal of Research and Development, 3(3):210—229, 1959.

[85] TD. Sanger. Optimal unsupervised learning in a single-layer linear feedforward

neural network. IEEE Trans. Neural Networks, 2:459—473, 1989.

203

[86]

I. Sirovich and M. Kirby. Low-dimensional procedure for the caracterization
of human faces. Journal of Optical Society of America A, 4(3):519—524, March

1987.

[87] O. Sporns, N. Almassy, and GM. Edelman. Plasticity in value systems and its

[881

[89]

[90]

[91]

[92]

role in adaptive behavior. Adaptive Behavior, 7(3), 1999.

RS. Sutton and AG. Barto. Toward a modern theory of adaptive networks:

expectation and prediction. Psychological Review, 88:135—170, 1981.

RS. Sutton and AG. Barto. A temporal-difference model of classical condition-
ing. In Proc. the Ninth Conference of the Cognitive Science Society, Erlbaum,

1987.

RS. Sutton and AG. Barto. Reinforcement Learning — An Introduction. The

MIT Press, Chambridge, MA, 1998.

Zeppenfeld T., Finke M., Ries K., Westphal M., and Waibel A. Recognition of
conversational telephone speech using the janus speech engine. In Proc. IEEE
Int ’l Conf. Acoust., Speech and Signal Processing, pages 1815—1818, Munich,

Germany, April, 1997.

PA. Thompson. An adaptive spectral analysis technique for unbiased frequency
estimation in the presence of white noise. In Proc. 13th Asilomar Conf. on
Circuits, System and Computers, pages 529—533, Paciﬁc Grove, CA, 1979.

204

[93] K. Torkkola and M. Kokkonen. Using the topology-preserving properties of
sofms in speech recognition. In Proc. IEEE Int’l Conf. Acoust., Speech and

Signal Processing, pages 261—264, Toronto, Canada, 1991.

[94] D.S. Touretzky and L.M. Saksida. Operant conditioning in skinnerbots. Adap-

tive Behavior, 5(3/4):219—247, 1999.

[95] T. Vintsyuk. Element-wise recognition of continuous speech composed of words

from a speciﬁed dictionary. K ibernetika, 7:133—143, March-April 1971.

[96] P. Violi. Meaning and Experience. Indiana University Press, Bloomington, IN,

2001.

[97] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme
recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech

and Signal Processing, 37, March 1989.

[98] C. J. Watkins. Q-learning. Machine Learning, 82279-292, 1992.

[99] R. W'atrous. Speech Recognition using Connectionist Networks. PhD thesis,

University of Pennsylvania, 1988.

[100] J. Weng. The living machine initiative. Technical Report CPS 96—60, Depart-
ment of Computer Science, Michigan State University, East Lansing, MI, Dec.
1996.

205

[101)

[102]

[103)

[104]

[105]

[106]

[107]

J. Weng. A theory for mentally developing robots. In Proc. IEEE 2nd Inter-
national Conference on Development and Learning (ICDL 2002), MIT, Cam-

bridge, MA, June 12-15 2002.

J. Weng and I. Stockman (eds). Proceedings of NSF/DARPA Workshop on

Development and Learning. East Lansing, Michigan, April 5-7, 2000.

J. Weng, T. S. Huang, and N. Ahuja. Motion and Structure from Image Se-

quences. Springer-Verlag, New York, 1993.

J. Weng and W. Hwang. An incremental learning algorithm with automatically
derived discriminating features. In Proc. Fourth Asian Conference on Computer

Vision, pages 426—431, Taipei, Taiwan, January 8-9, 2000.

J. Weng, W.S. Hwang, Y. Zhang, and G. Evans. Developmental robots: theory,
method and experimental results. In Proc. 2nd International Symposium on

Humanoid Robots, pages 57—64, Tokyo, Japan, October 8-9, 1999.

J. Weng, W.S. Hwang, Y. Zhang, C. Yang, and R. Smith. Developmental hu-
manoids: humanoids that develop skills automatically. In Proc. The First IEEE-
RAS International Conference on Humanoid Robots, Boston, MA, September

7-8, 2000.

J. Weng, Y. B. Lee, and C. H. Evans. The developmental approach to mul-
timedia speech learning. In Proc. IEEE Int ’1 Conf. on Acoustics, Speech, and

Signal Processing, pages 3093—3096, Phoenix, Arizona, March 15—19, 1999.

206

[108]

[109]

[110]

[111]

(112)

J. Weng, J McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, and
E. Thelen. Autonomous mental development by robots and animals. Science,

291:599—600, January 26, 2001.

J. Weng, Y. Zhang, and W.S. Hwang. Teaching a learning vehicle - a develop-
mental perspective. In Proc. Robotics and Mechatronics Congress (RMCQOOI),

Singapore, June 6-8, 2001.

S.L. Zeger and B. Qaqish. Markov regression models for time series: a quasi-

likelihood approach. Biometrics, 44(4):1019—1031, December 1988.

Y. Zhang and J. Weng. Grounded auditory development of a developmental
robot. In Proc. INNS-IEEE International Joint Conference on Neural Networks,

pages 1059—1064, Washington, DC, July 14-19, 2001.

V. Zue, R. Cole, and W. Ward. Spoken language input: speech recognition. In
G. B. Varile and A. Zampolli, editors, Survey of the State of the Art in Human

Language Technology. Cambridge University Press, Cambridge, UK, 1997.

207

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

J|[[11]][1[[[J]]JJ]]]|]]IJl