2.. 1‘3

z .11?in ‘.11.l1ﬂ
.121111.11 1 1'

23101“: 12:; 11391133212133.

3 411:
r '72 ”mu '1‘” "2 ,‘1331
"3; ' ‘4 3"?

. " '3 h ' 2
. ‘ ',.3' ‘ ‘ "92'1“? ".1111.
I “ .3- F351;" ”51-“. '
'. ' 12122923 333’} 1 ‘2 '
‘ . '11:" ,t'v3211 3F? '
‘

 

. 2'
M1,..-
2"1L‘231 3"." .22";
3":Irﬂ12511'2
1111-313 3 3 n 31'“
'23:: "0'" 33/.

2
13
"3:: .1...

n
a" ,

Al!
&'

~

53—:-
In
~‘ 0...»... -
a -m-r .
. . .

3, 1..

n — :1:

1.1.“.

. . . . - 1 <
.. .
.. . .., _.
. ._ . .
_ .-..' '. _ 1 ..
W... M.“ _ '1; .....
. - 3
. w
.. . .. ,.
1 w 2“.
1 .. v Iv
~. 4: -2:

-4. Jun... ..'.F
v

.. . <,,._...‘,.-._

. . ----
an. .- -....~c

. . I. . .
<- ”0.9”“
1.. .. . ..
u . a
.. . 1.

,. '4- “ '
'-.- -!
. .txn-..

1 11. 11,11 1 1 11:2.
31' 31—22 453172: ‘i;13f”é" {‘313Eﬂ3i‘éﬁr
mm 2‘ ’ 3.1 f 1
33'1’.’"'

:10.“

"-l-i‘ a

.‘ ' ‘ .- ‘ ‘ H.132", 117.1...

2 . . 3"; 2’32 ”"3223
.., 51 :21 me ~t .21'; ‘ .' -
. =-"3' ”3" 1133'?

1135.1'

13.3.53 ,1
.121o'1313ﬁ11331 ’

' 2 , z 33'.

.2 ”’12”; I $I’y- Fig; :15“: Li,
m‘w “'3--31'~21..ﬁ‘ 1 '31 g
' 11311133331133 52'!“ :I

:13: '3‘” 3:3

"u- AH.” "'
.1' '“ ""

 

"2‘4: 2‘3
1.
13223:"

2 “32311333325?
”33331231233" 2:111.
'2'323'21'22332221212-
""32: - 114212.133!
1 '1 .
E’rt n.1, ,
11351:}: Li 111311111
IE

'b 3
33'3"". : .' 2

2‘3 ”15‘3”;sz
1 ~ ”3"

' "2%" 'j:
‘ a

l

2-- «um

2.53:1 '
231323 ~14

2.332333

Ow" . ..
c..-.

a
’ 3

w 7:23;:

«a:

4r
'.u

~ . "-
run-rau—

TIE" '
.a.~

3213'. 5.72
1 33 111i??? 23’

,3
F

“'21;
2

. 3 <
«vi-— u...”

- F.).L..‘v':wv . ' .

2122:2221 ' ‘
232:1;

. 7‘31... H
z - ‘.r
wit-«J».
m” ‘ "$5
.1: n".

".1?

«3 .
p;- .: .. --
- ‘A V 1 l

' 1141-21 (1.11:; "(31:
3332223 ~21 211111.

323: 1.323.313
33%} 313'

Am

2;"3"

an
-m-

'31:” M
m.-
M97. M
:w
E":

22-.»
”ﬁbéﬂ

<~

v

1‘ 1.3031" t?

.1 m;

- in. , k
:' u . 233
' "" 24.135”;

'31:“
’-:W,'. ":2

.. "an.
1. ‘ 1 .
..1-4u_-u-.n. . a»

a
v:
u .
.2”.
2‘ “'

.‘m

._..
. .m—A.

«-
w a.-

m
w

_ .
A- - 2 ' . " .
..... . . ..
”A"...
u. 4 . ' . ,.
. ‘ .5 .
2
”0‘ V
.....~.:z 4 a

.
..—~..
- 1.." '
«nor-M. .2.
1-»

.‘7‘7
115'

23:"-

”322
I .3
.x u. .41:

"‘33:.
2:11;

244.-....
u ....

:35”?!

3. ‘2

,5
-! l

2‘32

h
'1" 1:... ,5... .111-
3‘.

2'
m an

r: ’.’= 3'!
"'
"2:

 

1.22 2‘" '2 73%

”‘5' .. .1 u..-

I .u.

.
. . .31. ”rd-d
r ..

"if”?

This is to certify that the
dissertation entitled
V.‘4u oJ WM“? “A I“ AFFW *‘W‘ 1’9
622w 4015-440“! A news

presented by
We} " “GUM“ llkjav‘g

has been accepted towards fulﬁllment
of the requirements for

F14. D degreein Cw’wﬂ/ éll‘WU? 4M1 gnzmry

 

 

Mfg/V7?

MS U is an Affirmative Action/Equal Opportunity Institution 0-12771

 

 

 

LIBRARY

Michigan State

University

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

12D

imﬂ‘ia‘ﬁ!‘

 

 

 

 

 

 

 

 

 

 

 

 

 

6’01 cJCIRCJDateDue.p65-p.15

 

VISUAL LEARNING AND ITS APPLICATION TO
SENSORIMOTOR ACTIONS

By

Wey-Shiuan Hwang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department Of Computer Science and Engineering

1999

ABSTRACT
VISUAL LEARNING AND ITS APPLICATION TO SENSORIMOTOR
ACTIONS
By

Wey-Shz'uan H wang

The capability of recognition is a key indicator of the capability of an autonomous
agent. However, high dimensionality and variation of sensory input make learning
for recognition a very challenging task. The goal of the work presented here is to
enable a system to learn directly from unsegmented and unedited sensory streams
while interacting with the environment including human teachers.

Automatically generate states for the tasks that the programmer does not know or
even understand. Thus, the internal representation must be automatically generated.
Although the framework is applicable to various types of learning mode, this work
concentrates on the mode where desired actions are imposed in real time during
training.

A major technical challenge to realize the above Objective is to automatically es-
tablish the mapping between a high dimensional input space and an output space.
This mapping is accomplished by a doubly clustered subspace-based hierarchical dis-

criminating regression (HDR) tree proposed in this work to efﬁciently deal with both

classiﬁcation and regression problems in high dimensional space. The major charac-
teristics of this algorithm include: (1) Clustering is performed in both output Space
and input space at each internal node and thus the term “doubly clustered.” (2) Dis-
criminants in the input Space are automatically derived from the clusters in the input
space. (3) A hierarchical probability distribution model is applied to the resulting
discriminating subspace at each internal node. This realizes a coarse—tO-ﬁne approxi-
mation of probability distribution of the input samples. (4) A ample-size dependent
negative-log-likelihood (N LL) is introduced to relax the per-class sample requirement.
(5) The execution of HDR tree is fast, due to the logarithmic time complexity of the
HDR tree.

To learn interactively in real time in the environment, an incremental version of
HDR tree (IHDR tree) was designed to meet this requirement. The IHDR tree rejects
or accepts a learning sample according to the real time response. A forgetting process
is applied to the IHDR tree algorithm to constrain the growth of the memory while
not having a sudden decrease in the execution performance. In order to have a longer
context for the robot, the current state of the robot and previous action feedback
were also a part of the input to the robot.

The HDR tree algorithms were tested for different types Of data: synthetic data
for examining the near-Optimal performance, large raw face-image data bases along
with a comparison with some major existing methods, such as CART, C5.0 and 0C1.
In addition to these data, the IHDR tree was applied to the vision-based navigation
problem using simulated data. The proposed algorithm was also applied on the real-

time tracking and reaching tasks for the robot application.

ACKNOWLEDGMENTS

I wish to thank my thesis advisor Professor Juyang Weng for his guidance and en-
couragement in the development of this thesis. He provided me with numerous ideas,
suggestions, and comments. Without him, this dissertation would not have been
completed. I am very lucky to have such a distinguished and responsible professor as
my advisor. I am grateful to professors George Stockman, Sridhar Mahadevan, Ning
Xi, and Lijian Yang, for serving my committee, sharing ideas with me, and providing
critical comments on my dissertation.

I beneﬁted from many discussions with members who participated SHOLSLIF
project, especially Shaoyun Chen, Yuntao Cui, Sally Howden, and Dan Swets. During
the development of SAIL project, I enjoyed many open discussions with, Greg Bloy,
Colin Evans, Mahesh Krishnaswamy, Yong-Beom Lee, Gongjun Li, Guo L. Liu, Jason
Sperber, Changjiang Yang, and Yilu Zhang.

During my stay in the Pattern Recognition and Image Processing (PRIP) Lab-
oratory at Michigan State University, I met many distinguished visitors. I would
like to thank Dr. Sei-Wan Chen from National Taiwan Normal University and Dr.
Chaur-Chin Chen form National Chiao Tung University in Taiwan for many moti-

iv

vated discussions and ideas sharing. A special acknowledgment goes to Dr. J ianZhong
Qian and Dr. Ming Fang of Siemens Research Corporate, where I spent two sum-
mers as a research intern. A broad exposure to various research problems made my
internship a valuable experience.

Thanks also to PRIPpies: Paul Albee, Vera Bakic, Scott Connell, Nicolae Duta,
Dan Gutchess, Rein-Lein Hsu, Erin l\«"Icgarrity, Salil Prabhakar, Arun Ross, Aditya
Vailaya, Jianguo Wang and former PRIPpies Jinlong Chen, Chitra Dorai, Marrie-
Piere Dubuisson, Kamen Geuntchev, Lin Hong, Kalle Karu, Yonghong Li, Jinhui
Liu, Jianchang Mao, Sharathcha Pankanti, and Nalini Ratha, Bin Yu, and Yu Zhong.
Many Of them have become good friends of mine. I enjoy many open discussions and
wonderful group activities that made my graduate study unforgettable.

Finally, my sincere thank goes to my dear wife Meng-Ling Hsieh for her love,
patience, encouragement, support, and understanding. I share all my academic ac-
complishments with her. Thanks to my daughter, Sophia, who was born in this
period, for having brought a lot Of joy to my family. I would also like to express
my heartful thanks to my mother, Shiow-Jy Chen, for her lifelong love, sacriﬁce and
support. Without her love, sacriﬁce, and hard work during my childhood and teenage

years, I would not have received any education.

TABLE OF CONTENTS

LIST OF FIGURES viii
LIST OF TABLES xii
1 Introduction 1
1.1 Existing machine learning mode ....................... 2
1.2 Comparisons of different approaches .................... 4
1.3 Developmental learning ............................ 9
1.3.1 AA-learning ................................. 9
1.3.2 States .................................... 11
1.3.3 Learning types ............................... 13
1.4 Learning in hight dimensional Space ..................... 15
1.5 Thesis overview ................................ 17

2 Hierarchical Discriminant Regression Tree: Batch Learning Mode 19

2.1 Introduction .................................. 19
2.2 Classiﬁcation and regression ......................... 22
2.3 Discriminant analysis for numerical output ................. 24
2.4 Distance in discriminating space ....................... 29
2.4.1 Discriminating subspace .......................... 29
2.4.2 Size-dependent negative-log-likelihood .................. 31
2.4.3 Computational considerations ....................... 39
2.5 The experimental results ........................... 41
2.5.1 Experiments using synthetic data ..................... 41
2.5.2 Experiments using real image data .................... 50
2.5.3 Experiments using data with manually extracted features ........ 63

3 Hierarchical Discriminant Regression Tree: Incremental Learning

Mode 67
3.1 introduction .................................. 67
3.2 Incremental Hierarchical Discriminant Regression ............. 69
3.3 Amnesic average ............................... 72
3.4 The experimental results ........................... 75
3.4.1 Experiments using synthetic data ..................... 75
3.4.2 Experiments using real face data ..................... 77
3.4.3 Experiments with autonomous navigation problem ........... 85
3.4.4 Experiments using data with manually extracted features ........ 89

vi

4 Learning from Continuous Input Stream 90
4.1 System Overview ............................... 90
4.2 The Split SHOSLIF .............................. 92
4.3 Forgetting ................................... 95
4.4 Spatiotemporal clustering .......................... 96
4.5 The experiments ............................... 99
4.5.1 The system flow .............................. 99
4.5.2 Experimental results ............................ 101
5 Application to the Robotics 112
5.1 Introduction .................................. 112
5.2 System Architecture ............................. 116
5.2.1 The Hardware Setup ............................ 116
5.2.2 The Software Setup ............................ 122
5.3 Experimental Results ............................. 128
6 Conclusions 136
6.1 Summary ................................... 136
6.2 Future Work ................................. 138
APPENDICES 141
A Linear manifold 142
B Cholesky Decomposition 145
C Sensor System of SAIL Robot 146
C.1 Communication speciﬁcation ......................... 146
C2 The A/D board ................................ 147

vii

LIST OF FIGURES

1.1 An AA-learning agent has two types of channels to interact with the en-
vironment, sensors and effectors. The double arrow for the effectors
means that the actions imposed by the environment (e.g., human) can
be sensed by the “brain.” ........................

2.1 Y-clusters in space y and the corresponding x-clusters in space X. The ﬁrst
and the second order statistics are updated for each cluster. By default, the
normalized Mahalanobis distance is used for x—cluster and the Euclidean
distance is used for distance to y-cluster. .................

2.2 Decision boundaries estimated by 500 samples per class. Lines ’BIZ’ mean deci-
sion boundaries between class 1 and class 2. Similarly, lines ’B13’ represent
boundaries between class 1 and class 3 and lines ’B23’ represent boundaries
between class 2 and class 3. Lines ‘B’ mean decision boundaries for Bayesian
decision rule. This method uses the ground truth for distribution and thus
is independent of samples. Lines ‘E’ are for Euclidean distance measured
from a scale covariance matrix p21. Lines ‘G’ are measured by Gaussian
NLL using estimated full sample covariance matrices for all clusters. Lines
‘M’ are for Mahalanobis distance using a single estimated covariance ma-
trix 3,”. Lines ‘L’ use our SDNLL. (a) Bayesian decision boundaries. (b)
Gaussian likelihood boundaries. (c) Mahalanobis likelihood boundaries. (d)

10

25

Euclidean likelihood boundaries. (e) SDNLL boundaries. (f) The overall view. 43

2.3 Decision boundaries estimated by 50 samples per class. (a) Bayesian decision
boundaries. (b) Gaussian likelihood boundaries. (c) Mahalanobis likelihood
boundaries. ((1) Euclidean likelihood boundaries. (e) SDNLL boundaries.
(f) The overall view. ............................

2.4 Decision boundaries estimated by 5 samples per class. (a) Bayesian decision
boundaries. (b) Gaussian likelihood boundaries. (c) Mahalanobis likelihood
boundaries. (d) Euclidean likelihood boundaries. (e) SDNLL boundaries.
(f) The overall view. ............................

2.5 Decision boundaries for unbalanced samples distribution. Class one has 500
samples. Class two has 50 samples and class three has only 5 samples.
(a) Bayesian decision boundaries. (b) Gaussian likelihood boundaries. (c)
Mahalanobis likelihood boundaries. (d) Euclidean likelihood boundaries.
(e) SDNLL boundaries. (f) The overall view. ...............

2.6 The data used in the second experiment. ...................

2.7 Face images from Weizmann Institute, all the combination of 3 lightings, 2
expressions, and 5 orientations. ......................

45

46

2.8

2.9

2.10
2.11

2.12

2.13

2.14

2.15

2.16

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

The demonstration of the image normalization process. (a) and (e): The original
image from the F ERET data set. (b) and (f): The normalized image. (c)
and (g): The masked image. ........................

The plot of error rate vs. number of x-clusters for FERET face test 1 using
Euclidean distance and the new SDN LL distance. .............

mean images in the root. ...........................

The tree structures of FERET face test 1. (a) The plot of depth of the tree vs
q for different distance Options. (b) and (c): The plots Of tree structures for
different Options with q = 2 for EU and SDNLL, respectively. EU: Euclidean
Distance. .................................

The timing data of FERET face test 1. (a) The plot of the average training
time vs q. (b) The plot Of the average testing time vs q. EU: Euclidean
Distance. .................................

The performance plots of FERET test 2. The plots of error rate vs. number Of
x-clusters. EU: Euclidean Distance. ....................

The tree structures of FERET face test 2. (a) the plot Of depth Of the tree
vs q for different Options. (b) and (c) are the plots of tree structures for
Euclidean Distance and SDNLL distance with q = 2, respectively.

The timing data Of F ERET face test 2. (a) The plot Of the average training
time vs q. (b) The plot of the average testing time vs q. EU: Euclidean
Distance. .................................

Letter images from car license plates. .....................

The effect of y-clustering algorithm. The number of samples for (a), (b), (c),
(d), and (e) are 3, 33, 63, 93, and 123, respectively. ............
The effect of y-clustering algorithm (Continue from Fig. 3.1.) The number
of samples for (f), (g), (h), (i), and (j) are 153, 183, 213, 243, and 273,
respectively. ................................
Face images from Weizmann Institute. The training images of one subject. 3
lightings, 2 expressions, and 3 orientations are included in the training set.
Face images from Weizmann Institute. The testing images of one subject. 3
lightings, 2 expressions, and 2 orientations are included in the testing set. .
The performance plot of F ERET face test 1. The plot of error rate vs. number
of epochs. The “Resub” line means the resubstitution error rate. The “Test”
line represents the testing error rate .....................
A subset of images used in autonomous navigation problem. The number right
below the image shows the heading direction associated with that image. .
The performance of the autonomous navigation. (a) The plot for maximum
error rates vs. epochs. (b) The plot for mean error rates vs. epochs. The
solid line represents the error rates for resubstitution test. The dash line
represents the error rates for the testing set .................
The histograms of the error rates. Plot (a), (b), (c), and (d) correspond to the
histograms at epoch 1, 6, 11, 20, respectively. ...............

ix

55

56
56

58

59

59

60

61
62

78

79

81

81

85

86

87

4.1 Learner and trainer. Xi: visual input. Ziz numerical sensory input from the
trainer. 1;: action to be imposed. Aiz action to be performed. ......
4.2 Update of memory trace (strength) M through time t. The solid curve repre-
sents an element which is visited Often enough to be kept. The dashed curve
indicates an element that is not visited often enough and thus, it falls below
the threshold T before being visited again. ................
4.3 A schematic illustration of the temporal adjacency cluster. Each region in Y
space represents a P-cluster. A dashed curve is used to roughly indicate the
samples that fall into the cluster. The circles to the right of the SHOSLIF
tree are the leaf nodes of the tree. Each —-,= or so: Sign indicates the relative
positions of leaf nodes (cf-prototype) in the input space to the SHOSLIF tree.
Two different signs representing two semantically different input sequences.
Each leaf node in the SHOSLIF tree has a pointer to the Y space, which
points to the center of the corresponding P-cluster or close to it after a
signiﬁcant amount of gravity pulling and merging. Y is shown here as 2-D
for visualization, but it is typically of the same dimensionality as X or with
a reduced dimension by a factor of dimension reduction. .........
4.4 The effect of pulling in Spatiotemporal clustering. The SHOSLIF tree is repeat-
edly visited by 2:1 z 2:2 2 x3 that are nearby in Spatiotemporal domain.
The two leaf nodes are among the top A: matched leaf nodes, although they
are separated early in the tree. Consequently, their Y vectors are merged
when they are sufficiently close and the corresponding prototypes of the two
leaf nodes belong to the same P-cluster. .................

97

98

4.5 The ﬂowchart for action-imposed learning. The system learns while performing. 100

4.6 A temporally subsampled segment of the real-time video stream during which
eight persons were presented to the system for training. Only one person is

involved in this segment. The images are shown in the English reading order. 107

4.7 State representations at level zero. The values of numerical sensor and numerical
action are represented as (S, A). Where 5' = 0 means “silence” and S = 1
means “asking question”. A = 0 means “no action” and A = 1 means
“person 1” ..................................

4.8 A temporally subsampled segment of the real-time video stream during the test
session, corresponding to the person shown in Fig. 4.6. ..........

4.9 Attention gates for two channels in 100 i.r.c. (a) is for numerical input
and (b) is for image input. When the attention is 1, the gate is open.
Otherwise, the gate is closed. ......................

4.10 The testing results which uses attention selection mechanism. (a) the
system response during the testing phase. (b) the attention for the
numerical channel. The peak at 347 i.r.c in (a) indicates a wrong
answer at the beginning Of person 4 entering. The wrong response is
corrected by the latter signals. .....................

5.1 The SAIL robot built at the Pattern Recognition and Image Processing Labo-
ratory at Michigan State University. ...................
5.2 The Eshed robot arm. .............................

108

109

110

5.3
5.4
5.5

5.6

5.7
5.8

5.9
5.10

5.11

5.12

5.13

5.14

A.1

The system diagram of the SAIL robot (right side view). ...........
The system diagram Of the SAIL robot (left side view). ...........
The system diagram of the SAIL robot. (a) hardware connection. (b) power
connection. ................................
The flow diagram of the system. Dashed line box indicates a separate thread.
Oval box means thread. Solid line is for control flow and dashed line rep-
resents data ﬂow. Dot line indicates spawning a thread. Frame time is up
every 100 ms. ...............................
The inputs/outputs of the brain process. ...................
(a) A system without feedback. The x axis means the input I1. The training
samples are mixed up on the boundary Of the action 0 the other two actions.
The system will behave dangling between action 0 and the other two actions.
(b) Use feedback as input to the system. The x axis means the input Il
and y axis is the feedback input 12. The distance between action 0 and the
other two actions are pulled away. When the system outputs action 0 from
action 1, it will remain in action 0 unless 11 changes a lot. ........
The input /Output of the IHDR tree .......................
Real-time training and testing for SAIL robot: ﬁnding the nearby ball and then
reaching for it. ...............................
A subset of images used in the tracking problem. The number right below the
image shows the PTU position associated with that image. From left to
right, one image sequence of ball-tracking is shown. ............
The transitions of sensors/effectors for ball tracking problem. (a) The training
session. (b) The testing session. ......................
An illustration of how the training and testing sequence match. The ﬁrst four
rows show a sequence of training samples, from (A) to (P). The related
sensors/effector is shown for each sample. The sensory inputs shown are
the image I(t) and In(t)=(TSK(t), P(t), PA(t-1)). The output Out(t) is
the action of the pan PA(t). Under each image shown the associated inputs
and outputs [In(t), Out(t)]. The last four rows shows a sequence of testing
samples. ..................................
(a) The memory usage for the Off-line simulation of ball tracking task. (b) The
performance of the testing. (c) The average CPU time for each sequence. .

The linear variety (hyper plane) that passes through the head points of the
vectors. It can be represented by 27 + span(S), the spanned space from
scatter vectors translated by the center vector 17 ...............

xi

123
124

126
127

128

129

130

133

135

143

1.1

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11

3.1
3.2
3.3
3.4
3.5
3.6

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

C.1

LIST OF TABLES

Comparison of Approaches .........................

Characteristics of Three Types of Scatter Matrices ...........
Optimality for 2-D Synthetic Data ....................
Optimality for 3-D Synthetic Data ....................
Optimality for 100-D Synthetic Data ...................
The performance for Weizmann face dataset ...............

The performance comparison of decision trees for the FERET test 1

The performance for F ERET face dataset II ...............
The performance for character images from car license plates .....
Test results on letter image recognition data ...............
Test results on satellite image dataset ...................
Test results shuttle dataset .........................

Error rates for 2-D Synthetic Data ....................
Error rates for 3-D Synthetic Data ....................
Error rates for 100-D Synthetic Data ...................
The performance for Weizmann face data set ..............
The performance of decision tree for FERET test ............
The performance for vision-based navigation ...............

The representation for image input ....................

The representation for numerical sensor input and action output

The Training and the resulting primitive clusters ............
Testing with noise-free images ......................
Result when inputs are noisy with SNR = 14db .............
Result when inputs are noisy with SN R = 2db .............
Real Images: The Resubstitution Test ..................
Real Images: The Disjoint Test ......................

The output of the SAIL sensor system ..................

xii

Chapter 1

Introduction

It is well known that vision is extremely difficult, especially for tasks such as rec-
ognizing Objects in more general settings. Recognition of Objects must cope with a
wide variety of factors, such as lighting, viewing angle, viewing distance and Object
changes (e.g., facial expressions). It is known that learning plays a central role in
the deveIOpment of human versatile visual capabilities and it takes place over a long
period (e.g., Carey [16], Hubel [30], Anderson [5], Martinez & Kessner [36]). Human
vision appears to be more a process Of learning and recalling than relying on an un-
derstanding of the physical processes of image formation and Object-modeling (e.g.,
the “Thatcher’s illusion” [91] and the overhead light source assumption in Shape from
shading [80]). A large amount Of visual data is processed along with other information
is learned by a human individual everyday.

I

1.1 Existing machine learning mode

The importance Of learning in developing intelligent machines has long been recog-
nized. However, learning has been used as a technique for manually deveIOping a
static system, instead of a fundamental vehicle for continued development. With
some thinking, we may realize that the continued development requires more than
just learning for longer time. The current prevailing mode of developing an advanced

system using learning techniques requires two major stages:

Content-level programming. Content-level programming here means the pro-
gramming task for modeling the content or knowledge represented in the infor-
mation of sensory input, other than preprocessing such as Fourier transforms.
Modeling visual shape is an example of content-level programming. One must
(1) start with a particular problem and understand it, (2) create a problem-Space
representation for the problem, called problem-speciﬁc model, (3) convert the
actual problem into a problem-speciﬁc model using the selected representation,
(4) design an algorithm, (5) write a program that implement the algorithm.
The human designers must explicitly deﬁne the content-level mapping between
the sensory input and the problem-speciﬁc model. Here are some examples.
For face recognition, one designs 3 neural networks, one for each of the three
facial parts, eyes, nose and mouth, respectively, and then designs another neu-
ral network that combines the output of the three neural networks. For speech
recognition, a separate HMM is responsible for each word and other HMMS for
phrases that use outputs from word-level networks. For language understand-

2

ing, a particular syntax model will take care of a particular syntax. For mobile
robot navigation using reinforcement learning, each state of the robot is deﬁned

for a speciﬁc location Of the predeﬁned navigation space.

Data compilation and training. Next, humans manually compile sensory data,
which includes creating scenarios, collecting sensory data, labeling data, seg-
menting data, ordering data, feeding data, etc. For supervised learning, typi-
cally a class label is assigned to each segmented data item. For some Simpler
problems that can be sufficiently modeled by a simulated environment, Simula-

tion techniques are extensively used instead of using the real sensory data.

Two major restrictions are direct consequences from such a mode of system devel-
Opment — low quantity and low quality Of information fed into the system. In terms
of quantity, the computers are allowed to Observe far less environmental variation,
context variation, and content domain variation than they really need for the tasks
that they are assigned for. It is difﬁcult to conduct extensive system training due to
the large amount of manual labor that is required in preparing the training data. In
terms of quality, these compiled training data are very much disconnected from the
environment from which the data arise. The rich meaning of the live sensory expe-
rience is degenerated into isolated segments, each being tied to a class label which
is meaningless to the system. The lack of environmental context of manually fed
training segments make it impossible for machines to learn beyond what is modeled
by the content-level programming. Unfortunately, what is modeled by content-level
programming is typically insufﬁcient for the challenging tasks at hand, due to our

3

human limitation in understanding and fully modeling the complex mechanisms of
human cognitive process.

For example, recognition of human faces must cope with a wide variety of vari-
ation factors, such as lighting, viewing angle, viewing distance, and facial changes
(e.g., expressions, hair styles, eye wears etc). A similar situation is true in speech
recognition (e.g., variation in time warping, coarticulation, intonation, age, gender,
etc) and language understanding (ambiguity without context, ambiguity without un-
derstanding, cultural differences, language differences). It is extremely difﬁcult, if not
impossible, for human designers to adequately represent and model these factors, to
design effective knowledge-level rules for them, to collect sufficient training data to
cover the variations, and to keep the system up to date. Certain degree of automation

is very desirable for these challenging problems.

1.2 Comparisons of different approaches

Existing approaches to artiﬁcial intelligence fall into the following four categories:
The world-knowledge-based approaches typically require a predeﬁned problem
space or world space. Human programmers manually model knowledge and spoon feed
knowledge. Researchers in each subﬁeld have been manually developing knowledge-
level theories and methods, and using them to write programs or build hardware.
Then,they manually “spoon feed” knowledge into the systems at the programming
level (e.g., CART [59], CYC [48] [49], lexical database, WordNet [58]). Such an ap-
proach may produce a system that appears to produce some intelligent results. How-

4

ever, the limitation of such systems have been recognized [12] [22]. Such a methodol~
ogy requires a huge amount of human labor and it faces a fundamental limit of humans
to fully model and specify the cognitive process required by challenging robotic tasks.

The behavior-based approaches avoid modeling world and instead models robot
behavior. The subsumption architecture was proposed by Brooks to allow a more so-
phisticated behavior layer to be added to the existing primitive behavior layers [14].
Each layer is a ﬁnite state machine, with states deﬁned and named by the program-
mer. The programmer is also responsible to program the state machine in each
layer for a desired behavior. Thus, this approach can be characterized by the terms
“manually-modeling-behavior and hand-coding-behavior”. Aloimonos [4] and others
also advocated behavior-based approach for active vision.

The evolutionary approaches are motivated by evolution of biological species. The
law of survival of the ﬁttest is used to select advantageous genotypes which code the
structure and / or behavior of simulated robots [54] [28]. SO far, the selection process
have been mostly simulated by computers using a simulated environment, due to the
obvious difﬁculties in carrying on evolutionary process with a large number Of robots
and performing a long-time physical evolution. The simulation method is attractive
due to the low cost beneﬁt. Also, the approach does not require the programmer to
code knowledge or behavior rules. However, evolutionary approaches leave the hard
task Of intelligent system design to the process of random trials and environment
selection, an extremely slow and costly process. Three major issues stand out: (1)
The Chromosome representation for a sophisticated system. The more sophisticated
the system is, the more sophisticated the chromosome is. (2) The extremely high cost

5

of real-robot evolution when high—dimensional perception and sophisticated actions
are required, such as vision, Speech and language. Simulation is not sufﬁcient for
challenging vision, speech and language functionalities. Each system must experience
the real-world. (3) The time, required to ﬁnd a good chromosome, is on the order of a
large number of generations. So far, genetic algorithms are typically used for simple
environment (e.g., symbolic) and Simple behaviors (e.g., symbolic) with carefully
designed environment-speciﬁc and behavior-speciﬁc chromosome representation (see,
e.g., Animate and AutonoMouse.

The learning-based approaches include all the existing learning methods, such
as supervised learning and reinforcement learning. Learning approaches are typi-
cally more efﬁcient than the corresponding evolutionary approaches Since the learning
mechanism of the system is hand-coded by the programmer with the former, but is
either absent or has to emerge from the trial-and-selection process with the latter. For
perception of high-dimensional inputs, learning seems the only viable method. Vari-
ous learning methods have produced impressive results for challenging cognition tasks
involving complex modalities, such as visual recognition (e.g.,[92] [88] [61]), speech
recognition (e.g., using HMM [79] [37]), vision-guided robot manipulation (e.g., [31])
and vision-guided navigation (e.g.,[97]). Supervised learning is typically more efﬁ-
cient than reinforcement learning. However, learning-based methods have not yet
produced systems that truly understand anything. The major reasons include: (1)
All the existing learning methods can only be used for a speciﬁc problem at a time.
For example, an neural network is trained for mapping from every image in a set of
face images to a name label. In reinforcement learning using Q-learning algorithm,

6

a problem space must be given ﬁrst. Then, the human designer must translate the
problem state to the internal representation (e.g., states) of the system model (e.g.,
Q-learning model). However, a system only for a particular problem cannot truly
understand anything. (2) A huge amount Of manual labor required in translating a
Speciﬁc problem to a learning tool. (3) A huge amount of manual labor required in
training a system, including collecting data and testing the system. These reasons
hinder further scaling up to more general, larger-size robotic tasks.

In actuality, a particular system may use a combination of several approaches.
For example, Robot-Soar [46] combines a world-knowledge—based approach with a
learning approach. It requires the human to feed knowledge about the environment
and to deﬁne the problem space. Then, the system learns to perform predeﬁned
tasks. The learning classiﬁer system uses a combination of reinforcement learning
and genetic algorithm.

Table 1.1 summarizes the four existing approaches and the new developmental ap-
proach. Among the ﬁve approaches in the table, the developmental approach appears
to require the least amount of human labor in terms of system design; but it requires
a large number of population individuals and a huge amount of genetic search time,
which makes physical evolution for complex systems impractical. On the other hand,
the knowledge-based, behavior-based, and conventional learning-based approaches all
require extensive human labor in problem-speciﬁc or behavior-Speciﬁc design, which
makes a general-purpose system impractical. The developmental approach seems to
be in the middle in terms of human designer’s effort and the system deveIOpmental cost
(time and money). It requires human designers to properly design a general-purpose

7

Table 1.1: Comparison of Approaches

 

 

Approach Species World System Task
architecture knowledge behavior speciﬁc
Knowledge-based programming manual modeling manual modeling Yes
Behavior-based programming avoid modeling manual modeling Yes
Learning-based programming model with model with Yes
parameters parameters Yes
Evolutionary genetic search model with model with Yes
parameters parameters Yes
Developmental programming avoid modeling avoid modeling No

 

 

 

 

 

 

learning mechanism, but they do not need to explicitly program for the content of
what is to be learned — neither for the world knowledge nor for the system behavior.
It requires only one or a few physical systems to be built to learn. These physical
systems take advantage the richness of intelligence in the human environment. For
example, suppose that in reinforcement learning in a computer simulation environ-
ment, a punishment is given when a sequence of system actions eventually lead to a
failure. The system does not know which action in the action sequence is wrong. In
the developmental approach, however, the human teacher can analyze the observed
system action sequence and can tell the system which action is wrong by (a) ﬁrst
bringing the system to the right context (e.g., lead it to the location of its bad ac-
tion) and then (b) giving it a punishment. Such a very powerful tell-you-what-it-is—for
mechanism can speed up learning tremendously. This type of information-rich learn-
ing environment is more powerful than the time-discounted average reward model
in Q-learning and the time-average reward model in R-learning [75] in dealing with
ubiquitous delayed-reward situations in learning.

8

 

1.3 Developmental learning

Development must be automatic. What is the Operational mode of automatic de-
veIOpment? A machine agent M may have several sensors. At the time Of “birth”,
its sensors fall into one Of the two categories, biased and unbiasedl. If the agent
has a predeﬁned (innate) preference for the Signal from a sensor, this sensor is then
called biased. Otherwise, it is an unbiased sensor, although a preference can be de-
veloped by the agent later through learning. For example, a human being has an
innate preference to sweet and bitter tastes from the taste sensor, but does not have
a strong preference to visual images of various furniture items. By deﬁnition, the ex-
troceptive, proprioceptive and interoceptive sensors are, respectively, those that sense
stimuli from external environment (e.g., visual), relative position of internal control

(e.g., arm position), and internal events (e.g., internal clock).

1.3.1 AA-learning

We introduce a computational deﬁnition of AA-learning (named after automated,

animal-like learning without claiming to be complete) for a machine agent.

Deﬁnition 1 A machine agent M conducts AA-learning at discrete time instances
if after its “birth” the following conditions are met for all the time instances t =
0, 1,2, (I) It! has a number of sensors (biased or unbiased, ertroceptive, proprio-
ceptive, or interoceptive), whose signal at time t is collectively denoted by :r(t). (II)

M has a number of effectors, whose control signal at time t is collectively denoted by

 

1This is an engineering deﬁnition. For a biological organism, it is hard to say that any of its
sensors is absolutely unbiased.

 

 

 

Sensors 4

 

 

Environment

 

Closed brain

 

Teacher Effectors H

 

 

 

 

 

 

Figure 1.1: An AA-learning agent has two types of channels to interact with the
environment, sensors and effectors. The double arrow for the effectors means that
the actions imposed by the environment (e.g., human) can be sensed by the “brain.”

a(t). The effectors include ezrtro-eﬁectors {those acting on the external world) and
intero-eﬁ'ectors {those acting on internal mechanism, e.g., attention). (III) M has a
“brain” denoted by b(t) at time t. (IV) At each time t, the time-varying state-update

function ft updates the “brain” based on sensory input $(t) and the current “brain”

b(t):

b(t+1) = ft(b(t),1r(t)) (1-1)

and the action-generation function g, generates the effector control signal based on

the updated “brain” b(t + 1):

a(t+1) = g,(b(t+ 1)) (1.2)

where a(t + 1) can be a part of the next sensory input :r(t+1). (V) The “brain” of M
is closed in that after the birth (the ﬁrst operation), b(t) cannot be altered directly by

human teachers for teaching purposes. It can only be updated according to Eq. (1.1).

Fig. 1.1 illustrates the relationship between an AA-learning agent and the world. The

10

design for a “brain” representation b(t), the time-varying state—update function f,,
and the action-generation function g, determines the AA-learning mechanism as well
as the maturation schedule. It is worth noting that t can be dropped from ft and
gt in the deﬁnition since b(t) is not restricted in the deﬁnition. For example, any
time varying function ft(b(t),:r(t)) can be represented by a time invariant function
f(:r(t), b(t), t) and (b(t), t) can be deﬁned as the “brain.” The deﬁnition for continuous

time is analogous.

From the deﬁnition we can see that AA-learning does not require two separate
learning and performance phases. The machine agent learns while performing. This
is important for continuous, Open-ended cognitive development. AA-learning does
not require humans to provide edited and segmented sensory input (i.e., no need
to spoon-feed data). The system accepts a continuous, unsegmented sensory input
stream on-line.

The design of the AA-learning mechanism can take into account various phenom-

ena known about animal learning and human learning. Therefore, this seems to be

new ground where artiﬁcial intelligence and biological intelligence can converge.

1.3.2 States

In behavior-based learning approaches, the states of an agent are manually bound to
a set Of predeﬁned task concepts before training [39] [55]. An AA-learning algorithm
must automatically generate states without being given any task.

The state of the“brain” is denoted by a state vector s(t). If s(t) is considered a

11

random process, Eqs. (1.1) and (1.2) are closely related to the formulations for Markov
decision processes (MDP) [75], or HMMS (hidden Markov models) if the action part
is omitted [78] [37]. Indeed, the state transition function f and the decision function
g can be based on probability distributions Shown below to take into account the

uncertainty in states, Observations and actions:

P(s(t + I) : s' | :I:(t), s(t) = s)

and

P(a(t+1) = a | s(t+1) = s')

where P() denotes the probability. However, the states in MDPS have been typi-
cally deﬁned as a set of symbols and there is no distance metric deﬁned to measure
the similarity between any two symbols (see, however, various MDP generalization

techniques surveyed by Kaelbling, Littman & Moore [39]).

We deﬁne a state s to be a vector in a high dimensional space 8. Thus, our state
has an explicit representation. 8 must contain all the possible sensory input :1: E X.
In contrast to existing MDP methods, we require that the state records temporal
context. Thus, we define the state Space to be S = X x R(S), where x denotes
Cartesian product and R() denotes a re-sampling Operator.

12

1.3.3 Learning types

Eqs. (1.1) and (1.2) identify four components of the AA-learning agent for each time

instance t:

(a(t + 1), s(t + 1), Sam-(m. (1.3)

They involve three entities: action, state, and sensor.

Depending on whether the action is imposed or not, the learning can be classiﬁed
into action-imposed learning and action-autonomous learning. Action-imposed learn-
ing is such that the extro—effector part of a(t + 1) is supplied by the trainer. For
example, hand-in-hand learning can be used by human adult to teach a child how to
use a pen. Otherwise, the learning is action-autonomous learning.

Depending on whether the state s(t) is imposed or not, learning can be classiﬁed
into state-imposed and state-autonomous. The state-imposed learning is such that s(t)
and s(t + 1) are set by the human trainer during the learning. If a learning method
requires a task-speciﬁc representation, the representation typically determines the
meaning of states and thus the learning must use state-imposed learning. AA-learning
is state-autonomous learning. As explained earlier, with AA-learning, the state of
the system is determined by the developmental algorithm autonomously. Another
concept is the state—readability. If the state of the system is not directly readable to
the teacher, the learning is state-readable. Otherwise, it is state-unreadable.

Depending on whether the unbiased sensor is used or not, the learning can be
classiﬁed into reinforcement learning and communicative learning. Reinforcement

13

learning is such that a biased sensor is used to reinforce or punish certain response
from the machine agent. Communicative learning is such that only unbiased sensors
are used in learning. This requires that the agent to correctly interpret the signal
from unbiased sensors, either it is an instruction for action, an encouragement, an ex-
planation, etc. Learning by a human adult is mostly conducted in the communicative
learning mode.

The learning type can be represented by a 3-tuple (A,S,X) where A E {i,a}
denotes if action is imposed or autonomous, S E (i, a} denotes the state is imposed
or autonomous, and X E {r, c} denotes the biased sensor is used or not. There are
8 different 3-tuples, representing a total of 8 different learning types. AA-learning is
state-autonomous learning. Thus, there are 4 types Of AA-learning: Type (1) action-
imposed and reinforcement, Type (2) action-imposed and communicative, Type (3)
action-autonomous and reinforcement, and Type (4) action-autonomous and commu-
nicative. It is worth noting that these four types are typically interleaved in a natural
learning environment for animals and humans. These deﬁnitions are required beyond
the coarse classic deﬁnition of supervised and unsupervised learning2. This thesis only
studies the type (2) action-imposed and communicative learning and other types are
not addressed further in this thesis. If the trainer imposes an action on an effector at
any time through, e.g., a joystick, the system performs action-imposed learning for
that effector. Otherwise, the system performs action-autonomous learning, during

which communicative learning is used.

 

2From the most strict deﬁnition of unsupervised learning, all the above learning modes are su-
pervised by human to some degree. It is difficult to identify any type of learning that is completely
unsupervised.

14

1.4 Learning in hight dimensional space

To realize an AA-learning, a powerful tool to map high dimensional sensory inputs to

continuous action outputs is required. There are two issues for this mapping tools.

0 Representation: There are two possible representation: the model-based and the
appearance-based. The model-based approaches use manually deﬁned features
to represent Objects in the images. A lot of efforts has been made using this
approachi[35] [45]. The focus of this approach is to design an efficient algorithm
from a set of manually selected features. This method is appealing because of its
efﬁciency. However, the model-based approach is difﬁcult to generalize. For ex-
ample, the face features become useless when a car image database is presented.

To make domain-extensible, we can not use the model-based representation.

The appearance—based approaches have recently drawn a lot Of attention in
machine vision [92] [61] [88]. Using an appearance-based approach, any sensory
input is considered as a vector in a high-dimensional space, no matter the input
is an image or something else. The retrieval task then becomes the problem
Of ﬁnding the best match among the training samples. To use fewer features
to represent a set of images, the principal component analysis (PCA) has been
used for face recognition [92]. PCA can reconstruct the images with the least
mean square errors. However, the PCA features which are good for image
reconstruction are not necessarily good for recognition. The features derived
from Fisher’s linear discriminant analysis (LDA) are better for recognition [88]
provided the samples are sufﬁciently rich to cover the typical variation within

15

each class.

However, LDA cannot, by itself, judiciously group Similar classes into a single
class for coarse classiﬁcation. Such a grouping will equivalently increase the
number Of samples in each merged class and thus will allow better estimate
of the within-class variation which is critical for the effectiveness of LDA. We
cast a classiﬁcation problem into a regression problem SO that each class is
represented by a vector in the output space. The similarity of classes can be
measured through distance in the output Space, which guides class merging in
the input space for each coarse classiﬁcation. Even in the case where the users
assign distance-preserved class vectors, merged classes have distance measures

that reflect how classes are merged.

Organization and incremental learning: It is very important for a real time
system to well-organize the represented features SO that the retrieval is both
fast and successful. Linear search is very time-consuming which makes it not
practical for a real time system. One way to solve this problem is to use a

classiﬁcation and regression tree.

Traditionally, classiﬁcation and regression trees use univariate split at each in-
ternal node, such as in CART [11], C50 [77] and many others. This means
that the partition by each node uses a hyper-plane that are orthogonal to an
axis in the input space X. Multivariate linear splits correspond to partition
hyper-planes that are typically not orthogonal to any axis of the input space.
Trees that use multivariate Splits are called Oblique tree. AS early as in the

16

mid 70’s, Freidman [23] proposed a discriminating node-Split for building a
tree which would result in an Oblique tree. OC1 by Murthy et al. 1994 [40]
and SHOSLIF tree by Weng 1994 [95] and 1996 [96] are two methods for con-
structing Oblique trees. For an extensive survey of decision trees, see a recent
survey by Murthy [62]. OC1 uses iterative search to ﬁnd a Split. SHOSLIF uses
principal component analysis (PCA) and linear discriminant analysis (LDA)
to directly compute Splits. Multivariate nonlinear Splits correspond to curved
partition surfaces with any orientation. An incremental hierarchical discrimi-
nant regression (IHDR) tree algorithm is prOposed to use multivariate nonlinear

Splits, with multivariate linear splits as a Special case.

1 .5 Thesis overview

The rest of this thesis is organized as follows. Chapter 2 introduces a new techniques
for the regression and classiﬁcation problems. This algorithm casts classiﬁcation
problems (class labels as output) and regression problems (numeric values as output)
into a uniﬁed regression problem. The method automatically derives discriminating
features in input space by clustering samples from output Space. A decision tree
structure is adopted to facilitate the speed of the execution. A sample-Size dependent
negative-log—likelihood measurement is proposed for the smooth transition among dif-
ferent statistic measurement which is suitable for different sample population. Several

experiments were conducted to verify the performance of the proposed algorithms.
In Chapter 3, the incremental version Of the hierarchical discriminant regression

17

tree (IHDR tree) is described. This is necessary for the on-line training and testing
Of the robotic tasks. The proposed algorithms use amnesic average to compute the
necessary statistics so that the computation complexity is low and the real time
performance can be achieved. In addition to the experiments done in Chapter 2,
this algorithm was tested for the vision-based navigation problem using real images
collected in the indoor environment.

The current technology in computer vision and pattern recognition requires hu-
mans to collect images, store images, segment images for computers and train com-
puter recognition systems using these images. It is unlikely that such a manual labor
process can meet the demands of many challenging recognition tasks that are crit-
ical for generating intelligent behavior, such as face recognition, object recognition
and speech recognition. In Chapter 4, a system to learn directly from continuous
image stream is described. We demonstrate the performance of the algorithm on the
problem of face recognition using video sequences from different subjects.

In Chapter 5, the proposed algorithms were applied to the real robot. The archi-
tecture Of the SAIL (named after Self-organizing Autonomous Incremental Learner) is
described in this chapter. Since the robot equips with several sensors and actuators,
multiple thread processes are implemented for the software settings. The tracking
tasks were experimented in the real robot testing.

Finally, Chapter 6 concludes the dissertation by summarizing the contribution

and giving the future research directions.

18

Chapter 2

Hierarchical Discriminant
Regression Tree: Batch Learning

Mode

2. 1 Introduction

The capability of computers to efﬁciently and effectively retrieve information from
image databases gives a signiﬁcant impact on the progress Of digital library technology.
A central task of a multimedia information system is to efficiently store, fast and
correctly retrieve, and easily manage images in the database [67] [77].

An essential issue for image database is the representation Of the image. We
can categorize the content-based image retrieval into two types: the model-based
and the appearance-based. The model-based approach uses manually deﬁned fea-
tures to represent Objects in the images. A lot of efforts has been made in this

19

approach [47] [7] [35] [45] [27]. Most of them have been focusing on designing an
efﬁcient algorithm from a set of manually selected features. The strength of model-
based approach is the efficiency in representing images. With a proper design and a
restricted domain Of images, only a very small number of parameters is sufﬁcient to
represent the Objects in the image and to distinguish among different Objects. How-
ever, the model—based approach is difﬁcult to generalize. For examples, given a face
image database, the designer needs to manually ﬁnd the features for faces. The face
features become be useless when a car image database is presented. The designer has
to ﬁnd another set of features for car images. Such a process of manually design-
ing features cannot scale up to large and complex domains Since there are countless
models to be built.

The appearance approach has recently drawn much attention in machine vi-
sion [92] [61] [63] [96]. Instead of relying on human designer to deﬁne features, the
appearance-based approach enables machines to automatically derive features from
image samples. To do so, it considers a two dimensional image as a long vector. Sta-
tistical classiﬁcation tools are applied directly to the sample vectors. One example is
the nearest neighbor (NN) classiﬁer. As is well-known, NN classiﬁer is very time and
space consuming for high-dimensional image space or a large image database. To use
fewer features to represent a set of images, the principal component analysis (PCA)
has been used for face recognition [92] [70]. PCA can optimistically reconstruct the
images represented with the least mean square errors. However, the features which
can well represent the original data set are not necessarily good for the purpose of

classiﬁcation. The features derived from the linear discriminant analysis (LDA) are

20

meant for well distinguishing different classes and thus are relatively better for the

purpose of classiﬁcation, provided that the samples contain sufﬁcient information [88].

The second issue for image database is how to organize the represented features
so that the retrieval is both fast and successful. Linear search is very time-consuming
which makes it not practical for very large image databases. One way to solve this
problem is to use a decision tree. A well designed decision tree can retrieve a matched
sample with a logarithmic time complexity. This is a very useful characteristic for
large image databases. There is a very rich literature about decision trees, see sur-
veys [19] [82] [29] [62] [11]. However, the applications of decision trees have been
traditionally for a low-dimensional feature space with manually selected features.
This is true largely because humans cannot deﬁne a large number of useful features.
Appearance-based approach drastically changed this situation. Tiaditional decision
trees for low-dimensional input Space have been found not suited for input dimen-
sionality of a few thousands and up, even after data-dimensional reduction using
techniques such as PCA. A major reason is the high complexity of sample distri-
bution that cannot be adequately captured by a single-level PCA. As demonstrated
by [87], if a different subspace is computed at each internal node of the tree, a better
generalization power results. A series Of limitations of existing tree classiﬁers have
motivated the work presented here with advances and advantages summarized in the
abstract.

21

2.2 Classiﬁcation and regression

The tasks Of discriminant analysis can be categorized into two types according to
their outputs: class-label (symbolic) output and numerical output. The former case
is called classiﬁcation and the latter case is called regression. A classiﬁcation task can
be stated as follows. Given training sample set L = {(13, l,) | i = 1, 2, . . . , n}, where
3:,- E X is an input (feature) vector and l,- is the symbolic label of 13,-, the task is to
determine the class label of any unknown input :1: E X. A regression task is similar
to the corresponding classiﬁcation one except that the class label 1,- is replaced by a
vector y,- in output space so that y,- E y,i = 1, 2, . . . , n. A regression task is stated
as follows. Given training set L’ = {(2:,-, y,) | i = 1, 2, . . .,n} and any testing vector

:1: E X, the goal is to estimate the vector y(.r) E y.

In a classiﬁcation problem, the class labels themselves do not provide more infor-
mation in terms of how different two classes are. We would like to cast a classiﬁcation
task regressive one SO that we can conveniently form coarse classes by merging some
original classes. These coarse classes are useful for performing coarse-to-ﬁne classiﬁ-

cation and regression using a decision tree, as we will explain later in this chapter.

There are three ways to cast a classiﬁcation task into a regression one. (1) If a
cost matrix [cu] is readily available from application, where cij is the cost Of confusing
classes i and j, one can embed n class labels into an (n — 1)-dimensional Euclidean
outputs Space by assigning vector y,- tO class i, i = 1, 2, . . .,n, so that Hy,- — y,” is
as close to as 0,], as much as possible. This process is not always possible Since a
pre-deﬁned cost matrix [cij] is not always easy to provide. (2) Canonical mapping.

22

Map n class labels into an n dimensional output Space so that the i-th class label
corresponds to a vector in which the i-th component is 1 and all other components are
zeros. After this mapping, the distance between any two different class labels is the
same: 1. This label mapping does not assign different distances to different output
vectors but will do S0 for coarse classes in a coarse-tO-ﬁne classiﬁcation. (3) Mapping
labels into the input space. Each sample (r,,l,) belonging to class I,- is converted
to (a:,-,y,) where y,, the vector class label, is the mean of all x,- that belong to the
same class. This label-mapping scheme considers the distance in input space as that
between different classes. In many applications, this is a desirable way. How can we
get the class label from the numerical vector in X? In each leaf node of the regression
tree, each training sample (17,-, y,) has a link to label l,- so that when (17,-,yi) is found
as a good match for unknown input as, l,- is directly output as the class label. There is
no need to search for the nearest neighbor in the output space for the corresponding

class label.

TO cast a classiﬁcation problem into a regression one can be achieved by the above
three methods. The other way around is not always possible. One cannot map a nu—
meric output space into a set Of class labels without losing the numeric properties
among an inﬁnite number of possible numerical vectors. Therefore, a regression prob-
lem is more general than the corresponding classiﬁcation problem. AS can be seen, the
numerical representation in output Space allows us to form hierarchical discriminating
subspaces for constructing a decision tree.

23

2.3 Discriminant analysis for numerical output

Consider the general regression problem: approximating a mapping h : X +—> y from
a set of training samples {(x,,y.,-) | x,- E X, y,- E y, i z 1, 2, . . . , n}. If y, was a class
label, linear discriminant analysis (LDA) [24] can be applied since the within-class
scatter and between—class scatter matrices are all deﬁned. Unfortunately, if each class
has only very few classes, the within-class scatter matrix is poorly estimated and the
LDA is not very effective. If the classiﬁcation problem is cast into a regression one,
it is possible to form coarse classes each having more samples which enables better
estimation of the within class scatter matrix. However, if y,- is a numerical output,
which can take any value for each component, it is a challenge to ﬁgure out an effective
discriminant analysis procedure.

To attack this challenge, a new hierarchical statistical modeling method is intro-
duced. Consider the mapping h : X I—> y, which is to be approximated by a regression
treel, called hierarchical discriminant regression (HDR) tree, for the high dimensional
space X. Our goal is to automatically derive discriminant features although no class
label is available (other than the numerical vectors in Space 32). In addition, for com-
putational efficiency, each sample (22,-, y,) must be processed to update the HDR tree
using only a minimal amount of computation.

Two types of clusters are formed at each node of the HDR tree — y—clusters and
x-clusters. as shown in Fig. 2.1. The y-clusters are clusters in the output Space 32

and x-clusters are those in the input Space X. There are a maximum of q (e.g.,

 

1A regression tree is, by deﬁnition, a decision tree whose output is a numeric vector while a
classiﬁcation tree is a decision tree whose output is a class label [11].

24

 

 

 

 

 

 

X space Y space

Figure 2.1: Y-clusters in space 32 and the corresponding x-clusters in Space X. The ﬁrst
and the second order statistics are updated for each cluster. By default, the normalized
Mahalanobis distance is used for x-cluster and the Euclidean distance is used for distance
to y-cluster.

9

q = 6) clusters of each type at each node. The q y—clusters determine the virtual
class label of each training sample (:13, y) based on its y part. The virtual class label
is used to determine which x-cluster the input sample (2:,y) should update using
its a: part. Each x-cluster approximates the sample population in X Space for the
samples that belong to it. It May spawn a child node from the current node if a
ﬁner approximation is required. At each node, y in (as, y) ﬁnds the nearest y-cluster
in, e.g., Euclidean distance (more general distance metrics can also be used). This
y-cluster indicates to which corresponding x-cluster the input (:13, y) belongs. Then,
the a: part of (:r, y) is used to compute the statistics of the x-cluster (the mean vector
and the covariance matrix). These statistics of every x-cluster are used to estimate
the probability for the sample (:r,y) to belong to the x-cluster, whose probability
distribution is modeled as a multidimensional Gaussian at this level. A total Of
q centers of the q x-clusters give q — 1 discriminant features which span (q — 1)—

25

dimensional discriminant space. A probability-based distance (to be discussed in
Section 2.4) from :1: to each of the q x-clusters is used to determine which x-cluster
Should be further searched. If the probability is high enough, the sample (:13, y) should
further search the corresponding child (maybe more than one but with an upper bound
k) recursively, until the corresponding terminal nodes are found.

For computational efﬁciency, none of the x-clusters and y-clusters keep actual
input samples, unlike the traditional clustering methods. Only the ﬁrst orders of
statistics are used to represent the clusters. For example, each y-cluster keeps the
mean vector and the covariance matrix, depending on the distance metric in the 32
space, while each x-cluster keeps the mean vector and the full covariance matrix in
an efﬁcient form.

In summary, the algorithm recursively builds a HDR tree from a set of training
samples. The deeper a node is in the tree, the smaller the variances of its x-clusters

are. The following is the outline of the algorithm for tree building and retrieval.

Procedure 1 BuildSubtree: Given a node N and a subset S’ of the training sam-
ples that belong to N, among the samples in S = {($,,y,) | :17,- E X, y,- E y, i =
1,2, . . .,n}, build the subtree which roots from the node N using S recursively. At

most q clusters are allowed in one node.

1. Let p be the number of the clusters in node N.

0 Call Clustering-Y (procedure 2) to obtain p y-clusters.

o If y,- belongs to i-th y—cluster, then 1:,- belongs to i-th z-cluster.

26

2. Compute the mean and covariance matrices of each x-cluster.

3. For every :13,- of (13,, y,) in S’:

c Find the nearest z-clusterj according to the probability-based distances.
0 Suppose that the sample (23,-, y,) belongs to cluster j.
4. For each clusterj, only a portion of samples S,- was assigned to the 112-cluster. If
the largest Euclidean distance among y,- ’s in the r-cluster is larger than a number
6y, a child node Nj of N is created from the :r-cluster and this procedure is called

recursively with input samples Sj and node Nj. The number 6,, represents the

sensitivity of the HDR tree in the space y.

Procedure 2 Clustering-Y: Given a set of output vectors Y = (y1,y2, . . .,yn), re-

turn p y-clusters. p S q, where q represents the maximum clusters allowed in one

node.

1. Let the mean Y1 of y-cluster 1 be yl. Set p = 1 and i = 2.

2. Forifrom 2 to 71 do

((1) Find the nearest y-clusterj for y,.
(b) Compute the Euclidean distance d = dist(y,~, Yj).
(c) Ifd 2 6y andp < q, let the mean Yp+1 of y—clusterp+1 be y,. Setp = p+l.

(d) Otherwise, update y-cluster j using the new member y,.

27

 

The procedure to create a HDR tree just calls procedure BuildSubtree with root
R and all the training samples S = {(x,,y,~) | x,- E X, y, E y, i = 1,2, . ..,n}. The

procedure for query the HDR tree for an unknown sample 1: is described in below.

Procedure 3 Retrieval: Given a HDR tree T and sample x, estimate the corre-
sponding output vector y. A parameter k speciﬁes the upper bound in the width of

parallel tree search.

1. From the root of the tree, compute the probability-based distance to every cluster
in the node. Select at most top It x-clusters which have the smallest probability-

based distances to x. These x-clusters are called active x-clusters.

2. For every active cluster received, check if it points to a child node. If it does,
mark it inactive and explore its child node by computing the probability-based
distances of x-clusters in the child node. At most k2 active x-clusters can be

returned.

3. Mark at most It active x-clusters according to the smallest probability-based dis-

tances.

4. Do the above steps 2 through 3 recursively until all the resulting active x-clusters

are all terminal.

5. Let the cluster c have the shortest distance among all reached leaf nodes. Output

the corresponding mean of its y-cluster as the estimated output y for x.

The above tree gives a coarse-to-ﬁne probability model. If Gaussian distribution is
used to model each x-cluster, this is a hierarchical version of the well-known mixture-

28

 

of-Gaussian distribution models: the deeper the tree is, the more Gaussians are used
and the ﬁner are these Gaussians. At Shallow levels, the sample distribution is ap-
proximated by a mixture of large Gaussians (with large variances). At deep levels,
the sample distribution is approximated by a mixture of many small Gaussians (with
small variances). The multiple search paths guided by probability allow a sample
1: that falls in-between two or more Gaussians at each shallow level to explore the
tree branches that contain its neighboring x-clusters. Those x-clusters to which the
sample (x,y) has little chance to belong are excluded for further exploration. This
results in the well-known logarithmic time complex for tree retrieval: O(log m) where
m is the number of leaf nodes in the tree, assuming that the number of samples in

each leaf node is bounded above by a constant.

2.4 Distance in discriminating space

2.4.1 Discriminating subspace

In the above section, it is necessary to estimate the distance for an input vector x
to belong to an x-cluster. For a real-time system, it is typically the case that the
system cannot afford to keep all the samples in each cluster. Thus, each cluster will
be represented by some statistical measures with an assumed distribution.

We ﬁrst consider x-clusters. Each x-cluster is represented by its mean as its center
and the covariance matrix as its size. However, since the dimensionality of the space
X is typically very high, it is not practical to directly keep the covariance matrix.

29

If the dimensionality of X is 3000, for example, each covariance matrix requires

3000 x 3000 = 9, 000, 000 numbers! We adopt a more efficient method.

As explained in Section 2.3, each internal node keeps up to q x-clusters. The

centers of these q x-clusters are denoted by
C = {61,62, ...,c,, | c,- E X,i = 1,2, ...,q}. (2.1)

The locations of these q centers tell us the subspace D in which these q centers lie. D is
a discriminant space Since the clusters are formed based on the clusters in space y. In
this space, the between-cluster scatter and within—cluster scatter can be computed.
Suppose that the number of samples in cluster i is n, and thus the grand total of

samples is n = Z321 n,. The mean of the cluster center, denoted by c is computed as

1

c:—
n

q
2 nici.
i=1

The covariance matrix of cluster i is denoted by 1“,, i = 1, 2, ..., q. The within-cluster

scatter matrix is the weighted average of the within-cluster scatter matrices:

SD:

311‘

q
2'”in (2.2)
i=1

The between-cluster scatter matrix is the sample covariance matrix for the cluster

30

centers:

Sb = ii; 711(02' _ C)(Ci _ OT (23)

The sample mixture matrix is the covariance matrix of all the samples regardless of

their cluster assignments, and it iS also equal to
Sm = Sw + Sb.

The Fisher’s linear discriminant analysis [24] [100] ﬁnds a subspace that maximizes
the ratio of between-cluster scatter and within-cluster scatter: ISbl/lSwl. Since the
entire discriminant space D is used, it is not necessary to consider the within-cluster
scatter here in ﬁnding D and thus Simpliﬁes the computation. Once this discriminat-
ing space D is found, the size-dependent negative-log—likelihood (SDNLL) distance,
as discussed in Section 2.4.2, will be applied to take care of the reliability of each

dimension in D using information that is richer than the matrix Sw.

2.4.2 Size-dependent negative—log-likelihood
The characteristics of different metrics

We need a system that fully uses the information available no matter how many
samples have been observed. This is an issue that has received little attention Since
it is typically assumed that sufficient samples are available for each class but it is

often not the case in practice. In terms of belongingness of a vector to a cluster,

31

there are three measures, likelihood, Mahalanobis distance and Euclidean distance.
The likelihood assumes Gaussian density and thus uses the covariance matrix of each
individual cluster. It requires that each x-cluster has enough samples to estimate the
(q — 1) x (q — 1) covariance matrix. It is the most demanding among the three in the
richness of Observations. The Mahalanobis distance uses the average of covariance
matrices. It is less demanding Since it just requires that the average of covariance
matrix has reasonably rich observations but not necessarily every x-cluster. The
Euclidean distance is estimated by p21 and thus has only one parameter p. Thus it
is the least demanding. When very few samples are available for all the clusters, the

Euclidean distance is the suited distance.

Let us consider the negative-log—likelihood (NLL) deﬁned from Gaussian density

of dimensionality q — 1:

q—l

 

G(x, c,) : $(x — 0,)TFf1(x — c,) + ln(27r) + %1n(|I’,-[). (2.4)

We call it Gaussian NLL for x to belong to the cluster i. c,- and F,- are the
cluster mean and covariance matrix, respectively. Similarly, the Mahalanobis NLL

and Euclidean NLL are deﬁned as:

 

M(x,c,)=%(1: — Afr-la; — c.) + q;11n(2n)+%ln(ll’|). (2.5)

32

 

I —I I
E(x, 0,) = ~2—(x -— c,)Tp2I"1(x — Ci) + q 2 ln(27r) + §ln(]p21|). (2.6)

I‘ is the within-class scatter matrix of each node ~— the average of covariance

matrices of q clusters.

When the number of samples in the node is small, the Euclidean NLL is preferred.
Gradually, as the number of samples increases, the within-class scatter matrix of q
x-clusters are better estimated. Then, the Mahalanobis NLL is more suitable. When
a cluster has very rich observations, the full Gaussian NLL for it will be the best

choice.

In order to properly decide when and how to transit among the three NLLs,
different characteristics of them are discussed here. Suppose that the input Space is
X and the discriminating subspace for an internal node is D. The Euclidean NLL
treats all the dimensions in the discriminating subspace D the same way, although
some dimensionalities can be more important than others. It has only one parameter
p to estimate. The Mahalanobis NLL uses within-class scatter matrix Sw computed
from all the samples in all the q x-clusters. It uses the inverse matrix 3,31 as the
weight in computing NLL. The meaning Of this Mahalanobis matrix weight Sujl is
as follows. The matrix S;1 properly rotates the original basis b1, b2, ...,bq_1 of the
subspace D so that the correlation about the new basis vectors b’l, b’2, ...,b’q_1 all
vanishes. Then S51 applies to each rotated basis vector b2, i = 1, ..., q — 1, a weight
which is the inverse of the sample variance along b;. By the way, using Mahalanobis

33

NLL as the weight for subspace D is equivalent to using Euclidean N LL in the basis
computed from Fisher’s LDA procedure [24] [88]. Thus, the Mahalanobis NLL takes
care of the reliability of different input components but the Euclidean NLL does not.
The former does not only decorrelate the input components, but also weight each
decorrelated new components. The number of parameters in Sw is q(q — 1) /2, and
thus, the Mahalanobis NLL requires more samples than the Euclidean NLL. Next,
consider the Gaussian NLL. AS can be seen, the Mahalanobis NLL does not treat
different x-lusters differently because it uses a single within-class scatter matrix Sw
for all the q x-clusters in each internal node. This is not the case with Gaussian NLL.
Using Gaussian NLL, L(x, c,-) in Eq.(2.4) uses the covariance matrix F,- Of x-cluster i.
In other words, Gaussian NLL not only decorrelates the correlations but also applied
a different weight at different location along each rotated basis. Note that the decision
boundary of the Euclidean NLL and the Mahalanobis NLL is linear and that by the
Gaussian NLL is quadratic.

In other words, consider the partition of the Space D into two regions, one contain-
ing c1 and the other containing C2. If L(x, c1) < L(x, c2), then x is with c1. Otherwise
x is with egg. The boundary is marked by all x’s that satisfy L(x, c1) = L(x, c2). Such
a boundary is a hyperline (linear) if Euclidean or Mahalanobis NLL is used for L(~, ).

The boundary is a quadratic in general if the Gaussian NLL is used.

The transition among different metrics

How do we realize a proper automatic transition from the three NLLS? We would
like to make this transition smooth when the number of samples increases. Two al-

34

ternative measurements of maturity for each cluster i can be deﬁned, the number of
samples n,- and the elapsed time t,- since its creation. The former is appropriate for
off-line application where the elapsed time is not applicable. The latter is more suited
for real-time application where a very large number of similar samples may be ob-
served during a Short time period, causing n, to increase Signiﬁcantly without having
observed enough variation in the data. In the following, 11,- is used as the maturity
measurement. For each node, the within-class scatter 5,0 is computed, which has a
total of n = 2321 11, samples. For each x-cluster in the node, we start with a scalar
covariance matrix p21. We would like to use the same Gaussian NLL expression as
the measure for belongingness but gradually change the estimated covariance matrix
according to the number of samples received for the required matrix. For the three
types of NLLS, we have three matrices, p21, Sw, and I}. Consider the number of
scales received to estimate each parameters, called number of scales per parameter
(NSPP), of the element of the matrices. The NSPP for p21 is (n — 1)(q — 1), since the
ﬁrst sample does not give any estimate of the variance and each independent vector
contains q — 1 scales. A bounded NSPP is deﬁned to limit the growth of NSPP so
that other matrices that contain more scalars can take over when there are a sufﬁcient

number of samples for them. Thus, the bounded NSPP for p21 is
be 2 min{(n ‘1)(q _1):ns}

where n, denotes the switch point for the next more complete matrix to take over.

At this switch point, a weight of about 50% will be used for p21 and another weight

35

of 50% for the next matrix. How large should n, be? Consider a serious of random
variables drawn independently from a distribution with a variance 02, the expected
sample mean of n random variables has a covariance 02/(n — 1). We can choose a
switch conﬁdence value a for 1 / (n — 1). When 1/ (n — 1) = a, we consider that the
estimate can take about a 50% weight. Thus, 11 = 1/01+1. As an example, let a = 0.1
meaning that we trust the estimate with 50% weight when the expected variance of
the estimate is reduced to about 10% of that of a single random variable. We get

then n = 11, which leads to n, = 11.

Next, consider the NSPP for Sw matrix for the Mahalanobis N LL. The number of
independent vectors received is n — q because each of the q x-cluster requires a vector
to form its mean vector. Thus, there are (n — q)(q — 1) independent scalars. There
are (q — 1)q/2 estimated parameters in the (symmetric) matrix Sw. Thus, the NSPP

for 5,, is

(n—nm-l) un—v_

(q - 1)q/2 _ q

 

To avoid the value to be negative when n < q, we take NSPP for Sw to be

Since the Gaussian NLL cannot be trusted until every matrix F,- has received

36

enough samples, we deﬁne the NSPP for F,- estimates for the Gaussian NLL as the

minimum among all the q x-clusters:
2 ’ i '— I
by = min {ﬂ—l}, (2.7)

This is somewhat conservative since it may be the cases where x-cluster that has
the least samples is not among the nearest x-clusters. The above N SPP is meant to
contain the worst error (when the x-cluster with the fewest samples is the nearest
x-cluster). Alternatively, if we are interested in containing the mean error, we may

choose NSPP to be the average number of samples per x-cluster:

by = {M} = M (2.8)

1
(11:1 ‘1

In the above computation, we only consider the x-clusters that have at least one
sample. It is worth noting that the NSPP for the Gaussian NLL does not need to
be bounded, Since among our models it is the best estimate with a large number of

samples.

Let us consider the three NSPP: be, bm and by. From the deﬁnitions, we know
that they grow roughly at rates n, 2n/ q and 2n/q2, respectively. be gets saturated by
n, the earliest. Then, bm does. by never saturates. Table 2.1 summarizes the result

of the N SPP values of the above derivation.

We deﬁne a size-dependent scatter matrix (SDSM) W,- as a weighted sum of three

37

Table 2.1: Characteristics of Three Types of Scatter Matrices
Type Euclidean p21 Mahalanobis Sw Gaussian F,-

 

 

 

 

 

 

NSPP (n — 1)(q — 1) @3222 29:11)
matrices:
W, = wepQI + mew + ung, (2.9)

where we 2 be / b, wm = bm / b, wg = by / b and b is a normalization factor so that these
three weights sum to 1: b 2 be +bm +bg. Using this size-dependent scatter matrix W,,
the size-dependent negative log likelihood (SDNLL) for x to belong to the x-cluster

with center c,- iS deﬁned as

—1

 

1
Law.) = W — 0,)TW,"1(x — c,) + ‘1 111(271) + gnaw).

(2.10)

It is worth noting the relation between LDA and SDN LL metric. Fisher’s LDA in
Space D gives a basis for a subspace D’ g D. This basis is a properly oriented and
scaled version for D so that the within-cluster scatter in D’ is a unit matrix [24] (Sec-
tions 2.3 and 10.2). In other words, all the basis vectors in D’ are already weighted
according to the within-cluster scatter matrix Sw of D. If D’ has the same dimension-
ality as D, the Euclidean distance in D’ is equivalent to the Mahalanobis distance in
D, up to a scale factor. However, if the covariance matrices are very different across
different x-clusters and each of them has enough samples to allow a good estimate

38

of individual covariance matrix, Fisher’s LDA in Space D is not as good as Gaussian
likelihood. The SDNLL in (2.10) allows automatic and smooth transition between
three different types of likelihood, Euclidean, Mahalanobis and Gaussian, according

to the predicted effectiveness of each likelihood.

2.4.3 Computational considerations

We are now ready to discuss the computational steps for the previous procedures.
The ﬁrst issue is how to represent the space D which is spanned by the centers

of x-clusters in Eq.(2.1). These centers are vectors in X, which typically has a very

high dimensionality. The matrix weighted squared distance from a vector x E X to

each X-cluster with center c, is deﬁned by

(12(1), 6;“) 2 (IL‘ — Ci)Tl’Vi-1($ — Ci) (2.11)

which is the ﬁrst term of Eq.(2.10).

We have two major issues to deal with. First, the SDSM W,- is very large if
we represent it in X directly. Second, the sample covariance matrix, which we will
be using to estimate matrix W,, is not invertible before the number of samples has
reached the high dimensionality of X.

A way to address the ﬁrst issue is to represent the discriminating space D by a
basis of orthonormal vectors. Using the method explained in Appendix A, we keep an
orthonormal basis of the linear manifold D. To address the second issue, we represent
the covariance matrix in the orthonormal basis for subspace D instead of X. Since

39

the dimensionality of D is at most q — 1, the matrix W,- in the orthonormal basis is

much smaller than that in the original space X.

The computational steps are described as follows. Suppose that the dimensionality
of Space X is d. From q x-cluster centers in Eq.(2.1) in X, use the GSO procedure
in Appendix A to compute the q — 1 orthonormal basis vectors M = [61, 62, ..., 6q_1],
where each column 6 is a unit basis vector, i = 1, 2, . . . , q -- 1, and M is a d x (q — 1)
matrix. For each x-cluster center c,-, its coordinate vector in the orthonormal basis
M is given by

81' = MTCI‘.

Thus, each x-cluster c,- is represented by only a (q — 1)-dimensional vector e,. Given
an unknown vector x E X, project it onto the basis e = M ’x. Then, the matrix-
weighted squared distance in Eq.(2.11) is computed only in (q — 1)-dimensional space
using the basis M. The SDSM W,- for each x-cluster in then only a (q — 1) x (q — 1)
square symmetric matrix, of which only q(q — 1) / 2 parameters need to be estimated.

When q = 6, for example, this number is 15.

Given a column vector v represented in the discriminating subspace with an or-
thonormal basis whose vectors are the columns of matrix 111, the representation of v

in the original space X is x = M v.

To compute the matrix weighted squared distance in Eq.(2.11), we Should use a
numerically efﬁcient method. For example, we can use Cholesky factorization [25]
(Sec. 4.2) which is for a positive deﬁnite matrix (which is symmetric). The Cholesky
decomposition algorithm computes a lower triangular matrix L from W so that W is

40

represented by W = LLT. The procedure is relegated to Appendix B.
With the lower triangular matrix L, we ﬁrst compute the difference vector from
the input vector x and each x-cluster center c,: v = x — 0,. The matrix weighted

squared distance is given by

d2(x, Ci) 2 le/l/',-_1v = vT(LLT)"’v = (L‘lv)T(L-1v). (2.12)

We solve for y in the linear equation Ly = v and then y = L"1v and d2(x,c,-) =
(L‘lv)T(L‘1v) = ||y|[2. Since L is a lower triangular matrix, the solution for y in

Ly = v is trivial Since we Simply use the backsubtitution method as described in [74]

(page 42).

2.5 The experimental results

Several experiments were conducted using the proposed new HDR algorithm. First,
we present the experimental results using synthetic data. Then we Show the experi-
mental results for real face images. In addition to these, the proposed algorithm was

also applied to the data with manually extracted features from images.

2.5.1 Experiments using synthetic data

The motivation of using synthetic data for test is to investigate the behavior of dif-
ferent distance matrices and to examine the near optimality potential of our new
algorithm with known distributions as a ground truth (but our algorithm does not

41

know the distribution).

The ﬁrst experiment used data set that has 3 clusters. The number of dimension
is two. Each cluster was modeled by a Gaussian distribution. The centers of the
clusters are at (0,0), (5,0), and (0,5), respectively. The covariance of the ﬁrst cluster

is an identity matrix. Those for the second and the third are,

4040

0 1 0 2.25

respectively. We ﬁrst Show the effects of the number of samples. In Fig. 2.2, every
cluster has 500 samples to estimate the required statistics for each measurement. The
Bayesian decision boundaries were obtained by using the true means and covariance
matrices for each cluster. To compute the Gaussian decision boundaries, we estimated
the sample means and covariance for each cluster by using the training samples. As
can be seen, the Gaussian NLL is very close to the Bayesian decision boundaries
because we have sufﬁcient number of samples to calculate those statistics. The differ-
ence between the Gaussian NLL and Mahalanobis NLL is that the covariance matrix
for each cluster is replaced by the average of individual covariance matrices for the
Mahalanobis NLL. Thus the decision boundaries for the Mahalanobis NLL are linear.
The Euclidean NLL used identical matrices as the covariance matrices for each cluster
and the boundaries are also linear. he decision boundary of SDNLL then is very close
to that of Gaussian NLL which is an appropriate distance metric here because of the
large number of available samples.

42

 

 

 

 

 

 

 

 

 

 

 

 

 

812
10» — B13
. ~ 323
Si
0.
-5]
-10»
"-15 -1‘0
15 .
812
1o»— B13
. . 1323
5»
01.
_5.
-10» i 1 -10 ]
1 1

 

 

 

 

 

 

_-§ . 1 I . ’ I . 1 ' 5 . 15

 

 

-5-

-10.

 

 

 

 

 

 

 

"—‘15 —10 -S o 5 1‘0 15

Figure 2.2: Decision boundaries estimated by 500 samples per class. Lines ’B12’ mean
decision boundaries between class 1 and class 2. Similarly, lines ’B13’ represent boundaries
between class 1 and class 3 and lines ’B23’ represent boundaries between class 2 and class
3. Lines ‘B’ mean decision boundaries for Bayesian decision rule. This method uses the
ground truth for distribution and thus is independent of samples. Lines ‘E’ are for Eu-
clidean distance measured from a scale covariance matrix p21. Lines ‘G’ are measured by
Gaussian NLL using estimated full sample covariance matrices for all clusters. Lines ‘M’ are
for Mahalanobis distance using a single estimated covariance matrix Sw. Lines ‘L’ use our
SDNLL. (a) Bayesian decision boundaries. (b) Gaussian likelihood boundaries. (c) Maha-
lanobis likelihood boundaries. ((1) Euclidean likelihood boundaries. (e) SDNLL boundaries.
(f) The overall view.

43

There are 50 samples per class in Fig. 2.3. In this case, either Gaussian distance
and Mahalanobis distance has reasonable decision boundaries. The decision bound-
ary of SDNLL is between those of Gaussian distance and Mahalanobis distance. In
Fig. 2.4, only ﬁve samples per class are used to estimate the decision boundaries.
Since the number of samples is very small, the SDNLL is very much that of the Eu-
clidean distance, which results in a reasonable boundary as shown. Fig. 2.5 shows
the behaviors under unbalanced sample Situation where the 3rd cluster receives much
fewer samples than the ﬁrst while the number of samples for the 2nd cluster is in-
between. Fig. 2.5 indicated that the SDNLL distance metric behaves in the way we
wanted.

For the large-sample case of Fig. 2.2, we would like to examine how close the error
rates are to the best possible Bayesian error rates which use the ground truth about
the distribution instead of samples. In the‘experiment, we used 500 samples per class
to train. Testing is performed on the other 500 samples for each class. The bases
derived from the our algorithm are (0.89, —0.45) and (—0.45, —0.89). The error rate is
estimated on the basis we derived. The following table shows the classiﬁcations from
(1) ground truth of distribution using Bayesian optimality (Bayesian-GT), (2) the
parameters estimated from the training data based on Bayesian optimality (Bayesian-
Sample), and (3) the proposed new algorithm (SDNLL). Table 2.2 Shows that the
classiﬁcation errors. Of course, our method would not be able to be close to the
Bayesian error rates if there are not enough samples per class.

The second experiment presented here is for 3—D where the discriminating space
D is 2-D. There were 3 clusters, each being modeled by a Gaussian distribution with

44

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

15 . 15
B12 1.
1o» — B13 10-
523
5 5]
0, 0_ B12
— B13
-5~ -5» B23
-10» 1 -1o~
115 -10 1‘0 15 "-‘ls -1’0 -5 b 5 1’0 15
(b)
15 1 15 1 . , -
--. 812 - 812 [
1o»—— B13 [[0 ] 1o»—— 313 f,
_. . 323 9:29,“? 1:) 323 0:6 "”1 1:
5’ “...: 5’ ’81s"
0 . '-.-;f0‘1.l"i '55231' 1 o 1 _. gar.) .2" $153."). ..
..'.'-" " ...-2'11. :'
-5~ l -5. ;'
-101 3 -1o» '
z 1 1 l A '4 1 1 l 1 1
"-15 -1o -5 o 5 1o 15 '35 -1o -5 o 5 1o 15
(C) (d)
15 .
B12 .. 12
1o-“.—,‘ B13 ' 101
5,823
51
0»
-5l
-10.
“-15 -1’0

 

Figure 2.3: Decision boundaries estimated by 50 samples per class. (a) Bayesian decision
boundaries. (b) Gaussian likelihood boundaries. (c) Mahalanobis likelihood boundaries.
((1) Euclidean likelihood boundaries. (e) SDNLL boundaries. (f) The overall view.

45

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

15 . ' 1 1
- - 312 - ‘ 812
10> — 313 —— 313 .I
823 ' 823 o ,I o I
0’ ° ” n l
‘5’ -5. 7 ’
-1o~ _
'35 -1o -5 b 10 15 .1 r -1 1 1 .‘1 .1
(a) (b)
15 1 1 ,~ 15 . - ,- .
- 312 - 312
1o~—— 313 1o-— 313 , V 1
323 1 , 323 1 ,
5. .1 5» , 1
r. ExfT/IT
0‘ of " " [if . a I ’
-5» ..51- 1 v 1"
x "
-1o~ [ -10- l
"-‘15 -1o -5 o 10 15 ”-515 -1‘0 -5 b 5 1’0 15
(C) (d)
12 . -
15 . . . . , 1 __ B L —l
_- 312 10. G 1
1o» -—— 313 < 3» ,
323 1 .. _ 0 .
5. o 1” , , 1 61- ’ 1
M _ a g <
0. .‘, . '1' . ﬂ ' . 4
2 4
_5( «1 0.
-10 _2- 1
E
- 1 1 1 1 _4. 1 1
-‘rls -1o -5 o 10 15 10
(e)

 

Figure 2.4: Decision boundaries estimated by 5 samples per class. (a) Bayesian decision
boundaries. (b) Gaussian likelihood boundaries. (c) Mahalanobis likelihood boundaries.
(d) Euclidean likelihood boundaries. (e) SDNLL boundaries. (f) The overall view.

46

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

-10'

 

 

 

 

 

, . -41- '1 . .
‘35 -1o -5 o 5 1o 15 0 5 10

 

 

Figure 2.5: Decision boundaries for unbalanced samples distribution. Class one has 500
samples. Class two has 50 samples and class three has only 5 samples. (a) Bayesian decision
boundaries. (b) Gaussian likelihood boundaries. (c) Mahalanobis likelihood boundaries. ((1)
Euclidean likelihood boundaries. (e) SDNLL boundaries. (f) The overall view.

47

Table 2.2: Optimality for 2-D Synthetic Data

 

 

 

 

Error rate Bayesian-GT Bayesian-Sample SDN LL
class 1 0.0402 0.0443 0.0498
class 2 0.0660 0.0673 0.0682
class 3 0.0289 0.0298 0.0317

 

 

 

 

 

 

means, respectively, (0,0,0), (5,0,0), (0,5,0) and covariance matrices

 

 

 

 

 

 

00100100225

There were 500 samples per class for training and testing, respectively. The training

data is plotted in Fig. 2.6.

 

 

 

 

 

 

 

Figure 2.6: The data used in the second experiment.

We know that the basis is on x-y plane for the ground truth and we expect the
deriving discriminating Space D to be roughly so. The basis derived from the pro-
posed algorithm is: (0.89, —0.45,0.001), (-0.45, —0.89, —0.025) which are very close

48

Table 2.3: Optimality for 3—D Synthetic Data

 

 

 

 

 

 

 

 

 

 

 

class 1 class 2 class 3
. average error 0.0678 0.0494 0.0232
Bayes‘“n'GT 0 0.0082 0.0096 0.0068
average error 0.0694 0.0498 0.023
Bayes‘an'sample 0 0.011 0.0108 0.0074
average error 0.0511 0.058 0.027
SDNLL a 0.011 0.0109 0.0076
SL12 0.411 0.85 0.89
SL13 0.00 0.00 0.01
SL23 0.00 0.00 0.01

 

 

 

 

 

 

to what we expect. The error rate is estimated on the basis we derived. Table 2.3
shows the classiﬁcation results from Bayesian-GT, Bayesian-Sample rule and our new
algorithm SDNLL, like the case of 2D. We used Signiﬁcance level ( [50], pp. 268)
to measure the error difference among each method. In Table 2.3, S L12 means the
Signiﬁcance level for Bayesian-GT and Bayesian-Samples. Similarly, S L13 means the
signiﬁcance level for Bayesian-GT and SDN LL and S L23 means the Signiﬁcance level
for Bayesian-Samples and SDNLL. The higher value of Signiﬁcance level indicates the
more similar results between two methods. A total of 10000 trials have indicated the
error differences between Bayesian-GT and Bayesian-Sample are statistically insignif-
icant. Of course, the Bayesian-Sample algorithm cannot deal cases with small sample

and unbalanced samples.

The third experiment used a high-dimensionality and six clusters. Each
class was modeled by a Gaussian distribution in 100 dimensions with means

(0,...,0), (0,5,0,...,0), (0,0,5,0,...,0), (0,0,0,5,0,...,0), (0,0,0,0,5,0,...,0),

49

Table 2.4: Optimality for 100-D Synthetic Data

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

class 1 class 2 class 3 class 4 class 5 class 6
avg. err. 0.112 0.025 0.025 0.026 0.025 0.026
B ' -
ayes‘a" GT 0 0.014 0.0076 0.0067 0.0075 0.0086 0.0059
. avg. err. 0.113 0.025 0.025 0.027 0.0255 0.026
Bayes‘an'samp‘e a 0.016 0.0084 0.007 0.0079 0.0089 0.0064
avg. err. 0.090 0.0289 0.029 0.03 0.029 0.029
SDNLL
a 0.014 0.009 0.0081 0.0082 0.009 0.0068
SL12 0.69 0.71 0.718 0.7 0.69 0.75
SL13 0.00 0.015 0.0024 0.012 0.019 0.0089
SL23 0.00 0.046 0.0079 0.037 0.053 0.025
(0,0, 0,0,0,5,0, . . .,0), respectively. The covariance matrix for class 0 is an identity

matrix. The covariance matrix for the class i,i > 0 is an identity matrix except that
the (i, i) element is equal to 2.25. There were 500 samples per class for training and
the other 500 samples per class for testing. We expect the basis for discriminating
subspace D is very much in the ﬁrst six dimensions. The resulting basis is indeed
very close to what we expect. Table 2.4 Shows the error rates for the three types of
classiﬁcation rules under comparison. Since the ﬁrst cluster overlaps with the other
clusters, the errors listed in Table 2.4 in the ﬁrst row and column are larger than other
rows and columns. The error rates from the proposed algorithm are comparable with

those that use direct Bayesian optimality.

2.5.2 Experiments using real image data

Since our primary interest is in images which have a high dimensionality, we applied

the new algorithm to appearance-based face image retrieval tasks. The ﬁrst two

50

experiments used face images from Weizmann Institute and F ERET face database,
respectively. The third experiment was conducted on OCR images from digitized car

license plates.

Face recognition on Weizmann dataset

The ﬁrst experiment used face images from the Weizmann Institute at Israel. The
image data base were constructed from 28 human subjects, each having thirty images
with all possible combinations of two different expressions under three different light-
ing conditions with ﬁve different orientations. An example of the face images from

one human subject is shown in Fig. 2.7.

We applied leave-one-out cross validation method to test this image data set. For
each trial, a total of 840 images were used for the testing. Table 2.5 compares different
appearance-based methods. For the principal component analysis (PCA), the number
of eigenvectors used is determined by keeping 95% of the total sample variance. This
gives 127 eigenvectors for PCA. A PCA tree is a binary classiﬁcation tree where each

node uses PCA.

Further, we compared the error rate of the proposed HDR algorithm with some
major tree algorithms. CART2 and C50 are among the best known classiﬁcation
trees. However, like most other decision trees, they are univariate trees in that each
internal node used only one input component to partition the samples. This means

that the partition of samples is done using hyperplanes that are orthogonal to one

 

2We have experimented the same data set using CART implemented by OC1. The performance
is far worse than those reported in the Table 2.5. See CART for F ERET set in Table 2.6.

51

 

Figure 2.7: Face images from Weizmann Institute, all the combination of 3 lightings, 2
expressions, and 5 orientations.

52

axis. We do not expect this type of tree can work well in a high dimensional space
for highly correlated multimedia data like images. Thus, we also tested a more re-
cent multivariate tree OC1. We realize that these trees were not designed for high
dimensional Spaces like those from images. We also tested the corresponding versions
by performing PCA before using CART, C50, and OC1 and call them CART with
PCA, C5.0 with PCA, and OC1 with PCA, respectively.

AS Shown in Table 2.5, LDA Shares the best performance with our new HDR
method in this test. However, the new HDR method is faster than LDA and has a
more compact representation. The speed difference will be more signiﬁcant when the

data set is much larger.

Table 2.5: The performance for Weizmann face dataset

 

 

 

 

 

 

 

 

Method Error rate Avg. testing time (msec)

PCA 0.95% 290
PCA tree 1.79% 76
LDA 0.00% 110

N N 1.31% 370

C50 with PCA 27.98% 197
OC1 with PCA 37.62% 203
HDR 0.00% 82

 

 

 

 

 

Face recognition on FERET dataset

We performed two experiments using the F ERET face dataset [71] [72] [81]. We used
the frontal views from the data set. There are 457 persons with frontal views, thirty
three of whom with four frontal images, one with Six images, and the remaining 423
persons having only two images each.

A face normalization program was used to translate, scale, and rotate each face

53

 

image into a canonical image of 88 rows and 64 columns [2] so that eyes are located
at prespeciﬁed positions as shown in Fig. 2.8. To reduce the effect of background and
non-facial areas, image pixels are weighted by a function of the radical distance from
the image center. Further, the image intensity is masked by a linear function so that
the minimum and maximum values of each image are 0 and 255, respectively. Fig. 2.8
shows the effect of such a series of transformations.

Thirty four human subjects were involved in the ﬁrst experiment for the FERET
data set. Each person had three face images for the purpose of training. The other
face image was used for testing. We compare different options of the proposed algo-
rithms. First, we used Euclidean distance in the discriminating subspace instead of
SDNLL distance. With different choices of number of x-clusters (q), we found that
the performance does not Signiﬁcantly increase with the increase of q. Then we used
SDNLL distance and the result is Shown in Fig. 2.9. From the ﬁgure, we found that
the best q (q = 18 and beyond) resulted in 100% recognition rate.

To give an intuitive display about what are the centers of the x-clusters, we Show
in Fig. 2.10 the mean face images at the root with q = 15. The dimensionality of the
discriminating subspace is then 14.

Fig. 2.11 (a) shows the depth of the HDR trees with different q’s and distance
metrics. All the Options resulted in a Similar tree height. The tree constructed using
Euclidean distance has the most shallow depth. Figs. 2.11 (b) and (c) give the nodes
counts at every level of the trees for q = 2. It is worth noting that the structure
of the trees affects the speed of the algorithm. As Shown in Fig. 2.12 (b), a deeper
tree results in a faster tree retrieval because it works on a lower dimensional space

54

 

 

(e)

 

Figure 2.8: The demonstration of the image normalization process. (a) and (e): The
original image from the FERET data set. (b) and (f): The normalized image. (c) and (g):
The masked image.

55

 

 

1 ﬂ 1
— EU
0.8»
------ SDNLL
%0.6
“E
9
“70.41
0.2-
0% T _____ ~ -

 

 

Figure 2.9: The plot of error rate vs. number of x-clusters for FERET face test 1 using
Euclidean distance and the new SDNLL distance.

 

Figure 2.10: mean images in the root.

56

Table 2.6: The performance comparison of decision trees for the FERET test 1

 

 

 

 

 

 

 

 

 

Method Error rate Time (sec)
Training Testing Training Testing

CART 10% 53% 2108.00 0.029
C50 1% 41% 21.00 0.030
OC1 6% 56% 2206.00 0.047
CART with PCA 11% 53% 10.89 0.047
C5.0 with PCA 6% 41% 9.29 0.047
OC1 with PCA 5% 41% 8.89 0.046
HDR 0% 0% 12.25 0.027

 

 

 

 

 

 

 

at each level. Fig. 2.12 indicates that the SDNLL distance metric does not require

signiﬁcantly more time to compute.

A summary of performance comparison with some existing major tree classiﬁers
is listed in Table 2.6. Notice that the training time is measured for the total time
to train the corresponding system. The testing time is the average time per query.
To make a fair comparison, the computation time for PCA is included in C50 with
PCA, OC1 with PCA, and CART with PCA. As Shown, none of the existing decision
trees can deal with FERET set acceptably well, not even the versions that use PCA

as a preprocessing step.

The second experiment for the F ERET data set used all the available data (Ex-
periment 1 did not use all the available FERET data because capacity-limitation of
these existing decision trees compared). AS described before, most of subjects have
only two views. We used leave-one—out cross validation method. For each trial, one
image was selected for each person for testing and the remaining images were used for
training. The number of samples for each cluster thus is not equal. We use a similar

57

 

A
DJ

 

A
c?

m SDNLL

CD

Maximum tree depth
3 e

N

 

 

 

 

No of Nodes
No of Nodes
P

M

   

 

 

 

123456789 1234567891011
depth depth

(b) (C)

Figure 2.11: The tree structures of FERET face test 1. (a) The plot of depth of the tree
vs q for different distance options. (b) and (c): The plots of tree structures for different
options with q = 2 for EU and SDN LL, respectively. EU: Euclidean Distance.

58

 

 

 

 

 

 

 

 

o 7 - f 50 P
,Kasr ———— EU .gas -——— EU
8 a:
.§05~ -- SDNLL1 EID[ SDNLL ,
a "‘ r
:E 0-3 .3
g02 £30
0.1 1 < 25
Go 110 2’0 3‘0 40 200 1'0 2‘0 3’0 40
Q Q
(a) (b)

Figure 2.12: The timing data of FERET face test 1. (a) The plot of the average training
time vs q. (b) The plot of the average testing time vs q. EU: Euclidean Distance.

analysis to experiment one. The results are Shown in Figs. 2.13, 2.14, and 2.15.

 

1 Y 7 ' v 1'

 

EU
0.8:
------- SDNLL
$0.6»
$0.4

.0
N

 

 

 

Figure 2.13: The performance plots of FERET test 2. The plots of error rate vs. number
of x-clusters. EU: Euclidean Distance.

Optical character recognition from car license plates

We also applied our method on the OCR problem. Images of characters extracted
from car license plates were used in the experiment. Fig 2.16 Shows the examples of
the character ‘A.’ A total of 23 different characters appeared in the plates. Totally 222

59

 

QVGJ

 

EU
SDNLL

 

Depth
9‘

h

 

 

 

4N0»)

 

 

 

12131-111111 10011

No. of nodes
No of Nodes

 

 

 

 

 

12345678
Depth

('0)

 

Figure 2.14: The tree structures of FERET face test 2. (a) the plot of depth of the tree vs
q for different options. (b) and (c) are the plots of tree structures for Euclidean Distance

and SDNLL distance with q = 2, respectively.

Table 2.7: The performance for FERET face dataset II

 

 

 

 

 

 

 

 

Method Error rate Avg. testing time (msec)
PCA 4.38% 203
PCA tree 7.66% 34
LDA Segmentation fault Nil
NN 3.50% 270
HDR (q=64) 8.32% 132
HDR (q=2) 12.91% 37

 

 

 

 

60

.13
O)

 

..n.
&

 

EU

.3
N

 

SDNLL

g. trairgng time (sec)
1;, .1

Av
.0
9

0.4 >

 

 

 

.0
o”

50 100
Q

(a)

150

 

l

 

EU

3?

 

SDNLL

'03

Av .testin time sec
09 p g o‘ cb
1o

0
—l

 

 

 

o 50 100 150
G

(b)

Figure 2.15: The timing data of FERET face test 2. (a) The plot of the average training
time vs q. (b) The plot of the average testing time vs q. EU: Euclidean Distance.

Table 2.8: The performance for character images from car license plates

 

 

 

 

 

 

 

 

 

 

Method Error rate Avg.testing time (msec)
PCA 4.28% 2.0
PCA tree 4.53% 1.5
LDA 0.50% 1.6
NN 3.78% 3.0
C5.0 with PCA ' 17.50 1.7
OC1 with PCA 17.04% 1.9
HDR 0.00% 1.7

 

 

 

images were used for training and 397 images were used for testing. We summarized

the results in Table 2.8. The new HDR method has the best accuracy and the PCA

tree method is the fastest algorithm.

61

 

Figure 2.16: Letter images from car license plates.

62

 

2.5.3 Experiments using data with manually extracted fea-
tures

We further investigated how our HDR algorithm performs on lower dimensional real
data, such as those publically available data sets that use human deﬁned features.
Thus, we also tested our algorithm on some data available publically. These data

used manually selected features. We reported comparison results for three data sets.

1. Letter image recognition data: There are 26 classes which corresponding to 26
capital letters. Each sample has 16 numeric features. 15000 samples were used

for training and 5000 samples were used for testing.

2. Satellite image dataset: There are six decision classes representing different
types of soils from satellite image. Each sample has 36 attributes. Training set

includes 4435 samples and testing set includes 2000 samples.

3. shuttle dataset: There are seven classes. The number of attributes is 7. 43500

samples were used for training and 14500 samples were used for testing.

We inserted the performance of the new HDR tree algorithm as well as the IHDR
tree algorithm described in Chapter 2 to the results which were published in the
StatLog project [57] as Shown in Tables 2.9, 2.10, and 2.11. For these lower dimen-
sional data sets, the performance of HDR tree algorithm is comparable with other

best existing ones.

63

Table 2.9: Test results on letter image recognition data

 

 

 

 

 

 

 

 

Algorithm Error rate Time (sec)
training testing training testing
Alloc80 0.065 0.064 39575 ‘?
KN N 0 0.068 15 2135
* HDR tree 0 0.070 212.7 30
IHDR tree 0 0.072 1150 41
LVQ 0.057 0.079 1487 48
QuaDisc 0.101 0.113 3736 1223
Cn2 0.021 0.115 40458 52
BayTTee 0.015 0.124 276 7
NewId 0 0.128 1056 2
IndCart 0.010 0.130 1098 1020
C45 0.042 0.132 309 292
Dip0192 0.167 0.176 1303 80
Radial 0.220 0.233 ‘? ?
LogDisc 0.234 0.234 5062 39
AC2 0 0.245 2529 92
Castle 0.237 0.245 9455 2933
Kohonen 0.218 0.252 ? ?
C315 0.158 0.253 1033 8
Smart 0.287 0.295 400919 184
Discrim 0.297 0.302 326 84
BackProp 0.323 0.327 277445 22
Bayes 0.516 0.529 75 18
Itrule 0.585 0.594 22325 69
Default 0.955 0.960 ? ?
Cascade 1.0
Cart 1.000

 

64

 

 

Table 2.10: Test results on satellite image dataset

 

 

 

 

Algorithm Error rate Time (sec)
training testing training testing
KN N 0.089 0.094 2105 944
LVQ 0.048 0.105 1273 44
* HDR tree 0 0.108 2.36 0.41
Dip0192 0.051 0.111 746 111
Radial 0.111 0.121 564 74
Alloc80 0.036 0.132 63840 28757
IHDR tree 0 0.135 220 0.85
IndCart 0.023 0.138 2109 9
Cart 0.079 0.138 330 14
BackProp 0.112 0.139 72495 53
BayTree 0.020 0.147 248 10
Newld 0.067 0.150 226 53
Cn2 0.010 0.150 1664 36
C45 0.040 0.150 434 1
C315 0.125 0.151 764 7
QuaDisc 0.106 0.155 157 53
AC2 ? 0.157 8244 17403
Smart 0.123 0.159 27376 11
LogDisc 0.119 0.163 4414 41
Cascade 0.112 0.163 7180 1
Discrim 0.149 0.171 68 12
Kohonen 0.101 0.179 12627 129
Castle 0.186 0.194 75 80
Bayes 0.308 0.287 75 17
Default 0.758 0.769
Itrule ? 100.00 '

 

 

 

 

 

65

 

 

Table 2.11: Test results Shuttle dataset

 

Error Rate TIME
Algorithm Train Test Train Test

 

 

** HDR tree 0 0.0021 41.9 3.76

Newld 0 0.01 6180 ?
BayTree 0 0.02 240 17
Cn2 0 0.03 11160 ?
Cal5 0.03 0.03 313 10
IHDR tree 0 0.04 820.7 6.55
Cart 0.04 0.08 79 2
IndCart 0.04 0.09 1152 16
C45 0.04 0.10 13742 11
Ac2 0 0.32 2553 2271
Itrule ? 0.41 91970 ?
BackPrOp 4.50 0.43 5174 21
KNN 0.39 0.44 32531 10482
LVQ 0.40 0.44 2813 84

Dip0192 0.44 - 0.48 2068 176
Smart 0.61 0.59 110010 93
Alloc80 0.95 0.83 55215 18333
Radial 1.60 1.40 ? ?
Castle 3.70 3.80 461 150
LogDisc 3.94 73.83 6946 106
Bayes 4.60 4.50 1030 22
Discrim 4.98 4.83 508 102
QuaDisc 6.35 6.72 709 177
Default 21.59 20.84
Cascade ? 100
Kohonen ? 100

 

66

 

Chapter 3

Hierarchical Discriminant
Regression Tree: Incremental

Learning Mode

3. 1 introduction

AS far as we know, there is no published incremental statistical method for construct-
ing a regression tree for high dimensional input space based on discriminant analysis.
By high dimensional Space we mean that the dimensionality is above a few thousands
and the number of samples can be smaller than the dimensionality 1.

This high dimensional issue becomes increasingly important with increased use of

high dimensional digital multimedia data such as images and video where each pixel

 

1When the number of samples is smaller than the input dimensionality (i.e., the number of
features), Brieman et al. [11] and Murthy [40] called the situation data underﬁts the concept and
thus disregard the situation.

67

 

1.1,," ‘_r- ._—‘ " "
‘c'.

 

value is a component of the input vector 2. These applications present a high degree
of correlation among components of high-dimensional input, not matched by typical
lOwer-dimensional human-prepared feature data. None of CART, C5.0, OC1 or other
published tree classiﬁers that we know was designed for this high-dimensional, highly
correlated input Situation. SHOSLIF tree is for high input dimensionality and it
has an incremental version [97], but it uses PCA only for splits. It is technically
challenging to incrementally construct a classiﬁcation or regression tree that uses
discriminant analysis, due to the complex nature of the problem involved. Why
is discriminant analysis important? Discriminant analysis uses information of the
output space in addition to the information in the input space to compute the splits.
PCA uses only information in the input Space and cannot use information in the
output Space. Consequently, variations in the input Space that is totally useless for
output (e.g., pure noise components) will also be captured by PCA. On the other
hand, various neural networks are incremental in training. They tend to build a
network with some discriminating power. However, since neural networks typically
do not use a statistical model, they suffer from local minima problem and thus give
poorer performance as we will Show later in this chapter.

We present an incremental way of constructing a regression tree that uses dis-
criminant analysis. Further, we deal with the unbalanced sample problem in that
some regions of input space may have a very large number of samples while other

regions have only very Sparse samples. A sample-Size dependent likelihood measure

 

2This corresponds to a now well-accepted and highly successful approach called appearance-based
approach, with which the human system designer does not deﬁne features at all but rather applies
statistical methods directly to high-dimensional, preprocessed image frame [42] [92] [61] [88].

68

 

1ft":—

is proposed to make subOptimal decision for different sample Sizes, which is also very

critical for an incremental algorithm which receive training data incrementally. We

also require real-time Speed of the regression system which is essential for many in-

teractive applications and thus a hierarchical data pruning structure such as a tree is
a must. We present experimental result to demonstrate the performance of the new

technique and compare it with some major published methods.

3.2 Incremental Hierarchical Discriminant Re-

gression

The algorithm incrementally builds an IHDR tree from a sequence of training samples.
The deeper a node is in the tree, the smaller the variances Of its x-clusters are. When
the number of samples in a node is too small to give a good estimate of the statistics
of q x-clusters, this node is a leaf node. The following is the outline of the incremental

algorithm for tree building (also tree retrieval when y is not given).

Procedure 4 Update-node: Given a node N and (1:,y) where y is either given or
not given, update the node N using (x, y) recursively. Output: top matched terminal
nodes. The parameters include: 19 which speciﬁes the upper bound in the width of
parallel tree search; (5,, the sensitivity of the IHDR tree in X space as a threshold to
further explore a branch; and c representing if a node is on the central search path.
Each returned node has a flag c. If e :- 1, the node is a central cluster and c = 0
otherwise.

69

_ .-__._.__'

1. Find the top matched x-cluster in the following way. Ifc = 0 skip to step (2).

Ify is given, do (a) and (b); otherwise do (b).

(a) Update the mean of the y-cluster nearest y in Euclidean distance by using
amnesic averages. Update the mean and the covariance matrix of the x-

cluster corresponding to the y-cluster by using amnesic average.

(b) Find the x-cluster nearest 1: according to the probability-based distances.
The central x-cluster is this x-cluster. Update the central x-cluster if it has

not been updated in (a). Mark this central x-cluster as active.

2. For all the x-clusters of the node N, compute the probability-based distances for

x to belong to each x-cluster.
3. Rank the distances in increasing order.

4. In addition to the central x-cluster, choose peripheral x-clusters according to

increasing distances until the distance is larger than 6,, or a total of k x-clusters

have been chosen.
5. Return the chosen x-clusters as active clusters.

From the above procedure, we can observe the following points. (a) When y is given,
the corresponding x-cluster iS updated, although this x-cluster is not necessarily the
one on the central path from which the tree is explored. Thus, we may update two
x-clusters, one corresponding to the given y, the other being the one used for tree
eXploration. The update for the former is an attempt to pull it to the right location.
The update for the latter is an attempt to record the fact that the central x-cluster

70

 

has hit this x-cluster once. (1)) NO matter y is given or not, the x—cluster along the
central path is always updated. (c) Only the x-clusters along the central path are
updated, other peripheral x-clusters are not. We would like to avoid, as much as

possible, storing the same sample in different brother nodes.

Procedure 5 Update-tree: Given the root of the tree and sample (x, y), update the
tree using (x, y). Ify is not given, estimate y and the corresponding conﬁdence. The
parameters include: k which speciﬁes the upper bound in the width of parallel tree

search.
1. From the root of the tree, update the node by calling Update-node using (x,y).
2. For every active cluster received, check if it points to a child node. If it does,

mark it inactive and explore the child node by calling Update-node. At most q2

active x-clusters can be returned this way if each node has at most q children.
3. The new central x-cluster is marked as active.

4. Mark additional active x-clusters according to the smallest probability-based dis-

tance (1, up to k total if there are that many x-clusters with d g 63,.

5. Do the above steps 2 through 4 recursively until all the resulting active x—clusters

are all terminal.

6. Each leaf node keeps samples (1:,, y,) that belong to it. The output is y,- if x,- is

the nearest neighbor among these samples.

7. If the current situation satisﬁes the spawn rule, i.e. the number of samples ex-
ceeds the number required for estimating statistics in new child, the top-matched

71

.L....___.."

an-

 

x-cluster in the leaf node along the central path spawns a child which has q new
x-clusters. All the internal nodes are ﬁxed in that their clusters do not fur-
ther update using future samples so that their children do not get temporally

inconsistent assignment of samples.

The above incrementally constructed tree gives a coarse-tO-ﬁne probability model.
If we use Gaussian distribution to model each x-cluster, this is a hierarchical version
of the well-known mixture-of-Gaussian distribution models: the deeper the tree is,
the more Gaussians are used and the ﬁner are these Gaussians. At shallow levels,
the sample distribution is approximated by a mixture of large Gaussians (with large
variances). At deep levels, the sample distribution is approximated by a mixture of
many small Gaussians (with small variances). The multiple search paths guided by
probability allow a sample x that falls in-between two or more Gaussians at each
Shallow level to explore the tree branches that contain its neighboring x-clusters.
Those x-clusters to which the sample (x, y) has little chance to belong are excluded
for further exploration. This results in the well-known logarithmic time complex for
tree retrieval: O(log m) where m is the number of leaf nodes in the tree, assuming

that the number of samples in each leaf node is bounded above by a constant.

3.3 Amnesic average

In incremental learning, the initial centers of each state clusters are largely determined
by early input data. When more data are available, these centers move to more
appropriate locations. If these new locations of the cluster centers are used to judge

72

'1'" "' "'— '-———=r

the boundary of each cluster, the initial input data were typically incorrectly classiﬁed.
In other words, the center of each cluster contains some earlier data that do not belong
to this cluster. To reduce the effect of these earlier data, the amnesic average can
be used to compute the center of each cluster. The amnesic average can also track

dynamic change of the input environment better than a conventional average.

The average of 71. input data 13,, x2, ..., xn is given by

1 1

23(71): — 2131‘ = Z £1134. (3.1)
=1

"1‘ 1:1

In the above expression, every x,- is multiplied by a weight 1/n and the product is
summed up together. Therefore, each x,- receives the same weight l/n. This iS called
an equally weighted average. If x,- arrives incrementally and we need to compute the
average for all the inputs received so far, it is more efﬁcient to recursively compute

the current average based on the previous average:

 

 

 

x(n+1) = "LEW + ”Tn“ _ n -(n) 1

— n. . 3.2
n+1 n+1$ n+1$+1 ( ’

In other words, the previous average 13‘") gets a weight n/(n + 1) and the new input
13,,“ gets a weight 1 / (n + 1). These two weights sum to one. The recursive equation
Eq. (3.2) gives an equally weighted average. In amnesic average, the new input gets
more weight as given in the following expression:

n—l_(n) 1+1
——x +

n+1 n+1

xn+l (33)

 

73

,where l is a parameter.

The amnesic average can also be applied to the recursive computation of a co-
variance matrix I} from incrementally arriving samples: $1,332, ..., (L‘n, where 1:,- is
a column vector for z' = 1, 2, The unbiased estimate of the covariance matrix from

these 71. samples x1, 3:2, ..., 23,, is given in a batch form as

 

1 in, — 22m,- — 5:)T (3.4)

with n > 1, where i is the mean vector of the 77. samples. Using the amnesic average,
53”“), up to the (n + 1)—th sample, we can compute the amnesic covariance matrix

up to the (n + 1)-th sample as

P(n+1) = n_:l_—l

Tl,

1+1
($n+1 _ i(n+1

 

1‘5") + >><xn+1— x<n+l>)T (3.5)

for n > Z + 1. When n S l + 1, we may use the batch version as in expression (3.4).
Even with a single sample 11:1, the corresponding covariance matrix should not be
estimated as a zero vector, since 231 is never exact if it is measured from a physical
event. For example, the initial variance matrix I‘ftl) can be estimated as 021, where
a2 is the expected digitization noise in each component and I is the identity matrix
of the appropriate dimensionality.

74

1).-Sr.

3.4 The experimental results

The proposed algorithm was tested with the data which was used in the batch mode.
First, we present the experimental results using synthetic data. Then we show the
the power of the method use real face images as high dimensional input vectors for
classiﬁcation. For the regression problem, we demonstrated the performance of our
algorithm for autonomous navigation where input is current image and output is the

required steering signal.

3.4.1 Experiments using synthetic data

The ﬁrst experiment used the same 3-clusters data set as used in the batch-mode
testing. The training set is shown in Fig. 2.2 (c) with 500 samples per class. The bases
derived from the IHDR tree algorithm are (——0.88, 0.44), (0.45, 0.92). The error rate
is estimated on the basis we derived. Table 3.1 shows the classiﬁcation errors from
(1) ground truth of distribution using Bayesian optimality (Bayesian-GT), (2) the
parameters estimated from the training data based on Bayesian optimality (Bayesian-

Sample), and (3) the proposed new algorithm (IHDR).

Table 3.1: Error rates for 2-D Synthetic Data

 

 

 

 

Bayesian-GT Bayesian-Sample IHDR
class 1 7% 7.2% 7.6%
class 2 6.8% 8.2% 5.6%
class 3 1.6% 2.0% 2.0%

 

 

 

 

 

 

The second experiment is for 3-D. The discriminating space ’D is 2D — x-y plane.

75

 

There were 3 clusters, each being modeled by a Gaussian distribution with means,

respectively, (0,0,0), (5,0,0), (0,5,0) and covariance matrices

 

 

 

 

 

 

001001002251

There are 500 samples per class for training and testing, respectively. The ba-
sis derived from the proposed algorithm is: (043,092, —0.021), (—0.89,0.44,0.001)
which are very close to x-y plane as we expected. The error rate is estimated on the
basis we derived. Table 3.2 shows the classiﬁcation from ground truth Bayesian rule,

sample—based Bayesian rule and our new algorithm, like the case of 2D.

Table 3.2: Error rates for 3-D Synthetic Data

 

 

 

 

Bayesian-GT Bayesian-Sample IHDR
class 1 7% 7.2% 4%
class 2 6.8% 8.2% 4.6%
class 3 1.6% 2.0% 4.8%

 

 

 

 

 

 

The third experiment used a high-dimensionality and six clusters. Each
class was modeled by a Gaussian distribution in 100 dimensions with means
(0,...,0), (0,5,0,...,0), (0,0,5,0,...,0), (0,0,0,5,0,...,0), (0,0,0,0,5,0,...,0),
(0,0,0, 0, 0, 5, 0, . . . ,0) , respectively. The covariance matrix for class 0 is an identity
matrix. The covariance matrix for the class i,i > 0 is an identity matrix except that
the (2', 2') element is equal to 2.25. There were 500 samples per class for training and
the other 500 samples per class for testing. We expect the basis for discriminating

76

subspace D is in the ﬁrst six dimensions. Table 3.3 shows the error rates for the
three types of classiﬁcation rules under comparison. Since the ﬁrst cluster overlaps
with the other clusters, the errors listed in Table 3.3 in the ﬁrst row and column are
larger than other rows and columns. The error rates from the proposed algorithm are

comparable with those that use Bayesian optimality.

Table 3.3: Error rates for 100-D Synthetic Data

 

 

 

 

 

 

 

Bayesian-GT Bayesian-Sample IHDR
class 1 9.6% 11.2% 18%
class 2 2.6% 2% 2%
class 3 2.8% 3.2% 4%
class 4 4.4% 3.8% 5%
class 5 2.8% 2.6% 2.8%
class 6 2.8% 3% 3.8%

 

 

 

 

 

 

The experimental results for these three synthetic data set show that the IHDR
algorithm has similar performance to the batch version — HDR algorithm. Its classi-
ﬁcation rates are also comparable to that of Bayesian decision rule for these tests.

The last experiment for the synthetic data is to show the effect of the y-clustering
algorithm. We used the 2-D synthetic data to show how the y means are clustered.
As shown in Fig. 3.1 and Fig. 3.2, the y-means are gradually converge to the correct

cluster centers.

3.4.2 Experiments using real face data

A major application of the presented algorithm is to directly deal with high-
dimensional multi-media data such as images, video, or speech. We present our

77

 

 

 

 

 

x X
l x x
X
l l l " x x
' X
\ 4
,2 . . o ,
F‘ J p ...-1,. I 1 d» +
l ..
t
l l L 4
k * 5 <4
‘1 ‘4
l l
‘JL.__. 1 l -‘ __.__._.’¥ 11.L11_.—?1- .1__._1~_._. AA—J
“b v 1’} *5 3 T l
C ,_1.,.... 1 I .T. ‘I 3 111111 1 1 —r“" 1-- ..
l 4 L j
L 4 ,.
l J l
l x x . l x x l
: x )(’< x r X)& Xxx x ,1
5.» x x x 3' x x x ~-‘
x xx ’3‘ xx x . L x xi‘xz )$< xx
2 X l x X l
X
l x x X , t x x x X .
\ 7 )f .
N I ' a 4’ “ ‘ ‘ ‘0 K 4'
x . \
‘ >7 0 .. + + ' -
\J 3‘42. K *1, 4 ‘ i \J ‘4 ‘ 9 0+ , f l‘
’ x 2 ‘ : l S. . '1
. . . +
l V 1
:3. 1.11.11 1 : "l 1 l l
~b " 5 70 S C '> 10
1 1 r . . . . . 1..., ,
x }
X X X x
X XiX X
x x
x xx ’8‘ xx
x x . x
X X 4x k X
_ _ 1 X
x xi. +
n .. . {hp . a. 7‘; 1 r _§ ’7
l ' x" " / I If 9 f I
. f - i‘ O l J
l V
L i
l
l
l r. .1
‘0

Figure 3.1: The effect of y-clustering algorithm. The number of samples for (a), (b), (c),

(d), and (e) are 3, 33, 63, 93, and 123, respectively.

78

 

 

 

 

 

 

 

1
11‘

 

 

[ T—ﬁi #‘7W T——V ﬁ" HAW ’—'-—_ A j #7“- °—L_‘—_—'_wf ‘V—kmi
L l l
l X X . l, X X
x x x x ‘ x x xx x x
- x X
x Xxx: Xx x l x >8:ng x
l " xx)‘ *’ x x 1“ 5x} .. ‘< x
xx ‘xX ’8‘ 2‘" . x‘ -xx & i“ x
x x X x x x x 1 x x x x x x
X X X
‘(X X X xx . X ‘ &X X XX X
‘ . \
X X
\ X x ‘ 4 i "-1, X \ 5 4», I
[ .,. + . l \ * ¢. 11‘ l‘ +
‘ ‘ 11 " 4. 1 be ,5. " “ 1 .
_".v "‘ 1.11- "i a '4' d ‘ +. * Tl Q l j. at: 1'6 f 1 ‘ +
‘ ‘r " 3 9
x“ x t 1, , a ‘ l - 4 .‘
l .( . \ - w w
' I ' 7 F
[ \ r i
\
L 4 l
l
l l
5s_1+ ‘Hi 7 L_._1L# 7 ..1 7111—.7 __ .__1 J ”“JL1+ .1 41_.__ __ iggfg1 1—.__
-“ C 13') v“ w b
3 I ~' {37" ' I T“
v- ‘ [-
l L
l X X 1 x X
x x x x l x X ‘95 x x
‘ L X .
l x x X&x§tx x x 1 g “ 853%)? ix x"‘
K r'.. K ‘
J, X 2 . )f X x
“Hwy" .. . “isms” x
1 X ~ ' x X x l x x x X x
x x x x x x x x x x x x
x X x
x X 'S<x X
l ’5‘ x ’9‘ x x X3: ., x
l * ‘ i x l I g} i x 6
I u , , w. - y .
N[ ‘ ‘ .., . ‘ é rrrrr gt. +1104 ‘ ‘l k .‘K .. “ ‘ t1 ;'1 +’_. ;*
U 9 4’ o i 9 3“ 4 f l v‘ ’ °r‘ 3 r; 0.- f f
Q :‘~ ‘ " ' 1 “I
_ . ~ ~ 0
5‘ ‘P , 1 ‘7 9 1'
l I V l f.
r _ l.
‘ l
5*. L 3.-....»111. 1 1 l. 11- .1“
-—‘) U f) 7 ’) " :5

X
~ X '1

x x x xx x x x 4
t» )&x XX XX \ Xx
l . d. A’ X + r 4
..‘.)_<. + ' 3 .,. i , 4
. ~‘ "1‘ .~ . ’ + "r. ya, : w
x ‘ 4 9 , v , 4 l

11_I_
i
f 9.
l 1 ‘ ~. :
I O 1
. .
#1.“—

Figure 3.2: The effect of y-clustering algorithm (Continue from Fig. 3.1.) The number of
samples for (f), (g), (h), (i), and (j) are 153, 183, 213, 243, and 273, respectively.

79

_ 1.11___.. :11'L_._1_ _g -i

 

111m.1_‘1_11m11 L 17-1 L -11.
[FI—___

l

J r
‘ L ..
V.)

experiments with images here. We treat each image of m rows and 72. columns as
a mn-dimensional vector, where each component of the vector corresponds to the
intensity of each pixel. Statistical methods have been applied directly to these vec-
tors of high dimensionality. This type of approaches has been very successful in
the ﬁeld of computer vision and now has been commonly called appearance-based
methods [92] [60] [88]. Although appearance-based methods themselves do not have
invariance in position, size, and orientation when applied to appearance-based object
recognition, they have been well-accepted for their superior performance when input
images are pre-processed images with roughly normalized position and size.

The ﬁrst experiment used face images from the Weizmann Institute at Israel. The
image database were constructed from 28 human subjects, each having thirty images
all combinations of two different expressions under three different lighting conditions
with ﬁve different orientations. An example of the face images from one human
subject is shown in Fig 3.3 3.4. The preprocessed images have a resolution of 88 x 64,
resulting an input space of 5632. The task here is to classify images into person’s
ID as class label. We used the mean of all training images of each person as the
corresponding y vector.

The data set was divided into two groups: training set and testing set. The
training set contains 504 face images. Each subject contributed 18 face images in the
training set. The 18 images include three different poses, three different lightings,
and two different expressions. The remaining 336 images were used for the testing
set. Each subject had 12 images for testing, which include two different poses, three
different lightings, and two expressions. In order to present enough training samples

80

 

\r

 

Figure 3.3: Face images from Weizmann Institute. The training images of one subject. 3
lightings, 2 expressions, and 3 orientations are included in the training set.

 

Figure 3.4: Face images from Weizmann Institute. The testing images of one subject. 3
lightings, 2 expressions, and 2 orientations are included in the testing set.

81

for the IHDR algorithm to build a stable tree, we artiﬁcially increase the samples by
presenting training samples to the program 20 times (20 epochs). Table 3.4 compares
different appearance-based methods. We used 95% sample variance in determining
the number of basis vectors (eigenvectors) in the principal component analysis (PCA).
PCA is faster than nearest neighbor (NN) and shares a similar accuracy. However,
the 95% of variance results in about 98 eigenvectors which are much less than that
of NN (5632-Dl). PCA organized with a binary tree was faster than straight NN as
shown in the Table. It is the fastest algorithm among all the methods we tested but
the performance is worse than those of PCA and N N. The accuracy of LDA is the
third best. Our new IHDR method is faster than LDA and resulted in the lowest
error rate.

We also applied support vector machines (SVM) [15] to this image set to compare
the performance. Support vector machines utilizes the structural risk minimization
principle [93]. It results in a maximum separation margin and the solution depends
only on the training samples (support vectors) which are located on the supporting
planes. SVM has been applied on both classiﬁcation and regression problems. We
used the SVM software obtained from Royal Holloway, University of London [83]
for this experiment. We used the PCA of the face images as the input features for
the SVM3 The best result we obtained by tuning the parameters of the software is
reported in Table 3.4. The recognition rate of the SVM with PCA is similar to that
of PCA alone. However, SVM with PCA is faster than PCA. This is because SVM

has more compact representation and PCA alone needs to conduct linear search for

 

3The software failed when we used the original image input with dimensionality 5362.

82

 

every training sample.

We compared the error rate of the proposed IHDR algorithm with some major tree
classiﬁers. CART [11] and C50 [76] are among the best known classiﬁcation trees4 .
However, like most other decision trees, they are univariate trees in that each internal
node used only one input component to partition the samples. This means that the
partition of samples is done using hyperplanes that are orthogonal to one axis. We
do not expect this type of tree can work well in a high dimensional space. Thus, we
also tested a more recent multivariate tree OC1 [62]. We realize that these trees were
not designed for high dimensional spaces like those from images We also tested the
corresponding versions by performing PCA before using CART, C50, and OC1 and

call them CART with PCA, C5.0 with PCA, and OC1 with PCA, respectively.

We have also compared the batch version of this algorithm [34] [98]. The batch
version, named hierarchical discriminating regression (HDR) tree, computes statistics
of training samples in a batch fashion. We expect the batch method out-perform the
incremental one. However, the error rate of IHDR tree is lower than that of HDR tree
for this set of data. The reason is that the same training samples might distribute
in different leaf nodes for the IHDR tree because we run several iterations during
training. For batch version, each training sample will only be allocated in one leaf

node.

Then we performed an experiment on FERET face data set. We used the frontal

views from the data set. Thirty four human subjects were involved in this experiment

 

4We have experimented the same data set using CART implemented by CC 1. The performance
is signiﬁcantly worse than those reported in the Table 3.4.

83

 

Table 3.4: The performance for Weizmann face data set

 

 

 

 

 

 

 

 

 

 

 

 

 

Method Error rate Avg. testing time (msec)

PCA 12.8% 115
PCA tree 14.58% 34
LDA 2.68% 105

NN 12.8% 164

SVM with PCA 12.5% 90
(35.0 with PCA 45.8% 95
OC1 with PCA 44.94% 98
HDR tree 1.19% 78
IHDR tree 0.6% 74

 

 

with four face images per subject. Each person had three face images for the purpose

of training. The other face image was used for testing.

A summary of comparison is listed in Table 3.5. Notice that the training time is
measured for the total time to train the corresponding system. The testing time is the
average time per query. To make a fair comparison, the computation time for PCA is
included in C50 with PCA, OC1 with PCA, and CART with PCA. As shown, none
of the existing decision trees can deal with FERET set well, not even the versions
that use PCA as a preprocessing step. The batch version of the proposed algorithm
(HDR) tree shares the same error rate as the IHDR tree. The HDR tree is faster than
the IHDR in both training and testing. This is because we ran several epochs for the

IHDR tree and the IHDR tree has more redundant information inside.

In order to display how the IHDR tree converges, Fig 3.5 shows the error rates
vs. epoch plot. As can be seen, the resubstitution error rate converges to zero at the

5th epoch. The testing error rate reaches zero at 6th epoch.

84

 

Table 3.5: The performance of decision tree for FERET test

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Method Error rate Time (sec)
Training Testing Training Testing
CART 10% 53% 2108.00 0.029
C50 41% 41% 21.00 0.030
OC1 6% 56% 2206.00 0.047
CART with PCA 11% 53% 10.89 0.047
C5.0 with PCA 6% 41% 9.29 0.047
OC1 with PCA 5% 41% 8.89 0.046
IHDR tree 0% 0% 12.25 0.027
IHDR tree 0% 0% 23.41 0.041
0.1
____________ ———— Resub
0.08-
--- Test
goes .
mo.04~
0.02»
0o s 10

 

4 6
Epochs
Figure 3.5: The performance plot of FERET face test 1. The plot of error rate vs. number

of epochs. The “Resub” line means the resubstitution error rate. The “Test” line represents
the testing error rate.

3.4.3 Experiments with autonomous navigation problem

All the tasks above are classiﬁcation tasks. We present the result for a regression
task. A vision-based navigation system accepts an input image X and outputs the
control signal C to update the heading direction of the vehicle. The navigator can be
denoted by an function f that maps the input image space X to control signal space
C. The learning process of the autonomous navigation problem then can be realized

85

as a function approximation. This is a very challenging task since the function to
be approximated is for a very high dimensional input space and the real application

requires the navigator to perform in real time.

 

Figure 3.6: A subset of images used in autonomous navigation problem. The number right
below the image shows the heading direction associated with that image.

We applied our algorithm to this challenging problem. Some of the example input
images are shown in Fig 3.6. Totally 318 images with the corresponding heading
directions were used for training. The resolution of each image is 30 by 40. We used
the other 204 images to test the performance of the trained system. Fig 3.7 shows the
maximum error rates and the mean error rates versus the number of training epochs.
Both maximum error and mean error converge around the 15th epoch. Fig 3.8 gives
plots of the histograms of the error rates at different epochs. As shown even after the
ﬁrst epoch, the performance of the IHDR tree is already reasonably good. With the
increase of the epochs, we observed the improvement of the maximum error and mean
error. The improvement stopped at the 15th epoch because we did not use any new
training samples in each epoch and the system has perfectly ﬁt the existing training

86

 

data set. our test on real mobile robot has shown that a system of such an error level

can navigate the robot very reliably for hours until the batteries are exhausted.

 

 

 

 

 

 

 

 

 

 

30 25
25’ 21
52% ---- g
5 £1.5-
g15* g
Q 0
. c 1r
610» 8
2 2
5, 0.5-
o L L A G 1 1 I L L
O 5 10 15 20 25 30 0 5 10 15 20 25 30
Epochs Epochs

(a) (b)

Figure 3.7: The performance of the autonomous navigation. (a) The plot for maximum
error rates vs. epochs. (b) The plot for mean error rates vs. epochs. The solid line
represents the error rates for resubstitution test. The dash line represents the error rates

for the testing set.

We also compare our experimental results with two artiﬁcial neural networks
(ANN) as reported in [99] with a consideration that the pattern-by-pattern train-
ing mode of artiﬁcial neural networks is also an incremental learning method. A
two-layer feed-forward (FF) network and a radial basis function (RBF) network were
used to train and test for the mapping from the image space to control signal space
using the same data set as used in our IHDR tree algorithm. The results are listed
in Table 3.6 which shows that our algorithm outperforms these two ANN methods.

87

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

30 -10 Deg 1o 20 3 1‘0 20
fee
(a)
0.3 0.8 T r
0.7» 0.7[
0.6- 0-6'
0.5» gos- ‘
0,4. 30"
'5
0.3 a 0.3»
0.2% 0.2»
0.1 »
0.1 ~
. .30 -1o {0 20
30 —10 O 10 20 Degree
(C) ((1)

Figure 3.8: The histograms of the error rates. Plot (a), (b), (c), and (d) correspond to the
histograms at epoch 1, 6, 11, 20, respectively.

Table 3.6: The performance for vision-based navigation

 

 

 

 

Algorithm Mean error (degree) Max. error (degree)
Resubstitution Testing set Resubstitution Testing set

FF 1.02 2.00 10 12

RBF 1.53 1.84 12 12

IHDR tree 0.00 1.25 0 13

 

 

 

 

 

 

88

3.4.4 Experiments using data with manually extracted fea-
tures

We further investigated how our IHDR algorithm performs on lower dimensional real
data. These data used manually selected features. The same data sets from StatLog
project [57] were adopted in the experiments in order to have a fair comparison
with the batch HDR tree algorithm. We inserted our performance to the results in
Table 2.9 and Table 2.10. The performance is slightly worse than the batch one while

still comparable to all the other classiﬁers.

89

Chapter 4

Learning from Continuous Input

Stream

4. 1 System Overview

For the purpose of learning automation, the program-level representation should not
be constrained by, or embedded with, handcrafted knowledge—level world models or
system behaviors. It is very difﬁcult to manually build a sufficient set of rules or
behaviors that is general and complete enough to handle the challenging recognition
tasks we have to deal with. Fig. 4.1 shows the relation between the learner and the
trainer. Let X, be the image input at time 2', where X,- is with dimensionality r x 0,
representing an image of r rows and 0 columns. Z, denotes the numerical sensory
input from the trainer at time i. When there is no numerical input from the trainer,
Z,- = O (default value). The human trainer decides when to impose the numerical
inputs. A,- denotes the action performed by the learner at time 2'. When A,- = 0,

90

 

*F

no action is performed. 1,- denotes the action imposed by the trainer on the learner.
When there is no action to be imposed, the trainer set I, = 0. Otherwise I,- # O and

A,- <— 1,, meaning the imposed action is performed immediately.

 

 

 

 

V"b1 w 1d Xi l Ai *
151 e or 1) Learner l ’[ Effector

Zi [ Ii

Trainer

 

 

 

 

 

 

Figure 4.1: Learner and trainer. Xi: visual input. Zi: numerical sensory input from the
trainer. Ii: action to be imposed. Ai: action to be performed.

At each time instant 2', the learner performs the following steps:
0 Receives (X,,Z,-,I,-).

o If I,- is on (non-zero), then A,- <— 1,. Otherwise, the learner uses its current state

S,- to query the prediction tree and get the predicted action Api. Let A = Api.

0 Use (X,, 2,, A,) as input to query and update decision tree. This gives a prim-

itive cluster H.

0 Use (P,,P,~_1,S,-) to query and update a association tree. This gives a state

Sm-
0 Use (5,, 5.11) to update the prediction tree. Let the current state be Si“.

The following sections explain how these are realized.

91

4.2 The Split SHOSLIF

SHOSLIF [96] [87] is a general approximator for supervised or unsupervised learning.
It approximates a high dimensional function f : X +——> Y, by incrementally creating a
tree from training samples. SHOSLIF uses the principal component analysis (PCA)
and linear discriminant analysis (LDA) to recursively build a feature space for every
internal node of the tree. The basis vectors from PCA are called most expressive
features (MEF) vectors and those from LDA are called the most discriminating feature
(MDF) vectors. Each leaf stores the sample :16,- it represents. Given any input :1: E X,
the tree ﬁnds the top It matched leaf nodes. For supervised learning, each leaf node
stores the desired output y, = f (21,) The output f (:r) is approximated by a weighted
average among the output vectors of top 1: leaf nodes. For unsupervised learning, it
reports the top It matched leaf nodes with a conﬁdence measure. The time complexity
for ﬁnding top k (k is a constant) matched leaf nodes is O(log(n)) for a tree with n
leaf nodes.

An incremental version of SHOSLIF is described in [97] whose task is to build a
tree incrementally. It uses a scheme that involves microtrees and macrotrees. The
required computation time for learning each sample is not uniform. In order for each
machine cycle to be ﬁnished within a roughly constant time, we introduce a new
scheme here. We do not evolve the node, unless it becomes a leaf after the deleting
process has deleted the subtree rooted from it. The new update-node algorithm is as

follow:

Procedure 6 Update-node (N, X): Given a node N and the input X, update the

92

 

subtree rooted from node N.
1. if (node is a leaf)

(a) if input-sensitive(X, node-gmean} // if X is not the same as mean with

noise considered
0 split-node(node, X); // node is split into two nodes
(b) else

0 matched(node, X); // no need to learn
2. else if (MEF-projection(X - mean) < 0 ) // decide which half-space Xfalls into
0 update-node{node-gleftchild, X);
3. else
0 update-node{node-grightchild, X);

In order to avoid storing unnecessary samples, especially those that are very sim-
ilar, a learning sample a: does not cause a new leaf node to be created if a: is within
6 distance from the top matched node :1),- and only :13,- is updated by :1: to record the
mean and variance of the samples falling into the leaf node. Thus, each leaf stores a
prototype sample with the pre-deﬁned input-sensitivity 6.

In the procedure split-node, we do not change the X-ﬁeld of the existing node.
The X-ﬁeld (mean) of a node B is ﬁxed as soon as it is created. However, we keep a
reserved splitter inside B as the mean vector ml") and an estimate of the MEF vector
pl”) as more inputs pass through B, where n is the number of visits to B. As soon

93

as B becomes a leaf node (by e.g. the forgetting process described in section 4.3 that
deletes nodes), we replace ml") and pl") for its current splitter. When B node spawns
new children, this new splitter is used to split its children population into left and

right subtrees.

Now, consider how to update the reserved splitter. Suppose that a node B with

ml") and pl”) receives a new input :13. The new mean is updated as

nm(") + a:

(n+1) _
m _ (3

+(1—5lx

where 0 < B S 1 is a depreciation factor for the history so that very old samples will
be disregarded faster than the standard case with B = 1. When n = 2, the vector
“(2) is a unit vector aligned with m“) — :13. When n > 2, vector is updated by :1:
using an appropriate method, so that it aligns roughly with the principal component.
For example, we can use either the power method or a probably more Space saving

method (no need to store covariance matrix):

v...» = w + (1 — 7)(m‘") — as)

where 0 S 'y < 1 determines how much the old pl") should be retained as opposed
to the simple difference method where B = 1 for n = 2. Note that this vector
can happened to be zero, in which case, (mln) — as) can be directly used. The vector
(m(") —a:) can never be zero because the input sensitivity test guarantees ||m(") —:r|| >
0.

94

 

 

 

 

 

1- L ,
I‘,
' \
I \
I \\
I s
I ‘s‘
T—r .......................... 'L ....... ~11, ............
I ______
I “--
‘ + I
II ’2 ’3 ’4

Figure 4.2: Update of memory trace (strength) M through time t. The solid curve rep-
resents an element which is visited often enough to be kept. The dashed curve indicates
an element that is not visited often enough and thus, it falls below the threshold T before
being visited again.

4.3 Forgetting

Due to a ﬁnite memory space, the system cannot remember all the associations that
it has come across. In fact, it should forget for generalization and for space limitation.
Many associations must be forgotten. The utilization of association is indicated by

co-occurrence frequency and occurrence intervals.

Let us consider a memory element, a tree node, which is used in our system
for memorizing an association it represents. Each element has a memory residual
register whose updating curve is shown in Fig. 4.2 which resembles what we know

about human memory characteristics [6] [36].

Each visit to the same element makes the trace to be reset to 1 and then the curve
declines using a next slower speed. For example, we can deﬁne a series of memory
fade factors (11 < 02 < < am z 1. a,- is used for an element that has been visited
i times. The memory trace r can be updated by r <— raﬁ where t is the number
of system cycles (refresh) elapsed since the last visit to the node. Thus, we do not
need to visit all the elements at every system cycle. When an element is visited, its

95

memory trace is updated ﬁrst from what remains from the last visit. If the memory
trace falls below the designated threshold, the memory element should be deleted and
so it is marked as to-be-deleted. If what is deleted is more than a single element, the
deleting process will not delete it right away to avoid consuming too much CPU time
in a real time process. Instead, it puts those elements in a garbage buffer which is to

be cleaned when the learner is “sleeping.”

4.4 Spatiotemporal clustering

The objective of Spatiotemporal clustering is to form primitive clusters (P-clusters)
using not just spatial information, but also temporal information. A P-cluster consists
of a number of prototypes each being represented by a leaf node. Each P-cluster
corresponds to a state of the system.

Fig. 4.3 shows a schematic illustration of the temporal-adjacency cluster. The
cluster represents a mapping 9 : X v—> Y which maps from input space X to output
space Y. X is the sensory input space of a particular sensor and Y is the output
space containing P-clusters. Given 3:, y = g(a:) is the image of 2:, representing the
corresponding P-cluster. In order to preserve the necessary t0pology in X, Y has the
same dimensionality as .X or has a reduced dimensionality. If X has a dimensionality
n x n corresponding to an image space of n x n pixels, then Y = Rmxm, where R is
the set of all real numbers. The ratio of dimensionality difference (n x n)/ (m x m), is
called the factor of dimensionality reduction (FDR). This reduction of dimensionality
is reasonable because all we need is to have Y space to roughly keep the topology of

96

I b ’
x SHOSLIF —o\ \ ‘ :> Y

<3 , *
_O___/ a:

<%

 

 

 

 

 

Figure 4.3: A schematic illustration of the temporal adjacency cluster. Each region in Y
space represents a P-cluster. A dashed curve is used to roughly indicate the samples that
fall into the cluster. The circles to the right of the SHOSLIF tree are the leaf nodes of
the tree. Each —,= or * sign indicates the relative positions of leaf nodes (6-prototype) in
the input space to the SHOSLIF tree. Two different signs representing two semantically
different input sequences. Each leaf node in the SHOSLIF tree has a pointer to the Y space,
which points to the center of the corresponding P-cluster or close to it after a signiﬁcant
amount of gravity pulling and merging. Y is shown here as 2-D for visualization, but it
is typically of the same dimensionality as X or with a reduced dimension by a factor of
dimension reduction.

X space. When a 6-prototype is ﬁrst created, it is represented by a single training
sample x,. We let its image y be the same as 33,, y,- = g(a:,-) = 2:,- (with FDR taken into
account if FDR > 1.) Later successful matching with x,- will cause y,- to be pulled
toward the Y vectors of its neighbors in X.

Consider an input stream :13 = {3(0), $(2), 23(3), ..., :1:(t), ....} where each frame 23(2')
is in X and 13 is the time index. The stream :1: forms a trajectory in X x T space, where
T = {0, 1, 2...t...} represents the set of time indexes. We need to chunk the trajectory
into segments of some appropriate length of events. Each segment corresponds to
a P-cluster. Two frames 22(1) and a(m) are temporally adjacent if one is followed
immediately by another within a time window w, i.e., 0 < ll — m] S 112, where w

97

—‘.' m 1.- _-. '- ~ ~

21> 21>

l >Y : >Y Y

 

 

Figure 4.4: The effect of pulling in Spatiotemporal clustering. The SHOSLIF tree is re-
peatedly visited by $1 a: 2:2 z 173 that are nearby in Spatiotemporal domain. The two leaf
nodes are among the t0p k matched leaf nodes, although they are separated early in the
tree. Consequently, their Y vectors are merged when they are sufﬁciently close and the
corresponding prototypes of the two leaf nodes belong to the same P-cluster.

is the adjacency window (default is 1). Given each frame x(t) at time t, its top Is
matched prototypes are considered. The top match gives its image y. Among top k
matched 5 prototypes with a sufficient matching conﬁdence, if two 6-prototypes are
less than a (a > 6) distance apart in X space and they are temporally-adjacent, the
images of these two 6-prototypes are pulled together using a simulated force, where
the number of visits of each prototypes is the “mass”. Thus, among the temporally
adjacent (5-prototypes with a a radius in X, those mostly often visited prototypes
tends to become the center of the cluster. When the y images of two prototypes
are very close, their 3] images become one. Merged y vectors form the center of
the P-cluster. Merged centers of y vectors have a higher mass. It can quickly pull
nearby y vectors and merge them. This speeds up the clustering speed. To avoid two
neighboring P-clusters to be slowly pulled together, a maturity schedule is applied so
that P-clusters that are mature enough will no longer be moved — an effect of long
term memory. Fig. 4.4 shows that the pointers emitting from leaf nodes gradually
point to the center of the P-cluster in Y space.

98

Note that two inputs 2:,- and :17]- will not have the same image (state) if they are
within a distance but never occur one after another. In other words, the P-clusters
are clusters of Spatiotemporal clusters, but 6-prototypes in SHOSLIF leaf nodes are
spatial—only prototypes and their spatial radius are typically much smaller than that

of a Spatiotemporal cluster.

The output from the Spatiotemporal cluster can be regarded as the lowest-level
discrete state. The techniques used here have a close relationship with learning vector
quantizer (LVQ) [43] [44] and k-mean clustering [38]. The differences are (1) temporal
concept; (2) unknown number of clusters; (3) the nearest neighbors must be found

quickly and thus the SHOSLIF tree is used; (4) learning is incremental.

4.5 The experiments

4.5.1 The system flow

Depending on whether the action is imposed or not, the learning can be classified into
action-imposed learning and action—autonomous learning. Action-imposed learning is
such that the extro-effector part of a(t + 1) is supplied by the trainer. For example,
hand-in-hand learning can be used by human adult to teach a child how to use a pen.
Otherwise, the learning is action—autonomous learning.

Fig. 4.5 illustrates a ﬂow chart of action-imposed learning. If the trainer imposes
an action on an effector at any time through, e.g., through a joystick, the system
performs action-imposed learning for that effector. Otherwise, the system performs

99

l

 

 

I Grab current sensory frame 4

 

 

 

 

 

 

 

 

-——" Update memory <

I

 

Figure 4.5: The ﬂowchart for action-imposed learning. The system learns while performing.

action-autonomous learning.

Suppose that the machine agent M has recorded in memory B =
{(x(z'),s(z'),a(z')) | 2' = 0,1,...,t — 1} U {s(t),a(t)}. Note that s(t),a(t) are the re-
sult from sensory input $(t — 1). According to the ﬂow diagram in Fig. 4.5, M grabs
the current sensory frame s(t). Then, M computes the next state s(t + 1). If an ac-
tion is imposed, a(t + 1) is supplied by a human being (or environment) and thus M
complies by sending a(t + 1) to the effector and then updates its memory by replacing
B by B U {:1:(t),s(t + 1),a(t + 1)}. If an action is not imposed, M derives action
a(t + 1) based on the past experience using a simpliﬁed g in Eq. (1.2) as follows.

First, M ﬁnds the best matched state:

3': argminogigtllsﬂ + 1) — 3(2)”. (4.1)

Then, the output action is determined as the action associated with the best matched
s(j): a(t+1) = a(j). The memory update is done as before. After a:(t+ 1) is grabbed
in the next machine cycle, which may include the resulting reward sensed by the
biased sensor, the system memory becomes B = {(2:(z'), s(z'), a(z')) | 2' = 0,1,...,t+1}.

100

 

As can be seen, this oversimpliﬁed version of action—imposed learning can do only
a little generalization by extending the action learned by the nearest neighbor 3(3)
(or multiple neighbors with action interpolation) to the current new state s(t + 1),

whenever no action is imposed by the human.

4.5.2 Experimental results

Although the method does not restrict the content domain in which the system learns,
we apply it to face recognition training, with supervised learning. The hypothetical
theme is to train the learner to learn “Tell the name of the face”. The interactive
training was conducted on-line. Several persons entered the scene one by one, stayed
for a while, and left. During the stay, the system trainer gives a coded numerical
sensor signal Q. In the mind of the trainer, it means question “who is it?” because
the trainer immediately imposes a numerical action which is regarded as the name of
that person by the trainer. First, we present the result of experiment using synthetic
data. The goal is to see how the system works. Data for person A (PA) and person B
(P3) are simulated. To simplify the notation of later presentation, we use the symbols

summarized in the following tables.

Table 4.3 explains how the training session was conducted through continuous
machine cycles, where each T, covers several image fresh cycles (i.r.c.) as listed
below: T1:0—19,T2:20—29,T3:30—39,T4:40—49,T525O-69,T6:60—79,
T7z80—89,T8:90—119,T9:120—129,T10:130—139,T11:140—149,T12:150—169,
T13 : 160 — 179, T14 : 180 — 189, T15 : 190 — 199. The appropriate action were imposed

101

Table 4.1: The representation for image input

 

Symbol Meaning
b Background
A, PA entering
A3 PA stay
A, PA leaving
Be PB entering
BS PB stay
B, PB leaving

 

 

 

 

 

 

 

 

 

 

 

Table 4.2: The representation for numerical sensor input and action output

 

 

 

 

 

 

 

Symbol Meaning
S Silence
Q Asking question
N N 0 action
A Answering “A”
B Answering “B”

 

 

 

 

by the human in the training session. Since this is a noise-free case, the resulting
primitive clusters are combinations of images, numerical sensors, and actions. These
clusters are the inputs to level-one in the automatic Spatiotemporal level building.
Notice that the recurrence input to the level-one tree helps to remember the history of
training experience. If we do not use the recurrence input, an ambiguous prediction

will occur. In this case, P2 will go to both P3 and P5.

The training session is followed by the testing session. A question is asked after
presenting a face. The results are summarized in Table 4.4. The results show that
both faces can be recognized correctly because the correct reply “A” or “B” was

received. The time periods correspond to different image refresh cycles in the test

102

 

Table 4.3: The Training and the resulting primitive clusters

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Period See Hear Action P-Clusters
T1 b S N P0
T2 A, S N P1
T3 A, S N P2
T4 As Q N P3
T5 A, S A P4
T5 A, S N P2
T, A; S N P5
T3 b S N P0
T9 B, S N P5
T10 B, S N P7
T11 83 Q N P 8
T12 B, S B P9
T13 B, S N P7
T14 B, S N P10
T15 b S N P0

 

 

 

 

 

 

 

session were listed as follows. T1 : 0 — 14, T2 : 15 — 20, T3 : 21 — 25, T4 : 26 — 27,
T5 : 28—30, T6 : 31—35, T7 : 36—52, T3 : 53—58, T9 : 59—66, T10 : 67—68,
T11 :69 — 70, T12 : 71 — 75, T13 :76 — 80.

In the next experiment, we added some noise to the synthetic input data (both
training and testing sessions). Different signal noise ratios (SN R) resulted in different
primitive clusters at level zero. The number of states at level zero increased when
the SNR decreased. Since we do not manually deﬁne the state representation, the
number of states in level zero does not affect the performance much. Because of the
application of nearest neighbors, very similar prototypes are used as inputs at level
one even if the matches are not exactly the same. However, if SNR was too small,
the retrieval was not perfect. The test results are shown in Table 4.5 and 4.6, where
each period T,’ corresponds to almost the same but somewhat different image refresh

103

 

Table 4.4: Testing with noise-free images

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Period See Hear Action
Tl’ b S N
T2’ A, S N
T; A, S N
T; A, Q N
T; A, S A
Té A, S A
T7’ b S N
Té B, S N
T; B, S N
Tllo Bs Q N
Ti 1 B, S B

1’, B, S B
1’3 b S N

 

 

cycles as that for Table 4.4. The system failed to reply correctly when SN R went down
about 2db as shown in Table 4.6. However, this does not reﬂect the system’s noise
immunity because the data are simulated and the task in the simulation is relatively
simple.

Next, we present the experiment results for real images. Fig. 4.6 shows some
frames that involve one person. The weight of the center pixels of an image is larger
than that of the boundary as indicated by the ﬁrst image in Fig. 4.6 so that a scene
element at center will have a larger effect than one at the periphery. The system au-
tonomously grabbed images frames from the incoming video stream and self-organized
its internal representation. The resolution of an input image was 50 by 50 pixels.
Fig. 4.7 shows the resulting states of level zero that correspond to the segment in
Fig. 4.6 after forgetting, pulling, and merging. The number of level zero states were
reduced to 18. Since the input of the level zero contains numerical input, image in-

104

 

Table 4.5: Result when inputs are noisy with SNR = 14db

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Period See Hear Action
Tl’ b S N
T 2’ A, S N
T 3: A, S N
T; A, Q N
T g A, S A
T, A; S A
T 7’ b S N
T; B, S N
T; B, S N

{o B, Q N
1’ 1 B, S B
1’, B, S B
1’3 b S N

 

put, and numerical action, the state representation is a combination of these three
components. In Fig. 4.7 we display the states by showing the image and the values
of numerical sensor and numerical action as a pair (S, A). These automatically de-
rived states approximate the prototypes of sample population with detail forgotten
(merged).

The system automatically built level one. The input to level one includes states
from level zero (primitive clusters) and the current state of level one. The state
representation in level one is recursive and it is difﬁcult to give an intuitive visual
display. In the training session of this experiment, eight people were presented to
the system one by one. When a person stayed in front of the camera, the trainer
asked the question by using numerical sensor input and imposed action (the label of
that person) after the question. Later, the person entered the scene again for test.
No action was imposed in the testing phase. The system performed according to its

105

Table 4.6: Result when inputs are noisy with SNR = 2db

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Period See Hear Action
Tl’ b S N
T 2’ A, S N
T; A, S N
T 4 A, Q N
T5’ A, S N
Té b S N
T7’ B, S N
Tg B, S N
T ,5 B, Q N
Tl’0 B, S N
Th b S N

 

 

Table 4.7: Real Images: The Resubstitution Test

 

 

 

 

 

 

 

 

 

 

 

 

 

Person Ask at (i.r.c.) Answer at (i.r.c.) Answer
1 27-36 32-44 1
2 137-146 145-154 2
3 237-246 243-254 3
4 335-344 346-352 4
5 432-441 437-450 5
6 530-539 536-547 6
7 637-646 642-654 7
8 743-752 754-760 8

 

 

predicted action. Some image frames seen by the system during testing were shown
in Fig. 4.8. During the entire test session for all eight persons, the system replied
correctly except for one person at roughly the right time (after being asked) and did
not “speak” when it was not supposed to. The results are summarized in Table 4.7
and 4.8. The test result shown in Table 4.7 uses video sequence for training. That in
Table 4.8 uses new video stream for testing and thus involved various noise and view
variation.

106

 

Figure 4.6: A temporally subsampled segment of the real-time video stream during which
eight persons were presented to the system for training. Only one person is involved in this
segment. The images are shown in the English reading order.

107

 

(0,0) (0,0) (0,0)

Figure 4.7: State representations at level zero. The values of numerical sensor and nu-
merical action are represented as (S, A). Where S = 0 means “silence” and S = 1 means
“asking question”. A = 0 means “no action” and A = 1 means “person 1”.

108

 

 

Figure 4.8: A temporally subsampled segment of the real—time video stream during the
test session, corresponding to the person shown in Fig. 4.6.

109

Table 4.8: Real Images: The Disjoint Test

 

 

 

 

 

 

 

 

 

 

Person Ask at (i.r.c.) Answer at (i.r.c.) Answer
1 37-46 38-45 1
2 137-146 147-162 2
3 237-246 256-263 3
4 335-344 354-357 4
5 425-434 445-451 5
6 537-546 556-563 4
7 632-641 651 7
8 744-753 763-770 8

 

 

 

 

 

The attention mechanism is then applied to improve the performance. A typical
attention gates for both channels are shown in Fig. 4.9. The attention mechanism
selects the modalities according to the novelty of the signals. The results are shown
in Fig. 4.10. All eight persons got right response at roughly appropriate time during

the testing phase, except the single incorrect short transition shown in Fig. 4.10.

Attention for Sensor 1 Attention for Sensor 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 - 1
7
0.8L - 0.8 *
§ .§
go 6 g, 0.6
< <
0 4 l 0.4 *
0 2 0 2
O 1 1 1 1 G 1 1 1 1
0 20 4O 60 80 100 0 20 40 60 80 100

Frames number Frames number

(a) (b)

Figure 4.9: Attention gates for two channels in 100 i.r.c. (a) is for numerical input
and (b) is for image input. When the attention is 1, the gate is open. Otherwise, the
gate is closed.

110

g

Response of the system during testing

 

Action
4}

 

 

 

 

 

 

 

 

 

 

 

 

o 260 460 660 800
Image refresh cvcles

(a),

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o 260 460 660 800
Image refresh cvcles

(b)

Figure 4.10: The testing results which uses attention selection mechanism. (a) the
system response during the testing phase. (b) the attention for the numerical channel.
The peak at 347 i.r.c in (a) indicates a wrong answer at the beginning of person 4
entering. The wrong response is corrected by the latter signals.

111

l

Chapter 5

Application to the Robotics

5.1 Introduction

The conventional mode of developmental process for a robot is not automatic —
the human designer is in the loop. Given a robotic task, it is the human designer to
understand the task. Based on his understanding, he comes up with a representation,
chooses a computational method, and writes a program that implements the method
for the robot. During this developmental process, some machine learning may be
used, during which some parameters are adjusted according to the collected training
data. However, these parameters are deﬁned by the human designer for the task
given. The resulting program is for this task only, not for any other tasks. This
manual development paradigm has met tremendous difﬁculties for tasks that are too
muddy for human to provide an effective representation. For example, tasks require
perception of environment but the environment is totally unknown at the time of
programming.

112

 

Due to the tremendous difﬁculties in sensing and reconstructing unknown scenes,
the behavior methods, such as those by Brooks [13], aim to avoid the drawbacks of
the conventional sense-model-plan-act (SMPA) framework. With a behavior-based
method, the programmer converts hand-designed behavior rules for the robot into
the control program. Although such an approach has successfully created robot’s
behaviors that have been explicitly programmed in, the huge number of sensed con-
ditions and the desired behaviors requires very complex behavior rules, which tends
to be difﬁcult to design and verify, especially when the number of behaviors become
large. In the work reported here, we investigate the possibility of letting a system
learn without explicitly programming behavior rules into a system. In particular,
the system has not been calibrated, because it is not a trivial task to calibrate a
dynamically changing hand-eye—head system. The task of system design is mainly to
deveIOp a self-organization scheme for the system. The system acquires task execu-
tion capability through interactive learning process, in which it learns by sensing and
doing.

To work in an unstructured environment, robots need sensors to sense the environ-
ments and translate what they sensed to what they can relate to their actions. The
basic issue is the calibration issue, calibrating the relationship between the sensor and
the other parts of the robot. For vision-guided robotic system, the calibration is not
a trivial task. Two types of approaches exist. The ﬁrst type requires the human de-
signer to explicitly model the relationships between sensors and actuators. The result
is a set of equations with a number of parameters that need to be estimated. Most of
the existing works belong to this category (e.g., [85], [53], [94], [68], [17], [21], [26]).

113

a,

 

W

The second type of approach does not employ explicit parameterized models. But
rather, a general approximator is used to map the sensor space to the actuator space.
A common tool to accomplish this is the artiﬁcial neural network [41], [53], [94]. The
ﬁrst type of approach is effective when the system can be accurately modeled with a
few parameters, but has problems when the system conﬁguration is time varying or
has a lot of degree of freedoms. The second type does not have the latter limitation due
to the generality of the neural network approximator. However, the neural networks
require a huge number of iterations and thus, not suited for interactive learning or task
learning where each case presented must be learned immediately on line. The system
must remember each training example presented and generalize to other location
based on relatively few examples learned.

For vision tracking, positioning control such as in [9], they do not use a learning
scheme. Some of them focus on the moving objects as in [52]. For learning by
watching such as in [101] and [102], they generate a sequence of robot control language
as ﬁnal outputs. Some researches are in the area of learning task level as in [56]
and [1]. Many approaches place a constraint on the positioning and(or) movement
of the target as in [69], which assumes that the depth of the plane on which objects
travel is known a priori. In [3], it is assumed that the target will maintain a regular
and repetitive movement that can be analyzed and intercepted. In [84], the three-
dimensional movement of the manipulator is decomposed into movements which are
at most two-dimensional. In [31] [32] [33], a recursive partition tree (RPT) is used
to approximate the mapping from the input to the output and applied to stereo
calibration and sensor-based action sequence learning and execution.

114

 

 

The work reported here is to study a mechanism which enables a hand-eye—head
system to perform some task sequences with explicitly hand-crafting task rules into
the control program. With this goal in mind, two criteria are important. The ﬁrst is
the generality — the system should not be imposed with certain rules that prevent it
from learning various tasks. The second is the scalability — the system can not slow
down quickly when the number of learned cases or tasks increase. The scalability is

important since the system needs to perform in real time eventually.

Speciﬁcally, we use the incremental hierarchical discriminant regression (IHDR)
tree described in Chapter 3 to organize the input space into a hierarchy of coarse-to-
ﬁne partitions which approximate the mapping between the sensing-task space and
the temporal action space. This tree shares the common characteristic with the well
known decision trees [76], regression trees [11] and clustering trees [38] in the sense
that uses a tree to organize information in a hierarchical manner. Moreover, the
building procedure of the IHDR tree is not iterative. This allows the system to learn
in real time. The incremental construction capabilities of the IHDR tree allows the
tree to grow according to the performance of the current tree. We also perform inter-
leaf interpolation for the multiple path probing, which results in a good approximation
and generalization performance without need for time-consuming batch analysis of
the entire set of data during tree construction. The method has been tested on a
hand-eye—head system for performing vision-guided temporal tasks, such as vision
tracking and reaching objects.

115

 

5.2 System Architecture

5.2.1 The Hardware Setup

A real robot called SAIL was assembled, as shown in Fig. 5.1. We enumerate the

hardware components as follow:

1. A Pentium II 333 dual-processor dual-bus CPU with 512 MB RAM and an
internal 27 GB three-drive disk array. This allows a real-time memory recall

and update as well as real-time effector controls.

 

2. ViewSonic ViewPanel VPA138 14 inches LCD display. The display is both flag

and light. These features make the robot platform easy to carry the display.

3. Two Panasonic GP-KSl52 industrial Color 1 / 2 inch CCDs with auto gain con-

trol and auto white balance.

4. A Matrox video card with two-way splitter for image capture and digitization.

5. Two Citizen M938 3.8 inches LCD Color video monitors.

6. An AccelePort 8e multiport serial adapter from Digi International Inc.. This

allows the host extend extra 8 serial communication ports.

7. Two Pan-Tilt Units (PTU) from Directed Perception, Inc. to control pan and
tilt motion of two eyes (CCD cameras). A C Programming Interface (PTU-
PCI) is used for fast binary communication between the PTU controller and
the host computer.

116

 

Figure 5.1: The SAIL robot built at the Pattern Recognition and Image Processing Labo-
ratory at Michigan State University.

117

8. A ﬁve—joint SCORBOT-ER III Robot Arm from Eshed Robotec, as shown in
Fig. 5.2. The controller can control two additional motors. We use one to

control the rotary table as the “neck” of the SAIL robot.

 

   

Shoulder

 

 

Base >

 

 

Figure 5.2: The Eshed robot arm.

9. 4 pressure sensors to sense push actions and force. 28 touch sensors on its arm,
neck, head, and bumper to allow human to teach how to act by direct touch.

The conﬁguration of these sensors is described in Appendex C.

10. An automatic power system APS 750 from TRIPP LITE power protection to

switch between heavy-duty batteries and line power.

The system diagrams are shown in Fig. 5.3, 5.4, and 5.5.

118

. '33-'31 Witt“

 

I. F:-

LCD Monitor _ LCD Display

i— Key Board

Camera —.

    
  
  

 

Pan—Tilt Unit

 

Neck

‘

Host Computer

Battery

 

 
  

-__:."

Eshed Robot Arm —>

—
_ DfiVe Base
Bumper _>

Figure 5.3: The system diagram of the SAIL robot (right side view).

119

Camera —

     
 
        
   
 
  

 

Pan—Tilt Unit % LCD Monitor LCD Display
7////A
*— Key Board
Neck -------

Drive Base Controller

‘

<— Drive Base AC Power Strip

ADR 2000

Pan—Tilt Unit Controller

Pan—Tilt Unit Power Box "IIIIIIIIIIIIIH'

’G)‘ llllllllllllllllll

©

 

O-I

Eshed Controller

 

AccelePort 8e
Eshed Robot Arm —>

 

Battery

Power Inverter

Drive Base
Bumper _>

Figure 5.4: The system diagram of the SAIL robot (left side view).

120

 

 

 

I Eshed Robot Armj [Rotary table (neck) J L32 sensors ] [Pan—Tilt Unit ] [Pan-Tilt Unit I

 

 

 

 

 

 

 

 

 

 

 

 

 

I , l -
7 Pan-Tilt Unit Pan—Tilt Unit
| Eshed Controller ] [ ADR 2000 | Comm", [ Comm",
1 I J
[AccelePort 8n

 

I LCD Monitor LCD Monitor J

[LCD Dismay l—_ [Matrox Video board ]

Host Computer

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I KeyBoard ]
(a)
--------- -> —>
AC Power DC Power
Pan-Tilt Unit - Host Com uter Pan-Tilt Unit
[Controller [Eco rem] p [LCD Monitor] Commie,

 

 

 

Ir---

 

 

.r----*i
.L---’
i---*

 

 

t

 

 

I
l

l Eshed Controller P-----LAC Multiple Outlet |------>I LCD Display |

p

l'
L

 

 

 

 

 

 

 

 

 
 

 

 

 

Line Power ____- APS 750 Automatic
Power System
CCD L ADR 2000 I CCD

 

 

 

 

(b)

Figure 5.5: The system diagram of the SAIL robot. (a) hardware connection. (b) power
connection.

121

5.2.2 The Software Setup

The programs were developed using Microsoft Foundation Class (MF C) for Graphic
User Interface (GUI) and multiple threads processes. The multiple thread processes

are illustrated in Fig. 5.6.

Instead of having four threads continuously querying the status of sensory inputs,
our system collects the information whenever is needed. It ﬁrst spawns four threads
to acquire 1) the current positions of the Pan-Tilt Units, 2) the 28 touch sensors
and four pressure sensors, 3) the current joint positions of the arm, and 4) the stereo
images. The main 100p of the program then starts from checking the completion of
these threads. Whenever a thread from one sensory input is ﬁnished, the main loop
then spawns a thread for that device. When all the information is acquired, the brain
process runs and then goes to the other iteration. Currently the output of the brain
process consists of two actions: the movement of the robot arm and the PTU. This is
trig from the brain process by spawning the robot arm action thread and PTU action
thread. This design not only saves the system resources by query from demand but
also solves the synchronization problem of multiple threads process for a real time

system.

The brain process takes four sensory inputs and two action feedback inputs. It
generates two action outputs. The original image input gives 480 x 640 resolution
with three color bands. The input dimension of the original stereo images is as high
as 480 x 640 x 3 x 2 = 184320 bytes. This is a very high dimensional input and is not
feasible for a real time system under the current CPU speed. Instead, we only use

122

T '7‘ pawn,

 

 

 

 

I Start image query I..+-> Image Acquisition ..n,

i

IStart sensors query I--+-> Sensory Acquisition

i

I Startarmquery lump-» ArmSignal Acquisition

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I Start PTU query I--'r--> PTU Acquisition

 

 

 

, Sleep till frame '
time is up

I Get Images H .......... -.‘mi

' .9.

rSm image query I [

- i -
I Get Sensor Signals Iﬁ. ........... ........ .

IStart sensors query I
i

I GetArmSignals Hm-.. .......... .

I Startarmquery I

i

 

 

 

 

 

 

 

 

 

 

 

 

 

I StartPTUquery J

i

Brain Process

 

 

7 --.tu Arm Action

 

 

 

-...i L» PTU Action

 

 

 

 

 

 

 

l _______ ‘. Thread 0 Variable

-------------- 9 Data flow ——> Control ﬂow -------> Spawn thread

Figure 5.6: The ﬂow diagram of the system. Dashed line box indicates a separate thread.
Oval box means thread. Solid line is for control ﬂow and dashed line represents data flow.
Dot line indicates spawning a thread. Frame time is up every 100 ms.

123

1

Delay

[

l

 

——> Arm action AA(t+1)
32 sensors reading S(t)

PTU aciton PA(t) —>
Stereo images I(t) .._>
—> PTU action PA(t+1)
PTU position P(t) .
Brain Process

Arm position A(t) .

—>

—>

Arm action AA(t)

’ E

Figure 5.7: The inputs/outputs of the brain process.

 

 

 

 

 

region of interest (ROI) of the stereo images I L and I R for the left and right images
with r x c resolution, respectively. The ROI is further down-scaled to a factor f,
f S 1.0. The dimension of the image input then is DIM(I) = r x c x 3 x 2 x f. The
choice of the system parameters (r,c,p) depends on the tasks given. For example, if
the robot arm wants to pick up a cup in front of it, it needs f = 1 with small (7‘, c).
If we want to do navigation, we need to have T = 480 and c = 640 with a small f to
cover the whole views and the attention might need to take care of part of the images

with a large f for locations like doors or crossings.

We have two PTUs to control the motion of two “eyes”. Each PTU reports
its current pan and tilt position. The inputs from PTUs are P = (PL,PR) with
PL = (panL,tz'ltL) and PR 2 (pang, tiltR), where pan], and tiltL are the pan and tilt

position of the left PTU. The input dimension from PTU thus is four.

The robot arm has ﬁve joints and one gripper. The inputs from the position of

124

the robot arm are A = (j 1, jg, . . . , 3'6), where jl is the base position, j2 is the shoulder
position, 33 is the elbow position, 34 is the roll position, 35 is the pitch position, and
j6 is the gripper position.

There are 28 touch sensors and 4 pressure sensors for the sensory input. The
usage of these sensory data is task dependent. In the current design, 20 touch sen-
sors are used. There are four touch sensors TIP 2 (Ephﬂprﬂlpm’ﬂpd) to control
the left PTU. The touch sensors to control the left PTU to turn left and right is
called TM and Tlpr. The touch sensors to control the left PTU to tilt up and down
is called TIP, and szd- Similar to left PTU, the right PTU has four touch sensors
to control the motion: T r, = (T,,,,T,,,,T,p,,Trpd). Each robot arm joint has two
touch sensors to control the motion. Thus the touch sensors for the arm joints are:
T, = (le1, ler, Tjgu, Tjgd, Tjgu, TJ-3d, T34”, Tj4d, Tj51, Tj5r). Like the naming of the PTU
touch sensors, u is for “up”, (1 is for “down”, l is for “left”, and r is for “right”. For
the gripper, (736,,73-60) are the touch sensors to control the open and close motion.
The left front pressure sensor T; f is used to specify the task type for the current imple-
mentation. The left rear pressure sensor Tl, tell the system in the learning mode and
the right rear pressure sensors T". is used to trig the system in the testing mode. The
sensory reading for the brain process thus is S = (Tlp, Trp, Ta], TJ-6,, T360, TU, T1,, T").

The brain process generates two action outputs for the current implementation.
One for the robot arm AA(t) and the other for the PTUs PA(t). The values of these
actions do not represent the amounts of the control signals. Every action only takes
one of the three values (—1,0, 1). For example, the action of the joint angle 1 (base)
of the robot arm is 1 if we want it to move left. A value —1 for the base action

125

 

means turn to right and 0 for stop motion. Such a design enables us to perform real
time training and testing. This is very different from the “snap shot” approach which
trains the system to achieve the goal state with the current observation without any
feedback control signals sent in between.

Besides the inputs from the hardwares, the brain process also takes action feed-
backs from PTU PA(t — 1) and arm AA(t - 1) as inputs. These feedbacks can both
make the input features more distinguishable and stabilized the action sequences, as

shown in Fig. 5.8.

 

 

 

 

 

 

 

 

 

1 ﬂ - . . ' ' r
. 1~ ..
...... Action-1
005_ 0.5'_
0. out 00...“) (SIDS-4 :- "\I- g 0- m m ]
_0.5' -0.5
own ActionO ' """ Action1 _1, ......
"1 —1:5 —i -015 no 05 i 1.5 —1:5 -i -o:5 no 05 i 1.5

(a) (b)

Figure 5.8: (a) A system without feedback. The x axis means the input II. The training
samples are mixed up on the boundary of the action 0 the other two actions. The system
will behave dangling between action 0 and the other two actions. (b) Use feedback as input
to the system. The x axis means the input 11 and y axis is the feedback input 12. The
distance between action 0 and the other two actions are pulled away. When the system
outputs action 0 from action 1, it will remain in action 0 unless [1 changes a lot.

The brain process has a general IHDR core with a task interpreter module.
The IHDR tree incrementally mapping from high dimensional input space to ac-
tion space as described in Chapter 3. The task interpreter module speciﬁes the

meaning of the sensor inputs. For example, a touch on the front left pressure sen-

126

sor means the robot is doing task one. A touch on the rear left pressure sensor
means start to train. In the training mode, the robot will takes both the input
In(t) = (TSK(t),I(t), A(t), P(t), PA(t — 1), AA(t — 1)) and the corresponding out-
put Out(t) = PA(t), AA(t)) to build the IHDR tree, where TSK(t) is the type of the
task interpreted by the task interpreter module. In the test mode, the robot gives
the current input In(t) = (TSK(t),I(t),A(t),P(t),PA(t — 1),AA(t — 1)) to query
the IHDR tree and outputs the actions Out(t) = PA(t), AA(t)).

Since each input component has different dimensionality ranging from hundreds
(image input) to one (e.g. task type), we need to normalize each component so that

each component is comparable. The input to IHDR tree thus is a long vector :

X(t) = (’LUtskTSK(t),’w[I(t),194140),ZUpP(t),’LUpAPA(t — l),wAAAA(t — 1))

 

X t —> .
( ) IHDR tree Y0“)

S(t)=X(t- 1) —>

 

 

 

xa)

 

Task type T(t)
Stereo images [(t)
m position p0) PTU action PA(t+l)
Arm position A(t) Arm action AA(t+1)

Y(t+l)

 

S(t)=X(t— l)

rb----.

 

 

PTU aciton PA(t)
Arm action AA(t)

 

 

 

Figure 5.9: The input/output of the IHDR tree.

127

 

5.3 Experimental Results

We performed several experimental on the SAIL robot. Since tracking objects and
reaching objects are sensorimotor behaviors ﬁrst developed in early infants, we trained
our SAIL robot for two tasks. In the ﬁrst task, called ﬁnding-ball task, we trained the
SAIL robot to ﬁnd a nearby ball and then turn eyes to it so that the ball is located on
the center of sensed image. In the second task, called pre-reaching task, we trained

the SAIL robot to reach for the object once it has been located and the eyes ﬁxate

on it.

 

Figure 5.10: Real—time training and testing for SAIL robot: ﬁnding the nearby ball and
then reaching for it.

Existing studies on visual attention selection are typically based on low-level

128

saliency measures, such as edges and texture [8]. In Birnbaum’s work [10], the visual
attention is based on the need to explore geometrical structure in the scene. In our
case, the visual attention selection is a result of past learning experience. Thus, we
do not need to deﬁne any task-speciﬁc saliency features. It is the SAIL robot that
automatically derive most discriminating features for the tasks being learned. At the
time of learning, the ball was presented in the region of interest (ROI) inside the
stereo images. The human trainer interactively pulls the robot’s eyes toward the ball
(through the touch sensors for the pan-tilt heads) so that the ball is located on the
center of the region of ROI (ﬁxating the eyes on the ball). The inputs to the devel-
opmental algorithm are the continuous sequence of stereo images and the sequence
of the pan-tilt head control signal. Three actions are deﬁned for the pan-tilt head in
pan direction: 0 (stop), 1 (move to the left), or -1 (move to the right). The size of
ROI we chose for this experiment is deﬁned as 120 x 320. In the mind of trainer, the
ROI is divided into ﬁve regions so that each region is of size 120 x 64. The goal of the

ﬁnding-ball task, is to turn the pan-tilt head so that the ball is at the center region.

Fig. 5.11 shows some example images for the tracking task.

30 63

 

 

   

324 357

     

Figure 5.11: A subset of images used in the tracking problem. The number right below the
image shows the PTU position associated with that image. From left to right, one image
sequence of ball-tracking is shown.

The scenarios for training and testing are shown in Fig. 5.13 and Fig. 5.12. The

129

 

 

Image

Task

Pan Position

Prev. Pan Action

Pan Action

Initialization

Time step

Image

Task

Pan Position

Prev. Pan Action

Pan Action

Initialization

Time step

 

\

 

—- ____, .—
—-——1 .

 

\

 

 

 

 

 

 

 

 

 

 

..-.v-. __ _1,_. — ._ v-.- — _ »—- __ —_.. —— .....— L .___ — ._.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[I
l I!
w
d [I
i ii
I L]
I I]
J. H
l [I
l II
t TT+1
(a)
]/ir]
’ ||
1 ii
I I]
] l
M
I ii
I’ [#1
lJ il
I II
I I;
i II
I ||
1 ii
I ll
1 I]
t TT+1

(b)

 

Figure 5.12: The transitions of sensors / effectors for ball tracking problem. (a) The training
session. (b) The testing session.

130

transitions during the training session are described below:

1. The task input is initiated by pushing a pressure sensor of the robot (or key in
a letter using keyboard) before imposing action to pan the camera. The action

of the pan is zero at this time since there is no action is imposed.

2. The action of the pan is imposed at time t. The initialization flag is on at the _

same time. The main program issues a pan thread to pan the camera.

 

3. The PTU starts to pan and the pan position as well as the image change. Note

at t+1 the previous pan action is zero.

4. When the ball is at the ﬁxation of the view at time T, we stopped impose action

by issuing a stop thread to the PTU and the initialization ﬂag is off.

5. At time T+1, the PTU stOpped moving the image did not change anymore. It

is worth noting that the pan action are all zero after time T-l.

Similarly, the testing session can be explained at the follow:

1. We push a pressure sensor to give the task input as well as turn the initialization

flag on at time t.

2. The action of the pan is given by query the IHDR tree. A non-zero action is

expected according to the trained examples.

3. The PTU and the image starts to move.

131

4. When the ball is at the ﬁxation of the view at time T, the query result of the
IHDR is a zero action. The program issues a st0p thread to the PTU and the

initialization ﬂag is off.

5. At time T+1, the PTU stopped moving the image did not change anymore.

If we did not use the context (previous pan action) as input, the image and the
pan position will be very similar at the point which the action should stop. This
will make the PTU st0p and go at the boundary point. As explained in Fig. 5.8, the
additional context is necessary to make the system stable.

To better understand the task performed, we ﬁrst used the off-line training-testing
mode to analyze the behavior of the algorithm. In the training mode, we positioned
the ball in ﬁve different position in the ROI and turned the PTU to locate the ball
in each session. Two sessions were trained for the system. In the test mode, we
presented the ball in 7 different position in the ROI. We trained the system ﬁrst used
two training sessions and then followed by a test session. The training and testing
were ran for 10 epochs.

Fig. 5.14(a) shows the memory usage of the program. In the ﬁrst stage, the tree
grows since the samples are accumulated in the root node. When the sample buffer
is saturated, the memory curve goes to ﬂat. At the end of the second epoch (runs 19
and 20), the tree started to grow child nodes and the memory start to increase again.
In the run 21 and 22, the forgetting process was initiated and the memory still slowly
grows because the second level nodes were accumulating samples. Note at this point
the tree had one root node and two child nodes. From run 23 to run 28, the second

132

 

       

(A) [(0,:66,0), (0)] (B) [(1,-66,0), (1)] (C) [(1,-67,1), (1)] (D) [(1,-102,1), (1)]

“...”? Lug w-~ \ - -'4I
. . , . ‘j -. _ v. .
Igl“ ii ' .1 r

(E) [(1 -134 1) (1)] (F) [(1,-167,1), (1)] (G) [(1,-199,1), (1)] (H) [(1,-232,1), (1)]

   

 

     

Ed m . —. Ed 1
(I) [(1, -2,)971, (1)] (J) [(1,-,),3301()] (K) [(1,-362,1), (1)] (L) [(1,-396,1), (1)]
(M) [(i,-428,1),(0)1 (N) [(1)-458,0), (0)] (O) [(194589) (0)] (P) [(1,-458,0). (0)]

 

(b) [(1,-55,0), (1)] (C) [(12'7121)2 (1)] (d) [(1a'105i1)2 (1)]

.... -» . :4:
. J
b .. '

(e) [(1,-i38.1),(1)i (f)[(1,-171,1),(1)] (g)[(1-2031 (1)) (h) ((12361) (1))

'4 _'.*. . ,Ti ‘ 1
; ‘ a - , ’ ‘
a, i a

(i) [(1,-269,1), (1)] (i) [(1,-302,1), (1)] (k) [(1)-334,1), (1)] (1) [(1,-367,1), (1)]

“t _ ' ' " Jr?“ .
. ”f ' ' i ' V :
:‘z . ‘ ‘
, - , I”

(m ) [(1'399 1)) (1)] (n) [(17'43221)2 (0)] (0) [(13'46220)2 (0)] (P) [(12’346220)) (0)]

 

   

Figure 5.13: An illustration of how the training and testing sequence match. The ﬁrst four
rows show a sequence of training samples, from (A) to (P). The related sensors/effector is
shown for each sample. The sensory inputs shown are the image I(t) and In(t)=(TSK(t),
P(t), PA(t-1)). The output Out(t) is the action of the pan PA(t). Under each image shown
the associated inputs and outputs [In(t), Out(t)]. The last four rows shows a sequence of
testing samples.

133

level nodes are accumulating samples. Run 29 to run 36 the samples are saturated
and the memory curve stay ﬂat. At run 37, the third node of the second level were
spawned. The memory grows till the run 48. The memory then saturated and keep
ﬂat afterward.

Fig. 5.14(b) shows the performance of the program for the test session. After the
3-rd epoch, the systems can achieve the desired performance to locate the ball in the
center of ROI. Fig. 5.14(c) gives the plot of the average CPU time for each action
sequence. The average CPU time for each action sequence is within the mental cycle
time: 100 millisecond. Since the IHDR tree is dynamically updated, all the forgetting,
updating, and rejecting processes occurred during each sequence. The average CPU

time for each sequence is up and down, depending on the given training examples.

134

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f
A 8 f -(
m
E
8 6*
(I)
3
5 4~
E
s
2,
G r . a r
0 20 40 60 80 100
Runs
(a)
0.08 . . . . 106
’6
an 80-
0.06) g
76’
2‘3 g 60
90041 D
g n.
h 0
m m 40 i
002- ] g
' 9 20
<
0 * 4 4 G ‘ ‘ ‘
0 2 4 6 8 10 0 50 100 150 200
Epochs No. sequence
(b) (c)

Figure 5.14: (a) The memory usage for the off-line simulation of ball tracking task. (b)
The performance of the testing. (c) The average CPU time for each sequence.

135

 

 

 

Chapter 6

Conclusions

6. 1 Summary

A new algorithm is ﬁrst introduced to cast both classiﬁcation and regression prob-
lems into a uniﬁed regression framework. This allows us to design the new doubly-
clustered method. Clusters in output space provide coarse-to—ﬁne virtual class labels
from clusters in the input space. To deal with high-dimensional input space, a differ-
ent discriminating subspace is automatically derived at each internal node of the tree.
A size-dependent probability-based distance metric SDNLL is proposed to deal with
large sample cases, small sample cases, and unbalanced sample cases. Our experi-
mental study with synthetic data showed that the method can achieve near-Bayesian
optimality for both low dimensional data and high dimensional data with low dimen-
sional data manifolds. With the help of the new decision tree, the retrieval time for
each sample is of a logarithmic complexity. The output of the system can be both
class label or numerical vectors, depending on how the system trainer gives the train-

136

 

 

 

ﬁ-L-l.

ing data. The experimental results have demonstrated that the algorithm can deal
with a wide variety of sample sizes with a wide-variety of dimensionality.

The incremental building of the HDR tree opens up the possibility of real-time
interactive training where the number of training samples can be extremely large but
the resulting tree does not need to store all the training samples. Thanks to the
amnesic average and the forgetting process, the computational complexity is very low
compared with the previous batch version of HDR tree. We have also demonstrated
the performance of the incremental version of the HDR tree is comparable to the
batch version. The experiments on the vision-based navigation show the feasibility
of the IHDR tree algorithm applied on the regression problem.

A new way of dealing with recognition tasks is proposed. The recognition of an
object is treated as a dynamic, online, incremental, continuous learning process. The
system learns while performing. The programming stage does not require humans
to specify content-speciﬁc rules. The goal is to relieve humans from the tedious task
of directly controlling the internal representation. No reprogramming is required
to train the system for more cases, more types of input, and even new tasks for
new domains. This is because our framework does not require humans to provide
a problem space. Of course, the idea of active vision [4] [18] calls for the dynamic
aspect. The autonomous learning concept introduced here is a further advance.

The application on the robotics develops an robot which can interacts with the
environment in real-time. A technical challenge to perform such a task is the map—
ping engine that scalable — keeping real-time speed and a stable performance for
a very large number of high dimensional vector data. With the proposed powerful

137

 

q

 

 

FFJ '-" I)". H.411“. a.“

mapping engine, the robot is able to operate in real time. A new style of program-
ming multi-sensory multi-effector real-time system is implemneted. Every thread is
ﬁnished within each memory cycle. The proposed algorithm has successfully run on
the SAIL robot for real-time interactive training and testing for two sensori-motor
tasks: ﬁnding ball and reaching the centered ball, two early tasks that infants often

learn to perform.

6.2 Future Work

The proposed HDR tree algorithm provide an efﬁcient way to deal with both clas-
siﬁcation and regression problems. However, there are several issues remain to be

explored:

o The theoretic proof of the algorithm. A formal proof of the algorithm is desired
to better understand the behavior and the limit of the algorithm. Although
the synthetic data simulation in Section 2.5 shows the HDR tree algorithm
reaches near optimality of Bayesian bound, a further study of the theoretic
bound can give more precise insight of the algorithm and help to choose the

suitable applications.

0 More applications and larger data set to test the algorithm. Several experiments
have bee conducted for the prOposed algorithm including synthetic data, real
image data, data with manually extracted features. Since what we proposed in
a general framework for both classiﬁcation and regression problem, we would
like to test more applications of the algorithm. To make others easy to try

138

 

our algorithms for their own application, the software can be accessed in the
web page: http : //www.cse.msu.edu/rgroups/amdl/hdrt.html. In the face

recognition problem, we would like to apply our algorithm to larger databases.

0 Improvement of the decision tree. The proposed HDR tree algorithm does not
handle problem with missing data attribute values. Methods in [64] [90] [89]
can be used to deal with this problem. Besides, if the cost matrix is provided for
a classiﬁcation problem, the current implemented HDR tree did not utilize this
information to obtain the discriminating features. Some data mining techniques

such as boosting [20] can also be incorporated into the HDR tree algorithm.

Similar to the HDR tree, the future research directions for the incremental HDR

tree are:

a A more comprehensive study of the dynamic behaviors of the proposed algo-
rithm. We would like to see further analysis of the parameters used in the

algorithm and the theoretic bound for the method.

0 We have tested the algorithm in the simulation of the vision-based navigation
problem. It is interesting to see the algorithm applied to the real robot and

compare with other existing systems [99] [73].
0 Improve the decision tree by using other data mining techniques.
For the robotic application, the future directions are as follow:

0 Longer context: The current tasks trained does not need much context. We
would like to have a longer context for the future work.

139

 

 

Integration of short and long temporal context: We would like to have a mech-

anism for automatic Spatiotemporal level building.

Non-monolithis sensory processing: All vision, speech, and touch sensory inputs

have similar receptive ﬁeld mechanisms.

Intramodal and intermodal attention: We would like to deal with both in-
tramodal (e.g., visual attention) and intermodal attention (selection which sen-

sory inputs are important.)

Mechanisms of self-generated intention: This is accomplished in motivational

system in the human central nervous system.

140

 

1......

 

APPENDICES

 

 

Appendix A

Linear manifold

Given a set of vectors V = {121, v2, ..., 12,} which is a subset of a vector space X. We

want to express the subspace that passes the head tips of the vectors in V.

For numerical stability, we use the center of the vectors in V,

and deﬁne the set of scatter vectors from their center: 8,- = 1),—T), z' = 1, 2, ..., n. These
77. scatter vectors are not linearly independent because their sum is equal to a zero
vector. Let S be the set that contains these scatter vectors: S = {3, | z' = 1, 2, ..., n}.
The subspace spanned by S, denoted by span(S), consists of all the possible linear
combinations from the vectors in S.

A translation of a subspace is called a linear manifold [66] (also called linear
variety [51]). The subspace M translated to vector v0 is denoted by v0 + M: 220 +M 2
{be + m ] m E M}. Thus, the subspace that passes the head tips of the vectors in S

142

 

 

Figure A.1: The linear variety (hyper plane) that passes through the head points of the
vectors. It can be represented by r) + span(S), the spanned space from scatter vectors
translated by the center vector 17.

can be represented by the linear manifold D = o + span(S), as shown in Fig. A.
The orthonormal basis a1, a2, ..., an_1 of the subspace span(S) can be constructed
from the radial vectors 31,32, ..., 3,, using the Gram—Schmidt Orthogonalz'zatz'on (GSO)

procedure:

Procedure 7 GSO Procedure: Given vectors 81,32,...,8n_1, compute the or-

thonormal basis vectors 31,32, ..., 3,,_1.
1. a1: sl/Ilslll.

2. Fort = 2, 3, ..., n - 1, do the following

(a) 02 = s,- - §;l(8far)aj-

(b) at = aé/llaéll.

In the above procedure, a degeneracy occurs if the denominator is zero. In the ﬁrst
step, the generacy means 31 is a zero vector. In the remaining steps, it means that
the corresponding vector 3,- is a linear combination of the previous radial vectors. If a
degeneracy occurs, the corresponding 3,- should be discarded in the basis computation.

143

 

The number of basis vectors that can be computed by the GSO procedure is the
number of linearly independent radial vectors in S.

Given a vector :1: E X, we can compute its scatter part 3 = :1: — v“. Then compute
the projection of :r onto the linear manifold. It’s i-th coordinates in the orthonormal
basis is given by [3,- : sTa,, t = 1, 2, ..., n— 1. We call the vector f = (61,62, ...,Bn_1)T

the feature vector of a: in the linear manifold S.

144

 

Appendix B

Cholesky Decomposition

Procedure 8 Cholesky factorization: Given an n x n positive deﬁnite matrix

A = [aij], compute lower triangular matrix L = [1”] so that A = LLT.

Fori=1,2,...,n do

1. Forj=1,2,...,i-—1 do

j—l
lo = (“2) " Z likljkl/ 1,,-
kzl

 

_ i—l 2
2- [ii — \/aii — kzl lik

145

 

Appendix C

Sensor System of SAIL Robot

C.1 Communication speciﬁcation

The serial A / D board can be connected to the serial port of the host computer through

the following protocol:
1. Baud rate: 38,000.
2. parity: No.

3. data bits: 8.

4. stop bits: 1.

5. No software or hardware handshaking is needed

146

 

 

0.2 The A/D board

The sensor system consists of (32) 12 bit analog input channels with a range of 0 to
5 Volts. This corresponds to a digital range of 0X000 to OXFFF in hexadecimal, or
0 to 4095 in decimal notation.

The 32 inputs are multiplexed into an 8 channel A/ D converter using hardwired
analog multiplexers. Each of the 8 lines has a 50 K ohm pull up resistor to +5V DC.
A short on any line should produce an analog reading between 0.5 volts that gives a
decimal reading below 400. An open should produce a voltage above 4.5 volts which
gives a decimal reading above 3600.

Four of the analog inputs are connected to resistive force sensors which have a
static resistance around 50K ohms. This produces a voltage input on these analog
channels of about 2.5 volts which corresponds to a decimal reading around 2000.
Pressure applied to the resistive force sensor will cause its resistance to change about
10% to 20%, giving a decimal reading change of about 300.

The command sent to the serial A/D board to cause it to scan all 32 ports is a
“q” followed with a carriage return and line feed. The serial A/ D will scan all 32
ports and return 32 4 character decimal values with space separators and a carriage
return plus line feed terminator string. The output is arranged to two rows with each

row consists of 16 port readings, as listed in the Table C1

147

Table C1: The output of the SAIL sensor system

 

 

 

port ﬁrst row second row
left eye left left eye down
neck left neck right

 

right eye left

right eye down

 

case left rear sensor

case right rear sensor

 

arm switch 1 (gripper) down

arm switch 1 (gripper) up

 

arm switch 3 (pitch) down

arm switch 3 (pitch) up

 

arm switch 4 (elbow) down

arm switch 4 (elbow) up

 

arm switch 6 (base) down

arm switch 6 (base) up

 

left eye right

left eye up

 

bumper front

bumper right front

 

right eye right

right eye up

 

case right front sensor

case left front sensor

 

arm switch 2 (roll) down

arm switch 2 (roll) up

 

bumper right rear

bumper rear

 

arm switch 5 (shoulder) up

arm switch 5 (shoulder) down

 

 

agzagzgomqamawwp:

 

bumper left rear

 

bumper left front

 

148

 

 

Bibliography

[1] Alan D. Christiansen, Mattew T. Mason, and Tom M. Mitchell. Learning
Reliable Manipulation Strategies without Initial Physical Models. In Proc of the

IEEE Int ’1 Conf on Robotics and Automation, pages 1224—1230. IEEE, 1990.

[2] Hamid Alavi. State self-organization for continuous video stream input, a mas-
ter’s project report. Technical report, Department of Computer Science, Michi-

gan State University, 1997.

[3] Allen, PK. and A. T imcenko and B. Yoshimi and P. Michelman. Automated
Tracking and Grasping of a Moving Object with a Robotic Hand-Eye System.

IEEE Trans on Robotics and Automation, 9(2):]52—165, Apr 1993.

[4] J. Aloimonos. Purposive and qualitative active vision. In Proc. 10th Int ’l Conf.

Pattern Recognition, pages 346—360, Atlantic City, NJ, June 1990.

[5] J. R. Anderson. Cognitive Psychology and Its Implications. Freeman, New

Worky, third edition, 1990.

[6] M. H. Ashcraft. Human Memory and Cognition. Harper Collins College Pub-
lishers, New Royk, NY, 1994.

149

 

l7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Martin Bichsel. Strategies of Robust Object Recognition for the Automatic Iden-
tiﬁcation of Human Faces. PhD thesis, Eidgenéssischen Technischen Hochschule

Ziirich, 1991. Diss. ETH Number 9467.

Martin Bichsel. Strategies of Robust Object Recognition for the Automatic
Identiﬁcation of Human Faces. Swiss Federal Institute of Technology, Zurich,

Switzerland, 1991.

Billibon H. Yoshimi and Peter K. Allen. Visual Control of Grasping and Ma-
nipulation Tasks. In ARPA Image UnderstandingWorkshop, pages 1151—1157,

Morgan Kaufmann, San Francisco, CA, Oct 1994.

Lawernce Birnbaum, Matthew Brand, and Paul C00per. Looking for Trouble:
Using Causal Semantics to Direct Focus of Attention. In Proc of the IEEE Int’l

Conf on Computer Vision, pages 49—56, Berlin, Germany, May 1993. IEEE

Computer Press.

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classiﬁcation and Regression

Trees. Chapman & Hall, New York, 1993.

R. Brooks. Intelligence without reason. In Proc. Int’l Joint Conf. on Artiﬁcial

Intelligence, pages 569—595, Sydney, Australia, August 1991.

R. A. Brooks. Symbolic reasoning among 3—D models and 2-D images. Artiﬁcial

Intelligence, 17(1-3):285—348, Aug. 1981.

R. A. Brooks. A robots layered control system for a mobile robot. IEEE Journal
of Robotics and Automation, 2(1):14—23, March 1986.

150

 

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

C. J. C. Burges. A tutorial on support vector machines for pattern recognition.

Data Mining and Knowledge Discovery, 2(2):121—167, 1998.

S. Carey. Conceptual Change in Childhood. The MIT Press, Chambridge, MA,

1985.

Castané, A. and S. Hutchinson. Hybrid Vision/Position Servo Control of a
Robotic Manipulator. In Proc of the IEEE Int’l Conf on Robotics and Automa-

tion, pages 1264—1269, Los Alamitos CA, May 1992. IEEE Computer Society

Press.

T. Darrell and Alex Pentland. Active gesture recognition using partial observ-
able Markov decision processes. In Proc. 10th Int’l Conf. Pattern Recognition,

Vienna, Austria, 1996.

G. R. Dattatreya and L. N. Kanal. Decision tress in pattern recognition. In
L. Kanal and A. Rosenfeld, editors, Progress in Pattern Recognition, pages

189—239. Elsevier Science, New York, NY, 1985.

H. Drucker, C. Cortes, L.D. Jackel, Y. Lecun, and V. Vapnik. Boosting and

other ensemble methods. Neural Computation, 6(6):1289—1301, 1994.

Feddema, J .T. and OR. Mitchell. Vision-Guided Servoing with Feature-Based
Trajectory Generation. IEEE Trans on Robotics and Automation, 5(5):691—700,

Oct 1989.

S. Franklin. Artiﬁcial Minds. MIT Press, Cambridge, MA, 1997.

151

 

[23]

[24]

[25]

[26]

[27]

[28]

[30]

J. H. Friedman. A recursive partition decision rule for nonparametric classiﬁ-

cation. IEEE Trans. on Computers, 262404—408, April 1977.

K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press,

New York, NY, second edition, 1990.

G. H. Golub and C. F. van Loan. Matrix Computations. The Johns HOpkins

University Press, Baltimore, MD, 1989.

Graf, DH. and W.R. LaLonde. A Neural Controller for Collision-Free Move-
ment of General Robot Manipulators. In Int’l Conf on Neural Networks, vol-

ume 1, pages 77—84, San Diego CA, Jul 1988. IEEE.

W. Eric L. Grimson. Object Recognition by Computer: The Role of Geometric

Constraints. The MIT Press, 1990.

1. Harvey. Evolutionary robotic and SAGA: The case for hill crawling and tour-
nament selection. Technical Report CSRP 222, University of Sussex, Brighton,

UK, 1992.

E. G. Henrichon, Jr. and K. S. Fu. A nonparametric multivariate partitioning
procedure for pattern classiﬁcation. IEEE Trans. on Computers, 18:614-624,

July 1969.

D. H. Hubel. Eye, Brain, and Vision. Scientiﬁc American Library, Distributed
by Freeman, New York, 1988.

152

 

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

W. Hwang, S. J. Howden, and J. Weng. Performing temporal action with a
hand-eye system using the SHOSLIF approach. In Proc. Int’l Conference on

Pattern Recognition, Vienna, Austria, Aug. 1996.

W. Hwang and J. Weng. Vision-guilded robot manipulator control as learn-
ing and recall using SHOSLIF. In Proc. IEEE Int’l Conf. on Robotics and

Automation, Albuquerque, NM, April 1997.

W. Hwang and J. Weng. Autonomous vision-guided robot manipulation control.
In Proc. of Asian Conference on Computer Vision, pages 503—510, Hong Kong,

China, January 1998.

W.-S. Hwang, J. Weng, M. Fang, and J. Qian. A fast image retrieval algorithm
with automatically extracted discriminant features. In Proc. IEEE Workshop
on Content-based Access of Image and Video Libraries, pages 8—15, Fort Collins,

Colorado, June 1999.

K. Ikeuche and T. Kanade. Automatic generation of object recognition pro-

grams. Proceedings of the IEEE, 76(8):1016—1035, 1988.

Jr. J. L. Martinez and R. P. Kessner (eds). Learning 6'5 Memory: A Biological

View. Academic Press, San Diego, CA, 2 edition, 1991.

Jr. J. R. Deller, Jone G. Proakis, and John H. L. Hansen. Discrete-Time

Processing of Speech Signals. Macmillan, New York, NY, 1993.

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall,
New Jersey, 1988.

153

 

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A

survey. Journal of Artiﬁcial Intelligence Research, 4:237—285, 1996.

S. K. Murthy S. Kasit and S. Salzberg. A system for induction of oblique

decision trees. Journal of Artiﬁcial Intelligence, 221—33, Aug. 1994.

Kieffer, S. and V. Morellas and M. Donath. Neural Network Learning of the
Inverse Kinematic Relationships for a Robot Arm. In Proc of the IEEE Int ’1
Conf on Robotics and Automation, pages 2418—2425, Sacramental CA, Apr

1991. IEEE.

M. Kirby and L. Sirovich. Application of the karhunen-loéve procedure for the
characterization of human faces. IEEE Trans. Pattern Analysis and Machine

Intelligence, 12(1):103—108, Jan. 1990.

T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag,

Berlin, second edition, 1988.

T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, second edition,

1997.

D. J. Kriegman and J. Ponce. On recognizing and positioning curved 3-D objects
from image contours. IEEE Trans. Pattern Analysis and Machine Intelligence,

12(12)1127—1137,1990.

J. E. Laird, E. S. Yager, M. Hucka, and C. M. Tuck. Robo-soar: An integra-
tion of external interaction, planning, and learning using Spar. Robotics and
Autonomous Systems, 8:113—129, 1991.

154

 

[47] Y. Lamdan and H. J. Wolfson. Geometric hashing: A general and efﬁcient
model-based recognition scheme. In Proc. 2nd Int ’1 Conf. Computer Vision,

pages 238—249, 1988.

[48] D. B. Lenat. Cyc: A large-cale investment in knowledge infrastructure. Com-

munications of the ACM, 38(11):33-38, 1995.

[49] D. B. Lenat, G. Miller, and T. T. Yokoi. Cyc, wordnet, and edr: Critiques and

responses. Communications of the ACM, 38(11):45—48, 1995.

[50] E. Lloyd. Handbook of Applicable Mathematics, Volume VI, Part A, Statistics.

J. W. Arrowsmith Ltd, Bristol, 1984.

[51] D. G. Luenberger. Optimization by Vector Space Methods. Wiley, New York,

1969.

[52] Marcos Salganicoff and Ruzena K. Bajcsy. Robotic Sensorimotor Learning in
Continuous Domain. In Proc of the IEEE Int’l Conf on Robotics and Automa-

tion, Los Alamitos CA, May 1992. IEEE.

[53] Martinetz, RM. and H.J. Ritter and K.J. Schulten. 3D-Neural-Net for Learning
Visuomotor-Coordination of a Robot Arm. In Int’l Jt Conf on Neural Networks,

volume 2, pages 351—356, Jun 1989.

[54] M. J. Mataric. A distributed model for mobile robot envrionment - learning
and navigation. Technical Report AI-TR—1228, MIT Artiﬁcial Intelligence Lab-
oratory, Cambridge, MA, 1990.

155

 

[55] A. K. McCallum. Reinforcement learning with selective perception and hidden
state. Technical report, PhD thesis, University of Rochester, Rochester, New

York, 1996.

[56] Michael S. Branicky. Task—Level Learning: Experiments and Extensions. In
Proc of the IEEE Int’l Conf on Robotics and Automation, pages 266—271, Sacra-

mental CA, Apr 1991. IEEE.

[57] D. Michie, D.J. Spiegelhalter, and CC. Taylor (eds). Machine Learning, Neural

and Statistical Classiﬁcation. Ellis Horwood, 1994.

[58] G. A. Miller. Worknet: A lexical basebase for English. Communications of the

ACM, 38(11):39—41, 1995.

[59] H. P. Moravec. The Stanford Cart and the CMU Rover. Proceedings of the

IEEE, 71(7):872—884, 1982.

[60] H. Murase and S. K. Nayar. Illumination planning for object recognition in
structured environments. In Proc. IEEE Conf. Comp. Vision Pattern Recogni—

tion, pages 31—38, Seattle, WA, June 1994.

[61] H. Murase and S. K. Nayar. Visual learning and recognition of 3-D objects from

appearance. Int’l Journal of Computer Vision, 14(1):5—24, January 1995.

[62] S. K. Murthy. Automatic construction of decision trees from data: A multidis-
ciplinary survey. Data Mining and Knowledge Discovery, 1998.

156

[63] S. K. Nayar, H. Murase, and S. A. Nene. Learning, positioning, and tracking
visual appearance. In Proc. IEEE Conf. on Robotics and Automation, May

1994.

[64] M. Nunez. The use of background knowledge in decision tree induction. Machine

Learning, 6(3):231—250, 1991.

[65] J. Yamato J. Ohya and K. Ishii. Recognizing human action in time-sequential
images using hidden Markov model. In IEEE Trans. Pattern Analysis and

Machine Intelligence, pages 379—385, 1992.

[66] E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press,

Letchworth, UK, 1983.

[67] Eitetsu Oomoto and Katsumi Tanaka. OVID: Design and implementation of a
video-object database system. IEEE Trans. Knowledge and Data Engineering,

5(4):629ff, August 1993.

[68] Papanikolopoulos, NP. and PK. Khosla and T. Kanade. Adaptive Robotic
Visual Tracking. In Proc of the 1991 American Control Conference, pages 962—

967, Jun 1991.

[69] Papanikolopoulos, NP. and PK. Khosla and T. Kanade. Visual Tracking of a
Moving Target by a Camera Mounted on a Robot: A Combination of Control
and Vision. IEEE Trans on Robotics and Automation, 9(1):14—35, Feb 1993.

157

 

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

Alex Pentland, Baback Moghaddam, and Thad Starner. View-based and modu-
lar eigenspaces for face recognition. In Proc. IEEE Conf. Comp. Vision Pattern

Recognition, pages 84—91, June 1994. Seattle, Washington.

P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi. The F ERET evaluation
methodology for face-recognition algorithms. In Proc. IEEE Conf. Comp. Vi-

sion Pattern Recognition, pages 137—143, San Juan, Puerto Rico, June 1997.

P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi. The FERET september 1996
database and evaluation procedure. In Proc. Int ’1 Conf. on Audio and Video-
based Biometric Person Authentication, Cranse-Montana, Switzerlan, March

1997.

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.
In D. Touretzky, editor, Advances in Neural Information Processing, volume 1,

pages 305—313. Morgran-Kaufmann Publishers, San Mateo, CA, 1989.

W.H. Press, B.P. F lannery, S.A. Teukolsky, and WT. Vetterling. Numerical

Recipes. Cambridge University Press, New York, 1986.

M. L. Puterman. Markov Decision Processes. Wiley, New York, NY, 1994.

J. Quinlan. Introduction of decision trees. Machine Learning, 1(1):81—106,

1986.

J. Quinlan. C45: Programs for Machine Learning. Morgan Kaufman, San
Mateo, CA, 1993.

158

 

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

L. R. Rabiner. A tutorial on hidden Markov models and selected applications

in speech recognition. Proceedings of IEEE, 77(2):257—286, 1989.

L. R. Rabiner, L. G. Wilpon, and F. K. Soong. High performance connected
digit recognition using hidden Markov models. IEEE Trans. Acoustics, Speech

and Signal Processing, 37:1214—1225, Aug. 1989.

V. S. Ramachandran. Perceiving shape from shading. In I. Rock, editor, The

Perceptual World, pages 127—138. Freeman, San Francisco, CA, 1990.

P. Rauss, P. J. Phillips, M. K. Hamilton, and A. T. DePersia. F ERET (face-
recognition technology) recognition algorithms. In Proc. of Automatic Target

Recognizer System and Technology Symposium, July 1996.

S. R. Safavin and D. Landgrebe. A survey of decision tree classiﬁer methodology.

IEEE Trans. Systems, Man and Cybernetics, 21(3):660—674, May/June 1991.

C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schlkopf, and A. Smola.
Support vector machine reference manual. Technical Report CSD-TR-98-O3,

Royal Holloway, University of London, Egham, UK, March. 1998.

Schrott, A. Feature-Based Camera-Guided Grasping by an Eye-in-Hand Robot.
In Proc of the IEEE Int’l Conf on Robotics and Automation, pages 1832—1837,

Los Alamitos CA, May 1992. IEEE Computer Society Press.

Sharma, R. and Y. Aloimonos. Visual Motion Analysis Under Interceptive
Behavior. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 148—153, Champaign, IL, June 1992. IEEE.

159

 

[86] T. Starner and A. Pentland. Visual recognition of American sign language
using hidden Markov models. In Proc. Int ’l Workshop on Automatic Face- and

Gesture-Recognition, pages 189—194, Zurich, Switzerland, June 1995.

[87] D. Swets and J. Weng. Discriminant analysis and eigenspace partition tree for
face and object recognition from views. In Proc. Int’l Conference on Automatic
Face- and Gesture-Recognition, pages 192—197, Killington, Vermont, Oct. 14-16

1996.

[88] D. L. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval.

IEEE Trans. Pattern Analysis and Machine Intelligence, 18(8):831—836, 1996.

[89] M. Tan. Cost-sensitive learning of classiﬁcation knowledge and its applications

in robotics. Machine Learning, 13(1):1—33, 1993.

[90] M. Tan and J. C. Schlimmer. Two case studies in cost-sensitive concept acqui-

sition. In AAAI—QO, 1990.

[91] P. Thompson. Margaret thatcher: a new illusion. Perception, 9:483—484, 1980.

[92] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive

Neuroscience, 3(1):71—86, 1991.

[93] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New
York, 1995.

160

[94] Walter, J .A. and K.J. Schulten. Implementation of Self-Organizing Neural Net-
works for Visuo-Motor Control of an Industrial Robot. IEEE Trans on Neural

Networks, 4(1):86—95, Jan 1993.

[95] J. Weng. On comprehensive visual learning. In Proc. NSF /ARPA Workshop
on Performance vs. Methodology in Computer Vision, pages 152—166, Seattle,

WA, June 1994.

[96] J. Weng. Cresceptron and SHOSLIF: Toward comprehensive visual learning. In
S. K. Nayar and T. Poggio, editors, Early Visual Learning. Oxford University

Press, New York, 1996.

[97] J. Weng and S. Chen. Incremental learning for vision-based navigation. In

Proc. Int’l Conference on Pattern Recognition, Vienna, Austria, Aug. 1996.

[98] J. Weng and W. Hwang. An incremental learning algorithm with automatically
derived discriminating features. In Proc. of Asian Conference on Computer

Vision, Taipei, Taiwan, January 2000. accepted and to appear.

[99] Juyang Weng and Shaoyun Chen. Vision-guided navigation using SHOSLIF.

IEEE Trans. Neural Networks, 11:1511-1529, 1998.

[100] S. S. Wilks. Mathematical Statistics. Wiley, New York, NY, 1963.

[101] Yasuo Kuniyoshi, Masayuki Inaba, Hirockika Inoue. Seeing, Understanding
and Doing Human Task. In Proc of the IEEE Int ’1 Conf on Robotics and
Automation, volume 1, pages 2—9, Nice, France, May 1992.

161

 

[102] Yasuo Kuniyoshi, Masayuki Inaba, Hirockika Inoue. Learning by Watching:
Extracting Reusable Task Knowledge from Visual Observation of Human Per-

formance. IEEE Trans on Robotics and Automation, 10(6):799—822, Dec 1994.

162