33.5.53]: 270‘...) 1 .I.LL.II

 

 

...:.lt 1.4).! I“ 1.

 

.b! . . 1.6. .7111.

 

4V 1’ 3
(5.5.3...)

may“. #3..
.Cmﬁﬁﬁﬁ

r1-..

. van...” ihué
3m... . 1,
gm...

 

 

1] $5 ‘E

5‘0» l

ill l.

WW \\\\\\\\\li“\'lll\ll “ll“;llll‘l‘
LIBRARY , 3‘29

Michigan State
University

 

 

 

 

This is to certify that the

dissertation entitled
Leamwa~ Renewal Vi‘sn‘em fwd It? APPMQIAlen ‘fb
A u ‘l‘UNO wow; 1 71-11007 (VOW?! 09mm

presented by

SHACYUN CHEM

has been accepted towards fulfillment
of the requirements for

P l‘- D. degree in COW? at! r Scum

/21/" [\‘M l ,
, /n A; 31‘ \p 8/

l! Lk.
"\
;.

Major professor

Date \1/14/01‘5

MSU is an Afﬁrmative Anion/Equal Opportunity Institution 0-12771

W“ -

—— hi ‘4 _«V,_< A

 

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1m mu

 

LEARNING-BASED VISION AND ITS APPLICATION TO
AUTONOMOUS INDOOR NAVIGATION

By

Shaoyun Chen

A DISSERTATION

Submitted tO
Michigan State University
in partial fulﬁllment Of the requirements
for the degree Of

DOCTOR OF PHILOSOPHY

Department Of Computer Science and Engineering

1998

ABSTRACT
LEARNING-BASED VISION AND ITS APPLICATION TO
AUTONOMOUS INDOOR NAVIGATION
By
Shaoyun Chen

Adaptation is critical tO autonomous navigation Of mobile robots. Many adaptive
mechanisms have been implemented, ranging from Simple color thresholding to com-
plicated learning with artiﬁcial neural networks (ANN). The major focus Of this thesis
lies in machine learning for vision-based navigation. Two well known vision-based
navigation systems are ALVINN and ROBIN developed by Carnegie-Mellon Univer-
sity and University Of Maryland, respectively. ALVIN N uses a two-layer feedforward
neural network while ROBIN relies on a radial basis function network (RBF N )

Although current ANN-based methods have achieved great success in vision-based
navigation, they have two major disadvantages: (1) Local minimum problem: The
training Of either multilayer perceptron or radial basis function network can get stuck
at poor local minimums. (2) The ﬂexibility problem: After the system has been
trained in certain road conditions, it is hard tO make the system adapt tO new road
conditions while retaining good performance for those road conditions that have al-
ready been learned. Sometimes this is termed a “memory loss” problem. As part Of
our SHOSLIF (Self—organizing Hierarchical Optimal Subspace Learning and Inference
Framework) effort, SHOSLIF-N (SHOSLIF for Navigation) treats vision-based navi-
gation as a content-based retrieval problem. Three major components Of SHOSLIF-N
are: (1) Automatic feature derivation: Instead Of starting with random initial weights,

the system employs either principle component analysis or linear discriminant anal-

ysis to derive features which are best suited for navigation tasks. (2) Nonparametric
recursive partitioning regression, which is more ﬂexible than global parametric re-
gression used in either ALVINN or ROBIN, is employed in direct input-to-output
mapping. Nonparametric recursive partitioning regression is realized with a recursive
partition tree (RPT). (3) Self-organizing mechanism. (4) Low computational com-
plexity: the recursive partition tree has a logarithmic retrieval complexity and can
be used tO address the complexity issue in learning a large number Of scenes.

For a binary RPT, only the most dominant eigenvector of principle component
analysis or linear discriminant analysis is needed for further partitioning Of each inner
node. This leads to an efficient online incremental learning algorithm: the system
learns or rejects a learning sample “on-the-ﬂy” with real time response. Similar
tO ALVINN and ROBIN, the basic SHOSLIF-N maps a single-framed retinal input
into an output steering Signal. The system was successfully tested but exhibited
limited capability in handling more complicated Situations. When the number Of
different turns or corners was increased to a certain extent, the system sometimes
failed to make the turn. One way tO tackle this problem is tO incorporate state
information, that indicates the relative position between the robot and the oncoming
corner or intersection, into the system. Therefore, state—based SHOSLIF-N, a system
that incorporates states and utilizes a Simple yet efﬁcient visual attention mechanism
which is helpful in determining the correct state transitions, is proposed and tested.
With a set of fewer than 300 learning samples, state-based SHOSLIF-N has been
successfully tested in indoor navigation on the 2nd and 3rd floors Of our Engineering
Building. Using a SUN Sparc-I and a framegrabber, both online incremental learning
and autonomous navigation were done in real-time. Comparative study with two
ANN-based methods has Shown the advantages Of the system: faster learning and

better performance for the tasks tested.

ACKNOWLEDGMENTS

I am indebted tO my advisor, Professor John J. Weng, who helped me choose the thesis
topic and has been an inexhaustive source Of ideas throughout my PhD. program at
Michigan State University. Without his support, this dissertation would not have
been completed. I am grateful tO Professors Anil K. Jain, George C. Stockman, R.
Lal. Tummala, Bang-Yam Chen, and Jim Zacks, for serving on my committee and
providing critical comments on my dissertation. I would like tO acknowledge my
special thanks tO Dr. Jain and Dr. Stockman, for their advice on my research topic.

I enjoyed many helpful discussions with former and current members working
on SHOSLIF-related projects, especially YuntaO Cui, Dan Swets, and Wey-Shyan
Hwang. I received numerous help from our nice PRIP lab managers: Lisa Lees,
Hans Dulimarta, Chitra Dorai, Sally Howden, Karissa Miler, Paul Albee, and Jason
Sperber. I would like to thank former Prippies who built our mobile robot Rome which
was used throughout my tests. Here, I would like to express my Special gratitude tO
Hans Dulimarta, who provided bug-free programs for robot control.

I beneﬁted from many discussions with former and present PRIP members.

Thanks go tO: Paul Albee, Hamid Alavi, Vera Bakic, Chaur—Chin Chen, Jinlong

iv

Chen, Sei-Wang Chen, YaO Chen, Scott Connell, Nicolae Duta, Kamen Guentchev,
Qian Huang, Lin Hong, Marie-Pierre Jolly, Yonghong Li, Jinhui Liu, Jianchang Mao,
Yogesh Pathak, Sharath Pankanti, Nalini Kanta Ratha, Prasun Sinha, Yarle Strand,
Daniel Swets, Aditya Vailaya, Oivind Due Trier, Marilyn VVulfekuhler, Shanti Vedula,
Gang Wang, Bin Yu, and Yu Zhong.

I owe Special thanks to my former advisor Prof. Xueyin Lin at Tsinghua Univer-
sity, P. R. China, for his guidance in vision-based navigation, which is closely related
to my dissertation topic.

Finally, I sincerely express my gratitude to my parents and my grandmother for
their love and encouragement throughout my life. I would like tO thank my wife Xueai

Fang, who joined me lately, for her support and understanding.

TABLE OF CONTENTS

LIST OF FIGURES ix
LIST OF TABLES xiv

1 Introduction 1
1.1 A Brief Review of Mobile Robot Vision ................... 2
1.2 Road Following Techniques ......................... 3
1.2.1 Edge detection ............................... 4
1.2.2 Region analysis ............................... 6
1.2.3 Stereo-based road following ........................ 7
1.2.4 3D perception ................................ 8

1.2.5 AN N-based approaches ........................... 8
1.2.6 Dickmanns’ 4D approach ......................... 10
1.3 Some Major Difﬁculties in Autonomous Navigation ............ 11
1.4 Adaptive and Learning Mechanisms: Direct Input-tO-Output Mapping . 12
1.5 Differences between SHOSLIF-N and Two Neural Network Approaches . 14
1.6 Our Mobile Robot ROME .......................... 17
1.7 Thesis Overview ............................... 19
2 SHOSLIF for Autonomous Navigation 21
2.1 Introduction .................................. 21
2.1.1 Existing systems .............................. 22
2.1.2 Motivations ................................. 24
2.2 Learning as Efﬁcient Function Approximation ............... 25
2.2.1 Image space ................................. 25
2.2.2 Functions in the image space ....................... 26
2.2.3 Learning as an approximation process .................. 27
2.2.4 Case-based learning for function approximation ............. 29
2.2.5 Hierarchical partition and the RPT .................... 31
2.2.6 Approximation using the RPT ...................... 33
2.3 Automatic Feature Derivation ........................ 36
2.3.1 Meaningful images in the image space .................. 37
2.3.2 The most expressive features ....................... 37
2.3.3 The most discriminating features ..................... 40
2.3.4 The DKL projection ............................ 44
2.3.5 Separability ................................. 45
2.4 Self-Organizing RPT Generation through Learning ............ 47

vi

2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6

Data organization
Binary recursive partition tree .......................
Constructing MEF RPT ..........................
Constructing MDF RPT ..........................
Learning phase and performing phase ...................
Differences between SHOSLIF-N and two neural network approaches

ooooooooooooooooooooooooooooo

2.5 Complexity ..................................

2.5.1
2.5.2

Complexity in the learning phase .....................
Complexity in the performing phase

2.6 Experimental Results .............................

2.6.1
2.6.2
2.6.3
2.6.4
2.6.5
2.6.6

Image acquisition ..............................
MEFS, MDFS and their clustering effects .................
Better clustering in children ........................
Effects on the tree size ...........................
RPT Retrieval
Real runs

2.7 Comparison with Two AN N-based Approaches ...............

2.7.1
2.7.2
2.7.3
2.7.4

2.8 Conclusions and Future Work

Simulation Of a feedforward neural network

Simulation Of a radial basis function network
Comparison Of SHOSLIF-N with two AN N-based approaches ......
Mapping SHOSLIF-N into ANN architecture

3 On-Line Incremental Learning for Vision-Guided Navigation
3.1 Introduction ..................................
3.2 Incremental Learning and Dominant Eigenvector Computation ......

3.21
3.22
3.23

Navigation as a content- based retrieval problem .............
The power method .............................
The Rayleigh quotient conjugate gradient (RQCG) methOd

3. 3 Update for Incremental Learning ......................

3.3.1
3.3.2

Insert a sample ...............................
Reconstruction Of W matrix from parent node’s W matrix .......

3.3.3 The incremental RQCG method ......................
3.4 Incremental Learning Algorithm (ILA) ...................

3.4.1

Analysis ...................................

3.5 Experimental Results .............................
3.6 Conclusions ..................................

4 State-Based SHOSLIF for Indoor Navigation
4.1 Introduction ..................................

4.2 State and Attention: Observation-Driven Markov Model

4.3 Stochastic Recursive Partition Tree for Nearest Neighbor Regression . . .

4.3.1

Navigation as a content-based retrieval problem .............

4.4 A Case of Indoor Navigation .........................
4.5 Experimental Results .............................

vii

48
49
51
52
53
55
56
57
57
57
58
61
62
63
64
67
69
70
73
74
74

77
77
79
80
81
83
85
86
86
88
88
92
93
99

100
101
103
108
108
111
114

4.6 Comparison with Two AN N-based Approaches ............... 120

4.6.1 Simulation Of a feedforward neural network ............... 122
4.6.2 Simulation Of a radial basis function network .............. 126
4.6.3 Comparison with two AN N-based approaches .............. 127
4.7 Conclusions and Future Work ........................ 132
5 Conclusions 133
5.1 Summary ................................... 133
5.2 Contributions ................................. 134
5.3 Future Work ................................. 135
APPENDICES 139
A Maps of Test Sites 139
B A Set of Training Samples Used in State—Based SHOSLIF-N 142

viii

1.1

1.2

1.3

2.1

2.2

2.3

2.4

2.6
2.7

2.8

2.9

LIST OF FIGURES

Differences between traditional edge-based or region-based approach and
direct input-to—output mapping. (a) Traditional approach; (b) Direct
input-to-output mapping, where the function f directly maps input
image into output steering signal. ...................

Schematic difference between current neural network approaches for
vision-based navigation and SHOSLIF-N. (a) Direct input-to-output
mapping using neural networks. (b) SHOSLIF-N. ...........

A picture Of Rome outside our PRIP lab. A Single CCD camera is mounted
on top of the robot with a looking down tilt angle. ..........

Navigator as a mapping from the image and path selection space tO the control
signal space. ................................

A 2—D illustration Of a hierarchical Dirichlet partition and the corresponding
recursive partition tree (RPT). (a) The partition, where the label indicates
the center of a cell. The label Of the child to which its parent’s center belongs
is not shown due to the lack of space. (b) The recursive partition tree.

A 2-D illustration Of the most expressive features (MEF). The MEFS are v1
and v2 and the MEF values are the projected values onto yl and y2 axes. .

A 2-D illustration of the mOst discriminating features (MDF). The MDF is the

projection along 21. The MEF along yl cannot separate the two subclasses.

Geometrical illustration Of the binary RPT. (a) The geometric cells. Thicker
lines indicate boundary Of coarser-level cells. The thinner lines indicate
boundary Of ﬁner-level cells. (b) The corresponding RPT. ........

The learning and performing phases .......................

Schematic difference between current neural network approaches for vision-
based navigation and SHOSLIF—N. (a) Direct input-tO-0utput mapping us-
ing neural networks. (b) SHOSLIF-N. ...................

Piecewise approximation Of SHOSLIF-N: mapping Of visual input to output
control Signal is based on a local neighborhood of the input image. The
sphere is used to represent a high-dimensional input space. ‘x’ denotes a
sample visual input. The output associated with input ‘x’ is a value based
on interpolation Of outputs associated with its nearest neighbors which are
marked as solid dot patterns. .......................

Some sample learning images in the straight corridors and the corners. The
upper row is a straight corridor. The lower one is a left turn. .......

ix

13

16

18

22

33
40

43

48
53

54

58

2.10

2.11

2.12

2.13

2.14

2.15

2.16

2.17

2.18

2.19

2.20

2.21

The difference between MEF and MDF in representing learning samples from
a straight corridor and a corner. (a) The ﬁrst ﬁve MEFS of the learning set.
((1) The ﬁrst ﬁve MDFS Of the learning set (1' = 80%). (c) Learning samples
represented in the subspace spanned by the ﬁrst two MEFS. ((1) Learning
samples represented in the subspace spanned by the ﬁrst two MDFS. The
numbers in the plot space are the class labels Of the learning samples. . . .
Test images in the subspace spanned by the ﬁrst two MEFS and MDFS, re-
spectively, all Of which are not included in the learning set. (a) The MEF
subspace. (b) The MDF subspace ......................
Fewer classes in the learning set: using only straight corridor images for training.
(a) Learning samples represented in the subspace spanned by the ﬁrst two
MEFS. (b) Learning samples represented in the subspace spanned by the
ﬁrst two MDFS. The numbers in the plot space are the class labels Of the
learning samples ...............................
Test images represented in the subspaces learned from a learning set with 170
corridor images only (without corner images). (a) The MEF subspace. (b)
The MDF subspace. ............................
Distribution Of absolute angular error (degree) Of retrieved heading direc-
tions.These two plots are based on queries using a test set Of 204 images,
Of which 170 are from straight hallway and 34 from the learned corner. (a)
Using a MEF RPT (k = 1). (b) Using a MDF RPT (k = 1). (c) Using a
KNDB MDF RPT (k=4). .........................
Examples Of retrieval. The ﬁrst row: query images; the second row: retrieved
using MDF RPT; the third row: retrieved from MEF RPT. .......
A sample Of autonomous navigation in the trained corridor: two straight cor-
ridors and one corner. The circles indicate the locations where images are
taken .....................................
Rome navigates automatically at two corners: the ﬁrst corner in the lst row
and the second corner in the 2nd row. ...................
Sample training input—output pair: the top part is a 30 x 40 image; the last row
is the associated output signal, which is a Gaussian distribution peaked at
the corrected heading direction. ......................
Some synthesized sample images generated At different locations and orienta-
tions along the simulated navigation path. Images were generated under a
perspective projection camera model. ...................
Training with Set 1. 100 trials with different random guesses of initial weights.
(a) shows the record Of sum Of square error (SSE). The trial with minimal
SSE is marked with a circle. (b) shows the target output for ﬁve classes
of training samples with different corrected headings: —10°, —5°, 0°, 5°
and 10°. (c) is the error histogram Of retrieved heading dierctions. (d1)
to (d5) show the network output overlaid on target output for each class:
simulated outputs and target outputs are plotted in solid lines and dotted
lines respectively. .............................
Tfaining record for Set 2: Sum Of square error for 100 trials. The best SSE
solution is marked with a circle. ......................

59

60

61

62

63

64

66

66

68

69

71

2.22 Comparison of three approaches: SHOSLIF-N, the best feedforward network

3.1

3.2

3.3

3.4

3.5

3.6
3.7

3.8

,4.1

4.2

4.3
4.4

4.5
4.6

and the best RBF network we Obtained. A set of 318 images are used for
training and a disjoint set of 204 images are used in testing. The error his-
togram for the training set is shown in left column while the error histogram
for the test set is shown in right column. ................. 75

The evolution of a incremental learning tree. In this example, three nonﬁxed
nodes Of macro tree are evolved into ﬁxed nodes. ............. 89

Update RPT when a new image is accepted. The search path is represented by
dashed lines. The updates are done only at two nodes: the ﬁrst nonﬁxed
macro node on the search path and the leaf micro node. They are marked

as “black nodes”. ............................. 91
Sample training hallway images. The upper row is a straight hallway, while the

lower is a left turn. ............................ 95
The timing record on a SparcStation-2 for learning a set Of 182 images (those

ignored images in incremental learning are not recorded. P1 = 50 and

P2 = 10.) ................................. 95

Distribution of absolute angular error (in degrees) Of retrieved heading direc-
tions. These three plots are based on queries using a test set Of 328 images,
of which 195 are from straight hallway and 133 from the two learned corners. 96
The map Of the tested sites, trained sections and untrained sections are marked

with thicker and thinner lines respectively. ................ 97
Rome navigates automatically at two turns: the ﬁrst turn in the lst row and

the second turn in the 2nd row. ...................... 97
The timing records on SPARC-20/model61 for learning a set of 400 images

(P1=150, P2=30). ............................. 98

Why states and attention? (a1) and (b1) Show two images around a corner.
White boxes in (a2) and (b2) are attention windows Of (a1) and (b1) re-
spectively. The attention images Of (a1) and (b1) are further shown in (a3)
and (b3). (c) and ((1) show the correlation image Of (a1) with (b1) and (a3)
with (b3) respectively, after zeros are padded tO the periphery. (c) and (d)

are resized to have the same Spatial resolution. .............. 104
The architecture Of state-based SHOSLIF for vision guided navigation. “Att”

stands for “Attention control”. ...................... 108
Different types Of corridor structure in our indoor navigation. ........ 112

Observation-driven Markov model for indoor vision-based navigation, where

x/y is used to represent a state x with associated action y. ‘a’ or ‘g’ are

used to indicate either two local attention views or a single global view is

used. Associated with each arc is the current Observation, either a local

view or a global view. ........................... 113
The representation of input and output parts. ................ 114
A map Of the test site. The test loop is indicated by thick solid lines. Six

different corners or intersections are trained in our tests. They are marked

with bold-faced arabic numbers. ..................... 115

xi

4.7 Some sample training images around corners or intersections. Six sets Of four
consecutive images are from six different corners or intersections in the test
loop. ....................................

4.8 Sample state transitions covers more than two passes of continuous running
along the tested lOOp with six corners or intersections, which are labeled
with digits in this plot. ..........................

4.9 Some sample control Signals and the associated states. The control signals
plotted in this set Of ﬁgures are corrected heading directions in degrees. (a)
and (c) are cases Of left turn. (b) is a “RZ”, which has a right turn followed
by a left turn. (c) is a straight section between corners 6 and 1. We can see
that the change in corrected heading along straight section is smaller.

4.10 Sample trajectories around the 4th corner Show how states can help the robot
in navigation. (a) and (b) Show two failure cases when states were not used.
“X” in (a) or (b) indicates an upcoming collision with the wall. (c) shows
the robot successfully made the turn when states were used. .......

4.11 Rome navigates autonomously around corner 5. The ﬁrst ﬁve images are video
sequences taken from behind the robot. The last three images are taken
from front Of the robot. ..........................

4.12 The mobile robot’s view sequence around corner 5 during a real navigation.
Each image was taken at every time step. This sequence shows the robot
approached the ﬁfth corner, successfully made the turn and entered straight
section. ..................................

4.13 Time record Of incremental learning a set 272 training images. The learning
time was recorded on a SPARC-IO. ....................

4.14 Sample training input—output pair: the top part is a 30 x 40 image. The second
row is the associated output signal, which is a Gaussian distribution peaked
at the corrected heading direction. The last row is the binary state ﬁelds,
which has a one in ﬁrst ﬁelds and zero in other ﬁelds. ..........

4.15 Training with Set A. 100 trials with different random guesses Of initial weights.
(a) shows the record Of sum Of square error (SSE). The trial with minimal
SSE is marked with a circle. (b) shows the target output for ﬁve classes
Of training samples with different corrected headings: —10°, —5°, 0°, 5°
and 10°. (c) is the error histogram Of retrieved heading directions. (d1)
to (d5) Show the network output overlaid on target output for each class:
simulated outputs and target outputs are plotted in solid lines and dotted
lines respectively. .............................

4.16 Training with Set A using PCR for initialization of weight. The network outputs
100 training images are plotted with solid lines, while the desired target
outputs are plotted in dashed lines for comparison. ............

4.17 Comparison of a two-layer perceptron with principle component analysis
and linear discriminant analysis. The two-layer perceptron is initialized
with principle component regression approach described in the text.

xii

118

119

119

122

124

126

4.18 Comparison of three approaches: SHOSLIF, the best feedforward network and
the best RBF network we Obtained. The error histograms for the training
set are shown in the left column while the error histograms for the test set
are shown in the right column. ......................

xiii

1.1

2.1
2.2
2.3
2.4

3.1

4.1
4.2
4.3

B.1

LIST OF TABLES

A comparison Of the thesis work with start-Of-the-art vision-based navigation
algorithms. ................................

The sizes of MDF tree and MEF tree, trained with the set of 318 images.

The total learning time for a set Of 363 images. ................

Time for retrieving from the trained RPT, given a new input. ........

The comparison in retrieval timing Of SPARC-10 between MEF tree version
and ﬂat versions, where tree structures are not used. A set Of 2850 learning
images were used in training. .......................

The Sizes Of the learned RPT for a set Of 180 accepted images. ........

Comparison Of three approaches in state prediction. .............
Comparison Of three approaches in response time. ..............
Qualitative comparison of three different approaches. The following notations
are used: “local/global” for local regression/global regression; d for input
dimension; h for the number Of hidden nodes in a MLP; c for the number
of centers in a RBFN; n for the number Of learned samples. The following
abbreviations are also used: “Para.” for parametric regression; “compl.”
for complexity; “Hierach.” for hierarchical; “Corr. conv.” for correct con-
vergence; “Increm.” for incremental; “mem.” for memory. . .' ......

Summary Of state transitions Of the training set, where “p. state/c. state”
stands for previous state/ current state. ..................

xiv

67
96

129
130

130

Chapter 1

Introduction

The goal Of this thesis was to develop vision-based learning techniques that can fa-
cilitate the training of mobile robots for autonomous indoor navigation. Practical
learning algorithms for this purpose Should be able tO meet the following three re-
quirements: real-time response, incremental learning, and gOOd performance. These

three issues are addressed throughout this dissertation.

Mobile robot systems provide a unique Opportunity to develop perception and
navigation techniques in real-world environments. In order tO perceive the external
world, a mobile robot may use various kinds Of sensors, including color video, scan-
ning laser rangeﬁnder, stereo video, sonar, radar, and inertial navigation, to collect
necessary information. Then higher level processing integrates different sensory in-
puts, and other a priori knowledge, SUCh as coarse maps or road models, to direct

vehicle actions.

1.1 A Brief Review Of Mobile Robot Vision

Visual input is the most important sensory source for human drivers. Thompson
[116] and Moravec [82] are among pioneers in exploiting the rOle Of machine vision
in autonomous navigation. The JPL robot [116] used visual input tO form polygonal
terrain models for path construction. The Stanford Cart [82, 83] employed Moravec’s
interest Operator to pick out distinctive features in the images. Three dimensional
information Of these features was Obtained by a correlation-based stereo algorithm.
FIDO [117] used a hierarchical correlation for stereopsis and successfully navigated
through a cluttered environment and along a Sidewalk. Tsuji et al. [122] used another

stereo vision system based on matching vertical edges and inferring surfaces.

The early mobile robots had very limited vision systems. The launch Of the Au-
tonomous Land Vehicle project in 1984, funded by the Defense Advanced Research
Projects Agency, led to numerous research activities [120], including motion tracking
by Univ. Of Massachusetts and Honeywell, tracking using 3D data by SRI, and qual-
itative navigation by Advanced Decision Systems. Martin-Marietta built its VITS
[123], and CMU built the Navlab [118]. The CMU Navlab group concentrated on
difﬁcult vision problems in realistic environments and produced many practical algo-
rithms for mobile robot vision. Their research on the Navlab project up to 1990 is

well summarized in a bOOk format [119].

Techniques developed for autonomous navigation can be applied not only for
strategic defense uses, but also for civilian purposes that can improve the safety
Of human drivers and signiﬁcantly lower the number Of traffic accidents.

2

Beyond the DARPA community, the past decade witnessed several other mobile
robot projects. A gOOd survey can be found in Thorpe et al. [120]. In the united
states, General Motors worked on lane—following. The F MC corporation [72] built a
road-following system based on color thresholding. F ijitsu and Nissan in Japan have
developed prototype road-following software. In Munich, the group led by Dickmanns

has been working on autonomous navigation for a numbers Of years [25, 28, 29, 30].

1.2 Road Following Techniques

There are two basic tasks in vision-based navigation: road-following and Obstacle
avoidance. A mobile robot needs to travel in road areas while avoiding Obstacles.
In general, these two tasks should not be separated. However, the techniques for
these tasks vary in practice Since the sensor inputs for these two tasks could be
different. Direct range measurements coming from sonar or infrared sensors are useful
for Obstacle avoidance. Sonar or infrared sensors are not useful in outdoor road
following, since they need reﬂecting surfaces which are generally not available in
outdoor environment and the Obtained Signals are not reliable. On the other hand,
rich visual information is necessary for both road following and Obstacle avoidance.
Since sonar or infrared sensory inputs are of low dimension and can be handled with
less computation, many Obstacle avoidance systems make use Of sonar or infrared

SBDSOI‘S.

Vision-based road following algorithms can be broadly classiﬁed into six categories:
(1) Edge-based technique: edges are detected from the input images and used in road

3

following. (2) Region analysis: Statistical pattern recognition is applied tO road
segmentation, e.g. clustering Of road and nonroad regions; (3) Stereo-based road
following: a stereo algorithm is used tO Obtain the depth information from either edges
or regions for road following; (4) 3D perception using laser scanners; (5) ANN-based
approaches: ANN is used either to map detected features into qualitative output or
directly map visual input into output steering signals; (6) Dickmann’s 4D approach:
Feedback from tracking Objects modeled with spatiO-temporal models are used in
Optimal control Of a vehicle. This approach is a good coupling Of Optimal control

with machine vision.

1.2.1 Edge detection

Edge detection was commonly used in early mobile robot navigation systems [53, 131].
The Terregator [125], a predecessor Of the Navlab, also detected road edges and tried
to ﬁnd the central line Of the path. The general’assumption is that the road lies on
the ground plane and two road edges are parallel. A Hough transform or other line

ﬁtting process is used to ﬁnd the path.

IPM (Inverse Perspective Mapping) [125, 62, 15] maps an input image into a view
with a looking-straight-down view direction. A view generated from IPM iS also called
a bird’s-eye view. Since the road lines of a path on the ground plane appear parallel
after IPM transform, IPM Simpliﬁes Optical ﬂow computation, Obstacle detection, and

road following. IPM is commonly used in vision—based navigation systems [125, 62, 2].

The problems with this type Of road following technique are: (1) The detected

4

edges may not be useful for navigation, e.g. in outdoor navigation, the strongest
edges are usually caused by shadows, the real road edges are weaker and sometimes
not detectable. The same Situation happens for indoor navigation. The true path
edges may not be detectable due tO ambient lighting or other distractions from the
background. (2) The real road edges do not necessarily ﬁt the parallel lines model and
thus the assumption for line ﬁtting does not apply. (3) The road edges are sometimes

not available due tO occlusions.

Edge detection is more suited for structured roads where lane markings or stripes
are available. One way tO overcome the limitation Of a Single line feature tracking
algorithm is tO explicitly model different available constraints and features Of a struc-
tured road. YARF [64, 119] takes advantage of prior knowledge of the environment.
According to different feature models, different feature tracking algorithms are de-
signed. One feature tracking methOd will take over if the current feature tracking

fails.

Algorithms taking into account more than general edge detection can achieve
better performance. The availability Of lane markings can be used to greatly enhance
the adaptation performance of a driving system. RALPH [96] is a good example. In
RALPH possible road curvatures are predicted, and the one with the highest signature
score is picked up as the winner. This technique leads to a successful lateral position

handler.

1 .2.2 Region analysis

An alternative tO edge detection is region analysis. A road scene can be classiﬁed
into road and nonroad regions. This type Of algorithm does not rely SO heavily as
edge-based approaches on the geometrical road model. Different cues, e.g. texture
and color, can be used to classify a pixel into road and nonroad. Because Of the rich
information available in chromatic space, color is widely used in outdoor navigation.
Color video is used in several vision-based navigation systems, including the Navlab
[118], V ITS [123] and FERMI [72]. Both the VITS and FERMI project image pixels

on the color Space and do a Simple adaptive thresholding.

More robust methods are used in the Navlab. Crisman and Thorpe [119] applied
statistical pattern recognition techniques tO road scene segmentation. SCARF (Su-
pervised Clustering Applied tO Road Following) classiﬁes each pixel intO road and
nonroad. Each class is represented by a Gaussian distribution, Of which the mean
vector and the covariance matrix are Obtained from a supervised learning process.
The classiﬁer is a maximal likelihood classiﬁer. SCARF uses reduced-resolution im-
ages with 60 x 64 pixels and achieves real-time performance. A voting scheme using

the Hough transform is used to Obtain the road model.

UNSCARF (Unsupervised Clustering Applied to Road Following) makes use of
spatial color appearance changes. UNSCARF uses a tuple Of ﬁve components, x =
{RED, GREEN, BLUE, r0w,column}‘, where the ﬁrst three are RGB components
Of a color scene, and the last two are spatial locations Of each pixel. The algorithm
clusters those pixels with nearby locations and Similar color appearances into a class

6

with an ISODATA clustering algorithm. The appropriate model is chosen from a set
of candidate road models based on a global matching Of road edges detected from
UNSCARF region analysis. The algorithm is suited for unstructured roads. Both
SCARF and UNSCARF have been mapped onto the CMU WARP machine with a
systolic architecture. Due to the iterative nature Of unsupervised learning used in

UNSCARF, it is much Slower than SCARF.

1.2.3 Stereo-based road following

Stereo provides depth information that can be used for vision-based navigation. Typ-
ically, some measures are taken tO reduce the computational complexity, either by
hierarchical matching Or by matching a few feature points or detected edges. The
Stanford CART [82] used Moravec interest Operator tO pick up a set Of features which
were tracked and correlated between frames tO Obtain depth information. The algo-
rithm was very slow, it tOOk ﬁve hours to navigate a 20 meter course. The FIDO
system [117] used a hierarchical correlation Of points chosen by an interest Operator.
Stereo based on matched edges is commonly used in both indoor and outdoor nav-
igation [122, 1, 71, 74]. Stereo algorithms with real-time constraint can only Obtain
a sparse depth map, which is the major constraint for its application in vision-based

navigation.

1.2.4 3D perception

3D range measurements, if reliably Obtained in real-time, can be directly used for
navigation. In contrast to stereo-based algorithms, 3D range measurements from a
laser scanner can provide a dense depth map. 3D range sensing was used in Navlab
project [119]. The Navlab used the ERIM (Environmental Research Institute Of
Michigan) scanner, which measured the phase Shift Of an amplitude-modulated laser.
A laser scanner has certain advantages over more traditional vision techniques such
as a stereo algorithm: it is insensitive tO ambient illumination. However, the major
disadvantage is that the acquisition rate is too low, which limits its use in high-speed
navigation. The typical acquisition rate is one frame per second. 3D perception
can be used in more places than road following, e.g. it can also be used in Offroad
navigation. The laser scanner can be used tO build a terrain map for cross-country

navigation.

1.2.5 ANN-based approaches

Artiﬁcial neural networks have been successfully applied in many engineering appli-
cations. ANNS have also found their role in vision-based navigation. One effort Of
using multilayer perceptron is NEURO-NAV, developed by Meng and Kak [81] at
Purdue University, for indoor navigation. NEURO-N AV is a system which couples
low-level neural network based modules with high-level semantically based planning.
The Hough transform was used tO detect edges from intensity images. The result
was then fed into neural networks tO produce a qualitative output: a “near,” “far” or

8

“at” output indicating the relative position tO a landmark. These outputs are then

directly fed into a planner for higher level planning.

The most well-known system is ALVINN (Autonomous Land Vehicle In a Neu-
ral Network) [95] developed by the NAV LAB group at Carnegie-Mellon University.
ALVINN is a feedforward multilayer artiﬁcial neural network trained by a back-
propagation learning algorithm. The system maps input images to output steering
signals for autonomous navigation. ALVINN does not explicitly impose the type Of
image features upon which the system is based. The neural network approach is com-
putationally simple and real-time navigation performance can be achieved. ALVINN

has been successfully tested in a variety Of road situations.

Extensions tO ALVINN include MANIAC (Multiple Alvinn Networks In Au-
tonomous Control) [56], which combines several ALVINN-type networks and allows
the vehicle to move smoothly from one type Of road tO another. ELVIS [44] uses an

eigen-subspace method as an alternative to the artiﬁcial neural network in ALVINN.

An alternative tO ALVINN have been successfully tested: ROBIN [105, 106], de-
veloped at Univ. Of Maryland and later extended at Lockheed Martin Corporation,
uses a radial basis function neural network (RBF N) tO map a low—resolution input

image into output steering Signal. Because Of its built-in regularization mechanism,

ROBIN achieved smoother behavior than ALVIN did, as shown in [106].

The self-organizing map (SOM), another major component in neural networks,
has also been used in autonomous navigation, but only a simulation study [49] has

been reported.

1.2.6 Dickmanns’ 4D approach

Vision-based navigation is essentially a control problem with visual sensory feedback.
The ideal method Should be a perfect coupling Of Optimal control and machine vision
techniques. However, not much work has been done tO seamlessly couple Optimal
control with computer vision for navigation. With great effort along this line, Dick-
manns’ group [26, 31, 32] used Kalman ﬁltering for feedback control with detected
visual features as Observations. The internal representation is a set Of SpatiO-temporal
models Of Objects. Since the Objects are three-dimensional, and temporal information

is used, this approach is termed a 4D approach.

The 4D Object databases incorporated generic models for Objects as background
knowledge and parameterized representation with speciﬁed ranges for each subclass.
The visual appearance is coded as a parameterized distribution around the Object
center and the aspect conditions. This approach tracks a set Of features which are
driven by the model, measures the error between predicted features and measured
features, which is used as feedback to control the vehicle. The 4D approach employs
an extended Kalman ﬁlter to estimate internal states. The nonlinear system model
is linearized along nominal trajectories. As the computing engines are getting more
and more powerful, this state update cycle is getting shorter and shorter and the
linearized model better characterizes the true nonlinear system. Dickmanns’ group
has been exploring this approach Since 1986. Early demonstrations have shown its
potential. With an inexpensive inertial system added and more powerful computing
engines with Pentium Pros, a remarkable performance was demonstrated in 1997 [32].

10

The success Of the 4D approach lies in its efﬁcient use Of spatiO-temporal information
from visual inputs and its seamless coupling with Optimal control. The model-driven
feature tracking achieves efﬁciency in real-time navigation by taking advantage Of
the structure Of well-marked roads. On the other hand, this also limits the system’s

applicability.

1.3 Some Major Difﬁculties in Autonomous Navi-
gation

AS discussed above, signiﬁcant advances have been made toward autonomous navi-

gation. However, we are still facing some major difﬁculties.

The ﬁrst major difﬁculty lies in dealing with different road types. Existing ap-
proaches tO autonomous navigation typically impose restrictions on the environment,
such as good pavement conditions, clear lane marks, absence Of Shadows, continuous
road edges, etc. Unfortunately, although such requirements may sometimes be met, it
is not necessarily always the case. Thus, it is desirable to seek a method that is more
general — applicable to a wide range Of scenes that may be encountered in real-world

driving conditions.

The second major difﬁculty lies in illumination. The lighting conditions impose
a challenge for current vision-based navigation. This is true for both indoor and
outdoor navigation, especially outdoor. The road edges can be invisible due tO poor
lighting, and some false edges caused by shadows and other Objects can also fool

11

the system. For outdoor navigation, the variation in illumination from night time to
day time or from cloudy days to sunny days requires a dynamic range which is not

achievable with current visual sensors.

The third major difﬁculty lies in the lack Of efﬁciency. The above generality
requirement means that the system must learn a huge number Of scenes, e.g., a few
hundred tO a few million. Almost all the existing autonomous navigation systems have
signiﬁcant limitations on the type Of applicable scenes, with one exception: ALVINN
(and its extensions), which virtually does not impose any restriction on the scene type
and thus, is capable Of learning any scene. However, it still has several problems. A
major one is a lack Of capability tO handle a large number Of scene types. Other
major ones include the local minima problem caused by the use Of artiﬁcial neural
networks, and the slow learning problem which provides little possibility for on-line

learning.

1.4 Adaptive and Learning Mechanisms: Direct
Input-tO-Output Mapping

The amount Of road condition variation is a crucial factor in determining the applica-
bility Of a system. TO handle a variety Of environmental conditions, most experimental
navigation systems have employed certain adaptation mechanisms tO various degrees.
For navigation on paved driveways, when lane marks are well illuminated and easily
traceable, a Simple technique, such as adaptive color thresholding [72], can be used. In

12

Input image

Input image

Edgedetectioﬁ V
. Regionranalysiis 7 , .,

, ,L ,

Line fitting f

. Voting

-I --

[Road boundary ,

Steering signal Steering signal

(a) (b)
Figure 1.1: Differences between traditional edge-based or region-based approach and

direct input-to—output mapping. (a) Traditional approach; (b) Direct input-to-output
mapping, where the function f directly maps input image into output steering signal.

Situations with unstructured roads with unmarked lanes and sudden changes in illu-
mination (e.g, the sun is occluded by clouds, or shadows cast from nearby buildings),
the design Of a reliable autonomous navigation system becomes a great challenge.
AS mentioned in the earlier section, the Navlab group developed various adaptation

mechanisms for various outdoor driving environments.

ALVINN [94, 120] and ROBIN [105, 106] directly maps preprocessed input im-
ages intO output steering signals for autonomous navigation. Direct input-to—output
mapping is very different from traditional edge-based or region-based approaches. A
schematic difference between direct input-to—output mapping and traditional edge-
based or region-based method is Shown in Fig. 1.1. In edge-based or region-based
approach, edges or regions are ﬁrst extracted and later used in subsequent image
processing steps to Obtain information required for vehicle steering. The extracted
edges or regions may not be sufﬁcient for making decisions in autonomous naviga-
tion. The direct input-to—output mapping uses the whole preprocessed image as input

13

which has richer information than edge map or region segmentation. It also bypasses
intermediate steps of processing edge maps or segmented regions into higher level in-
formation for navigation. The whole input-to-output mapping is treated as a global
Optimization process. Therefore, if properly done, direct input-to—output mapping is

expected tO achieve more reliability and efﬁciency.

1.5 Differences between SHOSLIF-N and Two

Neural Network Approaches

The work presented here is a navigational part of the Self-organizing Hierarchical
Optimal Subspace Learning and Inference Framework (SHOSLIF) [133].
The navigation system SHOSLIF-N described here was motivated by the following

considerations.

0 Automatic feature derivation which minimizes the imposition Of restrictions on

the type Of the scene the system will handle.

The framework automatically derives the most expressive features (MEF) and
the most discriminating features (MDF), which respectively maximize represen-
tation and classiﬁcation powers from information available in the input data.

Therefore, the method is potentially applicable to a wide variety Of conditions.

0 Nonparametric recursive partitioning regression.

Nonparametric recursive partitioning regression [37], which is more ﬂexible than
global parametric regression used in either ALVINN or ROBIN, is used in direct

14

input-to—output mapping. Nonparametric recursive partitioning regression is

realized using a recursive partition tree (RPT).

Self—organizing mechanism.

The approach uses a self-organizing mechanism for systemwise learning, which

is responsible for mapping input images to vehicle control Signals.

Low computational complexity.

The recursive partition tree has a logarithmic retrieval complexity and can
be used to address the complexity issue in learning a large number Of scenes.
Due to this low complexity, the experimental system Rome is capable of real-
time navigation while computation is performed by an on-board SUN SPARC-
1 without using any special image processing hardware other than a frame

grabber.

A schematic difference between our SHOSLIF-N and current neural network ap-

proaches is shown in Fig. 1.2. Current neural network approaches use an explicit

function f to directly map a visual input I into output signal 0. Here f is repre-

sented by a linear or nonlinear function Of linear or nonlinear combination Of ﬁrst and

second layer neurons. The adjustable parameters are weights Of the ﬁrst and the sec-

ond layers in a feedforward network, and widths and centers Of ﬁrst layer and weights

Of second layer for a RBF network. In SHOSLIF-N, a novel input I is used to retrieve

its 1: nearest neighbors in the learned samples. Then the associated outputs Of 1:

samples are interpolated to Obtain the actual output. Notice that there is a task de-

composition here: the visual input is used in content-based retrieval, while associated

15

 

 

(l n- Oil

 

 

 

 

 

 

  

Content—based Interpolation

retrieval of

appearance based 0" K

database nearest neighbors
(a) Direct mapping (b) SHOSLIF—N

Figure 1.2: Schematic difference between current neural network approaches for
vision-based navigation and SHOSLIF-N. (a) Direct input-to—output mapping using
neural networks. (b) SHOSLIF-N.

outputs are weighted according to their distances to current input in the interpolation
process to get the desired output. Direct input-to—output mapping in Fig. 1.2(a) can
be regarded as a global approximation trying to minimize the global sum of Square
error between target outputs and Simulated network outputs. SHOSLIF-N can be
regarded as a piecewise approximation of input-to-output mapping based on learn-
ing samples. Better and better approximations can be achieved by adding more and
more learning samples. On the other hand, more space is required for storing these

samples.

SHOSLIF-N uses local adaptive nonparametric regression, which is more ﬂexible
than global parametric regression. Recursive partitioning provides a gOOd tradeoff
between speed and performance. Projection pursuit regression [52] is another well
known adaptive algorithm for function approximation, but with very high computa-
tional complexity. Friedman [37] provides an excellent discussion on this subject.

16

1.6 Our Mobile Robot ROME

Our mobile robot, ROME (Robotic Mobile Experiment), was built by former students
Steve Walsh, Jon Courtney and Hans Dulimarta. The major hardware components

consist Of:
1. A Sun SPARCstation-l enhanced with 10 serial ports.
2. A 386SX laptop as terminal.

3. TRC LABMATE wheeled locomotion base, which can be controlled by sending

commands via a RS-232 serial port.

4. 24 ultrasonic sensors and 8 infrared sensors are placed in a circular pattern
and can be controlled via a RS-232 serial port through the TRC Proximity
Subsystem. The maximum detection range Of the ultrasonic sensors is 10 m.
Each infrared sensor returns a binary number which indicates whether it detects

an Object within a range Of 76 cm.
5. TWO Panasonic GP-KR202 CCD color cameras with 6mm and 3.6mm lenses.
6. Two Sun VideOPix image grabbers.
7. A pan / tilt carousel for mounting the cameras.

Walsh [126] developed a vision-based navigation scheme using symbolic landmark
maps. Courtney [17] applied classiﬁcation techniques for localization of a mobile
robot. Dulimarta [33] used a client-server control architecture for robot navigation.

17

 

Figure 1.3: A picture of Rome outside our PRIP lab. A single CCD camera is mounted
on top of the robot with a looking down tilt angle.

The above setup Of Rome served as the testbed for all experiments presented in
this thesis. A picture of Rome outside our PRIP lab is shown in Fig. 1.3. Currently

Rome is under upgrade to its more powerful version. The major upgrade includes:

1. The host computer will be replaced by PentiumPro 200 MHz computer.
2. Matrox framegrabber.
3. More ﬂexible pan/ tilt units from Directed Perception Inc.

4. 3.6mm autOiriS lenses.

1 .7 Thesis Overview

In this thesis research, the author focuses on developing machine learning techniques
and applying them to vision-based navigation by mobile robots. In Chapter 2 the
application Of SHOSLIF [133] to vision-based navigation is described. The vision-
based navigation problem is treated as a content-based retrieval problem where each
newly grabbed image is used to retrieve a recursive partition tree (RPT) which is
trained by sample drives Of the robot using a joystick control. The retrieved output
control Signal associated with the best-matched image is used as the next steering
signal.

In Chapter 3, the above learning scheme is done incrementally for ease Of training
Rome to Obtain satisfactory performance. The complexity Of incremental learning
is low and can be done on-line in real-time. Several experiments were conducted to
verify the performance Of the proposed algorithms.

19

 

 

 

 

 

 

System Developer Brief description Real-time On-line Use more
performance incre- than 1 frame
mental Of
learning information
ALVIN N Pomerleau Feedforward multilayer X X
(CMU) perceptron (MLP) with
back propagation learning.
ELVIS Hancock Eigensubspace X
& Thorpe method. The concatena-
(CMU) tion Of an input image and
its associated output signal
is treated as a single long
1D vector.
ROBIN Rosenblum Radial basis function neu- X
& Davis ral network (RBFNN).
(Maryland)
SHOSLIF—N Weng & Content-based retrieval X X
Chen with a recursive partition
(MSU) tree (RPT). An input im-
age is represented as a long
1D vector.
State-based Weng & SHOSLIF-N enhanced X X X
SHOSLIF-N Chen with state information and
(MSU) Simple visual attention.

 

 

 

 

 

 

 

 

Table 1.1: A comparison Of the thesis work with start—Of-the-art vision-based navigation
algorithms.

TO deal with more scene types and improve the system’s performance, state in-
formation and a Simple visual attention mechanism are incorporated into SHOSLIF
as state-based SHOSLIF. The state-based SHOSLIF is described in Chapter 4.

A comparison Of the proposed learning mechanism for vision-based navigation
with some representative state-of—the-art mobile robot vision systems which directly
map an input image into output control Signal is shown in Table 1.1. Although SO far
the algorithms described in this dissertation have been widely tested only in indoor
navigation, they could possibly be applied to outdoor navigation.

Finally, Chapter 5 gives a summary Of work done in this dissertation and future

research directions are discussed.

20

Chapter 2

SHOSLIF for Autonomous

Navigation

In this chapter, the earliest version Of SHOSLIF—N—batch-mode SHOSLIF-N-———for

vision-based navigation is described.

2.1 Introduction

A central part Of an autonomous vision-guided navigation system is the navigator.
It accepts an input image X as well as an intention signal P which indicates which
path to take. The navigator outputs the control signal vector C which controls the
vehicle. Fig. 2.1 gives a schematic illustration of such a view. Thus, the navigator
can be denoted by a function f that maps every element in the space Of (X,P)
to an element in the space Of C. A challenging task Of autonomous navigation is
to construct an automatic system which approximates (i.e., learns) the complicated

21

 

Image Control signal

Navigator
x ____.. I———> C

C=ﬂX.P)

[.

Path selection

 

 

 

Figure 2.1: Navigator as a mapping from the image and path selection space to the control
signal space.

function f (X, P) and computes quickly in real time given X and P.

2.1 .1 Existing systems

The function f is typically extremely complex, because of the high-dimension Of the
image Space Of X. One common way to address this problem is to design some
rules based on a particular type of image collections and then a system is developed
by implementing these rules. For example, some outdoor road followers use human
designed rules to follow road edges (e.g., Wallace et al [125], Dickmanns and Zapp
[25], Thorpe et al [118], Hebert [48]). Indoor corridor followers mainly rely on ﬂoor
or ceiling edges (e.g., Meng and Kak [81], Lebesgue and Argarwal [74]). In the past,
autonomous navigation systems that used only intensity images rely very much on a
type Of predeﬁned feature, such as road edge, lane mark, ﬂoor edge, etc.

The amount Of road condition variation is a crucial factor in determining the
applicability Of a system. TO handle a variety Of environment conditions, several
experimental navigation systems have employed certain adaptation mechanisms to
various degrees. For navigation on paved driveways, when lane marks are well illu-
minated and easily traceable, a simple technique, such as adaptive color thresholding
[72], can be used. In Situations with unstructured roads with unmarked lanes and sud-

22

den changes in illumination (e.g, the sun is occluded by clouds, or shadows cast from
nearby buildings), the design Of a reliable autonomous navigation system becomes
a great challenge. Investigation around the Navlab developed different adaptation
mechanisms for different outdoor driving environments. SCARF [20] [120] was de-
signed tO handle various roads with adaptive color classiﬁcation. YARF [64] [120]
deals with different structured roads by explicitly modeling different available con-
straints and features, since a single technique for road detection may fail on different

structured roads.

Many existing systems predeﬁne the type Of features that the system will use.
For greater adaptability, some systems do not specify which speciﬁc features to use.
With ALVINN [94] [120], a feedforward artiﬁcial neural network (ANN), trained by
a back-prOpagation learning algorithm, maps input images to output steering signals
for autonomous navigation. ALVIN N does not explicitly impose the type Of image
geometric features upon which the system is based. ALVINN has been successfully
tested in a variety Of road situations. An alternative neural network approach is
ROBIN [105], [106], developed at Univ. Of Maryland and later extended at Lockheed
Martin Corporation. ROBIN used a radial basis function (RBF) neural network to
map a low-resolution input image into output steering signal. Because of its build-in
regularization mechanism, ROBIN achieved smoother behavior than ALVINN did, as
reported in [106].

23

2.1.2 Motivations

The work presented here is a navigational part Of the Self-organizing Hierarchical Op-
timal Subspace Learning and Inference Framework (SHOSLIF) [133]. The navigation

system SHOSLIF-N described here was motivated by the following considerations.

0 Automatic feature derivation which minimizes imposing restrictions on the type

Of the scene the system will handle.

The framework automatically derives the most expressive features (MEF) and
the most discriminating features (MDF), which maximize representation and
classiﬁcation powers, respectively, from information available in the input data.

Therefore, the method is potentially applicable to a wide variety Of conditions.

0 Self-organizing mechanism.

The approach uses a self-organizing mechanism for systemwise learning, which

is responsible for mapping input images to vehicle control signals.

0 Low computational complexity.

TO address the complexity issue in learning a large number Of scenes, a recursive
space-partition tree is introduced, which has a logarithmic retrieval complex-
ity. Due to this low complexity, the experimental system Rome is capable Of
real-time path following while computation is performed by an on-board SUN
SPARC-l without using any Special image processing hardware other than a
frame grabber.

24

Although feedforward networks and radial basis function networks can also deal
with high dimensional mapping, our experimental result indicated that SHOSLIF-
N performs better for the tasks tested. SHOSLIF-N does not need tO start with a
random initial guess which is difﬁcult tO lead tO the correct solution [95]. It does not
require human interactive selection Of population centers as with radial basis networks
[106].

This chapter is organized as follows. Section 2.2 introduces basic concepts and
related results in learning with a nearest neighbor (l\ N) approximator. Section 2.3
discusses statistical tools for automatically deriving linear features from learning sam-
ples. Section 2.4 explains the self-organizing method which automatically constructs
a tree. Complexity Of SHOSLIF-N is analyzed in Section 2.5. Some experimental re-
sults are reported in Section 2.6, and a brief comparison with current neural network
approaches is presented in Section 2.7. Finally Section 2.8 provides some concluding

remarks.

2.2 Learning as Efficient Function Approximation

2.2.1 Image space

A digital image with 7" pixel rows and c pixel columns can be denoted by a vector in
(rc)-dimensional space. For example, the set Of image pixels {1(2', j) | 0 S 2' < r, 0 S
j < c} can be written as a vector X = (231,332, - --,:rd)‘ where (L'n'+j+1 = [(2,3) and
d 2 re. The actual mapping from the 2-D position Of every pixel tO a component in

25

the d-dimensional vector X is not essential but is ﬁxed once it is selected. Since the
pixels Of all the practical images can only take values in a ﬁnite range, we deﬁne X

as a point in a domain S Of a ﬁnite Size, where

S = {X [X =(T1,:1:2,---,.7:d)‘,|:r,~] g .M/2,z'=1,2,---d} (2.1)

which can be called a d-dimensional hypercube. The second order statistics between
pixels is represented by the corresponding covariance matrix 2, of the random vector
X, as we will see later. For example, if two pixels Of the assemble at locations
(2', j) and (27’, j’) are highly correlated in the original image, their corresponding cross-
covariance stored in the corresponding element at (ir + j, i’r + j’ ) Of the matrix 2,,
typically has a large absolute value. Therefore, treating a two-dimensional image
as an one-dimensional vector X is for notational convenience and the representation

itself does not necessarily lose any two-dimensional information.

2.2.2 Functions in the image space

The navigator Shown in Fig 2.1 is a function f deﬁned in a huge space Of (X, P). For
notational Simplicity, we drop the input P in notation and simply denote f (X, P) as
f(X), regarding P as a part Of X.

In fact, many sensor-based understanding and control problems can be modeled
as a function C = f (X), where X is the input vector and the output vector C can
be real-valued. In a control problem, C is a real-valued vector. In an autonomous
navigation problem, the vector can be, e.g., C = (c1,c2,c3) where Cl is the heading

26

direction, c2 the Speed, and c3 the next step Size. In classiﬁcation problems, C can
be a scalar y 6 (1,2,3, - --,k} indicating a class label Of the input X. However,
in practice, such a classiﬁcation output is often not sufﬁcient without a conﬁdence
measure. More generally, the nominal output for n classes can be 72 dimensional, with
2th component indicating the conﬁdence (or some probability measure) for the input
X to arise from the i-th class. Therefore, the categorical output can be considered as

a special case Of real-valued output.

From the discussion, we can see that many problems in sensor-based classiﬁcation
and control can be represented by a partial function f that maps elements in a domain

S to an element in a codomain C:

f:SI——>C (22)

where S is in deﬁned in (2.1) and C is an n-dimensional space C = R”. We regard f
as a partial function because it is not necessarily deﬁned on the entire set Of S. For
example, S may denote all the possible images but the images we see in our life is

only a subset of S.

2.2.3 Learning as an approximation process

Our Objective of constructing a navigator is equivalent to approximating function
f : S +—> C by another function f : S +—> C. The error of approximation can be
indicated by certain measure Of the error f — f. One such measure is the mean

27

square error

Ellf — In: = A... “f(X) - f(X)ll2dF(X)

where F (X) is the probability distribution function Of X in S. In other words, f can
defer a lot from f in parts where X never occurs, without affecting the error measure.
Another measure is the pointwise error [I f (X) — f (X)]] for any point X in S’, where
S' C S is a subset of S that is Of interest to a certain problem. In navigation, S’ may

consist of all the inputs that may arise.

Of course, the function f is typically high-dimensional and highly complex. For
navigation, it maps from a high—dimensional input vector X, which represents a mean-

ingful image and the current intention, to the correct control parameter vector.

A powerful method of constructing f is using learning. Speciﬁcally, a series Of

cases is acquired as the learning data set:

Then, construct f based on L. For notational convenience, the sample points in L is
denoted by X(L):

X(L)={X,|z’=1,2,~-~,n}.

X (L) should be drawn from the real Situation so that the underlying distribution Of
X (L) is as close to the real distribution as possible.

28

2.2.4 Case-based learning for function approximation

Superﬁcially, the function approximation problem we are dealing with here looks very
much like a typical regression problem in statistics or pattern recognition, i.e., pre-
dicting y from a given feature vector X. They are in fact very different due to the
underlying assumption. In statistics, features are preselected so that a Simple rela-
tionship between y and X exists. Therefore, the linear regression tree and k-d tree
(also called tree classiﬁer) are popular in statistics [5], [45] and pattern recognition
[16], [38]. However, those methods do not work well here because Of the very com-
plex nature Of the function f: it is highly nonlinear, high-dimensional in domain,
and the number Of samples can be even smaller than the dimension Of the domain.
For example, such a function cannot be reasonably approximated by any parameter
set, let alone the linear regression parameter vector used in linear regression. The
k-d tree does not work either because no threshold on any component of X can
give a meaningful intermediate classiﬁcation. The challenging problem Of learning a
highly nonlinear and high-dimensional function from real-world cases is fundamental
to intelligence but has not received much attention in the past. In the following, we
introduce some concepts for this task and establish related results.

A learning set L as in (2.3) consists of a sequence Of cases. We need to investigate

its relation with the Space S.

Deﬁnition 1 A partition P = {P1,P2, - --,P,,} ofS (i.e., P,- 0 Pj = (I) when i #j

and Uf=1Pi = S) is consistent with a learning set L if X,- 6 P,, i = 1,2, - - -,n.

A partition P consistent with L can be considered as a result Of partitioning S based

29

on the learning set L. Given an L, inﬁnitely many consistent partitions are possible.
The Dirichlet tessellation, also well known as the Voronoi diagram [54], is a special
partition consistent with L in which each P,- consists of those points Of S that are

closer to X,- than to any other X, in X(L) with 2' 7E j.

Deﬁnition 2 Given a learning set L, a nearest-neighbor (NN) approximatorf off

associated with L is deﬁned as follows. For any X in partition S,

where X,- is the nearest neighbor ofX in X(L). That is, X,- E X(L) and “X — X,“ <

”X — X)” for any X,- E X(L) with 2' ¢ j.

An N N approximator f Of f is a piecewise constant function, constant in every P,- Of
the Dirichlet tessellation P = {P1,P2, - - - , P,,} which is consistent with L. The NN
approximator is not unique, given an L, because the value at the boundary Of the
Dirichlet tessellation is arbitrary. In practice, its value along the boundary can take

the value Of either side Of the boundary, whichever is convenient.

Deﬁnition 3 Given a learning set L, a k-nearest-neighbor distance-based (KNDB)

approximatorf off associated with L is deﬁned as follows. For any X E P,

A

1 k
f(X) = —k_,_— 2 102'“an

321w? i2]

where X", E X(L) is the i-th nearest neighbor ofX and w,- = w(X,X,,,, - - -,X,,k) is
the value of a scalar weighting function.

30

The KNDB approximator uses more neighbors than the N N version and thus is more
immune to noise in the learning set L. When the top S S k nearest neighbors have
the same distance to X, the value Of f (X) takes roughly the average Of these top 3

neighbors. AS an example, the weighting function w(X) can take the form
w(x, Xm, . . . , X...) ._.__ a-llX—XnJl/(HlIX-Xn, II)

where c is a small positive number tO avoid the denominator tO become zero. The
value Of (1 determines how fast the weight will decrease for other runners up. A
point X", at twice the distance compared to that Of the nearest neighbor X", will
have its weight decreased by a factor Of l/a from that of the nearest neighbor. For
example, a = 10 is a reasonable choice for a. When a approaches inﬁnity, the KNDB
degenerates into the N N approximator.

Other interpolation schemes that may be used here include generalized multi-

quadratics [113], [133] or radial basis functions [97], [98], [47].

2.2.5 Hierarchical partition and the RPT

An immediate issue that arises here is that a large number Of samples is required in
the learning phase to approximate a high-dimensional function f. In the performing
phase, given an input X, the NN approximator needs to ﬁnd the nearest neighbor
from a large number n Of recorded items. A linear search in this database will require
0(n) time, which is not acceptable for real-time navigation with a large n, e.g.,
n = 1,000,000,000. The following hierarchical space partition scheme leads to a

31

O(log(n)) retrieval complexity.

Deﬁnition 4 A hierarchical partition P of S is a set of partitions P =
{P0,P1,---,Pm}, where every P,, i = 0,1,---,m is a partition of S. P;+1 is a ﬁner
partition of B- 5111911036 Pi :- {Pi,j+1:Pi.j+27"’aPz'.j+l}- F07" each Pig 6 Pi; Pi+1

contains either Pm- itself or P,+1,k+1, P,+1,k+2, ' - - , P,+1,k+b so that

b
PM" = U Pi+l,k+i

1:1

A recursive partition tree (RPT) is a tree which represents a hierarchical parti-
tion P. P0 = S at level 0 is the root, representing the entire space S. A node
Pm- at level i has b children if PM is further partitioned at level i + 1 by b cells
H+1,k+1,l’,+1,k+2,~~,P,+1,k+b. A node sz' is a leaf node if it is not further parti-
tioned by more cells in Pi+1- If PM appears at several levels, Pia‘ is represented by a

leaf at the lowest level at which it appears.

A hierarchical Dirichlet tessellation (or hierarchical Voronoi diagram) is a hierarchical
partition that satisﬁes an additional condition: Every cell is further partitioned at
a deeper level, if it does, only by a Dirichlet tessellation. As shown in Fig. 2.2,
although further partition Of each cell is a Dirichlet partition, the entire partion Pm
which corresponds to all the leaf nodes in the RPT tree, is not necessarily a Dirichlet
partition.

The entire space S is typically huge. The hierarchical partition covers the space
using cells of different sizes. In areas where samples are very Sparse, only large cells

32

 

0.1

 

O co. co

2.] 2.3 2.5 2.7 2.9
2.2 2.4 2.6 2.8 2. l0

 

 

 

 

(a) (b)

Figure 2.2: A 2-D illustration Of a hierarchical Dirichlet partition and the corresponding
recursive partition tree (RPT). (a) The partition, where the label indicates the center of a
cell. The label of the child to which its parent’s center belongs is not shown due to the lack
Of space. (b) The recursive partition tree.

are created to cover them. In areas where samples are very dense, many small cells
are created to cover them. AS we will see, the Size Of the cells is directly related to
the accuracy of approximation. In other words, regions in which f is not deﬁned or

Of low probability are approximated very roughly by a few coarse cells, but regions

Of high density are approximated in detail by many small cells.

2.2.6 Approximation using the RPT

We defer our discussion about the construction Of the RPT tO section 2.4. In this

section, we study how to use the RPT for function approximation.

The use of hierarchical partition structure drastically reduces the complexity to
ﬁnd the cell in Pm to which a given X belong.

33

 

RPT Search Algorithm: Given X, ﬁnd a leaf node c so that X E c.
node 2 root;
for each child c Of node {
if (X E c) {
if (c is not a leaf)
node = c;

else return(c);

}

Suppose L is a complete tree in which every internal node has b children. A tree Of

 

 

 

level 1+ 1 has n = b’ leaf nodes. At each level, the algorithm needs to visit b children.
The number Of nodes visited by the above search algorithm is bl = blogb(n). With a
complete tree, the search complexity is logarithmic O(logb(n)) in the number Of leaf
nodes. We will analyze the complexity in detail in Section 2.5.

Although a hierarchical partition allows a very fast retrieval algorithm, the ﬁnal
partition Pm in a hierarchical partition P is not a Dirichlet partition in general,
as indicated in the example Shown in Fig. 2.2. This is because the boundary Of a
coarse cell at level I is deﬁned according to the cells at level 1. Such a coarse-level
boundary is present at the ﬁnest partition Pm and is typically inconsistent with that
of the Dirichlet tessellation deﬁned on all the given samples. Therefore, the NN
approximator cannot be directly deﬁned on a hierarchical partition.

Our way Of dealing with this problem is to explore k competitive paths down the
RPT and use the technique Of the KNDB approximator at the k leaf nodes that are

reached.

Deﬁnition 5 Given a RPT, corresponding to a hierarchical partition P =
{P1, P2, - - - , Pm} and an integer constant k > 0, the k competitive leaves of an input

34

X are deﬁned recursively as follows:

1. At level 1, the k children of the root whose centers are the k nearest neighbors

of X, among all nodes at level I, are competitors at level 1.

2. At level l, let G contains all the children of the k competitors at level I — 1 plus
those competitors at level l —1 which are already leaves at a level 3 l— 1. The k
elements in C whose centers are the k nearest neighbors of X, among all other

elements in C, are the k competitors at level l.
The k competitive leaves ofX are the k competitors at the last level m.

Note that the k competitive leaves Of a RPT is not necessarily the k nearest neighbors
Of X, although they could be. In particular, the nearest neighbor of X is included
in the k competitive leaves only if every node along the path from the root to the
nearest-neighbor leaf is included in the k competitors at the corresponding level. The

larger k is, the more likely that occurs.

Deﬁnition 6 A k-competitor distance-based (K CDB ) approximator f of f associated
with the given learning set L and the associated RPT is the same at the KNDB, except

that the k nearest-neighbors is replaced by the k competitive leaves obtained from the

RPT.

As long as the nearest-neighbor is included in the k competitors, the value Of in-
terpolation from k competitive leaves are typically very close to that Of the KNDB
approximator, because it is the nearest neighbor Of X that plays a major role in the
interpolation. If the nearest neighbor is missing from the k competitive leaves, the

35

result Of interpolation should still be reasonably good if the k competitive leaves are
not all tOO far from X. The resilence Of the KCDB approximator is due to the use Of

interpolation instead Of using a Single nearest neighbor from the k competitive leaves.

The next important issue is how to derive features which determine the Shape and
size Of the cells in partition SO that each cell will cover a large region in which the
function f is nearly constant or does not change much. This is very important Since
it does not only reduce the number Of leaves in the RPT for efﬁciency, but also allows

the RPT to generalize effectively from a relatively small number Of learning samples.

2.3 Automatic Feature Derivation

If we cannot restrict the type. Of scenes that a vehicle can come across, it is then
improper to predeﬁne features that the system can use. For example, if we deﬁne
long straight edges to use for indoor navigation, the system will fail in curved hall
ways or places where no long straight edges are available. If we deﬁne smooth curved
edges to use for outdoor navigation, the system will fail when the road condition is SO
poor that no clearly deﬁnable edges are present. Even when road edges are clear, an
edge tracker may fail in the presence of passerS-by who occlude road edges. Therefore,
we must let system to automatically derive features by itself, depending on the type
of environment in which the system learns and Operates.

36

2.3.1 Meaningful images in the image space

We know that the dimension Of an image space is typically very large. For a moderate
128 x 128-pixel image, the dimension is d 2 rc = 128 x 128 = 16, 384! Suppose that
the absolute intensity value Of all image pixel is bounded by a number M, and the
space Of all the possible images is given by S in (2.1). But all the images a robot can
see in its Operational environment distribute within a very small subset S’ Of this huge
space S, although the region Shape Of S’ can be very complex. In other words, the
function f needs only to approximate f within a much smaller space S’. An Objective
Of learning iS to ﬁnd a representation tO represent the subspace 5’ well enough for

approximation of f in it. The actual value Of f in S - S’ is not of major concern.

The Karhunen-Loeve projection [24] is a very efﬁcient way to represent a small
subspace in a high-dimensional space. It reduces the dimension Of representation
from d in S to a much lower dimension for 5’ while still keeps most information in
the data. However, the Karhunen-Loeve‘ projection is not eﬁective for discriminating
different scenes, for which we will use the discriminant analysis to derive the best
features. For a general discussion Of dimension reduction using projection pursuit,

the reader is referred to an excellent survey by Huber [52].

2.3.2 The most expressive features

As we discussed above, images Of Objects in a category can be regarded as samples
represented by an d-dimensional random vector X in S, which can be represented by

37

a linear combination Of d orthonormal basis vectors, v1,v2, - - - , vd, SO that

d
X Z Z yivi = l'Y

i=1

where V is an orthogonal d x (1 matrix consisting Of orthonormal column vectors
v,,i = 1,2,---,d. Without loss Of generality, we can assume that the mean the
random vector X is a zero vector, Since we can always redeﬁne X — EX as the feature
vector.

TO represent vectors in S’, we can predict that it might be enough tO use a rel-
atively small number Of expansion terms to characterize the Object scene to some
degree Of precision. Suppose we use at basis vectors, each Of which iS a column vector
Of V, and the covariance matrix Of X is 2,, = E[(X — EX)(X — EX)’]. The ap-
proximate representation Of X is X(m) = 2,71, y,v,~. It has been proved [75] that the

Optimal v1 v2 - . - vm that minimize mean square error
A m r
62071): EHX - X(mlll2 = ”X - Z yon-HZ
i=1

are the m unit eigenvectors of the covariance matrix 2,, associated with the m largest
eigenvalues. After the computation Of v,’s, the 772 components, y,, Of the projected

vector Of X can be computed by
y,=v,’-(X-EX) i=1,2,---,m (2.4)

or Y = V‘X, where V 2 [v1 v2 vm]. This is known as the Karhunen-Loeve

38

projection [75]. Since the m features, v,’s in (2.4), give the minimum mean-square
error, we can call them the most expressive features (MEF) in that they best describe

the sample population in the sense of a linear transformation.

We can choose m so that the ratio Of the mean square error over the total variance

satisﬁes

4 A-
“5;“; ’ S T (2 r)
i2] 2

(e.g. r = 5%).

In practice, we are given a set Of n learning images, X1, X2, ..., X", Ex is approx-

imated by scatter matrix
Tl- 1
S=Z;(X,—m)(X,—m)’ =UUt (2.6)
i-l

where U 2 [U1 U2 Un] with U,- = X,- — m, m = (1/n)Z]’=1X,. If the number Of
learning images, n, is smaller than dimension d, instead of computing the eigenvalues
and eigenvectors Of the a very large (1 x d matrix U Ut directly, we can ﬁnd the
eigenvector and eigenvalues Of a smaller n x n matrix U ‘U , which in fact has the
same non-zero eigenvalues as S = UU’. If w,- is an eigenvector Of U ’U associated
with the eigenvalue /\,, then v,- = U w, is the eigenvector Of S = U U ‘ associated with

the same eigenvalue.

The Karhunen-Loeve projection reduces the image dimension from d to m, which
is typically a lot smaller than d. Fig. 2.3 illustrates the meaning Of the MEFS. The
regions within which the sample points distribute has a much lower dimension than the

39

 

r

1‘1

 

 

Figure 2.3: A 2-D illustration of the most expressive features (MEF). The MEFS are v1
and v2 and the MEF values are the projected values onto y] and y2 axes.

original space. The ﬁrst two eigenvectors v1 and v2 tell the most signiﬁcant directions
of sample variation. The MEF S are useful to reduce the dimension Of representation.
They are the best linear features to maximally preserve the Euclidean norm among
the samples. For some works on using MEF for recognition-related problems, the

reader is referred to [124] [86].

2.3.3 The most discriminating features

During navigation, the system needs to distinguish a variety Of road types. Is it a
cross intersection, a Y-shaped branching, a T-junction or a straight way? The road
types will be used, coupled with the intention P, to decide next navigation action.
Furthermore, given the same type Of scene, different current heading directions (as-
suming that the camera is ﬁxed on the vehicle), result in different scene appearances.
For example, the next heading direction should depend on the way in which the road
extends as seen from the current heading direction, left, straight, or right. All these

40

require that the navigator to classify scenes according what it sees.

In the MEF Space, a nearest neighbor type Of criterion is not well suited for scene
classiﬁcation, Since the Euclidean distance in MEF Space does not necessarily indicate
a perceptual distance. Due to the effects Of many unrelated factors, such as lighting,
the Euclidean distance between two views Of the same scene at different times may
be much larger than those between two different scenes. In other words, the pixel-
to-pixel distance, whether in the original image space or MEF Space, cannot well

characterize why two scenes are considered different.

When the category labels Of the sample data are known, the discriminant analysis
explained below can be used to automatically derive features that best characterize

different categories.

Suppose samples, Y’s, are m-dimensional random vectors from c classes. The ith
class has a probability p,, a mean vector m,- and a scatter matrix 23,-. The within-class

scatter matrix is deﬁned by

Sw = ipiEﬂY — m,)(Y — m,)’ [Y 6 class i} = ipﬂ, (2.7)

i=1 i=1

The between-class scatter matrix is Sb 2 2le p,(m,- - m)(m,- — m)‘ where m is
deﬁned as m = EY 2 21°21 pimi. The mixture scatter matrix is deﬁned by Sm =
E {(Y—m)(Y—m)‘} = Sw+Sb. We want to maximize the between-class scatter with
respect to the within-class scatter. In the discriminant analysis [142], an Objective

41

function we wish to maximize is

trace{S;’Sb} (2.8)

In other words, we want to ﬁnd features in which the scatter between classes is
large but the scatter within every class is small. Suppose we use lst-dimensional
linear features Z = W’Y where W is an m x k rectangular matrix whose column
vectors are linearly independent. The above mapping represents a linear projection
from m-dimensional space to k dimensional space. The Objective is to determine an
m x k matrix IV so that in the new Space, the Objective function in (2.8) fz(m) =
trace{SZ",[,SZb} is maximized. Alternatively, we can also consider another Objective
function. det{SZw} and det{SZb} measure the hyperellipsoidal scattering volume of

Szw and Sgb, respectively. We may maximize their ratio

det{Sgb} _ d€t{I/VtSYbW}
det{SZw} ‘ de‘t{W‘Sy,,,W}

 

(2.9)

It has been proved that both Objective functions (2.8) and (2.9) lead to the same
projection matrix W whose column vectors are the eigenvectors Of 5;,10Syb associated
with the ﬁrst It largest eigenvalues. The column vectors Of W are called the most
discriminating features (MDF). They are the best linear features that maximize the
ratio in (2.9). Since the rank Of S”, is at most c — 1, we know that only at most c — 1
features are needed and others do not contribute to the maximum Of the Objective

functions.

42

 

 

 

Figure 2.4: A 2-D illustration Of the most discriminating features (MDF). The MDF is the
projection along 21. The MEF along yl cannot separate the two subclasses.

Fig. 2.4 illustrates the meaning Of MDF. Without class labels, all the samples are
used to compute MEFS. In the ﬁgure, the ﬁrst MEF along yl cannot separate the
two subclasses. The direction along the difference Of the two means cannot either.
Although the variation along the ﬁrst MDF 2:1 is not large, it catches the major

feature that is crucial for classifying two subclasses.

In practice, we are given images from c categories. Let the number Of samples
belonging to ith class be n,-, the jth sample in ith class be Y3” and the mean vector
of ith class be mm. The grand mean vector for all samples is replaced by the grand

sample mean

C C
m = 3;: mm“) where n = Z n,
i=1

i=1

The mixture scatter matrix is replaced by the sample scatter matrix:

Sm = Z Z(Y§” — m)(Y§-’) — m)’

i=1j=l

43

Deﬁne the sample scatter matrix for the ith class as
"" H H
. I 2' l .
5(a) : Ely], _ m( ’lij _ ml’))’
1:1

The within—class scatter matrix is replaced by

and the between-class scatter matrix is replaced by
Sb 2 2 ”4mm _ m)(m(i) _ m)‘
i=1

The relation Sm = Sw + Sb still holds for the sample scatter matrices.

The stable solution Of eigenvectors and associated eigenvalues Of 5,7305be can be

Obtained via simultaneous diagonalization [38].

2.3.4 The DKL projection

The discriminant analysis procedure breaks down when the within-class scatter matrix
Sw becomes degenerate, which is true when the dimension of the input image is larger

than the number Of learning samples.

In fact, the discriminant analysis can be performed in the Space Of the Karhunen-
Loeve projection (i.e., MEF Space), where the degeneracy typically does not oc-
cur. Thus, the new overall discriminant projection is decomposed intO two projec-

44

tions, the Karhunen-Loeve projection followed by the discriminant projection. To
do this, we ﬁrst project the d-dimensional X-space onto m-dimensional MEF Space
(Y—Space) using the Karhunen—Loeve projection. Then, we project Y-Space on to the
k-dimensional MDF Space (Z-space). Mathematically, we deﬁne the new DKL pro-
jection (short for Discriminant Karhunen-Loeve projection) from the d—dimensional

space of X to the k-dimensional space Of Z as Z = l’l/“V’X.

We must determine the dimension m of Y SO that Sw is not degenerate. Given
3 samples from 0 classes, the maximum rank of Sw is s — c. Therefore, in order to
make Sw nondegenerate, the input Space Of MDF projection m (i.e., the size Of Sw)
cannot be larger than 3 — c. That is, m g s — c. This means that, in the MEF
projection, we need to discard (s — c + 1)-th up to (s — 1)th largest eigenvalues Of
Ex. As we can expect, these eigenvalues are typically extremely small, because the
(s — 1)th eigenvalue is the smallest among all the nonzero eigenvalues. In practice,
we would like to discard more because it is very unlikely that they are important for
classiﬁcation either. On the other hand, m cannot be smaller than the number Of
classes 0. Therefore we have k < c g m S s — c. Typically, s is much larger than
c and thus, m can be chosen as an integer value in [c, s — c] which also satisﬁes the

variation criterion in (2.5).

2.3.5 Separability

All the features we discussed here are linear features since only linear projections are
considered for deriving features. These linear features possess the Optimal properties

45

as we discussed above and they allow a fast computation. However, an important
question is the separability. That is, whether linear features can separate regions of

arbitrary shapes in the space S.

Geometrically, we have seen in Section 2.2.6 that the ﬁnal partition in a hierar-
chical partition P is the ﬁnest partition Pm. Other intermediate partitions are used
mainly to speed up the search. The use Of linear features here implies that cells in the
partition Pm are bounded by hyperplanes. Geometrically, we can see from Fig. 2.2
that piecewise linear hyperplance can approximate virtually any region to a desired

accuracy.

Mathematically, Cover [18] Showed that N N estimator is a general estimator: un-
der mild continuity conditions, the unconditional risk Of the N N estimate is bounded
by twice Of the Bayesian risk when the sample size approaches inﬁnity. Therefore,
using our coarse to ﬁne hierarchical partition scheme where the tessellation is based
on the ﬁnest cells, linear features are sufﬁcient to solve virtually any smooth nonlinear

problem.

Unlike feedforward neural networks trained with a type of iterative minimiza-
tion method such as back-propagation, a successful class partition is guaranteed if a
sufﬁcient number Of samples are used. The use Of MDF makes the partition to be
accomplished by fewer hyperplanes for better generalization.

46

2.4 Self-Organizing RPT Generation through

Learning

2.4.1 Data organization

First, consider how to automatically organize visual data. We might ﬁrst ﬁnd clusters
in the sample data and search for the best clusters to be assigned to each child. Then,
cells are deﬁned, each corresponds to a child. Recursively, a hierarchical clustering
structure in the data may be explored SO that a hierarchical tree can be constructed.
However, algorithms for clusterng analysis are very time consuming due to their
iterative nature, and their complexity is not suited for learning a huge number Of
samples.

Our Objective is tO approach a learning complexity that is close to O(n) for a
learning set Of size n. We cannot hope for any lower complexity since each item needs

to be examined for learning.

With this goal, we can recursively partition samples into smaller and smaller sets,
without worrying about whether each cell corresponds to a cluster or not. Because
we are interested in ﬁnding the best matching sample for an input, cluster is not a

major concern.

Our partition tree is different from the existing ones in that features are recursively
derived based on the actual samples each node receives in the learning phase. In other
words, nodes at the same level or different levels all use different sets of features, each
is best for the task Of each node.

47

 

(a) (b)

Figure 2.5: Geometrical illustration of the binary RPT. (a) The geometric cells. Thicker
lines indicate boundary of coarser-level cells. The thinner lines indicate boundary of ﬁner-
level cells. (b) The corresponding RPT.

2.4.2 Binary recursive partition tree

For real-time navigation, the speed Of tree search is a central issue. To project a d-
dimensional input into a k-dimensional space, It inner products need to be computed,
each for a d dimensional vector. Since d is large, ranging from 300 to 4800 with a
reduced-resolution image in our experiment, we chose to adopt k = 1 for the advantage
Of minimum computation and simplicity in implementation. In other words, each node

uses only a single feature.

Since only one feature is used at each internal node, we use the sample feature
mean as the threshold to break the samples into two groups, one for the left child

and one for the right. Thus, the result is a binary tree, as shown in Fig. 2.5.

In the context of organizing samples into a decision tree, various ways have been
used to select the feature at each node to assign samples to child nodes. Mui and Fu
[84] used an iterative method to classify nucleated blood cells. Sethi and Savarayudu

48

[107] maximized the amount Of average mutual information gain at each partition
step. Chou [16] employed the criterion Of information divergence. In our case, we use
Karhunen-Loeve projection or DKL projection, respectively, to automatically derive
linear features that are best at each node. In applications where no sample label is

available, the MEF feature is used. When class label is available, the MDF is used.

2.4.3 Constructing MEF RPT

Given a set Of learning samples

L = {(X,,f(X,-)) | i = 1,2, - --,n}. (2.10)

the Objective is to automatically construct an MEF RPT. The root of the tree accepts
the entire set L for learning. At each node, the samples that go to this node are used
to compute the ﬁrst MEF feature vector. All the samples that have a negative MEF
value go to the left child and the remaining ones gO to the right. Such a process
is performed recursively from the root down to each node’s children. The recursive
subdivision is terminated at a node when it has only one sample.

The power method [39] can be used to compute the dominant eigenvector as the
ﬁrst MEF. Let n and d be the number of samples and the dimension Of input sample
vector, respectively, and n S d. The time complexity for computing U tU is O(n2d),
for the dominant eigenvector of U ‘U is 0(n2), that for projecting n input samples
to the ﬁrst MEF feature iS O(nd). The total complexity is then O(n2d). When the
number Of learning samples is very large, the computation for MEF features becomes

49

time consuming. Coarse level nodes correspond tO large cells and thus they have
more samples than ﬁner level nodes. However, at a coarse level we do not need to use
all the samples to compute the exact MEF, because the partition at a coarse level
is in a rough sense anyway. Thus, we adopt a constant q as the maximum number
of samples used to compute MEF at every level. Whenever the number Of samples
available at a node exceeds q, a subset Of q samples are randomly selected to be used
for MEF computation. Of course, we will not waste the other remaining samples. All
the samples Of a node will be assigned to the corresponding children where they will
be used. At ﬁne levels, the number Of samples assigned to each node becomes small.
As soon as the number of samples Of a node is not larger than q, all the samples are

used for its MEF computation. The following is a pseudo code Of the algorithm.

 

MEF RPT Construction Algorithm: Given L, construct the MEF RPT.
call construct(root, L);
construct(node, L) {
If (||L|| : 1) { node = L; return; }
else if (||L|| > q}) L’ = pick-q-samples(L);
else L’ = L;
compute MEFI from L’;
mean = average Of MEFl-projections Of all s in L’;
left-set = right-set = 0;
for each s in L {
if (MEFl-projection(s) - mean < 0)
left-set = left-setU{s};
else right-set = right-setU{s};
}
construct(node—>leftchild, left-set);
construct(node—nightchild, right-set);

 

 

 

50

2.4.4 Constructing MDF RPT

The MDF RPT uses MDF S and thus class labels are required for each sample in the
learning set. Two types Of labels can be used, single-level label or hierarchical label.
A hierarchical label is of the form X1.X2. ~ - - .X,, where each X,- is a class label for
level i. The hierarchical label is intended tO direct the RPT to group the samples
in the way desired by the hierarchical label. For example, a coarse level label X1
can have two classes, corridor and corner, SO that the tree will try to group samples
into corridor images and corner images before further subdividing samples into ﬁner

classes. A single-level label is a special case Of the hierarchical label in that t = 1.

The construction algorithm of MDF RPT is Similar to that Of the MEF RPT. The
major difference is the computation Of MDF instead Of MEF, and the use Of label. A

set Of labeled learning samples is of the form

L={(X,,f(X,-),b,-)|i=1,2,---,n}.- (2°11)

where b,- is the label Of sample X,. If the labels are hierarchical, b = b1.b2. ~ - - .b,,
where the label b,- is intended to be used at level i. For any node at a level number
larger than t, the deepest level label, b,, is used. The recursive subdivision Of the
samples is terminated at a node if the node has only samples of a single label b,. A

form of pseudo code Of the algorithm is given bellow.

51

 

MDF RPT Construction Algorithm: Given a labeled L, construct the MDF RPT.
call construct(root, L, 0);
construct(node, L, level) {
If (samples in ”L“ have the same deepest label bt) { node = L; return; }
else if (||L|| > q}) L’ = pick-q-samples(L);
else L’ = L;
compute MDF 1 from L’ using labels in L’;
mean = average Of MDFl-projections Of all s in L’;
left-set 2: right-set 2 (0;
for each s in L {
if (MDFl-projection(s) - mean < 0)
left-set = left-setU{s};
else right-set = right—setU{s};
}
construct(node—Heftchild, left-set, level+1);
construct(node—>rightchild, right-set, level+1);

 

 

 

Of course, instead Of using mean only for assigning the left and right sets, the mean
and variance Of the samples in each class can be used to select a better threshold SO
that the boundary will not cut through a single-class cluster. Again, as we know,
the boundary does not have to be a perfect classiﬁcation due to the coarse-tO-ﬁne
nature Of the RPT partition. A fast algorithm is more important when the number

of sample images is very large.

2.4.5 Learning phase and performing phase

The result Of the learning phase is an RPT, as illustrated in Fig. 2.6. At each leaf
node, each learned image X,- has its corresponding control Signal vector f (X,).

In the performing phase, the KCDB approximator will use more than one leaf
node if k > 1. However, which Is leaves will be used depends on query input X.

52

Learning phase Performance phase

I l

 

 

 

 

 

 

 

 

 

 

 

 

 

Learning sample Real-time image
acquisition acquisition
Images I ll Control signals ll Image
RPT builder RPT retriever

 

 

 

1 RPT

RPT
I Interpolation I

l

Control signal

 

 

 

 

Figure 2.6: The learning and performing phases.

Therefore, the connection from k leaves to the interpolation part is neither a part of
tree nor stored. It is generated in run-time. The RPT can be regarded as a visual
information database. Since the entire input X is used for query, the retrieval from

this database in content-based.

2.4.6 Differences between SHOSLIF-N and two neural net-

work approaches

A schematic difference between Our SHOSLIF-N and current neural network ap—
proaches is shown in Fig. 2.7. Current neural network approaches use an explicit
function f to directly map a visual input I into output signal 0. Here f is repre-
sented by a linear or nonlinear function Of linear or nonlinear combination Of ﬁrst
and second layer neurons. The adjustable parameters are weights Of the ﬁrst and

53

 

(’0' 00’

 

 

 

 

 

(I n. 0:)

 

 

  

Content-based

. Interpolation
getneval of based on k
ppearance .
database nearest neighbors
(a) Direct mapping (b) SHOSLIF-N

Figure 2.7: Schematic difference between current neural network approaches for vision-
based navigation and SHOSLIF-N. (a) Direct input-to—output mapping using neural net-
works. (b) SHOSLIF-N.

the second layers in a feedforward network, and widths and centers Of ﬁrst layer and
weights Of second layer for a RBF network. In SHOSLIF-N, a novel input I is used to
retrieve its k nearest neighbors in the learned samples. Then the associated outputs
Of k samples are interpolated tO Obtain the actual output. Notice here there is a task
decomposition: the visual input is used in content-based retrieval, while associated
outputs are weighted according tO their distances to current input in the interpolation
process to get the desired output. Direct input-to-output mapping in Fig. 2.7(a) can
be regarded as a global approximation trying tO minimize the global sum Of square
error between target outputs and Simulated network outputs. SHOSLIF-N can be
regarded as a piecewise approximation Of input-to—output mapping based on learning
samples, as shown in Fig 2.8. Better and better approximation can be achieved by
adding more and more learning samples. On the other hand, more Space is required

for storing these samples.

We notice that simultaneous to our work [11], Hancock and Thorpe [44] applied
eigen-subspace method to outdoor navigation in their ELVIS system. ELVIS uses a

54

 

Figure 2.8: Piecewise approximation of SHOSLIF-N: mapping of visual input to output
control signal is based on a local neighborhood Of the input image. The sphere is used to
represent a high-dimensional input space. ‘x’ denotes a sample visual input. The output
associated with input ‘x’ is a value based on interpolation of outputs associated with its
nearest neighbors which are marked as solid dot patterns.

global parametric regression method. Each training sample pair with input image
I and output 0, a Gaussian distribution peaked at desired heading, is treated as
a monolithic vector. The coordinates Of I in the eigenspace Of (1,0) are used as
weights to generate output from the output part Of eigenvectors. Our method is a
nonparametric method which uses local interpolation and thus can better approximate

a function than the global parametric method.

2.5 Complexity

A detailed complexity analysis for the MEF RPT is given. The analysis for MDF

RPT is similar, except that at each nonleaf node MEF and MDF projections (i.e.,
DKL) are both employed.

55

2.5.1 Complexity in the learning phase

The time complexity for computing the ﬁrst MEF feature from n samples of dimension
d is O(nd2), when n > d. Given a set Of N learning images with a dimension d,
typically a low resolution (30 x 40) for images is sufﬁcient and we use a large number
Of learning samples. Therefore, N > d. The MEF-RPT will have an average leaf-node

depth O(log N) [65] and thus the time complexity for the learning algorithm is

log N 2’

Z Z 0 (My-4d?)

[:1 i=1

where Nu is the number Of learning images coming tO the i-th node at level I. We

know that 23;, NHL] 3 N, SO the complexity is O(Nd2 log N).

The time complexity is O(N log N), where N is the number of learning

samples.

This low learning complexity is due to the fact that the method avoids time-consuming
iterative analysis Of input samples, which makes it possible tO extend the method to
future incremental learning, where the time required for learning a single sample at
a time is O(log N).

Next, we consider the Space complexity. The number Of nodes in the binary RPT
tree is no more than 2N — 1. The number of leaves in the RPT is no more than N.

Each child node need only tO store the dominant eigenvector.

The space complexity of learning phase is O(dN) = O(N).

56

2.5.2 Complexity in the performing phase

First consider the time complexity. The average leaf-node depth Of a N-leaf binary
tree is O(log N). A query Of a new input image needs one projection at each inner
node along the searched path, so the complexity Of retrieving the best match for a
new image is O(d log N) = O(log N). This low time complexity allows the system to
reSponse fast even when N is large.

For each internal node Of the learning tree, we need to keep only the dominant

eigenvector. The space complexity of the performing phase is then O(Nd) = O(N).

2.6 Experimental Results

The experiments include performance evaluation and real runs with a mobile robot

at MSU, Rome, which is built on a Labmate platform from TRC.

2.6.1 'Image acquisition

The test Site for our navigation experiments is inside Of the Engineering Building at
MSU. The experimental area consists Of several corridors with various turns. Rome
was controlled manually to take pictures at different positions for learning. At each
position, a set Of ﬁve images with different heading directions were Obtained: two left
headings (5 and 10 degrees), two right headings (5 and 10 degrees) and one straight-
ahead direction. The corresponding corrected heading directions were also recorded.
For straight corridors, each set Of pictures were taken at roughly every 2 tO 3 meters.
At turns, learning images were taken during sample drives controlled manually us-

57

  

 

    

. -:-quaaemuacﬁ "W
.25"? -. ' :3. fires-“*3: ‘
“ I II _ c F R '. I. i I
ﬁlls-:-

. . _-
* .I .
.4

   

Figure 2.9: Some sample learning images in the straight corridors and the corners. The
upper row is a straight corridor. The lower one is a left turn.

ing a joystick. The corresponding corrected heading directions were Obtained using
consecutive heading increments. The pictures at turns were grabbed in shorter steps,
less than 0.5 meter per step, Since more frequent heading updates are needed during

a turn. Some Of the sample learning images are Shown in Fig 2.9.

2.6.2 MEFS, MDFS and their clustering effects

We Show some experimental results to indicate quantitatively how the MEF and the
MDF may perform very differently in classifying scenes. We computed MEFS and
MDFS, respectively, from 210 images along a straight corridor and 108 images at a
corner, all grouped into 6 classes. The ﬁrst ﬁve classes are straight corridors classiﬁed
corresponding to the next heading direction needed to recover the correct heading.
That is, class 0 for 10°; class 1 for 5°; class 2 for 0°; class 3 for —5°; and class 4
for —10°. Class 5 consists Of 108 images at the corner. Fig. 2.10 Shows the ﬁrst ﬁve
MEFS and MDFS, respectively. As can be seen, ﬁrst MEFS mainly record large area
Of contrast while MDFS record locations Of edges with increasing spatial resolutions.
Fig. 2.10(c) shows the learning samples in the subspace spanned by the ﬁrst 2 MEFS

58

 

f r
r “I
H mm“!

iii.

 

 

 

 

 

 

 

 

 

ITT‘VTXH‘J‘IV‘,’ . . ’45 3:00}: . 31w”? 6‘2"]: -, 3‘£.‘.-.a,::,‘ (urns-u; ‘
p ~'.'- ' . '5:
( , - I i .
',
- r r‘ L“
0 0 2 2
§] 1 0 0 0 O 8] 2&2 2 2 3 3
1 0 0° 8 q ‘1 o 10 , 22g?& 3 3 3
0100101 000 (31$ ‘, 22 33 3" 4 4
I 1 1 oq b I 33‘ {IS 1 33 N 4
I o I ‘ 4 4
2 121 I 2 2 % 222g5 122% 5 1i 1‘11 4‘11”: 4
O4 2 I 4
E 2 5? $25 9 4 55 5 E ’ ‘l 4 H“
5 3 2 S 04 0
w 5 323 33 3 O 0 4 4
z 2 333 45! 5 5 5 5 o 0 ° 5
3 3 3 333 34 2 4 55 8 0
3 3 44“. :4 l” :5 93% 5 s
3 4 4 4, 4 5 5 SI 0 (b0 5
8 4 4 5 . o 5 g 515
ID 4 ‘3 4 4 4 5 $5 55 5
5
4 5 5 5 g 255 g s 5
5 5
, . 5? . l . 5’ .
-1000 -500 0 500 -100 O 100 200
MEF‘I MDFI

Figure 2.10: The difference between MEF and MDF in representing learning samples from
a straight corridor and a corner. (a) The ﬁrst ﬁve MEFS Of the learning set. (d) The ﬁrst
ﬁve MDFS of the learning set (r = 80%). (c) Learning samples represented in the subspace
spanned by the ﬁrst two MEFS. ((1) Learning samples represented in the subspace spanned
by the ﬁrst two MDFS. The numbers in the plot space are the class labels of the learning
samples.

59

 

 

 

 

 

 

 

 

g] o y a
2 2
S 1 ° 0 E: f 52223 24 33
0 o 4
’ ‘I 2 i3; 33 4 4
§ 1° 5 0%106)’ 050 l 1‘ ’1‘1?2 22 333333333 4 4
. Mpmws . u (,2... .4
5 I 1 22
§4 1 ‘o I I) 055 ”’11 5 3 3 435% 4'
2 51 4
E ‘ 2 ’ 2‘ " is 2 “W22 355 55 E ’ talc“ I 33 4‘1
:0 15 k 22 a 5 .555 2° 0 O ‘1 05 z
,5 2 2 55 5 ° 30% ‘ 55
§ 3 2 333 4' s 5 3 0 £05 0 55
' 3 3 9 4‘ 45‘ 5 o 5555 s 4
3 3‘ 434 4 5 o 5 5 55 5 5
§« 33 4 H4 4 055 ea;
3” ‘ 4 s s 555 8 o 5 355
5 55 5
4 ‘ 5 5 '7' 5 s 5
§ 4 5 [ 5 55
soc o 500 -100 o 100 200
MEFI MDFI
(d) (b)

Figure 2.11: Test images in the subspace spanned by the ﬁrst two MEFS and MDFS,
respectively, all of which are not included in the learning set. (a) The MEF subspace. (b)
The MDF subspace.

and Fig. 2.10(d) Shows them in that by the ﬁrst 2 MDFS. As clearly Shown, in the
MEF subspace, the samples from a Single class spread out widely and the samples Of
different classes tend to mix together. However, in the MDF subspace, the samples
Of each class are clustered more tightly and the samples from different classes are
farther apart. From-Fig. 2.10, we can see that to classify an unknown image from the

same environment using the nearest neighbor rule, the MEF space is not as good as

the MDF space.

To Observe how the ﬁrst two MEFS and MDFS, respectively, perform when they
are used to represent images that have not been included in the learning sample set,
we plot a set Of 227 new images, Of which 170 are from the straight corridor, 34 from
the trained corner and 23 from a untrained corner, in Fig. 2.11. In Fig. 2.11(b) the
structure Of each class cluster is still visible in the MDF subspace, though not as tight
as the case in Fig. 2.10(d).

60

 

 

 

 

 

 

 

 

gj 4 4 3 2 22
4 4 4
8 ‘ 3 4 4 4 %
“I 4 “4‘3, 3 344‘ 4 ‘ 4 gi 2 2 3
3 3 3 3 43 4 4444 4 2 2 3 3
8 3 33 3 5' 33 333433 1 :83
04 32 333 a 33 3 i l1 a 3
2 332 5 2 8 11
2 2 2 22 2 2 2 11
04° 2 22 2 2 N l l 33
it, 23 2 12 12 3 L5 1 1 j, 3
’ 22 2 “1 0,911 130' ‘0 =°‘ 11.
g. 2 ‘ 11 2 11 1e 0 M10 1 i1
1 1
8 11 1 o 0‘ (1)0080" 9’ ° 3‘ 4 4
Y: ‘0 8 1o 0 0° 0 o 4’.
0 O 1 8 0 61)
g I 0 0 00 Pl ‘
l 0 I :99 (906’ l ‘ 4
s4 ° 2.
' 300 -600 -400 -200 o 200 400 600 ' -200 -100 o 100 200 300
MEF1 mom
(a) (b)

Figure 2.12: Fewer classes in the learning set: using only straight corridor images for
training. (a) Learning samples represented in the subspace spanned by the ﬁrst two MEFS.
(b) Learning samples represented in the subspace spanned by the ﬁrst two MDFS. The
numbers in the plot space are the class labels of the learning samples.

2.6.3 Better clustering in children

In the RPT, the children of a node divide the sample images handled by the parent
node into smaller groups and each group has a smaller number of classes, in general.
To see how fewer classes will allow the MDF to more effectively cluster classes, we
computed MDFS separately, one set from straight corridor images and another set

from corner images.

Using the ﬁrst two MDF computed from the corresponding images, the straight
corridor learning images are shown in Fig. 2.12 for the corresponding MEF and MDF
subspaces, respectively. Comparing Fig. 2.12(b) with Fig. 2.10(d), we see that each
class in Fig. 2.12(b) groups tighter and between class distance is larger. In Fig. 2.13
straight corridor images not in the learning set are plotted in the subspaces of MEF
and MDF, respectively, that used only corridor images as the learning set. As we can
predict, the test images now group more tightly than that in Fig. 2.11.

61

 

 

 

 

 

 

 

 

 

4
4 I 4 4 4 S 2 2 2
4 v-
31 3‘ 4.44 ‘44 44 34 ‘ 32262
3 3 3 3 4 2 2 3 3
8 33 :8 3 l“, ‘ ’3' 3:5 33 43 81 2 ggi’? 3
(v 33 " 3
3 3 3 I , l2 3
2 253 2 22 l 1 1 ﬁg;
0< 2 2 2 2 22 2 2 f 2 2 8 l 11 11 4
N 2 2 2 2 a N 1 111 1 3 3 3
u. I 11 u. l ‘ 2 9h
2 8. 2 2 1 1 q o I o 1 1 3
4y 1 d 1 O 1 1
1 1 o 0 1 o
1 1 l 1 o o o 1 1 4 l ‘
1 0l 1 o 0
$1 0 o 0 as 0 Q90 4.4
1 4 4
g (1) o o o 8 19° 00 f1, 4‘4
1 '7 (D
o %%o (b 44 1
§ 0 3 00 o
l-eoo -400 -200 o 200 400 600 -200 400 o 100 200 300
MEFI MDF1
(a) (b)

Figure 2.13: Test images represented in the subspaces learned from a learning set with
170 corridor images only (without corner images). (a) The MEF subspace. (b) The MDF
subspace.

2.6.4 Effects on the tree size

For comparison purpose, two types of tree were experimented with, MEF RPT and
MDF RPT. The former uses MEF features and the latter uses MDF features. How-
ever, since a pure MEF RPT does not perform well, as shown in the previous exam-
ples, we used MDF for the root level of the MEF RPT to classify coarse Scene types,

corridor and corner, before using MEFS for further classiﬁcation.

Table 2.1 shows the distribution of nodes over the levels in the MDF tree and
the MEF tree, respectively. Both trees used the same 318 learning images, 210 from
the straight corridor and 108 from the corner. The MDF tree has only a total of 69
nodes, with only 35 leaf nodes; while MEF tree has a total of 635 nodes, with 318 leaf
nodes. As expected, an MDF tree is typically smaller than the corresponding MEF
tree, since the hyperplanes represented by MDF are more effective to cut along the
boundary of classes.

62

 

 

 

 

MDF Tree MEF Tree
Level Number of nodes Level Number of nodes
0 1 0 1
1 2 1 2
2 4 2 4
3 8 3 8
4 10 4 16
5 14 5 32
6 22 6 64
7 8 7 128
8 208
9 138
10 34
Total 69 Total 635

 

 

 

 

 

 

Table 2.1: The sizes of MDF tree and MEF tree, trained with the set of 318 images.

 

Figure 2.14: Distribution of absolute angular error (degree) of retrieved. heading direc-
tions.These two plots are based on queries using a test set of 204 images, of which 170 are
from straight hallway and 34 from the learned corner. (a) Using a MEF RPT (k = 1). (b)
Using a MDF RPT (k = 1). (c) Using a KNDB MDF RPT (k=4).

2.6.5 RPT Retrieval

Since we have taken a large number of test images that are not in the learning set,
we conducted some “virtual runs.” Fig. 2.14 shows the error distribution of retrieved
heading directions for a set of 204 test images, when the trees are trained using the
a disjoint set of 318 images, with the straight hallway and the corner. Comparing
Fig. 2.14(a) and Fig. 2.14(b), we can see that the errors in retrieving MDF tree

63

 

Figure 2.15: Examples of retrieval. The ﬁrst row: query images; the second row: retrieved
using MDF RPT; the third row: retrieved from MEF RPT.

are relatively smaller, especially for the bin of 5—degree error, which implies better
generalization power.
To give some intuition about what kind of images were retrieved using the RPT,

some retrieval results are shown in Fig. 2.15.

2.6.6 Real runs

For the real runs, the chosen control signal has three components: corrected heading
direction, speed and step size. When a new image is grabbed, the retrieval program
searches down the RPT for the matching leaf node, and uses the corresponding stored
control signal to control the robot. For the learning, T in (2.5) is set to 5%. We trained
Rome through three learning drives in which 363 learning images were taken, 280 of
which from two straight corridors and 83 from a corner.

The image size for the MEF and MDF projections is 30 x 40, rescaled from its
original 480 X 640 size with smoothing to reduce the effect of input noise. The
total computer time for learning phase ranged from 4.5 to 7.5 minutes, as shown in

64

 

Workstation Time (minutes)
SPARC-2O 4.5
SPARC-IO 7.5

 

 

 

 

 

 

Table 2.2: The total learning time for a set of 363 images.

 

Workstation Average time
per retrieval
SPARC-IO 19.0ms
SPARC-2 37.0ms
SPARC-l 74.7ms

 

 

 

 

 

 

 

Table 2.3: Time for retrieving from the trained RPT, given a new input.

Table 2.2, depending on the SUN workstation used.

We let the Rome, learned using the MEF-RPT, navigate autonomously, at a walk-
ing speed, over 30 times along three straight corridors and two corners (including one
corridor that has not been learned). All these over 30 drives were successful [11]. By
Spring, 1995, the MDF RPT version had been implemented and it has been used
for Rome testing and demonstrations since then. The MDF version exhibited more
consistent direction history than the MEF version. During various tests and demon-
strations, there were people walking along the corridors, but Rome was not bothered
because it recognized the entire scene instead of a particular type of landmarks (e.g.,
road edges).

Table 2.3 shows the time used for retrieval from the trained MEF RPT. Clearly,
a SPARC-l is fast enough (13Hz) for real-time navigation in this experiment. The
MDF RPT is even faster.

Fig 2.16 shows a trajectory of a sample navigation and Fig 2.17 shows some images
to illustrate the environment.

To test the scalability performance of the algorithm, we collected a large set of

65

350%340dbl: 34L0£0g1p £40

 

 

 

 

GGGGQGGGQ W
‘335‘—‘33'5:au “"593"
V
0 Q ‘
men.dbl§ ® 1346. 2
CD
8333:?

3.65 m

Figure 2.16: A sample of autonomous navigation in the trained corridor: two straight
corridors and one corner. The circles indicate the locations where images are taken.

 

 

 

 

Figure 2.17: Rome navigates automatically at two corners: the ﬁrst corner in the 1st row
and the second corner in the 2nd row.

66

 

Per retrieval MEF tree ﬂat version

in MEF space in image space
Time (in milliseconds) 27.7 738.3 2853.7
Time ratio w.r.t MEF tree version 1 26.7 103.0

 

 

 

 

 

 

 

 

 

Table 2.4: The comparison in retrieval timing of SPARC-IO between MEF tree version and
ﬂat versions, where tree structures are not used. A set of 2850 learning images were used
in training.

2850 images from various corridor sections for training and got the timing data for
comparison between the tree version and the ﬂat version where the nearest neighbor
was found in the corresponding space. The computer timing data are shown in Table
2.4. From this table, it is clear that the use of MEF tree can greatly speed up the

retrieval and a real-time speed can be achieved only with the tree version.

2.7 Comparison with Two AN N-based Ap-
proaches

Artiﬁcial neural networks have been successful in various engineering applications,
including vision-based navigation. ALVINN [94] and ROBIN [105], [106], developed
by CMU and Univ. of Maryland respectively, are two well known systems that map
visual inputs directly into output control signals. ALVINN uses a two-layer feedfor-
ward network while ROBIN uses a RBF network. To compare our method and these
two approaches, we use MATLAB to do the simulation.

We used the same output scheme as ALVINN and ROBIN: the output pattern
of each training sample is a Gaussian distribution peaked at the desired corrected
heading. In our simulation, the output layer has 21 nodes with a resolution of two

67

 

I'i 114—

Figure 2.18: Sample training inputioutput pair: the top part is a 30 x 40 image; the last
row is the associated output signal, which is a Gaussian distribution peaked at the corrected
heading direction.

degrees apart. The i-th (i = 1,2,---,21) output node corresponds to (2' ~ 11) x
2 degrees in corrected heading. A sample training input~output pair is shown in
Fig. 2.18. After the neural network is trained, the simulated output heading is taken
as the peak of a Gaussian ﬁt to the outputs of neurons at output layer.

Three sets of data are used in this study:

Set 1 Real images from straight hallway sections: a set of 100 (30 x 40)-pixel images
with ﬁve different corrected heading directions: —10°, —5°, 0°, 5° and 10°. Each

of these ﬁve classes has 20 samples.

Set 2 A full set of 318 (30 x 40)-pixel real images are used. This set of images are
used in our training of SHOSLIF-N for vision—based navigation. It includes 210

images from straight hallway sections and 108 images from corner sections.

Set 3 Synthesized data set: Given a trajectory map, synthesized sample images are
generated with different orientations and translations. For a square trajectory,
a typical set consists of 100 (30 x 40)-pixel synthetic images with ﬁve different
headings: —10°, —5°, 0°, 5° and 10°. Some sample images are shown in Fig. 2.19.

68

 

(a) Road map. The dark area is nonroad and the white area is the road.

HHNHHWHHH

(b) Synthesized sample images.

Figure 2.19: Some synthesized sample images generated At different locations and orien-
tations along the simulated navigation path. Images were generated under a perspective
projection camera model.

These three sets of images are used to test different learning approaches under

various situations.

2.7.1 Simulation of a feedforward neural network

For simulation of a two-layer feedforward neural network, we used “trainbpx” with
adaptive learning rate available in the MATLAB neural network toolbox. Similar to
ALVINN, we use four to nine hidden nodes in our simulation.

When the size of problem is small, i.e. the number of images is small and the size
of image is also small, the training of feedforward network went successfully: it found
a reasonable solution within a few trials starting from initial random guesses. But
we found that when the input dimension is high, e.g. 30 x 40, the neural network
would not converge to reasonable solutions with random initial weights. We run our
MATLAB script ﬁles for 100 times, each time with a different random guess of initial
weights. The training epoch is set to 10,000, large enough to converge to a local

69

minimum. Fig. 2.20 shows the training record for Set 1. Here we use four hidden
nodes as indicated in [95]. Fig. 2.20 shows that the weights does sometimes converge
to a good solution for the training set, although only a few trials provide solutions
with small sum-of—squares (SSE) errors as shown in Fig. 2.20(a). Our simulation with
synthetic image set got consistent results: the learning process can easily get stuck
to poor local minimums.

Fig. 2.21 shows 100 trial training record for Set 2, the error histogram correspond-
ing to the best SSE solution marked with a circle in Fig. 2.21 is shown in left column
of Fig. 2.22(b). Again, the convergence to a reasonable solution is difficult. Training
of Set 2 with 10,000 epoches for each random initial guess took about 75 minutes on
a SUN SPARC-Ultra-I. To get a reasonable solution, typically about 2,000 epoches
were needed. It took about 15 minutes for batch training of Set 2 with 2,000 epoches.

All 100 random trials shown in Fig. 2.21 took 5.2 days of CPU time.

2.7 .2 Simulation of a radial basis function network

The implementation of ROBIN is different from a typical RBF network in the follow-

ing aspects:

1. Normalization. The responses of the receptive ﬁelds are normalized to have unit
sum response for all neurons in the ﬁrst layer. This normalization improves the

interpolation capability, as shown by references sited in [106].

2. Center selection.

Rosenblum and Davis [105], [106] used K-means clustering algorithm for center

70

8
L

§

Sum of square error

e

 

 

 

”o 20 4o
Trial

(a) Sum of square error for 100 trials.

Frequency
9 o
‘ O

  

 

 

i. W#¢.L
20 25 0 24651012 u 18 15
Abeohmerrorﬁndoqou)

(b) Target outputs. (c) Error histogram.

    

20 25

 

7.25.35
((11) Outputs of class 0. (d2) Outputs of class 1.
.«K‘ 1[ '1. “ I

        
   

20 is

 

o 5 1o 15 20 25 a ‘5 10‘ 15 '
Nodadmllyu NodewmlwlI-yev

(d3) Outputs of class 2. (d4) Outputs of class 3. ((15) Outputs of class 4.

Figure 2.20: Training with Set 1. 100 trials with different random guesses of initial weights.
(a) shows the record of sum of square error (SSE). The trial with minimal SSE is marked
with a circle. (b) shows the target output for ﬁve classes of training samples with different
corrected headings: —10°, —5°, 0°, 5° and 10°. (c) is the error histogram of retrieved
heading dierctions. (d1) to (d5) show the network output overlaid on target output for
each class: simulated outputs and target outputs are plotted in solid lines and dotted lines

respectively.

71

 

 

'3’
l

3%
F
}

id
”§
Sfrﬁ

 

 

ﬁ'

 

 

 

 

 

 

 

 

 

 

Sum of square error

 

 

 

 

.§§§§§

 

4o . so so 100
Trial

0
N .
0

Figure 2.21: Training record for Set 2: Sum of square error for 100 trials. The best SSE
solution is marked with a circle.

selection but the results were not satisfactory. So they used “forced-clustering”
by manually assigning subset of input patterns as centers. This center selection

process needs tedious human-machine interaction.

For our simulation of a RBF network, we manually select a subset of typical views
from the training set. For Set 1 and Set 3, we got zero error in heading direction
for the training set when about 20 centers were chosen. For Set 3, we used similar
number of centers as in [106]: 60 centers were manually selected. ROBIN [106] used
53 centers. We trained the RBF network with images in Set 3 and tested with the set
of 204 images used in Fig. 2.14. After several trials with manual center selection, the
error histograms of the training and test sets for the most reasonable selected centers
are shown in Fig. 2.22(c). After manual center selection, the training of Set 2 took
about 20 minutes on a SPARC-Ultra—I.

72

2.7.3 Comparison of SHOSLIF-N with two ANN-based ap-

proaches

We used a realistic image set —-— Set 3 to do the training. Another disjoint set of
204 images, the same data set used in Fig. 2.14, was used for testing. The error
histograms are plotted in Fig. 2.22, for all three different approaches. SHOSLIF-N
always gets perfect retrieval for the training set, so the error histogram concentrates
on a single bin with zero error. Comparing Fig. 2.22(b) with Fig. 2.22(c), we can see
that the feedforward neural network and the RBF network obtain compatible error
performance in heading for the test set, but the RBF network is a little bit worse
for the training set. For both the training and test sets, SHOSLIF-N gives better

heading accuracy.

According to our experience in simulation, SHOSLIF-N is advantageous in at
least two aspects: (1) ease of training; (2) good performance after it is trained. Al-
though the feedforward network could give reasonable solution, many random trials
are needed for initial weights. The RBF network does provide comparable perfor-
mance. However, the manual center selection is tedious and we don’t even know
how many centers will be sufﬁcient. When most of the training samples are used as
centers, the RBF network will act like a nearest neighbor estimator. But then the
computation complexity will be prohibitive when the number of training samples is
large. On the other hand, our SHOSLIF-N has to pay extra cost in storing training
samples. The memory size of the SHOSLIF-N tree is 3.1M bytes for Set 2. It seems
that the payoff from extra storage enables SHOSLIF-N to reach a good performance

73

accuracy with a real-time speed. The training is also relatively fast.

2.7 .4 Mapping SHOSLIF-N into ANN architecture

Learning algorithm used in SHOSLIF can be easily mapped into neural network
architecture. The most dominant eigenvector for binary partition of each inner node
can be computed by various neural learning algorithms (e.g. Oja, 1992 and references
therein). The tree structure of SHOSLIF can also be mapped into neural network
architecture [108]. Suppose that we have a binary tree with t splits and t+ 1 leaves.
A network in which the ﬁrst hidden layer of t nodes computes the split functions,
and the second layer has a node for each of the t+ 1 paths to a leaf which ANDs the
outputs along the nodes on the same path. The output layer then ORs the leaves

with similar desired outputs.

2.8 Conclusions and Future Work

The framework presented here does not restrict the scene type and thus is potentially
applicable to various scenes. Instead of relying on a particular scene feature, it uses
automatically derived spatially global features for decision making, and thus is rela-
tively insensitive to local and small scene changes, such as road edges being occluded
by objects or the presence of passers-by. The MDF version of the RPT seems to
generalize better than the MEF version with its relatively tighter class clusters and a
Smaller tree size. Due to availability of inexpensive, large hard disks and the fast log-
arithmic retrieval that is made possible by the RPT, it seems that storing a relatively

74

 

 

 

 

 

 

 

0 ..__-._—._P__._L_._.—_.__a_
1-‘1 H u

'I

(a) SHOSLIF-N.

 

MM Iva n Joy-e-

 

 

   
 

 

 

 

(b) A feedforward network.

w w l a.

 

 

  

 

 

7 O O O ‘C I? M V. O O l l O I ‘0 ‘3

(c) A RBF network.

Figure 2.22: Comparison of three approaches: SHOSLIF-N, the best feedforward network
and the best RBF network we obtained. A set of 318 images are used for training and a
disjoint set of 204 images are used in testing. The error histogram for the training set is
shown in left column while the error histogram for the test set is shown in right column.

75

large number of learning images is not unreasonable. The method presented seems
particularly attractive for situations where consistent floor edges are not available or
passers-by are not avoidable.

The future work includes experiments with a wider variety of driving environ-
ments. Since this approach requires a signiﬁcant number of images for learning, real

time incremental learning is a future direction for this line of research.

76

Chapter 3

On-Line Incremental Learning for

Vision-Guided Navigation

In last chapter a batch method of training RPT for vision-based navigation is dis-
cussed. In this chapter the incremental learning issue is addressed. Since binary
RPT requires only the most dominant eigenvector at each inner node for ﬁner parti-
tion, it turns out that the learning mechanism discussed in last chapter can be done

incrementally with low computational complexity.

3.1 Introduction

Autonomous vision-guided navigation has been a challenging task. In the past, most
of the autonomous navigation systems relied on tracking speciﬁc features, such as lane
markings of outdoor roads, floor or ceiling edges of hallways, while others detected
road regions based on features such as color or texture. All of these systems had a

77

strong reliance on an a prion' model about the road’s appearance, and hand—crafted
rules were implemented as the model ﬁtting algorithm. Unfortunately, these models
and rules are not always suited due to changes of environmental conditions. Some
roads may not have clear lane markings, and some none at all. Variations in illumi-
nation and road condition can often invalidate the underlying assumptions used in
the vision algorithms.

In order to deal with changes in environmental conditions, it is necessary for
the system to employ adaptive mechanisms [20, 81, 94, 96]. According to different
design goals and working environments, adaptive mechanisms may vary from simple
adaptive color thresholding [72] to complex machine learning using, e.g., artiﬁcial
neural networks [94, 57].

For outdoor navigation, ALVINN [94], an artiﬁcial neural network trained by a
back-propagation training algorithm, learns characteristic features of particular roads
under current conditions. It has been successfully tested in a variety of road condi-
tions. ELVIS [44] uses an eigen-subspace method as an alternative of the artiﬁcial
neural network in ALVINN. The availability of lane markings can be used to greatly
enhance the adaptation performance of a driving system, as shown in the RALPH
[96]: Possible road curvatures are predicted, and the correct one with the highest sig-
nature score is picked up as the winner. This technique leads to a successful lateral
position handler.

In this chapter, we explore the incremental learning issue in the framework of
SHOSLIF-N [138]. Incremental learning is a natural way in which humans gain their
driving experience. With incremental learning, the training can be done on-line.

78

This is very important in machine learning for autonomous navigation, since the
total sensed information in training is huge and highly redundant, and may exceed
the limit of physical storage device. A batch training method is less practical and

maybe less efﬁcient.

The training of artiﬁcial neural networks [47] has built-in incremental learning
capability, e.g. pattern-by-pattern updating. However, since these computations
with neural networks are iterative using the same set of images many times, it is hard
to implement learning in real time, which is essential in a domain like autonomous
driving. We proposed in [12, 139] an incremental learning algorithm by using eigen-
subspace learning with a tree structure. Here we present new computation methods
that reduce maximal response time, which is critical in on-line learning.

The remainder of this chapter is organized as follows: Section 3.2 describes efﬁcient
dominant eigenvector computation for incremental learning. Section 3.3 presents
how to realize the incremental update. Section 3.4 gives the incremental learning
algorithm. The experimental results and conclusions are reported in Section 3.5 and

Section 3.6, respectively.

3.2 Incremental Learning and Dominant Eigenvec—

tor Computation

First, we brieﬂy describe our autonomous navigation scheme within the framework

of SHOLSIF-N [138].

79

3.2.1 Navigation as a content-based retrieval problem

A navigator can be regarded as a complicated function which maps high-dimensional
input images into low-dimensional output control signal. In the learning phase, a
set of training images is used to build a recursive partition tree (RPT), in which
each learned image has its corresponding control signals. At each inner node of the
RPT, the eigen-subspace method is used to extract characteristic features for further
partition of input image space. This process is done recursively down the tree until the
current node contains only images with a similar control signal. In the performance
phase, each newly grabbed image is used to retrieve the best-matched image from
the trained RPT, and the control signal associated with the retrieved image is used

as the next steering control signals.

Notice that ELVIS [44] also uses an eigen-subspace method for outdoor navigation.
In ELVIS input image and its output vector are treated as a long monolithic vector.
Here we treat input image and output steering vector separately, and we deal with

on-line incremental learning while ELVIS does not.

We chose binary partition for its simplicity. At each inner node, further space
partition is based on images’ projection onto a single eigenvector associated with the
largest eigenvalue, which captures the largest variance in the sample images assigned
to the node. Therefore, at each inner node, we need only obtain the eigenvector
associated with the largest eigenvalue. This leads to an efﬁcient on-line incremental

learning algorithm.

The incremental eigenspace update algorithm has been discussed in [43, 78, 85].

80

The incremental update of eigenspace using QR decomposition is discussed in classic
textbook by Golub and Van Loan [39]. All of these discussions focus on incremental
update of an eigen-subspace. However, since we only need to solve the eigenvector
associated with the largest eigenvalue, much simpler and more efﬁcient methods can
be used. The power method, Lanczos algorithm, and the Rayleigh quotient conjugate
gradient method [34, 39, 73, 104, 146] are among the most popular ways in computing

dominant eigenvalue/eigenvector of a real symmetric matrix.

3.2.2 The power method

The power method is one of the oldest methods in solving an eigensystem [39, 73, 14]].
Suppose that a matrix A is diagonalizable, so that E‘lAE = diag()\1, - - . , A") with
E 2 [e1, - - - , en], and /\,~(i = 1, - - - , n) are in decreasing order. Let uo be an arbitrary

vector and let the sequences 11, and u, be deﬁned by the equations

vs+l : Aus: “3+1 : vs+l/]]'Us+llla (31)

where ”:13” is used to denote the [2 norm. Clearly, we have

Us = Asuo/HASUOH (3-2)
and if we write
“0 = 2 (1,6,, (33)
i=1

81

then apart from the normalizing factor, as can be written as

n n /\i s
ZOiASE’I Z A: [(1181 + Z: [—] 9,] . (3.4)
1:1 1:2 )‘1

If [[A1]] > “Ag” 2 “)3” - - - HA" , provided that al 51$ 0, we have u, ——> (Tl/[[81]].

 

 

If there are a number of eigenvectors corresponding to the dominant eigenvalue,

/\1 = A2 = = A, and A, > /\,.+1 2 - -- 2 An, the procedure still converges, since we

ASUO = X]; [20,131+ : al- [%]Sei] ~ ”Zola.
i=1 1 i=1

i=r+1

The iterates tend to converge to some vector lying in the subspace spanned by the
eigenvectors 61,- - -,e,. When the ratio /\,.+1/ A, is close to unity, the convergence is
slow. But the iterates still tend to a meaningful vector. The term in square brackets
of equation (3.2) can be regarded as a weighted sum of the eigenvectors 61,- - ',8,;,
with coefﬁcients 01,02(A2/)\1)3, . - -,aa,,()\,,//\1)°. The contribution from eigenvector
6,- tends to shrink by a factor of )1,- / /\1, as the power method iterates. Only those

dominant eigenvectors have signiﬁcant contributions.

A random vector can be chosen as 21.0, since a1 is always non-zero due to the
truncation error. In practice, a ﬁxed number of iterations usually give satisfactory
results. Letting the number of iterations be KP, the time complexity for the power
method is O(nzKp).

82

3.2.3 The Rayleigh quotient conjugate gradient (RQCG)

method

Let A be a real symmetric matrix of order n, A1,A2, - - 1,/\,, be nonincreasing eigen-

values and their associated eigenvectors be 61,62, - - -,e,, respectively. Then,

(.4513, 1:) (.4131?)

——m
I (II/‘41:) x lll'll

 

 

(3.5)

where (-, -) denotes inner product of two vectors, and I] - I] deﬁnes the norm of a vector.

Let R(x) = gﬁﬁlﬂ, R(x) is called the Rayleigh quotient. From eq. (3.5), it is clear

that the dominant eigenvector is the vector that maximizes the Rayleigh quotient. A
nice property of the Rayleigh quotient R(x) is that its critical points are either unsta-
ble saddle points or the global maximum. A practical way to solve the maximization

problem of (3.5) is the conjugate gradient method.

Let g(;r) be the gradient of the Rayleigh quotient R(zr). Then

9(23) 2 2.42: — 2R(.r):r.

From any approximation an, of 61, an improved approximation n+1 is computed by

“+1 = 33k + 51.191.-

Here, pk is the search direction found by the conjugate gradient technique,

83

M: = —9k +13kpk—l-
The value of 6,, is chosen by solving
337% R(l‘k + 5kPk)- (3-6)

The closed-form solution of (3.6) is

6k 2 ("bk — Vbi — 4aka)/(20«k)1

where

ak = ($k1pk)(pk1APk) - (-Tk1APk)(pk1Pk)a
bk : (pk: 14pk) — (lika ‘4$k)(pk1pk)1

Ck = ($k.APk)-($k144$k)(17k.19k)- (3-7)

The value of [3,, can be computed by various ways [146, 77]. A practical choice is

the Polak-Ribz'ere method, where

(9k+1“ gk19k+l)

3 =
““ mtg.)

 

is used.

84

The initial dominant eigenvector is randomly chosen on the unit n-dimensional
sphere. The above RQCG method is an iterative method, each iteration taking 0(712)
time. The iteration is looped up to a certain number of times or until the change in
Rk is very small. In practice, the RQCG method typically converges rapidly, much
faster than the power method. Letting the number of the RQCG iterations be K,,
the time complexity for the RQCG method is 0(712Kr).

Both the power method and RQCG method are easy to implement, and have a
complexity of O(n2K) where K is the ﬁxed maximal iterations. We use Kp and K,
to denote the iteration number of the power and RQCG methods respectively.

Another two candidates in computing dominant eigenpair of a real symmetric
matrix are the accelerated power method [104] and Lanczos algorithm. Writing a

numerical stable Lanczos algorithm for this purpose is highly nontrivial.

3.3 Update for Incremental Learning

Let {u,- | 2' = 1, 2, - - - , k} be the set of images at an inner node, and its mean vector
and covariance matrix be 7m and Vk, respectively. If we let 1],- = u,- — mk and
(7k _—_- [111,212, - --,ﬂk], then Vk = %Ukl:7,ﬁ. Assuming that u, is represented as a 1D
vector with length d, UkUé is of dimensionality d x d. In our vision-based navigation,
the number of images at an inner node, k, typically is smaller than the number of
pixels d. It can be easily shown that if (I) is an eigenvector of the covariance matrix
Vk with Vk<I> = /\<I>. Then (31(1) is an eigenvector of 315010,; with the same eigenvalue

A. Therefore instead of computing the eigen-system of a d x d matrix We, we can

85

calculate the eigen-system of a smaller k x k matrix H"), = iUﬁUk.

3.3.1 Insert a sample

The mean vectors mk can be updated incrementally:

1
mk+1 = ——[km;C +uk+1]. (3.8)

k+1

Letting (5m,c = mk+1 — mk, Wk“ can be updated incrementally:

1 [CM/7k 17,221,,“

VVk+1 = m — [Uﬁ+16mk+la "'aU]:+167nk+1]

-t — _t _
uk+lUk uk+1uk+1
r- - r- -

t _' t t
6mk+lbk+l 6mk+167nk+1 ' ° ' 67nk+16mk+1

 

 

 

 

.t ‘ t t
6mk+1Uk+1 6mk+16mk+1 6mk+16mk+14

d

The above update rule for inserting a sample can be done with time complexity
O(k2 + kd). Since in our application typically we have d >> k, the time complexity

for inserting or deleting a sample is O(kd).

3.3.2 Reconstruction of W matrix from parent node’s W ma-
trix
Let p be an inner node of RPT with a child node q, W be p——>W and W be q—>lV,

respectively. Suppose that p has 71 samples {ui | 2' = 1, 2, - - - , n} with mean vector m,

86

and 73 of p’s samples go to its child node q with mean vector 7H: U1, (2' = 1, 2, - - -,ﬁ.).

Let Aij 2 (u,- — m)‘(uj — m) and AU- 2 (u, - 771)‘(uj — 772). Then
IrV—1[A~] 2' ,-_12 n
— Tl I] 7 9.] — 9 a a a

and

1A
Wz—Ai-, ','=l,l,---.lA,
ﬁl J] 13 12 ,n

where [AU] and [211]] denote matrices with elements AU and A”, respectively.
We know that

AU 2 (u,- — m)‘(uj — m) = uguj — m‘uj — uﬁm + mtm, i,j = 1, 2, - - - , n, (3.10)

and

A
A

An = (”W — mWUj — 75) = “in " ﬁtuj — “i777 + WW, id = 11,12, - ° ala- (3-11)

Combining (3.10) and (3.11), 21,-]- can be expressed as

£1.)- 2 AU + (m — ﬁ)‘uj + uf(m — 773) + (ﬁ‘ﬁ — mtm), z',j = l1,12, - - - ,lg. (3.12)

The computation of W using (3.11) can be done with O(ﬁ2d) complexity, while
the reconstruction of 1717 with (3.12) using available W takes only O(‘ﬁd) time.

87

3.3.3 The incremental RQCG method

Instead of starting from a random initial guess of dominant eigenvector, the RQCG
method can be done in an incremental way: estimated dominant eigenvector of Wk
can be used to generate a good initial dominant eigenvector of Wk“, and the iteration
time can bereduced signiﬁcantly. We use Kr,- to denote the iteration times of the
incremental RQCG method.

From (3.9) we know that when I: >> 1, we can assume that 6mk+1 is very small
and thus the second, third and fourth terms in RHS of (3.9) can be ignored. In this

case we have

”1+1 3 ——

Ignoring the scale factor, the only difference between Wk“ and Wk is the newly
appended (k+1)-th row and (k+1)-th column. Because of the momentum effect,
the true dominant eigenvector of I’VkH is close to the true dominant eigenvector of
Wk appended with a. zero in the (k+1)-th component. So the estimated dominant
eigenvector of Wk appended with a zero can be used as the initial dominant eigenvector

of my“.

3.4 Incremental Learning Algorithm (ILA)

We modify our RPT tree [138] so that it can be used for incremental learning. The
RPT has to be built dynamically as more input images come, and whether each
coming image should be ignored or not should be decided on—line. Here we brieﬂy

88

., Micro tree
. Fixed node of macro tree
Q Nontixed node of macro tree

Figure 3.1: The evolution of a incremental learning tree. In this example, three nonﬁxed
nodes of macro tree are evolved into ﬁxed nodes.

describe our incremental learning algorithm, further details can be found in [12, 139].

Two parameters P] and P2 are chosen. When the number of images going to a
node reaches P1, the dominant eigenpair is computed using the power method and
the node is ﬁxed in subsequent learning. Its child nodes are created. But the number
of images ﬂowing into a node may still be large. In order to speed up the retrieving, a
second type of RPT, a micro tree, is introduced. The former RPT is therefore called
a macro tree for easy discrimination. Both types of RPT’s are built recursively by
projecting along the dominant eigenvector. P2 is used to decide when to stop growing
child nodes. When the number of images going to a node is less than P2, the node

stops growing and becomes a leaf node.

The structure of an incremental learning RPT is shown in Fig. 3.1. There are
three types of nodes in the RPT: ﬁxed nodes of macro tree, nonﬁxed nodes of macro
tree and nodes of micro tree. The number of images going to a nonﬁxed macro node
lies between P2 and P1. When a nonﬁxed node gets ﬁxed, the dominant eigenvector
is computed and its child RPT’s are rebuilt. Each new image with associated control
signals is checked with a cross validation procedure to decide whether it should be

89

ignored or not. A near neighbor of the input image is obtained by using content-based
retrieving of the learned RPT. If the associated retrieved control signals are within
tolerance of the desired control signals, the input image is ignored. Otherwise, the
incremental learning is conducted.

The following items are maintained for each node:

 

ﬁxed /* 1: ﬁxed; 0: nonﬁxed */

k /* number of images ﬂowing into the node*/
*list /* index numbers of k images*/

771,, /* incrementally updated mean vector*/

Mo /* mean vector for retrieving*/

Uk 2 [11171.2- - -uk]‘ /* accepted learn images ﬂowing into the node */
Wk 2 UﬁUk /* ﬁUkUﬁ is the covariance matrix*/

H /* dominant eigenvector

 

 

 

The use of hierarchical partition structure of RPT can achieve efﬁcient retrieving.

The retrieving procedure is shown below.

 

RPT Search Algorithm: Given an input image u, ﬁnd its corresponding
control signal or.
call retrieve(root, u);
retrieve(node, u) {
if node is a leaf {
do a linear search in images stored in node —-> U,
ﬁnd the nearest image or = arg min, ”a — uill;

return (crzthe associated control signal of ur);

}

else {
if ( (node ——> H )‘(u - node —+ Mo) 3 O) retrieve(node —-) leftchild );
else retrieve(node —> rightchild);

}

 

 

 

90

I'OOi

& micro tree

0‘00

Figure 3.2: Update RPT when a new image is accepted. The search path is represented
by dashed lines. The updates are done only at two nodes: the ﬁrst nonﬁxed macro node on
the search path and the leaf micro node. They are marked as “black nodes”.

For each newly accepted learn image, the update of RPT can be done in the
way shown in Fig. 3.2. First we locate inode, the ﬁrst nonﬁxed macro node in the
search path, insert the image into the inode, and then do the incremental update. If
inode is not a leaf micro node, ﬁnd the leaf micro node in the search path and do
the update. Given two ﬁxed numbers P1 and P2, the following procedure build the

recursive partition tree (RPT) incrementally.

91

 

1. Grab a new image u and get its associated desired control signal c.

2. Retrieve the learned RPT by calling c,=retrieve(root, u). If the retrieved control
signal or satisﬁes the condition: ||c—c,. || 5 Tc ( where T c is a tolerance threshold),
u is ignored, go back to step 1.

3. Incremental learning:

Locate the ﬁrst nonﬁxed macro node along the search path in retrieve(root, u),
denote it as inode;
For each node between Mode and the leaf node in the search path of
retrieve(root, u)
{
if (node :2 mode OR node is a leaf node ) {
node—>kznode—>k+1;
Incrementally update node—> rm, and node—> Wk
according to eq. (3.8) and (3.9);
}
if (k == P1 and node ——> fired 2: ) {
Compute the eigenvector associated with the dominant eigenvalue
using power method;
rebuild the child RPTs;
node ——> fixed 2 1;
node —) MO 2 node —> mk;
}
else if (k >2 P2) {
compute the eigenvector associated with largest eigenvalue;
build its two child nodes;
node —+ M0 = node_—+ mk;

}
}

 

 

 

3.4.1 Analysis

In unusual cases when the child RPT’S need rebuilding, the time complexity is at most
O(Pld)+0(log(n))+0(P12K), where d, n and K are the image dimension, the number
of accepted learned images and iteration times for computing dominant eigenvector
respectively. The ﬁrst term is the time complexity for updating Wk matrix, the second
is the complexity for retrieving the learned RPT, and the third the complexity for

92

computing eigenvector using the power or RQCG method.
The RPT built using ILA has an important property that the accepted learned

images are perfectly retrieved.

Property 1 For each accepted learned image it, the content-based retrieving using

retrieve(root, u) always locate the same image.

Proof: In ILA, each newly accepted learned image is inserted to the ﬁrst nonﬁxed
macro node inode and the leaf micronode if necessary. This guarantees the exact
retrieval before any new evolution of the learned RPT. As the learned RPT further
evolves, at each ﬁxed time, each accepted learned image has a ﬁxed retrieved path
determined by the tree rebuilding procedure. This always results in the same search

path for each accepted image.

3.5 Experimental Results

Our mobile robot, Rome, was built on a Labmate platform from TRC. It uses a SUN
SPARC-l as the host computer. Rome was controlled to take pictures at different
locations for our incremental learning. The corresponding corrected heading direc-
tions were also recorded. Some sample images for incremental learning are shown in
Fig. 3.3. The tested sites are indoor hallways with various turns. The lighting con-
dition is relatively ﬁxed in the tested hallway sections. To take into account lighting
changes between daytime and night time, all images are preprocessed to have a zero
mean and a unit variance.

93

We use the following notation in describing different versions of ILA, which are
different in the computations for the dominant eigenvector and the W matrix com-

putation:

o ILA 0: our earlist version of ILA [12, 139] where the power method is used and

the W matrix of a child node is computed by (3.11).

o ILAI: improved version of ILA. The RQCG method is used and the IV matrix

of a child node is computed by (3.12).

o ILA2: further improved version of ILAl. The incremental RQCG method is
used and the IV matrix of a child node is computed by (3.12). An initial dom-
inant eigenvector of W} is computed with the nonincremental RQCG method
when k ﬁrst reaches P1 / 2. The later dominant eigenvector of the same node is

computed incrementally.

In our experiments, P1 and P2 were set to 50 and 10 respectively. These two num-
bers were chosen experimentally. The ﬁrst P1. accepted images were cross-validated
for acceptance using linear search. The maximal iterations K r of the RQCG method
is set to 30. Fig. 3.4 shows the timing records on a SparcStation-2 for learning the
ﬁrst 182 accepted images. From Fig. 3.4(b) we can see that the time for incrementally
learning each accepted image is typically within 0.3 second. There are certain times
when the learning takes about 1 second. It happens when the number of images ﬂow-
ing into a nonﬁxed macro node reaches P1 and the reconstructions of child RPT’S are
recursively conducted. The Wk matrices and their associated dominant eigenvectors
for all the nodes of child RPT’S require much computation; they have to be computed

94

 

Figure 3.3: Sample training hallway images. The upper row is a straight hallway, while
the lower is a left turn.

 

 

 

 

  
    

,____MML_
20 40 60 M N 120 ‘40 1Q 00 200 20 ‘0 60 so 100
MIN-9011mm uqumlhlhngo

(a) ILAO: K, = 25. (b) ILAl: K, = 30.

     

         

 

 

 

 

 

 

‘20 ‘40 I” no I”
nurtu-

Figure'3.4: The timing record on a SparcStation-2 for learning a set of 182 images (those
ignored images in incremental learning are not recorded. P1 = 50 and P2 = 10.)

in a batch mode, which takes most of the time. The improvement of ILAl over ILAO
is obvious.

The structure of the learned tree with 180 accepted images is shown in table 3.1.
The learned tree has 7 levels, including macro nodes and micro nodes.

In order to evaluate the system performance, we tested the learned tree with a
set of 328 images. The error histogram in the retrieved heading directions is shown
in Fig. 3.5. Since there is no guarantee to retrieve the nearest image in the accepted

95

 

 

 

Level Macro nodes Micro nodes Nodes

0 1 0 1

1 2 0 2

2 4 0 4

3 2 6 8

4 0 16 16

5 0 20 20

6 0 4 4
Total 9 46 55

 

 

 

 

 

 

Table 3.1: The sizes of the learned RPT for a set of 180 accepted images.

learned images, it is likely to obtain better performance by exploring P competitive
paths and picking up the best one. For the comparison, we also show the error
histogram of linearly searching the best match among the set of all 180 accepted
learn images. From Fig. 3.5, we can see that the use of P competitive paths does

provide better results.

 

hum-an
—.x_
Mun

 

 

‘-

(a) P = (b) P = (c) Linear search.

Figure 3.5: Distribution of absolute angular error (in degrees) of retrieved heading direc-
tions. These three plots are based on queries using a test set of 328 images, of which 195
are from straight hallway and 133 from the two learned corners.

The tested sites are shown in Fig. 3.6, where learned hallway sections and corners
are marked with thicker lines. Our robot Rome was trained to go in both clockwise
_ and counter—clockwise directions. In real navigation, Rome rejects sudden change
in heading direction by assuming a smooth steering behavior. The driving speed is
slowed down when the retrieved image comes from places around a corner. With

96

 

-—> Learned sections
' I.~ —-> Unleamed sections

--A- - - - _
V '

 

 

 

Figure 3.6: The map of the tested sites, trained sections and untrained sections are marked
with thicker and thinner lines respectively.

 

 

 

Figure 3.7: Rome navigates automatically at two turns: the ﬁrst turn in the lst row and
the second turn in the 2nd row.

the above trained tree, Rome has consistently performed well in our numerous ex-
periments. Fig. 3.7 shows two sample sequences of images when Rome navigates

autonomously in the hallway.

To compare the response time of three versions of ILA, a set of 400 samples was
used. By artiﬁcially assigning output vectors, the algorithm will always accept each
new sample. The parameters were set as P1 = 150, P2 = 30. Fig. 3.8 shows the timing
data of ILAO, ILA1 and ILA2. The maximal response time of three ILA algorithms
are 7.35, 1.13 and 0.98, respectively. The improvement of ILA1 over ILAO is obvious.
The further improvement of ILA2 by using the incremental RQCG method is less,

97

but still visible. As P1 increases, the improvement of ILA2 over ILA1 will increase
since the time complexity for computing dominant eigenvector increases quadratically.
The heights at x-coordinates 150, 284 and 316 in Fig. 3.8(a) are different, since the
constructed trees have different degrees of balance. Three peaks due to reconstruction

when the number of samples reach P1 are all visible in these three ﬁgures.

 

 

 

 

 

 

lllllLl ...m .4

0 so 100 150 200 250 300 350 400
Win-90mm

(a) ILAO: K, = 25.

 

 

12 W T V r Y I Y 09

 

..l

o
a
.0
o

T—

he (In seconds)
2
the (h wanna)
o
i.»

O
.
.0
(a
L

 

 

 

 

 

 

 

 

 

 

. * 4

   

(b) ILA1: K, = 30. (c) ILA2: K. = 30, K,.- = 5.

Figure 3.8: The timing records on SPARC-20/mode161 for learning a set of 400 images
(P1=150, P2=30).

98

3.6 Conclusions

The author has proposed improved versions, ILA1 and ILA2, over previous incre-
mental learning algorithm ILAO, to reduce the maximal response latency. The ILA
algorithms have been successfully applied to indoor mobile robot navigation. The
incremental learning can be done on-line in real time. In various experiments the
RQCG method converged much faster. It results in consistent, stable behavior and
outperforms the power method.

The eigen-subspace method for feature derivation in this chapter is applicable to
linear discriminant analysis (LDA) which has been shown to work well for indoor
navigation [138]. The RQCG algorithm can be used to solve generalized eigenvalue
problem Ax 2 A82: without the computation of B‘1 [147].

Several issues need to be pursued further. For example, the scalability of improved
ILA versions for increased P1 can be studied more carefully. The ILA algorithms
make extensive use of vector processing, which can be greatly improved by the use of
a vector processor. And ﬁnally, an interpolation of the outputs can be used to achieve

smoother navigation behavior.

99

Chapter 4

State-Based SHOSLIF for Indoor

Navigation

Our earlier efforts of using SHOSLIF tree for indoor navigation have experienced
some difﬁculties when more complicated situations are involved. When the number
of corners or intersections was increased, the robot might fail to make a turn. The
system could get confused around a corner: when to start a turn, and when did the
robot really realize that it did enter a corner. State information is useful in solving
the ambiguity problem under these situations. We also notice that visual attention
mechanism is useful for solving the ambiguity problem. In this chapter, states and a
simple visual attention are incorporated into our earlier SHOSLIF.

100

4.1 Introduction

One difﬁculty we experienced in applying SHOSLIF-N to more complicated environ-
ment is that learning more training samples from sample drives could cause conﬂicts
which are hard to solve using a single-framed image. This problem could be solved
by using states which keep the information about the relative position between the
robot and the environment. The Markov chain and its variant the hidden Markov
model (HMM) or partially observable Markov decision process (POMDP) have been
successfully applied to indoor navigation by several researchers [66]. Koenig and
Simmons [66, 111] applied POMDP to integrate topological and metric information
for navigation of Xavier. Xavier maintained a probability distribution over its cur-
rent pose and successfully handled the uncertainty intrinsic in navigation, including
actuator and sensor uncertainty. Two kinds of inputs were used in [111, 66]: motor
reports derived from odometer on-board the robot, and sensor reports generated from

circularly deployed sonar sensors.

A great gain in using state information is the capability of using visual attention
to disambiguate globally similar but locally very different scenes. With the state
information the robot is able to determine when to acquire local views from a global
scene. We use a ﬁnite state machine to model the states in navigation. But the
state transition probability is not directly estimated due to a high cost in estimating
the probability distributions when high-dimensional visual inputs, rather than sonar
and odometer sensory inputs, are used. The probability is retrieved by SHOSLIF
and approximated by the distances to training samples. Since vision provides more

101

global information compared to sonar, fewer states are needed. Howvever, it should
be noticed that states have little help in dealing with unanticipated events and hence
some measures should be taken to deal with unexpected events. For example, in
autonomous navigation, the mobile robot can be programmed to stop when it detects
a novelty beyond a certain range. In the following sections, we show how state infor-
mation can be naturally embedded into our SHOSLIF-N with the use of a recursive
partition tree (RPT). To efﬁciently disambiguate states, attention windows, similar

to virtual cameras in [58] are employed.

In this chapter, the author presents a systematic framework through which sys-
tem states can be deﬁned and learned online to deal with situations where a stateless
system can not handle, such as where visual attention is required. The learning-based
approach allows the teacher to deﬁne states online during learning, instead of prepro-
gramming control rules into a static control scheme. Thus, the same learning scheme
can potentially handle more navigation scenes, without the need for reprogramming
for each different scene. The remainder of this chapter is organized as follows: In
Section 4.2, we show that the problem can be formulated as an observation-driven
Markov model realized by the nearest neighbor regression. The nearest neighbor re-
gression is computed efﬁciently with a stochastic recursive partition tree (SRPT), as
shown in Section 4.3. Vision-based indoor navigation is discussed in Section 4.4. Sec-
tion 4.5 presents some experimental results. Section 4.6 compares SHOSLIF-N with
two ANN-based approaches. Finally, some conclusions and future work are discussed
in Section 4.7.

102

4.2 State and Attention: Observation-Driven

Markov Model

One difﬁculty we experienced in applying an input-to—output mapping scheme, such as
SHOSLIF [11, 139], or available neural networks, to more complicated autonomous
navigation, is that learning more training samples from sample drives could cause
ambiguities which are hard to be solved using a monolithic image. In Fig. 4.1, (a1)
and (b1) Show two images taken around a corner. These two images are similar, which
implies that they have a small overall Euclidean distance. However, they require very
different actions: one going straight ahead while the other turning left. But if we look
at the upper-left subregions marked with white boxes, we can see that the difference
is more obvious. Moreover, from the correlation images in (c) and (d), we can see
that attention images are more sensitive to the translation along column direction
along which depth increases. Here the heights of two blobs in (c) and ((1) indicate the
sensitivity of robot’s translation along column direction. Therefore, we can see the
use of attention window, a window that extracts a part of view, if properly chosen,
can make it easier in judging different stages of an action. This is well known as
landmark selection issue. However, it is not trivial to incorporate this mechanism
using a systematic learning method, without writing a separate program for every
different scene. In our work reported here, we use the framework of observation-
driven Markov model coupled with the high-dimensionality, real-time capability of
SHOSLIF.

Along straight sections of a hallway, normally global input images are sufﬁcient

103

 

(C) ((1)

Figure 4.1: Why states and attention? (a1) and (b1) show two images around a corner.
White boxes in (a2) and (b2) are attention windows of (a1) and (b1) respectively. The
attention images of (a1) and (b1) are further shown in (a3) and (b3). (c) and ((1) show the
correlation image of (a1) with (b1) and (a3) with (b3) respectively, after zeros are padded
to the periphery. (c) and (d) are resized to have the same spatial resolution.

104

for visual navigation. Attention windows are needed only for critical sections, e.g.
corners and intersections. How does the system know that it is time to use local views
and how does it act differently for global and local views? This problem could be

solved by using system states which keep the history information.

Let Y, be outcome ran—

dom covariates at time t. Let Dt be the present and past input images and past

OUtCOIIleS, i..e Dt : {Xh Xt-1, ' ' ' , Xt—pa Yt_1, Yt_2, ° ' ' aYt—p—1}-

Deﬁne the regression function as

may.) 2 E(Yt|Dt = 1),), (4.1)

which is an “observation-driven” Markov model (ODMM) [19, 149]. Our goal is to
estimate Y; given Dt. If p = 1, Eq. (4.1) is an observation-driven ﬁrst-order Markov

model and can be rewritten as

m(X, Y) = E(Ytl(xtaYt—1) = (X, Y))- (4-2)

The learning set consists of triples of form (Xt, Yt_1, Ft), t = 1, 2, - - - , n, where n
is the total number of learning samples. X t, Yt_1 and Ft are the current observation,
previous outcome and current outcome respectively. For efﬁciency under high dimen-
sional X1, we use the nearest neighbor (NN) estimator to approximate function m.

105

Let the NN approximator be denoted as m, then

if (X,, 1’14) is the nearest neighbor of (X, 1) in the learning set. Here X,- and 12-1

are the input part, while 17’,- is the output part, of a training sample (X;, 12-1, ii).

The nearest neighbor estimator [18] has been widely used in function approxima-
tion: y = f(x), i.e. the nearest neighbor estimation of E(y[x). It has also been
used in approximation of Markov time series. Nonparametric estimators, including
the kernel estimator [145] and the nearest neighbor estimator, have been used for
prediction of Markov time series, E [XtHIXt]. Yakowitz [144] successfully applied
the nearest neighbor estimator to the prediction of rainfall/ runoff time series. Under
certain assumptions, the convergence properties of the nearest neighbor estimation
of E[XHIIXt] for Markov time series have been proved (cf. Yakowitz [144] and
references therein). Here we apply the nearest neighbor estimator to approximate

observation-driven Markov model.

The Markov chain and its variant the hidden Markov model (HMM) or partially
observable Markov decision process (POMDP) have been successfully applied to in-
door navigation by several researchers. Koenig and Simmons [66] applied POMDP
to integrate topological and metric information for navigation of Xavier. We notice
the difference between ODMMS and popularly used hidden Markov models. Normal
hidden Markov models [100] are stationary, i.e. the transition probability doesn’t
depend on input. Here the transition probability of an observation-driven Markov

106

model explicitly depends on observation at each time step. Therefore observation-

driven Markov model is nonstationary.

In our vision-based navigation, three outcomes are needed: state (.9), control signal
(0) for mobile robot, and visual attention signal (a) used to choose the attention
window. The observation is a preprocessed input image (I) at each time step. Thus,
in our model, Xt 2 It and Yt = st_1. Three nearest neighbor estimators are needed

for the following regressions:

f

ms(I,s) = E(st|(It,st_1) = (1,3))

< ma(I, s) = E(a,+1|(1,, SH) = (I, 3)) (4-3)

 

mC(IaS) = E(Ct+ll(1taSt—l) : (135))

\

Among them only the ﬁrst equation is an observation-driven Markov model, since
the outcome is the next state. The rest two are normal mapping functions without

recurrence of variables.

The overall architecture for vision-based navigation is shown in Fig. 4.2. A single
SRPT tree is used for simultaneous nearest neighbor estimation of 3 (state), c (control
signal) and a (attention signal). Current input image It and previous state st_1 are
used to derive the next control signal CH1 and next attention signal at“, and the
current state 3,. During real navigation, ct+1 and at“ are used to control the next
motion of mobile robot and to extract the following attention windows from the video
camera respectively.

107

      

State-based
SHOSLIF

 

 

 

t+1 t+1

 

 

 

 

Mobile A
robot ‘

OO

 

 

 

 

 

Figure 4.2: The architecture of state-based SHOSLIF for vision guided navigation. “Att”
stands for “Attention control”.

4.3 Stochastic Recursive Partition Tree for Near-
est Neighbor Regression

A challenge here is that It has a very high dimensionality, typically at least a few
thousands. We need to brieﬂy describe how SHOSLIF [139] addresses this challenging

problem.

4.3.1 Navigation as a content-based retrieval problem

A navigator can be regarded as a complicated function which maps a high-dimensional
input (image and state) into the corresponding low-dimensional output (control sig-
nal, action, the new state, etc). In the learning phase, a set of training images is used
to build a recursive partition tree (RPT), in which each learned sample records a de-
sired input-output pair. At each inner node of the RPT, an eigen-subspace method is

108

used to extract characteristic features from high dimensional input space for further
partition of input space. This process is done recursively down the tree until the
current node contains only samples with a similar output signal. In the performance
phase, each newly grabbed input is used to retrieve the best-matched input from the
trained RPT, and the output signal associated with the retrieved input is used as
desired output.

We chose binary partition for its simplicity and speed. At each inner node, the
further space partition is based on the projection of inputs onto the principle com-
ponent vector of PCA of the sample inputs assigned to the node, which captures the
largest variance in these inputs. Therefore, at each inner node, we need only to ob—
tain the eigenvector associated with the largest eigenvalue. This leads to our efﬁcient
on-line incremental learning algorithm [139].

The above learning process was done incrementally. First, the data covariance
matrix C was estimated from n input samples 271, 1:2, ---, 13,, as C = % [Lizzy-xi.
Then the dominant eigenvector of the estimated C was computed by either the Power
method or the Raylaigh quotient conjugate gradient (RQCG) method.

However, the dominant eigenvectors can be estimated without estimating the co-
variance matrix C as a prerequisite. Several authors addressed this problem using
stochastic approximation. Stochastic approximation has been shown especially useful
when high accuracy is not required while only a few eigenvectors need to be estimated.
Here we apply Oja and Karhunen’s method [92] for stochastic approximation of the
dominant eigenvector. For stochastic matrices A1, A2, - - -, with ﬁnite constant mean

A = E {Ak}, Oja and Karhunen’s iteration can be represented as follows:

109

Uk = uk—l'l'l‘kAkl-‘k—la (4-4)

Ur = ZZk/lldklle (4-5)

where pk is the learning rate and I] - M denotes the lg-norm of a vector. In our
computation for the dominant eigenvector of covariance matrix A, Ak is set as the
outer product of sample :rk: Ak = 3:er according to Oja and Karhunen [92].

The convergence of the above stochastic approximation has been proved [92].
Theorem 1 (Oja and Karhunen, 1985) Under the following assumptions:

1. Each Ak is almost surely bounded and symmetric and the Ak matrices are mu-

tually statistically independent with E[Ak} = A for all k.
2. The largest eigenvalue of A has unit multiplicity.
3- In 2 0, Zui < 00, 22m. = 00.

4. Each Ak has a probability density bounded away from zero uniformly in k in

some neighborhood of A in RM".

Then, uk tends to the eigenvector associated with the largest eigenvalue almost surely

ask—>00.

For each inner node, the incremental learning of a new sample can be conducted
recursively. When learning a new sample, the mean and dominant eigenvector are
updated incrementally. For the corresponding child nodes, only the changes caused

110

by this update need to be processed. Let old_list and new_list be the old and new
sample lists going into a child node respectively, 771k and Hk be mean vector and
dominant eigenvector of the child node. Then the processing for each child node can

be done recursively as shown below:

 

1. Accept new_list from parent node.

2. Form the d_list and i_list:
dJistzoldJist—neinst;
i_list=new_list—old_list;

3. Delete samples in dJist, update rm, and Hk.

4. Insert samples in i_list, update mic and Hk.

5. Obtain new-list for child nodes.

 

 

 

4.4 A Case of Indoor Navigation

Compared with outdoor road following, vision-based indoor navigation faces similar
challenges because there is no stable features (e.g. ﬂoor edges) and no stable contrast
pattern along typical navigation paths.

For our indoor navigation, the corridor types are shown in Fig. 4.3. In this ﬁgure,
we use the following brief notation to indicate corridor types: “L” for left turn; “R”
for right turn; “LZ” for left Z junction; “RZ” for right Z junction; “X” for four-way
intersection; and “S” for straight corridor.

For each corner or intersection, the situations are further grouped into states:
approaching, entering, and exiting corner or intersection. To reduce the number of

111

Figure 4.3: Different types of corridor structure in our indoor navigation.

states, some states can be merged. For example, an indoor corridor section with
types “L”, “LZ”, “R” and “R2” can be represented by a six-state model, as shown
in Fig. 4.4. Here approaching corners or intersections is represented as a single state,
ambiguity state (“A”). After transition to “A” state, the next transition can loop
to itself or jump to “L”, “LZ”, “R” or “RZ”, depending on current visual image
It and previous state st_1. The transition from each state depends on the current
observation and the previous state estimated by the NN estimator. Visual attention
for local view is needed only at state “A” to disambiguate the possible next states.
In our case the raw input image is of size (60 x 80) pixels: it is averaged with 2 x 2
window and reduced to half size (30 x 40) to get the global view; the left and right

local attention windows are subregions of the raw input image centered at (15, 22) i
and (15, 55) respectively but with the same size (30 x 40). For this arrangement, the
attention signal needs only a binary ﬂag to specify whether global view or local view

will be used.

1 1f St 2 1,
at+1 :- (4.6)

0 otherwise.

Koenig and Simmons [66] use sonar and odometer sensory inputs as observation
in their HMM model. They deﬁne states on the spatial grid of the hallway paths.
Usually a large number of states are needed. If similar HMM is directly applied to our

112

 

 

Figure 4.4: Observation-driven Markov model for indoor vision-based navigation, where
:r/y is used to represent a state a: with associated action y. ‘a’ or ‘g’ are used to indicate
either two local attention views or a single global view is used. Associated with each arc is
the current observation, either a local view or a global view.

vision-based navigation, the complexity in training HMM will be very high, since here
the input dimensionality with image is much higher. Our vision-based navigation task
is much more challenging than sonar-based navigation, since the visual images does
not relate to control as directly as range data. On the other hand, with the use of
visual images the number of states can be greatly reduced, since visual input provides

richer information and the system does not need to remember odometer information.

For the ODMM model shown in Fig. 4.4, its corresponding representation in our
case is shown in Fig. 4.5. Each image It is preprocessed to have zero mean and a unit
lg-norm. So to S5 are binary ﬁelds representing state 0 to 5 respectively; 3, is the
predicted current state; at“ is the predicted attention selection, which is a binary ﬂag
indicating whether global view or attention window will be used; 0H1 is the associated
output control signal for current (It, st_1) pair. The ﬁelds marked with solid boxes
are concatenated to form a single long 1D vector as input to SRPT. Those ﬁelds in

113

Input part Output part
A

 

/ r _ '__ _
[ Preprocessed raw image I So I 31 132 153 l 34 [35 ]| :‘JLaillfilitzil

 

Figure 4.5: The representation of input and output parts.

dashed boxes are either predicted or associative quantities.
We need to collect quadruples of (st_1, It, 3,, ct). The control signal of mobile robot
is the corrected heading direction. We need to train all the state transitions in our

observation-driven Markov model.

4.5 Experimental Results

Our mobile robot Rome (RObotic Mobile Experiment), built on a TRC labmate, was
used to test our algorithms. In our experiments, we trained the robot to navigate along
the loop shown in Fig. 4.6, which is on the third ﬂoor of MSU Engineering Building.
The ﬂoor plane covers an area of approximately 136 x 116 square meters. We have
ﬁve types of corridor structure shown in Fig. 4.3: “L”, “R”, “LZ”, “RZ”, ‘T’ and. “S”.
“X” type in Fig. 4.3 was ignored without affecting the performance. We tested our
algorithm on the third ﬂoor of our Engineering Building. The loop includes a “L”,
a “LZ”, a “R2” and three ‘T’ junctions. Some sample training images around these
corners and junctions are shown in Fig. 4.7. These six corners or junctions are very
typical for indoor navigation. Their appearances and structures are quite different,
and the widths of straight corridor segments are different too. Navigation using only
vision in this environment is challenging for any method that uses predeﬁned features
such as ﬂoor edges.

114

 

 

10m I b ‘,

 

 

 

 

F— -

 

 

Figure 4.6: A map of the test site. The test loop is indicated by thick solid lines. Six
different corners or intersections are trained in our tests. They are marked with bold-faced
arabic numbers. ‘

 

Figure 4.7: Some sample training images around corners or intersections. Six sets of four
consecutive images are from six different corners or intersections in the test loop.

115

The training samples were collected under human control from sample drives. At
each site, an input image, previous state, current state and its associated output
control signal were obtained. The input image was normalized to have a zero mean
and a unit lg—norm to suppress the effect of absolute lighting brightness and contrast
variation to some degree. Then this preprocessed raw input image and the previous
state information were used to query the SRPT. If the retrieved sample had the same
current state and its associated heading was within tolerance, the new sample was
discarded without being learned; Otherwise, incremental learning was conducted for
this training sample.

First we collected a reasonable set of training samples around each section of the
loop. Then we tested our algorithm, letting the system continue to learn at each
location. When all the presented samples are rejected, Rome can be set free.

Rome normally roams at a speed of 40 cm/s. The robot slows down when the
correction in heading direction is larger than 10° or the robot enters a non-straight
state.

In order to remove the spurious state transitions without resorting to higher order
Markov models, a voting scheme was used. We kept the state history up to ﬁve steps.
The histogram of history states was computed. The state transition was conﬁrmed
only when the number of votes was more than two; otherwise the current state was
unchanged. Fig. 4.8 shows sample state transitions of more than two passes along
the tested loop. Around each corner or intersection, a transition to state “A” always
precedes the conﬁrmed corner or intersection state. From Fig. 4.8 we can see that
the smoothing scheme did help in achieving a stable navigation behavior.

116

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5 5 g 5 § 5
l 4 4 4
4 '1
3 i=3, 3 3 3

6 1 2 6 1 ‘0 1 2 6 1 2 6 1

. 2. i

I [ ‘ 1" ‘

.Illl] [ll] i..i[]i[] [] ]] [l]] [l]]

500 1000 1500 o 500 1000 1500
Sequential time step Sequential time step
(a) Raw output states. (b) Output states after smoothing.

Figure 4.8: Sample state transitions covers more than two passes of continuous running
along the tested loop with six corners or intersections, which are labeled with digits in this
plot.

Some sample control signals with associated states are showed in Fig. 4.9. The
robot turned left in Figs. 4.9(a) and (c). In Fig. 4.9(b), the robot turned right then
left around corner 4, which is a “RZ”. From Fig. 4.9(d), it can be observed that along
a straight section, the robot turned with much smaller magnitude which resulted
smooth navigation behavior.

Without the use of states, stateless SHOSLIF experienced difficulties in guiding
the robot to navigate through the tested loop. It usually failed to make the 4th
corner, which is a “RZ”. Fig. 4.10 shows the scenario with plots collected from sample
navigation trajectories. Fig. 4.10(a) and (b) show two failure cases when states were
not used. Fig. 4.10(a) shows a case when the robot retrieved images with left turn
around the critical turning points, since the visual appearances around these positions
are similar to some other images around a left turn, e.g. the 6th corner. The robot
turned left and then failed to make the turn. Fig. 4.10(b) shows another case where
the robot underturned right and failed to make the 4th corner. Fig. 4.10(c) shows
how the robot performed after the state information was incorporated: the robot

117

50

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

E 0
O
o
'50 20 40 "60 80' ' 100
2"
£2 4
(D
20 40 60 80 100
Sequential time step
(a) Around corner 2 (“L”)
50
6
.3 OWHJWMEWL
O
o
_50 1 L
0 50 100 150
24. i
G r
0

 

 

so 100
Sequential time step

(c) Around corner 6 (“L”)

150

-50 1 A 1

Control
0

ET?

 

   

WWW“

 

 

O 20 4O 60 80 100 120 140 160

 

T Y 1'

 

o 20 40 60 100 120 140 160
Sequential time step

(b) Around corner 4 (“RZ”)

 

 

 

 

 

 

 

 

 

 

 

50 v v
:5
8 oWH—MLW
_50 1 A l 1 J
0 10 20 30 40 50 60 70 80 90
2“” ‘
m
632‘ ‘
G l J l_ I l l L I
0 10 20 30 40 50 60 7O 80 90

Sequential time step
((1) Between corners 6 and 1 (“S”)

Figure 4.9: Some sample control signals and the associated states. The control signals
plotted in this set of ﬁgures are corrected heading directions in degrees. (a) and (c) are
cases of left turn. (b) is a “RZ”, which has a right turn followed by a left turn. (0) is a

straight section between corners 6 and 1. We can see that the change in corrected heading
along straight section is smaller.

118

Figure 4.10: Sample trajectories around the 4th corner show how states can help the robot
in navigation. (a) and (b) show two failure cases when states were not used. “X” in (a) or
(b) indicates an upcoming collision with the wall. (c) shows the robot successfully made
the turn when states were used.

 

Figure 4.11: Rome navigates autonomously around corner 5. The ﬁrst ﬁve images are
video sequences taken from behind the robot. The last three images are taken from front
of the robot.

successfully made the turn with state-based SHOSLIF. The states were very helpful
in disambiguating critical scenarios.

Fig. 4.11 shows a sample run around the ﬁfth corner. Fig. 4.12 shows the mo-
bile robot’s view sequence during real navigation around corner 5. Each image was
grabbed at every single time step.

The content-based retrieval of each (It, s¢_1), a pair of current image and previous
state, was performed at a frequency of 6 HZ on the onboard Sun Spare-1 of the

119

 

Figure 4.12: The mobile robot’s view sequence around corner 5 during a real navigation.
Each image was taken at every time step. This sequence shows the robot approached the
ﬁfth corner, successfully made the turn and entered straight section.

robot. The incremental learning was conducted using the onboard SPARC-l. In a
batch training experiment, a set of 272 samples took about 26 seconds on a SUN
Spare-10. The time record is shown in Fig. 4.13. The maximal response time per
learning sample is within 1 second. For most of the training samples, the incremental
learning took less than 0.3 second. We have conducted test runs to observe the
performance stability of the learned robot. One pass of autonomous navigation along
the tested loop took about 20 minutes. The robot was observed to continuously run
for longer than 5 hours sei/eral times before the onboard batteries which provided the
power were low. In dozens of such tests conducted, so far the robot all performed

ﬂawlessly.

4.6 Comparison with Two ANN-based Ap-

proaches

To show how SHOSLIF performs compared with other alternative methods, we com-
pared it with feedforward neural networks and radial basis function networks. We

120

 

s" send-i .. ..

(a)
.,.

  

Time
.0
N

 

 

lullﬂl ]
'H“; l- 1 l

.0
d

 

 

 

0 50 100 . 150 200 250 300
Sequential Image number

Figure 4.13: Time record of incremental learning a set 272 training images. The learning
time was recorded on a SPARC-IO.

used two separate sets of data for comparison. Both sets of data were collected on two
separate time on the 3rd ﬂoor of our Engineering Building. Training with either set
of training sample results a tree which leads to successful performance when tested in
the trained loop. Set 1 and Set 2 contain about 500 and 300 samples of (It, st_1, 3,, ct)

respectively. Set 1 is more redundant and covers more scenarios.

To map (I¢,st_1) into current heading ct, we used the same output scheme as
ALV INN and ROBIN: the output pattern of each training sample is a Gaussian
distribution peaked at the desired corrected heading. In our simulation, the output
layer has 21 to 31 nodes with a resolution of two degrees apart. A sample training
inputeoutput pair is shown in Fig. 4.14. After the neural network is trained, the
simulated output heading is taken as the peak of a Gaussian ﬁt to the outputs of
neurons at output layer.

121

 

I—

Figure 4.14: Sample training input—output pair: the top part is a 30 X 40 image. The
second row is the associated output signal, which is a Gaussian distribution peaked at the
corrected heading direction. The last row is the binary state ﬁelds, which has a one in ﬁrst
ﬁelds and zero in other ﬁelds.

4.6.1 Simulation of a feedforward neural network

For simulation of a two-layer feedforward neural network, we used “trainbpx” with
adaptive learning rate available in the MATLAB neural network toolbox. Similar to
ALVINN, we used four to nine hidden nodes in our simulation. Our simulation showed
that it was very unlikely for multilayer perceptron (MLP) to converge to reasonable
weights if starting from initial random weights.

Three sets of data in Section 2.7 are used to study the behavior of a two-layer

perceptron.

Set A Real images from straight hallway sections: a set of 100 (30 x 40)-pixel images
with ﬁve different corrected heading directions: —10°, —5°, 0°, 5° and 10°. Each

of these ﬁve classes has 20 samples.

Set B A full set of 318 (30 x 40)-pixel real images are used. This set of images
are used in our training of SHOSLIF-N for vision-based navigation. It includes
210 images from straight hallway sections and 108 images from corner sections.
Images were collected from corners 1 and 2 and three straight sections connected

122

to these two corners.

Set C Synthesized data set: Given a trajectory map, synthesized sample images are
generated with different orientations and translations. For a square trajectory,
a typical set consists of 100 (30 x 40)-pixel synthetic images with ﬁve different

headings: —10°, —5°, 0°, 5° and 10°.

When the size of problem was small, i.e. the number of images was small and the
size of image was also small, the training of feedforward network went successfully: it
found a reasonable solution within a few trials starting from initial random guesses.
But we found that when the input dimension was high, e.g. 30 x 40, the neural
network would not converge to reasonable solutions with random initial weights. We
ran our MATLAB script ﬁles for 100 times, each time with a different random guess
of initial weights. The training epoch was set to 10,000, large enough to converge to a
local minimum. Fig. 4.15 shows the training record for Set A. Here we use four hidden
nodes as indicated in [95]. Fig. 4.15 shows that the weights do sometimes converge to
a good solution for the training set, although only a few trials provide solutions with
small sum-of—squares (SSE) errors as shown in Fig. 4.15(a). Our simulation with Sets
B and C got consistent results: the learning process can easily get stuck to poor local
minimums.

Therefore we used principle component regression (PCR) [35] to provide the initial
weights: the weight of ﬁrst layer was initialized with the principal components of the
input patterns, while the second layer was initialized with linear regression between
desired outputs and the response of hidden layer. Initialization MLP using PCR led

123

 

Sum of square error
q

 

 

 

"o 20 4o 60 so 100
Trial

(a) Sum of square error for 100 trials.

   
  
   

        

. v , A, , \‘ Aihgig 7., 4.4VVA. 7A
0 5 10 15 20 25 0 Z I 5 B IO ‘2 u 16 II
Nodooloupul layer Atacama error (In degrees)

(b) Target outputs. (c) Error histogram.

 

 

Am 20 —2'5

0 s A“ to 15“
Nodoutmhyu

IO "'13 ‘20
Nodeoiomptnlayor

(d1) Outputs of class 0. (d2) Outputs of class 1.

..l

        

     

5:]
it

T '20 3.5 o s 20 25 ”o 5 20 25

IO 15
Nodeolml-yu

(d3) Outputs of class 2. (d4) Outputs of class 3. (d5) Outputs of class 4.

ID 15
Madonna layer

‘0 I 5
Node a! oupul lay-r

Figure 4.15: Training with Set A. 100 trials with different random guesses of initial weights.
(a) shows the record of sum of square error (SSE). The trial with minimal SSE is marked
with a circle. (b) shows the target output for ﬁve classes of training samples with different
corrected headings: —10°, —5°, 0°, 5° and 10°. (c) is the error histogram of retrieved
heading directions. (d1) to (d5) show the network output overlaid on target output for
each class: simulated outputs and target outputs are plotted in solid lines and dotted lines
respectively.

124

 

 

 

 

 

25

' 0 5

10 1‘5
Node of output layer
Figure 4.16: ’Itaining with Set A using PCR for initialization of weight. The network
outputs 100 training images are plotted with solid lines, while the desired target outputs
are plotted in dashed lines for comparison.
to much better convergence in our experiments. One example is shown in Fig. 4.16.
Set A was trained with 1000 epoches. The learning process converged quickly and
resulted in very small sum~of—square error. We also notice that the network outputs

initialized with PCR are much smoother and the network output corresponding to

each class forms a compact class.

For comparison, the weight images of a two-layer perceptron are shown in Fig. 4.17.
The set of 318 images used in chapter 2 was used for training. Four hidden nodes and
twenty-one output nodes are used. The initial weights of the ﬁrst layer using PCR are
the ﬁrst four principle components of input patterns, shown in Fig. 4.17(a). The ﬁnal
weights of the ﬁrst layer after back—propagation learning are shown in Fig. 4.17(b).
For comparison, the ﬁrst four eigenvectors from linear discriminant analysis are shown
in Fig. 4.17(c). It is clear that learning with back—propagation can get similar results
as linear discriminant analysis, if the two-layer feedforward network is properly ini-
tialized.

125

       

   

(a) Initial weights of the ﬁrst layer.

 
  
   

(c) First ﬁve eigenvectors of linear discriminant analysis

Figure 4.17: Comparison of a two-layer perceptron with principle component analysis
and linear discriminant analysis. The two—layer perceptron is initialized with principle
component regression approach described in the text.

4.6.2 Simulation of a radial basis function network

The implementation of ROBIN is different from a typical RBF network in the follow-
ing aspects: ( 1) Normalization. The responses of the receptive ﬁelds are normalized
to have unit sum response for all neurons in the ﬁrst layer. This normalization im-
proves the interpolation capability, as shown by references sited in [106]. (2) Center
selection. Rosenblum and Davis [106] used K-means clustering algorithm for center
selection but the results were not satisfactory. So they used “forced-clustering” by
manually assigning subset of input patterns as centers. The author did the same
way for comparison. Here orthogonal least squares (OLS) [10] is also used to do the
automatic center selection for better performance.

126

4.6.3 Comparison with two ANN-based approaches

We used Set 1 to for training and Set 2 for testing. The error histograms are plotted
in Fig. 4.18, for all three different approaches. SHOSLIF always gets perfect retrieval
for the training set, so the error histogram concentrates on a single bin with zero

CI'I‘OI‘.

Since we need to study both the accuracy of retrieved heading and that of state-
prediction, the accuracy of state-prediction is studied for each of the three approaches.
Here the current image and previous state (It,s,_1) are mapped into current state
5,. This mapping is classiﬁcation, rather than approximation which maps (It, 3,-1)
into current heading ct. Therefore networks for classiﬁcation, instead of approxima-
tion, should be used. A MLP with a linear output layer [132] is known to perform
discriminant analysis or classiﬁcation. We used a three-layered perceptron, with two
hidden layers, for state prediction. The number of output nodes is the same as the
number of states, which are coded as binary patterns. For network output, the node
with the highest response corresponds to the predicted state. The number of nodes
in ﬁrst hidden layer ranges from 4 to 15, while the number of nodes in second hidden

layer is smaller, with a range from 4 to 10.

RBF can also be used for classiﬁcation. Similar to MLP, the states are coded as
binary patterns and the number of output nodes is the same as the number of states.
The output node with the highest response gives the predicted state. The centers are

again selected with orthogonal least squares [10].

We used one set of data for training and the other for testing, then reverse the

127

 

0.9

0.7

50.6

3 0.5

l: 0.4

0.3 <
02 i

0.1 4

 

oo 2 4 s 9101214161820224262330
Absolute error (In degrees)

 

 

 

 

o 2 4 e 81012141619202224262330
Absolute error (in degrees)

(a) State-based SHOSLIF.

 

 

 

0 2 4 s 61012141613202224262030
Absolute error (in degrees)

 

 

 

o 2 4 e 810121416.1820222428283‘6
Absolute error (in degrees)

(b) A feedforward network.

 

 

 

o 2 4 e 81012141618202224262030
Absolute error (in degrees)

(c) A RBF network.

 

0.25

Frquency

 

 

o 2 4 a 810121416_18202224262830
Absolute error (in degrees)

Figure 4.18: Comparison of three approaches: SHOSLIF, the best feedforward network and
the best RBF network we obtained. The error histograms for the training set are shown in
the left column while the error histograms for the test set are shown in the right column.

128

order of the training set and the test set for cross-validation. The accuracy of state
prediction is reported in Table 4.1. For MLP, it has a tendency to overtrain with the
training set, which usually results in poor generalization performance for the test set.
Therefore, the number of epochs is determined by a simple cross-validation process:
check the state prediction accuracy of both the training set and the test set. The
number of epochs which leads to balanced performance for both the train and test
sets was chosen. The images of Set 2 are shown in appendix B. Since Set 1 is more
redundant than Set 2, training with Set 1 leads to better performance when tested
with the disjoint set. From Table 4.1 it is clear to see that state-based SHOSLIF
performs better than both MLP and RBFN. A conclusion can be drawn: state-based

SHOSLIF is better suited for the tasks here.

 

 

 

 

 

 

 

 

 

 

 

 

 

Train Test MLP RBFN SHOSLIF
Setl Setl 80.71% 90.64% 100.00%
Setl Set2 73.16% 80.88% 89.08%
Set2 Setl 66.10% 79.59% 87.09%
Set2 Set2 85.66% 84.56% 100.00%

 

Table 4.1: Comparison of three approaches in state prediction.

The response speeds of these three approaches are reported in Table 4.2. After
training with each learning mechanism with Set 2, the response time in milliseconds
was recorded on a SUN SPARC-lO. The recorded time is the CPU time spending on
mapping into to output, not including time spent on grabbing images. When using
ANN-based approaches, the learned parameters were loaded from MATLAB, but the

response time was obtained using C programs for network mapping.
A qualitative comparison among MLP, RBF N and SHOSLIF is summarized in

129

 

MLP RBFN SHOSLIF
Response time 5.9 ms 84.0 ms 28.3 ms

 

 

 

 

 

 

 

Table 4.2: Comparison of three approaches in response time.

Table 4.3. All three approaches are suited for hardware implementation. The major
power of SHOSLIF comes from its local adaptive nonparametric regression, which
is more ﬂexible than global parametric regression. Recursive partitioning provides
a good tradeoff between speed and performance. Projection pursuit regression [52]
is another well known adaptive algorithm for function approximation, but with very
high computational complexity. Friedman [37] provides an excellent discussion on this
subject. Both MLP and RBF N use global parametric regression and try to minimize
the global least squares error. When a learned MLP is exposed to more scenarios,
the network could perform deteriorately for the learned samples. This is termed
as “memory loss” problem. RBFN for navigation has been shown to experience
fewer memory loss problems [106]. With SHOSLIF, memory loss problem is further

alleviated due to its use of local nonparametric regression.

 

 

 

 

 

 

 

 

 

 

Local Para. Hierarch. Time Space Corr. Increm.
global compl. compl. conv. learning
MLP global yes non/ few O(d x h) O(d x h) trials yes
PCR mem. loss
RBFN global yes non/ few O(d x c) O(d x c) OLS hard
SHOSLIF-N local no full O(d x log n) O(d x 71.) yes feasible

 

 

 

 

 

 

Table 4.3: Qualitative comparison of three different approaches. The following notations
are used: “local/ global” for local regression/ global regression; d for input dimension; h
for the number of hidden nodes in a MLP; c for the number of centers in a RBFN; n for
the number of learned samples. The following abbreviations are also used: “Para.” for
parametric regression; “compl.” for complexity; “Hierach.” for hierarchical; “Corr. conv.”
for correct convergence; “Increm.” for incremental; “mem.” for memory.

With MLP or RBFN, our experiments showed that there was few problems in

130

training the network to make a single turn. But when the network was exposed to
several corners or intersections, things were different. Both MLP and RBFN had very
hard time in navigating along the tested loop, especially around corners 3 and 4 in
Fig. 4.6. ALVINN has experienced similar problems when exposed to various road
conditions. Pomerleau [95] used a rule-based method to arbitrate individual networks
trained for speciﬁc roads. The integration or arbitration of several networks is a
difﬁcult issue. Here we used state information to solve the problem in our navigation.

No explicit rule for network arbitration was involved.

According to our experience in simulation, SHOSLIF is advantageous in at least
two aspects: (1) ease of training; (2) good performance after it is trained. Al-
though the feedforward network could give reasonable solution, many random trials
are needed for initial weights if random initial weights are used. The RBF network
does provide comparable performance. When most of the training samples are used
as centers, the RBF network will act like a nearest neighbor estimator. But then the
computation complexity will be prohibitive when the number of training samples is
large. On the other hand, our SHOSLIF-N has to pay extra cost in storing training
samples. It seems that the payoff from extra storage enables state-based SHOSLIF
to reach a good performance accuracy with a real-time speed. The training is also
faster.

131

4.7 Conclusions and Future Work

The characteristics of the proposed algorithm can be summarized as follows: (1) An
observation-driven Markov model for vision-based navigation. The state informa-
tion and visual attention have been incorporated into the framework to deal with
more complex scenes. (2) The ODMM is realized by a nonparametric approach—the
nearest neighbor regression. A stochastic recursive partition tree is used for efﬁcient
nearest neighbor estimation for the high—dimensional visual data in real-time. The
overall learning algorithm is a local nonparametric adaptive regression, which exhibits
more ﬂexibility than global parametric regression used in both MLP and RBFN. (3)
The appearance-based method. Compared with other navigation approaches using
sonar or odometer sensors, the use of visual image provides richer information and
hence the number of states is greatly reduced.

Some future researches will be conducted: (1) More extensive tests will be con—
ducted, especially under different navigation environments. (2) The use of direction
commands which allows to make different turns at the same intersection. This can
be done using our current framework by allowing states to include information about
the desired turning direction. The feasibility of the framework can be further utilized

in these future studies.

132

Chapter 5

Conclusions

5.1 Summary

This dissertation presents a learning-based visual navigation scheme. The navigator
is treated as a content-based retrieval system which directly maps a preprocessed
input image into an output steering signal. In contrast with global parametric regres-
sion used by current ANN-based approaches, nonparametric recursive partitioning
regression is used in this dissertation. Nonparametric recursive partitioning regres-
sion is more ﬂexible than global parametric regression. Projection pursuit regression
can also be used to approximate the input-to—output mapping but has a very high
computational complexity. Nonparametric recursive partitioning regression [37] used
in this dissertation is a good tradeoff between speed and performance.

The nonparametric recursive partitioning regression is realized by a recursive par-
tition tree (RPT). In the training phase, principle component analysis or linear dis-
criminant analysis is used at each inner node to automatically derive the best feature

133

for further partition. In the autonomous navigation mode, each newly grabbed image
is used to do the content-based retrieval. The steering control signal associated with
the best matching learned sample is used as the next control signal. The system has
been successfully tested in indoor navigation.

Since only the most dominant eigenvector of PCA or LDA is needed in binary
partition, an on-line incremental learning scheme is developed with very low compu-
tational complexity. This online incremental learning algorithm greatly facilitates the
training, compared with previous batch-training mode.

State information is quite useful in navigation. In chapter 4 state information is
incorporated into the system, termed as “state-based SHOSLIF-N.” State information
and a simple visual attention mechanism are naturally incorporated into the system.
The system is modeled as an observation-driven Markov model (ODMM), which is
again realized by a recursive partitioning regression through an RPT. Using a set of
fewer than 300 learning samples, the state-based SHOSLIF-N has been successfully

tested on the second and third ﬂoors of our Engineering Building.

5.2 Contributions

The major contribution of this dissertation can be summarized as follows:

0 Nonparametric recursive partitioning regression which directly maps prepro—

cessed input images into output steering signals for vision-based navigation.

Recursive partitioning regression of SHOSLIF-N is different from global para-
metric regression used in either ALVINN or ROBIN. Recursive partitioning

134

regression uses local regression splines and exhibits more ﬂexibility.

0 An online incremental learning algorithm with low computational complexity.

In SHOSLIF-N, only the most dominant eigenvector of principle component
analysis or linear discriminant analysis is computed at each inner node of RPT.
This fact is used in Chapter 3 to develop an efﬁcient incremental learning algo-

rithm which greatly eases the training for vision-based navigation.

0 State-based SHOSLIF-N incorporates state information and a simple visual

attention mechanism.

State-based SHOSLIF-N is modeled as an observation-driven Markov model
(ODMM) which is implemented via a recursive partition tree. State information
helps the system in disambiguating scenarios with similar visual appearances
but different steering signals in autonomous navigation. State information and
visual attention provide the possibility for the system to work in more compli-
cated situations. They are naturally incorporated into the system. State-based
SHOSLIF-N does not need an extra arbitration layer to determine which fea-

ture tracking algorithm or which navigation network suited for a trained path

should be used, as in YARF [119] or MANIAC [56].

5.3 Future Work

The research work done in this dissertation shows that direct input-to—output mapping
using nonparametric recursive partitioning regression is successful in indoor naviga-

135

tion. However, some issues remain to be explored:

o Obstacle avoidance.

In this dissertation, the author concentrates on path-following for indoor navi-
gation. One natural question is how to extend the framework to make the robot
avoid obstacles during navigation. In practice, depth information can be more
useful in obstacle avoidance. For example, a stereo algorithm or direct range
measures from sonar or infrared sensors may be more appropriate for obstacle

avoidance.

e The lighting problem and outdoor navigation.

Lighting has been a major challenge for vision-based navigation, especially for
outdoor navigation. In our tests of indoor navigation, the ambient lighting
does not change much. A simple normalization scheme which rescales each
input image to zero mean and a unit variance is used to deal with possible
changes in ambient lighting. The normalization scheme works for our indoor
navigation. To deal with much greater variations in lighting conditions, more
elaborate mechanisms should be used. In our indoor navigation, only a black
and white visual image is used. For outdoor navigation, typically color images
are preprocessed to enhance the contrast between road and nonroad areas [119].
Since our Robot Rome is designed to work only in indoor environment, only
limited outdoor experiments have been conducted. Rome has a small wheel
diameter and low clearance. It would be interesting to see how the algorithms
perform in outdoor navigation. If similar preprocessed inputs like those used in

136

ALVINN [95] are fed to the system, good performance could be expected.

Automatic design of states.

In Chapter 4, those states used in state-based SHOSLIF-N are manually de-
signed by the author for indoor navigation. For an autonomous navigation
system, this kind of hand-crafted design should be avoided. Some of the on-
going efforts [134] using spatio-temporal clustering aim for automatic derivation

of states for visual navigation.

Visual attention.

Right now only a simple visual attention mechanism is employed by the system.
The system uses attention windows with ﬁxed positions, and the only decision is
whether an attention window or a global view should be used. Visual attention
in state-based SHOSLIF-N is used only in discriminating those scenarios which
have similar global views but different local appearances in attention windows
and requires different steering signals. For a practical autonomous navigation

system, more elaborate visual attention mechanisms could be used.

Some efforts have been made by the Navlab group at Carnegie-Mellon Uni-
versity. Jochem [58] used virtual cameras, or attention windows which are
extracted from a global scene. Virtual cameras have been shown useful in lane
transition and intersection navigation. Dickmanns’ 4D approach [26, 32] nat-
urally incorporates visual attention into the system. The internal state of the
4D approach includes state information for visual attention. This is one of the
reasons that the 4D approach is successful. But the 4D approach relies on the

137

model-driven object tracking. There is still much space for improvement in

visual attention for autonomous navigation.

138

APPENDICES

Appendix A

Maps of Test Sites

The second- and third- ﬂoors of our Engineering Building serve as the test sites of

the thesis work. The tested loops are marked with blue lines.

139

 

Engineering Building 2nd Floor

140

 

Engineering Building 3rd Floor

141

Appendix B

A Set of Training Samples Used in

State-Based SHOSLIF-N

This set of training images was collected along the tested loop shown in Fig. 4.6 with
a Panasonic GP-KR202 CCD camera, with a 3.6mm ﬁsh-eye lens. It includes 273
training samples which were taken during the incremental training on the 3rd ﬂoor
of the Engineering Building. This set of training samples was used in the study of
state-based SHOSLIF-N described in chapter 4. Each training image is displayed

with its associated training data (previous state, current state, heading correction).

 
   

                        

. .7 ' , “k: 3...": .1
~- a. . . ~ _ o..- ‘
.‘ . , , VFK’, ~ '.. “:5 4501‘ I n‘, .
..- . v, ‘ ,. 3 T ‘ ,; : t: :1. kw. -~ 'w‘
“I“ uni » \ ,

(0,0,0°) (0,0, —8°) (0, 0,9°) (0, 1, 0°) (0,1,—7°) (0,1,10°) (1,2,0°) (1, 2, 0°)

 

, .
33!

142

 

(1,2,0°) (1,2,8°) (1,2,8°) (1,2,8°) (1,2,5°) (1,2,5°) (1,2,5°) (1,1,0°)

:3. 3 “.32
(1, 1, 0°) (1 1, 0°) (1,1, —7°) (1, 1, —7°) (1, 1, —7°) (1, 1, 4°) (1, 1, 4°) (1, 1, 4°)

               

       

(2,2,11°) (2,2, —7°) (2, 2, 0°) (2, 2, 8°) (2, 2,10°) (2, 2, 20°) (2, 2, 20°) (2,2, -11°)

   

(2, 2, 5°) (2,2,1?) (2,2,-7°) (212.100 (2,0,6°) (210,—9°) (0,0, 0°) (OWL-10°)

       

(0, 0, 12°) (0, 0, 0°) (0,0, —8°) (0,0, 7°) (0, 0, 0°) (0,0, -9°) (0, 0, 10°) (0, 1, 5°)

     

(0,1,10°) (01 0, 5°) (0, 0, 0°) (0,110°) (011,-7°) (1,1, -8°) (1,1,- 8°) (1 1 -8°)

           

. .1. " " '2' _

(1,2,6°) (126°) (1,2,6°) (1,2,7°) (1,2,7°) (1,2,7°) (2,2,5°) (2,2,5°)

       

(2,2,14°) (2, 2,10°) (2,2, 14°) (22, 12°) (2,2,10°) (2, 2 (2, 211°) (2,0,10°)

     

(2,0, —2°) (0, 0, 7°) (0, 0, 0°) (0,0, 8°) (0, 0, 0°) (0, 0, —-3°) (0,0, 8°) (2,2, 16°)

143

   

(2,2, 17°) (2,2,22°) (2,2,20°) (2,2, 17°) (0,1,0°) (0,1,-7°) (1,1,0°) (1,1,0°)

..“ﬁl'l‘tﬂmy'arm .'
. ...,, (ml
W. 3 S. ...Z:1J

. -. :5"

(1,1,0°) (1,3,10°) (1,3,10°) (1,3,10°) (1,3,19°) (1,3,19°) (1,3,19°) (3,3,16°)

       

 

(3, 3, 41°) (3, 3, 34°) (3, 3, —11°) (3, 3, 24°) (3,3,11°) (3, 3, 0°) (3,3, 9°) (3,3, —9°)

       

(3,0, 0°) (3,0, —12°) (3, 0, 4°) (0, 0, 7°) (0,0, -7°) (0,0, —11°) (0,0, 0°) (0, 0, 7°)

   

(0,0, 0°) (0,0, —7°) (0,0,10°) (0,1,0°) (0,1,—5°) (0,1,5°) (1,4,0°) (1,4,0°)

 

(1,4, 0°) (1,1,0°) (1, 1, 0°) (1, 1, 0°) (4, 4, 0°) (4,4, —7°) (4,4, -10°)(4 4, —25°)

 

(4,4, —20°) (4, 4, 0°) (4,4, -7°) (4,4, 7°) (4,4,0°) (4, 4, 6°) (4,4, 0°) (4,4, 23°)

  
  

   

(4, 4, 25°) (4,4, 20°) (4,4, 20°) (4, 4, 0°) (4,0,—

(414,20°) (0,0,17°) (0, 0, 4°) (010,-4°) (0, 010°) (0, 0, 5°) (OWL-6°) (0, 010°)

144

 
 
  

    

(0,0, —6°) (0, 0, 5°)

  

) (1, 5, 0°) (1,5, 0°)

     
 
    
 
 

 

(5, 5, 0°) (5, 5, 7°) (5, 5, —8°) (

 

   

     
 
  
  

‘ when? in: '-

,5, 20°) (5, 5, 22°) 5, 0,11°) (5, 0,—10°) (0,0, —8°) (0,0,13°)

Iv'v

( )

(o, 0, 4°) (0, 0, 0°)

 
 

 

(01 0, —6°)

        

 
  

r»! r411.» .

(0,0, 7°) (0,0,0°) (0,0, 7°) (0,0, —-6°) (0,0,4°) (0,0, -7°) (0,1,—5°) (0,1,0°)

 

145

 
       
  
  
 
 
       
    

Car-u-

(0,1,6°) (0,1,12°) (0,1,—13°) (0,0, 0°

  
 
  

   

j

) (0, 0, 11°) (0, 0, 10°

 

     

  
  

       
  
  

            

4 'r

(0, 0, 8°) (0, 0, 0°)

2, 2, 18°) (2, 2, 0°) (2, 2, -13°) (2, 2, —12°)

 

35:...-

(0,0,12°) (0,0, 4°) (0, 0, 1°) (0,0, —7°) (0, 0, 6°) (0,0, 0°)

T ' ' "4' ‘A"

      

._.

(0, 0) 8°) (01 0, _70) (07 0, 0°)

‘0
' L

(2, 2, 10°) (0,0, —4°) (0, 0, 7°) (0, 0, 6°) (0,0, —5°) 0,0, 4°) (2,2,0°) (2,2,6°)

rt
.. 'L
. g ‘ ,
i
-
.

  
    

        

in; ~ ..

     

                

‘l
‘-
Et ' .~

(2,2,10°) (0,0,0°) (0,0,12°) (0,0, 7°) (0,0, —6°) (2, 0, 4°) (0,0, 6°) (0, 0, 9°)

The state transitions of this training set are summarized in Table 31.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

p. state/c. state 0 1 2 3 4 5 1 sum
0 94 23 0 0 0 0 117
1 0 21 18 6 3 3 51
2 8 0 44 0 0 0 52
3 3 0 0 9 0 0 12
4 3 0 0 0 23 0 26
5 3 0 0 0 0 11 14
[ sumr111l44|62|15|26[14]l277]

 

Table 8.1: Summary of state transitions of the training set, where “p. state/c. state”
stands for previous state/current state.

146

Bibliography

[1] N. Ayache and O. D. F augeras, “Maintaining representations of the environment
of a mobile robot”, IEEE Trans. Robotics and Automation, Vol. 5, No.6, pp.

804-819, 1989.

[2] M. Bertozzi and A. Broggi, “Vision-Based Vehicle Guidance”, Computer, Vol 30,

No. 7, July 1997, pp. 49-55.

[3] M. Bertozzi and A. Broggi, “GOLD: A Parallel Real-Time Stereo Vision System
for Generic Obstacle and Lane Detection”, IEEE Trans. Image Processing, Vol.

7, No. 1, 1998, pp. 62-81.

[4] A. F. Bobick and J. W. Davis, “An appearance-based representation of action”,

MIT Media Lab Perceptual Computing Section Technical Report, No. 369, 1996.

[5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classiﬁcation and

Regression Trees, Chapman & Hall, New York, 1993.

[6] R. A. Brooks, “Intelligence without reason”, Proc. Int ’l Joint Conf. on Artiﬁcial
Intelligence, pp. 569-595, Sydney, Australia, August, 1991.

147

[7]

[8]

[9]

[10]

[11]

[12]

[13]

C. J. Chappel and J. G. Taylor, “The temporal Kohonen map”, Neural Networks,

6:441-445, 1993.

J .-L. Chen and A. Kundu, “Rotation and gray scale transform invariant texture
identiﬁcation using wavelet decomposition and hidden Markov model”, IEEE
Trans. Pattern Analysis and Machine Intelligence, Vol. 16, No. 2, pp. 208-214,

1994.

M.-Y. Chen, A. Kundu, and J. Zhou, “Off-line handwritten word recognition
using hidden Markov model type stochastic network”, IEEE Trans. Pattern Anal.

Machine Intell., Vol. 16, No. 5, pp. 481-496, 1994.

S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares learning
algorithms for radial basis function networks”, IEEE Trans. Neural Networks,

Vol. 2, No. 2, pp. 302-309, 1991.

S. Chen and J. Weng, “SHOSLIF-N: SHOSLIF for autonomous navigation
(Phase I)”, Technical Report CPS-94-62, Department of Computer Science,

Michigan State University, East Lansing, MI, December, 1994.

S. Chen and J. J. Weng, “Incremental Learning for Vision-based Navigation”,
Technical Report CPS 96-14, Department of Computer Science, Michigan State

University, 1996.

S. Chen and J. J. Weng, “On-Line Incremental Learning for Vision-Guided Real-
Time Navigation Using Improved Updating”, in The 10th Scandinavian Conf.
on Image Analysis, Lappeenranta, Finland, June, 1997.

148

[14] S. Chen and J. J. Weng, “State-based SHOSLIF for indoor visual navigation”,
to appear in Proc. 14th Int’l Conf. on Pattern Recognition, Brisbane, Australia,

August, 1998.

[15] X. Chen, E. Dagless, S. Zhang, and B. Thomas, “A real-time plane-view method
for following bending roads”, 19.93 IEEE Symposium on Intelligent Vehicles, July

14-16, 1993, Tokyo, Japan, pp. 219-224.

[16] P. A. Chou, “Optimal partitioning for classiﬁcation and regression trees”, IEEE

Trans. Pattern Anal. Machine Intell., Vol. 13, No. 4, pp. 340-354, 1991.

[17] J. D. Courtney, Mobile robot localization using classiﬁcation techniques, Mas-
ter’s thesis, Department of Computer Science, Michigan State University, East

Lansing, Michigan, 1993.

[18] T. M. Cover, “Estimation by the nearest neighbor rule”, IEEE Transactions on

Information Theory, Vol.IT-14, No.1, pp.50-55, Jan. 1968.

[19] D. R. Cox, “Statistical analysis of time series: some recent developments”, Scand.

J. Statist. , Vol. 8, No. 2, pp. 93-115, 1981.

[20] J. Crisman and C. Thorpe, “Color vision for road following”, in Vision and Nav-
igation: The Carnegie Mellon Navlab, C. Thorpe, ed., Kluwer, Norwell, Mass,

pp. 9—23, 1990.

[21] Y. Cui, D. Swets, and J. Weng, “Learning-based hand sign recognition using
SHOSLIF-M”, in Proc. Int ’1 Conf. Computer Vision, MIT, MA, pp. 631-636,
June 20—23, 1995.

149

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

T. J. Darrell, I. A. Essa and A. P. Pentland, “Task-speciﬁc gesture analysis
in real-time using interpolated views”, MIT Media Lab Perceptual Computing

Section Technical Report, No. 364, 1995.

S. Das and M. Mozer, “Dynamic on-line clustering and state extraction: an
approach to symbolic learning”, Neural Networks, Vol. 11, No. 1, pp. 53-64,

1998.

P. A. Devijver and J. Kittler, Pattern Recognition, Prentice Hall, 1982.

E. D. Dickmanns and A. Zapp, “A curvature-based scheme for improving road
vehicle guidance by computer vision”, in Proc. SPIE Mobile Robot Conf., Cam-

bridge, MA, pp. 161-168, Oct. 1986.

E. D. Dickmanns, “Machine perception exploiting high-level spatio—temporal

models”, ACARD Lecture Series 185 ‘Machine Perception’, Sept,/Oct., 1992.

E. D. Dickmanns and B. D. Mysliwetz, “Recursive 3-D road and relative ego-
state recognition”, IEEE Trans. Pattern Anal. Machine Intell., Vol. 14, No. 2 ,

pp. 199-213, 1992.

E. D. Dickmanns, “Active bifocal vision”, 7th International Conference on Image

Analysis and Processing, Monopoli, Italy, September 20-22, 1993.

E. D. Dickmanns, R. Behringer, D. Dickmanns, T. Hidebrant, M. Maurer, F.
Thomanek, and J. Schielen, “The seeing passenger car ‘VaMors-P”’, in 1994
IEEE Symposium on Intelligent Vehicles, pp. 68-73.

150

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

E. D. Dickmanns, “Performance improvements for autonomous road vehicles”,

Intelligent Autonomous Systems-4, March 1995, Karlsruhe, Germany.

E. D. Dickmanns, “Improvements in visual autonomous road vehicle guidance
1987-1994”, in Visual Navigation: from Biological Systems to Unmanned Ground
Vehicles (Ed. Y. Aloimonos), Larrence Erlbaum Associates, Publishers, New

Jersey, 1997.

E. D. Dickmanns, “Vehicles Capable of Dynamic Vision”, Proc. 15th Interna-
tional Joint Conference on Artiﬁcial Intelligence (IJCAI—97), Nagoya. Japan,

August 23-29, 1997.

H. S. Dulimarta and A. K. Jain, “A client/server control architecture for robot

navigation”, Pattern Recognition, Vol. 29, No. 8, pp. 1259-1284, 1996.

Y. T. Feng and D. R. Owen, “Conjugate gradient methods for solving the smallest
eigenpair of a large symmetric eigenvalue problems”, Int’l J. Numer. Methods

Eng, Vol. 39, pp. 2209-2229, 1996.

I. E. Frank and J. H. Friedman, “A statistical review of some chemometrics

regression tools” Technometrics, Vol. 35, No. 2, pp. 109—148, 1993.

J. H. Friedman, J. L. Bentley and R. A. Finkel, “An algorithm for findng best
matches in logarithmic expected time”, ACM Trans. on Mathematical Software,

Vol.3, No. 3, pp. 209-226, 1977.

J. H. Friedman, “l\/Iultivariate adaptive regression splines (with discussion)”, The
Annals of Statistics, Vol. 19, No. 1, pp. 1-141, 1991.

151

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Ed. Academic,

San Diego, 1990.

G. H. Golub and C. F. van Loan, Matrix Computations, The Johns Hopkins

Press, 1989.

S. Grossberg, “Adaptive pattern classiﬁcation and universal recoding: 1. Parallel
Development and coding of neural feature detectors”, Biol. Cybernetics 23, pp.

121-134, 1976.

S. Grossberg, “Adaptive pattern classiﬁcation and universal recoding: II. Feed-

back, expectation, olfaction, illusions”, Biol. Cybernetics 23, pp. 187—202, 1976.

S. Grossberg, “Content-addressable memory storage by neural networks: A gen-
eral model and global Liapunov method”, in Computational Neuroscience (E.L.

Schwartz, ed.), pp. 56-65, Cambridge, MA: MIT Press, 1990.

M. Gu and S. C. Eisenstat, “A Stable and fast algorithm for updating the singular
value decomposition”, Technical Report YALE / DCS/ RR-966,Yale University,

New Haven, CT, 1994.

J. Hancock and C. E. Thorpe, “ELVIS: eigenvectors for land vehicle image sys-

tem”, CMU-RI—TR—94-43, Dec. 1994.

D. J. Hand, Discrimination and Classiﬁcation, Wiley, Chichester, 1981.

T. Hastie and W. Stuetzle, “Principle curve”, J. American Statistical Associa-
tion, Vol. 84, No. 406, 1989.

152

[47]

[48]

[49]

[501

[51]

[52]

[53]

[54]

S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College

Publishing, 1994.

M. Hebert, “Building and navigating maps of road scenes using an active sensor”,
in Proc. IEEE Int’l Conf. Robotics and Automation, Philadelphia, PA, pp. 1136-

1142, April, 1988.

J. Heikkonen, P. Koikkalainen, and E. Oja, “Self-organizing maps for collision-
free navigation”, in Proc. World Congress on Neural Networks, Protland, OR,

Vol. 3, pp. 141-144, 1993.

X. D. Huang, Y. Ariki and M. A. Jack, Hidden Markov Models for Speech Recog-

nition, Edinburgh University Press, 1990.

X. Huang, “Phoneme classiﬁcation using semicontinuous hidden Markov mod-

els”, IEEE Trans. Signal'Processing, Vol. 40, No. 5, pp. 1062-1067, 1992.

P. J. Huber, “Projection pursuit”, The Annals of Statistics, Vol. 13, No. 2, pp.

435-475, 1985.

R. M. Inigo, E. S. McVey, B. J. Berger, and M. J. Mirtz, “Machine vision applied
to vehicle guidance”, IEEE Trans. Pattern Anal. Machine Intell., Vol. 6, No. 6,

pp. 820-826, 1984.

A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.

153

[55]

[57]

[581

[59]

[60]

[61]

[62]

D. L. James and R. Miikkulainen, “SARDNET: A Self-Organizing Feature Map
for Sequences”, in G. Tesauro, D. S. Touretzky and T. K. Leen (editors) Advances

in Neural Processing Systems 7, 1995.

T. M. Jochem, D. A. Pomerleau and C. E. Thorpe “MANIAC: a next generation
neurally based autonomous road follower”, Proceedings of the Image Understand-

ing Workshop, Washington D. G, April 1993.

T. M. Jochem, D. A. Pomerleau and C. E. Thorpe, “Vision-based neural net-
work road and intersection detection and traversal”, Proc. IEEE Symposium on

Intelligent Vehicles, pp. 344-349, Detroit, Michigan, Sept. 25-26, 1995.

T. M. Jochem, “V ision-based tactical driving”, Technical Report CMU-RI-TR-

96-14, The Robotics Institute, Carnegie-Mellon University, January 1996.

J. Kangas, “Time—dependent self-organizing maps for speech recognition”, in

Proc. of the Intern. Conf. on Artiﬁcial Neural Networks, 1591-1594, 1991.

L. V. Kantorovich and G. P. Akilov, Functional Analysis in Normed Spaces,

Pergamon, Oxford, 1964.

J. Karhunen and E. Oja, “New methods for stochastic approximation of trun-
cated Karhunen-Loeve expansions”, in Proceedings of the 6th Int’l Conf. on Pat-

tern Recognition, pp. 550-553, Munich, Germany, Oct. 1982.

S. K. Kenue, “Lanelok: detection of lane boundaries and vehicle tracking using
image-processing techniques ”, SPIE Conference On Aerospace Sensing, Mobile
Robots IV, Nov. 1989.

154

[63]

[64]

[65]

[66]

[67]

[68]

[69]

M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the
characterization of human faces”, IEEE Trans. Pattern Anal. Machine Intell.,

Vol. 12, No. 1, pp. 103-108, 1990.

K. Kluge and C. Thorpe, “Explicit models for robot road following”, in Vision
and Navigation: The Carnegie Mellon Navlab, C. Thorpe, ed., Kluwer, Norwell,

Mass, pp. 25-38, 1990.

D. E. Knuth, The Art of Computer Programming: Sorting and Searching,

Addison-Wiley, 1973.

S. Koenig and R. G. Simmons,“ A Robot Navigation Architecture Based on
Partially Observable Markov Decision Process Models”, in Artiﬁcial Intelli-
gence Based Mobile Robotics: Case Studies of Successful Robot Systems, D. Ko-

rtenkamp, R. P. Bonasso, R. Murphy (eds), MIT Press, 1997.

T. Kohonen, “Dynamically expanding context, with application to the correction
of symbol strings in the recognition of continuous speech”, Proc. of 8th Int. Conf.

Pattern Recognition, pp. 1148-1151, 1986.

T. Kohonen, “Improved versions of learning vector quantization”, Proc. of Int.

Joint Conf. Neural Networks, Vol. I, pp. 545-550, 1990.

T. Kohonen, “Self-organizing maps: optimization approaches”, in T. Kohonen,
K. Makisara and J. Kangas, editors, Artiﬁcial Neural Networks, Evsevier Science
Publishers, pp. 981-990, 1991.

155

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, “Engineering applica-
tions of the self-organizing map”, Proc. IEEE, Vol. 84, No. 10, pp. 1358-1384,

1996.

D. J. Kriegmaan, E. Triendl and T. O. Binford, “Stereo vision and navigation
in buildings for mobile robots”, IEEE Trans. Robotics and Automation, Vol. 5,

No.6, pp. 792-803, 1989.

D. Kuan, G. Philips, and A. Hsueh, “Autonomous land vehicle road following”,

in Proc. Int ’l Conf. Computer Vision, IEEE, Piscataway, N. J ., 1987.

J. Kuczynski and H. Wozniakowski, “Estimating the largest eigenvalue by the
power and Lanczos algorithms with a random start”, SIAM J. Matrix Anal.

Appl., Vol. 13, No. 4, pp. 1094-1122, 1992.

X. Lebesgue and J. K. Argarwal, “Signiﬁcant line segments for an indoor mobile

robot”, IEEE Trans. Robotics and Automation, Vol. 9, No.6, pp. 801-816, 1993.
M. M. Loeve, Probability Theory, NJ: Van Nostrand, 1955.

D. G. Luenberger, Optimization by Vector Space Methods, John Wiley, New York,

1969.

D. G. Luenberger, Linear and nonlinear programming (2nd ed.), Addison-Wesley
Publishing Company, 1984.

156

[78]

[79]

[80]

[81]

[82]

[83]

[84]

B. S. Manjunath, S. Chandrasekaran and Y. F. Wang, “An eigenspace update
algorithm for image analysis”, Proc. IEEE Int ’1 Symposium on Computer Vision,

Coral Gables, FL, pp. 551-556, Nov. 20-22, 1995.

J. Miintysalo, K. Toekkola, and T. Kohonen, “Mapping context dependent acous-
tic information into context independent from by LVQ”, Speech Communication,

Vol. 14, pp. 119-130, 1994.

J. Mao and A. K. Jain, “Artiﬁcial neural networks for feature extraction and
multivariate data projection”, IEEE Trans. Neural Network, Vol. 6, No. 2, pp.

296-317, 1995.

M. Meng and A. C. Kak, “Mobile robot navigation using neural networks and
nonmetrical environment models”, IEEE Control Systems, pp. 31-42, August,

1993.

H. P. Moravec, “Obstacle avoidance and navigation in the real world by seeing
a robot rover”, Ph.D. dissertation, Stanford Univ., Stanford, CA, Sept. 1980.
(Reprinted as Robot Rover Visual Navigation. Ann Arbor, MI: UMI Research

Press, 1981. )

H. P. Moravec , “The Stanford Cart and the CMU Rover”, Proc. IEEE, Vol. 71,

No. 7, pp. 872-884, 1983.

J. K. Mui, and K. S. Fu, “Automated classiﬁcation of nucleated blood cells using
a binary tree classiﬁer”, IEEE Trans. Pattern Anal. Machine Intell., Vol. 2, No.
5, pp. 429-443, 1980.

157

[85]

[86]

[87]

[88]

[89]

[90]

[91]

H. Murakami and B. V. K. Kumar, “Efﬁcient calculation of primary images from
a set of images”, IEEE Trans. Pattern Anal. Machine Intell., Vol. PAMI-4, No.

5, pp. 511—515, 1982.

H. Murase and S. K. Nayar, “Illumination planning for object recognition in
structured environments”, in Proc IEEE Conf. Computer Vision and Pattern

Recognition, Seattle, WA, pp. 31 - 38, June 1994.

H. Murase and S. K. N ayar, “Visual learning and recognition of 3D objects from
appearance”, International Journal of Computer Vision, Vol .14, No. 1, pp. 5-24,

1995.

H. Murase and M. Lindenbaum, “Partial eigenvalue decomposition of large im-
ages using the spatial temporal adaptive method”, IEEE Trans. Image Process-

ing, pp. 620-629, Vol. 4, No. 5, 1995.

S. K. Nayar, H. Murase, and S. A. Nene, “Learning, positioning, and tracking
visual appearance”, in Proc. IEEE Int ’1 Conf. Robotics and Automation, San

Diego, CA, May 1994.

S. K. Nayar, S. A. Nene, and H. Murase, “Real-time 100 object recognition
system”, in Proc. IEEE Int’l Conf. Robotics and Automation, Twin Cities, MN,

May 1996.

S. A. Nene and S. K. Nayar, “A simple algorithm for nearest neighbor search in
high dimensions”, Technical Report No. CUCS-030-95, Department of Computer
Science, Columbia University, 1995.

158

[92] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvectors and
eigenvalues of the expectation of a random matrix”, J. of Math. Analysis and

Applications, Vol. 106, pp. 69-84, 1985.

[93] E. Oja, “Principle components, minor components, and linear neural networks”,

Neural Networks, Vol. 5, pp. 927-936, 1992.

[94] D. A. Pomerleau, “Efﬁcient training of artiﬁcial neural networks for autonomous

navigation”, Neural Computation, Vol. 3, No. 1, pp. 88-97, 1991.

[95] D. A. Pomerleau, “Neural Network Perception for Mobile Robot Guidance”,

Klumer Academic Publishers, 1993.

[96] D. A. Pomerleau, “RALPH: rapidly adapting lateral position handler”, Proc.

IEEE Symposium on Intelligent Vehicles, Detroit, Michigan, Sept. 25-26, 1995.

[97] T. Poggio and F. Girosi, “Networks for approximation and learning”, Proceedings

of the IEEE, Vol. 78, No. 9, pp. 1481-1497, 1990.

[98] T. Poggio, “A theory of how the brain might work”, Cold Spring Habor Sympo-

sium on Qualitative Biology, Vol. LV, pp. 899-910, 1990.

[99] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical

Recipes in C {2th Ed. ), Cambridge University Press, 1992.

[100] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications
in speech recognition”, Proceedings of the IEEE, Vol. 77, No. 2, pp. 257-286,
1989.

159

[101] R. Reddy, “Foundations and grand challenges of artiﬁcial intelligence”, AI Mag-

azine, pp. 9-21, Winter 1988.

[102] R. Reddy, “To dream the possible dream”, Comm. ACM, pp. 105-112, May

1996.

[103] R. Reddy, “The challenge of artiﬁcial intelligence”, Computer, pp. 86-98, Oct.

1996.

[104] P. Roberti, “The accelerated power method”, Int ’1 J. Numer. Methods Eng,

Vol. 20, pp. 1179-1191, 1984.

[105] M. Rosenblum and L. S. Davis, “The use of a radial basis function network for
visual autonomous road following”, Univ. of Maryland Center Automat. Res.,

Tech. Rep. CAR-TR-666, May, 1993.

[106] M. Rosenblum and L. S. Davis, “An improved radial basis function network
for visual autonomous road following”, IEEE Trans. on Neural Networks, Vol.

7, No. 5, pp. 1111-1120, 1996.

[107] I. K. Sethi, and G. P. R. Savarayudu, “Hierarchical classiﬁer design using mutual
information”, IEEE Trans. Pattern Anal. Machine Intell., Vol. 4, No. 4, pp. 441-

445, 1982.

[108] I. K. Sethi, “Decision tree performance enhancement using an artiﬁcial neural
network implementation”, in Artiﬁcial Neural Networks and Statistical Pattern
Recognition (eds. I. K. Sethi and A. K. Jain), pp. 71-88, Amsterdam: North-
Holland, 1991.

160

[109] S. Shah and J. K. Aggarwal, “Mobile Robot Navigation and Scene Modeling
Using Stereo Fish-Eye Lens System”, Machine Vision Applications, Vol. 10, No.

4, 1997, pp. 159—173.

[110] H.-Y. Shum, K. Ikeuchi, and R. Reddy, “Principle component analysis with
missing data and its application to polyhedral object modeling”, IEEE Trans.

Pattern Anal. Machine Intell., Vol. 17, No. 9, pp. 854-867, 1995.

[111] R. Simmons and S. Koenig, “ Probabilistic Robot Navigation in Partially Ob-
servable Environments”, in Proceedings of the Fourteenth International Joint

Conference on Artiﬁcial Intelligence (IJCAI), pp. 1080-1087, 1995.

[112] L. Sirovich and M. Kirby, “Low-dimensional procedure for the characterization
of human faces”, Journal of the Optical Society of America A, Vol. 4, No. 3, pp.

519-524, 1987.

[113] S. E. Stead, “Smooth multistage multivariate approximation”, Ph.D. Disserta-

tion, Department of Math., Brown University, 1983.

[114] G. Struck, F. Geisler, H. Laubenstein, H. Nagel and G. Siegle, “Interaction be-
tween digital road map system and trinocular autonomous driving”, 1993 IEEE

Symposium on Intelligent Vehicles, July 14-16, 1993, Tokyo, Japan, pp. 461-466.

[115] R. S. Sutton, Guest Editor, Special Issue on Reinforcement Learning, Machine

Learning, Vol. 8, No. 3/4, May 1992.

[116] A. M. Thompson, “The navigation system of the JPL robot”, in Proc. Fifth
IJCAI, pp. 745-757, 1977.

161

[117] C. Thorpe, “FIDO: Vision and navigation for a mobile robot”, Ph.D. disserta-

tion, Dept. Comput. Sci., Carnegie-Mellon Univ., Dec, 1984.

[118] C. Thorpe, M. H. Hebert, T. Kanade, and S. Shafer, “Vision and navigation for
the Carnegie-h‘lellon Navlab”, IEEE Trans. Pattern Anal. Machine Intell., Vol.

10, No. 8, pp. 362-373, May 1988.

[119] C. Thorpe, ed., Vision and Navigation: The Carnegie Mellon Navlab, Kluwer,

Norwell, Mass, pp. 9-23, 1990.

[120] C. Thorpe, M. Hebert, T. Kanade, and S. Shafer, “Toward autonomous driving:

The CMU Navlab”, IEEE Expert, pp. 31-42, August, 1991.

[121] TRC, LABMATE User Manual. Version 5.21L-e., Transitions Research Cor-

poration, 1991.

[122] S. T suji, J. Y. Zheng, and M. Asada, “Stereo vision of a mobile robot: World
constraints for image matching and interpretation”, in Proc. IEEE Int. Conf.

Robotics and Automation, San Francisco, CA, Apr. 1986, pp. 1594-1599.

[123] M. A. Turk, D. G. Morgenthaler, K. D. Gremban, and M. Marra, “VITS—A
vision system for autonomous land vehicle navigation”, IEEE Trans. Pattern

Anal. Machine Intell., Vol. 10, No. 3, pp. 342-361, 1988.

[124] M. Turk and A. Pentlend, “Eigenfaces for recognition”, Journal of Cognitive
Neuroscience, Vol. 3, No. 1, pp. 71-86, 1991.

162

[125] R. Wallace, K. Matsuzaki, J. Crisman, Y. Goto, J. Webb, and T. Kanade,
“Progress in robot road-following”, in Proc. IEEE Int’l Conf. Robotics and Au-

tomation, San Francisco, CA, pp. 1426-1432, April 1986.

[126] S. J. Walsh, Indoor Robot Navigation using a symbolic landmark map, PhD the-
sis, Department of Computer Science, Michigan State University, East Lansing,

Michigan, 1992.

[127] D. L. Wang and M. A. Arbib, “Complex temporal sequence learning based on

short-term memory”, Proc. IEEE, Vol. 78, pp. 1536-1543, 1990.

[128] D. L. Wang and M. A. Arbib, “Timing and chunking in processing temporal

order”, IEEE Trans. Syst., Man, Cybern., Vol. 23, pp. 993-1009, 1993.

[129] D. L. Wang and B. Yuwono, “Anticipation-based temporal pattern generation”,

IEEE Trans. Syst., Man, Cybern., Vol. 25, pp. 615-628, 1995.

[130] D. L. Wang and B. Yuwono, “Incremental learning of complex temporal pat-

terns”, IEEE Trans. on Neural Networks, Vol. 7, No. 6, pp. 1465-1481, 1996.

[131] A. M. Waxman, J. J. Lemoigne, L. S. Davis, B. Srinivasan, T. R. Kushner, E.
Liang, and T. Siddalingaiah, “A visual navigation system for autonomous land
vehicles”, IEEE J. Robotics and Automation, Vol. RA-3, No. 2, pp. 124-141,

1987.

[132] A. R. Webb and D. Lowe, “The optimized internal representation of multilayer
classiﬁer networks performs nonlinear discriminant analysis”, Neural Networks,
Vol. 3, pp. 367-375, 1990.

163

[133] J. J. Weng, “On Comprehensive Visual Learning”, Proc. NSF /ARPA Workshop
on Performance vs. Methodology in Computer Vision, pp. 152-166, Seattle, WA,

June 24-25, 1994.

[134] J. J. Weng, “The developmental approach to intelligent robots”, in Proc. 1998
AAA] Spring Symposium Series, Integrating Robotic Research: Taking The Next

Leap , Stanford University, March 23-25, 1998.

[135] J. J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition using the Cre-

ceptron”, Int’l Journal of Computer Vision , 1996. Accepted and to appear.

[136] J. J. Weng and S. Chen, SHOSLIF Convergence Properties and MDF Version
of SHOSLIF-N, Technical Report CPS-95-22, Department of Computer Science,

l\='1ichigan State University, East Lansing, MI, May 1995.

[137] J. J. Weng and S. Chen, “SHOSLIF-N: SHOSLIF for autonomous navigation
(Phase 11)”, Technical Report CPS-95-22, Department of Computer Science,

Michigan State University, East Lansing, MI, May 1995.

[138] J. J. Weng and S. Chen, “Autonomous navigation through case-based learning”,
in Proc. IEEE Int ’1 Symposium on Computer Vision, Coral Gables, FL, pp. 359-

364, Nov. 20-22, 1995.

[139] J. J. Weng and S. Chen, “Incremental learning for vision-based navigation”, in
Proc. Int ’1 Conf. on Pattern Recognition, Vol. IV, pp. 45-49, Vienna, Austria,
Aug. 1996.

164

[140] J. J. Weng and S. Chen, “Vision-guided navigation using SHOSLIF”, Neural

Networks, Vol. 11, pp. 1511-1529, 1998.

[141] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Oxford University Press,

1965.

[142] S. S. VVilks, Mathematical Statistics, Wiley, New York, 1963.

[143] A. Wilson and A. F. Bobick, “Using conﬁguration states for representation and
recognition of gesture”, MIT Media Lab Perceptual Computing Section Technical

Report, No. 308, 1995.

[144] S. Yakowitz, “Nearest-neighbor methods for time series analysis”, J. Time Ser.

Anal, Vol. 8, No. 2, pp. 235-247, 1987.

[145] S. Yakowitz, “Nonparametric density and regression estimation for Markov se-
quences without mixing assumptions”, Journal of Multivariate Analysis, Vol. 30,

pp. 124-136, 1989.

[146] X. Yang, T. K. Sarkar, and E. Arvas, “A survey of conjugate gradient algorithms
for solution of extreme eigen-problems of a symmetric matrix”, IEEE Trans.

ASSP, Vol. 37, No. 10, pp. 1550-1555, 1989.

[147] H. Yang, “Conjugate gradient method for the Rayleigh quotient minimization
of generalized eigenvalue problems”, Computing, Vol. 51, pp. 79-94, 1993.

165

[148] J. A. Zandhuis, “Storing sequential data in self-organizing feature maps”, Inter-
nal Report MPI-NL-4/ 92, Max-Planck—Institute fur Psycholinguistik, Nijmegen,

the Netherlands.

[149] S. L. Zeger and B. Qaqish, “Markov regression models for time series: a quasi-

likelihood approach”, Biometrics, Vol. 44, pp. 1019-1031, 1988.

 

166