L
f'

 

”f“ llllllllllllllllllllllllllllllllllllUHIIHIIIllllllllllllHl

C9000 31293 01834 1754

LIBRARY
Michigan State
Universlty

 

 

 

 

This is to certify that the
thesis entitled
A“ HAt/‘t(9Y"\0L-1rf<. Animal ‘Lille.
AN} Object Recast/triad 353+€M

presented by

Colin Ems

wt

Face

has been accepted towards fulﬁllment
of the requirements for

M45163 degree in Conpwru Setcacﬁ

WUW/

Major professor
Date 31/ l7 / Q 7

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

_, v

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1198 chlRCJDdeDquS—p.“

AN “AUTOMATIC ANIMAL-LIKE” FACE AND OBJECT
RECOGNITION SYSTEM

By

Colin Evans

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Computer Science and Engineering

1999

Dr. John Weng

ABSTRACT
AN “AUTOMATIC ANIMAL-LIKE” FACE AND OBJECT
RECOGNITION SYSTEM
By

Colin Evans

This thesis introduces the developmental approach to machine intelligence con-
struction and integration, and then describes a system for Automated Animal-Like
Learning. At the core of this system iS the HSM Tree, an incrementally generated and
real-time decision tree for very high dimensional inputs. The HSM Tree is an early
developmental framework, capable of learning online and facilitating a real-time cycle
of training and testing. The HSM Tree is a decision tree that learns high-dimensional
image data in real time with very fast retrieval rates. Several experiments are de—
scribed that illustrate the operation and capacity of the HSM Tree. Face recognition
is demonstrated using recorded video sequences 33,889 rames in length of 143 differ-
ent subjects with a correct recognition rate of 95.1%. Real-time online learning at a

rate of 5—10 video frames per second is also demonstrated.

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. John Weng, who provided advice and moti-
vation, and wisdom at opportune times. I also would like to thank my committee
members Dr. Esfahanian and Dr. Stockman for their advice and help in completing
this thesis.

Special thanks goes to Wey Hwang, Yilu Zhang, and Yong Lee, all of whom
provided assistance with programming, data collection, and with the construction of
the HSM tree algorithm. Much of this was a collaborative work, and it would not

have been possible without their assistance.

iii

TABLE OF CONTENTS

LIST OF FIGURES
LIST OF TABLES

1 Introduction

1.1 Task-Speciﬁc Paradigm ............................
1.2 Developmental Paradigm ...........................
1.3 Learning Mechanism .............................
1.4 Prior Work ..................................

2 The HSM Tree

2.1 Overview of the HSM Tree ..........................
2.2 Nodes .....................................
2.2.1 Discriminant Subspace ...........................
2.2.2 Amnesic Averaging in the X -Space and Y—space .............
2.2.3 Neural Gas .................................
2.2.4 Update Values and Freezing ........................
2.2.5 Likelihood Metrics .............................
2.3 Tree ......................................
2.3.1 Sample Pruning ...............................
2.3.2 Retraining Samples .............................
2.3.3 Freezing and Spawning ...........................

3 Experiments

3.1 Synthetic Data ................................
3.2 Simulated Face Recognition from Video Sequences .............
3.2.1 Experiment Description ..........................
3.2.2 Experiment Results .............................
3.3 Real Time Face and Object Recognition ..................

4 Conclusions

iv

2.1

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

LIST OF FIGURES

This is a timeline of the freezing schedule of the tree across 47) time steps.

The decision boundaries of a 3-class 2—dimensional classiﬁcation problem.
Samples are selected with uniform probability from the space above.

This is a timeline of the freezing and spawning patterns of the tree over
2000 time steps. The freezing parameter 17 is set to 200. .......

These are the class decision boundaries generated by the HSM tree after
100 time steps (left) and 300 time steps. ................

These are the class decision boundaries generated by the HSM tree after
500 time steps (left) and 700 time steps. ................

These are the class decision boundaries generated by the HSM tree after
900 time steps (left) and 1100 time steps. The decision boundaries are
more reﬁned because the root node froze at time step 800, and two
child nodes have been spawned. .....................

These are the class decision boundaries generated by the HSM tree after
1300 time steps (left) and 1500 time steps. ..............

These are the class decision boundaries generated by the HSM tree after
1700 time steps (left) and 1900 time steps. Smaller reﬁnements in the
decision boundaries are noticeable because the second set level of nodes
froze at time step 1800 and a third level of nodes was spawned.

These are the 404 training samples that were kept by the tree in leaf nodes
after 2000 time steps. They were selected using the sample pruning
method described previously. ......................

This is a temporally subsampled sequence used in training. The subject’s
face enters the view and comes to the center of the image, remains
there for a while, and then leaves the view. A triangular foveal mask
is centered over the image .........................

3.10 These are a selection of 16 images of the 143 people used for this test. The

testing set varied across race, gender and age by a very wide amount.

3.11 A timeline for a typical question-response training session for face identiﬁ-

cation of person number 19. The face image enters the view, and then
a “who?” question is asked on the numerical sensor. The response 19 is
trained for the immediately following time step. The two-step memory
mechanism means that the system learns to respond when the previous
numerical input was a question. ....................

23

35

36

36

36

37

37

37

38

39

40

3.12 This is the makeup of the input vector to the HSM tree at each time

step. The previous and current sensory inputs are concatenated into

one vector, giving the system a memory of one time step. ...... 44
3.13 This is a data ﬂowchart of how one time step of training on this framework

works. The current image and numerical input are combined with the

previous image and numerical input to create an X -space input vector

for the tree, and the current sensory inputs are stored for the next time

step as well. The imposed effector output class is used along with the

X -space tree input to incrementally update the class means used for Y-

space class labels in the tree. The mean of the imposed effector output

class is then used as a Y—space input to the tree. Update-Mapping is

then called on the tree. ......................... 45
3.14 Face recognition performance of the HSM tree over different k and c val-

ues. As can be seen here, larger k values give consistently better per-

formance with larger 6 values, but large 6 values with small k values

will decrease performance. ........................ 46
3.15 Gender recognition performance of the HSM tree over different It and 5
values. .................................. 47

3.16 The time average in seconds that it would take to perform a search on the

tree, given different k and 6 values. Times range from 0.029 seconds to

0.04 seconds for a search. ........................ 49
3.17 Several of the objects used in real-time online training of the algorithm. . 50

vi

1.1

3.1

3.2

3.3

3.4

LIST OF TABLES

A comparison of several task-speciﬁc approaches to machine intelligence
alongside the developmental approach. .................

This is an outline of one epoch of training. Three training sequences were
collected for each person, and these sequences are recycled between the
two tasks trained. ............................

These are the seven face identiﬁcation errors made by the HSM tree. The
queried images are on the left, and the images that the tree returned
are on the right. .............................

These are the best results of the HSM tree for face and gender recognition
compared to the nearest neighbor method. ..............

These are the worst average time results for searching and training the
HSM tree for one sample compared with times for the Nearest Neighbor
classiﬁer with 858 samples. Training the Nearest Neighbor classiﬁer for
one sample takes no time because a new sample is simply kept. Note
that the HSM tree’s combined training and search times are less than
the search time alone of the Nearest Neighbor classiﬁer. .......

vii

41

Chapter 1

Introduction

The developmental approach to machine intelligence is presented here as an alter-
native to prior task-speciﬁc approaches. The notion of having a robot system that
can be trained in the same manner as an animal or young child is very compelling,
and this basis in biological models of incremental development offers a window into
problems that are very difﬁcult under current methods. We hope for a system that
is able to learn continuously over a lifetime, based on real-world input and train-
ing, and without task speciﬁcity or complex task decomposition. Such systems could
transform our views of technology, as electronic pets or companions become available.
This framework also makes the domain of intelligent robotics accessible to people who
don’t have extensive backgrounds in machine learning or artiﬁcial intelligence. The
notion of real-world real-time feedback and interaction is a model much closer to bi-
ological systems, and so it would be possible for a person to transfer their knowledge
of working with small children or animals to instructing a developmental robot.

The core of the developmental framework presented here is the HSM Tree, a real-

1

time incremental decision tree that deals speciﬁcally with high dimensional data. The
tree learns online in a real-time interactive context, and is used for appearance-based
object recognition here. The experiments here include face recognition using recorded
video sequences of 143 different people totaling 33,889 frames in length. The system
gave a correct recognition rate of 95.1% for face recognition with a retrieval rate of
0.04 seconds per image. The system also was demonstrated in a real-time framework,
training at a rate of 5-10 images per second from a video camera while feedback was
provided from the keyboard.

The HSM tree was developed by Dr. John Weng, Wey Shiuan Hwang, Yilu Zhang,
Yong Beom Lee, and myself. My work on this project has included programming and
testing of the whole HSM tree system alongside the other researchers mentioned, and
formulating use of the Neural Gas algorithm, the node and tree freezing mechanism,
the sample pruning and retraining mechanisms, and the spawning criteria for the
version of the tree described here. All data collection and all experimental results

were done by myself as well.

1.1 Task-Speciﬁc Paradigm

Current work in robotics and machine intelligence is based on the notion of completing
a task. The paradigm for approaching research in these areas is to identify a task or a
problem (such as autonomous navigation, face recognition, or Robo—Soccer) and then
create a system or agent that “solves” the problem or performs the task. The success
of the system largely depends on the decomposition of the problem into sub-problems

2

and the quality of the knowledge that the scientist is able to impart to the system,
either in the form of training samples or in explicit design. This paradigm produces
robust and reliable systems that are very good at completing tasks with small scopes

that can be formalized and decomposed cleanly and efﬁciently.

 

 

 

 

 

 

 

Approach Species World System Task
architecture knowledge behavior Speciﬁc
Knowledge-based programming manual modeling manual modeling yes
Behavior-based programming avoid modeling manual modeling yes
Learning—based programming treatment varies special-purpose learning yes
Evolutionary genetic search treatment varies genetic search yes
Developmental programming avoid modeling general-purpose learning no

 

 

 

Table 1.1: A comparison of several task-speciﬁc approaches to machine intelligence
alongside the developmental approach.

The task-oriented paradigm works well for producing systems that solve individ-
ual problems such as ALVINN [11], a neural-network based autonomous navigation
system that is able to drive on roads, AutonoMouse [4], a robot that learns to follow
light given reinforcement and using genetic algorithms for learning, or Xavier [14],
an autonomous robot navigation system that uses partially observable Markov mod-
els, or even broad classes of problems such as SOAR [6], a “general problem solver”
that uses axiomatic deﬁnitions to deﬁne a world, and then performs problem searches
through that axiomatic Space. Other examples of machine intelligence systems that
have been successful include the behavior—based subsumption architecture [3][2], re-
inforcement methods such as Q-learning [18], [5], and even knowledge-based methods
such as CYC [7] and Wordnet [9].

However, all of these systems and methodologies stand ﬁrmly within a task-

3

oriented design paradigm. For any of the above systems to be successful, the designer
must know and understand the problem well and be able to give a good decomposition
of the problem in order to expose the important points and relevant sub-tasks vital
for success. In this kind of a design process, there is a real emphasis on the task space.
A good system design recognizes what constraints are present in the environment and
how they can be leveraged to provide a compact and coherent representation of the
problem that can then be mapped to speciﬁc tool or algorithm.

The steps involved in a task-Speciﬁc approach to machine intelligence are as fol-

lows:

1. A scientist identiﬁes a task or problem for study (such as autonomous robot

navigation in building hallways, or categorizing web pages).

2. The scientist analyzes the task as best she can, locating key points of the task
that are important for success and discarding parts of the task that are irrelevant
for good performance of the system. This can involve decomposing the task
into tractable sub-problems, identifying what kinds of information needs to be
present and processed for a solution to be available, and even deciding what

constraints on the problem must be present to allow a good decomposition.

3. The scientist selects a tool (such as POMDP models of the building for navi-
gation, or a “bucket of words” representation for web pages) and encodes the
important components of the problem as a task space representation that will
allow the tool to operate efficiently and effectively on the problem.

4

4. Finally, the scientist determines and encodes the parameters of the tool. This
can mean incorporating hand-crafted knowledge or manually decomposed be-
haviors, estimating parameters from training samples using a statistical or com-
putational learning method, or searching for parameters using a manually de-

signed and task-speciﬁc objective function.

This paradigm is very good at providing solutions to problems that can be un-
derstood well enough to be decomposed by hand. However, it doesn’t do well with
vague and poorly Speciﬁed problems, such as general vision-based object recognition,
navigation in a variety of unconstrained environments (such as outside environments),
and general human-computer centered interaction. Such very general problems have
very few reliable constraints, and because of this it is very difficult to construct com-
pact and consistent representations. Without a compact and reliable representation
or good task decomposition, the manual labor and domain knowledge required for

the task-speciﬁc paradigm are very large.

1.2 Developmental Paradigm

We offer an alternative paradigm for approaching robotics and machine intelligence.
Instead of trying constructing a complex representation for a general problem, com-
plexity can be placed in the algorithm. Automation and adaptivity can be substituted
for domain knowledge and manual decompositions. For many types of machine intelli-
gence problems (such as theorem provers and natural language parsers) a hand-crafted

problem decomposition can be essential to the success of the system. However, there

are classes of problems that don’t lend themselves well to handmade decompositions
(such as some vision problems, and problems of robot control). In these types of
problems, the information needed for correct operation is not centralized and easily
represented. For example, appearance-based vision methods depend on the combined
input of the thousands of pixels in an image, and any small subset of pixels contains
much less information for classiﬁcation than the whole image. The information in a
pixel image is not easily decomposed into pieces and semantically labeled.

Similar issues are present in robot control problems (mapping a set of sensor inputs
to motor positions with real-time operation). The classical method for performing
robot control is to build an analytical model of the inverse kinematics of the robot.
The model is usually based on geometric and physical properties of the robot, and
maps the ﬁnal position of, say, a robot arm to the joint angles needed to move the arm
to that position. This approach to robot control decomposes each part of the control
problem (motors, joints, angles) into a formal model, but such an approach must have
extraordinary complexity in order to model things like friction, elasticity of joints, and
motor errors. All of these things are gremlins that chip away at the accuracy of an
analytical model, and this is because the problem is not totally decomposable into
semantic pieces that can be easily recognized and understood.

We believe that problems without clean and easy semantic decompositions can be
better decomposed automatically. We take the biological and psychological processes
of human and animal development as an inspiration, where over the course of a life-
time, from birth to old age a human acquires knowledge and Skills in an incremental
and autonomous fashion. The learning mechanisms present in humans are very much

6

automated — young children acquire motor Skills and language at a surprising rate.
New tasks are learned without the need for reprogramming or a new task decompo-
sition. Representations of knowledge are implicit and distributed in the system, as
opposed to being explicitly represented in states, symbols, or data structures. We
believe that methods that are developmental in nature offer avenues for exploration
into very general and vaguely deﬁned tasks and problems.

The goal of a developmental approach is to study the mechanism, present at
the “birth” of a system, that is able to incrementally learn and acquire abilities
and behaviors through real-time interaction with the environment. Some important

components of such a system are outlined below:

1. Domain-extensibility: The system is able to learn new tasks without the need

for reprogramming.

2. Relief from manual task decomposition: The system does not require a pro-
grammer to analyze and decompose every task that the system must learn, but

instead automatically generates a task Space.

3. Developmental mechanism: The system learns incrementally over time in a real-
time environment. This means that the system can act on its environment in
an unconstrained manner and learns from interaction and real-time feedback

from the environment.

In such an arrangement, the system would learn through active training in the
world. For instance, for a robot to learn a new task such as navigation in building

7

hallways, the robot could be initially led through the building several times, and then
allowed to roam autonomously. If the robot appeared to be lost or started to run
into a wall, it would be corrected at that point where the error occurred, introducing
new training data, and then sent on its way. In this way, a real-time interleaved cycle
of training and testing is established, where the system’s performance is constantly
critiqued and corrected by a teacher. This interleaving of training and testing is
markedly different from classical machine learning methods. Because training and
testing is done in real time in the environment, errors can be corrected quickly and
immediately, as soon as they are identiﬁed. Scarcity of training samples is less of a
problem, as input to the system is in real-time and so the system is able to acquire
data as needed. The training process becomes an active learning process, where
the trainer can seek out parts of the task that the system has not learned well and
concentrate those weaknesses.

This notion of “interactive learning” allows totally new methods of robot training
which are much closer to biological models of learning from instruction and rein-
forcement. Because of the “hands-on” training methodology, it is possible for people
who have no formal background in robotics to train and teach a developmental robot
system, simply using their experiences of instructing or training animals and small
children.

Additionally, the methodology of a system developing over time Similiar to how a
biological system would mature allows for the teaching of a wide range of tasks, with
prior Skills and abilities acquired by the system being used in future tasks. Because
the frame work is totally integrated — all of the learning takes place in the same

8

mechanism — it is simple for skills from one task to cross over to another task. These
are things that are difﬁcult to do utilizing the traditional task-speciﬁc paradigm of

design.

1 .3 Learning Mechanism

The mechanism that we have adopted for developmental learning is called “automated
animal-like learning”. We use the term “animal-like” because the learning process
is similar to stimulus-response learning that can be done by an animal in real time.
We base our mechanism around a machine agent M — a robot with various sensors
(such as vision or sonar) and effectors (such as wheels or an arm). The machine agent
views sensory input in discrete time steps starting from its “birth” and is capable of
activating effectors at each time step as well. The training phase consists of actions
being imposed on M, guiding it through its environment and presenting it with
the correct effector output given the sensory input encountered at each time step.
Testing of the system involves allowing the system to map sensory input to effector
outputs and act in real time. The system updates its internal models immediately,
meaning that the gap between training and testing is seamless. At the core of our
developmental algorithm is the HSM (Hierarchical Statistical Modeling) tree. This
is a high-dimensional incrementally built decision tree that maps sensor inputs to
effector output.

The purpose of this work is to describe the HSM tree as a core element to a devel-
opmental framework, and to offer several experiments that both Show the operation

9

and capacity of the HSM tree. This is an early work, as the developmental approach

is at a ﬂedgling stage.

'1 .4 Prior Work

No prior work existed about an overall developmental algorithm for an artiﬁcial agent
until our work here. The HSM tree is based on the SHOSLIF [19] [15] [16] approach
to appearance-based vision and robot control. This approach involves constructing
a high-dimensional decision tree for mapping image vectors to robot control Signals,
and was originally inspired by Turk and Pentland’s Eigenface [17] approach to face
recognition, which involved taking principal component projections of data sets in or-
der to reduce dimensionality and computation while maintaining robust performance.
Each node of the SHOSLIF tree ﬁnds a subspace projection (such a PCA or LDA
projection) of the total image space, and decision Splits at each node are found by
clustering together classes or by ﬁnding splits that reduce variance. However, the
SHOSLIF framework is an offline batch generated framework, while the HSM tree is

generated online and incrementally.

There are large bodies of work surrounding decision tree methods of classiﬁcation,
with the most successful and well known being the C45 [12] [13] and CART [1]
methods. Both of these methods use “purity metrics” (such as negative entropy)
to ﬁnd single-axis partitions at each node of the tree. These methods are fast and
very successful for low-dimensional problems, but their reliance on Single-axis decision
boundaries and need for offline batch processing makes them totally inappropriate for

10

handling high dimensional image data in a real-time training framework. One other
system, the OC1 [10] decision tree, offers decision boundaries that are “oblique” and
not axis-parallel. However, the OC1 algorithm searches for decision boundaries using
monte carlo methods and gradient ascent searches, both of which are very slow and
work very poorly on very high dimensional image vectors.

There was no work found on very high dimensional incrementally generated real-
time classiﬁers, which is the job that the HSM tree performs. Most classiﬁcation
methods that are able to operate in real-time (such as decision trees and neural
networks) fall prey to the curse of dimensionality in training over 5,000—10,000 di-
mensional image data in training. These methods are able to produce correct clas-
siﬁcations for the training examples, but do not generalize well at all to test cases.
Methods that explicitly deal with this kind of high dimensional data (such as PCA
and LDA projection methods) require copious amounts of ofﬂine processing and are

often slow for classiﬁcation as well.

11

Chapter 2

The HSM Tree

2.1 Overview of the HSM Tree

The HSM tree allows real-time operation of a nearest neighbor based regression tree
by decomposing the input space recursively into a tree structure. The purpose of the
algorithm is to give a good decomposition of the input Space (and thus good clas-
siﬁcation performance) while minimizing the number of comparisons and amount of
computation that must be done and thus making a nearest neighbor classiﬁer com-
putationally feasible for a very large number of samples. The tree is a regression
tree, which mainly means that similarity in the output space is assumed to imply
similarity in the input Space. A major requirement of the tree is online construction
in an incremental fashion that allows training and testing to be interleaved. This
means that there are no opportunities for offline batch processing of samples, and
the tree must be able to learn and operate in real time. Also, because we are inter-
ested in working with high-dimensional images, the tree employs methods to reduce

l2

computational costs, such as using dimensionality—reducing transforms.

The HSM algorithm models the output space of a function in a coarse-to-ﬁne
fashion, using some simple incremental data clustering techniques to ﬁrst break the
output space into coarse low-resolution blocks, and then breaking these coarse groups
into ﬁner and ﬁner groups, increasing the resolution of the output space at each step.
This method of breaking up the output space into nested clusters of similar outputs
gives a tree-like formulation of the data. This formulation of the data gives a corre-
sponding coarse-to-ﬁne decomposition of the input Space. Because the decomposition
of the input space is induced by the decomposition of the output Space, the parti-
tionings of the input space are useful for classiﬁcation purposes. Each of the output
Space clusters acts as virtual class label, aggregating together similar outputs and
thus creating clusters out of the corresponding inputs. The input clusters are then
used to derive a discriminant subspace projection of the data, and are modeled by
Gaussian curves. At the leaves of this tree, individual samples are stored for actual
classiﬁcation. Classiﬁcation is done by traversing a path down the tree and then
performing a linear nearest-neighbor search on the samples at the leaf.

Given a high dimensional input Space X and a lower—dimensional output space Y,
we see a tree-like formulation of data as the basic structure of the HSM algorithm,
where each node N of the tree contains at most q cluster mappings — function map-
pings from a cluster in the X Space to a cluster in the Y-space. These cluster mappings
are virtual classes that aggregate Y—space classes together and then model the cor-
responding X -space samples as clusters. Each of these virtual classes is speciﬁed by
a data cluster :13,- E X and a data cluster y,- E Y. Given that there are m cluster map-

13

pings (1 S m s q) in node N, a discriminant subspace D is generated in N, where D
is the m — 1 dimensional subspace of X that contains the m cluster centroids (cluster
means) of the clusters. This means that D is the m — 1 dimensional hyperspace that
is deﬁned by m X -space cluster centers. Each cluster 2:,- is represented by a Gaussian
distribution with a mean in the X -space and a (m — 1) x (m — 1) covariance matrix
F,- in the discriminant space D. Membership of a given pattern vector mum in each
of the m virtual classes is determined by projecting :rnew into D, projecting each of
the cluster means 2:, into D, and then ﬁnding the negative-log-likelihood that mum is
a member of each cluster using the covariance matrix I’, and the projected mean :5,-
as parameters to a Gaussian distribution.

Every internal node of the tree can have up to q nodes as children, with each
child node corresponding to a virtual class in the parent node. Thus, the X -space is
hierarchically decomposed as a nested mixture of Gaussians. The root node breaks the
X -space up into m Gaussians, with every Gaussian distribution roughly representing
a class corresponding to a cluster in the Y-space. The child node N,» of each virtual
class at the the root takes as input every pattern vector that falls into the Gaussian
distribution speciﬁed by the parent of Ni, and N,- again breaks up its input set into
a mixture of Gaussians, with each Gaussian roughly representing a class (again, a
cluster in the Y-space.) Assuming that the underlying data is not totally random,
such recursive decomposition will eventually lead to a very low level of variation and
a high resolution in the Y-Space in the leaves of the tree.

An acne“, 1—> ynew pair that represents a mapping from the X space to the Y space
is added to the tree by starting at the root node and ﬁnding the Y—space cluster y,-

14

that is closest to gum and updating 3,1,, 23,-, and F,- incrementally. The closest X space
cluster :13]- to 33mm is then found using negative—log-likelihood, and the algorithm is
repeated for the child node beneath 13,-. When the search arrives at a leaf cluster Clea],
a nearest neighbor search is done of the samples at c, and if 513nm is not contained
there, it is added to Clea). If there are too many samples in Clea}, Clea, spawns a child
node that inherits all of the samples from Cleaf-

A predicted mapping from a sample input znew can be determined by ﬁnding
the path from the root to a leaf note by ﬁnding the best membership of 513nm using
negative-log—likelihood in the X space at each level of the tree. When the algorithm
arrives at a leaf cluster Czeaf, a sequential nearest-neighbor search is done over the
samples present at the leaf. If Clea] has no samples, then the Y—space cluster centroid
of Clea; is used as an output. For greater accuracy in searching the tree, multiple
paths can be taken down the tree to ﬁnd the top 1: matches to 33mm, and then the

closest leaf sample is taken as an output.

2.2 Nodes

Each node attempts to incrementally build a model that separates classes into differ-
ent clusters. This is done by ﬁrst breaking up the Y-space (the output space, or class
labels) into at most q separate clusters, and then attempting to model the X -space
clusters that result. In each node, the Y—space clusters act as virtual class labels,
separating the X -space samples into different virtual classes. The node attempts to
model each of these X -space virtual classes for classiﬁcation.

15

Each node breaks up the Y-space using unlabeled clustering with most q Y-space
clusters. If there are fewer than q classes present in a node, the node will have fewer
than q clusters, with one cluster for each class, and each virtual class will correspond
to a real class. However, if there are more than q classes present in a node, then the
node must aggregate Similar classes into the same cluster, forming a virtual class that
corresponds to several real classes. The neural gas algorithm [8] — an incremental
variant of the k means clustering algorithm — is run over the Y—space vectors in order
to ﬁnd a good way to group classes. Neural gas is less sensitive to local minima than
k-means, and that is why it is used in this context.

Because the virtual class labels are decided by similarity of the Y-space outputs,
the labeling cannot by random or arbitrary. In regression-type problems, similarity
in the output space often implies similarity in the input space, and so this method
for aggregating classes works well with the raw output Space of regression problems.
However, in discrete classiﬁcation problems where the output corresponds to a name
or a symbol, arbitrary or random Y-Space vector labels can give very poor partition-
ings of the X -space. In this case, it is useful to base the Y—space vectors on the
X -space data. A method that we used was to set the Y-space label for each class
equal to the X -space mean of the class across all of the samples

In the X -space, each virtual class is modeled by a mean and covariance matrix,
which are estimated incrementally over the samples using amnesic averaging. Be-
cause we wish to take second-order statistics (covariance matrices) for each cluster, a
dimensionality-reducing transformation is necessary to make maintaining covariance
matrices computationally feasible. If such a transformation was not performed, then

16

given a 2, 000 dimensional input space (not uncommon for image data), the covari-
ance matrix would need to have 2, 000 x 2, 000 = 4, 000, 000 elements, which is too
large to store or perform any reasonable computations. As a dimensionality-reducing
transformation, we take the m - 1 dimensional subspace that covers the m cluster
centers in the node N. This subspace captures some discriminant information, and
the usefulness of each basis vector is captured in the covariance matrix of each virtual
class.

Classiﬁcation at each node is done using negative-log-likelihood for each X -space
virtual class. Because we are incrementally approximating the covariance matrix for
each class, it is likely that if very few samples have been viewed, the covariance ma-
trix has not yet been reliably estimated, and therefore will give poor classiﬁcation
performance. To solve this problem, we use three diﬁerent distance metrics in order
to give reliable distance measures — Gaussian likelihood, Mahalanobis distance, and
Euclidean distance. Gaussian likelihood is based on the covariance matrix for each
virtual class and needs a large number of samples for reliable parameter estimation.
Mahalanobis distance uses the average of covariance matrices across all of the virtual
classes in the node, and so needs fewer samples for good reliability. However, Maha-
lanobis distance is a less accurate metric than Gaussian likelihood for classiﬁcation
because it is not class dependent and thus carries no discriminant information. Eu-
clidean distance requires no samples and is the naive metric used when few samples
have been seen. From these three distance metrics, a weighted average is taken that
is based on the number of training samples that have been seen in the node and by
each individual cluster.

17

Updates to the node are done as follows:

Procedure: Update-Node: Given a node N with m clusters (1 g m S q) and a

mapping 33am »—> ynew, N is updated with the mapping :1: r—> y as follows:

1. Ifm < q and for each Y—space cluster y,, lyi—ynew] > 73,, add a new virtual class
to N with X -space mean anew, Y—space mean ynew, a zero covariance matrix,

and jump to step 5.

2. Find the virtual class c with the closest Y—space cluster centroid yc to ynew using

Euclidean distance.

3. Update the X -space and Y-space cluster centroids 2:6 and yC using amnesic av-

eraging.
4. Update the X -space covariance I’C matrix of c using amnesic averaging.
5. Re-compute the likelihood matrices for N.

6. Update the remaining m — 1 Y-space clusters with ynew using the neural gas

algorithm.

7. Re-compute the discriminant subspace DN of the node using Grahm-Schmidt

orthogonalization.

2.2.1 Discriminant Subspace

In order to reduce the dimensionality of the input space, an orthonormal subspace
projection D is found. The subspace D is the m — 1 dimensional space that contains

18

the m X -space cluster centers 21:1 through 23",. This means that D is the m - 1
dimensional hyperspace deﬁned by the m X -space cluster means 1:1, ..., arm. Distance
relative to this set of m points is captured by this projection, and cluster information
that is lost in the projection can be recovered in the second order statistics gathered.
Because the X -space cluster centers are formed by virtual class groupings in the Y-
space, the projection is a discriminant projection that captures information useful for
classiﬁcation, as opposed to a component projection that captures variance across all

of the data (such as the well-known principal component projection).

The discriminant subspace D is represented by an m —- 1 dimensional orthonormal
vector basis ()1, ..., bm_1 with projection matrix BD 2 (bl/.../bm_1). The basis is found
by ﬁnding the mean :rmean = #212,517,- of the cluster centers and then taking the m
difference vectors d, = :rmean — x,- for 1 g i g m. Running the well-known Grahm-
Schmidt Orthogonalization procedure on the vectors d1, ..., dm will produce at most
m - 1 orthonormal vectors b1, ..., bm_1 that are the basis of the discriminant subspace ‘
D. If there are linear dependencies between the set of vectors d1, ...,dm, then it is
possible that the dimensionality of the subspace will be less than m — 1, but this is

handled by padding zero vectors in the basis to get m — 1 dimensions.

2.2.2 Amnesic Averaging in the X-space and Y-space

Cluster centroids 3:0 and yC and the covariance matrix I’C are update incrementally
using amnesic averaging. Given a set of small update values ax, cry, and or, at time
n with inputs 17mm and ynew the updates are done as follows:

19

37],”) = (1 — admin—1) + 01-33mm

yen) : (1 — ay)y£n-l) + ayynew

To compute the covariance matrix, the X -space vectors are projected into the dis-

criminant subspace D using the transformation matrix BD. This gives the following:

Pg") = (1 — 011‘)F£n_1) + (11*(BDIL‘new — BD.’L‘£n))(BD£Enew - BD$£n))T

The update values are decayed over time, allowing for a “freezing” process to

occur. This is discussed later.

2.2.3 Neural Gas

For grouping together the Y—space clusters, we use a modiﬁed version of the neural gas
algorithm [8]. When a node N is updated with a new mapping anew 1—> ynew at time n,
the virtual class with a Y-Space centroid closest to gum is updated incrementally using
amnesic averaging. Then the remaining Y-space centroids for the other virtual classes
are incrementally updated with ynew according to an update variable any. any is very
small and quickly time-decaying. This means that over time, the update pattern
becomes similar to an incremental version of k-means, where the closest cluster is
updated and all of the others are ignored.

20

Assume a new mapping :1:an 1—> ynew and a node N with m virtual classes c1..., cm
such that i > j -—> lg,- — ynewl 2 [3), — gnaw], i.e. the clusters are ordered by distance
to the vector ynew with y, being the closest. Vector yl is updated using amnesic

averaging described above, and the vectors 312..., gm are updated at time n as follows:

n n ‘1

yin) : a—ilynew +(1_ a_i9_)yz(n )
This method scales the update any by the ranking of the cluster centroid in close-
ness to ynew. Over time, any is decayed at a rate faster than (I; or cry used in amnesic

averaging.

2.2.4 Update Values and Freezing

In order for accurate estimation, many of the statistics above depend on other statis-
tics to already be estimated. The X -space means depend on stationary Y—Space
virtual class vectors, and the X space covariance matrices depend on stationary X -
space means. One method for doing this would be to not estimate a variable until
its dependencies are accurately estimated, but this approach is not incremental in
nature. Instead, we keep the update values any, cry, Oz, and or constant for different
periods of time, and at different schedules we begin to decay the values. All of the up-
date values start off constant, and then at some point the update values for variables
with no dependencies begin to decay. After an oz decays a signiﬁcant amount and
the update amount gets close to zero, its dependents begin to decay as well. AS the

21

tutu—f.“ ”mun-.1 m .

dependencies of a variable become stationary through update value decay, it becomes
possible to estimate the variable with accuracy.

Each node keeps a local count T N of the number of global times steps since the
node was created. This is the age of the node. The decay is parameterized by a value
77 which is the number of local time steps that will pass before an update variable will
start to decay. The decay pattern is set up so that as the node passes each speciﬁed
age, a different update value 01 begins to decay. After all of the update variables
have decayed for a certain time, the node ceases to update at all, and “freezes”.
This freezing allows for reliable parameter estimation by child nodes which depend
on consistent decision boundaries by the parent nodes. The decay pattern used in

our experiments goes as follows:

1. The neural gas update value any decays ﬁrst as a part of the neural gas algo-

rithm. It starts to decay at age TN = 0.777.

2. The Y-space update value ay decays next, ﬁxing the Y-space clusters and thus

ﬁxing the virtual classes of the node. It starts to decay when TN 2 2.077.

3. The X -Space mean update value a; begins to decay once the virtual classes are

ﬁxed. It starts to decay at age TN 2 2.517.

4. When the X -space means are ﬁxed, the X -space covariance matrices can be

estimated, so or begins to decay. It starts to decay when TN 2 3.077.

5. At age TN = 4.077, the node ceases to update any variables and is frozen.

22

 

Neural Gas Begins to Decay
Y-Space Begins to Decay

X—Space Begins to Decay

 

 

 

Covariance Begins
To Decay
Node Freezes
1 \l
l l I l
Time Step 0 eta 2.0 eta 3.0 eta 4.0 eta

Figure 2.1: This is a timeline of the freezing schedule of the tree across 477 time steps.

Choosing a good value of 1) depends on knowing something about the types of
tasks that must be learned. Generally, the tree Should be able to be presented with
instances of each task to be learned before 77 time steps have passed. If 77 is too small,
nodes will freeze too early, damaging classiﬁcation performance, as variables freeze
before seeing samples from all of the classes. If 77 is too large, then the system will

take a very long time to reach reasonable performance.

Assuming that a variable begins to decay at age Tv, the decaying update value

adecay is computer as a function of the original oz update value as follows:

_ T, —T
adecay —' 08 v v

This means that as the age of the node progresses past age T”, the update value

decays at an exponential rate.

23

2.2.5 Likelihood Metrics

Negative-log-likelihood given an X -space pattern anew, a cluster mean 23,-, an m — 1
dimensional Space, a transformation matrix 80 that projects vectors into the dis-

criminant space, and a scatter matrix S,- is deﬁned as follows:

L(:r,,ew, 113,) = %d(:1:new,:r,) + MIT—121742”) + %ln(|S,-|)

d($newa$i) : IBD xnew _ BD xi] 51—1 [BB xnew — BD IilT

If S, = I}, using the covariance matrix for cluster i, then the equation above is
for Gaussian negative-log-likelihood. If we take the within-class scatter matrix 5,,
—— the weighted average of the covariance matrices in the node - then L(;rnew,:r,-)
gives Mahalanobis negative-log-likelihood. If S,- = I, the identity matrix, then the

equation gives Euclidean distance. Each of these distances has decreasing usefulness

for classiﬁcation, and a decreasing demand for samples for correct estimation as well.

F,- is estimated only from samples in virtual class i, and thus gives information
useful for discriminating between classes. 5,, is estimated from all of the samples
that have come through the node, and thus is usually estimated with good accuracy
sooner than I}. However, Sw contains no class-dependent discriminant information,
and is thus less helpful for classiﬁcation. Given v,- sample visits at each cluster i in the
node, with v = 2,21%, 5,, is derived as Sw = 2?;1%1F,-. Euclidean distance is naive,
requiring no samples making no distance-based weightings for class discrimination.

24

 

 

 

It is the simplest distance metric.

We want to transition between these three distance metrics automatically based
on the number of samples that have been seen. If no samples have been seen, we would
like to use Euclidean distance as a metric. If there are very few samples in the virtual
class i, but the node has seen many samples, then we would like to use Mahalanobis
distance with Sw. If the virtual class i has seen many samples, then we would like
to use Gaussian likelihood with 1",. To facilitate this automatic transition, we set S,-
to a weighted average of I , Sw, and I‘,-. The weights are proportional to the number
and distribution of samples across the node, and as samples are added, the weighting
slips from Euclidean distance to Mahalanobis distance to Gaussian likelihood.

We describe the weighted average using three weights we, wm, and wg, with w =
we + wm + wg. The covariance matrix used for negative-log-likelihood is derived as
S,- = $011 + 333w + 1311‘ ,-. We would like a scheme for assigning we, wm, and wg so
that Euclidean distance is weighted the most when there are very few sample visits to
the node, Mahalanobis distance is weighted highly when the total number of sample
visits v to the node is high, and Gaussian likelihood is weighted highly when the
average number of visits to a single cluster in the node is high. To do this, Euclidean
distance is given a static weight v6, and wm and wg are considered to be functions of
the number of samples that have been seen coupled with the number of parameters
that must be estimated in the covariance matrices.

After v visits to the node, and given that each visit produces an m —1 dimensional
vector used to estimate one covariance matrices, this means that there have been
v(m — 1) scalar values so far for estimating the parameters of the covariance matrices.

25

‘1 EL ..- —-—

 

Sm, the Mahalanobis distance matrix, has ﬂing—’11 parameters to be estimated, which

means that there has been 3:11:23) scalars for each parameter of Sm. There are
m Gaussian likelihood matrices, which means that there are on average m(:’(r’::]] )

scalars for each parameter in the m Gaussian likelihood matrices. This system of
computing weights based on the number of parameters that need to be estimated
gives a basis for assigning wm'and wg. Given enough samples, we would like to see
Gaussian likelihood become the dominant weight, so we set an upper bound ve for

wm. Given the above formulations, the weightings are as follows:

The bound ve is set low — usually to a value of 10 or 20; the tree is not very
sensitive to this setting in the long run, and this method makes it possible for a
smooth transition to be made into Gaussian likelihood, which is the decision metric

that we are really interested in using for the best classiﬁcation performance.

2.3 Tree

The two major operations on the tree are searching and updating. A search on the
tree starting with an X -space vector :rnew involves starting at the root node and
recursively ﬁnding the virtual class of xnew at each node and then going on to the
child node of the class. Individual samples are stored in the leaves of the tree, and
when the search arrives at a leaf, anew is compared to the samples there and the

26

 

output value of the closest sample is returned.

Because the tree partitions are not perfect, we use a version of the search algo-
rithm that follows down multiple paths of the tree and performs a “redundant search”.
The algorithm has two parameters - k, the maximum number of paths that can be
searched, and e, the threshold ratio of the negative-log-likelihood of the closest virtual
class in a node to the other likelihoods. If the likelihood ratio is smaller than 6 and

there are fewer than k paths, the virtual class is added as another path to search.

Procedure Search-Tree: Let P be the set of virtual classes that designate current
paths being followed in the search, with [P] S k. Assume an X-space vector 33mm to

be classiﬁed by tree T with root node r.

1. Take the negative-log-likelihood of 2:an for all virtual classes in r, and add the

virtual class with lowest likelihood d1 to P.

2. For all of the other virtual classes with likelihoods d,, if 3? g 6, add the virtual

class to P as long as [Pl 3 k.

3. For each member m E P, ifm is not a leaf, take the child node N ofm and
take the likelihood of xnew for all virtual classes in N. Add the virtual class with

lowest likelihood (11 to P and remove m.

4. For all of the other virtual classes in N with likelihoods d,, if {if g 6, add the

virtual class to P as long as |P| g k.

5. Repeat steps 3 and4 until P contains only virtual classes that are leaves.

27

6. Perform a linear search through all of the samples contained in each of the leaves
in P, ﬁnding the sample 3 with lowest Euclidean distance in the X—space. If a
leaf has no samples in it, take the Euclidean distance to the X-space mean of

the leaf and treat it as a sample.

7. Return the Y-space mapping of the closest sample. and the leaf node of the

sample.

Given a new mapping mnew 1—> ynew, the tree is updated using Add-Pattern, which
calls procedure Update-TTee. Update—Tree works by starting at the root node, run-
ning Update-Node, and then ﬁnding the virtual class of mm", and proceeding to the
child node of that class. When the algorithm arrives at a leaf cluster, the acne“, vector
is compared to each of the X -space samples present at the leaf, and if the difference
between X -space vectors is larger than a small threshold '73, xnew H ynew sample is

added to the leaf cluster.

Procedure Update-Tree: Given a tree T with root r and a novel mapping xnew I—+

ynew, the update goes as follows:

1. Let the current node N be r.
2. Call Update-Node on N with the mapping 33mm +—-> ynew ifN is not frozen.
3. Find the virtual class c with lowest negative-log-likelihood for mnew.

4. Ifc has a child node, let N be the child node ofc and go to step 2.

28

5. Perform a linear search through all of the samples contained in c. If for each

sample x, in c, |x,- — xnewl > 71, add x to the leaf c. Return c.

The Update-Tree procedure is not called directly at each time step, although this
is the fundamental procedure for updating the tree. Sample pruning, sample re-
training, and spawning are handled within the the procedure Add-Mapping, which
is called with a new mapping xnew +—-> ynew at each time step and which in turn calls

Update-Tree if it is needed.

Procedure Add-Mapping: Given a tree T and a novel mapping 117mm H ynew at

time step n, the mapping is processed as follows:

1. Test the mapping for pruning. Gall Search- Tree with k = 1, c = 0, and input
xnew. If Search- Tree returns a vector y,- such that Iyi — ynewl > 7y, call Update-
Tree with the mapping 113nm 1—> gum and go to step 2. Otherwise, discard xnew I—>

ynew and jump to step 3.

2. Check spawning conditions. If the leaf c returned by Update- Tree is frozen and
has more than 0 samples, spawn a new child node beneath c and call Update-

Node on the new node with xnew r—~> ynew.

3. Retrain one sample. Select an existing sample xe 1—> ye from a node in the tree,

delete the sample from the tree, and call Update- Tree with xe +——> ye.

29

2.3.1 Sample Pruning

Because the tree is expected to Operate in a real-time environment with a continuous
real-time input of samples, an incremental pruning algorithm is employed on every
new mapping xnew +——> ynew before the tree is updated to keep the tree from being
overwhelmed with needless and redundant data. An update is only done if the tree
incorrectly classiﬁes the vector xnew by a threshold 7y. If the tree correctly classiﬁes

xnew, then the new sample is discarded.

This simple incremental pruning algorithm tends to vastly reduce the number of
samples needed to perform a classiﬁcation problem. In some cases, it also will work to
normalize prior probabilities of different classes by pruning away samples that provide
redundant and inessential information. This is useful because of the unlabeled cluster-
ing methods used for creating the Y-space virtual classes. The unlabeled clustering
methods used work to incrementally ﬁnd cluster centers that reduce total squared
error. When prior probabilities are dramatically skewed across a set of classes, the
classes with highest prior probability will usually be responsible for the most squared
error, and so the clusters that result will carry very little information about classes
with small prior probability. These sorts of clusters tend to be non-optimal for clas-
siﬁcation, especially if many of the samples in more frequently viewed classes do not
add to classiﬁcation accuracy. Pruning tackles this problem quickly and effectively.

30

2.3.2 Retraining Samples

Each X -space mean and Y—space virtual class is incrementally estimated as new
samples arrive, and because of this, the decision boundaries induced by the means
change over time. As samples are stored in the leaves of the tree, they can be “lost”
or “shadowed” as a decision boundary moves and abandons some of the samples that
were placed because of that boundary. These samples are useless for classiﬁcation,
and need to be re—inserted into the tree. To do this, a linked list of nodes of the
tree is maintained, and at each time step a single sample is selected from the tree,
removed from that leaf, and the tree is retrained with the sample. Over time, all of
the samples are repeatedly removed and retrained with this process.

This incremental process causes samples to be redistributed through the tree as
nodes Slowly freeze, and it also prevents redundant samples from being stored in
different leaves. A Side beneﬁt is that variables in the tree that are being estimated are
repeatedly exposed to a randomized mix of samples that are relevant for classiﬁcation
(because of the pruning process). This seems to give better and faster parameter
estimation than running repeated epochs of training data — reusing samples offers a

speedup in training.

2.3.3 Freezing and Spawning

The primary motivation for spawning child nodes on the tree is to defray computation
time. If a leaf cluster collects a large number of samples after many updates, the

linear search of samples that must be done at that leaf in order to classify an input

31

vector can be signiﬁcant. Thus, if a leaf node collects greater than 0 samples, the leaf
spawns a child node in order to reduce the liner search length needed for classiﬁcation.
The samples are incrementally redistributed into the child node through the sample
retraining process, and the maximum length of linear searches is reduced.

Spawning is only done after a node has frozen completely. This is because a node
that is not frozen cannot be counted on to consistently assign a samples to the same
virtual classes over time, and this introduces a large amount of noise for the child
nodes. This means that the tree grows downward incrementally, only spawning one

level of leaves at a time, and freezing each level at the same time as well.

32

Chapter 3

Experiments

In order to illustrate the Operation of the HSM tree, we have three experiments de-
scribed below. Together, these three experiments demonstrate the correct Operation,
scalability, and real-world usage of the HSM tree algorithm described above. The
ﬁrst experiment, using very simple synthetic data, demonstrates the tree learning a
simple classiﬁcation problem over time. The decision boundaries generated by the
tree are displayed as training goes on, and the experiment illustrates that the tree is
able to approximate the decision boundaries of a simple problem.

The second experiment works with real face images, demonstrating face recogni—
tion across 143 different people. This experiment illustrates both the performance of
this algorithm and the ability for this algorithm to scale up to very large data sets
while maintaining very fast retrieval times. A very large amount of very high dimen-
sional data is used, and the algorithm retains its speed and gives a high classiﬁcation
rate.

The third experiment demonstrates this algorithm working in real time, attached

33

to a camera and taking input from the keyboard as well. The real-time Operation
of this system is illustrated as the system learns to recognize ﬁve people and seven

Objects over a course of two hours, given training input from the keyboard.

3.1 Synthetic Data

In order to demonstrate the operation of the HSM tree in a manner that can be visu-
alized easily, synthetic data was generated in a two-dimensional problem. Figure 3.1
displays the decision boundaries of a three-class two-dimensional problem. Samples
are uniformly drawn from the sample space shown with the training class determined
from the sample’s location in ﬁgure. The tree is given a new sample mapping at each

time step. The Y-space class labels are the numerical labels 1, 2, and 3.

The tree is set to have a maximum of 2 clusters at each node (q = 2) which means
that at each node the tree produces a 1-dimensional discriminant projection. The
freeze value 77 is set to 77 = 200, which means that the root node freezes after 800 time
steps, and child nodes that are created right after the root node freezes become frozen
after 1600 time steps. The freezing schedule is shown in Figure 3.2. This value of 77
was chosen somewhat arbitrarily — it is a large enough number that each node will
see a good amount of samples from each class before it freezes, and that should be
good enough for our purposes. The main consideration in choosing a value of 77 is to
be certain that it isn’t too small, as too small a value will cause the system to freeze
with very poor X -space partitions. The decision boundaries of the tree over different
stages of incremental training across 2000 time steps are shown in the ﬁgures below.

34

 

After 2000 training samples, the tree has a depth of three with seven nodes and
404 samples spread across eight leaf clusters. If the tree had continued to train, it
would have continued to spawn leaves, continually reﬁning the decision boundaries.
This is because the different regions are adjacent, and so samples are continually
added in an attempt to approximate the straight line boundaries.

As the training progresses, the decision boundaries become more reﬁned. Because
the HSM tree uses the nearest-neighbor method at each leaf node, the decision bound-
aries are very jagged, as the tree overﬁts the low-dimensional data. After steps 800
and 1800, the tree spawns new levels of nodes, and larger reﬁnements in the decision
boundaries become apparent because of this. This is the coarse-to-ﬁne growth of the

tree — as the tree grows, the boundaries become more accurate and more complex.

 

Class 3

 

 

Class 2

 

 

 

 

 

 

Class 1

 

 

 

Figure 3.1: The decision boundaries of a 3-class 2-dimensional classiﬁcation problem.
Samples are selected with uniform probability from the space above.

35

Root Node Freezes, Level Two Nodes Freeze.

 

 

 

 

Root Node Spawned 2 Level Two Nodes Spawn Level Three Nodes Spawn
Neural Gas Begins to Decay Neural Gas Begins to Decay Neural Gas Begins to Decay
Y—Space Begins to Decay Y-Space Begins to Decay Y-Space Begins to Decay
X—Space Begins to Decay X—Space Begins to Decay
Covariance Begins Covariance Begins
To Decay To Decay
\ I \ l J
I I l I
Time Step 0 500 1000 1500 2000

Figure 3.2: This is a timeline of the freezing and spawning patterns of the tree over
2000 time steps. The freezing parameter 77 is set to 200.

Figure 3.3: These are the class decision boundaries generated by the HSM tree after
100 time steps (left) and 300 time steps.

Figure 3.4: These are the class decision boundaries generated by the HSM tree after
500 time steps (left) and 700 time steps.

 

36

 

Figure 3.5: These are the class decision boundaries generated by the HSM tree after
900 time steps (left) and 1100 time steps. The decision boundaries are more reﬁned
because the root node froze at time step 800, and two child nodes have been spawned.

 

Figure 3.6: These are the class decision boundaries generated by the HSM tree after
1300 time steps (left) and 1500 time steps.

 

Figure 3.7: These are the class decision boundaries generated by the HSM tree af-
ter 1700 time steps (left) and 1900 time steps. Smaller reﬁnements in the decision
boundaries are noticeable because the second set level of nodes froze at time step
1800 and a third level of nodes was spawned.

37

 

 

 

 

 

 

 

 

Figure 3.8: These are the 404 training samples that were kept by the tree in leaf
nodes after 2000 time steps. They were selected using the sample pruning method
described previously.

38

3.2 Simulated Face Recognition from Video Se-

quences

3.2.1 Experiment Description

The SAIL tree was used as a part of a simulated robot control system to demonstrate
online face recognition and gender recognition using image sequences. This was done
to demonstrate the ability of this system to scale up to very long video sequences
of very high dimensionality. Data sets of the size used here are typical of what this
system might encounter in training in a real-time situation,

Four video sequences of 143 different people were collected and digitized, and the
images were cropped close to the individual’s faces, and covered with a triangular
“foveal mask” that simulates the effect of a fovea and weights the face region of a
subject more heavily than the background. Each video sequence was around 50—60
frames in length, with a resolution of 88 x 64. In total, 33,889 images were collected

across all of these sequences.

 

Figure 3.9: This is a temporally subsampled sequence used in training. The subject’s
face enters the view and comes to the center of the image, remains there for a while,
and then leaves the view. A triangular foveal mask is centered over the image.

Three of the sequences for each individual were used for training for each task,

39

 

Figure 3.10: These are a selection of 16 images of the 143 people used for this test.
The testing set varied across race, gender and age by a very wide amount.

and the fourth was used for testing. We were interested in training the system online
to perform a face recognition task, where the control system is trained online and
incrementally to respond with the correct name or gender of an individual if asked.
Training involves presenting each sequence to the tree and imposing actions upon the
system in a manner of supervised learning. Testing is done by presenting sequences

of images and stimuli and recording if the expected responses are displayed.

The two tasks were face recognition and gender recognition. The face recognition
task involved asking the system to provide an identifying number that corresponds
to a person’s identity when it is asked “Who?”. The gender recognition task involved
providing a number that corresponds to the correct gender of the individual when
the system is asked “Gender?”. Each sequence is 50-60 images, and at each time
step an image was presented to the tree, along with a numerical input sensor which
corresponds to “Who?”, “Gender?”, or to no question. The numerical sensor input
was a vector of length 500 with all values set to the same number. These inputs are
concatenated into a single vector, which is of length 88 x 64+500 = 6132, which is the
sensory feedback at each time step. The system output at each time step was either
a value of 0 which indicates no action, or a value that corresponded to the name of

40

an individual or to a gender.

 

 

 

[ Training Session Image Sequence Task Trained ]
1 Sequence 1 Face Recognition
2 Sequence 2 Face Recognition
3 Sequence 3 Face Recognition
4 Sequence 1 Gender Recognition
5 Sequence 2 Gender Recognition
6 Sequence 3 Gender Recognition

 

 

 

Table 3.1: This is an outline of one epoch of training. Three training sequences were
collected for each person, and these sequences are recycled between the two tasks
trained.

The numerical labels used for this experiment are inappropriate for use as Y-
space labels, because the single numerical values provide no information for the the
Neural Gas clustering algorithm. We produced Y-space labels for each class by taking
an incrementally updated mean of each class. This means that with 143 people to
identify, two genders, and an output of no action, there were 146 different classes.
Each class had a mean vector which was incrementally updated as samples for that
class were received by the system. Each vector was the mean of inputs belonging to
the class that the tree was trained on. Each individual sample was also marked with
the actual numerical label of the class that it belonged to, so that classiﬁcation could
be done quickly in the Search-Tree procedure.

The system was presented with each sequence of images, and trained sequentially
across the whole training set. In each training sequence, the individual’s face enters
the image, and then “Who?” or “Gender?” was given as an input to the tree. The
“Who?” question was represented by setting the numerical input sensor vector to
500. The “Gender?” question was represented by setting the numerical input sensor

41

to 700. At the time step following the question, the correct response corresponding
to identity or gender was imposed. At all other times, the question input was set
to 0 (no question), and an output of 0 (no action) was imposed. For testing, the
person’s face entered the image, and then “Who?” or “Gender?” was given to the
numerical sensor. The output at every step of the sequence was recorded, and correct
identiﬁcation of the individual was assessed if the system replied with the correct
name or gender at the correct time, and didn’t perform any inappropriate actions at

other times.

 

 

 

 

 

No Face ice Enters View Face Leaves View
Face Image y \X'
"who?"
Numerical A/
Input
19
Numerical A/
Output
| l l I I
I I l l I
Time Step 0 50

Figure 3.11: A timeline for a typical question-response training session for face iden-
tiﬁcation of person number 19. The face image enters the view, and then a “who?”
question is asked on the numerical sensor. The response 19 is trained for the imme-
diately following time step. The two—step memory mechanism means that the system
learns to respond when the previous numerical input was a question.

The numerical input sensor vector was given a large Size vector size and large
responses in order to have equal weighting against the image vector. If the numerical
input sensor vector were one-dimensional and had a small response range, then it

would not be a salient input for the system.

42

The HSM tree makes up the core of the simulated control system. When being
trained, at each time step the current image and numerical input sensor value is taken
as an input along with an output vector. At each time step of testing, the system
takes the current image and numerical input sensor value, and outputs a control
signal.

Because we wanted to have the system respond one time step after the “Who?”
question is given, we equipped the system with a small memory of one time step. This
was done by setting the input to the tree to be a vector containing the current image
and numerical input sensor concatenated with the previous image and numerical input
sensor. This meant that the current state was composed of the current and previous
sensor inputs, making the input dimensionality of the tree 2 x 6132 = 12264. This
“one step memory” meant that the system can respond to stimuli up to one step after
it is observed. Thus, when the system is trained to respond one time step after a
numerical input of “who?” is given, the system ends up mapping a tree input with
a numerical input of “who?” in the previous sensor input to an identifying response.
Longer memories would allow more prolonged responses and more complex tasks, but
for this experiment, one step was all that was necessary.

This interaction framework makes up a primitive way for a real-time vision-based
robot to be trained online. Face recognition was chosen here as a test bed because
it is a popular and easily understood area of research right now, but this system is
extensible to online training in other kinds of problems. The learning limitations
are only what can be represented by the interaction framework, but the real-time
operation of the tree makes all kinds of online learning possible.

43

Image at time T-l Image at time T

Numerical Input at time T-l Numerical Input at time T

\ /

l l J

\

/
Length 5632 Length 500
l l

/

Length 12264

 

 

 

 

 

 

 

Figure 3.12: This is the makeup of the input vector to the HSM tree at each time step.
The previous and current sensory inputs are concatenated into one vector, giving the
system a memory of one time step.

3.2.2 Experiment Results

The algorithm was run for ﬁve epochs of training, where in each epoch each face
identiﬁcation and gender response was trained on three different image sequences of
each person. This gives 50,833 time steps in each epoch. The tree had a maximum of
9 clusters (q = 9) for each node, and the freeze weight 77 was set to 15,000. This value
of 77 is large enough that the system will see an example of each class once before the
node freezes. If 77 is set much higher, performance is not affected, but it takes a much
longer time for the tree’s decision boundaries to converge.

After the ﬁfth epoch, each of the 14 nodes in the tree was frozen, and the tree
had stopped spawning nodes. Resubstitution error for both the face recognition and
gender identiﬁcation portions was zero. At this point, the system was tested using a
hold-out set of image sequences with different values of e and k for the Search-Tree
procedure. There were 16,945 images in the hold-out set. The results are presented
in ﬁgures 3.14 and 3.15. The best result for face recognition was 7 errors out of 143
with parameters k = 4 and e = 1.5. The best result for gender identiﬁcation was 4

44

 

 

Current
Image Vector

 

 

 

Combine Vectors To

 

 

 

 

 

 

 

Current > Create One-Step Memory

Numerical Input 1 (7

 

 

 

 

 

 

 

[ X-Space Input ]

 

 

 

  

 

 

 

Previous Image Vector

   
  

 

 

Call
Previous Numerical Input I Y-Space Input ] Add-Mappin g

A to Update the Tree

 

 

 

 

 

 

 

 

 

 

Current Imposed
Numerical Output

 

> Update Incremental Y-Space
Mean Corresponding to Output

 

 

 

 

 

 

 

 

 

 

Figure 3.13: This is a data ﬂowchart of how one time step of training on this frame—
work works. The current image and numerical input are combined with the previous
image and numerical input to create an X -space input vector for the tree, and the
current sensory inputs are stored for the next time step as well. The imposed effec-
tor output class is used along with the X -space tree input to incrementally update
the class means used for Y—space class labels in the tree. The mean of the imposed
effector output class is then used as a Y—space input to the tree. Update-Mapping is
then called on the tree.

errors out of 143 with parameters k = 7 and e = 1.5. Different values of k and e have
a large effect on the performance of the tree, and it can be seen that small values
of k with large values of 6 actually decreases the performance. This is because large
6 values cause the search to split at higher levels of the tree too quickly, leading to

useless parallel searches.

The face identiﬁcation and gender identiﬁcation tasks were tested against a nearest
neighbor classiﬁer for comparison. The nearest neighbor classiﬁer was provided with
the individual training samples used only during the question-and-answer phases of
training, and the “no action” images were not used. This made for a training set of 858
images. The nearest neighbor classiﬁer had 4 errors out of 143 for face identiﬁcation,

45

 

 

 

k 15 3 eta

Figure 3.14: Face recognition performance of the HSM tree over different k and 6
values. As can be seen here, larger k values give consistently better performance with
larger 6 values, but large 6 values with small k values will decrease performance.

and 2 errors out of 143 for gender recognition. This shows that the HSM tree’s

performance is comparable to a nearest neighbor classiﬁer for this speciﬁc problem of

face recognition.

The advantage of the HSM tree over a nearest neighbor classiﬁer is that it is
fast, can be trained online, and is scalable to large data sets. The average times for
Search-Tree to execute over the 16,945 training images for both tasks is illustrated
in 3.16, with a range of times from 0.029 seconds to 0.04 seconds. The average time
for Add-Mapping to execute over all of the training samples across all ﬁve epochs
was 0.15 seconds for one mapping. This compares with an average nearest-neighbor
search time over the 858 samples of 0.2 seconds. All of these tests were done on a

400Mhz Intel Pentium II system with 512MB of RAM.

The speed differences are impressive because the HSM tree was generated online,
while the nearest neighbor classiﬁer was generated in batch with total knowledge

46

 

 

Figure 3.15: Gender recognition performance of the HSM tree over different k and 6
values.

of which of the 50,833 training samples to exclude. Additionally, the HSM tree
maintains a tree structure, which in most cases should give it retrieval times that
are logarithmically prOportional to the number of samples stored in the tree. For
real-time applications, the HSM tree is deﬁnitely a feasible method — fast, online, and
scalable.

A function not displayed in this system is a “reject” option — an ability to reject
faces that are not recognized, or an ability of the system to express low conﬁdence
in the response given. If presented with a novel face and asked to identify, the
system would simply return the closest answer that it could ﬁnd. Traditionally,
such a capability is arrived at either by building an “imposter model” of imposter
faces, and then training them against the system, or else using a manually tuned
threshold found with an ROC curve or a similar method. Both of these methods
represent very task-speciﬁc approaches to face recognition, and are not compatible
with the developmental paradigm with which this algorithm is being developed. A

47

 

rQuery Image Retrieved Image [ Query Image Retrieved Image ]

 

 

 

 

 

 

 

 

 

Table 3.2: These are the seven face identiﬁcation errors made by the HSM tree. The
queried images are on the left, and the images that the tree returned are on the right.

 

[ Method [ Face Results Gender Results ]

HSM Tree 95.1% (Best) 97.2% (Best)
Nearest Neighbor 97.2% 98.6%

 

 

 

 

 

 

Table 3.3: These are the best results of the HSM tree for face and gender recognition
compared to the nearest neighbor method.

reject option is a valid and important function, but it will be a future work, trained
to the system as a behavior by a trainer, and not constructed intrinsically in the

algorithm.

48

 

[ Method [ Search Time Training Time ]

HSM Tree 0.04s (Worst) 0.15s
Nearest Neighbor 023 OS

 

 

 

 

 

 

Table 3.4: These are the worst average time results for searching and training the
HSM tree for one sample compared with times for the Nearest Neighbor classiﬁer with
858 samples. Training the Nearest Neighbor classiﬁer for one sample takes no time
because a new sample is simply kept. Note that the HSM tree’s combined training
and search times are less than the search time alone of the Nearest Neighbor classiﬁer.

 

 

Figure 3.16: The time average in seconds that it would take to perform a search
on the tree, given diﬂ'erent k and 6 values. Times range from 0.029 seconds to 0.04
seconds for a search.

49

3.3 Real Time Face and Object Recognition

In order to supplement the simulated recognition tests described above, the above
system was used for real-time online incremental training for recognition of ﬁve people
and seven objects using a camera and frame grabber. The purpose of this system was
to demonstrate an actual proof-of-concept system that is trained and is tested in
real time given real interaction with the trainer. The objects and faces trained were
easily discriminable, but the purpose of this last experiment was not to demonstrate
classiﬁcation capability, but instead only the real—time operation of this system.

The same question-response model of training outlined above was used here, with
the only change to the system being that RGB color images with a superimposed
foveal mask at a resolution of 80 x 60 were used. This makes the input dimensionality

of the tree ((80 x 60) x 3 + 500) x 2 = 29800.

 

Figure 3.17: Several of the objects used in real-time online training of the algorithm.

The system was trained over the course of two hours, with three training sessions
for each person or object. In sessions for face recognition, individual’s eyes were lined
up with two marks on the computer screen and the correct output for the individual’s

50

identity was imposed from 5 to 10 times. For object recognition, the object was placed
on a uniform brown background in the center of the ﬁeld of view and the correct
response was trained 5 to 10 times. Care was taken to make certain that the objects
were in roughly the same positions for training and testing, but explicit hand—tuned
image registration was not done for either the faces or the objects. Training for each
face and object was interleaved, with three consecutive sessions of training the 12
classes.

After the three training sessions, each recognition of each person and object was
tested once, with a correct recognition in each case. The frame rate of the system ran
around 5-10 frames per second on a 400 Mhz Intel Pentium II system with 512MB of
RAM. Training was a simple process, and this ﬁnal experiment exhibits a functioning

real-time system operating on a simple problem.

51

Chapter 4

Conclusions

The HSM tree and framework presented here represent a very simple developmental
learning system. The above experiments demonstrate that the HSM tree is scalable to
very large data sets and is capable of running in real time with very high dimensional
data. Face recognition from video sequences across 143 different subjects and with
33,889 frames of video is demonstrated with a correct identiﬁcation rate of 95.1%
and an average retrieval time of 0.04 seconds per image. Real-time operation is
demonstrated as well, with the system processing and being trained at a rate of 5-10
frames per second.

The next steps in this work are bringing this software to a robot system and
designing a training framework for the system. This is an early work, but it a ﬁrst
step on designing a truly reactive and ﬂexible machine intelligence system. Many
new problems and a rich array of research topics lie ahead of this project, including
ﬁnding new ways to automate parameter selection in the HSM tree, integration of
this work into a temporal learning model, and adaptation of reinforcement learning

52

methods to our framework.

53

Bibliography

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classiﬁcation and Regression

Trees. Chapman & Hall, New York, 1993.

[2] R. Brooks. Cog, a humanoid robot. Technical report, MIT Artiﬁcial Intelligence

Laboratory, 1997. private communication.

[3] R. A. Brooks. A robust layered control system for a mobile robot. IEEE Journal

of Robotics and Automation, 2(1):14—23, March 1986.

[4] M. Dorigo and M. Colombetti. Robot shaping: Developing autonomous agents

through learning. Artiﬁcial Intelligence, 71(2):321—370, 1994.

[5] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A

survey. Journal of Artiﬁcial Intelligence Research, 4:237—285, 1996.

[6] J. E. Laird, E. S. Yager, M. Hucka, and C. M. Tuck. Robo—soar: An integra-
tion of external interaction, planning, and learning using Soar. Robotics and

Autonomous Systems, 8:113—129, 1991.

[7] D. B. Lenat. CYC: A large-scale investment in knowledge infrastructure. Com-
munications of the AGM, 38(11):33-38, 1995.

54

[8] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. ’neural-gas’ network for
vector quantization and its application to time-series prediction. IEEE Trans.

on Neural Networks, 4(4):558—569, 1993.

[9] G. A. Miller. Worknet: A lexical database for English. Communications of the

ACM, 38(11):39—41, 1995.

[10] Sreerama K. Murthy, Simon Kasif, and Steven Salzberg. A system for induction

of oblique decision trees. Journal of Artiﬁcial Intelligence Research, 2:1—33, 1994.

[11] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.
In D. Touretzky, editor, Advances in Neural Information Processing, volume 1,

pages 305-313. Morgran-Kaufmann Publishers, San Mateo, CA, 1989.

[12] J. Quinlan. Introduction of decision trees. Machine Learning, 1(1):81—106, 1986.

[13] J. Quinlan. 04.5: Programs for Machine Learning. Morgan Kaufmann, San

Mateo, CA, 1993.

[14] Reid Simmons, Richard Goodwin, Karen Zita Haigh, Sven Koenig, and Joseph
O’Sullivan. A modular architecture for office delivery robots. Autonomous

Agents, pages 245—252, Feb. 1997.

[15] D. Swets and J. Weng. SHOSLIF-O: SHOSLIF for object recognition (Phase I).
Technical Report CPS 94-64, Department of Computer Science, Michigan State
University, East Lansing, MI, Dec. 1994.

55

 

[16] D. Swets and J. Weng. SHOSLIF-O: SHOSLIF for object recognition (Phase II).
Technical Report CPS 95-39, Department of Computer Science, Michigan State

University, East Lansing, MI, Oct. 1995.

[17] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive

Neuroscience, 3(1):71—86, 1991.

[18] C. Watkins. Learning from delayed rewards. Technical report, PhD thesis, King’s

College, Cambridge, England, 1989.

[19] J. Weng and S. Chen. Incremental learning for vision-based navigation. In Proc.
Int’l Conf. Pattern Recognition, volume IV, pages 45—49, Vienna, Austria, Aug.

25-30 1996.

56

 

nrcnrcnn STATE UNIV. LIBRARIES
[WillIll][Illllllllllll[Mill][Milli]
31293018341754