_ %§§§....

. .. , 4i”... .

a, Em? an .1“.
Candy L.» .. 2.3%.

5%.,th

. ¢ .1
.4ﬁwhﬁt. 3..
.‘uﬁ {ELK

Ty»... ,
n, .. 3...? 3:

.3»; _ «a
mm“ mam

4a.. :..
.. Kan . . .: hwy...
._.mwm..unw any , . . . wanmﬁgﬁv

k.

.

$3.619.

. Li. .I: I! 5.7...

~
~

g. .. >121»?

2.35.. «A

< v V ‘ . {.5
5L... ,

1.

iiivz .1... .1
5...}: .\
A ‘A\

. ‘

 

 

 

 

 

L
111.13'

:3 ,. , ‘ ‘ , ‘ . .y......»§.

.
. may...

w. 2‘}.
5&2...”

:3.

f

9:.

3-3....

:..._
L 3%."

u
m.

. a
.rv..

..
.ﬁa
Jaw... .x ..,.3

,. .V .mwwu

5.” .... .
3.6 i.
E...

THESlS
l

4:?CC'i

This is to certify that the

thesis entitled

Selective Visual Attention In
A Search Task: A Reinforcement
Learning Model

presented by

Silviu D. Minut

has been accepted towards fulfillment
of the requirements for

M.§. degree in Comp. SCi. 5! Eng.

 

 

Major professor

 

:4» /e

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

 

LIBRARY

Michigan State

Unlvorslty

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

DATE DU

untL ‘l

 

19 ll DATE DUE

DATE DUE

 

V‘UT a- v

 

 

 

 

 

 

 

 

 

 

 

 

 

11/00 M.“

 

SELECTIVE VISUAL ATTENTION IN A SEARCH TASK:
A REINFORCEMENT LEARNING MODEL

By

Silviu D. M inut

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment Of the requirements
for the degree of

MASTER OF SCIENCE

Department Of Computer Science

2000

ABSTRACT

SELECTIVE VISUAL ATTENTION IN A SEARCH TASK:
A REINFORCEMENT LEARNING MODEL
By
Silviu D. Minut

This thesis proposes a model of selective attention for visual search tasks, based
on a general framework for sequential decision-making. The model is implemented
using a ﬁxed pan-tilt-zoom camera in a visually cluttered lab environment, which
samples the environment at discrete time steps. The system is a learning agent which
has to decide where to ﬁxate next, based purely on visual information, in order to
reach the region where a target Object is most likely to be found. The model consists
Of two interacting modules. A reinforcement learning module learns a policy on a set
of regions in the room for reaching the target Object, using as Objective function the
expected value of the sum of discounted rewards. By selecting an appropriate gaze
direction at each step, this module provides top-down control in the selection Of the
next ﬁxation point. The second module performs “within fixation” processing, based
exclusively on visual information. Its purpose is twofold: to provide the agent with
a set Of locations Of interest in the current image, and to perform the detection and

identiﬁcation Of the target Object.

r A -
.

 

TO My Family

iii

ACKNOWLEDGMENTS

First and foremost I would like to thank my advisor, Dr. Sridhar Mahadevan. Numer-
ous discussions ranging from machine learning and reinforcement learning, to human
and computer vision, pattern recognition and cognitive science, lead to ideas that
shaped this research. I would also like to thank the other members of the committee,
Dr. George Stockman and Dr. John M. Henderson for their support and for their
invaluable expertise in computer vision and in vision cognition. I am also grateful to
Dr. John Weng and Dr. Fred Dyer for their insight and suggestions. Last, but not
least, I would like to thank the members of the SIGMA group at Michigan State, for
all the ideas that we have exchanged, and in particular to Richard Falk, for providing

some human data from the scan pattern experiments.

iv

_ WET“!

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES
1 Introduction

2 Background

2.1 Biological Motivation .......................
2.1.1 Saliency Map Theory ......................
2.2 Reinforcement Learning .....................

3 Problem Deﬁnition

3.1 Research Questions and Task Deﬁnition ............

4 Related Work

4.1 A Saliency Map Implementation .................
4.2 Reinforcement Learning .....................

5 Model Description

5.1 The System Overview ......................
5.2 The Q-learning Module ......................
5.2.1 Feature Extraction .......................
5.2.2 States, Actions and Reward ..................
5.3 The Vision Module ........................
5.3.1 The Symmetry Operator ....................
5.3.2 Histogram Back-projection ...................
5.3.3 Histogram Intersection .....................
5.4 More Visual Routines .......................
5.5 Algorithm .............................

6 Experimental Results

6.1 Learning a Policy for Finding the Target ............
6.2 Executing the Learned Policy ..................
6.3 Re-training for a Different Target ................
6.4 Comparison with Random Search ................
6.5 Comparison with Exhaustive Search ..............

7 Discussion

‘7

vii

viii

12
15

21
21

25
25
27

31
31
34
34
39
42
43
45
48
50
56

63
63
67’
68
70
71

73

8 Conclusions and Future Work

vi

77

LIST OF TABLES

6.1 Average number of ﬁxations for the three experiments starting from two test
1 points. In experiment 1 testing was done every 5-th epoch. The rest of the
epochs were training epochs. Experiment 2 consisted only in executing a

learned policy, with the target object inverted and Slightly displaced. In
experiment 3, a random search was done.

ooooooooooooooooo

vii

2.1

2.2

3.1

5.2
5.3

5.4

5.7

5.9

LIST OF FIGURES

Human ﬁxations (the red dots) in an indoor scene. Most of the ﬁxations
fall on objects. There are almost no ﬁxations on large, ﬂat, uniform
surfaces (ﬂoor, walls). Courtesy of the MSU Eye-Lab. Images in this
thesis are presented in color. ......................

Human ﬁxations (red dots) in an outdoor scene. Courtesy of the MSU
Eye-Lab. .................................

(a) RGB image of the target Object. (b) Panoramic image of the environment.
Images in this thesis are presented in color. ................

Overall architecture of the system. The tOp module performs the feature
extraction necessary to cluster the images online. The resulting regions
are used as states in a Q-learning program and a policy for moving
towards the target region is learned. The bottom module implements
the “within fixation” visual processing, consisting of computation of
interesting regions in the image and Object recognition. ........

(a) Low resolution template of the target object. (b) High resolution template.

Kullback distance Of pairs Of similar images (red) and on different images
(green) ...................................
TOp: feature vectors (4 concatenated color histograms) for two similar
images (bottom) .............................
Top: feature vectors extracted from two different images (bottom) . . . .
The shaded region represents a sector in image A. The most salient point is
computed (the blue square) and a new image B, centered at the new location
is taken. ..................................
If the reward is found in the periphery (image A) then upon subsequent visits
to the same region (image B) the reward could be missed, despite the fact
that the agent executed the “right” action. ................
Two edge pixels p,- and pj will vote for their midpoint. The strength of the
vote depends on the distance between p,- and pj, and on the length and the
orientation of the intensity gradient at each of the two pixels ........
(a) Sample image. (b) Symmetry map Obtained by maximizing the vote for
pixels at distance D = 130. .........................

5.10 (a) Sample search image. (b) Model (target) image. (c) Histogram back-

projection of (b) onto (a). .........................

10

11

23

32
33

34

35

36

40

41

43

5.11 (a) Indoor image. The dots represent the recorded human ﬁxation points. Cour-

tesy of the Michigan State University EyeLab. (b) Intensity image of (a).

The red squares are the cells with large variance. .............

5.12 (a) Intensity map for the image in Figure 5.11 (a). (b) Contrast map. The

brighter regions are the most salient. Compare with the human ﬁxations in

Figure 5.11 (a) ................................

5.13 Gaussians used to model color opponency. The red curve weighs the responses of

the central red cones, while the green curve weighs the (inhibitory) responses

. of the surrounding green cones. The blue curve is the difference of the two

gaussians ...................................

5.14 (a) Original RGB image. (b) Red-green opponency map. (c) Green-red Oppo-

nency map. ((1) Red-green opponency using (b) and (c) as the red and green

channels. (e) Green—red Opponency using (b) and (c) as the red and green

channels. The brightest regions are the ones in which the center- surround

opponency is the largest. The green-red opponency picks up the ﬂag and

the computer chair, while the red-green Opponency picks up the red box. .

5.15 Red: distribution of high resolution images containing the target object. Green:
distribution of high resolution images not containing the target object.

5.16 Left: Low resolution image. The points p1,p2, p3 are produced using the his-

togram back-projection of the low resolution model Mlow. At each p,, a score

3,- is computed using histogram intersection. Right: High resolution image

centered at p3. The high resolution model Mhigh is ﬁrst back-projected

onto the image, and then matched using histogram intersection. If the tar-

get was not found at any of the 1953 then the camera returns to the pan and

tilt coordinates it had before the local search, and the threshold T is set to

mm 33'. ...................................

6.1 Average number of ﬁxations for every 5-th epoch over ﬁve trials of 400 epochs
each. At every 5-th epoch, the policy learned was tested starting from
pan=500, tilt=—100. ...........................

6.2 Average number Of ﬁxations for every 5-th epoch over ﬁve trials of 400 epochs
each. At every 5-th epoch, the policy learned was tested starting from
pan=200, tilt=0. .............................

6.3 Initial scan path (green) and learned scan path (red). The rectangle
{-860, 860] x [—290, 290] represents the actual pan-tilt range of the cam-
era. The search starts at (300,50) and the target object is found around
(~380, 30). .................................

6.4 The sequence of ﬁxations corresponding to the learned path (red) in Figure 6.3.
The camera starts from region (1) and gradually moves towards the goal
region (5), Here, a sufﬁciently good candidate is found at low resolution (1),
the camera zooms in (6) and does a high resolution match. This sequence
of images corresponds to the red scan path in Figure 6.3 .........

6.5 States learned (the blue squares) and the best actions. Action A0 is represented
by a red square around a state. Actions A1 — A8 are represented by a red
segment originating at the state. The target is the green square. .....

ix

51

53

58

59

64

64

66

6.6 Number of saccades per epoch in ﬁnding the Object inverted, displaced by 60
cm from the training position. Testing from pan=500, tilt=—100.

6.7 Number of saccades per epoch in ﬁnding the object inverted, displaced by 60cm
ﬁ‘0m the training position. Testing from pan=200, tilt=0. ........

6.8 (a) The target Object in the third experiment was the green ﬂag. (b) Low
resolution template. (c) High resolution template. ............

6.9 Average number Of ﬁxations for ﬁnding the MSU ﬂag (Figure 6.8) for every 5-th
epoch over ﬁve trials Of 400 epochs each. At every 5-th epoch, the policy
learned was tested starting from pan=500, tilt=—100. ..........

6.10 Average number of ﬁxations for ﬁnding the MSU ﬂag (Figure 6.8) for every 5-th
epoch over ﬁve trials of 400 epochs each. At every 5-th epoch, the policy
learned was tested starting from pan=200, tilt=0. ............

6.11 Number of saccades per epoch in a random search. Testing from pan=500,

tilt=—100. .................................
6.12 Number of saccades per epoch in a random search. Testing from pan=200,
tilt=0. ...................................

Chapter 1

Introduction

The problem of visual search is to ﬁnd an Object in a large, usually cluttered envi-
ronment (e.g. a pen on a desk) [52]. In solving such a problem it is preferable to
use wide ﬁeld of view images, in order to analyse as much as possible of the envi-
ronment under scrutiny. On the other hand, small objects require high resolution
images, which in combination with the wide ﬁeld of view requirement leads to a very
high dimensional input array. Foveated vision is nature’s method of choice in solving
this problem and is a dominant characteristic of the vision system of virtually any
vertebrate species with well developed eyes [3]. It consists of processing of the visual
input at non-uniform resolution, due to an increased density of photo-receptors on
a small central patch (the fovea) at the level of the retina. As such, foveated vision
achieves the compromise between the need for a wide ﬁeld of view and the need for
high acuity.

Foveal image processing reduces the dimension of the input data, but in turn

generates an additional sequential decision problem. Choosing the next fixation point

requires an efﬁcient gaze control mechanism in order to check salient Objects, to
determine whether they are the target or not.

From an engineering standpoint, a sequential attention mechanism is attractive
because it has the potential of requiring only sparse local models [4]. However, the
visual attention mechanism raises a plethora of difﬁcult questions. In the ﬁrst. place,
since the next ﬁxation point is generally not in the fovea, its selection must be done
based on coarse, low resolution visual information, without a thorough understanding
of its semantics. The question is then, what low level features are necessary in order
to decide what to attend to in the next ﬁxation. Koch and Ullman [22] propose a
saliency map theory which is a task independent, bottom-up model of visual atten-
tion. In this framework, Itti and Koch [16], extract three types of feature maps (a
color map, an edge map and an intensity map) and fuse them together in a unique
map However, the selection of the next ﬁxation must require some top-down control
since low—level visual information is usually not sufﬁcient. Hence the second major
question is how to implement a high level, top-down mechanism to control the low
level, reactive attention? Tsotsos et. al. [46] propose a model Of visual attention
which tries to selectively tune visual processing by means of a top-down hierarchy of
winner-take—all processes. Finally, since the vision system samples the environment,
some information must be retained from one ﬁxation to the next, and integrated
across saccades, to produce a global understanding of the scene. The nature of this
information is the third unknown Of major concern. We propose an overall model that
integrates top-down gaze control with bottom-up reactive saliency map processing,

based on reinforcement learning.

In the remainder of this section we present an outline of the thesis.

In Chapter 2 we introduce the relevant terminology and present some background
and motivation from a cognitive science perspective, as well as give a brief introduction
in reinforcement learning. Some fundamental issues pertaining to visual attention are
outlined.

In light of the problems raised in Chapter 2, we formulate in Chapter 3 the ques-
tions that are addressed in this thesis and we state in mathematical terms the concrete
task being solved.

Chapter 4 is devoted to related work, where a number of representative previously
prOposed models of visual attention are discussed.

The core of this thesis is described in Chapters 5 and 6. Section 5.1 provides an
overview of the model and explains the logical decomposition of our system into two
modules: a reinforcement learning module, which provides the top-down control in
the search task, and a vision module, consisting of bottom-up visual routines that
implement our version of a saliency map and a recognizer used to identify the target
object of the search. We proceed to describe the components of each module in
the subsequent sections, independently of one another. The feature extraction and
the Q-learning routines are described in sections 5.2, 5.2.1 and 5.2.2, while the visual
routines are detailed in Section 5.3. The individual components are then “assembled”
together in Section 5.5, where a detailed description of the algorithm is given.

The performance of the search system is evaluated in Chapter 6, where four ex-
periments are described. The results Show that (1) the number of fixations to the goal
decreases during training, (2) after training, the system can ﬁnd the target Object in

3

a Slightly different location within one region of the enxdrornnent, (3) the agent can
be re-trained to find another object in the same environment withought fine tuning
any parameters. Finally, a comparison with a random search agent and with an ex-
haustive search agent is done, indicating that the learning agent performs better in
both cases.

Chapter 7 summarizes the strengths of the system, as well as its shortcomings.
Some technical issues regarding the implementation are presented, and some possible
improvements to feature extraction. Finally, Chapter 8 outlines a few promising

directions for further work.

Chapter 2

Background

2.1 Biological Motivation

Broadly speaking, computer vision is a research area with the practical goal of build-
ing a machine that (understands what it) sees. In doing so, researchers Often ﬁnd
their inspiration in biological systems, and often propose computational models that
try to explain or approximate the observed biological evidence. One of the ﬁrst com-
prehensive theories of computer vision was formulated by David Marr [26]. It was
the dominant paradigm in computer vision of the 80’s and it continues to be a major
one even today. The aim of computer vision, according to this theory, is to recover
the 3D structure of the external world, from 2D images, in an attempt to form a
stable, detailed, spatiotopic representation of the world. Attempts to implement such
a system failed to perform well in realistic environments. A few researchers have used
psycophysical evidence to argue for new biologically inspired approaches in computer
vision [4]. We shall describe some of this evidence and its in'Iplications to vision, but

~

0

in order to do so, we shall brieﬂy introduce some terminology ﬁrst.

It is estimated that more than half Of the neurons in the brain Of the macaque
monkey are devoted to vision (though fewer in the human brain) [11], [32]. This is
indicative of the importance and the complexity of vision. The brain must analyse
the sensory input, then synthesize the information in an attempt. to “make sense” of
the world, and act in an appropriate, optimal way. However, given that the human
ﬁeld of view is about 180°, it is a highly nontrivial task to process and interpret
all the visual information at constant, high resolution, even for a massively parallel
super-computer such as the brain, for dimensionality reasons. It is also not necessary
to keep detailed information about every single point in the visual ﬁeld.

The biological solution starts with the design of the eye and consists in a trade-
oﬂ' between high visual acuity and amount of processing. The fovea is anatomically
deﬁned as a small, central region on the retina, with a very high density of receptive
cells (cones). The density of the photo-receptors (and with it the visual acuity as
well) decreases exponentially from the fovea towards the periphery. To compensate
for any potential loss of information incurred by the decrease in resolution in the
periphery, the eyes are rapidly re-oriented via very fast (up to 900°/s ) ballistic
motions called saccades. The length of the saccade varies with the task, but in
general, the order of magnitude is a few degrees (see [35]). Fixations are the periods
between saccades (about 3 per second) during which the eyes remain relatively ﬁxed,
the visual information is processed and the location of the next ﬁxation point is
selected. In humans, the fovea has a diameter of about 1.5 mm, and accounts for

about 2° of the ﬁeld of view [32] [35].

This dimensionality reduction leads to a sequential decision problem: rather than
processing all the visual ﬁeld at once at constant high resolution to produce a detailed
global picture of the whole environment, the visual system devotes a substantial part
of its resources to the processing of a limited region of high resolution, in conjunction
with an efﬁcient attention mechanism, which selects the next ﬁxation point in such a
way that the brain appears to get just the right amount of information necessary for
the task at hand.

In the early 90’s two closely related paradigms started to affect the course of vision
research in computer science: active vision [2] and animate vision [4] In essence, the
active vision paradigm proposes that instead of trying to recover the world in the
form of a task independent, detailed internal representation, one Should try to ﬁnd
task speciﬁc solutions. These solutions could be found by decomposing the particular
task at hand into simpler sub-tasks, with simpler qualitative solutions. As Aloimonos
Points out [2], an algorithm that requires accuracy of a few decimals cannot be robust
under the virtually inﬁnite number of environments, hence the need for qualitative
algorithms.

Animate vision also acknowledges the fact that it is not necessary to create rich
3D 1'epresentations of the world. An animate vision system ideally would be endowed
With anthrOpomorphic features, such as binocular, foveated vision and a high speed
gaze control mechanism. Animate vision systems can take various actions (e.g. move,
or 200m in /out) in order to Optimize or simplify visual search. The computations are
carried out in an exocentric coordinate system (as opposed to an egocentric. coordinate

System in Marr’s paradigm), which has the advantage of being independent. of the

F7
i

motion of the observer (robot). The camera is not ﬁxed, but can make approximate
motions to acquire new information. Understanding vision as a continuous process of
mapping perception to action, allows for (reinforcement) learning algorithms which
can provide a vision system with more sophisticated behavior.

O’Regan [30] suggests that only minimal information about the scene is repre-
sented internally. The scene itself could serve as “external” memory. In a now classi-
cal experiment, Ballard [5] conﬁrmed O’Regan’s ﬁnding. Speciﬁcally, human subjects
were asked to reproduce a conﬁguration of colored blocks displayed on a computer
monitor. The display was divided into 3 regions: the model region, in which the
given conﬁguration was shown, the source, which contained an unordered set of col-
ored blocks, and the workspace, which was the area where the c0py of the model was
supposed to be assembled. Had the subjects built a detailed internal representation
of the model in their mind ﬁrst (as suggested by the Marr paradigm), one (possibly
long) look at the model would have been enough to produce a COpy in the working
area. Without exception though, the subjects made saccades back and forth between
the model and the workspace, accomplishing the task by selecting one block at the
time. This suggests that it is more economical to retain only the necessary informa-
tion locally both in space and time, and sample the environment as necessary, rather
than keep an internal representation of the scene.

Another piece of evidence against computing and storing a stable internal model
of the world is the change blindness effect [38], [37]. It has been observed that even

signiﬁcant (sudden) changes in a scene go undetected if they occur during a saccade 1.

 

1It is known that visual input during a saccade is essentially turned off, a phenomenon termed

Again, had the brain maintained a detailed spatio—temrmral model of the scene, the
changes would have been noticed.

Selective attention is the common feature of biological systems, but it raises a num-
ber of questions, some of which we shall discuss. Rensink [37] points out that there
are two categories of problems. One category addresses questions about retinotopic
representations and processes that occur within a single ﬁxation. Equally important
is what happens from ﬁxatiOn to ﬁxation in terms of how and what kind of informa-
tion is integrated across saccades, as well as how this information is coordinated with
the task at hand. We shall point out some fundamental questions from each category.
First, the “within ﬁxation” questions.

Yarbus [54] and subsequently several other studies (see [35] and the references
therein), showed by tracing the ﬁxation points in a scene, that the human scan pat-
terns vary with the task. Figures 2.1 and 2.2 shows human scan patterns recorded
with an eye-tracker at the MSU EyeLab. Despite these variations, the ﬁxation traces
have some common characteristics. It is noteworthy, for instance, that people almost
always ﬁxate on Objects, both in indoor images and in outdoor images, and almost
never on ﬂat, uniform, empty spaces (eg. ﬂoors, walls). It has been postulated [49]
that there are (at least) two fundamental Operations that a visual system must be
able to perform: ﬁrst it must be able to tell “where” in the image the interesting
regions are, and then, once attention is devoted to one of these regions, it must tell
“what” that region represents. In general, coarse, peripheral information is not. sufﬁ—

cient for Object recognition. For instance, surprising as it may seem, color perception

 

saccadic suppression [35].

 

Figure 2.1: Human ﬁxations (the red dots) in an indoor scene. Most of the ﬁxations
fall on objects. There are almost no ﬁxations on large, ﬂat, uniform surfaces (ﬂoor,
walls). Courtesy of the MSU Eye—Lab. Images in this thesis are presented in color.

is heavily depreciated beyond 20-30 degrees off the optical axis, due to the decrease
in cone density [32]. The peripheral information is used mainly to select the next
ﬁxation point, after which recognition occurs based primarily on fovea] information.
In light of these facts, a logical paradox occurs: how does the visual system decide
where something interesting is in the scene, before knowing what it is? Trying to an-
swer the “wha ” question ﬁrst seems equally hopeless, for how can it be decided what
an object is, before looking at it (La before knowing where it is)? One theory that
attempts to explain the selection mechanism is the saliency map framework, which

we discuss in some detail in the next section.

One fundamental problem in gaze control, part of “across ﬁxations” research is

10

 

Figure 2.2: Human ﬁxations (red dots) in an outdoor scene. Courtesy of the MSU
Eye-Lab.

how integration across saccades is performed, i.e. what information the brain retains
from ﬁxation to ﬁxation in order to produce the illusion of a coherent, uniform res-
Olution image, rather than a collection of snapshots of various regions in a scene. It
has been suggested (see [35] and references therein) that information from successive
ﬁxations is stored in a high capacity visual buffer. The main technique used in vi-
sion twearch to investigate this problem is to present human viewers with an object
extrafoveally, and then instruct them to direct their gaze towards the object. During
the saccade the object is replaced with another object which the subjects must name.
Studies in object perception [35], however, show that the visual information is not
stored in a point by point fashion, but rather as relatively abstract representations.

11

 

Integration of information across saccades remains an Open problem [35].

2.1.1 Saliency Map Theory

If the understanding of a scene is a sequential process, then naturally, we must try
to understand the mechanism that allows the shift. of attention from one region to

another.

An initial step is to observe that not all visual processes are created equal. Some
features (e.g. luminance, direction of motion) can be processed quickly and in parallel
over the entire ﬁeld of view, whereas more cognitive processes such as shape analysis,
or ﬁne recognition, have much higher complexity and require an increased amount
of resources, which necessarily must be carried out locally, on a restricted region
called focus of attention which most of the time (but not always!), coincides with
the ﬁxation point. The former processes constitute the “pre-attentive stage”, whose
role is to select a single region as the focus of attention, which will subsequently be

analysed by the latter processes - the “attentive stage”.

There exists experimental evidence that strongly supports the “selective atten-
tion” theory. In the ﬁrst place, experiments by Triesman [47], [48] showed that
search for a target deﬁned by a single visual feature (e..g color, or edge orientation)
takes place in parallel over a visual display, whereas targets deﬁned by a conjunction
of several features is done sequentially (e.g. ﬁnd a vertical red line among many
red and blue lines, vertical or horizontal). It has also been established [7], [20], [47]
that only a limited set of features (e.g. color, edge orientation and curvature) can

12

. A“ 3“— _.-..$....a_~ _~. ._. .

be detected in parallel. Physiological evidence based on recordings of neuronal ac-
tivity in awake behaving monkeys also supports the selective processing of the visual
data [8], [12], [13].

In light of these facts, Koch and Ullman [22] proposed a model of visual attention
called saliency map, aimed at explaining the shift of attention and the saccadic eye
movements. We describe their model in the remainder of this section. The result of
the processing in the pre-attentive stage is a. set of topographical cortical maps (pos-
sibly at different spatial resolutions), which are based on low-level visual information
(color, edge orientation, texture, contrast, disparity, direction of motion) and provides
an early representation of the environment. Each of these feature maps codes for con-
spicuity within one feature: the more different a location is from its surrounding, the
more conspicuous (i.e. more salient) it is. The feature maps are then combined into a
saliency map by computing at each location a weighted sum of the saliency in each of
the features. We must point out that it is nontrivial to compute such a weighted sum,
because some of the features are inherently not comparable (e.g. it is not. obvious
how to compare color conspicuity with edge orientation). In fact, it is quite possible
that these weights may vary, and may get top-down (semantic) inﬂuence from some
higher cortical centers.

Once a salient location has been selected, more abstract. properties of that region
are created and a central, non-topographical representation of that region is formed.
A “winner-take-all” network is proposed for ﬁnding the most. salient region in the
saliency map. A second network is also necessary for building the abstraction of the
attended region into a central representation.

13

Since the saliency Of a region in the ﬁeld of view is based on low—level visual
features, it is extrinsic to the visual system, i.e. it is entirely determined by the
environment. Then, the most salient region in the early representation, remains the
most salient also after that region has been attended. Naturally, then, the question is,
if the maximum saliency region is always selected, how does one ever move the eyes
to another location? Koch and Ullman suggest that higher cortical centers decrease
the saliency of the attended region, so that after some time, that region is no longer

the most salient in the ﬁeld of view, and another maximum is selected elsewhere.

A few comments are in order here. In the ﬁrst place, as we have already men-
tioned, it is not trivial to combine the various feature maps into a single saliency map.
Secondly, in the saliency map framework, it is not clear when the high level processing
occurs. The saliency map only tries to explain how the next focus of attention is cho-
sen, and, as presented above, semantics are given little consideration. In the absence
of semantics, the selection of the next ﬁxation point is a reactive, deterministic pro-
cess (once an environment is given), which does not require any voluntary, cognitive
actions; in other words, it is a reﬂex. While this may be the case during the ﬁrst
few ﬁxations when presented with a new scene, we believe that as the scene is being
gradually understood, semantics, and the speciﬁc task at hand have an increased role

in selecting the next ﬁxation point.

We close this section by mentioning that while there is biological evidence for
the existence of various feature maps, not all researchers accept the idea of a unique
topographic map to represent the most salient stimuli [10].

14

2.2 Reinforcement Learning

We emphasized in the previous section the necessity of a top-down control mechanism
to obtain the functionality of a gaze control system. Since we treat visual attention
as a sequential decision problem, we shall use reinforcement learning techniques [43]
to implement the top-down control in a task speciﬁc way. In this section we present
the basic formalism of sequential decision-making.

An agent is a system which can be in one of a ﬁnite set of states, and which can
take actions. Depending on the problem at hand, the information used to deﬁne the
states could be extrinsic to the agent, describing properties of the environment (e.g.
location information, “obstacle ahead”, “rainy day”, etc.), or intrinsic, internal to the
agent (e.g. “hungry”, “happy”, “angry”, “sure”, “unsure”, etc). In general, there is
one set of actions for each state, but for simplicity, we shall assume that at each state
the agent can choose from the same set of actions. The actions change the state of
the system and the agent must re-estimate the state by observing the environment.
The agent gets rewarded for each action it takes, and tries to learn what action is
the best in each state, with respect to some optimality criterion, by moving from one
state to another until some termination conditions are met.

Typically, the components of a sequential decision making problem are as follows:

0 States: S = {81,152, . ..s,,}.
0 Actions: A = {a1,a2, . ..am}.

0 Transition probabilities: P(s'|s, a) of reaching state .5" given that the agent

15

was in state 3 and has taken action a. In general, the transition from s to s’
by executing action a depends not only on s and a, but on the whole sequence
so, (11, . . . , sn 2 3 up to the current state. In many sequential decision problems,

however, the transition from a state to another does not depend on history, i.e.

P(5n+l Ian-+1737“ - - - :a1230) : P(3n+l|an+1: 8n) (21)

Equation (2.1) is called the Markov property and systems in which it is satisﬁed,
are called a Markov Decision Processes (MDP). We assume the Markov property

throughout this thesis.

0 Reward: R(s, a, s’) for transiting from state 3 to s’ as a result of action a. The
reward can be positive, zero, or negative (cost). We assume that rewards are
bounded in absolute value. Since the action a. puts the system in state 3’ with

probability P(s’ Is, a), the quantity

r(s,a) = E(R(s,a,s')]s,a) (2.2)
= :P(s'|s,a)-R(s,a,s’) (2.3)

represents the one step expected reward for executing action a in state 3.

0 Objective function: A function depending on long term reward, which the
agent must optimize. For instance the agent might try to maximize the sum of

16

rewards:
r(s0, a1) + r(sl, a2) + . ..

where a,- is the action taken at moment i, s,- is the state at time step i and so

is the state the agent starts from.

TO guarantee that the above series is ﬁnite, we assume the rewards are bounded
and we discount each of the terms by a factor 0 < y < 1. We require the agent

to maximize at each state the expected value Of the sum of discounted rewards:

E("§: y r(st_1, at)) (2.4)

i=1

See [21], [24] for more examples of optimality metrics.

A (stationary, deterministic) policy is a function 7r : S ——> A. The agent follows
a policy if in each state s it always chooses the same action a = 7r(s). Given an
Optimality metric and a policy 7r, we can deﬁne for each state 3 a value V”(s). For

instance, for the optimality metric (2.4), the value of state 3 is

V“(s) = E(i(':”l r(st 17r( 1)]5 7r) (2 5)
= rs.( ))+’7'- ZPLS ([s,7r(s) ))V(s') (26)

where so = s and sl = s’.

17

By “Optimizing” the objective function we mean that the agent must ﬁnd an

Optimal policy, i.e. a policy 7r* such that

V7r (s)>V“(s.)\7’s€S

For simplicity we denote V" = V’". By (2-6) we have that

V'“(s )2 man [I s, a)+y Z P(s (Is, a) )V*( (')] (2.7)

GAE s’ES

for any state s. Equations (2.7) are called the Bellman Optimality equations,
and there are two known dynamic programming algorithms that can be used to solve
them: value iteration and policy iteration [21], [24], [33], [43]. These algorithms can be
applied when we have a model Of the MDP, i.e. when the transition probabilities and
the reward functions are known a priori. This may be the case in some manufacturing
problems, but it rarely happens in real robotics problems. Indeed, a robot does not

know the outcome of an action until it has completed that action.

There exists, however, an algorithm called Q-learning due to Watkins [50] which
can be used to solve the Bellman equations even when the transition probabilities or
the rewards are not known. The algorithm is iterative, it is proven to converge, and

ﬁnds a solution of the Bellman equations.

To begin with, suppose that V“ is a solution of (2.7 ). For each pair (s, a) deﬁne

a function

18

Q*(s.a) = we + 7- Z W .s,a)v*(s') (28)

3’65

 

which is precisely the quantity under max in Equation (2.7). Clearly then,

V*(s) = Tea? Q*(s, a) (2.9)

Substituting (2.9) back into (2.8) we get

Q*(s, a) = r(s, a) + 7- Z P(s’|s,a) £1,133] Q*(s’,a') (2.10)

3’65

Unlike in the Bellman equations, in equations (2.10) the transition probabilities

are not under max. The quantity

2 P(s'[s, a.) max Q*(s’, a')
a’EA
s’ES
is the expected value of maxareA Q(s’, a’). The Q-learning algorithm tries to ﬁnd
the function Q*(s, a) through successive approximations. Once Q" is found, we can

ﬁnd an Optimal policy by

7r*(s) = arg mea‘x Q*(s, a)

Let (2,, be the approximation at time step n. Q0 is initialized randomly, or it is
identically 0. The iterative step is given by the following update equation:

19

Qn+l(3a a) = Qn(3: (l) + 0’n[(7“(8: a) ‘l‘ ’7 ' 2,123: Q-n-(SII all) — Qn(3: all (2'11)

where an is a. learning rate parameter. The transition probabilities are not present
in the update equation (2.11). It can be proven rigorously [50] that the sequence (2,,
is convergent to a function Q“ which satisﬁes (2.10), provided that 2a,, z co and

2 (1,2, is ﬁnite and provided that all (state, action) pairs are visited inﬁnitely Often.

20

Chapter 3

Problem Deﬁnition

3.1 Research Questions and Task Deﬁnition

Now that we have introduced the basic ideas in active vision and reinforcement learn-
ing, we can formulate more precisely the theoretical questions that we investigate in

this thesis.

The ﬁrst question is, how can perception and action be uniﬁed in a coherent,
possibly task dependent visual behavior. In other words, given that the agent recog-
nizes (with some certainty) an image, how does it decide what to do afterwards? We
propose a model of visual attention which comprises a low level, reactive layer, which
essentially implements a saliency map, controlled by a high level, semantic layer. The
addition Of the tOp layer goes beyond previous gaze control models, in which the
emphasis was on selecting the most salient point in a task independent way.

Secondly, we investigate to what extent reinforcement learning can be used to

21

integrate information across saccades. We propose that only minimal visual infor-
mation about a region in the environment is retained, together with the direction
of the saccade from that region towards some goal. This is analogous to landmark
navigation that organisms (from insects to humans) are known to perform.

Finally, we ask whether it is possible to learn relationships between Objects. For
instance, if the agent is searching for a pen, then, to restrict the search, we would like
it to know from prior experience that a pen is likely to be found on a desk. Many
examples from real life, Show that peOple do use this kind of prior knowledge: we
look for keys in certain places, we look for a book in a bookcase, etc. We believe
that reinforcement learning can partially answer this question, but a more complete
solution relies heavily on the ability of the agent to recognize classes Of objects, rather
than speciﬁc Objects (e.g. the agent can recognize “books”, or “pens”, not just “this
book”, or “this pen”).

A visual search for a relatively small Object which is not necessarily in an exact
location, but which can usually be found near some other larger ﬁxed Object in a
cluttered environment, seems to have the necessary complexity to address the above
issues. At any given moment the system can only process a limited region in the
environment, at high resolution, and knowledge about each Of the regions visited
must be “remembered” from one visual frame to another. Also, moving to another
region requires some way of ﬁnding a. new ﬁxation point, which requires solving the
aspect of visual saliency. The fact that the target Object is generally in one region
of the room, requires that the saccades be directed consistently towards the target
region, hence the problem of tOp-down control.

22

Formally, we deﬁne the search task as follows.

 

3.1 (b)), using a ﬁxed, pan-tilt-zoom camera.

Input: An RGB image of the target object T.

on L for ﬁnding the target T.

 

(but ﬁxed) position, until it ﬁnds the target T.

Task: Find a pre—speciﬁed object (Figure 3.1 (a)) in a ﬁxed position in a room (Figure

Output: A set L = {L1, L2, . . . , Ln} of landmarks (regions) in the room, and a policy

Performance Metric: Number of saccades the camera makes from an arbitrary

 

 

 

churn-”M- .. <.- ._.. ........_~.-...w~—

Figure 3.1: (a) RGB image of the target object. (b) Panoramic image of the environment.
Images in this thesis are presented in color.

The general approach is to create the set L incrementally, by using an instance
based approach of clustering the images. Each image is represented by means of color
histograms. Q-learning is used to learn the policy on the set L. The next ﬁxation
point is chosen by using two feature maps (rather than a single saliency map): we
do a low resolution search for a candidate location, based on color, followed by a

23

high resolution match at each candidate location. If the target is not found, then we
compute symmetries in the image using an adapted version of the symmetry operator
described in [36], and choose the maximum in a given direction as the next ﬁxation
point.

We want to emphasize here that although a gaze control system involves a great
deal of image processing and pattern recognition techniques, we do not attempt to
build a (nearly) perfect, general purpose recognizer, nor the perfect saliency map.
What we strive for, is to introduce task dependent, top down control in visual at-
tention. Improvements both in the image processing component, as well as in the
reinforcement learning component will be added subsequently, based on the insight
gained from the current work. For now, however, we see this work as the ﬁrst step
towards understanding the computational issues involved in building a gaze control

system capable of operating in realistic environments.

24

Chapter 4

Related Work

4.1 A Saliency Map Implementation

A number of approaches to visual attention, visual search and gaze control have been
proposed [1], [9], [16], [22], [29], [34], [39], [46], [53]. We shall succinctly describe some
of them in this section.

Perhaps the most comprehensive implementation Of the original saliency map
theory proposed by Koch and Ullman was done by Itti and Koch [16], [15] who
implemented a model of a task independent, pre—attentive selection mechanism. A
saliency map is built using a bottom—up approach. To this end, three types of feature
maps are built and combined into a unique map: a color map, an edge-orientation map
and an intensity contrast map, each at different scales in a pyramidal scheme, yielding
a total of 45 feature maps. Each of these features uses a center-surround structure that
mimics the visual receptive ﬁelds, which is meant to pick up regions that are in sharp
contrast with their surrounds. For the color contrast, for instance, the difference

25

between the red value at one pixel and the green value Of surrounding pixels at a
different. scale is computed, resulting in a red-green contrast map. Similarly, blue-
yellow contrast maps are computed and then combined with the red-green maps. The
same center-surround, bi-scale scheme is used to compute the intensity contrast maps
and the edge orientation maps. A great deal of effort is directed towards combining
these features into a unique saliency map. In [15], the same authors showed that a
simple way to combine the features by normalizing them to a ﬁxed range and summing
up, did not work well. In [16], strongly motivated by biological evidence, Itti and Koch
use a two dimensional difference-Of-gaussians (DOG) to model inhibitory-excitatory
interactions between neurons. First, each feature map is re-scaled between 0 and 1.
Iteratively, each map is convolved with a DOG ﬁlter, the original image is added to
the result and the negative entries in the result are set to 0. This “within feature”
competitive process produces a conspicuity map (for each feature) with a few peaks
left, each peak marking a location in the image most contrasting with its surround.
Finally, these three conspicuity maps are summed up linearly and the most salient
point in the image is the one with the highest peak.

Another important model Of visual attention is the one proposed by Rao, Zelinsky,
Hayhoe and Ballard in [34]. The idea here is to deﬁne saliency based on iconic
representation of each location in the image. An iconic representation at a location is
nothing but visual information from a patch in the image, centered at that location.
Then, given also an iconic representation of a target object, one could search for that
Object in the image by deﬁning the saliency at each location as the correlation between
the representation at that location and the representation of the model. Encoding

26

icons literally as raw images, is clearly prohibitively expensive, so the authors propose
an efﬁcient way Of iconic encoding, as the response around each pixel to a number Of
different ﬁlters, at different scales. Speciﬁcally, for a ﬁxed scale, a gaussian is used
to generate 3 ﬁrst order, 2 second order and 4 third order derivatives, for a total of
9 ﬁlters. Three scales are considered, yielding a total Of 27 multi-scale ﬁlters. Each
of these ﬁlters is applied to each Of three (modiﬁed) chromatic channels of an RGB
image, producing at each location in the image a vector of 81 responses. Given a
similar representation of a target object, one essentially deﬁnes the saliency at each
location (:13, y) in the search image as the L2 norm between the response at (r, y) and
the model representation. The next ﬁxation point is selected as the one with the
smallest norm. Unlike the Itti and Koch model, which is purely bottom-up and task
independent, this model requires knowledge about the task, in that it uses the iconic

representation Of the target Object to deﬁne saliency in the search image.

4.2 Reinforcement Learning

-
I

Trevor Darrel [9] applies McCallum’s nearest sequence memory (NSM) algorithm [27]
to implement a gaze control system for gesture recognition. Darrel’s system consists of
two cameras: one that provides a static, low-resolution, wide ﬁeld Of view image, and
another one with pan-tilt action, which provides high-resolution, narrow ﬁeld Of view
images. Part Of the world state is fully observable by means of low resolution images.
However, the hand gestures and facial expressions can only be Observed from the high
resolution images provided by the active camera. The various low-level routines can

27

perform person tracking, and can guide the active camera to saccade from one body
part to another, based on the global, low resolution images. Other visual routines
perform the gesture recognition based on the high resolution images taken by the
active camera, provided that the camera has been directed towards the relevant body
parts (e.g. hand or face). The system is supposed to learn how to choose the next
ﬁxation point in order to obtain new images of gestures or expressions, and maximally
discriminate a particular gesture. The low level routines extract at each time step
9 features from the low resolution and high resolution images: (person-present, left-
arm-extended, right-arm-extended, face-expression, left-hand-pose, right-hand-pose,
left-hand-foveated, right-hand-foveated, head-foveated). The state is hidden because,
for instance, the left-hand-pose (which can be neutral, point or Open) cannot be
known unless the active camera is pointed to the left hand. At any time, the system
can choose between 4 (primitive) actions: look—body, look-head, look-left-hand and
look-right-hand. Since the state is hidden, the system can be treated naturally as a
partially observable MDP (POMDP). The system keeps track Of memory sequences
(mt)0gtSTa where each element mt is a triple mt = (at, rt, at) of action, reward and
observation at moment t, and associates with each such sequence a Q-value, to decide
the utility of that sequence in dis-ambiguating the state of the world. The system
does manage to learn how to perform the task, but despite the small number Of states
and actions, it takes a considerable amount of time to train.

An interesting idea is presented by Rimey and Brown [39] where Hidden Markov
Models (HMM) are used by an artiﬁcial gaze control system to learn saccadic move-
ment sequences. In one example, the pan-tilt camera is assumed to II‘lOVG in small

28

constant increments in 8 directions. Let v0, . . . , v7 be these motions. An 8-state H.\-*I.\'I
is trained, in which the symbols (Observations) are precisely the v.,-’s. Observation se-
quences are now produced, for instance, by drawing a scan path with the mouse on a
computer screen, or by monitoring human eye movements. At time t = 0 the system
“perceives” Observation vto, at moment t = 1 Observation vt,, etc. Thus, a given scan
path produces Observation sequences which can be used to train the 8 state H.\I'l.\:’l,
i.e. to learn the transition probabilities a,j between states and the observation prob-
ability densities b,(-). The learned HMM is then used to generate symbols from the
learned distributions, which drives the camera to moving along a path similar to the
one used for training. Obviously this scheme does not use any visual information,
so the learned path is independent of the underlying image. However, the authors
modify the path generation after the HMM has been trained, to include saccades
towards the most salient points in the image. Moreover, they develop the formalism
for training augmented-HMMS (AHMM). As an application, the authors train two
HMMS (a “where” HMM and a “what” HMM), to implement two separate behav-
iors and combine them using the newly deﬁned notion of AHMM to create a more
complex behavior.

Finally, we shall describe a reinforcement learning solution to Object recognition
due to Bandera et. a1. [6]. Suppose we have a foveated vision system whose task is
to recognize Objects from a ﬁnite number of classes. TO ﬁx the ideas, suppose the
classes are “cube”, “cone” and “pyramid”. Suppose also that the system can detect,
ﬁxate and recognize various visual features that are present in all these classes. For
instance, a feature could be a 90 degree angle as it is indicative of a cube; vertical

29

and horizontal edges are also features present in a cube; a circular curve is indicative
of a cone, etc. In general. recognition with sufﬁcient conﬁdence requires the presence
Of a conjunction of these features. The various features have different importance
for the classiﬁcation task, since the more Objects share a common feature, the less
discriminant power that feature has. The system must learn which features are the
most discriminant for the given set Of Objects. At discrete time steps, the system
chooses one feature to interrogate. As the various regions of the Object are being
ﬁxated on, a conﬁdence measure is associated with each of the features in the “feature
database” which indicates the presence or the absence of that feature in the Object.
The state of the recognition process is deﬁned intrinsically, as the vector Of these
measures. These conﬁdence measures are readily transformed into probabilities and
the system must reach a state of minimal entropy by choosing the feature to be ﬁxated
next, based on its discriminant power. The one-step reward for choosing a feature
is equal to the decrease in entropy it produces when interrogated. A neural net is
then used to implement the residual Q-learning algorithm. Although implemented in
simulation, we consider this work of great importance, because it is an example of a
system in which the states are internal, reﬂecting the stage of the recognition process,
rather than properties of the environment. Moreover, we believe this approach has

great potential in automatic feature selection.

30

Chapter 5

Model Description

5.1 The System Overview

The system consists Of a Sony EV I-D30 pan-tilt-zoom camera placed in an elevated
location in our lab, controlled through the serial port. by a PC with a Pentium II 400
MHz processor. TO grab images we use an inexpensive BT848 board. The pan and
tilt Of the camera (measured in ticks) range in the interval [~860, 862] x [—281,283].
In degrees, these ranges translate into [—100°,100°] x [~25°,25°]. At the lowest
resolution, the largest ﬁeld of view of the camera is approximately 48° x 33°.

Figure 5.1 shows the architecture of the system. The target object is provided
to the system in the form of two template images: one at low resolution, used for
coarse search, and the other at high resolution, used for ﬁne matching (see Figure
5.2). The camera takes low resolution images from the lab and feeds them, one at
the time, into the two layers of the system. Each low resolution image is processed
along two streams - the two main modules of our model (see Figure 5.1). The top

31

 

 

Reinforcement learning (RL)

 

Q— learning

_——_—————————————————————————

 

ll

clustering

 

 

1 motor commands

 

 

 

 

 

back—projection intersection

c:
o
camera ‘53 Vision module (V)
*3 symmetry operator
:1) .............. ., ..............
93.
D
H
N
:42

histogram : histogram

 

 

 

 

sis—l i—
\ / \l

‘7’ \ / Object
/ input image image

Figure 5.1: Overall architecture of the system. The top module performs the feature
extraction necessary to cluster the images online. The resulting regions are used as
states in a Q-learning program and a policy for moving towards the target region
is learned. The bottom module implements the “within ﬁxation” visual processing,
consisting of computation of interesting regions in the image and object recognition.

 

 

 

 

 

 

 

 

 

 

 

module (which uses reinforcement. learning) learns a set Of clusters online consisting
Of images with Similar color histograms. The clusters represent physical regions in the
environment, and are used as states in the Q-learning method. This module learns a
policy for saccading from one region to another towards the region(s) most likely to
contain the target object. By selecting the gaze direction according to its utility (the
Q-value), the reinforcement learning module provides top-down control to a lower,
purely vision-based module. The vision module consists of low-level visual routines
and its purpose is to compute two feature maps (color map and synunetry map) for

32

Millikan

i
t
.5
;

 

(a) 1 (b) 2

Figure 5.2: (a) Low resolution template of the target object. (b) High resolution template.

representing saliency 1 in an area of interest in the image, and to recognize the target
object, both at low resolution and, if need be, at high resolution for high conﬁdence.
All computations in this module are performed in exocentric coordinates 2. Unlike
[16], at this point we do not attempt to combine the feature maps to form a unique,
task independent saliency map, but rather use them sequentially, ﬁrst the color map
(for ﬁnding candidate target locations), then the symmetry map (if the target was
not found and a new saccade is necessary).

The output is a set of landmarks L = {L1,L2, . ..L,,} speciﬁc to a particular
environment, and a policy on the set L which directs the camera one ﬁxation at the
time, starting from any point in the room, and moving towards the region(s) where
the target object is normally found. Each element in the set L is a cluster of images
representing the same region in the room. The set of landmarks is learned online,
and the clustering is solely based on visual input from the camera (i.e. not based on

the pan-tilt coordinates of the camera). Although desirable, we don’t expect/ require

 

1The next ﬁxation point is the resulting most salient point.
2Obviously, image coordinates must be transformed to camera coordinates, in order to physically
carry out the saccade, but the decision making does not use pan-tilt information.

33

that the landmarks learned he a faithful representation of the environment (i.e. they

may not correspond to physical objects in the room).

5.2 The Q-learning Module

5.2.1 Feature Extraction

In any image classiﬁcation problem, treating the raw images as vectors is unfeasible
due to the very high dimension of the resulting vectors. Dimensionality reduction

techniques (e.g. principal component analysis [28]) and feature extraction must be

 

 

 

 

apphed.
DISTRIBUTION OF KULLBRCK OISTRNCE 0N P9195 OF IHﬁGES
180 T T 1' T I T U
mil ir" -—---
J1, “diiierent“ -‘L'
160 i- / \ .r
If. i]
/ "
140 I- . -
Ir "- VIA I ‘\ [All 1,”, \I
3’ le {i ii '\
120 ~ f 'x i ‘ I. ‘
{l t \l / \\
l l // \
100 *- l \K
i, ‘,_
i [ 3' \
u 80 .. ] II . . \
I
. . .[ \
l . ‘ H \'
60 II] \ \\ -i
l \
l , \.
40 J l “ -
i z .
,/ i\
\
2,, . _/ ..
/. \,\ \\_._ A
o 1 i L '- -. I"! ’ ' ""3 A l L ‘
O 1 2 3 4 5 6 7 8 9 10

kullback distance

Figure 5.3: Kullback distance of pairs of similar images (red) and on different images
(green)

Our ﬁrst attempt to reduce the size of the observation vectors was to compute the
wavelet transform [42] of the images. Wavelet coefﬁcients produce good dimension-
ality reduction (an order of magnitude better than JPEG compression, with similar

34

0.035

HISTOGRHMS FOR SIMILRR IMRGES

 

“reg

onvnl z“

LregFonﬁ01_1"'—-——

0.025 -

FREQUENCY

0.015 -

J [ [l
0.01 » ‘ .‘l 2“] l
lam" ‘ ill
60 80 100 120 140 160
COLOR BINS

 

 

0.005 I

001:“ ll0 ullliiii

I .

ll 1....lilll

180

 

 

200

 

Figure 5.4: Top: feature vectors (4 concatenated color histograms) for two similar
images (bottom)

quality), but are too sensitive to small changes in the pan-tilt of the camera, as the
wavelet transform is not translation invariant. Small changes in the x,y position of
the camera produces large variation in the feature space, which makes any clustering

attempt useless.

A set of features that changes slowly as a function of the direction of gaze is
provided by the (color) histogram. For an introduction to the computer representation
of colors and color spaces see for instance [18], [19], [41]. Binarizing each channel in
the RGB space into 16 bins, we get 4096 colors. However, for a given environment,

35

HISTDGRRMS FOR DIFFERENT IMF-TEES

 

r. . _ .. .
regionnuLl —
“it"bliiil l" I“

0.045 r

FREQUENCY
5°
0
m
(.11

 

 

 

  

 

 

. . . . u i . ' l . . u-
0 20 40 60 80 100 120 140 160 180 200
COLOR BINS

 

Figure 5.5: Top: feature vectors extracted from two different images (bottom)

many of the colors occur with zero or very small probability. We can then build a
histogram on all 4096 colors of 100 random images from the whole lab and retain
only the colors with a count of at least 0.5%. The number of colors retained this way

was 49. Henceforth, these colors will be called dominant colors.

Histograms do have the advantage that they change little when the viewing angle
changes, but they may introduce perceptual aliasing: there may be very different
images (locations in the environment) which produce the same histogram. To alleviate
(but not necessarily eliminate) this problem, we divide each RGB image in 4 quadrants

36

and concatenate the histograms on the dominant colors on each of the squares. The

resulting observation vector has then dimension 196 = 4 - 49.

Having deﬁned the feature vectors, a suitable metric needs to be chosen, in order
to tell whether or not two images are similar. Given a pair (11,12) of images, two
situations can occur: either I 1 and 12 represent the same region in the enviromnent,

or not. If we define two classes

Csimilar ={(11,12)|112 12}

Cdifferent = {(Il-.I‘2)|Il ¢ [2}

then a metric on pairs of images becomes a feature which can distinguish the two
classes. We, therefore, choose a metric that maximizes the inter-class separation.
After trying the Euclidean metric and correlation, we chose the symmetric Kullback

distance, which we describe next.

The amount. of information produced by an event A which occurs with probability

P(A) is defined as

log2 —(l/U (5.1)

Given two probability distribution functions p(.r) and (1(1) of a random variable
x, then the Kullback contrast [17] between p and q is deﬁned as the expected value
with respect to q of the difference of information between p and (1. That is

37

1 q l 5

kip-.11) — EqUng I-)—(—‘l_)_ — 1082 _<1(4L‘)) ( .2)
_ . , W) ,. —

_ /(1(.L) 10g2 p($)d.r (0.3)

It can be proven [17] that k(p, q) is positive deﬁnite, but. it is not symmetric. To
transform the above Kullback contrast into a metric, we symmetrize it in the standard

way:

me = Mm) 2: rap) (5,)

 

A histogram represents the probability distribution that a certain color occurs
in an image, and so, it makes sense to use the Kullback distance to measure the
similarity between the two histograms. To see the effectiveness of the Kullback metric,
we grabbed 50 random images from our lab by choosing the ﬁxation points uniformly,
and 50 images from one region in the lab, i.e. by choosing the ﬁxation points from
a gaussian distribution with a small variance, centered around a pre-speciﬁed point.
We formed all the pairs within each class of images, and for each pair we computed
the Kullback distance. Figure 5.3 shows the Kullback distance as a feature on each

of the similar/non-similar classes of pairs of images.

The separation threshold between pairs of similar images and non similar images
is around 2. However, there is a signiﬁcant overlap between the two distributions, so,

38

in order to reduce the false accept rate we moved the threshold to the left, at 1.5, at

the expense of the false reject rate.

Figure (5.4) shows a pair of similar images, and their feature vector representation
as the concatenation of color histograms on each of the 4 quadrants in each of the

images, while Figure (5.5) shows the feature vectors for two non—similar images.

5.2.2 States, Actions and Reward

We define a state to be a cluster of images representing the same region in the room.
As explained in the previous section, from each image, an observation vector is ex-
tracted by concatenating the histograms on the dominant colors in each of the 4
quadrants. The similarity metric between two observation vectors is given by the

Kullback metric.

Q-learning requires that the set of states be known a priori. However, due to the
sequential processing of a scene, understanding must be incremental. For this reason,
we start with an empty set of clusters of images, and form new clusters as the agent
ﬁxates on new regions in the room. The clusters are spherical (with respect to the
Kullback distance) , and have a radius of at most 1.5 (see feature extraction). The
center of each cluster is the very ﬁrst observation vector used to initiate creation of a
that cluster, and it is not updated as new observation vectors are added.

We deﬁne 9 actions. A0 is a saccade to the most salient point in the whole image.
Further, we deﬁne 8 more actions A1, . . . , A3 as follows. Choose a cartesian coordinate
system with the origin at the center of the image and for each k = 1...8 let Rk be

39

 

 

 

 

 

 

 

 

Figure 5.6: The shaded region represents a sector in image A. The most salient point is
computed (the blue square) and a new image B, centered at the new location is taken.

the ray through the origin which makes an angle 45° x (k — 1) with the x axis. Let
5,, be a 90° sector in the image, bisected by Rk. Then we deﬁne action A), to be
the saccade to the most salient point in sector 5],. The shaded region in Figure 5.6
represents sector S, in image A. The small blue square is the most salient point in
5,. Once its coordinates are computed, the camera is directed toward the new point
and a new image B is taken. The actions A] . . . A8 always force the camera to be
directed away from the current ﬁxation point. Action A0 is justiﬁed by biological
evidence that human subjects sometimes make a series of very short saccades within
the same region of interest, perhaps for the purpose of acquiring more information.
If the agent chooses to execute A0 and if the most salient point is near the optical

axis, the agent need not move towards a point in the periphery.

Early in the development of this system we had two extra actions: a search action
and a saccade to a random point. We realized, however, that the search for the
target object is part of the “within ﬁxation” processing, and consequently, it should
be carried out at each time step, and should not be considered a separate action.

40

The random saccade was also eliminated, because if the outcome of an action has a
uniform distribution, then the agent has no way of learning whether it should execute
that action or not. Also, before deﬁning the actions using 90° sectors, we tried 45°
sectors. However, we have Opted in favor of the current size of 90° because by using
too narrow angles, some objects might not be entirely comprised within those angles,

so their symmetry will be inﬂuenced in a negative way.

 

 

 

 

 

 

 

A

 

B
Figure 5.7: If the reward is found in the periphery (image A) then upon subsequent visits
to the same region (image B) the reward could be missed, despite the fact that the agent
executed the “right” action.

The reward function depends on the current state 3, the action a taken in s and

the next state 3’, and is given by:

:2
100-e“2—7 if target is found in state 3'

A
OI
C)!

V

r(s, a, s') =
—4 otherwise

In the above equation, a: is the distance from the center of the image (the current
ﬁxation point) to the target object and the variance 0 controls the size of the gaussian
deﬁning the reward. The reason we reward more for bringing the target in the center of
the image than for ﬁnding it. in the periphery, is that. the latter situation is unreliable.

41

Consider the following scenario: the agent is in state 3, takes action a and lands in
state 5’, where it perceives the target somewhere near the periphery. If the agent gets
a signiﬁcant reward, then the action (1 becomes Optimal in state 3 and will be chosen
upon subsequent visits to 3. By using solely visual information, the ﬁxation points
can only be computed approximately. Thus, often, when the agent revisits state .9
and takes action a, the target object is just out of the ﬁeld of view, and the agent gets
penalized for the same action that previously brought a large reward. See Figure 5.7.

The optimality metric (i.e. the function that the agent tries to maximize) is the

expected discounted sum of rewards:

W) = 19(th m a...» (5.6)

:20

5.3 The Vision Module

The vision module (see Figure 5.1) implements some routines, as part of the “within
ﬁxation” processing. We build a color map using histogram back-projection [44], [45]
to obtain candidate locations for the target object in low resolution /wide ﬁeld of view
images. To assess the value of each of these locations we use a recognizer based on
histogram intersection [45]. If any of these locations turns out to be a. good candidate,
the camera is directed to that location, zooms in, and, for high conﬁdence, another
match is performed using the same recognizer and the high resolution template of
the target object. Finally, if the object was not found in the current frame then a
new saccade is necessary. The camera returns to the pan and tilt prior to inquiring

42

the candidate locations, and a new ﬁxation point is selected. To this end, we use
a context free symmetry operator [36] to build a symn‘ietry map and select the next
ﬁxation as the point with the highest synnnetry. For the sake of completeness, we

shall brieﬂy describe these routines below.

5.3.1 The Symmetry Operator

Man-made objects tend to have more symmetry than natural objects. In an indoor
environment such as a lab, most of the objects are man made, so it is natural to try

to employ a symmetry operator in an attempt to ﬁxate on the centers of objects.

 

e.
I I, pj
,n’ pi+pj
VIi 2
6i or..
I l
D

Figure 5.8: Two edge pixels p,- and p, will vote for their midpoint. The strength of the
vote depends on the distance between p, and pj, and on the length and the orientation of
the intensity gradient at each of the two pixels.

Following [36], for a graylevel image, we compute the edge map, and then we

43

let each pair of edge pixels vote for their midpoint. The vote is modulated by the
distance between the pixels, and by the direction and the length of the gradient of
the intensity. -More speciﬁcally, if I (1:, y) is the graylevel image, for each edge pixel

p,- deﬁne

rr =10s(1+|V1(Pz-)|) (5-7)

For any edge pixel p, let 6 be the argument of VI, i.e. the oriented angle between
the horizontal and the vector VI. For any two edge pixels p,- and p,- let (2,, be the
angle between the horizontal and the line pm, (see Figure 5.8).

The phase factor for pixels p,- and p,- is deﬁned by

P(i,j) = (1 — 005(0, + 6,- — 2 . 04,-,» - (1 — cos(0,- — 0,)) (5.8)

We deﬁne the distance factor for edge pixels p,- and p,- by

 

1 (llzvi-pl-Il-m)2
D(lj) : ' 6'— 2” (59)

27m

The vote of the edge pixels p,- and p,- is then deﬁned as

The gaussian in equation (5.9) achieves its maximum when the distance between
p, and p,- is equal to a user speciﬁed parameter a, giving pixels at distance a an
increased vote. See Figure 5.9.

44

   

(a) (b)

Figure 5.9: (a) Sample image. (b) Symmetry map obtained by maximizing the vote for
pixels at distance a = 130.

By equation (5.8), the phase factor P(i, j) achieves its maximum when

05+0j—201

ll
:1

05—91' = 7r

which is the case when the two gradients are parallel to the line pm,- and point in
opposite directions. On the other hand, P(i, j) achieves the minimum when. for
instance, 9,- = 0,, i.e. when the two gradients are parallel and point in the same

direction.

5.3.2 Histogram Back-projection

In addition to symmetry, we use color to deﬁne saliency. We do not try to build a task
independent, bottom-up color map which would pick up regions contrasting in color

45

with their surrounds (as in [16]), but rather try to ﬁnd in the image locations with
similar color distribution as the target template. To this end, we use the histogram

back-projection algorithm [44], [45].

 

Figure 5.10: (a) Sample search image. (b) Model (target) image. (c) Histogram back-
projection of (b) onto (a).

Given two color images I (the search image) and M (the model, provided a priori),
this algorithm ﬁnds the locations in I where M is likely to be found. First, one must
build the color histograms HM and H I of M and I respectively, on the same set of
color bins. For improved performance, we use the L". u“, 11* color space [18] rather

than RGB, given by:

 

l

I
1001/ a
L" — 25-( )—16
Yo

u‘ = 13-L’(u’—uo)
v‘ = 13~L‘(v'—vo) (5.11)

46

1n the above equations

 

 

, , _ , _ 4X
u _ u _ X +15r'+ 32
v' = 1.52) = 9y
X +15Y + 32

( X \ ( 0.490 0.310 0.200 \ K R \

Y = 0.177 0.813 0.011 - G (5-12)

 

 

 

 

 

 

(Z) (0.000 0.010 0.990} (B)

and Y0, no, 00 are the reference white values for Y, a, '0, respectively.

The color bins are precisely the colors that occur in I . Two different source images
I and I’ have different sets of color bins, and so these bins must be re-computed
for each pair (I, M). For an efﬁcient implementation we use look-up tables for the
RGB —+ L*u*'o* transformation. After computing H M and H1 we compute the ratio

histogram R, by

RU) = mm(HM(I)/H1(I)= 1) (513)

for each color bin j. This ratio histogram is back-projected onto a gray-level image

B (the color map) of the same size as the source image I by

303.31) = R(1(:r.y)) (5-14)

That is, for each pixel (:r, y), we determine what bin I (1:, g) falls in, we look up the

47

value of R on that bin, and set B(:r, y) equal to this value. The most likely locations
of the target object represented by the model M are the pixels (1:. y) for which B(.r., y)

is the largest.

The histogram back-projection algorithm works well if the target object is present
in the source image, but it cannot distinguish whether the target is at all present. or
not. The reason is that in the back-projected image B there is always a 111aximum.
Therefore, after having found the candidate locations in the source image, we must
run a recognizer to match the model template M at each location. The recognizer

we used is based on the histogram intersection.

5.3.3 Histogram Intersection

For the recognition of the target object with sufﬁcient conﬁdence, we use a classiﬁer

based on the histogram intersection [41] [44].

Let I and T be two images of the same size, that we want to match. Mathemati-
cally, neither image has a preferred signiﬁcance, but in practice T is a high resolution
template of a given object (in our case the object is a software box) and I is an un-
known image that we want to classify as similar to T or not. As with the histogram
back-projection, we compute the color histograms H I and HT. Here, since T is given
a priori, and large enough3, we choose the color bins to be precisely the colors that

occur in T. Deﬁne the histogram intersection distance between H 1 and HT by

 

3At high resolution an object appears larger.

48

Zmin(H1(j)-.HT(I)) (5.1") .

dH.H =1—
(1, r) 2,1,0.)

 

If I and T have the same size, then obviously

Z H[(j) = Z HT(j) = length x width

so d is symmetric. h’loreover, note that for each color bin j

111114500), HT(J')) S 2 H10) (516)

with equality iff H; (j) 2 H70). Summing up after j, we get

2 minU'IIU): HTU» < 1
ZHIU) _

with equality iff all inequalities in (5.16) are equalities, i.e. when H I 2 HT. Hence

0 S d(H{,HT) S 1 and d(H[,HT) = O Iff HI : HT-

T he template image T is chosen to be about the size of the target object, and if

its histogram is rich enough, a recognizer based on the above distance has very high

accuracy. Certainly, given a template T we can produce an image I with the exact

same histogram as T by re-shuffling the pixels in T, but this is rarely the case in

realistic situations.

49

5.4 More Visual Routines

In addition to the feature maps produced using the symmetry operator and the his-
togram back-projection described in the previous section, we have experimented with
a few other feature maps, in an attempt to produce ﬁxation points similar to recorded
human ﬁxations. These maps have not been incorporated in the current implemen-
tation. Nevertheless, we consider them good starting points in selecting visually
important regions in an image, and for this reason we shall describe them brieﬂy in

this section.

Variance map. Recordings of human scan patterns [14] show that people almost
never ﬁxate on ﬂat, uniform surfaces. Obviously, then, we can use the variance of
the intensity to detect non-uniform surfaces: the larger the variance in a region, the
more non-uniform that region is. We cover a gray scale image with a grid consisting
of square cells. On each cell, we compute the variance of the intensity and we retain
only the cells with variance within some interval (21mm, om”). Figure 5.11 shows an
indoor image with the recorded human ﬁxations marked by the green dots, and the
results of our variance map on the gray scale equivalent of the original image. The
red squares are the ones selected by our algorithm. The drawback with the variance
map is that the thresholds om," and om”, as well as the size of the cells in the grid

must be set a priori by the programmer.

Another idea in deﬁning saliency is to declare a pixel salient if and only if it is
sufﬁciently different from the mean, in some feature (e.g. intensity). To ﬁx the ideas,
let V (x,y) be the intensity at pixel (x,y). Let my and UV be the mean and the

50

 

 

Figure 5.11: (a) Indoor image. The dots represent the recorded human ﬁxation points.
Courtesy of the Michigan State University EyeLab. (b) Intensity image of (a). The red
squares are the cells with large variance.

standard deviation of the intensity I" over the whole image. Construct an intensity

map I by deﬁning

11/013331) * mvl if [W13 31) — "Wl > 0V
103.31) = (5.17)

0 otherwise

The intensity map obtained from Figure 5.11 (a) is presented in Figure 5.12 (a). The
darker a pixel, the more salient it is. If instead of intensity we use contrast, which we
deﬁne at each pixel by

f —— r
0(1), y) = If (15,31) ”Ll/l

 

mv

we obtain the map depicted in Figure 5.12 (b), which we term contrast map.

The last feature map that we describe is a color opponency map. Unlike the
previous maps, this map is biologically motivated by neuro-physiological evidence
observed in humans, primates and cats [23],[31], [40],[51]. Some of the retinal pho—
toreceptors are highly sensitive to red light (R cones) others to green light (G cones)
and yet a third category of cones is sensitive to blue light (B cones). Impulses from
these photoreceptors are transmitted to the bipolar cells, which project to the gan-
glion cells of the retina. The axons from the ganglion cells converge at the Optic disk
to form the optic nerve, which projects to the lateral genicalate nucleus (LGN). It has
been observed that some of the ganglion cells (loosely called the red-green Opponent
cells) have receptive ﬁelds consisting of a. center of R cones and a surround of G cones.
Given an RGB image, we model the red-green Opponency by computing at each pixel

(:r, y) a response

 

Figure 5.12: (a) Intensity map for the image in Figure 5.11 (a). (b) Contrast map. The
brighter regions are the most salient. Compare with the human ﬁxations in Figure 5.11 (a).

RG(1:, y) = [f t, R(a. v)g(.,.,,,,.,.(;c — a, y — v)d'a(lo —

// C(u, v)g,,,,.,.0,,,,d(;r — u, y — 1:)(l'adv (5.18)
surround

where R(.r,y) and C(r, y) are the red and green values at pixel (.r, g), game, is
a gaussian with a small standard deviation and g,.,,.,.,0,,nd is a gaussian with a larger
standard deviation. The ﬁrst integral in equation 5.18 is the sum of the red values near
the pixel (.r, y), weighted by a gaussian peaked at (:c, y) and represents the excitatory
component. The inhibitory component comes from the surround, and is the sum of
the green values weighted by a gaussian of a larger size. See Figure 5.13. We also
compute a green-red response GR by using the same procedure. Clearly then, we can
iterate this process by letting R +— RG and G (— GR. The results are presented in
Figure 5.14. The brightest spots represent the regions in the input image where the

contrast between the red center and the green surround (or vice versa) is maximal.

Another color opponency observed in the ganglion cells is the blue-yellow Oppo-
nency. The receptive ﬁelds of these cells are not separated in the center/surround
dichotomy as in the red-green cells, but for the purposes of a computer implementa-

tion one could use similar IGCllIllunS.

GAUSSIANS

1.4

12 -

1 .
0.8 -
0.6 -
0.4 —
0.2 -

O

-0.2 -

e4

USED TO MODEL RED-GREEEN COLOR OPPONENCY

 

 

v

 

 

conteroi)

 

 

<3

Figure 5.13: Gaussians used to model color opponency. The red curve weighs the responses
of the central red cones, while the green curve weighs the (inhibitory) responses of the
surrounding green cones. The blue curve is the difference of the two gaussians.

 

(d) (6)

Figure 5.14: (a) Original RGB image. (b) Red-green opponency map. (c) Green-red
Opponency map. ((1) Red-green opponency using (b) and (c) as the red and green channels.
(e) Green-red opponency using (b) and (c) as the red and green channels. The brightest
regions are the ones in which the center- surround opponency is the largest. The green-red
Opponency picks up the ﬂag and the computer chair, while the red-green opponency picks

Up the red box.

55

5.5 Algorithm

The system starts with an empty set Of clusters of RGB images, and with a. low
resolution template MW, and a high resolution template il-Ihigh of the target object.
The camera is pointed at a random initial location and a 384 x 384 image Io is
taken at the widest ﬁeld of view Of the camera (hence, at the lowest resolution). An
observation vector 00 is extracted from 10 (as described in Section 5.2.1) and cluster
C0 = {00} is formed. An array Q(C0, - ) of length 9 is deﬁned and initialized to 0,

and a scalar To = 00.

Suppose now, that after n steps, the system has clusters CO, . . . Cm representing
various regions in the room, and Q-values Q(C’0, - ), . . . Q(C,,,, - ). Suppose also that
the system has inferred it is in state OWN", based on the latest low resolution image

In grabbed by the camera.

An action acumm is chosen 6 — greedy. That is, with probability greater than
1 — e, we choose acmrem = arg maxa Q(Ccm.,.em, a), and with probability less than c, a

random, exploratory action is chosen.

Execution of action acumn, is only necessary if the target object is not present
in the current image In, so the agent tries to detect the object ﬁrst. TO this end,
we build a color map 13,, by back-projecting the low resolution template Mm, onto
In. (see Figure 5.10). Note: we do not back-project Mm, on all of In, but on a
central patch in In, of size 192 (half the size of In). The reason is twofold: in the ﬁrst
place, we reduce the processing time; secondly, since in the retina the cone density
decreases tremendously towards the periphery [32] [35], color perception is severely

56

impaired far from the optical axis. Let pl, 192, p3 be the brightest points in B". We
match the model 111,0,” with a neighborhood of p,- for each i = 1,2,3 usmg histogram
intersection, producing distances d|,d2,d3. At low resolution, however, histogram
intersection is not reliable enough, and the separation threshold varies from frame to
frame. To solve this problem, the camera must zoom in at each of the p,s, grab a new
image I,’, and do a ﬁne match, this time at high resolution, using the model Mmgh.
However, since histogram back—projection produces candidate locations whether or
not the target is present, the camera will have to zoom in three times per frame, most
of the time unsuccessfully. We reduce substantially the number of times we zoom in

by scoring each of the points p,- by

(5.19)

 

where Bn(p,-) is the gray level of p, in Bn (the color saliency), and d, is the
histogram intersection distance (5.15). The lower the score 3,, the better the candidate
p,. Using these scores (rather than just the histogram distances d,), it is possible to
separate the good candidates from the bad ones in most images from the same cluster,
but we could not ﬁnd a unique separation threshold across clusters. The solution is
to have each cluster keep track of its own threshold (call it T) and dynamically adjust
it as follows. First, in a newly created cluster set T = 00. At each time step, one of
the following two cases can occur: either (a) some of the candidates p,- have scores
3,- < T, or (b) all s, 2 T. In case (a) the camera is pointed to p, and zooms in. A high
resolution image 1;, is taken and a high resolution model Mmgh is back-projected onto

57

1;. A ﬁne match of Mm), is performed, again using the histogram intersection. At
high resolution, histogram intersection works remarkably well, since images in which

the target is present can be readily separated from images which do not contain the

target, as can be seen in Figure 5.15

 

 

 

 

 

HISTOGRAM INTERSECTION SCORE DISTRIBUTIONS
30 I l l l l _l I I
"object_present"
"object__not_present" -------- +
25 r ,1' 4
20 - -
>. I:
o .'
c ,
8 15 - .
g '1',
10 ' ’1", d
5 _ _
o ,"l
0 \~-_ 1 m 1 1 .4— —————— 1 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 8 0.9
matching score

Figure 5.15: Red: distribution of high resolution images containing the target Object
Green: distribution of high resolution images not containing the target object.

If the object is found at p,, the agent is rewarded according to Equation (5.5),
and a new training epoch starts with the clusters and the Q-values learned so far,

by directing the camera to a new random initial location. Otherwise, if the object. is
not present at p,-, the camera zooms back out, and proceeds to interrogate the other

points pj with score 3,- < T, if any. When all candidates p,- have been examined,

the threshold T is set to T = 111i11{.s,}, the camera returns to the pan and tilt it had

before investigating the p,s and we are in case (b). See Figure 5.16.
58

 

 

 

 

 

 

 

 

 

 

 

 

 

Low resolution image High resolution image

Figure 5.16: Left: Low resolution image. The points p1,p2,p3 are produced using the
histogram back-projection of the low resolution model Mlow. At each p,, a score 3,- is
computed using histogram intersection. Right: High resolution image centered at p3. The
high resolution model MMgh is ﬁrst back-projected onto the image, and then matched using
histogram intersection. If the target was not found at any of the p,s then the camera returns
to the pan and tilt coordinates it had before the local search, and the threshold T is set to
mm 3,.

So far, the system has decided that the target Object is not present in the current
image In, either because all candidates p, had scores s, 2 T, or, because the ﬁne match
recognizer ruled out all the candidates. At this point, a new saccade is necessary, so
ﬁnally, action acurrmt is carried out (from state errm). The next ﬁxation point
is selected based 011 the symmetry map. The winner is the point with the largest
symmetry vote (Equation 5.10) from all the edge pixels in the image In if one“ = A0,
or, from the edge pixels in one of the 8 central 90° sectors, if ovum", = Ak for some
I: = 1,. . .8. The camera is pointed to the new ﬁxation point and the action is
complete.

Knowing now whether errem contains the target or not, the agent is rewarded
according to Equation (5.5) for the transition from state 0,,”st to Cmrrem, as a
result of some action amevms. Action acmrmt will be rewarded at the next iteration,
because only then will the agent know whether awrrmt brought the goal in the ﬁeld
of view or not. Q(Cp,,.v,ou,,aprev,m) is updated by Equation (2.11). A common
technique for speeding up the propagation of the reward is to update more than a
single (state, action) pair. Our agent keeps track of the previous ﬁve pairs visited,
and updates in one step all the corresponding Q-values.

After the update, the camera grabs a new (low resolution) image In“, and the
agent estimates the state. More speciﬁcally, a feature vector 0,,“ is extracted from
In“, and the closest cluster to 0,,“ is found (with respect to the Kullback distance
(5.4)). Denote this cluster by Cplosest, and its center by 0610388,. If the Kullback
distance K (0,,+1,Od0,,.,,) < k,,,,-,, = 1.5, then 0,,“ is assigned to cluster Cdosest,
and the agent is in state Cum = Cdosest. No adjustments are made to Cdosm, so

60

that at any time, each cluster consists of a single vector, namely the initial vector
used to create that cluster. If K(O,,+1,O,.,0,,.,,,) > km, = 2.5, then the current
observation is not sufficiently close to any of the existing clusters, and a new cluster
Cum 2 {011“} is created. With this new cluster, a scalar Tum = 00 is deﬁned.
Uncertainty occurs when km,” s K (0M1, 0610,83,) 3 km”, since this is the case when
the overlap between the two curves in Figure 5.3 is signiﬁcant. In a situation like this,
it is not clear whether the images that produced the observations On+1 and 00,056,,
should be considered similar, or non-similar. The system could very well be “looking”
at a previously visited region, but is unable to recognize it, due to noise. Certainly,
we could simply create a new cluster whenever K (0714.1, 061088“) _>_ kmm, but in doing
so, the system unnecessarily ends up keeping multiple clusters representing the same
region in the environment. Therefore, before creating a new cluster, the system gets
a second chance at estimating the state, by making a small number of very short
saccades around the current ﬁxation point. If still none of the new observations falls
within distance km,“ from the closest cluster, then the system has no choice but create
a new cluster. A new action is chosen and the process repeats until either the Object
is found, or until a maximum number of iterations has been reached.

In what follows, we use the term “epoch” to refer to a sequence Of at most
nMAX = 100 ﬁxations. If either the goal has been reached, or all 100 time steps
have elapsed, a new epoch is started by pointing the camera to a random location,
in order to ensure sufﬁcient exploration of the whole state space. We summarize be-
low the algorithm used for a single training epoch. Some Of the technical details are

omitted in order to keep the presentation simple.

61

 

 

l) Initialization. Choose the gaze direction of the camera randomly. Set the

number of iterations n = 0. Get a low-resolution image I and extract an Observation

vector 0 (see Section 5.2.1). Deﬁne a cluster Co = {0}, deﬁne To = 00, and an

array Q(Co , - ) = 0 of Q-values.

WHILE object not found and n < n MAX

2)

3)

4)

5)

6)

Choose an action acurrmt, e — greedy.

Color map: back—project a low resolution target template onto I. Score each
of the top 3 candidates p1,p2, p3 by (5.19). Zoom in at each p,- with s, < Tn,
grab a high resolution image, and match the high resolution target template
using the histogram intersection. If object found, go to step 6. Else, zoom

out, adjust Tn = min{s,-} and go to next step.

Symmetry map: compute symmetries on the sector corresponding to action

acurrent and direct the camera towards the most salient point in that sector.

Clustering: get new feature vector 0. Find the closest cluster Cdosest to O
in Kullback distance K. If K is sufﬁciently small, 0mm = 0610,83,. Else, if K
is sufﬁciently large, create a new cluster Cm“ = {0}, deﬁne T next = 00 and
initialize an array Q(a,,..xt , - ) = 0. Finally, if K is near the decision boundary
(see Figure 5.3), make a few short saccades and re-compute K. If the new

image still cannot be unambiguously classiﬁed, create a new cluster.

Q-update: compute reward for action aprekus using equation (5.5). Update

the last 5 Q-values according to equation (2.11). Set it <— n + 1.

END WHILE

 

62

 

Chapter 6

Experimental Results

We describe in this section a number of experiments designed to determine in the ﬁrst
place if any visual learning is at all possible with our reinforcement learning agent,
and if so, how good the learning was. We also compare the performance of our agent

with the performance of a random agent.

6.1 Learning a Policy for Finding the Target

In the ﬁrst experiment, we trained the agent to learn in which direction to direct its
gaze in order to reach the region where the target object is most likely to be found,
in trials of 400 epochs each. Every 54th epoch was used for testing, i.e. the agent
simply executed the policy learned so far. The performance metric was the number
of ﬁxations to the goal. Within a single trial the starting position was the same in all
test epochs.

Figures 6.1 and 6.2 show the number of steps to goal for two initial starting

63

3‘

SRCCRDES TU BURL STRRTING FRO” (500, -100)
50 r

 

"st'eps02.gp1"' —
45 -

4o-

35 -

25 -

saccades

20 -

:35

S—th epoch

 

 

 

 

 

 

 

Figure 6.1: Average number of ﬁxations for every 5—th epoch over ﬁve trials of 400 epochs
each. At every 5-th epoch, the policy learned was tested starting from pan=500, tilt=—100.

i

0

0

SRCCHDES T0 BURL STRRTING FROM (200, 0)

 

"stbp503.gpf" —-—

saccades
N
an

H ._.
O 01

U1

 

 

 

 

 

 

 

1 ]in]11111111111111111U111I1111lllhnl‘

20 30 0 6 70 0
5-th epoch

Figure 6.2: Average number of ﬁxations for every 5-th epoch over ﬁve trials of 400 epochs
each. At every 5-th epoch, the policy learned was tested starting from pan=200, tilt=0.

64

positions. The results were averaged over several trials. It is apparent that. in general
the number of ﬁxations decreases with the number of epochs. Occasionally, there are

long ﬁxation sequences towards the end Of the trial, but the agent recovers quickly.

SCAN PATTERNS BEFORE AND AFTER TRAINING

 

 

 

 

 

300 r I l I I k I I I
, ~.‘"epoch__400“ ——+——
’ 'epoch_005' ~--w---
200 ~ .
¥ *5
100 - ‘~, ~
It
.— ,
a 0 - i
'- .‘ ,1 “\ Ix,
1 x
-100 ~ ‘} 4
-200 ~ «
_300 1 1 1 1 1 1 1 1 1
-800 -600 -400 -200 0 200 400 600 800

PAN

Figure 6.3: Initial scan path (green) and learned scan path (red). The rectangle
{—860, 860] x {—290, 290] represents the actual pan-tilt range of the camera. The search
starts at (300, 50) and the target object is found around (—380, 30).

We plotted in Figure 6.3 two sample scan paths, one at the beginning of learning,
which is very convoluted and has a large number of ﬁxations, and another one after
the system was trained to ﬁnd the object. Figure 6.4 shows a sequence of regions “as
seen” by the camera, as it ﬁxated from the starting position (rightmost image) to the
target (leftmost).

Figure 6.5 plots the centers of the learned regions (states) (the blue dots) in the
pan-tilt coordinate system as well as the best actions in each region. Only states
with more than 15 visits are shown, out of a total of 104 states. The position of

65

 

 

(a) 6 (b) 5 (c) 4 (a) 3 (e) 2 (r) 1

Figure 6.4: The sequence of ﬁxations corresponding to the learned path (red) in Figure
6.3. The camera starts from region (1) and gradually moves towards the goal region (5),
Here, a suﬂiciently good candidate is found at low resolution (1), the camera zooms in (6)
and does a high resolution match. This sequence of images corresponds to the red scan

path in Figure 6.3

Figure 6.5: States learned (the blue squares) and the best actions. Action A0 is represented
by a red square around a state. Actions A1 —A3 are represented by a red segment originating

at the state. The target is the green square.

66

the target is marked by the green square. Action AD is represented as a red square
around the state and actions A, — A8 are represented as a red segment originating
from the center of the state. Note that the length of each segment does not. represent
the length of the saccade. The actual saccade was directed at. some salient point in
a sector around a red segment. This explains why the ﬁeld of the red segments does

not point directly and uniformly towards the target (the green square).

6.2 Executing the Learned Policy

NUMBER OF FIXATIONS FOR DISPLACED TARGET

 

 

 

 

100 . a . . . T . .
"stepsOngl' ——

80 [- .1

§ 60 - .
if

s 40 - -
E

20 . l .

0 “11.111111111111111111]th.IIl.].]]]1l11..h11111110111].luhnldlllllllnlllililhlll1]I

 

0 10 20 30 40 50 60 7O 80 90 1 00
epochs

Figure 6.6: Number of saccades per epoch in ﬁnding the object inverted, displaced by 60
cm from the training position. Testing from pan=500, tilt=—100.

The purpose of our second experiment was to determine what changes in the
location of the target the system could manage. After the agent was trained to
ﬁnd the target upright in its usual location, the object was turned upside down and

67

NUMBER OF FIXATIONS FOR DISPLACED TARGET

 

 

 

 

 

 

100 I I I I I T I I
'stepsOSgpI" ——
80 I— .
U)
8
§ 60 - -
S
”E
s 40 - -
E
a
20 1- -4
0 1.111111111111111111111.111111111111111-.11111111111111l11llf11|1i]1]11111ll: I 11111111111111]
40 50 60 7O 80 90

0 1O 20 30 100

epochs

Figure 6.7: Number of saccades per epoch in ﬁnding the object inverted, displaced by 60cm
from the training position. Testing from pan=200, tilt=0.

displaced to a new location (about 50-60 cm from the training position), while the
agent was executing the learned policy. Again, we tested starting from a few initial
locations and we averaged over several test trials. The agent successfully found the

Object in a small number of ﬁxations, as shown in Figures 6.6 and 6.7.

6.3 Re—training for a Different Target

To estimate how sensitive the system was to the various parameters in the visual rou-
tines, we re-trained the agent to ﬁnd a different object (the green ﬂag in Figure 6.8)
without changing any parameters. Somewhat surprisingly, there is no noticeable qual-
itative difference between the performance in this experiment and in experiment 1.
Figures 6.9 and 6.10 show the number of ﬁxations for ﬁnding the flag, starting from

68

(500, —100) and (200,0), respectively.

 

Figure 6.8: (a) The target object in the third experiment was the green ﬂag. (b) Low
resolution template. (c) High resolution template.

SACCADES TO FLAG, STARTING FROM (500, -100)
60 ﬁ' I l l T

 

(“8902- '—

number of saccades

 

]

077710 2

 

 

 

 

 

 

 

 

 

 

 

”6071p I]

Figure 6.9: Average number of ﬁxations for ﬁnding the MSU ﬂag (Figure 6.8) for every
5-th epoch over ﬁve trials of 400 epochs each. At every 5-th epoch, the policy learned was
tested starting from pan=500, tilt=—100.

0]: ,0]

30 0

So
epochs

69

SACCADES TO FLAG, STARTING FROM (200, 0)
60 I I I I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

‘20‘ 1011112101 01]]! 1119111115101 11]]117011 Q.

Figure 6.10: Average number of ﬁxations for ﬁnding the MSU ﬂag (Figure 6.8) for every
5—th epoch over ﬁve trials of 400 epochs each. At every 5-th epoch, the policy learned was
tested starting from pan=200, tilt=0.

6.4 Comparison with Random Search

Finally, in our fourth experiment, we compared the performance of the reinforcement
learning agent, with a random agent i.e. an agent which performed random search.
The actions in the random agent were exactly the same as in the reinforcement learn-
ing agent, but their selection was random, rather than based on the Q-values. The
results are presented in Figures 6.11 and 6.12. On average, the number of ﬁxations
in the random search was between 50 and 70, substantially larger than the average
number of ﬁxations in the previous experiment (6—7 ﬁxations), where the agent was
only executing a learned policy.

70

NUMBER OF FIXATIONS IN RANDOM SEARCH

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100 I 'l I T-l I
II -—1 I-
80- -
2
8
g 60- _
6
8 40- ~
E
3
1:
20 l I'

0 10 20 30 40 50 60 70 80
epochs

Figure 6.11: Number of saccades per epoch in a random search. Testing from pan=500,
tilt=—100.

6.5 Comparison with Exhaustive Search

Besides the comparison with the random agent, we shall present, for the sake of
completeness, a comparison between the learning strategy proposed in this thesis,
with an exhaustive search agent. By exhaustive search we mean that the camera
starts exploring its pan-tilt universe from one ﬁxed location (e.g. the upper left
corner) and moves in ﬁxed increments one step at the time, until the whole pan-tilt
space has been searched, or until the target has been found. To this end, let us recall
that the pan—tilt range of our camera is [—100, 100] x [—25, 25] degrees, and that the
widest ﬁeld of view for one frame is about 48 x 33 degrees. We assume that all the
visual processing is the same in the search agent as in the learning agent, so first a low
resolution search is carried out, and then a high resolution search, around the most

71

NUMBER OF FIXATIONS IN RANDOM SEARCH

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100 I T I 4'
80- -
2
8
g 60- -
‘5
0
4o.
fa’
3
C
20‘ ] ]
,II] [II 1! [III [I] ]1 11]
0 10 20 30 40 50 60 70 80
epochs

Figure 6.12: Number of saccades per epoch in a random search. Testing from pan=200,
tilt=0.

promising locations detected at low resolution, as described in Section 5.5. Recall
from the same section, that within one frame, the search is not performed on the
whole image, but rather on a central patch of half the size. Thus, each low resolution
color search covers a visual ﬁeld of about 24 x 17 degrees. The pan-tilt range of the
camera can then be tiled by approximately 8 x 3 = 24 search regions. As the target
object could be in any of these 24 regions, the system would make at most 24 ﬁxations
for ﬁnding the object, in the worst case, and (1 + 2 + . . . + 24) /24 = 12.5 ﬁxations on

average. These results are summarized in Table 6.1.

 

 

 

Start Training (box) Test (box) Training (flag) Random Exhaustive
(500,—100) 15.8 6.7 18.7 66.9 worst: 24
(200,0) 11.41 6.6 12.8 51.1 avg: 12.5

 

 

 

 

 

 

 

 

Table 6.1: Average number of ﬁxations for the three experiments starting from two test
points. In experiment 1 testing was done every 5—th epoch. The rest of the epochs were
training epochs. Experiment 2 consisted only in executing a learned policy, with the target
object inverted and slightly displaced. In experiment 3, a random search was done.

72

Chapter 7

Discussion

The set of experiments presented in the previous section is by no means comprehen-
sive. NO systematic analysis is done to explore the effect of the learning parameters
(the learning rate a, and the exploration rate 6), nor to observe the effect of the var-
ious parameters from the visual routines. Although no further tuning was necessary
to ﬁnd the ﬂag, after the software box was found, we do not expect this to be the case
for any target Object. What the experiments do show, however, is that learning does
occur, and that the system is able to ﬁnd the target reliably, and that the search is
robust to afﬁne transformations (translations and rotations) of the Object within one
region. We discuss below some technical issues that we encountered.

Visual saliency. Color and symmetry were the means of deﬁning saliency in our
implementation. The symmetry operator used picks up objects fairly well, provided
the rough size of the objects is known. Tuning the operator for objects of a particular
size, will not work well for objects of different sizes. This is one reason why the
ﬁxations of our agent, in general, do not cluster 011 objects, the way human ﬁxations

73

do. Moreover, the symmetry map is influenced by the fact that we apply the symmetry
operator on a rather limited sector of the image. Symn‘ietric objects that simply do
not ﬁt in that sector, will have lower saliency.

Although back-projection histogram is a. good way of detecting objects by color,
it is not biologically plausible. As the evidence for color Opponency shows, biological
vision systems use feature contrast, rather than feature values. Histograms, of course,
use values. Some effort was required to distinguish genuine candidate locations of the
target object from noise, in low resolution images, but back-projection histogram in
conjunction with the histogram intersection is a fast and effective way of building a
color saliency map, provided a template of the target object is given.

Clustering. To our knowledge, an artiﬁcial general purpose classiﬁer with nearly
perfect rate does not exist, and we did not attempt to build one. Due to the prob-
abilistic nature of Q-learning, however, an imperfect (but reasonable) classiﬁer will
suffice. If the classiﬁer has accuracy of, say, 90%, then, statistically, the agent will
receive the correct reward most of the time, which is all it needs to learn the correct
best action.

We opted to represent images by means Of color histograms, in order to categorize
pairs of images as “similar” or “non-similar”. The loss of the spatial distribution
of the colors in an image leads to perceptual aliasing, but we found that perceptual
aliasing is unlikely to occur in a cluttered environment such as a lab, particularly
if we subdivide the image into smaller pieces, and compute the histogram on each
piece. The real problem that we believe is at the core of our clustering algorithm, is
more mathematical in nature. Let V be the feature space produced by our feature

74

extraction method. Each observation is a vector in V and has dimension 196 in this
implementation. For each pan-tilt (:11, y) of the camera, we get a vector 0(33, y) E V,
namely the observation extracted from the image taken when the camera is pointed at
(:6, y). For a given environment, we get this way a mapping from the two dimensional

euclidean space into V

0:R2——>V

hence a surface in V. Since the histogram is known to change slowly with (:L', y),
this surface is continuous, which entails that there are no natural clusters in V to
represent different regions in the room. By choosing a threshold for the similarity
metric (the Kullback distance), we effectively used spherical clusters in the feature

space V.

We preferred to lower the Kullback threshold used to cluster the low resolution
images in order to reduce the false accept rate (i.e. to reduce the number of images
classiﬁed as “similar” when in fact they were “non-similar”). Inherently, this increased
the false reject rate, that is, often two genuinely similar images were classiﬁed as
non-similar, and a new cluster was started, representing in fact a region that had
been visited previously. After 400 training epochs, we ended up with about 80-
100 clusters, out of which about 40 were distinct by visual inspection. Although
“duplicate” clusters are wasteful and unnecessarily increase the number of states, they
are harmless for the Q-learning algorithm, unless their number increases indeﬁnitely,
because in that case, the (state, action) pairs would not be visited sufﬁciently many

75

times.

The way we propose to approach this problem is to invoke an alternative similarity
metric when (and only when) the Kullback distance is indecisive. Certainly there
will be an interval (a,b), in which the alternative metric cannot distinguish well
between similar and non-similar image, just like the Kullback metric does not work
well between (1.5, 2.5), but it is crucial that (a,b) and (15,25) be disjoint. Good
candidates would be metrics on edge pixels (edge density, curvature, etc.), because
edge information and color information are uncorrelated.

An alternative idea1 is to modify the feature extraction. Instead of dividing each
image into quadrants and compute the color histogram on each of the 4 regions, we
can extract visual information (e.g. color histogram, or some iconic representation in
the sense of [4]) locally, from around the top salient points. This way, the visual infor-
mation will be more “anchored” to the environment, and consequently, less sensitive
to (small) changes in the camera pan and tilt. The problem that this approach raises
is that, unlike in the case of the 4 quadrants where we can always impose a ﬁxed
order, there is no canonical way Of deﬁning an ordering on salient points, which is
absolutely crucial if local observation vectors are going to be concatenated to produce

an observation vector for the whole image.

 

lJon Connell, IBM, personal conununication.

Chapter 8

Conclusions and Future Work

We have built a reinforcement agent which learns the region where a given object
can usually be found. Learning occurs by choosing the direction of the next saccade
according to its utility (the Q-value). Once the direction has been chosen by the
reinforcement learning module (see ﬁgure 5.1), low level visual routines ﬁnd the most
salient point in that direction and aim the camera at it.

The experimental results show that the agent is capable of learning, even with
an unsophisticated reinforcement learning mechanism. In comparison with random
search, our agent is about ten times better (see table 6.1), which shows the usefulness
Of learning. Certainly, exhaustive search is always an Option, but it defeats the
purpose of our study.

Our agent comes short of learning spatial relationships between objects. It cannot,
for instance, learn that a painting is usually on a wall, or that a pen is usually on a
desk. It can learn, however, relationships of the form this pen is in this region (which
happens to be a table).

77

In the current implementation, the information that is being integrated across
ﬁxations is color information (through histOgrams), and the direction of the next
saccade. No deep image understanding takes place at each ﬁxation, except to decide
whether the current region has been seen before, and whether or not it contains the
target object. In this sense, the information we retain is minimal, but sufﬁcient for
the search task.

The visual information used to deﬁne saliency in an image is based on color and
symmetry. If the symmetry operator is task independent and context free, the back-
projection histogram is not, in that it uses a template image of the target object.

The performance of our system can be improved in a number of ways. In the
ﬁrst place, different variants of Q-learning should be investigated. For instance, one
could use eligibility traces [43], which are temporary records of occurrence of an event
(e.g. visiting a (sate,action) pair), used to selectively assign credit to the events that
were marked as eligible. Secondly, the bottom-up visual attention mechanism must
be improved for a better selection of the ﬁxation points. We prOposed in Section
5.4 a number of other feature maps, but understanding how to combine them into
a unique saliency map requires further investigation. Thirdly, the classiﬁer used to
cluster images should be improved by adding one more feature, “orthogonal” to the
Kullback metric, or by using iconic representations around the top salient points in
the image, as already discussed at. the end of Chapter 7. Our clustering algorithm tries
to group together similar images, irrespective of their utility. A more SOphisticated
algorithm [25] which clusters observations using both a metric in feature Space and
the Q-values could be used for improved performance.

78

We view this work as an initial exploration towards the implementation of a gaze
control mechanism on a mobile robot. We purposefully avoided the use of pan-tilt
information in the decision making process, because due to the robot motion, the
target object (such as an exit sign, or an elevator door) changes its position with
respect to the agent. A more sophisticated recognizer capable of identifying the
target at different distances, and from various points of view will be needed.

We foresee some interesting applications of this system on a real robot. For
instance, by visually ﬁnding a ﬁxed Object in a room, the robot could automatically
align itself, which is a necessary step before navigation. Or, the robot could simply
ﬁnd the door, and exit the room. A more complex behavior would be to train the
search system to recognize several objects in several rooms, and then have the robot
tell what room it is in, based on the conﬁdence with which it ﬁnds the expected

objects.

Bibliography

[1] S. Ahmad and S. Omohundro. Efﬁcient visual search: A connectionist solution.
In Proceedings of the 13th Annual Conference of the Cognitive Science Society,

Chicago, 1991.

[2] J. Aloimonos, A. Bandopadhay, and I. Weiss. Active vision. In Proceedings First

International Conference on Computer Vision, pages 35*: 54, London, 1987.

[3] SM. Anstis. A chart demonstrating variations in acuity with retinal position.

Vision Research, 142589 592, 1974.
[4] DH. Ballard. Animate vision. Artiﬁcial Intelligence, 48:57:86, 1991.

[5] DH. Ballard, M.M. Hayhoe, and J .B. Pelz. Memory representaions in natural

tasks. Journal of Cognitive Neuroscience, 7(1):66~-80, 1995.

[6] C. Bandera, F. Vico, J. Bravo, M. Harmon, and L. Baird. Residual q-learning
applied to visual attention. In Proceedings of the 13th International Conference

on Machine Learning, pages 2027, Bari, Italy, July 3rd-6th 1996.

[7] JR. Bergen and B. Julesz. Focal attention in rapid pattern discrimination. Na-
ture, 3032696—698, 1983.

80

[8] C. Bushnell, ME. Goldberg, and D.L. Robinson. Behavioral enhancement. of
visual response in monkey cerebral cortex. i. modulation in posterior parietal
cortex related to selective visual attention. Journal of Neurophysiology, 4:775—

772, 1981.

[9] T. Darrel and A. Pentland. Active gesture recognition using partially observable
markov decision processes. In 13—th IEEE Intl. Conference on Pattern Recogni-

tion, 1996.

[10] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention.

Annual Review of Neuroscience, 18:193—222, 1995.

[11] BE. DeYoe and DC. Van Essen. Concurrent processing streams in monkey

visual cortex. Trends in Neurosciences, 11(5):219 226, 1988.

[12] ME. Goldberg and RH. Wurtz. Activity of superior colliculus in behaving

monkey. Journal of Neurophysiology, 35:560--—574, 1972.

[13] P. Haenny, J. Maunsell, and P. Schiller. Cells in prelunate cortex alter response

to visual stimuli of different behavioral signiﬁcance. Perception, 13:A7, 1984.

[14] J.M. Henderson and A. Hollingworth. Eye Guydance in Reading and Scene
Perception, chapter Eye Movements During Scene Viewing: An Overview, pages

269—295. Elsevier Science B.V., 1998.

[15] L. Itti and C. Koch. A comparison of feature combination strategies for saliency-
based visual attention systems. In SPIE Human Vision and Electronic Imaging
IV (HVEI’QQ), pages 473-482. 1999.

81

[16] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert

shifts of visual attention. Vision Research, 40214821506, 2000.

[17] M. Jagersand. Saliency maps and attention selection in scale and spatial coor-
dinates. an information theoretic approach. In Proc of 5-th International Con-

ference on Computer Vision, pages 195~202, 1995.

[18] A. K. Jain. Fundamentals of Digital Image Processing. Prentice Hall, 1989.

[19] R. Jain, R. Kasturi, and BC. Schunck. Machine Vision. McGraw-Hill, Inc.,

1995.

[20] B. Julesz and J.R. Bergen. Textons, the fundamental elements in preattentive

vision and perception of textures. Bell Syst Tech Journal, 6216191645, 1983.

[21] LP. Kaelbling. Reinforcement learning: A survey. Journal of Artiﬁcial Intelli-

gence Research, 4237-285, 1996.

[22] C. Koch and S. Ullman. Shifts in selective visual attention: Towards the under-

lying neural circuitry. In Human Neurobiology. 1985.

[23] P. Lennie, RW. Haake, and DR. Williams. The design of chromatically Opponent
receptive ﬁelds. In M.S. Landy and J .A. Movshon, editors, Computational Models

of Visual Processing, pages 71—82. MIT Press, 1991.

[24] S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms
and empirical results. Machine Learning, 22:159195, 1996.

82

 

 

 

 

 

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

S. Mahadevan and J. Connell. Automatic programming Of behavior-based robots

using reinforcement learning. Artiﬁcial Intelligence, 55(2-3):311- 365, June 1992.

DC. Marr. Vision. Freeman, 1982.

AK. McCallum. Reinforcement Learning with Selective Perception and Hidden

State. PhD thesis, University of Rochester, 1996.

B. Moghaddam and A. Pentland. Probabilistic visual learning for Object rep-
resentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,

19(7):696—710, 1997.

BA. Olshausen, D.C. Van Essen, and CH. Anderson. A neurobiological model
of visul attention and invariant pattern recognition based on dynamic routing of

information. Journal of Neuroscience, 13247004719, 1993.

J .K. O’Regan and A. Levy—Schoen. Integrating visual information from succes-
sive ﬁxations: Does trans saccadic fusion exist? Vision Research, 23(8):765- 768,

1983.

O. Packer, A.E. Hendrickson, and CA. Curcio. Photoreceptor topography of
the adult pigtail macaque (macaca nemestrina) retina. Journal of Comparative

Neurology, 288:165—183, 1989.

SE. Palmer. Vision Science - Photons to Phenomenology. MIT Press, 1999.

M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. Wiley & Sons, 1994.

83

[34] R.P.N. Rao, G.J. Zelinsky, MAI. Hayhoe, and DH. Ballard. Eye movements in
visual cognition: A computational study. Technical Report 97.1, University of

Rochester, 1997.

[35] K. Rayner. Eye movements in reading and information processing: 20 years of

research. Psycological Bulletin, 124(3):372—422, 1998.

[36] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Context free attentional Operators:
The generalized symmetry transform. International Journal of Computer Vision,

14:119—«130, 1995.

[37] RA. Rensink. Seeing, sensing and scrutinizing. Vision Research, 40214694487,

2000.

[38] RA. Rensink, J .K. O’Regan, and Clark J .J . To see or not to see: The need for

attention to perceive changes in scees. Psychological Science, 8:368: 373, 1997.

[39] RD. Rimey and CM. Brown. Controlling eye movements with hidden markov

models. International Journal of Computer Vision, 7(1):47665, 1991.

[40] R.W. Rodieck. Quantitative analysis of cat retinal ganglion cell response to

visual stimuli. Vision Research, 5:583—601, 1965.
[41] L. Shapiro and G. Stockrnan. Computer Vision. Addison-Wesley, 2000.

[42] G. Strang. Wavelets and dilation equations: A brief introduction. Siam Review,

31:613—627, 1989.

[43] RS. Sutton and AC. Barto. Reinforcement Learning. MIT Press, 1998.

84

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

M.J. Swain and DH. Ballard. Object identification using color cues. Technical

report, University of Rochester, NY, 1990.

M.J. Swain and DH. Ballard. Color indexing. International Journal of Computer

Vision, 7(1):11—32, 1991.

J.K. Totsos, SM. Culhane, W.Y.K. Wai, Y. Kai, N. Davis, and F. Nuflo. Mod-
elling visual attention via selective tuning. Artiﬁcial Intelligence, 78:507 545,

1995.

A. Triesman. The role of attention in object perception. In Physical and Biolog-

ical Processing of Images. 1983.

A. Triesman and G. Gelade. A feature-integration theory of attention. Cognitive

Psychology, 12197—136, 1980.

L. Ungerleider and M. Mishkin. Two cortical visual systems. In Analysis of

Visual Behavior, pages 549- 585. 1982.

C. Watkins. Learning From Delayed Rewards. PhD thesis, King’s Colledge, 1989.

AB. Watson. Neural contrast sensitivity. In MS. Landy and J.A. lNIOVSllOIl,
editors, Computational Models of Visual Processing, pages 95 107. MIT Press,

1991.

LE. W ixson. Exploiting world structure to efﬁciently search for Objects. Tech-
nical report, University of Rochester. July 1992.

85

[53] J. Wolfe, K. Cave, and S. Franze. Guided search: An alternative ro the feature
integration model for visual search. Journal of Experimental Psychology: Human

Perception and Performance, 152419 433, 1989.

[54] AL. Yarbus. Eye Movements and Vision. Plenum, 1967.

86