_ %§§§.... . .. , 4i”... . a, Em? an .1“. Candy L.» .. 2.3%. 5%.,th . ¢ .1 .4fiwhfit. 3.. .‘ufi {ELK Ty»... , n, .. 3...? 3: .3»; _ «a mm“ mam 4a.. :.. .. Kan . . .: hwy... ._.mwm..unw any , . . . wanmfigfiv k. . $3.619. . Li. .I: I! 5.7... ~ ~ g. .. >121»? 2.35.. «A < v V ‘ . {.5 5L... , 1. iiivz .1... .1 5...}: .\ A ‘A\ . ‘ L 111.13' :3 ,. , ‘ ‘ , ‘ . .y......»§. . . may... w. 2‘}. 5&2...” :3. f 9:. 3-3.... :..._ L 3%." u m. . a .rv.. .. .fia Jaw... .x ..,.3 ,. .V .mwwu 5.” .... . 3.6 i. E... THESlS l 4:?CC'i This is to certify that the thesis entitled Selective Visual Attention In A Search Task: A Reinforcement Learning Model presented by Silviu D. Minut has been accepted towards fulfillment of the requirements for M.§. degree in Comp. SCi. 5! Eng. Major professor :4» /e 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State Unlvorslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DU untL ‘l 19 ll DATE DUE DATE DUE V‘UT a- v 11/00 M.“ SELECTIVE VISUAL ATTENTION IN A SEARCH TASK: A REINFORCEMENT LEARNING MODEL By Silviu D. M inut A THESIS Submitted to Michigan State University in partial fulfillment Of the requirements for the degree of MASTER OF SCIENCE Department Of Computer Science 2000 ABSTRACT SELECTIVE VISUAL ATTENTION IN A SEARCH TASK: A REINFORCEMENT LEARNING MODEL By Silviu D. Minut This thesis proposes a model of selective attention for visual search tasks, based on a general framework for sequential decision-making. The model is implemented using a fixed pan-tilt-zoom camera in a visually cluttered lab environment, which samples the environment at discrete time steps. The system is a learning agent which has to decide where to fixate next, based purely on visual information, in order to reach the region where a target Object is most likely to be found. The model consists Of two interacting modules. A reinforcement learning module learns a policy on a set of regions in the room for reaching the target Object, using as Objective function the expected value of the sum of discounted rewards. By selecting an appropriate gaze direction at each step, this module provides top-down control in the selection Of the next fixation point. The second module performs “within fixation” processing, based exclusively on visual information. Its purpose is twofold: to provide the agent with a set Of locations Of interest in the current image, and to perform the detection and identification Of the target Object. r A - . TO My Family iii ACKNOWLEDGMENTS First and foremost I would like to thank my advisor, Dr. Sridhar Mahadevan. Numer- ous discussions ranging from machine learning and reinforcement learning, to human and computer vision, pattern recognition and cognitive science, lead to ideas that shaped this research. I would also like to thank the other members of the committee, Dr. George Stockman and Dr. John M. Henderson for their support and for their invaluable expertise in computer vision and in vision cognition. I am also grateful to Dr. John Weng and Dr. Fred Dyer for their insight and suggestions. Last, but not least, I would like to thank the members of the SIGMA group at Michigan State, for all the ideas that we have exchanged, and in particular to Richard Falk, for providing some human data from the scan pattern experiments. iv _ WET“! TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 2 Background 2.1 Biological Motivation ....................... 2.1.1 Saliency Map Theory ...................... 2.2 Reinforcement Learning ..................... 3 Problem Definition 3.1 Research Questions and Task Definition ............ 4 Related Work 4.1 A Saliency Map Implementation ................. 4.2 Reinforcement Learning ..................... 5 Model Description 5.1 The System Overview ...................... 5.2 The Q-learning Module ...................... 5.2.1 Feature Extraction ....................... 5.2.2 States, Actions and Reward .................. 5.3 The Vision Module ........................ 5.3.1 The Symmetry Operator .................... 5.3.2 Histogram Back-projection ................... 5.3.3 Histogram Intersection ..................... 5.4 More Visual Routines ....................... 5.5 Algorithm ............................. 6 Experimental Results 6.1 Learning a Policy for Finding the Target ............ 6.2 Executing the Learned Policy .................. 6.3 Re-training for a Different Target ................ 6.4 Comparison with Random Search ................ 6.5 Comparison with Exhaustive Search .............. 7 Discussion ‘7 vii viii 12 15 21 21 25 25 27 31 31 34 34 39 42 43 45 48 50 56 63 63 67’ 68 70 71 73 8 Conclusions and Future Work vi 77 LIST OF TABLES 6.1 Average number of fixations for the three experiments starting from two test 1 points. In experiment 1 testing was done every 5-th epoch. The rest of the epochs were training epochs. Experiment 2 consisted only in executing a learned policy, with the target object inverted and Slightly displaced. In experiment 3, a random search was done. ooooooooooooooooo vii 2.1 2.2 3.1 5.2 5.3 5.4 5.7 5.9 LIST OF FIGURES Human fixations (the red dots) in an indoor scene. Most of the fixations fall on objects. There are almost no fixations on large, flat, uniform surfaces (floor, walls). Courtesy of the MSU Eye-Lab. Images in this thesis are presented in color. ...................... Human fixations (red dots) in an outdoor scene. Courtesy of the MSU Eye-Lab. ................................. (a) RGB image of the target Object. (b) Panoramic image of the environment. Images in this thesis are presented in color. ................ Overall architecture of the system. The tOp module performs the feature extraction necessary to cluster the images online. The resulting regions are used as states in a Q-learning program and a policy for moving towards the target region is learned. The bottom module implements the “within fixation” visual processing, consisting of computation of interesting regions in the image and Object recognition. ........ (a) Low resolution template of the target object. (b) High resolution template. Kullback distance Of pairs Of similar images (red) and on different images (green) ................................... TOp: feature vectors (4 concatenated color histograms) for two similar images (bottom) ............................. Top: feature vectors extracted from two different images (bottom) . . . . The shaded region represents a sector in image A. The most salient point is computed (the blue square) and a new image B, centered at the new location is taken. .................................. If the reward is found in the periphery (image A) then upon subsequent visits to the same region (image B) the reward could be missed, despite the fact that the agent executed the “right” action. ................ Two edge pixels p,- and pj will vote for their midpoint. The strength of the vote depends on the distance between p,- and pj, and on the length and the orientation of the intensity gradient at each of the two pixels ........ (a) Sample image. (b) Symmetry map Obtained by maximizing the vote for pixels at distance D = 130. ......................... 5.10 (a) Sample search image. (b) Model (target) image. (c) Histogram back- projection of (b) onto (a). ......................... 10 11 23 32 33 34 35 36 40 41 43 5.11 (a) Indoor image. The dots represent the recorded human fixation points. Cour- tesy of the Michigan State University EyeLab. (b) Intensity image of (a). The red squares are the cells with large variance. ............. 5.12 (a) Intensity map for the image in Figure 5.11 (a). (b) Contrast map. The brighter regions are the most salient. Compare with the human fixations in Figure 5.11 (a) ................................ 5.13 Gaussians used to model color opponency. The red curve weighs the responses of the central red cones, while the green curve weighs the (inhibitory) responses . of the surrounding green cones. The blue curve is the difference of the two gaussians ................................... 5.14 (a) Original RGB image. (b) Red-green opponency map. (c) Green-red Oppo- nency map. ((1) Red-green opponency using (b) and (c) as the red and green channels. (e) Green—red Opponency using (b) and (c) as the red and green channels. The brightest regions are the ones in which the center- surround opponency is the largest. The green-red opponency picks up the flag and the computer chair, while the red-green Opponency picks up the red box. . 5.15 Red: distribution of high resolution images containing the target object. Green: distribution of high resolution images not containing the target object. 5.16 Left: Low resolution image. The points p1,p2, p3 are produced using the his- togram back-projection of the low resolution model Mlow. At each p,, a score 3,- is computed using histogram intersection. Right: High resolution image centered at p3. The high resolution model Mhigh is first back-projected onto the image, and then matched using histogram intersection. If the tar- get was not found at any of the 1953 then the camera returns to the pan and tilt coordinates it had before the local search, and the threshold T is set to mm 33'. ................................... 6.1 Average number of fixations for every 5-th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=500, tilt=—100. ........................... 6.2 Average number Of fixations for every 5-th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=200, tilt=0. ............................. 6.3 Initial scan path (green) and learned scan path (red). The rectangle {-860, 860] x [—290, 290] represents the actual pan-tilt range of the cam- era. The search starts at (300,50) and the target object is found around (~380, 30). ................................. 6.4 The sequence of fixations corresponding to the learned path (red) in Figure 6.3. The camera starts from region (1) and gradually moves towards the goal region (5), Here, a sufficiently good candidate is found at low resolution (1), the camera zooms in (6) and does a high resolution match. This sequence of images corresponds to the red scan path in Figure 6.3 ......... 6.5 States learned (the blue squares) and the best actions. Action A0 is represented by a red square around a state. Actions A1 — A8 are represented by a red segment originating at the state. The target is the green square. ..... ix 51 53 58 59 64 64 66 6.6 Number of saccades per epoch in finding the Object inverted, displaced by 60 cm from the training position. Testing from pan=500, tilt=—100. 6.7 Number of saccades per epoch in finding the object inverted, displaced by 60cm fi‘0m the training position. Testing from pan=200, tilt=0. ........ 6.8 (a) The target Object in the third experiment was the green flag. (b) Low resolution template. (c) High resolution template. ............ 6.9 Average number Of fixations for finding the MSU flag (Figure 6.8) for every 5-th epoch over five trials Of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=500, tilt=—100. .......... 6.10 Average number of fixations for finding the MSU flag (Figure 6.8) for every 5-th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=200, tilt=0. ............ 6.11 Number of saccades per epoch in a random search. Testing from pan=500, tilt=—100. ................................. 6.12 Number of saccades per epoch in a random search. Testing from pan=200, tilt=0. ................................... Chapter 1 Introduction The problem of visual search is to find an Object in a large, usually cluttered envi- ronment (e.g. a pen on a desk) [52]. In solving such a problem it is preferable to use wide field of view images, in order to analyse as much as possible of the envi- ronment under scrutiny. On the other hand, small objects require high resolution images, which in combination with the wide field of view requirement leads to a very high dimensional input array. Foveated vision is nature’s method of choice in solving this problem and is a dominant characteristic of the vision system of virtually any vertebrate species with well developed eyes [3]. It consists of processing of the visual input at non-uniform resolution, due to an increased density of photo-receptors on a small central patch (the fovea) at the level of the retina. As such, foveated vision achieves the compromise between the need for a wide field of view and the need for high acuity. Foveal image processing reduces the dimension of the input data, but in turn generates an additional sequential decision problem. Choosing the next fixation point requires an efficient gaze control mechanism in order to check salient Objects, to determine whether they are the target or not. From an engineering standpoint, a sequential attention mechanism is attractive because it has the potential of requiring only sparse local models [4]. However, the visual attention mechanism raises a plethora of difficult questions. In the first. place, since the next fixation point is generally not in the fovea, its selection must be done based on coarse, low resolution visual information, without a thorough understanding of its semantics. The question is then, what low level features are necessary in order to decide what to attend to in the next fixation. Koch and Ullman [22] propose a saliency map theory which is a task independent, bottom-up model of visual atten- tion. In this framework, Itti and Koch [16], extract three types of feature maps (a color map, an edge map and an intensity map) and fuse them together in a unique map However, the selection of the next fixation must require some top-down control since low—level visual information is usually not sufficient. Hence the second major question is how to implement a high level, top-down mechanism to control the low level, reactive attention? Tsotsos et. al. [46] propose a model Of visual attention which tries to selectively tune visual processing by means of a top-down hierarchy of winner-take—all processes. Finally, since the vision system samples the environment, some information must be retained from one fixation to the next, and integrated across saccades, to produce a global understanding of the scene. The nature of this information is the third unknown Of major concern. We propose an overall model that integrates top-down gaze control with bottom-up reactive saliency map processing, based on reinforcement learning. In the remainder of this section we present an outline of the thesis. In Chapter 2 we introduce the relevant terminology and present some background and motivation from a cognitive science perspective, as well as give a brief introduction in reinforcement learning. Some fundamental issues pertaining to visual attention are outlined. In light of the problems raised in Chapter 2, we formulate in Chapter 3 the ques- tions that are addressed in this thesis and we state in mathematical terms the concrete task being solved. Chapter 4 is devoted to related work, where a number of representative previously prOposed models of visual attention are discussed. The core of this thesis is described in Chapters 5 and 6. Section 5.1 provides an overview of the model and explains the logical decomposition of our system into two modules: a reinforcement learning module, which provides the top-down control in the search task, and a vision module, consisting of bottom-up visual routines that implement our version of a saliency map and a recognizer used to identify the target object of the search. We proceed to describe the components of each module in the subsequent sections, independently of one another. The feature extraction and the Q-learning routines are described in sections 5.2, 5.2.1 and 5.2.2, while the visual routines are detailed in Section 5.3. The individual components are then “assembled” together in Section 5.5, where a detailed description of the algorithm is given. The performance of the search system is evaluated in Chapter 6, where four ex- periments are described. The results Show that (1) the number of fixations to the goal decreases during training, (2) after training, the system can find the target Object in 3 a Slightly different location within one region of the enxdrornnent, (3) the agent can be re-trained to find another object in the same environment withought fine tuning any parameters. Finally, a comparison with a random search agent and with an ex- haustive search agent is done, indicating that the learning agent performs better in both cases. Chapter 7 summarizes the strengths of the system, as well as its shortcomings. Some technical issues regarding the implementation are presented, and some possible improvements to feature extraction. Finally, Chapter 8 outlines a few promising directions for further work. Chapter 2 Background 2.1 Biological Motivation Broadly speaking, computer vision is a research area with the practical goal of build- ing a machine that (understands what it) sees. In doing so, researchers Often find their inspiration in biological systems, and often propose computational models that try to explain or approximate the observed biological evidence. One of the first com- prehensive theories of computer vision was formulated by David Marr [26]. It was the dominant paradigm in computer vision of the 80’s and it continues to be a major one even today. The aim of computer vision, according to this theory, is to recover the 3D structure of the external world, from 2D images, in an attempt to form a stable, detailed, spatiotopic representation of the world. Attempts to implement such a system failed to perform well in realistic environments. A few researchers have used psycophysical evidence to argue for new biologically inspired approaches in computer vision [4]. We shall describe some of this evidence and its in'Iplications to vision, but ~ 0 in order to do so, we shall briefly introduce some terminology first. It is estimated that more than half Of the neurons in the brain Of the macaque monkey are devoted to vision (though fewer in the human brain) [11], [32]. This is indicative of the importance and the complexity of vision. The brain must analyse the sensory input, then synthesize the information in an attempt. to “make sense” of the world, and act in an appropriate, optimal way. However, given that the human field of view is about 180°, it is a highly nontrivial task to process and interpret all the visual information at constant, high resolution, even for a massively parallel super-computer such as the brain, for dimensionality reasons. It is also not necessary to keep detailed information about every single point in the visual field. The biological solution starts with the design of the eye and consists in a trade- ofl' between high visual acuity and amount of processing. The fovea is anatomically defined as a small, central region on the retina, with a very high density of receptive cells (cones). The density of the photo-receptors (and with it the visual acuity as well) decreases exponentially from the fovea towards the periphery. To compensate for any potential loss of information incurred by the decrease in resolution in the periphery, the eyes are rapidly re-oriented via very fast (up to 900°/s ) ballistic motions called saccades. The length of the saccade varies with the task, but in general, the order of magnitude is a few degrees (see [35]). Fixations are the periods between saccades (about 3 per second) during which the eyes remain relatively fixed, the visual information is processed and the location of the next fixation point is selected. In humans, the fovea has a diameter of about 1.5 mm, and accounts for about 2° of the field of view [32] [35]. This dimensionality reduction leads to a sequential decision problem: rather than processing all the visual field at once at constant high resolution to produce a detailed global picture of the whole environment, the visual system devotes a substantial part of its resources to the processing of a limited region of high resolution, in conjunction with an efficient attention mechanism, which selects the next fixation point in such a way that the brain appears to get just the right amount of information necessary for the task at hand. In the early 90’s two closely related paradigms started to affect the course of vision research in computer science: active vision [2] and animate vision [4] In essence, the active vision paradigm proposes that instead of trying to recover the world in the form of a task independent, detailed internal representation, one Should try to find task specific solutions. These solutions could be found by decomposing the particular task at hand into simpler sub-tasks, with simpler qualitative solutions. As Aloimonos Points out [2], an algorithm that requires accuracy of a few decimals cannot be robust under the virtually infinite number of environments, hence the need for qualitative algorithms. Animate vision also acknowledges the fact that it is not necessary to create rich 3D 1'epresentations of the world. An animate vision system ideally would be endowed With anthrOpomorphic features, such as binocular, foveated vision and a high speed gaze control mechanism. Animate vision systems can take various actions (e.g. move, or 200m in /out) in order to Optimize or simplify visual search. The computations are carried out in an exocentric coordinate system (as opposed to an egocentric. coordinate System in Marr’s paradigm), which has the advantage of being independent. of the F7 i motion of the observer (robot). The camera is not fixed, but can make approximate motions to acquire new information. Understanding vision as a continuous process of mapping perception to action, allows for (reinforcement) learning algorithms which can provide a vision system with more sophisticated behavior. O’Regan [30] suggests that only minimal information about the scene is repre- sented internally. The scene itself could serve as “external” memory. In a now classi- cal experiment, Ballard [5] confirmed O’Regan’s finding. Specifically, human subjects were asked to reproduce a configuration of colored blocks displayed on a computer monitor. The display was divided into 3 regions: the model region, in which the given configuration was shown, the source, which contained an unordered set of col- ored blocks, and the workspace, which was the area where the c0py of the model was supposed to be assembled. Had the subjects built a detailed internal representation of the model in their mind first (as suggested by the Marr paradigm), one (possibly long) look at the model would have been enough to produce a COpy in the working area. Without exception though, the subjects made saccades back and forth between the model and the workspace, accomplishing the task by selecting one block at the time. This suggests that it is more economical to retain only the necessary informa- tion locally both in space and time, and sample the environment as necessary, rather than keep an internal representation of the scene. Another piece of evidence against computing and storing a stable internal model of the world is the change blindness effect [38], [37]. It has been observed that even significant (sudden) changes in a scene go undetected if they occur during a saccade 1. 1It is known that visual input during a saccade is essentially turned off, a phenomenon termed Again, had the brain maintained a detailed spatio—temrmral model of the scene, the changes would have been noticed. Selective attention is the common feature of biological systems, but it raises a num- ber of questions, some of which we shall discuss. Rensink [37] points out that there are two categories of problems. One category addresses questions about retinotopic representations and processes that occur within a single fixation. Equally important is what happens from fixatiOn to fixation in terms of how and what kind of informa- tion is integrated across saccades, as well as how this information is coordinated with the task at hand. We shall point out some fundamental questions from each category. First, the “within fixation” questions. Yarbus [54] and subsequently several other studies (see [35] and the references therein), showed by tracing the fixation points in a scene, that the human scan pat- terns vary with the task. Figures 2.1 and 2.2 shows human scan patterns recorded with an eye-tracker at the MSU EyeLab. Despite these variations, the fixation traces have some common characteristics. It is noteworthy, for instance, that people almost always fixate on Objects, both in indoor images and in outdoor images, and almost never on flat, uniform, empty spaces (eg. floors, walls). It has been postulated [49] that there are (at least) two fundamental Operations that a visual system must be able to perform: first it must be able to tell “where” in the image the interesting regions are, and then, once attention is devoted to one of these regions, it must tell “what” that region represents. In general, coarse, peripheral information is not. suffi— cient for Object recognition. For instance, surprising as it may seem, color perception saccadic suppression [35]. Figure 2.1: Human fixations (the red dots) in an indoor scene. Most of the fixations fall on objects. There are almost no fixations on large, flat, uniform surfaces (floor, walls). Courtesy of the MSU Eye—Lab. Images in this thesis are presented in color. is heavily depreciated beyond 20-30 degrees off the optical axis, due to the decrease in cone density [32]. The peripheral information is used mainly to select the next fixation point, after which recognition occurs based primarily on fovea] information. In light of these facts, a logical paradox occurs: how does the visual system decide where something interesting is in the scene, before knowing what it is? Trying to an- swer the “wha ” question first seems equally hopeless, for how can it be decided what an object is, before looking at it (La before knowing where it is)? One theory that attempts to explain the selection mechanism is the saliency map framework, which we discuss in some detail in the next section. One fundamental problem in gaze control, part of “across fixations” research is 10 Figure 2.2: Human fixations (red dots) in an outdoor scene. Courtesy of the MSU Eye-Lab. how integration across saccades is performed, i.e. what information the brain retains from fixation to fixation in order to produce the illusion of a coherent, uniform res- Olution image, rather than a collection of snapshots of various regions in a scene. It has been suggested (see [35] and references therein) that information from successive fixations is stored in a high capacity visual buffer. The main technique used in vi- sion twearch to investigate this problem is to present human viewers with an object extrafoveally, and then instruct them to direct their gaze towards the object. During the saccade the object is replaced with another object which the subjects must name. Studies in object perception [35], however, show that the visual information is not stored in a point by point fashion, but rather as relatively abstract representations. 11 Integration of information across saccades remains an Open problem [35]. 2.1.1 Saliency Map Theory If the understanding of a scene is a sequential process, then naturally, we must try to understand the mechanism that allows the shift. of attention from one region to another. An initial step is to observe that not all visual processes are created equal. Some features (e.g. luminance, direction of motion) can be processed quickly and in parallel over the entire field of view, whereas more cognitive processes such as shape analysis, or fine recognition, have much higher complexity and require an increased amount of resources, which necessarily must be carried out locally, on a restricted region called focus of attention which most of the time (but not always!), coincides with the fixation point. The former processes constitute the “pre-attentive stage”, whose role is to select a single region as the focus of attention, which will subsequently be analysed by the latter processes - the “attentive stage”. There exists experimental evidence that strongly supports the “selective atten- tion” theory. In the first place, experiments by Triesman [47], [48] showed that search for a target defined by a single visual feature (e..g color, or edge orientation) takes place in parallel over a visual display, whereas targets defined by a conjunction of several features is done sequentially (e.g. find a vertical red line among many red and blue lines, vertical or horizontal). It has also been established [7], [20], [47] that only a limited set of features (e.g. color, edge orientation and curvature) can 12 . A“ 3“— _.-..$....a_~ _~. ._. . be detected in parallel. Physiological evidence based on recordings of neuronal ac- tivity in awake behaving monkeys also supports the selective processing of the visual data [8], [12], [13]. In light of these facts, Koch and Ullman [22] proposed a model of visual attention called saliency map, aimed at explaining the shift of attention and the saccadic eye movements. We describe their model in the remainder of this section. The result of the processing in the pre-attentive stage is a. set of topographical cortical maps (pos- sibly at different spatial resolutions), which are based on low-level visual information (color, edge orientation, texture, contrast, disparity, direction of motion) and provides an early representation of the environment. Each of these feature maps codes for con- spicuity within one feature: the more different a location is from its surrounding, the more conspicuous (i.e. more salient) it is. The feature maps are then combined into a saliency map by computing at each location a weighted sum of the saliency in each of the features. We must point out that it is nontrivial to compute such a weighted sum, because some of the features are inherently not comparable (e.g. it is not. obvious how to compare color conspicuity with edge orientation). In fact, it is quite possible that these weights may vary, and may get top-down (semantic) influence from some higher cortical centers. Once a salient location has been selected, more abstract. properties of that region are created and a central, non-topographical representation of that region is formed. A “winner-take-all” network is proposed for finding the most. salient region in the saliency map. A second network is also necessary for building the abstraction of the attended region into a central representation. 13 Since the saliency Of a region in the field of view is based on low—level visual features, it is extrinsic to the visual system, i.e. it is entirely determined by the environment. Then, the most salient region in the early representation, remains the most salient also after that region has been attended. Naturally, then, the question is, if the maximum saliency region is always selected, how does one ever move the eyes to another location? Koch and Ullman suggest that higher cortical centers decrease the saliency of the attended region, so that after some time, that region is no longer the most salient in the field of view, and another maximum is selected elsewhere. A few comments are in order here. In the first place, as we have already men- tioned, it is not trivial to combine the various feature maps into a single saliency map. Secondly, in the saliency map framework, it is not clear when the high level processing occurs. The saliency map only tries to explain how the next focus of attention is cho- sen, and, as presented above, semantics are given little consideration. In the absence of semantics, the selection of the next fixation point is a reactive, deterministic pro- cess (once an environment is given), which does not require any voluntary, cognitive actions; in other words, it is a reflex. While this may be the case during the first few fixations when presented with a new scene, we believe that as the scene is being gradually understood, semantics, and the specific task at hand have an increased role in selecting the next fixation point. We close this section by mentioning that while there is biological evidence for the existence of various feature maps, not all researchers accept the idea of a unique topographic map to represent the most salient stimuli [10]. 14 2.2 Reinforcement Learning We emphasized in the previous section the necessity of a top-down control mechanism to obtain the functionality of a gaze control system. Since we treat visual attention as a sequential decision problem, we shall use reinforcement learning techniques [43] to implement the top-down control in a task specific way. In this section we present the basic formalism of sequential decision-making. An agent is a system which can be in one of a finite set of states, and which can take actions. Depending on the problem at hand, the information used to define the states could be extrinsic to the agent, describing properties of the environment (e.g. location information, “obstacle ahead”, “rainy day”, etc.), or intrinsic, internal to the agent (e.g. “hungry”, “happy”, “angry”, “sure”, “unsure”, etc). In general, there is one set of actions for each state, but for simplicity, we shall assume that at each state the agent can choose from the same set of actions. The actions change the state of the system and the agent must re-estimate the state by observing the environment. The agent gets rewarded for each action it takes, and tries to learn what action is the best in each state, with respect to some optimality criterion, by moving from one state to another until some termination conditions are met. Typically, the components of a sequential decision making problem are as follows: 0 States: S = {81,152, . ..s,,}. 0 Actions: A = {a1,a2, . ..am}. 0 Transition probabilities: P(s'|s, a) of reaching state .5" given that the agent 15 was in state 3 and has taken action a. In general, the transition from s to s’ by executing action a depends not only on s and a, but on the whole sequence so, (11, . . . , sn 2 3 up to the current state. In many sequential decision problems, however, the transition from a state to another does not depend on history, i.e. P(5n+l Ian-+1737“ - - - :a1230) : P(3n+l|an+1: 8n) (21) Equation (2.1) is called the Markov property and systems in which it is satisfied, are called a Markov Decision Processes (MDP). We assume the Markov property throughout this thesis. 0 Reward: R(s, a, s’) for transiting from state 3 to s’ as a result of action a. The reward can be positive, zero, or negative (cost). We assume that rewards are bounded in absolute value. Since the action a. puts the system in state 3’ with probability P(s’ Is, a), the quantity r(s,a) = E(R(s,a,s')]s,a) (2.2) = :P(s'|s,a)-R(s,a,s’) (2.3) represents the one step expected reward for executing action a in state 3. 0 Objective function: A function depending on long term reward, which the agent must optimize. For instance the agent might try to maximize the sum of 16 rewards: r(s0, a1) + r(sl, a2) + . .. where a,- is the action taken at moment i, s,- is the state at time step i and so is the state the agent starts from. TO guarantee that the above series is finite, we assume the rewards are bounded and we discount each of the terms by a factor 0 < y < 1. We require the agent to maximize at each state the expected value Of the sum of discounted rewards: E("§: y r(st_1, at)) (2.4) i=1 See [21], [24] for more examples of optimality metrics. A (stationary, deterministic) policy is a function 7r : S ——> A. The agent follows a policy if in each state s it always chooses the same action a = 7r(s). Given an Optimality metric and a policy 7r, we can define for each state 3 a value V”(s). For instance, for the optimality metric (2.4), the value of state 3 is V“(s) = E(i(':”l r(st 17r( 1)]5 7r) (2 5) = rs.( ))+’7'- ZPLS ([s,7r(s) ))V(s') (26) where so = s and sl = s’. 17 By “Optimizing” the objective function we mean that the agent must find an Optimal policy, i.e. a policy 7r* such that V7r (s)>V“(s.)\7’s€S For simplicity we denote V" = V’". By (2-6) we have that V'“(s )2 man [I s, a)+y Z P(s (Is, a) )V*( (')] (2.7) GAE s’ES for any state s. Equations (2.7) are called the Bellman Optimality equations, and there are two known dynamic programming algorithms that can be used to solve them: value iteration and policy iteration [21], [24], [33], [43]. These algorithms can be applied when we have a model Of the MDP, i.e. when the transition probabilities and the reward functions are known a priori. This may be the case in some manufacturing problems, but it rarely happens in real robotics problems. Indeed, a robot does not know the outcome of an action until it has completed that action. There exists, however, an algorithm called Q-learning due to Watkins [50] which can be used to solve the Bellman equations even when the transition probabilities or the rewards are not known. The algorithm is iterative, it is proven to converge, and finds a solution of the Bellman equations. To begin with, suppose that V“ is a solution of (2.7 ). For each pair (s, a) define a function 18 Q*(s.a) = we + 7- Z W .s,a)v*(s') (28) 3’65 which is precisely the quantity under max in Equation (2.7). Clearly then, V*(s) = Tea? Q*(s, a) (2.9) Substituting (2.9) back into (2.8) we get Q*(s, a) = r(s, a) + 7- Z P(s’|s,a) £1,133] Q*(s’,a') (2.10) 3’65 Unlike in the Bellman equations, in equations (2.10) the transition probabilities are not under max. The quantity 2 P(s'[s, a.) max Q*(s’, a') a’EA s’ES is the expected value of maxareA Q(s’, a’). The Q-learning algorithm tries to find the function Q*(s, a) through successive approximations. Once Q" is found, we can find an Optimal policy by 7r*(s) = arg mea‘x Q*(s, a) Let (2,, be the approximation at time step n. Q0 is initialized randomly, or it is identically 0. The iterative step is given by the following update equation: 19 Qn+l(3a a) = Qn(3: (l) + 0’n[(7“(8: a) ‘l‘ ’7 ' 2,123: Q-n-(SII all) — Qn(3: all (2'11) where an is a. learning rate parameter. The transition probabilities are not present in the update equation (2.11). It can be proven rigorously [50] that the sequence (2,, is convergent to a function Q“ which satisfies (2.10), provided that 2a,, z co and 2 (1,2, is finite and provided that all (state, action) pairs are visited infinitely Often. 20 Chapter 3 Problem Definition 3.1 Research Questions and Task Definition Now that we have introduced the basic ideas in active vision and reinforcement learn- ing, we can formulate more precisely the theoretical questions that we investigate in this thesis. The first question is, how can perception and action be unified in a coherent, possibly task dependent visual behavior. In other words, given that the agent recog- nizes (with some certainty) an image, how does it decide what to do afterwards? We propose a model of visual attention which comprises a low level, reactive layer, which essentially implements a saliency map, controlled by a high level, semantic layer. The addition Of the tOp layer goes beyond previous gaze control models, in which the emphasis was on selecting the most salient point in a task independent way. Secondly, we investigate to what extent reinforcement learning can be used to 21 integrate information across saccades. We propose that only minimal visual infor- mation about a region in the environment is retained, together with the direction of the saccade from that region towards some goal. This is analogous to landmark navigation that organisms (from insects to humans) are known to perform. Finally, we ask whether it is possible to learn relationships between Objects. For instance, if the agent is searching for a pen, then, to restrict the search, we would like it to know from prior experience that a pen is likely to be found on a desk. Many examples from real life, Show that peOple do use this kind of prior knowledge: we look for keys in certain places, we look for a book in a bookcase, etc. We believe that reinforcement learning can partially answer this question, but a more complete solution relies heavily on the ability of the agent to recognize classes Of objects, rather than specific Objects (e.g. the agent can recognize “books”, or “pens”, not just “this book”, or “this pen”). A visual search for a relatively small Object which is not necessarily in an exact location, but which can usually be found near some other larger fixed Object in a cluttered environment, seems to have the necessary complexity to address the above issues. At any given moment the system can only process a limited region in the environment, at high resolution, and knowledge about each Of the regions visited must be “remembered” from one visual frame to another. Also, moving to another region requires some way of finding a. new fixation point, which requires solving the aspect of visual saliency. The fact that the target Object is generally in one region of the room, requires that the saccades be directed consistently towards the target region, hence the problem of tOp-down control. 22 Formally, we define the search task as follows. 3.1 (b)), using a fixed, pan-tilt-zoom camera. Input: An RGB image of the target object T. on L for finding the target T. (but fixed) position, until it finds the target T. Task: Find a pre—specified object (Figure 3.1 (a)) in a fixed position in a room (Figure Output: A set L = {L1, L2, . . . , Ln} of landmarks (regions) in the room, and a policy Performance Metric: Number of saccades the camera makes from an arbitrary churn-”M- .. <.- ._.. ........_~.-...w~— Figure 3.1: (a) RGB image of the target object. (b) Panoramic image of the environment. Images in this thesis are presented in color. The general approach is to create the set L incrementally, by using an instance based approach of clustering the images. Each image is represented by means of color histograms. Q-learning is used to learn the policy on the set L. The next fixation point is chosen by using two feature maps (rather than a single saliency map): we do a low resolution search for a candidate location, based on color, followed by a 23 high resolution match at each candidate location. If the target is not found, then we compute symmetries in the image using an adapted version of the symmetry operator described in [36], and choose the maximum in a given direction as the next fixation point. We want to emphasize here that although a gaze control system involves a great deal of image processing and pattern recognition techniques, we do not attempt to build a (nearly) perfect, general purpose recognizer, nor the perfect saliency map. What we strive for, is to introduce task dependent, top down control in visual at- tention. Improvements both in the image processing component, as well as in the reinforcement learning component will be added subsequently, based on the insight gained from the current work. For now, however, we see this work as the first step towards understanding the computational issues involved in building a gaze control system capable of operating in realistic environments. 24 Chapter 4 Related Work 4.1 A Saliency Map Implementation A number of approaches to visual attention, visual search and gaze control have been proposed [1], [9], [16], [22], [29], [34], [39], [46], [53]. We shall succinctly describe some of them in this section. Perhaps the most comprehensive implementation Of the original saliency map theory proposed by Koch and Ullman was done by Itti and Koch [16], [15] who implemented a model of a task independent, pre—attentive selection mechanism. A saliency map is built using a bottom—up approach. To this end, three types of feature maps are built and combined into a unique map: a color map, an edge-orientation map and an intensity contrast map, each at different scales in a pyramidal scheme, yielding a total of 45 feature maps. Each of these features uses a center-surround structure that mimics the visual receptive fields, which is meant to pick up regions that are in sharp contrast with their surrounds. For the color contrast, for instance, the difference 25 between the red value at one pixel and the green value Of surrounding pixels at a different. scale is computed, resulting in a red-green contrast map. Similarly, blue- yellow contrast maps are computed and then combined with the red-green maps. The same center-surround, bi-scale scheme is used to compute the intensity contrast maps and the edge orientation maps. A great deal of effort is directed towards combining these features into a unique saliency map. In [15], the same authors showed that a simple way to combine the features by normalizing them to a fixed range and summing up, did not work well. In [16], strongly motivated by biological evidence, Itti and Koch use a two dimensional difference-Of-gaussians (DOG) to model inhibitory-excitatory interactions between neurons. First, each feature map is re-scaled between 0 and 1. Iteratively, each map is convolved with a DOG filter, the original image is added to the result and the negative entries in the result are set to 0. This “within feature” competitive process produces a conspicuity map (for each feature) with a few peaks left, each peak marking a location in the image most contrasting with its surround. Finally, these three conspicuity maps are summed up linearly and the most salient point in the image is the one with the highest peak. Another important model Of visual attention is the one proposed by Rao, Zelinsky, Hayhoe and Ballard in [34]. The idea here is to define saliency based on iconic representation of each location in the image. An iconic representation at a location is nothing but visual information from a patch in the image, centered at that location. Then, given also an iconic representation of a target object, one could search for that Object in the image by defining the saliency at each location as the correlation between the representation at that location and the representation of the model. Encoding 26 icons literally as raw images, is clearly prohibitively expensive, so the authors propose an efficient way Of iconic encoding, as the response around each pixel to a number Of different filters, at different scales. Specifically, for a fixed scale, a gaussian is used to generate 3 first order, 2 second order and 4 third order derivatives, for a total of 9 filters. Three scales are considered, yielding a total Of 27 multi-scale filters. Each of these filters is applied to each Of three (modified) chromatic channels of an RGB image, producing at each location in the image a vector of 81 responses. Given a similar representation of a target object, one essentially defines the saliency at each location (:13, y) in the search image as the L2 norm between the response at (r, y) and the model representation. The next fixation point is selected as the one with the smallest norm. Unlike the Itti and Koch model, which is purely bottom-up and task independent, this model requires knowledge about the task, in that it uses the iconic representation Of the target Object to define saliency in the search image. 4.2 Reinforcement Learning - I Trevor Darrel [9] applies McCallum’s nearest sequence memory (NSM) algorithm [27] to implement a gaze control system for gesture recognition. Darrel’s system consists of two cameras: one that provides a static, low-resolution, wide field Of view image, and another one with pan-tilt action, which provides high-resolution, narrow field Of view images. Part Of the world state is fully observable by means of low resolution images. However, the hand gestures and facial expressions can only be Observed from the high resolution images provided by the active camera. The various low-level routines can 27 perform person tracking, and can guide the active camera to saccade from one body part to another, based on the global, low resolution images. Other visual routines perform the gesture recognition based on the high resolution images taken by the active camera, provided that the camera has been directed towards the relevant body parts (e.g. hand or face). The system is supposed to learn how to choose the next fixation point in order to obtain new images of gestures or expressions, and maximally discriminate a particular gesture. The low level routines extract at each time step 9 features from the low resolution and high resolution images: (person-present, left- arm-extended, right-arm-extended, face-expression, left-hand-pose, right-hand-pose, left-hand-foveated, right-hand-foveated, head-foveated). The state is hidden because, for instance, the left-hand-pose (which can be neutral, point or Open) cannot be known unless the active camera is pointed to the left hand. At any time, the system can choose between 4 (primitive) actions: look—body, look-head, look-left-hand and look-right-hand. Since the state is hidden, the system can be treated naturally as a partially observable MDP (POMDP). The system keeps track Of memory sequences (mt)0gtSTa where each element mt is a triple mt = (at, rt, at) of action, reward and observation at moment t, and associates with each such sequence a Q-value, to decide the utility of that sequence in dis-ambiguating the state of the world. The system does manage to learn how to perform the task, but despite the small number Of states and actions, it takes a considerable amount of time to train. An interesting idea is presented by Rimey and Brown [39] where Hidden Markov Models (HMM) are used by an artificial gaze control system to learn saccadic move- ment sequences. In one example, the pan-tilt camera is assumed to II‘lOVG in small 28 constant increments in 8 directions. Let v0, . . . , v7 be these motions. An 8-state H.\-*I.\'I is trained, in which the symbols (Observations) are precisely the v.,-’s. Observation se- quences are now produced, for instance, by drawing a scan path with the mouse on a computer screen, or by monitoring human eye movements. At time t = 0 the system “perceives” Observation vto, at moment t = 1 Observation vt,, etc. Thus, a given scan path produces Observation sequences which can be used to train the 8 state H.\I'l.\:’l, i.e. to learn the transition probabilities a,j between states and the observation prob- ability densities b,(-). The learned HMM is then used to generate symbols from the learned distributions, which drives the camera to moving along a path similar to the one used for training. Obviously this scheme does not use any visual information, so the learned path is independent of the underlying image. However, the authors modify the path generation after the HMM has been trained, to include saccades towards the most salient points in the image. Moreover, they develop the formalism for training augmented-HMMS (AHMM). As an application, the authors train two HMMS (a “where” HMM and a “what” HMM), to implement two separate behav- iors and combine them using the newly defined notion of AHMM to create a more complex behavior. Finally, we shall describe a reinforcement learning solution to Object recognition due to Bandera et. a1. [6]. Suppose we have a foveated vision system whose task is to recognize Objects from a finite number of classes. TO fix the ideas, suppose the classes are “cube”, “cone” and “pyramid”. Suppose also that the system can detect, fixate and recognize various visual features that are present in all these classes. For instance, a feature could be a 90 degree angle as it is indicative of a cube; vertical 29 and horizontal edges are also features present in a cube; a circular curve is indicative of a cone, etc. In general. recognition with sufficient confidence requires the presence Of a conjunction of these features. The various features have different importance for the classification task, since the more Objects share a common feature, the less discriminant power that feature has. The system must learn which features are the most discriminant for the given set Of Objects. At discrete time steps, the system chooses one feature to interrogate. As the various regions of the Object are being fixated on, a confidence measure is associated with each of the features in the “feature database” which indicates the presence or the absence of that feature in the Object. The state of the recognition process is defined intrinsically, as the vector Of these measures. These confidence measures are readily transformed into probabilities and the system must reach a state of minimal entropy by choosing the feature to be fixated next, based on its discriminant power. The one-step reward for choosing a feature is equal to the decrease in entropy it produces when interrogated. A neural net is then used to implement the residual Q-learning algorithm. Although implemented in simulation, we consider this work of great importance, because it is an example of a system in which the states are internal, reflecting the stage of the recognition process, rather than properties of the environment. Moreover, we believe this approach has great potential in automatic feature selection. 30 Chapter 5 Model Description 5.1 The System Overview The system consists Of a Sony EV I-D30 pan-tilt-zoom camera placed in an elevated location in our lab, controlled through the serial port. by a PC with a Pentium II 400 MHz processor. TO grab images we use an inexpensive BT848 board. The pan and tilt Of the camera (measured in ticks) range in the interval [~860, 862] x [—281,283]. In degrees, these ranges translate into [—100°,100°] x [~25°,25°]. At the lowest resolution, the largest field of view of the camera is approximately 48° x 33°. Figure 5.1 shows the architecture of the system. The target object is provided to the system in the form of two template images: one at low resolution, used for coarse search, and the other at high resolution, used for fine matching (see Figure 5.2). The camera takes low resolution images from the lab and feeds them, one at the time, into the two layers of the system. Each low resolution image is processed along two streams - the two main modules of our model (see Figure 5.1). The top 31 Reinforcement learning (RL) Q— learning _——_————————————————————————— ll clustering 1 motor commands back—projection intersection c: o camera ‘53 Vision module (V) *3 symmetry operator :1) .............. ., .............. 93. D H N :42 histogram : histogram sis—l i— \ / \l ‘7’ \ / Object / input image image Figure 5.1: Overall architecture of the system. The top module performs the feature extraction necessary to cluster the images online. The resulting regions are used as states in a Q-learning program and a policy for moving towards the target region is learned. The bottom module implements the “within fixation” visual processing, consisting of computation of interesting regions in the image and object recognition. module (which uses reinforcement. learning) learns a set Of clusters online consisting Of images with Similar color histograms. The clusters represent physical regions in the environment, and are used as states in the Q-learning method. This module learns a policy for saccading from one region to another towards the region(s) most likely to contain the target object. By selecting the gaze direction according to its utility (the Q-value), the reinforcement learning module provides top-down control to a lower, purely vision-based module. The vision module consists of low-level visual routines and its purpose is to compute two feature maps (color map and synunetry map) for 32 Millikan i t .5 ; (a) 1 (b) 2 Figure 5.2: (a) Low resolution template of the target object. (b) High resolution template. representing saliency 1 in an area of interest in the image, and to recognize the target object, both at low resolution and, if need be, at high resolution for high confidence. All computations in this module are performed in exocentric coordinates 2. Unlike [16], at this point we do not attempt to combine the feature maps to form a unique, task independent saliency map, but rather use them sequentially, first the color map (for finding candidate target locations), then the symmetry map (if the target was not found and a new saccade is necessary). The output is a set of landmarks L = {L1,L2, . ..L,,} specific to a particular environment, and a policy on the set L which directs the camera one fixation at the time, starting from any point in the room, and moving towards the region(s) where the target object is normally found. Each element in the set L is a cluster of images representing the same region in the room. The set of landmarks is learned online, and the clustering is solely based on visual input from the camera (i.e. not based on the pan-tilt coordinates of the camera). Although desirable, we don’t expect/ require 1The next fixation point is the resulting most salient point. 2Obviously, image coordinates must be transformed to camera coordinates, in order to physically carry out the saccade, but the decision making does not use pan-tilt information. 33 that the landmarks learned he a faithful representation of the environment (i.e. they may not correspond to physical objects in the room). 5.2 The Q-learning Module 5.2.1 Feature Extraction In any image classification problem, treating the raw images as vectors is unfeasible due to the very high dimension of the resulting vectors. Dimensionality reduction techniques (e.g. principal component analysis [28]) and feature extraction must be apphed. DISTRIBUTION OF KULLBRCK OISTRNCE 0N P9195 OF IHfiGES 180 T T 1' T I T U mil ir" -—--- J1, “diiierent“ -‘L' 160 i- / \ .r If. i] / " 140 I- . - Ir "- VIA I ‘\ [All 1,”, \I 3’ le {i ii '\ 120 ~ f 'x i ‘ I. ‘ {l t \l / \\ l l // \ 100 *- l \K i, ‘,_ i [ 3' \ u 80 .. ] II . . \ I . . .[ \ l . ‘ H \' 60 II] \ \\ -i l \ l , \. 40 J l “ - i z . ,/ i\ \ 2,, . _/ .. /. \,\ \\_._ A o 1 i L '- -. I"! ’ ' ""3 A l L ‘ O 1 2 3 4 5 6 7 8 9 10 kullback distance Figure 5.3: Kullback distance of pairs of similar images (red) and on different images (green) Our first attempt to reduce the size of the observation vectors was to compute the wavelet transform [42] of the images. Wavelet coefficients produce good dimension- ality reduction (an order of magnitude better than JPEG compression, with similar 34 0.035 HISTOGRHMS FOR SIMILRR IMRGES “reg onvnl z“ LregFonfi01_1"'—-—— 0.025 - FREQUENCY 0.015 - J [ [l 0.01 » ‘ .‘l 2“] l lam" ‘ ill 60 80 100 120 140 160 COLOR BINS 0.005 I 001:“ ll0 ullliiii I . ll 1....lilll 180 200 Figure 5.4: Top: feature vectors (4 concatenated color histograms) for two similar images (bottom) quality), but are too sensitive to small changes in the pan-tilt of the camera, as the wavelet transform is not translation invariant. Small changes in the x,y position of the camera produces large variation in the feature space, which makes any clustering attempt useless. A set of features that changes slowly as a function of the direction of gaze is provided by the (color) histogram. For an introduction to the computer representation of colors and color spaces see for instance [18], [19], [41]. Binarizing each channel in the RGB space into 16 bins, we get 4096 colors. However, for a given environment, 35 HISTDGRRMS FOR DIFFERENT IMF-TEES r. . _ .. . regionnuLl — “it"bliiil l" I“ 0.045 r FREQUENCY 5° 0 m (.11 . . . . u i . ' l . . u- 0 20 40 60 80 100 120 140 160 180 200 COLOR BINS Figure 5.5: Top: feature vectors extracted from two different images (bottom) many of the colors occur with zero or very small probability. We can then build a histogram on all 4096 colors of 100 random images from the whole lab and retain only the colors with a count of at least 0.5%. The number of colors retained this way was 49. Henceforth, these colors will be called dominant colors. Histograms do have the advantage that they change little when the viewing angle changes, but they may introduce perceptual aliasing: there may be very different images (locations in the environment) which produce the same histogram. To alleviate (but not necessarily eliminate) this problem, we divide each RGB image in 4 quadrants 36 and concatenate the histograms on the dominant colors on each of the squares. The resulting observation vector has then dimension 196 = 4 - 49. Having defined the feature vectors, a suitable metric needs to be chosen, in order to tell whether or not two images are similar. Given a pair (11,12) of images, two situations can occur: either I 1 and 12 represent the same region in the enviromnent, or not. If we define two classes Csimilar ={(11,12)|112 12} Cdifferent = {(Il-.I‘2)|Il ¢ [2} then a metric on pairs of images becomes a feature which can distinguish the two classes. We, therefore, choose a metric that maximizes the inter-class separation. After trying the Euclidean metric and correlation, we chose the symmetric Kullback distance, which we describe next. The amount. of information produced by an event A which occurs with probability P(A) is defined as log2 —(l/U (5.1) Given two probability distribution functions p(.r) and (1(1) of a random variable x, then the Kullback contrast [17] between p and q is defined as the expected value with respect to q of the difference of information between p and (1. That is 37 1 q l 5 kip-.11) — EqUng I-)—(—‘l_)_ — 1082 _<1(4L‘)) ( .2) _ . , W) ,. — _ /(1(.L) 10g2 p($)d.r (0.3) It can be proven [17] that k(p, q) is positive definite, but. it is not symmetric. To transform the above Kullback contrast into a metric, we symmetrize it in the standard way: me = Mm) 2: rap) (5,) A histogram represents the probability distribution that a certain color occurs in an image, and so, it makes sense to use the Kullback distance to measure the similarity between the two histograms. To see the effectiveness of the Kullback metric, we grabbed 50 random images from our lab by choosing the fixation points uniformly, and 50 images from one region in the lab, i.e. by choosing the fixation points from a gaussian distribution with a small variance, centered around a pre-specified point. We formed all the pairs within each class of images, and for each pair we computed the Kullback distance. Figure 5.3 shows the Kullback distance as a feature on each of the similar/non-similar classes of pairs of images. The separation threshold between pairs of similar images and non similar images is around 2. However, there is a significant overlap between the two distributions, so, 38 in order to reduce the false accept rate we moved the threshold to the left, at 1.5, at the expense of the false reject rate. Figure (5.4) shows a pair of similar images, and their feature vector representation as the concatenation of color histograms on each of the 4 quadrants in each of the images, while Figure (5.5) shows the feature vectors for two non—similar images. 5.2.2 States, Actions and Reward We define a state to be a cluster of images representing the same region in the room. As explained in the previous section, from each image, an observation vector is ex- tracted by concatenating the histograms on the dominant colors in each of the 4 quadrants. The similarity metric between two observation vectors is given by the Kullback metric. Q-learning requires that the set of states be known a priori. However, due to the sequential processing of a scene, understanding must be incremental. For this reason, we start with an empty set of clusters of images, and form new clusters as the agent fixates on new regions in the room. The clusters are spherical (with respect to the Kullback distance) , and have a radius of at most 1.5 (see feature extraction). The center of each cluster is the very first observation vector used to initiate creation of a that cluster, and it is not updated as new observation vectors are added. We define 9 actions. A0 is a saccade to the most salient point in the whole image. Further, we define 8 more actions A1, . . . , A3 as follows. Choose a cartesian coordinate system with the origin at the center of the image and for each k = 1...8 let Rk be 39 Figure 5.6: The shaded region represents a sector in image A. The most salient point is computed (the blue square) and a new image B, centered at the new location is taken. the ray through the origin which makes an angle 45° x (k — 1) with the x axis. Let 5,, be a 90° sector in the image, bisected by Rk. Then we define action A), to be the saccade to the most salient point in sector 5],. The shaded region in Figure 5.6 represents sector S, in image A. The small blue square is the most salient point in 5,. Once its coordinates are computed, the camera is directed toward the new point and a new image B is taken. The actions A] . . . A8 always force the camera to be directed away from the current fixation point. Action A0 is justified by biological evidence that human subjects sometimes make a series of very short saccades within the same region of interest, perhaps for the purpose of acquiring more information. If the agent chooses to execute A0 and if the most salient point is near the optical axis, the agent need not move towards a point in the periphery. Early in the development of this system we had two extra actions: a search action and a saccade to a random point. We realized, however, that the search for the target object is part of the “within fixation” processing, and consequently, it should be carried out at each time step, and should not be considered a separate action. 40 The random saccade was also eliminated, because if the outcome of an action has a uniform distribution, then the agent has no way of learning whether it should execute that action or not. Also, before defining the actions using 90° sectors, we tried 45° sectors. However, we have Opted in favor of the current size of 90° because by using too narrow angles, some objects might not be entirely comprised within those angles, so their symmetry will be influenced in a negative way. A B Figure 5.7: If the reward is found in the periphery (image A) then upon subsequent visits to the same region (image B) the reward could be missed, despite the fact that the agent executed the “right” action. The reward function depends on the current state 3, the action a taken in s and the next state 3’, and is given by: :2 100-e“2—7 if target is found in state 3' A OI C)! V r(s, a, s') = —4 otherwise In the above equation, a: is the distance from the center of the image (the current fixation point) to the target object and the variance 0 controls the size of the gaussian defining the reward. The reason we reward more for bringing the target in the center of the image than for finding it. in the periphery, is that. the latter situation is unreliable. 41 Consider the following scenario: the agent is in state 3, takes action a and lands in state 5’, where it perceives the target somewhere near the periphery. If the agent gets a significant reward, then the action (1 becomes Optimal in state 3 and will be chosen upon subsequent visits to 3. By using solely visual information, the fixation points can only be computed approximately. Thus, often, when the agent revisits state .9 and takes action a, the target object is just out of the field of view, and the agent gets penalized for the same action that previously brought a large reward. See Figure 5.7. The optimality metric (i.e. the function that the agent tries to maximize) is the expected discounted sum of rewards: W) = 19(th m a...» (5.6) :20 5.3 The Vision Module The vision module (see Figure 5.1) implements some routines, as part of the “within fixation” processing. We build a color map using histogram back-projection [44], [45] to obtain candidate locations for the target object in low resolution /wide field of view images. To assess the value of each of these locations we use a recognizer based on histogram intersection [45]. If any of these locations turns out to be a. good candidate, the camera is directed to that location, zooms in, and, for high confidence, another match is performed using the same recognizer and the high resolution template of the target object. Finally, if the object was not found in the current frame then a new saccade is necessary. The camera returns to the pan and tilt prior to inquiring 42 the candidate locations, and a new fixation point is selected. To this end, we use a context free symmetry operator [36] to build a symn‘ietry map and select the next fixation as the point with the highest synnnetry. For the sake of completeness, we shall briefly describe these routines below. 5.3.1 The Symmetry Operator Man-made objects tend to have more symmetry than natural objects. In an indoor environment such as a lab, most of the objects are man made, so it is natural to try to employ a symmetry operator in an attempt to fixate on the centers of objects. e. I I, pj ,n’ pi+pj VIi 2 6i or.. I l D Figure 5.8: Two edge pixels p,- and p, will vote for their midpoint. The strength of the vote depends on the distance between p, and pj, and on the length and the orientation of the intensity gradient at each of the two pixels. Following [36], for a graylevel image, we compute the edge map, and then we 43 let each pair of edge pixels vote for their midpoint. The vote is modulated by the distance between the pixels, and by the direction and the length of the gradient of the intensity. -More specifically, if I (1:, y) is the graylevel image, for each edge pixel p,- define rr =10s(1+|V1(Pz-)|) (5-7) For any edge pixel p, let 6 be the argument of VI, i.e. the oriented angle between the horizontal and the vector VI. For any two edge pixels p,- and p,- let (2,, be the angle between the horizontal and the line pm, (see Figure 5.8). The phase factor for pixels p,- and p,- is defined by P(i,j) = (1 — 005(0, + 6,- — 2 . 04,-,» - (1 — cos(0,- — 0,)) (5.8) We define the distance factor for edge pixels p,- and p,- by 1 (llzvi-pl-Il-m)2 D(lj) : ' 6'— 2” (59) 27m The vote of the edge pixels p,- and p,- is then defined as The gaussian in equation (5.9) achieves its maximum when the distance between p, and p,- is equal to a user specified parameter a, giving pixels at distance a an increased vote. See Figure 5.9. 44 (a) (b) Figure 5.9: (a) Sample image. (b) Symmetry map obtained by maximizing the vote for pixels at distance a = 130. By equation (5.8), the phase factor P(i, j) achieves its maximum when 05+0j—201 ll :1 05—91' = 7r which is the case when the two gradients are parallel to the line pm,- and point in opposite directions. On the other hand, P(i, j) achieves the minimum when. for instance, 9,- = 0,, i.e. when the two gradients are parallel and point in the same direction. 5.3.2 Histogram Back-projection In addition to symmetry, we use color to define saliency. We do not try to build a task independent, bottom-up color map which would pick up regions contrasting in color 45 with their surrounds (as in [16]), but rather try to find in the image locations with similar color distribution as the target template. To this end, we use the histogram back-projection algorithm [44], [45]. Figure 5.10: (a) Sample search image. (b) Model (target) image. (c) Histogram back- projection of (b) onto (a). Given two color images I (the search image) and M (the model, provided a priori), this algorithm finds the locations in I where M is likely to be found. First, one must build the color histograms HM and H I of M and I respectively, on the same set of color bins. For improved performance, we use the L". u“, 11* color space [18] rather than RGB, given by: l I 1001/ a L" — 25-( )—16 Yo u‘ = 13-L’(u’—uo) v‘ = 13~L‘(v'—vo) (5.11) 46 1n the above equations , , _ , _ 4X u _ u _ X +15r'+ 32 v' = 1.52) = 9y X +15Y + 32 ( X \ ( 0.490 0.310 0.200 \ K R \ Y = 0.177 0.813 0.011 - G (5-12) (Z) (0.000 0.010 0.990} (B) and Y0, no, 00 are the reference white values for Y, a, '0, respectively. The color bins are precisely the colors that occur in I . Two different source images I and I’ have different sets of color bins, and so these bins must be re-computed for each pair (I, M). For an efficient implementation we use look-up tables for the RGB —+ L*u*'o* transformation. After computing H M and H1 we compute the ratio histogram R, by RU) = mm(HM(I)/H1(I)= 1) (513) for each color bin j. This ratio histogram is back-projected onto a gray-level image B (the color map) of the same size as the source image I by 303.31) = R(1(:r.y)) (5-14) That is, for each pixel (:r, y), we determine what bin I (1:, g) falls in, we look up the 47 value of R on that bin, and set B(:r, y) equal to this value. The most likely locations of the target object represented by the model M are the pixels (1:. y) for which B(.r., y) is the largest. The histogram back-projection algorithm works well if the target object is present in the source image, but it cannot distinguish whether the target is at all present. or not. The reason is that in the back-projected image B there is always a 111aximum. Therefore, after having found the candidate locations in the source image, we must run a recognizer to match the model template M at each location. The recognizer we used is based on the histogram intersection. 5.3.3 Histogram Intersection For the recognition of the target object with sufficient confidence, we use a classifier based on the histogram intersection [41] [44]. Let I and T be two images of the same size, that we want to match. Mathemati- cally, neither image has a preferred significance, but in practice T is a high resolution template of a given object (in our case the object is a software box) and I is an un- known image that we want to classify as similar to T or not. As with the histogram back-projection, we compute the color histograms H I and HT. Here, since T is given a priori, and large enough3, we choose the color bins to be precisely the colors that occur in T. Define the histogram intersection distance between H 1 and HT by 3At high resolution an object appears larger. 48 Zmin(H1(j)-.HT(I)) (5.1") . dH.H =1— (1, r) 2,1,0.) If I and T have the same size, then obviously Z H[(j) = Z HT(j) = length x width so d is symmetric. h’loreover, note that for each color bin j 111114500), HT(J')) S 2 H10) (516) with equality iff H; (j) 2 H70). Summing up after j, we get 2 minU'IIU): HTU» < 1 ZHIU) _ with equality iff all inequalities in (5.16) are equalities, i.e. when H I 2 HT. Hence 0 S d(H{,HT) S 1 and d(H[,HT) = O Iff HI : HT- T he template image T is chosen to be about the size of the target object, and if its histogram is rich enough, a recognizer based on the above distance has very high accuracy. Certainly, given a template T we can produce an image I with the exact same histogram as T by re-shuffling the pixels in T, but this is rarely the case in realistic situations. 49 5.4 More Visual Routines In addition to the feature maps produced using the symmetry operator and the his- togram back-projection described in the previous section, we have experimented with a few other feature maps, in an attempt to produce fixation points similar to recorded human fixations. These maps have not been incorporated in the current implemen- tation. Nevertheless, we consider them good starting points in selecting visually important regions in an image, and for this reason we shall describe them briefly in this section. Variance map. Recordings of human scan patterns [14] show that people almost never fixate on flat, uniform surfaces. Obviously, then, we can use the variance of the intensity to detect non-uniform surfaces: the larger the variance in a region, the more non-uniform that region is. We cover a gray scale image with a grid consisting of square cells. On each cell, we compute the variance of the intensity and we retain only the cells with variance within some interval (21mm, om”). Figure 5.11 shows an indoor image with the recorded human fixations marked by the green dots, and the results of our variance map on the gray scale equivalent of the original image. The red squares are the ones selected by our algorithm. The drawback with the variance map is that the thresholds om," and om”, as well as the size of the cells in the grid must be set a priori by the programmer. Another idea in defining saliency is to declare a pixel salient if and only if it is sufficiently different from the mean, in some feature (e.g. intensity). To fix the ideas, let V (x,y) be the intensity at pixel (x,y). Let my and UV be the mean and the 50 Figure 5.11: (a) Indoor image. The dots represent the recorded human fixation points. Courtesy of the Michigan State University EyeLab. (b) Intensity image of (a). The red squares are the cells with large variance. standard deviation of the intensity I" over the whole image. Construct an intensity map I by defining 11/013331) * mvl if [W13 31) — "Wl > 0V 103.31) = (5.17) 0 otherwise The intensity map obtained from Figure 5.11 (a) is presented in Figure 5.12 (a). The darker a pixel, the more salient it is. If instead of intensity we use contrast, which we define at each pixel by f —— r 0(1), y) = If (15,31) ”Ll/l mv we obtain the map depicted in Figure 5.12 (b), which we term contrast map. The last feature map that we describe is a color opponency map. Unlike the previous maps, this map is biologically motivated by neuro-physiological evidence observed in humans, primates and cats [23],[31], [40],[51]. Some of the retinal pho— toreceptors are highly sensitive to red light (R cones) others to green light (G cones) and yet a third category of cones is sensitive to blue light (B cones). Impulses from these photoreceptors are transmitted to the bipolar cells, which project to the gan- glion cells of the retina. The axons from the ganglion cells converge at the Optic disk to form the optic nerve, which projects to the lateral genicalate nucleus (LGN). It has been observed that some of the ganglion cells (loosely called the red-green Opponent cells) have receptive fields consisting of a. center of R cones and a surround of G cones. Given an RGB image, we model the red-green Opponency by computing at each pixel (:r, y) a response Figure 5.12: (a) Intensity map for the image in Figure 5.11 (a). (b) Contrast map. The brighter regions are the most salient. Compare with the human fixations in Figure 5.11 (a). RG(1:, y) = [f t, R(a. v)g(.,.,,,,.,.(;c — a, y — v)d'a(lo — // C(u, v)g,,,,.,.0,,,,d(;r — u, y — 1:)(l'adv (5.18) surround where R(.r,y) and C(r, y) are the red and green values at pixel (.r, g), game, is a gaussian with a small standard deviation and g,.,,.,.,0,,nd is a gaussian with a larger standard deviation. The first integral in equation 5.18 is the sum of the red values near the pixel (.r, y), weighted by a gaussian peaked at (:c, y) and represents the excitatory component. The inhibitory component comes from the surround, and is the sum of the green values weighted by a gaussian of a larger size. See Figure 5.13. We also compute a green-red response GR by using the same procedure. Clearly then, we can iterate this process by letting R +— RG and G (— GR. The results are presented in Figure 5.14. The brightest spots represent the regions in the input image where the contrast between the red center and the green surround (or vice versa) is maximal. Another color opponency observed in the ganglion cells is the blue-yellow Oppo- nency. The receptive fields of these cells are not separated in the center/surround dichotomy as in the red-green cells, but for the purposes of a computer implementa- tion one could use similar IGCllIllunS. GAUSSIANS 1.4 12 - 1 . 0.8 - 0.6 - 0.4 — 0.2 - O -0.2 - e4 USED TO MODEL RED-GREEEN COLOR OPPONENCY v conteroi) <3 Figure 5.13: Gaussians used to model color opponency. The red curve weighs the responses of the central red cones, while the green curve weighs the (inhibitory) responses of the surrounding green cones. The blue curve is the difference of the two gaussians. (d) (6) Figure 5.14: (a) Original RGB image. (b) Red-green opponency map. (c) Green-red Opponency map. ((1) Red-green opponency using (b) and (c) as the red and green channels. (e) Green-red opponency using (b) and (c) as the red and green channels. The brightest regions are the ones in which the center- surround opponency is the largest. The green-red Opponency picks up the flag and the computer chair, while the red-green opponency picks Up the red box. 55 5.5 Algorithm The system starts with an empty set Of clusters of RGB images, and with a. low resolution template MW, and a high resolution template il-Ihigh of the target object. The camera is pointed at a random initial location and a 384 x 384 image Io is taken at the widest field of view Of the camera (hence, at the lowest resolution). An observation vector 00 is extracted from 10 (as described in Section 5.2.1) and cluster C0 = {00} is formed. An array Q(C0, - ) of length 9 is defined and initialized to 0, and a scalar To = 00. Suppose now, that after n steps, the system has clusters CO, . . . Cm representing various regions in the room, and Q-values Q(C’0, - ), . . . Q(C,,,, - ). Suppose also that the system has inferred it is in state OWN", based on the latest low resolution image In grabbed by the camera. An action acumm is chosen 6 — greedy. That is, with probability greater than 1 — e, we choose acmrem = arg maxa Q(Ccm.,.em, a), and with probability less than c, a random, exploratory action is chosen. Execution of action acumn, is only necessary if the target object is not present in the current image In, so the agent tries to detect the object first. TO this end, we build a color map 13,, by back-projecting the low resolution template Mm, onto In. (see Figure 5.10). Note: we do not back-project Mm, on all of In, but on a central patch in In, of size 192 (half the size of In). The reason is twofold: in the first place, we reduce the processing time; secondly, since in the retina the cone density decreases tremendously towards the periphery [32] [35], color perception is severely 56 impaired far from the optical axis. Let pl, 192, p3 be the brightest points in B". We match the model 111,0,” with a neighborhood of p,- for each i = 1,2,3 usmg histogram intersection, producing distances d|,d2,d3. At low resolution, however, histogram intersection is not reliable enough, and the separation threshold varies from frame to frame. To solve this problem, the camera must zoom in at each of the p,s, grab a new image I,’, and do a fine match, this time at high resolution, using the model Mmgh. However, since histogram back—projection produces candidate locations whether or not the target is present, the camera will have to zoom in three times per frame, most of the time unsuccessfully. We reduce substantially the number of times we zoom in by scoring each of the points p,- by (5.19) where Bn(p,-) is the gray level of p, in Bn (the color saliency), and d, is the histogram intersection distance (5.15). The lower the score 3,, the better the candidate p,. Using these scores (rather than just the histogram distances d,), it is possible to separate the good candidates from the bad ones in most images from the same cluster, but we could not find a unique separation threshold across clusters. The solution is to have each cluster keep track of its own threshold (call it T) and dynamically adjust it as follows. First, in a newly created cluster set T = 00. At each time step, one of the following two cases can occur: either (a) some of the candidates p,- have scores 3,- < T, or (b) all s, 2 T. In case (a) the camera is pointed to p, and zooms in. A high resolution image 1;, is taken and a high resolution model Mmgh is back-projected onto 57 1;. A fine match of Mm), is performed, again using the histogram intersection. At high resolution, histogram intersection works remarkably well, since images in which the target is present can be readily separated from images which do not contain the target, as can be seen in Figure 5.15 HISTOGRAM INTERSECTION SCORE DISTRIBUTIONS 30 I l l l l _l I I "object_present" "object__not_present" -------- + 25 r ,1' 4 20 - - >. I: o .' c , 8 15 - . g '1', 10 ' ’1", d 5 _ _ o ,"l 0 \~-_ 1 m 1 1 .4— —————— 1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 8 0.9 matching score Figure 5.15: Red: distribution of high resolution images containing the target Object Green: distribution of high resolution images not containing the target object. If the object is found at p,, the agent is rewarded according to Equation (5.5), and a new training epoch starts with the clusters and the Q-values learned so far, by directing the camera to a new random initial location. Otherwise, if the object. is not present at p,-, the camera zooms back out, and proceeds to interrogate the other points pj with score 3,- < T, if any. When all candidates p,- have been examined, the threshold T is set to T = 111i11{.s,}, the camera returns to the pan and tilt it had before investigating the p,s and we are in case (b). See Figure 5.16. 58 Low resolution image High resolution image Figure 5.16: Left: Low resolution image. The points p1,p2,p3 are produced using the histogram back-projection of the low resolution model Mlow. At each p,, a score 3,- is computed using histogram intersection. Right: High resolution image centered at p3. The high resolution model MMgh is first back-projected onto the image, and then matched using histogram intersection. If the target was not found at any of the p,s then the camera returns to the pan and tilt coordinates it had before the local search, and the threshold T is set to mm 3,. So far, the system has decided that the target Object is not present in the current image In, either because all candidates p, had scores s, 2 T, or, because the fine match recognizer ruled out all the candidates. At this point, a new saccade is necessary, so finally, action acurrmt is carried out (from state errm). The next fixation point is selected based 011 the symmetry map. The winner is the point with the largest symmetry vote (Equation 5.10) from all the edge pixels in the image In if one“ = A0, or, from the edge pixels in one of the 8 central 90° sectors, if ovum", = Ak for some I: = 1,. . .8. The camera is pointed to the new fixation point and the action is complete. Knowing now whether errem contains the target or not, the agent is rewarded according to Equation (5.5) for the transition from state 0,,”st to Cmrrem, as a result of some action amevms. Action acmrmt will be rewarded at the next iteration, because only then will the agent know whether awrrmt brought the goal in the field of view or not. Q(Cp,,.v,ou,,aprev,m) is updated by Equation (2.11). A common technique for speeding up the propagation of the reward is to update more than a single (state, action) pair. Our agent keeps track of the previous five pairs visited, and updates in one step all the corresponding Q-values. After the update, the camera grabs a new (low resolution) image In“, and the agent estimates the state. More specifically, a feature vector 0,,“ is extracted from In“, and the closest cluster to 0,,“ is found (with respect to the Kullback distance (5.4)). Denote this cluster by Cplosest, and its center by 0610388,. If the Kullback distance K (0,,+1,Od0,,.,,) < k,,,,-,, = 1.5, then 0,,“ is assigned to cluster Cdosest, and the agent is in state Cum = Cdosest. No adjustments are made to Cdosm, so 60 that at any time, each cluster consists of a single vector, namely the initial vector used to create that cluster. If K(O,,+1,O,.,0,,.,,,) > km, = 2.5, then the current observation is not sufficiently close to any of the existing clusters, and a new cluster Cum 2 {011“} is created. With this new cluster, a scalar Tum = 00 is defined. Uncertainty occurs when km,” s K (0M1, 0610,83,) 3 km”, since this is the case when the overlap between the two curves in Figure 5.3 is significant. In a situation like this, it is not clear whether the images that produced the observations On+1 and 00,056,, should be considered similar, or non-similar. The system could very well be “looking” at a previously visited region, but is unable to recognize it, due to noise. Certainly, we could simply create a new cluster whenever K (0714.1, 061088“) _>_ kmm, but in doing so, the system unnecessarily ends up keeping multiple clusters representing the same region in the environment. Therefore, before creating a new cluster, the system gets a second chance at estimating the state, by making a small number of very short saccades around the current fixation point. If still none of the new observations falls within distance km,“ from the closest cluster, then the system has no choice but create a new cluster. A new action is chosen and the process repeats until either the Object is found, or until a maximum number of iterations has been reached. In what follows, we use the term “epoch” to refer to a sequence Of at most nMAX = 100 fixations. If either the goal has been reached, or all 100 time steps have elapsed, a new epoch is started by pointing the camera to a random location, in order to ensure sufficient exploration of the whole state space. We summarize be- low the algorithm used for a single training epoch. Some Of the technical details are omitted in order to keep the presentation simple. 61 l) Initialization. Choose the gaze direction of the camera randomly. Set the number of iterations n = 0. Get a low-resolution image I and extract an Observation vector 0 (see Section 5.2.1). Define a cluster Co = {0}, define To = 00, and an array Q(Co , - ) = 0 of Q-values. WHILE object not found and n < n MAX 2) 3) 4) 5) 6) Choose an action acurrmt, e — greedy. Color map: back—project a low resolution target template onto I. Score each of the top 3 candidates p1,p2, p3 by (5.19). Zoom in at each p,- with s, < Tn, grab a high resolution image, and match the high resolution target template using the histogram intersection. If object found, go to step 6. Else, zoom out, adjust Tn = min{s,-} and go to next step. Symmetry map: compute symmetries on the sector corresponding to action acurrent and direct the camera towards the most salient point in that sector. Clustering: get new feature vector 0. Find the closest cluster Cdosest to O in Kullback distance K. If K is sufficiently small, 0mm = 0610,83,. Else, if K is sufficiently large, create a new cluster Cm“ = {0}, define T next = 00 and initialize an array Q(a,,..xt , - ) = 0. Finally, if K is near the decision boundary (see Figure 5.3), make a few short saccades and re-compute K. If the new image still cannot be unambiguously classified, create a new cluster. Q-update: compute reward for action aprekus using equation (5.5). Update the last 5 Q-values according to equation (2.11). Set it <— n + 1. END WHILE 62 Chapter 6 Experimental Results We describe in this section a number of experiments designed to determine in the first place if any visual learning is at all possible with our reinforcement learning agent, and if so, how good the learning was. We also compare the performance of our agent with the performance of a random agent. 6.1 Learning a Policy for Finding the Target In the first experiment, we trained the agent to learn in which direction to direct its gaze in order to reach the region where the target object is most likely to be found, in trials of 400 epochs each. Every 54th epoch was used for testing, i.e. the agent simply executed the policy learned so far. The performance metric was the number of fixations to the goal. Within a single trial the starting position was the same in all test epochs. Figures 6.1 and 6.2 show the number of steps to goal for two initial starting 63 3‘ SRCCRDES TU BURL STRRTING FRO” (500, -100) 50 r "st'eps02.gp1"' — 45 - 4o- 35 - 25 - saccades 20 - :35 S—th epoch Figure 6.1: Average number of fixations for every 5—th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=500, tilt=—100. i 0 0 SRCCHDES T0 BURL STRRTING FROM (200, 0) "stbp503.gpf" —-— saccades N an H ._. O 01 U1 1 ]in]11111111111111111U111I1111lllhnl‘ 20 30 0 6 70 0 5-th epoch Figure 6.2: Average number of fixations for every 5-th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=200, tilt=0. 64 positions. The results were averaged over several trials. It is apparent that. in general the number of fixations decreases with the number of epochs. Occasionally, there are long fixation sequences towards the end Of the trial, but the agent recovers quickly. SCAN PATTERNS BEFORE AND AFTER TRAINING 300 r I l I I k I I I , ~.‘"epoch__400“ ——+—— ’ 'epoch_005' ~--w--- 200 ~ . ¥ *5 100 - ‘~, ~ It .— , a 0 - i '- .‘ ,1 “\ Ix, 1 x -100 ~ ‘} 4 -200 ~ « _300 1 1 1 1 1 1 1 1 1 -800 -600 -400 -200 0 200 400 600 800 PAN Figure 6.3: Initial scan path (green) and learned scan path (red). The rectangle {—860, 860] x {—290, 290] represents the actual pan-tilt range of the camera. The search starts at (300, 50) and the target object is found around (—380, 30). We plotted in Figure 6.3 two sample scan paths, one at the beginning of learning, which is very convoluted and has a large number of fixations, and another one after the system was trained to find the object. Figure 6.4 shows a sequence of regions “as seen” by the camera, as it fixated from the starting position (rightmost image) to the target (leftmost). Figure 6.5 plots the centers of the learned regions (states) (the blue dots) in the pan-tilt coordinate system as well as the best actions in each region. Only states with more than 15 visits are shown, out of a total of 104 states. The position of 65 (a) 6 (b) 5 (c) 4 (a) 3 (e) 2 (r) 1 Figure 6.4: The sequence of fixations corresponding to the learned path (red) in Figure 6.3. The camera starts from region (1) and gradually moves towards the goal region (5), Here, a sufliciently good candidate is found at low resolution (1), the camera zooms in (6) and does a high resolution match. This sequence of images corresponds to the red scan path in Figure 6.3 Figure 6.5: States learned (the blue squares) and the best actions. Action A0 is represented by a red square around a state. Actions A1 —A3 are represented by a red segment originating at the state. The target is the green square. 66 the target is marked by the green square. Action AD is represented as a red square around the state and actions A, — A8 are represented as a red segment originating from the center of the state. Note that the length of each segment does not. represent the length of the saccade. The actual saccade was directed at. some salient point in a sector around a red segment. This explains why the field of the red segments does not point directly and uniformly towards the target (the green square). 6.2 Executing the Learned Policy NUMBER OF FIXATIONS FOR DISPLACED TARGET 100 . a . . . T . . "stepsOngl' —— 80 [- .1 § 60 - . if s 40 - - E 20 . l . 0 “11.111111111111111111]th.IIl.].]]]1l11..h11111110111].luhnldlllllllnlllililhlll1]I 0 10 20 30 40 50 60 7O 80 90 1 00 epochs Figure 6.6: Number of saccades per epoch in finding the object inverted, displaced by 60 cm from the training position. Testing from pan=500, tilt=—100. The purpose of our second experiment was to determine what changes in the location of the target the system could manage. After the agent was trained to find the target upright in its usual location, the object was turned upside down and 67 NUMBER OF FIXATIONS FOR DISPLACED TARGET 100 I I I I I T I I 'stepsOSgpI" —— 80 I— . U) 8 § 60 - - S ”E s 40 - - E a 20 1- -4 0 1.111111111111111111111.111111111111111-.11111111111111l11llf11|1i]1]11111ll: I 11111111111111] 40 50 60 7O 80 90 0 1O 20 30 100 epochs Figure 6.7: Number of saccades per epoch in finding the object inverted, displaced by 60cm from the training position. Testing from pan=200, tilt=0. displaced to a new location (about 50-60 cm from the training position), while the agent was executing the learned policy. Again, we tested starting from a few initial locations and we averaged over several test trials. The agent successfully found the Object in a small number of fixations, as shown in Figures 6.6 and 6.7. 6.3 Re—training for a Different Target To estimate how sensitive the system was to the various parameters in the visual rou- tines, we re-trained the agent to find a different object (the green flag in Figure 6.8) without changing any parameters. Somewhat surprisingly, there is no noticeable qual- itative difference between the performance in this experiment and in experiment 1. Figures 6.9 and 6.10 show the number of fixations for finding the flag, starting from 68 (500, —100) and (200,0), respectively. Figure 6.8: (a) The target object in the third experiment was the green flag. (b) Low resolution template. (c) High resolution template. SACCADES TO FLAG, STARTING FROM (500, -100) 60 fi' I l l T (“8902- '— number of saccades ] 077710 2 ”6071p I] Figure 6.9: Average number of fixations for finding the MSU flag (Figure 6.8) for every 5-th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=500, tilt=—100. 0]: ,0] 30 0 So epochs 69 SACCADES TO FLAG, STARTING FROM (200, 0) 60 I I I I ‘20‘ 1011112101 01]]! 1119111115101 11]]117011 Q. Figure 6.10: Average number of fixations for finding the MSU flag (Figure 6.8) for every 5—th epoch over five trials of 400 epochs each. At every 5-th epoch, the policy learned was tested starting from pan=200, tilt=0. 6.4 Comparison with Random Search Finally, in our fourth experiment, we compared the performance of the reinforcement learning agent, with a random agent i.e. an agent which performed random search. The actions in the random agent were exactly the same as in the reinforcement learn- ing agent, but their selection was random, rather than based on the Q-values. The results are presented in Figures 6.11 and 6.12. On average, the number of fixations in the random search was between 50 and 70, substantially larger than the average number of fixations in the previous experiment (6—7 fixations), where the agent was only executing a learned policy. 70 NUMBER OF FIXATIONS IN RANDOM SEARCH 100 I 'l I T-l I II -—1 I- 80- - 2 8 g 60- _ 6 8 40- ~ E 3 1: 20 l I' 0 10 20 30 40 50 60 70 80 epochs Figure 6.11: Number of saccades per epoch in a random search. Testing from pan=500, tilt=—100. 6.5 Comparison with Exhaustive Search Besides the comparison with the random agent, we shall present, for the sake of completeness, a comparison between the learning strategy proposed in this thesis, with an exhaustive search agent. By exhaustive search we mean that the camera starts exploring its pan-tilt universe from one fixed location (e.g. the upper left corner) and moves in fixed increments one step at the time, until the whole pan-tilt space has been searched, or until the target has been found. To this end, let us recall that the pan—tilt range of our camera is [—100, 100] x [—25, 25] degrees, and that the widest field of view for one frame is about 48 x 33 degrees. We assume that all the visual processing is the same in the search agent as in the learning agent, so first a low resolution search is carried out, and then a high resolution search, around the most 71 NUMBER OF FIXATIONS IN RANDOM SEARCH 100 I T I 4' 80- - 2 8 g 60- - ‘5 0 4o. fa’ 3 C 20‘ ] ] ,II] [II 1! [III [I] ]1 11] 0 10 20 30 40 50 60 70 80 epochs Figure 6.12: Number of saccades per epoch in a random search. Testing from pan=200, tilt=0. promising locations detected at low resolution, as described in Section 5.5. Recall from the same section, that within one frame, the search is not performed on the whole image, but rather on a central patch of half the size. Thus, each low resolution color search covers a visual field of about 24 x 17 degrees. The pan-tilt range of the camera can then be tiled by approximately 8 x 3 = 24 search regions. As the target object could be in any of these 24 regions, the system would make at most 24 fixations for finding the object, in the worst case, and (1 + 2 + . . . + 24) /24 = 12.5 fixations on average. These results are summarized in Table 6.1. Start Training (box) Test (box) Training (flag) Random Exhaustive (500,—100) 15.8 6.7 18.7 66.9 worst: 24 (200,0) 11.41 6.6 12.8 51.1 avg: 12.5 Table 6.1: Average number of fixations for the three experiments starting from two test points. In experiment 1 testing was done every 5—th epoch. The rest of the epochs were training epochs. Experiment 2 consisted only in executing a learned policy, with the target object inverted and slightly displaced. In experiment 3, a random search was done. 72 Chapter 7 Discussion The set of experiments presented in the previous section is by no means comprehen- sive. NO systematic analysis is done to explore the effect of the learning parameters (the learning rate a, and the exploration rate 6), nor to observe the effect of the var- ious parameters from the visual routines. Although no further tuning was necessary to find the flag, after the software box was found, we do not expect this to be the case for any target Object. What the experiments do show, however, is that learning does occur, and that the system is able to find the target reliably, and that the search is robust to affine transformations (translations and rotations) of the Object within one region. We discuss below some technical issues that we encountered. Visual saliency. Color and symmetry were the means of defining saliency in our implementation. The symmetry operator used picks up objects fairly well, provided the rough size of the objects is known. Tuning the operator for objects of a particular size, will not work well for objects of different sizes. This is one reason why the fixations of our agent, in general, do not cluster 011 objects, the way human fixations 73 do. Moreover, the symmetry map is influenced by the fact that we apply the symmetry operator on a rather limited sector of the image. Symn‘ietric objects that simply do not fit in that sector, will have lower saliency. Although back-projection histogram is a. good way of detecting objects by color, it is not biologically plausible. As the evidence for color Opponency shows, biological vision systems use feature contrast, rather than feature values. Histograms, of course, use values. Some effort was required to distinguish genuine candidate locations of the target object from noise, in low resolution images, but back-projection histogram in conjunction with the histogram intersection is a fast and effective way of building a color saliency map, provided a template of the target object is given. Clustering. To our knowledge, an artificial general purpose classifier with nearly perfect rate does not exist, and we did not attempt to build one. Due to the prob- abilistic nature of Q-learning, however, an imperfect (but reasonable) classifier will suffice. If the classifier has accuracy of, say, 90%, then, statistically, the agent will receive the correct reward most of the time, which is all it needs to learn the correct best action. We opted to represent images by means Of color histograms, in order to categorize pairs of images as “similar” or “non-similar”. The loss of the spatial distribution of the colors in an image leads to perceptual aliasing, but we found that perceptual aliasing is unlikely to occur in a cluttered environment such as a lab, particularly if we subdivide the image into smaller pieces, and compute the histogram on each piece. The real problem that we believe is at the core of our clustering algorithm, is more mathematical in nature. Let V be the feature space produced by our feature 74 extraction method. Each observation is a vector in V and has dimension 196 in this implementation. For each pan-tilt (:11, y) of the camera, we get a vector 0(33, y) E V, namely the observation extracted from the image taken when the camera is pointed at (:6, y). For a given environment, we get this way a mapping from the two dimensional euclidean space into V 0:R2——>V hence a surface in V. Since the histogram is known to change slowly with (:L', y), this surface is continuous, which entails that there are no natural clusters in V to represent different regions in the room. By choosing a threshold for the similarity metric (the Kullback distance), we effectively used spherical clusters in the feature space V. We preferred to lower the Kullback threshold used to cluster the low resolution images in order to reduce the false accept rate (i.e. to reduce the number of images classified as “similar” when in fact they were “non-similar”). Inherently, this increased the false reject rate, that is, often two genuinely similar images were classified as non-similar, and a new cluster was started, representing in fact a region that had been visited previously. After 400 training epochs, we ended up with about 80- 100 clusters, out of which about 40 were distinct by visual inspection. Although “duplicate” clusters are wasteful and unnecessarily increase the number of states, they are harmless for the Q-learning algorithm, unless their number increases indefinitely, because in that case, the (state, action) pairs would not be visited sufficiently many 75 times. The way we propose to approach this problem is to invoke an alternative similarity metric when (and only when) the Kullback distance is indecisive. Certainly there will be an interval (a,b), in which the alternative metric cannot distinguish well between similar and non-similar image, just like the Kullback metric does not work well between (1.5, 2.5), but it is crucial that (a,b) and (15,25) be disjoint. Good candidates would be metrics on edge pixels (edge density, curvature, etc.), because edge information and color information are uncorrelated. An alternative idea1 is to modify the feature extraction. Instead of dividing each image into quadrants and compute the color histogram on each of the 4 regions, we can extract visual information (e.g. color histogram, or some iconic representation in the sense of [4]) locally, from around the top salient points. This way, the visual infor- mation will be more “anchored” to the environment, and consequently, less sensitive to (small) changes in the camera pan and tilt. The problem that this approach raises is that, unlike in the case of the 4 quadrants where we can always impose a fixed order, there is no canonical way Of defining an ordering on salient points, which is absolutely crucial if local observation vectors are going to be concatenated to produce an observation vector for the whole image. lJon Connell, IBM, personal conununication. Chapter 8 Conclusions and Future Work We have built a reinforcement agent which learns the region where a given object can usually be found. Learning occurs by choosing the direction of the next saccade according to its utility (the Q-value). Once the direction has been chosen by the reinforcement learning module (see figure 5.1), low level visual routines find the most salient point in that direction and aim the camera at it. The experimental results show that the agent is capable of learning, even with an unsophisticated reinforcement learning mechanism. In comparison with random search, our agent is about ten times better (see table 6.1), which shows the usefulness Of learning. Certainly, exhaustive search is always an Option, but it defeats the purpose of our study. Our agent comes short of learning spatial relationships between objects. It cannot, for instance, learn that a painting is usually on a wall, or that a pen is usually on a desk. It can learn, however, relationships of the form this pen is in this region (which happens to be a table). 77 In the current implementation, the information that is being integrated across fixations is color information (through histOgrams), and the direction of the next saccade. No deep image understanding takes place at each fixation, except to decide whether the current region has been seen before, and whether or not it contains the target object. In this sense, the information we retain is minimal, but sufficient for the search task. The visual information used to define saliency in an image is based on color and symmetry. If the symmetry operator is task independent and context free, the back- projection histogram is not, in that it uses a template image of the target object. The performance of our system can be improved in a number of ways. In the first place, different variants of Q-learning should be investigated. For instance, one could use eligibility traces [43], which are temporary records of occurrence of an event (e.g. visiting a (sate,action) pair), used to selectively assign credit to the events that were marked as eligible. Secondly, the bottom-up visual attention mechanism must be improved for a better selection of the fixation points. We prOposed in Section 5.4 a number of other feature maps, but understanding how to combine them into a unique saliency map requires further investigation. Thirdly, the classifier used to cluster images should be improved by adding one more feature, “orthogonal” to the Kullback metric, or by using iconic representations around the top salient points in the image, as already discussed at. the end of Chapter 7. Our clustering algorithm tries to group together similar images, irrespective of their utility. A more SOphisticated algorithm [25] which clusters observations using both a metric in feature Space and the Q-values could be used for improved performance. 78 We view this work as an initial exploration towards the implementation of a gaze control mechanism on a mobile robot. We purposefully avoided the use of pan-tilt information in the decision making process, because due to the robot motion, the target object (such as an exit sign, or an elevator door) changes its position with respect to the agent. A more sophisticated recognizer capable of identifying the target at different distances, and from various points of view will be needed. We foresee some interesting applications of this system on a real robot. For instance, by visually finding a fixed Object in a room, the robot could automatically align itself, which is a necessary step before navigation. Or, the robot could simply find the door, and exit the room. A more complex behavior would be to train the search system to recognize several objects in several rooms, and then have the robot tell what room it is in, based on the confidence with which it finds the expected objects. Bibliography [1] S. Ahmad and S. Omohundro. Efficient visual search: A connectionist solution. In Proceedings of the 13th Annual Conference of the Cognitive Science Society, Chicago, 1991. [2] J. Aloimonos, A. Bandopadhay, and I. Weiss. Active vision. In Proceedings First International Conference on Computer Vision, pages 35*: 54, London, 1987. [3] SM. Anstis. A chart demonstrating variations in acuity with retinal position. Vision Research, 142589 592, 1974. [4] DH. Ballard. Animate vision. Artificial Intelligence, 48:57:86, 1991. [5] DH. Ballard, M.M. Hayhoe, and J .B. Pelz. Memory representaions in natural tasks. Journal of Cognitive Neuroscience, 7(1):66~-80, 1995. [6] C. Bandera, F. Vico, J. Bravo, M. Harmon, and L. Baird. Residual q-learning applied to visual attention. In Proceedings of the 13th International Conference on Machine Learning, pages 2027, Bari, Italy, July 3rd-6th 1996. [7] JR. Bergen and B. Julesz. Focal attention in rapid pattern discrimination. Na- ture, 3032696—698, 1983. 80 [8] C. Bushnell, ME. Goldberg, and D.L. Robinson. Behavioral enhancement. of visual response in monkey cerebral cortex. i. modulation in posterior parietal cortex related to selective visual attention. Journal of Neurophysiology, 4:775— 772, 1981. [9] T. Darrel and A. Pentland. Active gesture recognition using partially observable markov decision processes. In 13—th IEEE Intl. Conference on Pattern Recogni- tion, 1996. [10] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18:193—222, 1995. [11] BE. DeYoe and DC. Van Essen. Concurrent processing streams in monkey visual cortex. Trends in Neurosciences, 11(5):219 226, 1988. [12] ME. Goldberg and RH. Wurtz. Activity of superior colliculus in behaving monkey. Journal of Neurophysiology, 35:560--—574, 1972. [13] P. Haenny, J. Maunsell, and P. Schiller. Cells in prelunate cortex alter response to visual stimuli of different behavioral significance. Perception, 13:A7, 1984. [14] J.M. Henderson and A. Hollingworth. Eye Guydance in Reading and Scene Perception, chapter Eye Movements During Scene Viewing: An Overview, pages 269—295. Elsevier Science B.V., 1998. [15] L. Itti and C. Koch. A comparison of feature combination strategies for saliency- based visual attention systems. In SPIE Human Vision and Electronic Imaging IV (HVEI’QQ), pages 473-482. 1999. 81 [16] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40214821506, 2000. [17] M. Jagersand. Saliency maps and attention selection in scale and spatial coor- dinates. an information theoretic approach. In Proc of 5-th International Con- ference on Computer Vision, pages 195~202, 1995. [18] A. K. Jain. Fundamentals of Digital Image Processing. Prentice Hall, 1989. [19] R. Jain, R. Kasturi, and BC. Schunck. Machine Vision. McGraw-Hill, Inc., 1995. [20] B. Julesz and J.R. Bergen. Textons, the fundamental elements in preattentive vision and perception of textures. Bell Syst Tech Journal, 6216191645, 1983. [21] LP. Kaelbling. Reinforcement learning: A survey. Journal of Artificial Intelli- gence Research, 4237-285, 1996. [22] C. Koch and S. Ullman. Shifts in selective visual attention: Towards the under- lying neural circuitry. In Human Neurobiology. 1985. [23] P. Lennie, RW. Haake, and DR. Williams. The design of chromatically Opponent receptive fields. In M.S. Landy and J .A. Movshon, editors, Computational Models of Visual Processing, pages 71—82. MIT Press, 1991. [24] S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms and empirical results. Machine Learning, 22:159195, 1996. 82 [25] [26] [27] [28] [29] [30] [31] [32] [33] S. Mahadevan and J. Connell. Automatic programming Of behavior-based robots using reinforcement learning. Artificial Intelligence, 55(2-3):311- 365, June 1992. DC. Marr. Vision. Freeman, 1982. AK. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, 1996. B. Moghaddam and A. Pentland. Probabilistic visual learning for Object rep- resentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):696—710, 1997. BA. Olshausen, D.C. Van Essen, and CH. Anderson. A neurobiological model of visul attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13247004719, 1993. J .K. O’Regan and A. Levy—Schoen. Integrating visual information from succes- sive fixations: Does trans saccadic fusion exist? Vision Research, 23(8):765- 768, 1983. O. Packer, A.E. Hendrickson, and CA. Curcio. Photoreceptor topography of the adult pigtail macaque (macaca nemestrina) retina. Journal of Comparative Neurology, 288:165—183, 1989. SE. Palmer. Vision Science - Photons to Phenomenology. MIT Press, 1999. M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pro- gramming. Wiley & Sons, 1994. 83 [34] R.P.N. Rao, G.J. Zelinsky, MAI. Hayhoe, and DH. Ballard. Eye movements in visual cognition: A computational study. Technical Report 97.1, University of Rochester, 1997. [35] K. Rayner. Eye movements in reading and information processing: 20 years of research. Psycological Bulletin, 124(3):372—422, 1998. [36] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Context free attentional Operators: The generalized symmetry transform. International Journal of Computer Vision, 14:119—«130, 1995. [37] RA. Rensink. Seeing, sensing and scrutinizing. Vision Research, 40214694487, 2000. [38] RA. Rensink, J .K. O’Regan, and Clark J .J . To see or not to see: The need for attention to perceive changes in scees. Psychological Science, 8:368: 373, 1997. [39] RD. Rimey and CM. Brown. Controlling eye movements with hidden markov models. International Journal of Computer Vision, 7(1):47665, 1991. [40] R.W. Rodieck. Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision Research, 5:583—601, 1965. [41] L. Shapiro and G. Stockrnan. Computer Vision. Addison-Wesley, 2000. [42] G. Strang. Wavelets and dilation equations: A brief introduction. Siam Review, 31:613—627, 1989. [43] RS. Sutton and AC. Barto. Reinforcement Learning. MIT Press, 1998. 84 [44] [45] [46] [47] [48] [49] [50] [51] M.J. Swain and DH. Ballard. Object identification using color cues. Technical report, University of Rochester, NY, 1990. M.J. Swain and DH. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11—32, 1991. J.K. Totsos, SM. Culhane, W.Y.K. Wai, Y. Kai, N. Davis, and F. Nuflo. Mod- elling visual attention via selective tuning. Artificial Intelligence, 78:507 545, 1995. A. Triesman. The role of attention in object perception. In Physical and Biolog- ical Processing of Images. 1983. A. Triesman and G. Gelade. A feature-integration theory of attention. Cognitive Psychology, 12197—136, 1980. L. Ungerleider and M. Mishkin. Two cortical visual systems. In Analysis of Visual Behavior, pages 549- 585. 1982. C. Watkins. Learning From Delayed Rewards. PhD thesis, King’s Colledge, 1989. AB. Watson. Neural contrast sensitivity. In MS. Landy and J.A. lNIOVSllOIl, editors, Computational Models of Visual Processing, pages 95 107. MIT Press, 1991. LE. W ixson. Exploiting world structure to efficiently search for Objects. Tech- nical report, University of Rochester. July 1992. 85 [53] J. Wolfe, K. Cave, and S. Franze. Guided search: An alternative ro the feature integration model for visual search. Journal of Experimental Psychology: Human Perception and Performance, 152419 433, 1989. [54] AL. Yarbus. Eye Movements and Vision. Plenum, 1967. 86