:u.

3......

umu.

"an.

.x.
p

.4. \k

E... v

. .1.

 

B,- .
.uwrmww‘mun‘r . .hmk
vaguh‘wawua wag,
ﬂush. ._

‘
srﬂuﬂﬂ.
t . . .- . o

. .5
z. .u .33
In! .

7

r

a
E

 

 

This is to certify that the
dissertation entitled

HIERARCHICAL LEARNING AND PLANNING IN PARTIALLY
OBSERVABLE MARKOV DECISION PROCESSES

presented by
Georgios N. Theocharous

has been accepted towards fulﬁllment
of the requirements for

Doctoral degree in Computer Science
& Engineering

M. (Ll/la;

 

 

Major professor

Date Bljzgll D L

MSUiJ an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE ' DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/CIRC/DateDue.p65—p. 15

HIERARCHICAL LEARNING AND PLANNING IN
PARTIALLY OBSERVABLE MARKOV DECISION
PROCESSES

By

Georgios Theocharous

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree Of

DOCTOR OF PHILOSOPHY

Department of Computer Science & Engineering

2002

ABSTRACT
HIERARCHICAL LEARNING AND PLANNING IN PARTIALLY
OBSERVABLE MARKOV DECISION PROCESSES
By

G eorgios Theocharous

Sequential decision making under uncertainty is a fundamental problem in artiﬁ-
cial intelligence. This problem is faced by autonomous agents embedded in complex
environments (e.g., physical systems, including robots, softbots, and automated man-
ufacturing systems) that need to choose actions to achieve long-term goals efﬁciently.
Incomplete knowledge about the state of the environment, termed as hidden state,
and the uncertain effects of their actions, makes sequential decision making difﬁcult.

Partially Observable Markov Decision Processes (POMDPS) are a general frame-
work for sequential decision making in environments where states are hidden and ac-
tions are stochastic. A POMDP model represents the dynamics Of the environment,
such as the probabilistic outcomes of the actions, and the probabilistic relationships
between the agents Observations and the hidden states. Unfortunately, “learning” the
dynamics of the environment, and planning, scales poorly as the application domains

get larger.

l: I
.HFIIXE".

Brilil‘I...

"‘.T.-'. ‘I
new if, ..

‘ I . .
W"’v «7 Y:
“\in.l‘ 1_A

This

II} .V
.H ‘
nr“i~ -
0| I. " 9,
I‘d ‘
iI~ '

l‘.’ .
flr‘l’l

{“ ‘
“"ir“\"‘,
V A

VI.
“A; It , I
J “at
4

 

In this dissertation, we propose and investigate a Hierarchical POMDP
(HPOMDP) model to scale learning and planning to large scale partially Observable
environments. In scaling planning, the key ideas are spatial and temporal abstraction.
In Spatial abstraction, grouping of lower level states into abstract higher level states
reduces uncertainty Of the agent. In temporal abstraction, the agent uses longer-
duration actions instead Of just primitive one-step actions. Macro—actions are crucial,
in that they produce long traces of experience, which help the agent to disambiguate
perceptually aliascd Situations. In scaling learning, the main advantage Of hierarchy
is that a multi-resolution problem representation has advantages in ease of mainte-
nance, interpretability, and reusability. We investigate the new HPOMDP framework
within the context of indoor robot navigation.

This dissertation makes the following principal contributions. We formally in-
troduce and describe the HPOMDP model. We derive a hierarchical EM algorithm
for learning the parameters of a given HPOMDP, and Show empirical convergence.
We introduce two approximate training methods, “reuse-training”, and “selective-
training”, which result in fast training and better learned models. We also derive
two planning algorithms for approximating the Optimal policy for a given HPOMDP.
We conduct a detailed experimental study of the learning algorithms for indoor robot
navigation, and Show how the robot localization improves with our learning methods.
We also apply the planning algorithms to indoor robot navigation, both in Simula-
tion and in the real world. The results Show that the planning algorithms are very
successful in taking the robot to any environment state starting from no positional

knowledge, and use significantly less number of steps than ﬂat approaches.

© Copyright 2002 by Georgios Theocharous
All Rights Reserved

To My Parents

 

‘
Pl]' 4' it.
‘vll(‘A“‘.

.‘v . h.
ludhr‘ l..

. ‘
9:1,. ‘ ‘ 1
I. 'c‘ n.‘4

ACKNOWLEDGMENTS

First I would like to give many thanks to my advisor, Dr. Sridhar Mahadevan for his
support, encouragement, and patience. His breadth as well as depth for the field of
AI, and his enthusiasm and vision, have been invaluable lessons to me that I hope
to carry into my future career. I am also in gratitude to my colleague Khashayar
Rohanimanesh for his help, Specially during the early stages. I would also like to
thank my committee members, Dr. Anil K. Jain, Dr. John J. \Neng, Dr. John M.
Henderson (from Psychology), and Dr. Sven Koenig (from Georgia Tech) for their
advice and feedback. I would like to thank all the members of the AGENTS lab,
the Embodied Intelligence lab (E1), the Pattern Recognition and Image Processing
lab (PRIP), and the Sequential Information Gathering in Machines and Animals lab
(SIGMA) for their interest and feedback on the work presented in this dissertation.
My appreciation is also extended to the DARPA MARS grant, and the NSF KDI grant
awarded to Dr. Sridhar Mahadevan, which provided the financial support throughout
my Ph.D. student years.

Finally, my deepest gratitude goes to my parents whose unconditional support

throughout my student career have made this dissertation possible.

vi

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1 Introduction

1.1
1.2
1.3
1.4
1.5
1.6
1.7

Sequential decision making under uncertainty ...............
Planning ....................................
Learning ....................................
Hierarchical learning and planning .....................
Robot navigation ...............................
Contributions .................................
Outline ....................................

2 Background

2.1
2.2

2.2.1

Markov Decision Processes ..........................

Partially Observable Markov Decision Processes ..............
Planning in POMDPS ...........................

2.2.2 Learning POMDPS .............................

2.3
2.4

Robot navigation ...............................
Application of POMDPS to robot navigation ................

3 Hierarchical Partially Observable Markov Decision Processes

3.1
3.2

3.2.1

Hierarchical Hidden Markov Models .....................
Hierarchical Partially Observable Markov Decision Process Models . . . .
Hierarchical Baum-Welch for HPOMDPS .................

3.2.2 Learning a small corridor environment ..................
3.2.3 Converting HPOMDPS to POMDPS ...................

3.3

Related multi-resolution Hidden Markov Models ..............

4 The robot navigation domain

4.1
4.2
4.3
4.4

Robot platforms ...............................
The environment ...............................
Learning feature detectors ..........................
The navigation architecture .........................

vii

ix

xi

OOKICﬁlOb-i

11
15
18

22
22
29
31
39
47
52

56
57
6O
65
84
87
90

99
104

 

5 Lear

II II
x . v . . . . I . . . . . . . . TA. . . \ \ .
I! I \( c I” It.‘ \ \1 pl: pt r: \ O .V\ .. I‘llx n _ _
1 ﬁt- .I. A}. H I A». It. A;- .e x.)- C 1 A). 5)...) In.
. . 0.) mad. .. 1.. 11 it a.) . .I, r x.) r «.11... 5.1.4. I J I? It, ii . .J 1 A}. 2).. nib. 99. Ali 91
. . fl. 6 .II... fur II nil... [U at). tJ FM nil. EU Err». wt . I. . I. . I. . I. . I. . i. . I.

u....J fl...

.I.J 9 «NJ - .I...

BIBLII

5 Learning hierarchical spatial models for robot navigation

5.1 Approximate learning methods .....................
5.2 Maximum likelihood parameter estimation experiments ........
5.2.1 Simulated learning experiments ....................
5.2.2 Learning experiments using the robot platforms ...........
5.3 Inferring the structure of the environment ...............
5.3.1 Learning the structure of a three level hierarchy ...........
5.3.2 Learning the structure of a four level hierarchy ...........
5.4 Conclusions ................................

6 Hierarchical planning for robot navigation

6.1 Robot navigation in corridor environments ...............
6.2 Related work ...............................
6.2.1 Hierarchical planning in MDPS ....................
6.2.2 Approximate POMDP solutions ....................
6.3 Planning and execution algorithms ...................
6.3.1 Planning ................................
6.3.2 Execution ................................
6.4 Planning experiments using the robot platforms ............
6.4.1 Experiments with hand-coded models .................
6.4.2 Experiments with learned models ...................
6.5 Conclusions ................................

7 Conclusions and future work

7.1 Summary .................................
7.2 Contributions ...............................
7.3 Future Directions .............................
7.3.1 Faster training algorithms .......................
7.3.2 Structure learning ...........................
7.3.3 Sharing substructure ..........................
7.3.4 Exact HPOMDP solutions .......................

APPENDICES

A

A.1 Table of HPOMDP symbols .......................

BIBLIOGRAPHY

viii

106
107
111
111
122
124
125
131
138

140
141
146
146
149
150
151
159
165
165
169
171

173
173
176
179
179
180
181
182

184

185
185

186

("3
'Q

C."

(7"
L.'l

Ti“?

C171

1

k4

 

I
la

e4

3.1

3.2
3.3

5.1

5.2

5.3

5.4

5.5

5.6

LIST OF TABLES

The table describes the sensor models of the product states of the
HPOMDP in Figure 3.2. For every product state and action there
are four sensors, one for every side of the state (W,S,E,N). Each sensor
model gives the probability of a wall P(w|s, a) and Opening P(o|s, a)
given the current state and last action taken ...............

The table shows an initial sensor model which has been created randomly.

The table Shows the sensor model after training. The sensor models are
almost the same as the original sensor parameters in Table 3.1 .....

The table shows the mean and average Of the distance of the leaned model
from the original model. The distance was measure using 500 test
sequences of length 40, which were produced by sampling the original
model in the small environment in Figure 4.3. The bold numbers in
the last column Show the Shortest distance for each experiment. The
“reuse-training method” produces the best results. The EM learning
algorithm for HPOMDPS always produces better models than the EM
learning algorithm for POMDPS. ....................

The table Shows the localization error which was computed using the 500
test sequences of length 40. The “reuse—training” method produces
the best models with an exception in experiment 2 where “selective-
training” is slightly better. The EM learning algorithm for HPOMDPS
always produces better models than the EM learning algorithm for
POMDPS. .................................

The table shows the training times for Short training sequences and dif-
ferent training methods. All training times were calculated on a 900
MHz Athlon processor running Redhat Linux 7.1 ...........

The table Show the localization error as to the true state of the robot
before and after training, in the Nomad 200 simulator and in the real
environment with the actual robot Pavlov. ...............

The table Shows the percentage of transition probabilities that do not
belong to the correct structure before and after learning for the three
level hierarchical model. .........................

The table Shows the percentage of transition probabilities that do not
belong to the correct structure before and after learning, for the four
level hierarchical models ..........................

ix

64
86

87

116

117

122

124

131

133

6.1 The table shows the average number of primitive steps, and the average
number of decisions at the abstract level. The results are for the small
environment in the Nomad 200 simulator. ...............

6.2 The table shows the average number of primitive steps, and the average
number of decisions at the abstract level. The results are for the large
environment in the Nomad 200 Simulator, and in the real world.

6.3 The table Shows the planning times for the different environments and
models. The planning times were computed on a 900 MHz Athlon
processor ..................................

6.4 The table shows the average number of steps to reach the goal, and av-
erage entropy per run after every training stage. Simply training the
submodels provides an HPOMDP model that can be used successfully
for robot navigation tasks. Retraining the already learned model with

additional data from the navigation tasks improves robot localization.

A.1 Table of HPOMDP symbols .........................

167

167

169

171

185

 

l2

7??

PI

 

1.1

1.2

1.3

1.4

LIST OF FIGURES

The agent interacts with its environment by issuing actions. The environ-

ment uses the previous state and current action and transitions into a
new state according to the transition function T. The state generates
an observation 2 and a reward signal 1'. The agent uses 2 and r to issue
the next action. The agent chooses actions according to its behavior
B, which maps perceptions to actions, such that that its long-term
reward is maximized ............................

The ﬁgure shows a perceptually aliased Situation where locations J1 and

J2 generate the same observation (T-junction). These locations can
be disambiguated if the robot uses memory of past Observations and
actions. For example, if the robot has perceived a dead-end (J3) and
then moved forward where it perceived a T-junction (J 1 or J2), then
it knows it is at location J2. .......................

POMDPS can be thought of as an extension of HMMS with the addition

of a separate transition matrix for each available action. In the same
manner HHMMS are extended to HPOMDPS. In these example de-
cision models there are two actions available, one denoted with solid
arrows, and the other denoted with dashed arrows. ..........

The picture on the left shows a section from a corridor on the 3rd ﬂoor

of the Engineering Building at MSU. The picture on the right shows a
real robot named Pavlov (a Nomad 200). Pavlov uses its sonar sensors
to perceive walls and openings on the North, West and South direction.
After every perception, Pavlov takes actions such as go-forward, turn-
left, and turn-right to veer toward a goal location. Navigation in such
environments is challenging due to noisy sensors and actuators, long
corridors, and the fact that many locations in the environment generate
similar sensor values. ...........................

1.5 A corridor environment modeled as a ﬂat POMDP. The circles are the hid-

den states of the model and represent the pose of the robot (location
and orientation) in the environment. The ﬁgure Shows the transition
matrix for the three available actions for state S 1, and also the obser-
vation model associated with state S 1. .................

xi

10

12

1.6 The ﬁgure shows a map of the third ﬂoor of the MSU Engineering Building.
The squares indicate two square meter locations in the environment.
The Shading represents the probability distribution of possible loca-
tions, and darker means higher values. Pavlov’s task is to be able
to get to any one of the squares in the map starting from any other
square. A problem that may arise during navigation is that the robot’s
belief as to its true location may be multimodal. This could happen
either because the starting location of the robot was unknown, or due
to sensor and motion noise. For example, due to sensor and motion
noise at junction J 1, the robot was unable to distinguish the east from
the south corridor. ............................

1.7 The ﬁgure illustrates a multi—resolution spatial model. The rectangles in-
dicate abstract states representing corridors and junctions. The circles
in the rectangle indicate ﬁne resolution states which are children of the
rectangle abstract states ..........................

1.8 Topological map construction can either be learned or provided. The topo-
logical map is compiled into an HPOMDP representation which can be
used for navigation tasks through planning and execution. Navigation
tasks produce sequences of observations and actions that can be used in
an EM training procedure to improve the parameters of the HPOMDP
model. ...................................

2.1 An MDP modeling a Simple navigation task. There are two actions avail-
able east and west which succeed with probability 0.9. If they fail
the robot moves in the opposite direction, or stays in the same state
if the opposite direction is a wall. The reward function is structured
such that the robot gets a reward Of +1 for actions leading into state
3. Otherwise the robot gets a 0 reward except for a negative —1 re-
ward for actions leading onto the wall. All four states are completely
observable since they generate unique Observation symbols .......

2.2 The Value Iteration algorithm is a dynamic programming solution to
the non-linear set of equations deﬁned by Equation 2.7. The algo-
rithm works in an iterative fashion by computing the discounted ﬁnite-
horizon optimal value functions as the horizon keeps increasing.

2.3 A POMDP agent is made of two main components. The state estimator
module receives Observations from the environment, the last action
taken and the previous belief state and produces an updated belief
state. The Policy module maps a belief state to an action. ......

2.4 In this simple POMDP states 1, 2, and 4 generate the same observation
while state 3 generates a different observation. The agent can take
actions east and west which succeed with probability 0.9. If they fail,
the agent moves in the opposite direction. If a movement is not possible
the agent stays in the same state. ....................

xii

16

16

21

24

28

32

32

 

v.1.J
arm

Is

14”

\H/ u
all.“

TL

nII.J

ﬁIlJ

....J
9....

2.5

2.6

2.7

2.8

2.9

2.10

3.1

3.2

3.3

3.4

3.5

The policy tree indicates the non stationary t-step policy to be executed
for a POMDP. First the action a deﬁned by the root is executed and
then according to the observation 2,- perceived the t— 1 steps to go tree
is chosen. .................................

The ﬁgure shows an example of an Optimal t-step value function which
is the upper surface of the value functions associated with all t-step
policy trees. ................................

The ﬁgure Shows a mapping from the belief space to the appropriate action
deﬁned by the root of a t-step policy tree. The mapping is calculated
by projecting the Optimal value function onto the belief space.

The ﬁgure Show a landmark based topological map representing a corridor
environment. Each node in the topology represent a unique landmark
that can be identiﬁed by the robot. Each node also contains informa—
tion as to the distance and orientation to the next node. .......

The ﬁgure Shows an occupancy grid map that models an indoor ofﬁce
environment and was taken from [1]. ..................

The ﬁgure Shows a POMDP model that has multiple chains for each hall-
way. The GROW-BW algorithm adds longer chains as needed so as to

37

37

49

51

ﬁnd the length that ﬁts the data the best. The ﬁgure was taken from [2]. 54

An example hierarchical HMM modeling two adjacent corridors. Only
leaves (production) states (84, 55, 56, S7, and S8) have associated ob-
servations. The end-states are e1, e2, and e3 ..............

An example hierarchical POMDP with two primitive actions, “go-lef ”
indicated with the dotted arrows and and “go-right” indicated with
the dashed arrows. This HPOMDP has two abstract states 31 and 32
and each abstract state has two entry and two exit states. The product

state s4, s5, 36, 39, and 310 are associated with sensor models which are
described in Table 3.1. ..........................

This is the Baum-Welch algorithm for learning the parameters of an
HPOMDP model. The algorithm increases the log likelihood of the
Observation sequences given the actions sequences and current model
parameters after every training epoch. .................

The (1 variables gives the probability of the observation sequence zt...zt+k
ending in state .915 at time t+ k (denoted as a bold circle), given that
actions at, ...at+k were taken and the parent of s, 19(3) was started at
time t from entry state p(s)n (also denoted as bold circle). ......

Equation 3.3 computes the probability that at time t the parent state of
product state 3, p(3) produces a Single Observation zt. The observation
iS produced by a vertical activation into product state 3 from entry
state 19(3)”. ................................

60

65

67

67

3.5 E.

3.10 I}

3.11 Er,

C»:
p._4
go

>—4

 

 

3.6

3.7

3.8

3.9

3.10

3.11

3.12

3.13

3.14

3.15

Equation 3.4 gives us the probability that the parent of the product state
3, p(3) produced observations from time t to time t+ k, (k 2 1) and
that the last observation was produced by product state 3. s; is either
the exit state of some abstract state 3’ (such as 3’12) or a product state
3’ (such as 3’2). ..............................

The ﬁgure illustrates Equation 3.5 which is used for situations where the
parent of state 3, 12(5) starting from entry state 12(5),l produced Obser—
vations from t to t+ k (where k 2 0), and at time t+ k abstract state
3 was exited from 81.. We need to sum over all lengths of observation
sequences that could have been produced by abstract state 3. That is,
3 could have produced all observations zt...zt+k, or in the other extreme
zt, zt+k_1 could have been produced by the adjacent states of s (such
as 3’, and 3’2), and only the last observation 2”,. produced by 3.

Each square is a new a computation for a subsequence t + k for some in-
ternal state. The dashed arrows show dependencies on previous com-
putations. .................................

The 6 variable (Equation 3.7) denotes the probability that a state 5 was
entered at time t from entry state 3,, and that the observations zt...zt+k
were produced by the parent of s, which is 19(3), and actions at, ...at+k
were taken, and p(s) terminated at time t + k from exit state p(s)x. .

The ﬁgure illustrates Equation 3.8 which computes the probability that
at time t the product state 3 produced observation zt and then the
system exited the abstract state 19(3) from exit state p(s)x. ......

Equation 3.10 computes the probability that at time t that any of the
children of abstract state 3 (such as 2'1 and Q) has produced observation
Zr and then the parent of s 19(3) was exited from exit state 19(3),. . . .

The ﬁgure illustrates Equation 3.12 which deﬁnes the probability that at
time t the product state 3 produced observation 2; and its parent p(s)
continued to produce Observations zt+1...zt+k and exited at time t + I:
from exit state p(s)x, where k 2 1 (or at least two Observations are
produced). ................................

The ﬁgure describes Equation 3.13 which is the probability that at time
t the abstract state 3, starting from entry state 3,, produced some
Observations, and then the parent of s, 13(3) continued to produce the
rest of the observations and exited at time t + [c from exit state p(s)z.
Since 3 is an abstract state we have to consider all possible lengths of
observations that could have been produced by abstract state 3. So 3
could have produced a single observation zt, up to all the Observations
zt...zt+k, where k 2 1 ............................

Each square is a new ﬂ computation for a subsequence t + k for some in-
ternal state. The dashed arrows Show dependencies on previous com-
putations. .................................

The 5 variable denotes the probability of making a horizontal transition
from exit state 33, to entry state 3;, at time t, given a sequence of
Observations and actions and the model parameters ...........

xiv

69

70

71

72

72

73

74

76

76

77

3.16

3.1."

3 3H 1

3.31 l

3.32 I
Ms)

.,._J

*1 ii
i)

4., T

 

3.16

3.17

3.18

3.19

3.20

3.21

3.22

3.23

4.1

4.2

4.3

4.4

4.5

The 71,-" variable deﬁnes the probability that a state sis entered at time t
either vertically or horizontally given a sequence of actions and Obser-
vations up to time t — 1. .........................

The no,“ variable deﬁnes the probability that a state 3 exited at time t and
observations zt+1,...zT were perceived and actions at, ..., aT taken.

The X deﬁnes the probability that entry state 3,, of child s was activated
by the entry state p(s)n of its parent p(s) at time t given the action
and observation sequence. ........................

This is an HPOMDP model where the original transitions have been re-
placed by random values for training ...................

The random HPOMDP model after training, is more similar to the original
model (Figure 3.2) after training. ....................

Training runs of three different random models shows that the log like-
lihood converges fast to the original model. The original model does
not improve as expected since it was used to create the training data.
In the log likelihood function Z is the set of observation sequences,
A is the set of action sequences, and Model is the set of parameters
that are improved every training epoch (horizontal matrices, vertical
vectors, and observation models). ....................

This is the equivalent ﬂat POMDP of the hierarchical POMDP shown in
Figure 3.2 ..................................

For an observation sequence of length 2 the hierarchical likelihood will be
zero, since we need at least three observations to exit the abstracts
state S. While the likelihood in the equivalent flat model will not be
zero. ....................................

This is PAVLOV, a Nomad 200. Pavlov is equipped with 16 sonar sensors,
16 infrared-red sensors, two rings of bumper sensors (with a total of
16 switches), an arm, and a camera. Pavlov has an 133 MHz Pentium
onboard processor and an external Pentium II 400MHz laptop. . . . .

This ﬁgure illustrates the Nomad 200 simulator. The main window shows a
section of the environment and the Simulated robot. The two windows

77

79

81

85

86

88

88

90

95

on the right show infrared and sonar sensor values at the current location. 96

The ﬁgure shows an approximate model of the 3rd ﬂoor of the engineering
building that is used with the Nomad 200 simulator. The number next
to the edges is the distance between the nodes in meters. ......

The ﬁgure shows the real 3rd ﬂoor of the engineering building. This en-
vironment is used for experiments both, in the Nomad 200 Simulator,
and with the real robot Pavlov in the actual physical ﬂoor. The me-
ters next to the edges denote the distance in meters between the nodes.
The nodes represent junction locations in the environment .......

This is the equivalent ﬂat model of the HPOMDP in Figure 4.6 for the
forward action. ..............................

XV

97

98

Q: I
I‘Q

 

 

 

4.6 This is a sample three-dimensional view of the HPOMD model for the
forward action. The solid bold arrows represent the vertical transition
matrices. The dashed arrows represent the horizontal transitions. The
bold circles inside the abstract states represent entry-states and the
empty circle represent exit-states. In this representation all the chil-
dren of the root are abstract states. Product states are the leafs of
the model and represent two meter locations in the environment. The
arrows inside a product state represent the orientation of the product
state. For each product state there is an Observation model which was
previously described in Chapter 3. ...................

4.7 A local occupancy grid map, which is decomposed into four overlapping
quadrants (left, right, top, bottom), each of which is input to a neu-
ral net feature detector. The output of the net is a multi-class label
estimating the likelihood of each possible observation (opening, wall).
The net is trained on manually labeled real data. ...........

4.8 Learning curves for training the neural net to recognize features. The nets
are trained with 736 examples in an 8-fold cross-validation fashion
using quickprop ...............................

4.9 Sample local occupancy grids generated over an actual run, with obser-
vations output by the trained neural net. Despite signiﬁcant sensor
noise, the net is able to produce fairly reliable Observations .......

4.10 The ﬁgure Shows the overall navigation architecture. The modules inside
the dashed box are used during actual navigation while the models
outside are used off—line ..........................

5.1 When a short training sequence is collected in state 85, the probability that
this sequence occurred in 51 will be the same as the probability that
it occurred in 35, Since 5'1 and 55 are Similar. When a long training
sequence is obtained from 85 to 33, then the probability that this
sequence was started from S 1 will be much smaller than the probability
that it started in S5 since the intermediate junction nodes S4 and 52
have different observations. Therefore, for short sequences we provide
the starting state ..............................

5.2 An example HPOMDP model for which we illustrate our approximate
training methods. For example, in the reuse-training method we train
separately each abstract state child of the root (81, 32, S3) as Shown in
Figure 5.3 .................................

5.3 Here the submodels of the HPOMDP in Figure 5.2 are trained as sepa-
rate HPOMDP models. This is the reuse-training method, since the
submodels can be used in the complete model in Figure 5.2. In this
particular example, we don’t need to retrain the whole model in Figure
5.2 (after training separately the various submodels), since the connec-
tions between the abstract states are deterministic. ..........

xvi

100

101

102

103

105

108

109

‘4.-
v “I”.

..IJ

fl.»

 

 

5.4 In the “selective-training” method only selected parts of the model are
trained for each training sequence. Here two data sequences deﬁne two
valid selections of the whole model which are shown in bold lines.

5.5 The graph describes the ﬁt to the training data during experiment 1.
The vertical axis represents the log likelihood over all 738 training
sequences. The “reuse-training method” converges faster than all other
methods. The EM learning algorithm for POMDPS ﬁts the training
data slightly better than the hierarchical methods ............

5.6 The graph describes the ﬁt to the training data during experiment 2. The
ﬁt to the training data is similar to experiment 1, even though training
started with a better initial model. ...................

5.7 The graph describes the ﬁt to the training data during experiment 3.
The “reuse-training” method clearly converges faster than the other
training methods ..............................

5.8 The graph describes the ﬁt to the training data during experiment 4. The
ﬁt to the data is better than experiment 6 (see Figure 5.10). This is be-
cause data is allowed to start from all states. Nevertheless, navigation
performance is worse than experiment 6 according to Table 5.2.

5.9 The graph describes the ﬁt to the training data during experiment 5.
The POMDP model ﬁts the training data the best while the “reuse-
training” method converges faster than all other methods. ......

5.10 The graph describes the ﬁt to the training data during experiment 6. Even
though the ﬁt to the data is not as good as in experiment 4 (see Figure
5.8), navigation performance was better. The EM learning algorithm
for POMDPS ﬁts the training data the best. ..............

5.11 Assume that an Observation sequence of length 5 is started from entry state
N1 and that, in the correct topology of the environment the transition
probability X1 —> N3 is zero. In the hierarchical model the topology
will be learned correctly since in order for the transition X1 —% N3 to be
non-zero we need an observation sequence of at least length 6. In the
ﬂat model the topology will not be learned correctly Since the transition
probability 5'1 —> 5;», will not be zero. The hierarchical algorithm is able
to better discover the relationships between groups of states and has
the potential for inferring the structure at the abstract level. .....

5.12 The bold arrow indicates the correct structure for the particular south
exit point of the junction abstract state. The dotted arrows indicate
possible adjacent nodes. We chose adjacent nodes to be corridors of
different Size from the correct corridor. .................

5.13 The ﬁgure shows the transition probabilities that belong to the correct
structure before training for “model-1”. The probabilities Shown are

111

115

118

118

119

119

120

125

127

the ones corresponding to the HPOMDP model when converted to ﬁat. 129

5.14 The ﬁgure Shows the transition probabilities that belong to the correct
structure after training for “model-1”, using the “selective-training”

method. The probabilities shown are the ones corresponding to the
HPOMDP model when converted to flat. ...............

xvii

 

w: A .V A u A Wilt a. A a 7 a ~ TIA WI.¢ Wla
. .Iu aid . I .\. (1.. _, IL 1
l. 71. ll. q ... . .. J... 5;... v -1 A). .)J I.
to '3 a“); o_|.J -.|.J .IHJ u.) l. . t . .. 1
. PI. F).
«Her J

5.15 The ﬁgure shows how “model-1” ﬁts the data over every training epoch,

and for the different training methods. The POMDP model ﬁts the
training data better than the HPOMDP models .............

5.16 The ﬁgure shows how “model-2” ﬁts the data over every training epoch,

and for the different training methods. The POMDP model ﬁts the
data better than the HPOMDP models. ................

5.17 This is a four level representation of the environment where entire sec-

5.18

5.19

5.20
5.21

6.1

6.2

6.3

6.4

tions of the building where grouped into abstract states at a higher
level. The three large dashed polygons represent the abstraction at
level 2 while the smaller rectangles inside the polygons represent the
spatial abstraction at level 3. The bold arrows indicate the correct
structure at level 2 and the number next to the arrows indicate the
transition probabilities before any learning for model “4level-2”. The
probabilities shown are the ones that belong to the correct structure
when the model is converted to flat ....................

The arrows show the transition probabilities of the correct structure, after

131

134

training model “4level-2” with the EM learning algorithm for POMDPS. 135

The arrows Show the transition probabilities of the correct structure, after

training model “4level-2” with the EM learning algorithm for HPOMDPS. 136

The graph shows the ﬁt to the data during training of model “4level-1”. .
The graph Shows the ﬁt to training data during training of model “4level-

2”. In the graph, the EM algorithm for learning HPOMDPS, and the
EM algorithm for learning POMDPS ﬁt the data better than the plots
in Figure 5.20 despite the fact that the models here are more ergodic.

This HPOMDP model is used to represent corridor environments. The

dotted arrows from product states Show non-zero transition probabili-
ties for primitive actions. The black vertical arrows Show non-zero ver-
tical activations of product states from the entry states. The dashed
arrows from exit states Show non-zero transition probabilities from
exit-states to adjacent states of the abstract state they are associated
with .....................................

This is the equivalent ﬂat POMDP model of the hierarchical POMDP

model Shown in Figure 6.1. In this model we only represent the prod-
uct states of the hierarchical model. Global transition matrices for
each action are automatically computed from the hierarchical model
by considering all the possible paths between the product states under
the primitive actions ............................

This is a robot run on the Nomad 200 simulator, which was used to collect

observations for the entropy plot in Figures 6.4. ............

The plot shows the normalized entropies of the belief of the robot’s location

at the global product level and at the abstract level. The initial belief
was uniform and observations were collected along the trace Shown in
Figure 6.3 ..................................

xviii

137

137

143

143

145

145

x
. M

K,

(.73

'61:"

(f)

6.10 T

 

 

0.} j

 

6.5 The ﬁgure shows two macro-actions for abstract state S. The letter F
means gO-forward, L turn-left, and R turn-right. The solid arrows
deﬁne the transition matrix of the forward action, the dashed arrows
deﬁne the transition matrix of the turn-left action, and the dotted
arrows deﬁne the transition matrix of the turn-right action. The policy
of exit-states is always the same since exit-states can only be entered
with a single type of primitive action. The two macro-actions differ
only on the policy of state 54. Knowing the policy of exit-states allows
us to know the outcome Of the macro-actions at the level above.

6.6 When an exit state Sr is entered through another exit state 5;, the policy
of S; is the same as the policy of S; ...................

6.7 F means go forward, R means turn right and L means turn left. The
top ﬁgure is a macro-action for exiting the East Side of the corridor
under the “go-forward” action, the middle ﬁgure is a macro-action for
exiting the West Side of the corridor with the gO-forward action, and
the bottom ﬁgure is a macro-action for reaching a particular location
within the corridor. The solid arrows represent the transition matrix
for the gO-forward action, the dashed arrows represent the transition
matrix for the turn-left action, and the dotted arrows represent the
transition matrix for the turn-right action. ...............

6.8 The ﬁgure shows the operation of of Equation 6.12. The dashed arrow is
the discounted transition probability for reaching state S,C form state i.
The solid arrows are simply the transition probabilities for the forward
action. ...................................

6.9 S 1, 52, 5'3, and 8., cannot be activated vertically from states at the abstract
level. Therefore, if the agent was to start in any of these states we don’t
know what is the best macro-action to initiate ..............

6.10 The value of a state 3 under a macro-action p is computed by taking into
consideration the total reward and discounted transition probability for
exiting the parent of 3, 19(3), from all exit states 31,, and the values of
the resulting adjacent states to p(s), such as 3’. Equation 6.18 deﬁnes
the value of a state as the maximum value among all macro-actions )u
that belong to the set of of macro-actions M ”(3) available for the parent

of s, 13(3). .................................

6.11 The ﬁgure describes in detail our hierarchical-MLS execution algorithm.
The algorithm initiates macro actions by recursively finding the most
likely state from root to leaves. Once a macro-action is initiated it
runs until one of the ancestors of the state that the macro-action was
initiated from, is no longer the most likely state .............

xix

152

152

153

155

157

158

162

6.12 The ﬁgure shows an example HMLS execution. The bold solid arrows
indicate the macro-actions while the bold dashed arrows indicate a
new macro-action initiation. The beginning of the bold dashed arrows
indicate the state at which, we recursively start choosing the most
likely state until a product state is reached, at which point we can
decide the appropriate next macro-action. Starting from the root we
choose recursively the most likely state, which ends up to be S3. We
then begin to execute the appropriate macro-action from S3 until one
of its ancestors in no longer the most likely state (among its adjacent
states). This happen at time T5 when S5 is the most likely state
among the children of S1. Again, we recursively ﬁnd the most likely
product state starting from S5, which ends up to be S6. We begin
to execute the apprOpriate macro-action until time T7 at which point
the most likely state among the children Of the root has changed to
S8. We then again, ﬁnd recursively the most likely state starting from
state S8 which ends up to be state S10. The appropriate macro-action
begins to execute until time T9 at which the most likely state among
the children Of S8 changes to state 812. Again, we ﬁnd recursively
the most likely state starting from S12 which is 813 and execute the
appropriate macro-action. ........................ 163

6.13 Each type of bold arrow represents a different macro-action that can be
initiated from a product state. The H-QMDP heuristic multiplies the
value of executing every macro-action from a product state with the
current belief of the product state. The best macro-action to be exe-
cuted is the one for which the sum of all the products of values and
beliefs is the maximum. Note that choosing a macro-action, also de-
ﬁnes the abstract state to be executed under, since every macro-action
is unique to a a particular abstract state. In this example we have
assumed that we only have one macro-action for each abstract state,
but in our navigation models we have multiple macro-actions for each
abstract state ................................ 164

7.1 The tOp part Show a corridor environment and its ﬂat POMDP represen-
tation. The bottom part shows an extended HPOMDP representation
where abstract states share submodels. This future direction will have
the potential of scaling the algorithm to even larger domains, and ap-
plications to problems that have repetitive structure such as the task
Of reading .................................. 183

XX

Chapter 1

Introduction

Sequential decision making under uncertainty is a fundamental problem both for
artiﬁcial intelligence agents, and biological organisms (from insects to humans). It
is the problem faced by entities embedded in complex environments that need to
choose actions to achieve long-term goals efﬁciently. Incomplete knowledge about the
state of the environment, and uncertainty of the action outcomes, makes this problem
difﬁcult. Partially Observable Markov Decision Processes (POMDPS) are a general
framework for sequential decision making in environments where states are hidden
and actions are stochastic. Unfortunately, learning and planning in POMDPS scales
poorly as the application domains gets larger.

In this dissertation, we propose and investigate a Hierarchical POMDP
(HPOMDP) model to scale learning and planning to large scale partially Observable
environments. In scaling planning, the key ideas are Spatial and temporal abstrac-
tion, which reduce the agent uncertainty. In scaling learning, the main advantage

of hierarchy is that a multi-resolution problem representation has advantages in ease

OI l

ITLII

Gp‘x'
at .‘

 

 

of maintenance, interpretability, and reusability. We investigate the new HPOMDP
framework within the context of indoor robot navigation.

In this chapter, in Section 1.1, we deﬁne the problem of sequential decision making
under uncertainty, and provide a generic model of the problem. In Section 1.2 we
describe the solution to the sequential decision problem under uncertainty. A solution
to this problem can be computed through the rigorous mathematical framework of
Markov decision processes. In section 1.3 we describe the notion of learning partially
Observable Markov decision processes which is achieved through an Expectation Max-
imization algorithm (EM). EM type algorithms iteratively modify the parameters of
the model such that the log likelihood of the training data is increased after every
iteration. In Section 1.4 we motivate our hierarchical approach, explain the key ideas
behind our preposed solution, and briefly describe the derivation of our hierarchi-
cal POMDP model. In Section 1.5 we present an important instance Of sequential
decision making under uncertainty, which occurs in robot navigation. The robot navi-
gation domain is used throughout the main chapters Of the thesis, to demonstrate our
hierarchical POMDP approach and its advantages over ﬂat POMDP approaches. In
Section 1.6 we list the principal contributions made in this dissertation. In the ﬁnal

Section 1.7, we describe the presentation structure for the rest of the dissertation.

1.1 Sequential decision making under uncertainty

Sequential decision making under uncertainty is a fundamental problem in artiﬁcial

intelligence. This problem is faced by autonomous agents embedded in complex

em'li'

I u
an“ r.
Ireli

.11.;

III 1"” f
C‘

 

fr:

 

environments that need to choose actions to achieve long-term goals efﬁciently. It is
also a core problem in many biological systems, examples of which include navigation,
medical diagnosis, driving, and general everyday activity (from going to get lunch, to
going to graduate school).

For example, in a medical diagnosis task, the physician (agent) cannot directly
observe the internal state (hidden state) of the patient (environment) but rather
perceives the symptoms (observations). Yet, after observing every symptom he has
to issue a treatment (action) which in turn produces a new symptom. Treatments
can produce side—effects because they can change the internal state of the patient in
an unexpected manner. The goal of the physician is to come up with a treatment
plan that will “heal” the patient (i.e., cause a transition in the patient’s “state” from
being “ill” to being “healthy”). The need to decide what is the best treatment that
should be issued now is an instance of the problem of sequential decision making
under uncertainty.

Formally the problem of sequential decision making under uncertainty is visualized
in Figure 1.1, which depicts a generic embedded agent interacting with its environ-
ment. The agent is the decision maker while everything outside the agent, which the
agent interacts with, is the environment. At each point in time the agent chooses the
next action a according to its behavior B. The behavior B is a function that maps
the agent’s perception z into an action. The transition function T of the environment
changes stochastically the state of the environment based on the current state 3 and
current action a. The new state Of the environment depends only on the current

state and action pair, and not on previous state action pairs. The assumption that

krie‘

rt) 1“}?

R. a

(‘7 ,..,.

.u‘nt

 

 

knowledge of the current state and action deﬁnes the next state is typically referred to
as the Markov assumption. The new state Of the environment produces a numerical
reward signal 7‘, which is communicated to the agent through the reward function
R, and a perception 2, which is communicated to the agent through the Observation
function 0. The agent chooses actions such that its long-term reward is maximized.
For some environments, 0 is the identity function; that is the agent has access to
the true state of the environment. In real world domains however, the state of the

environment is only partially Observable to the agent.

 
   

ENVIRONMENT

 

 

 

 

 

 

 

 

 

 

 

Figure 1.1: The agent interacts with its environment by issuing actions. The envi-
ronment uses the previous state and current action and transitions into a new state
according to the transition function T. The state generates an observation 2 and a
reward signal r. The agent uses 2 and r to issue the next action. The agent chooses
actions according to its behavior B, which maps perceptions to actions, such that
that its long-term reward is maximized.

mi
.1

ll.“

 

 

 

 

1.2 Planning

A widely adopted framework for modeling the sequential decision making problem
under uncertainty is the Markov Decision Process (MDP) framework [3] [4]. MDPS
model the states Of the environment, the transition function T, the reward function
R, and assume that the observation function 0 is the identity. An agent’s task is
to compute a policy (also referred to as a plan) 7r, which maps every state of the
environment to an action through interaction with its environment such that the
long-term reward is maximized. The policy can be computed either through dynamic
programming, if R and T are known [5], or through reinforcement learning, if R and
T are unknown [3]. A detailed description of the MDP framework is given later in
Section 2.1.

A more general model, which does not assume that the Observation function 0 is
identity, is the partially observable Markov decision process model. A POMDP model
represents the dynamics of the environment, such as the probabilistic outcomes of
the actions (the transition function T), the reward function R, and the probabilistic
relationships between the agents Observations and the states of the environment (the
observation function 0). In a POMDP model the states of the environment are
typically referred to as “hidden states”, Since they are not directly Observable by the
agent.

In a POMDP model the process is no longer Markovian, because an agent cannot
predict the next perception based on the current perception and action. For an agent

to be able to compute policies in POMDP models, using the MDP formalism, we

need to create a new type of state representation, which is completely observable,
and adheres to the Markov assumption. Intuitively, such a state representation can
be formulated from history of past observations and actions. History, or memory of
past observations and actions allows biological and artiﬁcial agents to disambiguate

perceptually aliased Situations (see Figure 1.2).

 

II 12

CD

Robot

 

 

 

J3
J4

 

 

 

 

 

Figure 1.2: The ﬁgure shows a perceptually aliased situation where locations J1 and
J2 generate the same observation (T-junction). These locations can be disambiguated
if the robot uses memory of past observations and actions. For example, if the robot
has perceived a dead-end (J 3) and then moved forward where it perceived a T -junction
(J 1 or J2), then it knows it is at location J2.

In POMDPS memory is represented as a probability distribution over the hidden
states of the environment. A probability distribution over the hidden states Of the
environment (otherwise called belief state) is a sufﬁcient information state, in that it
implicitly represents a unique history Of observations and actions [6]. This type Of
MDP is called belief MDP and complies with the Markov assumption, in that having

information about previous belief states cannot improve the choice of action [6]. The

planning problem now becomes one Of ﬁnding mappings from belief states to actions.

1.3

leer:
tion C
were
the It
tions
the h
main?

time

P011
lmiza

DWI-I

 

 

Using belief states as a new type of state representation however, does not scale well
for large tasks with long planning horizons, since the number of belief states is inﬁnite,

which makes the problem computationally intractable [7], [8].

1 .3 Learning

Learning a POMDP model is also difficult because an agent needs to learn the transi-
tion and Observation functions Of the model without true knowledge of where actions
were taken and observations perceived. For example, in the robot navigation domain
the robot has to learn the probability of perceiving each one of its available Observa-
tions for every hidden state of the environment, and also the effects of its actions on
the hidden states Of the environment. This however is difficult, since the robot only
maintains a probability distribution over the hidden states of the environment each
time an action is taken and an Observation perceived.

A general algorithm for learning the observation and transition functions of a
POMDP is the Baum-Welch algorithm [9], [10], [11], which is an Expectation Max-
imization algorithm [12]. The EM procedure iterates between two steps. The ex-
pectation step (E-step) ﬁxes the current parameters A,- = (T, O), and computes
the posterior probabilities over the hidden states S. The maximization step (M-
step) maximizes the expected log likelihood of the perceptions Z as a function of
the parameters, and produces a new set of parameters )‘i+l- After every iteration
10gP(Z|A.-+I) 210sP(Z|/\I)-

Later in Section 2.2.2 we present a derivation of the EM algorithm and brieﬂy

sketch its convergence proof. We also describe the Baum-Welch algorithm which is an

adaptation of the EM algorithm and is traditionally used in learning the parameters

of a POMDP model [11].

1.4 Hierarchical learning and planning

In this dissertation, we explore how to scale the solution to large sequential deci-
sion making problems through hierarchical representations. Hierarchical learning and
planning is well motivated from biology and occurs in many real world situations.

Honeybees for example, use different low resolution features when they are travel-
ing either to the hive or to the food (such as mountains and trees), and ﬁne resolution
features when they are close to their goal (such as rocks and ﬂowers) [13]. Humans
also do not seem to be modeling the world at a uniform resolution [14]. In fact, hu-
mans use a hierarchical representation of Spatial knowledge which is acquired through
different stages (from sensory motor association to topological maps) [15], [16]. For
example, if a human is about to drive from one city to the next, he does not plan
based on joint movements Of his arms and legs, but rather forms a plan based on a
more abstract Spatial resolution such as the highways or exits to take.

Hierarchical representations of sequential decision making tasks, where the actions
are stochastic and the states are hidden, has the potential of signiﬁcantly reducing
uncertainty. There are two intuitive reasons why uncertainty would be reduced. First,
uncertainty reduces at higher levels of abstraction. An intuitive example is human

representation of space, where we usually feel more certain as to our world location at

 

a ('(l.:I

lt‘llf:

are I T

C)
>—.)
-
J

. l.
and),

ram .

pp,"
6:4,]; '

ll$e .

V‘
VII-764v

‘9
EAT “ '
A.

 

a coarse resolution of space, than a ﬁner resolution. The second reason has to do with
temporal abstraction, where an agent can execute multi-step actions. Macro-actions
are crucial in reducing uncertainty in that they enable an agent to acquire long traces
of experience from its environment, which help the agent to disambiguate perceptually
aliased situations. For example, if we are lost in some unfamiliar environment, we
can choose to follow a particular compass direction until we localize.

From an engineering point Of view organizing a problem hierarchically has several
advantages, including interpretability, ease of maintenance, and reusability. Reusabil-
ity can be an important factor in solving large scale sequential decision making prob-
lems. In learning, we can train separately the different sub hierarchies of the complete
model and reuse them in larger complete hierarchical models. Or, if different sub-
hierarchies have the same dynamics we could simply just train one of them and reuse
it for all the others. In planning, we can construct abstract actions available to each
abstract state that can be reused for any global plan without the need of worrying
about the details inside the abstract states. For example, if we are given abstract
actions to operate in any US city, and any US highway, then given any global desti-
nation city we can construct a plan to reach the destination city from any other city
by simply using the precomputed abstract actions. Another form of reusing abstract
actions would be to precompute abstract actions that can be used in every city (e.g.
use the subway).

The cornerstone of our approach to hierarchical sequential decision making under
uncertainty is the Hierarchical Hidden Markov Model (HHMM) framework [17]. We

extend the HHMM model to a new model called Hierarchical POMDP (HPOMDP)

9

 

(.1

‘. t}\T»'

Tr-
rm].

( I)
.—-v
>1 ‘

 

which includes actions and rewards. POMDPS can also be viewed as extending the
well-known hidden Markov model to include actions and rewards [10]. Figure 1.3

depicts the relationship between HMh-Is/POMDPS and HHMMS/HPOMDPS.

POM DP

 

Figure 1.3: POMDPS can be thought of as an extension of HMMS with the addition
of a separate transition matrix for each available action. In the same manner HHMMS
are extended to HPOMDPS. In these example decision models there are two actions
available, one denoted with solid arrows, and the other denoted with dashed arrows.

The HHMMS/HPOMDPS models differ from ﬂat models in that they have ab-
stract states which can activate children states through vertical transition matrices.
Abstract states do not directly emit Observations but rather produce sequences of
Observations which are output by their children states. Only primitive states directly
emit observations which are the leaves of the tree. Horizontal transitions from ab-

stract state are not allowed to occur until the process under the abstract state has

10

i. 1115?.

{If >5 '

 

 

 

ﬁnished (the small white circles inside the abstract states have been entered, see Fig-
ure 1.3). Detailed description of the HPOMDP model is presented later in Section

3.2.

1 .5 Robot navigation

In this dissertation the HPOMDP model is applied to learning and planning in the
context of robot navigation, which is a classic and difﬁcult instance of the problem
of sequential decision making under uncertainty. A robot agent acts in physical real
world environments where it can sense the world through sensors such as sonar,
infrared, laser, and camera, and acts upon the environment through effectors such
as motion movements (wheel motors), camera movements (pan-tilt), and arm actions
(pickup Objects). Figure 1.4 shows a real robot Pavlov (a Nomad 200), which senses its
environment (the corridors of the MSU Engineering Building) through sonar sensors
and produces perceptions such as wall, and opening. After every perception, Pavlov
can take actions such as “go-forward”, “turn-left”, and “turn-right”. Pavlov’s task
is to be able to navigate to any location in the engineering building shown in Figure
1.6.

The major problem for Pavlov however, is the determination of its exact location
in its environment. Unfortunately due to noisy sensors readings, and the fact that
many location in the environment produce similar sensor values, no amount Of im-
age processing of sensor input is guaranteed to yield unique features describing its

location. This realization is quite obvious through the corridor picture in Figure 1.4

11

IV

 

”r; ~..
uri‘

 

 

where many locations in the corridor have similar visible features. Not only a robot
agent has trouble identifying its true location but sometimes a human has trouble

too.

 

PERCEPTIONS (Walls, Openings)
N

 
  
 
  
    

Command:
Go get coffee
/

Infra—red
sensors

Bumper

Odometric
sensors

 

 

ACI‘ IONS (go_forward, tum_left, tum_right)

Figure 1.4: The picture on the left shows a section from a corridor on the 3rd ﬂoor of
the Engineering Building at MSU. The picture on the right shows a real robot named
Pavlov (a Nomad 200). Pavlov uses its sonar sensors to perceive walls and open-
ings on the North, West and South direction. After every perception, Pavlov takes
actions such as go-forward, turn-left, and turn-right to veer toward a goal location.
Navigation in such environments is challenging due to noisy sensors and actuators,
long corridors, and the fact that many locations in the environment generate similar
sensor values.

In order for Pavlov to be able to navigate to any location in the building it has to
solve three problems. First, it has to learn a representation of the environment such
as a map. Second, it needs to be able to construct plans which associate actions to

map locations for every navigation task. And third, it has to be able to execute every

navigation task successfully. A map representation could be a topological graph of

 

lean

I
“'lvlij,

ML]

>4

1 9%
a a“ it.

HIDE:

p
I

rat;

 

 

 

the environment, where vertices are locations in the environment and edges are the
outcomes Of the actions. A topological representation can either be given, or can be
learned through robot experience [18]. Given the graph representation then, a plan
would be the shortest path from every vertex to the goal state, which can easily be
computed using a. shortest path algorithm such as Dijkstra’s algorithm [19].

There are however two problems with this straightforward graph-theoretic solu-
tion: sensing and action uncertainty. In order to execute such a plan, the robot will
need to know exactly in which state it is at, so that it can take the action associated
with that state. One way to know the exact state of the robot is to start from an
initial location and keep track of odometry. Unfortunately, keeping track Of Odometry
(path-integration) is difﬁcult due to wheel slippage, noisy wheel encoders and the fact
that the robot may not be traveling a straight line due to dynamic obstacles. What’s
more, path integration requires that we give the robot the starting position.

An alternative, is to associate each state with a unique landmark that can be per-
ceived by the robot’s sensors. Still however, landmark based navigation is extremely
prone to errors due to noisy sensors, noisy perception, and the fact that many lo-
cations in the environment generate the same observation (perceptual aliasing) [20].
We need a framework that can incorporate sequences of Observations as memory from
the last distinct landmark. In biology, bees and ants seem to combine both path in-
tegration and landmark based navigation to produce robust navigation behavior [21],
[13].

The partially Observable Markov decision process framework has been proven to

be a reliable approach to robot navigation, which seamlessly integrates both motion

13

and
ll if.
lift?
but

emf.

‘
'V', v
u' .

 

and sensor reports [22], [23], [24]. Unlike pure Odometric and landmark based systems
it also incorporates the uncertainty of sensors and actuators. In POMDP based nav-
igation systems the robot does not maintain a single estimate of its current location,
but rather a probability distribution over all possible locations (hidden states) in the
environment. A Bayesian update routine is used to update this probability distribu-
tion after every motion and sensor report. Thus, the robot is never completely lost
because it always maintains a belief about its true location.

An example POl\/’IDP representation for robot navigation is shown in Figure 1.5,
where the states correspond to ﬁxed regions (e.g., each square meter in the environ-
ment is modeled as four states of the POMDP, one for each robot orientation). The
robot can take non-deterministic actions such as ”go-forward”, ”turn-left”, and ”tun-
right”, and in each state the robot can perceive features such as wall, and Opening on
the front, left, back and right side.

The advantage of the POMDP framework is that it provides rigorous algorithms
for learning the transitions matrices and observation models, and for computing map-
pings from belief states to actions. Modeling a real Ofﬁce environment as a ﬂat
POMDP however, may require many thousands of states. The current algorithms
scale poorly in terms of learning and planning in large POMDPS. It is difficult to
learn large models since the standard learning algorithm (the Baum-Welch [10]) does
not provide an obvious way of reusing previously learned sub-models. Exact planning
methods based on dynamic programming are also intractable in large models. Ad-
ditionally, heuristic planning methods such as Most Likely State (MLS) (which only

looks at the peak Of the distribution) often used in their stead are mainly effective in

14

 

.. CD CD CD ('5 <13

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. O O 69 O
G) 5" (D CD CD
57
TRANSITION mom ran at CD
00 69 O
52 G) 54 FORWARD SI 55
s: onsnnwmon Iona. ran at (D
SI 0.3 0.7 sr WALL OPENING
(D TURN FRONT 0 l ., 9
e 9 LE” SI 52 ' I
G) LEFT 0.9 0.1
5] ”-2 0.8
G) TURN R'GHT 0.9 0.!
9 Q RIGHT SI 54
(D 5| 0.2 0.8

 

 

 

 

 

 

 

Figure 1.5: A corridor environment modeled as a ﬂat POMDP. The circles are the
hidden states of the model and represent the pose of the robot (location and ori-
entation) in the environment. The ﬁgure shows the transition matrix for the three
available actions for state S 1, and also the observation model associated with state

51.

belief states with unimodal low entropy distributions (see Figure 1.6) [22]. In this dis-
sertation we apply a hierarchical solution to learning and planning in robot navigation
using our hierarchical POMDP framework. Figure 1.7 shows an example hierarchical
representation of a corridor environment where abstract states group together ﬁne

resolution primitive states.

1 .6 Contributions

The contributions of this dissertation are summarized as follows:

0 We formally introduce and describe the hierarchical partially Observable Markov
decision process model. This model is derived from the hierarchical hidden

Markov model with the addition of actions and rewards. Unlike ﬂat partially

15

\
if]t ~
Pi): -

._,.,

I‘ 1’ 7]:

 

; u
7’ Q
..[i‘z

' r
4.;

n
N.

 

 

rrrr r I I Irrrj

 

 

IIIAIIIIIILIIII

IIJAIIIIIIIIIIITT

 

TIT”

IIIITII IIIIIIIIIIIIIII

 

 

 

IIIIIII

 

 

 

A

 

H
I—4

III]

 

 

| III.IIIIIIIIIII: IIIIIIIIIII'II

II

IIILIIIIII

 

 

 

CIIIIITITIII

IIIIIIIIIII

 

 

LIIIIII

Figure 1.6: The ﬁgure shows a map of the third ﬂoor of the MSU Engineering Building.
The squares indicate two square meter locations in the environment. The shading
represents the probability distribution of possible locations, and darker means higher
values. Pavlov’s task is to be able to get to any one of the squares in the map starting
from any other square. A problem that may arise during navigation is that the robot’s
belief as to its true location may be multimodal. This could happen either because
the starting location of the robot was unknown, or due to sensor and motion noise.
For example, due to sensor and motion noise at junction J 1, the robot was unable to
distinguish the east from the south corridor.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(D (D (DU)

9:9 :2 23

_ (D0) (DCD
@CDCDO
OCDCDO
@CDCDO

 

 

 

 

 

Figure 1.7: The ﬁgure illustrates a multi-resolution spatial model. The rectangles
indicate abstract states representing corridors and junctions. The circles in the rect-
angle indicate ﬁne resolution states which are children of the rectangle abstract states.

16

Observable Markov decision process models, it provides both spatial and tem-
poral abstraction. Spatial abstraction is achieved by low resolution abstract
states, which group together ﬁner resolution states in a tree like fashion. In
temporal abstraction, a transition from an abstract state cannot occur unless
the process under the abstract state has ﬁnished. This model is presented in

Chapter 3.

We derive an EM algorithm for learning the parameters of a given HPOMDP.
We empirically Show that the algorithm converges by increasing the log like-
lihood Of the data after every training epoch. The algorithm is presented in

Chapter 3, and the empirical convergence results described in Chapter 5.

We introduce two approximate training methods. For the ﬁrst method, which
we named “selective-training”, only selected parts Of an overall HPOMDP are
allowed to be trained for a given training sequence. This results in faster train-
ing. For the second method, which we named “reuse-training”, submodels of
an HPOMDP model are ﬁrst trained separately and then reused in the overall
model. This results in better learned models. Both algorithms and results are

presented in Chapter 5.

We derive two planning and execution algorithms for approximating the Optimal
policy in a given HPOMDP. The planning and execution algorithms combine
hierarchical solutions for solving Markov decision processes, such as the Options

framework [25], and approximate ﬂat POMDP solutions [22] (Chapter 6).

17

0 We conduct a detailed experimental study of the learning algorithms for indoor
robot navigation. Experiments are presented for various large scale models
both in simulation and on a real robot platform. The experiments demonstrate
that our learning algorithms can efﬁciently improve initial HPOMDP models
which results in better robot localization (Chapters 5 and 6). Additionally,
we compare our learned models with equivalent ﬁat POMDP models that are
trained with the same data. We have developed approximate training methods
to learn HPOMDP models that converge much more rapidly than standard
EM procedures for ﬂat POMDP models. When the learned HPOMDP models
are converted to ﬂat, they give better robot localization than the ﬂat POMDP
models that are trained with the standard flat EM algorithm for POMDPS

(Chapter 5).

0 We apply the planning algorithms to indoor robot navigation, both in Simula-
tion and in the real world. Our algorithms are highly successful in taking the
robot to any environment state starting from no positional knowledge (uniform
initial belief state). In comparison with ﬂat POMDP heuristics, our algorithms
compute plans much faster, and use a much smaller number of steps in executing

navigation tasks (Chapter 6).

1.7 Outline

In Chapter 2 we cover background material. We formally explain the Markov decision

process framework, and the partially Observable Markov decision process framework,

18

which extends the MDP framework with hidden states. We describe the EM algorithm
for learning hidden Markov models, and POMDPS. Additionally, we present the
mainstream approaches to robot navigation, and report on previous studies of robot
navigation involving POMDPS.

In Chapter 3 we ﬁrst describe the hierarchical HMM model which has been the
building block Of our hierarchical approach. We then introduce the hierarchical
POMDP model which is an extension of the HHMM with the addition of actions and
rewards. We describe in detail the algorithm for learning HPOMDPS, and present a
small example to demonstrate the correctness of the learning algorithm. The chapter
continues with a discussion on the relation of hierarchical POMDPS to ﬂat POMDPS.
Finally, the chapter concludes with a discussion of related multi-resolution hidden
Markov models.

In Chapter 4 we describe the robot navigation domain. We describe the different
robot platforms, the navigation environments, and how they are represented as hierar-
chical POMDPS. We present a description Of the features that the robot can perceive,
and how they are computed through the use of a neural network. We conclude with
a detail description of the overall robot navigation architecture.

In the next two chapters we describe in detail learning and planning in HPOMDPS
within the context of robot navigation. Learning and planning are integrated in an
overall navigation architecture shown in Figure 1.8. The topological map construction
can be produced either by the programmer or can be learned. In our case we haven’t
implemented any automatic construction of topological maps, but rather we provide

it. Automatic topological map construction has been addressed by many researchers

19

anllc .
Befurc 1:
IO Mill! '3.
its true ;
attire I.
a smart-r.
algoritir:

III C}.
Rhinitis i.
all lear:
fast trek.
Where w.

men—f5, av

 

 

 

[18], [26], [27], and is not the focus of this dissertation. The topological map can be
compiled into an HPOMDP representation, which can be used for navigation tasks.
Before a navigation task can begin we construct a plan, which is a mapping from states
to actions. During execution of the navigation task the robot updates its belief about
its true position after every action and observation, and chooses the next apprOpriate
action to execute according to its plan and belief state. A navigation task produces
a sequence of observations and actions, which can be used through an EM learning
algorithm to improve the parameters of the HPOMDP model.

In Chapter 5 we describe detailed learning experiments where we train HPOMDP
models to represent hierarchical spatial maps of indoor Ofﬁce environments. Before
any learning experiments, we present two approximate learning methods that enable
fast training of large scale environments. We present a detailed experimental study
where we learn the parameters of HPOMDP models for simulated spatial environ-
ments, and present learning experiments with actual robot platforms. Additionally,
we present experiments for learning the structure of the environment using our learn-
ing algorithms. In structure learning, we allow for semi-ergodic models at abstract
levels Of the hierarchy, and prune away incorrect connections using our learning algo-
rithms.

In Chapter 6 we describe planning and execution algorithms, which combine the-
ory from hierarchical MDP solutions, such as the Options framework [25], with ap-
proximate POMDP solutions, such as the MLS strategy. We present robot navigation
results both in the Nomad 200 simulator and in the real physical world.

In the ﬁnal Chapter 7, we summarize the main contributions of the dissertation,

20

and d.

F1§1.1Ie 1
-. . -i .
[Upamygy
nailgrlf].

(If Di—‘Ti‘i‘r'.

 

 

and describe directions for future research.

INITIAL HPOMDP

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

TOPOLOGICAL
ENVIRONMENT MAP
‘J LEARNING/
HAND-CODING COMPILATION
,—. ——> __.>
a] '
£ PLANNING
PLANNING

 

 

 

 

 

PLANNING
——>-

 

 

EM

‘ : EXECUTION

 

 

 

 

 

 

 

 

TRAINED HPOMDP

 

 

El
—L

NAVIGATION SYSTEM

 

 

 

 

 

 

 

 

Figure 1.8: Topological map construction can either be learned or provided. The
topological map is compiled into an HPOMDP representation which can be used for
navigation tasks through planning and execution. Navigation tasks produce sequences
of observations and actions that can be used in an EM training procedure to improve
the parameters of the HPOMDP model.

21

Cl

Ba

lIl lili.‘
[)“fl‘in
Sc‘iiuz.
“fﬁtfi P
Sari .31
decal...
PIE‘SI‘III t

I“ h " . I
d‘uiiiltl:

”3)? Hrs

2.1

It

[1,
.n 7‘-
”WA

 

I
f
d.)

 

Chapter 2

Background

In this chapter we cover background material. In Section 2.1 we explain Markov
Decision processes which form the basic solution to sequential decision making. In
Section 2.2 we explain the partially observable Markov decision process framework
which extends the MDP framework by taking into consideration hidden states. In
Section 2.2.1 we present the planning solution in POMDPS, and in Section 2.2.2 we
describe the Baum-Welch algorithm for learning POMDP models. In Section 2.3 we
present the main stream approaches to robot navigation; such as metric-based, and
landmark-based navigation. Finally, in Section 2.4 we describe previous studies that

have used POMDP-based navigation systems.

2.1 Markov Decision Processes

Markov Decision Processes (MDPS) model sequential decision making tasks, where

an agent at each point in time Observes its current state and based on that chooses

22

what I

is Mr

star

“‘3 s

 

 

what action to do next [28]. When an action is chosen, it is carried out and a reward
is received from the agent’s environment. Figure 2.1 illustrates an MDP modeling a

simple navigation task. Formally MDPS are deﬁned as a four tuple (S, A, T, R):
o S: A set of states that represents the state of the system at each point in time.
o A: A set of actions that an agent can take (can depend on state).

0 T : A x S x S —> [0,1]: The state transition function, which maps each state
action pair into a probability distribution over the state space. The next dis-
tribution over the state space depends only on the current state action pair
and not on previous state action pairs. This requirement ensures the Markov
property of the process. We write T(s,a, s') for the probability that an agent

took action a from state 3 and reached state 3’.

o R : S x A —> P: The immediate reward function which indicates the reward for

doing an action in some state.

An agent chooses actions so as to maximize its expected long-term reward. If the
lifetime Of the agent is ﬁnite, for example if the agent expects to live for the next T
steps, it should choose to maximize its expected reward only for the next T steps.
Such optimality criterion is called ﬁnite-horizon optimality and is shown in Equation
2.1, which denotes the expected sum Of rewards the agent will receive starting from

state 3 and following policy 7r (mapping from states to actions) for T steps.

T—l
ms) = E" (2).) (2.1)

23

6.1

FISH“

PQ-r “ v

didn‘t}: ',
mom i
" r . I
15 “ill
for am.
[19:61. l‘s'l

OlJSf‘rTai

.

Her;
[0 the l:
iilf.’ inf
adieu \.

iEIjIUr ‘

 

 

east, +1, 01 east, +0, 0.]

    
 

east, +0, 0.1

      

east, —1, 1.0

 
   
  

east, +0, 0.9 east, +1, 0.9 east, +0, 0.9

    

 

I " “ """ ' “ """""""""" Tl \\ “‘ _______ ”"‘\ ‘
‘. [I west, +0. 0.9 \ west, +0, 0.9/ ,' x ,'
‘ ‘. ,' \\ west, +1, 0.9, .,
west, -1, 1.0 ,I ,' west, +1, 0.1
west, +0, 0.1 west, +0, 0.1
l 2 3 4

66

 

 

 

Figure 2.1: An MDP modeling a simple navigation task. There are two actions
available east and west which succeed with probability 0.9. If they fail the robot
moves in the Opposite direction, or stays in the same state if the Opposite direction
is a wall. The reward function is structured such that the robot gets a reward of +1
for actions leading into state 3. Otherwise the robot gets a 0 reward except for a
negative —1 reward for actions leading onto the wall. All four states are completely
observable since they generate unique Observation symbols.

However, if the agent’s lifetime is inﬁnite, or if the lifetime is very long compared
to the length of each action, then what seems to be a better optimality criterion is
the inﬁnite-horizon which is shown in Equation 2.2. In this model, the agent chooses
action so as to maximize its expected inﬁnite discounted sum Of rewards. The discount
factor 7 plays different roles. One reason is to give more weight to the rewards in the
near future over the rewards in the far future. Another reason is to make sure that

the sum is bounded.

V"(s) = E’r (i791) , 0 S 7 g 1 (2.2)

24

may; .~
,.

punt}:

ll. H

sum 01’

fl'f f \[o

 

 

, 7
51’1”".
' "Uiffu'

f.
r
“J, ("HA

 

An agent who is taking actions so as to maximize one Of the Optimality criteria
above is said to be executing a policy. The goal of the agent is to ﬁnd the best or
Optimal policy, which will also receive the most reward in the long-run. In the ﬁnite-
horizon model an optimal policy is typically non-stationary, that is it changes over
time. This is because the agent takes into account how many steps are left in its life,
and given the same situation (state), may choose a different action depending on how
many steps it has left. In the inﬁnite-horizon case, there is Shown to be an Optimal
policy, which is stationary.

In order to evaluate a policy that an agent is executing, we can deﬁne the value of
a state 3 given that the agent starts in state .9 and executes policy 7r as V"(s). In the
ﬁnite-horizon the value of a state 3 given a policy 7r can be deﬁned as the expected
sum Of reward gained from starting in state 3 and executing non-stationary policy 7r
for t steps. This value function can be expanded recursively as shown in Equation

2.3.

V3(3) = R(3, 7T:(8)) + 7 Z T(8, MS). 8') 511(8') (23)

s’ES
In the inﬁnite—horizon case, the value Of a state under some policy 7r is the expected
discounted sum of future rewards for starting in state s and executing stationary
policy 7r. This value function is recursively deﬁned in Equation 2.4. The simultaneous
solution for the linear set Of equations that can be derived from Equation 2.4 (one

for every 3 E S) is unique and gives us the value function of policy 7r.

25

Fr.
' . a
‘CIIM‘J

(I

. . -,
Lil ‘l.(i.‘

 

 

V"(s) : 12(3, 7r(s)) + 7 Z T(s, ”(5), S')V"(s') (2.4)
s’ES

For example, if the policy in the navigation example in Figure 2.1 is 7r =
[east,east,east,west], then to compute the value function we need to solve a set

Of four linear equations shown below:

V(31) :2 7 [0.9V(82) + 0.1V(sl)]
V(32) 2 1+ 7 [0.914(33) + (III/(81)]
V(sg) = y [0.9V(s4) + 0.1V(32)]
V(s4) 2 1+ 7 [0.9V(s3) + 0.1V(s4)]
If we set 7 = 0.9 then the solution is [4.844, 5.442, 4.946, 5.501], which represents
the values of the four states for policy 7r.
But, how can we ﬁnd the optimal policy? One way is to use the Optimal value
function, from which we can ﬁnd the optimal policy. First, for the ﬁnite-horizon case
the Optimal action to do for the last step is the one that gives the highest immediate

reward (Equation 2.5).

7rf(s) = argmaxaR(s, a) (2.5)

For every other step t, the optimal action to do in a state 3 is given by Equation

2.6.

7rf(s) = argmaxa (R(s, a) + 7 Z T(s, a, s')V,"_1(s')) (2.6)

s’ES

26

[:1

[(;]].,>\‘

Ullt“ U

Hi‘lill

Cir

ir: Err;

A

 

Haw”

En

jiitjif

 

 

For the inﬁnite-horizon case, there exists an optimal stationary policy that can be
followed from any initial state 3 that can produce the maximum value for that state
[5] . The value function for any Optimal policy (note that there can be more than
one optimal policy) is unique and is derived from the solution of the simultaneous

equations deﬁned in Equation 2.7.

V‘(s) = max (12(3, a) + 72T(s,a, 3')V*(s')) , V3 6 S (2.7)

s’ES
Given the optimal value function we can derive a greedy Optimal policy as shown

in Equation 2.8.

79(3) 2 argmaxa (R(s, a) + yZT(s, a, s')V*(s')) (2.8)

363
Thus one way to ﬁnd Optimal policies is to ﬁnd the Optimal value function. Un-
fortunately due to the max term in Equation 2.7, the set Of equations derived is non-
linear and therefore we cannot analytically solve the system of equations. However,
we can ﬁnd the Optimal value function using an iterative procedure called Value Iter-
ation. The value iteration algorithm works by iteratively computing the discounted
ﬁnite-horizon Optimal value functions as the horizon keeps increasing, and can be

shown to converge to the correct values [29], [30].
However, if we don’t know the transition and reward functions T and R we can
use an on-line algorithm such as Q-learning [31]. We can do that if we re-express

Equation 2.3 in terms of state action values called Q values shown in Equation 2.9.

27

TI
.
A5.
in .
lI id], I
in -
ilafJI'iZIVI

I
Ci; '1} I"; 5(3

2.111.

 

 

 

. initialize I/(s) arbitrarily
. loop until max [Vt — V,_1| < e
t = t+ 1
loop for 8 E S
loop for a E A
Vt(s) = maxa (R(s, a) + 723,63 T(s, a, s’)Vt_1(s’))
end loop
end loop
. end loop

 

 

 

Figure 2.2: The Value Iteration algorithm is a dynamic programming solution to
the non-linear set of equations deﬁned by Equation 2.7. The algorithm works in an
iterative fashion by computing the discounted ﬁnite-horizon Optimal value functions
as the horizon keeps increasing.

Qt(s, a) = R(s, a) + 7 Z T(s, a, 8')V,’_'_1(s’) (2.9)

s'ES
The intuition behind this re-expression is that if an agent has learned the Q values,
it can now choose actions without knowledge Of he transition function T and reward
function R. An agent who has learned the optimal Q‘ values (Equation 2.10) can
choose the Optimal action for each state my maximizing over the Q" values (Equation

2.11).

Q‘(s, a) = R(s, a) + 7 ZT(3, a, s')V*(s') (2.10)
s’ES
7r“(s) = argmaxaQ“(s, a) (2.11)

We can also express Equation 2.10 as Equation 2.12, which allows us to devise an

iterative routine for calculating the optimal Q“ values.

28

II

from IE

2.2

TO US“

 

the f")?

r.
df‘hr’],
"Huh _

  
 

{961m}. _

Q‘(s, a) = R(s, a) + 7 Z T(s, a, s')maa:aiQ’(s', a') (2.12)

s’ES

The Q“ values can be estimated recursively using Equation 2.13 [31].

Q(s, a) = Q(s, a) + a(r + 7maaxQ(s', a') — Q(s, a)) (2.13)

If each action is executed in each state an inﬁnite number of times on an inﬁnite
run and O’ is decayed appropriately, the Q values will converge with probability 1 to
the optimal Q" values [31, 32]. The r term is the immediate reward the agent receives

from the environment when it executes action a in state 5, and a is the learning rate.

2.2 Partially Observable Markov Decision Pro-

COSSOS

To use the MDP formalism for planning an agent needs to accurately perceive the
state of the environment. Unfortunately however, this is not the case in the real
world. For example, imagine a mobile robot whose task is to navigate an indoor
ofﬁce environment and that the only features it can perceive are walls and openings.
Obviously, this robot will be unable to tell its exact location in the environment
since its local Observation may correspond to different places in the world. Providing
the robot more observation capabilities could reduce perceptual aliasing at the cost of
deﬁning and using more features. Nonetheless, it is always difﬁcult to associate unique

features for each possible location in he environment, especially for large scale robot

29

rravi;
alhtél
[fire

and ,.~
an ap

It) I'fl'

[hams l
ItIIfMI
0b>rrV.
point i
a pro}.
belief ~
Iildl I}.
Dpihla

In?
Slate v
[59 [it

 

WP ll Pin

[file 5?-

 

navigation domains. Additionally, too many features can also give rise to perceptual
aliasing in that many locations would correspond to similar features. For example,
if we are navigating in a forest we are able to see an amazing number Of tree shapes
and sizes. Yet, for the most part it all appears the same. Thus, we would like to have
an approach that deals with perceptual aliasing rather than an approach that tries
to create unique features.

A way to to disambiguate these spatial locations that generate the same Observa-
tions is to use memory, that is, to keep track of all observations and actions that the
robot has taken. A pOpular way of representing memory is provided by the partially
observable Markov decision process framework. In the POMDP framework, at each
point in time the agent uses the history of past observations and actions to calculate
a probability distribution over the underlying environment state (belief states). The
belief states are a sufﬁcient statistic for the history of the process [33]. This means
that using the belief space as a new type of state representation we can compute
Optimal policies for POMDPS using the MDP framework.

Intuitively an agent that is able to act based on its uncertainty as to its true
state will be more successful than simply acting on its current perception as to what
the true state of the environment is. For example, if the robot knows that is highly
confused it could take actions just for the sake of reducing its uncertainty. Or, given
that it believes that it is at different locations at the same time it could take an action
that is the best for all probable locations. Thus planning in POMDPS means that
we need to map every possible probability distribution about the agents belief of its

true state, to the best action for the task at hand.

30

illl’lfl
[TEN [I

a all:

flit” >I
pl’ff‘flai

in 5;},

 

 

2.2.1 Planning in POMDPS

The POMDP framework is a systematic approach that uses belief states to represent
memory of past actions and observations. Thus, it enables an agent to compute
Optimal policies using the MDP framework. Formally a POMDP model is deﬁned as

a Six tuple (S, A,T, R, Z, O):
o S, A, T, and R describe a Markov decision process.
0 Z: A set of observations.

0 0 : A x S x Z —7 [0, 1]: A function that maps the action at time t — 1 and the
state at time t to a distribution over the Observation set. We write 0(3’, a, z)
for the probability of making Observation 2 given that the agent took action a

and landed in state 3’.

A POMDP agent can be decomposed into two components as Shown in ﬁgure 2.3.
The state estimator takes as input the current Observation, the action taken and the
previous belief state b; and it outputs a new belief state b’. This calculation is Shown

in Equation 2.14.

b'(s') = P(s']z,a, b)

= P(z|s’, a, b)P(s’|a, b)
P(z|a, b)

_ P(z|s’,a, b) 2365 P(s’|a, b, s)P(s|a, b)
_ P(z|a,b) (2.14)
=—. P(z]s’,a) 2365 P(s’|s,a)P(s|b)

P(z|a, b)
0(3’, a, z) 2365 T(s, a, s’)b(s)

P(z|a, b)

 

 

 

 

31

fist."
lllfulil.
prVi:

a ht‘ ,-

CUII] (‘5
the .

5906 1

 

 

ACTION
OBSERVATIO.‘

 

AGENT

BELIEF STATE (b) /

POLICY (1!)

 

 

 

STATE

 

ESTIMATOR (SE)

 

 

 

 

 

 

—I./

 

 

 

 

 

 

 

Figure 2.3: A POMDP agent is made of two main components. The state estimator
module receives observations from the environment, the last action taken and the
previous belief state and produces an updated belief state. The Policy module maps
a belief state to an action.

Next, we demonstrate a sample belief state evolution based on the POMDP
model described in Figure 2.4 and the belief update procedure described in Equa-
tion 2.14. If the starting belief state is [0.33 0.33 0.0 0.33] and the agent takes
action east and does not perceive the * Observation, then the new belief state be-
comes [0.1 0.45 0.0 0.45]. If it takes the action east again and still does not perceive
the * Observation, then the belief state becomes [0.1 0.164 0.0 0.736], which is a

good indication that the agent has moved to the most right state.

 

1 2 3 4

I I I

I I I

I | I
# ' # ' >I< . #

I I I
Figure 2.4: In this simple POMDP states 1, 2, and 4 generate the same Observation
while state 3 generates a different observation. The agent can take actions east and

west which succeed with probability 0.9. If they fail, the agent moves in the Opposite
direction. If a movement is not possible the agent stays in the same state.

 

 

 

32

’1:
ma,

s‘lilll

i\ din

 

 

After a belief state estimation, the policy component of the POMDP agent must
map the current belief state into an action. Since the belief state is a sufficient
statistic, the optimal policy is the solution to a continuous-space belief MDP, and it

is deﬁned as follows:

0 B: A set of belief states, where each belief state b : s— > [0, 1] st. ZS b(s) =
o A: A set of actions that remain the same as the underlying MDP.

7' : A x B x B —> [0,1]: The state transition function which maps each belief
state action pair into a probability distribution over the belief state space. The

state transition function can be calculated according to Equation 2.15.

T(b,a,b’)— — P(,)b’la b) =2 EZP P(b’ |a, b, z)P (zla, b),
1 zf SE(b,a, z) = b' (2-15)
where P(b’lb,a, z) =

0 otherwise

0 p : B X A —> 3?: The immediate reward function on belief states. The reward

function can be calculated as shown in equation 2.16.

a) = Z b(s)R(s, a) (2.16)

365

Since we can formulate the POMDP problem as an MDP whose states are the be-
lief states of the POMDP, we can solve it using value iteration (see Figure 2.2). That
is, determine the optimal policy by iteratively constructing each optimal t-step dis-
counted value function over belief space. Unfortunately, the number of belief states is

33

inﬁni"
fum'tz
are ll.‘

{1}“ l1:

lll-lll 3‘

' c, :.
”lull.

Value :
, ll
duh'll"

are: a}

go. L.
I

(Ch 00$. .
{hf‘h '

d2

‘l\rl 2‘ ‘

 

inﬁnite and iteration over belief states is intractable. Nevertheless, the optimal value
function for any ﬁnite-horizon POMDP exhibits certain characteristics, which allevi-
ate the computation of optimal policies. In particular, the optimal value function for
any ﬁnite-horizon POMDP is always piecewise-linear and convex over the belief space
[34], [35]. This is not necessarily true for the inﬁnite-horizon discounted value func-
tion which remains convex but may have inﬁnitely many facets [36]. Still the optimal
inﬁnite—horizon discounted value function can be approximated arbitrarily well by a
piecewise-linear and convex value function, or in other words with a ﬁnite-horizon
value function with sufﬁciently long horizon [37]. This property of the value function
allows for compact representations as well as for the development of algorithms that
are able to calculate the optimal t-step discounted value function from the optimal
t-1 step discounted value function.

To see how we can compactly represent the Optimal ﬁnite-horizon value function
of POMDPS, we will review the concept of policy—trees shown in Figure 2.5 [38]. In a
policy-tree the root node represents the action to be taken when there are t steps to
go. Depending on the observation made after the action has been taken, the agent
chooses the next action with t — 1 steps to go. If the agent is at a leaf of a tree
then there is only one thing to do, take the last action. To calculate the expected
discounted value of executing a non-stationary policy for t steps or equivalently the
value of executing a policy tree p from a belief state b, we can re-express Equation

2.3 with Equation 2.17.

34

figur
for a 3
to I};-

ll:
Inn] -

Eanﬂ

Eqd‘.

—T

 

t steps to go

t—l steps to go

 

2 steps to go

 

I step to go

Figure 2.5: The policy tree indicates the non stationary t-step policy to be executed
for a POMDP. First the action a deﬁned by the root is executed and then according
to the observation 2,; perceived the t — 1 steps to go tree is chosen.

V, (b)—pba))+’7:’r(,(bap p.1(b')
”63 (2.17)
= 12(1), a(p)) + 7 Z Z P(b’labJ), 12, z)P(2|a(p), b)Vz(p),t_1(b’)
(7’98 262

Where l/'Z(p),t_1 is the t — 1 step tree to go that was entered with observation 2
from its parent VN- Since the term P(b’]a(p), b, 2) (see Equation 2.15) is deterministic

Equation 2.17 can be re-expressed as Equation 2.18.

l”}),a(b)“ _ p( (b (1(1) )+ 72P( z|a(1J), Vz(p),t—1(SE(baP(a)az)) (2.18)

262

The term P(z|a(p), b) is the normalizing factor in the belief state estimation in

Equation 2.14. Therefore Equation 2.18 can be expressed with Equation 2.19.

35

 

every

1

I'm-n :

Will]?

-,\ .
Nate.

 

 

3.20)) = Z b(s)R(s, a(IJ)) + 7 Z Z Z b(8)T(8, a(p), 8')0(p(a), 8', Z)Vz(p),t-1(3')

563 262 8’65 365

= Z “3) (R(3a a(p)) + 7 Z Z T(s, a(p), 3’)O(P(a), 3’, z)Vz(P),t—1(s’))
$63 262 5’65

= Zb(s)l;),t(3)
365

(2.19)

Where l';,t(s) is the value of executing a policy tree p of depth t starting at state

3. The interesting part of Equation 2.19 is that it tells us that the value function of
every policy tree 1},“ is linear in the belief space. If we let a”, = (VN(31), ..., Vp,t(sn)),

then Equation 2.19 becomes Equation 2.20.

V},,,(b) = b - am (2.20)

If P is the ﬁnite set of all possible policy trees of depth t, then the optimal t-step
value for a belief state b is the value for executing the best policy tree in that belief

state, shown in Equation 2.21.

V,*(b) = mampep(b - am) (2.21)

Each policy tree p induces a value function pr that is linear in b, and V," is the
upper surface of this collection of functions, which makes the optimal t-step value
function piecewise-linear and convex. The convexity of the value function is intuitive
since belief states in the middle of the belief space have high entropy and as a result

the agent cannot select actions very appropriately and so tends to gain less long-term

36

rem

[hf‘ lit

in P);

Flmlrt.

a'u _
ar-rV .
"Lllﬂlh'
p .
(fl :1 -
D 44::
.\..
Val-“IF .

‘1‘. .

I

 

 

reward. An example with a two state POMDP 31 , 32 is shown in Figure 2.6 where

b(sl) is enough to describe the belief space.

 

 

Vpl
Vp3
X /
X—
Vp2
0 b(sl) I

Figure 2.6: The ﬁgure shows an example of an optimal t-step value function which is
the upper surface of the value functions associated with all t-step policy trees.

If we are given a piecewise-linear and convex value function, then we can calculate
the best action to do by projecting the optimal value function onto the belief space.

An example is shown in Figure 2.7.

Vpl

I

I

l I

; .

; Vp2 I
I

I I I

“ﬂ

I
I
I
I
I
I
I
I
I
I
I
I
I
l

 

0 a(pl) a(p2) a(p3) I
Figure 2.7: The ﬁgure shows a mapping from the belief space to the appropriate
action deﬁned by the root of a t-step policy tree. The mapping is calculated by
projecting the optimal value function onto the belief space.
Now in order to calculate V,“ from V,*_1 (which is what we want to achieve in a

value iteration algorithm) really means to get the set of am vectors from the set of

ap,t_1 vectors. Nevertheless, the complexity of computing the new set of a vectors

37

fur a
Proxii

mm;

0}" ill}

grows exponentially in the number of observations |Z |. In particular, given the current
set of a vectors 16-1, we need to compute |A||Vt_1||ZI oz vectors to represent the value
function V}. The reason is that we need to take into consideration all possible lists (of
alpha vectors in Vt_1) of length [0]. Solving POMDP problems, is intrinsically hard.
In fact, computing the Optimal policy for the ﬁnite—horizon is PSPACE-complete [7]

A few algorithms have been developed for solving this problem, [35], [39], [40] [41],
[38]. These algorithms however, are only able to solve small state POMDP problems,
and even worse, they are based on the the fact that the value function is piecewise-
linear and convex which is not necessarily true in the inﬁnite-horizon case [34]. For
the inﬁnite-horizon case ﬁnding the optimal policy of a given POMDP is actually
undecidable [8], and thus may not even be computable.

To avoid the computational intractability in computing the exact value function
for a given POMDP, researchers have developed various approximations [42]. Ap-
proximate algorithms have been developed both for approximating the optimal value
function in belief-state MDPS [43], [44] as well as algorithms for approximating the
optimal policy directly [45], [46], [47], [48]. In realistic domains such as robot navi-
gation which require thousands of states exact algorithms are impossible. Instead we
have to rely on approximate algorithms for solving POMDPS such as the algorithms
that will be presented in Chapter 6.

Finally, we should note that alternatives to POMDPS, in terms of computing belief
states, are Kalman ﬁlters [49], [50] and temporal Bayesian networks [51]. Kalman ﬁl-
ters have been a great contribution into the world history, in that they have enabled

the mid-course correction scheme for the Apollo missions to the moon [52]. With

38

Kali

elml

repr:

Ihi:

K“
tltl.

2.2.2

POX
be 2.,

alﬁl"

A.“
-

nl .
(1:10]! I I

IIISI

[Tact _ '

“THU!

 

 

Kalman ﬁlters however only restricted sets of probability distributions can be mod-
eled such as Gaussian. Gaussians are unimodal distributions which are not a good
representation for robot navigation beliefs which in many situations are multimodal.
This limitation is overcome with multi-hypothesis Kalman ﬁlters. Multiple-hypothesis
Kalman ﬁlters represent beliefs using mixtures of Gaussians [53], [54]. Nonetheless,
this approach still inherits the Gaussian noise assumption. Temporal Bayesian net-
works can also model arbitrary probability distributions. Unfortunately the model
size of temporal Bayesian nets grows linearly with the temporal look-ahead and this

limits their usefulness in planning [55].

2.2.2 Learning POMDPS

POMDPS are basically HMMS with the addition of actions and rewards. HHMMS can
be leaned from data through an Expectation Maximization procedure. The learning
algorithm is called Baum—Welch and can be derived from the general theory of EM
algorithms [9] [56] [57] .

In particular, the EM algorithm is well suited for two types of applications. The
ﬁrst occurs in situations where optimizing the likelihood function is analytically in-
tractable but when the likelihood function can be simpliﬁed by assuming the exis-
tence of, and values for additional but missing parameters. One such example is the
mixture-density parameter estimation [58], [59], [60]. The second occurs when the
data has indeed missing values, due to problems with or limitations of the observa-

tion process, such as training of hidden Markov models [9], [56], [61]. The second

39

application is what we are interested in, since we need to learn the parameters of a
POMDP model based on observation data which do not directly reveal the under-
ling hidden state of the environment. Next, we give a brief derivation of the EM
algorithm.

For analytical convenience the likelihood function is expressed as the log likelihood
function L(O) = In P(XIO), where X is the training data and O the set of parameters.
The EM algorithm recursively estimates a new set of parameters 9,41 such that
L(@,-+1) 2 L((-),), where 9,- is the current set of parameters. The difference between
the log likelihood of a any new set of parameters 9 and the current set ('3, is expressed

in Equation 2.22.

P(XIG)

L((-)) - Mei) =111P(X W

 

e) — lnP(X|9,~) = In (2.22)

We would like to choose the new parameters 9 such that the right-hand side of
Equation 2.22 is maximized. In general, maximization of the right-hand site may be
intractable. The idea of the EM algorithm is to introduce a set of hidden variables Z,
such that if Z where known, computation of the optimal value of 9 becomes tractable.

Equation 2.22 then, is transformed though conditioning into equation 2.23.

E. P(Xlza 9)P(ZI9)
P(X|@,)

 

L(e) — L(@,-) = In (2.23)

Unfortunately, in the above expression the logarithm of the sum is not easy to
deal with. Fortunately due to Jensen’s inequality (Equation 2.24), we can can replace

the logarithm of a sum with the sum of logarithms.

40

j j i

In our expression we introduce the term A, = P(le, 6),) which transforms Equa-

tion 2.23 into Equation 2.25.

ZzP(XIZJ 9)P(Z|9) P(ZIX, 9.)
P(Xlei) ( |P(Z[X,9i) (2 25)
, P X z,@ P(zIG) '
Z 2 PW ’ 9") 1“ P(Xle.>P(zlx, e.)

 

L(o) — L(e,)=1n

 

Equation 2.25 can be expressed as Equation 2.26.

P(Xlz, e)P(z|e)
P(X|@,)P(2|X, e,)

 

L(@) 2 L(@,) + Z P(le, e.) ln = 1(9) (2.26)

Now we need to ﬁnd the parameters 9,“ that maximize the right hand side of

Equation 2.26 as shown in Equation 2.27.

P(Xlz, G)P(z|€-))
P(X|®,)P(2|X, 9i)

= argmaxe Z P(le, 9,) 1n P(X, 2|(-))

 

@,+1 = argmaxeL(@,-) + ZP(2|X, (9,) ln
2 (2.27)

In short, Equation 2.27 is the EM algorithm. The E-step is the estimation of the
expectation (expectation of lnP(X, 2|O) with respect to the hidden variables) and
the M-step is the maximization. At each iteration @,+1 maximizes 1(9) and therefore
l(@,+1) 2 l(@,). Since L(@,) : 1(6),) (from Equation 2.26) and the fact that for

any 9 L(@) 2 [(9), the new log likelihood L(9,+1) cannot decrease, and therefore

41

\l'

 

“9141) Z 14(91)-

In cases where the complete maximization of the expected log likelihood is in-
tractable we resolve to partial M-step implementations. Such implementations sim-
ply improve l(@) using a gradient decent method rather than fully maximizing it.
Such algorithms are also guaranteed to increase L(@) and are called Generalized EM
(GEM) algorithms [12]. Next we present an adaptation of the EM algorithm, the
Baum-Vlklch algorithm, which learns the parameters of an HMM model.

An HIVIM model is deﬁned formally as a five tuple (S, T, Z, 0, 7r):

0 S: A set of states.

T : S x S —> [0, 1]: A function that maps every state to a distribution over the

state set. We write T(sl, .92) to indicate the probability of going from state 31

to state 32.

Z: A set of observations.

0 O : S x Z —> [0, 1]: A function that maps every state to a distribution over the

observation set. We write 0(3, 2) for the probability of observation 2 in state 3.

0 7r : II(S): The initial probability distribution.

When we say we learn an HMM model we mean that given an observation sequence
M = 21,32...2T, and the model parameters A = (T, 0, 7r), we must adjust the model
parameters such that the log likelihood of the observation sequence is maximized.
That is, maximize log P(M |A) In order to develop the algorithm, we ﬁrst need to
deﬁne some variables. The ﬁrst variable is called the forward variable a and is deﬁned

42

in l

591.“;

 

 

 

in Equation 2.28. The a variable deﬁnes the probability of an observation sequence

21, ...z, and that the state at time t is 2' given the model parameters A.

a,(2'.) = P(z1,22...z,,s, = ilA) (2.28)

If we know the a variable then it easy to calculate how well the model ﬁts a

sequence of observations M of length T as shown in Equation 2.29

P(MlA) = 2 027(3) (2.29)

sES

The a variable can easily be calculated recursively. If t = 1 then
01(1) = ”(00(1) Z1)

else

S
aH-l (.7) = Z at(z)T(Za j)0(ja Zt+1)

Another variable is the backward variable which deﬁnes the probability of an

observation sequence 2”,, ...2T given that the state at time t was 2' (Equation 2.30).

(M1) = P(Zt+lzt+2---ZT]3t : i, A) (2.30)

The backward-variable can also be calculated recursively as follows: If t = T then

are) = 1

43

else
5

ﬁt(j) : Z TU, j)0(ja Zt+1)/3t+1(jl
j=1

Note that if we know the [3 variable, we can rewrite equation Equation 2.29 with

Equation 2.31

P(MM) = Za,(s)ﬁ,(s) (2.31)

563
The next variable to be deﬁned is the 7 variable which gives the probability that

at time t the process was in state 2' (Equation 2.32)

w) = P(s, = ilM, A) (2.32)

The 7 variable can be calculated from the a and ﬂ variables as follows:

7 (it) = Mama) 2 WWW
‘ P(MIA) Efﬂatumtb)

 

The last variable to be deﬁned is the 5 variable which is shown in equation 2.33

QUJ) : P(St = i, 81+1 = j, (WM) (2-33)

which can also be calculated from the a and ﬂ variable as follows:

at(i)T(ia .7700.) Zt+1)/6t+1(])
P(MM)

 

{Mail 2

Using the calculated variables deﬁned above we re-estimate the transition model

44

35

the:

P] l
CODY

IlDI]

 

 

 

as shown in Equation 2.34, the observation model as shown in Equation 2.35, and

the initial distribution as shown in 2.36

_ ZZZ—11610.9)

 

 

Ti' _ ,
l J) 2:1‘742) (2 34)
. 2:1]; 4%“)
O ,k = t .
(J ) 2.1.74» (235)
an = 71(7) (2.36)

It has been proven that each time the model parameters get re-estimated that:
P(M|)\)new Z P(.’l-[|/\)prev,0us. Even though the maximization does not necessarily
converge to the best optimal solution, it converges to local maxima [56]. This limita-
tion requires that a good initial model is provided to the learning algorithm as well
as “adequate” training data.

In [11] an extension of the Baum-Welch algorithm for HMMS [10] is described,
which learns POMDPS, and we describe next. In this version there is also a scaling
factor, which allows the algorithm to run for long training sequences without fear of

exceeding the precision range of any machine.

1. First we compute the scaling factor and the a variable for the ﬁrst observation:

scalel = Z[P(z1|s)P(sl = 3)]

.968

45

a1(s) = p(z1|s)P(sl = s)/scalel V8 6 S

2. Next we compute the forward variable and the scaling factor for all t = 1 to

T — 1 in a procedure called “forward propagation”:

temp(s) = Z[P(s|s', a,)a,(s')]P(z,+1|s) V3 6 S

s’ES

scale,“ 2 Z temp(s)

sES

at+1(s) = temp(s)/sca.le,+1Vs E S

3. Next we initialize the backward variable )3 for time T:

5T(8) =1/scaleT V8 6 S

4. Then we estimate the backward variable starting from time t = T — 1 down to

t = 1 in a procedure called “backward propagation”:

ﬁt“) 2 ZIP(SIIS)at)P(Zt+1]-9,)ﬁt+l(SIN/scalet VS 6 S

3’65

5. After the estimation of the a and ﬂ variables we can now compute the 7 variable:

‘n(8,8') = at(3)P(3,I8aat)P(Zt+1]3’),Bt+1(Sl) Vs,s' e S, for t = 1 to T —1

7t(3) = scaletat(8)ﬁ,(s) V8 6 S, for t = 1 to T

46

 

na‘rj

gm.

ml.

 

by r.

0] {2‘7

 

6. Using the as, 35, and ")3 we can now re-estimate the initial probability distri-

bution, the transition matrices, and the observation models:

7r(s) = 71(3) Vs E S

P(s'ls, a.) = Z {,(s, 3') Z ),(3) Vs, sI E S and Va E A

t:1...T—I|a¢:a t:1...T—1|a¢:a

P(zls) = Z )t(s)/ Z 7,(s) V8 6 S and V2 6 Z

t:1...T|z¢=z t=1...T

2.3 Robot navigation

Robot navigation is a broad tOpic, covering a large spectrum of different technologies
and applications [62]. It draws on some very ancient techniques, as well as some
of the most advanced space science and engineering. The general problem of robot
navigation can be summarized by three questions: ”Where am 1?”, ”Where am I
going?”, and ”How should I get there?”, according to [63]. In general, we can classify
robot navigation technologies in to two categories: relative and absolute position
measurements [62]. Relative position methods include Dead-reckoning systems such
as systems based on odometric information. Absolute position systems include active
beacon navigation, landmark—based, and map-based navigation.

The dead-reckoning method (derived from ”deduced reckoning” from sailing) is
a simple mathematical procedure for determining the present location of a vehicle
by advancing some previous position through known course and velocity information

over a given length of time. The simplest form of dead-reckoning is often termed

47

as odometry. Odometry-based systems use encoders to measure wheel rotation and
steering orientation. The advantage of odometry is that it is self contained (no exter-
nal feedback from the environment is needed). The disadvantage, is that the position
error grows without bounds. The unbounded position error is due to unequal wheel
diameters, misalignment of wheels, travel over uneven ﬂoors, travel over unexpected
obstacles, and wheel slippage. Due to the increasing error, odometric systems are
rarely used by them selves but rather are combined with absolute positioning sys-
tems [64].

Active beacon based navigation is an ancient technique that ranges from star fol-
lowing to the Global Positioning System (GPS) [65]. Active beacons systems compute
the absolute position of the robot, by measuring the direction of incidence of three or
more actively transmitted beacons mounted at known locations in the environment.
There are two principle methods for determining the vehicle position: triangulation,
and trilateration. Triangulation measures the angles between the vehicles heading and
a number of beacons, and uses them to estimate its absolute position. Trilateration
uses a measurement of distance between a number of beacons and the vehicle [62].

In landmark-based navigation, the robot computes its absolute position by recog-
nizing distinct features (landmarks) from its sensory input. If the position of land-
marks in the environment is known in advance then the robot will be able to localize
correctly. Landmark-based approaches can incorporate topological maps which rep-
resent the location of the each landmark in the environment (see Figure 2.8). Or,
landmark based navigation could simply be a reactive systems where each landmark

deﬁnes the next robot direction or action [66], [67]. The advantage of having a

48

topological map is that a robot can use it to construct multiple navigation tasks
(planning). On the other hand, a topological map may be too general, because it rep-
resents multiple navigation tasks. Thus, a separate representation (direct mapping
from landmarks to actions) for each navigation task at hand, may be more efﬁcient
for that particular task. Nonetheless, it is costly to construct, or learn a separate

representation for every possible task.

9 ------------------- C9 ------------------- o

 

 

 

 

 

 

Figure 2.8: The ﬁgure show a landmark based topological map representing a corridor
environment. Each node in the topology represent a unique landmark that can be
identiﬁed by the robot. Each node also contains information as to the distance and
orientation to the next node.

In general, landmark-based navigation systems are difﬁcult to construct and main-
tain in large environments for complicated navigation tasks. The reason is that unique
landmarks and their relationships need to be discovered. Also, recognition of land-
marks may be ambiguous and sensitive to the point of view. If a robot cannot uniquely
identify landmarks then it would be executing inefﬁcient navigation strategies that
may not even be able to get it to its goal (if two places in the environment are
recognized as the same landmark and yet they require a different action). In prac-

tice researchers implement hybrid systems where odometry is used to disambiguate

49

landmarks which are perceived the same.

In map-based systems the information acquired from the robot’s onboard sensors
is compared to a map or world model of the environment. If features from the sensor-
based map and the world model map match, then the vehicle’s absolute location can
be estimated. Map-based positioning often includes improving global maps based on
the new sensory observations in a dynamic environment, and integrating local maps
into the global map to cover previously unexplored areas. One type of map is the
topological map used in landmark-based navigation. A classic example of landmark-
based navigation using topological map is presented in [18].

The other type of map is the geometric map. Unlike topological maps, which
represent the relationships between landmarks, the geometric maps represent the
world in a global coordinate system. A classic example of geometric map navigation
is presented in [1] where the environment is represented with a uniform ﬁne grain
resolution which can be as high as 1 square cm.. Figure 2.9 Shows an example
occupancy grid map of an ofﬁce environment taken from [1]. Each grid cell of the
occupancy map contains an occupancy value (probability of obstacle). Occupancy
values are calculated from multiple sonar sensors which are integrated over time
[68]. To build such a map however, a robot needs to know its exact position in the
environment. This is achieved by combining various approaches such as odometry,
the fact that the robot operates in an ofﬁce environment (90 degree angles), and by
mapping sensor readings to the current global grid map (map correlation).

Nonetheless, it is difﬁcult to construct such maps for cyclic environments where

the odometric error renders them unusable for robot navigation. In such situations

50

a better approach is the concurrent mapping and localization method [69], where an
EM type algorithm is used to improve the occupancy grid values. The advantage
of the EM algorithm is that it does not rely on exact odometric information as to
where landmark observations are generated, but rather estimates the probability that
a landmark was generated at a particular environment location. Given a sequence of
landmarks and motion control signals the E-step estimates the robot position, and the
M—step reestimates the occupancy grid values. This approach has also been extended
to on-line learning of spatial environments [70], and recently to learning of 3D maps

of spatial environments [71].

 

 

32.2 meters

Figure 2.9: The ﬁgure shows an occupancy grid map that models an indoor ofﬁce
environment and was taken from [1].

Even though it is easy to build and maintain large grid maps which are also view
point independent, planning is inefﬁcient due to the ﬁne-grain resolution, and un-
ecessarily detailed representation. Nonetheless, the ﬁne-grain representation has the
advantage of exact robot localization, and facilitates the disambiguation of differ-
ent places. However, the large number of states imposes signiﬁcant computational

51

dvs

Res

lull

OIL

burden, even for the simple procedure of belief state update. To overcome these
limitations, researchers have proposed selective updating algorithms [72], tree-based
representations that dynamically change their resolution, and Monte Carlo Localiza-
tion (MCL) methods [73]. Monte carlo localization simply represents the posteriors or
the belief state by a random collection of weighted particles which approximates the
desired distribution. In the literature this is referred to as the Sampling Importance
Resampling (SIR) algorithm [74]. The monte carlo localization approach presented
in [73] called Mixture-MCL, is considered to be the state of the art in robot local-
ization solutions, and has been shown to recover gracefully from global localization
failures. It is able to recover from localization failures generated from the kidnapped
robot problem [75], where a well localized robot is picked up and transferred to some

random location.

2.4 Application of POMDPS to robot navigation

An autonomous mobile robot that has to navigate in an indoor ofﬁce environment
needs to be able to know its location at each point in time so that it can navigate
successfully. Unfortunately indoor office environments are perceptually aliased, which
means that different places in the environment are perceived to be the same. Even if
the robot had perfect sensors, which is usually not the case, it will not always be able
to distinguish where it is in its environment simply by perceiving its surroundings.
One way to know exactly where the robot is in the environment is to keep track of

odometry, unfortunately this also requires noise-free dead-reckoning and knowledge

52

of the starting position of the robot. One solution then, to the above problems is
to model the navigation task as a partially observable Markov decision process [23],
[22]. The POMDP framework can be viewed as a method between landmark-based
and odometry-based based navigation in that it seamlesly integrates both odometric
and sensor reports. In addition, the integration is done in a probabilistic manner
which takes into consideration the noisy sensors, and the uncertainty involved with
dead-reckoning.

In the POMDP framework for robot navigation, for each ﬁxed region of the en-
vironment (e.g., for each square meter) there are four states (one for each robot
orientation) modeling the position and orientation of the robot. The robot can take
actions such as “go-forward”, “turn-left”, and “tun-right”. In each state it can observe
features such as wall, and opening on the front, left, back and right side. Modeling
a real ofﬁce environment in this manner requires thousands of states and non of the
exact algorithms can solve such a problem. A simple heuristic solution is to assume
that the states of the POMDP are completely observable, and solve it using the MDP
framework. Then, one POMDP execution strategy is to choose the action for each
belief states that corresponds to the state, for which the belief state is peaked. Other
strategies that use the MDP solution over the hidden states are described in [22], [76].

An important issue is how to learn the POMDP model. The obvious way is
to use the standard Baum-Welch algorithm, which requires a “good” initial model.
In [2] an initial model is created from a topological map which is later ﬁne-tuned
with an extension of the Baum-Welch algorithm called GROW-BW. The GROW-

BW algorithm examines multiple Markov chains along each hallway and adds longer

53

ones as needed until it ﬁnds the best length that ﬁts the data the best. Figure 2.10

shows a POMDP model with multiple chains for each hallway.

@OOO

~05

 

 

 

 

 

 

 

 

 

 

 

f f

f 9 [G3 , Q r
corresponds to 3 F0 ® corresponds to a
corridor junction 6‘) E 9 corridor junction
f f
O <9 O

Figure 2.10: The ﬁgure shows a POMDP model that has multiple chains for each
hallway. The GROW-BW algorithm adds longer chains as needed so as to ﬁnd the
length that ﬁts the data the best. The ﬁgure was taken from [2].

—‘@®®®
- @CDGJGJI

@®@®
——O®®®

 

 

 

 

 

Another way of learning POMDP models for robot navigation is the algorithm
described in [77], and [78]. This algorithm also extends the Baum-Welch algorithm
by adding the concept of relation-matrices. The relation-matrices store the means
and variances of the distances between states. These matrices are then incorporated
in the calculation of the a and ﬂ variables in the standard Baum-Welch algorithm.
Moreover, odometric information collected with each observation, is used to do an
initial clustering of the observations into states, and therefore producing some initial
model.

Both of the above approaches use ﬂat representations where they model the en-
vironment with uniform resolution. Since ﬂat POMDP representations do not scale
well in large scale environments, both in learning and planning, in the rest of this

54

dissertation we model robot navigation as a hierarchical POMDP. We present map
learning experiments using a hierarchical EM algorithm as well as experiments using

approximate hierarchical planning and execution algorithms.

55

Chapter 3

Hierarchical Partially Observable

Markov Decision Processes

A hierarchical problem representation is not only inspired from biology but it po-
tentially has many advantages from an engineering point of view. As mentioned in
Section 1.4, biological organisms (from insects to humans) seem to solve every day
sequential decision tasks in a hierarchical fashion. From an engineering point of view,
hierarchical modeling has advantages in ease of maintenance, interpretability and
reuseability. Reusability could be a great advantage in learning the model parame-
ters in that sub-hierarchies of the model can be trained separately and then reused
in the overall hierarchical model. In planning, the advantage is that at higher levels
of abstraction (both spatial and temporal) uncertainty is less. Reduced uncertainty
implies more efﬁcient planning algorithms (as we shall see in Chapter 6).

In this chapter we ﬁrst describe in Section 3.1 the hierarchical HMM model which

has been the building block of our hierarchical approach. In Section 3.2 we introduce

56

the hierarchical POMDP model which is an extension of the HHMM with the addition
of actions and rewards. In Section 3.2.1 we describe in detail the algorithm for
learning HPOMDPS. In Section 3.2.2 we present a small example to demonstrate
the correctness of the learning algorithm. The chapter continues with Section 3.2.3
where we describe the relation of hierarchical POMDPS to ﬂat POMDPS. Finally,
the chapter concludes with a discussion of related multi-resolution hidden Markov

models in Section 3.3.

3.1 Hierarchical Hidden Markov Models

The Hierarchical Hidden Markov Model (HHMM) [17] generalizes the standard hid-
den Markov model (HMM) [10] by allowing hidden states to represent stochastic
processes themselves. An HHMM is visualized as a tree structure (see Figure 3.1 in
which there are three types of states; product states (leaves of the tree) which emit
observations, internal states which are (unobservable) hidden states that represent
entire stochastic processes, and end-states which are artiﬁcial states of the model
that when entered exit the (parent) abstract state they are associated with. Each
production state is associated with an observation vector which maintains distribution
functions for each observation deﬁned for the model. Each internal state is associated
with a horizontal transition matrix, and a vertical transition vector. The horizontal
transition matrix of an internal state deﬁnes the transition probabilities among its
children. The vertical transition vectors deﬁne the probability of an internal state

to activate any of its children. Each internal state is also associated with a single

57

end-state child. The end-states do not produce observations and cannot be activated
through a vertical transition from their parent. The HHMM is formally deﬁned as a

5 tuple (S, T,II, 2,0):

0 S denotes the set of states. The function p(s) denotes the parent of state 3.
The function c(s, j) returns the 3"” child of state 3. The end-state child of an
abstract state .9 is denoted by es. The set of children of a state 3 is denoted by

C8 and the number of children by ICsl. There are three types of states.

—— Product states
— Abstract states

- End-states

o T 3 : {C3 — es} x CS —> [0, 1] denotes the horizontal transition functions, deﬁned
separately for each abstract state. A horizontal transition function maps each
child state of 3 into a probability distribution over the children states of s. We
write T3(c(s, i), C(S, j )) to denote the horizontal transition probability from the

it" to the jth child of state 3. As an example, in Figure 3.1, T54(s7, 38) = 0.6.

o H5 : {Cs — es} —2 [0, 1] denotes the vertical transition function for each abstract
state 3. This function deﬁnes the initial distribution over the children states
of state 8, except from the end-state child e3. For example, in Figure 3.1,

1131(32) = 0.5.

0 Z denotes the set of discrete observations.

58

o 03 : Csmdm ——> [0,1] denotes a function that maps every product state (child
of s) to a distribution over the observation set. We write Opls)(s,z) for the
probability of observing z in state 3. 03'0de is the set of all product states

which are children of s.

Figure 3.1 shows a graphical representation of an example HHMM. The HHMM

produces observations as follows:

1. If the current node is the root, then it chooses to activate one of its children

according to the vertical transition vector from the root to its children.

2. If the child activated is a product state, it produces an observation according to
an observation probability output vector. It then transitions to another state
within the same level. If the state reached after the transition is the end-state,

then control is returned to the parent of the end-state.

3. If the child is an abstract state then it chooses to activate one of its children. The
abstract state waits until control is returned to it from its child end-state. Then
it transitions to another state within the same level. If the resulting transition

is to the end-state then control is returned to the parent of the abstract state.

Fine et a1. [17] describe a hierarchical Baum-Welch algorithm that is able to
re-estimate the model parameters (including transitions matrices, vertical vectors,
and observation vectors) of an HHMM from observation sequences. The hierarchical
Baum-Welch algorithm operates in a recursive fashion similar to the standard Baum-
Welch procedure. In the next section we describe the extension of the HHMM into a

decision model, the hierarchical POMDP.

59

 

 

 

 

 

 

 

Figure 3.1: An example hierarchical HMM modeling two adjacent corridors. Only
leaves (production) states (s4, s5, s6, s7, and s8) have associated observations. The
end-states are e1, e2, and e3

3.2 Hierarchical Partially Observable Markov De-

cision Process Models

Robot navigation is an example of a planning problem that requires extending the
HHMM model to include actions, as well as reward functions for specifying goals. We
call this extended model Hierarchical Partially Observable Markov Decision Process
model [79]. Unlike HHMMS, we extend the deﬁnition to include multiple entry points
into abstract states which we call entry-states, and multiple exit-states (deﬁned as end-
states in HHMMS). Figure 3.2 shows a sample HPOMDP with multiple entry/exit

states for corridor environments. The exit and entry states represent the spatial

60

borders of each (corridor) abstract state. If we only had single exit and entry states,
the consequences of primitive actions would not be modeled correctly. For example,
an exit from the east side of a corridor would have a non-zero transition probability
to an adjacent state of the corridor at the west side.

On the other hand, we could have two abstract states representing each corridor
(one for each direction). However, having two abstract states for the same spatial lo-
cations would mean that the abstract states share children. Even though the HHMM
allows abstract states to share children, each abstract state induces its own set of
parameters on the children (vertical vectors, horizontal transitions, and observation
models). This would imply learning model parameters for each abstract state sepa-
rately, which is redundant. Since we are interested in learning the dynamics of the
world (e.g., effect of actions on states), it would be redundant to learn them more
than once (e.g., once when going to the east, and once when going to the west side
of the corridor).

In effect, since abstract states do not share parameters the representation becomes
equivalent to having each abstract state having its own exclusive set of children. If
we allowed abstract states to share parameters then the spatial abstraction is no
longer obvious and the model abstraction tends to become more temporal and task
speciﬁc. For example, if two abstract states shared the same sub-structure (states and
parameters) then we would not be able to distinguish the two abstract states based
on sequences of observations emitted by the substructure. For example, in robot
navigation, it is important that the robot knows its true location or true abstract
state that it is at, so that it can take the appropriate action. If the appropriate

61

action for each of the abstract states is different, then the robot will not know which
one to take. However, if both abstract states require the same action then sharing
sub-structures may be possible. In our HPOMDP deﬁnition we assume that abstract
states do not share any parameters or any children.

In addition to the above extensions, we have also removed the end-state associated
with the root abstract state. Instead, we consider the end of an episode of a sequence
observations and actions to be the last child state of the root which has produced an
observation (if it is a product state) or the last child state of the root exited (if it an
abstract state). By removing the end-state we allow the model to be trained using
training sequence that can terminate at any child of the root. This is justiﬁed, since
we need to use our model for planning where we need to construct arbitrary plans,

which can terminate anywhere, and therefore produce arbitrary training data as well.

We formally describe the HPOMDP model below:

0 S denotes the set of states. Unlike HHMMS, we have an additional type of state

called entry states.

Product states which produce observations.

Exit-states which return control to their parent abstract state when en-

tered.

Entry-states which belong to abstract states and when entered activate the

children of the abstract state they are associated with.

Abstract states which group together product, entry, exit, and other abstract
states. C; denotes the product children of abstract state 3. C: denotes the

62

abstract state children of abstract state 3. C; denotes the set of exit states
which belong to children abstract states of abstract state 8. C; denotes
the set of entry states which belong to children abstract states of abstract
state 8. X5 denotes the set of exit states that belong to s and N 3 denotes
the set of entry-states that belong to abstract state 3. p(s) denotes the

parent state of s.

A denotes the set of primitive actions. For example, in robot navigation, the
actions could be go-forward one meter, turn-left 90 degrees, and turn-right 90

degrees. Primitive actions are only deﬁned over product states.

T(s’|s,., a) denotes the horizontal transition probabilities for primitive actions.
If s is an abstract state, 3,, denotes the current exit state. If s is a product
state then 3,, refers to the product state itself since there are no exit states for
product states. The resulting state 3’ could be the entry point of some abstract

state, an exit state associated with the parent of s, or some product state.

V(s;,]s,,) denotes the probability that entry state n which belongs to abstract

I

state 3 will activate child state 3’. 3,, is an entry state of state 3' (if s’ is an

abstract state), and can be the state itself if s’ is a product state.

Z denote the set of discrete observations.

O(z|s, a) denotes the probability of observation 2 in product state 3 after action

a has been taken.

R(s, a) denotes an immediate reward function deﬁned over the product states.

63

For the example HPOMDP in Figure 3.2, there are 16 distinct observations for
each product state (the combination of wall and opening on the four sides of a state).
The probability of each observation can be constructed from individual sensor models
for each side of the state, which are shown in Table 3.1. Using the different sensor
models for each side of the state we can construct a joint probability distribution that
gives us the probability of each one of the possible 16 observations as described in

Equation 3.1 (assuming independence of the observations, or independence of each

side).

O(zw, 25, .23, lea, s) = P(zwls, a)P(23|s, a)P(zE]s, a)P(zN|s, a) (3.1)

 

Orientation W W S S E E N N
State Action P(w) P(o) P(w) P(o) P(w) P(o) P(w) P(o)

 

 

 

 

 

 

 

 

 

s4 go—left 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
84 go-right 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
35 go-left 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
35 go-right 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
36 go-left 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
35 go—right 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
89 go-left 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1

 

39 go—right 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
310 go—left 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1
s10 go—right 0.1 0.9 0.9 0.1 0.1 0.9 0.9 0.1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.1: The table describes the sensor models of the product states of the
HPOMDP in Figure 3.2. For every product state and action there are four sensors,
one for every side of the state (W,S,E,N). Each sensor model gives the probability of
a wall P(wls, a) and opening P(ols, a) given the current state and last action taken.

64

 

 

 

Figure 3.2: An example hierarchical POMDP with two primitive actions, “go-left”
indicated with the dotted arrows and and “go-right” indicated with the dashed arrows.
This HPOMDP has two abstract states 3, and 32 and each abstract state has two
entry and two exit states. The product state s4, s5, 36, 89, and 810 are associated
with sensor models which are described in Table 3.1.

3.2.1 Hierarchical Baum-Welch for HPOMDPS

A hierarchical Baum-Welch algorithm (shown in Figure 3.3) for learning the param-
eters of a hierarchical POMDP can be deﬁned by extending the hierarchical Baum-
Welch algorithm for HHMMS. The Baum-Welch algorithm is an Expectation Maxi-
mization algorithm which modiﬁes the model parameters such that the log P(Z IA, x\)
is increased after very training epoch, where Z is a sequence of observations, A is a
sequence of actions, and /\ the current model parameters T,O, and V.

For the Expectation step the Baum-Welch algorithm estimates two variables, the

65

horizontal variable g, and the vertical variable x. The 5 variable is an estimation of
the probability of a horizontal transition from every state to every other state (both
of which have the same parent), and for every time index in the training sequence.
The X variable is an estimation of a vertical transition from every parent state and
child, for every time index in the training sequence.

For the Maximization step, the Baum-Welch algorithm reestimates the model pa-
rameters by using the frequencies of occurrence of vertical and horizontal transitions
(provided by 5 and x). For example, to reestimate the horizontal transition probabil-
ity between two states 3 and 3’ under an action a, the algorithm divides the number
of times a transition occurred from s to s’ under action a over the number of times
a transition occurred from state 3 under action a. Next, we describe in detail the
Expectation and Maximization steps.

A major part of the expectation step is the computation of the forward a (similar
to ﬂat hidden Markov models) which is deﬁned in Equation 3.2 and described in

Figure 3.4.

a(p(s),,, 3,, t, t + k) =
P(z,, ...z,+,,, s exited from s, at t + k I (3.2)

a,..1, a,...a,+k, p(s) started at t from entry state p(s)”, A)
We say that a product state ﬁnishes at time t after the production of observation
2, and an abstract state ﬁnishes when its exit state 3,, is entered after observation
2, has been produced and action a, initiated. We also say that a product state 3

started at time t, if at time t it produced observation 2,. An abstract state 3 starts

66

 

 

Procedure HPOMDP-Baum-Welch
0 Input

— Sequences of observations Z1, ...,Zk and actions
Al,...,A),.
* Z,: a sequence of observations 21, ...zT.
* A,: a sequence of actions a0, ..., aT.

— Model parameters A = T, O, V.

* T: horizontal transition matrices.
* O: observation models of product states.

* V: vertical transition vectors from entry states
to children.

0 Output

— A new set of A’ parameters such that:

logP(Z1,...,Zk|A1,...Ak,A’) 2 logP(Z1,...,ZklA1,...,Ak,A)

 

Figure 3.3: This is the Baum-Welch algorithm for learning the parameters of an
HPOMDP model. The algorithm increases the log likelihood of the observation se-
quences given the actions sequences and current model parameters after every training

 

Figure 3.4: The 0 variables gives the probability of the observation sequence z,...zt+k
ending in state 3,, at time t+ k (denoted as a bold circle), given that actions at, mam,
were taken and the parent of s, p(s) was started at time t from entry state p(s)" (also
denoted as bold circle).

67

 

at time t, if at time t one of its children produced observation 2, but observation 2,_1
was generated before 3 was activated by its parent or any other horizontal transition.
We can calculate the a variable recursively from the leaves of the tree to the root in

ascending time order as follows:

0 For T = 1...N — 1 where N is the sequence length and for level = D...1 where D
is the tree depth, we calculate the a terms. For all internal states we calculate

the (1' terms for all sub-sequences t, t+ k and for the root only the sequence 1, T.

1. If s is a leaf state at depth : level, then for t = T we can calculate a as
shown in Equation 3.3 and described in Figure 3.5. When p(s) 2 root we
only use this equation for t = 1, because an observation sequence starts

from the root and can only start at time t = 1.

a(p(s)n, s, t,t) 2 V(s|p(s),,)0(2,|s,a,_1) (3.3)

0

Figure 3.5: Equation 3.3 computes the probability that at time t the parent state of
product state 3, p(s) produces a single observation 2,. The observation is produced
by a vertical activation into product state 3 from entry state p(s)".

2. If s is a leaf state at depth 2 level then for t = 1...T—- 1, and k = T —t we

68

can calculate a as shown in Equation 3.4 and described in Figure 3.6. The
equation computes the probability of an observation sequence that started
at at time t from entry state p(sn) and ﬁnished at a child state 3 at time
t + k. The computation is done iteratively by considering the probability
of the observation sequence which ﬁnished at some child s; a time step
earlier (t + k — 1). When p(s) 2 root we only use this equation for t = 1,
because an observation sequence starts from the root and can only start

at time t = 1.

ave)... s. t, t + k) =

Z 0(P(S)na 3;: t: t + k — 1)T(s|s;, at+k—1) 0(Zt+kl3a at+k——1)
sgecflslocg")

(3.4)

 

Figure 3.6: Equation 3.4 gives us the probability that the parent of the product state
3, p(s) produced observations from time t to time t+ k, (k 2 1) and that the last

observation was produced by product state s. 3’, is either the exit state of some

abstract state 3' (such as 3’1.) or a product state 3’ (such as 3’2).

3. If s is an abstract state at depth 2 level then for t = 1...T, and k = T—t we
can calculate 01 as shown in Equation 3.5 which can be better understood
with Figure 3.7. When p(s) 2 root we only use this equation for t = 1.

69

we)... 5., t, t + k) :

Etienne.) Z a(sn,ix,t,t+k)T(sx,limam)

”ENS iIEC;UC;
k——1
3.5
+ 2 009(8)... 3;, t,t +1)T(sn(3;,at+l) ( )
’:0 826059505”)
2 a(sﬂazlyt+l+ 17t+k)T(SI, '21:, at+k>
iIECguC;

 

Figure 3.7: The ﬁgure illustrates Equation 3.5 which is used for situations where the
parent of state 3, p(s) starting from entry state p(s)n produced observations from t to
t + k (where k 2 0), and at time t + k abstract state 3 was exited from 3,. We need
to sum over all lengths of observation sequences that could have been produced by
abstract state 3. That is, 3 could have produced all observations 2,...2,+k, or in the
other extreme 2,, 2,+k_1 could have been produced by the adjacent states of s (such
as 3’1 and 3’2), and only the last observation 2,4,), produced by 3.

Overall, for each sublevel (or internal state) of the model we need to compute
approximately T2 / 2 Oz terms, where T is the length of the observation sequence (see
Figure 3.8). In addition, to compute an a term we might need in the worst case to
look at all the states of the model N and all related 02 variables calculated in the

past. Thus the time complexity is 0(N T 3).

70

 

 

 

 

-—_——-——4—-———1_>\.

f

l

 

 

 

 

 

 

l 2 3 4

Figure 3.8: Each square is a new a computation for a subsequence t + k for some
internal state. The dashed arrows show dependencies on previous computations.

Once the a variable is calculated it can also be used to estimate the probability
of an observation sequence Z = 2,, ...2T given an action sequence A = a0...aT and the

model parameters A as shown in Equation 3.6.

P(ZIA, A) = Z a(s, i, 1,T), where s 2 root entry (3.6)

iEC;UCg
The next important variable for the expectation step is the backward variable 6

(similar to HMMS), which is deﬁned in Equation 3.7 and described in Figure 3.9.

501(8)“ 8n, t, t + k) =
P(2,, ...2,+k|a,, ...a,+k, sn started at t, p(s) generated (3.7)

2,...2,+k and exited from p(s), at t + k, A)

The beta variable can be calculated recursively from leaves to root. The time

variable is considered in descending order as follows:

0 For level = D...1 where D is the tree depth.

1. If s is a leaf state at depth 2 level, for t = N — 1...1 (where N is the

71

 

Figure 3.9: The ,8 variable (Equation 3.7) denotes the probability that a state 3
was entered at time t from entry state 3,, and that the observations 2,...2,+k were
produced by the parent of s, which is p(s), and actions at, ...a,+k were taken, and p(s)
terminated at time t + k from exit state p(s),,.

sequence length and T = N — 1), Equation 3.8 computes the probability
of a single observation and is described in Figure 3.10. When p(s) 2 root

we only use these equations for t = T.

[3(p(s),,s,t, t) = O(2,|s,a,_1)T(p(s)$|s,a,), p(s) aé root (3.8)

16(p(8)73atat) : 0(ztl8vat—l)7 p(S) : TOOt (39)

em
0

Figure 3.10: The ﬁgure illustrates Equation 3.8 which computes the probability that
at time t the product state 3 produced observation 2, and then the system exited the
abstract state p(s) from exit state p(s),

72

2. If s is an abstract state at depth 2 level, for t = N — 1...1, Equation 3.10
computes the probability of observation 2, and is described in Figure 3.11.

When p(s) 2 root we only use these equations for t = T.

6(1)(3)rv 3m ta t) z

Z Z V(inISn)/8(Sxﬁiﬂ3tit) T(P(3)xlsxvat)v p(8)#7‘00t

3.1-EA", 17,605ch

(3.10)

(3(1)(3),«5‘netat) = Z Z V(inlsn)ﬁ(sx,in,t,t) , p(s) 2 root

SIEX" ineC;‘,uC;;

(3.11)

 

Figure 3.11: Equation 3.10 computes the probability that at time t that any of the
children of abstract state 3 (such as i, and i2) has produced observation 2, and then
the parent of s p(s) was exited from exit state p(s),

3. For t = N — 2...1

73

(a) If s is a. leaf state at depth 2 level, then for k = 1...N — 1 — t, we can
calculate 6 as shown in Equation 3.12 and described in Figure 3.12.

When p(s) 2 root we only use this equation for t + k = T.

[3(p(s),, s, t, t + k) =

O(2,|s,a,_1) Z T(s;(s,a,)e(p(s),,s;,,t+1,t+ k)

s',,eC£(‘)uC,’,’(’)

(3.12)

 

Figure 3.12: The ﬁgure illustrates Equation 3.12 which deﬁnes the probability that
at time t the product state 3 produced observation 2, and its parent p(s) continued to
produce observations 2,+1...2,+k and exited at time t+ k from exit state p(s),“ where
k 2 1 (or at least two observations are produced).

(b) If s is an abstract state at depth 2 level, then for k : 1...N — 1 — t,
we can calculate 6 as shown in Equation 3.13 and described in Figure

3.13. When p(s) 2 root we only use these equations for t + k = T.

74

r3(1)(8)13811)t3t + k) =

Z Z V(z‘..|s..)ﬁ(s..z‘...t.t+k) T(p(s).ls.,a.+k)

3.6.x: agave;

k— 1
+ Z V(i,,|s,,),’3 (3,, imt, t+ l) (3.13)
[:0 inECgLJCf,
T(s’n|s,,,a,+,)ﬂ(p(s),,s;,,t+l+ 1,t+k) ,
s’ nECp(8’UCpm
p(s) 75 root
),,smt t+ k)

E V(i,,|s,,)ﬂ (s,,i,,,t,t+k)

SIEX’ inecgUC;

H

+ Z (linISn)l3(Sx,l7,,t,t+ l) (314)
1:0 innguC;
T(s;,|s,., a,+,),8(p(s), 3’", t + l + 1,t + k) ,
s’, ,ECP(” UCPU)
p(s) — —r00t

Calculating the [3 is also an 0(NT3) computation. The computation dependencies
are shown in Figure 3.14.
The next variable is g, which is one of the results of the expectation steps, is

deﬁned in Equation 3.15, and described in Figure 3.15.

75

 

Figure 3.13: The ﬁgure describes Equation 3.13 which is the probability that at time
t the abstract state 3, starting from entry state 5,, produced some observations, and
then the parent of s, p(s) continued to produce the rest of the observations and exited
at time t+ h from exit state 19(3),. Since 3 is an abstract state we have to consider all
possible lengths of observations that could have been produced by abstract state 3.
So 3 could have produced a single observation 2,, up to all the observations 2,...2,+,,,
where k 2 1.

 

 

 

 

 

 

 

 

 

 

I 2 3 4

Figure 3.14: Each square is a new 6 computation for a subsequence t + k for some
internal state. The dashed arrows show dependencies on previous computations.

76

{(15, 3x, 5;?) : P(s ﬁnished at time t from s,
(3.15)

s’ started at time t + 1 from 3;, | a0, ...aT, 2,, ...2T, A)

 

Figure 3.15: The 6 variable denotes the probability of making a horizontal transition
from exit state 3,, to entry state 3;, at time t, given a sequence of observations and
actions and the model parameters.

To simplify calculation of the 5 variable we deﬁne two auxiliary variables called

7)," and 770m. The 77,,, is deﬁned in Equation 3.16 and described in Figure 3.16.

7),,,(t, sn) 2 P(2,, ...2,_1, s was entered at t from 3,, a0, ...,a,_1, A) (3.16)

 

Figure 3.16: The 7)," variable deﬁnes the probability that a state sis entered at time
t either vertically or horizontally given a sequence of actions and observations up to
time t — 1.

We can estimate 7)," recursively from the root to the leaves of the tree.

1. If s is a child of the root and t = 1 we can calculate "in as shown in Equation

77

3.17, and for t > 1 as shown in Equation 3.18

1),-"(1, sn) 2 V(s,,|p(s),,), where p(s)” 2 root (3.17)

"in“: 5n) : Z 0(p(8)n38’$319t_1)T(8nI8:rJat— 1) (318)

sgecgl’hicfl”

2. If s is not a child of the root

77in“: 3n) 2

Z Zmdkmwh) Z a(l’(s)n73’ka3t—1)T(3nls’xvat—1)

P(s)nENP(5) k=0 sgecgulucgul

+ 77in (t: p(3)n)V(3an(3)n)
(3.19)

which, is the probability that state 3, which is not a root child, was entered at
time t from entry state 3". To calculate this probability we have to consider all
the possible times 0...t — 1 and all possible entry states p(s)” that the parent of

state 8, p(s) was entered.

The 770,, variable is deﬁned in Equation 3.20 and described in Figure 3.17. It can

also be estimated recursively from root to leaves.

770,,_,(t, 3,) = P(s ﬁnished at t, from 3,, 2,“, ...2T a,, ...,aT, A) (3.20)

78

 

Figure 3.17: The 710,, variable deﬁnes the probability that a state s exited at time t
and observations 2,+1, ...2T were perceived and actions a,, ..., aT taken.

1. If s is a child of the root.

7),,u,(t, 3,) = Z T(s;,|sx,a,)ﬁ(p(s),s;,,t+1,T), t < T,p(s) 2 root
sgecg’")ucgm
(3.21)
770”,(T, 3,) = 1.0 (3.22)

2. If s is not a child of the root and for t S T, where T = N — 1.

vaudt, 8;) =

Z Z Z T (stlsx,at)ﬂ(p(s)x,s;,,t+1,k) n,,,(k,p(s),)

P3)(1:EXP(5) kzt'l‘l SQEC£(8),UC£(5)

+ T(p(3)x]3x, at)770ut(ta p(s)?)
(3.23)

which is the probability that state 3, which is not a root child, was exited at
time t from exit state 3,, and then the observations 2,+1...2T were produced.
To calculate this probability we need to consider all the lengths of observation
sequences from t+1...T, including a zero length observation sequence that could

have been produced by the parent of s, p(s) before p(s) exits from exit state

79

19(3):"-

Now we can calculate 6 in Equation 3.15 recursively from the root to the leaves of

the tree.

1. If s and s’ are children of the root p(s) = p(s), = p(s),,

, aps n,s,,1,th;,sx,a ﬂps ,,s;,,t+1,T

 

{(T, 33:, 3],) Z 0

2. If s and s’ are not root children

{(t, 31: 3:1) =
’- t
1 I
———p(Z|A, ,) Z Z n,n(k,p(s)..) a(p(s),,,s,,,k,t) T(sn|3,,a,)
_k:1P(3)n€Np(‘) (326)

Z Z: ﬂ(p(8)$’8h’t+1ak)nout(kap(3)x) , t< T

bk=t+1p(3)IEXP(’-')

 

€(T9 3:1." 3:1) Z 0

which calculates the 6 variable by considering all the possible times 1..t and
entry states p(s)” that the parent state of s, p(s) was entered and then at time

t a transition was made from child exit state 3,, to child entry state 3;, and then

80

consider all the possible times from t + 1...T and all possible exit states p(s),
that p(s) was exited from. The equation below shows the probability of going

from the non-root child state 3,, to the exit state p(s), at time t.

77in(kap(8)n) a(p(3)na Smykat)
kZ— I. p(s)nENP(5)

TU) Slxlsxaat) )Uoutlta P(s )1 ) th

at.s.,p<s>.>= ——Z—P(| , ,

(3.28)

The second result. of the expectation step is the X(t, p(s)", 3") variable, shown in

Equation 3.29 and described in Figure 3.18.

X(t, p(s ),,,s n) = P(p(s),, started at time t, 3,, started at time t |
(3.29)

a0, ...aT, 2,, ...2T, A)

 

Figure 3.18: The X deﬁnes the probability that entry state 5,, of child s was activated
by the entry state p(s)" of its parent p(s) at time t given the action and observation
sequence.

We can calculate the X variable as follows:

1. If s is a child of the root p(s) = p(s),, = p(s),

x(1, p(8)n, 3..) = V(S"’p(3)"1),(ﬁz“l’§):$ 8’“ I’T’ S) (3.30)

 

81

2. Otherwise for 1 Z t S T

X“ I)(5)71‘_971) _
71....(t p(s) )V ' (3,,|p(s) l)" 2 Z 3( () tk) (k
P(Zl4, A) / p s 1,3", , 770m ,P(3)x)

k——tp(3)1€XP(8)

(3.31)

 

Based on these. variables we can reestimate the model parameters. The vertical

vectors are reestimated in Equations 3.32 and 3.33.

V(s,,|p(s),,) : X(1,p(s),,, 3"), if p(s) 2 root, 5,, E Cg”) U Cg“) (3.32)

23:1 X(t1 p(5)n9 8n)
234,603")UC§(‘) Zf=1X(taP(3)mS’n) (3.33)

if 8757‘00t

Equation 3.33 finds the vertical transition to any non-root child by calculating the

V(3nlp(3)n) :

number of times the child was vertically entered over the total number of time the
child and its siblings were entered. The horizontal transition matrices are reestimated
in Equation 3.34. The reestimation calculates the average number of times the process
went from state 3 to state 3’ over the number of times the process exited state 3 under

action a.

T
Zt=1|at=a €(t, 32:, 3;)

T(s’ lsx, a) = T
n Zt=1|a,=a 28:60£(5)UC7,’:(3’ €(t’ 81" Sill

 

(3.34)

The observation vectors are reestimated in Equation 3.35. The reestimation of the

82

observation model calculates the average number of times the process was in state 3

and perceived observation 2 over the number of times the process was in state 3.

T T
Zt=1,z,:2,a,_1=a X(t’p(8)n’ S) + Zl=2,2r=z.at—1=a 71.71“, S)
T T
2t:1X(t,p(3)n,S) + 2‘22 7in(t, 3)

where (3.35)

7371078) = Z €(t—1,S;,S)

SQEC‘5(S)UC£(8)

O(2]s,a) :-

 

All of the above formulas of the hierarchical Baum-Welch are used to estimate the
model parameters, given a single observation and action sequence. However, in order
to have sufﬁcient data to make reliable estimates of all model parameters, one has to
use multiple observation and action sequences. If we had K sequences of data then
for each iteration the algorithm needs to adjust the model parameters such that the
probability of all sequences is maximized. Assuming the sequences are independent of
each other (iid), then the probability of all sequences can be calculated as the product

of the individual sequences as shown in Equation 3.36

K
P(Z,,...,ZK|A1, ...AK,A) = HP(Z,,|A,,, A) (3.36)
k=1

Since reestimation is based on frequencies of occurrence of various events, the new
reestimation formulas can be constructed by adding together the individual frequen-
cies of occurrence for each data sequence. The vertical transition are reestimated in

Equations 3.37 and 3.38, the horizontal in Equation 3.39, and the observation models

83

in Equation 3.40.

K
V(s,,,|p(s),,) = Z X,(1, p(s)”, 3,), 2' f p(s) : root, 3,, 6 03(3) 0 05(8) (337)
kzl

 

V(s,,|])(s)n )_ 2k: 12::- 1Xk(t P(S ),,,Sn)
1 2k: 12.9,, 60”“)003“) Elf: 1Xk( (t P(3 3n) 8’”) (3.38)
if s 7'5 root

2k: 12?: 1|a,=a£k( (t 31:231.)
Ek212t=1|atza ngecgl’)uc£(’) £k(tJ8-’EJ‘SZ)

 

(3.39)

T(8;iIS-FJ 0:)

K T),
kzl t=1,2,:2,_a, 1: =an(t’p( 8)" 3)+th —2,:z, 2,.a, (za 71711:“? 8)

Zk=12t1Xk(p(3)n:5)+ Et_ —2 ”link“, 3) (3.40)

 

O(2|s,a) 2

3.2.2 Learning a small corridor environment

To test the HPOMDP learning algorithm we created a training set of 100 action and
observation sequences by sampling the original HPOMDP in Figure 3.2. We then used
these sequences to train different models, which were initialized with random transi-
tions (horizontal and vertical) and random observation models. To create sequences
of actions and observations, we simulate the model and choose transitions (vertical
and horizontal ) using the inverse transform technique [80]. For each sequence we ﬁrst
randomly choose an action a0 and then start from the root and vertically activate

submodels until a product state is reached and observation 2, produced. We then

84

randomly choose the next action and transition to the next state. If the next state is
a product state we produce the next observation. If the next state is an exit state we
transition to an adjacent state of the parent. If the adjacent state is an abstract state,
we transition vertically to submodels until a product state is reached. If the adjacent
state is a product state we produce an observation. Figure 3.19 shows a model with
random transition probabilities and Table 3.2 shows the initial sensor model proba-
bilities. Figure 3.20, and Table 3.2 show the model parameters after training. It is

obvious that the algorithm learns a model which is close to the original.

 

 

 

 

 

Figure 3.19: This is an HPOMDP model where the original transitions have been
replaced by random values for training.

We trained different random model whose ﬁt to the training data versus the train-
ing epochs is shown in Figure 3.21. Since we have more than one sequence during

training the log—likelihood log P(Z IA, A) represents all the data sequence (is the sum

85

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Orientation W W S S E E N N
State Action P(w) P(o) P(w) P(o) P(w) P(o) P(w) P(o)
s4 go-left 0.27 0.73 0.5 0.5 0.64 0.36 0.29 0.71
34 go-right 0.52 0.48 0.63 0.37 0.1 0.9 0.38 0.62
35 gO-left 0.14 0.86 0.37 0.63 0.19 0.81 0.67 0.33
35 go-right 0.57 0.43 0.19 0.81 0.31 0.69 0.7 0.3
36 go-left 0.57 0.43 0.3 0.7 0.78 0.22 0.4 0.6
86 go-right 0.1 0.9 0.2 0.8 0.79 0.21 0.34 0.66
39 go-left 0.51 0.49 0.85 0.15 0.53 0.47 0.19 0.81
39 go-right 0.5 0.5 0.34 0.66 0.9 0.1 0.57 0.43
310 go-left 0.87 0.13 0.05 0.95 0.36 0.64 0.45 0.55
310 go-right 0.55 0.45 0.65 0.35 0.8 0.2 0.36 0.64

 

Table 3.2: The table shows an initial sensor model which has been created randomly.

0.99 §

 

   

0.7]

Figure 3.20: The random HPOMDP model after training, is more similar to the

I . 7 "I: . .
I ’4 0.29 54 . I

0.09

original model (Figure 3.2) after training.

86

 

 

0734

  

0.66 , I
‘00:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Orientation W W S S E E N N
State Action P(w) P(o) P(w) P(o) P(w) P(o) P(w) P(o)
s4 go—left 0.10 0.90 0.92 0.08 0.20 0.80 0.92 0.08
s, go-right 0.0 1.0 1.0 0.0 0.06 0.94 0.99 0.01
55 go-left 0.01 0.99 0.77 0.23 0.0 1.0 0.48 0.52
.95 go-right 0.0 1.0 1.0 0.0 0.05 0.95 1.0 0.0
36 go-left 0.0 1.0 0.83 0.17 0.05 0.95 0.95 0.05
86 go-right 0.0 1.0 0.81 0.19 0.15 0.85 0.87 0.13
39 go-left 0.15 0.85 0.94 0.06 0.13 0.87 0.98 0.02
39 go-right 0.12 0.88 0.89 0.11 0.14 0.86 0.94 0.06
310 go-left 0.26 0.74 0.92 0.08 0.0 1.0 1.0 0.0
310 go-right 0.17 0.83 0.94 0.06 0.13 0.87 0.91 0.09

 

 

Table 3.3: The table shows the sensor model after training. The sensor models are
almost the same as the original sensor parameters in Table 3.1.

of the individual log-likelihoods). It is obvious from the plot that the different random

models converge fast to the original model.

3.2.3 Converting HPOMDPS to POMDPS

Every HPOMDP can be converted to a ﬂat POMDP. The states of the ﬂat POMDP
are the product states of the HPOMDP, and are associated with a global transition
matrix that is calculated from the vertical and horizontal transition matrices of the
HPOMDP. To construct this global transition matrix of the equivalent ﬂat POMDP,
for every pair of states (3,, 82), we need to sum up the probabilities of all the paths that
will transition the system from 31 to 32 under some action a. For example in Figure
3.2 the transition from 89 back to itself under the go-left action is not simply 0.1, but
rather the summation of two paths. On path is 89, 39 which gives probability of 0.1,
and another path is 89, 82, s9 which gives probability 0.9 x 0.1.0 2 0.09. Therefore the

total probability is 0.19. The equivalent ﬂat POMDP is shown here in Figure 3.22.

87

 

 

 

 

 

 

-600 . . , 1 . . j ,
Original —_.__
a a random1------.x
-800 l T 7 random 2 ----- 4.-.. -
random 3 ......... E, ........
-1000 - g .
3 -1200 - -
o : :5.
5.
S. '1400 j." '1’; _
N,
0' (l
ES -1600 a; 1
o .‘s
-1800 _
-2000 g _
f
-2200 [P l 1 1 1 1 1 1

Training epochs

Figure 3.21: Training runs of three different random models shows that the log like-
lihood converges fast to the original model. The original model does not improve as
expected since it was used to create the training data. In the log likelihood function
Z is the set of observation sequences, A is the set of action sequences, and Model is
the set of parameters that are improved every training epoch (horizontal matrices,
vertical vectors, and observation models).

 

Figure 3.22: This is the equivalent ﬁat POMDP of the hierarchical POMDP shown
in Figure 3.2.

88

Converting an HPOMDP to ﬂat model, however, is not advantageous since we lose
the abstract states and the vertical transitions which we can exploit later in learning
and planning. For example in model learning, the fact that we have abstract states
allows us to initially train submodels independently, and as a result start training of
the model from a good initial model. This results in better trained models since the
hierarchical Baum-Welch is an expectation maximization algorithm and converges to
local log likelihood maxima depending on the initial model. In planning, we can use
the abstract states to construct abstract state (belief state) representations and also
to construct macro-actions that execute under the abstract states. As a result we
can construct faster and more successful plans. The vertical vectors help to compute
hierarchical plans in that they give us the initial relationship between abstract states
and their children.

Nonetheless, the equivalent ﬂat POMDP will provide different likelihood to the
data than the hierarchical. The reason is that the a term for hierarchical models,
which deﬁnes the probability that an an observation sequence starts from some ab-
stract state S and ﬁnishes at the exit state of some child state 33,, will be zero unless
the child state is actually able to produce part of the observation sequence. This
concept is better understood in Figure 3.23. This property of the HHMM/HPOMDP
models, in that the next abstract state cannot be entered until the current abstract

state has exited, allows us to capture relationships between states that are far apart.

89

 

(
.Q/«Q OQ/‘O

HIERARCHICAL EOU IVALENT FLAT

Figure 3.23: For an observation sequence of length 2 the hierarchical likelihood will
be zero, since we need at least three observations to exit the abstracts state S. While
the likelihood in the equivalent ﬂat model will not be zero.

3.3 Related multi-resolution Hidden Markov

Models

HHMMS can also be viewed as a speciﬁc case of Stochastic Context Free Grammars
(SCFGs). The generalized hierarchical Baum-Welch presented in [17] is inspired by
the Inside-Outside algorithm which is an EM type algorithm for learning the prob-
abilities of SCFG rules [81], [82]. The Inside-Outside algorithm is less efﬁcient than
the EM algorithm for HHMMS because it also runs in cubic time in terms of the
length of the training sequence, but also runs in cubic time in terms of the number
of non-terminal symbols. Additionally, the likelihood of observed sequences induced
by a SCFG varies dramatically with small changes in the parameters of the model.
HHMMS differ from SCFGS in that they don’t share subtrees, and the whole model is

of bounded depth. However, it would not be difﬁcult to extend the generalized Baum-

90

Welch in [17] to handle HHMMS whose subtrees share structure. Since the models of
each submodel (vertical vectors, horizontal transitions, and observation models) are
deﬁned separately based on the parent abstract state of the submodel, then one way
of learning shared substructures models is to average all model parameters that are
being shared by different abstract states.

Another way of representing HHMMS is with a special case of Dynamic Bayesian
Networks (DBNs) [83]. The transformation of HHMMS to DBNs allows us to train
them by combining standard inference techniques in Bayesian nets such as the
junction-tree algorithm [84]. The junction-tree algorithm replaces the expectation
step in the hierarchical Baum-Welch algorithm. The major advantage is that with
this combination HHMMS can be trained in linear time while the conventional hier-
archical Baum-Welch algorithm requires cubic time [17]. The disadvantage is that
the DBN model grows linearly with the size of the training data. In addition, it
still remains to be seen how well such representations scale to larger models and
how the learned models compare with models learned by the standard hierarchical
Baum-Welch algorithm.

Another closely related model to HHMMS is the Abstract hidden Markov model
(AHMM) [85]. This is a DBN representation closely related with the options frame-
work [25]. The states however are not fully observable and the model becomes similar
to an HHMM. In this framework at each level of abstraction there is a set of abstract
actions available. The states for which these abstract actions are applicable form
a decomposition of the state space. Plan execution starts from t0p of the hierarchy

where the most abstract level issues an abstract action. The abstract action may issue

91

other abstract actions at the next lower level of abstraction. At the lowest level only
primitive actions are executed which terminate after every time step. Termination
of intermediate level abstract actions depends on a special set of destination states
available for each level of abstraction. Upon termination of an abstract action in
some abstract level the level above chooses the next abstract action. The process ter-
minates when the abstract action at the most abstract level terminates and all other
called abstract actions at lower levels terminated as well. The DBN representation
models the relationships between the different policies at all levels of abstraction, the
termination conditions, the hidden states, and the primitive actions.

The AHMIVI framework can be used to infer the probability of each possible ab-
stract action at each level given a sequence of observations. The AHMM differs
from HMMlVI in that it does not model the horizontal relationships between abstract
states (or spatial decompositions). However, it is more similar to our extension of the
HHMM model with actions and rewards, in that it also models multiple actions. This
representation models the effects of abstract actions, while as we shall see later in our
HPOMDP model usage in Chapter 6, the effect of abstract actions are computed
based on the model of the environment.

Embedded hidden Markov models are another special case of the HHMM model,
which were applied to face recognition [86]. The basic difference is that they are
always of depth 2 and the number of observations produced by each abstract state
is known in advance. This leads to efﬁcient algorithms for estimating the model
parameters.

Another well studied class of multi-resolution models are the segment models [87].

92

In the simplest segment model the self-transitions of abstract states are eliminated
and state duration is modeled explicitly either in a parametric or non-parametric fash-
ion. The reason is that self transitions induce an exponential duration while explicit
durations can be arbitrary. These simple models are also called variable duration
HMl\~’Is [10] [88], or hidden semi-Markov models [89]. These models can be thought
as special cases of 2 level HHMMS. Unfortunately, the computational complexity
of training durational HHMMS is usually proportional to D2 where D is the maxi-
mum allowed duration [10]. Nonetheless, there has been work that demonstrates that
training durational models can be done in time proportional to D rather than D2
[90].

Finally, another related model, which is also called hierarchical POMDP, is pre-
sented in [91]. Even though the underlying dynamics of the world are still represented
as a ﬂat POMDP, the model becomes hierarchical based on the solution the authors
propose. The POMDP solution is based on ﬁrst solving smaller POMDPS. The
smaller POMDP problems are created based on decompositions of the action space.

This chapter has described the HPOMDP model, its learning algorithm, and a
small experiment that showed how the algorithm converges successfully. It has also
shown the relation of HPOMDPS to ﬂat POMDPS and concluded with a discussion
on related multi-resolution models. In the next chapter we describe the robot navi-
gation domain and show how HPOMDP models are used to model robot navigation

environments.

93

Chapter 4

The robot navigation domain

In this chapter we describe the robot navigation domain. In Section 4.1 we describe
the different robot platforms used in the learning and planning experiments of this
dissertation. In Section 4.2 we describe the navigation environments used, and how
they are represented as hierarchical POMDPS. In Section 4.3 we describe the features
that the robot can observed and how it is done through the use of a neural network.

In the ﬁnal Section 4.4, we describe in detail the overall robot navigation architecture.

4. 1 Robot platforms

We have conducted experiments using two robot platforms. The ﬁrst is a real robot
platform, which is a Nomad 200 robot called Pavlov shown in Figure 4.1 [92]. The
nomad 200 is an integrated mobile robot platform with four sensory modules including
tactile, infrared, ultrasonic, and vision. The platform has an onboard computer for

sensor and motor control, and host computer communication. The mobile base keeps

94

track of its position and orientation (integrated over time) [92].

 

Figure 4.1: This is PAVLOV, a Nomad 200. Pavlov is equipped with 16 sonar sensors,
16 infrared-red sensors, two rings of bumper sensors (with a total of 16 switches), an
arm, and a camera. Pavlov has an 133 MHz Pentium onboard processor and an
external Pentium II 400MHz laptop.

The simulated robot platform simulates the real Nomad 200 robot. In the simu—
lation environment the robot can move in a two dimensional world populated with
convex polygonal structures ( see Figure 4.2). Sensor modules simulate each of the

sensors of the real robot platform and report sensor values according to the current

simulation world [93].

4.2 The environment

In our experiment we used two navigation environments. The ﬁrst one, whose topol-

ogy is shown in Figure 4.3, is an approximate model of the 3rd floor of the MSU

95

 

Figure 4.2: This ﬁgure illustrates the Nomad 200 simulator. The main window shows
a section of the environment and the simulated robot. The two windows on the right
show infrared and sonar sensor values at the current location.

engineering building. The second one, whose topology is shown in Figure 4.4, has the
same dimensions as the 3rd ﬂoor of the MSU engineering building.

Such topological maps can either be learned or provided. The topology of the
environment can be learned either by identifying landmark locations and their rela-
tionships [18], or by learning a metric representation of the environment and then
extracting the topology [1]. In our experiments we assume that we start with topo-
logical map which we compile into an HPOMDP representations (see Figure 4.6).
Figure 4.5 shows the equivalent ﬂat of the hierarchical model. When we compile
the topological map in Figure 4.3, the result is an initial HPOMDP model with 575
states. When this initial HPOMDP model is converted to a ﬂat POMDP the num-
ber of states is 465. Compilation of the topological map in Figure 4.4 results in an

HPOMDP model with 1385 states. When this HPOMDP model is converted to a ﬂat

96

 

 

 

 

 

 

 

 

 

2° 0 8
111 \J 113

10
.—
4

19

44
20

 

 

 

 

 

 

 

 

 

10 10 ‘4 r

 

 

 

 

 

Figure 4.3: The ﬁgure shows an approximate model of the 3rd ﬂoor of the engineering
building that is used with the Nomad 200 simulator. The number next to the edges
is the distance between the nodes in meters.

POMDP the number of states is 1285.

To learn the parameters of the HPOMDP models ﬁrst we need to collect sequences
of observations and actions. To collect data we run the robot in the environment and
save every action taken and every observation perceived. There are three actions
available, “go-forward” which takes the robot forward approximately two meters,
“turn-left”, which makes the robot turn left 90 degrees, and “turn-right” which makes
the robot turn right 90 degrees. An observation consists of 4 components, the ob-

servation of a wall or opening, on the front left, back, and right side of the robot.

97

 

 

 

 

56 20 IO

 

 

 

[—

 

 

“it's

 

 

 

 

  

 

 

 

 

 

 

 

O

 

 

 

 

 

 

,Oj-

O

59—22—09

 

 

 

 

A

$3

 

 

 

 

Figure 4.4: The ﬁgure shows the real 3rd ﬂoor of the engineering building. This
environment is used for experiments both, in the Nomad 200 simulator, and with the
real robot Pavlov in the actual physical ﬂoor. The meters next to the edges denote
the distance in meters between the nodes. The nodes represent junction locations in
the environment.

98

 

 

 

Figure 4.5: This is the equivalent ﬂat model of the HPOMDP in Figure 4.6 for the
forward action.

The observations are produces by a neural-net (Figure4.7). In the next section we

describe the feature detectors.

4.3 Learning feature detectors

Features such as wall and Opening on the four sides of the robot can be detected using
the 16 sonar sensors around the robot. However, because the walls of our environment
are fairly smooth, sonars can produce specular reﬂections. These reﬂections make
it difﬁcult to create hard-coded feature detectors for recognizing sonar signatures.
Instead, we use sonar and odometric information values to build local occupancy
grids which depict a local metric representation of the environment [68]. The local
occupancy grids are labeled and then used as training data for a neural net feature
detector which was studied earlier in [24] and [94]. Figure 4.7 shows the neural net
used in feature detection. The net was trained using the quickprOp method [95], an

optimized variant of the backpropagation algorithm.

99

 

 

 

 

Figure 4.6: This is a sample three-dimensional view of the HPOMD model for the
forward action. The solid bold arrows represent the vertical transition matrices.
The dashed arrows represent the horizontal transitions. The bold circles inside the
abstract states represent entry—states and the empty circle represent exit-states. In
this representation all the children of the root are abstract states. Product states are
the leafs of the model and represent two meter locations in the environment. The
arrows inside a product state represent the orientation of the product state. For

each product state there is an observation model which was previously described in
Chapter 3.

100

Front

   

 
 

local
._. occupancy
“En grid (32x32)
E

Left

(512) (201) (2)
Input Layer Hidden Layer Output Layer

 

Figure 4.7: A local occupancy grid map, which is decomposed into four overlapping
quadrants (left, right, top, bottom), each of which is input to a neural net feature
detector. The output of the net is a multi-class label estimating the likelihood of
each possible observation (opening, wall). The net is trained on manually labeled
real data.

Learning the features could have also be done using any other non-parametric
classiﬁer such as decision trees, or a parametric classiﬁer such as Bayesian [96]. Addi-
tionally, we can extend the HPOMDP model to handle continues features, or extend it
such that neural nets and the parameters of the HPOMDP are trained simultaneously
[97].

Sample local occupancy grids were collected by running the robot through the
hallways. Each local occupancy grid was then used to produce 4 training patterns.
The neural net was trained and tested with 736 hand labeled examples using the m-
fold cross—validation method [96]. Since all sensors predict the same set of features,
it was only necessary to learn one set of weights. Using the 736 examples we created
8 sets of hand-labeled data, where each time we used 7 for training and 1 for testing.
Figure 4.8 shows the learning curves for multiple training data sets using a batch

update. The training error shown in the graph is formulated as the sum of the

101

Error curves for neural net feature detectors

 

700 I r I I I '. I ﬁr
. g . 2 net1 ——+—-
net2 "urn-

" net3 ----- i~~ r
net4 "Mam-
ne‘ 5 _._.‘..-.-
net6 "~0~~ ‘
net? ~~¢~~

- net8 ~~¢~u

Total Error

 

 

 

 

0 2 4 6 8 10 12 14 16 18
Training Epochs

Figure 4.8: Learning curves for training the neural net to recognize features. The nets
are trained with 736 examples in an 8-fold cross-validation fashion using quickprop.

square difference between the actual and the desired output over all training patterns,
Training error = 23.11 training patterns(deSired — actual)? For each training run
we stopped training when the classiﬁcation error on the testing data set reached a
minimum. A testing pattern was classiﬁed correctly if the largest value of the actual
output was also the largest value in the desired output. The classiﬁcation error over
all test data sets had a mean value of ,u = 7.33 and a standard deviation value of
a = 3.7. For actual use with the robot we chose the neural net that had the smallest
classiﬁcation error.

Figure 4.9 illustrates the variation in observation data generated during an actual
run on Pavlov. In these occupancy grids, free space is represented by white, while
black represents occupied space. Gray areas indicate that occupancy information is
unknown. The ﬁgures are labeled with the virtual sensors and corresponding features,

as predicted by the neural net. Despite noise due to specular reﬂections, and noise due

102

FRONT OPENING

   

   

o
o E
a E 3 E]
Z n.
u: 0 3 B
O.
o E E E
E E —l t2:
'3
BACK WALL BACK OPENING
FRONT OPENING FRONT WALL
. ...s.ulwx..n4.:u.;crie.ku;ns-&I§a .Elgl
o
O
s :1 E E
..l <1 5 E
< B n- o
3 E C E
E o E o
—l E El 3
BACK OPENING BACK OPENING
FRONT OPENING FRONT OPENING
..1 o 3
2‘ E 3 a
z
3 LI.) 3 E
D.
a g a a
...l
o ..i
E

   

BACK OPENING
BACK OPENING

Figure 4.9: Sample local occupancy grids generated over an actual run, with obser-
vations output by the trained neural net. Despite signiﬁcant sensor noise, the net is
able to produce fairly reliable observations.

103

to robot orientation with respect to the features, the neural note is able to accurately

predict the observations.

4.4 The navigation architecture

Learning and planning are integrated in an overall navigation architecture shown in
Figure 4.10. The topological map construction module produces an initial HPOMDP
model. In our case the we haven’t implemented any automatic construction of topo-
logical maps but we rather provide it. Automatic topological map construction is not
difﬁcult to be done and has been done by previous researchers such as in [18], and
[1].

The planning layer is activated at the beginning of a navigation task and produces
a policy which is a mapping of states to actions. The input to the planning system
is the initial or learned learned HPOMDP model and an external reward signal that
deﬁnes the goal. The execution layer is responsible for updating the robot’s belief
after every action and observations and issuing the next action according to the plan.
The Planning and execution layer are presented in Chapter 6.

The neural net forward pass module is responsible for classifying local occupancy
grids into walls and openings. The neural net is trained oﬁ-line using labeled local
occupancy grids. The occupancy grid module is responsible for constructing local
occupancy grids based on robot motion and sonar sensors [68].

The behavior based layer is responsible for executing abstract actions such as

move forward two meters, turn left, and turn right, while at the same time avoiding

104

obstacles and keeping the robot aligned with the walls. The behavior-based framework
is a widely adopted and successful approach for programming mobile robots [98], [99].

The environment learning layer takes as input an initial HPOMDP and sequences
of observations and actions (produced by the run-time system) and produces a learned
HPOMDP model. Learning of the model is done-off line according to the EM learning
algorithm presented in Section 3.2.1. Next, in Chapter 5 we will present a detail

experimental study for learning HPOMDPS in our navigation environments.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

TOPOLOOICAL INmAL HPOMDP MODEL J ENVIRONMENT OBSERVATION
MAP LEARNING AND ACTION
SE UENCES
CONSTRUCTION l LAYER Q
LEARNED
HPOMDP MODEL
5 RUN-TIME SYSTEM
5 PLANNING GOAL
5 LAYER REWARD
; HPOMDP 5
5 POLICY ‘
' OBSERVATIONS
NEURAL NEURAL (Wall. Oppening) [ EXECUTION
NET NET ' LAYER
TRAINING FORWAD PASS
LOCAL ACTION :
. ( go_forward) ACTION I
LABELED OCCUPANCY 5 occéilﬁggd “um-'6“) COMPLETION
GRID DATA 3 (tumJight) 5
: I : MOTOR
5 OCCUPANCY BEHAVIOR 5 COMMANDS
: GRID BASED . T
. UPDATE LAYER

 

 

 

 

RAW SENSOR DATA
(sonar. infra—red. bumper. odometry)

Figure 4.10: The ﬁgure shows the overall navigation architecture. The modules inside
the dashed box are used during actual navigation while the models outside are used
off-line

105

Chapter 5

Learning hierarchical spatial

models for robot navigation

In this chapter we describe detailed learning experiments where we train HPOMDP
models to represent hierarchical spatial maps of indoor office environments. Before
any learning experiments, in Section 5.1, we present two approximate learning meth-
ods that enable fast training of large scale environments. In Section 5.2.1 we present
a detail experimental study where we lean the parameters of an HPOMDP model for
the spatial environment presented in Figure 4.3 in Chapter 4. For these set of experi-
ments data is collected by sampling some “good” initial HPOMDP model. In Section
5.2.2 we present learning experiments with our robot platforms. We train both the
small environment (Figure 4.3) and the larger environment (Figure 4.4) using robot
experience. Next, in Section 5.3 we present experiments for learning the structure
of the environment using our learning algorithms. In structure learning we allow

for semi-ergodic models at abstract levels of the hierarchy and prune away incorrect

106

connections using our learning algorithms. Finally in Section 5.4 we conclude with a

discussion about the results and signiﬁcance of the experiments.

5.1 Approximate learning methods

An interesting question is how long should the data sequences be in order to learn
HPOMDP models. Since the time complexity of our algorithm is 0(NT3), long
data sequences will require unrealistic training times. Fortunately, in order to learn
relationships between states which are children of the root (and the submodel below
them), training data should at least go through both of those states. This intuition
can be better understood by close examination of the calculation of the probability
of going from one state s,c at the abstract level to another state 3;, at time t (or
{(t, 33, 3%)) in Equation 3.24.

However, with short training sequences there is a danger we will not learn the
correct model parameters unless we provide the initial vertical vector from the root,
or in other words the first state that the observation sequence was initiated from.
With longer sequences, it is more likely that the starting position of the observation
sequence will be discovered by itself (see Figure 5.1). In our experiments we used
both short training sequences accompanied with their starting position and longer
training sequences without their starting position. Obtaining the starting position
of an observation sequence is easy, since for every robot run for which data is to be

collected, we know the starting position of the robot. If we always need to collect data

(whenever the robot is acting in the environment), and don’t have an exact initial

107

position, then we will have long training times.

 

 

 

 

 

 

 

 

 

 

 

SHORT DATA LONG DATA

SEQUENCE SEQUENCE
I . - - ' ' i , ‘ ' ' '
I I I

ll
S3 33

r = I - - I. - l I - - -
: ' s : s
0 SS : 4 ' SS : 4

 

 

 

 

Figure 5.1: When a short training sequence is collected in state S5, the probability
that this sequence occurred in 51 will be the same as the probability that it occurred
in 55, since SI and 55 are similar. When a long training sequence is Obtained from
5'5 to 83, then the probability that this sequence was started from 31 will be much
smaller than the probability that it started in S5 since the intermediate junction nodes
S4 and 52 have different observations. Therefore, for short sequences we provide the
starting state.

The above realization leads us to create two approximate learning methods that
take advantage of the minimum length of the training data. In the ﬁrst method,
which we call reuse-training, we collect data for each abstract state separately and
train each abstract state separately. If the horizontal transition model at the abstract
level is deterministic, then we are done. If the horizontal transition model at the
abstract level is not deterministic, then we can collect data for the entire model and
use the submodels learned with the reuse—training method as initial submodels. In
general, this method allows us to come up with better initial models. Since the
hierarchical Baum-Welch algorithm is an expectation-maximization algorithm, it is

highly sensitive to the initial model. In addition, if we have some large environment,

108

which has similar corridors (abstract states) we can just train one corridor and use
the learned model parameters for all other Similar corridors. Moreover, once we have
trained local abstract states separately of some environment, we can use them as part
of any arbitrary global model. Figure 5.2 shows an example HPOMDP model and

Figure 5.3 shows how we created the different submodels to be trained.

 

 

0.. 00- 000

Figure 5.2: An example HPOMDP model for which we illustrate our approximate
training methods. For example, in the reuse-training method we train separately
each abstract state child of the root (51,52,133) as shown in Figure 5.3

For the second method, which we call selective-training, we only compute the
estimation step of the EM algorithm for selected parts of the model. More precisely,
we only compute the a and ﬂ variables for the abstract states and their children
which are “valid”. The validity of a state depends on the current training sequence.
For each training sequence the valid states are the abstract states (and the states in

their submodel) from which the data sequence was initiated and its adjacent states

to which there is a non-zero transition probability. In addition to computing the a

109

(DOOM...

Figure 5.3: Here the submodels of the HPOMDP in Figure 5.2 are trained as separate
HPOMDP models. This is the reuse-training method, since the submodels can be
used in the complete model in Figure 5.2. In this particular example, we don’t
need to retrain the whole model in Figure 5.2 (after training separately the various
submodels), since the connections between the abstract states are deterministic.

and 3 of valid states we zero all vertical transitions from the root except the vertical
transition to the valid state which the training sequence was initiated from. For
each valid abstract state and their submodels we estimate the new updates of the
parameters. A combined update for all model parameters from all training sequences
is done at the end of every epoch (after all training sequences) according to Equations
3.37, 3.38, 3.39, and 3.40. Figure 5.4 Shows a graphical description of the selective-

training approach.

110

 

Figure 5.4: In the “selective-training” method only selected parts of the model are
trained for each training sequence. Here two data sequences deﬁne two valid selections
of the whole model which are shown in bold lines.

5.2 Maximum likelihood parameter estimation ex-

periments

5.2.1 Simulated learning experiments

To investigate the feasibility of the hierarchical Baum-Welch algorithm and the ap—
proximate training methods in learning large scale corridor environments, we per-
formed experiments using the simulation environment in Figure 4.3. First we created
artiﬁcial data using a good hand-coded model (which we refer to as the original
model), and for which all transitions to the next state were set to 0.9 and all self
transitions set to 0.1. For the sensor models we initialized the probabilities of per-
ceiving the correct observation to be 0.9 and all probabilities of perceiving an incorrect
observation to be 0.1.

Using different sets of initial models, different lengths of data sequences, some of
which were labeled with the starting state, and some of which were not labeled with

the starting state, we trained both POMDP and HPOMDP models for the simulated

111

environment. We used “short” data sequences where data was collected across two
abstract states and “long” data sequences where data was collected across three ab-
stract states. We used “uniform” initial models where all matrices in the “original”
model were changed to uniform (e.g. 0.1, 0.9 was change to 0.5, 0.5), and “good”
initial models where all values of “0.9” in the original model were changed to 0.6
and all values of 0.1 were changed to 0.4. We collected 738 short training sequences,
246 long sequences, and and 500 test sequences of length 40. We also collected 232
train sequences for training the submodels of the abstract states separately with the
“reuse-training” method. All testing and training data was collected by sampling the
good original model.

We trained uniform and good initial model using 5 training methods. In the ﬁrst
method, which we name “submodels-only” we simply train the submodels of each
abstract state and did not train the model at the abstract level. The second is the
“reuse-training” method where we train the submodels separately and then retrain
the whole model again. The third method is the “selective-training”, the fourth
is the standard EM learning algorithm for HPOMDPS, and the ﬁfth is the standard
EM algorithm for learning POMDPS. We evaluated and compared the learned models
based on robot localization, log likelihood of the training data, distance of the learned

model from the original model, and computation time.

Log likelihood of the training data was measured using the log likelihood
(2,}: log P(Zk|.4k,)\ f)) function deﬁned in Equation 3.36 in Chapter 3. Where 2,,

is the observation sequence for training data set k, and A], is action sequence k. The

112

term A, refers to the parameters of a flat POMDP (T ,O, and initial probability dis-
tribution 7r). The reason A, refers to the parameters of a flat POMDP is because
we need to be able to compare hierarchical and ﬂat POMDPS. Therefore, before any
comparisons are made we convert the learned HPOMDPS to ﬂat POMDPS. Since
a ﬂat POMDP can also be viewed as a Hierarchical POMDP with a single abstract
state, the root, then we can compute the log likelihood using Equation 3.36. We
used this measure to show both, convergence of the training algorithms, and how the
learned HPOMDPS compare in terms of ﬁt to the training data. Plots of the log
likelihood for different experiments are shown in Figures 5.5, 5.6, 5.7, 5.9, 5.8, and

5.10.

Distance was measured according to Equation 5.1 [10]. Equation 5.1 is a method
for measuring the similarity between two POMDP models /\1, and A2. It is a measure
of how well model /\1 matches observations generated my model A2. The sign of the
distance indicates which model ﬁts the data better. A negative Sign means A2 is better,
while a positive sign means that Al is better. The equation can also be interpreted in

other ways including cross entropy, divergence, and discrimination information [100].

10gP(ZzlA2,)\1) ‘108P(22lA2,)\2)
T

 

D()\1,)\2) = (5-1)

22 and A2 refer to a sequence of Observations and actions that were produced by
model A2, and T refers to the length of the sequence. In our experiments A2 refers
to the good “original” model which we used to create 500 test sequences. Before any
distance measurement was made, all the hierarchical models were converted to ﬁat.

113

In Table 5.1 we report the means and standard deviations of the distances of the

learned models from the original model.

Robot localization was done using the most likely state heuristic [22]. To compute
the most. likely state we ﬁrst convert the HPOMDP learned to a ﬂat POMDP as was
shown in Section 3.2.3. Then, for a test data sequence (of actions and observations),
we compute the probability distribution over the states of the model after every action
and Observation as was shown in Equation 2.14 in Section 2.2.1 and summarized

below:

0(5’, 0., z) 2365 T(s, a, s’)b(s)

“(3’) = P(z|a, b)

 

The most likely state for each belief state b then is SIMS 2 argmax3b(s). Since
our test data is created by sampling some good HPOMDP model we know the true
state sequence. Thus, we estimate the error on robot localization for a particular test
sequence by counting the number of times the true state was different from the most
likely state. For multiple test sequence we average over all the errors. Table 5.2 shows
robot localization results for the different models trained.

In passing, we should node the most likely state heuristic is a naive way of decoding
the underlying state sequence in an HMM. A better approach is the Viterbi algorithm
which computes the most likely state sequence rather than the most likely state
at every time index [101], [10]. However, the Viterbi algorithm requires the whole

Observation sequence before-hand which is not possible in the navigation domain.

114

Computation time

was done by simply measuring computer time. Computation

times are used to compare training times for the approximate learning algorithms

with the original EM learning algorithm for HPOMDPS and the original EM learning

algorithm for POMDPS. Such results are shown in Table 5.3

Short sequences with starting positions and uniform initial models

 

 

 

 

 

 

 

'10000 I I I I I
? pomdp EM -—+—
reuse-training ------—*
-12000 hpomdp EM ..... .-.,.
, selective-training we ---------
2 -14000 5
6
8 Ii
2 -15000 ...................................................................................... _
s“
N .
9: '18000 ................ -
0’
0 i
_I -20000 ........... .......................... _
-22000‘,-mmn. Hi i _
-24000 i i j l i
0 5 10 15 20 25

Training epochs

30

Figure 5.5: The graph describes the ﬁt to the training data during experiment 1.
The vertical axis represents the log likelihood over all 738 training sequences. The
“reuse-training method” converges faster than all other methods. The EM learning
algorithm for POMDPS ﬁts the training data slightly better than the hierarchical

methods.

From Tables 5.1 and 5.2, and from graphs 5.5, 5.6, 5.7, 5.9, 5.8, and 5.10. we

draw the following conclusions:

0 In almost all of the experiments described the “reuse-training” method performs

the best. Even when we start with uniform initial models and do not provide

the starting position the “reuse-training” method produces good results both

in robot localization and in distance of the learned models from the original

(Experiments 1, 3, and 5).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Initial Distance Training log Training Distance
model D(initial, data likelihood method D(learned,
original) # of seq., during original)
before # of abs. training after
training states training
traversed,
ILL, (a) label It, (a)
original 0 0, (0.0)
Experiment 1
232, 1,yes submodels-only -0.40, (0.21)
Figure reuse-training -0.29, (0.18)
uniform -1.3, (0.2) 738,2,yes (5.5) selective-training -0.34, (0.20)
EM for hpomdps -0.35, (0.21)
EM for pomdps -0.80, (0.35)
Experiment 2
232,1,yes submodels-only -0.38, 0.20
Figure reuse-training -0.30, (0.19)
good -0.95, (0.18) 738,2,yes (5.6) selective-training -0.30, (0.19)
EM for hpomdps -0.30, (0.19)
EM for pomdps -0.49, (0.25)
Experiment 3
Figure reuse-training -0.72, (0.31)
uniform -1.3, (0.2) 246,3,no (5.7) EM for hpomdps -1.93, (0.83)
EM for pomdps -2.15, (0.99)
Experiment 4
Figure reuse-training -0.72, (0.32)
good -0.95, (0.18) 246,3,no (5.8) EM for hpomdps -0.82, (0.35)
EM for pomdps -0.85, (0.33)
Experiment 5
Figure reuse-training -0.48, (0.26)
uniform -1.3, (0.2) 246,3,yes (5.9) EM for hpomdps -0.76, (0.30)
EM for pomdps -1.01, (0.41)
Experiment 6
Figure reuse-training -0.45, (0.24)
good -0.95, (0.18) 246,3,yes (5.10) EM for hpomdps -0.5, (0.26)
EM for pomdps -0.61, (0.21)

 

Table 5.1: The table shows the mean and average of the distance of the leaned model
from the original model. The distance was measure using 500 test sequences of length
40, which were produced by sampling the original model in the small environment
in Figure 4.3. The bold numbers in the last column show the shortest distance
for each experiment. The “reuse-training method” produces the best results. The
EM learning algorithm for HPOMDPS always produces better models than the EM
learning algorithm for POMDPS.

116

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Initial Localization Training Training Localization
model error data method error
before # of seq., after
training # of abs. training
states
,u, %, (a) traversed, p ‘70, (a)
label
original 17.7, (14.0)
Experiment 1
232,1,yes submodels-only 38.01, (21.47)
reuse-training 33.04, (19.60)
uniform 89.2, (10.0) 738,2,yes selective-training 37.36, (21.50)
EM for hpomdps 37.30, (21.62)
EM for pomdps 66.00, (21.61)
Experiment 2
232,1,yes submodels-only 36.6, (21.75)
reuse-training 33.16, (19.13)
good 55.38, (23.88) 738,2,yes selective-training 32.84, (19.61)
EM for hpomdps 33.2, (19.95)
EM for pomdps 49.17, (22.5)
Experiment 3
reuse-training 36.69, (20.83)
uniform 89.2, (10.0) 246,3,no EM for hpomdps 86.02, (10.59)
EM for pomdps 87.81, (8.85)
Experiment 4
reuse-training 36.97, (20.26)
good 55.38, (23.88) 246,3,no EM for hpomdps 43.89, (23.07)
EM for pomdps 48.62, (23.48)
Experiment 5
reuse-training 34.76, (20.95)
uniform 89.2, (10.0) 246,3,yes EM for hpomdps 60.45, (23.06)
EM for pomdps 69.55, (20.99)
Experiment 6
reuse-training 35.17, (20.29)
good 55.38, (23.88) 246,3,yes EM for hpomdps 36.60, (20.28)
EM for pomdps 45.47, (22.25)

 

Table 5.2: The table shows the localization error which was computed using the 500
test sequences of length 40. The “reuse-training” method produces the best models
with an exception in experiment 2 where “selective-training” is slightly better. The
EM learning algorithm for HPOMDPS always produces better models than the EM

learning algorithm for POMDPS.

117

 

Log(P(Z|A,Model))

-1 1000
—1 2000
-13000

-14000 f
-15000 ’

-16000
~17000
-18000

‘19000 I

-20000

Short sequences with starting positions and good initial models

 

 

l

 

1

pomdp Elvl
reuse-training
hpomdp EM

_.__
----x.---
. .... ...u.

selective-training ~~~----I3---—-

 

 

15

20

Training epochs

25

30

Figure 5.6: The graph describes the ﬁt to the training data during experiment 2. The
ﬁt to the training data is similar to experiment 1, even though training started with
a better initial model.

Log(P(Z|A,Model))

Long sequences with no starting positions and uniform initial models

-6000

-7000

-8000 ,

-9000

-10000

-11000

-12000

-1 3000

 

 

 

 

reuseitraining
pomdp EM

hpomdp EM ----- l---- -..

I
_._

----*-_-

 

 

0 5

15

20

25

Training epochs

30

35

Figure 5.7: The graph describes the ﬁt to the training data during experiment 3. The
“reuse-training” method clearly converges faster than the other training methods.

118

Long sequences with no starting positions and good initial models

 

 

 

 

 

'5000 I j r r I I
g a pomdp EM —._
hpomdp EM ---x----
'5000 reuse-training ----- ~---- —
2 -7000 .................................................... a
'6 :
3 .-'
2 -8000 ,' .-
i
N
E; -9000 _
C)
O
-‘ -10000 -«
-11000 , ......................... -
-12000 i 1 . L 1 1
0 5 1O 15 20 25 30 35

Training epochs

Figure 5.8: The graph describes the ﬁt to the training data during experiment 4. The
ﬁt to the data is better than experiment 6 (see Figure 5.10). This is because data is
allowed to start from all states. Nevertheless, navigation performance is worse than
experiment 6 according to Table 5.2.

Long sequences with starting positions and uniform initial models

 

 

 

 

 

 

 

'6000 . T 3 T v r I
-- --.,--x pomdp EM —+—
"* reuse-training -----x--
'7000 j . hpomdp EM ~--------* "1
-8000 :5 , ....... .................................. -
E ‘1' i E
.8 -90001Ew..m.w.mcﬂ HM. ............... q
2:, -10000 , . . ............................... ............. , ................... _
'2‘, a
a. . .
‘6) _11000 ................................................... q
0
.1
-12000 ........................ -
-13000 .
-14000 i i . i
0 5 1O 15 20 25

Training epochs

Figure 5.9: The graph describes the ﬁt to the training data during experiment 5.
The POMDP model ﬁts the training data the best while the “reuse-training” method
converges faster than all other methods.

119

Long sequences with starting positions and good initial models
-6000

 

I pomdp EM _._
hpomdp EM .......,L

-7000 reuse-training ----- -+----

~8000
-90001;

—10000

Log(P(Z|A,Model))

-11000

 

 

 

 

-12000 ' i . i
0 5 10 15 20 25

Training epochs

Figure 5.10: The graph describes the ﬁt to the training data during experiment 6.
Even though the ﬁt to the data is not as good as in experiment 4 (see Figure 5.8),
navigation performance was better. The EM learning algorithm for POMDPS ﬁts the
training data the best.

0 From Tables 5.1 and 5.2 we can conclude that the shorter the distance of the

learned models from the original model, the better the robot localization.

0 Almost all of the plots of the log-likelihoods during training show that the EM
learning algorithm for POMDPS ﬁts the training data better. Nonetheless, dur-
ing testing the models learned with the EM learning algorithm for POMDPS
performed the worst. This observation shows that the EM learning algorithm
for POMDPS over-trains the learned models (compared to the hierarchical al-

gorithms) and does not generalize well to unseen test sequences.

0 The “selective-training” method works almost as good as the “reuse-training”
method (Experiment 1 and 2). This is a pleasing result since the “selective-

training” method trains much faster than the “reuse-training” method and the

120

EM learning algorithm for HPOMDPS (see Table 5.3).

o It is obvious from Tables 5.1 and 5.2 that the learned models never reach the
performance of the original model. This is due to the fact that EM type algo-
rithms converges to local maxima and are highly dependent on the initial model.
If we compare the results of the EM algorithms in experiments 1 and 2, and
in experiments 3 and 4, and in experiments 5 and 6, we see that training that
started from good initial models produces better learned models than training

that started from uniform initial models.

0 Providing the initial state the data was collected from, produces better learned
models. This is obvious if we compare experiment 3 with experiment 5, and

experiment 4 with experiment 6.

o The worst results were produced in experiment 3 where we started with uniform

initial models and did not provide the initial starting position.

0 As the number of training data increases so does the performance. Despite the
fact that Experiments 1 and 2 use shorter length sequences than Experiments 3,
4, 5, and 6, they produce better learned models, mainly due to the signiﬁcantly

larger number of training sequences provided.

We also measured the training times for “short” length sequences (sequences that
were collected through two abstract states) and the different learning methods. Table
5.3 summarizes the results. For short training sequences the “selective-training”

method is much faster than both the “flat” and “hierarchical” training methods.

121

Even though the time complexity of the “hierarchical” training methods is 0(NT3),
for short training sequences they are faster than ﬂat models whose time complexity
is 0(N T). This is is due to the hierarchical representation of the state space which

makes the N term in the hierarchical methods smaller than the N term in the flat

 

 

 

 

methods.
Experiment Training method Time (sec) / epoch
Figures selective 449
5.5, 5.6 ﬂat 3722
hierarchical 1995

 

 

 

 

 

Table 5.3: The table shows the training times for short training sequences and dif-
ferent training methods. All training times were calculated on a 900 MHz Athlon
processor running Redhat Linux 7.1

5.2.2 Learning experiments using the robot platforms

In this set of experiments we trained hierarchical POMDP models for the small floor
(Figure 4.3), using robot experience from the Nomad 200 simulator, and experience
with the real robot Pavlov from the actual 3rd floor in Figure 4.4. The purpose of
these experiments was to investigate if learned models can actually be used for robot
navigation.

For both environments we simply trained all submodels of the abstract states
separately. To collect data we run the robot separately in each abstract state. In
corridors we would collect data by simply running the robot from every entry state
of a corridor to the opposite exit state. In junction states, we would run the robot
from every entry state to each one of the other exit states.

To evaluate the learned models we used some known “good” original model, which

122

we compare to the “learned” model. For the “good” original model, we made all sensor
probabilities of perceiving the correct observation to be 0.9 and the probability of
perceiving the wrong observation to be 0.1. All self transitions were set to 0.05 and
transitions to the correct state were set to 0.9. In these experiments, for all corridor
states we also added an overshoot transition, set to 0.05. Learning started from a
“bad” model which was created from the “good” original model by increasing the self
transitions to 0.47 and decreasing the transitions to the correct state to be 0.47. We
evaluated the learned models based on robot localization.

Localization performance was computed different this time. Since it is hard to
compute the true location of the robot, we compared the most likely state given
by the “bad” and “learned” model with the most likely state given by the “good”
original model. Since the “good” model is simply based on our insight of what the
parameters should be we relaxed the MLS comparison. The error increases only when
the distance between two states is longer than a single transition. That is, we don’t
increase the error if the most likely state given by the “good” model is the same, or
adjacent to the most likely state given by the “learned” model.

Localization results are summarized in Table 5.4. It is obvious that after training,
the learned models produce low error robot localization. Also note that unlike Table
5.2 in the previous section, the localization error here is quite low. The reason is that
the training and testing data came from actual robot runs. Actual robot navigation
tasks produce similar training and testing sequences of Observations and actions (e.g
traversing of corridors). In the previous section the data was collected trough sam-

pling which could have produced training sequences quite different from the testing

123

sequences. Additionally, the low localization error is due to the fact that the initial

observation models provided were fairly descent (same as the “good” original model).

 

 

 

 

 

Training method Training Test data Localization Localization
data error before error after
training training
It ‘76, (0) u ‘70, (0)
submodels-only 30 seq. 14 seq. 41.8, (21.4) 11.8, (18.0)
from from
Nomad 200 Nomad 200
simulator simulator
(Fig. 4.3) (Fig. 4.3)
43 seq. 7 seq. 64.9, (20.9) 2.8, (3.8)
from Pavlov from Pavlov
in real ﬂoor in real ﬂoor
(Fig. 4.4) (Fig. 4.4)

 

 

 

 

 

 

Table 5.4: The table show the localization error as to the true state of the robot
before and after training, in the Nomad 200 simulator and in the real environment
with the actual robot Pavlov.

5.3 Inferring the structure of the environment

We also did experiments to discover whether the HPOMDP model has the ability to
infer the structure of the environment. If at an abstract level we don’t have the correct
topology of the environment, then we would expect the HPOMDP learning algorithm
to perform much better in learning the correct topology than the ﬂat POMDP learning
algorithm. The reason is that the Hierarchical-Baum algorithm does not learn the
relationship between two abstract states (which are root children ) unless the training
data goes all the way to the exit-state of the second abstract state. Again, this

observation can be better understood by close examination of the calculation of the

probability of going from one state 3’I at the abstract level to another state 3:, at time

124

t (or {(t, 3,, 3%)) in Equation 3.24. This probability will be zero unless state 3;, exits.
In the flat case the algorithm will still learn relationships between the children of the
abstract states even if the training data does not go all the way through the second

abstract state. This idea is better understood in Figure 5.11.

HlERARCHlCAL

09 0 m
so

FLAT

Figure 5.11: Assume that an observation sequence of length 5 is started from entry
state N1 and that, in the correct topology of the environment the transition probabil-
ity X1 —> N, is zero. In the hierarchical model the topology will be learned correctly
since in order for the transition X1 —> N3 to be non-zero we need an observation
sequence of at least length 6. In the flat model the topology will not be learned
correctly since the transition probability 81 —> 53 will not be zero. The hierarchical
algorithm is able to better discover the relationships between groups of states and
has the potential for inferring the structure at the abstract level.

5.3.1 Learning the structure of a three level hierarchy

For the simulated environment in Figure 4.3 we trained an HPOMDP model using
245 short training sequences (collected by sampling through two abstract states)

labeled with their starting positions. We used three type of training methods, the

125

“selective-training” method, the standard EM learning algorithm for HPOMDPS, and
the standard ﬂat EM learning method for POMDPS. We trained the model starting
from different sets of initial parameters. Each set of parameters deﬁned a different

type of semi—ergodic model. Below we describe the different initial models:

1. A hierarchical model “model-1”, where every exit point of every junction ab-
stract state is connected to the correct corridor and entry-state, and all entry
points of other corridors whose size is different than the correct corridor. Every
corridor is connected to a single junction. A partial example of this type of

initial model is shown in Figure 5.12.

2. A hierarchical model, “model-2”, where every exit point of every junction ab-
stract state is connected to the correct corridor and entry-state, and all entry
points of other corridors whose size is different than the correct corridor. Every
corridor abstract state is connected to a every entry point of every junction

abstract state.

To evaluate each model we examine the percentage of the total transition proba-
bility that is associated with the correct structure of the environment at the abstract
level. The arrows in Figure 5.13 define the correct structure. In an ergodic model
there are more connections among states (at level 2) than the arrows shown in the

ﬁgure.

Correct structure was measured according to Equation 5.2.

126

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ll
U r JBL ............ _ I [:1
—l 4.:1 .................. m ........
I
_ T f A _
._ 1
E

 

Figure 5.12: The bold arrow indicates the correct structure for the particular south
exit point of the junction abstract state. The dotted arrows indicate possible adjacent
nodes. We chose adjacent nodes to be corridors of different size from the correct
corridor.

127

2365,06/1 Zs’eS’ T(S’|3’ 0)
2365,06/1 2.9”68” T(S”l8’ 0‘)

correct structure 2

 

(5.2)

Where 5’ is the set of states that belong to abstract states different from the
parent of s, p(s), such that s’ and s belong to topologically adjacent abstract states.
S” is the set of states that belong to abstract states different from the parent of s,
p(s) (s and s”do not. necessarily belong to topologically adjacent abstract states). In
order to be able to compare ﬂat and hierarchical learning algorithm the transition
functions T(s’|s, a) and T(s”|s, a) refer to the transition probabilities when the model
is represented as a flat POMDP.

Overall, our evaluation criterion ﬁnds the percentage of transition probability as-
sociated with the arrows in Figure 5.13 over the total transition probability associated
with all the arrows (all arrows starting from the same positions as the arrows shown in
Figure 5.13). Table 5.5 summarizes the results of all the models, and shows that the
hierarchical models improve the structure much more than the ﬂat models. Figures
5.13 and 5.14 show the transition probabilities of the correct structure before and
after learning for initial “model-1”.

We also plotted the log likelihood (logP(Z|,A, /\)) as a function of the training
steps as shown in Figures 5.15 and 5.16. The graphs show that that the ﬂat models ﬁt
the data better. Its also obvious that even though, “model-1” is less ergodic, “model-
2” ﬁts the data better. This suggests that for the short length data sequences better
ﬁt to the data does not necessarily mean better navigation model. In addition, even

though the flat model ﬁt the training data better than then hierarchical models, the

128

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.03 091 0“"
/\ /\ A /\
1‘ i X ‘
I . . I :31 . *7.
( l I 2 X
007 _ \Joos 0.91 (r you: V
7 on
‘4 ‘ .
-l 091 0" l 0.03
0.03 0.91
UOJ<
007 0.91 4r 0.03
091 §\ %om 0.91
/l Q1 /\
I X‘ I r]
l X . l
007< I x 1 l_ >007
7* 007 “‘5 09' 0.07 a 0‘
m nor ,y
0.9I 0.07 0.91
0.91
007 0-9' 0.03
i 007 on 007’ Jr "
091 H1 > 091 007 1,, r 091 007 09' 003 ‘~
H T“ ’ 091
I I i
7 V V
l \ l H J LA .
I 1 T I
\] V \j \ 7 Y i ( >00.
MI
0.9] .4
007 09] 007 "~03 0" w‘
‘N
‘V
0.91
0.03 _J
—

Figure 5.13: The ﬁgure shows the transition probabilities that belong to the cor-
rect structure before training for “model-1”. The probabilities shown are the ones
corresponding to the HPOMDP model when converted to ﬁat.

129

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.99 0.93
0.93 0.99
A /\ A A
“ I k I
I . Fl )1 L . I gt]
0.6} (i Y 1 Y \
\) ( V0.93 V
0.99 #4 ‘
0.75 0.“
0 76 09,,
0 70 <
0.00 0.72 05'
0‘99 0 23 051
I X I F
0.7 ‘ l m . l
< I x j __ >
“W 0.7% L/ \J v .r 0 5
.44 0,99 '
0.37 0.99
0.77 0 97
0.99
0.56 0.93 o 90 055 ‘a 0.94 a»
0.91 ‘7 o 92 o 12 ' NN
A 0.42 o 99 _ 0.99 0.15
__ l L 0.65
1 1v I E I I. ‘
I . I l l I . | E A A I
P 1 I I I l
\j w '—l
0.91
01“ 0 57 0 as 0 I5 0.37 A,
II 98 ' 0 99 m
V
IL N>
< 0.95
0 95 —-
—

Figure 5.14: The ﬁgure shows the transition probabilities that belong to the correct
structure after training for “model-1”, using the “selective-training” method. The
probabilities shown are the ones corresponding to the HPOMDP model when con-

verted to ﬁat.

130

 

 

 

 

 

 

 

Model Correct structure Training method Correct structure

error before error after

training training
selective-training 15.47 %

model-1 46.91 ‘70 EM for hpomdps 15.63 ‘70
EM for pomdps 41.79 %

selective-training 59.36 %

model-2 96.02 % EM for hpomdps 60.71 ‘70
EM for pomdps 87.20 ‘70

 

 

 

 

 

 

Table 5.5: The table shows the percentage of transition probabilities that do not
belong to the correct structure before and after learning for the three level hierarchical
model.

hierarchical models produces less error in the correct structure in Table 5.5.

Fit of training data for semi-ergodic model-1
-3400

 

pomdp EM —-1——
hpomdp EM ---x~---

-3600 _, selective-training ----- em.

-3800
-4000

~4200

Log(P(Z|A,Model))

 

-44OO

 

 

0 5 10 1 5 20
Training epochs

-4600

 

Figure 5.15: The ﬁgure shows how “model-1” ﬁts the data over every training epoch,
and for the different training methods. The POMDP model ﬁts the training data
better than the HPOMDP models.

5.3.2 Learning the structure of a four level hierarchy

We also did structure learning experiments on a four level hierarchical model. The

second and third level are shown in ﬁgure 5.17. In this set of experiments we started

131

Fit of training data for semi-ergodic model-2

 

EM pomdp _1_
EM hpomdp ----_-.,,.
selective-training ----- ....

.. .
'-- .
,-
I, . H. 'I- i i I
t u' , I I i
_ , . . .. .......,,..... .. .a . ...... .............................................. . _
l .' :
I j -
I l .
I ' .
I. . ' .

-3400

..q

-3800

-4000 ,H t m.,nmm,n . ,,.. ................................ , ................................. .mm, ..... .q

Log(P(Z|A,Model))

.4200 . ........................ , .................... , ......... ............. ..

-4400

 

.4500 1, . . . ,,,,, . _

 

 

 

-4800 A A A A
0
Training epochs

Figure 5.16: The ﬁgure shows how “model-2” ﬁts the data over every training epoch,
and for the different training methods. The POMDP model ﬁts the data better than
the HPOMDP models.

with two types of initial models:

1. In the ﬁrst model, “4level-1”, only the structure at level 2 was semi-ergodic

where every state at level two was connected to every other state at level 2.

2. In the second model, “4level-2’, all states at the second level where connected
the same as in in the ﬁrst model “4level-1”. At the third level (for all submodels
of the abstract states at level 2) corridor states were connected to all junction
states, and junction states were connected to all corridor states whose size was

different from the correct corridor (similar to “model-2” in section 5.3.1).

Table 5.6 summarizes the result for the different initial models and different train-
ing methods while Figures 5.20 and 5.21 show how the different methods and different

models ﬁt the data during training. In this experiment, 17 long training sequences

132

were used. For model “41evel—1”, both the EM learning algorithm for HPOMDPS and
the EM learning algorithm for POMDPS learn the structure of the model equally well.
The reason the EM learning algorithm for POMDPS performs this well is because we
have used long data training sequence across the abstract states at level 2, and the
fact that the abstract states at level 2 look sufﬁciently different, thus enabling the
EM learning algorithm for POMDPS to discover the correct state sequence.

For model “4level-2” however, the EM algorithm for learning HPOMDPS out-
performs the EM algorithm for learning POMDPS. The reason is that by adding
ergodicity at level 3 the abstract states at level 2 do not look substantially different
and as a result the EM algorithm for learning POMDPS is unable to discover the
correct sequence of states. Figures 5.19, and 5.18 show the correct structure learned
at level 2, for model “4level-2”, using the EM learning algorithm for HPOMDPS and

the EM algorithm for learning POMDPS.

 

 

Model Correct structure Training method Correct structure
error before error after
training training
second level third level second level third level

 

4level-1 77.27 % 4.31 ‘70 EM for hpomdps 13.25 % 0.22 ‘70
EM for pomdps 12.31 % 0.20 ‘70
4level-2 82.46 ‘70 83.61 % EM for hpomdps 32.50 ‘70 49.90 ‘70
EM for pomdps 43.85 ‘70 55.34 %

 

 

 

 

 

 

 

 

 

 

 

Table 5.6: The table shows the percentage of transition probabilities that do not
belong to the correct structure before and after learning, for the four level hierarchical
models.

133

 

 

 

 

 

 

 

 

 

 

 

 

LEVEL 3 /

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.17: This is a four level representation of the environment where entire sec-
tions of the building where grouped into abstract states at a higher level. The three
large dashed polygons represent the abstraction at level 2 while the smaller rectangles
inside the polygons represent the spatial abstraction at level 3. The bold arrows in-
dicate the correct structure at level 2 and the number next to the arrows indicate the
transition probabilities before any learning for model “4level-2”. The probabilities
shown are the ones that belong to the correct structure when the model is converted
to ﬁat.

134

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

S2
[:lf lCll 1E]
L L
UV lClI

 

 

 

-------—-----------.-----—..—--_-_—-_a----.---------------..--------. oucooo -.o-oooocnounooocu

 

 

 

 

 

Figure 5.18: The arrows show the transition probabilities of the correct structure,
after training model “4level-2” with the EM learning algorithm for POMDPS.

135

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

057

 

 

 

 

 

82
Eli IIjI l[_:_]
Ell EDI

 

 

 

 

 

 

 

 

.---------.--.-~----------------------------'----.----.-------.-----. '----.- ------.----- ......

 

 

 

 

 

...........

Figure 5.19: The arrows show the transition probabilities of the correct structure,
after training model “4level-2” with the EM learning algorithm for HPOMDPS.

136

Fit of training data for the four level semi-ergodic model 4level-1

 

 

 

 

 

'1 100 I I I I I
A ‘ % pomdp EM —+——
hpomdp EM -------»
-1200 . . ........... f ............................ -
2 -1300 5
6
8
2 -1400 .................... _
5'
N
D: _1500 ............................................ -
O)
O
-J -1500 -
-1700 -
-1800 .1 A Aee A A
O 5 10 15 20 25 30

Training epochs

Figure 5.20: The graph shows the ﬁt to the data during training of model “4level-1”.

Fit of training data for the four level semi-ergodic model 4level-2
-1000 . . ,

 

jpomdp ENI __._
-1100
-1200
-1300

-1400
-1500

Log(P(Z|A,Model))

-1600

-1700
-1800

 

 

 

_1900 ”I 1 1 1 1 1
0 5 10 15 20 25 30

 

Training epochs

Figure 5.21: The graph shows the ﬁt to training data during training of model “4level-
2”. In the graph, the EM algorithm for learning HPOMDPS, and the EM algorithm
for learning POMDPS ﬁt the data better than the plots in Figure 5.20 despite the
fact that the models here are more ergodic.

137

5.4 Conclusions

This chapter has described learning experiments, where we used the HPOMDP model
to represent spatial maps for indoor ofﬁce environments. Taking advantage of short
training length sequences we have created two approximate training methods. The
“reuse-training” method which starts with better initial models and as a result con-
verges faster, and the “selective-training” method which takes a lot less time per
iteration than any of the other methods. The two approximate training methods
concentrate on smaller parts of the whole model which results in much better models
for robot navigation. Advantages of training with smaller training data sequences is
that ﬁrst, it is easier to collect more complete training data sets and second, training
is done much faster. A disadvantage of the approximate training methods is that
they require knowledge as to where the data was collected from. Additionally, the
training sequences need to terminate at some root child.

The experimental results show that our hierarchically learned HPOMDP models
are closer to the generative model, from which the training data is produced, than the
models learned through the ﬂat EM procedure for POMDPS. Additionally, the hier-
archically learned models produce better robot localization than the models learned
through the ﬂat EM algorithm for POMDPS.

Overall, the hierarchical Baum-Welch algorithm is better at capturing the rela-
tionship between states which are far apart and as a result more meaningful models
are learned for use in robot localization. Additionally, the hierarchical Baum-Welch

is better in acquiring the initial structure. This algorithm is well suited for domains

138

which emit hierarchical data, such as data sets that start from the root and ﬁnish
at a child state of the root. In the following chapter we will show how to use the
hierarchical POMDP model to perform navigation tasks. We will show hierarchical
planning algorithms for approximating the Optimal policy in a given HPOMDP and

their comparison to ﬂat POMDP methods in robot navigation.

139

Chapter 6

Hierarchical planning for robot

navigation

The HPOMDP model inspires a hierarchical framework for planning, which has two
major advantages over flat POMDP solutions. The ﬁrst advantage is that plans can
be constructed in a hierarchical fashion and thus allow us to reuse low level plans in
different high level plans. The second advantage is that at higher levels of abstraction,
both spatial and temporal, uncertainty is less and therefore plan execution can be
done more successfully than in ﬂat POMDPS.

In this chapter we describe a general hierarchical planning algorithm for
HPOMDPS and apply the framework to corridor navigation to produce and execute
hierarchical plans [102]. At an abstract level we construct abstract plans by computing
mappings from abstract states (e.g whole corridors) to macro-actions (e.g., go down
the corridor) and within abstract states we compute the macro-action plans which are

mapping from product states to primitive actions. Examples of macro-actions within

140

abstract states include “exit a corridor” at one of its exit states, “exit a junction”
at one of its exit states, or reach a particular location within some abstract state.
Our experiments show that there is less uncertainty associated with the belief states
deﬁned over higher levels of the HPOMDP hierarchy. The combination of low entropy
at the abstract level, macro-actions, and heuristic POMDP solutions (e.g., choose the
action associated with the most likely state) allow the robot to reach locations of its
environment starting from no knowledge of its original starting location.

In Section 6.1 we motivate our planning algorithm. In Section 6.2 we review
related literature, which forms the building blocks of our planning algorithm. In
Section 6.3 we present our planning, and two execution algorithms. In Section 6.4
we present robot navigation results, both in the Nomad 200 simulator and in the real

physical world. Finally in Section 6.5, we present our conclusions.

6.1 Robot navigation in corridor environments

It was obvious from Figure 1.4 in Chapter 1 that corridors in the MSU engineering
building are quite long (more than 1000 ft.) and locations within the corridors do
not have distinguishing characteristics for a robot to localize. However, this problem
can be alleviated if we treat whole corridors as single states and thus abstract away
information that may be unnecessary for most navigation tasks. The reason is that for
most navigation tasks a robot just needs to know in which corridor it is in, rather than
the precise corridor location, and therefore it can often ignore location uncertainty

that arises within corridors.

141

Figure 6.1 shows an HPOMDP model for robot navigation and ﬁgure 6.2 shows
the equivalent ﬂat model. Both models are used to represent corridor environments.
The environment is divided into equal ﬁxed regions (e.g two meter regions), where
there are four product states (one for each direction). In the HPOMDP model whole
corridors are represented with a single abstract state, which in addition to the product
states contains two entry-states and two exit-states for the entry and exit points of
the corridor. Junction nodes are represented with four product states as shown in
Figure 6.1. In the HPOMDP model for corridor environments, we use the term global
product level to denote the set of all product states, and abstract level to denote all
the states which are children of the root.

A multi-resolution representation is intuitive for many reasons. First of all, it
allows us to scale up to larger environments since we do not need to represent all
relationships among product states. Second, for many navigation tasks the robot
does not need to know where it is in a corridor but rather that it is in some corridor
which it has to traverse. Third, it offers us the capability of learning local policies (or
macro-actions as shown in Figure 6.7) for the different abstract states, which we can
reuse for any global plan at the abstract level. Finally, as shown in Figure 6.4, the
robot’s belief (see Equations 6.20 and 6.21 ) as to where it is at the abstract level,
represented as a probability distribution over the states at the abstract level, is less
uncertain than its belief as to where it is at a global product level.

The uncertainty of the robots belief about its true location at the abstract and
global product level was measured using the normalized entropy. The normalized

entropy of an IS I dimensional discrete probability distribution can be computed as

142

 

 

 

 

 

Figure 6.1: This HPOMDP model is used to represent corridor environments. The
dotted arrows from product states show non-zero transition probabilities for primi-
tive actions. The black vertical arrows show non—zero vertical activations of product
states from the entry states. The dashed arrows from exit states show non—zero tran-
sition probabilities from exit-states to adjacent states of the abstract state they are
associated with.

..............
nnnnnn

‘ -"‘-~: -"-',‘. .......
.........

~ "‘
-
- -
........

 

Figure 6.2: This is the equivalent ﬁat POMDP model of the hierarchical POMDP
model shown in Figure 6.1. In this model we only represent the product states of
the hierarchical model. Global transition matrices for each action are automatically
computed from the hierarchical model by considering all the possible paths between
the product states under the primitive actions.

143

shown in Equation 6.1.

entropy

 

normalized entropy = .
maxrmum entropy

— 23.1. P(s) log P(s)

 

 

 

z —zf=.1/ISIlog(1/ISI> (6-1)
: ‘Zf=1P(8)105P(8) = ‘Zf=1P(3)10gP(s)
-log(1/ISI) log(|S|)

In fact, the entropy plot in Figure 6.4 reveals a number of characteristics that

motivate our planning algorithm:
1. It is Obvious from the plot that uncertainty is less at an abstract level.

2. Uncertainty reduces dramatically at junctions both at the global and abstract

levels (see J1, J2, J3 and J4 in Figures 6.3 and 6.4).

3. At the global product level uncertainty increases as the robot traverses corridors
where all observations are usually the same, but at an abstract level it does not

increase.

4. Overall uncertainty decreases as the robot traverses a series of corridors.

Since uncertainty is less at an abstract level, an approximate algorithm that treats
the states at the abstract level as completely observable should be more successful
than an algorithm which treats all product states as completely observable. In addi-
tion, a successful algorithm would be one that is able to execute macro-actions (e.g.,
move from one entry-state of a corridor to the exit-state at the other exit point of
the corridor) so that overall uncertainty is gradually reduced. In the next section we

144

 

 

 

 

 

 

 

 

 

 

1 START
14 GOAL
.liL— J1
J3 12

Figure 6.3: This is a robot run on the Nomad 200 simulator, which was used to collect
observations for the entropy plot in Figures 6.4.

Abstract and global level entropies (uniform starting belief)
1 l l

 

Global entrop'y —+——
o,9 ‘ Abstract entropy ------x-

1

Normalized entropies

 

 

 

 

0 10 20 11 30 12 40 J3 50 J4 60
Number of steps

START GOAL

Figure 6.4: The plot shows the normalized entrOpies of the belief Of the robot’s
location at the global product level and at the abstract level. The initial belief was
uniform and observations were collected along the trace shown in Figure 6.3.

145

present related work from semi Markov decision processes (SMDPS), and heuristic

POMDP solutions that formulate our hierarchical planning and execution algorithms

for HPOMDPS.

6.2 Related work

6.2.1 Hierarchical planning in MDPS

Hierarchical planning in completely observable Markov decision processes has been
explored by many researchers. Most notable are the hierarchical Abstract Machines
framework (HAM) [103], [104], the MAXQ framework [105], and the options frame-
work [25]. Here we briefly review the options framework on which we base much of
the formulation of our hierarchical planning algorithm for HPOMDPS.

An Option is either a temporally extended course of action (macro-action) or a
primitive action. If an option is a macro-action, then it is a policy over an underlying
MDP. Options consist of three components: a policy 7r : 8 x A —> [0, 1], a termination
condition 13 : 8+ —> [0,1], and an initiation set I g S. An option < I,7r, 6 > can be
taken from any state s E I at which point actions are selected according to policy 7r
until the option terminates stochastically according to 6.

An abstract policy a is a policy over options. However, before a policy among
options is computed we need to deﬁne its model, which speciﬁes for every state 3
where an option 0 may be started, and when it will terminate (see Equation 6.3).

The total reward that will be received for executing option 0 in state 3 is:

146

r: 2‘ E{7‘I+1 + 7Tt+2 ' ° ' 'l' ’Yk-lft+kl£(0a 3: fl} (6'2)

where t + k is the random time at which 0 terminates, and 8(0, 3, t) is the event

that Option 0 started at time t. The state prediction model is deﬁned as:

7933' = 21905316)?” (6.3)
k=1

which is a combination of the probability value p(s’, k) that o terminates in state
3’ in k steps, together with a measure of how delayed the outcome is according to the
discount factor 7. The reward and transition model for options allows us to write the
Bellman equation for the value of executing an abstract policy a at some state 3 as
shown in Equation 6.4. The Bellman equation then, allows us to compute the values
of executing each option from every state that belong in its initiation state I, and
thus be able to construct policies over options. Note that this equation is similar to

Equation 2.4 in Section 2.1.

l/“(Sl = EITHI + 77‘:+2 ' ° ° + 7k_17"t+k + TkV“(3t+k)|5(Ha 3, t)}

6.4
= r: + Ema/“(3') ( l

The transition and reward models of options can be learned using an intra-option
learning method [25]. We can express the Option models as Bellman equations (see
Equations 6.6 and 6.5) and then solve the system of linear equations either explicitly,

or through dynamic programming as was shown in Section 2.1.

147

r: =7”) +ZP:(,3’(1—ﬂ(s')r:,) (6.5)

=2 7’3.“’I(1(s’))7>a + 0506...] (6.6)

where 63,, = 1 if s’ : :r and is 0 otherwise. Note that in the above Bellman
equations (6.4, 6.5 and 6.6) the discount term '7, is included within the transition
model. In order to be able to solve these equation using the standard value-iteration
algorithm described in Chapter 2 (Figure 2.2) we can modify the deﬁnition of the state
prediction model as shown in Equation 6.7. Equation 6.4 then becomes Equation
6.8. Equations 6.5 and 6.6 become Equations 6.9 and 6.10. This redeﬁnition of the
transition model implies that the transition models for primitive actions (or options
made of a single action) are the same as the transition probability it self. Even though
the end result (Equations 6.8 and 6.4) for both deﬁnitions is the same, this redeﬁnition
is easier to interpret, since it uses the same form of value function regardless of whether
we compute a policy over Options or a policy over primitive actions. Additionally,
with the new deﬁnition its easier to understand how the options framework is adapted

to our macro—action deﬁnition, which we present in Section 6.3.

P3,, = 21133107“1 (6.7)
k=l

148

I'm“) 1’ E{7't+1+ 7Tt+2 ' ' ' + ”Y’s-171+}: + 7kV#(St-+-k)|£(ll’1 3, t)}

_,. (”Wipe V“ (6.8)
f3 2 7'3”) + 7:79:39” (1 - M837?) (6.9)
:2?” “('8’ [7(1 W ﬁ’W’é’e +ﬂ(s’)6.e] (6.10)

6.2.2 Approximate POMDP solutions

Since exact POMDP solutions are intractable, researchers have developed approxi-
mate solutions which have been proven to work successfully for robot navigation [23],
[22], [76]. Here, we brieﬂy review two approximations which assume that the POMDP
is completely observable and solve it as was shown in Section 2.1. The MDP policy
computed can then be transformed into a POMDP solution during execution through

the following strategies:

0 The Most likely State heuristic executes the action 7r(s), where II is the
completely observable policy that is associated with states 3. The chosen state

3 has the highest belief b(s).

a = 7r(argmaxs€5b(s))

o The (.2an approach does not use the completely observable policy directly, but

149

rather uses the Q values of the state action pairs which were estimated during
the value iteration algorithm (Figure 2.2). The QMDP approach chooses actions
by assuming that the world will become completely observable after the ﬁrst
step. In other words, the action chosen is optimal if the POMDP becomes
completely observable after the ﬁrst action. To relate the QMDP approach to
exact POMDP solutions we can think of the Q values of every action as an a

vector (or policy tree value) as was shown in Equation 2.20.

a = argmaxaeAZMS) r(s,a) +7ZP(S'Is,a)V"(8'))
.963 8’63 (6.11)
: argmaxaeA Z: b(S)Q(8, a)

.965

6.3 Planning and execution algorithms

Our planning and execution algorithms combine ideas from hierarchical SMDP plan-
ning with approximate solutions to POMDPS. First, we design a set of macro—actions
that operate over the abstract states. We then compute the models of those macro-
actions, which can be used to produce abstract plans or mappings from entry points
of the abstract states to macro-actions. During execution we can either use a hi-
erarchical most likely state (H-MLS) strategy, or a hierarchical QMDP (H-QMDP)
heuristic.

In the H-MLS strategy we choose the most likely abstract state and then execute
the appropriate macro—action under the abstract state. A macro-action terminates
when the parent abstract state which initiated the macro-action is no longer the most
likely abstract state. In the H-QMDP heuristic we choose the best macro-action to

150

execute according to the QMDP heuristic in Equation 6.11, where primitive actions
are replaced with macro-actions. Next, we describe the planning and execution algo-

rithms in detail which are generalized to arbitrary tree depths.

6.3.1 Planning

To ﬁnd a completely observable plan at an abstract level, we treat the product states
and the entry states (of the abstract states) as the states of an MDP. The set of
actions is the set of primitive actions applicable on product states and the set of
macro-actions for each abstract state s, which are responsible for taking the agent
out of the abstract state .9, or to some child state of the abstract state. An exit from
an abstract state can occur under any of the primitive actions, which result in one of
the exit states. We design HPOMDPS such that only one type of primitive action can
enter an exit state. In this manner the policy of each exit state can only be the type
of primitive action that leads into it. Thus, the action associated with the exit-state
can be used to connect the macro-action with states outside the subtree, in which the
macro-action is deﬁned. This idea is better understood in Figure 6.5. We should also
node that in the case where an exit state S;c is entered from another exit state 8;,
the policy of exit state 5x is the same as the policy of exit state S; (see Figure 6.6).

In our navigation HPOMDP model all exits from abstract states occur under the
“go-forward” action. Thus we need one exit-state for each spatial border of each
abstract state. Macro-actions for exiting the abstract states can be computed using

the MDP framework, where the robot is rewarded for taking the forward action in

151

 

 

 

 

 

Figure 6.5: The ﬁgure shows two macro-actions for abstract state S. The letter F
means go-forward, L turn-left, and R turn-right. The solid arrows deﬁne the transition
matrix of the forward action, the dashed arrows deﬁne the transition matrix of the
turn-left action, and the dotted arrows deﬁne the transition matrix of the turn-right
action. The policy of exit-states is always the same since exit-states can only be
entered with a single type of primitive action. The two macro-actions differ only
on the policy of state .94. Knowing the policy of exit-states allows us to know the
outcome of the macro-actions at the level above.

 

 

Figure 6.6: When an exit state S, is entered through another exit state 5;, the policy
of S,E is the same as the policy of S;

152

the particular exit-state. Macro-actions for reaching a goal state in the middle of an
abstract state can be computed by simply rewarding the robot for any action that it

takes in the goal state. Figure 6.7 shows different macro-actions for corridors.

 

‘ '. .
cut—Stale ' “. ’ cxII-sulc

 

 

 

cur-Marc ' ‘.‘ ,' exit-state

 

 

 

   

:—-- v"..‘ ‘ti'.
" '. I. .. . '.‘ ~\‘ : ‘ ‘,'
' ‘ r \‘ I '.' | , ‘ ’
CXIl-SIML A m0 3 I .- ’,' 0‘ . CXII'SINC
‘ . M
u _' E «I ‘ .'
. ‘. ' a “ ." ‘ n .' ‘
_ 5 g _‘ ~ . l 1‘ . I
-_ I . o . ,
. . I ‘. (I I- 'l I
'1' ' I .1 K "

 

Figure 6.7: F means go forward, R means turn right and L means turn left. The top
ﬁgure is a macro-action for exiting the East side of the corridor under the “go-forward”
action, the middle ﬁgure is a macro-action for exiting the West side of the corridor
with the go-forward action, and the bottom ﬁgure is a macro-action for reaching
a particular location within the corridor. The solid arrows represent the transition
matrix for the gO-forward action, the dashed arrows represent the transition matrix
for the turn-left action, and the dotted arrows represent the transition matrix for the
turn—right action.

Overall our macro-action representation is a special case of the Options framework

and is similar to work presented in [106]. Each of our macro-actions can be initiated

153

at some entry state of an abstract state and terminates at the exit states. More
precisely, we can say that given a macro-action ,u under an abstract state 3, which is
a policy over all the children, (the set of product states deﬁned as C5, the set of entry
states of other abstract states deﬁned as (13,, and the set of exit states of abstract state
3 deﬁned as X5), we can formulate one option for every entry state 3,, E N 3 (where
N 5 is the set of entry states of abstract state s). This Option will have initiation set
I = 212' 6 Cf, LJ C5,, V(z'|s,,) 75 0, policy a and termination condition 6 which will be 0
for all children of 3 except for all exit states 3,, E X 3.

Given the set of macro-actions available for each abstract state s (which we deﬁne
to be M 3, also explained in Appendix A), we need to evaluate them for each entry
state 5". Equation 6.12 computes the transition probability from entry-state 3,, E N 3
to an adjacent state 3’ of the abstract state 3 under some macro-action a (3' could
be a product or entry state). Figure 6.8 describes Equation 6.12. An exit from state
3 can occur from any exit state 5:, E X 3 under the primitive action 7rf,(s,,,), where 7r;
deﬁnes the policy of macro action [I on all states 13 that are either in the C;,C;";, or X 3
sets. If i is a product or exit state, then 7rf,(i) is a primitive action. If 2' is an entry

state, then 7rf,(z) is a macro action that belongs to Mj (where z' 6 NJ).

Tlsilsm l1) :

Z Z V(i|8n)TD(Sin, ,u) T(s'|s,,, 7r;(3x)) (6-12)

VSIEXJ ViEC;UC,9,
The term T D(sxlz’, ,u) is the expected discounted transition probability for exiting
the parent of state 2' starting from 2', and can be calculated by solving a set of linear

equations deﬁned in Equation 6.13. Note how this equation can be derived from the

154

 

TDrs, l i. 11)

Figure 6.8: The ﬁgure shows the operation of of Equation 6.12. The dashed arrow
is the discounted transition probability for reaching state S, form state 2'. The solid
arrows are simply the transition probabilities for the forward action.

Option Bellman Equation 6.10 where the termination condition 6 is 0 for all product
and entry children of state 3 and 1 for all exit states 33, E X 3.

TD(3,,|2',II) =

Tlsxlivﬁilillﬂ Z T(j|I,7r;(I))TD(s,|J-,,,) (6-13)

3605003,

In a similar manner we can calculate the total reward R that will be gained when

an abstract action a is executed at an entry state 3,, as shown in Equation 6.14.

Rem/0 = Z V<ils.)RD(z',u> (6.14)

iecgucs,
where the term RD(z', ,u.) can be calculated by solving a set of linear equations

deﬁned in Equation 6.15. Again, this equation is a direct derivation of the option

155

Bellman Equation 6.9 where ,6 is 0 for all product and entry children of state 3 and

1 for all exit states 31 6 X3.

300'. H) =

R(i.7r;'1(i))+7 Z T(j|i,7rf,(i))RD(j,II) (6'15)

jecgucg
Given any arbitrary HPOMDP and all the macro-actions M 3 of of every abstract
state, we can calculate all the reward and transition models at the highest level by
starting from the lowest levels and moving toward the top. Once the transition and
reward models are calculated, we can construct a plan among product and entry-
states at the most abstract level (level 2). To ﬁnd a plan, we ﬁrst compute the
optimal values of states by solving a set of linear equations deﬁned by the Bellman

equation (as shown below).

V’ISI) =
(jroot
. (6.16)
as... Rm 0) + 7 Z TI..I..,.)V <82)
32

 

 

where a is either a primitive action or a macro action. We can then associate the

best action for each state (or optimal policy 7r*) greedily as shown in Equation 6.17.

«*(31) =
(jroot

Rlsi.a) + 7 Z T(32l3ha)v‘(32)

32

(6.17)
argmaxaemaemn

 

 

Executing an abstract plan means that the robot can decide the next macro-action
only when it enters an entry-state at the abstract level. If we know at which entry

state the robot is, we can issue the appropriate macro-action which in turn may issue

156

another macro-action at a lower level until a primitive action is issued. In reality, we
never know if the robot is exactly at an entry-state at the abstract level and therefore
the robot may have to decide the next macro-action when it is in the middle of some

abstract state. Figure 6.9 illustrates this problem.

 

Figure 6.9: SI, 5'2, 53, and S4 cannot be activated vertically from states at the
abstract level. Therefore, if the agent was to start in any of these states we don’t
know what is the best macro-action to initiate.

In other words, we need to know what is the best macro-action to issue next if
the current macro-action is interrupted before it terminates at one of the exit states.
The best macro-action for intermediate states (not direct descendants of the abstract
level) needs to be computed by taking into consideration the plan at the above level,
and the consequences that each macro-action available on that intermediate state will
have on the long term reward. Formally, for any HPOMDP, we compute the values of
all states which are not at the abstract level in an iterative fashion from top to bottom
as explained in Figure 6.10 and described in Equation 6.18. The best macro-action

for each state, which is not a child of the root, is then computed greedily according

157

to equation 6.19.

 

TD(s , I 541) RD(s. it)

Figure 6.10: The value of a state 3 under a macro-action p is computed by taking
into consideration the total reward and discounted transition probability for exiting
the parent of s, p(s), from all exit states 3;, and the values of the resulting adjacent
states to p(s), such as 3’. Equation 6.18 deﬁnes the value of a state as the maximum
value among all macro-actions n that belong to the set of of macro-actions MP“)

available for the parent of s, p(s).

V(s) =
Cg(p(8))uc£(p(s))
I p(s) I
”3:135” RD(s,p) + 7 E: Z TD(8$|8,H)T(S lsz,7r“ (31.) V(s)
8’ VSIEXPU)
(6.18)
7r(s) = argmaxueMpm
C;(P(5))UC'P;(P(3))
120(3, 11) + 7 Z Z TD(sx|s, u)T(s'|s$,7rﬁ(s)(sx) V(s’)
S, stEXP(5)
(6.19)

158

6.3.2 Execution

To execute the abstract plan we keep track of a probability distribution for each level
of abstraction (at each tree depth). An efﬁcient way to calculate the belief state for
each level of the HPOMDP is to ﬁrst calculate a global transition matrix among all
product states under the primitive actions, and use that transition matrix to update a
global belief state among product states after every action and observation as shown
in Equation 6.20.
1 ISI
b'(s) = WPLZM, a) SIX—:1 P(sla, s')b(s') (6.20)
where b’ is the next belief state, and P(zls,a) is the probability of perceiving
observation 2 when action a was taken in state 3. In the robot navigation domain,
there are 16 observations, which indicate the probability of a wall and opening on the
four sides of the robot. The observations are extracted from trained neural nets where
the inputs are local occupancy grids constructed from sonar sensors and outputs are
probabilities of walls and openings [94]. P(sla, 3’) is the probability of going from
state 5’ to state 8 under action a and b is the previous belief state. Using the ﬂat
belief state b we can calculate the belief of abstract states in an iterative fashion from
leafs to root, where the belief of an abstract state is equal to the sum of the beliefs
of its children as shown in Equation 6.21.
cgucg

b(s) = Z b(c(s, 2)) ifs is an abstract state (6.21)

c(s,i)

159

After every action and observation we update the belief state b, and then choose
actions according to two strategies. The ﬁrst strategy is a hierarchical most likely
state strategy (II-MLS), and the second is a hierarchical QMDP strategy (H-QMDP)

which we describe next.

H-MLS

For the H-MLS strategy we start from the root and iteratively ﬁnd the most likely
child state until a product state 3 is reached. To avoid, situations where the robot gets
stuck into loop behavior, we don’t always choose the state at the peak of the belief,
but rather choose randomly among a set of most likely states. For each abstract
state we maintain a separate set of most likely states, which are states that share the
same parent abstract state (or otherwise states that belong in the same sublevel). A
sublevel is a set of states, which all have the same parent. The states that belong in
the most likely state set at the current sublevel are chosen such that their belief is is
closer than 1 / IS I to the maximum belief in the current sublevel (where IS I is the total
number of states which are at the same tree depth as the current sublevel). Once
a product state is reached we choose the best macro-action to execute according to
Equation 6.19. The macro-action terminates when one of the ancestors is no longer
among the most likely state set of its sublevel. We then jump at the sublevel where
the change occurred and restart the process.

To execute a macro-action under some abstract state S, we create a new separate
local belief state b’ which we initialize according to the most likely child product state

of S. The macro-action is then executed using the flat MLS strategy [22]. A goal

160

is reached when the most likely product state is the same as the goal state, and the
probability of being at the ancestor, at the most abstract level is larger than 0.9.
While Figure 6.11 describes in detail our H-MLS execution algorithm, Figure 6.12

shows an example H-MLS execution.

H-QMDP

For the H-QMDP strategy we find the best macro-action ,u to execute under every

abstract state S according to Equation 6.22.

H : argmaxueMs Z b(s)[RD(s,p)+

360,39
Cglslucﬁw) (6.22)
"Y Z 2 TD( (Szls IL) )TS( ,7rlsxa 53(x) V(3l)l
8’ VsIEXS

The overall best macro-action n and abstract state S is the one which is best over
all abstract state and macro-action pairs. The macro-action is executed under the
abstract state S until another < p, S > pair is maximized (according to the above
equation). Executing a macro-action is done in the same manner as for the H-MLS

heuristic. Figure 6.13 explains graphically the H-QMDP heuristic.

161

 

1. Set current sublevel equal to 2 (children of the root), initialize the belief
I), and create the most likely state sets for each sublevel. The states that
belong in the most likely state set of each sublevel are chosen such that their
belief is closer than 1 / IS I to the maximum belief in the sublevel (where |S| is
the total number of states which are at the same tree depth as the sublevel).

2. Recursively choose randomly a state from the most-likely state set at the
current sublevel until a product state 2' is reached.

3. Execute the appropriate macro-action p from state 2'. This macro-action is
one of the available macro actions that belong to the parent of 2', that is,
[.1 E 1’11”“)

(a) To execute the macro-action )2, create a new belief state b’ which is
deﬁned only for the subtree rooted at the parent of 2'. Initialize b’ such
that b’(2) = 1 and the belief of all other product states that belong in
the subtree is 0. Set parent = 12(2).

(1)) Choose the most likely state j among the children of parent according
to b’.

o If the most likely state is an abstract state (which may happen
in unbalanced HPOMDPS), then we ﬁnd the most likely entry
state jn. The belief of being at an entry state jn is calculated by
summing the beliefs of the children of j (according to b’), which
have non-zero vertical transition from the entry-state jn. The
macro-action 22’ to be executed under abstract state j then is
deﬁned by the macro-action u that was initiated under parent,
H’ = 7r(j,,)f‘“’e"‘. We set parent = j and p = [2’ and repeat step
3(b).

0 If the most likely state j is a product state, then the primitive
action a to be executed is deﬁned by the best macro-action ,u that

was initiated under parent. a = 7r( j )ﬁarem

(c) Execute the primitive action, get the action and observation, update
both b and b’, and reestimate the most likely state sets for each sublevel

o If the parent of 2' (or above) is not in the most likely state set of
its sublevel, terminate the macro-action, jump to that sublevel,
and go to step 2.

0 Otherwise go to step 3(b)

 

 

 

Figure 6.11: The ﬁgure describes in detail our hierarchical-MLS execution algorithm.
The algorithm initiates macro actions by recursively ﬁnding the most likely state from
root to leaves. Once a macro-action is initiated it runs until one of the ancestors of
the state that the macro-action was initiated from, is no longer the most likely state.

162

 

Figure 6.12: The ﬁgure shows an example HMLS execution. The bold solid arrows
indicate the macro-actions while the bold dashed arrows indicate a new macro-action
initiation. The beginning of the bold dashed arrows indicate the state at which, we
recursively start choosing the most likely state until a product state is reached, at
which point we can decide the appropriate next macro-action. Starting from the root
we choose recursively the most likely state, which ends up to be S3. We then begin to
execute the appropriate macro-action from S3 until one of its ancestors in no longer
the most likely state (among its adjacent states). This happen at time T5 when S5 is
the most likely state among the children of Sl. Again, we recursively ﬁnd the most
likely product state starting from SS, which ends up to be 86. We begin to execute the
appropriate macro-action until time T7 at which point the most likely state among
the children of the root has changed to S8. We then again, ﬁnd recursively the most
likely state starting from state 88 which ends up to be state 810. The appropriate
macro-action begins to execute until time T9 at which the most likely state among
the children of S8 changes to state 812. Again, we ﬁnd recursively the most likely
state starting from 812 which is 813 and execute the appropriate macro-action.

163

 

Figure 6.13: Each type of bold arrow represents a different macro-action that can
be initiated from a product state. The H—QMDP heuristic multiplies the value of
executing every macro-action from a product state with the current belief of the
product state. The best macro-action to be executed is the one for which the sum of
all the products of values and beliefs is the maximum. Note that choosing a macro-
action, also deﬁnes the abstract state to be executed under, since every macro-action
is unique to a a particular abstract state. In this example we have assumed that we
only have one macro-action for each abstract state, but in our navigation models we
have multiple macro-actions for each abstract state.

164

6.4 Planning experiments using the robot plat-

forms

6.4.1 Experiments with hand-coded models

We applied the above algorithms to two robot navigation environments, which were
shown in Figures 4.3, and 4.4 in Chapter 4. Both of the environments were repre-
sented as HPOMDPS using hand-coded parameters. The reason we used hand-coded
parameters, was to separate the evaluation of the planning algorithms from the learn-
ing algorithms. Otherwise, it would be difﬁcult to evaluate the planning algorithms
since the learned parameters affect robot performance, as we shall see in Section 6.4.2.

In the. small navigation environment, we set up two navigation tasks. One task
was to get to the four way junction in the middle of the environment (node 17)
and the other task was to get to the middle of the largest corridor (between nodes
I2 and 112 ). The reason we set up these two navigation tasks is because they
form two Opposite and extreme situations. The four way junction emits a unique
observations in the environment, while the middle of the largest corridor is the most
perceptually aliased location. Experiments in the small navigation environment were
done using the Nomad 200 simulator. We also performed experiments in the Nomad
200 simulator using the larger environment. In these set of experiments, again we
set up two navigation tasks, where one was to reach the four way junction (node 19),
and the other was to reach the middle of the largest corridor (between nodes I2 and I

16). Additionally, we performed experiments using the real robot Pavlov in the real

165

environment. In this set of experiments we formulated 5 different goals.

For each navigation task, we created macro-actions for every abstract state. For
every abstract state, we created one macro-action for each of its exit states (a macro-
action that would lead the robot out of the abstract state through the particular
exit state). For the abstract state that contained the goal location we created an
additional macro-action that could take the robot to the goal state. We then gave a
global reward, where the robot is penalized by —1 for every primitive action executed,
and rewarded it by +100 reward for primitive actions leading to the goal state. In
this way the robot would learn to choose shorter routes so as to minimize the cost.
Based on the reward function we computed the reward and transition models of the
macro-actions, and then computed a plan using macro-actions over the children of
the root.

For the small simulated environments we compared the H-MLS algorithm with
the ﬂat most likely state strategy (F-MLS), and the H—QMDP algorithm with the
ﬂat QMDP (F-QMDP). For each navigation task we started the robot 3 times from
6 different locations and recorded the number of steps it took to reach the goal. We
also recorded the number of times the robot was successful in reaching its goal, over
the total number of runs. The results are summarized in Table 6.1. All experiments
were started both from a uniform initial belief state, where the starting position was
not known, and from a localized initial belief state where the starting position was
known.

It is evident from Table 6.1 that the hierarchical approaches perform better than

the ﬂat. The MLS strategies produce 100% success due to the randomness in the

166

 

 

 

 

 

 

Envir. Initial Planning Steps Macro Success
position method steps

small sim. unknown H-MLS 56.1 15.5 100 ‘70

F-MLS 73.15 100 ‘70

H-QMDP 47 9.6 100 ‘70

F-QMDP 47 66.6%

known All 29 5.5 100 %

 

 

 

 

 

 

 

 

Table 6.1: The table shows the average number of primitive steps, and the average
number of decisions at the abstract level. The results are for the small environment
in the Nomad 200 simulator.

selection of the peak of the belief state. The QMD approaches perform better than
the MLS approaches. However, the ﬂat QMDP approach has a low success rate.
Starting from known initial positions, all approaches perform the same.

For the large simulated environment we tested all the algorithms by starting the
robot 3 times from 7 different locations and measured the steps it took to reach the
goal. The results are summarized in Table 6.2. We also ran the robot in the real
environment using the HPOMDP model. In this experiment we started the robot

from 11 different locations and used 5 different goal destinations. The results are

summarized in Table 6.2.

 

 

 

 

 

 

 

 

Envir. Initial Planning Steps Macro Success
position method steps
large sim. unknown H-MLS 90.5 15.8 100 %
F-MLS 163 93 ‘70
H-QMDP 74 12.5 85 %
F-QMDP 28 ’70
known All 46 6.7 100 ’70
large real unknown H-MLS 130 22.3 100 ‘70
known H-MLS 53.8 7.2 100 (70

 

 

 

 

 

 

 

 

Table 6.2: The table shows the average number of primitive steps, and the average
number of decisions at the abstract level. The results are for the large environment
in the Nomad 200 simulator, and in the real world.

It is obvious from Table 6.2 that the hierarchical approaches outperform the ﬁat

167

when the starting belief state is uniform. Due to the low success rate in the ﬂat
F-QMDP method (starting from unknown position) we did not report the number of
steps. Failures to reach the goal occur when the robot gets stuck into loop behavior.
The MLS strategies produce high success rates due to the randomness in choosing the
peak of the belief state. When the initial position is known, all the methods produce
the same result. For the real environment we run the most successful hierarchical
approach, the H-MLS, which proved to work successfully. Comparing Table 6.2 with
Table 6.1, we can conclude that the hierarchical approaches scale more gracefully
than ﬁat approaches to large environments.

we also measured the time it took for the robot to construct a plan, including
the time to construct the macro-action that leads to the goal state. The results are
summarized in table 6.3 where for the HPOMDP model it takes signiﬁcantly less time
to compute a plan. The time complexity of the value-iteration algorithm is 0(S2T)
where S is the number of states and T is the planning horizon. If we construct perfect
HPOMDP trees where all leaves are at the same level, and each abstract state has
the maximum number of children, then the time complexity of value iteration over
the children of the root would be 0(S‘3'NT), where d is the depth of the tree and
N is the maximum number of entry states for an abstract state. Usually we don’t
construct perfect trees, and therefore, time complexity lies between 0(SHZ'NT) and
0(S2T).

It is apparent from Table 6.3 that hierarchical plans are constructed faster than
ﬁat due to the small number of states at the abstract level, and due to the fact that we

can reuse precomputed macro-actions. For each navigation task only a new macro-

168

 

 

 

 

 

 

 

 

 

Environment Goal location Model Planning time (sec)

I7 HPOMDP 2.11

small POMDP 15.34
I2-112 HPOMDP 5.7

POMDP 14.67

large I9 HPOMDP 5.05
POMDP 88.02

12-116 HPOMDP 26.12

POMDP 83.57

 

 

 

 

 

 

Table 6.3: The table shows the planning times for the different environments and
models. The planning times were computed on a 900 MHz Athlon processor

action needs to be computed, the one for the abstract state which contains the goal

location.

6.4.2 Experiments with learned models

We also performed experiments with learned HPOMDP models to investigate the
effect of learning HPOMDP models on the navigation system as a whole. The exper-
iments were done in the Nomad 200 simulator using the smaller navigation environ-

ment. The overall training procedure was done as follows:

1. First we started with an initial HPOMDP model. In this HPOMDP model
we set all probabilities of observing the correct observation to be 0.6. In all
corridor states we set the overshoot transition to 0.6 the self transitions to 0.2,
and transitions to the correct states to 0.2. In all junctions states, we set the
self transitions to 0.5 and transitions to the correct state to 0.5. We trained
this model by simply training all the submodels of the abstract states separately
(“submodels-only” strategy). Training data was collected by running the robot

separately in each abstract state. In corridors, we would collect data by simply

169

running the robot from every entry state of a corridor to the opposite exit state.

We collected 30 training sequences in this manner, two for each corridor.

2. Given the trained model we were able to run different navigation tasks. For each
navigation task we collected data for the “selective-training” strategy. A data
sequence was recorded if the robot entered an abstract state and was 99% sure
that it was at the particular entry state. Recording stopped when two abstract
states were traversed. We collected 38 training sequences in this manner. We
then retrained the model learned in the ﬁrst stage, using the new training data,

with the “selective-training method.

To evaluate the different stages of learning we started the robot from 12 differ-
ent locations and provided the initial starting position. The goal was to get to the
four-way junction in the middle of the environment (node 17). Using our hierarchical
planning and execution algorithm H-MLS, we measured the average number of steps,
the success rate of reaching the goal, and the mean of the average entropy per run.
We measured the average entropy of all runs to be H = % Z, Ht, where T is the total
number of steps for all runs. H, is the normalized entropy of the probability distribu-
tion over all product states, which is computed aswas shown earlier in Equation 6.1.
A run was considered a failure if the robot traveled more than 70 steps and still was
unable to reach the goal. We measured the average number of steps in reaching the
goal over successful trials, but we measured the average entropy over both successful

and unsuccessful trials. The results are summarized in Table 6.4.

170

 

 

 

 

Training stage Average number of steps Average entropy Success
before training 51 0.29 16.6 %
submodels-only 27 0.14 100 %
selective-training 27 0.10 100 %

 

 

 

 

 

 

Table 6.4: The table shows the average number of steps to reach the goal, and average
entropy per run after every training stage. Simply training the submodels provides an
HPOMDP model that can be used successfully for robot navigation tasks. Retraining
the already learned model with additional data from the navigation tasks improves
robot localization.

6.5 Conclusions

We have seen in this chapter that our hierarchical approach to robot navigation using
the HPOMDP model outperforms previous ﬁat POMDP based representations. The
algorithms presented take less number of steps than the ﬂat POMDP approaches
in taking the robot to the goal starting from no positional knowledge. Also, the
hierarchical approaches scale gracefully to large environments than the ﬂat approaches
ass can be seen from Tables 6.1, and 6.2. Additionally, abstract plans are constructed
much faster than flat plan representations as can be seen from Table 6.3.

The better performance in taking the robot to its goal is due to two key reasons.
First, the robot takes decisions based on a hierarchical representation of its belief
about its true location. A hierarchical belief representation is advantageous, in that at
higher levels of abstraction uncertainty is less than lower levels due to the signiﬁcantly
smaller number of states. Second, macro-action termination depends on the abstract
belief state which changes more slowly than the global product belief state. As a
result, macro-actions execute for multiple steps which results in entropy reduction.

The disadvantage of our algorithms, is that they are is still approximate solutions

based on a completely observable hierarchical MDP solution. Even though they

171

perform better than ﬁat POMDP approximations they are still far from the optimal
number of steps for reaching the goal, which can be achieved when the initial position
is known. A future direction would be the study of this model with respect to exact
POMDP solutions. A hierarchical approach for planning and execution in POMDPS
that is closer to exact POMDP solutions is described in [91]. In the next chapter we
present in detail the main conclusions and contributions of the this dissertation and

directions for future research.

172

Chapter 7

Conclusions and future work

In this ﬁnal chapter we summarize the main contributions of the thesis and describe
directions for future research. Section 7.1 summarizes the dissertation, and Section 7.2
presents the main contributions. Section 7.3 presents a number of concrete directions

for future research.

7. 1 Summary

In this dissertation we have described a framework for learning and planning using
hierarchical models of partially observable environments, and applied it to indoor
robot navigation. We introduced a new multi-resolution model called HPOMDP, and
presented hierarchical learning and planning algorithms for this model.

In Chapter 3 we deﬁned the HPOMDP model and described an EM type learning
algorithm for learning its parameters. In Chapter 4 we described the robot plat-

forms, the extraction of features from the sensor values, the representation of spatial

173

environments as HPOMDPS, and the overall navigation architecture.

In Chapter 5 we introduced two approximate learning strategies, “reuse-training”,
and “selective-training”, and presented detailed experimental learning studies for par-
tially observable indoor corridor environments. The experiments were done on a small
simulated environment as well as with robot platforms in the Nomad 200 simulator
and in the real world. We compared the hierarchically learned models with models
learned using the standard EM for flat POMDPS. The comparisons were done using
different metrics such as distance to the generative model, error in robot localization,
log likelihood of the training data, and training time. Additionally we used the hier-
archical learning algorithms to prune away false transition models at higher level of
abstraction both for a three level, and a four level hierarchical model.

The results show that the hierarchical learning algorithms produce learned models
which are statistically closer to the generative model according to our distance metric.
Also the hierarchically learned models produce better robot localization. Additionally
the hierarchical learning algorithms are better at inferring the structure of the envi-
ronment at the abstract level than the EM algorithm for learning ﬁat POMDPS. The
approximate training strategies which take advantage of short length data sequences
have their own advantages. While the “reuse-training” method converges fast (takes
a small number of training epochs) and usually produces better models than all other
learning algorithms, the “selective-training” method in practice trains faster (takes
less time per training epoch) than any of the learning algorithms.

In general the advantage of the hierarchical learning methods is their ability to

better recognize the relationship between states which are spatially far apart (due

174

to the way transitions models are learned at the abstract levels). A disadvantage is
that in our implementation the algorithm’s time complexity is 0(NT3). Nonetheless,
a recently developed training algorithm for HMMS uses linear time [83], and could
be adapted to overcome this limitation in our implementation. In addition, training
data sequences needs to start at some root child and ﬁnish at some root child. This
implies that training sequences are in some respect labeled. For example, in corridor
environment training data needs to start at an entry state of an abstract state (which
may be a corridor or a junction) and ﬁnish at the exit state of an abstract state.

In Chapter 6 we introduced our hierarchical planning and execution algorithms,
and used them in robot navigation experiments. Experiments were performed with
good hand-coded models, for the real environment, and for simulation environments
in the Nomad 200 simulator. To evaluate the planning algorithms we measured
the average number of steps it took for the robot to reach locations in its environ-
ment starting from uniform initial belief states. We also performed experiments with
learned models, in the Nomad 200 simulator, and showed how the whole learning and
planning system is combined in practice. We showed how navigation performance
improves after every learning stage. The learning stages were compared based on
average steps to reach the goal and the average uncertainty per run as the model
parameters improved.

Our planning algorithms use both spatial and temporal abstraction and combine
approaches from hierarchical MDP solutions with approximate POMDP strategies.
Using these algorithms the robot is able to execute navigation tasks starting from no

initial positional knowledge, in less number of steps than ﬂat POMDP strategies. The

175

advantages of these algorithms were even more apparent as the environment model
got larger. Other advantages of hierarchical planning, is that macro-action models
can be computed and evaluated once, and then reused for multiple navigation tasks.
Additionally, the fact that abstract plans are mappings from entry states (which
is a small part of the total state space) to macro-actions reduces planning time.
The ability to successfully execute navigation tasks starting from an unknown initial
position makes this navigation system comparable to the current state of the art in
robot navigation systems, such as the monte carlo localization approach described in
[73].

The disadvantage of our planning algorithm is that it assumes abstract and prod-
uct states are completely observable. The assumption that we have different sets
of macro-actions for each abstract state may not be valid in an exact hierarchical
POMDP solution. In general, in an exact HPOMDP solution we would have to cre-
ate POMDP type macro-actions (or POMDP policies as was shown in Section 2.2,
which are policy graphs mapping observations to actions) and evaluate their transition
and reward models for each abstract state. Nonetheless, computing exact POMDP

type macro-actions even for small state spaces is intractable.

7.2 Contributions

We summarize the main contributions of this dissertation below:

0 We have formally introduced and described the hierarchical partially observable

Markov decision process model. This model is derived from the hierarchical

176

hidden Markov model with the addition of actions and rewards. Unlike ﬂat
partially observable l\=’Iarkov decision process models, it provides both spatial
and temporal abstraction. Spatial abstraction is achieved by low resolution
abstract states, which group together ﬁner resolution states in a tree like fashion.
In temporal abstraction, a transition from an abstract state cannot occur unless
the process under the abstract state has ﬁnished. This model was presented in

Chapter 3.

We have derived an EM algorithm for learning the parameters of a given
HPOMDP. We empirically showed that the algorithm converges by increas-
ing the log likelihood of the data after every training epoch. The algorithm
was presented in Chapter 3, and the empirical convergence results described in

Chapter 5.

We have introduced two approximate training methods. For the ﬁrst method,
which we named “selective-training”, only selected parts of an overall HPOMDP
are allowed to be trained for a given training sequence. This results in faster
training. For the second method, which we named “reuse-training”, submodels
of an HPOMDP model are ﬁrst trained separately and then reused in the overall
model. This results in better learned models. Both algorithms and results were

presented in Chapter 5.

We have derived two planning and execution algorithms for approximating the
optimal policy in a given HPOMDP. The planning and execution algorithms
combine hierarchical solutions for solving Markov decision processes, such as the

177

options framework [25], and approximate ﬂat POMDP solutions [22] (Chapter

6).

We conducted a detailed experimental study of the learning algorithms for in-
door robot navigation. Experiments were presented for various large scale mod-
els both in simulation and on a real robot platform. The experiments demon-
strate that our learning algorithms can efficiently improve initial HPOMDP
models which results in better robot localization (Chapters 5 and 6). Addi-
tionally, we compared our learned models with equivalent ﬂat POMDP models
that were trained with the same data. We have developed approximate train-
ing methods to learn HPOMDP models that converge much more rapidly than
standard EM procedures for ﬂat POMDP models. When the learned HPOMDP
models where converted to ﬂat, they gave better robot localization than the ﬂat
POMDP models that were trained with the standard ﬂat EM algorithm for

POMDPS (Chapter 5).

We applied the planning algorithms to indoor robot navigation, both in simu-
lation and in the real world. Our algorithms are highly successful in taking the
robot to any environment state starting from no positional knowledge (uniform
initial belief state). In comparison with ﬂat POMDP heuristics, our algorithms
compute plans much faster, and use a considerable smaller number of steps in

executing navigation tasks (Chapter 6).

178

7 .3 Future Directions

V‘Ve now describe some directions for future research including faster training algo-

rithms, structure learning, extension of the model to sharing substructures, and exact

HPOMDP solutions.

7 .3.1 Faster training algorithms

It would be useful to think of other approaches to implementing the same learn-
ing algorithm where the time complexity is less than 0(NT3). Such an approach
that uses the Dynamic Bayesian network formalism has recently been developed for
HHMMS in [83]. The transformation of HHMMS to DBNs allows us to train them
by combining standard inference techniques in Bayesian nets such as the junction-
tree algorithm [84]. The junction-tree algorithm replaces the expectation step in the
hierarchical Baum-Welch algorithm. The major advantage is that with this com-
bination HHMMS can be trained in linear time, while the conventional hierarchical
Baum-Welch algorithm requires cubic time [17]. We could adapt such an approach
to learning HPOMDPS, and at the same time verify it’s correctness by comparing it
to the original HHMM training algorithm.

We could also explore approximate inference algorithms for computing the Ex—
pectation step such as Markov Chain Monte Carlo (MCMC) methods [107]. These
methods can be thought of as simulations of Markov processes where they estimate
the number of times a state has been visited (similar to the 5 variable in Baum-Welch).

Such approaches are used in situations where exact computation of the Expectation

179

step is intractable. They have been applied in training factored HMMS [108], as well
as HMMS with continuous action and observation spaces [109]. In our case compu-
tation of the expectation step through sampling may reduce our training time, since
it will eliminate unnecessary summations (specially when the initial starting state is
provided).

Finally, the investigation of an Incremental EM procedure for training HPOMDPS
may eliminate the burden of long training times. Incremental versions of the EM for
Hl\-’I.\»’Is have been studied in [110], and [111]. In Incremental type EM algorithms the
E-step computes sufﬁcient statistics and can be updated after every new observation.

The M-step is the same as the batch EM algorithm.

7 .3.2 Structure learning

Another future direction would be to learn the initial model structure (number of
states, and allowed vertical and horizontal transitions) independent of any speciﬁc
application domain. In the past researchers have deveIOped such approaches for ﬂat
HMMS. One approach is through a top-down state splitting where cross validation is
used to determine if a local change to the model is beneﬁcial [112]. Another bottom-
up approach, which starts with a speciﬁc model consistent with the training data and
generalizes the model by merging states together, is presented in [113]. We would like
to extend such approaches to learning the structure of HPOMDPS.

Nonetheless, since HHMMS can be converted to dynamic Bayesian nets [83], we

can use structure learning methods from the Bayesian Network literature. Structure

180

learning in Bayesian net has been extensively studied by many researchers [114], [115],

[116], and [117].

7 .3.3 Sharing substructure

Another future direction would be to extend the HPOMDP model such that abstract
states share substructure. In this extension, we would be able to scale—up to even
larger domains since a single submodel would be able to represent multiple abstract
states. In such a representation not only would we have smaller models, but we could
also modify the Baum-Welch calculation of the forward and backward variables such
that it is only done once for abstract states that share the same substructure and as
a result speed up the computation time as well. We could, for example, have at an
abstract level a representation of the spatial environment and at lower levels have the
representation of the notion of corridors. This idea is illustrated in Figure 7.1.

Such an extension however would impose certain requirements, such as the policies
required for each entry point of every abstract state that share the same sub-structure
are the same. Otherwise the robot would not know which action to execute, since its
belief at being at any of the abstract states that share the same substructure would
be the same. Unfortunately, knowledge of the required action means that a policy has
already been computed which means that the model is task speciﬁc. One possibility
is to have a representation where some abstract states always represent ﬁxed policies
(e.g., always traverse the corridor). Unlike the approach in [85] where the effects of

abstract actions at abstract levels are an integral part of the AHMM speciﬁcation, we

181

could take a bottom up approach where low level abstractions represent task speciﬁc

representations and high level spatial abstraction is task independent.

7 .3.4 Exact HPOMDP solutions

Finally, a key question that this dissertation may raise is how does spatial and tempo-
ral abstraction aid in the exact solutions of POMDPS. If a macro-action is a policy-
tree as described in Chapter 2, then when do we initiate and when do we terminate
each macro-action. What is an abstract state in a POMDP? Is it a set of physical
states or a set of belief states? Do we need to compute macro-actions over a set of
belief states or over a set of physical states. Are macro actions applicable everywhere
or locally? In POMDPS primitive actions need to be applicable everywhere because
the true state is never known; does the same condition hold for macro-actions? If
we were to compute macro-actions over the entire POMDP then the computational
complexity will be the same as solving the whole POMDP and therefore temporal
abstraction may not be as useful. If we assume however, that we are a given a set of
macro-actions over the entire POMDP, then how does the computation of any global

plan based on the macro-actions compares with a ﬂat approach?.

182

 

Figure 7.1: The t0p part show a corridor environment and its ﬁat POMDP represen-
tation. The bottom part shows an extended HPOMDP representation where abstract
states share submodels. This future direction will have the potential of scaling the
algorithm to even larger domains, and applications to problems that have repetitive
structure such as the task of reading.

183

APPENDICES

184

 

Appendix A

A.1 Table of HPOMDP symbols

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Symbol 1 Deﬁnition
C; the set of product children of abstract state 3
C; the set of abstract children of state 3
Cf, the set of entry states that belong to abstract states children of s
X S the set of exit states that belong to abstract state 3
N S the set of entry states that belong to abstract state s
p(s) the parent of state s
.31 exit state that belongs to abstract state 3
3,, entry state that belongs to abstract state .9
T(s§,|sx, a) horizontal transition probability
V(s’n|s,,) vertical transition probability
0(zls, a) observation probability
R(s, a) immediate reward function
b(s) belief of state s
TD(s_,.|2', )2) Discounted transition model
RD(2, ,u) Discounted reward model
M 3 the set of macro-actions available under abstract state 3
7r3 policy of macro—action it under abstract state 3

 

[l

 

Table A.1: Table of HPOMDP symbols

185

BIBLIOGRAPHY

186

Bibliography

[1] S. Thrun, “Learning metric-topological maps for indoor mobile robot naviga-
tion,” Artiﬁcial Intelligence, no. 1, pp. 21—71, 99.

[2] S. Koenig and R. Simmons, “Passive distance learning for robot navigation,”

in Proceedings of the Thirteenth International Conference on Machine Learning
(ICML), pp. 266—274, 1996.

[3] R. S. Sutton and A. G. Barto, Reinforcement Learning An Introduction. The
MIT Press, 1998.

[4] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A
survey,” Journal of Artiﬁcial Intelligence Research, 1996.

[5] R. A. Howard, Dynamic Programming and Markov Processes. The MIT Press,
Cambridge, Massachusetss, 1960.

[6] K. J. Anstrom, “Optimal control of Markov decision processes wih incomlete
state estimation,” Journal of Mathematical Analysis and Applications, no. 10,
1965.

[7] C. Papadimitriou and J. Tsitsiklis, “The complexity of Markov decision pro-
cesses,” Mathematics of Operation Research, vol. 12(3), 1987.

[8] O. Madani, S. Hanks, and A. Gordon, “On the undecidability of probabilistic
planning and inﬁnite-horizon partially observable Markov decision processes,”

in Proceedings of the Sixteenth National Conference in Artiﬁcial Intelligence,
pp. 409—416, 1999.

[9] L. E. Baum and G. R. Sell, “Growth functions for transformations on mani-
folds,” Pac. J. Math, vol. 27, no. 2, pp. 211—227, 1968.

[10] L. Rabiner, “Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition,” in Proceedings of the IEEE, vol. 77, February 1989.

[11] S. Koenig and R. Simmons, “Unsupervised learning of probabilistic models
for robot navigation,” in Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), pp. 2301—2308, 1996.

187

 

[12] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomlete
data via the EM algorithm,” Journal of the Royal Statistical Society, 39 (Series
B), pp. 1—38, 1977.

[13] F. C. Dyer, “Spatial cognition: lessons from central—place foraging insects,” in
A Synthetic Approach to Studying Animal Cognition (I. Pepperberg, A. Kamil,
and R. Balda, eds.), pp. 119—154, London: Academic Press, 1998.

[14] H. L. Pick, M. R. Heinrichs, D. R. Montello, K. Smith, C. N. Sullivan, and
W. B. Thompson, “Topographic map reading,” The Ecology of Human-Machine
Systems, vol. 2, 1995.

[15] J. Piaget and A. Szeminska, The Child’s Conception of Geometry. New York:
Basic Books, 1960.

[16] A. W. Siegel and S. H. White, “The development of spatial representations of

large-scale environments,” in Advances in Child Development and Behaviour
(H. W. Reese, ed.), Academic Press, 1975.

[17] S. Fine, Y. Singer, and N. Tishby, “The Hierarchical Hidden Markov Model:
Analysis and Applications,” Machine Learning, vol. 32, pp. 41—62, July 1998.

[18] M. Mataric, “A distributed model for mobile robot environment-learning and
navigation,” Tech. Rep. TR-1228, MIT, 1990.

[19] G. Chartrand and O. R. Oellerman, Applied and Algorithmic Graph Theory.
Mc Graw Hill, 1993.

[20] R. Simmons, “Becoming increasingly reliable,” in Proceedings of the Interna-
tional Conference on Artiﬁcial Intelligence Planning Systems (AIRS), pp. 152—
157, 1994.

[21] M. Collett and T. S. Collett, “How do insects use path integration for their
navigation,” Biol. Cybern, no. 83, pp. 245—259, 2000.

[22] S. Koenig and R. Simmons, “A robot navigation architecture based on partially
observable Markov decision process models,” in Artiﬁcial Intelligence Based
Mobile Robotics:Case Studies of Successful Robot Systems (D. Kortenkamp,
R. Bonasso, and R. Murphy, eds), pp. 91—122, MIT press, 1998.

[23] I. Nourbakhsh, R. Powers, and S. Birchﬁeld, “Dervish: An ofﬁce-navigation
robot,” AI Magazine, 53-60, vol. 16, no. 2, pp. 53—60, 1995.

[24] N. Khaleeli, “A robust robot navigation architecture using partially observable

semi-Markov decision processes,” Master’s thesis, University of South Florida,
1997.

[25] R. S. Sutton, D. Precup, and S. Singh, “Between MDPS and semi-MDPS: A
framework for temporal abstraction in reinforcement learning,” Artiﬁcial Intel-
ligence, pp. 112:181—211, 1999.

188

[26] B. Kuipers and Y.-T. Byun, “A robot exploration and mapping strategy based
on a semantic hierarchy of spatial representations,” Journal of Robotics and
Autonomous Systems, vol. 8, 1991.

[27] H. Choset, Sensor based motion planning: The hierarchical generalized voronoi
graph. PhD thesis, California Institute of Technology, 1996.

[28] M. Puterman, Markov Decision Processes: Discrete Dynamic Stochastic Pro-
gramming. John Wiley, 1994.

[29] R. Bellman, Dynamic Programming. Princeton University Press, 1957.

[30] D. P. Bersekas, Dynamic Programming: Deterministic and Stochastic Models.
Prentice-Hall, 1987.

[31] C. J. C. H. Watkins, Learning from Delayed Rewards. PhD thesis, Cambridge
University, 1989.

[32] J. N. Tsitsiklis and B. V. Roy, “Asynchronous stochastic approximation and
Q-learning,” Machine Learning, vol. 16, September 1994.

[33] K. J. Astrom, “Optimal control of Markov decision processes with incomplete
state estimation,” Journal of Mathematical Analysis and Applications, vol. 10,
pp. 174—205, 1965.

[34] E. Sondik, The Optimal Control of Partially Observable Markov Processes. PhD
thesis, Stanford University, 1971.

[35] R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable
Markov processes over a ﬁnite horizon,” Operations Research, no. 21, pp. 1071—
1088, 1973.

[36] C. White and D. Harrington, “Applications of Jensen’s inequality for adap-
tive suboptimal design,” Journal of Optimaization Theory and Applications,
pp. 32(1):89—99, 1980.

[37] E. J. Sondik, “The optimal control of partially observable Markov processes
over the inﬁnite horizon,” Operations Research, pp. 26(2):282-304, 1978.

[38] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in
partially observable stochastic domains,” Artiﬁcial Intelligence, vol. 101, pp. 99—
134, 1998.

[39] G. E. Monahan, “A survey of partially observable Markov decision processes:
Theory, models, and algorithms,” Management Science, no. 28(1), pp. 1—16,
1982.

[40] H. T. Cheng, Algorithms for Partially Observable Markov Decision Processes.
PhD thesis, University of British Columbia, 1988.

189

 

[41]

[45]

[46]

[48]

[49]

[50]

[51]

[52]

A. Cassandra, M. L. Littman, and N. L. Zhang, “Incremental pruning: A sim-
ple, fast, exact method for partially observable Markov decision processes,” in
Proceedings of the Thirteenth Conference on Uncertainty in Artiﬁcial Intelli-
gence (UAI), 1997.

M. Hauskrecht, “Value-function approximations for partially observable Markov
decision processes,” Journal of Artiﬁcial Intelligence Research, vol. 13, pp. 33—
94, 2000.

R. Parr and S. Russell, “Approximating optimal policies for partially observ-
able stochastic domains,” in Proceedings of the Fourteenth International Joint
Conference on Artiﬁcial Intelligence (IJCAI), pp. 1088—1094, 1995.

R. I. Brafman, “A heuristic variable grid solution method for POMDPS,” in

Proceedings of the Fourteenth National Conference on Artiﬁcial Intelligence
(AAAI), 1997.

E. Hansen, “Solving POMDPS by searching in policy space,” in Proceedings
of the Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI),
1998.

N. Meuleau, L. Peshkin, K.-E. Kim, and L. Kaelbling, “Learning ﬁnite-state
controllers for partially observable environments,” in Proceedings of the Fif-
teenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 1999.

N. Meuleau, K.-E. Kim, L. Kaelbling, and A. Cassandra, “Solving POMDPS by
searching the space of ﬁnite policies,” in Proceedings of the Fifteenth Conference
on Uncertainty in Artiﬁcial Intelligence (UAI), 1999.

L. Peshkin, N. Meuleau, and L. Kaelbling, “Learning policies with external
memory,” in Interanational Conference on Machine learning (ICML), 1999.

A. Kosaka and A. Kak, “Fast vision-guided mobile robot navigation using
model-based reasoning and prediction of uncertainties,” in Proceedings of the
International Conference on Intelligent Robots and Systems (IROS), pp. 2177—
2186, 1992.

R. Smith and P. Cheeseman, “On the representation and estimation of spatial
uncertainty,” The International Journal of Robotics Research, pp. 5:56—68, 1986.

T. Dean, K. Basye, R. Chekaluk, S. Hyun, M. Lejter, and M. Randazza, “Cop-
ing with uncertainty in a control system for navigation and exploration.,” in
Proceedings of the the Eighth National Conference on Artiﬁcial Intelligence
(AAAI), pp. 1010—1015, 1990.

R. H. Battin and G. M. Levine, “Application of Kalman ﬁltering techniques to
the Apollo program,” tech. rep., MIT Instrumentation Laboratory, April 1969.
133-2401.

190

 

[53] I. J. Cox and J. Leonrad, “Modeling a dynamic environment using a Bayesian
multiple hypothesis approach,” Artiﬁcial Intelligence, pp. 311—344, 1994.

[54] S. Roumeliotis and G. Bekey, “Bayesian estimation and Kalman ﬁltering: A
uniﬁed framework for mobile robot localization,” in Proceedings of the IEEE
International Conference on Robotics and Automation (ICRA), (San Francisco
CA), pp. 2985—2992, 2000.

[55] K. Basye, T. Dean, J. Kirman, and M. Lejter, “A decision-theoretic approach
to planning, perception, and control,” IEEE Expert, vol. 7, no. 4, pp. 58-65,
1992.

[56] J. K. Baker, “The dragon system-an overview,” IEEE Trans. Acoust. Speech
Signal Processing, vol. ASSP-23, pp. 24—29, Feb 1975.

[57] J. A. Bilmes, “A gentle tutorial of the EM algorithm and its application to
parameter estimation for Gaussian Mixture and Hidden Markov Models,” tech.
rep., International Computer Science Institute, Berkley CA, 94704, April 1998.
TR-97-021.

[58] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University
Press, 1995.

[59] M. Jordan and R. Jacobs, “Hierachical mixtures of experts and the EM algo-
ritm,” Neural Computaion, pp. 6:181-214, 1994.

[60] R. Redner and H. Walker, “Mixture densities, maximum likelihood and the EM
algorithm,” SIAM Review, 26(2), 1984.

[61] G. J. McLachlan and T. Krishnan, The EM algorithm and eztentions. Wiley,
New York, 1997.

[62] J. Borenstein, H. R. Everett, and L. Feng, Navigating Mobile Robots: Sensors
and Techniques. Wellesley, MA: A. K. Peters, Ltd, 1996.

[63] J. Leonard and H. F. Durrant-Whyte, “Mobile robot localizatin by tracking
geometric beacons,” in IEEE Transactions Robotics and Automation, vol. 7(3),
pp. 376-382, 1991.

[64] I. J. Cox, “Blanche- an experiment in guidance and navigation of an autonomous
mobile rrobot,” in IEEE Transactions Robotics and Automation, vol. 7(3), 1991.

[65] I. A. Getting, “The global positioning system,” IEEE Spectrum, pp. 36—47,
December 1993.

[66] P. Gaussier, C. Joulain, S. Zrehen, and J .-P. Banquet, “Visual navigation in an
open environment without a map,” in Proceedings of the International Confer-
ence on Intelligent Robots and Systems (IROS), (Grenoble), 1997.

191

 

[67] S. Chen and J. Weng, “State-based SHOSLIF for indoor visual navigation,”
IEEE Trans. Neural Networks, vol. 11, no. 6, pp. 1300-1314, 2000.

[68] H. Moravec and A. Elfes, “High resolution maps from wide angle sonar,” in
Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA), pp. 116—121, 1985.

[69] S. Thrun, W. Burgard, and D. Fox, “A probabilistic approach to concur-
rent mapping and localization for mobile robots,” Machine Learning and Au-
tonomous Robots (joint issue), pp. 1-25, 1998.

[70] S. Thrun, W. Burgard, and D. Fox, “A real-time algorithm for mobile robot
mapping with applications to multi-robot and 3D mapping,” in Proceedings of
the IEEE International Conference on Robotics and Automation (ICRA), (San
Fransisco, CA), 2000.

[71] Y. Liu, R. Emery, D. Chakrabarti, W. Burgard, and S. Thrun, “Using EM
to learn 3D models with mobile robots,” in Proceedings of the International
Conference on Machine Learning (ICML), 2001.

[72] D. Fox, W. Burgard, and S. Thrun, “Markov localization for mobile robots
in dynamic environments,” Journal of Artiﬁcial Intelligence Research, vol. 11,
1999.

[73] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust monte carlo localiza-
tion for mobile robots,” Artiﬁcial Intelligence, pp. 99—141, 2000.

[74] D. B. Rubin, “Using the SIR algorithm to simulate posterior distributions,”
in Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and
A. F. M. Smith, eds), pp. 395-402, Oxford, UK: University Press, 1988.

[75] S. Engelson, Passive Map Learning and Visual Place Recognition. PhD thesis,
Yale University, 1994.

[76] A. Cassandra, Exact and Approximate Algorithms for Partially Observable
Markov Decision Processes. PhD thesis, Brown University, Department of Com-
puter Science, Providence, RI, 1998.

[77] H. Shatkay and L. Kaelbling, “Learning topological maps with weak local odo-
metric information,” in Proceedings of the Fifteenth International Joint Con-
ference on Artiﬁcial Intelligence (IJCAI), 1997.

[78] H. Shatkay, Learning Models for Robot Navigation. PhD thesis, Brown, 1998.

[79] G. Theocharous, K. Rohanimanesh, and S. Mahadevan, “Learning hierarchical
partially observable Markov decision processes for robot navigation,” in Pro-

ceedings of the IEEE International Conference on Robotics and Automation
(ICRA), (Seoul, Korea), 2001.

192

[80] J. Banks, J. Carson, and B. Nelson, Discrete-Event System Simulation. Prentice
Hall, 1996.

[81] K. Lari and S. Young, “The estimation of stochastic context-free grammars
using the Insight-Outside algorithm,” Computer Speech and Language, 1990.

[82] D. Prescher, “Inside-Outside estimation meets dynamic EM,” in Proceedings of
the 7th International Workshop on Parsing Technologies (IWPT—OI), (Beijing,
China), 2001.

[83] K. Murphy and M. Paskin, “Linear time inference in hierarchical HMMS,” in
Neural Information Processing Systems (NIPS), 2001.

[84] K. Murphy, “Applying the junction tree alogirthm to variable—length DBNs,”
tech. rep., U.C Berkley, Computer Science Divison, 2001.

[85] H. Bui, S. Venkatesh, and G. West, “On the recognition of abstract Markov

policies,” in Proceedings of the Seventeenth National Conference on Artiﬁcial
Intelligence (AAAI), 2000.

[86] A. Neﬁan and M. Hayes, “An embedded HMM-based approach for face detec-
tion and recognition,” in IEEE International Conference on Acoustics, Speech
and Signal Processing, 1999.

[87] M. Ostendorf, V. Digalakis, and O. Kimball, “From HMM’s to segment models:
a uniﬁed view of stochastic modeling for speech recognition,” IEEE Trans. on
Speach and Audio Processing, pp. 4(5):360-378, 1996.

[88] S. Levinson, “Continuosly variable duration hidden Markov models for auto-
matic speech recognition,” Computer Speech and Language, 1986.

[89] M. Russell and R. Moore, “Explicit modeling of state occupancy in hidden
Markov models for automatic speech recognition,” International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), pp. 5—8, 1985.

[90] C. Mitchell and L. Jamieson, “On the complexity of explicit duration HMMS,”
IEEE Transactions on Speech and Audio Processing, vol. 3, May 1995.

[91] J. Pineau, N. Roy, and S. Thrun, “A hierarchical approach to POMDP plan-
ning and execution,” in Workshop on Hierarchy and Memory in Reinforcement
Learning (ICML 2001), (Williams College, MA), June 2001.

[92] Nomadic technolgies, Inc., Nomad 200 Hardware Manual, 1997.
[93] Nomadic technolgies, Inc., User’s Manual, 1997.

[94] S. Mahadevan, G. Theocharous, and N. Khaleeli, “Fast concept learning for mo-
bile robots,” Machine Learning and Autonomous Robots Journal (joint issue),
vol. 31/5, pp. 239—251, 1998.

193

 

[95] S. Fahlman, “Faster-learning variations on back-propagation: An empirical
study,” in Proceedings of 1988 Connectionist Models Summer School, Morgan
Kaufmann, 1988.

[96] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classiﬁcation. Wiley Inter-
science, 2000.

[97] P. Baldi and S. Brunak, Bioinformatics The Machine Learning Approach, ch. 9.
MIT Press, 1998.

[98] J. H. Connell, A Colony Architecture for an Artiﬁcial Creature. PhD thesis,
MIT, 1989.

[99] R. A. Brooks, “Intelligence without representation,” Artiﬁcial Intelligence Jour-
nal, vol. 47, pp. 139-159, 1991.

[100] B. H. Juang and L. R. Rabiner, “A probabilistic distance measure for hidden
Markov models,” ATE T Tech J., vol. 64, pp. 391—408, February 1985.

[101] A. J. Viterbi, “Error bounds for convlutional codes and an aymptotically opti-
mal decoding algorithm,” in IEEE Transactions Information Theory, pp. 260—
269, 1967.

[102] G. Theocharous and S. Mahadevan, “Approximate planning with hierarchical
partially observable Markov decision processes for robot navigation,” in Pro-

ceesings of the IEEE International Conference on Robotics and Automation
{ICRA}, (Washington, DC), 2002.

[103] R. Parr and S. Russell, “Reinforcemnet learning with hierarchies of machines,”

in Advances in Neural Information Processing Systems, vol. 10, (Cambridge,
MA), pp. 1043—1049, MIT Press, 1997.

[104] R. Parr, Hierarchical control and learning for Markov decision processes. PhD
thesis, University of California at Berkley, 1998.

[105] T. G. Dietterich, “Hierachical reinforcement learning with the MAXQ value
function decomposition,” Journal of Artiﬁcial Intelligence Research, 2000.

[106] M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier, “Hi—
erachical solution of Markov decision processes using macro-actions,” in Pro-

ceedings of the Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence
{UAI), 1998.

[107] R. M. Neal, “Probabilistic inference using Markov chain monte carlo methods,”
Tech. Rep. CRG-TR-93—1, University of Toronto, 1993.

[108] Z. Ghahramani and M. I. Jordan, “Factorial hidden Markov models,” Machine
Learning, pp. 245-275, 1997.

194

 

[109] S. Thrun, J. Langford, and D. Fox, “Monte carlo hidden Markov models:
Learning non-parametric models of partially observable stochastic processes,”
in Proc. of the International Conference on Machine Learning, 1999.

[110] Y. Gotoh and H. F. Silverman, “Incremental ML estimation of HMM param—
eters for efﬁcient training,” in Proceedings of the International Conference on
Acoustics Speech and Signal Processing, pp. 585—588, 1996.

[111] V. Digalakis, “On-line adaptation of hidden Markov models using incremental
estimation algorithms,” in Proceedings of the European Conference on Speech
Communication and Technology, September 1997.

[112] D. Freitag and A. McCallum, “Information extraction with HMM structures
learned by stochastic optimization,” in AAAI/IAAI, pp. 584—589, 2000.

[113] A. Stolcke and S. Omohundro, “Hidden Markov model induction by Bayesian
model merging,” in Neural Information Processing Systems (NIPS), 1992.

[114] D. Heckerman, “A tutorial on learning with Bayesian networks,” Tech. Rep.
MSR-TR—95-06, Microsoft Research, 1996.

[115] N. Friedman, “The Bayesian structural EM algorithm,” in Proceedings of the
Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 1998.

[116] M. Brad, “Structure learning in conditional probability models via an entropic
prior and parameter extinction,” Neural computation, pp. 11:1155—1182, 1999.

[117] N. Friedman and D. Koller, “Being Bayesian about network structure,” in Pro—
ceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence
(UAI), 2000.

195