"73's llllllllllllllllllllllllllllllllllllHllllllllllllllllllll

31293020486

 

LIBRARY

Michigan State
University

 

 

 

This is to certify that the

thesis entitled

HIERARCHICAL MULTI AGENT REINFORCEMENT
LEARNING

presented by

Rajbala Makar

has been accepted towards fulﬁllment
of the requirements for

Master ' 5 degree in Computer Science
& Engineering

Media,

Major professor

Date Ewe Elmo ERMA/L MAHWW

 

 

 

0-7639 MS U is an Affirmative Action/Equal Opportunity Institution

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

J "5 I ’ ‘5':
ﬁﬁﬂhwd

 

 

 

 

 

 

 

 

 

 

 

 

 

moo W14

 

HIERARCHICAL MULTI AGENT REINFORCEMENT
LEARNING

By

Rajbala M akar

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Computer Science & Engineering

2000

ABSTRACT
HIERARCHICAL MULTI AGENT REINFORCEMENT LEARNING
By
Rajbala Makar

Reinforcement Learning has been extensively studied as a generalized approach
for sequential decision making problems involving actions of a single agent. Hence,
Single agent learning and the challenges and issues involved are very well understood.
Recent focus has been on co-operative multi agent learning as that opens up a whole
new dimension to machine learning and the type of problems it can be used for. The
complexity introduced as a result of having a number of agents is a major deterrent for
formulation of machine learning problems with multiple agents. Hierarchical learning
makes it easier to solve large problems by taking advantage of abstraction. This work
focuses on applying a hierarchical reinforcement learning approach to multi agent
learning. The multi agent learning algorithm has been built upon the MAXQ frame-
work for hierarchical reinforcement learning, as it naturally extends to the multi agent
case. Each agent uses the same MAXQ hierarchy to decompose a task into sub-tasks.
Coordination skills among agents are learned by conﬁguring the Q nodes at the high-

est level of the hierarchy to represent the joint action space among multiple agents.

The algorithm is tested on a complex AGV (Automated Guided Vehicle) scheduling
task, where four AGVS are available to carry material from a warehouse to one of
four assembly stations, and to carry the ﬁnished assemblies back to the warehouse.
Experiments Show that the co-operative multi agent reinforcement learning algorithm
performs considerably better than the case where the agents act selﬁshly and try to
maximize their own rewards. The multi agent algorithm is also shown to perform

better than heuristics usually used for AGV scheduling.

To My Mother (Smt. Narbada Devi)

iv

ACKNOWLEDGMENTS

I would like to take this opportunity to thank all the people around me who have
made this thesis possible. First of all, I express my heartfelt gratitude for my ad-
visor, Dr. Sridhar Mahadevan, who guided me throughout my time here with his
constant support and encouragement, and Spent a lot of his valuable time with me.
I would also like to thank Dr. John Weng and Dr. R. L. Tummala for being in my
thesis committee. I would like to thank Dr. James Bean, Dr. Yavuz Bozer and Dr.
Abhijeet Gosavi for their help in getting literature on AGV scheduling. I would like
to acknowledge the help of everyone in the Autonomous Agents lab at MSU, their
helpful comments and the use of their machines. I would like to dedicate this thesis
to my family who have constantly encouraged me and taught me how to dream. I do
not know where I would be without their love, patience, understanding, and support.
Finally, I am grateful to all my friends for providing a nice environment around me,

and encouraging me in bad times.

TABLE OF CONTENTS

LIST OF FIGURES viii
1 Introduction 1
1.1 Reinforcement Learning ........................... 2
1.2 Hierarchical Reinforcement Learning .................... 3
1.3 Multi Agent Reinforcement Learning .................... 4
1.4 Thesis Outline ................................ 7
2 Theoretical Foundations and Related Work 8
2.1 Markov Decision Processes .......................... 8
2.1.1 Value of a Policy .............................. 10
2.1.2 Dynamic Programming ........................... 11
2.1.3 Reinforcement Learning .......................... 13
2.1.4 Discrete Event Model ........................... 14
2.1.5 Semi Markov Decision Processes ..................... 15
2.2 Hierarchical Reinforcement Learning .................... 16
2.2.1 Approaches to Hierarchical Reinforcement Learning ........... 17
2.2.2 The MAXQ Method for Hierarchical Reinforcement Learning ..... 20
2.3 Multi Agent Learning ............................ 26
2.4 Discussion ................................... 28
3 CO-Operative MAXQ: A Hierarchical Multi Agent Reinforcement
Learning Method 29
3.1 Revisiting the Termination Predicate .................... 30
3.2 State Abstraction ............................... 32
3.3 The Multi Agent MAXQ Algorithm ..................... 34
4 A Multi Agent Robot Task 38
4.1 The Task ................................... 38
4.2 Results ..................................... 42
4.3 Conclusion ................................... 46
5 Experiments with AGV Scheduling 47
5.1 Scheduling in Flexible Manufacturing Systems ............... 48
5.2 Assumptions Used .............................. 50
5.3 The Problem Description ........................... 51
5.4 Implementation Details ............................ 51
5.5 State Abstraction ............................... 55

vi

5.6 Experimental Results ............................. 56

6 Conclusions and Future Work 65
6.1 Conclusion ................................... 65
6.2 Future Work ................................. 67
6.2.1 Adaptive Hierarchies ............................ 67
6.2.2 Other Hierarchical Learning Approaches ................. 68
6.2.3 Theoretical Foundations .......................... 68
6.2.4 Other Extensions .............................. 69
BIBLIOGRAPHY 70

vii

1.1

1.2

2.1

2.2
2.3
2.4
2.5

3.1
3.2

4.1
4.2
4.3
4.4
4.5
4.6

4.7

4.8

5.1
5.2

5.3
5.4

LIST OF FIGURES

Dynamics of interaction in a multi agent system. The state of the environ-
ment might be changed by the actions of other agents while one agent
is performing a task. ...........................

Example of a multi agent task. .......................

A simple two state Markov decision problem. The transition probability
associated with the actions, and reward for performing the action in a
particular state is shown in the bracket ..................

The taxi-cab domain ..............................

The task graph for the taxi-cab domain. ..................

The MAXQ graph for the taxi-cab domain. ................

The MAXQ learning algorithm. .......................

The multi agent MAXQ algorithm .....................
The algorithm for updating the joint and individual completion function
values at abstract levels for the multi agent MAXQ algorithm

The Environment used for the multi agent trash depositing task ......

The task graph used for the multi agent trash depositing task .......

The MAXQ graph used for the multi agent trash depositing task.

The average reward accumulated from start state to the goal state.

The number of primitive actions required to get to the goal state. . . . .

Learned policy for single agent performing trash collecting task using the
MAXQ hierarchical decomposition. ...................

The top and bottom boxes Show the learned policy for agent number 1 and
2 respectively, when the two agents are collectively performing trash
collecting task, using the co-operative multi agent MAXQ algorithm .

The number of primitive actions required to complete task .........

A multiple automatic guided vehicle (AGV) Optimization task .......
The MAXQ graph for an automatic guided vehicle (AGV) optimization
task. ....................................
The program flow in the implementation of the AGV scheduling task. . .
This ﬁgure compares the performance of single agent, multi-agent selﬁsh
MAXQ, multi agent co-operative MAXQ and flat methods for the AGV
scheduling task. It shows the throughput of the system when the AGV
travel time is very much less compared to the assembly time ......

viii

9
20
21
22
24

36

37

39
40
40
42
43

44

45
46

52

53
54

57

5.5

5.6
5.7

5.8

5.9

This ﬁgure compares the performance of single vs. multi-agent MAXQ
methods on the AGV scheduling task. It shows the results when the
AGV travel time and load/ unload time is 1 / 10th that of the average
assembly time. .............................. 58

A ﬂat implementation of a Single agent reinforcement learner is Shown here. 60

The performance of the proposed cooperative multi-agent MAXQ is com-
pared with a well known “ﬁrst come ﬁrst serve” heuristic for AGV
dispatching. ............................... 61

The performance of the cooperative multi-agent MAXQ algorithm is com-
pared with a composite of nearest station and stay in same station
heuristic, and the highest queue ﬁrst heuristic for AGV dispatching. 62

The performance of the cooperative multi-agent MAXQ algorithm when
one agent breaks down at 50000 sec, another at 70000 sec and the third
at 90000 sec. ............................... 63

ix

Chapter 1

Introduction

The recent focus on machine learning [1] as an approach for solving a wide va-
riety of extremely diverse tasks shows great promise in this approach. The re-
ward / punishment framework of reinforcement learning is general enough to be appli-
cable in virtually every domain from robotics [10], to backgammon [15] to elevator
scheduling [14] to channel allocation in cellular phones [17]. An agent learns incre-
mentally by trial and error in the reinforcement learning framework, as opposed to the
supervised learning framework, which assumes that there is an external teacher that
provides good and bad examples to learn from. Supervised learning is an important
form of learning, but is not adequate if an agent needs to learn from interaction with
the environment. In interactive problems it might be impractical to obtain examples
of desired behavior that are correct and represent all possible situations the agent
might ﬁnd itself in. In reinforcement learning, an agent is not told what actions to
perform, but must discover what actions yield the maximum amount of reward, which

is measured in terms of a scalar numerical Signal which is generated according to the

1

current state and provides feedback to the agent.

1.1 Reinforcement Learning

Reinforcement learning considers the problem of a goal directed agent interacting with
a possibly unknown and uncertain environment [2]. In many challenging cases, actions
might effect future situations in which the agent might ﬁnd itself in, in addition to
providing immediate reward. This characteristic of reinforcement learning is called
delayed reward. Thus, an agent is not explicitly told what are good actions or bad
actions in reinforcement learning, but has to learn on the basis of positive or negative
feedback from the environment at some point in time after taking a series of actions.
This approach has solid foundations in human and animal psychology. It is a well
known fact that pets can be taught to obey their master if they are scolded after
doing something undesirable and given reward after behaving themselves. Another
natural process which can be compared to the reinforcement learning approach is how
babies learn to walk. They fall a number of times and get hurt before learning the
correct way of walking. The most important distinguishing feature of reinforcement
learning is that it uses training information to evaluate an action executed rather
than instruct by giving correct actions. Thus, the goal of the agent is to maximize its
long term reward, or expected return. The expected return might be a Simple sum of
rewards, but that would create problems in a continuous process task, where the ﬁnal
time step might be inﬁnity. Another way of deﬁning the long term reward would be
the sum of discounted rewards the agent receives over the future. Thus the expected

2

discounted return,

00
Rt : Tt+1+ 7Tt+2 + 72711.3 + . . . = Z vkrt+k+1 (1.1)
k=0

where 0 S 7 g 1 is a parameter called the discount rate, is what the agent tries to

maximize in the reinforcement learning framework.

1.2 Hierarchical Reinforcement Learning

The key to solving large problems which are intractable using the traditional re-
inforcement learning methods (also called ﬂat reinforcement learning methods), is
multi level decomposition of the problem and abstraction. Reinforcement learning in
terms of abstract states or actions is termed as hierarchical reinforcement learning
[43]. Abstracting the state space provides advantages in terms of modeling only those
components of the state Space which are relevant to a particular task. Action space
can also be decomposed into the overall task, then the subtasks, up to the level of
primitive actions. This provides the advantage that knowledge about certain subtasks
which are common to different tasks can be reused once they are learned. Also, the
structure of the overall task is captured in the decomposition. The structure can thus
be used to limit the possible number of attainable states by performing a particular
action. Abstraction leads to constraining the policy space, which speeds up learning

considerably.

 

 

Agent(l) Agent(n)

 

 

 

 

 

 

 

 

A A A A
H C “U H 'U
3 8 i3 . . . 3 5 :3
e o g 8 8 3
52’ < v a: :3 < m
f D

Environment

\ J

 

 

Figure 1.1: Dynamics of interaction in a multi agent system. The state of the environ-
ment might be changed by the actions of other agents while one agent is performing
a task.

1.3 Multi Agent Reinforcement Learning

Many problems are complex and require the power of many independent agents work-
ing together towards a common goal. For example, any search task, such as foraging,
is parallel in nature and would beneﬁt by using multiple agents. One can think of
other tasks where it might be essential to have multiple agents, like a predator prey
task in which the prey might be stronger than one predator but could be captured

by two predators [13], or robosoccer [20].

The complexity introduced as a result of having a number of agents is a ma-
jor deterrent for formulation of machine learning problems with multiple agents, as
the interactions among the agents have to be modeled in addition to the interac-
tion between the agent and the environment. Hence, in a multi agent scenario, the
interaction dynamics takes on the form shown in the ﬁgure 1.1. Changes in the en-
vironment can be brought about by actions of any of the agents. The state of the

4

 

   

Mir/”'2'?
@ Go to goal

Figure 1.2: Example of a multi agent task.

 

 

 

environment might be changed by actions of other agents even while one agent is
performing an action. For example, in ﬁgure 1.2, where the two agents are navigating
to two different goal locations, agent 1 might reach its goal location while agent 2 is
learning to avoid obstacles. In a ﬂat framework, the agents would have to consider
the state of both agents and actions being performed by both agents to define the
state of the system completely. Modeling interactions between agents would make
state and action Space exponential in terms of the number of agents if the Simple
flat reinforcement learning approach is used. This is a huge burden and slows down
learning. Even for problems where this kind of model is feasible, many agents per-
forming random actions and trying to learn the best overall policy poses challenges
in terms of learning. For example, again consider that two agents are jointly trying
to get to opposite corners of a room. Note that the agents need to learn to reach

5

the goal locations while avoiding obstacles. This kind of learning might succeed in a
static environment with ﬁxed obstacles, which is is rarely the case. If the obstacles
are moving, the low level learning of obstacle avoidance might give the agent rewards
which might hinder with learning the overall task of reaching the goal (refer ﬁgure
1.2). Thus, low level interactions among agents might override the learning process
and produce a kind of chaos.

These difﬁculties arise due to the fact that the agent has no knowledge of the fact
that the goal of the agent is to reach a particular location, and obstacle avoidance
is a behavior it has to turn on and off in order to achieve the goal without harming
itself or anyone else. It is possible to give this knowledge to the system in terms of
the hierarchy of the task. Thus, in hierarchical learning, the agent is always aware of
the high level task even though it might be be currently performing some low level
behavior. Thus, in this new hierarchical multi agent learning method, the agents learn
co-ordinating their actions at the highest level of abstraction, thus hiding away the
low level details, which leads to a very organized emergent behavior for the system as
a whole. Joint action values need to be considered only at the abstract level in this
framework, which ensures a tremendous big savings in terms of memory requirements.

We also face a decision about whether to model the joint state space of the agents
for multi agent hierarchical learning. Modeling the joint state space might lead to
the agents learning a very good combined policy, but in real problems, the memory
and computational requirements are very high. If we consider a fairly complex task
with, say, 105 states, having four agents would mean 1020 states, which would be
intractable. Thus, in our approach, the agent only models its own state space. The

6

only information it has about the other agents is the high level action being executed
by them. Hence, each agent bases its decision on the next action it should execute
upon its own state in the environment and the actions being executed by other agents.
There is less communication overhead here, as agent only need to communicate their

high level actions to other agents.

1.4 Thesis Outline

The outline of the thesis is as follows. Chapter 2 brieﬂy reviews relevant literature
on hierarchical reinforcement learning methods and explains the theory behind it.
Chapter 3 describes the new hierarchical multi agent learning algorithm. A simple
ofﬁce environment task is described and results for multi agent hierarchical learning
approach for the task are presented in Chapter 4. Chapter 5 deals with the the AGV
scheduling task. It describes the task and issues involved in solving it, and presents
experimental results. Chapter 6 presents the conclusions and a number of interesting

ways this approach can be modiﬁed and its future research scope.

Chapter 2

Theoretical Foundations and

Related Work

The theory behind reinforcement learning needs to be understood before we present
our approach for multi agent reinforcement learning. Thus, we brieﬂy describe the
theory behind approaches to reinforcement learning, hierarchical reinforcement learn-

ing, and multi agent reinforcement learning.

2.1 Markov Decision Processes

Sequential decision making problems occur when an agents performance is dependent
not on a single decision, but on a series of complex decisions taken one after another.
A stochastic sequential decision making problem can be modeled as a Markov decision
problem. Figure 2.1 is a simple example of a two state MDP, where one action is
possible in each state. Execution of the Single action is highly likely to bring the

8

action2(0.9, -1)

action 1 (0.2, - 1) i i
action2(0.1, O)

action1(0.8, +2)

Figure 2.1: A simple two state Markov decision problem. The transition probability
associated with the actions, and reward for performing the action in a particular state
is Shown in the bracket.

agent to the other state. But in many real life situations, there is a possibility that
the action might fail. This is modeled with the help of the transition function, which
says that the action taken in a state will take the agent to the other state, i.e. will
succeed with probability p, and the agent would remain in the same state, i.e. the
action would fail, with probability (1 — p). A transition model gives the resulting
distribution of states if action a was executed in state 3, for all state-action pairs.
The Markov property is said to hold if the transition probabilities from any given
state depend only on the state and not on previous history. Thus, MDPS are a model
for sequential decision making under uncertainty, taking into account the current and

future decision outcomes.

Formally, a MDP M can be defined as a four tuple (S, A, P, R), where
o S is a set of environment states.
0 A is a ﬁnite set of actions.

0 P is a set of action dependent transition probabilities.

9

o R is a reward function.

MDPS can be deﬁned over continuous states and actions, but we are only consid-
ering the discrete case here. For each sequence of observed states (1:0, 1:1, . . .$t_1, act),

in an MDP, the probability of being in the current state y is given by

P(Xt+1= y|X0 = $0,a0,X1= 2:1,a1, . . .Xt = phat = a) = P(Xt+1= y|Xt 2 13a, 2 a)
(2.1)
where t = 0, 1, . . . is the decision epoch.
This essentially says that the current state and action provide all the information
for predicting the next state. Thus, if the Markov property is satisﬁed, knowledge of

the current state is all that is required for making a decision.

2.1.1 Value of a Policy

An agent observes the state of the environment at decision epochs, which are discrete
times in process evolution, and takes an action based on the state. The policy, 7r is
deﬁned as the complete mapping from states to actions, which tells an agent what
action to perform at each state. In order to identify a good policy, we should be able to
distinguish between policies. This can be done with the help of the value function of a
policy. The value function for a particular policy is deﬁned as a mapping V7r : S’ —> R
which determines the expected utility of each state 3, if actions are chosen according
to the policy 7r. The reward function assigns a numeric value to a state or an action,
based on the short term view, and the value function assigns a real valued number

10

based on a long term view. The expected total discounted reward of a policy 7r(:r) is

Vie) = E:(Z:,(v‘n>) (2.2)

where E}; is the expected value (under policy 7r) of starting from state a: and perform-
ing action a. for 0 S "y < 1. The discount factor 7 measures the present value of one
unit of reward received one epoch in the future. Discounting the reward accounts for
the value of the reward as time passes. The aim is to maximize the value function
over all states by identifying a policy 7r where V"(:r) 2 V(a:) for all states :5. This
policy 7r* whose value function is greater than that of all other policies is called the
optimal policy.

V*(a:) = maul/7W1?) V x E S (2.3)

V" refers to the optimal value function and V77r is the value function under policy 7r.

Therefore,

V*(:r) =v,"‘(a:)v;z: e S (2.4)

where 7r* is the discounted optimal policy. More than one optimal policy might exist,
but all these policies would deﬁne the same value function. The main algorithms

which are used for solving a Markov decision problem are discussed next.

2.1.2 Dynamic Programming

Dynamic Programming (DP) is the traditional analytical approach for solving MDPS
[5]. The complete model of the system, i.e. the transition probabilities and the reward

11

function must be known precisely to apply DP techniques. DP is based on the fact
that the value function of a policy can be written recursively using value function

prediction for the next state. Thus the value function can be rewritten as:

me = Nana) +Z,ES(P..(a>W(y>> (2.5)

Thus the value of a state under a particular policy is the sum of the immediate reward
of choosing an action dictated by the policy and the expected value of the resulting
state. Value iteration and Policy iteration algorithms can be used for computing the
optimal value function and policy. The optimal value function is computed and the
optimal policy directly derived from this in the value iteration algorithm [6]. This is

based on the Bellman optimality equation [45]:

V*(:c) = maxa(r(s, a) + Z Pz,y(a)V’(y)) (2.6)

3!

Thus, if the Optimal value function is known, the optimal action in state x is the
action that maximizes the value function of a policy. The value iteration algorithm
begins with an arbitrary value function and iteratively improves it until the value
function between steps is arbitrarily close. The policy iteration algorithm evaluates
a policy and chooses a greedy policy with the respect to the corresponding value
function, thus obtaining an improvement over the original policy. Policy evaluation
and improvement are carried out till the policy improvement step can no longer

produce a better policy.

12

2.1.3 Reinforcement Learning

Reinforcement learning methods are online methods which require direct interaction
with the environment in order to approximate the value function. The optimal value
function is arrived at incrementally. This is in contrast to the DP methods, where

updates across all states are performed simultaneously.

Temporal Difference Learning

The method of temporal differences (TD) is a general reinforcement learning frame-
work for solving MDPS [16]. TD provides a method for updating the evaluation of
a state based on the difference between the current evaluation of the state, and the
evaluation of the successor state. Thus, updates occur only when an agent encounters

a state.

Q Learning

Watkins Q-Learning method [7] utilizes the TD method to ensure that no knowledge
of a system model is required to compute the Optimal policy. The transition model
need not be initially speciﬁed in this case, as the agent is embedded in the system,
and can observe the actual state transitions. Thus, the action value of a state is

updated each time the state is encountered according to the equation:

Qk+1($, a) = Qk(ar, a) + a(r,-mm,-(:c, a) + mabek+1(y, b) -— Qk(:r, a)) (2.7)

13

where 0 g a g 1 is the learning rate, rimm, is the immidiate reward, mabek(y, b) =
V(y), and Qk(2:, a) is the action-value of state x and action a before the update. Ac-
tions are chosen according to a greedy policy (actions with higher values are chosen),
with a certain probability. Random actions are also chosen with some probability
to allow the agent to explore the state space. Watkins showed that the Q values
will asymptotically converge to optimal Q“ values, if each action is executed inﬁnite

number of times in each state, and a is decayed as the learning progresses.

2.1.4 Discrete Event Model

Discrete event models have been used as the basis for simulation studies in a multitude
of domains, ranging from manufacturing, queuing, networking, and transportation
[41]. For the purpose of Simulation optimization, the discrete time reinforcement
learning framework has to be generalized to a discrete event framework. Time is
explicitly modeled as a continuous variable, but the agent observes the environment
and makes decisions only at certain discrete points (or decision epochs). In between
these epochs, the state of the system can be changing in some complex way, but these
changes may not provide the agent with any additional information. Furthermore,
actions take non-constant time periods, modeled by some arbitrary time distribution.
It is not possible to model the Optimization of factory processes as a discrete-time
MDP, without drastic loss of information (for example, Poisson demand processes or
failure distributions require using real-valued stochastic time distributions). Formally,
the evolution of the environment at decision epochs can be modeled as a semi-Markov

14

decision process (SMDP) [45].

2.1.5 Semi Markov Decision Processes

MDP models assume discrete time. Decisions occur at equally Spaced points in time,
after every state transition. For problems such as navigation and AGV scheduling,
where the time taken to complete any action varies, a more natural model would
be one which taken time into account. This leads us to the notion of Semi Markov
Decision Processes (SMDP) [47]. Thus, in SMDPS, actions are allowed only at discrete
points in time, but the state can change between actions. This is in contrast to an
MDP, where state changes are solely due to actions being performed by the agent.

Formally, a semi Markov model of continuous time decision problems can be de—

ﬁned as a ﬁve tuple (S, A, P, R, F ), where

o S is a set of environment states.

0 A is a ﬁnite set of actions.

P is a set of action dependent transition probabilities.

o R is a reward function.

F is a function giving the probability of transition times for each state action

pair.

P is a function mapping the elements of S x A into discrete probability distri-
butions over S. P(s'|s,a) denotes the probability that the environment will make

15

a transition from state s E S to state 3' E S under action a E A. P describes the
transitions at decision epochs only.

F is a function where F (tls, a) is the probability that the next decision epoch
occurs within t time units after the agent chooses action a in state 3 at a decision
epoch. Let Q denote the joint probability that the system will be in state 3' for the
next decision epoch, at or before t time units after choosing action a in state 3. The
expected transition time between decision epochs can be calculated from Q, which

can be computed from F and P by

Q(t, slls, a) = P(s'|s,a).F(t|s, a) (2.8)

The reward function for an SMDP is usually more complex than in the MDP
model. This is because in addition to the ﬁxed reward k(s,a) accrued due to an
action performed at a decision epoch, an additional reward may be accumulated at
the rate 7(3', 3, a) for the time the natural process remains in state 3’ between decision
epochs. The natural process may change state several times between decision epochs.

Hence, the rate at which rewards are accumulated between decision epochs may vary.

2.2 Hierarchical Reinforcement Learning

The basic principle behind hierarchical learning is to constrain the policy space by
using abstraction to increase learning speed and to reduce the memory and computa-
tional requirements [43]. There are two fundamental ways in which abstraction can
be built into reinforcement learning. Temporal abstraction or abstraction in terms

16

of actions would mean that a set of primitive actions be considered a single abstract
action. In other words, an abstract action would be a policy over primitive actions.
Similarly, a number of states can be grouped together to form a single abstract state
(for example, a room) in state abstraction. State or Spatial abstraction, or function
approximation, techniques develop mappings which are more compact than the gen-
eral lookup table. Nearest-neighbor methods, neural networks, fuzzy logic, and many

other approaches have been used to achieve such mappings.

2.2.1 Approaches to Hierarchical Reinforcement Learning

A variety of different approaches have been developed to achieve hierarchical rein-

forcement learning objectives. A brief overview of some of them is presented here.

Feudal Reinforcement Learning

The feudal reinforcement learning method by Dayan and Hinton [32] creates a man-
agerial hierarchy in which high level managers learn how to assign tasks to their sub
managers who learn how to satisfy them. The manager receives reinforcement from
the external environment. Its actions consist of commands that it can give to the low-
level learner. When the manager generates a particular command to the sub-manager,
it must reward the sub-manager for taking actions that satisfy the command, even if
they do not result in external reinforcement. The manager, learns a mapping from
states to commands. The sub-manager learns a mapping from commands and states
to external actions.

17

Options

Another approach to hierarchical reinforcement learning is the Options model de-
scribed by Precup, Sutton, and Singh [9]. Their principle abstraction, essentially
a closed loop sequence of actions, is termed a behavior. Behaviors can Operate at
different time scales, and can be composed Of other behaviors. They incorporate a
general closed-loop policy, instead of being simply a ﬁxed sequence of actions, and
can have a wide variety of completion criteria. Formally, an option is deﬁned by the
set of states in which it applies, the decision rule specifying actions to be performed,
and a completion function. The decision rule is similar to a reinforcement learning
policy, except that it can choose from among different behaviors as well as different
primitive actions. Decisions can be based not only on the current state but on all
states and actions from the start of the behavior. A completion function can Specify
completion after a ﬁxed number of steps or after attainment of some subgoal.
Feudal reinforcement learning and the Options method approaches that take ad-
vantage of temporal abstraction. Hierarchical Distance to Goal, HAM and MAXQ
method take advantage of both spatial and temporal abstraction. These are described

next.

Hierarchical Distance to Goal

The Hierarchical Distance to Goal (HDG) algorithm by Kaelbling [3] imposes a set
of landmark states and the agent reaches the goal state via this series of landmarks.

Navigation through the environment consists of high-level navigation between land-

18

marks until the appropriate region is reached, at which point lower level actions take

over to locate the goal.

Hierarchical Abstract Machines

Hierarchical abstract machines (HAM) provide another mechanism for utilizing ab-
straction in reinforcement learning. The HAM model [33] uses nondeterministic ﬁnite
state machines to constrain policies. For example, a simple machine could specify the
actions ”repeatedly go up or right” and when applied would constrain the overall
policy in this manner. There are four types of HAM states: Action states execute an
action in the environment. Call states execute another HAM as a subroutine. Choice
states nondeterministically select the next machine state. Stop states halt execution
and return control to the previous level. Application of a HAM to a MDP to yields
an induced MDP. It is shown that the optimal policy for this induced MDP is equiv-
alent to an Optimal policy for the original MDP that is consistent with the HAM. An
algorithm HAMQ-learning is developed which is shown to converge to the Optimal
action at every choice point in the induced MDP. The HAMQ-learning algorithm
operates by keeping track of the accumulated reward and accumulated discount since
the previous choice state. An extended Q-table is maintained, where Q-values are
given for each environment state/ machine state pair and choice point action. Upon
transition to a choice state, the Q-value is updated using the cumulative reward and
discount values.

19

2.2.2 The MAXQ Method for Hierarchical Reinforcement
Learning

The MAXQ method for Hierarchical Reinforcement Learning by Dietterich is the
basis for the multi agent hierarchical reinforcement learning algorithm developed here,

hence we provide a detailed overview of the theory behind it [4].

0 1 2 3 4

 

0 R G

1 O

 

 

 

 

 

 

 

 

 

 

 

O The Taxi

R, G, Y, B Passenger Locations

Figure 2.2: The taxi-cab domain.

The MAXQ method for Hierarchical Reinforcement Learning involves the use of a
graph to store a distributed value function. The overall task is ﬁrst decomposed into

20

/

 

 

Root

 

 

 

 

Get

\

 

 

 

/\

 

 

Pickup

 

 

 

Put

 

 

 

 

N avigate(t)

 

 

 

North

South

A

 

 

Putdown

 

 

East

West

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.3: The task graph for the taxi-cab domain.

subtasks up to the desired level of detail, and the task graph constructed. A simple
example would be the taxi-cab domain (ﬁgure 2.2) analyzed by Dietterich in his work.
There are four special locations (denoted by R, G, Y, B) in a 5 x 5 grid world. A
passenger needs to be transported from one of these four locations to another. Thus,
the goal of the taxi agent is to go to the passenger location, pick him up, go to the
destination location, and put him down. The task graph and MAX graph for this
problem are shown in ﬁgures 2.3 and 2.4. This problem is considered to be episodic.
The taxi is placed at a random location in each episode. Each episode starts with the
passenger being at the start locations and ends with the passenger being deposited
at the destination location.

Thus, formally, the MAXQ method decomposes an MDP M into a set of subtasks

Mo, M1 . . . Mn . Each subtask is a three tuple (Ti, Ahﬁ) deﬁned as:

T,(s,-) is a termination predicate which partitions the state space S into a set of active

21

AA

Q...
A M...

[We
MPutDown m MPick

 

MNorth MSOuth MEast MWest

 

Figure 2.4: The MAXQ graph for the taxi-cab domain.

22

states 5,, and a set of terminal states T,. The policy for subtask M,- can only

be executed if the current state 3 is in 8,.

A, is a set of actions that can be performed to achieve subtask M,- . These actions
can either be primitive actions from A,- , the set of primitive actions for the

MDP, or they can be other subtasks, which can be denoted by their indexes z'.

E0

(s, [9, a) is the pseudo reward function, which speciﬁes a pseudo reward for each
transition from a state 3 E S, to a terminal state s, E T,. This pseudo reward
tells how desirable each of the terminal states is for this particular subtask. It
is typically employed to give goal terminal states a pseudo reward of 0 and non

goal terminal states a negative reward.

Each primitive action a from M is a primitive subtask in the MAXQ decom-
position, such that action a is always executable, it terminates immediately after
execution, and it’s pseudo reward function is uniformly zero. The projected value
function V7r is the value of executing hierarchical policy 7r starting in state 3, and
at the root of the hierarchy. The completion function (C"(z', s, a)) is the expected
cumulative reward, discounted back to the point where action a begins execution,
of completing subtask M, after invoking the subroutine for subtask Ma in state 3.

23

Figure 2.4 lists the MAXQ learning algorithm. The value function Vt(z', s') in the

 

 

function MAXQ(MaxNode 2', State 3)

Let seq = () be the sequences of states visited while executing 2'
if i is a primitive MaxNode
execute i, receive r, and Observe result state 3'
Vt+1(i, s) = (1 — at(i)).Vt(z’, s) + at(z').r
append 3 into the beginning of seq
else
let count = 0
while T,(s) is false do
choose an action a according to the current exploration policy 7r(z', 3)
let childseq = MAXQ(a, s), where childseq is the sequence of states visited
while executing action a.
observe result state 3'
let a“ = argmaxar[a(i, s', a') + V¢(a', s')]
let N = length(childseq)
for each s in childseq do
Ct+1(z’, s, a) = (1 — at(i))Ct(i, s, a) + at(t)7N[R_i(s') + Ct(i, s', a‘)
+ VAa". s’)l
Ct+1(i, 8’ a) = (1 - at(i))Ct(i, 8, a) + at(i)7NlCt(i, 8', a") + Vt(a*, s')l
append childseq onto the front of seq
3 = s'
end
end
return seq

end MAXQ

 

Figure 2.5: The MAXQ learning algorithm.

24

 

algorithm is calculated with the decomposition equations:

. mamaQtU, s, a) if i is composite
VAL 8) =

Vt(i, 3) if i is primitive

Qt(iasaa) : ”((1:8) +Ct(i731a) (29)

The deﬁnition of policy and value function with respect to the hierarchical case is
now presented. A hierarchical policy 7r is a set containing a policy for each of the
subtasks in the problem: 7r 2 {no . . .7rn}. Similarly, the projected value function in
the hierarchical case, denoted by V"(s), is the value of executing hierarchical policy 7r
starting in state 3 and starting at the root of the task hierarchy. A recursively optimal
policy for MDP M with MAXQ decomposition {M0 . . . Mk} is a hierarchical policy
7r 2 {no . . . n} such that for each subtask M,- the corresponding policy 7r.- is optimal
for the SMDP deﬁned by the set of state 3,, the set of actions A,, the state transition
probability function P”(s', N Is, a), and the reward function given by the sum of the
original reward function R(s' Is, a) and the pseudo-reward function Tia-(3'). The MAXQ
learning algorithm has been proven to converge to 7r;', the unique recursively Optimal
policy for MDP M and MAXQ graph H, where M = (S, A, P, R, P0) is a discounted
inﬁnite horizon M DP with discount factor 7, and H is a MAXQ graph deﬁned over

subtasks {Mo . . . Mk }.

25

2.3 Multi Agent Learning

Multi-agent reinforcement learning is an extremely challenging problem. Complexity
is introduced as a result of having a number of agents, as the interactions among
the agents have to be modeled in addition to the interaction between the agent and
the environment. Various approaches have been tried to tackle this problem. Most of
these approaches extend “flat” algorithms, such as Q-learning, to the multi-agent case.
A review of such algorithms is presented next. Littman [12], and Hu and Wellman [18]
have studied the the framework of Markov games for multi agent learning. Markov
games is an extension of game theory to MDP like environments. Littman studies
a minmax-Q learning algorithm using a simple two player game where both players
have Opposing goals. Thus, there is a single reward function which one agent tries
to minimize and another tries to maximize. In this framework, the max operator
in the update step of a standard Q learning algorithm is replaced by the minmax
operator that is evaluated by linear programming. Hu and Wellman [18] extend
this work to a broader framework, and prove the convergence of their method to
a Nash equilibrium under Speciﬁed conditions. Tan [13] studied the extension of
ﬂat reinforcement learning to the multi agent case. He studies the impact Of sharing
observation, policies or episodes, and having joint tasks for a multi agent predator prey
problem. The agents are shown to perform better when knowledge and information
is communicated between agents. However, these approaches are extensions of “flat”
Q learning and are not likely to scale as the state-action value space quickly becomes
intractable.

26

Crites and Barto [14] apply reinforcement learning for group elevator control by
using a global reinforcement signal. This is based on the idea that if each member
of a team of agents employs a reinforcement learning algorithm, a new collective
learning algorithm emerges for the team as a whole. Each agent is responsible for
controlling one elevator car in this case. Accumulated cost is updated incrementally
after every passenger arrival, passenger transfer and car arrival event. The amount
of cost accumulated between events is the same for all cars since they share the same
objective function, but the amount accumulated between decisions is different as the
cars make decisions asynchronously. Each car has a storage location where the total
discounted cost it incurred since it’s last decision is accumulated. The joint state space
is modeled here too. Stone and Veloso [20] propose a method for scaling multi-agent
learning by learning value functions over a small set of action based features instead
of states. Thus, the complexity of the learning task is reduced as a small feature space
is constructed, which partitions the state into regions. Each agent learns to act only
within it’s own partition. The feature set here is problem dependent. Behavior based
multi agent robotics has been studied by Balch [22] and Mataric [11]. In Balch’s
behavior based formation control, each robot maintains its position in the formation
depending on the locations of the other robots. Behaviors such as avoid-obstacle,
avoid-robot, move-to-goal and maintain formation are implemented to enable the
agents to perform the desired task. Mataric has proposed a method for minimizing the
multi agent learning space through the use of behaviors and conditions, and shaping
reinforcement with heterogeneous reinforcement functions and progress estimators,

which take advantage of the implicit domain knowledge in order to accelerate learning.

27

States are clustered into conditions, which are the necessary and sufﬁcient subsets of
state required for triggering the behavior sets. These approaches are dependent on

the behaviors selected initially.

2.4 Discussion

A wide variety Of methods have been tried to solve the problem of multi agent learn-
ing with some signiﬁcant successes. However, the area of hierarchical reinforcement
learning has not yet been explored as a vehicle for utilizing the hierarchy to help
with structuring the problem or to improve scalability, which are the key beneﬁts of
hierarchical learning. The MAXQ method for Hierarchical Reinforcement Learning
is extensible to the multi agent case. Each agent can use the MAXQ hierarchy to
decompose a task into sub-tasks. Coordination skills among agents can learned by
conﬁguring the Q nodes at the highest level of the hierarchy to represent the joint
action space among multiple agents. The value function is propagated upwards from
the lower level nodes whenever a high level node needs to be evaluated. This enables
the agent to simultaneously learn subtasks and high level tasks. Thus, by using this
method, the agent learns the co-ordination skills and the individual low level tasks
and subtasks all at once. These ideas have been used in the multi agent hierarchical
learning algorithm which is the topic of this thesis. The next chapter explains the

new hierarchical multi agent learning approach proposed here.

28

Chapter 3

Co—operative MAXQ: A
Hierarchical Multi Agent

Reinforcement Learning Method

The MAXQ method stores the value function in a distributed way in all nodes in
the subtask graph. The value function is prOpagated upwards from the lower level
nodes whenever a high level node needs to be evaluated. This enables the agent to
simultaneously learn subtasks and high level tasks. Thus, by using this method, the
agent learns co-ordination skills and individual low level tasks and subtasks all at
once. However, it is necessary to generalize the MAXQ framework to make it more
applicable to multi agent learning. A broad class of multi agent optimization tasks,
such as AGV scheduling, can be viewed as discrete-event dynamic systems. For such
tasks, the termination predicate used in MAXQ has to be redeﬁned to take care Of

29

the fact that the completion of certain subtasks might depend on the occurrence of

an event rather than just a state of the environment.

3.1 Revisiting the Termination Predicate

In the original MAXQ formulation, the termination predicate is deﬁned to be true
or false depending on whether the particular state 3 is in the set of active states S,
or the set of terminal states 5,. For some types of problems, when a task terminates
might depend on some factors in the environment other than just the state. In
many cases, the exact same state of the environment may result when a subtask is
completed by the agent, as when it was started. For example, consider Dietterichs
taxi-cab domain which was explained in chapter 2. Now make an assumption that
the task is not episodic and that passengers can appear at one the the four R, G,
B, Y locations randomly. Also, the passenger disappears after being dropped at the
destination location. Now consider that the MAXQ hierarchy contains the subtask
deliver passenger from source location to destination location (there might be other
subtasks like refuel the taxi etc.), and the agent chooses to perform it. Suppose the
passenger source location is R and destination location is G, and that the taxi is at the
passenger destination location G currently. Thus, the state Of the environment at the
onset of subtask deliver passenger from R to G is: taxi at G, passenger at R and no
passenger in taxi. While the taxi is performing the chosen subtask, another passenger
might come at R, waiting to be taken to some destination. Thus, when the taxi
completes the subtask, the state is still: taxi at G, passenger at R and no passenger

30

in taxi. Thus, without special care, these kinds of conditions might result in the agent
not exiting from a subtask to be able to choose another high level task at the correct
time. One way to avoid this situation might be to construct the task graph in such a
way that these conditions do not occur. If we separate the subtask deliver passenger
from source location to destination location into two subtasks, namely, go to source
location and pick up passenger and go to destination location and put down passenger
, it would again become possible to unambiguously deﬁne termination predicates with
respect to states for these subtasks. But this approach would obviously require a lot
of care on the part of the person constructing the hierarchy. It might also mean
more subtasks than necessary, which would increase the memory requirements and
the complexity Of the algorithm.

Another way to deal with this problem would be to separate the termination
predicate into two parts, one for starting a particular subtask, and another for ending
it. Thus the start termination predicate Tsmrm (s) is the regular termination predicate
as deﬁned in the MAXQ formulation, which divides the states into active and non-
active states, where active states mean that the subtask can be attempted by an
agent, and non-active means that the subtask cannot be attempted in the current
state. Once an agent starts a subtask after seeing that the state was present in the
active state set, the end of this task depends on the end termination predicate, Tend(i)
which is governed by the occurence of an event. Thus, examining the earlier example,
once an agent j has started the task deliver passenger from source to destination,
it will end only when the event putdown passenger at destination by agent j occurs.
This kind of termination predicate is not Signiﬁcantly more difficult to incorporate in

31

the MAXQ algorithm, and helps in making sure that inconsistencies are eliminated.
Thus the start termination predicate T,,a,.,(,) (s) is a predicate that partitions the set of
states S into a set of active states, 3,, and a set of non-active states 5,. The policy for
subtask M, can be executed only if the current state 3 is in Sa. The end termination
predicate Tend(i) is a predicate that signals when a subtask i is completed. It is
initialized to 0 but changes to 1 when the subtask completion event occurs. Tend“) is

reinitialized to 0 once the agent observes its value.

3.2 State Abstraction

The MAXQ state abstraction is used, which gives us a compact way of representing
the Completion functions, and Speeds up the algorithm. This abstraction is based on

ﬁve premises, as deﬁned in [8].

0 Max Node Irrelevance The set of state variables which are irrelevant to a
max node are ignored. Let M, be a Max node in a MAXQ graph H for MDP M.
A set of state variables Y is irrelevant to node i if the state variables of M can be
partitioned into two sets X and Y such that for any abstract hierarchical policy
1r executed by the descendents of i, the state transition probability distribution
at node i can be factored into the product of two distributions, one for the
states in X, and another for the states in Y, and for any pair of states differing
only in the variables in set Y, and any child action a, V"(a, 31) = V"(a, 32) and

TIE-($1) = R,(sg).

32

0 Leaf Irrelevance The set of state variables Y which are irrelevant to a primitive
action (or leaf node) are ignored. This occurs when the expected value of the

reward function does not depend on the values of the state variables in Y.

0 Result Distribution Irrelevance A set of state variables Y,- is irrelevant for
the result distribution of action 3' if, for all abstract policies 7r executed by node
j and its descendents in the MAXQ hierarchy, the following holds: for all pairs
of states 3, and 32 that differ only in their values for the state variables in Y],

P“(s', N|s,,j) = P"(s', leg,j) for all s’ and N.

0 Termination This occurs when a subtask is guaranteed to cause its parent task
to terminate in the goal state. Thus, if M, is a task in the MAXQ graph such
that for all states 3 where the goal termination predicate is true, the pseudo
reward function R, = 0, then for any policy executed at node i, the completion

cost is zero and does not need to be explicitly represented.

o Shielding Let M, be a task in a MAXQ graph, and s be a state such that for
all paths from the root of the graph down to node M,, there exists a subtask
j (possibly equal to i) whose termination predicate Tj(s) is true, then the Q

nodes Of M, do not need to represent completion function values for state 3.

If these conditions are examined with reference to the start termination predicate
with the new deﬁnition of the termination predicate, they hold true without any
changes. This is because the deﬁnition of the start termination predicate T3,0,,(,)(s)

is the same as before. Thus, all these ﬁve kinds of state abstraction can be taken

33

advantage of in the multi agent MAXQ algorithm, even with the new deﬁnition of

the termination predicate.

3.3 The Multi Agent MAXQ Algorithm

The most salient feature of this algorithm is that the top level (the level immediately
below the root) of the hierarchy stores the completion function (C) values for the joint

2...aj...a"),

(abstract) actions of all agents. This completion function Cj(i, s, a1 , a
where there are n agents, is deﬁned as the expected discounted reward of completion

of subtask aj by agent j, after invoking the subroutine for subtask i in the context

of the other agents performing subtasks a1 . . .a".

The recursive MAXQ algorithm is used for learning the C values. Thus, an agent
starts at the root task and successively chooses subtasks till it gets to a primitive
action. The primitive action is executed, the reward Observed, and the leaf V values
updated. When a subtask terminates, C(i,s,a) values are updated for all states
visited during the execution of that subtask. Similarly, when one of the tasks at the

2...aj...a”) values

level just below the root max node terminates, the C(i,s, a1, a
are updated according to the MAXQ learning algorithm. The new multi agent
MAXQ learning algorithm for each agent (say agent number j) is shown in ﬁgure 3.1.
where aj is the action being performed by the current agent. We assume here that
the high level actions a’ , a2 . . . a" being performed by all the agents and the agent
number of the current agent is available globally. The decomposition equations used

for calculating the projected value function V now take on the form:

34

' mararuanQAi, 3, a1. . .aj . . .a") if i is composite
Vita, 8) =
V,(i, 3) if i is primitive

Q,(i,s,a1. . .aj . ..a") = V,(aj,s) + C,(i,s,a1...aj...a") (3.1)

for the highest level of abstraction, where the joint action values are considered. The

decomposition equations for the lower levels in the hierarchy do not change.

35

 

 

function MAXQ(MaxNode i, State 3)

Let seq 2 () be the sequences of states visited while executing i
if i is a primitive MaxNode
execute i, receive r, and Observe result state 3'
V,+1(i, s) = (1 — a,(i)).V,(i, s) + a,(i).r
append 3 into the beginning of seq
else
let count = 0
if T,,a,,(,)(s) is false do
while Tend(,)(s) is false do
choose an action a for the current agent according to the current
exploration policy.
update the global action vector by replacing the action being performed
by agent j by action a, if action a is a highest level abstract action.
let childseq = M AX Q(a, s), where childseq is the sequence of states
visited while executing action a.
observe result state 3'
if i = root task do
UpdateJOintC(i, s, s', a, childseq)
else do
UpdateC(i, s, s', a, childseq)
append childseq onto the front of seq
3 = s,
end \\ while
end \\ if
end \\ else
return seq

end MAXQ

 

Figure 3.1: The multi agent MAXQ algorithm

36

 

 

function UpdateJointC(i, s, s', a, childseq)
let a" = argmazaz(C,(i, 3, a1. . .a'. . .a") + V,(a', s’) where a, is the subtask being
executed by the current agent j.
let N = length(childseq)
for each s in childseq do
C,+1(i,s, a .aj” .a") = (1 — a,(i))(C,(i,s, a1. . .aj. . .a")
+ a,(i).cyN[R, (s')+ (C, (i, s’, a’. . .a“. . .a”) + V,(a"‘, s')]

C,+1(i,s,a’...aj...a") = (1— a,(i)).C,(i, s, a’. . .aj. . .a") +
a,(i).’yN[C,(i, s', a’. . .a*. . .a") + V,(a*, s')]
N = N —— 1
end \\ for
end UpdateJointC

 

function UpdateC(i, s, s', a, childseq)
let a" = argmax;(-CT,(i, s', a’) + V,(a’, 3’)
let N = length(childseq)
for each s in childseq do
C,+1(i,s,a) = (1 —a,(i)).C,(i,s, a) +a,(i).7N[R,(s')+C,(i, s ,+a") +V,“,(a 3)]
C,+1(i, s, a) = (1 — a,(i)).C,(i, s,a) + a,(i).'yN[C',(i, s',a*)+ V,(a“, s')]
N = N — 1
end \\ for
end UpdateC

 

 

 

Figure 3.2: The algorithm for updating the joint and individual completion function
values at abstract levels for the multi agent MAXQ algorithm

37

Chapter 4

A Multi Agent Robot Task

To test of the new multi agent MAXQ method, experiments were conducted on a
simple robot navigation domain to verify its effectiveness in learning co-operation

skills. We present the results for a trash collecting robot task in this chapter.

4. 1 The Task

Consider the case where a robot is assigned the task of picking up trash from trash cans
over an extended area and accumulating it into one centralized trash bin, from where
it might be sent for recycling or disposed. This iS a task which can be parallelized, if we
have more than one agent working on it. An Ofﬁce (rooms and connecting corridors)
type environment as Shown in ﬁgure 4.1 was used for conducting experiments with
this task. A1 and A2 represent the two agents in the ﬁgure.

This trash depositing task was decomposed into subtasks and the task graph which
emerged is shown in ﬁgure 4.2. The task graph is then converted to the MAXQ graph,

38

 

Room2

 

Room3

 

@

Room 1
Corridor

 

 

 

 

GD

T1: Location Of one trash can.

 

T2: Location Of another trash can.
Dump: Final destination location for depositing all trash.

Figure 4.1: The Environment used for the multi agent trash depositing task.

39

 

Root

/ \

Navigate to T1 Navigate to T2 Navigate to D Put Pick

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

V

 

Navigate

/ V\

Follow Wall Align with Wall Find Wall

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.2: The task graph used for the multi agent trash depositing task.

m

Nav Put PICk

FOllO Align A

 

 

A : Max Node

QQNode

Tl: Location of trash 1
T2: Location of Trash 2
D: Location of dump

Figure 4.3: The MAXQ graph used for the multi agent trash depositing task.

40

which is shown in ﬁgure 4.3. The notion of termination predicates as deﬁned in the
MAXQ framework is used “as is” in this task, as the state of the environment can
indeed be used as a correct indicator of the completion of tasks here.

In the Single agent scenario, one robot starts in the middle of Room 1 and learns
the task of picking up trash from T1 and T2 and depositing it into Dump. The goal
state is reached when trash from both T1 and T2 has been deposited in Dump. The
robot starts all over again with the initial conﬁguration at this point, in order to
improve it’s policy. Two experiments are conducted in the multi agent case. The two
robots model the joint action space at the highest level in both cases, but in the ﬁrst
scenario, the robots model the joint state space of both agents, and in the second
scenario, each agent ignores the state of the other agent. The state space here is the
orientation of the robot (N,S,W,E), and another component based on its percept. We
assume that a ring of 16 sonars would enable the robot to ﬁnd out whether it is in a
corner, (with two walls perpendicular to each other on two sides of the robot), near a
wall (with wall only on one Side), near a door (wall on either Side of an opening), in
a corridor (parallel walls on either Side) or in an Open area (the middle Of the room).
Thus, each room is divided into 9 states, and the corridor into 4 states. Thus, we
have (9 x 3) + 4) x 4, or 124 locations for a robot. Also, trash 1 can be at T1, with
robot, or at D, and trash 2 can be at T2, with robot, or at Dump. Thus the total
number of environment states is 124 x 3 x 3, or 1116 for the single agent case. Going
to the two agent case would mean that the trash can be at either T1 /T 2, Dump, or
with one of the two robots. Thus the state space would now be 124 x 124 x 4 x 4,
or z 24 x 104. The environment is such that it is fully observable with this state

41

decomposition, as the direction which the robot is facing, in combination with the
percept, with the assumption that the agent knows what room it is in, gives us a
unique value for each state. The primitive actions considered here are behaviors to
ﬁnd a wall in one of four directions, align with the wall on left or right side, follow

wall, enter or exit door, align south or north in the corridor, or move in the corridor.

4.2 Results

0 I I I I

Two Co-Operating Agents having own state space -9—
Two Co-Operating Agents having product state space -+—--
Single Agent -e--

 

Average Reward

 

 

-5. 4

 

 

-6 1 L 1 I

O 50 100 150 200 250
Number Of trials averaged over 10 runs

 

Figure 4.4: The average reward accumulated from start state to the goal state.

Experiments were conducted and the learning curve and performance of the agents
compared for the Single agent, multi agent case with joint state space, and multi agent
case with own state space. The average reward accumulated while traveling from start

42

Primitive actions required to complete task

 

1200 I I l I

Two CO-operating Agents having own state space +—
Two Co-operating Agents having product state space -+—--
Single Agent -e--

 

 

 

 

 

o 1 l l l
O 50 100 150 200 250
Number of trials averaged over 10 runs

Figure 4.5: The number Of primitive actions required to get to the goal state.

43

state to goal state is Shown in ﬁgure 4.4, and the number of primitive actions required

to reach the goal state in ﬁgure 4.5.

As seen from the ﬁgures 4.4 and ﬁgure 4.5, all three methods converge in a similar
manner, and are able to ﬁnish the task of depositing both trash 1 and trash 2 from
the individual cans to the trash dump. The difference is seen as the learned policies
are examined. At each step, the actions and subtasks performed by the agents are
stored, in order to look at the ﬁnal policy learned by the agents. In the single agent
case, the task is performed in a sequential manner and the ﬁnal abstract policy is

shown in ﬁgure 4.6.

 

Learned Policy for Single Agent

root
navigate to trash 1
go to location of trash 1 in room 1
pick trash 1
navigate to bin
exit room 1
enter room 3
go to location of dump in room 3
put trash 1 in dump
navigate to trash 2
exit room 3
enter room 1
go to location of trash 2 in room 1
pick trash 2
navigate to bin
exit room 1
enter room 3
go to location of dump in room 3
put trash 2 in damp
end

 

 

 

Figure 4.6: Learned policy for single agent performing trash collecting task using the
MAXQ hierarchical decomposition.

44

On examining the policy in the case where two agents are acting in parallel (refer
ﬁgure 4.7), it is seen that the total number of actions required to ﬁnish the task by
each robot are half of those in the single agent case, thus the time taken to ﬁnish the

task is reduced by half.

 

Learned Policy for Agent 1

root
navigate to trash 1
go to location of trash 1 in room 1
pick trash 1
navigate to bin
exit room 1
enter room 3
go to location of dump in room 3
put trash 1 in dump
end

 

Learned Policy for Agent 2

root
navigate to trash 2
go to location of trash 2 in room 1
pick trash 2
navigate to bin
exit room 1
enter room 3
go to location of dump in room 3
put trash 2 in dump
end

 

 

 

Figure 4.7: The top and bottom boxes Show the learned policy for agent number 1
and 2 respectively, when the two agents are collectively performing trash collecting
task, using the co-operative multi agent MAXQ algorithm

Figure 4.8 clearly shows the difference in the performance for the co-operative multi
agent and single agent case. This plot shows the number of steps required to complete
the task of depositing trash, measured in terms of the number of primitive actions

45

 

 

 

 

 

120 TI I I I I I I I I
0 Two Co-operating Agents pedomiing task 4—

110 _. Single Agent performrng task -+--- q
E
5! 100 - .
2
g
g ~ . -
.9 >
‘0 7 i a
.13
3
g 70 — r
m 4
8 i- F
‘e 60 ‘
m
2
IE 50 - -
E
'C
Q
‘5 40 F
E .
z

r

 

 

 

 

 

o 200 400 600 800 1000 1200 1400 1600 1800 2000
Number of trials averaged over 10 runs

Figure 4.8: The number of primitive actions required to complete task.

required in both cases. Parallel execution of two primitive actions in the multi agent

case almost halves the number of steps required.

4.3 Conclusion

Looking at the encouraging results presented above for the simple navigation problem,
we decided to test the hierarchical multi agent learning algorithm on the complex AGV
scheduling task. The following chapters discuss the AGV scheduling task details and

the experimental results Obtained with the co-Operative multi agent algorithm.

46

Chapter 5

Experiments with AGV. Scheduling

An Automated Guided Vehicle (AGV) system is a set of co-Operating robotic vehicles
used for fetching and delivering material and ﬁnished assemblies from a warehouse
to the machines processing them and back. Automated Guided Vehicles (AGVS)
are routinely used in industry ﬂexible manufacturing systems (FMS) for material
handing. Any FMS system using AGVS has to deal with the problem of optimally
scheduling the paths of the AGVS in the system. AGV scheduling can be deﬁned
as the optimization process which selects and sequences activities for AGVS, given
a set of constraints that reﬂect the temporal relationships among activities, and the
capacity limitations of a set of shared resources [24]. Assemblies may be queued at
any station, awaiting delivery to their destinations. In addition, more parts might
arrive into the system even as deliveries are being made. The problem of sequencing
an AGVS pickups and deliveries so that parts reach their destinations as quickly
as possible is known as the AGV Scheduling Problem. The complex nature of this
problem makes it very interesting, as the states keep changing not only due to the

47

agent’s actions, but even otherwise. This is what prompted us to use this problem
as a testbed for the multi agent reinforcement learning algorithm developed. The
uncertain and ever changing nature of the job shop makes it virtually impossible to
plan moves ahead of time. Hence, AGV scheduling requires dynamic dispatching

rules.

5.1 Scheduling in Flexible Manufacturing Systems

Planning is the process of selecting and sequencing activities such that they achieve
one or more goals and satisfy a set of domain constraints. Scheduling is the process
of selecting among alternative plans and assigning resources and times to the set of
activities in the plan. These assignments must obey a set of rules or constraints that
reﬂect the temporal relationships between activities and the capacity limitations Of a
set of Shared resources. The assignments also affect the optimality Of a schedule with
respect to criteria such as cost, tardiness, or throughput. In summary, scheduling is
an optimization process where limited resources are allocated over time among both
parallel and sequential activities [35].

A ﬂexible manufacturing system consists of a set of machine tools and a material
handling system linked by a network of computers controlling and interfacing them
[27]. Unlike the traditional material handling system, where a human element is
involved in the transportation of materials between various locations, most of the
control is done with the help of computers in ﬂexible manufacturing systems. This
has been made possible by developments in guided vehicle technology and computer

48

controlled systems and machines [26]. With elimination of human intervention, the
AGV control system and the software governing it are complex and critical, especially
in the case of multi vehicle systems. Issues in material handling systems include the
number of vehicles required, the track layout, the traﬂic pattern and traffic control,
but the focus of the thesis is on the traffic control problem. AGV scheduling requires
dynamic dispatching rules, which are dependent on the state of the system like the
number of parts in each buffer, the state of the AGV and the processing going on
at the workstations. Since this problem is analytically intractable, various heuristics
and their combinations are generally used to schedule AGVS [23] [25] [28]. Some

heuristics usually used are:

FCFS First Come First Serve. The assignment which enters the queue ﬁrst is the one

which iS serviced ﬁrst.

SS Stay in Same Station. After dropping off a part at a dropoff Station of one
machine, the robot goes to the pickup station of the same machine to pick up

the ﬁnished assembly.

NS Nearest Station. After ﬁnishing an assignment, the AGV scans through a list

and acquires the nearest assignment, and proceeds towards the same.

HQ Highest Queue First. The AGV scans through a list of assignments and goes to

the pick up station of the machine with the highest number of in process jobs.

HOQ Highest Output Queue First. The AGV ﬁrst services the station which has the
highest number of jobs waiting to be serviced in its output queue.

49

LW KR Least Work Remaining First. The part which has least amount of work remain-

ing to be done on it gets priority in the AGV selection process.
RAN Random Selection.

CYC Cyclic Selection.

Various combinations of these heuristics are also used for scheduling purposes. It has
been shown that a combination of these heuristics (i.e. composite) heuristics work
better than any one heuristic alone [23] [25]. Other studies have looked at the advan-
tages and disadvantages of using multiple vehicle rules or tandem layouts with many
single vehicle loops [36], [37]. The tandem layout is based on partitioning all sta-
tions into non-overlapping, single vehicle closed loops with additional pickup/dropoff
points provided as an interface between adjacent loops. This kind of layout works
well to avoid congestion and support distributed control. However, it has less routing
ﬂexibility in case of breakdowns, and has a more difﬁcult load routing problem to

solve.

5.2 Assumptions Used

Local buffer capacity is limited and the number of AGVS are ﬁxed.

Input buffer capacity and output buffer capacity is same for all stations.

AGV travel distance from station to station is known.

The inter-arrival distribution of parts is normally distributed.

50

0 Once an action (primitive or high level) is started, it has to run to completion.
0 Processing times are uniformally distributed.
0 The sequence of operations to be performed on the parts is predetermined.

Traditionally, various heuristics such as FCFS ( First Come First Served ), Nearest
Station ﬁrst, Stay in Same Station, or combinations of these heuristics have been used
for AGV scheduling in such systems. The heuristics have the advantage that they
are simple to implement, but in most systems, the performance has much higher

precedence over simplicity.

5.3 The Problem Description

Figure 5.1 shows the layout of the system used for experimental purposes. Parts of
type i have to be carried to drop off station at workstation i and the assembled parts
brought back into the warehouse. The AGV travel is unidirectional (as the arrows

Show). Figure 5.2 shows the MAXQ graph for this problem.

5.4 Implementation Details

Synchronization and concurrency issues arise while implementing a multi agent learn-
ing algorithm. These are discussed in this section. For the AGV scheduling task, the
parent process ﬁrst spawns ﬁve processes, one of which models the part arrival, and
the other four model the machines activity at at each workstation (picking up parts
from the dropoff buffer, servicing them, and putting the ﬁnished assembly in the

51

DZ

60m

...........

60m

 

Figure 5.1: A multiple automatic

Pl

 

Dl-

db

P -Pick Up Station
D - Drop Off Station

M - Machine

 

Load
1.:

 

 

 

 

'-

4r—

Warehouse

 

30m

 

 

..ll..
ll

P4

40m

52

l _
ﬁ ﬁt

 

 

 

Unload

.ll. . ..
A

40m

60m

 

guided vehicle (AGV) optimization task.

DAl DA2

 

 

 

 

 

 

 

 

A : Max Node
O= Q Node
DMi: Deliver Material to Station i
DAi: Deliver Assembly to Station i
NavLoad: Navigate to Loading Deck
Naquti: Navigate to Dropoff Station i

Figure 5.2: The MAXQ graph for an automatic guided vehicle (AGV) optimization
task.

53

 

Main

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Part 1 Process
Warehouse
W 0
Loading
Assembly 1 \. l Deck
I \ _
.. I Part
l Arrival
Part 2 / I \ \
’ \ F Part3
Assembly 2 ,
I
b I / Assembly3
Part 4
Assembly 4 Initialize
parameters
Read Hierarchy
fom ﬁle
Construct
Hierarchy
Tree
\
j \

/ \~- Robot3
Robotl ~-—’ \

 

Robot 2 Robot 4

Figure 5.3: The program ﬂow in the implementation of the AGV scheduling task.

54

pickup buffer). Three processes are then spawned for modeling the three agents, and
the parent process models the fourth agent. This is shown diagrammatically in ﬁgure
5.3. The circles denote processes spawned by the main process, to perform various
activities, as listed. The joint action value function is stored in a ﬁle in the hard
disk, which is accessible to all four AGV processes. The state of the environment and
the actions being currently performed by each agent are stored in shared memory, so
that each agent can have access to it. Synchronization issues while accessing shared
resources are taken care of with the help of semaphores.

We present the experimental results obtained by using the multi agent MAXQ
framework develOped here on the AGV scheduling task. Results were obtained for the
ﬂat approach, the single agent MAXQ approach, selﬁsh multi agent MAXQ approach
(where multiple agents with the same hierarchy try to learn in a common environment
without sharing any information), and the new co-operative multi agent approach for

this task.

5.5 State Abstraction

The state of the environment consists of the number of parts in the pickup station
and in the dropoff station of each machine, and whether the warehouse contains parts
of each of the four types. In addition, each agent keeps track of its own location and
state as a part of the state Space. Thus, in the ﬂat case, the size of the state space is
z 100 locations, 3 parts in each buffer, 9 possible states of the AGV (carrying Part1,
Empty ), and 2 values for each part in the warehouse, i.e. 100 x 48 x 9 x 24 z 230,

55

which is enormous, and would blow up even more to (100 x 9)4 x 48 x 24, or 260 in the
multi agent case with four agents. The MAXQ state abstraction helps in reducing
the state space considerably, as for each action, i.e. each node in the Task graph,
only the relevant state variables are used while storing the Completion functions. For
example, considering the the Navigate subtasks, only the location state variable is
relevant, and this subtask can be learned with 100 values. Hence, for the highest level
with 8 actions, i.e. DMl DM4, and DMZ DM4, the relevant state variables
would be 100 X 9 x 4 x 2 z 213. For the lower level state space, the action with
the largest state space is Navigate with 100 values. This state abstraction gives us a

compact way of representing the C functions, and speeds up the algorithm.

5.6 Experimental Results

We present experimental results on the AGV scheduling task comparing several learn-
ing agents, including a Single agent using MAXQ, selﬁsh multiple agents using MAXQ
(where each agent acts independently and learns its own Optimal policy), and the new
co-Operative multi agent MAXQ approach. The experimental results were generated
with the following model parameters. The inter arrival time for part arrival at the
warehouse is uniformly distributed with a mean of 4.0 sec. The percentage of Part1,
Part2, Part3 and Part4 in the part arrival process are 20, 28, 22 and 30 respectively.
The time required for assembling the various parts is normally distributed with means
15, 24, 24 and 30 sec for Part1, Part2, Part3 and Part4 respectively. Each experiment
was conducted ﬁve times and the results averaged.

56

 

I I I I I I I
Multiple Cooperating Agents +—
12 - Multiple Selfish Agents -+--- a
Single Agent -e---
Flat 0 Learner ~x-

 

Throughput of the system

 

 

 
 

  
 

 
   

 

0v 1 l I'lfiii:IZLII.IT:11?LTT'ifiI ‘ ' ""'
0 5000 10000 1 20000 25000 30000
Time since start of simulation (sec)

 

35000 40000

Figure 5.4: This ﬁgure compares the performance of single agent, multi-agent selﬁsh
MAXQ, multi agent co-operative MAXQ and ﬂat methods for the AGV scheduling
task. It shows the throughput of the system when the AGV travel time is very much
less compared to the assembly time.

57

 

 

 

 

 

1-4 I I I I
Multiple Co-operating Agents +—
Multiple Selfish Agents +--
.[
E
9
(D
a l
0
5
'5
‘5
Q
.C -
D
3
9
.C
l- .
0.4 i -
l
0.2 L .
[l
0 l 1 l l
0 50000 100000 150000 200000 250000

Time since start 01 simulation (sec)

Figure 5.5: This ﬁgure compares the performance of single vs. multi-agent MAXQ
methods on the AGV scheduling task. It Shows the results when the AGV travel time
and load/ unload time is 1 / 10th that of the average assembly time.

58

Figures 5.4 and 5.5 Show the throughput of the system for the three learning agents
(measured in terms of number of assemblies delivered per min). Four homogeneous
agents are considered in the multi agent cases. AS seen in ﬁgure 5.4, the agents
learn a little faster initially in the selﬁsh multi agent method, but after some time,
undulations are seen in the graph Showing not only that the algorithm does not
stabilize, but also that it results in sub-Optimal performance. This is due to the fact
that two or more agents select the same action, but once the ﬁrst agent completes the
task, the other agents might have to wait for a long time to complete the task, due to
the constraints on the number of parts that can be stored at a particular place. The
system throughput achieved using the new cooperative multi agent MAXQ method is
signiﬁcantly higher than the Single agent or selﬁsh multi agent case. This difference
is even more signiﬁcant in ﬁgure 5.5, as when the agents have a longer travel time,
the cost of making a mistake is greater.

Figure 5.6 Show results from an implementation of a single ﬂat Q-Learning agent
with the buffer capacity at each station set at 1. AS can be seen from the plot, the
ﬂat algorithm converges extremely slowly. The throughput at 70,000 sec has gone up
to only 0.07, compared with 2.6 for the hierarchical single agent case.

Figure 5.7 compares the cooperative multi-agent MAXQ algorithm with a well-
known AGV dispatching rule, showing clearly the improved performance of the rein-
forcement learning method.

Figure 5.8 compares the cooperative multi-agent MAXQ algorithm with two AGV
dispatching rules, namely, a combination of “nearest station ﬁrst” and “stay in same
station” heuristic, and the “highest queue ﬁrst” heuristic. The way “highest queue

59

 

 

 

 

 

0.08 I r I I I I I
Flat 0 Learner +-
0.07 '- .
0.06 " .4
5
a; 0.05 " -i
>
In
0
5
3 0.04 '- '
5
Q
C
U
:1
9 0.03 r -‘
.C
I-
002 '- .,
0.01 [- A
0 U I I I I L I I
1 0000 20000 30000 40000 50000 60000 70000 80000 90000

Time since start of simulation (sec)

Figure 5.6: A ﬂat implementation of a Single agent reinforcement learner is shown
here.

60

 

12 I I I I I I

Cooperative Multi Agent MAXQ +—
First Come First Served Heuristic +--

 

Throughput of the system

 

 

 

 

2 f d
«l
o l L L l l L
0 5000 10000 15000 20000 25000 30000 35000
Time since start of simulation (sec)

Figure 5.7: The performance Of the prOposed cooperative multi-agent MAXQ is com-
pared with a well known “ﬁrst come ﬁrst serve” heuristic for AGV dispatching.

61

 

12 I I I I I I I

Multiple Cooperating Agents 4—
Nearest Station/Stay in Same Station Heuristic -+---
Highest Queue First Heuristic -e--

 

Throughput of the system

 

 

 

.ial'ﬂ‘B'G‘B'B"9'-B~e-e.era-8'98-G-B-G~e~e-e-e-e--e-e--e-e-e
4 q
2 I. -i
o l l l l j I l
0 5000 10000 15000 20000 25000 30000 35000 40000

Time since start of simulation (sec)

Figure 5.8: The performance of the cooperative multi—agent MAXQ algorithm is
compared with a composite of nearest station and stay in same station heuristic, and
the highest queue ﬁrst heuristic for AGV dispatching.

62

 

1‘ i i I i I

Multiple Cooperating Agents +-

 

 

 

 

Throughput of the system

 

..........................................................................................................................

 

 

 

2 *- -
1 ; . : : :

1 i I i I i
0 20000 40000 60000 80000 100000 120000

Time since start of simulation (sec)

Figure 5.9: The performance of the cooperative multi-agent MAXQ algorithm when
one agent breaks down at 50000 sec, another at 70000 sec and the third at 90000 sec.

63

ﬁrst” heuristic works is that when a vehicle ﬁnishes its task, it scans through the
list of pending jobs and chooses to serve the machine with the highest number of in
process jobs. The AGV selects the assignments which are at the location nearest to it
in the “nearest station ﬁrst / stay in same station” heuristic. When there are no jobs in
its job list the AGV goes to the pickup station of the same machine where it drOpped
the part, and waits for the job on that part to be ﬁnished. Better performance
of the hierarchical reinforcement learning algorithm can be clearly seen from this
ﬁgure. Figure 5.9 shows the throughput achieved by the the ooperative multi-agent
MAXQ algorithm when one agent breaks down at 50000 sec, another at 70000 sec,
and the third at 90000 sec. The ﬁnal performance when there is only one agent is
comparable to the single agent case. When an agent breaks down here, the other
agents assume that is is continuously performing the idle action. These experiments
Show that the new co-Operative multi agent MAXQ method performs better than
some heuristics, the ﬂat reinforcement learning method, and single agent and multi
agent selﬁsh MAXQ methods, and is thus a promising way to approach multi agent
hierarchical reinforcement learning problems. This method can be extended in a

number of different ways, which are discussed in the next chapter.

64

Chapter 6

Conclusions and Future Work

This thesis studied the problem of multi agent reinforcement learning, and proposed a
novel approach for dealing with it. In particular, we extended the MAXQ approach for
hierarchical reinforcement learning to the multi agent domain. This chapter summa-
rizes the new multi-agent hierarchical reinforcement learning approach, and provides

several interesting ideas for extending this work in the future.

6.1 Conclusion

We proposed a new algorithm which takes advantage of the abstraction provided
by hierarchical reinforcement learning in order for multiple agents to cooperatively
learn tasks which might require coordination among agents. Flat methods do not
scale well to the multi agent tasks. Also, interaction among agents at the lower
level sometimes make it very difﬁcult for the co-operation Skills to be learned. Using
hierarchy in combination with multi agent learning Speeds up learning enormously.

65

Also, a good policy iS arrived at as co-operation is only learned at the higher level
and does not have any effect on the low level policy. The only information an agent
has about the other agents is the high level action being performed by them, which
is easy as it requires very little communication, and is also an approximate indicator
of what states the agents might possibly be in. This approach for scaling multi agent
reinforcement learning uses the MAXQ hierarchical learning framework. The basic
MAXQ learning algorithm is adapted to the multi agent case by having joint action
values at the top level of the hierarchy. Events are used to signal completion of a
subtask instead Of state information, as that is more feasible in certain kinds Of tasks
which require discrete event modeling. The proposed method assumes a distributed
architecture in which each agent takes it’s own decision about the action it should
perform, depending on the state of the environment and the actions being performed
by other agents, rather than have a centralized controller dictate it’s actions. Detailed
experimental results from a complex AGV scheduling task were presented which Show
that the hierarchical multi agent MAXQ approach performed better than either the
single agent or simple multi agent MAXQ methods. This novel approach of utilizing
hierarchy for learning co-operation Skills shows considerable promise as an approach
that can be applied to other complex multi-agent domains. We primarily explored
the use of the MAXQ hierarchical framework in our study, but we believe that other
hierarchical methods could also be used to speed up multi-agent learning. The success

Of this approach depends on providing the system with a good initial hierarchy.

66

6.2 Future Work

This novel approach of utilizing hierarchy for learning co-operation skills suggests
many ideas for interesting possibilities for future research, as a lot still needs to be
explored in hierarchical and in multi agent learning. We present a few ideas in that

for future work in the next section.

6.2.1 Adaptive Hierarchies

Since a large part of the success of this algorithm depends on a good initial hierarchy,
changing the hierarchy would dramatically change the results and the learning speed.
Thus, an area worth exploring here would be how one can arrive at a good hierarchy
for decomposing the overall task into subtasks. Potential for state abstraction is an
important factor to be considered here as some hierarchies might permit more savings
in terms of state abstraction. Another factor for the multi agent case is that the top
level decomposition must decompose the task into essentially parallel subtasks, as
these are the subtasks where the co—operation skills are learned. For example, in the
AGV scheduling task, subtasks load material and drop material at dropoﬁr station
are sequential as these two have to be performed in a certain order to achieve the
goal of delivering material to a station, and it would make more sense to group them
together under another subtask. This Situation also arises due to the fact that our
approach only considers co-operation only at the highest level of abstraction, hence
the subtask decomposition has to be devised in a way that it is possible to learn co-

operation skills only at the highest level of abstraction. It would be helpful to devise

67

a method by which agents could learn co-Operation at the lower levels of abstraction
too if desired, but the overhead of doing that might possibly overshadow the beneﬁts.
Also, not having co—operation at the lower level provides Opportunities for re-using the
lower level policy once it has been learned. Coming back to the problem of devising
hierarchies, using heterogeneous agents with different initial hierarchies, might be a
step in this direction. Agents could even try different variations in parallel by moving
around the nodes in the hierarchy and ﬁnally choosing a solution which maximizes

the overall system performance.

6.2.2 Other Hierarchical Learning Approaches

We primarily explored the use of the MAXQ hierarchical framework in our study, but
we believe that other hierarchical methods could also be used to speed up multi-agent
learning. Thus, it would be worth exploring the possibilities Of multi agent learning
in combination with the other hierarchical reinforcement learning frameworks and
see how well they perform. For example, trying to extend the HAM (Hierarchies of
Abstract Machines) or Options framework to the multi agent case might pose new
challenges, and might help us in better understanding the hierarchical multi agent

reinforcement learning paradigm.

6.2.3 Theoretical Foundations

We have not investigated the theory behind the new hierarchical reinforcement learn-
ing method. Q-Learning has been proven to converge to an Optimal policy, whereas

68

MAXQ learning has been proven to converge to a recursively optimal policy under
certain conditions. These theoretical foundations need to be examined in the multi
agent case as well, as that will provide us with a better understanding of the multi
agent reinforcement learning problem, which in turn might lead to better solutions.
It would also help to have a more formal and uniﬁed deﬁnition of multi agent and

hierarchical multi agent frameworks.

6.2.4 Other Extensions

Collective reward Sharing might Speed up learning even more as all the agents would
get reinforcement more often. This would mean that when one agent receives a
reward, the other agents would also be notiﬁed of it and they would accumulate this
reward too between decision epochs. Real life systems are very stochastic in nature.
Conditions (eg. requirements for number of assemblies etc.) are constantly changing.
Hence, if the algorithm is modiﬁed to have rewards which change dynamically, it
would be more useful in real world situations. This approach also needs to be tested

on more complex problems.

69

BIBLIOGRAPHY

70

Bibliography

[1] Mitchell, Tom (1997) Machine Learning. McGraw Hill.

[2] Richard Sutton and Andrew Barto (1998) Reinforcement Learning An Introduc-
tion. The MIT Press.

[3] Leslie Pack Kaelbling (1993) Hierarchical learning in stochastic domains: Pre-
liminary results. Proceedings of the Tenth International Conference on Machine
Learning

[4] Dietterich, T. G. (1998) The MAXQ method for hierarchical reinforcement learn-
ing. International Conference on Machine Learning.

[5] D. Bertsekas (1995) Dynamic Programming and Optimal control. Athena Scien-
tiﬁc.

[6] D. White (1963) Dynamic Programming, Markov Chains, and the method of
successive approximation. Journal of Mathematical Analysis and Applications
vol. 6, pp. 373-376.

[7] C. Watkins (1989) Learning from Delayed Rewards. PhD Thesis, Kings College,
Cambridge.

[8] Dietterich, T. G. (2000). State abstraction in MAXQ hierarchical reinforcement
learning. Advances in Neural Information Processing Systems, 12. S. A. Solla, T.
K. Leen, and K. R. Muller (eds.), pp. 994-1000, MIT Press.

[9] Dietterich, T. G. (2000). Intra—option learning about temporally abstract actions.
Proceedings of the 15th International Conference on Machine Learning, pp. 556-
564.

[10] Sridhar Mahadevan and Jonathan Connell (1992). Automatic Programming of
Behavior-based Robots using Reinforcement Learning. Artiﬁcial Intelligence, vol.
55, Nos. 2—3, pp. 311-365.

[11] Maja Mataric. (1997) Reinforcement Learning in the Multi-Robot Domain. Au-
tonomous Robots, 4(1), 73—83.

71

[12] Michael Littman. (1994) Markov games as a framework for multi-agent reinforce-
ment learning. Proceedings of the Eleventh International Conference on Machine
Learning, pp. 157-163.

[13] Ming Tan. Multi-agent reinforcement learning: independent vs. COOperative
agents. Proceedings of the Tenth International Conference on Machine Learning.
pp. 330-337, Amherst, MA.

[14] Crites, RH. and Barto, AG. (1998) Elevator group control using multiple rein-
forcement learning agents. Machine Learning 33 pp. 235-262.

[15] Tesauro, G. (1992) Practical issues in temporal difference learning. Machine
Learning Vol. 8, no. 3.

[16] R. Sutton (1988) Learning to predict by the method of temporal differences.
Machine Learning Vol. 3, pp. 9-44.

[17] Singh S, Bertsekas D (1997) Reinforcement Learning for Dynamic Channel Al-
location in Cellular Telephone Systems. NIPS 10 proceedings.

[18] Hu, J. and Wellman, M. (1998) Multiagent reinforcement learning: Theoretical
framework and an algorithm. Fifteenth International Conference on Machine
Learning, pp. 242-250.

[19] Sutton, R. S., Precup, D., Singh, S. (1999) Between MDPS and semi-MDPS:
A Framework for Temporal Abstraction in Reinforcement Learning. Artiﬁcial
Intelligence 112 pp. 181-211.

[20] Stone, P. and Veloso, M. (1998) Team-Partitioned, Opaque-Transition Rein-
forcement Learning. Agents pp. 86-91.

[21] Parr, R. E. (1998) Hierarchical Control and Learning for Markov Decision Pro-
cesses. PhD Thesis, University of California, Berkeley.

[22] Tucker Balch and Arkin, Ronald, C. (1998) Behavior-based Formation Control
for Multi-robot Teams. IEEE Transactions on Robotics and Automation, Vol. 14,
Number 6, pp. 1-15.

[23] Lee J. (1996) Composite dispatching rules for multiple-vehicle AGV systems.
SIMULATION 66(2) pp. 121-130.

[24] Askin, R. G. and Standridge, C. R. (1993) Modeling and Analysis of Manufac-
turing Systems. John Wiley and Sons.

[25] Klein, C. M. and Kim, J. (1996) AGV dispatching. International Journal of
Production Research 34: (1) pp. 95—110.

[26] Schwind, G. F. (1988) AGVS: creative solutions go looking for problems. Material
Handling Engineering 43 pp. 44-47.

72

[27] Mahadevan, B. and Narendran, T. T. (1990) Design of an automated guided
vehicle-based material handling system for a ﬂexible manufacturing system. In-
ternational Journal of Production Research 28: (9) pp. 1611-1622.

[28] Sabuncuoglu, I. (1997) A study of scheduling rules of ﬂexible manufacturing
systems: a Simulation approach. International Journal of Production Research
36: (2) pp. 527-546.

[29] Tadepalli, P. and Ok, D. (1996) Scaling up average reward reinforcement learning
by approximating the domain models and the value function. Proceedings of
International Machine Learning Conference.

[30] Wang, Gang and Sridhar Mahadevan (1999) Hierarchical Optimization of Policy-
Coupled Semi-Markov Decision Processes. International Conference on Machine
Learning.

[31] Milos Hauskrecht, Nicolas Meuleau, Craig Boutilier, Leslie Pack Kaelbling, and
Thomas Dean. (1998) Hierarchical Solution Of Markov Decision Processes Us-
ing Macro-Actions. Proceedings of the Fourteenth International Conference on
Uncertainty In Artiﬁcial Intelligence.

[32] Peter Dayan and Geoff Hinton. (1993) Feudal Reinforcement Learning. Advances
in Neural Information Processing Systems, volume 5. Cambridge, MA.

[33] Ron Parr and Stuart Russel. ”Reinforcement Learning with Hierarchies Of ab-
stract Machines.” Advances in Neural Information Processing Systems. volume
10, Cambridge, MA. 1993.

[34] Doina Precup, Richard Sutton, and Satinder Singh. (1997) Theoretical Results
on Reinforcement Learning with Temporally Abstract Behaviors.

[35] M. Zweben and M. S. Fox (1994) (Eds. ) Intelligent Scheduling. Morgan Kauf-
mann Publishers, Inc.

[36] HOO Gon Choi and Hyuk Jin Kwon and Jim Lee (1994) Traditional and Tandem
AGV System Layouts: A Simulation Study. SIMULATION, pp. 85-93.

[37] Yavuz A. Bozer and Mandyam M. Srinivasan (1991) Tandem Conﬁgurations
for Automated Guided Vehicles and the Analysis of Single Vehicle Loops. IIE
Transactions, Vol. 23, Number 1, pp. 72-82.

[38] John W. Sheppard (1998) Co—Learning in Differential Games. Machine Learning,
special issue on Multi-Agent Learning, Vol. 33, No. 2/3, pp. 201-233.

[39] Sugawara, T. and Lesser, V. (1993) Learning Coordination Plans in Distributed
Problem-Solving Environments. Computer Science Technical Report 93-27 , Uni-
versity of Massachusetts, Amherst.

73

[40] Kaelbling, L. Littman, Moore, A. 1996. ”Reinforcement learning: A survey” in
Journal of Artiﬁcial Intelligence Research 4.

[41] Law, A. Kelton, W. 1991. Simulation Modeling and Analysis. New York, USA:
McGraw-Hill.

[42] Marchalleck, N. J. 1996. Improving Simulation Technology Through Reinforce-
ment Learning. Master’s Thesis, University of South Florida.

[43] Thomas G. Dietterich Hierarchical Reinforcement Learning Tutorial. (1999)
given at The Sixteenth International Conference on Machine Learning Bled, Slove-
nia.

[44] Mahadevan, S. Kaelbling, LP. 1996. The NSF workshop on reinforcement learn-
ing: Summary and Observation. AI Magazine 1(12):1—37.

[45] Puterman, M. L. (1994) Markov Decision Processes: Discrete Stochastic Dy-
namic Programming. Wiley Series in Probability and Mathematical Statistics
John Wiley and Sons, Inc.

[46] Stuart Russel and Peter Norvig (1995) Artiﬁcial Intelligence A Modern Ap-
proach. Prentice Hall.

[47] M. Puterman (1994) Markov Decision Processes: Discrete Dynamic Stachastic
Programming. John Wiley.

74