THESIS

}

lllllllll‘llllllllllllllllllllllllllll‘Ul‘lllllllllllllllll

3 1293 01776 668

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the

dissertation entitled

Applications of Play Against Past
Strategies in Repetitions of a Game

presented by

Lei Chen

has been accepted towards fulfillment
of the requirements for ‘

Ph.D. degree in StatiStiCS

doe/xv

Major professor

Dennis Giiiiiand

Date December 1, 1997

MS U is an Afﬁrmative Action/Eq ual Opportunity Institution 0- 12771

 

~_’-\

»—-r

_.’ _" ‘ "—‘h ....

V wﬁ -_.— ~_

APPLICATIONS OF PLAY AGAINST PAST STRATEGIES IN
REPETITIONS OF A GAME

By

Lei Chen

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Statistics and Probability

1997

ABSTRACT

APPLICATIONS OF PLAY AGAINST PAST STRATEGIES IN
REPETITIONS OF A GAME

By

Lei Chen

Hannan [17] introduced and studied certain recursive “play against past” strategies
in the repeated play of a game. His results include bounds on the excess of total risk
to player II over an envelope risk, namely, that which results from use of the best
simple strategy based on knowledge of player I’s empirical distribution of past choices.

This thesis investigates the Harman results for ﬁnite games and extends their
applications to certain inﬁnite games. This is accomplished in Chapter 2 by showing
that the Hannan bounds hold for play against random perturbations of the empirical
distribution of player I’s past choices of randomized strategies and by identifying pure
strategies in the inﬁnite games with randomized strategies in a companion ﬁnite game.
Chapter 3 uses this idea to show that the Harman recursive strategies and resulting
bounds have applications in the repeated play of certain allocation and nonadversarial
multi—arm bandit problems.

Another application concerns the repeated play of an expert selection problem.
Chapter 4 contains a review of some of the literature for this problem. In this com-
ponent game, player II selects an expert from a set of experts for a prediction and
suffers loss equal to the prediction error. In repeated play, the enve10pe risk is the
total of the prediction errors by the best individual expert. For a set of n = 2 ex-
perts, Foster and Vohra [13] proposed a recursive strategy and showed that it results
in average prediction error close to that of the best individual expert in the limit.
Chapter 4 shows that their strategy is an example of a Harman strategy and that the
adaptation of Hannan to inﬁnite games establishes a slightly stronger version of their
asymptotic result. Moreover, this approach provides an immediate solution for the

case of n experts.

To J ingjing - my dear husband

iii

ACKNOWLEDGEMENT

I would like to express my sincerest gratitude to my advisor Professor Dennis
Gilliland for his guidance and continuous encouragement in the preparation of this
dissertation. He taught me the art of consulting as well as the art of teaching and
research. I appreciate his patience and willingness to discuss problems at any time.
I could not have ﬁnished this dissertation without his great help.

Special thanks go to Professor James Hannan for the many things I learned from
him, for the conversations we had all over the campus - classroom, ofﬁce, hallway, IM
East, - - -, and of course, for his fundamental research in game theory from which my
work derives. It was amazing that he almost persuaded me to pursue an academic
career even though the chance of getting it is so slim.

I could not imagine the consequence if I forgot to thank Professor James Stapleton.
It is he who called me about my admission in July 1993. I am grateful for the good
advice he gave me throughout my graduate study, and for his encouraging words
about my English which sometimes were so good that it gave me the delusion that
my English is as good as his. A

I would also like to thank Professors J. Hannan, R. Lepage and H. Salehi for
serving on my thesis committee, for carefully reading my thesis, and for their helpful
comments.

I would like to thank all the faculty and staff of the Department of Statistics and
Probability who treated me so well that I am alleged to be the only one who cried
when leaving the Wells Hall.

I am very grateful to my parents - Weiyou Chen and Xinshun Zhao from whom I
learned to work hard and be responsible. Their wish of me getting advanced degrees
has been my inspiration over the years. I am also grateful to my parents-in-law
- Shizhong Wang and Liqiong Peng who kept me out of the kitchen when I was
preparing my dissertation. I owe special thanks to my parents-like host family - Fred
and Charlotte Poston. Their loving and caring always made me feel warm even in
the coolest days of Michigan’s winter.

Words are not enough for me to express how much I owe to my best friend, walking

iv

dictionary, personal trainer, counselor, soul mate - my husband Jing Wang. He is the
one who comforted me and cheered me up when I was unhappy. He is the one who
made me relax when I felt stress. He is the one who made me believe in myself when I
lost conﬁdence. Although I always complained about his frank criticisms, I know from
my heart that he is the one who loves me the most, and he is the one who changed
my world, and he is the one who made me become who I am. This dissertation is

dedicated to him with love.

Contents

1 Introduction

1.1 Introduction ................................

2 Strategies for repeated play of a game

2.1 Preliminaries ...............................
2.1.1 The ﬁnite component game ...................
2.1.2 The repeated game ........................

2.1.3 The k-extended game .......................
2.2 Lemmas ..................................
2.3 Bounds for the modiﬁed regret of Hannan recursive strategies

2.4 An extension to an inﬁnite game ....................

3 On-line allocation model and the multi-armed bandit problem
3.1 Introduction to on-line allocation model ................
3.1.1 Component allocation game ...................
3.1.2 On-line allocation model .....................
3.2 Application to the on—line allocation model ...............
3.3 Introduction to the multi-armed bandit problem ............
3.3.1 Component multi-armed bandit game ..............
3.3.2 The multi-armed bandit problem ...............
3.4 Application to the multi-armed bandit problem ............

vi

00

ONUCﬂU‘

11
16
22

25

25
25
27
32
32
32
34

4 Prediction using expert advice 38

4.1 Introduction ................................ 38
4.2 Literature review ............................. 38
4.3 A proof of Theorem 1 of Foster and Vohra ............... 48
4.4 A generalization to more than two experts ............... 51
4.5 The k-extended prediction strategies .................. 54
4.5.1 Introduction ............................ 54

4.5.2 The k-extended prediction strategies .............. 58
Bibliography 62

vii

List of Figures

3.1
3.2

4.1
4.2

Algorithm H ............................... 28
Algorithm I? ............................... 35
Strategy 0 ................................. 52
Strategy 0" ................................ 59

viii

Chapter 1

Introduction

1.1 Introduction

There is current interest in a variety of problems that involve repetitions of a decision
problem. Examples are on-line allocation problems, multi-armed bandit problems
and expert selection problems. Hannan [17] developed recursive strategies for player
11 in the repeated play of a game. Our concern is the adaptation of the Harman [17]
ﬁnite game results for certain inﬁnite games, including the aforementioned.

Prediction by combining expert advice has extensive applications. According to
Clemen [8]’s review, it has been applied not only to meteorology and economics, but
also to the prediction of social and technological events, football game outcomes,
electrical demand, tourism, political risk and population, etc.. Clemen [8] gave an
extensive survey of methods for combining forecasts. Most of these conventional
methods for combining forecasts involve taking a weighted average of individual fore-
casts and can be viewed within a regression framework. Regression techniques, such
as weighted least squares, robust-weighting techniques, ridge regression, latent root
regression, have been used in combining forecast. Bayesian techniques for includ-
ing prior information in a forecast combination have been studied by Clemen and
Winkler [10], Clemen and Guerard [9]and Anadalingam and Chen [1]. Most of these
approaches require that the probability distribution of the event being forecasted be
speciﬁed.

In computer science, predicting a binary sequence by combining the predictions
of a set of experts has been studied by Cesa-Bianchi et al.[7], Foster [12], Freund [14],
Littlestone and Warmuth [23], Vovk [27] [28] etc.. Haussler, Kivinen and Warmuth [18]
also provided prediction strategies for continuous-valued outcomes using expert advice
for certain classes of loss functions. Their strategies are based on the exponential
weight algorithm introduced by Littlestone and Warmuth [23] and by Vovk [27]. None
of their strategies requires statistical assumptions be made about the decision-maker’s
subjective beliefs regarding the distribution of outcomes. They proved bounds on the
difference between the average loss of their strategies and the average loss of the best
expert. The average loss of their strategies will approach that of the average loss of
the best expert at a certain rate as the length of outcome sequence T goes to 00.

Foster and Vohra [13] studied the problem of choosing a prediction from two
experts. Assume that the prediction loss is bounded and that the decision-maker has
the knowledge of the past losses incurred by the two experts. They constructed a
randomized strategy for the decision-maker and indicated that the difference between
the average loss of the decision-maker and the minimum average loss of the two
experts goes to 0 in probability as the number of trials T goes to 00.

By connecting the Foster and Vohra [13] randomized strategy with game theory
results studied by Harman [17], we verify that the Foster and Vohra [13] randomized
strategy is a special case of Hannan’s strategy for player II in a ﬁnite two-person game,
that is, it plays Bayes versus a randomized perturbation of the arithmetic mean of
player I’s past randomized strategies. The minimum average loss of the two experts
is the Bayes envelope, denoted by ¢(%XT), where X T is the sum of player I’s past
randomized strategies.

We also consider the prediction problem in which there are n experts. Based on
game theory results, we construct randomized strategies for the decision-maker such
that the difference between the average loss of the decision-maker and the average
loss of the best expert goes to 0 in probability as the number of trials T goes to 00.

In the compound decision problem, the extended envelopes, introduced by Johns

[21], is a lower envelope than the simple Bayes envelope. The extended sequence

compound decision problem has been studied by Swain [25], Gilliland and Hannan
[16], Ballard [3], Ballard, Gilliland and Hannan [5] and Ballard and Gilliland [4].
Vardeman [26] studied the k-extended problem in a game theoretic situation. He
constructed randomized strategies with risk approaching the k-extended envelope at
the rate of O(T"1/2).

If the loss of the experts have some dependency, then we can use Vardeman [26]’s
idea to construct k-extended randomized strategies for the decision-maker such that

the average loss of the decision-maker approaches a lower enve10pe.

1.2 Summary

This dissertation connects game theory, the on-line allocation model, the multi-armed
bandit problem and the on-line prediction problem using expert advice. It is organized
as follows.

In Chapter 2, following Hannan [17], we present notations, useful lemmas and the-
orems for a ﬁnite two-person game. We also extend the results of recursive strategies
to an inﬁnite game.

In Chapter 3, we introduce two algorithms, Algorithm H and Algorithm H, based
on Theorem 2.3.2. In the on-line allocation model, if the allocation agent uses Al-
gorithm H, then the difference between the expected average loss of the allocation
agent and the average loss of the best strategy converges to 0 at the rate of 0(T“1/2).
In the multi-armed bandit problem if the player applies Algorithm H, then the differ-
ence between the expected average reward of the player and the expected maximum
average reward of any arm in a sequence of T trials will go to 0 at the rate of 0(T‘1/2).

In Chapter 4, Sections 4.2 and 4.5.1 contain a general review of the most relevant
literature for the on-line prediction problem using expert advice and the k-extended
idea. In Section 4.3, we present a proof of Theorem 1 of Foster and Vohra [13] using
Theorem 2.3.3. In Section 4.4, Theorem 2.3.4 is used to generalize the problem studied
by Foster and Vohra [13] to the case in which the choice is among n experts. Without
any statistical assumption about the distribution of the sequence being predicted,

if the prediction loss is bounded, then we can construct a randomized strategy for
the decision-maker such that the difference between the average loss of the decision-
maker and the average loss of the best expert converges to 0 in probability at the
rate of 0(T'1/2). In Section 4.5, we use the Vardeman [26] technique to construct
k-extended prediction strategies when the decision-maker has the knowledge of the
predictions made by n experts. If the loss of the experts takes values in a ﬁnite set
with cardinality q, then using the k-extended prediction strategy the average loss of
the decision-maker will approach the k-extended envelope in probability at the rate

of 0(q"("‘1)T'1/2) uniformly in outcome sequences.

Chapter 2

Strategies for repeated play of a

game

2.1 Preliminaries

2.1.1 The ﬁnite component game

Consider a ﬁnite two-person game in which players I and II have, respectively, m and

11. pure strategies. Their spaces of randomized strategies are denoted by X and Y,

m
X = {x = (1:1,x2,...,:rm)|:r,- 2 0,223; = 1},
1

Y= {y= (y1.y2.--.,yn)ly.- 2 0,2!“ = 1},
I

and their pure strategies are represented by base vectors 6 and 6 in X and Y, that

is, the degenerate probability distributions.
Notations. For m-vectors, we use juxtaposition to indicate inner product, that is,

ab = 2’," a,b;, and we deﬁne the norms

m
Ial = mgxlaal and Hall = Slat-l-
1

Player II’s inutility is denoted by a loss matrix A,

011 “113 A1
A: g 5 = ; =(A1,...,An),
aml amn Am

It is assumed throughout the thesis that A has no dominant column, i.e., for each j
meax(eAj — mrin eA') > 0, (2.1)

and no duplicate and dominated columns.
Notations. Let
A"'=A"—A',1Sq<r5n.

The expectation of the loss when player I or player II uses a randomized strategy
is R(:r, y) = :rAy, and we will refer to it as risk. From the risk point of view, this
game is identical with the s-game in which player II’s pure strategies are m-vectors
in the set of columns of A, {A1, . . . , A"}, and player II’s randomized strategies 3 are

m-vectors in the convex hull of the columns of A,
s E S = {Ayly 6 Y}.

For each a: E X, the vector inner product ms is the Bayes risk of 3 against 2:. The

Bayes envelope is deﬁned as

¢(x)= minws=jn11in :cA’, xEX.
" ’n

A pure strategy valued Bayes response is a function a(-) on X into {A1, . . . , A"}

where for each x,
“(93) = 43(93)-

Note that ¢(-) is uniquely deﬁned and 0(o) is not unless, for each 2:, 2:0 is minimized
by a unique 0. It is useful to extend the domain of ¢(-) and any given a(-) to the

nonnegative orthant of m-space as follows. For w e [0, 00)”, let

¢(w)= minws= min wAj,
J'= —--1. .n

and 0(0) = arbitrary and
0(w) = 0(W/llwll) ifw 9* 0-

It is follows that a(-) is positive homogeneous of order zero, that is, 0(kw) = 0(w)
for all w e [0, oo)m and constants k > 0.

Notations. In all of the following, a(-) will denote a pure strategy valued, positive,
homogeneous Bayes response deﬁned on [0, 00)”; also, following Hannan [17, p. 102],

we let

IBI = sup |0($) - 0($’)I = Hgax lAq’l, (2-2)
z,z’ ,r

where the last equality follows from the fact that A has no dominated columns.
A strategy 3 e S may be evaluated for each 2: E X in terms of the additional risk
above the Bayes envelope risk ¢(:r), that is 2:3 — ¢(a:). This excess is called the regret

ofs at 3:.

2.1.2 The repeated game

This section considers repeated play of the component game and the evaluation of
recursive strategies in terms of their excess total risk over 42 evaluated at the (non-
normalized) empirical distribution of player I’s choices.

Suppose the component game is played repeatedly with e‘ and 0‘ denoting the
choices of pure strategies by player I and II, respectively, at time t = l, 2, ..., T. We

suppress the display of dependence on T in writing

1

g=(e ,...,€T) and g=(al,...,aT).

The total loss across the ﬁrst T plays of the game is
T

RT(§, g_) = Z e‘a‘.
1

If player II uses the same component strategy a at each time t, that is, g_ = (a,

. . . , a), then we write 9; = UT and note that

T
RT(§, 0T) = 2 6‘0 = ETa
1

where ET = 2:," e‘. (Harman [17, p. 107] calls such g power strategies.) Note that
12%. o") 2 ¢(ET)

with equality if a = 0(ET). We will sometimes refer to the sum ET as the empirical
distribution of player I’s choices of pure strategies through time T.

The envelope risk ¢(ET) is the minimum total risk to player II resulting from the
use of a power strategy across the ﬁrst T plays of the game. If player II knows ET in
advance, then the power strategy with kernel 0(ET) achieves the envelope risk.

Hannan [17] refers to the excess risk
RT(§,9_') - MET)

as the modiﬁed regret of g at 3:. He develops bounds for the modiﬁed regret of
recursive strategies g, with 0‘ a function of E"'1 and possible randomization. These
bounds are 0(T1/2) uniformly in g for ﬁnite component games.

Since we make extensive use of the Hannan [17] notations and results, it is conve-
nient to state some of these for future reference. We let H ‘, t = 0, 1,. . . be a sequence

of numbers satisfying,

0<H°5H1$H25m z 2 2w. (2.3)

H1 H 2 H 3
T T '5'
Let p be a probability measure on the unit m-cube [0, 1]”. z ~ p provides the arti-
ﬁcial randomization in the Harman recursive strategies and is such that the induced
distribution on

qu" = 2(A" — A')

satisﬁes a Lipschitz condition. Speciﬁcally, there exists a constant L > 0 such that
Miz i t1 S qur/lAqr] S t2} S “‘2 - t1) f0? all Q < 7', t1 S t2- (2.4)

Also 9 := Eullzll. When u is the uniform distribution on [0,1]”, 0 = m/2 and we
take L = 1.

Hannan [17, Theorem 3] showed that with the component game deﬁned earlier, if

(2.3) and (2.4) hold, then for each T and g,

T
—HT9|B| g :c‘Epa(E“1+H"lz)—¢(ET) (2.5)
l
2 T 2 T
< T I‘— ___ .
_ H 0lB|+L4(Zl: H, HT)IBI, (26)

where [B] is deﬁned in (2.2).
Denote the right hand side of (2.6) by UT. For ﬁxed p, Harman [17, Section 7]
showed that when T is known in advance to player II, then

H‘ = (Luz/40)1/2T1/2, t = 1,. . .,T,

minimizes supT T‘1/2UT. When T is unknown to player 11 and u is the uniform

distribution, then Hannan [17, Theorem 4] proved that for all T and _e_,
Ht = (3n2/2m)1/2t1/2, t = 1,2,. ..,

minimizes supT T‘WUT, and that

T 2 _
—(/3n2mT/8|B[ S gc‘Epa(Et‘l+(]§E-%n—llz)—¢(ET) S ‘/3n2mT/2|B|, (2.7)

Hannan [17, Theorem 6] produced bounds which are a uniform improvement of those

in (2.7). He showed that

—\/1.5mT|B| g ie‘EudEt'l + ([6—(t—ﬂE—llz) - (MET) S V6mT|B|. (2.8)

2.1.3 The k-extended game

Consider the above two-person game. Player I has m pure strategies and player II
has n pure strategies and the loss matrix A satisﬁes (2.1). Let X and Y be the set
of player I and 11’s randomized strategies, respectively.

Denote S = {Ay I y 6 Y}. For a: e X and s 6 S, ms is the risk of player II
choosing strategy 3 against 2:.

Suppose that this game occurs repeatedly. At each time t, let 1“ represent player

I’s move, and 2:1, ..., 1:“1 are known to player II before he makes his move at time t.

10

Let s = (31, 32, . . . , 3T) be such that s, is a function from X"1 into S. For x“1 =

(2:1, . . . ,T“1), the risk of sequence strategy 8 is
T
Z :r‘st(x“l). (2.9)
i=1
When 8 = (s, s, . . . , s) for a ﬁxed 3 6 S, the risk (2.9) becomes
T T
222‘s = (Z T‘)s = XTs,
t=l t=1

where X T is the sum of the randomized strategies of player I through time T. The

deﬁnitions of Bayes response and Bayes envelope imply that
¢(XT) = XT0(XT).
Let S" be a set of functions from X"‘1 into S. Let
tt-l t—k-i-l , xt—l).

x =(a: ,...

If we consider 8 = (s‘, s‘, . . . , s‘) for a ﬁxed 3" 6 S’, then the risk (2.9) for such an 3

reduces to 23;, :r‘s‘(x“'l). We term

a k-extended envelope.

We use Vardeman [26] ’5 Pk notation to deﬁne set S C [0, 00)“. of the form
S = {59' 6 R"‘k|(:r""°+1 <8) 32"“2 ® ~-® x‘)§ = T‘s*(x‘t'1),for some 3" E S“ },
where mk-vector TH‘“ ® Tt‘k” ® -- - ® 3‘ is given by
((a.-“"+1®a:“"+2 a ~~-®x‘),-,,,,,-,, 31-6 {1,...,m}, i: 1, . ..,k),
and

i:
t—k+l t—k-i-2 t _ t-k-i-t'
(x m 69 mm )1.....-. — I122... -
i=1

Let 6(a)) denote a positive homogeneous minimizer of ms. Denote

T
X; = Ext-k'i'l ® xt—k+2 ® _ . . ® xi.
t=1

11

Then

T
min T‘s‘(x
s‘ES‘ t=1

“-1)

T
= mip 2(xt-k'i'l ® $t_k+2 ® . . . ® $t)§
565 i=1

= mip X5
565

= X;5(X;.).
For an E Rmk, deﬁne ¢"(w) = w6(w). Thus the Ire-extended envelope is

¢’°(X%) = X%5(X4‘~)-

2.2 Lemmas

For later application, we determine the behavior of the Harman recursive strategy
when player I’s choices are randomized strategies :1.“ and 0(E"1 + H t‘lz) is replaced

by a(X"l + Ht'lz), where X0 = 0 and

For this purpose we reinterpret lemmas from Hannan [17] with 6‘ replaced by 2‘.

Remark 2.2.1 Lemma 2.2.1 below will be an important tool in the proof of Theorems
2.3.1 and 2.3.3. Hannan [17] used it in the proof of (2.5) and (2.6).

Lemma 2.2.1 Let v‘,t = 1,. . . ,T be any sequence of m-vectors. It follows that
T T T
23:”: = XTvT-i—l + Ext—1(1): _ 11‘“) + 21,10} _ 01H).
1 1 1
Proof.
T T
XTvT+l + Ext-1h} _ vt+l) + thh)‘ _ vt-i-l)
1 1
T

T T T
= XTUT+1 + Ext—lvt _ Ext-lvH-l + Zztvt _ thvH-l
l 1 1 l

12

T T T

Xtvt-i-l _ ZXt—IUHI + Zztvt _ Extvt+l
1 1 1

T

T
+ Z xtvt _ Z xtvt+l
1

l

n

“M“! “M“! “M”
H
s:
i

a
('5
e6

Cl

Remark 2.2.2 Lemma 2. 2.2 below will be used in the proof of Theorems 2. 3.1, 2.3.2
and 2. 3.3.

Lemma 2.2.2 If (2.3) holds, then
Xt'l Xt t - l t — 2

 

 

“HEY-‘HTHSF‘HT
Proof.
XH Xt m x:-1 X,.H+x:
"—m-‘zll = XI t-1——t——[
H H g, H H
m 1 1 T‘-
_ t-l i
- 2"“ (F'H’tl'iﬁ
1 1 1
S (t-1)('ﬁﬁ-§:)+F
< t—l_t—2
— Ht-l Ht

E]

Remark 2.2.3 The proofs of Lemmas 2.2.3 and 2.2.4 that follow are reinterpreta-
tions of (6.8)-(6.10) of Harman [17]. He proved the results when player I uses pure

strategies g under the assumption (without loss of generality) that for any 6

¢(e) = jeiiliiin} eAj = 0. (2.10)

Lemma 2.2.3 Suppose that the sequence H t is positive and nondecreasing and that

0‘ is the recursive strategy

0‘ := (7(X‘"1 + Ht'lz), t = 1,2, . . .. (2.11)

13
It follows that
T
XXV - at“) 2 -HTllzlllBl.
1
where IE] is deﬁned in (2.2).
Proof. From the deﬁnition of a‘, we have for each t
(X‘ + H‘z)(o‘ — at“) 2 0.
Then
xt(0.t _ Ut+l) 2 _th(0.t _ 0“”).

So

ixt(at_ 0.)t+1
l
T

T
2: H‘zo‘+1 — Z H‘zot
1

I

IV

T-1 T
HTZUT+1 + z: that-i-l __ thZUt
1 1
T T
= HTon+1 + 2Ht—lzo‘ — ZH‘zo‘
2 1
T
= HTZUT+1 _ H0201 _ 2(Ht _ Ht—l)za,t
1
= HTz(0T+l — a0) — H0210}l — a0) —ZT:(Ht— Ht 1) 2(0 — a0)

where a0 = (a9, . . . ,a?,,), and for each i

a9: min a,,-, (2.12)

j€{l.~ ..n}

where ajj is the element of loss matrix A.

Since for any t, 0‘ - a0 2 0, H ‘ 2 0 and H ‘ is non-decreasing, then we have
HTz(aT+1 — a0) 2 0.

0 S Hot/.(o1 - a0) 5 H°||z|||B|.
0 S (Ht - H"1)Z(0' - 0o) S (H‘ - Ht-INIZIIIBI-

14

Therefore

T T
ZX‘(0‘-0‘“) Z -H°IIZIIIBI-Z(H‘-H"1)IIZIIIB|
l 1

= —HT||z|||B|.
:1

Lemma 2.2.4 Suppose that the sequence H t is positive and nondecreasing and (2.11)

holds. It follows that
xTaT“ — ¢<XT> + )T:X‘-1(o*— W) s HTllzlllBl.
1
where [B] is deﬁned in (2.2).
Proof. Since (X T + HTZ)[0T+1 — 0(XT)] S 0, then
XTaT“ — ¢(XT) 5 —HT z[aT+1 — 0(XT)]. (2.13)
It follows from (2.11) that
(X‘-1 + Ht-lz)(ot — at“) S 0.

Then

Xt—l(0.t _ 0"”) S _Ht—lz(at _ at-i-l).

So

T
HTon+1 — Hozo1 — EXHt — H“‘)za‘+l. (2.14)
1

15

It follows from (2.13) and (2.14) that

7‘
XTUT+1 _ ¢(XT) + Ext—1(at _ (TH-l)
l

T
S ,HTzo(XT) — H0201 — 2(H‘ — H‘_l)zot+l
1

= HTz[o(XT) - a0] — H0207l — a0) Z—(H‘— Ht 1 .z)(a‘+1 — a0)
3 HTIIzlllBI-

where a° is deﬁned in (2.12). The second last inequality follows from the deﬁnitions
of H‘, a0 and |B|.

El

Remark 2.2.4 Lemma 2.2.5 below is a slight modiﬁcation of Lemma 1 of Hannan
[17]. It will be used to prove Theorem 2.3.3.

Lemma 2.2.5 If w and w' are m-vectors and (2.4) holds, then
Eu|o(w + z) -— 0(w' + 2)] _<_ L-gilBlllw' —- w||,
where [B] is deﬁned in (2.2).
Proof. Let T1), = {zlo(w + z) = Aj,o(w’ + z) = A"}, and note that

Eulo(w+z) -a(w’+z)| = ZIA""In(T

#1:
= Z [Ajkill‘(7:ik) + #(Tkjll-
j<k
Note that
Z IA”‘|_ < WI 21 s ﬁlB (2.15)
j<k j<k 2|
Hannan [17, p. 118] showed that
Hanna.) 5 Lllw-w’ll- (2.16)

Therefore, it follows from (2.15) and (2.16) that
2
E”|o(w + z) — 0(w' + 2)] s Ll‘é-lsmw — w'l].

This completes the proof of the lemma.

16
El

Remark 2.2.5 Vardeman [26] proved the following lemma in the case in which player
I only takes pure strategies. Lemma 2.2.6 below shows that Vardeman’s lemma holds
for any ﬁnite set of player I ’s randomized strategies. This lemma gives a decompo-
sition of (13"(X4‘i). By Lemma 2.2.6, we can modify the solutions of the uneztended
problem to produce solutions of the k-eztended problem.

Lemma 2. 2. 6 Let 36"“: --(.7:t ",+1...,T"1), ¢"(Xf~) be the k-eztended Bayes enve-
lope. If the cardinality of X is ﬁnite, then
A T

¢"(Xt~)= 23 ¢( 2 x‘).

:rEX‘"l t:x“"’1=z

Proof.
T T
Zzts( "( x“ 1) =2 ( 2 x‘)s’(:c). (2.17)
t=l TEX" —1 t:x""=::

Equation (2.17) is minimal if 3"(12) = 0(2):}..-1” x‘) for each x E X""1.

Therefore T
=2 ¢( 2 96‘)-
—"26X“1t:x“‘l=x

This completes the proof of the lemma.

2.3 Bounds for the modiﬁed regret of Hannan re-
cursive strategies

In this section, we prove results for the recursive strategy (2.11) that are extensions
of those of Hannan summarized in (2.5) - (2.8). The extensions are from (1 sequences

to 3; sequences.

Remark 2.3.1 Theorem 2.3.1 below extends the results of Hannan [17, Theorem 3]

as were summarized in (2. 5) and (2.6).

17

Theorem 2.3.1 Suppose (2.3) and (2.4) hold. Then for all T and T,

T 2 T
—HT0|B| s E, Escort-1+ HHz) — ¢(xT) g HT0|B| + Lift—(Z % — 32w-
1 l

where 0 = Eyllz“, and [BI is deﬁned in (2.2).

Proof. By Lemma 2.2.1, we have that

T T T
222:0" _ ¢(XT) = [XTO’T+1 _ ¢(XT)] + Ext—1(at_ at-i-l) + Extojt _ Ot-i-l)
l l l
:= 51 + $2 + 53.

Lemma 2.2.3 implies that
$2 + 53 Z -HT||2|||BI- (2-18)

(2.18) and X7707”1 — ¢(XT)) Z 0 imply that
51+ 52 + $3 2 —HT||z|||B|. (2.19)

By Lemma 2.2.4, we have
51+ 52 S HTllzlllBlo (220)

So it follows from (2.19) and (2.20) that
—HT||z|||B| 3 51+ 52 + 53 S HTIIZIIIBI + ix‘w‘ — at“). (2.21)
1
Taking expectation with respect to u on (2.21), we have
—HT0|B| g Ep(Sl + 5'2 + 53) _<_ HT0|B| + E“ iaﬂo‘ — at“). (2.22)
1

where 0 = Eu||z||.
Applying Lemma 1 of Hannan [17, p. 131] with w = )(“1/H“l and w’ = X‘/H‘,

 

we have
n2 xt—l xi
[EACH-0‘44” S leBlll'ﬁﬁ-Eﬁ”
n2 t—l t-2
S LT'BKHe-I‘F'

The last inequality follows from Lemma 2.2.2.

18

Thus
T
191121370” 0”1) S 2 ||$‘|||E(0 -<7t+1)|
l
— t — 2
S L—IBI 23?: - 717)
T
g L—|B|(Z;, — — HT‘ . (2.23)

Therefore, (2.22) and (2.23) imply that

T

—HT0|B| g EuZT‘o(X‘"l+H“lz) —¢(XT)
n2 T

< HT6|B|+LnT (T127: 7HT)|B|.

1
Cl

Remark 2.3.2 Theorem 2.3.2 below extends results of Hannan [17, Theorem 6] as
were summarized in (2.8). From the proof of Theorem 2.3.2, we see that the results
of Hannan [17, Theorem 4] as summarized in (2. 7) hold with 5‘ replaced by x‘.

Theorem 2.3.2 If p is the uniform distribution, then, for all T and g,
T —
—\/1.5mT|B| S E” Zx‘dXt-l +‘]§-(-§-1-7-1—-1-)-z)- ¢(XT) S VﬁmTIBI.
t=1
where [B] is deﬁned in (2.2).

Proof. Applying Lemma 2 of Hannan [17] with w = X “1 /H "‘1 and w’ = X t/ H ‘,
we have
X ‘-

 

xt
7;”
t—ll t—2
s IBI(H._1- H. 1.

where the last inequality follows from Lemma 2.2.2. Thus

T T
E,. Z z‘(o‘ - 0"") Z llx‘lllEu(0‘ - 0‘“)l
1

IE»(0‘-0‘“)| S IBIIIH

 

|/\

-2

 

s IBIZ(H—.-_i H.)
T

s um; 73—,- - my (2.24)

19

Therefore, (2.22) and (2.24) imply that

T
—HT6|B| 5 Eu Za."a(X"l + Ht'lz) — ¢(XT)
1

T

2

< H T0 B —— — — .
_ | 1+ (21: H, H.118!
Hannan [17, Section 7] showed how to minimize

T 2 T
—l/2 T _ _ _
sng [H 9|B|+(§1:H, HT)IB|]

by choice of the H ‘ satisfying (2.3). Hannan [17, (7.7)] gives
supT 1’2IHT0lBI + (2H, — — —)IB|]_ > (73181

with the lower bound obtained for

H‘= g1=1,2,....
m

[3

Remark 2.3.3 Theorem 2. 3.3 below is a modiﬁcation of Theorem 3 of Hannan [17].
In Theorem 2.3.3 the Eu expectation is outside the absolute value rather than inside

it.

Theorem 2.3.3 Suppose that H ‘ satisﬁes (2.3) and (2.4) holds. Then for all T and

£1
T t t 1 t 1 T T n2 T 2
- _ _ < _ _ _ _ .
EHIZI:TU(X +H 2) ¢(X )| _ H 0|B| +L 2 (g H‘ HT)|B|
where 0 = EHIIzll, and [BI is deﬁned in (2.2).
Proof. By Lemma 2.2.1, we start with
T T
ZTW —¢(XT) = [XTUTJ'l — ¢(XT)] + ZX"I(U‘ — 0“”) + 219(0‘ — at“)
1 1

I: 31 + $2 + 53.
It follows from (2.19) and (2.20) that

T
[51+ 52 + S3] S HT||z|||B| + IZT‘(0‘ — ot+1)|. (2.25)
1

20

Taking expectation on both sides of (2.25), we get
T
EH|31+ S; + Sgl g HT0|B| + E,,|Z::1:’(at — a‘+1)|. (2.26)
1

Applying Lemma 2.2.5 with w = )("1/2H"l and w’ = X’/H’,

xt-l xt

Elliot_ at+li < Ln— :iBiilHt_1 ...-IF”.

It follows from Lemma 2.2.2 that

T T
Eulzz’(0’-0’“)l S Ellz‘llEul0’-0’+ll
l l

n2 T t—l t—2
S L-H-IBIZQﬁZT—TIT)

TT)
5 L—IBI(ZT:— H, — — (2.27)
Therefore, Theorem 2.3.3 follows from (2.26) and (2.27).

El

Remark 2.3.4 Theorem 2. 3.4 below is a modiﬁcation of Theorem 4 of Hannan [17].
Similarly we can show that for a ﬁxed u, if T is known to player II, then

Ht = (n2/20)’/2T’/2, t = 1, . . . ,T,
minimizes supT T'1/"’UT (H).
Theorem 2.3.4 If p is the uniform distribution, then, for all T and T,
Es firm/X” + «mfg—1):) — ¢(XT)| 5 «W181.
1
where [B] is deﬁned in (2.2).

Proof. Since [1 is the uniform distribution, the result of Theorem 2.3.3 becomes

 

T
EHI ZT’0(X"’ + H"’z) — ¢(XT)|<
1

Let T
U =+HTmIBI 712(22
T — _

 

21

where E = (H 1, . . . , HT). Suppose T is unknown to player II, we want to obtain an
11 minimizing
sup T’l/zUTLIi)
T
Harman [17, Section 7] showed that

sup T’l/zUT(_I_I_) 2 (3n2m)‘/"’|B| (2.28)
T

and this lower bound is obtained for £11 with

2
H‘=‘/§—"—3 t=1,2,--- (2.29)
m

Therefore, (2.28) and (2.29) imply the result of Theorem 2.3.4.

C]

Remark 2.3.5 Theorem 2.3.5 below will be used to construct k-eztended predicting
strategies in Chapter 4.

Theorem 2.3.5 Let p be the uniform distribution. Assume that the cardinality of X

is d, and d < oo. Denote

 

3n2(T(x*“1) —- 1)

 

at = 0(X,"_1|x"‘—l +(/ z),

where
t—l

th...1|x“_l = 2 $1"

and

t-l
T(x‘t—l) = Z [{xtj-lzxot—l},

i=1

Then for all T and g,

 

T
Eu] Zz‘a, — ¢’=(X;.)| g 1/3n2mdk-1TIBI,
l

where IE] is deﬁned in (2.2).

22

Proof. Lemma 2.2.6 implies that

21? 0. <15" XT) = Z [( 2 x‘a¢)—¢(X§‘~|x)]

zexl‘“ t:x""l=z

:= 2 11(3).

:1:€X“"l

Denote the indices t for which x“‘1 = 2: by t; < t2 < . .. < tux). Then

= 2 z“a(X,':_1|a: +

i=1

T(x) 3 2 . -l
3&4» - «man.

By Theorem 2.3.4, we have that

E#|A(2:)| 5 ‘/3n2mT(:L')|B|.

Applying Schwarz inequality,

Eﬂlzxat- QS"( (XT)| S 2 ‘/3n2mT(2:)|B|

3610'"1

g \/3n2md"-1T|B|.

Hence the desired result follows.

2.4 An extension to an inﬁnite game

In this section we adapt the Harman recursive strategies to produce strategies for

the repeated play of an inﬁnite component game. The inﬁnite component game

is general enough to cover the on—line allocation, multi-armed bandit and expert

selection problems as will be shown in Chapters 3 and 4.

Consider an inﬁnite component game where player I chooses a pure strategy :3:

from

= [0,1]"

and player II chooses a pure strategy 37 from

Y={e1,...,e,,},

23

the set of standard base vectors in n-space. Suppose that player II’s inutility is given
by the loss function

L(2,g) = 23;, 2 e x, g e 1?, (2.30)
where we have used the juxtaposition of vectors in n-space to indicate the ordinary

inner product.

Theorem 2.4.1 There exists a ﬁnite 2”Xn game with loss matrix A and a one to one
mapping f from 2? into X, the (2" — 1)-dimensional simplex of randomized strategies
for player I in the ﬁnite game, such that

f(2)AJ' = 2e,- = 2,, 2 e x, j = 1, . . ,.n (231)

Proof. We see that
f(5;)AJ' = 2,, j =1,...,n

if the f (:2) mixture of the rows of A, f (5:)A, is equal to :3. Thus, if the convex hull
of the rows of A is equal to X = [0, 1]", then (2.31) is satisﬁed with f, any mapping
where f (5:) is a mixing distribution of the rows that gives mixture 5:. The minimum
generating set for [0,1]" is the set of 2” vertices of this cube. We take A to be any
matrix whose row vectors are the 2" vertices and f as described above. This A has
no dominant columns since the left hand side of (2.1) is 1. It has no duplicate or

dominated columns.

C]

The inﬁnite game it = [0, 1]“, Y 2 {e1, . . . , en} and L(§:, g) = 237 has randomized
strategies

5? = set of all probability distributions on X

and
i” = Y = (n — 1) — dimensional simplex of probability distribution on n points.

Since the loss function is linear in 5: and X is convex, the risks from randomized

strategies 23“ are the same as those from the means. Thus, the extension of the

24

inﬁnite game through its randomized strategies remains ”isomorphic” to the restricted
extension of the ﬁnite game X, Y = Y‘, A, the restriction being X replaced by f [X]
For any if E X‘ and gr 6 ?‘ there exists an a: 6 X, namely, :1: = f (mean of cit‘) and
a y E Y, namely, y = 37' such that L(a‘:‘, 37‘ = :rAy. Moreover, the Bayes envelope

risk in the ﬁnite game at a: = f (it) is given by
45(23) = min 571',
J

the Bayes envelope in the inﬁnite game at :i‘.
Suppose the inﬁnite component game is played repeatedly with :E‘ and gt denoting
the choices of strategies by player I and II, respectively, at time t = 1, ..., T. We

suppress the display of dependence on T in writing

=(21,...,2T) and g=(gl,...,*").

IR)

Remark 2.4.1 In the repeated play of the inﬁnite component game, suppose that

player II uses a Hannan recursive strategy
g‘ = e, if 0(X“1 +Ht‘lz) = A’, j = 1,2,...,

where a is a positive homogeneous pure strategy valued Bayes response in the ﬁnite

game described in Theorem 2.4.1, and
t
X0 :0, X‘ = Zx‘, 2‘ =f(:i:‘) t= 1,2,....
1
Then the modiﬁed regret in the repeated play of the inﬁnite game at i is
T T T
zs‘g‘ — mjin: ‘3. = Zx‘dXt’l + Ht‘lz) — ¢(XT),
t=l t=l t=l

that is, it is the same as the modiﬁed regret in the ﬁnite game at 3;. Hence the results
of Theorem 2. 3.1-2. 3.5 cover repeated play of the inﬁnite component game. Of course

m = 2" and |B| = 1 in this adaptation.

Chapter 3

On-line allocation model and the

multi-armed bandit problem

3.1 Introduction to on-line allocation model

3.1.1 Component allocation game

We consider the following component allocation game in which player I selects a loss
vector l 6 [0,1]" and player II selects a probability distribution p on n points with
loss L, = 2'; l,-p,-. We recognize this as an example where player I’s pure strategies

are in the cube
x = [0,1]"

and player II’s pure strategies are in the ﬁnite set
Y = {e1,...,e,,}.

By Theorem 2.4.1, there is a ﬁnite game isomorphic to the component allocation

game.

3.1.2 On—line allocation model

Suppose the component allocation game is played repeatedly with It and p‘ denoting
the choices of strategies by player I and II, respectively, at time t = 1, ..., T. This

25

26

repeated game was studied by Freund and Schapire [15]. They called this game the
on-line allocation model.

Freund and Schapire [15] formalized the on—line allocation model as follows. The
allocation agent A has n strategies to choose from. At each time t = 1, 2, .. ., T, the
allocator A choose a probability distribution

19‘ = ‘1.p3,---.p$.)

over the 12 strategies, where pf is the probability that strategy 2' will be chosen, for

each i and t. Each strategy 2' suffers some loss I]. Denote loss vector It by

= (1], (2,“ 1;).

The loss suffered by A at time t is deﬁned as

217‘”—

that is, the expected loss of the strategies with respect to A’s chosen allocation rule.

Denote the expected total loss of A across the ﬁrst T trials by
T
E [L5] = Zp‘l‘.
t=l
The goal of A is to minimize
E[L§ ] — lmi<n "21‘.
—- ~" t: 1
So A will try to perform as well as the best strategy among these it strategies.
Assume that the loss suffered by any strategy is bounded, so without loss of
generality, let If 6 [0,1], for each i and t. Also there is no statistical assumption
made about the loss vector 1‘. Reund and Schapire [15] showed that Littlestone and
Warmuth [23]’s Weighted Majority algorithm can be generalized to handle the on-line

allocation problem. They constructed an algorithm, called H edge(ﬂ), such that at
each time t, H edge(ﬂ) chooses the probability distribution vector

p‘ — -—..—wt
_ v
i=1 wi

27

where w‘ = (w{, wé, . . . , wf,) such that for each i and parameter ﬂ 6 [0, 1] ,
1 1 1+1 t 19
wi = E, “I" = wiﬂ ' .

So at each time t, after the loss vector l“1 is received, the H edge(ﬂ) will choose the
probability distribution vector p‘ by updating w‘.

If ,6 is chosen as a function of L and n, Freund and Schapire [15, Lemma 4] showed

that T
T - "
ElLHedge(,6)] S lrglsnngl: + 2Llnn + lnn (3.1)
holds for all sequence of loss vectors l1, ..., IT which satisﬁes

T
. t "
m1 . <
lgignngl' " L

Dividing both sides of (3.1) by T, we obtain an upper bound for the difference
between the expected average loss of H edgew) and the average loss suffered by the
best strategy.

 

 

ElLerdge<ﬁ)]< m2+ l‘+ W + .1112,
T "<T1Eli<<nn, T
That is, the average loss of H edge(ﬂ) approaches the average loss suffered by the best
strategy as T —) 00.
Note that, to obtain (3.1), the parameter ,6 in H edge(ﬂ) depends on L. In the

next section, we will give a solution of the on-line allocation model based on game

theory results when L is not available,

3.2 Application to the on-line allocation model
Our algorithm H for the on-line allocation model is described in Figure 3.1.

Lemma 3.2.1 For any x 6 [0,1]", let f (x) (f (x) ,f(x)2n), where

n

f(ﬂa' =Hll$jav + (1 - 11W" 02)],

and a;,- is the element of matrix A deﬁned in Figure 3.1. Then f is a one to one

mapping onto its range satisfying Theorem 2.4.1.

28

Figure 3.1: Algorithm H

 

Algorithm H
Choose initial probability vector p1

Repeat for t = 2, 3,...
1. Choose allocation p‘, such that

p;- = viz—fa;- - 1:) _<_ [99,3222 — A1), W},

where z = (21,” .,Zzn), 21, ..., 22.. are i.i.d U(O, 1) under 11. And A1, . . .,A"

are the columns of matrix A.

011 "° aln A1
_ . . _ . _ l n
A _— z o o o : - z — (A ,0 I o , A ),
02M °'° 02% A2»

and A1, . . . , A2,. are the distinct sequences from W.

W = {w" I w" = (w1,. . .,wn), w.- E {0, 1},Vi}.

2. Receive loss vector 1‘ E [0, 1]".

 

Proof. For ﬁxed j, let A = {i : a,-J- = 1}, then

2”
f (5)241 = 221:0.)

= in: ﬁn”... + (1 - 5:1.)(1 - aikllaij
i=l h=l
= “j Z Hlikaik + (1 - i‘k)(1 - Gael]
ieA k¢j

)

j 2 II gm. (3.2)

wn-le{o,1}"'1 #1

where

2 {L‘k if w, = 1
mm. = A I
1 - 3k If wk = O,

and (3.2) follows from the deﬁnition of a,,,.

Cl

Theorem 3.2.1 The modiﬁed regret for algorithm H in the on-line allocation prob-

lem satisﬁes

-\/2_'"—1\/3T_<_E[LE ]-lminn Zl;<\/2“—+1\/—,
for all T and all loss sequences 1, where E[L§ ]— - 2T p‘l‘.

Proof. Consider the allocation game described in Section 3.1. By Theorem 2.4.1,
there is a ﬁnite game isomorphic to this allocation game. Take f deﬁned in Lemma
3.2.1 to be the one to one mapping in Theorem 2.4.1.

In the allocation game, at each time t, suppose player I chooses a strategy it“ such

that
it = (li,...,l,‘,),

and player II’s strategy is a random vector 9‘ taking value from a set

Y = {81,...,€n}
such that
I? = e]
if
o,(xt—1+6__(___t2:1)z)__ Aj

in the corresponding ﬁnite game, where z and Al are deﬁned in Figure 3.1,

t
X°=O, X‘=Zx’, xt=f(x‘).
1

30

It follows from Lemma 3.2.1 and Theorem 2.4.1 that the modiﬁed regret of the

allocation game is the same as that of the ﬁnite game, i.e.,

[:]~3
EMS

6
II

n
El‘q’l“! ﬂ

6(t— 1)

Mt(xtl
2"

Z)- ¢(XT)
It follows from Theorem 2.3.2 that

T —
—\/2"-13T g E, :x‘aw-l + «Egan—1)» — ¢(XT) g \/2n+13T.
t=1

By the deﬁnition of pt and 37‘, we have for any 3' = 1, ..., n,

”{3}: = 81'}
= p{c7(X"'l + mz) = Aj}
= an: 1(a) ﬂit—2:31 zIAJ' < [Zf(x’) +———1—,:’21A‘ Vi}

It follows from

 

that
#{gt — 81}
‘4 , , 6t—1 . . _
= ”{zaj — l,) S (2“ )z(A' - .41), V2}
=1
= 1;;
Thus
T T T
EHZ§J‘Q‘—min2“; = ilk) g—mingl;
t=1 J t=1 j=1

= E[L§ ]—min2l‘-

and the result follows.

31

Remark 3.2.1 Compared with Hedgem), Algorithm H not only gives an upper bound
for Lg — minl§$n 23;, l: with the rate of 0(T1/2), but also gives a lower bound with
the same rate. At each time t, algorithm H determines the probability distribution

(allocation) without using either L or T.

The on-line allocation model described above is quite general. One of the appli-
cations is the problem of prediction using the advice of a team of experts. Suppose
at each time t = 1, ..., T, the decision-maker must make a prediction and has the
knowledge of the predictions made by each of the n experts. Each prediction and
the outcome, which is disclosed after the decision-maker has made his prediction,
determine the incurred loss. I

Let the decision-maker and the n experts select their predictions from a convex

set D, and let 6 be the outcome space. Suppose loss function is a function L,
L : ‘DXG —> [0,1].

At each time t, for each i, denote the prediction of expert B,- by B}, and denote the
outcome by 3],. Then L(B§, y) is the loss of expert B,- at time t. If at each time t,
the decision-maker chooses B: to predict y, with probability pg, where p: is deﬁned in
Algorithm H with If = L(B,?, yt). Then by Theorem 3.2.1, we have that the expected
average loss of the decision-maker approaches the average loss of the best expert at
the rate of 0(T‘1/2).

Note that if the loss function L is convex with respect to d 6 ’D, then at time
t, let the decision-maker predict with a nonrandomized prediction 2;, pﬁBf. The

convexity implies that
n n
14210332, ye) S 21931:.
i=1

i=1

It follows from Theorem 3.2.1 that

T n T
‘. F ' i ‘/ +1,/
1L(Zp,B,,yt) _<_ 1131511"; I, + 2n 3T.

t: i=1

We will continue the discussion of prediction using expert advice in Chapter 4.

32

3.3 Introduction to the multi-armed bandit prob-

lem

3.3.1 Component multi-armed bandit game

We consider the following component multi-armed bandit game in which player I
selects a reward vector b 6 [0,1]” and player II selects a probability distribution p on
n points with gain LP = Z? b,p,-. We recognize this as an example where player I’s
pure strategies are in the cube

x = [0,1]"

and player II’s pure strategies are in the ﬁnite set
Y = {e1,...,e,,}.

By Theorem 2.4.1, there is a ﬁnite game isomorphic to the component multi-armed

bandit game.

3.3.2 The multi-armed bandit problem

We are going to study the repeated play of the component multi-armed bandit game,
that is, the multi—armed bandit problem. In the multi-armed bandit problem, origi-
nally proposed by Robbins [24], a gambler must decide which arm of n non-identical
slot machines to play. At each trial, he plays one arm and receives a reward (maybe
nonpositive). The goal of the gambler is to maximize his total reward over in a
sequence of plays.

Lai and Robbins [22] studied this problem using statistical assumptions about
the rewards of the slot machines. They assumed that the distribution of rewards
associated with each arm is ﬁxed and does not depend on the number of trials T.
They bounded the difference between the expected total rewards of the player and
the maximum of the expected total rewards of any arm with 0(logT).

Auer, Cesa-Bianchi, Freund and Schapire [2] presented a variant of the bandit

problem in which no statistical assumptions are made about the generation of rewards.

33

They only assume that the rewards are bounded.

They formalized the multi-armed bandit problem as a game between a player
choosing actions and an adversary with knowledge of past plays choosing the rewards
associated with each action. Assume action space is { 1, ..., n} and all the rewards
belong to the interval [0,1].

They deﬁned the full information game and the partial information game. In the

full information game, at each trial t = 1,2,. . .,T:

1. The adversary selects a vector of the current rewards
b‘=(b‘1,...,bf,),
where for each i, bf is the reward associated with action i at trial t.

2. Without knowing b‘, the player chooses an action it 6 {1, 2,. . . , n} and get the

corresponding reward bi.

3. The player observes bt after he makes the action it.

The partial information game also consists of three steps. All the steps are the same
as that in the full information game except step 3 is replaced by: The player only
observes b}, after he makes the action it.

They presented an algorithm, called Hedge, for the full information game. Al-
gorithm Hedge is a slight variant of Algorithm H edge(,6) described in Section 3.1.2
The idea of Hedge is to choose action i at time t with probability

p‘- : (1 + a)’:
’ ELI“ + “)8: ,

 

where a > O is a parameter and

t
t+l _ E : t
Si — bio
t=1

Note that each reward b: is deﬁned as a random variable on the set of player’s actions
up to trial t— 1. The measure of the performance of any algorithm, say A, is E (G A) —

Gk“, where

T
E(GA) = Emu-JAE big],
t=l

34

and

T
Gbest = mag; Ei1,...,iq~[z b;]- (3-3)
' i=1

15:
Auer, Cesa-Bianchi, Freund and Schapire [2, Theorem 3.2] showed that in the full

information game for a > 0,

01 lnn
E(GHed9¢) Z Gbcst — 5056“ - :—

For an appropriate choice of a, which is depends on Gbeu, the difference between
E (G Hedge) and Gm, is at least —\/2Tlnn in the full information game.
Auer, Cesa—Bianchi, Freund and Schapire [2, Section 4] also gave an algorithm for

the partial information game based on Hedge.

3.4 Application to the multi-armed bandit prob-
lem

In this section, we investigate an algorithm for player in the full information game un-
der the assumption that each reward bf does not depend on past play. Our algorithm

is described in Figure 3.2.

Theorem 3.4.1 If 13‘ and b‘ are deﬁned in Algorithm H, then the expected gain of
Algorithm H satisﬁes

T
-x/2n+1‘/3T _<_ E(GH) — 11:113.} 2b;- 3 \/2'H\/3T,
' 'nt=l
where E(GH) = 9:1133b;.

Proof. The proof is similar to that of Theorem 3.2.1 with loss replaced by gain.
By Theorem 2.4.1, there is a ﬁnite game isomorphic to this component game. Let f
be a one to one mapping in Theorem 2.4.1 associated with matrxi A deﬁned in Figure
3.2.

At each time t, player I chooses a strategy it such that

:t‘ = (b§,...,bf,),

35

Figure 3.2: Algorithm H

 

Algorithm H
Choose initial probability vector 131
Repeat for t = 2,3,...

1. Choose action it according to the distribution 13‘, where

- "1 , , 6t—1 . . .
p;- = Mite.- — b.) s —(—2,—)2(AJ — A“).Vz}.
8:1

and z = (Zl,...,22n), 21, ..., 22.. are i.i.d U(0, 1) under u, and A1, . . .,A" are

the columns of matrix A. A is deﬁned in Algorithm H.

2. Receive the reward vector b‘.

 

and player II’s strategy is a random vector 37‘ taking value from a set

Y = {e1,...,e,,}
such that
lit = e)
if
0(X“l + 99%z) —Aj

in the corresponding ﬁnite game, where z and Al are deﬁned in Figure 3.2,
t
X0 = 0, X‘ = 212’, 12‘: f(fitt).
1
It follows from Theorem 2.4.1 that

T
5%? + max 2; 5:;
J
t=l

:1:‘¢7(X"'l + 6(t2—lez) -— ¢(XT).

Ms EM“!

H
II
y—s

36

It follows from Theorem 2.3.2 that
T

—\/2"—13T g E, Zita-(X‘ 1+‘/6L2:—)z—)¢()3XT 2n+13T.
t=1

By the deﬁnition of p~t and 37‘, we have for any j = l, .. ., n

#{9' = 31'}
= p{g(Xt-l + 6_(%:_1)Z) = _A1}

=u{[Zf(i ’)+ ‘”;‘,,21).zJ,’(—Aj)s[:-fw)+6“ 1) z](-— -A") V}

 

 

It follows from

.. = (b§,...,bf,), and f(a‘r‘M" = “$-

 

that
1437:6311
=u{§(b:—b;)s 6“; ,lz’w—A A‘), W}
_ﬁgm

Thus

Therefore, Theorem 3.4.1 follows.

[3

Remark 3.4.1 Algorithm H gives both an upper bound and a lower bound of the
modiﬁed regret at the rate of 0(T1/2) without knowledge of T or sz.

Suppose the player has access to the Opinion of a set of K experts. At each trial t,
before choosing an action, the player is provided with a set of K probability vectors

(9(1), . . . ,£‘(K)). For each j, £‘(j) is the advice of expert j on trial t, that is

37

and £f( j) is the recommended probability of choosing action 2' by expert j. After
receiving the reward vector b‘, the expected reward for expert j is 5‘( j )b‘.

At each time t, we apply Algorithm H with new reward vector

(€‘(llb‘, - - - .€‘(K)b‘)-
It follows from Theorem 3.4.1 that

T
-\/2K+1\/3T _<_ E(G'H) — ltggZE‘UW S V2K‘1\/3T.
- - t=l

Therefore, using Algorithm H, the expected average reward of the player will ap-
proach to
T
t - t
1131?; g E (2)1).

which is the average reward of the best expert, at the rate of 0(T‘1/2).

Chapter 4

Prediction using expert advice

4.1 Introduction

In this chapter we review some of the literature dealing with prediction using expert
advice. Section 4.2 is a self-contained review. However, the main thrust of this
chapter is to show how the Harman recursive strategies apply in a prediction problem
considered by Foster and Vohra [13]. In particular, in Section 4.3 we prove that a
strategy for n = 2 experts investigated by Foster and Vohra [13] is an exmaple of
a Harman strategy and we deduce a slightly stronger result than was established in
their Theorem 1. Section 4.4 gives a solution for the n expert case. Finally, Section

4.5 applies the k-extended approach to the prediction problem.

4.2 Literature review

Consider the following prediction problem. Suppose no statistical assumptions made
about the actual sequence

g=(y1....,yr) (4.1)

of outcomes that is observed. At each time t = 1, ..., the decision-maker must
predict the value of y,. Before making the prediction, the decision-maker is given the

Predictions of n experts.

38

39

This decision problem can be viewed as the following game between two players,

the decision-maker and nature. At each time t = l, 2, . . .,

1. Each expert B,, i = 1,...,n, makes a prediction B: 6 D, where D is the

prediction space.

2. The decision-maker, who has the knowledge of all Bf, i = 1, ..., n, s g t, and

past outcomes yl, ..., y¢_1, makes his prediction 3), 6 D.
3. The nature chooses some outcome y; E G, where 6 is the outcome space.

4. Each expert B,-, i = 1,..., n, incurs loss L(B,?, 3],) and the decision-maker incurs

loss L(3)¢, 3],), where L : Dxe -> [0, 00) is the loss function.

Suppose at each time t, the decision-maker uses a prediction algorithm A to make
prediction 3),, then loss of the decision-maker is equal to the loss of the algorithm
A. Deﬁne the total loss of the algorithm A on a sequence of trials with respect to a

sequence of outcomes g to be

T
ylzg Myra yr) (4-2)

where g is deﬁned in (4.1). Similarly the total loss of the expert B,- with respect to g
is deﬁned to be T
Lg. (g) = 2: L035. ye)-
t=1

The goal of the decision-maker is to ﬁnd an algorithm A to minimize

Lia) - min _L£,(y) (4.3)

The goal of the nature is to maximize (4.3). So the min/ max strategy for the decision-
maker is the algorithm that minimizes the maximum of (4.3) over all outcome se-
quences.

Vovk [27] introduced a general on-line prediction algorithm when the outcomes
are binary. For the case with continuous-valued predictions, Vovk proved for a large

class of loss functions bounds of the form

Lag) - 112531"ng (g) g cLlnn,

40

where cl, is a positive constant determined by the loss function L.
In the rest of this section, we give a literature review according to the following

four cases.
Case 1. D = [0,1], 6 = {0, 1} and [31,...,B,,} is the ﬂute set of experts.

Haussler, Kivinen and Warmuth [18] studied the Generic Algorithm ﬁrst introduced
by Vovk. At each time t, the Generic Algorithm predicts with any value 9. that

satisﬁes for y = 0 and y = 1 the condition

A n w .e—ﬁL(B::Ul
[IQ/my) S ‘61" t"

 

, (4.4)

n
i=1 i=1 wtii

where c and 17 are any two positive constants, for any i,
— L 39,
1013 > 0, wt+l,i = ware " ( ' m).

Haussler, Kivinen and Warmuth [18, Theorem 3.11] showed that if L is a loss

function such that

CL == sup 3(zlL’1(z>2-La(z)La(z)2
L6<2>L¥<z> - Leonie)

where Lo(z) = L(0,z) and L1(z) = L(1,z), then it follows from by applying the

< 00, (4.5)

 

Generic Algorithm with ml... = 1, c = CI, and r) = l/cL that for any T,

811p [Lgeneric(y) - 1127.21" LE,- (3)] S CLlnn! (46)
y E {0, 1}" - ..
B‘ 6 [0,1]"

where Bt = (Bf, . . .,Bg).
o For logarithmic loss, that is deﬁned by L(§, y) = ylng + (1 — y)ln]—:%, (4.6)

holds with cL = 1.

o For squared loss, that is deﬁned by L(g), y) = (ii—y)2, (4.6) holds with c1, = l / 2.

Foster [12] also proposed a prediction method with the loss function deﬁned as

squared loss. At each time t, the decision-maker predicts with

n
3}: = Z 10253:,

i=1

41

where (wt,1,. . . , wm) minimizes

332": «1).-B: - y.)2 + 5:10?

9:1 i=1 i=1
over all probability vectors w. It follows from Foster [12, Theorem 1] that for

any outcome sequence g and any probability vector w,

T T n
2(3): - yt)2 - 2(2101'3: - 11:)2 _<_ 2 + nlnn(T + 1).

i=1 t=l i=1

Therefore using this strategy, the decision-maker can perform as well as any
convex combination of the n experts, namely, the difference between the average
loss of the decision-maker and the average loss of any convex combination of

the n experts will converge to 0 as T goes to 00.

For absolute loss, that is deﬁned by L(37, y) = [37 — y], c1, = 00. Then the
Generic Algorithm can not be applies directly. Cesa-Bianchi et a1 [7] studies

some variants of the Generic Algorithm for the absolute loss.

Cesa-Bianchi et a1 [7] constructed a min/max strategy, called Algorithm MM,
for the decision-maker when the loss function is absolute loss. At each time
t = 1,. . . ,T, Algorithm MM predicts with

3,7 v(M'+Z",r—1)-—v(M'+1—r,r—1)+1
t: 1
2

 

where

r =T+l—t, Z” = (B[,...,B,‘,)
2-1
Mr = (M{,...,M,',), M]? = o M; = ZIB; —y.l, VJ‘.
8:1
and v is deﬁned inductively as
v(M,0) = lrsnjign Mj,

— M 1 — , —1
”(M:T)= min v(M+z,r l)+v( + zr )
z€[0,l] 2

 

Cesa-Bianchi et a1 [7 , Theorem 2] showed that for any set of n experts and for

any outcome sequence g,

T
T _ - T _ _
LMMQI) 115111.13" LB,- (y) S 2 ”(O’Tla

42

and Algorithm MM achieves the value of the game, €- — v(0, T). So Algorithm

MM is a min/ max strategy.

The disadvantage of the Algorithm MM is that T must be known in the be-
ginning of the game. And the algorithm is computational expensive since the
calculation of v(M, r) involves minimizing a recursively deﬁned function over all
choices of z 6 [0,1]". Therefore simple algorithms, Algorithm P, Algorithm P’

and Algorithm P‘, were introduced so that they can be implemented efﬁciently.

Algorithm P works as following. At each time t, Algorithm P makes a prediction
:03 = F16 (7'3),

where ﬂ 6 [0,1),
22:1 wt,iB:

Tt =
n
{:1 wt,‘

1

w...- = 1. w¢+1,.- = w1,.-Up(|Bf - ytl). Vi.

and Fp(7‘) and Uﬁ(Q) be any functions such that

ln((1— r)ﬂ + r) —ln(l — r + rﬁ)

SF 1‘ S .
2214,37,) ”( ) zznﬁﬁ)

 

1+

 

forOSr_<_1,and
3" S U301) S 1-(1-ﬁlq,
forogqgl.

The performance of Algorithm P depends on the parameter ﬂ. Cesa-Bianchi et
a1 [7] showed how to choose ﬂ in according to the type of knowledge available

to the decision-maker.

* If the decision-maker knows an upper bound on the total loss of the best

expert, Cesa-Bianchi et a1 [7, Theorem 15] showed that for any K _>_ 0,
taking ﬂ = g(‘/‘"7"-), where

(4.7)

 

43

for any set of 11 experts and any outcome sequence 3) such that

min Lg (y)__ < K,

l<j<n

we have
loggn

 

LP(3/) — 1min _nLB (y)_ < VK Klnn +

* If T is known to the decision-maker in advance, then use a slight variant of
Algorithm P by adding a new expert Bu“, where at time t, B:, +1 = 1- Bf.
Denote the algorithm that uses the expanded n+1 experts by Algorithm P’.
Cesa-Bianchi et al [7, Theorem 16] showed that taking ﬂ = “Mg—HT),
where g is deﬁned in (4.7), for any set of 11 experts and any outcome

sequence g of length T, we have

 

. Tln n+1 lo n+1
LT’(y)_ min Ila-(E) S\/ (2 )+ 92(2 )

 

* If there is no prior knowledge about the upper bound on the total loss of the
best expert or the length T of the sequence, the following procedure, called
Algorithm P‘, can be used. For z = O to co, Algorithm P“ repeatedly runs
Algorithm P(g(\/%)) until the total loss exceeds b,, where g is deﬁned in

(4-7).

k,= 4(1—-— +2‘/_)2=1nn,

and
loggn

2
Cesa-Bianchi et al [7 , Corollary 22] showed that if n 2 7, then for any

bz = k, + kzlnn +

 

outcome sequence jg,

 

LT. (g) — lrsnjiéin La, (g) S 4\/115nji£n Lg), (g)lnn + 2.8lnn.
Case 2. D = [0,1], 9 = [0, 1] and {81, . . . , Bu} is the ﬁnite set of experts.

Haussler, Kivinen and Warmuth [18, Theorem 4.2] showed that for loss function such

that 62
9(yi “1 b) 69“,: a" b) 2
—————By. +(———ay ) 2 o (4.8)

44

holds for all y, a, b 6 [0,1], where

L(31.2) _ L(y, b)

ya: b =
9(y ) CL CL

 

CI, is deﬁned in (4.5), and C], < co, the Generic Algorithm satisﬁes

sup [LGeneric(y) _ 115111.21" ng (Ell S cLlnn, (4'9)
a 6 {0,1}"
Bt 6 [0,1]"

where 3;: (8],. ., Bf.)-

o For logarithmic loss, that is deﬁned by L(g}, y) = ylng + (1 - y)ln]—:%, (4.8)
holds and cL = 1. Then (4.9) follows.

0 squared loss, that is deﬁned by L(;17,y)= (y— y)”, (4. 8) holds and CL: 1/2.
Then (4.9) follows.

0 For absolute loss, that is deﬁned by L(g}, y) = [3) -- yl, we have at, = 00. Then
we can not apply the Generic Algorithm. Haussler, Kivinen and Warmuth [18]
constructed an new algorithm, the Vee Algorithm, that predicts with any value

fit that satisﬁes the condition

wtﬁe "HIE:

 

 

 

max{y+ w: ._1 . '22:)1/1 n, +,..,1}
_<_ 37: — ’

wt ﬁe-" "IB:_yl
< min{y — UM; ’ ":1 u)”- )l/[2l nl + 2e""]}

where for any i,
-"L(B: 1y!)
3

101,1 > 0, wt+l,i = were
Y: {0, 1, B[,.. .,B‘}.

Haussler, Kivinen and Warmuth [18, Theorem 4.7] showed that for any outcome

sequence 31 and any i, the Vee algorithm satisﬁes

 

 

Luvs—mm. J+n_m.inn_1T;L (ml/12ml,2

i=lw

-,1.

45

Case 3. D = [0, 1], 6 = {0,1}, and there are unaccountably inﬁnite set of experts.

Heund [14] generalized the Weighted Majority algorithm of Littlestone and Warmuth
[23] to the case in which there are unaccountably inﬁnite set of experts. The algorithm
he used is called the exponential weights (EW) algorithm. The EW Algorithm gives
a prediction 3’], that is any value in [0, 1], such that for y E {0, l},

1 1
may.) 3 “If. e-%L<M>dn(p>1/1 /0 am». (4.10)
where for each t, c is a positive parameter, and the measure p¢(A) is deﬁned as
#t+1(A) = Le—%L(del‘t(1’)a
and 111 is a probability measure on [0, 1].
Assumption 4.2.1 Suppose the loss function satisﬁes the following properties
0 3 c and r) = l/c, such that (4.4) holds.
0 V y, L(p, g) has a continuous second derivative with respect to p.
o 315 : [0,1] —> [0,1], 13(6) is the unique minimizer of
T
z L(pi yt)
t=l
over all outcome sequence g whose empirical distribution is 9.

Choose the initial probability measure to be

p1(A) = [A w(:1:)d:c,

where 82
1
1003) = ‘3 ja'p—zghipllwﬂx):
1 82
z = [o lwgamnansdx, (4.11)
and

gum) = Elmo, 1) - Lo.1))+ (1 — x)(L<p, 0) - unem-

46

Freund [14, Theorem 1] showed that for all outcome sequence y whose empirical
distribution is 6,
max 6[L3W( y(_) — min ZlL(p, y¢)]< _ gln-T— — —an+ 0(— —),
1' 4, :1 m4 pe,[o 1]_ 2
where Z is deﬁned in (4.11).

o For log-loss, that is deﬁned by L(d,0) = -log(1 — |d — 0|), Assumption 4.2.1
holds with c = l. Freund [14, Theorem 3] showed that

max[LEW() — min 231L(p, yt)]_ %In(T+ 1) + 1

p6[0,1]t_

o For square loss, that is deﬁned by L(d,0) = (d -— 0)2, Assumption 4.2.1 holds
with c = 1 / 2. Freund [14, Theorem 4] showed that

max[LEW(y) — min éﬂp, y, )]< _ 4lnT+ ;ln— —f—_r(2\/2)— ilng,

p.6[0 1]_

where er f (p) = % f: e“ dz.

0 For absolute loss, that is deﬁned by L(d, 0) = Id — 0], we have c = 00. Since

Iggglélp— yt|= gggéﬁlp- yrl

this prediction problem can be treated as a prediction problem using two expert

advice.
Case 4. L(d,0) _<_ K, for all d6 D and 0 E 6.

Foster and Vohra [13] studied the problem of choosing between two expert forecasts.
Suppose that B1 and 32 are two experts with bounded forecasting errors or called
loss. At each time t = 1, ..., T, the decision-maker has the knowledge of the past
loss incurred by Experts BI and 82. Foster and Vohra [13] constructed a randomized
strategy 0 from BI and 32 based on their past average loss. They indicated that
the difference between the average loss of C' and the minimum average loss of BI and
B; will converge to O in probability at the rate of 0(T‘1/2). We will give a detail

description of the strategy 0 and a proof in Section 4.2 using game theory results.

47

One of the applications of the on-line allocation problem studied by Freund and
Schapire [15] is predicting using expert advice when the loss function is bounded. By
the H edge(ﬂ) Algorithm, at each time t, the decision-maker predicts with g. = B:
with probability pg, where B: is the prediction of Expert B,- at time t,

t
t w

pi = m
where the parameter ﬂ 6 [0,1], for any i,
wil =1/1’l wH-l = wtﬂMBf ,yt).
Freund and Schapire [15, Lemma 4] showed for all outcome sequence of y such that

min LT J(_y) < L,

l<j<n

choosing B as a function of L,

E[Lgedge(m(y) ] — {(1121n LT (y)__ < 2Llnn + lnn,

where
ElLHedgeW) (E)l_ _ Z Z p:L( 3:1) yt)1

t=l i=1
is the expected total loss of the H edge(,B) Algorithm over y.

In Chapter 3, we use game theory results to construct an algorithm H for the
decision-maker when the loss function is bounded. Without loss of generality, for any
deD,andae 9, let

0 S L(d, 0) S 1.

At each time t, the decision-maker predicts g. = B: with probability pg, where p: is
deﬁned 1n Algorithm H with l‘= L(B ,‘,y¢). Theorem 3.2.1 shows that for all outcome

sequence of y,
T
_r—2.-.¢—3T s Evian - mm 2L1... (y) s War—2T.
_I_ﬂ i=1 -

where
t=l i=lp
is the expected total loss of Algorithm H over y.

48

4.3 A proof of Theorem 1 of Foster and Vohra

Suppose BI and 32 are two experts. At each time t, let B] and B; be their predictions

of outcome yt, respectively. Suppose for i = 1, 2,
B: 6 D1 yt 6 61

and the loss function L is bounded. Without loss of generality, for any d e D, and 0
E 9, let
0 5 L(d, 0) S 1.

For i = 1, 2, denote

In their paper, Foster and Vohra [13] constructed a randomized strategy C for the

decision-maker. At each time t, the decision-maker predicts with

. B‘ with probability min(max[0, —J-;—l—D"“: ‘11 '], 1)
yt = 1 2(1 1

B; otherwise,
where s is a constant that satisﬁes 0.5 _<_ s < 1,

t t
D. := Z b’,c — Z b’f. (4.12)
k=1 k=l

Now we will write the randomized strategy C in the following equivalent way. At

each time t, the decision-maker predicts with

. B; if 1)..1 > 2(1 — 1)‘(z - 1/2)
y =
B; if DH 5 2(t — 1)‘(z — 1/2),

where z is U (0, 1) random variable under 11.

Foster and Vohra indicated that if the decision-maker uses the randomized strat-
egy C, then the difference between the average loss of the decision-maker and the
minimum average loss of the two experts will be bounded by a nonnegative random
variable 67, and 67' goes to 0 in probability as T goes to 00. On its face, this random-

ized strategy C is similar to the Harman [17]’s recursive strategies for player II in a

49

two-person ﬁnite game, that is player II plays Bayes versus a randomized perturba-
tion of the player I ’s empirical distribution. We will prove Theorem 1 of Foster and

Vohra [13] using the game theory results of Chapter 2.

Theorem 1 of Foster and Vohra [13] If Bl and 32 are two experts with bounded
loss bi, b; for all t, and the strategy C is deﬁned as above, then

1 T 1 T T
TZT S Tm1n(2b[,2b§) +61,
1 1 1
and
0 < 5T 1’1 0,

where E, is the loss of strategy C at time t.
Proof. Without loss of generality, for all t, let
0311531 and 0311.331.

Consider a ﬁnite two-person game in which player I has 4 pure strategies, player 11

has 2 pure strategies, and the loss matrix A is deﬁned as following:

(00)
1o
01

[11}

Let 2 = (21,22,23,z4). Under a probability measure )1, 21, 22, 23 and 24 are

 

 

independent. 2,- is U (0, 1) for i = 1, 2 and 4. 23 is degenerate at 1/2.

( o )
1
Since A" = A” - A' = , for l S q < r g 2, (2.4) becomes
—1

\0/

II{Z|t1322—23$t2}St2-t1-

 

 

So (2.4) holds with L = 1. a = Eullzll = m/2 = 2.

50

Consider that player I chooses randomized strategies T from the class it that

consists of all 3‘ such that for each t,
It = {(1_ btl)(1 — b2):bt1(1_ b2): (1 - btl)b21btlb2}
For any t and z' = 1, 2,
:IJ‘Ai = b:.

Then the Bayes envelope ¢(XT) becomes

T
T_ - T _ - t
¢(X )— mm X a—mirgz bi.

UE{A1,A2} 1: , t=l
If H ‘ = 2t’ and z is deﬁned as above, we have
t—l
(Xt'l + lit-120Al = Z b’f + 2(t—1)’(z2 + 24),
k=l

and

t-l
(X‘“1 + H“"1z)A2 = 2 b; + 2(t — 1)‘(z3 + 24).
k=l

Then the Bayes response 0(X"1 + H "'12) becomes

.41 if 1),.l > 2(t — 1)‘(22 — 1/2)

0(X“l + Ht'lz) =
A2 if D¢_1 _<_ 2(t — 1)‘(22 - 1/2),

where D¢_1 is deﬁned in (4.12).

Since

$t0.(Xt-l + Ht-lz) = btl if Dt-l > 2(t " l)‘(22 " 1/2)
b; if Dt-1 S 2(t-1)‘(22 - 1/2),

the deﬁnition of the strategy C implies that
6, = :z:‘0(Xt'l + Ht'lz).

Then applying Theorem 2.3.3 with m = 4, n = 2, [HI = 1, L = 1, 6 = 2, H‘ = 2t‘

and for any player I’s strategy 55‘ in i, we have

T T T
Eul 25¢ “ min(21:btla 2%“
1 1

51

T
= Eﬂl 2::1:‘(I(Xt"1 + Ht"lz) — ¢(XT)|
1

2

__ l—s
2,, T )IBI

22 T
< " —
_ 2T 2|B|+ 2 (g

3 4T‘ + if".
1 3

By Markov inequality, we have that
1 T ~ . T t T P
Tl th — m1n(z b1,2b§)| —) 0.
1 1 1
Hence the desired result follows.

C]

Remark 4.3.1 71-4 2? E, — min(2fb‘l, 2,7123” 5) 0 is slightly stronger than the result
of Foster and Vohra [13, Theorem 1]. The optimal choice of s is 1/2, and the optimal

convergence rate is 0(T‘1/2).

4.4 A generalization to more than two experts

In the last section, we have proved that given the predictions of two experts, we can
construct a randomized strategy for the decision-maker such that the decision-maker
performs as well as the best experts in the sense that the difference between the
average loss of the decision-maker and the average loss of the best expert converges
to 0 in probability. In this section we consider how to construct a randomized strategy
for the decision-maker when there are more than two experts.

Suppose at time t, Bf, . . . , Bf, are the predictions of n experts, respectively. For
any t and i, let outcome y, 6 9, B: e ’D. Suppose the loss function L is bounded.
Denote

b‘. = L(B:,y.), w, t.

8
Without loss of generality, we assume that 0 5 bf S l for all i and t.
Suppose at each time t, for all i, the decision-maker has the knowledge of b}, .. .,
bf“. We deﬁne a randomized strategy 0 for the decision-maker, which is described

in Figure 4.1.

52

Figure 4.1: Strategy C'

 

Strategy G

Choose initial prediction from the set of {B}, i = 1, . . . , n}
Repeat for t = 2, 3,...

predict with B; if for any i,

t-l 2 _ ‘ '
2(1);- — bi) s \/§3—(§,,—1)z(A‘— .42),
t=l

where z = (21, . . . , 22»), and 21, ..., 22.. are i.i.d. U(0, 1) random variables under 11,

A1, . . ., A" are the columns of matrix A.

011 "' aln A1

02"1 aznn A2»

And A1, . . ., A2,. are the distinct sequences from W.

W = {112" | w" = (1121,. . . ,wn), 112,6 {0,1},Vi}.

 

In the next theorem we will show that using strategy C, the decision-maker will

perform as well as the best experts among the n experts.

Theorem 4.4.1 Let c. be the loss of the strategy 0 at time t, b: be the loss of the

expert B.- at time t with
ogﬁgIVLt

Then as T goes to co,
1 T T t P
Tl)?“ - £1,151!”ng -+ 0.
and the convergence rate is 0(T‘1/2).

Proof. We consider the two-person game inﬁnite game described in Section 2.4.

By Theorem 2.4.1, there is a ﬁnite game isomorphic to this inﬁnite game. Let f be

53

a one to one mapping in Theorem 2.4.1 associated with the loss matrix A deﬁned in
Figure 4.1..
In the inﬁnite game, at each time t, suppose player I chooses a strategy 52‘ such
that
it = (b§,...,bf,),

and player II’s strategy is a random vector 9‘ taking value from a set
Y = {e1,...,e,,}

such that

if
0(X"l + H‘"lz) = Aj

in the corresponding ﬁnite game, where z and Al are deﬁned in Figure 4.1,

t
X°=0, X‘sz‘, T‘=f(:i;‘),
1

2
H‘=‘/3;t t=1,2,...

It follows from Theorem 2.4.1 that the modiﬁed regret of the allocation game is

and

 

the same as that of the ﬁnite game, i.e.,

T
= 22:1:‘0(X"l + Ht‘lz) — ¢(XT)
1—1

By the deﬁnition of 32‘, we have
T T
2 .. - 252.21%
T
= ZT‘UUC’I + Ht’lz) — ¢(XT)
t=l
It follows from Theorem 2.3.4 that

T
Eu| Z z‘a(X“l + H“1z) — ¢(XT)| g \/3n22“T.
i=1

54

Thus T
T
_ ' t, 2
Eulgq lglgngb’l S V311 2"T.

By Markov inequality, we have that

15an

1 T T t P
Tlgq— m1n21:bj|—>O.

Hence the desired result follows.

4.5 The k-extended prediction strategies

4.5.1 Introduction

In the on-line prediction model described in Section 4.2, at each time t, the decision-
maker predicts the outcome y, 6 6. The decision-maker has the knowledge of the
predictions made by each of 11. experts, and makes a prediction based on the past and
current expert predictions and the past outcome sequence. The goal of the decision-
maker is to ﬁnd a prediction strategy such that the total loss is as small as possible.
Since no statistical assumptions are made about the distribution of the outcome
sequence, a reasonable goal for the decision-maker is to perform as well as the best
expert. In Chapter 3 and the ﬁrst four sections of Chapter 4, different strategies have
been introduced to give upper bounds on the difference between the total loss of the
decision-maker and the total loss incurred by the best expert such that the average
loss of the decision-maker approaches the average loss incurred by the best expert as
T goes to 00. We observe that the average loss incurred by the best expert is equal
to the Bayes enve10pe in a ﬁnite two-person game. It is of interest to ﬁnd a strategy
for the decision-maker such that the average loss of the decision-maker approaches a
lower envelope as T goes to co.

Herbster and Warmuth [20] considered the prediction model where 'D = [0,1],
8 = [0, 1] and loss function is L. They studied the case in which the outcome
sequence is divided into at most 11: + 1 arbitrary segments. Each segment has a best

55

expert. The sequences of segments and its associated sequence of best experts is
called a partitioning. Now the goal of the decision-maker is to perform well relative
to the best partition.

Let y = (y1,...,yT) be any outcome sequence. For any (t1,...,tk) such that
1<tiSTandtiSti+n

[inyt1):[yt11 ytz): ' ° ° ) [ytu yT-i-l)

is called a k-partition of g, denoted by Pug), where

e: (60,...,6k),

such that 1 S e,- S n and e,- 75 e,-+1. Expert 8,, is the best expert associated with the
ith segment [th 3],, +1)- Deﬁne the total loss of Pk,¢(g) to be
I:
11%,,(3) = Z ngl-ti(lyt.1yt.+1))a
i=1

where Egg-“([3)“, y¢,+,)) is deﬁned in (4.2). Since

LP” (g) can be considered as a lower envelope.

Herbster and Warmuth [20] modiﬁed Vovk’s Generic Algorithm, which is de-
scribed in Section 4.2, by adding an additional update to obtain two algorithms: the
Fixed -— share Algorithm and the Variable - share Algorithm. Each algorithm has
a parameter a 6 [0,1].

Herbster and Warmuth [20, Theorem 4.4] showed that for any positive integers h
and T, by setting a = $5, for any outcome sequence 31 with T 5 T, and any Pug!)
with k S h, the Fixed — share Algorithm satisﬁes

A

T ..
Lama) g clnn + an?” (g) + clean: + ln(n — 1)) + 2011:,

where c and n are determined by the loss function, and Lame!) is deﬁned in (4.2).
Herbster and Warmuth [20, Theorem 5.8] showed that for any positive integers h
and L, by setting a = 575%, for any Pk,e(g) with k S h, LRJg) 5 L, the Variable -

56

share Algorithm satisﬁes
T T I" *
LVariable(g) 5 an" + GULF,” (31) + 004171;) + 00“):

where c and 17 are constants determined by the loss function, and Lam-we (g) is deﬁned
in (4.2).

For certain loss functions, such as square loss, c and n can be chosen such that
on = 1. Then using either Fixed — share Algorithm or Variable — share Algorithm,
we have bounds for the difference between the average loss of the decision-maker
and La'Jg). As an example, Herbster and Warmuth [20, Section 6] considered a
sequence of 800 trials with four distinct segments. They compared Vovk’s Generic
Algorithm to the Variable-share Algorithm. The simulation results showed that the
Variable-share Algorithm performs better than Vovk’s Generic Algorithm for this
sequence.

Cover and Shenhar [11] introduced a prediction strategy whose average loss ap-
proaches to the k-th order Bayes envelope in the situation of sequential prediction
of binary sequences with apparent Markov structure. In their paper, ‘D = [0, 1],
9 = {0, l}, and they use a score function instead of a loss function. Without con-
sidering the Markov structure, Cover and Shenhar [Section 4][11] gave a random pre-
dictor that predicts 37; = 1 with probability p; at each time t, where p, is constructed
based on Blackwell [6] ’s procedure.

For any outcome sequence g, denote the expected average score by

T
MST) = itgmy. + (1 - mu - yo],

and the Bayes envelope by

1 T 1 T
(MST) = mad; 21:11:.1 - 5,: 21:11:)-
t= t:

It follows from Hannan [17, p.139] that for any g,

3
¢(ST) " E(ST) S 7—73-

57

Considering the k-th Markov structure, Cover and Shenhar [11, Section 5] gave a

k-th order Markov predictor. At time T, T = k, k + 1, ..., let

2 =(yT—k+1,~-,yT)-

Denote T’ (2) = T(z, 1) + T(z, 0), where T(z, 1) and T(z, O) are the number of times
the sequence 2, 1 and z, 0 were observed in 34. For each .2, the k-th order Markov

predictor uses the Blackwell procedure to make predictions. It follows that

,, 2*3 k
45 (ST) — E(ST) S 7—1: + '71,

where

1

¢“(S )= —— T’(2)¢(S ,,)
T T4133], T"

is the k-th order Bayes envelope.

In the compound decision problem, a lower enve10pe than the simple Bayes en-
velope is the extended envelope introduced by Johns [21]. The idea of the extended
version is to take advantage of higher order empirical dependencies in the parameter
sequence. Johns [21] proposed extended compound rules whose risks achieve these
envelopes in the limit. Gilliland and Hannan [16] generalized and strengthened some
results previously reported by Swain [25] and Johns [21]. Theorem 3 of Gilliland
and Hannan [16] implies that the extended rules should compare favorably with un-
extended rules relative to the parameter sequence generated by a strictly stationary
process.

Vardeman [26] treated a sequence version of the ﬁnite state compound decision
problem. He also studied the k-extended problem in a game theoretic situation. After
proving a simple game theoretic decomposition of k-extended envelope, he constructed
randomized strategies with risk approximating the k-extended envelope at the rate
of 0(T1/2).

Suppose that the loss sequence of the experts has some dependencies. For instance,
in the case of two experts, assume that the loss sequence {(b],b§), (b§,b§), (b‘f, b3),
(bf, b3), ...} is {(0, 1), (1, 0), (0, l), (1, 0), ....} Taking advantage of the dependencies,

we can use the idea of the k-extended problem in decision theory to construct a

58

strategy, named k-extended strategy, such that the average prediction error of the
k-extended strategy approximates the Ic-extended envelope, which is a lower envelope

than the simple Bayes enve10pe.

4.5.2 The k-extended prediction strategies

Consider the prediction problem discussed in Section 4.4. Suppose at time t, Bf, .. .,
Bf, are the predictions of n experts, respectively. For any t and i, let outcome y; E 9,

Bf E D. Suppose the loss function L is bounded. Denote
bf = L(Bf,y¢), Vi, t.
Without loss of generality, we assume that 0 5 bf g l for all i and t. Let
Q = {b5 Vi. t}

and suppose that Q is a ﬁnite set with cardinality q.

Assume at each time t, for all i, the decision-maker has the knowledge of bf, .. .,
bf‘l. Using Vardeman’s technique, we deﬁne a k-extended randomized strategy 0"
for the decision-maker, which is described in Figure 4.2. So at each time t, strategy
6'" only uses the past stages, at which the It previous predicting errors are the same

as b“‘1, to determine the prediction.

Theorem 4.5.1 Let of be the loss of strategy 0" at each time t, bf be the loss of the
expert B,- at time t with
oswslvtt

Suppose Q = {bf Vi, t} is a ﬁnite set with cardinality q. Then as T —) oo,

1 T . T
Tlgcf- Z {2133 Z: b§)|5>0.

y€Qn(h-l) - — “bot-1:”

and the convergence rate is 0(T‘1/2).

Proof. Denote

X = |{x‘ = {xf, . . . ,xfn}, xf = H[bfa,-¢ + (1 — bf)(1 - au)], Vi}
l=1

59

Figure 4.2: Strategy 0"

 

Strategy 0"

Choose initial predictions 3),, t < h, from the set of {B}, Vi}
Repeat fort: k,k+ 1,...

predict with

 

 

2 “-1 __ . .
2;. = B;- if 2 (b; — b?) 5 1/3" (”b ,, l ”204' — A’Wi.
s:b“"1=b"'l 2
where
[ft-1: {bi—k-H bt—k+2 bt-l}
b“1={bf‘l bH}
t-l
T(btt-l) = 21{boj-l:bot—l}.
i=1
and z = (21, . . . ,Zzn), and 21, ..., 22.. are i.i.d. U(0, 1) random variables, A1, ..., A"

are the columns of matrix A.

02H 02M; A2»

And A1, . . ., A2» are the distinct sequences from W.

W = {MI | w" = (w1,. . . ,wn), w,- E {0, 1},Vi}.

 

60

Since the loss of experts take values in a ﬁnite set Q, which has cardinality q, the

cardinality of 5C is Q”. There exits a one to one mapping between Q" and 5f.
Consider a ﬁnite two-person game in which player I has 2" pure strategies and

player II has n pure strategies. The loss matrix is the matrix A in Figure 4.2. Suppose

player I only takes strategies from i. Denote

 

 

, _ 3n2 T x“"1 — 1
0t=0(th—1lxt 1+\/ ( (2,, ) )2):
where
t—l .
X'tk-llx‘lt—1 = 2: 1J1
jzxej-l=xo¢-l
and

X‘t-l)=21{x.j-1=xu-1}.

Then the deﬁnition of the strategy 0" implies that

T
212190;: 2 ( Z 23%,)

xeib-l t:x“-1=x

2(24)

y€Qn(b-l) “bu—1:”

It follows from Lemma 2.2.6 that

¢*(Xl‘~) = Z Milo

361""1

= 2: (min( 2 bf)).

yeQn(h- l) t:'b""‘1=y

Thus, applying Theorem 2.3.5 with [B] = 1, m = 2", and x‘ in i, we have

T T
alga: - 0% ”my”; b.,.)>|

— "ME-Tat” ¢k(XT)l

_<. \/3n22nqn(k—1)T.

 

Therefore, by Markov inequality, Theorem 4.4.1 follows.

61

[3
Remark 4.5.1
2: < '( i b >> )3}
min ,3, 5 min f.
y€Qn(k—l) ' “be ‘-1:y 151$" t=l

Example 4.5.1 Suppose there are two experts, and their loss sequence is as following:
{(bl’ 1);), (bf, b3), (bg’ b3), (bf, b3), ' ° °} = {(0, 1), (1’ 0), (0’ 1), (1’ 0), ° ’ '}

Then the simple envelope is

¢(XT)={ % ifT>landTisoddO

if T is even

For the simple case of k = 2, the k-extended envelope is
920%) = 0-

Therefore, we see that the 2-extended envelope is indeed a lower envelope.
Let T = 5000. To compare the average loss of strategy 02 with the average loss
of the strategy C deﬁned in Section 4.3, we used S-PLUS to determine:

— c. = 0.5196078,
4998 ,=,
1 ”2°“ .2
— = 0.
4998 ,=, ‘

Thus, the 2-extended prediction strategy performs better for this sequence.

Bibliography

[1] G. Anandalingam and L. Chen. Linear combination of forecasts: a general Baye-

[2]

l3]

[4]

l5]

l6]

[7]

[8]

sain Model. Journal of Forecasting, V8, 199-214, 1989.

P. Auer, N. Cesa—Bianchi, Y. Freund and R. E. Schapire. Gambling in a rigged
casino: The adversarial muti-armed bandit problem, In Proceedings, 36th Annual

Symposium on Foundations of Computer Science, November, 1995.

R. J. Ballard. Extended rules for the sequence compound decision problem with

m x n component, Doctoral Thesis at Michigan State University, 1974.

R. J. Ballard and D. Gilliland. On the risk performance of extended sequence
compound rules for classiﬁcation between N(0, 1) and N (1, 1), J. Statist. Com-
put. Simul., V6, 265-280, 1978.

R. J. Ballard, D. Gilliland and J. Hannan. 0(N’1/2) convergence to k-extended
Bayes risk in the sequence compound decision problem with m x n component,

RM-333, Statistics and Probability, MSU, 1974.

D. Blackwell. Controlled random walks. J. Proc. Int. Congr. Math. 3, 336-338,
1956.

N. Cesa-Bianch, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire and M.
K. Warmuth. How to use expert advice. Unpublished manuscript, 1995.

R. T. Clemen. Combining forecasts: a review and annotated bibliography. Intl.

J. Forecast 5, 559-583, 1989.

62

63

[9] R. T. Clemen and J. B. Guerard, Jr. Econometric GNP forecasts: incremental

information relative to nave extrapolation. Intl. J. Forecast 5, 417-426, 1989

[10] R. T. Clemen, and R. L. Winkler. Combining economic forecasts. J. Bus. And
Econ. Statist. 4, 39-46, 1986.

[11] T. M. Cover and A. Shenhar. Compound Bayes predictors for sequences with ap-
parent Markov Structure, IEEE Transactions on Systems, Man and Cybernetics,

Vol. SMC-7, No.6, 1977.

[12] D. P. Foster. Prediction in the worst case. The annals of Statistics, V19, No. 2,
1084-1090, 1991.

[13] D. P. Foster and R. Vohra. A randomization rule for selecting forecasts. Opera-

tions Research, 41(4):704-709, July-August 1993.

[14] Y. Freund. Predicting a binary sequence almost as well as the optimal biased
coin. In Proceedings of the Ninth Annual Conference on Computational Learning

Theory, 1996.

[15] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learn-
ing and an application to boosting . In Computational Learning Theory: Second

European Conference, EuroCOLT ’95, 23-37, Springer-Verlag, 1995.

[16] D. Gilliland and J. Hannan. On the extended compound decision problem. Ann
Math. Statist. 40: 1536-1541, 1969.

[17] J. Hannan. Approximation to Bayes risk in repeated play. In Contribution to the

theory of games, V3, 97-139, Princeton University Press, 1957.

[18] D. Haussler, J. Kivinen, and M. K. Warmuth. Tight worst-case loss bounds
for predicting with expert advice. In Computational Learning Theory: Second
European Conference, EuroCOLT ’95, 69-83, Springer-Verlag, 1995.

[19] D. P. Helmbold, R. E. Schapire, Y. Singer and M. K. Warmuth. On-line portfolio

selection using multiplicative updates. Unpublished manuscript, 1996.

64

[20] M. Herbster and M. K. Warmuth. Tracking the best expert. In Proceedings of
the Twelfth International Conference on Machine Learning, 286-294, 1995.

[21] M. V. Jr., Johns. Two-action compound decision problems. Proc. Fifth Berkeley
Symp. Math. Statist. Prob, 1 pages 463-478, University of California Press, 1967.

[22] T. L. Lai and H. Robbins. Asymptotically Efﬁcient Adaptive Allocation Rules.
Advances in Applied Mathematics, 6:4-22, 1985.

[23] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Informa-
tion and Computation, 108:212-261, 1994.

[24] H. Robbins. Some Aspects of the Sequential Design of Experiments. Bulletin
American Mathematical Society, 55:527—535, 1952.

[25] D. D. Swain. Bounds and rates of convergence for the extended compound es-
timation problem in the sequence case, Tech. Report, No. 81, Department of
Statistics, Stanford University, 1965.

[26] S. B. Vardeman. Approximation to minimum k-extended Bayes risk in sequences
of ﬁnite state decision problems and games, Bulletin of the Institute of Mathe-

matics Academia Sinica, V10, No. 1, 1982.

[27] V. G. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop
on Computational Learning Theory, 371-383, 1990.

[28] V. G. Vovk. A game of prediction with expert advice. Unpublished manuscript,
1996.

HICHIGRN STQT

HI] 1]_] illi]li|[[l[[l[i[i

312 301776