THESIS } lllllllll‘llllllllllllllllllllllllllll‘Ul‘lllllllllllllllll 3 1293 01776 668 LIBRARY Michigan State University This is to certify that the dissertation entitled Applications of Play Against Past Strategies in Repetitions of a Game presented by Lei Chen has been accepted towards fulfillment of the requirements for ‘ Ph.D. degree in StatiStiCS doe/xv Major professor Dennis Giiiiiand Date December 1, 1997 MS U is an Affirmative Action/Eq ual Opportunity Institution 0- 12771 ~_’-\ »—-r _.’ _" ‘ "—‘h .... V wfi -_.— ~_ APPLICATIONS OF PLAY AGAINST PAST STRATEGIES IN REPETITIONS OF A GAME By Lei Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Statistics and Probability 1997 ABSTRACT APPLICATIONS OF PLAY AGAINST PAST STRATEGIES IN REPETITIONS OF A GAME By Lei Chen Hannan [17] introduced and studied certain recursive “play against past” strategies in the repeated play of a game. His results include bounds on the excess of total risk to player II over an envelope risk, namely, that which results from use of the best simple strategy based on knowledge of player I’s empirical distribution of past choices. This thesis investigates the Harman results for finite games and extends their applications to certain infinite games. This is accomplished in Chapter 2 by showing that the Hannan bounds hold for play against random perturbations of the empirical distribution of player I’s past choices of randomized strategies and by identifying pure strategies in the infinite games with randomized strategies in a companion finite game. Chapter 3 uses this idea to show that the Harman recursive strategies and resulting bounds have applications in the repeated play of certain allocation and nonadversarial multi—arm bandit problems. Another application concerns the repeated play of an expert selection problem. Chapter 4 contains a review of some of the literature for this problem. In this com- ponent game, player II selects an expert from a set of experts for a prediction and suffers loss equal to the prediction error. In repeated play, the enve10pe risk is the total of the prediction errors by the best individual expert. For a set of n = 2 ex- perts, Foster and Vohra [13] proposed a recursive strategy and showed that it results in average prediction error close to that of the best individual expert in the limit. Chapter 4 shows that their strategy is an example of a Harman strategy and that the adaptation of Hannan to infinite games establishes a slightly stronger version of their asymptotic result. Moreover, this approach provides an immediate solution for the case of n experts. To J ingjing - my dear husband iii ACKNOWLEDGEMENT I would like to express my sincerest gratitude to my advisor Professor Dennis Gilliland for his guidance and continuous encouragement in the preparation of this dissertation. He taught me the art of consulting as well as the art of teaching and research. I appreciate his patience and willingness to discuss problems at any time. I could not have finished this dissertation without his great help. Special thanks go to Professor James Hannan for the many things I learned from him, for the conversations we had all over the campus - classroom, office, hallway, IM East, - - -, and of course, for his fundamental research in game theory from which my work derives. It was amazing that he almost persuaded me to pursue an academic career even though the chance of getting it is so slim. I could not imagine the consequence if I forgot to thank Professor James Stapleton. It is he who called me about my admission in July 1993. I am grateful for the good advice he gave me throughout my graduate study, and for his encouraging words about my English which sometimes were so good that it gave me the delusion that my English is as good as his. A I would also like to thank Professors J. Hannan, R. Lepage and H. Salehi for serving on my thesis committee, for carefully reading my thesis, and for their helpful comments. I would like to thank all the faculty and staff of the Department of Statistics and Probability who treated me so well that I am alleged to be the only one who cried when leaving the Wells Hall. I am very grateful to my parents - Weiyou Chen and Xinshun Zhao from whom I learned to work hard and be responsible. Their wish of me getting advanced degrees has been my inspiration over the years. I am also grateful to my parents-in-law - Shizhong Wang and Liqiong Peng who kept me out of the kitchen when I was preparing my dissertation. I owe special thanks to my parents-like host family - Fred and Charlotte Poston. Their loving and caring always made me feel warm even in the coolest days of Michigan’s winter. Words are not enough for me to express how much I owe to my best friend, walking iv dictionary, personal trainer, counselor, soul mate - my husband Jing Wang. He is the one who comforted me and cheered me up when I was unhappy. He is the one who made me relax when I felt stress. He is the one who made me believe in myself when I lost confidence. Although I always complained about his frank criticisms, I know from my heart that he is the one who loves me the most, and he is the one who changed my world, and he is the one who made me become who I am. This dissertation is dedicated to him with love. Contents 1 Introduction 1.1 Introduction ................................ 2 Strategies for repeated play of a game 2.1 Preliminaries ............................... 2.1.1 The finite component game ................... 2.1.2 The repeated game ........................ 2.1.3 The k-extended game ....................... 2.2 Lemmas .................................. 2.3 Bounds for the modified regret of Hannan recursive strategies 2.4 An extension to an infinite game .................... 3 On-line allocation model and the multi-armed bandit problem 3.1 Introduction to on-line allocation model ................ 3.1.1 Component allocation game ................... 3.1.2 On-line allocation model ..................... 3.2 Application to the on—line allocation model ............... 3.3 Introduction to the multi-armed bandit problem ............ 3.3.1 Component multi-armed bandit game .............. 3.3.2 The multi-armed bandit problem ............... 3.4 Application to the multi-armed bandit problem ............ vi 00 ONUCflU‘ 11 16 22 25 25 25 27 32 32 32 34 4 Prediction using expert advice 38 4.1 Introduction ................................ 38 4.2 Literature review ............................. 38 4.3 A proof of Theorem 1 of Foster and Vohra ............... 48 4.4 A generalization to more than two experts ............... 51 4.5 The k-extended prediction strategies .................. 54 4.5.1 Introduction ............................ 54 4.5.2 The k-extended prediction strategies .............. 58 Bibliography 62 vii List of Figures 3.1 3.2 4.1 4.2 Algorithm H ............................... 28 Algorithm I? ............................... 35 Strategy 0 ................................. 52 Strategy 0" ................................ 59 viii Chapter 1 Introduction 1.1 Introduction There is current interest in a variety of problems that involve repetitions of a decision problem. Examples are on-line allocation problems, multi-armed bandit problems and expert selection problems. Hannan [17] developed recursive strategies for player 11 in the repeated play of a game. Our concern is the adaptation of the Harman [17] finite game results for certain infinite games, including the aforementioned. Prediction by combining expert advice has extensive applications. According to Clemen [8]’s review, it has been applied not only to meteorology and economics, but also to the prediction of social and technological events, football game outcomes, electrical demand, tourism, political risk and population, etc.. Clemen [8] gave an extensive survey of methods for combining forecasts. Most of these conventional methods for combining forecasts involve taking a weighted average of individual fore- casts and can be viewed within a regression framework. Regression techniques, such as weighted least squares, robust-weighting techniques, ridge regression, latent root regression, have been used in combining forecast. Bayesian techniques for includ- ing prior information in a forecast combination have been studied by Clemen and Winkler [10], Clemen and Guerard [9]and Anadalingam and Chen [1]. Most of these approaches require that the probability distribution of the event being forecasted be specified. In computer science, predicting a binary sequence by combining the predictions of a set of experts has been studied by Cesa-Bianchi et al.[7], Foster [12], Freund [14], Littlestone and Warmuth [23], Vovk [27] [28] etc.. Haussler, Kivinen and Warmuth [18] also provided prediction strategies for continuous-valued outcomes using expert advice for certain classes of loss functions. Their strategies are based on the exponential weight algorithm introduced by Littlestone and Warmuth [23] and by Vovk [27]. None of their strategies requires statistical assumptions be made about the decision-maker’s subjective beliefs regarding the distribution of outcomes. They proved bounds on the difference between the average loss of their strategies and the average loss of the best expert. The average loss of their strategies will approach that of the average loss of the best expert at a certain rate as the length of outcome sequence T goes to 00. Foster and Vohra [13] studied the problem of choosing a prediction from two experts. Assume that the prediction loss is bounded and that the decision-maker has the knowledge of the past losses incurred by the two experts. They constructed a randomized strategy for the decision-maker and indicated that the difference between the average loss of the decision-maker and the minimum average loss of the two experts goes to 0 in probability as the number of trials T goes to 00. By connecting the Foster and Vohra [13] randomized strategy with game theory results studied by Harman [17], we verify that the Foster and Vohra [13] randomized strategy is a special case of Hannan’s strategy for player II in a finite two-person game, that is, it plays Bayes versus a randomized perturbation of the arithmetic mean of player I’s past randomized strategies. The minimum average loss of the two experts is the Bayes envelope, denoted by ¢(%XT), where X T is the sum of player I’s past randomized strategies. We also consider the prediction problem in which there are n experts. Based on game theory results, we construct randomized strategies for the decision-maker such that the difference between the average loss of the decision-maker and the average loss of the best expert goes to 0 in probability as the number of trials T goes to 00. In the compound decision problem, the extended envelopes, introduced by Johns [21], is a lower envelope than the simple Bayes envelope. The extended sequence compound decision problem has been studied by Swain [25], Gilliland and Hannan [16], Ballard [3], Ballard, Gilliland and Hannan [5] and Ballard and Gilliland [4]. Vardeman [26] studied the k-extended problem in a game theoretic situation. He constructed randomized strategies with risk approaching the k-extended envelope at the rate of O(T"1/2). If the loss of the experts have some dependency, then we can use Vardeman [26]’s idea to construct k-extended randomized strategies for the decision-maker such that the average loss of the decision-maker approaches a lower enve10pe. 1.2 Summary This dissertation connects game theory, the on-line allocation model, the multi-armed bandit problem and the on-line prediction problem using expert advice. It is organized as follows. In Chapter 2, following Hannan [17], we present notations, useful lemmas and the- orems for a finite two-person game. We also extend the results of recursive strategies to an infinite game. In Chapter 3, we introduce two algorithms, Algorithm H and Algorithm H, based on Theorem 2.3.2. In the on-line allocation model, if the allocation agent uses Al- gorithm H, then the difference between the expected average loss of the allocation agent and the average loss of the best strategy converges to 0 at the rate of 0(T“1/2). In the multi-armed bandit problem if the player applies Algorithm H, then the differ- ence between the expected average reward of the player and the expected maximum average reward of any arm in a sequence of T trials will go to 0 at the rate of 0(T‘1/2). In Chapter 4, Sections 4.2 and 4.5.1 contain a general review of the most relevant literature for the on-line prediction problem using expert advice and the k-extended idea. In Section 4.3, we present a proof of Theorem 1 of Foster and Vohra [13] using Theorem 2.3.3. In Section 4.4, Theorem 2.3.4 is used to generalize the problem studied by Foster and Vohra [13] to the case in which the choice is among n experts. Without any statistical assumption about the distribution of the sequence being predicted, if the prediction loss is bounded, then we can construct a randomized strategy for the decision-maker such that the difference between the average loss of the decision- maker and the average loss of the best expert converges to 0 in probability at the rate of 0(T'1/2). In Section 4.5, we use the Vardeman [26] technique to construct k-extended prediction strategies when the decision-maker has the knowledge of the predictions made by n experts. If the loss of the experts takes values in a finite set with cardinality q, then using the k-extended prediction strategy the average loss of the decision-maker will approach the k-extended envelope in probability at the rate of 0(q"("‘1)T'1/2) uniformly in outcome sequences. Chapter 2 Strategies for repeated play of a game 2.1 Preliminaries 2.1.1 The finite component game Consider a finite two-person game in which players I and II have, respectively, m and 11. pure strategies. Their spaces of randomized strategies are denoted by X and Y, m X = {x = (1:1,x2,...,:rm)|:r,- 2 0,223; = 1}, 1 Y= {y= (y1.y2.--.,yn)ly.- 2 0,2!“ = 1}, I and their pure strategies are represented by base vectors 6 and 6 in X and Y, that is, the degenerate probability distributions. Notations. For m-vectors, we use juxtaposition to indicate inner product, that is, ab = 2’," a,b;, and we define the norms m Ial = mgxlaal and Hall = Slat-l- 1 Player II’s inutility is denoted by a loss matrix A, 011 “113 A1 A: g 5 = ; =(A1,...,An), aml amn Am It is assumed throughout the thesis that A has no dominant column, i.e., for each j meax(eAj — mrin eA') > 0, (2.1) and no duplicate and dominated columns. Notations. Let A"'=A"—A',1Sq 0. Notations. In all of the following, a(-) will denote a pure strategy valued, positive, homogeneous Bayes response defined on [0, 00)”; also, following Hannan [17, p. 102], we let IBI = sup |0($) - 0($’)I = Hgax lAq’l, (2-2) z,z’ ,r where the last equality follows from the fact that A has no dominated columns. A strategy 3 e S may be evaluated for each 2: E X in terms of the additional risk above the Bayes envelope risk ¢(:r), that is 2:3 — ¢(a:). This excess is called the regret ofs at 3:. 2.1.2 The repeated game This section considers repeated play of the component game and the evaluation of recursive strategies in terms of their excess total risk over 42 evaluated at the (non- normalized) empirical distribution of player I’s choices. Suppose the component game is played repeatedly with e‘ and 0‘ denoting the choices of pure strategies by player I and II, respectively, at time t = l, 2, ..., T. We suppress the display of dependence on T in writing 1 g=(e ,...,€T) and g=(al,...,aT). The total loss across the first T plays of the game is T RT(§, g_) = Z e‘a‘. 1 If player II uses the same component strategy a at each time t, that is, g_ = (a, . . . , a), then we write 9; = UT and note that T RT(§, 0T) = 2 6‘0 = ETa 1 where ET = 2:," e‘. (Harman [17, p. 107] calls such g power strategies.) Note that 12%. o") 2 ¢(ET) with equality if a = 0(ET). We will sometimes refer to the sum ET as the empirical distribution of player I’s choices of pure strategies through time T. The envelope risk ¢(ET) is the minimum total risk to player II resulting from the use of a power strategy across the first T plays of the game. If player II knows ET in advance, then the power strategy with kernel 0(ET) achieves the envelope risk. Hannan [17] refers to the excess risk RT(§,9_') - MET) as the modified regret of g at 3:. He develops bounds for the modified regret of recursive strategies g, with 0‘ a function of E"'1 and possible randomization. These bounds are 0(T1/2) uniformly in g for finite component games. Since we make extensive use of the Hannan [17] notations and results, it is conve- nient to state some of these for future reference. We let H ‘, t = 0, 1,. . . be a sequence of numbers satisfying, 0 0 such that Miz i t1 S qur/lAqr] S t2} S “‘2 - t1) f0? all Q < 7', t1 S t2- (2.4) Also 9 := Eullzll. When u is the uniform distribution on [0,1]”, 0 = m/2 and we take L = 1. Hannan [17, Theorem 3] showed that with the component game defined earlier, if (2.3) and (2.4) hold, then for each T and g, T —HT9|B| g :c‘Epa(E“1+H"lz)—¢(ET) (2.5) l 2 T 2 T < T I‘— ___ . _ H 0lB|+L4(Zl: H, HT)IBI, (26) where [B] is defined in (2.2). Denote the right hand side of (2.6) by UT. For fixed p, Harman [17, Section 7] showed that when T is known in advance to player II, then H‘ = (Luz/40)1/2T1/2, t = 1,. . .,T, minimizes supT T‘1/2UT. When T is unknown to player 11 and u is the uniform distribution, then Hannan [17, Theorem 4] proved that for all T and _e_, Ht = (3n2/2m)1/2t1/2, t = 1,2,. .., minimizes supT T‘WUT, and that T 2 _ —(/3n2mT/8|B[ S gc‘Epa(Et‘l+(]§E-%n—llz)—¢(ET) S ‘/3n2mT/2|B|, (2.7) Hannan [17, Theorem 6] produced bounds which are a uniform improvement of those in (2.7). He showed that —\/1.5mT|B| g ie‘EudEt'l + ([6—(t—flE—llz) - (MET) S V6mT|B|. (2.8) 2.1.3 The k-extended game Consider the above two-person game. Player I has m pure strategies and player II has n pure strategies and the loss matrix A satisfies (2.1). Let X and Y be the set of player I and 11’s randomized strategies, respectively. Denote S = {Ay I y 6 Y}. For a: e X and s 6 S, ms is the risk of player II choosing strategy 3 against 2:. Suppose that this game occurs repeatedly. At each time t, let 1“ represent player I’s move, and 2:1, ..., 1:“1 are known to player II before he makes his move at time t. 10 Let s = (31, 32, . . . , 3T) be such that s, is a function from X"1 into S. For x“1 = (2:1, . . . ,T“1), the risk of sequence strategy 8 is T Z :r‘st(x“l). (2.9) i=1 When 8 = (s, s, . . . , s) for a fixed 3 6 S, the risk (2.9) becomes T T 222‘s = (Z T‘)s = XTs, t=l t=1 where X T is the sum of the randomized strategies of player I through time T. The definitions of Bayes response and Bayes envelope imply that ¢(XT) = XT0(XT). Let S" be a set of functions from X"‘1 into S. Let tt-l t—k-i-l , xt—l). x =(a: ,... If we consider 8 = (s‘, s‘, . . . , s‘) for a fixed 3" 6 S’, then the risk (2.9) for such an 3 reduces to 23;, :r‘s‘(x“'l). We term a k-extended envelope. We use Vardeman [26] ’5 Pk notation to define set S C [0, 00)“. of the form S = {59' 6 R"‘k|(:r""°+1 <8) 32"“2 ® ~-® x‘)§ = T‘s*(x‘t'1),for some 3" E S“ }, where mk-vector TH‘“ ® Tt‘k” ® -- - ® 3‘ is given by ((a.-“"+1®a:“"+2 a ~~-®x‘),-,,,,,-,, 31-6 {1,...,m}, i: 1, . ..,k), and i: t—k+l t—k-i-2 t _ t-k-i-t' (x m 69 mm )1.....-. — I122... - i=1 Let 6(a)) denote a positive homogeneous minimizer of ms. Denote T X; = Ext-k'i'l ® xt—k+2 ® _ . . ® xi. t=1 11 Then T min T‘s‘(x s‘ES‘ t=1 “-1) T = mip 2(xt-k'i'l ® $t_k+2 ® . . . ® $t)§ 565 i=1 = mip X5 565 = X;5(X;.). For an E Rmk, define ¢"(w) = w6(w). Thus the Ire-extended envelope is ¢’°(X%) = X%5(X4‘~)- 2.2 Lemmas For later application, we determine the behavior of the Harman recursive strategy when player I’s choices are randomized strategies :1.“ and 0(E"1 + H t‘lz) is replaced by a(X"l + Ht'lz), where X0 = 0 and For this purpose we reinterpret lemmas from Hannan [17] with 6‘ replaced by 2‘. Remark 2.2.1 Lemma 2.2.1 below will be an important tool in the proof of Theorems 2.3.1 and 2.3.3. Hannan [17] used it in the proof of (2.5) and (2.6). Lemma 2.2.1 Let v‘,t = 1,. . . ,T be any sequence of m-vectors. It follows that T T T 23:”: = XTvT-i—l + Ext—1(1): _ 11‘“) + 21,10} _ 01H). 1 1 1 Proof. T T XTvT+l + Ext-1h} _ vt+l) + thh)‘ _ vt-i-l) 1 1 T T T T = XTUT+1 + Ext—lvt _ Ext-lvH-l + Zztvt _ thvH-l l 1 1 l 12 T T T Xtvt-i-l _ ZXt—IUHI + Zztvt _ Extvt+l 1 1 1 T T + Z xtvt _ Z xtvt+l 1 l n “M“! “M“! “M” H s: i a ('5 e6 Cl Remark 2.2.2 Lemma 2. 2.2 below will be used in the proof of Theorems 2. 3.1, 2.3.2 and 2. 3.3. Lemma 2.2.2 If (2.3) holds, then Xt'l Xt t - l t — 2 “HEY-‘HTHSF‘HT Proof. XH Xt m x:-1 X,.H+x: "—m-‘zll = XI t-1——t——[ H H g, H H m 1 1 T‘- _ t-l i - 2"“ (F'H’tl'ifi 1 1 1 S (t-1)('fifi-§:)+F < t—l_t—2 — Ht-l Ht E] Remark 2.2.3 The proofs of Lemmas 2.2.3 and 2.2.4 that follow are reinterpreta- tions of (6.8)-(6.10) of Harman [17]. He proved the results when player I uses pure strategies g under the assumption (without loss of generality) that for any 6 ¢(e) = jeiiliiin} eAj = 0. (2.10) Lemma 2.2.3 Suppose that the sequence H t is positive and nondecreasing and that 0‘ is the recursive strategy 0‘ := (7(X‘"1 + Ht'lz), t = 1,2, . . .. (2.11) 13 It follows that T XXV - at“) 2 -HTllzlllBl. 1 where IE] is defined in (2.2). Proof. From the definition of a‘, we have for each t (X‘ + H‘z)(o‘ — at“) 2 0. Then xt(0.t _ Ut+l) 2 _th(0.t _ 0“”). So ixt(at_ 0.)t+1 l T T 2: H‘zo‘+1 — Z H‘zot 1 I IV T-1 T HTZUT+1 + z: that-i-l __ thZUt 1 1 T T = HTon+1 + 2Ht—lzo‘ — ZH‘zo‘ 2 1 T = HTZUT+1 _ H0201 _ 2(Ht _ Ht—l)za,t 1 = HTz(0T+l — a0) — H0210}l — a0) —ZT:(Ht— Ht 1) 2(0 — a0) where a0 = (a9, . . . ,a?,,), and for each i a9: min a,,-, (2.12) j€{l.~ ..n} where ajj is the element of loss matrix A. Since for any t, 0‘ - a0 2 0, H ‘ 2 0 and H ‘ is non-decreasing, then we have HTz(aT+1 — a0) 2 0. 0 S Hot/.(o1 - a0) 5 H°||z|||B|. 0 S (Ht - H"1)Z(0' - 0o) S (H‘ - Ht-INIZIIIBI- 14 Therefore T T ZX‘(0‘-0‘“) Z -H°IIZIIIBI-Z(H‘-H"1)IIZIIIB| l 1 = —HT||z|||B|. :1 Lemma 2.2.4 Suppose that the sequence H t is positive and nondecreasing and (2.11) holds. It follows that xTaT“ — ¢ + )T:X‘-1(o*— W) s HTllzlllBl. 1 where [B] is defined in (2.2). Proof. Since (X T + HTZ)[0T+1 — 0(XT)] S 0, then XTaT“ — ¢(XT) 5 —HT z[aT+1 — 0(XT)]. (2.13) It follows from (2.11) that (X‘-1 + Ht-lz)(ot — at“) S 0. Then Xt—l(0.t _ 0"”) S _Ht—lz(at _ at-i-l). So T HTon+1 — Hozo1 — EXHt — H“‘)za‘+l. (2.14) 1 15 It follows from (2.13) and (2.14) that 7‘ XTUT+1 _ ¢(XT) + Ext—1(at _ (TH-l) l T S ,HTzo(XT) — H0201 — 2(H‘ — H‘_l)zot+l 1 = HTz[o(XT) - a0] — H0207l — a0) Z—(H‘— Ht 1 .z)(a‘+1 — a0) 3 HTIIzlllBI- where a° is defined in (2.12). The second last inequality follows from the definitions of H‘, a0 and |B|. El Remark 2.2.4 Lemma 2.2.5 below is a slight modification of Lemma 1 of Hannan [17]. It will be used to prove Theorem 2.3.3. Lemma 2.2.5 If w and w' are m-vectors and (2.4) holds, then Eu|o(w + z) -— 0(w' + 2)] _<_ L-gilBlllw' —- w||, where [B] is defined in (2.2). Proof. Let T1), = {zlo(w + z) = Aj,o(w’ + z) = A"}, and note that Eulo(w+z) -a(w’+z)| = ZIA""In(T #1: = Z [Ajkill‘(7:ik) + #(Tkjll- j (73181 with the lower bound obtained for H‘= g1=1,2,.... m [3 Remark 2.3.3 Theorem 2. 3.3 below is a modification of Theorem 3 of Hannan [17]. In Theorem 2.3.3 the Eu expectation is outside the absolute value rather than inside it. Theorem 2.3.3 Suppose that H ‘ satisfies (2.3) and (2.4) holds. Then for all T and £1 T t t 1 t 1 T T n2 T 2 - _ _ < _ _ _ _ . EHIZI:TU(X +H 2) ¢(X )| _ H 0|B| +L 2 (g H‘ HT)|B| where 0 = EHIIzll, and [BI is defined in (2.2). Proof. By Lemma 2.2.1, we start with T T ZTW —¢(XT) = [XTUTJ'l — ¢(XT)] + ZX"I(U‘ — 0“”) + 219(0‘ — at“) 1 1 I: 31 + $2 + 53. It follows from (2.19) and (2.20) that T [51+ 52 + S3] S HT||z|||B| + IZT‘(0‘ — ot+1)|. (2.25) 1 20 Taking expectation on both sides of (2.25), we get T EH|31+ S; + Sgl g HT0|B| + E,,|Z::1:’(at — a‘+1)|. (2.26) 1 Applying Lemma 2.2.5 with w = )("1/2H"l and w’ = X’/H’, xt-l xt Elliot_ at+li < Ln— :iBiilHt_1 ...-IF”. It follows from Lemma 2.2.2 that T T Eulzz’(0’-0’“)l S Ellz‘llEul0’-0’+ll l l n2 T t—l t—2 S L-H-IBIZQfiZT—TIT) TT) 5 L—IBI(ZT:— H, — — (2.27) Therefore, Theorem 2.3.3 follows from (2.26) and (2.27). El Remark 2.3.4 Theorem 2. 3.4 below is a modification of Theorem 4 of Hannan [17]. Similarly we can show that for a fixed u, if T is known to player II, then Ht = (n2/20)’/2T’/2, t = 1, . . . ,T, minimizes supT T'1/"’UT (H). Theorem 2.3.4 If p is the uniform distribution, then, for all T and T, Es firm/X” + «mfg—1):) — ¢(XT)| 5 «W181. 1 where [B] is defined in (2.2). Proof. Since [1 is the uniform distribution, the result of Theorem 2.3.3 becomes T EHI ZT’0(X"’ + H"’z) — ¢(XT)|< 1 Let T U =+HTmIBI 712(22 T — _ 21 where E = (H 1, . . . , HT). Suppose T is unknown to player II, we want to obtain an 11 minimizing sup T’l/zUTLIi) T Harman [17, Section 7] showed that sup T’l/zUT(_I_I_) 2 (3n2m)‘/"’|B| (2.28) T and this lower bound is obtained for £11 with 2 H‘=‘/§—"—3 t=1,2,--- (2.29) m Therefore, (2.28) and (2.29) imply the result of Theorem 2.3.4. C] Remark 2.3.5 Theorem 2.3.5 below will be used to construct k-eztended predicting strategies in Chapter 4. Theorem 2.3.5 Let p be the uniform distribution. Assume that the cardinality of X is d, and d < oo. Denote 3n2(T(x*“1) —- 1) at = 0(X,"_1|x"‘—l +(/ z), where t—l th...1|x“_l = 2 $1" and t-l T(x‘t—l) = Z [{xtj-lzxot—l}, i=1 Then for all T and g, T Eu] Zz‘a, — ¢’=(X;.)| g 1/3n2mdk-1TIBI, l where IE] is defined in (2.2). 22 Proof. Lemma 2.2.6 implies that 21? 0. <15" XT) = Z [( 2 x‘a¢)—¢(X§‘~|x)] zexl‘“ t:x""l=z := 2 11(3). :1:€X“"l Denote the indices t for which x“‘1 = 2: by t; < t2 < . .. < tux). Then = 2 z“a(X,':_1|a: + i=1 T(x) 3 2 . -l 3&4» - «man. By Theorem 2.3.4, we have that E#|A(2:)| 5 ‘/3n2mT(:L')|B|. Applying Schwarz inequality, Efllzxat- QS"( (XT)| S 2 ‘/3n2mT(2:)|B| 3610'"1 g \/3n2md"-1T|B|. Hence the desired result follows. 2.4 An extension to an infinite game In this section we adapt the Harman recursive strategies to produce strategies for the repeated play of an infinite component game. The infinite component game is general enough to cover the on—line allocation, multi-armed bandit and expert selection problems as will be shown in Chapters 3 and 4. Consider an infinite component game where player I chooses a pure strategy :3: from = [0,1]" and player II chooses a pure strategy 37 from Y={e1,...,e,,}, 23 the set of standard base vectors in n-space. Suppose that player II’s inutility is given by the loss function L(2,g) = 23;, 2 e x, g e 1?, (2.30) where we have used the juxtaposition of vectors in n-space to indicate the ordinary inner product. Theorem 2.4.1 There exists a finite 2”Xn game with loss matrix A and a one to one mapping f from 2? into X, the (2" — 1)-dimensional simplex of randomized strategies for player I in the finite game, such that f(2)AJ' = 2e,- = 2,, 2 e x, j = 1, . . ,.n (231) Proof. We see that f(5;)AJ' = 2,, j =1,...,n if the f (:2) mixture of the rows of A, f (5:)A, is equal to :3. Thus, if the convex hull of the rows of A is equal to X = [0, 1]", then (2.31) is satisfied with f, any mapping where f (5:) is a mixing distribution of the rows that gives mixture 5:. The minimum generating set for [0,1]" is the set of 2” vertices of this cube. We take A to be any matrix whose row vectors are the 2" vertices and f as described above. This A has no dominant columns since the left hand side of (2.1) is 1. It has no duplicate or dominated columns. C] The infinite game it = [0, 1]“, Y 2 {e1, . . . , en} and L(§:, g) = 237 has randomized strategies 5? = set of all probability distributions on X and i” = Y = (n — 1) — dimensional simplex of probability distribution on n points. Since the loss function is linear in 5: and X is convex, the risks from randomized strategies 23“ are the same as those from the means. Thus, the extension of the 24 infinite game through its randomized strategies remains ”isomorphic” to the restricted extension of the finite game X, Y = Y‘, A, the restriction being X replaced by f [X] For any if E X‘ and gr 6 ?‘ there exists an a: 6 X, namely, :1: = f (mean of cit‘) and a y E Y, namely, y = 37' such that L(a‘:‘, 37‘ = :rAy. Moreover, the Bayes envelope risk in the finite game at a: = f (it) is given by 45(23) = min 571', J the Bayes envelope in the infinite game at :i‘. Suppose the infinite component game is played repeatedly with :E‘ and gt denoting the choices of strategies by player I and II, respectively, at time t = 1, ..., T. We suppress the display of dependence on T in writing =(21,...,2T) and g=(gl,...,*"). IR) Remark 2.4.1 In the repeated play of the infinite component game, suppose that player II uses a Hannan recursive strategy g‘ = e, if 0(X“1 +Ht‘lz) = A’, j = 1,2,..., where a is a positive homogeneous pure strategy valued Bayes response in the finite game described in Theorem 2.4.1, and t X0 :0, X‘ = Zx‘, 2‘ =f(:i:‘) t= 1,2,.... 1 Then the modified regret in the repeated play of the infinite game at i is T T T zs‘g‘ — mjin: ‘3. = Zx‘dXt’l + Ht‘lz) — ¢(XT), t=l t=l t=l that is, it is the same as the modified regret in the finite game at 3;. Hence the results of Theorem 2. 3.1-2. 3.5 cover repeated play of the infinite component game. Of course m = 2" and |B| = 1 in this adaptation. Chapter 3 On-line allocation model and the multi-armed bandit problem 3.1 Introduction to on-line allocation model 3.1.1 Component allocation game We consider the following component allocation game in which player I selects a loss vector l 6 [0,1]" and player II selects a probability distribution p on n points with loss L, = 2'; l,-p,-. We recognize this as an example where player I’s pure strategies are in the cube x = [0,1]" and player II’s pure strategies are in the finite set Y = {e1,...,e,,}. By Theorem 2.4.1, there is a finite game isomorphic to the component allocation game. 3.1.2 On—line allocation model Suppose the component allocation game is played repeatedly with It and p‘ denoting the choices of strategies by player I and II, respectively, at time t = 1, ..., T. This 25 26 repeated game was studied by Freund and Schapire [15]. They called this game the on-line allocation model. Freund and Schapire [15] formalized the on—line allocation model as follows. The allocation agent A has n strategies to choose from. At each time t = 1, 2, .. ., T, the allocator A choose a probability distribution 19‘ = ‘1.p3,---.p$.) over the 12 strategies, where pf is the probability that strategy 2' will be chosen, for each i and t. Each strategy 2' suffers some loss I]. Denote loss vector It by = (1], (2,“ 1;). The loss suffered by A at time t is defined as 217‘”— that is, the expected loss of the strategies with respect to A’s chosen allocation rule. Denote the expected total loss of A across the first T trials by T E [L5] = Zp‘l‘. t=l The goal of A is to minimize E[L§ ] — lmi [0,1]. At each time t, for each i, denote the prediction of expert B,- by B}, and denote the outcome by 3],. Then L(B§, y) is the loss of expert B,- at time t. If at each time t, the decision-maker chooses B: to predict y, with probability pg, where p: is defined in Algorithm H with If = L(B,?, yt). Then by Theorem 3.2.1, we have that the expected average loss of the decision-maker approaches the average loss of the best expert at the rate of 0(T‘1/2). Note that if the loss function L is convex with respect to d 6 ’D, then at time t, let the decision-maker predict with a nonrandomized prediction 2;, pfiBf. The convexity implies that n n 14210332, ye) S 21931:. i=1 i=1 It follows from Theorem 3.2.1 that T n T ‘. F ' i ‘/ +1,/ 1L(Zp,B,,yt) _<_ 1131511"; I, + 2n 3T. t: i=1 We will continue the discussion of prediction using expert advice in Chapter 4. 32 3.3 Introduction to the multi-armed bandit prob- lem 3.3.1 Component multi-armed bandit game We consider the following component multi-armed bandit game in which player I selects a reward vector b 6 [0,1]” and player II selects a probability distribution p on n points with gain LP = Z? b,p,-. We recognize this as an example where player I’s pure strategies are in the cube x = [0,1]" and player II’s pure strategies are in the finite set Y = {e1,...,e,,}. By Theorem 2.4.1, there is a finite game isomorphic to the component multi-armed bandit game. 3.3.2 The multi-armed bandit problem We are going to study the repeated play of the component multi-armed bandit game, that is, the multi—armed bandit problem. In the multi-armed bandit problem, origi- nally proposed by Robbins [24], a gambler must decide which arm of n non-identical slot machines to play. At each trial, he plays one arm and receives a reward (maybe nonpositive). The goal of the gambler is to maximize his total reward over in a sequence of plays. Lai and Robbins [22] studied this problem using statistical assumptions about the rewards of the slot machines. They assumed that the distribution of rewards associated with each arm is fixed and does not depend on the number of trials T. They bounded the difference between the expected total rewards of the player and the maximum of the expected total rewards of any arm with 0(logT). Auer, Cesa-Bianchi, Freund and Schapire [2] presented a variant of the bandit problem in which no statistical assumptions are made about the generation of rewards. 33 They only assume that the rewards are bounded. They formalized the multi-armed bandit problem as a game between a player choosing actions and an adversary with knowledge of past plays choosing the rewards associated with each action. Assume action space is { 1, ..., n} and all the rewards belong to the interval [0,1]. They defined the full information game and the partial information game. In the full information game, at each trial t = 1,2,. . .,T: 1. The adversary selects a vector of the current rewards b‘=(b‘1,...,bf,), where for each i, bf is the reward associated with action i at trial t. 2. Without knowing b‘, the player chooses an action it 6 {1, 2,. . . , n} and get the corresponding reward bi. 3. The player observes bt after he makes the action it. The partial information game also consists of three steps. All the steps are the same as that in the full information game except step 3 is replaced by: The player only observes b}, after he makes the action it. They presented an algorithm, called Hedge, for the full information game. Al- gorithm Hedge is a slight variant of Algorithm H edge(,6) described in Section 3.1.2 The idea of Hedge is to choose action i at time t with probability p‘- : (1 + a)’: ’ ELI“ + “)8: , where a > O is a parameter and t t+l _ E : t Si — bio t=1 Note that each reward b: is defined as a random variable on the set of player’s actions up to trial t— 1. The measure of the performance of any algorithm, say A, is E (G A) — Gk“, where T E(GA) = Emu-JAE big], t=l 34 and T Gbest = mag; Ei1,...,iq~[z b;]- (3-3) ' i=1 15: Auer, Cesa-Bianchi, Freund and Schapire [2, Theorem 3.2] showed that in the full information game for a > 0, 01 lnn E(GHed9¢) Z Gbcst — 5056“ - :— For an appropriate choice of a, which is depends on Gbeu, the difference between E (G Hedge) and Gm, is at least —\/2Tlnn in the full information game. Auer, Cesa—Bianchi, Freund and Schapire [2, Section 4] also gave an algorithm for the partial information game based on Hedge. 3.4 Application to the multi-armed bandit prob- lem In this section, we investigate an algorithm for player in the full information game un- der the assumption that each reward bf does not depend on past play. Our algorithm is described in Figure 3.2. Theorem 3.4.1 If 13‘ and b‘ are defined in Algorithm H, then the expected gain of Algorithm H satisfies T -x/2n+1‘/3T _<_ E(GH) — 11:113.} 2b;- 3 \/2'H\/3T, ' 'nt=l where E(GH) = 9:1133b;. Proof. The proof is similar to that of Theorem 3.2.1 with loss replaced by gain. By Theorem 2.4.1, there is a finite game isomorphic to this component game. Let f be a one to one mapping in Theorem 2.4.1 associated with matrxi A defined in Figure 3.2. At each time t, player I chooses a strategy it such that :t‘ = (b§,...,bf,), 35 Figure 3.2: Algorithm H Algorithm H Choose initial probability vector 131 Repeat for t = 2,3,... 1. Choose action it according to the distribution 13‘, where - "1 , , 6t—1 . . . p;- = Mite.- — b.) s —(—2,—)2(AJ — A“).Vz}. 8:1 and z = (Zl,...,22n), 21, ..., 22.. are i.i.d U(0, 1) under u, and A1, . . .,A" are the columns of matrix A. A is defined in Algorithm H. 2. Receive the reward vector b‘. and player II’s strategy is a random vector 37‘ taking value from a set Y = {e1,...,e,,} such that lit = e) if 0(X“l + 99%z) —Aj in the corresponding finite game, where z and Al are defined in Figure 3.2, t X0 = 0, X‘ = 212’, 12‘: f(fitt). 1 It follows from Theorem 2.4.1 that T 5%? + max 2; 5:; J t=l :1:‘¢7(X"'l + 6(t2—lez) -— ¢(XT). Ms EM“! H II y—s 36 It follows from Theorem 2.3.2 that T —\/2"—13T g E, Zita-(X‘ 1+‘/6L2:—)z—)¢()3XT 2n+13T. t=1 By the definition of p~t and 37‘, we have for any j = l, .. ., n #{9' = 31'} = p{g(Xt-l + 6_(%:_1)Z) = _A1} =u{[Zf(i ’)+ ‘”;‘,,21).zJ,’(—Aj)s[:-fw)+6“ 1) z](-— -A") V} It follows from .. = (b§,...,bf,), and f(a‘r‘M" = “$- that 1437:6311 =u{§(b:—b;)s 6“; ,lz’w—A A‘), W} _figm Thus Therefore, Theorem 3.4.1 follows. [3 Remark 3.4.1 Algorithm H gives both an upper bound and a lower bound of the modified regret at the rate of 0(T1/2) without knowledge of T or sz. Suppose the player has access to the Opinion of a set of K experts. At each trial t, before choosing an action, the player is provided with a set of K probability vectors (9(1), . . . ,£‘(K)). For each j, £‘(j) is the advice of expert j on trial t, that is 37 and £f( j) is the recommended probability of choosing action 2' by expert j. After receiving the reward vector b‘, the expected reward for expert j is 5‘( j )b‘. At each time t, we apply Algorithm H with new reward vector (€‘(llb‘, - - - .€‘(K)b‘)- It follows from Theorem 3.4.1 that T -\/2K+1\/3T _<_ E(G'H) — ltggZE‘UW S V2K‘1\/3T. - - t=l Therefore, using Algorithm H, the expected average reward of the player will ap- proach to T t - t 1131?; g E (2)1). which is the average reward of the best expert, at the rate of 0(T‘1/2). Chapter 4 Prediction using expert advice 4.1 Introduction In this chapter we review some of the literature dealing with prediction using expert advice. Section 4.2 is a self-contained review. However, the main thrust of this chapter is to show how the Harman recursive strategies apply in a prediction problem considered by Foster and Vohra [13]. In particular, in Section 4.3 we prove that a strategy for n = 2 experts investigated by Foster and Vohra [13] is an exmaple of a Harman strategy and we deduce a slightly stronger result than was established in their Theorem 1. Section 4.4 gives a solution for the n expert case. Finally, Section 4.5 applies the k-extended approach to the prediction problem. 4.2 Literature review Consider the following prediction problem. Suppose no statistical assumptions made about the actual sequence g=(y1....,yr) (4.1) of outcomes that is observed. At each time t = 1, ..., the decision-maker must predict the value of y,. Before making the prediction, the decision-maker is given the Predictions of n experts. 38 39 This decision problem can be viewed as the following game between two players, the decision-maker and nature. At each time t = l, 2, . . ., 1. Each expert B,, i = 1,...,n, makes a prediction B: 6 D, where D is the prediction space. 2. The decision-maker, who has the knowledge of all Bf, i = 1, ..., n, s g t, and past outcomes yl, ..., y¢_1, makes his prediction 3), 6 D. 3. The nature chooses some outcome y; E G, where 6 is the outcome space. 4. Each expert B,-, i = 1,..., n, incurs loss L(B,?, 3],) and the decision-maker incurs loss L(3)¢, 3],), where L : Dxe -> [0, 00) is the loss function. Suppose at each time t, the decision-maker uses a prediction algorithm A to make prediction 3),, then loss of the decision-maker is equal to the loss of the algorithm A. Define the total loss of the algorithm A on a sequence of trials with respect to a sequence of outcomes g to be T ylzg Myra yr) (4-2) where g is defined in (4.1). Similarly the total loss of the expert B,- with respect to g is defined to be T Lg. (g) = 2: L035. ye)- t=1 The goal of the decision-maker is to find an algorithm A to minimize Lia) - min _L£,(y) (4.3) The goal of the nature is to maximize (4.3). So the min/ max strategy for the decision- maker is the algorithm that minimizes the maximum of (4.3) over all outcome se- quences. Vovk [27] introduced a general on-line prediction algorithm when the outcomes are binary. For the case with continuous-valued predictions, Vovk proved for a large class of loss functions bounds of the form Lag) - 112531"ng (g) g cLlnn, 40 where cl, is a positive constant determined by the loss function L. In the rest of this section, we give a literature review according to the following four cases. Case 1. D = [0,1], 6 = {0, 1} and [31,...,B,,} is the flute set of experts. Haussler, Kivinen and Warmuth [18] studied the Generic Algorithm first introduced by Vovk. At each time t, the Generic Algorithm predicts with any value 9. that satisfies for y = 0 and y = 1 the condition A n w .e—fiL(B::Ul [IQ/my) S ‘61" t" , (4.4) n i=1 i=1 wtii where c and 17 are any two positive constants, for any i, — L 39, 1013 > 0, wt+l,i = ware " ( ' m). Haussler, Kivinen and Warmuth [18, Theorem 3.11] showed that if L is a loss function such that CL == sup 3(zlL’1(z>2-La(z)La(z)2 L6<2>L¥ - Leonie) where Lo(z) = L(0,z) and L1(z) = L(1,z), then it follows from by applying the < 00, (4.5) Generic Algorithm with ml... = 1, c = CI, and r) = l/cL that for any T, 811p [Lgeneric(y) - 1127.21" LE,- (3)] S CLlnn! (46) y E {0, 1}" - .. B‘ 6 [0,1]" where Bt = (Bf, . . .,Bg). o For logarithmic loss, that is defined by L(§, y) = ylng + (1 — y)ln]—:%, (4.6) holds with cL = 1. o For squared loss, that is defined by L(g), y) = (ii—y)2, (4.6) holds with c1, = l / 2. Foster [12] also proposed a prediction method with the loss function defined as squared loss. At each time t, the decision-maker predicts with n 3}: = Z 10253:, i=1 41 where (wt,1,. . . , wm) minimizes 332": «1).-B: - y.)2 + 5:10? 9:1 i=1 i=1 over all probability vectors w. It follows from Foster [12, Theorem 1] that for any outcome sequence g and any probability vector w, T T n 2(3): - yt)2 - 2(2101'3: - 11:)2 _<_ 2 + nlnn(T + 1). i=1 t=l i=1 Therefore using this strategy, the decision-maker can perform as well as any convex combination of the n experts, namely, the difference between the average loss of the decision-maker and the average loss of any convex combination of the n experts will converge to 0 as T goes to 00. For absolute loss, that is defined by L(37, y) = [37 — y], c1, = 00. Then the Generic Algorithm can not be applies directly. Cesa-Bianchi et a1 [7] studies some variants of the Generic Algorithm for the absolute loss. Cesa-Bianchi et a1 [7] constructed a min/max strategy, called Algorithm MM, for the decision-maker when the loss function is absolute loss. At each time t = 1,. . . ,T, Algorithm MM predicts with 3,7 v(M'+Z",r—1)-—v(M'+1—r,r—1)+1 t: 1 2 where r =T+l—t, Z” = (B[,...,B,‘,) 2-1 Mr = (M{,...,M,',), M]? = o M; = ZIB; —y.l, VJ‘. 8:1 and v is defined inductively as v(M,0) = lrsnjign Mj, — M 1 — , —1 ”(M:T)= min v(M+z,r l)+v( + zr ) z€[0,l] 2 Cesa-Bianchi et a1 [7 , Theorem 2] showed that for any set of n experts and for any outcome sequence g, T T _ - T _ _ LMMQI) 115111.13" LB,- (y) S 2 ”(O’Tla 42 and Algorithm MM achieves the value of the game, €- — v(0, T). So Algorithm MM is a min/ max strategy. The disadvantage of the Algorithm MM is that T must be known in the be- ginning of the game. And the algorithm is computational expensive since the calculation of v(M, r) involves minimizing a recursively defined function over all choices of z 6 [0,1]". Therefore simple algorithms, Algorithm P, Algorithm P’ and Algorithm P‘, were introduced so that they can be implemented efficiently. Algorithm P works as following. At each time t, Algorithm P makes a prediction :03 = F16 (7'3), where fl 6 [0,1), 22:1 wt,iB: Tt = n {:1 wt,‘ 1 w...- = 1. w¢+1,.- = w1,.-Up(|Bf - ytl). Vi. and Fp(7‘) and Ufi(Q) be any functions such that ln((1— r)fl + r) —ln(l — r + rfi) SF 1‘ S . 2214,37,) ”( ) zznfifi) 1+ forOSr_<_1,and 3" S U301) S 1-(1-filq, forogqgl. The performance of Algorithm P depends on the parameter fl. Cesa-Bianchi et a1 [7] showed how to choose fl in according to the type of knowledge available to the decision-maker. * If the decision-maker knows an upper bound on the total loss of the best expert, Cesa-Bianchi et a1 [7, Theorem 15] showed that for any K _>_ 0, taking fl = g(‘/‘"7"-), where (4.7) 43 for any set of 11 experts and any outcome sequence 3) such that min Lg (y)__ < K, l 0, wt+l,i = were Y: {0, 1, B[,.. .,B‘}. Haussler, Kivinen and Warmuth [18, Theorem 4.7] showed that for any outcome sequence 31 and any i, the Vee algorithm satisfies Luvs—mm. J+n_m.inn_1T;L (ml/12ml,2 i=lw -,1. 45 Case 3. D = [0, 1], 6 = {0,1}, and there are unaccountably infinite set of experts. Heund [14] generalized the Weighted Majority algorithm of Littlestone and Warmuth [23] to the case in which there are unaccountably infinite set of experts. The algorithm he used is called the exponential weights (EW) algorithm. The EW Algorithm gives a prediction 3’], that is any value in [0, 1], such that for y E {0, l}, 1 1 may.) 3 “If. e-%Ldn(p>1/1 /0 am». (4.10) where for each t, c is a positive parameter, and the measure p¢(A) is defined as #t+1(A) = Le—%L(del‘t(1’)a and 111 is a probability measure on [0, 1]. Assumption 4.2.1 Suppose the loss function satisfies the following properties 0 3 c and r) = l/c, such that (4.4) holds. 0 V y, L(p, g) has a continuous second derivative with respect to p. o 315 : [0,1] —> [0,1], 13(6) is the unique minimizer of T z L(pi yt) t=l over all outcome sequence g whose empirical distribution is 9. Choose the initial probability measure to be p1(A) = [A w(:1:)d:c, where 82 1 1003) = ‘3 ja'p—zghipllwflx): 1 82 z = [o lwgamnansdx, (4.11) and gum) = Elmo, 1) - Lo.1))+ (1 — x)(L 2(1 — 1)‘(z - 1/2) y = B; if DH 5 2(t — 1)‘(z — 1/2), where z is U (0, 1) random variable under 11. Foster and Vohra indicated that if the decision-maker uses the randomized strat- egy C, then the difference between the average loss of the decision-maker and the minimum average loss of the two experts will be bounded by a nonnegative random variable 67, and 67' goes to 0 in probability as T goes to 00. On its face, this random- ized strategy C is similar to the Harman [17]’s recursive strategies for player II in a 49 two-person finite game, that is player II plays Bayes versus a randomized perturba- tion of the player I ’s empirical distribution. We will prove Theorem 1 of Foster and Vohra [13] using the game theory results of Chapter 2. Theorem 1 of Foster and Vohra [13] If Bl and 32 are two experts with bounded loss bi, b; for all t, and the strategy C is defined as above, then 1 T 1 T T TZT S Tm1n(2b[,2b§) +61, 1 1 1 and 0 < 5T 1’1 0, where E, is the loss of strategy C at time t. Proof. Without loss of generality, for all t, let 0311531 and 0311.331. Consider a finite two-person game in which player I has 4 pure strategies, player 11 has 2 pure strategies, and the loss matrix A is defined as following: (00) 1o 01 [11} Let 2 = (21,22,23,z4). Under a probability measure )1, 21, 22, 23 and 24 are independent. 2,- is U (0, 1) for i = 1, 2 and 4. 23 is degenerate at 1/2. ( o ) 1 Since A" = A” - A' = , for l S q < r g 2, (2.4) becomes —1 \0/ II{Z|t1322—23$t2}St2-t1- So (2.4) holds with L = 1. a = Eullzll = m/2 = 2. 50 Consider that player I chooses randomized strategies T from the class it that consists of all 3‘ such that for each t, It = {(1_ btl)(1 — b2):bt1(1_ b2): (1 - btl)b21btlb2} For any t and z' = 1, 2, :IJ‘Ai = b:. Then the Bayes envelope ¢(XT) becomes T T_ - T _ - t ¢(X )— mm X a—mirgz bi. UE{A1,A2} 1: , t=l If H ‘ = 2t’ and z is defined as above, we have t—l (Xt'l + lit-120Al = Z b’f + 2(t—1)’(z2 + 24), k=l and t-l (X‘“1 + H“"1z)A2 = 2 b; + 2(t — 1)‘(z3 + 24). k=l Then the Bayes response 0(X"1 + H "'12) becomes .41 if 1),.l > 2(t — 1)‘(22 — 1/2) 0(X“l + Ht'lz) = A2 if D¢_1 _<_ 2(t — 1)‘(22 - 1/2), where D¢_1 is defined in (4.12). Since $t0.(Xt-l + Ht-lz) = btl if Dt-l > 2(t " l)‘(22 " 1/2) b; if Dt-1 S 2(t-1)‘(22 - 1/2), the definition of the strategy C implies that 6, = :z:‘0(Xt'l + Ht'lz). Then applying Theorem 2.3.3 with m = 4, n = 2, [HI = 1, L = 1, 6 = 2, H‘ = 2t‘ and for any player I’s strategy 55‘ in i, we have T T T Eul 25¢ “ min(21:btla 2%“ 1 1 51 T = Efll 2::1:‘(I(Xt"1 + Ht"lz) — ¢(XT)| 1 2 __ l—s 2,, T )IBI 22 T < " — _ 2T 2|B|+ 2 (g 3 4T‘ + if". 1 3 By Markov inequality, we have that 1 T ~ . T t T P Tl th — m1n(z b1,2b§)| —) 0. 1 1 1 Hence the desired result follows. C] Remark 4.3.1 71-4 2? E, — min(2fb‘l, 2,7123” 5) 0 is slightly stronger than the result of Foster and Vohra [13, Theorem 1]. The optimal choice of s is 1/2, and the optimal convergence rate is 0(T‘1/2). 4.4 A generalization to more than two experts In the last section, we have proved that given the predictions of two experts, we can construct a randomized strategy for the decision-maker such that the decision-maker performs as well as the best experts in the sense that the difference between the average loss of the decision-maker and the average loss of the best expert converges to 0 in probability. In this section we consider how to construct a randomized strategy for the decision-maker when there are more than two experts. Suppose at time t, Bf, . . . , Bf, are the predictions of n experts, respectively. For any t and i, let outcome y, 6 9, B: e ’D. Suppose the loss function L is bounded. Denote b‘. = L(B:,y.), w, t. 8 Without loss of generality, we assume that 0 5 bf S l for all i and t. Suppose at each time t, for all i, the decision-maker has the knowledge of b}, .. ., bf“. We define a randomized strategy 0 for the decision-maker, which is described in Figure 4.1. 52 Figure 4.1: Strategy C' Strategy G Choose initial prediction from the set of {B}, i = 1, . . . , n} Repeat for t = 2, 3,... predict with B; if for any i, t-l 2 _ ‘ ' 2(1);- — bi) s \/§3—(§,,—1)z(A‘— .42), t=l where z = (21, . . . , 22»), and 21, ..., 22.. are i.i.d. U(0, 1) random variables under 11, A1, . . ., A" are the columns of matrix A. 011 "' aln A1 02"1 aznn A2» And A1, . . ., A2,. are the distinct sequences from W. W = {112" | w" = (1121,. . . ,wn), 112,6 {0,1},Vi}. In the next theorem we will show that using strategy C, the decision-maker will perform as well as the best experts among the n experts. Theorem 4.4.1 Let c. be the loss of the strategy 0 at time t, b: be the loss of the expert B.- at time t with ogfigIVLt Then as T goes to co, 1 T T t P Tl)?“ - £1,151!”ng -+ 0. and the convergence rate is 0(T‘1/2). Proof. We consider the two-person game infinite game described in Section 2.4. By Theorem 2.4.1, there is a finite game isomorphic to this infinite game. Let f be 53 a one to one mapping in Theorem 2.4.1 associated with the loss matrix A defined in Figure 4.1.. In the infinite game, at each time t, suppose player I chooses a strategy 52‘ such that it = (b§,...,bf,), and player II’s strategy is a random vector 9‘ taking value from a set Y = {e1,...,e,,} such that if 0(X"l + H‘"lz) = Aj in the corresponding finite game, where z and Al are defined in Figure 4.1, t X°=0, X‘sz‘, T‘=f(:i;‘), 1 2 H‘=‘/3;t t=1,2,... It follows from Theorem 2.4.1 that the modified regret of the allocation game is and the same as that of the finite game, i.e., T = 22:1:‘0(X"l + Ht‘lz) — ¢(XT) 1—1 By the definition of 32‘, we have T T 2 .. - 252.21% T = ZT‘UUC’I + Ht’lz) — ¢(XT) t=l It follows from Theorem 2.3.4 that T Eu| Z z‘a(X“l + H“1z) — ¢(XT)| g \/3n22“T. i=1 54 Thus T T _ ' t, 2 Eulgq lglgngb’l S V311 2"T. By Markov inequality, we have that 15an 1 T T t P Tlgq— m1n21:bj|—>O. Hence the desired result follows. 4.5 The k-extended prediction strategies 4.5.1 Introduction In the on-line prediction model described in Section 4.2, at each time t, the decision- maker predicts the outcome y, 6 6. The decision-maker has the knowledge of the predictions made by each of 11. experts, and makes a prediction based on the past and current expert predictions and the past outcome sequence. The goal of the decision- maker is to find a prediction strategy such that the total loss is as small as possible. Since no statistical assumptions are made about the distribution of the outcome sequence, a reasonable goal for the decision-maker is to perform as well as the best expert. In Chapter 3 and the first four sections of Chapter 4, different strategies have been introduced to give upper bounds on the difference between the total loss of the decision-maker and the total loss incurred by the best expert such that the average loss of the decision-maker approaches the average loss incurred by the best expert as T goes to 00. We observe that the average loss incurred by the best expert is equal to the Bayes enve10pe in a finite two-person game. It is of interest to find a strategy for the decision-maker such that the average loss of the decision-maker approaches a lower envelope as T goes to co. Herbster and Warmuth [20] considered the prediction model where 'D = [0,1], 8 = [0, 1] and loss function is L. They studied the case in which the outcome sequence is divided into at most 11: + 1 arbitrary segments. Each segment has a best 55 expert. The sequences of segments and its associated sequence of best experts is called a partitioning. Now the goal of the decision-maker is to perform well relative to the best partition. Let y = (y1,...,yT) be any outcome sequence. For any (t1,...,tk) such that 10. y€Qn(h-l) - — “bot-1:” and the convergence rate is 0(T‘1/2). Proof. Denote X = |{x‘ = {xf, . . . ,xfn}, xf = H[bfa,-¢ + (1 — bf)(1 - au)], Vi} l=1 59 Figure 4.2: Strategy 0" Strategy 0" Choose initial predictions 3),, t < h, from the set of {B}, Vi} Repeat fort: k,k+ 1,... predict with 2 “-1 __ . . 2;. = B;- if 2 (b; — b?) 5 1/3" (”b ,, l ”204' — A’Wi. s:b“"1=b"'l 2 where [ft-1: {bi—k-H bt—k+2 bt-l} b“1={bf‘l bH} t-l T(btt-l) = 21{boj-l:bot—l}. i=1 and z = (21, . . . ,Zzn), and 21, ..., 22.. are i.i.d. U(0, 1) random variables, A1, ..., A" are the columns of matrix A. 02H 02M; A2» And A1, . . ., A2» are the distinct sequences from W. W = {MI | w" = (w1,. . . ,wn), w,- E {0, 1},Vi}. 60 Since the loss of experts take values in a finite set Q, which has cardinality q, the cardinality of 5C is Q”. There exits a one to one mapping between Q" and 5f. Consider a finite two-person game in which player I has 2" pure strategies and player II has n pure strategies. The loss matrix is the matrix A in Figure 4.2. Suppose player I only takes strategies from i. Denote , _ 3n2 T x“"1 — 1 0t=0(th—1lxt 1+\/ ( (2,, ) )2): where t—l . X'tk-llx‘lt—1 = 2: 1J1 jzxej-l=xo¢-l and X‘t-l)=21{x.j-1=xu-1}. Then the definition of the strategy 0" implies that T 212190;: 2 ( Z 23%,) xeib-l t:x“-1=x 2(24) y€Qn(b-l) “bu—1:” It follows from Lemma 2.2.6 that ¢*(Xl‘~) = Z Milo 361""1 = 2: (min( 2 bf)). yeQn(h- l) t:'b""‘1=y Thus, applying Theorem 2.3.5 with [B] = 1, m = 2", and x‘ in i, we have T T alga: - 0% ”my”; b.,.)>| — "ME-Tat” ¢k(XT)l _<. \/3n22nqn(k—1)T. Therefore, by Markov inequality, Theorem 4.4.1 follows. 61 [3 Remark 4.5.1 2: < '( i b >> )3} min ,3, 5 min f. y€Qn(k—l) ' “be ‘-1:y 151$" t=l Example 4.5.1 Suppose there are two experts, and their loss sequence is as following: {(bl’ 1);), (bf, b3), (bg’ b3), (bf, b3), ' ° °} = {(0, 1), (1’ 0), (0’ 1), (1’ 0), ° ’ '} Then the simple envelope is ¢(XT)={ % ifT>landTisoddO if T is even For the simple case of k = 2, the k-extended envelope is 920%) = 0- Therefore, we see that the 2-extended envelope is indeed a lower envelope. Let T = 5000. To compare the average loss of strategy 02 with the average loss of the strategy C defined in Section 4.3, we used S-PLUS to determine: — c. = 0.5196078, 4998 ,=, 1 ”2°“ .2 — = 0. 4998 ,=, ‘ Thus, the 2-extended prediction strategy performs better for this sequence. Bibliography [1] G. Anandalingam and L. Chen. Linear combination of forecasts: a general Baye- [2] l3] [4] l5] l6] [7] [8] sain Model. Journal of Forecasting, V8, 199-214, 1989. P. Auer, N. Cesa—Bianchi, Y. Freund and R. E. Schapire. Gambling in a rigged casino: The adversarial muti-armed bandit problem, In Proceedings, 36th Annual Symposium on Foundations of Computer Science, November, 1995. R. J. Ballard. Extended rules for the sequence compound decision problem with m x n component, Doctoral Thesis at Michigan State University, 1974. R. J. Ballard and D. Gilliland. On the risk performance of extended sequence compound rules for classification between N(0, 1) and N (1, 1), J. Statist. Com- put. Simul., V6, 265-280, 1978. R. J. Ballard, D. Gilliland and J. Hannan. 0(N’1/2) convergence to k-extended Bayes risk in the sequence compound decision problem with m x n component, RM-333, Statistics and Probability, MSU, 1974. D. Blackwell. Controlled random walks. J. Proc. Int. Congr. Math. 3, 336-338, 1956. N. Cesa-Bianch, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire and M. K. Warmuth. How to use expert advice. Unpublished manuscript, 1995. R. T. Clemen. Combining forecasts: a review and annotated bibliography. Intl. J. Forecast 5, 559-583, 1989. 62 63 [9] R. T. Clemen and J. B. Guerard, Jr. Econometric GNP forecasts: incremental information relative to nave extrapolation. Intl. J. Forecast 5, 417-426, 1989 [10] R. T. Clemen, and R. L. Winkler. Combining economic forecasts. J. Bus. And Econ. Statist. 4, 39-46, 1986. [11] T. M. Cover and A. Shenhar. Compound Bayes predictors for sequences with ap- parent Markov Structure, IEEE Transactions on Systems, Man and Cybernetics, Vol. SMC-7, No.6, 1977. [12] D. P. Foster. Prediction in the worst case. The annals of Statistics, V19, No. 2, 1084-1090, 1991. [13] D. P. Foster and R. Vohra. A randomization rule for selecting forecasts. Opera- tions Research, 41(4):704-709, July-August 1993. [14] Y. Freund. Predicting a binary sequence almost as well as the optimal biased coin. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, 1996. [15] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learn- ing and an application to boosting . In Computational Learning Theory: Second European Conference, EuroCOLT ’95, 23-37, Springer-Verlag, 1995. [16] D. Gilliland and J. Hannan. On the extended compound decision problem. Ann Math. Statist. 40: 1536-1541, 1969. [17] J. Hannan. Approximation to Bayes risk in repeated play. In Contribution to the theory of games, V3, 97-139, Princeton University Press, 1957. [18] D. Haussler, J. Kivinen, and M. K. Warmuth. Tight worst-case loss bounds for predicting with expert advice. In Computational Learning Theory: Second European Conference, EuroCOLT ’95, 69-83, Springer-Verlag, 1995. [19] D. P. Helmbold, R. E. Schapire, Y. Singer and M. K. Warmuth. On-line portfolio selection using multiplicative updates. Unpublished manuscript, 1996. 64 [20] M. Herbster and M. K. Warmuth. Tracking the best expert. In Proceedings of the Twelfth International Conference on Machine Learning, 286-294, 1995. [21] M. V. Jr., Johns. Two-action compound decision problems. Proc. Fifth Berkeley Symp. Math. Statist. Prob, 1 pages 463-478, University of California Press, 1967. [22] T. L. Lai and H. Robbins. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6:4-22, 1985. [23] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Informa- tion and Computation, 108:212-261, 1994. [24] H. Robbins. Some Aspects of the Sequential Design of Experiments. Bulletin American Mathematical Society, 55:527—535, 1952. [25] D. D. Swain. Bounds and rates of convergence for the extended compound es- timation problem in the sequence case, Tech. Report, No. 81, Department of Statistics, Stanford University, 1965. [26] S. B. Vardeman. Approximation to minimum k-extended Bayes risk in sequences of finite state decision problems and games, Bulletin of the Institute of Mathe- matics Academia Sinica, V10, No. 1, 1982. [27] V. G. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learning Theory, 371-383, 1990. [28] V. G. Vovk. A game of prediction with expert advice. Unpublished manuscript, 1996. HICHIGRN STQT HI] 1]_] illi]li|[[l[[l[i[i 312 301776