CHIGANT IIIIIIIIIIIIIII III IIIIIIIIIIIIII 1293 00877 2703 IIIIIIII This is to certify that the dissertation entitled Asymptotically optimal and admissible estimators in compound compact Gaussian shift experiments presented by Suman Majumdar has been accepted towards fulfillment of the requirements for PILDM degreein Statistics QMMIMMM Major professor L Date August 5, 1992 MS U i: an Affirmative Action/Equal Opporfluu'ry Institution 0-12771 ._._—_——— a“ “EF— LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. LLJC I_—II——I 'LJC—ij LII—Tl; I—_I I MSU Is An Affirmative Action/Equal Opportunity Institution WMMI ASYMPTOTICALLY OPTIMAL AND ADMISSIBLE ESTIMATORS IN COMPOUND COMPACT GAUSSIAN SHIFT EXPERIMENTS By Suman Majumdar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of _ DOCTOR OF PHILOSOPHY Department of Statistics and Probability 1992 5’77’ t??-«« ABSTRACT ASYMPTOTICALLY OPTIMAL AND ADMISSIBLE ESTIMATORS IN COMPOUND COMPACT GAUSSIAN SHIFT EXPERIMENTS By Suman Majumdar The problem of finding admissible and asymptotically optimal compound and empirical Bayes rules is investigated in the context of decision about an infinite dimensional parameter. The component experiment considered is a homogeneous experiment {P9 : 06H} on some measurable space ($,€F), where H is a real separable Hilbert space, such that the map 9H<0,. >0 := 1n po(.)+II9IP/2 is linear from H into the real-valued measurable functions on ($,€F), where p0 is a density of P9 wrt p=P0. This experiment is a Gaussian shift experiment in the sense of LeCam (1986) and {<0,. >0 : 06H} is the isonormal process on (%,€F,p) in the sense of Dudley (1967). The component problem estimates the shift parameter 0 restricted to a compact subset of H under squared error loss. We consider the compound and empirical Bayes formulations of the above component problem and show that all Hayes estimators in the various formulations are admissible. Our main result : Any Bayes compound estimator versus a mixture of iid priors on the compound parameter is asymptotically optimal if the mixing hyperprior has full support. Analogously any Hayes empirical Bayes estimator is asymptotically optimal if the empirical Bayes prior has full support. Using the (weak) conditional expectation representation of the Bayes estimator in the component problem and weak compactness of the unit ball, along with the fact that {<0,.>0 : OEH} is the isonormal process and consequences thereof, we reduce the question of asymptotic optimality to that of an L1 consistency of posterior mixtures. Ve prove the consistency result, which complements Datta (1991a), by assembling some previously known results and repeatedly using the Gaussian shift structure. The dissertation also characterizes the support of a Dirichlet hyperprior on the set of all probability measures on a separable metric space to be those probability measures whose supports are contained in that of the parameter measure (of the Dirichlet hyperprior), proving a result stated in Ferguson (1973) for the line and providing examples of a full support hyperprior. To my father and to the memory of my late mother iv ACKNOULEDGEMENT I would like to express my sincere gratitude to my advisor Professor James F. Hannan whose erudition, vision of and dedication to Statistics always guided and inspired me. His caring personality provided strong support in some rather trying circumstances. I would also like to thank Professors Dennis Gilliland, Hira L. Koul, V. Mandrekar and Habib Salehi for serving on my guidance committee, Professor Gilliland for some very useful comments on an earlier draft, Professor Mandrekar for* alerting me to the (now realized) possibility of generalizations of the results initially presented and Professor R.V. Ramamoorthi for an important reference. The opportunity provided by Professors Habib Salehi and James Stapleton on behalf of the Department of Statistics during the last two years to enhance my professional experience by teaching a wide variety of independent courses and working in the Statistical Consulting Service is gratefully acknowledged. Professors Gilliland and Stapleton very kindly taught me different aspects of the role of a Statistical Consultant. The support provided by my sisters during my difficult student days in India and their encouragement to pursue graduate studies, as well as the support and encouragement provided by my wife during the preparation of this dissertation deserve special mention. TABLE OF CONTENTS Chem; 0. INTRODUCTION ..................... 1 0.1. The component and the compound problem ...... 1 0.2. Literature review and a summary ......... 4 0.3. Notational Conventions .............. 7 1. THE COMPOUND ESTIMATION ................ 9 1.1. The Gaussian shift component ........... 9 1.2. Estimators induced by hyperpriors ....... 14 1.2.1. Bayes versus mixture of iid priors . . 15 1. 2. 2. A useful inequality on the modified regret 16 1.3. A bound on the L1(P0) distance between two component Bayes estimators ................ 17 1.4. Asymptotic optimality .............. 23 1.5. Admissibility .................. 27 2. CONSISTENCY OF THE POSTERIOR MIXTURES ......... 29 3. THE EMPIRICAL HAYES ESTIMATION ............ 42 3.1. Bayes empirical Bayes .............. 42 3.2. Asymptotic optimality .............. 43 APPENDIX ........................ 46 A.1. On measurability ................. 46 A.2. On topological support of Dirichlet prior . . . . 47 BIBLIOGRAPHY ....................... 53 vi CHAPTER 0 INTRODUCTION In Section 1 we describe the idea of compounding a decision problem (called the component problem) first espoused by Robbins (1951). In Section 2 we review that part of the existing literature on compound decision theory which can be considered to be a forerunner to our work and present a summary of it. In Section 3 we state the notational conventions to be followed throughout the dissertation (some of these conventions will be used informally in Sections 1 and 2). 1. The component and the compound problem. The component problem is a usual decision theory problem, consisting of a parameter set G), a family of probability measures {P9 : 069} on some measurable space ($,‘I), an observable $-valued random element X~P9 under 0, an action space A, a loss function L:.AX9I—+[0,oo) and decision rules t, tz$HA such that L(t,0) is measurable V 0 with risk R(t,0) :2 P0L(t,9). For consideration of Hayes solutions, we fix a o-algebra of subsets of G such that each of the maps (x,0)I—+L(t(x),0) is jointly measurable. Let Q={w : w is a probability on 9}. For (.069, let 1(0)) and 7w respectively denote the minimum Bayes risk and a Bayes rule versus w in the component problem (we assume existence of Tw for every w). That is, 1(0)) = €1R(t,9)dw(0)=IR(‘rw,9)dw(0). ('9 9 2 The compound problem simultaneously considers a number, say n, of independent decision problems, each of which is structurally identical to the above component problem. The compound loss is taken to be the average of the component losses. In the set compound version a decision about each component parameter is reached by using data from all the component problems, while in the sequence compound version only X0, data up to stage a, is used in making the a-th decision. Thus for each n21, the compound problem is also a decision problem, with parameter set 9", family of probability measures {P9 := il<1P9a : 0 := (01,...,0n)69n} on the measurable - a: space $11, observations X = (X1, . . . ,Xn) ~PQ under 0, action space A", decision rules tz$nHAn such that each L(ta,90) is measurable, loss r' r: A la» IQ: v II n “—1 2 L(taaaa) (1:1 and corresponding risk (1.1) Rn(t96) : PgLn(t_,Q) . If we were going to use only the data from a particular component problem to decide about that component parameter, then the component problems being structurally identical, there is an intuitive reason to use the same procedure (with different data) in the different problems. Formally, that amounts to using a compound procedure 1;, for which ta(x)=t(xa) V a=1,...,n, where t is a component procedure; such a compound procedure is called simple symmetric . Let Gn denote the empirical distribution of (91,...,9n). The 3 compound risk at Q of a simple symmetric 3 reduces to the component Bayes risk of t versus Gn, where ta(x_) =t(xa) V a: 1,...,n; as such it is at least r(Gn), the minimum Bayes risk versus On, which is referred to as the simple envelope at 9. For a compound rule t, the difference Dn(_t,Q):Rn(t,Q)-r(Gn) is called the modified regret of t at Q and a sequence of compound rules {3 : n21} is said to be asymptotically optimal (a.o.) if (1.2) Dn(_t,Q) —~> 0 as n—+oo. IQ>< However, it has long been recognized (Hannan and Robbins (1955)) that the compound problem is invariant under the group of n! permutations of coordinates; also, almost all the compound rules in the literature are equivariant under the permutation group. Hence a more appropriate yardstick to judge the performance of a compound rule should be the equivariant envelope, the minimum compound risk of equivariant rules (see Gilliland and Hannan (1986) for a discussion of equivariance in compound decision problems). Mashayekhi (1990) has shown that if the component problem involves a compact (in total variation norm) class of mutually absolutely continuous probability measures, then the excess of the simple envelope over the equivariant envelope goes to zero uniformly in the measures. We shall use that result to extend our optimality result against the simple envelope to that against the equivariant envelope. A sequence of compound rules {3 : n21} is said to be 4 admissible if for every n it is admissible in the usual sense. 2. Literature review and a summary. The problem of exhibiting compound rules which are a.o. as well as admissible has been an interesting and challenging question ever since it was put forward by Robbins (1951) in his pioneering paper of compound decision theory. He considered the problem of decision between N(—-1,1) and N(1,1), exhibited an a.o. compound procedure and conjectured that the Bayes compound rule versus the symmetric prior uniform on proportions might have better risk behavior, exactly or asymptotically, than his ano. rule. [That it will not be exactly superior to the bootstrap rule of Robbins was shown by Huang (1972).] A.o. compound rules whose components are typically Bayes versus some estimates of the unknown GB or direct estimates of the Bayes rule versus Gn have been worked out for many different component problems. In particular, when the component problem is an estimation problem under squared error loss, Gilliland (1968) and Singh (1974) obtained a.o. sequence compound rules with rates (we say 1; is a.o. with rate an if XDn(t,Q)=O(an)) for discrete and Lebesgue exponential components fespectively. But these rules are inadmissible in the sense of the previous section. Making use of results from Gilliland and Hannan (1974), which was later published in 1986, Gilliland, Hannan and Huang (1976) 1/2 obtained admissible and a.o. rules with rate n— where the component problem was a two—state restricted risk problem. [They did 5 not specify admissibility of their rules. But they considered full support hyperprior mixing of independent identically distributed (iid hereafter) priors on the compound state space to generate full support priors on it and looked at the resulting Bayes rules. Since the risk in their problem is trivially continuous (the state space is discrete), the resulting Bayes rules are admissible.] The first solution to the problem of exhibiting compound rules which are a.o. as well as admissible when the component problem involves decision among infinitely many probability measures, has been provided by Datta (1988/91b). The component problem there is the squared error loss estimation of an arbitrary continuous transform of the natural parameter of a large compact subclass of a one parameter exponential family. Since then, Mashayekhi (1990) proposed a class of admissible and a.o. procedures in the restricted risk compact component compound decision problem. This was extended by' Zhu (1992), who successfully exploited Datta’s (1991a) result about consistency of the posterior mixtures to obtain admissible and a.o. rules when the component problem involves equi(in actions) continuous loss functions in a multiparameter exponential family with parameter set restricted to a polytope inside the natural parameter set. The present work seems to be the first to accomplish asymptotic optimality when the component problem is the estimation of an infinite dimensional parameter. In fact, it accomplishes admissibility and asymptotic optimality simultaneously. Our component distributions, indexed by a real separable Hilbert space, 6 form a Gaussian shift experiment. Ve consider the component problem of squared error loss estimation of the Hilbert-valued shift parameter restricted to a compact subset of the Hilbert space. lie note that all Bayes estimators in our compound problem are admissible. Our main result is that a Bayes compound estimator versus a mixture of iid priors on the compound parameter is a.o. if the mixing hyperprior has full support. The dissertation is organized as follows. Chapter 1 treats the compound estimation problem. Section 1 formally introduces the component distributions as satisfying an assumption (A). That assumption immediately identifies the experiment to be a Gaussian shift experiment. Section 2 describes the Bayes estimator versus the above mentioned mixture of iid priors and establishes a bound on the modified regret of such an estimator. Section 3 establishes an upper bound on the distance between two component Bayes estimators in terms of the L1 distance between the corresponding mixtures. Section 4 combines the results in Sections 2 and 3 to establish asymptotic optimality, first against the simple envelope and then against the equivariant envelope, assuming posterior mixtures are L1 consistent for the empirical mixture. In this section we provide a closed form expression of our estimator, and examples of a full support hyperprior. In Section 5, we show that every Bayes estimator in our compound problem, in particular a Bayes estimator versus a mixture of iid priors, is admissible. In Chapter 2 we establish the consistency of the posterior mixtures assumed in proving asymptotic optimality in Chapter 1. In 7 the process we get that the very general sufficient conditions given by Datta (1991a) for this kind of consistency of the posterior mixtures are by no means necessary. Chapter 3 looks at the empirical Bayes problem of Robbins (1951, 1956) with the component problem described above. Admissibility and asymptotic optimality (defined in that chapter) follow from the compound results. Finally, in Section 1 of the Appendix we prove two measurability lemmas that are used in the main body of the dissertation; in Section 2, we characterize the topological support of a Dirichlet prior on a separable metric space, which is used in Section 4, Chapter 1 to give examples of a full support hyperprior. 3. Notational conventions. Given any n-tuple x_=(x1,...,xn) of elements from a set, for each 1 SaSn, xa denotes the a—tuple (x1,...,xa). For probabilities P1,” .,Pn, iglpi denotes their measure theoretic product; when Pi=P V i, iglpi is denoted by P“. For sets {Ai : 1 _<_ i Sn}, iEIAi denotes their set theoretic product; when Ai=A V i, iglAi is denoted by A“. To denote the integral of a function f with respect to (wrt hereafter) a measure u, we will interchangeably use the standard integral notation ffdp and the left operator notation p(f), or even pf; depending on typographical convenience and the emphasis to be conveyed, the dummy variable of integration in the integral notation will be sometimes displayed, sometimes only partially displayed and sometimes hidden altogether. Sets are always identified with their 8 indicator functions. The same is true for probabilities and their induced expectations. ER stands for the real line. If X is a random element on a probability space (.,.,P), then PX“1 denotes the P- induced distribution of X on the range space. The notation a := b will mean that a is defined to be b. The set theoretic complement of a set A will be denoted by A, except in Section 2 of the Appendix, where the more traditional Ac will be used. The following numbering convention will be used throughout : All numberings of displays and statements are local within a chapter. For chapters with multiple sections, (2.1) will refer to the first display in the second section; for chapters with a single section, (3) will refer to the third display. On occasions when we have to refer to numberings in other chapters, the reference will be explicit, e.g. Theorem 1 of Chapter 2 or Lemma 1.1 of Chapter 1. CHAPTER 1 THE COMPOUND ESTIMATION In this chapter we consider the compound problem as described in Chapter 0 corresponding to the Gaussian shift component problem to be introduced below. We prove asymptotic optimality of Hayes rules versus (full support hyperprior) mixture of iid priors [Theorem 4.1], which is the main result of the dissertation, using the consistency of the posterior (distribution of 0,, under the mixed compound prior given xn__1) mixtures, a result of independent interest stated and proved in Chapter 2. Section 1 describes the component problem to be investigated and assembles some pertinent facts about it. In Section 2 we calculate a Bayes estimator in the compound problem versus a mixture of iid priors on the compound parameter and obtain a useful upper bound on its absolute modified regret. In Section 3, we obtain an upper bound on the distance between two component Bayes rules in terms of the L1 distance between the corresponding mixtures, which is used in Section 4 in conjunction with the bound on the absolute modified regret obtained in Section 2 to prove the main result. In Section 5 we show that every Bayes estimator in the compound problem, in particular a Bayes estimator versus a mixture of iid priors, is admissible. 1. The Gaussian shift component. We consider the squared error loss estimation problem in a Hilbert indexed Gaussian shift experiment. Let H be a real separable 10 Hilbert space (with "f" denoting the norm of an element f in H and < .,. > the inner product) and {P9 : 06H} be a family of probabilities on a measurable space ($,‘:'F) specified by (strictly positive) densities {p0 : 06H} wrt p=P0, such that (A) the map 9H<0,. >0 := ln p9(.)+I|0|I2/2 is linear from H into the linear space of all real-valued measurable functions on (%,‘.F) . We consider the component problem with ('9 a compact subset of H 3 .A 3 e and L(a,0) =IIa—9IP. The contents of the remainder of this section are as described below : We show that {<0,. >0 : 06H} is the isonormal process on ($,€I,p) in the sense of Dudley (1967) [Remark 1.1], which in turn identifies the experiment under investigation to be a Gaussian shift experiment in the sense of LeCam (1986) [Remark 1.2]. We show that (0,x)I—->p0(x) and (w,x)I—»pw(x) :: fp0(x)dw(0) are jointly measurable when it (the set of all probabilities on O) is endowed with the topology of weak convergence and the corresponding Borel o-field [Remark 1.3]. We then show that a Bayes estimator in the component problem must be the (weak) posterior expectation [Lemma 1.1]. We close the section by proving two lemmas [Lemma 1.2 and Lemma 1.3] describing certain features of the component problem that are used in the sequel. Remark l_-_l (The isonormal process) . Since by (A) p0=exp(—||9|P/2+ (9,. >0) V OEH, by representing the lhs below as a u integral, using the linearity of the map in (A) to treat the integrand and representing the resulting integral as a Pn+t0 11 integral, we get V tEER and 0,7]EH, Pn[exp{t<0,. >0}] = exp{t <0,n> +t2II9II2/2}, which by uniqueness of moment generating function proves (1.1a) P,,<0,. >0-1=N( , Half?) v men. The linearity assumption in (A) then shows (1.1b) {<9,. >0 : 06H} is a centered Gaussian process on (0.6,‘5F,p). By the (polar) representation of the product of two numbers in terms of the square of their sum and the individual squares, using the linearity of the map in (A) and (1.1a) with 11:0, we get (1.2) p(<0,.>0<77,.>0)=<9,17> V0,nEH. Now, the assertions in (1.1b) and (1.2) show that the process {<0,. >0 : Hell} is isonormal in the sense of Dudley (1967). // Rglflk 1+2 (Gaussian shift experiment). Note that by (1.1b), the experiment under investigation is a Gaussian shift experiment in the sense of Definition 2 of Chapter 9 of LeCam (1986). Even though the definition in LeCam does not require the indexing set to be a Hilbert space, discussions following it show that it suffices to restrict attention to that case. // 12 Remafl; _L._3 (Joint measurability of densities). Let {ej : jZl) be an orthonormal basis of H. By (1.1b) and (1.2), we get that {0 : j_>_1} are independent random variables on ($,§F,p). Let 0,, :2 .il<0,ej>ej. By linearity of the map in (A), <01“. >J;j§::1<9,ej> 0. Since Orr-+0 in H, <01“. >0 converges to <9,.>0 in L201) by (1.2); since the <0n,.>0 are the partial sums of a sequence of independent random variables, by Levy’s Theorem [Theorem 3.3.1, Chow and Teicher (1988)], the convergence is p-a.s. as well. Since <0n,. >0 is continuous in 9 and a measurable function on 0.6, it is jointly (in 9 and x) measurable by Doob’s Theorem. That implies the joint measurability of its u—a.s. limit and hence that of po. Let Q, the set of all probabilities on the Borel o-field of 9, be endowed with ‘ the topology of weak convergence and the corresponding Borel o-field. For a269, let pw(x) :2 fp9(x)dw; this is clearly a density of the mixture Pw :2 [Pgdw wrt p. The map (w,x)I—)pw(x) is jointly measurable by the joint measurability of 0,x Hp x and Lemma 1.2 of the Appendix. 0 The next lemma characterizes a Bayes estimator in our component problem. Specializing the notation introduced in Chapter 0 we shall denote a Bayes estimator (versus w) in the component problem by 7w. Throughout the remainder of the dissertation, let (1.3) M=sup{IIOII : 069}. Lemma 1.1. On the common support of {PV : V60}, 7,, is the 13 unique mapping into H satisfying = f <17,h>(p,,/pw)dw(17) V hEH. 9 Proof. We first show that for any probability measure 1r on 9, 3 an unique element v(n) in H satisfying (1.4) = f d7r(n) V hEH. 9 Since the map hI—rf <17,h>d7r(n) is a linear functional on H whose 9 norm is bounded by M, the assertion of (1.4) follows from the Riesz— Frechet Theorem [Theorem 5.5.1, Dudley (1989)]. Note that if pw(x) is positive, the map 0Hp0(x)/pw(x) is a density (wrt w) of a probability measure 5.2x on ('9. By (1.4), it is enough to show that rw=v(&:) on the common support of {PV : V69}. Now, by Fubini’s Theorem, the Bayes risk (versus ca) of an estimator t is equal to (1.5) f [ fI|t(x) — BIFdeIdpMIxIdIIIx). 95 6 Triangulating around v(&2x) and expanding the norm square of the sum, the inner integral in (1.5) is IIIIx)-v(wx)IF+ yum.) JIM»... which is minimized _ifi t(x) =v(&2x), completing the proof. // Lemma 1.2. For every finite sequence {9i : 13 i Sk} CH and {ai : 1 g i S k} C ‘R, k . k 2105( ll-llpoialdl‘) 7"”.213162 I: 1:: I—élaIIeIII- 14 Proof. Starting with the functional form of pa. implicit in l (A), the assertion follows by using the linearity of the map in (A) k and the functional form of p6, where 5:.Zlai0i' // 1: Throughout the remainder of the dissertation, let ||f|Lq denote the Lq(p) norm of a function f in Lq(u). Lemma 1.3. For every we 9 and every integer q 2 1, 2 —l M 2 PwELqU‘) and IIPI...2I|qSe(q ) / - Proof. Writing pwq as a q-fold iterated integral, interchanging the order of integration on $ and 9“, applying Lemma 1.2 (with k=q and ai=:1 V'i), and using (1.3), we get (1-6) #(pwq) S exp{q(q- 1)M2/2}, completing the proof. // 2. Estimators induced by hyperpriors. In Subsection 2.1 we show that the a-th component of a Bayes estimator in the compound problem versus a mixture of iid priors on the compound parameter is the Bayes estimator in the component problem versus the posterior mean under the mixing hyperprior given the data from the other problems; in Subsection 2.2 we obtain an upper bound on the absolute modified regret of such an estimator in terms of the distance between its a-th component and a component Bayes rule versus the empirical state distribution. 15 2.1. Bayes versus mixture of iid priors. Since G is a compact metric space, by Theorem II.6.4 of Parthasarathy (1967), Q with the topology of weak convergence is also a compact metric space; let €B(Q) denote its Borel o-field. Let A be a probability measure on (fl,‘5B(Il)). We take A-mixture of iid priors on 9“ (for each n) and denote that prior by 52A,“. [The measure (DAn is defined on the class of measurable rectangles by n (2.1) GA’n(BlXB2X....XBn) =3; .I-IIW(Bi)dA, 1: and then extended to the product o-field. Note that by Lemma 1.1 of the Appendix the above integrand is measurable.] Let _t;=(t1,...,tn), where ta:$“I—ui is a measurable function, be an estimator in the set compound problem. The a-th component Bayes risk of _t versus DAD is I 2.2 R t ,w = t —9 dP dw dP n-ldA. ( > (0 AM) {wiHIAQIBIQ an? 0a ] .. Disintegrating the joint probability on $n'1xfl determined by (den—ldA) as (dAa,ndP5An_1), where AC”, is the posterior distribution of on (under A) given (x1,...,xa_1,xa+l,...,xn) [since (I is a Polish (in fact compact metric) space, by Theorem 10.2.2 of Dudley (1989), such a disintegration exists], we get 2 (2.3) 1hs(2.2) = f H9916 "ta—0C,“ dPgadwa,n]deA,n_l, $n_ where wmn denotes the A0,,n mix of w’s. Clearly, rhs(2.3) is 16 minimized by choosing ta(x)=1wan(xa). Since the compound risk is the average of the component risks, the Bayes estimator in the set compound problem versus the prior GA“ is given by f, where (2.43.) Pa(§n) =Twa,n(xa)° A similar argument shows that the Bayes estimator in the sequence compound problem versus the prior IDA,“ is given by t’, where (2°4b) t’a(§n) :Twa,a(xa)° 2.2. A useful inequality on the modified regret. Recall from Chapter 0 that Gn stands for the empirical distribution of 01,. . . ,0“. For every Q69”, by definition, . _ n . .. Dn(§aQ) = n 121 PQIIItonr—901II2”II tor—901W], a: where 130(5) =TGn(Xa). Using Cauchy-Schwartz inequality to bound the absolute difference between "d".2 and “b“:2 by IId+bII times IId—bII, triangle inequality in H and (1.3), we get a n A ~ (2.5) IDn(.t,Q)I s4Mn'lzngllta—tall. (1: Since ta(x) = Twa,n(xa) 9 (26) P2" Ea — Pa ll :: PC1300“ Twam — 7G“ "i 17 to investigate the bound on the absolute modified regret given by (2.5), we therefore consider Pdlrw—r,“ , where 969 and w,1rEQ. 3. A bound on the L1(Po) distance between two component Bayes rules. In Proposition 3.1 we derive a bound on Penny—“r," essentially in terms of the total variation distance between the corresponding mixtures. Abusing notation we shall use "all to denote the total variation norm of a signed measure a on ($,‘.f) as well. The next three lemmas are used to prove Proposition 3.1. Lemma 3.1. Let 0 := f <9,x>0dw(9). Then #0‘1=N(0, If dw(n)dw(€))- Proof. By (1.1b), 0 is normally distributed if an is finitely supported. Since (9,n)H<9,n> is continuous and bounded (on compact 02), the map taking (w,1r) to the L201) inner product of 0 and <7r,. >0 (which by interchanging the order of integration and using (1.2) is seen to be (wxn)<.,.>) is continuous. Continuity of wH0 in L2(u) follows. Since 0 has a dense subset consisting of finitely supported measures [Theorem II.6.3, Parthasarathy (1967)], and a family of normally distributed random variables is closed under L2 convergence, we get that 0 is normally distributed. The expression for the mean and the variance follows by using Fubini’s Theorem. // The following lemma is Lemma A.1 of Datta (1988). 18 Lemma (Datta—Singh): For (y,z,Y,Z,L)E§R5 such that z5£0 and L20, IzIII%—§I AL} s Iy-YI +(|%I +L)|z—Z|- Lemma 3.2. Given 6>0, 3 {h1,...,hI}C‘W := {hEH : IIhIISl} such that, for all real numbers a and b, exp( —M2/2“a+b)fl(P0IITw—Twrll[ <6,- >0 Sa][ 0 >b]) (3.1) I s 26+,21I1If < M > Pod(w-7I)I+3MIIPw—P1r”' 1: Proof. Starting with the definition of pw( 2 fpgdw), recalling the functional form of po implicit in (A), using (1.3) to bound p9 below, applying Jensen’s inequality to the exponential function, and noting that p9[ <9,. >Oga]e‘3§1 and e0_>_eb[ 0>b], we get exp(—M2/2—a+b)p9[ <9,. >0_<_a][ 0>b] Spw. In view of the above it suffices to show that ,u(pu]lrw—r,,") can be bounded by rhs(3.1). By Lemma 1.1, (3.2) "Tu—Tn”:V{|f<.,h>dCJ—f<.,h>dir|}, ‘W e e where C) and it are as in the proof of Lemma 1.1. 19 Applying Datta-Singh Lemma with zzpw, y: f <77,h>p,,dw(n), e Z = p,” Y = f < r],h > pnd1r(n) and L = 2M, 9 (3'3) Pwl((i_; < 71,11 > Pndw(7l)/Pw) _é < 77,11 > qu’r(77)/Pwl S lg Pnd(W—7I)(77)I +3MIPw_P1rI ' Since G is compact by assumption and ‘W is weakly compact by the Banach-Alaoglu Theorem, 6wa is compact. Since H is separable, ‘Ww and hence ('3wa is metrizable. Since (9,h)I—+<9,h>is a continuous function on GX‘WW, it is uniformly continuous. That implies {hH<9,h> : 969} is an equi(in 9) uniformly continuous family of functions on WW, so that for every wEQ (3.4) pl j( <0,h> — <0,h’> )pgdwl g gl <6,h> — <6,h’>| S5, if the distance between h and h', in a metric metrizing ‘Ww, is less than 6:6(6). If weak-balls of radius 6 around {h1,...,h1} cover ‘W, then triangulating around appropriate hi, using (3.4) and dominating the maximum of I non—negative terms by their sum, we get I The lemma follows from (3.2),(3.3) and (3.5). // Proposition 3.1. Let 7>0 be fixed arbitrarily. Then, 3 a number SIG such that 2O POIITw‘TWIIS57‘I‘3QIPw—P1rII' Proof. For arbitrary real numbers a and b, partitioning H into the sets [<0,.>Oga][0>b], [<0,.>Oga][ogb] and [<9,. >0>a], using the bound IlTw-TwIISZM on the last two sets and Cauchy-Schwartz inequality in L2(p) on the remaining 2 factors, and bounding "pelt, by eM /2 (see Lemma 1.3), we get Penna—Tr” (3.6) g 2MeM2/2{(p[ <9,. >0 >a])1/2+ (u[ 0 5b])1/2} +u(p9“'rw—r,,“[ <9,. >0_<_a][ 0>b]). By (1.1a), using the familiar bound on the upper tail of a normal distribution and (1.3), we get, for a>0 (3.7) p[<9,. >0>a]S(27r)—1/2Ma—1exp(—a2/2M2). Similarly, using Lemma 3.1, for b<0 (3.8) ,1[ < 3,. > 0 g b] g (2n)—1/2M( —b) -1exp( —b2/2M2). In view of (3.7) and (3.8), the first term in rhs(3.6) can be made arbitrarily small by appropriate choice (to be made later) of a and b. To treat the second term, we shall use Lemma 3.2 and concentrate on the term )ulf <9,hi>p0d(w—7r)| in the bound (3.1). A<¢h>~ Expanding the function AHe 1n a Taylor series around 21 A=0 up to 2nd order, collecting the terms in lhs(3.9) on one side of the equality, and using Cauchy-Schwartz inequality in H and (1.3) to bound the other side, we get for A >0 and h E‘W, (3.9) |<0,h> —%(e’\<0’h> —1)|_<_AM2eAM/2. By (3.9) and the triangle inequality, with a abbreviating a)-n, (3.10) III I < 0,11 > Pedal 5 We” +IIIIPw—PIII+III fe"<”’h>ped0|]- We now show (3.11) Izlfekg’bpodW—fll=PAhIPw-pnl as a consequence of A 9 h A 9 h _ _ (3-11a) #(fe < ’ >pgdw,fe < ’ >pgdv) lZPAh(pwapn) 1- By (1.1a), linearity of inner product and the map in (A), we get, V mZI and V (91,...,9m)€(3m, or equivalently, A<6.,l1> ': ._ '2: _ #({e ‘ Pgifizr?) 1 = Pyh({Pgi}li:T) 1. Hence, if co and 7r are finitely supported, (3.11a) holds. Since by Theorem II.6.3 of Parthasarathy (1967) (2 has a dense subset consisting of finitely supported measures, to prove (3.11a) for general u22uul n it will suffice to show that for every V in (I, as uk—w, fe2<0’h>p0duk(9) [puk] goes to fe’\<0’h>p9du(9) [pV] along 22 a subsequence p [PAh] a.s.. Actually we shall show the continuity of the map taking (11,11’) to the L201) [L2(PAh)] inner product of fe2<0’h>p0du(9) and fe’\p0du’(9) [pV and pu,]. We do that by interchanging the order of integration on $ and 92, using Lemma 1.2 (with k=2, a1=a2=1) to evaluate the p integral (which is continuous on 92, by continuities of vector addition and inner product and the exponential function, and bounded on 92 by (1.3)) and Lemma III.1.1 of Parthasarathy (1967). The bracket alternative is shown by representing the 19(th) inner product as a p integral, again interchanging the order of integration on % and 92, using Lemma 1.2 (this time with k==3, ai==1 V'i) to evaluate the integral (which is bounded continuous on ('92 by the same reasons as above) and Lemma III.1.1 of Parthasarathy (1967) again. Combining (3.10) and (3.11), we get (3.12) lhs(3.10)sAMIeIM+IIPw—P.II+—I-p(Ipw—pIIpM). By partitioning $ into [pAh>Cl and [PAhSCI’ and applying Cauchy-Schwartz inequality in L2(p), we get (3.13) ”(Ipw—pIIpM)scIIPM—P.II+IIp..-p..II{IIpIh2IpM>cI}‘/2. Since the family {pAh2 : A6[0,K], hE‘W} is uniformly p-integrable (it has uniformly bounded higher moments) for every K>0, {/‘PAh2LPAh>C]}1/2 can be made arbitrarily small, uniformly in A and h, by choosing c large enough. Now choose a in (3.7) and b in (3.8) so that, uniformly in w 2 and 9, 2MeM /2{(u[ <9,. >0>a])1/2+(#[ OSb])1/2} <7. Then 23 choose 6 small enough so that exp(M2/2+a—b) <7/6. Let I correspond to this 6 as in Lemma 3.2. Now choose A small enough so that AM2e’\M <6/I. Then choose c large enough so that, uniformly in w and 7r as well as in hE‘W, (1/A)||pw-Px||2{#PAh2[p,\h>c]}1/2S¢5/I (possible since by Lemma 1.3 and the triangle inequality in L201), M22 IIpM—pxIbs2e /)- With these choices, by (3.12) and (3.13), (3.14) lhs(3.10)S26/I+(c+1)||Pw—P,,||/A. The proof of the proposition is now completed [with 2 36: {3M+A—1I(c+1)}exp(%—+a—b)] by (3.6), choice of a and b, use of Lemma 3.2 with the above mentioned choice of 6 and substitution of the bound from (3.14) in Lemma 3.2. // 4. Asymptotic optimality. In view of the bound obtained in Proposition 3.1, (2.5) and (2.6), the question of convergence of the modified regret to 0 reduces to the question, loosely speaking, whether PM)!n is L1 consistent for PGn. More specifically, it suffices to show (4.1) \Il PQIIPwam—PGnII —) 0, uniformly in 9, as n—Ioo. 0:1 In Theorem 1 in Chapter 2 we establish such a consistency result for the non-delete version for sufficiently diffuse A. The result involving the delete versions will follow as a corollary (i.e. Corollary 1 in Chapter 2). Now we are in a position to prove our main result. For a 24 finite measure fit on the Borel a-field of a second countable topological space 3’, let Sm denote the topological support of m. [For the definition of the topological support of a finite measure on a second countable topological space see Section 2 of the Appendix.] Theorem 4.1 (Main Result). If SA=Q and f is the Bayes estimator in the set compound problem given in (2.4a), then (4.2) 61%" ta—tall—AO, uniformly in Q, as n—>oo. a: - Consequently, i is a.o. Proof. The second part of the assertion follows from the first part and the bound (2.5). For the first part recall from (2.6) the representation Pg" ta — ta II = PQPOO," Twam — Tan H; since 7 in the statement of Proposition 3.1 is arbitrary, the assertion follows from that proposition and the L1 consistency (4.1). // Remagk 4&1 (Asymptotic optimality against the equivariant envelope). As indicated in the introduction we now extend our optimality result against the simple envelope to that against the equivariant envelope. If the component problem involves a compact (in total variation norm) class of mutually absolutely continuous probability measures, then the excess of the simple envelope over the equivariant envelope goes to zero uniformly in the measures (Remark 4 in Mashayekhi (1990)). Recall that by assumption the measures {P0 : 969} are mutually absolutely continuous. Since G is topologically embedded in I) by Lemma 2 of Chapter 2 the map 9h+p0 is 25 continuous in L404). That implies continuity of 9HP9 in total variation norm by the moment inequality. Since ('9 is compact, {P9 : 969} is compact in the total variation norm. By triangulation around the simple envelope, the asymptotic optimality against the equivariant envelope follows from Theorem 4.1. // 11mg; 4._2_ (Asymptotic optimality of Hayes sequence compound estimators). We now prove the asymptotic optimality of the Bayes sequence compound estimator t’ given in (2.4b). For ISaSnoo. Using subadditivity of supremum and the fact that the limit of a convergent sequence equals its Ce’saro limit, we get that rhs(4.4)—)0 uniformly in Q. If we can show that )o/Dn(t’,9) is positive, the asymptotic optimality of t’ will follow. by (4.3) and convergence (uniform in Q) of rhs(4.4) to 0. We shall show that \ofDn(_t,Q) is positive for every compound t. Since fan(t,Q)dwnZ r(w) for every w, in particular for procedure G“, we get that VRn(t, 9) > Vr(Gn). That, by definition of VDn(t, 9) and subadditivity of supremum, implies the positivity of VDn(_t, 9). // Remark 4;} (Calculation of the a.o. Bayes compound estimator). From (2.4a), Lemma 1.1 and the definition of wow, it follows by a successive deconditioning argument that [I---I .fllpgi(xi)flldwfli‘l(9i)] (4.5) = n n 9. [I.. -.I iEIP9i(Xi)i£lldw-'- 1(9i)l 9- . . . 9 where w" 18 the poster1or mean of w g1ven Qi and w'ozjwdA; for details see Section 3 in Chapter 4 of Datta (1988). To use (4.5) to calculate our Bayes compound estimator, we . . 9- need to choose a hyperpr1or A such that the poster1or mean w" has a nice form for all 1. With that end in mind, we settle for the Dirichlet priors described below. Let a be a non—null finite Borel measure on ('9, where 9 is an 27 arbitrary separable metric space. In Section 2 of the Appendix we show (compiling some results from Section 4 in Ferguson (1973)) that there exists a probability measure “3(a) on (Q,‘.B(Q)) with the following property : for every finite measurable partition {B1,...Bm} of 9, the distribution of (w(Bl),...,w(Bm)) under 9(a) is Dirichlet with parameters (0(31),...,a(Bm)). We call 610(0) the Dirichlet prior with parameter a. By Theorem 2.1 of the Appendix, the topological support of c.D(a) is 9 if that of a is 6. An example of a finite Borel measure a on 9 with full support is obtained by choosing a countable dense subset {9n : n21} of ('3 and selecting aznilcndgn, where cn_>_0 V n and n02:1cne(0,oo). By Theorem 1 in Ferguson (1973), an=(a(®)+n)-1(a+.:2:160i), n_>_0. When ('9 is a subset of the line, a Monte Carlo method for calculation of rhs(4.5) has been given by Kuo (1986). The problem of numerical evaluation of our estimator remains and is worth investigating. // 5. Admissibility. The argument we use to prove admissibility of Hayes compound estimators is fairly standard in decision theory : A unique Bayes rule is admissible (see Theorem 1 in Section 2.3 of Ferguson (1967) for a precise statement). Let f be a prior on the compound parameter Q. Q will denote the joint distribution {0P0 on (5,9). Note that n—lfiqllta—9alfz, the " 0:1 28 Bayes (versus 6) compound risk of an estimator t, is minimal iff OIIta—9a"2 is minimal for every 0. Now O||ta—9a|F can be represented as [PQU'fllta—9a|[2dP0ad£a)d£, where {0:69a_1. Since the expression inside parenthesis in the previous line has, by Lemma 1.1, a unique minimizer, there exists a unique Bayes compound estimator versus every prior 5. That implies the admissibility of every Bayes compound estimator. CHAPTER 2 CONSISTENCY OF THE POSTERIOR MIXTURES In this chapter we show [Theorem 1] that Pun,“ (the non—delete version of the discussion at the beginning of Section 4 in Chapter 1) is Ll consistent for PGn-l in the sense of (4.1) of Chapter 1. We actually prove the result with n replaced by (n+1) and obtain (4.1) of Chapter 1 as a corollary [Corollary 1]. For the rest of the chapter let LI: and A abbreviate wn+1,n+l and An+l,n+1 respectively. Before proceeding further we note that G) can be interpreted as the posterior distribution of 9n+1 given (X1,...,Xn)=(x1,...,xn) in the Bayes compound model with (n+1) components. Consider the following Bayes model on flxan%": (i) Bayes model: to is distributed as A and given (.0, 9 is distributed n . as can: x w and g1ven 9 and w, R a: is distributed as P9: ii P9 . — azl a The above model gives rise to the following marginal model: (ii) Bayes compound model: 9: (91,...,9n) is distributed as EA,“ and given Q, R=(X1,...,Xn) is distributed as P6, where LEA”) is the A mixture of w“ . Since () and hence (I (with the weak convergence topology) is a Polish (in fact compact metric) space, all conditional distributions are regular by Theorem 10.2.2 of Dudley (1989). Datta (1991a) shows [see his Proposition 2.1] that under model (ii), with n replaced by n + 1 , (I: is the posterior distribution of 9H +1 given (X1,...,Xn) :(Xl,...,xn). 29 30 We now develop the machinery needed to prove Theorem 1. There are four propositions leading to the proof of Theorem 1. Four auxiliary lemmas are needed to prove the propositions. The key to the proof of Theorem 1 is the inequality (17) proved in Proposition 3. The force of Proposition 1 is used in part in the proof of Proposition 3 and later in full in the proof of Theorem 1 to treat the denominator of the second term in the bound (17); it is the only link in the proof where the assumption SA=Q is used. Proposition 2 is used to treat the numerator of the second term in the bound (17). Proposition 4 disposes of the third term in the bound (17). Lemma 1. For every {w,1r}cil, log(pw/p,r) €L2(u) and 2 I|10g(pw/p«) II; S 65M /2IIPw - Prr I14- Proof. Since the reciprocal function is convex on (0,00), the area under the reciprocal curve between a and b, where 0_1 and (trivially) for j=0, 31 (2) pw-j _<_ [pg—jdw V wEfl and V j described above. Applying (2) with j=i on pw and j=4—i on p,” interchanging the order of integration on 62 and $, and using Lemma 1.2 of Chapter 1 (with k=2, a1: —i, a2=i—4), we get (3) u(pw‘ip«i’4) = exp{Z'IIII — 19+ (1 -4)nIF+ iII9II2+ (4- i)IInII2]}- The exponent in rhs(3) simplifies to 2‘1[(12+1)I|9|I2+(12—9i +20)||17||2+21<9,17>(4— 1)]. For all i=0,1,...,4, the coefficients of ”NP, "1}"2 and <9,n> are all non-negative and hence, by (1.3) of Chapter 1, the exponent in 4 rhs(3) is bounded by 10M2. Since E(%)=24, using (3) with 10M2 i=1 bounding the exponent, we get 2 (4) second factor in rhs(1)S 2e5M /2. The lemma follows from (1) and (4). // In what follows, any reference to a topology of {2 will be to the topology of weak convergence. Lemma 2. prw is uniformly continuous from 0 into L404). Proof. For j=0,1,2,3,4, writing p“,j (and p,,4_j) as a j (and 4— j) fold iterated integral, interchanging the order of integration on % and 94 and using Lemma 1.2 of Chapter 1 (with k=4, a1=...=a.4=l), (5) W p.4-Idp=ufix«4-I(epr2-1I 1:216)i II2 —i $1“ 0i "2“). 32 By repeated application of Lemma III.1.1 of Parthasarathy (1967), if wn—w then wnjxw4_j—>w4 weakly on 94. Since the integrand in rhs(5) is a bounded continuous function on ('94, using (5) twice we get I pwnjpw4_jdu—*f mild/1 as can-W- Expanding (Pwn"Pw)4 and applying the above to the integral of each term, I (pt.n - pw)4du—+0 as wn-W; that establishes the asserted uniform continuity because the weak topology of Q is metrizable as a compact metric space. // Let (I) An(w) :: flog(P1r/Pw)dpir- Lemma 3. WHA,(w) is equi(in on) continuous. [Proof. For it and u in Q, triangulating around flog(p,r/pw)dP,,, (5) IAN») -Au(w)l S Iflog(P1r/Pw)d(P1r—PV)I+IAV(7")I' By Cauchy-Schwartz inequality in L2(u), (7) lst term in rhs(6)S”105(Pw/Pwln2IIP1r—Pull2i by Lemma 1, the triangle inequality in L4(u) and Lemma 1.3 of Chapter 1 with q=4, 2 (8) rhs(7) 5264M "Pr—Pull? By Cauchy—Schwartz inequality in L2(p) again, (9) 2nd term in rhs(6) S||1og(pV/p,,.)"2"p,,lb ; 33 by Lemma 1.3 of Chapter 1 with q: 2 and Lemma 1, 3M2 (10) rhs(9) Se "pi-pun..- Since IIPI‘PVIIZSIIPI‘PVILI the proof is completed by combining (6)- (10) and applying Lemma 2. // Lemma 4. wHA,(w) is equi(in 7r) continuous. Proof. For no and V in Q, by Cauchy—Schwartz inequality in L2(p)9 IAn(w) —A7r(V)I Slllog(Pu/Pw)II2IIP1rII2 2 S e3M Ile‘Pulka by Lemma 1.3 of Chapter 1 with q=2 and Lemma 1; the lemma follows by Lemma 2. // Proposition 1. If SAzfl, then 9 A{A,,<6}>0 V 6>0. Proof. [Taken from Lemma 6.6 of Datta (1991a)] By Lemma 4 {A,<6} is open; since it is non—empty (it contains 7r) and SAzfl, A{A,<6} >0. By Lemma 3, if 1rn—nr then AID—+A, pointwise on 0, hence in A-distribution. Therefore, by Theorem II.6.1(d) of Parthasarathy (1967), lim inf A{A,n<6} ZA{A,,<6} as 7rn—nr; in other words, 7r:—&A{A,,<6} is lower semi-continuous. Hence the infimum is attained over compact fl and is positive. // Let n (H) Vn(w) == n'lillog Pw(xa) - flog pdeGn- a: 34 Proposition 2. Let p be a metric on.(l for the topology of weak convergence. For every 6>0, 3 an e>0, such that p(w,1r) <6 implies (11) PQIepr2nIIHIw) — men) s e“. Proof. Using Cauchy—Schwartz inequality in L2(p), Lemma 1.3 of Chapter 1 with q = 2 to bound "p90!”2 and Lemma 1, 2 (12) |f10g(pw/p«)dpgal S e3MIIPw—P1rll4- Since 2nI°rn(w) — (“(4)1 = 35:: Imam/paw.) — iguana/mama, by isotonicity of the exponential function and the bound in (12), (13) lhs(n) s [Pg(al:ll(Pw/P1r)2(xa))leXPIZHIIPw— p.II.e3M2}. We shall now show that (14) PQIQIlew/p.)20. Let A =i§1uranI>I < 6/2}, where p—balls of radius 6 (corresponding to 6 as in Proposition 2) around {w1,...,wr} cover 0. Then —3n6 7' 7- ~ 1 e n n A] . —P < \I6+ nd ndA A + . (17) 2lP GnII 2 [ 2( 6)]6 Afe ] 6 A6 36 Proof. Since éflPw—PWHS 1 for all {w,7r} C0 (18) lhs(17) gaIPa—Pcnlpfiié. By definition of C), using the inequality |ff| Sflfl with f=pw—pG and interchanging the order of p and A integration, we n get 09> III-"cullsIII-(”GINA Since by inequality (3.6) of Hannan (1960) Ala-Pals IA.(w), bounding the integrand in rhs(19) by 446 on “1146 and by 2 on the complement, and A(‘U.46)A6 by 1, (20) fiIPQ—PGHHA632IE+A(%4,)A,. In view of (18) and (20), it remains to show that the second term in rhs(20) can be bounded by the second term in rhs(17). By definition of A [it is the posterior distribution of w given (X1,...,Xn)=(x1,...,xn), when given w, X1,...,Xn are iid~Pw and w~A], ~f egdA (21) [\(il KELL— 46 — feg dA ’ ”as n where g(w) = 21103 Pw(xa)° a: Using the identity g(w) = n‘Vn(w) — nAGn(w) + nflog pGndPGn, 37 bounding AGn below by 46 on ‘1145 in the numerator of rhs(21) and above by 6 on ‘116 in the denominator of rhs(21), we get I en‘rndA 11 (22) rhs(21) S e-P’n‘S —46—nv— j'e ndA. ”a Normalizing A on (116 (A(‘U.6)>O by Proposition 1) and applying Jensen’s inequality to the reciprocal function, which is convex on (0,00), 1 1 —n‘l" (23) ————S— fe ndA. f envndA A2615) In, “6 Substituting (23) in rhs(22) and weakening the resulting bound by enlarging the ranges of integration, the proposition follows from the remark following (20). // Proposition 4. Fix a 6 > 0. Let A6 be as in Proposition 3. Then XP ~6=O(n_1) as n—->oo. Proof. By the definition of A6 and subadditivity of measures, ~ 1' P9445 _<_ . ZIPQI: |‘V'n(wi) I 2 5/2l 1: which, by applying Chebychev’s inequality to each of the terms and bounding the sum. of r nonnegative terms by the maximum of the terms times r, is bounded by (41162)); when? Since ‘V'n(w) is the centered average of n independent random variables under Pb, 38 PQ(Tn(w))2 S “—1 ){vargflog Pw)° Since the variance is smaller than the second moment, it suffices to show that )9 )0/P0(log pm)2 <00. By the elementary log inequality used in the proof of Lemma 1, (24) 4P0(log Pw)2SP0(Pw_1)2(1+Pw_1)2' By Cauchy—Schwartz inequality in L2(p), (25) rhs(24) s||p9<1+pM-1)2|I1Ip.— 11.2. By the triangle inequality in L404) and Lemma 1.3 of Chapter 1 with q = 49 2 (26) 2nd factor in rhs(25) S (63M /2+1)2 . By Cauchy-Schwartz inequality in L204) and Lemma 1.3 of Chapter 1 with q=4, (27) lst factor in rhs(25) ge3M2/2||1+pw‘1|[2. By the triangle inequality in L8(u), (28) 2nd factor in rhs(27) S2(1+"pw“lIE2). Applying the bound on the inverse eighth power of a mixed density obtained in (2), interchanging the order of p and w integration, using Lemma 1.2 of Chapter 1 (with k=1, a1: —8), and using (1.3) of Chapter 1 to bound the resulting exponent, we get that "pm-1",;2 is bounded by e18M2; substituting that bound in rhs(28) and combining the result with (27) and (26), we complete the proof using 39 (25) and the remark preceding (24). // Theorem 1 [Cogsistency 9f LIE posterior mixtures]. If SAzfl, then PQHPa—PGHH—IO, uniformly in Q, as n—200. Proof. We shall show that for every 6 >0, (29) P,( fe—nvndAfenvndAMé 3 e2“, where A6 is as in Proposition 3. Taking P0 expectation of both sides of (17) and using (29), (30) Pd)Pa—PGHIIS4\I6+2 6, +2123, n6 A ((115) Since the above inequality holds for every 6>0, we complete the proof of Theorem 1 by taking supremum over 9 and lim sup as n—+00 (in that order) of both sides of (30), using subadditivity of these operations, and applying Propositions 1 and 4. Applying Cauchy-Schwartz inequality in L2(Pb) .and then the moment inequality to the A integrals in both the factors in the Cauchy-Schwartz bound, (31) lhs(29) _<_[pQ(fe‘mndA)A,]1/2[PQ(je2nvndA)A,]1/2. Consider the finite cover of (I described in Proposition 3. Clearly, for every (.060, choosing an wi such that p(w,wi)0, uniformly in _9_, as 11—200. a=1 Proof. As in the proof of Lemma 4.3 of Datta (1991b), we observe that \/ V P I’ -P 1< V 1’ P -—P (33> WI anal-mm Ilwnn c.-II where Gnu is the empirical distribution based on (0,,..,oa_,,9a+1,..0,). Since Gn-cna=n-1(5,a_cm) with 5,0 the unit mass at 9a, the definition of pw gives _ _ _ = -l _ —l (34) IIPGn Pawn—mpg“ penal) n #(nga pana|)32n - By the triangle inequality, (33) and (34), the corollary follows if rhs(33) goes to 0. But that, with n replacing n-—1, is the assertion (with some notational changes) of Theorem 1. // CHAPTER 3 THE EMPIRICAL BAYES ESTIMATION In this chapter we look at the empirical Bayes [Robbins (1951, 1956)] formulation of our component problem. Consider a Bayes decision problem involving {P0 : 969} and a Bayes prior (.0, where w is unknown. Suppose we have iid pairs (91,X1),...,(9n,Xn),...., where 91 is distributed as w and given 91, X1 is distributed as P01. At stage n, a decision tn=tn(Rn) about 9n is taken incurring loss "tn—9an and risk fflltn—9nfl2dPde“. The sequence {tn : n21} is called an empirical Bayes rule. An empirical Bayes rule {tn} is called asymptotically optimal (a.o.) if f f“tnT9nIFdPOd“"n_’r(“’)a for each an 6 Q, as 11—)OO. The notion of admissibility in the class of empirical Bayes rules is the same as the corresponding notion in the case of compound rules, with the understanding that the risk now is a function of w. Let A be a hyperprior on Q. We will prove that any sequence of Hayes (versus A at each stage) empirical Bayes rules is admissible; if SAzfl, a sequence of Hayes empirical Bayes rules versus A is asymptotically optimal as well. 1. Bayes empirical Bayes. For any given n, the stage n Bayes risk versus A in the empirical Bayes problem is 42 43 t —9 dewndA w = t -—9 dP dd: gééunflfg ()ggunnthm which is the n—th component Bayes risk versus the prior DA“, on the compound parameter Q in the set compound problem with n components. Hence a Bayes empirical Bayes estimator is tn given in (2.4a), with 0 replaced by n. A Admissibility. Since a Bayes empirical Bayes estimator is tn given in (2.4a), and as observed in Section 2.5 of Chapter 1 every Bayes compound estimator is unique up to u" equivalence, the admissibility again follows from the uniqueness of Hayes rule argument. 2. Asymptotic optimality. Theorem 2.1. If SAzle then the Bayes empirical Bayes estimator {tn : n231} is asymptotically optimal. Proof. Let 70,11 be a component Bayes estimator versus (.0 based on Xn. Then, as in (2.5) of Chapter 1, (2.1) I I III In — OnIFdPde“ — r(w) I 5 4M PM" "an,“ —rM,n“; by Proposition 3.1 of Chapter 1 it is enough to show that (2.2) PwHTIHPwn’n—Pw" ——>0 as n—>00. The uniform (in.oa) version of (2.2), with n replacing n-—1, is the assertion (with some changes in notation) of the following corollary [Corollary 2.1] to Theorem 1 in (Lapter 2; applying that corollary 44 we complete the proof. // Corollary 2.1. Let (.7) be as in Chapter 2. Under the assumption PwnIIPcD—PWII'IO’ uniformly in w, as n—roo. Proof. Noting that P“,n is the marginal on $n of the joint distribution on $“x9n obtained from P9 and w“, and triangulating around PG“, (2.3) PwnIIPa‘PwIIS IPQIIP&_PGn"dwn+wn(“PGn_PWIl) . Now the first term in rhs(2.3) goes to zero uniformly in w by Theorem 1 in Chapter 2. By the moment inequality, applied first to the w" integral and then to the u integral, I“) (“DIIPGn’PwIll25”n(I|pGn‘pWIP2) - I] Now by interchanging the order of a and 1.) integration, noting that I.u"(p(~1:n—pw)2 is the variance of the average of n iid random variables and bounding the variance of a random variable by its second moment, we get (2.5) rhs(2.4) Sn‘lffp02dw(9)du. Interchanging the order of u and to integration, and applying Lemma 1.3 of Chapter 1 with q=2 to bound p(p92), —1 M2 (2.6) rhs(2.5) Sn e . The corollary follows by (2.4)-(2.6). // 45 emar 2.1. By (2.1), Proposition 3.1 of Chapter 1 and Corollary 2.1, we get that lhs(2.1) goes to 0 uniformly in an which is a stronger form of asymptotic optimality. APPENDIX APPENDIX 1 . On measurability . In this section, we prove two lemmas concerning measurability of two maps which have been used in the main body of the dissertation. Lemma 1.1. Let 9 be a separable metric space endowed with its Borel a-field. Let It be the set of all probabilities on 9, endowed with the topology of weak convergence and the corresponding Borel 0- field. Then, for every bounded real-valued measurable function h on 9, the map wHw(h) is measurable. Proof. We shall use the following theorem (TI.20) from Meyer (1966): Let JG be a vector space of bounded real-valued functions defined on I‘, which contains the constant 1, is closed under uniform convergence, and is such that for every increasing, uniformly bounded sequence of non-negative functions gn 6 3B, the function g = rig-12mg“ belongs to fit. Let C be a subset of 3%, closed under multiplication. Then the space 35 contains all the bounded functions measurable with respect to the o-field ‘3' generated by the elements of C. Let 1‘29, JG: {h : h is a bounded real-valued function on 9 and wHw(h) is measurable}; clearly 3% satisfies all the conditions of TI.20, Meyer (1966). Let C={B§9 : B is closed}. Clearly C is closed under multiplication. Since, by the portmanteau Theorem [Theorem II.6.1(c), Parthasarathy (1967)], for every B6C and every kEIR the set {w : w(B)Zk} is closed in the topology of weak convergence, we get that C is a subset of 3%. Therefore 3% contains all the bounded real-valued measurable functions on 9. // 46 47 Lemma 1.2. Let ($,‘J) be a measurable space. Let 9 be a separable metric space endowed with its Borel o—field. Let f:9x$I—»[0,00) be a measurable function. Let Q be the set of all probabilities on 9, endowed with the topology of weak convergence and its Borel o-field. For c060, let f(w,x) := ff(.,x)dw; then fzflx$H[0,00) is measurable. Proof. We shall again use TI.20, Meyer (1966). Let F=9x$, 3£={h : h is a bounded real—valued function on 9x$ and (w,x)+—+fh(.,x)dw is measurable}; clearly 3% satisfies all the conditions of TI.20, Meyer (1966). Let C={AxB : A is a measurable subset of 9, B is a measurable subset of $}. C is clearly closed under multiplication. That C is a subset of 3% follows from Lemma 1.1. Therefore 3% contains all the bounded real-valued measurable functions on 9x$, in particular, fAM for every integer M. Since {fAM : M is an integer} is an increasing, uniformly bounded sequence of non-negative functions, its pointwise limit f also belongs to 3‘6. That completes the proof of the lemma. // 2. On topological support of Dirichlet prior. In this section we present a result of independent interest characterizing the topological support of a Dirichlet prior on an arbitrary separable metric space, which is used in Remark 4.3 of Chapter 1 to give examples of A with full support. Ferguson (1973) states that the topological support of a Dirichlet prior on the Borel o—field (corresponding to the weak convergence topology) of the set of all probability measures on the 48 line is the set of all probability measures with their topological supports contained in that of the parameter measure of the Dirichlet prior. We prove that statement with the line replaced by an arbitrary separable metric space. Let $ be a separable metric space and .A be the Borel o-field of $. Let Q be the set of all probability measures on ($,.A). The topology of weak convergence on Q is metrizable as a separable metric space [Theorem II.6.2, Parthasarathy (1967)]; let ¢§B(f2) denote its Borel o—field. We consider the random probability measure P defined in (4.7) of Ferguson (1973). Let {Vn : n21} be a sequence of iid random elements taking values in ($,A) with common distribution 0, where O(A) =a(A)/a(‘£) and a is a finite non-null measure on ($,.A). Let {Jn : n21} be a sequence of non-negative random variables independent of {Vn : n21}. For j22, let the conditional distribution of Jj given Jj_1,...,J1 be equal to the distribution of J1 truncated above at Jj_1; let the distribution function of J1 be exp(N(.)), where N(x)= —a(%):[oe-yy—1dy for x>0. In Theorem 4.1 of Ferguson (1973) it is proved that 0):an converges w.p. 1. For A61, define 1 PM) = :PjXVj(A)a where PD: 03“ and XV(A) = 1 if v6A :1: n = 0 otherwise. Clearly, for every point in the set (in the probability space underlying the random sequences {Pu} and {Vn}) on which ngn 1 49 converges, P is a probability measure on 1. Therefore without loss of generality we can assume P to be (I valued. Let ¢A be the map on 9 taking a: to w(A). Since the real—valued map P(A) is Borel measurable, P is measurable with respect to o{¢}, the o-field generated by the family {43A : A691}. Note that by Lemma 1.1, o{¢} is a sub o-field of 633(9). We shall show that 93(9) is a sub o-field of o{¢}. We shall denote the induced distribution of P on (Q,?B(Q)) by ‘3? and refer to it as the Dirichlet prior on "53(0) with parameter a. By Theorem 4.2 of Ferguson (1973), for every k=1,2,..., and measurable partition (B1,...,Bk) of %, the distribution of (P(Bl),...,P(Bk) is Dirichlet with parameters (a(B1),...,a(Bk)). Note that in the sense of Ferguson (1973) if the j-th parameter of a Dirichlet distribution is equal to 0, then the j—th coordinate is degenerate at 0. To prove “.B(Q)Co{¢}, recall [Theorem 3, Appendix III, Billingsley (1968)] that ‘11. :: {N(p ; A1,...,Ak; 61,...,€k) : #612, ei>0, Ai p-continuity subset of %, i=1,2,...,k, k=1,2,...} is an open base for the topology of weak convergence on Q, where k N(u ; A1,...,Ak; 61,...,€k) :=i01{l/69: |u(Ai)—p(Ai)| }. For a finite Borel measure m on a second countable topological 50 space If, the topological support of m is defined to be the set Sm: fl{F: Fis closed and m(Fc) =0}. Note that 363m iff for any open set O containing 3, we have m(O) >0. Since I? is second countable, by Lindel'o'f’s Theorem Smc can be expressed as a countable union of Fc sets. Therefore (2-1) m(Smc) =0; hence (2.2) if B is closed and m(B) =m(Sm), then SmCB. In the sequel, the set {(x1,...,xm)6§Rm: 0Sxi V i=1,2,...,m and.}n:lxiS1} 1:] will be referred to as the sub-simplex in m-dimension. We now state the main result of this section characterizing the topological support of a Dirichlet prior {P with parameter 0:. Theorem 2.1. Sap={pell : SflcSa}. Proof. We first show 593(1169 : SPCSa}. Let Spcsa ; we shall show that every basic open set in “U. containing p has positive fP-probability to conclude 1165?. Now for arbitrary positive integer k, u-continuity subsets A1,...,Ak of $ and positive numbers 61,...,6k, (2-3) ?P(i§1{ueflzlu(Ai)—u(Ai)| < 6il)=P1‘(i__rk_l1{IP(Ai)-#(Ai)I < éil), where Pr is the probability measure on the domain of P. Let {Fulwy : Vi=0 or 1 V i=1,...,k} denote the measurable k 51 partition generated by A1,...,Ak ; i.e. Ful....uk =. k . . flAjV-l where AVJ :A if Vj =1 1:1 = Ac otherwise . Then, noting that Ai': U F"1"""k and using subadditivity of distance, ”1:1 we get (2.4) gum—mum 0}, where c =./\ 6i. 1: Now {P(F”l""”k) : a(F,,1_m,,k)>0} has a Dirichlet distribution with all parameters positive; since, by (2.5) and (2.6), 52 2:{I‘(FV1....i/k) : 0(Ful....uk) > 0}=1, temporarily abbreviating (u1,...,vk) to y and fixing a 2 for which a(F,7) >0, we get (2.8) rhs(2.7) D fl{[|P(Fy_)—u(Fz)| <2'2ke] : K752 and 0(FZ) >0}. Since {P(Fylwyk) : a(F,,1W,,k) >0} has a Dirichlet distribution with all parameters positive, the induced distribution (over the sub—simplex in appropriate dimension) of a one—component-deleted subvector of {P(FVl-u-Vk) : a(F,,1m.,,k) >0} puts positive mass on every subset of the sub-simplex with non-empty interior. By (2.4), (2.7) and (2.8), lhs(2.3) is positive. That completes the proof of p659. Conversely, suppose #659; to show SpCSa it is enough (by the observation in (2.2)) to show [1(Sac) =0. Since by Theorem II.6.1(d) of Parthasarathy (1967) lim wn(A) Zw(A) whenever tun—w and AC$ is open, the set {1x60 : u(A)Zu(A)+e} is closed (in the topology of weak convergence) for every open set Ac% and every e>0. Since Sc,c is open, {V69 : p(Sac)<1/(Sac)+e} is an open set containing p for every e>0; since #639,, YP{VeQ : u(Sac) 0 for every e>0. Now V(Sa°)=0 a.s. (‘3’), because (P(Sac),P(Sa)) has a Dirichlet distribution with parameters (a(Sac),a(Sa)) and a(Sac) =0 by (2.1). Therefore, u(Sac) <6 for every e>0; that is, u(Sac) =0. That completes the proof. // BIBLIOGRAPHY BIBLIOGRAPHY BILLINGSLEY, P. (1968). Convergence of Probability Measures. Wiley, New York. CHOW, Y.S. and TEICHER, H. (1988). Probability Theory, Independence, Interchangeability, Martingales. Springer Verlag, New York. DATTA, S. (1988). Asymptotically optimal Bayes compound and empirical Bayes estimators in exponential families with compact parameter space. Ph.D. dissertation, Dept. Statist and Probab., Michigan State Univ. DATTA, S. (1991a). On the consistency of posterior mixtures and its applications. Ann. Statist. 19 338-353. DATTA, S. (1991b). Asymptotic optimality of Bayes compound estimators in compact exponential families. Ann. Statist. 19 354-365. DUDLEY, R.M. (1967). The sizes of compact subsets of Hilbert Space and continuity of Gaussian processes. J. Functional Analysis 1 290-330. DUDLEY, R.M. ( 1989). Real Analysis and Probability. Wadsworth It Brooks/Cole, Pacific Grove, California. FERGUSON, T.S. (1967). Mathematical Statistics, A Decision Theoretic Approach. Academic Press, New York. FERGUSON, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209-230. GILLILAND, D.C. (1968). Sequential compound estimation. Ann. Math. Statist. 39 1890-1904. GILLILAND, D.C. and HANNAN, J. (1974/86). The finite state compound decision problem, equivariance and restricted risk components. In Adaptive Statistical Procedures and Related Topics (J.Van Ryzin, ed.) IMS Lecture Notes-Monograph SeriesS 129-145. IMS, Haywood, Calif. GILLILAND, D. C., HANNAN, J. and HUANG, J. S. (1976). Asymptotic solutions to the two state component compound decision problem, Bayes versus diffuse priors on proportions. Ann. Statist. 4 1101-1112. 53 54 HANNAN, J.F. and ROBBINS, H. (1955). Asymptotic solutions of the compound decision problem for two completely specified distributions. Ann. Math. Statist. 36 1743-1752. HANNAN, J.F. (1957). Approximation to Bayes risk in repeated play. The Theory of Games 3. Ann. Math. Studies. 39 97-139. Princeton Univ. Press. HANNAN, J. (1960). Consistency of maximum likelihood estimation of discrete distributions. In Contributions to Probability and Statistics : Essays in Honor of Harold Hotelling (I. Olkin et. al., eds.) 244-257. Stanford University Press. HUANG, J.S. (1972). A note on Robbins’ compound decision procedure. Ann. Math. Statist. 43 348-350. KUO, L. (1986). A note on Bayes empirical Bayes estimation by means of Dirichlet processes. Statist. Probab. Lett. 4 145-150. ‘ LECAM, L. M. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York. MASHAYEKHI, M. (1990). Stability of symmetrized probabilities and compact compound equivariant decisions. Ph. D. dissertation, Dept. Statist. and Probab., Michigan State Univ. MEYER, P.A. (1966). Probability and Potential. Blaisdell Publishing Company, Waltham, Massachusetts. PARTHASARATHY, K. R. (1967). Probability Measures on Metric Spaces. Academic, New York. ROBBINS, H. (1951). Asymptotically subminimax solutions of compound statistical decision problems. Proc. Second Berkeley Symp. Math. Statist. Probab. 131- 148. Univ. California Press, Berkeley. ROBBINS, H. (1956). An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. Math. Statist. Probab. 1 157-164. Univ. California Press, Berkeley. SINGH, R.S. (1974). Estimation of derivatives of average p-densities and sequence-compound estimation in exponential families. Ph.D. dissertation, Dept. Statist. and Probab., Michigan State Univ. ZHU, J. (1992). Asymptotic behavior of compound rules in compact regular and nonregular families. Ph.D. dissertation, Dept. Statist. and Probab., Michigan State Univ. IIIIIIIIIIIIIII