HESIS l ZCUO This is to certify that the dissertation entitled Some aspects of Polya tree and Dykstra-Laud priors presented by Liliana Draghici has been accepted towards fulfillment of the requirements for Ph.D. degreein Stat‘iSthS RV QMwoovr£\\ Major professor R.V. Ramamoorthi Date Auqust 7, 2000 MS U i: an Affirmative Action/Equal Opportunity Institution 0 - 1 2771 fir -‘V r. , _._—v——~———’——/—-‘ LIBRARY M'Chlgan State Unlverslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 11m mus-p.14 SOME ASPECTS OF POLYA TREE AND DYKSTRA - LAUD PRIORS By Liliana Draghici A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Statistics and Probability 2000 Abstract SOME ASPECTS OF POLYA TREE AND DYKSTRA - LAUD PRIORS By Liliana Draghici In this dissertation we develop some properties of tailfree processes and Dykstra - Laud processes, used as prior probability measures in some Bayesian nonparametric problems. The first chapter of the dissertation contains a characterization of tailfree processes based on DeFinetti’s theorem for a sequence of exchangeable random variables. Special cases of the tailfree processes are the Polya tree processes. In the second chapter, in the context of Polya tree processes, we obtain conditions for the prior and posterior to be mutually absolutely continuous, as well as conditions for the prior and posterior to be mutually singular. Chapter 3 deals with prior probability measures introduced by Dykstra and Laud (1981). First, the Ll-support for these priors is established. The later parts of the chapter are devoted to the consistency of the posterior distribution, weak consistency and strong consistency. To my parents, my sister and my husband, iii MCI ACKNOWLEDGEMENTS I am deeply grateful to my dissertation adviser, Professor R.V. Ramamoorthi, for his continuous help, advice and guidance over these years. His care, patience, kindness and good sense of humor were of great value to me. I would like to thank Professors Roy Erickson, Michael Frazier and James Hannan for being in my guidance committee, Professor Erickson for many useful suggestions, Professor Hannan for his encouragement, helpful suggestions and conversations over the years. I am also very thankful to Professor Levental for his guidance and help in my first years here, and to Professor Gilliland for all I learned from him, especially about teaching. My deep regard goes to my Master’s thesis adviser, Professor Cabiria Andreian, at the University of Bucharest. Her encouragement, care, her great personality, as well as some other wonderful mathematics teachers I had in Romania, inspired me love for mathematics. I am very grateful to my husband for his patience, encouragement, always being close to me in hard moments, and countless trips he has done to come and see me in the last three years. I am very thankful to my parents for all their support and understanding during my student years. iv FHE Contents Characterization of tailfree processes 1.1 Prior and Posterior ............................ 1.2 Tailfree processes ............................. 1.3 Characterization ............................. Absolute continuity and singularity of Polya tree prior and Posterior 16 2.1 Introduction ................................ 2.2 Main Theorem .............................. Dykstra - Laud prior for hazard rates 3.1 Introduction ................................ 3.2 The extended Gamma process ...................... 3.3 Ll-support of no, ............................. 3.4 Weak consistency ............................. 3.5 Strong consistency ............................ Bibliography 16 27 27 29 40 Introduction A common statistical model consists of real valued observations X1, X 2, . . . which are independent with common distribution P, where P is unknown. The goal is to make inference on P based on the observations. Typically P is assumed to lie in a subset M of MGR) - the set of all probability measures on R. In this thesis we look at situations when: (1) M is all of MGR), (2) M is the set of densities on R, or (3) M is the set of all densities with nondecreasing hazard rate. Our approach is Bayesian. That is, we assume there is prior knowledge about P and this is represented by a probability measure on M, called prior distribution. Inference is then based on the conditional distribution given the observations - the posterior distribution. One of the main issues in this approach is the construction of priors on the set of all probability measures. Ferguson [11] notes that such priors should 0 have large support with respect to some suitable topology 0 the corresponding posterior distribution given a sample should be tractable. The Dirichlet processes constructed by Ferguson [11] fulfill these requirements. Even though the Dirichlet process has many appealing properties, it has one major draw— back. It gives mass 1 to the set of discrete distributions. Ferguson [11] also introduced Polya tree processes. These include Dirichlet priors and depend on a large family of parameters which can be chosen to ensure that the Polya tree prior sits on continuous distributions and even on densities. Tailfree processes further generalize Polya tree priors. In Chapter 1, after introducing and briefly describing some known properties of tailfree priors, we give a characterization of tailfree priors. It is known (Doksum, [4]) that if for all Borel sets B the posterior distribution of the random variable P H P(B) depends only on the number of observations that fall in B (and not on the exact values of the observations) the prior must be a Dirichlet process. We give a similar characterization of tailfree priors. We also obtain necessary and sufficient conditions that a sequence of exchangeable random variables should satisfy so that the prior arising from DeFinetti’s theorem is tailfree. A Dirichlet process has the disturbing property that, except for trivial situations, the prior and posterior are mutually singular. In Chapter 2 we obtain sufficient con- ditions for the Polya tree prior and the resulting posterior to be mutually absolutely continuous, as well as conditions to be mutually singular. It turns out that conditions which ensure that the prior is concentrated on densities, also ensure that the prior and posterior are mutually absolutely continuous. Another class of priors on densities was introduced by Dykstra and Laud [7]. These priors give mass 1 to the set of densities with nondecreasing hazard rates. In Chapter 3 we consider a special case of the Dykstra - Laud prior which is in- duced by the Gamma process. After introducing the prior we study the (topological) support of the prior. The later parts of Chapter 3 deal with consistency of these priors. A prior is consistent at P if the posterior probability of any neighborhood U of P goes to 1 with P-probability 1. Posterior consistency is a kind of frequentist validation of the Bayesian method and has received much attention in recent times. Doob [5], Freedman [12], and later Freedman and Diaconis [13], showed that even simple priors can be inconsistent at some P’s. It is then important to describe those P’s for which consistency holds. Consistency depends on the kind of neighborhoods under consideration. If U is a weak neighborhood then it is called weak consistency and if U is a total variation neighborhood then it is called strong or L1 consistency. Consistency prOperties for tailfree priors and Polya tree priors have been studied (Barron, Schervish and Wasserman [1], Ghosh and Ramamoorthi, [18]). In Chapter 3 we describe a class of distributions under which the prior considered is weakly consistent. The main tool for this is a theorem of Schwartz [24]. However, since the Dykstra-Laud prior sits on densities, L1 consistency is more appropriate. In the final part of this dissertation we investigate strong consistency for Dykstra-Laud priors using a result of Ghosh, Ghosal and Ramamoorthi [16]. Parts of this thesis are published in [6] and [3]. Chapter 1 Characterization of tailfree processes 1.1 Prior and Posterior Let IR be the real line and let M(lR) be the set of all probability measures on R. The o-algebra considered on IR is the Borel o-algebra BUR). Denote by BM the smallest o-algebra on M(lR) which makes the functions P H P(B), P E M(lR), measurable for any Borel set B C R. The Bayesian setup requires a prior distribution on (M(lR), BM). After observing the data X1, . . . , Xn the prior is updated and the result is called posterior distribution given X1, . . . , Xn. That is the conditional distribution of P given the data X1, . . . ,Xn. We will think of the observations X1, X2, as being the coordinate random variables defined on Q = IR°°, endowed with the Borel o-algebra B(IR°°). Let II be a prior distribution on M(lR) and given P E M(lR), let X1, X2, ,Xn 4 be i.i.d. P. Let Pfi denote the joint distribution of P and the data X1, . . . , X". Then [MC X {)(1 E A1,...,Xn E An}) = / HP(A1)dI_I(P), Ci=l where C is a set in BM and A1, ..., An are Borel sets of IR. The marginal distribution P" of X1, . . . , Xn is given by P"({X1 6 A1, ...,X, e An}) = HP(A,-)dH(P), M(lR) i=1 where, as before, A1, ..., An are Borel sets. The posterior given n observations X1, . . . , Xn, denoted by II] X1,” X", is the condi- tional distribution of P’fi given the o-algebra 8* generated by M(lR) x 0(X1, ..., X"). A formal definition is given in the next lines. Definition 1.1. A function II]X,an(- I) : BM x S2 —) [0,1] is called a posterior distribution given X1, . . .,X,, if 1. For each to E Q, HIXI...Xn( - Ice) is a probability measure on (M(lR), BM). 2. For every C E BM, H|X1...X,,(C[) is a version of Ep'rll(].CxQ[B*). That is, for every set A E 0(X1, . . . , X"), / “1(Clw)dP"(w) = Puc x A). A The posterior is unique only up to P" null sets. However, in most situations of interest there will be some natural candidate for the posterior distribution and we 5 HE r“, k\ (I will refer to it as the ”posterior”. 1 .2 Tailfree processes Dirichlet processes were introduced in 1973 by Ferguson who presented many of their basic properties and applied them to some nonparametric estimation problems. Dirichlet processes form a class of prior distributions on the space of probability mea- sures on the real line. For every finite measure a on the real line, one can define a Dirichlet process denoted by Da. We say that D0 is a Dirichlet process of pa- rameter a if for every finite Borel measurable partition Bl, ..., B, of IR, the random vector P —-> (P(Bl), ..., P(Bn)) has under Do a Dirichlet distribution with parameter (01(81), . . .,a(Bm)). If a(B) = 0 then P(B) : 0 almost surely under DO. Dirichlet processes could be seen as an infinite dimensional analogue of the finite dimensional Dirichlet distribution, which in turn, is a multivariate generalization of the Beta distribution. Dirichlet priors form a conjugate family of priors, in the sense that when the prior is a Dirichlet process D0,, the posterior distribution given a sample of observations X1, . . .,X,, is also a Dirichlet process and its parameter is a + 2'; 6X, (Ferguson, [11]). Here 6,, is the measure giving mass one to x. A disadvantage of the Dirichlet priors is that they choose discrete distributions with probability one. Therefore more general priors, that could give mass one to continuous distribution or even to densities, are introduced. These are the Tailfree processes, (Freedman, [12], Fabius, [8]) which we describe next. Let (rk)k31 be a sequence of nested partitions of R where 7'1: B07 Bl) BOHBI : ¢i BOUBI : R T21 Boo, Bet, Boo, BOI where B00, BOI is a partition of B0, B10, B11 is a partition of BI, and so on. For each i, let E,- : {0,1}i be the set of all sequences of 0’s and 1’s of length i and let E” 2 SEE,» We can then conveniently write the partition 73- as {Bg : g E E,}. In general BED, Bgl is a partition of Bg- We assume that the sets Be are nonempty intervals and they generate the Borel o-algebra on IR. Definition 1.2. A prior distribution II on (M(lR), BM) is said to be tailfree with respect to the nested sequence of partitions (rk)k21 if the random vectors {P(BO)}7{P(BOO [B0)aP(B10 I Blllv'“ ’{P(B§0 IBE) :EE Ei}v' ' ' are independent for alli Z 1. When P(Bg) : 0, we make the convention that P(Bgo | Br) 2 1. Some properties of tailfree processes 1. As we will see in the next chapter, there are examples of tailfree priors that give probability 1 to the set of continuous distributions or even to the set of absolutely continuous distributions. 2. Tailfree processes form a class of conjugate family of priors for M(lR), that is, if the prior is tailfree with respect to some sequence of nested partitions of R, the posterior given some observations will be tailfree with respect to the same sequence of partitions (Ferguson, [11]). 3. Except for some trivial types of processes, Dirichlet processes are the only ones that are tailfree with respect to any sequence of nested partitions (Doksum, [4]). In other words, for these processes the subdivision points chosen to form the nested partitions do not play any essential role in the behavior of the process. 1.3 Characterization Let II be a prior on M(lR). Let X1, X2, ..., be a sequence of random variables which are, given P E M(lR), independent with common distribution P. For g 6 Bi, let N}:£5 be the number of observations out of X1, . . . , Xn which fall in Bg. Formally N:g 2 22:1 I Be. (X j). Denote by N," the vector (N3g : g E E,). For the prior II, H|x1...xn will stand for the posterior given X1, . . . , X, and IIIN? for the posterior given N,". For a function g on M(lR), we will write £(g(P) | TI] x1...an to denote the ‘law’ or distribution of g(P) under Hlxl...x,.- Similar notation will be used for the measures “IN,"- Theorem 1.1. Suppose II{0 < P(Bg) < 1} = 1 for alle E E,,i Z 1. Then the following are equivalent: 1. II is Tailfree; 2. For all n and alli Z 1, £({P(B.) = 6 E Ei}|Hlx1...xn)= £({P(Bg) : t E E,} l Hwy)- Proof. Fix n and i 2 1. For proving that (1) :> (2), first note that under the posterior “my, {P(Bg) : g E E,} has the density N" H.615. P(Bs.) "g N}; . fM(IR)1—I§EE, P(Bé) dH(P) (1.1) Also P(Bgo) = P(B§)P(B§0 | Be)» P(Bgl) = P(B§)(1— P(Bgo | B§)), P(Bg) is independent of P(Bgo | Bg), and Ngléo + Nana 2 M72. Let C E o{P(B§) : g E E,}. Then 10 is as well independent of {P(Bgo | Bg),§ E E,}. The above observations and the assumption of the theorem give that '11 [VI-r14 HgEEiH P(B)?) Ni" Lg C fMUR) H968,“ P(Bg) + (”W”) N'nlt'o Ninlcl Ninlc Ninlcl Hams) ‘* mama.) +“°(1—P(B§o|B§)) we (I fM(R)1—I§EE,‘ P(BE) N" l e 1V!" lei ‘Nin It [Vin c '+ '-"+ + ’- P(Bgo | 3,) + '-°(1— P(B,0 | B,» +“(11103) : HN,"(C)9 everywhere. Hence C({P(B§l 3 § 6 E,} l HINT“) = £({P(B§) 3 E E E1} l HIM"), and therefore £<{P = g e a} | Hm) = mew.) = g e E} I 11w for any j _>_ i. Since the sets Bg generate the Borel o-algebra, we have that o-algebra generated by N},, 0(Nf,), increases to 0(X1, . . . , X”). Using the Martingale Convergence Theorem, we obtain relation (2). For (2) :> (1) we first prove a lemma. Lemma 1.1. For any i Z 1, under (2), (a) {P(Bg) : e E E,} and {P(Bgo I Be) : g 6 13,-} are independent. (b) {P(Bgo | Bg) : g E U3;[,Ej} and {P(Bgo | B) : (f E E,} are independent. Here E0 is the empty set and P(Bw I 8,), g E E0 stands for P(BO). Proof of Lemma. Since {P(Bg) : g E E,} determines {P(Bg) : g E E] for anyj S i, quantities like P(Bgo | Bgl for g E Ej,j < i are functions of {P(Bg) : e 6 Bi}. Hence (b) is an immediate consequence of (a). 10 To prove (a) first note that (2) gives the conditional independence of {P(Bg) : g E E,} and (X1, . .. ,Xn) given N,", which we write as {P(Bg) :15 E E,} 51L" (X1,... ,Xn) from where it follows that {P(Bg) : g E E1} RL NEH and hence that [’({P(B§) 1€E Ei} | BIN?) = 3({P(Bg) : § 6 Ei}|H|N,-"+1)- (1-2) To establish (a) it is enough to show that for any collection {ng : g E E,} of nonnegative integers En(H[P(B§0 | B§)]n" | {P(Bg) : g E Ei})= constant a.e. II. §€E§ Fix a set {ng : e E El} and let n = 2158.”? Consider the posterior density of {P(Bg) : g 6 Bi} given N,’g 2 ng and its posterior density given Nam, = ng, Ngm, = 0, as in (1.1). 11 ’HE Z Since by (1.2) the corresponding distributions are equal, we have nE ngE,IP(B§)l [P(Bg0 I lelns Irvine)HgEEJPIBgllngIPIBgo I B§)]n‘dfl( En P) I {P(B§) 3 E E E1} HE [P(B.>1"‘-‘ _ fm, 11.63,. {P(Baf‘de) which yields .. fat, H.6E,[P(B.o)1"‘dH(P) E"(Q.[P(B§° ' Bi” {P(Bi) ‘ i E m): fM(p)H.E.[P(B.>J"‘dH(P)' [I] Returning to the proof of the Theorem 1.1, (2) => (1 ) now follows by applying the lemma successively for i = 1, 2,. . .. CI Towards the next characterization we recall DeFinetti’s Theorem for an exchange- able sequence of random variables. Definition 1.3. Let X1, X2, ..., Xn be a sequence of real valued random variables de- fined on Q = R00 and let u be a probability measure on Q. The sequence X1,X2, is said to be exchangeable with respect to ,u if for any n and any permutation g of (1,2, ..n.), (X1, ..., X") and (X90), ..., Xg(n)) have the same joint distribution under )1. Theorem 1.2. (DeFinetti’s Representation Theorem) A sequence of random vari- ables X1, X2, . . . is exchangeable if and only if there is a random probability measure IP’ 12 defined on (IR, B(IR)) with values in (M(lR), BM) 30 that given any P = P(w), w 6 IR, X1,X2, . . . , are independent with distribution P. Furthermore, the distribution ofIP’ is unique. The distribution of P is a probability measure on (M(lR), BM) and we denote it by IT. For any n and any Borel sets Bl, . . . , B", the following relation holds: u{X1 6 Bl, . . .,x,, e B,,} = Hear.- 6 B,- | P)dH(P) 2 Mali) i=1 PBidIIP. (mg () () The question we address here is under what conditions the resulting prior in DeFinetti’s theorem is tailfree. The last theorem can be used to provide an answer to this. For each i, let T,(X) be the vector (IB§(X) : g E E,). Let uxhxn be the predictive distribution of X,,+1,X,,+2, . . . given X1, . . . ,Xn. Theorem 1.3. Let X1, X2, . . . be an exchangeable sequence under )1, and let II be the corresponding prior obtained from DeFinetti’s Theorem. If, for every 89 g E UE,, (i) lim,,_,oO u{X1 6 BE, . . . ,Xn E Bf} = 0, then the following are equivalent 1. II is tailfree; 2. For all n and alli Z 1, £(n(‘¥n+l) I M/Yl-e-Xn) : £(E(X11+l) I ltTi(/\'l)'-~Ti(l\’n))' 13 Proof. Observe first that condition (i) of the theorem ensures that II{0 < P(Bg) < 1} = 1 for all g E E,,i Z 1, and therefore, Theorem 1.1 can be applied. Indeed, using DeFinetti’s theorem, condition (i) is equivalent to liman(B§C)" dII(P) : O which implies limn fP( P(Bg)" dlI(P) : 0 and thus 33:1 l'I(P(B,_f) = 1) = 0, so II(P(B§) = O) = 0. As well, B: is a finite union of sets from U217} and therefore H(P(B§) = 0) = 0,80 II(P(B§) = 1) = 0. (1) => (2) Fix it and i Z 1. If II is tailfree, from Theorem 1.1 we have £({P(Bg) 3 t E E2} I HI.\'1...x..) = £({P(B§) I E E E,} I HIT.'(X1).-.T.(Xn))' (1-3) For any B,E E r,, u..-....X.(X..+1 e B.) = / P(B.)dn.x....x.(P) M(lR) P(B,)dn.T.(x.)...T.-(x,.)(m(P) (K) II 3\ = um.)...T.(/\'..)(Xn+1 E B,) where the second identity follows from (1.3). To show (2) => (1), by Theorem 1.1, it is enough to show that £({P(B§) : ‘5 E E.) I 11mm.) = 3({P(B;) : s E E,} | HIT.(X1)-.-T.(Xn))a 14 or equivalently, for any collection {ng : e E E,} of positive integers / H [P(Bgllngdnlxi....\'n (P) = f H [P(Bgll"!dnT.(.\'1)...r.~(X..)(P)- M(lR) M(lR) §€E§ EEEI Since for every n, by (2) of the theorem, for fixed i, /,x..,.x 1 T,X,, ,...,T,-X,,,,,. 1.4 1 m,” ,,,,, m2.) ( +1) ( +) ( ) Now let m = 2,6,3! ng and, given 71(X1),...,T,~(X,,), consider the conditional probability [1']; ( x1)...“ X”) that out of the next m observations ng fall in Bg for e E E, This is given by f H [P(Bgllngdl—ITAXI)...T,~(X,.)(P) M(lR) {EEI' and, by equation (1.4) above, is also equal to the conditional probability a X1... x, that out of the next m observations ng fall in BE for e E E,- and therefore it further equals P B. "‘dnx,,,,xn P. [WEI ( )1 < ) (GE, 15 Chapter 2 Absolute continuity and singularity of Polya tree prior and Posterior 2.1 Introduction Polya tree processes are a generalization of the Dirichlet processes and they are in- cluded in the family of tailfree processes. Unlike the Dirichlet processes, Polya tree processes are determined by a large collection of parameters and therefore they could incorporate a much wider range of beliefs. Polya tree priors were explicitly constructed by Ferguson in 1974 as a special case of tailfree processes. A formal way of constructing Polya tree priors via DeFinetti’s theorem can be found in Mauldin, Sudderth and Williams ( [22]). A detailed devel- Opment for these processes, including construction and discussion on the components needed in construction, is given in Lavine ( [20], [21]). 16 Defii pmi‘r , (IHIHI Let r = (rk)k21, 7",, = {B, : e E Ek}, k 2 1, be a nested sequence of partitions that generate BUR), as described in Section 1.2, and let (_1 = {a, : e E E’} be a family of nonnegative numbers. Let Y = P(BO), and Y, = P(B,0 I B,), e E E“. Definition 2.1. A prior probability measure on M(lR) is said to be a Polya tree process with respect to the sequence of partitions 7' and with parameters (3:, and we denote it by PT((rk)k21 ,q), if: 1. {Y, Y, : e E E‘} is a set of mutually independent random variables 2. Y has a Beta(ao, 01) distribution and Y, has a Beta(a,o, a,1), distribution for any «5 E E‘. Some properties of Polya tree priors 1. (Ghosh, Ramamoorthi, [18]) A Polya tree process with parameters a 2: {0, : g E E‘} exists if for any (5 E E” ago agOO agOOO . . . .. . ._— 0 01gb + agl 01,00 + 01,01 0,000 + 0,001 0110 01110 01110 _ O 010 + 0111 01110 + 0111 0‘1110 + 011111 2. Connection with Dirichlet processes (Ferguson [11] , Lavine [21]) o A Dirichlet process DO is a Polya tree with respect to any sequence of nested partitions (n.),,ZI, with parameters a, = a(B,), B, E nglTk. o A Polya tree PT(T, or) process is a Dirichlet process if a, = 0,0 + (1,1 for all 17 k3 g E E". The parameter a of the corresponding Dirichlet process is given by a(B,) = 0,. . If it is a Polya tree PT((r,,)k21 ,q) and given P, X1, . . . ,Xn are i.i.d. P, then the posterior distribution 7r] X1,“ X" is again a Polya tree with respect to the same sequence of nested partitions, with parameters {0, + 2;, 63,(X,-) : e E E“ } (Lavine, [21]). . The weak support of a PT ((rk)k?_1 , oz) prior (the smallest closed set under the weak topology of prior probability 1) is M(lR) iff a, > O. (Ghosh and Ramamoorthi, [18]). . Consider the Polya tree on the set of all dyadic intervals of length 1/2’", m 2 0, i.e. rm 2 {((i—1)/2"‘,i/2’"] : i = 1,...,2’"}. Take and”, 2 m2. Then the set of absolutely continuous distributions with respect to Lebesgue measure will have probability one under the resulting Polya tree. (Ferguson, [11]) 2.2 Main Theorem It is known that if a is a continuous measure, then the Dirichlet prior 0,, and the posterior Da+5x are mutually singular. (Ghosh and Ramamoorthi, [18]). In the next section we will see that this disturbing phenomena does not always occur when the prior is a Polya tree process. By construction a Polya tree prior 7r is an infinite product measure whose compo- nents have Beta distributions. 18 To formalize, let (2,, = [0, I]? and Q = “1:19P A Polya tree prior 7r with parameters {01, : e E E“ } is just a product measure “1:1“ on Q, where 7n, itself is again a product measure on (21,, whose components are Beta(a,0, (1,1) with e E Ek. The posterior 77le ...Xn being a Polya tree it too can be thought of as a product measure on Q. A natural way to establish mutual absolute continuity or singularity is the well known theorem of Kakutani (1948). Theorem 2.1. (Kakutani) Let uk and V), be two mutually absolutely continuous prob- ability measm'es on Q}, and let u = 11:, u), and V 2 [12:1 14,. Let A be a measure with respect to which both uk and V}, are absolutely continuous. Set duk de p.(u, V) = f —— - —d)\. 2.1) * a, V d). dA ( 00 If HPkIHa V) = 0, then u and V are singular. k=l (X) If HPkIH» V) > 0, then u and V are mutually absolutely continuous. k=1 Note that since 0 S pk(u, V) g 1, ”:1pkU‘v V) converges if and only if 21.0 - PkI/taVll < 00- For the next theorem we will assume for the family or of parameters that a, does not depend on the length of the vector g, i.e. a, = ak for any g E Ek, k 2 1. Denote a Polya tree prior with such parameters 7r(a), a = (a1, a2, . . . ). ’e do not make any specific choice for the sequence of partitions (77,)k21. I9 Theorem 2.2. Suppose 77 = 7r(a) is a Polya tree prior on [M(lR) and given P, let X have distribution P. 00 1 If E —— < 00, then 7t and the posterior NIX are mutually absolutely continuous. ak L1 00 1 . . If E — = 00, then 7r and the posterior NIX are szngular. k-l 0" The proof of this theorem uses the inequalities on Gamma functions given below. Sophisticated versions of these inequalities can be found in Laforgia [19], Bustoz and Ismail [2]. Lemma 2.1. For x > 0, Proof. For x > 0 and c > 0 set 1 ,I‘(x+1) \/x+c F(x+%), f0($) = By Sterling’s approximation of Gamma function, limxhoo f,(x) = 1. We will show that for c = 1/4, f,(x) > f,(x + 1) for all x > 0. Then since lim,H00 f,(x) = l we will have fc(x) > 1 for any x > 0 and thus the left hand side of the inequality is obtained. For the right hand side, take c = (\/3 — 1)/2. In this case it can be shown that fc(x) < fc(x + 1) and again because lim,Moo fc(x) = 1, we will obtain f,(x) < 1 for any x > 0 and the right hand side of the inequality is obtained. 20 Now look at the monotonicity of the function f,(x) _ \/x+c+1 x+-% 9C($):fc(17+1)_ \fx-I-—c 'x+1' Then , x4c—1 +262+26—1 9.(:17)= ( ) 4(x +1)2(x + c)3/2(x + c +1)1/2' Therefore if c = 1/4, g;(x) < 0 for any x > 0, so g, is decreasing and since limH00 gc(x) = 1, we have gc(x) > 1, or fc(x) > fc(x +1). Also ifc = (x/3 —1)/2, then g;(x) < 0 for any x > 0, so gC is increasing and again since lim,MOO gc(x) = 1, we will have gc(x) < 1, or f,(x) < f,(x +1). C] Proof of Theorem 2.2. For each k, X belongs to exactly one element of {B, : g E E, }. Consequently, under the posterior 70x, exactly one of P(B,0 I B,), g E Ek_1 will have distribution Beta(a;c +1, ak) and the remaining P(B,0 I B,) will be Beta(ak, ak). We recall here that the density function of the Beta(a, [3) distribution is Karla, fl) = $°“‘(1 - $)"“F(a,fl)/(F(a)1“(fl)), for It 6 (0, 1)- An easy computation shows that the quantity in (2.1) is _ I‘(2ak) P(ak + %) plea). we) — fi- ma, + ,) - PM) (2.3) For simplicity denote pk(7r(a), 7rIX(a)) = pk. The product we will have to consider when applying Kakutani’s theorem is 111:1 HeEEk pg, 2 Hi; pk. For technical reasons it is useful to first consider the case when a = lim inf]HOG ak < 21 00. Let an, be a subsequence converging to a. Then using Lemma 2.1, it follows that \/2(1. + fi_l a, pm: < \/2_ ' n2“ 2 ' nk “M an + i which converges to 2a + —‘/§—1 1 x/2 - 2 2 - < 1. a + % Therefore Hi, pk = 0. Now suppose ak —> oo. Rewriting the inequalities (2.2) of Lemma 2.1 with x — 1/2 in place of x, we obtain forx > %, 1 I“(.2:+-;-) 2—\/§ x—E 0. 22 On the other hand, by (2.2) and (2.4), 1 1—p,,>1—\/§- . a" 201,—},- ak+§ 1 1 = 50k 5 ./ 00, then Zk21(1/ak) < 00 implies 23,210 — pk) = 00 and ZkZIH/ak) < 00 implies 2,210 — pk) < 00. Also if a], ( or a subsequence) converges to some a < 00, then limk pk < 1 so HQ, pk = O. For the case of odd number of equal observations, say 2n + 1 observations, com- puting pk we obtain I‘(2a;c + 2n +1)I‘(2a,,)]1/2 P(a,c + n +1/2) pk : I P(ak + 2n +1)P(ak) “2“" + n +1/2) _ P(2ak)1"(ak +1/2) [2(2a,c +1)---(20k + 2n)]1/2 (W: + i + n —1)...(ak + i) ( _ I‘(2a;c +1/2)F(ak) (a;c +1)...(a;c + 2n) 2a;C + % + n —1)...(2a;c + %) Using Lemma 2.1, we have 20]: + 3% a), [2(2ak + 1)...(2ak + 2n)]1/2 Pk < ' 2a,, ’ak+% (ak+1)...(ak+2n) (a;c + i— + n -1)...(ak + %) (20.), + % + TL —1)...(2(lk + %) . The limit of the right hand side for a}, —> a E [0, oo) (eventually using a subsequence) t.\ is easily seen to be less than 1. For this one can use that 2a+i , 2a+i+1 (2a+i/2)2 a+i a+i+1 a+i/2 ’ and thus in this case [1,2ka = 0. When lim,C oh 2 00, using inequalities (2.2) and (2.4) we obtain that 1 — pk is between P(ak)/Q(ak) and R(ak)/T(ak), where P, R are polynomials in ak of degree n2+2n+1 and Q, T are polynomials in ak of degree n2+2n+2. Therefore 2,21 (1 -—pk) is of the same nature as 2,), 1/ak. 26 Chapter 3 Dykstra - Laud prior for hazard rates 3. 1 Introduction In survival analysis the variable of interest is the time to the occurrence of an event. It could be time to death of a biological unit(patient, animal) or time to failure of a mechanical component, or time to relapse(remission) of some disease under a certain treatment. Denote by X the time until some specified event. One basic function that char- acterizes the distribution of X is the survival function S, whose value at x is the probability of experiencing the event after time x. Another function to characterize the distribution of X is the hazard rate whose value at x is the chance that an in- dividual of age x experiences the event immediately after time x. This function is also known as the conditional failure rate in reliability, the age-specific failure rate in 27 I-\\ epidemiology, the force of mortality in demography. In the discrete case, the hazard rate r is defined by , P(xSXSx—tAxIXZx) r(x)=Allm0 A I—') I If X is a continuous random variable with density f then _ f(:r) _ f(:r) _ d TILE) - T1?) - P(X 2 2:) - “351M501?” A related quantity is the cumulative hazard function R(x) defined by R(x) = /0xr(u)du = — ln[S(x)] Thus, in the continuous case, Se) = epr—Ren = exp [— fauna] One may believe that the hazard rate for the occurrence of a specific event has some particular characteristics, for example it is increasing, or decreasing, or it is constant. Models with increasing hazard rates may appear when there is a natural aging or wear. Decreasing hazard rates are characteristic to events that have a very early possibility for failure, as in transplants. Constant hazard rates correspond to exponential distributions. Dykstra and Land [7] suggest a nonparametric Bayesian approach for problems 28 in reliability context. They provide a prior over the nondecreasing hazard rates by defining a stochastic process whose sample paths are nondecreasing hazard rates. The posterior distribution of the hazard rate, given right censored observations or given exact observations, is derived. Bayes estimates are found under the squared error loss. In the second section of this chapter the prior probability defined by Dykstra and Laud is introduced. In the third section the Ll-support for a particular case of the prior is established. In the fourth section, weak consistency is discussed and, in the last section, strong consistency is obtained. 3.2 The extended Gamma process Let G (a, 5) denote the Gamma distribution with density g(x | a, ,8) = x"-1 epr—x/BIIw,oo)(x)/(F(a)fi“), for a, B > 0. G (0, fl) denotes the distribution degenerate at 0. Let a be a nondecreasing, left continuous function on [0, 00), such that (1(0) 2 0 and let 5 be a positive right continuous function on [0, oo), bounded away from 0. Let (Z (t)) tZO defined on some probability space (0, 7-", P), be a Gamma process with independent increments corresponding to a. That means Z (0) E 0, for every n and any 0 = to < t1 < < tn, the family {Z(t,) — Z(t,-_1)}I‘:1 is independent, and for any t > s, Z(t) — Z(s) has a G(o(t) — 0(9), 1) distribution. 29 It is well known that such a process exists (Ferguson, [10]). We can assume without loss of generality that this process has nondecreasing left continuous sample paths. A new stochastic process is defined by integrating ,8 with respect to the sample paths of the (Z(t)),zo process. That is This process is called the Extended Gamma process. Any nonnegative, nondecreasing function r so that fl0 00) r(u)du : 00 corresponds to a cumulative distribution function given by F,(t) = 1 — exp [—/[0 t) r(u)du] . It is easy to prove that F, is absolutely continuous on [0, 00). Therefore, Mt) = gm) = r(t) exp [— j”) r(u)du] (3.1) is the corresponding density function. The distribution of the process (r(t))t20 thus corresponds to a prior probability over the set of nondecreasing hazard rates. This in turn induces a prior over the absolutely continuous distributions whose hazard rates are nondecreasing. We confine our studies to a particular case of this prior. We will assume that [3 is a constant function equal to 1. In other words, r(t) : Z (t). 30 I.\ In the following sections we will denote by an the prior distribution on nonde- creasing hazard rates induced by the Gamma process with independent increments corresponding to a. 3.3 Ll-support of no, Topological support of a prior 7r is the smallest closed set in the chosen topology for the parameter space of n-probability 1. If P0 is not in the support of 7r, then there exists a neighborhood of P0 that has probability 0 under 7r. Then for almost all sequences of observations X1, X2, . . . the posterior distribution given X1, X2, . . . , X" will assign mass 0 to that neighborhood. Therefore it is not reasonable to expect consistency outside the support of the prior. Before developing the Ll-support, a few lemmas that will be needed in this chapter are presented. Let ’R denote the set of nondecreasing hazard rates, i.e. ’R = {r 2 0 : r nondecreasing on [0, V), f[0.V) r(t)dt z 00, V E (0, 00]}. If r E R, we denote by f, the corresponding density function as described in (3.1). As the next lemma shows, if the L1 distance on a compact interval between two nondecreasing hazard rates is very small, then so is the L1 distance on the same interval between the corresponding density functions. Lemma3.1. Let60>OsothatifOSx<60, ex-lgfi. LetT>0andletr, r0 be two nondecreasing hazard rates. Then [[0 T) Ir( —r0(t)Idt < (50 implies I[0,T) If,(t) — fr0(t)Idt < 260 + \/6_0. 31 Proof. Using elementary inequalities for any t > 0 we have mm — new = Ir(ve‘Io-wr‘s’“ — ro(t>e'f[°.t>’°‘s’“| s W) — To(t)I€_f[°"’r(S)ds + T0(t)€— [[0'0 ro(s)ds]1 _ €_ f[0’,)(r(s)—ro(s))d3|. After integration on [0, T), clearly the first term of the above sum is at most 60. Next observe that I1 — e‘yI g 1 — 6"?” + e'yI — 1 S IyI + \/IyI g 60 + J33 when IyI < 60, and therefore the second term integrated on [0, T) is at most 60 + V36. [:1 Lemma 3.2. Let r0 be a nondecreasing hazard rate. For any T finite such that r0(T) < oo , and for any 6 > 0 , there exists a continuous, nondecreasing hazard rate to that satisfies: a) r0(t) 2 r0(t) for any t E [0, T] b) .[[0,T) [T0(t) — T0(L)Idt < 6. Proof. It is enough to prove the above lemma for r0 nondecreasing, left continuous hazard rate. Indeed, since it is nondecreasing, r0 has at most countably many points of discontinuity and the left-hand limit exists everywhere. Therefore if we set at any t, f0(t) = r0(t——), where r0(t-) denotes the left-hand limit of re at t, then 7‘0 will be left continuous and will differ from r0 for at most countably many points. Then f,0 = ffo almost everywhere. Fix 6 > 0 such that 6(T + 2r0(T)) < e. 32 Set to = 0 t1=sup{0 0, 60 < min{t,- — t,-_1 : 1 S i S n} Let s, = t,- - 60, i = 1,. . .,n — 1, and 3,, = t,,. Define r0 by r0(t) = r0(t,-) if t,_1 S x S s,-, then extend it linearly between 3,- and t,-, i = 1,. . .,n. Take f0(t) = r0(T) if t Z T. Note that r0(t ) > r0(t ) for t S T. Also Anew) -r0(t ))dt- 2:] 13)] t)t—ro( )Idt+:/ (roe) —r0(t))dt [Shirl n—l < 2/ ' — 7‘0(t,’_1+))dt "I' Z(T0(ti+1) — T0(ti_1))(ti — Si)dt (v-1 51(r) i=1 3 52(3 s,— t._1) + 2mm = 6(T + 270(7)) < e. [:1 Lemma 3.1 and Lemma 3.2 enable us to approximate in L1 any density fro, with r0 nondecreasing, by a density f;0 with f0 continuous, nondecreasing. DenOte by IIfFo — froII : [[0,00) Iff'o(t) — fro(t)Idt‘ 33 1" Lemma 3.3. Let r0 be a nondecreasing hazard rate. Then for any 6 > 0 , there exists a continuous, nondecreasing hazard rate 7:0, finite on [0, 00), so that IIf;o — frOII < 6. Proof. Let 6 > 0 and choose T > 0 such that the following relations hold / f,0(t)dt < 6 (3.2) [Tool r0(T) < 00. From Lemma 3.2 there exists 7‘0 continuous, nondecreasing on [0, 00) such that fl0.T) If0(t) — r0(t)Idt < 6. Consequently, by Lemma 3.1, for 6 small enough, / lfro(t) — new < 26 + n. (3.3) [0,70 By construction, f0(t) 2 r0(t) for t in [0, T]. Therefore I new=e—Iom*°“"”d3 s e”I°‘T)’°‘”‘“ = I f..(t)dt<6, (3.4) [T,oo) [T,oo) where the last inequality follows from (3.2). Hence by (3.2), (3.3), and (3.4), £0.00) If,,,(t) — f;0(t)|dt < 26 + J3 + 5 + 5 = 45 + «3. Choosing 6 such that 46 + J5 < e, we obtain IIf,~.0 — frOII < 6. Lemma 3.4. Suppose that a is strictly increasing on [0, T] and a(0+) > 0. If r0 is a continuous, nondecreasing hazard rate, and r0(T) < 00, then for any 6 > 0, 7ra{r E R : supIr(t) — r0(t)I < 6} > 0. (0»Tl 34 Proof. Since r0 is continuous there exist 0 = to < t1 < - -- < tn 2 T so that r0(t,-) — r0(t,-_1) < 6/2, i = 1,. ..,n. Denote by B,(c) the ball of radius 6 and center 0. Let R0(r0) = {r E R : r(0+) E B5/(2,,)(r0(0)) }, R1(r0) = {r E R : r(t1)- r(0+) E Bg/(gn)(r0(t1) — r0(0)) }, R,(ro) = {r E R : r(t,) — r(t,-_1) E Bg/(gn)(r0(t,) - r0(t,-_1)) }, i = 2, . . .,n. The set R(r0) = 03 R,(ro) has an positive measure. To see this, first note that under 7r,,, r(0+), r(tl) — r(0+), ..., r(tn) -— r(tn_1) are independent, r(0+) has distribution G(a(0+), 1), and r(t,) — r(t,_1) has distribution G(a(t,-) — a(t,-_1), 1), i = 1,...,n. Since a(0+) > 0 and a is strictly increasing, each set R,(r0), i = 0,. . . ,n, has no, positive measure. The independence property mentioned above implies that 7r,,(R(r0)) > 0. To complete the proof, we will show that if r E R(r0), then (soupllrU) — r0(t)I < 6. Observe that Ir(t,) —— r0(t,-)I < i6/(2n) < 6/2, for i = 1,. . .,n. Moreover, if t,_1 < t < t,-, i = 1,...,n, then r0(t) — r(t) S r0(t,-) — r(t,_1) = r0(t,-) — ro(t,-_1) + r0(t,-_1) — r(t,_1) < 6/2 + 6/2 < 6. Similarly r(t) — r0(t) < 6. E] The Ll-support of the prior measure 77,, will depend on the function a used in defining the Gamma process (Zt)t20. Two cases will be considered: 0 a is strictly increasing on [0, oo) 0 o is strictly increasing up to a point, then it stays constant. 35 Denote by .7: the set of density functions on [0, 00). Theorem 3.1. Suppose that a is strictly increasing on [0, 00) and a(0+) > 0. Then the Ll-support of no, is .77: = {fr E .7 : r E R}. Proof. The set TR has no probability one. The proof of the Theorem 3.1 will be completed if we also Show: (1) TR is a closed set in the L1 topology. (2) any L1 neighborhood of any density function in .77; has 770, positive measure. For proving (1), take a sequence (rn)n21 of nondecreasing hazard rates such that f," —> f“ in L1. Set r*(t) = f*(t)/(1 — F*(t)), which is the corresponding hazard rate of f *. We will show that r“ is nondecreasing. To see this, first note that fr, ——> f * in L1 implies that f,” —> f "‘ in measure, which further implies that there is a subsequence (frnklk which converges to f * almost everywhere. Consequently, Fr",c —-> F " almost everywhere and thus Tm, —) r“ a.e. Let A = {t : 7‘,,,_(t) —> r*(t)}. Let T“ = inf{t: F*(t) =1}. Then for any t1, t2 E A, t1 < t2 < T“, r*(t1) S r*(t2). Set f‘(t) = r‘(t), ift E A and ift E A, set f*(t) = lim,, r*(t,,), where tn —> t, tn E A for any n. In this case, r“ is well defined, nondecreasing and ff‘ 2 fr. a.e.. If T“ < 00, then IimHT- r*(t) = 00. Thus r“ E R. By Lemma 3.3 the set of density functions that correspond to continuous, nonde- creasing hazard rates, that are finite on [0, 00), is dense in the set of density functions whose hazard rates are nondecreasing. Therefore it is enough to show (2) for f,.0 with r0 nondecreasing, continuous, and r0(t) < 00 for any t > 0. Lete >0andletu,(fro)= {fE.7: IIf—fmII <6}. 36 Let 6 > 0 so that r0(t*) > 6 for some t‘ > 0. Choose T > 1 so that / fr.(t)dt < 6, (3.5) the) e-(ro(t')-5)(T-t‘) < 6. (3.6) Consider the set W = {r E R : supIr(t) — r0(t)I < 6/T}. We will show that for 6 (0,7'l small enough, for every r E IV, IIf, — fTOII < 6. Further, by Lemma 3.4, 7rO(IV) > 0 and hence 770(Ll,(fro)) > 0. Let r E W. Then IIOIT) Ir0(t) — r(t)Idt < (6/T)T = 6. As a consequence of Lemma 3.1, for 6 small enough, we have / Irma) — tuna < 25 + a (3.7) [0,T) Also, using (3.6), / fr(t)dt = e-IIO.T)T(3)dS S e—r(t')(T—t‘) < e—(ro(t‘)-6/T)(T—t°) Woo) < e-(ro(t‘)-5)(T-t') < 6. (3.8) Inequalities (3.5), (3.7) and (3.8) imply that [[0,00) If,0(t) —f,(t)Idt < 26+ \/6+6+6. Choosing 6 so that we also have 46 + \/6 < e, we obtain IIf, — frOII < e. 37 I.\ Lemma 3.5. Assume that a(0+) > 0, a is strictly increasing on [0, To] and constant after To. If r0 is a continuous, nondecreasing hazard rate, constant after To, then given 6 > 0 and T > 0, 7rO({r E R : supIr(t) — r0(t)I < 6}) > 0. (0»TI Proof. If T S To, then a is strictly increasing on [0, T] and the argument in Lemma 3.4 follows exactly. IfT > To, then, with an probability 1, supIr(t)—r0(t)I < 6 if and only if sup [r(t)— (0, T] (03 T0] r0(t)| < 6 and again, the result follows from Lemma 3.4. E] Theorem 3.2. Assume that a(0+) > 0, a is strictly increasing on [0, To] and con— stant after To. Then the Ll-support of no is the set fRTo of density functions for which the corresponding hazard rate is nondecreasing and either constant after T0, or it converges to 00 at T0 or at some point before To. Proof. Denote the set of hazard rates described above by RTO. For any tn > t,,_1 2 To, under no, r(tn) — r(tn_1) has distribution G(a(t,,) —- a(tn_1), 1) = G(0, 1), and so r is constant after T0 with probability 1. Therefore 77270 has no, probability 1. Next we will show that fRTo is a closed set in the L1 topology. Let (rn),,21 be a sequence of nondecreasing hazard rates in RTO such that f," —+ f * in L1. Following the same lines as in the Theorem 3.1, we have that there exists a subsequence r", —> r* a.e., where r" is the corresponding hazard rate to f *. Also, as before, if T“ = inf {t : F *(t) = 1}, then r* is nondecreasing on [0, T*]. Furthermore, T“ is either less than T0 or is 00. In other words, if it is greater than T 0, then it is 00. 38 But if we have lim,,_,00 rok(t) = r‘(t) < 00 for some t > T0, then the functions rok(t) are constant after To, and so will be r*(t), which means T“ = 00. Observe that by Lemma 3.3, for e > 0, there exists a finite, continuous, nonde- creasing hazard rate 70, which can also be chosen to be constant after To, so that IIf,0 — ffOII < 6. Hence, to complete the proof, it is enough to show that any L1 neighborhood of fro, with r0 nondecreasing, continuous, constant after To, has no positive measure. Let 6. > 0 and let Ll,(f,0) = {f E f": [If — froII < e}. Let 6 > 0, t“ > 0, T > 1 chosen as in the proof of the Theorem 3.1 so that (3.5) and (3.6) hold. Set W = {r E R : supIr(t) — r0(t)I < 6/T}. (0.TI By Lemma 3.5, no(W) > 0. Again as in the proof of the Theorem 3.1, for a suitable choice of 6, if r E W, then Ilfr0 — fFoII < 6. Therefore no(U,(f,.0)) > O. [:1 Remark 3.1. In general, if a is constant on some intervals 11 = [a1,b1], 12 = [a2, b2], . . . , 1,, 2 Ian, bo], then the Ll-support of no will consist of those densities f,, r E R so that either r is constant on the some intervals as a or limHTr(t) = 00 for some T outside of ULII, and constant on each I,- that is included in [0, T). 39 3.4 Weak consistency Consistency of the posterior distribution roughly means that if X1, X2, . . . have dis- tribution P0, then the posterior given X1, . . .,Xo converges, as it gets large, to the degenerate probability 6P0 almost surely P0. Let .7: denote as before, the set of all densities on [0, 00) with respect to Lebesgue measure. There are two natural topologies on 7-": weak topology and L1 topology. These lead to corresponding notions of consistency. A weak neighborhood of f0 is a set containing a set of the form Ufo = {f E .77: |f[0,oo)((b,-f—(b,~f0)l < 6,, i=1,...,k}, where k 2 1, (1523 are bounded and continuous on IR. A L1 neighborhood of f0 is a set containing a set of the form U7. = {f e r: 7,0,0, |f(t) — fo(t)| dt < . }. Let n be a prior on .7: and given P, let X1, . . . , Xo be i.i.d. P. Let 7I|X1...Xn( ) be the posterior distribution of P given X1, . . . , Xo. Definition 3.1. The sequence {FIXI...X..( - )}o21 is said to be weakly consistent at P0, if with P0 probability one, as n —> oo, 7T|X,...,\',.(U) —> 1 for any weak neighborhood U 0fP0. When the prior gives mass 1 to the set of densities, a more appropriate form of consistency is strong consistency, that is, consistency for L1 neighborhoods. Definition 3.2. The sequence {7T|.\'1...X,.( - )},,_>_1 is said to be strongly consistent at P0, if with P0 probability one, as n ——> 00, n|,\r,___xn(U) —> 1 for any L1 neighborhood U 0f P0. 40 A sufficient condition for having weak consistency at f0 is implied by the fol- lowing theorem due to Schwartz (1965). For any f0, f E .7, denote by K(f0, f) = [(0.00) f0(t) log(f0(t)/f(t)) dt and by K5(f0) a Kullback - Leibler neighborhood of f0, K5(f0) = {f E .77 : K(f0, f) < 6}. Say that f0 is in the K - L support of n if n(K,;(f0)) > 0 for any 6 > 0. Theorem 3.3. (Schwartz) If f0 is in the K - L support of n, then the posterior is weakly consistent at f0. The next two theorems will establish weak consistency based on Schwartz theorem. Theorem 3.4. Suppose that a(0+) > 0, (t is bounded and strictly increasing on [0, 00). If r0(0+) > 0, 7‘0 is bounded, nondecreasing hazard rate, then fro is in the Kullback-Leibler support of no. Therefore weak consistency holds at fro. Proof. Let 6 > 0 and let 8,; = {f E .7 : K(f,0, f) < 6}. Let T > 0, e > 0, and 7‘0 be a continuous, nondecreasing hazard rate as in Lemma 3.2 so that f0 2 r on [0, T] and f{0,T1Ir0(t) — r0(t)Idt < 6. Define U = {r E R : (soup]|r(t) — f0(t)I < r} and V = {r E R : r(t) S r(T) +6 for any t 2 T}. We will show that for a suitable choice of T and e, r E U H V implies K(f,.o, f,) < 6. Then the proof will be completed by showing no(U 0 V) > 0. First note that £0100) tf,0(t)dt < IN 00) t(z‘R0(‘)dt S £0.00) te_’°(0+)‘dt < 00. Here R0(t) 2 flat) r0(s)ds. For any r, define R similarly. 41 Since r0 is bounded, a = sup r0(t) < 00. Choose T such that :30 / t f,o(t)dt < e (3.9) [7.00) r0(t) > a — e for any t Z T. (3.10) Ifr E UflV, then for anyt 2 T, a—e < r0(T) < r(T)+e < r(t)+( and r(t) S r(T) + e < r0(T) + 26 S (1+ 26. Thus a — 26 < r(t) < a + 26 for any t Z T, which implies (a — 2()e“(“+2‘) < (a — 2e)e'"(‘) < f,(t) < r(t) < a + 26. (3.11) By (3.10) we also have for t 2 T, (a — 2e)e_“‘ < f,0(t) < a. (3.12) Combining relations (3.11) and (3.12), we obtain for t 2 T a - 26 fro(t) < a —at < . a + 266 fr(t) (a — 2€)e—I(0+2€) It follows that, when t 2 T, — t l 0 a+26 a < 0g f,(t) log < log + t(a + 26), (a — 2e) 42 which implies along with (3.9) fr0(l)| / ( a+2e a ) .0 t 10 dt < .0 t 10 +10 dt [Mr ()I g m Mr () 8.4. ga_2. + / f,,(t)t(2a + 2()dt < 55-, (3.13) moo) 3 when 6 is chosen small enough so that log[a(a + 2()/(a — 2()2] < 6/6 and ((2a + 26) < 6/6. On the other hand, fro”) __ To“) /[0.T)fr0(t)10g f(t) dt — ALT) fr (U108 r(t) dt + [0,,” fro(t)(R(t) " R0(t))dt. (3.14) If r E U 0 V, since 770(0) 2 r0(0+) > 0, choosing e < r0(0+), for any 0 < t < T, ro(0+)+e r(t) “K ro(0+)—e' (3'15) Relation (3.15) and 7‘0 2 r0 > 0 0n [0, T] imply that 7‘0“) f0”) f,.0(t)log dt S f,0 t log dt . (0,7) r(t) [0,T) ( ) r(t) ( 6 < , t1 1 —— dt —, .1 _ [0,T)f.()0s( +.,(o+)_.) <3 (3 6) when 6 is so that log(1+ e/(r0(0+) — 6)) < 6/3. 43 Also muse) — R. 0, first observe that U and V are independent. Note that V = no Vo, where I}, = {r E R : r(T + n) — r(T) < 6}. Since each r(T + n) — r(T) is independent of {7(t) : t S T}, so is V. Further V is independent of U. The set U has no positive probability by Lemma 3.4. The assumption that a is bounded assures that no(V) > 0. Indeed, let (to = sup (r(t). Then z>o l' l' V . f 5 $(0(T+n)-O(T))-1 ‘Id ”“I l‘ 15““ ")“12/0 F(a(T+n)-a(T))e ‘7’ Since 0 < o(T +1) — o(T) < o(T + n) — a(T) < an — a(T), there exists c > 0 such that for any n, (F((I‘(T + n) — (r(T)))‘l > c. Assume that e < 1. If (10 — a(T) S 1, then xIaIT+")‘a(T))" > 1 for any n and any x E (0, 6), so no(Vo) 2 c(1 — e“) > 0. If 00 — a(T) > 1, then 917(“(T+")“"(T))“1 > $(00—OIT))_1 for any n and any x E (0, 6). Hence, for any n, 6 no(Vn) > c/ xIaO_0(T)l_le‘xdx > 0. o 44 2 Therefore no(V) > 0. El Theorem 3.5. Assume that a(0+) > 0, a is strictly increasing on [0, To] and con— stant after To. If r0 is nondecreasing hazard rate, constant after To and r0(0+) > 0, then f,0 is in the Kullback-Leibler support of no. Therefore weak consistency holds at fro- Proof. Let 6 > 0 and let 8,; = {f E .77 : K(fr0, f) < 6}. Let T > 0, e > 0, and 7‘0 be a continuous, nondecreasing hazard rate as in Lemma 3.2 so that f0 2 r on [0, T] and fI0,T] Ir0(t) — r'O(t)Idt < 6. Since r0 is constant after To, we can choose f0 to be constant after To, as well. As in the previous theorem, define U = {r E R : (.soup]Ir(t) — f0(t)| < e} and V : {r E R : r(t) S r(T) + e for any t Z T}. , As shown in the proof of Theorem 3.4, for a suitable choice of T and e, r E U 0 V implies K(f,.0, f,) < 6. The only place 0 plays a role in the proof is on showing no(U n V) > 0. Since a is constant after T0, with no probability 1, r is constant after To. Choosing T > To, we have no(V) = 1. Also Lemma 3.5 implies no(U) > 0. Therefore no(Bo) > 0. CI The following result is known from Ghosh and Ramamoorthi [17]. Although some changes are necessary because of a different presentation of the prior probability, the core of the proof is the same with the one found in Ghosh and Ramamoorthi’s paper [17]. 45 Theorem 3.6. Suppose that o(0+) > 0 and a is strictly increasing on [0, 00). If f0 is a bounded density with compact support and its corresponding hazard rate r0 is nondecreasing with r0(0+) > 0, then f0 is in the Kullback-Leibler support of no. Therefore weak consistency holds at fro. Proof. For 6 > 0 set Ba 2 {f E .7: : K(f0, f) < 6}. Let [0, T] be the support of f0. First note that because f0 is bounded and lirn) ylogy = 0, we have y—) fI0,T)f0(t)I10gf0(t)Idt < 00. Let c > 0 and choose To < T so that / fo(t)|10sfo(t)ldt < e, (3.18) [To/1") supf0(t)(T — To) < e, (3.19) ”(710) > 1+ 6, (3.20) —(1 — F0(T0))10g(1— FOITOII < 6, (3-21) where the last two relations are possible since sup r0(t) = 00. t 0, define U = {r E R: sup Ir(t) — r0(t)I < 6}. (0,To] V = {r E R : r(T) — r(To) S e}. The sets U and V are independent, no(U) > 0 from Lemma 3.4 and also no(V) > 0. Hence no(U n V) > 0. Moreover, for a suitable choice of e, if r E U H V, then K(f0, f,) < 6. 46 We have that we. f) = (...,, fo(t)108(fo(t)/f(t))dt+f[To,T,fo(t)1os(fo(t)/f(t))dt- Imitating the argument in Theorem 3.4 for the first term above, when r E U (I V, /[O,To) foIt) log fIt) dt S /[0,To) fo(t) log r(t) dt S ALTO) f0(t) log 1+ __—_r0(0+) _ 6 dt + 5 / foItldt < 5/2, (3,22) I0,T0) if e is chosen small enough. We also have for ( small I f0“) 108(fo(t)/f(t))dt = / [T ,T) [T0 i T) 70(7) losfo(t)dt — / fo(t)10s7‘(t)dt [T09 T) + [mm fo(t) (10")r(s)ds) dt < 5:— (3.23) The first term above is less than c by (3.18). Because r(T) < r(To) + e < ro(To) + 26 < 2r0(T0) and r(To) > r0(T0) — e > 1 by the choice in (3.20), we have 0 S f[To,T) fo(t)10g7(t)dt < 00. The last term equals /[T0,T) f0(t)dt /[0,T0)T(S)ds + /[T0,T) f0(t) (/[‘T0,T)T(S)ds) dt 3 [”01“f0(t)dt‘/[O,To)r(s)ds+r(T)(T— To) / fo(t)dt [T0, T) The second term above is at most (1 — F0(T0))2r0(T0)(T — To) which in turn equals 2f0(T0)(T — To) and this is less than 26 by 3.19. 47 For the other term we have /[7‘0,T)f0(t)dt/U),T,)T(S)ds = /[TO,T)fo(t)dt/U),To)(r(s) _ f0(3))d3 + [M new I.T.)(’~‘°(S) — mews + [m sent In.) news S 6T0] f0(t)dt + E — (1 — F0(To)) 10g(1 — F0(T0)) [T0,T) < ((T +1)— (1— F0(T0))IOg(1— F0(T0)) < C(T + 2). where the last inequality is obtained by (3.21). Therefore (3.23) holds when r(T + 5) < 6/2. Relations (3.22) and (3.23) conclude the proof. CI The following theorem is similar to Theorem 3.6. Even though the proof for the two theorems is the same, the later one is mentioned as a separate result because it will be referred to in the next section. Theorem 3.7. Suppose that a is strictly increasing on [0, T‘], constant after T‘, and a(0+) > 0. If f0 is bounded, has compact support in [0, T] C [0, T‘] and its corresponding hazard rate r0 is nondecreasing with r0(0+) > 0, then f0 is in the K ullback-Leibler support of no, and therefore weak consistency holds at f,,,. Since the support of f0 is included in [0, T‘], a is strictly increasing on [0, T] and therefore the same proof as in the preceding theorem works. 48 3.5 Strong consistency Chosal, Ghosh and Ramamoorthi [16] give the following theorem to establish strong consistency. This theorem involves the L, metric entropy which we define next. Definition 3.3. Let g C .7 and let 6 > 0. Then the L1 metric entropy denoted by J(6, Q) is the logarithm of the minimum n such that there exist f1, f2, ..., fo in .7 with the property 9 c U? {f = If — flll < 6}- Theorem 3.8. (Ghosal, Ghosh, Ramamoorthi) Let n be a prior on .7. Suppose f0 E .7 is in the Kullback-Leibler support ofn. Iffor each 6 > 0 there is a 6 < 6, 01,02 > 0, [3 < 62/2 and 7,, C .7 such that, for all n large, 1. n(.7,‘,') < C1€_n62, 2. J(6,.7,,) < 71.16, then the posterior is strongly consistent at f0. The following two lemmas will be used to establish strong consistency for the Dykstra - Laud prior on the nondecreasing hazard rates. Assume that the function a is constant after 1. With very little modifications, the same argument for the two lemmas will hold for a constant after T, where T > 0. For 6' > 0, S“ > 0, define Rn = {r E R : e‘mg' < r(l) < n6‘} and let .7"={frE.7:rERo}. 49 Lemma 3.6. There exist cl, c2 > 0 so that no(.7f,) S cle‘“2 , for any 77. large. Proof. First observe that if r E Rn, then . -715. e_"3 $a(1)—-1 —x —na(l),{3 no({r E R. r(1) < e }) = [0 I‘(a(1))e dx S F(a(1)+1)° (3.24) If (2(1) S 1, then ... _ 00 $a(l)—1 -1, 1 —n . no({r E R: r(1) > n6 }) — l“. PICYIllle dx S I‘(a(1))e ‘5. (3.25) If (1(1) > 1, taking k 2 [02(1)], where [01(1)] is the integer part of (1(1), we have * oo $a(1)-—1 -:r 1 00 k —x no({rER: r(1)>n6 })=/1;6.me dx S FIJI—”Lax e dx Denote I), = f 1 :60. xkefl'dx. For k = 1, 11 = n6‘e‘m5' + e‘ms'. Then for 61 < 6* there exists N1 > 0 so that for any n 2 N1, 11 < e“"51. It is easily shown that for any positive integer k, I), = (n6‘)’°e‘"5’ + k1k_1. By induction, it can be shown that for any k, there exists 6,, > 0 and N), > 0 such that I), < 8—an for any n 2 Nk. Hence, for n large, no({r E R: r(1) > n6*}) S ———e‘"6" (3.26) Inequalities (3.24), (3.25), and (3.26) imply the above lemma. 50 Lemma 3.7. Let 6 > 0, 13 > 0. Then there exist (’3‘ > 0, 6* > 0 and N > 0, such that J(6,.7,.) S ad for any n 2 N. Proof. The idea of this proof is to find for each r E Rn a nondecreasing step function f , constant after 1, such that II f, — ff” < 6. The logarithm of the minimum number of functions f needed will be an upper bound for J (6, .7"). Let r E Rn. We begin by defining f on [0, 1], then we will extend it on (1, 00). Let 7 > 0 so that 27 + \/7 < 6/4. Let t0 = 0,t1= 7/77, t2 = 27/77, ...,tk =1, where k = [n/7] + 1, [77/7] is the integer part of n/7. Construct r“ constant on (t,, t,+1I by setting r*(t) = r(t,+1). Then when 6" < 1 k 2 I . 1|r(t) — “MW 3 i=0 k 2],, t, ]IT(ti+1) — r(t)Idt i=1 / Ire) — r‘(t)ldt —-—- I0~ 1] k — - _. . 1 = Z *1 = n _ EVIL“) r(t,))n r(1)n < n6 n 6 7 < 7, By Lemma 3.1, [[0,1] If,(t) — f,.(t)I(lt < 27 + fl < 6/4. Divide the interval [0, 776‘] into intervals of lengths 7. Denote the division points by y,. For t E (t,, t,+1I, define f(t) = yj+1 if r(t,+1) E (yj, yJ-HI. Then 0 S f(t) — r"(t) < y,“ — y, < 7 for any t E [0, 1]. Therefore f[o,1]I7~‘(t) —— r"(t)|dt < 7 and again Lemma 3.1 implies f[0,1]If"(t) — f,.(t)Idt < 6/4. Hence fIO, 1] If,-.(t) — f,(t)Idt < 6/2. As constructed, the nondecreasing hazard rate 7”: is constant on each (ti, t,+1I and all its values are multiples of 7, as 7, 2 7, ..., up to [n6‘/7I 7. Denote by R,, the set of all such functions. Moreover, with no probability one, any hazard rate is constant after 1. Thus, on 51 the interval (1, 00), n6“ > r(t) = r(1) > e'w'. Divide the interval (e‘nd', n6“) into intervals of length (6 / 8)e“"fl‘. Call the divi- sion points A,. We have that A,- > e“"5' for any i. Take i such that A,- 2 r(1) and IA,- — r(1)I < (6/8)e"”5'. For t > 1, define f(t) = /\,-. Denote by N72" the number of functions in Rn and denote by No the number of division points /\,. If we prove that [Ifr — f,=II < 6, then an upper bound for J (6, 7,,) is given by log(Nfin * No) 2 log N7,“ + log Nd. , By Lemma A1 in the Appendix, H; l n 1 1 1+ - <— *1 1 —— * — 6‘ 1. logNRn _ 7I6 og( + 6“) +log(1 +6 )I+ 2log ”/7 + Since limo-..,OI6“ log(1 + 31:) + log(1 + 6*)] = 0, there exists 6* > 0 such that £[6‘ log(1 + %) +log(1+ 6*)I < 8/4. Fix 6*. There is some N > 0 so that log[(1+ 1/6*)/(n/7)I+2 N. Then log N72” S n8/2 for any n > N. On the other hand, No 2 enfii7i6*/(6/8) and then log No 2 n8“ + log 5% < n8 / 2 for some 8* < ,8 and n large. Therefore there exist 6’, 8“ and N such that J(6, .7") < n8 for any n > N. Now look at IIf, — ffII. We already know that [[0 1] If,:(t) — f,.(t)Idt < 6/2. 52 On the interval (1, 00) we have Ito—ten = lee—Io»*‘S’dSe-W” — rue—Io»)"S’dSe-“IIU-“I S [Aie’ f[0, 1)1‘(5)dse—At(t—1) _ 7.(1)e--r(l)(t—l)e- fIO.1)F(3)dSI + Ir(1)e"(1)(‘—1)e-f[0,1)F(s)ds _ “1),—flowr(s)dse_r(1)(t—1)| S |/\,-e“""(“1) __ r(1)e—r(1)(t—l)| + 7.(1)e—r(1)(t—1)|e- fio.1)"(3)d3 _ e7 f[0,1)r(s)ds]. Integrating on (1, 00), then making a change of variable and using that Ie" - e‘yI S Ix — yI, we obtain / Ito—sewn: / vim—tennis (1,00) (0,00) + (In...) fr(1)(t)dt) ( I ,1) 17(3) — r(s)lds) =f If..(t)—f.(.)e)ldt+(/ 17 0, a strictly increasing on [0, 1), then constant after 1, a direct consequence of Theorem 3.8, Lemma 3.6 and Lemma 3.7, is the following Theorem 3.9. If f0 is in the K — L support of no, then the posterior is strongly consistent at f0. In Theorem 3.5 and Theorem 3.7 we have already pointed out some density func- tions that are in the K - L support of no. Hence at these densities strong consistency also holds. 54 Appendix Lemma A1. a) The number of all nondecreasing functions defined on k disjoint intervals, which can take one of the N possible distinct values {c1, c2, ..., cN} on N+k-1) . each interval, is ( k b) When k = [n/7I and N = [n6‘/7I, l N+k—I n I 1 1+—. <-—— 6'1 — l 1 6* —l ‘5 . log( k )‘7[ og(1+6*)+0g(+ )I+20g ”/7 +1 Proof. a) A similar argument for part a) could be found in Feller [9] (Application to occupancy problems). Let [1, 12, ..., I), be the k intervals. For a step function f denote by c7 its value on 1,. Choose k items out of the N + k — 1 elements {11, 12, ..., Ik_1,c1, c2, ..., cN}. Suppose we chose 7n intervals and k — m values, c1, ..., ck_m. To define a nonde- creasing step function f, put the values c1, . . . , ck_m in increasing order on the k — m remaining intervals, say 11,, 1,2, ..., 11km. Formally, f1, : c1, f), 2 c2, f)k_m = ck_m and ifi S 1,, take f, 2 C], if l1 < i S l2, take f,- 2 c2, and so on. Conversely, fix a nondecreasing step function f as described in the lemma above. Let c1, c2, ..., cm be the 7n distinct values of f, m S N. Let i1 : min{i : f, = f,+1} i2 = min{i > i1 : f,- = f,+1}, and so forth. Observe that ik-m S k — 1. Thus {i1,...,ik_m, c1,...,cm} are the k items out of {11, ..., Ik_1,c1, ..., cN} corresponding to f. Therefore because there are (N +1571) ways to choose k elements out of {11, ..., Ik_1, c1, .. ., (:N}, the number of nondecreasing step functions as described N+k—I). in the lemma is ( k b) (N 1if") can be evaluated using Stirling’s formula 1.] : fl£I+I/2e—x+fl/(l2r), 0 < 9 < 1 N+k—1 __ (N+k)! N k _ klN! N+k \/2n (N + k)N+k+1/2 exp{—(N + k) + ngJr—M} N \/2_n_k’°+1/2 exp{—k + 737?} \/2nNN+1/2 epr—N + T27} N + k’ and therefore (N+k — 1) (N+k-)"’+‘/2 (N+k-V log _— k = log +log kk+1/2 NN+l/2 + 6 = (N+ 1/2) log(1+ N) +k log (1+ 11:1) — 1/2logk+(, where e— lo 1 + 6 0 __€_<1 '_ gN+k M277 12(N+k) 12N 12k When k = [n/7I and N 2 [n6'/7I, we obtain n/7 n n6“/7 1 n +12)lo (1+ )+—lo (1+ )——10 —+1 / g ms“/7 7 g 77/7 2 g7 (N+k—1)<(n6* k " '7 n 1 1+5;- '7 1 * 1 T- It _ [6 log( +6“) +log(1+6 )I +210g n/7 +1. [:1 Lemma A2. If A1 > a and A2 > a, a > 0, then [(0,00)I/\1€_>‘lx — Age—A2IIdI S 2 [A1 — AQI /(1. Proof. If A1 > A2, then [Ale—Alf — Age-Mil S (A1 — Age—A”: + A2(e"‘” — e‘A‘I). Thus _ _, A—A A A A—A [Al—A2I A "lx—A "fld < 1 2 —2——2 = 2—‘——3<2——. ./(o,oo)| 18 26 I x — A1 + /\2 /\1 /\1 a As well, if A2 > A1, then A — A A — / [Ale—‘2” — Age—Aledx S 2—2———1 < 2I—1——i2—l. (0,00) 2 a Bibliography [1] Barron, A. R., Schervish M., and Wasserman, L. (1999). The consistency of posterior distribution in nonparametric problems. Ann. Statist. 27 , 536-561. [2] Bustoz, J. and Ismail, M.E.H. (1986). On gamma function inequalities. Math. Comp. 47, 659-667. [3] Dey, J ., Draghici, L. and Ramamoorthi, R.V. Characterization of tailfree and neutral to the right priors, Proc. [ISA Conf. Hamilton, Canada, to appear. [4] Doksum, K. (1974). Tailfree and neutral random probabilities and their posterior distributions. Ann. Probab. 2, 183-201. [5] Doob, J .L. (1948). Application of the theory of martingales. Coll. Int. du CNRS, Paris, 22-28. [6] Draghici, L. and Ramamoorthi, R.V. (2000). A note on the absolute continuity and singularity of Polya tree priors and posteriors, Scand. J. of Statist. 27, 299- 303. [7] Dykstra, R.L. and Laud, P. (1981). A Bayesian nonparametric approach to reli- ability. Ann. Statist. 9, 356-367 [8] Fabius, J. (1964). Asymptotic behavior of Bayes’ estimates. Ann. Math. Statist. 35, 846-856. [9] Feller, W. (1957). An introduction to probability theory and its applications, Vol 1, Second edition, Wiley Publications in Statistics. [10] Ferguson, TS. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209-230. [11] Ferguson, TS. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2, 615-629. [12] Freedman, DA. (1963). On the asymptotic behaviour of Bayes’ estimates in the discrete case. Ann. Math. Statist. 34, 1386-1403. [13] Freedman, DA. and Diaconis, P. (1983). On inconsistent Bayes estimates in the discrete case. Ann. Statist. 11, 1109-1118. 58 [14] Korwar, RM. and Hollander, M. (1973). Contributions to the theory of Dirichlet processes. Ann. Probability 4, 705-711. [15] Kraft, CH. (1964). A class of distribution function processes which have deriva- tives. J. Appl. Probab. 1, 385—388. [16] Ghosal, S., Ghosh, J.K. and Ramamoorthi, R.V. (1999). Consistency issues in Bayesian nonparametrics. Asymptotics, nonparametrics, and time series, Statist. Textbooks Monogr., 158, Dekker, New York, 639-667. [17] Ghosh, J .K. and Ramamoorthi, R.V. (1995). Consistency of Bayesian inference for survival analysis with or without censoring. [MS Lecture Notes - Monograph Series, 27, 95-103. [18] Ghosh, J .K. and Ramamoorthi, R.V. Manuscript, (2000). [19] Laforgia, A. (1984). Further inequalities for the gamma function. Math. Comp. 42, 597-600. [20] Lavine, M. (1992). Some aspects of Polya tree distributions for statistical mod- eling. Ann. Statist. 20, 1222-1235. [21] Lavine, M. (1994). More aspects of Polya tree distributions for statistical mod- eling. Ann. Statist. 22, 1161-1176 [22] Mauldin, R.D., Sudderth, W.D. and Williams, SC. (1992). Polya trees and ran- dom distributions. Ann. Statist. 20, 1203-1221 [23] Schervish, M.J. (1995). Theory of Statistics. Springer Series in Statistics, Springer-Verlag, N ew-York. [24] Schwartz, L. (1965). On Bayes procedures. Z. Wahrsch. Verw. Gebiete 4 10-26. IIIIIIIIIIIIIIIII