LATENT CLASS PROFILE ANALYSIS: INFERENCE, ESTIMATION AND ITS APPLICATIONS By Hsiu-Ching Chang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Statistics 2011 ABSTRACT LATENT CLASS PROFILE ANALYSIS: INFERENCE, ESTIMATION AND ITS APPLICATIONS By Hsiu-Ching Chang Recently, a great deal of attention has been paid to the stage-sequential process for the longitudinal data and a number of methods for analyzing stage-sequential processes have been derived from the family of finite mixture modeling. However, research on the sequential process is rendered difficult by the fact that the number of latent components is not known a priori. To address this problem, we propose two solutions, reversible jump MCMC and the Bayesian non-parametric approach, so as to provide a set of principles for the systematic model selection for the stage-sequential process. The reversible jump MCMC sampler can explore parameter space and automatically learn the model. Nevertheless, we have found that reversible jump Markov chain Monte Carlo requires the efficient design of proposal mechanism as jumping rules. To reduce the technical and computational burdens, we propose a Bayesian non-parametric approach to select the number of latent components. Using a latent class-profile analysis, we test both algorithms on synthesized data sets to evaluate their performances in model selection problems. Once a model is selected, the model parameters are needed to be estimated. The expectation-maximization algorithm (Dempster et al., 1977) and the data augmentation using MCMC (Hastings, 1970; Tanner and Wong, 1987a) are widely-used techniques to draw statistical inferences of the parameters for the LCPA model. As a number of measurement occasions increases in the LCPA model, however, the computation cost of expectationmaximization or MCMC will become exponentially intensive. On the contrary, if one adapts recursive scheme in the update steps, calculations will be simplified and become generalized to more time points. In light of this, we formulate each update step with recursive terms which are directly analogous to forward-backward algorithm (Chib, 1996; MacKay, 1997). The parameter estimation for the LCPA model benefits from recursive formula, but the recursive algorithm still requires careful examination for the existence of multiple local modes of the objective function (i.e., log-likelihood). Applying the recursive formula, we implement deterministic annealing EM (Ueda and Nakano, 1998) and deterministic annealing variant of variational Bayes (Katahiral et al., 2008) in order to find parameter estimates on the global mode of the objective function. Both methods are based on the deterministic annealing framework, in which ω is included as an annealing parameter to control the annealing rate. By adjusting the value of ω , the annealing process tracks multiple local modes and identifies the globalized optimum as a result. At last, we are interested in analyzing the early onset drinking behaviours among the young generation. We apply latent class-profile analysis to alcohol drinking behaviours as manifest in self-reported items drawn from the National Longitudinal Survey of Youth 1997, which was a survey that explores the transition from school to work and from adolescence to adulthood in the USA. To unveil the stage-sequential bevaviroal progressions, we adopt dynamic Dirichlet learning process to characterize the probable progressions in a discrete manner and then identify patterns in which similar progressions are grouped. For the parameter estimations, we conduct deterministic annealing approaches with predetermined annealing schedule. Key Words: Dirichlet process; Expectation-Maximization algorithm; Latent class profile analysis; Latent stage-sequential process; Longitudinal data; Multiple local modes; Recursive formula; Reversible jump MCMC; Deterministic annealing; Variatoinal Bayes iv Table of Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction 1 2 Latent Class Profile Analysis 6 2.1 Latent Class Profile Model with Covariates . . . . . . . . . . . . . . . . . 10 2.2 Missing Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Model Selection 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . . . 3.21 Within-Model Move . . . . . . . . . . . . . . . . . . . . 3.21.1 Prior Distribution Specification . . . . . . . . . . 3.21.2 Posterior Distribution . . . . . . . . . . . . . . 3.22 Across-Model Move . . . . . . . . . . . . . . . . . . . . 3.22.1 Jumping Rules for C . . . . . . . . . . . . . . . 3.22.2 Jumping Rules for S . . . . . . . . . . . . . . . 3.3 Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . 3.31 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . 3.31.1 Sethuraman Stick-Breaking Representation . . . . 3.31.2 Ferguson Gamma Process Representation . . . . . 3.31.3 Almost Sure Truncation of DP (G0 , λ) Measures 3.32 Dirichlet Process Mixture Model . . . . . . . . . . . . . . 3.33 Dynamic Dirichlet Process Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 15 15 17 18 19 20 22 23 23 27 28 29 33 38 4 Simulation 4.1 Simulation with Reversible Jump MCMC 4.2 Simulation with Dirichlet Process . . . . 4.3 Extended Cases . . . . . . . . . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 47 53 54 5 Parameter Estimation 5.1 Preliminary . . . . . . . 5.2 Expectation-Maximization 5.21 E-step . . . . . . 5.22 M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 57 59 59 . . . . . . . . . . . . . . . . . . . . v . . . . . . . . 5.3 5.4 Recursive Formula . . . . . . Local Modality . . . . . . . . 5.41 Split-and-Merge EM . . 5.42 Deterministic Annealing 5.43 Deterministic Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . EM Algorithm . Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Model Diagnosis 61 63 64 65 68 72 7 Data Analysis on National Longitudinal Survey of 7.1 Data . . . . . . . . . . . . . . . . . . . . . . . 7.11 Information Criteria versus Dirichlet Process 7.12 Parameter Estimates . . . . . . . . . . . 7.12.1 Discussion . . . . . . . . . . . . Youth . . . . . . . . . . . . . . . . 1997 . . . . . . . . . . . . (NLSY97) . . . . . . . . . . . . . . . . . . . . . . . . 75 75 76 79 87 8 Discussion 92 8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2 Direction for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 94 APPENDICES 95 A Acceptance Rate of C → C + 1 96 B Acceptance Rate of S → S + 1 98 C Hessian Matrix 100 C.1 Diagonal Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 C.2 Off-Diagonal Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 BIBLIOGRAPHY 107 vi List of Figures 4.1 The percentage of the most frequently selected profiles over simulated samples by reversible jump MCMC: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . . . . . . . . 48 4.2 The histogram of the most frequently selected profiles over simulated samples from Dirichlet process: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . . . . . . . . . 50 4.3 The histogram of the most frequently selected profiles over simulated samples from Dirichlet process with cutoff 1%: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . 51 4.4 The histogram of the most frequently selected profiles over simulated samples from Dirichlet process with cutoff 5%: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . 52 7.1 (a) - (c) Tracking plots for classes smoothed by LOWESS with two smoother spans (red: .67 and green: .15) from time 1 to time 3; (d) histogram of class progression (yellow bars) and the number of profiles (blue bars). . . . . . . . 90 7.2 Histogram of loglikelihood values derived from (a) the standard EM and (b) the DAEM algorithms with 100 different sets of starting values . . . . . . . 91 vii List of Tables 4.1 The percentage of the most frequently selected profiles over simulated samples by reversible jump MCMC . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 The percentage of the most frequently selected profiles over simulated samples from Dirichlet process with cutoff 5% . . . . . . . . . . . . . . . . . . . . 53 4.3 The percentage of selecting the correct number of profiles (S = 2) over simulated samples from RJMCM and Dirichlet process . . . . . . . . . . . 54 7.1 Goodness-of-fit statistics for a series of LCPA models under various numbers of classes and profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Standard deviation of the ρ-parameter estimates derived from DAVB with 100 different sets of starting values . . . . . . . . . . . . . . . . . . . . . 81 7.3 Standard deviation of the η -parameter estimates derived from DAVB with 100 different sets of starting values . . . . . . . . . . . . . . . . . . . . . 81 7.4 Standard deviation of β -parameter estimates derived from DAVB with 100 different sets of starting values (Profile 1 is the baseline) . . . . . . . . . . 82 7.5 Estimated probabilities of responding ‘any use’ to the drinking items for each class (ρ-parameters) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 Estimated probabilities of belonging to a class sequence for each profile (η parameters) and estimated profile prevalence (γ -parameters) . . . . . . . . 84 7.7 Estimated logistic regression coefficients (β -parameters) for the prevalence of profiles (Profile 1, non-drinking stayers, is the baseline) . . . . . . . . . . . 86 viii Chapter 1 Introduction Psychological research is increasingly turning to the idea of stage-sequential process. A common theme of stage-sequential process is that, at any moment, individuals are placed into distinct qualitative stages, and they can change their stage memberships over time. The latent class analysis (LCA) is perhaps the most straightforward mixture model now being used to identify mutually exclusive subgroups of individuals based on their responses to measured variables (Clogg and Goodman, 1984; Goodman, 1974). LCA models explain the relationships among categorical variables in a cross-classified contingency table by assuming the existence of an unobserved or latent classification. The first detailed statistical treatment of the LC model appeared in the textbook of Lazarsfeld and Henry (1968). In their terminology, the LC model is a special case of latent-structure analysis in which the measurement scales and the latent variables are both categorical. General overviews of LC modeling are provided by Goodman (1974), Haberman (1979), McCutcheon (1987), Heinen (1993, Chap. 2), Clogg (1995), Bartholomew and Knott (1999, Chap. 6), Hagenaars and McCutcheon (2002), and others. Recently, a number of new methods for analyzing stage-sequential processes have been 1 derived from the family of LCA. For example, latent transition analysis (LTA) (Collins and Wugalter, 1992; Chung et al., 2008) and general growth mixture models (GGMM) (Muth´n e and Shedden, 1999; Muth´n and Muth´n, 2004) have been used widely to identify patterns e e of the progression of adolescent substance use, such as alcohol (Lanza and Collins, 2006) or tobacco (Velicer et al., 2007), or to investigate the initiation and progression of drug-taking behaviors for a number of different substances (Dierker et al., 2007). Chung et al. (2011) proposed another type of LCA approach, the latent class-profile analysis (LCPA), to identify subtypes of the stage-sequential patterns of early-onset drinking behaviors, where drinking items are treated as fallible indicators of unseen states of drinking behaviors. In LCPA, the identification processes are divided into two steps. In the first step, LCPA identifies discrete subgroups of individuals who have similar responses to items at each measurement occasions. The subgroups identified in the first step are referred as classes. In the second step, LCPA examines individuals’ class membership over the entire set of time points so as to classify the population into two or more subgroups based on their class sequencing. The subgroups identified in the second step are referred as class profiles or simply profiles. By applying an LCPA to the longitudinal study of adolescent drinking, for example, all drinkers in a class at a certain time point are expected to be homogeneous in terms of their drinking behavious, and those individuals in a given profile will have similar sequential pattern of class membership over time. Like any other finite mixtures, the first and most crucial step in LCPA is to choose an appropriate number of classes and profiles since model selction has important ramifications for the analyses performed with the model. This study works on the issues regarding to the selection of the number of classes and profiles in LCPA. Two Bayesian approaches for selecting the number of classes and profiles have been proposed in this study: reversible jump 2 MCMC (RJMCMC) and Dirichlet process. RJMCMC has been proposed to select the number of latent components (e.g., classes and profiles) in finite mixture models (Green, 1995), where the number of components is considered as an unknown parameter to be estimated. In every step of the RJMCMC algorithm, the current mixture model can be proposed to jump across dimensions, or the model can be simply proposed to update the parameters within the current model. Any proposal is accepted with a probability that preserves reversibility with respect to the target posterior. As a result of the RJMCMC, we can estimate the relative frequencies regarding to the number of classes and profiles given the data. RJMCMC is widely applied on many applications including the finite mixture models (Richardson and Green, 1997) and linear mixed models (Ho and Hu, 2008). For the LCPA model, however, the new RJMCMC procedure should be developed to deal with the stage-sequential process for the latent component membership. In this study, we design a set of split-and-merge formula tailored for the multivariate categorical LCPA models. As a class of non-parametric Bayesian techniques, the Dirichlet process has been utilized in Dirichlet process mixture models (also known as infinite mixture models). The involvement of the non-parametric prior allows us to identify different distributions over observed data. However, there is little literature regarding the Dirichlet process on model selection problems in stage-sequential process with longitudinal data. To perform the Dirichlet process in the longitudinal framework, we develop a dynamic approach that is able to explore the stage-sequential process by elaborating on the stage transition. It can be seen as a special hidden Markov model with no constraints placed. The dynamic approach has the space of all distributions as support and can be easily applied to compute posterior and draw inferences. Since the technique based on the Dirichlet process prior enables a dynamic model learning, we therefore name it dynamic Dirichlet learning process and integrate in the study 3 to determine the optimal number of latent components. Once a model is selected, the parameters of the model needed to be estimated. There are several available parameter estimation methods and each is long-held for its own theoretical and practical worth. The expectation-maximization (EM) algorithm (Dempster et al., 1977) and the data augmentation using the Markov chain Monte Carlo (MCMC) (Hastings, 1970; Tanner and Wong, 1987a) are widely-used techniques to draw statistical inferences of the parameters. However, either the E-step of EM algorithm or I-step of MCMC requires the computation of marginal distributions, which appears to be too expensive to be of practical use in more generalized applications. To alleviate this problem, we formulate each update step with recursive formula which are directly analogous to forward-backward algorithm (Chib, 1996; MacKay, 1997). The recursive algorithm will therefore have less computational complexity and storage demands. Existing algorithms for parameter learning in models for estimating suffer from local maxima problems since the dependency between neighboring starting values is strong. To relax the dependence of the initializations, split-and-merge EM (SMEM) (Ueda et al., 2000), deterministic annealing EM (DAEM) algorithm (Ueda and Nakano, 1998) and deterministic annealing variant of variational Bayes (DAVB) (Katahiral et al., 2008) will be introduced for this purpose. SMEM has some defects that deteriorate its capability of being widely practiced. The major implementational difficulties of SMEM lie in designing suitable split and merge operations and choosing the right component which the proposed mechanisms can be applied to. On the other hand, both the DAEM and the DAVB algorithms are based on the deterministic annealing framework in which ω is included as an annealing parameter to control the annealing rate. By adjusting the value of ω , the annealing process tracks the localized optimum and identifies the globalized optimum as a result. 4 We organize the rest as follows: Chapter 2 introduces the mathematical structure underlying the LCPA models. Chapter 3 presents the procedures of two Bayesian model selection algorithms: RJMCMC and Dirichlet process. Chapter 4 continues the previous chapter and compares their performances in discovering the true model given different sets of conditions on model parameters. Chapter 5 provides three parameter estimation methods: original EM algorithm with split-and-merge formula; deterministic annealing designs in the application of EM and variational Bayes. Chapter 6 discusses the model identifiability issues. In Chapter 7, we apply model selection and parameter estimation techniques to alcohol drinking items drawn from the National Longitudinal Survey of Youth 1997 (NLSY97) and conclude their performances. In Chapter 8, we summarize the contributions and suggest future researches. 5 Chapter 2 Latent Class Profile Analysis Suppose we construct a S -profile LCPA model with C classes from a set of M items over T time periods. Let C = (C1 , . . . , CT ) denote the class membership variables from initial time t = 1 to time T , where Ct = 1, . . . , C , and let U denote the profile membership variable with S nominal categories. We begin with the profile identification procedure. The basic idea of LCPA in this identification procedure is that associations among class membership across T time points arise from the assumption that the population is composed of S profiles. If the ith individual’s class membership c = (c1 , . . . , cT ) could be observed, the joint probability that he or she belongs to the sequence c and the profile s is P (U = s, C = c) = P (U = s)P (C = c | U = s) T ∏ = P (U = s) P (Ct = ct | U = s) t=1 T C ∏ ∏ [ (t) ]I(ct =c) = γs η , c|s t=1 c=1 6 (2.1) where γs = P (U = s) denotes the marginal prevalence of the sth profile membership; (t) η = P (Ct = c | U = s) represents the probability of the cth class membership at time c|s t given the profile membership s; and I(A) is the indicator function such that I(A) = 1 if A is satisfied and I(A) = 0 otherwise. Here, we assume that the class membership C = (C1 , . . . , CT ) are conditionally independent or unrelated within each profile s. This assumption, called local independence (Lazarsfeld and Henry, 1968), is the crucial feature of LCPA that allows us to draw inferences about the unseen profile variable. In (2.1), the associations among class membership is explained by latent class theory, which posits that profiles can be identified by class sequencing over time. If individuals’ class membership over T time points could be observed, we can simply apply an LCA to identify a number of profiles. However, since individual’s class membership is not directly observed, another procedure should be conducted to identify class membership based on their responses to items. The class identification explains associations among item responses based on the assumption that the population is composed of C classes at each time point. Let Yt = (Y1t , . . . , YM t ) represent a vector of discrete M variables measuring latent class membership at time t, and let yit = (yi1t , . . . , yiM t )′ be the observed values of Yt for the ith individual, where each response yimt can take values from 1 to rm for m = 1, . . . , M and t = 1, . . . , T . The joint probability that the individual belongs to class ct at time t 7 and provides responses yt would be P (Ct = c, Yt = yit | U = s) = P (Ct = c | U = s)P (Yt = yit | Ct = c) M ∏ = P (Ct = c | U = s) P (Ymt = yimt | Ct = c) m=1 M rm ]I(y (t) ∏ ∏ [ imt =k) , = η (2.2) ρmkt|c c|s m=1 k=1 where ρmkt|c = P (Ymt = k | Ct = ct ) represents the probability of response k to t the mth item for a given class ct at time t. In order to investigate the relationship among items, we assume the following: (a) Yt = (Y1t , . . . , YM t ) are conditionally independent given c for t = 1, . . . , T ; and (b) the profile membership U is related to the items Yt only through the class membership Ct for t = 1, . . . , T . Assumption (b) implies that U depends on the class membership (C1 , . . . , CT ) but not on the items (Y1 , . . . , YT ). The joint probability that the individual belongs to the class sequence c = (c1 , . . . , cT ) and the profile s and provides responses yi = (yi1 , . . . , yiT ) would be L∗ = P (U = s, C = c, Y = yi ) i   T  M  ∏ ∏ = P (U = s) P (Ct = ct | U = s) P (Ymt = yimt | Ct = ct )   t=1 m=1   M rm [ T   ]I(y ∏ (t) ∏ ∏ imt =k) η = γs ρmkt|c (2.3)  ct |s  t m=1 k=1 t=1 Therefore, the ith subject’s contribution to the likelihood function of (Y1 , . . . , YT ), with8 out regard for the latent class and profile, is given by Li = P (Y = yi ) S C C ∑ ∑ ∑ = ··· L∗ i s=1 c1 =1 cT =1    C  rm [ S T M  ]I(y ∑ ∏  ∑ (t) ∏ ∏ imt =k) = γs η (2.4) ρmkt|c   t c =1 ct |s m=1 k=1  s=1 t=1 t In (2.4), the following three sets of parameters are estimated: 1. ρmkt|c = P (Ymt = k | Ct = ct ) represents the probability of the response k t to the mth item for a given class ct at time t; (t) = P (Ct = ct | U = s) represents the conditional probability of belonging to ct |s class ct at time t for a given class profile s; and 2. η 3. γs = P (U = s) represents the probability of belonging to the class profile s. We refer to the ρ-parameter as the primary measurement parameter because it describes how individuals in each class tend to respond to the mth item at each occasion for m = 1, . . . , M . The η -parameter, referred to as the secondary measurement parameter describes the relation between a class ct at time t and a class-profile s. The primary measurement parameters are usually constrained to be equal across measurement occasions (i.e., ρmk1|c = · · · = ρmkT |c ), so that the meaning of classes will not change over time. In practice, this invariance assumption should be carefully checked by comparing the fit of the model with and without constraints. 9 2.1 Latent Class Profile Model with Covariates When we consider the stage-sequential patterns, it is reasonable to take external causes into consideration. The nature way to extent the LCPA model in such a manner is to include covariates to examine whether the prevalence of latent profile varies under their influence. Let xi = (xi1 , . . . , xip )T denote a vector of time-invariant covariates for individual i that may influence the probability with which he or she falls into the latent class-profile Ui = 1, . . . , S . That is to say, the marginal probability of Ui is affected by the covariates and therefore denoted as P (Ui = s | xi ). Even though the covariates come into play in shaping the profile sizes, the influences of xi on the data is completely mediated by Ui . In most cases, the first covariate is fixed as a constant (xi1 = 1), so that the model with p = 1 reduces to a traditional LCPA model without covariates. The dependence of Ui on xi is specified by γs (xi ) = P (Ui = s|xi ) exp(xT β s ) i = ∑S−1 1 + j=1 exp(xT β j ) i s = 1, 2, . . . , S − 1, with (2.5) ∑S T s=1 γs = 1. In equation (2.5), β j = (βj1 , . . . , βjp ) is a p × 1 vector of logistic-regression coefficients influencing the log-odds that an individual falls into class j relative profile S , which serves as a baseline, log γs (xi ) = xT β s i γS (xi ) (2.6) Equation (2.6) appears to be a baseline-category logit model for a polytomous response (Agresti, 10 2002), except that in this case, the response is latent. The likelihood contribution for the ith individual can be written as S C ∑ ∑ P (Y = yi ) = ··· s=1 c1 =1 S ∑ = γs (xi ) s=1 C ∑ L∗ i cT =1  T  C ∏∑  t=1 ct =1    R (t) , ct |s   (2.7) ]I(y (t) (t) ∏M ∏rm [ imt =k) . where R =η ρmkt|c ct |s ct |s m=1 k=1 t The LCPA model with covariates (2.7) has the attractive property that the distribution of Yi marginalized over the covariates reduces to that of a traditional LCPA model (BandeenRoche et al., 1997). Letting F denote the probability distribution of xi , the marginal distribution of the item variables for the ith individual is P r(Yi = yi ) = = = ∫ ∑ S s=1 S ∑∫ s=1 S ∑ (t) Rs dF (xi ) t=1 γs (xi )dF (xi ) T ∏ (t) Rs t=1 ∗ γs s=1 (t) where Rs = γs (xi ) T ∏ T ∏ (t) Rs , (2.8) t=1 { } ]I(y =k) ∑C (t) ∏M ∏rm [ imt . ct =1 ηct |s m=1 k=1 ρmkt|ct ∗ That is, the observed log likelihood function can be reduced to an LCPA model with γs representing the marginal class-profile membership probabilities averaged over the distribution of covariates in the population. This simplified form suggests a researcher can analyze 11 the data with covariates by two-stage process: first, fit a conventional LCPA model to the data without covariates to determine the nature of the latent variables Ui and Ci ; then introduce covariates to assess their influence on the class-profile variable Ui . 2.2 Missing Items In real applications, it is inevitable to have missing data since some responses to one or more questionnaires are extracted missing. The common attempt for missing data is to delete the incomplete cases. However, the consequence could be far-flung if the removal of the incomplete individuals cause major changes in representing data characteristics. At fact, case deletion might be unnecessary because the incomplete information can provide substantial true knowledge of the data and estimates of parameters which are meant to represent the full population may be biased if part of the information has been discarded (Little and Rubin, 1987). A more principled method to deal with missing data is to apply a likelihood function defined by the marginal distribution of the observed items only. Maximizing this likelihood function which eliminates the missing responses is appropriate when the missing items are missing at random (MAR) in the sense defined by Rubin (1976) and Little and Rubin (1987). The contribution of individual i to this function, which we call the observed-data likelihood, is P r(Yi,obs = yi,obs ) = S ∑ γs (xi ) s=1  T  C ∏∑  t=1 ct =1    (t) R , ct |s   (2.9) where Yi,obs and yi,obs are the vectors of observed item variables and their realized values, 12 and obsi denotes the set of items observed for individual i. We assume that covariates xi are completely observed. Missing values among the covariates would require us to introduce additional assumptions about the distribution of xi . The concept for extension to include missing values in the covariates is straightforward but it can somehow wreck a havoc on the implementation because the dimension of the observed item vectors Yi,obs vary from one subject to another. Missing items also introduce complications in assessing goodness of fit and checking for departures from modeling assumptions (e.g., local independence). 13 Chapter 3 Model Selection 3.1 Introduction In the finite mixture modeling, the problem to be solved is to estimate the unknown number of components. This question arises in many traditional and novel situations such as variable selections, signal processing and Bayesian non-parametric statistics. There are many model selectors in popular use such as Akaike’s information criterion (Akaike, 1974) and Bayesian information criterion (Schwarz, 1978). As an estimated Kullback-Leibler (KL) distance, AIC aims at finding the best approximating model to the unknown true data generating process and penalizes with twice the number of parameters to achieve the parsimony and BIC operates in the similar way but penalizes more with the logarithm of sample size. Besides, the penalized least squares approach with smoothly clipped absolute deviation (SCAD) penalty term has been demonstrated to be an attractive selection approach which not only selects important variables consistently but also produces oracle estimations (Fan and Li, 2001). Unfortunately, there is no single superior model selection tool which can be applied to all the types of data sets. For example, BIC is consistently better than AIC because if true 14 model is considered among the candidates, BIC can identify the true model almost surely as sample size grows; however, AIC has been proved to be minimax-rate optimal for both parametric and nonparametric cases for estimating the regression function (Yang, 2003). Even though AIC and BIC are commonly used for comparing different models, computing the relevant AIC and BIC values for each possible model can be very time consuming when the number of competing models is high. In the LCPA models, the structure is intricate and we need to develop methods by which we can fully incorporate data information to better the decision-making process. 3.2 Reversible Jump MCMC In this section, we explore the algorithm of RJMCMC for the LCPA model to select the appropriate number of classes and profiles. The algorithm aims to compute the joint posterior distribution of the parameters and the number of latent components. The RJMCMC samplers can travel between different dimensions by constructing a reversible Markov chain on the general space with a specified limiting distribution. In other words, the RJMCMC can jump to another LCPA model with different dimensions (across-model) or it can simply update the parameters within the LCPA model with same dimension (within-model). 3.21 Within-Model Move In Bayesian analysis, our goal is to compute the posterior distribution, P (Θ(C,S) | y), where y represents the vectorized observed items from the sample and Θ(C,S) denotes as the parameter set corresponding to C -class / S -profile LCPA model. The posterior distribution, however, is difficult to portray in the LCPA model. If the class and profile 15 membership for each individual were known, the augmented posterior P (Θ(C,S) | y, z) would be easy to simulate. A within-model MCMC algorithm for the LCPA model is implemented as an iterative two-step procedure which can be regarded as a form of data augmentation (Tanner and Wong, 1987b) or Gibbs sampling (Gelfand and Smith, 1990). In the first step of MCMC procedure—the Imputation or I-step— Let z = (z1 , . . . , zn ) indicate the individuals’ class and profile memberships where zi is a T + 1 dimensional array for the ith individual such ∑ ∑ ∑ S C C that zi(s,c ,...,c ) ∈ {0, 1} and s=1 c1 =1 · · · cT =1 zi(s,c1 ,...,cT ) = 1. 1 T That is, if individual i belongs to the profile s and the class membership c = (c1 , . . . , cT ) from initial time t = 1 to time T , then zi(s,c ,...,c ) equals 1 and 0 otherwise. We 1 T (j+1) is drawn from the conditional distribui tion P (L = s, C = c | yi , γ (j) , η (j) , ρ(j) ) given the observed data and pre(j+1) vious parameter estimations. We then calculate the marginal indicators of z = is ∏T ∑C ∑ (j+1) ∏ (j+1) zi(s,c ,...,c ) , z = j̸=t C =1 zi(s,c ,...,c ) , and zic = t=1 ct =1 cj 1 1 t T i(s,ct ) T ∑S s=1 zit(s,ct ) for i = 1, . . . , n. In the second step—the Posterior or P-step—we draw assume at (j + 1)th cycle, the value z new random values for the parameters from the augmented posterior distribution γ (j+1) , η (j+1) , ρ(j+1) ∼ P (γ, η, ρ | y, z(j+1) ), (3.1) which regards the membership of class and profile as known. Repeating this two-step procedure creates a sequence of iterates, {(γ (1) , η (1) , ρ(1) ; z(1) ), (γ (2) , η (2) , ρ(2) ; z(2) ) . . . , } 16 (3.2) which converges to the stationary distribution P (Θ(C,S) | y). This stream of parameter values (after a suitable burn-in period) is summarized in various ways to produce approximate Bayesian estimates, intervals, tests, etc. It is convenient to choose priors that cause γ , η and ρ to be a posteriori independent given z. One way to achieve this is to make the priors independent from each other. In situations where the priors are independent, the joint posterior distribution for γ , η and ρ given z can be expressed as P (γ, η, ρ | y, z) ∝ P (γ, η, ρ)P (y, z | γ, η, ρ)     ]z (t) S S ∏ ∏ [ T C ∏ z ∏ (t) i(s,c)  is  × P (η) P (γ) ∝ γs η   c|s s=1 s=1 t=1 ct =1   }z (t) T ∏ ∏ rm {[ C M ∏ ]I ∏ ic   ρmkt|c (yimt=k) × P (ρ) , t=1 c=1 m=1 k=1 (3.3) where P (γ), P (η) and P (ρ) are the prior distributions for γ , η and ρ and the equation is composed of three parts: one pertaining to the γ parameters and one pertaining to the η and the other pertaining to the ρ. 3.21.1 Prior Distribution Specification In the equation (3.3), the posteriors for the measurement parameters consist of prior distribution and the multinomial distribution. The most straightforward choice for the prior distribution is Dirichlet since it is conjugate to the multinomial distribution. It is said that a random vector γ = (γ1 , . . . , γS ) has Dirichlet prior with hyperparameters α = (α1 , . . . , αS ) 17 if the density of γ is ∑S S ∏ l=1 αl ) P (γ | α) = γ α−1 l Γ(α1 ) · · · Γ(αS ) l=1 Γ( over the simplex γl ≥ 0, l = 1, . . . , S , and (3.4) ∑S l=1 γl = 1, where Γ(·) denotes the gamma function. We prefer using the noninformative priors to make sure the data-driven inferences will not be drawn under the influence of improper prior specification. The approach to choose a noninformative prior is to adopt Jeffreys’ invariance principle (Box and Tiao, 1992). Jeffreys’ ∏S −1/2 , which corresponds to the Dirichelt with α = l=1 γl ( ) (1/2, . . . , 1/2). Similarly, we apply the Jeffreys priors to ρmt|c = ρm1t|c , . . . , ρmr t|c m ( ) (t) (t) (t) and η s = η ,...,η . 1|s C|s prior density is proportional to 3.21.2 Posterior Distribution Adopting a Dirichelt prior to γ, η and ρ, the complete-data posterior distribution is also Dirichlet because of conjugacy. Thus in the P-step, new random values for the parameters are drawn from posterior distributions ( ) ρm|c ∼ Dirichlet nm1|c + 1/2, . . . , nmr |c + 1/2 m ) ( (t) (t) (t) η s ∼ Dirichlet n + 1/2, . . . , n + 1/2 1|s C|s ( ) γ ∼ Dirichlet n1 + 1/2, . . . , nS + 1/2 18 (3.5) for ct = 1, . . . , C , t = 1, . . . , T , m = 1, . . . , M , and s = 1, . . . , S , where n (t) = c|s ∑ ∑n ∑n ∑T (t) (t) zic I(yimt = k), and ns = n zis . We z , nmk|c = i=1 i(s,c) i=1 t=1 i=1 repeat this two-step procedure to create a sequence of iterates converging to the stationary posterior distribution. This stream of parameter values (after a suitable burn-in period) is summarized in various ways to produce Bayesian inference and estimation. 3.22 Across-Model Move For the across-model update, we assume that the sampling starts from the state Θ(C,S) and then transits to Θ∗ ∗ ∗ . To match up the dimension, we generate a vector of continuous (C ,S ) random variables u from some known distributions g and the new proposed state is then constructed by using an invertible deterministic function h such that (Θ∗ ∗ ∗ , u∗ ) = (C ,S ) h(Θ(C,S) , u) where u∗ are the suitable random variables generated from some known functions g ∗ that is required for the reversed move using the inverse function h∗ of h. The new state updates the current one with probability α(k, k ∗ ), where k = (C, S) and k ∗ = (C ∗ , S ∗ ). The probability α(·, ·) is usually termed as the acceptance rate. Assume the probability of choosing move from k to k ∗ is denoted by q(Θk , Θ∗ ∗ ), a k valid choice for the acceptance rate is { } P (Θ∗ ∗ | y)q(Θ∗ ∗ , Θk )g ∗ (u∗ ) ∂(Θ∗ ∗ , u∗ ) k k k α(k, k ∗ ) = min 1, , ∗ )g(u) ∂(Θk , u) P (Θk | y)q(Θk , Θ ∗ k (3.6) note that the last factor in (3.6) is the Jacobian arising from the transformation from (k, u) to (k ∗ , u∗ ). To keep expressions simple, we consider the LCPA model with binary items (i.e., rm = 2 19 for m = 1, 2, . . . , M ) with the invariance constraint on the ρ-parameter (i.e., ρmk1|c = . . . = ρmkT |c = ρmk|c ), although an extension to the model defines in (2.4) is straightforward. First, we select the number of classes C ; and then explore the number of profiles S with the fixed number of classes. We assume that the random variables C and S are from Poisson distribution with hyperparameter ς with maximum values truncated at S0 and C0 , respectively. Based on the simulation results, the choices of S0 and C0 have little impact on the performance of the RJMCMC and can be chosen as any reasonable values. In this study, we choose ς = 5, S0 = 10, and C0 = 10. In RJMCMC, there are two possible ways to accommodate the dimensional changes in latent classes: split (e.g., the C -class LCPA moves to the (C + 1)-class LCPA) and merge (e.g., the C -class LCPA moves to the (C − 1)-class LCPA). Usually, at each stage of dimensional change, split and merge moves are equally preferred (i.e., bc = P ( split | C -class LCPA) = 0.5). 3.22.1 Jumping Rules for C To update the number of classes, we first generate U from Uniform(0, 1). If U < bc , split move is executed. Conditional on the current number of profiles, we randomly pick a class c∗ from (1, . . . , C) and split it into classes c∗ and c∗ . To accommodate these classes, we 1 2 (t) draw us from Uniform(0, 1) and update the secondary measurement parameters as (t) (t) (t) η ∗ = η ∗ us c |s c1 |s (t) (t) (t) η ∗ = η ∗ (1 − us ) c |s c2 |s 20 for s = 1, . . . S and t = 1, . . . , T . For the primary measurement parameter, we generate um from Uniform(0, 1) and update the primary parameters as √ 1−u ¯ ρm1|c∗ = ρm1|c∗ − um κ u ¯ 1 √ u ¯ ρm1|c∗ = ρm1|c∗ + um κ 1−u ¯ 2 for m = 1, . . . M , where u is ¯ ∑S ∑T (t) s=1 t=1 us /ST and κ is an adjust term added to enhance the mixing performance. For simulation study, we choose 0.2 for κ since all ρmk|c parameters are bounded and large κ (e.g. κ ≥ 0.4) would violate the constraint easily and make the proposals invalid, however, small κ (e.g κ ≤ 0.1) will lead to low acceptance rate and slow down the chain. After the dimension-changing move is made, we need to reallocate the individuals whose class memberships were c∗ to the new classes. The class label for each individual is not known a priori and we therefore follow ad hoc rule to assign them to either c∗ or c∗ according to 1 2 the proposed formula. On the other hand, if merge move is chosen, we randomly select two classes c∗ and c∗ 1 2 from (1, . . . , C) and merge them into c∗ . To preserve the reversibility, we update the parameters as (t) (t) (t) η ∗ = η ∗ +η ∗ c |s c1 |s c2 |s ¯ ρm1|c∗ = uρm1|c∗ + (1 − u)ρm1|c∗ ¯ 2 1 for s = 1, . . . , S , t = 1, . . . , T , and m = 1, . . . , M . In this case, the reallocation can be simply done by re-setting the memberships for those who were in the class c∗ or c∗ to 1 21 2 c∗ . With the updated parameters, we then calculate the acceptance rate α(C, C ′ ) of the proposed C ′ -class LCPA model (C ′ = C − 1 or C + 1) as (3.6) by replacing k with (C, S) and k ∗ with (C ′ , S). 3.22.2 Jumping Rules for S Based on the selected number of classes C , we update the number of profiles S using the following procedure. At each stage of dimensional change, split and merge moves are equally preferred (i.e., bs = P ( split | S -profile LCPA) = 0.5). We generate V from the Uniform(0,1). If split move is chosen (i.e., V < bs ), we randomly select one profile s∗ from (1, . . . , S) and split it into s∗ and s∗ . We then draw w from the Uniform(0,1) and set 1 2 γs∗ = γs∗ w 1 γs∗ = γs∗ (1 − w). 2 For the secondary measurement parameter, we update the parameters by setting them to be proportional to the odds ratio for s∗    (t) (t) η ∗ η ∗  c|s  1  = log  c|s  + βct   log    (t)   (t)  w η η ∗ ∗ C|s C|s1     (t) (t) η ∗  ηc|s∗   2  = log  c|s  − βct ,   log    (t)  1 − w  (t)  η η C|s∗ C|s∗ 2  22 where βct are generated from N (0, σ 2 ) independently for c = 1, . . . , C − 1 and t = 1, . . . , T . In the simulation study, we choose σ = 1 and for those individuals whose profile memberships were s∗ , we relabel each of them to either s∗ or s∗ with probabilities based 1 2 on the proposed formula. If merge is selected, we randomly choose two profiles s∗ and s∗ and merge them into 1 s∗ by setting 2 γs∗ = γs∗ + γs∗ 1  2      (t) (t) (t)  ηc|s∗  η ∗  η1|s∗   1  + (1 − w) log  c|s2   log     (t)  = w log  (t)     (t)  η η η C|s∗ C|s∗ C|s∗ 1 2 for c = 1, 2, . . . , C − 1 and t = 1, 2, . . . , T . And the reallocation can be easily done by labeling those who were in the profile s∗ or s∗ to s∗ . 1 2 With the updated parameters, we then calculate the acceptance rate α(S, S ′ ) of the proposed S ′ -profile LCPA model (S ′ = S − 1 or S + 1) as (3.6) by replacing k with (C, S) and k ∗ with (C, S ′ ). The details of the calculations of acceptance rates for class and profile are summarized in the appendix. 3.3 Dirichlet Process 3.31 Preliminary Typically, we assume that data is drawn from an unknown distribution which we wish to estimate through the posterior distribution by specifying the prior. In most cases, the prior 23 distribution is parametrically specified and the Bayesian computing can help to infer the estimate easily. However, assuming prior has parametric form may not be suitable for the data. In the case of Dirichlet process, the prior is a random distribution over probability measures. The Dirichlet process is currently one of the most popular nonparametric Bayesian models since it can be practiced as a modern way to learn dominating components economically. By the definition from Ferguson (Ferguson, 1973), G is said to be Dirichlet process on a measurable space (B, A) with concentration parameter λ and base measure G0 if, for any finite measurable partition (B1 , B2 , . . . , Bk ) of B , the distribution of (P (B1 ), . . . , P (Bk )) is Dirichlet with parameter (λG0 (B1 ), . . . , λG0 (Bk )) and written as G ∼ DP (G0 , λ). The parameters G0 and λ play intuitive roles in the definition of the Dirichlet process. For any measurable set B ⊂ B , we have E(G(B)) = G0 (B) V ar(G(B)) = G0 (B)(1 − G0 (B))/(λ + 1). (3.7) The value of concentration parameter λ is a positive scalar with larger value of λ leading to small variance of G. In other words, Dirichlet process will concentrate more on the mean and as λ grows, G(B) will converges to G0 (B) weakly for any measurable set B in B . On the contrary, a Dirichlet with small value of λ favors extreme distribution but the belief in prior is easily over-written by the data. Since G is a prior distribution over the measure spaces, we can draw random samples from G itself. Let θ1 , . . . , θn be a sequence of independent draws from G. We can derive the posterior distribution of G given observed values of θ1 , . . . , θn . Let A1 , . . . , Ar be a finite measurable partition of B , and let mk = #{i : θi ∈ Ak } be the number of 24 observed values in Ak . Since Dirichlet is the conjugate prior of multinomial, we will have (G(A1 ), . . . , G(Ar )) | θ1 , . . . , θn ∼ Dirichlet (αG0 (A1 ) + m1 , . . . , αG0 (Ar ) + mr ) The above expression is true for all the finite measurable partitions A1 , . . . , Ar of B and in the light of the definition from Ferguson, we know G must be a Dirichlet process. Generally put, the posterior distribution of G can be written down as: ( G | θ1 , . . . , θn ∼ DP λ n λ + n, G0 + λ+n λ+n ) ∑n θi i=1 n The posterior mean of G is the weighted average between the prior base distribution G0 ∑n i=1 θi . As λ → 0, the prior becomes noninformative since and the sample average n the predictive distribution is controlled solely by the average. That is to say, as the amount of observations outnumbers the concentration parameter (n >> λ), the posterior is constructed by the observations themselves which is regarded as a data-driven approximation to the true underlying model. This fully explains that Dirichlet process is consistent in recovering the model by increasing the sample size. Consider again drawing G ∼ DP (G0 , λ), and drawing an i.i.d. sequence θ1 , . . . , θn ∼ G. The predictive distribution for θn+1 , conditioned on θ1 , . . . , θn with G marginalized out, has the following formulation,  1  θn+1 | θ1 , . . . , θn ∼ λG0 + λ+n n ∑ i=1  δθ  i (3.8) The sequence of predictive distributions is called the Blackwell- MacQueen urn scheme (Black25 well and Macqueen, 1973). Regardless of the question about the existence of the Dirichlet process, we can calculate the joint probability of the sequence of (θ1 , . . . , θn ) by the conditional rule, P (θ1 , . . . , θn ) = n ∏ P (θi | θ1 , . . . , θi−1 ) (3.9) i=1 where the conditional probability is defined as (3.8). It is obvious that the order of the sampling will not change the probability and we therefore ensure that the sequence of the drawn samples is infinitely exchangeable. More precisely, we say (θ1 , θ2 , . . .) is an infinitely exchangeable sequence of random variables if, for any n, the joint probability P (θ1 , . . . , θn ) is invariant to permutation of the indices. That is, for any permutation σ , P (θ1 , . . . , θn ) = P (θσ(1) , . . . , θσ(n) ) (3.10) According to de Finetti’s theorem (de Finetti, 1931) and the generalized version of his theorem (Hewitt and Savage, 1955), for any infinitely exchangeable sequence (θ1 , θ2 , . . .), there exists a probability measure µ on the set of probability measures G(·) such that P (θ1 , . . . , θn ) = ∫ ∏ n G(θi )dµ(G) (3.11) i=1 In our study, the prior over the random distribution is precisely the Dirichlet process DP (G0 , λ), thus establishing existence. The important aspect of the Dirichlet process is its clustering property. Based on the predictive form formulated as (3.8), the first sample is drawn from the distribution G0 26 and once θ1 , . . . , θn are observed, the next sample θn+1 is either drawn from G0 with probability λ/(λ + n) or assigned the same value as θi for some i = 1, . . . , n with probability 1/(λ + n). After the sequence of draws is long enough, we note that G is a weighted sum of point masses and the same values of draws can be grouped together. 3.31.1 Sethuraman Stick-Breaking Representation We have already mentioned that Dirichlet process is an almost surely discrete random probability measure which is composed of a weighted sum of point masses. Sethuraman made this precise by providing a constructive definition of the Dirichlet process called the stick-breaking construction (Sethuraman, 1994). If P = DP (G0 , λ), it can be constructed with the following: βk ∼ Beta(1, λ) k−1 ∏ (1 − βk ) πk = βk l=1 ∞ ∑ ∗ P(·) = πk δθ∗ (·), where θk ∼ G0 k k=1 (3.12) The construction of π can be understood through breaking a stick as follows. We imagine there is a stick whose length is 1 and we break it at β1 and assign weights π1 to the stick we just broke off. Then repetitively break the other portion to obtain π2 , π3 and so forth. The stick-breaking distribution over π is sometimes written π ∼ GEM (λ) (GEM stands for Griffiths, Engen and McCloskey). To summarize, the Sethuraman representation indicates that the unknown distribution G can be recovered if the infinite number of θi along with their corresponding weights are available. Because of its simple representation, Sethuraman’s 27 construction has lead to a variety of extensions as well as novel inference techniques for the Dirichlet process. 3.31.2 Ferguson Gamma Process Representation Before the Sethuraman’s representation, Ferguson provided a representation for the gamma process based on arrival times from a homogeneous Poisson process (Ferguson and klass, 1972). Let Ek be independent and identically distributed from exp(1) random variables and Γk = E1 + · · · + Ek . Let Zk be i.i.d elements, independent of Γk , with a probability distribution G0 over (B, A). Then Ferguson showed that the Dirichlet process with parameter (G0 , λ), could be described as the random probability measure G(·) = ∞ ∑ k=1 N −1 (Γk )δZ (·)/ k ∞ ∑ N −1 (Γl ) (3.13) l=1 where ∫ ∞ −u e N (x) = λ du, for x > 0 u x (3.14) is the L´vy measure of a gamma random variable with shape parameter λ > 0. e In the Sethuraman’s stick-breaking construction (3.12), the GEM weights are defined as π1 = β1 and πk = (1 − β1 ) · · · (1 − βk−1 )βk where k ≥ 2, (3.15) and they are related to the Poisson process. By ordering them such that π(1) ≥ π(2) ≥ · · · , the two sets of weights are also related by the following form (Patil and Taillie, 1977; 28 Perman and Pitman, 1992; Pitman and Yor, 1997) ( (π(1) , π(2) , · · · , ) = N −1 (Γ1 ) N −1 (Γ2 ) , ∑∞ ,··· ∑∞ N −1 (Γl ) N −1 (Γl ) l=1 l=1 ) (3.16) where N is the L´vy measure defined in (3.14). e Since there is no closed form solution for the inversed L´vy measure and since each π(i) e requires infinite sum calculation, the Ferguson representation is not widely practicable. 3.31.3 Almost Sure Truncation of DP (G0 , λ) Measures We already know that a Dirichlet process can be formulated by the weighted sum of point masses and we are intuited to see if the truncated form of the Dirichlet process can replace the original Dirichlet process without losing accuracy. Let DPM (G0 , λ) be the approximate form of DP (G0 , λ) by discarding the M + 1, M + 2, . . . terms and denote it as PM , PM (·) = M ∑ ∗ πk δθ∗ (·), where θk ∼ G0 , k k=1 (3.17) and πM = 1 − π1 − . . . − πM −1 . This corresponds to set βM = 1 so that ∑M k=1 πk = 1. As shown in the paper (Ishwaran and Zarepour, 2000), the DPM (λ, G0 ) random measure can be used to approximate integrable funcitonals of the Dirichlet process: DPM (G0 , λ)(g) → DP (G0 , λ)(g) (3.18) for any arbitrary bounded and continuous real valued function g . The key property of (3.18) can be exploited to describe an efficient Gibbs sampler for Bayesian nonparametric problems 29 in which PM is used as an approximating prior to the Dirichlet process. Let’s consider the following hierarchical set up where the prior for the measure G follows truncated Dirichlet process: Yi | θ i ∼ f (yi | θ i ) θi | G ∼ G G ∼ PM (3.19) The marginal density of the observed data Y is therefore written as mM (Y) =  ∫ ∏ ∫ n  i=1   f (yi |θ i )dG(θ i ) dPM (G).  (3.20) If m∞ is the marginal density of Y with Dirichlet process as the prior for G, as Ishwaran (Ishwaran and James, 2001, 2002) showed, ∫  n   M −1  ∑ 1 − E    |mM (Y) − m∞ (Y)|dY ≤ 4 πk   k=1 ≈ 4n exp(−(M − 1)/λ)  30 (3.21) The difference between the two marginal densities with each based on exact and approximate sum representation respectively has been proved as by Ishwaran as follows: ∫ |mM (Y) − m∞ (Y)|dY = ∫ ∫ ∏ n f (yi )(π M (dθ) − π∞ (dθ i )) dY i=1 ∫ ∫ ∏ n ≤ f (yi )dY π M (dθ i ) − π∞ (dθ i ) i=1 = 2D(π M , π∞ ) (3.22) where D is the total variation distance between two probability measures π M and π∞ . Let ki be the indicator of the group membership θ i has, that is, θ i = θ ∗ . The sampled ki values θ under π M and π∞ are identical if ki is sampled from a value smaller than M th term. Thus, D(π M , π∞ ) ≤ 2(1 − π M {ki ≤ M, for i = 1, 2, . . . , n})    n   M −1  ∑ 1 − E    = 2 πk   k=1 ≈ 2n exp(−(M − 1)/λ) (3.23) where the right most approximation follows by observing that M −1 ∑ k=1 πk =D 1 − exp(−E1 /α)exp(−E2 /α) · · · exp(−EM −1 /α) ≈ 1 − exp(−(M − 1)/λ) where E1 , · · · EM −1 are iid exp(1) random variables. 31 (3.24) Another method to compare between PM and DP (G0 , λ) is to compare their clustering behavior under sampling. Let ϕ = {ϕ1 , . . . , ϕk } denote the set of distinct θ i ’s, where k ≤ n is the number of distinct elements in the vector θ = (θ1 , . . . , θn ) and si is the indicator defined by si = j if θs = ϕj for i = 1, . . . , n. Let nj be the number i of si = j , DM and D∞ equal the number of distinct values in Y when sampled under PM and DP (λ, G0 ), Ishwaran proved that M! M k (M − k)! ≤ P (DM = k) ≤ nλk/M , for k = 1, . . . , min(n, M ) (3.25) P (D∞ = k) Since both sides of inequality (3.25) converge to one for each k as M → ∞, the two distributions agree in the limit by the squeeze theorem. Therefore, the M -truncation prior PM has been proved to have similar features in the clustering behavior as the DP (λ, G0 ). As shown in the corollary 20 of the paper (Pitman, 1996), the posterior distribution PM (· | θ) is the random probability measure represented as PM (· | θ) = M ∑ ∗ ∗ ∗ πj δθ ∗ (·) + πm+1 PM (·), j j=1 (3.26) where θ ∗ , . . . , θ ∗ are distinct values in the full sequence θ 1 , . . . , θ n occurring each with 1 M frequencies n∗ , and j ∗ ∗ ∗ (π1 , · · · , πM , πM +1 ) ∼ Dir(n∗ + λ/n, · · · , n∗ + λ/n, λ(1 − M/n)). (3.27) 1 M Let φ be a nonnegative or integrable function, the posterior mean of φ through PM (·|Y) 32 is characterized by ∫ ∫ ∫ φ(G)PM (dG | Y) = φ(G)PM (dG | θ1 , . . . , θn )µ(dθ1 . . . , θn | Y), (3.28) The interior integral on the right head side of (3.28) can be expressed through averaging over the values of θ1 , . . . , θn drawn from the Gibbs sampler, ∫ M ∑ n∗ + λ/n λ(1 − M/λ) j φ(G)PM (dG | θ1 , . . . , θn ) = φ(δθ∗ ) + φ(G0(3.29) ). λ+n λ+n j j=1 We can implement the preceding scheme (3.28) to estimate various posterior mean functionals by simplifying (3.28) with (3.29). There are many computational and theoretical advantages by using approximate sum representation since the simplified mathematical representations can be performed a lot more efficiently especially in cases where roundabout working strategy is needed. 3.32 Dirichlet Process Mixture Model To be consistent with the organization of Chapter 2, we begin with the profile identification defined in (2.1). Let the ith individual’s class membership at time t and his or her profile membership are cit and si respectively. The class memberships cit over time, ci = (ci1 , . . . , ciT ), ci = (ci1 , . . . , ciT ) is then distributed with the product of multi33 nomial probability densities with the form f (ci | η s ) = P (C = ci | U = si ) i T C ∏ ∏ (t) I(cit =ct ) η = , ct |si t=1 ct =1 (1) (T ) ,...,η ) indicates the vector for the secondary measurement where η s = (η i 1|si C|si parameters associated with profile si . Sampling from a Dirichlet process mixture (DPM) can be schemed by forming G with countably infinite number of point masses from G0 and draw parameters η s from G, i ci | η s ∼ f (ci | η s ) i i ηs ∼ G i G ∼ DP (G0 , λ). By marginalizing over the prior for G, the conditional distribution of η s given the i others is shown as i−1 ∑ 1 λ ∼ ηs | ηs , . . . , ηs δ(η s ) + G , 1 i−1 i − 1 + λ i j i−1+λ 0 j=1 (3.30) where δ(η s ) is the distribution concentrated at the point η s . Equation (3.30) is assoj j ciated with the P´lya-urn representation (Blackwell and Macqueen, 1973). In LCPA, this o formulation only requires marginal distribution because same values of η s can be grouped i together to form a profile. The evolution of η s , . . . , η sn in (3.30) can also be formed by 1 34 taking the number of components S in the finite mixture model to infinity: ci | η s ∼ f (ci | η s ) i i ηs i ∼ G0 si | γ ∼ Multinomial(γ1 , . . . , γS ) γ ∼ Dirichlet(λ/S, . . . , λ/S) (3.31) By integrating over the mixing proportions γ = (γ1 , . . . , γS ), the marginal probability of si given the profile membership except the ith individual has the following form P (si = s | s1 , . . . , si−1 ) = P (s1 , . . . , si−1 , si = s)/P (s1 , . . . , si−1 ) ni,s + λ/S = , i−1+λ where ni,s is the number of sj for j < i that equals to s. Let us imagine that the ith individual is the last of the n observations, then the conditional probabilities for si given s−i = (s1 , . . . , si−1 , si+1 , . . . , sn ), ci and η = (η 1 , . . . , η S ) can be obtained by multiplying the likelihood, f (ci | η s ), as follows: P (si = s | s−i , ci , η) = b n−i,s + λ/S n−1+λ f (ci | η s ), where n−i,s is the number of sj for j ̸= i that are equal to s, and b is the appropriate 35 normalizing constant. If S goes to infinity, P (si = s | s−i , ci , η) →   n−i,s b f (ci | η s ) if s is equal to sj for some j ̸= i n−1+λ (3.32) ∫ b λ  f (ci | η s )dG0 (η s ) if s is not equal to sj for all j ̸= i n−1+λ Here, η is the set of η s currently associated with some observations because we cannot explicitly represent the infinite number of η s as S goes to infinity. Next, we specify how to perform Gibbs sampling on DPM models to select the number of profiles in the LCPA model. Let the current profile state consist of s = (s1 , . . . , sn ). We can repeatedly sample as following three-step procedure: 1. For i = 1, . . . , n, remove the profile si from the current state if si is not associated with any other observation (i.e., n−i,s = 0). i 2. Generate a new profile membership si from the equation defined in (3.32). If the new (t) si is not associated with any other observation, generate a set of values for η s for all i t from the posterior distribution defined in (3.5). Repeat this step for i = 1, . . . , n. (t) 3. For all s ∈ (s1 , . . . , sn ), generate a set of new values for η s for all t from the posterior distribution. The procedures described above point out that the ith individual explores a new profile with probability λ/(i − 1 + λ) and thus the average number of profiles is expected to grow logarithmically in the sample with size n as O(λ log n). It has also been proved that the number of profiles will converge to infinity almost surely as n increases (Korwar and Hollander, 1973). In other words, if λ is fixed to a constant with large sample size, an over36 fitting problem might occur. In Liu’s paper (Liu, 1996), λ is estimated by its ML estimate by using sequential imputation but it demands many conditional probability calculations which barely comes with parametric forms. Therefore, we consider a noninformative gamma prior which assigns most equal probabilities to all possibilities to properly update λ with the procedures suggested by Escobar and West (1995). Here we briefly describe how to incorporate the concentration parameter into Gibbs sampling update procedure. According to Antoniak (1974), given the concentration parameter λ and sample size n, we can represent the uncertainty of the number of components k with the following probability density, P (k|λ, n) = Cn (k)n!λk Γ(λ) , Γ(λ + n) (3.33) where Cn (k) is nothing to do with λ . Assume we already sampled θ i for i = 1, . . . , n by DPM algorithm and are able to classify them into groups. In this way, the value of k is the number of groups we have for configuring the data. Suppose λ ∼ G(a, b), a gamma prior with shape a and scale b. For λ ≥ 0, we have Γ(λ) (λ + n)B(λ + 1, n) = Γ(λ + n) λΓ(n) (3.34) where B(·, ·) is the Beta function. Then the posterior of λ given k has the following form: P (λ | k) ∝ P (λ)λk−1 (λ + n)B(λ + 1, n) ∫ 1 k−1 (λ + n) xλ (1 − x)n dx ∝ P (λ)λ 0 (3.35) This implies that P (λ | k) is the marginal distribution from the joint distribution of λ and 37 a continuous variable ν such that P (λ, ν | k) ∝ P (λ)λk−1 (λ + n)ν λ (1 − ν)n−1 . (3.36) Since we assume λ has G(a, b) as the prior, we can write down the conditional probabilities for λ and ν as follows: P (ν | λ, k) ∝ ν λ (1 − ν)n−1 P (λ | ν, k) ∝ λa+k−1 (e−b ν)λ + nλa+k−2 (e−b ν)λ . (3.37) It is obvious that given the value of λ, the value of ν is updated by drawing from B(λ+1, n) and the conditional probability of λ given ν reduces to a mixture of two Gamma densities. P (λ | ν, k) ∼ πν G(a + k, b − log(ν)) + (1 − πν )G(a + k − 1, b − log(ν))(3.38) with πν satisfying πν /(1−πν ) = (a+k −1)/n(b − log(ν). At each iteration of Gibbs sampling, we can sample λ from the mixture of Gamma distributions given the currently sampled ν and k . 3.33 Dynamic Dirichlet Process Mixture Model In the scenarios where data are thought to be produced by the stage-sequential process, hidden Markov model is certainly the most suitable method to describe the data set. However, the task of estimating the number of stages (i.e., classes) is not covered in the hidden Markov model setting. We need to extend the hidden Markov model by assuming each row of the transition matrix follows a Dirichlet process to explore the stage-sequential progression and 38 infer the number of classes afterwards. To be precise, we consider the data observed at each time point is a group and the observations are exchangeable at this specific time. While each mixture model has mixing proportions specific to the group, we require that the different groups share the same set of mixture components. The idea is that while different groups have different characteristics given by a different combination of mixing proportions, using the same set of mixture components allows statistical strength to be shared across groups, and allows generalization to new groups. It is generally known as hierarchical Dirichlet process (HDP). Overall speaking, The HDP allows the data at a specific time point to have similar structure by providing global layer of hierarchy. The applications of HDP are presented in literatures (Ahmed and Xing, 2008; Xu et al., 2008). In LCPA models, we adopt the concept of HDP to identify the number of classes. Since the number of classes is not necessarily constant over time, we call this process as dynamic Dirichlet process mixture (DDPM) model. If each individual’s class membership over time (i.e., cit ) could be observed, we would like a joint model for class membership across T time points, ∏n i=1 P (C1 = ci1 , . . . , CT = ciT ), where all possible sequences of class membership can be re-expressed as frequencies in a contingency table with C T cells for n individuals. This table can produce a reasonable inference about the important aspects of the common progression of class membership over time. In LCPA, we assume there is an extra level of latent variable (i.e., class profile) by which the dependency among classes can be explained. Under this assumption, the classes are conditionally independent given the profile membership. The class memberships, however, are unobserved directly, and it is not possible to learn the class profile before the class memberships over time are known. In addition, it is not possible to know the true structure of dependency among class memberships over time. The LCPA can summarize 39 the dependency among classes by a small number of class profiles, without regard for the structure of dependency. Therefore, we implement DDPM by generating dependency among classes from the first-order Markov chain and summarizing the dependency through the class profiles. Let yi,t be the observed values for M items of the ith individual at time t. If his or her class membership at time t, ci,t , could be observed, the distribution of yi,t is the product of multinomial probability densities with the form f (yi,t | ρt|c ) = P (Yt = yi,t | Ct = ci,t ) i,t ]I(y M ∏ ∏ rm [ imt =k) = , ρmkt|c i,t m=1 k=1 where ρt|c represents the vector of all ρ-parameters associated with time t and class i,t (t) = P (Ct = ct | Ct−1 = ct−1 ) represent the transition ct |ct−1 probability of class membership from time t − 1 to t. Suppose we have C classes over T membership ci,t . Let τ time periods, there is a C × C transition probability matrix τ (t) with all elements of each row of τ (t) summed to one for t = 2, . . . , T . Thus, given the previous class membership (t) ct−1 , the row vector τ c , can be used as the mixing proportions for the current class t−1 membership. As described before, sampling from a Dirichlet process mixture can be schemed by taking 40 the number of classes C to infinity: yi,t | ρt|c ∼ f (yi,t | ρt|c ) i,t i,t (t) (t) (t) ci,t | τ c ∼ Multinomial(τ ,...,τ ) t−1 1|ct−1 C|ct−1 ρt|c ∼ G0 i,t (t) (t−1) (t−1) τc ∼ Dirichlet(n1 + α/C, . . . , nC + α/C), t−1 (t−1) where nc (3.39) denotes the number of individuals who were assigned class c at time t − 1, (t) α is the concentration parameter. By integrating over the mixing proportions τ c , the t−1 prior for ci,t , as conditional probability, has the following form (t−1) (t) nc + ni,c + α/C t t P (ci,t = ct | c1,t−1 , . . . , cn,t−1 , c1,t , . . . , ci−1,t ) = , n+i−1+α (t−1) (t) nc + ni,c + α/C t t P (ci,t = ct | c1,t−1 , . . . , cn,t−1 , c1,t , . . . , ci−1,t ) = , n+i−1+α (t) i,ct is the number of cj,t for j < i that equals to ct at time t. Let us imagine that the ith individual is the last of the n observations, then the conditional probabilities for ci,t where n given ct−1 = (c1,t−1 , . . . , cn,t−1 ), c−i,t = (c1,t , . . . , ci−1,t , ci+1,t , . . . , cn,t−1 ), and ρt = (ρt|1 , . . . , ρt|C ) can be obtained by multiplying the likelihood, f (yi,t | 41 ρt|c ), as follows: t (t−1) (t) nc + n−i,c + α/C t t f (yi,t | ρt|c ), P (ci,t = ct | ct−1 , c−i,t , yi,t , ρt ) = b t 2n − 1 + α (t) −i,ct is the number of cj,t for j ̸= i that are equal to ct at time t, and b is the appropriate normalizing constant. If C goes to infinity, where n P (ci,t = ct | ct−1 , c−i,t , yit , ρt ) →    n(t−1) +n(t)   c −i,ct   b t f (yit | ρt|c )   2n−1+α  t   if ct is equal to cj,t−1 or cj,t for some j ̸= i (3.40)      if ct is not equal to cj,t−1  n(t−1) +n(t)   −i,ct ∫  b ct  f (yit | ρt|c )dG0 (ρt|c )  2n−1+α t t and cj,t for all j ̸= i Here, ρt is the set of ρt|c currently associated with some observations because we cant not explicitly represent the infinite number of ρt|c as C goes to infinity. Therefore, we t can update the current class membership by sampling ci,t over time from the conditional probability P (ci,t | ct−1 , c−i,t , ct+1 , yi,t , ρt ) = P (cit = ct | ct−1 , c−i,t , yit , ρt )P (ct+1 | ct ). (3.41) Note that the first factor in (3.41) was presented in Equation (3.40). The second factor in (3.41) can be computed by integrating over the mixture weights τ (t+1) which depends 42 on the number of individuals who were assigned to each of identified classes at time t (i.e., (t) (t) n1 , . . . , nC ). Here, C is the number of existing classes at time t and t + 1. Then, it is straightforward to show that: ∑ ∏C (t) (t) (t+1) Γ( C nc + α/C) Γ(nc + nc + α/C) P (ct+1 | ct ) = ∏ c=1 , × c=1 ∑C (t) (t+1) (t) C Γ(n + α/C) Γ( + α/C) c c=1 nc + nc c=1 where Γ is a typical gamma function. We consider Gamma distribution with shape a and rate b, G(a, b), as the prior for the concentration parameters α and λ used in class and profile learning, respectively. Given the previous constructions, we specify the Gibbs sampling by following four-step procedure. 1. For t = 1, . . . , T , generate a new class membership ci,t from the equation (3.41). If the new ci,t is not associated with any other observation (i.e., n (t) (t−1) −i,ci,t = n−i,ci,t = 0), generate a set of values for ρt|c from the posterior distribution defined in (3.5). i,t Repeat this step for i = 1, . . . , n. 2. To update concentration parameter, α, we sample x and u from Beta(α + 1) and Uniform(0, 1), respectively. We then calculate pα = (a + µα − 1)/(a + µα − 1 + n(b − log(x))), where µα is the average of numbers of classes identified at each time. If u < pα , we update α by drawing a sample from G(a + µα , b − log(x)) or draw a sample from G(a + µα − 1, b − log(x)), otherwise. 3. With the class membership identified in Step 1, ci = (ci,1 , . . . , ci,T ) for i = 1, . . . , n, employ DPM to identify the number of profiles by following the three-step procedure described above. 43 4. We sample x and u from Beta(λ + 1) and Uniform(0, 1), respectively and calculate pλ = (a + S − 1)/(a + S − 1 + n(b − log(x))), where S is the number of identified profiles in Step 3. If u < pλ , we update λ by drawing a sample from G(a + S, b − log(x)) or draw a sample from G(a + S − 1, b − log(x)), otherwise. 44 Chapter 4 Simulation In this section, we evaluate the performance of RJMCMC and Dirichlet process over repeated samples. In our simulation, we draw over 200 samples from a pre-specified data set constructed with an LCPA representing two classes (i.e., C = 2) and two profiles (i.e., S = 2) with four binary items (i.e., M = 4) measured over four-time periods (i.e., T = 4). To simplify the presentation, we constrained ρ-parameter to be invariant over time (i.e., ρmk1|c = · · · = ρmkT |c = ρmk|c ), but imposed no constraint on the η -parameter. Note that the η -parameter describes the class progression over time. If the η -parameters (1) were time invariant (i.e., η s (T ) (1) (1) (1) = . . . = η s where η s = (η , . . . , η ).), given 1|s C|s a profile, the probabilities of belonging to a specific class could not change over time, leading to difficulties in interpretation for the identified profiles. We used the balanced probability of profile membership (i.e., γ1 = γ2 = .5), but the following three factors were varied for the simulation study: relationship between items and classes (i.e., ρ-parameter), relationship between classes and profiles (i.e., η -parameter), and sample size (n = 250 or n = 500). The ρ-parameter and the η -parameter have two levels: the strong (i.e, ρ = .9 or .1 and η = .9 or .1) relationship and the mixed (i.e., .1 < ρ < .4 or .6 < ρ < .9 and .1 < η < .4 or 45 .6 < η < .9) relationship. For each sample, we estimate the number of classes and profiles by the RJMCMC and Dirichlet process. Our purpose is to compare their performances by comparing the relative frequency of selecting the true model (i.e., two-class and two-profile LCPA) over repeated sample. For the prior specifications, we choose ς = 5, κ = 0.2, and σ = 1 and apply rather diffuse prior, G(1.5, 1), on both concentration parameters α and λ (i.e., a = 1.5 and b = 1). 4.1 Simulation with Reversible Jump MCMC For each data set, we run our sampler for 10,000 iterations with an additional 1,000 iterations for burn-in period. Regrading to the performance in recovering the correct number of classes, RJMCMC never fails to select the true number of classes under all combinations of factors considered. We find that RJMCMC samplers converge to the desired posterior distribution of the number of classes quickly. Once it converged, however, the trans-dimension moves are not easily implemented. In fact, the difficulty in accepting the proposed parameters of a different dimensional space may lead to some biases in the number of classes. To improve low acceptance rates, we modify the across-model scheme by adding birth and death of an empty class. The rate of accepting the birth move of an empty class is controlled by the quantity ∏S ∏T (t) n(t) s , where u(t) is a random variable from Uniform(0, 1) and s s=1 t=1 (1 − us ) (t) ns is the size of the profile s at time t, for all s = 1, . . . , S and t = 1, . . . , T . The problem of poor mixing remains, however, since the acceptance rate could diminish exponentially fast even for the moderate size of profile. Table 4.1 shows the percentage of the most frequently selected profiles over simulated samples. For the number of profiles, RJMCMC is able to identify correct number of profiles 46 Table 4.1: The percentage of the most frequently selected profiles over simulated samples by reversible jump MCMC Sample size Number of profiles ρ η 1 2 0.0 0.0 0.0 5.2 3 4 5 6 7 250 Strong Strong Mixed Strong Strong Mixed Mixed Mixed 78.4 19.6 2.0 0.0 0.0 0.0 74.8 20.2 4.6 0.4 0.0 0.0 45.2 48.6 5.4 0.8 0.0 0.0 34.8 58.8 1.2 0.0 0.0 0.0 500 Strong Strong 0.0 82.2 17.2 0.6 0.0 0.0 0.0 Mixed Strong 0.0 52.2 45.8 1.6 0.3 0.1 0.0 Strong Mixed 5.4 19.6 70.8 3.0 0.8 0.3 0.1 Mixed Mixed 10.4 14.4 70.2 4.6 0.4 0.0 0.0 when both measurements are strong (78.4% for the small sample and 82.2% for the large sample). However if secondary measurement is mixed, the accurate disparity among competing models is difficult especially when RJMCMC works with large samples. We find out given weak measurement setups, some of the resulting profiles are formalised with small sizes and richly strictured samples would exacerbate the performance even more. We generalize the cases to explain how intrinsic difficulty in RJMCMC impacts the performance in the later section. 4.2 Simulation with Dirichlet Process Even though Dirichlet process chooses the correct number of classes during the whole simulations, the method is not free from problems. A troubling aspect of the Dirichlet process is that a complex model is falsely preferred because there is probability that some profiles are generated but rarely been visited since then, which would result in sparse decomposition of an LCPA model. In addition to presenting results from Dirichlet process without applying 47 0 0 0.9 Large sample size n = 500 0.9 Small sample size n = 250 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (a) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (b) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (c) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 (d) Figure 4.1: The percentage of the most frequently selected profiles over simulated samples by reversible jump MCMC: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η 48 any inclusion threshold, we will adapt our models to work with two cutoffs (e.g., 1% and 5%) and investigate the changes. The results of the Dirichlet process for the number of profiles are presented in Figure 4.2. Dirichlet process is known to be consistent in recovering the true model as we increase the sample size and Figure 4.2 illustratively corroborates this attribute since the the correct rates increase when sample size grows. It is rather interesting to know that Dirichlet process favors models with excessive number of profiles especially when working with mixed primary measurements. We believe the poor performance can be ascribable to the weak relationship between classes and the items since the sequence of the well identified class memberships is the key ingredient for successful classification of the profiles. When the primary measurement is mixed, we can see the poor performance, especially with small samples . Generally, Dirichlet process is poorly suited to identify the correct number of profiles when none of the measurements is strong and sample size is small but the method is vindicated to have a good practical use when sample size is large. As the matter of fact, in the process of Dirichlet process, there are some profiles developed with sizes not even larger than 1% in many stages of profile learning. To avoid profile redundancy, we need to impose a threshold as an inclusion criterion. We use 1% and 5% as the cut-offs, by which any profile whose size is less than the given cutoff will be considered too sparse to be included and the corresponding distributions for the number of profiles are then given in Figure 4.3 and Figure 4.4. Clearly, when 5% cutoff is applied, the predictive accuracy increases dramatically especially when strong-strong pairs and large sample sizes are the working scenario. To determine an optimal cutoff value is more of the matter in discovering right dominating profiles. 49 0 0 0.9 Large sample size n = 500 0.9 Small sample size n = 250 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (a) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (b) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (c) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 (d) Figure 4.2: The histogram of the most frequently selected profiles over simulated samples from Dirichlet process: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η 50 0 0 0.9 Large sample size n = 500 0.9 Small sample size n = 250 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (a) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (b) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (c) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 (d) Figure 4.3: The histogram of the most frequently selected profiles over simulated samples from Dirichlet process with cutoff 1%: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η 51 0 0 0.9 Large sample size n = 500 0.9 Small sample size n = 250 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (a) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (b) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0.9 0 0 0.9 (c) 1 2 3 4 5 6 7 1 2 3 4 5 6 7 (d) Figure 4.4: The histogram of the most frequently selected profiles over simulated samples from Dirichlet process with cutoff 5%: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η 52 Table 4.2: The percentage of the most frequently selected profiles over simulated samples from Dirichlet process with cutoff 5% Sample size Number of profiles ρ η 1 2 3 4 5 6 7 250 Strong Strong 0.0 69.0 29.0 1.0 1.0 0.0 0.0 Mixed Strong 0.0 35.0 38.0 21.0 5.0 1.0 0.0 Strong Mixed 0.0 51.0 34.0 12.0 3.0 0.0 0.0 Mixed Mixed 6.0 41.0 33.0 17.0 5.0 0.0 0.0 500 Strong Strong 0.0 81.0 17.0 1.0 1.0 0.0 0.0 Mixed Strong 0.0 50.0 20.0 20.0 5.0 2.0 0.0 Strong Mixed 0.0 65.0 30.0 4.0 1.0 0.0 0.0 Mixed Mixed 3.0 53.0 30.0 14.0 0.0 0.0 0.0 Since 5% cutoff is recognized as more effective in producing satisfying results without being unreasonably liberal in screening profiles, we consider 5% as the appropriate threshold in determining the number of profiles and the percentages of the most frequently selected profiles are distributed in Table 4.2. For cases where more than 7 profiles are generated are not shown in the table since the sizes are insignificantly small. 4.3 Extended Cases In previous sections, we already provided empirical results showing that both presented techniques are able to learn the class structure correctly. However, to investigate the performance and validate their effectiveness in selecting the right profiles, we consider applications more generally. We draw samples from LCPA models with two classes, two profiles, four items over varying number of time measurements (i.e, T = 3, 4 and 5). Table 4.3 indicates that with strong-strong pairs, RJMCMC can optimize the performance when more time measurements are involved. However, the performance degradation is noted when weak measurements are 53 Table 4.3: The percentage of selecting the correct number of profiles (S = 2) over simulated samples from RJMCM and Dirichlet process Sample size ρ η RJMCMC DP with 5% cutoff T =3 T =4 T =5 T =3 T =4 T =5 250 Strong Strong Mixed Strong Strong Mixed Mixed Mixed 74.6 76.2 53.6 54.3 78.4 74.8 45.2 34.8 80.5 73.0 58.3 32.6 51.0 34.0 50.0 38.0 69.0 35.0 51.0 41.0 91.0 60.0 65.0 55.0 500 Strong Strong Mixed Strong Strong Mixed Mixed Mixed 77.9 64.7 27.4 33.1 82.2 52.2 19.6 14.4 86.2 46.9 22.2 17.7 57.0 45.0 51.0 48.0 81.0 51.0 65.0 53.0 92.0 66.0 82.0 65.0 given, which we believe is because weak measurements vaguely distinguish the competing models and most of time, RJMCMC would prefer a complex model as opposed to a simple one. On the other hand, we found out that Dirichlet process can learn the profiles more accurately when there are more time measurements involved. Besides, the results demonstrate that working with large data is always accompanied by significant improvement but enlarging the sample size might not be a feasible approach in modeling the complexity. 4.4 Discussion Latent stage-sequential process is an attractive tool for many areas of substantive research. Like other mixture models, however, it can be difficult to estimate the number of latent components such as classes and profiles. We have illustrated two Bayesian approaches, RJMCMC and Dirichlet process, for the latent class profile analysis. Although this model is certainly not representative of all latent stage-sequential process, it nevertheless lends several important insight which we believe have general relevance. 54 First, both RJMCMC and Dirichlet process can learn the model without assuming the number of latent components in advance. Comparing with the technical challenges that RJMCMC might bring, Dirichlet process may be preferred in terms of flexibility and consistency. Dirichlet process technique relaxes the constraint on the number of classes placed at each measurement occasion and it performs well with large sample size. Dirichlet process is conceptually simple and does not require intensive computation since there is no need for Dirichlet process to design jumping proposals and computing the acceptance rates. Technically, Dirichlet process is easily employed on the intuitive basis and readily extended to study longitudinal data. In the Bayesian mixture modeling, label switching is one of the most common issues and it will lead to poor parameter estimation. However, It does no harm in finding the number of components for the LCPA model. In the simulation study, we run the algorithms in the presence of label switching, but the conclusions we inferred here are unaffectedly viable. To capture the structural change in classes during experiments, we may allow the RJMCMC to vary the number of classes at different measurement occasions, but it leads to insurmountable technical hurdles. However, allowing various number of classes over time is possible in DDPM and it is an appealing aspect of Dirichlet process approach comparing to the RJMCMC. In this study, we present Bayesian model selection analysis on LCPA without any covariates to predict the prevalence of the profile; however, predicator-dependent kernel stick-breaking process has already increased the interest. It is utilized in choosing the priors for an unknown probability measure (Dunson and Park, 2008) and variable selection problems (Chung and Dunson, 2009). Adding predictors in the prior consideration gives different insights into how the classes are formed under the influence of predictors and it is understood as a dependent Dirichlet Process. Future work should explore the model selection for the LCPA regression 55 model. Some specific limitation to this study include the choice of jumping rules in RJMCMC. The simulation study can only serve as a preliminary exploration and it is worth noting that a comparison between RJMCMC and Dirichlet process cannot be generalized unless the ingenuity of the proposal mechanism we proposed in the RJMCMC is ensured. Besides, we note that the results of the Dirichlet process may not be promising when mixed-mixed pairs are considered since it did not gain much accuracy when sample size was increased from 250 to 500. We believe that the appropriate sample size should be investigated for Dirichlet process to produce accurate results under different scenarios on the measurement parameters for future study. We provided a limited demonstration to help elucidate the the two potential methodologies for the model selection in LCPA. Our hope is that substantive practitioners will be able to utilize this demonstration in their research and that it may provide helpful guidance for the implementation of these two methods. In summary, this presentation provided the coverage of the implementation of RJMCMC and the Dirichlet process for the LCPA model and showed the performance of those methods in selecting the number of classes and profiles over repeated samples. Future work should investigate more tailored RJMCMC and the Dirichlet process specialized for LCPA and compare the performance formally via simulation study. We expect that each method will display its own unique strengths and weaknesses under different conditions. 56 Chapter 5 Parameter Estimation 5.1 Preliminary Parameters for latent class profile analysis (LCPA) are easily estimated by maximum likelihood via EM algorithm or Bayesian method via Markov chain Monte Carlo. However, the local maximum problem is a long-standing issue in any hill-climbing optimization technique for the LCPA model. In this study, we propose to apply two probabilistic optimization techniques using the deterministic annealing framework in order to deal with multiple local modalities in the LCPA model. The deterministic annealing approaches are implemented with an efficient recursive formula in the step for the parameter update. 5.2 Expectation-Maximization The EM algorithm is an iterative procedure which is widely employed for finding maximum likelihood and solving missing value problems (Dempster et al., 1977). Estimation of parameters in the LCPA model may be regarded as an estimation problem with missing data: 57 the realized values of the latent-profile variable Ui and latent-class memberships Cit for all t are missing for individuals i = 1, . . . , n. The observed data loglikelihood given the covariates xi is n ∑ ( ) log P r(Yi = yi ) | xi i=1   n S T ∑ ∑ ∏ (t) = log  γs (xi ) Rs  s=1 t=1 i=1 l(θ) = (5.1) Direct maximization of the observed-data loglikelihood is complicated because we need to solve optimization problem with sum of logarithm functions; however, ML estimates could be calculated directly in closed form if there is no latent variables. EM is computationally tractable to maximizing the loglikelihood function by repeatedly solving the complete-data problem. EM is an iterative procedure in which each iteration consists of two steps, the E-step and M-step. In the E-step, we compute the expected values of the unobservable cell counts for the cross-classification by Yi1 , . . . , YiM and Ui , Ci1 , . . . , CiT , given the observed data and provious parameter estimates. In the M-step, we maximize the complete-data logliklihood function under the assumption that the missing data are known and the missing data from the E-step are used to substitute the actural missing data. For the LCPA model, the EM algorithm has some important advantages over other methods for finding ML estimates: it converges to gradually but reliably; it guarantees that the resulting estimates lie within the parameter space; it does not require inversion of the Hessian matrix at each iteration, so it demands less computational time than the NewtonRaphson; and it does not require a carefully chosen set of initial values to start the iterative 58 process in order to converge to the final solution. For all these reasons, the EM algorithm has been the preferred method for ML-based estimation in the LCPA model. If covariates are included, the expected complete-data loglikelihood can no longer be maximized using closed-form expressions; the M-step requires an iterative procedure equivalent to a routine for fitting a standard baseline-category multinomial logistic-regression model. Van der Heijden et al. (1996) and Bandeen-Roche et al. (1997) utilized an EM algorithm that incorporates a Newton-Raphson steps for the multinomial logit model into the standard latent class estimating procedure devised by Goodman (1974). The E- and M-steps of this procedure are given below. 5.21 E-step In the E-step, the posterior probabilities of latent class and profile memberships for the ith individual are calculated under the provisional parameter estimates from the previous iteration. By Bayes’ Theorem, these posterior probabilities are θi,(s,c) = P r(Ui = s, Ci = ci | Yi = yi , xi ) ∏ (t) γs (xi ) T R t=1 ct |s { } = ∑S ∏T ∑C (t) (t) ct =1 ηct |s Rct |s s=1 γs (xi ) t=1 5.22 M-step In the M-step, updated parameter estimates are obtained by maximizing the expected complete-data loglikelihod, regarding the latent variables if they were observed. If Ui and ci were known, let the contribution of the ith individual to the complete-data likelihood 59 be denoted L∗ (θ) and the logarithm of the complete-data likelihood function contribution i ∗ li (θ) = log(L∗ (θ)) can be written as i ∗ li (θ) = S ∑ log γs (xi )θis + n S T C ∑ ∑ ∑ ∑ (t) (t) θ log η i(s,ct ) ct |s s=1 i=1 s=1 t=1 ct =1 T C M rm ∑ ∑ (t) ∑ ∑ + θic I(yimt=k ) log ρmkt|c t t t=1 ct =1 m=1 k=1 where θis = ∑ c1 . . . (5.2) ∑ ∏ ∑ (t) (t) cT θi(s,c) , θi(s,c ) = j̸=t cj θi(s,c) and θict = t ∑ (t) s θi(s,c ) . According to (2.6) for each s = 1, . . . , S − 1, the maximizing vector β s t can be derived by setting n exp(x′ β s ) ∂ ∑ i θis log ∑ = 0 for r = 1, . . . , p. S exp(x′ β ) ∂βsr i=1 j=1 i j (5.3) To approach the roots of these non-linear equations, we use NR algorithm to expedite the calculation. In the initial stages of EM, the parameter estimates begin far from the desired values and Newton’s method might fail to converge. Burdened with the caveats, we only execute one iteration of NR method within each M-step. In the NR method, an estimate β new is updated by s ( β new = β old − s s )−1 ( ∂ 2l ∂β∂β T ∂l ∂β ) | β=β old where β is the vectorized parameter containing all parameter elements. 60 (5.4) 5.3 Recursive Formula Although direct maximization of the observed-data loglikelihood defined in (5.1) is complicated, ML estimates can be easily calculated if the latent memberships (i.e., c and s) were known. In the E-step, the conditional probabilities of class and profile memberships for the ith individual are calculated under the provisional parameter estimates from the previous iteration. By Bayes’ theorem, these conditional probabilities are given by θi(s,c ,...,c ) = P (U = s, C = c | yi , xi ) 1 T L∗ i = ∑ , ∑C ∑ S · · · C =1 L∗ s=1 c1 =1 cT i (5.5) where L∗ is defined in (2.3). i In the E-step, however, the complexity of computation and the memory demand grow when the number of time periods T increases. Instead of computing the joint posterior probabilities given in (5.5), we apply the recursive formula to the E-step by adopting the forward-backward algorithm (Chib, 1996; MacKay, 1997). Let α and λ represent the forward and backward probabilities, respectively: αit (ct , s) = P (Y1 = yi1 , . . . , Yt = yit , Ct = ct | U = s, xi ) λit (ct , s) = P (Yt+1 = yi(t+1) , . . . , YT = yiT | Ct = ct , U = s, xi ). Both α and λ functions then can be computed by the following recursive representations 61 using the first-order Markov chain: M rm ]I(y (t) ∏ ∏ [ imt =k) αit (ct , s) = αi(t−1) (ct−1 , s)η ρmkt|c ct |s t ct−1 =1 m=1 k=1 C ∑ λit (ct , s) = C ∑ (t+1) λi(t+1) (ct+1 , s)η ct+1 |s ct+1 =1 ]I(y M ∏ ∏ rm [ im(t+1) =k) × ρmk(t+1)|c t+1 m=1 k=1 With the forward and backward functions, the marginalized conditional probability of the latent component membership at time t can be computed by (t) θ = P (U = s, Ct = ct | yi , xi ) i(s,ct ) γs (xi )αit (ct , s)λit (ct , s) , = ∑ S γ (x ) ∑C s=1 s i cT =1 αiT (cT , s) (5.6) for s = 1, . . . , S , ct = 1, . . . , C , and t = 1, . . . , T . In the M-step, updated parameter estimates are obtained by maximizing the expected complete-data loglikelihood, regarding the latent variables if they were observed. The contribution of the ith individual to the complete-data loglikelihood ℓ∗ = log L∗ can be written i i as ℓ∗ (θ) as (5.2). The first sum in (5.2), which relates to the regression coefficients (i.e., the i β -parameters), is the loglikelihood function for the multinomial logit model, except that the ∑n unobserved counts for s are replaced by the fractional expectations i=1 θis . Updated estimates for the regression coefficients can be calculated with the standard Newton-Raphson method for multinomial logistic regression, provided that the computational routines allow 62 fractional responses rather than integer counts. The other model parameters can be obtained by ∑n (t) i=1 θi(s,ct ) (t) ∑n η ˆ = ct |s i=1 θis ∑ ∑ (t) (t) ∗ θic I(yimt = k) + (t) t (t) θict ρmkt|ct i∈obsm i∈mism ρmkt|c = ˆ ∑n (t) t θic i=1 t for m = 1, . . . , M , k = 1, . . . , rm , s = 1, . . . , S , ct = 1, . . . , C , and t = 1, . . . , T . (t) (t) Here obsm denotes the set of individuals who respond to the mth item at time t, mism denotes the set of individuals who fail to respond to the mth item at time t, and ρ∗ mkt|ct is the provisional parameter estimate. 5.4 Local Modality We understand that the local maximum problem could be mitigated by starting from multiple initial values and then tracking the optimal solution which achieves the highest likelihood. However, in the case of high-dimensional parameter space, a large number of performing the EM algorithm for each initialization is required and therefore become computationally intractable. In order to relax the initialization dependence for the LCPA model, we introduce two reformulated versions of the standard EM algorithm, namely split-and-merge EM (SMEM) (Ueda et al., 2000), deterministic annealing EM (DAEM) (Ueda and Nakano, 1998) and a variational Bayes learning algorithm, referred as and deterministic annealing variational Bayes (DAVB) (Katahiral et al., 2008). 63 The general idea of SMEM aims at increasing the log likelihood value gradually by tactically choosing three components as candidates for split and merge. A number of merge and split candidates are selected, usually 5 candidate sets are recommended for each iteration and then using partial expectation step to expedite the expectation computation. The idea of performing split and merge operations has been successfully applied to the EM in the Gaussian mixture models through considering local Kullback divergence as a split criteria and posterior probabilities as a merge index. Here we roughly sketch the steps for partial expectation step on classes and profiles in LCPA models. Ueda and Nakano (1998) proposed a deterministic annealing version of the EM (DAEM) algorithm to reduce the impact that inappropriate initial values could cause. Even though there is no evidence to prove it is grounded in theory, it has been successfully applied in the realm of local maximum issues and such an annealing framework has proven effective in improving the performance of the standard approaches (Itaya et al., 2004; Park et al., 2005). 5.41 Split-and-Merge EM Let JC = (c1 , c2 , c3 ) denote the split-and-merge candidates where c1 and c2 are chosen by the pre-specified rule to form a new c∗ and c3 is chosen to be divided into two new classes c1 and c2 . The partial update step for newly-generated classes l = c∗ , c1 , c2 are ˜ ˜ ˜ ˜ re-estimated by the following:   αt (l, s)λt (l, s) (t)  ∑ (t)  θ θ = × ∑ i(s,l) i(s,m) l=c∗ ,c˜ ,c˜ αt (l, s)λt (l, s) m∈JC 1 2 64 Similarly, for profiles, we set JS = (s1 , s2 , s3 ) as the split-and-merge candidates where s1 and s2 are merged to form s∗ and s3 is split into two new profiles s1 and s2 . In ˜ ˜ the partial update step, the posterior probability of profile components l = s∗ , s1 , s2 are ˜ ˜ re-estimated by   γl αt (ct , l)λt (ct , l) (t)  ∑ (t)  θ = θ × ∑ i(l,ct ) i(m,ct ) k=s∗ ,s1 ,s2 γk αt (ct , k)λt (ct , k) ˜ ˜ m∈JS SMEM algorithm is expected to be abler to travel low log-likelihood area and solve the problem of initialization dependence only if a clear set of principles for split and merge has been instituted. In the case of LCPA applications, the correspondence between adjacent measurement occasions is inextricably bridged and there is no easy heuristical method to relocate the mixture components in the data space. Generally speaking, SMEM intensifies the extent of the technical difficulties. 5.42 Deterministic Annealing EM Algorithm Ueda and Nakano (1998) proposed a deterministic annealing version of the EM algorithm (DAEM) to find a set of parameter estimates on the global mode of the loglikelihood function. In the DAEM algorithm for LCPA, the goal to maximize the loglikelihood function is reformulated as the problem of maximizing the function F (ω) = ∑n i=1 Fi (ω), where S C C ∑ ∑ ∑ 1 log ··· (L∗ )ω . Fi (ω) = i ω s=1 c1 =1 cT =1 65 Note that the function F (ω) equals to the observed-data loglikelihood ℓ in (5.1) when ω = 1. To maximize the function F (ω), we introduce a random density function q(U = s, C = c | yi ) and find the lower bound of Fi (ω) by Jensen’s inequality (Cover and Thomas, 1991): [ ] (L∗ )ω 1∑ i Fi (ω) ≥ q(U = s, C = c | yi ) log ω s,c q(U = s, C = c | yi ) ∑ = q(U = s, C = c | yi ) log L∗ i s,c 1∑ − q(U = s, C = c | yi ) log q(U = s, C = c | yi ). ω s,c (5.7) In the first step of DAEM, we find the optimal choice for q(U = s, C = c | yi ) by taking functional derivatives with respect to q(U = s, C = c | yi ) and setting it as zero under the constraint ∑ s,c q(U = s, C = c | yi ) = 1. Then the optimal choice for q(U = s, C = c | yi , ) would be (L∗ )ω i θi(s,c ,...,c ) = ∑ . ∑ S ∑C 1 T · · · C =1 (L∗ )ω s=1 c1 =1 cT i (5.8) It is worth noting that with values of ω close to zero, the function θ i(s,c ,...,c ) (ω) is 1 T uniformly distributed across all the latent components (i.e., classes and profiles). As the value of ω increases toward one, however, θ i(s,c ,...,c ) (ω) agrees with the conditional 1 T probabilities θi(s,c ,...,c ) given in (5.5). 1 T In the second step of DAEM, the model parameters are updated by maximizing the ∑n modified expected complete-data loglikelihood ℓ∗ (ω) = ∗ i=1 ℓi (ω). The contribution 66 of the ith individual to the ℓ∗ (ω) can be written as i ℓ∗ (ω) = i S ∑ S T C ∑ ∑ ∑ θis (ω) log γs (xi ) + θ (t) (t) (ω) log η i(s,ct ) ct |s s=1 s=1 t=1 ct =1 T C M rm ∑ ∑ (t) ∑ ∑ + θic (ω) I(yimt=k ) log ρmkt|c , t t t=1 ct =1 m=1 k=1 ( )ω (t) where θ is (ω) ∝ θis , θ (ω) ∝ i(s,ct ) ( (5.9) ( ) )ω (t) (t) (t) ω θ , and θ (ω) ∝ θ ict ict . i(s,ct ) These quantities can be calculated by the recursive formula given in (5.6), and the joint conditional probabilities are not necessary in the second step of DAEM. As the standard EM algorithm, updated estimates for the regression coefficients can be calculated with the Newton-Raphson algorithm for the first sum in (5.9). The other model parameters can be obtained by ∑n (t) θ i=1 i(s,ct ) (ω) (t) ∑n η ˆ = ct |s i=1 θis (ω) ∑n ∑ (t) (t) θic (ω)I(yimt = k) + n θic (ω)ρ∗ (t) t (t) t mkt|ct i∈obsm i∈mism ρmkt|c = ˆ ∑n (t) t θic (ω) i=1 t for m = 1, . . . , M , k = 1, . . . , rm , s = 1, . . . , S , ct = 1, . . . , C , and t = 1, . . . , T . We adopt ω =(.001, .01, .1, .2, .3, .4, .48, .58, .69, .83, 1) as an annealing schedule in DAEM. The algorithm is initialized with the value of ω close to zero (i.e., .001) and its converged parameters will be used as the starting values for the next one (i.e., .01). By repeating this procedure until ω reaches one, the DAEM algorithm will find the parameter 67 estimates on the global mode of the loglikelihood function. 5.43 Deterministic Annealing Variational Bayes Algorithm Let Θ denote the vectorized model parameters for the LCPA model. In the deterministic annealing variational Bayes (DAVB) algorithm, we can consider Θ, a set of unknown parameters, as the random quantities and incorporate the prior information for Θ. Let φ(Θ) denote a prior distribution of Θ. Further, let z = (z1 , . . . , zn ) indicate the individuals’ class and profile memberships, where zi is a T + 1 dimensional array for the ith individual such ∑ ∑ ∑ S C C that zi(s,c ,...,c ) ∈ {0, 1} and s=1 c1 =1 · · · cT =1 zi(s,c1 ,...,cT ) = 1. 1 T That is, if individual i belongs to the profile s and the class membership c = (c1 , . . . , cT ) from initial time t = 1 to time T , then zi(s,c ,...,c ) equals 1 and 0 otherwise. 1 T Constructed with the similar logic of the DAEM, the goal of DAVB to approximate the distribution over latent variables and model parameters can be rephrased by the problem of maximizing F (ω) = ∑n i=1 Fi (ω), where S C C ∑ ∑ ∑ ∫ 1 log ··· P (Y = yi , Z = zi , Θ)ω dΘ. Fi (ω) = ω s=1 c1 =1 cT =1 By introducing a random distribution q(Z = zi , Θ), Fi (ω) can be lower bounded by 68 Jensen’s inequality (Cover and Thomas, 1991): Fi (ω) ≥ S C ∑ ∑ ··· C ∑ ∫ q(Z = zi , Θ) log P (Y = yi , Z = zi , Θ)dΘ s=1 c1 =1 cT =1 S C C ∑ ∫ 1 ∑ ∑ − ··· q(Z = zi , Θ) log q(Z = zi , Θ)dΘ. ω s=1 c1 =1 cT =1 If q(Z = zi , Θ) has a factored form (i.e., q(Z = zi , Θ) = Q(Z = zi )r(Θ)), the function Fi (ω) is then lower bounded as Fi (ω) ≥ S C ∑ ∑ ··· C ∑ s=1 c1 =1 cT =1 ∫ Q(Z = zi )r(Θ) log P (Y = yi , Z = zi | Θ)φ(Θ)dΘ   S C C ∑ 1 ∑ ∑ ··· Q(Z = zi ) log Q(Z = zi ) (5.10) − ω s=1 c =1 cT =1 1 } ∫ + r(Θ) log r(Θ)dΘ . In the first step of DAVB, we iteratively maximizes the lower bound of Fi (ω) with respective to Q(Z = zi ) by taking functional derivative and setting it equal to zero under the constraint ∑S ∑C ∑ · · · C =1 Q(Z = zi ) = 1. The optimal choice for s=1 c1 =1 cT Q(Z = zi ) would be θi(s,c ,...,c ) (ω), which is the optimal choice for q(U = s, C = 1 T c | yi ) given in (5.8). In the second step, the lower bound of Fi (ω) is maximized with respective to r(Θ) and the model parameters are updated based on the variational posteriors. We consider 69 multivariate normal distribution Np×(S−1) (0, I) as a prior for the β -parameters. At each cycle of Gibbs sampler, the vectorized coefficients β are updated by Metropolis algorithm (Robert and Casella, 2004). A candidate for the next coefficient vector β c is generated ˆ ˆ from Np×(S−1) (β, δΣ) at each iteration, where β is the generated sample from the previous iteration. In this paper, we adjust the value of δ in order to control the acceptance rate within a recommended range from .18 to .30 (Gelman et al., 1997). The variance Σ is the negative inverse of the β -submatrix in the Hessian from the complete-data loglikelihood ℓ∗ (ω) evaluated at the DAEM estimates. The candidate vector for β c is accepted with ˆ probability α(β, β c ) = min (1, exp (ωA)), where n S ) ∑ ∑ 1( 2 ˆ2 + ˆ θis (ω)x′ (β c − β s ) A = − |β| − |β| i s 2 i=1 s=1      n S S  ∑ ∑ ∑ ′ β c ) − log  ′ β ) . ˆ  − log exp(xi s exp(xi s   s=1 s=1 i=1 (t) (t) (t) = (η , . . . , η ) 1|s C|s (t) and ρmt|c = (ρm1t|c , . . . , ρmr t|c ), new parameters can be drawn from η s ∼ m (t) (t) Dirichlet(τ ,...,τ ) and ρmt|c ∼ Dirichlet(νm1t|c , . . . , νmr t|c ), where m 1|s C|s Applying the Jeffreys’ priors to the measurement parameters η s  n ∑ (t) 1 (t) τ = ω θ (ω) −  + 1 c|s i(s,c) 2 i=1     νmkt|c = ω   ∑ (t) I(yimt = k)θic (ω) + (t) i∈obsm  ∑ (t) i∈mism 70 1 (t)  ∗ θic (ω)ρmkt|c −  + 1 2 for c = 1, . . . , C , s = 1, . . . , S , t = 1, . . . , T , k = 1, . . . , rm , and m = 1, . . . , M . In DAVB, we adopt the same annealing schedule and proceed with the same procedure as in DAEM. 71 Chapter 6 Model Diagnosis For the inferences drawn from a model being meaningful, we need to establish a correctly specified model and identifiability is fundamental for having valid parameter estimates. According to the definition by Goodman (1974), the parameters of an LC model without covariates are said to be locally identifiable at a particular value θ ∗ if for some open neighborhood of it, the loglikelihood function has a unique optimum value at θ ∗ . The definition is no problem to be further applied to LCPA models. Let us assume the number of a response pattern as a particular combination of responses to the manifest items Y1t , . . . , YM t ∏M T m=1 rm ) . A saturated model for Y1t , . . . , YM t for ∏ T all t = 1, . . . , T would have ( M m=1 rm ) − 1 nonredundant parameters because for all t = 1, . . . , T is ( the probabilities for the response patterns must sum to 1. A necessary but not sufficient condition to make the LCPA model locally identifiable is that the number of nonredundant parameters in the saturated model must be greater than or equal to the number of free parameters in γ, η and ρ (Goodman, 1974; Clogg and Goodman, 1984). The satu- ∏M T m=1 rm ) − 1 nonredundant parameters and the LCPA model has ∑ S − 1 + ST (C − 1) + C( C m=1 rm − M ) when we assume the number of classes is rated model has ( 72 fixed as C over time. When  T   C ∑  rm  − 1 > S − 1 + ST (C − 1) + C  rm − M  m=1 m=1 M ∏ (6.1) the LCPA model might be identifiable but not necessarily so. Traditionally, we diagnose the identifiability by carrying out the Hessian matrix evaluated at the ML estimates θ ∗ . If it has full rank or the inverse exists, we can ascertain the parameter estimates are at least locally identifiable. In practice, however, it is difficult to distinguish between situations where the Hessian is nearly singular and where it is exactly singular because of the imprecision of floating-point computations. As pointed out by Formann (2003), in situations where some estimated parameters lie on the boundary (estimates are close to boundary value), we’d better fix those values in order to make the remaining parameters identifiable. At fact, a covariance matrix with large variance indicates boundary problems. The Hessian matrix for LCPA with covariates may have result in large dimension structure and the derivatives therefore become unappealing. As shown in (2.8), the marginalization implies that an LCPA with logistic regression is locally identifiable if the following three conditions are satisfied (Bandeen-Roche et al., 1997): 1. an LCPA marginalized over covariates is locally identifiable; 2. if the design matrix (x1 , . . . , xn )′ has full column rank; 3. at least one individual has nonzero γs (xi ) for all s = 1, 2, . . . , S . Some new monitoring methods for identifiability have been proposed, for example, Kim and Lindsay (2009) define a new concept, the degree of identifiability of the parameters in the likelihood. This data-dependent identifiability is measured through modal regions that is 73 the maximum connected subset containing ML estimate θ ∗ and determined by a significance level α. The model is said to be identifiable at confidence level 100(1 − α)% at θ ∗ if the modal region containing θ ∗ is disjoint from any other permutated ones. It comes no surprise that as α decreases, the modal regions are hardly well-labeled because overlapping might occur. The level of identifiability can be quantified by finding the infimum of the value α such that the modal region is well separated. Besides, Yao and Lindsay (2009) developed marginal discriminant plots to measure the degree of modal separation. 74 Chapter 7 Data Analysis on National Longitudinal Survey of Youth 1997 (NLSY97) 7.1 Data The proposed model selection (i.e., Dirichlet process) and parameter estimation (i.e., deterministic annealing) methods are applied to the drinking items drawn from the National Longitudinal Survey of Youth 1997 (NLSY97), a survey that explores the transition from school to work and from adolescence to adulthood in the United State. Since the detailed information on the data appears in Chung et al. (2011), we briefly describe the data structure here. Complete data for the analysis were available on 1416 adolescents who were identified as the early onset drinkers. The early onset drinkers under this study had started drink by the time when they were 14 years old, at least 7 years before the legal drinking age. There 75 are three self-report items measuring adolescent drinking behaviors: how many days they had one or more drinks of an alcoholic beverage during the last 30 days (Recent Drinking); how many days they had five or more drinks on the same occasion during the last 30 days (Binge Drinking); and how many days they had drinks immediately before or during school or work hours in the last 30 days (Drinking at School). The responses for Recent Drinking were reduced to a three-category indicator, non-drinker (0 days of drinking), occasional drinker (1-5 days of drinking) and regular drinker (6 or more days of drinking). For Binge Drinking, respondents who had consumed five or more drinks on the same occasion at least one time were characterized as binge drinkers. The same rule was applied for Drinking at School. These three drinking items were tracked over the three survey waves in 1997 (Wave 1), 2000 (Wave 4), and 2003 (Wave 7), corresponding to early adolescence (ages 12–14), middle adolescence (ages 15–17), and late adolescence (ages 18–20), respectively. In addition to these three items, we consider gender and race as covariates in the model. 7.11 Information Criteria versus Dirichlet Process Under the data set described above, Chung et al. (2011) started by fitting a series of twoclass LCPA models where the pathways of the class membership were mapped onto between two and four profiles. This procedure was repeated until it reached a model with six classes and six profiles. Table 7.1 shows a series of LCPA models with evaluations based on the bootstrap p-value for goodness of fit and AIC. The four-class-three-profile LCPA model and five-class-three-profile LCPA models are favored in terms of AIC, bootstrap p-value and the principle of parsimony. However, due to the unclear interpretation of the classes in the five-class-three-profile LCPA model, fourclass-three-profile LCPA model is selected. Even though the traditional model selectors work 76 Table 7.1: Goodness-of-fit statistics for a series of LCPA models under various numbers of classes and profiles Number of classes 2 3 4 5 6 Number of profiles 2 3 4 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 6 LRT 812.81 812.81 812.81 492.84 452.99 447.16 445.64 449.80 404.58 392.61 387.90 429.69 375.52 357.65 347.64 421.65 367.31 338.81 325.88 306.07 77 Bootstrapping p-value 0.000 0.000 0.000 0.000 0.025 0.015 0.020 0.005 0.305 0.440 0.315 0.025 0.375 0.585 0.595 0.015 0.410 0.735 0.810 0.795 AIC 12103 12111 12119 11803 11777 11785 11798 11780 11755 11763 11778 11780 11752 11760 11776 11792 11770 11773 11792 11804 as convenient means for comparison of each competing model, the process would be a lengthy undertaking. Since dynamic Dirichlet learning process has been proved to work well with less computation demands, we are interested in implementing Dirichlet process to see if it comes to the same conclusions without comparing each model one at a time. Fig. 7.1 (a) - (c) shows scatter diagrams derived by applying DDPM and LOWESS algorithm proposed by Cleveland (1979) with two smoothing curves. LOWESS is a locally weighted scatterplot smoothing used here to summarize the appropriate number of classes over time. The smoothing span gives the proportion of points in the plot that influence the smoothness at each value and larger spans give smoother lines. Each plot is smoothed with two smoother spans (.1 and .5) to locally fit the diagram. It is obvious to see all the wavy lines are around four, which makes the preferences fairly self-explanatory to be four classes over three time measurements. The result is in line with the histogram exploration as shown in Fig. 7.1 (d) where a four-class-over-three-time model has dominant frequency. Chung et al. (2011) selected a four-class LCPA model based on the Table 7.1 and it seems Dirichlet process has equal learning capability of selecting suitable number of classes. Based on the class transiting routes, we implement DPM to learn the number of profiles. Fig. 7.1 (d) indicates models with three or four profiles are favored by the DPM algorithm because of their higher occurrences, 27% and 29%, respectively. Chung et al. (2011) conclude that three-profile model fit better based on goodness-of-fit index and according to the result DPM presents, we have consistent conclusions. Since DDPM and DPM lead to successful applications without requiring prior knowledge on the unknown system, we believe Dirichlet process can effectively constructs the model by integrating the sequential information. 78 7.12 Parameter Estimates Given the picked four-class-three-profile LCPA model, Chung et al. (2011) used 100 different sets of initial parameters to avoid local optimum entrapment. They selected a set of estimates providing the highest loglikelihood value among those from 100 different initializations by using standard EM algorithm. We will focus on the same model structure to compare the performances of two deterministic annealing approaches in finding the global maximum estimates with the pre-determined annealing schedule. Note that we consider the LCPA model where the ρ-parameters are constrained to be equal across time points and both deterministic annealing update the marginalized conditional probability of the latent component membership recursively by using forward and backward functions. To ensure the comparability of the standard EM algorithm and two deterministic annealing approaches, we implement DAEM with 100 initializations in order to better understand how different starting values induce variations in the loglikelihood values. For the standard EM algorithm, the distribution of the converged loglikelihood values corresponding to 100 sets of initializations are presented in Figure 7.2 (a). Exploring the histogram, we can see that EM solutions are trapped in four local modes based on their initializations. The largest loglikelihood value is −5726.28 presented in the last bar in Figure 7.2 (a), but the most frequent loglikelihood values are observed in the range of −5740 and −5739. On the contrary, Figure 7.2 (b) illustrates that DAEM reaches a single point convergence (loglikelihood value is −5727.14) irrespective of the initial values. Although DAEM does not achieve the optimum value given by the standard EM algorithm, the difference in the loglikelihood values is negligible based on the distributions presented in Figure 7.2. In addition, only 29% of the initial sets are converged to the global maximum with the standard EM algorithm. 79 Therefore, the DAEM algorithm with recursive formula can be considered as a robust tool to avoid many local entrapments and reduce the computation cost. For DAVB, we also apply 100 different initializations and calculate the standard deviation for each of the model parameters from the 100 sets of DAVB estimates. Generally speaking, DAVB exhibits the work of satisfaction because the extent to which the variations are incurred due to the numerous initializations can be regarded insignificant. Here we briefly sketch the performance of DAVB in terms of their standard deviations and related concerns. The standard deviation of the ρ-parameter estimates derived from DAVB with 100 sets of initializations are presented in Table 7.2. The values under the Recent Drinking column provide the standard deviations of the estimates for probabilities of having reported occasional drinking (one to five days of drinking in the last 30 days) and regular drinking (six or more days of drinking in the last 30 days) for a given class membership. The other two columns show the standard deviations of the estimates for the probabilities of having consumed five or more drinks on the same occasion at least one time in the last 30 days for Binge Drinking and consumed alcoholic beverage right before or during school or work hours at least once in the last 30 days for Drinking at School. Table 7.2 indicates that Class 3 seems more likely to produce largest deviations among the identified classes. We believe the cause is more related to the small prevalence of Class 3: the prevalence of Class 3 is 7.5%, 7.0%, and 0.2% in 1997, 2000, and 2003, respectively, and the average prevalence over time is 4.9%. The secondary measurement parameters (i.e., η -parameters) identify common sequential patterns of drinking behaviors. The standard deviation of the η -parameter estimates from the 100 sets of DAVB estimates are presented in Table 7.3. The η -parameter estimates from DAVB are shown to be fairly stable in terms of insignificant degree of spread. We can see 80 Table 7.2: Standard deviation of the ρ-parameter estimates derived from DAVB with 100 different sets of starting values Recent Drinking Binge Drinking Class Occasional Regular Drinking at School 1 0.024 0.000 0.000 0.000 2 0.001 0.009 0.026 0.022 3 0.003 0.062 0.078 0.084 4 0.001 0.022 0.010 0.017 Table 7.3: Standard deviation of the η -parameter estimates derived from DAVB with 100 different sets of starting values Year Profile Class 1997 2000 2003 1 1 0.025 0.031 0.027 2 0.024 0.039 0.030 3 0.018 0.030 0.020 4 0.006 0.008 0.026 2 1 2 3 4 0.043 0.040 0.019 0.012 0.044 0.056 0.026 0.035 0.040 0.068 0.016 0.061 3 1 2 3 4 0.030 0.039 0.040 0.020 0.030 0.045 0.051 0.043 0.024 0.018 0.010 0.034 that the estimates corresponding to Profile 2 have more fluctuations comparing to those of other profiles. Although estimates regarding to some classes or profiles might not have ideally small deviations, DAVB can still be considered as a more consistent implementation than the traditional multi-start methods because all the standard deviations are below an acceptable threshold. The standard deviations of the β -parameter estimates from the 100 sets of DAVB estimates are presented in Table 7.4. As inspected, the standard deviations of Profile 2 relative to Profile 1 are slightly larger than those of Profile 3, but none of them reveal serious insta81 Table 7.4: Standard deviation of β -parameter estimates derived from DAVB with 100 different sets of starting values (Profile 1 is the baseline) Covariate Category Profile 2 Profile 3 Gender versus Male Female 0.122 0.101 Race versus White Black 0.242 0.163 Hispanic 0.260 0.204 Other race 0.272 0.283 Table 7.5: Estimated probabilities of responding ‘any use’ to the drinking items for each class (ρ-parameters) Method Class EM 1 2 3 4 Recent Drinking Occasional Regular 0.001 0.002 0.960 0.040 0.675 0.325 0.280 0.720 Binge Drinking Drinking at School 0.000 0.000 0.303 0.119 0.848 0.606 0.960 0.173 DAEM 1 2 3 4 0.102 0.929 0.899 0.260 0.000 0.071 0.101 0.740 0.000 0.378 0.546 0.965 0.000 0.056 0.396 0.205 DAVB 1 2 3 4 0.096 0.922 0.894 0.238 0.000 0.076 0.105 0.761 0.000 0.407 0.541 0.967 0.000 0.044 0.469 0.207 bility among the resulting estimates from 100 sets of starting values with DAVB. Profile 2 tends to produce larger standard deviations as previously shown in Table 7.3 and we believe the small prevalence of Profile 2 affects the results in some degrees. The estimated primary measurement parameters (i.e., ρ-parameters) using the three estimation algorithms are presented in Table 7.5. The values under the Recent Drinking column provide the estimated probabilities of having reported occasional drinking (one to five days of drinking in the last 30 days) and regular drinking (six or more days of drinking in the last 30 days) for a given class membership. The other two columns show the probabilities 82 of having consumed five or more drinks on the same occasion at least one time in the last 30 days for Binge Drinking and consumed alcoholic beverage right before or during school or work hours at least once in the last 30 days for Drinking at School. We can see that all three items combined support a meaningful interpretation for each class. An inspection of these values leads to the adoption of the common class names across estimation methods. Adolescents in Class 1 have not been involved in any drinking in the previous 30 days; therefore we label Class 1 as ‘non-current drinkers.’ Class 2 represents adolescents who drink occasionally but have no history of regular drinking or drinking at work or school. We accordingly identify adolescents classified in Class 2 as ‘light drinkers.’ For the same grounds, Class 4 labels as ‘regular binge drinkers’ who both drink regularly and engage in binge drinking. The estimates for Class 3, however, produce largest deviations between EM and two deterministic annealing methods, leading to a difficulty in labeling Class 3 with the common class name across estimation algorithms. The maximal differences in the estimates between EM and two annealing methods are .302 and .307 in Binge Drinking for DAEM and DAVB, respectively. We believe the cause is more related to the small prevalence of Class 3: using the EM algorithm, the prevalence of Class 3 is 7.5%, 7.0%, and 0.2% in 1997, 2000, and 2003, respectively, and the average prevalence over time is 4.9%. The other two deterministic alternatives produce similar class prevalences as the EM algorithm. Although three parameter estimation methods produce distinct estimates for Class 3, the difference in estimates of ‘small’ class may not affect the description of the major classes and the loglikelihood value. The secondary measurement parameters (i.e., η -parameters) identify common sequential patterns of drinking behaviors. The estimated η -parameters with three estimation methods are presented in Table 7.6. As shown in Table 7.6, in Profile 1 the probabilities of belonging 83 Table 7.6: Estimated probabilities of belonging to a class sequence for each profile (η parameters) and estimated profile prevalence (γ -parameters) Year 1997 2000 0.694 0.746 0.250 0.219 0.055 0.034 0.001 0.000 0.538 0.058 0.432 0.766 0.000 0.000 0.031 0.175 0.614 0.252 0.227 0.122 0.139 0.150 0.020 0.476 2003 0.626 0.348 0.004 0.023 0.100 0.610 0.000 0.290 0.133 0.015 0.000 0.853 1 2 3 4 1 2 3 4 1 2 3 4 0.782 0.046 0.163 0.009 0.610 0.239 0.130 0.021 0.672 0.019 0.255 0.054 0.806 0.027 0.167 0.000 0.112 0.766 0.000 0.122 0.253 0.034 0.158 0.555 0.688 0.203 0.078 0.031 0.121 0.666 0.001 0.212 0.135 0.002 0.001 0.862 1 2 3 4 1 2 3 4 1 2 3 4 0.785 0.013 0.189 0.013 0.695 0.102 0.186 0.017 0.665 0.033 0.258 0.044 0.780 0.052 0.165 0.003 0.159 0.711 0.030 0.100 0.284 0.050 0.136 0.530 0.688 0.207 0.065 0.040 0.094 0.749 0.014 0.143 0.139 0.015 0.001 0.845 Method EM Profile Class 1 (45.0) 1 2 3 4 2 (18.6) 1 2 3 4 3 (36.4) 1 2 3 4 DAEM 1 (46.1) 2 (19.0) 3 (34.9) DAVB 1 (47.0) 2 (18.4) 3 (34.6) 84 to Class 1, the class of ‘not current drinkers,’ are consistently higher than the probabilities of belonging to other classes (see the first rows in Table 7.6 for all estimation method), implying that adolescents in this profile are likely to remain as ‘not current drinkers’ over time. The LCPA identified another two profiles of adolescents who tended to intensify their drinking habits over time. Adolescents in Profile 2 were more likely to belong to Class 1 (not current drinkers) in 1997, advance to Class 2 (light drinkers) in 2000. By 2003, some of them had advanced further to become members of Class 4 (regular binge drinkers), although many others remained in Class 2 (light drinkers). Profile 3 consists of early drinkers who moved toward Class 4 (regular binge drinkers) by the year of 2000 and stayed in that class by 2003. Three parameter estimation methods yield similar estimates for the η -parameters so that we can commonly label the identified profiles across three different estimation methods. The most prevalent profile is Profile 1 for all estimation methods. The largest differences between EM and two annealing methods are .208 for DAEM in Profile 3 and Class 2 in 1997 and .330 for DAVB in Profile 2, Class 2 in 1997. Point estimates for the logistic regression coefficients (β -parameter) pertaining to each covariate are reported in Table 7.7. Speaking broadly, whites are more likely to be in the profiles involving intensified drinking habits (i.e., Profiles 2 and 3) than in Profile 1 (nondrinking stayers). Interestingly, the odds of belonging to Profile 2 (light drinking advancers) versus Profile 1 are higher for females, but the odds of belonging to Profile 3 (regular binge drinking advancers) are higher for males than their counterparts. Three parameter estimation methods come to the same conclusions in explaining the logistic coefficients even though the yielded values are not indistinguishable. 85 Table 7.7: Estimated logistic regression coefficients (β -parameters) for the prevalence of profiles (Profile 1, non-drinking stayers, is the baseline) Method EM DAEM DAVB Covariate Intercept Female versus male Race versus White Black Hispanic Other race Intercept Female versus male Race versus White Black Hispanic Other race Intercept Female versus male Race versus White Black Hispanic Other race 86 Profile 2 Profile 3 −1.144 0.511 1.244 −1.011 −1.640 −1.510 −1.600 −0.198 −1.133 −0.481 −1.161 0.410 1.349 −0.882 −1.884 −1.546 −1.897 −0.240 −0.877 −0.459 −1.146 0.415 1.197 −0.823 −1.457 −1.523 −1.217 −0.261 −0.473 −0.301 7.12.1 Discussion In the standard EM algorithm, the computation of the joint posterior distribution that involves all time measurements is fairly demanding. Instead of computing the joint posterior probabilities given in (5.5), the recursive formula adopts the forward-backward algorithm to calculate the marginal posterior probabilities given in (5.6) directly. The recursive formula enables us to predict latent membership faster than the standard EM algorithm because the forward and backward quantities can easily calculated. In this study, we apply the two annealing methods, DAEM and DAVB, to estimate the model parameters lying on the global maximum of the objective functions for the LCPA models using the recursive formula at each iteration. Deterministic annealing is an optimization technique aiming to search for a global maximum by gradually increasing the value of ω . However, the performance of the annealing methods is contingent upon the choice of annealing schedule. Geman and Geman (1984) have shown that, if the annealing schedule follows ω ∝ (log i), where i is the number of current iteration, the global solution is theoretically achievable even though such schedule might not be realistic in practice. Chang and Lin (2002) proposed a novel method to adaptively learn the annealing parameter based on the results of the previous iteration. However, it requires matrix calculation which is limited only for special cases. In this study, we adopt a very tight annealing schedule, but more research is required to select the appropriate annealing schedule for the LCPA models. However, the DAEM and DAVB have their own strengths and therefore are attractive alternatives to the standard EM algorithm in order to discover an optimal solution. Roughly speaking, when ω is close to zero, for both DAEM and DAVB, the corresponding θi(s,c ,...,c ) generates random moves with uniform weights and pick the membership of 1 T 87 class and profile arbitrarily. As the annealing parameters are gradually adjusted, the localized influences become more discriminating for classes and profiles and enable deterministic annealing approaches to avoid arbitrary local maxima and search in the right direction. The DAEM algorithm is more convenient to operate and typically converges more quickly than DAVB. However, when the annealing rate is too sparse or the prevalence of latent components are not large enough, DAEM may be entrapped into a local mode of the likelihood function. On the contrary, the DAVB algorithm is less influenced by the annealing schedule or the size of latent components, but it may exhibit slow convergence in the search of global maximum. Therefore, more computational burden should be expected as the trade-off. The most troubling aspect of the DAVB is to decide a well-functioning proposal distribution in the MCMC. In the beginning, we applied a Metropolis algorithm in which a candidate for the next logistic coefficient vector was sampled from a multivariate t distribution with degree ˆ of freedom ν , β c ∼ tν (β, δΣ), where δ was a scalar value used to control the acceptance rate. Using the t distribution, however, the acceptance rate drops when ω approaches one, which causes the slow mixing in the Markov chain. Although decreasing the variation in the proposal distribution by setting a small value to δ may alleviate the problem, it does ˆ not make sense to sample the candidate values only from the regions close to the mean β . Therefore, we substitute t distribution with a multivariate normal to expedite the mixing by keeping many undesired extreme values from being selected. Although many alternatives to the standard estimation algorithm could be pursued in light of the LCPA model, we only mentioned only a few in this presentation. Note that our exploration was intended to demonstrate a possible solution to difficulties in finding a set of estimates on the global maximum of the objective function. We provided a limited demonstration using real data to elucidate the estimation methods. Our hope is that substantive 88 researchers will be able to identify possible difficulties in estimation for the LCPA model, and consider using the proposed solution in their research. 89 8 4 5 6 7 8 7 6 5 4 0 500 1000 1500 0 1000 1500 (b) 0 4 10 5 20 6 30 7 40 8 50 (a) 500 0 500 1000 1500 4−4−4 (c) 5−4−4 4 5 6 (d) Figure 7.1: (a) - (c) Tracking plots for classes smoothed by LOWESS with two smoother spans (red: .67 and green: .15) from time 1 to time 3; (d) histogram of class progression (yellow bars) and the number of profiles (blue bars). 90 100 0 20 40 Frequency 60 80 100 80 60 40 0 20 Frequency −5760 −5750 −5740 −5730 −5760 −5750 −5740 −5730 Figure 7.2: Histogram of loglikelihood values derived from (a) the standard EM and (b) the DAEM algorithms with 100 different sets of starting values 91 Chapter 8 Discussion 8.1 Contributions Latent stage-sequential process is an attractive tool for many areas of substantive research. We have designed several computational algorithms in order to develop practical understanding and concrete advice for those who may apply the LCPA model to their own data. The unique contributions of this thesis are summarized as follows. Determining the number of latent components such as classes and profiles can be difficult especially when prior knowledge is not readily available. We adopt two Bayesian approaches to learn the model without presuming the number of latent components in advance. The RJMCMC turns to be less computationally feasible because there is no way to design a best-suited jumping rules. When the proposal construction has trouble landing in regions of low probability density, a rapid mixing to reach the stationary distribution will not be facilitated. However, to design an appropriate jumping principle for dimension changes is not easy. On the contrary, Dirichlet process is easily carried out to predict the underlying sequential structure due to its non-parametric nature. By relaxing the constrains on the 92 number of classes placed at each measurement occasion, Dirichlet process helps to enhance visibility of future patterns. However, in the profile-learning phase, Dirichlet process has difficulty delivering satisfactory results because weak measurements make competing models indistinguishable. To estimate the unknown parameters, the widely-known technique is hill climbing optimization such as EM algorithm. This gradient optimization approach can search local neighborhood of the initial values but it may be very poor compared to the global optimal solution. In this study, we firstly compute the forward probability of being in a certain state (i.e. class) at time point t and the backward probability of having a specific type of future observations after t given the current states. Both of them are advantageously represented in recursive forms to expedite each iteration. The forward-backward algorithm proceeds on the basis of parameter updates but the initialization dependence problem remains unsolved. There is a plethora of literatures aiming at finding a numerically robust parameter estimation tool to deal with local maximum problems (Biernacki et al., 2003; Reddy and Rajaratnam, 2010). For example, Reddy and Rajaratnam (2010) convolute the objective function by kernel functions to flatten the surface and reduce the number of local modes. They demonstrate that the optimal solutions would be reached effectively by adjusting the smoothing factor at each iteration. Instead, we implement deterministic annealing EM and deterministic annealing variant of variational Bayes in order to find parameter estimates on the global mode of the objective function. Two deterministic annealing approaches are presented here to overcome the local maximum issues. It is shown that both DAEM and DAVB are equivalent to a generic gradient method which starts from the unimodal well-shaped function of unknown parameter and by gradually increasing the annealing parameter, the distribution becomes more discriminating and a good approximation to the global optimum 93 can be located. While we have demonstrated the effectiveness of the deterministic annealing algorithms in combating the local modality problems, a global optimum is not always guaranteed. The most important reason is that the annealing schedule is not unanimously regulated; one must be very discreet when adjusting the increases. Generally, the schedule should be determined so that the annealing rate is either passive nor aggressive. 8.2 Direction for Future Research In the model selection problem, we adopt reversible jump MCMC because it is the most commonly used MCMC tool by which we can explore variable dimension statistical models. We already discussed about the difficulties in proposing efficient jumping rules especially in the complex LCPA models. Fahimah (2004) suggested using a secondary Markov chain (adding few fixed-dimension MCMC steps) to modify proposed moves before calculating the acceptance rates. They have shown the acceptance probabilities soar even the proposals are poorly matched to the true target distribution but the increased programming costs are expected as a trade-off. We present dynamic Dirichlet learning process analysis on model selection problems without any covariates to predict the prevalence of the profile; however, predicator-dependent kernel stick-breaking process has already increased the interest. It is utilized in choosing the priors for an unknown probability measure (Dunson and Park, 2008) and variable selection problems (Chung and Dunson, 2009). Adding predictors in the prior consideration gives different insights into how the profiles are formed under the influence of predictors and it is understood as a dependent Dirichlet Process. Future work should explore the model selection for the LCPA regression model. 94 APPENDICES 95 Appendix A Acceptance Rate of C → C + 1 For the split move on latent classes (i.e., C ′ = C + 1), the acceptance rate α(C, C ′ ) is min (1, A), where A = = × × q(Θ(C ′ ,S) , Θ(C,S) ) g ∗ (u∗ ) ∂(Θ(S,C ′ ) , u∗ ) × × × P (Θ(C,S) | y) q(Θ(C,S) , Θ(C ′ ,S) ) g(u) ∂(Θ(S,C) , u) π0 (Θ ′ ) d 1 (C ,S) × C+1 × (likelihood ratio) × π0 (Θ(C,S) ) bC Palloc 1 ∏S ∏T ∏ (t) g(us ) × M g(um ) s=1 t=1 m=1 ] [ ∏S ∏T (t) 1−¯ )κ M u ¯ s=1 t=1 ηc∗ |s (1 + u , M/2 [(1 − u)/¯] ¯ u P (Θ(C ′ ,S) | y) where the likelihood ratio is the ratio of the product of the complete likelihood values for the new parameter sets when new classes are formed to that for the old. The function π0 is the product of the prior distributions for model parameters, the number of latent classes, and the latent allocation variables. To be specific, writing Dv (δ) to denote a Dirichlet density 96 evaluated at v and parametrized by a vector δ ,   δ−1 (t) δ−1 (t) η ∗ (δ)Dρ (δ) η ∗  ∏ Dρ π0 (Θ ′ ) ∏  c |s c2 |s m|c∗ m|c∗  M (C ,S) 1 2  1  ∝   δ−1 π0 (Θ(C,S) ) Dρ (δ)  m|c∗ s,t  η (t) B(δ, Cδ) m=1 c∗ |s Besides, Palloc is the probability that the specific reallocation for those who were in the class that was chosen to be split is made. The g function is the Uniform(0, 1) density. For the corresponding merge move, the acceptance probability α(C, C − 1) is min(1, A−1 ), using the same expression for A but some corrections are required. 97 Appendix B Acceptance Rate of S → S + 1 For the split move on latent profiles (i.e., S ′ = S + 1), the acceptance rate α(S, S ′ ) is min (1, B), where q(Θ ∂(Θ ′ , u∗ ) ′ ) , Θ(C,S) ) g ∗ (u∗ ) (C,S (S ,C) × × × P (Θ(C,S) | y) q(Θ(C,S) , Θ ) g(u) ∂(Θ(S,C) , u) (C,S ′ ) π0 (Θ(C,S ′ ) ) d 1 (likelihood ratio) × × S+1 × π0 (Θ(C,S) ) bS Palloc 1 ∏C−1 ∏T g1 (w) c=1 t=1 g2 (βct ) T (C−1)  ∏C ∏T ∏C ∏T (t) (t)   c=1 t=1 ηc|s1 × c=1 t=1 ηc|s2 1   , ×  ∏C ∏T (t) w(1 − w)  c=1 t=1 ηc|s∗ P (Θ B = = × × | y) (C,S ′ ) where the likelihood ratio is the ratio of the product of the complete likelihood values for the new parameter sets when new profiles are formed to that for the old. The function π0 is the product of the prior distributions for model parameters, the number of latent profiles, 98 and the latent allocation variables. To be specific,  π0 (Θ(C,S ′ ) ) π0 (Θ(C,S) )  D (t) (δ)D (t) (δ) T ∏  ηs ηs  (γs∗ w1 )δ−1 (γs∗ w2 )δ−1 1 2  × ∝   D (t) (δ) γ δ−1 B(Sδ, δ) t=1 s∗ η ∗ s Besides, P alloc is the probability that the specific reallocation for those who were in the profile that was chosen to be split is made. The g1 function is the Uniform(0, 1) density and g2 is the N (0, 1) density. For the merge move, the acceptance probability α(S, S −1) is min(1, B −1 ), using the same expression for B but some corrections are required. 99 Appendix C Hessian Matrix C.1 Diagonal Entries Firstly, we introduce several quantities for the future use 1. θi(s,c) = P r(Li = s, Ci = ci |Yi = yi ) ∑ c1 . . . cT θi(s,c) ∏ ∑ (t) 3. θ = j̸=t c θi(s,c) i(s,ct ) j (t) ∑ (t) 4. θ ict = s θi(s,ct ) (t,t∗ ) 5. θ = P (Li = s, Cit = c, Cit∗ = c∗ | Yi = yi ) i(s,c,c∗ ) 2. θis = ∑ ∑ (t,t∗ ) (t,t∗ ) = sθ i(c,c∗ ) i(s,c,c∗ ) ( ( )I )I (yimt =1) 1 − ρ (yimt =0) 7. g(i, m, c, t) = ρm|c =c m|ct =c t 6. θ 8. δ(i, c, t) = 1 − ∏rm g(i,m,t,C) m=1 g(i,m,t,c) 100 Let fi = P (Yi1 = yi1 , . . . , YiT = yiT ) and li = log fi . The sref and cref stand for reference group for profile and class respectively. The elements in the diagonal matrix of the Hessian for the profile probabilities γ are    θis θis θ ref   θis∗ ref  = −  is − − ∂γs ∂γs∗ γs γs γs∗ γs ref ref ∂ 2l where s and s∗ = 1, 2, . . . , S − 1. To get the diagonal block of the Hessian with respect to η , we consider the following cases. 1. When s = s and t = t, the following equation is true whether c1 is equal to c2    (t) (t) (t) (t) θ  θi(s,c ) θi(s,c  θ i(s,cref )  ∂ 2 li ref )   i(s,c2 )   1 − =  −   (t) (t) (t) (t)  (t)   (t)  ∂η ∂η η η η η c1 |s c2 |s c1 |s cref |s c2 |s cref |s 2. When s = s but t1 ̸= t2 , the following equation is true whether c1 is equal to c2 (t ,t ) θ 1 2 i(s,c1 ,c2 ) i = δ(i, t1 , c1 )δ(i, t2 , c2 ) (t1 ) (t2 ) (t1 ) (t2 ) ∂η ∂η η η c1 |s c2 |s c1 |s c2 |s    (t1 ) (t2 ) (t2 ) (t1 ) θ  θi(s,c ) θi(s,c  θ i(s,cref )  ref )   i(s,c2 )   1 − −  −   (t1 ) (t2 )  (t1 )   (t2 )  η η η η c1 |s cref |s c2 |s cref |s ∂ 2l 101 3. When t = t but s1 ̸= s2 , the following equation is true whether c1 is equal to c2   (t) (t)  θi(s ,c ) θi(s ,c ∂ 2 li 1 ref )    1 1 − = −  (t) (t) (t)  (t)  ∂η ∂η η η c1 |s1 c2 |s2 c1 |s1 cref |s1   (t) (t)  θi(s ,c ) θi(s ,c 2 ref )    2 2 − ×   (t)   (t) η η c2 |s2 cref |s2 4. When t1 ̸= t2 and s1 ̸= s2 , the following equation is true whether c1 is equal to c2  t1 (t1 )  θi(s ,c ) θi(s ,c 2l ∂ i 1 ref )    1 1 − = −  (t1 ) (t2 ) (t1 )   (t1 ) ∂η ∂η η η c1 |s1 c2 |s2 c1 |s1 cref |s1   t2 t2  θi(s ,c ) θi(s ,c 2 ref )    2 2 − ×   (t2 )   (t2 ) η η c2 |s2 cref |s2  Next, for the diagonal block of the Hessian with respect to the response measurement parameter ρ, there are several cases to be considered. 1. When t1 ̸= t2 , the following expression is true irrespective of the relationship between 102 k and l or c1 and c2 . (2yikt − 1)(2yilt − 1) ∂ 2 li 1 2 = ∂ρk|c =c ∂ρl|c =c g(i, k, t1 , c1 )g(i, l, c2 , t2 ) t1 1 t2 2 ( ) (t1 ,t2 ) (t1 ) (t2 ) × θ − θi,c θi,c i(c1 ,c2 ) 1 2 2. When t = t, c = c but k ̸= l, (2y − 1)(2yilt − 1) (t) ∂ 2 li (t) = − ikt θic (1 − θic ) ∂ρk|c =c ∂ρk|c =c g(i, k, c, t)g(i, l, c, t) t t 3. When t = t but c1 ̸= c2 , the following expression is true whether k is equal to l, (t) (t) θic (2yikt − 1) θic (2yilt − 1) ∂ 2 li 2 = − 1 ∂ρk|c =c ∂ρk|c =c g(i, k, c1 , t) g(i, l, c2 , t) t 1 t 2 4. When all subscripts are the same, ∂ 2l i = − ∂ρk|c =c ∂ρk|c =c t t 103 (t) 2 θi,c (2yikt − 1)2 g(i, k, t, c)2 C.2 Off-Diagonal Entries To derive the off-diagonal elements, we start with the Hessian with respect to γ and η . Similarly, we need to consider couple of cases. 1. When s = s,  (t) (t) θ θ 2l i(s,cref )  ∂ i   i(s,c) − =   (t)   (t) ∂γs ∂η t η γs η γs c|s c|s cref |s     (t) (t) θis θ θ θis i(s,cr ef )  ref   i(s,c)    − − −  (t) γs γsref  (t) η η c|s C|s  2. When s1 ̸= s2 , ∂ 2l θ (t) i(sref ,c) i = − δ(i, c, t) (t) (t) ∂γs1 ∂η γsref η c|s2 c|sref   (t)   (t) θ θis θ θis i(s2 ,cref )  ref   i(s2 ,c)   1− − −   (t)  γs1 γsref  (t) η η c|s2 cref |s2 104 We then derive the off-diagonal elements of the Hessian with respective to γ and η . (2y − 1) ∂ 2 li = − ikt ∂γs1 ∂ρk|c =c g(i, k, c, t) t   (t)   (t) θ θi,s  θ i(γsref ,c) ref  (t)   i(s,c)  θis − ×  − − θic   γs  γs γs γs ref ref 105 At last, we derive the off-diagonal elements of the Hessian with respective to η and ρ. There are several cases we need to take into consideration. 1. When t1 ̸= t2 , the following expression is applicable whether or not c1 is equal to c2 , ∂ 2 li (t ) ∂η 1 ∂ρk|c =c c1 |s t2 2 = 2yikt − 1 2 R, g(i, k, c2 , t2 ) (C.1)     (t1 ) (t2 ,t1 ) (t1 ) θ θ θ i(s,cref )  (t )   i(s,c2 ,c1 )  i(s,c1 )  2  where R =  δ(i, t1 , c1 ) −  −  θi,c  (t1 ) (t1 )   (t1 )  2 η η η c1 |s c1 |s cref |s 2. When t = t but c1 ̸= c2 ,   (t) (t) (t)  θi(s,c ) θi(s,c )  θi,c (2yikt−1 ) ∂ 2 li ref   1 − 2 =   (t) (t )  (t)  g(i, k, t, c2 ) ∂η ∂ρk|c =c η η 1 c1 |s c1 |s cref |s t 2 3. When all subscripts are the same     (t) (t) (t) 2l 2yikt − 1  θi(s,c)  θi(s,c) θi(s,cref )  (t)  ∂ i     2 − − =   θi,c  (t) (t)  (t)   g(i, k, c, t)  (t) ∂η ∂ρk|c =c η η η c|s c|s c|s cref |s t 106 BIBLIOGRAPHY 107 BIBLIOGRAPHY A. Agresti. Categorical Data Analysis. Wiley, Hoboken, New Jersey, second edition, 2002. A. Ahmed and E. Xing. Dynamic non-parametric mixture models and the recurrent chinese restaurant process : with applications to evolutionary clustering. 2008. H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 6:716–723, 1974. C. E. Antoniak. Mixtures of dirichlet processes with applications to non-parametric problems. The Annals of Statistics, 2:1152–1174, 1974. K. Bandeen-Roche, D. L. Miglioretti, S. L. Zeger, and P. J. Rathouz. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92: 1375–1386, 1997. D. J. Bartholomew and M. Knott. Latent variable models and factor analysis. Arnold, London, second edition, 1999, Chap. 6. C. Biernacki, G. Celeux, and G. Govaert. Choosing starting values for the em algorithm for getting the highest likelihood in multivariate gaussian mixture models. Computational Statistics and Data Analysis, 41:561–575, 2003. D. Blackwell and J. B. Macqueen. Ferguson distributions via polya urn schemes. The Annals of Statistics, 1:353–355, 1973. G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Analysis. Wiley & Sons, New York, 1992. M.-W. Chang and C.-J. Lin. Adaptive determnistic annealing for two applications: competing svr of switching dynamics and travelling salesman problems. The 9th International Conference on Neural Information Processing, Singapore, 2002. S. Chib. Calculating posterior distributions and modal estimates in markov mixture models. Journal of Econometrics, 75:79–97, 1996. H. Chung, J. C. Anthony, and J. L. Schafer. Latent class profile analysis: an application to stage-sequential process of under-age drinking behaviours. Journal of the Royal Statistical Society, Series A, 2011. 108 H. Chung, S. T. Lanza, and E. Loken. Latent transition analysis: inference and estimation. Statistics in Medicine, 27:1834–1854, 2008. Y. Chung and D. B. Dunson. Nonparametric bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association, 104:1646–1660, 2009. W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74:829–836, 1979. C. C. Clogg. Latent class models. In G. Arminger, C. C. Clogg, and M. E. Sobel, editors, Handbook of Statistical Modeling for the Social and Behavioral Sciences, pages 351–359. Plenum, New York, 1995. C. C. Clogg and L. A. Goodman. Latent structure analysis of a set of multidimensional contingency tables. Journal of the American Statistical Association, 79:762–771, 1984. L. M. Collins and S. E. Wugalter. Latent class models for stage-sequential dynamic latent variables. Multivariate Behavioral Research, 27:131–157, 1992. T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, 1991. B. de Finetti. Funzione caratteristica di un fenomeno aleatorio. Atti della R. Academia Nazionale dei Lincei, Serie 6, 4:251–299, 1931. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via em algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39: 1–38, 1977. L. C. Dierker, F. Vesela, E. M. Sledjeskia, D. Costelloa, and N. Perrine. Testing the dual pathway hypothesis to substance use in adolescence and young adulthood. Drug and Alcohol Dependence, 87:83–93, 2007. D. B. Dunson and J.-H. Park. Kernel stick-breaking processes. Biometrika, 95:307–323, 2008. M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348–1360, 2001. T. S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230, 1973. T. S. Ferguson and M. J. klass. A representation of independentincrement processes without gaussian components. The Annals of Mathematical Statistics, 43:1634–1643, 1972. A. K. Formann. Latent class model diagnosis from a frequentist point of view. Biometrics, 59:189–196, 2003. 109 A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398–409, 1990. A. Gelman, W. R. Gilks, and G. O. Roberts. Weak convergence and optimal scaling of random walk metropolis algorithms. Annals of Applied Probability, 7:110–120, 1997. S. Geman and D. Geman. Stochastic relaxation, gibbs distribution, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:401– 412, 1984. L. A. Goodman. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61:215–231, 1974. P. J. Green. Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82:711–732, 1995. S. J. Haberman. Analysis of Qualitative Data. Academic Press, New Developments, New York, 1979. J. A. Hagenaars and A. L. McCutcheon. Applied latent class analysis. Cambridge University Press, Cambridge, 2002. W. Hastings. Monte carlo sampling methods using markov chains and their application. Biometrika, 57:97–109, 1970. T. Heinen. Discrete latent variable models. Tilburg University Press, The Netherlands, 1993, Chap. 2. E. Hewitt and L. J. Savage. Symmetric measures on cartesian products. Transactions of the American Mathematical Society, 80:470–501, 1955. K. W. Ho and I. Hu. Flexible modelling of random effects in linear mixed models-a bayesian approach. Computational Statistics and Data Analysis, 52:1347–1361, 2008. H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96:161–173, 2001. H. Ishwaran and L. F. James. Approximate dirichlet process computing for finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics, In press, 2002. H. Ishwaran and M. Zarepour. Markov chain monte carlo in approximate dirichlet and beta twoparameter process hierarchical models. Biometrika, 87:371–390, 2000. Y. Itaya, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura. Deterministic annealing em algorithm in parameter estimation for acoustic model. Interspeech, pages 433–436, 2004. K. Katahiral, K. Watanabe, and M. Okada. Deterministic annealing variant of variational bayes method. Journal of Physics: Conference Series, 95, 2008. 110 D. Kim and B. G. Lindsay. Using condence distribution sampling to visualize confidence sets. Statistica Sinica, 2009. R. M. Korwar and M. Hollander. Contributations to the theory of dirichlet processes. The Annals of Statistics, 1:705–711, 1973. S. T. Lanza and L. M. Collins. A mixture model of discontinuous development in heavy drinking from ages 18 to 30: The role of college enrollment. Journal of Studies on Alcohol, 67:552–561, 2006. P. F. Lazarsfeld and N. W. Henry. Latent structure analysis. Houghton Mifflin, Boston, 1968. R. J. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987. J. S. Liu. Nonparametric hierarchical bayes via sequential imputations. The Annals of Statistics, 42:911–930, 1996. D. J. C. MacKay. Emsemble learning for hidden markov models. 1997. A. L. McCutcheon. Sexual morality, pro-life values, and attitudes toward abortion: A simultaneous latent structure analysis for 1978–1983. Sociological Methods and Research, 16: 256–275, 1987. B. O. Muth´n and K. Shedden. Finite mixture modeling with mixture outcomes using the e em algorithm. Biometrics, 55:463–469, 1999. L. K. Muth´n and B. O. Muth´n. Mplus user’s guide. Muth´n & Muth´n, Los Angeles, 3rd e e e e edition, 2004. J. Park, W. Cho, and S. Park. Deterministic annealing em and its application in natural image segmentation. Computational and Information Science, 3314:639–644, 2005. G. P. Patil and C. Taillie. Diversity as a concept and its implications for random communities. Bulletin of the International Statistical Institute, 47:497–515, 1977. M. Perman and J. Pitman. Size-biased sampling of poisson point processes and excursions. probability theory and related fields. Probability Theory and Related Fields, 92:21–39, 1992. J. Pitman. Some developments of the Blackwell-MacQueen urn scheme. In Statistics, Probability and Game Theory (Edited by T. S. Ferguson, L. S. Shapley and J. B. MacQueen), volume 30. 1996. J. Pitman and M. Yor. The two-parameter poisson-dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25:855–900, 1997. C. K. Reddy and B. Rajaratnam. Learning mixture models via component-wise parameter smoothing. Computational Statistics and Data Analysis, 54:732–749, 2010. 111 S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B, 59:731–792, 1997. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New York, 2nd edition, 2004. D. B. Rubin. Inference and missing data. Biometrika, 63:581–592, 1976. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978. J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4:639–650, 1994. M. Tanner and W. Wong. The calculation of posterior distributions by data augmen- tation. Journal of the American Statistical Association, 82:528–550, 1987a. W. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation,. Journal of the American Statistical Association, 82:528–550, 1987b. N. Ueda and R. Nakano. Deterministic annealing em algorithm. Neural Networks, 11:271– 282, 1998. N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. Split and merge em algorithm for improving gaussian mixture density estimates. Journal of VLSI Signal Processing Systems, 26:133–140, 2000. W. F. Velicer, C. A. Redding, M. D. Anatchkova, J. L. Fava, and J. O. Prochaska. Identifying cluster subtypes for the prevention of adolescent smoking acquisition. Addictive Behaviors, 32:228–247, 2007. T. Xu, Z. Zhang, P. S. Yu, and B. Long. Evolutionary clustering by hierarchical dirichlet process with hidden markov state. 2008. Y. Yang. Can the strengths of aic and bic be shared? BIOMETRICA, 92, 2003. W. Yao and B. G. Lindsay. Bayesian mixture labeling by highest posterior density. Journal of the American Statistical Association, 104:758–767, 2009. 112