LATENT CLASS PROFILE ANALYSIS: INFERENCE, ESTIMATION AND ITS
APPLICATIONS
By
Hsiu-Ching Chang

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Statistics
2011

ABSTRACT
LATENT CLASS PROFILE ANALYSIS: INFERENCE, ESTIMATION AND
ITS APPLICATIONS
By
Hsiu-Ching Chang

Recently, a great deal of attention has been paid to the stage-sequential process for the
longitudinal data and a number of methods for analyzing stage-sequential processes have
been derived from the family of ﬁnite mixture modeling. However, research on the sequential
process is rendered diﬃcult by the fact that the number of latent components is not known
a priori. To address this problem, we propose two solutions, reversible jump MCMC and
the Bayesian non-parametric approach, so as to provide a set of principles for the systematic
model selection for the stage-sequential process. The reversible jump MCMC sampler can
explore parameter space and automatically learn the model. Nevertheless, we have found
that reversible jump Markov chain Monte Carlo requires the eﬃcient design of proposal
mechanism as jumping rules. To reduce the technical and computational burdens, we propose
a Bayesian non-parametric approach to select the number of latent components. Using a
latent class-proﬁle analysis, we test both algorithms on synthesized data sets to evaluate
their performances in model selection problems.
Once a model is selected, the model parameters are needed to be estimated.

The

expectation-maximization algorithm (Dempster et al., 1977) and the data augmentation
using MCMC (Hastings, 1970; Tanner and Wong, 1987a) are widely-used techniques to draw
statistical inferences of the parameters for the LCPA model. As a number of measurement occasions increases in the LCPA model, however, the computation cost of expectationmaximization or MCMC will become exponentially intensive. On the contrary, if one adapts

recursive scheme in the update steps, calculations will be simpliﬁed and become generalized
to more time points. In light of this, we formulate each update step with recursive terms
which are directly analogous to forward-backward algorithm (Chib, 1996; MacKay, 1997).
The parameter estimation for the LCPA model beneﬁts from recursive formula, but the
recursive algorithm still requires careful examination for the existence of multiple local modes
of the objective function (i.e., log-likelihood). Applying the recursive formula, we implement
deterministic annealing EM (Ueda and Nakano, 1998) and deterministic annealing variant of
variational Bayes (Katahiral et al., 2008) in order to ﬁnd parameter estimates on the global
mode of the objective function. Both methods are based on the deterministic annealing
framework, in which ω is included as an annealing parameter to control the annealing rate.
By adjusting the value of ω , the annealing process tracks multiple local modes and identiﬁes
the globalized optimum as a result.
At last, we are interested in analyzing the early onset drinking behaviours among the
young generation. We apply latent class-proﬁle analysis to alcohol drinking behaviours as
manifest in self-reported items drawn from the National Longitudinal Survey of Youth 1997,
which was a survey that explores the transition from school to work and from adolescence
to adulthood in the USA. To unveil the stage-sequential bevaviroal progressions, we adopt
dynamic Dirichlet learning process to characterize the probable progressions in a discrete
manner and then identify patterns in which similar progressions are grouped. For the parameter estimations, we conduct deterministic annealing approaches with predetermined
annealing schedule.
Key Words: Dirichlet process; Expectation-Maximization algorithm; Latent class proﬁle analysis; Latent stage-sequential process; Longitudinal data; Multiple local modes; Recursive formula; Reversible jump MCMC; Deterministic annealing; Variatoinal Bayes
iv

Table of Contents

List of Figures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

1 Introduction

1

2 Latent Class Proﬁle Analysis
6
2.1 Latent Class Proﬁle Model with Covariates . . . . . . . . . . . . . . . . . 10
2.2 Missing Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Model Selection
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . . .
3.21 Within-Model Move . . . . . . . . . . . . . . . . . . . .
3.21.1
Prior Distribution Speciﬁcation . . . . . . . . . .
3.21.2
Posterior Distribution . . . . . . . . . . . . . .
3.22 Across-Model Move . . . . . . . . . . . . . . . . . . . .
3.22.1
Jumping Rules for C . . . . . . . . . . . . . . .
3.22.2
Jumping Rules for S . . . . . . . . . . . . . . .
3.3 Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . .
3.31 Preliminary . . . . . . . . . . . . . . . . . . . . . . . .
3.31.1
Sethuraman Stick-Breaking Representation . . . .
3.31.2
Ferguson Gamma Process Representation . . . . .
3.31.3
Almost Sure Truncation of DP (G0 , λ) Measures
3.32 Dirichlet Process Mixture Model . . . . . . . . . . . . . .
3.33 Dynamic Dirichlet Process Mixture Model . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

14
14
15
15
17
18
19
20
22
23
23
27
28
29
33
38

4 Simulation
4.1 Simulation with Reversible Jump MCMC
4.2 Simulation with Dirichlet Process . . . .
4.3 Extended Cases . . . . . . . . . . . .
4.4 Discussion . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

45
46
47
53
54

5 Parameter Estimation
5.1 Preliminary . . . . . . .
5.2 Expectation-Maximization
5.21 E-step . . . . . .
5.22 M-step . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

57
57
57
59
59

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
v

.
.
.
.

.
.
.
.

5.3
5.4

Recursive Formula . . . . . .
Local Modality . . . . . . . .
5.41 Split-and-Merge EM . .
5.42 Deterministic Annealing
5.43 Deterministic Annealing

. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
EM Algorithm .
Variational Bayes

. . . . . .
. . . . . .
. . . . . .
. . . . . .
Algorithm

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

6 Model Diagnosis

61
63
64
65
68
72

7 Data Analysis on National Longitudinal Survey of
7.1 Data . . . . . . . . . . . . . . . . . . . . . . .
7.11 Information Criteria versus Dirichlet Process
7.12 Parameter Estimates . . . . . . . . . . .
7.12.1
Discussion . . . . . . . . . . . .

Youth
. . . .
. . . .
. . . .
. . . .

1997
. . .
. . .
. . .
. . .

(NLSY97)
. . . . . .
. . . . . .
. . . . . .
. . . . . .

75
75
76
79
87

8 Discussion
92
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Direction for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 94
APPENDICES

95

A Acceptance Rate of C → C + 1

96

B Acceptance Rate of S → S + 1

98

C Hessian Matrix
100
C.1 Diagonal Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
C.2 Oﬀ-Diagonal Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
BIBLIOGRAPHY

107

vi

List of Figures

4.1

The percentage of the most frequently selected proﬁles over simulated samples
by reversible jump MCMC: (a) strong ρ and strong η , (b) mixed ρ and strong
η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . . . . . . . . 48

4.2

The histogram of the most frequently selected proﬁles over simulated samples
from Dirichlet process: (a) strong ρ and strong η , (b) mixed ρ and strong η ,
(c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . . . . . . . . . 50

4.3

The histogram of the most frequently selected proﬁles over simulated samples
from Dirichlet process with cutoﬀ 1%: (a) strong ρ and strong η , (b) mixed
ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . 51

4.4

The histogram of the most frequently selected proﬁles over simulated samples
from Dirichlet process with cutoﬀ 5%: (a) strong ρ and strong η , (b) mixed
ρ and strong η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η . . 52

7.1

(a) - (c) Tracking plots for classes smoothed by LOWESS with two smoother
spans (red: .67 and green: .15) from time 1 to time 3; (d) histogram of class
progression (yellow bars) and the number of proﬁles (blue bars). . . . . . . . 90

7.2

Histogram of loglikelihood values derived from (a) the standard EM and (b)
the DAEM algorithms with 100 diﬀerent sets of starting values . . . . . . . 91

vii

List of Tables

4.1

The percentage of the most frequently selected proﬁles over simulated samples
by reversible jump MCMC . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2

The percentage of the most frequently selected proﬁles over simulated samples
from Dirichlet process with cutoﬀ 5% . . . . . . . . . . . . . . . . . . . . 53

4.3

The percentage of selecting the correct number of proﬁles (S = 2) over
simulated samples from RJMCM and Dirichlet process . . . . . . . . . . . 54

7.1

Goodness-of-ﬁt statistics for a series of LCPA models under various numbers
of classes and proﬁles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2

Standard deviation of the ρ-parameter estimates derived from DAVB with
100 diﬀerent sets of starting values . . . . . . . . . . . . . . . . . . . . . 81

7.3

Standard deviation of the η -parameter estimates derived from DAVB with
100 diﬀerent sets of starting values . . . . . . . . . . . . . . . . . . . . . 81

7.4

Standard deviation of β -parameter estimates derived from DAVB with 100
diﬀerent sets of starting values (Proﬁle 1 is the baseline) . . . . . . . . . . 82

7.5

Estimated probabilities of responding ‘any use’ to the drinking items for each
class (ρ-parameters) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.6

Estimated probabilities of belonging to a class sequence for each proﬁle (η parameters) and estimated proﬁle prevalence (γ -parameters) . . . . . . . . 84

7.7

Estimated logistic regression coeﬃcients (β -parameters) for the prevalence of
proﬁles (Proﬁle 1, non-drinking stayers, is the baseline) . . . . . . . . . . . 86

viii

Chapter 1
Introduction
Psychological research is increasingly turning to the idea of stage-sequential process. A common theme of stage-sequential process is that, at any moment, individuals are placed into
distinct qualitative stages, and they can change their stage memberships over time. The
latent class analysis (LCA) is perhaps the most straightforward mixture model now being
used to identify mutually exclusive subgroups of individuals based on their responses to
measured variables (Clogg and Goodman, 1984; Goodman, 1974). LCA models explain the
relationships among categorical variables in a cross-classiﬁed contingency table by assuming
the existence of an unobserved or latent classiﬁcation. The ﬁrst detailed statistical treatment
of the LC model appeared in the textbook of Lazarsfeld and Henry (1968). In their terminology, the LC model is a special case of latent-structure analysis in which the measurement
scales and the latent variables are both categorical. General overviews of LC modeling are
provided by Goodman (1974), Haberman (1979), McCutcheon (1987), Heinen (1993, Chap.
2), Clogg (1995), Bartholomew and Knott (1999, Chap. 6), Hagenaars and McCutcheon
(2002), and others.
Recently, a number of new methods for analyzing stage-sequential processes have been
1

derived from the family of LCA. For example, latent transition analysis (LTA) (Collins and
Wugalter, 1992; Chung et al., 2008) and general growth mixture models (GGMM) (Muth´n
e
and Shedden, 1999; Muth´n and Muth´n, 2004) have been used widely to identify patterns
e
e
of the progression of adolescent substance use, such as alcohol (Lanza and Collins, 2006) or
tobacco (Velicer et al., 2007), or to investigate the initiation and progression of drug-taking
behaviors for a number of diﬀerent substances (Dierker et al., 2007).
Chung et al. (2011) proposed another type of LCA approach, the latent class-proﬁle
analysis (LCPA), to identify subtypes of the stage-sequential patterns of early-onset drinking
behaviors, where drinking items are treated as fallible indicators of unseen states of drinking
behaviors. In LCPA, the identiﬁcation processes are divided into two steps. In the ﬁrst
step, LCPA identiﬁes discrete subgroups of individuals who have similar responses to items
at each measurement occasions. The subgroups identiﬁed in the ﬁrst step are referred as
classes. In the second step, LCPA examines individuals’ class membership over the entire
set of time points so as to classify the population into two or more subgroups based on their
class sequencing. The subgroups identiﬁed in the second step are referred as class proﬁles or
simply proﬁles. By applying an LCPA to the longitudinal study of adolescent drinking, for
example, all drinkers in a class at a certain time point are expected to be homogeneous in
terms of their drinking behavious, and those individuals in a given proﬁle will have similar
sequential pattern of class membership over time.
Like any other ﬁnite mixtures, the ﬁrst and most crucial step in LCPA is to choose an
appropriate number of classes and proﬁles since model selction has important ramiﬁcations
for the analyses performed with the model. This study works on the issues regarding to
the selection of the number of classes and proﬁles in LCPA. Two Bayesian approaches for
selecting the number of classes and proﬁles have been proposed in this study: reversible jump
2

MCMC (RJMCMC) and Dirichlet process. RJMCMC has been proposed to select the number of latent components (e.g., classes and proﬁles) in ﬁnite mixture models (Green, 1995),
where the number of components is considered as an unknown parameter to be estimated. In
every step of the RJMCMC algorithm, the current mixture model can be proposed to jump
across dimensions, or the model can be simply proposed to update the parameters within the
current model. Any proposal is accepted with a probability that preserves reversibility with
respect to the target posterior. As a result of the RJMCMC, we can estimate the relative
frequencies regarding to the number of classes and proﬁles given the data. RJMCMC is
widely applied on many applications including the ﬁnite mixture models (Richardson and
Green, 1997) and linear mixed models (Ho and Hu, 2008). For the LCPA model, however,
the new RJMCMC procedure should be developed to deal with the stage-sequential process
for the latent component membership. In this study, we design a set of split-and-merge
formula tailored for the multivariate categorical LCPA models.
As a class of non-parametric Bayesian techniques, the Dirichlet process has been utilized
in Dirichlet process mixture models (also known as inﬁnite mixture models). The involvement of the non-parametric prior allows us to identify diﬀerent distributions over observed
data. However, there is little literature regarding the Dirichlet process on model selection
problems in stage-sequential process with longitudinal data. To perform the Dirichlet process
in the longitudinal framework, we develop a dynamic approach that is able to explore the
stage-sequential process by elaborating on the stage transition. It can be seen as a special
hidden Markov model with no constraints placed. The dynamic approach has the space of
all distributions as support and can be easily applied to compute posterior and draw inferences. Since the technique based on the Dirichlet process prior enables a dynamic model
learning, we therefore name it dynamic Dirichlet learning process and integrate in the study
3

to determine the optimal number of latent components.
Once a model is selected, the parameters of the model needed to be estimated. There are
several available parameter estimation methods and each is long-held for its own theoretical and practical worth. The expectation-maximization (EM) algorithm (Dempster et al.,
1977) and the data augmentation using the Markov chain Monte Carlo (MCMC) (Hastings,
1970; Tanner and Wong, 1987a) are widely-used techniques to draw statistical inferences
of the parameters. However, either the E-step of EM algorithm or I-step of MCMC requires the computation of marginal distributions, which appears to be too expensive to be
of practical use in more generalized applications. To alleviate this problem, we formulate
each update step with recursive formula which are directly analogous to forward-backward
algorithm (Chib, 1996; MacKay, 1997). The recursive algorithm will therefore have less
computational complexity and storage demands.
Existing algorithms for parameter learning in models for estimating suﬀer from local
maxima problems since the dependency between neighboring starting values is strong. To
relax the dependence of the initializations, split-and-merge EM (SMEM) (Ueda et al., 2000),
deterministic annealing EM (DAEM) algorithm (Ueda and Nakano, 1998) and deterministic
annealing variant of variational Bayes (DAVB) (Katahiral et al., 2008) will be introduced
for this purpose. SMEM has some defects that deteriorate its capability of being widely
practiced. The major implementational diﬃculties of SMEM lie in designing suitable split
and merge operations and choosing the right component which the proposed mechanisms
can be applied to. On the other hand, both the DAEM and the DAVB algorithms are based
on the deterministic annealing framework in which ω is included as an annealing parameter
to control the annealing rate. By adjusting the value of ω , the annealing process tracks the
localized optimum and identiﬁes the globalized optimum as a result.
4

We organize the rest as follows: Chapter 2 introduces the mathematical structure underlying the LCPA models. Chapter 3 presents the procedures of two Bayesian model selection
algorithms: RJMCMC and Dirichlet process. Chapter 4 continues the previous chapter and
compares their performances in discovering the true model given diﬀerent sets of conditions
on model parameters. Chapter 5 provides three parameter estimation methods: original EM
algorithm with split-and-merge formula; deterministic annealing designs in the application of
EM and variational Bayes. Chapter 6 discusses the model identiﬁability issues. In Chapter
7, we apply model selection and parameter estimation techniques to alcohol drinking items
drawn from the National Longitudinal Survey of Youth 1997 (NLSY97) and conclude their
performances. In Chapter 8, we summarize the contributions and suggest future researches.

5

Chapter 2

Latent Class Proﬁle Analysis

Suppose we construct a S -proﬁle LCPA model with C classes from a set of M items over T
time periods. Let C = (C1 , . . . , CT ) denote the class membership variables from initial
time t = 1 to time T , where Ct = 1, . . . , C , and let U denote the proﬁle membership
variable with S nominal categories. We begin with the proﬁle identiﬁcation procedure.
The basic idea of LCPA in this identiﬁcation procedure is that associations among class
membership across T time points arise from the assumption that the population is composed
of S proﬁles. If the ith individual’s class membership c = (c1 , . . . , cT ) could be observed,
the joint probability that he or she belongs to the sequence c and the proﬁle s is

P (U = s, C = c) = P (U = s)P (C = c | U = s)
T
∏
= P (U = s)
P (Ct = ct | U = s)
t=1
T C
∏ ∏ [ (t) ]I(ct =c)
= γs
η
,
c|s
t=1 c=1
6

(2.1)

where γs = P (U = s) denotes the marginal prevalence of the sth proﬁle membership;

(t)
η = P (Ct = c | U = s) represents the probability of the cth class membership at time
c|s
t given the proﬁle membership s; and I(A) is the indicator function such that I(A) = 1
if A is satisﬁed and I(A) = 0 otherwise. Here, we assume that the class membership

C = (C1 , . . . , CT ) are conditionally independent or unrelated within each proﬁle s. This
assumption, called local independence (Lazarsfeld and Henry, 1968), is the crucial feature
of LCPA that allows us to draw inferences about the unseen proﬁle variable. In (2.1), the
associations among class membership is explained by latent class theory, which posits that
proﬁles can be identiﬁed by class sequencing over time. If individuals’ class membership
over T time points could be observed, we can simply apply an LCA to identify a number
of proﬁles. However, since individual’s class membership is not directly observed, another
procedure should be conducted to identify class membership based on their responses to
items.

The class identiﬁcation explains associations among item responses based on the assumption that the population is composed of C classes at each time point. Let Yt =

(Y1t , . . . , YM t ) represent a vector of discrete M variables measuring latent class membership at time t, and let yit = (yi1t , . . . , yiM t )′ be the observed values of Yt for the

ith individual, where each response yimt can take values from 1 to rm for m = 1, . . . , M
and t = 1, . . . , T . The joint probability that the individual belongs to class ct at time t
7

and provides responses yt would be

P (Ct = c, Yt = yit | U = s) = P (Ct = c | U = s)P (Yt = yit | Ct = c)
M
∏
= P (Ct = c | U = s)
P (Ymt = yimt | Ct = c)
m=1
M rm
]I(y
(t) ∏ ∏ [
imt =k) ,
= η
(2.2)
ρmkt|c
c|s
m=1 k=1
where ρmkt|c = P (Ymt = k | Ct = ct ) represents the probability of response k to
t
the mth item for a given class ct at time t. In order to investigate the relationship among
items, we assume the following: (a) Yt = (Y1t , . . . , YM t ) are conditionally independent
given c for t = 1, . . . , T ; and (b) the proﬁle membership U is related to the items Yt
only through the class membership Ct for t = 1, . . . , T . Assumption (b) implies that U
depends on the class membership (C1 , . . . , CT ) but not on the items (Y1 , . . . , YT ).

The joint probability that the individual belongs to the class sequence c = (c1 , . . . , cT )
and the proﬁle s and provides responses yi = (yi1 , . . . , yiT ) would be

L∗ = P (U = s, C = c, Y = yi )
i


T 
M

∏
∏
= P (U = s)
P (Ct = ct | U = s)
P (Ymt = yimt | Ct = ct )


t=1
m=1


M rm [
T 

]I(y
∏
(t) ∏ ∏
imt =k)
η
= γs
ρmkt|c
(2.3)
 ct |s

t
m=1 k=1
t=1
Therefore, the ith subject’s contribution to the likelihood function of (Y1 , . . . , YT ), with8

out regard for the latent class and proﬁle, is given by

Li = P (Y = yi )
S C
C
∑ ∑
∑
=
···
L∗
i
s=1 c1 =1
cT =1


 C

rm [
S
T
M

]I(y
∑
∏  ∑ (t) ∏ ∏
imt =k)
=
γs
η
(2.4)
ρmkt|c


t
c =1 ct |s m=1 k=1

s=1 t=1 t
In (2.4), the following three sets of parameters are estimated:
1. ρmkt|c = P (Ymt = k | Ct = ct ) represents the probability of the response k
t
to the mth item for a given class ct at time t;

(t)
= P (Ct = ct | U = s) represents the conditional probability of belonging to
ct |s
class ct at time t for a given class proﬁle s; and

2. η

3. γs = P (U = s) represents the probability of belonging to the class proﬁle s.

We refer to the ρ-parameter as the primary measurement parameter because it describes
how individuals in each class tend to respond to the mth item at each occasion for m =

1, . . . , M . The η -parameter, referred to as the secondary measurement parameter describes
the relation between a class ct at time t and a class-proﬁle s. The primary measurement parameters are usually constrained to be equal across measurement occasions (i.e.,

ρmk1|c = · · · = ρmkT |c ), so that the meaning of classes will not change over time. In
practice, this invariance assumption should be carefully checked by comparing the ﬁt of the
model with and without constraints.
9

2.1

Latent Class Proﬁle Model with Covariates

When we consider the stage-sequential patterns, it is reasonable to take external causes into
consideration. The nature way to extent the LCPA model in such a manner is to include
covariates to examine whether the prevalence of latent proﬁle varies under their inﬂuence.
Let xi = (xi1 , . . . , xip )T denote a vector of time-invariant covariates for individual

i that may inﬂuence the probability with which he or she falls into the latent class-proﬁle
Ui = 1, . . . , S . That is to say, the marginal probability of Ui is aﬀected by the covariates
and therefore denoted as P (Ui = s | xi ). Even though the covariates come into play in
shaping the proﬁle sizes, the inﬂuences of xi on the data is completely mediated by Ui .
In most cases, the ﬁrst covariate is ﬁxed as a constant (xi1 = 1), so that the model
with p = 1 reduces to a traditional LCPA model without covariates. The dependence of

Ui on xi is speciﬁed by
γs (xi ) = P (Ui = s|xi )
exp(xT β s )
i
=
∑S−1
1 + j=1 exp(xT β j )
i
s = 1, 2, . . . , S − 1, with

(2.5)

∑S
T
s=1 γs = 1. In equation (2.5), β j = (βj1 , . . . , βjp )

is a p × 1 vector of logistic-regression coeﬃcients inﬂuencing the log-odds that an individual
falls into class j relative proﬁle S , which serves as a baseline,

log

γs (xi )
= xT β s
i
γS (xi )

(2.6)

Equation (2.6) appears to be a baseline-category logit model for a polytomous response (Agresti,
10

2002), except that in this case, the response is latent. The likelihood contribution for the

ith individual can be written as
S C
∑ ∑

P (Y = yi ) =

···

s=1 c1 =1
S
∑

=

γs (xi )

s=1

C
∑

L∗
i

cT =1

T  C
∏∑

t=1 ct =1





R

(t)
,
ct |s 


(2.7)

]I(y
(t)
(t) ∏M ∏rm [
imt =k) .
where R
=η
ρmkt|c
ct |s
ct |s m=1 k=1
t
The LCPA model with covariates (2.7) has the attractive property that the distribution of

Yi marginalized over the covariates reduces to that of a traditional LCPA model (BandeenRoche et al., 1997). Letting F denote the probability distribution of xi , the marginal
distribution of the item variables for the ith individual is

P r(Yi = yi ) =

=

=

∫ ∑
S

s=1
S
∑∫
s=1
S
∑

(t)
Rs dF (xi )

t=1
γs (xi )dF (xi )

T
∏

(t)
Rs

t=1
∗
γs

s=1
(t)
where Rs =

γs (xi )

T
∏

T
∏

(t)
Rs ,

(2.8)

t=1

{
}
]I(y
=k)
∑C
(t) ∏M ∏rm [
imt
.
ct =1 ηct |s m=1 k=1 ρmkt|ct

∗
That is, the observed log likelihood function can be reduced to an LCPA model with γs
representing the marginal class-proﬁle membership probabilities averaged over the distribution of covariates in the population. This simpliﬁed form suggests a researcher can analyze
11

the data with covariates by two-stage process: ﬁrst, ﬁt a conventional LCPA model to the
data without covariates to determine the nature of the latent variables Ui and Ci ; then
introduce covariates to assess their inﬂuence on the class-proﬁle variable Ui .

2.2

Missing Items

In real applications, it is inevitable to have missing data since some responses to one or more
questionnaires are extracted missing. The common attempt for missing data is to delete
the incomplete cases. However, the consequence could be far-ﬂung if the removal of the
incomplete individuals cause major changes in representing data characteristics.
At fact, case deletion might be unnecessary because the incomplete information can provide substantial true knowledge of the data and estimates of parameters which are meant to
represent the full population may be biased if part of the information has been discarded (Little and Rubin, 1987).
A more principled method to deal with missing data is to apply a likelihood function
deﬁned by the marginal distribution of the observed items only. Maximizing this likelihood
function which eliminates the missing responses is appropriate when the missing items are
missing at random (MAR) in the sense deﬁned by Rubin (1976) and Little and Rubin (1987).
The contribution of individual i to this function, which we call the observed-data likelihood,
is

P r(Yi,obs = yi,obs ) =

S
∑

γs (xi )

s=1


T  C
∏∑

t=1 ct =1





(t)
R
,
ct |s 


(2.9)

where Yi,obs and yi,obs are the vectors of observed item variables and their realized values,
12

and obsi denotes the set of items observed for individual i. We assume that covariates xi
are completely observed. Missing values among the covariates would require us to introduce
additional assumptions about the distribution of xi . The concept for extension to include
missing values in the covariates is straightforward but it can somehow wreck a havoc on the
implementation because the dimension of the observed item vectors Yi,obs vary from one
subject to another. Missing items also introduce complications in assessing goodness of ﬁt
and checking for departures from modeling assumptions (e.g., local independence).

13

Chapter 3
Model Selection

3.1

Introduction

In the ﬁnite mixture modeling, the problem to be solved is to estimate the unknown number
of components. This question arises in many traditional and novel situations such as variable
selections, signal processing and Bayesian non-parametric statistics. There are many model
selectors in popular use such as Akaike’s information criterion (Akaike, 1974) and Bayesian
information criterion (Schwarz, 1978). As an estimated Kullback-Leibler (KL) distance,
AIC aims at ﬁnding the best approximating model to the unknown true data generating
process and penalizes with twice the number of parameters to achieve the parsimony and BIC
operates in the similar way but penalizes more with the logarithm of sample size. Besides, the
penalized least squares approach with smoothly clipped absolute deviation (SCAD) penalty
term has been demonstrated to be an attractive selection approach which not only selects
important variables consistently but also produces oracle estimations (Fan and Li, 2001).
Unfortunately, there is no single superior model selection tool which can be applied to
all the types of data sets. For example, BIC is consistently better than AIC because if true
14

model is considered among the candidates, BIC can identify the true model almost surely
as sample size grows; however, AIC has been proved to be minimax-rate optimal for both
parametric and nonparametric cases for estimating the regression function (Yang, 2003).
Even though AIC and BIC are commonly used for comparing diﬀerent models, computing
the relevant AIC and BIC values for each possible model can be very time consuming when
the number of competing models is high. In the LCPA models, the structure is intricate and
we need to develop methods by which we can fully incorporate data information to better
the decision-making process.

3.2

Reversible Jump MCMC

In this section, we explore the algorithm of RJMCMC for the LCPA model to select the
appropriate number of classes and proﬁles. The algorithm aims to compute the joint posterior distribution of the parameters and the number of latent components. The RJMCMC
samplers can travel between diﬀerent dimensions by constructing a reversible Markov chain
on the general space with a speciﬁed limiting distribution. In other words, the RJMCMC
can jump to another LCPA model with diﬀerent dimensions (across-model) or it can simply
update the parameters within the LCPA model with same dimension (within-model).

3.21

Within-Model Move

In Bayesian analysis, our goal is to compute the posterior distribution, P (Θ(C,S) | y),
where y represents the vectorized observed items from the sample and Θ(C,S) denotes
as the parameter set corresponding to C -class / S -proﬁle LCPA model. The posterior
distribution, however, is diﬃcult to portray in the LCPA model. If the class and proﬁle
15

membership for each individual were known, the augmented posterior P (Θ(C,S) | y, z)
would be easy to simulate.

A within-model MCMC algorithm for the LCPA model is implemented as an iterative
two-step procedure which can be regarded as a form of data augmentation (Tanner and
Wong, 1987b) or Gibbs sampling (Gelfand and Smith, 1990). In the ﬁrst step of MCMC
procedure—the Imputation or I-step— Let z = (z1 , . . . , zn ) indicate the individuals’ class
and proﬁle memberships where zi is a T + 1 dimensional array for the ith individual such

∑

∑

∑

S
C
C
that zi(s,c ,...,c ) ∈ {0, 1} and
s=1 c1 =1 · · · cT =1 zi(s,c1 ,...,cT ) = 1.
1
T
That is, if individual i belongs to the proﬁle s and the class membership c = (c1 , . . . , cT )
from initial time t = 1 to time T , then zi(s,c ,...,c ) equals 1 and 0 otherwise. We
1
T

(j+1)
is drawn from the conditional distribui
tion P (L = s, C = c | yi , γ (j) , η (j) , ρ(j) ) given the observed data and pre(j+1)
vious parameter estimations. We then calculate the marginal indicators of z
=
is
∏T ∑C
∑
(j+1) ∏
(j+1)
zi(s,c ,...,c ) , z
= j̸=t C =1 zi(s,c ,...,c ) , and zic
=
t=1 ct =1
cj
1
1
t
T i(s,ct )
T
∑S
s=1 zit(s,ct ) for i = 1, . . . , n. In the second step—the Posterior or P-step—we draw
assume at (j + 1)th cycle, the value z

new random values for the parameters from the augmented posterior distribution

γ (j+1) , η (j+1) , ρ(j+1) ∼ P (γ, η, ρ | y, z(j+1) ),

(3.1)

which regards the membership of class and proﬁle as known. Repeating this two-step procedure creates a sequence of iterates,

{(γ (1) , η (1) , ρ(1) ; z(1) ), (γ (2) , η (2) , ρ(2) ; z(2) ) . . . , }
16

(3.2)

which converges to the stationary distribution P (Θ(C,S) | y). This stream of parameter
values (after a suitable burn-in period) is summarized in various ways to produce approximate
Bayesian estimates, intervals, tests, etc. It is convenient to choose priors that cause γ , η
and ρ to be a posteriori independent given z. One way to achieve this is to make the
priors independent from each other. In situations where the priors are independent, the
joint posterior distribution for γ , η and ρ given z can be expressed as

P (γ, η, ρ | y, z) ∝ P (γ, η, ρ)P (y, z | γ, η, ρ)


 
]z (t)
S
S ∏ ∏ [
T C
∏ z
∏
(t) i(s,c) 
is  × P (η)
P (γ)
∝
γs
η


c|s
s=1
s=1 t=1 ct =1


}z (t)
T ∏ ∏ rm {[
C M ∏
]I
∏
ic 

ρmkt|c (yimt=k)
× P (ρ)
,
t=1 c=1 m=1 k=1
(3.3)

where P (γ), P (η) and P (ρ) are the prior distributions for γ , η and ρ and the equation
is composed of three parts: one pertaining to the γ parameters and one pertaining to the

η and the other pertaining to the ρ.

3.21.1

Prior Distribution Speciﬁcation

In the equation (3.3), the posteriors for the measurement parameters consist of prior distribution and the multinomial distribution. The most straightforward choice for the prior distribution is Dirichlet since it is conjugate to the multinomial distribution. It is said that a random
vector γ = (γ1 , . . . , γS ) has Dirichlet prior with hyperparameters α = (α1 , . . . , αS )
17

if the density of γ is

∑S
S
∏
l=1 αl )
P (γ | α) =
γ α−1
l
Γ(α1 ) · · · Γ(αS )
l=1

Γ(

over the simplex γl ≥ 0, l = 1, . . . , S , and

(3.4)

∑S
l=1 γl = 1, where Γ(·) denotes the

gamma function.
We prefer using the noninformative priors to make sure the data-driven inferences will
not be drawn under the inﬂuence of improper prior speciﬁcation. The approach to choose a
noninformative prior is to adopt Jeﬀreys’ invariance principle (Box and Tiao, 1992). Jeﬀreys’

∏S
−1/2
, which corresponds to the Dirichelt with α =
l=1 γl
(
)
(1/2, . . . , 1/2). Similarly, we apply the Jeﬀreys priors to ρmt|c = ρm1t|c , . . . , ρmr t|c
m
(
)
(t)
(t)
(t)
and η s = η
,...,η
.
1|s
C|s

prior density is proportional to

3.21.2

Posterior Distribution

Adopting a Dirichelt prior to γ, η and ρ, the complete-data posterior distribution is also
Dirichlet because of conjugacy. Thus in the P-step, new random values for the parameters
are drawn from posterior distributions

(
)
ρm|c ∼ Dirichlet nm1|c + 1/2, . . . , nmr |c + 1/2
m )
(
(t)
(t)
(t)
η s ∼ Dirichlet n + 1/2, . . . , n
+ 1/2
1|s
C|s
(
)
γ ∼ Dirichlet n1 + 1/2, . . . , nS + 1/2
18

(3.5)

for ct = 1, . . . , C , t = 1, . . . , T , m = 1, . . . , M , and s = 1, . . . , S , where n

(t)
=
c|s

∑
∑n
∑n ∑T
(t)
(t)
zic I(yimt = k), and ns = n zis . We
z
, nmk|c =
i=1 i(s,c)
i=1 t=1
i=1
repeat this two-step procedure to create a sequence of iterates converging to the stationary
posterior distribution. This stream of parameter values (after a suitable burn-in period) is
summarized in various ways to produce Bayesian inference and estimation.

3.22

Across-Model Move

For the across-model update, we assume that the sampling starts from the state Θ(C,S) and
then transits to Θ∗ ∗ ∗ . To match up the dimension, we generate a vector of continuous
(C ,S )
random variables u from some known distributions g and the new proposed state is then
constructed by using an invertible deterministic function h such that (Θ∗ ∗ ∗ , u∗ ) =
(C ,S )

h(Θ(C,S) , u) where u∗ are the suitable random variables generated from some known
functions g ∗ that is required for the reversed move using the inverse function h∗ of h.
The new state updates the current one with probability α(k, k ∗ ), where k = (C, S) and

k ∗ = (C ∗ , S ∗ ). The probability α(·, ·) is usually termed as the acceptance rate.
Assume the probability of choosing move from k to k ∗ is denoted by q(Θk , Θ∗ ∗ ), a
k
valid choice for the acceptance rate is

{

}
P (Θ∗ ∗ | y)q(Θ∗ ∗ , Θk )g ∗ (u∗ ) ∂(Θ∗ ∗ , u∗ )
k
k
k
α(k, k ∗ ) = min 1,
,
∗ )g(u)
∂(Θk , u)
P (Θk | y)q(Θk , Θ ∗
k

(3.6)

note that the last factor in (3.6) is the Jacobian arising from the transformation from (k, u)
to (k ∗ , u∗ ).
To keep expressions simple, we consider the LCPA model with binary items (i.e., rm = 2
19

for m = 1, 2, . . . , M ) with the invariance constraint on the ρ-parameter (i.e., ρmk1|c =

. . . = ρmkT |c = ρmk|c ), although an extension to the model deﬁnes in (2.4) is straightforward. First, we select the number of classes C ; and then explore the number of proﬁles S
with the ﬁxed number of classes. We assume that the random variables C and S are from
Poisson distribution with hyperparameter ς with maximum values truncated at S0 and C0 ,
respectively. Based on the simulation results, the choices of S0 and C0 have little impact on
the performance of the RJMCMC and can be chosen as any reasonable values. In this study,
we choose ς = 5, S0 = 10, and C0 = 10. In RJMCMC, there are two possible ways to
accommodate the dimensional changes in latent classes: split (e.g., the C -class LCPA moves
to the (C + 1)-class LCPA) and merge (e.g., the C -class LCPA moves to the (C − 1)-class
LCPA). Usually, at each stage of dimensional change, split and merge moves are equally
preferred (i.e., bc = P ( split | C -class LCPA) = 0.5).

3.22.1

Jumping Rules for C

To update the number of classes, we ﬁrst generate U from Uniform(0, 1). If U < bc , split
move is executed. Conditional on the current number of proﬁles, we randomly pick a class

c∗ from (1, . . . , C) and split it into classes c∗ and c∗ . To accommodate these classes, we
1
2
(t)
draw us from Uniform(0, 1) and update the secondary measurement parameters as
(t)
(t) (t)
η ∗
= η ∗ us
c |s
c1 |s
(t)
(t)
(t)
η ∗
= η ∗ (1 − us )
c |s
c2 |s
20

for s = 1, . . . S and t = 1, . . . , T . For the primary measurement parameter, we generate

um from Uniform(0, 1) and update the primary parameters as
√

1−u
¯
ρm1|c∗ = ρm1|c∗ − um κ
u
¯
1
√
u
¯
ρm1|c∗ = ρm1|c∗ + um κ
1−u
¯
2
for m = 1, . . . M , where u is
¯

∑S ∑T
(t)
s=1 t=1 us /ST and κ is an adjust term added to

enhance the mixing performance. For simulation study, we choose 0.2 for κ since all ρmk|c
parameters are bounded and large κ (e.g. κ ≥ 0.4) would violate the constraint easily and
make the proposals invalid, however, small κ (e.g κ ≤ 0.1) will lead to low acceptance rate
and slow down the chain.
After the dimension-changing move is made, we need to reallocate the individuals whose
class memberships were c∗ to the new classes. The class label for each individual is not known
a priori and we therefore follow ad hoc rule to assign them to either c∗ or c∗ according to

1

2

the proposed formula.
On the other hand, if merge move is chosen, we randomly select two classes c∗ and c∗

1

2

from (1, . . . , C) and merge them into c∗ . To preserve the reversibility, we update the
parameters as

(t)
(t)
(t)
η ∗
= η ∗ +η ∗
c |s
c1 |s
c2 |s
¯
ρm1|c∗ = uρm1|c∗ + (1 − u)ρm1|c∗
¯
2
1
for s = 1, . . . , S , t = 1, . . . , T , and m = 1, . . . , M . In this case, the reallocation can
be simply done by re-setting the memberships for those who were in the class c∗ or c∗ to

1

21

2

c∗ .
With the updated parameters, we then calculate the acceptance rate α(C, C ′ ) of the
proposed C ′ -class LCPA model (C ′ = C − 1 or C + 1) as (3.6) by replacing k with

(C, S) and k ∗ with (C ′ , S).

3.22.2

Jumping Rules for S

Based on the selected number of classes C , we update the number of proﬁles S using
the following procedure. At each stage of dimensional change, split and merge moves are
equally preferred (i.e., bs = P ( split | S -proﬁle LCPA) = 0.5). We generate V from the
Uniform(0,1). If split move is chosen (i.e., V < bs ), we randomly select one proﬁle s∗ from

(1, . . . , S) and split it into s∗ and s∗ . We then draw w from the Uniform(0,1) and set
1
2
γs∗ = γs∗ w
1
γs∗ = γs∗ (1 − w).
2
For the secondary measurement parameter, we update the parameters by setting them to be
proportional to the odds ratio for s∗




(t)
(t)
η ∗
η ∗
 c|s

1  = log  c|s  + βct


log 

 (t) 
 (t) 
w
η
η
∗
∗
C|s
C|s1




(t)
(t)
η ∗
 ηc|s∗ 

2  = log  c|s  − βct ,


log 

 (t)  1 − w
 (t) 
η
η
C|s∗
C|s∗
2


22

where βct are generated from N (0, σ 2 ) independently for c = 1, . . . , C − 1 and t =

1, . . . , T . In the simulation study, we choose σ = 1 and for those individuals whose proﬁle
memberships were s∗ , we relabel each of them to either s∗ or s∗ with probabilities based

1

2

on the proposed formula.
If merge is selected, we randomly choose two proﬁles s∗ and s∗ and merge them into

1

s∗ by setting

2

γs∗ = γs∗ + γs∗
1  2





(t)
(t)
(t)
 ηc|s∗ 
η ∗
 η1|s∗ 

1  + (1 − w) log  c|s2 

log 


 (t)  = w log  (t) 


 (t) 
η
η
η
C|s∗
C|s∗
C|s∗
1
2
for c = 1, 2, . . . , C − 1 and t = 1, 2, . . . , T . And the reallocation can be easily done by
labeling those who were in the proﬁle s∗ or s∗ to s∗ .

1

2

With the updated parameters, we then calculate the acceptance rate α(S, S ′ ) of the
proposed S ′ -proﬁle LCPA model (S ′ = S − 1 or S + 1) as (3.6) by replacing k with

(C, S) and k ∗ with (C, S ′ ). The details of the calculations of acceptance rates for class
and proﬁle are summarized in the appendix.

3.3

Dirichlet Process

3.31

Preliminary

Typically, we assume that data is drawn from an unknown distribution which we wish to
estimate through the posterior distribution by specifying the prior. In most cases, the prior
23

distribution is parametrically speciﬁed and the Bayesian computing can help to infer the estimate easily. However, assuming prior has parametric form may not be suitable for the data.
In the case of Dirichlet process, the prior is a random distribution over probability measures.
The Dirichlet process is currently one of the most popular nonparametric Bayesian models
since it can be practiced as a modern way to learn dominating components economically.
By the deﬁnition from Ferguson (Ferguson, 1973), G is said to be Dirichlet process on a
measurable space (B, A) with concentration parameter λ and base measure G0 if, for any ﬁnite measurable partition (B1 , B2 , . . . , Bk ) of B , the distribution of (P (B1 ), . . . , P (Bk ))
is Dirichlet with parameter (λG0 (B1 ), . . . , λG0 (Bk )) and written as G ∼ DP (G0 , λ).
The parameters G0 and λ play intuitive roles in the deﬁnition of the Dirichlet process. For
any measurable set B ⊂ B , we have

E(G(B)) = G0 (B)
V ar(G(B)) = G0 (B)(1 − G0 (B))/(λ + 1).

(3.7)

The value of concentration parameter λ is a positive scalar with larger value of λ leading to
small variance of G. In other words, Dirichlet process will concentrate more on the mean
and as λ grows, G(B) will converges to G0 (B) weakly for any measurable set B in B .
On the contrary, a Dirichlet with small value of λ favors extreme distribution but the belief
in prior is easily over-written by the data.
Since G is a prior distribution over the measure spaces, we can draw random samples
from G itself. Let θ1 , . . . , θn be a sequence of independent draws from G. We can derive
the posterior distribution of G given observed values of θ1 , . . . , θn . Let A1 , . . . , Ar be
a ﬁnite measurable partition of B , and let mk = #{i : θi ∈ Ak } be the number of
24

observed values in Ak . Since Dirichlet is the conjugate prior of multinomial, we will have

(G(A1 ), . . . , G(Ar )) | θ1 , . . . , θn ∼ Dirichlet (αG0 (A1 ) + m1 , . . . , αG0 (Ar ) + mr )
The above expression is true for all the ﬁnite measurable partitions A1 , . . . , Ar of B and in
the light of the deﬁnition from Ferguson, we know G must be a Dirichlet process. Generally
put, the posterior distribution of G can be written down as:

(

G | θ1 , . . . , θn ∼ DP

λ
n
λ + n,
G0 +
λ+n
λ+n

)
∑n
θi
i=1
n

The posterior mean of G is the weighted average between the prior base distribution G0

∑n
i=1 θi . As λ → 0, the prior becomes noninformative since
and the sample average
n
the predictive distribution is controlled solely by the average. That is to say, as the amount
of observations outnumbers the concentration parameter (n >> λ), the posterior is constructed by the observations themselves which is regarded as a data-driven approximation
to the true underlying model. This fully explains that Dirichlet process is consistent in
recovering the model by increasing the sample size.

Consider again drawing G ∼ DP (G0 , λ), and drawing an i.i.d. sequence θ1 , . . . , θn ∼

G. The predictive distribution for θn+1 , conditioned on θ1 , . . . , θn with G marginalized
out, has the following formulation,



1 
θn+1 | θ1 , . . . , θn ∼
λG0 +
λ+n

n
∑
i=1



δθ 
i

(3.8)

The sequence of predictive distributions is called the Blackwell- MacQueen urn scheme (Black25

well and Macqueen, 1973).
Regardless of the question about the existence of the Dirichlet process, we can calculate
the joint probability of the sequence of (θ1 , . . . , θn ) by the conditional rule,

P (θ1 , . . . , θn ) =

n
∏

P (θi | θ1 , . . . , θi−1 )

(3.9)

i=1
where the conditional probability is deﬁned as (3.8). It is obvious that the order of the sampling will not change the probability and we therefore ensure that the sequence of the drawn
samples is inﬁnitely exchangeable. More precisely, we say (θ1 , θ2 , . . .) is an inﬁnitely exchangeable sequence of random variables if, for any n, the joint probability P (θ1 , . . . , θn )
is invariant to permutation of the indices. That is, for any permutation σ ,

P (θ1 , . . . , θn ) = P (θσ(1) , . . . , θσ(n) )

(3.10)

According to de Finetti’s theorem (de Finetti, 1931) and the generalized version of his
theorem (Hewitt and Savage, 1955), for any inﬁnitely exchangeable sequence (θ1 , θ2 , . . .),
there exists a probability measure µ on the set of probability measures G(·) such that

P (θ1 , . . . , θn ) =

∫ ∏
n

G(θi )dµ(G)

(3.11)

i=1
In our study, the prior over the random distribution is precisely the Dirichlet process

DP (G0 , λ), thus establishing existence.
The important aspect of the Dirichlet process is its clustering property. Based on the
predictive form formulated as (3.8), the ﬁrst sample is drawn from the distribution G0
26

and once θ1 , . . . , θn are observed, the next sample θn+1 is either drawn from G0 with
probability λ/(λ + n) or assigned the same value as θi for some i = 1, . . . , n with
probability 1/(λ + n). After the sequence of draws is long enough, we note that G is a
weighted sum of point masses and the same values of draws can be grouped together.

3.31.1

Sethuraman Stick-Breaking Representation

We have already mentioned that Dirichlet process is an almost surely discrete random probability measure which is composed of a weighted sum of point masses. Sethuraman made this
precise by providing a constructive deﬁnition of the Dirichlet process called the stick-breaking
construction (Sethuraman, 1994).
If P = DP (G0 , λ), it can be constructed with the following:

βk ∼ Beta(1, λ)
k−1
∏
(1 − βk )
πk = βk
l=1
∞
∑
∗
P(·) =
πk δθ∗ (·), where θk ∼ G0
k
k=1

(3.12)

The construction of π can be understood through breaking a stick as follows. We imagine
there is a stick whose length is 1 and we break it at β1 and assign weights π1 to the stick we
just broke oﬀ. Then repetitively break the other portion to obtain π2 , π3 and so forth. The
stick-breaking distribution over π is sometimes written π ∼ GEM (λ) (GEM stands for
Griﬃths, Engen and McCloskey). To summarize, the Sethuraman representation indicates
that the unknown distribution G can be recovered if the inﬁnite number of θi along with
their corresponding weights are available. Because of its simple representation, Sethuraman’s
27

construction has lead to a variety of extensions as well as novel inference techniques for the
Dirichlet process.

3.31.2

Ferguson Gamma Process Representation

Before the Sethuraman’s representation, Ferguson provided a representation for the gamma
process based on arrival times from a homogeneous Poisson process (Ferguson and klass,
1972). Let Ek be independent and identically distributed from exp(1) random variables
and Γk = E1 + · · · + Ek . Let Zk be i.i.d elements, independent of Γk , with a probability distribution G0 over (B, A). Then Ferguson showed that the Dirichlet process with
parameter (G0 , λ), could be described as the random probability measure

G(·) =

∞
∑
k=1

N −1 (Γk )δZ (·)/
k

∞
∑

N −1 (Γl )

(3.13)

l=1

where

∫ ∞ −u
e
N (x) = λ
du, for x > 0
u
x

(3.14)

is the L´vy measure of a gamma random variable with shape parameter λ > 0.
e
In the Sethuraman’s stick-breaking construction (3.12), the GEM weights are deﬁned as

π1 = β1 and πk = (1 − β1 ) · · · (1 − βk−1 )βk where k ≥ 2,

(3.15)

and they are related to the Poisson process. By ordering them such that π(1) ≥ π(2) ≥

· · · , the two sets of weights are also related by the following form (Patil and Taillie, 1977;
28

Perman and Pitman, 1992; Pitman and Yor, 1997)

(

(π(1) , π(2) , · · · , ) =

N −1 (Γ1 )
N −1 (Γ2 )
, ∑∞
,···
∑∞
N −1 (Γl )
N −1 (Γl )
l=1
l=1

)
(3.16)

where N is the L´vy measure deﬁned in (3.14).
e
Since there is no closed form solution for the inversed L´vy measure and since each π(i)
e
requires inﬁnite sum calculation, the Ferguson representation is not widely practicable.

3.31.3

Almost Sure Truncation of DP (G0 , λ) Measures

We already know that a Dirichlet process can be formulated by the weighted sum of point
masses and we are intuited to see if the truncated form of the Dirichlet process can replace the
original Dirichlet process without losing accuracy. Let DPM (G0 , λ) be the approximate
form of DP (G0 , λ) by discarding the M + 1, M + 2, . . . terms and denote it as PM ,

PM (·) =

M
∑

∗
πk δθ∗ (·), where θk ∼ G0 ,
k
k=1

(3.17)

and πM = 1 − π1 − . . . − πM −1 . This corresponds to set βM = 1 so that

∑M
k=1 πk = 1. As shown in the paper (Ishwaran and Zarepour, 2000), the DPM (λ, G0 )

random measure can be used to approximate integrable funcitonals of the Dirichlet process:

DPM (G0 , λ)(g) → DP (G0 , λ)(g)

(3.18)

for any arbitrary bounded and continuous real valued function g . The key property of (3.18)
can be exploited to describe an eﬃcient Gibbs sampler for Bayesian nonparametric problems
29

in which PM is used as an approximating prior to the Dirichlet process. Let’s consider the
following hierarchical set up where the prior for the measure G follows truncated Dirichlet
process:

Yi | θ i ∼ f (yi | θ i )
θi | G ∼ G
G ∼ PM

(3.19)

The marginal density of the observed data Y is therefore written as

mM (Y) =


∫ ∏ ∫
n

i=1



f (yi |θ i )dG(θ i ) dPM (G).


(3.20)

If m∞ is the marginal density of Y with Dirichlet process as the prior for G, as Ishwaran (Ishwaran and James, 2001, 2002) showed,

∫


n 
 M −1

∑
1 − E 


|mM (Y) − m∞ (Y)|dY ≤ 4
πk


k=1
≈ 4n exp(−(M − 1)/λ)


30

(3.21)

The diﬀerence between the two marginal densities with each based on exact and approximate
sum representation respectively has been proved as by Ishwaran as follows:

∫

|mM (Y) − m∞ (Y)|dY =

∫ ∫ ∏
n

f (yi )(π M (dθ) − π∞ (dθ i )) dY

i=1
∫ ∫ ∏
n
≤
f (yi )dY π M (dθ i ) − π∞ (dθ i )
i=1
= 2D(π M , π∞ )
(3.22)
where D is the total variation distance between two probability measures π M and π∞ .
Let ki be the indicator of the group membership θ i has, that is, θ i = θ ∗ . The sampled
ki

values θ under π M and π∞ are identical if ki is sampled from a value smaller than M th
term. Thus,

D(π M , π∞ ) ≤ 2(1 − π M {ki ≤ M, for i = 1, 2, . . . , n})


 n 
 M −1

∑
1 − E 


= 2
πk


k=1
≈ 2n exp(−(M − 1)/λ)

(3.23)

where the right most approximation follows by observing that

M −1
∑
k=1

πk =D 1 − exp(−E1 /α)exp(−E2 /α) · · · exp(−EM −1 /α)
≈

1 − exp(−(M − 1)/λ)

where E1 , · · · EM −1 are iid exp(1) random variables.
31

(3.24)

Another method to compare between PM and DP (G0 , λ) is to compare their clustering behavior under sampling. Let ϕ = {ϕ1 , . . . , ϕk } denote the set of distinct θ i ’s,
where k ≤ n is the number of distinct elements in the vector θ = (θ1 , . . . , θn ) and si
is the indicator deﬁned by si = j if θs = ϕj for i = 1, . . . , n. Let nj be the number
i
of si = j , DM and D∞ equal the number of distinct values in Y when sampled under

PM and DP (λ, G0 ), Ishwaran proved that
M!
M k (M − k)!

≤

P (DM = k)
≤ nλk/M , for k = 1, . . . , min(n, M ) (3.25)
P (D∞ = k)

Since both sides of inequality (3.25) converge to one for each k as M → ∞, the two
distributions agree in the limit by the squeeze theorem. Therefore, the M -truncation prior

PM has been proved to have similar features in the clustering behavior as the DP (λ, G0 ).
As shown in the corollary 20 of the paper (Pitman, 1996), the posterior distribution PM (· |

θ) is the random probability measure represented as

PM (· | θ) =

M
∑

∗
∗
∗
πj δθ ∗ (·) + πm+1 PM (·),
j
j=1

(3.26)

where θ ∗ , . . . , θ ∗ are distinct values in the full sequence θ 1 , . . . , θ n occurring each with

1

M

frequencies n∗ , and

j

∗
∗
∗
(π1 , · · · , πM , πM +1 ) ∼ Dir(n∗ + λ/n, · · · , n∗ + λ/n, λ(1 − M/n)). (3.27)
1
M
Let φ be a nonnegative or integrable function, the posterior mean of φ through PM (·|Y)

32

is characterized by

∫

∫ ∫

φ(G)PM (dG | Y) =

φ(G)PM (dG | θ1 , . . . , θn )µ(dθ1 . . . , θn | Y),
(3.28)

The interior integral on the right head side of (3.28) can be expressed through averaging
over the values of θ1 , . . . , θn drawn from the Gibbs sampler,

∫

M
∑ n∗ + λ/n
λ(1 − M/λ)
j
φ(G)PM (dG | θ1 , . . . , θn ) =
φ(δθ∗ ) +
φ(G0(3.29)
).
λ+n
λ+n
j
j=1

We can implement the preceding scheme (3.28) to estimate various posterior mean functionals
by simplifying (3.28) with (3.29). There are many computational and theoretical advantages
by using approximate sum representation since the simpliﬁed mathematical representations
can be performed a lot more eﬃciently especially in cases where roundabout working strategy
is needed.

3.32

Dirichlet Process Mixture Model

To be consistent with the organization of Chapter 2, we begin with the proﬁle identiﬁcation deﬁned in (2.1). Let the ith individual’s class membership at time t and his or
her proﬁle membership are cit and si respectively. The class memberships cit over time,

ci = (ci1 , . . . , ciT ), ci = (ci1 , . . . , ciT ) is then distributed with the product of multi33

nomial probability densities with the form

f (ci | η s ) = P (C = ci | U = si )
i
T C
∏ ∏ (t) I(cit =ct )
η
=
,
ct |si
t=1 ct =1
(1)

(T )

,...,η
) indicates the vector for the secondary measurement
where η s = (η
i
1|si
C|si
parameters associated with proﬁle si . Sampling from a Dirichlet process mixture (DPM)
can be schemed by forming G with countably inﬁnite number of point masses from G0 and
draw parameters η s from G,
i

ci | η s ∼ f (ci | η s )
i
i
ηs ∼ G
i
G ∼ DP (G0 , λ).

By marginalizing over the prior for G, the conditional distribution of η s given the
i
others is shown as

i−1
∑
1
λ
∼
ηs | ηs , . . . , ηs
δ(η s ) +
G ,
1
i−1 i − 1 + λ
i
j
i−1+λ 0
j=1

(3.30)

where δ(η s ) is the distribution concentrated at the point η s . Equation (3.30) is assoj
j
ciated with the P´lya-urn representation (Blackwell and Macqueen, 1973). In LCPA, this
o
formulation only requires marginal distribution because same values of η s can be grouped
i
together to form a proﬁle. The evolution of η s , . . . , η sn in (3.30) can also be formed by
1
34

taking the number of components S in the ﬁnite mixture model to inﬁnity:

ci | η s ∼ f (ci | η s )
i
i
ηs

i

∼ G0

si | γ ∼ Multinomial(γ1 , . . . , γS )
γ ∼ Dirichlet(λ/S, . . . , λ/S)

(3.31)

By integrating over the mixing proportions γ = (γ1 , . . . , γS ), the marginal probability
of si given the proﬁle membership except the ith individual has the following form

P (si = s | s1 , . . . , si−1 )
= P (s1 , . . . , si−1 , si = s)/P (s1 , . . . , si−1 )
ni,s + λ/S
=
,
i−1+λ
where ni,s is the number of sj for j < i that equals to s. Let us imagine that the ith
individual is the last of the n observations, then the conditional probabilities for si given

s−i = (s1 , . . . , si−1 , si+1 , . . . , sn ), ci and η = (η 1 , . . . , η S ) can be obtained by
multiplying the likelihood, f (ci | η s ), as follows:

P (si = s | s−i , ci , η) = b

n−i,s + λ/S
n−1+λ

f (ci | η s ),

where n−i,s is the number of sj for j ̸= i that are equal to s, and b is the appropriate

35

normalizing constant. If S goes to inﬁnity,

P (si = s | s−i , ci , η) →

 n−i,s
b
f (ci | η s )
if s is equal to sj for some j ̸= i
n−1+λ
(3.32)
∫
b λ

f (ci | η s )dG0 (η s ) if s is not equal to sj for all j ̸= i
n−1+λ
Here, η is the set of η s currently associated with some observations because we cannot
explicitly represent the inﬁnite number of η s as S goes to inﬁnity.
Next, we specify how to perform Gibbs sampling on DPM models to select the number
of proﬁles in the LCPA model. Let the current proﬁle state consist of s = (s1 , . . . , sn ).
We can repeatedly sample as following three-step procedure:
1. For i = 1, . . . , n, remove the proﬁle si from the current state if si is not associated
with any other observation (i.e., n−i,s = 0).
i
2. Generate a new proﬁle membership si from the equation deﬁned in (3.32). If the new

(t)
si is not associated with any other observation, generate a set of values for η s for all
i
t from the posterior distribution deﬁned in (3.5). Repeat this step for i = 1, . . . , n.
(t)

3. For all s ∈ (s1 , . . . , sn ), generate a set of new values for η s

for all t from the

posterior distribution.
The procedures described above point out that the ith individual explores a new proﬁle
with probability λ/(i − 1 + λ) and thus the average number of proﬁles is expected to
grow logarithmically in the sample with size n as O(λ log n). It has also been proved that
the number of proﬁles will converge to inﬁnity almost surely as n increases (Korwar and
Hollander, 1973). In other words, if λ is ﬁxed to a constant with large sample size, an over36

ﬁtting problem might occur. In Liu’s paper (Liu, 1996), λ is estimated by its ML estimate
by using sequential imputation but it demands many conditional probability calculations
which barely comes with parametric forms. Therefore, we consider a noninformative gamma
prior which assigns most equal probabilities to all possibilities to properly update λ with
the procedures suggested by Escobar and West (1995). Here we brieﬂy describe how to
incorporate the concentration parameter into Gibbs sampling update procedure.
According to Antoniak (1974), given the concentration parameter λ and sample size n, we
can represent the uncertainty of the number of components k with the following probability
density,

P (k|λ, n) = Cn (k)n!λk

Γ(λ)
,
Γ(λ + n)

(3.33)

where Cn (k) is nothing to do with λ . Assume we already sampled θ i for i = 1, . . . , n
by DPM algorithm and are able to classify them into groups. In this way, the value of k is
the number of groups we have for conﬁguring the data. Suppose λ ∼ G(a, b), a gamma
prior with shape a and scale b. For λ ≥ 0, we have

Γ(λ)
(λ + n)B(λ + 1, n)
=
Γ(λ + n)
λΓ(n)

(3.34)

where B(·, ·) is the Beta function. Then the posterior of λ given k has the following form:

P (λ | k) ∝ P (λ)λk−1 (λ + n)B(λ + 1, n)
∫ 1
k−1 (λ + n)
xλ (1 − x)n dx
∝ P (λ)λ
0

(3.35)

This implies that P (λ | k) is the marginal distribution from the joint distribution of λ and
37

a continuous variable ν such that

P (λ, ν | k) ∝ P (λ)λk−1 (λ + n)ν λ (1 − ν)n−1 .

(3.36)

Since we assume λ has G(a, b) as the prior, we can write down the conditional probabilities
for λ and ν as follows:

P (ν | λ, k) ∝ ν λ (1 − ν)n−1
P (λ | ν, k) ∝ λa+k−1 (e−b ν)λ + nλa+k−2 (e−b ν)λ .

(3.37)

It is obvious that given the value of λ, the value of ν is updated by drawing from B(λ+1, n)
and the conditional probability of λ given ν reduces to a mixture of two Gamma densities.

P (λ | ν, k) ∼ πν G(a + k, b − log(ν)) + (1 − πν )G(a + k − 1, b − log(ν))(3.38)
with πν satisfying πν /(1−πν ) = (a+k −1)/n(b − log(ν). At each iteration of Gibbs
sampling, we can sample λ from the mixture of Gamma distributions given the currently
sampled ν and k .

3.33

Dynamic Dirichlet Process Mixture Model

In the scenarios where data are thought to be produced by the stage-sequential process, hidden Markov model is certainly the most suitable method to describe the data set. However,
the task of estimating the number of stages (i.e., classes) is not covered in the hidden Markov
model setting. We need to extend the hidden Markov model by assuming each row of the
transition matrix follows a Dirichlet process to explore the stage-sequential progression and
38

infer the number of classes afterwards. To be precise, we consider the data observed at each
time point is a group and the observations are exchangeable at this speciﬁc time. While each
mixture model has mixing proportions speciﬁc to the group, we require that the diﬀerent
groups share the same set of mixture components. The idea is that while diﬀerent groups
have diﬀerent characteristics given by a diﬀerent combination of mixing proportions, using
the same set of mixture components allows statistical strength to be shared across groups,
and allows generalization to new groups. It is generally known as hierarchical Dirichlet process (HDP). Overall speaking, The HDP allows the data at a speciﬁc time point to have
similar structure by providing global layer of hierarchy. The applications of HDP are presented in literatures (Ahmed and Xing, 2008; Xu et al., 2008). In LCPA models, we adopt
the concept of HDP to identify the number of classes. Since the number of classes is not
necessarily constant over time, we call this process as dynamic Dirichlet process mixture
(DDPM) model.
If each individual’s class membership over time (i.e., cit ) could be observed, we would like
a joint model for class membership across T time points,

∏n
i=1 P (C1 = ci1 , . . . , CT =

ciT ), where all possible sequences of class membership can be re-expressed as frequencies
in a contingency table with C T cells for n individuals. This table can produce a reasonable
inference about the important aspects of the common progression of class membership over
time. In LCPA, we assume there is an extra level of latent variable (i.e., class proﬁle)
by which the dependency among classes can be explained. Under this assumption, the
classes are conditionally independent given the proﬁle membership. The class memberships,
however, are unobserved directly, and it is not possible to learn the class proﬁle before the
class memberships over time are known. In addition, it is not possible to know the true
structure of dependency among class memberships over time. The LCPA can summarize
39

the dependency among classes by a small number of class proﬁles, without regard for the
structure of dependency. Therefore, we implement DDPM by generating dependency among
classes from the ﬁrst-order Markov chain and summarizing the dependency through the class
proﬁles.

Let yi,t be the observed values for M items of the ith individual at time t. If his or her
class membership at time t, ci,t , could be observed, the distribution of yi,t is the product
of multinomial probability densities with the form

f (yi,t | ρt|c ) = P (Yt = yi,t | Ct = ci,t )
i,t
]I(y
M ∏
∏ rm [
imt =k)
=
,
ρmkt|c
i,t
m=1 k=1
where ρt|c
represents the vector of all ρ-parameters associated with time t and class
i,t

(t)
= P (Ct = ct | Ct−1 = ct−1 ) represent the transition
ct |ct−1
probability of class membership from time t − 1 to t. Suppose we have C classes over T

membership ci,t . Let τ

time periods, there is a C × C transition probability matrix τ (t) with all elements of each
row of τ (t) summed to one for t = 2, . . . , T . Thus, given the previous class membership

(t)
ct−1 , the row vector τ c
, can be used as the mixing proportions for the current class
t−1
membership.

As described before, sampling from a Dirichlet process mixture can be schemed by taking
40

the number of classes C to inﬁnity:

yi,t | ρt|c
∼ f (yi,t | ρt|c )
i,t
i,t
(t)
(t)
(t)
ci,t | τ c
∼ Multinomial(τ
,...,τ
)
t−1
1|ct−1
C|ct−1
ρt|c
∼ G0
i,t
(t)
(t−1)
(t−1)
τc
∼ Dirichlet(n1
+ α/C, . . . , nC
+ α/C),
t−1
(t−1)

where nc

(3.39)

denotes the number of individuals who were assigned class c at time t − 1,

(t)
α is the concentration parameter. By integrating over the mixing proportions τ c
, the
t−1
prior for ci,t , as conditional probability, has the following form
(t−1)
(t)
nc
+ ni,c + α/C
t
t
P (ci,t = ct | c1,t−1 , . . . , cn,t−1 , c1,t , . . . , ci−1,t ) =
,
n+i−1+α

(t−1)
(t)
nc
+ ni,c + α/C
t
t
P (ci,t = ct | c1,t−1 , . . . , cn,t−1 , c1,t , . . . , ci−1,t ) =
,
n+i−1+α
(t)
i,ct is the number of cj,t for j < i that equals to ct at time t. Let us imagine that
the ith individual is the last of the n observations, then the conditional probabilities for ci,t
where n

given ct−1 = (c1,t−1 , . . . , cn,t−1 ), c−i,t = (c1,t , . . . , ci−1,t , ci+1,t , . . . , cn,t−1 ),
and ρt = (ρt|1 , . . . , ρt|C ) can be obtained by multiplying the likelihood, f (yi,t |
41

ρt|c ), as follows:
t
(t−1)
(t)
nc
+ n−i,c + α/C
t
t
f (yi,t | ρt|c ),
P (ci,t = ct | ct−1 , c−i,t , yi,t , ρt ) = b
t
2n − 1 + α
(t)
−i,ct is the number of cj,t for j ̸= i that are equal to ct at time t, and b is the
appropriate normalizing constant. If C goes to inﬁnity,
where n

P (ci,t = ct | ct−1 , c−i,t , yit , ρt ) →


 n(t−1) +n(t)


c
−i,ct

 b t
f (yit | ρt|c )


2n−1+α

t



if ct is equal to cj,t−1 or

cj,t for some j ̸= i
(3.40)






if ct is not equal to cj,t−1
 n(t−1) +n(t)


−i,ct ∫
 b ct

f (yit | ρt|c )dG0 (ρt|c )

2n−1+α
t
t and cj,t for all j ̸= i

Here, ρt is the set of ρt|c currently associated with some observations because we cant
not explicitly represent the inﬁnite number of ρt|c as C goes to inﬁnity. Therefore, we
t
can update the current class membership by sampling ci,t over time from the conditional
probability

P (ci,t | ct−1 , c−i,t , ct+1 , yi,t , ρt )
= P (cit = ct | ct−1 , c−i,t , yit , ρt )P (ct+1 | ct ).

(3.41)

Note that the ﬁrst factor in (3.41) was presented in Equation (3.40). The second factor
in (3.41) can be computed by integrating over the mixture weights τ (t+1) which depends

42

on the number of individuals who were assigned to each of identiﬁed classes at time t (i.e.,

(t)
(t)
n1 , . . . , nC ). Here, C is the number of existing classes at time t and t + 1. Then, it is
straightforward to show that:

∑
∏C
(t)
(t)
(t+1)
Γ( C nc + α/C)
Γ(nc + nc
+ α/C)
P (ct+1 | ct ) = ∏ c=1
,
× c=1
∑C
(t)
(t+1)
(t)
C Γ(n + α/C) Γ(
+ α/C)
c
c=1 nc + nc
c=1
where Γ is a typical gamma function. We consider Gamma distribution with shape a and rate

b, G(a, b), as the prior for the concentration parameters α and λ used in class and proﬁle
learning, respectively. Given the previous constructions, we specify the Gibbs sampling by
following four-step procedure.

1. For t = 1, . . . , T , generate a new class membership ci,t from the equation (3.41). If
the new ci,t is not associated with any other observation (i.e., n

(t)
(t−1)
−i,ci,t = n−i,ci,t =

0), generate a set of values for ρt|c from the posterior distribution deﬁned in (3.5).
i,t
Repeat this step for i = 1, . . . , n.
2. To update concentration parameter, α, we sample x and u from Beta(α + 1) and
Uniform(0, 1), respectively. We then calculate pα = (a + µα − 1)/(a + µα −

1 + n(b − log(x))), where µα is the average of numbers of classes identiﬁed at each
time. If u < pα , we update α by drawing a sample from G(a + µα , b − log(x)) or
draw a sample from G(a + µα − 1, b − log(x)), otherwise.

3. With the class membership identiﬁed in Step 1, ci = (ci,1 , . . . , ci,T ) for i =

1, . . . , n, employ DPM to identify the number of proﬁles by following the three-step
procedure described above.
43

4. We sample x and u from Beta(λ + 1) and Uniform(0, 1), respectively and calculate

pλ = (a + S − 1)/(a + S − 1 + n(b − log(x))), where S is the number of
identiﬁed proﬁles in Step 3. If u < pλ , we update λ by drawing a sample from

G(a + S, b − log(x)) or draw a sample from G(a + S − 1, b − log(x)), otherwise.

44

Chapter 4
Simulation
In this section, we evaluate the performance of RJMCMC and Dirichlet process over repeated samples. In our simulation, we draw over 200 samples from a pre-speciﬁed data
set constructed with an LCPA representing two classes (i.e., C = 2) and two proﬁles (i.e.,

S = 2) with four binary items (i.e., M = 4) measured over four-time periods (i.e., T = 4).
To simplify the presentation, we constrained ρ-parameter to be invariant over time (i.e.,

ρmk1|c = · · · = ρmkT |c = ρmk|c ), but imposed no constraint on the η -parameter.
Note that the η -parameter describes the class progression over time. If the η -parameters

(1)

were time invariant (i.e., η s

(T )
(1)
(1)
(1)
= . . . = η s where η s = (η , . . . , η
).), given
1|s
C|s

a proﬁle, the probabilities of belonging to a speciﬁc class could not change over time, leading
to diﬃculties in interpretation for the identiﬁed proﬁles. We used the balanced probability of
proﬁle membership (i.e., γ1 = γ2 = .5), but the following three factors were varied for the
simulation study: relationship between items and classes (i.e., ρ-parameter), relationship between classes and proﬁles (i.e., η -parameter), and sample size (n = 250 or n = 500). The

ρ-parameter and the η -parameter have two levels: the strong (i.e, ρ = .9 or .1 and η = .9
or .1) relationship and the mixed (i.e., .1 < ρ < .4 or .6 < ρ < .9 and .1 < η < .4 or
45

.6 < η < .9) relationship. For each sample, we estimate the number of classes and proﬁles
by the RJMCMC and Dirichlet process. Our purpose is to compare their performances by
comparing the relative frequency of selecting the true model (i.e., two-class and two-proﬁle
LCPA) over repeated sample. For the prior speciﬁcations, we choose ς = 5, κ = 0.2, and

σ = 1 and apply rather diﬀuse prior, G(1.5, 1), on both concentration parameters α and
λ (i.e., a = 1.5 and b = 1).

4.1

Simulation with Reversible Jump MCMC

For each data set, we run our sampler for 10,000 iterations with an additional 1,000 iterations
for burn-in period. Regrading to the performance in recovering the correct number of classes,
RJMCMC never fails to select the true number of classes under all combinations of factors
considered. We ﬁnd that RJMCMC samplers converge to the desired posterior distribution
of the number of classes quickly. Once it converged, however, the trans-dimension moves
are not easily implemented. In fact, the diﬃculty in accepting the proposed parameters of a
diﬀerent dimensional space may lead to some biases in the number of classes. To improve low
acceptance rates, we modify the across-model scheme by adding birth and death of an empty
class. The rate of accepting the birth move of an empty class is controlled by the quantity

∏S ∏T
(t) n(t)
s , where u(t) is a random variable from Uniform(0, 1) and
s
s=1 t=1 (1 − us )
(t)
ns is the size of the proﬁle s at time t, for all s = 1, . . . , S and t = 1, . . . , T .
The problem of poor mixing remains, however, since the acceptance rate could diminish
exponentially fast even for the moderate size of proﬁle.
Table 4.1 shows the percentage of the most frequently selected proﬁles over simulated
samples. For the number of proﬁles, RJMCMC is able to identify correct number of proﬁles
46

Table 4.1: The percentage of the most frequently selected proﬁles over simulated samples by
reversible jump MCMC
Sample
size

Number of proﬁles

ρ

η

1

2

0.0
0.0
0.0
5.2

3

4

5

6

7

250

Strong Strong
Mixed Strong
Strong Mixed
Mixed Mixed

78.4 19.6 2.0 0.0 0.0 0.0
74.8 20.2 4.6 0.4 0.0 0.0
45.2 48.6 5.4 0.8 0.0 0.0
34.8 58.8 1.2 0.0 0.0 0.0

500

Strong Strong 0.0 82.2 17.2 0.6 0.0 0.0 0.0
Mixed Strong 0.0 52.2 45.8 1.6 0.3 0.1 0.0
Strong Mixed 5.4 19.6 70.8 3.0 0.8 0.3 0.1
Mixed Mixed 10.4 14.4 70.2 4.6 0.4 0.0 0.0

when both measurements are strong (78.4% for the small sample and 82.2% for the large
sample). However if secondary measurement is mixed, the accurate disparity among competing models is diﬃcult especially when RJMCMC works with large samples. We ﬁnd out
given weak measurement setups, some of the resulting proﬁles are formalised with small sizes
and richly strictured samples would exacerbate the performance even more. We generalize
the cases to explain how intrinsic diﬃculty in RJMCMC impacts the performance in the
later section.

4.2

Simulation with Dirichlet Process

Even though Dirichlet process chooses the correct number of classes during the whole simulations, the method is not free from problems. A troubling aspect of the Dirichlet process is
that a complex model is falsely preferred because there is probability that some proﬁles are
generated but rarely been visited since then, which would result in sparse decomposition of
an LCPA model. In addition to presenting results from Dirichlet process without applying
47

0

0

0.9

Large sample size
n = 500

0.9

Small sample size
n = 250

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(a)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(b)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(c)

1 2 3 4 5 6 7

1 2 3 4 5 6 7
(d)

Figure 4.1: The percentage of the most frequently selected proﬁles over simulated samples
by reversible jump MCMC: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong
ρ and mixed η , and (d) mixed ρ and mixed η

48

any inclusion threshold, we will adapt our models to work with two cutoﬀs (e.g., 1% and
5%) and investigate the changes.
The results of the Dirichlet process for the number of proﬁles are presented in Figure 4.2.
Dirichlet process is known to be consistent in recovering the true model as we increase the
sample size and Figure 4.2 illustratively corroborates this attribute since the the correct
rates increase when sample size grows.
It is rather interesting to know that Dirichlet process favors models with excessive number
of proﬁles especially when working with mixed primary measurements. We believe the poor
performance can be ascribable to the weak relationship between classes and the items since
the sequence of the well identiﬁed class memberships is the key ingredient for successful
classiﬁcation of the proﬁles. When the primary measurement is mixed, we can see the poor
performance, especially with small samples . Generally, Dirichlet process is poorly suited to
identify the correct number of proﬁles when none of the measurements is strong and sample
size is small but the method is vindicated to have a good practical use when sample size is
large.
As the matter of fact, in the process of Dirichlet process, there are some proﬁles developed
with sizes not even larger than 1% in many stages of proﬁle learning. To avoid proﬁle
redundancy, we need to impose a threshold as an inclusion criterion. We use 1% and 5% as
the cut-oﬀs, by which any proﬁle whose size is less than the given cutoﬀ will be considered
too sparse to be included and the corresponding distributions for the number of proﬁles are
then given in Figure 4.3 and Figure 4.4.
Clearly, when 5% cutoﬀ is applied, the predictive accuracy increases dramatically especially when strong-strong pairs and large sample sizes are the working scenario. To determine an optimal cutoﬀ value is more of the matter in discovering right dominating proﬁles.
49

0

0

0.9

Large sample size
n = 500

0.9

Small sample size
n = 250

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(a)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(b)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(c)

1 2 3 4 5 6 7

1 2 3 4 5 6 7
(d)

Figure 4.2: The histogram of the most frequently selected proﬁles over simulated samples
from Dirichlet process: (a) strong ρ and strong η , (b) mixed ρ and strong η , (c) strong ρ
and mixed η , and (d) mixed ρ and mixed η

50

0

0

0.9

Large sample size
n = 500

0.9

Small sample size
n = 250

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(a)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(b)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(c)

1 2 3 4 5 6 7

1 2 3 4 5 6 7
(d)

Figure 4.3: The histogram of the most frequently selected proﬁles over simulated samples
from Dirichlet process with cutoﬀ 1%: (a) strong ρ and strong η , (b) mixed ρ and strong
η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η

51

0

0

0.9

Large sample size
n = 500

0.9

Small sample size
n = 250

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(a)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(b)

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0.9
0

0

0.9

(c)

1 2 3 4 5 6 7

1 2 3 4 5 6 7
(d)

Figure 4.4: The histogram of the most frequently selected proﬁles over simulated samples
from Dirichlet process with cutoﬀ 5%: (a) strong ρ and strong η , (b) mixed ρ and strong
η , (c) strong ρ and mixed η , and (d) mixed ρ and mixed η

52

Table 4.2: The percentage of the most frequently selected proﬁles over simulated samples
from Dirichlet process with cutoﬀ 5%
Sample
size

Number of proﬁles

ρ

η

1

2

3

4

5

6

7

250

Strong Strong 0.0 69.0 29.0 1.0 1.0 0.0 0.0
Mixed Strong 0.0 35.0 38.0 21.0 5.0 1.0 0.0
Strong Mixed 0.0 51.0 34.0 12.0 3.0 0.0 0.0
Mixed Mixed 6.0 41.0 33.0 17.0 5.0 0.0 0.0

500

Strong Strong 0.0 81.0 17.0 1.0 1.0 0.0 0.0
Mixed Strong 0.0 50.0 20.0 20.0 5.0 2.0 0.0
Strong Mixed 0.0 65.0 30.0 4.0 1.0 0.0 0.0
Mixed Mixed 3.0 53.0 30.0 14.0 0.0 0.0 0.0

Since 5% cutoﬀ is recognized as more eﬀective in producing satisfying results without being unreasonably liberal in screening proﬁles, we consider 5% as the appropriate threshold
in determining the number of proﬁles and the percentages of the most frequently selected
proﬁles are distributed in Table 4.2. For cases where more than 7 proﬁles are generated are
not shown in the table since the sizes are insigniﬁcantly small.

4.3

Extended Cases

In previous sections, we already provided empirical results showing that both presented techniques are able to learn the class structure correctly. However, to investigate the performance
and validate their eﬀectiveness in selecting the right proﬁles, we consider applications more
generally. We draw samples from LCPA models with two classes, two proﬁles, four items over
varying number of time measurements (i.e, T = 3, 4 and 5). Table 4.3 indicates that with
strong-strong pairs, RJMCMC can optimize the performance when more time measurements
are involved. However, the performance degradation is noted when weak measurements are
53

Table 4.3: The percentage of selecting the correct number of proﬁles (S = 2) over simulated
samples from RJMCM and Dirichlet process
Sample
size

ρ

η

RJMCMC
DP with 5% cutoﬀ
T =3 T =4 T =5 T =3 T =4 T =5

250

Strong Strong
Mixed Strong
Strong Mixed
Mixed Mixed

74.6
76.2
53.6
54.3

78.4
74.8
45.2
34.8

80.5
73.0
58.3
32.6

51.0
34.0
50.0
38.0

69.0
35.0
51.0
41.0

91.0
60.0
65.0
55.0

500

Strong Strong
Mixed Strong
Strong Mixed
Mixed Mixed

77.9
64.7
27.4
33.1

82.2
52.2
19.6
14.4

86.2
46.9
22.2
17.7

57.0
45.0
51.0
48.0

81.0
51.0
65.0
53.0

92.0
66.0
82.0
65.0

given, which we believe is because weak measurements vaguely distinguish the competing
models and most of time, RJMCMC would prefer a complex model as opposed to a simple
one. On the other hand, we found out that Dirichlet process can learn the proﬁles more
accurately when there are more time measurements involved. Besides, the results demonstrate that working with large data is always accompanied by signiﬁcant improvement but
enlarging the sample size might not be a feasible approach in modeling the complexity.

4.4

Discussion

Latent stage-sequential process is an attractive tool for many areas of substantive research.
Like other mixture models, however, it can be diﬃcult to estimate the number of latent
components such as classes and proﬁles. We have illustrated two Bayesian approaches,
RJMCMC and Dirichlet process, for the latent class proﬁle analysis. Although this model is
certainly not representative of all latent stage-sequential process, it nevertheless lends several
important insight which we believe have general relevance.
54

First, both RJMCMC and Dirichlet process can learn the model without assuming the
number of latent components in advance. Comparing with the technical challenges that
RJMCMC might bring, Dirichlet process may be preferred in terms of ﬂexibility and consistency. Dirichlet process technique relaxes the constraint on the number of classes placed
at each measurement occasion and it performs well with large sample size. Dirichlet process
is conceptually simple and does not require intensive computation since there is no need for
Dirichlet process to design jumping proposals and computing the acceptance rates. Technically, Dirichlet process is easily employed on the intuitive basis and readily extended to
study longitudinal data.
In the Bayesian mixture modeling, label switching is one of the most common issues and
it will lead to poor parameter estimation. However, It does no harm in ﬁnding the number
of components for the LCPA model. In the simulation study, we run the algorithms in the
presence of label switching, but the conclusions we inferred here are unaﬀectedly viable. To
capture the structural change in classes during experiments, we may allow the RJMCMC to
vary the number of classes at diﬀerent measurement occasions, but it leads to insurmountable
technical hurdles. However, allowing various number of classes over time is possible in DDPM
and it is an appealing aspect of Dirichlet process approach comparing to the RJMCMC. In
this study, we present Bayesian model selection analysis on LCPA without any covariates
to predict the prevalence of the proﬁle; however, predicator-dependent kernel stick-breaking
process has already increased the interest. It is utilized in choosing the priors for an unknown
probability measure (Dunson and Park, 2008) and variable selection problems (Chung and
Dunson, 2009). Adding predictors in the prior consideration gives diﬀerent insights into how
the classes are formed under the inﬂuence of predictors and it is understood as a dependent
Dirichlet Process. Future work should explore the model selection for the LCPA regression
55

model.
Some speciﬁc limitation to this study include the choice of jumping rules in RJMCMC.
The simulation study can only serve as a preliminary exploration and it is worth noting
that a comparison between RJMCMC and Dirichlet process cannot be generalized unless
the ingenuity of the proposal mechanism we proposed in the RJMCMC is ensured. Besides,
we note that the results of the Dirichlet process may not be promising when mixed-mixed
pairs are considered since it did not gain much accuracy when sample size was increased
from 250 to 500. We believe that the appropriate sample size should be investigated for
Dirichlet process to produce accurate results under diﬀerent scenarios on the measurement
parameters for future study. We provided a limited demonstration to help elucidate the the
two potential methodologies for the model selection in LCPA. Our hope is that substantive
practitioners will be able to utilize this demonstration in their research and that it may
provide helpful guidance for the implementation of these two methods. In summary, this
presentation provided the coverage of the implementation of RJMCMC and the Dirichlet
process for the LCPA model and showed the performance of those methods in selecting
the number of classes and proﬁles over repeated samples. Future work should investigate
more tailored RJMCMC and the Dirichlet process specialized for LCPA and compare the
performance formally via simulation study. We expect that each method will display its own
unique strengths and weaknesses under diﬀerent conditions.

56

Chapter 5
Parameter Estimation

5.1

Preliminary

Parameters for latent class proﬁle analysis (LCPA) are easily estimated by maximum likelihood via EM algorithm or Bayesian method via Markov chain Monte Carlo. However, the
local maximum problem is a long-standing issue in any hill-climbing optimization technique
for the LCPA model. In this study, we propose to apply two probabilistic optimization
techniques using the deterministic annealing framework in order to deal with multiple local
modalities in the LCPA model. The deterministic annealing approaches are implemented
with an eﬃcient recursive formula in the step for the parameter update.

5.2

Expectation-Maximization

The EM algorithm is an iterative procedure which is widely employed for ﬁnding maximum
likelihood and solving missing value problems (Dempster et al., 1977). Estimation of parameters in the LCPA model may be regarded as an estimation problem with missing data:
57

the realized values of the latent-proﬁle variable Ui and latent-class memberships Cit for
all t are missing for individuals i = 1, . . . , n. The observed data loglikelihood given the
covariates xi is

n
∑

(
)
log P r(Yi = yi ) | xi
i=1


n
S
T
∑
∑
∏ (t)
=
log 
γs (xi )
Rs 
s=1
t=1
i=1

l(θ) =

(5.1)

Direct maximization of the observed-data loglikelihood is complicated because we need to
solve optimization problem with sum of logarithm functions; however, ML estimates could
be calculated directly in closed form if there is no latent variables. EM is computationally
tractable to maximizing the loglikelihood function by repeatedly solving the complete-data
problem. EM is an iterative procedure in which each iteration consists of two steps, the
E-step and M-step.
In the E-step, we compute the expected values of the unobservable cell counts for the
cross-classiﬁcation by Yi1 , . . . , YiM and Ui , Ci1 , . . . , CiT , given the observed data and
provious parameter estimates. In the M-step, we maximize the complete-data logliklihood
function under the assumption that the missing data are known and the missing data from
the E-step are used to substitute the actural missing data.
For the LCPA model, the EM algorithm has some important advantages over other
methods for ﬁnding ML estimates: it converges to gradually but reliably; it guarantees that
the resulting estimates lie within the parameter space; it does not require inversion of the
Hessian matrix at each iteration, so it demands less computational time than the NewtonRaphson; and it does not require a carefully chosen set of initial values to start the iterative
58

process in order to converge to the ﬁnal solution. For all these reasons, the EM algorithm
has been the preferred method for ML-based estimation in the LCPA model. If covariates
are included, the expected complete-data loglikelihood can no longer be maximized using
closed-form expressions; the M-step requires an iterative procedure equivalent to a routine for
ﬁtting a standard baseline-category multinomial logistic-regression model. Van der Heijden
et al. (1996) and Bandeen-Roche et al. (1997) utilized an EM algorithm that incorporates
a Newton-Raphson steps for the multinomial logit model into the standard latent class
estimating procedure devised by Goodman (1974). The E- and M-steps of this procedure
are given below.

5.21

E-step

In the E-step, the posterior probabilities of latent class and proﬁle memberships for the

ith individual are calculated under the provisional parameter estimates from the previous
iteration. By Bayes’ Theorem, these posterior probabilities are

θi,(s,c) = P r(Ui = s, Ci = ci | Yi = yi , xi )
∏
(t)
γs (xi ) T R
t=1 ct |s
{
}
=
∑S
∏T
∑C
(t) (t)
ct =1 ηct |s Rct |s
s=1 γs (xi ) t=1
5.22

M-step

In the M-step, updated parameter estimates are obtained by maximizing the expected
complete-data loglikelihod, regarding the latent variables if they were observed. If Ui and

ci were known, let the contribution of the ith individual to the complete-data likelihood
59

be denoted L∗ (θ) and the logarithm of the complete-data likelihood function contribution

i
∗
li (θ) = log(L∗ (θ)) can be written as
i
∗
li (θ) =

S
∑

log γs (xi )θis +

n S T C
∑ ∑ ∑ ∑

(t)
(t)
θ
log η
i(s,ct )
ct |s

s=1
i=1 s=1 t=1 ct =1
T C
M rm
∑ ∑ (t) ∑ ∑
+
θic
I(yimt=k ) log ρmkt|c
t
t
t=1 ct =1
m=1 k=1
where θis =

∑

c1 . . .

(5.2)

∑
∏
∑
(t)
(t)
cT θi(s,c) , θi(s,c ) = j̸=t cj θi(s,c) and θict =
t

∑ (t)
s θi(s,c ) . According to (2.6) for each s = 1, . . . , S − 1, the maximizing vector β s
t
can be derived by setting

n
exp(x′ β s )
∂ ∑
i
θis log ∑
= 0 for r = 1, . . . , p.
S exp(x′ β )
∂βsr
i=1
j=1
i j

(5.3)

To approach the roots of these non-linear equations, we use NR algorithm to expedite the
calculation. In the initial stages of EM, the parameter estimates begin far from the desired
values and Newton’s method might fail to converge. Burdened with the caveats, we only
execute one iteration of NR method within each M-step. In the NR method, an estimate

β new is updated by
s
(

β new = β old −
s
s

)−1 (

∂ 2l
∂β∂β T

∂l
∂β

)

|

β=β old

where β is the vectorized parameter containing all parameter elements.

60

(5.4)

5.3

Recursive Formula

Although direct maximization of the observed-data loglikelihood deﬁned in (5.1) is complicated, ML estimates can be easily calculated if the latent memberships (i.e., c and s) were
known.
In the E-step, the conditional probabilities of class and proﬁle memberships for the ith
individual are calculated under the provisional parameter estimates from the previous iteration. By Bayes’ theorem, these conditional probabilities are given by

θi(s,c ,...,c ) = P (U = s, C = c | yi , xi )
1
T
L∗
i
= ∑
,
∑C
∑
S
· · · C =1 L∗
s=1 c1 =1
cT
i

(5.5)

where L∗ is deﬁned in (2.3).

i

In the E-step, however, the complexity of computation and the memory demand grow
when the number of time periods T increases. Instead of computing the joint posterior
probabilities given in (5.5), we apply the recursive formula to the E-step by adopting the
forward-backward algorithm (Chib, 1996; MacKay, 1997). Let α and λ represent the forward
and backward probabilities, respectively:

αit (ct , s) = P (Y1 = yi1 , . . . , Yt = yit , Ct = ct | U = s, xi )
λit (ct , s) = P (Yt+1 = yi(t+1) , . . . , YT = yiT | Ct = ct , U = s, xi ).
Both α and λ functions then can be computed by the following recursive representations
61

using the ﬁrst-order Markov chain:

M rm
]I(y
(t) ∏ ∏ [
imt =k)
αit (ct , s) =
αi(t−1) (ct−1 , s)η
ρmkt|c
ct |s
t
ct−1 =1
m=1 k=1
C
∑

λit (ct , s) =

C
∑

(t+1)
λi(t+1) (ct+1 , s)η
ct+1 |s

ct+1 =1
]I(y
M ∏
∏ rm [
im(t+1) =k)
×
ρmk(t+1)|c
t+1
m=1 k=1
With the forward and backward functions, the marginalized conditional probability of the
latent component membership at time t can be computed by

(t)
θ
= P (U = s, Ct = ct | yi , xi )
i(s,ct )
γs (xi )αit (ct , s)λit (ct , s)
,
= ∑
S γ (x ) ∑C
s=1 s i
cT =1 αiT (cT , s)

(5.6)

for s = 1, . . . , S , ct = 1, . . . , C , and t = 1, . . . , T .

In the M-step, updated parameter estimates are obtained by maximizing the expected
complete-data loglikelihood, regarding the latent variables if they were observed. The contribution of the ith individual to the complete-data loglikelihood ℓ∗ = log L∗ can be written

i

i

as ℓ∗ (θ) as (5.2). The ﬁrst sum in (5.2), which relates to the regression coeﬃcients (i.e., the

i

β -parameters), is the loglikelihood function for the multinomial logit model, except that the
∑n
unobserved counts for s are replaced by the fractional expectations
i=1 θis . Updated
estimates for the regression coeﬃcients can be calculated with the standard Newton-Raphson
method for multinomial logistic regression, provided that the computational routines allow
62

fractional responses rather than integer counts. The other model parameters can be obtained
by

∑n
(t)
i=1 θi(s,ct )
(t)
∑n
η
ˆ
=
ct |s
i=1 θis
∑
∑
(t)
(t) ∗
θic I(yimt = k) +
(t) t
(t) θict ρmkt|ct
i∈obsm
i∈mism
ρmkt|c =
ˆ
∑n
(t)
t
θic
i=1 t
for m = 1, . . . , M , k = 1, . . . , rm , s = 1, . . . , S , ct = 1, . . . , C , and t = 1, . . . , T .

(t)

(t)

Here obsm denotes the set of individuals who respond to the mth item at time t, mism
denotes the set of individuals who fail to respond to the mth item at time t, and ρ∗

mkt|ct

is the provisional parameter estimate.

5.4

Local Modality

We understand that the local maximum problem could be mitigated by starting from multiple
initial values and then tracking the optimal solution which achieves the highest likelihood.
However, in the case of high-dimensional parameter space, a large number of performing
the EM algorithm for each initialization is required and therefore become computationally
intractable. In order to relax the initialization dependence for the LCPA model, we introduce two reformulated versions of the standard EM algorithm, namely split-and-merge
EM (SMEM) (Ueda et al., 2000), deterministic annealing EM (DAEM) (Ueda and Nakano,
1998) and a variational Bayes learning algorithm, referred as and deterministic annealing
variational Bayes (DAVB) (Katahiral et al., 2008).
63

The general idea of SMEM aims at increasing the log likelihood value gradually by
tactically choosing three components as candidates for split and merge. A number of merge
and split candidates are selected, usually 5 candidate sets are recommended for each iteration
and then using partial expectation step to expedite the expectation computation. The idea
of performing split and merge operations has been successfully applied to the EM in the
Gaussian mixture models through considering local Kullback divergence as a split criteria
and posterior probabilities as a merge index. Here we roughly sketch the steps for partial
expectation step on classes and proﬁles in LCPA models.
Ueda and Nakano (1998) proposed a deterministic annealing version of the EM (DAEM)
algorithm to reduce the impact that inappropriate initial values could cause. Even though
there is no evidence to prove it is grounded in theory, it has been successfully applied in
the realm of local maximum issues and such an annealing framework has proven eﬀective in
improving the performance of the standard approaches (Itaya et al., 2004; Park et al., 2005).

5.41

Split-and-Merge EM

Let JC = (c1 , c2 , c3 ) denote the split-and-merge candidates where c1 and c2 are chosen
by the pre-speciﬁed rule to form a new c∗ and c3 is chosen to be divided into two new
classes c1 and c2 . The partial update step for newly-generated classes l = c∗ , c1 , c2 are
˜
˜
˜ ˜
re-estimated by the following:





αt (l, s)λt (l, s)
(t)
 ∑ (t)

θ
θ
=
× ∑
i(s,l)
i(s,m)
l=c∗ ,c˜ ,c˜ αt (l, s)λt (l, s)
m∈JC
1 2
64

Similarly, for proﬁles, we set JS = (s1 , s2 , s3 ) as the split-and-merge candidates where

s1 and s2 are merged to form s∗ and s3 is split into two new proﬁles s1 and s2 . In
˜
˜
the partial update step, the posterior probability of proﬁle components l = s∗ , s1 , s2 are
˜ ˜
re-estimated by





γl αt (ct , l)λt (ct , l)
(t)
 ∑ (t)

θ
=
θ
× ∑
i(l,ct )
i(m,ct )
k=s∗ ,s1 ,s2 γk αt (ct , k)λt (ct , k)
˜ ˜
m∈JS
SMEM algorithm is expected to be abler to travel low log-likelihood area and solve the
problem of initialization dependence only if a clear set of principles for split and merge has
been instituted. In the case of LCPA applications, the correspondence between adjacent
measurement occasions is inextricably bridged and there is no easy heuristical method to
relocate the mixture components in the data space. Generally speaking, SMEM intensiﬁes
the extent of the technical diﬃculties.

5.42

Deterministic Annealing EM Algorithm

Ueda and Nakano (1998) proposed a deterministic annealing version of the EM algorithm
(DAEM) to ﬁnd a set of parameter estimates on the global mode of the loglikelihood function. In the DAEM algorithm for LCPA, the goal to maximize the loglikelihood function is
reformulated as the problem of maximizing the function F (ω) =

∑n
i=1 Fi (ω), where

S C
C
∑ ∑
∑
1
log
···
(L∗ )ω .
Fi (ω) =
i
ω
s=1 c1 =1
cT =1
65

Note that the function F (ω) equals to the observed-data loglikelihood ℓ in (5.1) when

ω = 1. To maximize the function F (ω), we introduce a random density function q(U =
s, C = c | yi ) and ﬁnd the lower bound of Fi (ω) by Jensen’s inequality (Cover and
Thomas, 1991):

[
]
(L∗ )ω
1∑
i
Fi (ω) ≥
q(U = s, C = c | yi ) log
ω s,c
q(U = s, C = c | yi )
∑
=
q(U = s, C = c | yi ) log L∗
i
s,c
1∑
−
q(U = s, C = c | yi ) log q(U = s, C = c | yi ).
ω s,c

(5.7)

In the ﬁrst step of DAEM, we ﬁnd the optimal choice for q(U = s, C = c | yi ) by
taking functional derivatives with respect to q(U = s, C = c | yi ) and setting it as zero
under the constraint

∑
s,c q(U = s, C = c | yi ) = 1. Then the optimal choice for

q(U = s, C = c | yi , ) would be
(L∗ )ω
i
θi(s,c ,...,c ) = ∑
.
∑
S ∑C
1
T
· · · C =1 (L∗ )ω
s=1 c1 =1
cT
i

(5.8)

It is worth noting that with values of ω close to zero, the function θ i(s,c ,...,c ) (ω) is
1
T
uniformly distributed across all the latent components (i.e., classes and proﬁles). As the
value of ω increases toward one, however, θ i(s,c ,...,c ) (ω) agrees with the conditional
1
T
probabilities θi(s,c ,...,c ) given in (5.5).
1
T
In the second step of DAEM, the model parameters are updated by maximizing the

∑n
modiﬁed expected complete-data loglikelihood ℓ∗ (ω) =

∗
i=1 ℓi (ω). The contribution

66

of the ith individual to the ℓ∗ (ω) can be written as

i

ℓ∗ (ω) =
i

S
∑

S T C
∑ ∑ ∑

θis (ω) log γs (xi ) +

θ

(t)
(t)
(ω) log η
i(s,ct )
ct |s

s=1
s=1 t=1 ct =1
T C
M rm
∑ ∑ (t)
∑ ∑
+
θic (ω)
I(yimt=k ) log ρmkt|c ,
t
t
t=1 ct =1
m=1 k=1
( )ω (t)
where θ is (ω) ∝ θis , θ
(ω) ∝
i(s,ct )

(

(5.9)

(
)
)ω
(t)
(t)
(t) ω
θ
, and θ (ω) ∝ θ
ict
ict .
i(s,ct )

These quantities can be calculated by the recursive formula given in (5.6), and the joint
conditional probabilities are not necessary in the second step of DAEM. As the standard
EM algorithm, updated estimates for the regression coeﬃcients can be calculated with the
Newton-Raphson algorithm for the ﬁrst sum in (5.9). The other model parameters can be
obtained by

∑n
(t)
θ
i=1 i(s,ct ) (ω)
(t)
∑n
η
ˆ
=
ct |s
i=1 θis (ω)
∑n
∑
(t)
(t)
θic (ω)I(yimt = k) + n
θic (ω)ρ∗
(t) t
(t) t
mkt|ct
i∈obsm
i∈mism
ρmkt|c =
ˆ
∑n
(t)
t
θic (ω)
i=1 t
for m = 1, . . . , M , k = 1, . . . , rm , s = 1, . . . , S , ct = 1, . . . , C , and t = 1, . . . , T .
We adopt ω =(.001, .01, .1, .2, .3, .4, .48, .58, .69, .83, 1) as an annealing schedule in
DAEM. The algorithm is initialized with the value of ω close to zero (i.e., .001) and its
converged parameters will be used as the starting values for the next one (i.e., .01). By
repeating this procedure until ω reaches one, the DAEM algorithm will ﬁnd the parameter
67

estimates on the global mode of the loglikelihood function.

5.43

Deterministic Annealing Variational Bayes Algorithm

Let Θ denote the vectorized model parameters for the LCPA model. In the deterministic
annealing variational Bayes (DAVB) algorithm, we can consider Θ, a set of unknown parameters, as the random quantities and incorporate the prior information for Θ. Let φ(Θ) denote a prior distribution of Θ. Further, let z = (z1 , . . . , zn ) indicate the individuals’ class
and proﬁle memberships, where zi is a T + 1 dimensional array for the ith individual such

∑

∑

∑

S
C
C
that zi(s,c ,...,c ) ∈ {0, 1} and
s=1 c1 =1 · · · cT =1 zi(s,c1 ,...,cT ) = 1.
1
T
That is, if individual i belongs to the proﬁle s and the class membership c = (c1 , . . . , cT )
from initial time t = 1 to time T , then zi(s,c ,...,c ) equals 1 and 0 otherwise.
1
T
Constructed with the similar logic of the DAEM, the goal of DAVB to approximate the
distribution over latent variables and model parameters can be rephrased by the problem of
maximizing F (ω) =

∑n
i=1 Fi (ω), where

S C
C
∑ ∑
∑ ∫
1
log
···
P (Y = yi , Z = zi , Θ)ω dΘ.
Fi (ω) =
ω
s=1 c1 =1
cT =1
By introducing a random distribution q(Z = zi , Θ), Fi (ω) can be lower bounded by
68

Jensen’s inequality (Cover and Thomas, 1991):

Fi (ω) ≥

S C
∑ ∑

···

C
∑ ∫

q(Z = zi , Θ) log P (Y = yi , Z = zi , Θ)dΘ

s=1 c1 =1
cT =1
S C
C
∑ ∫
1 ∑ ∑
−
···
q(Z = zi , Θ) log q(Z = zi , Θ)dΘ.
ω
s=1 c1 =1
cT =1
If q(Z = zi , Θ) has a factored form (i.e., q(Z = zi , Θ) = Q(Z = zi )r(Θ)), the
function Fi (ω) is then lower bounded as

Fi (ω) ≥

S C
∑ ∑

···

C
∑

s=1 c1 =1
cT =1
∫
Q(Z = zi )r(Θ) log P (Y = yi , Z = zi | Θ)φ(Θ)dΘ

 S
C
C
∑
1 ∑ ∑
···
Q(Z = zi ) log Q(Z = zi )
(5.10)
−
ω
s=1 c =1
cT =1
1
}
∫
+ r(Θ) log r(Θ)dΘ .

In the ﬁrst step of DAVB, we iteratively maximizes the lower bound of Fi (ω) with
respective to Q(Z = zi ) by taking functional derivative and setting it equal to zero under
the constraint

∑S ∑C
∑
· · · C =1 Q(Z = zi ) = 1. The optimal choice for
s=1 c1 =1
cT

Q(Z = zi ) would be θi(s,c ,...,c ) (ω), which is the optimal choice for q(U = s, C =
1
T
c | yi ) given in (5.8).
In the second step, the lower bound of Fi (ω) is maximized with respective to r(Θ)
and the model parameters are updated based on the variational posteriors. We consider
69

multivariate normal distribution Np×(S−1) (0, I) as a prior for the β -parameters. At
each cycle of Gibbs sampler, the vectorized coeﬃcients β are updated by Metropolis algorithm (Robert and Casella, 2004). A candidate for the next coeﬃcient vector β c is generated

ˆ
ˆ
from Np×(S−1) (β, δΣ) at each iteration, where β is the generated sample from the previous iteration. In this paper, we adjust the value of δ in order to control the acceptance
rate within a recommended range from .18 to .30 (Gelman et al., 1997). The variance Σ is
the negative inverse of the β -submatrix in the Hessian from the complete-data loglikelihood

ℓ∗ (ω) evaluated at the DAEM estimates. The candidate vector for β c is accepted with
ˆ
probability α(β, β c ) = min (1, exp (ωA)), where
n S
) ∑ ∑
1( 2
ˆ2 +
ˆ
θis (ω)x′ (β c − β s )
A = − |β| − |β|
i s
2
i=1 s=1
 



n
S
S

∑
∑
∑
′ β c ) − log 
′ β ) .
ˆ

−
log
exp(xi s
exp(xi s


s=1
s=1
i=1
(t)

(t)
(t)
= (η , . . . , η
)
1|s
C|s
(t)
and ρmt|c = (ρm1t|c , . . . , ρmr t|c ), new parameters can be drawn from η s ∼
m
(t)
(t)
Dirichlet(τ
,...,τ
) and ρmt|c ∼ Dirichlet(νm1t|c , . . . , νmr t|c ), where
m
1|s
C|s
Applying the Jeﬀreys’ priors to the measurement parameters η s


n
∑ (t)
1
(t)
τ
= ω
θ
(ω) −  + 1
c|s
i(s,c)
2
i=1





νmkt|c = ω 


∑

(t)
I(yimt = k)θic (ω) +

(t)
i∈obsm


∑

(t)
i∈mism

70

1
(t)

∗
θic (ω)ρmkt|c −  + 1
2

for c = 1, . . . , C , s = 1, . . . , S , t = 1, . . . , T , k = 1, . . . , rm , and m = 1, . . . , M .
In DAVB, we adopt the same annealing schedule and proceed with the same procedure
as in DAEM.

71

Chapter 6
Model Diagnosis
For the inferences drawn from a model being meaningful, we need to establish a correctly
speciﬁed model and identiﬁability is fundamental for having valid parameter estimates. According to the deﬁnition by Goodman (1974), the parameters of an LC model without
covariates are said to be locally identiﬁable at a particular value θ ∗ if for some open neighborhood of it, the loglikelihood function has a unique optimum value at θ ∗ . The deﬁnition is
no problem to be further applied to LCPA models. Let us assume the number of a response
pattern as a particular combination of responses to the manifest items Y1t , . . . , YM t

∏M
T
m=1 rm ) . A saturated model for Y1t , . . . , YM t for
∏
T
all t = 1, . . . , T would have ( M
m=1 rm ) − 1 nonredundant parameters because

for all t = 1, . . . , T is (

the probabilities for the response patterns must sum to 1. A necessary but not suﬃcient
condition to make the LCPA model locally identiﬁable is that the number of nonredundant parameters in the saturated model must be greater than or equal to the number of
free parameters in γ, η and ρ (Goodman, 1974; Clogg and Goodman, 1984). The satu-

∏M
T
m=1 rm ) − 1 nonredundant parameters and the LCPA model has
∑
S − 1 + ST (C − 1) + C( C
m=1 rm − M ) when we assume the number of classes is
rated model has (

72

ﬁxed as C over time. When



T


C
∑

rm  − 1 > S − 1 + ST (C − 1) + C 
rm − M 
m=1
m=1

M
∏

(6.1)

the LCPA model might be identiﬁable but not necessarily so.
Traditionally, we diagnose the identiﬁability by carrying out the Hessian matrix evaluated at the ML estimates θ ∗ . If it has full rank or the inverse exists, we can ascertain the
parameter estimates are at least locally identiﬁable. In practice, however, it is diﬃcult to
distinguish between situations where the Hessian is nearly singular and where it is exactly
singular because of the imprecision of ﬂoating-point computations. As pointed out by Formann (2003), in situations where some estimated parameters lie on the boundary (estimates
are close to boundary value), we’d better ﬁx those values in order to make the remaining
parameters identiﬁable. At fact, a covariance matrix with large variance indicates boundary
problems. The Hessian matrix for LCPA with covariates may have result in large dimension structure and the derivatives therefore become unappealing. As shown in (2.8), the
marginalization implies that an LCPA with logistic regression is locally identiﬁable if the
following three conditions are satisﬁed (Bandeen-Roche et al., 1997):
1. an LCPA marginalized over covariates is locally identiﬁable;
2. if the design matrix (x1 , . . . , xn )′ has full column rank;
3. at least one individual has nonzero γs (xi ) for all s = 1, 2, . . . , S .
Some new monitoring methods for identiﬁability have been proposed, for example, Kim
and Lindsay (2009) deﬁne a new concept, the degree of identiﬁability of the parameters in
the likelihood. This data-dependent identiﬁability is measured through modal regions that is
73

the maximum connected subset containing ML estimate θ ∗ and determined by a signiﬁcance
level α. The model is said to be identiﬁable at conﬁdence level 100(1 − α)% at θ ∗ if the
modal region containing θ ∗ is disjoint from any other permutated ones. It comes no surprise
that as α decreases, the modal regions are hardly well-labeled because overlapping might
occur. The level of identiﬁability can be quantiﬁed by ﬁnding the inﬁmum of the value α
such that the modal region is well separated. Besides, Yao and Lindsay (2009) developed
marginal discriminant plots to measure the degree of modal separation.

74

Chapter 7
Data Analysis on National
Longitudinal Survey of Youth 1997
(NLSY97)

7.1

Data

The proposed model selection (i.e., Dirichlet process) and parameter estimation (i.e., deterministic annealing) methods are applied to the drinking items drawn from the National
Longitudinal Survey of Youth 1997 (NLSY97), a survey that explores the transition from
school to work and from adolescence to adulthood in the United State. Since the detailed
information on the data appears in Chung et al. (2011), we brieﬂy describe the data structure
here. Complete data for the analysis were available on 1416 adolescents who were identiﬁed
as the early onset drinkers. The early onset drinkers under this study had started drink by
the time when they were 14 years old, at least 7 years before the legal drinking age. There
75

are three self-report items measuring adolescent drinking behaviors: how many days they
had one or more drinks of an alcoholic beverage during the last 30 days (Recent Drinking);
how many days they had ﬁve or more drinks on the same occasion during the last 30 days
(Binge Drinking); and how many days they had drinks immediately before or during school
or work hours in the last 30 days (Drinking at School). The responses for Recent Drinking were reduced to a three-category indicator, non-drinker (0 days of drinking), occasional
drinker (1-5 days of drinking) and regular drinker (6 or more days of drinking). For Binge
Drinking, respondents who had consumed ﬁve or more drinks on the same occasion at least
one time were characterized as binge drinkers. The same rule was applied for Drinking at
School. These three drinking items were tracked over the three survey waves in 1997 (Wave
1), 2000 (Wave 4), and 2003 (Wave 7), corresponding to early adolescence (ages 12–14), middle adolescence (ages 15–17), and late adolescence (ages 18–20), respectively. In addition to
these three items, we consider gender and race as covariates in the model.

7.11

Information Criteria versus Dirichlet Process

Under the data set described above, Chung et al. (2011) started by ﬁtting a series of twoclass LCPA models where the pathways of the class membership were mapped onto between
two and four proﬁles. This procedure was repeated until it reached a model with six classes
and six proﬁles. Table 7.1 shows a series of LCPA models with evaluations based on the
bootstrap p-value for goodness of ﬁt and AIC.
The four-class-three-proﬁle LCPA model and ﬁve-class-three-proﬁle LCPA models are
favored in terms of AIC, bootstrap p-value and the principle of parsimony. However, due
to the unclear interpretation of the classes in the ﬁve-class-three-proﬁle LCPA model, fourclass-three-proﬁle LCPA model is selected. Even though the traditional model selectors work
76

Table 7.1: Goodness-of-ﬁt statistics for a series of LCPA models under various numbers of
classes and proﬁles
Number of
classes
2

3

4

5

6

Number of
proﬁles
2
3
4
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
6

LRT
812.81
812.81
812.81
492.84
452.99
447.16
445.64
449.80
404.58
392.61
387.90
429.69
375.52
357.65
347.64
421.65
367.31
338.81
325.88
306.07

77

Bootstrapping
p-value
0.000
0.000
0.000
0.000
0.025
0.015
0.020
0.005
0.305
0.440
0.315
0.025
0.375
0.585
0.595
0.015
0.410
0.735
0.810
0.795

AIC
12103
12111
12119
11803
11777
11785
11798
11780
11755
11763
11778
11780
11752
11760
11776
11792
11770
11773
11792
11804

as convenient means for comparison of each competing model, the process would be a lengthy
undertaking. Since dynamic Dirichlet learning process has been proved to work well with
less computation demands, we are interested in implementing Dirichlet process to see if it
comes to the same conclusions without comparing each model one at a time.

Fig. 7.1 (a) - (c) shows scatter diagrams derived by applying DDPM and LOWESS
algorithm proposed by Cleveland (1979) with two smoothing curves. LOWESS is a locally
weighted scatterplot smoothing used here to summarize the appropriate number of classes
over time. The smoothing span gives the proportion of points in the plot that inﬂuence the
smoothness at each value and larger spans give smoother lines. Each plot is smoothed with
two smoother spans (.1 and .5) to locally ﬁt the diagram. It is obvious to see all the wavy
lines are around four, which makes the preferences fairly self-explanatory to be four classes
over three time measurements. The result is in line with the histogram exploration as shown
in Fig. 7.1 (d) where a four-class-over-three-time model has dominant frequency.

Chung

et al. (2011) selected a four-class LCPA model based on the Table 7.1 and it seems Dirichlet
process has equal learning capability of selecting suitable number of classes.

Based on the class transiting routes, we implement DPM to learn the number of proﬁles.
Fig. 7.1 (d) indicates models with three or four proﬁles are favored by the DPM algorithm
because of their higher occurrences, 27% and 29%, respectively. Chung et al. (2011) conclude
that three-proﬁle model ﬁt better based on goodness-of-ﬁt index and according to the result
DPM presents, we have consistent conclusions. Since DDPM and DPM lead to successful
applications without requiring prior knowledge on the unknown system, we believe Dirichlet
process can eﬀectively constructs the model by integrating the sequential information.
78

7.12

Parameter Estimates

Given the picked four-class-three-proﬁle LCPA model, Chung et al. (2011) used 100 diﬀerent sets of initial parameters to avoid local optimum entrapment. They selected a set of
estimates providing the highest loglikelihood value among those from 100 diﬀerent initializations by using standard EM algorithm. We will focus on the same model structure to
compare the performances of two deterministic annealing approaches in ﬁnding the global
maximum estimates with the pre-determined annealing schedule. Note that we consider the
LCPA model where the ρ-parameters are constrained to be equal across time points and
both deterministic annealing update the marginalized conditional probability of the latent
component membership recursively by using forward and backward functions.
To ensure the comparability of the standard EM algorithm and two deterministic annealing approaches, we implement DAEM with 100 initializations in order to better understand
how diﬀerent starting values induce variations in the loglikelihood values. For the standard
EM algorithm, the distribution of the converged loglikelihood values corresponding to 100
sets of initializations are presented in Figure 7.2 (a). Exploring the histogram, we can see
that EM solutions are trapped in four local modes based on their initializations. The largest
loglikelihood value is −5726.28 presented in the last bar in Figure 7.2 (a), but the most
frequent loglikelihood values are observed in the range of −5740 and −5739. On the contrary, Figure 7.2 (b) illustrates that DAEM reaches a single point convergence (loglikelihood
value is −5727.14) irrespective of the initial values. Although DAEM does not achieve
the optimum value given by the standard EM algorithm, the diﬀerence in the loglikelihood
values is negligible based on the distributions presented in Figure 7.2. In addition, only 29%
of the initial sets are converged to the global maximum with the standard EM algorithm.
79

Therefore, the DAEM algorithm with recursive formula can be considered as a robust tool
to avoid many local entrapments and reduce the computation cost.
For DAVB, we also apply 100 diﬀerent initializations and calculate the standard deviation for each of the model parameters from the 100 sets of DAVB estimates. Generally
speaking, DAVB exhibits the work of satisfaction because the extent to which the variations are incurred due to the numerous initializations can be regarded insigniﬁcant. Here we
brieﬂy sketch the performance of DAVB in terms of their standard deviations and related
concerns. The standard deviation of the ρ-parameter estimates derived from DAVB with
100 sets of initializations are presented in Table 7.2. The values under the Recent Drinking
column provide the standard deviations of the estimates for probabilities of having reported
occasional drinking (one to ﬁve days of drinking in the last 30 days) and regular drinking
(six or more days of drinking in the last 30 days) for a given class membership. The other
two columns show the standard deviations of the estimates for the probabilities of having
consumed ﬁve or more drinks on the same occasion at least one time in the last 30 days for
Binge Drinking and consumed alcoholic beverage right before or during school or work hours
at least once in the last 30 days for Drinking at School. Table 7.2 indicates that Class 3
seems more likely to produce largest deviations among the identiﬁed classes. We believe the
cause is more related to the small prevalence of Class 3: the prevalence of Class 3 is 7.5%,
7.0%, and 0.2% in 1997, 2000, and 2003, respectively, and the average prevalence over time
is 4.9%.
The secondary measurement parameters (i.e., η -parameters) identify common sequential
patterns of drinking behaviors. The standard deviation of the η -parameter estimates from
the 100 sets of DAVB estimates are presented in Table 7.3. The η -parameter estimates from
DAVB are shown to be fairly stable in terms of insigniﬁcant degree of spread. We can see
80

Table 7.2: Standard deviation of the ρ-parameter estimates derived from DAVB with 100
diﬀerent sets of starting values
Recent Drinking
Binge
Drinking
Class Occasional Regular Drinking at School
1
0.024
0.000
0.000
0.000
2
0.001
0.009
0.026
0.022
3
0.003
0.062
0.078
0.084
4
0.001
0.022
0.010
0.017

Table 7.3: Standard deviation of the η -parameter estimates derived from DAVB with 100
diﬀerent sets of starting values
Year
Proﬁle Class 1997 2000 2003
1
1
0.025 0.031 0.027
2
0.024 0.039 0.030
3
0.018 0.030 0.020
4
0.006 0.008 0.026
2

1
2
3
4

0.043
0.040
0.019
0.012

0.044
0.056
0.026
0.035

0.040
0.068
0.016
0.061

3

1
2
3
4

0.030
0.039
0.040
0.020

0.030
0.045
0.051
0.043

0.024
0.018
0.010
0.034

that the estimates corresponding to Proﬁle 2 have more ﬂuctuations comparing to those
of other proﬁles. Although estimates regarding to some classes or proﬁles might not have
ideally small deviations, DAVB can still be considered as a more consistent implementation
than the traditional multi-start methods because all the standard deviations are below an
acceptable threshold.
The standard deviations of the β -parameter estimates from the 100 sets of DAVB estimates are presented in Table 7.4. As inspected, the standard deviations of Proﬁle 2 relative
to Proﬁle 1 are slightly larger than those of Proﬁle 3, but none of them reveal serious insta81

Table 7.4: Standard deviation of β -parameter estimates derived from DAVB with 100 different sets of starting values (Proﬁle 1 is the baseline)
Covariate
Category Proﬁle 2 Proﬁle 3
Gender versus Male Female
0.122
0.101
Race versus White Black
0.242
0.163
Hispanic
0.260
0.204
Other race
0.272
0.283

Table 7.5: Estimated probabilities of responding ‘any use’ to the drinking items for each
class (ρ-parameters)
Method Class
EM
1
2
3
4

Recent Drinking
Occasional Regular
0.001
0.002
0.960
0.040
0.675
0.325
0.280
0.720

Binge
Drinking
Drinking at School
0.000
0.000
0.303
0.119
0.848
0.606
0.960
0.173

DAEM

1
2
3
4

0.102
0.929
0.899
0.260

0.000
0.071
0.101
0.740

0.000
0.378
0.546
0.965

0.000
0.056
0.396
0.205

DAVB

1
2
3
4

0.096
0.922
0.894
0.238

0.000
0.076
0.105
0.761

0.000
0.407
0.541
0.967

0.000
0.044
0.469
0.207

bility among the resulting estimates from 100 sets of starting values with DAVB. Proﬁle 2
tends to produce larger standard deviations as previously shown in Table 7.3 and we believe
the small prevalence of Proﬁle 2 aﬀects the results in some degrees.
The estimated primary measurement parameters (i.e., ρ-parameters) using the three
estimation algorithms are presented in Table 7.5. The values under the Recent Drinking
column provide the estimated probabilities of having reported occasional drinking (one to
ﬁve days of drinking in the last 30 days) and regular drinking (six or more days of drinking in
the last 30 days) for a given class membership. The other two columns show the probabilities
82

of having consumed ﬁve or more drinks on the same occasion at least one time in the last
30 days for Binge Drinking and consumed alcoholic beverage right before or during school
or work hours at least once in the last 30 days for Drinking at School. We can see that all
three items combined support a meaningful interpretation for each class. An inspection of
these values leads to the adoption of the common class names across estimation methods.
Adolescents in Class 1 have not been involved in any drinking in the previous 30 days;
therefore we label Class 1 as ‘non-current drinkers.’ Class 2 represents adolescents who
drink occasionally but have no history of regular drinking or drinking at work or school.
We accordingly identify adolescents classiﬁed in Class 2 as ‘light drinkers.’ For the same
grounds, Class 4 labels as ‘regular binge drinkers’ who both drink regularly and engage in
binge drinking. The estimates for Class 3, however, produce largest deviations between EM
and two deterministic annealing methods, leading to a diﬃculty in labeling Class 3 with the
common class name across estimation algorithms. The maximal diﬀerences in the estimates
between EM and two annealing methods are .302 and .307 in Binge Drinking for DAEM
and DAVB, respectively. We believe the cause is more related to the small prevalence of
Class 3: using the EM algorithm, the prevalence of Class 3 is 7.5%, 7.0%, and 0.2% in 1997,
2000, and 2003, respectively, and the average prevalence over time is 4.9%. The other two
deterministic alternatives produce similar class prevalences as the EM algorithm. Although
three parameter estimation methods produce distinct estimates for Class 3, the diﬀerence
in estimates of ‘small’ class may not aﬀect the description of the major classes and the
loglikelihood value.
The secondary measurement parameters (i.e., η -parameters) identify common sequential
patterns of drinking behaviors. The estimated η -parameters with three estimation methods
are presented in Table 7.6. As shown in Table 7.6, in Proﬁle 1 the probabilities of belonging
83

Table 7.6: Estimated probabilities of belonging to a class sequence for each proﬁle (η parameters) and estimated proﬁle prevalence (γ -parameters)
Year
1997 2000
0.694 0.746
0.250 0.219
0.055 0.034
0.001 0.000
0.538 0.058
0.432 0.766
0.000 0.000
0.031 0.175
0.614 0.252
0.227 0.122
0.139 0.150
0.020 0.476

2003
0.626
0.348
0.004
0.023
0.100
0.610
0.000
0.290
0.133
0.015
0.000
0.853

1
2
3
4
1
2
3
4
1
2
3
4

0.782
0.046
0.163
0.009
0.610
0.239
0.130
0.021
0.672
0.019
0.255
0.054

0.806
0.027
0.167
0.000
0.112
0.766
0.000
0.122
0.253
0.034
0.158
0.555

0.688
0.203
0.078
0.031
0.121
0.666
0.001
0.212
0.135
0.002
0.001
0.862

1
2
3
4
1
2
3
4
1
2
3
4

0.785
0.013
0.189
0.013
0.695
0.102
0.186
0.017
0.665
0.033
0.258
0.044

0.780
0.052
0.165
0.003
0.159
0.711
0.030
0.100
0.284
0.050
0.136
0.530

0.688
0.207
0.065
0.040
0.094
0.749
0.014
0.143
0.139
0.015
0.001
0.845

Method
EM

Proﬁle Class
1 (45.0)
1
2
3
4
2 (18.6)
1
2
3
4
3 (36.4)
1
2
3
4

DAEM

1 (46.1)

2 (19.0)

3 (34.9)

DAVB

1 (47.0)

2 (18.4)

3 (34.6)

84

to Class 1, the class of ‘not current drinkers,’ are consistently higher than the probabilities of
belonging to other classes (see the ﬁrst rows in Table 7.6 for all estimation method), implying
that adolescents in this proﬁle are likely to remain as ‘not current drinkers’ over time. The
LCPA identiﬁed another two proﬁles of adolescents who tended to intensify their drinking
habits over time. Adolescents in Proﬁle 2 were more likely to belong to Class 1 (not current
drinkers) in 1997, advance to Class 2 (light drinkers) in 2000. By 2003, some of them had
advanced further to become members of Class 4 (regular binge drinkers), although many
others remained in Class 2 (light drinkers). Proﬁle 3 consists of early drinkers who moved
toward Class 4 (regular binge drinkers) by the year of 2000 and stayed in that class by 2003.
Three parameter estimation methods yield similar estimates for the η -parameters so that
we can commonly label the identiﬁed proﬁles across three diﬀerent estimation methods. The
most prevalent proﬁle is Proﬁle 1 for all estimation methods. The largest diﬀerences between
EM and two annealing methods are .208 for DAEM in Proﬁle 3 and Class 2 in 1997 and .330
for DAVB in Proﬁle 2, Class 2 in 1997.

Point estimates for the logistic regression coeﬃcients (β -parameter) pertaining to each
covariate are reported in Table 7.7. Speaking broadly, whites are more likely to be in the
proﬁles involving intensiﬁed drinking habits (i.e., Proﬁles 2 and 3) than in Proﬁle 1 (nondrinking stayers). Interestingly, the odds of belonging to Proﬁle 2 (light drinking advancers)
versus Proﬁle 1 are higher for females, but the odds of belonging to Proﬁle 3 (regular binge
drinking advancers) are higher for males than their counterparts. Three parameter estimation methods come to the same conclusions in explaining the logistic coeﬃcients even though
the yielded values are not indistinguishable.
85

Table 7.7: Estimated logistic regression coeﬃcients (β -parameters) for the prevalence of
proﬁles (Proﬁle 1, non-drinking stayers, is the baseline)
Method
EM

DAEM

DAVB

Covariate
Intercept
Female versus male
Race versus White
Black
Hispanic
Other race
Intercept
Female versus male
Race versus White
Black
Hispanic
Other race
Intercept
Female versus male
Race versus White
Black
Hispanic
Other race

86

Proﬁle 2

Proﬁle 3

−1.144
0.511
1.244 −1.011
−1.640 −1.510
−1.600 −0.198
−1.133 −0.481
−1.161
0.410
1.349 −0.882
−1.884 −1.546
−1.897 −0.240
−0.877 −0.459
−1.146
0.415
1.197 −0.823
−1.457 −1.523
−1.217 −0.261
−0.473 −0.301

7.12.1

Discussion

In the standard EM algorithm, the computation of the joint posterior distribution that
involves all time measurements is fairly demanding. Instead of computing the joint posterior
probabilities given in (5.5), the recursive formula adopts the forward-backward algorithm to
calculate the marginal posterior probabilities given in (5.6) directly. The recursive formula
enables us to predict latent membership faster than the standard EM algorithm because the
forward and backward quantities can easily calculated.
In this study, we apply the two annealing methods, DAEM and DAVB, to estimate the
model parameters lying on the global maximum of the objective functions for the LCPA
models using the recursive formula at each iteration. Deterministic annealing is an optimization technique aiming to search for a global maximum by gradually increasing the value
of ω . However, the performance of the annealing methods is contingent upon the choice of
annealing schedule. Geman and Geman (1984) have shown that, if the annealing schedule
follows ω ∝ (log i), where i is the number of current iteration, the global solution is theoretically achievable even though such schedule might not be realistic in practice. Chang
and Lin (2002) proposed a novel method to adaptively learn the annealing parameter based
on the results of the previous iteration. However, it requires matrix calculation which is
limited only for special cases. In this study, we adopt a very tight annealing schedule, but
more research is required to select the appropriate annealing schedule for the LCPA models. However, the DAEM and DAVB have their own strengths and therefore are attractive
alternatives to the standard EM algorithm in order to discover an optimal solution.
Roughly speaking, when ω is close to zero, for both DAEM and DAVB, the corresponding

θi(s,c ,...,c ) generates random moves with uniform weights and pick the membership of
1
T
87

class and proﬁle arbitrarily. As the annealing parameters are gradually adjusted, the localized inﬂuences become more discriminating for classes and proﬁles and enable deterministic
annealing approaches to avoid arbitrary local maxima and search in the right direction. The
DAEM algorithm is more convenient to operate and typically converges more quickly than
DAVB. However, when the annealing rate is too sparse or the prevalence of latent components are not large enough, DAEM may be entrapped into a local mode of the likelihood
function. On the contrary, the DAVB algorithm is less inﬂuenced by the annealing schedule
or the size of latent components, but it may exhibit slow convergence in the search of global
maximum. Therefore, more computational burden should be expected as the trade-oﬀ. The
most troubling aspect of the DAVB is to decide a well-functioning proposal distribution in
the MCMC. In the beginning, we applied a Metropolis algorithm in which a candidate for
the next logistic coeﬃcient vector was sampled from a multivariate t distribution with degree

ˆ
of freedom ν , β c ∼ tν (β, δΣ), where δ was a scalar value used to control the acceptance
rate. Using the t distribution, however, the acceptance rate drops when ω approaches one,
which causes the slow mixing in the Markov chain. Although decreasing the variation in
the proposal distribution by setting a small value to δ may alleviate the problem, it does

ˆ
not make sense to sample the candidate values only from the regions close to the mean β .
Therefore, we substitute t distribution with a multivariate normal to expedite the mixing
by keeping many undesired extreme values from being selected.
Although many alternatives to the standard estimation algorithm could be pursued in
light of the LCPA model, we only mentioned only a few in this presentation. Note that our
exploration was intended to demonstrate a possible solution to diﬃculties in ﬁnding a set of
estimates on the global maximum of the objective function. We provided a limited demonstration using real data to elucidate the estimation methods. Our hope is that substantive
88

researchers will be able to identify possible diﬃculties in estimation for the LCPA model,
and consider using the proposed solution in their research.

89

8
4

5

6

7

8
7
6
5
4
0

500

1000

1500

0

1000

1500

(b)

0

4

10

5

20

6

30

7

40

8

50

(a)

500

0

500

1000

1500

4−4−4

(c)

5−4−4

4

5

6

(d)

Figure 7.1: (a) - (c) Tracking plots for classes smoothed by LOWESS with two smoother
spans (red: .67 and green: .15) from time 1 to time 3; (d) histogram of class progression
(yellow bars) and the number of proﬁles (blue bars).

90

100
0

20

40

Frequency

60

80

100
80
60
40
0

20

Frequency

−5760

−5750

−5740

−5730

−5760

−5750

−5740

−5730

Figure 7.2: Histogram of loglikelihood values derived from (a) the standard EM and (b) the
DAEM algorithms with 100 diﬀerent sets of starting values

91

Chapter 8
Discussion

8.1

Contributions

Latent stage-sequential process is an attractive tool for many areas of substantive research.
We have designed several computational algorithms in order to develop practical understanding and concrete advice for those who may apply the LCPA model to their own data. The
unique contributions of this thesis are summarized as follows.
Determining the number of latent components such as classes and proﬁles can be diﬃcult
especially when prior knowledge is not readily available. We adopt two Bayesian approaches
to learn the model without presuming the number of latent components in advance. The
RJMCMC turns to be less computationally feasible because there is no way to design a
best-suited jumping rules. When the proposal construction has trouble landing in regions
of low probability density, a rapid mixing to reach the stationary distribution will not be
facilitated. However, to design an appropriate jumping principle for dimension changes is
not easy. On the contrary, Dirichlet process is easily carried out to predict the underlying
sequential structure due to its non-parametric nature. By relaxing the constrains on the
92

number of classes placed at each measurement occasion, Dirichlet process helps to enhance
visibility of future patterns. However, in the proﬁle-learning phase, Dirichlet process has
diﬃculty delivering satisfactory results because weak measurements make competing models
indistinguishable.
To estimate the unknown parameters, the widely-known technique is hill climbing optimization such as EM algorithm. This gradient optimization approach can search local
neighborhood of the initial values but it may be very poor compared to the global optimal
solution. In this study, we ﬁrstly compute the forward probability of being in a certain state
(i.e. class) at time point t and the backward probability of having a speciﬁc type of future
observations after t given the current states. Both of them are advantageously represented
in recursive forms to expedite each iteration. The forward-backward algorithm proceeds on
the basis of parameter updates but the initialization dependence problem remains unsolved.
There is a plethora of literatures aiming at ﬁnding a numerically robust parameter estimation tool to deal with local maximum problems (Biernacki et al., 2003; Reddy and
Rajaratnam, 2010). For example, Reddy and Rajaratnam (2010) convolute the objective
function by kernel functions to ﬂatten the surface and reduce the number of local modes.
They demonstrate that the optimal solutions would be reached eﬀectively by adjusting the
smoothing factor at each iteration. Instead, we implement deterministic annealing EM and
deterministic annealing variant of variational Bayes in order to ﬁnd parameter estimates on
the global mode of the objective function. Two deterministic annealing approaches are presented here to overcome the local maximum issues. It is shown that both DAEM and DAVB
are equivalent to a generic gradient method which starts from the unimodal well-shaped
function of unknown parameter and by gradually increasing the annealing parameter, the
distribution becomes more discriminating and a good approximation to the global optimum
93

can be located. While we have demonstrated the eﬀectiveness of the deterministic annealing algorithms in combating the local modality problems, a global optimum is not always
guaranteed. The most important reason is that the annealing schedule is not unanimously
regulated; one must be very discreet when adjusting the increases. Generally, the schedule
should be determined so that the annealing rate is either passive nor aggressive.

8.2

Direction for Future Research

In the model selection problem, we adopt reversible jump MCMC because it is the most
commonly used MCMC tool by which we can explore variable dimension statistical models.
We already discussed about the diﬃculties in proposing eﬃcient jumping rules especially
in the complex LCPA models. Fahimah (2004) suggested using a secondary Markov chain
(adding few ﬁxed-dimension MCMC steps) to modify proposed moves before calculating the
acceptance rates. They have shown the acceptance probabilities soar even the proposals
are poorly matched to the true target distribution but the increased programming costs are
expected as a trade-oﬀ.
We present dynamic Dirichlet learning process analysis on model selection problems without any covariates to predict the prevalence of the proﬁle; however, predicator-dependent
kernel stick-breaking process has already increased the interest. It is utilized in choosing
the priors for an unknown probability measure (Dunson and Park, 2008) and variable selection problems (Chung and Dunson, 2009). Adding predictors in the prior consideration
gives diﬀerent insights into how the proﬁles are formed under the inﬂuence of predictors and
it is understood as a dependent Dirichlet Process. Future work should explore the model
selection for the LCPA regression model.

94

APPENDICES

95

Appendix A
Acceptance Rate of C → C + 1
For the split move on latent classes (i.e., C ′ = C + 1), the acceptance rate α(C, C ′ ) is

min (1, A), where

A =

=
×

×

q(Θ(C ′ ,S) , Θ(C,S) ) g ∗ (u∗ )
∂(Θ(S,C ′ ) , u∗ )
×
×
×
P (Θ(C,S) | y)
q(Θ(C,S) , Θ(C ′ ,S) )
g(u)
∂(Θ(S,C) , u)
π0 (Θ ′ ) d
1
(C ,S)
× C+1 ×
(likelihood ratio) ×
π0 (Θ(C,S) )
bC
Palloc
1
∏S ∏T
∏
(t)
g(us ) × M g(um )
s=1 t=1
m=1
]
[
∏S ∏T
(t)
1−¯ )κ M
u
¯
s=1 t=1 ηc∗ |s (1 + u
,
M/2
[(1 − u)/¯]
¯ u
P (Θ(C ′ ,S) | y)

where the likelihood ratio is the ratio of the product of the complete likelihood values for the
new parameter sets when new classes are formed to that for the old. The function π0 is the
product of the prior distributions for model parameters, the number of latent classes, and
the latent allocation variables. To be speciﬁc, writing Dv (δ) to denote a Dirichlet density
96

evaluated at v and parametrized by a vector δ ,




δ−1 (t) δ−1
(t)
η ∗
(δ)Dρ
(δ)
η ∗
 ∏ Dρ
π0 (Θ ′ )
∏  c |s
c2 |s
m|c∗
m|c∗
 M
(C ,S)
1
2
 1

∝


δ−1
π0 (Θ(C,S) )
Dρ
(δ)

m|c∗
s,t  η (t)
B(δ, Cδ) m=1
c∗ |s
Besides, Palloc is the probability that the speciﬁc reallocation for those who were in the
class that was chosen to be split is made. The g function is the Uniform(0, 1) density. For
the corresponding merge move, the acceptance probability α(C, C − 1) is min(1, A−1 ),
using the same expression for A but some corrections are required.

97

Appendix B
Acceptance Rate of S → S + 1
For the split move on latent proﬁles (i.e., S ′ = S + 1), the acceptance rate α(S, S ′ ) is

min (1, B), where
q(Θ
∂(Θ ′ , u∗ )
′ ) , Θ(C,S) ) g ∗ (u∗ )
(C,S
(S ,C)
×
×
×
P (Θ(C,S) | y)
q(Θ(C,S) , Θ
)
g(u)
∂(Θ(S,C) , u)
(C,S ′ )
π0 (Θ(C,S ′ ) ) d
1
(likelihood ratio) ×
× S+1 ×
π0 (Θ(C,S) )
bS
Palloc
1
∏C−1 ∏T
g1 (w) c=1 t=1 g2 (βct )
T (C−1)

∏C ∏T
∏C ∏T
(t)
(t)

 c=1 t=1 ηc|s1 × c=1 t=1 ηc|s2
1


,
×

∏C ∏T
(t)
w(1 − w) 
c=1 t=1 ηc|s∗
P (Θ

B =

=
×

×

| y)
(C,S ′ )

where the likelihood ratio is the ratio of the product of the complete likelihood values for
the new parameter sets when new proﬁles are formed to that for the old. The function π0
is the product of the prior distributions for model parameters, the number of latent proﬁles,
98

and the latent allocation variables. To be speciﬁc,



π0 (Θ(C,S ′ ) )
π0 (Θ(C,S) )


D (t) (δ)D (t) (δ)
T
∏  ηs
ηs
 (γs∗ w1 )δ−1 (γs∗ w2 )δ−1
1
2

×
∝


D (t) (δ)
γ δ−1 B(Sδ, δ)
t=1
s∗
η ∗
s

Besides, P alloc is the probability that the speciﬁc reallocation for those who were in the
proﬁle that was chosen to be split is made. The g1 function is the Uniform(0, 1) density
and g2 is the N (0, 1) density. For the merge move, the acceptance probability α(S, S −1)
is min(1, B −1 ), using the same expression for B but some corrections are required.

99

Appendix C
Hessian Matrix

C.1

Diagonal Entries

Firstly, we introduce several quantities for the future use
1. θi(s,c) = P r(Li = s, Ci = ci |Yi = yi )

∑
c1 . . . cT θi(s,c)
∏
∑
(t)
3. θ
= j̸=t c θi(s,c)
i(s,ct )
j
(t) ∑ (t)
4. θ
ict = s θi(s,ct )
(t,t∗ )
5. θ
= P (Li = s, Cit = c, Cit∗ = c∗ | Yi = yi )
i(s,c,c∗ )
2. θis =

∑

∑ (t,t∗ )
(t,t∗ )
= sθ
i(c,c∗ )
i(s,c,c∗ )
(
(
)I
)I
(yimt =1) 1 − ρ
(yimt =0)
7. g(i, m, c, t) = ρm|c =c
m|ct =c
t
6. θ

8. δ(i, c, t) = 1 −

∏rm g(i,m,t,C)
m=1 g(i,m,t,c)
100

Let fi = P (Yi1 = yi1 , . . . , YiT = yiT ) and li = log fi . The sref and cref stand
for reference group for proﬁle and class respectively. The elements in the diagonal matrix of
the Hessian for the proﬁle probabilities γ are





θis
θis
θ
ref   θis∗
ref 
= −  is −
−
∂γs ∂γs∗
γs
γs
γs∗
γs
ref
ref

∂ 2l

where s and s∗ = 1, 2, . . . , S − 1.
To get the diagonal block of the Hessian with respect to η , we consider the following
cases.

1. When s = s and t = t, the following equation is true whether c1 is equal to c2





(t)
(t)
(t)
(t)
θ
 θi(s,c ) θi(s,c
 θ
i(s,cref ) 
∂ 2 li
ref )   i(s,c2 )


1 −
= 
−


(t)
(t)
(t)
(t)
 (t)
  (t)

∂η
∂η
η
η
η
η
c1 |s c2 |s
c1 |s
cref |s
c2 |s
cref |s

2. When s = s but t1 ̸= t2 , the following equation is true whether c1 is equal to c2

(t ,t )
θ 1 2
i(s,c1 ,c2 )
i
=
δ(i, t1 , c1 )δ(i, t2 , c2 )
(t1 ) (t2 )
(t1 ) (t2 )
∂η
∂η
η
η
c1 |s c2 |s
c1 |s c2 |s



(t1 )
(t2 )
(t2 )
(t1 )
θ
 θi(s,c ) θi(s,c
 θ
i(s,cref ) 
ref )   i(s,c2 )


1 −
− 
−


(t1 )
(t2 )
 (t1 )
  (t2 )

η
η
η
η
c1 |s
cref |s
c2 |s
cref |s
∂ 2l

101

3. When t = t but s1 ̸= s2 , the following equation is true whether c1 is equal to c2




(t)
(t)
 θi(s ,c ) θi(s ,c
∂ 2 li
1 ref ) 


1 1 −
= −

(t)
(t)
(t)
 (t)

∂η
∂η
η
η
c1 |s1 c2 |s2
c1 |s1
cref |s1


(t)
(t)
 θi(s ,c ) θi(s ,c
2 ref ) 


2 2 −
× 

(t)

 (t)
η
η
c2 |s2
cref |s2

4. When t1 ̸= t2 and s1 ̸= s2 , the following equation is true whether c1 is equal to

c2

t1
(t1 )
 θi(s ,c ) θi(s ,c
2l
∂ i
1 ref ) 


1 1 −
= −

(t1 )
(t2 )
(t1 )

 (t1 )
∂η
∂η
η
η
c1 |s1 c2 |s2
c1 |s1
cref |s1


t2
t2
 θi(s ,c ) θi(s ,c
2 ref ) 


2 2 −
× 

(t2 )

 (t2 )
η
η
c2 |s2
cref |s2


Next, for the diagonal block of the Hessian with respect to the response measurement
parameter ρ, there are several cases to be considered.

1. When t1 ̸= t2 , the following expression is true irrespective of the relationship between
102

k and l or c1 and c2 .
(2yikt − 1)(2yilt − 1)
∂ 2 li
1
2
=
∂ρk|c =c ∂ρl|c =c
g(i, k, t1 , c1 )g(i, l, c2 , t2 )
t1 1
t2 2
(
)
(t1 ,t2 )
(t1 ) (t2 )
× θ
− θi,c θi,c
i(c1 ,c2 )
1
2
2. When t = t, c = c but k ̸= l,

(2y − 1)(2yilt − 1) (t)
∂ 2 li
(t)
= − ikt
θic (1 − θic )
∂ρk|c =c ∂ρk|c =c
g(i, k, c, t)g(i, l, c, t)
t
t
3. When t = t but c1 ̸= c2 , the following expression is true whether k is equal to l,

(t)
(t)
θic (2yikt − 1) θic (2yilt − 1)
∂ 2 li
2
= − 1
∂ρk|c =c ∂ρk|c =c
g(i, k, c1 , t)
g(i, l, c2 , t)
t 1
t 2
4. When all subscripts are the same,

∂ 2l

i
= −
∂ρk|c =c ∂ρk|c =c
t
t

103

(t) 2
θi,c (2yikt − 1)2
g(i, k, t, c)2

C.2

Oﬀ-Diagonal Entries

To derive the oﬀ-diagonal elements, we start with the Hessian with respect to γ and η .
Similarly, we need to consider couple of cases.

1. When s = s,


(t)
(t)
θ
θ
2l
i(s,cref ) 
∂ i

 i(s,c)
−
= 

(t)

 (t)
∂γs ∂η t
η γs η
γs
c|s
c|s
cref |s




(t)
(t)
θis
θ
θ
θis
i(s,cr ef ) 
ref   i(s,c)



−
−
−

(t)
γs
γsref  (t)
η
η
c|s
C|s


2. When s1 ̸= s2 ,

∂ 2l

θ

(t)
i(sref ,c)

i
= −
δ(i, c, t)
(t)
(t)
∂γs1 ∂η
γsref η
c|s2
c|sref


(t)


(t)
θ
θis
θ
θis
i(s2 ,cref ) 
ref   i(s2 ,c)

 1−
−
−


(t)

γs1
γsref  (t)
η
η
c|s2
cref |s2
104

We then derive the oﬀ-diagonal elements of the Hessian with respective to γ and η .

(2y − 1)
∂ 2 li
= − ikt
∂γs1 ∂ρk|c =c
g(i, k, c, t)
t


(t)


(t)
θ
θi,s

θ
i(γsref ,c)
ref  (t) 
 i(s,c)
 θis −
× 
−
−
θic 
 γs

γs
γs
γs
ref
ref

105

At last, we derive the oﬀ-diagonal elements of the Hessian with respective to η and ρ.
There are several cases we need to take into consideration.
1. When t1 ̸= t2 , the following expression is applicable whether or not c1 is equal to

c2 ,
∂ 2 li
(t )
∂η 1 ∂ρk|c =c
c1 |s
t2 2

=

2yikt − 1
2
R,
g(i, k, c2 , t2 )

(C.1)






(t1 )
(t2 ,t1 )
(t1 )
θ
θ
θ
i(s,cref )  (t ) 
 i(s,c2 ,c1 )
 i(s,c1 )
 2 
where R = 
δ(i, t1 , c1 ) − 
−
 θi,c 
(t1 )
(t1 )

 (t1 )

2
η
η
η
c1 |s
c1 |s
cref |s
2. When t = t but c1 ̸= c2 ,




(t)
(t)
(t)
 θi(s,c ) θi(s,c
)  θi,c (2yikt−1 )
∂ 2 li
ref 

1 −
2
= 

(t)
(t )
 (t)
 g(i, k, t, c2 )
∂η
∂ρk|c =c
η
η 1
c1 |s
c1 |s
cref |s
t 2
3. When all subscripts are the same







(t)
(t)
(t)
2l
2yikt − 1  θi(s,c)  θi(s,c) θi(s,cref )  (t) 
∂ i




2
−
−
=

 θi,c 
(t)
(t)
 (t)


g(i, k, c, t)  (t)
∂η ∂ρk|c =c
η
η
η
c|s
c|s
c|s
cref |s
t

106

BIBLIOGRAPHY

107

BIBLIOGRAPHY

A. Agresti. Categorical Data Analysis. Wiley, Hoboken, New Jersey, second edition, 2002.
A. Ahmed and E. Xing. Dynamic non-parametric mixture models and the recurrent chinese
restaurant process : with applications to evolutionary clustering. 2008.
H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 6:716–723, 1974.
C. E. Antoniak. Mixtures of dirichlet processes with applications to non-parametric problems.
The Annals of Statistics, 2:1152–1174, 1974.
K. Bandeen-Roche, D. L. Miglioretti, S. L. Zeger, and P. J. Rathouz. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92:
1375–1386, 1997.
D. J. Bartholomew and M. Knott. Latent variable models and factor analysis. Arnold,
London, second edition, 1999, Chap. 6.
C. Biernacki, G. Celeux, and G. Govaert. Choosing starting values for the em algorithm
for getting the highest likelihood in multivariate gaussian mixture models. Computational
Statistics and Data Analysis, 41:561–575, 2003.
D. Blackwell and J. B. Macqueen. Ferguson distributions via polya urn schemes. The Annals
of Statistics, 1:353–355, 1973.
G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Analysis. Wiley & Sons, New
York, 1992.
M.-W. Chang and C.-J. Lin. Adaptive determnistic annealing for two applications: competing svr of switching dynamics and travelling salesman problems. The 9th International
Conference on Neural Information Processing, Singapore, 2002.
S. Chib. Calculating posterior distributions and modal estimates in markov mixture models.
Journal of Econometrics, 75:79–97, 1996.
H. Chung, J. C. Anthony, and J. L. Schafer. Latent class proﬁle analysis: an application to
stage-sequential process of under-age drinking behaviours. Journal of the Royal Statistical
Society, Series A, 2011.
108

H. Chung, S. T. Lanza, and E. Loken. Latent transition analysis: inference and estimation.
Statistics in Medicine, 27:1834–1854, 2008.
Y. Chung and D. B. Dunson. Nonparametric bayes conditional distribution modeling with
variable selection. Journal of the American Statistical Association, 104:1646–1660, 2009.
W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of
the American Statistical Association, 74:829–836, 1979.
C. C. Clogg. Latent class models. In G. Arminger, C. C. Clogg, and M. E. Sobel, editors,
Handbook of Statistical Modeling for the Social and Behavioral Sciences, pages 351–359.
Plenum, New York, 1995.
C. C. Clogg and L. A. Goodman. Latent structure analysis of a set of multidimensional
contingency tables. Journal of the American Statistical Association, 79:762–771, 1984.
L. M. Collins and S. E. Wugalter. Latent class models for stage-sequential dynamic latent
variables. Multivariate Behavioral Research, 27:131–157, 1992.
T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, 1991.
B. de Finetti. Funzione caratteristica di un fenomeno aleatorio. Atti della R. Academia
Nazionale dei Lincei, Serie 6, 4:251–299, 1931.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via em algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39:
1–38, 1977.
L. C. Dierker, F. Vesela, E. M. Sledjeskia, D. Costelloa, and N. Perrine. Testing the dual
pathway hypothesis to substance use in adolescence and young adulthood. Drug and
Alcohol Dependence, 87:83–93, 2007.
D. B. Dunson and J.-H. Park. Kernel stick-breaking processes. Biometrika, 95:307–323,
2008.
M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures.
Journal of the American Statistical Association, 90:577–588, 1995.
J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association, 96:1348–1360, 2001.
T. S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230, 1973.
T. S. Ferguson and M. J. klass. A representation of independentincrement processes without
gaussian components. The Annals of Mathematical Statistics, 43:1634–1643, 1972.
A. K. Formann. Latent class model diagnosis from a frequentist point of view. Biometrics,
59:189–196, 2003.
109

A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85:398–409, 1990.
A. Gelman, W. R. Gilks, and G. O. Roberts. Weak convergence and optimal scaling of
random walk metropolis algorithms. Annals of Applied Probability, 7:110–120, 1997.
S. Geman and D. Geman. Stochastic relaxation, gibbs distribution, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:401–
412, 1984.
L. A. Goodman. Exploratory latent structure analysis using both identiﬁable and unidentiﬁable models. Biometrika, 61:215–231, 1974.
P. J. Green. Reversible jump markov chain monte carlo computation and bayesian model
determination. Biometrika, 82:711–732, 1995.
S. J. Haberman. Analysis of Qualitative Data. Academic Press, New Developments, New
York, 1979.
J. A. Hagenaars and A. L. McCutcheon. Applied latent class analysis. Cambridge University
Press, Cambridge, 2002.
W. Hastings. Monte carlo sampling methods using markov chains and their application.
Biometrika, 57:97–109, 1970.
T. Heinen. Discrete latent variable models. Tilburg University Press, The Netherlands, 1993,
Chap. 2.
E. Hewitt and L. J. Savage. Symmetric measures on cartesian products. Transactions of the
American Mathematical Society, 80:470–501, 1955.
K. W. Ho and I. Hu. Flexible modelling of random eﬀects in linear mixed models-a bayesian
approach. Computational Statistics and Data Analysis, 52:1347–1361, 2008.
H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of
the American Statistical Association, 96:161–173, 2001.
H. Ishwaran and L. F. James. Approximate dirichlet process computing for ﬁnite normal
mixtures: smoothing and prior information. Journal of Computational and Graphical
Statistics, In press, 2002.
H. Ishwaran and M. Zarepour. Markov chain monte carlo in approximate dirichlet and beta
twoparameter process hierarchical models. Biometrika, 87:371–390, 2000.
Y. Itaya, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura. Deterministic
annealing em algorithm in parameter estimation for acoustic model. Interspeech, pages
433–436, 2004.
K. Katahiral, K. Watanabe, and M. Okada. Deterministic annealing variant of variational
bayes method. Journal of Physics: Conference Series, 95, 2008.
110

D. Kim and B. G. Lindsay. Using condence distribution sampling to visualize conﬁdence
sets. Statistica Sinica, 2009.
R. M. Korwar and M. Hollander. Contributations to the theory of dirichlet processes. The
Annals of Statistics, 1:705–711, 1973.
S. T. Lanza and L. M. Collins. A mixture model of discontinuous development in heavy
drinking from ages 18 to 30: The role of college enrollment. Journal of Studies on Alcohol,
67:552–561, 2006.
P. F. Lazarsfeld and N. W. Henry. Latent structure analysis. Houghton Miﬄin, Boston,
1968.
R. J. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987.
J. S. Liu. Nonparametric hierarchical bayes via sequential imputations. The Annals of
Statistics, 42:911–930, 1996.
D. J. C. MacKay. Emsemble learning for hidden markov models. 1997.
A. L. McCutcheon. Sexual morality, pro-life values, and attitudes toward abortion: A simultaneous latent structure analysis for 1978–1983. Sociological Methods and Research, 16:
256–275, 1987.
B. O. Muth´n and K. Shedden. Finite mixture modeling with mixture outcomes using the
e
em algorithm. Biometrics, 55:463–469, 1999.
L. K. Muth´n and B. O. Muth´n. Mplus user’s guide. Muth´n & Muth´n, Los Angeles, 3rd
e
e
e
e
edition, 2004.
J. Park, W. Cho, and S. Park. Deterministic annealing em and its application in natural
image segmentation. Computational and Information Science, 3314:639–644, 2005.
G. P. Patil and C. Taillie. Diversity as a concept and its implications for random communities.
Bulletin of the International Statistical Institute, 47:497–515, 1977.
M. Perman and J. Pitman. Size-biased sampling of poisson point processes and excursions.
probability theory and related ﬁelds. Probability Theory and Related Fields, 92:21–39,
1992.
J. Pitman. Some developments of the Blackwell-MacQueen urn scheme. In Statistics, Probability and Game Theory (Edited by T. S. Ferguson, L. S. Shapley and J. B. MacQueen),
volume 30. 1996.
J. Pitman and M. Yor. The two-parameter poisson-dirichlet distribution derived from a
stable subordinator. The Annals of Probability, 25:855–900, 1997.
C. K. Reddy and B. Rajaratnam. Learning mixture models via component-wise parameter
smoothing. Computational Statistics and Data Analysis, 54:732–749, 2010.
111

S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number
of components. Journal of the Royal Statistical Society, Series B, 59:731–792, 1997.
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New York, 2nd
edition, 2004.
D. B. Rubin. Inference and missing data. Biometrika, 63:581–592, 1976.
G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.
J. Sethuraman. A constructive deﬁnition of dirichlet priors. Statistica Sinica, 4:639–650,
1994.
M. Tanner and W. Wong. The calculation of posterior distributions by data augmen- tation.
Journal of the American Statistical Association, 82:528–550, 1987a.
W. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmentation,. Journal of the American Statistical Association, 82:528–550, 1987b.
N. Ueda and R. Nakano. Deterministic annealing em algorithm. Neural Networks, 11:271–
282, 1998.
N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton. Split and merge em algorithm for
improving gaussian mixture density estimates. Journal of VLSI Signal Processing Systems,
26:133–140, 2000.
W. F. Velicer, C. A. Redding, M. D. Anatchkova, J. L. Fava, and J. O. Prochaska. Identifying
cluster subtypes for the prevention of adolescent smoking acquisition. Addictive Behaviors,
32:228–247, 2007.
T. Xu, Z. Zhang, P. S. Yu, and B. Long. Evolutionary clustering by hierarchical dirichlet
process with hidden markov state. 2008.
Y. Yang. Can the strengths of aic and bic be shared? BIOMETRICA, 92, 2003.
W. Yao and B. G. Lindsay. Bayesian mixture labeling by highest posterior density. Journal
of the American Statistical Association, 104:758–767, 2009.

112