FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS
By
Xin Qi

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Statistics - Doctor of Philosophy
2014

ABSTRACT
FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS
By
Xin Qi
As a branch of statistics, functional data analysis analyzes data based on the information
about curves, functions, surfaces or anything else varying over a continuum such as time or
spatial location, etc. This topic has been drawing more and more attention by scientists
and other people in recent years since in many real problems, samples are often collected on
curves and other functional observations. Therefore, it also calls for various statistical models
and methodologies to perform inference on the data appropriately. In this article, I introduce
some examples of functional analysis problem from real world and develop methodologies
that provide powerful tools to address statistical concerns.

TABLE OF CONTENTS

LIST OF TABLES ........................................................................................................................v
LIST OF FIGURES .................................................................................................................... vi
Chapter 1 Bayesian Inference for Covariance Ordering in Multivariate Functional Data ...1
1.1 Introduction ...........................................................................................................................1
1.2 Spline Mixed Model ..............................................................................................................4
1.3 Bayesian Hypothesis Testing on Covariance Ordering .........................................................6
1.3.1 Model Selection and Hypothesis Testing ..................................................................6
1.3.2 New Prior on
with Ordering Representation ........................................................8
1.3.3 Bayesian Estimation ...................................................................................................9
1.4 Analysis of Audio Frequency Data of Wildlife Species in Michigan .................................10
1.4.1 Bayesian Inference Results ......................................................................................12
1.4.2 Discussion ................................................................................................................15
Chapter 2 Joinpoint Detection using -Penalized Spline Method ........................................17
2.1 Introduction .........................................................................................................................17
2.2 -Penalized Regression Spline Model ...............................................................................19
2.3 Consistency of the LASSO Type Estimator ........................................................................21
2.4 Simulation Study .................................................................................................................25
2.5 A Case Study: National Cancer Incidence Rate Analysis ...................................................33
2.6 Conclusion and Discussion .................................................................................................37
Chapter 3 Nonparametric Bayesian Clustering of Functional Data ......................................38
3.1 Introduction .........................................................................................................................38
3.2 Spline Mixed Model for Clustering Multiple Curves .........................................................41
3.2.1 Model Specification ......................................................................................................41
3.2.2 Derivation of Functional Dirichlet Process and Other Prior .......................................43
3.3 Theory of Bayesian Inference ..............................................................................................46
3.3.1 Propriety of the Posterior Distribution .........................................................................46
3.3.2 Gibbs Updating Steps ...................................................................................................46
3.4 Bayesian Inference: A Measure for Comparing a Pair of Clustering of Sites ....................52
3.5 Numerical Analysis: Simulation Study and Real Data Example .........................................54
3.5.1 An Alternative Less-informative Prior Choice for
..................................................55
iii

3.5.2 Simulations and Comparison with An Existing Approach ...........................................56
3.5.3 A Real Data Example: Lung Cancer Rates in the U.S. ................................................57
3.6 Conclusion and Discussion .................................................................................................59
APPENDICES .............................................................................................................................64
Appendix A Proof of Theorems in Chapter 1 ...........................................................................65
Appendix B MCMC Algorithm via Gibbs Sampling for Bayesian Inference in Chapter 1 .....74
Appendix C Maximum Likelihood Estimation in Chapter 1 ....................................................76
Appendix D Proof of Theorems and Lemmas in Chapter 2 ......................................................78
Appendix E Proof of Theorems in Chapter 3 ............................................................................83
BIBLIOGRAPHY .......................................................................................................................86

iv

LIST OF TABLES

Table 1.1: Bayes Factor for Significant Orderings at L00 ............................................................15
Table 1.2: Bayes Factor for Significant Orderings at L02 ............................................................15
Table 2.1: A Summary of Four Methods to Compare with the PTB ............................................28
Table 2.2: Simulation Result for

,

........................................................29

Table 2.3: Simulation Result for

,

......................................................30

Table 2.4: Simulation Result for

,

......................................................31

Table 2.5: Simulation Result for

,

......................................................33

Table 2.6: Simulation Result for

,

...................................................34

Table 2.7: Simulation Result for

,

.................................................36

Table 2.8: Joinpoint Detection for Real Data Analysis ................................................................36
Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach .............56
Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar's Approach .....56
Table 3.3: The Distribution of Number of Clusters

..................................................................58

Table 3.4: Estimated Cluster Configuration .................................................................................59

v

LIST OF FIGURES

Figure 1.1: An Example of Biodiversity Data ................................................................................3
Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =
,
.................................................................................................................................32
Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =
,
.................................................................................................................................32
Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999) .........................................35
Figure 2.4: Fitted Incidence Rates (1973 - 1999) .........................................................................35
Figure 3.1: Dendrogram for the Clustering Estimation ................................................................61
Figure 3.2: Estimated Cluster Configuration ................................................................................62
Figure 3.3: Overall Trend by Clusters ..........................................................................................63

vi

Chapter 1
Bayesian Inference for Covariance
Ordering in Multivariate Functional
Data

1.1

Introduction

Biodiversity, often defined as the variety of life at all levels of organization [9], has been
studied at species level for most of the time. It is closely related to human lives and human
activities such as agriculture, health, business and industry, etc. In recent years, unsustainable ecological practices such as habitat destruction, agricultural over-harvesting, and
pollution have been damaging species diversity. According to International Union for the
Conservation of Nature (IUCN) [19], the world’s main authority on the conservation status
of species, through 2012, 19,817 out of 63,837 assessed species were threatened with extinction. The Convention on Biological Diversity (CBD) was recognized to help conservations
of biodiversity and sustainable use of its components. Statistical analysis of biodiversity
measurements are essential and valuable to conservationists to understand the nature of
biodiversity. With species indices and habitat variability assessments, conservationists can
identify areas of ecological importance that should be protected.

1

The audio measurements on suburban or rural areas can be used to study impact of
human activities on the other wildlife animals. The audio measurements taken at a regular
time intervals are converted to energy readings on the frequency domain and energy readings
are then divided into several frequency bands. Each frequency band corresponds to a different
category of wildlife animals according to the sound they create. By investigating how energy
readings in each band are changing over time, we can study interaction between different
groups of wildlife animals. For example, Figure 1.1 (a) shows the energy readings of several
frequency bands at one suburban location in the upper peninsular of Michigan on June 15
2010 while Figure 1.1 (b) gives a similar plot for a rural location. Note that one of frequency
bands in black color is related to the sounds created by human activity such as the noise
from automobile vehicles in highways. We can see that energy readings of other frequency
bands (in red, green and blue) behave differently depending on level of energy readings by
human activities. Also, the patterns of such interaction vary over different location as in
(a) and (b). This type of data can be viewed as the set of functional curves, in which each
curve corresponds to the energy readings over time for each frequency band. To investigate
interaction between curves (species), we introduce spline mixed models. Off-diagonal entries
of the covariance matrix for random effects can be interpreted as correlation between different
frequency bands (i.e. different species) so that we can compare off-diagonal entries to study
sensitivity of species in one frequency band against species in another frequency band. This
can be done by Bayesian hypothesis testing on covariance ordering. However, entries of
the covariance matrix are on the constrained parameter space due to positive definiteness
criterion and different orderings of covariances may have unbalanced prior probabilities,
which is not desirable.
In this chapter, we investigate the relationship and interaction between species by con2

Figure 1.1: An Example of Biodiversity Data
ducting Bayesian hypothesis testing on the covariance ordering of random effects in a spline
mixed model for multivariate functional data. We develop a default prior of a covariance
matrix which yields balanced prior probabilities on the covariance ordering after appropriate
re-parametrization. With the proposed default prior, we fit a spline mixed model using energy readings data from two locations in the upper peninsular of Michigan. Our data analysis
claims significant difference in the relationship among group of species between such two
locations.
We state our spline mixed model in section 1.2 and introduce Bayesian hypothesis testing
as well as construction of a default prior for the covariance matrix of random effects in section
1.3. In section 1.4, we apply our result to the energy reading data in Michigan to compare
and discuss the activities of different species.

3

1.2

Spline Mixed Model

To investigate interaction between multiple curves (functions) over time which is of interest
in the application of biodiversity, we consider a spline mixed model. Suppose that there
are d curves, {fj (τ )}dj=1 on τ ∈ [0, T ]. We observe such curves at n discrete time points,
0 < τ1 < τ2 < · · · < τn < T during the time period [0, T ]. In the application of audio
measurements, each curve corresponds to one frequency band, a time period is one day. The
data were collected over several days and we assume that we have repeated measurements
on days since it is reasonable to assume that daily activity patterns of animals including
(i)

human are not too different over days. Let yjt be the observed value of the j-th curve at
time τt on the i-th day for j = 1, · · · , d, t = 1, · · · , n and i = 1, · · · , M .
For functional data, it is common to approximate a function using basis functions. In
this regard, we adopt approximation of functions using spline basis functions. That is,
q

f (τ ) ≈
m=0

τm
βm +
m!

k
m=1

q

(τ − ξm )+
γm ,
q!

where a+ = max{a, 0}. This spline approximation uses order q spline polynomial functions with knot points {ξ1 , · · · , ξk }. With the data structure described above, we introduce the following spline mixed model in a matrix form for the i-th day. Let y (i) =
(i)

(i)

(i)

(i)

(i)

(i) T

y11 , y12 , . . . , y1n , . . . , yd1 , yd2 , . . . , ydn

be the set of observed curves for the i-th day

for i = 1, . . . , M . Then, we have

y (i) = (Id ⊗ X)β (i) + (Id ⊗ Z)γ (i) + (i) ,
(i)
(i)
(i)
(i) T
where β (i) = β10 , · · · , β1q , · · · , βd0 , · · · , βdq
are fixed effects,

4

(1.1)

(i)

(i)

(i) T

(i)

γ (i) = γ11 , · · · , γ1k , · · · , γd1 , · · · , γdk
(i)

=

(i) (i)
11 , 12 , · · ·

(i)

(i)

are random effects and
(i) T

(i)

, 1n , · · · , d1 , d2 , · · · , dn

are error terms.

The corresponding design matrices for fixed effects and random effects are

1 τ 1 . . .


1 τ 2 . . .

X=
 ..
.


1 τn . . .





1 q
q! τ1 

q
 (τ1 − ξ1 )+

q
(τ1 − ξ2 )+

...




1 τ q
 (τ2 − ξ1 )q (τ2 − ξ2 )q . . .
1
+
+
q! 2 

, Z = 


q!
..


.




q
q
1 q
(τn − ξ1 )+ (τn − ξ2 )+ . . .
q! τn



q
(τ1 − ξk )+ 



q 
(τ2 − ξk )+ 


q

(τn − ξk )+

.





We also assume that (i) ∼ N (0, Σ), where Σ = D ⊗ Σϕ with D = diag(σ12 , σ22 , . . . , σd2 )



1 if u = v




2
and Σϕ = (ϕuv )n×n and ϕuv =
ϕ if |u − v| = 1 . σj allows to have different variability





 0 otherwise
of each curve. Σϕ is a temporal covariance matrix to capture possible temporal dependence
in each curve.
For the covariance structure of random effects, we assume γ (i) ∼ N (0, Σ0 ⊗ Ik ) with
(i)

(i)

Σ0 = (σjl )d×d so that cov(γj , γl ) = σjl Ik for j, l = 1, . . . , d and σjl models the interaction
between the j-th and l-th curves.
Let y = y (1) , y (2) , . . . , y (M )

T

be the collection of all data points. Then, the matrix

form of the entire model is

¯ + Zγ
¯ + ,
y = Xβ

(1.2)

¯ = IM ⊗Id ⊗X and Z¯ = IM ⊗Id ⊗Z. Note that γ ∼ N (0, G) with G = IM ⊗Σ0 ⊗Ik .
where X

5

1.3

Bayesian Hypothesis Testing on Covariance Ordering

To investigate interaction between curves using the spline mixed model introduced in the
previous section, we consider the ordering of the covariance matrix for random effects. In
particular, we are interested in a Bayesian test on a specific ordering within a row. e.g.,
σ12 > σ13 > · · · > σ1d > 0.

1.3.1

Model Selection and Hypothesis Testing

Let θ = Σ0 be the parameter of interest. We assume that the parameter space, Θ, is
partitioned into Θi such that Θ =

d!
i=1 Θi

and {Θi } is a partition of Θ with d! possible

ordering restrictions of {σ12 , σ13 , . . . , σ1d , 0}. Note that each subspace Θi represents a model
Mi with the corresponding ordering. To perform model selection, we consider testing the
hypothesis Hi : θ ∈ Θi versus Hj : θ ∈ Θj , where i, j = 1, . . . , d!, i = j.
Given the priors πi = π(Θi ) and πj = π(Θj ), suppose the posterior probabilities are
νi = π(Θi |y) and νj = π(Θj |y). Then the Bayes factor is defined as

B=

νi /νj
posterior odds ratio
Θi π(θi )f (y|θi ) dθi
=
=
prior odds ratio
πi /πj
Θj π(θj )f (y|θj ) dθj

πi /πj .

(1.3)

When the Bayes factor becomes large enough, we tend to propose Hi .
To compute the Bayes factor, we need to put an appropriate prior on Σ0 . The prior
distribution should bring a proper posterior probability and computation of πi /πj should be
easy. However, the common prior models for a covariance matrix such as Jeffrey’s prior and
inverse Wishart prior do not satisfy such criteria. This leads us to develop a default prior

6

which brings a proper posterior distribution as well as easiness of computation.
First, we show that Jeffrey’s reference prior leads to an improper posterior, so that it is
not suitable to use for our test.
Theorem 1. Assume d ≥ 4 and M k > 2. The improper priors π(Σ0 ) ∝ |Σ0 |−p with (1)
d
p = d+1
2 , (2) p = 2 − 1 and (3) p = d give rise to improper posteriors.

The proof of the theorem is provided in Appendix A.
By Theorem 1, we can not calculate νi = π(Θi |y) and νj = π(Θj |y) which indicates
inappropriateness of Jeffrey’s reference prior for the proposed spline mixed models.
When inverse Wishart prior is considered, it is clear that the posterior distribution is
proper. However, inverse Wishart prior is not feasible for Σ0 for our testing problem above
since it has no explicit ordering representation. For instance, assume d = 4 and consider
testing hypotheses H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 > σ13 > σ14 . Note that in
this case, Θ1 = {σ12 > σ13 > σ14 > 0} and Θ2 = {σ12 > 0 > σ13 > σ14 }. Then for inverse
Wishart prior Σ0 ∼ IW (S0 , m), where S0 is a d × d positive definite matrix and m > d − 1
is a constant, it is difficult to calculate π1 = P (Σ0 ∈ Θ1 ) and π2 = P (Σ0 ∈ Θ2 ). Also, there
is no guarantee to have equal prior probability so that no prior subjectivity can affect the
Bayes factor. In the next section, we introduce a new prior on the covariance matrix which
solve both issues. That is, the new prior guarantees propriety of the posterior distribution
and equal prior probability so that we do not need to calculate πi and πj in the Bayes factor
due to cancellation in πi /πj .

7

1.3.2

New Prior on Σ0 with Ordering Representation

We propose a partially improper prior on a covariance matrix via Cholesky decomposition.
For simplicity, we assume Σϕ = In . As we see later in section 1.4, the real data supports
such simple assumption.




 11
Let Σ0 = LLT be the Cholesky decomposition of Σ0 , where L = 
1

0 
 is a lower
L11

triangular matrix. Then we have the following representation for Σ0 :










T
1



2
11

σ11 σ1∗   11 0   11
 
Σ0 = 
=

=
T
T
σ∗1 Σ11
0 L11
1 L11
11 1



T
11 1
T
1 1

+ L11 LT11




where 11 > 0 and 1 ∈ Rd−1 .
This re-parametrization separates out the constrained vector of parameters ( 1 ) and
unconstrained parameters ( 11 and L11 ). Note that σ1∗ = (σ12 , σ13 , . . . , σ1d ) = 11 T1 . Since
11

> 0, each subset of parameter space, Θi , corresponds to an ordering of entries in 1 for

i = 1, 2, . . . , d!. The following theorem guarantees a proper posterior distribution when the
flat prior on 1 is used.




 11
Theorem 2. Assume n > q+1+k and M k ≥ d−1. Let Σ0 = LLT , where L = 
1

Suppose that the priors satisfy
(1) π(β) ∝ 1,
(2) π(D) =

d
2
j=1 π(σj ),

where π(σj2 ) ∝

1 αj +1 ,
σj2

αj > 0, ∀j = 1, . . . , d,

(3) π(L) = π( 11 )π( 1 )π(L11 ), where π( 1 ) ∝ 1, π( 11 ) and π(L11 ) are proper.
Then the posterior is proper.
The proof of theorem is provided in Appendix A.
8

0 
.
L11

Now we reconsider the testing example H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 >
σ13 > σ14 again. Theorem 2 claims proper posteriors so that ν1 = π(Θ1 |y) and ν2 = π(Θ2 |y)
are well defined. Furthermore, when the flat prior π( 1 ) ∝ 1 is used, the relationship between
σ1∗ and 1 simply implies π(Θ1 ) = π(Θ2 ) = 1/d! if we add 0 into the consideration to have
ordering of covariance parameters. Hence in this situation, the priors are canceled out and
the Bayes factor remains to be the posterior ratio B = νi /νj , which can be calculated from
posterior samples straightforwardly.

1.3.3

Bayesian Estimation

To obtain the posterior samples of parameters to calculate the Bayes factor, we use a Markov
Chain Monte Carlo (MCMC) algorithm, in particular, Gibbs sampling algorithm. To derive
the posterior distribution, we consider the following priors for the parameters, β, σj2 and
Σ0 . We assume π(β) ∝ 1, π(σj2 ) ∼ IG(aj , bj ), where IG(a, b) denotes the inverse Gamma
distribution with the density function π(x) ∝ x−(a+1) e−1/bx . For the prior of Σ0 , we consider
the proposed prior of L developed in the previous section. That is, π(L) = π( 11 )π( 1 )π(L11 )
with π( 1 ) ∝ 1, π( 11 ) ∼ IG(a0 , b0 ) and π(L11 LT11 ) ∼ IW (Ω11 , d − 1). Then, by Theorem
2, the posterior distribution is proper and the MCMC algorithm is valid with the proposed
prior.
In this section, we derive the conditional posterior distributions of 11 , 1 and L11 .
Conditional posterior distribution of other parameters are given in Appendix B.
First, we introduce the matrix S which is quadratic forms of random effects. That is,
S=

M
i=1

(i) T (i)
k
gl
l=1 gl

(i)

with gl

(i)

(i)

(i)

= γ1l , γ2l , . . . , γdl , for l = 1, . . . , k and i = 1, . . . , M .

9

Then, we can obtain the conditional posterior distributions of 1 , 211 and L11 LT11 :

∼

N ( ∗1 , V ∗ ),

| γ, 1 , L11 , . . .

∼

IG

L11 LT11 | γ, 11 , 1

∼

IW (M k − d − 1, S ∗ + Id−1 )

1

| γ, 11 , L11 , . . .

2
11

1

Mk − d + 1
s
1 −1
+ α0 , 11 +
,
2
2
β0



sT1



2
s11

where ∗1 = s11 s1 , V ∗ = s11 (L11 LT11 ) and S ∗ = S11 − s 1 s1 sT1 for S = 
.
11
11
11
1
s1 S11
With above conditional posterior distributions, we can obtain posterior samples by the

Gibbs sampling algorithm.

1.4

Analysis of Audio Frequency Data of Wildlife Species
in Michigan

In this section, we apply our methodology to audio frequency data of wildlife species in
Michigan.
The wildlife sounds from two locations, L00 and L02, in Upper Peninsula of Michigan,
U.S. were collected in 1 minute interval from June 1 2010 to June 15 2010 at every 30
minutes from 12 a.m. to 11:30 p.m. on each day. L00 is located at Crawford Bay Lane
area, Cheboygan, MI which is close to main city roads while L02 is at Crane Nest Narrows,
Cheboygan, MI and it is relatively pristine area. Then, the audio data were converted
into the frequency data and divided into ten frequency bands which correspond to different
groups of species (human, bird type 1, bird type 2, etc.). More relevant information about
the source of data can be found in the Remote Environmental Assessment Laboratory at

10

Michigan State University (real.msu.edu). The behavior of energy readings in each frequency
band represents activities of different types of species and we can observe how other species
interact with human in their activities by how the energy reading curves are changing over
time of day.
(i)

Let {Pj (t) ≥ 0 : τt ∈ [0, T ]; j = 1, . . . , D; i = 1, . . . , M } be the energy readings for D
frequency bands obtained from recorded audio measurements. Since the energy readings are
(i)

scaled, Pj (t) is the proportion for the jth frequency band at time τt on the ith day. Thus,
(i)

a natural constraint on Pj (t) is

(i)
D
j=1 Pj (t)

= 1 for t = 1, · · · , n and i = 1, . . . , M . For

(i)

fixed i and t, {Pj (t)}D
j=1 can be viewed as compositional data. A typical additive log-ratio
(i)

transformation is widely applied to compositional data, which we adopt for Pj (t) as well.
We define the additive log-ratio transformation as
(i)

(i)
yjt

= log

Pj (t)
(i)

, j = 1, . . . , D − 1.

PD (t)

Assume that the set of time points, T = {τ1 , τ2 , . . . , τn } with 0 < τ1 < τ2 < · · · < τn < T
(i)

and d = D − 1. Let yjt : t = 1, . . . , n; j = 1, . . . , d; i = 1, . . . , M

be the response variables

for the proposed spline mixed model.
Since the last four frequency bands contribute on audio intensity relatively small through
the period, we consider first six frequency bands and combined the last four frequency bands
into one component. Thus, d = 6. The number of time points, n is 48 and the number
of days, M is 15. Then, we apply the spline mixed model introduced in section 1.2. An
intuitive choice of the order of polynomial for fixed effects is q = 1. We set up k = 8 equally
spaced time nodes as knots for the spline functions.
For the model (1.2) in section 1.2, we start with some level of time-dependence in Σϕ

11

with the parameter ϕ. To investigate the time-dependence, we obtain maximum likelihood
estimator (MLE) of β, γ and ϕ by assuming γ as fixed effect coefficients for simplicity.
The derivation of optimal equations for MLE is provided in Appendix C. The MLE of ϕ is
ϕ = −0.038 for location L00 and ϕ = 0.024 for location L02. We see in both locations, the
correlation between time points are weak, which suggests us to assume ϕ = 0 for the rest of
the analysis.

1.4.1

Bayesian Inference Results

We run the MCMC algorithm via Gibbs sampling for N = 5000 iterations with three chains.
To check the convergence of chains, we carry out Gelman-Rubin diagnostics by calculating
the potential scale reduction factor [10]. All three chains converged after 2000 iterations. Our
estimation is based on the posterior mean of last 1500 posterior samples, i.e. 500 posterior
samples from each chain. The upper 4 × 4 submatrices of estimated covariance matrices
˜ 0, Σ
˜ 2 are the Bayesian
of random effects are shown below. Σ0 , Σ2 are the MLEs and Σ
estimates for locations L00 and L02, respectively.
For location L00,

 0.2307 0.1384 0.1140


 0.1384 0.1872 0.1546

Σ0 = 

 0.1140 0.1546 0.1703


0.0611 0.0826 0.1056





0.0611 
 0.2646 0.1334




 0.1334 0.2119
0.0826 
 ˜

 , Σ0 = 


 0.1079 0.1599
0.1056 




0.1114
0.0605 0.0830

12


0.1079 0.0605 


0.1599 0.0830 

.

0.1906 0.1143 


0.1143 0.1481

For location L02,

 0.0768


 0.0075

Σ2 = 

−0.0326


−0.0365







0.0075 −0.0326 −0.0365 
 0.1118 0.0287 −0.0201 −0.0271 






 0.0287 0.1073

0.0758
0.0686
0.0400 
0.0795
0.0361
 ˜


,
Σ
=


.
2



−0.0201 0.0795

0.1657
0.1173
0.0686
0.1500
0.1209 






0.0400
0.1209
0.1671
−0.0271 0.0361
0.1173
0.1865

˜ 0 and Σ
˜ 2 are very
We can see that, overall, the Bayesian estimation for both locations, Σ
close to the corresponding MLE, Σ0 and Σ0 , respectively. They have same signs for all the
entries of covariance matrices, which justifies our Bayesian method from the likelihood per˜ 2 is slightly different from Σ2 in the sense that
spective. Moreover, unlike for location L00, Σ
it gives larger variance for each component σ
˜j2 and lower magnitude of negative correlations,
i.e. σ
˜13 and σ
˜14 than the MLE. Finally, we suspect a difference in the overall relationship
(over time points) between band 1 and either band 3 or 4 for the two locations due to the
opposite sign of corresponding entries in two different locations. However, to further support
our finding, we have to perform more detailed posterior sample analysis.
Bayesian method allows us to obtain posterior samples of σjl from which we can investigate interaction between different types of species (i.e. different frequency bands) as well as
compare the results from two locations. In particular, we are interested in the relationship
between frequency band 1 and frequency bands 2 to 4. The frequency band 1 corresponds to
the sounds from human activity while frequency bands 2, 3 and 4 represent three different
categories of wildlife animals including birds. So, we compute posterior probabilities of all
orderings of {σ12 , σ13 , σ14 , 0}. 0 is included to see whether σjl is positive or negative.
There are 4! = 24 possible orderings. We list the first three highest posterior probabilities

13

among 24 possible orderings for each location.
For location L00,

(0)

P (Θ1 = {σ12 > σ13 > σ14 > 0} | data) = P ( 12 > 13 > 14 > 0 | data) = 0.702,
(0)

P (Θ2 = {σ13 > σ12 > σ14 > 0} | data) = P ( 13 > 12 > 14 > 0 | data) = 0.133,
(0)

P (Θ3 = {σ12 > σ14 > σ13 > 0} | data) = P ( 12 > 14 > 13 > 0 | data) = 0.083.

For location L02,

(2)

P (Θ1 = {σ12 > 0 > σ13 > σ14 } | data) = P ( 12 > 0 > 13 > 14 | data) = 0.493,
(2)

P (Θ2 = {σ12 > σ13 > 0 > σ14 } | data) = P ( 12 > 13 > 0 > 14 | data) = 0.222,
(2)

P (Θ3 = {σ12 > 0 > σ14 > σ13 } | data) = P ( 12 > 0 > 14 > 13 | data) = 0.149.

Bayes factors computed from the posterior probabilities can be used to select a particular
ordering for each location. Thus, we consider testing the following hypotheses

(L)

Hi : θ(L) ∈ Θi

(L)

vs. Hj : θ(L) ∈ Θj ,

where i, j = 1, 2, 3 represent those particular ordering with high posterior probabilities L =
(L)

(L)

0, 2 represents the location of the data collected. Since π(Θi ) = π(Θj ) for all i, j = 1, 2, 3
(L)

(L)

and L = 0, 2, the Bayes factor is just a ratio of posterior probabilities, i.e., Bij = νi

(L)

/νj .

(0)

Table 1.1 and 1.2 show the Bayes factors Bij at locations L00 and L02, respectively.
From the above analysis, we found that there are significant differences in the patterns of
audio frequency bands over time in two locations, L00 and L02. For L00, all three bands 2,

14

Denom. \ Num.
σ12 > σ13 > σ14 > 0
σ12 > σ13 > σ14 > 0
1
5.28
σ13 > σ12 > σ14 > 0
σ12 > σ14 > σ13 > 0
8.42

σ13 > σ12 > σ14 > 0 σ12 > σ14 > σ13 > 0
0.19
0.12
1
0.63
1.60
1

Table 1.1: Bayes Factor for Significant Orderings at L00
Denom. \ Num.
σ12 > 0 > σ13 > σ14
σ12 > σ13 > 0 > σ14
σ12 > 0 > σ14 > σ13

σ12 > 0 > σ13 > σ14
1
2.22
3.32

σ12 > σ13 > 0 > σ14
0.45
1
1.50

σ12 > 0 > σ14 > σ13
0.30
0.67
1

Table 1.2: Bayes Factor for Significant Orderings at L02
3 and 4 were synchronous with human activity, with band 4 in the lowest level and band 2
in the highest level. On the other hand, band 4 was most sensitive to human activity for the
pristine location L02, following by band 3. However, band 2 was synchronous with human
activity in L02.

1.4.2

Discussion

In this chapter, we investigated audio measurements data from the real ecological world.
Our primary objective is to extract the information that reflects the relationship and interaction between the variables of interest and then quantify the synchronicity or discrepancy
of such interaction among locations. By classifying the observations as the compositional
data varying over time points, we are able to view them as functional curves. We proposed
a spline mixed model and interpret the random effect coefficients of spline polynomials as a
measure of interaction between curves over time. By specifying the covariance components
of random effects, we successfully transformed our concerns to Bayesian inference for the
orderings of covariance matrices. To solve our problem, we developed a new non-informative
prior on the covariance matrices with ordering which possesses posterior propriety as well as

15

simple implementation for testing. Finally we applied our model with the new prior to the
biodiversity data from Michigan and our Bayesian analysis successfully identified the main
characteristics of interactions between species.
Some further extension of our current work may be worthwhile. For instance, it can
be challenging but interesting if one consider generalizing the type of prior we proposed to
covariance matrices with a wider class of orderings.

16

Chapter 2
Joinpoint Detection using
L1-Penalized Spline Method
2.1

Introduction

Cancer is one of the major causes of death in the United States. According to a global and
regional mortality study from 235 causes of death for 20 age groups presented on The Lancet,
it claimed 8.0 million lives in 2010, 15.1% of all deaths worldwide, with large increases in
deaths from trachea, bronchus, and lung cancers [15]. Statistical analyses of cancer incidence
have gained increasing attention in recent years given such challenging public health issue.
Many of them are conducted to investigate a single cancer rate curve varying over time. An
major interest is detecting the changes of the trend of annual cancer incident rates. Such
concern can be viewed as a joinpoint identification problem based on a certain structure of
regression models in statistics.
Several methods have been developed for the joinpoint problem recently, including [14],
[21], [11] and [4]. In general, the joinpoint regression model can be defined in the following
way. Let y be an observed response at time point x ∈ S = {x1 , x2 , · · · , xn } with x1 ≤ x2 ≤
· · · ≤ xn . Then we assume

y = β0 + β1 x + δ1 (x − τ1 )+ + · · · + δk (x − τk )+ + ,
17

(2.1)

where τ1 < τ2 · · · < τk are unknown joinpoints,

represents the error and a+ = a✶{a > 0}.

The main objective of joinpoint detection is to discover the correct number of joinpoints (k)
as well as the location of them (τ s).
One classical permutation test based (PTB) approach, proposed by Kim, et al. [14]
in 2000, has been adopted by the National Cancer Institute (NCI) as an analytical tool
(Joinpoint Software) in their Surveillance Research Program. It generally applies a grid
search technique to fit a regression model under homoscedastic and uncorrelated errors. The
identification of significant joinpoints is conducted by multiple pairwise permutation tests
which compare two best regression models fitted with different sets of presumed joinpoints.
From a statistical perspective, it only involves basic estimation and testing knowledge for
piecewise linear regression models thus is well motivated and easy to understand. However,
due to the nature of permutation test and grid search technique, the PTB approach requires
heavy computation if either maximum possible number of joinpoints (k) or the sample size
(n) is relatively large, for example, k > 5 or n > 50 usually. Such limitation makes joinpoint
detection with big dataset infeasible using the PTB.
In this paper we propose a new method to detect the joinpoints on a discrete time grid
by introducing an 1 penalized regression spline model. Our model specifies spline base at
time knots as the covariates thus will be able to capture potential features of joinpoints as a
subset of statistically significant slopes. Meanwhile, by adding a penalty term, we allow for
a shrinkage of the spline model so that insignificant slope changes will be eliminated.
The rest of this chapter is organized in the following way. We define the penalized
regression spline model in section 2.2 and validate some theoretical properties of our estimates
in section 2.3. In section 2.4, we carry out a simulation study of our method and compare
the performance with the PTB method. Section 2.5 provides a case study of real data by
18

applying our approach to the cancer incidence rates for the overall U.S. from 1973 to 1999.
Note that a comparison with the results from the Joinpoint Software will also be presented.
We finally conclude with a discussion about possible extensions as well as potential future
works in section 2.6.

2.2

L1-Penalized Regression Spline Model

In this section we introduce the penalized regression spline model. Assume we observe data
n
{yi }n
i=1 independently on time points {xi }i=1 with x1 ≤ x2 ≤ · · · ≤ xn . We define a

penalized regression model with first order spline bases as follows.

y = β0 + β1 x + β2 (x − τ2 )+ + · · · + βp−1 (x − τp )+ + ,

where

(2.2)

∼ N (0, σ 2 ) and τ2 < · · · < τp are (p − 1) time knots in the entire time interval. It

specifies p covariates as in the linear model.
For the annual cancer incidence rate problem, it is appropriate and straightforward to
assume the set of candidate joinpoints is identical to the set of observed covariates (years),
that is, p = n − 1 and τj = xj for j = 2, · · · , n − 1. And without loss of generality, we
assume they are equally-spaced without ties. Furthermore, to guarantee some conditions for
theoretical justification of our model, we particularly scale observed time points along with
spline bases such that xj = nj , j = 1, 2, · · · , n. By introducing a matrix form representation
of our model, we have
∼ N (0, σ 2 I n ),

y = Xβ + ,

19

(2.3)

where y = (y1 , y2 , · · · , yn )T , β = (β0 , β1 , · · · , βn−1 )T ,

1 x1 (x1 − x2 )+ · · ·


..
X =  ... ...
.


1 xn (xn − x2 )+ · · ·

= ( 1 , 2 , · · · , n )T and

(x1 − xn−1 )+ 


..

.


(xn − xn−1 )+

(2.4)

Note that the PTB method works only on a small subset of time points as a candidate set
for joinpoints when performing the estimation or testing via grid search each time. On the
other hand, our penalizing method allows for the entire set of candidates. It is motivated in
a natural and data-driven way. Note that in the model we exclude the first year as well as
the last year since it is not reasonable to view them as possible joinpoints without further
information. When we have a large sample size n, the usual least square estimate of β
may contain a large number of nonzero components and fail to remain to be a consistent
estimator. It suggests us applying one of most commonly used penalty structure, Least
Absolute Shrinkage and Selection Operator (LASSO), introduced by Tibshirani [20]. It
yields to minimize the 1 penalized negative log-likelihood. Because of the singularity of
1

norm at the origin, LASSO is a powerful tool for model selection. By doing so, we

only keep the candidates with nonzero estimated coefficients, which can be automatically
interpreted as significant slope changes thus solid evidence of joinpoints. In our study,
we treat the intercept (β0 ) and the first slope (β1 ) as the baseline trend for curves. It
follows that they are not supposed to characterize joinpoints thus are not penalized as other
truncated spline coefficients. However, it is still feasible to restrict the absolute magnitude of
them by a pre-specified constant in practice. So we consider a constrained parameter space
B = {β : |β0 | + |β1 | ≤ M }, where M is a positive constant independent of n. Our LASSO

20

type estimator is provided by


n−1

 y − Xβ 2
2 +λ
β = arg min
|βj | .

n
β ∈B 

(2.5)

j=2

The parameter λ appeared above is called the tuning (shrinkage) parameter. It essentially
controls how hard the βs are penalized. In practice, λ is often pre-specified or obtained by
some other data driven validation procedures. Several algorithms for LASSO, including the
well known Least Angle Regression (LAR) [5], have been developed in recent years. Most
of them automatically assume penalization for the full set of coefficients. However, since
in our model the baseline slope β1 is not penalized, we have to utilize an implementation
which allows penalization for a partial set of parameters. And the coordinate decent algorithm proposed by Friedman, Hastie and Tibshirani [8] meets our demands by solving the
regularization paths as we desire.

2.3

Consistency of the LASSO Type Estimator

As we mentioned in the previous section, the LASSO method features its computational
feasibility. Meanwhile, under certain assumptions of the regression model, it also provides
statistical accuracy. The estimator under LASSO approach has been proved to be consistent by a series of notable works. Some basic results are available in [3]. And we will
provide a proof of consistency of the LASSO type estimator under (2.5) following their
approach. From now on we define Σ = X T X/n and assume its j-th diagonal entry to
be σj2 , i.e. σj2 = Σjj . Meanwhile, given the random error , we define a set of event Jλ0 =

ω : max1≤j≤n 2| T X (j) |/n ≤ λ0 , where X (j) denotes the jth column of the

21

design matrix X. First we give a lemma indicating a high probability of set Jλ0 .
Lemma 1. Assume σj2 ≤ K for j = 1, 2, · · · , n and some K > 0. Then for all t > 0 and
t2 +2 log n
,
n

λ0 = 2Kσ

we have

P (Jλ0 ) ≥ 1 − 2 exp(−t2 /2).

(2.6)

The proof of Lemma 1 and other proposition, lemma and theorem are provided in the
Appendix. Especially, when taking t2 ≥ 4 log n, we obtain

P

√
for λ0 ≥ 2 6Kσ

2| T X (j) |
> λ0
n
1≤j≤n
max

≤ 2e−2 log n = 2/n2

(2.7)

log n
n .

In practice we may need to replace unknown variance parameter σ 2 by a reasonable
estimator σ 2 . The following lemma shows that we can choose the sample second raw moment
as a candidate for σ 2 .
Lemma 2. Let r = Xβ 22 /σ 2 be the signal-to-noise ratio of the model (2.2). Consider the
estimator σ 2 = Y T Y /n for σ. Given any 0 < α < 1, we have

P σ 2 < (1 − α)σ 2 ≤ min

2(n + 2r) 12(n + 2r)2 + 48(n + 4r)
,
,1 .
α 2 n2
α 4 n4

(2.8)

The previous lemma claims that the σ 2 proposed above can not be relatively too small
compared to σ 2 . Now combining two lemmas above we conclude the consistency of our
estimator β.
Theorem 3. (Consistency of the LASSO Type Estimator) Suppose σj2 ≤ K for j = 1, 2, · · · , n
22

and some K > 0. Let σ 2 = Y T Y /n and λ = 12Kσ

log n/n. Consider the model (2.2)

with the estimator β in (2.5). If

lim

n→∞

β 1
= 0 and
n/ log n

rn =

Xβ 22
≤ R < ∞ for some R > 0
σ2

then we have X(β − β) 22 /n → 0 almost surely as n → ∞, i.e.

P

lim

n→∞

X(β − β) 22 /n = 0 = 1.

(2.9)

It is easy to verify that the matrix Σ in model (2.2) satisfies σj2 ≤ K for all j with K = 1.
Furthermore, note that the proof of Theorem 3 remains valid when the number of covariates
p is greater than the number of time points n although they are identical under our design.
That is, our model (2.2) can be extended to allow more intermediate time knots as joinpoint
candidates.
The main theorem above claims the consistency of our LASSO estimator in terms of
prediction error. A stronger conclusion on such type of estimators has also been studied in [3]. In their publication, a more refined 1 consistency result for β is provided
under some additional compatibility conditions for the design matrix X. An alternative
restricted eigenvalue condition is given [2]. Given an index set S ⊂ {1, 2, · · · , n}, denote
β S = (β0,S , β1,S , · · · , βn−1,S )T where βj,S = βj ✶{j ∈ S} for j = 1, 2, · · · , n − 1. The
compatibility condition is stated as follows.
Assumption 1. (Compatibility Condition) Let Σ = X T X/n be the Gram matrix. Given
the true index set S0 with cardinality s0 = |S0 |, there exists φn > 0 such that for all β

23

satisfying β S c 1 ≤ 3 β S0 1 , it holds that
0
s
β S0 21 ≤ 02 β T Σβ
φn

and

s 0 λ2
= o(1)
φ2n

Unfortunately, we point out that under our design with λ = O(

(2.10)

log n/n), the matrix Σ

violates the compatibility condition.
Proposition 1. The eigenvalues of the design matrix in (2.4) are
√
√
1
n + 2 + n2 + 4
n + 2 − n2 + 4
, ν2 = ν3 = · · · = νn−1 = , νn =
ν1 =
2n
n
2n

(2.11)

From the claim above we see that at least (n − 1) of the eigenvalues of X are no greater
than n−1 . Therefore, by the properties of eigenvalues, we conclude that the smallest eigenvalue of Σ = X T X/n, denoted by ν1 (Σ), is at most n−3 . Given the simple fact that
β S0 21 ≤ s0 β 22 and ν1 (Σ) β 22 ≤ β T Σβ for all β, it follows that φ2n ≤ n−3 hence
λ2 s0 /φ2n ≥ O(n2 log(n)), which contradicts the assumption that s0 λ2 /φ2n = o(1) in (2.10).
In other words, unless we choose a λ extremely close to 0, we will not be able to meet the
compatibility condition.
As discussed above, the violation of compatibility for Σ implies that we are not able
to use a similar technique as before to prove the 1 consistency result for β. This is not
surprising since the design matrix X consists of rescaled time nodes xj = nj , j = 1, 2, · · · , n
so that we may need conditions such as β 2 = O(n) to have r = Xβ 22 /σ 2 < ∞. Also,
it is natural to assume s0 < ∞. Then, given S0 = S, β 1 = O(n). Hence the usual 1
consistency for β may not be a reasonable criterion under our settings.
On the other hand, if we define S = {2 ≤ j ≤ n − 1 : βj = 0} as our selected set of

24

joinpoints, one may desire the following oracle properties

P (S = S0 ) → 1 or P (S ⊃ S0 ) → 1 as n → ∞

(2.12)

It indicates that the estimated effective joinpoints will at least cover the true set S0 when n is
large. In our case, we may seek a weaker oracle property. Note that X β = (X 1 β, X 2 β, · · · , X n β)T ,
where X j s are row vectors of X. Let S∗ = {2 ≤ j ≤ n − 1 : X j β = 0} and S∗ = {2 ≤ j ≤
n − 1 : X j β = 0}. Then one may show that

P (S ∗ = S∗ ) → 1 or P (S ∗ ⊃ S∗ ) → 1 as n → ∞

(2.13)

Note that here X j (j ≥ 2) has nonzero elements only in its first j entries. After all, it
is possible that although as a sufficient condition, the compatibility assumption fails under
design X, the numerical result still shows excellent performance in variable selections.

2.4

Simulation Study

In this section, we will examine the performance of joinpoint detection based on the 1 penalized regression (LASSO) method and the PTB method realized by the Joinpoint Software
via several simulations.
To be consistent with the real data analysis in the next section, we first consider n = 27
and x ∈ S = {1, 2, · · · , 27}. The true models with two joinpoints can be written as follows.

y = β0 + β1 x + δ1 (x − τ1 )+ + δ2 (x − τ2 )+ + ,

25

(2.14)

where the errors are still assumed to be independent Gaussian N (0, σ 2 ). To further specify
different models based on quantity of slope changes, assignment of joinpoint positions and the
noise level, we let β0 = 5, (τ1 , τ2 ) = (8, 18) or (18, 23) and σ 2 = 0.0001 or 0.001. Meanwhile,
as given in [14], the slopes, β1 , δ1 and δ2 are determined by the annual percentage change
APC = (APC1 , APC2 , APC3 ) in the following way: β1 = log(1 + 0.01APC1 ), δ1 = log(1 +
0.01APC2 ) − log(1 + 0.01APC1 ) and δ2 = log(1 + 0.01APC3 ) − log(1 + 0.01APC2 ). In our
simulation we set APC = (1, 3, 1) or (4, −1, 2), which leads to different trends of data. Hence
all the combinations of true parameters give us eight different scenarios.
In the LASSO spline model, we clarify the design matrix X as the form of (2.4) with xj =
j/27, j = 1, 2, · · · , 27. That is, we rescale observed time points without losing information.
One of the major concern when fitting the generated dataset to our model is to select
appropriate tuning (penalty) parameters (λs) since it directly controls the sparsity, or the
effective number of joinpoints. Some popular rules, such as cross validation (CV) [12] and
Bayesian information criteria (BIC) [22], are widely used to handle this issue. As a model
validation technique, cross validation partitions a sample of data into a training set and
a testing set, performs the analysis on the training set, and validates the analysis on the
testing set. It indicates better performance in terms of accuracy of prediction in many
literatures. We will apply a 5-fold cross validation in the simulations. For instance, the
original observations are randomly partitioned into five almost equal size subsamples. A
single subsample is retained as the validation data for testing the model while the remaining
four are used as training data. The cross validation process is then repeated five times, with
each of the five subsamples used exactly once as the validation data. We simply select the
tuning parameter λ with minimum mean squared error (MSE). However, sometimes we may
prefer a more simplified model with fewer covariates, or, in our case, not too many estimated
26

joinpoints. Hastie, Tibshirani and Friedman suggest a ‘one standard error rule’ [8], which
chooses the largest λ value within one standard error of the original minimizer λ given by
˜ = sup{λ : MSE(λ) ≤ MSE(λ) + SE(λ)}, where MSE(λ) and SE(λ)
cross validation, i.e. λ
denote the mean squared error and standard error under λ. We will provide the results under
both original cross validation approach (L-CV) and one standard error rule (L-SE).
As an alternative model for comparison purpose, we also apply the PTB method implemented in the Joinpoint Software. Since the PTB approach is based on the grid search
when fitting the model, it is inevitably time consuming even if one requires a moderate amount of candidate joinpoints given a relatively large sample size of data (usually
when n ≥ 40). Especially, when the maximum number of joinpoint is greater than 4, the
computation will be quite slow. As advised by the user manual of the software, we set
the maximum number of joinpoint as 4. Since our penalized spline model does not have
such limitation and process the same data in a much faster way as described in section
2.2, we claim that it demonstrates more flexibility in modeling various types of real data
effectively. Moreover, one of the settings for the PTB implementation imposes a minimum distance (3 as default) between two consecutive estimated joinpoints. Therefore, to
make a fair comparison, we also consider a modification for both L-CV and L-SE methods: First we call a sequentially ordered subset of originally selected components a ‘cluster’ if it is the biggest set consisting of two or more consecutive nonzero coefficients that
preserve the same sign, either all positive or all negative, i.e. a set either of the form
B+ (s, t) := {(βs , βs+1 , · · · , βt−1 , βt ) : βs−1 ≤ 0; βs , · · · , βt > 0; βt+1 ≤ 0} or of the form B− (s, t) := {(βs , βs+1 , · · · , βt−1 , βt ) : βs−1 ≥ 0; βs , · · · , βt < 0; βt+1 ≥ 0} for some
2 ≤ s < t ≤ n − 2. Then for all other selected components that are not from a ‘cluster’,
we call each of them a ‘singleton’. Our modification eventually pick only one component
27

from each ‘cluster’ along with all the ‘singletons’ which minimizes the mean squared error as
the joinpoints. We denote such modification for L-CV and L-SE methods as L-CV-M and
L-SE-M, respectively. To give a summary of all the methods we use to compare with the
PTB, we present the following 2 × 2 chart.
Tuning Parameter \ Refinement
Original LASSO
λ1 via Cross-validation
L-CV
λ2 by the ‘One Standard Error’ Rule
L-SE

Modified LASSO
L-CV-M
L-SE-M

Table 2.1: A Summary of Four Methods to Compare with the PTB

We simulate the data 1000 times for each of eight scenarios and fit them by L-CV, LCV-M, L-SE, L-SE-M and PTB methods. The performance of joinpoint detection can be
characterized by three quantities: (1) the average number of correctly identified joinpoints,
#CJ; (2) the average number of incorrectly identified joinpoints, #IJ and (3) the average of
mean squared error of predictions, PMSE = X(β − β) 2 /n. The results are listed in Table
2.2 and 2.3.
From the result we see that in general, the 1 penalized regression spline model with
original cross validation procedure (L-CV) provides better detection of true joinpoints compared to PTB or L-SE. However, a relatively high #IJ value indicates it usually introduces
more joinpoints than the other two methods. The penalized model with cross validation
under one standard error rule (L-SE) performs equally or better than PTB in detecting
the true joinpoints for most of the time. And it brings down the #IJ value compared to
L-CV model due to a more penalization of parameters. Meanwhile, the modified approach
L-CV-M usually leads to more #CJ values than L-SE and less #IJ values than L-CV thus
gives a more balanced and robust estimation between L-CV and L-SE. The L-SE-M reduces
the number of effective joinpoints dramatically but is more likely to miss the true joinpoints

28

σ2
0.0001

σ2
0.0010

APC
(1, 3, 1)

#CJ
1.377
1.648
1.436
1.592
1.392
1.897
1.942
1.909
1.933
1.900

#IJ PMSE(×10−4 )
1.284
0.266
3.852
0.940
1.659
0.277
3.831
1.093
1.691
0.419
0.236
0.200
2.982
0.993
0.694
0.208
2.996
1.019
0.714
0.211

APC
(1, 3, 1)

#CJ
0.306
0.696
0.448
0.283
0.215
0.960
1.391
1.032
1.171
0.966

#IJ PMSE(×10−4 )
3.047
3.956
4.956
3.679
3.648
3.711
3.375
5.673
2.754
5.663
2.106
2.754
5.121
3.351
3.367
2.913
4.485
5.523
2.952
3.247

Method
PTB
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
PTB
L-CV
L-CV-M
L-SE
L-SE-M
Method
PTB
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
PTB
L-CV
L-CV-M
L-SE
L-SE-M

Table 2.2: Simulation Result for n = 27, (τ1 , τ2 ) = (8, 18)
compared to other methods. As for the computational efficiency, as we expected, LASSO
models showed notable superiority to the PTB by equipping with the coordinate decent
algorithm mentioned in section 2.2.
To investigate the performance of our LASSO method further under larger sample size,
we design other groups of simulations with n = 54 and 135, corresponding to 2 and 5 times
the original sample size of responses. We keep the settings for APC values and σ 2 same as
the previous simulation settings. However, as the time line extends, it is reasonable to simply
maintain the relative location of true joinpoints, for example, (τ1 , τ2 ) = (16, 36) or (36, 46)
respectively for n = 54. Also note that since the PTB method will be significantly time-

29

σ2
0.0001

σ2
0.0010

APC
(1, 3, 1)

#CJ
1.118
1.290
1.093
1.010
0.749
1.748
1.714
1.575
1.435
1.259

#IJ PMSE(×10−4 )
1.667
0.339
3.309
0.444
1.913
0.343
3.113
0.766
2.190
0.571
0.561
0.232
3.314
0.551
1.023
0.353
3.429
0.945
1.470
0.585

APC
(1, 3, 1)

#CJ
0.242
0.623
0.510
0.272
0.231
0.448
1.134
0.774
0.899
0.604

#IJ PMSE(×10−4 )
2.637
2.498
3.703
2.533
2.793
2.659
2.586
4.112
2.308
3.867
2.368
3.275
3.897
2.856
2.646
2.981
2.608
5.467
2.098
3.258

Method
PTB
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
PTB
L-CV
L-CV-M
L-SE
L-SE-M
Method
PTB
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
PTB
L-CV
L-CV-M
L-SE
L-SE-M

Table 2.3: Simulation Result for n = 27, (τ1 , τ2 ) = (18, 23)
consuming under large sample size, we decide not to include it in the comparison. The
results are shown on Table 2.4 through 2.7.
It is clear that as the sample size n increases, the LASSO model tends to select more
candidate components as joinpoints, which will be more likely to cover the true joinpoints.
Meanwhile, the modified LASSO, L-CV-M and L-SE-M reduce #IJ dramatically by eliminating the redundant information within ‘clusters’. In general, both our methods L-CV-M
and L-SE-M give acceptable estimates.
On the other hand, note that as the sample size n increases, it is more worthwhile to
study the LASSO model with asymptotic tuning parameter λ ∼ O(σ

30

log(n)/n) than the

σ2
0.0001

σ2
0.0010

APC
(1, 3, 1)

#CJ
1.729
1.445
1.729
1.443
1.883
1.696
1.883
1.696

#IJ PMSE(×10−4 )
6.621
0.954
2.250
0.163
6.626
0.959
2.256
0.163
5.227
1.004
1.434
0.274
5.220
1.010
1.433
0.274

APC
(1, 3, 1)

#CJ
1.458
0.685
1.329
0.680
1.615
1.065
1.546
1.097

#IJ PMSE(×10−4 )
8.487
2.089
4.663
1.612
7.406
3.417
4.107
1.645
6.827
2.250
4.041
1.878
6.136
3.386
3.578
1.775

Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M
Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M

Table 2.4: Simulation Result for n = 54, (τ1 , τ2 ) = (16, 36)
√
usual one selected from cross-validation techniques. Thus we also use λ0 = 4 6σ

log(n)/n

as specified in section 2.3 to verify the asymptotic behavior of the predicted mean squared
error (PMSE). Some results under σ 2 = 0.0010 are shown in Figure 2.1 and 2.2. To save
computation time, we only take n = 27 × 5k for k = 3, · · · , 15 for demonstration.
We can see a clear decreasing trend for the PMSE, which meets our expectation since
we have shown that the prediction error should be close to 0 once the sample size is large
enough.

31

Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =
(1, 3, 1), σ 2 = 0.0010

Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC =
(4, −1, 2), σ 2 = 0.0010

32

σ2
0.0001

σ2
0.0010

APC
(1, 3, 1)

#CJ
1.593
1.391
1.518
1.324
1.833
0.828
1.763
0.810

#IJ PMSE(×10−4 )
5.239
0.416
1.830
0.151
5.307
0.466
1.945
0.163
6.497
0.532
2.876
1.537
6.479
0.618
2.896
1.511

APC
(1, 3, 1)

#CJ
1.011
0.635
0.669
0.412
1.545
0.748
1.185
0.652

#IJ PMSE(×10−4 )
6.176
1.650
3.951
1.544
4.262
3.489
3.186
2.329
7.744
1.739
3.894
2.035
5.753
3.488
3.240
2.460

Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M
Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M

Table 2.5: Simulation Result for n = 54, (τ1 , τ2 ) = (36, 46)

2.5

A Case Study: National Cancer Incidence Rate
Analysis

As a real application of the penalized regression model, throughout this section we will
employ our method to analyze the cancer rates data.
First we provide Figure 2.3 showing the age-adjusted incidence rates of colon and rectal
cancer for the overall population in the U.S. from 1973 to 1999. These data reported in the
SEER Cancer Statistics Review [17] utilizing the Joinpoint Software. It is easy to see some
clear trends and changes over the time.
We again apply PTB and LASSO to the cancer rate data. As indicated in the previous
section, when applying the PTB method via the Joinpoint Software, we set the maximum
number of potential joinpoints to be four. We use the modified LASSO methods L-CV-M
33

σ2
0.0001

σ2
0.0010

APC
(1, 3, 1)

#CJ
1.300
0.661
1.300
0.661
1.518
0.611
1.518
0.611

#IJ PMSE(×10−4 )
16.08
1.584
5.663
0.137
16.08
1.584
5.663
0.137
13.79
1.438
4.736
6.280
13.79
1.441
4.736
6.280

APC
(1, 3, 1)

#CJ
1.479
0.688
1.483
0.692
1.495
0.598
1.492
0.578

#IJ PMSE(×10−4 )
18.35
1.783
6.461
0.827
17.75
2.083
6.257
0.815
15.92
1.939
6.120
6.211
15.43
2.231
6.01
5.760

Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-CV-M
Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M

Table 2.6: Simulation Result for n = 135, (τ1 , τ2 ) = (40, 90)
and L-SE-M for comparison purpose. The result of fitting is given in Table 2.8.
In Table 2.8, PTB (#2) denotes the original unrestricted PTB method while PTB (#4) is
implemented under a pre-specified number of joinpoints as four only for comparison purpose.
From the results we can see that the LASSO type penalized regression methods tend to be
more aggressive when selecting joinpoints than PTB. We have no strong reasons to suspect
an over-fitting issue for this model since the tuning parameter λs are well chosen by cross
validation procedures. Moreover, we see that L-SE-M, the most penalized method among
all of our approaches, gives the same result as the original PTB (#2) while L-CV-M selects
more joinpoints than any others. So it is clear that our method provides more flexibility
in determining the total number of joinpoints because we are always able to control such
important quantity in terms of a series of appropriate tuning parameters. Finally, we show
the plot for fitted models in Figure 2.4, where we see both approaches successfully pick out
34

Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999)

Figure 2.4: Fitted Incidence Rates (1973 - 1999)

35

σ2
0.0001

σ2
0.0010

APC
(1, 3, 1)

#CJ
1.897
0.676
1.895
0.675
1.550
0.271
1.553
0.272

#IJ PMSE(×10−4 )
11.89
0.551
3.848
0.211
11.88
0.554
3.850
0.211
16.72
0.905
4.815
15.05
16.69
0.925
4.804
15.07

APC
(1, 3, 1)

#CJ
1.736
0.465
1.518
0.468
1.575
0.247
1.513
0.237

#IJ PMSE(×10−4 )
15.41
1.060
5.504
0.831
13.85
1.903
4.790
0.816
20.53
1.411
5.937
13.68
17.93
2.266
5.333
14.26

Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M
Method
L-CV
L-CV-M
L-SE
L-SE-M
(4, −1, 2)
L-CV
L-CV-M
L-SE
L-SE-M

Table 2.7: Simulation Result for n = 135, (τ1 , τ2 ) = (90, 115)
Method
PTB (#2)
PTB (#4)
L-CV-M
L-SE-M

# of Joinpoints
2
4
7
2

Joinpoints Details (Year)
1986, 1995
1977, 1982, 1985, 1995
1976, 1980, 1982, 1985, 1990, 1993, 1995
1985, 1996

Table 2.8: Joinpoint Detection for Real Data Analysis
the peak value around year 1985 as well as the bottom value at year 1995 and our approach
offers a quite close fit to the incidence rate curve.
Overall, we are satisfied with the performance of our penalized regression model regarding
both coverage and accuracy of the joinpoint detection. However, one should note that we did
assume homoscedasticity and independence of the errors in our model for simplicity. Thus
further potential approaches involving more complicated but reasonable error structures are
always worthwhile.

36

2.6

Conclusion and Discussion

As a summary of this chapter, we proposed a new method to solve the joinpoints detection
problem on discrete time points using 1 type penalized regression spline model (LASSO). We
proved a basic version of consistency of the joinpoint estimator for the LASSO and compared
it with another commonly used approach, PTB via simulations and real data analysis. Based
on the results, we conclude high performance of our method. It showed superiority to the
PTB in terms of both efficiency and accuracy. It is also easy to understand as well as to
implement. In practice, one may obtain the optimal estimation of joinpoints by carefully
selecting a set of appropriate tuning parameters. Moreover, a problem of exploring stronger
consistency properties of our estimator is still open and worthwhile for further investigation.

37

Chapter 3
Nonparametric Bayesian Clustering of
Functional Data

3.1

Introduction

We start this chapter as a continuation to study the mortality incidence rates of cancers.
However, unlike in the previous chapter, we first raise a more general scenario when multiple
curves are available for comparison and analysis at the same time. For example, assuming
now cancer data can be obtained at a state level from the U.S., then one may overlook
potential relationships or interactions among curves from different states by simply fitting
them individually. Actually, even if we consider the same question for joinpoint detection
as in Chapter 2, based on studying a set of multiple functional simulation data appearing
in section 3.5.2, our L1 penalized method in Chapter 2 gives us #CJ = 0.531 and #IJ =
1.449, while the model proposed by Dass et al. (2014), which is similar to the one we are
going to present in section 3.2, leads to #CJ = 0.816 and #IJ = 0.551, respectively. Thus
it is expected that with more information on curves such as the group memberships, we are
likely to obtain a better estimation of their trends. Therefore, we are well motivated by this
comparison and will now focus on grouping the curves that have similar patterns together.
That is, our main concern is whether they form some groups with similar overall trends
but have significant variations between groups. For instance, one may suspect that the
38

mortality rates for all 48 states of the U.S. indicate certain geographical similarities. Hence
an appropriate model is required to capture the characteristics of the pattern of curves and
to classify them into groups, or, clusters as the primary objective.
In general, the clustering problem briefly introduced above can be conveniently quantified in the following longitudinal way. Let fs (t) be a true function at time t ∈ [0, T ] for
the s-th individual, s = 1, · · · , N . It is usual and natural to assume each function fs to be
smooth so that it can be approximated by polynomial functions of any degree. A natural and
heuristic way to study the clustering problem came from the classification perspective when
MacQueen (1967), Hartigan and Wong (1978) introduced the famous K-means algorithm to
partition data into groups. Each curve is first viewed as a discretely observed vector. Then
the algorithm, which aims to minimize the within-group sum of squares, is directly applied
to multiple vectors. The membership of clusters is determined as a result of classification.
Overall, entire process is straightforward and easy to realize. But there are also clear disadvantages of such approach. For example, it requires the number of groups or clusters to be
specified at the beginning and fixed throughout the procedure although it is unknown for
most of the real studies. Moreover, this clustering method treats each individual curve as an
independent outcome thus overlooks potential correlations or interactions between curves,
which may not be negligible factor when recovering the true clusters. Later, some more
complicated model-based methods have been developed to address the issue. James and
Sugar (2003) proposed a mixed effects model which is more powerful for clustering sparsely
observed functional data. In their approach, B-splines are considered to capture individual
curves and random spline coefficients with a mixture normal distribution are used to cluster
them. It handles missing values well and provides reasonable estimation and confidence intervals for them. However, like K-means algorithm, the number of clusters in the model has
39

also to be pre-determined by independent criteria. Meanwhile, the knot points are the same
across all clusters, which reduce the flexibility of model as well as its ability to process a group
of curves with varying connecting points. Abraham et al. (2003) combined B-spline basis
with K-means procedure on fixed effects to study such problem. Some other model-based
approaches include Banfield and Raftery (1993), Serban and Wasserman (2005), Heard et
al. (2006) and Ray and Mallick (2006).
We develop our model as a powerful tool for classification purpose by applying a classical
nonparametric Bayesian approach: Dirichlet Process (DP) methodology [7] in an innovative
way to clustering the trends of different curves as polynomial components. Our method will
utilize such representation of smooth curves to obtain relevant information such as connection
and fluctuation. In general, our proposed approach does not require an initial estimate of
the number of clusters as most of the other methods mentioned in the previous paragraph.
Instead, it provides a reasonable estimate along with an empirical distribution. It also allows
curves to be characterized by different numbers and locations of knot points, which has more
capability of fitting individual curves well. Furthermore, it performs the clustering based
on a good summary of the shape and trend of curves thus demonstrates the similarity of
patterns between curves within the same cluster.
The rest of this chapter is organized as follows. We present our functional spline mixed
model and prior specifications for Bayesian inference in section 3.2. Some prior components
are taken to be improper, and therefore, propriety of the posterior becomes a concern.
Section 3.3 establishes the propriety of the posterior distribution under some mild conditions.
Section 3.4 provides details of the Bayesian inference, particularly, performance measures of
clustering configuration for the comparison. At the same time, we also explore a new choice
of more non-informative prior on the concentration parameter. Section 3.5 shows results from
40

numerical studies including simulated data as well as the real lung cancer data. Finally, we
summarize our work and give some discussion for the proposed model and future research.

3.2
3.2.1

Spline Mixed Model for Clustering Multiple Curves
Model Specification

Suppose that we observe functions {fs (t)}N
s=1 at t = t1 , t2 , · · · , tn . Let ξ = (ξ1 , · · · , ξK )
be knot points for spline basis functions, where K is the number of knots and ξ1 > t1 and
ξK < tn . Let ξ0 = t1 and ξK+1 = tn . The spline basis functions of order q are then given
by
hm (x) = xm−1 for m = 1, · · · , q
and
q−1

hq+l (x) = (x − ξl )+

for l = 1, · · · , K.

In practice, most widely used orders are q = 1, 2, 4, where q = 1 corresponds to a constant
function, q = 2 corresponds to a linear function and q = 4 corresponds to a cubic spline
function. Then, we assume fs (t) can be approximated as following for s = 1, 2, · · · , N .
q+Ks

fs (t) ≈

(s) (s)

βm hm (t),
m=1

(s)

(s)

(s)

where hm (t) depends on the s-th individual with time knots ξ l = ξ1 , · · · , ξK .
s
Let Yst be the observed value of fs (t) at time t for t = 1, · · · , n. Then, the observed

41

values are modeled using spline representation such that
q+Ks

Yst =

(s) (s)

βm hm (t) + est ,
m=1

where est is an error that represents variability from various sources. We decompose the
error as est = us + st and consider us as a random effect to capture variability between
curves and st as an error that reflects variability across time for each s-th function. Thus
the model can be written as
q+Ks

Yst =

(s) (s)

βm hm (t) + us + st
m=1

with us ∼ N (0, τ 2 ) and st ∼ N (0, σs2 ). Let Y s = (Yst1 , · · · , Ystn )T , Y = (Y T1 , · · · , Y TN )T ,
s

= ( st1 , · · · , stn )T ,

= ( T1 , · · · , TN )T and u = (u1 , · · · , uN )T . Then, we can write the

model in the matrix form for the s-th individual as

Y s = X (s) β (s) + us 1n + s ,

where X (s) = (xij )n×(q+Ks ) with xij = hj (ti ) for i = 1, · · · , n and j = 1, · · · , q + Ks , and
(s)

(s)

β (s) = (β1 , · · · , βq+K )T . For all s = 1, · · · , N , we have
s

Y = Xβ + u ⊗ 1n + ,

(3.1)

with X = diag(X (1) , · · · , X (N ) ) and β = (β (1)T , · · · , β (N )T )T , where ⊗ denotes the Kronecker product.

42

3.2.2

Derivation of Functional Dirichlet Process and Other Prior

We are interested in clustering functions based on their shapes. Such clustering problem
based on the coefficients of basis functions are proposed by other authors, in particular,
using coefficients of spline basis functions such as [1] and [16]. However, the locations and
the number of knots are pre-determined before clustering is considered. Different groups of
functions may need different number of knots and knot-locations to capture the shape of
those functions. Thus, models should take into account such flexibility. To model clustering
of N sites with respect to their knot-point locations and corresponding magnitudes, we
apply the Dirichlet Process (DP) priors to the functions fs (t) with an appropriate centering
measure so that the number of knots and locations of knots are random and estimated during
inferential procedures.
To utilize the DP prior, we consider a function θs (t) = fs (t) and let Θ be the space of
all possible θs (t). A functional DP on the space of all distributions on Θ, where we define Θ
later, has two hyper parameters that characterize the DP such that DP ≡ DP (α0 G0 ). α0
is called precision parameter and G0 is called the baseline (or centering) distribution on Θ.
It is well known that a randomly generated distribution F from DP (α0 G0 ) is almost surely
discrete and admits the representation [18].

∞

F =

ωi δθi

(3.2)

i=1

where δz denotes a point mass at z, ω1 = η1 , ωi = ηi

i−1
k=1 (1

− ηk ), for i = 2, 3, · · · with

θ1 , θ2 , · · · i.i.d from distribution G0 .
To specify G0 on Θ, it is convenient to consider a hierarchical structure. The randomness
43

of G0 is specified by assuming randomness on K, the number of knots, ξ, the locations of
knots, β, the spline coefficients and u, the random effect component.
(1) First, we specify the distribution of K. We assume that each time interval between
knot points has at least w units. We then consider a truncated Poisson distribution for the
distribution of K. That is,
k

p(k) = P (K = k) =

e−λ λk!

k ∗ −λ λl
l=0 e
l!

=

λk
k!
k ∗ λl
l=0 l!

for k = 0, 1, · · · , k ∗ , where k ∗ = [ n−1
w − 1] to ensure n0 ≡ n − 1 − (K + 1)w > 0.
k+1
l=1 nl

(2) Given K = k, we assume that for (n1 , · · · , nk+1 ), where nl ≥ 0 and
1
1
,··· ,
(n1 , n2 , · · · , nk+1 ) ∼ Multinomial n0 ,
k+1
k+1

= n0 ,

.

nl + w becomes the length of a time interval between knot points for l = 1, · · · , k + 1. That
is, ξl is defined recursively as ξ0 = 1 and ξl = nl + ξl−1 + w for l = 1, · · · , k.
(3) Given K = k and ξ = (ξ1 , · · · , ξk ), generate β1 , · · · , βq+k i.i.d from the density π0 on
R1 .
q+K
m=1 βm hm (t)

By the steps (1)-(3), we set θ(t) = f (t) =

and θ = (f (t1 ), · · · , f (tn ))T =

Xβ. From the hierarchical specification above, it follows that the infinitesimal measure for
G0 is given as

G0 (dθ) = p(k)

n0
1
k+1

Γ(n0 + 1)
k+1
l=1 Γ(nl

+ 1)

q+k

π0 (βm )dβm .

(3.3)

m=1

In our analysis, we take π0 (β) ∝ 1.
Finally, to complete the hierarchical model, we specify priors of α0 , λ, τ 2 and σ 2 =
44

2 ): π (λ) = Ga(a , b ), π (τ 2 ) = IG(a , b ) and π (σ 2 ) = IG(a , b ), where
(σ12 , · · · , σN
τ τ
σ σ
1
2
3 s
λ λ

Ga(a, b) denotes the Gamma distribution with density function π(x) ∝ xa−1 ex/b and IG(a, b)
has already been defined in Chapter 1. As for the precision parameter α0 , we also choose
IG(aα , bα ) in order to guarantee a simple form of posterior sampling distributions. However,
as we show in section 3.5 when performing the numerical analysis, such prior specification
may be subjective under some circumstances. Thus we will discuss on an alternative choice
of prior distribution of α0 . In general, the hyper-parameters for the Gamma and inverse
Gamma distributions are chosen to have large variance so that the impact of the prior input
is minimal. The specific choices of the hyper-parameters for the various Gamma and inverse
Gamma distributions are given in section 3.5.
As mentioned in section 3.1, it is notable that Dass et al. (2014) proposed a mixed
effects model with similar DP prior to identify change-points of multiple curves. Although
both our model in this article and theirs feature a nonparametric Bayesian approach, we
feel worthwhile to point out the major difference between our models and objectives. In
the approach of Dass et al. (2014), piecewise disconnected linear segments were used to
model the change-points of functional data. They appear in the model as a set of slopes
and intercepts. Such design of model is not intended for clustering. On the other hand,
our model features B-spline basis which is designed to capture and cluster the shape and
trend of continuous curves well. Even if the order of B-splines is chosen to be 2, which
corresponds to connected linear functions, because of the discrepancy of the objective and
nature of our models, the interpretation of coefficients and other estimated parameters will
be clearly different.

45

3.3

Theory of Bayesian Inference

This section presents some fundamental theories of the nonparametric Bayesian inference
for our clustering spline mixed model, including a validation of propriety of the posterior
distribution and Gibbs sampling schemes for generating the posterior samples.

3.3.1

Propriety of the Posterior Distribution

Let S ≡ {1, 2, · · · , N } be the set of all N sites. Denote a partition of S by c = ∪dr=1 Cr , and
define P to be the set of all partitions of S. We consider probability distributions π(c) on
P:

Γ(α0 )
αd
π(c) =
Γ(α0 + N ) 0

d

(|Cr | − 1)!

(3.4)

r=1

where |Cr | is the number of elements in Cr .
2 ), θ = (θ , θ , · · · , θ ) be the collection of functions
Theorem 4. Let σ 2 = (σ12 , σ22 , · · · , σN
1 2
N

on the N sites with θs = (β (s) , Ks , Ts ), and φ = (u, τ 2 ) with support Φ = RN × (0, ∞).
Given model (3.1) and specification of priors in section 3.2.2, if Ks < min{aσ + (n −
1)/2, [(n − 1)/w]} for all s and some integer w > 0, then the posterior distribution is proper.
The proof of theorem is provided in Appendix.

3.3.2

Gibbs Updating Steps

Let S ≡ {1, 2, · · · , N } be the set of all N individuals. Denote a partition of S by c = ∪dr=1 Cr ,
and define P to be the set of all partitions of S. Note that a randomly generated distribution F from DP (α0 G0 ) admits the Sethuraman representation (Sethuraman, 1994). By
46

integrating out the random measure F , the equivalence between the DP prior (for any general space Θ) and the distribution it induces on P is well known. Moreover, the probability
distribution on P has an explicit form given by

Γ(α0 )
αd
π(c) =
Γ(α0 + N ) 0

d

(|Cr | − 1)!,

(3.5)

r=1

where |Cr | is the number of elements in Cr . Define θ = (θ 1 , · · · , θ N ) and
θ −s = (θ 1 , θ 2 , · · · , θ s−1 , θ s+1 , · · · , θ N ) to be the collection of all θ-components except θ s .
Let c−s be the partition of S \ {s} determined by θ −s ; the identical components of θ −s
∗

uniquely determine the partition of c−s . Suppose that c−s = ∪N
r=1 Er . The conditional
distribution π(c | c−s ) based on (3.5) is π(c | c−s ) = α0 /(α0 + N − 1) if c = {s} ∪ c−s
and π(c | c−s ) = Nr /(α0 + N − 1) with Nr denoting the number of elements in Er where
∗

c−s = ∪N
j=1 Ej .

(1) Update θ s :
The posterior of θ s conditional on θ −s is given by
qs,0 G∗0 (dθ s ) +
π(θ s | θ −s , Y , u) =

qs,0 +

N∗
r=1 qs,r δθ
N∗
r=1 qs,r

(r)

,

(3.6)

where

qs,0 =

Θ

π(Y s | θ s , us ) G0 (dθ s )

π(c | c−s ) and
(3.7)

qs,r = π(Y s | θ (r) , us ) π(c | c−s ),
respectively, for c = {s}∪c−s and c = (E1 , E2 , · · · , Er ∪{s}, · · · , EN ∗ ) for r = 1, 2, · · · , N ∗ .
47

G∗0 (dθ s ) =

π(Y s | θ s , us ) G0 (dθ s )
Θ π(Y s | θ s , us ) G0 (dθ s )

(3.8)

is the posterior distribution of θ s given that a new cluster is formed by the s-th individual.
The explicit expression for qs,0 after canceling 1/(α0 + N − 1) is given as
k∗

qs,0 = α0
k=0 (n1 ,··· ,nk+1

n0 !
exp [H(n1 , · · · , nk+1 )]
n1 ! · · · nk+1 !
)

n0
1
p(k),
k+1

where
∞

H(n1 , · · · , nk+1 ) = log

Rq+Ks

0

π(Y s | β (s) , us , σs2 )π(σs2 ) dσs2 dβ (s)

2aσ + n − q − Ks
n − q − Ks
log(2π) − log Γ(aσ ) − aσ log bσ + log Γ
=−
2
2
2aσ + n − q − Ks
1
(s)T X (s)
T (s)
−
log(b−1
σ + (1/2)Y ∗ P Y ∗ ) − log X
2
2
with Y s∗ = Y s − us 1n and P (s) = I − X (s) (X (s)T X (s) )−1 X (s)T .
qs,r is also given as

qs,r = Nr

(a +n/2)
Γ(aσ + n/2) b∗ σ
,
Γ(aσ )
baσσ
(2π)n/2

1

r r T
r r
r r
where b∗ −1 = b−1
σ + 0.5(Y s∗ − X β ) (Y s∗ − X β ) . X β ≡ θ (r) for the cluster Cr .

Note that θ s ≡ θ (r) if s ∈ Cr .
Expression (3.6) explicitly demonstrates the clustering capability of the functional DP
prior. The current value of θ s can be selected to be one of the distinct θ (r) functions
with probability

N∗
r=1 qs,r /(qs,0 +

N∗
r=1 qs,r ),

this positive probability being the reason for

possible clustering of sites in terms of θ s as mentioned. Expression (3.6) also allows for a
48

new θ s to be generated from the posterior distribution G∗0 .
(2) Update σ: The update of σs2 is carried out once θ s is obtained via (3.6). Regardless
of whether θ s is a new value or an existing θ (r) , for each individual s = 1, · · · , N , the
conditional posterior distribution of σs2 given other parameters is

π(σs2 | · · · ) = IG(a, b),

(s) (s) T
(s) (s)
with a = aσ + n2 and b−1 = b−1
σ + 0.5(Y s∗ − X β ) (Y s∗ − X β ) .

Note that θ uniquely determines the collection of parameters (β, K, T ), where β is the set
of spline coefficients, K = (K1 , · · · , KN ) is the set of number of knots and T = (ξ 1 , · · · , ξ N )
is the set of knot locations. Indeed, since θ contains several identical components, it follows
that the corresponding components of (β, K, T ) are also identical to each other. In what
follows, we present the updating steps for the d distinct components of (β, K, T ), namely,
(β r , Kr , T r ) for r = 1, 2, · · · , d. Let ∪dr=1 Cr be the partition of {1, 2, · · · , N } at the current
update of the Gibbs sampler (thus, d is the number of distinct clusters).
(3) Update (β r , Kr , T r ): We first update Kr from the posterior marginal of Kr , and then
update T r | Kr , and finally β r | T r , Kr from their respective conditional distributions. The
posterior marginal probability of Kr = k is proportional to

ν(n1 , · · · , nk+1 ),

p(k)

(3.9)

(n1 ,··· ,nk+1 )

with

˜ 1 , · · · , nk+1 )}
ν(n1 , · · · , nk+1 ) = exp{H(n

49

Γ(n0 + 1)
k+1
l=1 Γ(nl

+ 1)

n0
1
k+1

(3.10)

where



˜ 1 , · · · , nk+1 ) = log
H(n


π(Y s | X (s) , β, σs2 , us ) dβ

Rq+k





s∈Cr

1 (s)T (s)
n
1
(n|Cr | − q − k)
1
log(2π) −
log σs2 − log
σs−2 Y Ts∗ Y s∗
X
X
−
2
2
2
2
2
σs
s∈Cr
s∈Cr
s∈Cr
−1 


T 

=−

+

1
2

σs−2 X (s)T Y s∗  
s∈Cr

σs−2 X (s)T X (s) 

σs−2 X (s)T Y s∗  .


s∈Cr

s∈Cr

Note that X (s) ≡ X r for all s ∈ Cr .
The summation in (3.9) is over all non-negative integers n1 , n2 , · · · , nk+1 such that
k+1
l=1

nl = n0 ≡ n − 1 − (k + 1)w. Obtaining the posterior probability of Kr = k re-

quires evaluation of (3.10) for each value of k ≥ 0. This could potentially require significant
amount of computational time and drastically reduce the efficiency of the Gibbs chain, but
˜ using the flat prior
this did not occur in our application due to closed forms expressions of H
π0 .
To update T r given Kr = k, note that this is equivalent to updating (n1 , · · · , nk+1 ) with
probabilities p(n1 , · · · , nk+1 ) ∝ v(n1 , n2 , · · · , nk+1 ). This is carried out by exhaustively
listing of all such combinations and numerically computing the corresponding probabilities.
The update of β r given T r and Kr = k is done based on the conditional distribution
N (µβ , Σβ ) with
r
r
−1


Σβ = 
r

σs−2 X (s)T X (s) 

r

=

s∈Cr

= Σβ X rT
r

σs−2 

and

(3.11)

s∈Cr




µβ

−1



−1

X rT X r

σs−2 Y s∗ 
s∈Cr

50

(3.12)

with the q + k components of β r generated independently of each other from their respective
component densities.
(4) Update λ: λ is updated using
∗

N

π(λ | · · · ) ∝ π3 (λ)

p(Ks ) ∝

(

s=1
d
r=1 |Cr |kr ,

with a∗λ = aλ +

a −1
λ λ e−λ/bλ
k ∗ λl N
l=0 l! )

,

(3.13)

where kr is the number of knot-points corresponding to θ (r)

in cluster Cr .
(5) Update α0 : For updating α0 with π2 (α0 ) ∝ α0aα −1 e−α0 /bα , we utilize the two-step
procedure by Escobar and West [6]: at the b-th iteration: (1) draw κ from the Beta dis(b−1)

tribution Be α0

+ 1, N

(b)

and; (2) draw α0

from the mixture density of two Gamma

distributions

πκ Ga(aα + N ∗ , (1/bα − log(κ))−1 ) + (1 − πκ ) Ga(aα + N ∗ − 1, (1/bα − log(κ))−1 )

where N ∗ is the latest number of clusters and the membership probability

πκ =

aα + N ∗ − 1
.
N (1/bα − log(κ))

(6) Update u: We have

π(u | · · · ) ∝ π(Y | u, · · · ) · π(u | τ 2 )



 1 N

u2s 1Tn 1n − 2us 1Tn Ws /σs2 + (u − u0 )T (u/ − u0 )τ 2 
∝ exp − 
 2

s=1

where Ws = Y s − X (s) β (s) . Notice that 1Tn 1n = n. The conditional posterior distribution
51

of u is N (µu , Σu ) with Σu =

−2 ) + τ −2 I
nDiag(σ1−2 , · · · , σN
N

−1

and µu = Σu (W +

2 )T .
τ −2 IN u0 ) with W = (1Tn W1 /σ12 , · · · , 1Tn WN /σN

(7) Update τ 2 : The posterior distribution of τ 2 is given as

π(τ 2 | · · · ) ∝ π(u | τ 2 ) · π(τ 2 ),
∼ IG(a∗τ , b∗τ )
−1
.
with a∗τ = N2 + aτ and b∗τ = b1 + 12 uT u
τ

3.4

Bayesian Inference: A Measure for Comparing a
Pair of Clustering of Sites

Given the validity and details of the Gibbs updating scheme for all the parameters based on
the elicited improper prior components from section 3.2.2, we are able to perform Bayesian
inference on the posterior sample of them, especially the estimated cluster information θ =
(θ 1 , · · · , θ N ). Suppose that we have M Gibbs samples for θ: {θ (m) }M
m=1 after convergence
is established. It may be challenging to obtain results for the clustering configuration directly
from them since such information (i.e. number of clusters, specific assignment of each site) is
changing at each Gibbs iteration, classical posterior analysis by collecting summary statistics
for clustering information is not straightforward and not easy to interpret. We introduce a
2 × 2 cross classification table to measure the deviation between two cluster configurations,
for instance, C and C ∗ . The table has entires {nij }, 1 ≤ i, j ≤ 2, where n11 is the number
of pairs of sites, out of all n++ = N (N − 1)/2 pairs, that are in the same cluster in C as well
as C ∗ while n22 is the number of pairs of sites that are not in the same cluster in C or C ∗ .
52

n12 is the number of pairs that are in the same cluster in C but not in C ∗ . Similarly, n21
is the number of pairs that are not in the same cluster in C but in the same cluster in C ∗ .
If two cluster C and C ∗ have identical configurations, we have n11 = n+1 and n22 = n+2 .
The interpretation of n+1 is that it is the number of pairs which are in the same cluster in
C ∗ , whereas the interpretation of n+2 is that it is the number of pairs that are not in the
same cluster in C ∗ . Thus, the latter quantities n+1 and n+2 are only dependent on C ∗ but
not on C. Then, we consider two measures named sensitivity S1 = n11 /n+1 and specificity
S2 = n22 /n+2 . The interpretation of S1 is that it is the proportion of pairs that are also
in the same cluster in C given that they are in the same cluster in C ∗ . Similarly, S2 is the
proportion of pairs that are also not in the same cluster in C given that they are not in the
same cluster in C ∗ . It is clear that S1 and S2 take values between 0 and 1 with the ideal
case that (S1 , S2 ) = (1, 1). Also note that for a cluster C ⊂ C ∗ , that is, every partition of
C is in some partition of C ∗ , the number of clusters in C is much larger than that of C ∗ .
In this case, n12 = 0 yields S2 = 1 but S1 ≤ 1. On the other hand, when C ∗ ⊂ C, we have
S1 = 1 and S2 ≤ 1. Thus, deviations from the point (1, 1) or from the lines y = 1 and x = 1
give an idea about the nature of deviations of the clustering C from the clustering C ∗ .
For simulation studies, (S1 , S2 ) can be used to measure deviations between a Gibbs
cluster configuration and the true one. Suppose that from the Gibbs output, we have cluster configurations {Cm }M
m=1 . For the m-th Gibbs cluster configuration, we can calculate
(S1 (m), S2 (m)) with C = Cm and C ∗ = C0 specified by the true cluster information. A
plot of (S1 (m), S2 (m)) will indicate how dispersed the current Gibbs cluster is with respect
to the true configuration: If (S1 (m), S2 (m)) is concentrated close to (1, 1), it indicates that
deviations of Cm from C0 is not much. Points (S1 (m), S2 (m)) far from (1, 1) indicate two
different types of deviations based on their proximity to the lines x = 1 or y = 1. Proximity
53

to the line x = 1 indicates that among those pairs that are in the same cluster in C0 , more
pairs become in the same cluster in Cm as well whereas proximity to y = 1 indicates that
among those pairs not in the same cluster in C0 , more pairs becomes not in the same cluster
in Cm as well. For simulation studies, C0 can be taken to be the true cluster configuration
since the true clustering is known. In the 2×2 cross-classification table, there are two degrees
of freedom. The quantities n+1 and n+2 are fixed for the true cluster configuration, so we
need two free parameters, say n11 and n22 , to determine all entries in the table completely.
Thus, the scatter plot of {(S1 (m), S2 (m)) : m = 1, 2, · · · M } gives a complete picture of
variability. As a numerical measure for variability, we consider

1
SS =
M

M

(2 − S1 (m) − S2 (m))
m=1

with smaller SS value indicating better performance.

3.5

Numerical Analysis: Simulation Study and Real
Data Example

In this section, we apply the proposed method to numerical studies including simulations
under various settings as well as real case analysis of the lung cancer data in the U.S.. We
simulate data at n = 38 time points for N = 49 sites based on the model. We partition
the sites into 6 groups (clusters) Cr with r = 1, 2, · · · , 6. Within each cluster, the curves
share common information in form of (β (r) , Kr , T (r) ). As stated in the previous section, we
collect overall posterior samples via MCMC procedure with Gibbs sampling scheme at each
iteration. The true parameters are specified in the following way.

54

r |Cr |
1
9
2
9
3
8
4
7
5
7
6
9

Cr (Site #)
2, 4, 11, 25, 27, 36, 43, 46, 49
12, 13, 14, 21, 22, 26, 33, 40, 48
3, 5, 15, 17, 24, 30, 35, 42
1, 9, 10, 23, 32, 39, 41
6, 18, 20, 28, 31, 38, 44
7, 8, 16, 19, 29, 34, 37, 45, 47

T (r)
(8, 30)
(7, 11, 20)
(13, 25)
(17, 21)
(18, 20)
(18, 20)

β (r)
(0.0667, 0.0500, −0.2000)
(−0.0667, 0.0300, 0.1667, −0.3333)
(−0.1333, −0.1333, 0.1667)
(−0.0333, −0.0167, 0.1667)
(−0.0667, 0.0200, −0.1333)
(0.0167, 0.0167, 0.3333)

As for the true value of the variability parameter between the curves, τ 2 , and within the
curves, σs2 , we consider the following settings: τ 2 = 1/30 or 2/30; σs2 ∼ U (1/30, 2/30) or U (2/30, 4/30).
Such specifications lead to 4 scenarios.

3.5.1

An Alternative Less-informative Prior Choice for α0

Recall that when specifying the prior for parameters of our model in section 3.2.2, we raise
a question on the choice of prior for the precision parameter α0 , which directly controls
the ability to create a new cluster, that is, increases the number of clusters. We observed
the fact that under the Gamma prior for the precision parameter α0 , the corresponding
posterior distribution can be well obtained from a mixture of two Gamma distributions [6].
However, in practice, to guarantee a reasonable extent of splitting existing clusters, we may
need quite small value of α0 , which leads to highly subjective Gamma prior for α0 . For
instance, in our simulation, we require that α0 ∼ Ga(aα , bα ) with E(α0 ) = aα bα ≈ 10−8
and Var(α0 ) ≈ 102 thus extreme small aα and large bα . To resolve this issue, we introduce
another prior as α0 ∼ LN (µα , σα2 ), where LN (µ, σ 2 ) denotes the log-Normal distribution
with corresponding mean parameter µ and variance parameter σ 2 . The intuition behind
such re-parameterization is that we believe it is more appropriate to put a relatively noninformative Gaussian prior on log(α0 ): log(α0 ) ∼ N (µα , σα2 ). For our simulation we use
2
µα = −40 and σα = 7. Note that the mean and variance for α0 are E(α0 ) = eµα +σα /2 ≈

55

2

2

1.86 × 10−7 and Var(α0 ) = (eσα − 1)e2µα +σα ≈ 6.57 × 107 . Hence we are able to claim that
the log-Normal type prior for α0 is much less subjective than the original Gamma prior.
To complete the posterior sampling scheme for π(α0 ) = LN (µα , σα2 ), we simply apply
the grid search method over (µα − 3σα , µα + 3σα ), which covers over 99% of the range of α0 .

3.5.2

Simulations and Comparison with An Existing Approach

We generate 10 sets of data under each scenario specified at the beginning of section 3.5
and run MCMC iterations with two chains for 5000 times to guarantee the convergence. For
comparison purpose, we also apply a functional clustering methodology proposed by James
and Sugar (2003) to the same simulated datasets. The results are listed in Table 3.1 and
3.2.
τ2
σ2
S1 (Specificity) S2 (Sensitivity) SS(Overall)
1/30 U (1/30, 2/30)
0.8813
0.9844
0.1343
U (2/30, 4/30)
0.6466
0.9827
0.3707
2/30 U (1/30, 2/30)
0.8898
0.9843
0.1260
U (2/30, 4/30)
0.6443
0.9827
0.3730

Med. #
7
6.5
6
6.5

Mode #
6
6
6
6

Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach

τ2
σ2
S1 (Specificity) S2 (Sensitivity) SS(Overall)
1/30 U (1/30, 2/30)
0.7107
0.9066
0.3827
U (2/30, 4/30)
0.7017
0.9028
0.3955
2/30 U (1/30, 2/30)
0.6792
0.9137
0.4071
U (2/30, 4/30)
0.6978
0.9152
0.3870

True #
6
6
6
6

Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar’s Approach

From the result of our approach in Table 3.1, we obtain excellent performance in sensitivity (S2 ) as well as the median and mode of the estimated number of clusters. Meanwhile,
it is shown that the specificity (S1 ) is mainly affected by relatively large σs2 , the variability
56

of individual curves within a cluster. On the other hand, since James and Sugar’s model
requires a pre-specified number of clusters as introduced in section 3.1, we simply plug in the
true number of clusters, 6, when utilizing their models to avoid extra determining process.
We also point out that the diagnostic measures (specificity, sensitivity, etc.) are calculated
based on the predicted cluster membership for their case rather than averaging the posterior
samples for our case. It is clear from the comparison that our method leads to better sensitivity (S2 ) in general and higher specificity (S1 ) for smaller σs2 . However, it appears that
their approach is slightly more robust under large individual variability.

3.5.3

A Real Data Example: Lung Cancer Rates in the U.S.

Besides the simulation study given above, we apply our proposed model to the lung cancer
mortality rate curves from 48 states and the Washington D.C. of the U.S.. In the data
analysis, the minimum number of time points in each time interval is set to be w = 7, which
leads to the maximum number of knot-points as k ∗ = 4. Then all the hyper-parameters are
specified as follows. The shape and scale parameters of the inverse Gamma priors on σs2 and
τ 2 are aσ = aτ = 2 and bσ = bτ = 100 respectively so that the variance of priors are large
enough to be considered as infinite. The shape and scale parameters of the Gamma prior on
λ are aλ = 0.004 and bλ = 500 respectively. For the prior on the precision parameter α0 , as
we discussed in section 3.5.1, we propose the log-Normal distribution with µα = −100 and
σα = 11. Recall that the precision parameter α0 controls the number of clusters while λ
controls the number of knot-points.
We run two chains under MCMC algorithm with Gibbs sampling for 20000 iterations
each. The convergence of chains is attained after 15000 iterations given the Gelman-Rubin
diagnostics introduced in Chapter 1, section 1.4.1. We collect our posterior sample by
57

choosing every fifth sample from the last 5000 iterations for both chains. Such sampling
procedure provides more ‘mixed’ posterior sample with a total size of 2000. The posterior
probability of the number of clusters of states is shown in Table 3.3.
P (d = 2) P (d = 3) P (d = 4) P (d = 5) P (d = 6) P (d = 7) P (d = 8)
0.093
0.217
0.298
0.236
0.110
0.038
0.008
Table 3.3: The Distribution of Number of Clusters d

It is clear that d = 4 gives us the estimated number of clusters with highest probability.
To provide a cluster configuration of all the states, we apply an agglomerative clustering
algorithm with the dissimilarity distance defined as
M

dist(s1 , s2 ) = 1 −

distm (s1 , s2 )/M,
m=1

where distm (s1 , s2 ) = 1 if s1 and s2 fall into the same cluster in mth iteration. Otherwise,
distm (s1 , s2 ) = 0. It is natural to specify the threshold for the maximum number of clusters
in the algorithm as d, the posterior mode of the number of clusters. The main algorithm
can be summarized as follows.
1. Create an individual cluster for each site.
2. Among all current clusters, select two clusters whose elements have the smallest dissimilarity distance.
3. If two selected clusters from the previous step are distinct, merge them into a new
cluster and replace two old clusters.
4. Repeat step 2 and 3 until the desired threshold for the number of clusters is attained.

58

The estimated cluster configuration is shown on Table 3.4 along with a dendrogram in
Figure 3.1 that illustrates the process of agglomerative clustering algorithm as introduced
above. Moreover, Figure 3.2 provides the data from individual sites grouped by cluster
configuration while Figure 3.3 shows the averaged curve (in black line) over all sites within
those clusters.
Cluster
(a)
(b)

(c)
(d)

# of Sites
8
21

7
13

Cluster Details (State)
CO, LA, MA, ME, OH, OR, PA, WY
AL, AR, GA, IA, ID, IN, KS, KY, MN, MS,
MO, MT, NC, ND, NE, OK, SC, SD, TN,
WI, WV
CA, DC, FL, MD, NJ, NY, UT
AZ, CT, DE, IL, MI, NH, NM, NV, RI, TX,
VA, VT, WA

Table 3.4: Estimated Cluster Configuration

Cluster (a) has one knot-point around year 1991. The log scaled rates for these states
maintained a steady trend after that. Cluster (b) also has one knot-point. However, unlike
cluster (a), it happened a little earlier around year 1989. Cluster (c) has the same knot-point
at year 1991 as in cluster (a) but demonstrate a clear decreasing trend after that. As for
cluster (d), it represents states with potential two knot-points detected around year 1982
and 1992. These states have the log-rates increasing slower after 1982 and dropping after
1992.

3.6

Conclusion and Discussion

In this chapter, we proposed a mixed spline model that performs concurrent clustering of
multiple curves based on their patterns and trends. Such clustering is designed by introducing
a Dirichlet process prior on the space of step functions over time points. The model was
59

verified by a series of simulations and then applied to analyze age-adjusted cancer mortality
rates for all the states in the U.S. to find state-wise clusters that have similar trends. Given
the result of real data analysis, our model provides a reasonable and meaningful cluster
configuration and captures the group-wise features of the curves. For some potential further
investigation, one can extend a piece-wise linear model (corresponding to q = 2) to higher
order polynomial models such as cubic spline functions to model curves by summarizing
more information about their patterns. Then the criteria and interpretation of clustering
may have to be reestablished.

60

Figure 3.1: Dendrogram for the Clustering Estimation

61

Figure 3.2: Estimated Cluster Configuration

62

Figure 3.3: Overall Trend by Clusters

63

APPENDICES

64

Appendix A
Proof of Theorems in Chapter 1
Proof of Theorem 1. Note that the likelihood function given D, Σ0 , β, γ is

l(y; D, Σ0 , β, γ) =

1

1

¯

¯

e− 2 (y−Xβ−Zγ) Σ
1

1

−1 (y−Xβ−
¯
¯
Zγ)

1
1

(2π) 2 M dn |Σ| 2

− 12 γ G−1 γ
e
,
1

(2π) 2 M dk |G| 2

where Σ = IM ⊗ D ⊗ In and G = IM ⊗ Σ0 ⊗ Ik . Let l(y; D, Σ0 ) be the likelihood with β
and γ integrated out. To show the posterior is improper, it is enough to show

l(y; D, Σ0 )π(D)π(Σ0 ) dD dΣ0 = ∞.

(A.1)

Using the eigenvalue decomposition of Σ0 , Σ0 = Q ΨQ, we can write π(Σ0 ) dΣ0 =
π(Ψ) dΨ · π(Q) dQ with π(Ψ) =

1
d ψp
j=1 j

i<j (ψi − ψj )

by Yang and Berger [23]. Then, we

have

l(y; D, Σ0 )π(D)π(Σ0 ) dD dΣ0 =

l(y; D, Σ0 )π(D) π(Ψ) π(Q) dD dΨ dQ.

To show (A.1), it is enough to find subsets, SD and SΨ of the domains of integration for D
and Ψ on which the integration is unbounded. That is,

SD ×SΨ

l(y; D, Σ0 )π(D) π(Ψ) dD dΨ = ∞.

65

Let SD = {σj2 ≥ , j = 1, . . . , d} for a fixed > 0 and SΨ = {ψ : ψ1 > . . . ψd > 0}, where
ψ = (ψ1 , . . . , ψd ) is the set of eigenvalues of Σ0 and S ∗ = SD × SΨ . To proceed further, we
denote the ordering of two m × m symmetric matrices, A and B, with respect to positive
definiteness by A

B if for any v ∈ Rm , v Av ≤ v Bv. We also say B

A if A

B.

Note that we have
1

1

− Md
(2π)− 2 (n−q−1) |X X| 2

l(y; D, Σ0 ) =

1

− 12 y W y
,
e
1

1

|D| 2 M (n−q−1) |Σ0 | 2 M k |Z¯ P Z¯ + G−1 | 2

¯ Z¯ P Z¯ + G−1 )−1 Z¯ P with P = IM ⊗ D−1 ⊗ (In − X(X X)−1 X ). For
where W = P − P Z(
simplicity, define PX = X(X X)−1 X , PH = In − PX so that P = IM ⊗ D−1 ⊗ PH . Let
0 < λ1 < λ2 < · · · < λd be the eigenvalues of Z PH Z. Then, we have
1

Mk

|Σ0 |− 2

1

2
|IM ⊗ Σ−1
0 ⊗ Ik |

=

|Z¯ P Z¯ + G−1 | 2

1

2
|IM ⊗ D−1 ⊗ Z PH Z + IM ⊗ Σ−1
0 ⊗ Ik |
1

2
|IM ⊗ Σ−1
0 ⊗ Ik |

≥

1

2
|IM ⊗ D−1 ⊗ λd Ik + IM ⊗ Σ−1
0 ⊗ Ik |
d

1

≥

=
Mk

|λd −1 Σ0 + I| 2
on SD , where the first inequality holds since λd Ik
¯ Z¯ P Z¯ + G−1 )−1 Z¯ P
since P Z(

1
Mk

j=1

(1 + ψj λd −1 ) 2

Z PH Z. Also, e−1/2y W y ≥ e−1/2y P y

0.

Thus, on SD × SΨ ,
1

1

− Md
1
(2π)− 2 (n−q−1) |X X| 2
e− 2 y P y
l(y; D, Ψ, Q) π(Ψ) ≥
1 M (n−q−1)
|D| 2

66

i<j (ψi

− ψj )

Mk
p
d
−1 ) 2
ψ
(1
+
ψ
λ
j
d
j=1 j

Since

−1Md
− 1 (n−q−1)
1
|X X| 2
(2π) 2
e− 2 y P y π(D) d D
1 M (n−q−1)
SD
|D| 2

i<j (ψi
SΨ

− ψj )

Mk
p
d
−1 ) 2
j=1 ψj (1 + ψj λd

> 0, we are going to show

dΨ = ∞.

The first term in the integrand in (A.2) corresponding to the finite expansion of

(A.2)

i<j (ψi −ψj )

is
1 ψ0
ψ1d−1 ψ2d−2 . . . ψd−1
d

(A.3)

Mk
p
d
−1 ) 2
j=1 ψj (1 + ψj λd

∗
First consider (1) p = d+1
2 , corresponding to the Jeffrey’s prior. Since d ≥ 4, let j =
d−1
2
d−1
2

> 1 where x is the smallest integer greater than or equal to x. Then we have

∗
− j > 0 for 1 ≤ j ≤ j ∗ − 1 and d−1
2 − j ≤ 0 for j ≤ j ≤ d. Consider A = {ψ : ψ1 ≥

1, ψ2 ≥ 1, . . . , ψj ∗ −1 ≥ 1} ∩ SΨ . Then
1 ψ0
ψ1d−1 ψ2d−2 . . . ψd−1
d
SΨ

≥

≥

dΨ

Mk
p
d
−1 ) 2
j=1 ψj (1 + ψj λd
d
d−1 −j
1
ψj 2
Mk
A j=1
(1 + ψj λd −1 ) 2
d−1 −j
2
ψ
∗
j≥j
j
A

≥
A

dΨ

1

dΨ

Mk
Mk
−1 ) 2
−1 ) 2
(1
+
ψ
λ
(1
+
ψ
λ
∗
∗
j
j
d
d
j≥j
j<j
d−1
1
1
1
j≥j ∗ | 2 −j|
Mk
Mk
ψj ∗
−1 ) 2
−1 ) 2
j≥j ∗ (1 + ψj λd
j<j ∗ (1 + ψj λd

since ψj ∗ > ψj ∗ +1 > · · · > ψd > 0.

67

dΨ (A.4)

For p > 1 and b > 0, we have
t

1
1
1
1−
ds =
p
b(p − 1)
(1 + bt)p−1
0 (1 + bs)

(A.5)

Using (A.5), we integrate out (A.4) with respect to ψd , · · · , ψj ∗ on A ∩ SΨ . First,
ψd−1

1

1
1
1
−
.
dψ
=
d
Mk
M k −1
−1 ( M k − 1)
λ
−1
−1
d
2
(1 + ψd λd
) 2
(1 + ψd−1 λd
) 2

0

Thus, using (A.5) again,
ψd−2

Mk
(1 + ψd−1 λd −1 ) 2
ψd−2
1

0

=

λd −1 ( M2k

0

=

ψd−1

1

1
λ2d −2 ( M2k − 1)2
−

− 1)
1−

Mk

0

1−

1
(1 + ψd λd −1 ) 2

dψd dψd−1

1

1

Mk
(1 + ψd−1 λd −1 ) 2 −1

Mk

(1 + ψd−1 λd −1 ) 2

dψd−1

1
Mk
(1 + ψd−2 λd −1 ) 2 −1

1
λ2d −2 ( M2k − 1)(M k − 2)

1−

1
.
(1 + ψd−2 λd −1 )M k−2

After integrating (A.4) with respect to ψd , · · · , ψj ∗ +1 , we have
ψj ∗ −1
0

P

1
1 + ψj ∗ λd

−1

·

1
ψj ∗

d−1
j≥j ∗ | 2 −j|

dψj ∗ ,

(A.6)

where P (·) is a polynomial of degree (M k − 2)(d − j ∗ ) + M k. Note that we assume M2k > 1.
However, the integrand above is not integrable at 0 since

j≥j ∗

d−1
2

− j ≥ 1.

Hence we showed that the posterior is improper. Similar to case (1), by setting j ∗ =
d−2
2

≥ 1, we can prove case (2). For case (3), consider the corresponding multiple integral
68

on B = {ψ : 1 > ψ1 > ψ2 > · · · > ψd > 0}. Then, it is easy to verify the first integral of
(A.3) with respect to ψd is infinite.
Proof of Theorem 2. Let l(y; D, L) be the likelihood with β and γ integrated out and Σ0
expressed in terms of L. To show that the posterior distribution is proper, we need to verify

l(y; D, L)π(D)π(L) dD dL < ∞.

Given the prior models for D and L, we can express

l(y; D, L)π(D)π(L) dD dL
d

=

=C

l(y; D, L)

π(σj2 ) π( 11 )π( 1 )π(L11 ) dσ12 · · · dσd2 d 11 d 1 dL11

j=1
1
Mk
|Σ0 |− 2 e− 2 y W y
1
|Z¯ P Z¯ + G−1 | 2

d

1
σj2
j=1

M (n−q−1)
+αj +1
2

π( 11 )π(L11 )dσ12 · · · dσd2 d 11 d 1 dL11 ,

where C is a constant independent of D and L.
First, we find W0 that does not depend on L and it satisfies W0

1
W so that e− 2 y W y ≤

1
¯ Z¯ P Z)
¯ −1 Z¯ P . Since (Z¯ P Z¯ + G−1 )−1
e− 2 y W0 y . Let W0 = P − P Z(

¯ Z¯ P Z¯ + G−1 )−1 Z¯ P
IM ⊗ D ⊗ (Z PH Z)−1 , W = P − P Z(

¯ −1 =
(Z¯ P Z)

¯ Z¯ P Z)
¯ −1 Z¯ P = W0 .
P − P Z(

1
1
Thus, e− 2 y W y ≤ e− 2 y W0 y .

Let P0 = PH −PH Z (Z PH Z)−1 Z PH . Then, we can write W0 = IM ⊗D−1 ⊗P0 so that
we can express y W0 y =

cj
d
j=1 σ 2 ,
j

where cj =

(i)
M
i=1 yj

(i)

P 0 yj

(i)

(i)

(i)

with yj = (yj1 , · · · , yjn )

for j = 1, . . . , d. Now, we claim that cj > 0 for all j = 1, · · · , d with probability 1. Note that
PX , PH and P0 are idempotent so that PH is orthogonal to PX = I − PH and P0 is orthogonal to PH Z (Z PH Z)−1 Z PH . Thus, rank(PH ) = rank(I) − rank(PX ) = n − (q + 1) since
69

rank(PX ) = rank(X) = q + 1. rank(P0 ) = rank(PH ) − rank(PH Z (Z PH Z)−1 Z PH ). Since
we assume n > q + 1 + k, rank(Z PH Z) = min{rank(Z), rank(PH )} = k and rank(PH Z) =
min{rank(Z), rank(PH )} = k so that rank(PH Z (Z PH Z)−1 Z PH ) = k. Therefore, rank(P0 ) =
n − (q + 1) − k > 0. Since P0 is idempotent and non-degenerate, we prove the claim.
Thus, we have

1
1
e− 2 y W y ≤ e− 2 y W0 y =

d

e

−cj /2σj2

,

(A.7)

j=1

where cj > 0 for all j = 1, . . . , d.
Now, we want to bound

|Σ0 |

− M2k

1
−1 | 2
¯
|Z¯ P Z+G

above by a simpler expression.

1

Mk

|Σ0 |− 2

1

2
|IM ⊗ Σ−1
0 ⊗ Ik |

=

|Z¯ P Z¯ + G−1 | 2

1

2
|IM ⊗ D−1 ⊗ Z PH Z + IM ⊗ Σ−1
0 ⊗ Ik |
1

2
|IM ⊗ Σ−1
0 ⊗ Ik |

≤

1

2
|IM ⊗ E −1 ⊗ Ik + IM ⊗ Σ−1
0 ⊗ Ik |
Mk

=

2
|Σ−1
0 |

|E −1

Mk
2
+ Σ−1
|
0

Mk

|E| 2

=
|E

Mk
+ Σ0 | 2

,

where E = λ−1
1 D and λ1 be the smallest eigenvalue of Z PH Z. The inequality follows from
IM ⊗ D−1 ⊗ Z PH Z

IM ⊗ D−1 ⊗ λ1 Ik = IM ⊗ E −1 ⊗ Ik .


e11 0 
We further simplify the last expression. Let E = 
. By the block-wise
0 E11

70

A B

= |A||D − CA−1 B|, we have

expansion for determinants,
C D

2
e11 1 1
1 1 11
= (e11 + 211 )
+L11 L11 +E11 .
|E+Σ0 | = (e11 + 211 ) 1 1 +L11 L11 +E11 −
2
e11 + 11
e11 + 211

By the identity |Ip + U V | = |Ir + V U | when U is a p × r matrix and V is an r × p matrix,
we have

e11
|E + Σ0 | = (e11 + 211 )|L11 L11 + E11 | 1 + 1 (L11 L11 + E11 )−1 1
e11 + 211
= (e11 + 211 )|L11 L11 + E11 |(1 + 1 F 1 ),

where F =

e11
e11 + 2
11

(L11 L11 +E11 )−1 . Note that |E| = e11 |E11 | =

d
j=1 ejj

= λ−d
1

d
2
j=1 σj .

.

(A.8)

Thus,
Mk

Mk

|Σ0 |− 2
|Z¯ P Z¯

1
+ G−1 | 2

≤

|E| 2

Mk

|E + Σ0 | 2

Mk

Mk

e112 |E11 | 2

=
(e11 +

Mk
2 ) 2 |L L
11 11
11

Mk
+ E11 | 2 (1 +

1F

Mk
1) 2

Let R = {( 11 , 1 , L11 , σ12 , . . . , σd2 ) : 11 > 0; 1 ∈ Rd−1 ; σj2 > 0, j = 1, . . . , d} and
R0 = {( 11 , L11 , σ12 , . . . , σd2 ) : 11 > 0; σj2 > 0, j = 1, . . . , d}. Then, by (A.7) and (A.8), we

71

have
1
Mk
|Σ0 |− 2 e− 2 y W y

|Z¯ P Z¯

R

1
+ G−1 | 2

d

1
σj2
j=1
1

=
R0

Rd−1

(1 + 1 F

×
(e11 +

π( 11 )π(L11 )dσ12 · · · dσd2 d 11 d 1 dL11

d 1

cj
Mk
Mk
− 2
M (n−q−1)
d
2
2
+α
+1
e11 |E11 |
1
2σj
j
2
e
2
Mk
Mk
σj
2 ) 2 |L L + E | 2
j=1
11
11 11
11
× π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11 .

When M k > 1, Rd−1
becomes

Mk
1) 2

M (n−q−1)
+αj +1
2

1
Mk
(1+ 1 F 1 ) 2

(A.9)

1

d 1 = C1 |F |− 2 , where C1 is a constant. Thus, (A.9)

cj
Mk
Mk
− 2
M (n−q−1)
d
2
2
+α
+1
|F | e11 |E11 |
1
2σj
j
2
C1
e
2
Mk
Mk
σj
R0 (e + 2 ) 2 |L L + E | 2
j=1
11 11
11
11
11
× π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11
cj
M k−(d−1)
Mk
− 2
M (n−q−1)
d
+α
+1
2
2
1
e11
|E11 |
2σj
j
2
= C1
e
2
2
M
k−1
σj
R0 e11 + 11
j=1
|L11 L11 + E11 | 2
× π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11 .
cj
− 2
M (n−q−1)
d
+α
+1
1
1
2σj
j
2
e
π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11
≤ C1
|E11 | 2
2
σj
R0
j=1
cj
c
d
∞ 1 M (n−q−1) +α +1 − 12
∞ 1 M (n−q−1) +α − 2σ 2
1
j
2
2
j dσ 2
= C2
e 2σ1 dσ12
e
j
2
2
σ
σ
0
1
j
j=2 0
∞
×
π( 11 ) d 11 π(L11 ) dL11
0
−1
2

< ∞,

72

·

where the first inequality holds since

e11
e11 + 211

M k−(d−1)
2

≤ 1,

|E11 |

M k−1
2

|L11 L11 + E11 |

M k−1
2

≤1

for M k ≥ d − 1 and the last inequality holds since π( 11 ) and π(L11 ) are proper and
∞ −s −t/θ
dθ
0 θ e

< ∞ for s > 0, t > 0. C1 and C2 above are both finite constants. Hence we

conclude that the posterior is proper.

73

Appendix B
MCMC Algorithm via Gibbs
Sampling for Bayesian Inference in
Chapter 1
For the linear mixed model,
¯ + Zγ
¯ +
y = Xβ
γ ∼ N (0, G),

∼ N (0, Σ),

where G = IM ⊗ Σ0 ⊗ Ik and Σ = IM ⊗ D ⊗ In . For the Bayesian inference, we put priors
on the parameters as described in section 1.3.3 and the followings are the updating steps of
β and σj2 for Gibbs sampling method.
Updating β
1
Let r(i) = y (i) −(Id ⊗Z)γ (i) and r¯ = M

M (i)
i=1 r .

Given π(β) ∝ 1, the posterior distribution

of β is
β | y, γ, D, Σ0 ∼ N (β, Vβ ),
¯ Σ−1 X)
¯ −1 X
¯ Σ−1 r = (Id ⊗ (X X)−1 X )¯
¯ Σ−1 X)
¯ −1 = 1 (D ⊗
where β = (X
r and Vβ = (X
M
(X X)−1 ).

74

Updating γ
Let u(i) = y (i) − (Id ⊗ X)β, i = 1, . . . , M . We can update γ (i) one by one independently.
Since γ (i) ∼ N (0, Σ0 ⊗ Ik ), the posterior distribution of γ (i) is

γ (i) | y, β, D, Σ0 ∼ N (γ (i) , Vγ ), i = 1, . . . , M,

−1
−1
−1
(i)
−1
−1
where γ (i) = (D−1 ⊗ Z Z + Σ−1
0 ⊗ Ik ) (D ⊗ Z )u and Vγ = (D ⊗ Z Z + Σ0 ⊗ Ik ) .

Updating D = diag(σ12 , . . . , σd2 )
We can update D by updating σj2 one by one independently.
¯ − Zγ.
¯ Given IG(aj , bj ) as the prior for π(σ 2 ), the posterior distribution
Let r = y − Xβ
j
of σj2 is

σj2

1
1
| y, β, γ, Σ0 ∼ IG M n + aj ,
2
2

M

n

(i)2

rjt +
i=1 t=1

1 −1
, j = 1, . . . , d
bj

Updating procedure of Σ0 via 1 1, 1 and L11 is provided in section 1.3.3.

75

Appendix C
Maximum Likelihood Estimation in
Chapter 1
This part serves as a preliminary analysis of the parameters based on the likelihood approach.
Here instead of the original assumption that the covariance matrix for errors is Σ = IM ⊗D⊗
Σϕ , where D is diagonal matrix, we consider a more general relationship between individual
covariates such that Σ = IM ⊗ Σ1 ⊗ Σϕ , where Σ1 can be any valid covariance matrix. Then
the likelihood function for linear regression model in section 1.4 is

f (y|β, γ, Σ1 ) =

1

1

¯

¯

e− 2 (y−Xβ−Zγ) Σ
1

−1 (y−Xβ−
¯
¯
Zγ)

1
(2π) 2 ndM |Σ| 2

,

where Σ = IM ⊗ Σ1 ⊗ Σϕ when we assume γ as fixed effects as well. The MLE equation for
β and γ is then


  

−1
−1
−1
¯
¯
¯
¯
¯
X Σ X X Σ Z  β  X Σ y 

  = 
.
−1
−1
−1
¯
¯
¯
¯
¯
ZΣ X ZΣ Z
γ
ZΣ y
(i)
(i)
¯ β − Zγ,
¯ Ui = (e(i) , . . . , e(i) ) and Si = U Σ−1
Let β, γ denote the MLEs, ejt = yjt − X
i ϕ Ui ,
1
d

where i = 1, . . . , M . Then, we have

f (y|β, γ, Σ1 , Σϕ ) ∝

76

−1
1
e− 2 tr(Σ1 S)
nM
dM
|Σ1 | 2 |Σϕ | 2

,

where S =

M
i=1 Si .

S .
Then D is maximized at Σ1 = nM

The log-likelihood function is

L(y|β, γ, Σ1 , Σϕ ) ∝ −

dM
M
nM
¯
tr(Σ−1
log|Σ−1
|+
log|Σ−1
S(ϕ))
+
ϕ |,
1
1
2
2
2

S.
¯
where S(ϕ)
=M
n
It can be shown that |Σ−1
ϕ |=2

ϕ = arg max
−1≤ϕ≤1

2jπ
n−1
j=0 (1+ϕ cos( n )).

Therefore the MLE for ϕ satisfies

M
dM
ndM
¯
tr(Σ−1
S(ϕ))
+
log|Σ−1
ϕ |+
1
2
2
2

Following the procedure above we are able to obtain the MLEs by updating the parameters
iteratively until convergence.

77

Appendix D
Proof of Theorems and Lemmas in
Chapter 2
√
Proof of Lemma 1. First note that Vj = T X (j) /( nσσj ) ∼ N (0, 1). Examining the tail
probability of Gaussian random variables leads to

P

max |Vj | >

1≤j≤n

t2 + 2 log n

≤ 2nP

V1 >

≤ 2n exp −

t2 + 2 log n

t2 + 2 log n
= 2 exp(−t2 /2).
2

(D.1)
(D.2)

Then it follows that

P

2| T X (j) |
> λ0
n
1≤j≤n
max

=P
≤P

2σ
√ max σj |Vj | > λ0
n 1≤j≤n
max |Vj | >

1≤j≤n

≤ 2 exp(−t2 /2).

t2 + 2 log n

(D.3)
(D.4)
(D.5)

Proof of Lemma 2. Let W = nσ 2 /σ 2 = Y T Y /σ 2 . Since Y ∼ N (Xβ, σ 2 I), we claim
that W ∼ χ2n (r), i.e. a noncentral chi-square distribution with noncentrality parameter r.

78

It is known that

E(W ) = n + r

(D.6)

E[W − E(W )]2 = 2(n + 2r)

(D.7)

E[W − E(W )]4 = 12(n + 2r)2 + 48(n + 4r).

(D.8)

Given k > 0, by Chebyshev’s inequality

P |σ 2 − (1 + r/n)σ 2 | > ασ 2 = P (|W − E(W )| > nα) ≤

E|W − E(W )|k
αk nk

(D.9)

Therefore,

P σ 2 < (1 − α)σ 2 ≤ P |σ 2 − (1 + r/n)σ 2 | > ασ 2 ≤

E|W − E(W )|k
α k nk

(D.10)

Finally by taking k = 2 and k = 4 respectively, we complete our proof.
Proof of Theorem 3. Note that under (2.5),
n−1

Y − Xβ

2 /n + λ
2

n−1

|βj | ≤ Y − Xβ
j=2

2 /n + λ
2

|βj |

(D.11)

j=2

By expanding the 2 terms, it simply gives
n−1

X(β − β)

2 /n + λ
2

n−1

|βj | ≤ 2
j=2

T X(β

− β)/n + λ

|βj |

(D.12)

j=2

For any vector v ∈ Rp we define v 01 = (v0 , v1 , 0, · · · , 0)T and v 2+ = (0, 0, v3 , · · · , vn−1 )T .

79

Since v = v 01 + v 2+ , (D.12) becomes
n−1

X(β − β)

2 /n
2

≤2

T X(β − β)

01 /n + 2

T X(β − β) /n − λ
2+

n−1

|βj | + λ
j=2

Set λ0 = 6Kσ

log n
n .

|βj | (D.13)
j=2

On Jλ0 = max1≤j≤n 2| T X (j) |/n ≤ λ0 ,
n−1

2

T X(β

− β)2+ /n ≤ λ0

|βj − βj |

(D.14)

j=2

Meanwhile, note that β 01 1 = |β0 | + |β1 | ≤ M by assumption. We have

2 T X(β − β)01 /n ≤ λ0 β 01 1 + λ0 β 01 1 ≤ λ0 M + λ0 (|β0 | + |β1 |)

(D.15)

Now define Sn = {σ : σ ≥ σ/2}. Then on Sn we have λ ≥ λ0 . Combining (D.12) through
(D.15) and noting the fact that limn→∞ λ = 0 and limn→∞ λ β 1 = 0, we claim that on
Jλ0 ∩ Sn ,
X(β − β) 22 /n ≤ λM + 2λ β 1 → 0 as n → ∞.

(D.16)

It remains to show that Jλ0 ∩ Sn has high probability when n is large. In fact, we have

P Jλc ∪ Snc ≤ P (Jλc ) + P (Snc )
0

0

(D.17)

By lemma 1, P (Jλc ) ≤ 2/n2 . Taking α = 3/4 in lemma 2 we get
0

P (Snc ) ≤

(27 + 12) · 44
27
≤
34 · n2
n2

80

(D.18)

when n ≥ max(4r, 8). Hence we claim that P Jλc ∪ Snc ≤ 130/n2 when n ≥ max(4r, 8).
0
Since
∞

∞

P
n=1

Jλc
0

∪ Snc

≤ (4r + 8) + 130
n=1

1
< ∞,
n2

(D.19)

by Borel-Cantelli lemma,
P Jλc ∪ Snc i.o. = 0.
0

(D.20)

P ∃N > 0 s.t. Jλ0 ∩ Sn holds when n > N = 1.

(D.21)

It follows that

Therefore, by (D.16)
P

lim

n→∞

X(β − β) 22 /n = 0 = 1.

(D.22)

Proof of Proposition 1. Recall that xj = j/n, j = 1, 2, · · · , n. According to (2.4), we
have





1
0
0 ···
1 − ν
n


2
 1
0
0 ···

n −ν


1
3
 1
0 ···
n
n −ν

X − νI = 
 ..
..
..
..
 .
.
.
.



n−1
n−3
n−2 · · ·
 1
n
n
n


n−1 · · ·
n−2
1
1
n
n

0

0 


0
0 



0
0 


..
.. 
.
. 



1 −ν

0
n


1 −ν
2
n
n

(D.23)

It is easy to see that ν = 2/n is not the solution to det(X − νI) = 0. So by multiplying
the second row of (D.23) with −1/(2 − nν) and add it to the first row we obtain a lower

81

triangular matrix




nν 2 −(n+2)ν+1

2−nν









R(X − νI) = 









1

2
n

0

0

0

···

−ν

0

0

···

−ν

0

···

1

3
n

1
n

..
.

..
.

..
.

..
.

1

n−1
n

n−3
n

n−2
n

···

1

1

n−2
n

n−1
n

···

0

0 


0
0 



0
0 

,
..
.. 
.
. 



1 −ν

0
n


2
1 −ν
n
n

(D.24)

where R denotes the row operation described as above. Then it is clear that
nν 2 − (n + 2)ν + 1
det(X − νI) = det(R(X − νI)) =
n

n−2
1
−ν
n

(D.25)

It directly follows that
√
√
n + 2 − n2 + 4
1
n + 2 + n2 + 4
ν1 =
, ν2 = ν3 = · · · = νn−1 = , νn =
2n
n
2n

are the roots of det(X − νI) = 0.

82

(D.26)

Appendix E
Proof of Theorems in Chapter 3
Proof of Theorem 4. Given site s, the likelihood function is

f (Y s |b, us , σs2 ) =

1
n
(2πσs2 ) 2

exp −

1
(Y s − X (s) b(s) − us 1n )T (Y s − X (s) b(s) − us 1n )
2σs2
(E.1)

To validate the propriety of the posterior, we need to show

N
c∈P (0,∞) R

Θd (0,∞)N



N

f (Y s |b, us , σs2 )π(us |τ 2 )π(σs2 ) π(c)π(τ 2 )dσ 2 G0 (dθ)dudτ 2 < ∞


s=1

(E.2)
Integrating out σs2 , we have

n Γ(aσ

fs (Y s , b, us ) =
(0,∞)

f (Y s |b, us , σs2 )π(σs2 )dσs2 = (2π)− 2

n)
2

+
Γ(aσ )

1 eT e
2 s s

+ b1
σ
aσ
bσ

−(aσ + n
2)

,
(E.3)

where es = Y s − X (s) b(s) − us 1n .
¯
Then the integrand in (E.2) becomes




N

fs (Y s , b, us )π(us |τ 2 ) π(c)π(τ 2 ),


s=1

83

(E.4)

which is equivalent to the clustered form
d
Γ(aσ + n2 ) N
Γ(aσ )baσσ

− N2n

(2π)

r=1 s∈Cr

n
1 −(aσ + 2 )
1 T
e er,s +
,
2 r,s
bσ

(E.5)

where er,s = Y s − Xr br − us 1n .
¯
−(aσ + n
2)
1
1
T
. Integrating with respect to G0 (dθ), given cluster r we
Let Ir,s = 2 er,s er,s + b
σ

need to evaluate


d
q+kr
r=1 R

since Ir,s ≤
1 T
2 er,r1 er,r1

aσ + n
bσ 2



d

Ir,s  dbr ≤


s∈Cr

q+kr
r=1 R

(aσ + n
2 )(|Cr |−1)

bσ

Ir,r1 dbr

(E.6)

for any r and s, where r1 ∈ Cr is a fixed location. Note that Ir,r1 =

−(aσ + n
2)
+ b1
as a function of br is proportional to a (q + kr ) dimensional
σ

multivariate t-distribution with parameters (µr , Σr , φr ), where

µr = (XrT Xr )−1 XrT (Y r1 − ur1 1n ) = (XrT Xr )−1 XrT v r1
¯
2 + v T I − X (X T X )−1 X T v
r
r1
r r
r
r1
bσ
(XrT Xr )−1
Σr =
φr

(E.7)

φr = 2aσ + n − (q + kr )

(E.9)

(E.8)

Then the rth integral for the RHS of (E.6) can be shown as

(aσ + n
2 )(|Cr |−1)

Rq+kr

bσ

(aσ + n
2 )(|Cr |−1)

Ir,r1 dbr = Ar bσ

·

2
bσ

+ v Tr1 Pr v r1

1
bσ

+ v Tr1 Pr v r1

q+kr 1
(aσ + n
2 )|Cr |− 2 2 2 (q+kr ) ,

≤ Ar bσ

84

q+kr
2
aσ + n
2

(E.10)

(E.11)

where Pr = I − Xr (XrT Xr )−1 XrT and
n−(q+kr )
2
n
Γ(aσ + 2 )

1

π 2 (q+kr )

Γ aσ +
Ar =

(E.12)

1

|XrT Xr | 2

Therefore it remains to verify that
N
RN

Recall that π(us

|τ 2 )

=

√1
2πτ

π(us |τ 2 )π(τ 2 )dτ du < ∞

(E.13)

(0,∞) s=1

u2
− s2
e 2τ

and

π(τ 2 )

=

τ
b−a
τ
Γ(aτ )

1
τ2

aτ +1 − 1 2
e bτ τ ,

integrating out τ 2

leads to a new integrand

−N
2

(2π)
baτ τ

N)
2

Γ(aτ +
Γ(aτ )


1 +1
bτ
2

N

−(aτ + N )
2

u2s 

,

(E.14)

s=1

which, as a function of u, is proportional to the density of a N dimensional t-distribution
with parameters (z N , (aτ bτ )−1 IN , 2aτ ). Hence the last integration with respect to u is finite.
Note the final summation is over finite possible partitions of S. Therefore we complete our
verification.

85

BIBLIOGRAPHY

86

BIBLIOGRAPHY
[1] Abraham, C., Cornillon, P.A., Matzner-Løber, E. and Molinari, N. (2003) Unsupervised
Curve Clustering using B-Splines. Scandinavian Journal of Statistics 30: 581-595.
[2] Bickel, P., Ritov, Y., and Tsybakov, A. (2009), Simultaneous Analysis of Lasso and
Dantzig Selector. The Annals of Statistics 37(4): 1705-1732.
[3] B¨
uhlmann, P. and van de Geer, S. (2011), Statistics for High-Dimensional Data: Methods,
Theory and Applications Springer: New York, USA.
[4] Dass, S., Lim, C., Maiti, T. and Zhang, Z. (2014), Clustering Curves based on Change
point analysis : A Nonparametric Bayesian Approach. Accepted by Statistica Sinica
[5] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least Angle Regression.
The Annals of Statistics 32(2): 407-499.
[6] Escobar, M.D. and West, M. (1995), Bayesian Density Estimation and Inference using
Mixtures. Journal of the American Statistical Association 90(430): 577-588.
[7] Ferguson, T.S. (1973), A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1(2): 209-230.
[8] Friedman, J., Hastie, T. and Tibshirani, R. (2010), Regularization Paths for Generalized
Linear Models via Coordinate Descent. Journal of Statistical Software 33(1): 1-22.
[9] Gaston, K.J. (1996), Biodiversity: A Biology of Numbers and Difference Oxford, UK.
[10] Gelman, A. and Rubin, D.B. (1992), Inference from Iterative Simulation Using Multiple
Sequences. Statistical Science, 7: 457-472.
[11] Ghosh P., Basu S. and Tiwari, R.C. (2009), Bayesian Analysis of Cancer Rates from
SEER Program using Parametric and Semiparametric Joinpoint Regression Models. Journal of the American Statistical Association 104(486): 439-452.
[12] Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning
Springer: New York, USA.
87

[13] Jeffreys, H. (1961), Theory of Probability Oxford Univ. Press: Oxford, UK.
[14] Kim, H.-J., Fay, M. P., Feuer, E. J. and Midthune, D. N. (2000), Permutation Tests
for Joinpoint Regression with Applications to Cancer Rates. Statistics in Medicine 19:
335-351.
[15] Lozano, R. et al. (2012), Global and Regional Mortality from 235 Causes of Death for
20 Age Groups in 1990 and 2010: A Systematic Analysis for the Global Burden of Disease
Study 2010. The Lancet 380(9859): 2095-2128.
[16] Ramsay J.O. and Silverman B.W. (1997) Functional Data Analysis Springer: New York,
USA.
[17] Ries, L.A.G., Harkins, D., Krapcho, M., Mariotto, A., Miller, B.A., Feuer, E.J., Clegg, L., Eisner, M.P., Horner, M.J., Howlader, N., Hayat, M., Hankey, B.F., Edwards,
B.K. (eds). (2006), SEER Cancer Statistics Review, 1975-2003 National Cancer Institute:
Bethesda, MD, USA.
[18] Sethuraman, J. (1994), A Constructive Definition of Dirichlet Priors. Statistica Sinica
4: 639-650.
[19] The IUCN Red List of Threatened Species (2012). http://www.iucnredlist.org/.
[20] Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso. Journal of the
Royal Statistical Society: Series B (Methodological) 58(1): 267-288.
[21] Tiwari, R.C., Cronin, K.A., Davis, W., Feuer, E.J., Yu, B. and Chib, S. (2005), Bayesian
Model Selection for Join Point Regression with Application to Age-adjusted Cancer Rates.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 5: 919-939.
[22] Wang, H., Li, B. and Leng, C. (2009), Shrinkage Tuning Parameter Selection with
a Diverging Number of Parameters. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 71: 671-683.
[23] Yang, R. and Berger, J. (1994), Estimation of a Covariance Matrix Using the Reference
Prior. The Annals of Statistics 22(3): 1195-1211.

88