FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS By Xin Qi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics - Doctor of Philosophy 2014 ABSTRACT FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS By Xin Qi As a branch of statistics, functional data analysis analyzes data based on the information about curves, functions, surfaces or anything else varying over a continuum such as time or spatial location, etc. This topic has been drawing more and more attention by scientists and other people in recent years since in many real problems, samples are often collected on curves and other functional observations. Therefore, it also calls for various statistical models and methodologies to perform inference on the data appropriately. In this article, I introduce some examples of functional analysis problem from real world and develop methodologies that provide powerful tools to address statistical concerns. TABLE OF CONTENTS LIST OF TABLES ........................................................................................................................v LIST OF FIGURES .................................................................................................................... vi Chapter 1 Bayesian Inference for Covariance Ordering in Multivariate Functional Data ...1 1.1 Introduction ...........................................................................................................................1 1.2 Spline Mixed Model ..............................................................................................................4 1.3 Bayesian Hypothesis Testing on Covariance Ordering .........................................................6 1.3.1 Model Selection and Hypothesis Testing ..................................................................6 1.3.2 New Prior on with Ordering Representation ........................................................8 1.3.3 Bayesian Estimation ...................................................................................................9 1.4 Analysis of Audio Frequency Data of Wildlife Species in Michigan .................................10 1.4.1 Bayesian Inference Results ......................................................................................12 1.4.2 Discussion ................................................................................................................15 Chapter 2 Joinpoint Detection using -Penalized Spline Method ........................................17 2.1 Introduction .........................................................................................................................17 2.2 -Penalized Regression Spline Model ...............................................................................19 2.3 Consistency of the LASSO Type Estimator ........................................................................21 2.4 Simulation Study .................................................................................................................25 2.5 A Case Study: National Cancer Incidence Rate Analysis ...................................................33 2.6 Conclusion and Discussion .................................................................................................37 Chapter 3 Nonparametric Bayesian Clustering of Functional Data ......................................38 3.1 Introduction .........................................................................................................................38 3.2 Spline Mixed Model for Clustering Multiple Curves .........................................................41 3.2.1 Model Specification ......................................................................................................41 3.2.2 Derivation of Functional Dirichlet Process and Other Prior .......................................43 3.3 Theory of Bayesian Inference ..............................................................................................46 3.3.1 Propriety of the Posterior Distribution .........................................................................46 3.3.2 Gibbs Updating Steps ...................................................................................................46 3.4 Bayesian Inference: A Measure for Comparing a Pair of Clustering of Sites ....................52 3.5 Numerical Analysis: Simulation Study and Real Data Example .........................................54 3.5.1 An Alternative Less-informative Prior Choice for ..................................................55 iii 3.5.2 Simulations and Comparison with An Existing Approach ...........................................56 3.5.3 A Real Data Example: Lung Cancer Rates in the U.S. ................................................57 3.6 Conclusion and Discussion .................................................................................................59 APPENDICES .............................................................................................................................64 Appendix A Proof of Theorems in Chapter 1 ...........................................................................65 Appendix B MCMC Algorithm via Gibbs Sampling for Bayesian Inference in Chapter 1 .....74 Appendix C Maximum Likelihood Estimation in Chapter 1 ....................................................76 Appendix D Proof of Theorems and Lemmas in Chapter 2 ......................................................78 Appendix E Proof of Theorems in Chapter 3 ............................................................................83 BIBLIOGRAPHY .......................................................................................................................86 iv LIST OF TABLES Table 1.1: Bayes Factor for Significant Orderings at L00 ............................................................15 Table 1.2: Bayes Factor for Significant Orderings at L02 ............................................................15 Table 2.1: A Summary of Four Methods to Compare with the PTB ............................................28 Table 2.2: Simulation Result for , ........................................................29 Table 2.3: Simulation Result for , ......................................................30 Table 2.4: Simulation Result for , ......................................................31 Table 2.5: Simulation Result for , ......................................................33 Table 2.6: Simulation Result for , ...................................................34 Table 2.7: Simulation Result for , .................................................36 Table 2.8: Joinpoint Detection for Real Data Analysis ................................................................36 Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach .............56 Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar's Approach .....56 Table 3.3: The Distribution of Number of Clusters ..................................................................58 Table 3.4: Estimated Cluster Configuration .................................................................................59 v LIST OF FIGURES Figure 1.1: An Example of Biodiversity Data ................................................................................3 Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = , .................................................................................................................................32 Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = , .................................................................................................................................32 Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999) .........................................35 Figure 2.4: Fitted Incidence Rates (1973 - 1999) .........................................................................35 Figure 3.1: Dendrogram for the Clustering Estimation ................................................................61 Figure 3.2: Estimated Cluster Configuration ................................................................................62 Figure 3.3: Overall Trend by Clusters ..........................................................................................63 vi Chapter 1 Bayesian Inference for Covariance Ordering in Multivariate Functional Data 1.1 Introduction Biodiversity, often defined as the variety of life at all levels of organization [9], has been studied at species level for most of the time. It is closely related to human lives and human activities such as agriculture, health, business and industry, etc. In recent years, unsustainable ecological practices such as habitat destruction, agricultural over-harvesting, and pollution have been damaging species diversity. According to International Union for the Conservation of Nature (IUCN) [19], the world’s main authority on the conservation status of species, through 2012, 19,817 out of 63,837 assessed species were threatened with extinction. The Convention on Biological Diversity (CBD) was recognized to help conservations of biodiversity and sustainable use of its components. Statistical analysis of biodiversity measurements are essential and valuable to conservationists to understand the nature of biodiversity. With species indices and habitat variability assessments, conservationists can identify areas of ecological importance that should be protected. 1 The audio measurements on suburban or rural areas can be used to study impact of human activities on the other wildlife animals. The audio measurements taken at a regular time intervals are converted to energy readings on the frequency domain and energy readings are then divided into several frequency bands. Each frequency band corresponds to a different category of wildlife animals according to the sound they create. By investigating how energy readings in each band are changing over time, we can study interaction between different groups of wildlife animals. For example, Figure 1.1 (a) shows the energy readings of several frequency bands at one suburban location in the upper peninsular of Michigan on June 15 2010 while Figure 1.1 (b) gives a similar plot for a rural location. Note that one of frequency bands in black color is related to the sounds created by human activity such as the noise from automobile vehicles in highways. We can see that energy readings of other frequency bands (in red, green and blue) behave differently depending on level of energy readings by human activities. Also, the patterns of such interaction vary over different location as in (a) and (b). This type of data can be viewed as the set of functional curves, in which each curve corresponds to the energy readings over time for each frequency band. To investigate interaction between curves (species), we introduce spline mixed models. Off-diagonal entries of the covariance matrix for random effects can be interpreted as correlation between different frequency bands (i.e. different species) so that we can compare off-diagonal entries to study sensitivity of species in one frequency band against species in another frequency band. This can be done by Bayesian hypothesis testing on covariance ordering. However, entries of the covariance matrix are on the constrained parameter space due to positive definiteness criterion and different orderings of covariances may have unbalanced prior probabilities, which is not desirable. In this chapter, we investigate the relationship and interaction between species by con2 Figure 1.1: An Example of Biodiversity Data ducting Bayesian hypothesis testing on the covariance ordering of random effects in a spline mixed model for multivariate functional data. We develop a default prior of a covariance matrix which yields balanced prior probabilities on the covariance ordering after appropriate re-parametrization. With the proposed default prior, we fit a spline mixed model using energy readings data from two locations in the upper peninsular of Michigan. Our data analysis claims significant difference in the relationship among group of species between such two locations. We state our spline mixed model in section 1.2 and introduce Bayesian hypothesis testing as well as construction of a default prior for the covariance matrix of random effects in section 1.3. In section 1.4, we apply our result to the energy reading data in Michigan to compare and discuss the activities of different species. 3 1.2 Spline Mixed Model To investigate interaction between multiple curves (functions) over time which is of interest in the application of biodiversity, we consider a spline mixed model. Suppose that there are d curves, {fj (τ )}dj=1 on τ ∈ [0, T ]. We observe such curves at n discrete time points, 0 < τ1 < τ2 < · · · < τn < T during the time period [0, T ]. In the application of audio measurements, each curve corresponds to one frequency band, a time period is one day. The data were collected over several days and we assume that we have repeated measurements on days since it is reasonable to assume that daily activity patterns of animals including (i) human are not too different over days. Let yjt be the observed value of the j-th curve at time τt on the i-th day for j = 1, · · · , d, t = 1, · · · , n and i = 1, · · · , M . For functional data, it is common to approximate a function using basis functions. In this regard, we adopt approximation of functions using spline basis functions. That is, q f (τ ) ≈ m=0 τm βm + m! k m=1 q (τ − ξm )+ γm , q! where a+ = max{a, 0}. This spline approximation uses order q spline polynomial functions with knot points {ξ1 , · · · , ξk }. With the data structure described above, we introduce the following spline mixed model in a matrix form for the i-th day. Let y (i) = (i) (i) (i) (i) (i) (i) T y11 , y12 , . . . , y1n , . . . , yd1 , yd2 , . . . , ydn be the set of observed curves for the i-th day for i = 1, . . . , M . Then, we have y (i) = (Id ⊗ X)β (i) + (Id ⊗ Z)γ (i) + (i) , (i) (i) (i) (i) T where β (i) = β10 , · · · , β1q , · · · , βd0 , · · · , βdq are fixed effects, 4 (1.1) (i) (i) (i) T (i) γ (i) = γ11 , · · · , γ1k , · · · , γd1 , · · · , γdk (i) = (i) (i) 11 , 12 , · · · (i) (i) are random effects and (i) T (i) , 1n , · · · , d1 , d2 , · · · , dn are error terms. The corresponding design matrices for fixed effects and random effects are  1 τ 1 . . .   1 τ 2 . . .  X=  .. .   1 τn . . .   1 q q! τ1  q  (τ1 − ξ1 )+ q (τ1 − ξ2 )+ ...     1 τ q  (τ2 − ξ1 )q (τ2 − ξ2 )q . . . 1 + + q! 2   , Z =    q! ..   .     q q 1 q (τn − ξ1 )+ (τn − ξ2 )+ . . . q! τn  q (τ1 − ξk )+   q  (τ2 − ξk )+   q (τn − ξk )+ .     We also assume that (i) ∼ N (0, Σ), where Σ = D ⊗ Σϕ with D = diag(σ12 , σ22 , . . . , σd2 )    1 if u = v     2 and Σϕ = (ϕuv )n×n and ϕuv = ϕ if |u − v| = 1 . σj allows to have different variability       0 otherwise of each curve. Σϕ is a temporal covariance matrix to capture possible temporal dependence in each curve. For the covariance structure of random effects, we assume γ (i) ∼ N (0, Σ0 ⊗ Ik ) with (i) (i) Σ0 = (σjl )d×d so that cov(γj , γl ) = σjl Ik for j, l = 1, . . . , d and σjl models the interaction between the j-th and l-th curves. Let y = y (1) , y (2) , . . . , y (M ) T be the collection of all data points. Then, the matrix form of the entire model is ¯ + Zγ ¯ + , y = Xβ (1.2) ¯ = IM ⊗Id ⊗X and Z¯ = IM ⊗Id ⊗Z. Note that γ ∼ N (0, G) with G = IM ⊗Σ0 ⊗Ik . where X 5 1.3 Bayesian Hypothesis Testing on Covariance Ordering To investigate interaction between curves using the spline mixed model introduced in the previous section, we consider the ordering of the covariance matrix for random effects. In particular, we are interested in a Bayesian test on a specific ordering within a row. e.g., σ12 > σ13 > · · · > σ1d > 0. 1.3.1 Model Selection and Hypothesis Testing Let θ = Σ0 be the parameter of interest. We assume that the parameter space, Θ, is partitioned into Θi such that Θ = d! i=1 Θi and {Θi } is a partition of Θ with d! possible ordering restrictions of {σ12 , σ13 , . . . , σ1d , 0}. Note that each subspace Θi represents a model Mi with the corresponding ordering. To perform model selection, we consider testing the hypothesis Hi : θ ∈ Θi versus Hj : θ ∈ Θj , where i, j = 1, . . . , d!, i = j. Given the priors πi = π(Θi ) and πj = π(Θj ), suppose the posterior probabilities are νi = π(Θi |y) and νj = π(Θj |y). Then the Bayes factor is defined as B= νi /νj posterior odds ratio Θi π(θi )f (y|θi ) dθi = = prior odds ratio πi /πj Θj π(θj )f (y|θj ) dθj πi /πj . (1.3) When the Bayes factor becomes large enough, we tend to propose Hi . To compute the Bayes factor, we need to put an appropriate prior on Σ0 . The prior distribution should bring a proper posterior probability and computation of πi /πj should be easy. However, the common prior models for a covariance matrix such as Jeffrey’s prior and inverse Wishart prior do not satisfy such criteria. This leads us to develop a default prior 6 which brings a proper posterior distribution as well as easiness of computation. First, we show that Jeffrey’s reference prior leads to an improper posterior, so that it is not suitable to use for our test. Theorem 1. Assume d ≥ 4 and M k > 2. The improper priors π(Σ0 ) ∝ |Σ0 |−p with (1) d p = d+1 2 , (2) p = 2 − 1 and (3) p = d give rise to improper posteriors. The proof of the theorem is provided in Appendix A. By Theorem 1, we can not calculate νi = π(Θi |y) and νj = π(Θj |y) which indicates inappropriateness of Jeffrey’s reference prior for the proposed spline mixed models. When inverse Wishart prior is considered, it is clear that the posterior distribution is proper. However, inverse Wishart prior is not feasible for Σ0 for our testing problem above since it has no explicit ordering representation. For instance, assume d = 4 and consider testing hypotheses H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 > σ13 > σ14 . Note that in this case, Θ1 = {σ12 > σ13 > σ14 > 0} and Θ2 = {σ12 > 0 > σ13 > σ14 }. Then for inverse Wishart prior Σ0 ∼ IW (S0 , m), where S0 is a d × d positive definite matrix and m > d − 1 is a constant, it is difficult to calculate π1 = P (Σ0 ∈ Θ1 ) and π2 = P (Σ0 ∈ Θ2 ). Also, there is no guarantee to have equal prior probability so that no prior subjectivity can affect the Bayes factor. In the next section, we introduce a new prior on the covariance matrix which solve both issues. That is, the new prior guarantees propriety of the posterior distribution and equal prior probability so that we do not need to calculate πi and πj in the Bayes factor due to cancellation in πi /πj . 7 1.3.2 New Prior on Σ0 with Ordering Representation We propose a partially improper prior on a covariance matrix via Cholesky decomposition. For simplicity, we assume Σϕ = In . As we see later in section 1.4, the real data supports such simple assumption.    11 Let Σ0 = LLT be the Cholesky decomposition of Σ0 , where L =  1 0   is a lower L11 triangular matrix. Then we have the following representation for Σ0 :      T 1  2 11 σ11 σ1∗   11 0   11   Σ0 =  =  = T T σ∗1 Σ11 0 L11 1 L11 11 1  T 11 1 T 1 1 + L11 LT11   where 11 > 0 and 1 ∈ Rd−1 . This re-parametrization separates out the constrained vector of parameters ( 1 ) and unconstrained parameters ( 11 and L11 ). Note that σ1∗ = (σ12 , σ13 , . . . , σ1d ) = 11 T1 . Since 11 > 0, each subset of parameter space, Θi , corresponds to an ordering of entries in 1 for i = 1, 2, . . . , d!. The following theorem guarantees a proper posterior distribution when the flat prior on 1 is used.    11 Theorem 2. Assume n > q+1+k and M k ≥ d−1. Let Σ0 = LLT , where L =  1 Suppose that the priors satisfy (1) π(β) ∝ 1, (2) π(D) = d 2 j=1 π(σj ), where π(σj2 ) ∝ 1 αj +1 , σj2 αj > 0, ∀j = 1, . . . , d, (3) π(L) = π( 11 )π( 1 )π(L11 ), where π( 1 ) ∝ 1, π( 11 ) and π(L11 ) are proper. Then the posterior is proper. The proof of theorem is provided in Appendix A. 8 0  . L11 Now we reconsider the testing example H1 : σ12 > σ13 > σ14 > 0 vs. H2 : σ12 > 0 > σ13 > σ14 again. Theorem 2 claims proper posteriors so that ν1 = π(Θ1 |y) and ν2 = π(Θ2 |y) are well defined. Furthermore, when the flat prior π( 1 ) ∝ 1 is used, the relationship between σ1∗ and 1 simply implies π(Θ1 ) = π(Θ2 ) = 1/d! if we add 0 into the consideration to have ordering of covariance parameters. Hence in this situation, the priors are canceled out and the Bayes factor remains to be the posterior ratio B = νi /νj , which can be calculated from posterior samples straightforwardly. 1.3.3 Bayesian Estimation To obtain the posterior samples of parameters to calculate the Bayes factor, we use a Markov Chain Monte Carlo (MCMC) algorithm, in particular, Gibbs sampling algorithm. To derive the posterior distribution, we consider the following priors for the parameters, β, σj2 and Σ0 . We assume π(β) ∝ 1, π(σj2 ) ∼ IG(aj , bj ), where IG(a, b) denotes the inverse Gamma distribution with the density function π(x) ∝ x−(a+1) e−1/bx . For the prior of Σ0 , we consider the proposed prior of L developed in the previous section. That is, π(L) = π( 11 )π( 1 )π(L11 ) with π( 1 ) ∝ 1, π( 11 ) ∼ IG(a0 , b0 ) and π(L11 LT11 ) ∼ IW (Ω11 , d − 1). Then, by Theorem 2, the posterior distribution is proper and the MCMC algorithm is valid with the proposed prior. In this section, we derive the conditional posterior distributions of 11 , 1 and L11 . Conditional posterior distribution of other parameters are given in Appendix B. First, we introduce the matrix S which is quadratic forms of random effects. That is, S= M i=1 (i) T (i) k gl l=1 gl (i) with gl (i) (i) (i) = γ1l , γ2l , . . . , γdl , for l = 1, . . . , k and i = 1, . . . , M . 9 Then, we can obtain the conditional posterior distributions of 1 , 211 and L11 LT11 : ∼ N ( ∗1 , V ∗ ), | γ, 1 , L11 , . . . ∼ IG L11 LT11 | γ, 11 , 1 ∼ IW (M k − d − 1, S ∗ + Id−1 ) 1 | γ, 11 , L11 , . . . 2 11 1 Mk − d + 1 s 1 −1 + α0 , 11 + , 2 2 β0  sT1  2 s11  where ∗1 = s11 s1 , V ∗ = s11 (L11 LT11 ) and S ∗ = S11 − s 1 s1 sT1 for S =  . 11 11 11 1 s1 S11 With above conditional posterior distributions, we can obtain posterior samples by the Gibbs sampling algorithm. 1.4 Analysis of Audio Frequency Data of Wildlife Species in Michigan In this section, we apply our methodology to audio frequency data of wildlife species in Michigan. The wildlife sounds from two locations, L00 and L02, in Upper Peninsula of Michigan, U.S. were collected in 1 minute interval from June 1 2010 to June 15 2010 at every 30 minutes from 12 a.m. to 11:30 p.m. on each day. L00 is located at Crawford Bay Lane area, Cheboygan, MI which is close to main city roads while L02 is at Crane Nest Narrows, Cheboygan, MI and it is relatively pristine area. Then, the audio data were converted into the frequency data and divided into ten frequency bands which correspond to different groups of species (human, bird type 1, bird type 2, etc.). More relevant information about the source of data can be found in the Remote Environmental Assessment Laboratory at 10 Michigan State University (real.msu.edu). The behavior of energy readings in each frequency band represents activities of different types of species and we can observe how other species interact with human in their activities by how the energy reading curves are changing over time of day. (i) Let {Pj (t) ≥ 0 : τt ∈ [0, T ]; j = 1, . . . , D; i = 1, . . . , M } be the energy readings for D frequency bands obtained from recorded audio measurements. Since the energy readings are (i) scaled, Pj (t) is the proportion for the jth frequency band at time τt on the ith day. Thus, (i) a natural constraint on Pj (t) is (i) D j=1 Pj (t) = 1 for t = 1, · · · , n and i = 1, . . . , M . For (i) fixed i and t, {Pj (t)}D j=1 can be viewed as compositional data. A typical additive log-ratio (i) transformation is widely applied to compositional data, which we adopt for Pj (t) as well. We define the additive log-ratio transformation as (i) (i) yjt = log Pj (t) (i) , j = 1, . . . , D − 1. PD (t) Assume that the set of time points, T = {τ1 , τ2 , . . . , τn } with 0 < τ1 < τ2 < · · · < τn < T (i) and d = D − 1. Let yjt : t = 1, . . . , n; j = 1, . . . , d; i = 1, . . . , M be the response variables for the proposed spline mixed model. Since the last four frequency bands contribute on audio intensity relatively small through the period, we consider first six frequency bands and combined the last four frequency bands into one component. Thus, d = 6. The number of time points, n is 48 and the number of days, M is 15. Then, we apply the spline mixed model introduced in section 1.2. An intuitive choice of the order of polynomial for fixed effects is q = 1. We set up k = 8 equally spaced time nodes as knots for the spline functions. For the model (1.2) in section 1.2, we start with some level of time-dependence in Σϕ 11 with the parameter ϕ. To investigate the time-dependence, we obtain maximum likelihood estimator (MLE) of β, γ and ϕ by assuming γ as fixed effect coefficients for simplicity. The derivation of optimal equations for MLE is provided in Appendix C. The MLE of ϕ is ϕ = −0.038 for location L00 and ϕ = 0.024 for location L02. We see in both locations, the correlation between time points are weak, which suggests us to assume ϕ = 0 for the rest of the analysis. 1.4.1 Bayesian Inference Results We run the MCMC algorithm via Gibbs sampling for N = 5000 iterations with three chains. To check the convergence of chains, we carry out Gelman-Rubin diagnostics by calculating the potential scale reduction factor [10]. All three chains converged after 2000 iterations. Our estimation is based on the posterior mean of last 1500 posterior samples, i.e. 500 posterior samples from each chain. The upper 4 × 4 submatrices of estimated covariance matrices ˜ 0, Σ ˜ 2 are the Bayesian of random effects are shown below. Σ0 , Σ2 are the MLEs and Σ estimates for locations L00 and L02, respectively. For location L00,   0.2307 0.1384 0.1140    0.1384 0.1872 0.1546  Σ0 =    0.1140 0.1546 0.1703   0.0611 0.0826 0.1056   0.0611   0.2646 0.1334      0.1334 0.2119 0.0826   ˜   , Σ0 =     0.1079 0.1599 0.1056      0.1114 0.0605 0.0830 12  0.1079 0.0605    0.1599 0.0830   .  0.1906 0.1143    0.1143 0.1481 For location L02,   0.0768    0.0075  Σ2 =   −0.0326   −0.0365    0.0075 −0.0326 −0.0365   0.1118 0.0287 −0.0201 −0.0271         0.0287 0.1073  0.0758 0.0686 0.0400  0.0795 0.0361  ˜   , Σ =   . 2    −0.0201 0.0795  0.1657 0.1173 0.0686 0.1500 0.1209        0.0400 0.1209 0.1671 −0.0271 0.0361 0.1173 0.1865 ˜ 0 and Σ ˜ 2 are very We can see that, overall, the Bayesian estimation for both locations, Σ close to the corresponding MLE, Σ0 and Σ0 , respectively. They have same signs for all the entries of covariance matrices, which justifies our Bayesian method from the likelihood per˜ 2 is slightly different from Σ2 in the sense that spective. Moreover, unlike for location L00, Σ it gives larger variance for each component σ ˜j2 and lower magnitude of negative correlations, i.e. σ ˜13 and σ ˜14 than the MLE. Finally, we suspect a difference in the overall relationship (over time points) between band 1 and either band 3 or 4 for the two locations due to the opposite sign of corresponding entries in two different locations. However, to further support our finding, we have to perform more detailed posterior sample analysis. Bayesian method allows us to obtain posterior samples of σjl from which we can investigate interaction between different types of species (i.e. different frequency bands) as well as compare the results from two locations. In particular, we are interested in the relationship between frequency band 1 and frequency bands 2 to 4. The frequency band 1 corresponds to the sounds from human activity while frequency bands 2, 3 and 4 represent three different categories of wildlife animals including birds. So, we compute posterior probabilities of all orderings of {σ12 , σ13 , σ14 , 0}. 0 is included to see whether σjl is positive or negative. There are 4! = 24 possible orderings. We list the first three highest posterior probabilities 13 among 24 possible orderings for each location. For location L00, (0) P (Θ1 = {σ12 > σ13 > σ14 > 0} | data) = P ( 12 > 13 > 14 > 0 | data) = 0.702, (0) P (Θ2 = {σ13 > σ12 > σ14 > 0} | data) = P ( 13 > 12 > 14 > 0 | data) = 0.133, (0) P (Θ3 = {σ12 > σ14 > σ13 > 0} | data) = P ( 12 > 14 > 13 > 0 | data) = 0.083. For location L02, (2) P (Θ1 = {σ12 > 0 > σ13 > σ14 } | data) = P ( 12 > 0 > 13 > 14 | data) = 0.493, (2) P (Θ2 = {σ12 > σ13 > 0 > σ14 } | data) = P ( 12 > 13 > 0 > 14 | data) = 0.222, (2) P (Θ3 = {σ12 > 0 > σ14 > σ13 } | data) = P ( 12 > 0 > 14 > 13 | data) = 0.149. Bayes factors computed from the posterior probabilities can be used to select a particular ordering for each location. Thus, we consider testing the following hypotheses (L) Hi : θ(L) ∈ Θi (L) vs. Hj : θ(L) ∈ Θj , where i, j = 1, 2, 3 represent those particular ordering with high posterior probabilities L = (L) (L) 0, 2 represents the location of the data collected. Since π(Θi ) = π(Θj ) for all i, j = 1, 2, 3 (L) (L) and L = 0, 2, the Bayes factor is just a ratio of posterior probabilities, i.e., Bij = νi (L) /νj . (0) Table 1.1 and 1.2 show the Bayes factors Bij at locations L00 and L02, respectively. From the above analysis, we found that there are significant differences in the patterns of audio frequency bands over time in two locations, L00 and L02. For L00, all three bands 2, 14 Denom. \ Num. σ12 > σ13 > σ14 > 0 σ12 > σ13 > σ14 > 0 1 5.28 σ13 > σ12 > σ14 > 0 σ12 > σ14 > σ13 > 0 8.42 σ13 > σ12 > σ14 > 0 σ12 > σ14 > σ13 > 0 0.19 0.12 1 0.63 1.60 1 Table 1.1: Bayes Factor for Significant Orderings at L00 Denom. \ Num. σ12 > 0 > σ13 > σ14 σ12 > σ13 > 0 > σ14 σ12 > 0 > σ14 > σ13 σ12 > 0 > σ13 > σ14 1 2.22 3.32 σ12 > σ13 > 0 > σ14 0.45 1 1.50 σ12 > 0 > σ14 > σ13 0.30 0.67 1 Table 1.2: Bayes Factor for Significant Orderings at L02 3 and 4 were synchronous with human activity, with band 4 in the lowest level and band 2 in the highest level. On the other hand, band 4 was most sensitive to human activity for the pristine location L02, following by band 3. However, band 2 was synchronous with human activity in L02. 1.4.2 Discussion In this chapter, we investigated audio measurements data from the real ecological world. Our primary objective is to extract the information that reflects the relationship and interaction between the variables of interest and then quantify the synchronicity or discrepancy of such interaction among locations. By classifying the observations as the compositional data varying over time points, we are able to view them as functional curves. We proposed a spline mixed model and interpret the random effect coefficients of spline polynomials as a measure of interaction between curves over time. By specifying the covariance components of random effects, we successfully transformed our concerns to Bayesian inference for the orderings of covariance matrices. To solve our problem, we developed a new non-informative prior on the covariance matrices with ordering which possesses posterior propriety as well as 15 simple implementation for testing. Finally we applied our model with the new prior to the biodiversity data from Michigan and our Bayesian analysis successfully identified the main characteristics of interactions between species. Some further extension of our current work may be worthwhile. For instance, it can be challenging but interesting if one consider generalizing the type of prior we proposed to covariance matrices with a wider class of orderings. 16 Chapter 2 Joinpoint Detection using L1-Penalized Spline Method 2.1 Introduction Cancer is one of the major causes of death in the United States. According to a global and regional mortality study from 235 causes of death for 20 age groups presented on The Lancet, it claimed 8.0 million lives in 2010, 15.1% of all deaths worldwide, with large increases in deaths from trachea, bronchus, and lung cancers [15]. Statistical analyses of cancer incidence have gained increasing attention in recent years given such challenging public health issue. Many of them are conducted to investigate a single cancer rate curve varying over time. An major interest is detecting the changes of the trend of annual cancer incident rates. Such concern can be viewed as a joinpoint identification problem based on a certain structure of regression models in statistics. Several methods have been developed for the joinpoint problem recently, including [14], [21], [11] and [4]. In general, the joinpoint regression model can be defined in the following way. Let y be an observed response at time point x ∈ S = {x1 , x2 , · · · , xn } with x1 ≤ x2 ≤ · · · ≤ xn . Then we assume y = β0 + β1 x + δ1 (x − τ1 )+ + · · · + δk (x − τk )+ + , 17 (2.1) where τ1 < τ2 · · · < τk are unknown joinpoints, represents the error and a+ = a✶{a > 0}. The main objective of joinpoint detection is to discover the correct number of joinpoints (k) as well as the location of them (τ s). One classical permutation test based (PTB) approach, proposed by Kim, et al. [14] in 2000, has been adopted by the National Cancer Institute (NCI) as an analytical tool (Joinpoint Software) in their Surveillance Research Program. It generally applies a grid search technique to fit a regression model under homoscedastic and uncorrelated errors. The identification of significant joinpoints is conducted by multiple pairwise permutation tests which compare two best regression models fitted with different sets of presumed joinpoints. From a statistical perspective, it only involves basic estimation and testing knowledge for piecewise linear regression models thus is well motivated and easy to understand. However, due to the nature of permutation test and grid search technique, the PTB approach requires heavy computation if either maximum possible number of joinpoints (k) or the sample size (n) is relatively large, for example, k > 5 or n > 50 usually. Such limitation makes joinpoint detection with big dataset infeasible using the PTB. In this paper we propose a new method to detect the joinpoints on a discrete time grid by introducing an 1 penalized regression spline model. Our model specifies spline base at time knots as the covariates thus will be able to capture potential features of joinpoints as a subset of statistically significant slopes. Meanwhile, by adding a penalty term, we allow for a shrinkage of the spline model so that insignificant slope changes will be eliminated. The rest of this chapter is organized in the following way. We define the penalized regression spline model in section 2.2 and validate some theoretical properties of our estimates in section 2.3. In section 2.4, we carry out a simulation study of our method and compare the performance with the PTB method. Section 2.5 provides a case study of real data by 18 applying our approach to the cancer incidence rates for the overall U.S. from 1973 to 1999. Note that a comparison with the results from the Joinpoint Software will also be presented. We finally conclude with a discussion about possible extensions as well as potential future works in section 2.6. 2.2 L1-Penalized Regression Spline Model In this section we introduce the penalized regression spline model. Assume we observe data n {yi }n i=1 independently on time points {xi }i=1 with x1 ≤ x2 ≤ · · · ≤ xn . We define a penalized regression model with first order spline bases as follows. y = β0 + β1 x + β2 (x − τ2 )+ + · · · + βp−1 (x − τp )+ + , where (2.2) ∼ N (0, σ 2 ) and τ2 < · · · < τp are (p − 1) time knots in the entire time interval. It specifies p covariates as in the linear model. For the annual cancer incidence rate problem, it is appropriate and straightforward to assume the set of candidate joinpoints is identical to the set of observed covariates (years), that is, p = n − 1 and τj = xj for j = 2, · · · , n − 1. And without loss of generality, we assume they are equally-spaced without ties. Furthermore, to guarantee some conditions for theoretical justification of our model, we particularly scale observed time points along with spline bases such that xj = nj , j = 1, 2, · · · , n. By introducing a matrix form representation of our model, we have ∼ N (0, σ 2 I n ), y = Xβ + , 19 (2.3) where y = (y1 , y2 , · · · , yn )T , β = (β0 , β1 , · · · , βn−1 )T ,  1 x1 (x1 − x2 )+ · · ·   .. X =  ... ... .   1 xn (xn − x2 )+ · · · = ( 1 , 2 , · · · , n )T and  (x1 − xn−1 )+    ..  .   (xn − xn−1 )+ (2.4) Note that the PTB method works only on a small subset of time points as a candidate set for joinpoints when performing the estimation or testing via grid search each time. On the other hand, our penalizing method allows for the entire set of candidates. It is motivated in a natural and data-driven way. Note that in the model we exclude the first year as well as the last year since it is not reasonable to view them as possible joinpoints without further information. When we have a large sample size n, the usual least square estimate of β may contain a large number of nonzero components and fail to remain to be a consistent estimator. It suggests us applying one of most commonly used penalty structure, Least Absolute Shrinkage and Selection Operator (LASSO), introduced by Tibshirani [20]. It yields to minimize the 1 penalized negative log-likelihood. Because of the singularity of 1 norm at the origin, LASSO is a powerful tool for model selection. By doing so, we only keep the candidates with nonzero estimated coefficients, which can be automatically interpreted as significant slope changes thus solid evidence of joinpoints. In our study, we treat the intercept (β0 ) and the first slope (β1 ) as the baseline trend for curves. It follows that they are not supposed to characterize joinpoints thus are not penalized as other truncated spline coefficients. However, it is still feasible to restrict the absolute magnitude of them by a pre-specified constant in practice. So we consider a constrained parameter space B = {β : |β0 | + |β1 | ≤ M }, where M is a positive constant independent of n. Our LASSO 20 type estimator is provided by   n−1   y − Xβ 2 2 +λ β = arg min |βj | .  n β ∈B  (2.5) j=2 The parameter λ appeared above is called the tuning (shrinkage) parameter. It essentially controls how hard the βs are penalized. In practice, λ is often pre-specified or obtained by some other data driven validation procedures. Several algorithms for LASSO, including the well known Least Angle Regression (LAR) [5], have been developed in recent years. Most of them automatically assume penalization for the full set of coefficients. However, since in our model the baseline slope β1 is not penalized, we have to utilize an implementation which allows penalization for a partial set of parameters. And the coordinate decent algorithm proposed by Friedman, Hastie and Tibshirani [8] meets our demands by solving the regularization paths as we desire. 2.3 Consistency of the LASSO Type Estimator As we mentioned in the previous section, the LASSO method features its computational feasibility. Meanwhile, under certain assumptions of the regression model, it also provides statistical accuracy. The estimator under LASSO approach has been proved to be consistent by a series of notable works. Some basic results are available in [3]. And we will provide a proof of consistency of the LASSO type estimator under (2.5) following their approach. From now on we define Σ = X T X/n and assume its j-th diagonal entry to be σj2 , i.e. σj2 = Σjj . Meanwhile, given the random error , we define a set of event Jλ0 = ω : max1≤j≤n 2| T X (j) |/n ≤ λ0 , where X (j) denotes the jth column of the 21 design matrix X. First we give a lemma indicating a high probability of set Jλ0 . Lemma 1. Assume σj2 ≤ K for j = 1, 2, · · · , n and some K > 0. Then for all t > 0 and t2 +2 log n , n λ0 = 2Kσ we have P (Jλ0 ) ≥ 1 − 2 exp(−t2 /2). (2.6) The proof of Lemma 1 and other proposition, lemma and theorem are provided in the Appendix. Especially, when taking t2 ≥ 4 log n, we obtain P √ for λ0 ≥ 2 6Kσ 2| T X (j) | > λ0 n 1≤j≤n max ≤ 2e−2 log n = 2/n2 (2.7) log n n . In practice we may need to replace unknown variance parameter σ 2 by a reasonable estimator σ 2 . The following lemma shows that we can choose the sample second raw moment as a candidate for σ 2 . Lemma 2. Let r = Xβ 22 /σ 2 be the signal-to-noise ratio of the model (2.2). Consider the estimator σ 2 = Y T Y /n for σ. Given any 0 < α < 1, we have P σ 2 < (1 − α)σ 2 ≤ min 2(n + 2r) 12(n + 2r)2 + 48(n + 4r) , ,1 . α 2 n2 α 4 n4 (2.8) The previous lemma claims that the σ 2 proposed above can not be relatively too small compared to σ 2 . Now combining two lemmas above we conclude the consistency of our estimator β. Theorem 3. (Consistency of the LASSO Type Estimator) Suppose σj2 ≤ K for j = 1, 2, · · · , n 22 and some K > 0. Let σ 2 = Y T Y /n and λ = 12Kσ log n/n. Consider the model (2.2) with the estimator β in (2.5). If lim n→∞ β 1 = 0 and n/ log n rn = Xβ 22 ≤ R < ∞ for some R > 0 σ2 then we have X(β − β) 22 /n → 0 almost surely as n → ∞, i.e. P lim n→∞ X(β − β) 22 /n = 0 = 1. (2.9) It is easy to verify that the matrix Σ in model (2.2) satisfies σj2 ≤ K for all j with K = 1. Furthermore, note that the proof of Theorem 3 remains valid when the number of covariates p is greater than the number of time points n although they are identical under our design. That is, our model (2.2) can be extended to allow more intermediate time knots as joinpoint candidates. The main theorem above claims the consistency of our LASSO estimator in terms of prediction error. A stronger conclusion on such type of estimators has also been studied in [3]. In their publication, a more refined 1 consistency result for β is provided under some additional compatibility conditions for the design matrix X. An alternative restricted eigenvalue condition is given [2]. Given an index set S ⊂ {1, 2, · · · , n}, denote β S = (β0,S , β1,S , · · · , βn−1,S )T where βj,S = βj ✶{j ∈ S} for j = 1, 2, · · · , n − 1. The compatibility condition is stated as follows. Assumption 1. (Compatibility Condition) Let Σ = X T X/n be the Gram matrix. Given the true index set S0 with cardinality s0 = |S0 |, there exists φn > 0 such that for all β 23 satisfying β S c 1 ≤ 3 β S0 1 , it holds that 0 s β S0 21 ≤ 02 β T Σβ φn and s 0 λ2 = o(1) φ2n Unfortunately, we point out that under our design with λ = O( (2.10) log n/n), the matrix Σ violates the compatibility condition. Proposition 1. The eigenvalues of the design matrix in (2.4) are √ √ 1 n + 2 + n2 + 4 n + 2 − n2 + 4 , ν2 = ν3 = · · · = νn−1 = , νn = ν1 = 2n n 2n (2.11) From the claim above we see that at least (n − 1) of the eigenvalues of X are no greater than n−1 . Therefore, by the properties of eigenvalues, we conclude that the smallest eigenvalue of Σ = X T X/n, denoted by ν1 (Σ), is at most n−3 . Given the simple fact that β S0 21 ≤ s0 β 22 and ν1 (Σ) β 22 ≤ β T Σβ for all β, it follows that φ2n ≤ n−3 hence λ2 s0 /φ2n ≥ O(n2 log(n)), which contradicts the assumption that s0 λ2 /φ2n = o(1) in (2.10). In other words, unless we choose a λ extremely close to 0, we will not be able to meet the compatibility condition. As discussed above, the violation of compatibility for Σ implies that we are not able to use a similar technique as before to prove the 1 consistency result for β. This is not surprising since the design matrix X consists of rescaled time nodes xj = nj , j = 1, 2, · · · , n so that we may need conditions such as β 2 = O(n) to have r = Xβ 22 /σ 2 < ∞. Also, it is natural to assume s0 < ∞. Then, given S0 = S, β 1 = O(n). Hence the usual 1 consistency for β may not be a reasonable criterion under our settings. On the other hand, if we define S = {2 ≤ j ≤ n − 1 : βj = 0} as our selected set of 24 joinpoints, one may desire the following oracle properties P (S = S0 ) → 1 or P (S ⊃ S0 ) → 1 as n → ∞ (2.12) It indicates that the estimated effective joinpoints will at least cover the true set S0 when n is large. In our case, we may seek a weaker oracle property. Note that X β = (X 1 β, X 2 β, · · · , X n β)T , where X j s are row vectors of X. Let S∗ = {2 ≤ j ≤ n − 1 : X j β = 0} and S∗ = {2 ≤ j ≤ n − 1 : X j β = 0}. Then one may show that P (S ∗ = S∗ ) → 1 or P (S ∗ ⊃ S∗ ) → 1 as n → ∞ (2.13) Note that here X j (j ≥ 2) has nonzero elements only in its first j entries. After all, it is possible that although as a sufficient condition, the compatibility assumption fails under design X, the numerical result still shows excellent performance in variable selections. 2.4 Simulation Study In this section, we will examine the performance of joinpoint detection based on the 1 penalized regression (LASSO) method and the PTB method realized by the Joinpoint Software via several simulations. To be consistent with the real data analysis in the next section, we first consider n = 27 and x ∈ S = {1, 2, · · · , 27}. The true models with two joinpoints can be written as follows. y = β0 + β1 x + δ1 (x − τ1 )+ + δ2 (x − τ2 )+ + , 25 (2.14) where the errors are still assumed to be independent Gaussian N (0, σ 2 ). To further specify different models based on quantity of slope changes, assignment of joinpoint positions and the noise level, we let β0 = 5, (τ1 , τ2 ) = (8, 18) or (18, 23) and σ 2 = 0.0001 or 0.001. Meanwhile, as given in [14], the slopes, β1 , δ1 and δ2 are determined by the annual percentage change APC = (APC1 , APC2 , APC3 ) in the following way: β1 = log(1 + 0.01APC1 ), δ1 = log(1 + 0.01APC2 ) − log(1 + 0.01APC1 ) and δ2 = log(1 + 0.01APC3 ) − log(1 + 0.01APC2 ). In our simulation we set APC = (1, 3, 1) or (4, −1, 2), which leads to different trends of data. Hence all the combinations of true parameters give us eight different scenarios. In the LASSO spline model, we clarify the design matrix X as the form of (2.4) with xj = j/27, j = 1, 2, · · · , 27. That is, we rescale observed time points without losing information. One of the major concern when fitting the generated dataset to our model is to select appropriate tuning (penalty) parameters (λs) since it directly controls the sparsity, or the effective number of joinpoints. Some popular rules, such as cross validation (CV) [12] and Bayesian information criteria (BIC) [22], are widely used to handle this issue. As a model validation technique, cross validation partitions a sample of data into a training set and a testing set, performs the analysis on the training set, and validates the analysis on the testing set. It indicates better performance in terms of accuracy of prediction in many literatures. We will apply a 5-fold cross validation in the simulations. For instance, the original observations are randomly partitioned into five almost equal size subsamples. A single subsample is retained as the validation data for testing the model while the remaining four are used as training data. The cross validation process is then repeated five times, with each of the five subsamples used exactly once as the validation data. We simply select the tuning parameter λ with minimum mean squared error (MSE). However, sometimes we may prefer a more simplified model with fewer covariates, or, in our case, not too many estimated 26 joinpoints. Hastie, Tibshirani and Friedman suggest a ‘one standard error rule’ [8], which chooses the largest λ value within one standard error of the original minimizer λ given by ˜ = sup{λ : MSE(λ) ≤ MSE(λ) + SE(λ)}, where MSE(λ) and SE(λ) cross validation, i.e. λ denote the mean squared error and standard error under λ. We will provide the results under both original cross validation approach (L-CV) and one standard error rule (L-SE). As an alternative model for comparison purpose, we also apply the PTB method implemented in the Joinpoint Software. Since the PTB approach is based on the grid search when fitting the model, it is inevitably time consuming even if one requires a moderate amount of candidate joinpoints given a relatively large sample size of data (usually when n ≥ 40). Especially, when the maximum number of joinpoint is greater than 4, the computation will be quite slow. As advised by the user manual of the software, we set the maximum number of joinpoint as 4. Since our penalized spline model does not have such limitation and process the same data in a much faster way as described in section 2.2, we claim that it demonstrates more flexibility in modeling various types of real data effectively. Moreover, one of the settings for the PTB implementation imposes a minimum distance (3 as default) between two consecutive estimated joinpoints. Therefore, to make a fair comparison, we also consider a modification for both L-CV and L-SE methods: First we call a sequentially ordered subset of originally selected components a ‘cluster’ if it is the biggest set consisting of two or more consecutive nonzero coefficients that preserve the same sign, either all positive or all negative, i.e. a set either of the form B+ (s, t) := {(βs , βs+1 , · · · , βt−1 , βt ) : βs−1 ≤ 0; βs , · · · , βt > 0; βt+1 ≤ 0} or of the form B− (s, t) := {(βs , βs+1 , · · · , βt−1 , βt ) : βs−1 ≥ 0; βs , · · · , βt < 0; βt+1 ≥ 0} for some 2 ≤ s < t ≤ n − 2. Then for all other selected components that are not from a ‘cluster’, we call each of them a ‘singleton’. Our modification eventually pick only one component 27 from each ‘cluster’ along with all the ‘singletons’ which minimizes the mean squared error as the joinpoints. We denote such modification for L-CV and L-SE methods as L-CV-M and L-SE-M, respectively. To give a summary of all the methods we use to compare with the PTB, we present the following 2 × 2 chart. Tuning Parameter \ Refinement Original LASSO λ1 via Cross-validation L-CV λ2 by the ‘One Standard Error’ Rule L-SE Modified LASSO L-CV-M L-SE-M Table 2.1: A Summary of Four Methods to Compare with the PTB We simulate the data 1000 times for each of eight scenarios and fit them by L-CV, LCV-M, L-SE, L-SE-M and PTB methods. The performance of joinpoint detection can be characterized by three quantities: (1) the average number of correctly identified joinpoints, #CJ; (2) the average number of incorrectly identified joinpoints, #IJ and (3) the average of mean squared error of predictions, PMSE = X(β − β) 2 /n. The results are listed in Table 2.2 and 2.3. From the result we see that in general, the 1 penalized regression spline model with original cross validation procedure (L-CV) provides better detection of true joinpoints compared to PTB or L-SE. However, a relatively high #IJ value indicates it usually introduces more joinpoints than the other two methods. The penalized model with cross validation under one standard error rule (L-SE) performs equally or better than PTB in detecting the true joinpoints for most of the time. And it brings down the #IJ value compared to L-CV model due to a more penalization of parameters. Meanwhile, the modified approach L-CV-M usually leads to more #CJ values than L-SE and less #IJ values than L-CV thus gives a more balanced and robust estimation between L-CV and L-SE. The L-SE-M reduces the number of effective joinpoints dramatically but is more likely to miss the true joinpoints 28 σ2 0.0001 σ2 0.0010 APC (1, 3, 1) #CJ 1.377 1.648 1.436 1.592 1.392 1.897 1.942 1.909 1.933 1.900 #IJ PMSE(×10−4 ) 1.284 0.266 3.852 0.940 1.659 0.277 3.831 1.093 1.691 0.419 0.236 0.200 2.982 0.993 0.694 0.208 2.996 1.019 0.714 0.211 APC (1, 3, 1) #CJ 0.306 0.696 0.448 0.283 0.215 0.960 1.391 1.032 1.171 0.966 #IJ PMSE(×10−4 ) 3.047 3.956 4.956 3.679 3.648 3.711 3.375 5.673 2.754 5.663 2.106 2.754 5.121 3.351 3.367 2.913 4.485 5.523 2.952 3.247 Method PTB L-CV L-CV-M L-SE L-SE-M (4, −1, 2) PTB L-CV L-CV-M L-SE L-SE-M Method PTB L-CV L-CV-M L-SE L-SE-M (4, −1, 2) PTB L-CV L-CV-M L-SE L-SE-M Table 2.2: Simulation Result for n = 27, (τ1 , τ2 ) = (8, 18) compared to other methods. As for the computational efficiency, as we expected, LASSO models showed notable superiority to the PTB by equipping with the coordinate decent algorithm mentioned in section 2.2. To investigate the performance of our LASSO method further under larger sample size, we design other groups of simulations with n = 54 and 135, corresponding to 2 and 5 times the original sample size of responses. We keep the settings for APC values and σ 2 same as the previous simulation settings. However, as the time line extends, it is reasonable to simply maintain the relative location of true joinpoints, for example, (τ1 , τ2 ) = (16, 36) or (36, 46) respectively for n = 54. Also note that since the PTB method will be significantly time- 29 σ2 0.0001 σ2 0.0010 APC (1, 3, 1) #CJ 1.118 1.290 1.093 1.010 0.749 1.748 1.714 1.575 1.435 1.259 #IJ PMSE(×10−4 ) 1.667 0.339 3.309 0.444 1.913 0.343 3.113 0.766 2.190 0.571 0.561 0.232 3.314 0.551 1.023 0.353 3.429 0.945 1.470 0.585 APC (1, 3, 1) #CJ 0.242 0.623 0.510 0.272 0.231 0.448 1.134 0.774 0.899 0.604 #IJ PMSE(×10−4 ) 2.637 2.498 3.703 2.533 2.793 2.659 2.586 4.112 2.308 3.867 2.368 3.275 3.897 2.856 2.646 2.981 2.608 5.467 2.098 3.258 Method PTB L-CV L-CV-M L-SE L-SE-M (4, −1, 2) PTB L-CV L-CV-M L-SE L-SE-M Method PTB L-CV L-CV-M L-SE L-SE-M (4, −1, 2) PTB L-CV L-CV-M L-SE L-SE-M Table 2.3: Simulation Result for n = 27, (τ1 , τ2 ) = (18, 23) consuming under large sample size, we decide not to include it in the comparison. The results are shown on Table 2.4 through 2.7. It is clear that as the sample size n increases, the LASSO model tends to select more candidate components as joinpoints, which will be more likely to cover the true joinpoints. Meanwhile, the modified LASSO, L-CV-M and L-SE-M reduce #IJ dramatically by eliminating the redundant information within ‘clusters’. In general, both our methods L-CV-M and L-SE-M give acceptable estimates. On the other hand, note that as the sample size n increases, it is more worthwhile to study the LASSO model with asymptotic tuning parameter λ ∼ O(σ 30 log(n)/n) than the σ2 0.0001 σ2 0.0010 APC (1, 3, 1) #CJ 1.729 1.445 1.729 1.443 1.883 1.696 1.883 1.696 #IJ PMSE(×10−4 ) 6.621 0.954 2.250 0.163 6.626 0.959 2.256 0.163 5.227 1.004 1.434 0.274 5.220 1.010 1.433 0.274 APC (1, 3, 1) #CJ 1.458 0.685 1.329 0.680 1.615 1.065 1.546 1.097 #IJ PMSE(×10−4 ) 8.487 2.089 4.663 1.612 7.406 3.417 4.107 1.645 6.827 2.250 4.041 1.878 6.136 3.386 3.578 1.775 Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Table 2.4: Simulation Result for n = 54, (τ1 , τ2 ) = (16, 36) √ usual one selected from cross-validation techniques. Thus we also use λ0 = 4 6σ log(n)/n as specified in section 2.3 to verify the asymptotic behavior of the predicted mean squared error (PMSE). Some results under σ 2 = 0.0010 are shown in Figure 2.1 and 2.2. To save computation time, we only take n = 27 × 5k for k = 3, · · · , 15 for demonstration. We can see a clear decreasing trend for the PMSE, which meets our expectation since we have shown that the prediction error should be close to 0 once the sample size is large enough. 31 Figure 2.1: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = (1, 3, 1), σ 2 = 0.0010 Figure 2.2: Asymptotic Pattern for the Predicted Mean Squared Error (PMSE): APC = (4, −1, 2), σ 2 = 0.0010 32 σ2 0.0001 σ2 0.0010 APC (1, 3, 1) #CJ 1.593 1.391 1.518 1.324 1.833 0.828 1.763 0.810 #IJ PMSE(×10−4 ) 5.239 0.416 1.830 0.151 5.307 0.466 1.945 0.163 6.497 0.532 2.876 1.537 6.479 0.618 2.896 1.511 APC (1, 3, 1) #CJ 1.011 0.635 0.669 0.412 1.545 0.748 1.185 0.652 #IJ PMSE(×10−4 ) 6.176 1.650 3.951 1.544 4.262 3.489 3.186 2.329 7.744 1.739 3.894 2.035 5.753 3.488 3.240 2.460 Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Table 2.5: Simulation Result for n = 54, (τ1 , τ2 ) = (36, 46) 2.5 A Case Study: National Cancer Incidence Rate Analysis As a real application of the penalized regression model, throughout this section we will employ our method to analyze the cancer rates data. First we provide Figure 2.3 showing the age-adjusted incidence rates of colon and rectal cancer for the overall population in the U.S. from 1973 to 1999. These data reported in the SEER Cancer Statistics Review [17] utilizing the Joinpoint Software. It is easy to see some clear trends and changes over the time. We again apply PTB and LASSO to the cancer rate data. As indicated in the previous section, when applying the PTB method via the Joinpoint Software, we set the maximum number of potential joinpoints to be four. We use the modified LASSO methods L-CV-M 33 σ2 0.0001 σ2 0.0010 APC (1, 3, 1) #CJ 1.300 0.661 1.300 0.661 1.518 0.611 1.518 0.611 #IJ PMSE(×10−4 ) 16.08 1.584 5.663 0.137 16.08 1.584 5.663 0.137 13.79 1.438 4.736 6.280 13.79 1.441 4.736 6.280 APC (1, 3, 1) #CJ 1.479 0.688 1.483 0.692 1.495 0.598 1.492 0.578 #IJ PMSE(×10−4 ) 18.35 1.783 6.461 0.827 17.75 2.083 6.257 0.815 15.92 1.939 6.120 6.211 15.43 2.231 6.01 5.760 Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-CV-M Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Table 2.6: Simulation Result for n = 135, (τ1 , τ2 ) = (40, 90) and L-SE-M for comparison purpose. The result of fitting is given in Table 2.8. In Table 2.8, PTB (#2) denotes the original unrestricted PTB method while PTB (#4) is implemented under a pre-specified number of joinpoints as four only for comparison purpose. From the results we can see that the LASSO type penalized regression methods tend to be more aggressive when selecting joinpoints than PTB. We have no strong reasons to suspect an over-fitting issue for this model since the tuning parameter λs are well chosen by cross validation procedures. Moreover, we see that L-SE-M, the most penalized method among all of our approaches, gives the same result as the original PTB (#2) while L-CV-M selects more joinpoints than any others. So it is clear that our method provides more flexibility in determining the total number of joinpoints because we are always able to control such important quantity in terms of a series of appropriate tuning parameters. Finally, we show the plot for fitted models in Figure 2.4, where we see both approaches successfully pick out 34 Figure 2.3: Incidence Rates of Cancer for overall U.S. (1973 - 1999) Figure 2.4: Fitted Incidence Rates (1973 - 1999) 35 σ2 0.0001 σ2 0.0010 APC (1, 3, 1) #CJ 1.897 0.676 1.895 0.675 1.550 0.271 1.553 0.272 #IJ PMSE(×10−4 ) 11.89 0.551 3.848 0.211 11.88 0.554 3.850 0.211 16.72 0.905 4.815 15.05 16.69 0.925 4.804 15.07 APC (1, 3, 1) #CJ 1.736 0.465 1.518 0.468 1.575 0.247 1.513 0.237 #IJ PMSE(×10−4 ) 15.41 1.060 5.504 0.831 13.85 1.903 4.790 0.816 20.53 1.411 5.937 13.68 17.93 2.266 5.333 14.26 Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Method L-CV L-CV-M L-SE L-SE-M (4, −1, 2) L-CV L-CV-M L-SE L-SE-M Table 2.7: Simulation Result for n = 135, (τ1 , τ2 ) = (90, 115) Method PTB (#2) PTB (#4) L-CV-M L-SE-M # of Joinpoints 2 4 7 2 Joinpoints Details (Year) 1986, 1995 1977, 1982, 1985, 1995 1976, 1980, 1982, 1985, 1990, 1993, 1995 1985, 1996 Table 2.8: Joinpoint Detection for Real Data Analysis the peak value around year 1985 as well as the bottom value at year 1995 and our approach offers a quite close fit to the incidence rate curve. Overall, we are satisfied with the performance of our penalized regression model regarding both coverage and accuracy of the joinpoint detection. However, one should note that we did assume homoscedasticity and independence of the errors in our model for simplicity. Thus further potential approaches involving more complicated but reasonable error structures are always worthwhile. 36 2.6 Conclusion and Discussion As a summary of this chapter, we proposed a new method to solve the joinpoints detection problem on discrete time points using 1 type penalized regression spline model (LASSO). We proved a basic version of consistency of the joinpoint estimator for the LASSO and compared it with another commonly used approach, PTB via simulations and real data analysis. Based on the results, we conclude high performance of our method. It showed superiority to the PTB in terms of both efficiency and accuracy. It is also easy to understand as well as to implement. In practice, one may obtain the optimal estimation of joinpoints by carefully selecting a set of appropriate tuning parameters. Moreover, a problem of exploring stronger consistency properties of our estimator is still open and worthwhile for further investigation. 37 Chapter 3 Nonparametric Bayesian Clustering of Functional Data 3.1 Introduction We start this chapter as a continuation to study the mortality incidence rates of cancers. However, unlike in the previous chapter, we first raise a more general scenario when multiple curves are available for comparison and analysis at the same time. For example, assuming now cancer data can be obtained at a state level from the U.S., then one may overlook potential relationships or interactions among curves from different states by simply fitting them individually. Actually, even if we consider the same question for joinpoint detection as in Chapter 2, based on studying a set of multiple functional simulation data appearing in section 3.5.2, our L1 penalized method in Chapter 2 gives us #CJ = 0.531 and #IJ = 1.449, while the model proposed by Dass et al. (2014), which is similar to the one we are going to present in section 3.2, leads to #CJ = 0.816 and #IJ = 0.551, respectively. Thus it is expected that with more information on curves such as the group memberships, we are likely to obtain a better estimation of their trends. Therefore, we are well motivated by this comparison and will now focus on grouping the curves that have similar patterns together. That is, our main concern is whether they form some groups with similar overall trends but have significant variations between groups. For instance, one may suspect that the 38 mortality rates for all 48 states of the U.S. indicate certain geographical similarities. Hence an appropriate model is required to capture the characteristics of the pattern of curves and to classify them into groups, or, clusters as the primary objective. In general, the clustering problem briefly introduced above can be conveniently quantified in the following longitudinal way. Let fs (t) be a true function at time t ∈ [0, T ] for the s-th individual, s = 1, · · · , N . It is usual and natural to assume each function fs to be smooth so that it can be approximated by polynomial functions of any degree. A natural and heuristic way to study the clustering problem came from the classification perspective when MacQueen (1967), Hartigan and Wong (1978) introduced the famous K-means algorithm to partition data into groups. Each curve is first viewed as a discretely observed vector. Then the algorithm, which aims to minimize the within-group sum of squares, is directly applied to multiple vectors. The membership of clusters is determined as a result of classification. Overall, entire process is straightforward and easy to realize. But there are also clear disadvantages of such approach. For example, it requires the number of groups or clusters to be specified at the beginning and fixed throughout the procedure although it is unknown for most of the real studies. Moreover, this clustering method treats each individual curve as an independent outcome thus overlooks potential correlations or interactions between curves, which may not be negligible factor when recovering the true clusters. Later, some more complicated model-based methods have been developed to address the issue. James and Sugar (2003) proposed a mixed effects model which is more powerful for clustering sparsely observed functional data. In their approach, B-splines are considered to capture individual curves and random spline coefficients with a mixture normal distribution are used to cluster them. It handles missing values well and provides reasonable estimation and confidence intervals for them. However, like K-means algorithm, the number of clusters in the model has 39 also to be pre-determined by independent criteria. Meanwhile, the knot points are the same across all clusters, which reduce the flexibility of model as well as its ability to process a group of curves with varying connecting points. Abraham et al. (2003) combined B-spline basis with K-means procedure on fixed effects to study such problem. Some other model-based approaches include Banfield and Raftery (1993), Serban and Wasserman (2005), Heard et al. (2006) and Ray and Mallick (2006). We develop our model as a powerful tool for classification purpose by applying a classical nonparametric Bayesian approach: Dirichlet Process (DP) methodology [7] in an innovative way to clustering the trends of different curves as polynomial components. Our method will utilize such representation of smooth curves to obtain relevant information such as connection and fluctuation. In general, our proposed approach does not require an initial estimate of the number of clusters as most of the other methods mentioned in the previous paragraph. Instead, it provides a reasonable estimate along with an empirical distribution. It also allows curves to be characterized by different numbers and locations of knot points, which has more capability of fitting individual curves well. Furthermore, it performs the clustering based on a good summary of the shape and trend of curves thus demonstrates the similarity of patterns between curves within the same cluster. The rest of this chapter is organized as follows. We present our functional spline mixed model and prior specifications for Bayesian inference in section 3.2. Some prior components are taken to be improper, and therefore, propriety of the posterior becomes a concern. Section 3.3 establishes the propriety of the posterior distribution under some mild conditions. Section 3.4 provides details of the Bayesian inference, particularly, performance measures of clustering configuration for the comparison. At the same time, we also explore a new choice of more non-informative prior on the concentration parameter. Section 3.5 shows results from 40 numerical studies including simulated data as well as the real lung cancer data. Finally, we summarize our work and give some discussion for the proposed model and future research. 3.2 3.2.1 Spline Mixed Model for Clustering Multiple Curves Model Specification Suppose that we observe functions {fs (t)}N s=1 at t = t1 , t2 , · · · , tn . Let ξ = (ξ1 , · · · , ξK ) be knot points for spline basis functions, where K is the number of knots and ξ1 > t1 and ξK < tn . Let ξ0 = t1 and ξK+1 = tn . The spline basis functions of order q are then given by hm (x) = xm−1 for m = 1, · · · , q and q−1 hq+l (x) = (x − ξl )+ for l = 1, · · · , K. In practice, most widely used orders are q = 1, 2, 4, where q = 1 corresponds to a constant function, q = 2 corresponds to a linear function and q = 4 corresponds to a cubic spline function. Then, we assume fs (t) can be approximated as following for s = 1, 2, · · · , N . q+Ks fs (t) ≈ (s) (s) βm hm (t), m=1 (s) (s) (s) where hm (t) depends on the s-th individual with time knots ξ l = ξ1 , · · · , ξK . s Let Yst be the observed value of fs (t) at time t for t = 1, · · · , n. Then, the observed 41 values are modeled using spline representation such that q+Ks Yst = (s) (s) βm hm (t) + est , m=1 where est is an error that represents variability from various sources. We decompose the error as est = us + st and consider us as a random effect to capture variability between curves and st as an error that reflects variability across time for each s-th function. Thus the model can be written as q+Ks Yst = (s) (s) βm hm (t) + us + st m=1 with us ∼ N (0, τ 2 ) and st ∼ N (0, σs2 ). Let Y s = (Yst1 , · · · , Ystn )T , Y = (Y T1 , · · · , Y TN )T , s = ( st1 , · · · , stn )T , = ( T1 , · · · , TN )T and u = (u1 , · · · , uN )T . Then, we can write the model in the matrix form for the s-th individual as Y s = X (s) β (s) + us 1n + s , where X (s) = (xij )n×(q+Ks ) with xij = hj (ti ) for i = 1, · · · , n and j = 1, · · · , q + Ks , and (s) (s) β (s) = (β1 , · · · , βq+K )T . For all s = 1, · · · , N , we have s Y = Xβ + u ⊗ 1n + , (3.1) with X = diag(X (1) , · · · , X (N ) ) and β = (β (1)T , · · · , β (N )T )T , where ⊗ denotes the Kronecker product. 42 3.2.2 Derivation of Functional Dirichlet Process and Other Prior We are interested in clustering functions based on their shapes. Such clustering problem based on the coefficients of basis functions are proposed by other authors, in particular, using coefficients of spline basis functions such as [1] and [16]. However, the locations and the number of knots are pre-determined before clustering is considered. Different groups of functions may need different number of knots and knot-locations to capture the shape of those functions. Thus, models should take into account such flexibility. To model clustering of N sites with respect to their knot-point locations and corresponding magnitudes, we apply the Dirichlet Process (DP) priors to the functions fs (t) with an appropriate centering measure so that the number of knots and locations of knots are random and estimated during inferential procedures. To utilize the DP prior, we consider a function θs (t) = fs (t) and let Θ be the space of all possible θs (t). A functional DP on the space of all distributions on Θ, where we define Θ later, has two hyper parameters that characterize the DP such that DP ≡ DP (α0 G0 ). α0 is called precision parameter and G0 is called the baseline (or centering) distribution on Θ. It is well known that a randomly generated distribution F from DP (α0 G0 ) is almost surely discrete and admits the representation [18]. ∞ F = ωi δθi (3.2) i=1 where δz denotes a point mass at z, ω1 = η1 , ωi = ηi i−1 k=1 (1 − ηk ), for i = 2, 3, · · · with θ1 , θ2 , · · · i.i.d from distribution G0 . To specify G0 on Θ, it is convenient to consider a hierarchical structure. The randomness 43 of G0 is specified by assuming randomness on K, the number of knots, ξ, the locations of knots, β, the spline coefficients and u, the random effect component. (1) First, we specify the distribution of K. We assume that each time interval between knot points has at least w units. We then consider a truncated Poisson distribution for the distribution of K. That is, k p(k) = P (K = k) = e−λ λk! k ∗ −λ λl l=0 e l! = λk k! k ∗ λl l=0 l! for k = 0, 1, · · · , k ∗ , where k ∗ = [ n−1 w − 1] to ensure n0 ≡ n − 1 − (K + 1)w > 0. k+1 l=1 nl (2) Given K = k, we assume that for (n1 , · · · , nk+1 ), where nl ≥ 0 and 1 1 ,··· , (n1 , n2 , · · · , nk+1 ) ∼ Multinomial n0 , k+1 k+1 = n0 , . nl + w becomes the length of a time interval between knot points for l = 1, · · · , k + 1. That is, ξl is defined recursively as ξ0 = 1 and ξl = nl + ξl−1 + w for l = 1, · · · , k. (3) Given K = k and ξ = (ξ1 , · · · , ξk ), generate β1 , · · · , βq+k i.i.d from the density π0 on R1 . q+K m=1 βm hm (t) By the steps (1)-(3), we set θ(t) = f (t) = and θ = (f (t1 ), · · · , f (tn ))T = Xβ. From the hierarchical specification above, it follows that the infinitesimal measure for G0 is given as G0 (dθ) = p(k) n0 1 k+1 Γ(n0 + 1) k+1 l=1 Γ(nl + 1) q+k π0 (βm )dβm . (3.3) m=1 In our analysis, we take π0 (β) ∝ 1. Finally, to complete the hierarchical model, we specify priors of α0 , λ, τ 2 and σ 2 = 44 2 ): π (λ) = Ga(a , b ), π (τ 2 ) = IG(a , b ) and π (σ 2 ) = IG(a , b ), where (σ12 , · · · , σN τ τ σ σ 1 2 3 s λ λ Ga(a, b) denotes the Gamma distribution with density function π(x) ∝ xa−1 ex/b and IG(a, b) has already been defined in Chapter 1. As for the precision parameter α0 , we also choose IG(aα , bα ) in order to guarantee a simple form of posterior sampling distributions. However, as we show in section 3.5 when performing the numerical analysis, such prior specification may be subjective under some circumstances. Thus we will discuss on an alternative choice of prior distribution of α0 . In general, the hyper-parameters for the Gamma and inverse Gamma distributions are chosen to have large variance so that the impact of the prior input is minimal. The specific choices of the hyper-parameters for the various Gamma and inverse Gamma distributions are given in section 3.5. As mentioned in section 3.1, it is notable that Dass et al. (2014) proposed a mixed effects model with similar DP prior to identify change-points of multiple curves. Although both our model in this article and theirs feature a nonparametric Bayesian approach, we feel worthwhile to point out the major difference between our models and objectives. In the approach of Dass et al. (2014), piecewise disconnected linear segments were used to model the change-points of functional data. They appear in the model as a set of slopes and intercepts. Such design of model is not intended for clustering. On the other hand, our model features B-spline basis which is designed to capture and cluster the shape and trend of continuous curves well. Even if the order of B-splines is chosen to be 2, which corresponds to connected linear functions, because of the discrepancy of the objective and nature of our models, the interpretation of coefficients and other estimated parameters will be clearly different. 45 3.3 Theory of Bayesian Inference This section presents some fundamental theories of the nonparametric Bayesian inference for our clustering spline mixed model, including a validation of propriety of the posterior distribution and Gibbs sampling schemes for generating the posterior samples. 3.3.1 Propriety of the Posterior Distribution Let S ≡ {1, 2, · · · , N } be the set of all N sites. Denote a partition of S by c = ∪dr=1 Cr , and define P to be the set of all partitions of S. We consider probability distributions π(c) on P: Γ(α0 ) αd π(c) = Γ(α0 + N ) 0 d (|Cr | − 1)! (3.4) r=1 where |Cr | is the number of elements in Cr . 2 ), θ = (θ , θ , · · · , θ ) be the collection of functions Theorem 4. Let σ 2 = (σ12 , σ22 , · · · , σN 1 2 N on the N sites with θs = (β (s) , Ks , Ts ), and φ = (u, τ 2 ) with support Φ = RN × (0, ∞). Given model (3.1) and specification of priors in section 3.2.2, if Ks < min{aσ + (n − 1)/2, [(n − 1)/w]} for all s and some integer w > 0, then the posterior distribution is proper. The proof of theorem is provided in Appendix. 3.3.2 Gibbs Updating Steps Let S ≡ {1, 2, · · · , N } be the set of all N individuals. Denote a partition of S by c = ∪dr=1 Cr , and define P to be the set of all partitions of S. Note that a randomly generated distribution F from DP (α0 G0 ) admits the Sethuraman representation (Sethuraman, 1994). By 46 integrating out the random measure F , the equivalence between the DP prior (for any general space Θ) and the distribution it induces on P is well known. Moreover, the probability distribution on P has an explicit form given by Γ(α0 ) αd π(c) = Γ(α0 + N ) 0 d (|Cr | − 1)!, (3.5) r=1 where |Cr | is the number of elements in Cr . Define θ = (θ 1 , · · · , θ N ) and θ −s = (θ 1 , θ 2 , · · · , θ s−1 , θ s+1 , · · · , θ N ) to be the collection of all θ-components except θ s . Let c−s be the partition of S \ {s} determined by θ −s ; the identical components of θ −s ∗ uniquely determine the partition of c−s . Suppose that c−s = ∪N r=1 Er . The conditional distribution π(c | c−s ) based on (3.5) is π(c | c−s ) = α0 /(α0 + N − 1) if c = {s} ∪ c−s and π(c | c−s ) = Nr /(α0 + N − 1) with Nr denoting the number of elements in Er where ∗ c−s = ∪N j=1 Ej . (1) Update θ s : The posterior of θ s conditional on θ −s is given by qs,0 G∗0 (dθ s ) + π(θ s | θ −s , Y , u) = qs,0 + N∗ r=1 qs,r δθ N∗ r=1 qs,r (r) , (3.6) where qs,0 = Θ π(Y s | θ s , us ) G0 (dθ s ) π(c | c−s ) and (3.7) qs,r = π(Y s | θ (r) , us ) π(c | c−s ), respectively, for c = {s}∪c−s and c = (E1 , E2 , · · · , Er ∪{s}, · · · , EN ∗ ) for r = 1, 2, · · · , N ∗ . 47 G∗0 (dθ s ) = π(Y s | θ s , us ) G0 (dθ s ) Θ π(Y s | θ s , us ) G0 (dθ s ) (3.8) is the posterior distribution of θ s given that a new cluster is formed by the s-th individual. The explicit expression for qs,0 after canceling 1/(α0 + N − 1) is given as k∗ qs,0 = α0 k=0 (n1 ,··· ,nk+1 n0 ! exp [H(n1 , · · · , nk+1 )] n1 ! · · · nk+1 ! ) n0 1 p(k), k+1 where ∞ H(n1 , · · · , nk+1 ) = log Rq+Ks 0 π(Y s | β (s) , us , σs2 )π(σs2 ) dσs2 dβ (s) 2aσ + n − q − Ks n − q − Ks log(2π) − log Γ(aσ ) − aσ log bσ + log Γ =− 2 2 2aσ + n − q − Ks 1 (s)T X (s) T (s) − log(b−1 σ + (1/2)Y ∗ P Y ∗ ) − log X 2 2 with Y s∗ = Y s − us 1n and P (s) = I − X (s) (X (s)T X (s) )−1 X (s)T . qs,r is also given as qs,r = Nr (a +n/2) Γ(aσ + n/2) b∗ σ , Γ(aσ ) baσσ (2π)n/2 1 r r T r r r r where b∗ −1 = b−1 σ + 0.5(Y s∗ − X β ) (Y s∗ − X β ) . X β ≡ θ (r) for the cluster Cr . Note that θ s ≡ θ (r) if s ∈ Cr . Expression (3.6) explicitly demonstrates the clustering capability of the functional DP prior. The current value of θ s can be selected to be one of the distinct θ (r) functions with probability N∗ r=1 qs,r /(qs,0 + N∗ r=1 qs,r ), this positive probability being the reason for possible clustering of sites in terms of θ s as mentioned. Expression (3.6) also allows for a 48 new θ s to be generated from the posterior distribution G∗0 . (2) Update σ: The update of σs2 is carried out once θ s is obtained via (3.6). Regardless of whether θ s is a new value or an existing θ (r) , for each individual s = 1, · · · , N , the conditional posterior distribution of σs2 given other parameters is π(σs2 | · · · ) = IG(a, b), (s) (s) T (s) (s) with a = aσ + n2 and b−1 = b−1 σ + 0.5(Y s∗ − X β ) (Y s∗ − X β ) . Note that θ uniquely determines the collection of parameters (β, K, T ), where β is the set of spline coefficients, K = (K1 , · · · , KN ) is the set of number of knots and T = (ξ 1 , · · · , ξ N ) is the set of knot locations. Indeed, since θ contains several identical components, it follows that the corresponding components of (β, K, T ) are also identical to each other. In what follows, we present the updating steps for the d distinct components of (β, K, T ), namely, (β r , Kr , T r ) for r = 1, 2, · · · , d. Let ∪dr=1 Cr be the partition of {1, 2, · · · , N } at the current update of the Gibbs sampler (thus, d is the number of distinct clusters). (3) Update (β r , Kr , T r ): We first update Kr from the posterior marginal of Kr , and then update T r | Kr , and finally β r | T r , Kr from their respective conditional distributions. The posterior marginal probability of Kr = k is proportional to ν(n1 , · · · , nk+1 ), p(k) (3.9) (n1 ,··· ,nk+1 ) with ˜ 1 , · · · , nk+1 )} ν(n1 , · · · , nk+1 ) = exp{H(n 49 Γ(n0 + 1) k+1 l=1 Γ(nl + 1) n0 1 k+1 (3.10) where   ˜ 1 , · · · , nk+1 ) = log H(n  π(Y s | X (s) , β, σs2 , us ) dβ Rq+k    s∈Cr 1 (s)T (s) n 1 (n|Cr | − q − k) 1 log(2π) − log σs2 − log σs−2 Y Ts∗ Y s∗ X X − 2 2 2 2 2 σs s∈Cr s∈Cr s∈Cr −1    T  =− + 1 2 σs−2 X (s)T Y s∗   s∈Cr σs−2 X (s)T X (s)  σs−2 X (s)T Y s∗  .  s∈Cr s∈Cr Note that X (s) ≡ X r for all s ∈ Cr . The summation in (3.9) is over all non-negative integers n1 , n2 , · · · , nk+1 such that k+1 l=1 nl = n0 ≡ n − 1 − (k + 1)w. Obtaining the posterior probability of Kr = k re- quires evaluation of (3.10) for each value of k ≥ 0. This could potentially require significant amount of computational time and drastically reduce the efficiency of the Gibbs chain, but ˜ using the flat prior this did not occur in our application due to closed forms expressions of H π0 . To update T r given Kr = k, note that this is equivalent to updating (n1 , · · · , nk+1 ) with probabilities p(n1 , · · · , nk+1 ) ∝ v(n1 , n2 , · · · , nk+1 ). This is carried out by exhaustively listing of all such combinations and numerically computing the corresponding probabilities. The update of β r given T r and Kr = k is done based on the conditional distribution N (µβ , Σβ ) with r r −1  Σβ =  r σs−2 X (s)T X (s)  r = s∈Cr = Σβ X rT r σs−2  and (3.11) s∈Cr   µβ −1  −1  X rT X r σs−2 Y s∗  s∈Cr 50 (3.12) with the q + k components of β r generated independently of each other from their respective component densities. (4) Update λ: λ is updated using ∗ N π(λ | · · · ) ∝ π3 (λ) p(Ks ) ∝ ( s=1 d r=1 |Cr |kr , with a∗λ = aλ + a −1 λ λ e−λ/bλ k ∗ λl N l=0 l! ) , (3.13) where kr is the number of knot-points corresponding to θ (r) in cluster Cr . (5) Update α0 : For updating α0 with π2 (α0 ) ∝ α0aα −1 e−α0 /bα , we utilize the two-step procedure by Escobar and West [6]: at the b-th iteration: (1) draw κ from the Beta dis(b−1) tribution Be α0 + 1, N (b) and; (2) draw α0 from the mixture density of two Gamma distributions πκ Ga(aα + N ∗ , (1/bα − log(κ))−1 ) + (1 − πκ ) Ga(aα + N ∗ − 1, (1/bα − log(κ))−1 ) where N ∗ is the latest number of clusters and the membership probability πκ = aα + N ∗ − 1 . N (1/bα − log(κ)) (6) Update u: We have π(u | · · · ) ∝ π(Y | u, · · · ) · π(u | τ 2 )     1 N  u2s 1Tn 1n − 2us 1Tn Ws /σs2 + (u − u0 )T (u/ − u0 )τ 2  ∝ exp −   2  s=1 where Ws = Y s − X (s) β (s) . Notice that 1Tn 1n = n. The conditional posterior distribution 51 of u is N (µu , Σu ) with Σu = −2 ) + τ −2 I nDiag(σ1−2 , · · · , σN N −1 and µu = Σu (W + 2 )T . τ −2 IN u0 ) with W = (1Tn W1 /σ12 , · · · , 1Tn WN /σN (7) Update τ 2 : The posterior distribution of τ 2 is given as π(τ 2 | · · · ) ∝ π(u | τ 2 ) · π(τ 2 ), ∼ IG(a∗τ , b∗τ ) −1 . with a∗τ = N2 + aτ and b∗τ = b1 + 12 uT u τ 3.4 Bayesian Inference: A Measure for Comparing a Pair of Clustering of Sites Given the validity and details of the Gibbs updating scheme for all the parameters based on the elicited improper prior components from section 3.2.2, we are able to perform Bayesian inference on the posterior sample of them, especially the estimated cluster information θ = (θ 1 , · · · , θ N ). Suppose that we have M Gibbs samples for θ: {θ (m) }M m=1 after convergence is established. It may be challenging to obtain results for the clustering configuration directly from them since such information (i.e. number of clusters, specific assignment of each site) is changing at each Gibbs iteration, classical posterior analysis by collecting summary statistics for clustering information is not straightforward and not easy to interpret. We introduce a 2 × 2 cross classification table to measure the deviation between two cluster configurations, for instance, C and C ∗ . The table has entires {nij }, 1 ≤ i, j ≤ 2, where n11 is the number of pairs of sites, out of all n++ = N (N − 1)/2 pairs, that are in the same cluster in C as well as C ∗ while n22 is the number of pairs of sites that are not in the same cluster in C or C ∗ . 52 n12 is the number of pairs that are in the same cluster in C but not in C ∗ . Similarly, n21 is the number of pairs that are not in the same cluster in C but in the same cluster in C ∗ . If two cluster C and C ∗ have identical configurations, we have n11 = n+1 and n22 = n+2 . The interpretation of n+1 is that it is the number of pairs which are in the same cluster in C ∗ , whereas the interpretation of n+2 is that it is the number of pairs that are not in the same cluster in C ∗ . Thus, the latter quantities n+1 and n+2 are only dependent on C ∗ but not on C. Then, we consider two measures named sensitivity S1 = n11 /n+1 and specificity S2 = n22 /n+2 . The interpretation of S1 is that it is the proportion of pairs that are also in the same cluster in C given that they are in the same cluster in C ∗ . Similarly, S2 is the proportion of pairs that are also not in the same cluster in C given that they are not in the same cluster in C ∗ . It is clear that S1 and S2 take values between 0 and 1 with the ideal case that (S1 , S2 ) = (1, 1). Also note that for a cluster C ⊂ C ∗ , that is, every partition of C is in some partition of C ∗ , the number of clusters in C is much larger than that of C ∗ . In this case, n12 = 0 yields S2 = 1 but S1 ≤ 1. On the other hand, when C ∗ ⊂ C, we have S1 = 1 and S2 ≤ 1. Thus, deviations from the point (1, 1) or from the lines y = 1 and x = 1 give an idea about the nature of deviations of the clustering C from the clustering C ∗ . For simulation studies, (S1 , S2 ) can be used to measure deviations between a Gibbs cluster configuration and the true one. Suppose that from the Gibbs output, we have cluster configurations {Cm }M m=1 . For the m-th Gibbs cluster configuration, we can calculate (S1 (m), S2 (m)) with C = Cm and C ∗ = C0 specified by the true cluster information. A plot of (S1 (m), S2 (m)) will indicate how dispersed the current Gibbs cluster is with respect to the true configuration: If (S1 (m), S2 (m)) is concentrated close to (1, 1), it indicates that deviations of Cm from C0 is not much. Points (S1 (m), S2 (m)) far from (1, 1) indicate two different types of deviations based on their proximity to the lines x = 1 or y = 1. Proximity 53 to the line x = 1 indicates that among those pairs that are in the same cluster in C0 , more pairs become in the same cluster in Cm as well whereas proximity to y = 1 indicates that among those pairs not in the same cluster in C0 , more pairs becomes not in the same cluster in Cm as well. For simulation studies, C0 can be taken to be the true cluster configuration since the true clustering is known. In the 2×2 cross-classification table, there are two degrees of freedom. The quantities n+1 and n+2 are fixed for the true cluster configuration, so we need two free parameters, say n11 and n22 , to determine all entries in the table completely. Thus, the scatter plot of {(S1 (m), S2 (m)) : m = 1, 2, · · · M } gives a complete picture of variability. As a numerical measure for variability, we consider 1 SS = M M (2 − S1 (m) − S2 (m)) m=1 with smaller SS value indicating better performance. 3.5 Numerical Analysis: Simulation Study and Real Data Example In this section, we apply the proposed method to numerical studies including simulations under various settings as well as real case analysis of the lung cancer data in the U.S.. We simulate data at n = 38 time points for N = 49 sites based on the model. We partition the sites into 6 groups (clusters) Cr with r = 1, 2, · · · , 6. Within each cluster, the curves share common information in form of (β (r) , Kr , T (r) ). As stated in the previous section, we collect overall posterior samples via MCMC procedure with Gibbs sampling scheme at each iteration. The true parameters are specified in the following way. 54 r |Cr | 1 9 2 9 3 8 4 7 5 7 6 9 Cr (Site #) 2, 4, 11, 25, 27, 36, 43, 46, 49 12, 13, 14, 21, 22, 26, 33, 40, 48 3, 5, 15, 17, 24, 30, 35, 42 1, 9, 10, 23, 32, 39, 41 6, 18, 20, 28, 31, 38, 44 7, 8, 16, 19, 29, 34, 37, 45, 47 T (r) (8, 30) (7, 11, 20) (13, 25) (17, 21) (18, 20) (18, 20) β (r) (0.0667, 0.0500, −0.2000) (−0.0667, 0.0300, 0.1667, −0.3333) (−0.1333, −0.1333, 0.1667) (−0.0333, −0.0167, 0.1667) (−0.0667, 0.0200, −0.1333) (0.0167, 0.0167, 0.3333) As for the true value of the variability parameter between the curves, τ 2 , and within the curves, σs2 , we consider the following settings: τ 2 = 1/30 or 2/30; σs2 ∼ U (1/30, 2/30) or U (2/30, 4/30). Such specifications lead to 4 scenarios. 3.5.1 An Alternative Less-informative Prior Choice for α0 Recall that when specifying the prior for parameters of our model in section 3.2.2, we raise a question on the choice of prior for the precision parameter α0 , which directly controls the ability to create a new cluster, that is, increases the number of clusters. We observed the fact that under the Gamma prior for the precision parameter α0 , the corresponding posterior distribution can be well obtained from a mixture of two Gamma distributions [6]. However, in practice, to guarantee a reasonable extent of splitting existing clusters, we may need quite small value of α0 , which leads to highly subjective Gamma prior for α0 . For instance, in our simulation, we require that α0 ∼ Ga(aα , bα ) with E(α0 ) = aα bα ≈ 10−8 and Var(α0 ) ≈ 102 thus extreme small aα and large bα . To resolve this issue, we introduce another prior as α0 ∼ LN (µα , σα2 ), where LN (µ, σ 2 ) denotes the log-Normal distribution with corresponding mean parameter µ and variance parameter σ 2 . The intuition behind such re-parameterization is that we believe it is more appropriate to put a relatively noninformative Gaussian prior on log(α0 ): log(α0 ) ∼ N (µα , σα2 ). For our simulation we use 2 µα = −40 and σα = 7. Note that the mean and variance for α0 are E(α0 ) = eµα +σα /2 ≈ 55 2 2 1.86 × 10−7 and Var(α0 ) = (eσα − 1)e2µα +σα ≈ 6.57 × 107 . Hence we are able to claim that the log-Normal type prior for α0 is much less subjective than the original Gamma prior. To complete the posterior sampling scheme for π(α0 ) = LN (µα , σα2 ), we simply apply the grid search method over (µα − 3σα , µα + 3σα ), which covers over 99% of the range of α0 . 3.5.2 Simulations and Comparison with An Existing Approach We generate 10 sets of data under each scenario specified at the beginning of section 3.5 and run MCMC iterations with two chains for 5000 times to guarantee the convergence. For comparison purpose, we also apply a functional clustering methodology proposed by James and Sugar (2003) to the same simulated datasets. The results are listed in Table 3.1 and 3.2. τ2 σ2 S1 (Specificity) S2 (Sensitivity) SS(Overall) 1/30 U (1/30, 2/30) 0.8813 0.9844 0.1343 U (2/30, 4/30) 0.6466 0.9827 0.3707 2/30 U (1/30, 2/30) 0.8898 0.9843 0.1260 U (2/30, 4/30) 0.6443 0.9827 0.3730 Med. # 7 6.5 6 6.5 Mode # 6 6 6 6 Table 3.1: Cluster Configuration Diagnostics for Simulations: Bayesian DP Approach τ2 σ2 S1 (Specificity) S2 (Sensitivity) SS(Overall) 1/30 U (1/30, 2/30) 0.7107 0.9066 0.3827 U (2/30, 4/30) 0.7017 0.9028 0.3955 2/30 U (1/30, 2/30) 0.6792 0.9137 0.4071 U (2/30, 4/30) 0.6978 0.9152 0.3870 True # 6 6 6 6 Table 3.2: Cluster Configuration Diagnostics for Simulations: James and Sugar’s Approach From the result of our approach in Table 3.1, we obtain excellent performance in sensitivity (S2 ) as well as the median and mode of the estimated number of clusters. Meanwhile, it is shown that the specificity (S1 ) is mainly affected by relatively large σs2 , the variability 56 of individual curves within a cluster. On the other hand, since James and Sugar’s model requires a pre-specified number of clusters as introduced in section 3.1, we simply plug in the true number of clusters, 6, when utilizing their models to avoid extra determining process. We also point out that the diagnostic measures (specificity, sensitivity, etc.) are calculated based on the predicted cluster membership for their case rather than averaging the posterior samples for our case. It is clear from the comparison that our method leads to better sensitivity (S2 ) in general and higher specificity (S1 ) for smaller σs2 . However, it appears that their approach is slightly more robust under large individual variability. 3.5.3 A Real Data Example: Lung Cancer Rates in the U.S. Besides the simulation study given above, we apply our proposed model to the lung cancer mortality rate curves from 48 states and the Washington D.C. of the U.S.. In the data analysis, the minimum number of time points in each time interval is set to be w = 7, which leads to the maximum number of knot-points as k ∗ = 4. Then all the hyper-parameters are specified as follows. The shape and scale parameters of the inverse Gamma priors on σs2 and τ 2 are aσ = aτ = 2 and bσ = bτ = 100 respectively so that the variance of priors are large enough to be considered as infinite. The shape and scale parameters of the Gamma prior on λ are aλ = 0.004 and bλ = 500 respectively. For the prior on the precision parameter α0 , as we discussed in section 3.5.1, we propose the log-Normal distribution with µα = −100 and σα = 11. Recall that the precision parameter α0 controls the number of clusters while λ controls the number of knot-points. We run two chains under MCMC algorithm with Gibbs sampling for 20000 iterations each. The convergence of chains is attained after 15000 iterations given the Gelman-Rubin diagnostics introduced in Chapter 1, section 1.4.1. We collect our posterior sample by 57 choosing every fifth sample from the last 5000 iterations for both chains. Such sampling procedure provides more ‘mixed’ posterior sample with a total size of 2000. The posterior probability of the number of clusters of states is shown in Table 3.3. P (d = 2) P (d = 3) P (d = 4) P (d = 5) P (d = 6) P (d = 7) P (d = 8) 0.093 0.217 0.298 0.236 0.110 0.038 0.008 Table 3.3: The Distribution of Number of Clusters d It is clear that d = 4 gives us the estimated number of clusters with highest probability. To provide a cluster configuration of all the states, we apply an agglomerative clustering algorithm with the dissimilarity distance defined as M dist(s1 , s2 ) = 1 − distm (s1 , s2 )/M, m=1 where distm (s1 , s2 ) = 1 if s1 and s2 fall into the same cluster in mth iteration. Otherwise, distm (s1 , s2 ) = 0. It is natural to specify the threshold for the maximum number of clusters in the algorithm as d, the posterior mode of the number of clusters. The main algorithm can be summarized as follows. 1. Create an individual cluster for each site. 2. Among all current clusters, select two clusters whose elements have the smallest dissimilarity distance. 3. If two selected clusters from the previous step are distinct, merge them into a new cluster and replace two old clusters. 4. Repeat step 2 and 3 until the desired threshold for the number of clusters is attained. 58 The estimated cluster configuration is shown on Table 3.4 along with a dendrogram in Figure 3.1 that illustrates the process of agglomerative clustering algorithm as introduced above. Moreover, Figure 3.2 provides the data from individual sites grouped by cluster configuration while Figure 3.3 shows the averaged curve (in black line) over all sites within those clusters. Cluster (a) (b) (c) (d) # of Sites 8 21 7 13 Cluster Details (State) CO, LA, MA, ME, OH, OR, PA, WY AL, AR, GA, IA, ID, IN, KS, KY, MN, MS, MO, MT, NC, ND, NE, OK, SC, SD, TN, WI, WV CA, DC, FL, MD, NJ, NY, UT AZ, CT, DE, IL, MI, NH, NM, NV, RI, TX, VA, VT, WA Table 3.4: Estimated Cluster Configuration Cluster (a) has one knot-point around year 1991. The log scaled rates for these states maintained a steady trend after that. Cluster (b) also has one knot-point. However, unlike cluster (a), it happened a little earlier around year 1989. Cluster (c) has the same knot-point at year 1991 as in cluster (a) but demonstrate a clear decreasing trend after that. As for cluster (d), it represents states with potential two knot-points detected around year 1982 and 1992. These states have the log-rates increasing slower after 1982 and dropping after 1992. 3.6 Conclusion and Discussion In this chapter, we proposed a mixed spline model that performs concurrent clustering of multiple curves based on their patterns and trends. Such clustering is designed by introducing a Dirichlet process prior on the space of step functions over time points. The model was 59 verified by a series of simulations and then applied to analyze age-adjusted cancer mortality rates for all the states in the U.S. to find state-wise clusters that have similar trends. Given the result of real data analysis, our model provides a reasonable and meaningful cluster configuration and captures the group-wise features of the curves. For some potential further investigation, one can extend a piece-wise linear model (corresponding to q = 2) to higher order polynomial models such as cubic spline functions to model curves by summarizing more information about their patterns. Then the criteria and interpretation of clustering may have to be reestablished. 60 Figure 3.1: Dendrogram for the Clustering Estimation 61 Figure 3.2: Estimated Cluster Configuration 62 Figure 3.3: Overall Trend by Clusters 63 APPENDICES 64 Appendix A Proof of Theorems in Chapter 1 Proof of Theorem 1. Note that the likelihood function given D, Σ0 , β, γ is l(y; D, Σ0 , β, γ) = 1 1 ¯ ¯ e− 2 (y−Xβ−Zγ) Σ 1 1 −1 (y−Xβ− ¯ ¯ Zγ) 1 1 (2π) 2 M dn |Σ| 2 − 12 γ G−1 γ e , 1 (2π) 2 M dk |G| 2 where Σ = IM ⊗ D ⊗ In and G = IM ⊗ Σ0 ⊗ Ik . Let l(y; D, Σ0 ) be the likelihood with β and γ integrated out. To show the posterior is improper, it is enough to show l(y; D, Σ0 )π(D)π(Σ0 ) dD dΣ0 = ∞. (A.1) Using the eigenvalue decomposition of Σ0 , Σ0 = Q ΨQ, we can write π(Σ0 ) dΣ0 = π(Ψ) dΨ · π(Q) dQ with π(Ψ) = 1 d ψp j=1 j i 0 and SΨ = {ψ : ψ1 > . . . ψd > 0}, where ψ = (ψ1 , . . . , ψd ) is the set of eigenvalues of Σ0 and S ∗ = SD × SΨ . To proceed further, we denote the ordering of two m × m symmetric matrices, A and B, with respect to positive definiteness by A B if for any v ∈ Rm , v Av ≤ v Bv. We also say B A if A B. Note that we have 1 1 − Md (2π)− 2 (n−q−1) |X X| 2 l(y; D, Σ0 ) = 1 − 12 y W y , e 1 1 |D| 2 M (n−q−1) |Σ0 | 2 M k |Z¯ P Z¯ + G−1 | 2 ¯ Z¯ P Z¯ + G−1 )−1 Z¯ P with P = IM ⊗ D−1 ⊗ (In − X(X X)−1 X ). For where W = P − P Z( simplicity, define PX = X(X X)−1 X , PH = In − PX so that P = IM ⊗ D−1 ⊗ PH . Let 0 < λ1 < λ2 < · · · < λd be the eigenvalues of Z PH Z. Then, we have 1 Mk |Σ0 |− 2 1 2 |IM ⊗ Σ−1 0 ⊗ Ik | = |Z¯ P Z¯ + G−1 | 2 1 2 |IM ⊗ D−1 ⊗ Z PH Z + IM ⊗ Σ−1 0 ⊗ Ik | 1 2 |IM ⊗ Σ−1 0 ⊗ Ik | ≥ 1 2 |IM ⊗ D−1 ⊗ λd Ik + IM ⊗ Σ−1 0 ⊗ Ik | d 1 ≥ = Mk |λd −1 Σ0 + I| 2 on SD , where the first inequality holds since λd Ik ¯ Z¯ P Z¯ + G−1 )−1 Z¯ P since P Z( 1 Mk j=1 (1 + ψj λd −1 ) 2 Z PH Z. Also, e−1/2y W y ≥ e−1/2y P y 0. Thus, on SD × SΨ , 1 1 − Md 1 (2π)− 2 (n−q−1) |X X| 2 e− 2 y P y l(y; D, Ψ, Q) π(Ψ) ≥ 1 M (n−q−1) |D| 2 66 i 0, we are going to show dΨ = ∞. The first term in the integrand in (A.2) corresponding to the finite expansion of (A.2) i 1 where x is the smallest integer greater than or equal to x. Then we have ∗ − j > 0 for 1 ≤ j ≤ j ∗ − 1 and d−1 2 − j ≤ 0 for j ≤ j ≤ d. Consider A = {ψ : ψ1 ≥ 1, ψ2 ≥ 1, . . . , ψj ∗ −1 ≥ 1} ∩ SΨ . Then 1 ψ0 ψ1d−1 ψ2d−2 . . . ψd−1 d SΨ ≥ ≥ dΨ Mk p d −1 ) 2 j=1 ψj (1 + ψj λd d d−1 −j 1 ψj 2 Mk A j=1 (1 + ψj λd −1 ) 2 d−1 −j 2 ψ ∗ j≥j j A ≥ A dΨ 1 dΨ Mk Mk −1 ) 2 −1 ) 2 (1 + ψ λ (1 + ψ λ ∗ ∗ j j d d j≥j j ψj ∗ +1 > · · · > ψd > 0. 67 dΨ (A.4) For p > 1 and b > 0, we have t 1 1 1 1− ds = p b(p − 1) (1 + bt)p−1 0 (1 + bs) (A.5) Using (A.5), we integrate out (A.4) with respect to ψd , · · · , ψj ∗ on A ∩ SΨ . First, ψd−1 1 1 1 1 − . dψ = d Mk M k −1 −1 ( M k − 1) λ −1 −1 d 2 (1 + ψd λd ) 2 (1 + ψd−1 λd ) 2 0 Thus, using (A.5) again, ψd−2 Mk (1 + ψd−1 λd −1 ) 2 ψd−2 1 0 = λd −1 ( M2k 0 = ψd−1 1 1 λ2d −2 ( M2k − 1)2 − − 1) 1− Mk 0 1− 1 (1 + ψd λd −1 ) 2 dψd dψd−1 1 1 Mk (1 + ψd−1 λd −1 ) 2 −1 Mk (1 + ψd−1 λd −1 ) 2 dψd−1 1 Mk (1 + ψd−2 λd −1 ) 2 −1 1 λ2d −2 ( M2k − 1)(M k − 2) 1− 1 . (1 + ψd−2 λd −1 )M k−2 After integrating (A.4) with respect to ψd , · · · , ψj ∗ +1 , we have ψj ∗ −1 0 P 1 1 + ψj ∗ λd −1 · 1 ψj ∗ d−1 j≥j ∗ | 2 −j| dψj ∗ , (A.6) where P (·) is a polynomial of degree (M k − 2)(d − j ∗ ) + M k. Note that we assume M2k > 1. However, the integrand above is not integrable at 0 since j≥j ∗ d−1 2 − j ≥ 1. Hence we showed that the posterior is improper. Similar to case (1), by setting j ∗ = d−2 2 ≥ 1, we can prove case (2). For case (3), consider the corresponding multiple integral 68 on B = {ψ : 1 > ψ1 > ψ2 > · · · > ψd > 0}. Then, it is easy to verify the first integral of (A.3) with respect to ψd is infinite. Proof of Theorem 2. Let l(y; D, L) be the likelihood with β and γ integrated out and Σ0 expressed in terms of L. To show that the posterior distribution is proper, we need to verify l(y; D, L)π(D)π(L) dD dL < ∞. Given the prior models for D and L, we can express l(y; D, L)π(D)π(L) dD dL d = =C l(y; D, L) π(σj2 ) π( 11 )π( 1 )π(L11 ) dσ12 · · · dσd2 d 11 d 1 dL11 j=1 1 Mk |Σ0 |− 2 e− 2 y W y 1 |Z¯ P Z¯ + G−1 | 2 d 1 σj2 j=1 M (n−q−1) +αj +1 2 π( 11 )π(L11 )dσ12 · · · dσd2 d 11 d 1 dL11 , where C is a constant independent of D and L. First, we find W0 that does not depend on L and it satisfies W0 1 W so that e− 2 y W y ≤ 1 ¯ Z¯ P Z) ¯ −1 Z¯ P . Since (Z¯ P Z¯ + G−1 )−1 e− 2 y W0 y . Let W0 = P − P Z( ¯ Z¯ P Z¯ + G−1 )−1 Z¯ P IM ⊗ D ⊗ (Z PH Z)−1 , W = P − P Z( ¯ −1 = (Z¯ P Z) ¯ Z¯ P Z) ¯ −1 Z¯ P = W0 . P − P Z( 1 1 Thus, e− 2 y W y ≤ e− 2 y W0 y . Let P0 = PH −PH Z (Z PH Z)−1 Z PH . Then, we can write W0 = IM ⊗D−1 ⊗P0 so that we can express y W0 y = cj d j=1 σ 2 , j where cj = (i) M i=1 yj (i) P 0 yj (i) (i) (i) with yj = (yj1 , · · · , yjn ) for j = 1, . . . , d. Now, we claim that cj > 0 for all j = 1, · · · , d with probability 1. Note that PX , PH and P0 are idempotent so that PH is orthogonal to PX = I − PH and P0 is orthogonal to PH Z (Z PH Z)−1 Z PH . Thus, rank(PH ) = rank(I) − rank(PX ) = n − (q + 1) since 69 rank(PX ) = rank(X) = q + 1. rank(P0 ) = rank(PH ) − rank(PH Z (Z PH Z)−1 Z PH ). Since we assume n > q + 1 + k, rank(Z PH Z) = min{rank(Z), rank(PH )} = k and rank(PH Z) = min{rank(Z), rank(PH )} = k so that rank(PH Z (Z PH Z)−1 Z PH ) = k. Therefore, rank(P0 ) = n − (q + 1) − k > 0. Since P0 is idempotent and non-degenerate, we prove the claim. Thus, we have 1 1 e− 2 y W y ≤ e− 2 y W0 y = d e −cj /2σj2 , (A.7) j=1 where cj > 0 for all j = 1, . . . , d. Now, we want to bound |Σ0 | − M2k 1 −1 | 2 ¯ |Z¯ P Z+G above by a simpler expression. 1 Mk |Σ0 |− 2 1 2 |IM ⊗ Σ−1 0 ⊗ Ik | = |Z¯ P Z¯ + G−1 | 2 1 2 |IM ⊗ D−1 ⊗ Z PH Z + IM ⊗ Σ−1 0 ⊗ Ik | 1 2 |IM ⊗ Σ−1 0 ⊗ Ik | ≤ 1 2 |IM ⊗ E −1 ⊗ Ik + IM ⊗ Σ−1 0 ⊗ Ik | Mk = 2 |Σ−1 0 | |E −1 Mk 2 + Σ−1 | 0 Mk |E| 2 = |E Mk + Σ0 | 2 , where E = λ−1 1 D and λ1 be the smallest eigenvalue of Z PH Z. The inequality follows from IM ⊗ D−1 ⊗ Z PH Z IM ⊗ D−1 ⊗ λ1 Ik = IM ⊗ E −1 ⊗ Ik .   e11 0  We further simplify the last expression. Let E =  . By the block-wise 0 E11 70 A B = |A||D − CA−1 B|, we have expansion for determinants, C D 2 e11 1 1 1 1 11 = (e11 + 211 ) +L11 L11 +E11 . |E+Σ0 | = (e11 + 211 ) 1 1 +L11 L11 +E11 − 2 e11 + 11 e11 + 211 By the identity |Ip + U V | = |Ir + V U | when U is a p × r matrix and V is an r × p matrix, we have e11 |E + Σ0 | = (e11 + 211 )|L11 L11 + E11 | 1 + 1 (L11 L11 + E11 )−1 1 e11 + 211 = (e11 + 211 )|L11 L11 + E11 |(1 + 1 F 1 ), where F = e11 e11 + 2 11 (L11 L11 +E11 )−1 . Note that |E| = e11 |E11 | = d j=1 ejj = λ−d 1 d 2 j=1 σj . . (A.8) Thus, Mk Mk |Σ0 |− 2 |Z¯ P Z¯ 1 + G−1 | 2 ≤ |E| 2 Mk |E + Σ0 | 2 Mk Mk e112 |E11 | 2 = (e11 + Mk 2 ) 2 |L L 11 11 11 Mk + E11 | 2 (1 + 1F Mk 1) 2 Let R = {( 11 , 1 , L11 , σ12 , . . . , σd2 ) : 11 > 0; 1 ∈ Rd−1 ; σj2 > 0, j = 1, . . . , d} and R0 = {( 11 , L11 , σ12 , . . . , σd2 ) : 11 > 0; σj2 > 0, j = 1, . . . , d}. Then, by (A.7) and (A.8), we 71 have 1 Mk |Σ0 |− 2 e− 2 y W y |Z¯ P Z¯ R 1 + G−1 | 2 d 1 σj2 j=1 1 = R0 Rd−1 (1 + 1 F × (e11 + π( 11 )π(L11 )dσ12 · · · dσd2 d 11 d 1 dL11 d 1 cj Mk Mk − 2 M (n−q−1) d 2 2 +α +1 e11 |E11 | 1 2σj j 2 e 2 Mk Mk σj 2 ) 2 |L L + E | 2 j=1 11 11 11 11 × π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11 . When M k > 1, Rd−1 becomes Mk 1) 2 M (n−q−1) +αj +1 2 1 Mk (1+ 1 F 1 ) 2 (A.9) 1 d 1 = C1 |F |− 2 , where C1 is a constant. Thus, (A.9) cj Mk Mk − 2 M (n−q−1) d 2 2 +α +1 |F | e11 |E11 | 1 2σj j 2 C1 e 2 Mk Mk σj R0 (e + 2 ) 2 |L L + E | 2 j=1 11 11 11 11 11 × π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11 cj M k−(d−1) Mk − 2 M (n−q−1) d +α +1 2 2 1 e11 |E11 | 2σj j 2 = C1 e 2 2 M k−1 σj R0 e11 + 11 j=1 |L11 L11 + E11 | 2 × π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11 . cj − 2 M (n−q−1) d +α +1 1 1 2σj j 2 e π( 11 )π(L11 ) dσ12 · · · dσd2 d 11 dL11 ≤ C1 |E11 | 2 2 σj R0 j=1 cj c d ∞ 1 M (n−q−1) +α +1 − 12 ∞ 1 M (n−q−1) +α − 2σ 2 1 j 2 2 j dσ 2 = C2 e 2σ1 dσ12 e j 2 2 σ σ 0 1 j j=2 0 ∞ × π( 11 ) d 11 π(L11 ) dL11 0 −1 2 < ∞, 72 · where the first inequality holds since e11 e11 + 211 M k−(d−1) 2 ≤ 1, |E11 | M k−1 2 |L11 L11 + E11 | M k−1 2 ≤1 for M k ≥ d − 1 and the last inequality holds since π( 11 ) and π(L11 ) are proper and ∞ −s −t/θ dθ 0 θ e < ∞ for s > 0, t > 0. C1 and C2 above are both finite constants. Hence we conclude that the posterior is proper. 73 Appendix B MCMC Algorithm via Gibbs Sampling for Bayesian Inference in Chapter 1 For the linear mixed model, ¯ + Zγ ¯ + y = Xβ γ ∼ N (0, G), ∼ N (0, Σ), where G = IM ⊗ Σ0 ⊗ Ik and Σ = IM ⊗ D ⊗ In . For the Bayesian inference, we put priors on the parameters as described in section 1.3.3 and the followings are the updating steps of β and σj2 for Gibbs sampling method. Updating β 1 Let r(i) = y (i) −(Id ⊗Z)γ (i) and r¯ = M M (i) i=1 r . Given π(β) ∝ 1, the posterior distribution of β is β | y, γ, D, Σ0 ∼ N (β, Vβ ), ¯ Σ−1 X) ¯ −1 X ¯ Σ−1 r = (Id ⊗ (X X)−1 X )¯ ¯ Σ−1 X) ¯ −1 = 1 (D ⊗ where β = (X r and Vβ = (X M (X X)−1 ). 74 Updating γ Let u(i) = y (i) − (Id ⊗ X)β, i = 1, . . . , M . We can update γ (i) one by one independently. Since γ (i) ∼ N (0, Σ0 ⊗ Ik ), the posterior distribution of γ (i) is γ (i) | y, β, D, Σ0 ∼ N (γ (i) , Vγ ), i = 1, . . . , M, −1 −1 −1 (i) −1 −1 where γ (i) = (D−1 ⊗ Z Z + Σ−1 0 ⊗ Ik ) (D ⊗ Z )u and Vγ = (D ⊗ Z Z + Σ0 ⊗ Ik ) . Updating D = diag(σ12 , . . . , σd2 ) We can update D by updating σj2 one by one independently. ¯ − Zγ. ¯ Given IG(aj , bj ) as the prior for π(σ 2 ), the posterior distribution Let r = y − Xβ j of σj2 is σj2 1 1 | y, β, γ, Σ0 ∼ IG M n + aj , 2 2 M n (i)2 rjt + i=1 t=1 1 −1 , j = 1, . . . , d bj Updating procedure of Σ0 via 1 1, 1 and L11 is provided in section 1.3.3. 75 Appendix C Maximum Likelihood Estimation in Chapter 1 This part serves as a preliminary analysis of the parameters based on the likelihood approach. Here instead of the original assumption that the covariance matrix for errors is Σ = IM ⊗D⊗ Σϕ , where D is diagonal matrix, we consider a more general relationship between individual covariates such that Σ = IM ⊗ Σ1 ⊗ Σϕ , where Σ1 can be any valid covariance matrix. Then the likelihood function for linear regression model in section 1.4 is f (y|β, γ, Σ1 ) = 1 1 ¯ ¯ e− 2 (y−Xβ−Zγ) Σ 1 −1 (y−Xβ− ¯ ¯ Zγ) 1 (2π) 2 ndM |Σ| 2 , where Σ = IM ⊗ Σ1 ⊗ Σϕ when we assume γ as fixed effects as well. The MLE equation for β and γ is then      −1 −1 −1 ¯ ¯ ¯ ¯ ¯ X Σ X X Σ Z  β  X Σ y     =  . −1 −1 −1 ¯ ¯ ¯ ¯ ¯ ZΣ X ZΣ Z γ ZΣ y (i) (i) ¯ β − Zγ, ¯ Ui = (e(i) , . . . , e(i) ) and Si = U Σ−1 Let β, γ denote the MLEs, ejt = yjt − X i ϕ Ui , 1 d where i = 1, . . . , M . Then, we have f (y|β, γ, Σ1 , Σϕ ) ∝ 76 −1 1 e− 2 tr(Σ1 S) nM dM |Σ1 | 2 |Σϕ | 2 , where S = M i=1 Si . S . Then D is maximized at Σ1 = nM The log-likelihood function is L(y|β, γ, Σ1 , Σϕ ) ∝ − dM M nM ¯ tr(Σ−1 log|Σ−1 |+ log|Σ−1 S(ϕ)) + ϕ |, 1 1 2 2 2 S. ¯ where S(ϕ) =M n It can be shown that |Σ−1 ϕ |=2 ϕ = arg max −1≤ϕ≤1 2jπ n−1 j=0 (1+ϕ cos( n )). Therefore the MLE for ϕ satisfies M dM ndM ¯ tr(Σ−1 S(ϕ)) + log|Σ−1 ϕ |+ 1 2 2 2 Following the procedure above we are able to obtain the MLEs by updating the parameters iteratively until convergence. 77 Appendix D Proof of Theorems and Lemmas in Chapter 2 √ Proof of Lemma 1. First note that Vj = T X (j) /( nσσj ) ∼ N (0, 1). Examining the tail probability of Gaussian random variables leads to P max |Vj | > 1≤j≤n t2 + 2 log n ≤ 2nP V1 > ≤ 2n exp − t2 + 2 log n t2 + 2 log n = 2 exp(−t2 /2). 2 (D.1) (D.2) Then it follows that P 2| T X (j) | > λ0 n 1≤j≤n max =P ≤P 2σ √ max σj |Vj | > λ0 n 1≤j≤n max |Vj | > 1≤j≤n ≤ 2 exp(−t2 /2). t2 + 2 log n (D.3) (D.4) (D.5) Proof of Lemma 2. Let W = nσ 2 /σ 2 = Y T Y /σ 2 . Since Y ∼ N (Xβ, σ 2 I), we claim that W ∼ χ2n (r), i.e. a noncentral chi-square distribution with noncentrality parameter r. 78 It is known that E(W ) = n + r (D.6) E[W − E(W )]2 = 2(n + 2r) (D.7) E[W − E(W )]4 = 12(n + 2r)2 + 48(n + 4r). (D.8) Given k > 0, by Chebyshev’s inequality P |σ 2 − (1 + r/n)σ 2 | > ασ 2 = P (|W − E(W )| > nα) ≤ E|W − E(W )|k αk nk (D.9) Therefore, P σ 2 < (1 − α)σ 2 ≤ P |σ 2 − (1 + r/n)σ 2 | > ασ 2 ≤ E|W − E(W )|k α k nk (D.10) Finally by taking k = 2 and k = 4 respectively, we complete our proof. Proof of Theorem 3. Note that under (2.5), n−1 Y − Xβ 2 /n + λ 2 n−1 |βj | ≤ Y − Xβ j=2 2 /n + λ 2 |βj | (D.11) j=2 By expanding the 2 terms, it simply gives n−1 X(β − β) 2 /n + λ 2 n−1 |βj | ≤ 2 j=2 T X(β − β)/n + λ |βj | (D.12) j=2 For any vector v ∈ Rp we define v 01 = (v0 , v1 , 0, · · · , 0)T and v 2+ = (0, 0, v3 , · · · , vn−1 )T . 79 Since v = v 01 + v 2+ , (D.12) becomes n−1 X(β − β) 2 /n 2 ≤2 T X(β − β) 01 /n + 2 T X(β − β) /n − λ 2+ n−1 |βj | + λ j=2 Set λ0 = 6Kσ log n n . |βj | (D.13) j=2 On Jλ0 = max1≤j≤n 2| T X (j) |/n ≤ λ0 , n−1 2 T X(β − β)2+ /n ≤ λ0 |βj − βj | (D.14) j=2 Meanwhile, note that β 01 1 = |β0 | + |β1 | ≤ M by assumption. We have 2 T X(β − β)01 /n ≤ λ0 β 01 1 + λ0 β 01 1 ≤ λ0 M + λ0 (|β0 | + |β1 |) (D.15) Now define Sn = {σ : σ ≥ σ/2}. Then on Sn we have λ ≥ λ0 . Combining (D.12) through (D.15) and noting the fact that limn→∞ λ = 0 and limn→∞ λ β 1 = 0, we claim that on Jλ0 ∩ Sn , X(β − β) 22 /n ≤ λM + 2λ β 1 → 0 as n → ∞. (D.16) It remains to show that Jλ0 ∩ Sn has high probability when n is large. In fact, we have P Jλc ∪ Snc ≤ P (Jλc ) + P (Snc ) 0 0 (D.17) By lemma 1, P (Jλc ) ≤ 2/n2 . Taking α = 3/4 in lemma 2 we get 0 P (Snc ) ≤ (27 + 12) · 44 27 ≤ 34 · n2 n2 80 (D.18) when n ≥ max(4r, 8). Hence we claim that P Jλc ∪ Snc ≤ 130/n2 when n ≥ max(4r, 8). 0 Since ∞ ∞ P n=1 Jλc 0 ∪ Snc ≤ (4r + 8) + 130 n=1 1 < ∞, n2 (D.19) by Borel-Cantelli lemma, P Jλc ∪ Snc i.o. = 0. 0 (D.20) P ∃N > 0 s.t. Jλ0 ∩ Sn holds when n > N = 1. (D.21) It follows that Therefore, by (D.16) P lim n→∞ X(β − β) 22 /n = 0 = 1. (D.22) Proof of Proposition 1. Recall that xj = j/n, j = 1, 2, · · · , n. According to (2.4), we have   1 0 0 ··· 1 − ν n   2  1 0 0 ···  n −ν   1 3  1 0 ··· n n −ν  X − νI =   .. .. .. ..  . . . .    n−1 n−3 n−2 · · ·  1 n n n   n−1 · · · n−2 1 1 n n 0 0    0 0     0 0    .. ..  . .     1 −ν  0 n   1 −ν 2 n n (D.23) It is easy to see that ν = 2/n is not the solution to det(X − νI) = 0. So by multiplying the second row of (D.23) with −1/(2 − nν) and add it to the first row we obtain a lower 81 triangular matrix   nν 2 −(n+2)ν+1  2−nν         R(X − νI) =          1 2 n 0 0 0 ··· −ν 0 0 ··· −ν 0 ··· 1 3 n 1 n .. . .. . .. . .. . 1 n−1 n n−3 n n−2 n ··· 1 1 n−2 n n−1 n ··· 0 0    0 0     0 0   , .. ..  . .     1 −ν  0 n   2 1 −ν n n (D.24) where R denotes the row operation described as above. Then it is clear that nν 2 − (n + 2)ν + 1 det(X − νI) = det(R(X − νI)) = n n−2 1 −ν n (D.25) It directly follows that √ √ n + 2 − n2 + 4 1 n + 2 + n2 + 4 ν1 = , ν2 = ν3 = · · · = νn−1 = , νn = 2n n 2n are the roots of det(X − νI) = 0. 82 (D.26) Appendix E Proof of Theorems in Chapter 3 Proof of Theorem 4. Given site s, the likelihood function is f (Y s |b, us , σs2 ) = 1 n (2πσs2 ) 2 exp − 1 (Y s − X (s) b(s) − us 1n )T (Y s − X (s) b(s) − us 1n ) 2σs2 (E.1) To validate the propriety of the posterior, we need to show  N c∈P (0,∞) R Θd (0,∞)N  N f (Y s |b, us , σs2 )π(us |τ 2 )π(σs2 ) π(c)π(τ 2 )dσ 2 G0 (dθ)dudτ 2 < ∞  s=1 (E.2) Integrating out σs2 , we have n Γ(aσ fs (Y s , b, us ) = (0,∞) f (Y s |b, us , σs2 )π(σs2 )dσs2 = (2π)− 2 n) 2 + Γ(aσ ) 1 eT e 2 s s + b1 σ aσ bσ −(aσ + n 2) , (E.3) where es = Y s − X (s) b(s) − us 1n . ¯ Then the integrand in (E.2) becomes   N fs (Y s , b, us )π(us |τ 2 ) π(c)π(τ 2 ),  s=1 83 (E.4) which is equivalent to the clustered form d Γ(aσ + n2 ) N Γ(aσ )baσσ − N2n (2π) r=1 s∈Cr n 1 −(aσ + 2 ) 1 T e er,s + , 2 r,s bσ (E.5) where er,s = Y s − Xr br − us 1n . ¯ −(aσ + n 2) 1 1 T . Integrating with respect to G0 (dθ), given cluster r we Let Ir,s = 2 er,s er,s + b σ need to evaluate  d q+kr r=1 R since Ir,s ≤ 1 T 2 er,r1 er,r1 aσ + n bσ 2  d Ir,s  dbr ≤  s∈Cr q+kr r=1 R (aσ + n 2 )(|Cr |−1) bσ Ir,r1 dbr (E.6) for any r and s, where r1 ∈ Cr is a fixed location. Note that Ir,r1 = −(aσ + n 2) + b1 as a function of br is proportional to a (q + kr ) dimensional σ multivariate t-distribution with parameters (µr , Σr , φr ), where µr = (XrT Xr )−1 XrT (Y r1 − ur1 1n ) = (XrT Xr )−1 XrT v r1 ¯ 2 + v T I − X (X T X )−1 X T v r r1 r r r r1 bσ (XrT Xr )−1 Σr = φr (E.7) φr = 2aσ + n − (q + kr ) (E.9) (E.8) Then the rth integral for the RHS of (E.6) can be shown as (aσ + n 2 )(|Cr |−1) Rq+kr bσ (aσ + n 2 )(|Cr |−1) Ir,r1 dbr = Ar bσ · 2 bσ + v Tr1 Pr v r1 1 bσ + v Tr1 Pr v r1 q+kr 1 (aσ + n 2 )|Cr |− 2 2 2 (q+kr ) , ≤ Ar bσ 84 q+kr 2 aσ + n 2 (E.10) (E.11) where Pr = I − Xr (XrT Xr )−1 XrT and n−(q+kr ) 2 n Γ(aσ + 2 ) 1 π 2 (q+kr ) Γ aσ + Ar = (E.12) 1 |XrT Xr | 2 Therefore it remains to verify that N RN Recall that π(us |τ 2 ) = √1 2πτ π(us |τ 2 )π(τ 2 )dτ du < ∞ (E.13) (0,∞) s=1 u2 − s2 e 2τ and π(τ 2 ) = τ b−a τ Γ(aτ ) 1 τ2 aτ +1 − 1 2 e bτ τ , integrating out τ 2 leads to a new integrand −N 2 (2π) baτ τ N) 2 Γ(aτ + Γ(aτ )  1 +1 bτ 2 N −(aτ + N ) 2 u2s  , (E.14) s=1 which, as a function of u, is proportional to the density of a N dimensional t-distribution with parameters (z N , (aτ bτ )−1 IN , 2aτ ). Hence the last integration with respect to u is finite. Note the final summation is over finite possible partitions of S. Therefore we complete our verification. 85 BIBLIOGRAPHY 86 BIBLIOGRAPHY [1] Abraham, C., Cornillon, P.A., Matzner-Løber, E. and Molinari, N. (2003) Unsupervised Curve Clustering using B-Splines. Scandinavian Journal of Statistics 30: 581-595. [2] Bickel, P., Ritov, Y., and Tsybakov, A. (2009), Simultaneous Analysis of Lasso and Dantzig Selector. The Annals of Statistics 37(4): 1705-1732. [3] B¨ uhlmann, P. and van de Geer, S. (2011), Statistics for High-Dimensional Data: Methods, Theory and Applications Springer: New York, USA. [4] Dass, S., Lim, C., Maiti, T. and Zhang, Z. (2014), Clustering Curves based on Change point analysis : A Nonparametric Bayesian Approach. Accepted by Statistica Sinica [5] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least Angle Regression. The Annals of Statistics 32(2): 407-499. [6] Escobar, M.D. and West, M. (1995), Bayesian Density Estimation and Inference using Mixtures. Journal of the American Statistical Association 90(430): 577-588. [7] Ferguson, T.S. (1973), A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1(2): 209-230. [8] Friedman, J., Hastie, T. and Tibshirani, R. (2010), Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1): 1-22. [9] Gaston, K.J. (1996), Biodiversity: A Biology of Numbers and Difference Oxford, UK. [10] Gelman, A. and Rubin, D.B. (1992), Inference from Iterative Simulation Using Multiple Sequences. Statistical Science, 7: 457-472. [11] Ghosh P., Basu S. and Tiwari, R.C. (2009), Bayesian Analysis of Cancer Rates from SEER Program using Parametric and Semiparametric Joinpoint Regression Models. Journal of the American Statistical Association 104(486): 439-452. [12] Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning Springer: New York, USA. 87 [13] Jeffreys, H. (1961), Theory of Probability Oxford Univ. Press: Oxford, UK. [14] Kim, H.-J., Fay, M. P., Feuer, E. J. and Midthune, D. N. (2000), Permutation Tests for Joinpoint Regression with Applications to Cancer Rates. Statistics in Medicine 19: 335-351. [15] Lozano, R. et al. (2012), Global and Regional Mortality from 235 Causes of Death for 20 Age Groups in 1990 and 2010: A Systematic Analysis for the Global Burden of Disease Study 2010. The Lancet 380(9859): 2095-2128. [16] Ramsay J.O. and Silverman B.W. (1997) Functional Data Analysis Springer: New York, USA. [17] Ries, L.A.G., Harkins, D., Krapcho, M., Mariotto, A., Miller, B.A., Feuer, E.J., Clegg, L., Eisner, M.P., Horner, M.J., Howlader, N., Hayat, M., Hankey, B.F., Edwards, B.K. (eds). (2006), SEER Cancer Statistics Review, 1975-2003 National Cancer Institute: Bethesda, MD, USA. [18] Sethuraman, J. (1994), A Constructive Definition of Dirichlet Priors. Statistica Sinica 4: 639-650. [19] The IUCN Red List of Threatened Species (2012). http://www.iucnredlist.org/. [20] Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1): 267-288. [21] Tiwari, R.C., Cronin, K.A., Davis, W., Feuer, E.J., Yu, B. and Chib, S. (2005), Bayesian Model Selection for Join Point Regression with Application to Age-adjusted Cancer Rates. Journal of the Royal Statistical Society. Series C (Applied Statistics) 5: 919-939. [22] Wang, H., Li, B. and Leng, C. (2009), Shrinkage Tuning Parameter Selection with a Diverging Number of Parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71: 671-683. [23] Yang, R. and Berger, J. (1994), Estimation of a Covariance Matrix Using the Reference Prior. The Annals of Statistics 22(3): 1195-1211. 88