STATISTICAL METHODS FOR HIGH-DIMENSIONAL SEQUENCING STUDIES By Changshuai Wei A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Epidemiology - Doctor of Philosophy 2014       ABSTRACT STATISTICAL METHODS FOR HIGH-DIMENSIONAL SEQUENCING STUDIES By Changshuai Wei Background: With the advance of next generation sequencing technology, a massive amount of sequencing data are generated from sequencing studies, offering a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. The massive amount of data also poses a great challenge to the statistical analysis. Association analyses based on a single-locus test endure substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. A joint association test, on the other hand, has been shown to be more suitable for sequencing studies, as it jointly tests multiple variants to increase the power and reduces the dimensionality. However, current methods for joint association tests have limitations, such as dependency on distribution assumptions, poor performance for a small sample size, inability to adjust for covariates, inability to consider a family structure, inability to handle multiple traits, and computational inefficiency. Method and Result: I have developed a series of statistical methods to detect joint associations in sequencing studies based on weighted U statistics. 1) I developed a weighted U statistic, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on the non-parametric U statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be       applied to various types of phenotypes. 2) The simultaneously analysis of multiple related phenotypes has been great interest in recent human genetics research. In these studies, multiple phenotype measurements have been collected to study pleiotropic effects, or comorbidity effects. I have extended the WU-SEQ method and developed a Gene-Trait Similarity U test (GTSU) to detect sequencing variants associated with multiple phenotypes. 3) Both population-based and family-based designs have been commonly used in genetic association studies. There lacks, however, new statistical methods for family-based sequencing data analyses. Family-based sequencing studies have many advantages, including the ability to aggregate more disease-susceptibility rare variants within families. I have extended the GTSU to FGTSU for family-based sequencing association analyses.       ACKNOWLEDGEMENTS I want to first thank my PhD advisor, Dr. Qing Lu, for his support and mentoring. I also want to thank my PhD committee members, Dr. James Anthony, Dr. Wenjiang Fu, and Dr. Yuehua Cui, for their help with my dissertation and research project. My collaborators, Dr. Nigel Paneth and Dr. Robert Elston, also helped me a great deal with my research, and gave me very valuable advice. Finally, I want to thank my parents and all my friends for their support in my five-year PhD studies. iv      TABLE OF CONTENTS LIST OF TABLES ....................................................................................................vii LIST OF FIGURES ...................................................................................................ix INTRODUCTION .....................................................................................................1 CHAPTER 1 ..............................................................................................................9 A Weighted U for Sequencing Association Studies ..................................................9 1.1 A Weighted U Test ..................................................................................9 1.2 Asymptotic Distribution...........................................................................12 1.3 Covariates Adjustment .............................................................................14 1.4 Simulation ................................................................................................15 1.4.1 Simulation I.............................................................................16 1.4.2 Simulation II ...........................................................................19 1.4.3 Simulation III ..........................................................................21 1.5 Application to the DHS Sequencing Data ...............................................25 1.6 Discussion ................................................................................................28 CHAPTER 2 ..............................................................................................................33 Gene Trait Similarity U test .......................................................................................33 2.1 Weighted U statistic .................................................................................34 2.2 Similarity Measurement ...........................................................................36 2.3 Hypothesis Testing...................................................................................38 2.4 The Power and Sample Size Calculations ...............................................39 2.5 Computation and Implementations ..........................................................41 2.6 Covariates Adjustment .............................................................................42 2.7 Simulation Method...................................................................................43 2.7.1 Simulation I.............................................................................44 2.7.2 Simulation II ...........................................................................45 2.8 Simulation Results ...................................................................................45 2.8.1 Simulation I.............................................................................46 2.8.2 Simulation II ...........................................................................49 2.9 Computation and Software ......................................................................52 2.10 Application to Dallas Heart Study .........................................................53 CHAPTER 3 ..............................................................................................................55 Family Based Gene Trait Similarity U test ................................................................55 3.1 Family-based Gene Trait Similarity U test ..............................................55 3.2 Covariates adjusted Gene Trait Similarity U test ....................................57 3.3 Simulation ................................................................................................59 3.4 Whole Genome Sequencing Analysis of GAW18 ...................................63 v      CHAPTER 4 ..............................................................................................................67 Conclusion and Discussion ........................................................................................67 APPENDICES ...........................................................................................................71 Appendix A ....................................................................................................72 Appendix B ....................................................................................................77 Appendix C ....................................................................................................96 BIBILOGRAPHY ......................................................................................................99 vi      LIST OF TABLES Table 1: Type I error and power comparisons of WU-SEQ and SKAT under different direction and magnitude of effects ....................................................18 Table 2: Type I error and power comparisons of WU-SEQ and SKAT for different sample sizes .................................................................................................20 Table 3: Type I error and power comparisons of WU-SEQ and SKAT when the number of variants is much larger than the sample size ......................................21 Table 4: Type I error and power comparisons of WU-SEQ and SKAT for covariate adjustment ..................................................................................................23 Table 5: Type I error and power comparisons of WU-SEQ by using different c ...................................................................................................................24 Table 6: Type I error and power comparison of WU-SEQ by using normal quantile or rank ..........................................................................................................24 Table 7: The association of 4 candidate genes with 3 continuous phenotypes (i.e., BMI, Cholesterol, and VLDL) in the Dallas Heart Study ..............27 Table 8: Type I error for single trait analysis ............................................................47 Table 9: Type I error for multiple trait analysis .........................................................50 Table 10: Multiple trait analysis for DHS study ........................................................54 Table 11: Type 1 error and power comparison for discrete phenotype .....................61 Table 12: Type 1 error and power comparison for continuous phenotype ................62 Table 13: Top 10 SNV sets with smallest p-values for GAW18 ...............................66 Table A1: Simulation setting for models without confounding effect and fixed sample size ........................................................................................................73 Table A2: Simulation setting for model without confounding effect and varying sample size ....................................................................................................73 Table A3: Simulation setting for model with confounding effect and varying sample size ....................................................................................................74 vii      Table A4: The association of 4 candidate genes with 3 binary phenotypes in Dallas Heart Study .................................................................................................74 Table B1: Simulation setting for simulation I............................................................89 Table B2: Simulation setting for simulation II ..........................................................90 Table B3: Computational time for R-based GTSU and SKAT, SKATO, AdjSKAT for analyzing 1000 simulated data replicates ...........................................91 Table B4: Simulation setting for multiple trait analysis with covariates ...................92 Table B5: Type I error for multiple trait analysis with covariates .............................93 Table C1: Simulation setting for family based sequencing studies ...........................97 viii      LIST OF FIGURES Figure 1: Distribution of SNVs with MAF<0.03 for the 4 genes in the Dallas Heart Study ..................................................................................................................26 Figure 2: Distributions of the 3 phenotypes in the Dallas Heart Study .............................26 Figure 3: Power comparison for single trait analysis when half of the functional SNVs are protective and the other half of the functional SNVs are deleterious ............................................................................................................48 Figure 4: Power comparison for single trait analysis when majority of the functional SNVs are deleterious ................................................................................49 Figure 5: Power comparison for multiple trait analysis when the multiple traits follow the same type of distribution .................................................................50 Figure 6: Power comparison for multiple trait analysis when the multiple traits follow the different types of distribution ..........................................................52 Figure 7: QQ plot for the whole genome sequencing analysis ..................................65 Figure A1: Missing data distribution for Dallas Heart Study ....................................75 Figure A2: MAF for all SNPs ....................................................................................75 Figure A3: Distribution of test statistics ....................................................................76 Figure B1: Power comparison for multiple trait analysis with same set of covariates ...................................................................................................................94 Figure B2: Power comparison for multiple trait analysis with different set of covariates ...............................................................................................................95 Figure C1: distribution of sizes of SNV sets .............................................................98 ix      INTRODUCTION Genome-wide association studies (GWAS) have been used to uncover common genetic variants predisposing to common complex diseases. During the past decade, thousands of new genetic variants have been identified from GWAS; some with a compelling biological plausibility for a role in disease etiology [Hindorff, et al. 2009]. Despite such successes, the genetic variants identified to date for most complex diseases can only explain a small fraction of the total heritability. Evolutionary theory suggests that rare variants are likely to be recent mutations, and are also likely to be more deleterious because they are under less negative selective pressure [Boyko, et al. 2008; Fay, et al. 2001; Kryukov, et al. 2007; Pritchard 2001; Raychaudhuri 2011]. Converging evidence from genetic studies of complex diseases also suggests that complex diseases are highly heterogeneous, and that familial subtypes of complex diseases are frequently seen to be caused by rare variants with large effects [Ahituv, et al. 2007; Cohen, et al. 2004; Easton, et al. 2007; Ji, et al. 2008; Romeo, et al. 2009]. Nevertheless, current findings of rare variants predisposing to complex diseases are still limited. With advancements in highthroughput sequencing technology, researchers are now able to comprehensively study the role of a massive amount of sequencing variations, including both common and rare variants, in human diseases. While the ongoing exome and whole-genome sequencing studies hold great promise for identifying new genetic variants predisposing to complex disease, they also pose great challenges for the statistical analysis of massive amounts of sequencing data. 1      Although single-locus analysis has been widely used in GWAS, such analysis is underpowered for detecting rare variants. Rare variants are anticipated to have larger effects than common variants, but their low frequencies in the study population make them hard to detect. Moreover, the number of single-locus tests required for sequencing studies is significantly larger than those required for GWAS studies, which increases the multiple-testing burden. Therefore, as an essential alternative, a joint-association analysis of genetic variants has become popular in the association analysis of sequencing data [Barnett, et al. 2013; Chen, et al. 2012; Ladouceur, et al. 2012; Lee, et al. 2012b; Lin and Tang 2011b; Madsen and Browning 2009; McCarthy, et al. 2008; Morgenthaler and Thilly 2007a; Neale, et al. 2011; Wei and Lu 2011; Wu, et al. 2011; Zhu, et al. 2010]. By aggregating information from multiple variants, the association signal is enhanced and becomes more detectable. Moreover, this greatly reduces the number of tests, which alleviates the multiple-testing issue. Let Yi be the phenotype value for subject i, and Gi  (Gi1 ,..., GiL ) as the genotype vector for subject i. One of the traditional methods for joint association analyses is the regression model, Yi    Gi    i ,  i ~ N (0,  2 ), We can use F test to test the null hypothesis whether  equals 0. However, in a sequencing study, the dimension of the matrix Gi can be larger than the sample sizes. Additionally, G could be a sparse matrix and GT G might not be invertible. Therefore, the traditional methods are not suitable for high dimensional sequencing data. A common strategy used in current methods is to first aggregate the effect of rare variants in a region 2      (e.g., a gene) via collapsing or weighting, and then to assess the joint association of the variants with the phenotype of interest. Most of the existing methods for sequencing association analyses can be classified into two categories: burden test and non-burden test. Burden tests assume the effects of the multiple variants are homogenous, and use the following model to test the joint association, Yi    t (Gi )   i ,  i ~ N (0,  2 ), where t () is a pre-specified function to calculate a genetic score from the multiple genetic variants and  is a scalar parameter to measure the joint effect. Non-burden tests consider the heterogeneous effects of the multiple variants, and use the following model to test the joint association, Yi    r (Gi )   i ,  i ~ N (0,  2 ), where r () is a general function measuring the effects of multiple variants. For nonburden tests, we test H 0 : r  0 v.s. H a : r  0 . The commonly used methods for burden tests include the cohort allelic sum test (CAST), the combined multivariate and collapsing (CMC) method [McCarthy, et al. 2008], and the weighted sum test (WST) [Madsen and Browning 2009]. CAST collapses all variants into one variant and t (Gi ) is defined as, 1, if there are rare variants, t (Gi )   0, otherwise. Using the pseudo-variant, CAST then performs a simple two-sample t test between cases and controls to evaluate the joint association. By collapsing all the variants into one 3      variant, CAST greatly reduces the data dimensionality and computational burden. Yet, CAST assumes that all the variants have the same effect sizes, which might be hold in read data scenarios. CMC generalizes the CAST by collapsing the variants into several pseudo-variants based on predefined criteria (e.g., MAF). Then, a Hotelling’s T test is used to test the joint association, T  ( tcases  tcontrol )T S 1 ( tcases  tcontrol ), where t is the mean vector for the new pseudo-variants, and S is the corresponding covariance matrix. By collapsing all the variants into multiple pseudo-variants, CMC has the ability to consider different effect sizes of multiple genetic variants. Different from CAST and CMC, WST considers different effect sizes by assigning different weights to different variants. In WST, t (Gi ) is defined as a weighted sum of genetic scores, L t (Gi )   w j Gi , j , j 1 where w j is the weight for the j-th variant. A rank sum test is then used to compare the genetic scores between the cases and controls to make inference on the joint association. Note that the above three methods can only be applied to case-control data. In order to consider both binary and continuous phenotypes, SCORE-seq [Lin and Tang 2011a] is developed under the framework of generalized linear model, g ( E (Yi ))  X i  t (Gi )  , where X i represents the matrix for confounding covariates, and  represents the effects of the covariates. t (Gi ) is defined as a weighted sum of multiple variants, 4      L t (Gi )   w j Gi , j , and  is a scalar parameter measuring the joint association effect. j 1 Score test can then be used to test the null hypothesis whether  equals zero. For different choices of t (Gi ) , SCORE-seq defines a maximum statistic, Tmax  max | Tk |, 1 k  K where Tk is the score test statistic for the k-th choice of t (Gi ) . P-value can be calculated by using, P(Tmax  tmax )  1  P(| T1 | tmax ,...,| TK | tmax ). Burden tests implicitly assume the effects of different rare variants are in the same direction and/or same magnitude. If the assumption fails, it could lead to power loss. To address this limitation, different burden tests are proposed. The C-alpha test [Neale, et al. 2011] can consider the direction and magnitude of effects, by comparing the expected variance with observed variance in a case-control setting. The test statistic is equivalent to a simple variance component test, T  (Y  Y )T GGT (Y  Y ). Besides the variance component test, we can also use functional approach by considering the genotypes in a region as a function of genetic position[Luo, et al. 2012]. For the quantitative trait, functional linear model (FLM) can be built, Yi  X i   Gi (t )  (t ) dt   i ,  i ~ N (0,  2 ), 5      where t is the genetic position, Gi (t ) is the function for genotypes, and  (t ) is the function for the genetic effects. Using basis function to describe Gi (t ) and  (t ) , FLM can be transformed into a usual linear regression setting. Another semi-parametric method that is widely used in sequencing data analyses is the sequence kernel association test (SKAT) [Wu, et al. 2011], which is also robust for the direction and magnitude of genetic effects. Moreover, because it is built based on kernel machine regression, SKAT can adjust for covariates, and is applicable to both binary and Gaussian phenotypes, g ( E (Yi ))  X i  r (Gi ), where g () is the link function and r (Gi ) is considered as a random effect. A variance component score test can be used to test the null hypothesis whether the function r () equals 0, T  (Y  Yˆ )T K (Y  Yˆ ), where Yˆ is obtained from the null model (i.e., g ( E (Yi ))  X i ) and K is a kernel matrix calculated from multiple genetic variants. Most existing methods for sequencing data analyses are parametric-based or semiparametric based, which rely on certain assumptions. However, in practice, these assumptions may not be satisfied (e.g., the phenotype may not follow a normal distribution in a sequencing study based on an extreme phenotype design). When the assumptions are violated, the existing methods will likely have either decreased power or an inflated type I error. Non-parametric methods, such as U-statistic-based methods, have shown their robustness against underlying phenotypic distributions and/or underlying 6      modes of inheritance [Li 2012; Li, et al. 2011; Schaid, et al. 2005; Wei, et al. 2012; Wei, et al. 2013a; Wei, et al. 2008]. Various U-statistic-based methods have been proposed for identifying the common variants associated with binary or quantitative phenotypes. Most of these methods use U statistics to obtain multiple group-wise scores, and then form test statistics to compare scores among different groups. For case-control data analyses, U statistics are used to summarize the genetic information, and then to compare scores between cases and controls [Schaid, et al. 2005]. For quantitative data analyses, U statistics first summarize the phenotypic information, and then compare scores among different genotype groups or multi-locus genotype groups [Wei, et al. 2008]. It is challenging to provide a unified method under the traditional U-statistic framework because there are different ways of constructing U-statistics for binary and quantitative phenotypes [Li 2012]. The weighted U statistic is a more general form of the U statistic. It was developed in the 1980s-90s [Gregory 1977b; Serfling 1981; Shieh 1997], and has rarely been used in genetic data analyses. By using a weighted U statistic, genetic information and phenotypic information can be summarized separately into the weight function and the U kernel, thereby avoiding the issue of group score comparison. In the following chapters, we use the weighted U statistic to develop new statistical methods for the analyses of multiple sequencing variants associated with a single trait, multiple traits and correlated traits (i.e., traits from family studies). We use the new methods to investigate the genetic risk factors for cardiovascular disease and hypertension (HT) by using targeted sequencing data from Dallas Heart Study and whole genome sequencing study from GAW18. Cardiovascular disease (CVD) refers to any disease that affects the cardiovascular system (e.g. coronary heart disease and stroke). 7      It is the leading cause of death worldwide. By 2005, the total number of CVD death had reached 17.5 million. The projection of total CVD death in 2015 is about 20 millions, accounting for 30 percent of all dealths worldwide [Kelly and Fuster 2010]. Risk factors for CVD include family history, age, smoking, hypertension, high cholesterol, obesity and so on. Hypertension is the single most important risk factors for stroke and also plays a significant role in heart attack. 8      CHAPTER 1: A Weighted U for Sequencing Association Studies Based on the weighted U statistic, we have developed a unified method, referred to as WU-SEQ, for sequencing data analyses of various types of phenotypes (e.g., binary, ordinal, and continuous). Moreover, we used a projection method in WU-SEQ for covariate adjustments, and derived the asymptotic distribution of the test statistic for an efficient assessment of the significance of the association. The performance of WU-SEQ was evaluated though extensive simulation studies, and compared with a commonly used method, SKAT [Wu, et al. 2011]. Finally, we illustrated the proposed method by applying WU-SEQ to sequencing data from the Dallas Heart Study (DHS). 1.1 A Weighted U Test Assume a population-based sequencing study with N unrelated subjects and P single nucleotide variants (SNV) located in a gene or a genetic region. Let yi and Gi  ( gi1, gi 2 ,, giP ) denote, respectively, the phenotype and the genotypes of P variants of an individual i ( 1  i  n ), where gip ( 1  p  P ) is coded as 0, 1, or 2. We use si,i ' and wi ,i ' to denote the phenotypic similarity and genetic similarity between individuals i and i’, respectively. The phenotypic similarity can be measured by any 2-degree kernel function, si ,i '  h( yi , yi ' ) , which satisfies the finite second moment condition ( EF (h2 (Y1 , Y2 ))   ). While various kernel functions can be used to measure the phenotypic similarities, we used the cross product kernel, si ,i '  qi qi ' , where qi is the normal quantile of the rank of 9      yi , defined as qi  1 ((rank ( yi )  0.5) / n) , and  1 () is the inverse cumulative distribution function for standard normal distribution. An averaged rank is assigned in the presence of ties. For example, if there are n0 numbers of controls (i.e., yi  0 ) and n1 numbers of cases (i.e., yi  1 ), an average rank is assigned to the group with the same phenotype value (e.g., rank ( yi { yi , yi  0})  (n0  1) / 2 is assigned to the control group). Other than the quantile-transformed cross product kernel, distance-transformed phenotypic similarities such as si ,i '  exp( | yi  yi ' |) or si ,i '  exp(( yi  yi ' )2 ) can also be used. Nevertheless, as demonstrated below, the use of the quantile-transformed cross product kernel leads to nice asymptotic properties. Genetic similarity, wi ,i ' , can be calculated using a variety of similarity functions. One commonly used similarity function for sequencing data is the weighted IBS, which gives more weight to rare variants, P 2 | Gi , p  Gi ', p | p 1   p (1   p ) wi ,i '   where  p is the minor allele frequency for , the p-th rare variant, and P    2 /  p (1   p ) is used to standardize the weight function so that wi ,i ' [0,1] . In p 1 addition to the weighted IBS, distance-transformed similarity functions can also be used. For example, we could use wi ,i '  exp( Di ,i ' ) , where Di ,i ' is the distance function (e.g., a Euclidian distance). Given si,i ' and wi ,i ' , the weighted U is formed to measure the association of P genetic variants with the disease phenotype, 10      Uw  1  wi,i ' si,i ' , n(n  1) i i ' (1) where si,i ' is the 2 degree U kernel and wi ,i ' is the weight function for the weighted U. When wi ,i '  1 , we can construct an un-weighted U by using only the phenotype similarity, U uw  1  si,i ' . n(n  1) i i ' (2) In the weighted U, the summation is over all phenotypic similarities weighted by the genotypic similarity, whereas only the phenotypic similarity is used for the un-weighted U. When the genetic variants are associated with the phenotype, we would expect genetic similarities to be concordant with phenotypic similarities, where larger weights are given to larger values of phenotypic similarity. Therefore, the weighted U has a larger value than the un-weighted U. Based on this concept, we could build an association test by comparing the weighted U with the un-weighted U. The two U statistics, however, are based on weights of different scales (i.e., wi ,i ' vs. constant 1), therefore, a constant c is introduced to balance the two weight functions. The test statistic is then defined as, WU seq  U w  cU uw , (3) where the scaling constant c can be obtained by minimizing the L2 norm distance between the two weight metrics, i.e., c  arg min{ ( wi ,i '  c) 2 } . Alternatively, we could c 0 i i ' choose other types of c. For instance, we could obtain c by minimizing the L1 norm distance between the two weight metrics, i.e., c  arg min{ | wi ,i '  c |} . c0 11    i i '   1.2 Asymptotic Distribution When the weighted U is significantly larger than the rescaled un-weighted U, we reject the null hypothesis and conclude there is an association between P genetic variants and the phenotype. The p-value can be obtained by comparing the observed test statistic, obs obs WU seq ) . For a small sample size, a , with the null distribution, i.e., Pr(WU seq  WU seq permutation test can be used to calculate the p-value. However, for a large sample size and high-dimensional data, a permutation test could be computationally intense. Therefore, we derive the asymptotic distribution of WU seq to efficiently assess the significance level of the association. We first rewrite the test statistic, WU seq , of equation (3) as a weighted summation of si,i ' , WU seq  1  ki,i ' si,i ' , n(n  1) i i ' (4) where we define a new weight ki ,i '  wi ,i '  c . If the phenotypic similarity is measured by the cross product kernel, si ,i '  qi qi ' , WU seq is simplified to a quadratic form, with all the diagonal elements equal to 0 ( ki ,i  0 ). In such a case, it has a close connection with the variance component score test in the linear mixed model, except that WU seq does not use information from the diagonal terms ( ki ,i  0 ), and does not assume a Gaussian distribution of the phenotype. In U statistics theory, the asymptotic behavior of U depends on  1  Var ( E (h(Y1 , Y2 ) | Y2 )) [Serfling 1981]. If  1  0 , the U statistic is a non-degenerated U and can be approximated 12      by a normal distribution. If  1  0 , the U statistic is a degenerated U and can be approximated by a mixture of chi-squared distribution. Since E(q1q2 | q2 )  q2 E(q1 )  0 , we have  1  0 . Thus the statistic WU seq is, by definition, a degenerated weighted U statistic, and its limiting distribution can be obtained by taking the decomposition of the weight function ki,i ' and the kernel function si,i ' [Serfling 1981; Shieh, et al. 1994; Wet and Venter 1973]. WU seq can then be approximated by a linear combination of chisquared random variables,  n m 1 l 1 nWU seq    m  l (1,2ml  1) , (5) 2 are iid chi-squared random variables with 1 degree of freedom and l where 1,ml (l=1,…,n) are obtained from the eigen-values l of matrix K  {ki ,i ' }nn , with l  l / (n 1) . {m} are the eigenvalues of a general kernel function h (,) , obtained from the following decomposition,  h(q1 , q2 )    mm (q1 )m (q2 ) , m 1 where {m ()} are the ortho-normal eigenfunctions corresponding to m . For the cross product kernel, we can show that h(q1 , q2 )  q1q2  1 (q1 )1 (q2 ) , where 1 (q)  q . Thus,  m  1{m1} ,  n n m 1 l 1 l 1 where 1 is an indicator function, and  m  l   l  trace( K ) / (n  1)  0 . The limiting distribution of WUseq can be 13       D simplified to nWU seq   2, , where  2,   l 1,2l is mixture of chi-squared distributed l 1 with a 0 mean and a finite variance (Appendix A). Given the asymptotical distribution of WU seq , the p-value can then be calculated using the Davies method [Davies 1980]. In Appendices (Figure A3), we graphically showed the shape of the distribution of the mixture chi-squares and showed that it is very close to the permutation results when the sample size is large. 1.3 Covariates Adjustment Assume we have J covariates, X i  (1, xi1 ,, xiJ ) , i  1, 2,..., N . To adjust for the potential confounding effects, we will first fit the transformed value of the phenotype, Q  (q1 , q2 ,...qN )T , with covariates X  ( X1 , X 2 ,..., X N )T , and then use the residuals for the association test. Based on this idea, we project Q onto the space spanned by X , and obtain the prediction Qˆ , where Qˆ  X ( X T X ) 1 X T Q . By denoting H  I  X ( X T X )1 X T and ˆ 2  (Q  Qˆ )T (Q  Qˆ ) / ( N  J  1) , we can obtain the residuals, Qe  HQ / ˆ . It is easy to show that Qe  X , so that a new response vector is attained, which is perpendicular to the space spanned by the covariates. The test statistic WU seq can be reconstructed as WU seq   ki ,i ' qie qie' . i i ' Using the same argument as above, WU seq with covariate adjustments can also be approximated by a linear combination of chi-squared random variables, 14      n nWU seq ~  le 1,2l . l 1 le  le / (n  1) and {le } are the eigen-values of matrix K e , where K e  HKH . Similar to other methods, we should consider potential issues, such as nonlinearity of covariates and high correlations between genetic variables and covariates, before implementing the covariate-adjusting approach. The residual confounding is negligible when the model is correctly specified. Directly using the approach without considering these issues may lead to power loss. 1.4 Simulation We conducted simulation studies to evaluate the performance of WU-SEQ, comparing it with one of the most commonly used methods, SKAT [Wu, et al. 2011]. For all of the simulations, we used genetic data from the 1000 Genome Project [Abecasis, et al. 2010], to mimic a real sequencing data structure (e.g., LD pattern and allele frequency). Specifically, we used a 1Mb region from the genome (Chromosome 17: 73443288344327), and randomly chose a 30kb segment from that 1Mb region for each simulation replicate (if not specified otherwise). There are an average of 194 SNVs in the sampled 30kb segments. The minor allele frequency (MAF) of the SNV in the genome region ranged from 4.50×10-4 to 4.99×10-1, with a distribution highly skewed to rare variants (34.8% of the variants with an MAF<0.001, 69.1% of the variants with an MAF<0.01 and 80% of the variants with an MAF<0.03). Similar to the SKAT simulation studies [Wu, et al. 2011], we selected a portion of the genetic variants with an MAF<0.03 as functional SNVs. A number of individuals, ranging from 50 to 500, were randomly 15      chosen as the study samples from all of the available individuals in the 1000 Genome Project. For each disease model, we simulated 1000 data replicates and obtained the type 1 error and power for both WU-SEQ and SKAT. For SKAT, we used the link function according to the phenotype distributions (i.e., the logit link for a binary phenotype and the identity link for a continuous phenotype) [Wu, et al. 2011]; for WU-SEQ, we used the L2 norm to choose the constant c, and did not specify an assumption on the phenotype distribution. To be consistent, both methods used a weighted IBS to summarize genetic information. The type 1 error and power were obtained by calculating the percentage of p-values smaller than 0.05 from the 1000 data replicates. 1.4.1 Simulation I: We first simulated a series of disease models, without considering covariates, and investigated the influence of the direction and magnitude of effects on both methods. Each data replicate was comprised of 500 subjects. We considered 4 types of distributions for the phenotypes: binary, Gaussian, Student’s t with 2 degrees of freedom, and Cauchy. Binary and Gaussian phenotypes are typically observed in population-based studies. The Cauchy-distributed and t-distributed phenotypes represent continuous phenotypes with more extreme values (i.e., heavy-tailed). We used the logistic model to simulate the binary phenotype, logit( P ( yi  1))    Gi  , where Gi and yi were the genotype and phenotype of the i-th individual, respectively.  was a vector of regression parameters, measuring the effects of the genetic variants. For 16      each simulation replicate, we sampled an effect vector from a multivariate normal   distribution, MVN ( 1, 2 I ) , where 1 was the vector of 1 and I was the identity matrix. For Gaussian phenotypes, we simulated the model as, yi    Gi    i , where  i ~ N (0,  2 ) . For the t-distributed phenotype, we simulated the model as: yi    Gi    i ,  i ~ tdf 2 . For the Cauchy type of phenotype, we used yi ~ cauchy ( ai , b ) . ai and b were the location parameter and the scale parameter of the Cauchy distribution, respectively, where ai    Gi  and b was a fixed value. For all four types of phenotypes, we considered different directions of genetic effects. For the first scenario, we assumed   0 , whereby half of the functional SNVs were deleterious and half of the functional SNVs were protective. For the second scenario, we assumed   0 , whereby the majority of the functional SNVs were deleterious. For each scenario, we varied the percentage of functional SNVs from 5% to 50%. The details of the simulation setting are provided in Table A1. 17      Table 1: Type I error and power comparisons of WU-SEQ and SKAT under a different direction and magnitude of effects Effect* Pct** Type I Error/Power Binary Null A1 A2 Normal Student’s t Cauchy SKAT WU-seq SKAT WU-seq SKAT WU-seq SKAT WU-seq 0 0.037 0.037 0.037 0.043 0.110 0.038 0.194 0.041 5 0.281 0.284 0.174 0.185 0.164 0.193 0.167 0.116 10 0.417 0.423 0.315 0.329 0.212 0.331 0.160 0.201 30 0.828 0.841 0.675 0.688 0.369 0.695 0.192 0.451 50 0.935 0.944 0.856 0.869 0.525 0.880 0.177 0.635 5 0.065 0.073 0.085 0.093 0.140 0.134 0.190 0.095 10 0.166 0.173 0.155 0.163 0.192 0.288 0.175 0.179 30 0.539 0.547 0.524 0.527 0.471 0.784 0.199 0.578 50 0.773 0.780 0.790 0.798 0.733 0.960 0.187 0.818 * “Null” corresponds to the null model with no functional variant; “A1” corresponds to the setting where half of the functional rare variants have a deleterious effect and the other half of functional rare variants have a protective effect; “A2” corresponds to the setting where all of the functional rare variants have deleterious effects. ** Percentage of functional rare variants. We have summarized the results in Table 1. From Table 1, we can see that WU-SEQ had a well-controlled type 1 error rate under various phenotype distributions. In contrast, SKAT had an inflated type 1 error rate when the underlying distribution was the Cauchy distribution (0.194) or t distribution (0.110). Similar to SKAT, WU-SEQ allowed for different directions of genetic effects (i.e., both deleterious and protective effects). For both scenarios (i.e.,   0 and   0 ), WU-SEQ obtained a power that was comparative to or at a slightly higher power than SKAT. As the percentage of the functional SNVs increased, both WU-SEQ and SKAT gained improved power for binary and Gaussian 18      phenotypes. For the Cauchy phenotype and t-distributed phenotype, however, the power of WU-SEQ was significantly higher than that of SKAT. With the Cauchy phenotype, for example, SKAT’s power ranged from 0.167 to 0.192 for   0 and ranged from 0.175 to 0.199 for   0 , while the power of WU-SEQ increased from 0.116 to 0.635 for   0 and increased from 0.095 to 0.818 for   0 . 1.4.2 Simulation II: In simulation II, we investigated the influence of sample size on the performance of WUSEQ and SKAT. We used the same models as in simulation I to simulate binary, Gaussian, Student’s t and Cauchy phenotypes. For simplicity, we assumed   0 and a fixed  2 , and varied the sample size from 50 to 500 (Table A2). We assumed 0% of the genetic variants were functional for the null model, and assumed 50% of the genetic variants were functional for the disease models. The results are summarized in Table 2. Type 1 error rates of WU-SEQ were well controlled at the 0.05 level for different sample sizes (i.e., 50, 100, 200 and 500) and various phenotype distributions (i.e., binary, Gaussian, Student’s t and Cauchy), while the type 1 error of SKAT was inflated for the Cauchy phenotype (0.118~0.194) and the tdistributed phenotype (0.110~0.131). We also observed that both WU-SEQ and SKAT had conservative type I error rates when the study sample sizes were small. For instance, the type I error rates of SKAT and WU-SEQ were 0.001 and 0.005, respectively, when the sample size was 50 and the phenotype was binary. 19      Table 2: Type I error and power comparisons of WU-SEQ and SKAT for different sample sizes Effect* Sample size Type I Error/Power Binary Normal Student’s t Cauchy SKAT WU-seq SKAT WU-seq SKAT WU-seq SKAT WU-seq 50 0.001 0.005 0.027 0.047 0.117 0.044 0.118 0.041 100 0.007 0.013 0.035 0.050 0.131 0.051 0.134 0.052 200 0.022 0.029 0.029 0.039 0.121 0.043 0.165 0.056 500 0.037 0.037 0.037 0.043 0.110 0.038 0.194 0.041 50 0.016 0.035 0.145 0.177 0.160 0.164 0.126 0.086 100 0.119 0.181 0.279 0.318 0.220 0.296 0.160 0.164 200 0.494 0.536 0.521 0.545 0.317 0.557 0.178 0.291 500 0.935 0.944 0.856 0.869 0.525 0.880 0.177 0.635 Null A1 * “Null” corresponds to the null model with no functional variant, “A1” corresponds to the setting where half of the functional rare variants have a deleterious effect and the other half of the functional rare variants have a protective effect. The power of WU-SEQ increased as the sample size increased for all four types of phenotypes. The power of SKAT remained almost the same for the Cauchy phenotype (0.126~0.177) when the sample size increased. For both the binary and Gaussian phenotypes, the power of WU-SEQ was comparable or slightly higher than that of SKAT, while the power of WU-SEQ was significantly higher than that of SKAT for the t- and Cauchy-distributed phenotypes. 20      We performed additional simulations to evaluate the performance of WU-SEQ for highdimensional settings. In particular, we randomly chose a 500kb segment for each simulation replicate. The sample sizes were fixed at 100 and we simulated the phenotype using the settings in Table A2. There was an average of 1873 SNVs for each subject. The results show that WU-SEQ can control the type 1 error well, and maintain a good power performance when the number of SNVs is much larger than the sample size (Table 3). Table 3: Type I error and power comparisons of WU-SEQ and SKAT when the number of variants is much larger than the sample size* Effect Null A1 Method Distribution Binary Normal Student’s t Cauchy SKAT 0.007 0.021 0.130 0.196 WU-seq 0.012 0.059 0.042 0.059 SKAT 0.153 0.896 0.708 0.264 WU-seq 0.304 0.933 0.864 0.558 *The average number of SNVs is 1873 in these settings, while the sample size is set at 100. The effect size is set as the same as in Table A2 1.4.3 Simulation III: In genetic association analyses, we often need to adjust important covariates (e.g., gender) for potential confounding effects. Therefore, we conducted another set of simulations to investigate the performance of WU-SEQ considering covariate adjustments. In this simulation, we simulated two covariates, x1,i ~ Bernoulli(0.3) and x2,i ~ N (0,1) , for each subject. The binary phenotype was simulated using 21      logit( P ( yi  1))    X i  Gi  , where X i  ( x1,i , x2,i ) and   ( 1 ,  2 )T were the covariates and the effects of the covariates, respectively. Similarly, we used a linear model for the Gaussian phenotype, yi    X i  Gi    i ,  i ~ N (0,  2 ) , and the following model for the t-distributed phenotype, yi    X i  Gi    i ,  i ~ tdf 2 . The Cauchy phenotype was simulated using, yi ~ cauchy ( ai , b ), where ai    X i  Gi  . We varied the sample size from 50 to 500, and fixed the percentage of functional rare variants at 50% for the disease models (Table A3). The results are summarized in Table 4. We found that the projection method worked very well for WU-SEQ with regards to covariate adjustment, and the type I error was wellcontrolled under different phenotype distributions. SKAT, however, only had a wellcontrolled type 1 error when the underlying assumptions were satisfied and the link function was correctly specified. Similar to simulations I and II, we observed comparable power between WU-SEQ and SKAT for both the binary and Gaussian phenotypes, and WU-SEQ had significantly higher power than SKAT for the Cauchy and t-distributed phenotypes. We also found the type 1 error rates were less conservative for studies with a binary phenotype with covariate adjustment. This could be due to the fact that the increased levels of residuals after covariate adjustment lead to a better approximation of normal distribution. 22      Table 4: Type I error and power comparisons of WU-SEQ and SKAT for covariate adjustment Effect Sample size Type I Error/Power Binary Normal Student’s t Cauchy SKAT WU-seq SKAT WU-seq SKAT WU-seq SKAT WU-seq 50 0.027 0.027 0.040 0.059 0.084 0.069 0.128 0.063 100 0.030 0.034 0.042 0.054 0.106 0.064 0.131 0.065 200 0.028 0.028 0.036 0.050 0.110 0.060 0.173 0.061 500 0.036 0.036 0.043 0.048 0.115 0.065 0.194 0.047 50 0.060 0.071 0.128 0.165 0.297 0.400 0.140 0.112 100 0.156 0.210 0.306 0.342 0.483 0.664 0.164 0.159 200 0.522 0.545 0.500 0.527 0.682 0.882 0.163 0.262 500 0.929 0.934 0.850 0.855 0.884 0.997 0.180 0.554 Null A1 * “Null” corresponds to the null model with no functional variant; “A1” corresponds to the setting in which half of the functional rare variants have a deleterious effect, and the other half of the functional rare variants have a protective effect. The above three sets of simulations show the performance of WU-SEQ compared with SKAT under different scenarios. We also conducted additional simulations to investigate the performance of WU-SEQ for different choices of c (Table 5), and different choices of U kernel (Table 6). The tests in which c was chosen based on L1 norm and L2 norm showed that both can protect the type I error, but the L2 norm shows slightly higher 23      power (Table 5). We also compare the U kernel with a quantile transformation and without a quantile transformation (using the rank of the phenotype value). Both can control the type I error, but using a quantile transformation leads to higher power for normal and t-distributed phenotypes (Table 6). Table 5: Type I error and power comparisons of WU-SEQ by using different c* Effect Null A1 Method Distribution Binary Normal Student’s t Cauchy WU-SEQL1** 0.013 0.048 0.056 0.041 WU-SEQL2*** 0.013 0.049 0.058 0.043 WU-SEQL1 0.164 0.313 0.300 0.150 WU-SEQL2 0.17 0.322 0.306 0.159 *The sample size for this simulation is 100 and the effect size is set as the same as in Table A2. **In WU-SEQL1, we choose c using the L1 norm. ***In WU-SEQL2, we choose c using the L2 norm. Table 6: Type I error and power comparison of WU-SEQ by using normal quantile or rank* Effect Null A1 Method Distribution Binary Normal Student’s t Cauchy WU-SEQRK** 0.017 0.02 0.031 0.032 WU-SEQQT*** 0.017 0.047 0.057 0.050 WU-SEQRK 0.161 0.243 0.304 0.181 WU-SEQQT 0.161 0.314 0.322 0.169 *The sample size for this simulation is 100 and the effect size is set as the same as in Table A2. **In WU-SEQRK, we use a cross-product kernel based on the rank of the phenotype value without a quantile transformation. ***In WU-SEQQT, we use a cross-product kernel with a quantile transformation. 24      1.5 Application to the DHS Sequencing Data To further evaluate the performance of WU-SEQ and SKAT, we applied both methods to the sequencing data from the Dallas Heart Study (DHS) [Romeo, et al. 2009]. The DHS sequencing data is comprised of 4 genes, ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6. We were interested in evaluating the association of these genes with body mass index (BMI), cholesterol, and very low density lipoprotein cholesterol (VLDL). Following a previous study [Wu, et al. 2011], we considered age, gender, and race as covariates in the model. Prior to the association analysis, we re-assessed the quality of the genotype data. As a part of the quality assessment, we eliminated the SNVs and subjects that had a high missing rate. After the quality control, 230 rare variants (54 SNVs, 63 SNVs, 61 SNVs, and 52 SNVs, from ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6, respectively) and 2598 subjects remained for the analysis. In the analysis, a random imputation based on allele frequency was used to impute missing genotypes. The distribution of the SNVs in the DHS was heavily skewed to the rare variants (Figure S2), wherein 93.5%, 87.4%, and 70% of all SNVs had an MAF of less than 3%, 1% and 0.1%, respectively. Similar to previous studies [Wu, et al. 2011], we selected SNVs with an MAF<3% for the analysis in order to detect associations due to rare variants. The distributions of MAF for the 4 genes are given in Figure 1. From Figure 1, we can see that the distributions of MAF were highly skewed to the left, with the majority of the SNVs having an MAF<0.1%. The distributions of the three phenotypes, BMI, cholesterol and VLDL, can be viewed in Figure 2. Among the three phenotypes, the distribution of VLDL was heavily skewed to the left, which is unlikely to follow the Gaussian distribution. 25      Figure 1: Distribution of SNVs with an MAF<0.03 for the 4 genes in the Dallas Heart Study Figure 2: Distributions of the 3 phenotypes in the Dallas Heart Study       26      We applied both WU-SEQ and SKAT to the association analyses of 4 genes, with consideration of three covariates: gender, race and age (Table 7). To be consistent, we used a weighted IBS for both WU-SEQ and SKAT. Table 7: The association of 4 candidate genes with 3 continuous phenotypes (i.e., BMI, Cholesterol, and VLDL) in the Dallas Heart Study Gene BMI Cholesterol VLDL SKAT WU-SEQ SKAT WU-SEQ SKAT WU-SEQ ANGPTL3 0.633 0.752 0.559 0.662 0.471 0.198 ANGPTL 4 0.121 0.255 0.16 0.316 0.105 0.007 ANGPTL 5 0.633 0.607 0.95 0.926 0.683 0.664 ANGPTL 6 0.874 0.773 0.025 0.059 0.433 0.453 All 4 genes 0.503 0.641 0.117 0.373 0.316 0.032 * P-value from the association analysis, adjusting for age, gender, and race.  WU-SEQ detected a strong association of ANGPTL4 (p-value=0.007) with VLDL, while SKAT found only a marginal association (p-value=0.105). To further explore this finding, we combined all 4 genes into a gene set, and then tested its association with VLDL (Table 7). The association remained significant (p-value=0.032) using WU-SEQ, but not SKAT (p-value=0.316). None of the 4 genes was found to be associated with BMI using either WU-SEQ or SKAT. A marginal association was detected between ANGPTL6 and cholesterol, with a p-value of 0.059 from WU-SEQ and a p-value of 0.025 from SKAT. To evaluate the performance of both methods for the binary phenotype, we dichotomized the data into a highest quartile (coded as 1), and a lowest quartile (coded as 0) for each 27      phenotype (Table A4). Overall, WU-SEQ attained similar results to SKAT. For example, both methods found no association of all 4 genes with BMI, a marginal association of all 4 genes with cholesterol, and a significant association of all 4 genes with VLDL. 1.6 Discussion Targeted, exome and whole-genome sequencing studies are now underway for the discovery of new genetic variants, particularly rare variants, associated with complex diseases. With the emerging of a large amount of high-dimensional sequencing data, great challenges have been posed to statistical analysis of sequencing data. Conventional single-locus analyses have been shown to have low power for analyzing sequencing data, not only because of the low frequencies of rare variants, but also because of the use of a more stringent significance threshold. Joint association analyses of multiple genetic markers, as demonstrated in several studies, can greatly reduce the dimensionality and obtain powerful performance for sequencing data [Madsen and Browning 2009; McCarthy, et al. 2008; Neale, et al. 2011; Tzeng, et al. 2009; Wu, et al. 2011]. Non-parametric methods, such as U-statistic-based methods, have shown great promise for high-dimensional data analysis, especially when the underlying phenotype distributions and modes of inheritance are unknown. Several U-statistic-based methods were recently adopted in genetic association studies to detect common variants underlying complex human diseases [Li 2012; Li, et al. 2011; Schaid, et al. 2005; Wei, et al. 2012; Wei, et al. 2013a]. In this chapter, we propose a non-parametric method, WUSEQ, for testing the joint association of multiple SNVs with disease phenotypes. Built under the framework of the weighted U statistic, WU-SEQ is robust against different 28      underlying distributions of phenotypes. As demonstrated by the simulation study, WUSEQ had well-controlled type 1 errors and attained high power under binary, Gaussian, tdistributed and Cauchy phenotypes. In contrast, the performance of existing methods, such as SKAT, depends on satisfying of the underlying assumptions. If the assumptions were violated (e.g., the distribution followed a heavy-tailed distribution such as Cauchy), SKAT had an inflated type 1 error, and had low power to detect an association. Although SKAT can handle classic types of phenotypes (e.g., the exponential family) by using appropriate link functions, WU-SEQ can be applied to a wider range of phenotypes without any distribution assumptions. Here, we used the normal quantile to build the test statistic. Alternatively, rank can also be used to construct the test statistic. When the sample size is sufficiently large, we expect that using quantile and using rank would reveal similar results. Nevertheless, for small sample sizes, using a normal quantile could be more powerful and less conservative than using rank. Yet, when the distribution is heavily tailed (e.g., Cauchy), using rank could attain a slight advantage over using quantile (e.g., having a well-controlled type I error) for small sample sizes (Table 6). As a similarity-based method, WU-SEQ is flexible, to accommodate various types of data. The genetic variants used to construct genetic similarity are not constrained to categorical data (e.g. SNV); they can also be count data (e.g., CNV) and continuous data (e.g., expression data). Moreover, the genetic variants to be jointly tested can be of high dimension. In the simulation studies, when the number of variants is much larger than sample sizes, we showed that WU-SEQ can still control type I error and achieve satisfactory power. In high dimensional settings, it is also important to select meaningful genetic variants, as well as to select appropriate similarity kernels and optimize the computation. 29      In WU-SEQ, we first summarize the information from multiple markers into genetic similarities, and then evaluate the genetic similarities with the corresponding phenotypic similarities. If the genetic similarities are concordant with the trait similarities, we anticipate a large value for the test statistic, from which we could infer an association. Here, we used a weighted IBS to construct genetic similarity, which assigns more weights to rare variants. Other types of genetic similarity metrics can also be used, such as those that consider interactions. Prior knowledge can also be incorporated, by assigning a different weight for each variant according to its biological plausibility. To measure the phenotype similarity, we used the cross-product kernel, si ,i '  h(qi , qi ' )  qi qi ' , which led to nice asymptotic properties of the test statistic. Nevertheless, we can use other types of kernel functions for the phenotype similarity. If the kernel function satisfies two regularity conditions, EF (h2 (Y1 , Y2 ))   and Var ( E (h(Y1 , Y2 ) | Y2 ))  0 , we can calculate the asymptotic p-value by approximating the weighted U to a linear combination of chi-squared variables. Even if the two conditions are not satisfied, we can still use permutation to obtain the p-value of the association test. The choice of the genetic- or phenotype-similarity metrics depends on the underlying disease model, which could lead to different performance. For example, when the percentage of risk variants increases and/or when the effect sizes of risk variants follow one direction, the burden tests are expected to be more powerful than WU-SEQ. In this case, we can use genetic similarities accommodating the underlying disease models. One of the strategies is to first collapse all genetic variants by weighted sum and then calculate the genetic similarity. Although joint association analyses greatly reduce the dimensionality for sequencing data, the computation could be intense if a permutation test is used. We derived the asymptotic 30      distribution for WU-SEQ to facilitate the high-dimensional data analysis. When the sample size of a study is small, asymptotic properties may not hold and a permutation test can be used. We also use a projection method in WU-SEQ to take covariates into account. Through simulations and a real data analysis, we found the covariate-adjusted WU-SEQ was robust for different types of phenotype distributions. In addition, we found that WUSEQ had almost the same power as SKAT when the phenotype followed a Gaussian distribution and covariates were not considered. The reason for this is that SKAT uses a variance component score test, and WU-SEQ has a close connection with the variance component score test for the Gaussian-distributed phenotype. In fact, when covariates are not considered, SKAT can be viewed as a special case of WU-SEQ by using the crossproduct kernel without a quantile transformation. When covariates are considered, especially when the phenotype does not follow the Gaussian distribution (i.e. the link function in SKAT is not an identity link), WU-SEQ could attain more computational efficiency than SKAT. In SKAT, one needs to first fit the null model using a generalized linear model, which involves iterative estimation. Furthermore, the calculation of a projection matrix in SKAT involves additional matrix multiplication (i.e., calculating H  V  VX ( X TVX )1 X TV in SKAT vs. calculating H  I  X ( X T X )1 X T in WU-SEQ, where V is the covariance matrix estimated from the null model), and the calculation of the limiting distribution in SKAT involves an additional large matrix decomposition (i.e., calculating H 1/2 KH 1/2 , where H 1/2 needs to be calculated from an eigen-decomposition of the H matrix). The covariate-adjusted WU-SEQ does not require iterative estimation, and needs less matrix multiplication and decomposition, which offers a greater computational advantage over SKAT. 31      In the analysis of the DHS study, WU-SEQ detected a strong association of ANGPTL 4 with VLDL. By further studying the distribution of VLDL, we found the distribution was heavily skewed, which does not fit the normal assumption. The advantages of WU-SEQ are not limited to this specific case; the method could be useful for other cases (e.g., small sample sequencing studies and sequencing studies with extreme-phenotype design). A recent version of SKAT also considers the extreme phenotype by assuming a truncated Gaussian distribution [Barnett, et al. 2013]. While SKAT needs to make adjustments and certain assumptions for the truncated distribution of extreme phenotypes, WU-SEQ, as a non-parametric method, can be directly applied to studies with extreme-value phenotypes. 32      CHAPTER 2: Gene Trait Similarity U test Most existing joint association tests are parametric or semi-parametric, which often rely on certain assumptions (e.g., the normal distribution assumption). When the assumptions are violated, it may lead to power loss or inflated type I error. Moreover, there lacks new methods for sequencing association analyses of multiple traits, especially when the study considers different types of traits (e.g., some traits are binary while others are continuous). The study of multiple traits is important in biomedical research. Many studies collect multiple biochemical measurements to reflect the underlying disease status. These multiple measurements are usually more related to the biological mechanism of the disease, and are, thus, more likely reflects heritable components of the disease. It also becomes popular for human genetic studies to study multiple related phenotypes. For instance, multiple traits, such as nicotine dependence and alcohol dependence, are collected to study genetic variants contributed to co-morbidity of substance dependence. Because of the sequencing cost, most sequencing studies are based on a small number of subjects. It is, therefore, crucial to develop a statistical method that has good statistical properties (e.g., non-conservative type I error and good power performance) for a small sample size. For example, the original SKAT has been considered conservative for small sample size studies. An Adj-SKAT was then proposed to adjust the conservative p-value using the moment-matching method [Lee, et al. 2012a]. Finally, few analytic methods for sequencing studies have targeted the computational issue. Most methods have been implemented in R, without consideration of computational optimization. In the era of exome-sequencing and whole-genome-sequencing, we need more efficient software (e.g. 33      based on C or C++) to handle the massive amount of data and bring the computation time to a reasonable range. To achieve these goals, we propose a Gene Trait Similarity U test, referred to as GTSU. In GTSU, we use two different similarity functions to summarize genetic information and phenotypic information, and then form a test under the weighted U framework. We have derived the asymptotic distribution of test statistics and implemented the method in both R and C++. The proposed method has several remarkable features: 1) it is non-parametric and is thus distribution free; 2) it can handle multiple traits; 3) it performs well under small sample size without any adjustment; 4) it is computationally efficient for massive amount of data (i.e., using C++ version of GTSU) ; and 5) it is statistically powerful. We have conducted extensive simulation studies to evaluate the type 1 error rates and compare the power of our method with several other popular methods. Finally, we applied our method to the Dallas Heart Study to test the association of four candidate genes with multiple cholesterol related traits. 2.1 Weighted U Statistic Suppose that n subjects are sequenced in the study, whereby we are interested in the association of L traits ( yi ,l , 1  i  n , 1  l  L ) with M genetic variants ( g i , m , 1  i  n , 1  m  M ). For each subject i, we observe a vector of traits yi ( yi  ( yi ,1 , yi ,2 ,..., yi , L ) ) and a vector of genotypes gi ( gi  ( gi ,1 , gi ,2 ,..., gi , M ) ). For the case of L=1 (or M=1), it is simplified to a single-trait association analysis (or single-locus analysis). When L>1, we can perform either a multiple-trait association analysis or a single-trait association analysis of each trait. In addition, each trait can be of a different type (e.g., continuous or 34      categorical). The number of genetic variants M can be very large, for example, in the case of a sequencing study, M can be greater than n. We first define the trait similarity S i , j between subject i and subject j by: Si , j  h( yi , y j ), and define their genetic similarity Ki , j by: Ki , j  f ( gi , g j ). Here, the similarity measurements can be of a general form as long as they satisfy the finite second moment condition (i.e. E (h 2 (Y1 , Y2 ))   and E ( f 2 (G1 , G2 ))   ). We then center the trait similarity by: Si , j  h ( yi , y j )  h( yi , y j )  E (h( yi , Y j ))  E (h(Yi , y j ))  E (h(Yi , Y j )) . (1) We also center the genetic similarity ( K i , j  f ( gi , g j ) ) in a same manner. The gene-trait similarity weighted U (GTSU) is then defined as the summation of the trait similarity weighted by the genetic similarity: U 1  K i , j Si, j , n(n  1) i  j (2) where K i , j s are considered as the weight function and Si , j s are considered as the U kernel. In our formulation of the weighted U statistic, the role of genetic similarity and trait similarity are interchangeable. In other words, we can also consider Si , j s as the weight function and K i , j s as the U kernel. 35      2.2 Similarity Measurement The form of a trait similarity h (,) and a genetic similarity f (,) are flexible. According to different types of genetic variants and the purpose of the analysis, we can choose different forms of trait similarities and genetic similarities. For a joint association analysis of multiple SNVs, where the genotypic data are categorical, we can use either IBS or weighted-IBS to measure the genetic similarity [Lynch and Ritland 1999]. Assuming the genetic variants ( g i ,m , 1  i  n , 1  m  M ) are respectively coded as 0, 1 and 2 for AA, Aa and aa, we can define the IBS-based genetic similarity between the individuals i and j as: M K IBS i, j   2 | g m 1 i ,m  g j ,m | 2M . Alternatively, we can use a weighted-IBS (wIBS) to emphasize the effects of rare variants. For a sequencing study, we can define wIBS-based genetic similarity as: M KiwIBS  ,j wm (2 | gi ,m  g j ,m |)  m 1 , where wm represents the weight for the m-th SNV in the SNV set, and  is a scaling M constant, defined as   2 wm . wm is usually defined as a function of minor allele m 1 frequency (MAF,  m ). For example, we can define the weight wm as wm  1/  m (1   m ) . When the genetic data are count data or continuous data, we can choose other forms of f (,) to measure the genetic similarity (e.g., euclidean distance based similarity), which we will leave for further investigations. 36      For trait similarity, we define a unified measurement for both categorical and continuous traits based on a normal quantile. For each trait yl ( 1  l  L ), we first define the corresponding normal quantile by, qi ,l   1 ((rank ( yi ,l )  0.5) / n) , where rank () calculates the rank of the trait value yi ,l in the corresponding trait vector yl . When there are ties, we assign the averaged rank.  () is the inverse cumulative density function for a standard normal distribution. We can then calculate the Euclidian Distance (ED)-based trait similarity between subjects i and j: L 2 SiED , j  exp(   l ( qi ,l  q j ,l ) ) , l 1 where l represents the weight for the l-th traits, and can be defined based on the importance of the traits. If there is no prior knowledge, we can just set all weights equal, l  1/ L . The ED-based trait similarity can be easily modified to take the correlation among the traits into account:  1  (qi  q j )T  (qi  q j )  , SiED , j  exp    L  where qi  (qi1 ,..., qiL )T and  can be chosen to reflect the correlation structure among the traits. For example, we ca n define  as, ( 1 n qi qiT ) 1 .  n i 1 Other than the ED-based trait similarity, we could also define the trait similarity based on cross product [Tzeng, et al. 2009]. The types of the trait similarity or the genetic 37      similarity can be chosen based on different purposes, which may influence the power of the method. For simplicity, we use ED-based trait similarity. 2.3 Hypothesis Testing By the definition of the centered similarity, we can show that E (h (Yi , Y j ))  0 and E ( f (G1 , G2 ))  0 (Appendix A). Under the null hypothesis, when the genetic variants are not associated with multiple traits, E (U )  0 (Appendix A). Under the alternative, when the genetic variants are associated with multiple traits, we expect that the trait similarity is concordant with the genetic similarity. In other words, the positive trait similarities is weighted heavier and the negative trait similarities is weighted lighter, leading to a positive value of U statistic. A formal statistical test can be performed to determine its significance level. The p-value can be calculated by P(U  Uobs ) , where U obs is the observed value of U. To access the significance of the genetic association, we can use permutation or bootstrap. However, both are computationally expensive for high-dimensional data. Thus, we derive the asymptotic distribution of GTSU. By considering the genetic similarity as a weight function and the trait similarity as a U kernel, GTSU is in the form of a weighted U statistic [Gregory 1977a; Lindsay, et al. 2008; O'Neil and Redner 1993; Shapiro and Hubert 1979; Shieh 1997]. More specifically, because its kernel satisfied Var ( E (h (Y1 , Y2 ) | Y2 ))  0 (Appendix A), GTSU is actually a degenerated weighted U statistic. We can decompose the centered trait similarity by:  h ( y1 , y2 )   ss ( y1 )s ( y2 ) , s 1 38      where {s } and {s ()} are eigenvalues and eigenfunctions for the kernel h (,) and all the eigenfunctions are orthogonal: 0, if s  s . 1, if s  s  s ( y1 )s ( y1 )dF ( y1 )   Similarly, we can decompose the centered genetic similarity by:  f ( g1 , g 2 )   tt ( g1 ) t ( g 2 ) . t 1 Then, we can write the GTSU as (see Appendix B2): 1    1 n *  U sign   t (Gi )s* (Yi )  ( )   t s  n  1 t 1 s 1  n i 1    n 1 1 sign(t s )  (t* (Gi )t* (Yi )) 2 ,   n  1 t 1 s 1 n i 1 2 where t* (Gi ) | t |0.5 t (Gi ) and s* (Yi ) | s |0.5 s (Yi ) . Using the above form, we can show that the limiting distribution of GTSU is the weighted sum of independent chi-square random variables. This is the result of theorem 1 below, which is proved in Appendix B2. Theorem 1. Let h (Y1 , Y2 ) and f (G1 , G2 ) respectively be the centered trait similarity and the genetic similarity. Define U as U  1/ (n(n  1)) h (Yi , Y j ) f (Gi , G j ) . Assuming i j D   t 1 s 1 E (h 2 (Y1 , Y2 ))   , E ( f 2 (G1 , G2 ))   and Y  G , we have nU   t  s (  st2  1) , where { st2 } are independent chi-square random variables with 1 degree of freedom. 2.4 The Power and Sample Size Calculation 39      In this subsection, we study the properties of GTSU under the alternative hypothesis. We derive the asymptotic distribution of GTSU under the alternative, and provide an analytical method to calculate the power and sample sizes for sequencing association studies. Under the alternative hypothesis, we assume E (h (Y1 , Y2 ) f (G1 , G2 ))    0 and Var ( E (h (Y1 , Y2 ) f (G1 , G2 ) | (Y2 , G2 )))   1  0 , where  measures the strength of the association. It is easily to show that GTSU is an unbiased estimator: E (U )  1  E ( f (Gi , G j )h (Yi , Y j ))   . n( n  1) i  j Using the Hoeffding projection, we can show that GTSU asymptotically follows a normal distribution with a mean of  and a variance of 4 1 / n . This is the result of Theorem 2, which is proved in Appendix B3. Theorem 2. Let h (Y1 , Y2 ) and f (G1 , G2 ) respectively be the centered trait similarity and the genetic similarity. Define U as U  1/ (n(n  1)) h (Yi , Y j ) f (Gi , G j ) . Assuming i j E (h 2 (Y1 , Y2 ))   , E ( f 2 (G1 , G2 ))   and Y is associated with G we have D n (U   )  N (0, 4 1 ) . The power of the GST at significance level  can be calculated by, P nU  q1   n (U   ) q1  n   P   2 n 1   2  1 n  q1  ( ), 2 n 1 40      where q1 is the 1   quantile for   t 1 s 1 t  s (  st2  1) and () is the CDF for standard normal distribution. The sample size to achieve power  can be calculated by solving ( n  q1 )   for the smallest n. Denote Z  as the  quantile for a standard normal 2 n 1 distribution, and the sample size is given by,  ( Z   1  Z 2 1   q1 )2   n  min n  . 2 nN    2.5 Computation and Implementations We can rewrite the trait similarity and the genetic similarity in the matrix format, S  {Si , j }nn and K  {K i , j }nn . The centered similarity matrix can be obtained by, S  ( I  J ) S ( I  J ) , K  ( I  J ) K ( I  J ) , where I is an n-by-n identity matrix and J is an n-by-n matrix, with all elements being 1 / n (Appendix B4). GTSU can be rewritten as: U 1  K i, j Si, j . n(n  1) i  j In this form, U can be viewed as the sum of the element-wise product of the two matrices, K 0 and S0 , where K 0 and S0 are obtained by assigning 0 to diagonal elements of matrices K and S . We use a matrix eigen-decomposition to approximate the eigenvalues in the function decomposition. Denote {s } and {k } as the eigen-values for the 41      matrices S0 and K 0 , respectively (Appendix B5). The limiting distribution of U is given by: n nU ~  s 1 s n t  n ( n t 1 2 st  1) , where { st2 } are independent chi-square random variables with 1 degree of freedom. The p-value can be calculated using Davies’ method [Davies 1980], Liu’s method [Liu, et al. 2009a] or Kuonen’s method [Kuonen 1999]. 2.6 Covariates Adjustment Important covariates, such as sample’s ethnic background, need to be carefully considered in genetic association studies. Failure to accommodate covariates that have confounding effects could lead to false positive results. In this section, we illustrate how to adjust covariates in GTST. Suppose that there are P covariates ( xi , p , 1  i  n , 1  p  P ) that need to be adjusted in the association study of L traits and M genetic variants. Let X denote the n-by-P matrix for the covariates ( X  {xi , p }nP ).Our strategy is to performed adjustment on both the genetic similarity and the trait similarity, and then to form a U statistic based on the adjusted similarities, which reflects the true association between the genetic variants and multiple traits. In order to do that, we use a two-sided projection [Lindsay, et al. 2008] to calculate the covariate-adjusted trait similarity ( S cov ) and genetic similarity ( K cov ): S cov  ( I  X ( X T X )1 X T ) S0cen ( I  X ( X T X ) 1 X T ) , K cov  ( I  X ( X T X ) 1 X T ) K 0cen ( I  X ( X T X ) 1 X T ) . 42      Then the covariate-adjusted GTSU can be calculated as: U cov  1 n2 n n  K i 1 j 1 cov i, j Sicov ,j . We include the diagonal terms in the covariate-adjusted similarities because they also contain the similarity information after the adjustment. In fact, the covariate-adjusted GTSU is a weighted V statistic, and its asymptotic distribution can be attained using a similar method as the one for the weighted U statistic [Shieh 1997]. After eigendecomposition of the two similarity matrices, the asymptotic distribution is also the sum of independent chi-square random variables: nU cov ~ n n 1 cov cov 2   s  k  s , k , n( n  P  1) s 1 k 1 where { s2,k } are independent chi-square random variables with 1 degree of freedom, {scov } and {kcov } are the eigen-values for the matrices S cov and K cov , respectively. 2.7 Simulation Method In the simulation studies, we used genetic data from the 1000 Genome Project [2010] to mimic the real data scenario. Specifically, we used a 1Mb region of the genome (Chromosome 17: 7344328-8344327) from 1092 individuals in the 1000 Genome Project. For each simulation replicate, we randomly chose a 30kb segment from the 1Mb region as the SNV set to be tested. From the SNV set, we set a part of the SNVs as causal. A number of individuals were randomly chosen as a simulation sample to study the performance of the methods under different sample sizes. 43      To investigate the robustness against different phenotype distributions, we simulated three types of traits: a binary-distributed trait, a Gaussian-distributed trait and a Cauchydistributed trait. The binary-distributed phenotype was simulated by using a logistic regression model: logit( P( yi  1))    giT  , where yi is the phenotype value for the i-th individual, g i is the observed genotype vector (coded as 0, 1, and 2), and  are the effects of the phenotypes and genotypes. The effects  for the genotypes were sampled from a uniform distribution with a mean of  and a variance of  2 . The Gaussian-distributed phenotype was simulated by using a conventional linear regression model: yi    giT    i ,  i ~ N (0, 2 ) , The Cauchy-distributed phenotype represents a situation in which the continuous trait contains more extreme values (i.e., heavy-tailed), and was simulated by the following model: ai    giT  , yi ~ cauchy ( ai , b ) , where ai and b are the location parameter and scale parameter for the Cauchy distribution. Two sets of simulations were performed. In simulation I, we considered a single trait, and in simulation II we considered multiple-traits. 2.7.1 Simulation I: 44      The null models were simulated by setting   0 and  2  0 , and changing the sample size from 50 to 500 (i.e. 50, 100, 200 and 500). Two series of alternative models were simulated (details in Table B1): 1. Set   0 and  2  0 , so that half of the functional SNVs were deleterious and the other half were protective. 2. Set   0 and  2  0 , so that the majority of the causal SNVs were deleterious. 2.7.2 Simulation II: Two sets of models were simulated: 1. Assume the multiple traits follow the same distributions. In particular, we simulated 3 Binary-distributed traits (BBB), 3 Gaussian-distributed traits (GGG), and 3 Cauchy-distributed traits (CCC). 2. Allow that the multiple traits follow different distributions. In particular, we simulated 3 traits with 2 Binary and 1 Gaussian (BBG), 3 traits with 1 Binary and 2 Gaussian (BGG), and 3 traits with 1 Binary, 1 Gaussian and 1 Cauchy (BGC). For the null model, we set   0 and  2  0 (details in Table B2). For the alternative models, we set   0 ,  2  0 and allowed the multiple traits to be influenced by the different sets of causal SNVs, by repeatedly choosing 30% SNV of the SNV-set on the 30kb segment as functional for each trait. 2.8 Simulation Results We evaluated the performance of GTSU by comparing it with three existing methods: SKAT, AdjSKAT and SKATO [Lee, et al. 2012a; Wu, et al. 2011]. For each simulation, 45      we simulated 1000 data replicates and calculated type 1 error or power. In GTSU, we used a wIBS-based genetic similarity and ED-based trait similarity to construct the statistic. To be consistent, we used the same weighted IBS to construct the kernel for SKAT [Wu, et al. 2011]. Because AdjSKAT and SKATO are not implemented with a weighted IBS kernel, we used the default kernel, the weighted linear kernel, when applying these two methods. SKAT, AdjSKAT and SKATO are designed for single trait analysis. In order to analyze data with multiple traits, we perform single trait analysis and obtained a p-value for each trait. A Bonferroni correction was then used to take the multiple tests into account. 2.8.1 Simulation I: We summarized the type I errors of the 4 methods in Table 8. GTSU had well-controlled type I error regardless of the phenotype distributions (i.e., binary, Gaussian or Cauchy) and the sample size (i.e., ranged from 50 to 500). Nevertheless, SKAT, AdjSKAT and SKATO had an inflated type I error (ranged from 0.101 to 0.19) for a Cauchy-distributed trait. When the sample size was small (e.g. 50 or 100), SKAT and SKATO also had a conservative type I error. The power comparisons were summarized in Figures 3 and 4. For the first disease model, in which half of the causal SNVs were deleterious (Figure 1), GTSU consistently had higher power than the other methods for different sample sizes ranged from 50 to 500. For the binary trait and a sample size of 50, the power of AdjSKAT (power=0.114) was marginally larger than SKAT (power=0.024) and SKATO (power=0.057), while the power of GTSU (power=0.346) was significantly larger than all three SKAT-based methods. 46      Table 8: type I error for single trait analysis Sample size 50 100 200 500 Method Trait distribution Binary Gaussian Cauchy SKAT 0.001 0.03 0.122 SKATO 0.021 0.041 0.101 AdjSKAT 0.054 0.028 0.123 GTSU 0.058 0.046 0.051 SKAT 0.014 0.028 0.149 SKATO 0.035 0.038 0.12 AdjSKAT 0.063 0.027 0.139 GTSU 0.046 0.05 0.053 SKAT 0.023 0.04 0.155 SKATO 0.024 0.042 0.14 AdjSKAT 0.043 0.042 0.156 GTSU 0.057 0.048 0.045 SKAT 0.039 0.055 0.19 SKATO 0.052 0.056 0.158 AdjSKAT 0.049 0.053 0.18 GTSU 0.038 0.046 0.043 47      Figure 3: Power comparison for a single trait analysis when half of the causal SNVs are protective and the other half of the causal SNVs are deleterious For the Gaussian trait, the three SKAT-based methods had similar performance, while GTSU has much higher power than the SKAT-based methods. For the Cauchy trait, where the distribution assumption is not satisfied, we could hardly see power improvement for the SKAT-based methods with the increased sample size. On the other hand, GTSU does not rely on the assumption, and had power improvement when the sample size increased. For the second disease model where majority of the phenotype is deleterious (Figure 4), we observed consistently higher power of GTSU over the other methods regardless of the sample sizes and trait distributions. 48      Figure 4: Power comparison for a single trait analysis when the majority of the causal SNVs are deleterious 2.8.2 Simulation II: We summarized the type I error for simulation II in Table 9. Similar to the results of the single trait analysis, GTSU can correctly control the type I errors at the 0.05 level for all simulation scenarios (Table 9). All of the other three parametric methods (SKAT, SKATO and AdjSKAT) had an inflated type I error when one or more than one trait did not satisfy the parametric distribution assumption (CCC and BGC). When the distribution assumptions were satisfied, AdjSKAT was able to correctly control the type 1 error. We found, however, SKAT and SKATO had conservative type I errors for small sample sizes (50 and 100), especially when one or more than one trait were binary-distributed traits (BBB, BBG and BGG). 49      Table 9: type I error for multiple trait analysis Sample size 50 100 200 500 Method Traits distribution* BBG CCC BGG BGC 0.008 0.021 0.097 0.207 0.016 0.024 0.085 0.028 0.237 0.049 0.04 0.113 0.057 0.047 0.045 0.034 0.043 0.056 SKAT 0.001 0.023 0.295 0.011 0.013 0.122 SKATO 0.025 0.039 0.26 0.025 0.036 0.117 AdjSKAT 0.055 0.025 0.29 0.059 0.042 0.135 GTSU 0.048 0.047 0.044 0.051 0.059 0.058 SKAT 0.02 0.048 0.325 0.033 0.02 0.134 SKATO 0.03 0.058 0.288 0.035 0.036 0.118 AdjSKAT 0.059 0.046 0.319 0.064 0.038 0.136 GTSU 0.05 0.047 0.049 0.058 0.056 0.043 SKAT 0.044 0.035 0.364 0.035 0.04 0.166 SKATO 0.054 0.036 0.315 0.038 0.047 0.156 AdjSKAT 0.06 0.038 0.342 0.041 0.049 0.172 GTSU 0.045 0.054 0.044 0.046 0.039 0.062 BBB GGG SKAT 0.002 0.028 0.232 SKATO 0.019 0.027 AdjSKAT 0.046 GTSU *In trait distribution, “B” represents a binary-distributed trait, “G” represents a Gaussian-distributed trait, and “C” represents a Cauchy-distributed trait. Thus, “BBB” means three binary-distributed traits, “GGG” means three Gaussian-distributed traits, “CCC” means three Cauchy-distributed traits, “BBG” means two binary-distributed traits and one Gaussian-distributed trait, “BGG” means one binary-distributed trait and two Gaussian-distributed traits, and “BGC” means three traits with a binary, Gaussian and Cauchy distribution for each. We summarized the power comparison for the two disease models in Figures 5 and 6. For the first disease model, in which the multiple traits follow the same distribution, GTSU had higher power than SKAT, SKATO and AdjSKAT, for all three simulation scenarios (Figure 4). In the simulation scenarios of BBB and GGG, when the sample size was 50, GTSU obtained higher power than the other three SKAT methods. As the sample size 50      increased, GTSU could obtain substantially power improvement than the other three methods. Figure 5: Power comparison for a multiple trait analysis when the multiple traits follow the same type of distribution *“BBB”, “GGG”, “CCC” represent three binary-distributed traits, three Gaussian-distributed traits, and three Cauchy-distributed traits, respectively. For CCC, we clearly observed the increased power of GTSU as the sample size increased. Compared to GTSU, SKAT, SKATO and AdjSKAT had little gain in power as the sample size increased. We also observed higher power of three SKAT methods over GTSU when the sample size was small. This is due to the fact that the SKAT methods had an inflated type 1 error (i.e., > 0.20) (Table 9). Similarly, for the second disease model, GTSU had higher power than SKAT, SKATO and AdjSKAT, regardless of the 51      sample size, the distributions of the multiple traits, and the configuration of the causal SNVs on the multiple traits (Figure 6). Figure 6: Power comparison for a multiple trait analysis when the multiple traits follow different types of distribution * “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binary-distributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait. 2.9 Computation and Software We implemented GTSU in both R and C++. Because SKAT is written in R, to be comparable, we compared the computational efficiency of GTSU in R with that of SKAT, SKATO and AdjSKAT. To analyze a 30kb sequence with a sample sizes of 50 for 1000 data replicates on a high-performance personal computer with 2.3GHz CPU and 4G memory, SKAT, SKATO and AdjSKAT spent 10 times (95.606s), 217 times (2155.145s) 52      and 532 times (5286.115s) longer, respectively, of the time that GTSU spent (9.925s). We listed more detailed computational times for analyzing data with various distributed traits and different sample sizes in Table B3. To handle high-dimensional data and further improve computational efficiency, we have developed a C++ based package for GTSU (https://www.msu.edu/~changs18/software.html#GTSU), including a windows version and a unix version. The current version of the C++ based GTSU only implement a weighted IBS for genetic similarity and an ED-based similarity for trait similarity. More features will be added to the package in the future. 2.10 Application to Dallas Heart Study To further evaluate the performance of GTSU, we applied GTSU to the Dallas Heart Study (DHS) sequence data [Ahituv, et al. 2007], and compared the results of GTSU with three other SKAT-based methods. The DHS sequencing data is comprised of 4 genes, ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6. After the quality control (e.g., removing SNVs with high missing value rate), 230 rare variants (54 SNVs, 63 SNVs, 61 SNVs, 52 SNVs are from ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6, respectively), and 2598 subjects remained for the analysis. We were interested in testing the association of the rare variants in these genes with multiple traits that are related to cholesterol level, including obesity (dichotomized from BMI using a cut-off of 35), cholesterol, high density lipoprotein cholesterol (HDL), low density lipoprotein cholesterol (LDL) and very low density lipoprotein cholesterol (VLDL). 53      Table 10: multiple trait analysis for the DHS study rare variants p-values for multiple traits* GTSU SKAT AdjSKAT SKATO ANGPTL3 54 0.300 0.415 0.355 0.560 ANGPTL4 63 0.057 0.395 0.430 0.775 ANGPTL5 61 0.075 0.230 0.365 0.535 ANGPTL6 52 0.297 0.295 0.325 0.505 ALL 230 0.028 1 1 1 *Traits considered in the analysis include BMI, cholesterol, HDL, LDL and VLDL. For SKAT-based methods, the p-values listed in the table are the Bonferroni-corrected p-value attained from the multiple pvalues. To consider potential confounding effects from age, gender and race in GTSU, we used the residuals to build the trait similarity. The three SKAT-based methods cannot directly analyze multiple-trait data. Thus, we obtained multiple p-values from single trait analyses, and then used a Bonferroni correction to decide whether there was a significant association. The results were summarized in Table 10. Among the 4 genes, GTSU found a marginal association (p-value=0.057) of ANGPTL4 with the multiple traits, while there is no significant association found by SKAT, SKATO and AdjSKAT. To further explore, we combined the 4 genes into a gene set, and found a significant association of the gene set with multiple traits by using GTSU (p-value=0.028). 54      CHAPTER 3: Family-based Gene-Trait Similarity U test Rare variants are often recent mutation, and are only carried by few individuals in the population. Although they are rare in the population, if the mutations are inheritable, they could be clustered within families. Sequencing studies that utilize a family design are, therefore, more suitable to detect rare variants. However, directly using population-based association method without considering the family correlation structure could lead to inflated type I error. A family-based Gene Trait Similarity U test is thus proposed to analyze sequencing association data from family samples or a mixture of family samples and unrelated samples. 3.1 Family-based Gene Trait Similarity U test Suppose that n subjects are genotyped in the study, where we are interested in the association of the phenotype ( y i , 1  i  n ) with M genetic variants ( g i , m , 1  i  n , 1  m  M ). In order to test the joint association, we first define the trait similarity by h( yi , y j ) and the genetic similarity by f ( g i , g j ) , where the function h(,) and f (,) satisfy the finite second moment conditions, (i.e. E (h 2 (Y1 , Y2 ))   and E ( f 2 (G1 , G2 ))   ). We further center trait similarities by, h ( yi , y j )  h( yi , y j )  E (h( yi , Y j ))  E (h(Yi , y j ))  E (h(Yi , Y j )), and center the genetic similarities, f ( gi , g j ) , in the same manner. We can then build a gene-trait similarity U test by calculating the summation of the centered trait similarity weighted by the centered genetic similarity, as follows: 55      U 1  h ( yi , y j ) f ( gi , g j ) . n(n  1) i  j The above form can be considered as the sum of the element-wise product of two matrices. When the phenotype is associated with the genetic variants, the phenotype similarity is concordant with the genetic similarity. Thus, the positive trait similarity is positively weighted and the negative trait similarity is negatively weighted, and a larger value is expected for the test statistics under the alternative hypothesis. In order to take the correlation structure into account, we first calculate the kinship matrix, K, 1   1 2 12  2 1  2 K   21  ... ...   2  n1 2  n 2 2 1n  2  2 n  , ... ...   ... 1   n  ... ... where  i represents the inbreeding coefficient for the subject, and ij represents the kinship coefficient between subjects i and j. We then apply a projection approach to obtain a family-based genetic similarity and a trait similarity. As we shown in Appendix C1, the technique originates from the linear mixed model. The kinship matrix can be decomposed into K  U U T , where the columns of U are eigenvectors, and the diagonal of  are eigenvalues. Denote the trait similarity S  {Si , j }nn , where Si , j  h( yi , y j ) , the family-based trait similarity matrix can be calculated by S  U T SU . Also denote the genetic similarity W  {Wi , j }nn , where Wi , j  f ( gi , g j ) , the family-based genetic similarity can be calculated by W   0.5U T WU  0.5 . Let S0 be a similar matrix as 56      S with the diagonal elements of 0, and W0 be a similar matrix as W with the diagonal elements of 0, I be the identity matrix and J be the matrix in which all elements equal 1 / n . We can calculate the centered similarity matrix by,  S cen  ( I  J ) S0 ( I  J ) .  cen W  ( I  J )W0 ( I  J ) The family-based GTSU is defined similarly, U 1 n2 n n W cen i, j i 1 j 1 Sicen ,j . Under the null hypothesis, when the genetic variants are not associated with the phenotype, the limiting distribution of the family-based GTSU is a mixture of chisquared random variables, n nU ~  s 1 s n t n n t 1 2 st , where {s } and {t } are the eigenvalues of the matrix S cen and W cen , and { st2 } are i.i.d chi-squared random variables with a d.f. of 1. The p-value of the test, P(U obs  U ) can then be calculated using the Davies’ method [Davies 1980], Liu’s method [Liu, et al. 2009a] or Kuonen’s method [Kuonen 1999]. 3.2 Covariate-adjusted Gene Trait Similarity U test We can also consider the family structure as the confounding effect that needs to be corrected. We derive a covariate-adjusted GTSU, and use the principle components of the kinship matrix as the covariate adjustment for GTSU. Suppose that there are P covariates 57      ( xi , p , 1  i  n , 1  p  P ) that need to be adjusted, where we choose P leading principle components of the kinship matrix in a family study. By denoting S0cen a matrix by assigning 0 to the diagonal element of matrix ( I  J ) S ( I  J ) , and W0cen a matrix by assigning 0 to the diagonal element of matrix ( I  J )W ( I  J ) , we can use a two-sided projection [Lindsay, et al. 2008] to calculate the covariate-adjusted trait similarity ( S cov ) and genetic similarity ( K cov ): S cov  ( I  X ( X T X )1 X T ) S0cen ( I  X ( X T X ) 1 X T ) , W cov  ( I  X ( X T X ) 1 X T )W0cen ( I  X ( X T X ) 1 X T ) . Then the covariate-adjusted GTSU can be calculated as: U cov  1 n2 n n W i 1 j 1 cov i, j Sicov ,j . Its asymptotic distribution can be attained using a method similar to the weighted U statistic [Shieh 1997]. After an eigen-decomposition of the two similarity matrices, the asymptotic distribution of the covariate-adjusted GTSU is the sum of the independent chi-square random variables: nU cov ~ n n 1 scov  kcov  st2 ,  n( n  P  1) s 1 t 1 where { st2 } are independent chi-square random variables with 1 degree of freedom, {scov } and {tcov } are the eigen-values for the matrices S cov and W cov , respectively. 58      3.3 Simulation We used the genetic data from the 1000 Genome Project [2010] to mimic real scenarios. In particular, we used a 1Mb region of the genome (Chromosome 17: 7344328-8344327) from 1092 individuals in the 1000 Genome Project. For each simulation replicate, we randomly chose a 30kb segment from the 1Mb region as the SNV set to be tested. In the SNV set, only a part of the SNVs was set as causal. A number of individuals were randomly chosen as parents in the family, from which we generated 2 offspring for each family. We varied the numbers of families to study the performance of GTSU under different sample sizes. To investigate the robustness against phenotype distributions, we simulated different types of traits, i.e., a binary-distributed trait, a Gaussian-distributed trait, a t-distributed trait, or a Poisson-distributed trait. The binary-distributed phenotype was simulated by a logistic regression model: logit( P( yi  1))    giT  + i , where yi was the phenotype value for the i-th individual, g i was the observed genotype vector (coded as 0, 1, and 2),  were the effects of the genotypes, and  i was the effect of the family correlation. The effects  for the genotypes were sampled from a normal distribution with a mean of  and a variance of  2 .   ( 1 , 2 ,..., n )T was sampled from a multivariate normal distribution,  ~ N (0, K2 K ) , where K is the kinship matrix calculated based on the family structure. The Poisson-distributed phenotype is simulated by a Poisson regression model: log(ai )    giT  + i , yi ~ Pois (ai ) , where yi was sampled from a Poisson distribution with a mean of ai . 59      The Gaussian-distributed phenotype is simulated by a conventional linear regression model: yi    giT    i   i ,  i ~ N (0, 2 ) . The t-distributed phenotype represents a situation in which the continuous trait has more extreme values (heavy-tailed), and was simulated by the following model: yi    giT    i   i ,  i ~ tdf 1 , where  i were sampled from a t distribution with a d.f. of 1. We compared the proposed family-based GTSU (FGTSU) with 3 other methods, including the original GTSU without any adjustment (GTSU), covariate-adjusted GTSU using the 20 leading principle components (PC-GTSU), and fam-SKAT [Chen, et al. 2013], which is based on a variance-component score test in the linear mixed model. The results were summarized in Table 1 and Table 2. In general, the original GTSU had the highest power among the 4 methods, but type 1 errors for GTSU were inflated. FGTSU had a controlled type 1 error, and had a reasonable power to detect the joint association. When the phenotypes followed a binary distribution (Table 11), PC-GTSU, FGTSU and fam-SKAT had controlled the type 1 error rates. However, when the sample size was small (e.g., 80), fam-SKAT and PC-GTSU had conservative type 1 error rates (0.005 for fam-SKAT and 0.016 for PC-GTSU), and they had very low power (0.033 for fam-SKAT and 0.016 for PC-GTSU). PC-GTSU was the least powerful approach among the three, and FGTSU had higher power than both PC-GTSU and fam-SKAT. 60      Table 11: Type 1 error and power comparison for a discrete phenotype Distribution Effect Null Binary Disease Null Poisson Disease Sample size Type 1 error/Power GTSU PC-GTSU FGTSU fam-SKAT 80 0.153 0.016 0.053 0.005 200 0.164 0.03 0.047 0.023 400 0.174 0.069 0.070 0.032 80 0.39 0.016 0.175 0.033 200 0.715 0.1 0.489 0.298 400 0.922 0.537 0.823 0.736 80 0.317 0.01 0.054 0.122 200 0.357 0.045 0.06 0.149 400 0.403 0.117 0.056 0.142 80 0.5 0.019 0.199 0.315 200 0.767 0.151 0.425 0.446 400 0.915 0.539 0.679 0.605 When the phenotypes followed a Poisson distribution (Table 1), only FGTSU had the controlled type 1 error. PC-GTSU controlled the type 1 error when the sample size was 80 or 200, but not when sample size was 400. fam-SKAT could not control the type 1 error for different sample sizes. Similar to the binary cases, PC-GTSU had a conservative type 1 error and low power for small sample sizes, while FGTSU had the largest power, among those that had controlled the type 1 error. 61      Table 12: Type 1 error and power comparison for a continuous phenotype Distribution Effect Null Gaussian Disease Null Student’s t Disease Sample size Type 1 error/Power GTSU PC-GTSU FGTSU fam-SKAT 80 0.299 0.013 0.064 0.039 200 0.357 0.041 0.073 0.035 400 0.385 0.107 0.056 0.037 80 0.513 0.023 0.236 0.201 200 0.742 0.12 0.429 0.452 400 0.885 0.491 0.65 0.761 80 0.148 0.015 0.054 0.145 200 0.175 0.036 0.046 0.176 400 0.158 0.058 0.051 0.183 80 0.486 0.017 0.285 0.171 200 0.784 0.146 0.63 0.197 400 0.938 0.582 0.867 0.221 When the phenotypes followed a Gaussian distribution (Table 12), both FGTSU and famSKAT controlled the type 1 errors at around the 0.05 level. Fam-SKAT had comparable or even higher power than FGTSU, especially when the sample size increased. Both methods attained higher power than PC-GTSU. Similar to before, PC-GTSU had an inflated type 1 error when the sample size was 400 and low power when the sample size was 80. 62      When the phenotypes followed the Student’s t distribution (Table 2), both FGTSU and PC-GTSU controlled the type 1 errors at around the 0.05 level. Fam-SKAT had an inflated type 1 error for different sample sizes. FGTSU had the highest power performance among those that could control the type 1 error. PC-GTSU had low power when the sample size was small. To summarize, compared to the other three methods, FGTSU was the most robust to phenotype distributions. FGTSU was also a powerful method among those that could control type 1 error. Fam-SKAT had controlled type 1 error for both the Gaussian and binary phenotypes, but not for Poisson- or t-distributed phenotypes. Fam-SKAT attained highest power when the phenotype followed the Gaussian distribution because famSKAT is developed based on a linear mixed model. We also found unstable performance of PC-GTSU. Although the types of phenotypes did not seem to influence its robustness, PC-GTSU tended to have an inflated type 1 error for a large sample size and low power for a small sample size. The reason might be the number of principle components used for PC-GTSU. A larger number of PC can protect the type 1 error but leads to low power. 3.4 Whole Genome Sequencing Analysis of GAW18 We developed a C++ package for GTSU and family-based GTSU (https://www.msu.edu/~changs18/software.html#GTSU). Using the C++ package, we performed a whole genome sequencing analysis using the GAW 18 data. The GAW18 had only genetic data from odd numbers of autosomes. The GAW18 was comprised of 959 family members, among which 464 were directly sequenced and the others were imputed, based on 20 large pedigree structures and existing genetic data. The status of hypertension for the first visit was chosen as the phenotype in the analysis. After 63      eliminating the subjects with missing phenotypes, we had 804 subjects, and 8,348,674 SNPs remained. The SNV sets were then needed to be defined for the whole genome. For simplicity, we used the 75% quantile of the length of all genes, which corresponds to the size of 58861bp, to divide the genome sequence into 22887 SNV sets. The number of SNVs in each SNV set ranged from 1 to 1488, with a median size of 361 (Figure C1). The kinship matrix for the 804 family members was calculated using pedigree information from 1389 subjects. An ED-based similarity was used to calculate the phenotype similarity, and wIBS was used to calculate the genetic similarity. The analysis of the whole genome took 10 hours on a personal computer with 8GB memory and a 2.4 GHz CPU. The QQ plot of the 22887 SNV sets showed that there is no evidence of systematically inflated p-values (e.g., due to population stratification) in the genome-wide analysis (Figure 1). The SNV sets with the top p-values were listed in Table 13. By taking multiple testing into account, the significance level should be at 10-6 level. Therefore, the first 2 SNV sets on chromosomes 7 and 9 were significantly associated with hypertension. 64      Figure 7: QQ plot for the whole genome sequencing analysis 65      Table 13: Top 10 SNV sets with smallest p-values for GAW18 Chr Start position End position nSNP P_value 7 61098237 61157098 39 1.14×10-06 9 45676524 45735385 1 2.85×10-06 9 30961149 31020010 440 1.22×10-05 15 54564610 54623472 362 2.57×10-05 9 31078872 31137733 496 2.93×10-05 17 8122887 8181748 325 5.51×10-05 21 35258038 35316900 306 6.14×10-05 11 38024529 38083390 391 0.000111 3 190711260 190770121 400 0.000148 1 1118368 1177230 568 0.000149 66      CHAPTER 4: Conclusion and Discussion Driven by the sequencing technologies and the “common diseases and rare-variants hypothesis”, sequencing studies are now emerging as a major study design in populationbased human genetics research. Yet, the characteristics of sequencing data, including low MAF and high dimensionality, are posing daunting challenges for the statistical analyses. The conventional analytical approaches, such as a single-locus analysis, are not suitable for a sequencing data analysis. Instead, a joint association analysis or a SNV-set analysis are becoming popular, for their ability to combine multiple rare variants to increase power, as well as to reduce the dimensionality [Lee, et al. 2012a; Madsen and Browning 2009; McCarthy, et al. 2008; Morgenthaler and Thilly 2007b; Neale, et al. 2011; Wu, et al. 2011]. Although most of the existing methods have nice features and are easy to use, they also have certain limitations. For example, most of burden tests assume the homogeneity of the direction and magnitude of the effects within the SNV set, which may not hold in real disease scenarios. Besides burden tests, most of the existing methods are parametric or semi-parametric, which often assumes phenotype distribution and mode of inheritance. When the assumptions are violated, the results could be unreliable. To overcome these limitations, we have proposed WU-seq, GTSU and FSTSU, which is computationally effcient and is distribution-free. We conducted extensive simulation studies using data from the 1000 Genome Project, and compared their performance with three popular methods: SKAT, AdjSKAT, and SKATO. Across all of the simulation scenarios, including single traits with various distributions, and multiple traits with various combinations of trait distribution, the proposed methods outperformed existing 67      methods in terms of controlled type I error and power improvement. Although the simulation results depend on the simulation setting, and should always be interpreted in the context of the simulation setting, we believe they reflect the advantage of our methods over the other methods in a broader sense, because 1) the simulated genetic data is from the 1000 Genome Project, which should be able to mimic the LD pattern and allelic frequency in the general population; and 2) we simulated a wide range of disease models to reflect common disease scenarios. In recent years, methods based on U statistics have been increasingly applied to genetic studies, and have shown their robustness and flexibility [Schaid, et al. 2005; Wei, et al. 2013b; Zhang, et al. 2010]. GTSU is a general framework for sequencing associating analysis, which is based on similarity measurements. WU-seq and FGTSU can be viewed as two special cases of GTSU. Although we used the weighted-IBS to calculate genetic similarity based on a SNV set in this chapter, other forms of genetic similarity can also be used. For a weighted-IBS similarity, a simple alternative would be to modify the weights of the weighted-IBS (originally wm  1/  m (1   m ) in weighted-IBS) to reflect a different influence from each SNV. In place of the IBS-based similarity, we could also use distance-transformed similarities to model the effect nonlinearly. For example, we M 2 can use the Euclidian-Distance based method, K iED , j  exp(   wm ( g i , m  g j , m ) ) . In this m 1 chapter, we have focused on the analysis of categorical sequencing data (i.e., SNVs). By using an appropriate genetic similarity measurement, GTSU can easily be extended to analyze other types of genetic data, such as count data (i.e., CNVs) and continuous data (i.e., gene expression). 68      The flexibility of GTSU also comes from the construction of trait similarity. By using trait similarity, GTSU can analyze not only a single trait, but also multiple traits with various distributions. Many genetic studies collect multiple traits to study a human complex disease, or use intermediate biomarkers to measure the complex disease. Using multiple traits has been shown to improve power by accounting for the underlying structure of the complex disease [Liu, et al. 2009b; Maity, et al. 2012; Zhang, et al. 2010]. However, most methods for analyzing sequencing data have focused on single trait analysis. SKAT has recently been extended to multiple continuous trait analysis, by using a score test under the multivariate linear mixed effect model [Maity, et al. 2012]. The method, however, relies on a normal assumption for the trait distribution, and cannot analyze multiple traits with other types of phenotypes (e.g. a binary phenotype). In contrast, GTSU is distribution-free and can analyze multiple traits with different types of phenotypes. By using an appropriate similarity measurement, GTSU can be extended to analyze longitudinal data, survival data and even a high-dimensional phenotype such as image data. Current sequencing studies are often of small sample size. Thus, it is important for a statistical method to have enough power for a small sample size study. Methods, such as AdjSKAT, was proposed for small sample size study [Lee, et al. 2012a]. In our simulation studies, we find both AdjSKAT and GTSU can correctly control the type I error for the binary trait at 0.05 level. But GTSU has higher power than AdjSKAT for detecting a genetic association. GTSU, as well as WU-seq and FGTSU, are not only statistically powerful, but also computationally efficient. We derived the asymptotic distribution of three methods so as 69      to efficiently calculate the significance level of the association test. As we showed in the result section, GTSU and Wu-seq had the advantage over the SKAT methods. We also developed a C++ version of GTSU. With the fast-moving sequencing technology and reduced sequencing cost, large-scale whole-genome sequencing will soon become popular. We hope the proposed methods could be computationally efficient tools to handle the massively sequencing data. 70      APPENDICES 71      Appendix A We first introduce a regularity condition to the weight function ki ,i ' . Since 0  wi ,i '  1 and ki ,i '  wi ,i '  c , we have 0 | ki ,i ' | c ' , where c ' is a positive constant. It is, thus, reasonable to assume lim n  1 ki2,i '  C , where C is another positive constant.  n(n  1) i i ' Alternatively, if we consider ki ,i ' as random, this is similar to assuming a finite second 2 )  C  . moment, i.e. E (k1,2 Then we can show,   l 1 l 1 E (  2, )   l E ( 1,2l )   l  0 , and,  Var (  2, )   l2Var ( 1,2l ) l 1 n 2 l2  2 n  ( n  1) l 1 2  lim trace( KK ) n  ( n  1) 2 2  lim ki2,i '  n  ( n  1) 2 i i '  2C.  lim Using this result, we can further check that the usual p ( nWU seq  0 ). 72    nWU seq is degenerated   Table A1: simulation setting for models without confounding effects and fixed sample sizes Effect Pct* Setting* Binary Null Normal Student’s t Cauchy         0 0 0 0 0 0 0 0 0 1 0 0.3 0 0.5 0 0.5 0.25 0.25 0.1 0.1 0.25 0.25 0.25 0.25 0 5 A1 10 30 50 5 A2 10 30 50 * percentage of functional rare variants Table A2: simulation setting for model without confounding effects and varying sample sizes Effect Sample size Setting Binary Normal Student’s t Cauchy         0 0 0 0 0 0 0 0 0 1 0 0.3 0 0.5 0 0.5 50 Null 100 200 50 * A1 100 200 * 50% of the rare variants are set as functional. 73      Table A3: simulation setting for model with confounding effects and varying sample sizes Effect Sample size Setting Binary Normal Student’s t Cauchy         0 0 0 0 0 0 0 0 0 1 0 0.3 0 1 0 0.5 50 Null 100 200 50 * A1 100 200 * 50% of the rare variants are set as functional. Table A4: The association of 4 candidate genes with 3 binary phenotypes* in the Dallas Heart Study   Gene BMI Cholesterol VLDL SKAT WUSEQ SKAT WUSEQ SKAT WUSEQ ANGPTL3 0.805** 0.791 0.734 0.655 0.024 0.051 ANGPTL 4 0.340 0.365 0.242 0.263 0.012 0.020 ANGPTL 5 0.750 0.585 0.752 0.648 0.310 0.413 ANGPTL 6 0.698 0.49 0.006 0.006 0.286 0.375 All 4 genes 0.765 0.632 0.047 0.059 0.010 0.022 *For each phenotype, we use the highest quartiles and lowest quartiles to form the binary phenotype. ** P-value from the association analysis by adjusting for age, gender, and race.  74      Figure A1: Missing data distribution for the Dallas Heart Study Figure A2: MAF for all SNPs 75      Figure A3: Distribution of test statistic by using Permutation v.s. Asymptotic under null hypothesis 76      Appendix B Section B1: By the definition of h ( y1 , y2 )  h( y1 , y2 )  E (h( y1 , Y2 ))  E (h(Y1 , y2 ))  E (h(Y1 , Y2 )) , we can obtain the conditional expectation of centered trait similarity, E (h (Y1 , Y2 ) | Y1 )  E (h(Y1 , Y2 ) | Y1 )  E (h(Y1 , Y2 ) | Y1 )  E (h(Y1 , Y2 ))  E (h(Y1 , Y2 ))  0 . Thus, we have E (h (Y1 , Y2 ))  0 and Var ( E (h (Y1 , Y2 ) | Y1 ))  0 . The same argument apply for the centered genetic similarity. Under the null hypothesis, when the genetic similarities are independent of trait similarities, we have: E (U )   1 E ( f (Gi , G j )h (Yi , Y j )) n(n  1) i  j 1  E ( f (Gi , G j )) E (h(Yi , Y j )) . n(n  1) i  j  0. Section B2: We can decompose the centered trait similarity by:  h ( y1 , y2 )   ss ( y1 )s ( y2 ) , s 1 where {s } and {s ()} are the eigenvalue and eigenfunction of the kernel h (,) . Because the orthogonality of {s ()} , we have 77      E (h (Y1 , Y2 )s (Y2 ) | Y1 )   h (Y1 , y2 )s ( y2 )dF ( y2 )    ss (Y1 )  s ( y2 )s ( y2 )dF ( y2 ) s 1  ss (Y1 ) We have also showed E (h (Y1 , Y2 ) 1| Y1 )  0 1 in Appendix B1, which forces 1 ()  1 and 1  0 to be an eigenfunction-eigenvalue pair in the decomposition of h (,) . Again, because 1 () is orthogonal with {s ()}s 1 , for s>1, we have Es (Y1 )   s ( y1 ) 1dF ( y1 )   s ( y1 )1 ( y1 )dF ( y1 )  0. Following same steps, we can decompose the centered genetic similarity:  f (G1 , G2 )   tt (G1 )t (G2 ) . t 1 Then,  Es (Y )  0,  s  1   Et (G)  0,  t  1. Using the function decomposition, we can write GTSU as, 78    (B1).   U 1  f (Gi , G j )h (Yi , Y j ) n(n  1) i  j    1 G G    ss (Yi )s (Y j ) ( ) ( )  t t i t j  n(n  1) i  j t 1 s 1    1 t  s  t (Gi )t (G j )s (Yi )s (Y j )  n(n  1) t 1 s 1 i  j  1   1 n    t (Gi )s (Yi )    t s  n  1 t  2 s  2  n i 1  n   1 1  t  s  t2 (Gi )t2 (Yi )  n  1 t  2 s  2 n i 1 2  1    1 n *  sign   t (Gi )s* (Yi )   ( )   t s  n  1 t 2 s2  n i 1  n   1 1 sign(t s )  (t* (Gi )t* (Yi )) 2 ,   n  1 t 2 s 2 n i 1 . 2 where t* (Gi ) | t |0.5 t (Gi ) and s* (Yi ) | s |0.5 s (Yi ) . Under the null hypothesis, genotypes ( Gi ) are independent of traits ( Yi ). Therefore, for s>1 and t>1, we have E (t* (G1 )s* (Y1 )) | st |0.5 E (t (G1 )) E (s (Y1 )) (B2)  0, , and E (t* (G1 )s* (Y1 )t* (G1 )s* (Y1 )) | st st  |0.5 E (t (G1 )t  (G1 )) E (s (Y1 )s (Y1 )) |   | , if s  s and t  t   s t otherwise. 0, 79    (B3)    1 n *  For any finite subset  of {st}s 1,t 1 ,  converges to a t (Gi )s* (Yi )    n i 1 st multivariate normal distribution by using the results from appendix B2, B3 and multivariate CLT. Additionally, we can show that:  s 1,t 1 E (t* (G1 )s* (Y1 )) 2   | s | | t | s   2 s s t  2 t t  E ( h (Y1 , Y2 )) 2 E ( f (G1 , G2 )) 2  Under the condition  s 1,t 1 E 2 (t* (G1 )s* (Y1 ))   , the countable sequence of function {t* ()s* ()}s 1,t 1 is a Donsker class [van der Vaart and Wellner 2000]. Thus, the empirical process, 1 n n  i 1 * t (Gi )s* (Yi ) , converges weakly to the Gaussian process Z s ,t with a mean zero and a covariance function: |   | , if s  s and t  t  cov( Z s ,t , Z s,t  )  E (t* (G1 )s* (Y1 )t* (G1 )s* (Y1 ))   s t . otherwise 0, With this uniform convergence (for all s>1 and t>1), we can show that: 80        nU   sign(t s )  Z s ,t  D 2 t 2 s 2     sign(t s ) | st | , t 2 s 2 D     s 1 t 1 s 1  t  s  st2  t  s t 1 where { st2 } are the i.i.d chi-square random variables with a d.f of 1. Section B3: To simplify notations, we denote u ( X1 , X 2 )  f (G1 , G2 )h (Y1 , Y2 ) . Then GTSU can be rewritten as: U 1  u( X i , X j ) . n( n  1) i  j Define a centered kernel u( x1 , x2 ) by: u( x1 , x2 )  u( x1 , x2 )  u1 ( x1 )  u1 ( x2 )  , where u1 ( x)  E(u( X1 , X 2 ) | X1  x) . We can decompose the GTSU as follows: U 1  u( X i , X j ) n(n  1) i  j  1   u ( X i , X j )  u1 ( X i )  u1 ( X j )      n(n  1) i  j  1 2 n  u X X  ( , )  u1 ( X i )     .  i j n n(n  1) i  j i 1 Thus, 81    X  (Y , G ) and   2 n n  u1 ( X i )       u ( X i , X j ). n(n  1) i  j n i 1 n (U   )  Since E (u1 ( X ))   and Var (u1 ( X ))   1 , the first term converges to a normal distribution by applying CLT: 2 n n D   u ( X )     N (0, 4 i 1 1 i 1 ). We then need to show: R p n  ( , )  u X X  i j 0. n(n  1) i  j This can be done by providing E( R2 )  0 , using the fact that E(u( X1 , X 2 ))  0 , Var (u( X1 , X 2 ))   and E(u( X1 , X 2 ) | X1 )  0 . In fact, by using a similar technique to Appendix B2, we can show that nR asymptotically follows the distribution of a weighted sum of independent chi-square random variables. Section B4: In the study sample, we can get the centered trait similarity by: 1 n 1 n 1 h ( yi , y j )  h( yi , y j )   h( yi , y j )   h( yi , y j )  2 n j 1 n i 1 n  h( y , y ) . i j i, j If we denote Si , j  h ( yi , y j ) and Si , j  h( yi , y j ) , the above equation becomes: 1 n 1 n 1 Si , j  Si , j   Si , j   Si , j  2 n j 1 n i 1 n The equation can be written in a matrix form, 82    S i, j i, j .   S  S  JS  SJ  JSJ ,  (I  J )S (I  J ) 1 where S  {h ( yi , y j )}nn , S  {h( yi , y j )}nn , I  {1{i  j}}nn , and J  { }nn . n Similarly, the centered genetic similarity can also be written in matrix form: K  ( I  J ) K ( I  J ) , where, K  { f ( g i , g j )}nn , and K  { f ( g i , g j )}nn . Section B5: In the actual computation, we will use a matrix eigen-decomposition to obtain the eigenvalue and finite-dimension approximation of the eigenfunction. For matrix eigen-decomposition, computer algorithm usually gives eigenvalues s with eigenvectors s that satisfy n  s2,i  1 instead of i 1 1 n 2  s ,i  1 . Thus, using the eigenvalues n i 1 {s } and {t } calculated from the matrix eigen-decomposition, the limiting distribution of GTSU is: n nU ~  t 1 t n s  n ( n s 1 2 st  1) . Section B6: The purpose is to obtain a covariate-centered similarity, so that the centered similarity is perpendicular to the space spanned by the covariates. Suppose we have P covariates that need to be adjusted. We first consider the trait similarity and assume the 83      covariate to be a P+1 dimensional function of traits: xi  x( yi ) . We define a new kernel by: u ( y1 , y2 )  x ( y1 )  E ( x()T x())  x( y2 )T . 1 From this definition, we have: 1 E (u ( y1 , Y2 ) x(Y2 ))  x( y1 ) E  E ( x(Y2 )T x(Y2 ))  x(Y2 )T x(Y2 )   x( y1 ) .   The covariate-centered kernel is then defined as:  h ( y1 , y2 )  h( y1 , y2 )  E (h( y1 , Y3 )u (Y3 , y2 ))  E (u ( y1 , Y3 )h(Y3 , y2 ))  E (u ( y1 , Y3 )h(Y3 , Y4 )u (Y4 , y2 )) So we have: 84    .    E (h ( y1 , Y2 ) x(Y2 ))   h( y1 , y2 ) x( y2 )dF ( y2 )    h( y1 , y3 )u ( y3 , y2 ) x( y2 )dF ( y3 )dF ( y2 )    u ( y1 , y3 )h( y3 , y2 )x( y2 )dF ( y3 )dF ( y2 )    u ( y1 , y3 )h( y3 , y4 )u ( y4 , y2 ) x( y2 )dF ( y4 )dF ( y3 )dF ( y2 )   h( y1 , y2 ) x( y2 )dF ( y2 )    h( y1 , y3 )  E (u ( y3 , Y2 ) x(Y2 ))  dF ( y3 )    u ( y1 , y3 )h( y3 , y2 )x( y2 )dF ( y3 )dF ( y2 )    u ( y1 , y3 )h( y3 , y4 )  E (u ( y4 , Y2 ) x(Y2 ))  dF ( y4 )dF ( y3 )   h( y1 , y2 ) x( y2 )dF ( y2 )    h( y1 , y3 )x( y3 )dF ( y3 )    u ( y1 , y3 )h( y3 , y2 )x( y2 )dF ( y3 )dF ( y2 )    u ( y1 , y3 )h( y3 , y4 )x( y4 )dF ( y4 )dF ( y3 ) 0  Consider the above expectation as an inner product  h (, y2 ) x( y2 )  defined by:    h (, y2 ) x( y2 )   h (, y2 ) x( y2 )dF ( y2 ) ,  In this sense, we have h  x for our covariate-centered trait similarity. In fact, this has  forced {1, x1 ,..., xP } to be a subset of the eigenfunctions of h (,) with eigenvalues  {s  0}1 s  P 1 . By the definition of the covariate-centered trait similarity, we can estimate it by replacing the expectation with a sample average: 85      n n  h ( yi , y j )  h( yi , y j )   xi ( xlT xl ) 1 xkT h( yk , y j ) k 1 l 1 n n k 1 l 1   h( yi , yk ) xk ( xlT xl ) 1 xTj n n . n n l 1 l 1   xi ( xlT xl ) 1 xmT h( ym , yk ) xk ( xlT xl ) 1 xTj m 1 k 1 As in Appendix B4, we can write it in matrix form:  S  ( I  X ( X T X ) 1 X T ) S ( I  X ( X T X ) 1 X T ) Similarly, we can define the covariate-centered genetic similarity by:  f ( g1 , g 2 )  f ( g1 , g 2 )  E ( f ( g1 , G3 )v(G3 , g 2 ))  E (v( g1 , G3 ) f (G3 , g 2 ))  E (v( g1 , G3 ) f (G3 , G4 )v(G4 , g 2 )) , where, v(,) is constructed in the same way as u (,) , except for letting xi  x( gi ) . The  covariate-centered genetic similarity has the perpendicular property f  x . In the study sample, we can write the covariate-centered genetic similarity in the following matrix form:  K  ( I  X ( X T X )1 X T ) K ( I  X ( X T X )1 X T ) . To derive the asymptotic distribution of a covariate-adjusted GTSU, we can decompose  the h (,) by:   h ( y1 , y2 )   ss ( y1 )s ( y2 ) , s 1  and decompose the f ( g1 , g 2 ) by:   f ( g1 , g 2 )  tt ( g1 )t ( g 2 ) . t 1 86      Then using the same technique in Appendix B2, we can show that: U 1 n2 1  2 n    f (G , G )h (Y , Y ) i j i j i, j  n     t  s   t (Gi )s (Yi )  t P2 sP2  i 1    2 and its limiting distribution is also an infinite weighted sum of the iid chi-square variables.   Then, by using a matrix eigen-decomposition (to get eigenvalues {t } and {s } ), taking  T 1 T into account the projection operator ( H  I  X ( X X ) X ) in a finite sample, U can be approximated as follows: nU ~ 1  n  n  2 t  s  st  , n(n  P  1)  t 1 s 1 where, { st2 } are i.i.d chi-square random variables with a d.f. of 1. Section B7: We allowed the multiple traits to follow different distributions (BBG, BGG, or BGC). Two sets of Null models (   0 and  2  0 ) were simulated (details in Table B4): 1. We generated two covariates ( x1,i ~ Bernoulli (0.3) and x2,i ~ N (0,1) ) for all three traits, so that the three traits were influenced by the same set of covariates. 2. Repeatedly generated two covariates for each trait, so that the three traits were influenced by different sets of covariates. Two sets of disease models (   0 and  2  0 ) were also simulated, each related to one of the Null models. 87      We summarized the type I error of the 4 methods for models with the same covariates, and models with different covariates in Table 3. For all of the simulation scenarios, GTSU could control the type I error correctly around 0.05. The other three parametric methods either had an inflated type I error or a conservative type I error, depending on the sample size and the trait distribution. AdjSKAT and SKATO were able to correctly control the type I error around 0.05 for BBG and BGG traits, but suffered an inflated type I error for BGC traits. SKAT also suffered an inflated type I error when one or more than one of the multiple traits are Cauchy-distributed traits. Moreover, SKAT suffered a conservative type I error for small sample sizes with BBG and BGG traits. An interesting observation is that the conservatism of the type I error for the SKAT-based methods are less severe than the scenario in which no covariates are involved. The reason might be that the SKAT-based methods use residuals to build the test statistics after adjusting for the covariates. The residuals have more levels than the original trait value (2 levels for the binary trait), and are closer to the normal distribution. We summarized the power comparison for the multiple-traits disease model with common covariates and distinct covariates, respectively, in Figure 6 and Figure 7. We observed that GTSU obtained a much higher power than SKAT, SKATO and AdjSKAT across all of the sample sizes, trait distributions and covariate effect configurations (a shared or distinct set of covariates). For example, when the sample size was 50 for BBG traits with common covariates, SKAT, SKATO and AdjSKAT were only 11% (power=0.026), 19% (power=0.043), and 25% (power=0.057) of the power of GTSU (power=0.222). 88      Table B1: simulation setting for simulation I Effect Sample size Pct* Setting for effect Binary Normal Cauchy       0 0 0 0 0 0 0 50 0 0.4 0 0.2 0 0.3 50 0.1 0.2 0.05 0.1 0.1 0.2 0 0.4 0 0.2 0 0.3 50 Null 100 200 500 50   0 100   0 200 500 50   0 100   0 200 500 5   0   0 500 10 30 50 *percentage of caudal SNVs 89      Table B2: simulation setting for simulation II Effect Sample size Pct* Setting for effect** BBB Null-same-traits GGG CCC       0 0 0 0 0 0 50 100 200 0 500 BBG Null-diff-traits BGG BGC       0 0 0 0 0 0 50 100 200 0 500 BBB same-traits diff-SNPs GGG CCC       0 0.2 0 0.1 0 0.2 50 100 200 30 500 BBG diff-traits diff-SNPs BGG BGC       0 0.15 0 0.15 0 0.15 50 100 200 30 500 *percentage of causal SNVs for each trait **“BBB”, “GGG”, and “CCC” denote three binary-distributed traits, three Gaussian-distributed traits, and three Cauchy-distributed traits, respectively. “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binary-distributed trait, one Gaussiandistributed trait and one Cauchy-distribution trait. 90      Table B3: computational time for R-based GTSU and SKAT, SKATO, AdjSKAT for analyzing 1000 simulated data replicates Traits* BBG BGG BGC sample size Computational time (seconds) SKAT SKATO AdjSKAT GTSU 50 95.606 2155.145 5286.115 9.925 100 184.013 3921.727 7613.602 34.822 200 438.496 5424.594 9181.626 102.419 500 4319.94 13686.37 20890.24 849.143 50 63.993 1479.988 1996.783 7.084 100 106.755 2588.816 2702.345 22.783 200 392.034 6966.902 5523.014 128.326 500 3665.337 18607.69 13910.52 1140.583 50 68.707 1814.798 2147.614 8.242 100 105.197 2553.035 2683.744 22.551 200 436.481 7337.539 6788.79 144.654 500 3852.689 19392.38 16327.11 1239.392 91      Table B4: simulation setting for multiple trait analysis with covariates Effect Sample size Pct* N of Cov Setting for effect** BBG Null-same-cov BGG BGC       0 0 0 0 0 0 50 100 200 0 2 500 BBG Null-diff-cov BGG BGC       0 0 0 0 0 0 50 100 200 0 6 500 BBG same-cov BGG BGC       0 0.15 0 0.15 0 0.15 50 100 200 30 2 500 BBG diff-cov BGC       0 0.15 0 0.15 0 0.15 50 100 200 30 6 500 92    BGG   Table B5: type I error for multiple trait analysis with covariates Sample size 50 100 200 500 Method Traits distribution* same covariates different covariates BBG BGG BGC BBG BGG BGC SKAT 0.019 0.021 0.114 0.031 0.03 0.096 SKATO 0.033 0.038 0.097 0.042 0.042 0.086 AdjSKAT 0.039 0.038 0.128 0.062 0.046 0.108 GTSU 0.053 0.062 0.062 0.061 0.048 0.05 SKAT 0.025 0.024 0.127 0.039 0.025 0.115 SKATO 0.042 0.043 0.123 0.055 0.031 0.114 AdjSKAT 0.057 0.04 0.137 0.068 0.042 0.123 GTSU 0.052 0.036 0.049 0.049 0.059 0.045 SKAT 0.034 0.03 0.135 0.034 0.034 0.122 SKATO 0.049 0.036 0.126 0.05 0.042 0.119 AdjSKAT 0.051 0.036 0.144 0.056 0.048 0.133 GTSU 0.042 0.056 0.045 0.048 0.048 0.056 SKAT 0.045 0.054 0.161 0.046 0.042 0.157 SKATO 0.056 0.055 0.154 0.055 0.061 0.137 AdjSKAT 0.048 0.059 0.155 0.056 0.045 0.148 GTSU 0.041 0.04 0.047 0.042 0.045 0.046 * “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binarydistributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait. 93      Figure B1: Power comparison for multiple trait analysis with the same set of covariates * “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binarydistributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait. 94      Figure B2: Power comparison for multiple trait analysis with different sets of covariates * “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binarydistributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait. 95      Appendix C Using a traditional linear mixed model, we can build the following model, Y    G    ,  ~ N (0, ),    2 I   K2 K , where Y is the vector for the phenotype, G is the matrix for the genotypes,  K2 is the variance component for the family structure, and  2 is the variance component for the noise. Let U represent the eigenvectors for the matrix K . We can decompose the covariance matrix as K  U U T , where  is the diagonal matrix with eigenvalues of K on the diagonal. We can get transformed data by letting Y   U T Y , thus Var (Y )  Var (U T Y )  U T U   2 I   K2  . Since  is the diagonal matrix, the transformed phenotype values are uncorrelated and the equivalent model is, Y   U T   U T G    ,   ~ N (0,  2 I   K2  ) . 96      Table C1: Simulation setting for family-based sequencing studies Effect Sample size Setting for effect Binary Poisson Gaussian Student’s t         0 0 0 0 0 0 0 0 0 1 0 0.5 0 0.5 0 1 80 Null 200 400   0   0 80 200 400 97      Figure C1: distribution of sizes of SNV sets 98      BIBLIOGRAPHY 99      BIBLIOGRAPHY 2010. A map of human genome variation from population-scale sequencing. Nature 467(7319):1061-1073. Abecasis G, Altshuler D, Auton A, Brooks L, Durbin R, Gibbs R, Hurles M, McVean G. 2010. A map of human genome variation from population-scale sequencing. Nature 467(7319):1061-1073. Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, Doelle H, Ersoy B, Kryukov G, Schmidt S and others. 2007. Medical sequencing at the extremes of human body mass. American Journal of Human Genetics 80(4):779791. Barnett IJ, Lee S, Lin XH. 2013. Detecting Rare Variant Effects Using Extreme Phenotype Sampling in Sequencing Association Studies. Genetic Epidemiology 37(2):142-151. Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR and others. 2008. Assessing the evolutionary impact of amino acid mutations in the human genome. Plos Genetics 4(5). Chen H, Meigs JB, Dupuis J. 2013. Sequence Kernel Association Test for Quantitative Traits in Family Samples. Genetic Epidemiology 37(2):196-204. Chen LS, Hsu L, Gamazon ER, Cox NJ, Nicolae DL. 2012. An Exponential Combination Procedure for Set-Based Association Tests in Sequencing Studies. American Journal of Human Genetics 91(6):977-986. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. 2004. Multiple rare Alleles contribute to low plasma levels of HDL cholesterol. Science 305(5685):869-872. Davies RB. 1980. Algorithm AS 155: The Distribution of a Linear Combination of x2 Random Variables. Journal of the Royal Statistical Society. Series C (Applied Statistics) 29(3):323-333. Easton DF, Deffenbaugh AM, Pruss D, Frye C, Wenstrup RJ, Allen-Brady K, Tavtigian SV, Monteiro ANA, Iversen ES, Couch FJ and others. 2007. A Systematic Genetic Assessment of 1,433 Sequence Variants of Unknown Clinical Significance in the BRCA1 and BRCA2 Breast Cancer-predisposition Genes. The American Journal of Human Genetics 81(5):873-883. 100      Fay JC, Wyckoff GJ, Wu CI. 2001. Positive and negative selection on the human genome. Genetics 158(3):1227-1234. Gregory GG. 1977a. Large Sample Theory for U-Statistics and Tests of Fit. Annals of Statistics 5(1):110-123. Gregory GG. 1977b. Large Sample Theory for U-Statistics and Tests of Fit. The Annals of Statistics 5(1):110-123. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences 106(23):9362-9367. Ji WZ, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW, Levy D, Lifton RP. 2008. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nature Genetics 40(5):592-599. Kelly BB, Fuster V. 2010. Promoting Cardiovascular Health in the Developing World:: A Critical Challenge to Achieve Global Health: National Academies Press. Kryukov GV, Pennacchio LA, Sunyaev SR. 2007. Most rare missense alleles are deleterious in humans: Implications for complex disease and association studies. American Journal of Human Genetics 80(4):727-739. Kuonen D. 1999. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86(4):929-935. Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. 2012. The Empirical Power of Rare Variant Association Methods: Results from Sanger Sequencing in 1,998 Individuals. Plos Genetics 8(2). Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, Lin XH. 2012a. Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control WholeExome Sequencing Studies. American Journal of Human Genetics 91(2):224-237. Lee S, Wu MC, Lin XH. 2012b. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13(4):762-775. Li HZ. 2012. U-statistics in genetic association studies. Human Genetics 131(9):13951401. Li M, Ye CY, Fu WJ, Elston RC, Lu Q. 2011. Detecting Genetic Interactions for Quantitative Traits With U-Statistics. Genetic Epidemiology 35(6):457-468. 101      Lin D-Y, Tang Z-Z. 2011a. A General Framework for Detecting Disease Associations with Rare Variants in Sequencing Studies. The American Journal of Human Genetics 89(3):354-367. Lin DY, Tang ZZ. 2011b. A General Framework for Detecting Disease Associations with Rare Variants in Sequencing Studies. American Journal of Human Genetics 89(3):354-367. Lindsay BG, Markatou M, Ray S, Yang K, Chen SC. 2008. Quadratic distances on probabilities: A unified foundation. Annals of Statistics 36(2):983-1006. Liu H, Tang YQ, Zhang HH. 2009a. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics & Data Analysis 53(4):853-856. Liu J, Pei Y, Papasian CJ, Deng H-W. 2009b. Bivariate association analyses for the mixture of continuous and binary traits with the use of extended generalized estimating equations. Genetic Epidemiology 33(3):217-227. Luo L, Zhu Y, Xiong M. 2012. Quantitative trait locus analysis for next-generation sequencing with the functional linear models. Journal of Medical Genetics 49(8):513-524. Lynch M, Ritland K. 1999. Estimation of Pairwise Relatedness With Molecular Markers. Genetics 152(4):1753-1766. Madsen BE, Browning SR. 2009. A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. Plos Genetics 5(2). Maity A, Sullivan PE, Tzeng JY. 2012. Multivariate Phenotype Association Analysis by Marker-Set Kernel Machine Regression. Genetic Epidemiology 36(7):686-695. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN. 2008. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics 9(5):356-369. Morgenthaler S, Thilly WG. 2007a. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 615(1-2):28-56. Morgenthaler S, Thilly WG. 2007b. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST). Mutation Research-Fundamental and Molecular Mechanisms of Mutagenesis 615(1-2):28-56. 102      Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. 2011. Testing for an Unusual Distribution of Rare Variants. Plos Genetics 7(3). O'Neil KA, Redner RA. 1993. Asymptotic Distributions of Weighted U-Statistics of Degree 2. The Annals of Probability 21(2):1159-1169. Pritchard JK. 2001. Are rare variants responsible for susceptibility to complex diseases? American Journal of Human Genetics 69(1):124-137. Raychaudhuri S. 2011. Mapping Rare and Common Causal Alleles for Complex Human Diseases. Cell 147(1):57-69. Romeo S, Yin W, Kozlitina J, Pennacchio LA, Boerwinkle E, Hobbs HH, Cohen JC. 2009. Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. Journal of Clinical Investigation 119(1):7079. Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. 2005. Nonparametric tests of association of multiple genes with human disease. American Journal of Human Genetics 76(5):780-793. Serfling R. 1981. Approximation Theorems of Mathematical Statistics (Wiley Series in Probability and Statistics): Wiley-Interscience. Shapiro CP, Hubert L. 1979. Asymptotic Normality of Permutation Statistics Derived from Weighted Sums of Bivariate Functions. The Annals of Statistics 7(4):788794. Shieh GS. 1997. Weighted degenerate U- and V-statistics with estimated parameters. Statistica Sinica 7(4):1021-1038. Shieh GS, Johnson RA, Frees EW. 1994. Testing Independence of Bivariate Circular Data and Weighted Degenerate U-Statistics. Statistica Sinica 4(2):729-747. Tzeng J-Y, Zhang D, Chang S-M, Thomas DC, Davidian M. 2009. Gene-Trait Similarity Regression for Multimarker-Based Association Analysis. Biometrics 65(3):822832. van der Vaart A, Wellner JA. 2000. weak convergence and empirical processes: Springer. Wei C, Anthony JC, Lu Q. 2012. Genome-Environmental Risk Assessment of Cocaine Dependence. Frontiers in Genetics 3. Wei C, Lu Q. 2011. Collapsing ROC approach for risk prediction research on both common and rare variants. BMC Proceedings 5(Suppl 9):S42. 103      Wei C, Schaid DJ, Lu Q. 2013a. Trees Assembling Mann-Whitney Approach for Detecting Genome-Wide Joint Association Among Low-Marginal-Effect Loci. Genetic Epidemiology 37(1):84-91. Wei CS, Schaid DJ, Lu Q. 2013b. Trees Assembling Mann-Whitney Approach for Detecting Genome-Wide Joint Association Among Low-Marginal-Effect Loci. Genetic Epidemiology 37(1):84-91. Wei Z, Li M, Rebbeck T, Li H. 2008. U-Statistics-based Tests for Multiple Genes in Genetic Association Studies. Annals of Human Genetics 72(6):821-833. Wet TD, Venter JH. 1973. Asymptotic Distributions for Quadratic Forms with Applications to Tests of Fit. Annals of Statistics 1(2):380-387. Wu MC, Lee S, Cai TX, Li Y, Boehnke M, Lin XH. 2011. Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. American Journal of Human Genetics 89(1):82-93. Zhang HP, Liu CT, Wang XQ. 2010. An Association Test for Multiple Traits Based on the Generalized Kendall's Tau. Journal of the American Statistical Association 105(490):473-481. Zhu X, Feng T, Li Y, Lu Q, Elston RC. 2010. Detecting rare variants for complex traits using family and unrelated data. Genetic Epidemiology 34(2):171-187. 104