STATISTICAL METHODS FOR HIGH-DIMENSIONAL SEQUENCING STUDIES
By
Changshuai Wei

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Epidemiology - Doctor of Philosophy
2014

 
 

 

ABSTRACT
STATISTICAL METHODS FOR HIGH-DIMENSIONAL SEQUENCING STUDIES
By
Changshuai Wei
Background: With the advance of next generation sequencing technology, a massive
amount of sequencing data are generated from sequencing studies, offering a great
opportunity to comprehensively investigate the role of rare variants in the genetic
etiology of complex diseases. The massive amount of data also poses a great challenge to
the statistical analysis. Association analyses based on a single-locus test endure
substantial power loss because of the low frequency of genetic variants and the extremely
high dimensionality of the data. A joint association test, on the other hand, has been
shown to be more suitable for sequencing studies, as it jointly tests multiple variants to
increase the power and reduces the dimensionality. However, current methods for joint
association tests have limitations, such as dependency on distribution assumptions, poor
performance for a small sample size, inability to adjust for covariates, inability to
consider a family structure, inability to handle multiple traits, and computational
inefficiency.
Method and Result: I have developed a series of statistical methods to detect joint
associations in sequencing studies based on weighted U statistics. 1) I developed a
weighted U statistic, referred to as WU-SEQ, for the high-dimensional association
analysis of sequencing data. Based on the non-parametric U statistic, WU-SEQ makes no
assumption of the underlying disease model and phenotype distribution, and can be
 
 

 

applied to various types of phenotypes. 2) The simultaneously analysis of multiple related
phenotypes has been great interest in recent human genetics research. In these studies,
multiple phenotype measurements have been collected to study pleiotropic effects, or
comorbidity effects. I have extended the WU-SEQ method and developed a Gene-Trait
Similarity U test (GTSU) to detect sequencing variants associated with multiple
phenotypes. 3) Both population-based and family-based designs have been commonly
used in genetic association studies. There lacks, however, new statistical methods for
family-based sequencing data analyses. Family-based sequencing studies have many
advantages, including the ability to aggregate more disease-susceptibility rare variants
within families. I have extended the GTSU to FGTSU for family-based sequencing
association analyses.

 
 

 

ACKNOWLEDGEMENTS

I want to first thank my PhD advisor, Dr. Qing Lu, for his support and mentoring. I also
want to thank my PhD committee members, Dr. James Anthony, Dr. Wenjiang Fu, and
Dr. Yuehua Cui, for their help with my dissertation and research project. My
collaborators, Dr. Nigel Paneth and Dr. Robert Elston, also helped me a great deal with
my research, and gave me very valuable advice. Finally, I want to thank my parents and
all my friends for their support in my five-year PhD studies.

iv 
 

 

TABLE OF CONTENTS

LIST OF TABLES ....................................................................................................vii
LIST OF FIGURES ...................................................................................................ix
INTRODUCTION .....................................................................................................1
CHAPTER 1 ..............................................................................................................9
A Weighted U for Sequencing Association Studies ..................................................9
1.1 A Weighted U Test ..................................................................................9
1.2 Asymptotic Distribution...........................................................................12
1.3 Covariates Adjustment .............................................................................14
1.4 Simulation ................................................................................................15
1.4.1 Simulation I.............................................................................16
1.4.2 Simulation II ...........................................................................19
1.4.3 Simulation III ..........................................................................21
1.5 Application to the DHS Sequencing Data ...............................................25
1.6 Discussion ................................................................................................28
CHAPTER 2 ..............................................................................................................33
Gene Trait Similarity U test .......................................................................................33
2.1 Weighted U statistic .................................................................................34
2.2 Similarity Measurement ...........................................................................36
2.3 Hypothesis Testing...................................................................................38
2.4 The Power and Sample Size Calculations ...............................................39
2.5 Computation and Implementations ..........................................................41
2.6 Covariates Adjustment .............................................................................42
2.7 Simulation Method...................................................................................43
2.7.1 Simulation I.............................................................................44
2.7.2 Simulation II ...........................................................................45
2.8 Simulation Results ...................................................................................45
2.8.1 Simulation I.............................................................................46
2.8.2 Simulation II ...........................................................................49
2.9 Computation and Software ......................................................................52
2.10 Application to Dallas Heart Study .........................................................53
CHAPTER 3 ..............................................................................................................55
Family Based Gene Trait Similarity U test ................................................................55
3.1 Family-based Gene Trait Similarity U test ..............................................55
3.2 Covariates adjusted Gene Trait Similarity U test ....................................57
3.3 Simulation ................................................................................................59
3.4 Whole Genome Sequencing Analysis of GAW18 ...................................63
v 
 

 

CHAPTER 4 ..............................................................................................................67
Conclusion and Discussion ........................................................................................67
APPENDICES ...........................................................................................................71
Appendix A ....................................................................................................72
Appendix B ....................................................................................................77
Appendix C ....................................................................................................96
BIBILOGRAPHY ......................................................................................................99

vi 
 

 

LIST OF TABLES

Table 1: Type I error and power comparisons of WU-SEQ and SKAT
under different direction and magnitude of effects ....................................................18
Table 2: Type I error and power comparisons of WU-SEQ and SKAT for
different sample sizes .................................................................................................20
Table 3: Type I error and power comparisons of WU-SEQ and SKAT when
the number of variants is much larger than the sample size ......................................21
Table 4: Type I error and power comparisons of WU-SEQ and SKAT for
covariate adjustment ..................................................................................................23
Table 5: Type I error and power comparisons of WU-SEQ by using
different c ...................................................................................................................24
Table 6: Type I error and power comparison of WU-SEQ by using normal
quantile or rank ..........................................................................................................24
Table 7: The association of 4 candidate genes with 3 continuous
phenotypes (i.e., BMI, Cholesterol, and VLDL) in the Dallas Heart Study ..............27
Table 8: Type I error for single trait analysis ............................................................47
Table 9: Type I error for multiple trait analysis .........................................................50
Table 10: Multiple trait analysis for DHS study ........................................................54
Table 11: Type 1 error and power comparison for discrete phenotype .....................61
Table 12: Type 1 error and power comparison for continuous phenotype ................62
Table 13: Top 10 SNV sets with smallest p-values for GAW18 ...............................66
Table A1: Simulation setting for models without confounding effect and
fixed sample size ........................................................................................................73
Table A2: Simulation setting for model without confounding effect and
varying sample size ....................................................................................................73
Table A3: Simulation setting for model with confounding effect and
varying sample size ....................................................................................................74

vii 
 

 

Table A4: The association of 4 candidate genes with 3 binary phenotypes
in Dallas Heart Study .................................................................................................74
Table B1: Simulation setting for simulation I............................................................89
Table B2: Simulation setting for simulation II ..........................................................90
Table B3: Computational time for R-based GTSU and SKAT, SKATO,
AdjSKAT for analyzing 1000 simulated data replicates ...........................................91
Table B4: Simulation setting for multiple trait analysis with covariates ...................92
Table B5: Type I error for multiple trait analysis with covariates .............................93
Table C1: Simulation setting for family based sequencing studies ...........................97

viii 
 

 

LIST OF FIGURES

Figure 1: Distribution of SNVs with MAF<0.03 for the 4 genes in the Dallas
Heart Study ..................................................................................................................26
Figure 2: Distributions of the 3 phenotypes in the Dallas Heart Study .............................26
Figure 3: Power comparison for single trait analysis when half of the
functional SNVs are protective and the other half of the functional SNVs
are deleterious ............................................................................................................48
Figure 4: Power comparison for single trait analysis when majority of the
functional SNVs are deleterious ................................................................................49
Figure 5: Power comparison for multiple trait analysis when the multiple
traits follow the same type of distribution .................................................................50
Figure 6: Power comparison for multiple trait analysis when the multiple
traits follow the different types of distribution ..........................................................52
Figure 7: QQ plot for the whole genome sequencing analysis ..................................65
Figure A1: Missing data distribution for Dallas Heart Study ....................................75
Figure A2: MAF for all SNPs ....................................................................................75
Figure A3: Distribution of test statistics ....................................................................76
Figure B1: Power comparison for multiple trait analysis with same set of
covariates ...................................................................................................................94
Figure B2: Power comparison for multiple trait analysis with different set
of covariates ...............................................................................................................95
Figure C1: distribution of sizes of SNV sets .............................................................98

ix 
 

 

INTRODUCTION

Genome-wide association studies (GWAS) have been used to uncover common genetic
variants predisposing to common complex diseases. During the past decade, thousands of
new genetic variants have been identified from GWAS; some with a compelling
biological plausibility for a role in disease etiology [Hindorff, et al. 2009]. Despite such
successes, the genetic variants identified to date for most complex diseases can only
explain a small fraction of the total heritability. Evolutionary theory suggests that rare
variants are likely to be recent mutations, and are also likely to be more deleterious
because they are under less negative selective pressure [Boyko, et al. 2008; Fay, et al.
2001; Kryukov, et al. 2007; Pritchard 2001; Raychaudhuri 2011]. Converging evidence
from genetic studies of complex diseases also suggests that complex diseases are highly
heterogeneous, and that familial subtypes of complex diseases are frequently seen to be
caused by rare variants with large effects [Ahituv, et al. 2007; Cohen, et al. 2004; Easton,
et al. 2007; Ji, et al. 2008; Romeo, et al. 2009]. Nevertheless, current findings of rare
variants predisposing to complex diseases are still limited. With advancements in highthroughput sequencing technology, researchers are now able to comprehensively study
the role of a massive amount of sequencing variations, including both common and rare
variants, in human diseases. While the ongoing exome and whole-genome sequencing
studies hold great promise for identifying new genetic variants predisposing to complex
disease, they also pose great challenges for the statistical analysis of massive amounts of
sequencing data.

1 
 

 

Although single-locus analysis has been widely used in GWAS, such analysis is
underpowered for detecting rare variants. Rare variants are anticipated to have larger
effects than common variants, but their low frequencies in the study population make
them hard to detect. Moreover, the number of single-locus tests required for sequencing
studies is significantly larger than those required for GWAS studies, which increases the
multiple-testing burden. Therefore, as an essential alternative, a joint-association analysis
of genetic variants has become popular in the association analysis of sequencing data
[Barnett, et al. 2013; Chen, et al. 2012; Ladouceur, et al. 2012; Lee, et al. 2012b; Lin and
Tang 2011b; Madsen and Browning 2009; McCarthy, et al. 2008; Morgenthaler and
Thilly 2007a; Neale, et al. 2011; Wei and Lu 2011; Wu, et al. 2011; Zhu, et al. 2010]. By
aggregating information from multiple variants, the association signal is enhanced and
becomes more detectable. Moreover, this greatly reduces the number of tests, which
alleviates the multiple-testing issue.
Let Yi be the phenotype value for subject i, and Gi  (Gi1 ,..., GiL ) as the genotype vector
for subject i. One of the traditional methods for joint association analyses is the
regression model,

Yi    Gi    i ,  i ~ N (0,  2 ),
We can use F test to test the null hypothesis whether  equals 0. However, in a
sequencing study, the dimension of the matrix Gi can be larger than the sample sizes.
Additionally, G could be a sparse matrix and GT G might not be invertible. Therefore,
the traditional methods are not suitable for high dimensional sequencing data. A common
strategy used in current methods is to first aggregate the effect of rare variants in a region
2 
 

 

(e.g., a gene) via collapsing or weighting, and then to assess the joint association of the
variants with the phenotype of interest. Most of the existing methods for sequencing
association analyses can be classified into two categories: burden test and non-burden test.
Burden tests assume the effects of the multiple variants are homogenous, and use the
following model to test the joint association,

Yi    t (Gi )   i ,  i ~ N (0,  2 ),
where t () is a pre-specified function to calculate a genetic score from the multiple
genetic variants and  is a scalar parameter to measure the joint effect. Non-burden tests
consider the heterogeneous effects of the multiple variants, and use the following model
to test the joint association,

Yi    r (Gi )   i ,  i ~ N (0,  2 ),
where r () is a general function measuring the effects of multiple variants. For nonburden tests, we test H 0 : r  0 v.s. H a : r  0 .
The commonly used methods for burden tests include the cohort allelic sum test (CAST),
the combined multivariate and collapsing (CMC) method [McCarthy, et al. 2008], and
the weighted sum test (WST) [Madsen and Browning 2009]. CAST collapses all variants
into one variant and t (Gi ) is defined as,
1, if there are rare variants,
t (Gi )  
0, otherwise.
Using the pseudo-variant, CAST then performs a simple two-sample t test between cases
and controls to evaluate the joint association. By collapsing all the variants into one
3 
 

 

variant, CAST greatly reduces the data dimensionality and computational burden. Yet,
CAST assumes that all the variants have the same effect sizes, which might be hold in
read data scenarios. CMC generalizes the CAST by collapsing the variants into several
pseudo-variants based on predefined criteria (e.g., MAF). Then, a Hotelling’s T test is
used to test the joint association,

T  ( tcases  tcontrol )T S 1 ( tcases  tcontrol ),
where t is the mean vector for the new pseudo-variants, and S is the corresponding
covariance matrix. By collapsing all the variants into multiple pseudo-variants, CMC has
the ability to consider different effect sizes of multiple genetic variants. Different from
CAST and CMC, WST considers different effect sizes by assigning different weights to
different variants. In WST, t (Gi ) is defined as a weighted sum of genetic scores,
L

t (Gi )   w j Gi , j ,
j 1

where w j is the weight for the j-th variant. A rank sum test is then used to compare the
genetic scores between the cases and controls to make inference on the joint association.
Note that the above three methods can only be applied to case-control data. In order to
consider both binary and continuous phenotypes, SCORE-seq [Lin and Tang 2011a] is
developed under the framework of generalized linear model,

g ( E (Yi ))  X i  t (Gi )  ,
where X i represents the matrix for confounding covariates, and  represents the effects
of the covariates. t (Gi ) is defined as a weighted sum of multiple variants,
4 
 

 
L

t (Gi )   w j Gi , j , and  is a scalar parameter measuring the joint association effect.
j 1

Score test can then be used to test the null hypothesis whether  equals zero. For
different choices of t (Gi ) , SCORE-seq defines a maximum statistic,
Tmax  max | Tk |,
1 k  K

where Tk is the score test statistic for the k-th choice of t (Gi ) . P-value can be calculated
by using,

P(Tmax  tmax )  1  P(| T1 | tmax ,...,| TK | tmax ).
Burden tests implicitly assume the effects of different rare variants are in the same
direction and/or same magnitude. If the assumption fails, it could lead to power loss. To
address this limitation, different burden tests are proposed. The C-alpha test [Neale, et al.
2011] can consider the direction and magnitude of effects, by comparing the expected
variance with observed variance in a case-control setting. The test statistic is equivalent
to a simple variance component test,

T  (Y  Y )T GGT (Y  Y ).
Besides the variance component test, we can also use functional approach by considering
the genotypes in a region as a function of genetic position[Luo, et al. 2012]. For the
quantitative trait, functional linear model (FLM) can be built,
Yi  X i   Gi (t )  (t ) dt   i ,  i ~ N (0,  2 ),

5 
 

 

where t is the genetic position, Gi (t ) is the function for genotypes, and  (t ) is the
function for the genetic effects. Using basis function to describe Gi (t ) and  (t ) , FLM
can be transformed into a usual linear regression setting. Another semi-parametric
method that is widely used in sequencing data analyses is the sequence kernel association
test (SKAT) [Wu, et al. 2011], which is also robust for the direction and magnitude of
genetic effects. Moreover, because it is built based on kernel machine regression, SKAT
can adjust for covariates, and is applicable to both binary and Gaussian phenotypes,

g ( E (Yi ))  X i  r (Gi ),
where g () is the link function and r (Gi ) is considered as a random effect. A variance
component score test can be used to test the null hypothesis whether the function r ()
equals 0,

T  (Y  Yˆ )T K (Y  Yˆ ),
where Yˆ is obtained from the null model (i.e., g ( E (Yi ))  X i ) and K is a kernel matrix
calculated from multiple genetic variants.
Most existing methods for sequencing data analyses are parametric-based or semiparametric based, which rely on certain assumptions. However, in practice, these
assumptions may not be satisfied (e.g., the phenotype may not follow a normal
distribution in a sequencing study based on an extreme phenotype design). When the
assumptions are violated, the existing methods will likely have either decreased power or
an inflated type I error. Non-parametric methods, such as U-statistic-based methods, have
shown their robustness against underlying phenotypic distributions and/or underlying
6 
 

 

modes of inheritance [Li 2012; Li, et al. 2011; Schaid, et al. 2005; Wei, et al. 2012; Wei,
et al. 2013a; Wei, et al. 2008]. Various U-statistic-based methods have been proposed for
identifying the common variants associated with binary or quantitative phenotypes. Most
of these methods use U statistics to obtain multiple group-wise scores, and then form test
statistics to compare scores among different groups. For case-control data analyses, U
statistics are used to summarize the genetic information, and then to compare scores
between cases and controls [Schaid, et al. 2005]. For quantitative data analyses, U
statistics first summarize the phenotypic information, and then compare scores among
different genotype groups or multi-locus genotype groups [Wei, et al. 2008]. It is
challenging to provide a unified method under the traditional U-statistic framework
because there are different ways of constructing U-statistics for binary and quantitative
phenotypes [Li 2012]. The weighted U statistic is a more general form of the U statistic.
It was developed in the 1980s-90s [Gregory 1977b; Serfling 1981; Shieh 1997], and has
rarely been used in genetic data analyses. By using a weighted U statistic, genetic
information and phenotypic information can be summarized separately into the weight
function and the U kernel, thereby avoiding the issue of group score comparison. In the
following chapters, we use the weighted U statistic to develop new statistical methods for
the analyses of multiple sequencing variants associated with a single trait, multiple traits
and correlated traits (i.e., traits from family studies).
We use the new methods to investigate the genetic risk factors for cardiovascular disease
and hypertension (HT) by using targeted sequencing data from Dallas Heart Study and
whole genome sequencing study from GAW18. Cardiovascular disease (CVD) refers to
any disease that affects the cardiovascular system (e.g. coronary heart disease and stroke).
7 
 

 

It is the leading cause of death worldwide. By 2005, the total number of CVD death had
reached 17.5 million. The projection of total CVD death in 2015 is about 20 millions,
accounting for 30 percent of all dealths worldwide [Kelly and Fuster 2010]. Risk factors
for CVD include family history, age, smoking, hypertension, high cholesterol, obesity
and so on. Hypertension is the single most important risk factors for stroke and also plays
a significant role in heart attack.

8 
 

 

CHAPTER 1: A Weighted U for Sequencing Association Studies

Based on the weighted U statistic, we have developed a unified method, referred to as
WU-SEQ, for sequencing data analyses of various types of phenotypes (e.g., binary,
ordinal, and continuous). Moreover, we used a projection method in WU-SEQ for
covariate adjustments, and derived the asymptotic distribution of the test statistic for an
efficient assessment of the significance of the association. The performance of WU-SEQ
was evaluated though extensive simulation studies, and compared with a commonly used
method, SKAT [Wu, et al. 2011]. Finally, we illustrated the proposed method by
applying WU-SEQ to sequencing data from the Dallas Heart Study (DHS).
1.1 A Weighted U Test
Assume a population-based sequencing study with N unrelated subjects and P single
nucleotide variants (SNV) located in a gene or a genetic region. Let yi and

Gi  ( gi1, gi 2 ,, giP ) denote, respectively, the phenotype and the genotypes of P variants
of an individual i ( 1  i  n ), where gip ( 1  p  P ) is coded as 0, 1, or 2. We use si,i ' and

wi ,i ' to denote the phenotypic similarity and genetic similarity between individuals i and i’,
respectively. The phenotypic similarity can be measured by any 2-degree kernel function,

si ,i '  h( yi , yi ' ) , which satisfies the finite second moment condition ( EF (h2 (Y1 , Y2 ))   ).
While various kernel functions can be used to measure the phenotypic similarities, we
used the cross product kernel, si ,i '  qi qi ' , where qi is the normal quantile of the rank of

9 
 

 

yi , defined as qi  1 ((rank ( yi )  0.5) / n) , and  1 () is the inverse cumulative
distribution function for standard normal distribution. An averaged rank is assigned in the
presence of ties. For example, if there are n0 numbers of controls (i.e., yi  0 ) and n1
numbers of cases (i.e., yi  1 ), an average rank is assigned to the group with the same
phenotype value (e.g., rank ( yi { yi , yi  0})  (n0  1) / 2 is assigned to the control group).
Other than the quantile-transformed cross product kernel, distance-transformed
phenotypic similarities such as si ,i '  exp( | yi  yi ' |) or si ,i '  exp(( yi  yi ' )2 ) can also
be used. Nevertheless, as demonstrated below, the use of the quantile-transformed cross
product kernel leads to nice asymptotic properties. Genetic similarity, wi ,i ' , can be
calculated using a variety of similarity functions. One commonly used similarity function
for sequencing data is the weighted IBS, which gives more weight to rare variants,
P

2 | Gi , p  Gi ', p |

p 1

  p (1   p )

wi ,i '  

where  p

is

the

minor

allele

frequency

for

,

the

p-th

rare

variant,

and

P

   2 /  p (1   p ) is used to standardize the weight function so that wi ,i ' [0,1] . In
p 1

addition to the weighted IBS, distance-transformed similarity functions can also be used.
For example, we could use wi ,i '  exp( Di ,i ' ) , where Di ,i ' is the distance function (e.g., a
Euclidian distance).
Given si,i ' and wi ,i ' , the weighted U is formed to measure the association of P genetic
variants with the disease phenotype,
10 
 

 

Uw 

1
 wi,i ' si,i ' ,
n(n  1) i i '

(1)

where si,i ' is the 2 degree U kernel and wi ,i ' is the weight function for the weighted U.
When wi ,i '  1 , we can construct an un-weighted U by using only the phenotype similarity,

U uw 

1
 si,i ' .
n(n  1) i i '

(2)

In the weighted U, the summation is over all phenotypic similarities weighted by the
genotypic similarity, whereas only the phenotypic similarity is used for the un-weighted
U. When the genetic variants are associated with the phenotype, we would expect genetic
similarities to be concordant with phenotypic similarities, where larger weights are given
to larger values of phenotypic similarity. Therefore, the weighted U has a larger value
than the un-weighted U. Based on this concept, we could build an association test by
comparing the weighted U with the un-weighted U. The two U statistics, however, are
based on weights of different scales (i.e., wi ,i ' vs. constant 1), therefore, a constant c is
introduced to balance the two weight functions. The test statistic is then defined as,

WU seq  U w  cU uw , (3)
where the scaling constant c can be obtained by minimizing the L2 norm distance
between the two weight metrics, i.e., c  arg min{ ( wi ,i '  c) 2 } . Alternatively, we could
c 0

i i '

choose other types of c. For instance, we could obtain c by minimizing the L1 norm
distance between the two weight metrics, i.e., c  arg min{ | wi ,i '  c |} .
c0

11 
 

i i '

 

1.2 Asymptotic Distribution
When the weighted U is significantly larger than the rescaled un-weighted U, we reject
the null hypothesis and conclude there is an association between P genetic variants and
the phenotype. The p-value can be obtained by comparing the observed test statistic,
obs
obs
WU seq
) . For a small sample size, a
, with the null distribution, i.e., Pr(WU seq  WU seq

permutation test can be used to calculate the p-value. However, for a large sample size
and high-dimensional data, a permutation test could be computationally intense.
Therefore, we derive the asymptotic distribution of WU seq to efficiently assess the
significance level of the association.
We first rewrite the test statistic, WU seq , of equation (3) as a weighted summation of si,i ' ,

WU seq 

1
 ki,i ' si,i ' ,
n(n  1) i i '

(4)

where we define a new weight ki ,i '  wi ,i '  c . If the phenotypic similarity is measured by
the cross product kernel, si ,i '  qi qi ' , WU seq is simplified to a quadratic form, with all the
diagonal elements equal to 0 ( ki ,i  0 ). In such a case, it has a close connection with the
variance component score test in the linear mixed model, except that WU seq does not use
information from the diagonal terms ( ki ,i  0 ), and does not assume a Gaussian
distribution of the phenotype.
In U statistics theory, the asymptotic behavior of U depends on  1  Var ( E (h(Y1 , Y2 ) | Y2 ))
[Serfling 1981]. If  1  0 , the U statistic is a non-degenerated U and can be approximated
12 
 

 

by a normal distribution. If  1  0 , the U statistic is a degenerated U and can be
approximated by a mixture of chi-squared distribution. Since E(q1q2 | q2 )  q2 E(q1 )  0 ,
we have  1  0 . Thus the statistic WU seq is, by definition, a degenerated weighted U
statistic, and its limiting distribution can be obtained by taking the decomposition of the
weight function ki,i ' and the kernel function si,i ' [Serfling 1981; Shieh, et al. 1994; Wet
and Venter 1973]. WU seq can then be approximated by a linear combination of chisquared random variables,


n

m 1

l 1

nWU seq    m  l (1,2ml  1) , (5)
2
are iid chi-squared random variables with 1 degree of freedom and l
where 1,ml

(l=1,…,n) are obtained from the eigen-values l of matrix K  {ki ,i ' }nn , with

l  l / (n 1) . {m} are the eigenvalues of a general kernel function h (,) , obtained
from the following decomposition,


h(q1 , q2 )    mm (q1 )m (q2 ) ,
m 1

where {m ()} are the ortho-normal eigenfunctions corresponding to m . For the cross
product kernel, we can show that h(q1 , q2 )  q1q2  1 (q1 )1 (q2 ) , where 1 (q)  q . Thus,

 m  1{m1}

,



n

n

m 1

l 1

l 1

where

1

is

an

indicator

function,

and

 m  l   l  trace( K ) / (n  1)  0 . The limiting distribution of WUseq can be

13 
 

 


D

simplified to nWU seq   2, , where  2,   l 1,2l is mixture of chi-squared distributed
l 1

with a 0 mean and a finite variance (Appendix A). Given the asymptotical distribution of

WU seq , the p-value can then be calculated using the Davies method [Davies 1980]. In
Appendices (Figure A3), we graphically showed the shape of the distribution of the
mixture chi-squares and showed that it is very close to the permutation results when the
sample size is large.
1.3 Covariates Adjustment
Assume we have J covariates, X i  (1, xi1 ,, xiJ ) , i  1, 2,..., N . To adjust for the
potential confounding effects, we will first fit the transformed value of the phenotype,

Q  (q1 , q2 ,...qN )T , with covariates X  ( X1 , X 2 ,..., X N )T , and then use the residuals for
the association test. Based on this idea, we project Q onto the space spanned by X , and
obtain

the

prediction

Qˆ

,

where

Qˆ  X ( X T X ) 1 X T Q

.

By

denoting

H  I  X ( X T X )1 X T and ˆ 2  (Q  Qˆ )T (Q  Qˆ ) / ( N  J  1) , we can obtain the
residuals, Qe  HQ / ˆ . It is easy to show that Qe  X , so that a new response vector is
attained, which is perpendicular to the space spanned by the covariates. The test statistic

WU seq can be reconstructed as
WU seq   ki ,i ' qie qie' .
i i '

Using the same argument as above, WU seq with covariate adjustments can also be
approximated by a linear combination of chi-squared random variables,
14 
 

 
n

nWU seq ~  le 1,2l .
l 1

le  le / (n  1) and {le } are the eigen-values of matrix K e , where K e  HKH . Similar
to other methods, we should consider potential issues, such as nonlinearity of covariates
and high correlations between genetic variables and covariates, before implementing the
covariate-adjusting approach. The residual confounding is negligible when the model is
correctly specified. Directly using the approach without considering these issues may
lead to power loss.
1.4 Simulation
We conducted simulation studies to evaluate the performance of WU-SEQ, comparing it
with one of the most commonly used methods, SKAT [Wu, et al. 2011]. For all of the
simulations, we used genetic data from the 1000 Genome Project [Abecasis, et al. 2010],
to mimic a real sequencing data structure (e.g., LD pattern and allele frequency).
Specifically, we used a 1Mb region from the genome (Chromosome 17: 73443288344327), and randomly chose a 30kb segment from that 1Mb region for each simulation
replicate (if not specified otherwise). There are an average of 194 SNVs in the sampled
30kb segments. The minor allele frequency (MAF) of the SNV in the genome region
ranged from 4.50×10-4 to 4.99×10-1, with a distribution highly skewed to rare variants
(34.8% of the variants with an MAF<0.001, 69.1% of the variants with an MAF<0.01
and 80% of the variants with an MAF<0.03). Similar to the SKAT simulation studies
[Wu, et al. 2011], we selected a portion of the genetic variants with an MAF<0.03 as
functional SNVs. A number of individuals, ranging from 50 to 500, were randomly
15 
 

 

chosen as the study samples from all of the available individuals in the 1000 Genome
Project.
For each disease model, we simulated 1000 data replicates and obtained the type 1 error
and power for both WU-SEQ and SKAT. For SKAT, we used the link function according
to the phenotype distributions (i.e., the logit link for a binary phenotype and the identity
link for a continuous phenotype) [Wu, et al. 2011]; for WU-SEQ, we used the L2 norm to
choose the constant c, and did not specify an assumption on the phenotype distribution.
To be consistent, both methods used a weighted IBS to summarize genetic information.
The type 1 error and power were obtained by calculating the percentage of p-values
smaller than 0.05 from the 1000 data replicates.
1.4.1 Simulation I:
We first simulated a series of disease models, without considering covariates, and
investigated the influence of the direction and magnitude of effects on both methods.
Each data replicate was comprised of 500 subjects. We considered 4 types of
distributions for the phenotypes: binary, Gaussian, Student’s t with 2 degrees of freedom,
and Cauchy. Binary and Gaussian phenotypes are typically observed in population-based
studies. The Cauchy-distributed and t-distributed phenotypes represent continuous
phenotypes with more extreme values (i.e., heavy-tailed). We used the logistic model to
simulate the binary phenotype,
logit( P ( yi  1))    Gi  ,

where Gi and yi were the genotype and phenotype of the i-th individual, respectively. 
was a vector of regression parameters, measuring the effects of the genetic variants. For
16 
 

 

each simulation replicate, we sampled an effect vector from a multivariate normal


distribution, MVN ( 1, 2 I ) , where 1 was the vector of 1 and I was the identity matrix.

For Gaussian phenotypes, we simulated the model as,
yi    Gi    i ,

where  i ~ N (0,  2 ) . For the t-distributed phenotype, we simulated the model as:
yi    Gi    i ,  i ~ tdf 2 .

For the Cauchy type of phenotype, we used
yi ~ cauchy ( ai , b ) .

ai and b were the location parameter and the scale parameter of the Cauchy distribution,
respectively, where ai    Gi  and b was a fixed value. For all four types of phenotypes,
we considered different directions of genetic effects. For the first scenario, we assumed

  0 , whereby half of the functional SNVs were deleterious and half of the functional
SNVs were protective. For the second scenario, we assumed   0 , whereby the
majority of the functional SNVs were deleterious. For each scenario, we varied the
percentage of functional SNVs from 5% to 50%. The details of the simulation setting are
provided in Table A1.

17 
 

 

Table 1: Type I error and power comparisons of WU-SEQ and SKAT under a different
direction and magnitude of effects
Effect*

Pct**

Type I Error/Power
Binary

Null

A1

A2

Normal

Student’s t

Cauchy

SKAT

WU-seq

SKAT

WU-seq

SKAT

WU-seq

SKAT

WU-seq

0

0.037

0.037

0.037

0.043

0.110

0.038

0.194

0.041

5

0.281

0.284

0.174

0.185

0.164

0.193

0.167

0.116

10

0.417

0.423

0.315

0.329

0.212

0.331

0.160

0.201

30

0.828

0.841

0.675

0.688

0.369

0.695

0.192

0.451

50

0.935

0.944

0.856

0.869

0.525

0.880

0.177

0.635

5

0.065

0.073

0.085

0.093

0.140

0.134

0.190

0.095

10

0.166

0.173

0.155

0.163

0.192

0.288

0.175

0.179

30

0.539

0.547

0.524

0.527

0.471

0.784

0.199

0.578

50

0.773

0.780

0.790

0.798

0.733

0.960

0.187

0.818

* “Null” corresponds to the null model with no functional variant; “A1” corresponds to the setting where
half of the functional rare variants have a deleterious effect and the other half of functional rare variants
have a protective effect; “A2” corresponds to the setting where all of the functional rare variants have
deleterious effects.
** Percentage of functional rare variants.

We have summarized the results in Table 1. From Table 1, we can see that WU-SEQ had
a well-controlled type 1 error rate under various phenotype distributions. In contrast,
SKAT had an inflated type 1 error rate when the underlying distribution was the Cauchy
distribution (0.194) or t distribution (0.110). Similar to SKAT, WU-SEQ allowed for
different directions of genetic effects (i.e., both deleterious and protective effects). For
both scenarios (i.e.,   0 and   0 ), WU-SEQ obtained a power that was comparative
to or at a slightly higher power than SKAT. As the percentage of the functional SNVs
increased, both WU-SEQ and SKAT gained improved power for binary and Gaussian
18 
 

 

phenotypes. For the Cauchy phenotype and t-distributed phenotype, however, the power
of WU-SEQ was significantly higher than that of SKAT. With the Cauchy phenotype, for
example, SKAT’s power ranged from 0.167 to 0.192 for   0 and ranged from 0.175 to
0.199 for   0 , while the power of WU-SEQ increased from 0.116 to 0.635 for   0
and increased from 0.095 to 0.818 for   0 .
1.4.2 Simulation II:
In simulation II, we investigated the influence of sample size on the performance of WUSEQ and SKAT. We used the same models as in simulation I to simulate binary,
Gaussian, Student’s t and Cauchy phenotypes. For simplicity, we assumed   0 and a
fixed  2 , and varied the sample size from 50 to 500 (Table A2). We assumed 0% of the
genetic variants were functional for the null model, and assumed 50% of the genetic
variants were functional for the disease models.
The results are summarized in Table 2. Type 1 error rates of WU-SEQ were well
controlled at the 0.05 level for different sample sizes (i.e., 50, 100, 200 and 500) and
various phenotype distributions (i.e., binary, Gaussian, Student’s t and Cauchy), while the
type 1 error of SKAT was inflated for the Cauchy phenotype (0.118~0.194) and the tdistributed phenotype (0.110~0.131). We also observed that both WU-SEQ and SKAT
had conservative type I error rates when the study sample sizes were small. For instance,
the type I error rates of SKAT and WU-SEQ were 0.001 and 0.005, respectively, when
the sample size was 50 and the phenotype was binary.

19 
 

 

Table 2: Type I error and power comparisons of WU-SEQ and SKAT for different sample sizes
Effect*

Sample
size

Type I Error/Power
Binary

Normal

Student’s t

Cauchy

SKAT

WU-seq

SKAT

WU-seq

SKAT

WU-seq

SKAT

WU-seq

50

0.001

0.005

0.027

0.047

0.117

0.044

0.118

0.041

100

0.007

0.013

0.035

0.050

0.131

0.051

0.134

0.052

200

0.022

0.029

0.029

0.039

0.121

0.043

0.165

0.056

500

0.037

0.037

0.037

0.043

0.110

0.038

0.194

0.041

50

0.016

0.035

0.145

0.177

0.160

0.164

0.126

0.086

100

0.119

0.181

0.279

0.318

0.220

0.296

0.160

0.164

200

0.494

0.536

0.521

0.545

0.317

0.557

0.178

0.291

500

0.935

0.944

0.856

0.869

0.525

0.880

0.177

0.635

Null

A1

* “Null” corresponds to the null model with no functional variant, “A1” corresponds to the setting where
half of the functional rare variants have a deleterious effect and the other half of the functional rare variants
have a protective effect.

The power of WU-SEQ increased as the sample size increased for all four types of
phenotypes. The power of SKAT remained almost the same for the Cauchy phenotype
(0.126~0.177) when the sample size increased. For both the binary and Gaussian
phenotypes, the power of WU-SEQ was comparable or slightly higher than that of SKAT,
while the power of WU-SEQ was significantly higher than that of SKAT for the t- and
Cauchy-distributed phenotypes.

20 
 

 

We performed additional simulations to evaluate the performance of WU-SEQ for highdimensional settings. In particular, we randomly chose a 500kb segment for each
simulation replicate. The sample sizes were fixed at 100 and we simulated the phenotype
using the settings in Table A2. There was an average of 1873 SNVs for each subject. The
results show that WU-SEQ can control the type 1 error well, and maintain a good power
performance when the number of SNVs is much larger than the sample size (Table 3).

Table 3: Type I error and power comparisons of WU-SEQ and SKAT when the number
of variants is much larger than the sample size*
Effect

Null

A1

Method

Distribution
Binary

Normal

Student’s t

Cauchy

SKAT

0.007

0.021

0.130

0.196

WU-seq

0.012

0.059

0.042

0.059

SKAT

0.153

0.896

0.708

0.264

WU-seq

0.304

0.933

0.864

0.558

*The average number of SNVs is 1873 in these settings, while the sample size is set at 100. The effect size
is set as the same as in Table A2

1.4.3 Simulation III:
In genetic association analyses, we often need to adjust important covariates (e.g., gender)
for potential confounding effects. Therefore, we conducted another set of simulations to
investigate the performance of WU-SEQ considering covariate adjustments. In this
simulation, we simulated two covariates, x1,i ~ Bernoulli(0.3) and x2,i ~ N (0,1) , for each
subject. The binary phenotype was simulated using
21 
 

 
logit( P ( yi  1))    X i  Gi  ,

where X i  ( x1,i , x2,i ) and   ( 1 ,  2 )T were the covariates and the effects of the covariates,
respectively. Similarly, we used a linear model for the Gaussian phenotype,
yi    X i  Gi    i ,  i ~ N (0,  2 ) ,

and the following model for the t-distributed phenotype,
yi    X i  Gi    i ,  i ~ tdf 2 .

The Cauchy phenotype was simulated using,
yi ~ cauchy ( ai , b ), where ai    X i  Gi  .

We varied the sample size from 50 to 500, and fixed the percentage of functional rare
variants at 50% for the disease models (Table A3).
The results are summarized in Table 4. We found that the projection method worked very
well for WU-SEQ with regards to covariate adjustment, and the type I error was wellcontrolled under different phenotype distributions. SKAT, however, only had a wellcontrolled type 1 error when the underlying assumptions were satisfied and the link
function was correctly specified. Similar to simulations I and II, we observed comparable
power between WU-SEQ and SKAT for both the binary and Gaussian phenotypes, and
WU-SEQ had significantly higher power than SKAT for the Cauchy and t-distributed
phenotypes. We also found the type 1 error rates were less conservative for studies with a
binary phenotype with covariate adjustment. This could be due to the fact that the
increased levels of residuals after covariate adjustment lead to a better approximation of
normal distribution.
22 
 

 

Table 4: Type I error and power comparisons of WU-SEQ and SKAT for covariate adjustment
Effect

Sample
size

Type I Error/Power
Binary

Normal

Student’s t

Cauchy

SKAT

WU-seq

SKAT

WU-seq

SKAT

WU-seq

SKAT

WU-seq

50

0.027

0.027

0.040

0.059

0.084

0.069

0.128

0.063

100

0.030

0.034

0.042

0.054

0.106

0.064

0.131

0.065

200

0.028

0.028

0.036

0.050

0.110

0.060

0.173

0.061

500

0.036

0.036

0.043

0.048

0.115

0.065

0.194

0.047

50

0.060

0.071

0.128

0.165

0.297

0.400

0.140

0.112

100

0.156

0.210

0.306

0.342

0.483

0.664

0.164

0.159

200

0.522

0.545

0.500

0.527

0.682

0.882

0.163

0.262

500

0.929

0.934

0.850

0.855

0.884

0.997

0.180

0.554

Null

A1

* “Null” corresponds to the null model with no functional variant; “A1” corresponds to the setting in which
half of the functional rare variants have a deleterious effect, and the other half of the functional rare
variants have a protective effect.

The above three sets of simulations show the performance of WU-SEQ compared with
SKAT under different scenarios. We also conducted additional simulations to investigate
the performance of WU-SEQ for different choices of c (Table 5), and different choices of
U kernel (Table 6). The tests in which c was chosen based on L1 norm and L2 norm
showed that both can protect the type I error, but the L2 norm shows slightly higher

23 
 

 

power (Table 5). We also compare the U kernel with a quantile transformation and
without a quantile transformation (using the rank of the phenotype value). Both can
control the type I error, but using a quantile transformation leads to higher power for
normal and t-distributed phenotypes (Table 6).
Table 5: Type I error and power comparisons of WU-SEQ by using different c*
Effect

Null

A1

Method

Distribution
Binary

Normal

Student’s t

Cauchy

WU-SEQL1**

0.013

0.048

0.056

0.041

WU-SEQL2***

0.013

0.049

0.058

0.043

WU-SEQL1

0.164

0.313

0.300

0.150

WU-SEQL2

0.17

0.322

0.306

0.159

*The sample size for this simulation is 100 and the effect size is set as the same as in Table A2.
**In WU-SEQL1, we choose c using the L1 norm.
***In WU-SEQL2, we choose c using the L2 norm.

Table 6: Type I error and power comparison of WU-SEQ by using normal quantile or rank*
Effect

Null

A1

Method

Distribution
Binary

Normal

Student’s t

Cauchy

WU-SEQRK**

0.017

0.02

0.031

0.032

WU-SEQQT***

0.017

0.047

0.057

0.050

WU-SEQRK

0.161

0.243

0.304

0.181

WU-SEQQT

0.161

0.314

0.322

0.169

*The sample size for this simulation is 100 and the effect size is set as the same as in Table A2.
**In WU-SEQRK, we use a cross-product kernel based on the rank of the phenotype value without a
quantile transformation.
***In WU-SEQQT, we use a cross-product kernel with a quantile transformation.

24 
 

 

1.5 Application to the DHS Sequencing Data
To further evaluate the performance of WU-SEQ and SKAT, we applied both methods to
the sequencing data from the Dallas Heart Study (DHS) [Romeo, et al. 2009]. The DHS
sequencing data is comprised of 4 genes, ANGPTL3, ANGPTL4, ANGPTL5 and
ANGPTL6. We were interested in evaluating the association of these genes with body
mass index (BMI), cholesterol, and very low density lipoprotein cholesterol (VLDL).
Following a previous study [Wu, et al. 2011], we considered age, gender, and race as
covariates in the model. Prior to the association analysis, we re-assessed the quality of the
genotype data. As a part of the quality assessment, we eliminated the SNVs and subjects
that had a high missing rate. After the quality control, 230 rare variants (54 SNVs, 63
SNVs, 61 SNVs, and 52 SNVs, from ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6,
respectively) and 2598 subjects remained for the analysis. In the analysis, a random
imputation based on allele frequency was used to impute missing genotypes.
The distribution of the SNVs in the DHS was heavily skewed to the rare variants (Figure
S2), wherein 93.5%, 87.4%, and 70% of all SNVs had an MAF of less than 3%, 1% and
0.1%, respectively. Similar to previous studies [Wu, et al. 2011], we selected SNVs with
an MAF<3% for the analysis in order to detect associations due to rare variants. The
distributions of MAF for the 4 genes are given in Figure 1. From Figure 1, we can see
that the distributions of MAF were highly skewed to the left, with the majority of the
SNVs having an MAF<0.1%. The distributions of the three phenotypes, BMI, cholesterol
and VLDL, can be viewed in Figure 2. Among the three phenotypes, the distribution of
VLDL was heavily skewed to the left, which is unlikely to follow the Gaussian
distribution.
25 
 

 
Figure 1: Distribution of SNVs with an MAF<0.03 for the 4 genes in the Dallas Heart Study

Figure 2: Distributions of the 3 phenotypes in the Dallas Heart Study
 

 
 

26 
 

 

We applied both WU-SEQ and SKAT to the association analyses of 4 genes, with
consideration of three covariates: gender, race and age (Table 7). To be consistent, we
used a weighted IBS for both WU-SEQ and SKAT.

Table 7: The association of 4 candidate genes with 3 continuous phenotypes (i.e., BMI,
Cholesterol, and VLDL) in the Dallas Heart Study
Gene

BMI

Cholesterol

VLDL

SKAT

WU-SEQ

SKAT

WU-SEQ

SKAT

WU-SEQ

ANGPTL3

0.633

0.752

0.559

0.662

0.471

0.198

ANGPTL 4

0.121

0.255

0.16

0.316

0.105

0.007

ANGPTL 5

0.633

0.607

0.95

0.926

0.683

0.664

ANGPTL 6

0.874

0.773

0.025

0.059

0.433

0.453

All 4 genes

0.503

0.641

0.117

0.373

0.316

0.032

* P-value from the association analysis, adjusting for age, gender, and race. 

WU-SEQ detected a strong association of ANGPTL4 (p-value=0.007) with VLDL, while
SKAT found only a marginal association (p-value=0.105). To further explore this finding,
we combined all 4 genes into a gene set, and then tested its association with VLDL
(Table 7). The association remained significant (p-value=0.032) using WU-SEQ, but not
SKAT (p-value=0.316). None of the 4 genes was found to be associated with BMI using
either WU-SEQ or SKAT. A marginal association was detected between ANGPTL6 and
cholesterol, with a p-value of 0.059 from WU-SEQ and a p-value of 0.025 from SKAT.
To evaluate the performance of both methods for the binary phenotype, we dichotomized
the data into a highest quartile (coded as 1), and a lowest quartile (coded as 0) for each
27 
 

 

phenotype (Table A4). Overall, WU-SEQ attained similar results to SKAT. For example,
both methods found no association of all 4 genes with BMI, a marginal association of all
4 genes with cholesterol, and a significant association of all 4 genes with VLDL.
1.6 Discussion
Targeted, exome and whole-genome sequencing studies are now underway for the
discovery of new genetic variants, particularly rare variants, associated with complex
diseases. With the emerging of a large amount of high-dimensional sequencing data,
great challenges have been posed to statistical analysis of sequencing data. Conventional
single-locus analyses have been shown to have low power for analyzing sequencing data,
not only because of the low frequencies of rare variants, but also because of the use of a
more stringent significance threshold. Joint association analyses of multiple genetic
markers, as demonstrated in several studies, can greatly reduce the dimensionality and
obtain powerful performance for sequencing data [Madsen and Browning 2009;
McCarthy, et al. 2008; Neale, et al. 2011; Tzeng, et al. 2009; Wu, et al. 2011].
Non-parametric methods, such as U-statistic-based methods, have shown great promise
for high-dimensional data analysis, especially when the underlying phenotype
distributions and modes of inheritance are unknown. Several U-statistic-based methods
were recently adopted in genetic association studies to detect common variants
underlying complex human diseases [Li 2012; Li, et al. 2011; Schaid, et al. 2005; Wei, et
al. 2012; Wei, et al. 2013a]. In this chapter, we propose a non-parametric method, WUSEQ, for testing the joint association of multiple SNVs with disease phenotypes. Built
under the framework of the weighted U statistic, WU-SEQ is robust against different
28 
 

 

underlying distributions of phenotypes. As demonstrated by the simulation study, WUSEQ had well-controlled type 1 errors and attained high power under binary, Gaussian, tdistributed and Cauchy phenotypes. In contrast, the performance of existing methods,
such as SKAT, depends on satisfying of the underlying assumptions. If the assumptions
were violated (e.g., the distribution followed a heavy-tailed distribution such as Cauchy),
SKAT had an inflated type 1 error, and had low power to detect an association. Although
SKAT can handle classic types of phenotypes (e.g., the exponential family) by using
appropriate link functions, WU-SEQ can be applied to a wider range of phenotypes
without any distribution assumptions. Here, we used the normal quantile to build the test
statistic. Alternatively, rank can also be used to construct the test statistic. When the
sample size is sufficiently large, we expect that using quantile and using rank would
reveal similar results. Nevertheless, for small sample sizes, using a normal quantile could
be more powerful and less conservative than using rank. Yet, when the distribution is
heavily tailed (e.g., Cauchy), using rank could attain a slight advantage over using
quantile (e.g., having a well-controlled type I error) for small sample sizes (Table 6). As a
similarity-based method, WU-SEQ is flexible, to accommodate various types of data. The
genetic variants used to construct genetic similarity are not constrained to categorical data
(e.g. SNV); they can also be count data (e.g., CNV) and continuous data (e.g., expression
data). Moreover, the genetic variants to be jointly tested can be of high dimension. In the
simulation studies, when the number of variants is much larger than sample sizes, we
showed that WU-SEQ can still control type I error and achieve satisfactory power. In
high dimensional settings, it is also important to select meaningful genetic variants, as
well as to select appropriate similarity kernels and optimize the computation.
29 
 

 

In WU-SEQ, we first summarize the information from multiple markers into genetic
similarities, and then evaluate the genetic similarities with the corresponding phenotypic
similarities. If the genetic similarities are concordant with the trait similarities, we
anticipate a large value for the test statistic, from which we could infer an association.
Here, we used a weighted IBS to construct genetic similarity, which assigns more
weights to rare variants. Other types of genetic similarity metrics can also be used, such
as those that consider interactions. Prior knowledge can also be incorporated, by
assigning a different weight for each variant according to its biological plausibility. To
measure the phenotype similarity, we used the cross-product kernel, si ,i '  h(qi , qi ' )  qi qi ' ,
which led to nice asymptotic properties of the test statistic. Nevertheless, we can use
other types of kernel functions for the phenotype similarity. If the kernel function
satisfies two regularity conditions, EF (h2 (Y1 , Y2 ))   and Var ( E (h(Y1 , Y2 ) | Y2 ))  0 , we
can calculate the asymptotic p-value by approximating the weighted U to a linear
combination of chi-squared variables. Even if the two conditions are not satisfied, we can
still use permutation to obtain the p-value of the association test. The choice of the
genetic- or phenotype-similarity metrics depends on the underlying disease model, which
could lead to different performance. For example, when the percentage of risk variants
increases and/or when the effect sizes of risk variants follow one direction, the burden
tests are expected to be more powerful than WU-SEQ. In this case, we can use genetic
similarities accommodating the underlying disease models. One of the strategies is to first
collapse all genetic variants by weighted sum and then calculate the genetic similarity.
Although joint association analyses greatly reduce the dimensionality for sequencing data,
the computation could be intense if a permutation test is used. We derived the asymptotic
30 
 

 

distribution for WU-SEQ to facilitate the high-dimensional data analysis. When the
sample size of a study is small, asymptotic properties may not hold and a permutation test
can be used. We also use a projection method in WU-SEQ to take covariates into account.
Through simulations and a real data analysis, we found the covariate-adjusted WU-SEQ
was robust for different types of phenotype distributions. In addition, we found that WUSEQ had almost the same power as SKAT when the phenotype followed a Gaussian
distribution and covariates were not considered. The reason for this is that SKAT uses a
variance component score test, and WU-SEQ has a close connection with the variance
component score test for the Gaussian-distributed phenotype. In fact, when covariates are
not considered, SKAT can be viewed as a special case of WU-SEQ by using the crossproduct kernel without a quantile transformation. When covariates are considered,
especially when the phenotype does not follow the Gaussian distribution (i.e. the link
function in SKAT is not an identity link), WU-SEQ could attain more computational
efficiency than SKAT. In SKAT, one needs to first fit the null model using a generalized
linear model, which involves iterative estimation. Furthermore, the calculation of a
projection matrix in SKAT involves additional matrix multiplication (i.e., calculating

H  V  VX ( X TVX )1 X TV in SKAT vs. calculating H  I  X ( X T X )1 X T in WU-SEQ,
where V is the covariance matrix estimated from the null model), and the calculation of
the limiting distribution in SKAT involves an additional large matrix decomposition (i.e.,
calculating H 1/2 KH 1/2 , where H 1/2 needs to be calculated from an eigen-decomposition
of the H matrix). The covariate-adjusted WU-SEQ does not require iterative estimation,
and needs less matrix multiplication and decomposition, which offers a greater
computational advantage over SKAT.
31 
 

 

In the analysis of the DHS study, WU-SEQ detected a strong association of ANGPTL 4
with VLDL. By further studying the distribution of VLDL, we found the distribution was
heavily skewed, which does not fit the normal assumption. The advantages of WU-SEQ
are not limited to this specific case; the method could be useful for other cases (e.g.,
small sample sequencing studies and sequencing studies with extreme-phenotype design).
A recent version of SKAT also considers the extreme phenotype by assuming a truncated
Gaussian distribution [Barnett, et al. 2013]. While SKAT needs to make adjustments and
certain assumptions for the truncated distribution of extreme phenotypes, WU-SEQ, as a
non-parametric method, can be directly applied to studies with extreme-value phenotypes.

32 
 

 

CHAPTER 2: Gene Trait Similarity U test

Most existing joint association tests are parametric or semi-parametric, which often rely
on certain assumptions (e.g., the normal distribution assumption). When the assumptions
are violated, it may lead to power loss or inflated type I error. Moreover, there lacks new
methods for sequencing association analyses of multiple traits, especially when the study
considers different types of traits (e.g., some traits are binary while others are continuous).
The study of multiple traits is important in biomedical research. Many studies collect
multiple biochemical measurements to reflect the underlying disease status. These
multiple measurements are usually more related to the biological mechanism of the
disease, and are, thus, more likely reflects heritable components of the disease. It also
becomes popular for human genetic studies to study multiple related phenotypes. For
instance, multiple traits, such as nicotine dependence and alcohol dependence, are
collected to study genetic variants contributed to co-morbidity of substance dependence.
Because of the sequencing cost, most sequencing studies are based on a small number of
subjects. It is, therefore, crucial to develop a statistical method that has good statistical
properties (e.g., non-conservative type I error and good power performance) for a small
sample size. For example, the original SKAT has been considered conservative for small
sample size studies. An Adj-SKAT was then proposed to adjust the conservative p-value
using the moment-matching method [Lee, et al. 2012a]. Finally, few analytic methods for
sequencing studies have targeted the computational issue. Most methods have been
implemented in R, without consideration of computational optimization. In the era of
exome-sequencing and whole-genome-sequencing, we need more efficient software (e.g.
33 
 

 

based on C or C++) to handle the massive amount of data and bring the computation time
to a reasonable range.
To achieve these goals, we propose a Gene Trait Similarity U test, referred to as GTSU.
In GTSU, we use two different similarity functions to summarize genetic information and
phenotypic information, and then form a test under the weighted U framework. We have
derived the asymptotic distribution of test statistics and implemented the method in both
R and C++. The proposed method has several remarkable features: 1) it is non-parametric
and is thus distribution free; 2) it can handle multiple traits; 3) it performs well under
small sample size without any adjustment; 4) it is computationally efficient for massive
amount of data (i.e., using C++ version of GTSU) ; and 5) it is statistically powerful. We
have conducted extensive simulation studies to evaluate the type 1 error rates and
compare the power of our method with several other popular methods. Finally, we
applied our method to the Dallas Heart Study to test the association of four candidate
genes with multiple cholesterol related traits.
2.1 Weighted U Statistic
Suppose that n subjects are sequenced in the study, whereby we are interested in the
association of L traits ( yi ,l , 1  i  n , 1  l  L ) with M genetic variants ( g i , m , 1  i  n ,
1  m  M ). For each subject i, we observe a vector of traits yi ( yi  ( yi ,1 , yi ,2 ,..., yi , L ) )

and a vector of genotypes gi ( gi  ( gi ,1 , gi ,2 ,..., gi , M ) ). For the case of L=1 (or M=1), it is
simplified to a single-trait association analysis (or single-locus analysis). When L>1, we
can perform either a multiple-trait association analysis or a single-trait association
analysis of each trait. In addition, each trait can be of a different type (e.g., continuous or
34 
 

 

categorical). The number of genetic variants M can be very large, for example, in the case
of a sequencing study, M can be greater than n.
We first define the trait similarity S i , j between subject i and subject j by:
Si , j  h( yi , y j ),
and define their genetic similarity Ki , j by:
Ki , j  f ( gi , g j ).
Here, the similarity measurements can be of a general form as long as they satisfy the
finite second moment condition (i.e. E (h 2 (Y1 , Y2 ))   and E ( f 2 (G1 , G2 ))   ). We then
center the trait similarity by:
Si , j  h ( yi , y j )
 h( yi , y j )  E (h( yi , Y j ))  E (h(Yi , y j ))  E (h(Yi , Y j ))

.

(1)

We also center the genetic similarity ( K i , j  f ( gi , g j ) ) in a same manner. The gene-trait
similarity weighted U (GTSU) is then defined as the summation of the trait similarity
weighted by the genetic similarity:
U

1
 K i , j Si, j ,
n(n  1) i  j

(2)

where K i , j s are considered as the weight function and Si , j s are considered as the U
kernel. In our formulation of the weighted U statistic, the role of genetic similarity and
trait similarity are interchangeable. In other words, we can also consider Si , j s as the
weight function and K i , j s as the U kernel.
35 
 

 

2.2 Similarity Measurement
The form of a trait similarity h (,) and a genetic similarity f (,) are flexible. According
to different types of genetic variants and the purpose of the analysis, we can choose
different forms of trait similarities and genetic similarities.
For a joint association analysis of multiple SNVs, where the genotypic data are
categorical, we can use either IBS or weighted-IBS to measure the genetic similarity
[Lynch and Ritland 1999]. Assuming the genetic variants ( g i ,m , 1  i  n , 1  m  M ) are
respectively coded as 0, 1 and 2 for AA, Aa and aa, we can define the IBS-based genetic
similarity between the individuals i and j as:
M

K

IBS
i, j



 2 | g
m 1

i ,m

 g j ,m |

2M

.

Alternatively, we can use a weighted-IBS (wIBS) to emphasize the effects of rare
variants. For a sequencing study, we can define wIBS-based genetic similarity as:
M

KiwIBS

,j

wm (2 | gi ,m  g j ,m |)


m 1

,

where wm represents the weight for the m-th SNV in the SNV set, and  is a scaling
M

constant, defined as   2 wm . wm is usually defined as a function of minor allele
m 1

frequency (MAF,  m ). For example, we can define the weight

wm as

wm  1/  m (1   m ) . When the genetic data are count data or continuous data, we can
choose other forms of f (,) to measure the genetic similarity (e.g., euclidean distance
based similarity), which we will leave for further investigations.
36 
 

 

For trait similarity, we define a unified measurement for both categorical and continuous
traits based on a normal quantile. For each trait yl ( 1  l  L ), we first define the
corresponding normal quantile by,

qi ,l   1 ((rank ( yi ,l )  0.5) / n) ,
where rank () calculates the rank of the trait value yi ,l in the corresponding trait vector

yl . When there are ties, we assign the averaged rank.  () is the inverse cumulative
density function for a standard normal distribution. We can then calculate the Euclidian
Distance (ED)-based trait similarity between subjects i and j:
L

2
SiED
, j  exp(   l ( qi ,l  q j ,l ) ) ,
l 1

where l represents the weight for the l-th traits, and can be defined based on the
importance of the traits. If there is no prior knowledge, we can just set all weights equal,

l  1/ L . The ED-based trait similarity can be easily modified to take the correlation
among the traits into account:
 1

(qi  q j )T  (qi  q j )  ,
SiED
, j  exp  
 L


where qi  (qi1 ,..., qiL )T and  can be chosen to reflect the correlation structure among
the traits. For example, we ca n define  as,
(

1 n
qi qiT ) 1 .

n i 1

Other than the ED-based trait similarity, we could also define the trait similarity based on
cross product [Tzeng, et al. 2009]. The types of the trait similarity or the genetic
37 
 

 

similarity can be chosen based on different purposes, which may influence the power of
the method. For simplicity, we use ED-based trait similarity.
2.3 Hypothesis Testing
By the definition of the centered similarity, we can show that E (h (Yi , Y j ))  0 and

E ( f (G1 , G2 ))  0 (Appendix A). Under the null hypothesis, when the genetic variants are
not associated with multiple traits, E (U )  0 (Appendix A). Under the alternative, when
the genetic variants are associated with multiple traits, we expect that the trait similarity
is concordant with the genetic similarity. In other words, the positive trait similarities is
weighted heavier and the negative trait similarities is weighted lighter, leading to a
positive value of U statistic. A formal statistical test can be performed to determine its
significance level. The p-value can be calculated by P(U  Uobs ) , where U obs is the
observed value of U.
To access the significance of the genetic association, we can use permutation or bootstrap.
However, both are computationally expensive for high-dimensional data. Thus, we derive
the asymptotic distribution of GTSU.
By considering the genetic similarity as a weight function and the trait similarity as a U
kernel, GTSU is in the form of a weighted U statistic [Gregory 1977a; Lindsay, et al.
2008; O'Neil and Redner 1993; Shapiro and Hubert 1979; Shieh 1997]. More specifically,
because its kernel satisfied Var ( E (h (Y1 , Y2 ) | Y2 ))  0 (Appendix A), GTSU is actually a
degenerated weighted U statistic. We can decompose the centered trait similarity by:


h ( y1 , y2 )   ss ( y1 )s ( y2 ) ,
s 1

38 
 

 

where {s } and {s ()} are eigenvalues and eigenfunctions for the kernel h (,) and all
the eigenfunctions are orthogonal:
0, if s  s
.
1, if s  s

 s ( y1 )s ( y1 )dF ( y1 )  

Similarly, we can decompose the centered genetic similarity by:


f ( g1 , g 2 )   tt ( g1 ) t ( g 2 ) .
t 1

Then, we can write the GTSU as (see Appendix B2):
1  
 1 n *

U
sign


t (Gi )s* (Yi ) 
(
)


t s 
n  1 t 1 s 1
 n i 1

 
n
1
1
sign(t s )  (t* (Gi )t* (Yi )) 2 ,


n  1 t 1 s 1
n i 1

2

where t* (Gi ) | t |0.5 t (Gi ) and s* (Yi ) | s |0.5 s (Yi ) .
Using the above form, we can show that the limiting distribution of GTSU is the
weighted sum of independent chi-square random variables. This is the result of theorem 1
below, which is proved in Appendix B2.
Theorem 1. Let h (Y1 , Y2 ) and f (G1 , G2 ) respectively be the centered trait similarity
and the genetic similarity. Define U as U  1/ (n(n  1)) h (Yi , Y j ) f (Gi , G j ) . Assuming
i j

D





t 1

s 1

E (h 2 (Y1 , Y2 ))   , E ( f 2 (G1 , G2 ))   and Y  G , we have nU   t  s (  st2  1) ,
where { st2 } are independent chi-square random variables with 1 degree of freedom.
2.4 The Power and Sample Size Calculation

39 
 

 

In this subsection, we study the properties of GTSU under the alternative hypothesis. We
derive the asymptotic distribution of GTSU under the alternative, and provide an
analytical method to calculate the power and sample sizes for sequencing association
studies.
Under the alternative hypothesis, we assume E (h (Y1 , Y2 ) f (G1 , G2 ))    0 and

Var ( E (h (Y1 , Y2 ) f (G1 , G2 ) | (Y2 , G2 )))   1  0 , where  measures the strength of the
association. It is easily to show that GTSU is an unbiased estimator:
E (U ) 

1
 E ( f (Gi , G j )h (Yi , Y j ))   .
n( n  1) i  j

Using the Hoeffding projection, we can show that GTSU asymptotically follows a normal
distribution with a mean of  and a variance of 4 1 / n . This is the result of Theorem 2,
which is proved in Appendix B3.
Theorem 2. Let h (Y1 , Y2 ) and f (G1 , G2 ) respectively be the centered trait similarity
and the genetic similarity. Define U as U  1/ (n(n  1)) h (Yi , Y j ) f (Gi , G j ) . Assuming
i j

E (h 2 (Y1 , Y2 ))   , E ( f 2 (G1 , G2 ))   and Y is associated with G we have
D

n (U   )  N (0, 4 1 ) .

The power of the GST at significance level  can be calculated by,

P nU  q1 
 n (U   ) q1  n 
 P


2 n 1 
 2  1
n  q1
 (
),
2 n 1
40 
 

 

where q1 is the 1   quantile for





t 1

s 1

t  s (  st2  1) and () is the CDF for standard

normal distribution. The sample size to achieve power  can be calculated by solving
(

n  q1
)   for the smallest n. Denote Z  as the  quantile for a standard normal
2 n 1

distribution, and the sample size is given by,


( Z   1  Z 2 1   q1 )2 

n  min n 
.
2
nN



2.5 Computation and Implementations
We can rewrite the trait similarity and the genetic similarity in the matrix format,
S  {Si , j }nn and K  {K i , j }nn . The centered similarity matrix can be obtained by,
S  ( I  J ) S ( I  J ) ,

K  ( I  J ) K ( I  J ) ,
where I is an n-by-n identity matrix and J is an n-by-n matrix, with all elements being
1 / n (Appendix B4). GTSU can be rewritten as:

U

1
 K i, j Si, j .
n(n  1) i  j

In this form, U can be viewed as the sum of the element-wise product of the two
matrices, K 0 and S0 , where K 0 and S0 are obtained by assigning 0 to diagonal elements
of matrices K and S . We use a matrix eigen-decomposition to approximate the eigenvalues in the function decomposition. Denote {s } and {k } as the eigen-values for the

41 
 

 

matrices S0 and K 0 , respectively (Appendix B5). The limiting distribution of U is given
by:
n

nU ~ 
s 1

s

n

t

 n (
n
t 1

2
st

 1) ,

where { st2 } are independent chi-square random variables with 1 degree of freedom. The
p-value can be calculated using Davies’ method [Davies 1980], Liu’s method [Liu, et al.
2009a] or Kuonen’s method [Kuonen 1999].
2.6 Covariates Adjustment
Important covariates, such as sample’s ethnic background, need to be carefully
considered in genetic association studies. Failure to accommodate covariates that have
confounding effects could lead to false positive results. In this section, we illustrate how
to adjust covariates in GTST.
Suppose that there are P covariates ( xi , p , 1  i  n , 1  p  P ) that need to be adjusted
in the association study of L traits and M genetic variants. Let X denote the n-by-P matrix
for the covariates ( X  {xi , p }nP ).Our strategy is to performed adjustment on both the
genetic similarity and the trait similarity, and then to form a U statistic based on the
adjusted similarities, which reflects the true association between the genetic variants and
multiple traits. In order to do that, we use a two-sided projection [Lindsay, et al. 2008] to
calculate the covariate-adjusted trait similarity ( S cov ) and genetic similarity ( K cov ):
S cov  ( I  X ( X T X )1 X T ) S0cen ( I  X ( X T X ) 1 X T ) ,
K cov  ( I  X ( X T X ) 1 X T ) K 0cen ( I  X ( X T X ) 1 X T ) .
42 
 

 

Then the covariate-adjusted GTSU can be calculated as:

U cov 

1
n2

n

n

 K
i 1 j 1

cov
i, j

Sicov
,j .

We include the diagonal terms in the covariate-adjusted similarities because they also
contain the similarity information after the adjustment. In fact, the covariate-adjusted
GTSU is a weighted V statistic, and its asymptotic distribution can be attained using a
similar method as the one for the weighted U statistic [Shieh 1997]. After eigendecomposition of the two similarity matrices, the asymptotic distribution is also the sum
of independent chi-square random variables:
nU cov ~

n
n
1
cov
cov 2


s  k  s , k ,
n( n  P  1) s 1
k 1

where { s2,k } are independent chi-square random variables with 1 degree of freedom,
{scov } and {kcov } are the eigen-values for the matrices S cov and K cov , respectively.
2.7 Simulation Method
In the simulation studies, we used genetic data from the 1000 Genome Project [2010] to
mimic the real data scenario. Specifically, we used a 1Mb region of the genome
(Chromosome 17: 7344328-8344327) from 1092 individuals in the 1000 Genome Project.
For each simulation replicate, we randomly chose a 30kb segment from the 1Mb region
as the SNV set to be tested. From the SNV set, we set a part of the SNVs as causal. A
number of individuals were randomly chosen as a simulation sample to study the
performance of the methods under different sample sizes.

43 
 

 

To investigate the robustness against different phenotype distributions, we simulated
three types of traits: a binary-distributed trait, a Gaussian-distributed trait and a Cauchydistributed trait. The binary-distributed phenotype was simulated by using a logistic
regression model:
logit( P( yi  1))    giT  ,

where yi is the phenotype value for the i-th individual, g i is the observed genotype
vector (coded as 0, 1, and 2), and  are the effects of the phenotypes and genotypes. The
effects  for the genotypes were sampled from a uniform distribution with a mean of 
and a variance of  2 . The Gaussian-distributed phenotype was simulated by using a
conventional linear regression model:
yi    giT    i ,  i ~ N (0, 2 ) ,

The Cauchy-distributed phenotype represents a situation in which the continuous trait
contains more extreme values (i.e., heavy-tailed), and was simulated by the following
model:
ai    giT  , yi ~ cauchy ( ai , b ) ,

where ai and b are the location parameter and scale parameter for the Cauchy distribution.
Two sets of simulations were performed. In simulation I, we considered a single trait, and
in simulation II we considered multiple-traits.
2.7.1 Simulation I:

44 
 

 

The null models were simulated by setting   0 and  2  0 , and changing the sample
size from 50 to 500 (i.e. 50, 100, 200 and 500). Two series of alternative models were
simulated (details in Table B1):
1. Set   0 and  2  0 , so that half of the functional SNVs were deleterious and the
other half were protective.
2. Set   0 and  2  0 , so that the majority of the causal SNVs were deleterious.
2.7.2 Simulation II:
Two sets of models were simulated:
1. Assume the multiple traits follow the same distributions. In particular, we
simulated 3 Binary-distributed traits (BBB), 3 Gaussian-distributed traits (GGG),
and 3 Cauchy-distributed traits (CCC).
2. Allow that the multiple traits follow different distributions. In particular, we
simulated 3 traits with 2 Binary and 1 Gaussian (BBG), 3 traits with 1 Binary and 2
Gaussian (BGG), and 3 traits with 1 Binary, 1 Gaussian and 1 Cauchy (BGC).
For the null model, we set   0 and  2  0 (details in Table B2). For the alternative
models, we set   0 ,  2  0 and allowed the multiple traits to be influenced by the
different sets of causal SNVs, by repeatedly choosing 30% SNV of the SNV-set on the
30kb segment as functional for each trait.
2.8 Simulation Results
We evaluated the performance of GTSU by comparing it with three existing methods:
SKAT, AdjSKAT and SKATO [Lee, et al. 2012a; Wu, et al. 2011]. For each simulation,
45 
 

 

we simulated 1000 data replicates and calculated type 1 error or power. In GTSU, we
used a wIBS-based genetic similarity and ED-based trait similarity to construct the
statistic. To be consistent, we used the same weighted IBS to construct the kernel for
SKAT [Wu, et al. 2011]. Because AdjSKAT and SKATO are not implemented with a
weighted IBS kernel, we used the default kernel, the weighted linear kernel, when
applying these two methods. SKAT, AdjSKAT and SKATO are designed for single trait
analysis. In order to analyze data with multiple traits, we perform single trait analysis and
obtained a p-value for each trait. A Bonferroni correction was then used to take the
multiple tests into account.
2.8.1 Simulation I:
We summarized the type I errors of the 4 methods in Table 8. GTSU had well-controlled
type I error regardless of the phenotype distributions (i.e., binary, Gaussian or Cauchy)
and the sample size (i.e., ranged from 50 to 500). Nevertheless, SKAT, AdjSKAT and
SKATO had an inflated type I error (ranged from 0.101 to 0.19) for a Cauchy-distributed
trait. When the sample size was small (e.g. 50 or 100), SKAT and SKATO also had a
conservative type I error.
The power comparisons were summarized in Figures 3 and 4. For the first disease model,
in which half of the causal SNVs were deleterious (Figure 1), GTSU consistently had
higher power than the other methods for different sample sizes ranged from 50 to 500.
For the binary trait and a sample size of 50, the power of AdjSKAT (power=0.114) was
marginally larger than SKAT (power=0.024) and SKATO (power=0.057), while the
power of GTSU (power=0.346) was significantly larger than all three SKAT-based
methods.
46 
 

 

Table 8: type I error for single trait analysis
Sample size
50

100

200

500

Method

Trait distribution
Binary

Gaussian

Cauchy

SKAT

0.001

0.03

0.122

SKATO

0.021

0.041

0.101

AdjSKAT

0.054

0.028

0.123

GTSU

0.058

0.046

0.051

SKAT

0.014

0.028

0.149

SKATO

0.035

0.038

0.12

AdjSKAT

0.063

0.027

0.139

GTSU

0.046

0.05

0.053

SKAT

0.023

0.04

0.155

SKATO

0.024

0.042

0.14

AdjSKAT

0.043

0.042

0.156

GTSU

0.057

0.048

0.045

SKAT

0.039

0.055

0.19

SKATO

0.052

0.056

0.158

AdjSKAT

0.049

0.053

0.18

GTSU

0.038

0.046

0.043

47 
 

 

Figure 3: Power comparison for a single trait analysis when half of the causal SNVs are
protective and the other half of the causal SNVs are deleterious

For the Gaussian trait, the three SKAT-based methods had similar performance, while
GTSU has much higher power than the SKAT-based methods. For the Cauchy trait,
where the distribution assumption is not satisfied, we could hardly see power
improvement for the SKAT-based methods with the increased sample size. On the other
hand, GTSU does not rely on the assumption, and had power improvement when the
sample size increased. For the second disease model where majority of the phenotype is
deleterious (Figure 4), we observed consistently higher power of GTSU over the other
methods regardless of the sample sizes and trait distributions.

48 
 

 

Figure 4: Power comparison for a single trait analysis when the majority of the causal SNVs are
deleterious

2.8.2 Simulation II:
We summarized the type I error for simulation II in Table 9. Similar to the results of the
single trait analysis, GTSU can correctly control the type I errors at the 0.05 level for all
simulation scenarios (Table 9).
All of the other three parametric methods (SKAT, SKATO and AdjSKAT) had an
inflated type I error when one or more than one trait did not satisfy the parametric
distribution assumption (CCC and BGC). When the distribution assumptions were
satisfied, AdjSKAT was able to correctly control the type 1 error. We found, however,
SKAT and SKATO had conservative type I errors for small sample sizes (50 and 100),
especially when one or more than one trait were binary-distributed traits (BBB, BBG and
BGG).
49 
 

 

Table 9: type I error for multiple trait analysis
Sample size

50

100

200

500

Method

Traits distribution*
BBG
CCC

BGG

BGC

0.008

0.021

0.097

0.207

0.016

0.024

0.085

0.028

0.237

0.049

0.04

0.113

0.057

0.047

0.045

0.034

0.043

0.056

SKAT

0.001

0.023

0.295

0.011

0.013

0.122

SKATO

0.025

0.039

0.26

0.025

0.036

0.117

AdjSKAT

0.055

0.025

0.29

0.059

0.042

0.135

GTSU

0.048

0.047

0.044

0.051

0.059

0.058

SKAT

0.02

0.048

0.325

0.033

0.02

0.134

SKATO

0.03

0.058

0.288

0.035

0.036

0.118

AdjSKAT

0.059

0.046

0.319

0.064

0.038

0.136

GTSU

0.05

0.047

0.049

0.058

0.056

0.043

SKAT

0.044

0.035

0.364

0.035

0.04

0.166

SKATO

0.054

0.036

0.315

0.038

0.047

0.156

AdjSKAT

0.06

0.038

0.342

0.041

0.049

0.172

GTSU

0.045

0.054

0.044

0.046

0.039

0.062

BBB

GGG

SKAT

0.002

0.028

0.232

SKATO

0.019

0.027

AdjSKAT

0.046

GTSU

*In trait distribution, “B” represents a binary-distributed trait, “G” represents a Gaussian-distributed trait,
and “C” represents a Cauchy-distributed trait. Thus, “BBB” means three binary-distributed traits, “GGG”
means three Gaussian-distributed traits, “CCC” means three Cauchy-distributed traits, “BBG” means two
binary-distributed traits and one Gaussian-distributed trait, “BGG” means one binary-distributed trait and
two Gaussian-distributed traits, and “BGC” means three traits with a binary, Gaussian and Cauchy
distribution for each.

We summarized the power comparison for the two disease models in Figures 5 and 6. For
the first disease model, in which the multiple traits follow the same distribution, GTSU
had higher power than SKAT, SKATO and AdjSKAT, for all three simulation scenarios
(Figure 4). In the simulation scenarios of BBB and GGG, when the sample size was 50,
GTSU obtained higher power than the other three SKAT methods. As the sample size

50 
 

 

increased, GTSU could obtain substantially power improvement than the other three
methods.

Figure 5: Power comparison for a multiple trait analysis when the multiple traits follow
the same type of distribution

*“BBB”, “GGG”, “CCC” represent three binary-distributed traits, three Gaussian-distributed traits, and
three Cauchy-distributed traits, respectively.

For CCC, we clearly observed the increased power of GTSU as the sample size increased.
Compared to GTSU, SKAT, SKATO and AdjSKAT had little gain in power as the
sample size increased. We also observed higher power of three SKAT methods over
GTSU when the sample size was small. This is due to the fact that the SKAT methods
had an inflated type 1 error (i.e., > 0.20) (Table 9). Similarly, for the second disease
model, GTSU had higher power than SKAT, SKATO and AdjSKAT, regardless of the

51 
 

 

sample size, the distributions of the multiple traits, and the configuration of the causal
SNVs on the multiple traits (Figure 6).
Figure 6: Power comparison for a multiple trait analysis when the multiple traits follow
different types of distribution

* “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes one
binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binary-distributed trait,
one Gaussian-distributed trait and one Cauchy-distribution trait.

2.9 Computation and Software
We implemented GTSU in both R and C++. Because SKAT is written in R, to be
comparable, we compared the computational efficiency of GTSU in R with that of SKAT,
SKATO and AdjSKAT. To analyze a 30kb sequence with a sample sizes of 50 for 1000
data replicates on a high-performance personal computer with 2.3GHz CPU and 4G
memory, SKAT, SKATO and AdjSKAT spent 10 times (95.606s), 217 times (2155.145s)

52 
 

 

and 532 times (5286.115s) longer, respectively, of the time that GTSU spent (9.925s).
We listed more detailed computational times for analyzing data with various distributed
traits and different sample sizes in Table B3. To handle high-dimensional data and further
improve computational efficiency, we have developed a C++ based package for GTSU
(https://www.msu.edu/~changs18/software.html#GTSU), including a windows version
and a unix version. The current version of the C++ based GTSU only implement a
weighted IBS for genetic similarity and an ED-based similarity for trait similarity. More
features will be added to the package in the future.
2.10 Application to Dallas Heart Study
To further evaluate the performance of GTSU, we applied GTSU to the Dallas Heart
Study (DHS) sequence data [Ahituv, et al. 2007], and compared the results of GTSU with
three other SKAT-based methods. The DHS sequencing data is comprised of 4 genes,
ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6. After the quality control (e.g.,
removing SNVs with high missing value rate), 230 rare variants (54 SNVs, 63 SNVs, 61
SNVs, 52 SNVs are from ANGPTL3, ANGPTL4, ANGPTL5 and ANGPTL6, respectively),
and 2598 subjects remained for the analysis. We were interested in testing the association
of the rare variants in these genes with multiple traits that are related to cholesterol level,
including obesity (dichotomized from BMI using a cut-off of 35), cholesterol, high
density lipoprotein cholesterol (HDL), low density lipoprotein cholesterol (LDL) and
very low density lipoprotein cholesterol (VLDL).

53 
 

 

Table 10: multiple trait analysis for the DHS study
rare variants

p-values for multiple traits*
GTSU

SKAT

AdjSKAT

SKATO

ANGPTL3

54

0.300

0.415

0.355

0.560

ANGPTL4

63

0.057

0.395

0.430

0.775

ANGPTL5

61

0.075

0.230

0.365

0.535

ANGPTL6

52

0.297

0.295

0.325

0.505

ALL

230

0.028

1

1

1

*Traits considered in the analysis include BMI, cholesterol, HDL, LDL and VLDL. For SKAT-based
methods, the p-values listed in the table are the Bonferroni-corrected p-value attained from the multiple pvalues.

To consider potential confounding effects from age, gender and race in GTSU, we used
the residuals to build the trait similarity. The three SKAT-based methods cannot directly
analyze multiple-trait data. Thus, we obtained multiple p-values from single trait analyses,
and then used a Bonferroni correction to decide whether there was a significant
association. The results were summarized in Table 10. Among the 4 genes, GTSU found
a marginal association (p-value=0.057) of ANGPTL4 with the multiple traits, while there
is no significant association found by SKAT, SKATO and AdjSKAT. To further explore,
we combined the 4 genes into a gene set, and found a significant association of the gene
set with multiple traits by using GTSU (p-value=0.028).

54 
 

 

CHAPTER 3: Family-based Gene-Trait Similarity U test

Rare variants are often recent mutation, and are only carried by few individuals in the
population. Although they are rare in the population, if the mutations are inheritable, they
could be clustered within families. Sequencing studies that utilize a family design are,
therefore, more suitable to detect rare variants. However, directly using population-based
association method without considering the family correlation structure could lead to
inflated type I error. A family-based Gene Trait Similarity U test is thus proposed to
analyze sequencing association data from family samples or a mixture of family samples
and unrelated samples.
3.1 Family-based Gene Trait Similarity U test
Suppose that n subjects are genotyped in the study, where we are interested in the
association of the phenotype ( y i , 1  i  n ) with M genetic variants ( g i , m , 1  i  n ,
1  m  M ). In order to test the joint association, we first define the trait similarity by

h( yi , y j ) and the genetic similarity by f ( g i , g j ) , where the function h(,) and f (,)
satisfy

the

finite

second

moment

conditions,

(i.e.

E (h 2 (Y1 , Y2 ))  

and

E ( f 2 (G1 , G2 ))   ). We further center trait similarities by,

h ( yi , y j )  h( yi , y j )  E (h( yi , Y j ))  E (h(Yi , y j ))  E (h(Yi , Y j )),
and center the genetic similarities, f ( gi , g j ) , in the same manner. We can then build a
gene-trait similarity U test by calculating the summation of the centered trait similarity
weighted by the centered genetic similarity, as follows:
55 
 

 

U

1
 h ( yi , y j ) f ( gi , g j ) .
n(n  1) i  j

The above form can be considered as the sum of the element-wise product of two
matrices. When the phenotype is associated with the genetic variants, the phenotype
similarity is concordant with the genetic similarity. Thus, the positive trait similarity is
positively weighted and the negative trait similarity is negatively weighted, and a larger
value is expected for the test statistics under the alternative hypothesis.
In order to take the correlation structure into account, we first calculate the kinship matrix,

K,
1   1 2 12
 2
1  2
K   21
 ...
...

 2  n1 2  n 2

2 1n 
2  2 n 
,
...
... 

... 1   n 

...
...

where  i represents the inbreeding coefficient for the subject, and ij represents the
kinship coefficient between subjects i and j. We then apply a projection approach to
obtain a family-based genetic similarity and a trait similarity. As we shown in Appendix
C1, the technique originates from the linear mixed model. The kinship matrix can be
decomposed into K  U U T , where the columns of U are eigenvectors, and the
diagonal of

 are eigenvalues. Denote the trait similarity S  {Si , j }nn , where

Si , j  h( yi , y j ) , the family-based trait similarity matrix can be calculated by S  U T SU .
Also denote the genetic similarity W  {Wi , j }nn , where Wi , j  f ( gi , g j ) , the family-based
genetic similarity can be calculated by W   0.5U T WU  0.5 . Let S0 be a similar matrix as

56 
 

 

S with the diagonal elements of 0, and W0 be a similar matrix as W with the diagonal

elements of 0, I be the identity matrix and J be the matrix in which all elements equal
1 / n . We can calculate the centered similarity matrix by,

 S cen  ( I  J ) S0 ( I  J )
.
 cen
W  ( I  J )W0 ( I  J )
The family-based GTSU is defined similarly,

U

1
n2

n

n

W

cen
i, j

i 1 j 1

Sicen
,j .

Under the null hypothesis, when the genetic variants are not associated with the
phenotype, the limiting distribution of the family-based GTSU is a mixture of chisquared random variables,
n

nU ~ 
s 1

s

n

t

n
n
t 1

2
st

,

where {s } and {t } are the eigenvalues of the matrix S cen and W cen , and { st2 } are i.i.d
chi-squared random variables with a d.f. of 1. The p-value of the test, P(U obs  U ) can
then be calculated using the Davies’ method [Davies 1980], Liu’s method [Liu, et al.
2009a] or Kuonen’s method [Kuonen 1999].
3.2 Covariate-adjusted Gene Trait Similarity U test
We can also consider the family structure as the confounding effect that needs to be
corrected. We derive a covariate-adjusted GTSU, and use the principle components of the
kinship matrix as the covariate adjustment for GTSU. Suppose that there are P covariates
57 
 

 

( xi , p , 1  i  n , 1  p  P ) that need to be adjusted, where we choose P leading
principle components of the kinship matrix in a family study. By denoting S0cen a matrix
by assigning 0 to the diagonal element of matrix ( I  J ) S ( I  J ) , and W0cen a matrix by
assigning 0 to the diagonal element of matrix ( I  J )W ( I  J ) , we can use a two-sided
projection [Lindsay, et al. 2008] to calculate the covariate-adjusted trait similarity ( S cov )
and genetic similarity ( K cov ):
S cov  ( I  X ( X T X )1 X T ) S0cen ( I  X ( X T X ) 1 X T ) ,
W cov  ( I  X ( X T X ) 1 X T )W0cen ( I  X ( X T X ) 1 X T ) .
Then the covariate-adjusted GTSU can be calculated as:

U cov 

1
n2

n

n

W
i 1 j 1

cov
i, j

Sicov
,j .

Its asymptotic distribution can be attained using a method similar to the weighted U
statistic [Shieh 1997]. After an eigen-decomposition of the two similarity matrices, the
asymptotic distribution of the covariate-adjusted GTSU is the sum of the independent
chi-square random variables:
nU cov ~

n
n
1
scov  kcov  st2 ,

n( n  P  1) s 1
t 1

where { st2 } are independent chi-square random variables with 1 degree of freedom,
{scov } and {tcov } are the eigen-values for the matrices S cov and W cov , respectively.

58 
 

 

3.3 Simulation
We used the genetic data from the 1000 Genome Project [2010] to mimic real scenarios.
In particular, we used a 1Mb region of the genome (Chromosome 17: 7344328-8344327)
from 1092 individuals in the 1000 Genome Project. For each simulation replicate, we
randomly chose a 30kb segment from the 1Mb region as the SNV set to be tested. In the
SNV set, only a part of the SNVs was set as causal.
A number of individuals were randomly chosen as parents in the family, from which we
generated 2 offspring for each family. We varied the numbers of families to study the
performance of GTSU under different sample sizes. To investigate the robustness against
phenotype distributions, we simulated different types of traits, i.e., a binary-distributed
trait, a Gaussian-distributed trait, a t-distributed trait, or a Poisson-distributed trait. The
binary-distributed phenotype was simulated by a logistic regression model:
logit( P( yi  1))    giT  + i ,

where yi was the phenotype value for the i-th individual, g i was the observed genotype
vector (coded as 0, 1, and 2),  were the effects of the genotypes, and  i was the effect
of the family correlation. The effects  for the genotypes were sampled from a normal
distribution with a mean of  and a variance of  2 .   ( 1 , 2 ,..., n )T was sampled from
a multivariate normal distribution,  ~ N (0, K2 K ) , where K is the kinship matrix
calculated based on the family structure. The Poisson-distributed phenotype is simulated
by a Poisson regression model:
log(ai )    giT  + i , yi ~ Pois (ai ) ,

where yi was sampled from a Poisson distribution with a mean of ai .
59 
 

 

The Gaussian-distributed phenotype is simulated by a conventional linear regression
model:
yi    giT    i   i ,  i ~ N (0, 2 ) .

The t-distributed phenotype represents a situation in which the continuous trait has more
extreme values (heavy-tailed), and was simulated by the following model:
yi    giT    i   i ,  i ~ tdf 1 ,

where  i were sampled from a t distribution with a d.f. of 1.
We compared the proposed family-based GTSU (FGTSU) with 3 other methods,
including the original GTSU without any adjustment (GTSU), covariate-adjusted GTSU
using the 20 leading principle components (PC-GTSU), and fam-SKAT [Chen, et al.
2013], which is based on a variance-component score test in the linear mixed model. The
results were summarized in Table 1 and Table 2. In general, the original GTSU had the
highest power among the 4 methods, but type 1 errors for GTSU were inflated. FGTSU
had a controlled type 1 error, and had a reasonable power to detect the joint association.
When the phenotypes followed a binary distribution (Table 11), PC-GTSU, FGTSU and
fam-SKAT had controlled the type 1 error rates. However, when the sample size was
small (e.g., 80), fam-SKAT and PC-GTSU had conservative type 1 error rates (0.005 for
fam-SKAT and 0.016 for PC-GTSU), and they had very low power (0.033 for fam-SKAT
and 0.016 for PC-GTSU). PC-GTSU was the least powerful approach among the three,
and FGTSU had higher power than both PC-GTSU and fam-SKAT.

60 
 

 

Table 11: Type 1 error and power comparison for a discrete phenotype
Distribution

Effect

Null
Binary
Disease

Null
Poisson
Disease

Sample
size

Type 1 error/Power
GTSU

PC-GTSU

FGTSU

fam-SKAT

80

0.153

0.016

0.053

0.005

200

0.164

0.03

0.047

0.023

400

0.174

0.069

0.070

0.032

80

0.39

0.016

0.175

0.033

200

0.715

0.1

0.489

0.298

400

0.922

0.537

0.823

0.736

80

0.317

0.01

0.054

0.122

200

0.357

0.045

0.06

0.149

400

0.403

0.117

0.056

0.142

80

0.5

0.019

0.199

0.315

200

0.767

0.151

0.425

0.446

400

0.915

0.539

0.679

0.605

When the phenotypes followed a Poisson distribution (Table 1), only FGTSU had the
controlled type 1 error. PC-GTSU controlled the type 1 error when the sample size was
80 or 200, but not when sample size was 400. fam-SKAT could not control the type 1
error for different sample sizes. Similar to the binary cases, PC-GTSU had a conservative
type 1 error and low power for small sample sizes, while FGTSU had the largest power,
among those that had controlled the type 1 error.

61 
 

 

Table 12: Type 1 error and power comparison for a continuous phenotype
Distribution

Effect

Null
Gaussian
Disease

Null
Student’s t
Disease

Sample
size

Type 1 error/Power
GTSU

PC-GTSU

FGTSU

fam-SKAT

80

0.299

0.013

0.064

0.039

200

0.357

0.041

0.073

0.035

400

0.385

0.107

0.056

0.037

80

0.513

0.023

0.236

0.201

200

0.742

0.12

0.429

0.452

400

0.885

0.491

0.65

0.761

80

0.148

0.015

0.054

0.145

200

0.175

0.036

0.046

0.176

400

0.158

0.058

0.051

0.183

80

0.486

0.017

0.285

0.171

200

0.784

0.146

0.63

0.197

400

0.938

0.582

0.867

0.221

When the phenotypes followed a Gaussian distribution (Table 12), both FGTSU and famSKAT controlled the type 1 errors at around the 0.05 level. Fam-SKAT had comparable
or even higher power than FGTSU, especially when the sample size increased. Both
methods attained higher power than PC-GTSU. Similar to before, PC-GTSU had an
inflated type 1 error when the sample size was 400 and low power when the sample size
was 80.

62 
 

 

When the phenotypes followed the Student’s t distribution (Table 2), both FGTSU and
PC-GTSU controlled the type 1 errors at around the 0.05 level. Fam-SKAT had an
inflated type 1 error for different sample sizes. FGTSU had the highest power
performance among those that could control the type 1 error. PC-GTSU had low power
when the sample size was small.
To summarize, compared to the other three methods, FGTSU was the most robust to
phenotype distributions. FGTSU was also a powerful method among those that could
control type 1 error. Fam-SKAT had controlled type 1 error for both the Gaussian and
binary phenotypes, but not for Poisson- or t-distributed phenotypes. Fam-SKAT attained
highest power when the phenotype followed the Gaussian distribution because famSKAT is developed based on a linear mixed model. We also found unstable performance
of PC-GTSU. Although the types of phenotypes did not seem to influence its robustness,
PC-GTSU tended to have an inflated type 1 error for a large sample size and low power
for a small sample size. The reason might be the number of principle components used
for PC-GTSU. A larger number of PC can protect the type 1 error but leads to low power.
3.4 Whole Genome Sequencing Analysis of GAW18
We

developed

a

C++

package

for

GTSU

and

family-based

GTSU

(https://www.msu.edu/~changs18/software.html#GTSU). Using the C++ package, we
performed a whole genome sequencing analysis using the GAW 18 data. The GAW18
had only genetic data from odd numbers of autosomes. The GAW18 was comprised of
959 family members, among which 464 were directly sequenced and the others were
imputed, based on 20 large pedigree structures and existing genetic data. The status of
hypertension for the first visit was chosen as the phenotype in the analysis. After
63 
 

 

eliminating the subjects with missing phenotypes, we had 804 subjects, and 8,348,674
SNPs remained.
The SNV sets were then needed to be defined for the whole genome. For simplicity, we
used the 75% quantile of the length of all genes, which corresponds to the size of
58861bp, to divide the genome sequence into 22887 SNV sets. The number of SNVs in
each SNV set ranged from 1 to 1488, with a median size of 361 (Figure C1). The kinship
matrix for the 804 family members was calculated using pedigree information from 1389
subjects. An ED-based similarity was used to calculate the phenotype similarity, and
wIBS was used to calculate the genetic similarity. The analysis of the whole genome took
10 hours on a personal computer with 8GB memory and a 2.4 GHz CPU. The QQ plot of
the 22887 SNV sets showed that there is no evidence of systematically inflated p-values
(e.g., due to population stratification) in the genome-wide analysis (Figure 1). The SNV
sets with the top p-values were listed in Table 13. By taking multiple testing into account,
the significance level should be at 10-6 level. Therefore, the first 2 SNV sets on
chromosomes 7 and 9 were significantly associated with hypertension.

64 
 

 

Figure 7: QQ plot for the whole genome sequencing analysis

65 
 

 

Table 13: Top 10 SNV sets with smallest p-values for GAW18
Chr

Start position

End position

nSNP

P_value

7

61098237

61157098

39

1.14×10-06

9

45676524

45735385

1

2.85×10-06

9

30961149

31020010

440

1.22×10-05

15

54564610

54623472

362

2.57×10-05

9

31078872

31137733

496

2.93×10-05

17

8122887

8181748

325

5.51×10-05

21

35258038

35316900

306

6.14×10-05

11

38024529

38083390

391

0.000111

3

190711260

190770121

400

0.000148

1

1118368

1177230

568

0.000149

66 
 

 

CHAPTER 4: Conclusion and Discussion

Driven by the sequencing technologies and the “common diseases and rare-variants
hypothesis”, sequencing studies are now emerging as a major study design in populationbased human genetics research. Yet, the characteristics of sequencing data, including low
MAF and high dimensionality, are posing daunting challenges for the statistical analyses.
The conventional analytical approaches, such as a single-locus analysis, are not suitable
for a sequencing data analysis. Instead, a joint association analysis or a SNV-set analysis
are becoming popular, for their ability to combine multiple rare variants to increase
power, as well as to reduce the dimensionality [Lee, et al. 2012a; Madsen and Browning
2009; McCarthy, et al. 2008; Morgenthaler and Thilly 2007b; Neale, et al. 2011; Wu, et
al. 2011]. Although most of the existing methods have nice features and are easy to use,
they also have certain limitations. For example, most of burden tests assume the
homogeneity of the direction and magnitude of the effects within the SNV set, which
may not hold in real disease scenarios. Besides burden tests, most of the existing methods
are parametric or semi-parametric, which often assumes phenotype distribution and
mode of inheritance. When the assumptions are violated, the results could be unreliable.
To overcome these limitations, we have proposed WU-seq, GTSU and FSTSU, which is
computationally effcient and is distribution-free. We conducted extensive simulation
studies using data from the 1000 Genome Project, and compared their performance with
three popular methods: SKAT, AdjSKAT, and SKATO. Across all of the simulation
scenarios, including single traits with various distributions, and multiple traits with
various combinations of trait distribution, the proposed methods outperformed existing
67 
 

 

methods in terms of controlled type I error and power improvement. Although the
simulation results depend on the simulation setting, and should always be interpreted in
the context of the simulation setting, we believe they reflect the advantage of our
methods over the other methods in a broader sense, because 1) the simulated genetic data
is from the 1000 Genome Project, which should be able to mimic the LD pattern and
allelic frequency in the general population; and 2) we simulated a wide range of disease
models to reflect common disease scenarios.
In recent years, methods based on U statistics have been increasingly applied to genetic
studies, and have shown their robustness and flexibility [Schaid, et al. 2005; Wei, et al.
2013b; Zhang, et al. 2010]. GTSU is a general framework for sequencing associating
analysis, which is based on similarity measurements. WU-seq and FGTSU can be viewed
as two special cases of GTSU. Although we used the weighted-IBS to calculate genetic
similarity based on a SNV set in this chapter, other forms of genetic similarity can also be
used. For a weighted-IBS similarity, a simple alternative would be to modify the weights
of the weighted-IBS (originally wm  1/  m (1   m ) in weighted-IBS) to reflect a
different influence from each SNV. In place of the IBS-based similarity, we could also
use distance-transformed similarities to model the effect nonlinearly. For example, we
M

2
can use the Euclidian-Distance based method, K iED
, j  exp(   wm ( g i , m  g j , m ) ) . In this
m 1

chapter, we have focused on the analysis of categorical sequencing data (i.e., SNVs). By
using an appropriate genetic similarity measurement, GTSU can easily be extended to
analyze other types of genetic data, such as count data (i.e., CNVs) and continuous data
(i.e., gene expression).
68 
 

 

The flexibility of GTSU also comes from the construction of trait similarity. By using
trait similarity, GTSU can analyze not only a single trait, but also multiple traits with
various distributions. Many genetic studies collect multiple traits to study a human
complex disease, or use intermediate biomarkers to measure the complex disease. Using
multiple traits has been shown to improve power by accounting for the underlying
structure of the complex disease [Liu, et al. 2009b; Maity, et al. 2012; Zhang, et al. 2010].
However, most methods for analyzing sequencing data have focused on single trait
analysis. SKAT has recently been extended to multiple continuous trait analysis, by using
a score test under the multivariate linear mixed effect model [Maity, et al. 2012]. The
method, however, relies on a normal assumption for the trait distribution, and cannot
analyze multiple traits with other types of phenotypes (e.g. a binary phenotype). In
contrast, GTSU is distribution-free and can analyze multiple traits with different types of
phenotypes. By using an appropriate similarity measurement, GTSU can be extended to
analyze longitudinal data, survival data and even a high-dimensional phenotype such as
image data.
Current sequencing studies are often of small sample size. Thus, it is important for a
statistical method to have enough power for a small sample size study. Methods, such as
AdjSKAT, was proposed for

small sample size study [Lee, et al. 2012a]. In our

simulation studies, we find both AdjSKAT and GTSU can correctly control the type I
error for the binary trait at 0.05 level. But GTSU has higher power than AdjSKAT for
detecting a genetic association.
GTSU, as well as WU-seq and FGTSU, are not only statistically powerful, but also
computationally efficient. We derived the asymptotic distribution of three methods so as
69 
 

 

to efficiently calculate the significance level of the association test. As we showed in the
result section, GTSU and Wu-seq had the advantage over the SKAT methods. We also
developed a C++ version of GTSU. With the fast-moving sequencing technology and
reduced sequencing cost, large-scale whole-genome sequencing will soon become
popular. We hope the proposed methods could be computationally efficient tools to
handle the massively sequencing data.

70 
 

 

APPENDICES

71 
 

 

Appendix A

We first introduce a regularity condition to the weight function ki ,i ' . Since 0  wi ,i '  1
and ki ,i '  wi ,i '  c , we have 0 | ki ,i ' | c ' , where c ' is a positive constant. It is, thus,
reasonable to assume lim

n 

1
ki2,i '  C , where C is another positive constant.

n(n  1) i i '

Alternatively, if we consider ki ,i ' as random, this is similar to assuming a finite second
2
)  C  .
moment, i.e. E (k1,2

Then we can show,




l 1

l 1

E (  2, )   l E ( 1,2l )   l  0 ,

and,


Var (  2, )   l2Var ( 1,2l )
l 1

n
2
l2

2
n  ( n  1)
l 1
2
 lim
trace( KK )
n  ( n  1) 2
2
 lim
ki2,i '

n  ( n  1) 2
i i '
 2C.

 lim

Using this result, we can further check that the usual
p

( nWU seq  0 ).

72 
 

nWU seq is degenerated

 

Table A1: simulation setting for models without confounding effects and fixed sample sizes
Effect

Pct*

Setting*
Binary

Null

Normal

Student’s t

Cauchy

















0

0

0

0

0

0

0

0

0

1

0

0.3

0

0.5

0

0.5

0.25

0.25

0.1

0.1

0.25

0.25

0.25

0.25

0
5

A1

10
30
50
5

A2

10
30
50

* percentage of functional rare variants

Table A2: simulation setting for model without confounding effects and varying sample sizes

Effect

Sample size

Setting
Binary

Normal

Student’s t

Cauchy

















0

0

0

0

0

0

0

0

0

1

0

0.3

0

0.5

0

0.5

50
Null

100
200
50

*
A1

100
200

* 50% of the rare variants are set as functional.

73 
 

 

Table A3: simulation setting for model with confounding effects and varying sample sizes

Effect

Sample size

Setting
Binary

Normal

Student’s t

Cauchy

















0

0

0

0

0

0

0

0

0

1

0

0.3

0

1

0

0.5

50
Null

100
200
50

*
A1

100
200

* 50% of the rare variants are set as functional.

Table A4: The association of 4 candidate genes with 3 binary phenotypes* in the Dallas Heart
Study
 
Gene

BMI

Cholesterol

VLDL

SKAT

WUSEQ

SKAT

WUSEQ

SKAT

WUSEQ

ANGPTL3

0.805**

0.791

0.734

0.655

0.024

0.051

ANGPTL 4

0.340

0.365

0.242

0.263

0.012

0.020

ANGPTL 5

0.750

0.585

0.752

0.648

0.310

0.413

ANGPTL 6

0.698

0.49

0.006

0.006

0.286

0.375

All 4 genes

0.765

0.632

0.047

0.059

0.010

0.022

*For each phenotype, we use the highest quartiles and lowest quartiles to form the binary
phenotype.
** P-value from the association analysis by adjusting for age, gender, and race. 

74 
 

 

Figure A1: Missing data distribution for the Dallas Heart Study

Figure A2: MAF for all SNPs

75 
 

 

Figure A3: Distribution of test statistic by using Permutation v.s. Asymptotic under null
hypothesis

76 
 

 

Appendix B

Section B1: By the definition of

h ( y1 , y2 )  h( y1 , y2 )  E (h( y1 , Y2 ))  E (h(Y1 , y2 ))  E (h(Y1 , Y2 )) ,
we can obtain the conditional expectation of centered trait similarity,

E (h (Y1 , Y2 ) | Y1 )  E (h(Y1 , Y2 ) | Y1 )  E (h(Y1 , Y2 ) | Y1 )  E (h(Y1 , Y2 ))  E (h(Y1 , Y2 ))  0 .
Thus, we have E (h (Y1 , Y2 ))  0 and Var ( E (h (Y1 , Y2 ) | Y1 ))  0 . The same argument apply
for the centered genetic similarity.
Under the null hypothesis, when the genetic similarities are independent of trait
similarities, we have:
E (U ) 


1
E ( f (Gi , G j )h (Yi , Y j ))
n(n  1) i  j
1
 E ( f (Gi , G j )) E (h(Yi , Y j )) .
n(n  1) i  j

 0.

Section B2: We can decompose the centered trait similarity by:


h ( y1 , y2 )   ss ( y1 )s ( y2 ) ,
s 1

where {s } and {s ()} are the eigenvalue and eigenfunction of the kernel h (,) .
Because the orthogonality of {s ()} , we have

77 
 

 

E (h (Y1 , Y2 )s (Y2 ) | Y1 )
  h (Y1 , y2 )s ( y2 )dF ( y2 )


  ss (Y1 )  s ( y2 )s ( y2 )dF ( y2 )
s 1

 ss (Y1 )

We have also showed E (h (Y1 , Y2 ) 1| Y1 )  0 1 in Appendix B1, which forces 1 ()  1
and 1  0 to be an eigenfunction-eigenvalue pair in the decomposition of h (,) . Again,
because 1 () is orthogonal with {s ()}s 1 , for s>1, we have

Es (Y1 )   s ( y1 ) 1dF ( y1 )
  s ( y1 )1 ( y1 )dF ( y1 )
 0.
Following same steps, we can decompose the centered genetic similarity:


f (G1 , G2 )   tt (G1 )t (G2 ) .
t 1

Then,

 Es (Y )  0,  s  1

 Et (G)  0,  t  1.
Using the function decomposition, we can write GTSU as,

78 
 

(B1).

 

U

1
 f (Gi , G j )h (Yi , Y j )
n(n  1) i  j





1
G
G



ss (Yi )s (Y j )
(
)
(
)
 t t i t j 
n(n  1) i  j t 1
s 1





1
t  s  t (Gi )t (G j )s (Yi )s (Y j )

n(n  1) t 1 s 1 i  j


1 
 1 n



t (Gi )s (Yi ) 


t s 
n  1 t  2 s  2  n i 1

n


1
1

t  s  t2 (Gi )t2 (Yi )

n  1 t  2 s  2 n i 1

2



1  
 1 n *

sign


t (Gi )s* (Yi ) 

(
)


t s 
n  1 t 2 s2
 n i 1

n
 
1
1
sign(t s )  (t* (Gi )t* (Yi )) 2 ,


n  1 t 2 s 2
n i 1

.

2

where t* (Gi ) | t |0.5 t (Gi ) and s* (Yi ) | s |0.5 s (Yi ) .
Under the null hypothesis, genotypes ( Gi ) are independent of traits ( Yi ). Therefore, for
s>1 and t>1, we have

E (t* (G1 )s* (Y1 ))
| st |0.5 E (t (G1 )) E (s (Y1 ))

(B2)

 0,
, and

E (t* (G1 )s* (Y1 )t* (G1 )s* (Y1 ))
| st st  |0.5 E (t (G1 )t  (G1 )) E (s (Y1 )s (Y1 ))
|   | , if s  s and t  t 
 s t
otherwise.
0,

79 
 

(B3)

 

 1 n *

For any finite subset  of {st}s 1,t 1 , 
converges to a
t (Gi )s* (Yi ) 

 n i 1
st

multivariate normal distribution by using the results from appendix B2, B3 and
multivariate CLT.
Additionally, we can show that:



s 1,t 1

E (t* (G1 )s* (Y1 )) 2

  | s | | t |
s

 

2
s

s

t



2
t

t

 E ( h (Y1 , Y2 )) 2 E ( f (G1 , G2 )) 2


Under the condition



s 1,t 1

E 2 (t* (G1 )s* (Y1 ))   , the countable sequence of function

{t* ()s* ()}s 1,t 1 is a Donsker class [van der Vaart and Wellner 2000]. Thus, the
empirical process,

1
n

n


i 1

*
t

(Gi )s* (Yi ) , converges weakly to the Gaussian process Z s ,t

with a mean zero and a covariance function:

|   | , if s  s and t  t 
cov( Z s ,t , Z s,t  )  E (t* (G1 )s* (Y1 )t* (G1 )s* (Y1 ))   s t
.
otherwise
0,
With this uniform convergence (for all s>1 and t>1), we can show that:

80 
 

 




nU   sign(t s )  Z s ,t 
D

2

t 2 s 2




  sign(t s ) | st | ,
t 2 s 2

D 







s 1

t 1

s 1

 t  s  st2  t  s
t 1

where { st2 } are the i.i.d chi-square random variables with a d.f of 1.

Section

B3:

To

simplify

notations,

we

denote

u ( X1 , X 2 )  f (G1 , G2 )h (Y1 , Y2 ) . Then GTSU can be rewritten as:
U

1
 u( X i , X j ) .
n( n  1) i  j

Define a centered kernel u( x1 , x2 ) by:

u( x1 , x2 )  u( x1 , x2 )  u1 ( x1 )  u1 ( x2 )  ,
where u1 ( x)  E(u( X1 , X 2 ) | X1  x) .
We can decompose the GTSU as follows:
U

1
 u( X i , X j )
n(n  1) i  j



1
  u ( X i , X j )  u1 ( X i )  u1 ( X j )     
n(n  1) i  j



1
2 n

u
X
X

(
,
)
 u1 ( X i )     .
 i j n
n(n  1) i  j
i 1

Thus,

81 
 

X  (Y , G )

and

 

2 n
n
 u1 ( X i )    

 u ( X i , X j ).
n(n  1) i  j
n i 1

n (U   ) 

Since E (u1 ( X ))   and Var (u1 ( X ))   1 , the first term converges to a normal
distribution by applying CLT:
2
n

n

D

  u ( X )     N (0, 4
i 1

1

i

1

).

We then need to show:
R

p
n

(
,
)

u
X
X
 i j 0.
n(n  1) i  j

This can be done by providing E( R2 )  0 , using the fact that E(u( X1 , X 2 ))  0 ,

Var (u( X1 , X 2 ))   and E(u( X1 , X 2 ) | X1 )  0 . In fact, by using a similar technique to
Appendix B2, we can show that

nR asymptotically follows the distribution of a

weighted sum of independent chi-square random variables.

Section B4: In the study sample, we can get the centered trait similarity by:

1 n
1 n
1
h ( yi , y j )  h( yi , y j )   h( yi , y j )   h( yi , y j )  2
n j 1
n i 1
n

 h( y , y ) .
i

j

i, j

If we denote Si , j  h ( yi , y j ) and Si , j  h( yi , y j ) , the above equation becomes:

1 n
1 n
1
Si , j  Si , j   Si , j   Si , j  2
n j 1
n i 1
n
The equation can be written in a matrix form,

82 
 

S
i, j

i, j

.

 

S  S  JS  SJ  JSJ
,
 (I  J )S (I  J )
1
where S  {h ( yi , y j )}nn , S  {h( yi , y j )}nn , I  {1{i  j}}nn , and J  { }nn .
n

Similarly, the centered genetic similarity can also be written in matrix form:

K  ( I  J ) K ( I  J ) ,
where, K  { f ( g i , g j )}nn , and K  { f ( g i , g j )}nn .

Section B5: In the actual computation, we will use a matrix eigen-decomposition to
obtain the eigenvalue and finite-dimension approximation of the eigenfunction. For
matrix eigen-decomposition, computer algorithm usually gives eigenvalues s with
eigenvectors s that satisfy

n

 s2,i  1 instead of
i 1

1 n 2
 s ,i  1 . Thus, using the eigenvalues
n i 1

{s } and {t } calculated from the matrix eigen-decomposition, the limiting distribution
of GTSU is:
n

nU ~ 
t 1

t

n

s

 n (
n
s 1

2
st

 1) .

Section B6: The purpose is to obtain a covariate-centered similarity, so that the centered
similarity is perpendicular to the space spanned by the covariates. Suppose we have P
covariates that need to be adjusted. We first consider the trait similarity and assume the

83 
 

 

covariate to be a P+1 dimensional function of traits: xi  x( yi ) . We define a new kernel
by:
u ( y1 , y2 )  x ( y1 )  E ( x()T x())  x( y2 )T .
1

From this definition, we have:
1
E (u ( y1 , Y2 ) x(Y2 ))  x( y1 ) E  E ( x(Y2 )T x(Y2 ))  x(Y2 )T x(Y2 )   x( y1 ) .



The covariate-centered kernel is then defined as:

h ( y1 , y2 )  h( y1 , y2 )  E (h( y1 , Y3 )u (Y3 , y2 ))
 E (u ( y1 , Y3 )h(Y3 , y2 ))
 E (u ( y1 , Y3 )h(Y3 , Y4 )u (Y4 , y2 ))

So we have:

84 
 

.

 


E (h ( y1 , Y2 ) x(Y2 ))
  h( y1 , y2 ) x( y2 )dF ( y2 )
   h( y1 , y3 )u ( y3 , y2 ) x( y2 )dF ( y3 )dF ( y2 )
   u ( y1 , y3 )h( y3 , y2 )x( y2 )dF ( y3 )dF ( y2 )
   u ( y1 , y3 )h( y3 , y4 )u ( y4 , y2 ) x( y2 )dF ( y4 )dF ( y3 )dF ( y2 )
  h( y1 , y2 ) x( y2 )dF ( y2 )
   h( y1 , y3 )  E (u ( y3 , Y2 ) x(Y2 ))  dF ( y3 )
   u ( y1 , y3 )h( y3 , y2 )x( y2 )dF ( y3 )dF ( y2 )
   u ( y1 , y3 )h( y3 , y4 )  E (u ( y4 , Y2 ) x(Y2 ))  dF ( y4 )dF ( y3 )
  h( y1 , y2 ) x( y2 )dF ( y2 )
   h( y1 , y3 )x( y3 )dF ( y3 )
   u ( y1 , y3 )h( y3 , y2 )x( y2 )dF ( y3 )dF ( y2 )
   u ( y1 , y3 )h( y3 , y4 )x( y4 )dF ( y4 )dF ( y3 )
0


Consider the above expectation as an inner product  h (, y2 ) x( y2 )  defined by:


 h (, y2 ) x( y2 )   h (, y2 ) x( y2 )dF ( y2 ) ,

In this sense, we have h  x for our covariate-centered trait similarity. In fact, this has

forced {1, x1 ,..., xP } to be a subset of the eigenfunctions of h (,) with eigenvalues

{s  0}1 s  P 1 .

By the definition of the covariate-centered trait similarity, we can estimate it by replacing
the expectation with a sample average:

85 
 

 
n
n

h ( yi , y j )  h( yi , y j )   xi ( xlT xl ) 1 xkT h( yk , y j )
k 1

l 1

n

n

k 1

l 1

  h( yi , yk ) xk ( xlT xl ) 1 xTj
n

n

.

n

n

l 1

l 1

  xi ( xlT xl ) 1 xmT h( ym , yk ) xk ( xlT xl ) 1 xTj
m 1 k 1

As in Appendix B4, we can write it in matrix form:

S  ( I  X ( X T X ) 1 X T ) S ( I  X ( X T X ) 1 X T )
Similarly, we can define the covariate-centered genetic similarity by:

f ( g1 , g 2 )  f ( g1 , g 2 )  E ( f ( g1 , G3 )v(G3 , g 2 ))
 E (v( g1 , G3 ) f (G3 , g 2 ))
 E (v( g1 , G3 ) f (G3 , G4 )v(G4 , g 2 )) ,

where, v(,) is constructed in the same way as u (,) , except for letting xi  x( gi ) . The

covariate-centered genetic similarity has the perpendicular property f  x . In the study
sample, we can write the covariate-centered genetic similarity in the following matrix
form:


K  ( I  X ( X T X )1 X T ) K ( I  X ( X T X )1 X T ) .
To derive the asymptotic distribution of a covariate-adjusted GTSU, we can decompose

the h (,) by:


h ( y1 , y2 )   ss ( y1 )s ( y2 ) ,
s 1


and decompose the f ( g1 , g 2 ) by:


f ( g1 , g 2 )  tt ( g1 )t ( g 2 ) .
t 1

86 
 

 

Then using the same technique in Appendix B2, we can show that:
U

1
n2

1
 2
n





 f (G , G )h (Y , Y )
i

j

i

j

i, j

 n




t 
s   t (Gi )s (Yi ) 
t P2
sP2
 i 1





2

and its limiting distribution is also an infinite weighted sum of the iid chi-square variables.


Then, by using a matrix eigen-decomposition (to get eigenvalues {t } and {s } ), taking



T
1 T
into account the projection operator ( H  I  X ( X X ) X ) in a finite sample, U can be

approximated as follows:
nU ~

1
 n  n  2
t  s  st  ,
n(n  P  1)  t 1 s 1

where, { st2 } are i.i.d chi-square random variables with a d.f. of 1.

Section B7: We allowed the multiple traits to follow different distributions (BBG, BGG,
or BGC). Two sets of Null models (   0 and  2  0 ) were simulated (details in Table
B4):
1. We generated two covariates ( x1,i ~ Bernoulli (0.3) and x2,i ~ N (0,1) ) for all three
traits, so that the three traits were influenced by the same set of covariates.
2. Repeatedly generated two covariates for each trait, so that the three traits were
influenced by different sets of covariates.
Two sets of disease models (   0 and  2  0 ) were also simulated, each related to one
of the Null models.

87 
 

 

We summarized the type I error of the 4 methods for models with the same covariates,
and models with different covariates in Table 3. For all of the simulation scenarios,
GTSU could control the type I error correctly around 0.05. The other three parametric
methods either had an inflated type I error or a conservative type I error, depending on
the sample size and the trait distribution. AdjSKAT and SKATO were able to correctly
control the type I error around 0.05 for BBG and BGG traits, but suffered an inflated type
I error for BGC traits. SKAT also suffered an inflated type I error when one or more than
one of the multiple traits are Cauchy-distributed traits. Moreover, SKAT suffered a
conservative type I error for small sample sizes with BBG and BGG traits. An interesting
observation is that the conservatism of the type I error for the SKAT-based methods are
less severe than the scenario in which no covariates are involved. The reason might be
that the SKAT-based methods use residuals to build the test statistics after adjusting for
the covariates. The residuals have more levels than the original trait value (2 levels for
the binary trait), and are closer to the normal distribution. We summarized the power
comparison for the multiple-traits disease model with common covariates and distinct
covariates, respectively, in Figure 6 and Figure 7. We observed that GTSU obtained a
much higher power than SKAT, SKATO and AdjSKAT across all of the sample sizes,
trait distributions and covariate effect configurations (a shared or distinct set of
covariates). For example, when the sample size was 50 for BBG traits with common
covariates, SKAT, SKATO and AdjSKAT were only 11% (power=0.026), 19%
(power=0.043), and 25% (power=0.057) of the power of GTSU (power=0.222).

88 
 

 

Table B1: simulation setting for simulation I

Effect

Sample size

Pct*

Setting for effect
Binary

Normal

Cauchy













0

0

0

0

0

0

0

50

0

0.4

0

0.2

0

0.3

50

0.1

0.2

0.05

0.1

0.1

0.2

0

0.4

0

0.2

0

0.3

50
Null

100
200
500
50

  0

100

  0

200
500
50

  0

100

  0

200
500

5

  0
  0

500

10
30
50

*percentage of caudal SNVs

89 
 

 

Table B2: simulation setting for simulation II
Effect

Sample size

Pct*

Setting for effect**
BBB

Null-same-traits

GGG

CCC













0

0

0

0

0

0

50
100
200

0

500
BBG

Null-diff-traits

BGG

BGC













0

0

0

0

0

0

50
100
200

0

500
BBB

same-traits
diff-SNPs

GGG

CCC













0

0.2

0

0.1

0

0.2

50
100
200

30

500
BBG

diff-traits
diff-SNPs

BGG

BGC













0

0.15

0

0.15

0

0.15

50
100
200

30

500
*percentage of causal SNVs for each trait
**“BBB”, “GGG”, and “CCC” denote three binary-distributed traits, three Gaussian-distributed
traits, and three Cauchy-distributed traits, respectively. “BBG” denotes two binary-distributed
traits and one Gaussian-distributed trait; “BGG” denotes one binary-distributed trait and two
Gaussian-distributed traits; and “BGC” denotes one binary-distributed trait, one Gaussiandistributed trait and one Cauchy-distribution trait.
90 
 

 

Table B3: computational time for R-based GTSU and SKAT, SKATO, AdjSKAT for
analyzing 1000 simulated data replicates

Traits*

BBG

BGG

BGC

sample size

Computational time (seconds)
SKAT

SKATO

AdjSKAT

GTSU

50

95.606

2155.145

5286.115

9.925

100

184.013

3921.727

7613.602

34.822

200

438.496

5424.594

9181.626

102.419

500

4319.94

13686.37

20890.24

849.143

50

63.993

1479.988

1996.783

7.084

100

106.755

2588.816

2702.345

22.783

200

392.034

6966.902

5523.014

128.326

500

3665.337

18607.69

13910.52

1140.583

50

68.707

1814.798

2147.614

8.242

100

105.197

2553.035

2683.744

22.551

200

436.481

7337.539

6788.79

144.654

500

3852.689

19392.38

16327.11

1239.392

91 
 

 

Table B4: simulation setting for multiple trait analysis with covariates

Effect

Sample size

Pct*

N of Cov

Setting for effect**
BBG

Null-same-cov

BGG

BGC













0

0

0

0

0

0

50
100
200

0

2

500
BBG

Null-diff-cov

BGG

BGC













0

0

0

0

0

0

50
100
200

0

6

500
BBG

same-cov

BGG

BGC













0

0.15

0

0.15

0

0.15

50
100
200

30

2

500
BBG

diff-cov

BGC













0

0.15

0

0.15

0

0.15

50
100
200

30

6

500

92 
 

BGG

 

Table B5: type I error for multiple trait analysis with covariates

Sample size

50

100

200

500

Method

Traits distribution*
same covariates
different covariates
BBG

BGG

BGC

BBG

BGG

BGC

SKAT

0.019

0.021

0.114

0.031

0.03

0.096

SKATO

0.033

0.038

0.097

0.042

0.042

0.086

AdjSKAT

0.039

0.038

0.128

0.062

0.046

0.108

GTSU

0.053

0.062

0.062

0.061

0.048

0.05

SKAT

0.025

0.024

0.127

0.039

0.025

0.115

SKATO

0.042

0.043

0.123

0.055

0.031

0.114

AdjSKAT

0.057

0.04

0.137

0.068

0.042

0.123

GTSU

0.052

0.036

0.049

0.049

0.059

0.045

SKAT

0.034

0.03

0.135

0.034

0.034

0.122

SKATO

0.049

0.036

0.126

0.05

0.042

0.119

AdjSKAT

0.051

0.036

0.144

0.056

0.048

0.133

GTSU

0.042

0.056

0.045

0.048

0.048

0.056

SKAT

0.045

0.054

0.161

0.046

0.042

0.157

SKATO

0.056

0.055

0.154

0.055

0.061

0.137

AdjSKAT

0.048

0.059

0.155

0.056

0.045

0.148

GTSU

0.041

0.04

0.047

0.042

0.045

0.046

* “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes
one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binarydistributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait.

93 
 

 

Figure B1: Power comparison for multiple trait analysis with the same set of covariates

* “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes
one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binarydistributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait.

94 
 

 

Figure B2: Power comparison for multiple trait analysis with different sets of covariates

* “BBG” denotes two binary-distributed traits and one Gaussian-distributed trait; “BGG” denotes
one binary-distributed trait and two Gaussian-distributed traits; and “BGC” denotes one binarydistributed trait, one Gaussian-distributed trait and one Cauchy-distribution trait.

95 
 

 

Appendix C

Using a traditional linear mixed model, we can build the following model,
Y    G    ,  ~ N (0, ),    2 I   K2 K ,
where Y is the vector for the phenotype, G is the matrix for the genotypes,  K2 is the
variance component for the family structure, and  2 is the variance component for the
noise. Let U represent the eigenvectors for the matrix K . We can decompose the
covariance matrix as K  U U T , where  is the diagonal matrix with eigenvalues of

K on the diagonal. We can get transformed data by letting Y   U T Y , thus
Var (Y )  Var (U T Y )  U T U   2 I   K2  . Since  is the diagonal matrix, the
transformed phenotype values are uncorrelated and the equivalent model is,
Y   U T   U T G    ,   ~ N (0,  2 I   K2  ) .

96 
 

 

Table C1: Simulation setting for family-based sequencing studies
Effect

Sample size

Setting for effect
Binary

Poisson

Gaussian

Student’s t

















0

0

0

0

0

0

0

0

0

1

0

0.5

0

0.5

0

1

80
Null

200
400

  0
  0

80
200
400

97 
 

 

Figure C1: distribution of sizes of SNV sets

98 
 

 

BIBLIOGRAPHY

99 
 

 

BIBLIOGRAPHY

2010. A map of human genome variation from population-scale sequencing. Nature
467(7319):1061-1073.
Abecasis G, Altshuler D, Auton A, Brooks L, Durbin R, Gibbs R, Hurles M, McVean G.
2010. A map of human genome variation from population-scale sequencing.
Nature 467(7319):1061-1073.
Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, Doelle H,
Ersoy B, Kryukov G, Schmidt S and others. 2007. Medical sequencing at the
extremes of human body mass. American Journal of Human Genetics 80(4):779791.
Barnett IJ, Lee S, Lin XH. 2013. Detecting Rare Variant Effects Using Extreme
Phenotype Sampling in Sequencing Association Studies. Genetic Epidemiology
37(2):142-151.
Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE,
Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR and others. 2008. Assessing the
evolutionary impact of amino acid mutations in the human genome. Plos Genetics
4(5).
Chen H, Meigs JB, Dupuis J. 2013. Sequence Kernel Association Test for Quantitative
Traits in Family Samples. Genetic Epidemiology 37(2):196-204.
Chen LS, Hsu L, Gamazon ER, Cox NJ, Nicolae DL. 2012. An Exponential Combination
Procedure for Set-Based Association Tests in Sequencing Studies. American
Journal of Human Genetics 91(6):977-986.
Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. 2004.
Multiple rare Alleles contribute to low plasma levels of HDL cholesterol. Science
305(5685):869-872.
Davies RB. 1980. Algorithm AS 155: The Distribution of a Linear Combination of x2
Random Variables. Journal of the Royal Statistical Society. Series C (Applied
Statistics) 29(3):323-333.
Easton DF, Deffenbaugh AM, Pruss D, Frye C, Wenstrup RJ, Allen-Brady K, Tavtigian
SV, Monteiro ANA, Iversen ES, Couch FJ and others. 2007. A Systematic
Genetic Assessment of 1,433 Sequence Variants of Unknown Clinical
Significance in the BRCA1 and BRCA2 Breast Cancer-predisposition Genes. The
American Journal of Human Genetics 81(5):873-883.
100 
 

 

Fay JC, Wyckoff GJ, Wu CI. 2001. Positive and negative selection on the human genome.
Genetics 158(3):1227-1234.
Gregory GG. 1977a. Large Sample Theory for U-Statistics and Tests of Fit. Annals of
Statistics 5(1):110-123.
Gregory GG. 1977b. Large Sample Theory for U-Statistics and Tests of Fit. The Annals
of Statistics 5(1):110-123.
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA.
2009. Potential etiologic and functional implications of genome-wide association
loci for human diseases and traits. Proceedings of the National Academy of
Sciences 106(23):9362-9367.
Ji WZ, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW,
Levy D, Lifton RP. 2008. Rare independent mutations in renal salt handling genes
contribute to blood pressure variation. Nature Genetics 40(5):592-599.
Kelly BB, Fuster V. 2010. Promoting Cardiovascular Health in the Developing World:: A
Critical Challenge to Achieve Global Health: National Academies Press.
Kryukov GV, Pennacchio LA, Sunyaev SR. 2007. Most rare missense alleles are
deleterious in humans: Implications for complex disease and association studies.
American Journal of Human Genetics 80(4):727-739.
Kuonen D. 1999. Saddlepoint approximations for distributions of quadratic forms in
normal variables. Biometrika 86(4):929-935.
Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. 2012. The
Empirical Power of Rare Variant Association Methods: Results from Sanger
Sequencing in 1,998 Individuals. Plos Genetics 8(2).
Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC,
Wurfel MM, Lin XH. 2012a. Optimal Unified Approach for Rare-Variant
Association Testing with Application to Small-Sample Case-Control WholeExome Sequencing Studies. American Journal of Human Genetics 91(2):224-237.
Lee S, Wu MC, Lin XH. 2012b. Optimal tests for rare variant effects in sequencing
association studies. Biostatistics 13(4):762-775.
Li HZ. 2012. U-statistics in genetic association studies. Human Genetics 131(9):13951401.
Li M, Ye CY, Fu WJ, Elston RC, Lu Q. 2011. Detecting Genetic Interactions for
Quantitative Traits With U-Statistics. Genetic Epidemiology 35(6):457-468.

101 
 

 

Lin D-Y, Tang Z-Z. 2011a. A General Framework for Detecting Disease Associations
with Rare Variants in Sequencing Studies. The American Journal of Human
Genetics 89(3):354-367.
Lin DY, Tang ZZ. 2011b. A General Framework for Detecting Disease Associations with
Rare Variants in Sequencing Studies. American Journal of Human Genetics
89(3):354-367.
Lindsay BG, Markatou M, Ray S, Yang K, Chen SC. 2008. Quadratic distances on
probabilities: A unified foundation. Annals of Statistics 36(2):983-1006.
Liu H, Tang YQ, Zhang HH. 2009a. A new chi-square approximation to the distribution
of non-negative definite quadratic forms in non-central normal variables.
Computational Statistics & Data Analysis 53(4):853-856.
Liu J, Pei Y, Papasian CJ, Deng H-W. 2009b. Bivariate association analyses for the
mixture of continuous and binary traits with the use of extended generalized
estimating equations. Genetic Epidemiology 33(3):217-227.
Luo L, Zhu Y, Xiong M. 2012. Quantitative trait locus analysis for next-generation
sequencing with the functional linear models. Journal of Medical Genetics
49(8):513-524.
Lynch M, Ritland K. 1999. Estimation of Pairwise Relatedness With Molecular Markers.
Genetics 152(4):1753-1766.
Madsen BE, Browning SR. 2009. A Groupwise Association Test for Rare Mutations
Using a Weighted Sum Statistic. Plos Genetics 5(2).
Maity A, Sullivan PE, Tzeng JY. 2012. Multivariate Phenotype Association Analysis by
Marker-Set Kernel Machine Regression. Genetic Epidemiology 36(7):686-695.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA,
Hirschhorn JN. 2008. Genome-wide association studies for complex traits:
consensus, uncertainty and challenges. Nature Reviews Genetics 9(5):356-369.
Morgenthaler S, Thilly WG. 2007a. A strategy to discover genes that carry multi-allelic
or mono-allelic risk for common diseases: A cohort allelic sums test (CAST).
Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis
615(1-2):28-56.
Morgenthaler S, Thilly WG. 2007b. A strategy to discover genes that carry multi-allelic
or mono-allelic risk for common diseases: A cohort allelic sums test (CAST).
Mutation Research-Fundamental and Molecular Mechanisms of Mutagenesis
615(1-2):28-56.

102 
 

 

Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan
S, Purcell SM, Roeder K, Daly MJ. 2011. Testing for an Unusual Distribution of
Rare Variants. Plos Genetics 7(3).
O'Neil KA, Redner RA. 1993. Asymptotic Distributions of Weighted U-Statistics of
Degree 2. The Annals of Probability 21(2):1159-1169.
Pritchard JK. 2001. Are rare variants responsible for susceptibility to complex diseases?
American Journal of Human Genetics 69(1):124-137.
Raychaudhuri S. 2011. Mapping Rare and Common Causal Alleles for Complex Human
Diseases. Cell 147(1):57-69.
Romeo S, Yin W, Kozlitina J, Pennacchio LA, Boerwinkle E, Hobbs HH, Cohen JC.
2009. Rare loss-of-function mutations in ANGPTL family members contribute to
plasma triglyceride levels in humans. Journal of Clinical Investigation 119(1):7079.
Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. 2005.
Nonparametric tests of association of multiple genes with human disease.
American Journal of Human Genetics 76(5):780-793.
Serfling R. 1981. Approximation Theorems of Mathematical Statistics (Wiley Series in
Probability and Statistics): Wiley-Interscience.
Shapiro CP, Hubert L. 1979. Asymptotic Normality of Permutation Statistics Derived
from Weighted Sums of Bivariate Functions. The Annals of Statistics 7(4):788794.
Shieh GS. 1997. Weighted degenerate U- and V-statistics with estimated parameters.
Statistica Sinica 7(4):1021-1038.
Shieh GS, Johnson RA, Frees EW. 1994. Testing Independence of Bivariate Circular
Data and Weighted Degenerate U-Statistics. Statistica Sinica 4(2):729-747.
Tzeng J-Y, Zhang D, Chang S-M, Thomas DC, Davidian M. 2009. Gene-Trait Similarity
Regression for Multimarker-Based Association Analysis. Biometrics 65(3):822832.
van der Vaart A, Wellner JA. 2000. weak convergence and empirical processes: Springer.
Wei C, Anthony JC, Lu Q. 2012. Genome-Environmental Risk Assessment of Cocaine
Dependence. Frontiers in Genetics 3.
Wei C, Lu Q. 2011. Collapsing ROC approach for risk prediction research on both
common and rare variants. BMC Proceedings 5(Suppl 9):S42.

103 
 

 

Wei C, Schaid DJ, Lu Q. 2013a. Trees Assembling Mann-Whitney Approach for
Detecting Genome-Wide Joint Association Among Low-Marginal-Effect Loci.
Genetic Epidemiology 37(1):84-91.
Wei CS, Schaid DJ, Lu Q. 2013b. Trees Assembling Mann-Whitney Approach for
Detecting Genome-Wide Joint Association Among Low-Marginal-Effect Loci.
Genetic Epidemiology 37(1):84-91.
Wei Z, Li M, Rebbeck T, Li H. 2008. U-Statistics-based Tests for Multiple Genes in
Genetic Association Studies. Annals of Human Genetics 72(6):821-833.
Wet TD, Venter JH. 1973. Asymptotic Distributions for Quadratic Forms with
Applications to Tests of Fit. Annals of Statistics 1(2):380-387.
Wu MC, Lee S, Cai TX, Li Y, Boehnke M, Lin XH. 2011. Rare-Variant Association
Testing for Sequencing Data with the Sequence Kernel Association Test.
American Journal of Human Genetics 89(1):82-93.
Zhang HP, Liu CT, Wang XQ. 2010. An Association Test for Multiple Traits Based on
the Generalized Kendall's Tau. Journal of the American Statistical Association
105(490):473-481.
Zhu X, Feng T, Li Y, Lu Q, Elston RC. 2010. Detecting rare variants for complex traits
using family and unrelated data. Genetic Epidemiology 34(2):171-187.

104