HUMAN GENETIC VARIATIONS AND THEIR EFFECT ON
COMMON COMPLEX DISEASES
By
Ming Li

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Epidemiology
2012

ABSTRACT
HUMAN GENETIC VARIATIONS AND THEIR EFFECTS
ON COMMON COMPLEX DISEASES
By
Ming Li
BACKGROUND: The genetic etiology of common complex diseases has been extensively
studied during the past few years. Though many causal genetic variants have been identified,
they account for only a small percentage of the estimated heritability of complex diseases, such
as breast cancer and cigarette smoking. It remains an open question about where the unexplained
heritability lies and how to find it. The objective of this dissertation research is to examine three
possible sources of such unexplained heritability: 1) the association between copy number
variants and breast cancer, 2) the association between gene-gene interactions and cigarette
smoking, and 3) the association between functional rare variants and a simulated quantitative
trait. METHODS: To detect copy number variants in breast cancer, we examine a breast cancer
dataset from the National Cancer Institute and apply a hidden Markov model. To detect genegene interactions that are associated with cigarette smoking, we examine a genome-wide dataset
from the Study of Addiction: Genetics and Environment and apply a forward U-test. To detect
functional rare variants, we examine a dataset from Genetic Analysis Workshop 17 and apply an
aggregating U-test. RESULTS: In the breast cancer study, we detect five genomic regions on
chromosome 2, 4, 6, 12, and 13. In the cigarette smoking study, we detect two single nucleotide
polymorphisms (SNPs) with potential interactions. These two SNPs are located in genes
CHRNA5 and NTRK2. In the quantitative trait study, we show that the aggregating U-test has a
greater power to detect functional rare variants than a commonly used approach, QuTie.
CONCLUSIONS: Our findings from the breast cancer study suggest that structural changes of

these genomic regions may contribute to the development of breast cancer. Our findings from the
cigarette smoking study indicate that the joint action between genes CHRNA5 and NTRK2 may
contribute to the development of cigarette smoking behavior. These proposed methods provide
useful tools to detect various types of human genetic variations underlying complex diseases.

ACKNOWLEDGEMENTS

I am indebted to many people who have contributed to my doctoral education. I would
like to start by acknowledging my Ph.D advisors, Dr. Wenjiang Fu and Dr. Qing Lu, who
inspired me with challenging ideas, encouraged me to do independent research, and aided me in
every piece of my work. I want to thank Dr. Joseph Gardiner, who gave me helpful suggestions
when I had difficulties and doubts. I want to thank Dr. Ellen Velie, who gave her time and
critical comments that assisted me to organize my dissertation from an epidemiological
perspective. I want to thank Dr. Yuhua Cui, who helped me in achieving a rudimentary
understanding of quantitative genetics. I have benefited from all the members of my dissertation
committee: Dr. Wenjiang Fu, Dr. Qing Lu, Dr. Joseph Gardiner, Dr. Ellen Velie and Dr. Yuehua
Cui. I also want to thank Dr. Bert Gold from the National Cancer Institute for kindly providing
the breast cancer datasets, which made part of this dissertation research feasible.
It has been a privilege to study at the Department of Epidemiology, College of Human
Medicine, Michigan State University. The department provided me with a very friendly
environment for scientific research. Many faculty members in the department have been
excellent teachers for me. I would especially like to thank Dr. Claudia Holzman, Dr. David
Todem and Dr. Zhehui Luo, who offered the kindest help when I needed.
Finally, I want to thank my wife Kate, Zhongnan Zhang, who has always believed in my
abilities and has been supportive through all the trying times. Thank you for maintaining some
normal home life for us while I focused on the research. I am sure that I could not have achieved
any of these without you by my side.

iv

TABLE OF CONTENTS

LIST OF TABLES……………………………………………………………………………….vii
LIST OF FIGURES……………………………………………………………………………..viii
LIST OF KEY ABBREVIATIONS...…………………………………………………………....ix
CHAPTER 1: INTRODUCTION AND AIMS…………………………………………………...1
1. Introduction……………………………………………………………………………1
2. Genetic Basis of Common Complex Diseases..……………….………………………2
3. Human Genetic Variation……………………………………………………………..4
4. Models of Genetic Origin of Complex Diseases..…………………………………….6
5. Methods of Studying the Genetic Etiology of Common Complex Diseases………….8
6. Rationale of this Research…………………………………………………………….9
7. Specific Aims………………………………………………………………………...10
8. Organization of Dissertation…………………………………………………………11
CHAPTER 2: GENETIC RISK FACTORS FOR BREAST CANCER AND CIGARETTE
SMOKING: A REVIEW OF THE LITERATURE…………………………………….………..12
1. Descriptive Epidemiology of Breast Cancer..………………………………………..12
2. Genetic Risk Factors of Breast Cancer………………………………………………13
3. Descriptive Epidemiology of Cigarette Smoking……………………………………15
4. Genetic Risk Factors of Cigarette Smoking………………………………………….16
5. Limitation of the Existing Investigation…...………………………………………...17
6. Significance of the Research…………………………………………………………18
CHAPTER 3: COPY NUMBER VARIATION AND BREAST CANCER…………………….20
1. Introduction..…………………………………………………………………………20
2. Methods………………………………………………………………………………22
3. Results………………………………………………………………………………..32
4. Discussion……………………………………………………………………………37
CHAPTER 4: GENE-GENE INTERACTION AND CIGARETTE SMOKING....…………….45
1. Introduction..…………………………………………………………………………45
2. Methods………………………………………………………………………………48
3. Results………………………………………………………………………………..51
4. Discussion……………………………………………………………………………69
CHAPTER 5: RARE VARIANTS AND QUANTITATIVE TRAITS......……………………...75
1. Introduction..…………………………………………………………………………75
2. Methods………………………………………………………………………………76
v

3. Results and Discussion………………………………………………………………80
4. Conclusion…………………………………………………………………………...82
CHAPTER 6: SUMMARY AND FUTURE DEVELOPMENT………………………………...84
1. Summary of the Findings…………………………………………………………….84
2. Future Development......……………………………………………………………...85
REFERENCES…………………………………………………………………………………..87

vi

LIST OF TABLES

Table 3.1.Configuration of all possible copy number states …………………………………….39
Table 3.2.Error rate for inference of copy number states with correctly and incorrectly specified
expected length of copy number states…………………………………………………………..40
Table 3.3.Regions showing significant copy number changes in phase III and their significance
levels in phase I………………………………………………………………………………….41
Table 4.1.Average trait values for two-locus joint action models……………………………….59
Table 4.2.Comparison between the forward U-test and GMDR………………………………...60
Table 4.3.Comparison between the forward U-test and GMDR when the quantitative traits are
simulated from the distribution of number of cigarette smoked per day………………………...62
Table 4.4.Comparison between the forward U-test and GMDR when the quantitative traits are
simulated from the distribution of life-time FTND scores………………………………………63
Table 4.5.Summary of two SNPs identified in FSCD and replicated in COGA and COGEND..64
Table 4.6.Analysis result of GMDR in FSCD and replication in COGA and COGEND……….65
Table 4.7.Goodness of Fit with the SNPs identified by forward U-test and GMDR……………66
Table 5.1.Power comparison between the aggregating U-test and QuTie……………………….83

vii

LIST OF FIGURES

Figure 3.1.Distribution of the size of CNVs..……………………………………………………44
Figure 4.1.Trait distributions in Simulation II. A: the distribution of the number of cigarette
smoked per day; B: the distribution of Participants’ life-time score of FTND………………….58
Figure 4.2.Trait distributions in FSCD, COGA and COGEND.A: the distribution of trait in
FSCD; B: the distribution of trait in COGA; C: the distribution of trait in COGEND………….67
Figure 4.3.Joint effect of two SNPs showing potential statistical interaction. A: average trait by
genotype groups in FSCD; B: average trait by genotype groups in COGA; C: average trait by
genotype groups in COGEND.......................................................................................................68

viii

LIST OF KEY ABBREVIATIONS

BRCA

Breast Cancer

CS

Cigarette Smoking

GWAS

Genome Wide Association Study

CNV

Copy Number Variant

SNP

Single Nucleotide Polymorphism

CDCV

Common Disease – Common Variant

CDRV

Common Disease – Rare Variant

ix

CHAPTER 1.
INTRODUCTION AND AIMS
1.1. Introduction
The past decades have witnessed accelerated progression in the field of human genetics, with
a large and rapid expansion in the understanding of the inherited genetic etiology of Mendelian
diseases, but limited advancement in our understanding of the inherited genetic etiology of
common complex diseases, such as breast cancer and cigarette smoking. The causal genetic
variants identified so far confer relatively small increments in risk, and explain only a small
percentage of the heritability of breast cancer (12.5%) and cigarette smoking (~10%) [1,2]. Many
possible explanations of this issue of “missing” heritability have been suggested, such as various
types of human genetic variations, their complicated interactions and the limited power of
current statistical methods. The objective of this dissertation research is to develop statistical
methods applicable to the study of the inherited genetic etiology of common complex diseases.
Specifically, I will examine three possible sources of the unexplained heritability, copy number
variants, gene-gene interactions, and functional rare variants [3], aiming to understand the
genetic etiology of breast cancer and cigarette smoking.
In this dissertation, novel statistical methods will be proposed and applied for investigating
the genetic etiology of two common complex diseases, breast cancer and cigarette smoking.
Breast cancer is the most common malignancy in women. In the United States, it accounts for
about 2% of deaths from all causes in the general population [4]. It was estimated that in 2010,
207,090 women were diagnosed and 39,840 women died from breast cancer [5]. Smoking is also
a well known risk factor for many complex human diseases, such as cardiovascular diseases and
lung cancer. It was estimated that smoking caused approximately 435,000 deaths annually in the

1

United States, which was 18.1% of all deaths [6]. Understanding the genetic etiology of breast
cancer and cigarette smoking will have a profound impact on reducing the burden of diseases in
the population. The focus of this research is to study the effect of inherited genetics variations.
Somatic mutations that only occur in somatic cells after conception are not considered.
1.2. Genetic Basis of Common Complex Diseases
Until now, Mendelian disorders are the most well understood human genetic disorders in
regard to their causes and mechanisms. In such cases, the disorders are caused by single gene
defects through either dominant or recessive patterns [7]. Though Mendelian diseases may vary
in severity, their risks usually can be predicted accurately by the genotypes. For example, two
deleterious mutations in -hemoglobin gene can predict sickle cell anemia accurately [8]. So far,
many genes underlying Mendelian diseases have been mapped, and successfully cloned [9].
These successes have led to a significant improvement for early diagnosis and treatment of
Mendelian disorders. However, Mendelian disorders are typically rare in the population. Their
total incidence was estimated to be less than 5% [10]. Most diseases have multi-factorial
etiologies and are referred to as complex diseases. The development of complex diseases is
usually associated with the joint effect of multiple genes and the environmental factors.
Significant public health concerns are now focused on the complex disorders [11], such as
cardiovascular diseases, diabetes and cancers. These complex diseases are already common in
developed countries and are becoming prevalent in developing countries [12]. Based on the
disease prevalence in the population, the complex diseases can be differentiated as either
‘common disease’ or ‘rare disease’. There is no universal cut-off value between the prevalence
of common and rare diseases. In the United States, a common disease usually is referred to as a
disease that develops in more than 200,000 persons or about 1 in 1,500 people [13].

2

Understanding the genetic etiology of common complex diseases is crucial for reducing the
morbidity as well as mortality in the population.
The genetic contribution of common complex diseases is supported by the clustering of cases
within families. For example, parental history of cardiovascular diseases (CVD) is a well
accepted predictor for CVD risk in offspring. Evidence from the Framingham Heart Study
showed that the risk of CVD before age of 55 was significantly higher among those with parental
history than among those without parental history. The estimated age-adjusted odds ratios were
2.6 in men and 2.3 in women [14]. Though shared environmental factors might also contribute to
the excessive familial risk, the corresponding odds ratios were 2.0 in men and 1.7 in women after
adjusting for other known environmental risk factors, indicating that the excessive familial risk
was largely due to genetics factors [14].
The inherited genetic contribution to complex diseases or traits can be measured by their
heritability, which is defined as the proportion of the trait variation in a population that can be
attributed to genetic variability [15]. Two types of heritability are commonly used in the
literature: the broad sense or the narrow sense. The broad sense of heritability (H2) is estimated
by partitioning the variance of the trait into a genetic variance component and an environmental
variance component. In his seminal work in quantitative genetics, R.A Fisher proposed to further
partition the genetic variance into additive, dominant and epistatic components [16]. The additive
and dominant components measure the genetic variation that can be explained by a generalized
linear model through either an additive or dominant pattern, while the epistatic component
measures the genetic variation that deviates from a generalized linear model. The narrow sense
of heritability (h2) can be estimated by the ratio between the additive component of genetic
variation and the total variation of trait. Both the broad and narrow sense of heritability is widely

3

used to describe the relative contribution of genetic or environmental factors to the trait in a
particular population. To avoid any confusion in this dissertation, heritability refers to the broad
sense definition unless specified otherwise.
1.3. Human Genetic Variation
Human genetic variation may occur on many different scales, ranging from single nucleotide
polymorphisms (SNPs) to large structural alterations (e.g. chromosome duplication or deletion)
that can affect thousands or millions of nucleotides [17]. Each genetic variant may have multiple
forms, referred to as alleles. SNP is the most common type of sequence variation and occurs
when a single nucleotide varies at a particular site between individual genomes. On the average,
two individual genomes differ from one another by approximately 0.1% of DNA nucleotide sites
[18]. The majority of SNPs are bi-allelic, and can form three possible genotypes by the
combination of two nucleotides [19]. The two nucleotides are differentiated by their frequencies
in the population as minor allele or major allele. In the past, only those polymorphisms with
minor allele frequencies (MAFs) greater than 1% were defined as SNPs. With the advent of
genomic era, this frequency requirement is no longer necessary. Instead, the SNPs with MAFs
greater or less than 1% are now referred to as common or rare variants. The total number of
SNPs in the human genome is about 10 million, which constitutes about 90% of the genetic
variation in human [20,21,22]. Because of the genotyping convenience and the dense coverage of
human genome, SNPs are predominantly used as genetic markers to study the genetic etiology of
complex human diseases.
A human genome consists of twenty-three pairs of chromosomes, including twenty-two pairs
of autosomal chromosomes and one pair of sex chromosomes. Therefore, each human ordinarily
has two copies of each autosomal region. During the past few years, solid evidence has shown

4

that structural alterations, due to insertions, deletions and inversions of the DNA, also contribute
considerably to the variability of the human genomes [23,24,25]. These structural changes may
cause copy number differences in particular genomic regions, ranging from one kilobase to a
complete chromosome arm. A copy number variant (CNV) is defined as a genomic region where
the DNA copy number differs between two or more individuals. CNVs are far more complex
than SNPs. The human genomes differ by only 0.1% with respect to SNPs, but by 1.2% with
respect to CNVs. Aside from that, about one quarter of the CNVs occur without any SNPs in the
region [26]. CNVs also pervasively exist in the population. Even for monozygotic twins who are
presumed to be genetically identical in terms of sequence variations, their genomes can differ by
CNVs [27]. One recent study among monozygotic twins estimated that ~10% of the CNVs were
not observed in any of the parents, which were referred to as De Novo CNVs. Further, 35% of
these De Novo CNVs differed between the monozygotic twins [28]. Therefore, copy number
variation can be viewed as an important form of human genetic variation.
Human genetic variations also exist in other forms besides SNPs and CNVs. One example is
the microsatellite, also known as short tandem repeat (STR) or simple sequence repeat (SSR).
Microsatellites are usually formed by repeating sequences of 1-6 base pairs of DNA at particular
genomic positions, and may vary among individuals by the number of repeats [29]. Compared to
SNPs and CNVs, microsatellites are less stable across populations or generations, and they also
have a low coverage of the human genome [30]. Because of these reasons, microsatellites have
been less frequently used as genetic markers than SNPs or CNVs.
Penetrance measures the individual effect of each genetic variant on a disease, which is
defined as the probability to develop a disease for an individual carrying a particular genotype.
Penetrance is often expressed as an age-related cumulative frequency. The penetrance may vary

5

greatly across diseases or genetic variants. For example, the penetrance of breast cancer by age
70 was estimated to be 65% and 45% for BRCA1 and BRCA2 mutations, respectively [31], and it
was generally lower than 10% for most of the other disease-susceptibility variants. The highest
value of penetrance is 100%, which is also called complete penetrance. One example is the
familial hyper-cholesterolemia (FH). FH is caused by deleterious mutation in the LDL receptor
(LDLR) gene. The mutation has a dominant effect, and carrying a single copy of the mutant gene
will lead to a 2-fold increase in LDL production in blood. Nearly 100% individuals with the
mutation in gene LDLR will develop FH [32]. According to their disease penetrance level, the
disease-susceptibility variants may fall into three categories: high-penetrance variants, moderatepenetrance variants and low-penetrance variants [33]. However, the classification is usually
empirically determined and depends on the disease of interest.
1.4. Models of the Genetic Origin of Common Complex Diseases
There has been a long debate regarding how genetic variants contribute to the development
of common complex diseases. It now seems clear that the genetic variants may influence the
susceptibility of common complex diseases in at least two ways: Common Disease-Common
Variant (CDCV) model and Common Disease – Rare Variant (CDRV) model [34]. The CDCV
hypothesis asserts that complex diseases are caused by multiple genetic variants with appreciable
frequencies in the population at large, but each confers a small or moderate effect [35]. The
CDRV hypothesis, on the other hand, argues that the complex diseases are mostly caused by
multiple genetic variants with low frequencies in the population, but each confers a relatively
large effect [36]. Each hypothesis has been perused by a substantial number of researchers and is
supported with a large amount of evidence.

6

The CDCV hypothesis is predominant in the literature and has provided a theoretical basis
for the extensive genome-wide association studies (GWASs) [37]. The GWASs are typically
population based. Each subject in a GWAS is genotyped with a large number of SNPs (e.g. over
100K or 500K) that are dense enough to cover the whole genome. The rationale of GWASs is
that most of the human genome falls into highly correlated segments, within which genetic
variants are in strong linkage disequilibrium (LD) with one other [38]. In order to be successful
in a genetic association study, the SNPs being tested can either be the causal SNPs or those SNPs
in strong LD with the causal SNPs. Consequently, a number of representative SNPs can be
selected in each genomic region as genetic markers for the association test. These representative
SNPs are referred to as tagSNPs, which are selected according to the high-density maps of the
human genome. To date, hundreds of GWASs have been conducted and have detected a large
number of genetic variants that are associated with over 40 complex diseases [39,40,41,42,43].
Many findings have also been replicated in diverse populations [44]. Though most of the
associated SNPs are found to be located in non-coding regions and are not directly involved in
protein productions, they may have important regulatory functions that control gene behaviors,
such as gene expression levels [45]. These findings provide compelling evidence that many
complex diseases are caused by the collective effect of multiple common genetic variants.
The CDCV hypothesis is also challenged by a number of investigators. They argue that the
rare variants may have weak correlation with the higher-frequency tagSNPs used in GWASs.
Therefore, the indirect association mapping via tagSNPs may have a low power to detect the
causal rare variants [46,47]. Many investigators have surveyed rare sequence variations and have
detected multiple rare variants that are involved in the etiology of complex diseases or complex
traits [48,49,50,51]. These rare variants are more likely to cause a disease individually rather

7

than jointly, which is different from the mechanism of common causal variants [34]. These facts
suggest that the genetic etiology of complex diseases is highly heterogeneous [52]. New
strategies are in great need to detect the rare variants underlying complex diseases.
1.5. Methods of Studying the Genetic Etiology of Complex Diseases
Understanding the genetic architecture of common complex diseases includes three major
aspects: detecting the genetic variants involved in the disease, uncovering their distributions in
the population and estimating the magnitude of their effect [53]. Understanding the genetic
architecture of common complex diseases is crucial to identify the high risk population, to
facilitate disease prevention, and also to promote personalized medicine. However, connecting
genotypes and disease phenotypes is no simple task. It was not until the 1980’s that a general
method, linkage analysis, was first proposed [54]. In the 1990’s, the emerging understanding of
molecular biology and the development of the Human Genome Project offered insight into
possible candidate genes that might be functionally related to genetic disorders [55,56,57] . As a
result, genetic association studies using candidate genes became popular, providing powerful
alternatives for genetic linkage analysis. Genetic association studies compare the frequency of
specific genetic variants between cases and controls, and do not require samples from family
pedigrees as linkage analysis does. Therefore, they are more suitable for detecting genetic risk
factors that commonly present in the population. During the past few years, genetic association
studies have searched for genetic risk factors across the entire human genome, which is made
possible by the comprehensive high-density maps of the human genome and the advancement of
genotyping technologies [37,58,59,60,61]. These GWASs advance the field of human genetics
dramatically by the identification of many novel genetic risk factors.

8

At present, GWASs are commonly adopted for revealing the genetic etiology of complex
diseases. Though a substantial number of disease-susceptibility variants have been identified, the
genetic etiology of complex diseases remains elusive. The identified genetic variants only
account for a small percentage of the estimated heritability of complex diseases, such as type 2
diabetes (6%) or Crohn’s diseases (20%) [44]. It is also unclear how many genetic variants in the
human genome are associated with diseases, and how the genetic variants interact with one
another to cause diseases. It is no surprise that additional genetic variants with lower effect sizes
may exist and can be discovered by increasing the sample sizes of GWASs [62,63]. However,
searching through larger GWASs does not seem likely to uncover all the remaining genetic risk
factors [64]. The challenge arises as to where the unexplained heritability lies and how to find it.
Though this important issue is still under debate, researchers have suggested that it is partly, if
not mostly, due to the following reasons: 1) copy number variations have not been well
understood; 2) gene-gene interactions pervasively exist in biological pathways; 3) rare variants
have been scarcely addressed in genetic association studies.
1.6. Rationale of this Research
Extensive studies have been conducted to investigate the genetic etiology of breast cancer
and cigarette smoking. However, the heritability of these traits remains largely unexplained.
Many possible explanations have been suggested for this issue of “missing” heritability. First,
besides sequence variations, large structural alterations, such as copy number variants, may also
contribute to disease development. Relatively few studies have investigated functional CNVs,
especially for CNVs with a small-to-intermediate size. Second, the effect of one genetic variant
may be suppressed or enhanced by the other variants through complex interactions, which is also
termed epistasis. Therefore, the association test may have a low power if the loci are examined

9

separately without considering potential interactions. Third, rare variants may also play a major
role in the development of complex human diseases. Analysis of rare variants holds great
promise to detect novel disease-susceptibility loci. However, challenge still remains for
statistical modeling because of the low allele frequencies. Sophisticated statistical tools are in
great needs to address these limitations.
1.7. Specific Aims
The objective of this dissertation research is to develop novel statistical methods for the
identification of various types of human genetic variations associated with complex diseases,
including breast cancer (BRCA) and cigarette smoking (CS). The methods will address three
possible sources of the “missing” heritability. The specific aims are:
AIM1. Detecting Copy Number Variants that are Associated with Breast Cancer
I will apply a copy number estimation method, referred to as the Probe Intensity Composite
Representation (PICR) [65], to detect copy number variants that are associated with breast cancer.
The newly established PICR model will be extended with a hidden Markov model for CNV
identification. Data from a recent GWAS of breast cancer will be used for analyses [66]. The
original study was a three-phase case-control study with subjects from a genetically isolated
population, Ashkenazi Jews. I hypothesize that the proposed method can detect small to
intermediate size CNVs that are associated with breast cancer.
AIM2. Detecting Gene-gene Interactions that are Associated with Cigarette Smoking
I will apply a forward U-test to detect gene-gene interactions associated with cigarette
smoking. The method will use U-Statistics to measure the variation of quantitative traits. Data
from the Study of Addiction: Genetics and Environment (SAGE) will be used for analyses. The
cigarette smoking trait will be defined as the number of cigarettes smoked per day: 0 (10

10

cigarettes or less), 1 (11-20 cigarettes), 2 (21-30 cigarettes) and 3 (31 cigarettes or more). I
hypothesize that the proposed method can detect gene-gene interactions among known smokingassociated loci.
AIM3. Detecting Functional Rare Variants that are Associated with Quantitative Traits
I will apply an aggregating U-test to detect rare variants associated with quantitative traits.
Data from Genetic Analysis Workshop (GAW) 17 will be used for analyses. The method will be
an extension of the forward U-test described in aim 2 with the consideration of both common and
rare variants. I hypothesize that the proposed method can have a higher power to detect the
association with rare variants than a commonly used approach, QuTie.
1.8. Organization of the Dissertation
The dissertation is organized as follows: In Chapter 2, I summarize the descriptive
epidemiology of two common complex diseases, breast cancer and cigarette smoking, and
review the inherited genetic risk factors associated with each outcome. In Chapter 3, I propose a
hidden Markov model for detecting copy number variants, and then illustrate the proposed
method by a study of breast cancer. In Chapter 4, I propose a forward U-test for detecting genegene interactions, and then illustrate the proposed method by a study of cigarette smoking. In
Chapter 5, I propose an aggregating U-test for detecting functional rare variants, and then
illustrate the proposed method by a study of quantitative traits. In Chapter 6, I summarize the
findings in these studies and discuss challenges and directions for future development.

11

CHAPTER 2.
GENETIC RISK FACTORS FOR BREAST CANCER AND CIGARETTE SMOKING:
A REVIEW OF THE LITERATURE
2.1. Descriptive Epidemiology of Breast Cancer
Breast cancer usually develops in women, but it can also be found in men. The risk of breast
cancer is about 100 times greater among women than men [67]. The incidence of breast cancer
varies greatly around the world. It is generally higher in developed countries than less-developed
countries. The US has the highest breast cancer incidence around the world. Based on the
number of cases diagnosed in 2003-2007 from 17 Surveillance Epidemiology and End Result
(SEER) centers, the age-adjusted incidence rate was 122.9 per 100,000 women per year. In the
US, the life time risk of developing breast cancer among women is about 1 in 8 (12.15%). It is
the most common malignancy in women, and accounts for about 2% of deaths from all causes in
the general population. It was estimated that 207,090 women were diagnosed and 39,840 women
died from breast cancer in 2010 [5].
According to the National Cancer Institute, breast cancer incidence is highest among
white, non-Hispanic women and lowest among Korean American women. The African American
women have a slightly lower incidence rate of breast cancer than the white women. The death
rate of breast cancer also varies across racial groups, and is highest among the African American
women and lowest among the Chinese American women [5]. The etiology of breast cancer is
largely unknown. Well established risk factors for breast cancer include age, family history,
hormone replacement therapy, radiation, alcohol consumption, and obesity [68,69]. Studies have
suggested that breast cancer risk can be reduced by enhancing physical activities and maintaining
a healthy weight [70]. Currently, the standard and most commonly used method for breast cancer

12

risk prediction is the Breast Cancer Risk Assessment Tool (BCRAT), also known as Gail model.
The Gail model predicts the breast cancer risk for a woman by a number of risk factors that she
is exposed to, including family history of breast cancer, current age, age at menarche, age at first
birth of a child, and race/ethnicity. Other clinical measures, such as medical history of ductal
carcinoma in situ and breast biopsy, can also be incorporated if available [71,72,73,74].
2.2. Genetic Risk Factors of Breast Cancer
Breast cancer is caused by DNA damage due to either germline mutations or somatic
mutations in the process of aging [48,75]. Overall, breast cancer is twice as common among
women with an affected first-degree relative [76]. Although shared environmental factors may
also contribute to the elevated risk, twin studies have indicated that the excessive familial risk is
mainly due to genetic factors [77]. During the past few years, many causal variants have been
successfully identified through linkage mapping and genetic association studies [78]. These
identified breast cancer susceptibility alleles appear to fall into three categories according to their
risk levels and prevalence in the population, including rare high-penetrance alleles, rare
moderate-penetrance alleles and common low-penetrance alleles [79].
The high-penetrance alleles may increase the risk of developing breast cancer by over ten
folds. In the 1990s, two major predisposition genes, BRCA1 and BRCA2, were identified through
linkage mapping [80,81,82]. It was estimated that the average cumulative risk by age 70 was
65% for BRCA1-mutation carriers and 45% for BRCA2-mutation carriers [31]. On the other hand,
because of their low allele frequencies in the general population, these two genes can account for
at most 5% of breast cancer cases [83]. In addition, the high-penetrance alleles also include
germline mutations in a few other tumor suppressors, including TP53, PTEN, STK11 and CDH1.
However, the mutations are very rare and account for a much smaller fraction of the breast

13

cancer cases [84]. Overall, these inherited high-penetrance gene mutations account for less than
7% of all breast cancer cases [85]. At present, it is commonly accepted that no other highpenetrance genes may exist to account for a large proportion of the breast cancer cases [86].
The moderate-penetrance alleles may increase the risk of developing breast cancer by 2-4
folds [78,79]. The susceptibility genes are usually in the same biological pathways with BRCA1
and BRCA2, and are identified through direct interrogation for disease-causing mutations in the
genes. To date, at least four genes are identified by this strategy, including ATM, BRIP1, CHEK2
and PALB2. Compared to BRCA1 and BRCA2, these genes confer less elevated risks of
developing breast cancer. The moderate-penetrance allele carriers have approximately 6-10%
risk of developing breast cancer by age 60 [79]. Similar to the high-penetrance alleles, the
moderate-penetrance alleles also have low frequencies in the population and each makes a
relatively small contribution to the breast cancer incidence.
It is hypothesized that a large proportion of the breast cancer cases are due to the common
alleles that confer very small increases of the risk. These low-penetrance alleles may commonly
present in the general population with an increased risk of less than two folds. They are usually
identified by genetic association studies, either on the basis of candidate genes or through
genome-wide search. During the past few years, many genome-wide association studies
(GWASs) have been conducted to identify common risk loci for breast cancer. For example,
gene fibroblast growth factor receptor 2 (FGFR2) was first identified to be associated with
invasive breast cancer in women less than 60 years old with European ancestry [40]. FGFR2 is
located in the chromosome region of 10q26.23. It was estimated that the mutation in FGFR2
conferred an increased breast cancer risk of 1.26 [1.23-1.30]. This association has been
replicated in a series of studies among different populations [87,88,89,90]. The estimated relative

14

risk ranged from 1.17 to 1.43. Several other genes or genetic variants were also identified to be
associated with BRCA susceptibility, such as TNRC9, MAP3K1, and LSP1 [78,79]. For most of
the genetic variants identified so far, the biological mechanisms still remain unknown.
Despite all the progress in the past two decades, the current findings can only explain a small
fraction of the breast cancer cases, highlighting the need of searching for additional genetic
variants for BRCA susceptibility. Recently, copy number variants have been recognized as a
novel form of genetic variation that can contribute considerably to disease development. One
study showed that the copy number change in gene MTUS1 was associated with breast cancer
[91]. MTUS1 is located on chromosome 8p, a region frequently undergoing deletion. This
deletion variant was found to be associated with a decreased risk of breast cancer with an odds
ratio of 0.58 [91]. Other evidence also suggests that CNVs in other genes or chromosome regions,
such as PIK3CA, 16p12.1 and 16q22.1, may play an important role for breast cancer
development [92,93,94].
2.3. Descriptive Epidemiology of Cigarette Smoking
Smoking was originally used by many civilizations for burnt incense during religious rituals,
and was later adopted for pleasure or as a social tool [95]. As early as the 1950s, the British
Doctors Study provided solid epidemiological evidence of the association between smoking and
lung cancer [96]. Since then, the association has been consistently replicated and the causal
relationship between smoking and lung cancer has been well established. Smoking is also a well
known risk factor for many other complex human diseases, such as cardiovascular diseases. It
was estimated that smoking caused approximately 435,000 deaths annually in the United States,
which was 18.1% of all deaths [97]. Smoking also imposes a great burden to the economy, and is
responsible for about 7% of the total US healthcare costs, or an estimated 157.7 billion dollars

15

each year [97]. Despite the public awareness of its hazard to human health, the prevalence of
smoking in the United States has been barely reduced. In 2002, the estimated number of current
smokers in the US was 45.8 million [98]. Men are five times more likely to smoke than women.
However, the gender difference is diminishing, due to the decline of male smokers and increase
of female smokers. Most smokers are addicted to cigarette smoking due to their dependence on
nicotine. Relatively few smokers can achieve sustained abstinence without medicine or other
help. It was estimated that the success rate for unaided smoking cessation was about 7% after an
average of 10 months of follow-up [99]. On the other hand, cigarette smoking is one of the most
preventable causes of deaths [100]. Understanding the etiology of cigarette smoking can have a
profound impact on the prevention of many complex diseases.
2.4. Genetic Risk Factors of Cigarette Smoking
The quantification of smoking is a major unsolved issue in tobacco-related research. Most of
the available measurements are defined according to various aspects of cigarette smoking, such
as age of smoking initiation and number of cigarettes smoked per day [101]. Smoking is a
complex behavior involving both genetic and environmental factors and their interactions. Peer
and family influences are the strongest environmental factors for the time of smoking initiation.
Genetic factors also play an important role in determining smoking initiation and dependence.
The early evidence comes from twin studies. It was estimated that the average heritability of
nicotine dependence was 56% [102,103]. In the 1990s, the linkage analysis for cigarette smoking
identified a genomic region on chromosome 5q, very close to the D1 dopamine receptor gene
[104]. After that, a number of other regions on chromosome 3, 5, 17 and 18 were also identified
independently by multiple studies using linkage analysis of smoking behavior [105,106,107,108].
Subsequent association studies also identified a number of genes that were associated with

16

cigarette smoking, such as CHRNA5, GABAB2, DDC, BDNF, and COMT. For instance, studies
showed that mutation in gene CHRNA5 was associated with a two-fold risk of developing
nicotine dependence [109]. Meanwhile, the biological mechanisms for these identified genes
remain largely unknown. The etiology of cigarette smoking may involve many genetic variants
through complex biological pathways. It has been suggested that many genes are functionally
related to cigarette smoking through nicotine metabolism and dopaminergic reward system [110].
Detecting the complex interactions among genes and environmental factors is crucial to
understand the biological pathways for disease development.
2.5. Limitation of the Existing Investigations
Extensive studies have been conducted to investigate the genetic etiology of breast cancer and
cigarette smoking. However, the genetic heritability of these diseases remains largely
unexplained. Many possible explanations have been suggested. First, besides the genotypic
variations, large structural alterations, such as copy number variants, may also influence the
disease development. Relatively few studies have focused on detecting functional CNVs,
especially at a small-to-intermediate size. Second, the effect of one genetic variant may be
suppressed or enhanced by the other variants through complex interactions, which is also termed
epistasis [111]. Therefore, the association test may have a low power if the loci are examined
separately without considering potential interactions. Third, rare variants may also play a major
role in the development of complex human diseases. Analysis of rare variants holds great
promise to detect novel disease susceptibility loci. Meanwhile, challenge still remains for
statistical modeling due to the low allele frequencies. In this dissertation research, I am going to
propose novel statistical methods addressing these limitations.
2.6. Significance of the Research

17

BRCA GWAS (AIM 1)
The proposed study will be among the very first ones to study small-to-intermediate size
CNVs in BRCA. Although a number of BRCA GWASs have reported significant findings of
SNP genotypes and CNVs, the reported CNVs are of large size (≥ 50 kb or even several mega
bases). To our knowledge, no small-to-intermediate size CNVs has been reported yet. Compared
to large size CNVs, small-to-intermediate size CNVs can lead to more precise genomic regions
that are functionally related to a disease. In this study, the identification of small-to-intermediate
size CNVs will be achieved with the novel idea of estimating allelic copy numbers at each single
SNP locus by PICR method, and further extending to multiple SNPs by applying a Hidden
Markov Model (HMM).
Gene-gene Interactions Associated with Cigarette Smoking (AIM 2)
The available data suggests that gene-gene interactions are likely to be a major source of the
unexplained heritability of complex diseases. Intuitively, the interactions among genes can be
examined by exploring all possible combinations of the genetic variants [112,113]. However, an
exhaustive search is often not feasible because of the rapidly increasing computational time.
Moreover, when the number of genetic variants is large, an irrelevant combination may
outperform the real disease model simply due to sample randomness. Exhaustive search may
increase the likelihood of finding such irrelevant combinations. To address these limitations, we
propose a novel method that searches for potential gene-gene interactions sequentially, which is
computationally efficient and is applicable to high-dimensional data. In addition, it is also a nonparametric method without any assumption of the trait distribution.
Aggregation of Multiple Rare Variants (AIM 3)

18

Most of the available statistical methods are proposed for common variants analysis. For
rare variants, the number of subjects carrying the rare alleles is usually small. Therefore, the
available methods usually have a low power to detect the association between rare variants and
traits. Right now, the next generation sequencing technology has become popular, which yields a
large amount of genetic data, including both common and rare variants. At present, the most
commonly used approach for detecting phenotypic associations with rare variants is to group
multiple rare variants into a single ‘super’ variant and then combine it with other common
variants for a multivariate analysis, [46,114,115]. Different from existing methods, our method
adaptively collapses a subset of potential disease-susceptibility rare variants. I expect this
method to have a greater power than the existing methods.

19

CHAPTER 3.
COPY NUMBER VARIANTS AND BREAST CANCER
3.1. Introduction
During the past few years, genome-wide association studies (GWASs) have been commonly
adopted for detecting genetic variants underlying common complex diseases, such as breast
cancer and Type II diabetes [40,116]. Though the findings from GWASs have provided valuable
insight into the genetic etiology of complex diseases, they account for a small percentage of the
heritability [117]. The GWASs typically test the genotype frequency of each SNP between a
group of cases and controls. However, it was estimated that two individual genomes were on the
average 99.9% identical with respect to DNA sequence variations [118,119]. Solid evidence
suggests that sequence variations may not be the only source for the heritability of diseases
[3,117]. Alternatively, structural alterations, such as copy number variants (CNVs), can occur
without any sequence variations. It was estimated that the structural alterations accounted for up
to 7.3% of the genetic variability among human genomes [120]. These CNVs may contribute
considerably to the development of many complex diseases, such as cancers [91,121]. However,
until now, the association between CNVs and disease development remains largely unexplored.
Copy number variants were first identified in the early 2000s [122,123], and were found to
exist pervasively in human genomes [124,125]. In the past decade, the rapid advancement of
biotechnology allowed us to characterize human genomes with copy number variations. At
present, two platforms, including Affymetrix high density SNP arrays and Illumina Bead arrays,
are commonly adopted for copy number inference [126,127]. Both platforms provide data in the
form of experimentally determined intensities as surrogates for DNA quantities in the biological

20

samples. Therefore, sophisticated statistical models are in great need to infer the underlying copy
number levels accurately. Smoothing methods are used among the very first studies for copy
number inference [128,129]. These methods usually assume that the underlying copy numbers
may have three levels: normal, copy number gain and copy number loss. After fitting a
smoothing curve along the genomic regions, certain threshold is used to infer copy number levels.
These methods have been applied in many studies for detecting copy number changes. However,
as discussed by Lai et al., the smoothing methods have two major limitations: 1) it is difficult for
the smoothing methods to locate the boundaries of copy number changes; 2) it is difficult for the
smoothing methods to test the significance of copy number changes [130]. Another group of
methods adopt certain change-point models to infer the underlying copy number levels [131,132].
These change-point models usually assume that the SNPs are uniformly distributed in human
genomes, and the underlying copy number levels are piecewise constants with a series of jumps.
By maximizing the likelihood function, the parameters and the change-points can be estimated
for copy number inference. Such models were further extended by various formations of hidden
Markov models (HMMs) [133,134,135,136]. The HMMs usually assume the observed intensities
of SNPs are emitted by an underlying Markov chain, and they explicitly specify the distribution
of the waiting time of copy number changes and the jumping probabilities between copy number
states. These methods have emerged as promising tools for copy number inference.
However, the available methods are commonly proposed to handle the intensity values of
SNPs, which are subjected to large experimental noise. As a result, the quality control issue has
raised considerable concerns regarding the result interpretation and decision making [137]. In a
recent study, Wan et al. proposed a novel approach to estimate copy number abundance on a
single SNP- single array basis, referred to as the Probe Intensity Composite Representation

21

(PICR) [65]. This method models the cross-hybridization between DNA sequences via their
physical binding affinities. It has shown great potential for differentiating copy number signal
from background noise. In this chapter, I propose to extend PICR method with a hidden Markov
model for copy number inference. The copy number abundance is first estimated at each SNP
locus by PICR, and then standardized to achieve equal scaling between multiple samples. A
hidden Markov model is further applied for copy number inference. Compared to the available
HMM-based methods, our method has two major advantages: 1) by estimating the copy number
abundance using PICR, a large proportion of the noise is removed from the intensity values to
improve the performance of HMM; 2) through a novel standardization of the copy number
abundance, our method does not require between array normalizations for multiple samples,
which ensures the data integrity. This proposed method is suitable for detecting copy number
variants with Affymetrix high density SNP arrays.
3.2. Methods
In this section, I first describe the design of Affymetrix SNP array, and then explain the
proposed method step by step. Suppose we have a study population of N subjects, each
genotyped with a large number of K SNPs. Our method first estimates the copy number
abundance at a single SNP locus for each subject by using the newly established PICR model
[65]. It then standardizes the copy number abundance to achieve equal scaling between subjects.
Finally, a hidden Markov model is applied to integrate multiple SNPs for copy number inference.
3.2.1. Design of Affymetrix 500K SNP Array
In an experiment with an oligonucleotide microarray, the array is attached with millions of
short immobilized nucleic acid sequences, known as probes. These probes are designed
complementary to the DNA sequences in biological samples, referred to as targets. These targets

22

are labeled with fluorescent dyes and their abundance can be quantified by the fluorescent
intensities yielded through their hybridization with the probes [138,139,140]. Affymetrix SNP
array uses multiple probe-sets to capture the properties of each SNP. In a 500K SNP array, six
quartets are adopted to interrogate a single dimorphic SNP site with its possible alleles
commonly denoted as A and B. Each quartet consists of 4 types of probes that are 25 base pairs
in length. These probes are designed either perfectly matched to the targets or mismatched at a
particular nucleotide site for each allele: perfect match A (PA), mismatch A (MA), perfect match
B (PB) and mismatch B (MB). The probe-sets are also designed to hybridize with either sense
strands (s=1) or antisense strands (s=-1). The quartets have different shifts (k) for the nucleotide
on the probes (k may take the values -4, -3, -2, -1, 0, 1, 2, 3, 4) from the center nucleotide of the
probes (k = 0 at position 13 of the 25 base pairs) [141].
3.2.2. Estimation for copy number abundance by PICR
The PICR method takes into account the cross-hybridization of the DNA sequences via a
positional-dependent nearest neighbor (PDNN) model [65]. In PICR, the florescent intensity of a
particular probe-set is decomposed into multiple components: the baseline intensity (b), the
products of allelic copy numbers abundance ( Rm ) and the binding affinity between the target
and the probe with respect to different alleles ( Rm ), and a measurement error (ε) (Equation (1)).
The binding affinities ( f A , f B ) are determined by the physical property of the DNA sequences.
Based on the PICR model, the allelic copy number abundance can be estimated via a linear
regression between the intensities and binding affinities.

23


PA
PA
I PA  b  N A f A  N B f B   PA
PB
PB
I PB  b  N A f A  N B f B   PB
MA
MA
I MA  b  N A f A  N B f B   MA

Equation (1)

MB
MB
I MB  b  N A f A  N B f B   MB


3.2.3. Multi-array Equal Scaling by Standardization

The allelic copy number abundance is estimated by PICR on a single array- single SNP basis.
All the fluorescence intensities are subject to experimental scales which may vary among arrays.
It is thus essential to achieve equal scaling for multiple arrays before any further analyses. We
propose a novel standardization approach as:
SCN i , j 

Ni, j , A  N i, j, B
se( N i , j , A  Ni , j , B )

; i  1, 2,......, N ; j  1, 2,......, K

Equation (2)

where N i , j , A ( N i , j , B ) denotes the allelic copy number abundance for SNP j of subject i ;
se( N i , j , A  N i , j , B ) denotes its estimated standard error of N i , j , A  N i , j , B via the linear

regression model of Equation (1). Assuming the random errors in Equation (1) are normally
distributed, the standardized copy numbers follow t distributions identically for
i  1,......, N ; j  1,......, K , and hence, are expected to be on the same scale.

3.2.4. Hidden Markov Modeling for Multi-SNP Copy Number Inference
Modeling Strategy and Copy Number States
As illustrated by Equation (2), our objective is to detect total copy number changes
among subjects. Similar to the existing HMM-based methods [133,135,136], we assume a
particular SNP locus may have 5 possible copy numbers states, with its total copy number
ranging from 0 to 4, (Table 3.1). Such copy number states are not observed directly, and hence,
are hidden. The inference of these hidden states is based on two types of observations, log R
24

ratios (LRR) and B allele frequencies (BAF), which can be calculated by the estimated allelic
copy number abundance. We first estimate the standardized copy number abundance for
the j th SNP of subject i . Because the standardized copy number abundance is a relative measure
with unknown reference, we further define its LRR as:
Ri , j  LRRi , j  log 2 (

SCN i , j
SCN j , reference

);

where SCN j , reference  median ( SCN i , j )
i{Control}

The estimated SCNs among controls are expected to represent normal levels of copy numbers
and can be used to determine the reference level for each particular SNP locus. We further define
the BAF as:

0

Bi , j  BAFi , j  (i , j  a j ) / (b j  a j )

1


where i, j  arctan

scni, j, B / scni, j, A

 /2

i, j  a j
a j  i , j  b j ;

i, j  b j

and a j , b j are the corresponding thresholds for accurate

genotyping of SNP j with PICR. The B allele frequencies provide a normalized measure of
relative signal ratio between allele B and allele A. Similar to a few previous studies, a HMM is
adopted to integrate LRR and BAF for copy number inference [133,135,136]. One novelty of our
method is that we use standardized copy number abundance rather than probe intensities to
calculate corresponding LRR and BAF.
Transition Probability of copy number states
Our proposed model is a time-dependent continuous Markov Chain, using genomic
positions of SNPs as ‘time’. Therefore, the transition probabilities depend on the distance

25

between SNPs. Let zi, j be the underlying copy number state for the j th SNP of subject i and
d j , j  be the physical distance between SNPs j and j in the genome. We define the transition

probability between the copy number states of SNPs j and j as:
exp( d j, j / s )

if s  s

ps , s  (d j , j  )  p ( zi, j   s | zi, j  s )  
;
(1  exp( d j , j  / s )) ps , s  if s  s 

where 1  s, s  5 ; and



pss   1

s s

Here, ps , s (d j , j  ) is the probability for a hidden state s at SNP j to stay at the same state at SNP

j over a distance of d j , j  , which is modeled by an exponential distribution with parameter
1 / s . Therefore, s has the interpretation of the expected length for the copy number to stay at a
particular state s. It is worthwhile to note that ps , s (d j , j  ) may be close to zero when d j , j  is
large. In practice, when two consecutive SNPs are far apart, the Markov chain will be restarted to
avoid the probability of zero for the underlying copy number to stay at the same state.
Emission Probability of Observations
Similar to a few previous studies, we use LRR and BAF as observations, which are
modeled by mixture distributions [133,135,136]. Denote zi, j as the underlying copy number
state for the j th SNP of subject i. Assume the LRR and BAF at a particular SNP locus are
conditionally independent given the underlying copy number state, we have
p( Ri, j , Bi, j | zi , j )  p ( Ri , j | zi , j ) p( Bi , j | zi , j )

We model the emission probability of LRR (R) with the mixture of a uniform distribution and a
normal distribution as:

26

p( Ri, j | zi, j  s ) 

R
 (1   R ) f ( Ri, j ,  R, s ,  R, s ) ;
RM  Rm

1  i  N ;1  j  K ;1  s  5 ;
where f(.) denotes the normal probability density function. Here, we assume the genotyping may
fail with a small probability of  R . Under such a circumstance, the LRR is observed as
background noise, which follows a uniform distribution between its possible minimum ( Rm ) and
maximum values ( RM ). Otherwise, it follows a normal distribution with a mean (  R, s ) and a
standard deviation (  R, s ) with respect to the underlying copy number state. As illustrated by
Table 3.1, the mean and standard deviation of the normal distributions vary by the underlying
copy number states.
As illustrated in Table 3.1, the expected values of BAFs (B) vary by the underlying copy
number states as well as the underlying genotypes. Let Gs be the set of all possible genotypes
for copy number state s. We further denote  s, g and  s, g as the mean and standard deviation of
BAF for a SNP with copy number state s and genotype g, where g  Gs . Let  s , g denotes the
prior probability of BAF for copy number state s and genotype g, which can be calculated by a
binomial distribution based on the B allele frequency in the population (bpf) [133,135,136]. For
example, a SNP with genotype AAB has a copy number of 3, and expected BAF of 1/3. The
prior probability of the BAF can be calculated as:
 3
p(Gi, j  AAB | zi, j  3)    (bpf j )1 (1  bpf j )2 .
1

27

Because the B allele frequencies are only observed within the range of 0 and 1, we model
the emission probability of BAF at a particular SNP locus with the mixture of a uniform
distribution and a few normal and truncated normal distributions:

  B  (1   B )   s , g f ( Bi, j ,  s , g ,  s , g )
for 0  Bi , j  1

gGs


;
p( Bi, j | zi, j  s )    B  (1   B )   s , g  ( Bi, j ,  s, g ,  s , g )
for Bi , j  0

gGs

 B  (1   B )   s, g (1   ( Bi, j ,  s, g ,  s , g )) for Bi , j  1

gGs


1  i  N ;1  j  K ;1  s  5
where f (.) and  (.) are the normal probability density function and cumulative density function
respectively.
Parameter Estimation and Copy Number Inference
In practice, we assume  R   B  0.01 as the empirical error rate for genotyping, and s ,
1  s  5 , are pre-determined to account for the size of copy number variants. The set of

parameters that need to be estimated includes:
  { ( s )  p ( z  s ) as starting probability; s  1, 2,3, 4,5
P  ( pss  ) as transition probability;1  s, s  5

 R, s ; as mean of R; s  1, 2,3, 4,5
 R, s ; as srandard deviation of R; s  1, 2,3, 4,5
 B, s, g ; as mean of B; s  1, 2,3, 4,5; g  1, 2,.....Gs
 B, s, g ; as s tan dard deviation of B; s  1, 2,3, 4,5; g  1, 2,......Gs }
We use the forward-backward algorithm, also known as Baum-Welch algorithm, to optimize the
parameters in Ω [142]. After the parameter estimation, the inference of copy number states is
carried out by Viterbi algorithm [143].
28

Baum-Welch Algorithm
The Baum–Welch algorithm is a particular case of the generalized expectationmaximization (GEM) algorithm, and is commonly used to estimate parameters of a hidden
Markov model. The estimation is achieved by updating parameters interactively to maximize the
likelihood function of the observations from a HMM. The likelihood can be calculated efficiently
via a forward algorithm and a backward algorithm. To describe the Baum-Welch algorithm, we
denote Oi, j  ( Ri, j , Bi, j ) as the observations at j th SNP of subject i, and e(Oi, j , s ) as the
emission probability of Oi, j given the underlying copy number state s at SNP locus j. We have
e(Oi, j , s )  p (Oi, j | zi , j  s )  p ( Ri , j | zi , j  s ) p ( Bi , j | zi , j  s )

The forward algorithm computes the likelihood of the first k observations in a HMM with a
particular ending state s recursively by:

 ( s )e(Oi,1, s )

k 1

 (i, k , s )  p (Oi,1,......, Oi, k , z k  s)    (k  1, s) p (d )e(O , s )

ss s s
i,k
1 k  K
 s

Specifically, the overall likelihood of a Markov chain with length of K can be calculated by:
Li  p(Oi,1,......, Oi, K )   p(Oi,1,......, Oi, K , zi, K  s )    (i, K , s )
s

s

On the other hand, given the current underlying state of s at the k th SNP, the backward algorithm
computes the likelihood of the future observations starting at the (k  1)th SNP.
1

kK

 (i, k , s )  p(Ok 1,......, OK | zi, k  s )   p  (d
)e(Oi, k 1, s )  (i, k  1, s ) 1  k  K
ss k , k 1
 s

Therefore, the posterior distribution for the underlying state of the k th SNP in subject i can be
calculated as:
29

 (i, k , s)  p( zi, k  s | Oi,1,......, Oi, K ) 

p ( zi,k  s , Oi,1,......, Oi, K )
p (Oi,1,......, Oi, K )



p ( zi,k  s, Oi,1,......Oi, k ) p (Oi,k 1,......, Oi, N | zi ,k  s )
p (Oi,1,......, Oi, K )



 (i, k , s)  (i, k , s)
Li

In each iteration step by Baum-Welch Algorithm, the parameters are updated by maximizing the
posterior probability. Denote the overall forward probability, backward probability and
likelihood for N subjects as:
N

 (k , s)   (i, k , s);
i 1
N

 (k , s)   (i, k , s);
i 1

N

L   Li .
i 1

Therefore, the overall posterior probability for the underlying state of each SNP can be given as:

 (k , s )  p( zk  s | O.,. ) 

 (k , s)  (k , s)
,
L

where O.,. is all the observations for Oi, k , 1  i  N ; 1  k  K .
The starting probability  ( s) can be updated by its posterior distribution as:
ˆ
 ( s )  p ( z1  s | O.,. ) 

 (1, s ) (1, s )
;1  s  5 .
L

The transition probability ps , s can be updated as:

30

K

  (k  1, s) ps, s (dk 1,k )e(O.,k , s)  (k , s)
ˆ
ps , s  

k 2
K

:.

K

    (k  1, s) ps,l (dk 1, k )e(O., k , l )  (k , l )
k 2 l  s k  2

The mean and variance of the LRR (R) with respect to the copy number states can be updated as:


ˆ
 R, s 

Ri, j p( Ri, j | zi, j  s )

1i  N ;1 j  K



p( Ri, j | zi, j  s )

;

1i  N ;1 j  K

ˆ
( Ri , j   R, s )2 p ( Ri , j | zi, j  s )


ˆ2
 R, s 

1i  N ;1 j  K



p ( Ri, j | zi, j  s )

.

1i  N ;1 j  K

To estimate the mean and variance of the BAF (B) with respect to the copy number states and
genotypes, we denote | g |B as the allelic copy number of B for genotype g (i.e. | AAB |B  1 ), and
bpf as the allele frequency of B in the population. We update the mean and variance of BAF as:


ˆ
 B, s , g 

Bi, j p ( Bi, j | zi, j  s ) p(Gi, j  g | Zi, j  s )

1 i  N ;1 j  K



p( Bi, j | zi, j  s ) p(Gi, j  g | zi, j  s )

;

1i  N ;1 j  K

 zi, j 
zi , j | g |B
| g |B
where p(Gi, j  g | zi, j  s )  
;
 | g |  (bpf j ) (1  bpf j )

 B
ˆ
( Bi, j   B, s , g )2 p( Bi, j | zi, j  s ) p (Gi, j  g | zi, j  s )


ˆ2
 R, s 

1i  N ;1 j  K



p( Bi, j | zi, j  s ) p(Gi, j  g | zi, j  s )

;

1i  N ;1 j  K

Viterbi Algorithm
Given all the model parameters, the Viterbi algorithm is used to infer the most likely path
for the underlying states. The following steps are implemented:
31

i)

Calculate v(i,1, s )   (s )e(Oi,1, s ); 1  s  5;1  i  N ; as the probability to
produce the first observation for a hidden Markov chain staring with underlying
state s.

ii)

Calculate v(i, j, s )  max(v(i, j  1, s) ps s (d j 1, j )e(Oi, j , s )); 2  j  K as the
s

largest probability to produce first j th observations for a hidden Markov chain to
end at state s.
iii)

Infer the most probable underlying state for the K th SNP of a hidden Markov
chain as: zi, K  arg max s v(i, K , s ) .

iv)

Recursively infer the most probable underlying state for the j th SNP of a hidden
Markov chain as: path( zi, j | zi, j 1  s )  arg max s  [v(i, j, s) pss (d j , j 1)] for

1  j  K 1.
3.3. Results
3.3.1. Simulation Study
In the simulation study, we assume that the length of genome is 106 base pairs. We first
simulate 10K SNPs with their physical positions uniformly distributed in the genome. The
expected lengths of the copy number states are set at 3  50K and l  5K ; l  1, 2, 4,5 . The
transition probabilities between copy number states are set as:
 0
 0.01

( pss )   0.25

 0.01
 0.01


0.01 0.97 0.01 0.01 
0
0.97 0.01 0.01 

0.25
0
0.25 0.25 

0.01 0.97
0
0.01 
0.01 0.97 0.01
0 


32

The parameters for the emission probability of LRR (R) are set as:
State

1

2

3

4

5

 R, s

log 2 (1/ 10)

log 2 (1 / 2)

log 2 (1)

log 2 (3 / 2)

log 2 (2)

 R, s

1

0.15

0.15

0.15

0.2

The parameters for emission probability of BAF (B) are set as:
 0.5

 0.25
 0

 0.05
1



 0 0.5
 and ( B, s, g )   0.05
(  B , s, g ) 
1



 0 1/ 3 2 / 3 1

 0.05
 0 1/ 4 2 / 4 3 / 4 1
 0.05



s  1 g  
s  2 g  A
gB

s  3  g  AA
g  AB

s  4  g  AAA
g  AAB
 g  AAAA g  AAAB
s  5

g  BB
g  ABB
g  AABB

0.05
0.05
0.05
0.05




 for
0.05

0.05 0.05

0.05 0.05 0.05 


g  BBB
g  ABBB




,


g  BBBB 


where g   denotes a copy number loss on both chromosome. The observations of B are further
truncated at 0 and 1.
We simulate 100 subjects by using the above model parameters. For each subject, the
underlying copy number states and genotypes of 10K SNPs are first simulated in a sequential
order according to the transition probabilities. The frequencies of allele B in the population
follow a uniform distribution between [0.1, 0.9]. For each SNP, the observations of LRR (R) and
BAF (B) are then simulated by using the emission probability according to the underlying copy
number state and genotype. Two subjects are randomly selected to estimate the parameters by
using the Baum-Welch algorithm. The estimated parameters are then used to infer the underlying
copy number states for all subjects by using the Viterbi algorithm. For computational precision

33

reason, the convergence criterion is met when the summation of the absolute change of all
parameters is less than 103 . We calculate the error rates for inferring the copy number states of
all SNPs in all subjects. Because the expected lengths of the copy number variants ( s ) are predetermined and may have an impact on the performance of the inference, we also examine the
error rates when they are incorrectly specified. The results are listed in Table 3.2. The
simulation results show that the proposed method is highly accurate to infer the underlying copy
number states when s is correctly specified. The overall error rate for all SNPs is estimated to
be 1.34e-04. When s is incorrectly specified, the error rate increases with the extent of misspecification, but remains at a low level. In our simulation, we find that the error rate is not
inflated seriously with an up-to 10 folds over-specification of s . It is also noted that the error
rate for SNPs with a normal state of two copies decreases by the extent of over-specification of

s . This is because the majority of SNPs belong to a normal state of two copies. The normal
state also has the largest expected length, and a SNP is more likely to be inferred as two copies
when s is large. On the other hand, the error rate for SNPs with a normal state of two copies
increases when s is under-specified. Overall, the error rate is properly controlled when s is
incorrectly specified.
3.3.2. Application to Breast Cancer Data
We also apply the proposed method to detect copy number variants that are associated with
breast cancer development, using a recent GWAS study among Ashkenazi Jews (AJ) [66]. The
original study had three phases. The first phase included 249 breast cancer cases without BRCA1
and BRCA2 mutations, and 299 cancer-free AJ women as controls. The second phase was a
replicate study with 343 candidate SNPs among 950 AJ cases and 979 AJ controls. The third

34

phase was also a GWAS study that included 243 AJ cases and 187 controls. The participants
from phase I and phase III were genotyped with Affymetrix 500K SNP array, while those from
phase II were genotyped by Illumina GoldenGate assay. Because our method is proposed for
Affymetrix SNP arrays, we focus our analysis on the phase I and phase III data.
We use phase III as an initial study for the analyses. The proposed method is first applied to
ten randomly selected controls for parameter training. The parameters are then used to infer copy
number states among all participants. Because the inferred copy numbers are not normally
distributed and their distributions are not straightforward to determine, we further conduct a
Kolmogorov-Smirnov (KS) Test for each SNP to compare the inferred copy numbers between
cases and controls. The KS test is a non-parametric test and does not rely on the distribution of
copy numbers. The significant regions are selected if three consecutive SNPs show significant
copy number differences at a level of 1e-07. After the region is selected, a global p-value is
further calculated by conducting a KS test using the average copy number of the SNPs within the
region. The results are summarized in Table 3.3. The findings include 34 genomic regions from
16 chromosomes. The region with the largest number of significant SNPs is 4q31.23. This region
has 10 SNPs showing significant copy number difference between cases and controls. Besides
region 4q31.23, two regions, 1p21.1 and 10q21.1, both have 7 significant SNPs. Three regions
have 5 SNPs with significant copy number differences, including 6q22.33, 6q27 and 11p12.
These results indicate that copy number alterations on chromosome 4, 6, 1 and 11 may have a
significant impact on the development of breast cancer.
Most of these identified regions were reported in literature for potential involvement with the
development of breast cancer. One SNP in the region 4q31.23 was recently reported to be
significantly associated with breast cancer progression [144]. A gene ARHGAP10-NR3C2,

35

located in this region, was also suggested to be related to carcinogenesis through structural
alteration [145]. In addition, possible copy number changes of region 4q31.23 were observed
from cancer cell line data [146]. Regions 1p21.1 and 10q21.1 were reported repeatedly with
potential association with breast cancer. Chromosome arm 1p was suggested to contain multiple
tumor suppressor genes [147]. Structural alterations of 1p21.1 were observed from many studies
of cancers [147,148,149,150]. Region 10q21.1 also contained multiple candidate tumor
suppressors, such as ANX7 and CDC2 [151,152]. Interestingly for region 6q22.33, it was
identified by the original GWAS as a novel locus for breast cancer development. Our study
confirms this finding and also suggests that the copy number changes in the region may play an
important role.
We further apply the same procedure to the phase I data for replication. The results are
summarized in Table 3.3. Among the regions identified by using phase III data, the copy number
changes remain significant at five regions by using phase I data, including 4q31.23, 6q13,
12q23.1, 13q14.3 and 2p21. These five regions contain 10, 5, 4, 4, and 3 SNPs respectively. The
association between the CNVs within these regions and BRCA susceptibility is supported by the
literature. The long arm of chromosome 6 was reported to be frequently rearranged in human
cancers [153,154,155]. The region of 6q13 was among the regions that showed frequent copy
number alterations [156,157]. In region 12q23.1, a gene SLC5A8 was identified by a previous
study to be affected frequently by structural changes, such as DNA methylation [158,159]. This
gene was actively involved in the gene pathway related to the development of primary human
tumors [160,161]. The region 13q14.3 was reported for copy number changes in various cancers,
such as prostate cancer and breast cancer [162,163,164,165]. The structural changes of region
2p21 were well studied, such as 2p21 deletion syndrome [166]. It was caused by the deletion of a

36

larger portion of genetic material from chromosome 2p21, characterized by infant seizures,
reduced muscle tone, developmental delay, lactic acidosis and unusual facial appearance [167].
The structural changes of 2p12 were also suggested to be involved in cancer development [168].
To validate the results, we also examine the distribution of the sizes of identified CNVs and
compare it with findings from literature. This distribution has a similar shape with one recent
study (Figure 1 of [169]). Further, the density function in our study has the peak value at around
50K. Compared to the study by Li et al (peak at 200K), the identified CNVs by our study have
smaller sizes. This is expected since we start with single SNPs and focus on the small to intermediate size CNVs.
3.4. Discussion
Though genome-wide association studies have identified hundreds of novel diseasesusceptibility loci [170], the genetic architecture of complex disease remains elusive [171]. A
large percentage of the heritability of diseases is still unexplained, highlighting the need to
consider all types of heritable variations besides the sequence variations. As promising
candidates, CNVs are ubiquitous throughout the human genome. It was estimated that about 5%16% of the human genome might undergo copy number changes [126,172,173]. Meanwhile, the
knowledge of CNVs is still limited in regard to their contribution to the disease development
among human populations. In addition, statistical tools are still lacking to infer the CNVs
accurately and efficiently. In this research, we propose a HMM-based approach for copy number
inference, illustrated with an application to breast cancer datasets. The method can be viewed as
an extension of PICR model with an implementation of HMM. Our approach differs from the
other HMM-based methods by: 1) our method first estimates the copy number abundance using
PICR, which removes noises from probe-intensity values; 2) our method achieves equal scaling

37

among multiple subjects by a novel approach of copy number standardization. By doing so, no
between-array normalization is required, which keeps the data integrity.
In the simulation study, we show that the proposed method is highly accurate for copy
number inference. The error rates remain at a low level when the pre-determined model
parameter is mis-specified. The application to the phase III data of the BRCA GWAS identifies a
few genomic regions with significant associations. The associations of five regions, including
4q31.23, 12q23.1, 13q14.3 and 2p21, are replicated by using the phase I data of the BRCA
GWAS. All these genomic regions have been reported in the literature as candidate regions for
the development of primary tumors or other complex disorders. Whereas it is biologically
plausible that the structural changes of these regions may play an important role in the
development of breast cancer, further studies are needed to replicate the association and
investigate the biological mechanisms.
We are also aware that our method may have a few limitations. First, our copy number
estimation method is based on the design of Affymetrix SNP arrays. It currently cannot be
directly applied to the Illumina platform. Second, our method currently focuses on detecting the
total copy number changes, and does not differentiate the paternal/maternal specific copy
numbers. Third, our method is currently suitable for population-based association studies with
unrelated samples. Further extension is needed for its application to studies with samples from
family pedigrees.

38

Table 3.1. Configuration of all possible copy number states
State (z)

Copy Number

Expected LRR

Expected BAF

Possible Genotypes

1

0

log(0)= 

2

1

log2(1/2)=-1

3

2

log2(1)=0

4

3

log2(3/2)=0.585

5

4

log2(2)=1

0
0
1
0
0.5
1
0
0.33
0.67
1
0
0.25
0.5
0.75
1

- (deletion)
A;
B
AA;
AB;
BB
AAA;
AAB;
ABB;
BBB
AAAA;
AAAB;
AABB;
ABBB;
BBBB

39

Table 3.2 Error rate for inference of copy number states with correctly and incorrectly
specified expected length of copy number states

Ave. # of SNP with copy number state in each subject
HMM State

1

2

3

4

5

557

163

8875

185

220

 used in HMM

10,000

Error Rates by copy number state

true

5.92e-04

1.53e-04

2.37e-05

1.40e-04

1.32e-03

2true a

3.97e-03

4.91e-04

1.69e-05

7.01e-04

4.46e-03

5true

4.18e-03

6.13e-4

1.80e-05

7.01e-04

4.51e-03

10true

4.38e-03

9.20e-04

1.80e-05

1.08e-03

4.87e-03

0.5true

9.69e-04

1.53e-04

3.27e-05

1.56e-04

1.32e-03

a

Total

the model specified  is 2 times greater than the true  .

40

1.34e-04
3.55e-04
3.71e-04
4.02e-04
1.66e-04

Table 3.3 Regions showing significant copy number changes in Phase III and their
significance levels in Phase I.

Chromosome

Cytoband

Location

# of
SNP

p-val
(Phase III)

p-val
(Phase I)

1

p21.1

102622376
- 102640646

7

2.62e-13

0.954

1

p12

120292824
- 120312909

3

7.62e-14

0.999

1

q22

154077091
- 154106555

3

2.453e-11

0.999

2

p21

45759616
- 45760637

3

1.106e-08

0.014

2

p12

81196767
- 81197522

3

7.232e-09

0.977

2

q21.1

131925407
-131955270

3

4.872e-13

0.999

3

p14.3

57706175
-57839689

3

1.228e-09

0.116

4

q26

117544365
-117576957

3

4.577e-11

0.138

4

q31.23

148668320
-148697327

10

9.43e-15

7.56e-05

4

q32.3

166885930
-166957371

5

6.664e-11

0.189

5

q14.3

84350898
- 84398999

5

4.330e-14

0.720

5

q22.3

115145252
- 115178424

4

2.220e-16

0.893

41

Table 3.3 . (cont’d)

Chromosome Cytoband

Location

# of
SNP

p-val
(Phase III)

p-val
(Phase I)

6

q13

75247853
-75311831

5

5.218e-15

0.034

6

q22.33

128476625
-128533696

6

2.409e-13

0.806

6

q23.2

134651674
-134672863

5

3.722e-10

0.999

6

q27

165234976
-165247908

6

1.752e-09

0.996

7

q22.1

98318717
-98361309

4

4.727e-11

0.103

7

q31.31

118754169
-118754169

5

1e-17

0.524

8

q11.22

52786953
-52796842

3

4.550e-10

0.840

8

q21.3

90963387
-90964181

3

2.862e-08

0.772

8

q24.13

125649171
- 139914783

3

2.30e-08

0.973

8

q24.3

145891814
- 145948840

4

3.220e-15

7.96e-04

9

p21.3

22270796
-22294230

5

6.249r-09

3.33e-03

10

q21.1

56853055
-74432554

7

1.084e-09

0.998

42

Table 3.3 . (cont’d)

Chromosome Cytoband

Location

# of
SNP

p-val
(Phase III)

p-val
(Phase I)

11

p13

36306019
- 36366302

3

8.95e-11

0.223

11

p12

37905557
- 37916354

6

2.627e-09

0.968

11

q22.3

104741435
- 104806689

5

4.152e-14

0.999

12

q23.1

94977527
- 95052366

4

1.11e-16

1.07e-04

13

q13.3

34828145
- 34846106

4

8.975e-10

0.428

13

q14.3

51036156
- 51071687

4

5.268e-12

6.83e-09

13

q33.1

103334252
- 103344370

5

1.589e-09

0.964

14

q23.1

60136001
- 60140123

5

1.843e-12

0.996

18

p11.31

3597746
- 3635894

3

4.268e-10

0.417

X

q27.3

146596395
- 146646974

4

5.873e-14

0.086

43

Figure 3.1. Distribution of the size of CNVs

30000

20000

10000

0
50K
0e+00

2e+05

4e+05

6e+05
Size of CNV

44

8e+05

1e+06

CHAPTER 4.
GENE-GENE INTERACTION AND CIGARETTE SMOKING
4.1. Introduction
The genetic etiology of common complex diseases is of tremendous interest to clinical and
basic science researchers as well as to the general public. During the past few years, the radical
breakthrough of biotechnologies has enabled us to generate a large amount of genotypic data
with great accuracy [61]. Testing the association between these genetic variants and complex
traits provides an unprecedented opportunity to unravel the mystery of gene functions, which is
crucial for a better understanding of the disease etiology. Meanwhile, the rapid growth of the
data dimensionality also presents great challenges to statistical modeling and hypothesis testing.
Most of the first generation genome wide association studies test the association between
genetic variants and disease outcomes on a single-locus basis [40,116,174]. Though a substantial
number of genetic variants have been identified to be associated with many complex diseases,
such as diabetes and Crohn’s disease, they account for only a small percentage of the heritability
[117]. One possible explanation of the issue of ‘missing’ heritability is that most of the complex
diseases are polygenic in nature. Multiple genetic variants, each conferring a small or moderate
effect, may contribute to the disease development [175,176]. In addition, the effect of one
genetic variant can be suppressed or enhanced by the other variants, which is termed epistasis
[111]. Whereas epistasis per se cannot account for the missing additive heritability, it may often
lead to the lack of power to detect association when loci are examined separately without
considering their potential interactions [177].
Considering the polygenic nature of many complex diseases, statistical approaches for multilocus association analysis have been recently developed. Lin et al. proposed a sequence

45

interaction model in a multivariate regression framework for quantitative traits [178]. Several
studies modeled multi-locus interactions through haplotype analysis [179,180,181]. Schaid et al.
proposed a U-statistic-based score test that could simultaneously examine the association
between multiple genetic variants and dichotomous traits [182]. Wei et al. further extended this
approach for quantitative traits by using data-adaptive weights for different variants [183]. These
approaches comprise the commonly used single-locus approaches, providing powerful
alternatives for genetic association analysis. However, they are less suitable for handling a large
number of genetic variants and for considering interactions, especially high order interactions.
Another group of methods uses a different strategy. They first select a subset of genetic
variants from the totality of the genotyped variants, and then conduct an association test to assess
the joint effect of the selected loci. The genetic variants are usually selected to best describe the
risk of a binary disease outcome or the variation of a quantitative trait. For example, Ritchie et al.
proposed a Multifactor Dimensionality Reduction (MDR) method for balanced case-control
studies [112]. It pools multi-locus genotypes into high-risk and low-risk groups, and hence
reduces the data dimension to one. This method has been further extended in a series of articles.
Martin et al. extended the MDR method for studies with family-based designs [184]. Lou et al.
derived a generalized MDR (GMDR) method that could be applied to both dichotomous and
quantitative traits [113]. The GMDR method is not limited to studies with a balanced design and
has the advantage of allowing for covariate adjustment. It maps the phenotypic traits into
residual scores through certain link functions under a generalized linear model framework, and
then conducts SNP selection and association test based on the residual scores. The extension of
GMDR can also be applied to studies with family-based designs, referred to as pedigree-based

46

GMDR (PGMDR) [185].These approaches have now been commonly used to search for genegene/gene-environment interactions. They are generally non-parametric and model free.
However, the above methods commonly search for SNP combinations exhaustively.
When the number of genetic variants is large, the chances are that an irrelevant combination may
outperform the real disease model simply due to sample randomness. Therefore, when hundreds
of thousands of genetic variants and environmental factors are examined, an exhaustive search
may suffer from loss of power due to the substantial increase in the feature space [186]. In
addition, an exhaustive search may not be computationally feasible for high order interactions,
especially at a genome-wide scale. As discussed by Cordell and Marchini et al., searching for
high order epistasis beyond pair-wise interactions is not computationally affordable and can be
pursued only after single-locus-based filtering [187,188].
As an alternative approach, the forward or sequential selection algorithm has received
growing attention for its computational efficiency [189,190,191]. The algorithm starts with a
null feature set and sequentially adds the best feature that satisfies certain criteria. Real data
applications and simulation studies have also suggested that the forward search may have a
greater power than the exhaustive search [186,189]. In this chapter, I propose a U-statistic-based
multi-locus testing approach for quantitative traits. This method searches a large number of SNPs
for joint gene-gene actions through a forward selection. Compared to the available methods, our
method has the following advantages: 1) it tests the joint association for multiple genetic variants
with the consideration of gene-gene interactions, including high-order interactions; 2) it is a nonparametric method that makes no assumptions of the trait distribution; and 3) it is
computationally efficient and can be applied to high-dimensional data.

47

4.2. Methods
We first introduce notations and the hypothesis of interest. Suppose the study has N subjects.
Let Yi denote the quantitative trait for the ith subject, i  1, 2,......, N ; and
let X i  ( X i1, X i 2 ,......, X iK ) denote K independent SNP genotypes, each taking a value from one
of the three possible genotypes X ij  { AA, Aa, aa}; j  1, 2,......, K . The hypothesis is that these K
SNPs, or a subset of them, are associated with the quantitative trait Y. To test this hypothesis, we
first select k SNPs that best describe the variation of Y, where k  K , and then test whether these
selected SNPs are jointly associated with the trait Y.
U-Statistics
Since the foundational work of Hoeffding, U-Statistics have been widely used in both
theoretical and applied statistical research [192]. They were recently used to build test statistics
for multiple genetic variants [182,183]. However, while considering multiple genetic variants
simultaneously, these approaches calculate the global U-Statistic by assuming an additive effect
across multiple genetic variants, and thus do not consider the gene-gene interactions. We here
introduce a new U-Statistic to test the joint association of multiple genetic variants with the
consideration of gene-gene interactions. In this new method, we measure the difference of the
quantitative traits between two subjects i and j as:

 (Yi , Y j )  Yi  Y j ;

1  i, j  N .

Suppose we have k selected SNPs, which comprise L multi-SNP genotypes, denoted by
G1, G2 ,......, GL . A multi-SNP genotype, Gl , is defined here as a vector of k single-SNP
genotypes that an individual carries (e.g., {g1, g 2 ,......, g k } ). The k SNPs and L multi-SNP
genotypes are selected sequentially out of a total number of K genotyped SNPs (See Section
48

below for details). We denote by Sl  {i, X i  Gl } the group of subjects carrying multi-SNP
genotype Gl , l  1, 2,......, L and ml | Sl | is the number of subjects in group Sl . We define the
between-group U-statistic for group l and group l  as:

U l ,l     (Yi , Y j ); i  Sl , j  Sl  .
i, j

U l , l  is the summation of all possible pair-wise trait comparisons for any two subjects from Sl

and Sl  . In the presence of an association, we expect individuals carrying different multi-SNP
genotypes have different trait values (e.g., those carrying high risk multi-SNP genotypes have
higher trait values than those carrying low risk multi-SNP genotypes). We assume that the
expected quantitative trait values of L multi-SNP genotypes decreases with l (i.e.,
E (YS1 )  E (YS 2 )  ......  E (YSL ) . Practically, we can sort the multi-SNP genotypes according to

their average trait values (i.e., YS1  YS2  ......  YS L ). Based on U l , l  , we further define a global
U-statistic for L groups as:



l ,l U l ,l 

U  1 l  l  L



U l ,l 



ml  ml 
L( L  1)
; l ,l  
.
2
ml ml 

1 l  l  L

Here, the weight parameter  is chosen to account for the number of subjects in each genotype
group. This global U-Statistic measures the overall trait differences among a total number of L
multi-SNP genotype groups.
The global U-Statistic described above is expected to have a zero mean under the null hypothesis
of no association and to follow a normal distribution asymptotically. For simplicity, we denote
U



l ,l U l ,l  , and the variance can be estimated as:

1l l  L

49

Var (U ) 



l2l  (ml2 ml  ml2ml  ) 2 

,

1l  l  L





1l1,l2 ,l  L
l1  l2


1 l1 , l ,l2  L
l1  l2

l ,l l ,l  ml1 ml2 ml 2 
1

2

 l ,l  l ,l  ml ml1 ml2 2 
1


1 l1 ,l ,l2  L
l1  l2

2


1l1 , l2 ,l  L
l1  l2

 l ,l  l ,l  ml1 ml2 ml  2
1
2

l ,l l ,l ml1ml2 ml 2
1
2

where Var (Yi )   2 for any 1  i  N . The derivation is described in Appendix.
U-Statistic-Based Forward Selection Algorithm
When a large number of SNPs are examined, it is likely that a significant proportion of
the SNPs are not disease-related, and thus conducting a model selection will be necessary. We
here introduce a computationally efficient U-Statistic-based forward selection algorithm that
searches among a large number of SNPs for disease-susceptibility loci. A subset of loci is
selected to best describe the variation of the traits. We start by taking all individuals as a single
genotype group. In the first step, each SNP j can form two single-SNP genotypes, {g1j , g 2j } , in
j
j
three possible ways, denoted as {g1j  { AA}, g 2  { Aa, aa}} , {g1j  { Aa}, g 2  { AA, aa}} and

{g1j  {aa}, g 2j  { Aa, AA}} . This leads to a total number of 3K possible partitions that can be
(1)
(1)
j
represented by {G1  g1j , G2  g 2 } , where Gl( s ) denotes the l th multi-SNP genotype at step s.
(1) (1)
We calculate the U-Statistic for each partition {G1 , G2 } . The SNP with the largest value of the

U-statistic is selected, and the corresponding partition is recorded. In the second step, based on
the first selected SNP, a second SNP j  is chosen to form four two-SNP genotypes, denoted








(2)
(1)
(2)
(1)
(2)
(1)
(2)
(1)
by {G1  G1 & g1j , G2  G1 & g 2j , G3  G2 & g1j , G4  G2 & g 2j } . It should be noted

that, if the same SNP from step one is chosen in step two, only three single-SNP genotypes will
(2)
(2)
(2)
be formed, denoted by {G1  { AA}, G2  { Aa}, G3  {aa}} . We screen all SNPs and

50

calculate the U-statistic for each of these partitions. The SNP that increases the U-statistic the
most is chosen, together with its corresponding partition. As the algorithm moves forward, the
global U-Statistic is expected to increase until all the genotype groups are separated. The largest
number of possible genotype groups will be 3K .
We use a 10-fold cross-validation (CV) procedure to decide when the selection algorithm
should be stopped. In this procedure, all the subjects are randomly divided into 10 subgroups.
Nine of the ten subgroups are used as training set, while the remaining one is used as testing set.
The process is repeated ten times to make sure every subgroup has served as a testing set. A
multi-SNP model is determined from each training set, and a U-statistic is calculated for the
corresponding testing set. The selection algorithm is stopped when the U-statistics averaged over
ten testing sets ceases to increase. After the number of forwarding steps is determined, a global
U-statistic is calculated on the whole dataset including all subjects. The nominal significance
level of the association can be tested by using the asymptotic distribution of the global UStatistic. An empirical p-value, which accounts for the inflated Type I error due to the model
selection, can be obtained by the permutation test.
Note that, although the illustration above is specified for joint gene-gene actions, the
same procedure is also valid for joint gene-environment actions. Similar to genetic variables,
environmental factors with categorical or ordinal levels can be directly analyzed. For continuous
environmental factors, however, we need to first cluster them into different levels and then put
them into the model as discrete variables.
4.3. Results
Simulation Results

51

We conduct two sets of simulations to evaluate the performance of the proposed method,
and compare it with a commonly used approach, GMDR. The first set of simulations compares
the performance of two approaches under various underlying disease models. The second set of
simulations evaluates the performance of two approaches when the trait distribution is unknown.
The quantitative traits for the second set of simulations are simulated according to the
distributions of two traits from the Study of Addiction: Genetics and Environment (SAGE)
dataset. The two traits are ‘number of cigarettes smoked per day’ and ‘lifetime Fagerström Test
for Nicotine Dependence (FTND) score’. The trait distributions in SAGE are illustrated in Figure
4.1.
Simulation I
In the first set of simulations, we consider a variety of underlying disease models, starting
with three types of two-locus SNP models (Table 4.1.) introduced by Marchini et al (i.e.,
multiplicative-effect model, additive-effect model and threshold-effect model) [188]. We are
here assuming only one SNP in each locus. To mimic more complex disease scenarios, we also
simulate two three-locus models and two four-locus models. The two three-locus models, which
are extensions of the two-locus models to three loci, are simulated with multiplicative and
additive effects, respectively. Each of the four-locus models comprises two two-locus models
(i.e., two two-way joint actions). We simulate the two-locus models of the first and second fourlocus models with multiplicative and addictive effects, respectively. We further assume the
effects between the two two-locus models for the first and second four-locus models follow an
addictive model and a multiplicative model, respectively. The multi-SNP genotypes are
simulated under the assumption of joint Hardy-Weinberg Equilibrium (HWE). For the two-locus
models, the minor allele frequencies for the risk loci are set at 0.4 and 0.3. For the three-locus

52

models, they are set at 0.4, 0.5 and 0.3. For the four-locus models, they are set at (0.4, 0.3) and
(0.3, 0.4) for each of the two-locus models, respectively. The allele frequencies remain fixed in
this study unless specified otherwise. Noise loci are also introduced to mimic real data
application. The minor allele frequencies of the SNPs at the noise loci are simulated from a
uniform distribution ranging from 0.1 to 0.9. The number of noise loci is adjusted to ensure the
total number of SNPs is always ten. A total of L multi-locus SNP genotypes are formed from the
simulated SNPs at the ten loci, {G1, G2 ,......, GL } , corresponding to different levels of the
quantitative trait. Assuming multi-locus group l has an expected trait value of l , calculated
based on the simulated setting (e.g., additive-effect model), we simulate quantitative traits for a
reference population of one million subjects as:
L

yi   l I{ X i Gl }   i ;
l 1

where  i ~ N (0,1) and I{.} is an indicator function. The forward U-test and GMDR are applied
to 1000 subjects randomly selected from the reference population. For each underlying disease
model, the simulation is repeated 1000 times with 1000 permutations. For both methods, the
association is significant if the test statistic exceeds the 95-th percentile of the corresponding
permutation distribution. The power is then calculated as the probability to detect the joint
association based on 1000 replicates. In a similar manner, we calculate the type I error by only
considering non-causal loci in the model.
The simulation results are summarized in Table 4.2. We report power, Type I error,
sensitivity and specificity. The sensitivity (specificity) is calculated as the probability of
selecting (not selecting) a causal (non-causal) SNP. U-statistics and Testing Balanced Accuracy
(default in GMDR) are used as test statistics to examine the significance level of the two
53

methods. For GMDR, since the quantitative traits are simulated under a normal distribution, an
identity link is used to calculate the score statistics. The simulation results show that, compared
to GMDR, the forward U-test significantly increases the test power under multiplicative and
additive models, while properly controlling the Type I error. For the threshold effect model,
GMDR and the forward U-test have comparable power. In terms of selection accuracy, the
sensitivity of GMDR tends to be higher than that of the forward U-test, with a few exceptions
when both of the causal SNPs have large marginal effects. However, the specificity of the
forward U-test is consistently higher than that of GMDR, and is greater than 0.95 in all scenarios.
On the other hand, the specificity of GMDR is significantly reduced when the effect size
decreases or the complexity of disease model increases. This result indicates that the forward Utest has a low false positive rate for SNP selection, which can partially explain the increase rather
than loss of testing power over GMDR despite of the relatively lower sensitivity, because less
noise loci is selected into the final model. The increase of power in most scenarios can also be
explained by allowing for more than two risk groups in the model. The results in Table 4.2 show
that: 1) for the additive and multiplicative effect models which contain more than two risk
groups, the power of forward U-test is significantly higher than that of GMDR; 2) for the
threshold effect models which contain only two risk groups, the power of forward U-test is
comparable to that of GMDR.
Simulation II
We conduct a second set of simulations to compare the performance of two methods
when the underlying trait distribution is unknown. Two quantitative traits are simulated
according to the distributions of two variables in SAGE, ‘number of cigarettes smoked per day’
and ‘lifetime Fagerström Test for Nicotine Dependence (FTND) score’. For each trait, two-SNP

54

disease models with three types of joint action effects, multiplicative, additive and threshold, are
used for the comparison. Because of the unknown trait distribution, various link functions are
used to calculate the residual scores for GMDR, including zero inflated Poisson, Poisson,
negative binomial, and Gamma. The residual score for zero inflated Poisson is calculated with
the package ‘pscl’ in R [193].
The simulation results are summarized in Table 4.3 and Table 4.4. For both traits,
forward U-test attains a greater power than GMDR, especially under two-SNP models with
additive or multiplicative effect. Among the trait distributions, GMDR has its best performance
by assuming zero-inflated Poisson. When the underlying disease model is the threshold model,
GMDR with a zero-inflated Poisson link can reach the same power as the forward U-test.
However, the power of GMDR is significantly reduced if an inappropriate link function is used.
In all scenarios, the specificity of forward U-test is greater than 0.95 and is consistently higher
than that of GMDR. In terms of sensitivity, the performance largely varies, depending on the
underlying disease models, effect sizes and link functions.
Application to Nicotine Dependence
We apply the proposed method to the Study of Addiction: Genetics and Environment
(SAGE) GWAS dataset, searching for potential joint gene-gene actions among 155 known CSassociated SNPs. The participants of SAGE are unrelated individuals selected from three large,
complementary studies: the Family Study of Cocaine Dependence (FSCD), the Collaborative
Study on the Genetics of Alcoholism (COGA), and the Collaborative Genetic Study of Nicotine
Dependence (COGEND). In our study, the trait of primary interest is the level of addiction to
cigarettes, assessed by the answer to the question ‘How many cigarettes do you smoke per day?’.
It has four ordinal levels: 0 (10 cigarettes or less), 1 (11-20 cigarettes), 2 (21-30 cigarettes) and 3

55

(31 cigarettes or more). The sample sizes of FSCD, COGA and COGEND are 760, 799 and 1356,
respectively. The distributions of traits in three studies are shown in Figure 4.2. From the
literature, we select 155 SNPs across 67 candidate genes that have been reported for potential
association with CS. In the SAGE dataset, genotypes for 128 SNPs are available, and genotypes
for the remaining 27 SNPs are imputed by using PLINK [37,194]. The HapMap phase III
founders of the CEU and ASW populations are used in the imputation as the reference panels for
the white and black subjects [60].
We apply the forward U-test to FSCD for an initial association test and then replicate the
initial findings in COGA and COGEND. Two SNPs, rs16969968 (A/G) and rs1122530 (C/T),
are identified to be significantly jointly associated with the trait with a nominal p-value of 5.31e7 in FSCD. Permutation test is also conducted and the empirical p-value is p<0.001. The two
SNPs are located in genes CHRNA5 and NTRK2. Evaluation of the finding in COGA (pvalue=1.08e-5) and COGEND (p-value=0.02) shows that the association remains significant at
0.05 level (Table 4.5). The two SNPs together form four two-SNP genotypes:
G1  {{ AA or AG} &{CC or CT }}, G2  {{ AA or AG} &{TT }},
G3  {{GG}&{CC or CT }}, G4  {{GG} &{TT }} . In order to study any potential interaction
between the two SNPs, we calculate the average trait values in each genotype group. From
FSCD, we find that the effect of rs16969968 is modified differently by the genotypes of
rs1122530, indicating a potential interaction between the two SNPs (Figure 4.3). A similar trend
is observed in COGA and COGEND (Figure 4.3). In particular, it should be noted that this
interaction is “essential” and not completely removable by a monotonic transformation of the
data [195].

56

We also apply GMDR on the same datasets. For the initial association study on FSCD,
the disease models are searched with up to 3-way joint actions, and a zero inflated Poisson link is
assumed. The results show that the model with two SNPs performs the best in terms of Testing
Balance Accuracy, CV consistency, and sign test p-value (Table 4.6). Whereas GMDR identifies
rs16969968 (A/G) that overlaps with the result of forward U-test, it also picks up a different SNP,
rs573400 (A/G), which is located in gene GABRA2. Examination of the two SNPs in the other
two datasets shows that the association remains significant in COGA (p-value=0.0001), but is
not significant in COGEND (p-value=0.6230). We use linear regression models to fit the trait
values with the grouping strategies identified by both methods and examine the goodness of fit
with R-Squares (Table 4.7). The results show that the SNPs identified by GMDR have a better fit
than the SNPs identified by forward U-test in FSCD, but not in COGA and COGEND. Both
methods may indicate plausible joint gene-gene actions. Although the findings of both methods
cannot be directly compared, the results from the association and goodness-of-fit analyses
suggest that the finding of forward U-test may be more robust across different studies.

57

Figure 4.1.Trait distributions in Simulation II. A: the distribution of the number of cigarette smoked per day; B: the
distribution of Participants’ life-time score of FTND.

58

Table 4.1. Average trait values for two-locus joint action models
Two-locus joint action with
multiplicative effects
bb

Two-locus joint action with
additive effects
BB
α(1+  21 )(1+  22 )

Aa

α
α(1+ 11 )

Bb
α(1+  21 )
α(1+ 11 )(1+  21 )

AA

α(1+ 11 )(1+ 12 )

α(1+ 12 )(1+  21 )

aa

bb

BB
α+  22

α(1+ 11 )(1+  22 )

α
α+ 11

Bb
α+  21
α+ 11 +  21

α(1+ 12 )(1+  22 )

α+ 12

α+ 12 +  21

59

Two-locus joint action with
a threshold effect
bb

Bb

BB

α

α

α

α+ 11 +  22

α

α (1+)

α (1+)

α+ 12 +  22

α

α (1+)

α (1+)

Table 4.2. Comparison between forward U-test and GMDR

Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity

Forward
U-test
0.947
0.026
0.564
0.968
0.878
0.042
0.794
0.970
0.623
0.059
0.589
0.961

Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity

0.991
0.038
0.915
0.978
0.762
0.074
0.749
0.964
0.474
0.06
0.562
0.954

0.824
0.045
0.817
0.878
0.267
0.061
0.758
0.803
0.098
0.044
0.672
0.746

Power
Type I Err.
Sensitivity
Specificity

0.808
0.049
0.609
0.984

0.871
0.058
0.992
0.935

Power
Type I Err.
Sensitivity
Specificity

0.412
0.057
0.554
0.972

0.385
0.031
0.897
0.838

Disease Model
Two-locus Multiplicative
a
1
1.1
Average trait 1

1.2

Average trait 2

1.3

1.4

Two-locus Multiplicative
Average trait 1
1
1.2

1.3

Average trait 2

1.2

1.3

Two-locus Multiplicative
Average trait 1
1
1.1

1.2

Average trait 2

1.3

1

1

1

1.2

Two-locus Additive
Average trait 1
1
1.3
Average trait 2
1
1.3
Two-locus Additive
Average trait 1
1
1.2

1.4
1.4

Average trait 2
1
Two-locus Additive

1.2

1.3

Average trait 1

1

1.1

1.2

Average trait 2

1

1.2

1.3

1.3

Two-locus Threshold Model
b
0.4
0.4
RAF
Average trait

1.5

Two-locus Threshold Model
RAF
Average trait
a
b

0.5

0.5
1.3
st

Average trait value for genotype AA, Aa, aa of 1 causal SNP.
Risk allele frequency for causal SNPs.

60

GMDR
0.781
0.038
0.604
0.921
0.462
0.051
0.780
0.843
0.342
0.048
0.695
0.789

Table 4.2. (cont’d)
Forward
U-test

Disease Model

GMDR

Three-locus Multiplicative
Average trait 1
1
1.1
Average trait 2
1
1.1
Average trait 3
1
1.2

1.1
1.2
1.2

Power
Type I Err.
Sensitivity
Specificity

0.612
0.042
0.408
0.960

0.198
0.051
0.609
0.741

Three-locus Additive
Average trait 1
1
1.1
Average trait 2
1
1.2
Average trait 3
1
1.2

1.2
1.2
1.3

Power
Type I Err.
Sensitivity
Specificity

0.672
0.054
0.454
0.967

0.228
0.045
0.610
0.799

Two-locus × Two-locus
Multiplicative/Additive
Average trait 1-1
1
1.1
Average trait 1-2
1
1.2
Average trait 2-1
1
1.2
Average trait 2-2
1
1.1

1.2
1.3
1.3
1.2

Power
Type I Err.
Sensitivity
Specificity

0.871
0.046
0.474
0.978

0.438
0.045
0.568
0.840

Power
Type I Err.
Sensitivity
Specificity

0.838
0.059
0.423
0.976

0.483
0.050
0.533
0.840

Two-locus × Two-locus
Additive/Multiplicative
Average trait 1-1
1
1.1
Average trait 1-2
1
1.2
Average trait 2-1
1
1.1
Average trait 2-2
1
1.1

1.2
1.3
1.3
1.2

61

Table 4.3. Comparison between forward U-test and GMDR when the quantitative traits are simulated from the distribution of
number of cigarette smoked per day
Forward
U-test

GMDR
(Zero Infl.
Poisson)

GMDR
(Poisson)

GMDR
(Negative
Binomial)

GMDR
(Gamma)

Power
Type I Err.
Sensitivity
Specificity

0.930
0.056
0.562
0.957

0.540
0.056
0.634
0.870

0.289
0.079
0.546
0.910

0.217
0.061
0.511
0.938

0.355
0.052
0.659
0.841

Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity

0.952
0.050
0.754
0.967
0.948
0.056
0.749
0.966

0.924
0.054
0.982
0.944
0.694
0.069
0.789
0.847

0.526
0.063
0.781
0.918
0.343
0.066
0.646
0.920

0.291
0.074
0.623
0.943
0.247
0.086
0.570
0.932

0.843
0.064
0.952
0.931
0.579
0.063
0.764
0.870

Disease Model
Two-locus Multiplicative
Relative Risk 1

1

1.1

1.2

Relative Risk 2

1

1.3

1.4

Two-locus Threshold
RAF

0.6

Relative Risk

0.6
1.5

Two-locus Additive
Relative Risk 1
Relative Risk 2

1
1

1.3
1.3

1.4
1.4

62

Table 4.4. Comparison between forward U-test and GMDR when the quantitative traits are simulated from the distribution of
life-time FTND scores
Forward
U-test

Disease Model
Two-locus Multiplicative
Relative Risk 1

1

1.1

1.2

Relative Risk 2

1

1.3

1.4

Two-locus Threshold
MAF
Relative Risk

0.4

0.4
1.5

Two-locus Additive
Relative Risk 1
Relative Risk 2

1
1

1.3
1.3

1.4
1.4

Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity
Power
Type I Err.
Sensitivity
Specificity

0.875
0.039
0.547
0.951
0.779
0.048
0.609
0.981
0.971
0.036
0.853
0.980

63

GMDR
(Zero Infl.
Poisson)
0.624
0.064
0.611
0.885
0.780
0.055
0.984
0.904
0.657
0.060
0.813
0.853

GMDR
(Poisson)
0.421
0.046
0.648
0.860
0.107
0.061
0.496
0.907
0.583
0.073
0.803
0.871

GMDR
(Negative
Binomial)
0.126
0.048
0.530
0.963
0.450
0.057
0.465
0.972
0.160
0.063
0.556
0.964

GMDR
(Gamma)
0.297
0.047
0.560
0.929
0.064
0.068
0.462
0.955
0.369
0.081
0.664
0.916

Table 4.5. Summary of two SNPs identified in FSCD and replicated in COGA and
COGEND

SNP

Allele Chro

Position

Gene

Grouping

rs16969968

A/G

15

78882925

CHRNA5

{AA,AG},{GG}

rs1122530

C/T

9

87464352

NTRK2

{CC,CT},{TT}

64

p-values
FSCD : 5.31e-7
COGA : 1.08e-5
COGEND :0.02

Table 4.6. Analysis result of GMDR in FSCD and replication in COGA and COGEND
Study

Model

Allel
e

Training
Bal. Acc

Testing
Bal. Acc

0.5944

Gene

Sign Test
(p)
7
(0.1719)

0.5511

0.6448

0.6369

10
(0.001)

10/10

CV

1

rs2836823

2

rs16969968
rs573400

3

rs16969968
rs573400
rs9321013

0.6764

0.5803

10
(0.001)

3/10

COGA

rs16969968
rs573400

0.6093

0.6107

10
(0.001)

10/10

COGEND

rs16969968
rs573400

0.5511

0.4840

5
(0.6230)

10/10

FSCD

A/G
A/G

CHRNA5
GABRA2

65

7/10

Table 4.7. Goodness of fit with the SNPs identified by forward U-test and GMDR
R-Squares
Study
forward U-test

GMDR

FSCD

0.0567

0.0656

COGA

0.0348

0.0165

COGEND

0.0051

0.0033

66

Figure 4.2. Trait distributions in FSCD, COGA and COGEND A: the distribution of trait in FSCD; B: the distribution of trait
in COGA; C: the distribution of trait in COGEND.

67

Figure 4.3. Joint effect of two SNPs showing potential statistical interaction A: average trait by genotype groups in FSCD; B:
average trait by genotype groups in COGA; C: average trait by genotype groups in COGEND

68

4.4. Discussion
Complex traits are expected to be caused by the interplay of multiple genes and
environmental factors through complicated mechanisms. If two genes are jointly involved in
producing the variability of a phenotype whether additively or not, biological interaction
between them or their products must be involved [196]. In addition, there may be statistical
interaction that may or may not be removable by a transformation of the data [195]. Thus,
statistical approaches that consider gene-gene/gene-environment interactions, including high
order interactions, are more likely to take this complexity into account and can improve the
discovery process of identifying important genetic variants. In this chapter, I have proposed a
forward U-test for detecting joint association of multiple genetic variants, with the consideration
of gene-gene interactions. Through simulations, we have shown that our method has a better
performance than GMDR under various scenarios, whether statistical interaction exists or not.
The improvement can be explained by several reasons. First, our method is an entirely nonparametric approach and makes no assumption about the trait distribution, while GMDR is based
on a generalized linear model and implicitly specifies the link function with an assumption of the
trait distribution. Second, similar to MDR, GMDR assumes two levels of the quantitative traits
by clustering multi-locus genotypes into a high-risk group and a low-risk group. Our method
measures the differences of traits between genotype groups without constraining the genotype
groups to two levels. Therefore, our method may gain more strength from the quantitative
variation of the trait. Third, unlike MDR and GMDR, which select a set of candidate models for
each model size, the forward U-test uses a cross-validation procedure to choose the most
parsimonious model, making it easier to interpret the results and replicate the findings. Forth, our
method uses a forward selection, instead of an exhaustive selection, and substantially reduces the

69

search space of the SNP combinations. When the number of loci increases, the computational
time and memory use for the exhaustive search increase exponentially, while those increase only
quadratic for the forward search. This makes it computationally feasible for testing high-order
interactions on high-dimensional data (e.g., whole genome-wide data). As discussed by Wu et al
[186], the performance of the selection strategies depends on the underlying disease models. Our
results indicate that, under additive and multiplicative models, forward selection outperforms
exhaustive selection. However, we expect the power to decrease for forward selection if none of
the genetic variants have any marginal effect. In this specific case, exhaustive selection will
perform better than forward selection.
The forward U-test also differs from the other U-Statistic-based methods: 1) It calculates
the global U-Statistic by a summation over the U-Statistics of multi-SNP genotype groups
instead of each single SNP, which implicitly considers the joint gene-gene action that is additive
or not; 2) It searches for the multi-SNP genotypes by a forward selection algorithm, which is
important for high-dimensional data with a large number of non-functional SNPs. The size of the
model selected by forward selection algorithm may depend on the study sample size. The larger
the sample size is, the more likely a high-order interaction can be found. In addition, the choice
of the weight parameter  can also have an impact on the performance of the approach.
Different weights can be used in the proposed method (e.g. l ,l  1 for all l ,l  ). We choose

l ,l  

ml  ml 
ml ml 

in our study because it appears to have the best testing power.

In the real data application, we identify two SNPs, located in CHRNA5 and NTRK2, to be
jointly associated with CS. Both CHRNA5 and NTRK2 have been suggested to be functionally
related to CS. SNP rs16969968, a non-synonymous coding SNP in exon 5 of CHRNA5, was first

70

reported to be CS-related with a significance level of 0.00064 [109]. The association was also
replicated in several other studies [197,198,199,200,201,202]. Studies also suggested that
CHRNA5 might interact with CHRNA3 and CHRNB4 to affect CS and lung cancer
[203,204,205,206]. SNP rs1122530, a non-coding SNP in NTRK2, was found to be associated
with CS in a haplotype analysis with two other SNPs (rs1659400 and rs1187272) in NTRK2
[207]. A previous study found evidence of joint actions between NTRK2 and multiple functional
genes for CS, such as CHRNA4, CHRNB2, and BDNF [208]. However, to our knowledge, no
joint action has been reported for CHRNA5 and NTRK2. Although the joint association of
CHRNA5 and NTRK2 with CS, involving a statistical interaction, reaches a statistically
significant level and is replicated in independent studies, further studies will be necessary to
replicate and investigate the statistical interaction.

71

Appendix
Estimation of the variance of the U-Statistic under the null hypothesis
Suppose we have a study sample of N subjects. We assume their quantitative traits are
independent and have the same variance, denoted as Var (Yi )   2 for i  1, 2,......, N . Further, we
assume that we have L multi-locus genotypes determined by the forward selection procedure,
listed as G1, G2 ,......, GL . Following the same notation above, let Sl  {i, X i  Gl } and
ml , l  1, 2,......, L , be the number of subjects in Sl . The corresponding U-statistic is calculated as



l ,l U l ,l 

U  1 l l   L



l ,l 



ml  ml 
L( L  1)
, where l ,l  
2
ml ml 

1l l  L

For simplicity, we denote
U



l ,l U l ,l 

1l l  L

The variance of the U-statistic can be expressed as
Var (U )  Var (
 l ,l U l ,l  ) 
l2l Var (U l ,l  ) 
l1 ,l1 l2 ,l2 Cov (U l1 ,l1 ,U l2 ,l2 )
,
 L
 L
 L
1 l  l
1l  l
1 l1 l1

1 l2 l2  L

(l1 ,l1)  (l2  l2 )







For all 1  l  l   L , we estimate the group-wise variance for the U-Statistic as:
Var (U l ,l  )  Var (



 (Yi , Y j ))  Var (

iSl , jSl

 Var (ml   Yi  ml
iSl



(Yi  Y j ))

iSl , jSl

 Y j )  ml2  Var (Yi )  ml2  Var (Y j )
jSl

iSl

jSl

 (ml2 ml  ml2 ml  ) 2


The covariance between group-wise U-Statistics is estimated according to different scenarios:
1) l1  l1  l2  l2 ,
72

Cov(U

l1 ,l1

,U

l2 ,l2

)0

2) l1  l2  l ,

Cov(U

,U
)  Cov(U  ,U  )  Cov(
l1 , l1 l2 , l2
l , l1
l ,l2



 Cov(


iSl , j1 Sl

(Yi  Y j2 ))

iSl , j2 Sl

 Cov(ml   Yi , ml 
1

 (Yi , Y j2 ))

iSl , j2 Sl



(Yi  Y j1 ),

iSl , j1 Sl



 (Yi , Y j1 ),

2

iSl

 Yi )
iSl

 ml ml Var (  Yi )
1

2

iSl

 ml ml  ml  2
1

2

 
3) l1  l2  l ,

Cov(Ul l  ,Ul l  )  Cov(Ul1l ,Ul2l )  Cov(
1 1

2 2



i1 Sl1 , jSl

 Cov(

 (Yi1 , Y j ),



i1 Sl1 , jSl

 Cov(ml1



i2 Sl2 , jSl

(Yi Y j ),

2

 ml1 ml2 Var (  Y j )
jSl

 ml1 ml2 ml σ 2

4) l1  l2  l ,

73

jSl



i2 Sl2 , jSl

 Y j , ml  Y j )

jSl

 (Yi2 , Y j ))
(Yi Y j ))

Cov(U l l  , U l l  )  Cov (U l1l ,U ll  )  Cov (
1 1

2 2

2



i Sl1 , j Sl

 (Yi , Y j ),



 Cov(



j  S l , t  Sl2

(Yi  Y j ),

i Sl1 , j Sl

 Cov(m 
l1
2

j  Sl

 ml  ml2 ml σ 2
1


5) l1  l2  l

is equivalent to 4)

74



j  S l , t  S l2


 Y j , ml   Y j )

j Sl

 ml  ml Var (  Y j )
1

 (Y j , Yt ))

2

j  Sl

(Y j  Yt ))

CHAPTER 5.
RARE VARIANTS AND QUANTITATIVE TRAITS
5.1. Introduction
The common disease-common variant (CDCV) model used to be well accepted for the
genetic origin of complex human diseases. It asserts that the common complex diseases are
caused by multiple common genetic variants, each conferring a small or moderate effect [209].
Based upon such a hypothesis, extensive genome-wide association studies (GWASs) have been
conducted, bring into light many common variants underlying common complex diseases, such
as breast cancer [79]. However, the genetic variants identified so far account for only a small
percentage of the heritability of complex diseases [34]. It seems now clear that the genetic
etiology of complex diseases is highly heterogeneous. Some genetic mutations, though
individually rare, may impose a very high risk to disease development [52]. For example,
germline mutations in more than ten genes were found to be associated with elevated risks of
developing breast cancer [48]. Rare variants may play major roles in the development of
complex diseases, and have received growing attention by investigators. With the fast
development of biotechnologies, it is now feasible to genotype rare sequence variations in the
general population rapidly and accurately [61]. Meanwhile, statistical methods are in great need
to detect the association between these genetic variants and the common complex diseases.
The most commonly used approach for detecting the association between rare variants and a
disease outcome is to group multiple rare variants into a single ‘super’ variant, which is further
tested as a common variant. Based upon this idea, Li and Leal developed a Combined
Multivariate and Collapsing (CMC) method for rare variants analysis [46]. It collapses the
genetic variants with minor allele frequencies (MAFs) under certain threshold (e.g., 1%). Such a
75

strategy is extended in several other studies. Morris et al. developed a minor allele proportion
method that calculated the accumulated minor allele frequency for multiple rare variants [210].
Han et al. proposed a data adaptive sum method by considering the opposite direction of genetic
effect [211]. Madsen et al. used a weighted-sum method to group the rare variants according to
their minor allele frequencies and biological functions (e.g., those within the same gene) [114].
Price et al. proposed to use data-determined thresholds of MAFs to differentiate common and
rare variants, and incorporated the functional effect of amino acid changes [115]. Compared to
the multivariate analysis of multiple rare variants, these collapsing methods can reduce the
degree of freedom by creating a single ‘super’ variant comprised of multiple individual rare
variants, and thus improve the testing power. In addition, testing on a single ‘super’ variant can
reduce the burden of multiple testing. However, the existing methods also have a few limitations
that may affect their performance. Collapsing all the rare variants in the same gene or genomic
region, although biologically meaningful, can also introduce non-functional variants into the
‘super’ variant, which may diminish the signal that the functional variants carry. Intuitively, this
limitation can be addressed by only collapsing a subset of disease-susceptibility rare variants. In
what follows, we refer to the collapsing process using trait information as aggregation.
In this chapter, I propose an aggregating U-Test to examine the association between
quantitative traits and multiple genetic variants, including both rare and common variants. The
method first adaptively collapses the disease-susceptibility rare variants into a ‘super’ variant; it
then searches the ‘super’ variant and the remaining common variants for the best multi-SNP
combination, via a forward selection. We apply our method to GAW 17 mini-exome data, and
compare its performance with a commonly used method, QuTie [46].
5.2. Methods

76

U-statistics were previously adopted to examine the joint association of multiple genetic
variants with complex traits [182,183]. In Chapter 4, we have developed a forward U-test to
detect gene-gene interactions [212]. In this chapter, we briefly describe our method and extend it
with the consideration of both common and rare variants. Suppose we have a study population of
N subjects. Let Yi denote the observed value of the quantitative trait for the ith subject,
i=1,2,……N; let X i  ( X i1, X i 2 ,......, X iK ) denote the genotypes of K SNPs for the ith individual,
each taking a value from one of the three possible genotypes, X ij  { AA, Aa, aa} j=1,2,……K.
Without loss of generality, we assume A is the minor allele, and the first r SNPs,
( X i1, X i 2 ,......, X ir ) , are rare variants.
U-Statistics
Suppose we have L multi-SNP genotypes formed by k SNPs of interest, denoted by
G1, G2 ,......, GL . A multi-SNP genotype, Gl , is defined here as a vector of k genotypes that a
subject carries (e.g., {g1, g 2 ,......, g k } ). The k SNPs and L multi-SNP genotypes are selected
sequentially out of a total number of K SNPs (See Section ‘forward U-test’ for details).
Let Sl  {i : X i  Gl }, l  1, 2,......, L , be the group of subjects carrying multi-SNP genotype Gl
and ml | Sl | be the number of subjects in Sl . We measure the trait difference between two
multi-SNP genotype groups Sl and Sl  as:
U l , l     (Yi , Y j ) ,

i  Sl , j  S l  , l  l  ;

i, j

where the kernel function is chosen as  (Yi , Y j )  Yi  Y j . U l , l  is the summation of all possible
pair-wise trait comparisons for any two subjects from Sl and Sl  . In the presence of an
association, we expect subjects carrying different multi-SNP genotypes have different trait
77

values (e.g., those carrying high risk multi-SNP genotypes have higher trait values than those
carrying low risk multi-SNP genotypes). Based on the U l , l  , we can form a global U-statistic.
We assume that the expected quantitative trait values of L multi-SNP genotypes decrease with l
(i.e., E (YS1 )  E (YS 2 )  ......  E (YSL ) ). Practically, we can sort the multi-SNP genotypes
according to their average trait values (i.e., YS1  YS2  ......  YS L ). We define a global U-statistic
for L multi-SNP genotypes as



l ,l U l ,l 

U  1 l l   L



l ,l 



L( L  1)
2

; where l ,l  

ml  ml 
ml ml 

…….. Eq. (1)

1l l  L

Here, the weight parameter l ,l is chosen to account for the number of subjects in different
genotype groups. This global U-Statistic measures the overall trait differences among subjects
from a total number of L multi-SNP genotype groups.
Aggregation of the Rare Variants
When a large number of rare variants are examined, it is likely that a significant
proportion of these rare variants are not associated with the trait. Therefore, collapsing on a
selected subset of rare variants will be necessary. Each rare variant can form two single-SNP
genotypes, {{AA,Aa} and {aa}}. We first calculate the U-statistics between two genotype groups
for each rare variant using Equation (1), and then rank the U-statistics in a decreasing order as
U (1) , U (2) ,......, U ( r ) . Assume V(1) , V(2) ,......, V(r ) are the corresponding rare variants in a

candidate gene, and X i (1) , X i (2) ,......, X i ( r ) are their observed genotypes for subject i. We start
from variant V(1) , and define a ‘super’ variant as

78

1 X i (1)  AA | Aa
Ri1  
0 X i (1)  aa
At each step of the aggregation process, we choose a rare variant with the largest marginal Ustatistics and add it to the ‘super’ variant. Accordingly, we re-define the ‘super’ variant as
1 Ri ( j 1)  1 or X i ( j )  AA | Aa
Rij  
0
otherwise

2 jr

During the collapsing process, the ‘super’ variant always forms two genotype groups, for which
a corresponding U-statistic can be calculated. The aggregation procedure stops at step t, where
the U-statistic starts to decrease, (i.e., U R1  U Rr  ......  U Rt  U Rt 1 ).
Forward U-test
A forward U-test [212] is then used to evaluate the ‘super’ variant and the other common
variants for their joint association with the trait. We start the process by treating all individuals as
a single group. In the first step, each common SNP j can form two single-SNP genotypes,
j
{g1j , g 2j } , in three possible ways, denoted as {g1j  { AA}, g 2  { Aa, aa}} ,
j
{g1j  { Aa}, g 2  { AA, aa}} , and {g1j  {aa}, g 2j  { Aa, AA}} . As a special case, the ‘super’

variant can only form two single-SNP genotypes, {g1j  1, g 2j  0} . This leads to a total number
(1)
(1)
j
of 3( K  r )  1 possible grouping strategies that can be represented by {G1  g1j , G2  g 2 } ,

where Gl( s ) denotes the l th multi-SNP genotype at step s. We calculate the U-Statistic for each
(1) (1)
grouping {G1 , G2 } . The SNP with the largest value of U-statistics is selected, and the

corresponding grouping is recorded. In the second step, based on the first selected SNP, a second
SNP j  is chosen to form four two-SNP genotypes, denoted by

79









(2)
(1)
(2)
(1)
(2)
(1)
(2)
(1)
{G1  G1 & g1j , G2  G1 & g 2j , G3  G3 & g1j , G4  G2 & g 2j } . We calculate the U-

statistics for each of these grouping strategies, and choose the one with the largest U-statistic. It
should be noted that, if the same SNP from step one is chosen in step two, only three single-SNP
(2)
(2)
(2)
genotypes are formed, denoted by {G1  { AA}, G2  { Aa}, G3  {aa}} . As the algorithm

moves forward, the U-Statistics are expected to increase until groups cannot be further split. This
results a series of models with different numbers of groups. The best model with an appropriate
number of groups can be determined by using a 10-fold cross-validation. The U-statistic of the
best model is calculated using the whole dataset. The significance of association can be
evaluated by a permutation test. For each permutation replicate, the same procedure, including
the aggregation process, the model selection and the cross-validation, is applied to calculate the
U-Statistics. By repeating the process 1000 times, we can have a null distribution of U-statistics
and calculate the empirical p-value,

p   I (U perm  U ) /1000
5.3 Results and Discussion
We apply the proposed method to analyze the quantitative trait Q1 from GAW 17 mini-exome
data. Thirty-nine SNPs located in 9 genes are associated with trait Q1. The minor alleles of these
SNPs are associated with higher mean of Q1 and their frequencies range from 0.07% to 16.5%.
We first adjust the trait by age using a linear regression model. The residual scores are used for
our association studies. Based on two hundred replicates, we conduct a gene-based association
study for each of the nine causal genes. For each gene, the traits are permutated for 1000 times to
generate an empirical null distribution of U-statistics. The association is significant if the Ustatistics exceed the 95th percentile of the null distribution. Similar analysis is also conducted

80

using QuTie-0.2. The threshold for rare variants is chosen as MAF<0.01. The results are
summarized in Table 5.1.
Based on the analysis of 9 causal genes, we compare the performance of two methods.
1) Both methods have a high power to detect the association, when there are causal common
variants with large effect sizes. This can be illustrated by the analysis of FLT1 gene. There is
one common variant in gene FLT1 with a large effect size (0.65). The power to detect the
association is 0.84 and 1 for QuTie and aggregating U-test, respectively.
2) The aggregating U-test has a significant power improvement over QuTie, when there are a
large number of rare variants within a gene and only a small number of these rare variants
are causal. This can be illustrated by the analysis of genes ELAVL4, FLT4, HIF1A, and
VEGFA. For example, there are 7 rare variants and 3 common variants within gene ELAVL4,
and only 2 rare variants are causal. The effect sizes of these variants, though relatively large
(0.769 and 0.304), are subsided by collapsing with other SNPs, which reduces the power of
QuTie. Same argument can also be applied to the other three genes, FLT4, HIF1A and
VEGFA.
3) Both methods have a high power to detect the association, when the majority of the rare
variants are causal. This can be illustrated by the analysis of gene KDR. For gene KDR,
QuTie attains a higher power than the aggregating U-test. Since most of the rare variants in
KDR are causal and their effect sizes are relatively large, it will be ideal to collapse all the
rare variants. In such a case, the aggregation has less advantage because the selection
process introduces additional variation. However, we believe this scenario is not common in
real data applications.

81

4) Both methods have a low power to detect the association, when only a small proportion of
the rare variants are causal, each having a small effect size. This can be illustrated by the
analysis of genes ARNT, and HIF3A. For both genes, the selection of rare variants does not
show any advantage because of the low effect size of each functional rare variant. In such a
case, the power of both methods has no significant difference from Type I errors.
5) Both methods have a high power to detect the association when the majority of genetic
variants under examination are causal, each having a large effect size. As an extreme case,
gene VEGFC only has one rare variant, and therefore no selection is necessary. Due to its
large effect size, both methods are able to detect the association. Interestingly, this variant
has a MAF of 0.0717%, which is equivalent to 1 rare allele carrier out of 697 subjects. In
such a case, we expect a low power of the association test if binary outcomes are used
instead of quantitative traits.
In order to examine the Type I errors for the proposed method, we use the same genetic data and
simulate the quantitative traits by assuming a standard normal distribution. The aggregating Utest is applied to 500 Monte Carlo simulated replicates to evaluate Type I errors. The results
show that the Type I errors are well controlled (Table 5.1).
5.4

Conclusion
The performance of statistical methods to detect the association between rare variants and

phenotypic traits may be affected by many factors, such as MAFs, the number of rare variants
under examination, the number of functional rare variants and their effect sizes. Compared to the
commonly used method QuTie, our method has two major advantages: 1) it can substantially
improve the testing power when only a small number of rare variants are functional with

82

relatively large effect sizes; 2) it only collapses a subset of rare variants which are potentially
trait-related. Therefore, it can also identify those disease-susceptibility rare variants.

83

Table 5.1. Power comparison between the aggregating U-test and QuTie

Group

Gene

1

FLT1
ELAVL4
FLT4

2
HIF1A
VEGFA
3

KDR
HIF3A

4
ARNT
5

VEGFC

# of Causal SNP/# of Total SNP in
Gene
Rare Variants
Common Variants
11/35
8/25
3/10
2/10
2/7
0/3
2/10
2/8
0/2
4/8
3/7
1/1
1/6
1/5
0/1
10/16
8/14
2/2
3/21
3/15
0/6
5/18
4/15
1/3
1/1
1/1
0/0

84

Power
(QuTie)

Power
(Agg. U)

Type I
(Agg. U)

0.86

1.000

0.036

0.025

0.585

0.050

0.58

0.715

0.060

0.215

0.915

0.048

0

0.265

0.060

0.99

0.840

0.038

0.055

0.05

0.040

0.05

0.07

0.038

0.745

0.785

0.054

CHAPTER 6.
SUMMARY AND FUTURE DEVELOPMENT
6.1 Summary
In this dissertation, I have conducted three studies that explore three possible sources of the
“missing” heritability of complex diseases. First, the human genetic variations underlying
complex diseases include both sequence variations and structural alterations. Copy number
variants may encompass multiple genes as well as non-coding DNAs, accounting for a large
proportion of the variability among human genomes. These copy number variants may serve as
promising candidates of functional units that are associated with the development of common
complex diseases [213]. Second, complex diseases are usually caused by multiple genes through
various complicated biological pathways. Ignoring the complex interactions between genetic
variants will likely reduce the power of detecting novel risk factors underlying complex diseases
[214]. Third, the genetic etiology of complex diseases is highly heterogeneous [52]. Rare
variants may also play an important role in the development of complex diseases.
The three aspects discussed above are investigated in Chapters 3-5. In Chapter 3, a hidden
Markov model has been proposed for detecting copy number variants, with an application to a
breast cancer study. While applying the method to the phase III data of the study, we detect a
number of genomic regions that are associated with breast cancer. The associations of five
regions, on chromosome 2, 4, 6, 12, and 13, remain significant in the phase I data of the study.
The findings suggest that the structural changes of these genomic regions may contribute to the
genetic susceptibility of breast cancer. These findings are consistent with the literature. In
Chapter 4, a forward U-test has been proposed for detecting gene-gene interactions, with an
application to a cigarette smoking study. While applying the method to SAGE GWAS datasets,
85

we identify two SNPs with a statistical interaction. The two SNPs are located in gene CHRNA5
and NTRK2. Both genes have been reported for association with cigarette smoking. In Chapter 5,
an aggregating U-test has been proposed for detecting functional rare variants, with an
application to a quantitative trait study. While applying the method to GAW17 mini-exome data,
I have shown this method attains a higher power to detect the association than a commonly used
method, QuTie. Overall, these proposed methods have provided powerful tools for genetic
association studies. Whereas the findings of this dissertation research are biologically plausible,
further research will be necessary to replicate the results.
6.2 Future Development
Statistical methods for genetic association studies focus on detecting the association
between genetic variants and the phenotypic traits, measured by either binary disease outcomes
or quantitative clinical features. Though a large number of GWASs have been conducted, the
genetic etiology of complex diseases remains largely unknown. There are many possible
explanations for this challenge. First, complex human diseases usually manifest with multiple
sub-phenotypes, representing specific physiological or biochemical processes from various gene
pathways. Taking these sub-phenotypes into account is necessary to address disease
heterogeneity and to provide novel insights into the genetic etiology of complex diseases. At
present, however, our understanding is still limited to describing these sub-phenotypes, and our
statistical tools are not powerful enough to detect the functional variants while accounting for the
genetic heterogeneity [11,215]. Second, this dissertation research has focused on investigating
the genetic effects. However, the effects of genes or genetic variants are commonly modified by
environmental risk factors [216]. It is important for future studies to take into account the
complex gene-environmental interactions. Third, the large dimensionality of genomic data has

86

remained a major obstacle for genetic association studies. The power of the association test is
usually reduced due to the multiple testing adjustments [217,218]. We are still awaiting
sophisticated statistical tools that may have the following properties: 1) able to consider the
genetic heterogeneity; 2) able to account for gene-environmental interactions; 3) feasible on a
genome-wide scale.

87

REFERENCES

88

References
[1]

So HC, Gui AH, Cherny SS, Sham PC (2011) Evaluating the heritability explained by
known susceptibility variants: a survey of ten complex diseases. Genet Epidemiol.

[2]

Frazer KA, Murray SS, Schork NJ, Topol EJ (2009) Human genetic variation and its
contribution to complex traits. Nat Rev Genet 10: 241-251.

[3]

Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH (2010) Missing
heritability and strategies for finding the underlying causes of complex disease. Nat Rev
Genet 11: 446-450.

[4]

Howlader N, AM N, M K, Neyman N, R A, W W, SF A, Kosary CL, Ruhl J, Tatalovich
Z, et al. (2011) SEER Cancer Statistic Review 1975-2008. SEER web site.

[5]

Altekruse SF, Kosary CL, Krapcho M, Neyman N, Aminou R, Waldron W, Ruhl J,
Howlader N, Tatalovich Z, Cho H, et al. (2010) SEER Cancer Statistics Review, 19752007. SEER web site.

[6]

Mokdad AH, Marks JS, Stroup DF, Gerberding JL (2004) Actual causes of death in the
United States, 2000. JAMA 291: 1238-1245.

[7]

Mendel G (1965) Experiments in Plant Hybridization. British Medical Journal 1: 370-&.

[8]

Pauling L, Itano HA, et al. (1949) Sickle cell anemia, a molecular disease. Science 109:
443.

[9]

Conneally PM (2003) The complexity of complex diseases. Am J Hum Genet 72: 229232.

[10]

Chakravarti A (2011) Genomic contributions to Mendelian disease. Genome Res 21: 643644.

[11]

Gibson G (2009) Decanalization and the origin of complex disease. Nat Rev Genet 10:
134-140.

[12]

Abegunde DO, Mathers CD, Adam T, Ortegon M, Strong K (2007) The burden and costs
of chronic diseases in low-income and middle-income countries. Lancet 370: 1929-1938.

[13]

Rare_Disease_Act (2002) http://frwebgateaccessgpogov/cgibin/getdoccgi?dbname=107_cong_public_laws&docid=f:publ280107.

[14]

Lloyd-Jones DM, Nam BH, D'Agostino RB, Sr., Levy D, Murabito JM, Wang TJ, Wilson
PW, O'Donnell CJ (2004) Parental cardiovascular disease as a risk factor for
cardiovascular disease in middle-aged adults: a prospective study of parents and offspring.
JAMA 291: 2204-2211.

89

[15]

Visscher PM, Hill WG, Wray NR (2008) Heritability in the genomics era--concepts and
misconceptions. Nat Rev Genet 9: 255-266.

[16]

Fisher R (1918) The correlations between relatives on the supposition of Mendelian
inheritance. Trans R Soc 52.

[17]

Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N,
Teague B, Alkan C, Antonacci F, et al. (2008) Mapping and sequencing of structural
variation from eight human genomes. Nature 453: 56-64.

[18]

International-HapMap-Consortium (2003) The International HapMap Project. Nature 426:
789-796.

[19]

Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S,
Mullikin JC, Mortimore BJ, Willey DL, et al. (2001) A map of human genome sequence
variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928-933.

[20]

Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27: 234-236.

[21]

Reich DE, Gabriel SB, Altshuler D (2003) Quality and completeness of SNP databases.
Nat Genet 33: 457-458.

[22]

Collins FS, Brooks LD, Chakravarti A (1998) A DNA polymorphism discovery resource
for research on human genetic variation. Genome Res 8: 1229-1231.

[23]

Bhangale TR, Stephens M, Nickerson DA (2006) Automating resequencing-based
detection of insertion-deletion polymorphisms. Nat Genet 38: 1457-1462.

[24]

McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S,
Gabriel SB, Lee C, Daly MJ, et al. (2006) Common deletion polymorphisms in the
human genome. Nat Genet 38: 86-92.

[25]

Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK (2006) A high-resolution
survey of deletion polymorphism in the human genome. Nat Genet 38: 75-81.

[26]

Nature-Publishing-Group (2010) Genomics: The tough new variants. Nature 467: 1136.

[27]

Bruder CE, Piotrowski A, Gijsbers AA, Andersson R, Erickson S, Diaz de Stahl T,
Menzel U, Sandgren J, von Tell D, Poplawski A, et al. (2008) Phenotypically concordant
and discordant monozygotic twins display different DNA copy-number-variation profiles.
Am J Hum Genet 82: 763-771.

[28]

Maiti S, Kumar KH, Castellani CA, O'Reilly R, Singh SM (2011) Ontogenetic de novo
copy number variations (CNVs) as a source of genetic individuality: studies on two
families with MZD twins for schizophrenia. PLoS One 6: e17125.

90

[29]

Bidichandani SI, Ashizawa T, Patel PI (1998) The GAA triplet-repeat expansion in
Friedreich ataxia interferes with transcription and may be associated with an unusual
DNA structure. Am J Hum Genet 62: 111-121.

[30]

Hammock EA, Young LJ (2005) Microsatellite instability generates diversity in brain and
sociobehavioral traits. Science 308: 1630-1634.

[31]

Antoniou A, Pharoah PD, Narod S, Risch HA, Eyfjord JE, Hopper JL, Loman N, Olsson
H, Johannsson O, Borg A, et al. (2003) Average risks of breast and ovarian cancer
associated with BRCA1 or BRCA2 mutations detected in case Series unselected for
family history: a combined analysis of 22 studies. Am J Hum Genet 72: 1117-1130.

[32]

Civeira F (2004) Guidelines for the diagnosis and management of heterozygous familial
hypercholesterolemia. Atherosclerosis 173: 55-68.

[33]

te Meerman GJ, de Vries EG (2001) Relevance of high and low penetrance. Lancet 358:
331-332.

[34]

Schork NJ, Murray SS, Frazer KA, Topol EJ (2009) Common vs. rare allele hypotheses
for complex diseases. Curr Opin Genet Dev 19: 212-219.

[35]

Reich DE, Lander ES (2001) On the allelic spectrum of human disease. Trends Genet 17:
502-510.

[36]

Pritchard JK (2001) Are rare variants responsible for susceptibility to complex diseases?
Am J Hum Genet 69: 124-137.

[37]

Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW,
Boudreau A, Hardenbol P, Leal SM, et al. (2007) A second generation human haplotype
map of over 3.1 million SNPs. Nature 449: 851-861.

[38]

Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES (2001) High-resolution
haplotype structure in the human genome. Nat Genet 29: 229-232.

[39]

WTCCC (2007) Genome-wide association study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 447: 661-678.

[40]

Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG,
Struewing JP, Morrison J, Field H, Luben R, et al. (2007) Genome-wide association
study identifies novel breast cancer susceptibility loci. Nature 447: 1087-1093.

[41]

Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, Perry
JR, Elliott KS, Lango H, Rayner NW, et al. (2007) A common variant in the FTO gene is
associated with body mass index and predisposes to childhood and adult obesity. Science
316: 889-894.

91

[42]

Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M, Zhang K, Gumbs C,
Castagna A, Cossarizza A, et al. (2007) A whole-genome association study of major
determinants for host control of HIV-1. Science 317: 944-947.

[43]

Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H,
Chandrasekaran A, Davies LR, et al. (2007) TRAF1-C5 as a risk locus for rheumatoid
arthritis--a genomewide study. N Engl J Med 357: 1199-1209.

[44]

Manolio TA, Brooks LD, Collins FS (2008) A HapMap harvest of insights into the
genetics of common disease. J Clin Invest 118: 1590-1605.

[45]

Pennacchio LA, Rubin EM (2001) Genomic strategies to identify mammalian regulatory
sequences. Nat Rev Genet 2: 100-109.

[46]

Li B, Leal SM (2008) Methods for detecting associations with rare variants for common
diseases: application to analysis of sequence data. Am J Hum Genet 83: 311-321.

[47]

Liu DJ, Leal SM (2010) Replication strategies for rare variant complex trait association
studies via next-generation sequencing. Am J Hum Genet 87: 790-801.

[48]

Walsh T, King MC (2007) Ten genes for inherited breast cancer. Cancer Cell 11: 103105.

[49]

Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH (2004)
Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305:
869-872.

[50]

Ji W, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW,
Levy D, Lifton RP (2008) Rare independent mutations in renal salt handling genes
contribute to blood pressure variation. Nat Genet 40: 592-599.

[51]

Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, Cohen
JC (2007) Population-based resequencing of ANGPTL4 uncovers variations that reduce
triglycerides and increase HDL. Nat Genet 39: 513-516.

[52]

McClellan J, King MC (2010) Genetic heterogeneity in human disease. Cell 141: 210217.

[53]

Moore JH, Williams SM (2009) Epistasis and its implications for personal genetics. Am J
Hum Genet 85: 309-320.

[54]

Botstein D, White RL, Skolnick M, Davis RW (1980) Construction of a genetic linkage
map in man using restriction fragment length polymorphisms. Am J Hum Genet 32: 314331.

[55]

Jasny BR, Kennedy D (2001) The human genome. Science 291: 1153.

92

[56]

Peltonen L, McKusick VA (2001) Genomics and medicine. Dissecting human disease in
the postgenomic era. Science 291: 1224-1229.

[57]

Jimenez-Sanchez G, Childs B, Valle D (2001) Human disease genes. Nature 409: 853855.

[58]

The_International_HapMap_Consortium (2003) The International HapMap Project.
Nature 426: 789-796.

[59]

The_International_HapMap_Consortium (2005) A haplotype map of the human genome.
Nature 437: 1299-1320.

[60]

Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de
Bakker PI, Deloukas P, Gabriel SB, et al. (2010) Integrating common and rare genetic
variation in diverse human populations. Nature 467: 52-58.

[61]

Schuster SC (2008) Next-generation sequencing transforms today's biology. Nat Methods
5: 16-18.

[62]

Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N (2010)
Estimation of effect size distribution from genome-wide association studies and
implications for future discoveries. Nat Genet 42: 570-575.

[63]

Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA,
Heath AC, Martin NG, Montgomery GW, et al. (2010) Common SNPs explain a large
proportion of the heritability for human height. Nat Genet 42: 565-569.

[64]

Gibson G (2010) Hints of hidden heritability in GWAS. Nat Genet 42: 558-560.

[65]

Wan L, Sun K, Ding Q, Cui Y, Li M, Wen Y, Elston RC, Qian M, Fu WJ (2009)
Hybridization modeling of oligonucleotide SNP arrays for accurate DNA copy number
estimation. Nucleic Acids Res 37: e117.

[66]

Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, Friedman E, Narod
S, Olshen AB, Gregersen P, et al. (2008) Genome-wide association study provides
evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci U S A 105: 43404345.

[67]

Fentiman IS, Fourquet A, Hortobagyi GN (2006) Male breast cancer. Lancet 367: 595604.

[68]

Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML,
Jackson RD, Beresford SA, Howard BV, Johnson KC, et al. (2002) Risks and benefits of
estrogen plus progestin in healthy postmenopausal women: principal results From the
Women's Health Initiative randomized controlled trial. JAMA 288: 321-333.

93

[69]

London SJ, Colditz GA, Stampfer MJ, Willett WC, Rosner B, Speizer FE (1989)
Prospective study of relative weight, height, and risk of breast cancer. JAMA 262: 28532858.

[70]

Friedenreich CM (2011) Physical activity and breast cancer: review of the epidemiologic
evidence and biologic mechanisms. Recent Results Cancer Res 188: 125-139.

[71]

Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, Mulvihill JJ (1989)
Projecting individualized probabilities of developing breast cancer for white females who
are being examined annually. J Natl Cancer Inst 81: 1879-1886.

[72]

Costantino JP, Gail MH, Pee D, Anderson S, Redmond CK, Benichou J, Wieand HS
(1999) Validation studies for models projecting the risk of invasive and total breast
cancer incidence. J Natl Cancer Inst 91: 1541-1548.

[73]

Gail MH, Costantino JP, Bryant J, Croyle R, Freedman L, Helzlsouer K, Vogel V (1999)
Weighing the risks and benefits of tamoxifen treatment for preventing breast cancer. J
Natl Cancer Inst 91: 1829-1846.

[74]

Gail MH, Costantino JP, Pee D, Bondy M, Newman L, Selvan M, Anderson GL, Malone
KE, Marchbanks PA, McCaskill-Stevens W, et al. (2007) Projecting individualized
absolute invasive breast cancer risk in African American women. J Natl Cancer Inst 99:
1782-1792.

[75]

Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H,
Teague J, Butler A, Stevens C, et al. (2007) Patterns of somatic mutation in human
cancer genomes. Nature 446: 153-158.

[76]

Collaborative-Group-on-Hormonal-Factors-in-Breast-Cancer (2002) Breast cancer and
breastfeeding: collaborative reanalysis of individual data from 47 epidemiological studies
in 30 countries, including 50302 women with breast cancer and 96973 women without
the disease. Lancet 360: 187-195.

[77]

Peto J, Mack TM (2000) High constant incidence in twins and other relatives of women
with breast cancer. Nat Genet 26: 411-414.

[78]

Turnbull C, Rahman N (2008) Genetic predisposition to breast cancer: past, present, and
future. Annu Rev Genomics Hum Genet 9: 321-345.

[79]

Stratton MR, Rahman N (2008) The emerging landscape of breast cancer susceptibility.
Nat Genet 40: 17-22.

[80]

Hall JM, Lee MK, Newman B, Morrow JE, Anderson LA, Huey B, King MC (1990)
Linkage of early-onset familial breast cancer to chromosome 17q21. Science 250: 16841689.

94

[81]

Miki Y, Swensen J, Shattuck-Eidens D, Futreal PA, Harshman K, Tavtigian S, Liu Q,
Cochran C, Bennett LM, Ding W, et al. (1994) A strong candidate for the breast and
ovarian cancer susceptibility gene BRCA1. Science 266: 66-71.

[82]

Wooster R, Neuhausen SL, Mangion J, Quirk Y, Ford D, Collins N, Nguyen K, Seal S,
Tran T, Averill D, et al. (1994) Localization of a breast cancer susceptibility gene,
BRCA2, to chromosome 13q12-13. Science 265: 2088-2090.

[83]

Wooster R, Weber BL (2003) Breast and ovarian cancer. N Engl J Med 348: 2339-2347.

[84]

Kriege M, Brekelmans CT, Boetes C, Besnard PE, Zonderland HM, Obdeijn IM,
Manoliu RA, Kok T, Peterse H, Tilanus-Linthorst MM, et al. (2004) Efficacy of MRI and
mammography for breast-cancer screening in women with a familial or genetic
predisposition. N Engl J Med 351: 427-437.

[85]

Pasche B (2008) Recent advances in breast cancer genetics. Cancer Treat Res 141: 1-10.

[86]

Ripperger T, Gadzicki D, Meindl A, Schlegelberger B (2009) Breast cancer susceptibility:
current knowledge and implications for genetic counselling. Eur J Hum Genet 17: 722731.

[87]

Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang
Z, Welch R, Hutchinson A, et al. (2007) A genome-wide association study identifies
alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat
Genet 39: 870-874.

[88]

Thomas G, Jacobs KB, Kraft P, Yeager M, Wacholder S, Cox DG, Hankinson SE,
Hutchinson A, Wang Z, Yu K, et al. (2009) A multistage genome-wide association study
in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat
Genet 41: 579-584.

[89]

Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, Seal S,
Ghoussaini M, Hines S, Healey CS, et al. (2010) Genome-wide association study
identifies five new breast cancer susceptibility loci. Nat Genet 42: 504-507.

[90]

Li J, Humphreys K, Heikkinen T, Aittomaki K, Blomqvist C, Pharoah PD, Dunning AM,
Ahmed S, Hooning MJ, Martens JW, et al. (2010) A combined analysis of genome-wide
association studies in breast cancer. Breast Cancer Res Treat.

[91]

Frank B, Bermejo JL, Hemminki K, Sutter C, Wappenschmidt B, Meindl A, KiechleBahat M, Bugert P, Schmutzler RK, Bartram CR, et al. (2007) Copy number variant in
the candidate tumor suppressor gene MTUS1 and familial breast cancer risk.
Carcinogenesis 28: 1442-1445.

[92]

Tchatchou S, Burwinkel B (2008) Chromosome copy number variation and breast cancer
risk. Cytogenet Genome Res 123: 183-187.

95

[93]

Wu G, Xing M, Mambo E, Huang X, Liu J, Guo Z, Chatterjee A, Goldenberg D, Gollin
SM, Sukumar S, et al. (2005) Somatic mutation and gain of copy number of PIK3CA in
human breast cancer. Breast Cancer Res 7: R609-616.

[94]

Downing TE, Oktay MH, Fazzari MJ, Montagna C (2010) Prognostic and predictive
value of 16p12.1 and 16q22.1 copy number changes in human breast cancer. Cancer
Genet Cytogenet 198: 52-61.

[95]

Robicsek F (1979) The Smoking Gods: Tobacco in Maya Art, History and Religion.
University of Okelahoma Press: 30.

[96]

Doll R, Hill AB (1956) Lung cancer and other causes of death in relation to smoking; a
second report on the mortality of British doctors. Br Med J 2: 1071-1081.

[97]

Mokdad AH, Serdula MK, Dietz WH, Bowman BA, Marks JS, Koplan JP (2000) The
continuing epidemic of obesity in the United States. JAMA 284: 1650-1651.

[98]

Centers-for-Disease-Control-and-Prevention (2004) Cigarette smoking among adultsUnited States. Morb Mortal Wkly Rep 53: 427-431.

[99]

Baillie AJ, Mattick RP, Hall W (1995) Quitting smoking: estimation by meta-analysis of
the rate of unaided smoking cessation. Aust J Public Health 19: 129-131.

[100] Danaei G, Ding EL, Mozaffarian D, Taylor B, Rehm J, Murray CJ, Ezzati M (2009) The
preventable causes of death in the United States: comparative risk assessment of dietary,
lifestyle, and metabolic risk factors. PLoS Med 6: e1000058.
[101] Kandel D, Schaffran C, Griesler P, Samuolis J, Davies M, Galanti R (2005) On the
measurement of nicotine dependence in adolescence: comparisons of the mFTQ and a
DSM-IV-based scale. J Pediatr Psychol 30: 319-332.
[102] Sullivan PF, Kendler KS (1999) The genetic epidemiology of smoking. Nicotine Tob Res
1 Suppl 2: S51-57; discussion S69-70.
[103] Li MD, Cheng R, Ma JZ, Swan GE (2003) A meta-analysis of estimated genetic and
environmental effects on smoking behavior in male and female adult twins. Addiction 98:
23-31.
[104] Duggirala R, Almasy L, Blangero J (1999) Smoking behavior is under the influence of a
major quantitative trait locus on human chromosome 5q. Genet Epidemiol 17 Suppl 1:
S139-144.
[105] Bergen AW, Korczak JF, Weissbecker KA, Goldstein AM (1999) A genome-wide search
for loci contributing to smoking and alcoholism. Genet Epidemiol 17 Suppl 1: S55-60.
[106] Li MD, Ma JZ, Cheng R, Dupont RT, Williams NJ, Crews KM, Payne TJ, Elston RC
(2003) A genome-wide scan to identify loci for smoking rate in the Framingham Heart
Study population. BMC Genet 4 Suppl 1: S103.
96

[107] Straub RE, Sullivan PF, Ma Y, Myakishev MV, Harris-Kerr C, Wormley B, Kadambi B,
Sadek H, Silverman MA, Webb BT, et al. (1999) Susceptibility genes for nicotine
dependence: a genome scan and followup in an independent sample suggest that regions
on chromosomes 2, 4, 10, 16, 17 and 18 merit further study. Mol Psychiatry 4: 129-144.
[108] Gelernter J, Liu X, Hesselbrock V, Page GP, Goddard A, Zhang H (2004) Results of a
genomewide linkage scan: support for chromosomes 9 and 11 loci increasing risk for
cigarette smoking. Am J Med Genet B Neuropsychiatr Genet 128B: 94-101.
[109] Saccone SF, Hinrichs AL, Saccone NL, Chase GA, Konvicka K, Madden PA, Breslau N,
Johnson EO, Hatsukami D, Pomerleau O, et al. (2007) Cholinergic nicotinic receptor
genes implicated in a nicotine dependence association study targeting 348 candidate
genes with 3713 SNPs. Hum Mol Genet 16: 36-49.
[110] Li MD (2006) The genetics of nicotine dependence. Curr Psychiatry Rep 8: 158-164.
[111] Bateson W (1909) Mendel's Principles of Heredity. Cambridge: Cambridge Press.
[112] Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH (2001)
Multifactor-dimensionality reduction reveals high-order interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet 69: 138-147.
[113] Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD (2007) A generalized
combinatorial approach for detecting gene-by-gene and gene-by-environment interactions
with application to nicotine dependence. Am J Hum Genet 80: 1125-1137.
[114] Madsen BE, Browning SR (2009) A groupwise association test for rare mutations using a
weighted sum statistic. PLoS Genet 5: e1000384.
[115] Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR (2010)
Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet
86: 832-838.
[116] Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, de Bakker PI, Abecasis
GR, Almgren P, Andersen G, et al. (2008) Meta-analysis of genome-wide association
data and large-scale replication identifies additional susceptibility loci for type 2 diabetes.
Nat Genet 40: 638-645.
[117] Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI,
Ramos EM, Cardon LR, Chakravarti A, et al. (2009) Finding the missing heritability of
complex diseases. Nature 461: 747-753.
[118] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M,
Evans CA, Holt RA, et al. (2001) The sequence of the human genome. Science 291:
1304-1351.
[119] International-Human-Genome-Sequencing-Consortium (2004) Finishing the euchromatic
sequence of the human genome. Nature 431: 931-945.
97

[120] Choy KW, Setlur SR, Lee C, Lau TK (2010) The impact of human copy number
variation on a new era of genetic testing. BJOG 117: 391-398.
[121] Diskin SJ, Hou C, Glessner JT, Attiyeh EF, Laudenslager M, Bosse K, Cole K, Mosse
YP, Wood A, Lynch JE, et al. (2009) Copy number variation at 1q21.1 associated with
neuroblastoma. Nature 459: 987-991.
[122] Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C
(2004) Detection of large-scale variation in the human genome. Nat Genet 36: 949-951.
[123] Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H,
Walker M, Chi M, et al. (2004) Large-scale copy number polymorphism in the human
genome. Science 305: 525-528.
[124] Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH,
Carson AR, Chen W, et al. (2006) Global variation in copy number in the human genome.
Nature 444: 444-454.
[125] Perry GH, Ben-Dor A, Tsalenko A, Sampas N, Rodriguez-Revenga L, Tran CW,
Scheffer A, Steinfeld I, Tsang P, Yamada NA, et al. (2008) The fine-scale and complex
architecture of human copy-number variation. Am J Hum Genet 82: 685-695.
[126] McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH,
de Bakker PI, Maller JB, Kirby A, et al. (2008) Integrated detection and populationgenetic analysis of SNPs and copy number variation. Nat Genet 40: 1166-1174.
[127] Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw
CA, Belmont J, et al. (2006) High-resolution genomic profiling of chromosomal
aberrations using Infinium whole-genome genotyping. Genome Res 16: 1136-1148.
[128] Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R,
Botstein D, Borresen-Dale AL, Brown PO (2002) Microarray analysis reveals a major
direct role of DNA copy number alteration in the transcriptional program of human breast
tumors. Proc Natl Acad Sci U S A 99: 12963-12968.
[129] Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E (2004) Analysis of array CGH
data: from signal ratio to gain and loss of DNA regions. Bioinformatics 20: 3413-3422.
[130] Lai WR, Johnson MD, Kucherlapati R, Park PJ (2005) Comparative analysis of
algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics
21: 3763-3770.
[131] Venkatraman ES, Olshen AB (2007) A faster circular binary segmentation algorithm for
the analysis of array CGH data. Bioinformatics 23: 657-663.
[132] Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Circular binary segmentation
for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572.

98

[133] Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A,
Holmes CC, Ragoussis J (2007) QuantiSNP: an Objective Bayes Hidden-Markov Model
to detect and accurately map copy number variation using SNP genotyping data. Nucleic
Acids Res 35: 2013-2025.
[134] Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I (2008) Hidden Markov models for the
assessment of chromosomal alterations using high-throughput SNP arrays. Ann Appl Stat
2: 687-713.
[135] Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M (2007)
PennCNV: an integrated hidden Markov model designed for high-resolution copy
number variation detection in whole-genome SNP genotyping data. Genome Res 17:
1665-1674.
[136] Sun W, Wright FA, Tang Z, Nordgard SH, Van Loo P, Yu T, Kristensen VN, Perou CM
(2009) Integrated study of copy number states and genotype calls using high-density SNP
arrays. Nucleic Acids Res 37: 5365-5377.
[137] Frueh FW (2006) Impact of microarray data quality on genomic data submissions to the
FDA. Nat Biotechnol 24: 1105-1107.
[138] Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Science 270: 467-470.
[139] Brown CK, Madauss K, Lian W, Beck MR, Tolbert WD, Rodgers DW (2001) Structure
of neurolysin reveals a deep channel that limits substrate access. Proc Natl Acad Sci U S
A 98: 3127-3132.
[140] Jain AN, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, Pinkel D (2002) Fully
automatic quantification of microarray image data. Genome Res 12: 325-332.
[141] Matsuzaki H, Loi H, Dong S, Tsai YY, Fang J, Law J, Di X, Liu WM, Yang G, Liu G, et
al. (2004) Parallel genotyping of over 10,000 SNPs using a one-primer assay on a highdensity oligonucleotide array. Genome Res 14: 414-425.
[142] Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical
Statistics 41: 164-171.
[143] Viterbi AJ (1967) Error Bounds for Convolutional Codes and an Asymptotically
Optimum Decoding Algorithm. Ieee Transactions on Information Theory It13: 260-+.
[144] Azzato EM, Pharoah PD, Harrington P, Easton DF, Greenberg D, Caporaso NE, Chanock
SJ, Hoover RN, Thomas G, Hunter DJ, et al. (2010) A genome-wide association study of
prognosis in breast cancer. Cancer Epidemiol Biomarkers Prev 19: 1140-1143.
[145] Katoh M (2004) Characterization of human ARHGAP10 gene in silico. Int J Oncol 25:
1201-1206.
99

[146] Park JJ, Kang JK, Hong S, Ryu ES, Kim JI, Lee JH, Seo JS (2008) Genome-wide
combination profiling of copy number and methylation offers an approach for
deciphering misregulation and development in cancer cells. Gene 407: 139-147.
[147] Aarts M, Dannenberg H, deLeeuw RJ, van Nederveen FH, Verhofstad AA, Lenders JW,
Dinjens WN, Speel EJ, Lam WL, de Krijger RR (2006) Microarray-based CGH of
sporadic and syndrome-related pheochromocytomas using a 0.1-0.2 Mb bacterial
artificial chromosome array spanning chromosome arm 1p. Genes Chromosomes Cancer
45: 83-93.
[148] Shadeo A, Lam WL (2006) Comprehensive copy number profiles of breast cancer cell
model genomes. Breast Cancer Res 8: R9.
[149] Varma G, Varma R, Huang H, Pryshchepava A, Groth J, Fleming D, Nowak NJ,
McQuaid D, Conroy J, Mahoney M, et al. (2005) Array comparative genomic
hybridisation (aCGH) analysis of premenopausal breast cancers from a nuclear fallout
area and matched cases from Western New York. Br J Cancer 93: 699-708.
[150] Bello MJ, de Campos JM, Vaquero J, Kusak ME, Sarasa JL, Rey JA (2000) Highresolution analysis of chromosome arm 1p alterations in meningioma. Cancer Genet
Cytogenet 120: 30-36.
[151] Srivastava M, Bubendorf L, Srikantan V, Fossom L, Nolan L, Glasman M, Leighton X,
Fehrle W, Pittaluga S, Raffeld M, et al. (2001) ANX7, a candidate tumor suppressor gene
for prostate cancer. Proc Natl Acad Sci U S A 98: 4575-4580.
[152] Ohta T, Okamoto K, Isohashi F, Shibata K, Fukuda M, Yamaguchi S, Xiong Y (1998) Tloop deletion of CDC2 from breast cancer tissues eliminates binding to cyclin B1 and
cyclin-dependent kinase inhibitor p21. Cancer Res 58: 1095-1098.
[153] Dutrillaux B, Gerbault-Seureau M, Zafrani B (1990) Characterization of chromosomal
anomalies in human breast cancer. A comparison of 30 paradiploid cases with few
chromosome changes. Cancer Genet Cytogenet 49: 203-217.
[154] Sheng ZM, Marchetti A, Buttitta F, Champeme MH, Campani D, Bistocchi M, Lidereau
R, Callahan R (1996) Multiple regions of chromosome 6q affected by loss of
heterozygosity in primary human breast carcinomas. Br J Cancer 73: 144-147.
[155] Chappell SA, Walsh T, Walker RA, Shaw JA (1997) Loss of heterozygosity at
chromosome 6q in preinvasive and early invasive breast carcinomas. Br J Cancer 75:
1324-1329.
[156] Rodriguez C, Causse A, Ursule E, Theillet C (2000) At least five regions of imbalance on
6q in breast tumors, combining losses and gains. Genes Chromosomes Cancer 27: 76-84.
[157] Noviello C, Courjal F, Theillet C (1996) Loss of heterozygosity on the long arm of
chromosome 6 in breast cancer: possibly four regions of deletion. Clinical cancer

100

research : an official journal of the American Association for Cancer Research 2: 16011606.
[158] Hong C, Maunakea A, Jun P, Bollen AW, Hodgson JG, Goldenberg DD, Weiss WA,
Costello JF (2005) Shared epigenetic mechanisms in human and mouse gliomas
inactivate expression of the growth suppressor SLC5A8. Cancer Res 65: 3617-3623.
[159] Yamanaka S, Sunamura M, Furukawa T, Sun L, Lefter LP, Abe T, Yatsuoka T, Fujimura
H, Shibuya E, Kotobuki N, et al. (2004) Chromosome 12, frequently deleted in human
pancreatic cancer, may encode a tumor-suppressor gene that suppresses angiogenesis.
Lab Invest 84: 1339-1351.
[160] Gupta N, Martin PM, Prasad PD, Ganapathy V (2006) SLC5A8 (SMCT1)-mediated
transport of butyrate forms the basis for the tumor suppressive function of the transporter.
Life Sci 78: 2419-2425.
[161] Paroder V, Spencer SR, Paroder M, Arango D, Schwartz S, Jr., Mariadason JM,
Augenlicht LH, Eskandari S, Carrasco N (2006) Na(+)/monocarboxylate transport
(SMCT) protein expression correlates with survival in colon cancer: molecular
characterization of SMCT. Proc Natl Acad Sci U S A 103: 7270-7275.
[162] Lerebours F, Bertheau P, Bieche I, Driouch K, De The H, Hacene K, Espie M, Marty M,
Lidereau R (2002) Evidence of chromosome regions and gene involvement in
inflammatory breast cancer. Int J Cancer 102: 618-622.
[163] Bullrich F, Fujii H, Calin G, Mabuchi H, Negrini M, Pekarsky Y, Rassenti L, Alder H,
Reed JC, Keating MJ, et al. (2001) Characterization of the 13q14 tumor suppressor locus
in CLL: identification of ALT1, an alternative splice variant of the LEU2 gene. Cancer
Res 61: 6640-6648.
[164] Yin Z, Spitz MR, Babaian RJ, Strom SS, Troncoso P, Kagan J (1999) Limiting the
location of a putative human prostate cancer tumor suppressor gene at chromosome
13q14.3. Oncogene 18: 7576-7583.
[165] Frank B, Klaes R, Burwinkel B (2005) Familial cancer and ARLTS1. N Engl J Med 353:
313-314; author reply 313-314.
[166] Phelan MC, Rogers RC, Saul RA, Stapleton GA, Sweet K, McDermid H, Shaw SR,
Claytor J, Willis J, Kelly DP (2001) 22q13 deletion syndrome. Am J Med Genet 101: 9199.
[167] Manning MA, Cassidy SB, Clericuzio C, Cherry AM, Schwartz S, Hudgins L, Enns GM,
Hoyme HE (2004) Terminal 22q deletion syndrome: a newly recognized cause of speech
and language disability in the autism spectrum. Pediatrics 114: 451-457.
[168] Rippe V, Drieschner N, Meiboom M, Murua Escobar H, Bonk U, Belge G, Bullerdiek J
(2003) Identification of a gene rearranged by 2p21 aberrations in thyroid adenomas.
Oncogene 22: 6111-6114.
101

[169] Li J, Yang T, Wang L, Yan H, Zhang Y, Guo Y, Pan F, Zhang Z, Peng Y, Zhou Q, et al.
(2009) Whole genome distribution and ethnic differentiation of copy number variation in
Caucasian and Asian populations. PLoS One 4: e7958.
[170] Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA
(2009) Potential etiologic and functional implications of genome-wide association loci
for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362-9367.
[171] Donnelly P (2008) Progress and challenges in genome-wide association studies in
humans. Nature 456: 728-731.
[172] Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, Krauss RM, Myers RM,
Ridker PM, Chasman DI, et al. (2009) Population analysis of large copy number variants
and hotspots of human genetic disease. Am J Hum Genet 84: 148-161.
[173] McCarroll SA, Huett A, Kuballa P, Chilewski SD, Landry A, Goyette P, Zody MC, Hall
JL, Brant SR, Cho JH, et al. (2008) Deletion polymorphism upstream of IRGM
associated with altered IRGM expression and Crohn's disease. Nat Genet.
[174] Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg
MS, Taylor KD, Barmada MM, et al. (2008) Genome-wide association defines more than
30 distinct susceptibility loci for Crohn's disease. Nat Genet 40: 955-962.
[175] Moore JH (2003) The ubiquitous nature of epistasis in determining susceptibility to
common human diseases. Hum Hered 56: 73-82.
[176] Nagel RL (2005) Epistasis and the genetics of human diseases. C R Biol 328: 606-615.
[177] Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S (2006) Powerful
multilocus tests of genetic association in the presence of gene-gene and geneenvironment interactions. Am J Hum Genet 79: 1002-1016.
[178] Lin M, Wu RL (2006) Detecting sequence-sequence interactions for complex diseases.
Current Genomics 7: 59-72.
[179] Zhang J, Liang F, Dassen WR, Veldman BA, Doevendans PA, De Gunst M (2003)
Search for haplotype interactions that influence susceptibility to type 1 diabetes, through
use of unphased genotype data. Am J Hum Genet 73: 1385-1401.
[180] Tzeng JY, Wang CH, Kao JT, Hsiao CK (2006) Regression-based association analysis
with clustered haplotypes through use of genotypes. Am J Hum Genet 78: 231-242.
[181] Li M, Romero R, Fu WJ, Cui Y (2010) Mapping haplotype-haplotype interactions with
adaptive LASSO. BMC Genet 11: 79.
[182] Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN (2005)
Nonparametric tests of association of multiple genes with human disease. Am J Hum
Genet 76: 780-793.
102

[183] Wei Z, Li M, Rebbeck T, Li H (2008) U-statistics-based tests for multiple genes in
genetic association studies. Ann Hum Genet 72: 821-833.
[184] Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH (2006) A novel method to identify
gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol 30: 111-123.
[185] Lou XY, Chen GB, Yan L, Ma JZ, Mangold JE, Zhu J, Elston RC, Li MD (2008) A
combinatorial approach to detecting gene-gene and gene-environment interactions in
family studies. Am J Hum Genet 83: 457-467.
[186] Wu Z, Zhao H (2009) Statistical power of model selection strategies for genome-wide
association studies. PLoS Genet 5: e1000582.
[187] Cordell HJ (2009) Detecting gene-gene interactions that underlie human diseases. Nat
Rev Genet 10: 392-404.
[188] Marchini J, Donnelly P, Cardon LR (2005) Genome-wide strategies for detecting
multiple loci that influence complex diseases. Nat Genet 37: 413-417.
[189] Storey JD, Akey JM, Kruglyak L (2005) Multiple locus linkage analysis of genomewide
expression in yeast. PLoS Biol 3: e267.
[190] Brem RB, Storey JD, Whittle J, Kruglyak L (2005) Genetic interactions between
polymorphisms that affect gene expression in yeast. Nature 436: 701-703.
[191] Lu Q, Elston RC (2008) Using the optimal receiver operating characteristic curve to
design a predictive genetic test, exemplified with type 2 diabetes. Am J Hum Genet 82:
641-651.
[192] Hoeffding W (1948) A Class of Statistics with Asymptotically Normal Distribution.
Annals of Mathematical Statistics 19: 293-325.
[193] Zeileis A, Kleiber C, Jackman S (2008) Regression models for count data in R. Journal of
Statistical Software 27: 1-25.
[194] Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method
for genome-wide association studies by imputation of genotypes. Nature Genetics 39:
906-913.
[195] Wu C, Zhang H, Liu X, Dewan A, Dubrow R, Ying Z, Yang Y, Hoh J (2009) Detecting
essential and removable interactions in genome-wide association studies. Stat Interface 2:
161-170.
[196] Wang X, Elston RC, Zhu X (2010) The Meaning of Interaction. Hum Hered 70: 269-277.
[197] Berrettini W, Yuan X, Tozzi F, Song K, Francks C, Chilcoat H, Waterworth D, Muglia P,
Mooser V (2008) Alpha-5/alpha-3 nicotinic receptor subunit alleles increase risk for
heavy smoking. Mol Psychiatry 13: 368-373.
103

[198] Schuckit MA, Danko GP, Smith TL, Bierut LJ, Bucholz KK, Edenberg HJ, Hesselbrock
V, Kramer J, Nurnberger JI, Jr., Trim R, et al. (2008) The prognostic implications of
DSM-IV abuse criteria in drinking adolescents. Drug Alcohol Depend 97: 94-104.
[199] Stevens VL, Bierut LJ, Talbot JT, Wang JC, Sun J, Hinrichs AL, Thun MJ, Goate A,
Calle EE (2008) Nicotinic receptor gene variants influence susceptibility to heavy
smoking. Cancer Epidemiol Biomarkers Prev 17: 3517-3525.
[200] Grucza RA, Wang JC, Stitzel JA, Hinrichs AL, Saccone SF, Saccone NL, Bucholz KK,
Cloninger CR, Neuman RJ, Budde JP, et al. (2008) A risk allele for nicotine dependence
in CHRNA5 is a protective allele for cocaine dependence. Biol Psychiatry 64: 922-929.
[201] Spitz MR, Amos CI, Dong Q, Lin J, Wu X (2008) The CHRNA5-A3 region on
chromosome 15q24-25.1 is a risk factor both for nicotine dependence and for lung cancer.
J Natl Cancer Inst 100: 1552-1556.
[202] Caporaso N, Gu F, Chatterjee N, Sheng-Chih J, Yu K, Yeager M, Chen C, Jacobs K,
Wheeler W, Landi MT, et al. (2009) Genome-wide and candidate gene association study
of cigarette smoking behaviors. PLoS One 4: e4653.
[203] Schlaepfer IR, Hoft NR, Collins AC, Corley RP, Hewitt JK, Hopfer CJ, Lessem JM,
McQueen MB, Rhee SH, Ehringer MA (2008) The CHRNA5/A3/B4 gene cluster
variability as an important determinant of early alcohol and tobacco initiation in young
adults. Biol Psychiatry 63: 1039-1046.
[204] Weiss RB, Baker TB, Cannon DS, von Niederhausern A, Dunn DM, Matsunami N,
Singh NA, Baird L, Coon H, McMahon WM, et al. (2008) A candidate gene approach
identifies the CHRNA5-A3-B4 region as a risk factor for age-dependent nicotine
addiction. PLoS Genet 4: e1000125.
[205] Li MD, Xu Q, Lou XY, Payne TJ, Niu T, Ma JZ (2010) Association and interaction
analysis of variants in CHRNA5/CHRNA3/CHRNB4 gene cluster with nicotine
dependence in African and European Americans. Am J Med Genet B Neuropsychiatr
Genet 153B: 745-756.
[206] Li MD, Yoon D, Lee JY, Han BG, Niu T, Payne TJ, Ma JZ, Park T (2010) Associations
of variants in CHRNA5/A3/B4 gene cluster with smoking behaviors in a Korean
population. PLoS One 5: e12183.
[207] Beuten J, Ma JZ, Payne TJ, Dupont RT, Lou XY, Crews KM, Elston RC, Li MD (2007)
Association of specific haplotypes of neurotrophic tyrosine kinase receptor 2 gene
(NTRK2) with vulnerability to nicotine dependence in African-Americans and EuropeanAmericans. Biol Psychiatry 61: 48-55.
[208] Li MD, Lou XY, Chen G, Ma JZ, Elston RC (2008) Gene-gene interactions among
CHRNA4, CHRNB2, BDNF, and NTRK2 in nicotine dependence. Biol Psychiatry 64:
951-957.
104

[209] Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases.
Science 273: 1516-1517.
[210] Morris AP, Zeggini E (2010) An evaluation of statistical approaches to rare variant
analysis in genetic association studies. Genet Epidemiol 34: 188-193.
[211] Han F, Pan W (2010) A data-adaptive sum test for disease association with multiple
common or rare variants. Hum Hered 70: 42-54.
[212] Li M, Ye C, Fu W, Elston RC, Lu Q (2010) Detecting gene-gene/gene-environment
interactions for quantitative traits with U-Statistics. International Genetic Epidemiology
Society (IGES).
[213] McCarroll SA, Altshuler DM (2007) Copy-number variation and association studies of
human disease. Nat Genet 39: S37-42.
[214] Khoury MJ, Wacholder S (2009) Invited commentary: from genome-wide association
studies to gene-environment-wide interaction studies--challenges and opportunities. Am J
Epidemiol 169: 227-230; discussion 234-225.
[215] Kent JW, Jr. (2009) Analysis of multiple phenotypes. Genet Epidemiol 33 Suppl 1: S3339.
[216] Ottman R (1990) An epidemiologic approach to gene-environment interaction. Genet
Epidemiol 7: 177-185.
[217] Curtis D (2007) Comparison of artificial neural network analysis with other multimarker
methods for detecting genetic association. BMC Genet 8: 49.
[218] Gayan J, Gonzalez-Perez A, Bermudo F, Saez ME, Royo JL, Quintas A, Galan JJ, Moron
FJ, Ramirez-Lorca R, Real LM, et al. (2008) A method for detecting epistasis in genomewide studies using case-control multi-locus association analysis. BMC Genomics 9: 360.

105