“THS

This is to certify that the
thesis entitled

STATISTICAL METHODS FOR IDENTIFYING
GENETIC ASSOCIATIONS

presented by

LAN TONG

has been accepted towards fulﬁllment
of the requirements for the

M. S. degree in Applied Statistics

 

 

"3

 

Major Professor’s Signature
I! .. 2 9 -— 0 S"

Date

 

MSU is an Afﬁrmative Action/Equal Opportunity Institution

LIBRARIES
MICHIGAN STATE UNIVERSITY
EAST LANSING, MICH 48824-1048

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/06 c:/CIRC/Date0ue.indd-p.15

 

STATISTICAL METHODS FOR IDENTIFYING GENETIC ASSOCIATIONS
By

Lan Tong

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Statistics

2005

ABSTRACT
STATISTICAL METHODS FOR IDENTIFYING GENETIC ASSOCIATIONS
By
Lan Tong

This thesis introduces three bio-statistical concepts: Hardy-Weinberg equilibrium,
linkage disequilibrium, and haplotype reconstruction and haplotype frequency estimation.
Three statistical methods have been discussed and utilized to test these concepts for a
population genetics study.

Hardy-Weinberg equilibrium is a basis for genetic inference. It is tested through
the exact test implemented by the Arlequin software package. Linkage disequilibrium is
an important tool for mapping disease genes. It is tested with a likelihood-ratio test,
whose key procedure is the Expectation-Maximum algorithm, implemented by the
Arlequin software package. Haplotype information is essential for mapping disease. The
haplotype frequencies are estimated through the Bayesian estimation method
implemented by the PHASE software package. '

The above concepts and tests have been applied to the Isle of Wight cohort study.
It has been found that all the loci of interest (hC V893205 6, hCV15862 743, hCV8932053,
and hCV8932052 on the 11.13 gene) are in Hardy-Weinberg equilibrium, and that all pair-
wise loci are in linkage disequilibrium. The haplotypes of the most informative SNP
pair, hC V893205 6 and hCV8932052, have been reconstructed; their frequencies are
estimated for eight phenotypes of interests. The contingency tests suggest that there is no
association between the haplotype patterns CA/CG/T A/T G and CA/TA/T G and allergic

asthma.

ACKNOWLEDGMENTS

I would like to express my appreciation to the several people who have been
generous with their encouragement and support for this thesis. Most of all, I am grateful
to Dr. Marianne Huebner, Department of Statistics, Michigan State University, and Dr.
Susan Ewart, Department of Large Animal Clinical Sciences, Michigan State University,
for providing me the opportunity of participating the Isle of Wight birth cohort study. Dr.
Huebner has mentored me throughout the project and the writing of the thesis. I have had
many helpful discussions with Dr. Ewart and Dr. Dennis Gilliland, Department of
Statistics, Michigan State University, both of whom provided me invaluable suggestions
for the improvement of my thesis. Dr. James Stapleton, Department of Statistics,
Michigan State University, has served on my committee and offered me constructive
advice about my thesis.

I also want to thank my parents for taking care of my newborn daughter so that I
was able to concentrate on my research and thesis writing. I am deeply indebted to my
husband whose love, support, and knowledge have made the completion of the project

and the thesis possible.

iii

TABLE OF CONTENTS

 

 

 

 

 

 

 

 

List of Tables - - - - - - - - - _ v
List of Figures- vi
Chapter 1 Introduction- - - - - - - - - - -- - 1
Section I Background ...................................................................................................... 1
Section 2 Genetic Concepts ............................................................................................. 4
Section 3 Case Study ..................................................................................................... 10
Chapter 2 Hardy-Weinberg Equilibrium _ -- - -- -- - -- - 12
Section 1 Introduction ................................................................................................... 12
Section 2 Exact Test of Hardy- Weinberg Equilibrium .................................................. 14
Section 3 Arlequin Implementation of Exact Test and Case Study ................................ 19
Chapter3 Linkage Disequilibrium ........... - - -- - _- - - -- - - 22
Section I Introduction ................................................................................................... 22
Genetic Linkage ......................................................................................................... 22
Allelic Association ..................................................................................................... 22
Maintenance of Allelic Associations: Linkage Disequilibrium ................................. 23
Section 2 Likelihood Ratio Test of Linkage Disequilibrium .......................................... 25
Computation of L0 ...................................................................................................... 25
Computation of L1 ...................................................................................................... 26
Likelihood-ratio test ................................................................................................... 30
Section 3 Arlequin Implementation of Likelihood Ratio Test and Case Study .............. 31
Chapter 4 Haplotype Frequency Estimation - - - - - -- -- 33
Section 1 Introduction ................................................................................................... 33
Section 2 Bayesian Estimation Method for Haplotype Frequency ................................ 35
Section 3 PHASE Implementation of Bayesian Estimation and Case Study ................. 39
Chapter 5 Conclusions . -- - - - - - ...... - - -- - -- 49
Appendices ....... . .................. -- .......... - - 50
Bibliography .................................................................................................................... 58

iv

LIST OF TABLES

Table 1 Sample of Data Set ....... -- - ll

 

Table 2 Sample of Data Set with Genotype Information on Four SN Ps 20

Table 3 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052

 

 

 

 

 

 

 

 

by Phenotypes 42
Table 4 Two-way Contingency Table of Asthma 1 or 2 vs. Control- -- 46
Table 5 Two-way Contingency Table of Asthma 4 vs. Control - -_ -46
Table 6 Two-way Contingency Table of CDAlO vs. Control - 46
Table 7 Two-way Contingency Table of Wheeze 1 or 2 vs. Control - -- - -- 46
Table 8 Two-way Contingency Table of Wheeze 4 vs. Control - ........ 46
Table 9 Two-way Contingency Table of Wheeze 10 vs. Control _- ...... 46
Table 10 Two-way Contingency Table of Chronic vs. Control - ....... 46
Table 11 Summary of Contingency Tests (Haplotype CA/CG/TA/T G) .................... 48

Table 12 Summary of Contingency Tests (Haplotype CA/TA/TG) _ 48

 

LIST OF FIGURES

 

 

 

Figure 1 Levels of Information Transfer -- - 4
Figure 2 DNA _- - - - 5
Figure 3 Allele ........ - - - - - - 7
Figure 4 SNP - - - - - - 8

 

Figure 5 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052

by Phenotypes - - - _ - - 43

 

Figure 6 Relationship between Haplotypes and Phenotypes 45

 

vi

CHAPTER 1 INTRODUCTION
Section 1 Background

The purpose of this thesis is to discuss three bio-statistical topics: Hardy-
Weinberg equilibrium, linkage disequilibrium, and haplotype reconstruction and
haplotype frequency estimation, and their applications in population genetics.

The Hardy-Weinberg law plays a very important role in the ﬁeld of population
genetics and often serves as a basis for genetic inference (Crow, 1988). This law says
that in a large random-mating population with no selection, mutation, or migration, the
allele frequencies and the genotype frequencies are constant from generation to
generation and that, furthermore, if the alleles A 1, A2, Am have relative frequencies f1,
f2, fm respectively, then the relative frequency of homozygous genotypes such as AA is
gt,- =ﬁ2, and the relative frequency of a heterozygous genotype such as A A ( i ¢ j ) is g.)-
= 2f”? Because of its importance, a lot of effort has been made to test if a population
exhibits Hardy-Weinberg equilibrium. This paper discusses, in detail, one of the
statistical methods, exact test, used to test Hardy-Weinberg equilibrium.

Linkage disequilibrium is an important tool for mapping disease genes. It
describes the nonrandom association of alleles at linked loci. When only genotype and
not haplotype frequencies are available, linkage disequilibrium between a pair of loci is
tested with a likelihood-ratio test. This paper discusses the likelihood ratio test, with
highlight on its key procedure, the Expectation-Maximum algorithm which resolves
double heterozygous genotypes into haplotypes when not assuming linkage equilibrium.

Haplotype information is essential for mapping disease. When individuals are

homozygous at every locus, haplotypes can be easily determined; however, when

individuals are heterozygous at more than one locus, haplotypes cannot be deducted from
genotype information. Statistical approaches can be used to reconstruct haplotypes and
estimate the relative frequencies of all possible haplotypes. This paper introduces the
Bayesian estimation method of reconstructing haplotypes and estimating haplotype
frequencies.

All the above three concepts have been applied to the Isle of Wight birth cohort
study. This study is committed to identify the genetic susceptibility loci responsible for
asthma and allergy by studying a group of children born on the Isle of Wight, United
Kingdom. The Hardy-Weinberg equilibrium and linkage disequilibrium for the four
SNPs: hC V893205 6, hCVI5862 743, hCV8932053, and hCV8932052 on the 1L13 gene
have been tested using the Arlequin software package. Haplotype reconstruction and
frequency estimation for SNPs hC V893205 6 and hCV8932052 have been performed
using the PHASE software package.

Both the Arlequin and PHASE software packages are designed for population
genetics data analysis. Arlequin implements exact test and likelihood-ratio test to
examine the Hardy-Weinberg equilibrium and linkage disequilibrium, respectively. It is
available for free at http://lgbunige.ch/arlequin. PHASE implements the Bayesian
algorithm to approximate the posterior distribution of haplotype conﬁgurations. It is also
free and can be downloaded at http://www.stat.washington.edu/stephens/soﬁwarehtml.

Chapter 1 of this thesis introduces the background, the basic genetic concepts
involved in the study, and the data of Isle of Wight birth cohort. Chapter 2 describes
Hardy-Weinberg equilibrium, its signiﬁcance, exact test of Hardy-Weinberg equilibrium,

and testing the Isle of Wight birth cohort for Hardy-Weinberg equilibrium using

Arlequin. Chapter 3 discusses linkage disequilibrium, its signiﬁcance, likelihood ratio
test of linkage disequilibrium, and testing the Isle of Wight birth cohort for linkage
disequilibrium using Arlequin. Chapter 4 explains the motif of haplotype reconstruction
and frequency estimation, Bayesian estimation method, and estimating the Isle of Wight
birth cohort for haplotype frequencies using PHASE. Finally, in Chapter 5, the test

results in the previous chapters are summarized.

Section 2 Genetic Concepts

To better understand genetic research and the Isle of Wight birth cohort study, it

is helpful to explore some basic genetic deﬁnitions and concepts.

 

  
         
 
  

I... ‘
lmnlm
"Ilium!

WI

 

lllllllllgh

“l ll‘llilgi
6. Population

Figure 1 Levels of Information Transferl

Genetics is the study of traits passed on from parent to child and variation of those

traits within and between individuals. It is about the transfer of information among many

different levels (Figure 1). The foundation level is the molecule called DNA (1). The

information in DNA is organized into genes (2). Genes, in turn, make up chromosomes

(3), which when taken all together form an organism's genome (4). Every nucleus in an

individual (5) contains the genome.

Instructions that provide almost all of the information necessary for a living
organism to grow and function are in the nucleus of every cell. The instructions are in
the form of a molecule called deoxyribonucleic acid, or DNA (Figure 2).

In humans, the DNA molecule consists of two ribbon—like strands that wrap
around each other, resembling a twisted ladder. The rungs of the ladder are nucleotide
base pairs. The bases are called adenine (“ ”), cytosine (“C"), guanine (“G”) and

thymine (“I“). These bases always pair up as A+T and C +G.

i“vii

. “WWW ".1:Iit:“t.‘ii H
‘ Ill ,ii III‘lullllillllllmlmlll‘ll -”‘., ‘ v ' " "ill
it'll‘llhulliiﬁim lpﬂirii’lillhﬂmlllL W * T

l v ‘I i

‘ 'l'nlirrttuilll [ll lilini. u.
' ‘ |vii' WWI“ I ll
. M l M

 

Figure 2 DNA:

 

' This ﬁgure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at

GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004.
2 This ﬁgure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at

GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004.

Before DNA was discovered, gene was deﬁned as a discrete unit which is
inherited from parent to offspring and which exerts control on a single character. After
DNA was discovered, gene was redeﬁned as a segment of DNA that codes for a protein
subunit. The word gene may also be used to refer to a functional DNA segment or a class
of functional DNA segments that have the same position, structure and function (Sham,
1998)

DNA is contained in tightly coiled packets called chromosomes. Chromosomes
consist of the double helix of DNA wrapped around proteins. Each human cell nucleus
contains 23 pairs of chromosomes.

A locus is a speciﬁc position in the genome (the complete set of genes). The DNA of
most people hare highly similar. The presence of different DNA sequences at the same
locus in a population is known as a polymorphism. The alternative DNA sequences at a
locus on a chromosome pair are known as alleles (Figure 3). People can have two
identical or two different alleles for a particular gene. A person who has two identical
alleles for a gene is said to be homozygous for that gene. A person with two different
alleles is said to be heterozygous. The pair of alleles a person has at a speciﬁc location
in the genome is called genotype. Genotype affects phenotype, which is the observable
effect of the allele, such as eye color. A combination of alleles is termed haplotype.
Haplotypic phase refers to whether a gametic (a gamete is a sperm or an egg that fuse

during reproduction) haplotype changes during recombination (GlaxoSmithKline, 2004).

   

W’ U" l
"WI gull“; I“ {I
copy I of chromosome A
copy 2 of chromosome A

 
 

Genotypes

PP
Homozygous for
the dominant allele

aa
Homozygous for
the recessive allele

Bb
Heterozygous
Recessive allele

- Dominant allele

Figure 3 Allele’

Many diseases are related in some way to genes. Many common diseases result
from a change in one or a few susceptibility genes.

To ﬁnd a gene that is involved in a speciﬁc disease, scientists must search for
DNA changes that are present more often in people who have a particular disease
compared to people who do not have the disease (GlaxoSmithKline, 2004).

A contemporary kind of genetic map, called a high-density single nucleotide
polymorphism (“SNP”) (Figure 4) map, has the potential to promote this research. SNPs

are single-base differences in the DNA sequence that can be observed between different

 

3 This ﬁgure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at
GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004.

individuals in the population. For example, a SNP might change the DNA sequence
AAGGC T AA to AT GGC T AA. SNPS are the simplest and most common forms of DNA
polymorphism. They are present throughout the human genome. Groups of neighboring
SNPS may have alleles that show distinctive patterns of linkage disequilibrium and as
such may create a haplotypic diversity that can be exploited in both genetic linkage and
direct association studies. The simple structure of SNPS also allows rapid and efﬁcient
genotyping. In addition, SNPS are also evolutionarily stable - not changing much from
generation to generation - making them easier to follow in population studies. These

features of SNPS in the genome make them particularly valuable as genetic markers
(Schork et a1, 2000).

Using SNPS 'rbi' Lbiaiing
Disease Genes .,

2. SNP proﬁle of people
without the disease

“-1-?" Him __ ..
“llllTll‘ Ill IgiLIII-
_, mi .

1. SNP proﬁle of people with
the disease

[[llll l’i““ii """Ilijilini
ll 3- ,[Jl ". Ii
4. Compare the two proﬁles
Till I ll. . ill]lffllllllfjll f
f .IllL— ' "ll: fflll :fllllllfjﬂ Ill 'L
3. Differences in the proﬁles

identify locations that may
contain susceptibility genes

3121*”: «I?

M

Figure 4 SNP“

 

4 This ﬁgure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at
GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004.

Using the information that SNPS provide, it may be possible to predict people’s
genetic risk of developing a certain disease, to diagnose a disease more accurately, or to

predict how people most likely will respond to a medicine.

Section 3 Case Study

The Isle of Wight birth cohort study represents an unselected whole population
birth cohort based on the Isle of Wight, United Kingdom. The Isle of Wight is a small
island (13 x 23 miles) just off the South coast of England with a resident population Of
133,000.

The ethnic background of the island residents is mainly Caucasian. While the Isle
of Wight population is not genetically homogeneous, it is stable to the extent that the
majority of children in the cohort has not moved away and has thus been available for
follow up.

Enrollment took place at birth. Of the 1,536 children born on the Isle of Wight
between January 1, 1989 and February 28, 1990, informed consent was obtained from the
parents of 1,456 children. These children have Since been seen at the ages of l (n = 1,167;
80.2%), 2 (n = 1,174; 80.6%), 4 (n = 1,218; 83.7%) and 10-years (n = 1,373; 94.3%)
(Kurukulaaratchy et al, 2003).

At birth, information on family history of allergy, household pets, parental
smoking, socioeconomic status and birth weight were recorded. At every follow-up (1, 2,
4 and IO-years), detailed questionnaires were completed with the parents for each child
regarding asthma and allergy prevalence. For each child, the following phenotype
information has been recorded: asthma at 1 or 2 years, asthma 4 years, currently
diagnosed asthma (CDA) at 10 years, wheezing at 1 or 2 years, wheezing at 4 years,
wheezing at 10 years, chronic asthma, and no symptoms. With high cohort retention, this
prospectively followed population provides a uniquely characterized resource for

ongoing studies (Kurukulaaratchy et al, 2003).

10

A segment of the data ﬁle is presented in Table 1.

Table 1 Sample of Data Set

asthma asthma
32056 1 15862743 1 32053 1 32052 1 1

CC CC GG

CC CC GG

 

Among the total of 1,536 children, 625 children have complete information about
SNPS hCV8932056-10, hC V 1 5 862 743-1 0, hCV8932053-10, and hCV8932052-10. There
are two indices under each phenotype with “1” indicating positive result (case) and “0”
negative result (control). The population that has complete information about the four
SNPS, including cases and controls, is used to test Hardy-Weinberg equilibrium and
linkage disequilibrium. The population that has complete information about the most
informative SNP pair hCV8932056-10 and hCV8932052-10, but including cases only, is

estimated for its haplotype frequency on the basis of phenotypes.

ll

CHAPTER 2 HARDY-WEINBERG EQUILIBRIUM
Section 1 Introduction

In a large population, in the absence of natural selection, mutation, or migration,
when the mating type frequencies arise from random mating, the ratios of the different
genotypes follow a mathematical result established independently by the English
mathematician Hardy and the German physician Weinberg. This phenomenon is named
“Hardy-Weinberg Equilibrium”. We need to examine all polymorphisms for Hardy-
Weinberg equilibrium in order to produce valid and signiﬁcant results in allelic
association studies.

Consider a biallelic locus with alleles A 1 and A 2. Let the relative frequencies of
the three genotypes A [A ,, AA; and A 2A 2 in a large population be g”, 2g”, and g2; such
that g” + 2g]; + g2; = 1. Hardy’s result was that, if individuals in the population mated
with each other at random, these relative frequencies would be such that

8122 = 811822.

Moreover, if these offspring are mated at random, then the relative frequencies of
the genotypes will remain unchanged after a second generation. It follows that these
relative frequencies will continue to be in the Hardy-Weinberg ratio as long as mating in
the population are random with respect to the locus.

The Hardy-Weinberg equilibrium enables us to relate genotype frequencies to
allele frequencies. If the relative frequencies of alleles A, and A 2 are f; and f2,
respectively, then under normal conditions, the relative frequencies of gametes with
alleles A 1 and A 2 will be f, and f2. According to the Hardy-Weinberg law, under random

mating, the relative frequencies of genotype A ,A 1, A [A2 and A 2A 2 are g,,=f,2, g12=2ftf2,

12

and g2; =f22, regardless of the genotype frequencies in the parental generation. Since the
allele frequencies corresponding to these genotype frequencies remain unchanged at f,
and f2, the same genotype frequencies will be maintained in subsequent generations as
long as random mating applies in the population.

The same principle can be applied to a locus with more than two alleles. Let the
alleles be A}, A 2, A,,,, with relative frequencies 1”,, f2, fm. Under Hardy—Weinberg
equilibrium, the relative frequency of a homozygous genotypes such as AA is g,,- = ﬂ)
and the relative frequency of a heterozygous genotype such as A ,A,- (i ¢ j ) is gy- = Zﬁf]
(Sham, 1998).

Hardy-Weinberg equilibrium has many important applications. The

demonstration of the ratio provides strong evidence for a genetic basis for a trait.

13

Section 2 Exact Test of Hardy-Weinberg Equilibrium

Due to the importance of the Hardy-Weinberg law in the development of
population genetics, testing of the null hypothesis that a population exhibits Hardy-
Weinberg equilibrium has drawn a lot of attention during the past decades.

The methods proposed to test Hardy-Weinberg equilibrium can be categorized
into two groups. One consists of large-sample goodness-Of-ﬁt tests that lean heavily on
asymptotic results. However, it has been recognized that such tests can sometimes lead
to false rejection or acceptance of Hardy-Weinberg equilibrium when the sample sizes
are small and/or some cell frequencies are small or zero. The other approach involves the
exact test, which is preferred when the sample size is small and/or some cell frequencies
are small or zero (Guo et al, 1992). The exact test is named so because it does not rely on
approximations. The Arlequin software package that we used to test Hardy-Weinberg
equilibrium for the Isle of Wight birth cohort, implements the exact test. The exact test
for Hardy-Weinberg equilibrium for multiple alleles is discussed below.

Consider an autosomal locus that has m alleles A 1, A2, Am. A sample of size n

is sampled from a population of interest. Let Cij (1 S j S i S m) denote the observed

count of genotype A ,-A j. Then the data can be presented as the contingency table

 

 

 

 

 

 

 

 

 

 

Al Cu

A2 02/ 622

Am Cm] Cm2 ... Cmm
A] A2 Am

14

Use c = (c11,021,c22,...,cmm) to designate this table. Let c,- denote the number
ofA, alleles 1n the sample. Then cl- = Cii +Zj=1cij (where cij =cj,- 1f j>l).

The table of the count of genotype AjAj is a random variable. Let C denote this
random variable. Then under random mating and the constraint that the number of allele
A,- remains unchanged from generation to generation, the distribution of C satisﬁes
multivariate hypergeometric distribution. Thus, the probability of obtaining the sample c
is:

m
n!Hc,-! 291'
Pr(C = c) = my” ,
j>i

where 26,] is the number of heterozygous individuals.
j >i

The exact test for the Hardy-Weinberg equilibrium given observed sample c has

to evaluate

P: ZPr(c'),

c'eT
where T = {c'z Pr(c') S Pr(c),and{c;-} = {0,}, wherec; =# alleleAi} (Guo et al, 1992).

In order words, the P—value of the test is the sum of the probabilities of the tables
that have a probability smaller than or equal to the observed contingency table c and have
the same allele counts as does c. Rejection or acceptance of the null hypothesis depends
on whether P is smaller than a pre-speciﬁed signiﬁcance level or. A large P-value means

that the probability of obtaining a sample as extreme or more extreme than the actually

observed is large, thus it suggests Hardy-Weinberg equilibrium (Schneider, 2004).

15

Since there may be innumerable contingency tables having identical marginal
counts, simply enumerating such tables is unrealistic. The Arlequin software modiﬁes
the Metropolis algorithm (Guo et a1, 1992) to construct a Markov chain of contingency
tables that have the same allelic counts (the same marginal counts) as the observed table,
and a limiting distribution matching Pr(C). This approach starts with the observed
contingency table. In order to create a new contingency table from an existing one, we
randomly select two distinct rows i1, i2 and two distinct columns j 1, j 2. Neither the two
rows nor the two columns have to be next to each other. The new table is obtained by
decreasing the counts of the cells (i 1. j I) (i 2, j 2) and increasing the counts of the cells (i I,
j 2) (i 2, j 2) by one unit. This leaves the alleles counts {c,-} unchanged. For example, by
decreasing the counts of the cells (C31, C43) and increasing the counts of the cells (C3 3, C41)
by one unit in the original sample table c, we obtain a candidate for the new table. Note
that the marginal counts of rows c3 and c4 and of columns c, and C3 remain the same as

those in table c.

 

A1 0/1

 

A2 6'21 022

 

A3 6'31“] C32 033'”

 

A4 641+] C42 643-1 C44

 

 

Am Cm] CmZ cm3 Cm4 - - - Cmm

 

 

 

 

 

 

 

 

A! A2 A3 A4 Am

Whether to accept this candidate table or not depends on a probability R equal to:

16

_ Pr(c(k+l)) _ ci1j1ci21'2 (1 +6111} )(1 +512}? )

 

 

 

 

 

1.R — vifi1¢110fi2¢jz
Pr(c(k)) (Ciljz + 1)(Ci2j1 +1) (1+ 6i1j2 )(1+ 6i2j1 )
P “(+1) C' . c. . 4
2, R: ‘(C k l: ‘1“ ‘212 —,ifi1 =j1 and i2=j2
Pr(c( )) (Ciljz +1)(Ci2j1 +2) 1
P (k't'l) c. . (c. . —1)
3. R: 1‘(C k )2 1111 1212 1,1fl1 = j2 and 12 =j1
Pr(c( )) (Ciljz + 1)(ci2j1 + 1) 4

where k is the created table number, and 6,-1- = 0 if i at j, 1 if i = j. R is the ratio of the
probabilities of the two tables.

The switch to the new table is accepted if R is larger than 1. The resulted Markov
chain has a limiting probability distribution that is the same as the distribution of the

m
nlnci! Z Cij

i=ll 2j>l

(2n)!l—Icij!

j>i

contingency tables, Pr(C = c) =

In practice, the Markov chain starting from the observed contingency table is
biased. In order for the chain to be unbiased, it is ideal to start the chain from a random
table chosen from the distribution Pr(C). One solution is to start the chain from the
observed table and run for a long time so that the initial table is “forgotten”. This process
is referred to as “dememorization”.

After the dememorization, the Markov chain is constructed for a long time, which
results in a large number of selected tables. The limiting probability of the Markov chain
in a particular table can be interpreted as the long-run proportion of time that the chain

stays in that table. Therefore, following the computation of P—value in the exact test, the

17

P-value of the test is the proportion of the selected tables that have a probability smaller

than or equal to the observed contingency table.

18

Section 3 Arlequin Implementation of Exact Test and Case Study

The Arlequin software package is employed to test Hardy-Weinberg equilibrium
for the Isle of Wight birth cohort Study. The goal of Arlequin is to provide the users in
population genetics with a large set of methods and statistical tests, in order to extract
information on genetic and demographic features of a collection of population samples
(Schneider, 2004). The software is available on line at http://1@.unige.ch/arlequin.

Arlequin input ﬁle with extension .arp contains both “Proﬁle”, descriptions of the
properties of the data, as well as “Data”, the raw data themselves. The Proﬁle section
speciﬁes the title of the project, the number of samples present in the project, the type of
data to be analyzed, if the project deals with haplotypic or genotypic data, if the gametic
phase of genotypes is known and so on. The sample of our project is deﬁned as DNA
multi-locus data with unknown gametic phase. The Data section includes a name of the
sample, the size of sample, and the data itself.

A SAS program (Appendix A) has been written to obtain the information required
by Data section for our project. The children with complete genotype information at
SNPS hCV8932056-10, hCV15862743-10, hCV8932053-10, and hCV8932052-10 are ﬁrst
selected. There are a total of 625 such children (Table 2). Next, the individual genotypes
for the four loci are output on two separate lines according to Arlequin’s instruction. For
instance, sample #1532 has genotype pattern TT// C C//C T // AG. A combination of the
ﬁrst allele at each locus, namely T C CA, forms the ﬁrst line for this individual; likely,

T C TG forms the second line. It should be noted that the sequence of the loci must be

kept in the order of their physical positions.

19

Table 2 Sample of Data Set with Genotype Information on Four SN Ps

 

 

 

 

 

 

 

asthma asthma no

ID C8932056_10C15862743_10C8932053_jOC8932052_10 1,2 4 chronic symptom
2 CC CC CC GG 1 1 0 1
3 CC CC CC GG 1 1 0 0
1530 CC CC CC GG 1 1 0 1
1532 "IT CC CT AG 2 1 O 0

 

 

 

 

 

 

 

 

 

 

The input ﬁle (with partial dataset) of our project is displayed as follows:

[Proﬁle]

Title="Isle of Wight Asthma HWE test"
NbSamples= 1

DataType= DNA
GenotypicData= 1
LocusSeparator= WHITESPACE
GameticPhase= O
RecessiveData= 0
RecessiveAllele= null
MissingData= '?'

Frequency= ABS
CompDistMatrix= 0
FrequencyThreshold= 1.0e-5
EpsilonValue= 1.0e-7

[Data]

[[SamPICSl]

SampleName="Isle of Wight Asthma HWE test"
SampleSize=625
SampleData= {

21 CCCG
CCCG
15321 TCCA
TCTG

Once the project ﬁle is built, “Calculation Settings” of Arlequin interface is
decided. In our project, Hardy-Weinberg equilibrium test with 100,000 steps in Markov
chain, 1,000 dememorisation steps, and a signiﬁcance level of 5% has been set. The

number of steps in Markov chain sets the maximum number of alternative tables to

20

 

explore. A ﬁgure of 100,000 is in order. The number of dememorization steps sets the
number of steps to perform before beginning to compare the alternative table
probabilities to that of the observed table. 1,000 steps are necessary to reach a random
starting point corresponding to a table independent from the observed table.

The output of the test is presented below:

Hardy-Weinberg equilibrium: (Isle of Wight Asthma HWE test)

 

Exact test using a Markov chain (for all Loci):
Forecasted chain length :100000
Dememorization steps 21000
Locus #Genot Obs.Heter. Exp.Heter. P. value s.d. Steps done
1 625 0.30880 0.58841 1.00000 0.00000 100172
625 0.16640 0.49639 0.80856 0.00122 100172

2
3 625 0.30720 0.58422 0.79268 0.00123 100172
4 625 0.28480 0.58663 0.08930 0.00084 100172

As shown above, the P-values of the test at all loci are larger than 0.05,
suggesting acceptance of null hypothesis; namely, all the loci are in Hardy-Weinberg
equilibrium. Once Hardy-Weinberg equilibrium is established, we can proceed to test

whether two loci are in linkage disequilibrium.

21

CHAPTER 3 LINKAGE DISEQUILIBRIUM
Section 1 Introduction
Genetic Linkage

The aim of genetic linkage analysis is to infer the relative positions of two or
more loci by examining the patterns of allele-transmission from parent to offspring, or the
patterns of allele-sharing by relatives.

The genotype of an individual at two loci is formed by the haplotypes of two
gametes, each inherited from one parent. A gamete may contain two alleles ﬁ‘om the
same parental gamete or one allele from each parental gamete. In the ﬁrst case, the
haplotype of the gamete is the same as the haplotype of one of the parental gametes.

Such gametes are deﬁned as non-recombinants. In the second case, however, the
haplotype of the gamete constitutes a new combination of alleles different from either
parental haplotype. Such gametes are deﬁned as recombinants. The recombination
fraction, usually denoted as 6 , between the two loci on the same chromosome is deﬁned
as the probability that a gamete is recombinant. Two loci with a recombination fraction
of less than 1/2 are said to be in linkage. The closer the two loci, the smaller the
recombination fraction is, and the more tightly linked are the two loci (Sham, 1998).
Allelic Association

As a result of linkage, some combinations of alleles, i.e. haplotypes, on short
chromosomal segments may be preserved over a large number of generations and become
quite frequent in the population. The excessive co-occurrence of certain combinations of

alleles in the same gamete because of tight linkage is known as allelic association.

22

Consider two loci A and B, with alleles A1,A2,...,Am,and Bl,Bz,...,Bn. By

deﬁnition, if the occurrence of allele A,- and allele B j in a haplotype are independent
events, then the relative frequency of the haplotype AiBJ- is equal to the product of the
allele frequencies of A,- and Bj, i.e. hij- = f,-(A)fJ-(B). If hij ¢ fi(A)fj(B), then the
occurrences of A,- and B j are not independent, and the two alleles are said to be

associated (Sham, 1998).
Maintenance of A llelic Associations: Linkage Disequilibrium

In a large, closed, randomly mating population, let the relative frequency of the

haplotype AiBj in the current generation be hl-J-O. In the next generation, if the
haplotype is a recombinant, then the probability that it is AiBJ- is
Pr(AiBj) = fi(A)fj(B);
If the haplotype is a non-recombinant, then the probability that it is A,- B j is simply
Pr(Al-Bj) = hijo.
The total probability that a haplotype transmitted to the next generation is AiB j is

therefore
hm = Pr(Aij) = 0%(A)f,-(B)+ (1— 9)hzj‘0-
The change in haplotype frequency from generation 0 to generation 1 is
hijl ' hijo = 9(fi(A)fj(B) " hyo).
When fi(A)f]-(B) = hijo for all i,j at the two loci, i.e. ifthere is no allelic

association, there will be no change in haplotype frequencies from generation to

23

generation, and we say that the two loci are in linkage equilibrium. Otherwise the two
loci are said to be in linkage disequilibrium.

For most human populations and for most regions of the genome, substantial
linkage disequilibrium is only likely to occur between loci with a recombination fraction
of less than 1%. This is the rationale behind the use of association analysis as an

important tool for mapping susceptibility loci (Sham, 1998).

24

Section 2 Likelihood Ratio Test of Linkage Disequilibrium

For genotypic data where the haplotypic phase is unknown, linkage
disequilibrium between a pair of loci is tested for genotypic data using a likelihood-ratio
test. The likelihood ratio is between the likelihood of the data assuming linkage
equilibrium (denoted L0) and the likelihood of the data not assuming linkage equilibrium
(denoted L 1)- L0 is computed by using the fact that, under the hypothesis of linkage
equilibrium, the haplotype frequencies are obtained as the product of the allele
frequencies. L, is obtained by applying the Expectation-Maximum algorithm to estimate
haplotype frequencies. Both L0 and L I assume Hardy-Weinberg equilibrium (random
mating). The ratio of L0 and L 1 suggests the degree of deviation from linkage equilibrium

(Schneider et al, 2000).

Suppose that two loci A and B, with distinctive alleles A1,A2,...,Am and
BI , 82 ,...,B,, , have been genotyped in a random sample of the population. Each

individual has a genotype of the form A,- A j // BkBl. There are m(m + 1)/ 2 possible

genotypes at locus A , and n(n + 1) / 2 possible genotypes at locus B , so that the total
number of joint genotypes is (m(m + 1) / 2)(n(n + l) / 2) . The aim is to test the null

hypothesis of linkage equilibrium between A and B.
Computation of L0
Under the assumption Of linkage equilibrium, the relative frequency of a
haplotype is equal to the products of the allele frequencies:
hik=ﬁ(A)ﬁc(B)
htl=ﬁ(A)fl(B)
hjk=ﬁ(A)ﬁ(B)

25

hﬂ=ﬁ(A)fz(B).
where the allele frequencies can be obtained by simple counting on the basis of given
genotype information.
Under the assumption of random mating, the genotype frequencies can be

expressed as follows:
g... = 11.12.. = 12.2 aim/nor
g,“ = 11,11, + huh”, = 211,12, = 2f,-(A)2f,(B)ﬁ,(B)
g,“ = 12,72, + 111,12, = 2h,,h,, = 2ﬁ(A)ﬁ(A)f,,(B)2
g,“ = h,,hj, + h j,h,, + huh], + 121,11, = 4 f,(A) fj(A) f, (B) f, (B) .

The observed genotypic counts CH”. c”12 , c follow a multinomial

III/"II"

distribution with parameters cm (the sample size) and the genotypic frequencies 81111,

glnz, gmmnn. The log-likelihood of observing cijkl is therefore

ln L0 = Z cijkl ln(g,-J-k1),
', j l,...,m

l l n
where the summation is taken over all (m(m +1)/2)(n(n + 1) / 2) possible genotypes.
The maximum likelihood estimate of a population genotype frequency is simply the
sample genotype frequency (Sham, 1998).
Computation of L 1

When not assuming linkage equilibrium, in the case that an individual is

heterozygous for both loci, the haplotypes cannot be deduced from the genotype. For

example, the genotype A,- A j // Bk B, can be made up of the haplotype pairs Ain /A j B,

or A j Bk / AiBl. In order to estimate the haplotype frequencies, we use the criterion of

26

maximizing the likelihood of observing genotype data, L ,. From the discussions in the

preceding subsection, we know

1nL1: 2 can 1n(gijk1),
',j l,...,m
l l ,n

where Cijkl are the observed genotypic counts, gijkl are the estimated genotype

frequencies, which are calculated from the estimated haplotype frequencies under the
assumption of random mating as follows:
giikk = hikhik = hik 2
g iikl = hik hi1 + hilhik = 2hilhik
gijkk = hikhjk + hjkhik = Zhikhjk
8in = hikhjl + hjlhik + hilhjk + hjkhil = 2(hikhjl + hjkhil)-
In the following, we use an iterative method of counting based on the
Expectation-Maximum algorithm to obtain the maximum likelihood estimation of
haplotype frequencies.

The EM algorithm is a numerical method of ﬁnding maximum likelihood

estimates for parameters given incomplete data. It begins by setting the initial haplotype

frequencies as haw , u = l,...,m,v = l,...,n. It is reasonable to set these initial haplotype

frequencies as the products of the allele frequencies, just as the haplotype frequencies

under the assumption of Hardy-Weinberg equilibrium and linkage equilibrium.

The count of genotype A, A J- // Bk B, can be considered as the sum of 2
unobserved counts: Cijkl = Cik,jl +c,,,jk , where Cijkl is the count of genotype

A, A j // Bk 8,, Cik.jl is the number of individuals with haplotype pair A,Bk / A j B, , and

27

Cil,jk is the number of individuals with haplotype pair AiB, / A jBk . For any
heterozygous genotype AiA j // Bk 8,, the initial expected values of the unobserved

counts of haplotype pairs are calculated as follows:

Under Hardy-Weinberg equilibrium, gik,jl = hik h jl and gm], = h,,hjk , where
gm], and g i,‘ jk are the relative frequencies of the genotypes obtained from the
haplotype pairs Ain / A 1B, and AB, / A jBk respectively. Therefore, the fraction of
genotype A, A j // Bk B, that is obtained by the haplotype pair Ain / A 1B, is

hik,0 th,0

 

, and the count ofhaplotype pair A-Bk / A -B, is
hik,0h 21,0 + hil,0hjk,0 I J

Cijkl (hik,0hjl,0)
hik,0hjl,0 + hil,0hjk,0

 

Cik,jl,0 =
Similarly, the count of haplotype pair AiB, / A jBk is

Cijkl (hi1,0hjk,0)

 

Cil,jk,0 = -
hik,0h 11,0 +hil,0hjk,0

This step of computing the expected counts of haplotype pairs is called expectation step.
Using the initial expected values as real data, we can compute the number of

heterozygous haplotypes and get a set of revised haplotype counts as follows:

6qu = Zcpqsno ’

s=l,...,m
t=l,...,n

where p = 1,...,m,q = 1,...,n . The revised haplotype frequencies are, therefore,

28

cuv,l

h 1 =———,
W, Zcuvd
u=l,...,,m
v=l,...,n

where u =1,...,m,v = l,...,n.

This step of reevaluating the haplotype frequencies from the relative probabilities of the
possible haplotype pairs is called maximization step. The maximum likelihood estimate
of population haplotype frequency is simply the sample haplotype ﬁ'equency.

In the next iteration, this set of revised haplotype frequencies are used to obtain a

set of revised expected values of the unobserved counts, which are denoted as Cik, 1,1 and

 

 

Cll,jk,l , CIC.
C cijkl (hik,l h 21,1 )
ik,jl,l :
hik,l hjl,1 + hil,1hjk,l
C ijkl (hil,l ’1 jk,1 )
Cil, jk,1 =

hik,l hjl,1 + hil,1hjk,1

The cycle of revising the haplotype frequencies, revising the expected values of
the unobserved counts, and counting the haplotypes is repeated until the changes in
haplotype frequencies from one iteration to the next become negligible, i.e. convergence
is reached. These are then the local maximum likelihood estimates of the haplotype
frequencies (Sham, 1998).

While the EM algorithm guarantees convergence, it is not guaranteed to converge
to the global maximum when there are multiple local maximums. To increase the chance
of obtaining the global maximum, it is best to try numerous initial values for the

haplotype frequencies (Long et al, 1995). In Arlequin, initial values of 100 and more are

29

in order (Schneider et a1, 2004). The set of haplotype frequencies with the highest local
maximum likelihood is then used as the ﬁnal estimation.

After the haplotype frequencies are estimated, the genotype frequencies and lnL,
are computed in the same way as in the computation ofL,,.

Likelihood-ratio test

In L0 and In LI have (m — l) + (n — l) and mu —1 estimated parameters,
respectively. The likelihood ratio statistic given by
S = —2(lnLO — 1n L1)
has an asymptotical Chi-square distribution with (mn — 1) — ((m - 1) + (n - 1)) =
(m — l)(n - 1) degrees of freedom. A statistically signiﬁcant P -value suggests rejection

of null hypothesis.

In the case of small samples with large number of alleles per locus, the Chi-square
distribution does not apply to the likelihood ratio distribution. In order to better
approximate the underlying distribution of the likelihood ratio statistic, we perform a
randomization test. Such test is non-parametric, not based on asymptotic approximation,
and applicable to context with few data sets. The procedure is as follows:

1. Permute the alleles between individuals at one locus only.

2. Re-estimate the likelihood of the data L, by the EM algorithm. L0 is unaffected by
the permutation procedure.

3. Repeat steps 1-2 a large number of times to get the null distribution of L ,, and
therefore the null distribution of S .

The P -value is calculated by

30

Z"(Si)

”(809101)
2 "(SH ,

where S, is the log likelihood ratio of the observed data (Schneider, 2004).

P — value =

Section 3 Arlequin Implementation of Likelihood Ratio Test and Case Study

The Arlequin software package is used to test linkage disequilibrium for the Isle
of Wight birth cohort. Exactly the same project ﬁle is prepared as the test of Hardy-
Weinberg equilibrium. Under “Calculation Settings”, pair-wise linkage disequilibrium
test with 16,000 permutations, 100 initial conditions, and a signiﬁcance level of 5% has
been set. Number of permutations sets the number of random permuted samples to
generate. 16,000 permutations guarantee to have less than 1% difference with the exact
probability in 99% of the cases. Number of initial conditions sets the number of random
initial conditions from which the EM is started to repeatedly estimate the sample
likelihood. The haplotype frequencies globally maximizing the sample likelihood will be
eventually kept. In our project, 100 initial conditions are used.

The output of the test is presented below:

Pairwise linkage disequilibrium: (Isle of Wight Asthma LD test)

 

Permutation test using the EM algorithm
Number of permutations: 16000
Number of initial conditions for EM: 100

Pair(0, 1)
LnLHood LD : -788.23438 LnLHood LE : -790.29395
Exact P= 0.04306 +- 0.00149 (16002 permutations done)
Chi-square test value= 4.11914 (P = 0.04240, 1 d.f.)
Pair(0, 2)
LnLHood LD : -839.46460 LnLHood LE : -942.81079
Exact P= 0.00000 +- 0.00000 (16002 permutations done)
Chi-square test value=206.69238 (P = 0.00000, 1 d.f.)

31

Pair(1, 2)
LnLHood LD : -761.75385 LnLHood LE : -780.76373
Exact P= 0.00000 +- 0.00000 (16002 permutations done)
Chi-square test value=38.01978 (P = 0.00000, 1 d.f.)
Pair(O, 3)
LnLHood LD : -861.22888 LnLHood LE : -958.38812
Exact P= 0.00000 +- 0.00000 (16002 permutations done)
Chi-square test value=194.31848 (P = 0.00000, 1 d.f.)
Pair(1, 3)
LnLHood LD : -777.89398 LnLHood LE : -796.34106
Exact P= 0.00000 +- 0.00000 (16002 permutations done)
Chi-square test value=36.89417 (P = 0.00000, 1 d.f.)

Pair(2, 3)
LnLHood LD : -564.50177 LnLHood LE : -948.85785
Exact P: 0.00000 +- 0.00000 (16002 permutations done)
Chi-square test value=768.71216 (P = 0.00000, 1 d.f.)

Table of signiﬁcant linkage disequilibrium (signiﬁcance level=0.0500):

Locus #IOI 1| 2| 3|
0|*+++
1|+*+

2|++*

3|+++

*++

Loci 0, 1, 2, and 3 refer to SNPs hCV8932056-10, hCV15862743-10,
hCV8932053-10, and hC V8932052-1 0 respectively.

As shown above, the P -values of all the pair-wise tests are less than 0.05,
suggesting rejection of null hypothesis; namely, all the pairs of loci are in linkage

disequilibrium.

32

CHAPTER 4 HAPLOTYPE FREQUENCY ESTIMATION
Section 1 Introduction

Haplotype information is an essential ingredient in many analyses of ﬁne-scale
molecular genetics data. For example, haplotype analysis is an important tool for linkage
disequilibrium assessment, disease-gene discovery, genetic demography, and
chromosomal-evolution studies. However, many haplotype analysis methods rely on
phase information from the individuals under study. As mentioned in Section 2 of
Chapter 3, for autosomal loci, when only the multi-locus genotypes for each individual
are provided, the phase information for those individuals with multiple heterozygous
phenotypes is inherently ambiguous. Phase can be established by genotyping family
members of each study subject to infer parental chromosomes, but this requires
recruitment and genotyping of relatives, which is expensive and may not be realistic.
Laboratory techniques have also been used. to determine haplotypes, but these approaches
are technologically demanding and often cost-prohibitive. Alternatively, statistical
methods can be used to infer phase at linked loci from genotypes and to estimate
frequency of all possible haplotypes (Fallin et al, 2000).

The problem of unknown phase and one of its possible solutions can be explained
in the example of Clark’s algorithm. In Clark’s algorithm, when a homozygote is found,
a haplotype is unambiguously identiﬁed. When a single-locus heterozygote is found, a
possible haplotype pair is inferred. For each of the remaining multi-locus heterozygotes,
we need to determine whether it can produce a haplotype that has been established. If it
can, identify the complementary haplotype by using the established haplotype as one of

the actual haplotypes that it implies. Continue this process until the phase information

33

for all individuals is either resolved or identiﬁed as unresolved. This algorithm is
intuitively appealing and effective in resolving haplotypes when the dataset contains a
sufﬁcient number of homozygous individuals. It also performs well for relatively small
sample sizes. However, three problems can arise with this procedure. It may not be
possible to start the iterative algorithm if there is no unambiguous or single-locus
heterozygous individuals in the sample. There may be unresolved haplotypes left at the
end. In addition, haplotypes may be erroneously inferred if a crossover product of two
actual haplotypes is identical to another true haplotype (Clark, 1990).

Compared with Clark’s algorithm, the Bayesian method is more accurate in
inferring haplotype information and can handle more loci. For the Isle of Wight birth
cohort, we estimate the haplotype frequencies using the PHASE software package that
implements the Bayesian algorithm. The Bayesian algorithm will be introduced in the

following section.

34

Section 2 Bayesian Estimation Method for Haplotype Frequency

Suppose there is a sample of n diploid individuals from a population. Let

G = (G1,...,G,,) denote the (known) genotypes for the individuals, and let
H = (H 1 ,...,H n) denote the (unknown) corresponding haplotype pairs.

The Bayesian algorithm regards the unknown haplotypes as unobserved random
quantities and aims to evaluate their conditional distribution given the genotype data.
Gibbs sampling, a type of Bayesian approach, is used to obtain an approximate sample
from the posterior distribution of H given G, Pr(H | G) . Inforrnally, the algorithm starts
with an initial guess 11“” for H, repeatedly chooses an individual at random, and estimates
that individual’s haplotypes under the assumption that all the other haplotypes are
correctly reconstructed. Repeating this process enough times results in an approximate
sample from Pr(H | G) . Formally, this method involves constructing a Markov chain
14“” , H”) , Hm, ..., with stationary distribution Pr(H | G), on the space of possible
haplotype reconstructions, using an algorithm of the following form.

Start with some initial haplotype reconstruction 11“”. Fort = 0, 1, 2, obtain
H ("1’ from H m using the following three steps:

1. Choose an individual, i, uniformly and at random from all ambiguous individuals

(i.e., individuals who are heterozygous at more than one loci).

2. Sample Hy“) from Pr(H, |G,H£ti)), where H_-

l

is the set of haplotypes excluding

individual i. The conditional probability Pr(H i IG, H S?) depends on the genetic
and demographic models.

3. Set H§t+1)=H§-t) forj =1, n,j :r-‘i.

35

It has been proved that this process produces a Markov chain (Gilks et al, 1996).
To help illustrate Gibbs sampling, let us consider a simple example. Suppose that
a random sample of two individuals from a population have been genotyped for two loci

A and B. We have G = {G1,G2} , where G, ={A1A2 //Ble} and

G2 = {A3A4 // B3 B4}. The corresponding haplotype pair H, is either {A181 “1232} or
{A182 / A231 } . Similarly, the corresponding haplotype pair H 2 is either

{A383 /A4B4} or {A384 /A4B3}. The purpose is to ﬁnd the distribution of H1 and H2

conditional on G , i.e.

Pr(Hl ={A1B1/A282}|G)and Pr(Hl ={A132/A231HG)

and

Pr(H2 = {A383 /A4B4} | G) and Pr(H2 = {A384 /A4B3} I G).
From a certain genetic model we know the conditional probabilities Pr(H1 | G, H _1) and
Pr(H2 I G, H_2), in this case, i.e. Pr(HI |G,H2)and Pr(H2 IG,H1).

Starting with an initial guess H (O) for H , say
0 0 0
H‘ ’=(H,‘ ’,H§ ))=({AiBz/A281}. {A383/A4B4I),
we obtain H (I) from 11(0) as follows:

1. Choose an individual uniformly and at random from G . Say we get individual #1,

i.e. 1111(0) .

36

2. Since the genetic model gives Pr(H1 IG, H gm) and we have

H :30) = {A3 B3 / A484 } , thus we know the distribution of H1 conditional on H 50) ,
say
Pr(Hl = {A132 /AzBl } I 0,115") = {A333 /A4B4}) = 0.2
and
Pr(Hl .—. {A181 /A2132} | G, 1,5,0) = {A383 /A4B4}) = 0.8.
Randomly drawing H 1(1) on the basis of this distribution, we obtain, say,
H1“) = A181 /AZBZ.
3. Set Hg) = H50).
So far we obtain a new sample Ha) =(H1(1), Hg”) =({A131 /AZBZ}, {A383 /A4B4 }).
Next, by repeating the above steps, we obtain H (2) from H (1) :

1. Choose an individual uniformly and at random from G . Say we get individual #2 this

time.

2. Since the genetic model gives Pr(H2 | G, H11), we know the distribution of H 2

conditioned on H 1(1) , say

Pr(H2 .-. {A384 /A4B3} I 0,111“) = {/11132 /AZB, }) = 0.4
and

Pr(I-I2 = {A383 /A4B4} l G,H1(1) = {.11132 /A;,191 }) = 0.6.

37

Again, we randomly draw H $2) on the basis of this distribution, say we Obtain
2
Hg ) = A384 /A4B3.
3. Set 111(2) = H1“).

Now we Obtain another sample HQ) =(r11‘2),Ir§2))=({A,B2 /AZB, }, {A384 /A4B3 }).
Let this process continue for a large number of times, say 10,000 times. We
obtain a sample of 10,000 H 's , from which the distribution of H l and H 2 conditional

on G, Pr(H, | G) and Pr(H2 | G) can be calculated.

As shown by Stephens et al., the Bayesian method has three major advantages
over Clark’s method and the EM algorithm: increased accuracy, wider applicability (for
instance, it can handle a large number of loci), and the facility to assess accurately the

uncertainty associated with each phase call (Stephens et al, 2001).

38

Section 3 PHASE Implementation of Bayesian Estimation and Case Study

The PHASE software package is used to estimate the haplotype frequencies for
the children of Isle of Wight. PHASE implements the Bayesian statistical method for
reconstructing haplotypes and estimating haplotype frequencies from genotype data, and
for estimating recombination rates and identifying recombination hotspots (Stephens et
a1, 2003). The program is available on line at http://www.stat.washington.edu/stephens/
softwarehtml.

We want to choose the SNP pairs that are in linkage equilibrium to conduct
haplotype analysis. Although all of the SNP pairs are in linkage disequilibrium, since the
combination of SNPS hC V893205 6 and hCV8932052 is found to be most informative by
the preliminary study, we choose these two SNPS to perform haplotype frequency
estimation on the basis of eight phenotypes: asthma at 1 or 2 years, asthma at 4 years,
currently diagnosed asthma at 10 years, wheezing at 1 or 2 years, wheeze at 4 years,
wheeze at 10 years, chronic asthma, and no symptom at all ages.

PHASE input ﬁle speciﬁes how many individuals there are to be analyzed, how
many loci each individual has been typed at, what physical positions the loci are, what
sort of loci these are (SNP or microsatellite), and the ID and the genotypes for each
individual.

A SAS program (Appendix B) has been written to obtain the information required
by PHASE input ﬁle. The children who have complete information on SNPs
hC V893205 6 and hCV8932052 and are positive (i.e. “2” if the phenotype has indices “1”
and “2”, and “1 ” if the phenotype has indices “0” and “1”) within each phenotype are

ﬁrst selected. There are a total of 164 such children in the group “asthma at 1 or 2

39

years”, 99 in “asthma at 4 years”, 92 in “currently diagnosed asthma at 10 years”, 48 in
“wheeze at 1 or 2 years”, 145 in “wheeze at 4 years”, 228 in “wheeze at 10 years”, 34 in
“chronic asthma”, and 308 in “no symptoms at all age”. Next, the individual genotypes
for the two loci are output on two separate lines according to PHASE’S instruction. In
fact, it is the same way as Arlequin. It should also be noted that the sequence of the loci
must be kept in the order of their physical positions.

As an example, the input ﬁle (with partial raw data) for the phenotype group

“asthma at 1 or 2 years” of our project is displayed as follows:

164
2

P 128122939 128126575
SS

8

CG
CG
16
CA
TG
1510
CG
CG
1524
CG
TG

This presentation says that there are 164 children (who have asthma at 1 or 2
years) typed at two loci (hC V893205 6 and hCV8932052) whose relative positions along
the chromosome are 128122939 and 1281265 75, and which are bi-allelic. The genotype
information then follows, with the ID and two lines for each child. For example, the
individual #8 has the genotype pattern CC//GG.

PHASE produces a number of output ﬁles: a summary report and additional

40

reports whose sufﬁxes indicate the contents of the ﬁle. “_freqs” estimates the sample
haplotype frequencies; “_pairs” lists the most likely pairs of haplotypes for each

6

individual, together with their probability; ‘ recom” contains estimates of recombination

parameters across the region; and “_monitor” measures the goodness of ﬁt of the
estimated haplotypes to the underlying model. Among them the output ﬁle “_freq”
directly pertains to the research purpose of the Isle of Wight Birth Cohort Study. The
output ﬁles “_freq” of the eight groups of children are consolidated and presented in
Table 3.

Figure 5 produced by a Matlab program (Appendix C) visualizes the haplotype
frequency estimation in Table 3. In the graph, a12 stands for the phenotype of “asthma at
1 or 2 years”, a4 for “asthma at 4 years”, cdalO for “currently diagnosed asthma at 10
years”, w12 for “wheeze at 1 or 2 years”, w4 for “wheeze at 4 years”, w10 for “wheeze
at 10 years”, chronic for “chronic asthma”, and control for “control”.

It can be clearly seen that haplotype CG is found in all of the eight phenotypes with high
probabilities ranging from 0.663015 to 0.790994. The other haplotypes are also found in
all the other phenotypes, but with very low probabilities.

In order to view better the relationship between each haplotype and phenotype,

graphs of the relative frequencies of each haplotype vs. each phenotype are shown in

Figure 6, which is produced by Matlab (Appendix C).

41

Table 3 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052 by Phenotypes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Phenotypes Haplotype E (freq) SE
Asthma at l or 2 years CA 0.080212 0.004531
(n=164) CG 0.706373 0.004531
TA 0.124056 0.004531
TG 0.089359 0.004531
Asthma at 4 years CA 0.070573 0.005364
(n=99) CG 0.742559 0.005364
TA 0.126397 0.005364
TG 0.060472 0.005364
Currently diagnosed asthma at 10 years CA 0.040528 0.003686
(n=92) CG 0.790994 0.003686
TA 0.117081 0.003686
TG 0.051398 0.003686
Wheeze at 1 or 2 years CA 0.060431 0.009539
(n=48) CG 0.679152 0.009539
TA 0.158319 0.009539
TG 0.102098 0.009539
Wheeze at 4 years CA 0.074331 0.004659
(n=145) CG 0.763600 0.004659
TA 0.104979 0.004659
TG 0.057090 0.004659
Wheeze at 10 years CA 0.066354 0.003643
(n=228) CG 0.753822 0.003643
TA 0.117857 0.003643
TG 0.061968 0.003643
Chronic asthma CA 0.116397 0.014663
(n=34) CG 0.663015 0.014663
TA 0.148309 0.014663
TG 0.072279 0.014663
Control CA 0.071836 0.003412
(n=308) CG 0.7301 12 0.003412
TA 0.124593 0.003412
TG 0.073459 0.003412

 

 

 

 

42

 

Distribution

 

 

 

 

 

 

 

0.9 — I I a
"—1

0.8 _ a
_ : I

0.7 _ l -

0.6 ~

Frequency
0
U!
I

04~ —

0.3 '- ’I
CA TG TG

02”TG ‘ CA CA “
TA TA I TA

01— _
CG I CG CG

 

 

 

   

 

 

 

 

 

a1 2 a4 w1 2 w4 M 0 chronic control

Figure 5 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052 by Phenotypes
As shown in Figure 6, haplotype CG is prominently related with the phenotype of
“currently diagnosed asthma at 10 years” with relative frequency of 0.791. It has the
least relationship with the phenotype of “chronic asthma” with relative frequency of
0.663. Haplotype TA is found in “wheeze at l or 2 years” with the highest relative
frequency of 0.158 and in “wheeze at 4 years” with the lowest relative frequency of
0.106. Haplotype CA is found in “chronic asthma” with the highest relative frequency of
0.116 and in “currently diagnosed asthma at 10 years” with the lowest of relative
frequency of 0.04. Haplotype T G is found in “wheeze at 1 or 2 years” with the highest
relative frequency of 0.102 and in “currently diagnosed asthma at 10 years” with the

lowest relative frequency of 0.05.

43

0.14

0.12

0.1

Frequency
.0
O
Cb

0.06

0.04

0.02

0.82

0.8

0.78

0.76

Frequency
.0
N
b

.0
N
N

0.7

0.68

0.66

0.64

CA

 

 

H—l

 

l

 

1 L
cda 1 0 W1 2 W4

CG

I 1
W10 chronic control

 

 

1 I

 

cda10 w12

44

1
MO chronic control

 

 

TA

 

 

 

 

 

 

0.17 I T I I I I l I
T
0.16— ’ ~
1t
0.15~ __ a
It
>0.‘I4* _.
U
C
O
8 .l-
8
L‘-o.13~ a

, , -

0.11 ~ -

 

 

 

01 l l l 1 l I 1
a12 a4 cda10 w12 W4 w10 chronic control

TG

0.12 I T I I If I I l

 

0.11- a

0.09 e 2

Frequency
0
O
a)
I
J

. l
0.07 - ..

0.06 — l j, -
0.05 e E _

I 1 L 1 1 T 1 1
812 84 cdaio w12 w4 w10 chronic control

 

 

 

0.04

 

Figure 6 Relationship between Haplotypes and Phenotypes

45

In order to examine whether there exist any statistically signiﬁcant associations
between the haplotypes and the phenotypes, we conducted a contingency test. The
contingency tables of the haplotype counts of each infected group vs. the haplotype

counts of the control group are ﬁrst formed, which are shown in Table 4-Table 10.

Table 4 Two-way Contingency Table of Asthma 1 or 2 vs. Control

 

 

 

 

 

 

 

 

 

CA CG TA TG
A12 13 1 16 20 15
control 22 225 38 23

 

Table 5 Two-way Contingency Table of Asthma 4 vs. Control

 

 

 

CA CG TA TG
a4 7 74 13 6
control 22 225 38 23

 

 

 

 

 

 

 

Table 6 Two-way Contingency Table of CDA10 vs. Control

 

 

 

CA CG TA TG
CDA10 4 73 1 1 5
control 22 225 38 23

 

 

 

 

 

 

 

Table 7 Two-way Contingency Table of Wheeze 1 or 2 vs. Control

 

 

 

 

 

 

 

 

 

CA CG TA TG
w12 3 33 8 5
control 22 225 38 23

 

Table 8 Two-way Contingency Table of Wheeze 4 vs. Control

 

 

 

 

 

 

 

 

 

CA CG TA TG
W4 1 1 1 1 l 15 8
control 22 225 38 23

 

Table 9 Two-way Contingency Table of Wheeze 10 vs. Control

 

 

 

 

 

 

 

 

 

CA CG TA TG
w10 15 172 27 14
control 22 225 38 23

 

Table 10 Two-way Contingency Table of Chronic vs. Control

 

 

 

 

 

 

 

 

 

CA CG TA TG
chronic 4 23 5 2
control 22 225 38 23

 

46

A SAS program has been written to perform the contingency tests. The SAS code
has performed the contingency test between the haplotype CA/CG/T A/T G of asthma at 1
or 2 and control is shown in Appendix D. Below is the test report for the group of asthma

at 1 or 2 years vs. the control group:

Contingency Test between Asthma at 1 or 2 and Control (Haplotypes: CA/CG/T A/T G)

Statistics for Table of group by haplotype

 

Statistic DF Value Prob
Chi-Square 3 0.5449 0.9089
Likelihood Ratio Chi-Square 3 0.5371 0.9107
Mantel-Haenszel Chi-Square 1 0.1329 0.7155
Phi Coefﬁcient 0.0340
Contingency Coefﬁcient 0.0340
Cramer's V 0.0340

Fisher's Exact Test

Table Probability (P) 0.0018
Pr <= P 0.8957

Sample Size = 472

Although SAS gives the P- values of several kinds of Chi-square tests, it should
be noted that since some of the cells have expected counts less than 5, Chi—square may
not be a valid test. Instead, Fisher’s exact test’s result is used when deciding if there is
any association between the haplotypes and the phenotypes. As shown above, the P-
value of 0.8957 suggests that the probabilities for the group of asthma at 1 or 2 and the
control group are independent of the haplotypes CA/C G/TA/T G.

Table 11 summarizes the seven contingency tests. As we can see, the P-values
range from 0.1599 to 0.9914, suggesting that haplotypes CA/C G/TA/T G are not

associated with asthma and wheeze symptoms.

47

Table 11 Summary of Contingency Tests (Haplotype CA/CG/TA/TG)

Fisher's Exact Test

 

 

 

 

 

 

 

 

 

Comparison Groups P-value Sample Size
a12 vs. control 0.8947 472
a4 vs. control 0.9914 408
CDA10 vs. control 0.7222 401
w12 vs. control 0.6792 357
w4 vs. control 0.8143 453
w10 vs. control 0.9200 536
chronic vs. control 0.6813 342

 

 

 

 

Since CG is the most predominant haplotype, it is interesting to know if the less
common haplotypes affect asthma and allergy. Thus, similar contingency tests are
performed to examine the association between each infected group and the control group
for the haplotypes CA/T A/T G only. The test results shown in Table 12 suggest that

haplotypes CA/T A/T G are not associated with asthma and wheeze symptoms either.

Table 12 Summary of Contingency Tests (Haplotype CA/TA/TG)

Fisher's Exact Test

 

 

 

 

 

 

 

 

 

 

Cormarison Groups P-value Sample Size
a12 vs. control 0.8953 131
a4 vs. control 0.9574 109
CDA10 vs. control 0.8092 103
w12 vs. control 0.8814 99
W4 vs. control 0.8109 117
w10 vs. control 0.9744 139
chronic vs. control 0.7817 94

 

 

 

 

48

CHAPTER 5 CONCLUSIONS

This thesis has discussed three bio-statistical concepts: Hardy-Weinberg
equilibrium, linkage disequilibrium, and haplotype reconstruction and haplotype
frequency estimation, and their applications in the Isle of Wight cohort study which aims
to identify genetic susceptibility loci for allergic asthma.

Hardy-Weinberg equilibrium has been tested through the exact test implemented
by the Arlequin software package. It is found that all the loci of interest (h C V893205 6,
hC V1 5 862 743, hCV8932053, and hCV8932052 on the IL13 gene) are in Hardy-Weinberg
equilibrium. This test result provides a valid assumption for the test of linkage
disequilibrium.

Linkage disequilibrium between pair-wise loci is tested with a likelihood-ratio
test, whose key procedure is the Expectation-Maximum algorithm, implemented by the
Arlequin software package. It is found that all the pair-wise loci are in linkage
disequilibrium.

The PHASE software package, which implements the Bayesian estimation
method, is used to reconstruct haplotypes and estimate haplotype frequencies of the most
informative SNP pair, hC V893205 6 and hCV893205. The subsequent contingency test
suggests that there is no association between the haplotype patterns CA/CG/T A/T G and

CA/T A/T G and allergic asthma.

49

APPENDICES

50

APPENDIX A
SAS CODE TO CREATE INPUT FILES FOR ARLEQUIN

data thesisl .hweldl ;

set thesis.asthmagroups;

ct_dash=0;

if C8932056_lO="-" then ct_dash=ct_dash+1;

if C8932056_10="-" then C8932056_10="??";

if C 1 5862 743_1 0="-" then ct_dash=ct__dash+1;

if C15862743_10="-" then C15862743_10="??";

if C8932053_10="-" then ct_dash=ct_dash+1;

if C8932053_10="-" then C8932053_10="??";

if C8932052_10="-" then ct_dash=ct_dash+1;

if C8932052_10="-" then C8932052_10="??";

if ct_dash>=l then delete;

freq=1;

if
CDA10="0"|CDA10="1"|asthm312="1"lasthmal2="2"|asthma4="1"|asthma4="2"|wheez
e12="1"|wheezel2="2"|wheeze4="1 "|wheeze4="2"|wheeze10="1"|wheeze10="2"|nosym
ptoms="0"|nosymptoms="1" then output;

run;

data thesis.hweld2;

set thesis.hweldl;

c11=substr(C8932056_10,1,1);
c21=substr(C15862743_10,1,1);
c31=substr(C8932053_10,1,1);
c41=substr(C8932052__10,1,1);
c12=substr(C8932056_10,2,1);

c22=substr(C] 5862743__10,2,1);
c32=substr(C8932053_10,2,1);
c42=substr(C8932052_10,2,1);

SNP1=trim(c11) || trim (c21) || trim (031) || trim(c41);
SNP2=trim(c12) || trim (c22) || trim (C32) || trim(c42);
run;

data thesis.hweld3;

set thesis.hweld2 nobs=nobs;

ﬁlename fn 'C:\ hweld.txt';

ﬁle fn;
put@20IDfreq@30cll"c21"C31”c41/@30012"c22"c32"c42;
run;

51

APPENDIX B

SAS CODE TO CREATE INPUT FILES FOR PHASE

data thesi52.a12_1;

set thesisZasthmagroups;

ct_dash=0;

if C8932056_10="-" then ct_dash=ct_dash+1;
if C8932056_10="-" then C8932056_10="??";
if C8932052_10="-" then ct_dash=ct__dash+1;
if C8932052_10="-" then C8932052_10="??";
if ct_dash>=1 then delete;

if asthma12="2" then output;

run;

data thesis2.a12_2;

set thesis2.a12_1;

cl 1=substr(C8932056_10,1,1);
c21=substr(C8932052_10,1,1);
c12=substr(C8932056_10,2,1);
c22=substr(C8932052_10,2,1);
SNP1=trim(c11) || trim(c21);
SNP2=trim(c12) || trim(c22);
run;

data thesi52.a12_3;

set thesi52.a12_2 nobs=nobs;

ﬁlename fn 'C:\snp56_52_al2_2_no_mis.txt';
ﬁle fn;

put ID / SNPl /SNP2;

run;

52

APPENDIX C

MATLAB CODE TO CREATE FIGURES OF HAPLOTYPE FREQUENCY

% For each haplotype combination, plot freq vs. subject group

clear all;
close all;

freq_ﬁi_label= {
'ast_grp_al 2_no__mis.out__freqs', 'a12';
'ast_grp_a4_no_mis.out_freqs', 'a4';
‘ast_grp_CDA 1 0_no_mis.out_freqs', 'CDA10';
'ast_grp_wl 0_no_mis.out_freqs', 'WI 0';
'ast_grp_w 1 2_no_mis.out__freqs', 'w12';
'ast_grp_w4_no_mis.out_freqs', 'w4';
'ast_grp_chron_no_mis.out_freqs', 'chron';
'ast_grp_control_no_mis.out__freqs', 'control'

};

[no_fn tmp]=size(freq_fn_label);

for fn_ct=1 : no_ﬁ1
%check the number of lines in the ﬁle
ﬁd=fopen(char(freq_fn_label(fn_ct,1)));
data_ct=0;
while ~feof(ﬁd)
data_line=fgetl(ﬁd);
data_ct=data_ct+1 ;
end
fclose(ﬁd);
%Do not count the ﬁrst head line
freq_data( ﬁr_ct).no_data=data_ct—l ;

ﬁd=fopen(char( freq_fn_label(ﬁ1_ct, 1 )));

% Get rid of the ﬁrst line

fgetl(ﬁd);

% Read data

for data_ct=1 : freq_data(ﬁi_ct).no_data;
freq_data(fn_ct).idx(data_ct)=fscanf(ﬁd, '%i',1);
freq_data(fn_ct).haplotype {data_ct} =fscanf(ﬁd, '%l Os',1 );
freq_data( ﬁi_ct).freq_ave(data_ct)=fscanf(ﬁd, '%f’,1);
freq_data( ﬁi_ct).freq_se(data_ct)=fscanf(ﬁd, '%f‘, 1 );

end

53

fclose(ﬁd);

%Sort the data w.r.t. freq ave
[sort_data sort_idx]=sort(-freq_data(fn_ct).freq_ave);
freq_data(fn_ct).idx=freq_data(fn_ct).idx(sort_idx);
freq_data(fn_ct).haplotype=freq_data(fn_ct).haplotype(sort_idx);
freq_data(ﬁr_ct).freq_ave=freq_data(fn__ct).freq_ave(sort_idx);
freq_data(ﬁi_ct).freq_se=freq_data(fn_ct).freq_se(sort_idx);

end

%Enumerate all haplotypes
haplotype_set=freq_data( 1 ).haplotype;
no_haplotype=freq_data( 1 ).no_data;
for i=2 : no_fn
for j=1 : freq_data(i).no__data
b_ﬁnd=0;
for k=l : no_haplotype
if strcmp( char(haplotype_set(k)), char(freq_data(i)haplotype(j)) )
b_ﬁnd=1;
break;
end
end
% Add to haplotype_set if the same haplotype is not found in haplotype_set
if ~ b_ﬁnd
no_haplotype=no_haplotype+1 ;
haplotype_set(no_haplotype)=freq_data(i).haplotype(j);
end
end
end

%Construct the structure to hold the freq_ave and freq_se for each haplotype in each ﬁle
for fn_ct=1 : no_ﬁi
haplotype(ﬁ1_ct). freq_ave=zeros(1 ,no_haplotype);
haplotype( fn_ct). freq_se=zeros( 1 ,no_haplotype);
for i=1 : freq_data(ﬁi_ct).no_data
for j=1 :no_haplotype
if strcmp( char(haplotype_set(j», char(freq_data(fn__ct).haplotype(i)) )
haplotype(fn_ct).freq_ave(j)=freq_data(ﬁi_ct).freq_ave(i);
haplotype(fn_ct).freq_se(j)=freq_data(fn_ct).freq_se(i);
break;
end
end
end
end

%plot 1

54

bar_freq=zeros(no_fn,no_haplotype);
for fn_ct=1 : no_fn
for j=1 : freq_data(ﬁr_ct).no_data
bar_freq(fn_ct,j)=freq_data(ﬁ1_ct).freq_ave(i);
bar_label_pos(fn_ct,j)=sum(freq_data(fn_ct).freq_ave(1 : j ));
end
end
ﬁgure(1);
%Draw the bars
bar(bar_freq,.3,'stack');
%Draw the labels
for fn_ct=l : no_fn
for j=1 : freq_data(ﬁ1_ct).no_data
text(ﬁi_ct-.8,j/ 16,char(freq_data(ﬁ1_ct)haplotype(j)));
end
end

%hold on;

title('Distribution');

ylabel('Frequency (%)');

fn_ﬁg=['distrib.jpg'];

set(gca,'XTick',[ 1 :no_ﬁi],'XTickLabel',freq_ﬁ1_label(:,2)');
print(ﬁi_ﬁg,'-djpeg');

%plot 2
for i=1 :no_haplotype
ﬁgure(2);
for j=1 :no_fn
y(j)=haplotype(j).freq_ave(i);
e(j)=haplotype(j).freq_se(i);
end
errorbar([1 :no_fn],y,e,'X');
title(char(haplotype_set(i)));
ylabel('Frequency (%)');
set(gca,'XTick',[] :no_fn],'XTickLabel',freq_ﬁi_label( :,2)');

ﬁi_ﬁg=[char(haplotype_set(i)) '.jpg'];

print(ﬁ1_ﬁg,'-djpeg');
end

55

APPENDIX D
SAS CODE TO CONDUCT CONTINGENCY TESTS

data thesis3.contingency;
do group=1 to 2;
do haplotype=1 to 4;
input count @@;
output;
end;
end;
cmdg
13 116 20 15
22 225 38 23

9

proc freq data=thesisl .contingency;

weight count;

tables group*haplotype / exact;

title 'Contingency Test between Asthma at 1 or 2 and Control (CA/CG/TA/T G)’;
run;

56

Symbols

A, B,

A,, i=1, m
, 1 =1, m
c,, i=1, m

gjj,i,j:1, ...,m

gut-1. if = 1. m,

k,j = 1, n,

gik ._,'1

Ci}, l,j =1, ..., m

60],], l,j :1, ..., m,
k,j = I, n,

Cik./’I

hi1“, l,j=1, ...,m

APPENDIX E

NOTATIONS

Notations

Loci

Alleles at locus A with m alleles

Relative frequencies of alleles at a locus with m alleles
Counts of alleles at a locus with m alleles

Relative Frequencies of genotypes at a locus with m alleles

Relative frequencies of genotypes at two loci which have m alleles
and n alleles respectively.

Relative frequencies of genotypes obtained from haplotype pair
Ain /AJ-B, at loci A and B.

Counts of genotypes at a locus with m alleles

Counts of genotypes at two loci which have m alleles and n alleles
respectively.

Counts of genotypes obtained from haplotype pair Ain / A j B, at
loci A and B.

Relative frequencies of haplotypes at a locus with m alleles

57

BIBLIOGRAPHY

58

BIBLIOGRAPHY

Clark, A. G. Inference of haplotypes from PCR-arnpliﬁed samples of diploid populations.
Molecular Biology Evolution. 1990: 7(2): 111-122.

Crow, J. F. Eighty years ago: the beginnings of population genetics. Genetics 1988 Jul:
119 (3): 473-6.

Excofﬁer, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype
frequencies in a diploid population. Molecular Biology Evolution. 1995: 12(5): 921-927.

Fallin, D. & Schork, J. Accuracy of haplotype frequency estimation for biallelic loci, via
the expectation-maximization algorithm for unphased diploid genotype data. American
Journal of Human Genetics 2000: 67: 947-959.

Gilks, W. R., Richardson S., & Spiegelhalter, D. J. Markov Chain Monte Carlo in
Practice. London: Chapman & Hall, 1996.

GlaxoSmithKline. Genetics at GlaxoSmithKline. [On-Line]. Available:
http://genetics.gsk.com/link.htm, 2004.

Guo, S. W., & Thompson, E. Performing the exact test of Hardy-Weinberg proportion for
multiple alleles. Biometrics 1992: 48: 361-372.

Kurukulaaratchy, R. J ., Fenn, M. H., Waterhouse L. M., Matthews, S. M., Holgate, S. T.,
Arshad, S. H. Characterization of wheezing phenotypes in the ﬁrst 10 years of life.
Clinical and Experimental Allergy 2003: 33: 573-578.

Kurukulaaratchy, R. J ., Matthews, S., Holgate, S. T., Arshad, S. H. Predicting persistent

disease among children who wheeze during early life. European Journal of Respiratory
Diseases 2003: 22: 719-720.

Kurukulaaratchy, R. J ., Matthews, 8., Waterhouse, L., Arshad, S. H. Factors inﬂuencing
symptom expression in children with bronchial hyperresponsiveness at 10 years of age.
The Journal of Allergy and Clinical Immunology 2003: 112: 311-316.

Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm and testing strategy for
multiple-locus haplotypes. American Journal of Human Genetics 1995: 56:799-810.

Schneider, S., Roessli, D., & Excofﬁer, L. Arlequin ver. 2.000: A software for population
genetics data analysis. Genetics and Biometry Laboratory, University of Geneva,
Switzerland, 2000.

Schork, N. J ., Fallin, D., & Lanchbury, S. Single nucleotide polymorphisms and the
future of genetic epidemiology. Clinical Genetics 2000: 58:250-264.

59

Sham, P. Statistics in Human Genetics. New York: Oxford University Press, 1998.
Stephens, M., Smith, N. J ., & Donnelly, P. A new statistical method for haplotype
reconstruction from population data. American Journal of Human Genetics 2001: 68:
978-989.

Stephens, M., Smith, N. J ., & Donnelly, P. Documentatin for PHASE ver.2.02.
Department of Statistics, University of Washington, Seattle, WA, 2003.

60

 

 

Ilflll‘lljllllllzlll’lll[llljlfl

 

A. ‘ '