“THS This is to certify that the thesis entitled STATISTICAL METHODS FOR IDENTIFYING GENETIC ASSOCIATIONS presented by LAN TONG has been accepted towards fulfillment of the requirements for the M. S. degree in Applied Statistics "3 Major Professor’s Signature I! .. 2 9 -— 0 S" Date MSU is an Affirmative Action/Equal Opportunity Institution LIBRARIES MICHIGAN STATE UNIVERSITY EAST LANSING, MICH 48824-1048 PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/06 c:/CIRC/Date0ue.indd-p.15 STATISTICAL METHODS FOR IDENTIFYING GENETIC ASSOCIATIONS By Lan Tong A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Statistics 2005 ABSTRACT STATISTICAL METHODS FOR IDENTIFYING GENETIC ASSOCIATIONS By Lan Tong This thesis introduces three bio-statistical concepts: Hardy-Weinberg equilibrium, linkage disequilibrium, and haplotype reconstruction and haplotype frequency estimation. Three statistical methods have been discussed and utilized to test these concepts for a population genetics study. Hardy-Weinberg equilibrium is a basis for genetic inference. It is tested through the exact test implemented by the Arlequin software package. Linkage disequilibrium is an important tool for mapping disease genes. It is tested with a likelihood-ratio test, whose key procedure is the Expectation-Maximum algorithm, implemented by the Arlequin software package. Haplotype information is essential for mapping disease. The haplotype frequencies are estimated through the Bayesian estimation method implemented by the PHASE software package. ' The above concepts and tests have been applied to the Isle of Wight cohort study. It has been found that all the loci of interest (hC V893205 6, hCV15862 743, hCV8932053, and hCV8932052 on the 11.13 gene) are in Hardy-Weinberg equilibrium, and that all pair- wise loci are in linkage disequilibrium. The haplotypes of the most informative SNP pair, hC V893205 6 and hCV8932052, have been reconstructed; their frequencies are estimated for eight phenotypes of interests. The contingency tests suggest that there is no association between the haplotype patterns CA/CG/T A/T G and CA/TA/T G and allergic asthma. ACKNOWLEDGMENTS I would like to express my appreciation to the several people who have been generous with their encouragement and support for this thesis. Most of all, I am grateful to Dr. Marianne Huebner, Department of Statistics, Michigan State University, and Dr. Susan Ewart, Department of Large Animal Clinical Sciences, Michigan State University, for providing me the opportunity of participating the Isle of Wight birth cohort study. Dr. Huebner has mentored me throughout the project and the writing of the thesis. I have had many helpful discussions with Dr. Ewart and Dr. Dennis Gilliland, Department of Statistics, Michigan State University, both of whom provided me invaluable suggestions for the improvement of my thesis. Dr. James Stapleton, Department of Statistics, Michigan State University, has served on my committee and offered me constructive advice about my thesis. I also want to thank my parents for taking care of my newborn daughter so that I was able to concentrate on my research and thesis writing. I am deeply indebted to my husband whose love, support, and knowledge have made the completion of the project and the thesis possible. iii TABLE OF CONTENTS List of Tables - - - - - - - - - _ v List of Figures- vi Chapter 1 Introduction- - - - - - - - - - -- - 1 Section I Background ...................................................................................................... 1 Section 2 Genetic Concepts ............................................................................................. 4 Section 3 Case Study ..................................................................................................... 10 Chapter 2 Hardy-Weinberg Equilibrium _ -- - -- -- - -- - 12 Section 1 Introduction ................................................................................................... 12 Section 2 Exact Test of Hardy- Weinberg Equilibrium .................................................. 14 Section 3 Arlequin Implementation of Exact Test and Case Study ................................ 19 Chapter3 Linkage Disequilibrium ........... - - -- - _- - - -- - - 22 Section I Introduction ................................................................................................... 22 Genetic Linkage ......................................................................................................... 22 Allelic Association ..................................................................................................... 22 Maintenance of Allelic Associations: Linkage Disequilibrium ................................. 23 Section 2 Likelihood Ratio Test of Linkage Disequilibrium .......................................... 25 Computation of L0 ...................................................................................................... 25 Computation of L1 ...................................................................................................... 26 Likelihood-ratio test ................................................................................................... 30 Section 3 Arlequin Implementation of Likelihood Ratio Test and Case Study .............. 31 Chapter 4 Haplotype Frequency Estimation - - - - - -- -- 33 Section 1 Introduction ................................................................................................... 33 Section 2 Bayesian Estimation Method for Haplotype Frequency ................................ 35 Section 3 PHASE Implementation of Bayesian Estimation and Case Study ................. 39 Chapter 5 Conclusions . -- - - - - - ...... - - -- - -- 49 Appendices ....... . .................. -- .......... - - 50 Bibliography .................................................................................................................... 58 iv LIST OF TABLES Table 1 Sample of Data Set ....... -- - ll Table 2 Sample of Data Set with Genotype Information on Four SN Ps 20 Table 3 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052 by Phenotypes 42 Table 4 Two-way Contingency Table of Asthma 1 or 2 vs. Control- -- 46 Table 5 Two-way Contingency Table of Asthma 4 vs. Control - -_ -46 Table 6 Two-way Contingency Table of CDAlO vs. Control - 46 Table 7 Two-way Contingency Table of Wheeze 1 or 2 vs. Control - -- - -- 46 Table 8 Two-way Contingency Table of Wheeze 4 vs. Control - ........ 46 Table 9 Two-way Contingency Table of Wheeze 10 vs. Control _- ...... 46 Table 10 Two-way Contingency Table of Chronic vs. Control - ....... 46 Table 11 Summary of Contingency Tests (Haplotype CA/CG/TA/T G) .................... 48 Table 12 Summary of Contingency Tests (Haplotype CA/TA/TG) _ 48 LIST OF FIGURES Figure 1 Levels of Information Transfer -- - 4 Figure 2 DNA _- - - - 5 Figure 3 Allele ........ - - - - - - 7 Figure 4 SNP - - - - - - 8 Figure 5 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052 by Phenotypes - - - _ - - 43 Figure 6 Relationship between Haplotypes and Phenotypes 45 vi CHAPTER 1 INTRODUCTION Section 1 Background The purpose of this thesis is to discuss three bio-statistical topics: Hardy- Weinberg equilibrium, linkage disequilibrium, and haplotype reconstruction and haplotype frequency estimation, and their applications in population genetics. The Hardy-Weinberg law plays a very important role in the field of population genetics and often serves as a basis for genetic inference (Crow, 1988). This law says that in a large random-mating population with no selection, mutation, or migration, the allele frequencies and the genotype frequencies are constant from generation to generation and that, furthermore, if the alleles A 1, A2, Am have relative frequencies f1, f2, fm respectively, then the relative frequency of homozygous genotypes such as AA is gt,- =fi2, and the relative frequency of a heterozygous genotype such as A A ( i ¢ j ) is g.)- = 2f”? Because of its importance, a lot of effort has been made to test if a population exhibits Hardy-Weinberg equilibrium. This paper discusses, in detail, one of the statistical methods, exact test, used to test Hardy-Weinberg equilibrium. Linkage disequilibrium is an important tool for mapping disease genes. It describes the nonrandom association of alleles at linked loci. When only genotype and not haplotype frequencies are available, linkage disequilibrium between a pair of loci is tested with a likelihood-ratio test. This paper discusses the likelihood ratio test, with highlight on its key procedure, the Expectation-Maximum algorithm which resolves double heterozygous genotypes into haplotypes when not assuming linkage equilibrium. Haplotype information is essential for mapping disease. When individuals are homozygous at every locus, haplotypes can be easily determined; however, when individuals are heterozygous at more than one locus, haplotypes cannot be deducted from genotype information. Statistical approaches can be used to reconstruct haplotypes and estimate the relative frequencies of all possible haplotypes. This paper introduces the Bayesian estimation method of reconstructing haplotypes and estimating haplotype frequencies. All the above three concepts have been applied to the Isle of Wight birth cohort study. This study is committed to identify the genetic susceptibility loci responsible for asthma and allergy by studying a group of children born on the Isle of Wight, United Kingdom. The Hardy-Weinberg equilibrium and linkage disequilibrium for the four SNPs: hC V893205 6, hCVI5862 743, hCV8932053, and hCV8932052 on the 1L13 gene have been tested using the Arlequin software package. Haplotype reconstruction and frequency estimation for SNPs hC V893205 6 and hCV8932052 have been performed using the PHASE software package. Both the Arlequin and PHASE software packages are designed for population genetics data analysis. Arlequin implements exact test and likelihood-ratio test to examine the Hardy-Weinberg equilibrium and linkage disequilibrium, respectively. It is available for free at http://lgbunige.ch/arlequin. PHASE implements the Bayesian algorithm to approximate the posterior distribution of haplotype configurations. It is also free and can be downloaded at http://www.stat.washington.edu/stephens/sofiwarehtml. Chapter 1 of this thesis introduces the background, the basic genetic concepts involved in the study, and the data of Isle of Wight birth cohort. Chapter 2 describes Hardy-Weinberg equilibrium, its significance, exact test of Hardy-Weinberg equilibrium, and testing the Isle of Wight birth cohort for Hardy-Weinberg equilibrium using Arlequin. Chapter 3 discusses linkage disequilibrium, its significance, likelihood ratio test of linkage disequilibrium, and testing the Isle of Wight birth cohort for linkage disequilibrium using Arlequin. Chapter 4 explains the motif of haplotype reconstruction and frequency estimation, Bayesian estimation method, and estimating the Isle of Wight birth cohort for haplotype frequencies using PHASE. Finally, in Chapter 5, the test results in the previous chapters are summarized. Section 2 Genetic Concepts To better understand genetic research and the Isle of Wight birth cohort study, it is helpful to explore some basic genetic definitions and concepts. I... ‘ lmnlm "Ilium! WI lllllllllgh “l ll‘llilgi 6. Population Figure 1 Levels of Information Transferl Genetics is the study of traits passed on from parent to child and variation of those traits within and between individuals. It is about the transfer of information among many different levels (Figure 1). The foundation level is the molecule called DNA (1). The information in DNA is organized into genes (2). Genes, in turn, make up chromosomes (3), which when taken all together form an organism's genome (4). Every nucleus in an individual (5) contains the genome. Instructions that provide almost all of the information necessary for a living organism to grow and function are in the nucleus of every cell. The instructions are in the form of a molecule called deoxyribonucleic acid, or DNA (Figure 2). In humans, the DNA molecule consists of two ribbon—like strands that wrap around each other, resembling a twisted ladder. The rungs of the ladder are nucleotide base pairs. The bases are called adenine (“ ”), cytosine (“C"), guanine (“G”) and thymine (“I“). These bases always pair up as A+T and C +G. i“vii . “WWW ".1:Iit:“t.‘ii H ‘ Ill ,ii III‘lullllillllllmlmlll‘ll -”‘., ‘ v ' " "ill it'll‘llhulliifiim lpflirii’lillhflmlllL W * T l v ‘I i ‘ 'l'nlirrttuilll [ll lilini. u. ' ‘ |vii' WWI“ I ll . M l M Figure 2 DNA: ' This figure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004. 2 This figure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004. Before DNA was discovered, gene was defined as a discrete unit which is inherited from parent to offspring and which exerts control on a single character. After DNA was discovered, gene was redefined as a segment of DNA that codes for a protein subunit. The word gene may also be used to refer to a functional DNA segment or a class of functional DNA segments that have the same position, structure and function (Sham, 1998) DNA is contained in tightly coiled packets called chromosomes. Chromosomes consist of the double helix of DNA wrapped around proteins. Each human cell nucleus contains 23 pairs of chromosomes. A locus is a specific position in the genome (the complete set of genes). The DNA of most people hare highly similar. The presence of different DNA sequences at the same locus in a population is known as a polymorphism. The alternative DNA sequences at a locus on a chromosome pair are known as alleles (Figure 3). People can have two identical or two different alleles for a particular gene. A person who has two identical alleles for a gene is said to be homozygous for that gene. A person with two different alleles is said to be heterozygous. The pair of alleles a person has at a specific location in the genome is called genotype. Genotype affects phenotype, which is the observable effect of the allele, such as eye color. A combination of alleles is termed haplotype. Haplotypic phase refers to whether a gametic (a gamete is a sperm or an egg that fuse during reproduction) haplotype changes during recombination (GlaxoSmithKline, 2004). W’ U" l "WI gull“; I“ {I copy I of chromosome A copy 2 of chromosome A Genotypes PP Homozygous for the dominant allele aa Homozygous for the recessive allele Bb Heterozygous Recessive allele - Dominant allele Figure 3 Allele’ Many diseases are related in some way to genes. Many common diseases result from a change in one or a few susceptibility genes. To find a gene that is involved in a specific disease, scientists must search for DNA changes that are present more often in people who have a particular disease compared to people who do not have the disease (GlaxoSmithKline, 2004). A contemporary kind of genetic map, called a high-density single nucleotide polymorphism (“SNP”) (Figure 4) map, has the potential to promote this research. SNPs are single-base differences in the DNA sequence that can be observed between different 3 This figure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004. individuals in the population. For example, a SNP might change the DNA sequence AAGGC T AA to AT GGC T AA. SNPS are the simplest and most common forms of DNA polymorphism. They are present throughout the human genome. Groups of neighboring SNPS may have alleles that show distinctive patterns of linkage disequilibrium and as such may create a haplotypic diversity that can be exploited in both genetic linkage and direct association studies. The simple structure of SNPS also allows rapid and efficient genotyping. In addition, SNPS are also evolutionarily stable - not changing much from generation to generation - making them easier to follow in population studies. These features of SNPS in the genome make them particularly valuable as genetic markers (Schork et a1, 2000). Using SNPS 'rbi' Lbiaiing Disease Genes ., 2. SNP profile of people without the disease “-1-?" Him __ .. “llllTll‘ Ill IgiLIII- _, mi . 1. SNP profile of people with the disease [[llll l’i““ii """Ilijilini ll 3- ,[Jl ". Ii 4. Compare the two profiles Till I ll. . ill]lffllllllfjll f f .IllL— ' "ll: fflll :fllllllfjfl Ill 'L 3. Differences in the profiles identify locations that may contain susceptibility genes 3121*”: «I? M Figure 4 SNP“ 4 This figure was originally created by GlaxoSmithKline. It is available in the online article ‘Genetics at GlaxoSmithKline’ at http://genetics.gsk.com/link.htm, 2004. Using the information that SNPS provide, it may be possible to predict people’s genetic risk of developing a certain disease, to diagnose a disease more accurately, or to predict how people most likely will respond to a medicine. Section 3 Case Study The Isle of Wight birth cohort study represents an unselected whole population birth cohort based on the Isle of Wight, United Kingdom. The Isle of Wight is a small island (13 x 23 miles) just off the South coast of England with a resident population Of 133,000. The ethnic background of the island residents is mainly Caucasian. While the Isle of Wight population is not genetically homogeneous, it is stable to the extent that the majority of children in the cohort has not moved away and has thus been available for follow up. Enrollment took place at birth. Of the 1,536 children born on the Isle of Wight between January 1, 1989 and February 28, 1990, informed consent was obtained from the parents of 1,456 children. These children have Since been seen at the ages of l (n = 1,167; 80.2%), 2 (n = 1,174; 80.6%), 4 (n = 1,218; 83.7%) and 10-years (n = 1,373; 94.3%) (Kurukulaaratchy et al, 2003). At birth, information on family history of allergy, household pets, parental smoking, socioeconomic status and birth weight were recorded. At every follow-up (1, 2, 4 and IO-years), detailed questionnaires were completed with the parents for each child regarding asthma and allergy prevalence. For each child, the following phenotype information has been recorded: asthma at 1 or 2 years, asthma 4 years, currently diagnosed asthma (CDA) at 10 years, wheezing at 1 or 2 years, wheezing at 4 years, wheezing at 10 years, chronic asthma, and no symptoms. With high cohort retention, this prospectively followed population provides a uniquely characterized resource for ongoing studies (Kurukulaaratchy et al, 2003). 10 A segment of the data file is presented in Table 1. Table 1 Sample of Data Set asthma asthma 32056 1 15862743 1 32053 1 32052 1 1 CC CC GG CC CC GG Among the total of 1,536 children, 625 children have complete information about SNPS hCV8932056-10, hC V 1 5 862 743-1 0, hCV8932053-10, and hCV8932052-10. There are two indices under each phenotype with “1” indicating positive result (case) and “0” negative result (control). The population that has complete information about the four SNPS, including cases and controls, is used to test Hardy-Weinberg equilibrium and linkage disequilibrium. The population that has complete information about the most informative SNP pair hCV8932056-10 and hCV8932052-10, but including cases only, is estimated for its haplotype frequency on the basis of phenotypes. ll CHAPTER 2 HARDY-WEINBERG EQUILIBRIUM Section 1 Introduction In a large population, in the absence of natural selection, mutation, or migration, when the mating type frequencies arise from random mating, the ratios of the different genotypes follow a mathematical result established independently by the English mathematician Hardy and the German physician Weinberg. This phenomenon is named “Hardy-Weinberg Equilibrium”. We need to examine all polymorphisms for Hardy- Weinberg equilibrium in order to produce valid and significant results in allelic association studies. Consider a biallelic locus with alleles A 1 and A 2. Let the relative frequencies of the three genotypes A [A ,, AA; and A 2A 2 in a large population be g”, 2g”, and g2; such that g” + 2g]; + g2; = 1. Hardy’s result was that, if individuals in the population mated with each other at random, these relative frequencies would be such that 8122 = 811822. Moreover, if these offspring are mated at random, then the relative frequencies of the genotypes will remain unchanged after a second generation. It follows that these relative frequencies will continue to be in the Hardy-Weinberg ratio as long as mating in the population are random with respect to the locus. The Hardy-Weinberg equilibrium enables us to relate genotype frequencies to allele frequencies. If the relative frequencies of alleles A, and A 2 are f; and f2, respectively, then under normal conditions, the relative frequencies of gametes with alleles A 1 and A 2 will be f, and f2. According to the Hardy-Weinberg law, under random mating, the relative frequencies of genotype A ,A 1, A [A2 and A 2A 2 are g,,=f,2, g12=2ftf2, 12 and g2; =f22, regardless of the genotype frequencies in the parental generation. Since the allele frequencies corresponding to these genotype frequencies remain unchanged at f, and f2, the same genotype frequencies will be maintained in subsequent generations as long as random mating applies in the population. The same principle can be applied to a locus with more than two alleles. Let the alleles be A}, A 2, A,,,, with relative frequencies 1”,, f2, fm. Under Hardy—Weinberg equilibrium, the relative frequency of a homozygous genotypes such as AA is g,,- = fl) and the relative frequency of a heterozygous genotype such as A ,A,- (i ¢ j ) is gy- = Zfif] (Sham, 1998). Hardy-Weinberg equilibrium has many important applications. The demonstration of the ratio provides strong evidence for a genetic basis for a trait. 13 Section 2 Exact Test of Hardy-Weinberg Equilibrium Due to the importance of the Hardy-Weinberg law in the development of population genetics, testing of the null hypothesis that a population exhibits Hardy- Weinberg equilibrium has drawn a lot of attention during the past decades. The methods proposed to test Hardy-Weinberg equilibrium can be categorized into two groups. One consists of large-sample goodness-Of-fit tests that lean heavily on asymptotic results. However, it has been recognized that such tests can sometimes lead to false rejection or acceptance of Hardy-Weinberg equilibrium when the sample sizes are small and/or some cell frequencies are small or zero. The other approach involves the exact test, which is preferred when the sample size is small and/or some cell frequencies are small or zero (Guo et al, 1992). The exact test is named so because it does not rely on approximations. The Arlequin software package that we used to test Hardy-Weinberg equilibrium for the Isle of Wight birth cohort, implements the exact test. The exact test for Hardy-Weinberg equilibrium for multiple alleles is discussed below. Consider an autosomal locus that has m alleles A 1, A2, Am. A sample of size n is sampled from a population of interest. Let Cij (1 S j S i S m) denote the observed count of genotype A ,-A j. Then the data can be presented as the contingency table Al Cu A2 02/ 622 Am Cm] Cm2 ... Cmm A] A2 Am 14 Use c = (c11,021,c22,...,cmm) to designate this table. Let c,- denote the number ofA, alleles 1n the sample. Then cl- = Cii +Zj=1cij (where cij =cj,- 1f j>l). The table of the count of genotype AjAj is a random variable. Let C denote this random variable. Then under random mating and the constraint that the number of allele A,- remains unchanged from generation to generation, the distribution of C satisfies multivariate hypergeometric distribution. Thus, the probability of obtaining the sample c is: m n!Hc,-! 291' Pr(C = c) = my” , j>i where 26,] is the number of heterozygous individuals. j >i The exact test for the Hardy-Weinberg equilibrium given observed sample c has to evaluate P: ZPr(c'), c'eT where T = {c'z Pr(c') S Pr(c),and{c;-} = {0,}, wherec; =# alleleAi} (Guo et al, 1992). In order words, the P—value of the test is the sum of the probabilities of the tables that have a probability smaller than or equal to the observed contingency table c and have the same allele counts as does c. Rejection or acceptance of the null hypothesis depends on whether P is smaller than a pre-specified significance level or. A large P-value means that the probability of obtaining a sample as extreme or more extreme than the actually observed is large, thus it suggests Hardy-Weinberg equilibrium (Schneider, 2004). 15 Since there may be innumerable contingency tables having identical marginal counts, simply enumerating such tables is unrealistic. The Arlequin software modifies the Metropolis algorithm (Guo et a1, 1992) to construct a Markov chain of contingency tables that have the same allelic counts (the same marginal counts) as the observed table, and a limiting distribution matching Pr(C). This approach starts with the observed contingency table. In order to create a new contingency table from an existing one, we randomly select two distinct rows i1, i2 and two distinct columns j 1, j 2. Neither the two rows nor the two columns have to be next to each other. The new table is obtained by decreasing the counts of the cells (i 1. j I) (i 2, j 2) and increasing the counts of the cells (i I, j 2) (i 2, j 2) by one unit. This leaves the alleles counts {c,-} unchanged. For example, by decreasing the counts of the cells (C31, C43) and increasing the counts of the cells (C3 3, C41) by one unit in the original sample table c, we obtain a candidate for the new table. Note that the marginal counts of rows c3 and c4 and of columns c, and C3 remain the same as those in table c. A1 0/1 A2 6'21 022 A3 6'31“] C32 033'” A4 641+] C42 643-1 C44 Am Cm] CmZ cm3 Cm4 - - - Cmm A! A2 A3 A4 Am Whether to accept this candidate table or not depends on a probability R equal to: 16 _ Pr(c(k+l)) _ ci1j1ci21'2 (1 +6111} )(1 +512}? ) 1.R — vifi1¢110fi2¢jz Pr(c(k)) (Ciljz + 1)(Ci2j1 +1) (1+ 6i1j2 )(1+ 6i2j1 ) P “(+1) C' . c. . 4 2, R: ‘(C k l: ‘1“ ‘212 —,ifi1 =j1 and i2=j2 Pr(c( )) (Ciljz +1)(Ci2j1 +2) 1 P (k't'l) c. . (c. . —1) 3. R: 1‘(C k )2 1111 1212 1,1fl1 = j2 and 12 =j1 Pr(c( )) (Ciljz + 1)(ci2j1 + 1) 4 where k is the created table number, and 6,-1- = 0 if i at j, 1 if i = j. R is the ratio of the probabilities of the two tables. The switch to the new table is accepted if R is larger than 1. The resulted Markov chain has a limiting probability distribution that is the same as the distribution of the m nlnci! Z Cij i=ll 2j>l (2n)!l—Icij! j>i contingency tables, Pr(C = c) = In practice, the Markov chain starting from the observed contingency table is biased. In order for the chain to be unbiased, it is ideal to start the chain from a random table chosen from the distribution Pr(C). One solution is to start the chain from the observed table and run for a long time so that the initial table is “forgotten”. This process is referred to as “dememorization”. After the dememorization, the Markov chain is constructed for a long time, which results in a large number of selected tables. The limiting probability of the Markov chain in a particular table can be interpreted as the long-run proportion of time that the chain stays in that table. Therefore, following the computation of P—value in the exact test, the 17 P-value of the test is the proportion of the selected tables that have a probability smaller than or equal to the observed contingency table. 18 Section 3 Arlequin Implementation of Exact Test and Case Study The Arlequin software package is employed to test Hardy-Weinberg equilibrium for the Isle of Wight birth cohort Study. The goal of Arlequin is to provide the users in population genetics with a large set of methods and statistical tests, in order to extract information on genetic and demographic features of a collection of population samples (Schneider, 2004). The software is available on line at http://1@.unige.ch/arlequin. Arlequin input file with extension .arp contains both “Profile”, descriptions of the properties of the data, as well as “Data”, the raw data themselves. The Profile section specifies the title of the project, the number of samples present in the project, the type of data to be analyzed, if the project deals with haplotypic or genotypic data, if the gametic phase of genotypes is known and so on. The sample of our project is defined as DNA multi-locus data with unknown gametic phase. The Data section includes a name of the sample, the size of sample, and the data itself. A SAS program (Appendix A) has been written to obtain the information required by Data section for our project. The children with complete genotype information at SNPS hCV8932056-10, hCV15862743-10, hCV8932053-10, and hCV8932052-10 are first selected. There are a total of 625 such children (Table 2). Next, the individual genotypes for the four loci are output on two separate lines according to Arlequin’s instruction. For instance, sample #1532 has genotype pattern TT// C C//C T // AG. A combination of the first allele at each locus, namely T C CA, forms the first line for this individual; likely, T C TG forms the second line. It should be noted that the sequence of the loci must be kept in the order of their physical positions. 19 Table 2 Sample of Data Set with Genotype Information on Four SN Ps asthma asthma no ID C8932056_10C15862743_10C8932053_jOC8932052_10 1,2 4 chronic symptom 2 CC CC CC GG 1 1 0 1 3 CC CC CC GG 1 1 0 0 1530 CC CC CC GG 1 1 0 1 1532 "IT CC CT AG 2 1 O 0 The input file (with partial dataset) of our project is displayed as follows: [Profile] Title="Isle of Wight Asthma HWE test" NbSamples= 1 DataType= DNA GenotypicData= 1 LocusSeparator= WHITESPACE GameticPhase= O RecessiveData= 0 RecessiveAllele= null MissingData= '?' Frequency= ABS CompDistMatrix= 0 FrequencyThreshold= 1.0e-5 EpsilonValue= 1.0e-7 [Data] [[SamPICSl] SampleName="Isle of Wight Asthma HWE test" SampleSize=625 SampleData= { 21 CCCG CCCG 15321 TCCA TCTG Once the project file is built, “Calculation Settings” of Arlequin interface is decided. In our project, Hardy-Weinberg equilibrium test with 100,000 steps in Markov chain, 1,000 dememorisation steps, and a significance level of 5% has been set. The number of steps in Markov chain sets the maximum number of alternative tables to 20 explore. A figure of 100,000 is in order. The number of dememorization steps sets the number of steps to perform before beginning to compare the alternative table probabilities to that of the observed table. 1,000 steps are necessary to reach a random starting point corresponding to a table independent from the observed table. The output of the test is presented below: Hardy-Weinberg equilibrium: (Isle of Wight Asthma HWE test) Exact test using a Markov chain (for all Loci): Forecasted chain length :100000 Dememorization steps 21000 Locus #Genot Obs.Heter. Exp.Heter. P. value s.d. Steps done 1 625 0.30880 0.58841 1.00000 0.00000 100172 625 0.16640 0.49639 0.80856 0.00122 100172 2 3 625 0.30720 0.58422 0.79268 0.00123 100172 4 625 0.28480 0.58663 0.08930 0.00084 100172 As shown above, the P-values of the test at all loci are larger than 0.05, suggesting acceptance of null hypothesis; namely, all the loci are in Hardy-Weinberg equilibrium. Once Hardy-Weinberg equilibrium is established, we can proceed to test whether two loci are in linkage disequilibrium. 21 CHAPTER 3 LINKAGE DISEQUILIBRIUM Section 1 Introduction Genetic Linkage The aim of genetic linkage analysis is to infer the relative positions of two or more loci by examining the patterns of allele-transmission from parent to offspring, or the patterns of allele-sharing by relatives. The genotype of an individual at two loci is formed by the haplotypes of two gametes, each inherited from one parent. A gamete may contain two alleles fi‘om the same parental gamete or one allele from each parental gamete. In the first case, the haplotype of the gamete is the same as the haplotype of one of the parental gametes. Such gametes are defined as non-recombinants. In the second case, however, the haplotype of the gamete constitutes a new combination of alleles different from either parental haplotype. Such gametes are defined as recombinants. The recombination fraction, usually denoted as 6 , between the two loci on the same chromosome is defined as the probability that a gamete is recombinant. Two loci with a recombination fraction of less than 1/2 are said to be in linkage. The closer the two loci, the smaller the recombination fraction is, and the more tightly linked are the two loci (Sham, 1998). Allelic Association As a result of linkage, some combinations of alleles, i.e. haplotypes, on short chromosomal segments may be preserved over a large number of generations and become quite frequent in the population. The excessive co-occurrence of certain combinations of alleles in the same gamete because of tight linkage is known as allelic association. 22 Consider two loci A and B, with alleles A1,A2,...,Am,and Bl,Bz,...,Bn. By definition, if the occurrence of allele A,- and allele B j in a haplotype are independent events, then the relative frequency of the haplotype AiBJ- is equal to the product of the allele frequencies of A,- and Bj, i.e. hij- = f,-(A)fJ-(B). If hij ¢ fi(A)fj(B), then the occurrences of A,- and B j are not independent, and the two alleles are said to be associated (Sham, 1998). Maintenance of A llelic Associations: Linkage Disequilibrium In a large, closed, randomly mating population, let the relative frequency of the haplotype AiBj in the current generation be hl-J-O. In the next generation, if the haplotype is a recombinant, then the probability that it is AiBJ- is Pr(AiBj) = fi(A)fj(B); If the haplotype is a non-recombinant, then the probability that it is A,- B j is simply Pr(Al-Bj) = hijo. The total probability that a haplotype transmitted to the next generation is AiB j is therefore hm = Pr(Aij) = 0%(A)f,-(B)+ (1— 9)hzj‘0- The change in haplotype frequency from generation 0 to generation 1 is hijl ' hijo = 9(fi(A)fj(B) " hyo). When fi(A)f]-(B) = hijo for all i,j at the two loci, i.e. ifthere is no allelic association, there will be no change in haplotype frequencies from generation to 23 generation, and we say that the two loci are in linkage equilibrium. Otherwise the two loci are said to be in linkage disequilibrium. For most human populations and for most regions of the genome, substantial linkage disequilibrium is only likely to occur between loci with a recombination fraction of less than 1%. This is the rationale behind the use of association analysis as an important tool for mapping susceptibility loci (Sham, 1998). 24 Section 2 Likelihood Ratio Test of Linkage Disequilibrium For genotypic data where the haplotypic phase is unknown, linkage disequilibrium between a pair of loci is tested for genotypic data using a likelihood-ratio test. The likelihood ratio is between the likelihood of the data assuming linkage equilibrium (denoted L0) and the likelihood of the data not assuming linkage equilibrium (denoted L 1)- L0 is computed by using the fact that, under the hypothesis of linkage equilibrium, the haplotype frequencies are obtained as the product of the allele frequencies. L, is obtained by applying the Expectation-Maximum algorithm to estimate haplotype frequencies. Both L0 and L I assume Hardy-Weinberg equilibrium (random mating). The ratio of L0 and L 1 suggests the degree of deviation from linkage equilibrium (Schneider et al, 2000). Suppose that two loci A and B, with distinctive alleles A1,A2,...,Am and BI , 82 ,...,B,, , have been genotyped in a random sample of the population. Each individual has a genotype of the form A,- A j // BkBl. There are m(m + 1)/ 2 possible genotypes at locus A , and n(n + 1) / 2 possible genotypes at locus B , so that the total number of joint genotypes is (m(m + 1) / 2)(n(n + l) / 2) . The aim is to test the null hypothesis of linkage equilibrium between A and B. Computation of L0 Under the assumption Of linkage equilibrium, the relative frequency of a haplotype is equal to the products of the allele frequencies: hik=fi(A)fic(B) htl=fi(A)fl(B) hjk=fi(A)fi(B) 25 hfl=fi(A)fz(B). where the allele frequencies can be obtained by simple counting on the basis of given genotype information. Under the assumption of random mating, the genotype frequencies can be expressed as follows: g... = 11.12.. = 12.2 aim/nor g,“ = 11,11, + huh”, = 211,12, = 2f,-(A)2f,(B)fi,(B) g,“ = 12,72, + 111,12, = 2h,,h,, = 2fi(A)fi(A)f,,(B)2 g,“ = h,,hj, + h j,h,, + huh], + 121,11, = 4 f,(A) fj(A) f, (B) f, (B) . The observed genotypic counts CH”. c”12 , c follow a multinomial III/"II" distribution with parameters cm (the sample size) and the genotypic frequencies 81111, glnz, gmmnn. The log-likelihood of observing cijkl is therefore ln L0 = Z cijkl ln(g,-J-k1), ', j l,...,m l l n where the summation is taken over all (m(m +1)/2)(n(n + 1) / 2) possible genotypes. The maximum likelihood estimate of a population genotype frequency is simply the sample genotype frequency (Sham, 1998). Computation of L 1 When not assuming linkage equilibrium, in the case that an individual is heterozygous for both loci, the haplotypes cannot be deduced from the genotype. For example, the genotype A,- A j // Bk B, can be made up of the haplotype pairs Ain /A j B, or A j Bk / AiBl. In order to estimate the haplotype frequencies, we use the criterion of 26 maximizing the likelihood of observing genotype data, L ,. From the discussions in the preceding subsection, we know 1nL1: 2 can 1n(gijk1), ',j l,...,m l l ,n where Cijkl are the observed genotypic counts, gijkl are the estimated genotype frequencies, which are calculated from the estimated haplotype frequencies under the assumption of random mating as follows: giikk = hikhik = hik 2 g iikl = hik hi1 + hilhik = 2hilhik gijkk = hikhjk + hjkhik = Zhikhjk 8in = hikhjl + hjlhik + hilhjk + hjkhil = 2(hikhjl + hjkhil)- In the following, we use an iterative method of counting based on the Expectation-Maximum algorithm to obtain the maximum likelihood estimation of haplotype frequencies. The EM algorithm is a numerical method of finding maximum likelihood estimates for parameters given incomplete data. It begins by setting the initial haplotype frequencies as haw , u = l,...,m,v = l,...,n. It is reasonable to set these initial haplotype frequencies as the products of the allele frequencies, just as the haplotype frequencies under the assumption of Hardy-Weinberg equilibrium and linkage equilibrium. The count of genotype A, A J- // Bk B, can be considered as the sum of 2 unobserved counts: Cijkl = Cik,jl +c,,,jk , where Cijkl is the count of genotype A, A j // Bk 8,, Cik.jl is the number of individuals with haplotype pair A,Bk / A j B, , and 27 Cil,jk is the number of individuals with haplotype pair AiB, / A jBk . For any heterozygous genotype AiA j // Bk 8,, the initial expected values of the unobserved counts of haplotype pairs are calculated as follows: Under Hardy-Weinberg equilibrium, gik,jl = hik h jl and gm], = h,,hjk , where gm], and g i,‘ jk are the relative frequencies of the genotypes obtained from the haplotype pairs Ain / A 1B, and AB, / A jBk respectively. Therefore, the fraction of genotype A, A j // Bk B, that is obtained by the haplotype pair Ain / A 1B, is hik,0 th,0 , and the count ofhaplotype pair A-Bk / A -B, is hik,0h 21,0 + hil,0hjk,0 I J Cijkl (hik,0hjl,0) hik,0hjl,0 + hil,0hjk,0 Cik,jl,0 = Similarly, the count of haplotype pair AiB, / A jBk is Cijkl (hi1,0hjk,0) Cil,jk,0 = - hik,0h 11,0 +hil,0hjk,0 This step of computing the expected counts of haplotype pairs is called expectation step. Using the initial expected values as real data, we can compute the number of heterozygous haplotypes and get a set of revised haplotype counts as follows: 6qu = Zcpqsno ’ s=l,...,m t=l,...,n where p = 1,...,m,q = 1,...,n . The revised haplotype frequencies are, therefore, 28 cuv,l h 1 =———, W, Zcuvd u=l,...,,m v=l,...,n where u =1,...,m,v = l,...,n. This step of reevaluating the haplotype frequencies from the relative probabilities of the possible haplotype pairs is called maximization step. The maximum likelihood estimate of population haplotype frequency is simply the sample haplotype fi'equency. In the next iteration, this set of revised haplotype frequencies are used to obtain a set of revised expected values of the unobserved counts, which are denoted as Cik, 1,1 and Cll,jk,l , CIC. C cijkl (hik,l h 21,1 ) ik,jl,l : hik,l hjl,1 + hil,1hjk,l C ijkl (hil,l ’1 jk,1 ) Cil, jk,1 = hik,l hjl,1 + hil,1hjk,1 The cycle of revising the haplotype frequencies, revising the expected values of the unobserved counts, and counting the haplotypes is repeated until the changes in haplotype frequencies from one iteration to the next become negligible, i.e. convergence is reached. These are then the local maximum likelihood estimates of the haplotype frequencies (Sham, 1998). While the EM algorithm guarantees convergence, it is not guaranteed to converge to the global maximum when there are multiple local maximums. To increase the chance of obtaining the global maximum, it is best to try numerous initial values for the haplotype frequencies (Long et al, 1995). In Arlequin, initial values of 100 and more are 29 in order (Schneider et a1, 2004). The set of haplotype frequencies with the highest local maximum likelihood is then used as the final estimation. After the haplotype frequencies are estimated, the genotype frequencies and lnL, are computed in the same way as in the computation ofL,,. Likelihood-ratio test In L0 and In LI have (m — l) + (n — l) and mu —1 estimated parameters, respectively. The likelihood ratio statistic given by S = —2(lnLO — 1n L1) has an asymptotical Chi-square distribution with (mn — 1) — ((m - 1) + (n - 1)) = (m — l)(n - 1) degrees of freedom. A statistically significant P -value suggests rejection of null hypothesis. In the case of small samples with large number of alleles per locus, the Chi-square distribution does not apply to the likelihood ratio distribution. In order to better approximate the underlying distribution of the likelihood ratio statistic, we perform a randomization test. Such test is non-parametric, not based on asymptotic approximation, and applicable to context with few data sets. The procedure is as follows: 1. Permute the alleles between individuals at one locus only. 2. Re-estimate the likelihood of the data L, by the EM algorithm. L0 is unaffected by the permutation procedure. 3. Repeat steps 1-2 a large number of times to get the null distribution of L ,, and therefore the null distribution of S . The P -value is calculated by 30 Z"(Si) ”(809101) 2 "(SH , where S, is the log likelihood ratio of the observed data (Schneider, 2004). P — value = Section 3 Arlequin Implementation of Likelihood Ratio Test and Case Study The Arlequin software package is used to test linkage disequilibrium for the Isle of Wight birth cohort. Exactly the same project file is prepared as the test of Hardy- Weinberg equilibrium. Under “Calculation Settings”, pair-wise linkage disequilibrium test with 16,000 permutations, 100 initial conditions, and a significance level of 5% has been set. Number of permutations sets the number of random permuted samples to generate. 16,000 permutations guarantee to have less than 1% difference with the exact probability in 99% of the cases. Number of initial conditions sets the number of random initial conditions from which the EM is started to repeatedly estimate the sample likelihood. The haplotype frequencies globally maximizing the sample likelihood will be eventually kept. In our project, 100 initial conditions are used. The output of the test is presented below: Pairwise linkage disequilibrium: (Isle of Wight Asthma LD test) Permutation test using the EM algorithm Number of permutations: 16000 Number of initial conditions for EM: 100 Pair(0, 1) LnLHood LD : -788.23438 LnLHood LE : -790.29395 Exact P= 0.04306 +- 0.00149 (16002 permutations done) Chi-square test value= 4.11914 (P = 0.04240, 1 d.f.) Pair(0, 2) LnLHood LD : -839.46460 LnLHood LE : -942.81079 Exact P= 0.00000 +- 0.00000 (16002 permutations done) Chi-square test value=206.69238 (P = 0.00000, 1 d.f.) 31 Pair(1, 2) LnLHood LD : -761.75385 LnLHood LE : -780.76373 Exact P= 0.00000 +- 0.00000 (16002 permutations done) Chi-square test value=38.01978 (P = 0.00000, 1 d.f.) Pair(O, 3) LnLHood LD : -861.22888 LnLHood LE : -958.38812 Exact P= 0.00000 +- 0.00000 (16002 permutations done) Chi-square test value=194.31848 (P = 0.00000, 1 d.f.) Pair(1, 3) LnLHood LD : -777.89398 LnLHood LE : -796.34106 Exact P= 0.00000 +- 0.00000 (16002 permutations done) Chi-square test value=36.89417 (P = 0.00000, 1 d.f.) Pair(2, 3) LnLHood LD : -564.50177 LnLHood LE : -948.85785 Exact P: 0.00000 +- 0.00000 (16002 permutations done) Chi-square test value=768.71216 (P = 0.00000, 1 d.f.) Table of significant linkage disequilibrium (significance level=0.0500): Locus #IOI 1| 2| 3| 0|*+++ 1|+*+ 2|++* 3|+++ *++ Loci 0, 1, 2, and 3 refer to SNPs hCV8932056-10, hCV15862743-10, hCV8932053-10, and hC V8932052-1 0 respectively. As shown above, the P -values of all the pair-wise tests are less than 0.05, suggesting rejection of null hypothesis; namely, all the pairs of loci are in linkage disequilibrium. 32 CHAPTER 4 HAPLOTYPE FREQUENCY ESTIMATION Section 1 Introduction Haplotype information is an essential ingredient in many analyses of fine-scale molecular genetics data. For example, haplotype analysis is an important tool for linkage disequilibrium assessment, disease-gene discovery, genetic demography, and chromosomal-evolution studies. However, many haplotype analysis methods rely on phase information from the individuals under study. As mentioned in Section 2 of Chapter 3, for autosomal loci, when only the multi-locus genotypes for each individual are provided, the phase information for those individuals with multiple heterozygous phenotypes is inherently ambiguous. Phase can be established by genotyping family members of each study subject to infer parental chromosomes, but this requires recruitment and genotyping of relatives, which is expensive and may not be realistic. Laboratory techniques have also been used. to determine haplotypes, but these approaches are technologically demanding and often cost-prohibitive. Alternatively, statistical methods can be used to infer phase at linked loci from genotypes and to estimate frequency of all possible haplotypes (Fallin et al, 2000). The problem of unknown phase and one of its possible solutions can be explained in the example of Clark’s algorithm. In Clark’s algorithm, when a homozygote is found, a haplotype is unambiguously identified. When a single-locus heterozygote is found, a possible haplotype pair is inferred. For each of the remaining multi-locus heterozygotes, we need to determine whether it can produce a haplotype that has been established. If it can, identify the complementary haplotype by using the established haplotype as one of the actual haplotypes that it implies. Continue this process until the phase information 33 for all individuals is either resolved or identified as unresolved. This algorithm is intuitively appealing and effective in resolving haplotypes when the dataset contains a sufficient number of homozygous individuals. It also performs well for relatively small sample sizes. However, three problems can arise with this procedure. It may not be possible to start the iterative algorithm if there is no unambiguous or single-locus heterozygous individuals in the sample. There may be unresolved haplotypes left at the end. In addition, haplotypes may be erroneously inferred if a crossover product of two actual haplotypes is identical to another true haplotype (Clark, 1990). Compared with Clark’s algorithm, the Bayesian method is more accurate in inferring haplotype information and can handle more loci. For the Isle of Wight birth cohort, we estimate the haplotype frequencies using the PHASE software package that implements the Bayesian algorithm. The Bayesian algorithm will be introduced in the following section. 34 Section 2 Bayesian Estimation Method for Haplotype Frequency Suppose there is a sample of n diploid individuals from a population. Let G = (G1,...,G,,) denote the (known) genotypes for the individuals, and let H = (H 1 ,...,H n) denote the (unknown) corresponding haplotype pairs. The Bayesian algorithm regards the unknown haplotypes as unobserved random quantities and aims to evaluate their conditional distribution given the genotype data. Gibbs sampling, a type of Bayesian approach, is used to obtain an approximate sample from the posterior distribution of H given G, Pr(H | G) . Inforrnally, the algorithm starts with an initial guess 11“” for H, repeatedly chooses an individual at random, and estimates that individual’s haplotypes under the assumption that all the other haplotypes are correctly reconstructed. Repeating this process enough times results in an approximate sample from Pr(H | G) . Formally, this method involves constructing a Markov chain 14“” , H”) , Hm, ..., with stationary distribution Pr(H | G), on the space of possible haplotype reconstructions, using an algorithm of the following form. Start with some initial haplotype reconstruction 11“”. Fort = 0, 1, 2, obtain H ("1’ from H m using the following three steps: 1. Choose an individual, i, uniformly and at random from all ambiguous individuals (i.e., individuals who are heterozygous at more than one loci). 2. Sample Hy“) from Pr(H, |G,H£ti)), where H_- l is the set of haplotypes excluding individual i. The conditional probability Pr(H i IG, H S?) depends on the genetic and demographic models. 3. Set H§t+1)=H§-t) forj =1, n,j :r-‘i. 35 It has been proved that this process produces a Markov chain (Gilks et al, 1996). To help illustrate Gibbs sampling, let us consider a simple example. Suppose that a random sample of two individuals from a population have been genotyped for two loci A and B. We have G = {G1,G2} , where G, ={A1A2 //Ble} and G2 = {A3A4 // B3 B4}. The corresponding haplotype pair H, is either {A181 “1232} or {A182 / A231 } . Similarly, the corresponding haplotype pair H 2 is either {A383 /A4B4} or {A384 /A4B3}. The purpose is to find the distribution of H1 and H2 conditional on G , i.e. Pr(Hl ={A1B1/A282}|G)and Pr(Hl ={A132/A231HG) and Pr(H2 = {A383 /A4B4} | G) and Pr(H2 = {A384 /A4B3} I G). From a certain genetic model we know the conditional probabilities Pr(H1 | G, H _1) and Pr(H2 I G, H_2), in this case, i.e. Pr(HI |G,H2)and Pr(H2 IG,H1). Starting with an initial guess H (O) for H , say 0 0 0 H‘ ’=(H,‘ ’,H§ ))=({AiBz/A281}. {A383/A4B4I), we obtain H (I) from 11(0) as follows: 1. Choose an individual uniformly and at random from G . Say we get individual #1, i.e. 1111(0) . 36 2. Since the genetic model gives Pr(H1 IG, H gm) and we have H :30) = {A3 B3 / A484 } , thus we know the distribution of H1 conditional on H 50) , say Pr(Hl = {A132 /AzBl } I 0,115") = {A333 /A4B4}) = 0.2 and Pr(Hl .—. {A181 /A2132} | G, 1,5,0) = {A383 /A4B4}) = 0.8. Randomly drawing H 1(1) on the basis of this distribution, we obtain, say, H1“) = A181 /AZBZ. 3. Set Hg) = H50). So far we obtain a new sample Ha) =(H1(1), Hg”) =({A131 /AZBZ}, {A383 /A4B4 }). Next, by repeating the above steps, we obtain H (2) from H (1) : 1. Choose an individual uniformly and at random from G . Say we get individual #2 this time. 2. Since the genetic model gives Pr(H2 | G, H11), we know the distribution of H 2 conditioned on H 1(1) , say Pr(H2 .-. {A384 /A4B3} I 0,111“) = {/11132 /AZB, }) = 0.4 and Pr(I-I2 = {A383 /A4B4} l G,H1(1) = {.11132 /A;,191 }) = 0.6. 37 Again, we randomly draw H $2) on the basis of this distribution, say we Obtain 2 Hg ) = A384 /A4B3. 3. Set 111(2) = H1“). Now we Obtain another sample HQ) =(r11‘2),Ir§2))=({A,B2 /AZB, }, {A384 /A4B3 }). Let this process continue for a large number of times, say 10,000 times. We obtain a sample of 10,000 H 's , from which the distribution of H l and H 2 conditional on G, Pr(H, | G) and Pr(H2 | G) can be calculated. As shown by Stephens et al., the Bayesian method has three major advantages over Clark’s method and the EM algorithm: increased accuracy, wider applicability (for instance, it can handle a large number of loci), and the facility to assess accurately the uncertainty associated with each phase call (Stephens et al, 2001). 38 Section 3 PHASE Implementation of Bayesian Estimation and Case Study The PHASE software package is used to estimate the haplotype frequencies for the children of Isle of Wight. PHASE implements the Bayesian statistical method for reconstructing haplotypes and estimating haplotype frequencies from genotype data, and for estimating recombination rates and identifying recombination hotspots (Stephens et a1, 2003). The program is available on line at http://www.stat.washington.edu/stephens/ softwarehtml. We want to choose the SNP pairs that are in linkage equilibrium to conduct haplotype analysis. Although all of the SNP pairs are in linkage disequilibrium, since the combination of SNPS hC V893205 6 and hCV8932052 is found to be most informative by the preliminary study, we choose these two SNPS to perform haplotype frequency estimation on the basis of eight phenotypes: asthma at 1 or 2 years, asthma at 4 years, currently diagnosed asthma at 10 years, wheezing at 1 or 2 years, wheeze at 4 years, wheeze at 10 years, chronic asthma, and no symptom at all ages. PHASE input file specifies how many individuals there are to be analyzed, how many loci each individual has been typed at, what physical positions the loci are, what sort of loci these are (SNP or microsatellite), and the ID and the genotypes for each individual. A SAS program (Appendix B) has been written to obtain the information required by PHASE input file. The children who have complete information on SNPs hC V893205 6 and hCV8932052 and are positive (i.e. “2” if the phenotype has indices “1” and “2”, and “1 ” if the phenotype has indices “0” and “1”) within each phenotype are first selected. There are a total of 164 such children in the group “asthma at 1 or 2 39 years”, 99 in “asthma at 4 years”, 92 in “currently diagnosed asthma at 10 years”, 48 in “wheeze at 1 or 2 years”, 145 in “wheeze at 4 years”, 228 in “wheeze at 10 years”, 34 in “chronic asthma”, and 308 in “no symptoms at all age”. Next, the individual genotypes for the two loci are output on two separate lines according to PHASE’S instruction. In fact, it is the same way as Arlequin. It should also be noted that the sequence of the loci must be kept in the order of their physical positions. As an example, the input file (with partial raw data) for the phenotype group “asthma at 1 or 2 years” of our project is displayed as follows: 164 2 P 128122939 128126575 SS 8 CG CG 16 CA TG 1510 CG CG 1524 CG TG This presentation says that there are 164 children (who have asthma at 1 or 2 years) typed at two loci (hC V893205 6 and hCV8932052) whose relative positions along the chromosome are 128122939 and 1281265 75, and which are bi-allelic. The genotype information then follows, with the ID and two lines for each child. For example, the individual #8 has the genotype pattern CC//GG. PHASE produces a number of output files: a summary report and additional 40 reports whose suffixes indicate the contents of the file. “_freqs” estimates the sample haplotype frequencies; “_pairs” lists the most likely pairs of haplotypes for each 6 individual, together with their probability; ‘ recom” contains estimates of recombination parameters across the region; and “_monitor” measures the goodness of fit of the estimated haplotypes to the underlying model. Among them the output file “_freq” directly pertains to the research purpose of the Isle of Wight Birth Cohort Study. The output files “_freq” of the eight groups of children are consolidated and presented in Table 3. Figure 5 produced by a Matlab program (Appendix C) visualizes the haplotype frequency estimation in Table 3. In the graph, a12 stands for the phenotype of “asthma at 1 or 2 years”, a4 for “asthma at 4 years”, cdalO for “currently diagnosed asthma at 10 years”, w12 for “wheeze at 1 or 2 years”, w4 for “wheeze at 4 years”, w10 for “wheeze at 10 years”, chronic for “chronic asthma”, and control for “control”. It can be clearly seen that haplotype CG is found in all of the eight phenotypes with high probabilities ranging from 0.663015 to 0.790994. The other haplotypes are also found in all the other phenotypes, but with very low probabilities. In order to view better the relationship between each haplotype and phenotype, graphs of the relative frequencies of each haplotype vs. each phenotype are shown in Figure 6, which is produced by Matlab (Appendix C). 41 Table 3 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052 by Phenotypes Phenotypes Haplotype E (freq) SE Asthma at l or 2 years CA 0.080212 0.004531 (n=164) CG 0.706373 0.004531 TA 0.124056 0.004531 TG 0.089359 0.004531 Asthma at 4 years CA 0.070573 0.005364 (n=99) CG 0.742559 0.005364 TA 0.126397 0.005364 TG 0.060472 0.005364 Currently diagnosed asthma at 10 years CA 0.040528 0.003686 (n=92) CG 0.790994 0.003686 TA 0.117081 0.003686 TG 0.051398 0.003686 Wheeze at 1 or 2 years CA 0.060431 0.009539 (n=48) CG 0.679152 0.009539 TA 0.158319 0.009539 TG 0.102098 0.009539 Wheeze at 4 years CA 0.074331 0.004659 (n=145) CG 0.763600 0.004659 TA 0.104979 0.004659 TG 0.057090 0.004659 Wheeze at 10 years CA 0.066354 0.003643 (n=228) CG 0.753822 0.003643 TA 0.117857 0.003643 TG 0.061968 0.003643 Chronic asthma CA 0.116397 0.014663 (n=34) CG 0.663015 0.014663 TA 0.148309 0.014663 TG 0.072279 0.014663 Control CA 0.071836 0.003412 (n=308) CG 0.7301 12 0.003412 TA 0.124593 0.003412 TG 0.073459 0.003412 42 Distribution 0.9 — I I a "—1 0.8 _ a _ : I 0.7 _ l - 0.6 ~ Frequency 0 U! I 04~ — 0.3 '- ’I CA TG TG 02”TG ‘ CA CA “ TA TA I TA 01— _ CG I CG CG a1 2 a4 w1 2 w4 M 0 chronic control Figure 5 Haplotype Frequency Estimation for SNPs hCV8932056 and hCV8932052 by Phenotypes As shown in Figure 6, haplotype CG is prominently related with the phenotype of “currently diagnosed asthma at 10 years” with relative frequency of 0.791. It has the least relationship with the phenotype of “chronic asthma” with relative frequency of 0.663. Haplotype TA is found in “wheeze at l or 2 years” with the highest relative frequency of 0.158 and in “wheeze at 4 years” with the lowest relative frequency of 0.106. Haplotype CA is found in “chronic asthma” with the highest relative frequency of 0.116 and in “currently diagnosed asthma at 10 years” with the lowest of relative frequency of 0.04. Haplotype T G is found in “wheeze at 1 or 2 years” with the highest relative frequency of 0.102 and in “currently diagnosed asthma at 10 years” with the lowest relative frequency of 0.05. 43 0.14 0.12 0.1 Frequency .0 O Cb 0.06 0.04 0.02 0.82 0.8 0.78 0.76 Frequency .0 N b .0 N N 0.7 0.68 0.66 0.64 CA H—l l 1 L cda 1 0 W1 2 W4 CG I 1 W10 chronic control 1 I cda10 w12 44 1 MO chronic control TA 0.17 I T I I I I l I T 0.16— ’ ~ 1t 0.15~ __ a It >0.‘I4* _. U C O 8 .l- 8 L‘-o.13~ a , , - 0.11 ~ - 01 l l l 1 l I 1 a12 a4 cda10 w12 W4 w10 chronic control TG 0.12 I T I I If I I l 0.11- a 0.09 e 2 Frequency 0 O a) I J . l 0.07 - .. 0.06 — l j, - 0.05 e E _ I 1 L 1 1 T 1 1 812 84 cdaio w12 w4 w10 chronic control 0.04 Figure 6 Relationship between Haplotypes and Phenotypes 45 In order to examine whether there exist any statistically significant associations between the haplotypes and the phenotypes, we conducted a contingency test. The contingency tables of the haplotype counts of each infected group vs. the haplotype counts of the control group are first formed, which are shown in Table 4-Table 10. Table 4 Two-way Contingency Table of Asthma 1 or 2 vs. Control CA CG TA TG A12 13 1 16 20 15 control 22 225 38 23 Table 5 Two-way Contingency Table of Asthma 4 vs. Control CA CG TA TG a4 7 74 13 6 control 22 225 38 23 Table 6 Two-way Contingency Table of CDA10 vs. Control CA CG TA TG CDA10 4 73 1 1 5 control 22 225 38 23 Table 7 Two-way Contingency Table of Wheeze 1 or 2 vs. Control CA CG TA TG w12 3 33 8 5 control 22 225 38 23 Table 8 Two-way Contingency Table of Wheeze 4 vs. Control CA CG TA TG W4 1 1 1 1 l 15 8 control 22 225 38 23 Table 9 Two-way Contingency Table of Wheeze 10 vs. Control CA CG TA TG w10 15 172 27 14 control 22 225 38 23 Table 10 Two-way Contingency Table of Chronic vs. Control CA CG TA TG chronic 4 23 5 2 control 22 225 38 23 46 A SAS program has been written to perform the contingency tests. The SAS code has performed the contingency test between the haplotype CA/CG/T A/T G of asthma at 1 or 2 and control is shown in Appendix D. Below is the test report for the group of asthma at 1 or 2 years vs. the control group: Contingency Test between Asthma at 1 or 2 and Control (Haplotypes: CA/CG/T A/T G) Statistics for Table of group by haplotype Statistic DF Value Prob Chi-Square 3 0.5449 0.9089 Likelihood Ratio Chi-Square 3 0.5371 0.9107 Mantel-Haenszel Chi-Square 1 0.1329 0.7155 Phi Coefficient 0.0340 Contingency Coefficient 0.0340 Cramer's V 0.0340 Fisher's Exact Test Table Probability (P) 0.0018 Pr <= P 0.8957 Sample Size = 472 Although SAS gives the P- values of several kinds of Chi-square tests, it should be noted that since some of the cells have expected counts less than 5, Chi—square may not be a valid test. Instead, Fisher’s exact test’s result is used when deciding if there is any association between the haplotypes and the phenotypes. As shown above, the P- value of 0.8957 suggests that the probabilities for the group of asthma at 1 or 2 and the control group are independent of the haplotypes CA/C G/TA/T G. Table 11 summarizes the seven contingency tests. As we can see, the P-values range from 0.1599 to 0.9914, suggesting that haplotypes CA/C G/TA/T G are not associated with asthma and wheeze symptoms. 47 Table 11 Summary of Contingency Tests (Haplotype CA/CG/TA/TG) Fisher's Exact Test Comparison Groups P-value Sample Size a12 vs. control 0.8947 472 a4 vs. control 0.9914 408 CDA10 vs. control 0.7222 401 w12 vs. control 0.6792 357 w4 vs. control 0.8143 453 w10 vs. control 0.9200 536 chronic vs. control 0.6813 342 Since CG is the most predominant haplotype, it is interesting to know if the less common haplotypes affect asthma and allergy. Thus, similar contingency tests are performed to examine the association between each infected group and the control group for the haplotypes CA/T A/T G only. The test results shown in Table 12 suggest that haplotypes CA/T A/T G are not associated with asthma and wheeze symptoms either. Table 12 Summary of Contingency Tests (Haplotype CA/TA/TG) Fisher's Exact Test Cormarison Groups P-value Sample Size a12 vs. control 0.8953 131 a4 vs. control 0.9574 109 CDA10 vs. control 0.8092 103 w12 vs. control 0.8814 99 W4 vs. control 0.8109 117 w10 vs. control 0.9744 139 chronic vs. control 0.7817 94 48 CHAPTER 5 CONCLUSIONS This thesis has discussed three bio-statistical concepts: Hardy-Weinberg equilibrium, linkage disequilibrium, and haplotype reconstruction and haplotype frequency estimation, and their applications in the Isle of Wight cohort study which aims to identify genetic susceptibility loci for allergic asthma. Hardy-Weinberg equilibrium has been tested through the exact test implemented by the Arlequin software package. It is found that all the loci of interest (h C V893205 6, hC V1 5 862 743, hCV8932053, and hCV8932052 on the IL13 gene) are in Hardy-Weinberg equilibrium. This test result provides a valid assumption for the test of linkage disequilibrium. Linkage disequilibrium between pair-wise loci is tested with a likelihood-ratio test, whose key procedure is the Expectation-Maximum algorithm, implemented by the Arlequin software package. It is found that all the pair-wise loci are in linkage disequilibrium. The PHASE software package, which implements the Bayesian estimation method, is used to reconstruct haplotypes and estimate haplotype frequencies of the most informative SNP pair, hC V893205 6 and hCV893205. The subsequent contingency test suggests that there is no association between the haplotype patterns CA/CG/T A/T G and CA/T A/T G and allergic asthma. 49 APPENDICES 50 APPENDIX A SAS CODE TO CREATE INPUT FILES FOR ARLEQUIN data thesisl .hweldl ; set thesis.asthmagroups; ct_dash=0; if C8932056_lO="-" then ct_dash=ct_dash+1; if C8932056_10="-" then C8932056_10="??"; if C 1 5862 743_1 0="-" then ct_dash=ct__dash+1; if C15862743_10="-" then C15862743_10="??"; if C8932053_10="-" then ct_dash=ct_dash+1; if C8932053_10="-" then C8932053_10="??"; if C8932052_10="-" then ct_dash=ct_dash+1; if C8932052_10="-" then C8932052_10="??"; if ct_dash>=l then delete; freq=1; if CDA10="0"|CDA10="1"|asthm312="1"lasthmal2="2"|asthma4="1"|asthma4="2"|wheez e12="1"|wheezel2="2"|wheeze4="1 "|wheeze4="2"|wheeze10="1"|wheeze10="2"|nosym ptoms="0"|nosymptoms="1" then output; run; data thesis.hweld2; set thesis.hweldl; c11=substr(C8932056_10,1,1); c21=substr(C15862743_10,1,1); c31=substr(C8932053_10,1,1); c41=substr(C8932052__10,1,1); c12=substr(C8932056_10,2,1); c22=substr(C] 5862743__10,2,1); c32=substr(C8932053_10,2,1); c42=substr(C8932052_10,2,1); SNP1=trim(c11) || trim (c21) || trim (031) || trim(c41); SNP2=trim(c12) || trim (c22) || trim (C32) || trim(c42); run; data thesis.hweld3; set thesis.hweld2 nobs=nobs; filename fn 'C:\ hweld.txt'; file fn; put@20IDfreq@30cll"c21"C31”c41/@30012"c22"c32"c42; run; 51 APPENDIX B SAS CODE TO CREATE INPUT FILES FOR PHASE data thesi52.a12_1; set thesisZasthmagroups; ct_dash=0; if C8932056_10="-" then ct_dash=ct_dash+1; if C8932056_10="-" then C8932056_10="??"; if C8932052_10="-" then ct_dash=ct__dash+1; if C8932052_10="-" then C8932052_10="??"; if ct_dash>=1 then delete; if asthma12="2" then output; run; data thesis2.a12_2; set thesis2.a12_1; cl 1=substr(C8932056_10,1,1); c21=substr(C8932052_10,1,1); c12=substr(C8932056_10,2,1); c22=substr(C8932052_10,2,1); SNP1=trim(c11) || trim(c21); SNP2=trim(c12) || trim(c22); run; data thesi52.a12_3; set thesi52.a12_2 nobs=nobs; filename fn 'C:\snp56_52_al2_2_no_mis.txt'; file fn; put ID / SNPl /SNP2; run; 52 APPENDIX C MATLAB CODE TO CREATE FIGURES OF HAPLOTYPE FREQUENCY % For each haplotype combination, plot freq vs. subject group clear all; close all; freq_fii_label= { 'ast_grp_al 2_no__mis.out__freqs', 'a12'; 'ast_grp_a4_no_mis.out_freqs', 'a4'; ‘ast_grp_CDA 1 0_no_mis.out_freqs', 'CDA10'; 'ast_grp_wl 0_no_mis.out_freqs', 'WI 0'; 'ast_grp_w 1 2_no_mis.out__freqs', 'w12'; 'ast_grp_w4_no_mis.out_freqs', 'w4'; 'ast_grp_chron_no_mis.out_freqs', 'chron'; 'ast_grp_control_no_mis.out__freqs', 'control' }; [no_fn tmp]=size(freq_fn_label); for fn_ct=1 : no_fi1 %check the number of lines in the file fid=fopen(char(freq_fn_label(fn_ct,1))); data_ct=0; while ~feof(fid) data_line=fgetl(fid); data_ct=data_ct+1 ; end fclose(fid); %Do not count the first head line freq_data( fir_ct).no_data=data_ct—l ; fid=fopen(char( freq_fn_label(fi1_ct, 1 ))); % Get rid of the first line fgetl(fid); % Read data for data_ct=1 : freq_data(fii_ct).no_data; freq_data(fn_ct).idx(data_ct)=fscanf(fid, '%i',1); freq_data(fn_ct).haplotype {data_ct} =fscanf(fid, '%l Os',1 ); freq_data( fii_ct).freq_ave(data_ct)=fscanf(fid, '%f’,1); freq_data( fii_ct).freq_se(data_ct)=fscanf(fid, '%f‘, 1 ); end 53 fclose(fid); %Sort the data w.r.t. freq ave [sort_data sort_idx]=sort(-freq_data(fn_ct).freq_ave); freq_data(fn_ct).idx=freq_data(fn_ct).idx(sort_idx); freq_data(fn_ct).haplotype=freq_data(fn_ct).haplotype(sort_idx); freq_data(fir_ct).freq_ave=freq_data(fn__ct).freq_ave(sort_idx); freq_data(fii_ct).freq_se=freq_data(fn_ct).freq_se(sort_idx); end %Enumerate all haplotypes haplotype_set=freq_data( 1 ).haplotype; no_haplotype=freq_data( 1 ).no_data; for i=2 : no_fn for j=1 : freq_data(i).no__data b_find=0; for k=l : no_haplotype if strcmp( char(haplotype_set(k)), char(freq_data(i)haplotype(j)) ) b_find=1; break; end end % Add to haplotype_set if the same haplotype is not found in haplotype_set if ~ b_find no_haplotype=no_haplotype+1 ; haplotype_set(no_haplotype)=freq_data(i).haplotype(j); end end end %Construct the structure to hold the freq_ave and freq_se for each haplotype in each file for fn_ct=1 : no_fii haplotype(fi1_ct). freq_ave=zeros(1 ,no_haplotype); haplotype( fn_ct). freq_se=zeros( 1 ,no_haplotype); for i=1 : freq_data(fii_ct).no_data for j=1 :no_haplotype if strcmp( char(haplotype_set(j», char(freq_data(fn__ct).haplotype(i)) ) haplotype(fn_ct).freq_ave(j)=freq_data(fii_ct).freq_ave(i); haplotype(fn_ct).freq_se(j)=freq_data(fn_ct).freq_se(i); break; end end end end %plot 1 54 bar_freq=zeros(no_fn,no_haplotype); for fn_ct=1 : no_fn for j=1 : freq_data(fir_ct).no_data bar_freq(fn_ct,j)=freq_data(fi1_ct).freq_ave(i); bar_label_pos(fn_ct,j)=sum(freq_data(fn_ct).freq_ave(1 : j )); end end figure(1); %Draw the bars bar(bar_freq,.3,'stack'); %Draw the labels for fn_ct=l : no_fn for j=1 : freq_data(fi1_ct).no_data text(fii_ct-.8,j/ 16,char(freq_data(fi1_ct)haplotype(j))); end end %hold on; title('Distribution'); ylabel('Frequency (%)'); fn_fig=['distrib.jpg']; set(gca,'XTick',[ 1 :no_fii],'XTickLabel',freq_fi1_label(:,2)'); print(fii_fig,'-djpeg'); %plot 2 for i=1 :no_haplotype figure(2); for j=1 :no_fn y(j)=haplotype(j).freq_ave(i); e(j)=haplotype(j).freq_se(i); end errorbar([1 :no_fn],y,e,'X'); title(char(haplotype_set(i))); ylabel('Frequency (%)'); set(gca,'XTick',[] :no_fn],'XTickLabel',freq_fii_label( :,2)'); fii_fig=[char(haplotype_set(i)) '.jpg']; print(fi1_fig,'-djpeg'); end 55 APPENDIX D SAS CODE TO CONDUCT CONTINGENCY TESTS data thesis3.contingency; do group=1 to 2; do haplotype=1 to 4; input count @@; output; end; end; cmdg 13 116 20 15 22 225 38 23 9 proc freq data=thesisl .contingency; weight count; tables group*haplotype / exact; title 'Contingency Test between Asthma at 1 or 2 and Control (CA/CG/TA/T G)’; run; 56 Symbols A, B, A,, i=1, m , 1 =1, m c,, i=1, m gjj,i,j:1, ...,m gut-1. if = 1. m, k,j = 1, n, gik ._,'1 Ci}, l,j =1, ..., m 60],], l,j :1, ..., m, k,j = I, n, Cik./’I hi1“, l,j=1, ...,m APPENDIX E NOTATIONS Notations Loci Alleles at locus A with m alleles Relative frequencies of alleles at a locus with m alleles Counts of alleles at a locus with m alleles Relative Frequencies of genotypes at a locus with m alleles Relative frequencies of genotypes at two loci which have m alleles and n alleles respectively. Relative frequencies of genotypes obtained from haplotype pair Ain /AJ-B, at loci A and B. Counts of genotypes at a locus with m alleles Counts of genotypes at two loci which have m alleles and n alleles respectively. Counts of genotypes obtained from haplotype pair Ain / A j B, at loci A and B. Relative frequencies of haplotypes at a locus with m alleles 57 BIBLIOGRAPHY 58 BIBLIOGRAPHY Clark, A. G. Inference of haplotypes from PCR-arnplified samples of diploid populations. Molecular Biology Evolution. 1990: 7(2): 111-122. Crow, J. F. Eighty years ago: the beginnings of population genetics. Genetics 1988 Jul: 119 (3): 473-6. Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology Evolution. 1995: 12(5): 921-927. Fallin, D. & Schork, J. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. American Journal of Human Genetics 2000: 67: 947-959. Gilks, W. R., Richardson S., & Spiegelhalter, D. J. Markov Chain Monte Carlo in Practice. London: Chapman & Hall, 1996. GlaxoSmithKline. Genetics at GlaxoSmithKline. [On-Line]. Available: http://genetics.gsk.com/link.htm, 2004. Guo, S. W., & Thompson, E. Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 1992: 48: 361-372. Kurukulaaratchy, R. J ., Fenn, M. H., Waterhouse L. M., Matthews, S. M., Holgate, S. T., Arshad, S. H. Characterization of wheezing phenotypes in the first 10 years of life. Clinical and Experimental Allergy 2003: 33: 573-578. Kurukulaaratchy, R. J ., Matthews, S., Holgate, S. T., Arshad, S. H. Predicting persistent disease among children who wheeze during early life. European Journal of Respiratory Diseases 2003: 22: 719-720. Kurukulaaratchy, R. J ., Matthews, 8., Waterhouse, L., Arshad, S. H. Factors influencing symptom expression in children with bronchial hyperresponsiveness at 10 years of age. The Journal of Allergy and Clinical Immunology 2003: 112: 311-316. Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 1995: 56:799-810. Schneider, S., Roessli, D., & Excoffier, L. Arlequin ver. 2.000: A software for population genetics data analysis. Genetics and Biometry Laboratory, University of Geneva, Switzerland, 2000. Schork, N. J ., Fallin, D., & Lanchbury, S. Single nucleotide polymorphisms and the future of genetic epidemiology. Clinical Genetics 2000: 58:250-264. 59 Sham, P. Statistics in Human Genetics. New York: Oxford University Press, 1998. Stephens, M., Smith, N. J ., & Donnelly, P. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics 2001: 68: 978-989. Stephens, M., Smith, N. J ., & Donnelly, P. Documentatin for PHASE ver.2.02. Department of Statistics, University of Washington, Seattle, WA, 2003. 60 Ilflll‘lljllllllzlll’lll[llljlfl A. ‘ '