Li 3 RA RY > Michigan State 2, 0 If) ‘ U. .r-sersxy i -"f1—w This is to certify that the dissertation entitled IDENTIFICATION OF GENE-SPECIFIC SINGLE NUCLEOTIDE POLYMORPHISMS WITHIN THE CANINE GENOME AND THEIR USE TO DETERMINE NUCLEOTIDE DIVERSITY AND INBREEDING COEFFICIENTS WITHIN THE CANINE GENOME presented by JAMES A. BROUILLETTE, MD has been accepted towards fulfillment of the requirements for the Ph.D. degree in Genetics Major Professor’ idnatfiFe / /3 / H? I I Date MSU is an Affirmative Action/Equal Opportunity Employer PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 KzlProleocaPresICIRCIDateDue.indd IDENTIFICATION OF GENE-SPECIFIC SINGLE NUCLEOTIDE POLYMORPHISMS WITHIN THE CANINE GENOME AND THEIR USE TO DETERMINE NUCLEOTIDE DIVERSITY AND INBREEDING COEFFICIENTS WITHIN THE CANINE GENOME By James A. Brouillette, MD A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Genetics 2010 ABSTRACT IDENTIFICATION OF GENE-SPECIFIC SINGLE NUCLEOTIDE POLYMORPHISMS WITHIN THE CANINE GENOME AND THEIR USE TO DETERMINE NUCLEOTIDE DIVERSITY AND IN BREEDING COEFFICIENTS WITHIN THE CANINE GENOME By James A. Brouillette, MD The domestic dog, Canisfamiliaris, has lived in close relationship with man for thousands of years, working as a hunter, herder, and loyal companion. Through selective breeding, the various dog breeds have a high prevalence of various genetic diseases. Genetic studies are ongoing to elucidate the nature of disease-causing mutations within the various dog breeds. In the work presented here, I elucidate a method to identify single nucleotide polymorphisms (SNPs) within canine genes of interest by pooling and sequencing DNA from across ten breeds of dog. The SNP markers generated are used to estimate heterozygosity within the canine genome, to demonstrate that the markers generated by across breed SNP identification will be heterozygous within breeds, and to estimate the coefficient of inbreeding within three breeds of dog. DEDICATION This manuscript is dedicated to my sons Jacob and Andrew, who made as many sacrifices as I did in order to complete this degree. I love you and appreciate your patience during the long course of my graduate education. I also dedicate this work to my wife Tammie, without whom this work would not have been completed. Thank you for your tireless support. iii ACKNOWLEDGEMENTS I would like to thank my advisors Patrick Venta and Vilma Yuzbasiyan-Gurkan whose enthusiasm and support made this work possible. Your knowledge, inquisitiveness, and energy were greatly appreciated. I would also like to thank Donna Housley, David Entz, Tracy Hammer, Sarah Colombini, Margo Machen, Tiffy Zachos, Susan Ewart, Martha Mulks, Betty Werner, Cheryl Swensen, John Kruger, and Simon Peterson-Jones for your many discussions and problem solving sessions along the way. Finally, I would like to thank Kathy Lovell for shepherding this manuscript to completion. This project would not have been completed without you. TABLE OF CONTENTS LIST OF TABLES ..................................................................................... vii LIST OF FIGURES ................................................................................... viii CHAPTER 1 INTRODUCTION ....................................................................................... 1 Finding Canine Disease Genes3 Evolution ofCanine Genetic MarkersS The Canine Genetic Map ...................................................................... 6 The SNP Markers .............................................................................. 10 Searching for SNPs in the Human Genome12 SNPs and Gene Mapping .................................................................... 17 Methods of SNP Identification .............................................................. 22 Methods of SNP Detection .................................................................. 26 Canine Breed History ........................................................................ 28 Appendix 1-2 Breed History ............................................................... 31 CHAPTER 2 GENE-SPECIFIC UNIVERSAL MAMMALIAN SEQUENCE TAGGED SITES: APPLICATION TO THE CANINE GENOME ................................................... 34 Abstract ........................................................................................ 36 Introduction ................................................................................... 37 Materials and Methods ....................................................................... 39 Results ......................................................................................... 44 Discussion ..................................................................................... 48 Acknowledgements .......................................................................... 5 1 CHAPTER 3 ESTIMATE OF NUCLEOTIDE DIVERSITY IN DOGS USING A POOL-AND- SEQUENCE METHOD .............................................................................. 68 Abstract ........................................................................................ 70 Introduction ................................................................................... 71 Materials and Methods ....................................................................... 73 Results ......................................................................................... 78 Discussion ..................................................................................... 84 Acknowledgements .......................................................................... 88 CHAPTER 4 WITHIN-BREED HETEROZYGOSITY OF CANINE SINGLE NUCLEOTIDE POLYMORPHISMS IDENTIFIED BY ACROSS-BREED COMPARISON.............101 Summary .................................................................................... 103 Introduction ................................................................................. 1 04 Materials and Methods ..................................................................... 106 Results and Discussion .................................................................... 108 Acknowledgements ........................................................................ I l 1 CHAPTER 5 SUMMARY .......................................................................................... 114 Appendix S-l Brief Explanation of the Wahlund Effect ...................................... 123 BIBLIOGRAPHY ................................................................................... 125 vi LIST OF TABLES Appendix 1-1 Summary of physical characteristics of dog breeds .................................................................................................... 30 Appendix 2-5 Amplification Conditions for Canine UM-STSs ....................................... 58 Appendix 2-6 Summary of Amplification Results for UM-STSs for Several Mammalian DNAs ............................................................................................... 59 Appendix 2-7 Primer Sets for 11 Universal Mammalian Sequence-Tagged Sites ........... 60 Appendix 2- 8 Eighty- Six Universal Mamammalian Sequence Tagged Sites- Human Chromosomal Locations and Names... ....62 Appendix 2-9 Eighty-Six Universal Mammalian Sequence Tagged Sites — Sequences and Sizes .................................................................................... 65 Appendix 3-1 Gene Segments With Amount of DNA Sequenced................................89 Appendix 3-2 Location of SNPs, Types of Base Changes, and Diagnostic Tests ....................................................................................................... 90 Appendix 3- 3 Allele Counts and Heterozygosity for SNPs Found Among DogsUsedintheSequencingPool... . ....93 Appendix 3-4 Genotypes for Four SNPs Within Canine Breeds Appendix 3-5 Summary of SNPs Identified vs. Nucleotide and Amino Acid Conservation in Loc1 Appendix 3-9 UM-STSS Primers Used for PCR Amplification and Sequencing ................................................................................. Appendix 3-10 Diagnostic Primers for Testing SNPs Appendix 4-1 Diagnostic Primers, Restriction Enzymes, and SNP Locations ....................................................................................... Appendix 4-2 Genotypes for Dogs from Three Breeds Using Four SNPs ............................................................................................. vii ..94 95 .......... 99 ..100 ......... 112 ......... 113 LIST OF FIGURES Appendix 2-1 Amplification of Several Canine Genes using UMSTSs. . . . . . . . . . .............52 Appendix 2-2 Lineups of Several Canine Gene Sequences With Homologous Mammalian Genes ....................................................................................... 53 Appendix 2-3 Sequence of a Portion of the F ES Proto-Oncogene from Several Mammalian DNAs ..................................................................................... 56 Appendix 2-4 Amplification of a Portion of the F ES Protooncogene From Several Mammalian DNAs Using UM-STS Primer857 Appendix 3-6 Identification of Four SNPs in the Canine TS Gene by Pool-And-Sequence......................................................................................96 Appendix 3-7 Identification of Four SNPs in the Canine CFTR Gene. ......................... 97 Appendix 3-8 Location of Four SNPs Identified in the Canine TS Gene. ...................... 98 viii KEY TO ABBREVIATIONS] Term Abbreviation Base(s) pairs bp centiMorgans cM Deoxyribonucleic acid DNA Kilobase(s) (pairs) kb Mega base(s) pairs Mb Micrograms pg Microliters ul Micromolar uM Milliliters ml Millimolar li Nanograms ng Picograms Pg Picomoles pmol Polymerase Chain Reaction PCR Ribonucleic acid RNA Units U 1. Abbreviations for specific gene names are found in the text. ix Chapter 1 Introduction Introduction Domestic dogs and humans have lived in close association for centuries. During that time, man has selectively bred dogs to the point where today there are over three hundred different dog breeds in existence worldwide (Ostrander et al., 2000). Man’s understanding of the genetic diseases that occur in domestic dogs has steadily grown over the last few decades. As recently as 1979, there were only 13 canine disorders that were established as congenital or inherited in dogs (Pearson, 1979). That number has rapidly increased. In 1988, there were more than 200 genetic disorders that had been identified in domestic dogs, more than 70% of which were inherited as autosomal recessive disorders (Patterson et al., 1988). With the widespread use of antibiotics, antihelmintics, vaccination against viral diseases, and improved and standardized diets, genetic disease has grown in clinical importance among veterinarians, dog breeders, and dog owners (Patterson, 2000). At latest count, there are 370 different canine genetic disorders, with 50% of these having breed-specific aggregations (Patterson, 2000; Ostrander et al., 2000). Of these 370 diseases, 215 (58%) have clinical and laboratory abnormalities that resemble a human genetic disease and more than 70% of these are inherited as autosomal recessive traits, X-linked recessive traits, or have a complex pattern of inheritance (Ostrander et al., 2000). An additional 5- 10 diseases are added to the growing list each year. The price of this genetic disease manifests itself as pain and suffering for the affected dogs, emotional suffering to owners and breeders, and an estimated $500 million per year in costs to diagnose and treat affected animals (Padgett, 1998). Currently, 50 of the 370 canine genetic diseases have been defined at the molecular level (Giger et al., 2006). Finding Canine Disease Genes The disease genes that have been identified to date fall into three different categories. The first category is those diseases that have an identifiable protein that is absent or not functioning. The search for the mutation in these cases becomes a matter of searching the gene encoding the faulty protein for mutations in the coding or control regions. Factor IX deficiency causing hemophilia B in Cairn Terriers (Evans et al., 1989), von Willebrand factor deficiency causing von Willebrand’s disease in Scottish Terriers (V enta et al., 2000), and C3 deficiency causing Complement third component deficiency in Brittany Spaniels (Ameratunga et al., 1998) are examples of diseases in this category. A second category of disease genes are those that have been identified by examining candidate genes for the presence of a mutation. A candidate gene is any gene that is suspected to harbor the mutation causing the disease phenotype by virtue of its role in a similar disease in another species or based on some prior knowledge of its biochemical properties. Mutations in the dystrophin gene causing Duchenne’s muscular dystrophy in Golden Retrievers (Sharp et al., 1992), the phosphodiesterase 6 beta gene causing rod- cone dysplasia in Irish Setters (Suber et al., 1993), and the phosphodiesterase alpha gene causing rod-cone dysplasia in Cardigan Welsh Corgis (Petersen-J ones et al., 1999) fall into this category. The final category of genetic diseases contains those diseases that have no obvious defective protein or candidate genes to test as causative agents. In these cases, purely genetic analysis can be performed using linkage analysis in families in which the disease is present or by association analysis in affected individuals. Yuzbasiyan-Gurkan et al. (1997) were able to identify a marker closely linked to the mutation causing copper toxicosis in Bedlington Terriers using linkage analysis. The COMMDl gene was subsequently] cloned, leading to new insights in copper metabolism in mammals and the discovery of a whole new family of related proteins (van de Sluis et al., 2002; Burstein et al., 2005). Similarly, Lin et a1. (1999) were able to identify a mutation in the hypocretin receptor which causes narcolepsy in Doberman Pinschers and Labrador Retrievers using linkage analysis, opening a whole new field of investigation for sleep and sleep-related disorders (Zeiter et al., 2006). Ostrander and Kruglyak have demonstrated the feasibility of whole genome association analysis in dogs using computer modeling (Ostrander and Kruglyak, 2000). More recent work has moved association mapping in dogs fiom a theoretical possibility to a future certainty, as all the necessary groundwork has been done to enable geneticists to undertake association studies in dogs (Kirkness et al., 2003; Lindblad-Toh et al., 2005; Sutter et al., 2004; Ostrander and Kruglyak, 2000; Clark et al., 2004). A recent use of association analysis in dogs resulted in the discovery that the merle coat color is due to a mutation in the SILV gene (Clarke et al., 2006). Evolution of Canine Genetic Markers The first genetic markers in canines were identified as protein polymorphisms observed as variants in electrophoretic mobility. The vast majority of these were structural proteins or enzymes of the various components of the blood such as isocitrate dehydrogenase, albumin, and hemoglobin (Meera Khan et al., 1973; Weiden et al., 1974; Simonsen, 1976). By today’s standards, these markers exhibited little polymorphism and were limited in scope. These markers were soon replaced by RF LP markers. During the 19805, several RFLP markers were discovered in genes such as the DLA-D and DLA-A (Sarmiento and Storb, 1988, 1989). These markers, while more abundant and polymorphic than the protein polymorphisms, were labor and time intensive and required relatively large quantities of DNA. A major leap forward in canine genetics occurred in the 19903. Microsatellite (simple sequence length polymorphisms or SSLPs) markers, simple sequence motifs of two to six nucleotides repeated in arrays of varying lengths, were found to be highly abundant throughout mammalian genomes, highly polymorphic, and rapidly typable using PCR. They have the advantage of being present at about 20 kb intervals throughout the canine genome (Yuzbasiyan-Gurkan and Venta, Pers. Comm). The one drawback is that they are only occasionally found associated with coding regions of the canine genome. Our collaborator, Dr. Vilma Yuzbasiyan-Gurkan, previously developed several hundred anonymous SSLP markers throughout the canine genome and used this resource to identify a marker that is tightly linked to the mutation causing copper toxicosis in Bedlington Terriers (Yuzbasiyan-Gurkan et al., 1997). Many other SSLP type markers have been discovered and developed in other labs (Francisco et al., 1996; Ostrander et al., 1993, 1995). This work centers on the identification and development of single nucleotide polymorphisms as genetic markers. At the time this work was undertaken, nothing was known about the frequency of occurrence of SNPs in the canine genome (Chapter 3). Based on data from the human genome, it was likely that the SNPs would occur more frequently in the canine genome than SSLP markers (Collins et al., 1997, Nickerson et al., 1998). In addition, the introduction of DNA arrays made high throughput genotyping of SNP markers feasible (Wang et al., 1998; Chee et al., 1996; Landegren et al., 1998, see below, ), which would allow for more rapid whole-genome scans of SNP markers. The Canine Genetic Map The first maps of the canine genome were published in 1997 (Lingaas et al., 1997; Mellersh et al., 1997; Langston et al., 1997). Lingaas et al. established 16 linkage groups and assigned a total of 43 markers to those 16 groups. Using 17 three generation families, Mellersh et al. (1997) assigned 139 microsatellite markers to 30 linkage groups containing at least one other linked marker (with a lod score of 3 or greater). The linkage groups ranged in size from 2.3 to 106.1 cM. This map covered an estimated 884.2 cM of the canine genome with an average marker spacing of 14.03 cM. An additional 11 polymorphic markers were not linked to any other marker. Of the 150 markers, 47 were dinucleotide repeats, 102 were tetranucleotide repeats, and one was a hexanucleotide repeat. In a companion paper, Langston et al. (1997) developed the first canine-rodent somatic cell hybrid panel. This panel contains a total of 43 microcell hybrid clones that each display unique canine chromosome retention patterns and three whole cell hybrids that contained the X chromosome. They assigned 181 microsatellite markers and 27 canine genes to 31 syntenic groups consisting of two or more markers and/or genes. Many of these markers were also used by Mellersh et al. (1997). Each of the syntenic groups had between 2 and 11 markers. Since the canine karyotype consists of 38 pairs of autosomes plus the X and Y chromosomes (Langston et al., 1997), this does not represent full coverage of the canine genome. It does, however, represent a substantial portion of the canine genome. Priat et al. (1998) followed up these mapping projects with a radiation hybrid map cf the canine genome. This map contains a total of 400 markers consisting of 2 1 8 gene markers and 182 microsatellite markers. The map contains 347 markers assigned to 57 groups with an additional 53 markers being unlinked in the current map. The groups contain between 2 and 11 linked markers. The radiation hybrid panel consists of 126 cell lines, and the map is thought to cover about 80% of the canine genome. The work by Priat et al. began the process of integrating the linkage maps of Lingaas et al. (1997) and Mellersh et al. (1997) into a radiation hybrid (RH) map. It also indicates areas of synteny shared by the dog, human, and pig genomes. The next generation of linkage map was produced by Neff et al. (1999). This map extends the original map of Mellersh et al. (1997) from 150 microsatellite markers to 276 markers divided into 40 linkage groups. Average marker spacing in this map dropped from 14 cM to 9.3 cM. This map is estimated to cover 90% of the canine genome (Neff et al., 1999). The canine genetic map was integrated into a single map by the assignment of the linkage groups from the RH map (Priat et al., 1998) and linkage map (Lingaas et al., 1997; Mellersh et al., 1997; Neff et al., 1999) to specific canine chromosomes using chromosome painting (Yang et al., 1999). Of the 44 published RH groups and 40 published linkage groups, 39 and 33 groups were assigned to specific chromosomes, respectively. In addition, Yang et al. were able to align chromosomal regions of the canine karyotype with syntenic regions of both the human and red fox karyotypes. The syntenic alignments should enable researchers to identify genes in the human comparative map that will be candidates for canine genetic diseases by cross-species comparison, once linkage is established to a canine marker (Yang et al., 1999). A combined 3MB resolution RH map of CF A1 that incorporated SNP markers was published in 2004 ( Housley et al., 2004). Several refinements have been made in the canine genome map since the above reports were published (Werner et al., 1999; Sargan et al., 2000; Mellersh et al., 2000; Richman et al., 2001; Lingaas et al., 2001). At the time, the expanded and integrated RH and linkage map consisted of approximately 800 markers, more than 300 of which were genes (Mellersh et al., 2000; Lingaas et al., 2001; Sargan et al., 2000). The average spacing between markers was 9 cM. In addition, each of the synteny groups had been assigned to specific canine chromosomes (Sargan et al., 2001). One additional refinement was the characterization of a set of 172 markers for genome-wide screens of the canine genome (Minimal screening set—1 [MSS-l] Richman et al., 2001). This set of markers, all of which were microsatellites, were chosen because they provided as complete coverage of the canine genome as was possible, they were highly informative, and they had been ordered in linkage groups with high statistical certainty (Richman et al., 2001). It had been estimated that 42% of the canine genome is within 5 cM of at least one of these markers, and 77% of the genome was within 10 cM (Richman et al., 2001). While there were some gaps within the canine genetic map, the map taken in total was thought to be sufiicient for the whole-genome linkage analysis (Richman et al., 2001; Ostrander and Kruglyak, 2000). An extended and improved marker set (MSS-2) was published in 2004 (Clark et al., 2004). By 2004, Breen et al. created an integrated FISH and Radiation hybrid map of the canine genome (Breen et al., 2004). That map contained a total of 4250 markers, 4100 of which were assigned to linkage groups and to canine chromosomes. The genes were assigned to 60 different linkage groups that could be assigned to the 38 canine autosomes and two sex chromosomes (Breen et al., 2004). In 2005, the entire canine genome was sequenced from a female Boxer (Lindblad-Toh et a1, 2005). This map represents 7.5-fold redundancy and is thought to cover 99% of the canine genome. In addition, the sequence data revealed 2.5 million SNP markers within the canine genome. Commercially available SNP arrays (Lindblad-Toh et al., 2006) have been developed. The SNP Markers The canine genetic markers that have been developed in this series of studies are single nucleotide polymorphisms, or SNPs. These markers are even more abundant than SSLP markers and are also amenable to typing by PCR amplification. In addition, it has been possible to develop SNPs as type I, or gene-associated, markers (Werner et al., 1999; Sargan et al., 2000). This has the effect of anchoring them in the canine genome and furthers comparative genetics across mammalian species. Early reports of SNP markers in the canine genome began appearing in the mid-19903 in such genes as erythroid aminolevulinate synthase, y-D-crystallin and opsin (Boyer et al., 1995; Shibuya et al., 1995; Ray et al., 1996). 10 SNPs are defined as single base pair substitutions in genomic DNA at which different sequence alternatives exist in normal individuals in some population. The allele frequency of the most common allele must also be 99% or less (Brookes, 1999). The frequencies of transitions, transversions, and indels are not equal. Two thirds of all SNPs are C —+ T transitions while the other third is made up of all the other possible changes (Wang et al., 1998; Brookes, 1999). It has been speculated (Halliday and Grigg, 1993) that the reason for the propensity of C -> T changes is that 3-5% of cytosine residues in mammalian genomes are presumed to be methylated. These residues can undergo spontaneous deamination to yield thymine (Halliday and Grigg, 1993). Thus, a methylated cytosine residue gives rise to a thymidine residue. The result is the conversion from a C-G base pair to a T-A base pair. An unmethylated C residue that undergoes deamination will be recognized as a uracil residue, and readily repaired back to the original C residue. Before this work, the frequency of occurrence of SNPs in the canine genome was unknown. It had been established that in humans, if one randomly analyzes two chromosomes, a SNP will typically be observed to occur once per 1000 bp of DNA (Brookes, 1999). This means that there is a 0.1% chance that any base will be heterozygous in a given individual. Within gene coding regions, the fi'equency of occurrence of a SNP drops to around 1 in 4000 bp, with half of these changes resulting in non-synonymous changes (Brookes, 1999). These numbers indicated that there would be several million nucleotide differences between any two individuals and around 100,000 11 differences in their proteomes (Brookes, 1999). This estimate would later be borne out by sequence analysis of the canine genome (Lindblad-Toh et al., 2005). Before this work, there had not been any publications on a systematic search for SNPs in the canine genome. The human genome project had resulted in a huge amount of human DNA sequence data being available to the scientific community. Several groups have taken advantage of this resource to locate SNPs in the human genome. Two groups (Buetow et al., 1999, Picoult-Newberg et al. 1999) searched the expressed sequence tag (EST) database for SNPs. They scanned the database for multiply sequenced ESTs and examined them for sequence differences. They then went back to the DNA and confirmed that the SNPs did indeed exist in the DNA among the population. Taillon-Miller et al. (1998) used a similar strategy on genomic DNA. They analyzed overlapping clones of genomic DNA and looked for sequence differences. They then analyzed sequence differences to see if they represented sequencing errors or whether they indeed represented SNPs. Searching for SNPs in the Human Genome Several large-scale SNP identification projects have been undertaken to establish estimates of the frequency of occurrence of SNPs and to develop methods of identification and genotyping of SNPs once they had been located. Wang et al. (1998) examined 2.3 Mb of DNA from three individuals and a pool of 10 individuals using gel- 12 based sequencing and high-density variation-detection DNA chips. For this study, they selected 1,139 STS sequences for analysis by both sequencing and DNA chip hybridization. They found a total of 279 candidate SNPs distributed across 239 of the STS sequences, yielding a SNP in roughly every 1000 bp of DNA screened. Among the SNPs identified, the ratio of transitions to transversions was 2:1. In addition, 25% of changes occurred in CpG dinucleotides even though they made up only 2% of the sequence surveyed. Almost all of the changes were C ——r T transitions. This project also involved using DNA arrays to survey STS sequences for SNPs. This was done by establishing 25-bp oligomers in groups of 4 with position 13 of each oligomer representing one of the four bases. By knowing the nucleotide occurring at this position based on the known reference sequence, variability could be detected as a change in the expected hybridization pattern. They identified 2748 SNPs in this manner, with a SNP occuring once every 721 nucleotides. Among these SNPs, the mean heterozygosity was 33% and the mean frequency of the minor allele was 25%. In addition to the identification of SNPs, Wang et al. used chip hybridization to genotype individuals for the collection of SNPs they had previously identified. They were easily able to simultaneously test for 558 SNPs on one chip. They established two tiles for each SNP, one for each allele. The oligonucleotide arrays again consisted of 25-mers that were complementary to one of the two alleles at position 13. The individual’s DNA to be hybridized was synthesized using specific PCR primers with uniform sequence on each end to allow batch labeling of all PCR products. They were able to perform multiplex l3 PCR on all 558 loci in a single PCR reaction and make allele determinations for each of 3 individuals tested at 50% of the loci tested. When dividing the loci into 24 sets of 23 primer pairs each, they were able to make allele determinations for all three individuals tested at 92% of the loci tested. They have thus demonstrated the feasibility of using chip hybridization to perform large-scale genotyping of hundreds of SNPs simultaneously. Another group (Lai et al., 1998), examined the region around the human APO B gene for the presence of SNPs with results similar to those of the other studies listed. However, they analyzed a contiguous stretch of DNA and confirmed that the development of a high-density SNP map (with SNP markers spaced every 30 kb) was feasible given current technology. Similarly, Cargill et al. (1999) used DNA chip hybridization along with denaturing HPLC to identify SNPs occuring in the coding regions and adjacent sequences of 106 candidate genes for caridovascular disease, endocrine disease, and neuropsychiatric disease. They searched a total of 196.2 kb of DNA and identified 392 cSNPs and an additional 168 SNPs in the adjacent noncoding sequence. They found a SNP at a fiequency of one SNP per 346 bp in the coding region and one SNP per 354 bp in the noncoding region. They calculated nucleotide diversity to be 0.0005 in coding regions and 0.00052 in noncoding regions. In addition, they were able to examine the cSNPs for occurrence of synonymous vs. non- synonymous nucleotide changes. They found that roughly half of the cSNPs were of each type with 207 cSNPs being synonymous and 185 cSNPs being non-synonymous. 14 Since roughly two thirds of all random nucleotide mutations would be expected to alter the amino acid sequence of the encoded protein, they argue that there is strong selection against non-conservative DNA mutations. In fact, they calculate that non-synonymous nucleotide changes survive at only 38% of the rate of synonymous nucleotide changes (Cargill et al., 1999). Based on their data, they conclude that the average gene contains approximately 4 SNPs in their coding regions, each of which occur at a fiequency of at least a few percent in the human population. By extrapolating these data, they would estimate the number of cSNPs in the human genome to be between 240,000 and 400,000. More recent estimates of the number of genes in the human genome would push this number down to between 120,000 and 160,000 (Venter et al., 2001). In a companion study, Halushka et al. (1999) examined the coding sequences and adjacent sequences for SNPs in 75 candidate genes for essential hypertension by chip hybridization and gel-based sequencing. They surveyed a total of 28 Mb of DNA, 190 kb in 148 alleles. They identified a total of 874 SNPs, of which 387 were cSNPs. The nucleotide diversity from the data of Halushka et al. are very close to those for Cargill et a1. (1999), with the nucleotide diversities reported by Halushka et a]. being 0.00045 for coding regions and 0.00054 for noncoding regions. In another series of experiments, an area of either 9.7 kb or 24 kb of contiguous DNA was sequenced around the lipoprotein lipase or angiotensin converting enzyme genes, respectively (Nickerson et al., 1998; Clark et al., 1998; Rieder et al., 1999). In the first set of experiments (N ickerson et al., 1998 and Clark et al., 1998), researchers sequenced 15 9.7 kb of DNA within the lipoprotein lipase gene in a total of 71 individuals. The individuals were Afi'ican-American (24 individuals) Eurpoean (24 individuals) and European-American (23 individuals). They found a total of 79 SNPs, of which, 47 were transversions. They also found 9 insertion/deletion variations. There were 7 variable sites in the coding region, a stretch of 998 bp of DNA, with the remaining 81 variable sites in the 8,736 bp of noncoding DNA. This gave a nucleotide diversity of 0.002 in the entire sample and 0.0005 in the coding region. In the second study (Rieder et al., 1999), the investigators sequenced 24 kb of DNA around the DCPl gene, which encodes angiotensin converting enzyme. They did this in six individuals of European descent and 5 individuals of African descent. They identified a total of 78 varying sites on 22 chromosomes. They found the nucleotide diversity to be 0.00093 overall. Using a combination of techniques, they were able to determine that there were 13 distinct haplotypes among the individuals tested. Taken together, these studies support one another and likely provide a reasonable estimate for the nucleotide diversity across the human genome. They also validate the chip hybridization approach as both a method of SNP screening and genotyping. The work below follows these projects in several respects. A method was developed to systematically scan coding and noncoding regions of various canine genes for the presence of SNPs in a pool of ten dogs of different breeds. From these results, an estimate for nucleotide diversity was calculated for the canine genome (Chapter 3). At 16 the time of this work, a limited amount of canine nucleotide sequence data was available, and the sequencing data that resulted from the SNP search also represented newly cataloged sequence data for the canine genome (Genbank accession numbers in Chapter 3). Since that time, the complete nucleotide sequence of the canine genome has become available (Kirkness et al., 2003, Lindblad-Toh et al., 2005). The sequencing of the canine genome led to the identification of 2.5 million SNPs within the canine genome (Lindblad- Toh et al., 2005). SNPs and Gene Mapping SNPs are the most abundant form of polymorphism known to exist in the genome. Like any type of genetic marker, family-based linkage studies can be performed using SNP- based markers. One disadvantage of SNP markers compared to the more commonly used SSLP markers, is that the informativeness of the markers is less than that of SSLP markers, due to the fact that SNP markers are biallelic, whereas SSLPs generally have several alleles. With only two alleles, the maximum heterozygosity is 0.50. In contrast, SSLP markers have a heterozygosity that typically ranges from 0.65-0.80 (Kruglylak, 1997). However, the greater abundance of SNP markers easily makes up for this shortfall in heterozygosity because several can be combined to increase informativeness. Kruglyak (1997) set out to test the feasibility of performing whole-genome linkage searches using SNP markers. In his computer modeling, he reached several key conclusions. First, a map of biallelic markers with a density of 2.25-2.5 times that of a 17 microsatellite map provides comparable information content. Thus, a 4 cM map of biallelic markers is comparable to a 10 cM map of microsatellites. Next, the frequencies of the two alleles do not have a great effect on the information content of the map of biallelics as long as the fiequency of the rare allele is 0.2 or greater. Thus, perfect “50/ 50” alleles are not required for an informative map. Finally, the abundance of the SNPs in the genome makes development of large numbers of markers to create a very dense map of the genome (1 cM or less) theoretically and technically feasible. In fact, if current estimates hold, there should be on the order of 10 million SNPs in the human genome. While testing such large numbers of markers in family based linkage studies is technically daunting, methods are being developed to increase throughput to make such genotypings feasible (Wang et al., 1998; see above). It has been suggested that one of the true breakthroughs in genetics that SNPs will allow to come to pass is the mapping of genes conferring risk for complex diseases (Risch and Merikangas, 1996; Collins et al., 1997; Kruglyak, 1999). Risch and Merikangas examined the possibility of detecting genes conferring a genome relative risk (GRR) between 1.5 and 4. (GR is defined as the increased chance that an individual with a particular genotype has the disease.) They conclude that disease susceptibility alleles with moderate frequency in the population (p is 0.1 to 0.5) that confer a GRR of 4 or greater will be detectable by family-based linkage analysis. However, for disease 18 susceptibility loci with GRR of 2 or less, the number of families needed to detect linkage would exceed 2500 and thus be practically unachievable. They suggest that association analysis is a much better approach in this case. Instead of family-based linkage analysis, association analysis would be performed using affected sib-pairs or single affected individuals and their parents. Association analysis would then be performed based on inheritance of a given allele or associated marker in affected individuals as compared to appropriately selected controls. A significant deviation from random inheritance based on allele frequencies would be suggestive of association between the marker under consideration and the disease susceptibility allele. Similar calculations could be performed based on inheritance of a given allele or market fi‘om unaffected parents to affected offspring. An inheritance of a given allele or marker that was significantly greater than 50% would be suggestive of association between the marker and the disease susceptibility allele. Two approaches to whole-genome association analysis have been suggested (Collins et al., 1997). The direct method involves characterizing the approximately 25,000 genes in the human genome to identify SNPs in the coding regions (cSNPs) of these genes. It is‘ assumed that the SNPs resulting in an amino acid change in the encoded protein will be directly responsible for disease susceptibility. The tests would directly examine these coding changes for association with disease susceptibility (Collins et al., 1997). In fact, many investigators have begun identifiying these cSNPs within the human genome (Picoult-Newburg et al., 1999; Cargill et al., 1999; Halushka et al., 1999). 19 Kruglyak (1999) has done computer modeling to assess the feasibility of whole-genome association analysis using an indirect approach. The indirect approach would rely on linkage disequilibrium (LD) between the variable site which confers the disease susceptibility and tighly linked markers. However, it has not been established what levels of linkage disequilibrium can be generally expected across the human genome. Based on his modeling, Kruglyak suggests that useful levels of LD are only on the order of a few kilobases in the outbred human population. This implies that it would take 500,000 SNPs to undertake whole-genome association studies in outbred human populations. He also suggests that similar numbers of SNPs would be required in isolated populations unless the founding population is very small (effective size of 10- 100 unrelated individuals). The assertions of Kruglyak have been controversial. Collins et al. (1999) examined linkage disequilibria between 1000 pairs of loci and found that LD was on the order of 300 kb throughhout the human genome. They assert that unlike the computer models of Kruglyak which simulated the human population as steadily expanding to its current size, the human population has gone through a series of expansions and contractions over its existence. The contractions, due to events such as epidemics, famines, massacres and pressure from technologically more advanced or more aggressive neighbors would result in greater LD than the model suggested by Kruglyak. Collins et al. conclude that as few as 30,000 SNP markers, 1 per 100 kb of DNA, may be sufficient to perform whole-genome association analysis in the human genome. 20 More recent work by The International HapMap Consortium (2005) indicates that there is much more linkage disequilibrium in the human genome than simple modeling studies would indicate. The HapMap Consortium obtained complete DNA sequences from 269 individuals froom four different human populations, including ten 500kb regions in which essentially all common DNA variation was determined. This study, in addition to identifying more than 1 million SNPs, found that Ostrander and Kruglyak (2000) performed computer modeling to evaluate the feasibility of association analysis in the various dog breeds. They concluded that LD mapping is practical given the current state of the canine linkage map, with microsatellite markers spaced an average of 8.86 cM apart (Ostrander and Kruglyak, 2000; Werner et al., 1999). Indeed, they herald some characteristics of purebred dogs that make them intriguing for LD mapping. First, gene flow between breeds is limited by the pedigree structure. (Registration of a dog as a member of a given breed requires that both his parents be registered members of the same breed.) The modern dog breeds are relatively young, with most being developed in the last 300 years (Wilcox and Walkowicz, 1995;Wayne and Ostrander, 1999; Ostrander and Kruglyak, 2000). Many breeds have a small founding population. Popular sires have decreased the effective population size of the breeds. Finally, for many breeds, the breed’s natural history has been such that severe population bottlenecks have occurred in the recent past (Ostrander and Kruglyak, 2000). All of these factors combine to increase the area of linkage disequilibrium in the various dog breeds. For example, Ostrander and Kruglyak (2000) performed computer modeling on the Rottweiler breed. Based on pedigree data provided by the American Kennel Club 21 and breed history (Wilcox and Walkowicz, 1995), they estimated that there will be high levels of LD extending 5-10 cM around a disease mutation (Ostrander and Kruglyak, 2000). They firrther propose that screening a sample of 40 affected dogs for identity by descent will be sufficient for gene localization. While the above analysis is specific to Rottweilers, further modeling indicates that similar areas of LD will exist even in breeds that haven’t suffered the types of severe population bottlenecks as those of the Rottweiler (Ostrander and Kruglyak, 2000). Similar results have been demonstrated by Lindblad- Toh et a1. (2005) for the Boxer and in five different breeds by Sutter et al. (2004). Methods of SNP Identification Since the development of RFLP markers (Botstein et al., 1980) it has been known that there was nucleotide variation within mammalian genomes. When the search for markers was first undertaken, the only method available was to isolate a cloned gene fragment for use as a probe and perform restriction digestion with as many different restriction enzymes as were necessary to locate an RF LP marker. One of the drawbacks of this method is that even performing endonuclease digestion with all the restriction enzymes available today, only about 50% of the SNPs would be identified as RFLPs. In fact, Nickerson et al. (1998) report that if they had performed restriction digestion on their target DNA with all of the restriction enzymes with either five- or six-base specificities (Roberts and Macelis, 1997), only 34 of their 88 variable sites would have been discovered. 22 With the advent of high throughput DNA sequencing techniques, methods of SNP identification have been developed using DNA sequencing. Direct sequencing has the advantage of examining all nucleotides in a sequencing run for the presence of SNPs. Its other advantage is that only DNA sequencing will precisely define both the location and the exact nature of the DNA variation detected (Kwok et al., 1994). The major disadvantage has been the high cost associated with sequencing the DNA of several individuals within a population under study in order to locate variable nucleotide SCQUCIICCS. We and others, most notably Kwok’s research group at Washington University, have developed a method of identifying SNPs by pooling DNA for sequencing (Chapter 3, Taillon-Miller et al., 1999). This has the advantage of simultaneously surveying several copies of DNA sequence for the presence of nucleotide variability while reducing the cost to that of just two sequencing reactions. One early effort to identify SNPs in the human genome by Kwok et al. sought to utilize the large overlapping clones already available from the human genome project and inspect these sequences for nucleotide variability (Taillon-Miller et al., 1998). Where no nucleotide sequence information is available, one must develop STSs and then sequence the DNA from several individuals in order to identify SNPs found in that area of the genome. Kwok et al.. (1 994, 1996) were performing automated sequencing of the DNA fiom 4 individuals plus a pooled DNA sample for allele frequency estimates. This 23 strategy enabled them to identify with > 85% probability all the SNPs that occurred in the regions sequenced at a frequency of greater than 20% (Kwok et al., 1994). Kwok et al. (1996) then applied this technique on a larger scale by scanning a series of STS markers for the presence of SNPs. They obtained primers for 194 STSs from the Whitehead Institute’s collection of 838 STSs (as of July, 1994). They were able to amplify DNAs from 154 of the primer sets in four individuals and a pool of 80 individuals, and examine the amplified DNA for SNPs as given above (Kwok et al., 1994, 1996). They found 39 SNPs among the 154 STSs tested and estimated that a polymorphism occurred at a frequency of once per 791 bp, similar to the SNP fi'equencies reported above. Taillon-Miller et al., (1997, 1999) further refined this method. First, they used a complete hydatiform mole (CHM) to serve as a sequencing control (Taillon-Miller et al., 1997). A CHM is the product of an abnormal conception. It is generally the product of the union of an enucleated ovum with a single sperm cell that later duplicates its genome to give a diploid tumor (Taillon-Miller et al., 1997; Grimes, 1984; Kajii and Ohama, 1977). Since the genome of the mole is from a single haploid sperm cell that has undergone a duplication event, every nucleotide position should be homozygous in the CHM. This serves as a control reaction in that it allows false positive SNPs resulting from amplification of duplicated sequences in the genome to be distinguished from true SNPs. It is estimated that the worldwide incidence of hydatiform moles in humans is one 24 per one thousand pregnancies (Taillon-Miller et al., 1999; Grimes, 1984). Thus, they argue that sample material should be available for all populations of interest. With the improvement in dye-labeled dideoxy chain terminators, Taillon-Miller et al. (1999) now recommend sequencing only two DNA samples in parallel. These are the CHM DNA as a control and a pool of 80 individuals. They found that they could cut the number of sequencing reactions by 60%, from five parallel sequencing reactions to just two and still identify SNPs with the same sensitivity as separately sequencing the four individuals’ DNA as was done previously. At the time of this work, several other methods of SNP identification had been developed. These methods have been reviewed by Kwok and Chen (1998) and are briefly outlined below. From the time of completion of this research to present time, the availability of automated sequencing has virtually eliminated the use of these techniques to identify SNPs. They are included for the purpose of placing the work completed here in the context of the time it was completed. SSCP: SSCP is single strand conformational polymorphism. The technique is based on the fact that single stranded DNA will form a unique tertiary structure based on its DNA sequence (Kwok and Chen, 1998). Any changes in nucleotide sequence will change the tertiary structure of the molecule. When these single stranded molecules are electrophoresed on a native gel, molecules with sufficient differences in conformation will migrate at different rates and can be distinguished on the gel. The advantage of this 25 technique is its technical simplicity. The disadvantages are that target molecules in which a polymorphism are to be identified must be smaller than 300 bp for differences in single nucleotides to sufficiently influence conformation so as to be resolvable on the gel and the need for multiple buffer conditions to achieve 90% sensitivity. DGGE: DGGE is denaturing gradient gel electrophoresis. It is based on the fact that denaturation of double stranded DNA is sequence dependent. A difference in a single nucleotide between two DNA molecules often causes a great enough difference in their denaturation temperatures to distinguish between the two molecules. When a partially- open DNA molecule is migrating through a gel, it is for all intents and purposes irnmobilzed at the site where one end first denatures. Thus, DNA molecules with different low-melting domains will have different final positions in the gel. When heteroduplex DNA is run on a denaturing gradient gel, the heteroduplex DNA will denature at a concentration of denaturant that is much lower than its homoduplex counterpart. This forms the basis of the detection procedure. The advantage of this procedure is its ability, with some modification of the basic procedure above, to locate polymorphisms in DNA fragments as large as 1000 bp. Its disadvantage is the need to use specialized equipment to perform the analysis. Methods of SNP Detection Once a SNP has been identified, the next step is to develop a means to genotype individuals at the marker. The “gold standar ” of SNP detection is to use allele-specific 26 restriction digestion and gel electrophoresis to genotype individuals for a given SNP. This method has the advantage of being highly accurate, technically reliable, and inexpensive. The disadvantages are that the method is labor intensive and has a low throughput. Several other methods to speed throughput have been developed (Landegren et al., 1998). The current methods all use amplification by PCR followed by allele determination by allele-specific hybridization or allele-specific restriction digestion, determination of mismatched DNA substrates by polymerases or li gases, or by template specific incorporation of nucleotides by polymerases. There is a great deal of overlap between methods of SNP identification and detection. Certainly, given unlimited budgets, DNA sequencing could be used for SNP detection. Other methods, such as SSCP, DDGE, and heteroduplex analysis could be used to determine if polymorphism existed in a given individual. They may even be used to detennine which alleles were present in an individual, with the inclusion of appropriate control reactions. DNA chip hybridization is best suited for high throughput SNP detection. It has the advantage of being able to genotype an individual at thousands of polymorphic sites simultaneously (Wang et al., 1998; Chee et al., 1996; Landegren et al., 1998). Its main disadvantage at the time this work was undertaken was its high cost. Since the completion of this work, the cost and availability of DNA chip hybridization technology had decreased and become more reliable, putting it within the budget of most laboratories. 27 Canine Breed History In the series of experiments detailed below, DNA from ten different dog breeds were used to form a working pool of DNA. The breeds making up this pool were chosen because they differ in size, behavior, and temperment. Presumably, genetic variation among this pool of DNA will reflect such variation. A summary of breed characteristics is given in Appendix l-l following this chapter. Breed histories are also included in Appendix 1-2 at the end of this chapter. I have published the following papers during the course of this work. Chapter 2 was previously published as Venta, et al., “Gene-specific Universal Mammalian Sequence Tagged Sites: Application to the Canine Genome”, Biochemical Genetics 34: 321-341 (1996). In this work, I designed approximately 20% of the primer pairs, performed all of the DNA amplifications, and performed all of the sequencing reactions within the paper. Chapter 3 was previously published as Brouillette et al., “Estimate of Nucleotide Diversity in Dogs with a Pool-and-Sequence Method”, Mammalian Genome 11: 1079- 1086 (2000). In this work, I performed approximately 90% of the experiments. Chapter 4 was previously published as Brouillette and Venta, “Within-breed Heterozygosity of Canine Single Nucleotide Polymorphisms Identified by Across-Breed 28 Comparison”, Animal Genetics 33: 464-467 (2002). In this work, I performed all of the experiments. In addition, I have coauthored four other papers. They have been previously published as follows: 1. Brouillette et al., “le I PCR/RFLP Marker in the Canine Connexin 40 Gene”, Animal Genetics 30: 229 (1999), in which I performed about 75% of the experiments; 2. Brouillette and Venta, “T th 1 PCR/ RFLP Marker in the Canine Rod Transducin Alpha Gene”, Animal Genetics 31: 68 (2000), in which I performed all of the experiments; 3. Lingaas et al., “A Canine Linkage Map: 39 Linkage Groups”, J. Animal Breeding and Genetics 118: 3-19 (2001), in which I performed segregation analysis for 7 of the 222 markers, as part of the DogMap consortium; and 4. Ernst et al., “Mapping of PBS and FURIN Genes to Porcine Chromosome 7”, Animal Genetics 35: 142-167 (2004), for which I provided PCR primers and amplification conditions for the mapping of the PBS gene. 29 Appendix 1-1 a Summary of physical characteristics of dog breeds Breed Height in Weight in Coat color Fur Style Classb inches pounds Am. Cooker 15 24-28 Black, tan, Silky, long Sporting Spaniel chocolate, cream, tricolor Greyhound 26-28 65-70 Cinnamon, _ Short, smooth Hound chestnut, red, black, brindle Doberman 26-28 66-88 Black, red, Short, smooth Working Pinscher blue, fawn Siberian Husky 21-23 45-60 Gray, black, Thick, dense Working red Labrador 2 l -24 55-75 Black, Moderately Sporting Retriever chocolate, short yellow Collie 24-26 60-75 Sable and Short, smooth, Herding white, tricolor, double blue merle Scottish Terrier 10-1 1 19-23 Black, brindle, Wiry Terrier wheat, gray German 22-26 7 5-95 Black and tan, Short, dense Herding Shepherd black, sable Beagle 13-15 55-75 Any color Short, dense, Hound smooth Pointer 25-28 55-75 Liver, lemon, Short, dense Sporting orange, white smooth a. Information is from The Complete Dog Book, 1997 and Wilcox and Walkowicz, 1995. Height and weight are for male dogs of each breed. Where differences existed between the references cited, data was from the first reference above. In all cases, the female dog was slightly shorter and lighter than the male dog. b. “Class” refers to the grouping used by the American Kennel Club in The Complete Dog Book, 1997. 30 Appendix 1-2 Breed History The history of each of the breeds used in this study is outlined briefly below. American Cocker Spaniel: The American Cocker Spaniel can trace its roots back as far as the 14th century. In 1368, the Spanyell was first mentioned in the literature. Through the years the spaniel family was divided into two groups, the land spaniels and the water spaniels. As time passed, the land spaniels were divided into the smaller cooker spaniels and the larger varieties. Later, the toy spaniels were divided from the cooker spaniels. The first registry of the Cocker Spaniel breed was in England in 1892. It was brought to the United States in the 18805 and went through a change in breed standard such that by the 19305 it came to be considered a separate breed from the English Cocker Spaniel from which it originated. It is considered to be a sporting dog, and is reputed to be an excellent hunter. The breed is known for being handsome, happy, eager to please, trusting, and intelligent. These traits have made it one of the most popular dog breeds in the United States (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Greyhound: The greyhound can trace its lineage back to ancient times. The first known record of the greyhound dates back to the hieroglyphs of ancient Egypt, around 3000 B.C. The greyhound has long been a favorite of the aristocracy. Documents from 9th century England indicate that it was a favorite hunting dog of the Duke of Mercia. The earliest accounts of the greyhound in America date back to Spanish explorers in the 15005. Known as hunters, there are reports of greyhounds running down deer, stags, and foxes. Yet, it is probably best known for its hunting ability for rabbits and hares. Known today for being gentle, well—behaved, and graceful pets, greyhounds are elegant show dogs and thrilling competitors (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Doberman Pinscher: The origin of this breed is well established. The breed began in Thueringen, Germany in 1890 by Louis Doberrnann. Doberrnann, a tax collector by trade, needed a dog to protect him from bandits. The breed mixed the hardiness and intelligence of the German Shepherd, the reaction and fire of the German Pinscher, and the hunting ability of the pointer. Further outbreeding added the Rottewiler’s strength, courage, and guarding instinct and the Greyhound’s foot speed. In only ten years, the breed standard had been established. The breed is known today for its intelligence, its ability to absorb and retain training, and its loyalty. It is these qualities that put the breed in demand as a police and military dog American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Siberian Husky: The Siberian Husky traces its roots to the dogs of the ancient Chukchi people of northeastern Asia. The dog was bred to be a sled dog. Its primary mission was as a dog that would travel great distances at moderate speed while carrying a light load and wouldn’t flinch at the subzero temperatures of the Arctic region. The reputation of this breed of sled dog was made in the United States in 1925. A diphtheria epidemic was sweeping through Nome, Alaska and dogsleds were used to take antiserum from Anchorage to Nome. This serum run was the forerunner of the famous lditarod dogsled race, and it focused the spotlight on the Siberian Husky. The Siberian Husky is naturally friendly and gentle. He is an exceptional family dog, and is still the favorite of dog mushers across the United States (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). 31 Labrador Retriever: The Labrador Retriever was originally seen in the early 18005 in Canada as a hunting dog and was particularly useful at retrieving water fowl. The breed was transported to England on fishing boats and soon became a popular sport dog there as well. Later in that century, the breed was all but eliminated in Canada, due to a heavy dog tax. The breed’s development into its current form occurred largely in England. The breed was first recognized in England in 1903 and in the United States in 1917. This breed is known for its eagerness to please its master and still possesses its sensibility, even-temper, intelligence, and strong marking and retrieving skills. The breed is renowned as a bird flusher, companion, drug-detector, and as a guide dog for the blind. These traits consistently put the breed in the top five in popularity in both England and the United States (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Collie: The breed known as the collie is thought to have its origin in the dogs that were brought to Scotland with the Roman invaders of 50 BC. These ancient dogs interbred with other Scottish herding dogs to yield the breed known today. This breed of dogs has been used to herd sheep for centuries in Scotland. In 1860, Queen Victoria became a fancier of the breed after a trip to Scotland. With her blessing, the breed became a favorite of the aristocracy and affluent, as well as maintaining its traditional role as a herding dog. The two types, rough and smooth (referring to the length of the coat), were fixed enough in characteristics by 1886 that little has changed with regard to the breed standard since that time. By 1877, the breed had become established in the United States, though the breed was first introduced in this country with the early settlers over a century earlier. The breed is consistently popular today as a family pet. It is known for its loyalty and affection and as a self-appointed guardian of the entire family, but particularly of small children. In recent years, the dog has maintained its popularity, due in part, to the Lad stories of writer Albert Payson Terhune and the “Lassie” movies and television series (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Scottish Terrier: The Scottish Terrier has been in existence for centuries. There are those that will argue that descriptions of the Skye Terrier written in the 15705 are not the Skye Tenier that is known today but the Scottish Tenier of antiquity. At the very least, the modern breed can trace its lineage to Scotland in the 18605. The first standard was established in England in 1880. It has remained the standard, With pillsI minor changes, up to the present day. The Scotty was first introduced into the United States in 1883. Since this time, there have been thousands of Scotties imported. The terrier temperament is taken to the extreme in Scotties. He is alert, quick, and feisty. These qualities make the breed well suited to being a watchdog and varmint killer. The breed of dog requires discipline to prevent him from becoming a bully (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). German Shepherd Dog: This breed was founded in 1899 by Max von Stephanitz. It has always been a working dog, originally as a herder, and in a variety of roles today. This breed grew steadily in popularity around the world up to World War I, but the popularity of the breed suffered due to the anti-German backlash in Europe and America following World War I. The breed is known today for its loyalty, courage, and ability to assimilate and retain training for a number of specific purposes. German Shepherds are often used as guide dogs for the blind, and as police dogs, military dogs, and as a key component of search-and-rescue units. Considered by some to be aloof, the German Shepherd Dog doesn’t give affection freely. However, once the dog warms to a person, he is loyal and dedicated, even to the point of giving his life for his master (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Beagle: The history of the Beagle breed is cloudy. Some reports indicate that the origin of the Beagle dates as far back as ancient Rome. Other accounts note that the Beagle has been used to hunt hares in Wales for centuries. Modern records of the Beagle date at least to the middle 17005. Their keen sense of smell and compact size has made them a favorite to hunt rabbits, hares, and foxes, either individually or in packs. .32 In the United States, the Beagle has been in existence since colonial times. However, these dogs had the look of a Basset Hound rather than that of the Beagle of today. Imports of Beagles from the kennels of Great Britain in the 18805 and 18905 gave rise to the Beagle that is recognized today. The Beagle is known as a capable hunting dog as well as a playmate for adults and children alike. The breed’s inquisitiveness and happy-go-lucky nature have made it a consistent member of the top ten dog breeds in the United States (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). Pointer: The Pointer breed got its start in England around 1650. These are excellent hunting dogs and were considered to be the first true pointing dogs. They were originally used to hunt hares. The Pointer was sent out to locate a hare, at which time, greyhounds were brought in to chase the hare. During the early 17005, the pointer’s hunting ability was more thoroughly exploited due to the increased popularity of wing- hunting. Legends abound about this breed’s pointing ability. One example is the story of a sportsman who lost his dog in the moors of England. He returned a year later to find the skeleton of the dog still pointing at the skeleton of a bird. The pointer of today is a hunting specialist. He is muscular, courageous, speedy and has great endurance. His ability to concentrate on his job and ability to work with people other than his master keep him as a favorite among hunting dogs (American Kennel Club, 1997; Wilcox and Walkowicz, 1995). 33 Chapter 2 Gene-Specific Universal Mammalian Sequence-Tagged Sites: Application to the Canine Genome 34 Gene-Specific Universal Mammalian Sequence—Tagged Sites: Application to the Canine Genome 1 2,3 Patrick J. Venta ’ 1’3 , James A. Brouillette , Vilma Yuzbasiyan-Gurkanz, and George J. Brewer4 Departments of Microbiology1 and Small Animal Clinical Sciencesz, College of Veterinary Medicine, and the Genetics Program3, Michigan State University, East Lansing, MI 48824-1314 and the Department of Human Genetics4, The University of Michigan Medical School, Ann Arbor, MI 48109-0618 Key Words: Genome mapping; evolution; homology; polymerase chain reaction Corresponding Author: Patrick J. Venta, Ph.D. Phone: 517-432-2515 FAX: 517-432-2514 e-mail: venta@cvm.msu.edu 35 Abstract We are developing a genetic map of the dog based partly upon markers contained within known genes. In order to facilitate the development of these markers, we have used PCR primers designed to conserved regions of genes that have been sequenced in at least two species. We have refined the method for designing primers to maximize the number that produce successful amplifications across as many mammalian species as possible. We report the development of primer sets for eleven loci in detail: CF TR, COL10A1, CSFIR, CYP1A1, DCNI, FES, GHR. GLBl, PKLR, PVALB, and RBI. We also report an additional 75 primer sets in the appendices. The PCR products were sequenced to show that the primers amplify the expected canine genes. These primer sets thus define a class of gene-specific sequence-tagged sites (STSs). There are a number of uses for these STSs, including the rapid development of various linkage tools and the rapid testing of genomic and cDNA libraries for the presence of their corresponding genes. Six of the eleven gene targets reported in detail have been proposed to serve as “anchored reference loci” for the development of mammalian genetic maps [O’Brien et al., Nat. Genet. 3:103- 112, 1993]. The primer sets should cover a significant portion of the canine genome for the development of a linkage map. In order to determine how useful these primer sets would be for other genome projects, we tested the eleven primer sets on the DNA from species representing five mammalian orders. Eighty-four percent of the gene-species combinations amplified successfully. We have named these primer sets “universal mammalian sequence-tagged sites” (U M-STSs) because they should be useful for many mammalian genome projects. 36 Introduction Efforts have intensified in recent years to develop comprehensive genomic maps for many eukaryotio species using molecular techniques. Many of these efforts have focused on mammalian species, including human, mouse, rat, ox, sheep, pig, horse, cat, and dog (e.g., Buchanan et al., 1994; Dietrich et al., 1992; Ellegren et al., 1992; O’Brien, 1986; Serikawa et al., 1992; Weissbach et al., 1992; WinterO et al., 1991; Barendse et al., 1994; and the present report). For the non-human species, these projects should lead to more successful breeding strategies, both for selecting desirable characteristics and for removing genes that lead to various genetic diseases. Comparisons made between these genome maps should also lead to new insights on the mechanisms of chromosomal evolution (e.g., see O’Brien et al., 1993). We are developing a comprehensive map of the canine genome, with our ultimate aim being to reduce the incidence of canine genetic diseases. In addition to developing random, highly polymorphic genetic markers (Type 2 markers; Yuzbasiyan-Gurkan et al., submitted), we are also developing markers for specific genes (Type 1 markers). An appropriate mix of these two types of markers should maximize our ability to map disease genes. The traditional method for developing gene-specific markers, Southern blotting and cross-species hybridization, is very time consuming, labor intensive and limited in flexibility. This method has been the mainstay for developing gene-specific markers in 37 most animal genome projects. There is a need to develop more efficient methods. This is particularly important for animal genome projects where scientific resources are more limited. One method that has excellent potential is cross-species polymerase chain reaction (PCR). This method has been used successfully for the study of a number of individual genes but has not been applied on a genome-wide basis for the purpose of map development. To study a single gene, the cost associated with the failure of a few primers sets to amplify the correct target is negligible and new primer sets can be easily redesigned and synthesized. However, when primer sets are being designed for many genes, the cost for failed primers can become substantial, both in terms of time and other resources, so we have refined the design method to minimize this problem. We describe here, in detail, eleven primer sets that can amplify gene-specific targets of dogs and other mammalian species. Seventy-five additional primer sets are listed in the Appendices. Because markers based on PCR primers are called sequence-tagged sites (STSs; Olsen et al., 1989), we call these primer sets universal mammalian STSs (UM- STSs) because they should be useful for many mammalian genome projects. 38 Materials and methods DNA Isolation DNA from dog, human, pigtail macaque, horse, pig, rat and mouse were isolated from various tissues by standard phenol-chloroform extraction methods (Sambrook et al., 1989). Goat DNA was kindly supplied by Dr. Karen F riderici, Michigan State University. DNA was purified by standard methods from a canine liver cDNA library (Clontech) and from a canine genomic DNA library (Clontech) after growing 1 x 106 phage in E. coli strain LE392 (Murray et al., 1977) in liquid culture (Sambrook et al., 1989) Design of PCR Primers The method of primer design detailed here was used throughout this series of experiments for primer design, unless the primer was designed based on available sequence data for the species being studied. Primers were designed to genes where the intron-exon structure was known in at least one species and where the nucleotide sequence was known in at least two species (the “index species”) that are not closely related. Tandemly duplicated genes known to have undergone gene conversion in any species were avoided. Primers were generally designed so that the amplified product contained an intron. Since the canine sequence was unknown in most cases, the sizes of the canine introns were not known prior to amplification. We have since determined that 39 the vast majority of canine introns will be between 50% and 150% of the size of the corresponding human intron (V enta, unpublished observations). We have followed the human gene nomenclature system (ISGN, 1987) for naming the canine genes. The eleven loci described in detail in this chapter, and their protein products, are: CF_"LR, cystic fibrosis transmembrane regulator; COL10A1, type X collagen, alpha 1 chain; CSFIR colony stimulating factor 1 receptor; CYP1A1 cytochrome P-450 1, alpha 1; D_CN1, decorin; EE_S, c-fes (feline sarcoma) proto-oncogene; GHR, grth hormone receptor; Glam, beta galactosidase; LKfl, pyruvate kinase - liver, RBC form; w, parvalbumin; and RA], retinoblastoma protein. The Genbank Accession numbers or reference for the sequence of the two index species for each locus are as follows: CF__TR, M55129, M60493; COLIOAl, X65120, X65121; $13, X14720, K01643; CYP1A1, Uchida et al., 1990, X04300; DCNl, L01125, Z12298; FES, X06292, J02088; GHR, 211802, J0481]; GLBl, 859584, M57734; PVALB, X63578, M15452; PKLR, $59798, M17088; and RB], L11910, M26391. Primers were designed to highly conserved nucleotide sequences contained within coding regions. It is presumed that, in the absence of parallel evolution, regions that are conserved among distantly related mammalian species represent the nucleotide sequence of the most recent common ancestor of the two modern mammals. Thus, any mismatches to the primers should result only from evolution of the canine nucleotide sequence since the divergence from the most recent common ancestor. 40 Within the areas that were conserved among the two index species, an attempt was made to place the primers in the nucleotide sequence corresponding to the least mutable amino acids (Collins and Jukes, 1994; Jones et al., 1992). While there are slight differences among studies, there is a consensus that Gly, Phe, Tyr, Trp, and Cys are among the least mutable amino acids (Collins and Jukes, 1994; Jones et al., 1992; Dayhoff et al., 1978). In addition, an attempt was made to choose primers overlying codons with the fewest degenerate codon positions (Li and Grauer, 1987). With only a single codon, Met and Trp are excellent amino acids over which to design a primer. Others that are both rarely mutable and have few codons are Phe, Cys, Lys, and Gln. The final step was to attempt to place the 3' end of the primer in the second position of the codon for all codons except glycine, which has greater conservation at the first position rather than the second position (Venta, unpublished observations). An attempt was made to use general principles such as the avoidance of primer-dimer formation as well. Conservation of amino acids within multigene families was also taken into account, when possible. Where unavoidable nucleotide mismatches occurred between the two index species, the primer sequence was designed to exactly match one of the two which we then call the “primary” index species. GC-rich genes were generally avoided due to the amplification difficulties that can occur, even with exactly matching primers. Primers were twenty bp in length on average. Each primer in a pair was adjusted to be of 41 approximately the same annealing temperature (Breslauer et al., 1986). All fits of primer pairs were designed to have approximately the same annealing temperature as well, in anticipation of performing multiplex amplifications. It was not always possible to follow every rule for every gene, given the actual circumstances; however, the majority of the rules were generally applicable. Primers were synthesized by either the Michigan State University Macromolecular Structure Facility or the University of Michigan DNA Synthesis Facility. PCR Amplifications Correct design and syntheses of the primers were examined by amplifying the DNA from the primary index species. Standard buffer, nucleotide, and primer concentrations were 50 mM Tris-HCI (pH 8.3 at room temperature), 50 mM KCl, 1.5 mM MgClz, 200 pM dNTPs, 0.1 pg of each primer, and 0.5 - 1.0 pg of target DNA in a 25 pl reaction. Reactions were routinely boiled for three min prior to the addition of 2.0 U of Taq DNA polymerase. Optimal cycling conditions for the amplification of canine genomic DNA were usually found by testing one of several sets of conditions in general use in the lab (see Appendix 2-5). Occasionally it was necessary to use “hot-start” conditions (Bassam and Caetano-Anolles, 1993) in order to get stronger, cleaner amplifications. The presence of an amplification product was determined by electrophoresis of a portion of the reaction on a 1% agarose TBE gel (TBE = 90 mM Tris, pH 8.3, 90 mM sodium borate, 2.5 mM EDTA) followed by staining with ethidium bromide. 42 DNA Sequence Analysis The identity of each amplified canine gene was confirmed by “single pass” direct sequencing of PCR products using Sequenase or Taq cycle sequencing kits (United States Biochemical Corp., Cleveland). The PCR products were gel purified with Qiaex (Qiagen Corp., Chatsworth, CA) or by elution from polyaorylamide gel slices (Bergenhem et al., 1992) prior to their use in the sequencing reactions. The canine sequences were visually aligned with the sequences of the other species used to design the PCR primers in order to verify the degree of sequence identity. 43 Results The primer sets for the various UM-STSs reported here are given in Appendix 2-7 and efficient amplification conditions for the canine genes are given in Appendix 2-5. It is probable that these conditions could be optimized further (e.g., reduction in the time in each cycle). However, the conditions reported here were found to work effectively while minimizing the number of conditions that had to be examined. A representative gel showing amplification of the canine target DNA along with the human target DNA is shown in Appendix 2-1. The human target serves as a positive control for the amplification system because these primers were designed to exactly match the human sequence. The ability to quickly screen genomic and cDNA libraries for the presence of sequences is also demonstrated in Appendix 2-1. The genomic clones for GHR, COLIOAI , and m (a very faint signal, stronger on other gels [data not shown]) are present in this particular canine genomic library. The presence of a decorin cDNA clone (encoded by the M locus) in the canine liver cDNA library is shown by the presence of the 122 bp band; cDNA clones for GE and COL10A1 are not present. The Q9111 PCR product from the cDNA library was sequenced and its identity confirmed (see Appendix 2-2). The human and canine genomic bands have different sizes for Gfl and 9% because of the intron size differences. The size for the COLl 0A1 PCR product is the same between the species because an intron was not spanned for this is the UM-STS. Although the PCR product bands in Appendix 2-1 are unique, a few UM-STS-species combinations sometimes contained one to several non-specific amplification products. This is a minor problem with unique sequence primers, because it is almost always 44 possible to deduce the correct band based upon staining intensity, and on the similarity in size compared to the band of the primary index species. The amplified products for all of the canine loci were sequenced to confirm their identity and the results are shown in Appendix 2-2. The degree of identity between the canine and index species sequences for each locus is within the range generally accepted (roughly 70 to 100%) as demonstrating homology between the genes of mammalian species (Li and Grauer, 1987). These results support the hypotheses that the canine PCR products are homologous to the respective index species’ genes. The canine COL10A1 sequence matched the human and mouse sequences to a similar extent (data not shown). The sequences for _PK_LR and CYP1A1 exactly matched previously published canine coding sequences (Whitney etal., 1994; Uchida et al., 1990); the sequence for canine EES is given in Appendix 2-3. Although the majority of the canine sequence for P_V_A_L_B_ is from an intron, we believe the degree of sequence identity from this region is sufficient evidence to confirm that the PCR product is from the correct canine locus. As expected, the canine sequences tend to show greater identity with the human sequences than with the rodent sequences because of the faster evolutionary rate of the rodent genome (Gu and Li, 1993). A microsatellite repeat was found. within the amplified product itself for RBI. Preliminary results show that the RBI repeat, (GA)12(avg)a has moderate genetic variability within several canine breeds. We hypothesized that each primer set should work for many different mammals, given the evolutionary rate at which nucleotide substitutions occur (Li and Graur, 1987) and the 45 number of primer nucleotide mismatches that can be tolerated by PCR. We tested the ‘universal’ utility of these primers on the DNAs from mammals representing several different orders. We used the same reaction conditions that were found to amplify the canine sequences. We have termed these reactions “Zoo PCRs.” Appendix 2-4 shows a representative experiment. The F_ES_ proto-oncogene was amplified from all of the DNAs examined. These PCR products were purified and sequenced directly without subcloning (see Methods and Materials). The sequences are tabulated in Appendix 2-3. The degree of sequence identity makes it highly likely that the canine PCR products are all homologous to the corresponding index species’ genes. The pattern of nucleotide interchange is also what would be expected for homologous genes; members of the same mammalian order share more sequence similarity with one another than with those of other orders. The data for the Zoo PCRs for the other UM-STS primer sets reported in this paper are given in Appendix 2-6. Greater than eighty-four percent of the targets, excluding the index and canine species, amplified under the single condition used to amplify the canine sequence. These species represent five different mammalian orders; primates (human and macaque), carnivores (dog), artiodactyls (goat and pig), perissodactyls (horse), and rodents (mouse and rat). Limited experiments on other members of these orders (e.g., cat and ox) produced similar results (data not shown). Lack of amplification for D_CN__1_ for one of the artiodactyls (goat) would be predicted because there are four mismatches between the UM-STS primers and the sequence of the closely related bovine m (Day et al., 1987). We have found it difficult (although not impossible) to amplify DNA using 46 primers that contain more than two mismatches with the target, when using 20-mers (P.V., unpublished results). It is likely that the homologous gene from at least some of the non-amplifying species would appear using these primer sets if other PCR conditions were examined. 47 Discussion This study has shown the feasibility of generating a series of UM-STSs, useful for studies of many genomes, and addressed methodological considerations for their development. UM-STSs should serve as useful tools both for amplifying regions of interest from genomes as well as for isolation of clones from genomic and cDNA libraries and for cross-species comparisons. The data reported in this paper indicate that approximately 85% of all carefully designed UM-STSs will be useful for any given mammalian species. We believe that this method is far more efficient, less costly and considerably less labor intensive than traditional hybridization and Southern blotting-based methods. An additional important benefit is that the information for the necessary reagents (i.e., the primer sequences) is transmitted much more easily and quickly than the clones that are necessary for Southern blotting. UM-STSs will also be useful for developing genetic markers within various genomes. We have found a microsatellite within one of the eleven loci reported here (Q) and have found other microsatellite repeats associated with genomic clones isolated through the use of UM-STSs (unpublished results). Single site variability should also be found directly in at least some of the amplified products by using one of a number of techniques developed for scanning for variability, such as the single-strand conformation polymorphism technique. For example, this method has been used to find two polymorphic sites in a study of the canine A_L_A_§2 gene in a PCR product of a size similar to those reported here (Boyer et al., 1995). If the frequency of single site polymorphic 48 variability for other mammals is as high as that estimated for humans (roughly one in 200 to 400 nucleotides), then a significant portion of UM-STSs will have these sites. We are currently screening for this variability in the canine genome to estimate the frequency of such variation in the dog. It will be necessary to screen each species individually for genetic variability. However, the availability of previously designed UM-STS primer sets, such as those reported here, should make this work proceed more rapidly compared to the traditional method. An example of the utility of cross-species comparisons is given by the case of Waardenburg syndrome. The clue to the location of one of the human Waardenburg syndrome genes--well known for causing a syndromic hearing loss--was first gleaned from comparative mapping with the mouse (Asher and Friedman, 1990). The map locations in the mouse suggested possible locations of the human disease gene, one of which eventually was proven correct (e.g., Morell et al., 1992). Because the identity of the gene in the mouse was not known at the time, this approach might more properly be called a 'positional candidate' approach. UM-STSs will be useful for rapidly producing mammalian genetic maps so that the positional candidate approach can be applied to more species. Very little is known about location of genes within the canine genome. Indeed, except for genes located on the X-chromosome (Meera-Khan, 1984; Deschenes et al., 1994) and a few small unassigned linkage groups (Meera-Khan, 1984), the rest of the genome has remained unexplored. New linkage groups are being developed by us and others 49 (Holmes et al., 1992; Ostrander et al., 1993; Rothuizen et al., 1994; Yuzbasiyan-Gurkan et al., submitted) based primarily on simple sequence repeats. The development of UM- STSs should help to rapidly identify the location of linkage groups on specific canine chromosomes. The identification of conserved syntenies will allow candidate linkages to be tested in the canine genome. The assignment of the proposed anchor loci (O’Brien et al., 1993) as defined by UM-STSs to specific chromosomes can be accomplished by the somatic cell hybrid, flow sorted chromosome, and fluorescent in situ hybridization (FISH) methodologies. Other methods, such as assignment by use of linkage to previously mapped loci, are also possible. We have already assigned several genes by FISH to canine chromosomes using cosmids isolated with UM-STSs (F ujita et al., in press). Using the methods described in this paper, we have developed a much greater number of UM-STSs that should cover, for linkage mapping purposes, a substantial portion of the canine and other mammalian genomes (see Appendices 2-8 and 2-9). 50 Acknowledgements We thank Ya Shiou Yu and Murat Gurkan for their valuable technical assistance. We also thank Tracy Hammer, Neal Dittmer, Jessica Nadler, Elizabeth Tullett, Marc Crotteau, and Kristen Penner for their contributions to the development of a number of the primer sets. This work was supported by the American Kennel Club, the Orthopedic Foundation for Animals, and the Morris Animal Foundation. We also thank the Washington Regional Primate Facility for supplying the pigtail macaque tissues to Dr. Richard E. Tashian, with whom one of us (P.V.) originally isolated the DNA. 51 Appendix 2-1 123456789101112M Size lkbl r —2.32 Amplification of several canine gene segments using UM—STSs. The following lanes were amplified with the gene-specific primer sets (see table 1): lanes l-4, GHR; lanes 5-8, COLIOAI; and lanes 9-12, DCNI. Lane 13 contains a mixture of DNA size markers; 1 bacteriophage cut with the restriction endonuclease BstEII and the plasmid pSK- (Stratagene) cut with Mspl. Lanes 1, 5, and 9 contain PCR products amplified from human genomic DNA. Lanes 2, 6, and 10 contain PCR products amplified from canine genomic DNA. Lanes 3, 7, and l 1 contain PCR product amplified fi'om DNA purified fi'om a canine genomic library contained in a it bacteriophage vector. Lanes 4, 8, and 12 contain PCR products amplified from a canine liver cDNA library. 52 Appendix 2-2 CFTR A.A. 1346 l intron 22 Dog - - - - - - -| I - - - - Mouse - - - - - — — -| I - - - V - — - V — — - — Human E P S A H L D PI V T Y Q I I R R T L K Q A F A Human GAACCCAGTGCTCATTTGGATCClAGTAACATACCAAATAATTAGAAGAACTCTAAAACAAGCAT TTGCT Mouse ..Gm.....CmC.A..C..ICAm........G.CmC..C..GTm.........C..Cm Dog G .................... |.A- ............ C COLIOAl Dog — - - — - - - - K - - - - — — — - - - - - - H Mouse - I Y E — — — - — - - — — — - — S - - — - - K Human P F D K I L Y N R Q Q H Y D P R T G I F T C Q Human CCATTTGATAAAATTTTGTATAACAGGCAACAGCATTATGACCCAAGGACTGGAATCTTTACTTGTCAG Mouse ..CA..T..G.G C .C..Tm..Gm.....Cm.....ATm.Tm.....CmA.. Dog m........G..Cm.......A ..................... Am........C..C..C..C dog coatttgataagatcttgtataacaagcaacagcattatgacccaagaactggaatcttcacctgccag CSFIR intron 3 | Dog|-—v———Q--———-——V—G———- FeLV l- - A — — — Q — — - — — — T _ L _ G _ _ _ _ Human ID P A R P W N V L A Q E V V V F E D Q D A L L Human lACCCTGCCCGGCCCTGGAACGTGCTAGCACAGGAGGTGGTCGTGTTCGAGGACCAGGACGCACT ACTGC FeLV lm....Tm..T ..G ..G..Cm..AMACGM..G..A.GTW..T..GT.Gm. Dog Im...TTm..Tm..Gm..G..Gm.....Cm...Gm...GGm..T..G..Gm. dog accctgttcggccttggaaggtgctggcgcaggaggtcgtcgtggtcgaggggcaggatgcgctgctgc 53 Appendix 2-2 (cont’d). DCNl | intron 6 Dog - — - — - - |- — — - — - - - Human V D A A S L K G L N N L A [K L G L S F N S I 8 Human GTTGATGCAGCTAGCCTGAAAGGACTGAATAATTTGGCTAlAGTTGGGATTGAGTTTCAACAGCA TCTCT Rat m........Cm.........A..TCm....Tml..Cm.Tm..Cm..Tm...A.C Dog ................... lm.....Cm....Tm........N dog ggactgaataatttggcta agttgggactgagttttaacagcatctc GHR A A. 333 l Dog - D L - - - - - G - - - - — — — — N _ _ _ Rat - D A - - — - — — — — — — — — — _ D _ Q _ Human D E P D E K T E E S D T D R L L S S D H E K S Human TGATGAGCCAGATGAAAAGACTGAGGAATCAGACACAGACAGACTTCTAAGCAGTGACCATGAGA AATCA Rat m...TG.G ..G .....A..G .....C ............... GAM...Gm...... Dog m...C.Tm.........C..A.G .......................... AC ............... dog tgatgacctagatgaaaagaccgaaggatcagacacagacagacttctaagcaacgaccatgagaaatca GLBl A.A. 268 | Dog — — — — — — — — - v - — — v - Mouse — - — - - - - — - — — K - — — — v — - K T L — T Human E F Y T G W L D H W G Q P H S T I K T E A V A S Human GAATTCTATACTGGCTGGCTAGATCACTGGGGCCAACCTCACTCCACAATCAAGACCGAAGCAGT GGCTTCC Mouse ..G .................... C .....TAm.C..T ..GG.G..A..TA..A..C ..A.. Dog m..T .....G..A ..A 6.6 ..T .TC ... dogglbl gatcattggggccagccacactcaacagtgaagactgaagtcgtggcttcc 54 Appendix 2-2 (cont’d). PVALB A.A. 59 I Dog — intron 2 — - Rat — — — — - — - (bp) S - Human I E E D E L G F I L K Human ATCGAGGAGGATGAGCTGGthaagctggagg — 1300 - tttctcctccagATTCATCCTAAAAG Rat ..T ..................... a. — 1500 — m....- .G.C T..G..G. Dog m....agactcc. - 1300 - m....-m........ dogpvalb ggggtaaagactccg tttctcc ccagattcatcc R81 A.A. 890 | intron 22 Dog A - - - - - - - - - - L - - - - - - - - - - --l Mouse - G - - - - - - - N V — — - - — - A - - - - Human G S N P P K P L K K L R F D I E G S D E A D GGAAGCAACCCTCCTAAACCACTGAAAAAACTACGCTTTGATATTGAAGGATCAGATGAAGCAGA TGGAAGI Mouse ..CG ....c..c .............. CG.G .....C..C..GmG.C .............. G..I Dog .c ................... Tm.........TGm.....C .......................... I dogrbl gcaagcaaccctcctaaaccattgaaaaaactactgtttgatatcgaaggatcagatgaagcagatggaag Lineups of several canine gene sequences with homologous mammalian genes. The nucleotide and amino acid sequences are compared for each of several anchor loci between dog and two other species. The locations of PCR primers are underlined, although not all PCR primer sites are shown. Some of the lineups show intron sequence whereas others simply identify the location of the introns. Genbank accession numbers for canine sequences are as follows: CFTR, L77683 and L77689;COL10A1, L77672; CSFI R, L77670; DCNl, L77648; GHR, L77673; GLBl, L7767l; PVALB, L77685 and L77686; and RBI, L77669. 55 Appendix 2-3 Sequence of a portion of the PBS proto-oncogene from several mammalian DNAs. Sequences are from exon 15 and intron 15. Notations for the sequence lineups are as follows: HUM, human; MAC, macaque; CAT, domestic cat; F ES, feline sarcoma virus; DOG, dog; COW, ox; GOA, goat; HOR, horse; PIG, pig; RAT, rat; and MOU, mouse. The upper two lines for each block of text represent amino acid sequences and the lower lines represent nucleotide sequences. Dots indicate nucleotides in the various species that are identical to those of the human sequence. The human and cat sequences determined here exactly match the published sequences (Alcalay et al, 1990p; Roebroek et al, 1987). The feline sarcoma virus sequence was not determined in this study but is included for comparative purposes. Only a single amino acid interchange was found among these sequences; isoleucine (I) for macaque, cat and feline sarcoma virus and leucine (L) in all others. Sequence alignments for the intron were done visually and may not be optimal. Genbank accession numbers for these sequences are as follows: AMACFES, L77678; DOGFES, L77674, CATFES, L77675, COWFES, L77677; GOAFES, L77681; PIGFES, L77679; HORFES, L77676; RATFES, L77680, and MOUFES, L77682. MAC,CAT,FES I HUM A D N T L V A V K S C R E T L P P D L K HUM GCCGACAACACCCTGGTGGCGGTGAAGTCTTGTAGAGAGACGCTCCCACCTGACCTCAAG MAC m.........Tm....A ................................. Am.. CAT m.........Tm....Cm..Am...C.Cm..A. ........... Am.. FES m........Tm.....Cm..Am...C.Cm..Am..Am...Am.. DOG m.....T..T .............. Am..CCm....C .................. COW ..A ....................... Am...C.Cm..A .................. GOA ..A ....................... Am...Cm....A..G..Cm......... PIG ..A..T .................... Am..CCm.A ..................... HOR ..T ....................... Am..CCm..........C..Gm...... RAT ..Am.......Cm...Tm.........C ................... NW... MOU ................. Tm.........Cm...NNN ................. HUM A K F L Q E A R HUM GCCAAGTTTCTACAGGAAGCGAG GTGGGTGATAAACTAATGATCACCACGGGTCCCGCAT DOG ....................... m....Cm.GmCm--..CA.A.CT..Am CAT T ..A.A ..Am.ACmAG..Cm--..CATAA.TW..C FES m........Tm.....A.A COW m........Cm........ m.......G.AC.CCCMA.TGTA..C CATA GOA G .A.. m.......G.AC.CCmA.TGTA..CmT.C.C PIG m........Gm........ m.........AG..CCM.TGTGATAAAAGA.CC HOR ................. G..A.. m.....CmAmCCm.TGGTAT.CTAA.G.. RAT m........G..NNNNm.. m...Cm..A.GGGA.CAGT..A..T TTGTG MOU .................... A.. m.........Am.AT 56 Appendix 2-4 Size 1ka . 1.5- ”‘9 0.6- - " 0.1- Amplification of a portion of the PBS protooncogene fi’om several mammalian DNAs using UM-STS primers. Target DNAs for each lane are as follows: 1, human; 2, pigtailed macaque; 3, dog; 4, goat; 5, pig; 6, horse; 7, mouse; and 8, rat. The mouse DNA here was degraded; strong amplification was with another lot (sequence shown in Appendix 2-3). The DNA marker lane, (M) contains a 100-bp ladder. 57 Appendix 2-5 Amplification Conditions for Canine UM—STSs Size of PCR Product (bp) Locus Temperatures (C) Times (min) Human Dog CFTR 95, 57, 72 0.5, 1.5, 4 700 1000 COLIOAI 94, 57, 72 (hs)a 1, 2, 3 384 384 CSFIR 94, 59, 72 1, 2, 3 730 730 CYP1A1 95, 57, 72 0.5, 1.5, 4 700 600 DCNI 94, 57, 72 1, 2, 3 1422 2000 PBS 94, 57, 72 0.5, 1, 1.5 484 500 GHR 94, 57, 72 1, 2, 3 765 800 GLBI 94, 57, 72 1, 2, 3 238 240 PKLR 94, 59, 72 l, 2, 3 600 630 PVALB 94, 57, 72 (hs) 0.5, 1.5, 4 1400 1300 RB] 94, 59, 72 l 2 3 695 1300 O 9 a. hs indicates “hot start” used. 58 Appendix 2-6 a Summary of Amplification Results for UM-STSs for Several Mammalian DNAs Locus Human Macaque Dog Goat Pig Horse Mouse Rat CFTR +1) + + - + + + + COLIOAI + + + - + + + + CSFI R + + + + - - + + CYP1A1 + + + + + + + + DCNI + + + — + + + + FES + + + + + + + + GHR + + + + + + + + GLBI + + + + + + + + PKLR + + + + + - + + PVALB + + + + + + - - RBI + + + - + + + + a. +, Amplification; -, no amplification. b. Boldface symbols indicate index species. §§§i****§§§§§§***§** §§**i**§*§§**§§**§§* and 38AM Dva 33 m2 Nb Jaw—Mm: Imus—mm wom .55 DQXm Owe—Km , -mNUm— 3% mm -mmn—I -mmmI OOOHUD :OOHUU cacaaaaaaeaaaaaaaO. aaasaacaiaaaaaaaa: >mom was 2me omxm . .mme _E on -M Emu: -m _mmUI n=._ .8: SEE 02820:: we use . m £033 .53 A<m Q 58m .555... 5555 o... 5. 53m 5.... 58m 5 .. .55... 5:5. FEES SE. ...... 5555...? 55.5.8»... ..E 95.55.. 5555b 68.56 .353. 5.55.5... .Q5 ”.55.... Be... 55555.. .55.. we .5555 .85.. .QVUZ .. 8...... 52.35... ...N 3,5 55525-05... 55> 859.8 55.5.. o... .5 2505.5... 5.2.8 .33 5.5.0.5... we... omv ... 55955.5 438.9 55.35.. 555855.58 85.5... 6.55 .255 mm .55.... 525.585 RU .55.. 685.56 8...... ....ka 5.... .5. o... 5 5.505 855... ... 05. ...... 5 85> Nb .mm .mm m .N .. cow HUUHO Nb .3 .3 m .N .. can UUHUH c5 85:55.5 N.. ..e .3 . .. .. m3. UZUUHO N» Km .3 .m... m... owv OOOH Nb Km .3 .m... m... coo :OOHO Nb .5 .3 .m. .2. oNo. ECHOHUUH N.. ...m .3 .m... m... o2. 0HOEU.............< ... 655.50.... 5.5 5.5:“ 25 5,. 5.... 5.5....5. 255 o. .55.... 5.53 5... 52.5.2.2 ..N 5.5.... 8.55.2.5 50.558-05-52. ......w.5 o... 55....2. 0. .55 55... .5... 255...... o... 5... 353 55.5 2.558.. 55... $5 255 5.5... ... 3.3.3 . . . 23 3200500555553 555505505053 8...... 30 3.3.3 5.. 3 35050003550053 03003550525553 8.... E5 ..0 3.3.3 3.. 8. 53005003055553 5305505505053 8...? 38.. 3.3.3 3.. 5. 55555053552553 05500525030553 .350 83.. 3.3.3 ..... 3. 0.005550050309053 00550550505053 85 .6 3.3.3 2.. .3 05505505050503 <00<<0<0000H<000<§03 8.55 .6 3.3.3 .... 3. 005550555553 205505505505. 8.85 3.. 3.33 E. m . m 05000555550053 05500555050053 .8905. won. 3.3.3.3 ... . . .. 93 5050050555503 305505053355... 8052 won 3.5.3 .... o8 03050005005253 <00<<0e00h<00<00<0<3 8.5.20 30 3.5.3 .... 8. E055025E00E<03 5.5050050335030903 3.30 woo 3.3.3 .... 5. 50905055500553 5005550555053 855 3.. 3.3.3 5.. o3 50.05025500553 05505505055003 855 30 3.3.3 ..... w: <0om .0 gene as a N _Baaz . 28$ .3 e 32:35 a _ “35:2 .m .on :fimcoo wfiueommotoo 5 528: 02820:: 380:2: 35:32 .v .3 N _ -v 8: 239—an2 Buss—E82 {238358 2: «a and .ES 3 mo 5:828:00 3:: a 8 N53: mo 8:63 2: 5:: than mo: 5 So 3E8 953 maoumowE 25.20 5:850: .m .55 m 33 2:: :owmeoio 2: :5 as N 33 2:: wéaoecm 2: wit. 5.: Ba .0 owe me; <22: 8.: 232383 3:325 05 E85 A8353 <75 38:3 335 cm cc @2958 <20 323 335 mm .5: 5:: m .0 ONN .EE _ .0 arm .58 _ .0 ova 303 £53280 3:96 dogma.— .1 N .5: omEofimfim