APPLICATIONS OF WHOLE GENOME SEQUENCE DATA IN BEEF CATTLE GENOMICS By Muhammad Yasir Nawaz A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Genetics and Genome Sciences – Doctor of Philosophy 2023 ABSTRACT Over the last decade, genome sequencing technologies to obtain whole genome sequence (WGS) have improved a lot as they become more and more accessible and affordable to researchers in academia and industry. WGS provides an exciting opportunity to make use of rare variants and previously unidentified quantitative trait loci for various applications including GWAS and genomic prediction. However, WGS also comes with its own limitations beside the high cost of sequencing, which include high storage and memory requirements, and statistical and computational challenges related to high dimensional data. Therefore, there is a need to explore how WGS can be effectively utilized in beef cattle genomics. In this dissertation, I have analyzed simulated and real genomic data from various beef cattle breeds obtained using SNP chip technology as well as sequencing platforms to answer various questions on this end. First, I simulated matings of three beef cattle breeds (Angus, Hereford and Brahman) and their crosses over 20 generations to study the genomic prediction trends over time using dense SNP information either with or without known QTLs in the dataset. I demonstrate that while WGS may result in a minimal gain in prediction accuracy at a certain snapshot in time, collection of WGS data over the years will increase the accuracy of genomic prediction in beef cattle for distantly related animals. Second, I demonstrate that WGS imputation with a small reference population closely related to the target population is more accurate than a large multi-breed reference with distantly related animals. This implies that even though a large variety of dairy and beef breeds from around the world have already been sequenced, there is still a need to increase the representation of less European-centric breeds in sequencing efforts to improve the imputation of these breeds. Third, I observed that association analysis based on WGS resulted in stronger signals of association as compared to medium and high-density SNP chip genotypes, and identified novel loci regulating carcass traits. However, we observed no gain in prediction accuracy using imputed WGS as compared to medium and high-density genotype for genomic prediction in Korean beef cattle population using GBLUP methodology. We also observed that addition of SNPs pre-selected based on association analysis to medium and high-density genotypes may result in a slight increase in accuracy of prediction for traits regulated by large effect loci. Moreover, modelling epistatic effects explained a small proportion of genetic variance but failed to increase the prediction accuracy for additive genetic value and total genetic value. Lastly, I identified genomic selection signals in Angus and Hanwoo beef cattle by four distinct methods primarily based on allele frequency and haplotype structures surrounding SNP variants. We also combined these selection signals to obtain strong evidence of selection by composite selection signature score. Finally, we identified and confirmed various positional candidate genes located in selected regions related to beef production and quality. In conclusion, I have explored various strategies for effective utilization of WGS and identified novel genomic variants regulating meat traits in Hanwoo and Angus beef cattle. These analyses will help in effective application of WGS in beef cattle genomics. Copyright By MUHAMMAD YASIR NAWAZ 2023 This thesis is dedicated to my parents, who worked so hard to provide me quality education and motivated me to pursue excellence in my academic career. To my wife Fakiha, who has loved and supported me through difficult phases of my PhD and made sure I stay the course. To my daughter Amal, who was born during my PhD and brought me so much joy and happiness. To my siblings, who continue to remind me of our fond memories of childhood and keep making silly jokes whenever we talk. v ACKNOWLEDGMENTS First of all, I would like to thank my adviser, Dr. Cedric Gondro for his guidance and mentorship throughout my PhD. Cedric always allowed me to work independently and learn on my pace. It was due to his efforts and support that I have had several professional development opportunities over the last five years. I would forever be grateful for his kindness and unending support throughout my PhD research. I would also thank my PhD guidance committee members Dr. Juan Steibel, Dr. Ana Vazquez and Dr. Wolfgang Banzhaf for their technical feedback on my research throughout my PhD training. They are the best committee members I could have asked for, both in terms of technical skills and mentorship. I would also like to thank Alaina Burghardt who coordinated on administrative matters related to my graduate program, Dr. Cathy Ernst and Dr. Claire Veille who served as the directors of my PhD program. My work would have been impossible without the support of various post-doctoral researchers in our lab that helped me learn technical skills and trouble shoot when I am stuck. These include Beatriz Cuyabano, Gabriel Rovere, Rodrigo Savegnango and Kenneth Reid. I would like to thank fellow PhD students in our research group which include Hanna Ostrovski and Salman Ali, for their friendliness and emotional support during this journey. Lastly, I would like to thank Pakistani graduate students at MSU and a vibrant muslim community at large, who created an environment like home away from home. They provided me and my wife Fakiha, a support system that allowed us to enjoy our time in East Lansing. vi TABLE OF CONTENTS Chapter 1 INTRODUCTION……………………………………………………….…………………………………………………1 1.1. Genetic evaluation of livestock ......................................................................................1 1.2. Genomic revolution in cattle breeding ...........................................................................2 1.3. Adoption of genomic prediction in Beef cattle...............................................................5 1.4. Why do we need full sequence data?.............................................................................6 1.5. Imputation of genotypes ................................................................................................7 1.6. Hanwoo and Angus beef cattle.......................................................................................9 1.7. Project goals .................................................................................................................10 1.8. Publications ..................................................................................................................11 1.9. Conference presentations ............................................................................................12 REFERENCES…………………………..…………………………………………………………………………………………….14 Chapter 2 IMPROVING ACCURACY OF GENOMIC PREDICTION IN DISTANT POPULATIONS BY COLLECTING SEQUENCE DATA OVER GENERATIONS ...................................................................18 2.1. Abstract ........................................................................................................................18 2.2. Introduction ..................................................................................................................18 2.3. Methods .......................................................................................................................20 2.4. Results ..........................................................................................................................24 2.5. Discussion .....................................................................................................................28 2.6. Conclusions ...................................................................................................................30 REFERENCES………………………………………………………………………………………………………………………...32 Chapter 3 EVALUATION OF WHOLE-GENOME SEQUENCE IMPUTATION STRATEGIES IN KOREAN HANWOO CATTLE .........................................................................................................................34 3.1. Abstract ........................................................................................................................34 3.2. Introduction ..................................................................................................................35 3.3. Materials and Methods ................................................................................................37 3.4. Results and Discussion ..................................................................................................43 3.5. Conclusions ...................................................................................................................59 REFERENCES…………………………………………………………………………………………………………………………61 Chapter 4 VALUE OF SEQUENCE DATA IN GENOMIC PREDICTION AND ASSOCIATION ANALYSIS OF HANWOO BEEF CATTLE CARCASS TRAITS ...............................................................................65 4.1. Abstract .......................................................................................................................65 4.2. Introduction .................................................................................................................66 4.3. Materials and Methods ...............................................................................................69 4.4. Results and Discussion .................................................................................................74 4.5. Conclusions ..................................................................................................................86 REFERENCES…………………………………………………………………………………………………………………………88 Chapter 5 FOOTPRINTS OF SELECTION IN ANGUS AND HANWOO BEEF CATTLE USING IMPUTED WHOLE GENOME SEQUENCE DATA .............................................................................................92 vii 5.1. Abstract ........................................................................................................................92 5.2. Introduction ..................................................................................................................93 5.3. Materials And Methods ................................................................................................95 5.4. Results ..........................................................................................................................99 5.5. Discussion ...................................................................................................................109 5.6. Conclusions .................................................................................................................113 REFERENCES…………………………………………………………………………………………………………………….…115 Chapter 6 SUMMARY AND FUTURE DIRECTIONS .......................................................................119 APPENDIX ...................................................................................................................................124 viii Chapter 1 INTRODUCTION 1.1. Genetic evaluation of livestock Cows have been a major source of milk and meat and selective breeding of high performing cattle has been practiced for centuries using phenotypic selection. This refers to selection of high producing cows for breeding to produce succeeding generations. With advances in statistics and population genetics in 20th century, well defined breeding goals, refined genetic evaluation methods and selection strategies were developed. Phenotypic performance was divided into genetic and non-genetic components utilizing genetic relationships between animals (Fisher, 1918). Development of linear mixed model theory and simultaneous estimation of fixed effects and random effects by mixed model equations was introduced (Henderson, 1949). It was also realized that selecting only female animals was difficult because the stock numbers needed to be maintained at a certain level while much more intensive selection decisions could be made by selecting bulls. To achieve that, the genetic value of bulls were based on the performance records of their female descendants. Extensive data was recorded in breed specific herd books to keep track of performance and pedigree relationships of cows. This information was used to evaluate the genetic value of bulls for various traits of interest. This process is known as pedigree-based selection and is still partly in use in cattle breeding. Once a bull with a high genetic merit has been identified, its genetics can be propagated by the use of artificial insemination (which was initially developed to control for venereal diseases). However, pedigree- information can be unreliable and inaccurate in many livestock species, for example, in sheep and beef cattle where multiple 1 sire mating is common. It was observed that even minor errors in pedigrees could reduce the accuracy of breeding values (Sanders et al., 2006). Also due to large generation interval in cattle, genetic evaluation of bulls by progeny testing programs required years of data recording before a bull could be “proven” to be of a high genetic value. 1.2. Genomic revolution in cattle breeding The idea that genes determine physical features or performance has been around for centuries but the molecular knowledge about genes was minimal until the 2nd half of 20th century. After the discovery of DNA structure and function (Watson and Crick, 1953), it was widely anticipated to use specific DNA markers for improvement of genetic gain in cattle also known as marker assisted selection (Smith, 1967; Soller and Beckmann, 1983). This involved identification of genes controlling traits of interest and then using them in genetic evaluation (Fernando and Grossman, 1989). The adoption of this theoretical idea was limited mainly because of three reasons. Firstly, attempting to infer genetic relationships based on small number of markers led to poor estimates of genetic relationships (Wilson et al., 2003; Hayes and Goddard, 2008). Secondly, most production and health traits were regulated by a large number of loci with small effects (Hayes and Goddard, 2001). Lastly, the cost of genotyping a large number of markers simultaneously was simply unaffordable in the beginning. In the last two decades there have been three major breakthroughs that revolutionized cattle breeding industry. 1) In 2001, genomic selection methodology was introduced which assumed that thousands of genes regulate a trait and that all genetic markers may have an effect on the phenotype due to linkage with the quantitative trait loci (QTL) (Meuwissen et al., 2001). So, instead of testing the effects of individual markers, focus shifted on simultaneous estimation of 2 marker effects that are located across the genome. 2) Worldwide collaboration efforts like 1000 Bulls Genome Project between researchers to sequence cattle genome resulted in discovery of millions of genetic variants (also known as single nucleotide polymorphisms or SNPs) in cattle (Hayes and Daetwyler, 2019). 3) Decreased cost of genotyping thousands of SNPs mainly due to SNP-Chip genotyping technology that relies on DNA hybridization of sample DNA with DNA probes. More than 3 million dairy cows (Wiggans et al., 2017) and several millions of pigs and poultry have been genotyped using these arrays. These advancements helped in a wide adoption of genomic selection not only in cattle but other livestock species with varying degrees of success (van Eenennaam et al., 2014). One obvious problem with genomic selection was that not all animals are genotyped in a population. Phenotypes of ungenotyped animals need to be included in genomic selection to make full use of available data. The initial solution proposed was to do a multi-step genomic evaluation i.e traditional pedigree based breeding values (EBV) were calculated for ungenotyped individuals and genomic breeding values (GEBV) were calculated for genotyped and phenotyped individuals. Total EBVs were calculated by combining the GEBVs and EBVs (VanRaden, 2008). A more accurate solution is to calculate a relationship matrix for all animals (H) by combining information from genotypes and pedigree and performing genomic evaluation at once, thus known as single step genomic prediction, a widely used method in genomic prediction reviewed by Legarra et al., (2014). Overcoming technical and practical challenges discussed above, genomic selection is widely adopted in cattle today (starting in ~2008) as it presents several advantages over traditional pedigree-based selection. The biggest advantage of GS is that accurate genomic evaluation of 3 candidate calves can be done at birth simply by genotyping using a hair sample and prediction equations developed in a reference population. Hayes et al (B.J. Hayes et al. 2009) analyzed data from Australia, NewZealand, USA and Netherlands and concluded that accuracy of genomic estimated breeding values (GEBV) achieved were significantly higher than parental average breeding values. This helped avoid the long process of progeny testing which involves evaluation of daughter performance requiring at least 5 years. Genomic selection enables breeding from selected bulls can be done at 2 year of age rather 5 year or later. This not only led to a huge reduction in cost of breeding programs but also a much faster genetic gain in cattle as compared to traditional evaluation. Garcia-Ruiz et al (García-Ruiz et al. 2016) analyzed data after only 7 years of adoption of genomic selection in dairy cattle and found that the rate of genetic gain doubled for most yield traits and increased from three to four fold for lowly heritable traits like fertility and longevity. Another major advantage of genomic selection is that it enables selection for novel traits that are expensive or difficult to measure or measured later in life like methane emission, feed efficiency, longevity etc. This can be achieved by gathering phenotypic measurements on a suitable reference population to generate prediction equations. Future selection candidates or commercial animals can be evaluated by simply genotyping. This means that while pedigree-based selection focused mainly on yield traits, genomic selection enabled a more balanced selection across traits. Moreover, genomic selection can help manage inbreeding. Genotyping heifer calves for mate allocation such that inbreeding is minimized without compromising selection. Heifer genotypes can also be used for market allocation or culling decisions on the farm for beef cattle. In short, several applications of genomics have already been explored in cattle. 4 1.3. Adoption of genomic prediction in Beef cattle Genomic prediction methodology has being applied in beef cattle industry worldwide and selection programs are largely regulated by breed specific societies. In USA alone, more than 750,000 Angus cattle have been genotyped (personal communication with Angus Genetics Inc.). There are various other breed societies that have their own selection programs. Farmers register their herds with the society, and provide samples and data for genetic evaluation of their cattle. These genetic evaluation services are available to farmers at a nominal cost (~40$ per animal) which makes it a profitable strategy. Still, beef cattle industry faces two main challenges to a wide adoption of genomic selection which are discussed below. Firstly, the accuracies of genomic estimated breeding values are generally lower in beef cattle due to poor quality of reference populations (van Eenennaam et al., 2014). Unlike dairy, beef industry is highly segmented that consists of a large number of beef breeds and admixed cattle. A lot of beef farms are also free range with minimal interventions from farmers and poor adoption of artificial insemination. Therefore, there is a lot of genetic variability within and between breeds. This makes it difficult to maintain a high-quality reference population of a large enough size for most breeds. One way to cope up with this challenge is to use a multi breed reference data set because biologically it makes sense to assume that the same genes regulate a trait in different breeds. However, as most traits are polygenic and true quantitative trait loci (QTLs) are unknown we only make use of linkage (disequilibrium) between SNPs and QTLs for genomic selection. This linkage or correlation does not persist well between animals that are distantly related or belong to a different breed leading to poor accuracies of across breed and multibreed prediction. 5 Secondly, the economic prospects of genomic selection are not as great as in dairy industry (partly due to low accuracies but) mainly because there is no one beneficiary of a genetic improvement program who is willing to pay for genotyping and phenotyping of a large reference population. Beef cattle frequently change ownership in the production chain moving from farms to feedlots to processors. Different traits are recorded at different sectors of the industry and they are never communicated back in the production chain. Consequently, it becomes difficult to make genetic improvement in those traits. 1.4. Why do we need full sequence data? One of the widely discussed topics in genomic prediction is the fact that not all of the genetic variance is captured by common SNP markers. In fact, if we focus on SNPs identified by GWAS and use them for heritability estimation or genomic prediction, results are even more discouraging. For example, a classic trait to study this effect has been human height. Twin cohort studies revealed that adult human height heritability is around 0.8 (Jelenkovic et al., 2016) but significant markers from GWAS could only capture a small fraction of heritability (h=0.05 (Maher, 2008). These results attracted wide-spread attention of all stake holders in genomic revolution including farmers, biotechnology companies and government institutions. The same was observed for most other quantitative traits in multiple studies (Yang et al., 2010; Visscher et al., 2017). There could be two possible explanations for this disparity: 1) imperfect tagging of causal markers by SNP panels i.e. the actual QTLs were not included in the data and the SNP markers are unable to pick up QTL effects for being poorly correlated with the QTLs 2) overestimation of pedigree-based heritability. 6 A decade after the case of missing heritability was first pointed out, it has now become clear that rare genomic variants largely account for the missing heritability in complex traits (Wainschtein et al., 2022). Scientists were able to recover almost all of the heritability using whole genome information of more than 25,000 individuals and demonstrated that rare variants in regions of low linkage disequilibrium accounted for missing heritability of complex traits. Therefore, efforts to genotype animals on more genomic variants have resulted in an enormous increase in density of SNP information. Dense SNP marker panels with up to 770,000 SNPs are now readily available and utilized in cattle breeding. The number of animals with whole genome sequence are also increasing rapidly especially for cattle breeds of European origin. 1.5. Imputation of genotypes Genotype imputation is one of the most commonly used tools in genomics which refers to estimating unknown genotypes in a sample (also referred to as target) that contains missing genotypes, based on a set of reference individuals that have been genotyped on a higher density chip array, or sequenced. There are various methods for imputation, the common theme between the methods is identifying short DNA segments shared between the target and reference samples to represent the target as an assortment of short reference haplotypes. Normally a target haplotype can be represented by a number of possible combinations of short reference segments. Therefore, imputation step requires a framework based on probabilities of imputed alleles which makes it computationally intensive. Pre-phasing or haplotype estimation before imputation helps to reduce the complexity of imputation because it allows comparison against phased haplotypes rather than all pairs of haplotypes (Howie et al., 2012). Most commonly used imputation software today (Beagle, Minimac and IMPUTE) rely on prephasing 7 for faster imputation. The computational cost of imputation has decreased considerably since the first imputation software was released (Scheet and Stephens, 2006). Strategies to decrease the computational time and resources include the use of a custom subset of reference panel closely related to the target, prephasing, parallelizing and multithreading and development of specialized compression formats for reference data (M3VCF, bref) (Das et al., 2018). Accuracy of imputed genotypes relies on a number of factors. Firstly, the sample size and genomic diversity of reference haplotypes as well as their relationship with the target haplotypes is important. A large reference panel closely related to the target population results in higher imputation accuracy. Worldwide collaborative efforts in the last decade have resulted in large reference panels for imputation like the 1000 Bulls Genome Project which currently includes more than 6000 sequenced animals from a number of breeds. In biomedical research, web imputation servers have been developed to allow researchers to remotely submit their target genotypes for imputation. The server carries out automated imputation and quality control and provides the link to download imputed data once the imputation process is complete. Such efforts need to be replicated in livestock so that researchers can spend more time on interpreting their data rather than learning the nitty gritty details of imputation. Using imputed genotypes may or may not help improve genomic prediction based on the population structure. In cattle, the gain in additive genomic variance was negligible because of high LD (Druet et al., 2014; van Binsbergen et al., 2015). However, in humans it has been shown that imputed genotypes recover some of the missing heritability (Yang et al., 2015), but more importantly not all of it. This is due to rare variants that account for most of the missing heritability in quantitative traits (Wainschtein et al., 2022). Rare variants are difficult to impute 8 with high accuracy simply because they are rare. Larger reference sets with high coverage might help improve the imputation accuracy of rare variants. 1.6. Hanwoo and Angus beef cattle In this research thesis, we used genomic data from various cattle breeds, but majority of projects involved Korean Hanwoo cattle. This is because Hanwoo cattle population have unique characteristics that provide an opportunity to learn more about genomic imputation, genomic prediction and genomic basis of beef quality. Hanwoo are genetically distinct from Western taurine cattle and closely related to Eastern taurine like Wagyu and Akaushi cattle (Lee et al., 2014). They are raised under special care and production system in small pens that allow little movement and a long fattening period with an average age at slaughter of 31 months (Jo et al., 2012)- a year longer than most beef cattle raised in USA. Inherently high intramuscular fat content coupled with a long fattening period results in highly marbled Hanwoo beef with a unique flavor that is popular in Korea despite its high price tag. Although Korea is one of the biggest importers of beef in the world, people are willing to pay a high price to eat Hanwoo beef. Therefore, a specialized breeding program has been going on for Hanwoo that is highly focused on meat marbling. Hanwoo beef is graded into 5 groups i.e. 1++, 1+, 1, 2, 3 where 1++ is considered as the premium quality and contains the highest percentage of intramuscular fat 21.48% as compared to less than 6% in grade 3 (Jo et al., 2012). Other factors effecting meat quality include meat color, fat color, firmness, texture, and maturity of Longissimus dorsi muscle at the 13th rib interface. Several studies have reported that beef with higher marbling score has greater tenderness and juiciness simply due to dilution of dense muscle matrix in beef and 9 influences meat flavor, lipid oxidation and beef color (Bonnet et al., 2010; Hocquette et al., 2010; Gagaoua et al., 2018). The second major breed in our study is Angus and it served as a perfect contrast to Hanwoo in terms of physical attributes, meat quality, genome architecture, demographic history and breeding program. Originating from Europe, Angus cattle were first imported in United States in late 19th century and became popular for being naturally polled, higher growth rate and yielding well marbled flavorful beef. Today, American Angus association is the nation’s largest beef breed organization indicating the success and popularity of Angus cattle. 1.7. Project goals There is a need to evaluate how whole genome sequence data can be useful for genomic improvement of beef cattle in a wide range of applications like genotype imputation, genomic prediction, and GWAS. We first conducted a simulation study evaluating genomic prediction scenarios where whole genome sequence data may or may not be useful. Secondly, we evaluated imputation strategies to obtain correctly imputed data for Hanwoo cattle which is a popular Korean breed of cattle distantly related to commonly used beef breeds in the world. Thirdly, we evaluated the value of imputed whole genome sequence data for genomic prediction and variance component estimation of meat traits. Finally, we compared imputed whole genome sequence of Hanwoo and Angus beef cattle to identify regions of genome under strong selection pressure. Specifically, our aims included: 10 1. Analyze the effect of having true QTLs in the SNP set on prediction accuracy as the genetic distance between reference and test population increased over generations and across breeds. 2. Evaluate the imputation accuracies obtained when using single-breed (Hanwoo) and multi-breed reference panels to impute Hanwoo genotypes up to whole genome sequence 3. Perform GWAS based on imputed sequence data for carcass traits in Hanwoo, evaluate accuracy of genomic prediction between commonly used genotyping platforms and imputed sequence data, and compare the proportion of phenotypic variance captured by additive and epistatic genetics components and their effect on predictive ability 4. Identify and compare genome wide signals of selection in Angus and Hanwoo beef cattle using imputed whole genome sequence (WGS) data 1.8. Publications 1. Findings from Chapter 2 of this thesis were published as a short four page article in proceedings of World Congress on Genetics Applied to Livestock Production. Nawaz, M. Y., and C. Gondro. "Improving accuracy of genomic prediction in distant populations by collecting sequence data over generations." World Congress on Genetics Applied to Livestock Production, Rotterdam, The Netherlands (2022) https://www.wageningenacademic.com/pb-assets/wagen/WCGALP2022/02_003.pdf 2. Chapter 3 of this thesis has been published as a research article in Animals as referenced below: Nawaz, M.Y.; Bernardes, P.A.; Savegnago, R.P.; Lim, D.; Lee, S.H.; Gondro, C. “Evaluation of 11 Whole-Genome Sequence Imputation Strategies in Korean Hanwoo Cattle”. Animals 2022, 12, 2265. https://doi.org/10.3390/ani12172265 3. Nascimento, A. v., Romero, A. R. S., Nawaz, M. Y., Cardoso, D. F., Santos, D. J. A., Gondro, C., et al. (2021). An updated Axiom buffalo genotyping array map and mapping of cattle quantitative trait loci to the new water buffalo reference genome assembly. Anim Genet 52. doi: 10.1111/age.13103. Under preparation: Chapter 4 and 5 of this thesis are currently under preparation to be submitted for publication in Genes and Frontiers of Genetics respectively. These are being prepared with the help of coauthors as mentioned below: 1. Nawaz, M.Y., Cuyabano, B.C.D., Savegnago, R.P., Lim, D., Lee, S.H., and Gondro, C., “Value of Sequence Data in Genomic Prediction and Association Analysis of Hanwoo Beef Cattle Carcass Traits” 2. Nawaz, M.Y., Savegnago, R.P., Lim, D., Lee, S.H., and Gondro, C. Footprints of selection in Angus and Hanwoo beef cattle using imputed whole genome sequence data 1.9. Conference presentations 1. Nawaz, M. Y., Ostrovski, H., Savegnango, R. P., Ackerson, L. K., & Gondro, C. (January 2023, January) “Breed identification by mobile sequencing” Plant and Animal Genome XXX Conference, San Diego, CA. https://pag.confex.com/pag/xxvii/meetingapp.cgi/Paper/35811 2. Nawaz, M. Y., Savegnago, R. P., & Gondro, C. (2021, November). Footprints of Selection in Angus and Hanwoo Beef Cattle Using Imputed Whole Genome Sequence Data. In Journal Of Animal Science (Vol. 99, Issue 1, pp. 25-25) 12 3. Nawaz, M.Y., Lee, S.H., Lim, D., Park, B., Kim, S.D., Gondro, C. (2019, January) “Genetic Analysis of Carcass Quality Traits in Korean Hanwoo Cattle” Plant and Animal Genome XXVII Conference, San Diego, CA. https://pag.confex.com/pag/xxvii/meetingapp.cgi/Paper/35811 13 REFERENCES Bonnet, M., Cassar-Malek, I., Chilliard, Y., and Picard, B. (2010). Ontogenesis of muscle and adipose tissues and their interactions in ruminants and other species. Animal 4. doi: 10.1017/S1751731110000601. Das, S., Abecasis, G. R., and Browning, B. L. (2018). Genotype imputation from large reference panels. Annu Rev Genomics Hum Genet 19, 73–96. doi: 10.1146/annurev-genom-083117- 021602. Druet, T., Macleod, I. M., and Hayes, B. J. (2014). Toward genomic prediction from whole-genome sequence data: Impact of sequencing design on genotype imputation and accuracy of predictions. Heredity (Edinb) 112, 39–47. doi: 10.1038/hdy.2013.13. Fernando, R., and Grossman, M. (1989). Marker assisted selection using best linear unbiased prediction. Genetics Selection Evolution 21. doi: 10.1186/1297-9686-21-4-467. Fisher, R. A. (1918). The Correlation between Relatives on the Supposition of Mendelian Inheritance. Transactions of Royal Society of Edinburgh 52, 399–433. Gagaoua, M., Picard, B., and Monteils, V. (2018). Associations among animal, carcass, muscle characteristics, and fresh meat color traits in Charolais cattle. Meat Sci 140. doi: 10.1016/j.meatsci.2018.03.004. García-Ruiz, A., Cole, J. B., VanRaden, P. M., Wiggans, G. R., Ruiz-López, F. J., and van Tassell, C. P. (2016). Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proceedings of the National Academy of Sciences 113, E3995–E4004. doi: 10.1073/pnas.1519061113. Hayes, B., and Goddard, M. E. (2001). The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution 33. doi: 10.1051/gse:2001117. Hayes, B. J., Bowman, P. J., Chamberlain, A. J., and Goddard, M. E. (2009). Invited review: Genomic selection in dairy cattle: Progress and challenges. J Dairy Sci 92, 433–443. doi: 10.3168/jds.2008-1646. Hayes, B. J., and Daetwyler, H. D. (2019). 1000 Bull Genomes Project to Map Simple and Complex Genetic Traits in Cattle: Applications and Outcomes. Annu Rev Anim Biosci 7, 89–102. doi: 10.1146/annurev-animal-020518-115024. Hayes, B. J., and Goddard, M. E. (2008). Technical note: Prediction of breeding values using marker-derived relationship matrices. J Anim Sci 86. doi: 10.2527/jas.2007-0733. Henderson, C. R. (1949). Estimation of changes in herd environment. J Dairy Sci 32, 706. Available at: http://acteon.webs.upv.es/ARTICULOS/HENDERSON 1949.pdf. 14 Hocquette, J. F., Gondret, F., Baza, E., Mdale, F., Jurie, C., and Pethick, D. W. (2010). Intramuscular fat content in meat-producing animals: Development, genetic and nutritional control, and identification of putative markers. Animal 4. doi: 10.1017/S1751731109991091. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G. R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44. doi: 10.1038/ng.2354. Jelenkovic, A., Sund, R., Hur, Y. M., Yokoyama, Y., Hjelmborg, J. V. B., Möller, S., et al. (2016). Genetic and environmental influences on height from infancy to early adulthood: An individual- based pooled analysis of 45 twin cohorts. Sci Rep 6. doi: 10.1038/srep28496. Jo, C., Cho, S. H., Chang, J., and Nam, K. C. (2012). Keys to production and processing of Hanwoo beef: A perspective of tradition and science. Animal Frontiers 2. doi: 10.2527/af.2012-0060. Lee, S.-H., Park, B.-H., Sharma, A., Dang, C.-G., Lee, S.-S., Choi, T.-J., et al. (2014). Hanwoo cattle: origin, domestication, breeding strategies and genomic selection. J Anim Sci Technol 56, 2. doi: 10.1186/2055-0391-56-2. Legarra, A., Christensen, O. F., Aguilar, I., and Misztal, I. (2014). Single Step, a general approach for genomic selection. Livest Sci 166. doi: 10.1016/j.livsci.2014.04.029. Maher, B. (2008). Personal genomes: The case of the missing heritability. Nature 456. doi: 10.1038/456018a. Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi: 11290733. Nascimento, A. v., Romero, A. R. S., Nawaz, M. Y., Cardoso, D. F., Santos, D. J. A., Gondro, C., et al. (2021). An updated Axiom buffalo genotyping array map and mapping of cattle quantitative trait loci to the new water buffalo reference genome assembly. Anim Genet 52. doi: 10.1111/age.13103. Nawaz, M. Y., Bernardes, P. A., Savegnago, R. P., Lim, D., Lee, S. H., and Gondro, C. (2022). Evaluation of Whole-Genome Sequence Imputation Strategies in Korean Hanwoo Cattle. Animals 12, 2265. doi: 10.3390/ani12172265. Nawaz, M. Y., and Gondro, C. (2022). Improving accuracy of genomic prediction in distant populations by collecting sequence data over generations. in World Congress on Genetics Applied to Livestock Production. Sanders, K., Bennewitz, J., and Kalm, E. (2006). Wrong and missing sire information affects genetic gain in the Angeln dairy cattle population. J Dairy Sci 89. doi: 10.3168/jds.S0022-0302(06)72096- 3. 15 Scheet, P., and Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78. doi: 10.1086/502802. Smith, C. (1967). Improvement of metric traits through specific genetic loci. Anim Prod 9. doi: 10.1017/S0003356100038642. Soller, M., and Beckmann, J. S. (1983). Genetic polymorphism in varietal identification and genetic improvement. Theoretical and Applied Genetics 67. doi: 10.1007/BF00303917. van Binsbergen, R., Calus, M. P. L., Bink, M. C. A. M., van Eeuwijk, F. A., Schrooten, C., and Veerkamp, R. F. (2015). Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle. Genetics Selection Evolution 47, 1–13. doi: 10.1186/s12711-015-0149-x. van Eenennaam, A. L., Weigel, K. A., Young, A. E., Cleveland, M. A., and Dekkers, J. C. M. (2014). Applied Animal Genomics: Results from the Field. Annu Rev Anim Biosci 2, 105–139. doi: 10.1146/annurev-animal-022513-114119. VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. J Dairy Sci 91, 4414– 23. doi: 10.3168/jds.2007-0980. Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., et al. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet 101, 5–22. doi: 10.1016/j.ajhg.2017.06.005. Wainschtein, P., Jain, D., Zheng, Z., Aslibekyan, S., Becker, D., Bi, W., et al. (2022). Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat Genet 54, 263–273. doi: 10.1038/s41588-021-00997-7. Watson, J. D., and Crick, F. H. C. (1953). Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature 171, 737–738. doi: 10.1038/171737a0. Wiggans, G. R., Cole, J. B., Hubbard, S. M., and Sonstegard, T. S. (2017). Genomic Selection in Dairy Cattle: The USDA Experience∗. Annu Rev Anim Biosci 5. doi: 10.1146/annurev-animal- 021815-111422. Wilson, A. J., McDonald, G., Moghadam, H. K., Herbinger, C. M., and Ferguson, M. M. (2003). Marker-assisted estimation of quantitative genetic parameters in rainbow trout, Oncorhynchus mykiss. Genet Res 81. doi: 10.1017/S0016672302006055. Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A. E., Lee, S. H., et al. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet 47. doi: 10.1038/ng.3390. 16 Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42. doi: 10.1038/ng.608. 17 Chapter 2 IMPROVING ACCURACY OF GENOMIC PREDICTION IN DISTANT POPULATIONS BY COLLECTING SEQUENCE DATA OVER GENERATIONS 2.1. Abstract Genomic prediction based on single nucleotide polymorphism (SNP) panels is highly dependent on the linkage (LD) between SNP markers and true QTLs regulating the trait being predicted. In this study we tested the hypothesis that if target populations drift away from the reference population due to successive breeding or crossbreeding, accuracy of genomic prediction can be significantly improved by including QTLs in genomic evaluation. Genotypes of 2500 animals for each of three breeds were available for analysis i.e. Angus, Hereford and Brahman. Genotypes were phased and subset to 29632 SNPs common among breeds. 20 generations of matings were simulated for each breed and their crosses. Three different scenarios were created for genomic prediction i.e. within breed, crossbred and across-breed for three sets of SNP panels namely SNP, QTL, and SNP and QTL. Genomic prediction was performed by GBLUP methodology. When QTL were added in the dataset, the decline in prediction accuracy in generations was lower and the accumulation of data over generations led to a higher increase in accuracy as compared to SNPs, especially for crossbred and across breed scenarios. Our results demonstrate that although including QTL information may appear to result in a limited increase at a certain point in time, collecting such data over generations will be valuable in beef cattle breeding. 2.2. Introduction Genetic improvement of farm animals involves artificial selection based on genetic ranking of animals, also known as estimated breeding values (EBVs). Accuracy of EBVs is a key element for 18 success of any genetic improvement program. Utilizing genomic prediction methodology (Meuwissen et al., 2001; VanRaden, 2008), EBVs can be reliably estimated at a young age which has led to a drastic improvement in cattle breeding in the past decade by accelerating the pace of genetic improvement. Genomic prediction based on single nucleotide polymorphism (SNP) panels is highly dependent on the linkage (LD) between SNP markers and true QTLs regulating the trait being predicted. As reproduction involves random recombination of genetic material in each generation, the linkage between SNPs and QTLs typically does not persist in subsequent generations. Similarly, animals from a different part of the world, or those from a different breed of the same species, also exhibit unique genetic structure and LD pattern between markers (de Roos et al., 2008). This resulted in reduced prediction accuracy across generations (Muir, 2007), across countries (Saatchi et al., 2013) and across breeds as demonstrated by various studies in cattle (Hayes et al., 2009a; Mason et al., 2012; Kachman et al., 2013). Therefore, design of a reference population played an important role in accurate EBV estimation. It is already well documented in the literature that a large reference population that consists of animals distantly related to each other but closely related to the target population is best suited for genomic prediction (Hayes et al., 2009b; Clark et al., 2012; Pszczola et al., 2012; Pszczola and Calus, 2016). It was predicted that whole genome sequence information will increase the accuracy of prediction (Meuwissen and Goddard, 2010) as it will include most (if not all) variants in a population. A small fraction of those millions of variants can be expected to be quantitative trait loci (QTL) of traits. Since QTL of most quantitative traits have very small effects, it requires very large datasets to identify them and directly use them for prediction. Still, we can use them in 19 genomic prediction by assuming an infinitesimal model i.e. SNP effects come from a normal distribution, as in the case of genomic best linear unbiased prediction (GBLUP) methodology. Yet, studies have reported minimal gain in prediction accuracy using WGS information as compared to dense SNP genotypes of medium and high density in cattle (van Binsbergen et al., 2015; Calus et al., 2016). These results were attributed to high LD in cattle genome, and inability to estimate accurate SNP effects due to small sample sizes with millions of SNP markers. In this study we test the hypothesis that if target populations drift away from the reference population due to successive breeding or crossbreeding, accuracy of genomic prediction can be improved by including QTLs in genomic evaluation. We test this hypothesis by simulating matings of three baseline populations of cattle i.e. Angus, Hereford and Brahman, and their crosses for 20 generations using dense marker information. Main objectives of this study were to quantify the decrease in prediction accuracy as the genetic distance between reference and test population increased over generations in beef cattle, analyze the effect of having true QTLs in the SNP set on prediction accuracy over generations, and demonstrate the impact of accumulating generations in reference population over time. We examined all these research questions in the contexts of within-breed, crossbred and across-breed predictions using three scenarios i.e. when SNPs (dense SNP marker information without causal QTL), SNPs and QTLs (SNP markers and causal QTL together), and QTLs (only causal QTLs to represent an ideal world scenario) were used for prediction. 2.3. Methods Genotypes of three breeds were available for analysis i.e. Angus, Hereford and Brahman. 2500 animals were used for each breed. These genotypes were phased and imputed separately for 20 each breed using eagle software, converted to vcf format using shapeit software and finally into a numerical matrix using package splitstackshape in R version 4.0.2. The three datasets were subset down to 29632 SNPs that were common between them. Matings were simulated using R package epinetR. First, a moderately heritable phenotype regulated by 100 QTL was simulated under an additive model using a baseline population of Angus cattle. The SNP effects were extracted and used to create baselines of Hereford and Brahman cattle. Crossbred genotypes were then simulated using the same software with equal contribution from parental breeds (Figure 2.1). Figure 2.1: Flow diagram of simulation of breeding for 20 generations under selection pressure. Once we had the baseline populations, we simulated matings under selection for 20 generations for each of purebred and crossbred base populations. For this step we used function runSim in R package epinetR with following parameters: selection =random, truncSire=0.3 (top 30% of males 21 in the population with highest phenotypic value were selected during each generation), truncDam=0.8 (top 80% of females in the population with highest phenotypic value were selected during each generation), breedSire=50 (maximum number of offsprings per sire). These parameters were chosen to represent the current structure of beef industry in the best possible way. The genomic structure of populations was analyzed using principal component analysis (Figure 2.2). Three different scenarios were created for genomic prediction i.e. within breed, cross-breed and across breed for three sets of SNP panels namely SNP, QTL, and SNP&QTL. For every scenario we evaluated trends in genomic prediction accuracy in three ways: a) against an increase in the number of generations between training and testing population, b) by accumulating training populations backward from the testing population, c) accumulating training populations towards the testing population. Genomic prediction was performed by genomic best linear unbiased prediction methodology (GBLUP) using genomic relationship matrix (G= ZZ’ where Z is a standardized matrix of genotypes) to account for genomic relationships between animals (VanRaden, 2008). Breeding values were estimated using mix model equations: 0) # = 𝑮(𝒕𝒆𝒔𝒕%𝒕𝒓𝒂𝒊𝒏) ×𝒕𝒆𝒔𝒕 [𝑮𝒕𝒆𝒔𝒕×𝒕𝒆𝒔𝒕 + 𝑰𝝀]-𝟏 (𝒚 − 𝑿𝒃 𝒂 Where 𝑎3 represents the estimated breeding values, 𝑮(𝒕𝒆𝒔𝒕%𝒕𝒓𝒂𝒊𝒏) ×𝒕𝒆𝒔𝒕 indicates the genomic relationships of animals in the testing population with all other animals, 𝑮𝒕𝒆𝒔𝒕×𝒕𝒆𝒔𝒕 marks the genomic relationship matrix of test population only, 𝑰 is an identity matrix, and 𝝀 denotes a scale 𝝈𝟐𝒆5 parameter defined as the ratio between error variance and genetic variance . 𝝈𝟐𝒖 22 A B C D E F Figure 2.2: A) Plot of first two principal components of baseline populations of all breeds. B-F) Plots of first two principal components of simulated genotypes of 20 generations for each breed. 23 The accuracy of prediction was defined as the Pearson’s correlation between estimated breeding values and phenotypes of test population. Finally, simulations of matings of all the breeds and subsequent genomic prediction accuracy was analyzed for 100 iterations to report the average accuracies for all scenarios. 2.4. Results 2.4.1. Within breed prediction scenarios For genomic prediction within a breed, accuracy of prediction was the highest when reference population was one generation apart from the test population. As the distance between reference and test population increased, the accuracy of prediction decreased for the first 10 generations until it hit a baseline level (Figure 2.3A). When QTL were added to the SNP set, it was observed that the prediction accuracy was consistently higher (1.46 times) then the SNPs, and the decrease in prediction accuracy was smaller (0.77 times) over generations (Table 2.1). In an ideal scenario, where we knew the actual QTL and used them for prediction, the accuracy of prediction remained the highest, and did not decrease with an increase in genetic distance between the reference and training populations (Figure 2.3A). When reference populations were accumulated over generations, we saw an increase in prediction accuracy for first 2-3 generations (Figure 2.3B). From then on, the accuracy of prediction did not increase even though the size of reference population consistently increased due to addition of more generations. If QTL were included in the SNP set, it was observed that the accuracy of prediction continued to increase with generations in the reference dataset, this increase was 4.14 times more than using SNP information only. Again, in an ideal scenario where we only used the real QTL markers, the accuracy of prediction remained consistently high. 24 Table 2.1: Summary of accuracies of prediction for various scenarios. Ratio of SNPs accuracy Metric Population SNPs & (SNPs & QTLs QTLS/ SNPs) Average accuracies of prediction Within-breed 0.147 0.214 1.46 when distance between reference and validation animals ranged Crossbred 0.100 0.178 1.78 between 1 to 19 generations Across-breed 0.082 0.164 2.05 Average decrease in prediction Within-breed 0.226 0.174 0.77 accuracy when reference and Crossbred 0.118 0.083 0.70 validation animals are 19 generations apart Across-breed 0.008 0.005 0.625 Within-breed 0.037 0.153 4.14 Average gain in prediction accuracy by collecting 19 generations in Crossbred 0.099 0.255 2.58 reference population Across-breed 0.033 0.229 7.63 Figure 2.3: A) Accuracy of genomic prediction within a breed, across scenarios, as a function of distance in terms of number of generations between reference and test populations. B) Accuracy of genomic prediction, across scenarios against cumulative generations backwards in the reference population. 25 2.4.2. Crossbred prediction scenarios Similar trends in prediction accuracy were observed for prediction of breeding values in crossbred animals using either of the parental pure breeds or a combination of parental breeds while keeping the size of reference population constant. The accuracy of prediction decreased consistently over generations when only SNP data was used. Having QTL in the SNP set improved the accuracy of prediction and led to a smaller decrease over first five generations. After that, the prediction accuracy remained constant over distant generations. Out of purebred reference populations, Angus reference population yielded higher prediction accuracies as compared to Hereford or Brahman for their respective crosses. Even a combination of Angus and Hereford, or Angus and Brahman performed worse than pure Angus reference population. Figure 2.4: A) Accuracy of genomic prediction for crossbred target populations, across scenarios, as a function of genetic distance in terms of number of generations between reference and test populations. B) Accuracy of genomic prediction, across scenarios against cumulative generations backwards in the reference population. 26 This difference became less evident if QTLs were included in the SNP set. However, when using QTL only, Angus reference population performed worse than either of Hereford and Brahman, or their combination. Finally, A small increase in prediction accuracy was observed by accumulating generations when only SNPs information was used (Fig 3b). When both SNPs and QTLs were used, the prediction accuracy increased consistently over generations and the increase was 2.58 times the corresponding increase when only SNPs were used. 2.4.3. Across breed prediction scenarios For across breed prediction, the accuracy of prediction remained constant over all generations. Like all previous scenarios, including QTL in the SNP set improved the prediction accuracy and using only QTL for prediction yielded accuracies of prediction comparable to corresponding scenarios in within breed prediction or prediction of crossbreds. It was also observed that taurine breeds (Angus and Hereford) predicted each other better as compared to predicting Brahman or using Brahman to predict taurine breeds. This was true in scenarios which used SNPs or SNPs and QTL both in the model. However, using QTL only to predict the other breeds, Angus reference population performed the worst. Its prediction accuracy was even lower than Hereford predicting Brahman cattle and vice versa, which were genetically far away from each other. This result is significant as it shows that even if we know the true QTL, their effects may not be accurately estimated in a reference population that is highly inbred. When generations were gathered to predict across breeds, the prediction accuracy improved slightly, but remained very low (Fig 4b). When QTLs were added with SNPs, the prediction accuracy increased consistently, this increase was 7.63 times the corresponding increase when 27 only SNPs were used (Table 1). Finally, when only QTLs were used in the model, the prediction accuracy was high but constant over generations. Figure 2.5: A) Accuracy of genomic prediction across different breeds and scenarios, as a function of genetic distance in terms of number of generations between reference and test populations. B) Accuracy of genomic prediction, across scenarios against cumulative generations backwards in the reference population. 2.5. Discussion In this study our simulations demonstrated that if the actual QTLs are part of the SNP panel: 1) the prediction accuracy improved, 2) the decay in prediction accuracy over generations was smaller and 3) the advantage of collecting genotype data over generations was higher. There was clear evidence that the prediction accuracy increased when the actual QTLs were part of the SNP data. The same is true for the value of accumulating genotypes over generations which is significantly higher when the QTLs are genotyped, although the gains are less pronounced in within breed predictions. 28 Accumulating generations of phenotyped and genotyped animals was expected to increase the prediction accuracy as it also meant larger reference populations in each generation. However, the magnitude of this increase was still low in within breed predictions by using SNPs without any QTL information. In fact, a small increase was only observed for 2-3 most recent generations, adding further generations did not result in any increase in prediction accuracy. It was only by addition of true QTLs in the SNP data that the real benefit of accumulating data over generations was observed. This result has important implications for breeding companies and breed organizations that have been collecting genotype information for more than a decade now which could be imputed up to WGS to improve prediction accuracy – a strategy that could be especially relevant for across breed or multibreed evaluation programs since our results demonstrated that the benefit of including QTLs was more evident when there was a larger genetic distance between the reference and target populations. This can not only yield higher prediction accuracies, but also result in powerful studies to identify highly predictive SNPs (Mason et al., 2012; Mulder et al., 2012) . Alternatively, strategies can be devised to drop the information of older animals from the model which would result in avoiding the computational challenge of genomic evaluation of thousands of animals. In our study, we also observed that within-breed prediction performed the best while across breed prediction yielded lowest accuracies. Gathering SNP data (without QTLs) for generations in crossbred and across breed prediction did not help to increase the accuracy of prediction. However, addition of causal QTLs to SNP set yielded a larger increase in prediction for crossbred and across-breed prediction and gathering information over generations was led to an increase in accuracy of prediction (Figures 3,4 , Table 1). 29 The three SNP sets used in this study serve as a proxy for three models. Firstly, SNPs without QTL data represented widely used SNP chip data which does not include most of causal QTLs for traits. SNPs and QTLs panel represents whole genome sequence data which in principle should include most of causal variants and other correlated SNPs. Finally, QTL only panel represented an ideal world situation where we can identify and utilize only those SNPs that were highly predictive for a phenotype. Such SNPs can only be identified by powerful studies based on enormous sample sizes which is not possible at the moment but may be possible in near future by pooling large datasets collected over the years. In fact, some studies have already shown benefits of selecting highly predictive features to increase prediction in distant populations (Raymond et al., 2018a)(Wientjes et al., 2015; van den Berg et al., 2016). One possible limitation of the study is that it does not address the n << p problem - frequently discussed when it comes to practical use of WGS data for cattle breeding. Imputation of millions of already genotyped animals up to WGS is one possible solution that should be investigated in future studies. It is also noteworthy that our study was based on real genotypes of three breeds in the base populations to represent their real genetic architecture, but the phenotypes were simulated under a strictly additive model. We recognize that non-additive gene actions like dominance also play an important role in regulating phenotypes, especially in crossbred cattle. Since dominance is not heritable (as only a single allele is passed on to next generations), we focused on additive gene action only. 2.6. Conclusions While various studies have reported minimal gain in prediction accuracy using sequence data in cattle, collection of WGS data over the years will be better to increase the accuracy of genomic 30 prediction in beef cattle especially for distantly related animals. We simulated matings of three beef cattle breeds and their crosses over 20 generations to study the genomic prediction trends over time using dense SNP information either with or without known QTLs in the dataset. We demonstrated that the benefit of including QTLs was more evident when there was a large genetic distance between reference and target populations i.e populations spread out genetically due to genetic recombination during breeding or crossbred and across-breed prediction scenarios. We also predict that WGS information of highly inbred cattle breeds will be of limited use in across-breed prediction possibly due to inaccurate estimation of SNP effects. Future research and resources should be spent on generation and efficient utilization of large datasets of genomic sequence in beef cattle. 31 REFERENCES Calus, M. P. L., Bouwman, A. C., Schrooten, C., and Veerkamp, R. F. (2016). Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection. Genetics Selection Evolution 48, 1–19. doi: 10.1186/s12711-016-0225-x. Clark, S. A., Hickey, J. M., Daetwyler, H. D., and van der Werf, J. H. J. (2012). The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet Sel Evol 44, 4. doi: 10.1186/1297-9686-44-4. de Roos, A. P. W., Hayes, B. J., Spelman, R. J., and Goddard, M. E. (2008). Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179, 1503–1512. doi: 10.1534/genetics.107.084301. Hayes, B. J., Bowman, P. J., Chamberlain, A. C., Verbyla, K., and Goddard, M. E. (2009a). Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genetics Selection Evolution 41, 1–9. doi: 10.1186/1297-9686-41-51. Hayes, B. J., Bowman, P. J., Chamberlain, A. J., and Goddard, M. E. (2009b). Invited review: Genomic selection in dairy cattle: Progress and challenges. J Dairy Sci 92, 433–443. doi: 10.3168/jds.2008-1646. Kachman, S. D., Spangler, M. L., Bennett, G. L., Hanford, K. J., Kuehn, L. A., Snelling, W. M., et al. (2013). Comparison of molecular breeding values based on within- and across-breed training in beef cattle. Genetics Selection Evolution 45, 1. doi: 10.1186/1297-9686-45-30. Mason, B. A., Goswami, S., Hayes, B. J., Goddard, M. E., Bowman, P. J., Reich, C. M., et al. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95, 4114–4129. doi: 10.3168/jds.2011-5019. Meuwissen, T., and Goddard, M. (2010). Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics 185, 623–631. doi: 10.1534/genetics.110.116590. Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi: 11290733. Muir, W. M. (2007). Comparison of genomic and traditional BLUP-estimated breeding value accuracy and selection response under alternative trait and genomic parameters. Journal of Animal Breeding and Genetics 124, 342–355. doi: 10.1111/j.1439-0388.2007.00700.x. Mulder, H. A., Calus, M. P. L., Druet, T., and Schrooten, C. (2012). Imputation of genotypes with low-density chips and its effect on reliability of direct genomic values in Dutch Holstein cattle. J Dairy Sci 95, 876–889. doi: 10.3168/jds.2011-4490. 32 Pszczola, M., and Calus, M. P. L. (2016). Updating the reference population to achieve constant genomic prediction reliability across generations. Animal 10, 1018–1024. doi: 10.1017/S1751731115002785. Pszczola, M., Strabel, T., Mulder, H. A., and Calus, M. P. L. (2012). Reliability of direct genomic values for animals with different relationships within and to the reference population. J Dairy Sci 95, 389–400. doi: 10.3168/jds.2011-4338. Raymond, B., Bouwman, A. C., Schrooten, C., Houwing-Duistermaat, J., and Veerkamp, R. F. (2018). Utility of whole-genome sequence data for across-breed genomic prediction. Genetics Selection Evolution 50. doi: 10.1186/s12711-018-0396-8. Saatchi, M., Ward, J., and Garrick, D. J. (2013). Accuracies of direct genomic breeding values in Hereford beef cattle using national or international training populations. J Anim Sci 91, 1538– 1551. doi: 10.2527/jas.2012-5593. van Binsbergen, R., Calus, M. P. L., Bink, M. C. A. M., van Eeuwijk, F. A., Schrooten, C., and Veerkamp, R. F. (2015). Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle. Genetics Selection Evolution 47, 1–13. doi: 10.1186/s12711-015-0149-x. van den Berg, I., Boichard, D., Guldbrandtsen, B., and Lund, M. S. (2016). Using Sequence Variants in Linkage Disequilibrium with Causative Mutations to Improve Across-Breed Prediction in Dairy Cattle: A Simulation Study. G3&#58; Genes|Genomes|Genetics 6, 2553–2561. doi: 10.1534/g3.116.027730. VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. J Dairy Sci 91, 4414– 23. doi: 10.3168/jds.2007-0980. Wientjes, Y. C. J., Veerkamp, R. F., and Calus, M. P. L. (2015). Using selection index theory to estimate consistency of multi-locus linkage disequilibrium across populations. BMC Genet 16, 1– 15. doi: 10.1186/s12863-015-0252-6. 33 Chapter 3 EVALUATION OF WHOLE-GENOME SEQUENCE IMPUTATION STRATEGIES IN KOREAN HANWOO CATTLE 3.1. Abstract This study evaluated the accuracy of sequence imputation in Hanwoo beef cattle using different reference panels: a large multibreed reference with no Hanwoo (n=6,269), a much smaller Hanwoo purebred reference (n=88), and both datasets combined (n=6,357). The target animals were 136 cattle both sequenced and genotyped with the Illumina BovineSNP50 v2 (50k). The average imputation accuracy measured by Pearson correlation (R) was 0.695 with the multibreed reference, 0.876 with the purebred Hanwoo and 0.887 with the combined data; the average concordance rates (CR) were 88.16%, 94.49% and 94.84% respectively. The accuracy gains from adding a large multibreed reference of 6,269 samples to only 88 Hanwoo was marginal, however the concordance rate for the heterozygotes decreased from 85% to 82% and the concordance rate for fixed SNPs in Hanwoo also decreased from 99.98% to 98.73%. Although the multi-breed panel was large it was not sufficiently representative of the breed for accurate imputation without the Hanwoo animals. Additionally, we evaluated the value of high-density 700k genotypes (n=991) as an intermediary step in the imputation process. The imputation accuracy differences were negligible between a single-step imputation strategy from 50k directly to sequence and a two-step imputation approach (50k-700k-sequence). We also observed that imputed sequence data can be used as a reference panel for imputation (mean R=0.9650, mean CR=98.35%). Finally, we identified 31 poorly imputed genomic regions in the Hanwoo genome and demonstrated that imputation accuracies were particularly lower at the chromosomal ends. 34 3.2. Introduction Whole genome DNA sequence (WGS) data in livestock may lead to an increase in prediction accuracy, especially in less related populations (van den Berg et al., 2016; Raymond et al., 2018a) and a better resolution to identify causal loci for traits of interest (Frischknecht et al., 2017). Advances in next generation sequencing (NGS) technologies and a rapid decrease of sequencing costs have made WGS available for livestock species. Still, sequencing the large number of animals needed for genomic prediction is not yet economically feasible. The animal breeding industry and researchers are currently relying mostly on medium and high density SNP genotypes for association studies (GWAS) and for genomic prediction of traits (VanRaden et al., 2017). An alternative strategy to directly sequencing animals is through genotype imputation which is a cost-effective approach to acquire WGS data for a large number of animals. It is the process of inferring unknown genotypes (in silico) for animals genotyped at a lower density (e.g. 50k) using pedigree information and/or a set of reference animals genotyped at higher density (e.g. 700k, WGS, etc.). Imputation can help improve genomic coverage, facilitate comparison and combination of studies that use different marker panels, increase the power to detect genetic associations by combining datasets from different studies, and guide fine-mapping of quantitative trait loci. Several studies have shown that imputed genotypes can lead to accurate genomic predictions and help identify quantitative trait loci (QTLs) (Mason et al., 2012; Mulder et al., 2012; Frischknecht et al., 2017). Effective use of imputed genotypes in genomic selection requires that single nucleotide polymorphisms (SNPs) are imputed with high accuracy (Badke et al., 2014). Several factors affect imputation accuracy, for example the number of animals in the reference population, the 35 number of SNPs in the panel(s) used for the lower density target population, minor allele frequency (MAF) of the SNPs to be imputed, the extent of linkage disequilibrium (LD) and the genetic structure of the population. Other factors that also affect the accuracy of imputation are the software used and its underlying imputation method, and somewhat more concerningly, even the metric used to evaluate imputation accuracy can lead to different interpretations (Khatkar et al., 2012). To illustrate the latter, a study by Rowan et al. (Rowan et al., 2019a) observed that using a composite breed reference panel for Gelbvieh cattle resulted in higher accuracies for rare variants when measured by the quality score metric produced by the imputation software Minimac3, but no such gains were observed for concordance rate and Pearson’s correlation between observed and imputed genotypes. Previous research has shown that combining reference populations from different breeds to increase the size of the reference population may or may not be a good strategy to help increase imputation accuracy. For example, a study showed that a reference population combining dairy and beef cattle breeds actually led to a decrease in the imputation accuracy from low density SNPs to HD and suggested that it might be due to differences in LD phase and haplotype dissimilarities between breeds (Berry et al., 2014). However in another study, a combined reference population of three closely related dairy breeds helped increase the sequence imputation accuracy when compared to a within breed reference (Brøndum et al., 2014). Similarly, a large multibreed reference population yielded higher imputation accuracies in Fleckvieh and Holstein target populations using the FImpute software, but there were no or negligible gains when using Minimac software (Pausch et al., 2017). Another study reported modest gains in imputation accuracy for Gelbvieh cattle which is a mixed ancestry breed when 36 using a large multibreed reference (Rowan et al., 2019a); however the same study also observed a low imputation accuracy of individuals from breeds that were only sparsely represented and were distantly related to the reference population. Imputation accuracies for cattle have been reported in several studies (Druet et al., 2010; VanRaden et al., 2012; van Binsbergen et al., 2014) but they are mostly limited to dairy breeds of European origin. To our knowledge there have not been any studies that evaluated imputation strategies for less common breeds that are genetically distinct from the main European breeds and for which sequence information is rarely available to be used as a reference for imputation. In this study, we report the accuracy of imputation from lower density genotypes (50k) to WGS in the Korean Hanwoo beef cattle which are an East Asian taurine cattle breed more related to the Japanese Wagyu but very distinct from western taurine cattle breeds. We also evaluated the imputation accuracies obtained when using single breed (Hanwoo) and multibreed reference panels to impute the Hanwoo genotypes. We further compared imputation accuracies obtained from one step imputation (50k-sequence) and two step imputation (50k-700k-sequence). Finally, we explored how imputation accuracy varies between common and rare SNP variants, and across different genomic regions in the Hanwoo cattle. 3.3. Materials and Methods 3.3.1. Reference genotype data The reference WGS data consisted of 201 Hanwoo cattle selected from key sires widely used in the national breeding program in Korea and the genomes from Run 9 of the 1000 Bull Genomes Project (Hayes and Daetwyler, 2019) plus other public sources of sequence consisting of 6,292 samples from various breeds, including an additional 23 Hanwoo samples. The fastq files from 37 the 201 Hanwoo were processed as described below and then combined with the 23 Hanwoo sequences from the 1000 Bull Genomes Project for a total of 224 Hanwoo samples in the Hanwoo reference. The variants in the reference panel were called from the Hanwoo sequencing short read files using IVDP – Integrated Variant Discovery Pipeline (https://github.com/rodrigopsav/IVDP). IVDP follows the gold standard GATK pipeline for variant calling using whole genome sequencing data (Error! Reference source not found.) and is similar to the analysis pipeline used in the 1000 Bull Genomes Project. Standard read filtering and adapter removal were applied using trimmomatic (Bolger et al., 2014). The filtered reads were aligned onto the bovine ARS-UCD1.2 reference genome from Ensembl (GCA_002263795.2) with bwa-mem2 (Vasimuddin et al., 2019). After that, duplicated reads were marked with sambamba markdup (Tarasov et al., 2015) and base quality score recalibration was carried out with GATK BaseRecalibratorSpark and ApplyBQSRSpark. Variant calling was done using GATK (version 4.2.6.0) HaplotypeCaller with gVCF mode (HaplotypeCaller + GenomicsDBImport + GenotypeGVCFs commands). The variant calls were then filtered using the following criteria: must be biallelic across samples, must have variant and sample missingness £ 0.2, Phred-quality score (QUAL) ≥ 50; excluding single-nucleotide polymorphisms with QD<2.0, MQ<40.0, FS>100.0, MQRankSum < -8.0, ReadPosRankSum < -20.0, ExcessHet>54.69. No minor allele frequency filtering was performed at this stage as variants were called to align with variants in the 1000 Bulls run9 dataset. [20] 38 Figure 3.1: IVDP pipeline for variant calling from Illumina whole genome short sequencing reads. After filtering, the data was combined with the Hanwoo from the 1000 Bulls and the final reference panel ended up with 55,927,497 markers, however 18,560,332 of these were monomorphic in the breed. For consistency, the same markers were used across all reference panel sets (a negligible number of new SNP variants detected only in the Hanwoo data were excluded). Finally, the reference panel was then phased with Beagle 5.4 software (Loh et al., 2016). Only the 29 cattle autosomes were used in this study. Principal component analysis was performed to explore the genetic architecture of the multibreed reference. The first principal component separated the taurine and indicine breeds while the second principal component separated the dairy and beef cattle breeds. Since the data consisted of more than 200 cattle breeds from around the world, only a few major breeds were highlighted in the figure (Figure 2). 39 0.03 Hanwoo Holstein Brahman Angus Others 0.02 PC 2 ( 13.16 %) 0.01 0.00 −0.01 −0.01 0.00 0.01 0.02 0.03 0.04 PC 1 (72.416%) Figure 3.2: Plot of first two principal components of the large multibreed reference including Hanwoo animals. 3.3.2. Target genotype data After sample quality control (samples with more than 10% of genotypes missing were excluded), a total of 9,732 animals genotyped with the Illumina BovineSNP50 v2 (50k) and 991 animals genotyped with Illumina BovineHD chip (HD, 770k) were used as the target animals. Among these animals, there were 628 animals genotyped on both platforms and there were 136 animals that were both sequenced and genotyped on the 50k array. There was no overlap between the 700k and sequenced animals. The genotyped animals were from half-sib families and the average, minimum and maximum numbers of progeny per sire was 9, 1 and 35, respectively. Also, there were 1,783 animals in the target population that had at least one parent in the Hanwoo sequence reference population. 40 SNPs were realigned to the ARS reference assembly using a custom R script and the probe sequence information using blastn. SNPs that could not be mapped to the reference or if the REF/ALT alleles could not be unambiguously assigned were excluded. SNP genotypes with missing rates greater than 20%, or on non-autosomal chromosomes were also deleted. After SNP data filtering, 52,653 SNP from the 50k panel and 674,691 from the 700k were retained for imputation. From the 50k panel there were 42,296 SNPs in common with the 700k and 46,734 SNPs in common with the sequence data. The 700k data had 668,998 SNPs in common with the sequence. Sporadic missing SNP genotypes (GC<0.6) in the 50k and 700k datasets were then imputed and phased with Beagle 5.4 (Browning et al., 2018, 2021) using standard settings and the flags window=300 and overlap=100. After initial comparison of one step and two step imputation strategies, the target animals for all other imputation approaches (section 3.3. onwards) were only 136 animals whose sequence information was available for calculation of imputation accuracies. These 136 animals were previously phased to resolve haplotypes along with all other 50K animals as described above. 3.3.3. Imputation Imputation was carried out using Impute5 (Rubinacci et al., 2020) with default parameters and one chromosome at a time. Two imputation strategies were evaluated: imputation of the 50k genotypes (n=9,732) directly to sequence using the Hanwoo purebred reference with 224 animals and a two-step approach where the 50k genotypes were first imputed to the 700k panel with 991 animals, and then imputed to the same reference sequence panel. For the two-step approach, after the 50k genotypes were imputed to 700k, and before imputing up to sequence, the genotypes were re-phased with Beagle as recent work from Oget-Ebrad et. Al (Oget-Ebrad et 41 al., 2022) reported that Beagle is currently the best software to resolve haplotypes, which is critical for the accuracy of imputation. Additionally, the imputation accuracy of different sequence reference panels was compared: 1) only the purebred Hanwoo, a large multibreed reference without Hanwoo, and both reference sets combined. This was done only by imputing the 50k genotypes directly to sequence as the imputation accuracy differences between the one and two step approaches were negligible (further details in the results and discussion). Additionally, we increased the size of the purebred Hanwoo reference- the 136 Hanwoo with 50k and sequence data were randomly split into four groups of 34 target samples. For each of these groups, the target samples were excluded from the reference panel of 224 animals, leaving 190 for the reference which was then used for imputation. Imputed groups were then combined back into the 136 Hanwoo evaluation set and compared to the sequenced samples. Finally, we explored the possibility of using imputed sequence data as a reference for imputation. The imputed purebred Hanwoo data was used as a reference (n=9,596) after removing the target 136 animals from the imputed 50k dataset. 3.3.4. Evaluation of the imputation The 628 animals in common between the 50k and 700k datasets (674,691 SNPs) and the 136 animals in common between the 50k and sequence datasets (55,927,497 SNPs total but 18,560,332 monomorphic in Hanwoo) were used to calculate imputation accuracies. Accuracies were calculated as the correlation between the imputed and observed genotypes and as the percentage of correctly imputed genotypes (concordance rate). Accuracies per sample, per allele frequency, per genotype and per SNP were also calculated. The pattern of imputation accuracy according to allele frequency was evaluated using average values of imputation accuracies in bins 42 of 0.01. Differences between real and imputed datasets were estimated through a principal component analysis of the genomic relationship matrices (GRM). 3.4. Results and Discussion 3.4.1. Imputation from 50k to 700k in Hanwoo The average imputation accuracy measured by the correlation between the imputed 50k and the reference HD was 0.996. At the animal level, correlations varied between 0.969 and 0.998; by SNP correlations varied between -0.028 and 1 (2.2% of the SNPs had a correlation <0.9). The minimum, average and maximum concordance rates were 95.5%, 99.43% and 99.76% by animal; and by SNP they were 26.43%, 99.43% and 100%, only 0.95% of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9996. At the genotype level, concordances were also very high as shown in Table 1 and not surprisingly the highest error rates are in the imputation results of the heterozygotes (~1% wrong). Concordances in allele frequency bins ranged between 0.9868 and 0.9995. Table 3.1: Concordance rate in percentage between imputed and real genotypes for imputation from 50k to 700k (991 700K reference). AAimp ABimp BBimp AAref 99.57 0.41 0.02 ABref 0.48 99.03 0.48 BBref 0.03 0.33 99.65 It is important to highlight that these imputation accuracies were overly optimistic and reflect the best-case scenario where the target samples were already part of the reference population. These results should not be construed as reflective of accuracies in usual circumstances where the target is not a subset of the reference. Our main objective at this point was primarily to ensure that the imputation from 50k to 700k was adequate for imputing up to sequence. A minor 43 point of interest is that while most studies evaluate imputation by subsetting SNPs from a larger panel to a smaller one, here the same samples were independently genotyped on the two platforms. The discordance between observed genotypes was 0.115% and, of somewhat more concern, almost 10% of the missing genotypes in the 50k data, those that were imputed during the phasing step, differed from the genotypes in the 700k panel. The missing rate in the 628 animals was 8.37% and the overall discordance was 0.915%. In contrast, when the original 50k SNP panel is removed from the real and imputed 700k datasets, the discordance between them is lower at 0.542% which suggests a proportionally higher error rate of imputation of the missing genotypes during the phasing step with Beagle than of the imputation to higher density with Impute5. The haplotypes seem to have been well resolved by the phasing step, which is crucial for the imputation, but again, some caution is warranted in not overinterpreting these accuracies since the target and reference sets had the same animals and it was expected that the haplotypes would align well between the two sets. Nevertheless, this analysis suggests the need for some attention when using Beagle to fill in missing genotypes. 3.4.2. Imputation from 50k to sequence and imputation from 50k to 700k to sequence in Hanwoo The average imputation accuracy measured by the correlation between the one-step imputed 50k and the reference sequence was 0.9653. At the animal level, correlations varied between 0.9104 and 0.9827 (Figure 3A); by SNP correlations varied between -0.1 and 1 (18.74% of the SNPs had a correlation <0.9, 8.8% <0.8 and 4.5% < 0.7). The minimum, average and maximum concordance rates were 95.99%, 98.36% and 99.25% by animal (Figure 3A); and by SNP they were 44 0.0%, 98.36% and 100%, 2.94% of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9968. For the two-step imputation (50k -> 700k -> sequence), the average imputation accuracy correlation was 0.9564. At the animal level, correlations varied between 0.9071 and 0.9918 (Figure 3A); by SNP correlations varied between -0.1 and 1 (25.44% of the SNPs had a correlation <0.9, 10.8% <0.8 and 4.7% < 0.7). The minimum, average and maximum concordance rates were 95.82%, 97.91% and 99.63% by animal (Figure 3A); and by SNP they were 0.0%, 97.91% and 100%, 3.45% of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9978. At the genotype level, concordances were still high (Table 2) but noticeably lower than the 50k- 700k imputation (Table 1). The error rates of the heterozygotes were 5.15% for the one-step imputation and 8.18% for the two-step. It is our view that the error rates of the heterozygotes are a preferable measure of imputation accuracy because they are the hardest to impute correctly. The concordances in allele frequency bins were similar for one-step and two-step imputation and ranged between 0.9607 and 0.9969 respectively (Figure 3B). We also observed that concordances were higher for variants at the two ends of allele frequency spectrum i.e for variants with low minor allele frequency (MAF) which is commonly observed in imputation studies (Ramnarine et al., 2015). Table 3.2: Concordance rate in percentage between imputed sequence and real sequence of 136 animals for one-step (224 WGS reference) and two-step imputation (224 WGS and 991 700K reference). One-Step Two-Step AAimp ABimp BBimp AAimp ABimp BBimp AA ref 97.14 2.65 0.22 AA ref 95.69 3.98 0.33 AB ref 1.10 94.85 4.05 AB ref 1.4 91.82 6.78 BB ref 0.03 1.07 98.91 BB ref 0.02 1.09 98.89 45 When comparing one and two-step imputation, 98.24% of the genotypes were identical. Out of the wrongly imputed genotypes, around half of these were the same between methods (~76Million genotypes). Differences between the one and two-step approaches were minimal, with a small advantage for the one-step approach. This goes against the consensus that it is preferable to first impute to a higher density panel and then up to sequence (van Binsbergen et al., 2015). In other still unpublished work, we have noted that direct imputation certainly can yield high accuracies without first imputing to a high-density panel provided that there is a large reference population and good coverage of the target haplotypes. In this particular instance, however, the higher accuracies from the direct imputation were more likely due to phasing errors of the 136 target animals when they were imputed from 50k up to 700k since they were not genotyped on the high-density array (in relation to the sequence haplotypes, the haplotype concordance of the 50k was slightly higher than the 700k at ~0.11%). Keeping in mind that these were different animals and not the same sample sizes, the discordance between 50k genotypes and sequence was 0.445%, and higher than between the 50k and 700k (0.115%). The discordance between the missing 50k genotypes was similar to what was observed in relation to the 700k at 6.85%. The genomic relationship matrices of the sequence data and the two imputation approaches were very similar (Figure 4). The sums of squares of the deviation between the GRM of the sequence and the one-step was 0.3625 and for the two-step, slightly worse at 0.412. The effect of these deviations on the estimates of genomic breeding values should be negligible. 46 Figure 3.3: A) Concordance and correlation of samples for one-step and two-step imputation. B) Concordance by bins allele frequency (0.01 interval) for one-step and two-step imputation. 47 Figure 3.4: Plot of the first two principal components of the genomic relationship matrix of the sequence data and the imputed genotypes using the one and two-step approaches. 3.4.3. Evaluation of imputation accuracy with all available animals (Hanwoo and multibreed) To set a target for imputation accuracy and see how well the imputation pipeline can perform in a perfect world, the 224 Hanwoo animals plus another 6,269 cattle genotypes from a broad range of breeds were used as the reference imputation panel. This reference panel is as good as we can be. The accuracy results thus obtained will be used as a target to achieve in subsequent imputations in this study. The 136 animals with sequence data were imputed directly from 50k to sequence as there were no accuracy gains in using the two-step approach when the target animals were included in the reference panel. The average imputation accuracy measured by the correlation was 0.9711 (+0.0058 in relation to using only the Hanwoo samples as reference) and 48 varied between 0.9058 and 0.9889 across samples. The average concordance rate was 98.62% (+0.26%), varying between 95.78% and 99.50% across animals. 2.08% (-0.86%) of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9983 (+0.0015). The sums of squares of the deviation between the GRM of WGS and the imputed data was 0.3026. The augmented reference set used in this section is almost 30 times larger than the Hanwoo specific reference panel, but the imputation accuracy gains were only marginal. Of course, this is still a scenario where the target is a subset of the reference panel and the accuracies with the Hanwoo panel were already high anyway. It does however confirm that it is still beneficial, even if marginally, to increase the reference set by adding animals from other breeds and that the inclusion of very distantly related samples do not have a sizable negative effect on the accuracies; but it also confirms that relatedness between target and reference is important for imputation accuracy, as discussed in more detail in the next section. Table 3.3: Concordance rate in percentage between imputed and real genotypes for 224 Hanwoo and 6,269 multibreed animals in reference. AAimp ABimp BBimp AA ref 97.75 2.06 0.19 AB ref 0.88 94.8 4.33 BB ref 0.01 0.78 99.2 3.4.4. Evaluation of imputation accuracy with independent reference sets For a more realistic evaluation of the imputation accuracy in routine settings, the 136 Hanwoo with sequence data were excluded from the reference panels. These animals were imputed directly up to sequence from their 50k original data. The accuracy with two reference panels was evaluated. One panel used the multibreed genotypes but without any Hanwoo (n=6,269), the 49 other used the same panel plus the remaining 88 Hanwoo after exclusion of the 136 samples. The average imputation accuracy measured by the correlation was 0.6950 without the Hanwoo and 0.8873 with the Hanwoo. With reference sets in the same order, the correlations varied between 0.6670 - 0.9096 and 0.7973 - 0.9798 across samples. The average concordance rates were 88.16% and 94.84% varying between 86.42% - 95.89% and 91.55% - 99.03%, across animals. 41.21% and 20.05% of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9322 and 0.9911. In total 33.19% (18,560,332) of the 55,927,497 were fixed in the Hanwoo population. The concordance for the SNPs that should have been fixed in these individuals was 98.94% and 99.73%. Aside from the computational burden, the large number of SNPs that are not segregating in Hanwoo but are in other breeds did not introduce an excessive number of spuriously segregating genotypes within the breed; although it should be noted that from the 18.5 million fixed SNPs, 31.48% and 16.60% had an imputed MAF>0. The small effect on the concordance is due to the low frequencies of these incorrectly imputed SNPs (averages of 0.0118 and 0.0079). The bins concordances ranged between 0.6485 - 0.9794 and 0.8530 - 0.9936, the averages were 0.7259 and 0.8917. Table 3.4: Concordance rate in percentage between imputed and real genotypes for multibreed reference panels with (88 Hanwoo, 6269 multibreed) and without Hanwoo (6269 multibreed) samples. Multi-Breed without Hanwoo Multi-Breed with Hanwoo AAimp ABimp BBimp AAimp ABimp BBimp AA ref 76.03 18.97 5.01 AA ref 90.45 8.84 0.71 AB ref 12.29 56.95 30.76 AB ref 5.15 82.15 12.7 BB ref 0.71 8.09 91.2 BB ref 0.1 3.23 96.67 At the genotype level, concordances were substantially lower (Table 4), especially for heterozygotes (56.95%) when there were no Hanwoo in the reference. With Hanwoo included in 50 the reference, concordances were much better but still around 82.15% for the heterozygotes (Table 4). The sums of squares of the deviation between the GRM of the sequence and the imputed data was 1.0552, and slightly lower (better) when Hanwoo were included in the reference at 0.9641. Inclusion of Hanwoo samples in the imputation reference panel provided a substantial increase in the imputation accuracy, the gain in the correlation was 0.1923 and in the concordance was 6.68%. Even with a large reference population, when there were no samples of the targeted breed in the reference population, the imputation accuracies were rather low. Here there were only 88 independent Hanwoo for the reference but even this small number (1.38% of the total reference) already provided a substantial improvement in the accuracy. Hanwoo is an East Asian taurine breed quite distinct from other European breeds but more closely related to other Asian breeds such as the Wagyu and Akaushi. The reference had a limited number of these breeds to assist with the imputation (e.g., 174 Wagyu and a single Akaushi). There is still a need to increase the representation of less European centric breeds in sequencing efforts to improve imputation results of these breeds. To better separate the contribution of within and between breed sequence data to the accuracy of imputation, we also used only the 88 Hanwoo as a reference. The average imputation accuracy measured by the correlation was 0.8759, varying between 0.8205 and 0.9421 across samples. The average imputation accuracy measured by the concordance was 94.49%, varying between 92.60% and 97.43%. Similar to the results discussed in the previous section, the accuracy gain from adding all the other 6,269 samples was marginal at +0.0114 for the correlation and +0.35% for the concordance. The correlation between real and imputed allele frequencies was 0.9807 (- 51 0.0105) but the concordance of the fixed SNPs correctly imputed increased to 99.98% without the noise introduced by the variants segregating in the other breeds. 21.91% (1.86% worse) of the SNPs had a concordance rate below 90%. The allele frequency bins concordances ranged between 0.8717 - 0.9653, the average was 0.8922. The genotype concordances (Table 5) were similar to those obtained with the multibreed reference including Hanwoo, even slightly better as the heterozygote error rates were lower at 14.18% (but ~1% loss of concordance for each of the homozygotes). Table 3.5: Concordance rate in percentage between imputed and real genotypes with a reference panel of 88 purebred Hanwoo. AAimp ABimp BBimp AA ref 89.21 9.93 0.86 AB ref 4.63 85.82 9.54 BB ref 0.18 4.14 95.68 3.4.5. Evaluation of imputation accuracy with a larger and independent Hanwoo only reference set Since there was only a marginal overall improvement in the imputation accuracy using the multibreed reference, we opted to use only the purebred Hanwoo data (n=190 in four independent chunks) as the final reference panel for this study. The average imputation accuracy measured by the correlation was 0.9336 and varied between 0.8459 and 0.9741 across samples. The average concordance rate was 96.88%, varying between 93.50% and 98.78% across animals. 8.35% of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9960. The sums of squares of the deviation between the GRM of the sequence and the imputed data was 0.4923. At the genotype level, concordances were high (Table 6) and the error rates of the heterozygotes was 10.47%. The allele frequency bins 52 concordances ranged between 0.9178 and 0.9983, the average was 0.9431. Overall, this reference set seems adequate for accurate sequence imputation in Hanwoo, apart from some genomic regions of lower accuracy which are discussed below (section 3.7.). Table 3.6: Concordance rate in percentage between imputed and real genotypes with a reference panel of 190 purebred Hanwoo. AAimp ABimp BBimp AA ref 93.69 5.91 0.40 AB ref 2.47 89.54 8.00 BB ref 0.04 1.93 98.03 3.4.6. Evaluation of imputation accuracy with an imputed Hanwoo reference set The real imputation accuracy in the samples without sequence data is unknown. As an empirical evaluation, we used the imputed sequence data of the 50k samples (one-step imputation) as a reference panel (n= 9596) to impute the 136 animals with sequence data. These samples were removed from the imputed sequence data and their original 50k genotypes were used as the target, after phasing with Beagle using the other phased 50k genotypes (the phasing reference panel). There is of course still some circularity to this argument as the 136 target samples accounted for 61% of the initial reference dataset used to impute the 50k genotypes, i.e., even though none of the samples in the reference overlaps with the target, the latter was initially used to impute them – this should lead to some upwards bias in the imputation accuracies. The average imputation accuracy measured by the correlation was 0.9650 and varied between 0.9103 and 0.9829 across samples. The average concordance rate was 98.35%, varying between 95.98% and 99.25% across animals and only 2.99% of the SNPs had a concordance rate below 90%. The correlation between real and imputed allele frequencies was 0.9967. The sums of squares of the deviation between the GRM of the sequence and the imputed data was only 53 0.3786. At the genotype level, concordances were high, and the error rates of the heterozygotes was only 5.21% (Table 7). The frequency bins concordances ranged between 0.9605 and 0.9967. Table 3.7: Concordance rate in percentage between imputed and real genotypes of 136 target animals with a reference panel of 9596 purebred Hanwoo that were previously imputed from 50K to sequence. AAimp ABimp BBimp AA ref 97.11 2.67 0.22 AB ref 1.11 94.79 4.10 BB ref 0.03 1.07 98.90 This was effectively the best imputation accuracy, almost on par with having the animals in the reference set itself along with a large multibreed reference set (table 3, table S1) . Again, accounting for the lack of independence in how the imputed reference samples were imputed at the first place, this does suggest the possibility of using imputed data along with real data to improve imputation accuracies. How well this would work in practice will be a function of the value of a larger number of haplotypes in the reference, offset by the errors in the imputed data – but that’s a topic for future work. 3.4.7. Evaluation of imputation accuracy across chromosomal regions We computed the running median of imputation accuracies in windows of 1001 SNPs to detect poorly imputed chromosomal regions across the Hanwoo genome. SNPs falling in 0.1 percentile (CR cutoff value 0.853, n=48,438) were used to identify genomic regions with poor concordance rates that contained 2 or more SNPs. CHR 4 contained the highest number of poorly imputed SNPs (n=15339) followed by CHR 17 (n=9177), 10 (5,326) and 7 (n=4,747). Poorly imputed SNPs located more than 1 Mb apart were considered as a separate genomic region. The 5 longest regions were located on CHR 17, 10, 15, 23 and 4. The longest one being on CHR 17 and spanning 54 the region between 37,810,468 and 39,363,461 BP. Although imputation accuracies were generally high, there were some noticeable regions with poor imputation accuracies especially close to the ends of chromosomes due to higher recombination rates (Sun et al., 2012) (Figure 5 and 6). Concordance I I I 0.5 0.6 0.7 0.8 0.9 1.0 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 27 29 Figure 3.5: Smoothed concordance rate across 29 autosomes in Hanwoo cattle. Additionally, highly polymorphic genomic regions like the major histocompatibility complex (MHC) are intrinsically difficult to impute due to the high number of repetitive elements, a greater diversity of haplotypes, and a complex LD structure. For example, in this study we did identify one such region on CHR 23 that overlaps with the bovine leukocyte antigen (BoLA) class II that encodes genes that perform similar functions across species but are structurally different. Other poorly imputed genomic regions and the number of SNPs in those regions are presented in Table 8. Such intrinsically difficult to impute genomic regions have also been reported in Flekvieh, Holstein and Nelore cattle [21–23] and should be taken into account in studies that use impute sequence data. Additional reasons for some regions having low imputation accuracy can be due to a low coverage by SNP panels and a high sequence variant density in a region, high GC content, and assembly errors. 55 Figure 3.6: Plot of imputation accuracy against distance of SNPs from nearest end of chromosome (Mb). Table 3.8: Poorly imputed genomic regions in Hanwoo. CHR Start End Length No. of SNPs 2 81496310 81511148 14838 273 3 11716513 11776690 60177 452 4 105573863 105589511 15648 1422 4 112830880 113341416 510536 13917 5 92617024 92629684 12660 236 6 55737025 55758674 21649 508 7 10758277 11039089 280812 4584 7 95736166 95742201 6035 163 8 29160922 29183317 22395 559 9 10593065 10596144 3079 59 9 104358861 104380205 21344 671 10 459493 521352 61859 846 10 22908467 23406397 497930 2334 10 100209890 101216542 1006652 2146 11 40903028 40905284 2256 72 12 70443588 70455007 11419 713 12 71728153 71735710 7557 719 56 Table 3.8 (cont’d) 14 82178440 82180992 2552 60 15 45854158 46106343 252185 3506 15 78379540 79226472 846932 398 16 372932 418633 45701 1473 16 5775482 5778367 2885 53 17 14049911 14159496 109585 4297 17 37810468 39363461 1552993 4880 18 62688950 62690650 1700 5 19 927489 945870 18381 537 20 49746665 49760276 13611 432 22 897334 922574 25240 738 23 15964918 15976090 11172 138 23 25875394 26511965 636571 2223 29 51092637 51093495 858 23 Previous studies have suggested the use of a multi breed reference population for whole genome imputation but the target populations in those studies consisted of either admixed populations or cattle breeds closely related to the reference breeds (Brøndum et al., 2014; Pausch et al., 2017; Rowan et al., 2019b). Our results collectively indicate that the imputation accuracy in Hanwoo was largely driven by within breed imputation. The large multibreed reference panel added very little to the imputation accuracy beyond what was obtained by simply using the closely related samples from the breed itself. This was mainly because the 1000 Bulls sequence data mostly contained cattle breeds of European origin while haplotype diversity and genetic architecture of breeds originating from Asia and Africa are poorly represented. On the upside however, there was no clear indication of the diverse reference having a negative effect on the overall imputation accuracy, although it did increase the imputation error rates of heterozygotes and falsely detected some rare variants that did not actually exist in the target population. Additionally, the ratio of within breed to multibreed animals in the reference is also important. Adding a small number of animals from various breeds to a small reference panel can adversely affect accuracy. 57 However, a large number of multibreed animals in the reference can help increase accuracy (Korkuć et al., 2019). The large and diverse multibreed reference in our study also added considerably to computational burden considering the fact that it was ~40 times greater than the within breed reference. These findings suggest that in the absence of an adequately sized breed specific panel, a large multibreed reference can be used but the error rates will be high. This should not overly affect genomic prediction (although there’s limited value in using sequence data to estimate GEBVs anyway (van Binsbergen et al., 2015; Calus et al., 2016; Raymond et al., 2018b)) but could be more consequential for association studies. One aspect we have not yet considered, are the Impute5 imputation info scores. These are estimates of the ratio between the observed and expected statistical information (Howie et al., 2011). This measure aims to serve as a guide as to the reliability of the imputed genotypes, but it is quite dependent on the estimated allele frequency of the imputed genotypes. Hard filters such as e.g. R<0.6 are commonly used but it does not seem to align well with observed concordance values. With this dataset the correlation between the info score and SNP concordance is only 0.2598. The correlation with allele frequencies is even lower at 0.0908. If the 0.6 filter cutoff was used on this data, the average concordance for the filtered-out data would be 0.9574 and for the selected data it would be 0.9715 (and would remove ~20% of the genotypes). The percentage of SNPs with a concordance below 0.9 would only improve from 8.35% to 7.20%. Within allele frequency bins (in 0.01 intervals), the minimum concordance was 0.9311 and the highest 0.9983, however the correlation of these concordance means in the bins and allele frequencies is also low at 0.0603 and not associated with allele frequencies. The metric does seem more accurate with more stringent cutoffs though; for example, on the extreme, if 58 the data is filtered for an R=1 then the concordance mean in the subset is 99.99% and only 0.015% of the SNPs will have a concordance <0.9 but only 30.38% of the genotypes will be kept and much more problematically, only a fraction of these will have a MAF>0.01. In this work we could not find a meaningful way to use the info scores that would help to filter out SNPs with low imputation accuracy. 3.5. Conclusions In conclusion, we have demonstrated that imputation with a small reference population closely related to the target population is more accurate than a large multi breed reference with distantly related animals. Without any representation of the target breed in the reference population the imputation accuracies for Hanwoo were substantially lower, which is due to the high genetic distances between the Korean breed and the other breeds. The addition of a relatively small number of Hanwoo animals to the large multi breed reference panel considerably improved the imputation accuracy, although the accuracies of the rare variants were still slightly lower than those obtained when using only a purebred Hanwoo reference. Even though a large variety of dairy and beef breeds from around the world have already been sequenced, there is still a need to increase the representation of less European centric breeds in sequencing efforts to improve the imputation of these breeds. Furthermore, while a two-step imputation strategy is usually the suggested approach, in this work the differences in imputation accuracy when compared to single-step imputation was negligible. We also observed that imputed genotypes can be used as the reference panel which suggests that, for example, HD genotypes could be imputed to sequence level and then included along with the original sequence reference panel for use in imputation applications. This could reduce the computational burden of the two-step 59 approach and potentially improve imputation accuracies when the number of informative animals in the reference panel is small. Lastly, we also identified poorly imputed genomic regions in Hanwoo cattle that should be accounted for when the imputed data is used in other projects – particularly in genome-wide association studies. 60 REFERENCES Badke, Y. M., Bates, R. O., Ernst, C. W., Fix, J., & Steibel, J. P. (2014). Accuracy of Estimation of Genomic Breeding Values in Pigs Using Low-Density Genotypes and Imputation. G3, 4. https://doi.org/10.1534/g3.114.010504 Berry, D. P., Mcclure, M. C., & Mullen, M. P. (2014). Within- and across-breed imputation of high- density genotypes in dairy and beef cattle from medium- and low-density genotypes. Journal of Animal Breeding and Genetics, 131(3), 165–172. https://doi.org/10.1111/jbg.12067 Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30(15), 2114–2120. https://doi.org/10.1093/bioinformatics/btu170 Brøndum, R. F., Guldbrandtsen, B., Sahana, G., Lund, M. S., & Su, G. (2014). Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle. BMC Genomics. https://doi.org/10.1186/1471-2164-15-728 Browning, B. L., Tian, X., Zhou, Y., & Browning, S. R. (2021). Fast two-stage phasing of large-scale sequence data. American Journal of Human Genetics, 108(10), 1880–1890. https://doi.org/10.1016/j.ajhg.2021.08.005 Browning, B. L., Zhou, Y., & Browning, S. R. (2018). A One-Penny Imputed Genome from Next- Generation Reference Panels. American Journal of Human Genetics, 103(3), 338–348. https://doi.org/10.1016/j.ajhg.2018.07.015 Calus, M. P. L., Bouwman, A. C., Schrooten, C., & Veerkamp, R. F. (2016). Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection. Genetics Selection Evolution, 48(1), 1–19. https://doi.org/10.1186/s12711-016-0225-x Druet, T., Schrooten, C., & de Roos, A. P. W. (2010). Imputation of genotypes from different single nucleotide polymorphism panels in dairy cattle. Journal of Dairy Science, 93(11), 5443–5454. https://doi.org/10.3168/jds.2010-3255 Frischknecht, M., Pausch, H., Bapst, B., Signer-Hasler, H., Flury, C., Garrick, D., Stricker, C., Fries, R., & Gredler-Grandl, B. (2017). Highly accurate sequence imputation enables precise QTL mapping in Brown Swiss cattle. BMC Genomics, 18(1). https://doi.org/10.1186/s12864-017- 4390-2 Hayes, B. J., & Daetwyler, H. D. (2019). 1000 Bull Genomes Project to Map Simple and Complex Genetic Traits in Cattle: Applications and Outcomes. Annual Review of Animal Biosciences, 7, 89– 102. https://doi.org/10.1146/annurev-animal-020518-115024 Howie, B., Marchini, J., & Stephens, M. (2011). Genotype Imputation with Thousands of Genomes. G3 Genes|Genomes|Genetics, 1(6), 457–470. https://doi.org/10.1534/g3.111.001198 61 Khatkar, M. S., Moser, G., Hayes, B. J., & Raadsma, H. W. (2012). Strategies and utility of imputed SNP genotypes for genomic analysis in dairy cattle. BMC Genomics, 13(1). https://doi.org/10.1186/1471-2164-13-538 Korkuć, P., Arends, D., & Brockmann, G. A. (2019). Finding the optimal imputation strategy for small cattle populations. Frontiers in Genetics, 10(FEB). https://doi.org/10.3389/fgene.2019.00052 Loh, P.-R., Danecek, P., Palamara, P. F., Fuchsberger, C., Reshef, Y. A., Finucane, H. K., Schoenherr, S., Forer, L., Mccarthy, S., Abecasis, G. R., Durbin, R., & Price, A. L. (2016). Reference-based phasing using the Haplotype Reference Consortium panel. Nature Genetics, 48(11). https://doi.org/10.1038/ng.3679 Mason, B. A., Goswami, S., Hayes, B. J., Goddard, M. E., Bowman, P. J., Reich, C. M., Matukumalli, L. K., & Erbe, M. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science, 95(7), 4114–4129. https://doi.org/10.3168/jds.2011-5019 Mulder, H. A., Calus, M. P. L., Druet, T., & Schrooten, C. (2012). Imputation of genotypes with low-density chips and its effect on reliability of direct genomic values in Dutch Holstein cattle. Journal of Dairy Science, 95(2), 876–889. https://doi.org/10.3168/jds.2011-4490 Oget-Ebrad, C., Kadri, N. K., Moreira, G. C. M., Karim, L., Coppieters, W., Georges, M., & Druet, T. (2022). Benchmarking phasing software with a whole-genome sequenced cattle pedigree. BMC Genomics, 23(1), 1–16. https://doi.org/10.1186/s12864-022-08354-6 Pausch, H., MacLeod, I. M., Fries, R., Emmerling, R., Bowman, P. J., Daetwyler, H. D., & Goddard, M. E. (2017). Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle. Genetics Selection Evolution, 49(1), 1–14. https://doi.org/10.1186/s12711-017-0301-x Ramnarine, S., Zhang, J., Chen, L. S., Culverhouse, R., Duan, W., Hancock, D. B., Hartz, S. M., Johnson, E. O., Olfson, E., Schwantes-An, T. H., & Saccone, N. L. (2015). When does choice of accuracy measure alter imputation accuracy assessments? PLoS ONE, 10(10). https://doi.org/10.1371/journal.pone.0137601 Raymond, B., Bouwman, A. C., Schrooten, C., Houwing-Duistermaat, J., & Veerkamp, R. F. (2018a). Utility of whole-genome sequence data for across-breed genomic prediction. Genetics Selection Evolution, 50(1). https://doi.org/10.1186/s12711-018-0396-8 Raymond, B., Bouwman, A. C., Schrooten, C., Houwing-Duistermaat, J., & Veerkamp, R. F. (2018b). Utility of whole-genome sequence data for across-breed genomic prediction. Genetics Selection Evolution, 50(1), 1–12. https://doi.org/10.1186/s12711-018-0396-8 62 Rowan, T. N., Hoff, J. L., Crum, T. E., Taylor, J. F., Schnabel, R. D., & Decker, J. E. (2019a). A multi- breed reference panel and additional rare variants maximize imputation accuracy in cattle. Genetics Selection Evolution, 51(1), 1–16. https://doi.org/10.1186/s12711-019-0519-x Rowan, T. N., Hoff, J. L., Crum, T. E., Taylor, J. F., Schnabel, R. D., & Decker, J. E. (2019b). A Multi- Breed Reference Panel and Additional Rare Variation Maximizes Imputation Accuracy in Cattle. BioRxiv. https://www.biorxiv.org/content/biorxiv/early/2019/01/10/517144.full.pdf?%3Fcollection= Rubinacci, S., Delaneau, O., & Marchini, J. (2020). Genotype imputation using the Positional Burrows Wheeler Transform. PLOS Genetics, 16(11), e1009049. Sun, C., Wu, X. L., Weigel, K. A., Rosa, G. J. M., Bauck, S., Woodward, B. W., Schnabel, R. D., Taylor, J. F., & Gianola, D. (2012). An ensemble-based approach to imputation of moderate-density genotypes for genomic selection with application to Angus cattle. Genetics Research, 94(3), 133– 150. https://doi.org/10.1017/S001667231200033X Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31(12), 2032–2034. https://doi.org/10.1093/bioinformatics/btv098 van Binsbergen, R., Bink, M. C. A. M., Calus, M. P. L., van Eeuwijk, F. A., Hayes, B. J., Hulsegge, I., & Veerkamp, R. F. (2014). Accuracy of imputation to whole-genome sequence data in Holstein Friesian cattle. Genetics Selection Evolution, 46(1), 1–13. https://doi.org/10.1186/1297-9686-46- 41 van Binsbergen, R., Calus, M. P. L., Bink, M. C. A. M., van Eeuwijk, F. A., Schrooten, C., & Veerkamp, R. F. (2015). Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle. Genetics Selection Evolution, 47(1), 1–13. https://doi.org/10.1186/s12711-015-0149-x van den Berg, I., Boichard, D., Guldbrandtsen, B., & Lund, M. S. (2016). Using Sequence Variants in Linkage Disequilibrium with Causative Mutations to Improve Across-Breed Prediction in Dairy Cattle: A Simulation Study. G3&#58; Genes|Genomes|Genetics, 6(8), 2553–2561. https://doi.org/10.1534/g3.116.027730 VanRaden, P. M., Null, D. J., Sargolzaei, M., Wiggans, G. R., Tooker, M. E., Cole, J. B., Sonstegard, T. S., Connor, E. E., Winters, M., van Kaam, J. B. C. H. M., Valentini, A., van Doormaal, B. J., Faust, M. A., & Doak, G. A. (2012). Genomic imputation and evaluation using high-density Holstein genotypes. Journal of Dairy Science, 96(1), 668–678. https://doi.org/10.3168/jds.2012-5702 VanRaden, P. M., Tooker, M. E., O’Connell, J. R., Cole, J. B., & Bickhart, D. M. (2017). Selecting sequence variants to improve genomic predictions for dairy cattle. Genetics Selection Evolution, 49(1), 1–12. https://doi.org/10.1186/s12711-017-0307-4 63 Vasimuddin, M., Misra, S., Li, H., & Aluru, S. (2019). Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 314–324. https://doi.org/10.1109/IPDPS.2019.00041 64 Chapter 4 VALUE OF SEQUENCE DATA IN GENOMIC PREDICTION AND ASSOCIATION ANALYSIS OF HANWOO BEEF CATTLE CARCASS TRAITS 4.1. Abstract In this study we compared the genomic heritability and accuracy of prediction for carcass traits in Korean Hanwoo beef cattle using whole genome sequence data against medium and high- density SNP genotypes. We also selected SNPs based on GWAS in a discovery population and augmented the medium and high-density genotypes with selected SNPs. Imputed genotypes of 7501 Hanwoo cattle and four carcass traits, - carcass weight (CWT), eye muscle area (EMA), back fat thickness (BFT) and marbling score (MS9) that had varying genetic architecture were analysed. While GWAS based on sequence information identified larger peaks and clear signals as compared to medium and high-density genotypes, prediction accuracy remained the same. Adding statistically significant SNPs from GWAS to medium and high-density genotypes resulted in a slight increase in prediction accuracy for CWT. For traits that were not regulated by loci with large effects (i.e. EMA, BFT and MS9) selecting SNPs to augment medium and high-density genotypes did not result in an increase in prediction accuracy. When only selected SNPs were used for prediction, accuracy was higher than equal number of random SNPs but lower than commonly used dense SNP panels. We also observed that epistatic variance accounts for a small proportion of total phenotypic variance but prediction accuracy still remained comparable to a strictly additive model. These results demonstrate that imputed whole genome sequence data may be useful for fine mapping but has a limited value for genomic prediction in a closely related population. 65 4.2. Introduction Genomic prediction based on single nucleotide polymorphism (SNPs) is routinely used as a tool for genetic improvement in cattle. It utilizes linkage between SNP markers and quantitative traits loci (QTLs) to estimate and predict the breeding values (BVs) of animals (Meuwissen et al., 2001; VanRaden, 2008). The general framework of genomic prediction includes estimation of SNP effects from a training population with phenotypes and genotypes, which are subsequently used to predict the genomic breeding values (PGBVs) of young selection candidates without their phenotype information. Selection candidates are then ranked by their PGBVs to make decisions based on certain breeding goals outlined by the breeders. The accuracy of genomic evaluation depends on a number of factors including the extent of linkage disequilibrium (LD) between SNP markers and QTLs, the size of reference population, genetic relationship between reference and validation populations, the heritability of the trait, and the distribution of QTL effects regulating the trait being predicted (Hayes et al., 2009b). Genomic prediction accuracies reported in beef cattle have generally been lower than dairy cattle (van Eenennaam et al., 2014). This is mainly because of the absence of large reference populations in beef cattle since a number of beef breeds are being used for beef production throughout the world each with its own small reference population. Secondly, the relationship between reference and selection candidates is also lower due to frequent crossbreeding and limited adoption of artificial insemination in beef operations due to challenges of reproductive management in free range production systems. Studies have shown that prediction equations developed in one breed did not perform well to predict phenotypes in another breed or crossbred cattle (Hayes et al., 2009a; Mason et al., 2012; Kachman et al., 2013). Even within the 66 same breed, prediction equations developed in one country did not perform well in another country (Saatchi et al., 2013). This is likely due to a breakdown in LD between SNPs and QTL in across-breed predictions. Similarly, genetic recombination during meiosis can break down the LD between SNPs and QTLs which makes it hard to predict genetically distant generations (Muir, 2007). Therefore, accuracy of genomic prediction gets compromised due to these challenges. In this context, whole genome sequence data provides a unique opportunity to increase the accuracy of prediction between distantly related animals. This type of data does not rely on the LD between SNPs and QTL, as the causal loci are expected to be in the data set. In theory, if we can estimate the true effects of causal QTLs underlying traits and include them in genomic prediction, the accuracy of prediction should increase. Moreover, this increase is expected to persist across multiple generations and across breeds. Identifying the causal QTLs and estimating their true effects requires large training datasets with full sequence information, which is very expensive, posing a potential constraint to the use of whole genome sequences in prediction models. Genotype imputation provides a cost-effective strategy to obtain sequence level information of substantially large reference populations. Imputation of existing genotype information of thousands of animals up to sequence level has been done with high accuracy due to large haplotype blocks in the cattle genome (Pausch et al., 2017). Such imputed data can be used to investigate the benefit of using sequence level information in prediction and association analyses. Besides higher cost with little benefit, another major challenge in the commercial use of sequence data is the computational complexity due to high dimensionality of sequence data. A widely suggested solution to this problem is to select variants that are located close to causal 67 QTLs and add them to SNP panels. Selection of genetic variants can be done either by biological significance as adopted by Neogen, Lincoln, NE to design Bovine-GGP-F250 SNP panel that contains more than 200,000 variants selected based on potential biological significance. Alternatively, selection of SNPs can also be done based on statistical significance for a trait in a discovery population. However, such a selection will need to be done for all the traits of interest that are included in the breeding objective. Still, this area needs further research as the success of this strategy highly depends on genetic architecture of the trait as well as the population under selection. The use of sequence data in beef cattle is still in its infancy. Most studies using sequence information for genomic prediction are either done on simulated data (Meuwissen and Goddard, 2010) or in dairy animals (van Binsbergen et al., 2015; Veerkamp et al., 2016; Raymond et al., 2018a). These studies have shown that accuracy gains by using whole genome sequence data are negligible in cattle especially if the training and testing was performed with-in the same population or breed and the number of individuals were small (Druet et al., 2014; van Binsbergen et al., 2015). However, GWAS studies based on whole genome sequence data are more powerful as compared to genotyping platforms as they capture more genetic variation and help in more precise detection of QTL (Veerkamp et al., 2016). In the context of animal breeding strictly additive models are used for genetic evaluation of animals and non-additive genetic components are generally ignored. It is evident from biology that genes interact and work together in various processes to express a phenotype (Mackay, 2014) and some research in plants and pigs has shown that predictive ability can be improved by using non-additive models (Crossa et al., 2010; Su et al., 2012). However, other studies have 68 warned against a biological interpretation of such effects pointing out that an apparent epistatic effect is due to imperfect tagging of QTLs by SNPs in lower density SNP panels (Schrauf et al., 2020). Epistatic models may capture an underlying strictly additive architecture but such an effect disappears by fine mapping of additive effects, thus known as phantom epistasis (Schrauf et al., 2020; Hemani et al., 2021). Therefore, the magnitude of epistatic effects and their predictive ability in livestock needs to be explored using different densities of SNP data. Korean breeding program for their popular beef cattle Hanwoo, focuses primarily on 4 traits i.e carcass weight (CWT), eye muscle area (EMA), back fat thickness (BFT) and marbling score (MS) with the most emphasis (economic weight) on MS. The objectives of this study were: 1) perform GWAS based on imputed sequence data for carcass traits in Hanwoo for a better resolution of QTL regions for these traits and compared the results with those obtained from genotyping platforms, 2) to investigate whether selecting SNPs based on association analysis in a discovery population adds to prediction accuracy of genomic evaluation for carcass traits in Hanwoo, 3) evaluate accuracy of genomic prediction between commonly used genotyping platforms and imputed sequence data 4) compare the proportion of phenotypic variance captured by additive and epistatic genetics components and their effect of predictive ability of carcass traits in Hanwoo cattle. 4.3. Materials and Methods 4.3.1. Phenotypes Phenotypes of 7501 Hanwoo cattle from four different sources were collected. Number of animals in each population were 4290, 600, 20 and 2591. The number of animals in each contemporary group were 1965 and 202 animals. The animals were slaughtered when they were 69 approximately 2 years old (844.62 103.48 days). Four carcass traits were recorded at the slaughter house: carcass weight (CWT), eye muscle area (EMA), back fat thickness (BFT) and marbling score (MS9) (Table 1). Table 4.1: Phenotypic averages of LWT, CWT, EMA, BFT and MS9. Trait Mean Standard deviation Carcass weight (CWT) 412.40 62.30 Eye muscle area (EMA) 90.56 13.07 Back fat thickness (BFT) 12.32 5.33 Marbling score (MS9) 5.23 2.24 4.3.2. Genotypes 9,160 animals were genotyped at the 50k level (Illumina Bovine SNP50 BeadChip; Illumina, San Diego, CA, 1,704 animals genotyped at high-density level (777k SNP, Illumina Bovine HD Beadchip, Illumina, San Diego, CA), and 203 reference animals were available with whole genome sequence (WGS) data. DNA sequence was aligned to ARS-UCD1.2 bos taurus assembly. Genotype imputation was done stage by stage with the help of a set of reference animals, i.e. starting from medium density SNP data (50k) to 700k, and then from 700k up to sequence level. The phasing and imputation were done one chromosome at a time. Eagle software version 2.3.2 and Minimac3 were used for the phasing and imputation processes respectively. Further, the imputed SNPs with an R2 score lower than 0.6 were removed to get a subset of sequence panel. In the next step, all SNPs having minor allele frequency lower than 0.05 were discarded. 13,526,184 SNPs passed filtering criteria. These filtering cutoffs were chosen arbitrarily but commonly used in animal breeding and genetics community. The remaining SNP positions were matched with Illumina Bovine HD 777K map and 50K map to get two subsets of 70 519,271 and 40,676 SNPs respectively. Finally, genomic relationships matrices (G) were constructed based on different SNP panels as G= ZZ’, where Z is a standardized matrix of genotypes, 3!" -45" 𝑍12 = # (1) 6∑$ 45" (8-5" ) where 𝑀12 indicates the marker matrix that includes the number of copies of reference allele on SNP k for animal j, 𝑝2 represents the allele frequency of the reference allele for SNP k, and m indicates the total number of SNPs in the marker matrix. 4.3.3. Genomic prediction and GWAS The phenotypes were pre-adjusted for contemporary group, age of the animals on the day of slaughter and test batch effects. Adjusted phenotypes were used as a response variable for an animal effect model of genomic prediction. 𝒚 = 1𝜇 + 𝑊𝒈 + 𝒆 (2) where 𝒚 are the animals’ adjusted phenotypes, 𝜇 is the overall mean, 𝒈 are the additive genetic effects, assumed to be normally distributed with a mean 0 and variance of 𝐆𝜎94 . 𝜎94 represents the genetic variance, W is the design matrix, 𝒆 represents the residuals, assumed to be normally distributed with a mean 0 and variance 𝑰: 𝜎;4 . Variance components (𝜎94 and 𝜎;4 ) were estimated using Restricted Maximum Likelihood (REML) with R package NAM (Xavier et al., 2015). Firstly, GWAS was performed using all the data for a better resolution of QTL regions. We first obtained the breeding values and then back-solved the breeding values to get the SNP effects and their variances (Gualdrón Duarte et al., 2014). Secondly, to estimate the prediction accuracy we adopted a 5-fold cross validation approach. In each iteration, half of the population was used to identify top GWAS SNPs while the other half was used for training and testing: 80% of 71 remaining animals were randomly selected as training data set and the 20% of animals were selected as testing dataset. GWAS was performed in the training data. The SNPs with a P-value of less than 0.00001 were selected for each trait and pruned for LD using indep-pairwise software in plink (https://www.cog-genomics.org/plink/1.9/) with following paramters: window size=100, step size = 10, r2 threshold = 0.6. The software takes 100 SNPs at a time, prune all the highly linked variants in that window and move to the next step by adding the next 10 variants. This process is repeated until no such pairs remain. The pruned set of SNPs were used for prediction in the validation set. The medium and high-density SNP panels were also augmented with these selected SNPs and compared with other scenarios. The Pearson correlation between estimated breeding values and the adjusted phenotypes in the validation data set represented the cross- validation prediction accuracy (R). Table 4.2: Number of SNPs in each scenario of genotypes relative to the trait. All-Seq refers to SNP matrix based on final imputed dataset without filtering any SNPs, Seq-QC was based on only filtered SNPs in imputed sequence data, GWAS SNPs refer to selected SNPs from GWAS using Seq-QC data, and finally, augmented 50K and HD panels consisted of GWAS SNPs combined with 50 K and HD SNP panels respectively. Trait Number of SNPs Imputed No. of GWAS No. of LD pruned HD Panel 50K panel Sequence SNPs (average) SNPs (average) CWT 13526184 519271 40676 15225 1839 EMA 13526184 519271 40676 7038 1155 BFT 13526184 519271 40676 7704 1191 MS9 13526184 519271 40676 6761 1209 4.3.4. Modeling epistasis To study the relationship between marker density and genetic variance components, an extended GBLUP model (EG-BLUP) method was used to model additive by additive epistatic 72 effect and additive genetic effect simultaneously (Martini et al., 2016), assuming that additive and epistatic effects are mutually orthogonal. 𝒚 = 𝟏𝝁 + 𝑾𝒂 𝒈𝒂 + 𝑾𝒆 𝒈𝒆𝒑 + 𝒆 (3) Where 𝒚, 𝝁, and 𝒆 are the same as in (2). 𝒈𝒂 is a n dimensional vector of additive genotypic values and 𝒈𝒆 is a n dimensional vector of additive by additive epistatic values where 𝒈𝒂 ~ 𝑵(𝟎, 𝑮𝝈𝟐𝒂 ), 𝒈𝒆𝒑 ~ 𝑵(𝟎, 𝑯𝝈𝟐𝒆𝒑 ). Here G is the standard genomic relationship matrix (VanRaden, 2008) while H is the epistatic relationship matrix obtained by Hadamard product of G by itself. Just like SNP- BLUP is equivalent to GBLUP, it has been shown that EG-BLUP model is equivalent to explicitly modeling epistatic effects of markers (Jiang and Reif, 2015; Martini et al., 2016). We also compared the full model presented in Eq. 3 to a reduced model that consisted of additive component only as shown in Eq. 2. We calculated the variance components by using regress function in R package regress (https://cran.r-project.org/web/packages/regress/index.html). Finally, we calculated the genetic effects of the test population by matrix multiplication using Equation 2 from VanRaden et al. (2008) (VanRaden, 2008). The accuracy of prediction was represented by the mean correlation between genetic effects and adjusted phenotypes of test population by 10-fold cross validation (80/20 split between reference and test). For model used in Eq. 2, we calculated the three different correlations: using only the additive values (given by Ra), only epistatic values (Rep), as well by using the sum of additive and epistatic values (Ra+ep) representing the total genetic value of animals. 73 4.4. Results and Discussion 4.4.1. Genomic Relationships Genomic relationships obtained from sequence data were compared to those obtained from high and medium density genotypes and pedigree information. The mean of upper triangle values of relationship matrices obtained from medium (G50) and high (G777) density genotypes and sequence data (Gseq) was the same (i.e. 0.00013). The correlations between the off-diagonal values of G50 and G777 with the off diagonals of Gseq were 0.98 and 0.99 respectively. This means that genetic relationships between animals based on medium and high-density genotypes and imputed sequence were comparable. Principal component analysis 0.02 0.01 0.00 PC2 (10.154%) −0.01 −0.02 −0.03 −0.04 −0.06 −0.04 −0.02 0.00 PC1 (11.776%) Figure 4.1: Plot of first two principal components by eigen value decomposition of genomic relationship matrix based on imputed sequence data. 74 We performed principal component analysis and plotted the first two principal components to visualize the genetic structure of the population. The first two principal components accounted for 11.78% and 10.15% of variance in the population respectively (Figure 1). 4.4.1. Association analyses Association analysis based on sequence data provided a much better resolution than medium and high density SNP data. For example, for CWT, the strongest signals were observed on CHR 4, 6 and 14 which were detected by medium, high density SNP data and imputed sequence. However, smaller peaks on CHR 1,7, 19, 22, and 27 which were not detected in 50K and 700K data, were identified when imputed sequence data was used. Similarly for marbling score, none of SNPs reached statistical significance using 50K and 700 k panels, but imputed sequence identified a bunch of peaks spread across the genome (CHR 4,7,10,12,14,18,19,22) that reached Bonferroni level of significance which is deemed to be the most conservative form of multiple test correction. Similarly, for EMA significant genomic regions were identified on CHR 1,4,6, 9,11,12,14,15,17,19,22,23,24 and 28 using imputed sequence information using Bonferroni cut off value. None of these peaks could be identified using medium and high density data. For BFT strongest peaks were observed at CHR 3,13,15,17,18,19,23, 24 and 25. Some of these peaks appeared to be approaching statistical significance in 700 SNP panel but none could be identified using 50K data. To identify the candidate genes, significant SNPs were mapped to ARS assembly and genes within 250 bp on either end of significant SNPs were reported in table 3. A total of 536 significant SNPs associated with CWT mapped to 62 genes in cattle genome. Similarly, 27,19 and 52 SNPs mapped to 10, 19 and 10 genes associated with EMA, BFT and MS respectively. 75 A B C Figure 4.2: Manhattan plots showing P values obtained by genome wide association studies on carcass weight based on (A) 50K, (B) 700K and (C) imputed sequence genotypes. 76 Our study not only confirmed the existing findings that most of the significant SNPs for CWT were centered around a region on CHR 14 (spanning from 14.6 to 32.3 Mb) containing PLAG1, TOX, CHCHD7 genes among others, but also identified some novel genes (like TRAPPC9, ST3GAL1, NDRG1, LRRC6, SLC26A7, COL14A1) located far away from the said region. Similarly, we also identified a novel region on CHR 19 spanning from 26.4 Mb to 28.8 Mb that contained genes (STX8, MINK1, U8, KIF1C, TMEM107, MYH10) associated with CWT. We also identified other genes on CHR 19 (NOG, MYO1D, BRIP1, RAB40B, RPTOR) associated with EMA, BFT and MS listed in table (3). For BFT, 5 genes were identified on chromosome 13 (MPP7, SVIL, PARD3, JCAD, CD93) but the genes that contained the most number of significant SNPs (n=15) was located on CHR 25 (ENSBTAG00000048566). Moreover, smaller regions each containing less than 5 genes were identified on a number of chromsomes (CHR 1, 3, 7, 10, 11, 12, 15, 17, 18, 22, 23, 24, 27) listed in table 3. Previous studies based on SNP array (Lee et al., 2013) and imputed sequence data (Bhuiyan et al., 2018) have identified candidate regions on CHR 4, 6 and 14 for CWT and EMA while no significant regions were identified for BFT and MS in Hanwoo. Table 4.3: Genes associated with carcass traits identified using imputed sequence data of all the animals. Bonferroni cutoff value of 3.7 x 10-09 was used as a cut off value for significance. SNP s - Ensemble ID Name ma Chr Gene Start Gene End Trait log10 ppe P d ENSBTAG00000049910 CHCHD7 1 14 23376061 23383133 CWT 14.3 ENSBTAG00000005287 CYP7A1 1 14 24664833 24675169 CWT 14.0 ENSBTAG00000019147 RPS20 1 14 23278316 23279689 CWT 13.6 ENSBTAG00000051748 2 14 24577095 24587122 CWT 13.5 ENSBTAG00000008958 NSMAF 1 14 24765312 24830983 CWT 12.2 ENSBTAG00000019910 SDCBP 2 14 24728895 24763264 CWT 11.7 ENSBTAG00000038286 PPDPFL 5 14 19981881 19986371 CWT 11.4 ENSBTAG00000017815 STX8 1 19 28658444 28866846 CWT 10.9 77 Table 4.3 (cont’d) ENSBTAG00000040351 1 14 19106847 19109491 CWT 10.7 ENSBTAG00000018644 PDZRN3 7 22 28178200 28387545 CWT 10.4 ENSBTAG00000017019 PRKDC 1 14 19424288 19551028 CWT 10.3 ENSBTAG00000015637 BPNT2 6 14 23867153 23883121 CWT 10.0 ENSBTAG00000000948 RAB2A 1 14 26181071 26253265 CWT 10.0 ENSBTAG00000000711 NDRG1 10 14 8073208 8129604 CWT 9.9 ENSBTAG00000005748 SOX17 1 14 22229596 22232109 CWT 9.9 ENSBTAG00000004907 MINK1 2 19 26489830 26537129 CWT 9.9 ENSBTAG00000010548 LRRC6 9 14 8601623 8672683 CWT 9.9 ENSBTAG00000043508 U8 1 19 27820782 27820917 CWT 9.9 ENSBTAG00000047743 KCNIP4 42 6 40250033 41576485 CWT 9.8 ENSBTAG00000004954 TOX 113 14 24946881 25258596 CWT 9.8 ENSBTAG00000004022 PLAG1 9 14 23330541 23375751 CWT 9.7 ENSBTAG00000000914 OPRK1 3 14 21713331 21737924 CWT 9.7 ENSBTAG00000001349 KIF1C 1 19 26390112 26412206 CWT 9.7 HEPACA ENSBTAG00000012456 5 4 10541020 10595196 CWT 9.6 M2 TMEM1 ENSBTAG00000001702 3 19 27821338 27823573 CWT 9.6 07 ENSBTAG00000044080 CPA6 1 14 31534789 31829014 CWT 9.6 ENSBTAG00000044050 XKR4 12 14 22640221 22953771 CWT 9.5 ENSBTAG00000020034 LYN 24 14 23134995 23244752 CWT 9.4 ENSBTAG00000054400 NKAIN3 1 14 27737786 27998050 CWT 9.4 ENSBTAG00000051221 2 14 14617954 14685406 CWT 9.4 ENSBTAG00000001156 ST3GAL1 5 14 7865552 7951420 CWT 9.4 ENSBTAG00000005560 ST18 35 14 21160085 21250282 CWT 9.4 ENSBTAG00000048664 CDK14 9 4 7937943 8616503 CWT 9.4 ENSBTAG00000019808 CCSER1 4 6 33504739 34501208 CWT 9.4 ENSBTAG00000014517 KLB 1 6 58528467 58563083 CWT 9.3 ENSBTAG00000020931 CHN2 4 4 66634935 66970909 CWT 9.3 ENSBTAG00000009138 UBXN2B 6 14 24587138 24624435 CWT 9.3 ENSBTAG00000013537 FER1L6 2 14 15865228 15995482 CWT 9.2 ENSBTAG00000022169 PREX2 1 14 31976746 32270936 CWT 9.2 ENSBTAG00000054303 7 14 20266077 20282980 CWT 9.1 ENSBTAG00000017458 CALCR 2 4 10781954 10893236 CWT 9.1 ENSBTAG00000016194 FBXO32 1 14 16328949 16361047 CWT 8.9 ENSBTAG00000002448 SNTG1 100 14 20398321 20699673 CWT 8.9 ENSBTAG00000017529 CA8 1 14 25956319 26038333 CWT 8.9 ENSBTAG00000051416 1 4 10203048 10206518 CWT 8.8 ENSBTAG00000044023 CDK6 8 4 9939646 10188536 CWT 8.8 14566894 14568721 ENSBTAG00000011802 COL6A1 11 1 CWT 8.79 9 0 ENSBTAG00000013579 SARAF 1 27 26248680 26273822 CWT 8.78 78 Table 4.3 (cont’d) ENSBTAG00000043978 CLVS1 1 14 26854867 27012590 CWT 8.78 PPP1R9 ENSBTAG00000024426 6 4 12191129 12534305 CWT 8.73 A ENSBTAG00000021151 MYH10 4 19 28063029 28183409 CWT 8.7 FAM110 ENSBTAG00000050550 39 14 24365744 24432635 CWT 8.69 B TRAPPC ENSBTAG00000013955 2 14 3200372 3587738 CWT 8.68 9 ENSBTAG00000026283 ASPH 1 14 27016060 27203495 CWT 8.68 ENSBTAG00000020407 MTSS1 2 14 15383762 15553664 CWT 8.59 ENSBTAG00000052277 1 14 22093129 22095768 CWT 8.58 ENSBTAG00000055199 7 14 22058467 22073263 CWT 8.58 ENSBTAG00000044106 SPIDR 1 14 19129765 19406228 CWT 8.58 ENSBTAG00000017086 GRB10 1 4 5106065 5327013 CWT 8.56 ENSBTAG00000021841 CHD7 1 14 26361178 26487977 CWT 8.52 ENSBTAG00000002428 PPA2 1 6 19708709 19805633 CWT 8.49 ENSBTAG00000013362 DNM2 1 7 15209049 15297760 CWT 8.46 OR51C1 ENSBTAG00000054011 1 15 50310273 50311217 EMA 9.85 P ENSBTAG00000015894 WWOX 1 18 5240349 6170333 EMA 9.8 ENSBTAG00000019404 IL10RB 1 1 2242523 2314142 EMA 9.33 ENSBTAG00000032301 SLC26A7 3 14 73063026 73212030 EMA 9.06 ENSBTAG00000014112 EXOC4 2 4 96940092 97744748 EMA 9.04 ENSBTAG00000040282 NOG 1 19 7389042 7389740 EMA 8.95 ENSBTAG00000019062 FCHSD2 11 15 52372213 52631806 EMA 8.91 ENSBTAG00000015527 MYO1D 4 19 17330801 17691005 EMA 8.89 ENSBTAG00000046413 GRID2 1 6 30780273 31540156 EMA 8.72 ENSBTAG00000004259 HPCAL1 2 11 87213288 87330722 EMA 8.67 10.2 ENSBTAG00000017354 MPP7 3 13 36304106 36527690 BFT 1 FAM222 ENSBTAG00000050459 1 17 63456481 63510511 BFT 9.88 A ENSBTAG00000027444 SVIL 1 13 34569531 34751826 BFT 9.82 TBC1D2 ENSBTAG00000014253 1 23 11148615 11223026 BFT 9.48 2B ENSBTAG00000038823 1 19 7286704 7311297 BFT 9.43 ENSBTAG00000014991 PARD3 2 13 18484975 19062784 BFT 9.22 ENSBTAG00000048566 15 25 2265790 2272283 BFT 8.98 ENSBTAG00000001204 JCAD 1 13 35028262 35105654 BFT 8.97 ENSBTAG00000011995 1 18 2564597 2567928 BFT 8.95 ENSBTAG00000012002 BCAR1 1 18 2573410 2610924 BFT 8.91 ENSBTAG00000004207 CD93 2 13 41878359 41882281 BFT 8.88 ENSBTAG00000019543 ELOB 3 25 2280925 2284966 BFT 8.73 79 Table 4.3 (cont’d) ENSBTAG00000019745 LMF1 7 25 714685 777085 BFT 8.67 ENSBTAG00000002823 MPZL1 1 3 915375 997233 BFT 8.63 ENSBTAG00000012870 BMP5 2 23 4470532 4614742 BFT 8.63 ENSBTAG00000018482 KBTBD4 1 15 77374325 77379012 BFT 8.56 10496505 10518860 ENSBTAG00000001629 SCMH1 5 3 BFT 8.52 3 8 ENSBTAG00000034368 PRSS33 3 25 2271693 2274504 BFT 8.47 ENSBTAG00000053860 1 24 2736362 2757313 BFT 8.46 ENSBTAG00000018947 SYT16 3 10 73936265 74230784 MS 9.12 ENSBTAG00000054537 4 18 50872943 50885021 MS 9.08 ENSBTAG00000012068 BRIP1 1 19 11232242 11410164 MS 9 ENSBTAG00000015387 RAB40B 1 19 49958836 49984090 MS 8.92 ENSBTAG00000019832 TGFBR2 2 22 5091037 5184321 MS 8.91 ENSBTAG00000012032 PDE4A 2 7 14946947 14990520 MS 8.88 ENSBTAG00000003490 ELMO1 3 4 59867412 60450510 MS 8.73 COL14A ENSBTAG00000013369 1 14 81512830 81729606 MS 8.71 1 ENSBTAG00000002883 RPTOR 1 19 51687441 52010821 MS 8.61 ENSBTAG00000001133 VWA8 1 12 11675264 12071822 MS 8.55 4.4.3. Heritability and Accuracy Genomic heritability estimates for all traits were moderate, ranging between 0.35 to 0.41 using sequence data (Seq-QC). We observed high-density genotypes yielded comparable heritability estimates to imputed sequence data. These results were consistent in all carcass traits with varying genetic architecture e.g. CWT is known to be regulated by some QTLs with large effect on Chr 4,6 and 14 while MS and BFT are known to be highly polygenic with thousands of small effect loci (Bhuiyan et al., 2018). Our results were similar to other studies in cattle that found minimal gain in the use of sequence data for genomic prediction (van Binsbergen et al., 2015; Calus et al., 2016; Raymond et al., 2018b) due to high LD in most cattle breeds. Adding preselected markers to 50K and 777K genotypes marginally increased the accuracy as compared to sequence data for CWT which is regulated by some large effect loci as demonstrated 80 in Figure 2 and 3. However, this strategy did not work for traits that were regulated by small effect loci like EMA, BFT and MS9. A small number of selected SNPs yielded higher accuracies than random SNP sets consisting of same number of SNPs but their accuracies were lower than commonly used SNP panels of 50K and 700 K SNPs. Veerkamp et al,. (Veerkamp et al., 2016) also reported that using only the selected markers from a GWAS using a subset of animals from the same population as a reference decreased the accuracy of prediction as compared to genotyping panels and addition of such markers to common SNP panels resulted in negligible gains in prediction accuracy. However, Raymond et al,. (Raymond et al., 2018b) were able to benefit from a powerful meta-GWAS study to select markers for across breed prediction in dairy cattle. They observed that while addition of selected markers to SNP panels was not beneficial, using only a small number of selected markers resulted in 3-4 times increase in across breed prediction accuracy as compared to common SNP panels. Brondum et al,. (Brøndum et al., 2015) also demonstrated that by selecting SNPs based on independent GWAS using sequence data from three different breeds and augmenting the 54K SNP data led to a 4% increase in prediction accuracy. Van den Berg et al,. (van den Berg et al., 2016) also observed in a simulation study that accuracy only increased when selected variants were very close to causative mutations and concluded that selection of the right markers is an essential step in exploiting the true potential of sequence data. These studies highlight that although selection of variants from the same population may be difficult due to long range LD, multi-breed meta GWAS and genomic prediction in distant populations can fully exploit the benefit of sequence data. While using sequence data for prediction may result in limited to no gains in closely related populations, 81 accumulation of sequence data over successive generations may lead to a higher gain in prediction accuracy for distant populations (Nawaz and Gondro, 2022). 0.5 CWT 0.4 Accuracy of prediction 0.3 0.2 0.1 Random SNPs Seq 700K 50K TopSNPs 700K+TopSNPs 50K+TopSNPs SNP panels 0.5 EMA 0.4 Accuracy of prediction 0.3 0.2 0.1 Random SNPs TopSNPs 50K Seq 700K 700K+TopSNPs 50K+TopSNPs SNP panels Figure 4.3: Boxplots of 5 fold cross validation prediction accuracy of carcass traits using different SNP panels. 82 Figure 4.3 (cont’d) 0.5 BFT 0.4 Accuracy of prediction 0.3 0.2 0.1 Random SNPs TopSNPs 50K 700K Seq 700K+TopSNPs 50K+TopSNPs SNP panels 0.5 MS 0.4 Accuracy of prediction 0.3 0.2 0.1 Random SNPs TopSNPs 50K 50K+TopSNPs 700K 700K+TopSNPs Seq SNP panels 4.4.4. Epistatic variance captured by WGS and SNP chip panels For all traits and models, epistatic genetic variance explained a small proportion of total phenotypic variance. Among all traits, MS had the smallest epistatic component ranging between 2.65% to 7.27% of total phenotypic variance while BFT had the highest epistatic variance and ranged between 17.76% to 30.12% depending on the density of SNP panel used (Table 4). For CWT and EMA, these numbers ranged between 8.09% and 17.68%. These estimates were within 83 the range of epistatic variances reported for quantitative traits in other studies (Su et al., 2012; Vitezica et al., 2018). Table 4.4: Additive genetic and epistatic variance components for carcass traits in Hanwoo for WGS, 700K and 50K SNP panels. Snp Traits Model 1 Model 2 panel Additive (% of Epistatic (% of phenotypic Residual Additive phenotypic Residual variance) variance) SEQ CWT 736.43 (37.88%) 1207.62 708.48 184.32 (9.48%) 1052.02 700K CWT 735.30 (37.66%) 1206.71 711.69 157.16 (8.09%) 1073.49 50K CWT 687.67 (35.63%) 1241.10 652.60 255.71 (13.24%) 1022.00 SEQ EMA 34.49 (31.35%) 75.51 32.16 14.88 (13.53%) 63.00 700K EMA 34.40 (31.32%) 75.45 32.61 11.66 (10.61%) 65.60 50K EMA 31.40 (28.72%) 77.94 28.68 19.34 (17.68%) 61.35 SEQ BFT 7.14 (33.17%) 14.38 6.50 3.82 (17.75%) 11.20 700K BFT 7.12 (33.10%) 14.39 6.45 4.08 (18.97%) 10.97 50K BFT 6.38 (29.81%) 15.02 5.39 6.44 (30.12%) 9.56 SEQ MS 1.05 (39.23%) 1.62 1.04 0.07 (2.65%) 1.56 700K MS 1.03 (38.76%) 1.63 1.02 0.10 (3.63%) 1.55 50K MS 0.95 (35.66%) 1.71 0.92 0.19 (7.27%) 1.54 Comparison of prediction accuracies for model 1 and 2 revealed that addition of epistatic component in the model did not result in an increase in prediction accuracy of breeding value and total genetic value (Table 5). Infact, additive and epistatic effects were highly correlated for all traits and models (𝑹(𝒈𝒂 , 𝒈𝒆𝒑 ) >= 0.80) (Table 5). Similar results were obtained by modelling non additive genetic components for litter size in pigs (Vitezica et al., 2018). They showed that although broad sense heritability almost doubled by including dominance and epistasis, there was no gain in prediction accuracy for breeding and total genetic values. However, Su et al (Su et al., 2012) found that modeling epistatic effect slightly improved unbiasedness of genomic predictions for average daily gain pigs in using 60K SNP panel. No comparable studies were found for carcass traits in beef cattle. 84 We also observed that epistatic variance captured by the WGS was lower than 50K data for all traits. For example, CWT epistatic variance decreased from 13.24% in 50K to 9.48% in WGS. Similarly, 17.68% to 13.53% for EMA, 30.12% to 17.75% for BFT, and 7.27% to only 2.65% for MS in case of 50K and sequence data respectively. This can be related to phantom epistasis i.e. the effect of missing QTLs in lower density panels captured by a pairwise epistatic effect. Table 4.5: Accuracy of prediction for carcass traits using additive and/OR non additive models for WGS, 700K and 50K SNP data. Snp Traits Model 1 Model 2 panel 𝑹(𝒚, 𝒈𝒂 ) 𝑹(𝒚, 𝒈𝒂 ) 𝑹(𝒚, 𝒈𝒆𝒑 ) 𝑹(𝒚, 𝒈𝒂 +𝒈𝒆𝒑 ) 𝑹(𝒈𝒂 , 𝒈𝒆𝒑 ) SEQ CWT 0.4058 0.4055 0.2913 0.3988 0.875 700K CWT 0.4067 0.4064 0.29 0.4005 0.868 50K CWT 0.4061 0.4053 0.2999 0.3992 0.876 SEQ EMA 0.3623 0.3623 0.2764 0.3572 0.876 700K EMA 0.3652 0.3652 0.2712 0.36 0.867 50K EMA 0.3591 0.3586 0.2843 0.3547 0.873 SEQ BFT 0.3551 0.3547 0.2882 0.3519 0.897 700K BFT 0.3557 0.3552 0.2892 0.3523 0.895 50K BFT 0.3414 0.3397 0.2988 0.3427 0.872 SEQ MS 0.388 0.388 0.2428 0.3866 0.796 700K MS 0.388 0.3879 0.2458 0.3862 0.807 50K MS 0.3828 0.3826 0.251 0.3799 0.821 Some limitations of our study include a small size of populations and high relationship between reference and validation animals as they originally come from the same population. Moreover, the number of SNP effects estimated were much larger than the number of animals in the dataset. This might lead to a lower accuracy in estimation of SNP effects that translated into the loss of advantage of using sequence data. Moreover, we used GBLUP approach in this study which assumes that all SNPs effects regulating the traits originate from a normal distribution. Bayesian approaches that assume apriori distribution of SNP effects were predicted to perform better with sequence data based on simulation studies (Meuwissen and Goddard, 2010; 85 MacLeod et al., 2013) and real data (van Binsbergen et al., 2015) but the advantage depended largely on the size and distribution of QTLs. Ober et al (Ober et al., 2012) demonstrated that GBLUP and Bayes B lead to comparable prediction accuracies with real phenotypes. Moreover, imputation errors can also be important factor in reduction of advantage of using sequence data. Binsbergen et al (van Binsbergen et al., 2015) demonstrated that imputed HD genotypes randomly selected from sequence data lead to a reduction of 1%-3% in accuracy as compared to real HD genotypes. Other studies in dairy cattle have also shown that when accuracy of imputation was low, the accuracy of prediction decreased (Khatkar et al., 2012; Mulder et al., 2012; Chen et al., 2014). More recently, Wainschtein et al (Wainschtein et al., 2022) recovered the missing heritability for human height and BMI by using real sequence information and demonstrated that low frequency variants account for a large proportion of heritability. As accuracy of imputation is lower for low frequency variants (Pausch et al., 2017) , it penalizes the prediction accuracy for imputed sequence data. 4.5. Conclusions In conclusion, association analysis based on sequence data resulted in stronger signals of association as compared to medium and high-density genotypes and identified novel loci regulating carcass traits. However, we observed no gain in prediction accuracy using imputed sequence data as compared to medium and high-density genotype for genomic prediction in Korean beef cattle using GBLUP methodology. Addition of SNPs pre-selected based on association analysis to medium and high-density genotypes may result in a slight increase in accuracy of prediction for traits regulated by large effect loci. Moreover, modelling epistatic effects explained a small proportion of genetic variance but failed to increase the prediction 86 accuracy for additive genetic value and total genetic value. Future research involving genomic prediction in distantly related populations (across breeds or across generations) of animals using sequence data should be explored to evaluate true potential of sequence data for genomic prediction. 87 REFERENCES Bhuiyan, M. S. A., Lim, D., Park, M., Lee, S., Kim, Y., Gondro, C., et al. (2018). Functional partitioning of genomic variance and genome-wide association study for carcass traits in Korean hanwoo cattle using imputed sequence level SNP data. Front Genet 9, 1–14. doi: 10.3389/fgene.2018.00217. Brøndum, R. F., Su, G., Janss, L., Sahana, G., Guldbrandtsen, B., Boichard, D., et al. (2015). Quantitative trait loci markers derived from whole genome sequence data increases the reliability of genomic prediction. J Dairy Sci 98, 4107–4116. doi: 10.3168/jds.2014-9005. Calus, M. P. L., Bouwman, A. C., Schrooten, C., and Veerkamp, R. F. (2016). Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection. Genetics Selection Evolution 48, 1–19. doi: 10.1186/s12711-016-0225-x. Chen, L., Li, C., Sargolzaei, M., and Schenkel, F. (2014). Impact of genotype imputation on the performance of GBLUP and Bayesian methods for genomic prediction. PLoS One 9, 1–7. doi: 10.1371/journal.pone.0101544. Crossa, J., de Los Campos, G., Pérez, P., Gianola, D., Burgueño, J., Araus, J. L., et al. (2010). Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186. doi: 10.1534/genetics.110.118521. Druet, T., Macleod, I. M., and Hayes, B. J. (2014). Toward genomic prediction from whole-genome sequence data: Impact of sequencing design on genotype imputation and accuracy of predictions. Heredity (Edinb) 112, 39–47. doi: 10.1038/hdy.2013.13. Gualdrón Duarte, J. L., Cantet, R. J. C., Bates, R. O., Ernst, C. W., Raney, N. E., and Steibel, J. P. (2014). Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations. BMC Bioinformatics 15, 246. doi: 10.1186/1471-2105-15-246. Hayes, B. J., Bowman, P. J., Chamberlain, A. C., Verbyla, K., and Goddard, M. E. (2009a). Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genetics Selection Evolution 41, 1–9. doi: 10.1186/1297-9686-41-51. Hayes, B. J., Bowman, P. J., Chamberlain, A. J., and Goddard, M. E. (2009b). Invited review: Genomic selection in dairy cattle: Progress and challenges. J Dairy Sci 92, 433–443. doi: 10.3168/jds.2008-1646. Hemani, G., Powell, J. E., Wang, H., Shakhbazov, K., Westra, H. J., Esko, T., et al. (2021). Phantom epistasis between unlinked loci. Nature 596, E1–E3. doi: 10.1038/s41586-021-03765-z. Jiang, Y., and Reif, J. C. (2015). Modeling epistasis in genomic selection. Genetics 201, 759–768. doi: 10.1534/genetics.115.177907. 88 Kachman, S. D., Spangler, M. L., Bennett, G. L., Hanford, K. J., Kuehn, L. A., Snelling, W. M., et al. (2013). Comparison of molecular breeding values based on within- and across-breed training in beef cattle. Genetics Selection Evolution 45, 1. doi: 10.1186/1297-9686-45-30. Khatkar, M. S., Moser, G., Hayes, B. J., and Raadsma, H. W. (2012). Strategies and utility of imputed SNP genotypes for genomic analysis in dairy cattle. BMC Genomics 13. doi: 10.1186/1471-2164-13-538. Lee, S. H., Choi, B. H., Lim, D., Gondro, C., Cho, Y. M., Dang, C. G., et al. (2013). Genome-Wide Association Study Identifies Major Loci for Carcass Weight on BTA14 in Hanwoo (Korean Cattle). PLoS One 8. doi: 10.1371/journal.pone.0074677. Mackay, T. F. C. (2014). Epistasis and quantitative traits: Using model organisms to study gene- gene interactions. Nat Rev Genet 15, 22–33. doi: 10.1038/nrg3627. MacLeod, I. M., Hayes, B. J., and Goddard, M. E. (2013). Will Sequence Snp Data Improve the Accuracy of Genomic Prediction in the Presence of Long Term Selection? Proc. Assoc. Advmt. Anim. Breed. Genet., 215–219. Martini, J. W. R., Wimmer, V., Erbe, M., and Simianer, H. (2016). Epistasis and covariance: how gene interaction translates into genomic relationship. Theoretical and Applied Genetics 129, 963– 976. doi: 10.1007/s00122-016-2675-5. Mason, B. A., Goswami, S., Hayes, B. J., Goddard, M. E., Bowman, P. J., Reich, C. M., et al. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95, 4114–4129. doi: 10.3168/jds.2011-5019. Meuwissen, T., and Goddard, M. (2010). Accurate prediction of genetic values for complex traits by whole-genome resequencing. Genetics 185, 623–631. doi: 10.1534/genetics.110.116590. Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi: 11290733. Muir, W. M. (2007). Comparison of genomic and traditional BLUP-estimated breeding value accuracy and selection response under alternative trait and genomic parameters. Journal of Animal Breeding and Genetics 124, 342–355. doi: 10.1111/j.1439-0388.2007.00700.x. Mulder, H. A., Calus, M. P. L., Druet, T., and Schrooten, C. (2012). Imputation of genotypes with low-density chips and its effect on reliability of direct genomic values in Dutch Holstein cattle. J Dairy Sci 95, 876–889. doi: 10.3168/jds.2011-4490. Nawaz, M. Y., and Gondro, C. (2022). Improving accuracy of genomic prediction in distant populations by collecting sequence data over generations. in World Congress on Genetics Applied to Livestock Production. 89 Ober, U., Ayroles, J. F., Stone, E. A., Richards, S., Zhu, D., Gibbs, R. A., et al. (2012). Using whole- genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet 8. doi: 10.1371/journal.pgen.1002685. Pausch, H., MacLeod, I. M., Fries, R., Emmerling, R., Bowman, P. J., Daetwyler, H. D., et al. (2017). Evaluation of the accuracy of imputed sequence variant genotypes and their utility for causal variant detection in cattle. Genetics Selection Evolution 49, 1–14. doi: 10.1186/s12711-017-0301- x. Raymond, B., Bouwman, A. C., Schrooten, C., Houwing-Duistermaat, J., and Veerkamp, R. F. (2018a). Utility of whole-genome sequence data for across-breed genomic prediction. Genetics Selection Evolution 50, 1–12. doi: 10.1186/s12711-018-0396-8. Raymond, B., Bouwman, A. C., Schrooten, C., Houwing-Duistermaat, J., and Veerkamp, R. F. (2018b). Utility of whole-genome sequence data for across-breed genomic prediction. Genetics Selection Evolution 50. doi: 10.1186/s12711-018-0396-8. Saatchi, M., Ward, J., and Garrick, D. J. (2013). Accuracies of direct genomic breeding values in Hereford beef cattle using national or international training populations. J Anim Sci 91, 1538– 1551. doi: 10.2527/jas.2012-5593. Schrauf, M. F., Martini, J. W. R., Simianer, H., de los Campos, G., Cantet, R., Freudenthal, J., et al. (2020). Phantom epistasis in genomic selection: On the predictive ability of epistatic models. G3: Genes, Genomes, Genetics 10, 3137–3145. doi: 10.1534/g3.120.401300. Su, G., Christensen, O. F., Ostersen, T., Henryon, M., and Lund, M. S. (2012). Estimating Additive and Non-Additive Genetic Variances and Predicting Genetic Merits Using Genome-Wide Dense Single Nucleotide Polymorphism Markers. PLoS One 7. doi: 10.1371/journal.pone.0045293. van Binsbergen, R., Calus, M. P. L., Bink, M. C. A. M., van Eeuwijk, F. A., Schrooten, C., and Veerkamp, R. F. (2015). Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle. Genetics Selection Evolution 47, 1–13. doi: 10.1186/s12711-015-0149-x. van den Berg, I., Boichard, D., Guldbrandtsen, B., and Lund, M. S. (2016). Using Sequence Variants in Linkage Disequilibrium with Causative Mutations to Improve Across-Breed Prediction in Dairy Cattle: A Simulation Study. G3&#58; Genes|Genomes|Genetics 6, 2553–2561. doi: 10.1534/g3.116.027730. van Eenennaam, A. L., Weigel, K. A., Young, A. E., Cleveland, M. A., and Dekkers, J. C. M. (2014). Applied Animal Genomics: Results from the Field. Annu Rev Anim Biosci 2, 105–139. doi: 10.1146/annurev-animal-022513-114119. VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. J Dairy Sci 91, 4414– 23. doi: 10.3168/jds.2007-0980. 90 Veerkamp, R. F., Bouwman, A. C., Schrooten, C., and Calus, M. P. L. (2016). Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in Holstein- Friesian cattle. Genetics Selection Evolution 48, 1–14. doi: 10.1186/s12711-016-0274-1. Vitezica, Z. G., Reverter, A., Herring, W., and Legarra, A. (2018). Dominance and epistatic genetic variances for litter size in pigs using genomic models. Genetics Selection Evolution 50, 1–8. doi: 10.1186/s12711-018-0437-3. Wainschtein, P., Jain, D., Zheng, Z., Cupples, L. A., Shadyab, A. H., McKnight, B., et al. (2022). Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat Genet 54. doi: 10.1038/s41588-021-00997-7. Xavier, A., Xu, S., Muir, W. M., and Rainey, K. M. (2015). NAM: association studies in multiple populations. Bioinformatics 31, 3862–3864. doi: 10.1093/bioinformatics/btv448. 91 Chapter 5 FOOTPRINTS OF SELECTION IN ANGUS AND HANWOO BEEF CATTLE USING IMPUTED WHOLE GENOME SEQUENCE DATA 5.1. Abstract In this study, we detected signatures of selection in Hanwoo and Angus beef cattle using allele frequency and haplotype-based methods based on imputed whole genome sequence variants. Our dataset included 13,202 Angus animals with 10,057,633 imputed SNPs and 10,437 Hanwoo animals with 13,241,550 imputed SNPs. The dataset was subset down to 6,873,624 SNPs in common between the two populations to identify within population (runs of homozygosity, extended haplotype homozygosity) and between population signals of selection (allele fixation index, extended haplotype homozygosity). Assuming these selection signals were complementary to each other, they were combined into a composite selection signal to identify regions under selection for each of the breeds. 27 genomic regions spanning 25.15 Mb and harboring 360 genes were identified in Angus on chromosomes 1,3, 4, 5, 6, 7, 8, 12, 13, 14, 16, 20, 21 and 28. Similarly, in Hanwoo, 59 genes and 17 genomic regions spanning 5.21 Mb on chromosomes 2, 4, 5, 6, 7, 8, 9, 10, 13, 17, 20 and 24 were identified. There were no common signals between the two breeds reflecting their largely different selection histories, environmental challenges, breeding objectives and breed characteristics. Positional candidate genes identified in selected genomic regions in Angus have been previously associated with growth, immunity, reproductive development, feed efficiency and adaptation to environment while the candidate genes identified in Hanwoo included important genes regulating meat 92 quality, fat deposition, cholesterol metabolism, lipid synthesis, neuronal development and olfactory reception. 5.2. Introduction Natural selection is an adaptive response to the current conditions of the ecosystem that a population inhabits, which drives its evolutionary changes by favoring traits that are advantageous and increases their prevalence in the population. Very recently, at least on an evolutionary scale, human driven artificial selection has also become a primary driver of changes in populations by exerting selective pressure on traits of human interest. These selection processes change allele frequencies in populations and leave traceable marks across the genome. Genomic regions under selective pressure can be identified by their allele frequency distributions, measures of linkage disequilibrium between loci and the structure of their haplotypes. Identification of these genetic patterns or signatures of selection (SOS) help us understand the underlying biological processes of adaptation in different environments and provide insights into the domestication history of agricultural species. They can also help us identify genes or genomic regions that regulate the phenotypic expression of traits of economic importance. For example, studies of signatures of selection have been used to identify genes that regulate coat color and body size in dogs (Pollinger et al., 2005; Sutter et al., 2007), stature in horses (Makvandi-Nejad et al., 2012), and body temperature maintenance under cold stress in cattle (Igoshin et al., 2019). In cattle, Randhawa et al., (Randhawa et al., 2016) published a meta- assembly of selection signatures in cattle genome by combining results from various studies. Various methods have been proposed to identify genomic signatures of selection which can be broadly classified into two main categories: within population measures for a single population 93 (e.g., runs of homozygosity and integrated haplotype score) or between population measures that compare two or more populations (e.g., fixation index and cross-population extended haplotype homozygosity). Each of these test statistics explore unique facets of the genomic architecture of populations but they are not necessarily consistent with each other. Due to this, some studies take a more conservative approach and only focus on the regions that are common across different measures, albeit at the risk of not identifying a proportion of the relevant signals in the process. An alternative approach is to consider the selection signals from different methods as complimentary to each other (Ma et al., 2015) and combine them to get a composite score (Randhawa et al., 2014). Various methods to combine individual signals have already been proposed in the literature (Grossman et al., 2010; Utsunomiya et al., 2013b; Randhawa et al., 2014; Ma et al., 2015). Initial approaches to combine the signals did not account for the covariance structure between signals but Ma et al (2015) suggested a new approach to calculate a decorrelated composite of multiple signals (DCMS) that adjusted for correlations between signals and was more powerful to detect selected regions in the genome. This study focused on the identification of signatures of selection in Angus and Hanwoo cattle. Both are beef cattle breeds, but they have been subjected to very different selection pressures and have different genetic population structures, body characteristics, domestication history, beef quality and breeding program objectives. Hanwoo are Korean taurine cattle, more related to the Japanese Wagyu than to western taurine cattle breeds (Lee et al., 2014). Hanwoo have a smaller stature than Angus, but its beef is popular for its juiciness, high levels of marbling, and unique flavor (Cho et al., 2010); which, similarly to Wagyu, attracts a market premium. Hanwoo was historically a draft breed but a formal national breeding program for breed traits was 94 established in 1980s. The selection index of the Korean Proven Bulls program is mainly driven by 4 traits – marbling score (MS), carcass weight (CWT), eye muscle area (EMA) and back fat thickness (BF). Angus, on the other hand, are European taurine cattle characterized by their high muscularity, fast growth rate, medium height, and moderate levels of intramuscular fat (Albertí et al., 2008). Angus have been intensively selected for growth, stature and feed intake in the 20th century and have become the most common beef cattle in the world. In contrast to Hanwoo, different selection indices are used in Angus cattle breeding programs worldwide depending on the type of beef production operation and its breeding objectives. The objectives of this study were to identify genome wide signals of selection in Angus and Hanwoo beef cattle using imputed whole genome sequence (WGS) data. We also combined individual selection measures to obtain a decorrelated composite of multiple signals (DCMS) for identification of selected genomic regions. These signature of selection were then mapped to the ARS-UCD1.2 reference assembly to identify candidate genes located in these regions. We also highlight important genes related to meat production and quality. 5.3. Materials And Methods 5.3.1. Genotype data Imputed whole genome genotypes of 10,437 Hanwoo animals (13,241,550 SNPs) and 13,202 Angus animals (10,057,633 SNPs) were utilized for this analysis. Respectively, the Hanwoo and Angus data consisted of 9,160 and 11,632 animals genotyped on 50k arrays (Illumina Bovine SNP50 BeadChip; Illumina, San Diego, CA), 1,704 and 1,236 animals genotyped on 700k arrays (777k SNP, Illumina Bovine HD Beadchip, Illumina, San Diego, CA), and 203 and 334 reference animals with whole genome sequence (WGS) data. DNA sequence was aligned to ARS-UCD1.2 95 bos taurus assembly. Eagle software version 2.3.2 was used for phasing and Minimac3 for imputation. Details on quality control, WGS pipeline and imputation accuracies were previously reported in Nawaz et al., 2022. 5.3.2. Analysis Within Population measures: a. Runs of homozygosity: Runs of homozygosity (ROH) are defined as long continuous homozygous genomic segments that are assumed to be identical DNA segments inherited by descent from a common ancestor, and that serve as an indicator of genomic autozygosity, consanguinity, selection, and population size reduction. ROH detection was done using the homozyg function in plink using the default parameters except for the number of SNPs in a scanning window (homozyg-window-snp) which was increased to 100 instead of default 50 SNPs because of the high density of SNPs in the sequence data. The other default parameters were as follows: --homozyg-window-het = 1 (maximum number of heterozygotes in a scanning window) --homozyg-window-missing = 5 (maximum number of missing calls in a scanning window) --homozyg-window-threshold = 0.05 --homozyg-kb 1000 --homozyg-het 1 --homozyg-snp 100 --homozyg-density 50 --homozyg-gap 1000 96 To identify ROH islands, we calculated the autozygosity of each SNP by taking the proportion of individuals in which a SNP was identified within an ROH region. b. Integrated haplotype score (iHS): Integrated haplotype score (iHS) aims to identify genomic regions that were under recent positive selection based on the relationship between an allele’s frequency and the extent of linkage disequilibrium around it. iHS was calculated (Voight et al., 2006) based on extended haplotype homozygosity (EHH) values (Sabeti et al., 2002) calculated using the program hapbin (Maclean et al., 2015). Due to the high dimensionality of our data and computational limitations, the analysis was performed by dividing both Hanwoo and Angus datasets into seven and 14 bins containing 1491 and 943 animals per bin, respectively. Final values of iHS were calculated by taking the average of iHS values from the data bins. Absolute values of iHS were smoothed out in windows of 1,001 SNPs to identify regions under recent positive selection. Across Population measures: a. Fixation Index (FST): Fixation Index (FST) is a measure of population differentiation. It represents the proportion of total genetic variance that exists within a sub population. Allele frequencies of Angus and Hanwoo datasets were calculated using freq function in plink. Average of Angus and Hanwoo allele frequencies were used as the baseline allele frequency (𝑝) and genetic variances (𝑝 ∗ (1 − 𝑝)). Finally, FST was calculated for each SNP by taking squared deviation of allele frequency in a population from the baseline frequency divided by allelic variance (Weir and Cockerham, 1984): 𝝈𝟐 𝑭𝑺𝑻 = 𝒑 ∗ (𝟏 − 𝒑) 97 To identify prominent genomic regions, FST was smoothed in windows of 1,001 SNPs using runmed function in R. b. Across population extended haplotype homozygosity (XPEHH): Across population extended haplotype homozygosity (XPEHH) is another population differentiation-based test that is used to detect selective sweeps in which selected regions are close to fixation in one population but remain polymorphic in another population. For XPEHH, we compared the two breeds under study (Angus and Hanwoo) directly against each other to identify regions that were differentially selected between populations. For this analysis, imputed whole genome data was subset down to the 6,873,624 SNPs that were in common between the two populations. We used the hapbin software (Maclean et al., 2015) to perform this analysis with the xpehh function. De-correlated composite of multiple signals (DCMS): In order to combine the several test statistics, we used the method suggested by Ma et al (Ma et al., 2015) that takes into account correlations between signals to calculate a decorrelated composite of multiple signals (DCMS) based on their P values. Firstly, fractional ranks of absolute values of ROH, iHS, FST and XPEHH were used to calculate their P values using stat_to_pvalue function in R package MINOTAUR (with parameters two.tailed = FALSE, right.tailed = TRUE). Then, a pairwise correlation matrix was created between absolute values of the signals. This matrix was used as an input to DCMS function in MINOTAUR to calculate raw DCMS scores as follows (Ma et al., 2015): : 1 − 𝑝?@ Q𝑙og 𝑝?@ U 𝐷𝐶𝑀𝑆 = P ∑:AB8|𝑟A@ | @B8 98 𝑝?@ was the P value of individual selection measures in each position l and 𝑟A@ was the correlation coefficient between ith and tth measures. Finally, P values of raw DCMS scores were calculated by pnorm function using empirical mean and standard deviation. Multiple test correction was done by calculating false positive rate (FDR) using p.adjust function in base R with method= 'BH'. SNPs having adjusted p-values (q) less than 0.05 were deemed to be significant. Adjacent significant SNPs (located less than 1 MB apart) were combined to identify regions under selection by a custom script in R. Functional annotation of signatures of selection: A bos taurus gene annotation dataset which included positional information for all known bovine genes (n=27,900) mapped to the latest bovine assembly (ARS_UCD1.2) was downloaded from ensemble with BIOMART. Significant genomic regions were mapped to genes using the GenomicRanges package in R (Lawrence et al., 2013). 5.4. Results 5.4.1. With-in Population Measures: a. ROH: The mean number of ROH detected per animal was higher in Angus (88.7 ± 18.50) as compared to Hanwoo (12.5 ± 8.4) (Table 1). The median length of ROH regions was also higher in Angus (1565 BP) as compared to Hanwoo (1384 BP). However, the proportion of ROH regions longer than 5 MB was higher in Hanwoo (12.5%) than Angus (7.1 %). This suggested that Hanwoo had fewer ROH regions, but were longer in length as compared to Angus suggesting comparatively strong recent selection in Hanwoo. Mean genome wide autozygosity was higher in Angus (0.08) 99 as compared to Hanwoo (0.01). The highest peak for Hanwoo was observed at CHR 7 (BP 50280340) and smaller peaks were observed on CHR 2, 12, 23, 24 and 29 peaks. In Angus, the strongest signal was observed on CHR 13 (BP 63854457). Other significant peaks were also identified on CHR 8 and 14. Table 5.1: Summary of results from runs of homozygosity analysis for Hanwoo and Angus cattle. Parameter Hanwoo Angus Total SNPs 13241550 10057633 Total animals 10437 13202 Percentage of animals having ROH 99.7 99.9 Total number of ROH regions 129778 1169509 Mean number of ROH per animal 12.5 88.7 SD of number of ROH per animal 8.4 18.501 Minimum number of ROH per animal 1 1 Maximum number of ROH per animal 440 270 Mean length of ROH regions in KB 3024 2381.845 SD of length of ROH regions 5089.125 2965.864 Median length of ROH regions 1384 1565 Maximum length of ROH 123720 120023 Minimum length of ROH 1000 1000 No of ROH per animal 12.46 88.67 ROH 1-5 mb 113515 (87.5%) 1087433 (92.9%) ROH 5-10 mb 8369 (6.5%) 56661 (4.9%) ROH 10-15 mb 3616 (2.8%) 13443 (1.1%) ROH 15-20 mb 1718 (1.3%) 5491 (0.4%) ROH 20-25 mb 1017 (0.8%) 2828 (0.2%) ROH 25-30 mb 614 (0.5%) 1527 (0.1%) ROH >30 mb 929 (0.7%) 2126 (0.2%) b. iHS: Genome wide distribution of absolute iHS values was similar in Hanwoo and Angus with a mean of 0.31 and 0.30 respectively. Absolute value of iHS indicated genomic regions with unusually long haplotypes at chromosomes 1, 5, 6, 8, 10, 11, 13, 16, 17, 20, 23, 24 and 29 in Angus harboring 13009 significant SNPs. The strongest signal was detected on CHR 16 (rs208273139) at 40588657 BP. In Hanwoo, the strongest signal was observed on chromosome 2 at (rs207720085) 82874034 100 BP. Other peaks were observed at 1, 2, 3, 5, 6, 7, 8, 9, 14, 17, 20, 25, 26 harboring 13030 significant SNPs. Correlation between iHS values of Angus and Hanwoo was 0.016 indicating differences in the regions of selection sweeps between the two breeds. We also observed that ROH and iHS were significantly correlated (R=0.252, 95% confidence interval 0.251-0.253) in Hanwoo and Angus (R=0.286, 95% interval 0.286-0.287) 5.4.2. Across population measures: a. Fixation index FST: SNPs with an FST value in the top 0.2% were identified on 18 out of 29 autosomes indicating widespread allele frequency differences between breeds. CHR 4 contained the most number of significant SNPs (n=1,577) followed by CHR 8 (n=1,464) and CHR 5 (n=1,256). The most significant SNP (rs209900249) was observed on CHR 4 position 69,682,473. Other prominent FST hotspots were observed on CHR 1, 2, 3, 6, 7, 9, 10, 13, 14, 16, 18, 20, 21, 28, and 29. b. Across population extended haplotype homozygosity (XPEHH): 13,004 SNPs with top 0.2% XPEHH values were located on CHR 3 (n=2,216), 8 (n=4,826), 13 (n=4,115) and 14 (n=1,847). The most significant peak was observed on CHR 13 at position 62,594,885 (rs207508467). We also observed that the two measures of across population measures were significantly correlated, Pearson correlation R = 0.2956 and a 95% confidence interval 0.295-0.296. 5.4.3. De-correlated composite of multiple signals (DCMS): a. Angus: A total of 39,898 SNPs were identified with significant p-values. Genic SNPs accounted for 27.49% of all the significant SNPs. 27 significant genomic regions were identified using the DCMS adjusted 101 p-value (q value) cutoff of 0.05. The mean length of selected regions was 931.613 Kb (± 1,255.33) while their total length was 25.153 Mb. The significant genomic regions mapped to CHR 1,3, 4, 5, 6, 7, 8, 12, 13, 14, 16, 20, 21, and 28 that harbor 360 genes. The most significant genomic selection signal was observed on CHR 13 where 91 genes were found spread across 3 distinct regions. Some of the notable genes were associated with body size and stature (PLAG1, CHCHD7, RPS20, LYN), growth and feed intake (TMEM68, TGS1, LYN, XKR4), growth differentiation factor (GDF5), feed efficiency (OR6C76, PIK3CD), embryonic growth and reproductive development (NMNAT1), immunity related to tropical adaptation (SLC25A33, SPSB1), immune response and immune regulation (PIK3CD), pigmentation and adaptation to environment (ASIP). A complete list of all the regions and genes identified is shown in Table 5.2. Table 5.2: Genomic regions under selection in Angus cattle identified by DCMS q values <=0.05 and genes identified in those regions. Start End CHR No of Genes gene s 26683714 26745320 1 0 None 50460996 53793201 3 35 MTF2 DIPK1A RPL5 SNORA66 SNORD21 U6 EVI5 ENSBTAG00000055274 5S_rRNA GFI1 RPAP2 GLMN C3H1orf146 BTBD8 ENSBTAG00000040248 EPHX4 BRDT ENSBTAG00000054884 ENSBTAG00000047443 TGFBR3 CDC7 HFM1 ENSBTAG00000054082 ENSBTAG00000046077 ENSBTAG00000055150 ZNF644 bta-mir-2285b-2 BARHL2 ZNF326 LRRC8D bta-mir-2285k-5 ENSBTAG00000050182 LRRC8C LRRC8B ENSBTAG00000038625 55238880 55468638 3 3 PKN2 ENSBTAG00000051499 ENSBTAG00000051844 69475528 69894433 4 6 7SK SNX10 CBX3 HNRNPA2B1 NFE2L3 MIR148A 48288996 48752060 5 4 MSRB3 LEMD3 WIF1 U6 52011261 52117527 5 1 TAFA2 53692271 54424033 5 3 SLC16A7 ENSBTAG00000055198 ENSBTAG00000053531 58055499 59307814 5 42 U6 ENSBTAG00000047825 ENSBTAG00000052093 ENSBTAG00000049329 ENSBTAG00000051156 102 Table 5.2 (cont’d) ENSBTAG00000046778 ENSBTAG00000048295 ENSBTAG00000054507 ENSBTAG00000050480 ENSBTAG00000051165 ENSBTAG00000051462 ENSBTAG00000049219 ENSBTAG00000051274 ENSBTAG00000048779 OR6C76 OR6C75 ENSBTAG00000049581 ENSBTAG00000049184 ENSBTAG00000048408 ENSBTAG00000024691 ENSBTAG00000051265 ENSBTAG00000050381 ENSBTAG00000049213 ENSBTAG00000054097 ENSBTAG00000049016 ENSBTAG00000045922 ENSBTAG00000048168 ENSBTAG00000053702 ENSBTAG00000054733 ENSBTAG00000049913 ENSBTAG00000051198 ENSBTAG00000002913 ENSBTAG00000051990 ENSBTAG00000048864 ENSBTAG00000046446 ENSBTAG00000049753 ENSBTAG00000054193 OR10A7 ENSBTAG00000049751 ENSBTAG00000053229 ENSBTAG00000053772 ENSBTAG00000037629 1155763 1320935 6 1 ENSBTAG00000051456 8724140 8824920 6 0 None 78535547 78939028 6 0 None 38009575 38227430 7 8 FAF2 RNF44 CDHR2 GPRIN1 SNCB EIF4E1B TSPAN17 UNC5A 44026848 44466715 7 22 ENSBTAG00000012150 MEX3D MBD3 UQCR11 TCF3 ONECUT3 ATP8B3 REXO1 KLF16 ABHD17A ENSBTAG00000050118 SCAMP4 CSNK1G2 bta-mir- 6120 BTBD2 SOWAHA SHROOM1 GDF9 UQCRQ LEAP2 AFF4 U6 89565309 94976398 8 42 5S_rRNA ENSBTAG00000052296 NXNL2 SPIN1 ENSBTAG00000051928 ENSBTAG00000054632 CDK20 FBXW12 ENSBTAG00000021235 MSANTD3 TMEFF1 CAVIN4 PLPPR1 5S_rRNA ENSBTAG00000025760 MRPL50 ZNF189 ALDOB PGAP4 RNF20 GRIN3A ENSBTAG00000050971 ENSBTAG00000030953 CYLC2 U6 SMC2 ENSBTAG00000047350 ENSBTAG00000013445 ENSBTAG00000050829 ENSBTAG00000052864 ENSBTAG00000053491 ENSBTAG00000016173 ENSBTAG00000050464 ENSBTAG00000000145 ENSBTAG00000052019 OR13C3 ENSBTAG00000049256 OR13C8 ENSBTAG00000048409 NIPSNAP3A ABCA1 SLC44A1 103 Table 5.2 (cont’d) 12701319 12716083 12 1 TNFSF11 61100092 61130961 13 12 ENSBTAG00000052743 DEFB121 DEFB122A DEFB122 DEFB123 DEFB124 REM1 HM13 bta-mir-12010 ID1 COX4I2 BCL2L1 62482399 65519468 13 71 BPIFB4 BPIFA2A ENSBTAG00000031375 BPIFA2C ENSBTAG00000011704 BPIFA2B ENSBTAG00000031373 BPIFA3 BPIFA1 BPIFB1 BPIFB5 CDK5RAP1 ENSBTAG00000031354 SNTA1 ENSBTAG00000010131 ENSBTAG00000053051 ENSBTAG00000053797 NECAB3 C13H20orf144 E2F1 PXMP4 ZNF341 CHMP4B RALY EIF2S2 ASIP AHCY ENSBTAG00000050108 ENSBTAG00000046623 ITCH DYNLRB1 MAP1LC3A PIGU TP53INP2 NCOA6 GGT7 ACSS2 GSS MYH7B bta-mir-499 TRPC4AP EDEM2 PROCR MMP24 EIF6 FAM83C UQCC1 ENSBTAG00000053266 GDF5 ENSBTAG00000052250 CEP250 ENSBTAG00000030976 ERGIC3 ENSBTAG00000053187 SPAG4 CPNE1 RBM12 NFS1 ROMO1 RBM39 ENSBTAG00000053775 PHF20 5S_rRNA SCAND1 CNBD2 ENSBTAG00000052997 ENSBTAG00000053403 EPB41L1 ENSBTAG00000050801 AAR2 DLGAP4 67831405 69506639 13 8 FAM83D ENSBTAG00000044690 DHX35 U6 ENSBTAG00000049087 ENSBTAG00000050378 ENSBTAG00000048871 MAFB 22710076 24757731 14 24 XKR4 TMEM68 TGS1 LYN RPS20 ENSBTAG00000045097 U1 MOS PLAG1 CHCHD7 ENSBTAG00000054153 SDR16C5 SDR16C6 PENK U6 BPNT2 FAM110B ENSBTAG00000047136 ENSBTAG00000051748 UBXN2B CYP7A1 U1 SDCBP NSMAF 40378981 41205611 16 9 TNFSF18 ENSBTAG00000052047 ENSBTAG00000053302 TNFSF4 ENSBTAG00000020550 AADACL4 DHRS3 VPS13D SNORA59A 42527352 44175641 16 32 MTOR ANGPTL7 EXOSC10 SRM MASP2 TARDBP CASZ1 PEX14 DFFA ENSBTAG00000045105 CORT CENPS PGD ENSBTAG00000048790 ENSBTAG00000048747 ENSBTAG00000054239 U6 UBE4B RBP7 NMNAT1 LZIC CTNNBIP1 CLSTN1 104 Table 5.2 (cont’d) PIK3CD U6 5S_rRNA U6 TMEM201 SLC25A33 ENSBTAG00000049485 SPSB1 H6PD 45628162 46172668 16 3 ENSBTAG00000048839 ENSBTAG00000051176 CAMTA1 31142802 31450979 20 11 ENSBTAG00000033187 NNT PAIP1 ENSBTAG00000049623 C20H5orf34 TMEM267 CCL28 HMGCS1 ENSBTAG00000048672 NIM1K ENSBTAG00000042376 69997413 70839780 20 8 ENSBTAG00000050065 IRX2 U6 ENSBTAG00000054006 5S_rRNA IRX4 NDUFS6 ENSBTAG00000050317 2650859 3126131 21 2 ATP10A U6 62909681 63080951 21 3 5S_rRNA ENSBTAG00000049199 ENSBTAG00000052737 25323298 25520626 28 9 DDX21 KIFBP U6 SRGN ENSBTAG00000042264 ENSBTAG00000051145 VPS26A SUPV3L1 HKDC1 Figure 5.1: Manhattan plot of DCMS p-values in Angus cattle. 105 Figure 5.2: Circos plot of P values for genome wide selection signatures in Angus cattle. b. Hanwoo: A total of 10,162 SNPs were found in significant hotspots of selection using FDR cut off value of 0.05 on adjusted DCMS P values (q value). Out of these only 2,095 (20.6%) SNPs were located in genes. Significant SNPs were used to identify 17 significant genomic regions. The mean length of the selected regions was 306.27 kb (± 337.43) while their total length was 5.21 Mb. Significant genomic regions mapped to CHR 2, 4, 5, 6, 7, 8, 9, 10, 13, 17, 20, and 24 which harbor 59 genes. The most significant genomic region was on CHR 2 between BP 81860076 and 82963443 BP 106 where only 1 gene was identified (ENSBTAG00000048361). The most number of SNPs mapped to a gene on CHR 17 that plays important role in immunity (LRBA). An important region on CHR 24 (BP 43384983- 44317964) was identified that contained genes (e.g MC2R) regulating fat deposition and meat quality. Other genes identified were previously associated with important roles in brain development (CPLANE1), developmental regulation (NIPBL), mitosis (TTK - may cause tumor proliferation if in excess), breakdown of amino acids (BCKDHB), cholesterol metabolism (NBEAL1), drug metabolism and lipid synthesis (CYP20A1 that belongs to the cytochrome P450 superfamily of enzymes, cell/tissue structural organization (ABI2), neuronal development and olfactory reception (OR6F1). A complete list of all the regions and genes identified has been provided in Table 3. Interestingly, none of the significantly selected regions were common between Hanwoo and Angus. Table 5.3: Genomic regions under selection in Hanwoo cattle identified by DCMS q values <=0.05 and genes identified in those regions. Start No of End Chr Genes genes 81860076 82963443 2 1 ENSBTAG00000048361 69605816 69985802 4 5 SNX10 CBX3 HNRNPA2B1 NFE2L3 MIR148A SLC16A7 ENSBTAG00000055198 53668560 53873255 5 3 ENSBTAG00000053531 77883587 77916175 5 1 RESF1 80625444 80681767 5 0 None 1092999 1344929 6 1 ENSBTAG00000051456 9935965 10067004 6 0 None 78629307 79704653 6 3 ENSBTAG00000054580 5S_rRNA TECRL ENSBTAG00000039484 OR2G2 ENSBTAG00000030735 OR2G3 41006911 41016811 7 9 ENSBTAG00000039804 ENSBTAG00000052311 ENSBTAG00000054452 OR6F1 ENSBTAG00000045691 90427692 90752832 8 4 TMEFF1 CAVIN4 PLPPR1 5S_rRNA 107 Table 5.3 (cont’d) 1692932 1844248 9 0 None 20087216 20220978 9 1 BCKDHB ENSBTAG00000050516 ENSBTAG00000038188 ENSBTAG00000013255 ENSBTAG00000053839 27861580 27896822 10 10 ENSBTAG00000047465 ENSBTAG00000051986 ENSBTAG00000053279 ENSBTAG00000003549 OR4F15 ENSBTAG00000052056 ENSBTAG00000011704 BPIFA2B 62709174 62826868 13 10 ENSBTAG00000031373 BPIFA3 BPIFA1 BPIFB1 BPIFB5 CDK5RAP1 ENSBTAG00000031354 SNTA1 7070755 7207418 17 2 LRBA MAB21L2 bta-mir-2360 CPLANE1 NIPBL ENSBTAG00000050782 37138810 37552209 20 5 SLC1A3 MC2R ENSBTAG00000048673 ENSBTAG00000046153 43661886 44310133 24 4 SETBP1 Figure 5.3: Manhattan plot of DCMS p-values in Hanwoo cattle. 108 Figure 5.4: Circos plot of P values for genome wide selection signatures in Hanwoo cattle. 5.5. Discussion Signatures of selection can serve as a complementary method to genome-wide association studies for identification of functional variants in the genome and to provide new insights into the underlying biology of traits important for agricultural production. The main aim of this study was to identify genomic regions under selective pressure in Angus and Hanwoo cattle utilizing imputed whole genome information. We first identified individual selection signals by four distinct methods primarily based on allele frequency and haplotype patterns. We combined 109 individual signals to identify strong signals of selection. Finally, we identified various positional candidate genes related to beef production and quality. Overall, we observed more genomic regions and genes under selective pressure in Angus than in Hanwoo and none of the selected regions or genes were common between the breeds, which is consistent with differences in breed origin, environmental habitats, divergent selection histories, breeding program objectives and ultimately, the phenotypic differences between the breeds. Genes identified within selected genomic regions in Angus included previously known regulators of growth, body size, feed intake, reproductive performance, and immunity. For example, PLAG1 regulates cell proliferation and its association with carcass weight and stature has been reported in several cattle breeds (Utsunomiya et al., 2013a; Takasuga, 2016; Fink et al., 2017). Similarly, LYN, another regulator of cell proliferation and RPS20, a catalyst of protein synthesis, have been associated with body weight and preweaning daily gain in Nellore (Utsunomiya et al., 2013a; Fink et al., 2017). CHCHD7 was previously reported as significantly associated with height in Jersey and Holstein (Utsunomiya et al., 2013a; Fink et al., 2017) and with carcass weight in Wagyu cattle (Nishimura et al., 2012). Both PLAG1 and RPS20 have also been associated with fetal growth and calving ease (Takasuga, 2016). Several olfactory receptors were also found in significant genomic regions (e.g., OR6C76, OR6C75, OR10A7, OR13C3, OR13C8). The olfactory transduction pathway has been associated with feed intake as it affects the perception of odor and in turn influences food preference and consumption (Abo-Ismail et al., 2010). Olfactory receptor loci have also been identified in other selective sweep studies in cattle and there are indications of recent duplication events (Ramey et al., 2013); which suggests that olfactory receptors may be under strong selection. TMEM68 (a cyltransferase involved in glycerolipid metabolism) and XKR4 have been 110 associated with growth and feed intake in Nellore (Terakado et al., 2018). XKR4 has also been associated with subcutaneous fat in indicine and composite cattle (Porto Neto et al., 2012). TGS1 (trimethylguanosine synthase 1) has pleitropic effects in growth traits and feed efficiency (Terakado et al., 2018; Ghoreishifar et al., 2020). GDF5 (growth differentiation factor) is critical for normal skeletal development. Loss of GDF5 function results in developmental delay and a shortened appendicular skeleton (Buxton et al., 2001). PIK3CD (a component of the phosphatidylinositol-3-kinase pathway) is involved in lymphocyte signaling. Mutations in PIK3CD causes immune dysregulation and disease pathogenesis (Tangye et al., 2019). SPSB1 (splA/ryanodine receptor domain and SOCS box containing 1) is an important component of mammalian innate immune system regulation that recognizes foreign molecules derived from pathogens (Lewis et al., 2011). We also identified solute carrier genes (SLC44A1, SLC16A7, SLC25A33) which belong to a major class of transport proteins in the cell membrane and play an important role in response to metabolic states and environmental conditions (Pizzagalli et al., 2021). Various solute carrier genes were also identified in another study directly comparing zebu and taurine cattle using differential allele frequency and haplotype diversity methods (Chan et al., 2010). This strongly suggests their role in adaptation to tropical environments. ASIP (Agouti signaling protein) is a well-known gene associated with coat color pigmentation and environmental adaptation in several species (Bertolini et al., 2018). Considering the breed’s innate characteristics and the high focus of the Hanwoo breeding program to select for increased marbling, it was reasonable to expect that some genomic regions under selection would be related to marbling score. An important region on CHR 24 was identified which contained ENSBTAG00000046153 and SETBP1 genes. The same region was also 111 identified by composite signal in a multi breed study within a Hanwoo-specific signal (Gutiérrez- Gil et al., 2015). MC2R (adrencorticotropin receptor) and MC5R (melanocortin 5 receptor) genes belong to a family of melanocortin receptors (reviewed by Switonski et al., 2013) that are involved in fatty acid and lipid metabolism pathways and reproduction. These genes have been previously located within a QTL region for marbling and backfat thickness and meat quality in pigs (Kováčik et al., 2012; Switonski et al., 2013). MC5R is a functional candidate for fatness in domestic animals and obesity in humans (Switonski et al., 2013) because it regulates interleukin 6 (IL6) (Jun et al., 2010) and downregulates leptin secretion (Hoggard et al., 2004) respectively resulting in increased fat deposition and increased feed intake. Based on these findings, we conclude that this selected region on CHR 24 is an important functional region for meat quality and should be further investigated in future studies in Hanwoo and/or other beef cattle. Although it is common to focus on the genes identified in selected genomic regions, it should also be considered that much of the phenotypic diversity originates from differential regulation of gene expression by regulatory elements like promoters, enhancers, silencers etc. (van Laere et al., 2003; Salinas et al., 2016). In this study 29.67% and 27.3% of the significant SNPs found in Angus and Hanwoo were annotated to gene coding regions, while the majority of the significant variants were located elsewhere. Similarly, Vernot et al. (2012) reported that the number of regulatory variants under selection far exceeded the number of variants in protein coding regions although their effect sizes may be small. Therefore, apart from the genes highlighted above, there may also be important regulatory elements within these significant genomic regions that play an important role in determining the phenotypic diversity of these breeds. 112 Detection of signatures of selection in populations can be challenging as it may be confounded with various other events in the population’s history that can lead to false positive results, e.g., population bottlenecks, migration, and genetic drift. Ascertainment bias is also a common problem in SNP data (Vitti et al., 2013). This study utilized whole genome sequence information from thousands of animals which should, to some extent, mitigate these issues. However, our study did not account for variation in the rate of recombination which may mimic the characteristics of selection signals (Haasl and Payseur, 2016). We also did not focus on other types of structural variants under selection such as copy number variants and tandem repeats which can play important biological roles. Moreover, the cutoff values used to initially filter the raw selection sweep signals across methods is largely arbitrary. For example, studies analyzing genotype data tend to adopt more liberal cutoffs of top 1% or 5% while those based on sequence data typically use a more conservative cutoff value such as the top 0.1% or 0.01%. In this study we used 0.02 as a threshold of significance for individual signals while for the DCMS p-values we adopted the commonly used Bonferroni cutoff of 0.05. Candidate gene search was only performed based on the DCMS p-values. For discovery of important QTLs or therapeutic targets, these thresholds may have downstream implications. 5.6. Conclusions To date, this is the largest signatures of selection study in Angus and Hanwoo beef cattle, both in terms of the density of SNPs and the number of animals per breed. We detected more selected genomic regions in Angus than in Hanwoo and the total length of genomic regions with evidence of selection was also higher in Angus. Moreover, we observed that the signatures of selection in Hanwoo and Angus do not overlap, markedly reflecting large differences in their selection history 113 and breed characteristics. More specifically, in Angus, we identified genes associated with growth, body size, feed intake, reproductive development and immunity, while in Hanwoo important genes associated with immunity, fat deposition, cholesterol metabolism, neuronal development and meat quality were identified. Future studies may help independently validate key functional genes regulating traits associated with these breeds. 114 REFERENCES Abo-Ismail, M. K., Voort, G. vander, Squires, J. J., Swanson, K. C., Mandell, I. B., Liao, X., et al. (2010). Single nucleotide polymorphisms for feed efficiency and performance in crossbred beef cattle. J Anim Sci 88, 3749–3758. doi: 10.2527/jas.2010-2907. Albertí, P., Panea, B., Sañudo, C., Olleta, J. L., Ripoll, G., Ertbjerg, P., et al. (2008). Live weight, body size and carcass characteristics of young bulls of fifteen European breeds. Livest Sci 114, 19–30. doi: 10.1016/j.livsci.2007.04.010. Bertolini, F., Servin, B., Talenti, A., Rochat, E., Kim, E. S., Oget, C., et al. (2018). Signatures of selection and environmental adaptation across the goat genome post-domestication 06 Biological Sciences 0604 Genetics. Genetics Selection Evolution 50, 1–24. doi: 10.1186/s12711- 018-0421-y. Buxton, P., Edwards, C., Archer, C. W., and Francis-West, P. (2001). Growth / Differentiation Factor-5 ( GDF-5 ) and Skeletal Development. J Bone Joint Surg 83-A. Chan, E. K. F., Nagaraj, S. H., and Reverter, A. (2010). The evolution of tropical adaptation: Comparing taurine and zebu cattle. Anim Genet 41, 467–477. doi: 10.1111/j.1365- 2052.2010.02053.x. Cho, S. H., Kim, J., Park, B. Y., Seong, P. N., Kang, G. H., Kim, J. H., et al. (2010). Assessment of meat quality properties and development of a palatability prediction model for Korean Hanwoo steer beef. Meat Sci 86, 236–242. doi: 10.1016/j.meatsci.2010.05.011. Fink, T., Tiplady, K., Lopdell, T., Johnson, T., Snell, R. G., Spelman, R. J., et al. (2017). Functional confirmation of PLAG1 as the candidate causative gene underlying major pleiotropic effects on body weight and milk characteristics. Sci Rep 7, 1–8. doi: 10.1038/srep44793. Ghoreishifar, S. M., Eriksson, S., Johansson, A. M., Khansefid, M., Moghaddaszadeh-Ahrabi, S., Parna, N., et al. (2020). Signatures of selection reveal candidate genes involved in economic traits and cold acclimation in five Swedish cattle breeds. Genetics Selection Evolution 52, 1–15. doi: 10.1186/s12711-020-00571-5. Grossman, S. R., Shlyakhter, I., Karlsson, E. K., Byrne, E. H., Morales, S., Frieden, G., et al. (2010). A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection. Science (1979) 327, 883–886. doi: 10.1126/science.1183863. Gutiérrez-Gil, B., Arranz, J. J., and Wiener, P. (2015). An interpretive review of selective sweep studies in Bos taurus cattle populations: Identification of unique and shared selection signals across breeds. Front Genet 6. doi: 10.3389/fgene.2015.00167. Haasl, R. J., and Payseur, B. A. (2016). Fifteen years of genomewide scans for selection: Trends, lessons and unaddressed genetic sources of complication. Mol Ecol 25, 5–23. doi: 10.1111/mec.13339. 115 Hoggard, N., Hunter, L., Duncan, J. S., and Rayner, D. v. (2004). Regulation of adipose tissue leptin secretion by α-melanocyte-stimulating hormone and agouti-related protein: Further evidence of an interaction between leptin and the melanocortin signalling system. J Mol Endocrinol 32, 145– 153. doi: 10.1677/jme.0.0320145. Igoshin, A. v., Yurchenko, A. A., Belonogova, N. M., Petrovsky, D. v., Aitnazarov, R. B., Soloshenko, V. A., et al. (2019). Genome-wide association study and scan for signatures of selection point to candidate genes for body temperature maintenance under the cold stress in Siberian cattle populations. BMC Genet 20. doi: 10.1186/s12863-019-0725-0. Jun, D. J., Na, K. Y., Kim, W., Kwak, D., Kwon, E. J., Yoon, J. H., et al. (2010). Melanocortins induce interleukin 6 gene expression and secretion through melanocortin receptors 2 and 5 in 3T3-L1 adipocytes. J Mol Endocrinol 44, 225–236. doi: 10.1677/JME-09-0161. Kováčik, A., Bulla, J., Trakovická, A., Žitný, J., and Rafayová, A. (2012). the Effect of the Porcine Melanocortin-5 Receptor (Mc5R) Gene Associated With Feed Intake, Carcass and Physico- Chemical Characteristics. Journal of Microbiology 1, 498–506. Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., et al. (2013). Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9, 1–10. doi: 10.1371/journal.pcbi.1003118. Lee, S.-H., Park, B.-H., Sharma, A., Dang, C.-G., Lee, S.-S., Choi, T.-J., et al. (2014). Hanwoo cattle: origin, domestication, breeding strategies and genomic selection. J Anim Sci Technol 56, 2. doi: 10.1186/2055-0391-56-2. Lewis, R. S., Kolesnik, T. B., Kuang, Z., D’Cruz, A. A., Blewitt, M. E., Masters, S. L., et al. (2011). TLR Regulation of SPSB1 Controls Inducible Nitric Oxide Synthase Induction. The Journal of Immunology 187, 3798–3805. doi: 10.4049/jimmunol.1002993. Ma, Y., Ding, X., Qanbari, S., Weigend, S., Zhang, Q., and Simianer, H. (2015). Properties of different selection signature statistics and a new strategy for combining them. Heredity (Edinb) 115, 426–436. doi: 10.1038/hdy.2015.42. Maclean, C. A., Chue Hong, N. P., and Prendergast, J. G. D. (2015). Hapbin: An efficient program for performing haplotype-based scans for positive selection in large genomic datasets. Mol Biol Evol 32, 3027–3029. doi: 10.1093/molbev/msv172. Makvandi-Nejad, S., Hoffman, G. E., Allen, J. J., Chu, E., Gu, E., Chandler, A. M., et al. (2012). Four loci explain 83% of size variation in the horse. PLoS One 7, 1–6. doi: 10.1371/journal.pone.0039929. Nawaz, M. Y., Bernardes, P. A., Savegnago, R. P., Lim, D., Lee, S. H., and Gondro, C. (2022). Evaluation of Whole-Genome Sequence Imputation Strategies in Korean Hanwoo Cattle. Animals 12. doi: 10.3390/ani12172265. 116 Nishimura, S., Watanabe, T., Mizoshita, K., Tatsuda, K., Fujita, T., Watanabe, N., et al. (2012). Genome-wide association study identified three major QTL for carcass weight including the PLAG1-CHCHD7 QTN for stature in Japanese Black cattle. BMC Genet 13. doi: 10.1186/1471-2156- 13-40. Pizzagalli, M. D., Bensimon, A., and Superti-Furga, G. (2021). A guide to plasma membrane solute carrier proteins. FEBS Journal 288, 2784–2835. doi: 10.1111/febs.15531. Pollinger, J. P., Bustamante, C. D., Fledel-Alon, A., Schmutz, S., Gray, M. M., and Wayne, R. K. (2005). Selective sweep mapping of genes with large phenotypic effects. Genome Res 15, 1809– 1819. doi: 10.1101/gr.4374505. Porto Neto, L. R., Bunch, R. J., Harrison, B. E., and Barendse, W. (2012). Variation in the XKR4 gene was significantly associated with subcutaneous rump fat thickness in indicine and composite cattle. Anim Genet 43, 785–789. doi: 10.1111/j.1365-2052.2012.02330.x. Ramey, H. R., Decker, J. E., McKay, S. D., Rolf, M. M., Schnabel, R. D., and Taylor, J. F. (2013). Detection of selective sweeps in cattle using genome-wide SNP data. BMC Genomics 14. doi: 10.1186/1471-2164-14-382. Randhawa, I. A. S., Khatkar, M. S., Thomson, P. C., and Raadsma, H. W. (2014). Composite selection signals can localize the trait specific genomic regions in multi-breed populations of cattle and sheep. BMC Genet 15, 1–19. doi: 10.1186/1471-2156-15-34. Randhawa, I. A. S., Khatkar, M. S., Thomson, P. C., and Raadsma, H. W. (2016). A meta-assembly of selection signatures in cattle. PLoS One 11, 1–30. doi: 10.1371/journal.pone.0153013. Sabeti, P. C., Reich, D. E., Higgins, J. M., Levine, H. Z. P., Richter, D. J., Schaffner, S. F., et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–837. doi: 10.1038/nature01140. Salinas, F., de Boer, C. G., Abarca, V., García, V., Cuevas, M., Araos, S., et al. (2016). Natural variation in non-coding regions underlying phenotypic diversity in budding yeast. Sci Rep 6, 1–13. doi: 10.1038/srep21849. Sutter, N. B., Bustamante, C. D., Chase, K., Gray, M. M., Zhao, K., Zhu, L., et al. (2007). A single IGF1 allele is a major determinant of small size in dogs. Science (1979) 316, 112–115. doi: 10.1126/science.1137045. Switonski, M., Mankowska, M., and Salamon, S. (2013). Family of melanocortin receptor (MCR) genes in mammals-mutations, polymorphisms and phenotypic effects. J Appl Genet 54, 461–472. doi: 10.1007/s13353-013-0163-z. Takasuga, A. (2016). PLAG1 and NCAPG-LCORL in livestock. Animal Science Journal 87, 159–167. doi: 10.1111/asj.12417. 117 Tangye, S. G., Bier, J., Lau, A., Nguyen, T., Uzel, G., and Deenick, E. K. (2019). Immune Dysregulation and Disease Pathogenesis due to Activating Mutations in PIK3CD—the Goldilocks’ Effect. J Clin Immunol 39, 148–158. doi: 10.1007/s10875-019-00612-9. Terakado, A. P. N., Costa, R. B., de Camargo, G. M. F., Irano, N., Bresolin, T., Takada, L., et al. (2018). Genome-wide association study for growth traits in Nelore cattle. Animal 12, 1358–1362. doi: 10.1017/S1751731117003068. Utsunomiya, Y. T., do Carmo, A. S., Carvalheiro, R., Neves, H. H. R., Matos, M. C., Zavarez, L. B., et al. (2013a). Genome-wide association study for birth weight in Nellore cattle points to previously described orthologous genes affecting human and bovine height. BMC Genet 14. doi: 10.1186/1471-2156-14-52. Utsunomiya, Y. T., Pérez O’Brien, A. M., Sonstegard, T. S., van Tassell, C. P., do Carmo, A. S., Mészáros, G., et al. (2013b). Detecting Loci under Recent Positive Selection in Dairy and Beef Cattle by Combining Different Genome-Wide Scan Methods. PLoS One 8, 1–11. doi: 10.1371/journal.pone.0064280. van Laere, A. S., Nguyen, M., Braunschweig, M., Nezer, C., Collette, C., Moreau, L., et al. (2003). A regulatory mutation in IGF2 causes a major QTL effect on muscle growth in the pig. Nature 425, 832–836. doi: 10.1038/nature02064. Vernot, B., Stergachis, A. B., Maurano, M. T., Vierstra, J., Neph, S., Thurman, R. E., et al. (2012). Personal and population genomics of human regulatory variation. Genome Res 22, 1689–1697. doi: 10.1101/gr.134890.111. Vitti, J. J., Grossman, S. R., and Sabeti, P. C. (2013). Detecting Natural Selection in Genomic Data. Annu Rev Genet. doi: 10.1146/annurev-genet-111212-133526. Voight, B. F., Kudaravalli, S., Wen, X., and Pritchard, J. K. (2006). A map of recent positive selection in the human genome. PLoS Biol 4, 0446–0458. doi: 10.1371/journal.pbio.0040072. Weir, B. S., and Cockerham, C. C. (1984). ESTIMATING F-STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE. Evolution (N Y) 38, 1358–1370. 118 Chapter 6 SUMMARY AND FUTURE DIRECTIONS Genetic improvement of cattle involves frequent use of whole genome sequence data which is increasingly becoming popular as technological advancements make sequencing more and more accessible. Routinely used techniques in genomic analyses include sequence alignment, variant calling, genotype imputation, genomic prediction, QTL discovery by GWAS, study of genetic architecture by evaluating genomic variation and recombination across the genome, and studying evolutionary history as in population genetics. Chapter 2 to 5 in this dissertation describe various strategies for effective utilization of whole genome sequence data in beef cattle. I have provided a theoretic demonstration of the true potential of whole genome sequence data for genomic prediction in distant populations. I evaluated strategies for accurate genotype imputation in rare breeds with minimal reference data availability. I also explored various strategies for efficient use of WGS in genomic prediction and association analyses. Furthermore, I discovered conserved genomic regions in Hanwoo and Angus cattle breeds and identified underlying genomic regions that may play important regulatory functions in meat production. Results from these analyses provide important insights and suggestions for effective utilization of WGS and contribute in better understanding of the regulation of meat production in Hanwoo and Angus cattle. Specifically in Chapter 2, I conducted a simulation study to demonstrate that if target populations drift away from the reference population due to successive breeding and recombination, genomic prediction can be improved by using WGS data. Three different simulation scenarios were created for within breed, crossbred and across-breed genomic prediction of Angus, 119 Hereford and Brahman cattle with three sets of SNP panels consisting of only SNP, only QTL, or SNP and QTL. We found that when QTL were a part of the SNP panel, the decline in prediction accuracy over generations was lower and the accumulation of data over generations led to higher accuracy gains compared to using only non-causal SNPs, especially for crossbred and across breed prediction. The QTL only panel represented an ideal situation where we could identify and utilize only those SNPs that were directly involved in expressing a phenotype. Such exact SNP panels are still elusive and would require studies based on huge sample sizes at the WGS level. Nonetheless, some studies have already shown the benefits of selecting highly predictive features to increase prediction in distant populations. In summary, a wider adoption of sequence data for genomic prediction still seems a valid proposition to increase the accuracy of genomic prediction in cattle, especially for distantly related animals. In Chapter 3, we evaluated various WGS genotype imputation strategies for the Korean Hanwoo cattle. We observed that a large reference panel consisting of many cattle breeds did not improve the imputation accuracy when compared to a proportionally small purebred Hanwoo reference. The large multi-breed reference panel added very little to the imputation accuracy beyond what was obtained by simply using the closely related samples from the breed itself. This was because the multibreed reference did not contain animals sufficiently related to the Hanwoo to improve the accuracies and, although not detrimental in effect, only added to the computational burden of imputation. Despite the large multibreed reference, when the Hanwoo were removed from the reference, the imputation accuracies were low. These results suggest additional sequencing efforts are needed for underrepresented breeds, particularly those less genetically related to the common European breeds. Additionally, we evaluated the value of high-density 700K genotypes 120 as an intermediary step in the imputation process. The imputation accuracy differences were negligible between a single-step imputation strategy from 50K directly to sequence and a two- step imputation approach (50K-700K-sequence) in a closely related target population. We also identified poorly imputed genomic regions in the Hanwoo genome and demonstrated that imputation accuracies were particularly lower at the chromosomal ends. Moreover, we observed that imputed sequence data can be used as a reference panel for future imputation projects. This suggests that, for example, HD genotypes could be imputed to sequence level and then, included along with the original sequence reference panel for use in one step imputation approach. This could reduce the computational burden of the two-step approach and potentially improve imputation accuracies when the number of informative animals in the WGS reference panel is small. This may also come in handy when there are limitations on data sharing between research organizations. In Chapter 4, we performed genome-wide association studies (GWAS) based on imputed sequence data for carcass traits in Hanwoo cattle in order to better understand the genetic regions that influence these traits. We compared our results with those obtained using genotyping platforms, and investigated whether selecting SNPs based on association analysis in a discovery population improves the accuracy of genomic predictions for carcass traits in Hanwoo. I found that that association analysis based on sequence data resulted in stronger signals of association compared to medium and high-density genotypes, and identified novel loci that regulate carcass traits. However, we observed no gain in prediction accuracy using imputed sequence data compared to medium and high-density genotypes for genomic prediction in Korean beef cattle using GBLUP methodology. Adding SNPs pre-selected based on association 121 analysis to medium and high-density genotypes may result in a slight increase in prediction accuracy for traits regulated by large-effect loci. Finally, we compared the proportion of phenotypic variance explained by additive and epistatic genetic components and their effect on the predictive ability of carcass traits in Hanwoo cattle. I demonstrated that epistatic effects explained a small proportion of genetic variance but did not increase prediction accuracy for additive genetic value or total genetic value. Future research involving genomic prediction in distantly related populations (across breeds or across generations) of animals using sequence data should be explored to evaluate the true potential of sequence data for genomic prediction. In chapter 5, I identified genomic regions in Angus and Hanwoo cattle that were under selection pressure using imputed whole genome information. Four methods were used to identify individual selection signals based on allele frequency and haplotype structures surrounding SNP variants. These signals were combined to identify composite selection signatures. Finally, various positional candidate genes related to beef production and quality were identified. This was the largest study to date of Angus and Hanwoo beef cattle in terms of the number of SNPs and the number of animals per breed. The results showed that there were more selected genomic regions in Angus compared to Hanwoo, and the total length of these regions was also higher in Angus. The selection signatures in Hanwoo and Angus did not overlap, which may reflect differences in the selection history and characteristics of the two breeds. For example, in Angus, genes associated with growth, body size, feed intake, reproductive development, and immunity were identified, while in Hanwoo, genes associated with immunity, fat deposition, cholesterol metabolism, neuronal development, and meat quality were identified. These results were consistent with differences in breed origin, environmental habitats, divergent selection histories 122 and breeding programs and consequently differences in physical features of the breeds. Future research may help to validate the functional genes that regulate these traits. The field of animal breeding and genetics has seen the most transformative two decades of technological advancements. Development of genotyping platforms and adoption of genomic selection has been successful. Future of beef cattle genomics will frequently involve whole genome sequence data just like other fields of biological research. With better datasets, tools and resources at hand, there is a renewed interest in identification of genes by both forward and reverse genetics approaches. This will not only help in improved predictive models but also aid in developing therapeutics using genome editing technologies. However, our ability to make the best use of whole genome data in genomic analyses will depend heavily on the genetic diversity of samples and sample sizes. Detection of causal variants or SNP filtering remains a challenge in cattle due to high linkage in the genome but promising results in biomedical research indicate that worldwide collaborative efforts in the form of consortia may be able to provide the next breakthrough technology. 123 APPENDIX A B C Figure S1: Manhattan plots showing P values obtained by GWAS of EMA in Hanwoo using A) 50K, (B) 700K and (C) imputed sequence genotypes. 124 A B C Figure S2: Manhattan plots showing P values obtained by GWAS of BFT in Hanwoo using A) 50K, (B) 700K and (C) imputed sequence genotypes. 125 A B C Figure S3: Manhattan plots showing P values obtained by GWAS of MS in Hanwoo using A) 50K, (B) 700K and (C) imputed sequence genotypes. 126 Figure S4: Plot of -log10 P values obtained from ROH analysis in Hanwoo. Figure S5: Plot of -log10 P values obtained from iHS analysis in Hanwoo. 127 Figure S6: Plot of -log10 P values obtained from ROH analysis in Angus. Figure S7: Plot of -log10 P values obtained from iHS analysis in Angus. 128 Figure S8: Plot of -log10 P values obtained from FST analysis comparing Angus and Hanwoo. Figure S9: Plot of -log10 P values obtained from XPEHH analysis comparing Angus vs Hanwoo. 129