IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME ASSEMBLY, AND MODELING RICE DOMESTICATION By Shujun Ou A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Plant Breeding, Genetics and Biotechnology - Horticulture - Doctor of Philosophy Ecology, Evolutionary Biology, and Behavior - Dual Major 2018 ABSTRACT IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME ASSEMBLY, AND MODELING RICE DOMESTICATION By Shujun Ou The majority of fundamental theories in genetics and evolution were proposed prior to the discovery of DNA as the genetic material in 1952. Those include DarwinÕ s theory of evolution (1859), Mendelian genetics (1865), Wright and FisherÕs population genetics (1918), and McClintockÕs transposition of genetic elements (1951). Nevertheless, the underlining mechanisms of those theories were not fully elucidated till th e appearance of DNA sequencing technology. At present, technological advances have minimized the cost for sequencing genomes. The real bottleneck to establish genomic resources is the annotation of genomic sequences. Long Terminal Repeat (LTR) retrotranspo son is a major type of transposable genetic elements and dominating plant genomes. We developed a new method called LTR_retriever for accurate annotation of LTR retrotransposons. Further, we studied genome dynamics, genome size variation, and polyploidy or igin using LTR retrotransposons. The presence of LTR retrotransposons challenges current sequencing and assembly techniques due to their size and repetitiveness. We proposed an unbiased metric called LTR Assembly Index (LAI) which utilizes the assembled LT R retrotransposons to evaluate continuity of genome assembly. We revealed the massive gain of continuity for assembly sequenced based on long -read techniques over short -read methods, and further proposed a standardized classification system for genome qual ity based on LAI. With high -quality genomes, we can extend our knowledge about microevolution events using a population of genomes. The domestication history of rice is still unresolved due to its complicated demographic history. We collected, re -mapped, a nd re - analyzed 3,485 cultivated and wild rice resequencing accessions. With data imputation, a total of 17.7 million high -quality single -nucleotide polymorphisms (SNPs) were identified. Our dataset is highly accurate as verified by cross -platform Affymetri x Microarray data, with a pairwise concordance rate of 99%. Combining phylogeny, PCA, and ADMIXTURE analyses, we present profound diversification among rice ecotypes. iv ACKNOWLEDGMENTS When I was little, my father told me that erudite scholar s know everyt hing, and they can also comprehend novels. My parents did not receive decent education , but their admiration for knowledge planted a seed on my mind . Now I have pursued advanced education, I would like to thank my parents, as well as my sister, for their unlimited support and love. I am very lucky to have two great mentors. I would like to exp ress my thankfulness to Dr. Chengcai Chu, who encouraged and supported my study prior to my graduate education. He is also a great collaborator and good friend. I am grateful to have Dr. Ning Jiang as my Ph.D. advisor, who ha s provided a friendly, flexible, a nd supportive study environment for me to explore the possibilities in graduate school. I am also thankful for her rigorous altitude towards scientific research, which would not allow any compromise that could have dampen ed the quality of my studies. Both of them have show n me enjoyable and sustainable ways to take research as a career. I would also like to thank my graduate committee Drs. David Lowry, C. Robin Buell, and Patrick Edger, who provided helpful guidance with their expertise and experience. I a lso want to thank Drs. Robert VanBuren, Mitch McGrath, Marivi Colle , Jinfeng Chen , and Zixiang Wen for their professional and productive collaborations. The research findings presented in Chapter 3 has been greatly encouraged and supported by Pat, Bob, and Jinfeng, and criti ques from Ning have further advanced it into the next level. I would like to acknowledge my major graduate program in Plant Breeding, Genetics, and Biotechnology (PBGB), my minor graduate program in Ecology, Evolutionary Biology, and Behavior (EEBB), and my home department Horticulture for providing diverse, specialized, v professional, friendly, and supportive learning and working environments. I would also like to acknowledge two supercomputer centers, the Institute for Cyber -Enabled Re search (iCER) at MSU and the Extreme Science and Engineering Discovery Environment (XSEDE), for generously providing high -performance computation resources to support my studies. I also like to thank members in the JointLab group in the department of Horticulture , the Computational Plant B iology (ComPLB) group, and the EEBB colloquium for providing training and feedbacks to my research presentations. I want to thank my former and current labmates, Dr. Dongyan Zhao, Dongmei Yin, and Stefan Cerbin for their help with my studies and living in the United States. I also want to thank my former and current officemates, Dr. Lusheng Zhang, Fransiska Renita Anon Basundari , Kristen Andersen, Christopher Gottschalk , Robert Loepp , Laura Hillmann , Songwen Zhang, Joshua Vander Weide , and Kathleen Rhoades for creating an d maintaining a lovely office environment. I also want to express my appreciations for friends in the Horticultural Organization of Graduate Students (HOGS), the Association for Crop and Soil (ACRS) , and the EEBB Graduate Group (EGG) for the joyful memor ies we shared. Finally, I want to thank my girlfriend Xuejia Lai for her company and support. vi TABLE OF CONTENTS LIST OF TABLES .................................................................................................................. viii LIST OF FIGURES ................................................................................................................... ix CHAP TER 1 ............................................................................................................................... 1 Introduction of Rice Domestication Models, Current Theories, Gaps and Debates ...................... 1 Rice domestication was a long -lasting procedure ..................................................................... 2 Origins of key domestication traits and paradoxical observations ............................................. 3 Hypotheses of rice domestication ............................................................................................ 4 Genome -wide studies of rice domestication ............................................................................. 6 Whole -genome resequencing dissects the domestication of rice ............................................... 8 The role of introgression ........................................................................................................ 11 Working models for rice domestication ................................................................................. 12 APPENDIX .............................................................................................................................. 14 REFERENCES ......................................................................................................................... 17 CHAPTER 2 ............................................................................................................................. 21 LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons ........................................................................................................... 21 Abstract ................................................................................................................................. 22 CHAPTER 3 ............................................................................................................................. 23 Assessing Genome Assembly Quality Using the LTR Assembly Index (LAI) ........................... 23 Abstract ................................................................................................................................. 24 CHAPTER 4 ............................................................................................................................. 25 A Fine Map of Rice Domestication Revealed by 3,485 Cultivated and Wild Accessions ........... 25 Introduction ........................................................................................................................... 26 Results and Discussions ......................................................................................................... 27 Geno me mapping, imputation, and filtering ....................................................................... 27 Population structure and phylogeny ................................................................................... 28 Materials and Methods .......................................................................................................... 29 Sampling and DNA sequences ........................................................................................... 29 Read alignment and variant identification .......................................................................... 30 Imputation of missing data ................................................................................................. 31 Estimation of variant discordance ...................................................................................... 32 Population structure ........................................................................................................... 35 Phylogenetic reconstruction ............................................................................................... 37 Acknowledgments ................................................................................................................. 37 APPENDIX .............................................................................................................................. 38 REFERENCES ......................................................................................................................... 48 vii CHAPTER 5 ............................................................................................................................. 52 Concluding Remarks ................................................................................................................. 52 Annotation of LTR sequences is still a difficult task .............................................................. 53 LTR sequences challenge the assembly of genomes ............................................................... 54 The functional role of LTR sequences is largely unknown ..................................................... 55 The ambiguous role of SVs for domestication ....................................................................... 56 REFERENCES ......................................................................................................................... 58 viii LIST OF TABLES Table 4.1 SNP discordance rate between this study and other studies. ....................................... 39 Table 4.2 Grouping of 3,485 rice accessions. ............................................................................ 40 !ix LIST OF FIGURES Figure 1.1 Domestication models adapted from Sang and Ge (2007). ........................................ 15 Figure 1.2 Schematics of the single - (A) vs. double - (B) founder models adapted from Molina et al. (2011). ................................................................................................................................. 16 Figure 4.1 Truth sensitivity of tranches in Variant Quality Score Recalibration (VQSR) .... ....... 41 Figure 4.2 SNP filtering based on Variant Quality Score Recalibration (VQSR). ....................... 42 Figure 4.3 Distribution of imputation R 2 in minor allele frequency (MAF) bins (0.05 per bin). . 43 Figure 4.4 K -means clustering of 3,485 cultivated and wild rice accessions with (A) whole -genome 4D SNPs and (B) whole -genome random SNPs. .......................................................... 44 Figure 4.5 Determination of the number of populations. ............................................................ 45 Figure 4.6 The population structure of 3,485 rice accessions revealed by various K. ................. 46 Figure 4.7 Population structure of 3,485 rice accessions. ........................................................... 47 !1 CHAPTER 1 Introduction of Rice Domestication Models, Current Theories, Gaps and Debates !2 Rice domestication was a long -lasting procedure Asian rice ( Oryza sativa L.) is one of the most important staple crops, which has two main ecotypes, indica and japonica . The origin of rice is still in debate due to its complex domestication and demographic history. In the late 19 th century, India was considered to be the origi n of rice based on its enriched diversity in India. In 1928, Kato and colleagues proposed that rice was domesticated from India and Japan, and named the subspecies indica and japonica , respectively (Kato and Hara 1928) . Meanwhile, Ding (1929) reported a wild rice lineage discovered in Guangdong, China (Ding 1929) , unveiling the possibility that cultivated rice might be originated from China. In recent years, archaeological excavations f rom the Yangzi River valley in China show that at the late Pleistocene (12,000 BP Ð 11,000 BP), people had started collecting seeds from wild rice (Li et al. 2016). The Shangshan site (11,000 BP Ð 9,000 BP ) in the Yangzi basin unearthed charred rice husks and leaves in pottery shreds, as well as sickle -shaped stoneware, indicating that ancient human might have started to harvest rice as a crop (Li et al. 2016). One of the most apparent changes from wild rice to cultivated rice was the reduction of seed shattering. Non -shattering rice was featured by the existence of Ònon -brittle rachisÓ that occurred at the abscission layer of spikelet (Li et al. 2016). Analysis of abscission layers of ancient rice grains from the Kuahuqiao site (8,200 BP Ð 7,200 BP) and the Hemudu site (7,000 BP Ð 6,500 BP) in the Yangzi basin indicated the proportion of non -shattering rice increased from 42% to 51% in 800 years (Li et al. 2016). These data indicate that selection of the non -shattering trait could be a long -lasting procedure. Although japonica rice generally has more typical abscission layers, it is difficult to determine the rice subspecies bas ed on this. The change of grain size is also one of the key domestication syndromes. Indica has longer grains while !3 japonica generally has short grains. Thus, another attempt to differentiate indica , japonica , initial domesticates, and wild rice was to ana lyze the grain size (Li et al. 2016) . Liu et al. (2007) compared the length, width, length/width ratio, and length*width of rice grain remnants unearthed from the aforementioned historical sites. However, large variations among these measurements were found, revealing ambiguous or mixed rice types, indicating tha t rice domestication was a non -linear procedure (Li et al. 2016). Origins of key domestication traits and paradoxical observations Plant domestication involves in the transition of wild species into cultivars that have higher fitness under managed environment (Gepts 2004) . Key domestication traits denote the initial and critical change from a wild species to a domesticated species. In rice, such transitions were known as the reduction of shattering for higher harvest rate (Li et al. 2006), reduction of dormancy for more synchronized g ermination (Sugimoto et al. 2010) , change from prostrate growth to erectness growth for more compact cultivation (Jin et al. 2008), and change from loose panicle to compact panicle for less but larger grains (Bai et al. 2016). Interestingly, genes respo nsible for these key domestication syndromes, namely sh4 , Sdr4 -n, PROSTRATE GROWTH 1 (PROG1 ), and FRIZZLE PANICLE (FZP ), respectively, have identical genotypes in both indica and japonica (Li et al. 2006; Jin et al. 2008; Sugimoto et al. 2010; Bai et al. 2016) . Parsimoniously, these traits should be derived from a common ancestor, indicating a single domestication origin. However, many traits responsible for rice improvements were carrying different ge notypes between indica and japonica , indicating different origins. For example, the Big Grain 2 (BG2 ) gene encodes a cytochrome P450 which could influence grain size and yield (Xu et al. 2014) . The coding region of BG2 holds eight polymorphic sites between Nipponbare, a japonica variety, and 93 -11, an indica variety, which were seemingly inherited from different !4 wild rice ecotypes (Xu et al. 2014). For the NRT1.1B gene responsible for nitrate -use efficiency (NUE), a functional polymorphism (FNP) was found to be responsible for the NUE divergen ce between indica and japonica , which was also inherited from different wild rice ecotypes (Hu et al. 2015). Phylogeographic analyses of three genes in a large collection of cultivated and wild rice revealed that indica might be originated from eastern India, while japonica might be originated from southern China (Londo et al. 2006) . The parsimonious explanation for the abundance of subspecies -specific polymorphisms is consistent with a double - or multiple -origin hypothesis. Hypotheses of rice domestication To reconcile the paradoxical origins between key domestication traits and improvement traits, two evolutionary models were proposed, which are the snow -balling model and the combinati on model ( Figure 1.1 ) (Sang and Ge 2007) . In the snow -balling model, a lineage of wild rice wa s initially domesticated, then key domestication genes were transferred to other wild rice by hybridization ( Figure 1.1 A). Further domestication of the hybridized wild rice eventually generated the current cultivars that have the same set of key domesticat ion genes but apparently different genetic backgrounds ( Figure 1.1 A) (Sang and Ge 2007) . This model agrees with the fact that japonica has undergone a severe bottleneck which denotes the initial domestication, while the subsequent hybridization with wild rice could partially recover the effective population size ( Ne) (Sang and Ge 2007; Vaughan 2008) . The hybrid wild rice is proposed to be the progenitor of indica which had a weaker bottleneck (Sang and Ge 2007; Vaughan 2008) . In the combination model, domestication of rice started in parallel, resulting in two or more initial domesticates which carried different key domestication genes ( Figure 1.1 B). Further hybridization and pyramiding resulted in fixation of a similar set of key genes in different !5 lineages ( Figure 1.1 B). This model agrees with modern breeding principles such as gene pyramiding and also supported by phylogeographic studie s which revealed multiple gene pools for cultivated rice (Londo et al. 2006). Vaughan et al. (2008) further extended the snow -balling model to explain the existence of multiple rice groups (Vaughan 2008) . In this model, rice was proposed to be initially domesticated at 30¡N where the Yangzi River valley is located and was the northern limit of common wild rice ( O. rufipogon ) (Vaughan 2008) . The initial domesticate was transported by human activities and hybridized with local wild rice lineages (Vaughan 2008) . The resulting hybrids contained most of the key domestication genes and complex introgressions due to human management (Vaughan 2008) . Most admixture lines were extinct but some were passed on and became the contemporary divergent domesticated rice (Vaughan 2008) . The evolution model for the semi -dwarf 1 (sd1 ) gene proposed by Asano et al. (2011) was the extension of the combination model ( Figure 1.1 B). The sd1 gene encodes a gibberellin biosynt hetic enzyme and has two FNPs between indica and japonica (Asano et al. 2011). The japonica allele has decreased nucleotide diversity and is considered to be domesticated, while the indica allele remains the wild type (Asano et al. 2011) . In this model, indica and japonica were domesticated from different wild rice lineages and thus carrying different gene backgrounds. During further domestication of japonica prototypes, the sd1 mutant occurred and was quickly fixed in later generations (Asano et al. 2011). Similarly, so me rice genes like Wxb, qsw5 , and qsh1 were also fixed in japonica , but genes like rc, wx, gs3 , and badh2.1 were not only fixed in japonica but also introduced into indica , contributing to a range of shared domesticated gene alleles (Asano et al. 2011). !6 Genome -wide studies of rice domestication Studying the domestication history of rice through a few genes could be biased. Due to complex demography, such as introgression, genetic drift, and bottleneck, evolution history of a gene could be different from that of the species, known as incomplete lineage sorting. To preclude biases from using specific genes, it is preferred to utilize large -scale data and populations to dissect the domestication history of rice. One of the earliest attempts d ates back to 1985 from Second (Second 1985) , who studied isozyme polymorphism of 181 diverse rice and wild rice strains at 24 loci. Principal coordinate analysis indicated that isozymes of cultivated rice could be classified into two groups, indica and japonica (Second 1985) . Further, Second pointed out t hat most wild rice strains from Guangxi province (southwestern China) and Taiwan province (eastern China) were close to the japonica type, while strains from Guangdong province (southern China) were intermediate between japonica and indica types (Second 1985) . These observations seem to support the snow -balling model, with the intermediate -type wild rice hybridized from japonica prototypes and local indica -type wild rice. However, the relationship between japonica -type wild rice from Guangxi and Taiwan is not revealed. In 2005, Garris and colleagues genotyped 169 nuclear SSRs and two chloroplast loci from 234 diverse rice accessions and revealed the popula tion structure of rice (Garris et al. 2005) . The research presented five distinct groups with deep genetic structures, which are indica , aus, tropical japonica , temperate japonica , and aromatic . The first two groups are traditionally regarded as indica , while the latter three groups are japonica . The authors pointed out that such deep genetic structure might be caused by differentiated ancestral populations and the selfing behavior of rice (Garris et al. 2005). Gao and Innan (2008) genotyped 60 microsatellites for 92 !7 accessions of rice and wild rice for the measurement of genetic diversity (Gao and Innan 2008) . A number of low -diversity regions were found shared between indica and japonica , which could not be explained by the independent domestication hypothesis. Thus, a model wi th shared bottleneck was proposed, which was partly agreed with the snow -balling model. However, Gao and InnanÕs model didnÕt explain the much severe bottleneck observed in japonica or the much larger effective population size observed in indica . Caicedo et al. (2007) genotyped 111 randomly selected sequence tags for a diverse panel including 93 accessions of rice and wild rice and observed an excess of derived SNPs contributing to the U -shape site -frequency spectra (SFS) of cultivated rice (Caicedo et al. 2007) . Factors like neutrality, bottleneck, migration, and selective sweep were explored for the explanation of observed data. The combination of the bottl eneck and selection effects could better explain the observed SFS of indica and tropical japonica , while that of the O. rufipogon population could be well explained by the standard neutral model (Caicedo et al. 2007) . Similarly, Molina et al. (2011) genotyped 630 gene fragments from Chromosomes 8, 10, and 12 of 56 accessions containing wild rice, indica , and tropical japonica (Molina et al. 2011), and observed skewed SFS of cultivated rice. Further modeling using joint frequency spectrum favored the one founder hypothesis over double founders either with or without t he sweep effect (Molina et al. 2011). This research provides support to the snow -balling model ( Figure 1.1 A), howe ver, suffers from sampling bias such as the lack of temperate japonica , as well as other minor cultivated groups. The divergence time between indica and japonica genomes can be estimated using molecular clocks, such as long terminal repeat retrotransposons (LTR -RTs). Vitte et al. (2004) identified 110 LTR -RTs in indica and japonica , of which 28 are syntenic between the two !8 ecotypes that dates their differentiation ca. 0.9 MYA Ð ca. 2.1 MYA ( ! = 6.5 x 10 -9 per synonymous site per year) (Vitte et al. 2004). These times are likely overestimated due to th e usage of a lower mutation rate. In a newer study using the same method, the differentiation between indica and japonica is estimated ca. 0.6 MYA Ð ca. 0.25 MYA ( ! = 1.3 x 10 -8 per bp per year) (Roulin et al. 2010). These estimations seem to support the combination model ( Figure 1.1 B), which, however, should be revisited using whole -genome LTR -RTs as they are readily available (International Rice Genome Sequencing 2005; Zhang et al. 2016; Du et al. 2017; Nie et al. 2017). Whole -genome resequencing dissects the domestication of rice Benefit from the advance of sequencing techniques, genome -wide sequencing projects were carried out for large collections of rice accessions. Population genetic studies based on whole -genome markers had advanced our understanding of rice domestication. He et al. (2011) sequenced 66 rice accessions spanning indica , japonica , and O. rufipogon using a pooling strategy in each population for 30X coverages (He et al. 2011). Scans of genetic diversity ( "w) detected hundreds of low diversity regions (LDR) across the genome. Shared LDRs between indica and japonica showed very low genetic distances ( D statistics) and population differentiation ( FST), while the genome background showed the opposite trend (He et al. 2011). Further, coalescent simulations revealed that the overlapped LDRs follows the sequential domestication model, while the genome background follows the independent domestication model (He et al. 2011). These results reinforce the snow -balling model ( Figure 1.1 A), that the initially domesticated rice hybridized with differ ent ancestral wild rice ecotypes, which were then domesticated into diverse cultivated groups. He and colleagues also rediscovered several key domestication genes, i.e. sh4 and PROG1 , as well as 13 additional domestication gene !9 candidates (He et al. 2011). Xu et al. (2011) resequenced 40 varieties of cultivated rice, 15 accessions of perennial wild rice O. rufipogon , and 10 accessions of annual wild rice O. nivara for 1.5X -20X and identified 6.5 mill ion high -quality SNPs (Xu et al. 2011) . Identification of LDRs based on the SNP set has identified important biological features as well as regions with unknown functions (Xu et al. 2011) . In 2012, Huang and colleagues reported 8 million SNPs identified from the low -depth (~2X) resequencing of a large rice population including 446 accessions of wild rice ( O. rufipogon ) and 1,083 cultivated rice varieties (Huang et al. 2012). Phylogenetic reconst ruction based on the SNP set indicated that indica was derived from the ancestral Or -I wild rice ecotype, while japonica was derived from the ancestral Or -IIIa ecotype (Huang et al. 2012) . FST statistics between rice populations showed that the common wild rice might have been differentiated before the domestication of rice, probably due to geographic isolations (Huang et al. 2012) . A severe bottleneck during the domestication of japonica was obser ved, which might have contributed to the rapid fix of rare alleles that are responsible for phenotypic diversifications of rice subspecies (Huang et al. 2012) . Using nucleotide div ersity, a total of 55 selective sweeps were identified, which may share the most recent common ancestor with the Or -IIIa group located in the Pearl River basin of Guangxi (southwestern China) (Huang et al. 2012). This result agrees with the isozyme study cond ucted by Second (1985) (Second 1985) . It becomes quite clear that the snow -balling model is supported by most of the data ( Figure 1.1 A). To advance the model, Huang et al. (2012) conducted forward simulations with population parameters estimated from the SNP set. By comparing the simulated data to real data, only one model was found that could explain their observations (Huang et al. 2012) . In this model, jap onica prototype was initially domesticated from ancestral O. rufipogon IIIa, which was later hybridized with the !10 ancestral group of O. rufipogon I in the southeast and eastern Asia, forming the prototype indica (Huang et al. 2012). The japonica and indica prototypes were further domesticated into aromatic , tropical japonica , and temperate japonica , and aus and indica , respectively (Huang et al. 2012). Using the same SNP set from Huang et al. (2012) (Huang et al. 2012) , Civ⁄ ! et al. (2015) proposed another domestication scenario (Civan et al. 2015). First, Civ⁄ ! and colleagues identified LDRs by comparing nucleotide diversity of indica (520 accessions), japonica (489 accessions), and aus (30 accessions) to the whole wild rice population (446 accessions), respectively, recovering 31 LDR regions that are shared between cultivated rice ecotyp es (Civan et al. 2015). Principle component analysis (PCA) and neighbor -joining clustering of these overlapping LDRs indicated that the three cultivated rice populations might be originated from different ancestral wild rice populations (Civan et al. 2015). Phylogeographic analysis showed that japonica might be domesticated from ancestral wild rice populations in south China, indica might be domesticated in Indochina and the Brahmaputra valley, and aus might be domesticated in central India or Bangladesh (Civan et al. 2015). Contrasting results fr om Huang et al. (2012) and Civ⁄ ! et al. (2015) indicate the intriguing and complex domestication history of rice. Using the same resequencing data published by Huang et al. (2012) (Huang et al. 2012) and similar methodologies to Civ⁄ ! et al. (2015) (Civan et al. 2015) but different criteria, Choi et al. (2018) found supports to the snow -balling model based on neighbor -joining phylogenies of overlapping LDRs (Choi and Purugganan 2018) . Xie et al. (2015) resequenced a diversity panel containing 533 accessions of rice cultivars and combined with 950 accessions of rice cultivars from Huang et al. (2012) (Huang et al. 2012) for the identification of breeding signatures (Xie et al. 2015). Two putative heterotic groups were recognized from the indica subpopulation, which may have gone through independent breeding !11 efforts (Xie et al. 2015) . Cross -population likelihood ratio test (XP -CLR) among the two indica heterotic groups identified 200 regions spanning 7.8% of the rice genome that may be footprints of modern breeding efforts (Xie et al. 2015). Genes like sd1 which lead to the Ògreen revolutionÓ and Gn1a , LP, and OsBRI1 which involved in grain quality and yield were rediscovered as breeding footprints (Xie et al. 2015). Clearly, artificial selection and local adaptation also play an important role in shaping the divergence of cultivated rice. The role of introgression Modern rice breeding involves hybridization between cultivars and pyramiding of beneficial genes, resulting in mosaic genetic backgrounds in offspring. It is likely that the domestication of rice is associated with hybridization and introgression, since many cultivated rice ecotypes share the same habitat. Choi et al. (2017) utilized the genome sequences of three rice subspecies (japonica , indica , and aus) to estimate the divergence and introgressions of cultivated rice (Choi et al. 2017). Based on coalescence modeling, they estimated the japonica genome was split from O. rufipogon at ca. 13.1 Ð ca. 24.1 thousand -years ago (Ka), while the estimated split between indica and O. rufipogon was ca. 6.7 Ð ca. 17.7 Ka (Choi et al. 2017). Although the estimated split time of japonica is earlier than that of indica , these estimates are substantially overlapping (13.1 Ð 17.7 Ka), reveali ng the difficulty in distinguishing the domestication origin of rice. The estimated split between aus and O. nivara (a wild rice ecotype that is hypothesized to share the most recent common ancestor with indica and aus) is ca. 1.7 Ð ca. 9.1 Ka, indicating that aus is a more recently diversified rice subpopulation. Further, Choi and colleagues found evidence for significant gene flow from japonica to both indica (~17%) and aus (~15%) (Choi et al. 2017), supporting the snow -balling model. !12 Due to the selfing nature of cultivated rice, introgression among individual plants is relatively difficult. However, O. rufipogon has a higher rate of open pollination and also sympatric (share the same habitat) with cultivated rice, which is more likely to receive introgressions via pollens. Wang et al. (2017) identif ied heavy introgressions from cultivated rice to wild rice by reanalyzing the resequencing data of 446 wild rice accessions generated by Huang et al. (2012) (Huang et al. 2012) as well as 202 divers e cultivated accessions (Wang et al. 2017). Using supervised ancestry identification, Wang and colleagues found over 50% of wild rice accessions have more than 10% of their genome that are likely introduced from domesticated rice (Wang et al. 2017) . Further, many wild rice accessions prese nted extent origins (50 -100%) from cultivated rice, indicating recent introgressions or feralization from cultivated rice (Wang et al. 2017). This study shows that the contemporary wild rice may represent a hybrid swarm between domesticated rice and ancestral wild rice (Wang et al. 2017). Because of these introgressions, the actual cultivated rice ancestor may have gone extinct. As a result, analyses based on contemporary wild rice samples should be carefully interpreted. Working models for rice domestication In summary, rice domestication was likely a continuous and complicated process involving demographic changes, natural selection, artificial selection, hybridization, and migration. It is possible to detect the ÒfootprintsÓ of such activities from genetic variations but unfortunately no t all of them are recognizable, and some could be masked by others. Due to the complexity of the process, the exact path of rice domestication is still unclear despite tremendous efforts. Based on the current studies, the domestication of rice was likely f ollowing the snowballing model, but the possibility of the combination model is not excluded ( Figure 1.1 ). Such dilemma is likely created !13 by a mixture of evolutionary drivers other than positive selection, such as demography and introgression. In aware of other evolutionary drivers, we propose revised scenarios for both domestication models. For the single -founder model, the prototype of cultivated rice was domesticated from the ancestral population of O. rufipogon , which accumulate key domestic ation genes shared in cultivated rice ( Figure 1.2 A). The ancestral O. rufipogon is then diverged into different ecotypes (i.e., Or -I and Or -IIIa), likely due to genetic drift and geographic isolation ( Figure 1.2 A). Due to close distribution between wild ri ce and cultivated rice, different O. rufipogon ecotypes received introgressions from sympatric cultivated rice, resulting further divergence of these wild rice ecotypes and their genetic similarity to cultivated rice ( Figure 1.2 A). Introgressions between c ultivate groups are also likely due to breeding activities ( Figure 1.2 A). For the multi -founder model, the divergence of ancestral O. rufipogon predated the domestication of rice ( Figure 1.2 B). Independent domestications of different wild rice ecotypes (i. e., Or -I and Or -IIIa) created indica and japonica , with further divergent genetic backgrounds (Figure 1.2 B). Introgressions between early indica and japonica domesticates result in the shared key domestication genes as we observed from data ( Figure 1.2 B). Besides, introgressions from cultivated rice to sympatric wild rice ecotypes further shaped the divergence of contemporary wild rice ecotypes. Further studies will reveal the roles of these evolutionary drivers. With novel strategies and more comprehensive analyses, our understanding about rice domestication will be further advanced. !14 APPENDIX !15 Figure 1. 1 Domestication models adapted from Sang and Ge (2007). (A) snow -balling model; (B) combination model. Squares represent founder wild rice; Rounded rectangle represent initial domesticates; circles represent modern rice subspecies. Differences in shadings indicate genomic divergence. Triangle, diamond, and ellipse represent critical domestication alleles that are observed in modern cultivars. Double -headed arrows indicate hybridization. Single arrowheads point to the progress of domestication. !"!16 Figure 1. 2 Schematics of the single - (A) vs. double - (B) founder models adapted from Molina et al. (2011). (A) In the single domestication event, the O. rufipogon ancestral population was domesticated into a prototype of cultivated rice, which was further diversifie d into indica and japonica . Introgressions from cultivated rice into wild rice as well as geographic isolations facilitated the divergence of O. rufipogon into Or -I and Or -IIIa ecotypes. ( B) In the double -founder model, the O. rufipogon ancestral populatio n was diversified into Or -I and Or -IIIa ecotypes, which were domesticated independently and formed indica and japonica , respectively. Symmetric introgressions between the populations are indicated by arrows. Wild rice Or-IOr-IIIaInd Jap Wild rice Or-IOr-IIIaInd Jap AB!17 REFERENCES !18 REFERENCES Asano K, Yamasaki M, Takuno S, Miura K, Katagiri S, Ito T, Doi K, Wu J, Ebana K, Matsumoto T et al. 2011. Artificial selection for a green revolution gene during japonica rice domestication. Proc Natl Acad Sci U S A 108: 11034 -11039. Bai X, Huang Y, Mao D, Wen M, Zhang L, Xing Y. 2016. Regulatory role of FZP in the determination of panicle branching and spikelet formation in rice. Scientific Reports 6: 19022. Caicedo AL, Williamson SH, Hernandez RD, Boyko A, Fledel -Alon A, York TL, Polato NR, Olsen KM, Nielsen R, McCouch SR et al. 2007. Genome -wide patterns of nucleotide polymorphism in domesticated rice. PLoS Genet 3: 1745 -1756. Choi JY, Platts AE, Fuller DQ, Hsing Y -I, Wing RA, Purugganan MD. 2017. The Rice Paradox: Multiple Origins but Single Domestication in Asian Rice. Mol Biol Evol 34: 969 -979. Choi JY, Purugganan MD. 2018. Multiple Origin but Single Domestication Led to Oryz a sativa . G3: Genes|Genomes|Genetics 8: 797 -803. Civan P, Craig H, Cox CJ, Brown TA. 2015. Three geographically separate domestications of Asian rice. Nat Plants 1: 15164. Ding Y. 1929. Textual research of crop names [in Chinese]. Nongsheng : 5 -9. Du H, Yu Y, Ma Y, Gao Q, Cao Y, Chen Z, Ma B, Qi M, Li Y, Zhao X et al. 2017. Sequencing and de novo assembly of a near complete indica rice genome. Nat Commun 8: 15324. Gao LZ, Innan H. 2008. Nonindependent domestication of the two rice subspecies, Oryza sativa ssp. indica and ssp. japonica , demonstrated by multilocus microsatellites. Genetics 179: 965-976. Garris AJ, Tai TH, Coburn J, Kresovich S, McCouch S. 2005. Genetic structure and diversity in Oryza sativa L. Genetics 169: 1631 -1638. Gepts P. 2004. Crop domes tication as a longterm selection experiment. In Plant Breeding Reviews , Vol 24 (ed. J Janick). John Wiley & Sons, Inc. !19 He Z, Zhai W, Wen H, Tang T, Wang Y, Lu X, Greenberg AJ, Hudson RR, Wu CI, Shi S. 2011. Two evolutionary histories in the genome of rice: the roles of domestication genes. PLoS Genet 7: e1002100. Hu B, Wang W, Ou S, Tang J, Li H, Che R, Zhang Z, Chai X, Wang H, Wang Y et al. 2015. Variation in NRT1.1B contributes to nitrate -use divergence between rice subspecies. Nat Genet 47: 834 -838. Huan g X, Kurata N, Wei X, Wang ZX, Wang A, Zhao Q, Zhao Y, Liu K, Lu H, Li W et al. 2012. A map of rice genome variation reveals the origin of cultivated rice. Nature 490: 497 -501. International Rice Genome Sequencing P. 2005. The map -based sequence of the rice genome. Nature 436: 793 -800. Jin J, Huang W, Gao JP, Yang J, Shi M, Zhu MZ, Luo D, Lin HX. 2008. Genetic control of rice plant architecture under domestication. Nat Genet 40: 1365-1369. Kato S, Hara HK. 1928. Classification of improved rice varieties. J of Kyushu Imperial University of Agriculture : 132 -147. Li C, Zhou A, Sang T. 2006. Rice domestication by reducing shattering. Science 311: 1936 -1939. Li L, Lee G -A, Leping J, Ju zhong Z. 2016. Evidence for the early beginning (c. 9000 cal. BP) of rice domestication in China: a response. The Holocene 17: 1059 -1068. Londo JP, Chiang YC, Hung KH, Chiang TY, Schaal BA. 2006. Phylogeography of Asian wild rice, Oryza rufipogon , reveals multiple independent domestications of cultivated rice, Oryza sativa . Proc Natl Acad Sci U S A 103: 9578 -9583. Molina J, Sikora M, Garud N, Flowers JM, Rubinstein S, Reynolds A, Huang P, Jackson S, Schaal BA, Bustamante CD et al. 2011. Molecular evidence f or a single evolutionary origin of domesticated rice. Proc Natl Acad Sci U S A 108: 8351 -8356. Nie SJ, Liu YQ, Wang CC, Gao SW, Xu TT, Liu Q, Chang HL, Chen YB, Yan PC, Peng W et al. 2017. Assembly of an early -matured japonica (Geng) rice genome, Suijing18, based on PacBio and Illumina sequencing. Sci Data 4: 170195. !20 Roulin A, Chaparro C, Piegu B, Jackson S, Panaud O. 2010. Paleogenomic analysis of the short arm of chromosome 3 reveals the history of the African and Asian proge nitors of cultivated rices. Genome Biol Evol 2: 132 -139. Sang T, Ge S. 2007. The puzzle of rice domestication. J Integr Plant Biol 49: 760 -768. Second G. 1985. Evolutionary relationships in the Sativa group of Oryza based on isozyme data. G”n”tique, S”lect ion, …volution 1: 89 -114. Sugimoto K, Takeuchi Y, Ebana K, Miyao A, Hirochika H, Hara N, Ishiyama K, Kobayashi M, Ban Y, Hattori T et al. 2010. Molecular cloning of Sdr4 , a regulator involved in seed dormancy and domestication of rice. Proc Natl Acad Sci U S A 107: 5792 -5797. Vaughan DA, Lu Bao -Rong, Tomooka, Norihiko. 2008. The evolving story of rice evolution. Plant Sci 174: 394 Ð408. Vitte C, Ishii T, Lamy F, Brar D, Panaud O. 2004. Genomic paleontology provides evidence for two distinct origins of Asian rice ( Oryza sativa L.). Mol Genet Genomics 272: 504 -511. Wang H, Vieira FG, Crawford JE, Chu C, Nielsen R. 2017. Asian wild rice is a hybrid swarm with extensive gene flow and feralization from domesticated rice. Genome Res doi:10.1101/gr.204800.116. Xie W , Wang G, Yuan M, Yao W, Lyu K, Zhao H, Yang M, Li P, Zhang X, Yuan J et al. 2015. Breeding signatures of rice improvement revealed by a genomic variation map from a large germplasm collection. Proc Natl Acad Sci U S A 112: E5411 ÐE5419. Xu F, Fang J, Ou S, Gao S, Zhang F, Du L, Xiao Y, Wang H, Sun X, Chu J et al. 2014. Variations in CYP78A13 coding region influence grain size and yield in rice. Plant Cell Environ 38: 800 -811. Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L et al. 2011. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat Biotechnol 30: 105 -111. Zhang J, Chen LL, Sun S, Kudrna D, Copetti D, Li W, Mu T, Jiao WB, Xing F, Lee S et al. 2016. Build ing two indica rice reference genomes with PacBio long -read and Illumina paired -end sequencing data. Sci Data 3: 160076. !21 CHAPTER 2 LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons The work presented in this chapter is part of the final publication: Ou S, Jiang N. 2018. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol 176: 1410 -1422. !22 Abstract Long terminal repeat retrotransposons (LTR -RTs) are prevalent in plant genomes. The identification of LTR -RTs is critical for achieving high -quality gene annotation. Based on the well -conserved structure, multiple programs were developed for the de novo identification of LTR -RTs; however, these programs are associated with low specificity and high false discovery rates. Here, we report LTR_retriever, a multithreading -empowered Perl program that identifies LTR -RTs and generates high -quality LTR lib raries from genomic sequences. LTR_retriever demonstrated significant improvements by achieving high levels of sensitivity (91%), specificity (97%), accuracy (96%), and precision (90%) in rice (Oryza sativa). LTR_retriever is also compatible with long sequ encing reads. With 40k self -corrected PacBio reads equivalent to 4.5 " genome coverage in Arabidopsis ( Arabidopsis thaliana ), the constructed LTR library showed excellent sensitivity and specificity. In addition to canonical LTR -RTs with 5' -TGÉCA -3' termini , LTR_retriever also identifies noncanonical LTR -RTs (non -TGCA), which have been largely ignored in genome -wide studies. We identified seven types of noncanonical LTRs from 42 out of 50 plant genomes. The majority of noncanonical LTRs are Copia elements, w ith which the LTR is four times shorter than that of other Copia elements, which may be a result of their target specificity. Strikingly, non -TGCA Copia elements are often located in genic regions and preferentially insert nearby or within genes, indicatin g their impact on the evolution of genes and their potential as mutagenesis tools. For a full text of this work, please visit http://www.plantphysiol.org/content/176/2/1410.long . !23 CHAPTER 3 Assessing Genome Assembly Quality Using the L TR Assembly Index (L AI) The work presented in this chapter is part of the final publication: Ou S, Chen J, Jiang N. 2018. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res 46: e126. !24 Abstract Assembling a plant genome is challenging due to the abundance of repetitive sequences, yet no standard is available to evaluate the assembly of repeat space. LTR retrotransposons (LTR -RTs) are the predominant interspersed repeat that is poorly assembled in draft genomes. Here, we propose a reference -free genome metric called LTR Assembly Index (LAI) that evaluates assembly continuity using LTR -RTs. After correcting for LTR -RT amplification dynamics, we show that LAI is independent of genome size, genomic LTR -RT content, and gene space evaluation metrics (i.e., BUSCO and CEGMA). By comparing genomic sequences produced by various sequencing techniques, we reveal the significant gain of assembly continu ity by using long -read -based techniques over short -read -based methods. Moreover, LAI can facilitate iterative assembly improvement with assembler selection and identify low -quality genomic regions. To apply LAI, intact LTR -RTs and total LTR -RTs should cont ribute at least 0.1% and 5% to the genome size, respectively. The LAI program is freely available on GitHub: https://github.com/oushujun/LTR_retriever . For a full text of this work, please visit https://academic.oup.com/nar/article/46/21/e126/5068908 . !25 CHAPTER 4 A Fine Map of Rice Domestication Revealed by 3,485 Cultivated and Wild Accessions !26 Introduction Asian rice ( Oryza sativa L.) is a staple crop for over half of the human population (Juliano and Institute 1993) . The earliest archeology record for rice harvesting dates back to ca. 14,000 Ð ca. 18,000 years ago in southern China (Lu 2006; Boaretto et al. 2009) . However, the exact origin of cultivated rice is still debatable. Archaeological studies reveal that rice domestication may have begun ca. 9,400 years ago in the Lower Yangtze basin (Jiang and Liu 2006; Zuo et al. 2017), while genetic and phylogeographic studies suggest possible origins in southern China, eastern India, and Indochina (Londo et al. 2006; Huang et al. 2012; Civan et al. 2015; Choi and Purugganan 2018) . Rice has two main ecotypes or subspecies, indica and japonica , which have been highly diversified and adapted to different habitats (Garris et al. 2005; Huang et al. 2012; Wang et al. 2018). Two other regional ecotypes, aromatic and aus, are recognized by molecular genetic studies (Garris et al. 2005; Huang et al. 2012; Wang et al. 2018) and mainly colonized in India, Bangladesh, and Nepal with unique adaptations to harsh environments. Hypotheses for the origin of these diverse rice ecotypes include single -origin, dual -origin, and multiple -origin models, with conflicting observations for the monophyletic key domestication genes (e.g., sh4 , prog1 , and sd-1) and the polyphyletic genomic background. The common wild rice ( O. rufipogon Griff.) is known as the direct ancestor of cultivated rice. However, O. rufipogon has been diversified into multiple ecotypes, which have been proposed to be ancestors of different cultivated rice ecotypes (Huang et al. 2012; Civan et al. 2015) . Besides, intensive interspecific and intraspecific admixtures between cultivated and wild rice ecotyp es are also noted (Huang et al. 2012; Choi et al. 2017; Wang et al. 2017) , making the puzzle of rice domestication even more complicated. !27 Since the initial sequencing of the rice genome (Goff et al. 2002; Yu et al. 2002; International Rice Genome Sequencing 2005) , substantial amounts of resequencing resources and efforts have been devoted to solving this domestication puzzle. With the boom of new algorithms, r evisiting these questions by combining available resources may provide new insights. Here, we combined raw sequencing data of 3,024 cultivated rice accessions and 461 wild rice accessions (hereafter, the 3Kw dataset) from the National Center for Biotechnol ogy Information (NCBI). Remapping of these data allows unbiased evaluations of genetic diversity and demographic history among rice ecotypes. We reveal their complicated relationship by reconstruction of the phylogeny and the population structure , providing new insight into rice domestication . Results and Discussion s Genome mapping, imputation, and filtering Information about the 3,485 cultivated and wild rice accessions (3Kw) used in this study are summarized in Supplementary Table 4.1 . Raw reads were trimmed and mapped to the O. sativa ssp. japonica cv. Nipponbare MSU 7.0 reference genome (Kawahara et al. 2013). A total of 46 million raw variants were identified, which were filtered using a machine -learning algorithm with 99.0% truth sensitivity (Materials and Methods) ( Figure 4. 1 and Figure 4. 2). Insertion, deletion, and multiallelic variants were classified as low -confident variants and were not used in this study, retaining 21.9 million single -nucleotide polymorphisms (SNPs). To rec oncile data heterogeneity caused by variations in sequencing depth, missing genotypes were imputed and filtered with imputation R 2 # 0.95 ( Figure 4. 3), resulting in 17.7 million SNPs that are highly concordant (97.4% - 99.9%) with four other datasets ( Tabl e 4.1 ) (McCouch et al. 2016; Wang et !28 al. 2017; Wang et al. 2018) . All following analyses were based on the 17.7 million high -quality SNPs. Population structure and phylogeny Classifying the 3K w population into subpopulations is possible and highly consistent with previous studies ( Figure 4. 4) (Huang et al. 2012; Wang et al. 2018). With 14 principal components, about 93% of genetic variations in the 3Kw population are explained ( Figure 4. 5B). To st udy the ancestral components of the 3Kw population, we applied supervised estimation of individual ancestry from unlinked ( r2 $ 0.1) SNPs, with inferred ancestry ( K) varies from 1 to 50 (Figure 4. 6). The cross -validation error decreases as K increases, indicating a very complicated population structure ( Figure 4. 5A). To further dissect the relationship of 3Kw accessions and subpopulations, we used near -neutral SNPs that are located in the third position of fourfold degenerate codons (4DSNP) to reconstruc t the whole -genome phylogeny of the 3Kw population ( Figure 4. 7C), which shows high consistency with its population structure ( Figure 4. 7D). We thus regroup the 3Kw population into 14 subpopulations based on previous groupings of the 3,000 Rice Genome (3KRG ) population (Wang et al. 2018) and the wild rice population (Huang et al. 2012) in considerations of the population structure ( K = 14) and the phylogeny ( Figure 4.7), with group names consistent with previous studies ( Table 4.2 ) (Materials and Methods). Annotation of subpopulations reveals their strong connections to geographic origin and pedigree status ( Figure 4. 7A, B). Aromatic , aus, and indica -2 groups are ma inly from India, while indica -1A and temperate japonica are mainly found in China with substantial breeding improvements ( Figure 4. 7A, B). In Indonesia, indica -3B and tropical japonica are dominating, while the latter is also found in southeast Asia countr ies such as Laos and Philippines, and in the !29 United States ( Figure 4. 7A). The s ubtropical japonica group is mainly found in Laos and Thailand ( Figure 4. 7A). Indica -1B, admixed indica , and admixed japonica are distributed worldwide with extensive tracible pedigree records, which probably represent modern breeding efforts ( Figure 4. 7A, B). Indica -3A are mainly traditional lines located in the Indochinese Peninsula ( Figure 4. 7A, B). We rooted the whole -genom e phylogeny of the 3Kw population using a small group of wild rice relatives (Figure 4. 7C, Supplementary Table 4.1 ). After 1,000 bootstrap pseudo -replications, we found high supporting rate for major branches ( Figure 4. 7C). Further, major cultivated rice e cotypes, japonica , indica , and aus, are clustered in different branches, indicating parallel efforts for rice improvements ( Figure 4. 7C). Moreover, common wild rice ecotypes Or -3A and japonica share most recent common ancestor, so do Or -1 and indica , and aus (Figure 4.7C). While this may indicate independent domestication origin of the three rice ecotypes (Londo et al. 2006; Civan et al. 2015), other evolutionary processes such as population bottleneck and lateral gene transfer may also be the cause, which have been documented previously (see Xu et al. (2011) (Xu et al. 2011), Huang et al. (2012) (Huang et al. 2012), Wang et al. (2017) (Wang et al. 2017), and Civ⁄ ! and Brown (2018) (Civ⁄ ! and Brown 2018) ). Other O. rufipogon ecotypes, Or -2 and Or -3B, are clustered with Or -1 and placed between the indica and the japonica clades ( Figure 4. 7C). Materials and Methods Sampling and DNA sequences The sequences of 3,024 cultivated rice accessions were derived from the 3,000 Rice Genomes (3KRG) project with an average sequencing depth of 14 " (The Rice Genomes Project 2014) . The collection represents a core set of the International Rice Germplasm Collection (IRGC) at the !30 International Rice Research Institute (IRRI), and the China National Crop Gene Bank (CNCGB) at the Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS). The wild rice data contain 461 samples including 446 O. rufipogon accessions and 15 accessions of outgroup species, which were sequenced by Huang et al. (2012) with an average sequencing depth of 1.5 " (Huang et al. 2012) . The wild rice access ions were collected and preserved at the China National Rice Research Institute (CNRRI) in Hangzhou, China, and the National Institute of Genetics (NIG) in Mishima, Japan. The detailed information, including sequencing depth, country of origin, and ecotype grouping were adapted from relevant previous studies and summarized in Supplementary Table 4.1 . Raw sequencing data of 3,485 cultivated and wild rice (3Kw) were downloaded from the NCBI Sequence Read Archive (SRA) (www.ncbi.nlm.nih.gov/sra/ ) with project numbers PRJEB6180 and PRJEB2829, respectively, consisting a total of 17.4 Terabases, representing a comprehensive dataset of genetic diversity in cultivated and wild rice. Read alignment and variant identification All reads were trimmed to remove low -quality bases and adapters to improve mapping quality and variant call accuracy (Del Fabbro et al. 2013) . Pair -end reads were trimmed separately for each end using cutadapt (v1.11) ( -f fastq -u 10 -q 20,20 --trim -n -m 30 -n 3) (Martin 2011) , then re-paired using a custom PERL script (sync_PE_fastq_files.pl). Pair -end and single -end clean reads were aligned to the O. sativa ssp. japonica cv. Nipponbare (MSUv7) reference genome, separately, using BWA -MEM (v0.7.12.r1044) ( -t 4 -k 16 -w 10 -M) (Li 2013) . Multiple alignment files of an accession, either single -end or pair -end, or from different sequencing runs, were merged into one file with PCR duplicates removed using Picard tools (v1.113) (Mer geSamFiles.jar and MarkDuplicates.jar, respectively) (http://picardsourceforgenet ). To !31 improve call accuracy of insertion/deletion (Indel) va riants, candidate indel regions were locally re-aligned using the Genom e Analysis Toolkit (GATK v3.4.46) (McKenna et al. 2010), then the GATK HaplotypeCaller is applied to call SNPs and indels simultaneously via de novo assembly of haplotype regio ns ( -T HaplotypeCaller --genotyping_mode DISCOVERY -stand_call_conf 30 --emitRefConfidence GVCF). The GATK joint genotyping algorithm using the multi -sample aggregation of sequences was applied to call genotypes confidently. To filter out low -quality and false variant calls, a machine -learning based filter called Variant Quality Score Recalibration (VQSR) (McKenna et al. 2010) was applied by using the rice dbSNP dataset (https://www.ncbi .nlm.nih.gov/projects/SNP/ ) to train classifiers ( -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0). A total of 2,590,236 validated SNPs from the rice dbSNP dataset were used as true si tes training input ( -resource:hapmap), while the other 7,540,170 SNPs from the dbSNP database were used as known sites input ( -resource:dbsnp). Variants with the tranche truth sensitivity of 99% were marked as ÒPASSÓ, which were retained and subjected to h ard filtering. For hard filtering, biallelic SNPs with read coverage less than 2 (QD < 2.0), strand bias higher than 60.0 (FS > 60.0), mapping quality less than 40.0 (MQ < 40.0), mapping quality rank sum test score less then -12.5 (MQRankSum < -12.5), or r ead position rank sum test score less than -8.0 (ReadPosRankSum < -8.0) were filtered out as low -confident calls. Indels with QD < 2.0, FS > 200.0, or ReadPosRankSum < -20.0 were filtered out as low -confident calls. Imputation of missing data After hard filtering, ambiguous and missing genotypes were imputed using Beagle (v4.1) (Browning and Browning 2016) with default parameters. A high -density rice recombination map !32 from Huang et al. (2009) (Huang et al. 2009) was converted based on the physical distance of SNPs and provided as a genetic map for imputation. To control for imputation quality, the allelic R2 and dosage R 2 representing the squared correlation between estimated and true reference allele dosage was applied, with SNP R 2 < 0.95 being removed. A total of 17.7 million SNPs were retained and were used for all subsequent analyses unless indicated. Estimation of variant discordance Imputed and filtered SNPs were evaluated using multiple public resources. Th e IRRI -released 32 million SNP set of the 3K data (Wang et al. 2018) and the 700K high -density rice array (HDRA) data generated by McCouch et al. (2016) (McCouch et al. 2016) were used to evaluate our cultivated rice SNP calls. For wild rice SNPs, the initial SNP set was called based on the IRGSP v4.0 reference genome (Huang et al. 2012), which SNP coordinate was converted to the current reference (MSU v7) by Wang et al. (2017) (Wang et al. 2017) based on syntenic regions and alignment of flanking sequences. We also performed de novo SNP identification using clean reads of all wild rice accessions with a Bayesian approach called Freebayes (v0.9.21) (Garrison 2012). All four reference datasets were filtered with the same standard we applied on our dataset, which heterozygous and multi -allelic SNPs were converted to the ambiguous state with further filtering on missing data. To calculate the pair -wise discordance rate between two SNP sets, say, the reference dataset ( R) and our dataset ( T), assume M samples with N SNPs are present in both datasets. For sample i, if SNP j is unambiguously genotyped in both datasets, then this data point is counted as an informative site. For all informative SNPs in a chromosome, if Rij has the same genotype as Tij, then this SNP is counted as concordant, otherwise discordant. Based on these settings, we defined the chromosomal pair -wise SNP discordance rate D as follows . !33 !"#$%&'()&'$%&'()&'#*#$%&'+)&'#,#-../, where .0123 and .0425 For verification of cultivated rice S NPs, the IRRI -released 32 M SNP set of the 3K data (Wang et al. 2018) downloaded from the Rice SNP -Seek Database ( http://oryzasnp.org ) was used as the reference dataset. To control the quality of reference datasets, multi -allelic SNPs were converted to the ambiguous state as we did in our dataset, while SNPs with missing data rates higher than 10% were removed. As a result, a total of 12,255,063 SNPs were retaine d as informative sites ( Table 4.1 ). We also used a second dataset generated using the 700K high -density rice array (HDRA) downloaded from the Rice Diversity Database ( www.ricediversity.org ) (McCouch et al. 2016) , with 94 cultivated rice accessions overlapped with the 3K dataset, to estimate variant discordance. To control t he SNP quality of the shared accessions, we only retained SNP calls that were saturated with array signals (Signal = 33.0) and labeled others as ambiguous. Multi -allelic SNPs were also converted to the ambiguous state, while SNPs with missing data rates higher than 10% were removed. As a result, a total of 38,863 SNPs were retained as informative sites ( Table 4.1 ). For the verification of wild rice SNPs, there is no proper reference dataset available due to the update of the reference genome. Wang et al. (2017) (Wang et al. 2017) converted the wild rice SNP set released by Huang et al. (2012) (Huang et al. 2012), from genome version IRGSP 4.0 to MSU 7.0 (current version), based on genome synteny and alignment of flanking sequences. Such conversion may introduce errors due to the uncertainty of finding micro -syntenic regions. Also, the correction of the reference genome sequence is not reflected in the converted SNP set due to the lack of remapping raw reads. We used this converted SNP set with the same quality filter applied on the IRRI SNP set, resulting in 2,060,264 informative SNPs ( Table 4.1 ). !34 To provide further insight about the quality of our wild rice SNP set, we performed remapping of all wild rice clean reads to the current reference genome using a different method, called Freebayes (v0.9.21) (Garrison 2012) , which adopts a Bayesian statistical framework for variant detection. To accelerate the inference of variants, we split the reference genome into 50 -Kb regions to run Freebayes (Garrison 2012) (--theta 0.003 --ploidy 2 --use -best -n-alleles 8 --min-mapping -quality 1 --min-base -quality 3 --read -max -mismatch -fraction 0.10 --min-alternate -fraction 0.15 --min-alternate -count 1 --read -snp -limit 10 --read -indel -limit 3), then combine the results afterwards. Then GATK (v3.4.46) (McKenna et al. 2010) was used to perform hard -filtering with similar parameters as we used on GATK Joint Genotyping results. Further conversion and filtering of SNPs were applied the same as we did on the IRRI SNP set, ex cept allowing the missing data rate to reach up to 50% due to the low read depth (~1.5 " ). A total of 3,518,552 informative SNPs were obtained in this experiment. Discordance rates of our dataset to filtered datasets from IRRI, HDRA, Wang et al. (2017), and Freebayes remapping are presented in Table 4.1 , which shows extremely low discordance rate (0.10%) between our cultivated rice SNP calls and the IRRI -released 3K dataset, indicating high concordance (99.9%) of our data. Similarly, the pair -wise concordance rate is very high (99%) between our cultivar rice SNPs and the cross -platform array dataset (HDRA). A slightly higher discordance rate (2.65%) is observed between our wild rice SNP set and the coordinate -converted dataset originated from Huang et al. (2012) (Huang et al. 2012) (Table 4.1 ), which is probably due to errors created by reference coordinate conversions and the difference between mapping algorithms in dealing with low depth sequencing data. Note that the GATK Joint Genotyping algorithm was developed to utilize sequencing information of the entire mapping population comparing to previous algorithms that treat samples independently !35 (McKenna et al. 2010) , which is particularly fit for low -depth sequencing data. To rule out the discrepancies created by reference coordinate conversion, we further conducted remapping of the wild rice clean reads using a Bayesian method called Freebayes (Garrison 2012) . The Freebayes -derived wild rice SNP set shows a lower discordance rate ( 1.14%) to our dataset, indicating the improvement of mapping algorithms and reference genome do help with more accurate variant calls with low -depth data. In conclusion, both the cultivated and wild rice SNPs in our dataset demonstrate high concordance to other datasets. Population structure The whole -genome 4D SNPs were used to study the population composition of the 3Kw dataset. SNPs were screened based on linkage ( r2), which was estimated using PLINK (v1.90b4.4) (Purcell et al. 2007) in 50 -SNP windows with 10 SNPs per step, and markers with r2 > 0.1 were excluded ( --indep -pairwise 50 10 0.1). The rice genetic map converted from Huang et al. (2009) (Huang et al. 2009) was also supplied to PLINK (Purcell et al. 2007) for the procedure of linkage pruning. A total of 22,909 SNPs were retained for estimation of individual ancestry usi ng ADMIXTURE (v1.3.0) (Alexander and Lange 2011) with default parameters. We used K from 1 to 50 to perform the estimation ( Figure 4. 6), with results visualized using CLUMPAK (Kopelman et al. 2015) and Inkscape ( www.inkscape.org ). Principal component analysis (PCA) with the same data was performed using PLINK (Purcell et al. 2007) . The unsupervised k-means clustering were performed using the kmeans() function in R. To evaluate the randomness of the 4D SNP set, we extracted a random subset of the genome, with a total of 113, 432 SNPs, to perform side -by-side k-means clustering. Clustering of these two datasets are virtually the same (Figure 4. 4). In consideration of ADMIXTURE, PCA, whole -genome phylogeny, and k-means clustering results, we determined that K = 14 is ÒoptimumÓ, which can explain the majority of !36 population structures and genetic variations of the 3,485 rice accessions ( Figure 4. 5, Figure 4. 6, Figure 4. 7). The further grouping of cultivated rice samples was mainly based on three sources: the original grouping of Wang et al. (2018) (Wang et al. 2018), the whole -genome phylogeny, and the ADMIXTURE results. For samples consistently grouped in the three sources, their grouping remained unchanged as in Wang et al. (2018) (Wang et al. 2018), otherwise were grouped as Adx (admixture), Japadx ( japonica -type admixture), or Indadx ( indica -type admixture), whichever fit s better to their placements in the phylogeny. The Ind -3 group was further divided into Ind -3A and Ind -3B based on their split placement on the phylogeny ( Figure 4. 7C). Of the nine low -quality samples identified in Wang et al. (2018) (Wang et al. 2018) , one was removed due to its extremely low depth (~0.3 " ) and the rest were marked as Adx in our grouping results. These lines were not included in subpopulation specific analyses. For the grouping of wild rice samples, the whole -genome phylogeny and the admixture results were used. Samples placed unambiguously in both sources were grouped as Or -1, Or -2, and Or -3 in agreement with the group names used in Huang et al. (2012) (Huang et al. 2012), otherwise grouped as Oradx ( O. rufipogon -type admixture). The Or -3 group was further divided into Or -3A and Or -3B based on their split placement in the phylogeny and geographic origins (Figure 4. 7C). The 15 wild rice relatives, including one accession of O. barthii , one accession of O. glaberrima , two accessions of O. longistaminata , and 11 accessions of O. meridionalis were independently grouped as Wild_relatives. Of the five O. glaberrima samples identified in the 3K dataset, two of them are clustered with the Wild_relatives group, increasing the group size to 17, while others are placed in cultivated rice groups, indicating certain degree of admixture, thus !37 labeled as Adx. Summary of the new grou ping strategy for the 3,485 rice accessions was presented in Table 4.2 . Phylogenetic reconstruction We used S NPs located in 4 -fold degenerate (4D) sites of the whole -genome to reconstruct the phylogeny of the 3Kw population. After further screening out SNPs with minor allele frequency (MAF) < 1%, a total of 157,475 SNPs were retained. We first searched for the be st-observed maximum likelihood (ML) tree using RAxML (v8.2.9) (Stamatakis 2014) with 20 independent attempts searching over 2,500 temporary trees ( -j --no-bfgs -f d -N 5 -m ASC_GTRGAMMA --asc -corr=lewis -o W3101). The GTR GAMMA model wa s used with correction for ascertainment bias of SNP data by the Lewis model. Further, we conducted 10 independent searches to generate 1,000 rapid bootstrap trees using RAxML (Stamatakis 2014) . The transfer bootstrap supporting value o f each branch was summarized using BOOSTER (v0.1.1) (Lemoine et al. 2018). One sample (W3101) of O. longistaminata , which seems to be a distant relative to cultivated rice, was used as the outgroup to root all the trees. Similar methods were used to reconstruct the phylogeny of ancestral regions based on 1,345 4D SNPs that are located in shared non-introgression regions of Or -1 and Or -3A, with 1,000 rapid bootstrap pseudo -replications. FigTree (v1.4.4) ( http://tree.bio.ed.ac.uk/software/figtree/ ) was used to visualize the best -observed ML tree with supporting values of major branches manually annotated using Inkscape (v0.92) ( www.ink scape.org ). Acknowledgments We thank Drs. Xiaofan Zhou and Xing -Xing Shen for valuable discussions. We thank Drs. Patrick Edger, David Lowry, and C. Robin Buell for critical reading of the manuscript. !38 APPENDIX !39 Table 4. 1 SNP discordance rate between this study and other studies. Chromosome IRRI 1 HDRA 2 Wang et al. (2017) 3 Freebayes 4 Chr1 0.11% 1.09% NA5 1.09% Chr2 0.10% 0.90% 2.60% 1.07% Chr3 0.10% 0.83% 2.44% 1.06% Chr4 0.10% 1.23% 2.56% 1.16% Chr5 0.09% 0.92% 2.53% 1.09% Chr6 0.10% 1.15% 2.78% 1.13% Chr7 0.10% 0.97% 2.68% 1.16% Chr8 0.09% 0.97% 2.62% 1.13% Chr9 0.10% 0.89% 2.66% 1.11% Chr10 0.11% 0.89% 2.73% 1.20% Chr11 0.10% 0.84% 2.70% 1.25% Chr12 0.10% 1.42% 2.79% 1.23% Whole genome 0.10% 1.01% 2.65% 1.14% Informative SNP 12,255,063 38,863 2,060,264 3,518,552 1The IRRI -released 32 M SNP set for 3,024 cultivated rice ( http://oryzasnp.org ). 2The 700K high -density rice array (HDRA) from McCouch et al. (2016) with 94 shared cultivated rice. 3Coordinated converted from Huang et al. (2012) for 443 wild rice accessions. 4Remapping of 461 wild rice accessions using Freebayes. 5Only 47 SNPs were retained after filtering. !40 Table 4. 2 Grouping of 3, 485 rice accessions. Indica -clade Size Japonica -clade Size Or-clade Size Others Size Ind -1A 196 Tej 280 Or-1 86 Aus 192 Ind -1B 178 TrjA 334 Or-2 103 Adx 168 Ind -2 254 TrjB 87 Or-3A 67 Wild_relatives 17 Ind -3A 244 Aro 73 Or-3B 40 Ind -3B 128 Japadx 127 Oradx 150 Indadx 761 Sum 1761 901 446 377 !41 Figure 4. 1 Truth sensitivity of tranches in Variant Quality Score Recalibration (VQSR). (A) Truth sensitivity of four tranche categories, 0.90, 0.99, 0.999, and 1.00, were calculated in the VQSR procedure as shown in the x -axis. The tranche specificity is represented by the SNP transition -to-transversion ratio (Ti/Tv) as shown in the y -axis. (B) Proportion of true positive (TP) SNPs in the four VQSR trenches (Y -axis). The x -axis denotes the number of SNPs in a unit of 1000. !"!42 Figure 4. 2 SNP filtering based on Variant Quality Score Recalibration (VQSR). A total of eight factors, DP (coverage), QD (quality by depth), FS (fisher strand), SOR (strand odds ratio), MQ (RMS mapping quality), MQRankSum (mapping quality rank sum test), ReadPosRankSum (read posi tion rank sum test), and InbreedingCoeff (inbreeding coefficient), were used to train the multi -dimensional machine -learning classifier with 2 -dimention plots shown. The truth sensitivity cutoff was set to 0.99 for the classification of true (green) and fa lse (purple) SNPs. 1 dot = 1 SNP. !43 Figure 4. 3 Distribution of imputation R 2 in minor allele frequency (MAF) bins (0.05 per bin). Outliers were plotted in grey dots outside each box -plot. Each subplot represents a rice chromosome indicated in the subplot title. X - and Y -axes indicate MAF and imputation R 2, respectively. (A) raw imputation results. (B) after removal of SNPs with imputation R 2 < 0.95. !"!44 Figure 4. 4 K-means clustering of 3,485 cu ltivated and wild rice accessions with (A) whole -genome 4D SNPs and (B) whole -genome random SNPs. Clustering of subpopulations were indicated on the figure. Admix, admixture; aro, aromatic ; aus, aus; ind -I, indica group I; ind -II, indica group II, ind -III, indica group III, Or -I, O. rufipogon group I; Or -II, O. rufipogon group II; Or -III, O. rufipogon group III; tej, temperate japonica ; trj, tropical japonica . Labeling of subpopulations was based on published groupings from Huang et al. (2012) and Wang et al. (2018). !"#$!"#$$$!"#$$%&' %"' ()* $+, ("- .-/ !"#$!"#$$$!"#$$%&' %"' ()* $+, ("- .-/ 01!45 Figure 4. 5 Determination of the number of populations. (A) Cross validation error (y -axis) keeps decreasing when the number of populations (K, x -axis) increases. (B) Percentage of variation explained (y -axis) by e ach principal component (PC, x -axis). More than 93% of total genetic variation could be explained by the first 14 PCs. !"#! !"#$ !"%! !"%$ !"&! !"&$ !'!#!%!&!$!()*++,-./01.20*3,4))*) 5!6'!6#!6%!6&!6$!6!$'!'$#!6,7.)0.20*3 8)0390:./,(*;:*3<32 =%6*>,7.)0.20*3 ?@!46 Figure 4. 6 The population structure of 3,485 rice accessions revealed by various K. The x -axis represents cultivated or wild rice accessions in the order the same as the whole -genome phylogeny ( Figure 4.1 C). Different colors in a panel represent different ancestral compositions, with the height of color bars represents the whole -genome percentage of that compo sition. !47 Figure 4. 7 Population structure of 3,485 rice accessions. (A) Country origin of cultivated rice groups. (B) Germplasm status based on pedigree and knowledge (adapted from Wang et al. (2018)). (C) Maximum -likelihood phy logenetic reconstruction of 3,485 rice accessions with major branch support from 1,000 rapid bootstrap pseudo -replicates. The tree was rooted with O. longsitaminata . Branches were colored based on subpopulations with major groups indicated on the right, an d those without group labels are admixed. Scale indicates branch length of 0.2. Ind, indica ; Or, O. rufipogon ; Aro, aromatic ; TrjA, tropical japonica ; TrjB, subtropical japonica ; Tej, temperature japonica . (D) Population structure of all samples with K increases from 13 to 16. Each row represents a sample, which were ordered based on the phylogenetic tree in (C). Different colors in a panel represent different ancestral compositions. !"#$!"# $%&!"# $'&!"# $%(!"# $'(!"# $)&*+ ,-$',-$),-$%(,-$%&&-. /-0& /-0( /10 2..3 !"45'% !"45'6 !45'7 !45'8 !48 REFERENCES !49 REFERENCES Alexander DH, Lange K. 2011. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12: 246. Boaretto E, Wu X, Yuan J, Bar -Yosef O, Chu V, Pan Y, Liu K, Cohen D, Jiao T, Li S et al. 2009. Radiocarbon dating of charcoal and bone collagen associated with early pottery at Yuchanyan Cave, Hunan Province, China. Proc Natl Acad Sci U S A 106: 9595 -9600. Browning Brian L, Browning Sharon R. 2016. Genotype Imputation with Millions of Referen ce Samples. Am J Hum Genet 98: 116 -126. Choi JY, Platts AE, Fuller DQ, Hsing Y -I, Wing RA, Purugganan MD. 2017. The Rice Paradox: Multiple Origins but Single Domestication in Asian Rice. Mol Biol Evol 34: 969 -979. Choi JY, Purugganan MD. 2018. Multiple Ori gin but Single Domestication Led to Oryza sativa . G3: Genes|Genomes|Genetics 8: 797 -803. Civ⁄ ! P, Brown TA. 2018. Role of genetic introgression during the evolution of cultivated rice (Oryza sativa L.). BMC Evol Biol 18: 57. Civan P, Craig H, Cox CJ, Brown TA. 2015. Three geographically separate domestications of Asian rice. Nat Plants 1: 15164. Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. 2013. An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE 8: e85024. Garris A J, Tai TH, Coburn J, Kresovich S, McCouch S. 2005. Genetic structure and diversity in Oryza sativa L. Genetics 169: 1631 -1638. Garrison EM, Gabor. 2012. Haplotype -based variant detection from short -read sequencing. In arXiv:[q -bioGN] , Vol 1207.3907. Goff S A, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H et al. 2002. A draft sequence of the rice genome ( Oryza sativa L. ssp. japonica ). Science 296: 92 -100. !50 Huang X, Feng Q, Qian Q, Zhao Q, Wang L, Wang A, Guan J, Fan D, Weng Q, Huang T et al. 2009. High -throughput genotyping by whole -genome resequencing. Genome Res 19: 1068-1076. Huang X, Kurata N, Wei X, Wang ZX, Wang A, Zhao Q, Zhao Y, Liu K, Lu H, Li W et al. 2012. A map of rice genome variation reveals the origin o f cultivated rice. Nature 490: 497 -501. International Rice Genome Sequencing P. 2005. The map -based sequence of the rice genome. Nature 436: 793 -800. Jiang LP, Liu L. 2006. New evidence for the origins of sedentism and rice domestication in the Lower Yangzi River, China. Antiquity 80: 355 -361. Juliano BO, Institute IRR. 1993. Rice in Human Nutrition . Food and Agriculture Organization of the United Natio ns. Kawahara Y, de la Bastide M, Hamilton J, Kanamori H, McCombie W, Ouyang S, Schwartz D, Tanaka T, Wu J, Zhou S et al. 2013. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6: 1 -10. Kopelman NM, Mayzel J, Jakobsson M, Rosenberg NA, Mayrose I. 2015. Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Molecular ecology resources 15: 1179 -1191. Lemoine F, Domelevo Entfellner JB, Wil kinson E, Correia D, D⁄vila Felipe M, De Oliveira T, Gascuel O. 2018. Renewing FelsensteinÕs phylogenetic bootstrap in the era of big data. Nature 556: 452 -456. Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA -MEM. In arXi v: [q -bioGN] , Vol 1303.3997v1. Londo JP, Chiang YC, Hung KH, Chiang TY, Schaal BA. 2006. Phylogeography of Asian wild rice, Oryza rufipogon , reveals multiple independent domestications of cultivated rice, Oryza sativa . Proc Natl Acad Sci U S A 103: 9578 -9583. Lu TLD. 2006. The Occurrence of Cereal Cultivation in China. Asian Perspectives 45: 129 -158. !51 Martin M. 2011. Cutadapt removes adapter sequences from high -throughput sequencing reads. 2011 17. McCouch SR, Wright MH, Tung C -W, Maron LG, McNally KL, Fitzg erald M, Singh N, DeClerck G, Agosto -Perez F, Korniliev P et al. 2016. Open access resources for genome -wide association mapping in rice. Nat Commun 7: 10532. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Ga briel S, Daly M et al. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next -generation DNA sequencing data. Genome Res 20: 1297 -1303. Purcell S, Neale B, Todd -Brown K, Thomas L, Ferreira Manuel A R, Bender D, Maller J, Sklar P, de Bakker Paul I W, Daly Mark J et al. 2007. PLINK: A tool set for whole -genome association and population -based linkage analyses. Am J Hum Genet 81: 559 -575. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post -analysis of large phylogenies. Bioinformatics 30: 1312 -1313. The Rice Genomes Project. 2014. The 3,000 Rice Genomes Project. GigaScience 3: 7. Wang H, Vieira FG, Crawford JE, Ch u C, Nielsen R. 2017. Asian wild rice is a hybrid swarm with extensive gene flow and feralization from domesticated rice. Genome Res doi:10.1101/gr.204800.116. Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F et al. 2018. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557: 43 -49. Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L et al. 2011. Resequencing 50 accessions of cultivated and wild rice yields marke rs for identifying agronomically important genes. Nat Biotechnol 30: 105 -111. Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X et al. 2002. A draft sequence of the rice genome ( Oryza sativa L. ssp. indica ). Science 296: 79 -92. Zuo X , Lu H, Jiang L, Zhang J, Yang X, Huan X, He K, Wang C, Wu N. 2017. Dating rice remains through phytolith carbon -14 study reveals domestication at the beginning of the Holocene. Proc Natl Acad Sci U S A 114: 6486 -6491. !52 CHAPTER 5 Concluding Remarks !53 Annotation of LTR sequences is still a difficult task It is known that LTR sequences are not conserved between species (Benachenhou et al. 2013) . For the genome of newly sequenced species, it is inefficient to annotate LTR sequences based on the LTR sequence library of other species. Currently, the standard LTR annotation process is (1) identify intact LTR retrotransposons (LTR -RTs) from the genome, (2) construct a non -redundant LTR sequence libra ry based on intact LTR -RTs, (3) homology -based annotation of LTR sequences using the non -redundant LTR library. In Chapter two, we developed a new computer algorithm, called LTR_retriever, based on the fine -scale structure of LTR -RTs, for identification of intact elements. Tests of prediction sensitivity, specificity, accuracy, and precision show that the program has achieved a very high level of all parameters (i.e., > 90%) in both monocotyledonous and dicotyledonous model species. The performance of LTR_r etriever has surpassed similar programs and represents the latest advancement of LTR -RT identification. However, since the construction of LTR library is solely based on the identifiable intact LTR -RTs, if the assembly of these elements is not complete, th e power of such LTR annotation scheme would be compromised. Especially, for inactive LTR -RT families, due to the constant removal of intact LTR -RTs by recombination, the copy number of these families is usually very low. Misassembling the intact element of these families has a much higher chance for incomplete LTR annotation. Furthermore, for extremely old LTR families (i.e., > 2 MYA), it is very likely that all of their intact versions have been destructed, resulting in very limited power to annotate the r emaining sequence fragments. To alleviate the shortcoming of current annotation process, one possible way is to identify LTR fragments independent of intact LTR -RTs. In Chapter two, I described the removal mechanism for LTR -RT insertions. One of the end pr oducts is the formation of solo LTR, which !54 harbors only one copy of the terminal repeat. Although the sequence of LTR is among the most diverse part of LTR -RTs, the solo LTR still features with a number of micro -structures that could be used for de novo identification. For example, during the integration of LTR elements, the target genome DNA would be cut open, leaving a 4 Ð 6 bp target -site duplications (TSDs) immediately flanking the LTR -RT. The solo LTR would also carry TSDs. Further, the LTR usually con tains a dinucleotide -inverted repeat, also known as the terminal motif 5' -TG..CA -3', which is conserved between LTR -RTs across kingdoms. For 6 Ð 8 bp random sequences, the chance for two being matched is 0.02% - 0.002%, which is precise enough for the iden tification of solo LTR candidates. Further features located in the LTR, such as the core promoter TATA -box, the integrase recognition signal (5' -TGTTRNR..YNYAACA -3'), and the polyadenylation signal (5' -AATAAA -3') could be used to discriminate true LTR and false predictions (Benachenhou et al. 2013) . LTR sequences challenge the as sembly of genomes Genome sequencing techniques have been advanced, and it is possible to obtain long reads (i.e., > 20 Kb) that can effectively span the majority of LTR sequences, resulting in more continuous assemblies. In Chapter three, we developed a ne w genome evaluation metric called LTR Assembly Index (LAI) based on the assembly of LTR -RTs. It is clear that long -read techniques can produce genome assemblies as continuous as those based on BAC -by-BAC or Sanger whole -genome shotgun techniques. However, we observed that the Arabidopsis genome, which is presumably the gold -standard for genomic studies, has LAI scores (LAI = 14.9) lower than some of the long -read -based genomes (i.e., LAI > 20). While this could be explained by the limited LTR sequences (~7% ) in the Arabidopsis genome that dampen the accuracy of LAI as well as !55 the existence of unclosed physical gaps that decrease the assembly continuity, there may be other factors responsible for the lower LAI of Arabidopsis, such as the length of LTR sequenc es. Virtually, the assembly continuity is affected by the length of LTR sequences, with longer elements more difficult to be spanned by sequencing reads. On the other hand, for genomes with similar continuity, shorter LTR -RTs would result in lower LAI sco res, and vice versa. We compared the length of LTR sequences between Arabidopsis and other genomes and found the Arabidopsis genome do harbor shorter internal regions in its LTR -RTs (~ 5,200 bp). Interestingly, the internal region length of the genome with highest observed LAI (LAI = 32.2), the Sorghum bicolor genome, is comparatively longer (~ 7,300 bp). It is likely that the LAI measurement is slightly biased to genomes with longer LTR -RTs. Further adjustments may be made based on more thorough understand ings between the sizes of LTRs and that of internal regions. The functional role of LTR sequences is largely unknown Transposable elements (TEs) are known genetic Òdark mattersÓ in eukaryotic genomes. Certainly, LTR -RTs are mutagens due to their transposi tion activities. However, LTR -RTs are still very mysterious in terms of molecular functions and evolutionary roles. A number of biological functions have been found altered by the insertion of LTR -RTs. Interestingly, many of these functions were created du e to the induced expression of downstream genes, including apical dominance of maize ( Zea mays ) ( tb1 ) (Studer et al. 2011; Tsiantis 2011) , anthocyanin accumulation of Sicilian blood oranges ( Citrus sinensis ) ( Ruby ) (Butelli et al. 2012), and enhanced aluminum tolerance ( OsFRDL4 ) of rice ( Oryza sativa ) (Yokosho et al. 2016). Considering the promoter function of the LTR region during the transcriptional activity of LTR elements (Benachenhou et al. 2013) , it is very likely that some LTR regions carry reg ulatory !56 motifs that can also induce the expression of other genes. To this regard, the activity of LTR -RTs may be one of the major drivers for the evolution of gene regulations and expressions. The evolution of genome structure is also affected by the amp lification and removal of LTR -RTs. Significant haplotype variations have been observed in the Selaginella lepidophylla genome (VanBuren et al. 2018) , which may also be the case in the genome of other species, or between g enomes of different individuals of the same species. Such variations are likely driven by the recombination activity around LTR sequences. In maize, double -strand breaks (DSBs) are predominately formed in Gypsy LTR elements (He et al. 2017), which may result in chromosomal crossovers (He et al. 2017). Both insertions of LTR -RTs and genetic recombination would result in the change of chromosomal structure, such as structural variations (SVs), including insertions and deletions (Indels), copy -numb er variations (CNVs), presence -and -absence variations (PAVs), and segmental inversions. However, it is still challenging, both technically and financially, for researchers to obtain high -quality SVs, especially large SVs (i.e., > 100 bp), within a species. Luckily, with further advancement of sequencing techniques, it is getting cheaper and easier to sequence multiple accessions of a species. The relationship between LTR -RT activity and SVs is one of the imminent questions to be answered in the near future. The ambiguous role of SVs for domestication Previous studies have been utilizing the abundant genetic resources to advance the understanding of crop domestication histories. Hundreds and thousands of crop accessions have been re -sequenced to generate hig h-density markers for such purpose. However, due to the limitation of sequencing read length ( $ 250 bp), the confidently identified genetic markers are usually shorter than 10 bp, with large SVs underrepresented. In Chapter four, I reanalyzed the resequenc ing data of 3,485 rice accessions and generated 17.7 million high -quality single -nucleotide !57 polymorphisms (SNPs). The domestication history of rice is then inferred based on this dataset. We revealed the important role of standing variations for selection of beneficial traits during domestication. However, SVs are apparently the missing piece from the current model. Using de novo assemblies based on the re -sequencing data of 3,010 rice accessions, Wang et al. (2018) identified more than 10,000 novel full -length protein -coding genes (Wang et al. 2018), providing important understandings of genetic variations between rice varieties. However, the role of TEs is not revealed due to the limitation of sequencing length. Incorporating genome -wide gene expression analyses, Hufford et al. (2012) showed that the magnitude of gene expression has been changed during the domestication of maize (Hufford et al. 2012). Together with the possible enhancer function of LTR sequences, it is likely that LTR -RTs have played a systemic role during crop domestication. Future studies incorporating large SVs, thorough LTR sequence annotation, and gene expression analyses, may help to piece the underlying connections. !58 REFERENCES !59 REFERENCES Benachenhou F, Sperber GO, Bongcam -Rudloff E, Andersson G, Boeke JD, Blomberg J. 2013. Conserved structure and inferred evolutionary history of long terminal repeats (LTRs). Mob DNA 4: 5. Butelli E, Licciardello C, Zhang Y, Liu J, Mackay S, Bailey P, Reforgiato -Recupero G, Martin C. 2012. Retrotransposons control fruit -specific, cold -dependent accumulation of anthocyanins in blood oranges. Plant Cell 24: 1242 -1255. He Y, Wang M, Dukowic -Schulze S, Zhou A, Tiang CL, Shilo S, Sidhu GK, Eichten S, Bradbury P, Springer NM et al. 2017. Genomic features shaping the landscape of meiotic double -strand -break hotspots in maize. Proc Natl Acad Sci U S A 114: 12231 -12236. Hufford MB, Xu X, va n Heerwaarden J, Pyhajarvi T, Chia J -M, Cartwright RA, Elshire RJ, Glaubitz JC, Guill KE, Kaeppler SM et al. 2012. Comparative population genomics of maize domestication and improvement. Nat Genet 44: 808 -811. Studer A, Zhao Q, Ross -Ibarra J, Doebley J. 20 11. Identification of a functional transposon insertion in the maize domestication gene tb1 . Nat Genet 43: 1160 -1163. Tsiantis M. 2011. A transposon in tb1 drove maize domestication. Nat Genet 43: 1048 -1050. VanBuren R, Wai CM, Ou S, Pardo J, Bryant D, Jia ng N, Mockler TC, Edger P, Michael TP. 2018. Extreme haplotype variation in the desiccation -tolerant clubmoss Selaginella lepidophylla . Nat Commun 9: 13. Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F et al. 2018. Ge nomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557: 43 -49. Yokosho K, Yamaji N, Fujii -Kashino M, Ma JF. 2016. Retrotransposon -Mediated Aluminum Tolerance through Enhanced Expression of the Citrate Transporter OsFRDL4. Plant Physiol 172: 2327 -2336.