IDENTIFICATION OF LTR RETROTRANSPOSONS, EVALUATION OF GENOME 
ASSEMBLY, AND MODELING RICE DOMESTICATION
 By Shujun Ou
 A DISSERTATION
 Submitted to
 Michigan State University
 in partial 
fulfillment of the requirements
 for the degree of
 Plant Breeding, Genetics and Biotechnology 
- Horticulture 
- Doctor of Philosophy
 Ecology, Evolutionary Biology, and Behavior 
- Dual Major
 2018  ABSTRACT
 IDENTIFICATION 
OF LTR
 RETROTRANSPOSONS, EVALUATION 
OF GENOME 
ASSEMBLY, 
AND MODELING RICE DOMESTICATION
 By Shujun Ou
 The majority of fundamental theories in genetics and evolution were proposed prior to the 
discovery of DNA as the genetic material in 1952. Those include DarwinÕ
s theory of evolution 
(1859), Mendelian genetics (1865), Wright and FisherÕs population genetics (1918), and 
McClintockÕs transposition of genetic elements (1951). Nevertheless, the underlining 
mechanisms of those theories were not fully elucidated till th
e appearance of DNA sequencing 
technology. At present, technological advances have minimized the cost for sequencing genomes. 
The real bottleneck to establish genomic resources is the annotation of genomic sequences. Long 
Terminal Repeat (LTR) retrotranspo
son is a major type of transposable genetic elements and 
dominating plant genomes. We developed a new method called LTR_retriever for accurate 
annotation of LTR retrotransposons. Further, we studied genome dynamics, genome size 
variation, and polyploidy or
igin using LTR retrotransposons. The presence of LTR 
retrotransposons challenges current sequencing and assembly techniques due to their size and 
repetitiveness. We proposed an unbiased metric called LTR Assembly Index (LAI) which 
utilizes the assembled LT
R retrotransposons to evaluate continuity of genome assembly. We 
revealed the massive gain of continuity for assembly sequenced based on long
-read techniques 
over short
-read methods, and further proposed a standardized classification system for genome 
qual
ity based on LAI. With high
-quality genomes, we can extend our knowledge about 
microevolution events using a population of genomes. The domestication history of rice is still 
unresolved due to its complicated demographic history. We collected, re
-mapped, a
nd re
-  analyzed 3,485 cultivated and wild rice resequencing accessions. 
With data imputation, a total of 
17.7 million high
-quality single
-nucleotide polymorphisms (SNPs) were identified. Our dataset 
is highly accurate as verified by cross
-platform Affymetri
x Microarray data, with 
a pairwise 
concordance rate of 99%. 
Combining phylogeny, PCA, and ADMIXTURE analyses, we present 
profound diversification among rice ecotypes.
  iv  ACKNOWLEDGMENTS
 When I was little, my father told me 
that 
erudite scholar
s know everyt
hing, and they can also 
comprehend novels. My parents did not receive decent education
, but their 
admiration
 for
 knowledge planted a seed 
on my mind
. Now I have pursued advanced education, I would like to 
thank my parents, as well as my sister, for their unlimited support and love.
  I am
 very lucky 
to have
 two great mentors. I would like to exp
ress my thankfulness to Dr. 
Chengcai Chu, who encouraged and supported my study 
prior to my
 graduate education. He is 
also a great collaborator and good friend. I am grateful to have Dr. Ning Jiang as my Ph.D. 
advisor, who ha
s provided
 a friendly, flexible, a
nd supportive study environment for me to 
explore the possibilities in graduate school. I am also thankful for her rigorous altitude towards 
scientific research, which would not allow any compromise that 
could have dampen
ed the 
quality of my studies. Both 
of them have show
n me enjoyable and sustainable ways to take 
research as a career.
  I would also like to thank my graduate committee Drs. David Lowry, C. Robin Buell, and 
Patrick Edger, who provided 
helpful
 guidance with their expertise and experience. I a
lso want to 
thank Drs. Robert VanBuren, Mitch McGrath, 
Marivi Colle
, Jinfeng Chen
, and Zixiang Wen for 
their professional and productive collaborations. The research findings presented in Chapter 3 
has been greatly encouraged and supported by Pat, Bob, and
 Jinfeng, and criti
ques
 from Ning 
have further advanced it into the next level.
  I would like to acknowledge my major graduate program in Plant Breeding, Genetics, 
and Biotechnology (PBGB), my minor graduate program in Ecology, Evolutionary Biology, and 
Behavior (EEBB), and my home department Horticulture for providing diverse, specialized, 
v  professional, friendly, and supportive learning and working environments. I would also like to 
acknowledge two supercomputer centers, the 
Institute for Cyber
-Enabled Re
search
 (iCER)
 at 
MSU and the 
Extreme Science and Engineering Discovery Environment
 (XSEDE), for 
generously providing high
-performance computation resources to support my studies. I also like 
to thank members in the JointLab group
 in 
the department of Horticulture
, the Computational 
Plant B
iology (ComPLB) group, and the EEBB colloquium for providing training and feedbacks 
to my research presentations.
  I want to thank my former and current labmates, 
Dr. Dongyan Zhao, 
Dongmei Yin, and 
Stefan Cerbin for their help with my studies and living in
 the United States. I also want to thank 
my former and current officemates, Dr. Lusheng Zhang, 
Fransiska Renita Anon
 Basundari
, Kristen Andersen, 
Christopher Gottschalk
, Robert Loepp
, Laura Hillmann
, Songwen Zhang, 
Joshua Vander Weide
, and 
Kathleen Rhoades
 for creating an
d maintaining a lovely 
office
 environment. I also want to express my appreciations for friends in the Horticultural 
Organization of Graduate Students (HOGS), the 
Association for Crop and Soil (ACRS)
, and the 
EEBB Graduate Group (EGG) for the joyful memor
ies
 we shared.
 Finally, I want to thank my 
girlfriend Xuejia Lai for her company and support.
  vi  TABLE OF CONTENTS
 LIST OF TABLES
 .................................................................................................................. viii
 LIST OF FIGURES
 ................................................................................................................... ix CHAP
TER 1
 ............................................................................................................................... 1 Introduction of Rice Domestication Models, Current Theories, Gaps and Debates
 ...................... 1 Rice domestication was a long
-lasting procedure
 ..................................................................... 2 Origins of key domestication traits and paradoxical observations
............................................. 3 Hypotheses of rice domestication
 ............................................................................................ 4 Genome
-wide studies of rice domestication
 ............................................................................. 6 Whole
-genome resequencing dissects the domestication of rice
 ............................................... 8 The role of introgression
 ........................................................................................................ 11 Working models for rice domestication
 ................................................................................. 12 APPENDIX
 .............................................................................................................................. 14 REFERENCES
 ......................................................................................................................... 17 CHAPTER 2
 ............................................................................................................................. 21 LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal 
Repeat Retrotransposons
 ........................................................................................................... 21 Abstract
 ................................................................................................................................. 22 CHAPTER 3
 ............................................................................................................................. 23 Assessing Genome Assembly Quality Using the LTR Assembly Index (LAI)
 ........................... 23 Abstract
 ................................................................................................................................. 24 CHAPTER 4
 ............................................................................................................................. 25 A Fine Map of Rice
 Domestication Revealed by 3,485 Cultivated and Wild Accessions
 ........... 25 Introduction
 ........................................................................................................................... 26 Results and Discussions
 ......................................................................................................... 27 Geno
me mapping, imputation, and filtering
 ....................................................................... 27 Population structure and phylogeny
 ................................................................................... 28 Materials and Methods
 .......................................................................................................... 29 Sampling and DNA sequences
 ........................................................................................... 29 Read alignment and variant identification
 .......................................................................... 30 Imputation of missing data
 ................................................................................................. 31 Estimation of variant discordance
 ...................................................................................... 32 Population structure
 ........................................................................................................... 35 Phylogenetic reconstruction
 ............................................................................................... 37 Acknowledgments
 ................................................................................................................. 37 APPENDIX
 .............................................................................................................................. 38 REFERENCES
 ......................................................................................................................... 48 vii
  CHAPTER 5
 ............................................................................................................................. 52 Concluding Remarks
 ................................................................................................................. 52 Annotation of LTR sequences is still a difficult task
 .............................................................. 53 LTR sequences challenge the assembly of genomes
............................................................... 54 The functional role of LTR sequences is largely 
unknown
 ..................................................... 55 The ambiguous role of SVs for domestication
 ....................................................................... 56 REFERENCES
 ......................................................................................................................... 58 viii
  LIST OF 
TABLES
 Table 4.1 SNP discordance rate between this study and other studies.
 ....................................... 39 Table 4.2 Grouping of 3,485 rice accessions.
 ............................................................................ 40  !ix  LIST OF 
FIGURES
 Figure 1.1 Domestication models adapted from Sang and Ge (2007).
 ........................................ 15 Figure 1.2 Schematics of the single
- (A) vs. double
- (B) founder models adapted from Molina 
et al.
 (2011).
 ................................................................................................................................. 16 Figure 4.1 Truth sensitivity of tranches in Variant Quality Score Recalibration (VQSR)
....
....... 
41 Figure 4.2 SNP filtering based on Variant Quality Score Recalibration (VQSR).
....................... 42 Figure 4.3 Distribution of imputation R
2 in minor allele frequency (MAF) bins (0.05 per bin).
 . 43 Figure 4.4 K
-means clustering of 3,485 cultivated and wild rice accessions with (A) whole
-genome 4D SNPs and (B) whole
-genome random SNPs.
 .......................................................... 44 Figure 4.5 Determination of the number of populations.
 ............................................................ 45 Figure 4.6 The population structure of 3,485 rice accessions revealed by various K.
 ................. 46 Figure 4.7 Population structure of 3,485 rice accessions.
 ........................................................... 47  !1  CHAPTER 1
 Introduction 
of Rice Domestication Models, Current Theories, Gaps 
and Debates
  !2  Rice domestication was a long
-lasting procedure
 Asian rice (
Oryza sativa
 L.) is one of the most important staple crops, which has two main 
ecotypes, 
indica 
and 
japonica
. The origin of rice is still in debate due to its complex 
domestication and demographic history. In the late 19
th century, India was considered to be the 
origi
n of rice based on its enriched diversity in India. In 1928, Kato and colleagues proposed that 
rice was domesticated from India and Japan, and named the subspecies 
indica
 and 
japonica
, respectively 
(Kato and Hara 1928)
. Meanwhile, Ding (1929) reported a wild rice lineage 
discovered in Guangdong, China 
(Ding 1929)
, unveiling the possibility that cultivated rice might 
be originated from China. In recent years, archaeological excavations f
rom the Yangzi River 
valley in China show that at the late Pleistocene (12,000 BP 
Ð 11,000 BP), people had started 
collecting seeds from wild rice 
(Li
 et al.
 2016). The Shangshan site (11,000 BP 
Ð 9,000 BP
) in 
the Yangzi basin unearthed charred rice husks and leaves in pottery shreds, as well as sickle
-shaped stoneware, indicating that ancient human might have started to harvest rice as a crop 
(Li
 et al.
 2016). One of the most apparent changes from wild rice to cultivated rice was the reduction of 
seed shattering.
 Non
-shattering rice was featured by the existence of Ònon
-brittle rachisÓ that 
occurred at the abscission layer of spikelet 
(Li
 et al.
 2016). Analysis of abscission layers of 
ancient rice grains from the Kuahuqiao site (8,200 BP 
Ð 7,200 BP) and the Hemudu site (7,000 
BP 
Ð 6,500 BP) in the Yangzi basin indicated 
the proportion of non
-shattering rice increased 
from 42% to 51% in 800 years 
(Li
 et al.
 2016). These data indicate that selection of the non
-shattering trait could be a long
-lasting procedure. Although 
japonica
 rice generally has more 
typical abscission layers, it is difficult to determine the rice subspecies bas
ed on this. The change 
of grain size is also one of the key domestication syndromes. 
Indica
 has longer grains while 
!3  japonica
 generally has short grains. Thus, another attempt to differentiate 
indica
, japonica
, initial 
domesticates, and wild rice was to ana
lyze the grain size 
(Li
 et al.
 2016)
. Liu 
et al.
 (2007) 
compared the length, width, length/width ratio, and length*width of rice grain remnants 
unearthed from the aforementioned historical sites. However, large variations among these 
measurements were found, revealing ambiguous or mixed rice types, indicating tha
t rice 
domestication was a non
-linear procedure 
(Li
 et al.
 2016). Origins of key domestication traits and paradoxical observations
 Plant domestication involves in the transition of wild species into cultivars that have higher 
fitness under managed environment 
(Gepts 2004)
. Key domestication traits denote the initial and 
critical change from a wild species to a domesticated species. In rice, such transitions were 
known as the reduction of shattering for higher harvest rate 
(Li
 et al.
 2006), reduction of 
dormancy for more synchronized g
ermination 
(Sugimoto
 et 
al.
 2010)
, change from prostrate 
growth to erectness growth for more compact cultivation 
(Jin
 et al.
 2008), and change from loose 
panicle to 
compact panicle for less but larger grains 
(Bai
 et al.
 2016). Interestingly, genes 
respo
nsible for these key domestication syndromes, namely 
sh4
, Sdr4
-n, PROSTRATE 
GROWTH 1
 (PROG1
), and 
FRIZZLE PANICLE
 (FZP
), respectively, have identical genotypes in 
both 
indica
 and 
japonica
 (Li
 et al.
 2006; Jin
 et al.
 2008; Sugimoto
 et al.
 2010; Bai
 et al.
 2016)
. Parsimoniously, these traits should be derived from a common ancestor, indicating a single 
domestication origin. However, many traits responsible for rice improvements were carrying 
different ge
notypes between 
indica 
and 
japonica
, indicating different origins. For example, the 
Big Grain 2
 (BG2
) gene encodes a cytochrome P450 which could influence grain size and yield 
(Xu
 et al.
 2014)
. The coding region of 
BG2
 holds eight polymorphic sites between Nipponbare, a 
japonica
 variety, and 93
-11, an 
indica 
variety, which were seemingly inherited from different 
!4  wild rice ecotypes 
(Xu
 et al.
 2014). For the 
NRT1.1B
 gene responsible for nitrate
-use efficiency 
(NUE), a functional polymorphism (FNP) was found to be responsible for the NUE divergen
ce between 
indica
 and 
japonica
, which was also inherited from different wild rice ecotypes 
(Hu
 et 
al.
 2015). Phylogeographic analyses of three genes in a large collection of cultivated and wild 
rice revealed that 
indica
 might be originated from eastern India, while 
japonica
 might be 
originated from southern China 
(Londo
 et al.
 2006)
. The parsimonious explanation for the 
abundance of subspecies
-specific polymorphisms is consistent with a 
double
- or multiple
-origin 
hypothesis.
 Hypotheses of rice domestication
 To reconcile the paradoxical origins between key domestication traits and improvement traits, 
two evolutionary models were proposed, which are the snow
-balling model and the combinati
on model (
Figure 1.1
) (Sang and Ge 2007)
. In the snow
-balling model, a lineage of wild rice wa
s initially domesticated, then key domestication genes were transferred to other wild rice by 
hybridization (
Figure 1.1
A). Further domestication of the hybridized wild rice eventually 
generated the current cultivars that have the same set of key domesticat
ion genes but apparently 
different genetic backgrounds (
Figure 1.1
A) 
(Sang and Ge 2007)
. This 
model agrees with the 
fact that 
japonica
 has undergone a severe bottleneck which denotes the initial domestication, 
while the subsequent hybridization with wild rice could partially recover the effective population 
size (
Ne) (Sang and Ge 2007; Vaughan 2008)
. The hybrid wild rice is proposed to be the 
progenitor of 
indica
 which had a weaker bottleneck 
(Sang 
and Ge 2007; Vaughan 2008)
. In the 
combination model, domestication of rice started in parallel, resulting in two or more initial 
domesticates which carried different key domestication genes (
Figure 1.1
B). Further 
hybridization and pyramiding resulted in fixation of a similar set of key genes in different 
!5  lineages (
Figure 1.1
B).  This model agrees with modern breeding principles such as gene 
pyramiding and also supported by phylogeographic studie
s which revealed multiple gene pools 
for cultivated rice 
(Londo
 et al.
 2006). Vaughan 
et al.
 (2008) further extended the snow
-balling model to explain the existence 
of multiple rice groups 
(Vaughan 2008)
. In this model, rice was proposed to be initially 
domesticated at 30¡N where the Yangzi River valley
 is located and was the northern limit of 
common wild rice (
O. rufipogon
) (Vaughan 2008)
. The initial domesticate was transported by 
human activities and hybridized with local wild rice lineages 
(Vaughan 2008)
. The resulting 
hybrids contained most of the key domestication genes and complex introgressions due to human 
management 
(Vaughan 2008)
. Most admixture lines 
were extinct but some were passed on and 
became the contemporary divergent domesticated rice 
(Vaughan 2008)
. The evolution model for 
the 
semi
-dwarf 1
 (sd1
) gene proposed by Asano 
et al.
 (2011) was the extension of the 
combination model (
Figure 1.1
B). The 
sd1
 gene encodes a gibberellin biosynt
hetic enzyme and 
has two FNPs between 
indica 
and 
japonica
 (Asano
 et al.
 2011). The 
japonica
 allele has 
decreased nucleotide diversity and is considered to be domesticated, while the 
indica
 allele 
remains the wild type 
(Asano
 et al.
 2011)
. In this model, 
indica 
and 
japonica
 were domesticated 
from different wild rice lineages and thus carrying different gene backgrounds. During further 
domestication of 
japonica 
prototypes, the 
sd1 
mutant occurred and was quickly fixed in later 
generations 
(Asano
 et al.
 2011). Similarly, so
me rice genes like 
Wxb, qsw5
, and
 qsh1 
were also 
fixed in 
japonica
, but genes like
 rc, wx, gs3
, and
 badh2.1
 were not only fixed in 
japonica 
but 
also introduced into 
indica
, contributing to a range of shared domesticated gene alleles 
(Asano
 et 
al.
 2011).  !6  Genome
-wide studies of rice domestication
  Studying the domestication history of rice through a few genes could be biased. Due to 
complex demography, such as introgression, genetic drift, and bottleneck, evolution history of a 
gene could be 
different from that of the species, known as incomplete lineage sorting. To 
preclude biases from using specific genes, it is preferred to utilize large
-scale data and 
populations to dissect the domestication history of rice.
  One of the earliest attempts d
ates back to 1985 from Second 
(Second 1985)
, who studied 
isozyme polymorphism of 181 diverse rice and wild rice strains at 24 loci. Principal coordinate 
analysis indicated that isozymes of cultivated rice could be classified into two groups, 
indica
 and 
japonica 
(Second 1985)
. Further, Second pointed out t
hat most wild rice strains from Guangxi 
province (southwestern China) and Taiwan province (eastern China) were close to the 
japonica 
type, while strains from Guangdong province (southern China) were intermediate between 
japonica 
and 
indica 
types 
(Second 1985)
. These observations seem to support the snow
-balling 
model, with the intermediate
-type wild rice hybridized from 
japonica
 prototypes and local 
indica
-type wild rice. However, the relationship between 
japonica
-type wild rice from Guangxi 
and Taiwan is not revealed.
  In 2005, Garris and colleagues genotyped 169 nuclear SSRs and two chloroplast loci 
from 234 diverse rice accessions and revealed the popula
tion structure of rice 
(Garris
 et al.
 2005)
. The research presented five distinct groups with deep genetic structures, which are 
indica
, aus, tropical japonica
, temperate japonica
, and 
aromatic
. The 
first two groups are traditionally 
regarded as 
indica
, while the latter three groups are 
japonica
. The authors pointed out that such 
deep genetic structure might be caused by differentiated ancestral populations and the selfing 
behavior of rice 
(Garris
 et al.
 2005). Gao and Innan (2008) genotyped 60 microsatellites for 92 
!7  accessions of rice and wild rice for the measurement of genetic diversity 
(Gao and Innan 2008)
. A number of low
-diversity regions were found shared between 
indica 
and 
japonica
, which could 
not be explained by the independent domestication hypothesis. Thus, a model wi
th shared 
bottleneck was proposed, which was partly agreed with the snow
-balling model. However, Gao 
and InnanÕs model didnÕt explain the much severe bottleneck observed in 
japonica 
or the much 
larger effective population size observed in 
indica
. Caicedo 
et al.
 (2007) genotyped 111 randomly selected sequence tags for a diverse panel 
including 93 accessions of rice and wild rice and observed an excess of derived SNPs 
contributing to the U
-shape site
-frequency spectra (SFS) of cultivated rice 
(Caicedo
 et al.
 2007)
. Factors like neutrality, bottleneck, migration, and selective sweep were explored for the 
explanation of observed data. The combination of the bottl
eneck and selection effects could 
better explain the observed SFS of 
indica
 and 
tropical japonica
, while that of the 
O. rufipogon
 population could be well explained by the standard neutral model 
(Caicedo
 et al.
 2007)
. Similarly, Molina 
et al.
 (2011) genotyped 630 gene fragments from Chromosomes 8, 10, and 12 
of 56 accessions containing wild rice, 
indica
, and 
tropical japonica 
(Molina
 et al.
 2011), and 
observed skewed SFS of cultivated rice. Further modeling using joint frequency spectrum 
favored the one founder hypothesis over double founders either with or without t
he sweep effect 
(Molina
 et al.
 2011). This research provides support to the snow
-balling model (
Figure 1.1
A), 
howe
ver, suffers from sampling bias such as the lack of 
temperate japonica
, as well as other 
minor cultivated groups.
 The divergence time between 
indica 
and 
japonica 
genomes can be estimated using 
molecular clocks, such as long terminal repeat retrotransposons
 (LTR
-RTs). Vitte 
et al.
 (2004) 
identified 110 LTR
-RTs in 
indica
 and 
japonica
, of which 28 are syntenic between the two 
!8  ecotypes that dates their differentiation ca. 0.9 MYA 
Ð ca. 2.1 MYA (
! = 6.5 x 10
-9 per 
synonymous site per year) 
(Vitte
 et al.
 2004). These times are likely overestimated due to th
e usage of a lower mutation rate. In a newer study using the same method, the differentiation 
between 
indica 
and 
japonica 
is estimated ca. 0.6 MYA 
Ð ca. 0.25 MYA (
! = 1.3 x 10
-8 per bp 
per year) 
(Roulin
 et al.
 2010). These estimations seem to support the combination model (
Figure 
1.1
B), which, however, should be revisited using whole
-genome LTR
-RTs as they are readily 
available 
(International Rice Genome Sequencing 2005; Zhang
 et al.
 2016; Du
 et al.
 2017; Nie
 et 
al.
 2017). Whole
-genome resequencing dissects the domestication of rice
  Benefit from the advance of sequencing techniques, genome
-wide sequencing 
projects 
were 
carried out
 for large collections of rice accessions. Population genetic studies based on 
whole
-genome markers had advanced our understanding of rice domestication. He 
et al.
 (2011) 
sequenced 66 rice accessions spanning 
indica
, japonica
, and 
O. rufipogon
 using a pooling 
strategy in each population for 30X coverages 
(He
 et al.
 2011). Scans of genetic diversity (
"w) detected hundreds of low diversity regions (LDR) across the genome. Shared LDRs between 
indica
 and 
japonica
 showed very low genetic distances (
D statistics) and population 
differentiation (
FST), while the genome background showed the opposite trend 
(He
 et al.
 2011). Further, coalescent simulations revealed that the overlapped LDRs follows the sequential 
domestication model, while the genome background follows the independent domestication 
model 
(He
 et al.
 2011). These results reinforce the snow
-balling model (
Figure 1.1
A), that the 
initially domesticated rice hybridized with differ
ent ancestral wild rice ecotypes, which were 
then domesticated into diverse cultivated groups. He and colleagues also rediscovered several 
key domestication genes, i.e. 
sh4
 and 
PROG1
, as well as 13 additional domestication gene 
!9  candidates 
(He
 et al.
 2011). Xu 
et al.
 (2011) resequenced 40 varieties of cultivated rice, 15 
accessions of perennial wild rice 
O. rufipogon
, and 10 accessions of annual wild rice 
O. nivara 
for 1.5X
-20X and identified 6.5 mill
ion high
-quality SNPs 
(Xu
 et al.
 2011)
. Identification of 
LDRs based on the SNP set has identified important biological features as well as regions with 
unknown functions 
(Xu
 et al.
 2011)
.  In 2012, Huang and colleagues reported 8 million SNPs identified from the low
-depth 
(~2X) resequencing of a large rice population including 446 accessions of wild rice (
O. 
rufipogon
) and 1,083 cultivated rice varieties
 (Huang
 et al.
 2012). Phylogenetic reconst
ruction 
based on the SNP set indicated that 
indica
 was derived from the ancestral Or
-I wild rice ecotype, 
while 
japonica
 was derived from the ancestral Or
-IIIa ecotype 
(Huang
 et al.
 2012)
. FST statistics 
between rice populations showed that the common wild rice might have been differentiated 
before the domestication of rice, probably due to geographic isolations 
(Huang
 et al.
 2012)
. A 
severe bottleneck during the domestication of 
japonica
 was obser
ved, which might have 
contributed to the rapid fix of rare alleles that are responsible for phenotypic diversifications of 
rice subspecies 
(Huang
 et al.
 2012)
. Using nucleotide div
ersity, a total of 55 selective sweeps 
were identified, which may share the most recent common ancestor with the Or
-IIIa group 
located in the Pearl River basin of Guangxi (southwestern China) 
(Huang
 et al.
 2012). This result 
agrees with the isozyme study cond
ucted by Second (1985) 
(Second 1985)
. It becomes quite 
clear that the snow
-balling model is supported by most of the data (
Figure 1.1
A). To advance the 
model, Huang 
et al.
 (2012) conducted forward simulations with population parameters estimated 
from the SNP set. By comparing the simulated data to real data, only one model was found that 
could explain their observations 
(Huang
 et al.
 2012)
. In this model, 
jap
onica
 prototype was 
initially domesticated from ancestral 
O. rufipogon
 IIIa, which was later hybridized with the 
!10  ancestral group of 
O. rufipogon 
I in the southeast and eastern Asia, forming the prototype 
indica
 (Huang
 et al.
 2012). The 
japonica 
and 
indica 
prototypes were further domesticated into 
aromatic
, tropical japonica
, and 
temperate japonica
, and 
aus 
and 
indica
, respectively 
(Huang
 et al.
 2012).  Using the same SNP set from Huang 
et al. 
(2012) 
(Huang
 et al.
 2012)
, Civ⁄
! et al.
 (2015) 
proposed another domestication scenario 
(Civan
 et al.
 2015). First, Civ⁄
! and colleagues 
identified 
LDRs by comparing nucleotide diversity of 
indica 
(520 accessions), 
japonica 
(489 
accessions), and 
aus 
(30 accessions) to the whole wild rice population (446 accessions), 
respectively, recovering 31 LDR regions that are shared between cultivated rice ecotyp
es 
(Civan
 et al.
 2015). Principle component analysis (PCA) and neighbor
-joining clustering of these 
overlapping LDRs indicated that the three cultivated rice populations might be originated from 
different ancestral wild rice populations 
(Civan
 et al.
 2015). Phylogeographic analysis showed 
that 
japonica 
might be domesticated from ancestral wild rice populations in south China, 
indica 
might be domesticated in Indochina and the Brahmaputra valley, and 
aus 
might be domesticated 
in central India or Bangladesh 
(Civan
 et al.
 2015). Contrasting results fr
om Huang 
et al.
 (2012) 
and Civ⁄
! et al.
 (2015) indicate the intriguing and complex domestication history of rice. Using 
the same resequencing data published by Huang 
et al. 
(2012) 
(Huang
 et al.
 2012) and similar 
methodologies to Civ⁄
! et al.
 (2015) 
(Civan
 et al.
 2015) but different criteria, Choi 
et al.
 (2018) 
found supports to the snow
-balling model based on neighbor
-joining phylogenies of overlapping 
LDRs 
(Choi and Purugganan 2018)
.  Xie 
et al.
 (2015) resequenced a diversity
 panel containing 533 accessions of rice cultivars 
and combined with 950 accessions of rice cultivars from Huang 
et al. 
(2012)
 (Huang
 et al.
 2012) for the identification of breeding signatures 
(Xie
 et al.
 2015). Two putative heterotic groups were 
recognized from the 
indica 
subpopulation, which may have gone through independent breeding 
!11  efforts 
(Xie
 et al.
 2015)
. Cross
-population likelihood ratio test (XP
-CLR) among the two 
indica
 heterotic groups identified 200 regions spanning 7.8% of the rice genome that may be footprints 
of modern breeding efforts 
(Xie
 et al.
 2015). Genes like 
sd1
 which lead to the Ògreen revolutionÓ 
and 
Gn1a
, LP, and 
OsBRI1
 which involved in grain quality and yield were rediscovered as 
breeding footprints 
(Xie
 et al.
 2015). Clearly, artificial selection and local adaptation also play an 
important role in shaping the divergence of cultivated rice.
 The role of introgression
 Modern rice breeding involves hybridization between cultivars and pyramiding of 
beneficial 
genes, resulting in mosaic genetic backgrounds in offspring. It is likely that the domestication of 
rice is associated with hybridization and introgression, since many cultivated rice ecotypes share 
the same habitat. Choi 
et al. 
(2017) utilized 
the genome sequences of three rice subspecies 
(japonica
, indica
, and 
aus) to estimate the divergence and introgressions of cultivated rice 
(Choi
 et al.
 2017). Based on coalescence modeling, they estimated the 
japonica
 genome was split from 
O. rufipogon
 at ca. 13.1 
Ð ca. 24.1 thousand
-years ago (Ka), while the estimated split between 
indica
 and 
O. rufipogon
 was ca. 6.7 
Ð ca. 17.7 Ka 
(Choi
 et al.
 2017). Although the estimated 
split time of 
japonica 
is earlier than that of 
indica
, these estimates are substantially overlapping 
(13.1 
Ð 17.7 Ka), reveali
ng the difficulty in distinguishing the domestication origin of rice. The 
estimated split between 
aus and 
O. nivara
 (a wild rice ecotype that is hypothesized to share the 
most recent common ancestor with 
indica 
and 
aus) is ca. 1.7 
Ð ca. 9.1 Ka, indicating 
that 
aus is a 
more recently diversified rice subpopulation. Further, Choi and colleagues found evidence for 
significant gene flow from 
japonica 
to both 
indica 
(~17%) and 
aus 
(~15%) 
(Choi
 et al.
 2017), supporting the snow
-balling model.
 !12   Due to the selfing nature of cultivated rice, introgression among
 individual plants is 
relatively difficult. However, 
O. rufipogon
 has a higher rate of open pollination and also 
sympatric (share the same habitat) with cultivated rice, which is more likely to receive 
introgressions via pollens. Wang 
et al.
 (2017) identif
ied heavy introgressions from cultivated rice 
to wild rice by reanalyzing the resequencing data of 446 wild rice accessions generated by 
Huang 
et al.
 (2012) 
(Huang
 et al.
 2012) as well as 202 divers
e cultivated accessions 
(Wang
 et al.
 2017). Using supervised ancestry identification, Wang and colleagues found over 50% of wild 
rice accessions have more than 10% of their genome that are likely introduced from 
domesticated rice 
(Wang
 et al.
 2017)
. Further, many wild rice accessions prese
nted extent 
origins (50
-100%) from cultivated rice, indicating recent introgressions or feralization from 
cultivated rice 
(Wang
 et al.
 2017). This study shows that the contemporary wild rice may 
represent a hybrid swarm between domesticated rice and ancestral wild rice 
(Wang
 et al.
 2017). Because of these introgressions, the actual cultivated rice ancestor may have gone extinct. As a 
result, analyses based on contemporary wild rice samples should be carefully interpreted.
 Working models for rice domestication
 In summary, rice domestication
 was likely a continuous and complicated process involving 
demographic changes, natural selection, artificial selection, hybridization, and migration. It is 
possible to detect the ÒfootprintsÓ of such activities from genetic variations but unfortunately no
t all of them are recognizable, and some could be masked by others. Due to the complexity of the 
process, the exact path of rice domestication is still unclear despite tremendous efforts. Based on 
the current studies, the domestication of rice was likely f
ollowing the snowballing model, but the 
possibility of the combination model is not excluded (
Figure 1.1
). Such dilemma is likely created 
!13  by a mixture of evolutionary drivers other than positive selection, such as demography and 
introgression.
 In aware of other evolutionary drivers, we propose revised scenarios for both 
domestication models. For the single
-founder model, the prototype of cultivated rice was 
domesticated from the ancestral population of 
O. rufipogon
, which accumulate key 
domestic
ation genes shared in cultivated rice (
Figure 1.2
A). The ancestral 
O. rufipogon
 is then 
diverged into different ecotypes (i.e., Or
-I and Or
-IIIa), likely due to genetic drift and geographic 
isolation (
Figure 1.2
A). Due to close distribution between wild ri
ce and cultivated rice, different 
O. rufipogon
 ecotypes received introgressions from sympatric cultivated rice, resulting further 
divergence of these wild rice ecotypes and their genetic similarity to cultivated rice (
Figure 
1.2
A). Introgressions between c
ultivate groups are also likely due to breeding activities (
Figure 
1.2
A). For the multi
-founder model, the divergence of ancestral 
O. rufipogon 
predated the 
domestication of rice (
Figure 1.2
B). Independent domestications of different wild rice ecotypes 
(i.
e., Or
-I and Or
-IIIa) created 
indica 
and 
japonica
, with further divergent genetic backgrounds 
(Figure 1.2
B). Introgressions between early 
indica 
and 
japonica 
domesticates result in the shared 
key domestication genes as we observed from data (
Figure 1.2
B). 
Besides, introgressions from 
cultivated rice to sympatric wild rice ecotypes further shaped the divergence of contemporary 
wild rice ecotypes. Further studies will reveal the roles of these evolutionary drivers. With novel 
strategies and more comprehensive
 analyses, our understanding about rice domestication will be 
further advanced.
  !14  APPENDIX
!15   Figure 1.
1 Domestication models adapted from Sang and Ge (2007).
 (A) snow
-balling model; 
(B) combination model. Squares represent founder wild rice; Rounded 
rectangle represent initial domesticates; circles represent modern rice subspecies. Differences in 
shadings indicate genomic divergence. Triangle, diamond, and ellipse represent critical 
domestication alleles that are observed in modern cultivars. Double
-headed arrows indicate 
hybridization. Single arrowheads point to the progress of domestication.
   !"!16   Figure 1.
2 Schematics of the single
- (A) vs. double
- (B) founder models adapted from Molina 
et 
al.
 (2011).
 (A) In the single domestication event, the 
O. rufipogon
 ancestral population was domesticated 
into a prototype of cultivated rice, which was further diversifie
d into 
indica 
and 
japonica
. Introgressions from cultivated rice into wild rice as well as geographic isolations facilitated the 
divergence of 
O. rufipogon
 into Or
-I and Or
-IIIa ecotypes. (
B) In the double
-founder model, the 
O. rufipogon
 ancestral populatio
n was diversified into Or
-I and Or
-IIIa ecotypes, which were 
domesticated independently and formed 
indica 
and 
japonica
, respectively. Symmetric 
introgressions between the populations are indicated by arrows.
  Wild rice
Or-IOr-IIIaInd
Jap
Wild rice
Or-IOr-IIIaInd
Jap
AB!17  REFERENCES
 !18  REFERENCES
 Asano K, Yamasaki M, Takuno S, Miura K, Katagiri S, Ito T, Doi K, Wu J, Ebana K, 
Matsumoto T
 et al.
 2011. Artificial selection for a green revolution gene during 
japonica
 rice domestication. 
Proc Natl Acad Sci U S A
 108: 11034
-11039. Bai X, Huang Y, Mao D, Wen M, Zhang L, Xing Y. 2016. Regulatory role of 
FZP 
in the 
determination of panicle branching and spikelet formation in rice. 
Scientific Reports
 6: 19022. Caicedo AL, Williamson SH, Hernandez RD, Boyko A, Fledel
-Alon A, York TL,
 Polato NR, 
Olsen KM, Nielsen R, McCouch SR
 et al.
 2007. Genome
-wide patterns of nucleotide 
polymorphism in domesticated rice. 
PLoS Genet
 3: 1745
-1756. Choi JY, Platts AE, Fuller DQ, Hsing Y
-I, Wing RA, Purugganan MD. 2017. The Rice Paradox: 
Multiple Origins but Single Domestication in Asian Rice. 
Mol Biol Evol
 34: 969
-979. Choi JY, Purugganan MD. 2018. Multiple Origin but Single Domestication Led to 
Oryz
a sativa
. G3: Genes|Genomes|Genetics
 8: 797
-803. Civan P, Craig H, Cox CJ, Brown TA. 2015. Three geographically separate domestications of 
Asian rice. 
Nat Plants
 1: 15164.
 Ding Y. 1929. Textual research of crop names [in Chinese]. 
Nongsheng
: 5
-9. Du H, Yu 
Y, Ma Y, Gao Q, Cao Y, Chen Z, Ma B, Qi M, Li Y, Zhao X
 et al.
 2017. Sequencing 
and 
de novo
 assembly of a near complete 
indica 
rice genome. 
Nat Commun
 8: 15324.
 Gao LZ, Innan H. 2008. Nonindependent domestication of the two rice subspecies, 
Oryza sativa
 ssp. 
indica 
and ssp. 
japonica
, demonstrated by multilocus microsatellites. 
Genetics
 179: 965-976. Garris AJ, Tai TH, Coburn J, Kresovich S, McCouch S. 2005. Genetic structure and diversity in 
Oryza sativa
 L. 
Genetics
 169: 1631
-1638. Gepts P. 2004. Crop domes
tication as a longterm selection experiment. In 
Plant Breeding 
Reviews
, Vol 24 (ed. J Janick). John Wiley & Sons, Inc.
 !19  He Z, Zhai W, Wen H, Tang T, Wang Y, Lu X, Greenberg AJ, Hudson RR, Wu CI, Shi S. 2011. 
Two evolutionary histories in the genome of rice:
 the roles of domestication genes. 
PLoS 
Genet
 7: e1002100.
 Hu B, Wang W, Ou S, Tang J, Li H, Che R, Zhang Z, Chai X, Wang H, Wang Y
 et al.
 2015. 
Variation in 
NRT1.1B 
contributes to nitrate
-use divergence between rice subspecies. 
Nat 
Genet
 47: 834
-838. Huan
g X, Kurata N, Wei X, Wang ZX, Wang A, Zhao Q, Zhao Y, Liu K, Lu H, Li W
 et al.
 2012. 
A map of rice genome variation reveals the origin of cultivated rice. 
Nature
 490: 497
-501. International Rice Genome Sequencing P. 2005. The map
-based sequence of the rice genome. 
Nature
 436: 793
-800. Jin J, Huang W, Gao JP, Yang J, Shi M, Zhu MZ, Luo D, Lin HX. 2008. Genetic control of rice 
plant architecture under domestication. 
Nat Genet
 40: 1365-1369. Kato S, Hara HK. 1928. Classification of improved rice varieties. 
J of Kyushu Imperial 
University of Agriculture
: 132
-147. Li C, Zhou A, Sang T. 2006. Rice domestication by reducing shattering. 
Science
 311: 1936
-1939. Li L, Lee G
-A, Leping J, Ju
zhong Z. 2016. Evidence for the early beginning (c. 9000 cal. BP) of 
rice domestication in China: a response. 
The Holocene
 17: 1059
-1068. Londo JP, Chiang YC, Hung KH, Chiang TY, Schaal BA. 2006. Phylogeography of Asian wild 
rice,
 Oryza rufipogon
, reveals 
multiple independent domestications of cultivated rice, 
Oryza sativa
. Proc Natl Acad Sci U S A
 103: 9578
-9583. Molina J, Sikora M, Garud N, Flowers JM, Rubinstein S, Reynolds A, Huang P, Jackson S, 
Schaal BA, Bustamante CD
 et al.
 2011. Molecular evidence f
or a single evolutionary 
origin of domesticated rice. 
Proc Natl Acad Sci U S A
 108: 8351
-8356. Nie SJ, Liu YQ, Wang CC, Gao SW, Xu TT, Liu Q, Chang HL, Chen YB, Yan PC, Peng W
 et al.
 2017. Assembly of an early
-matured 
japonica
 (Geng) rice genome, Suijing18, based on 
PacBio and Illumina sequencing. 
Sci Data
 4: 170195.
 !20  Roulin A, Chaparro C, Piegu B, Jackson S, Panaud O. 2010. Paleogenomic analysis of the short 
arm of chromosome 3 reveals the history of the African and Asian proge
nitors of 
cultivated rices. 
Genome Biol Evol
 2: 132
-139. Sang T, Ge S. 2007. The puzzle of rice domestication. 
J Integr Plant Biol
 49: 760
-768. Second G. 1985. Evolutionary relationships in the 
Sativa 
group of 
Oryza 
based on isozyme data. 
G”n”tique, S”lect
ion, …volution
 1: 89
-114. Sugimoto K, Takeuchi Y, Ebana K, Miyao A, Hirochika H, Hara N, Ishiyama K, Kobayashi M, 
Ban Y, Hattori T
 et al.
 2010. Molecular cloning of 
Sdr4
, a regulator involved in seed 
dormancy and domestication of rice. 
Proc Natl Acad Sci U
 S A
 107: 5792
-5797. Vaughan DA, Lu Bao
-Rong, Tomooka, Norihiko. 2008. The evolving story of rice evolution. 
Plant Sci
 174: 394
Ð408.
 Vitte C, Ishii T, Lamy F, Brar D, Panaud O. 2004. Genomic paleontology provides evidence for 
two distinct origins of Asian 
rice (
Oryza sativa
 L.). 
Mol Genet Genomics
 272: 504
-511. Wang H, Vieira FG, Crawford JE, Chu C, Nielsen R. 2017. Asian wild rice is a hybrid swarm 
with extensive gene flow and feralization from domesticated rice. 
Genome Res
 doi:10.1101/gr.204800.116.
 Xie W
, Wang G, Yuan M, Yao W, Lyu K, Zhao H, Yang M, Li P, Zhang X, Yuan J
 et al.
 2015. 
Breeding signatures of rice improvement revealed by a genomic variation map from a 
large germplasm collection. 
Proc Natl Acad Sci U S A
 112: E5411
ÐE5419.
 Xu F, Fang J, Ou S,
 Gao S, Zhang F, Du L, Xiao Y, Wang H, Sun X, Chu J
 et al.
 2014. 
Variations in 
CYP78A13 
coding region influence grain size and yield in rice. 
Plant Cell 
Environ
 38: 800
-811. Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L
 et al.
 2011. Resequencing 50 accessions of cultivated and wild rice yields markers for 
identifying agronomically important genes. 
Nat Biotechnol
 30: 105
-111. Zhang J, Chen LL, Sun S, Kudrna D, Copetti D, Li W, Mu T, Jiao WB, Xing F, Lee S
 et al.
 2016. 
Build
ing two indica rice reference genomes with PacBio long
-read and Illumina paired
-end sequencing data. 
Sci Data
 3: 160076.
 !21  CHAPTER 2
 LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long 
Terminal Repeat Retrotransposons
   The work presented in this chapter is part of the final publication:
  Ou S, Jiang N. 2018. LTR_retriever: A Highly Accurate and Sensitive Program for Identification 
of Long Terminal Repeat Retrotransposons. 
Plant Physiol
 176: 1410
-1422.  !22  Abstract
 Long terminal repeat retrotransposons (LTR
-RTs) are prevalent in plant genomes. The 
identification of LTR
-RTs is critical for achieving high
-quality gene annotation. Based on the 
well
-conserved structure, multiple programs were developed for the 
de novo identification of 
LTR
-RTs; however, these programs are associated with low specificity and high false discovery 
rates. Here, we report LTR_retriever, a multithreading
-empowered Perl program that identifies 
LTR
-RTs and generates high
-quality LTR lib
raries from genomic sequences. LTR_retriever 
demonstrated significant improvements by achieving high levels of sensitivity (91%), specificity 
(97%), accuracy (96%), and precision (90%) in rice (Oryza sativa). LTR_retriever is also
 compatible with long sequ
encing reads. With 40k self
-corrected PacBio reads equivalent to 4.5
"
 genome coverage in Arabidopsis (
Arabidopsis thaliana
), the constructed LTR library showed 
excellent sensitivity and specificity. In addition to canonical LTR
-RTs with 5'
-TGÉCA
-3' termini
, LTR_retriever also identifies noncanonical LTR
-RTs (non
-TGCA), which have been 
largely ignored in genome
-wide studies. We identified seven types of noncanonical LTRs from 
42 out of 50 plant genomes. The majority of noncanonical LTRs are
 Copia
 elements, w
ith which 
the LTR is four times shorter than that of other
 Copia
 elements, which may be a result of their 
target specificity. Strikingly, non
-TGCA
 Copia
 elements are often located in genic regions and 
preferentially insert nearby or within genes, indicatin
g their impact on the evolution of genes and 
their potential as mutagenesis tools.
  For a full text of this work, please visit 
http://www.plantphysiol.org/content/176/2/1410.long
.  !23  CHAPTER 3
 Assessing Genome Assembly Quality Using 
the L
TR Assembly Index (L
AI)   The work presented in this chapter is part of the final publication:
  Ou S, Chen J, Jiang N. 2018. Assessing genome assembly quality using the LTR Assembly 
Index (LAI). 
Nucleic Acids Res
 46: e126.
  !24  Abstract
  Assembling a plant genome is challenging due to the abundance of repetitive sequences, 
yet no standard is available to evaluate the assembly of repeat space. LTR retrotransposons 
(LTR
-RTs) are the 
predominant interspersed repeat that is poorly assembled in draft genomes. 
Here, we propose a reference
-free genome metric called LTR Assembly Index (LAI) that 
evaluates assembly continuity using LTR
-RTs. After correcting for LTR
-RT amplification 
dynamics,
 we show that LAI is independent of genome size, genomic LTR
-RT content, and gene 
space evaluation metrics (i.e., BUSCO and CEGMA). By comparing genomic sequences 
produced by various sequencing techniques, we reveal the significant gain of assembly 
continu
ity by using long
-read
-based techniques over short
-read
-based methods. Moreover, LAI 
can facilitate iterative assembly improvement with assembler selection and identify low
-quality 
genomic regions. To apply LAI, intact LTR
-RTs and total LTR
-RTs should cont
ribute at least 
0.1% and 5% to the genome size, respectively. The LAI program is freely available on GitHub:
 https://github.com/oushujun/LTR_retriever
.  For a full text of this work, please visit
 https://academic.oup.com/nar/article/46/21/e126/5068908
.  !25  CHAPTER 4
 A Fine Map 
of Rice Domestication Revealed 
by 3,485 Cultivated 
and Wild Accessions
  !26  Introduction
 Asian 
rice (
Oryza sativa 
L.) is a staple crop for over half of the human population 
(Juliano and 
Institute 1993)
. The earliest archeology record for rice harvesting dates back to ca. 14,000 
Ð ca. 
18,000 years ago in southern China 
(Lu 2006; Boaretto
 et al.
 2009)
. However, the exact origin of 
cultivated rice is still debatable. Archaeological studies reveal that rice domestication may have 
begun ca. 9,400 years ago in the Lower Yangtze basin 
(Jiang and Liu 2006; Zuo
 et al.
 2017), while genetic and phylogeographic studies suggest possible origins in southern China, eastern 
India, and Indochina 
(Londo
 et al.
 2006; Huang
 et al.
 2012; Civan
 et al.
 2015; Choi and 
Purugganan 2018)
. Rice has two main ecotypes or subspecies, 
indica 
and 
japonica
, which have been highly 
diversified and adapted to different habitats 
(Garris
 et al.
 2005; Huang
 et al.
 2012; Wang
 et al.
 2018). Two other regional ecotypes, 
aromatic 
and 
aus, are recognized by molecular genetic
 studies
 (Garris
 et al.
 2005; Huang
 et al.
 2012; Wang
 et al.
 2018) and mainly colonized in India, 
Bangladesh, and Nepal with unique adaptations to harsh environments. Hypotheses for the origin 
of these diverse rice ecotypes include 
single
-origin, dual
-origin, and multiple
-origin models, with 
conflicting observations for the monophyletic key domestication genes (e.g., 
sh4
, prog1
, and 
sd-1) and the polyphyletic genomic background. The common wild rice (
O. rufipogon 
Griff.) is 
known as 
the direct ancestor of cultivated rice. However, 
O. rufipogon 
has been diversified into 
multiple ecotypes, which 
have been
 proposed to be ancestors of different cultivated rice ecotypes 
(Huang
 et al.
 2012; Civan
 et al.
 2015)
. Besides, intensive interspecific and intraspecific 
admixtures between cultivated and wild rice ecotyp
es are also noted 
(Huang
 et al.
 2012; Choi
 et 
al.
 2017; Wang
 et al.
 2017)
, making the puzzle of rice domestication even more complicated.
 !27   Since the initial sequencing of the rice genome 
(Goff
 et al.
 2002; Yu
 et al.
 2002; 
International Rice Genome Sequencing 2005)
, substantial amounts of resequencing resources 
and efforts have been devoted to solving this domestication puzzle. With the boom of new 
algorithms, r
evisiting these questions by combining available resources may provide new 
insights. Here, we combined raw sequencing data of 3,024 cultivated rice accessions and 461 
wild rice accessions (hereafter, the 3Kw dataset) from the National Center for Biotechnol
ogy Information (NCBI). Remapping of these data allows unbiased evaluations of genetic diversity 
and demographic history among rice ecotypes. We reveal their complicated relationship by 
reconstruction of the phylogeny
 and 
the 
population structure
, providing new insight into rice 
domestication
. Results and Discussion
s Genome mapping, imputation, and filtering
 Information about the 3,485 cultivated and wild rice accessions (3Kw) used in this study are 
summarized in 
Supplementary Table 
4.1
. Raw reads were trimmed and mapped to the 
O. sativa
 ssp. 
japonica
 cv. Nipponbare MSU 7.0 reference genome 
(Kawahara
 et al.
 2013). A total of 46 
million raw variants were identified, which were filtered using a machine
-learning algorithm 
with 99.0% truth 
sensitivity (Materials and Methods) (
Figure 4.
1 and 
Figure 4.
2). Insertion, 
deletion, and multiallelic variants were classified as low
-confident variants and were not used in 
this study, retaining 21.9 million single
-nucleotide polymorphisms (SNPs). To rec
oncile data 
heterogeneity caused by variations in sequencing depth, missing genotypes were imputed and 
filtered with imputation R
2 #
 0.95 (
Figure 4.
3), resulting in 17.7 million SNPs that are highly 
concordant (97.4% 
- 99.9%) with four other datasets (
Tabl
e 4.1
) (McCouch
 et al.
 2016; Wang
 et 
!28  al.
 2017; Wang
 et al.
 2018)
. All following analyses were based on the 17.7 million high
-quality 
SNPs.
 Population structure and phylogeny
 Classifying the 3K
w population into subpopulations is possible and highly consistent with 
previous studies (
Figure 4.
4) (Huang
 et al.
 2012; Wang
 et al.
 2018). With 14 principal 
components, about 93% of genetic variations in the 3Kw population are explained (
Figure 4.
5B). 
To st
udy the ancestral components of the 3Kw population, we applied supervised estimation of 
individual ancestry from unlinked (
r2 $
 0.1) SNPs, with inferred ancestry (
K) varies from 1 to 50 
(Figure 4.
6). The cross
-validation error decreases as 
K increases, 
indicating a very complicated 
population structure (
Figure 4.
5A).
 To further dissect the relationship of 3Kw accessions and subpopulations, we used near
-neutral SNPs that are located in the third position of fourfold degenerate codons (4DSNP) to 
reconstruc
t the whole
-genome phylogeny of the 3Kw population (
Figure 4.
7C), which shows 
high consistency with its population structure (
Figure 4.
7D). We thus regroup the 3Kw 
population into 14 subpopulations based on previous groupings of the 3,000 Rice Genome 
(3KRG
) population 
(Wang
 et al.
 2018)
 and the wild rice population 
(Huang
 et al.
 2012) in 
considerations of the population structure (
K = 14) and the phylogeny (
Figure 
4.7), with group 
names consistent with previous studies (
Table 4.2
) (Materials and Methods).
  Annotation of subpopulations reveals their strong connections to geographic origin and 
pedigree status (
Figure 4.
7A, B). 
Aromatic
, aus, and 
indica
-2 groups are ma
inly from India, 
while 
indica
-1A and 
temperate japonica
 are mainly found in China with substantial breeding 
improvements (
Figure 4.
7A, B). In Indonesia, 
indica
-3B and 
tropical japonica
 are dominating, 
while the latter is also found in southeast Asia countr
ies such as Laos and Philippines, and in the 
!29  United States (
Figure 4.
7A). The s
ubtropical japonica
 group is mainly found in Laos and 
Thailand (
Figure 4.
7A). 
Indica
-1B, admixed 
indica
, and admixed 
japonica
 are distributed 
worldwide with extensive tracible pedigree records, which probably represent modern breeding 
efforts (
Figure 4.
7A, B). 
Indica
-3A are mainly traditional lines located in the Indochinese 
Peninsula (
Figure 4.
7A, B).
  We rooted the whole
-genom
e phylogeny of the 3Kw population using a small group of 
wild rice relatives
 (Figure 4.
7C, 
Supplementary Table 4.1
). After 1,000 bootstrap pseudo
-replications, we found high supporting rate for major branches (
Figure 4.
7C). Further, major 
cultivated rice e
cotypes, 
japonica
, indica
, and 
aus, are clustered in different branches, indicating 
parallel efforts for rice improvements (
Figure 4.
7C). Moreover, common wild rice ecotypes Or
-3A and 
japonica
 share most recent common ancestor, so do Or
-1 and 
indica
, and 
aus 
(Figure 
4.7C). While this may indicate independent domestication origin of the three rice ecotypes 
(Londo
 et al.
 2006; Civan
 et al.
 2015), other evolutionary processes such as population 
bottleneck and lateral gene transfer may also be the cause, which have been documented 
previously (see Xu 
et al. 
(2011) 
(Xu
 et al.
 2011), Huang 
et al. 
(2012) 
(Huang
 et al.
 2012), Wang 
et al.
 (2017) 
(Wang
 et al.
 2017), and Civ⁄
!
 and Brown (2018) 
(Civ⁄
! and Brown 2018)
). Other 
O. rufipogon
 ecotypes, Or
-2 and Or
-3B, are clustered with Or
-1 and placed between the 
indica 
and the 
japonica
 clades (
Figure 4.
7C).
 Materials and Methods
 Sampling and DNA sequences
 The sequences of 3,024 cultivated rice accessions were derived from the 3,000 Rice Genomes 
(3KRG) project with an average sequencing depth of 14
"
 (The Rice Genomes Project 2014)
. The 
collection represents a core set
 of the International Rice Germplasm Collection (IRGC) at the 
!30  International Rice Research Institute (IRRI), and the China National Crop Gene Bank (CNCGB) 
at the Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS). The wild 
rice data
 contain
 461 samples including 446 
O. rufipogon
 accessions and 15 accessions of 
outgroup species, which were sequenced by Huang 
et al.
 (2012) with an average sequencing 
depth of 1.5
"
 (Huang
 et al.
 2012)
. The wild rice access
ions were collected and preserved at the 
China National Rice Research Institute (CNRRI) in Hangzhou, China, and the National Institute 
of Genetics (NIG) in Mishima, Japan. The detailed information, including sequencing depth, 
country of origin, and ecotype
 grouping were adapted from relevant previous studies and 
summarized in 
Supplementary Table 4.1
. Raw sequencing data of 3,485 cultivated and wild rice 
(3Kw) were downloaded from the NCBI Sequence Read Archive (SRA) 
(www.ncbi.nlm.nih.gov/sra/
) with project numbers PRJEB6180
 and PRJEB2829, respectively, 
consisting a total of 17.4 Terabases, representing a comprehensive dataset of genetic diversity in 
cultivated and wild rice.
 Read alignment and variant 
identification
 All reads were trimmed to remove low
-quality bases and adapters to improve mapping quality 
and variant call accuracy 
(Del Fabbro
 et al.
 2013)
. Pair
-end reads were trimmed separately for 
each end using cutadapt (v1.11) (
-f fastq 
-u 10 
-q 20,20 
--trim
-n -m 30 
-n 3) 
(Martin 2011)
, then 
re-paired using a custom PERL script (sync_PE_fastq_files.pl). Pair
-end and single
-end clean 
reads were aligned to the 
O. sativa
 ssp. 
japonica 
cv. Nipponbare (MSUv7) reference genome, 
separately, using BWA
-MEM (v0.7.12.r1044) (
-t 4 
-k 16 
-w 10 
-M) 
(Li 2013)
. Multiple 
alignment files of an accession, either single
-end or pair
-end, or from different sequencing runs, 
were merged into one file with PCR duplicates removed using Picard tools (v1.113) 
(Mer
geSamFiles.jar and MarkDuplicates.jar, 
respectively) 
(http://picardsourceforgenet
). To 
!31  improve call accuracy of insertion/deletion (Indel) va
riants, candidate indel regions were locally 
re-aligned using the Genom
e Analysis Toolkit (GATK v3.4.46) 
(McKenna
 et al.
 2010), then the 
GATK HaplotypeCaller is applied to call SNPs and indels simultaneously via 
de novo
 assembly 
of haplotype regio
ns (
-T HaplotypeCaller 
--genotyping_mode DISCOVERY 
-stand_call_conf 30 
--emitRefConfidence GVCF). The GATK joint genotyping algorithm using the multi
-sample 
aggregation of sequences was applied to call genotypes confidently. To filter out low
-quality and 
false variant calls, a machine
-learning based filter called Variant Quality Score Recalibration 
(VQSR) 
(McKenna
 et al.
 2010)
 was applied by using the rice dbSNP dataset 
(https://www.ncbi
.nlm.nih.gov/projects/SNP/
) to train classifiers (
-an DP 
-an QD 
-an FS 
-an 
SOR 
-an MQ 
-an MQRankSum 
-an ReadPosRankSum 
-an InbreedingCoeff 
-tranche 100.0 
-tranche 99.9 
-tranche 99.0 
-tranche 90.0). A total of 2,590,236 validated SNPs from the rice 
dbSNP dataset were used as true si
tes training input (
-resource:hapmap), while the other 
7,540,170 SNPs from the dbSNP database were used as known sites input (
-resource:dbsnp). 
Variants with the tranche truth sensitivity of 99% were marked as ÒPASSÓ, which were retained 
and subjected to h
ard filtering. For hard filtering, biallelic SNPs with read coverage less than 2 
(QD < 2.0), strand bias higher than 60.0 (FS > 60.0), mapping quality less than 40.0 (MQ < 40.0), 
mapping quality rank sum test score less then 
-12.5 (MQRankSum < 
-12.5), or r
ead position rank 
sum test score less than 
-8.0 (ReadPosRankSum < 
-8.0) were filtered out as low
-confident calls. 
Indels with QD < 2.0, FS > 200.0, or ReadPosRankSum < 
-20.0 were filtered out as low
-confident calls.
 Imputation of missing data
 After hard filtering, ambiguous and missing genotypes were imputed using Beagle (v4.1) 
(Browning and Browning 2016)
 with default parameters. A high
-density rice
 recombination map 
!32  from Huang 
et al.
 (2009) 
(Huang
 et al.
 2009)
 was converted based on the physical distance of 
SNPs and provided as a genetic map for imputation. To control for imputation quality, the allelic 
R2 and dosage R
2 representing the squared correlation between estimated and true reference 
allele dosage was 
applied, with SNP R
2 < 0.95 being removed. A total of 17.7 million SNPs were 
retained and were used for all subsequent analyses unless indicated.
 Estimation of variant discordance
 Imputed and filtered SNPs were evaluated using multiple public resources. Th
e IRRI
-released 32 
million SNP set of the 3K data 
(Wang
 et al.
 2018) and the 700K high
-density rice array (HDRA) 
data generated by McCouch 
et al.
 (2016) 
(McCouch
 et al.
 2016) were used to evaluate our 
cultivated rice SNP calls. For wild rice SNPs, the initial SNP set
 was called based on the IRGSP 
v4.0 reference genome 
(Huang
 et al.
 2012), which SNP coordinate was converted to the current 
reference (MSU v7) by Wang 
et al.
 (2017) 
(Wang
 et al.
 2017) based on syntenic regions and 
alignment of flanking sequences. We also performed 
de novo
 SNP identification using clean 
reads of all wild rice accessions with a Bayesian approach called Freebayes (v0.9.21) 
(Garrison 
2012). All four reference datasets were filtered with the same standard we applied on our dataset, 
which heterozygous and multi
-allelic SNPs were converted to the ambiguous state with further 
filtering on missing data.
 To calculate the pair
-wise discordance rate between two SNP sets, say, the reference 
dataset (
R) and our dataset (
T), assume 
M samples with 
N SNPs are present in both datasets. For 
sample 
i, if SNP 
j is unambiguously genotyped in both datasets, then this data point is counted as 
an informative site. For all informative SNPs in a chromosome, if 
Rij has the same genotype as 
Tij, then this SNP is 
counted as concordant, otherwise discordant.
 Based on these settings, 
we 
defined 
the chromosomal pair
-wise SNP discordance rate 
D as follows
. !33  !"#$%&'()&'$%&'()&'#*#$%&'+)&'#,#-../, where 
.0123 and 
.0425 For verification of cultivated rice S
NPs, the IRRI
-released 32 M SNP set of the 3K data 
(Wang
 et al.
 2018) downloaded from the Rice SNP
-Seek Database (
http://oryzasnp.org
) was 
used as the reference
 dataset. To control the quality of reference datasets, multi
-allelic SNPs 
were converted to the ambiguous state as we did in our dataset, while SNPs with missing data 
rates higher than 10% were removed. As a result, a total of 12,255,063 SNPs were retaine
d as 
informative sites (
Table 4.1
). We also used a second dataset generated using the 700K high
-density rice array (HDRA) 
downloaded from the Rice Diversity Database (
www.ricediversity.org
) (McCouch
 et al.
 2016)
, with 94 cultivated rice accessions overlapped with the 3K dataset, to estimate variant 
discordance. To control t
he SNP quality of the shared accessions, we only retained SNP calls 
that were saturated with array signals (Signal = 33.0) and labeled others as ambiguous. Multi
-allelic SNPs were also converted to the ambiguous state, while SNPs with missing data rates 
higher than 10% were removed. As a result, a total of 38,863 SNPs were retained as informative 
sites (
Table 4.1
). For the verification of wild rice SNPs, there is no proper reference dataset available due 
to the update of the reference genome. Wang 
et al.
 (2017) 
(Wang
 et al.
 2017) converted the wild 
rice SNP set released by Huang 
et al.
 (2012) 
(Huang
 et al.
 2012), from genome version IRGSP 
4.0 to MSU 7.0 (current version), based on genome synteny and alignment of flanking sequences. 
Such conversion may introduce errors due to
 the uncertainty of finding micro
-syntenic regions. 
Also, the correction of the reference genome sequence is not reflected in the converted SNP set 
due to the lack of remapping raw reads. We used this converted SNP set with the same quality 
filter applied 
on the IRRI SNP set, resulting in 2,060,264 informative SNPs (
Table 4.1
). !34  To provide further insight about the quality of our wild rice SNP set, we performed 
remapping of all wild rice clean reads to the current reference genome using a different method, 
called Freebayes (v0.9.21) 
(Garrison 2012)
, which adopts a Bayesian statistical framework 
for 
variant detection. To accelerate the inference of variants, we split the reference genome into 50
-Kb regions to run Freebayes 
(Garrison 2012)
 (--theta 0.003 
--ploidy 2 
--use
-best
-n-alleles 8 
--min-mapping
-quality 1 
--min-base
-quality 3 
--read
-max
-mismatch
-fraction 0.10 
--min-alternate
-fraction 0.15 
--min-alternate
-count 1 
--read
-snp
-limit 
10 --read
-indel
-limit 3), then combine the 
results afterwards. Then GATK (v3.4.46) 
(McKenna
 et al.
 2010) was used to perform hard
-filtering with similar parameters as we used on GATK Joint Genotyping results. Further 
conversion and filtering of SNPs were applied the same as we did on the IRRI SNP set, ex
cept 
allowing the missing data rate to reach up to 50% due to the low read depth (~1.5
"
). A total of 
3,518,552 informative SNPs were obtained in this experiment.
 Discordance rates of our dataset to filtered datasets from IRRI, HDRA, Wang 
et al.
 (2017), and
 Freebayes remapping are presented in 
Table 4.1
, which shows extremely low 
discordance rate (0.10%) between our cultivated rice SNP calls and the IRRI
-released 3K dataset, 
indicating high concordance (99.9%) of our data. Similarly, the pair
-wise concordance rate is 
very high (99%) between our cultivar 
rice SNPs and the cross
-platform array dataset (HDRA). A 
slightly higher discordance rate (2.65%) is observed between our wild rice SNP set and the 
coordinate
-converted dataset originated from Huang 
et al.
 (2012) 
(Huang
 et al.
 2012) (Table 4.1
), 
which is 
probably due to errors created
 by reference coordinate conversions and the difference 
between mapping algorithms in dealing with low depth sequencing data. Note that the GATK 
Joint Genotyping algorithm was developed to utilize sequencing information of the
 entire 
mapping population comparing to previous algorithms that treat samples independently 
!35  (McKenna
 et al.
 2010)
, which is particularly fit for low
-depth sequencing data. To rule out the 
discrepancies created by reference coordinate conversion, we further conducted remapping of the 
wild rice clean reads using a Bayesian method called Freebayes 
(Garrison 2012)
. The Freebayes
-derived wild rice SNP set shows a lower discordance rate (
1.14%) to our dataset, indicating the 
improvement of mapping algorithms and reference genome do help with more accurate variant 
calls with low
-depth data. In conclusion, both the cultivated and wild rice SNPs in our dataset 
demonstrate high concordance to 
other datasets.
 Population structure
 The whole
-genome 4D SNPs were used to study the population composition of the 3Kw dataset. 
SNPs were screened based on linkage (
r2), which was estimated using PLINK (v1.90b4.4) 
(Purcell
 et al.
 2007) in 50
-SNP windows with 10 SNPs per step, and markers with 
r2 > 0.1 were 
excluded (
--indep
-pairwise 50 10 0.1). The rice genetic map converted from Huang 
et al.
 (2009) 
(Huang
 et al.
 2009) was also supplied to PLINK 
(Purcell
 et al.
 2007)
 for the procedure of 
linkage pruning. A total of 22,909 SNPs were retained for estimation of individual ancestry usi
ng 
ADMIXTURE (v1.3.0) 
(Alexander and Lange 2011)
 with default parameters. We used 
K from 1 
to 50 to perform the estimation (
Figure 4.
6), with results visualized using CLUMPAK 
(Kopelman
 et al.
 2015)
 and Inkscape (
www.inkscape.org
). Principal component analysis (PCA) 
with the same data was performed using PLINK 
(Purcell
 et al.
 2007)
. The unsupervised 
k-means 
clustering were performed using the kmeans() function in R. To evaluate the randomness of the 
4D SNP set, we extracted a random subset of the genome, with a total of 113,
432 SNPs, to 
perform side
-by-side 
k-means clustering. Clustering of these two datasets are virtually the same 
(Figure 4.
4). In consideration of ADMIXTURE, PCA, whole
-genome phylogeny, and 
k-means 
clustering results, we determined that 
K = 14 is ÒoptimumÓ, 
which can explain the majority of 
!36  population structures and genetic variations of the 3,485 rice accessions (
Figure 4.
5, Figure 4.
6, Figure 4.
7). The further grouping of cultivated rice samples was mainly based on three sources: the 
original grouping of Wang 
et al.
 (2018) 
(Wang
 et al.
 2018), the whole
-genome phylogeny, and 
the ADMIXTURE results. For samples consistently grouped in the three sources, their grouping 
remained unchanged as in Wang 
et al.
 (2018) 
(Wang
 et al.
 2018), otherwise were grouped as 
Adx (admixture), Japadx (
japonica
-type admixture), or Indadx (
indica
-type admixture), 
whichever fit
s better to their placements in the phylogeny. The Ind
-3 group was further divided 
into Ind
-3A and Ind
-3B based on their split placement on the phylogeny (
Figure 4.
7C). Of the 
nine low
-quality samples identified in Wang 
et al.
 (2018) 
(Wang
 et al.
 2018)
, one was removed 
due to its extremely low depth (~0.3
"
) and the rest were marked as Adx in our grouping results. 
These lines were not included in subpopulation specific analyses.
  For the grouping of wild rice samples, the whole
-genome phylogeny and the admixture 
results were used. Samples placed unambiguously in both sources were grouped as Or
-1, Or
-2, 
and Or
-3 in agreement with the group names used in Huang 
et al.
 (2012) 
(Huang
 et al.
 2012), otherwise grouped as Oradx (
O. rufipogon
-type admixture). The Or
-3 group was further divided 
into Or
-3A and Or
-3B based on their split placement in the phylogeny and geographic origins 
(Figure 4.
7C). The 15 wild rice relatives, including one accession of 
O. barthii
, one accession of 
O. glaberrima
, two accessions of 
O. longistaminata
, and 11 accessions of 
O. meridionalis
 were 
independently grouped as Wild_relatives. Of the five 
O. glaberrima 
samples
 identified in the 3K 
dataset, two of them are clustered with the Wild_relatives group, increasing the group size to 17, 
while others are placed in cultivated rice groups, indicating certain degree of admixture, thus 
!37  labeled as Adx. Summary of the new grou
ping strategy for the 3,485 rice accessions was 
presented in 
Table 4.2
. Phylogenetic reconstruction
 We used S
NPs located in 4
-fold degenerate (4D) sites of the whole
-genome to reconstruct the 
phylogeny of the 3Kw population. After further screening out SNPs with minor allele frequency 
(MAF) < 1%, a total of 157,475 SNPs were retained. We first searched for the be
st-observed 
maximum likelihood (ML) tree using RAxML (v8.2.9) 
(Stamatakis 2014)
 with 20 independent 
attempts searching over 2,500 temporary trees (
-j --no-bfgs 
-f d 
-N 5 
-m ASC_GTRGAMMA 
--asc
-corr=lewis 
-o W3101). The GTR GAMMA model wa
s used with correction for 
ascertainment bias of SNP data by the Lewis model. Further, we conducted 10 independent 
searches to generate 1,000 rapid bootstrap trees using RAxML 
(Stamatakis 2014)
. The transfer 
bootstrap supporting value o
f each branch was summarized using BOOSTER (v0.1.1) 
(Lemoine
 et al.
 2018). One sample (W3101) of 
O. longistaminata
, which seems to be a distant relative to 
cultivated rice, was used as the outgroup to root all the trees. Similar
 methods were used to 
reconstruct the phylogeny
 of ancestral regions based on 1,345 4D SNPs that are located in shared 
non-introgression regions of Or
-1 and Or
-3A, with 1,000 rapid bootstrap pseudo
-replications. 
FigTree (v1.4.4) (
http://tree.bio.ed.ac.uk/software/figtree/
) was used to visualize the best
-observed ML tree with supporting values of major branches manually annotated using Inkscape 
(v0.92) (
www.ink
scape.org
). Acknowledgments
 We thank Drs. Xiaofan Zhou
 and 
Xing
-Xing Shen for valuable discussions. We thank Drs. 
Patrick Edger, David Lowry, and C. Robin Buell for critical reading of the manuscript.
  !38  APPENDIX
!39  Table 4.
1 SNP discordance rate between this study and other studies.
 Chromosome
 IRRI
1 HDRA
2 Wang 
et al.
 (2017)
3 Freebayes
4 Chr1
 0.11%
 1.09%
 NA5 1.09%
 Chr2
 0.10%
 0.90%
 2.60%
 1.07%
 Chr3
 0.10%
 0.83%
 2.44%
 1.06%
 Chr4
 0.10%
 1.23%
 2.56%
 1.16%
 Chr5
 0.09%
 0.92%
 2.53%
 1.09%
 Chr6
 0.10%
 1.15%
 2.78%
 1.13%
 Chr7
 0.10%
 0.97%
 2.68%
 1.16%
 Chr8
 0.09%
 0.97%
 2.62%
 1.13%
 Chr9
 0.10%
 0.89%
 2.66%
 1.11%
 Chr10
 0.11%
 0.89%
 2.73%
 1.20%
 Chr11
 0.10%
 0.84%
 2.70%
 1.25%
 Chr12
 0.10%
 1.42%
 2.79%
 1.23%
 Whole genome
 0.10%
 1.01%
 2.65%
 1.14%
 Informative SNP
 12,255,063
 38,863 
 2,060,264
 3,518,552
 1The IRRI
-released 32 M SNP set for 3,024 cultivated rice (
http://oryzasnp.org
). 2The 700K high
-density rice array (HDRA) from McCouch 
et al. 
(2016) with 
94 shared 
cultivated rice.
 3Coordinated converted from Huang 
et al. 
(2012) for 443 wild rice accessions.
 4Remapping of 461 wild rice accessions using Freebayes.
 5Only 47 SNPs were retained after filtering.
   !40  Table 4.
2 Grouping of 3,
485 rice accessions.
 Indica
-clade
 Size
 Japonica
-clade
 Size
 Or-clade
 Size
 Others
 Size
 Ind
-1A 196 Tej
 280 Or-1 86 Aus
 192 Ind
-1B 178 TrjA
 334 Or-2 103 Adx
 168 Ind
-2 254 TrjB
 87 Or-3A 67 Wild_relatives
 17 Ind
-3A 244 Aro
 73 Or-3B 40   Ind
-3B 128 Japadx
 127 Oradx
 150   Indadx
 761       Sum
 1761  901  446  377   !41   Figure 4.
1 Truth sensitivity of tranches in 
Variant Quality Score Recalibration (VQSR).
 (A) Truth sensitivity of four tranche categories, 0.90, 0.99, 0.999, and 1.00, were calculated in 
the VQSR procedure as shown in the x
-axis. The tranche specificity is represented by the SNP 
transition
-to-transversion ratio (Ti/Tv) as shown in the y
-axis. (B) Proportion of true positive (TP) 
SNPs in the four VQSR trenches (Y
-axis). The x
-axis denotes the number of SNPs in a unit of 
1000.   !"!42   Figure 4.
2 SNP filtering 
based on Variant Quality Score Recalibration (VQSR).
 A total of eight factors, DP (coverage), QD (quality by depth), FS (fisher strand), SOR (strand 
odds ratio), MQ (RMS mapping quality), MQRankSum (mapping quality rank sum test), 
ReadPosRankSum (read posi
tion rank sum test), and InbreedingCoeff (inbreeding coefficient), 
were used to train the multi
-dimensional machine
-learning classifier with 2
-dimention plots 
shown. The truth sensitivity cutoff was set to 0.99 for the classification of true (green) and fa
lse 
(purple) SNPs. 1 dot = 1 SNP.
   !43   Figure 4.
3 Distribution of imputation R
2 in minor allele frequency (MAF) bins (0.05 per bin).
 Outliers were plotted in grey dots outside each box
-plot. Each subplot represents a rice 
chromosome indicated in the subplot title. X
- and Y
-axes indicate MAF and imputation R
2, respectively. (A) raw imputation results. (B) after removal of SNPs with imputation R
2 < 0.95.
   !"!44   Figure 4.
4 K-means clustering of 3,485 cu
ltivated and wild rice accessions with (A) whole
-genome 4D SNPs and (B) whole
-genome random SNPs.
 Clustering of subpopulations were indicated on the figure. Admix, admixture; aro, 
aromatic
; aus, 
aus; ind
-I, 
indica
 group I; ind
-II, 
indica
 group II, ind
-III, indica
 group III, Or
-I, 
O. rufipogon
 group I; Or
-II, 
O. rufipogon
 group II; Or
-III, 
O. rufipogon
 group III; tej, 
temperate japonica
; trj, 
tropical japonica
. Labeling of subpopulations was based on published groupings from Huang et 
al. (2012) and Wang et 
al. (2018).
   !"#$!"#$$$!"#$$%&'
%"'
()*
$+,
("-
.-/
!"#$!"#$$$!"#$$%&'
%"'
()*
$+,
("-
.-/
01!45   Figure 4.
5 Determination of the number of populations.
 (A) Cross validation error (y
-axis) keeps decreasing when the number of populations (K, x
-axis) 
increases. (B) Percentage of variation explained (y
-axis) by e
ach principal component (PC, x
-axis). More than 93% of total genetic variation could be explained by the first 14 PCs.
   !"#!
!"#$
!"%!
!"%$
!"&!
!"&$
!'!#!%!&!$!()*++,-./01.20*3,4))*)
5!6'!6#!6%!6&!6$!6!$'!'$#!6,7.)0.20*3
8)0390:./,(*;:*3<32
=%6*>,7.)0.20*3
?@!46   Figure 4.
6 The population structure of 3,485 rice accessions revealed by various K.
 The x
-axis represents 
cultivated or wild rice accessions in the order the same as the whole
-genome phylogeny (
Figure 4.1
C). Different colors in a panel represent different ancestral 
compositions, with the height of color bars represents the whole
-genome percentage of that 
compo
sition.
  !47   Figure 4.
7 Population structure of 3,485 rice accessions.
 (A) Country origin of cultivated rice groups. (B) Germplasm status based on pedigree and 
knowledge (adapted from Wang et al. (2018)). (C) Maximum
-likelihood phy
logenetic 
reconstruction of 3,485 rice accessions with major branch support from 1,000 rapid bootstrap 
pseudo
-replicates. The tree was rooted with 
O. longsitaminata
. Branches were colored based on 
subpopulations with major groups indicated on the right, an
d those without group labels are 
admixed. Scale indicates branch length of 0.2. Ind, 
indica
; Or, 
O. rufipogon
; Aro, 
aromatic
; TrjA, 
tropical japonica
; TrjB, 
subtropical japonica
; Tej, 
temperature japonica
. (D) Population 
structure of all samples with K 
increases from 13 to 16. Each row represents a sample, which 
were ordered based on the phylogenetic tree in (C). Different colors in a panel represent different 
ancestral compositions.
  !"#$!"#
$%&!"#
$'&!"#
$%(!"#
$'(!"#
$)&*+
,-$',-$),-$%(,-$%&&-.
/-0&
/-0(
/10
2..3
!"45'%
!"45'6
!45'7
!45'8
!48  REFERENCES
 !49  REFERENCES
 Alexander DH, Lange K. 2011. Enhancements to the ADMIXTURE algorithm for individual 
ancestry estimation. 
BMC Bioinformatics
 12: 246.
 Boaretto E, Wu X, Yuan J, Bar
-Yosef O, Chu V, Pan Y, Liu K, Cohen D, Jiao T, Li S
 et al.
 2009. 
Radiocarbon dating of charcoal and bone collagen associated with early pottery at 
Yuchanyan Cave, Hunan Province, China. 
Proc Natl Acad Sci U S A
 106: 9595
-9600. Browning Brian
 L, Browning Sharon
 R. 2016. Genotype Imputation with Millions of Referen
ce Samples. 
Am J Hum Genet
 98: 116
-126. Choi JY, Platts AE, Fuller DQ, Hsing Y
-I, Wing RA, Purugganan MD. 2017. The Rice Paradox: 
Multiple Origins but Single Domestication in Asian Rice. 
Mol Biol Evol
 34: 969
-979. Choi JY, Purugganan MD. 2018. Multiple Ori
gin but Single Domestication Led to 
Oryza sativa
. G3: Genes|Genomes|Genetics
 8: 797
-803. Civ⁄
! P, Brown TA. 2018. Role of genetic introgression during the evolution of cultivated rice 
(Oryza sativa
 L.). 
BMC Evol Biol
 18: 57.
 Civan P, Craig H, Cox CJ, Brown
 TA. 2015. Three geographically separate domestications of 
Asian rice. 
Nat Plants
 1: 15164.
 Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. 2013. An Extensive Evaluation of Read 
Trimming Effects on Illumina NGS Data Analysis. 
PLoS ONE
 8: e85024.
 Garris A
J, Tai TH, Coburn J, Kresovich S, McCouch S. 2005. Genetic structure and diversity in 
Oryza sativa
 L. 
Genetics
 169: 1631
-1638. Garrison EM, Gabor. 2012. Haplotype
-based variant detection from short
-read sequencing. In 
arXiv:[q
-bioGN]
, Vol 1207.3907.
 Goff S
A, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, 
Varma H
 et al.
 2002. A draft sequence of the rice genome (
Oryza sativa 
L. ssp. 
japonica
). 
Science
 296: 92
-100. !50  Huang X, Feng Q, Qian Q, Zhao Q, Wang L, Wang A, Guan J, Fan 
D, Weng Q, Huang T
 et al.
 2009. High
-throughput genotyping by whole
-genome resequencing. 
Genome Res
 19: 1068-1076. Huang X, Kurata N, Wei X, Wang ZX, Wang A, Zhao Q, Zhao Y, Liu K, Lu H, Li W
 et al.
 2012. 
A map of rice genome variation reveals the origin o
f cultivated rice. 
Nature
 490: 497
-501. International Rice Genome Sequencing P. 2005. The map
-based sequence of the rice genome. 
Nature
 436: 793
-800. Jiang LP, Liu L. 2006. New evidence for the origins of sedentism and rice domestication in the 
Lower Yangzi River, China. 
Antiquity
 80: 355
-361. Juliano BO, Institute IRR. 1993. 
Rice in Human Nutrition
. Food and Agriculture Organization of 
the United Natio
ns.
 Kawahara Y, de la Bastide M, Hamilton J, Kanamori H, McCombie W, Ouyang S, Schwartz D, 
Tanaka T, Wu J, Zhou S
 et al.
 2013. Improvement of the 
Oryza sativa 
Nipponbare 
reference genome using next generation sequence and optical map data. 
Rice
 6: 1
-10. Kopelman NM, Mayzel J, Jakobsson M, Rosenberg NA, Mayrose I. 2015. Clumpak: a program 
for identifying clustering modes and packaging population structure inferences across K. 
Molecular ecology resources
 15: 1179
-1191. Lemoine F, Domelevo Entfellner JB, Wil
kinson E, Correia D, D⁄vila Felipe M, De Oliveira T, 
Gascuel O. 2018. Renewing FelsensteinÕs phylogenetic bootstrap in the era of big data. 
Nature
 556: 452
-456. Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA
-MEM. 
In 
arXi
v: [q
-bioGN]
, Vol 1303.3997v1.
 Londo JP, Chiang YC, Hung KH, Chiang TY, Schaal BA. 2006. Phylogeography of Asian wild 
rice,
 Oryza rufipogon
, reveals multiple independent domestications of cultivated rice, 
Oryza sativa
. Proc Natl Acad Sci U S A
 103: 9578
-9583. Lu TLD. 2006. The Occurrence of Cereal Cultivation in China. 
Asian Perspectives
 45: 129
-158. !51  Martin M. 2011. Cutadapt removes adapter sequences from high
-throughput sequencing reads. 
2011 17. McCouch SR, Wright MH, Tung C
-W, Maron LG, McNally KL, Fitzg
erald M, Singh N, 
DeClerck G, Agosto
-Perez F, Korniliev P
 et al.
 2016. Open access resources for genome
-wide association mapping in rice. 
Nat Commun
 7: 10532.
 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, 
Altshuler D, Ga
briel S, Daly M
 et al.
 2010. The Genome Analysis Toolkit: a MapReduce 
framework for analyzing next
-generation DNA sequencing data. 
Genome Res
 20: 1297
-1303. Purcell S, Neale B, Todd
-Brown K, Thomas L, Ferreira Manuel
 A R, Bender D, Maller J, Sklar 
P, de
 Bakker Paul
 I W, Daly Mark
 J et al.
 2007. PLINK: A tool set for whole
-genome 
association and population
-based linkage analyses. 
Am J Hum Genet
 81: 559
-575. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post
-analysis of 
large phylogenies. 
Bioinformatics
 30: 1312
-1313. The Rice Genomes Project. 2014. The 3,000 Rice Genomes Project. 
GigaScience
 3: 7.
 Wang H, Vieira FG, Crawford JE, Ch
u C, Nielsen R. 2017. Asian wild rice is a hybrid swarm 
with extensive gene flow and feralization from domesticated rice. 
Genome Res
 doi:10.1101/gr.204800.116.
 Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F
 et al.
 2018. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. 
Nature
 557: 43
-49. Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L
 et al.
 2011. Resequencing 50 accessions of cultivated and wild rice yields marke
rs for 
identifying agronomically important genes. 
Nat Biotechnol
 30: 105
-111. Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X
 et al.
 2002. A 
draft sequence of the rice genome (
Oryza sativa
 L. ssp. 
indica
). 
Science
 296: 79
-92. Zuo X
, Lu H, Jiang L, Zhang J, Yang X, Huan X, He K, Wang C, Wu N. 2017. Dating rice 
remains through phytolith carbon
-14 study reveals domestication at the beginning of the 
Holocene. 
Proc Natl Acad Sci U S A
 114: 6486
-6491. !52  CHAPTER 5
 Concluding Remarks
  !53  Annotation of LTR sequences is still a difficult task
 It is known that LTR sequences are not conserved between species 
(Benachenhou
 et al.
 2013)
. For the genome of newly sequenced species, it is 
inefficient to annotate LTR sequences based on 
the LTR sequence library of other species. Currently, the standard LTR annotation process is (1) 
identify intact LTR retrotransposons (LTR
-RTs) from the genome, (2) construct a non
-redundant 
LTR sequence libra
ry based on intact LTR
-RTs, (3) homology
-based annotation of LTR 
sequences using the non
-redundant LTR library.
 In Chapter two, we developed a new computer algorithm, called LTR_retriever, based on 
the fine
-scale structure of LTR
-RTs, for identification of
 intact elements. Tests of prediction 
sensitivity, specificity, accuracy, and precision show that the program has achieved a very high 
level of all parameters (i.e., > 90%) in both monocotyledonous and dicotyledonous model 
species. The performance of LTR_r
etriever has surpassed similar programs and represents the 
latest advancement of LTR
-RT identification. However, since the construction of LTR library is 
solely based on the identifiable intact LTR
-RTs, if the assembly of these elements is not 
complete, th
e power of such LTR annotation scheme would be compromised. Especially, for 
inactive LTR
-RT families, due to the constant removal of intact LTR
-RTs by recombination, the 
copy number of these families is usually very low. Misassembling the intact element of
 these 
families has a much higher chance for incomplete LTR annotation. Furthermore, for extremely 
old LTR families (i.e., > 2 MYA), it is very likely that all of their intact versions have been 
destructed, resulting in very limited power to annotate the r
emaining sequence fragments.
 To alleviate the shortcoming of current annotation process, one possible way is to 
identify LTR fragments independent of intact LTR
-RTs. In Chapter two, I described the removal 
mechanism for LTR
-RT insertions. One of the end pr
oducts is the formation of solo LTR, which 
!54  harbors only one copy of the terminal repeat. Although the sequence of LTR is among the most 
diverse part of LTR
-RTs, the solo LTR still features with a number of micro
-structures that could 
be used for 
de novo
 identification. For example, during the integration of LTR elements, the 
target genome DNA would be cut open, leaving a 4 
Ð 6 bp target
-site duplications (TSDs) 
immediately flanking the LTR
-RT. The solo LTR would also carry TSDs. Further, the LTR 
usually con
tains a dinucleotide
-inverted repeat, also known as the terminal motif 5'
-TG..CA
-3', 
which is conserved between LTR
-RTs across kingdoms. For 6 
Ð 8 bp random sequences, the 
chance for two being matched is 0.02% 
- 0.002%, which is precise enough for the iden
tification 
of solo LTR candidates. Further features located in the LTR, such as the core promoter TATA
-box, the integrase recognition signal (5'
-TGTTRNR..YNYAACA
-3'), and the polyadenylation 
signal (5'
-AATAAA
-3') could be used to discriminate true LTR and 
false predictions 
(Benachenhou
 et al.
 2013)
. LTR sequences challenge the as
sembly of genomes
 Genome sequencing techniques have been advanced, and it is possible to obtain long reads (i.e., > 
20 Kb) that can effectively span the majority of LTR sequences, resulting in more continuous 
assemblies. In Chapter three, we developed a ne
w genome evaluation metric called LTR 
Assembly Index (LAI) based on the assembly of LTR
-RTs. It is clear that long
-read techniques 
can produce genome assemblies as continuous as those based on BAC
-by-BAC or Sanger whole
-genome shotgun techniques. However, 
we observed that the Arabidopsis genome, which is 
presumably the gold
-standard for genomic studies, has LAI scores (LAI = 14.9) lower than some 
of the long
-read
-based genomes (i.e., LAI > 20). While this could be explained by the limited 
LTR sequences (~7%
) in the Arabidopsis genome that dampen the accuracy of LAI as well as 
!55  the existence of unclosed physical gaps that decrease the assembly continuity, there may be other 
factors responsible for the lower LAI of Arabidopsis, such as the length of LTR sequenc
es.
  Virtually, the assembly continuity is affected by the length of LTR sequences, with 
longer elements more difficult to be spanned by sequencing reads. On the other hand, for 
genomes with similar continuity, shorter LTR
-RTs would result in lower LAI sco
res, and vice 
versa. We compared the length of LTR sequences between Arabidopsis and other genomes and 
found the Arabidopsis genome do harbor shorter internal regions in its LTR
-RTs (~ 5,200 bp). 
Interestingly, the internal region length of the genome with
 highest observed LAI (LAI = 32.2), 
the 
Sorghum bicolor
 genome, is comparatively longer (~ 7,300 bp). It is likely that the LAI 
measurement is slightly biased to genomes with longer LTR
-RTs. Further adjustments may be 
made based on more thorough understand
ings between the sizes of LTRs and that of internal 
regions.
 The functional role of LTR sequences is largely unknown
 Transposable elements (TEs) are known genetic Òdark mattersÓ in eukaryotic genomes. Certainly, 
LTR
-RTs are mutagens due to their transposi
tion activities. However, LTR
-RTs are still very 
mysterious in terms of molecular functions and evolutionary roles. A number of biological 
functions have been found altered by the insertion of LTR
-RTs. Interestingly, many of these 
functions were created du
e to the induced expression of downstream genes, including apical 
dominance of maize (
Zea mays
) (
tb1
) (Studer
 et al.
 2011; Tsiantis 2011)
, anthocyanin 
accumulation of Sicilian blood oranges (
Citrus sinensis
) (
Ruby
) (Butelli
 et al.
 2012), and 
enhanced aluminum tolerance (
OsFRDL4
) of rice (
Oryza sativa
) (Yokosho
 et al.
 2016). Considering the promoter function of the LTR region during the transcriptional activity of LTR 
elements 
(Benachenhou
 et al.
 2013)
, it is very likely that some LTR regions carry reg
ulatory 
!56  motifs that can also induce the expression of other genes. To this regard, the activity of LTR
-RTs 
may be one of the major drivers for the evolution of gene regulations and expressions.
  The evolution of genome structure is also affected by the amp
lification and removal of 
LTR
-RTs. Significant haplotype variations have been observed in the 
Selaginella lepidophylla
 genome 
(VanBuren
 et al.
 2018)
, which may also be the case in the genome of other species, or 
between g
enomes of different individuals of the same species. Such variations are likely driven 
by the recombination activity around LTR sequences. In maize, double
-strand breaks (DSBs) are 
predominately formed in 
Gypsy
 LTR elements 
(He
 et al.
 2017), which may result in 
chromosomal crossovers 
(He
 et al.
 2017). Both insertions of LTR
-RTs and genetic 
recombination would result in the change of chromosomal structure, such as structural variations 
(SVs), including insertions and deletions (Indels), copy
-numb
er variations (CNVs), presence
-and
-absence variations (PAVs), and segmental inversions. However, it is still challenging, both 
technically and financially, for researchers to obtain high
-quality SVs, especially large SVs (i.e., > 
100 bp), within a species.
 Luckily, with further advancement of sequencing techniques, it is 
getting cheaper and easier to sequence multiple accessions of a species. The relationship between 
LTR
-RT activity and SVs is one of the imminent questions to be answered in the near future.
 The ambiguous role of SVs for domestication
 Previous studies have been utilizing the abundant genetic resources to advance the understanding 
of crop domestication histories. Hundreds and thousands of crop accessions have been re
-sequenced to generate hig
h-density markers for such purpose. However, due to the limitation of 
sequencing read length (
$
 250 bp), the confidently identified genetic markers are usually shorter 
than 10 bp, with large SVs underrepresented. In Chapter four, I reanalyzed the resequenc
ing data 
of 3,485 rice accessions and generated 17.7 million high
-quality single
-nucleotide 
!57  polymorphisms (SNPs). The domestication history of rice is then inferred based on this dataset. 
We revealed the important role of standing variations for selection 
of beneficial traits during 
domestication. However, SVs are apparently the missing piece from the current model.
  Using 
de novo 
assemblies based on the re
-sequencing data of 3,010 rice accessions, 
Wang 
et al. 
(2018) identified more than 10,000 novel full
-length protein
-coding genes 
(Wang
 et 
al.
 2018), providing important understandings of genetic variations between rice varieties. 
However, the role of TEs is not revealed due
 to the limitation of sequencing length. 
Incorporating genome
-wide gene expression analyses, Hufford 
et al. 
(2012) showed that the 
magnitude of gene expression has been changed during the domestication of maize 
(Hufford
 et al.
 2012). Together with the possible enhancer function of LTR 
sequences, it is likely that LTR
-RTs 
have played a systemic role during crop domestication. Future studies incorporating large SVs, 
thorough LTR sequence annotation, and gene expression analyses, may help to piece the 
underlying connections.
  !58  REFERENCES
 !59  REFERENCES
 Benachenhou F, Sperber GO, Bongcam
-Rudloff E, Andersson G, Boeke JD, Blomberg J. 2013. 
Conserved structure and inferred evolutionary history of long terminal repeats (LTRs). 
Mob 
DNA
 4: 5.
 Butelli E, Licciardello C, Zhang Y, Liu J, Mackay S, Bailey P, Reforgiato
-Recupero G, Martin C. 
2012. Retrotransposons control fruit
-specific, cold
-dependent accumulation of 
anthocyanins in blood oranges. 
Plant Cell
 24: 1242
-1255. He Y, Wang M, 
Dukowic
-Schulze S, Zhou A, Tiang CL, Shilo S, Sidhu GK, Eichten S, 
Bradbury P, Springer NM
 et al.
 2017. Genomic features shaping the landscape of meiotic 
double
-strand
-break hotspots in maize. 
Proc Natl Acad Sci U S A
 114: 12231
-12236. Hufford MB, Xu X, va
n Heerwaarden J, Pyhajarvi T, Chia J
-M, Cartwright RA, Elshire RJ, 
Glaubitz JC, Guill KE, Kaeppler SM
 et al.
 2012. Comparative population genomics of 
maize domestication and improvement. 
Nat Genet
 44: 808
-811. Studer A, Zhao Q, Ross
-Ibarra J, Doebley J. 20
11. Identification of a functional transposon 
insertion in the maize domestication gene 
tb1
. Nat Genet
 43: 1160
-1163. Tsiantis M. 2011. A transposon in 
tb1 
drove maize domestication. 
Nat Genet
 43: 1048
-1050. VanBuren R, Wai CM, Ou S, Pardo J, Bryant D, Jia
ng N, Mockler TC, Edger P, Michael TP. 
2018. Extreme haplotype variation in the desiccation
-tolerant clubmoss 
Selaginella 
lepidophylla
. Nat Commun
 9: 13.
 Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F
 et al.
 2018. Ge
nomic variation in 3,010 diverse accessions of Asian cultivated rice. 
Nature
 557: 43
-49. Yokosho K, Yamaji N, Fujii
-Kashino M, Ma JF. 2016. Retrotransposon
-Mediated Aluminum 
Tolerance through Enhanced Expression of the Citrate Transporter OsFRDL4. 
Plant 
Physiol
 172: 2327
-2336.