SEQUENCE ACQUISITION SPECIFICITY AND EVOLUTION OF TERMINAL SEQUENCES IN PLANT MUTATOR-LIKE ELEMENTS AND THE REPETITIVE SEQUENCE LANDSCAPE OF SACRED LOTUS (Nelumbo nucifera) By Ann Roselle Armenia Ferguson A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Plant Breeding, Genetics and Biotechnology – Horticulture – Doctor of Philosophy 2014 ABSTRACT SEQUENCE ACQUISITION SPECIFICITY AND EVOLUTION OF TERMINAL SEQUENCES IN PLANT MUTATOR-LIKE ELEMENTS AND THE REPETITIVE SEQUENCE LANDSCAPE OF SACRED LOTUS (Nelumbo nucifera) By Ann Roselle Armenia Ferguson Transposable elements (TE) are an integral part of eukaryotic genomes; therefore, their identification, characterization and analysis remain critical in genetic and evolutionary studies. The Mutator superfamily (MULEs) are DNA TEs characterized by long terminal inverted repeats (TIR) and includes a special group of elements, called Pack-MULEs, that carry genes/gene fragments. This study aims to understand the role of terminal sequences in transposition of MULEs and the factors involved in sequence acquisition by Pack-MULEs. In addition, the repetitive content and diversity of an ancient eudicot genome was characterized to better understand the evolutionary role of MULEs as well as other TEs in angiosperms. Analyses of MULEs in rice show that these elements also capture GC-rich intergenic sequences but at a much lower frequency than genes. The TIR-MULE type is predominantly involved in sequence acquisition and these elements are associated with long GC-rich TIRs which may be important in acquisition. Genes with known functions and genes with orthologs are overrepresented among parental genes of Pack-MULEs in rice, maize and Arabidopsis suggesting that Pack-MULE preferentially duplicate bona fide genes. Pack-MULEs selectively acquire and retain parental sequences through a combined effect of GC content and breadth of expression, with GC content playing a stronger role. Analysis of MULEs in maize, tomato, rice and Arabidopsis detected the formation of atypical MULEs and Pack-MULEs with multiple TIRs mostly located in tandem. The copy number of these atypical MULEs suggests their significant mobility while their tandem TIR sequences indicate sequence conservation. The successful amplification of the Pack-MULE, PM-ZIBP, demonstrates that MULEs with tandem TIRs are functional in transposition and duplication of gene sequences. Characterization of the repetitive sequence of sacred lotus (Nelumbo nucifera) shows that 50% of the genome is composed of recognizable transposable elements (TE). TE content and diversity show a comparable Copia and Gypsy LTR content, which is atypical among plants. Non-canonical LTR types comprise 15.6% of the total LTR content suggesting the need to consider other end types in annotation of LTR elements. Sacred lotus also contains the highest coverage and copy number of hAT elements among all sequenced genomes to date. The 1447 Pack-MULEs in the genome provide the first evidence for the GC acquisition preference by Pack-MULEs outside the grasses. ACKNOWLEDGEMENTS I extend special appreciation to my adviser, Dr. Ning Jiang, for her patience, mentoring, and for giving me an opportunity to learn the skills that I do now and to appreciate and be constantly in awe of those sequences in the genome many others consider as “junk”. I would like to thank all my committee members, Drs. James Hancock, Douglas Schemske and Shin-han Shiu for their support and guidance. Big thanks go to the past and current Jiang lab members (Veronica, Dongyan, Guozhu, Dongying, Dongmei and Stefan) for all the help in the lab and many wonderful discussions. Thanks to all the students, post-docs and faculty of “joint lab” and “rice group” for the insightful comments and discussions throughout the years, especially Robin Buell and Kevin Childs. I also thank the National Science Foundation and the PBGB program who have supported my research in the Jiang Lab. I would like to show gratitude to my wonderful family, from thousands of miles away, for the support and prayers throughout this long journey. I love you all! Thank you to my husband, Brett, for all the love, support and encouragement. I could not have done this all without you and Elliot!   iv   TABLE OF CONTENTS LIST OF TABLES …………………………………………………..………………………...viii LIST OF FIGURES ………………………………………………..………………….………..ix CHAPTER 1: Review of Literature …………...………………………………...…………..…1 Transposable Elements………….…………………………………………………….…………2 Class I TEs ……………………………………………………………………...…..……2 Class II TEs ……………………………………………………………………..….……3 Genome size variation…………….…………………..…………………...……….….…6 Regulation of transposable elements …………………………..………………….……7 Domestication of transposable elements ……………………………...…………..……8 Mutator Superfamily …………………………….……………………………….…………...…9 Transposition ………………………………………………………..….………………11 Target selection ………………………………………………….….……….…………11 Epigenetic regulation ……………………………………………......…………………12 Pack-MULEs ……………………………………………………………..…………….………13 Mechanism of acquisition………………………………………………………….…...14 Functional capacity ……………………………………………………...……………..16 Sacred Lotus …………………………………………………………………………..………..17 Outline for this dissertation …………………………………………………………..………18 REFERENCES ……………………………..……………….…………………………….……20 CHAPTER 2: Selective Acquisition and Retention of Genomic Sequences by Pack-MutatorLike Elements Based on GC Content and Breadth of Expression....……..…………34 Abstract ……………………………………………...………….………………………………35 Introduction……………..………………………………………………………………………36 Methods ………………...………………………………………………….……………………40 Identification of MULEs and Pack-MULEs ………………………………………….40 TIR and Sub-TIR Analyses ...…………………………………………………………41 Analysis of GC Content ………………………………………………………………. 42 Gene Functional and Expression Analyses ………………………………………….. 42 Age of Acquisition Events …………………………………………………………… 43 Results ………………………………………………...……...…………………………………44 Rice MULEs Preferentially Acquire Genic Sequences ………………………………44 Structural Differences between Pack-MULEs and Non-Pack-MULEs …….………47 Underrepresentation of Genes with Unknown Function among Parental Genes……………………………………….…………………54 The Effect of Gene Expression on Sequence Acquisition and Its Interaction with GC Content ……………………………………………………..…...……63 The Enrichment of GC-Rich Sequences inside Pack-MULEs Is Due to Selective Acquisition and Preferential Retention ……………………….……66 Discussion ……………………………………………………………………...………….……67   v   Conclusion …...…………………………………………………………………………………74 APPENDIX ………...…………………………………………………...………………………76 REFERENCES ……………………………...……………………….…………………………81 CHAPTER 3: Mutator-like Elements with Multiple Long Terminal Inverted Repeats in Plants ………………………………….…………….…….……….……….…..……87 Abstract ………………………...…………………………………….…………………………88 Introduction ……………………………..………………………...……………………………88 Methods …………………………...…………………………………………………….………90 Plant Genomic Sequences and Construction of the Tomato MULE TIR Library…90 Estimation of MULE Copy Number and Identification of Elements with Multiple TIRs from Plants ……………………………………………………………….91 Phylogenetic Analysis of TIRs and Internal Sequences ………………………..……93 Annotation of Pack-MULEs and Frequency of Element Sizes in Tomato ...……….93 TIR Sequence Analysis and Conservation Test ……………………………...………94 Results…………………………………………………………………………………………...94 Types of MULEs with Multiple TIRs ………………………….…….….……………94 A Pack-MULE Family with Tandem TIRs …………….…………………...………101 Sequence Features with Elements Carrying Multiple TIRs ………...………….…108 The Putative Role of the Tandem TIRs in Amplification of the Elements ……..…111 Discussion ……...…………………………………………………………………………...…115 The Formation and Amplification of MULEs with Additional TIR Sequence …...115 The Mechanism Involved in the Formation of Duplicated TIRs ………..…………118 Possible Competency Conferred by Tandemly Duplicated TIR …………….….…120 Conclusion ……………………………………...………………………………..……………122 REFERENCES …………...…………………………………………………...………………124 CHAPTER 4: Repetitive Sequence Landscape of the Ancient Eudicot Sacred Lotus (Nelumbo nucifera) ……………………….…………………………...………………128 Abstract ……...………………………………………………….…………………..…………129 Introduction ………..…………………………………..………….………………..…………130 Methods ………….……...………..……………………………………………………………132 Construction of repeat library ……………………………………………………….132 Estimation of copy number and genomic coverage …………………………...……134 Pack-MULEs ……………………….…………………………………………………134 Phylogenetic analysis ……….…………………….…………………….…………….135 Results ………...…………………………………………………….…………………………137 Repeat content and diversity ……………………………...…………………………137 LTR elements with non-canonical ends ……………………..………………………139 hAT elements …………………..……………………………...………………………152 Pack-MULEs ……………………………..……………...……………………………154 Discussion …………...……………………………………………………...…………………162 APPENDIX …………………………...…………………………..………...…………………169 REFERENCES …………………...……….………………..…………………………………171 CONCLUDING REMARKS …………………….…………………………………………185   vi   REFERENCES ……………………………………...………………………………………190   vii   LIST OF TABLES Table 2.1 Copy numbers and percentage of different classes of TIR MULEs in the rice genome …………………………………………………………….…………………48 Table 2.2 Copy numbers and percentage of different classes of non-TIR MULEs in the rice genome …………………………………………………………………….…………49 Table 2.3 GC content and expression information among Pack-MULE parental genes and nonTE genes in maize according to functional assignment ………...……………………60 Table 2.4 GC content and expression information among Pack-MULE parental genes and nonTE genes in Arabidopsis according to functional assignment ………….……………61 Table 2.5 GC content and expression information among Pack-MULE parental genes and nonTE genes in rice according to GOSlim assignment ………………….…….…………62 Table A1 Order of TIR MULE families found in Figure 2.1 A and B …………….……………77 Table 3.1 Distribution of elements among tomato MULE TIR families …..……………………98 Table 3.2 MULE TIR families involved in TIR duplication in tomato ………………………..100 Table 3.3 List of the multiple TIR elements in Arabidopsis, maize and rice …………...……..103 Table 3.4 Frequency of MULEs, Pack-MULEs and MULEs with multiple TIRs in four different plant genomes ………………………………………………………………………104 Table 3.5 List of the Pack-MULEs that captured a fragment of a putative zinc-ion binding protein ………………………………………………………………………………106 Table 4.1 Repeat content of the sacred lotus genome ……………………………………….…140 Table 4.2A RNA TE information of various sequenced plant genomes ………………………141 Table 4.2B Genome information and DNA TE content of various sequenced plant genomes...143 Table 4.3 Family, copy number and genome coverage of different LTR ends ……………..…146 Table 4.4 Pack-MULEs with EST evidence of expression ……………………………………158   viii   LIST OF FIGURES Figure 1.1 Classes of transposable elements (Feschotte et al., 2002) …………………....………4 Figure 1.2 Mechanisms of sequence acquisition by Pack-MULEs. (A) Capture by mobile soloTIRs model; (B) Ectopic gene conversion using cruciform structure; (C) Aberrant gap-repair model …………………………………………………………….………15 Figure 2.1. Partition of Pack-MULEs and nonPack-MULEs among TIR MULE and non-TIR MULE families in the rice genome. (A) Copy Number and Pack-MULE distribution in TIR-MULE families associated with gene acquisition. (B) Percent Pack-MULEs and non-Pack-MULEs of total copy number for TIR-MULE families associated with gene acquisition. (C) Copy Number and Pack-MULE distribution in non-TIR MULE families associated with gene acquisition. (D) Percent Pack-MULEs and non-PackMULEs of total copy number for non-TIR MULE families associated with gene acquisition. The corresponding names of MULE families for A and B are found on Appendix Table 1………………………..…………….……………………………. 50 Figure 2.2 GC content along Pack-MULEs and non-Pack-MULEs. The first 2 and last 2 bins represent TIR regions and the internal sequence was divided into 10 equal-sized bins prior to determination of GC content per bin …….…………….……………………55 Figure 2.3 Structural difference between Pack-MULEs and non-Pack-MULEs based on TIR and sub-terminal (sub-TIR) sequences. (A) Median TIR length. (B) Median TIR GC content. (C) Median sub-TIR GC content. (D) Median sub-TIR free energy. PM: Pack-MULE; nPM-PMTIR: non-Pack-MULEs with PM associated TIRs; nPMnPMTIRs: non-Pack-MULEs with non-Pack-MULE exclusive TIRs; PM100: using only first 100 bp sequence of Pack-MULE TIRs. Bars designated with different letters indicate their values are significantly different (α=0.008 for B and α=0.02 for A, C and D) by Wilcoxon Rank Sum Test (WRS) with Bonferroni correction………………………………………………………………….…………56 Figure 2.4 GC content of different genomic sequences in the rice genome. Genome average GC content is indicated by a dashed line. Bars designated with different letters indicate their values are significantly different (α= 0.002) by Wilcoxon Rank Sum Test (WRS) with Bonferroni correction…………………………….……….……………57 Figure 2.5 Percent of GOSlim categories of Pack-MULE parental genes and all non-TE genes of rice……………………………………………………………...…………………58 Figure 2.6 The effect of GC content and expression in gene acquisition frequency. (A) The relationship between expression breadth and ratio of parental genes among genes grouped on GC content range (low: 30-50% GC; moderate: 51-62% GC; high: 6381% GC); (B) The relationship between gene GC content and ratio of parental genes   ix   among genes grouped on number of tissue expression range (no/low: 0 to 1 libraries; moderate: 2 to 7 libraries; high: 8 to 10 libraries)……………………...……………68 Figure 2.7 Comparison of GC content and breadth of expression of parental genes between recent and old acquisition events estimated by transversion rate. (A) The effect of acquisition age on breadth of gene expression; (B) The effect of acquisition age on the GC content; (C) The relationship between GC content of parental genes and transversion rate. ………………………………………………..………………...…69 Figure 3.1 The structure of distinct types of MULEs with additional TIRs in tomato including their copy numbers and the number of TIR families involved in formation. Black horizontal arrows indicate target site duplication (TSD); solid colored triangles indicate Terminal Inverted Repeat (TIR); colored boxes indicate internal sequence and are labeled accordingly if sequences are annotated as genes or have similarity to MULE transposases. Objects with different colors indicate unrelated sequences. ZIBP – zinc ion binding protein…………………………………………………..……..…96 Figure 3.2 PM-ZIBP elements with single and tandem TIRs that contain a gene fragment. Solid triangles indicate TIRs and blue boxes indicate exons of gene SGN-U574419 and fragment acquired by the Pack-MULE. Introns are depicted as lines connecting exons………………………………………………………………...………….…..102 Figure 3.3 Phylogenetic analysis based on the acquired fragment in PM-ZIBP from SGNU574419 and related sequences (SGN-U273862, SGN-U20267, and SGNU506815). Sequences were aligned using ClustalW and phylogenetic reconstruction used the maximum likelihood method with Kimura-2 parameter distances implemented in the MEGA program. Bootstrap values are indicated as a percentage of 1000 replicates (40% majority rule consensus). Elements mapping to heterochromatic regions are indicated by a star symbol……………………… ...…107 Figure 3.4 Chromosomal distribution of tandem TIR PM-ZIBP in tomato. Dark blue blocks represent heterochromatin and light blue regions represent euchromatin. Individual elements are represented by dark red vertical bars, and the purple ovals indicate the location of the centromere………………………………………………..…...……109 Figure 3.5 Alignment of TIR sequences from a PM-ZIBP with tandem TIRs and that from PM-ZIBP-1 illustrating the location of the 3 repetitive motifs found in the TIRs…110 Figure 3.6 Phylogenetic analysis of sequences of the external, and internal TIRs of PM-ZIBP elements. Sequences were aligned and phylogeny was reconstructed as described for Figure 3.3 …….……………………………………………………………………112 Figure 3.7 Phylogenetic analysis of external and internal TIRs from type 2-46. TIRs from the 3TIR elements are indicated by a star symbol. Sequences were aligned and phylogeny was reconstructed as described for Figure 3.3……………………...………………113   x   Figure 3.8 The frequency of different element sizes. Elements that are less than 2 kb are plotted.…………………… ………………………………………...……...………116 Figure 3.9 Nucleotide conservation across the two tandem TIRs. (A) Tandem TIRs from PMZIBP. (B) Tandem TIRs from type 2-46. The nucleotide conservation scores are calculated as an average of 5 nucleotide position scores from the copies of the element. Colored or black triangles represent the TIR. In Figure 3.9A, the orange regions indicate the 3 repetitive motifs (see text). Colored box indicates part of the acquired gene fragment…………………………………………..…………………117 Figure 4.1 Sequence alignment of NN00206 illustrating the LTR sequence, TSD, primer binding site (PBS) and polypurine tract (PPT). Blue text indicates 10bp initial LTR sequence at the each terminal and red indicates 5 bp TSD. The shaded text indicates the sequence that matches a Gly-tRNA……………………………….……………147 Figure 4.2 Sequence alignment of LTR ends and TSD of the Copia family NN00206. Text in blue indicates 6 bp outermost LTR sequence and red indicates 5 bp TSD..…….…148 Figure 4.3 Phylogenetic analysis of TGCT LTR using the conserved integrase catalytic core domain. Bootstrap values are indicated as a percentage of 500 replicates. Shown are: sacred lotus sequences TGCT LTR (blue bolded), TGCA LTR (black bolded), other types of non-canonical LTR (red), Sireviruses (green), TGCT grape LTR (light blue), TGCA grape LTR (brown), other species LTR (black)…………………...……..…149 Figure 4.4 Phylogenetic analysis of the conserved domain 3 of hAT transposase. Bootstrap values are indicated as a percentage of 500 replicates. Names in red represent non-plant hAT transposase. ………………..…………..………………………….……..…………156 Figure 4.5 GC content along Pack-MULEs and non-Pack-MULEs. The first 2 and last 2 bins represent TIR regions and the internal sequence was divided into 12 equal-sized bins prior to determination of GC content per bin …………………………………159 Figure 4.6 GC content of different genomic sequences in the sacred lotus genome. Bars designated with different letters indicate their values are significantly different (α= 0.0025) by Wilcoxon Rank SumTest (WRS) with Bonferroni correction...………..160 Figure 4.7 Distribution of different types of genes based on GC gradient in sacred lotus, Lotus japonicus and Arabidopsis …………………………………………..………161 Figure A1 Phylogenetic analysis of conserved domain in the reverse transcriptase gene from Gypsy LTR. Bootstrap values are indicated as a percentage of 500 replicates …...170   xi                     CHAPTER 1: Review of Literature   1       Transposable Elements Transposable elements (TEs) are genetic sequences that are capable of moving from one genomic location into another and in the process can increase their copy numbers. These genetic elements were first discovered by Barbara McClintock over 50 years ago which she referred to as “controlling elements” responsible for the altered pigmentation in mutant maize kernels (McClintock, 1951). The continued sequencing of many eukaryotic genomes has demonstrated the extensive contribution of TEs in the genomic composition of diverse plant and animal species (Adams, 2000; Lander et al., 2001; Sequencing Project, 2005; Schnable et al., 2009; Howe et al., 2013; Middleton et al., 2013). According to the intermediate form of transposition used by the specific element, TEs are generally classified into two major groups: Class I and Class II (Figure 1.1; Feschotte et al., 2002; Wicker et al., 2007; Kapitonov and Jurka, 2008). In addition, the coding capacity of elements for proteins involved or comprising the transpositional machinery allows for further classification of elements into autonomous elements, which code for these proteins, or nonautonomous elements, which rely on the autonomous elements for movement within the genome (Figure 1.1). Furthermore, TEs may be classified into superfamilies and families, primarily based on specific sequence and/or structural features as well as replication strategy shared by the elements within the superfamily (Wicker et al., 2007). These features, among others, include the target site duplication (TSD) that is created at the flanking regions following transposition into a new genomic location and terminal repeats that specifically vary in size among different families. Class I TEs. Elements belonging to Class I are also referred to as RNA elements or retrotransposons, and transpose via an RNA intermediate using a copy-and-paste mechanism.   2       The elements are transcribed forming an RNA intermediate which is then reverse-transcribed by a reverse transcriptase encoded by autonomous members. The DNA copy is then integrated into a new genomic location (Flavell et al., 1994; Wicker et al., 2007). Retrotransposons account for the majority of repetitive sequences in many plant genomes and have been implicated for genome size expansion (Bennetzen and Kellogg, 1997; Kalendar et al., 2000; Bennetzen, 2005; Schulman and Kalendar, 2005). The presence or absence of long terminal repeats (LTRs) among Class I elements groups them into LTR retrotransposons and non-LTR retrotransposons (Figure 1.1; Feschotte et al., 2002; Kapitonov and Jurka, 2008). LTR retrotransposons are the major components of the TE fraction in plants while non-LTR retrotransposons are more predominant in animal genomes. Moreover, LTR retrotransposon insertions target specific sites in the genome (Bushman, 2003; Ammiraju et al., 2007; Linheiro and Bergman, 2012; Tsukahara et al., 2012). For instance, different LTR elements in yeast (Saccharomyces cerevisiae) exhibit variable targeting. Ty3 elements target locations close to the RNA PolIII transcription initiation sites (Chalker and Sandmeyer, 1992), while Ty5 preferentially integrates into heterochromatin of telomeres and silent mating locus HMR (Zhu, 2003). Class II TEs. Class II elements, or DNA elements, transpose via a cut-and-paste mechanism. The element itself excises from its original genomic location and moves into a new target location. DNA elements may increase their copy numbers by transposing during DNA replication into a locus prior to the formation of the replication fork and by exploiting the gap repair process which creates a new copy and restores the original copy through repair involving the copy in the sister chromatid (Nassif et al., 1994). An exception to the typical features found   3       Figure 1.1 Classes of transposable elements (Feschotte et al., 2002).   4       in Class II elements are Helitrons, which is hypothesized to transpose via a rolling-circle mechanism (Kapitonov and Jurka, 2007). An integral part of many DNA element families are the end sequences which are called Terminal Inverted Repeats (TIR) due to their reverse complementary nature. Some DNA TE superfamilies are characterized by short TIRs (<50bp) such as hATs and CACTA (Wicker et al., 2007). Meanwhile other superfamilies including Mutators, Tc1/Mariner, Polintons and Merlin are typically associated with long TIRs or include members with long TIRs (Li and Shaw, 1993; Lisch, 2002; Feschotte, 2004; Kapitonov, 2006; Ferguson and Jiang, 2012). This variation in TIR length partially allows for the classification of different superfamilies (Wicker et al., 2007; Kapitonov and Jurka, 2008). The TIR plays an important role during transposition. DNA binding domains in the transposase proteins bind the TIR sequence of related elements, and in certain cases sub-terminal sequence, and this is the first step in transposition (Ichikawa et al., 1987; Becker, 1997; Benito and Walbot, 1997; Zhou et al., 2004; Loot et al., 2006). In Osmar5, an active and autonomous Tc1/Mariner element, the N-terminal binding domain of the encoded transposase that contains the helix-turn-helix (HTH) motifs which specifically binds to two sequence motifs in the TIR (Feschotte, 2005). In the bacterial Tn3 element, the binding of transposase to the 38-bp motif sequence in the TIR facilitates the nicking at the end of Tn3 by DNase I and initializes the transposition process (Ichikawa et al., 1987). Binding of transposase to the TIR and to the target DNA mediates the synapsis of the transposon ends and the target DNA, allowing the insertion of the element into the target sequence (Craig, 2002). In the Osmar5 scenario, mutations in the conserved motifs in the TIR can prevent transposition (Yang et al., 2006).   5       Genome size variation. The “C-value” paradox refers to the apparent lack of correlation between an organism’s genome size and its biological complexity (Thomas, 1971). A huge variation in genome sizes exists among organisms with the “large” genomes of this spectrum dominated by transposable elements. Due to their capacity to multiply within a host genome and their prevalence among plant and animal genomes, TEs contribute significantly to increases in genome size (Bennetzen and Kellogg, 1997; Ammiraju et al., 2007; Bennetzen, 2007; Zuccolo et al., 2007) and this may explain the differences in genome sizes without an apparent functional correlation. Recent bursts of amplification of TEs was shown to account for a two-fold increase in genome size of wild rice, Oryza australiensis, in comparison to the asian domesticated rice Oryza sativa (Piegu et al., 2006). In addition, retrotransposon families, primarily Gorge3, have been implicated in genome size increases in two cotton (Gossypium) lineages (Hawkins et al., 2006). These findings suggest that TE can play a major influence in the genome size expansion of plants. TEs are ubiquitous among eukaryotic genomes and the Class I/retrotransposon content maintains to be the largest component of repeat content in many plants. Despite this, dramatic differences exist in the content of different TE types between organisms. For instance, some animal and insect genomes contain a higher proportion of non-LTR retrotransposons compared to LTR retrotransposons (Lander et al., 2001; Chinwalla et al., 2002; Nene et al., 2007). Meanwhile, in plants, the LTR retrotransposons coverage typically dominates the TE landscape (Rice Sequencing Project, 2005; Paterson et al., 2009; Schnable et al., 2009; Schmutz et al., 2010). Also, the ratio of RNA to DNA elements can vary. On one extreme, the RNA TE coverage in the rice genome is 1.5-fold more abundant than the DNA elements and DNA TEs are more abundant in terms of copy number (Rice Sequencing Project, 2005) while in the opposite   6       extreme is the papaya genome where 99% of annotated TEs are RNA elements (Ming et al., 2008), suggesting that although plants overall share similarities in TE component, the content and ratios of specific TE families varies between species. The precise epigenetic regulating mechanisms that resulted in this variable success and lack thereof among different TE superfamilies remains to be elucidated. Regulation of transposable elements. While TE proliferation can contribute to genome expansion, this uncontrolled activity can, in most cases, produce havoc to a genome when TE movement interrupts crucial gene function or regulation (Callinan and Batzer, 2006). Important epigenetic defense mechanisms regulate and silence TE activity, and thus TEs are largely inactive in all plant genomes (Lisch, 2009) and less than 0.05% of TEs in the human genome is able to transpose (Mills et al., 2007). TE activity is regulated through a combination of post-transcriptional and transcriptional silencing mechanisms (reviewed in Feschotte et al., 2002; Slotkin and Martienssen, 2007). Together, the two silencing strategies can result in a repressed TE activity through DNA methylation, chromatin modifications, small RNA and reduced expression. The RNA interference (RNAi) pathway is critical in the regulation of TE. It can be involved in the generation of small RNAs that directly target the degradation of TE RNAs or through RNAdirected DNA methylation (RdDM), both of which repress expression and, therefore, activity. DNA methylation is accomplished by DNA methyltransferases such as METHYLTRANSFERASE1 (MET1), CHROMOMETHYLASE3 (CMT3) and DOMAINS REARRANGED METHYLTRANSFERASE 2 (DRM2) (Lindroth, 2001; Cao and Jacobsen, 2002; Kinoshita, 2004; Law and Jacobsen, 2010). In Arabidopsis, it was shown that DDM1, a Snf2 nucleosome remodeling factor is important in TE regulation (Hirochika et al., 2000; Singer,   7       2001). In fact, ddm1 mutants show transcriptional activation of TEs (Lippman et al., 2004) and increased transposition (Tsukahara et al., 2009). TE regulations through DNA methylation and chromatin changes via DDM1 may involve the RNA-directed DNA methylation (RdDM), a Pol IV- and PolV-dependent process whereby small RNAs target homologous genomic DNA for cytosine methylation (Lippman et al., 2004; Law and Jacobsen, 2010). Recent study indicates that the synergistic TE silencing activity of both DDM1 and RdDM regulate nearly all TE in Arabidopsis (Zemach et al., 2013). This control occurs when the chromatin remodeling nature of DDM1 allows DNA methyltransferase access to otherwise inaccessible heterochromatin while RdDM silences TEs that are typically found in euchromatin. This study is a key step in understanding the variable silencing strategies that might explain the success of certain TE types or superfamilies in some organisms over others. Domestication of transposable elements. Despite the fact that majority of TE activity is deleterious and effectively regulated by various silencing strategies, some TEs have been implicated in adaptive evolution. TE domestication refers to the process whereby TE sequences evolve as new genes or sequences with beneficial roles in the host genome (Volff, 2006). Studies have shown that transposable elements have contributed not only to the creation of new genes but also to the evolution of regulatory networks (Jordan et al., 2003; Muotri et al., 2007). Studies show that several ancient TEs have been used as noncoding functional elements in vertebrates (Bejerano et al., 2006; Kamal, 2006; Kapusta et al., 2013). Sequencing of the shorttailed opossum (Monodelphis domestica) genome has revealed that 20% of TE-derived noncoding elements were conserved in the eutherian lineage suggesting that these elements might be functional (Mikkelsen et al., 2007). Fifty-five human micro RNA (miRNA) genes are apparently   8       derived from TEs (some L2 and MIR TE family related) and can potentially regulate human genes (Piriyapongsa et al., 2006). The catalytic core of the RAG1 and RAG2 proteins involved in the V(D)J recombination process is derived from a Transib transposase (DNA element) indicating that a very critical process in the jawed vertebrate immune system originally evolved from transposons (Kapitonov and Jurka, 2005). In Drosophila, the telomeres are not maintained by telomerase as in the case of most other organisms. In fact, Drosophila lacks telomerase and the conventional telomeric repeats; instead their telomeres are maintained by the Het-A and TART non-LTR retrotransposons that specifically transposose to the ends of the chromosome (Pardue et al., 2005). In plants, one example of TE domestication is DAYSLEEPER, a transposase similar to the hAT superfamily of transposases, which is essential for normal plant growth and development in Arabidopsis thaliana (Bundock and Hooykaas, 2005). Furthermore, a Mariner type transposase fused to a sequence encoding a SET domain (Metnase or SETMAR) found in primates is implicated in the increased resistance to ionizing radiation and non-homologous endjoining repair of double stranded breaks in DNA (Cordaux, 2006; Shaheen et al., 2010). Mutator Superfamily The Mutator superfamily is one of the most active DNA elements in plants. This superfamily was initially discovered by Don Robertson in 1978 in a maize stock that generated an unusually high frequency of genetically unstable recessive mutations (Robertson, 1978). The first non-autonomous Mutator element characterized was Mu1, a 1.4 kb sequence inserted in the Adh1 gene that resulted in an unstable mutant allele (Strommer et al., 1982; Bennetzen et al., 1984). Mutator maize stocks may contain between 10-50 copies of this element, whereas most other maize strains are devoid of this element (Alleman, M. and Freeling, M., 1986).   9       Furthermore, this copy number is maintained with a remarkably high transposition rate of 10-15 transpositions per gamete per generation. This extremely high transposition rate gives the Mutator superfamily an unparalleled activity among DNA transposons in plants. Since the original discovery of Mu1, other non-autonomous elements were later characterized from the maize genome, and subsequently in other organisms where they are referred to as Mutator-like elements (MULEs) (Yu et al., 2000; Chalvet, 2003; Jiang et al., 2004; Holligan et al., 2006; Marquez and Pritham, 2010; Ming et al., 2013). The contribution of MULEs to genome composition varies among plants. The model plant with a relatively small genome size, Arabidopsis thaliana, contains ~1500 MULEs. However, MULE copy number is reported to be over 28,000 and 30,000 in tomato and rice, respectively (Ferguson and Jiang, 2012). In addition, maize with a genome about 6 times the rice genome contains only ~13,000 MULEs (Schnable et al., 2009). This variation in success of MULE amplification among different plants implies differential silencing between different TE types and competition from the activity of other TEs in the genome. DNA elements belonging to the Mutator superfamily are generally characterized and differentiated from other classes of DNA transposons by distinct structural features such as TIR and target site duplication (TSD) (Kapitonov and Jurka, 2008). Mutator elements in maize share a ~220bp TIRs found on both ends of the element but the internal sequence between the TIRs varies among sub-families (Chomet et al., 1991). In addition, during transposition, these elements form a 8-11 bp (mostly 9 bp) TSD at the new location, a feature that is generally regarded as a hallmark of transposition activity. The long TIRs of MULEs may range from 100 to 500 bp and appear to be critical for element transposition and expression. TIRs of actively transposing elements in maize contain a   10       ~32bp conserved binding motif for the MuDR transposase protein A (MURA) (Benito and Walbot, 1997). In addition, two genes contained within the autonomous MuDR element, including the transposase MURA, are transcribed convergently from promoters located within the TIR (Hershberger et al., 1995). Furthermore, the promoters in the TIRs are also responsible for the expression of the internal regions of some coding non-autonomous elements (Jiang et al., 2004) suggesting the importance of the TIR sequence for transposition and retention of MULEs in the genome. Transposition. Due to the prevalence of Mutators and MULEs in plants, understanding how these elements amplify in the host genome is essential. To this end, studies within this superfamily are limited to the active autonomous member (MuDR) in maize. The MURA protein binds to a conserved ~32 bp motif in both methylated and unmethylated TIRs of known active Mutator elements (Benito and Walbot, 1997). It is presumed that the MURA binds the TIRs, consequently bringing them together and catalyzing a double stranded break (Lisch and Jiang, 2009). In maize the integration process appears to require the MURB (MuDR protein B) protein, encoded by mudrB (Lisch et al., 1999); however, the precise molecular role of mudrB in transposition remains to be elucidated. Furthermore, the exact molecular processes involved in the excision and reintegration of MULEs also awaits further analyses. Target selection. The first few detected Mutator elements already allowed identification of target preference. These elements were cloned due to its insertion nearby and disruption of three genes, Adh1, A1 and Bz2 (Bennetzen et al., 1984; O’Reilly et al., 1985; McLaughlin and Walbot, 1987) suggesting a preference for insertion near genic sequences. In a genome-wide mutagenesis study with RescueMu, a Mu1 element containing a pBluescript plasmid, a strong bias against insertion in repetitive DNA was found (Fernandes et al., 2004). Only ~6% of   11       flanking sequences were in repeats in a recent study of over 40,000 nonredundant Mu insertions using the 454 technology (Liu et al., 2009). These elements also show insertion hotspots for the 5’ end of genes (Dietrich et al., 2002) and particularly for the sequences directly surrounding the transcription start site and 5’ of the translational start site (Liu et al., 2009). Because insertion in or near genes can typically cause gene disruption, yet certain insertions are retained in the genome. This genic insertion preference of MULEs may suggest that certain insertions can lead to mutations with adaptive traits. Analysis of the nucleotide composition at the insertion sites has previously suggested a preference for a weak consensus insertion site (Dietrich et al., 2002). Studies with RescueMu insertions indicate a bias for high GC content in the TSD and a flanking dyad-symmetrical consensus: CCT-(TSD)-AGG (Fernandes et al., 2004). The weak consensus and the variation in insertion sequences have been proposed to suggest that this preference is for specific structural features rather than sequence (Lisch and Jiang, 2009). It appears that chromatin structure may play a role, and secondary structures such as transitions between high and low GC content, DNA bendability, B-DNA twist, α-philicity, protein-induced deformability (Dietrich et al., 2002) and recombination and epigenetic modifications (Liu et al., 2009) are likely important. Epigenetic regulation. Similar to other TEs in the genome, Mutator activity is regulated by silencing mechanisms. The typical occurrence when Mutator elements become inactive in maize is methylation (Chomet et al., 1991). Although methylation is a typical hallmark of epigenetic silencing, methylation in Mu1 and MuDR derivatives do not seem to result in transcriptional silencing (Barkan and Martienssen, 1991; Lisch et al., 1999). In Arabidopsis, mutation of the chromatin remodeling factor DDM1 (Decrease in DNA methylation) results in   12       progressive loss of heterochromatin DNA methylation, and transcriptional and transpositional activation of different MULE families (Singer, 2001). In maize, an element that reliably silences and leads to methylation of MuDR is Mu killer (Muk) (Slotkin et al., 2003). Later, Muk was described to be a derivative of the MuDR element that contained a deletion and an inverted duplication of the internal MuDR sequence (Slotkin et al., 2005). Muk transcription produces a hairpin transcript that is processed into small RNAs. The initiation of silencing by Muk appears to be distinct from the maintenance of silencing among Mu elements in maize, due to its different small interfering RNA (siRNA) expression pattern and its lack of dependence on the RNA-dependent RNA polymerase (MOP1) (Lisch and Jiang, 2009). Pack-MULEs Within the Mutator superfamily are a special group of non-autonomous elements, called Pack-MULEs that carry genes and gene fragments (Jiang et al., 2004). In fact, the first discovered Mutator element (Mu1) is a Pack-MULE which was later found to carry a sequence conserved in all maize lines as well as closely related species (Talbert and Chandler, 1989). This sequence is expressed and referred to as Mu-related sequence A (MRS-A); however, its function in maize is still unknown. To date, Pack-MULEs have been characterized in both monocots and dicots, including rice, maize, Lotus japonicus, Arabidopsis, sacred lotus, and tomato (Yu et al., 2000; Jiang et al., 2004; Juretic et al., 2005; Holligan et al., 2006; Schnable et al., 2009; Ferguson and Jiang, 2012; Ming et al., 2013) suggesting their prevalence among plants and its ancient formation. Consistent with Mutator and MULEs, Pack-MULEs also show insertion preference for regions flanking the 5’ end of genes (Jiang et al., 2011).   13       Mechanism of acquisition. To date, the mechanism of formation and acquisition of the genes and gene fragments by Pack-MULEs remains to be elucidated. Three models have been proposed. The first is a process similar to the acquisition of resistance genes by IS elements in bacteria (Talbert and Chandler, 1989). In this model, a mobile solo-TIR is suggested to encompass the Mu-related Sequence A (MRS-A) to form the Mu1 element. However, to date, a mobile solo-TIR has not been reported. Secondly, Bennetzen and Springer (1994) suggested a model using an ectopic gene conversion across a nicked-cruciform structure. Based on this model, a stem-loop structure is formed with the TIR serving as the stem and the internal region as the loop. A nick within the loop results from an endonucleolytic cleavage and the subsequent repair of the gap may proceed using a genomic sequence template with short homology to the gap ends. In this process a genomic sequence not previously associated with the elements becomes incorporated in the regions internal to the TIR. Finally, the third model for acquisition proposes an aberrant gap repair process that uses ectopic sequences as template. In this model, an excision event is necessary and the acquisition of sequences occurs upon the repair of the gap at the donor site (Yamashita et al., 1999). To date, none of the proposed mechanisms of acquisition have any empirical support including the amplification in the genome by solo TIRs. The long inverted nature of the TIR seems to support the likelihood of formation of a cruciform structure. However, it is unknown whether MULEs, in fact, form these structures in vivo or in vitro. The third model requires transposition of an element and can therefore be tested using the presence or absence of an active autonomous element or a transposase protein. Despite the lack of information to imply an acquisition mechanism, a recent study in rice showed that Pack-MULEs do not acquire fragments at random. Instead these elements preferentially acquire GC-rich sequences (Jiang et al., 2011). Plant genes can be defined by GC   14       A mobile solo TIRS gene gene B gene C gene Excision  site   Figure 1.2 Mechanisms of sequence acquisition. (A) Capture by mobile solo-TIRs model; (B) Ectopic gene conversion using cruciform structure; (C) Aberrant gap-repair model   15       gradients which refers to the variation in GC content along the direction of transcription. Despite the prevalence of genes with negative GC gradients, those where the 5’ half is more GC-rich than the 3’ half of the gene body, in grasses, it appears that this acqusition preference is influenced by the GC gradient and not the position within the gene itself. Sequence capture via GC preference may occur via the aberrant gap-repair or ectopic gene conversion. Further genetic, biochemical and bioinformatics analyses will be necessary to begin to unravel the mystery behind the acquisition process by Pack-MULEs. Functional capacity. Although Pack-MULEs have been discovered in many plant genomes, limited information in rice exists on the exact role and contribution of Pack-MULE encoded sequences in gene function and regulation. The first report of Pack-MULEs in rice show that they can carry gene fragments from multiple loci, which form new open reading frames (Jiang et al., 2004). In addition to coding sequences that are entirely derived from the PackMULE internal region, Pack-MULEs provide part of the ORF and/or UTR that fuses with downstream sequences/genes to form chimeric transcripts (Jiang et al., 2011). Both of these forms have evidence of expression based on full-length cDNA transcripts or protein expression data. Comprehensive analyses on rice Pack-MULEs showed that 22% of rice Pack-MULEs are transcribed with 28 elements having evidence of translation (Hanada et al., 2009). Interestingly, these chimeric Pack-MULEs appear to be expressed more frequently than those that acquired only a single gene. Pack-MULEs can be expressed in either orientation and with a small subset having bidirectional transcription, that is, both sense and antisense transcripts are formed (Hanada et al., 2009). The formation of antisense transcripts suggests a role for Pack-MULE transcript in regulating the expression of parental genes, which refer to the sequences where the gene or gene   16       fragments are derived. Moreover, a possible feedback regulation if transcripts are expressed in aberrant quantities has been suggested (Lisch, 2005). In fact, more than half of rice Pack-MULEs are generating small RNAs (Hanada et al., 2009). In addition, parental genes that have shared small RNAs with Pack-MULEs show lower expression compared to genes without shared small RNAs. Although a definite coding capacity of a Pack-MULE encoded gene remains to be demonstrated though biochemical and genetic analyses, a computational study has looked into the selection pressure on Pack-MULE encoded sequences. If Pack-MULE captured fragments exhibit sequence conservation, this may indicate that the sequences are conserved are likely functional. Ka/Ks ratios of Pack-MULE sequences in rice show that a considerable number of Pack-MULEs are under selective constraint and subjected to purifying selection. The selective constraints are particularly stronger for Pack-MULE transcripts in the sense orientation. These data suggest its potential in contributing new open reading frames with coding capacity (Hanada et al., 2009). Sacred Lotus To date, MULEs and Pack-MULEs are characterized in multiple monocots and dicots species (Yu et al., 2000; Jiang et al., 2004; Juretic et al., 2005; Holligan et al., 2006; Schnable et al., 2009; Ferguson and Jiang, 2012; Ming et al., 2013). Nevertheless, the acquisition preference for GC-rich sequences of Pack-MULEs is only evident in grasses, which are monocot plants. It raised the question whether such preference has been only evolved in monocots or it is of ancient origin. To this end, examination of the genomes of basal dicots would provide novel insight on this issue, and scared lotus (Nelumbo nucifera) is one of the basal dicot plants. Sacred lotus is a perennial aquatic plant that belongs to the Nelumbonaceae family and is found throughout Asia   17       and northern Australia (Han et al., 2007; Pan et al., 2009). It provides economic value as an ornamental and food crop in Asia and is also used for medicinal purposes (Shen-Miller, 2002; Guo, 2008). The sacred lotus has a century-old cultivation history and remarkable longevity whereby seeds are viable for up to 1300 years and the rhizomes remain healthy for more than 50 years. The genome of sacred lotus was recently sequenced using Illumina and 454 technologies (Ming et al., 2013). This provides an excellent biological resource particularly for evolutionary analysis of transposable element biology in eudicots since lotus phylogenetically lies outside the core eudicots making it currently the most basal lineage of angiosperm sequenced. The final genome assembly (804Mb) is 86.5% of the estimated 929 Mb lotus genome (Diao et al., 2006). Outline for this dissertation In this dissertation, the nature of sequence acquisition by MULEs and Pack-MULEs was explored to give insights into the mechanism of capture of parental fragments and the repetitive composition of the sacred lotus was annotated. Elucidating the mechanism of sequence acquisition will be a fundamental step in understanding Pack-MULE biology and evolution and can be used to exploit the mechanism process as a tool in generating novel coding sequences. The annotation of the sacred lotus repeats is essential in dissecting its TE content and diversity. This information may be used to understand TE evolution in a basal eudicot and provide a platform for comparative TE analysis, especially acquisition mechanism of Pack-MULEs, between dicots and monocots. First, the role of TIR and sub-TIR sequence and structure on the acquisition process and the acquired sequence context was evaluated in the rice genome. Results show that these two key features of the MULEs may be involved in the acquisition process. Also, data suggest that gene GC content and ubiquity of expression play a role in the acquisition   18       frequency of a parental gene; moreover, very high GC content influences the retention of these fragments in the rice genome. Second, an atypical class of MULEs that possess duplicated TIRs on one or both ends of the elements was characterized. These elements were found to be present in higher copy numbers than their single or three-TIR counterparts indicating the importance of the tandem TIR structure in amplification efficiency of specific families of MULEs in plant genomes. Third, the repetitive component of the sacred lotus (Nelumbo nucifera) genome was analyzed. Results indicate features that sets it distinctly from other plants in terms of its TE content and diversity. Because of its location in plant evolutionary history by being the most basal angiosperm genome sequenced to date makes N. nucifera a model in TE analysis and evolution.   19       REFERENCES   20       REFERENCES Adams MD (2000) The Genome Sequence of Drosophila melanogaster. Science 287: 2185–2195 Alleman, M., Freeling, M. (1986) The Mu transposable elements of maize: evidence for transposition and copy number regulation during development. Genetics 112: 107–119 Ammiraju JSS, Zuccolo A, Yu Y, Song X, Piegu B, Chevalier F, Walling JG, Ma J, Talag J, Brar DS, et al (2007) Evolutionary dynamics of an ancient retrotransposon family provides insights into evolution of genome size in the genus Oryza. The Plant Journal 52: 342–351 Argout X, Salse J, Aury J-M, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, et al (2010) The genome of Theobroma cacao. Nature Genetics 43: 101–108 Banks JA, Nishiyama T, Hasebe M, Bowman JL, Gribskov M, dePamphilis C, Albert VA, Aono N, Aoyama T, Ambrose BA, et al (2011) The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of Vascular Plants. Science 332: 960–963 Barkan A, Martienssen RA (1991) Inactivation of maize transposon Mu suppresses a mutant phenotype by activating an outward-reading promoter near the end of Mu1. Proceedings of the National Academy of Sciences 88: 3502–3506 Bartolomé C, Bello X, Maside X (2009) Widespread evidence for horizontal transfer of transposable elements across Drosophila genomes. Genome Biology 10: R22 Becker H-A (1997) Maize Activator transposase has a bipartite DNA binding domain that recognizes subterminal sequences and the terminal inverted repeats. Molecular and General Genetics MGG 254: 219–230 Bejerano G, Lowe CB, Ahituv N, King B, Siepel A, Salama SR, Rubin EM, James Kent W, Haussler D (2006) A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature 441: 87–90 Benito MI, Walbot V (1997) Characterization of the maize Mutator transposable element MURA transposase as a DNA-binding protein. Mol Cell Biol 17: 5165–75 Bennetzen J (2007) Patterns in grass genome evolution. Current Opinion in Plant Biology 10: 176–181 Bennetzen JL (2005) Mechanisms of Recent Genome Size Variation in Flowering Plants. Annals of Botany 95: 127–132 Bennetzen JL, Kellogg EA (1997) Do Plants Have a One-Way Ticket to Genomic Obesity? Plant Cell 9: 1509–1514   21       Bennetzen JL, Springer PS (1994) The generation of Mutator transposable element subfamilies in maize. Theoretical and Applied Genetics. doi: 10.1007/BF00222890 Bennetzen JL, Swanson J, Taylor WC, Freeling M (1984) DNA insertion in the first intron of maize Adh1 affects message levels: cloning of progenitor and mutant Adh1 alleles. Proceedings of the National Academy of Sciences 81: 4125–4128 Brenchley R, Spannagl M, Pfeifer M, Barker GLA, D’Amore R, Allen AM, McKenzie N, Kramer M, Kerhornou A, Bolser D, et al (2012) Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 491: 705–710 Bundock P, Hooykaas P (2005) An Arabidopsis hAT-like transposase is essential for plant development. Nature 436: 282–284 Bushman FD (2003) Targeting survival: integration site selection by retroviruses and LTRretrotransposons. Cell 115: 135–138 Callinan PA, Batzer MA (2006) Retrotransposable Elements and Human Disease. In J-N Volff, ed, Genome Dynamics. KARGER, Basel, pp 104–115 Cao X, Jacobsen SE (2002) Role of the Arabidopsis DRM Methyltransferases in De Novo DNA Methylation and Gene Silencing. Current Biology 12: 1138–1144 Chalker DL, Sandmeyer SB (1992) Ty3 integrates within the region of RNA polymerase III transcription initiation. Genes & Development 6: 117–128 Chalvet F (2003) Hop, an Active Mutator-like Element in the Genome of the Fungus Fusarium oxysporum. Molecular Biology and Evolution 20: 1362–1375 Chan AP, Crabtree J, Zhao Q, Lorenzi H, Orvis J, Puiu D, Melake-Berhan A, Jones KM, Redman J, Chen G, et al (2010) Draft genome sequence of the oilseed species Ricinus communis. Nature Biotechnology 28: 951–956 Chinwalla AT, Cook LL, Delehaunty KD, Fewell GA, Fulton LA, Fulton RS, Graves TA, Hillier LW, Mardis ER, McPherson JD, et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520–562 Chomet P, Lisch D, Hardeman KJ, Chandler VL, Freeling M (1991) Identification of a regulatory transposon that controls the Mutator transposable element system in maize. Genetics 129: 261–270 Cordaux R (2006) From the Cover: Birth of a chimeric primate gene by capture of the transposase gene from a mobile element. Proceedings of the National Academy of Sciences 103: 8101–8106 D’Hont A, Denoeud F, Aury J-M, Baurens F-C, Carreel F, Garsmeur O, Noel B, Bocs S, Droc G, Rouard M, et al (2012) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488: 213–217   22       Dassanayake M, Oh D-H, Haas JS, Hernandez A, Hong H, Ali S, Yun D-J, Bressan RA, Zhu JK, Bohnert HJ, et al (2011) The genome of the extremophile crucifer Thellungiella parvula. Nature Genetics 43: 913–918 Diao Y, Chen L, Yang G, Zhou M, Song Y, Hu Z, Liu JY (2006) Nuclear DNA C-values in 12 species in Nymphales. Caryologia 59: 25–30 Dietrich CR, Cui F, Packila ML, Li J, Ashlock DA, Nikolau BJ, Schnable PS (2002) Maize Mu transposons are targeted to the 5’ untranslated region of the gl8 gene and sequences flanking Mu target-site duplications exhibit nonrandom nucleotide composition throughout the genome. Genetics 160: 697–716 Al-Dous EK, George B, Al-Mahmoud ME, Al-Jaber MY, Wang H, Salameh YM, Al-Azwani EK, Chaluvadi S, Pontaroli AC, DeBarry J, et al (2011) De novo genome sequencing and comparative genomics of date palm (Phoenix dactylifera). Nature Biotechnology 29: 521–527 Drinnan AN, Crane PR, Hoot SB (1994) Patterns of floral evolution in the early diversification of non-magnoliid dicotyledons (eudicots). In PK Endress, EM Friis, eds, Early Evolution of Flowers. Springer Vienna, Vienna, pp 93–122 Du J, Grant D, Tian Z, Nelson RT, Zhu L, Shoemaker RC, Ma J (2010) SoyTEdb: a comprehensive database of transposable elements in the soybean genome. BMC Genomics 11: 113 Duke JA, Duke (2002) Handbook of medicinal herbs. CRC Press, Boca Raton, FL Edgar RC, Myers EW (2005) PILER: identification and classification of genomic repeats. Bioinformatics 21: i152–i158 Ellinghaus D, Kurtz S, Willhoeft U (2008) LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9: 18 Ferguson AA, Jiang N (2012) Mutator-Like Elements with Multiple Long Terminal Inverted Repeats in Plants. Comparative and Functional Genomics 2012: 1–14 Fernandes J, Dong Q, Schneider B, Morrow DJ, Nan G-L, Brendel V, Walbot V (2004) Genome-wide mutagenesis of Zea mays L. using RescueMu transposons. Genome Biol 5: R82 Feschotte C (2004) Merlin, a New Superfamily of DNA Transposons Identified in Diverse Animal Genomes and Related to Bacterial IS1016 Insertion Sequences. Molecular Biology and Evolution 21: 1769–1780 Feschotte C (2005) DNA-binding specificity of rice mariner-like transposases and interactions with Stowaway MITEs. Nucleic Acids Research 33: 2153–2165   23       Feschotte C, Jiang N, Wessler SR (2002) PLANT TRANSPOSABLE ELEMENTS: WHERE GENETICS MEETS GENOMICS. Nature Reviews Genetics 3: 329–341 Flavell AJ, Pearce SR, Kumar A (1994) Plant transposable elements and the genome. Current Opinion in Genetics & Development 4: 838–844 Guo HB (2008) Cultivation of lotus (Nelumbo nucifera Gaertn. ssp. nucifera) and its utilization in China. Genetic Resources and Crop Evolution 56: 323–330 Han Y, Wessler SR (2010) MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Research 38: e199–e199 Han Y-C, Teng C-Z, Zhong S, Zhou M-Q, Hu Z-L, Song Y-C (2007) Genetic variation and clonal diversity in populations of Nelumbo nucifera (Nelumbonaceae) in central China detected by ISSR markers. Aquatic Botany 86: 69–75 Hanada K, Vallejo V, Nobuta K, Slotkin RK, Lisch D, Meyers BC, Shiu S-H, Jiang N (2009) The Functional Role of Pack-MULEs in Rice Inferred from Purifying Selection and Expression Profile. THE PLANT CELL ONLINE 21: 25–38 Hawkins JS, Kim H, Nason JD, Wing RA, Wendel JF (2006) Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium. Genome Research 16: 1252–1261 Hehl R, Nacken WKF, Krause A, Saedler H, Sommer H (1991) Structural analysis of Tam3, a transposable element from Antirrhinum majus, reveals homologies to the Ac element from maize. Plant Molecular Biology 16: 369–371 Hershberger RJ, Benito M-I, Hardeman KJ, Warren C, Chandler VL, Walbot V (1995) Characterization of the major transcripts encoded by the regulatory MuDR transposable element of maize. Genetics 140: 1087–1098 Hirochika H, Okamoto H, Kakutani T (2000) Silencing of retrotransposons in arabidopsis and reactivation by the ddm1 mutation. Plant Cell 12: 357–369 Hirochika H, Sugimoto K, Otsuki Y, Tsugawa H, Kanda M (1996) Retrotransposons of rice involved in mutations induced by tissue culture. Proceedings of the National Academy of Sciences 93: 7783–7788 Holligan D, Zhang X, Jiang N, Pritham EJ, Wessler SR (2006) The Transposable Element Landscape of the Model Legume Lotus japonicus. Genetics 174: 2215–2228 Hollister JD, Smith LM, Guo Y-L, Ott F, Weigel D, Gaut BS (2011) Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata. Proceedings of the National Academy of Sciences 108: 2322–2327   24       Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, Collins JE, Humphray S, McLaren K, Matthews L, et al (2013) The zebrafish reference genome sequence and its relationship to the human genome. Nature 496: 498–503 Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, Xie B, Ni P, et al (2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics 41: 1275–1281 Ichikawa H, Ikeda K, Wishart WL, Ohtsubo E (1987) Specific binding of transposase to terminal inverted repeats of transposable element Tn3. Proceedings of the National Academy of Sciences 84: 8220–8224 Jarvis CE, Linnean Society of London, Natural History Museum (London, England) (2007) Order out of chaos: Linnaean plant names and their types. Linnean Society of London in association with the Natural History Museum, London, London Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569–573 Jiang N, Ferguson AA, Slotkin RK, Lisch D (2011) Pack-Mutator-like transposable elements (Pack-MULEs) induce directional modification of genes through biased insertion and DNA acquisition. Proceedings of the National Academy of Sciences 108: 1537–1542 Jiang N, Panaud O (2013) Transposable Element Dynamics in Rice and Its Wild Relatives. In Q Zhang, RA Wing, eds, Genetics and Genomics of Rice. Springer New York, New York, NY, pp 55–69 Jordan IK, Rogozin IB, Glazko GV, Koonin EV (2003) Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends in Genetics 19: 68–72 Juretic N, Hoen DR, Huynh ML, Harrison PM, Bureau TE (2005) The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res 15: 1292– 1297 Kalendar R, Tanskanen J, Immonen S, Nevo E, Schulman AH (2000) From the Cover: Genome evolution of wild barley (Hordeum spontaneum) by BARE-1 retrotransposon dynamics in response to sharp microclimatic divergence. Proceedings of the National Academy of Sciences 97: 6603–6607 Kamal M (2006) A large family of ancient repeat elements in the human genome is under strong selection. Proceedings of the National Academy of Sciences 103: 2740–2745 Kapitonov VV (2006) Self-synthesizing DNA transposons in eukaryotes. Proceedings of the National Academy of Sciences 103: 4540–4545 Kapitonov VV, Jurka J (2005) RAG1 core and V(D)J recombination signal sequences were derived from Transib transposons. PLoS Biol 3: e181   25       Kapitonov VV, Jurka J (2008) A universal classification of eukaryotic transposable elements implemented in Repbase. Nature Reviews Genetics 9: 411–412 Kapitonov VV, Jurka J (2007) Helitrons on a roll: eukaryotic rolling-circle transposons. Trends in Genetics 23: 521–529 Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, Yandell M, Feschotte C (2013) Transposable Elements Are Major Contributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs. PLoS Genetics 9: e1003470 Kawakami T, Strakosh SC, Zhen Y, Ungerer MC (2010) Different scales of Ty1/copia-like retrotransposon proliferation in the genomes of three diploid hybrid sunflower species. Heredity (Edinb) 104: 341–350 Kempken F, Windhofer F (2001) The hAT family: a versatile transposon group common to plants, fungi, animals, and man. Chromosoma 110: 1–9 Kinoshita T (2004) One-Way Control of FWA Imprinting in Arabidopsis Endosperm by DNA Methylation. Science 303: 521–523 Knip M, de Pater S, Hooykaas PJ (2012) The SLEEPER genes: a transposase-derived angiosperm-specific gene family. BMC Plant Biology 12: 192 Kumar A, Bennetzen JL (1999) Plant Retrotransposons. Annual Review of Genetics 33: 479–532 Kunze R, Weil CF (2002) The hAT and CACTA superfamilies of plant transposons. Mobile DNA II. pp 565–610 Kuwahara A, Kato A, Komeda Y (2000) Isolation and characterization of copia-type retrotransposons in Arabidopsis thaliana. Gene 244: 127–136 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921 Law JA, Jacobsen SE (2010) Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews Genetics 11: 204–220 Lazarow K, Du M-L, Weimer R, Kunze R (2012) A Hyperactive Transposase of the Maize Transposable Element Activator (Ac). Genetics 191: 747–756 Li W, Shaw JE (1993) A variant Tc4 transposable element in the nematode C.elegans could encode a novel protein. Nucleic Acids Research 21: 59–67 Lindroth AM (2001) Requirement of CHROMOMETHYLASE3 for Maintenance of CpXpG Methylation. Science 292: 2077–2080   26       Linheiro RS, Bergman CM (2012) Whole Genome Resequencing Reveals Natural Target Site Preferences of Transposable Elements in Drosophila melanogaster. PLoS ONE 7: e30008 Lippman Z, Gendrel A-V, Black M, Vaughn MW, Dedhia N, Richard McCombie W, Lavine K, Mittal V, May B, Kasschau KD, et al (2004) Role of transposable elements in heterochromatin and epigenetic control. Nature 430: 471–476 Lisch D (2002) Mutator transposons. Trends Plant Sci 7: 498–504 Lisch D (2009) Epigenetic Regulation of Transposable Elements in Plants. Annual Review of Plant Biology 60: 43–66 Lisch D (2005) Pack-MULEs: theft on a massive scale. BioEssays 27: 353–355 Lisch D, Girard L, Donlin M, Freeling M (1999) Functional analysis of deletion derivatives of the maize transposon MuDR delineates roles for MURA and MURB proteins. Genetics 151: 331–341 Lisch D, Jiang N (2009) Mutator and MULE transposons. In JL Bennetzen, S Hake, eds, Handbook of Maize. Springer New York, New York, NY, pp 277–306 Liu S, Yeh C-T, Ji T, Ying K, Wu H, Tang HM, Fu Y, Nettleton D, Schnable PS (2009) Mu Transposon Insertion Sites and Meiotic Recombination Events Co-Localize with Epigenetic Marks for Open Chromatin across the Maize Genome. PLoS Genetics 5: e1000733 Loot C, Santiago N, Sanz A, Casacuberta JM (2006) The proteins encoded by the pogo-like Lemi1 element bind the TIRs and subterminal repeated motifs of the Arabidopsis Emigrant MITE: consequences for the transposition mechanism of MITEs. Nucleic Acids Research 34: 5238–5246 Marquez CP, Pritham EJ (2010) Phantom, a New Subclass of Mutator DNA Transposons Found in Insect Viruses and Widely Distributed in Animals. Genetics 185: 1507–1517 Mayer KFX, Waugh R, Langridge P, Close TJ, Wise RP, Graner A, Matsumoto T, Sato K, Schulman A, Muehlbauer GJ, et al (2012) A physical, genetic and functional sequence assembly of the barley genome. Nature. doi: 10.1038/nature11543 McCarthy EM, McDonald JF (2003) LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19: 362–367 McClintock B (1951) CHROMOSOME ORGANIZATION AND GENIC EXPRESSION. Cold Spring Harbor Symposia on Quantitative Biology 16: 13–47 McLaughlin M, Walbot V (1987) Cloning of a mutable bz2 allele of maize by transposon tagging and differential hybridization. Genetics 117: 771–776   27       Middleton CP, Stein N, Keller B, Kilian B, Wicker T (2013) Comparative analysis of genome composition in Triticeae reveals strong variation in transposable element dynamics and nucleotide diversity. The Plant Journal 73: 347–356 Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Duke S, Garber M, Gentles AJ, Goodstadt L, Heger A, et al (2007) Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447: 167–177 Mills RE, Bennett EA, Iskow RC, Devine SE (2007) Which transposable elements are active in the human genome? Trends in Genetics 23: 183–191 Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KLT, et al (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452: 991–996 Ming R, VanBuren R, Liu Y, Yang M, Han Y, Li L-T, Zhang Q, Kim M-J, Schatz MC, Campbell M, et al (2013) Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.). Genome Biology 14: R41 Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15: 211–218 Muotri AR, Marchetto MCN, Coufal NG, Gage FH (2007) The necessary junk: new functions for transposable elements. Human Molecular Genetics 16: R159–R167 Nassif N, Penney J, Pal S, Engels WR, Gloor GB (1994) Efficient copying of nonhomologous sequences from ectopicsites via P-element-induced gap repair. Mol Cell Biol 14: 1613– 1625 Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu Z, Loftus B, Xi Z, Megy K, Grabherr M, et al (2007) Genome Sequence of Aedes aegypti, a Major Arbovirus Vector. Science 316: 1718–1723 Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, et al (2013) The Norway spruce genome sequence and conifer genome evolution. Nature 497: 579–584 O’Reilly C, Shepherd NS, Pereira A, Schwarz-Sommer Z, Bertram I, Robertson DS, Peterson PA, Saedler H (1985) Molecular cloning of the a1 locus of Zea mays using the transposable elements En and Mu1. EMBO J 4: 877–882 Oliver KR, McComb JA, Greene WK (2013) Transposable Elements: Powerful Contributors to Angiosperm Evolution and Diversity. Genome Biology and Evolution 5: 1886–1901 Pan L, Xia Q, Quan Z, Liu H, Ke W, Ding Y (2009) Development of Novel EST-SSRs from Sacred Lotus (Nelumbo nucifera Gaertn) and Their Utilization for the Genetic Diversity Analysis of N. nucifera. Journal of Heredity 101: 71–82   28       Pardue M-L, Rashkova S, Casacuberta E, DeBaryshe PG, George JA, Traverse KL (2005) Two retrotransposons maintain telomeres in Drosophila. Chromosome Research 13: 443–453 Parisod C, Alix K, Just J, Petit M, Sarilar V, Mhiri C, Ainouche M, Chalhoub B, Grandbastien M-A (2010) Impact of transposable elements on the organization and function of allopolyploid genomes: Research review. New Phytologist 186: 37–45 Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551–556 Piegu B, Guyot R, Picault N, Roulin A, Saniyal A, Kim H, Collura K, Brar DS, Jackson S, Wing RA, et al (2006) Doubling genome size without polyploidization: Dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Research 16: 1262–1269 Piriyapongsa J, Marino-Ramirez L, Jordan IK (2006) Origin and Evolution of Human microRNAs From Transposable Elements. Genetics 176: 1323–1337 Robertson DS (1978) Characterization of a mutator system in maize. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 51: 21–28 Robertson HM (2002) Evolution of DNA transposons. Mobile DNA II. American Society for Microbiology Press, Washington, D.C, pp 1093–1110 Sanmiguel P (1998) Evidence that a Recent Increase in Maize Genome Size was Caused by the Massive Amplification of Intergene Retrotransposons. Annals of Botany 82: 37–44 Sato S, Hirakawa H, Isobe S, Fukai E, Watanabe A, Kato M, Kawashima K, Minami C, Muraki A, Nakazaki N, et al (2010) Sequence Analysis of the Genome of an Oil-Bearing Tree, Jatropha curcas L. DNA Research 18: 65–76 Sato S, Tabata S, Hirakawa H, Asamizu E, Shirasawa K, Isobe S, Kaneko T, Nakamura Y, Shibata D, Aoki K, et al (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485: 635–641 Schaack S, Gilbert C, Feschotte C (2010) Promiscuous DNA: horizontal transfer of transposable elements and why it matters for eukaryotic evolution. Trends in Ecology & Evolution 25: 537–546 Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463: 178–183 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al (2009) The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science 326: 1112–1115   29       Schulman AH, Kalendar R (2005) A movable feast: diverse retrotransposons and their contribution to barley genome dynamics. Cytogenetic and Genome Research 110: 598– 605 Sequencing Project IRG (2005) The map-based sequence of the rice genome. Nature 436: 793– 800 Shaheen M, Williamson E, Nickoloff J, Lee S-H, Hromas R (2010) Metnase/SETMAR: a domesticated primate transposase that enhances DNA repair, replication, and decatenation. Genetica 138: 559–566 Shen-Miller J (2002) Sacred lotus, the long-living fruits of China Antique. Seed Science Research 12: 131–143 Shen-Miller J, Aung LH, Turek J, Schopf JW, Tholandi M, Yang M, Czaja A (2013) CenturiesOld Viable Fruit of Sacred Lotus Nelumbo nucifera Gaertn var. China Antique. Tropical Plant Biology 6: 53–68 Shulaev V, Sargent DJ, Crowhurst RN, Mockler TC, Folkerts O, Delcher AL, Jaiswal P, Mockaitis K, Liston A, Mane SP, et al (2010) The genome of woodland strawberry (Fragaria vesca). Nature Genetics 43: 109–116 Singer T (2001) Robertson’s Mutator transposons in A. thaliana are regulated by the chromatinremodeling gene Decrease in DNA Methylation (DDM1). Genes & Development 15: 591–602 Singh R, Ming R, Yu Q (2013) Nucleotide Composition of the Nelumbo nucifera Genome. Tropical Plant Biology 6: 85–97 Slotkin RK, Freeling M, Lisch D (2003) Mu killer causes the heritable inactivation of the Mutator family of transposable elements in Zea mays. Genetics 165: 781–797 Slotkin RK, Freeling M, Lisch D (2005) Heritable transposon silencing initiated by a naturally occurring transposon inverted duplication. Nature Genetics 37: 641–644 Slotkin RK, Martienssen R (2007) Transposable elements and the epigenetic regulation of the genome. Nature Reviews Genetics 8: 272–285 Steinbiss S, Willhoeft U, Gremme G, Kurtz S (2009) Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Research 37: 7002–7013 Strommer JN, Hake S, Bennetzen J, Taylor WC, Freeling M (1982) Regulatory mutants of the maize Adh1 gene caused by DNA insertions. Nature 300: 542–544 Talbert LE, Chandler VL (1989) Characterization of a highly conserved sequence related to mutator transposable elements in maize. Mol Biol Evol 5: 519–529   30       Temin HM (1981) Structure, variation and synthesis of retrovirus long terminal repeat. Cell 27: 1–3 Tenaillon MI, Hollister JD, Gaut BS (2010) A triptych of the evolution of plant transposable elements. Trends in Plant Science 15: 471–478 Thomas CA (1971) The Genetic Organization of Chromosomes. Annual Review of Genetics 5: 237–256 Tsukahara S, Kawabe A, Kobayashi A, Ito T, Aizu T, Shin-i T, Toyoda A, Fujiyama A, Tarutani Y, Kakutani T (2012) Centromere-targeted de novo integrations of an LTR retrotransposon of Arabidopsis lyrata. Genes & Development 26: 705–713 Tsukahara S, Kobayashi A, Kawabe A, Mathieu O, Miura A, Kakutani T (2009) Bursts of retrotransposition reproduced in Arabidopsis. Nature 461: 423–426 Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, et al (2006) The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science 313: 1596–1604 Varshney RK, Chen W, Li Y, Bharti AK, Saxena RK, Schlueter JA, Donoghue MTA, Azam S, Fan G, Whaley AM, et al (2011) Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor farmers. Nature Biotechnology 30: 83–89 Velasco R, Zharkikh A, Affourtit J, Dhingra A, Cestaro A, Kalyanaraman A, Fontana P, Bhatnagar SK, Troggio M, Pruss D, et al (2010) The genome of the domesticated apple (Malus × domestica Borkh.). Nature Genetics 42: 833–839 Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, FitzGerald LM, Vezzulli S, Reid J, et al (2007) A High Quality Draft Consensus Sequence of the Genome of a Heterozygous Grapevine Variety. PLoS ONE 2: e1326 Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, et al (2013) The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nature Genetics 45: 487–494 Vogel JP, Garvin DF, Mockler TC, Schmutz J, Rokhsar D, Bevan MW, Barry K, Lucas S, Harmon-Smith M, Lail K, et al (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463: 763–768 Volff J-N (2006) Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. BioEssays 28: 913–922 Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun J-H, Bancroft I, Cheng F, et al (2011) The genome of the mesopolyploid crop species Brassica rapa. Nature Genetics 43: 1035–1039   31       Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, et al (2007) A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8: 973–982 Wikstrom N, Savolainen V, Chase MW (2001) Evolution of the angiosperms: calibrating the family tree. Proceedings of the Royal Society B: Biological Sciences 268: 2211–2220 Xu Q, Chen L-L, Ruan X, Chen D, Zhu A, Chen C, Bertrand D, Jiao W-B, Hao B-H, Lyon MP, et al (2012) The draft genome of sweet orange (Citrus sinensis). Nature Genetics 45: 59– 66 Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, Zhang G, Yang S, Li R, Wang J, et al (2011) Genome sequence and analysis of the tuber crop potato. Nature 475: 189–195 Xu Z, Wang H (2007) LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35: W265–W268 Yamashita S, Takano-Shimizu T, Kitamura K, Mikami T, Kishima Y (1999) Resistance to gap repair of the transposon Tam3 in Antirrhinum majus: a role of the end regions. Genetics 153: 1899–1908 Yang G, Weil CF, Wessler SR (2006) A rice Tc1/mariner-like element transposes in yeast. Plant Cell 18: 2469–2478 Yin H, Liu J, Xu Y, Liu X, Zhang S, Ma J, Du J (2013) TARE1, a Mutated Copia-Like LTR Retrotransposon Followed by Recent Massive Amplification in Tomato. PLoS ONE 8: e68587 Young ND, Debellé F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, et al (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature. doi: 10.1038/nature10625 Yu Z, Wright SI, Bureau TE (2000) Mutator-like elements in Arabidopsis thaliana. Structure, diversity and evolution. Genetics 156: 2019–2031 Zemach A, Kim MY, Hsieh P-H, Coleman-Derr D, Eshed-Williams L, Thao K, Harmer SL, Zilberman D (2013) The Arabidopsis nucleosome remodeler DDM1 allows DNA methyltransferases to access H1-containing heterochromatin. Cell 153: 193–205 Zhang G, Liu X, Quan Z, Cheng S, Xu X, Pan S, Xie M, Zeng P, Yue Z, Wang W, et al (2012) Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nature Biotechnology 30: 549–554 Zhou L, Mitra R, Atkinson PW, Burgess Hickman A, Dyda F, Craig NL (2004) Transposition of hAT elements links transposable elements and V(D)J recombination. Nature 432: 995– 1001   32       Zhu Y (2003) From the Cover: Controlling integration specificity of a yeast retrotransposon. Proceedings of the National Academy of Sciences 100: 5891–5895 Zuccolo A, Sebastian A, Talag J, Yu Y, Kim H, Collura K, Kudrna D, Wing RA (2007) Transposable element distribution, abundance and role in genome size variation in the genus Oryza. BMC Evolutionary Biology 7: 152   33       CHAPTER 2: Selective Acquisition and Retention of Genomic Sequences by Pack-Mutator-Like Elements Based on GC Content and Breadth of Expression Copyright American Society of Plant Biologists www.plantphysiol.org Ferguson AA, Zhao D, Jiang N (2013) Selective Acquisition and Retention of Genomic Sequences by Pack-Mutator-Like Elements Based on GC Content and Breadth of Expression. Plant Physiology. pp.113.223271   34       Abstract The process of gene duplication followed by sequence and functional divergence is important for the generation of new genes. Pack-MULEs, nonautonomous Mutator-like elements (MULEs) that carry genic sequence(s), are potentially involved in generating new open reading frames and regulating parental gene expression. These elements are identified in many plant genomes and are most abundant in rice (Oryza sativa). Despite the abundance of Pack-MULEs, the mechanism by which parental genes are captured by Pack-MULEs remains largely unknown. In this study, we identified all MULEs in rice and examined factors likely important for sequence acquisition. Terminal inverted repeat MULEs are the predominant MULE type and account for the majority of the Pack-MULEs. In addition to genic sequences, rice MULEs capture GC-rich intergenic sequences, albeit at a much lower frequency. MULEs carrying nontransposon sequences have longer terminal inverted repeats and higher GC content in terminal and subterminal regions. An overrepresentation of genes with known functions and genes with orthologs among parental genes of Pack-MULEs is observed in rice, maize (Zea mays), and Arabidopsis (Arabidopsis thaliana), suggesting preferential acquisition for bona fide genes by these elements. Pack-MULEs selectively acquire/retain parental sequences through a combined effect of GC content and breadth of expression, with GC content playing a stronger role. Increased GC content and number of tissues with detectable expression result in higher chances of a gene being acquired by Pack-MULEs. Such selective acquisition/retention provides these elements greater chances of carrying functional sequences that may provide new genetic resources for the evolution of new genes or the modification of existing genes.   35       Introduction Transposable elements (TEs) are sequences in the genome that move from one location to another and in the process multiply in copy number. According to the transposition intermediate, TEs are classified into two major classes: class I or RNA elements, which transpose via an RNA intermediate using a copy-and-paste mechanism; and class II or DNA elements, which transpose via a DNA intermediate using a cut-and-paste mechanism. Based on their coding capacity for transposition machinery, both classes of TEs can be divided into autonomous and nonautonomous elements. Autonomous elements encode the protein products (transposase or reverse transcriptase) required for their transposition, whereas nonautonomous elements do not encode relevant proteins and rely on their cognate autonomous elements for transposition. TEs constitute over 50% of many plant genomes and as much as 85% of the maize (Zea mays) genome (Devos et al., 2005; Paterson et al., 2009; Schnable et al., 2009; Schmutz et al., 2010; Tomato Genome Consortium, 2012; Nystedt et al., 2013; Wu et al., 2013). In addition, computational and biological analyses of genomic information have revealed critical roles of transposons in gene expression, regulation, and genome evolution (Bennetzen and Kellogg, 1997; Lippman et al., 2004; Piegu et al., 2006; Ammiraju et al., 2007; Bennetzen, 2007; Slotkin and Martienssen, 2007; Zuccolo et al., 2007; Feschotte, 2008). The Mutator superfamily is a class II/DNA TE originally discovered in maize (Robertson, 1978). Since the initial discovery of Mu1 and MuDR in maize (Robertson,1978; Robertson et al., 1989), similar elements were later identified from the maize genome and subsequently in other organisms including plants, animals, and fungi, where they are referred to as Mutator-like elements (MULEs; Yu et al., 2000; Lisch et al., 2001; Chalvet et al., 2003; Jiang et al., 2004; Holligan et al., 2006; Marquez and Pritham, 2010). MULEs are typically   36       characterized by an 8- to 11-bp target site duplication (TSD) flanking the element, with 9-bp TSD as the most frequent form. In addition, the majority of these elements are known for the presence of long terminal inverted repeats (TIRs), which typically range from 100 to 500 bp, a feature that largely sets them apart from other major class II TEs such as En/Spm, Helitron, PIF/Pong, and Tc1/Mariner elements. MULEs associated with long TIRs are referred to as TIR MULEs. TIR sequences appear to be important for element transposition and expression (Benito and Walbot, 1997; Raizada et al., 2001; Jiang et al., 2004). Recently, however, non-TIR MULEs have been reported in Arabidopsis (Arabidopsis thaliana), Lotus japonicus, maize, and yeast (Yarrowia lipolytica; Yu et al., 2000; Neuvéglise et al., 2005; Holligan et al., 2006; Wang and Dooner, 2006). Non-TIR MULEs refer to the MULEs with exceptionally short TIRs (less than 50 bp) and low similarity between the inverted terminal sequences. The detection of non-TIR MULEs in multiple plants suggests that extended long TIRs are dispensable for the transposition of MULEs in plants. Although elements belonging to the same TIR MULE family share an overall sequence similarity in their TIRs, they vary in their internal region. The Mu family of maize that includes multiple elements (Mu1–Mu13) share a 220-bp sequence in their TIRs, but the internal region between the TIRs may contain unique and unrelated sequences (Chomet et al., 1991; Lisch, 2002; Lisch and Jiang, 2009). The Mu4 elements, for instance, have much longer TIRs compared with other Mu elements (530 bp long), and the TIR sequence includes a fragment from a BRASSINOSTEROID INSENSITIVE1 gene (Lisch, 2002). Thus, in addition to differences in the internal sequence, elements within a MULE family vary in their TIR lengths. Pack-MULEs are nonautonomous Mutator and MULEs that carry genes or gene fragments. Although the abundance of Pack-MULEs was not acknowledged until the availability of the entire rice (Oryza sativa) genomic sequence, the first Mutator element discovered (Mu1)   37       was, in fact, a Pack-MULE carrying a fragment of the MRS-A gene (Talbert and Chandler, 1988), as were the other nonautonomous Mutator elements (Lisch, 2002). To date, Pack-MULEs have been characterized in both monocots and dicots, including rice, maize, L. japonicus, Arabidopsis, tomato (Solanum lycopersicum), and sacred lotus (Nelumbo nucifera ; Yu et al., 2000; Jiang et al., 2004; Hoen et al., 2006; Holligan et al., 2006; Schnable et al., 2009; Ferguson and Jiang, 2012; Ming et al., 2013), suggesting their prevalence among plants. The genes from which gene sequences or fragments are captured are referred to as parental genes, and the captured fragment is referred to as the acquired sequence. Previous work identified 2,853 PackMULEs in rice that have transduced about 1,500 parental genes (Jiang et al., 2011). Comprehensive analyses showed that over 22% of rice Pack-MULEs are transcribed, with at least 28 elements having evidence of translation (Hanada et al., 2009). These elements often carry gene fragments from multiple loci, forming new open reading frames (ORFs). In addition to the formation of independent ORFs, Pack-MULEs can serve as part of the ORF and/or untranslated region that fuses with adjacent sequences/genes to form chimeric transcripts (Jiang et al., 2011). Pack-MULE transcripts are found in either orientation with regard to the transcription of the parental gene, with a small subset having bidirectional transcription. The formation of antisense transcripts suggests a critical role for Pack-MULE-derived transcripts in regulating the expression of parental genes through the activity of small RNAs (Hanada et al., 2009). In fact, over half of Pack-MULEs in rice are directly involved in the formation of small RNAs. Parental genes that have shared small RNAs with Pack-MULEs show lower expression levels compared with genes without an association with small RNAs (Hanada et al., 2009). Thus far, rice has remained exceptional in its Pack-MULE copy number load. Another advantage of studying Pack-MULEs in rice is the unparalleled quality of its reference genome sequence,   38       which was accomplished using the traditional bacterial artificial chromosome-by-bacterial artificial chromosome sequencing technology (International Rice Genome Sequencing Project, 2005). Despite progress in MULE and Pack-MULE identification in sequenced higher eukaryotes, the process by which parental genes are captured by these elements remains to be elucidated. Thus far, two probable mechanisms have been proposed. Bennetzen and Springer (1994) suggested a model (model 1) similar to an ectopic gene conversion across a nickedcruciform structure. Here, ectopic sequences are introduced into the internal region of the element during repair of the nick within the loop. According to this model, acquisition may or may not require the presence of transposase. The second model (model 2) proposes an aberrant gap-repair process that uses ectopic sequences as template during the repair of the empty site. In this model, an excision event is necessary, and the acquisition of new sequences occurs upon the repair of the gap at the donor site (Yamashita et al., 1999). As a result, the acquisition requires the presence of transposase but is not associated with transposition of the element. Both models predict the involvement of short stretches of homology between the broken ends and the new genomic sequence not previously associated with the element, which ultimately becomes incorporated in the internal region. Although neither of the two models has any empirical support at this time, computational analysis of Pack-MULEs in rice, maize, and Arabidopsis has shed some light on the acquisition process. A phenomenon that likely extends to all grass genomes, where significant GC islands and gradients persist, is the preferential acquisition of GC-rich sequences by Pack-MULEs (Jiang et al., 2011). In this study, a comprehensive analysis of all MULEs in the rice genome, including PackMULEs, was performed to further understand how Pack-MULEs select and acquire parental   39       gene sequences. The results from this study indicate that element TIR and sub-TIR properties differ between Pack-MULEs and non-Pack-MULEs and may be involved in target selection and acquisition. Analysis of the parental genes of Pack-MULEs in rice, maize, and Arabidopsis supports the role of GC content and ubiquity in the expression of the parental genes in sequence acquisition, which explains the significant preference of MULEs to duplicate genic sequences. Methods Identification of MULEs and Pack-MULEs The sequences for rice (Oryza sativa subsp. japonica ‘Nipponbare’) pseudomolecules and gene annotation information were downloaded from the Rice Genome Annotation Project at Michigan State University (http://rice.plantbiology.msu.edu/; release 7.0). The rice TIR library was built using repeats generated by RECON (Bao and Eddy, 2002). Prior to the identification of MULEs and Pack-MULEs, MULE TIRs were classified as TIR MULEs or non-TIR MULEs. MULE families whose TIRs are at least 50 bp in length with at least 60% similarity are considered as TIR MULEs. However, MULE families less than 150 bp in size (small elements) where TIR length is at least 40 bp and their terminal sequence is related to a TIR MULE were also classified as TIR MULEs, because the short TIR is due to deletion and not to phylogeny origin. All other families are considered non-TIR MULEs. The procedure for the annotation and identification of Pack-MULEs was similar to that described previously with some modifications (Hanada et al., 2009). Briefly, genomic sequence was masked with MULE TIR library and all possible TIR pairs within 20 kb of each other were examined. The annotation of other MULEs was similar to that of Pack-MULEs, except that there is no requirement for the internal region of MULEs to match proteins. Auto-MULEs are MULEs with matches to previously known MULE transposases. Elements with flanking 9- to 11-bp TSD with no more than two mismatches (or 1-   40       bp mismatch plus 1-bp insertion/deletion) were accepted for further classification and analysis. For elements with 8-bp TSD, only 1-bp mismatch or 1-bp insertion/deletion was accepted. The presence of TSD for non-Pack-MULEs was detected by custom perl scripts, and a maximum 10bp swing from the putative element ends was allowed. For all elements with parental copies, TSD was verified by manual examination of elements and flanking sequences. The identification of the parental origin of the sequences captured by MULEs and PackMULEs was conducted as described previously (Jiang et al., 2011). For an individual PackMULE, the sequence with the highest similarity score (BLASTN, E = 1e-10) that was not associated with a MULE TIR was considered as the parental copy of the internal sequence in a Pack-MULE. Elements without matching any proteins that did not contain a recognizable parental genomic sequence were classified as MULE-other or non-Pack-MULEs. Elements with recognizable nongenic parental sequences were classified as MULE-intergenic. Elements with hits only to hypothetical proteins and without parental sequences were classified as MULEHypProt. TIR and Sub-TIR Analyses Since the majority of Pack-MULEs belong to TIR-MULEs, TIR and sub-TIR analyses were performed only on the TIR-MULEs. To identify the TIR length of each individual element, the terminal 800-bp sequence (or half of the element if the element is shorter than 1,600 bp) of each element was aligned using DIALIGN2 (Morgenstern, 2004). A custom perl script was used to determine the length of the TIR on each side, whereby considerable sequence alignment falls off. The sub-TIR was defined as the 50-bp sequence immediately following the TIR, as determined previously. The GC content of each individual TIR and sub-TIR sequence was calculated using a custom perl script. Calculations of sub-TIR free energy were performed using   41       UNAFold (Markham and Zuker, 2008), available at http://mfold.rna.albany.edu. The statistical difference between each group was examined using the R package (http://www. r-project.org). A Bonferroni correction was applied to account for multiple comparisons. Analysis of GC Content To calculate the GC content of MULEs and Pack-MULEs, nested TE insertions were first curated and removed from the element sequence. Determination of the GC content of parental genes was conducted after masking with the rice repeat library that excluded Pack-MULEs. To calculate the GC gradient along MULE sequences, the TIR sequences (on both ends of the elements) and the internal region (the sequences between the TIRs) were divided into two and 10 equal-sized bins, respectively. A custom perl script was used to determine the GC content of each bin. Comparisons of GC content between groups were performed using the R package (http://www.r-project.org). Gene Functional and Expression Analyses The biological process GOSlim assignments and RNA-Seq expression data for rice genes were downloaded from the Rice Genome Annotation Project at Michigan State University (http://rice.plantbiology.msu.edu/). GOSlim categories were calculated such that a total count of 1 was generated from each gene; that is, genes with multiple GOSlim assignments were given an equal proportion totaling to 1. To classify expressed genes, only RNA-Seq libraries with calculated FPKM values were used, to avoid misclassifying background or noise reads from expression calls made using a single read to a gene. Genes were considered expressed if the FPKM values were 1 or greater in at least one expression library. The maize (Zea mays) filtered gene set sequence (release 5b) and functional annotation were downloaded from the maize sequencing project (http://www.maizesequence.org). The   42       Arabidopsis (Arabidopsis thaliana) gene sequences and functional annotation were downloaded from The Arabidopsis Information Resource 10 (http://www.arabidopsis.org). The genes were classified as “known” if a functional annotation is available; otherwise, the genes were classified as unknown. Maize RNA-Seq expression data were obtained from previously published work (Davidson et al., 2011). Evaluation of the expression of maize genes was similar to rice (FPKM ≥1 in at least one expression library) from RNA-Seq, which includes 13 different expression libraries. Since a similarly comprehensive RNA-Seq expression library is not readily available for Arabidopsis, the MPSS data set from eight different expression libraries was downloaded (http://mpss. udel.edu/at/mpss_index.php; Meyers et al., 2004a, 2004b). To determine the expression patterns of genes in Arabidopsis, eight libraries were used, and a gene was classified as expressed if the TPM values were 5 or greater in at least one expression library. Comparisons of various expression parameters among groups were performed using the R package (http://www.r-project.org). To determine the distribution of parental genes among non-TE genes with and without orthologs, Arabidopsis and rice gene orthologous data were downloaded from the Rice Genome Annotation Project at Michigan State University (http://rice.plantbiology.msu.edu/; Lin et al., 2010; Davidson et al., 2012). For maize genes, data were downloaded from the maize sequencing project (http://www.maizesequence.org; Schnable et al., 2009). Age of Acquisition Events To roughly estimate the age of the genic and intergenic acquisition events, Pack-MULE and MULE-intergenic sequences was aligned to parental sequences using BLASTN (M = 5, N = 211, Q = 22, R = 11, E = 1e-10, wordmask = dust, wordmask = seg, hspsepSmax = 100, hspsepQmax = 100) to determine the boundary of the alignable region. Subsequently, each pair   43       of alignable sequences were aligned using MUSCLE (Edgar, 2004), and the output was further processed by custom perl scripts to calculate the number of transversion events between aligned sequences as well as the transversion rate for each sequence pair. An average transversion rate was assigned for parental genes that were acquired by multiple Pack-MULEs. Results Rice MULEs Preferentially Acquire Genic Sequences To understand the mechanism of sequence acquisition by Pack-MULEs, we compared Pack-MULEs with MULEs that do not carry non-TE genomic sequences. To this end, we established a procedure to collect all MULEs in the rice genome, which resulted in a total of 13,857 MULEs with TSDs (Tables 2.1 and 2.2). MULEs were categorized into TIR MULE and non-TIR MULE according to a distinct similarity and length of TIRs (see “Materials and Methods” ). Among the MULE elements with TSDs, 87% were TIR MULEs, suggesting that this MULE type is more predominant than the non-TIR MULEs. If the internal region of a MULE has a non-TE genomic homolog, we call the genomic homolog the parental copy or parental gene (if it is from the genic region; see below). According to the internal sequence contained within the TIR, MULEs were further classified into five groups: (1) Pack-MULEs, as defined previously (Jiang et al., 2004), refers to elements containing genic sequence(s) (Supplemental Table S1); (2) MULE-intergenic refers to elements with a non-TE parental copy located in intergenic regions (Supplemental Table S2); (3) MULEother or non-Pack-MULEs are elements whose internal sequences have no identifiable parental origin/sequence (Supplemental Table S3); (4) Auto-MULEs are elements containing sequences with homology to known Mutator/MULE transposases (Supplemental Table S4); and (5) MULE-HypProt are elements containing annotated hypothetical genes or with homology to   44       hypothetical genes yet without a recognizable parental copy (Supplemental Table S5). MULEHypProt could represent ancient sequence acquisitions where the internal regions are too diverged or evolved to allow the identification of the parental copies. Alternatively, it is a result of misannotation from an automated gene annotation pipeline. The non-Pack-MULEs in each MULE type were subsequently categorized into two groups based on whether the TIR family is involved in sequence acquisition. PMTIR refers to TIR families that contain or include PackMULEs, while non-PMTIR refers to TIR families that contain exclusively non-Pack-MULEs. Among the 13,857 MULEs identified, 2,924 (21.1%) carry gene or gene fragments, suggesting that the majority of MULEs do not acquire genes (Tables 2.1 and 2.2). A total of 251 TIR families were identified in the rice genome, which included 186 TIR MULEs and 65 nonTIR MULEs. Among these TIRs, 122 were associated with sequence acquisition (referred to as PMTIR). The copy numbers of Pack-MULEs range from one to 1,002 elements/copies per TIR family (Fig. 2.1, A and C; Supplemental Table S1). The TIR family with the most family members, Os0037, has a total of 1,151 elements, with the majority being Pack-MULEs (87%). Pack-MULEs identified are predominantly of the TIR MULE type (96.2%), suggesting that MULEs with typical long TIRs are more likely to be associated with gene sequence acquisition. This is also true if the abundance of Pack-MULEs is corrected by the total copy number: 23% of the TIR MULEs are Pack-MULEs, while only 6% of the non-TIR MULEs carry gene fragments. Nevertheless, regardless of the MULE type, the composition of Pack-MULEs and non-PackMULEs across different MULE TIR families that vary in total copy numbers suggests that the abundance of Pack-MULEs is not correlated to the abundance of the family in the genome (Fig. 2.1, B and D). In other words, TIR families with high copy numbers are not more likely and frequently to acquire gene fragments than families with fewer copies. Meanwhile, 129 TIR   45       families were devoid of Pack-MULEs (non-PMTIR), comprising a total copy number of 4,953 elements (Supplemental Fig. S1, A and B; Supplemental Table S3). From the 2,924 Pack-MULEs, 1,557 unique parental genes were identified (Supplemental Table S6). Among the Pack-MULEs, 63 also contain intergenic sequences in addition to genic sequences. In addition, 22 MULE-intergenic elements were found (Supplemental Table S2), and all of them are associated with PMTIR. The intergenic components of the 63 Pack-MULEs and 22 MULE-intergenic elements are derived from a total of 60 intergenic parental sequences, suggesting that MULEs can acquire sequences other than genes, albeit at a much lower frequency. To test whether the dearth of intergenic sequence acquisition is a result of a lower proportion of the genome being the source of this type of parental sequences, we calculated the total genic and intergenic space of the rice genome. The intergenic space (79 Mb) is roughly 68% of the size of the genic space (116 Mb). However, there are about 26 times more genic parentals compared with intergenic parentals, and among Pack-MULEs, even more elements (45 times) have acquired only genes compared with those that acquired both genic and intergenic sequences. The underrepresentation of intergenic sequences among acquisitions by MULEs suggests that genic sequences are preferentially acquired. Alternatively, this may indicate that, compared with the genic components in Pack-MULEs, the intergenic fragments in Pack-MULEs or MULE-intergenics have less selective advantage, so their retention time is shorter. If the latter was the case, one would expect to see more intergenic sequences among newer acquisition events. To test this, the age of acquisition events was roughly estimated based on the transversion rate (the amount of transversion that has occurred between the alignable length of the acquired sequence and the parental sequence). Sequence transversion rate was chosen, as it is   46       a better indicator of age compared with either sequence similarity or transition rate. This is because the transition rate is correlated with GC content in addition to age. The median transversion rate of all acquired sequences (genic and nongenic) was used as the cutoff to classify relatively old (transversion rate > 2.75%) and recent (transversion rate ≤ 2.75%) events. The results show no significant difference in the number of intergenic acquisition events between recent and old acquisitions (2.91% versus 2.94%). Thus, a potential lack of selective advantage does not explain the dramatic underrepresentation of intergenic regions inside MULEs. Structural Differences between Pack-MULEs and Non-Pack-MULEs Since non-TIR MULEs do not have well-defined inverted terminal regions and only account for a minor portion of the Pack-MULEs, comparisons of structural differences were limited to elements classified as TIR MULEs. A variety of differences were observed when the sequences of Pack-MULEs were compared with those of non-Pack-MULEs. Overall, PackMULEs have a much higher GC content compared with non-Pack-MULEs (median, 58.2% versus 36.5%; P < 2.2 x 10-16,Wilcoxon rank-sum test [WRS]) and are much longer (1,445 versus 441 bp; P < 2.2 x 10-16, WRS). To evaluate the GC gradient along the elements, the TIR regions of each element were divided into two equal-sized bins, while the internal regions were divided into 10 equal-sized bins. As shown in Figure 2.2, both TIR and internal sequences of Pack- MULEs are more GC rich than those of non-Pack-MULEs. In addition, a steeper increase in GC content (15% increase) is observed from bin 1 to bin 3 of Pack-MULEs. Furthermore, properties among previously deemed critical regions for sequence acquisition, TIR and sub-TIR, were compared between Pack-MULEs and non-Pack-MULEs.   47       Table 2.1. Copy numbers and percentage of different classes of TIR MULEs in the rice genome Element type TIR type Internal Region Protein match Parental copy Pack-MULEs (PM) PM-genic PM-plusintergenic MULE-intergenic MULE-HypProt MULE-other (non-Pack-MULEs) AutoMULEs 2812 (23.21) PM TIR known protein genic sequence genic and intergenic sequence intergenic sequence   2755 PM TIR known protein PM TIR N/A PM TIR/nonPM TIR hypothetical protein N/A 1196 (9.87) PM TIR N/A N/A 3695 (30.50) nonPM TIR N/A N/A 3915 (32.32) PM TIR/nonPM TIR MULE transposase N/A 479 (3.95) Total a Copy a number 57 17 (0.14) 12114 Numbers in parenthesis represent percent of total copy number. 48       Table 2.2. Copy numbers and percentage of different classes of non-TIR MULEs in the rice genome Element type Internal Region TIR type Protein match Parental copy Pack-MULEs (PM) 112 (6.42) PM-genic PM TIR known protein PM-plusintergenic PM TIR known protein PM TIR N/A PM TIR/nonPM TIR hypothetical protein N/A 119 (6.88) PM TIR N/A N/A 428 (24.54) nonPM TIR N/A N/A 1038 (59.52) PM TIR/nonPM TIR MULE transposase N/A 41 (2.35) MULE-intergenic MULE-HypProt MULE-other (non-Pack-MULEs) AutoMULEs genic sequence genic and intergenic sequence intergenic sequence Total a   Copy a number 106 6 5 (0.29) 1743 Numbers in parenthesis represent percent of total copy number. 49       Figure 2.1. Partition of Pack-MULEs and nonPack-MULEs among TIR MULE and non-TIR MULE families in the rice genome. (A) Copy Number and Pack-MULE distribution in TIR-MULE families associated with gene acquisition. (B) Percent Pack-MULEs and non-Pack-MULEs of total copy number for TIR-MULE families associated with gene acquisition. (C) Copy Number and Pack-MULE distribution in non-TIR MULE families associated with gene acquisition. (D) Percent Pack-MULEs and non-Pack-MULEs of total copy number for non-TIR MULE families associated with gene acquisition. The corresponding names of MULE families for A and B are found on Appendix Table 1.   50       Figure 2.1 (cont’d)   51       Figure 2.1 (cont’d)   52       In our analysis, the sub-TIR was defined as the 50-bp sequence adjacent to the TIR. As shown in Figure 2.3, Pack-MULEs have significantly longer TIRs, higher TIR and sub-TIR GC content, and stronger sub-TIR free energy than non-Pack-MULEs (P < 2.2 x 10-16, WRS). If only elements within PMTIR families are considered, there are more non-Pack-MULEs than PackMULEs (Table 2.1), yet the TIRs of Pack-MULEs are still longer, with higher GC content in TIR and sub-TIR regions (Fig. 2.3; P < 2.2 x 10-16, WRS). This suggests that longer TIR and higher GC content are not required or favorable for transposition but may be important in sequence acquisition. Alternatively, these differences may also be a product of a positive feedback mechanism through the acquisition of GC-rich sequences that in some cases can be converted as part of the TIR (as in the case for the Mu4 element mentioned in the Introduction), therefore resulting in longer and more GC-rich TIRs and GC-rich sub-TIRs with stronger free energies. However, analysis of GC content using only the first 100-bp sequence of the PackMULE TIRs, a size more similar to the average TIR length of non-Pack-MULEs, shows that even the most terminal end of Pack-MULEs is more GC rich than TIRs of non-Pack-MULEs (P < 2.2 x 10-16, WRS; Fig. 2.3B). Since this region is distal to the internal region, it is unlikely that the higher GC content in this region in Pack-MULEs is a direct or an immediate consequence of acquisition. However, it could be an indirect consequence or result of selection if the higher GC content of TIRs promotes acquisition. Since previous work has reported the importance of GC content in the acquisition of genic sequences by Pack-MULEs (Jiang et al., 2011), we tested the role of GC content in the acquisition of intergenic sequences by MULEs. Intergenic parental sequences of MULEs are significantly more GC rich than the overall TE and intergenic sequence of the genome (P = 1.687 x 10-15 and P = 2.59 x 10-12, respectively, WRS; Fig. 2.4). Similarly, Pack-MULE parental   53       genes are significantly more GC rich than the overall genic sequence of the genome (P < 2.2 x 10-16, WRS; Fig. 2.4), suggesting that the preference for GC-rich sequences applies to both genic and nongenic regions. Underrepresentation of Genes with Unknown Function among Parental Genes Although Pack-MULEs preferentially acquire GC-rich genes, it is not known whether they also prefer certain classes of genes or if acquisition based on gene function is random. If acquisition is random, we would expect no differences in the ratio of non-TE genes and PackMULE parental genes for each functional category. To test this hypothesis, the ratio of non-TE genes and rice parental genes among different GOSlim assignments of biological processes was evaluated using functional assignments and annotations made by the Rice Genome Annotation Group at Michigan State University (Kawahara et al., 2013). A total of 32 biological process categories, which includes “ unknown” for genes without an assignment, were compared between Pack-MULE parental genes and non-TE genes. As shown in Figure 2.5, a slight overrepresentation of genes involved in biosynthetic and metabolic processes (X2 test, P < 2.2 x 10-16) and a strong bias against genes with unknown classification among parental genes (X2 test, P < 2.2 x 10-16) were observed. This slight preference for a few categories dissipates, however, when the unknown category is excluded from the analysis (Supplemental Fig. S2). A comparison was also conducted in maize and Arabidopsis to determine whether such a bias against genes with unknown function exists in other plant species where Pack-MULEs have been characterized. In both species, a significant underrepresentation of genes with unknown function among Pack-MULE parental genes was also found (X2 test, maize, P < 2.2 x 10-16, Arabidopsis, P = 0.03; Tables 2.3 and 2.4).   54       Pack-MULE non-Pack-MULE 70 GC content (%) 60 50 40 30 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 bin Figure 2.2. GC content along Pack-MULEs and non-Pack-MULEs. The first 2 and last 2 bins represent TIR regions and the internal sequence was divided into 10 equal-sized bins prior to determination of GC content per bin.   55       A B 50 200 e TIR GC content (%) TIR Length (bp) 250 f 150 g 100 50 0 f g 40 h 30 20 10 0 D 60 50 40 e sub-TIR free energy (kcal/ mol) sub-TIR GC content (%) C     e f g 30 20 10 0 0 -1 -2 -3 -4 -5 f e g Figure 2.3. Structural difference between Pack-MULEs and non-Pack-MULEs based on TIR and sub-terminal (sub-TIR) sequences. (A) Median TIR length. (B) Median TIR GC content. (C) Median sub-TIR GC content. (D) Median sub-TIR free energy. PM: PackMULE; nPM-PMTIR: non-Pack-MULEs with PM associated TIRs; nPM-nPMTIRs: nonPack-MULEs with non-Pack-MULE exclusive TIRs; PM100: using only first 100 bp sequence of Pack-MULE TIRs. Bars designated with different letters indicate their values are significantly different (α=0.008 for B and α=0.02 for A, C and D) by Wilcoxon Rank Sum Test (WRS) with Bonferroni correction.   56       70 GC content (%) 60 c 50 40 a cd def f e b 30 20 10 0 Figure 2.4. GC content of different genomic sequences in the rice genome. Genome average GC content is indicated by a dashed line. Bars designated with different letters indicate their values are significantly different (α= 0.002) by Wilcoxon Rank Sum Test (WRS) with Bonferroni correction.   57   Percent of total   40 35 30 25 20 15 10 5 0 unknown other biological process metabolic process response to stress cellular process biosynthetic process DNA metabolic process protein modification transport protein metabolic process reproductive development signal transduction cellular component multicellular organismal carbohydrate metabolic lipid metabolic process anatomical structure cell death generation of precursor photosynthesis cell cycle secondary metabolic cellular homeostasis cell differentiation cell growth cellular communication epigenetic processes growth tropism abscission behavior ripening     %nonTE genes % Parental genes Figure 2.5. Percent of GOSlim categories of Pack-MULE parental genes and all non-TE genes of rice. 58         To understand the mechanism underlying the apparent bias against genes without a known function, we compared the GC content among genes with and without a GOSlim assignment, since it is known that GC richness is favored in sequence acquisition/retention. Among non-TE genes, those without a GOSlim assignment have significantly higher GC content than counterparts with a GOSlim assignment both at the genomic and coding sequence (CDS) levels (genomic, P < 2.2 x 10-16; CDS, P < 2.2 x 10-16, WRS; Table 2.5). Among Pack-MULE parental genes, the GC content difference between GOSlim genes and unknown genes was detectable at the genomic sequence level (P = 0.001, WRS; Table 2.5) but not significant at the CDS level. In all four comparisons, genes with unknown category have higher or comparable GC content than those with assigned function(s). These results suggest that the gene GC content does not explain the underrepresentation of genes with unknown biological function among PackMULE parental genes in rice. Similarly, maize non-TE genes with unknown function have significantly higher GC content than those with known function (genomic, P < 2.2 x 10-16; CDS, P = 5.589 x 10-16, WRS; Table 2.3). In Arabidopsis, genes with unknown function had significantly higher GC content than genes with known function only at the genomic level (P = 0.01; Table 2.4). These data show that acquisition bias against genes with unknown function is not species specific and supports the notion that GC content does not explain this finding. Some of the genes with unknown function might be the result of misannotation. Thus, it is feasible that sequences misannotated as genes are overrepresented within the unknown group and that the apparent bias against them indicates a preference for bona fide genes. To test this, we surveyed the distribution of parental genes among non-TE genes with and without an ortholog (Schnable et al., 2009; Lin et al., 2010; Davidson et al., 2012), since genes with orthologs are more likely bona fide genes. For both rice and maize, genes with orthologs are   59       Table 2.3. GC content and expression information among Pack-MULE parental genes and non-TE genes in maize according to functional assignment. Gene Total %GCgenomic %GC-CDS FPKM* # of library* no expression % no expression nonTEgene-unknown 13057 50.10b 56.50b 39.60a 11.00a 4196 32.1d nonTEgene-known 26599 47.50a 55.90a 70.14b 12.00b 2873 10.8b PMPar- unknown 47 57.20c 63.60ac 45.43b 12.00b 2 4.3a PMPar- known 188 55.45c 63.50c 62.71b 12.00b 7 3.7a *Median determined only from genes that are expressed. PMPar – Pack-MULE parental gene; Numbers in each column followed by different letters are significantly different (α=0.008 with Bonferroni adjustment)   60       Table 2.4. GC content and expression information among Pack-MULE parental genes and non-TE genes in Arabidopsis according to functional assignment. Gene Total %GCgenomic %GC-CDS TPKM* # of library* no expression % no expression nonTEgene-unknown 10239 39.50b 43.50a 33.83a 5.00a 4584 44.8d nonTEgene-known 17147 39.30a 44.40b 54.33b 6.00b 3088 18.0c PMPar- unknown 7 41.80ab 43.20ab 46.92ab 2.50ab 1 14.3b PMPar- known 28 39.80ab 44.65ab 41.80ab 6.00ab 2 7.1a *Median determined only from genes that are expressed. PMPar – Pack-MULE parental gene Numbers in each column followed by different letters are significantly different (α=0.008 with Bonferroni adjustment)           61       Table 2.5. GC content and expression information among Pack-MULE parental genes and non-TE genes in rice according to GOSlim assignment Gene Total %GCgenomic %GCCDS FPKM* # of library* no expression % no expression nonTEgene-NoSLIM 12010 51.80b 59.50b 5.81a 6.0a 6541 54.46d nonTEgene-WithSLIM 23609 45.20a 55.00a 9.98b 8.0c 3671 15.55b PMPar-NoSLIM 250 59.00d 68.30c 6.79a 7.0ab 60 24.29c PMPar-WithSLIM 1334 56.40c 67.30c 9.41b 7.0b 145 11.10a *Median determined only from genes that are expressed. (PMPar – Pack-MULE parental gene; NoSlim – genes without a GOSlim assignment; WithSlim – genes with a GOSlim assignment Numbers in each column followed by different letters are significantly different (α=0.008 with Bonferroni adjustment)   62       significantly enriched among Pack-MULE parental genes (X2 test, rice, P < 2.2 x 10-16; maize, P = 4.152 x 10-14; Supplemental Table S7). For Arabidopsis, an enrichment is observed among parental genes (91% have orthologs; Supplemental Table S7), yet this overrepresentation is not statistically significant, most likely due to the low number of parental genes. The Effect of Gene Expression on Sequence Acquisition and Its Interaction with GC Content Since GC content does not explain the discrepancy in gene acquisition preference mentioned above, other factors that may influence sequence acquisition were explored. The role of gene expression in rice was tested using RNA-Seq data (Michigan State University Rice Genome Annotation Group) from 10 different rice developmental stages encompassing diverse vegetative and reproductive tissues (Davidson et al., 2012). A gene was considered expressed if the fragments per kilobase of exon per one million fragments mapped (FPKM) value was 1.0 or greater in at least one expression library; otherwise, the gene was categorized as not expressed. Over one-half (54%) of non-TE genes without a GOSlim assignment were not expressed, while only 15% of non-TE genes with known functions were not expressed (Table 2.5). Meanwhile only 24% of Pack-MULE parental genes without a GOSlim assignment and even fewer, 11%, of those with GOSlim assignments were not expressed, suggesting that gene expression may play a role in the preference for acquisition. The role of gene expression was also evaluated in maize and Arabidopsis. Maize RNA-Seq expression data were obtained from a previous study (Davidson et al., 2011), and expression was determined using parameters similar to rice. Since similarly comprehensive RNA-Seq expression data generated from a single experiment were not readily available in Arabidopsis, we utilized the massively parallel signature sequencing data set with expression levels of genes from multiple tissues (Meyers et al., 2004a, 2004b). Using only   63       uniquely mapping signatures, a gene was considered expressed if the transcripts per one million (TPM) value was 5.0 or greater in at least one expression library. In both species, significantly more genes with unknown function were not expressed compared with genes with known function (X2 test, rice, P < 2.2 x 10-16; Tables 2.3 and 2.4). However, the number of nonexpressed genes from either category was much lower among parental genes than the genomic average, suggesting that the underrepresentation of genes with unknown function among parental genes is connected to the lack of expression of unknown genes in all three species. To further assess the role of gene expression, the level of expression, determined by the FPKM/TPM value, and the number of tissues with expression were compared among different groups of genes. In all three species, expressed genes with unknown function have significantly lower expression levels (P < 2.2 x 10-16, WRS) and fewer tissues with detectable expression (P < 2.2 x 10-16, WRS) than those with a GOSlim assignment (Tables 2.3-2.5). The expression levels of parental genes of PackMULEs do not significantly differ from the genomic average, with the exception of the maize parental genes with unknown function, which showed a significantly higher expression level than the genomic average (P = 5.923 x 10-6, WRS). Interestingly, parental genes with or without a known function were expressed in similar numbers of tissues in all three species (Tables 2.32.5), suggesting that the breadth of expression is critical to sequence acquisition. Thus, the high percentage of genes with no expression and genes with less ubiquitous expression explains why genes without a GOSlim assignment are underrepresented in the parental genes of Pack-MULEs. To determine whether the roles of GC content and the breadth of gene expression on sequence acquisition are independent, we categorized rice genes into different GC content groups (low, moderate, and high) as well as different expression categories (no/low, moderate, and high) based on the number of tissues with detectable expression (FPKM ≥ 1.0) and   64       determined the proportion of Pack-MULE parental genes within each group. Although the results above suggest that expression plays a similar role in acquisition preference in maize and Arabidopsis, the analysis in this section was limited to the rice data, due to the much lower number of parental genes in the other two species. Our results in rice indicate that both GC content and the number of tissues with expression evidence play a role in sequence acquisition preference by Pack-MULEs. The ratio of parental genes among non-TE genes was used to reflect the acquisition frequency (how frequently a certain group of genes was acquired). As shown in Figure 2.6A, for low GC genes and moderate GC genes, a very minimal and modest increase in the proportion of parental genes, respectively, may be observed, with increase in the number of tissues with expression. In comparison, among high GC genes, a stronger increase in the ratio of parental genes occurs with more expression libraries. Meanwhile, when genes are categorized according to the number of tissues with expression, a more substantial increase in the ratio of parental genes is observed in all three expression groups as GC levels increase, and the increase is much greater among genes expressed in eight to 10 tissues (Fig. 2.6B). It is clear, however, that GC content plays a more dominant role than gene expression for sequence acquisition/retention. This is because variation of GC content may lead to as much as an 11-fold change in the percentage of parental genes, while that for gene expression is only 2- to 5-fold. In addition, the increase in GC content is accompanied by a boost in the percentage of parental genes, despite their expression patterns. In contrast, the effect of gene expression on the percentage of parental genes is only substantial when the genes have moderate or high GC content (Fig. 2.6A). It is also interesting that the effect of gene expression plateaued with expression in seven or more tissues (Fig. 2.6A). That explains why the median value of the number of tissues (seven tissues) with expression for parental genes with known function is   65       slightly lower than the genomic average (eight tissues; Table 2.5) in rice, because more ubiquitous expression (in more than seven tissues) does not confer additional advantage for acquisition. The Enrichment of GC-Rich Sequences inside Pack-MULEs Is Due to Selective Acquisition and Preferential Retention The apparent preference for higher GC content and relatively ubiquitous expression in sequence acquisition in rice, however, can be an artifact of selection, since sequences with higher GC content and more ubiquitous expression are more likely derived from coding regions and, thus, are more likely to be functional. If that is the case, one would expect the preference to be more dramatic among old than among recent acquisition events. Again, the age of acquisition events was roughly estimated through the transversion rate between the acquired sequence and the parental gene. Parental genes were separated into two groups: recent acquisitions, those with transversion rate of 2.75% or less; and old acquisitions, those with transversion rate of 2.75% or greater. As shown in Figure 2.7A, the two groups of parental genes show an overall similar percentage with increasing number of tissue expression, suggesting that the number of expressed tissues does not have a significant influence on the retention of their gene fragments. In contrast, there are significantly more parental genes in old acquisition events (transversion rate > 2.75%) compared with recent events among genes with a GC content of 69% to 82% (Fig. 2.7B), suggesting that selection may play a role in the apparent enrichment of parental genes with extremely high GC content. To further characterize the impact of gene GC content on the retention of the relevant gene fragments, we tested the relationship of GC content and transversion rate of all rice parental genes and found a low, albeit statistically significant, correlation (0.09; P = 0.0003, Spearman; Fig. 2.7C); that is, the GC content of parental genes   66       progressively increases with transversion rate between Pack-MULEs and the parental genes. Again, this indicates that selection plays a role in the retention of GC-rich genes. To obtain the best possible assessment of the GC content of parental genes upon acquisition, we calculated the GC content of the 14 parental genes with a 0% transversion rate. Theoretically, these sequences represent the most recent acquisition events and have been subjected to little selection. The GC content of all of them is higher than 50%, and the average value is 66.1%, which is dramatically higher than the genome average GC content (45.6%) of non-TE genes. This fact, together with the minor increment of GC content of parental genes over evolutionary time (Fig. 2.7C), suggests a strong preference for GC-rich genes upon acquisition. Taken together, our results suggest that selection may play a role in the retention of fragments from different parental genes, although it is insufficient to fully explain the enrichment of GCrich genes among parental genes of Pack-MULEs. Discussion The process of gene duplication followed by sequence and functional divergence (neofunctionalization) is one of the most important means for the generation of new genes (Flagel and Wendel, 2009). Studies have shown that all major families of TEs are involved in gene duplication in plants (Jiang et al., 2004; Kawasaki and Nitasaka, 2004; Morgante et al., 2005; Zabala and Vodkin, 2005; Wang et al., 2006; Schnable et al., 2009). In the rice genome, over 1,500 parental genes have been transduced by Pack-MULEs, which can generate independent or chimeric transcripts when fused with nearby sequences (Jiang et al., 2011). In addition, these transcripts may regulate parental gene expression, suggesting a very important role of Pack-MULEs in novel gene formation and evolution. It was shown previously that Pack-   67       A     B moderate no/low high 18 18 16 16 14 14 % PM Parental genes % PM Parental genes low 12 10 8 6 4 10 8 6 4 2 0 0 1-2 3-4 5-6 7-8 Number of libraries 9-10 high 12 2 0 moderate <46 46-50 51-55 56-62 63-68 69-82 Gene GC content                         Figure 2.6. The effect of GC content and expression in gene acquisition frequency. (A) The relationship between expression breadth and ratio of parental genes among genes grouped on GC content range (low: 30-50% GC; moderate: 51-62% GC; high: 63-81% GC); (B) The relationship between gene GC content and ratio of parental genes among genes grouped on number of tissue expression range (no/low: 0 to 1 libraries; moderate: 2 to 7 libraries; high: 8 to 10 libraries).   68       % Pack-MULE Parental to nonTE genes <=2.75 >2.75 <=2.75 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 0 >2.75 <46 46-50 51-55 56-62 63-68 69-82 Gene GC content (%) 1-2 3-4 5-6 7-8 9-10 Number of libraries Parental Genes 70 Genome Average GC content (%) 65 60 55 50 45 >1 1-2 2-3 3-4 >4 Transversion rate (%) Figure 2.7. Comparison of GC content and breadth of expression of parental genes between recent and old acquisition events estimated by transversion rate. (A) The effect of acquisition age on breadth of gene expression; (B) The effect of acquisition age on the GC content; (C) The relationship between GC content of parental genes and transversion rate.   69       MULEs preferentially acquire GC-rich sequences, a phenomenon only seen in grasses. Aside from that, the process by which these sequences are selected and captured by Pack-MULEs remains largely an enigma. Our findings in this study show that rice TIR MULEs have a higher propensity to acquire genomic sequences compared with non-TIR MULEs, and this bias may be related to differences in structural properties such as TIR length and TIR GC content. It remains unclear at this stage whether the capacity of sequence acquisition among TIR MULEs contributes to the overall success of TIR MULEs versus non-TIR MULEs, since 87% of the rice MULEs belong to TIR MULEs. MULEs and Pack-MULEs in rice are capable of acquiring both genic and intergenic sequences, although the acquisition preference for genic sequences is much more pronounced compared with intergenic sequences (Table 2.1). This may suggest that Pack-MULEs are either more competent to acquire genes or that genes are more readily acquired by Pack-MULEs over other sequences in the genome. Consistent with previous work, GC content was a factor in the preferential acquisition or retention of intergenic fragments, because the GC content of the intergenic sequences inside MULEs or Pack-MULEs is much higher than the genomic, TE, and intergenic GC contents. Interestingly, the GC-rich internal sequences of Pack-MULEs are accompanied by higher GC content of TIRs and sub-TIRs in Pack-MULEs compared with that of non-Pack-MULEs. One of the models for sequence acquisition suggests the formation of a cruciform structure during the process, with the TIRs forming the stem of the hairpin (Bennetzen and Springer, 1994). In this model, an endonucleolytic attack occurs in the single-stranded loop, aided by sequences containing homology to a parental sequence, and initiates repair through illegitimate recombination. Our results are consistent with this model in the following respects. On the one hand, the GC-rich internal regions and GC-rich sub-TIRs seem to imply that   70       sequence homology between the element and the acquisition target likely plays a role in acquisition. If this is the case, one would expect AT-rich sequences to be acquired as well if the sub-TIR sequence is also AT-rich. This was not observed, since non-Pack-MULEs have relatively AT-rich sub-TIRs (Fig. 2.2) but they do not carry any recognizable genomic sequences. This is possibly because the pairing of AT-rich sequences is not as stable as that of GC-rich sequences to initialize the repair process. On the other hand, a long GC-rich TIR would lead to a more stable cruciform that may facilitate the acquisition process. This hypothesis may also explain why MULEs are more frequently associated with sequence acquisitions than other “cut-and-paste” DNA transposons, in that most MULEs have extended long TIRs. Although a particular functional category of genes does not seem to be more preferentially acquired by Pack-MULEs, a bias against genes with unknown function is obvious, and this was not species specific. Hypothetical genes and genes with unknown function can often result from misannotation. These genes, in most cases, are generated by gene prediction programs and, therefore, may lack supporting expression data. As a result, it is conceivable that there are more false positive annotations within this group compared with genes with known function. Such bias may reflect the preference of Pack-MULEs for bona fide genes. In other words, Pack-MULEs might be better than gene annotation programs in distinguishing genuine genes from other sequences. The underrepresentation of genes without a known biological function prompted the analysis of expression among annotated genes, which showed that a relatively ubiquitous expression throughout development may play a role in sequence acquisition: Pack-MULEs preferentially acquire genes that are expressed in multiple tissues/developmental stages (Fig. 2.6A). Our analysis also shows an enrichment of genes with orthologs among parental genes (Supplemental Table S7). More importantly, our data explain the   71       strong propensity of Pack-MULEs to transduce genes over other nongenic sequences in the genome. Gene GC content and the ubiquity of expression show an additive effect on preferential selection and acquisition by Pack-MULEs. Interestingly, it appears that selection acts differentially on GC content and the ubiquity of expression. The preference for expression ubiquity seems to be largely at the acquisition level, since it is evenly distributed among old and recent acquisition events (Fig. 2.7A). In contrast, there is a detectable level of selection that favors the retention of fragments from highly GC-rich genes (Fig. 2.7, B and C). Such differentiation is understandable from a mechanistic point of view: once a gene fragment is acquired by a Pack-MULE, the characteristic of expression is no longer associated with the fragment. Since the acquired fragments may not be expressed in the same pattern as their parental copies, there is no basis for selection for or against the expression pattern of the parental genes. This is consistent with the fact that Pack-MULEs are often associated with different tissue specificity from their parental genes (Hanada et al., 2009). On the other hand, GC content is always associated with the fragment. High GC content could induce a series of genetic and epigenetic changes in the genome. Genetically, it may modify the 5’ end of the adjacent genes and intensify the negative GC gradient (Jiang et al., 2011). Epigenetically, GC-rich sequences offer more methylation targets that could influence the chromatin structure and expression of the nearby genes (Kalisz and Purugganan, 2004; Tatarinova et al., 2010; see below). All these features may form the basis for selection. Apparently, the high GC content is favored here, which may imply that it has provided certain benefits for the organism. Despite the possible selection for high GC content over evolutionary time, the degree of selection seems too moderate to explain the dramatic difference in the GC content between parental genes and all non-TE genes   72       (Fig. 2.7C). Accordingly, it is likely that the preferential acquisition by Pack-MULEs for GCrich genes is also responsible, or more important, for the enrichment of GC-rich genes among parental genes. In addition, we cannot rule out the possibility that variation in GC content over evolutionary time is due to a change in acquisition preference. This could occur, for example, when different MULE families have slightly different acquisition preferences and their amplification rate has not been constant in each time range. Future computational and biochemical analyses are required to test whether acquisition preference varies among different MULE families. To our knowledge, our study is the first to elucidate the direct involvement of gene expression in sequence duplication by DNA transposons. Furthermore, our data suggest that GC richness offers a more dominant effect in this process. This is because, as discussed above, high GC content is very likely favored by both acquisition and selection. Studies to determine the relationship between GC content and expression level and the breadth of expression are conflicting, with studies reporting a strong positive correlation (Lercher et al., 2003; Kudla et al., 2006) and those reporting weak or unclear correlation (Gilbert et al., 2004; Sémon et al., 2005). Nevertheless, these and other studies established the association of GC-rich sequences with open chromatin (Vinogradov, 2003). The open chromatin provided by GC-rich sequences potentially allows these sequences to be more accessible by host enzymes during interrupted gap repair (model 2; see the introduction) or during internal strand repair of cruciform structures (model 1; see the Introduction). Expression level, as measured by FPKM/TPM values, does not appear critical to the likelihood of a gene being transduced by a Pack-MULE (Tables 2.3-2.5). Nevertheless, we cannot rule out the possibility that parental genes were expressed at a level higher than average prior to the formation of relevant Pack-MULEs. This is because Pack-   73       MULEs have a negative regulatory effect on the expression level of parental genes, which may render the difference no longer detectable after the acquisition. In contrast, preferential acquisition is positively correlated with the breadth of expression when the genes are expressed in seven or fewer tissues (Fig. 2.6). The open chromatin configuration during active transcription may allow access to a sequence for duplication. Consequently, the greater number of tissues with detectable expression allows the gene greater chances of being transduplicated by Pack-MULEs. Conclusion The unprecedented copy number of Pack-MULEs, the massive duplication of thousands of genes in the rice genome, combined with their biased acquisition for GC-rich genes and insertion in 5’ regions of genes suggest an evolutionary importance of these elements in gene evolution and regulation. Our findings in rice show that sequence acquisition by Pack-MULEs relies on structural/sequential properties of the elements and the acquisition targets. TIR MULEs are the predominant MULE type in the rice genome and account for the majority of the PackMULEs. Although Pack-MULEs can duplicate both genic and intergenic sequences, a much stronger preference for genic sequences exists. Pack-MULEs exhibit a non-species-specific bias against genes with unknown function and enrichment of parental genes with orthologs, suggesting its preferential acquisition for bona fide genes. Structural properties of elements, GC content, and the breadth of expression of parental genes influence the selection and acquisition of sequences. Increased GC content and number of tissues with detectable expression results in a higher likelihood of a gene being acquired by a Pack-MULE. Moreover, GC-rich sequences acquired by Pack-MULEs are preferentially retained compared with sequences that are not so GC rich. Although the molecular mechanism for how Pack-MULEs locate and duplicate intergenic and genic sequences remains to be empirically evaluated, our study demonstrates that   74       the activity of Pack-MULEs leads to the selective duplication/retention of CDSs, because CDSs are more GC rich and have a wider breadth of tissue expression. Such selection enables them to carry the most likely functional sequences instead of “junk” and so provide new resources for the evolution of new genes or the modification of existing gene.   75       APPENDIX   76       Table A1. Order of TIR MULE families found in Figure 2.1 A and B. Bar number TIR name 1 Os0037 2 Os2580 3 Os0166 4 Os0086 5 Os0243 6 Os0205 7 Os0229 8 Os0949 9 Os0105 10 Os0284 11 Os0335 12 Os0301 13 Os2537 14 Os0312 15 Os0053 16 Os2088 17 Os1129 18 Os0297 19 Os0222 20 Os0372 21 Os0378 22 Os0138 23 Os0182 24 Os0230 25 Os0319 26 Os0115 27 Os2471 28 Os0385 29 Os0219 30 Os0202 31 Os3333 32 Os3372 33 Os1693 34 Os1617 35 Os0709 36 Os0513 37 Os2788 38 Os0547 39 Os2159   77       Table A1 (cont’d) 40 Os0208 41 Os1121 42 Os2701 43 Os3609 44 Os2269 45 Os0116 46 Os3617 47 Os0610 48 Os0853 49 Os1224 50 Os3284 51 Os1272 52 Os2766 53 Os3366 54 Os3625 55 Os3605 56 Os3621 57 Os0874 58 Os1886 59 Os1104 60 Os1308 61 Os0584 62 Os3608 63 Os0454 64 Os2835 65 Os0570 66 Os0892 67 Os3614 68 Os1015 69 Os1810 70 Os3624 71 Os3375 72 Os1057 73 Os3369 74 Os2033 75 Os3604 76 Os3602 77 Os3618 78 Os1455 79 Os3601   78       Table A1 (cont’d) 80 Os3613 81 Os3371 82 Os3622 83 Os1404 84 Os2398 85 Os3616 86 Os3337 87 Os3612 88 Os2623 89 Os1838 90 Os3355 91 Os3383 92 Os1228 93 Os3619 94 Os3238 95 Os3600 96 Os3606 97 Os3623 98 Os3626 99 Os3627 All supplemental tables and figures are available at: http://www.plantphysiol.org/content/early/2013/09/12/pp.113.223271.short Figure S1. Copy number of MULE TIR families not associated with gene acquisition. (A) TIRMULE families, (B) non-TIR MULE families Figure S2. Percent of GOSlim categories of Pack-MULE parental genes and non-TE genes of rice excluding genes without a functional assignment. Table S1. List of rice Pack-MULEs. Table S2. List of rice MULEs with intergenic sequence acquisition (MULE-intergenic). Table S3. List of rice non-Pack-MULEs. Table S4. List of rice autonomous MULEs. Table S5. List of rice MULE-HypProt.   79       Table S6. List of parental genes of Pack-MULEs. Table S7. Distribution of non-TE and parental genes among genes with and without orthologs Sequence-File S1. Terminal sequences of TIR-MULE families. Sequence-File. S2. Terminal sequences of non-TIR-MULE families. Sequence-File S3. Element sequences of small MULEs (<150 bp). Sequence-File S4. Sequences of intergenic parental copies.   80       REFERENCES   81       REFERENCES Ammiraju JS, Zuccolo A, Yu Y, Song X, Piegu B, Chevalier F, Walling JG, Ma J, Talag J, Brar DS, et al (2007) Evolutionary dynamics of an ancient retrotransposon family provides insights into evolution of genome size in the genus Oryza. Plant J 52: 342–351 Bao, Z and Eddy, SR (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12:1269-1276 Benito MI, Walbot V (1997) Characterization of the maize Mutator transposable element MURA transposase as a DNA-binding protein. Mol Cell Biol 17: 5165–5175 Bennetzen JL (2007) Patterns in grass genome evolution. Curr Opin Plant Biol 10: 176–181 Bennetzen JL, Kellogg EA (1997) Do plants have a one-way ticket to genomic obesity? Plant Cell 9: 1509–1514 Bennetzen JL, Springer PS (1994) The generation of Mutator transposable element subfamilies in maize. Theor Appl Genet 87: 657–667 Chalvet F, Grimaldi C, Kaper F, Langin T, Daboussi MJ (2003) Hop, an active Mutator-like element in the genome of the fungus Fusarium oxysporum. Mol Biol Evol 20: 1362–1375 Chomet P, Lisch D, Hardeman KJ, Chandler VL, Freeling M (1991) Identification of a regulatory transposon that controls the Mutator transposable element system in maize. Genetics 129: 261–270 Davidson RM, Gowda M, Moghe G, Lin H, Vaillancourt B, Shiu SH, Jiang N, Buell CR (2012) Comparative transcriptomics of three Poaceae species reveals patterns of gene expression evolution. Plant J 71: 492–502 Davidson RM, Hansey CN, Gowda M, Childs K, Lin H, Vaillancourt B, Sekhon RS, de Leon N, Kaeppler SM, Jiang N, et al (2011) Utility of RNA-seq for analysis of maize reproductive transcriptomes. Plant Genome 4: 191–203 Devos KM, Ma J, Pontaroli AC, Pratt LH, Bennetzen JL (2005) Analysis and mapping of randomly chosen bacterial artificial chromosome clones from hexaploid bread wheat. Proc Natl Acad Sci USA 102: 19243–19248 Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792–1797 Ferguson AA, Jiang N (2012) Mutator-like elements with multiple long terminal inverted repeats in plants. Comp Funct Genomics 2012: 695827   82       Feschotte C (2008) Transposable elements and the evolution of regulatory networks. Nat Rev Genet 9: 397–405 Flagel LE, Wendel JF (2009) Gene duplication and evolutionary novelty in plants. New Phytol 183: 557–564 Gilbert N, Boyle S, Fiegler H, Woodfine K, Carter NP, Bickmore WA (2004) Chromatin architecture of the human genome: gene-rich domains are enriched in open chromatin fibers. Cell 118: 555–566 Hanada K, Vallejo V, Nobuta K, Slotkin RK, Lisch D, Meyers BC, Shiu SH, Jiang N (2009) The functional role of Pack-MULEs in rice inferred from purifying selection and expression profile. Plant Cell 21: 25–38 Hoen DR, Park KC, Elrouby N, Yu Z, Mohabir N, Cowan RK, Bureau TE (2006) Transposonmediated expansion and diversification of a family of ULP-like genes. Mol Biol Evol 23: 1254–1268 Holligan D, Zhang X, Jiang N, Pritham EJ, Wessler SR (2006) The transposable element landscape of the model legume Lotus japonicus. Genetics 174: 2215–2228 International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436: 793–800 Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569–573 Jiang N, Ferguson AA, Slotkin RK, Lisch D (2011) Pack-Mutator-like transposable elements (Pack-MULEs) induce directional modification of genes through biased insertion and DNA acquisition. Proc Natl Acad Sci USA 108: 1537–1542 Kalisz S, Purugganan MD (2004) Epialleles via DNA methylation: consequences for plant evolution. Trends Ecol Evol 19: 309–314 Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S, et al (2013) Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6: 4 Kawasaki S, Nitasaka E (2004) Characterization of Tpn1 family in the Japanese morning glory: En/Spm-related transposable elements capturing host genes. Plant Cell Physiol 45: 933– 944 Kudla G, Lipinski L, Caffin F, Helwak A, Zylicz M (2006) High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biol 4: e180   83       Lercher MJ, Urrutia AO, Pavlícek A, Hurst LD (2003) A unification of mosaic structures in the human genome. Hum Mol Genet 12: 2411–2415 Lin H, Moghe G, Ouyang S, Iezzoni A, Shiu SH, Gu X, Buell CR (2010) Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana. BMC Evol Biol 10: 41 Lippman Z, Gendrel AV, Black M, Vaughn MW, Dedhia N, McCombie WR, Lavine K, Mittal V, May B, Kasschau KD, et al (2004) Role of transposable elements in heterochromatin and epigenetic control. Nature 430: 471–476 Lisch D (2002) Mutator transposons. Trends Plant Sci 7: 498–504 Lisch D, Jiang N (2009) Mutator and MULE transposons. In JL Bennetzen, S Hake, eds, Handbook of Maize: Genetics and Genomics. Springer, New York, pp 277–306 Lisch DR, Freeling M, Langham RJ, Choy MY (2001) Mutator transposase is widespread in the grasses. Plant Physiol 125: 1293–1303 Markham NR, Zuker M (2008) UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol 453: 3–31 Marquez CP, Pritham EJ (2010) Phantom, a new subclass of Mutator DNA transposons found in insect viruses and widely distributed in animals. Genetics 185: 1507–1517 Meyers BC, Lee DK, Vu TH, Tej SS, Edberg SB, Matvienko M, Tindell LD (2004a) Arabidopsis MPSS: an online resource for quantitative expression analysis. Plant Physiol 135: 801–813 Meyers BC, Tej SS, Vu TH, Haudenschild CD, Agrawal V, Edberg SB, Ghazal H, Decola S (2004b) The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. Genome Res 14: 1641–1653 Ming R, Vanburen R, Liu Y, Yang M, Han Y, Li LT, Zhang Q, Kim MJ, Schatz MC, Campbell M, et al (2013) Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.). Genome Biol 14: R41 Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A (2005) Gene duplication and exon shuffling by Helitron-like transposons generate intraspecies diversity in maize. Nat Genet 37: 997–1002 Morgenstern B (2004) DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Res 32: W33–W36 Neuvéglise C, Chalvet F, Wincker P, Gaillardin C, Casaregola S (2005) Mutator-like element in the yeast Yarrowia lipolytica displays multiple alternative splicings. Eukaryot Cell 4:   84       615–624 Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin YC, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, et al (2013) The Norway spruce genome sequence and conifer genome evolution. Nature 497: 579–584 Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551–556 Piegu B, Guyot R, Picault N, Roulin A, Sanyal A, Kim H, Collura K, Brar DS, Jackson S, Wing RA, et al (2006) Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Res 16: 1262–1269 Raizada MN, Benito MI, Walbot V (2001) The MuDR transposon terminal inverted repeat contains a complex plant promoter directing distinct somatic and germinal programs. Plant J 25: 79–91 Robertson D, Woessner JP, Gillham NW, Boynton JE (1989) Molecular characterization of two point mutants in the chloroplast atpB gene of the green alga Chlamydomonas reinhardtii defective in assembly of the ATP synthase complex. J Biol Chem 264: 2331–2337 Robertson DS (1978) Characterization of a Mutator system in maize. Mutat Res 51: 21–28 Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463: 178–183 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112–1115 Sémon M, Mouchiroud D, Duret L (2005) Relationship between gene expression and GCcontent in mammals: statistical significance and biological relevance. Hum Mol Genet 14: 421–427 Slotkin RK, Martienssen R (2007) Transposable elements and the epigenetic regulation of the genome. Nat Rev Genet 8: 272–285 Talbert LE, Chandler VL (1988) Characterization of a highly conserved sequence related to Mutator transposable elements in maize. Mol Biol Evol 5: 519–529 Tatarinova TV, Alexandrov NN, Bouck JB, Feldmann KA (2010) GC3 biology in corn, rice, sorghum and other grasses. BMC Genomics 11: 308   85       Tomato Genome Consortium (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485: 635–641 Vinogradov AE (2003) DNA helix: the importance of being GC-rich. Nucleic Acids Res 31: 1838–1844 Wang Q, Dooner HK (2006) Remarkable variation in maize genome structure inferred from haplotype diversity at the bz locus. Proc Natl Acad Sci USA 103: 17644–17649 Wang W, Zheng H, Fan C, Li J, Shi J, Cai Z, Zhang G, Liu D, Zhang J, Vang S, et al (2006) High rate of chimeric gene origination by retroposition in plant genomes. Plant Cell 18: 1791–1802 Wu J, Wang Z, Shi Z, Zhang S, Ming R, Zhu S, Khan MA, Tao S, Korban SS, Wang H, et al (2013) The genome of the pear (Pyrus bretschneideri Rehd.). Genome Res 23: 396–408 Yamashita S, Takano-Shimizu T, Kitamura K, Mikami T, Kishima Y (1999) Resistance to gap repair of the transposon Tam3 in Antirrhinum majus: a role of the end regions. Genetics 153: 1899–1908 Yu Z, Wright SI, Bureau TE (2000) Mutator-like elements in Arabidopsis thaliana: structure, diversity and evolution. Genetics 156: 2019–2031 Zabala G, Vodkin LO (2005) The wp mutation of Glycine max carries a gene fragment-rich transposon of the CACTA superfamily. Plant Cell 17:2619–2632 Zuccolo A, Sebastian A, Talag J, Yu Y, Kim H, Collura K, Kudrna D, Wing RA (2007) Transposable element distribution, abundance and role in genome size variation in the genus Oryza. BMC Evol Biol 7: 152   86       CHAPTER 3: Mutator-like Elements with Multiple Long Terminal Inverted Repeats in Plants Ferguson AA, Jiang N (2012) Mutator-like elements with multiple long terminal inverted repeats in plants. Comp Funct Genomics 2012: 695827   87       Abstract Mutator-like transposable elements (MULEs) are widespread in plants and the majority have long terminal inverted repeats (TIRs), which distinguish them from other DNA transposons. It is known that the long TIRs of Mutator elements harbor transposase binding sites and promoters for transcription, indicating that the TIR sequence is critical for transposition and for expression of sequences between the TIRs. Here, we report the presence of MULEs with multiple TIRs mostly located in tandem. These elements are detected in the genomes of maize, tomato, rice, and Arabidopsis. Some of these elements are present in multiple copies, suggesting their mobility. For those elements that have amplified, sequence conservation was observed for both of the tandem TIRs. For one MULE family carrying a gene fragment, the elements with tandem TIRs are more prevalent than their counterparts with a single TIR. The successful amplification of this particular MULE demonstrates that MULEs with tandem TIRs are functional in both transposition and duplication of gene sequences. Introduction Transposable elements (TEs) are DNA fragments that are capable of moving from one genomic location to another and increasing their copy numbers. Based on their transposition mechanisms, TEs fall into two classes: (1) Class I elements, or retrotransposons, that use the element-encoded mRNA as the transposition intermediate and (2) Class II elements, or DNA transposons, that transpose through a DNA intermediate. Autonomous transposons encode transposases that are responsible for the transposition of themselves and their cognate nonautonomous elements that do not encode transposases. A common feature for DNA transposons, with a few exceptions, is the presence of a terminal inverted repeat (TIR) at each terminus of the element. As an essential structural   88       component of the element, TIR plays important roles in transposition. For example, the transposase encoded by the bacterial Tn3 element specifically binds to its TIR (38 bp in length), and this then facilitates the nicking at the end of Tn3 by DNase I and initializes the transposition process [1]. In eukaryotes, it was shown that the transposase of the Hermes element binds to its imperfect TIRs and excises the element. This process is accompanied by the formation of a hairpin structure in the flanking donor sequence, resembling the V(D)J recombination process [2]. Binding of transposase to the TIR and to the target DNA mediates the synapsis of the transposon ends and the target DNA, allowing the insertion of the element into the target sequence [3]. The presence of the TIR sequence also influences the target specificity of the element. For instance, the deletion of a 4 bp sequence within the binding region of the TIR in Tn3 abolishes its transposition immunity, that is, the phenomenon whereby Tn3 avoids insertion into another Tn3 element [4]. The TIRs of DNA transposons are usually less than 50 bp in length and can be as short as 8 bp [5]. Nevertheless, there are a few transposon families with exceptionally long TIRs and among these is the Mutator superfamily of transposable elements. First discovered in maize in 1978 [6], Mutator and Mutator-like elements (MULEs) appear to be prevalent among eukaryotes. Subsequent to the initial discovery in maize, MULEs have been found in other plant genomes such as Arabidopsis, rice, and Lotus japonicus [7–9], as well as in fungal and animal genomes [10, 11]. In addition to their unparalleled activity in maize [12], MULEs include a special subgroup of elements referred to as Pack-MULEs which are nonautonomous MULEs carrying genes or gene fragments. The sequence acquisition by Pack-MULEs may result in the formation of new open reading frames and the potential to regulate the expression of the parental genes from which the fragments are derived [8, 13, 14].   89       Mutator and MULEs are distinguished from other DNA transposable elements by having a 9–11 bp target site duplication (TSD) which flanks the element and are formed during transposition into a new genomic location. The TIRs of MULEs, typically ranging from 100 to 500 bp, appear to be critical for element transposition and expression. Mutator TIRs contain binding sites for the transposase where MURA protein was shown to bind a conserved ∼32 bp sequence motif in active Mutator elements in maize [15]. In addition, two convergent genes contained within the maize autonomous MuDR element, including the transposase MURA, are transcribed from promoters located within the TIRs [16]. Furthermore, the MuDR TIR contains plant cell cycle enhancer motifs which program a 20-fold upregulated expression in reproductive organs as compared with leaves [16]. The promoters in the TIRs are also responsible for the expression of the internal regions of Pack-MULEs [8], suggesting the importance of the TIR sequence for transposition and retention of MULEs in the genome. In this study, we report on the identification and characterization of MULEs and Pack-MULEs with multiple TIRs in plants and the possible role of tandem TIRs in element amplification. Methods Plant Genomic Sequences and Construction of the Tomato MULE TIR Library The sequence for the tomato (Solanum lycopersicum) genome was downloaded from the International Tomato Genome Sequencing Consortium Library. The sequence (http://www.solgenomics.net/organism/Solanumlycopersicum/genome/; release 2.40). The sequence for rice (Oryza sativa ssp. japonica cv. Nipponbare) pseudomolecules was downloaded from the rice annotation group at Michigan State University (http://rice.plantbiology.msu.edu/, release 6.0). Maize (Zea mays cv. B73) chromosome sequences (4a.53) were downloaded from the maize sequencing project (http://www.maizesequence.org/, B73 RefGen v1 [17]). The   90       sequence for potato (Solanum tuberosum cv. DM) was downloaded from the Potato Genome Sequencing Consortium (http://potatogenomics.plantbiology.msu.edu, release 3.0 [18]). The Arabidopsis genome sequence (TAIR10) was downloaded from the Arabidopsis Information Resource (http://www.arabidopsis.org/). For identifying elements with additional TIRs, MULE TIR libraries that were generated previously were used for rice, maize, and Arabidopsis genomes [19]. The tomato MULE TIR library was built with an iterative process that uses Pairwise Alignment of Long Sequences (PALS 1.0) [20] to identify long inverted repeats (minimum length = 100 bp; minimum similarity = 80%). A custom python script was used to identify pairs of inverted repeats from the output of PALS, extract flanking sequences, and identify a 9–11 bp TSD. Manual curation was done to verify terminal inverted pairs with overall high sequence similarity in at least 100 bp sequence and having intact TIR ends and presence of a 9–11 bp TSD immediately flanking the ends of the TIR. Pairs that passed the above criteria were added to a TIR sequence library of tomato which was later filtered for redundancy. If two TIR sequences share 80% or higher similarity in at least 80% of their length, the two sequences are considered redundant, and one of the sequences is excluded. Among a redundant group of TIR sequences, the TIR sequence from the element with the highest TIR identity is retained in the nonredundant library. Estimation of MULE Copy Number and Identification of Elements with Multiple TIRs from Plants To determine the abundance of MULEs related to different TIR families (Table 3.1), copy number was estimated by considering one pair of TIRs as one element. To estimate how many elements are associated with a TSD, the presence of a TSD was verified using a pipeline   91       consisting of perl scripts that search for 9-10 bp direct repeat with no more than 2 mismatches flanking the ends of the TIR sequences. The copy number of autonomous MULE elements was estimated from all elements retrieved from the previous step having significant match to known MULE transposases (E=10−5, BLASTX) after filtering for low complexity. To search for MULEs with multiple TIRs, elements from each genome sequence were identified using RepeatMasker (using default parameters; http://repeatmasker.genome.washington.edu/) with the rice, maize, tomato, and Arabidopsis MULE TIR libraries. A custom python script was used to identify elements flanked by 2 similar TIRs on one or both sides of the element. The following criteria were used to filter the results: (1) distance between the external TIRs is not larger than 20 kb and there is no sequencing gap between the TIRs, (2) TIRs must be at least 50 bp long, (3) truncations at the external ends of TIRs must be no more than 15 bp, (4) the two TIRs on one or both ends are less than 600 bp apart, and (5) presence of a 9–11 bp TSD with no more than 2 mismatches. Custom perl and python scripts were used in combination to extract the sequences of putative multiple TIR elements and their flanking sequences, and all elements were manually verified for the presence of a TSD. To define whether an element has multiple copies in a genome, the following criteria were applied: for individual elements, if the TIRs of two elements (with different TSDs) can be aligned (BLASTN, E=10-10), and if > 70% of the sequence between the TIRs can be aligned (BLASTN, E=10−10), then the two elements are defined as copies. To obtain an approximate location of the elements in the tomato chromosomes, their coordinates in the pseudomolecules (release 2.4) were used to find nearby flanking SGN-markers as indicated in the Tomato Genome Browser (http://www.solgenomics.net/). The tomato FISH map was used as the basis of chromosome structure indicating centromere, euchromatin, and   92       heterochromatin regions of each chromosome [21, 22]. The relative position of the flanking SGN-markers were identified in EXPEN-2000 physical map [23, 24] which have been linked to the FISH map by sequenced BACs. These maps are available through the Sol Genomics Network. Phylogenetic Analysis of TIRs and Internal Sequences To generate multiple sequence alignments, 220 bp of the external and internal TIRs at both ends (Figure 3.2) of the PM-ZIBP elements (the element containing a gene fragment from a gene encoding a zinc-ion binding protein, see Results) were used and resolved into lineages by generating phylogenetic trees. Multiple sequence alignment was performed by CLUSTALW (http://www.ebi.ac.uk/clustalw/) with default parameters. Phylogenetic trees were generated on the basis of the maximum likelihood method [25] with Kimura2 parameter distances [26] using the MEGA 4 program (http://www.megasoftware.net). Support for the internal branches of the phylogeny was assessed using 1000 bootstrap replicates. To compare the TIR of type 2-46 elements, 130 bp of the external and internal TIRs of type 2-46 elements were used to generate sequence alignment and a phylogenetic tree employing the same parameters and methods as the TIR comparisons for PM-ZIBP. Similar methods and parameters were used to generate sequence alignment and a phylogenetic tree for the acquired fragments in PM-ZIBP, the parental gene (SGN-U574419) in tomato, and gene sequences from potato, tobacco, and pepper. Annotation of Pack-MULEs and Frequency of Element Sizes in Tomato The procedure for the annotation of Pack-MULEs in the tomato genome was similar to that described previously [13]. Candidate Pack-MULEs and MULEs with multiple TIRs identified in this process were masked with all available tomato repeat sequences using   93       RepeatMasker (http://repeatmasker.genome.washington.edu/ version open-3.0) with default parameters. The repeat library was built by combining repeats collected previously [27] with sequences matching transposase sequences in the RepeatMasker package (version open-3.0). The masked outputs were queried against the Solanaceae Unigene database (http://solgenomics.net/, version 5, BLASTN E= 10−10) and the nonredundant protein database in NCBI (http://www.ncbi.nlm.nih.gov/Blast.cgi E=10−05) to identify gene fragments inside elements. To estimate element size, the elements were masked with all available tomato repeat sequences excluding MULE-related sequences using RepeatMasker to identify nested insertion of other transposons inside Pack-MULEs. The element size was then calculated using a custom perl script which excluded the masked sequence inside Pack-MULEs. Elements larger than 2 kb were not plotted due to their low abundance. TIR Sequence Analysis and Conservation Test The external and internal TIR sequences were compared using the “gap” program available from the GCG package (version 11.0, Accelrys Inc., San Diego, CA) to identify repeats found in the sequences. To determine the conservation of the TIR at nucleotide level, the first 700 bp sequence of each element was extracted and compared using DIALIGN2-2 [28] with default parameters. The normalized local sequence similarity scores as determined from DIALIGN2-2 were then used to determine an average similarity score for every 5 nucleotides and then plotted. Results Types of MULEs with Multiple TIRs A typical MULE contains one pair of TIRs, which refers to similar or identical terminal inverted sequences found on the opposite ends of the element (Figure 3.1, see above). In this   94       study, we detected some atypical MULEs with two pairs of TIRs (type 1 and type 2, Figure 3.1), which will be referred to as external TIR and internal TIR, respectively. The external and internal TIRs have extended sequence similarity in at least 100 bp of the TIR region. In some cases, there is only one additional terminal sequence instead of one pair of terminal sequences (type 3, 4, 5, and 6, Figure 3.1), and this additional terminal sequence will be called a solo TIR to distinguish it from paired TIRs. Analysis of the tomato genome sequence revealed the presence of 61 MULEs with at least one additional TIR. All these elements are associated with a distinguishable TSD, a hallmark of transposition, suggesting that these elements are derived from transposition and not recombination. Furthermore, there is no recognizable TSD flanking the internal TIR, indicating that these elements are not formed through nested insertion of the same type of elements. As shown in Figure 3.1, the majority of these elements (48 out of 61 elements, 79%) are type 1 and type 2 which contain external and internal TIRs located in tandem on both ends of the element (Figure 3.1). Some elements carry recognizable gene fragments (type 1 and type 5) and are therefore classified as Pack-MULEs. Thirteen elements (out of 61, 21%) are associated with one additional solo TIR only on one end of the element (type 3) or in the internal region (type 4, 5, and 6). Elements with tandem TIRs are more abundant than elements containing a solo TIR (48 versus 13). Three elements (out of 6) with tandem TIRs have multiple copies (elements are considered to be copies if they have similar TIR and similar internal region). The element with   95       Figure 3.1. The structure of distinct types of MULEs with additional TIRs in tomato including their copy numbers and the number of TIR families involved in formation. Black horizontal arrows indicate target site duplication (TSD); solid colored triangles indicate Terminal Inverted Repeat (TIR); colored boxes indicate internal sequence and are labeled accordingly if sequences are annotated as genes or have similarity to MULE transposases. Objects with different colors indicate unrelated sequences. ZIBP – zinc ion binding protein   96       the highest copy number is flanked by two SLMULE46 TIRs (29 copies, type 2, and will be referred to as type 2-46 element), followed by a second group, an element with 13 copies that makes up the type 1 class. In contrast, among the 12 distinct elements containing one additional solo TIR, only one has another copy in the genome, and this element belongs to type 3 with SLMULE46 TIR. In fact, this element is the 3-TIR version of type 2-46 because it has a similar internal region (see below). This suggests that elements with solo TIRs may be dramatically less competent in transposition than elements with tandem TIRs on both ends. In tomato, a total of 59 MULE TIR families have been identified with approximately 28,000 total copies of MULEs (see Table 3.1 for distribution of copies in each family). Among them, 10 TIR families (17%) are involved in the formation of elements with additional TIR sequences (Figure 3.1, Table 3.2). These 10 families have moderate copy numbers and a fraction of putative autonomous elements comparable to many other TIR families that are not associated with the formation of elements with multiple TIRs (Table 3.1). Eight families are represented in the formation of elements with solo TIRs, yet only three TIR families are involved in the formation of elements with tandem TIRs (type 1 and type 2, Table 3.2). In addition, most of the elements with this atypical TIR feature are non-autonomous elements in that they do not encode the proteins essential for transposition. The only exception is an element with an additional solo TIR, whose internal region was found to have sequence similarity to MULE transposases (type 6 in Figure 3.1). However, a close examination indicates the presence of numerous premature stop codons and frameshifts in the coding region, suggesting the element is unlikely a currently functional autonomous element.   97       Table 3.1. Distribution of elements among tomato MULE TIR families. Copy No. with % with Autonomous TIR Family number TSD TSD Elements MULETIR056 8 1 12.5 0 MULETIR022 11 3 27.3 0 MULETIR028 13 4 30.8 0 MULETIR007 25 5 20.0 3 MULETIR048 26 7 26.9 0 MULETIR050 34 2 5.9 0 MULETIR049 40 1 2.5 2 MULETIR034 42 1 2.4 0 MULETIR025 49 13 26.5 9 MULETIR052 56 11 19.6 0 MULETIR044 62 2 3.2 1 MULETIR038 68 22 32.4 10 MULETIR031 71 5 7.0 3 MULETIR039 88 26 29.5 3 MULETIR054 100 28 28.0 12 MULETIR043 106 63 59.4 35 MULETIR047 109 25 22.9 0 MULETIR017 112 4 3.6 5 MULETIR045 112 12 10.7 2 MULETIR020 114 36 31.6 0 MULETIR053 116 8 6.9 3 MULETIR010 118 3 2.5 35 MULETIR011 130 43 33.1 7 MULETIR059 144 49 34.0 2 MULETIR042 156 6 3.8 1 MULETIR027 164 72 43.9 19 MULETIR029 164 64 39.0 1 MULETIR016 167 40 24.0 4 MULETIR030 176 31 17.6 1 MULETIR051 210 64 30.5 1 MULETIR023 233 84 36.1 1 MULETIR004 250 4 1.6 2 MULETIR058 262 66 25.2 2 MULETIR026* 266 56 21.1 0 MULETIR013 273 73 26.7 94 MULETIR055 275 87 31.6 2 MULETIR057* 337 178 52.8 10 MULETIR040 361 173 47.9 20 MULETIR036 373 100 26.8 17 MULETIR024* 381 96 25.2 4 MULETIR041 413 126 30.5 3   98   % of Autonomous 0.0 0.0 0.0 12.0 0.0 0.0 5.0 0.0 18.4 0.0 1.6 14.7 4.2 3.4 12.0 33.0 0.0 4.5 1.8 0.0 2.6 29.7 5.4 1.4 0.6 11.6 0.6 2.4 0.6 0.5 0.4 0.8 0.8 0.0 34.4 0.7 3.0 5.5 4.6 1.0 0.7     Table 3.1 (cont’d) MULETIR046* 418 62 14.8 MULETIR032 444 199 44.8 MULETIR015 455 130 28.6 MULETIR018* 498 198 39.8 MULETIR003 546 193 35.3 MULETIR037* 635 361 56.9 MULETIR008* 644 133 20.7 MULETIR009* 760 425 55.9 MULETIR035 823 407 49.5 MULETIR014 1144 236 20.6 MULETIR021 1301 73 5.6 MULETIR033* 1310 548 41.8 MULETIR005* 1341 110 8.2 MULETIR006 1621 816 50.3 MULETIR002 1685 748 44.4 MULETIR012 1798 748 41.6 MULETIR019 2386 914 38.3 MULETIR001 4017 2609 64.9 total 28041 10604 38% * MULE TIR families involved in TIR duplication.   99   2 62 9 11 10 40 4 26 30 1 7 16 7 18 1 5 3 3 569 0.5 14.0 2.0 2.2 1.8 6.3 0.6 3.4 3.6 0.1 0.5 1.2 0.5 1.1 0.1 0.3 0.1 0.1 2%     Table 3.2. MULE TIR families involved in TIR duplication in tomato. Type MULE TIR Number of copies 1 SLMULE18 13 2 3 4 5 6   SLMULE18 1 SLMULE33 1 SLMULE46 33 SLMULE26 1 SLMULE46 4 SLMULE57 1 SLMULE05 1 SLMULE08 1 SLMULE26 1 SLMULE46 1 SLMULE09 1 SLMULE24 1 SLMULE37 1 100       A similar analysis of rice and maize genomes to detect elements with multiple TIRs identified fewer elements compared to the tomato genome (Tables 3.3 and 3.4). Two elements belonging to the type 3 and four elements (type 3 and 4) were uncovered in rice and maize genomes, respectively. In these two genomes, the presence of tandem TIR on both ends (type 1 or type 2) was not detected. In contrast, in Arabidopsis, a species that has previously been reported as having few MULEs and Pack-MULEs compared to rice and maize [17, 19, 29], eleven elements with tandem TIRs on both ends were found, including one Pack-MULE. This suggests that the formation of tandem TIR elements is not related to the abundance of MULEs and Pack-MULEs in the genome, and dicot genomes harbor more tandem TIR elements than genomes of monocots. A Pack-MULE Family with Tandem TIRs The elements comprising type 1 elements in tomato are copies of a Pack-MULE that harbors a fragment from a gene encoding a zinc-ion binding protein (Figure 3.2). This PackMULE family has 13 copies with tandem SLMULE18 TIRs located on both ends of the element and a single copy with one pair of TIRs (Figure 3.2), resembling that of a typical Pack-MULE. This family will be referred to as PM-ZIBP hereafter (Table 3.5). To dissect the relationship between the single TIR element and elements with tandem TIRs, a phylogenic tree was built using the acquired region, the parental gene in tomato, and the corresponding regions from other related plants including potato, pepper, and tobacco. The PM-ZIBP elements are grouped with the putative parental gene from tomato. Moreover, related elements are not present in the genome of potato suggesting that the acquisition of the gene fragment and the formation of the Pack-MULE may have occurred after the divergence of tomato and potato. Among the PM-ZIBP elements,   101       Figure 3.2. PM-ZIBP elements with single and tandem TIRs that contain a gene fragment. Solid triangles indicate TIRs and blue boxes indicate exons of gene SGN-U574419 and fragment acquired by the Pack-MULE. Introns are depicted as lines connecting exons.   102       Table 3.3. List of the multiple TIR elements in Arabidopsis, maize and rice. Element ID Genome Chromosome Position Start AtMULEdt-1 Arabidopsis 1 15918026 AtMULEdt-2 Arabidopsis 1 16260326 AtMULEdt-3 Arabidopsis 2 5670143 AtMULEdt-4 Arabidopsis 2 6446333 AtMULEdt-5 Arabidopsis 3 16141980 AtMULEdt-6 Arabidopsis 4 1230491 AtMULEdt-7 Arabidopsis 4 6700242 AtMULEdt-8 Arabidopsis 4 2341448 AtMULEdt-9 Arabidopsis 5 9709514 AtMULEdt-10 Arabidopsis 5 17479616 AtMULEdt-11 Arabidopsis 5 18980159 OsMULEdt-1 Rice 1 20126005 OsMULEdt-2 Rice 9 8210481 ZmMULEdt-1 Maize 1 239115417 ZmMULEdt-2 Maize 5 17121166 ZmMULEdt-3 Maize 8 94462286 ZmMULEdt-4 Maize 8 121663900 *Based on the same classification as Figure 3.1.   103   Position End TIR family Type* 15919043 16261097 5671281 6451005 16143074 1231738 6701357 2342412 9710845 17480806 18981488 20126814 8216333 239115828 17121599 94472999 121667746 At000317 At000824 At000317 At000317 At000317 At000800 At000800 At000317 At000800 At000317 At000800 Os0182 Os1455 Zm15155 Zm00411 Zm28610 Zm00411 2 2 2 1 2 2 2 2 2 2 2 3 3 3 4 4 4     Table 3.4. Frequency of MULEs, Pack-MULEs and MULEs with multiple TIRs in four different plant genomes. Element type Arabidopsis Tomato Rice Maize MULEs 1576 28041 30475 12900 a Pack-MULEs 46b (2.92) 220 (1.6) 2853 b (9.4) MULEs with multiple 11 (0.70) 61 (0.22) 4 (0.01) TIRS Number in parenthesis indicate percentage from total number of MULEs. a Schnable et al., 2009 [16] b Jiang et al., 2011 [18]   104   276 b (2.1) 2 (0.02)     the element with a single TIR (PMZIBP-1) forms a branch with the longest length (Figure 3.3). If the mutation rate is comparable among this group of elements, this implies that PM-ZIBP-1 is the most ancient element that acquired this gene fragment (or the ancestor of this element acquired the gene fragment), and elements with tandem TIRs are putative derivatives of PMZIBP-1. This is consistent with the fact that PM-ZIBP-1 is associated with the lowest TIR identity (Table 3.5), since young elements often have identical or highly similar TIRs. Our results show no evidence that elements with tandem TIRs are capable of acquiring gene sequence. The 14 copies of PM-ZIBP were mapped onto the tomato chromosomes (Table 3.5). All the elements with tandem TIRs are mapped to distinct chromosomal loci with most of them in euchromatic regions. However, the single-TIR copy mapped to Chromosome 0 which consists of sequenced fragments that cannot be physically mapped to any of the chromosomes. Sequence analysis of the contig containing this element showed that it was highly repetitive suggesting its location in heterochromatic regions of the genome. This raised the question as to whether the long branch length associated with PM-ZIBP-1 element is an artifact of accelerated mutation in heterochromatic regions. To test this notion, the five elements with tandem TIRs that are located in heterochromatic regions (Figure 3.4) were examined, and these elements were found to be associated with both long and short branches (Figure 3.3). Moreover, none of them has a branch that is longer than that of PM-ZIBP-1. As a result, the location of PM-ZIBP-1 does not fully explain its branch length, and it is likely the oldest PM-ZIBP element. However, we cannot rule out the possibility that there are other unknown factors responsible for the unusually high mutation rate (in both TIR and internal regions) in PM-ZIBP-1 that are not correlated with its   105       Table 3.5. List of the Pack-MULEs that captured a fragment of a putative zinc-ion binding protein. Element ID PM-ZIBP-1 PM-ZIBP -2 PM-ZIBP -3 PM-ZIBP -4 PM-ZIBP -5 PM-ZIBP -6 PM-ZIBP -7 PM-ZIBP -8 PM-ZIBP -9 PM-ZIBP -10 PM-ZIBP -11 PM-ZIBP -12 PM-ZIBP -13 PM-ZIBP -14   Chromosome Position Start Position End Size TSD sequence % outer TIR identity % inner TIR identity 0 2 2 3 4 4 5 6 6 6 6 8 8 12 12038207 29724087 33667391 50199015 4315606 58126752 10380074 29844899 41871484 42030690 7121877 14577081 4530978 9912832 12039150 29725458 33668737 50200383 4316970 58128120 10381444 29846272 41872853 42032062 7123250 14578450 4532348 9914208 944 1372 1347 1369 1365 1369 1371 1374 1370 1373 1374 1370 1371 1377 TTTTAAATT TAAATTATA TACATTTTAA TTAAAATTA TATTATAAA GTCAGGTTAA ATAAAAGAT CTTCGAGAC TTTATTTAC TTAAAAAAA TTAAAAGAA GAATAATAA TTTTGGGAA TATTTTTAT 87 94 92 91 95 93 93 91 90 92 90 93 93 92 N/A 92 90 93 90 91 92 92 89 92 90 91 89 90 106       Figure 3.3. Phylogenetic analysis based on the acquired fragment in PM-ZIBP from SGN-U574419 and related sequences (SGNU273862, SGN-U20267, and SGN-U506815). Sequences were aligned using ClustalW and phylogenetic reconstruction used the maximum likelihood method with Kimura-2 parameter distances implemented in the MEGA program. Bootstrap values are indicated as a percentage of 1000 replicates (40% majority rule consensus). Elements mapping to heterochromatic regions are indicated by a star symbol.   107       age. Despite this limitation, it is obvious that PM-ZIBP-1 has been present in the genome for a substantial amount of time, without being amplified. Sequence Features with Elements Carrying Multiple TIRs To understand the mechanism involved in the formation of PM-ZIBP elements in tomato, careful analysis, and comparison between the external and internal TIR sequences were performed for the single and tandem TIR copies. Three motifs with high sequence similarity were found in PMZIBP with tandem TIRs and two of these, motif-I and motif-II, were also found in the single TIR of PM-ZIBP-1 (Figure 3.5). The presence of the repetitive motifs in the TIR suggests that the additional TIR could be formed via a DNA replication slippage process involving a single TIR element. According to this model, when the replication proceeds to motifII, the DNA polymerase slips from the DNA template and subsequently reattaches at motif-I so that the sequence between motif-I and motif-II is duplicated. If this is the case, the external and internal TIR should originate from the same template. To test this notion, phylogenetic analysis of the internal and external TIRs from the tandem TIR PM-ZIBP and the TIR of the single TIR element from both 5’ end and 3’ end was performed using ClustalW and MEGA (Figure 3.6). This analysis demonstrates a separate grouping of the internal, external and single TIRs, which seems to contradict with the slippage hypothesis. However, this is not definitive evidence against the slippage hypothesis since the bootstrap values are relatively low and separation of the external and internal TIR could be due to their distinct role in transposition (also see below). The examination of another element with tandem TIRs (type 2-46) failed to identify the presence of similar motifs, suggesting that the presence of recognizable repetitive motif is not essential for the formation of tandem TIRs. Phylogenetic analysis between the external and internal TIRs of this family showed four fundamental groups (5’ internal, 5’ external, 3’ internal,   108       Figure 3.4. Chromosomal distribution of tandem TIR PM-ZIBP in tomato. Dark blue blocks represent heterochromatin and light blue regions represent euchromatin. Individual elements are represented by dark red vertical bars, and the purple ovals indicate the location of the centromere.   109       Figure 3.5. Alignment of TIR sequences from a PM-ZIBP with tandem TIRs and that from PMZIBP-1 illustrating the location of the 3 repetitive motifs found in the TIRs.   110       3’ external) (Figure 3.7) with distinct branch lengths, suggesting they have been amplifying for an extended period. However, there are several TIRs intermingled with other groups and the bootstrap values for the major branches are rather low. As a result, it is difficult to make a clear-cut interpretation about the origin of the TIR duplication in this family. Interestingly, the two 3-TIR elements do not form an independent group. Instead, their TIRs intermingled with different elements with tandem TIRs. The branch length of the 3-TIR elements is comparable to other type 2-46 elements so they might have formed within a similar time frame. An alternative mechanism for the formation of elements with tandem TIRs is through recombination between elements with related TIRs. If this is the case, the initial parental elements are not expected to harbor a TSD. To test this hypothesis, we screened all 3-TIR elements and tandem TIR elements that are not associated with a TSD. A closer examination of these candidates indicates all of them have a certain level of truncation at the very termini of the TIR. As a result, it is not clear whether the lack of TSD is due to recombination or due to truncation. Thus, it remains an open question whether recombination played a role in the formation of these elements. The Putative Role of the Tandem TIRs in Amplification of the Elements As mentioned above, the Pack-MULE element with single and tandem TIR PM-ZIBP copies share terminal and internal sequences, yet the elements with tandem TIRs have many more copies than the single TIR PM-ZIBP-1 (13 versus 1). For the type 2-46 element with SLMULE46 TIRs, we failed to identify a corresponding element with single TIR and exactly the same internal sequence. However, other non-autonomous MULEs with single SLMULE46 TIR and associated with a TSD were identified and the copy number of none of them is as high as that of type 2-46 (29 copies). The copy numbers of these elements range from 1 to 14, with an   111       Figure 3.6. Phylogenetic analysis of sequences of the external, and internal TIRs of PM-ZIBP elements. Sequences were aligned and phylogeny was reconstructed as described for Figure 3.3   112       Figure 3.7. Phylogenetic analysis of external and internal TIRs from type 2-46. TIRs from the 3TIR elements are indicated by a star symbol. Sequences were aligned and phylogeny was reconstructed as described for Figure 3.3   113       average of 3 copies. A parsimonious explanation for such phenomenon is that the presence of the second TIR confers some advantage for transposition. The internal TIR may simply function as a filler DNA that allows the element to achieve an optimum size. To evaluate the role of size in transposition efficiency of PM-ZIBP, the distribution of Pack-MULE sizes in tomato was examined to identify a range that would correspond to optimal sizes for successful movement and amplification in the genome. The Pack-MULEs in tomato were grouped according to size at 100 bp increments. As shown in Figure 3.8, Pack-MULEs are most abundant with size ranging from 1000–1200 bp which is very close to that of the single TIR PM-ZIBP-1 (944 bp). There are elements from 9 TIRs with 24 different types of internal regions so the presence of this maximum is not due to the amplification of one or two element families. Meanwhile, a minor maximum was observed at 1300–1400 bp (composed of seven TIR families with 10 different internal sequences), which coincides with the size of PM-ZIBP associated with tandem TIRs. Interestingly, the sizes of the elements with tandem SLMULE46 TIRs that predominantly compose the type 2 elements (type 2-46) fall into the same peak as the PM-ZIBP with tandem TIRs (Figure 3.8). The presence of tandem TIRs (over 1 kb in total) and internal sequence makes it highly unlikely for a single element to be within 1000–1200 bp in size. As a result, 1300–1400 bp could be the optimal size for elements with tandem TIRs, regardless of the presence or absence of gene fragments in the internal region. An additional explanation for the abundance of tandem TIR elements over their single TIR counterpart is an advantage conferred by the tandem TIR resulting in increased frequency of recognition by the transposase or enhanced interaction between the element and the transposition machinery. If this is the case, one would expect significant sequence conservation in both TIRs. Comparison of sequence identity between the external TIRs and the internal TIRs of PM-ZIBP   114       shows that the initial part of the internal TIR is slightly less conserved compared to that of the external TIR. However, the majority of the TIR sequence has a similar level of conservation, indicating that both TIRs may play functional roles (Figure 3.9A). The most conserved region is motif II and its adjacent region (orange and grey region, Figure 3.9A), which is not present in the external TIR, suggesting the importance of this region. This is in contrast to the low conservation level in regions between the TIR and the acquired gene fragment. Interestingly, the conservation level of the acquired gene fragment is comparable or slightly higher than that of TIRs, suggesting that the gene fragments might be functional. The divergence of the internal TIR around motif-M may be a result of selection to ensure that precise cleavage occurs in the external TIR instead of the internal TIR upon excision of the element. This is in concordance with the fact that no element with single TIR appears to be derived from elements containing tandem TIRs among PM-ZIBP elements. Compared to the internal region, the TIRs of the type 2-46 elements have considerable level of conservation, which is similar to that of PM-ZIBP (Figure 3.9B). However, unlike the PM-ZIBP elements, the most internal region of the internal TIR does not demonstrate an elevated level of conservation, suggesting the variation in location of important cis elements among different families of TIR sequence. Furthermore, a low level of conservation was observed in the internal region of this group of elements, which is consistent with the lack of gene fragments in its internal region. Discussion The Formation and Amplification of MULEs with Additional TIR Sequence The TIR sequences of DNA elements contain cis elements that are responsible for interaction with and recognition by the relevant transposases. It also contributes to the selection   115       Pack-MULEs PM-ZIBP only PM-ZIBP tandem TIR 50 all MULE with SL-MULE46 TIR 45 40 Frequency 35 30 25 20 15 10 5 0 Size (bp) Figure 3.8. The frequency of different element sizes. Elements that are less than 2 kb are plotted.   116       A                               B   Figure 3.9. Nucleotide conservation across the two tandem TIRs. (A) Tandem TIRs from PMZIBP. (B) Tandem TIRs from type 2-46. The nucleotide conservation scores are calculated as an average of 5 nucleotide position scores from the copies of the element. Colored or black triangles represent the TIR. In Figure 3.9A, the orange regions indicate the 3 repetitive motifs (see text). Colored box indicates part of the acquired gene fragment.   117       of insertion site as well as serving as the target for epigenetic regulation [4, 30]. As a result, the TIR sequences play a critical role in the successful amplification of relevant TEs. For many DNA transposons, short repetitive motifs are present in the TIR or subterminal regions, either in direct or inverted orientations. Nevertheless, the duplication of an entire TIR (or almost an entire TIR) is unusual and has not been studied previously. In this study, 59 MULE TIR families in tomato were examined and 10 of them are associated with TIR duplication. This indicates that certain MULE TIR families have a propensity to form duplicate TIRs over others, and the frequency is not correlated with the total copy number of the particular TIR family. Among the elements with multiple TIRs, some are only associated with one solo TIR (type 3, 4, 5, and 6, with 8 TIR families) and others are associated with duplicated TIRs on both ends (type 1 and 2, with 3 TIR families). Obviously, few TIR families are associated with the formation of duplicated TIRs on both ends, suggesting this is a less frequent event. However, only one element containing a solo TIR has an additional copy, and it is uncertain whether the two copies are derived from each other. This suggests the destiny of “death on birth” for elements with a solo TIR. It is possible that the presence of one additional TIR resulted in a lack of structural symmetry which interferes with transposition. In other words, the presence of one additional TIR sequence could have negative impact on transposition competency. In contrast, the elements with tandem TIRs on both ends are more successfully amplified, despite their low frequency of initial formation. The Mechanism Involved in the Formation of Duplicated TIRs At present it is not clear how the TIR sequence was duplicated in these atypical MULEs. DNA replication slippage is considered a common mechanism to cause deletion or duplication of sequences when repetitive motifs are present in adjacent regions. This seems to apply to the PM-   118       ZIBP elements due to the presence of repetitive motifs inside the TIR sequence. Nevertheless, this hypothesis is not unambiguously supported by phylogenetic analyses of internal and external TIR sequences. Furthermore, not all elements with additional TIR have significant repetitive motifs inside the TIR. As a result, there may be other mechanisms involved in the duplication of TIRs. This may include duplication by recombination or through nested insertion followed by loss of TSD for the internal element. If recombination is the main factor that drives the formation of elements with multiple TIRs, one would expect those elements to be overrepresented among the TIR families with highest copy numbers in the genome. Nevertheless, it does not seem to be the case (Table 3.1). Another question about the formation of elements with tandem TIR is whether the duplication of TIR at both ends is a single event or a step-wise process. Based on the fact that there are elements with 3 TIRs, it is possible that the duplication is a step-wise process. The coexistence of type 2-46 with two corresponding 3-TIR elements seems to suggest this is the case. However, the phylogenic analysis does not support the notion that the 3-TIR elements are older than all type 2-46 elements (Figure 3.7). Thus, it is unlikely that the 3-TIR elements are the direct progenitor of elements with tandem TIRs, and the true ancestor may have been lost from the genome. On the other hand, since the two 3-TIR elements are more closely related to type 246 elements than to each other, it seems to imply they are not derivatives of each other. In this case, an alternative scenario is that the two 3-TIR elements are derivatives of distinct type 2-46 elements through aberrant transposition. This may occur, for example, when one external TIR and one internal TIR in type 2-46 are recognized for transposition. This is consistent with the fact that none of the other 3-TIR elements has a duplicated copy, so the duplication of this particular 3-TIR element may have not arisen through the transposition of itself.   119       It is known that non-autonomous MULEs are capable of acquiring genomic sequences including genes. The frequency of acquisition of genes by MULEs seems to be higher than that of other DNA elements with shorter TIRs [17]. Moreover, the acquired sequences can be integrated into extended TIR sequences [13, 31]. Given this fact, it is conceivable that the additional TIR could also be introduced through acquisition. Unfortunately, the mechanism of sequence acquisition is yet to be understood. The comparison of copy number of elements with tandem TIRs in different genomes may provide additional insights into this question. Considering the abundance of MULEs and PackMULEs in the genomes of maize and rice, it is striking that only a few MULEs with additional TIRs are found in these genomes. However, if we assume that tandem TIRs are formed through sequence acquisition, the phenomenon can be readily explained. The genomes of maize and rice contain substantially more GC-rich sequences (or a more significant GC gradient) than that of Arabidopsis [32, 33]. Pack-MULEs in rice and maize demonstrate a strong preference for acquiring GC-rich sequences [19, 34]. Since the GC content of Pack-MULE TIR is similar or lower than the genomic average level [19], the acquisition of additional TIR sequence would be discriminated against in the genomes of rice and maize due to their relatively low GC content compared to gene sequences. In contrast, the GC content of sequences in dicot genomes is less variable [19, 35], such that TIR sequences are more likely to be acquired than their counterparts in the genomes of monocots. This might explain why there are more elements with tandem TIRs in tomato and Arabidopsis than in maize and rice. Possible Competency Conferred by Tandemly Duplicated TIR Among the elements with duplicated TIRs, two tomato elements have amplified to a certain degree. The PM-ZIBP elements have 13 copies with an additional copy that is associated   120       with only a single TIR. The type 2-46 elements have 29 copies without a corresponding copy with a single TIR, yet this particular TIR family is associated with single TIR elements harboring distinct internal regions with a lower copy number. Due to the coincidence of elements with tandem TIRs and single TIRs, it is clear that the presence of duplicated TIRs is not required for transposition, at least for these two TIR families. This raises the question whether the additional TIR has any role in transposition or successful amplification of these elements. There are several explanations for the overrepresentation of elements with tandemly duplicated TIRs among the PMZIBP elements. Our analysis excludes the possibility that the duplicated TIR is acting as a filler DNA to allow the element to achieve an optimal size for amplification. It is worth pointing out that the “optimum size” might be present due to reasons other than size. If that is case, it also implies that the failure of PM-ZIBP-1 to amplify is unlikely attributable to its size. An alternative possibility points to the role of the internal region of PMZIBP since the acquired region appears more conserved than other internal sequence. According to this model, the PM-ZIBP is amplified because of the functional role of the acquired fragment. The element with single TIR (PM-ZIBP-1) failed to amplify due to its genomic location that is likely in heterochromatic region and not accessible for the transposition machinery. Nevertheless, due to the excision activity of DNA transposons and insertion polymorphism in the population, only a small subset of the transposons formed by transposition will be retained in the genome. If this model is valid, this may imply that PM-ZIBP-1 is the sole copy with a single TIR that has ever been present in the genome and one of the elements with duplicated TIRs must have been directly derived from PM-ZIBP-1. If this was the case, this would require PM-ZIBP-1 to be accessible in a certain way, which contradicts the original assumption of this model. Alternatively, there were other copies of PM-ZIBP with single TIR in the genomic location with   121       more open chromatin that gave birth to the element with tandem TIRs. In either case, one or more of the elements with single TIR was in an accessible location but failed to amplify while their counterparts with tandem TIRs significantly increased their copy number. In addition, the type 2-46 element has amplified to 29 copies without an apparently functional internal region, suggesting that the presence of gene fragments is not required for the amplification of elements with duplicated TIRs. Taken together, the overrepresentation of PM-ZIBP and type 2-46 elements with tandem TIRs likely reflects an elevated competency for transposition for these two specific MULE families. This could be achieved by increased recognition of the element by the transposase and/or interaction with transposase. This is in accordance with the fact that sequence conservation was observed for both internal and external TIRs. Conclusion Transposable elements are the major components of plant genomes. MULEs play important roles in plant genome evolution due to their high activity and potential to acquire and amplify gene fragments. In this study, we uncovered that formation of duplicated TIRs might have contributed to the success of some specific MULE elements. The availability of genomic sequences from multiple plant genomes allows us to conduct a comprehensive analysis which led to the following conclusions: (1) the formation of elements with additional TIR is not a rare event but only elements with duplicated TIRs on both terminus have significant mobility; (2) the genome of dicots harbor more elements with duplicated TIRs than that of monocots, and such difference might be attributed to the presence of GC-rich sequences in the genomes of monocots; (3) distribution of size versus copy number of MULEs (or Pack-MULEs) is periodic, suggesting the distance between the TIRs or the relative spatial position of TIRs may have a role in transposition; (4) in the elements with tandem TIRs, both TIRs appear to be subject to certain   122       constraints, and the presence of duplicated TIRs may confer certain mechanistic advantages for transposition. Such features may be utilized to create elements with elevated transposition activity.   123       REFERENCES   124       REFERENCES [1] H. Ichikawa, K. Ikeda, W. L. Wishart, and E. Ohtsubo, “Specific binding of transposase to terminal inverted repeats of transposable element Tn3,” Proceedings of the National Academy of Sciences of the United States of America, vol. 84, no. 23, pp. 8220–8224, 1987. [2] L. Zhou, R. Mitra, P.W. Atkinson, A. B. Hickman, F. Dyda, and N. L. Craig, “Transposition of hAT elements links transposable elements and V(D)J recombination,” Nature, vol. 432, no. 7020, pp. 995–1001, 2004. [3] N. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, Mobile DNA II, ASM Press, Washington, DC, USA, 2nd edition, 2002. [4] C. J. Huang, F. Heffron, J. S. Twu, R. H. Schloemer, and C. H. Lee, “Analysis of Tn3 sequences required for transposition and immunity,” Gene, vol. 41, no. 1, pp. 23–31, 1986. [5] D. A. O’Brochta and P. W. Atkinson, “Transposable elements and gene transformation in non-drosophilid insects,” Insect Biochemistry and Molecular Biology, vol. 26, no. 8-9, pp. 739– 753, 1996. [6] D. S. Robertson, “Characterization of a Mutator system in maize,” Mutation Research, vol. 51, no. 1, pp. 21–28, 1978. [7] Z. Yu, S. I.Wright, and T. E. Bureau, “Mutator-like elements in Arabidopsis thaliana: structure, diversity and evolution,” Genetics, vol. 156, no. 4, pp. 2019–2031, 2000. [8] N. Jiang, Z. Bao, X. Zhang, S. R. Eddy, and S. R. Wessler, “Pack-MULE transposable elements mediate gene evolution in plants,” Nature, vol. 431, no. 7008, pp. 569–573, 2004. [9] D. Holligan, X. Zhang, N. Jiang, E. J. Pritham, and S. R. Wessler, “The transposable element landscape of the model legume Lotus japonicus,” Genetics, vol. 174, no. 4, pp. 2215–2228, 2006. [10] F. Chalvet, C. Grimaldi, F. Kaper, T. Langin, and M. J. Daboussi, “Hop, an active Mutatorlike element in the genome of the fungus Fusarium oxysporum,” Molecular Biology and Evolution, vol. 20, no. 8, pp. 1362–1375, 2003. [11] C. P. Marquez and E. J. Pritham, “Phantom, a new subclass of Mutator DNA transposons found in insect viruses and widely distributed in animals,” Genetics, vol. 185, no. 4, pp. 1507– 1517, 2010. [12] M. Alleman and M. Freeling, “The Mu transposable elements of maize: evidence for transposition and copy number regulation during development,” Genetics, vol. 112, no. 1, pp. 107–119, 1986.   125       [13] K. Hanada, V. Vallejo, K. Nobuta et al., “The functional role of Pack-MULEs in rice inferred from purifying selection and expression profile,” Plant Cell, vol. 21, no. 1, pp. 25–38, 2009. [14] N. Juretic, D. R. Hoen, M. L. Huynh, P. M. Harrison, and T. E. Bureau, “The evolutionary fate of MULE-mediated duplications of host gene fragments in rice,” Genome Research, vol. 15, no. 9, pp. 1292–1297, 2005. [15] M. I. Benito and V. Walbot, “Characterization of the maize Mutator transposable element MURA transposase as a DNA binding protein,” Molecular and Cellular Biology, vol. 17, no. 9, pp. 5165–5175, 1997. [16] M. N. Raizada, M. I. Benito, and V. Walbot, “The MuDR transposon terminal inverted repeat contains a complex plant promoter directing distinct somatic and germinal programs,” Plant Journal, vol. 25, no. 1, pp. 79–91, 2001. [17] P. S. Schnable, D. Ware, R. S. Fulton et al., “The B73 maize genome: complexity, diversity, and dynamics,” Science, vol. 326, no. 5956, pp. 1112–1115, 2009. [18] The Potato Genome Sequencing Consortium, “Genome sequence and analysis of the tuber crop potato,” Nature, vol. 475, no. 7355, pp. 189–195, 2011. [19] N. Jiang, A. A. Ferguson, R. K. Slotkin, and D. Lisch, “Pack-Mutator-like transposable elements (Pack-MULEs) induce directional modification of genes through biased insertion and DNA acquisition,” Proceedings of the National Academy of Sciences of the United States of America, vol. 108, no. 4, pp. 1537–1542, 2011. [20] R. C. Edgar and E. W. Myers, “PILER: identification and classification of genomic repeats,” Bioinformatics, vol. 21, no. 1, pp. i152–i158, 2005. [21] S. B. Chang, L. K. Anderson, J. D. Sherman, S. M. Royer, and S. M. Stack, “Predicting and testing physical locations of genetically mapped loci on tomato pachytene chromosome 1,” Genetics, vol. 176, no. 4, pp. 2131–2138, 2007. [22] D. Szinay, S. B. Chang, L. Khrustaleva et al., “High-resolution chromosome mapping of BACs using multi-colour FISH and pooled-BAC FISH as a backbone for sequencing tomato chromosome 6,” Plant Journal, vol. 56, no. 4, pp. 627–637, 2008. [23] S. D. Tanksley, M. W. Ganal, J. P. Prince et al., “High density molecular linkage maps of the tomato and potato genomes,” Genetics, vol. 132, no. 4, pp. 1141–1160, 1992. [24] T. M. Fulton, R. Van der Hoeven, N. T. Eannetta, and S. D. Tanksley, “Identification, analysis, and utilization of conserved ortholog set markers for comparative genomics in higher plants,” Plant Cell, vol. 14, no. 7, pp. 1457–1467, 2002.   126       [25] J. Felsenstein, “Evolutionary trees from DNA sequences: a maximum likelihood approach,” Journal of Molecular Evolution, vol. 17, no. 6, pp. 368–376, 1981. [26] M. Kimura, “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences,” Journal of Molecular Evolution, vol. 16, no. 2, pp. 111–120, 1980. [27] N. Jiang, D. Gao, H. Xiao, and E. Van Der Knaap, “Genome organization of the tomato sun locus and characterization of the unusual retrotransposon Rider,” Plant Journal, vol. 60, no. 1, pp. 181–193, 2009. [28] B. Morgenstern, “DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment,” Bioinformatics, vol. 15, no. 3, pp. 211–218, 1999. [29] T. Sasaki, “The map-based sequence of the rice genome,” Nature, vol. 436, no. 7052, pp. 793–800, 2005. [30] D. Lisch and N. Jiang, “Mutator and MULE transposons,” in Handbook of Maize: Genetics and Genomics, pp. 277–306, Springer, New York, NY, USA, 2009. [31] D. Lisch, “Mutator transposons,” Trends in Plant Science, vol. 7, no. 11, pp. 498–504, 2002. [32] N. Carels and G. Bernardi, “Two classes of genes in plants,” Genetics, vol. 154, no. 4, pp. 1819–1825, 2000. [33] G. K. S.Wong, J.Wang, L. Tao et al., “Compositional gradients in Gramineae genes,” Genome Research, vol. 12, no. 6, pp. 851–856, 2002. [34] A. A. Ferguson and N. Jiang, “Pack-MULEs: recycling and reshaping genes through GCbiased acquisition,” Mobile Genetic Elements, vol. 1, no. 2, pp. 1–4, 2011. [35] P. F. Cavagnaro, D. A. Senalik, L. Yang et al., “Genomewide characterization of simple sequence repeats in cucumber (Cucumis sativus L.),” BMC Genomics, vol. 11, no. 1, article 569, 2010.   127       CHAPTER 4: Repetitive Sequence Landscape of the Basal Eudicot Genome Sacred Lotus (Nelumbo nucifera)   128       Abstract Transposable elements (TEs) are pervasive among eukaryotes and are often the largest component in these genomes. These sequences can introduce genetic variation that possesses adaptive and evolutionary potential; and therefore, their identification remains an integral part of genome studies. The sequencing of the genome of sacred lotus, a basal eudicot, has allowed us to characterize its repetitive content. Here, we report that 59% of the genome is composed of repetitive sequences, and the majority of them (50% of the genome) are identifiable TEs. Analysis of the TE content and diversity in sacred lotus revealed unique composition of TEs compared to other plant genomes characterized so far. Among LTR elements, Copia-like and Gypsy-like elements demonstrate comparable coverage and copy number, a distribution not typical of numerous plant genomes where Gypsy-like elements are the dominant type of LTR elements. The relative abundance of Copia-like elements is in part due to the presence of families with non-canonical LTR ends (15.6% of total LTR content), many of which have not been described previously. Sacred lotus also contains the highest coverage and copy number of hAT elements among all genomes sequenced to date. In addition, the genome contains 1447 Pack-MULEs and provides the first evidence for the acquisition preference of GC-rich sequences by Pack-MULEs outside the grasses.   129       Introduction Sacred lotus (Nelumbo nucifera) is a perennial aquatic plant that belongs to the Nelumbonacee family and is found throughout Asia and northern Australia. It provides economic value as an ornamental and food crop in Asia. In addition, parts of lotus such as the flowers, roots and rhizomes have been used for medicinal purposes (Duke, 2002; Shen-Miller, 2002; Shen-Miller et al., 2013). The genome of sacred lotus variety “China Antique” was recently sequenced (Ming et al., 2013) using Illumina and 454 technologies. The final genome assembly (804 Mb) is 86.5% of the estimated 929 Mb lotus genome (Diao et al., 2006). This provides an excellent resource for the evolutionary analysis of eudicots and comparative studies between dicots and monocots since sacred lotus phylogenetically lies outside the core eudicots. Phylogenetic comparisons between grape and sacred lotus suggests that it is a better model than the grape genome for inferences about the common ancestors of eudicots (Ming et al., 2013). Genomic analysis reveals that the sacred lotus genome lacks the γ triplication event seen in all core eudicots and shows a remarkably low substitution rate and a higher retention of duplicated genes compared with most other angiosperm genomes (Ming et al., 2013). Transposable elements (TEs) are genetic sequences first discovered 50 years ago by Barbara McClintock. TEs move from one genomic location into another and in the process increase their copy numbers. According to the intermediate form of transposition used by the specific element, TEs are generally classified into two major groups: Class I or RNA elements which transpose via an RNA intermediate using a copy-and-paste mechanism, and Class II or DNA elements that transpose via a DNA intermediate using a cut-and-paste mechanism (Wicker et al., 2007; Kapitonov and Jurka, 2008). In addition, the coding capacity of elements for proteins involved or comprising the transpositional machinery allows for further classification of   130       elements into autonomous elements, which code for these proteins, or non-autonomous elements, which rely on the cognate autonomous elements for movement within the genome. Due to their capacity to multiply within a host and their prevalence among plant and animal genomes, TEs have previously been implicated to contribute significantly to increases in genome size (Bennetzen and Kellogg, 1997; Ammiraju et al., 2007; Bennetzen, 2007; Zuccolo et al., 2007). In some instances, TE may constitute the largest part of the genome (Schnable et al., 2009; Brenchley et al., 2012; Mayer et al., 2012). Dramatic differences exist in the content and diversity of different TE types between organisms. While animal and insect genomes typically contain a higher proportion of non-LTR retrotransposons (Lander et al., 2001; Chinwalla et al., 2002; Nene et al., 2007), the LTR retrotransposons typically dominates the TE landscape in plants (Rice Sequencing Project, 2005; Paterson et al., 2009; Schnable et al., 2009; Schmutz et al., 2010). Various computational and biological analyses of genomic information have demonstrated the critical roles of transposons in many aspects of genome evolution, gene expression and regulation (Jordan et al., 2003; Muotri et al., 2007). Widespread in plant genomes is a special type of DNA element called Pack-MULE which carry genes or gene fragments (Jiang et al., 2004). These elements have been implicated in the generation of new open reading frames and the regulation of the expression of their parental genes (Hanada et al., 2009; Jiang et al., 2011). Taken together, these studies indicate that TEs are prevalent in plants and are actively interacting with other components in their host genomes. Prior to the sequencing of the sacred lotus genome, no information was available regarding any aspect of its repetitive sequence content. Due to its position in the eudicot phylogeny, sacred lotus may offer important biological contributions in terms of its TE content, structure and diversity. Here we report the results of a comprehensive computer-assisted   131       identification and analysis of the repetitive content in the available sequence of sacred lotus. Our results show the exceptional contribution of atypical Copia LTR families with non-canonical end sequences. In addition, the genome appears to have unprecedented amplification of the hAT superfamily and suggests that GC-preferential acquisition by Pack-MULEs occurs in eudicots. Methods Construction of repeat library The current assembly of 804 Mb scaffold contains 707 Mb of contig sequence and 97 Mb of sequencing gaps. The 707 Mb contig sequence was downloaded from Genbank at the NCBI database (http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=AQOG01) for further analysis. Repetitive sequences were mined using a variety of approaches. To identify LTR elements, the program LTRharvest (Ellinghaus et al., 2008) was used (parameters: -minlenltr 80 -maxlenltr 6000 -mindistltr 300 -mintsd 4 -maxtsd 6 -motif tgca -similar 90) and the resulting elements were further screened using LTRdigest (Steinbiss et al., 2009) to determine the presence of a poly purine tract (PPT) or primer binding site (PBS). Only elements that contain a PPT or PBS were retained for further analysis. To determine the precise boundary of LTR elements, 100 bp of flanking sequences (5’ and 3’ ends) were retrieved and aligned using DIALIGN2 (Morgenstern, 1999). Elements wherein ≥ 50 bp of the flanking sequences were alignable (with similarity score ≥ 60%) were excluded. This is because for most LTR elements, only the LTRs, not the flanking sequences, should be alignable. To reduce the redundancy, examplar elements were selected using the “examplar_maker.pl” script from the MITE-Hunter package (Han and Wessler, 2010). The above procedure was initially performed to recover only elements with terminal sequences 5’-TG..CA-3’. However, after the recovery of elements with non-canonical terminal sequences during manual curation (see below), the procedure was repeated without the   132       definition of terminal motif to consider other motifs and additional examplars of LTR elements with non-canonical ends were manually verified. To search for LTR elements with noncanonical ends in grape (Vitis vinifera), its genomic sequence was downloaded from Phytozome (ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Vvinifera/). Non-autonomous DNA elements were mined using the MITE-Hunter package with parameters as recommended. Terminal inverted repeats (TIRs) of Mutator-like elements were identified using Pairwise Alignment of Long Sequences (PALS; Edgar and Myers, 2005) and manual curation as described previously (Ferguson and Jiang, 2012). The sequences of exemplars of LTR elements, non-autonomous DNA elements, and MULE TIRs were then used to mask the genomic sequence and the repetitive sequences in the unmasked portion of the genomic DNA were further identified in a second mining step using RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html). The output of RepeatModeler contains both known and unknown repeats. The resulting sequences were first filtered to remove putative gene families using BLASTX and sequences matching non-TE genes proteins (E < 10-5) were removed. The remaining sequences where the copy number is > 1000 or the genome coverage is ≥ 0.05% were manually curated to determine their identity and the 5’ and 3’ boundaries. This was done in a stepwise process. First, the relevant sequences were initially used to search and retrieve at least 10 hits (BLASTN, E < 10-10) with the corresponding 100 bp of 5’ and 3’ flanking sequences. Second, recovered sequences were then aligned using DIALIGN2 (Morgenstern, 1999), to determine the possible boundary between elements and their flanking sequences. In this case, a boundary was defined as the position to which sequence homology is conserved over more than half of the aligned sequences. Finally, sequences with defined boundaries were examined for the presence of target site duplication (TSD). To classify the relevant TEs, features   133       in the terminal ends and TSD were used. Each transposon family is associated with distinct features in their terminal sequences and TSD which can be used to identify an element (Wicker et al., 2007). To identify inverted or direct repeats in the terminal sequence, the putative terminal sequences were aligned using “gap” in the GCG package. Fragmented sequences identified by RepeatModeler were joined to derive a complete element. Manually curated sequences were compared to the unknown repeats using RepeatMasker. Sequences matching the curated sequences were excluded if they belong to the same family. Families were defined as follows: elements that share 80% or higher similarity across at least 90% of their length. If a repetitive sequence matches the curated sequences without reaching the above criteria, this sequence was retained and was considered to belong to a new family within the same superfamily. Estimation of copy number and genomic coverage The entire repeat library, containing elements with curated and non-curated ends, was used to mask the genomic sequence to determine TE coverage and copy number. If an element in the genomic sequence matched a sequence in the repeat library over the entire sequence, or if the truncation was less than 20 bp on each end, this copy was considered to be intact. Otherwise it was considered as a truncated sequence or half of a copy. Copy number is reported only for families that are curated, that is, the ends are known and verified manually. The genome coverage of TEs was estimated as the total sequence masked by each superfamily with overlapping regions between different entries only calculated once. Pack-MULEs Pack-MULE elements (Mutator-like elements carrying genes) were identified as described previously (Hanada et al., 2009). To determine which Pack-MULEs are expressed, the   134       element sequences were compared with EST database (Ming et al., 2013). A Pack-MULE is considered expressed if it matches an EST sequence with ≥97% similarity and the Pack-MULE coordinate is the best hit for the EST in the genome. The identification of the parental origin of the sequences captured by Pack-MULEs was conducted as described previously (Jiang et al., 2011). For an individual Pack-MULE, the sequence with the highest similarity score (BLASTN, E=1e-10), and was not associated with a MULE terminal inverted repeat (TIR), was considered as the parental copy of the internal sequence in a Pack-MULE. To calculate the GC content of MULEs and Pack-MULEs, nested TE insertions were first curated and removed from the element sequence. Determination of GC content of parental genes was conducted after masking with the repeat library that excluded Pack-MULEs. To calculate GC gradient along MULE sequences, the TIR sequences (on both ends of the elements) and the internal region (the sequence between the TIRs) were divided into two and 12 equal-sized bins, respectively. A custom perl script was used to determine the GC content of each bin. Comparisons of GC content between groups were performed using the R package (http://www.rproject.org). Arabidopsis (Arabidopsis thaliana) gene sequences were downloaded from The Arabidopsis Information Resource 10 (http://www.arabidopsis.org) while the Lotus japonicus gene sequences were downloaded from Lotus japonicus Genome Sequencing Project (http://www.kazusa.or.jp/lotus/). Rice, Arabidopsis and Lotus japonicus genes were classified as negative, positive or moderate genes based on GC content along the direction of transcription as described previously (Jiang et al., 2011). Phylogenetic analysis To search for autonomous hAT elements, both curated and non-curated autonomous hAT families were used to search the genome. Copies that are truncated by no more than 15 bp on   135       each end of the element or are over 2.5 Kb in length were retained and compared against a database containing known hAT transposase to identify elements containing the conserved motif 3 (Kempken and Windhofer, 2001; Lazarow et al., 2012). Intact LTR elements were mined using the sequences in repeat library and verified to contain the internal sequence, flanked by LTR and were associated with a 5-bp target site duplication. The conserved integrase core domain, which contains the RNase H fold catalytic motif, of representative LTR elements from each family was retrieved. Sequences of conserved integrase core domain or motif 3 from LTR elements and hAT transposase, respectively, were used to generate multiple alignments and resolved into lineages by generating phylogenetic trees. Multiple sequence alignment was performed by ClustalW (http://www.ebi.ac.uk/clustalw) with default parameters. Phylogenetic trees were generated using the maximum-likelihood method. Support for the internal branches of the phylogeny was assessed using 500 bootstrap replicates using MEGA (http://www.megasoftware.net).   136       Results Repeat content and diversity Identification of repeats was performed using the 707 Mb contig sequence of the sacred lotus genome that excluded the sequencing gaps contained in the 804 Mb final assembly. The contig sequences account for 76% of the estimated lotus genome (929 Mb) (Ming et al., 2013). Transposable elements (TEs) and other repetitive sequences were mined using a combination of structure-based and homology-based approaches (see methods). Approximately 59.3% of the genome is derived from repetitive sequences and 50% of it is composed of recognizable TEs (Table 4.1). The 353 Mb of repetitive genomic sequence included at least 1.26 million copies of TEs. This copy number is underestimated since families whose precise boundaries cannot be defined were not included in the copy number estimation which was done to avoid grossly overestimating the copy number from fragmented families whose ends are not known. The majority of recognizable TE is contributed by Class I/RNA elements (65.6%), a familiar phenomenon across the plant kingdom where the amplification of LTR elements has been suggested to contribute to genome size expansion. The majority of LTR retrotransposons are classified into two major superfamilies: Copia and Gyspy depending on the arrangement of the genes in the pol region. In sacred lotus, the LTR retrotransposon content (26.4%) is comprised by a comparable number and coverage of Copia and Gypsy elements (77951 and 72624 copies, respectively; Table 4.1). This is not typical among plant genomes wherein the Gypsy content usually outnumbers the Copia content (Table 4.2A). Among the 29 genomes with available Gypsy:Copia ratio, Gypsy elements occupy twice as much of the genomic fraction as that for Copia elements in 17 (59%) genomes. The only other plant genomes except lotus that seem to share a Gypsy:Copia ratio close to 1.0 are Brassica rapa, peach, strawberry and sweet   137       orange (1.10, 1.16, 1.20 and 1.25 respectively). However, the ratio is still closest to 1.0 in lotus. Genomes with the highest Gypsy:Copia ratio are Selaginella moellendorffii and papaya (7.81 and 5.05, respectively). Two plant genomes, banana and date palm, display the opposite extreme observed in most plants, where the Copia content outnumbers the Gypsy content (Table 4.2A). In addition, the scared lotus genome contains a high coverage of non-LTR retrotransposons (6.4%), which are predominantly contributed by LINEs (Table 4.2A). The two other plant genomes containing such a high non-LTR retrotransposon load are that of apple (7.95%) and banana (5.41%) (Velasco et al., 2010; D’Hont et al., 2012). Taken together, these results suggest a high activity and/or retention of non-LTR and Copia retrotransposons in the evolution of sacred lotus genome in comparison to most other plants. This may result from a massive amplification combined with a lack or low silencing from the genome for these TE types, overall conferring non-LTRs an efficient evasion strategy for various silencing mechanisms. Class II elements comprised about 17% of the genome. This level of DNA TE content is notable and only observed before in rice, wheat and soybean (Schmutz et al., 2010; Brenchley et al., 2012; Jiang and Panaud, 2013). The largest contributors to DNA element content are hATs (~7% of the genome), followed by Helitrons (3.8%). At least 170635 copies of the hAT elements were detected, making sacred lotus the most abundant in both genome hAT coverage and copy number among all plant genomes sequenced to date (Table 4.2B). Although most major DNA transposon families were identified, the Tc1/Mariner superfamily is absent. The absence of Tc1/Mariner elements has been reported in three other plant genomes: banana, grape, and Selaginella (Velasco et al., 2007; Banks et al., 2011; D’Hont et al., 2012), suggesting that the loss of this superfamily in plants is not unusual.   138       LTR elements with non-canonical ends In both major classes of LTR elements (Copia and Gypsy), the canonical LTR repeats found at the ends of these elements typically start with 5’-TG and ends in CA-3’ (Kumar and Bennetzen, 1999), which forms a short inverted repeat. In fact, many computer-assisted searches use this criterion in de novo searches for LTRs (McCarthy and McDonald, 2003; Steinbiss et al., 2009). However, in sacred lotus, 84 LTR families comprising eight different non-canonical LTR ends were found (Table 4.3). While the majority of these elements are Copia, two Gypsy and one TRIM families contained these atypical ends. Among all groups of LTR elements with noncanonical ends, four of them (TGCT, TGGA, TACA and TGTA) harbor mutations in one nucleotide, and the remainder (TGGT, TACT, TATA and TGTT) harbor mutations in two nucleotides. Obviously, the ones with a single mutation are more abundant than those with two mutations (Table 4.3). In addition, variations in constraints are observed in terms of the mutation allowed at different sites: a) no mutation was detected at the most 5’ nucleotide (always “T”) b) second nucleotide at 5’ end is a purine (G or A), c) the second nucleotide at 3’ end is the most flexible, and can be C, G, or T and d) the last nucleotide can be either “A” or “T”. The most abundant non-canonical end type is found in elements where the LTR starts with the canonical 5’-TG but ends in CT-3’ (referred to as TGCT LTR), where the most terminal nucleotide is not inverted. This LTR end type includes 25 Copia families making up an estimated 5658 copies or 2.3% of the genome. These elements are flanked by LTR sequences with the noncanonical ends that range from 199 to 408 bp in size, a 5 bp TSD at the insertion site, a primer binding site (PBS) and a polypurine tract (PPT) (Figure 4.1). The PBS, located immediately downstream of the 5’ LTR, and the PPT, located immediately upstream of the 3’ LTR, are conserved sequences motifs which are important for replication and amplification of the element.   139       Table 4.1. Repeat content of the sacred lotus genome. Copy Class Subclass Superfamily a number Copia 29928 LTR Gypsy Other LTR Class I non-LTR n/a Genome Coverage 12.551 72624 13.654 n/a 0.194 LINE 5621 42331 6.192 SINE 5717 8551 0.129 n/a 0.083 203734 32.803 1038 0.315 Other non-LTR Total Class I n/a 67828 CACTA Class II 26562 Copy b number 77951 44 hAT 83043 170635 7.315 MULE 51450 93733 2.634 PIF/Tourist 62126 103755 3.213 Helitron 10740 72489 3.757 n/a 0.082 DNA/unknown n/a Total Class II 207403 441650 17.317 Total TE 275231 645384 50.120 unknown 165921 324267 6.517 simple sequence 118452 118452 0.886 low complexity 174823 174823 1.791 Total repeats 734427 1262926 59.314 a  Copy number based on detected elements that contain terminal sequences. Elements with noncurated ends were not included in copy number estimation     b Copy number based on all detected fragments. Elements with non-curated ends were not included in copy number estimation       140       Table 4.2A. RNA TE information of various sequenced plant genomes. Common name DICOTS tomato Potato soybean Apple pigeon pea sacred lotus castor bean poplar papaya canola Peach Jatropha curcas barrel medic Grape cucumber strawberry sweet orange Cacao thale cress saltwater cress MONOCOTS date palma Maize   Scientific name Copia (%) Gypsy (%) Lycopersicum esculentum Solanum tuberosum Glycine max Malus domestica Cajanus cajan Nelumbo nucifera Ricinus communis Populus trichocarpa Carica papaya Brassica rapa Prunus persica Jatropha curcas Medicago trunculata Vitis vinifera Cucumis sativus Fragaria vesca Citrus sinensis Theobroma cacao Arabidopsis thaliana Thellungiella parvula 6.3 3.8 12.47 6.72 6.22 12.55 4.77 1.79 5.5 4.85 8.6 8.03 4.1 6.12 n/a 5.33 7.84 n/a 3.56 1.92 19.7 15.2 29.52 30.98 11.79 13.65 11.45 6.96 27.8 5.34 9.97 19.6 5.7 17.96 n/a 6.39 9.77 n/a 5.74 2.46 3.13 4.0 2.37 4.61 1.89 1.09 2.4 3.89 5.05 1.1 1.16 2.44 1.39 2.93 n/a 1.2 1.25 n/a 1.61 1.28 0.9 1.0 0.25 7.95 1.17 6.4 0.14 0.54 1.1 3.28 0.63 n/a 2.3 0.82 1.75 0.45 0.4 n/a 1.49 1.1 Phoenix dactylifera Zea mays 3.1 23.7 1.4 46.4 0.45 1.96 n/a 1.0 141   Gypsy:Copia ratio Non-LTR RT (%)     Table 4.2A (cont’d) Barley wheata sorghum Rice banana foxtail millet Brachypodium GYMNOSPERM Norway sprucea LYCOPHYTE S. moellendorfii a Hordeum vulgare Triticum aestivum Sorghum bicolor Oryza sativa Musa acuminata Setaria italica Brachypodium distachyon 8.46 17.39 5.18 3.6 25.58 7.18 4.86 17.96 44.03 19 15.5 11.45 22.14 16.05 2.12 2.53 3.67 4.31 0.45 3.03 3.3 0.53 0.82 0.04 1.9 5.41 1.98 1.94 Picea abies 16.0 35.0 2.19 1.0 Sellaginella moellendorfii 2.7 21.1 7.81 n/a Data is based on results from unassembled reads n/a – data not available for specific family   142       Table 4.2B. Genome information and DNA TE content of various sequenced plant genomes. Common name DICOTS tomato Potato soybean Apple pigeon pea sacred lotus castor bean poplar papaya canola Peach Jatropha curcas barrel medic Grape cucumber strawberry sweet orange Cacao thale cress saltwater cress MONOCOTS date palma   Scientific name Assembled size (Mb) Total TE (%) DNA TE (%) hAT (%) Lycopersicum esculentum Solanum tuberosum Glycine max Malus domestica Cajanus cajan Nelumbo nucifera Ricinus communis Populus trichocarpa Carica papaya Brassica rapa Prunus persica Jatropha curcas Medicago trunculata Vitis vinifera Cucumis sativus Fragaria vesca Citrus sinensis Theobroma cacao Arabidopsis thaliana Thellungiella parvula 760 727 973 604 606 804 350 485 370 284 227 285 262 477 244 210 320 327 119 137 63 62 59 52 52 50 50 44 43 40 37 37 31 29 24 23 20 16 16 7 0.9 3.9 16.5 6.6 4.5 17.3 0.91 2.4 0.21 3.2 9.1 2.0 3.4 1.8 1.2 6.4 2.3 n/a 5.25 1.2 0.1 0.2 0.04 0.35 n/a 7.32 n/a 0.02 n/a 2.87 n/a n/a 0.2 1.03 n/a 0.64 0.36 n/a 0.53 n/a (Chan et al., 2010) (Tuskan et al., 2006) (Ming et al., 2008) (Wang et al., 2011) (Verde et al., 2013) (Sato et al., 2010) (Young et al., 2011) (Velasco et al., 2007) (Huang et al., 2009) (Shulaev et al., 2010) (Xu et al., 2012) (Argout et al., 2010) (Hollister et al., 2011) (Dassanayake et al., 2011) Phoenix dactylifera 380 n/a n/a n/a (Al-Dous et al., 2011) 143   Reference (Sato et al., 2012) (Xu et al., 2011) (Schmutz et al., 2010) (Velasco et al., 2010) (Varshney et al., 2011)     Table 4.2B (cont’d) Maize Barley wheata sorghum Rice banana foxtail millet Brachypodium GYMNOSPERM Norway sprucea LYCOPHYTE spikemoss a Zea mays Hordeum vulgare Triticum aestivum Sorghum bicolor Oryza sativa Musa acuminata Setaria italica Brachypodium distachyon 2048 4560 n/a 730 374 523 423 272 84 84 79 62 45 44 40 28 8.6 5.0 14.9 7.5 20.4 1.24 9.4 4.8 1.1 0.2 0.04 0.02 1.6 n/a 0.59 0.24 (Schnable et al., 2009) (Mayer et al., 2012) (Brenchley et al., 2012) (Paterson et al., 2009) (Jiang and Panaud, 2013) (D’Hont et al., 2012) (Zhang et al., 2012) (Vogel et al., 2010) Picea abies 12000 70 1.0 n/a (Nystedt et al., 2013) Sellaginella moellendorfii 212 37 1.8 0.02 (Banks et al., 2011) Data is based on results from unassembled reads n/a – data not available   144       An example of this LTR end type is the Copia family NN00209 and is present in at least 424 copies (Figure 4.2 and Table 4.3). In total, the non-canonical Copia elements contribute to 3.74% of the genome, which account for 30% of all Copia elements. The two families of Gypsy elements with TACA terminal sequence) only contribute to 0.038% of the genome, so they are not significantly amplified. To determine the relationship between the non-canonical elements and other elements, a phylogenetic analysis was performed using the conserved integrase catalytic domain of TGCT LTR elements as well as other LTR elements in lotus and other characterized Copia elements in other species. Our analysis shows that the majority of the TGCT LTR families in sacred lotus are closely related to each other and form a single clade (Figure 4.3). Although the TGCT LTR families are predominant in this clade, it also contains a few families with the canonical TGCA end, four of the five non-canonical TGGT, and one TGGA LTR families. This suggests that mutations within this clade may give rise to other non-canonical end families and a reversion to the typical TGCA LTR type. It appears that this clade is distantly related to the Tos17 element from rice which was the first reported LTR element with a non-canonical end, TGGA (Hirochika et al., 1996). Meanwhile, four TGCT LTR families (NN00215, NN00219, NN00221 and NN00225) grouped separately in a clade that contained AtRE1 element (TATA end) from Arabidopsis (Kuwahara et al., 2000). In addition, this clade includes other sacred lotus LTR families with five non-canonical LTR types (TACA, TACT, TATA, TGTA and TGTT; Figure 4.3). This suggests a high frequency of mutation has given rise to multiple types of noncanonical LTR families within this clade. Since grape is the closest eudicot sequenced genome to sacred lotus, Copia elements from grape including five TGCT families identified in this study were also analyzed. Overall, the TGCT Copia families in grape grouped distinctly from those in   145       Table 4.3. Family, copy number and genome coverage of different LTR ends. # Copia # Gypsy # TRIM Total Total copy Total End families families families families number coverage TGCA 112 111 2 225 48009 20.75 TGCT 25 0 0 25 5658 2.29 TGGA 6 0 1 7 2210 0.31 TACA 13 2 0 15 697 0.34 TGTA 3 0 0 3 566 0.34 TGGT 5 0 0 5 400 0.22 TACT 8 0 0 8 359 0.19 TATA 7 0 0 7 186 0.13 TGTT 1 0 0 1 18 0.01 total 180 113 3 296 58103 24.58       146       gag   5’LTR   PBS   5’LTR 3’LTR 5’LTR 3’LTR 5’LTR 3’LTR 5’LTR 3’LTR 5’LTR 3’LTR 5’LTR 3’LTR 5’LTR 3’LTR 5’LTR 3’LTR     pol         3’LTR                    PPT   TCATGGTGTCTGTTGAAGATTAATTAGTTTTATGATGTTTTTCTGAGT | | || | |||||||||||||||||||||||||||||||||||||| CTTGAGGGGGAGTGTTGAAGATTAATTAGTTTTATGATGTTTTTCTGAGT AGATGTTTTATGTTTGTTGTTAGATAGGAGGTAGGGACCACAGTAGTTAT |||| ||||||||| ||||||||||||||||| ||||||| | ||||||| AGATATTTTATGTTAGTTGTTAGATAGGAGGTGGGGACCAAAATAGTTAT GTGAGTCCAAATTTTATTATGTTTTTTGAG...TTTATATGCTGTTTTAT ||||||||||||||||||||| |||||||| | |||||| |||||||| GTGAGTCCAAATTTTATTATGCTTTTTGAGTGTTATATATGTTGTTTTAT GTTTTGCCGCCCTATAAGGGCATAGTTGTCTTTGTACAATTCTATTATTT |||||||| | |||||||||||||||||||||||||| |||||||||| GTTTTGCCCCTCTATAAGGGCATAGTTGTCTTTGTACGGCTCTATTATTT TAGTATAAATAATAAGTGAAGGGGGCGGCCAGATCATTGCTCTCTTTTCT |||||||||||||||||||||||||||||||||||||||||||||||||| TAGTATAAATAATAAGTGAAGGGGGCGGCCAGATCATTGCTCTCTTTTCT TTCAACTTCTCTTCTTCTCTTCTTTTCCTTCTCTCTTCATTTCTTCTTCA |||||||||||||||||| ||||| ||||||||||||| || |||||| TTCAACTTCTCTTCTTCTTTTCTTCTCCTTCTCTCTTCCTTCTCTCTTCA AGAAACTACTTTCTTTCATATTTTTTAATTCAGCAGATTTCTACTTGGTA ||||||||||||||||||||||||||||||||||||||||||||| || AGAAACTACTTTCTTTCATATTTTTTAATTCAGCAGATTTCTACTGTGTC TCAAAGCATCGGCTAGGGTTTGTTGAAGCWTTCTCATGCAGCCCTAGATC ||| | TCAGA Figure 4.1. Sequence alignment of NN00206 illustrating the LTR sequence, TSD, primer binding site (PBS) and polypurine tract (PPT). Blue text indicates 10bp initial LTR sequence at the each terminal and red indicates 5 bp TSD. The shaded text indicates the sequence that matches a GlytRNA.   147       NN00206-7930 NN00206-9299 NN00206-10517 NN00206-10869 NN00206-14103 NN00206-14546 NN00206-18031 NN00206-20935 NN00206-21176 NN00206-24944 CACTAGAATCTGTTGA---------TCTACTGAATCTTTCA CAAAGAAAAATGTTGA---------TCTACTAAAAAAGTTG GTAATCATTTTGTTGA---------TCTACTCATTTTAATA GTTAGAAATGTGTTGA---------TCTACTAAATGCATCA AAAGAATGTATGTTGA---------CCTACTATGTATAAAT TCATGGTGTCTGTTGA---------TCTACTGTGTCTCAGA AAATGGAAGTTGTTGA---------TCTACTGAAGTGGTAT GATGGTTTAGTGTTGA---------CCTACTTTTAGAATAA CCTAGCACTCTGTTGA---------TCTACTCACTCTATTG GTATGCCTCTTGTTGA---------TCTACTCCTCTTGTTT Figure 4.2. Sequence alignment of LTR ends and TSD of the Copia family NN00206. Text in blue indicates 6 bp outermost LTR sequence and red indicates 5 bp TSD.   148   RIRE1-tgca 93 73     Copia-62 vv-tgca Copia-chr17-vv-tgct 67 NN00214-tgct NN00264-tggt 90 99 Hopscotch Os-tgta RETROFIT 49-tgca NN00263-tggt NN00080-tgca Copia-36 vv-tgca NN00025-tact NN00204-tgct 52 82 Copia-65 vv-tgca NN00097-tgca Copia-43 vv-tgca Copia-63 vv-tgca NN00265-tggt 72 ATRE1 73-tata NN00010-taca 58 62 NN00216-tgct NN00261-tggt NN00211-tgct 52 NN00028-tact NN00206-tgct NN00136-tgca NN00201-tgct NN00085-tgca NN00209-tgct 71 58 NN00210-tgct NN00218-tgct NN00047-tata NN00281-tgtt NN00205-tgct 80 54 NN00207-tgct 56 NN00212-tgct N NN00024-tact NN00202-tgct NN00221-tgct NN00217-tgct NN00023-tact NN00222-tgct NN00225-tgct NN00088-tgca NN00078-tgca NN00163-tgca 97 55 94 NN00244-tgga NN00086-tgca 76 NN00132-tgca Tvv1-tgca NN00215-tgct NN00027-tact NN00223-tgct NN00224-tgct 95 NN0008 NN00273-tgta 62 NN00213-tgct 72 BARE1-tgca Copia-45 vv-tgca Copia-chr15-vv-tg 60 55 OARE1-tgca 75 NN00026-tact 0.2 Figure 4.3. Phylogenetic analysis of TGCT LTR using the conserved integrase catalytic core Copia-chr10-vv-tgct domain. Bootstrap values are indicated as a percentage of 500 replicates. Shown are: sacred NN00096-tgca lotus sequences TGCT LTR (blue bolded), TGCA LTR (black bolded), other types of nonWintz vv-tgca canonical LTR (red), Sireviruses (green), TGCT grape LTR (light blue), TGCA grape LTR (brown), other species LTR (black). Copia23-tgca 52 Copia-73 vv-tgca NN00004-taca   56 56 149   Copia-55 vv-tgca Gentil vv-tgga NN00074-tgca NN00088-tgca   55   94 Figure 4.3 (cont’d) Edel vv-tgca NN00078-tgca NN00163-tgca 97 NN00244-tgga NN00086-tgca Brand-tgca 76 NN00132-tgca Tvv1-tgca 72 RIRE1-tgca 93 73 Copia-chr10-vv-tgct NN00096-tgca Copia23-tgca Copia-73 vv-tgca 90 99 Hopscotch Os-tgta RETROFIT 49-tgca Copia-55 vv-tgca Gentil vv-tgga 56 NN00074-tgca 56 Copia-36 vv-tgca Copia-56 vv-tgca NN00025-tact Copia-75 vv-tgca 81 Copia-65 vv-tgca Copia-chr1-vv-tgct Copia-chr18-vv-tgct 98 76 Copia-43 vv-tg Copia-63 vv-tgca NN00068-tgca NN00069-tgca NN00028-tact TOS17-tgga NN00136-tgca Rangen vv-tgca 57 87 94 81 NN00085-tgca Ji-tgca NN00047-tata Opie-tgca NN00281- Osr9-tgca SIRE-tgca 87 62 62 Osr8-tgca 56 Cremant vv-taca 95 77 84 75 Gans-vv-tgca NN00061-tgca NN00024-tact TNT1-tgca NN00064-tgca NN00221-tgct NN00023-tact NN00225-tgct 75 TONT1 LE-tgca Brand-tgca RIRE1-tgca 60   67 NN00215-tgct TTO1-tgca Ty1-tgca 93 73 NN00 NN00273-tg NN00027-tact Edel vv-tgca 89 ATRE1 73-tata NN00010-taca 58 62 NN00100-tgca NN00070-tgca 94 69 BARE1-tgca Copia-62 vv-tgca Copia-chr17-vv-tgct 67 NN00004-taca OARE1-tgca Copia-45 vv-tgca Copia-chr15-vv 60 Wintz vv-tgca 52 TONT1 LE-tgca 89 OARE1-tgca BARE1-tgca Copia-45 vv-tgca Copia-chr15-vv-tgct Copia-62 vv-tgca150   Copia-chr17-vv-tgct 90 99 Hopscotch Os-tgta 0.2 NN00026-tact Edel vv-tgca Ty1-tgca TONT1 LE-tgca   89 Brand-tgca   RIRE1-tgca Figure 4.3 (cont’d) 93 73 OARE1-tgca BARE1-tgca Copia-45 vv-tgca Copia-chr15-vv-tgct 60 Copia-62 vv-tgca Copia-chr17-vv-tgct 67 90 99 Hopscotch Os-tgta RETROFIT 49-tgca Copia-36 vv-tgca NN00025-tact Copia-65 vv-tgca ATRE1 73-tata NN00010-taca 58 62 Copia-43 vv-tgca Copia-63 vv-tgca NN00028-tact NN00136-tgca NN00085-tgca NN00047-tata NN00281-tgtt 62 56 NN00089-tgca NN00273-tgta NN00215-tgct NN00027-tact NN00219-tgct NN00024-tact NN00221-tgct NN00023-tact NN00225-tgct 75 NN00026-tact 0.2   151       sacred lotus. In contrast, some Copia families in grape with canonical ends do group together with their lotus counterparts (Figure 4.3), suggesting the TGCT elements in the two species may not have a common origin. To our knowledge, two copies of a curated LTR element with a similar non-canonical end (5’-TG.CT-3’) is present in the soybean genome (Du et al., 2010). However, these elements have been classified as Gypsy. Phylogenetic analysis of the conserved reverse transcriptase domain supports that the soybean elements are Gypsy type (Figure A1). These results suggest that formation of non-canonical LTR ends can occur in both Gypsy and Copia LTRs but it appears it occurs more often to Copia elements. Another non-canonical LTR type present in considerable copy number (2210 estimated copies, Table 4.3) are those that ends in GA-3’ (referred to as TGGA LTR). Further analysis of copy numbers among the families with TGGA LTR suggests that a single TRIM family, NN00293, which has an estimated copy number of 1264 elements, contributes the majority of the copies. Terminal-repeat retrotransposons in miniature (TRIM) elements are structurally classified as LTR retrotransposons despite their lack of coding sequences. Using the LTR and conserved internal region sequence of this element to mine for intact elements in the sacred lotus repeat database, 304 full copies (see Methods) were mined. NN00293 TRIM elements in sacred lotus are characterized by 73 bp LTR and 142 bp internal sequences, representing an LTR element with the smallest LTR characterized so far. hAT elements DNA transposable elements belonging to the hAT superfamily are widespread in plant and animal genomes have been widely used in gene tagging and functional genomics studies (Kunze and Weil, 2002). Although widespread in plants, its contribution to genomic repeat is typically low (≤1% of the genome, Table 4.2B). As stated above, the sacred lotus genome is   152       exceptional in its hAT content among plant genomes (>7%). The copy number and coverage of this superfamily indicates its successful amplification that can be a result of high transposition or retention rates, or both. Overall, 325 hAT families were identified. However, defined ends are only manually verified and defined for 75 families, therefore the estimated copy number for this superfamily is greatly underestimated. To evaluate whether specific families have expanded in the genome, the coverage of individual families was determined. Results indicate that overall, the high abundance of hAT elements in sacred lotus was not due to the massive amplification of a single family. However, the five most abundant hAT families make up to 1.52% of the sacred lotus genome, which is an exceptional success for specific DNA element family in plants. All the five most abundant families were non-autonomous elements that range in size from 293 to 658 bp. Since these elements do not encode the transposase protein, these must have relied on a cognate autonomous partner(s) in the genome for movement. To test whether levels of diversity in hAT transposase may reflect its successful amplification in the sacred lotus genome, phylogenetic analysis of the most conserved domain (motif 3 which contains the E region of the catalytic DD/E motif) of the hAT transposase (Kempken and Windhofer, 2001; Lazarow et al., 2012) among autonomous copies was performed. Our results show substantial diversity among the hAT transposases found in the genome. Among the 7 clades present in plants (Ac/Tam3 and Tag1), animals (hobo, hopper, Charlie and Tip100) and fungi (restless) (Kempken and Windhofer, 2001; Robertson, 2002), the sacred lotus genome contained autonomous hATs from the two clades in plants: Ac/Tam3 and Tag1 (Figure 4.4). Majority of the autonomous elements with a recognizable motif 3 that groups within the Ac/Tam3 clade shows a wide spectrum of diversity wherein various subgroups found   153       are more closely related to hAT proteins from other plant species than hAT transposase within the genome. The most expanded subgroup is closely related to the Tam3 transposase in Antirrhinum majus (Hehl et al., 1991). Overall, these results suggest that diversity within autonomous elements may have contributed to the success of the hAT superfamily in sacred lotus. Manual examination for the three conserved motifs that contain the transposases catalytic domain shows that out of 162 putative autonomous element copies, belonging to 15 autonomous hAT families, only 2 of these do not contain premature stop codons and are likely to code functional transposase. This data indicates that the majority of autonomous hATs and their corresponding non-autonomous partners can currently be considered “dead”. Alternatively, active autonomous copies that carry functional hAT transposase may be present in multiple and highly similar copies which may make them difficult to be assembled using the next generation sequencing technique. Pack-MULEs A total of 1447 Pack-MULEs were identified in N. nucifera. Out of 53 MULE TIR families found in the genome, 43 families were involved in gene sequence acquisition. Three of these TIRs constituted over 20% of the Pack-MULEs found. Analysis using the N. nucifera EST library generated by this study indicated that at least 10 Pack-MULEs are expressed (Table 4.4). To determine the source of the acquired genic sequences in Pack-MULEs, the internal sequences were used to search against the genomic sequence. This search identified 996 parental genes for the acquired fragments. Preferential acquisition of GC-rich sequences has previously been reported in the grasses, rice and maize (Jiang et al., 2011; Ferguson et al., 2013). Analysis in Arabidopsis was inconclusive due to the low copy number of Pack-MULEs (Jiang et al., 2011). To evaluate if the GC preference phenomenon previously observed in monocots also extends to   154       dicots, the GC gradient of Pack-MULEs in sacred lotus was analyzed. Our results indicate that internal regions of Pack-MULEs are overall more GC-rich than the genome average (Figure 4.5). Moreover, the acquired fragments are significantly more GC-rich than the genome average (P < 2.2 x 10-16; WRS). This trend is not consistent with the Arabidopsis Pack-MULEs wherein the internal sequences, TIRs and genomic average have a similar GC content (Jiang et al., 2011). Comparison of the GC content of Pack-MULE parental genes and all non-TE genes in sacred lotus show that Pack-MULE parental genes have significantly higher GC content than other nonTE genes (Figure 4.6). This results supports data from rice Pack-MULEs where parental genes are associated with significant GC-content (Jiang et al., 2011) suggesting that the preferential acquisition based on GC content also occurs in some eudicots. In grasses, a higher proportion of GC-rich genes are found in comparison to Arabidopsis (Jiang et al., 2011). In addition, the Pack-MULE acquisition preference for GC-rich sequences combined with their insertion specificity at 5’ end of genes is considered at least partly responsible for the presence of genes with negative GC-gradients (Jiang et al., 2011), where the 5’ end of genes are more GC-rich than their 3’ end. Again, this correlation is not obvious with Arabidopsis, possibly due to its low activity of Pack-MULEs. To test whether the abundance of Pack-MULEs in sacred lotus is accompanied by increased negative GC-gradient of genes, we compared the number of positive, moderate and negative genes between the Arabidopsis and sacred lotus. The proportions of positive genes, those where the 3’ half is more GC-rich than the 5’ half, in the two genomes are comparable (Arabidopsis: 7.3% and lotus: 5.9%). However, the proportion of negative genes in sacred lotus, those where the 5’ half is more GC-rich than the 3’ half, is more than double that found in Arabidopsis (25.4% vs. 11.7%; Figure 4.7). Lotus   155   NN00651-1 NN00651-2   NN00635-1   NN00643-1   97 NN01199-2 NN00643-2     NN01199-10 Charlie1   57 Mx tpZm   hAT-1 SM   METRAHAT   NN02365-2 Tip100   93Restless NN02365-21   TEMPINDAS   Hermes   NN01603-3 98   Hobo 85 NN01603-6   Tag2 HopperBd     NN02324-2 91 Tol2   NN02324-5   Folyt1 NN02692-1   NN01722-1 73   93 NN02692-2 Ac/Tam3   NN02346-5 NN01722-2   NN02278-1 NN01809-1   NN02278-2 61   NN01377-1   88 Daysleeper   NN01377-2 b-Gary 76   54 NN02346-27 OsaTag1     Ac NN01481-1   Tam3   TAG1 NN00649-1   NN00684-1   NN00640-1 76 80 65 0.2 59 NN01469-1 NN00635-10     NN00640-2 NN00651-1     NN00651-2     NN00635-1   NN00643-1 NN01469-2 NN00643-2 Charlie1 57 of the conserved domain 3 of hAT transposase. Figure 4.4 Phylogenetic analysis Bootstrap SMin red represent non-plant hAT values are indicated as a percentage of 500 replicates. hAT-1 Names transposase. Tip100 Restless   76 Hermes 156   Hobo HopperBd 80 57 NN00651-2 hAT-1 SM NN00635-1 Tip100 Figure 4.4 (cont’d) NN00643-1 Restless Hermes NN00643-2 NN00640-2 Charlie1 NN00651-1     76 NN00635-10 59 NN00643-2 Charlie1 57 hAT-1 SM Tip100 Hobo HopperBd Restless 76 Tol2 Hermes Folyt1 Hobo 73 73 76 76 HopperBd NN01722-1 Tol2 NN01722-2 Folyt1 NN01809-1 NN01722-1 NN01722-2 88 NN01809-1 88 54 54 NN01377-1 NN01377-2 NN01377-1 OsaTag1 NN01377-2 OsaTag1 NN01481-1 NN01481-1 TAG1 TAG1 NN00684-1 NN00684-1 NN01469-1 NN01469-1 65 65 NN01469-2 NN01469-2 0.2 0.2   157   Charlie       Tip100   Restless       Hobo     Hopper                     Tag1                       Table 4.4. Pack-MULEs with EST evidence of expression. Coordinate Megascaffold EST contig number Start End 1 48058897 48060093 contig04254 2 95879186 95881535 contig08224 3 3060656 3063250 contig00026 3 7141300 7143148 contig04546 6 30463674 30464922 contig13509 7 1373862 1375844 contig11719 7 1382403 1395660 contig06447 8 4585905 4587696 contig14084 10 8693634 8701482 contig10382 79 8508 10373 contig07837   158   Putative annotation rhamnosyltransferase 1 Kinase maturase K probable exonuclease V-like U-box domain-containing protein CCT/B-box zinc finger protein hypothetical protein hypothetical protein conserved hypothetical protein seed imbibition protein (Sip1)     GC Genome nonPM 50 GC content (%) 45 40 35 30 25 1 2 1 2 3 4 5 6 7 8 9 10 11 12 1 2 bin Figure 4.5. GC content along Pack-MULEs and non-Pack-MULEs. The first 2 and last 2 bins represent TIR regions and the internal sequence was divided into 12 equal-sized bins prior to determination of GC content per bin.   159       50 f f Median GC content (%) e c d 40 b a 30 20 Figure 4.6. GC content of different genomic sequences in the sacred lotus genome. Bars designated with different letters indicate their values are significantly different (α= 0.0025) by Wilcoxon Rank SumTest (WRS) with Bonferroni correction.   160       sacred lotus Lotus japonicus arabidopsis 100 % of total genes 80 60 40 20 0 negative positive Types of Genes moderate Figure 4.7. Distribution of different types of genes based on GC gradient in sacred lotus, Lotus japonicus and Arabidopsis.   161       japonicus, another dicot containing >1000 Pack-MULE (Holligan et al., 2006), has negative genes equally as frequent (23.6%) as sacred lotus (Figure 4.7). Taken together, our results suggest that Pack-MULEs may contribute to the negative GC-gradient of genes in dicots with high Pack-MULE activity. Discussion Angiosperm is the most dominant plant taxon containing as many as 400,000 species (Jarvis et al., 2007) and ranks second to insects in species richness. Two major groups within angiosperms are monocots and dicots whose split was dated 150-130 million years ago (MYA) (Wikstrom et al., 2001). At present, among the dicots, the eudicot clade represents ~75% of the species diversity in angiosperms (Drinnan et al., 1994) and lotus occupies a key position in studies of angiosperm evolution particularly in the monocot-dicot split. The divergence from its closest sister lineage was dated at 137-125 MYA (Wikstrom et al., 2001). Therefore, it is the most basal dicot sequence released, replacing the grape genome (Velasco et al., 2007; Ming et al., 2013), which diverged from its sister lineage about 118-108 MYA (Wikstrom et al., 2001) making it the only sequenced dicot genome that is closest to monocots. In addition, it lies outside the core eudicot clade and lacks a duplication event that is shared by all other sequenced eudicots (Ming et al., 2013). Transposable elements are an integral part of eukaryotic genomes and studies have demonstrated its significant contribution in the evolution of angiosperms (reviewed in Oliver et al., 2013). Notably, TEs have been implicated in the genome size expansion of plants (Sanmiguel, 1998; Hawkins et al., 2006; Piegu et al., 2006) and have contributed to the domestication of new genes (Piriyapongsa et al., 2006; Mikkelsen et al., 2007) and evolution of regulatory elements in many genomes (Jordan et al., 2003; Muotri et al., 2007). The TE-thrust   162       hypothesis proposes that both active and inactive TE can introduce genetic changes that may provides adaptive and evolutionary potential (Oliver et al., 2013). These changes include gene modifications, altered gene expression, gene duplication and creation of novel genes (as in the case of Pack-MULEs). Thus, their identification and characterization remains crucial in genome studies. The characterization of transposable elements in a basal eudicot genome such as lotus may be used as a tool in understanding the role of TE in the monocot-dicot split as well as in the diversification of eudicots. In this study, computer-assisted and manual approaches were used to mine and characterize the repeat content and diversity of sacred lotus (Nelumbo nucifera). Over half of the genome sequence is repetitive with the bulk of this composed by recognizable transposable elements. This is an underestimate because the unassembled region or sequencing gaps are usually more enriched in repetitive sequences. The total amount of TEs in sacred lotus is comparable to other plants with similar genome size, such as tomato, potato, sorghum, soybean, and apple (Table 4.2B). Like other plant genomes, majority of the sacred lotus genome repeat sequence is contributed by LTR retrotransposons. However, the content and diversity of some of the TE families contained in the genome show features unique to sacred lotus. Multiple factors are involved in the success or failure of a TE family in a host genome. Novel TEs may be introduced into a host through many pathways including but not limited to hybridization and polyploidy (Kawakami et al., 2010; Parisod et al., 2010), and horizontal transfer (Bartolomé et al., 2009; Schaack et al., 2010). Unless quickly lost, novel TEs may become prolific until they are recognized by the host and subsequently silenced. TEs can be lost or become undetectable when inactive TEs accumulate mutations over time until they are no longer recognizable as such, or are eliminated from the genome through drastic processes such as   163       recombination and random deletions (Tenaillon et al., 2010). The lotus TE content contains a notable gain and loss in DNA elements. The genome contains all major types of DNA TEs in plants except Tc1/Mariner family, a widespread DNA element family found in many plant genomes, making it the fourth reported genome devoid of these elements (D’Hont et al., 2012; Jaillion et al., 2007). The loss of this TE superfamily in sacred lotus suggests that specific silencing and/or insertion targeting mechanisms can be critical in the failure or success of TE families in a host. In this particular case, it is unknown which factor played a more important role in its demise. In contrast, the hAT superfamily seems to have massively proliferated, accounting for as much as 7.3% of the genome. This represents more than the total Class II/DNA element content of many dicot plant genomes with the exception of peach and soybean (Schmutz et al., 2010; Verde et al., 2013). In addition, this is a level of coverage typically seen only from retrotransposons among eukaryotes. This indicates an overall unparalleled activity of many hAT families in the evolution of sacred lotus, more than any other genome whose repeat content has been characterized. Since amplification and success of a specific TE family in a genome involves a struggle against many other TE types in the genome, this implies that hAT elements in sacred lotus may have possessed certain advantages. This can include a combination of higher transposition efficiency and more successful targeting preference that allows it to escape intense genomic regulation. The diversity in hAT transposase proteins may play a role in this regard. However, in this study, analysis of conserved domains from the hAT transposase encoded by autonomous elements suggests that many autonomous hAT elements found by this study are presently inactive, suggesting they are subject to significant silencing and might be proceeding to the end of their life cycle.   164       The hAT superfamily includes the Ac/Ds elements which were the very first transposons discovered (McClintock, 1951) and has since been paramount in the study of TE domestication. In a study of 65 instances of traits in angiosperms generated by TE, hAT elements account for 20% of these events (Oliver et al., 2013). These include the SLEEPER genes from a domesticated hAT transposase that functions as transcriptional regulators of plant development unique to angiosperms (Knip et al., 2012). Because of their prevalence in the angiosperm lineage and its role in plant development, these data suggest the potential impact of domesticated hAT genes in angiosperm evolution. The unparalleled activity of hATs in lotus compared with other organisms may prompt studies to evaluate its effect in gene regulation as well as TE domestication. LTR retrotransposons are the most abundant TEs in plants and are considered to be responsible for the expansion of plant genomes. Most LTR elements start with 5’-TG and end in CA-3’. Such terminal sequence was believed to be important for element integration (Temin, 1981). Prior to this study, four incidents of LTR elements with non-canonical ends were reported or annotated. These include three Copia-like elements: the Tos17 in rice, AtRE1 in Arabidopsis and TARE1 in tomato with 5’-TG.GA-3’, 5’-TA.TA-3’ and 5’-TA.CA-3’ LTR ends, respectively (Hirochika et al., 1996; Kuwahara et al., 2000; Yin et al., 2013). In addition, three copies of Gypsy-like elements with 5’-TG.CT-3’ are annotated in soybean (Du et al., 2010). However, all these elements are low copy number elements in natural populations, despite the fact that Tos17 can achieve high activity artificially through tissue culture (Hirochika et al., 1996). As a result, it is unclear whether they represent transient mutations or they can successfully amplify and be retained in natural populations. In sacred lotus, we found 8 different non-canonical ends that comprise at least 10000 copies and 3.8% of the genome (16% of total LTR content and 30% of   165       total Copia-like elements), 4 of these (TGTA, TACT, TGGT, TGTT) are reported for the first time. Our analysis indicates that among the two terminal nucleotides on each side of the LTR, only the first nucleotide at the 5’ end is not replaceable. All other nucleotides can be substituted without complete abolishment of the transposition activity. The second nucleotide at the 3’ end is the most flexible and can be C, G, or T. The differential constraint at the two ends suggests that they play distinct roles in integration. Thus far, the high copy TGCT LTR covers over 2% of the sacred lotus genome and is the first non-canonical LTR type reported with this level of coverage and copy number in plants. This end type is found in the soybean genome but is present in only 3 copies (one of which is a solo LTR), suggesting that it may not be highly competent for transposition. Similarly, the two families of Gypsy-like elements with non-canonical ends detected in the genome of sacred lotus demonstrated limited amplification. Therefore, it appears that non-canonical ends are more successful among Copia-like elements. The origin of the TGCT elements in sacred lotus could be explained in two ways. It is possible that there is an ancient lineage of Copia-like elements the integrase of which has higher affinity to TGCT than TGCA end. If that is the case, one would expect the TGCT elements group with the same elements in other species but that is not observed. Alternatively, the TGCT elements are derived from relatively recent mutations in sacred lotus and somehow have achieved significant success. Our phylogenetic analysis indicates this is more likely the case. Unlike the hAT elements (Figure 4.4), where most elements group with counterparts in other species, the majority of TGCT LTR families are closely related, and may represent a specific clade of LTR elements with frequent formation of non-canonical LTR ends. For the tomato TARE1 elements, it was postulated that the change in the LTR sequence was due to a   166       mutation in the 3’ LTR end from ‘G’ to ‘A’ prior to the transposition of the element (Yin et al., 2013). It is known that the reverse transcription reaction is error-prone so it is not surprising that mutations arise prior to transposition. The critical question is whether the mutation is competent for further transposition. If not, the mutation would rapidly disappear from the genome. Our analysis clearly demonstrated that 3 out of the 4 terminal nucleotides can be altered (a maximum of 2 can be mutated simultaneously) to retain the transposition activity. The presence of 8 different mutant ends suggests the high degree of flexibility of the integrase of Copia-like elements to cooperate with different element ends. On the other hand, the presence of few elements ending with TGCA end among the TGCT clade (Figure 4.3) suggests that the mutation might be reversible but the integrase of these elements might have been evolved specificity to this type of end (TGCT). If this is the case, the success of TGCT elements could be explained by the long-term co-evolution between the element and the transposition machinery in the lineage of sacred lotus. Computer-assisted techniques to mine LTR retrotransposons during de novo genomic searches usually involve structure-based algorithms. Two widely used programs are LTR-STRUC and LTR_FINDER which uses defining features of LTR elements (McCarthy and McDonald, 2003; Xu and Wang, 2007), one of which is the typical LTR which starts with 5’-TG and ends in CA-3’. Due to this, LTR elements with atypical ends can be inherently missed by automated searches. Our analysis provides guidance for future annotation of LTR elements in plant genomes. Pack-MULEs are Mutator-like elements (MULE) that carry gene fragment(s). These elements are particularly important as they may generate new open reading frames and/or regulate the expression of the parental genes from which the captured sequences are derived (Jiang et al., 2004; Hanada et al., 2009). Recent work shows that gene GC content and   167       expression play a combined role in the acquisition preference of Pack-MULEs (Ferguson et al., 2013). Our annotation also allowed additional analyses in regards to sequence acquisition preference by Pack-MULEs. To date, the sacred lotus is part of a very small subset of plants (along with rice and Lotus japonicus) that contain thousands of Pack-MULEs (Jiang et al., 2004; Holligan et al., 2006). Analysis of GC content of Pack-MULEs and their parental genes shows that a high GC content preference by Pack-MULEs also extends to dicots, a notion that was previously inconclusive using Arabidopsis data (Jiang et al., 2011). Recent work on the nucleotide composition of sacred lotus shows that its GC content is intermediate to that of eudicots and grasses and exhibits negative GC gradient in the GC3 content (Singh et al., 2013). In combination with the presence of a higher proportion of negative genes than that in the Arabidopsis genome, it may indicate that significant activity of Pack-MULEs may cause detectable variation of GC gradient of genes. Our results suggest that sacred lotus may serve as an additional model for Pack-MULE studies. In addition, it may serve in comparative studies in the role and impact of Pack-MULE formation in gene duplication and evolution for monocots and dicots.   168       APPENDIX   169       soybean   Figure A1. Phylogenetic analysis of conserved domain in the reverse transcriptase gene from Gypsy LTR. Bootstrap values are indicated as a percentage of 500 replicates.   170       REFERENCES   171       REFERENCES Adams MD (2000) The Genome Sequence of Drosophila melanogaster. Science 287: 2185–2195 Alleman, M., Freeling, M. (1986) The Mu transposable elements of maize: evidence for transposition and copy number regulation during development. Genetics 112: 107–119 Ammiraju JSS, Zuccolo A, Yu Y, Song X, Piegu B, Chevalier F, Walling JG, Ma J, Talag J, Brar DS, et al (2007) Evolutionary dynamics of an ancient retrotransposon family provides insights into evolution of genome size in the genus Oryza. The Plant Journal 52: 342–351 Argout X, Salse J, Aury J-M, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, et al (2010) The genome of Theobroma cacao. Nature Genetics 43: 101–108 Banks JA, Nishiyama T, Hasebe M, Bowman JL, Gribskov M, dePamphilis C, Albert VA, Aono N, Aoyama T, Ambrose BA, et al (2011) The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of Vascular Plants. Science 332: 960–963 Barkan A, Martienssen RA (1991) Inactivation of maize transposon Mu suppresses a mutant phenotype by activating an outward-reading promoter near the end of Mu1. Proceedings of the National Academy of Sciences 88: 3502–3506 Bartolomé C, Bello X, Maside X (2009) Widespread evidence for horizontal transfer of transposable elements across Drosophila genomes. Genome Biology 10: R22 Becker H-A (1997) Maize Activator transposase has a bipartite DNA binding domain that recognizes subterminal sequences and the terminal inverted repeats. Molecular and General Genetics MGG 254: 219–230 Bejerano G, Lowe CB, Ahituv N, King B, Siepel A, Salama SR, Rubin EM, James Kent W, Haussler D (2006) A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature 441: 87–90 Benito MI, Walbot V (1997) Characterization of the maize Mutator transposable element MURA transposase as a DNA-binding protein. Mol Cell Biol 17: 5165–75 Bennetzen J (2007) Patterns in grass genome evolution. Current Opinion in Plant Biology 10: 176–181 Bennetzen JL (2005) Mechanisms of Recent Genome Size Variation in Flowering Plants. Annals of Botany 95: 127–132 Bennetzen JL, Kellogg EA (1997) Do Plants Have a One-Way Ticket to Genomic Obesity? Plant Cell 9: 1509–1514   172       Bennetzen JL, Springer PS (1994) The generation of Mutator transposable element subfamilies in maize. Theoretical and Applied Genetics. doi: 10.1007/BF00222890 Bennetzen JL, Swanson J, Taylor WC, Freeling M (1984) DNA insertion in the first intron of maize Adh1 affects message levels: cloning of progenitor and mutant Adh1 alleles. Proceedings of the National Academy of Sciences 81: 4125–4128 Brenchley R, Spannagl M, Pfeifer M, Barker GLA, D’Amore R, Allen AM, McKenzie N, Kramer M, Kerhornou A, Bolser D, et al (2012) Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 491: 705–710 Bundock P, Hooykaas P (2005) An Arabidopsis hAT-like transposase is essential for plant development. Nature 436: 282–284 Bushman FD (2003) Targeting survival: integration site selection by retroviruses and LTRretrotransposons. Cell 115: 135–138 Callinan PA, Batzer MA (2006) Retrotransposable Elements and Human Disease. In J-N Volff, ed, Genome Dynamics. KARGER, Basel, pp 104–115 Cao X, Jacobsen SE (2002) Role of the Arabidopsis DRM Methyltransferases in De Novo DNA Methylation and Gene Silencing. Current Biology 12: 1138–1144 Chalker DL, Sandmeyer SB (1992) Ty3 integrates within the region of RNA polymerase III transcription initiation. Genes & Development 6: 117–128 Chalvet F (2003) Hop, an Active Mutator-like Element in the Genome of the Fungus Fusarium oxysporum. Molecular Biology and Evolution 20: 1362–1375 Chan AP, Crabtree J, Zhao Q, Lorenzi H, Orvis J, Puiu D, Melake-Berhan A, Jones KM, Redman J, Chen G, et al (2010) Draft genome sequence of the oilseed species Ricinus communis. Nature Biotechnology 28: 951–956 Chinwalla AT, Cook LL, Delehaunty KD, Fewell GA, Fulton LA, Fulton RS, Graves TA, Hillier LW, Mardis ER, McPherson JD, et al (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520–562 Chomet P, Lisch D, Hardeman KJ, Chandler VL, Freeling M (1991) Identification of a regulatory transposon that controls the Mutator transposable element system in maize. Genetics 129: 261–270 Cordaux R (2006) From the Cover: Birth of a chimeric primate gene by capture of the transposase gene from a mobile element. Proceedings of the National Academy of Sciences 103: 8101–8106   173       D’Hont A, Denoeud F, Aury J-M, Baurens F-C, Carreel F, Garsmeur O, Noel B, Bocs S, Droc G, Rouard M, et al (2012) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488: 213–217 Dassanayake M, Oh D-H, Haas JS, Hernandez A, Hong H, Ali S, Yun D-J, Bressan RA, Zhu JK, Bohnert HJ, et al (2011) The genome of the extremophile crucifer Thellungiella parvula. Nature Genetics 43: 913–918 Diao Y, Chen L, Yang G, Zhou M, Song Y, Hu Z, Liu JY (2006) Nuclear DNA C-values in 12 species in Nymphales. Caryologia 59: 25–30 Dietrich CR, Cui F, Packila ML, Li J, Ashlock DA, Nikolau BJ, Schnable PS (2002) Maize Mu transposons are targeted to the 5’ untranslated region of the gl8 gene and sequences flanking Mu target-site duplications exhibit nonrandom nucleotide composition throughout the genome. Genetics 160: 697–716 Al-Dous EK, George B, Al-Mahmoud ME, Al-Jaber MY, Wang H, Salameh YM, Al-Azwani EK, Chaluvadi S, Pontaroli AC, DeBarry J, et al (2011) De novo genome sequencing and comparative genomics of date palm (Phoenix dactylifera). Nature Biotechnology 29: 521–527 Drinnan AN, Crane PR, Hoot SB (1994) Patterns of floral evolution in the early diversification of non-magnoliid dicotyledons (eudicots). In PK Endress, EM Friis, eds, Early Evolution of Flowers. Springer Vienna, Vienna, pp 93–122 Du J, Grant D, Tian Z, Nelson RT, Zhu L, Shoemaker RC, Ma J (2010) SoyTEdb: a comprehensive database of transposable elements in the soybean genome. BMC Genomics 11: 113 Duke JA, Duke (2002) Handbook of medicinal herbs. CRC Press, Boca Raton, FL Edgar RC, Myers EW (2005) PILER: identification and classification of genomic repeats. Bioinformatics 21: i152–i158 Ellinghaus D, Kurtz S, Willhoeft U (2008) LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9: 18 Ferguson AA, Jiang N (2012) Mutator-Like Elements with Multiple Long Terminal Inverted Repeats in Plants. Comparative and Functional Genomics 2012: 1–14 Fernandes J, Dong Q, Schneider B, Morrow DJ, Nan G-L, Brendel V, Walbot V (2004) Genome-wide mutagenesis of Zea mays L. using RescueMu transposons. Genome Biol 5: R82 Feschotte C (2004) Merlin, a New Superfamily of DNA Transposons Identified in Diverse Animal Genomes and Related to Bacterial IS1016 Insertion Sequences. Molecular   174       Biology and Evolution 21: 1769–1780 Feschotte C (2005) DNA-binding specificity of rice mariner-like transposases and interactions with Stowaway MITEs. Nucleic Acids Research 33: 2153–2165 Feschotte C, Jiang N, Wessler SR (2002) PLANT TRANSPOSABLE ELEMENTS: WHERE GENETICS MEETS GENOMICS. Nature Reviews Genetics 3: 329–341 Flavell AJ, Pearce SR, Kumar A (1994) Plant transposable elements and the genome. Current Opinion in Genetics & Development 4: 838–844 Guo HB (2008) Cultivation of lotus (Nelumbo nucifera Gaertn. ssp. nucifera) and its utilization in China. Genetic Resources and Crop Evolution 56: 323–330 Han Y, Wessler SR (2010) MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Research 38: e199–e199 Han Y-C, Teng C-Z, Zhong S, Zhou M-Q, Hu Z-L, Song Y-C (2007) Genetic variation and clonal diversity in populations of Nelumbo nucifera (Nelumbonaceae) in central China detected by ISSR markers. Aquatic Botany 86: 69–75 Hanada K, Vallejo V, Nobuta K, Slotkin RK, Lisch D, Meyers BC, Shiu S-H, Jiang N (2009) The Functional Role of Pack-MULEs in Rice Inferred from Purifying Selection and Expression Profile. THE PLANT CELL ONLINE 21: 25–38 Hawkins JS, Kim H, Nason JD, Wing RA, Wendel JF (2006) Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium. Genome Research 16: 1252–1261 Hehl R, Nacken WKF, Krause A, Saedler H, Sommer H (1991) Structural analysis of Tam3, a transposable element from Antirrhinum majus, reveals homologies to the Ac element from maize. Plant Molecular Biology 16: 369–371 Hershberger RJ, Benito M-I, Hardeman KJ, Warren C, Chandler VL, Walbot V (1995) Characterization of the major transcripts encoded by the regulatory MuDR transposable element of maize. Genetics 140: 1087–1098 Hirochika H, Okamoto H, Kakutani T (2000) Silencing of retrotransposons in arabidopsis and reactivation by the ddm1 mutation. Plant Cell 12: 357–369 Hirochika H, Sugimoto K, Otsuki Y, Tsugawa H, Kanda M (1996) Retrotransposons of rice involved in mutations induced by tissue culture. Proceedings of the National Academy of Sciences 93: 7783–7788 Holligan D, Zhang X, Jiang N, Pritham EJ, Wessler SR (2006) The Transposable Element Landscape of the Model Legume Lotus japonicus. Genetics 174: 2215–2228   175       Hollister JD, Smith LM, Guo Y-L, Ott F, Weigel D, Gaut BS (2011) Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata. Proceedings of the National Academy of Sciences 108: 2322–2327 Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, Collins JE, Humphray S, McLaren K, Matthews L, et al (2013) The zebrafish reference genome sequence and its relationship to the human genome. Nature 496: 498–503 Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, Xie B, Ni P, et al (2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics 41: 1275–1281 Ichikawa H, Ikeda K, Wishart WL, Ohtsubo E (1987) Specific binding of transposase to terminal inverted repeats of transposable element Tn3. Proceedings of the National Academy of Sciences 84: 8220–8224 Jarvis CE, Linnean Society of London, Natural History Museum (London, England) (2007) Order out of chaos: Linnaean plant names and their types. Linnean Society of London in association with the Natural History Museum, London, London Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569–573 Jiang N, Ferguson AA, Slotkin RK, Lisch D (2011) Pack-Mutator-like transposable elements (Pack-MULEs) induce directional modification of genes through biased insertion and DNA acquisition. Proceedings of the National Academy of Sciences 108: 1537–1542 Jiang N, Panaud O (2013) Transposable Element Dynamics in Rice and Its Wild Relatives. In Q Zhang, RA Wing, eds, Genetics and Genomics of Rice. Springer New York, New York, NY, pp 55–69 Jordan IK, Rogozin IB, Glazko GV, Koonin EV (2003) Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends in Genetics 19: 68–72 Juretic N, Hoen DR, Huynh ML, Harrison PM, Bureau TE (2005) The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res 15: 1292– 1297 Kalendar R, Tanskanen J, Immonen S, Nevo E, Schulman AH (2000) From the Cover: Genome evolution of wild barley (Hordeum spontaneum) by BARE-1 retrotransposon dynamics in response to sharp microclimatic divergence. Proceedings of the National Academy of Sciences 97: 6603–6607 Kamal M (2006) A large family of ancient repeat elements in the human genome is under strong selection. Proceedings of the National Academy of Sciences 103: 2740–2745 Kapitonov VV (2006) Self-synthesizing DNA transposons in eukaryotes. Proceedings of the   176       National Academy of Sciences 103: 4540–4545 Kapitonov VV, Jurka J (2005) RAG1 core and V(D)J recombination signal sequences were derived from Transib transposons. PLoS Biol 3: e181 Kapitonov VV, Jurka J (2008) A universal classification of eukaryotic transposable elements implemented in Repbase. Nature Reviews Genetics 9: 411–412 Kapitonov VV, Jurka J (2007) Helitrons on a roll: eukaryotic rolling-circle transposons. Trends in Genetics 23: 521–529 Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, Yandell M, Feschotte C (2013) Transposable Elements Are Major Contributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs. PLoS Genetics 9: e1003470 Kawakami T, Strakosh SC, Zhen Y, Ungerer MC (2010) Different scales of Ty1/copia-like retrotransposon proliferation in the genomes of three diploid hybrid sunflower species. Heredity (Edinb) 104: 341–350 Kempken F, Windhofer F (2001) The hAT family: a versatile transposon group common to plants, fungi, animals, and man. Chromosoma 110: 1–9 Kinoshita T (2004) One-Way Control of FWA Imprinting in Arabidopsis Endosperm by DNA Methylation. Science 303: 521–523 Knip M, de Pater S, Hooykaas PJ (2012) The SLEEPER genes: a transposase-derived angiosperm-specific gene family. BMC Plant Biology 12: 192 Kumar A, Bennetzen JL (1999) Plant Retrotransposons. Annual Review of Genetics 33: 479–532 Kunze R, Weil CF (2002) The hAT and CACTA superfamilies of plant transposons. Mobile DNA II. pp 565–610 Kuwahara A, Kato A, Komeda Y (2000) Isolation and characterization of copia-type retrotransposons in Arabidopsis thaliana. Gene 244: 127–136 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921 Law JA, Jacobsen SE (2010) Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews Genetics 11: 204–220 Lazarow K, Du M-L, Weimer R, Kunze R (2012) A Hyperactive Transposase of the Maize Transposable Element Activator (Ac). Genetics 191: 747–756   177       Li W, Shaw JE (1993) A variant Tc4 transposable element in the nematode C.elegans could encode a novel protein. Nucleic Acids Research 21: 59–67 Lindroth AM (2001) Requirement of CHROMOMETHYLASE3 for Maintenance of CpXpG Methylation. Science 292: 2077–2080 Linheiro RS, Bergman CM (2012) Whole Genome Resequencing Reveals Natural Target Site Preferences of Transposable Elements in Drosophila melanogaster. PLoS ONE 7: e30008 Lippman Z, Gendrel A-V, Black M, Vaughn MW, Dedhia N, Richard McCombie W, Lavine K, Mittal V, May B, Kasschau KD, et al (2004) Role of transposable elements in heterochromatin and epigenetic control. Nature 430: 471–476 Lisch D (2002) Mutator transposons. Trends Plant Sci 7: 498–504 Lisch D (2009) Epigenetic Regulation of Transposable Elements in Plants. Annual Review of Plant Biology 60: 43–66 Lisch D (2005) Pack-MULEs: theft on a massive scale. BioEssays 27: 353–355 Lisch D, Girard L, Donlin M, Freeling M (1999) Functional analysis of deletion derivatives of the maize transposon MuDR delineates roles for MURA and MURB proteins. Genetics 151: 331–341 Lisch D, Jiang N (2009) Mutator and MULE transposons. In JL Bennetzen, S Hake, eds, Handbook of Maize. Springer New York, New York, NY, pp 277–306 Liu S, Yeh C-T, Ji T, Ying K, Wu H, Tang HM, Fu Y, Nettleton D, Schnable PS (2009) Mu Transposon Insertion Sites and Meiotic Recombination Events Co-Localize with Epigenetic Marks for Open Chromatin across the Maize Genome. PLoS Genetics 5: e1000733 Loot C, Santiago N, Sanz A, Casacuberta JM (2006) The proteins encoded by the pogo-like Lemi1 element bind the TIRs and subterminal repeated motifs of the Arabidopsis Emigrant MITE: consequences for the transposition mechanism of MITEs. Nucleic Acids Research 34: 5238–5246 Marquez CP, Pritham EJ (2010) Phantom, a New Subclass of Mutator DNA Transposons Found in Insect Viruses and Widely Distributed in Animals. Genetics 185: 1507–1517 Mayer KFX, Waugh R, Langridge P, Close TJ, Wise RP, Graner A, Matsumoto T, Sato K, Schulman A, Muehlbauer GJ, et al (2012) A physical, genetic and functional sequence assembly of the barley genome. Nature. doi: 10.1038/nature11543 McCarthy EM, McDonald JF (2003) LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19: 362–367   178       McClintock B (1951) CHROMOSOME ORGANIZATION AND GENIC EXPRESSION. Cold Spring Harbor Symposia on Quantitative Biology 16: 13–47 McLaughlin M, Walbot V (1987) Cloning of a mutable bz2 allele of maize by transposon tagging and differential hybridization. Genetics 117: 771–776 Middleton CP, Stein N, Keller B, Kilian B, Wicker T (2013) Comparative analysis of genome composition in Triticeae reveals strong variation in transposable element dynamics and nucleotide diversity. The Plant Journal 73: 347–356 Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Duke S, Garber M, Gentles AJ, Goodstadt L, Heger A, et al (2007) Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447: 167–177 Mills RE, Bennett EA, Iskow RC, Devine SE (2007) Which transposable elements are active in the human genome? Trends in Genetics 23: 183–191 Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KLT, et al (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452: 991–996 Ming R, VanBuren R, Liu Y, Yang M, Han Y, Li L-T, Zhang Q, Kim M-J, Schatz MC, Campbell M, et al (2013) Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.). Genome Biology 14: R41 Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15: 211–218 Muotri AR, Marchetto MCN, Coufal NG, Gage FH (2007) The necessary junk: new functions for transposable elements. Human Molecular Genetics 16: R159–R167 Nassif N, Penney J, Pal S, Engels WR, Gloor GB (1994) Efficient copying of nonhomologous sequences from ectopicsites via P-element-induced gap repair. Mol Cell Biol 14: 1613– 1625 Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu Z, Loftus B, Xi Z, Megy K, Grabherr M, et al (2007) Genome Sequence of Aedes aegypti, a Major Arbovirus Vector. Science 316: 1718–1723 Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, et al (2013) The Norway spruce genome sequence and conifer genome evolution. Nature 497: 579–584 O’Reilly C, Shepherd NS, Pereira A, Schwarz-Sommer Z, Bertram I, Robertson DS, Peterson PA, Saedler H (1985) Molecular cloning of the a1 locus of Zea mays using the   179       transposable elements En and Mu1. EMBO J 4: 877–882 Oliver KR, McComb JA, Greene WK (2013) Transposable Elements: Powerful Contributors to Angiosperm Evolution and Diversity. Genome Biology and Evolution 5: 1886–1901 Pan L, Xia Q, Quan Z, Liu H, Ke W, Ding Y (2009) Development of Novel EST-SSRs from Sacred Lotus (Nelumbo nucifera Gaertn) and Their Utilization for the Genetic Diversity Analysis of N. nucifera. Journal of Heredity 101: 71–82 Pardue M-L, Rashkova S, Casacuberta E, DeBaryshe PG, George JA, Traverse KL (2005) Two retrotransposons maintain telomeres in Drosophila. Chromosome Research 13: 443–453 Parisod C, Alix K, Just J, Petit M, Sarilar V, Mhiri C, Ainouche M, Chalhoub B, Grandbastien M-A (2010) Impact of transposable elements on the organization and function of allopolyploid genomes: Research review. New Phytologist 186: 37–45 Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, et al (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457: 551–556 Piegu B, Guyot R, Picault N, Roulin A, Saniyal A, Kim H, Collura K, Brar DS, Jackson S, Wing RA, et al (2006) Doubling genome size without polyploidization: Dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Research 16: 1262–1269 Piriyapongsa J, Marino-Ramirez L, Jordan IK (2006) Origin and Evolution of Human microRNAs From Transposable Elements. Genetics 176: 1323–1337 Robertson DS (1978) Characterization of a mutator system in maize. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 51: 21–28 Robertson HM (2002) Evolution of DNA transposons. Mobile DNA II. American Society for Microbiology Press, Washington, D.C, pp 1093–1110 Sanmiguel P (1998) Evidence that a Recent Increase in Maize Genome Size was Caused by the Massive Amplification of Intergene Retrotransposons. Annals of Botany 82: 37–44 Sato S, Hirakawa H, Isobe S, Fukai E, Watanabe A, Kato M, Kawashima K, Minami C, Muraki A, Nakazaki N, et al (2010) Sequence Analysis of the Genome of an Oil-Bearing Tree, Jatropha curcas L. DNA Research 18: 65–76 Sato S, Tabata S, Hirakawa H, Asamizu E, Shirasawa K, Isobe S, Kaneko T, Nakamura Y, Shibata D, Aoki K, et al (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485: 635–641 Schaack S, Gilbert C, Feschotte C (2010) Promiscuous DNA: horizontal transfer of transposable   180       elements and why it matters for eukaryotic evolution. Trends in Ecology & Evolution 25: 537–546 Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463: 178–183 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al (2009) The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science 326: 1112–1115 Schulman AH, Kalendar R (2005) A movable feast: diverse retrotransposons and their contribution to barley genome dynamics. Cytogenetic and Genome Research 110: 598– 605 Sequencing Project IRG (2005) The map-based sequence of the rice genome. Nature 436: 793– 800 Shaheen M, Williamson E, Nickoloff J, Lee S-H, Hromas R (2010) Metnase/SETMAR: a domesticated primate transposase that enhances DNA repair, replication, and decatenation. Genetica 138: 559–566 Shen-Miller J (2002) Sacred lotus, the long-living fruits of China Antique. Seed Science Research 12: 131–143 Shen-Miller J, Aung LH, Turek J, Schopf JW, Tholandi M, Yang M, Czaja A (2013) CenturiesOld Viable Fruit of Sacred Lotus Nelumbo nucifera Gaertn var. China Antique. Tropical Plant Biology 6: 53–68 Shulaev V, Sargent DJ, Crowhurst RN, Mockler TC, Folkerts O, Delcher AL, Jaiswal P, Mockaitis K, Liston A, Mane SP, et al (2010) The genome of woodland strawberry (Fragaria vesca). Nature Genetics 43: 109–116 Singer T (2001) Robertson’s Mutator transposons in A. thaliana are regulated by the chromatinremodeling gene Decrease in DNA Methylation (DDM1). Genes & Development 15: 591–602 Singh R, Ming R, Yu Q (2013) Nucleotide Composition of the Nelumbo nucifera Genome. Tropical Plant Biology 6: 85–97 Slotkin RK, Freeling M, Lisch D (2003) Mu killer causes the heritable inactivation of the Mutator family of transposable elements in Zea mays. Genetics 165: 781–797 Slotkin RK, Freeling M, Lisch D (2005) Heritable transposon silencing initiated by a naturally occurring transposon inverted duplication. Nature Genetics 37: 641–644   181       Slotkin RK, Martienssen R (2007) Transposable elements and the epigenetic regulation of the genome. Nature Reviews Genetics 8: 272–285 Steinbiss S, Willhoeft U, Gremme G, Kurtz S (2009) Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Research 37: 7002–7013 Strommer JN, Hake S, Bennetzen J, Taylor WC, Freeling M (1982) Regulatory mutants of the maize Adh1 gene caused by DNA insertions. Nature 300: 542–544 Talbert LE, Chandler VL (1989) Characterization of a highly conserved sequence related to mutator transposable elements in maize. Mol Biol Evol 5: 519–529 Temin HM (1981) Structure, variation and synthesis of retrovirus long terminal repeat. Cell 27: 1–3 Tenaillon MI, Hollister JD, Gaut BS (2010) A triptych of the evolution of plant transposable elements. Trends in Plant Science 15: 471–478 Thomas CA (1971) The Genetic Organization of Chromosomes. Annual Review of Genetics 5: 237–256 Tsukahara S, Kawabe A, Kobayashi A, Ito T, Aizu T, Shin-i T, Toyoda A, Fujiyama A, Tarutani Y, Kakutani T (2012) Centromere-targeted de novo integrations of an LTR retrotransposon of Arabidopsis lyrata. Genes & Development 26: 705–713 Tsukahara S, Kobayashi A, Kawabe A, Mathieu O, Miura A, Kakutani T (2009) Bursts of retrotransposition reproduced in Arabidopsis. Nature 461: 423–426 Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, et al (2006) The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science 313: 1596–1604 Varshney RK, Chen W, Li Y, Bharti AK, Saxena RK, Schlueter JA, Donoghue MTA, Azam S, Fan G, Whaley AM, et al (2011) Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor farmers. Nature Biotechnology 30: 83–89 Velasco R, Zharkikh A, Affourtit J, Dhingra A, Cestaro A, Kalyanaraman A, Fontana P, Bhatnagar SK, Troggio M, Pruss D, et al (2010) The genome of the domesticated apple (Malus × domestica Borkh.). Nature Genetics 42: 833–839 Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, FitzGerald LM, Vezzulli S, Reid J, et al (2007) A High Quality Draft Consensus Sequence of the Genome of a Heterozygous Grapevine Variety. PLoS ONE 2: e1326 Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, et al (2013) The high-quality draft genome of peach (Prunus   182       persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nature Genetics 45: 487–494 Vogel JP, Garvin DF, Mockler TC, Schmutz J, Rokhsar D, Bevan MW, Barry K, Lucas S, Harmon-Smith M, Lail K, et al (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463: 763–768 Volff J-N (2006) Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. BioEssays 28: 913–922 Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun J-H, Bancroft I, Cheng F, et al (2011) The genome of the mesopolyploid crop species Brassica rapa. Nature Genetics 43: 1035–1039 Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, et al (2007) A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8: 973–982 Wikstrom N, Savolainen V, Chase MW (2001) Evolution of the angiosperms: calibrating the family tree. Proceedings of the Royal Society B: Biological Sciences 268: 2211–2220 Xu Q, Chen L-L, Ruan X, Chen D, Zhu A, Chen C, Bertrand D, Jiao W-B, Hao B-H, Lyon MP, et al (2012) The draft genome of sweet orange (Citrus sinensis). Nature Genetics 45: 59– 66 Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, Zhang G, Yang S, Li R, Wang J, et al (2011) Genome sequence and analysis of the tuber crop potato. Nature 475: 189–195 Xu Z, Wang H (2007) LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35: W265–W268 Yamashita S, Takano-Shimizu T, Kitamura K, Mikami T, Kishima Y (1999) Resistance to gap repair of the transposon Tam3 in Antirrhinum majus: a role of the end regions. Genetics 153: 1899–1908 Yang G, Weil CF, Wessler SR (2006) A rice Tc1/mariner-like element transposes in yeast. Plant Cell 18: 2469–2478 Yin H, Liu J, Xu Y, Liu X, Zhang S, Ma J, Du J (2013) TARE1, a Mutated Copia-Like LTR Retrotransposon Followed by Recent Massive Amplification in Tomato. PLoS ONE 8: e68587 Young ND, Debellé F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, et al (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature. doi: 10.1038/nature10625   183       Yu Z, Wright SI, Bureau TE (2000) Mutator-like elements in Arabidopsis thaliana. Structure, diversity and evolution. Genetics 156: 2019–2031 Zemach A, Kim MY, Hsieh P-H, Coleman-Derr D, Eshed-Williams L, Thao K, Harmer SL, Zilberman D (2013) The Arabidopsis nucleosome remodeler DDM1 allows DNA methyltransferases to access H1-containing heterochromatin. Cell 153: 193–205 Zhang G, Liu X, Quan Z, Cheng S, Xu X, Pan S, Xie M, Zeng P, Yue Z, Wang W, et al (2012) Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nature Biotechnology 30: 549–554 Zhou L, Mitra R, Atkinson PW, Burgess Hickman A, Dyda F, Craig NL (2004) Transposition of hAT elements links transposable elements and V(D)J recombination. Nature 432: 995– 1001 Zhu Y (2003) From the Cover: Controlling integration specificity of a yeast retrotransposon. Proceedings of the National Academy of Sciences 100: 5891–5895 Zuccolo A, Sebastian A, Talag J, Yu Y, Kim H, Collura K, Kudrna D, Wing RA (2007) Transposable element distribution, abundance and role in genome size variation in the genus Oryza. BMC Evolutionary Biology 7: 152   184       CONCLUDING REMARKS Transposable elements (TE) are an integral part of eukaryotic genomes and in some organisms they occupy a substantial portion. TE classification can be complex as multiple levels of classes and subclasses exist. Based on the transposition intermediates used, TEs are categorized into two major groups (RNA and DNA TEs). In addition, they belong to different superfamilies distinguished by structural features, protein conservation and replication strategies, intrinsic to the group (Wicker et al., 2007). These features include the terminal sequences (TIR or LTR), target site duplication (TSD) and proteins required for mobilization. Further, within each superfamily are families that consist of phylogenetically related copies sharing high sequence similarity (Wicker et al., 2007). The activity of TE in a genome can be deleterious, neutral or beneficial. Thus, different silencing strategies exist to regulate its activity and many TEs in the host genome are currently inactive (Mills et al., 2007; Lisch, 2009). Regardless, numerous studies illustrate how TEs can introduce various forms of genetic variations with adaptive and evolutionary potentials (reviewed in Oliver and Greene, 2009; Oliver and Greene, 2009; Kejnovsky et al., 2012). The myriad of roles, ranging from structural to regulatory, played by transposable elements in a host makes their annotation an important component to every genome sequencing project. As more genomes with annotated repeat content are available, various types of comparative analysis may be performed to address important biological and evolutionary questions such as, time and mechanism of genome expansions and development of gene regulatory networks both of which can be influenced by TE activity. In this research, the repetitive content of a basal eudicot genome (sacred lotus) was annotated and characterized. Our results show that its genome shows typical similarities to most   185       previously sequenced angiosperms including the high TE contribution to the genome size. This makes the sacred lotus TE information useful not only in downstream gene annotation work but also in further understanding the role of TEs in gene and genome evolution. Being a perennial adaptive to aquatic environments, combined with the many beneficial adaptive roles that TEs play, it will not be surprising to find TEs involved in genome expansion, gene regulation and sequence exaptation that may have been critical to its growth habit. Moreover, results from sacred lotus may be used in comparative studies with other eudicots and monocots to identify the role TEs may play in the diversification within this lineage. Oliver et al. (2013) proposes that TE activity in concert with other factors including hybridization, polyploidization, horizontal transfer and stress may have promoted angiosperm diversity and dominance. Certain unique features of sacred lotus TE also may make it valuable in TE biology studies. For instance, sacred lotus possesses an unprecedented content of the hAT superfamily. Its hAT coverage is not only exceptional among DNA TE superfamilies but its success outperforms activity by all DNA TEs in many angiosperms sequenced to date. Since this study has shown that majority of the autonomous members identified in this superfamily are likely inactive and nonfunctional, a hAT transposase can be molecularly reconstituted similar to work with Sleeping Beauty (Ivics et al., 1997). This functional autonomous copy or transposase may then be used in transposition and silencing studies to determine how these aspects may have played a role in the success of this superfamily. This superfamily includes the Ac/Ds elements which were the first TEs discovered (McClintock, 1951) as well as the domesticated SLEEPER genes which are unique and conserved in angiosperms (Knip et al., 2012). Therefore, analysis of remnant hAT related sequences in the genome may be done to determine its role in gene domestication and gene regulation which might explain its retention despite experiencing silencing and inactivation.   186       Another successful TE group in sacred lotus are the non-canonical Copia-type LTR elements that have amplified despite the lack of the conserved “TG-CA” end sequence previously suggested as important for integration (Temin, 1981). These TEs are abundant in sacred lotus suggesting either a unique capacity for transposition and retention of these elements in sacred lotus or that most previous TE annotations may have missed these types of sequences due to their non-canonical end sequences which are not typically used in automated annotation programs. In either case, it will be necessary to elucidate the integration process for this specific group of Copia elements as well as the evolution of the catalytic domain in the integrase protein that accommodates for mutations in the end regions. Meanwhile, the Tc1/Mariner superfamily suffered the opposite outcome where it has been eliminated from the genome. The contrasting fates of these two superfamilies in scared lotus may reflect the effect of differential amplification, silencing and retention between different TE types. Since sacred lotus is the fourth genome reported to be devoid of these elements (Velasco et al., 2007; Banks et al., 2011; D’Hont et al., 2012) , it can be used in comparative studies to understand the apparent lack of success of Tc1/Mariner elements in these genomes which may be related to intense genome silencing and removal. A beneficial genetic variation that TEs provide to its host is the exaptation of coding sequences that can generate new adaptive functions. A few TE superfamilies reported to duplicate host genes are LINEs, Helitrons, CACTA and Pack-MULEs (Moran, 1999; Jiang et al., 2004; Morgante et al., 2005; Zabala, 2005). In rice, close to 3000 Pack-MULEs are found that transduplicated about 1500 parental genes. These elements show sequence conservation and regulation of expression of their parental genes (Hanada et al., 2009). This research shows that terminal and subterminal sequences may play a role in the sequence acquisition process. Also,   187       GC content and breadth of expression plays an additive role in the preferential acquisition by Pack-MULEs, with the GC effect being stronger. These results indicate that Pack-MULEs are “clever elements” in the genome that appear to distinguish and preferentially capture genes rather than other genomic sequences. This unique capability allows these elements increased chances of carrying functional sequences that may provide new genetic resources for the evolution of new genes or the modification of existing genes. In addition, this capability may have contributed to its survival and retention in the genome. However, these new data is just the beginning for understanding how Pack-MULEs form and the precise molecular mechanism of sequence capture remains to be elucidated. Nevertheless, the understanding of preferential acquisition is vital in manipulating future experiments to catch a Pack-MULE capture “in action”. The understanding of how the gene fragments are captured is also important in understanding its biology and will be useful in potential usage of the capture process as a useful mutational tool to generate novel coding sequences with new adaptive potentials. Furthermore, this study illustrates that certain MULE families are capable of duplicating Terminal Inverted Repeat (TIR) sequences to generate elements with tandem TIRs on both ends, although the mechanism for this duplication is still unknown. This atypical structure seems to play a role in enhanced transposition efficiency for these specific MULE families suggesting that TIR duplication can provide a benefit in terms of a TE’s successful colonization in a host. In vivo and in vitro studies to test transposition efficiency conferred by duplicated TIRs will be tested in the future. Taken together, this research provides novel and supporting information of how TEs play a role relevant to gene and genome evolution. The unique TE components in sacred lotus illustrate the need to investigate the reasons behind the differential success and failure of various   188       TE types in a host genome. While results from MULEs and Pack-MULEs analyses indicate the apparent extraordinary nature of TEs to duplicate host sequences and generate functional sequences.   189       REFERENCES   190       REFERENCES Banks JA, Nishiyama T, Hasebe M, Bowman JL, Gribskov M, dePamphilis C, Albert VA, Aono N, Aoyama T, Ambrose BA, et al (2011) The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of Vascular Plants. Science 332: 960–963 D’Hont A, Denoeud F, Aury J-M, Baurens F-C, Carreel F, Garsmeur O, Noel B, Bocs S, Droc G, Rouard M, et al (2012) The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488: 213–217 Hanada K, Vallejo V, Nobuta K, Slotkin RK, Lisch D, Meyers BC, Shiu S-H, Jiang N (2009) The Functional Role of Pack-MULEs in Rice Inferred from Purifying Selection and Expression Profile. THE PLANT CELL ONLINE 21: 25–38 Ivics Z, Hackett PB, Plasterk RH, Izsvák Z (1997) Molecular Reconstruction of Sleeping Beauty, a Tc1-like Transposon from Fish, and Its Transposition in Human Cells. Cell 91: 501– 510 Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569–573 Kejnovsky E, Hawkins JS, Feschotte C (2012) Plant Transposable Elements: Biology and Evolution. In JF Wendel, J Greilhuber, J Dolezel, IJ Leitch, eds, Plant Genome Diversity Volume 1. Springer Vienna, Vienna, pp 17–34 Knip M, de Pater S, Hooykaas PJ (2012) The SLEEPER genes: a transposase-derived angiosperm-specific gene family. BMC Plant Biology 12: 192 Lisch D (2009) Epigenetic Regulation of Transposable Elements in Plants. Annual Review of Plant Biology 60: 43–66 McClintock B (1951) CHROMOSOME ORGANIZATION AND GENIC EXPRESSION. Cold Spring Harbor Symposia on Quantitative Biology 16: 13–47 Mills RE, Bennett EA, Iskow RC, Devine SE (2007) Which transposable elements are active in the human genome? Trends in Genetics 23: 183–191 Moran JV (1999) Exon Shuffling by L1 Retrotransposition. Science 283: 1530–1534 Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A (2005) Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nature Genetics 37: 997–1002 Oliver KR, Greene WK (2009) Transposable elements: powerful facilitators of evolution. BioEssays 31: 703–714   191       Oliver KR, McComb JA, Greene WK (2013) Transposable Elements: Powerful Contributors to Angiosperm Evolution and Diversity. Genome Biology and Evolution 5: 1886–1901 Temin HM (1981) Structure, variation and synthesis of retrovirus long terminal repeat. Cell 27: 1–3 Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, FitzGerald LM, Vezzulli S, Reid J, et al (2007) A High Quality Draft Consensus Sequence of the Genome of a Heterozygous Grapevine Variety. PLoS ONE 2: e1326 Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, et al (2007) A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8: 973–982 Zabala G (2005) The wp Mutation of Glycine max Carries a Gene-Fragment-Rich Transposon of the CACTA Superfamily. THE PLANT CELL ONLINE 17: 2619–2632   192