PERVASIVE ALTERNATIVE RNA EDITING IN TRYPANOSOMA BRUCEI By Laura Elizabeth Kirby A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Microbiology and Molecular Genetics—Doctor of Philosophy 2019 ABSTRACT PERVASIVE ALTERNATIVE RNA EDITING IN TRYPANOSOMA BRUCEI By Laura Elizabeth Kirby Trypanosoma brucei is a single celled eukaryote that utilizes a complex RNA editing system to render many of its mitochondrial genes translatable. Editing of these genes requires multiple small RNAs called guide RNAs to direct the insertion and deletion of uridines. These gRNAs act sequentially, each generating the anchor binding site for the next gRNA. This sequential dependence should render the process quite fragile, and mutations in the gRNAs should not be tolerated. In the examination of the gRNA transcriptome of T. brucei, many gRNAs were identified that are capable of generating alternative mRNA sequences, and potentially disrupting the editing process. In this work, the effects of alternative editing are characterized. This analysis revealed the role of gRNAs in developmental regulation of gene expression, showing a correlation between the abundance of the initiating gRNAs across two different points in the life cycle of T. brucei and their expression. This study also revealed the existence of mitochondrial dual-coding genes, which provide protection for genetic material that is not under selection at all points of the life cycle of T. brucei. The examination of these dual-coding genes showed that RNA editing patterns can shift between cell lines and under different energetic conditions. Examining the gRNAs involved in these editing pathways revealed that there is a high amount of mismatching base pairs that are tolerated for editing to function, and that gRNA abundance is not a reliable predictor for editing preference. Finally, a reexamination of the gRNA transcriptome revealed that many gRNAs are still unidentified and most likely are generating new alternatively edited sequences. ACKNOWLEDGEMENTS I would first like to acknowledge my family, without whose support I could not have completed the work I have done. I am forever grateful for their continuous encouragement and constant faith in me. I gratefully acknowledge Dr. Donna Koslowsky, who has been an excellent mentor. She not only provided her extensive insights on the field of RNA editing, but also aided and advised me in my professional development. She had shaped me as a researcher and as a person, and I was very fortunate in being able to work with her. I would also like to thank the small undergraduate army I have had the priveledge to work with. They taught me a great deal about mentoring and leadership, and their many hours greatly aided me in the completion of this work. For their assistance in my computational education, I would like to thank Dr. Yanni Sun and Dr. Arend Hintze. Without their assistance, my research would not have been possible. For serving as my graduate committee, I would like to thank Dr. Shannon Manning, Dr. Charles Hoogstraten, Dr. Chris Adami and Dr. Yanni Sun. Their many contributions have shaped and refined my research. I would also like to thank Dr. Cori Fata-Hartley, for serving as my teaching mentor and helping me design and conduct an education research project. I would like to thank the friends I have gained during my time at Michigan State who have supported me and been wonderful colleagues: Alexis Weber, Sandy Olenic, Ahrom Kim, and Shreya Saha. iv Finally, I would like to thank the Department of Microbiology and Molecular Genetics, the College of Natural Science, the Elenor L. Gilmore Endowment, the Frank Peabody Microbiology Student Research Fund, the Russell B. DuVall Endowment, the Berttina Wentworth Fellowship, and the Marvis A. Richardson Endowed Fellowship for their support of my research. v TABLE OF CONTENTS LIST OF TABLES ......................................................................................................................... ix LIST OF FIGURES ........................................................................................................................ x CHAPTER 1: INTRODUCTION .................................................................................................... 1 Kinetoplastids ............................................................................................................... 1 Kinetoplastid RNA Editing ............................................................................................ 1 Trypanosoma brucei .................................................................................................... 2 Trypanosoma vivax ...................................................................................................... 6 Trypanosoma cruzi ....................................................................................................... 6 Leishmania spp. ............................................................................................................ 7 Phytomonas spp. .......................................................................................................... 7 Procyclic gRNA Transcriptome ..................................................................................... 8 Evolution and retention of RNA editing in kinetoplastids ......................................... 10 Dual-coding and dual-function genes ........................................................................ 13 Project Summary ........................................................................................................ 15 CHAPTER 2: ANALYSIS OF THE TRYPANOSOMA BRUCEI EATRO 164 BLOODSTREAM GUIDE RNA TRANSCRIPTOME ................................................................................................................... 18 Abstract ...................................................................................................................... 18 Author Summary ........................................................................................................ 19 Introduction ............................................................................................................... 19 Materials and Methods .............................................................................................. 21 Results ........................................................................................................................ 23 Discussion................................................................................................................... 37 Accession Numbers .................................................................................................... 40 Acknowledgments...................................................................................................... 40 CHAPTER 3: MITOCHONDRIAL DUAL-CODING GENES IN TRYPANOSOMA BRUCEI ............... 42 Abstract ...................................................................................................................... 42 Author Summary ........................................................................................................ 43 Introduction ............................................................................................................... 43 Materials and Methods .............................................................................................. 46 Results ........................................................................................................................ 49 Discussion................................................................................................................... 60 Acknowledgments...................................................................................................... 65 CHAPTER 4: ANALYSIS OF THREE PAN-EDITED MRNAS REVEALS DUAL-CODING GENES AND COMPLEX MULTIPATH EDITING ............................................................................................. 66 Abstract ...................................................................................................................... 66 vi Introduction ............................................................................................................... 67 Materials and Methods .............................................................................................. 70 Results ........................................................................................................................ 73 Discussion................................................................................................................... 93 Acknowledgements .................................................................................................... 99 CHAPTER 5: CLUSTER CLASSIFICATION OF UNKNOWN GRNAS REVEALS THE ROBUSTNESS OF THE RNA EDITING SYSTEM ................................................................................................... 100 Abstract .................................................................................................................... 100 Introduction ............................................................................................................. 101 Materials and Methods ............................................................................................ 105 Results ...................................................................................................................... 108 Discussion................................................................................................................. 129 Acknowledgements .................................................................................................. 134 CHAPTER 6: SUMMARY AND DISCUSSION ........................................................................... 135 Introduction ............................................................................................................. 135 Summary of Chapter 2 ............................................................................................. 136 Summary of Chapter 3 ............................................................................................. 137 Summary of Chapter 4 ............................................................................................. 138 Summary of Chapter 5 ............................................................................................. 139 Genetic Integrity ...................................................................................................... 140 Developmental Regulation ...................................................................................... 144 Protein Diversity ...................................................................................................... 146 Editing Efficiency ...................................................................................................... 147 Future Work ............................................................................................................. 149 Conclusion ................................................................................................................ 151 APPENDICES ......................................................................................................................... 152 APPENDIX A. Quantification of the number of identified bloodstream and procyclic gRNA transcripts that cover a respective nucleotide in the fully edited mRNA...... 153 APPENDIX B. Alignment of the mitochondrial fully edited mRNAs and the most abundant gRNAs required for full coverage identified in the bloodstream and procyclic life cycle stages ......................................................................................... 158 APPENDIX C. All gRNA major classes pulled for ATPase 6 in the EATRO 164 procyclic and bloodstream transcriptomes ............................................................................ 187 APPENDIX D. Identified CR3 mRNA and gRNA transcripts....................................... 215 APPENDIX E. ND7 5'-most gRNA populations and the predicted mRNA sequences generated ................................................................................................................. 217 APPENDIX F. RPS12 5'-most gRNA populations and the predicted mRNA sequences generated ................................................................................................................. 219 APPENDIX G. Alignments of T. brucei and T. vivax edited mRNAs ......................... 220 vii APPENDIX H. Alignments of protein sequences of pan-edited dual-coding genes in L. tarentolae, L. amazonensis, P. serpens, and Perkinsela CCAP1560/4 with T. brucei and T. vivax sequences ............................................................................................ 227 APPENDIX I. RPS12 gRNA Alignments for TREU 667 SDM79 and EATRO 164 SDM79 cells , and all editing variants ................................................................................... 234 APPENDIX J. gRNAs identified to edit the RPS12 mRNAs of found in both TREU 667 and EATRO 164 gRNA transcriptomes ..................................................................... 241 APPENDIX K. ND7 gRNA Alignments for TREU 667 SDM79 and EATRO 164 SDM79 cells, and all editing variants .................................................................................... 244 APPENDIX L. gRNAs identified to edit the ND7 5’ mRNAs of found in both TREU 667 and EATRO 164 gRNA transcriptomes ..................................................................... 251 APPENDIX M. Predicted ND7 protein sequences ................................................... 253 APPENDIX N. CR3 gRNA Alignments for TREU 667 SDM79, and all editing variants254 APPENDIX O. CR3 gRNA Alignments for EATRO 164, and all editing variants ......... 261 APPENDIX P. gRNAs identified to edit the CR3 mRNAs of found in both TREU 667 and EATRO 164 gRNA transcriptomes ..................................................................... 270 REFERENCES ......................................................................................................................... 274 viii LIST OF TABLES Table 1. Differences in mitochondrial transcript abundance, polyadenylation and the extent of RNA editing in two life cycle stages of T. brucei. ................................................................. 5 Table 2. Number of gRNA transcripts in procyclic and bloodstream major classes and ratio of procyclic transcripts to bloodstream transcripts for each gene. ........................................... 25 Table 3. Summary of the gRNA data coverage for each gene. .............................................. 26 Table 4. Most common gRNA transcription start sites in procyclic and bloodstream data. . 29 Table 5. Identified gaps or weak overlaps (less than 6 nucleotides) between populations of gRNAs observed in both data sets. ........................................................................................ 31 Table 6. Summary of populations found in both data sets that have more reads in the bloodstream data set than in the procyclic data set. ............................................................ 32 Table 7. Editing efficiencies of RPS12, ND7, and CR3. .......................................................... 76 Table 8. Editing efficiency for each RPS12 gRNA population. .............................................. 80 Table 9. Editing efficiency for each ND7 5’domain gRNA population .................................. 84 Table 10. Editing efficiencies by block level of CR3 .............................................................. 85 Table 11. Summary of ACORNS Results .............................................................................. 110 Table 12. Cluster summary ................................................................................................. 110 Table 13. Cluster size summary ........................................................................................... 110 Table 14. gRNA population analyses for RPS12. ................................................................. 118 Table 15. gRNA population analysis for ND7 5’ ................................................................... 124 Table 16. CR3 gRNA population analysis ............................................................................ 127 ix LIST OF FIGURES Figure 1. The abundance of the initiating gRNA of all edited mRNAs in each stage. ............ 25 Figure 2. The frequency of nt variations versus nucleotide position in the gRNA. ............... 27 Figure 3. Comparing the number of non-complementary nucleotides 5’ of the anchoring region or 3’ of the guiding region in procyclic and bloodstream gRNAs. .............................. 28 Figure 4. Length of gRNA complementarity to fully edited mRNAs for both bloodstream and procyclic gRNAs. ..................................................................................................................... 28 Figure 5. The percentage of different nucleotide overlaps found between adjacent gRNAs ... ................................................................................................................................................ 30 Figure 6. Alignment of conventional ATPase 6 protein sequence to hypothetical proteins generated by the 11U alternative edited mRNA and the 4U alternatively edited mRNA. ... 33 Figure 7. Editing sites 420–489 of COIII aligned with the gRNAs identified for that region in the procyclic and bloodstream data sets. .............................................................................. 34 Figure 8. Alternative editing of the 5' end of pan-edited genes results in access to different reading frames. ...................................................................................................................... 52 Figure 9. Positions of stop codons on all RFs of the edited genes in T. brucei. ..................... 53 Figure 10. Mutational frequencies in mitochondrially encoded genes categorized by effect on amino acid sequence. ....................................................................................................... 55 Figure 11. Percent conservation of editing patterns between T. brucei and T. vivax. .......... 56 Figure 12. Principal component analysis of frequency of amino acid mutation types and editing conservation between T. brucei and T. vivax pan-edited transcripts. ...................... 58 Figure 13. Amino acid sequences of ARFs of dual-coding genes. .......................................... 60 Figure 14. Observed RPS12 editing pathways in the TREU 667 cell line and the EATRO 164 cell line grown in SDM79 and SDM80 .................................................................................... 79 Figure 15. Alignment of RPS12 proteins from T. brucei, T. vivax, Leishmania tarentolae, Leishmania donovani, and Leishmania amazonensis. ......................................................... 80 x Figure 16. Regions with poor gRNA coverage and functionally conserved residues in RPS12 . ................................................................................................................................................ 81 Figure 17. Observed ND7 5’ editing pathways in the TREU 667 cell line and the EATRO 164 cell line grown in SDM79 and SDM80 .................................................................................... 82 Figure 18. Regions with poor gRNA coverage and functionally conserved residues in ND7 5’ ................................................................................................................................................ 84 Figure 19. Observed CR3 editing pathways in the TREU 667 cell line .................................. 87 Figure 20. Four different 3’ end sequences found in the TREU 667 transcriptome for the CR3 transcript and CR3 protein sequences ................................................................................... 88 Figure 21. Observed CR3 editing pathways in the EATRO 164 cell line grown in SDM79 and SDM80 .................................................................................................................................... 90 Figure 22. Alignment of CR3 predicted protein variants from the EATRO 164 cell line. ...... 91 Figure 23. Predicted secondary structures of most abundant CR3 predicted proteins ....... 91 Figure 24. Frequencies of early total deletions of DNA encoded uridines in partially edited ND7 and RPS12 transcripts .................................................................................................... 93 Figure 25. Example clusters of related gRNAs generated by ACORNS from the EATRO 164 PC gRNA transcriptome ............................................................................................................. 112 Figure 26. Observed RPS12 editing pathways in the TREU 667 cell line the EATRO cell line .... .............................................................................................................................................. 117 Figure 27. Analysis of functionality and abundance of productive gRNAs populations that edit RPS12 in TREU 667 cells ................................................................................................ 119 Figure 28. Analysis of functionality and abundance of productive gRNAs that edit RPS12 in EATRO 164 cells.................................................................................................................... 120 Figure 29. Observed ND7 5’ editing pathways in TREU 667 cell line and the EATRO 164 cell line ........................................................................................................................................ 122 Figure 30. Analysis of functionality and abundance of productive gRNAs that edit that edit ND7 5’ in TREU 667 cells ...................................................................................................... 123 Figure 31. Analysis of functionality and abundance of productive gRNAs that edit ND7 5’ in EATRO 164 cells.................................................................................................................... 123 xi Figure 32. Observed CR3 editing pathways in TREU 667 and EATRO 164 cell lines ............ 126 Figure 33. Analysis of functionality and abundance of gRNA subpopulations that edit CR3 in TREU 667 cells ...................................................................................................................... 128 Figure 34. Analysis of functionality and abundance of gRNA subpopulations that edit CR3 in EATRO 164 cells.................................................................................................................... 128 xii CHAPTER 1: INTRODUCTION Kinetoplastids Trypanosoma brucei is a member of the Kinetoplastea, a group of protozoans characterized by a large network of DNA in their mitochondria known as the kinetoplast that is physically attached to the flagellum [1]. While not all kinetoplastids are parasites, the group encompasses some of the most successful parasites in existence, inhabiting an incredibly wide range of hosts from plants to invertebrates to vertebrates [2,3]. The dixenous members cycle between two distinct hosts and can encounter different environments with distinct metabolic constraints. These environmental shifts require rapid and extensive changes in gene expression. This is particularly interesting considering the kinetoplastids’ bizarre and complicated use of RNA editing for their mitochondrial gene expression. Kinetoplastid RNA Editing RNA editing is one of several unique genetic features found in the mitochondria of these parasites. RNA editing creates open reading frames in “cryptogenes” by insertion and deletion of uridylate residues at specific sites within the mRNA. The U-insertions/deletions are directed by small guide RNAs (gRNAs) and can repair frameshifts, generate start and stop codons and more than double the size of the transcript [4]. The kinetoplast DNA (kDNA) consists of two types of DNA molecules, maxicircles and minicircles. Maxicircles are large circular DNA molecules that contain the genes for two ribosomal RNAs, 12S and 9S, and the protein coding genes [5]. While some of the protein- coding genes do not require RNA editing prior to translation, most require extensive editing 1 before they can be translated [6,7]. The sequence changes are guided by small complementary RNA molecules (the gRNAs) that are encoded on the minicircles [8]. Minicircles make up the bulk of the kinetoplastid network with each minicircle encoding 1–5 gRNAs. This effectively means that the genetic information for the edited mitochondrial mRNAs is dispersed between the mRNA cryptogenes on the maxicircles and as many as 10,000 gRNA encoding minicircles. In T. brucei, the extensive editing of a single transcript can require more than 40 gRNAs and hundreds of editing events [9]. The gRNAs act as templates for the large multi-subunit protein complex known as the editosome [4,6]. The editosome cleaves the mRNA, inserts or deletes the correct number of uridines and then re-ligates the mRNA in an energy intensive process. This is repeated until the mRNA is complementary to the small gRNA. The initiating gRNA interacts with the 3' end of the pre-edited transcript and generates the anchor binding region for the next gRNA. In fact, all subsequent gRNAs anchor to the edited sequence created by the preceding gRNA. Editing proceeds from the 3' end to the 5' end of the mRNA transcript with the terminating gRNA either creating the start codon or bringing an existing start codon into frame. Because each gRNA directs editing that generate the anchor region for the next gRNA, the RNA editing process is sequentially dependent on correct editing by each gRNA. As a result, the process is incredibly fragile. Trypanosoma brucei Trypanosoma brucei is the causative agent of Human African trypanosomiasis (HAT) and one agent of Animal African Trypanosomiasis (AAT). Each year, 10,000 new cases of HAT are reported, and 3 million cattle are killed, severely impacting the lives and livelihood of those in infected areas [10,11]. The trypanosomes live in two distinct environments: the animal host 2 and the insect vector, the tsetse fly. These environments are distinct in temperature and nutrient composition, providing a challenge to T. brucei as it cycles between hosts. While in the mammalian host, T. brucei lives entirely extracellularly in the bloodstream. It is frequently subject to attacks by the host’s adaptive immune system, and the population evades these attacks through antigenic variation [12]. This part of the life cycle can be quite long, with the longest known infection lasting 29 years [13]. In the bloodstream, the bulk of the trypanosome population exists in the actively dividing slender form. The slender form is optimized to utilize its glucose rich environment, using glycolysis to generate energy [14,15]. During this stage of the life cycle, the mitochondrion is down-regulated, lacking both Krebs cycle enzymes and a functional electron transport chain (ETC) [16]. While the activity of the mitochondrion is relatively low during the bloodstream stage (BS), expression of the mitochondrial genome is still essential [17,18]. Once the population reaches an optimum density, a small portion of the population transitions into stumpy form trypanosomes. The stumpy form is nondividing and appears to be transitional, activating mitochondrial genes in preparation for uptake in a blood meal by its tsetse fly vector and subsequent transfer to a harsher environment [14]. Successful transition to the fly vector requires activation of the ETC, and ATP synthesis via oxidative phosphorylation. Once inside the tsetse fly, the parasite utilizes proline as its primary energy source while residing and actively dividing in the midgut [19–21]. This stage of the life cycle is followed by a dramatic bottleneck when the trypanosomes transition from the midgut to the salivary glands of the tsetse fly, with as few as 1-5 trypanosomes completing the transition [22,23]. From the salivary glands, trypanosomes are then refluxed into their next mammalian 3 host during a bloodmeal. In order to adapt to these sudden changes in environment, T. brucei must vastly alter its gene expression, most notably, in its mitochondria. The 22 kb maxicircle of T. brucei encodes several genes involved in the mitochondrial ETC and oxidative phosphorylation, NADH dehydrogenase (ND) subunits 1-5 and 7-9, cytochrome oxidase (CO) subunits I-III, cytochrome b (CYb), ATP synthase subunit 6 (A6), as well as genes encoding the ribosomal protein small subunit 12 (RPS12), 12S and 9S rRNAs, and some genes with unknown functions: C-rich regions (CR) 3 and 4, and Maxicircle unidentified reading frames (Murf) 2 and 5 [5]. Twelve of these genes require some amount of RNA editing to be translatable, with some requiring only one or two gRNAs (COII, CYb, MurfII), and others requiring editing across the span of the transcript (ND3, ND7, ND8, ND9, COIII, A6, RPS12, CR3, and CR4) [4,6,7]. Distinct differences in mitochondrial transcript abundance, polyadenylation and the extent of RNA editing are observed during the complex life cycle (Table 1). The pattern of differential RNA editing observed is especially interesting. For example, the CYb and COII mRNAs are edited during the insect stage, but are primarily unedited in bloodstream forms [24,25]. In contrast, editing of the NADH dehydrogenase subunit transcripts (ND3, ND7, ND8 and ND9) and RPS12 appears to occur preferentially in bloodstream forms [5,26–30]. Other transcripts, COIII and A6 are edited equally in both life cycle stages [31,32]. Aside from the genes encoded on the maxicircle, there are the minicircle gRNAs. The minicircles range in abundance from 5,000–10,000 present in each network, and are ~1kb in size, with each minicircle encoding 2–5 gRNAs. In T. brucei, there are more than 200 different minicircle sequence classes (~1200 gRNAs) [8,33]. While the minicircles make up a bulk of the 4 Table 1. Differences in mitochondrial transcript abundance, polyadenylation and the extent of RNA editing in two life cycle stages of T. brucei. No. of uridines Relative level of mature RNA PolyA tail length Added Deleted Edited size (nt) Stage Edited Long Slender Short Stumpy Procyclic Bloodstream Procyclic 2-17 (tail) 7(tail) 34 448 0 4 547 0 0 210 0 0 553 259 345 132 26 N.D. 148 325 0 0 0 28 0 0 41 0 0 13 0 0 89 46 20 28 4 N.D. 13 40 1149 611 1,151 821 1,647 663 969 960 1,343 452 1,314 1,779 1,238 574 649 325 1,111 N.D. 299 567 N.D. N.D. Pa P/BSb NEc P P/BS NE NE P/BS NE NE 5’P/BS, 3’BSd BSe BS BS P/BS N.D. BS BS 0.04 0.07 ~0 1.0 0.07 ~0 1.0 ~1 >1.0 >1.0 ~1.0 0.5 ~10 ~20 >1.0 >1.0 ~1 N.D. >1.0 1.0 1.3 1.4 0.5 N.D.f 0.4 0.5 N.D. N.D. N.D. N.D. N.D. 0.8 N.D. N.D. N.D. N.D. ~1 N.D. N.D. N.D. 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ~1 1.0 1.0 ~1.0 1.0 1.0 1.0 1.0 1.0 ~1 N.D. 1.0 ~0 N.A. N.A. N.A. N.A. Short (UEh & Ei) Short (E) & Long (E) Short (E) & Long (E) Short (E) & Long (E) Short Short Short (E) Short & Long Short & Long Short (E) Short & Long N.D. Short(E) Short (E) & Long (E) Short (PE) & Long (E) Short & Long Short (UE) & Long (E) Short(E) & Long(E) Short & Long Short & Long Short (E) Short & Long N.D. Short(UE) Short (E) Short (PE) Short(UE & E) & Short(UE & E) & Long (E) Short & Long Long (E) Short & Long N.D. N.D. N.D. N.D. Short (E) & Long (E) Short (UE) Number of PC major classesg N.A.j N.A. 11 81 N.A. N.A.k 151 N.A. N.A. 34 N.A. N.A. 129 70 39 50 1 N.A. 37 41 References [24,34,35] [24,34,35] [24,25,36,37] [31,36] [24,37,38] [24,37,39] [32] [38,40,41] [37,42,43] [26] [37,38,44,45] [24,38] [5,27] [5,28,37] [29] [30] [40,46] N.D. [47] [48] Gene 12S 9S CYb A6 COI COII COIII ND1 ND2 ND3 ND4 ND5 ND7 ND8 ND9 RPS12 Murf 2 Murf 5 CR3 CR4 aP, transcript is edited only in the procyclic (insect) developmental stage. bP/BS, transcript is edited in both bloodstream and procyclic stages. cNE, never edited, editing of these transcripts has not been reported. dThe ND7 transcript is differentially edited in the procyclic and bloodstream stages. eBS These transcripts are only fully edited in the bloodstream developmental stage [49]. fN.D. Values have not yet been determined. gAll data comes from EATRO 164 procyclic gRNA transcriptome previously published [9]. hUE, unedited, the transcripts which carried these tails were typically unedited. iE, edited, the transcripts carrying these tails were typically edited. jN.A., Not applicable. kCOII is a cis-edited transcript. Poly-A tails listed as short are between 10 and 50 nts long and tails listed as long are between 150 and 200 nts long. 5 mitochondrial DNA, early studies using both Northern blot and primer extension analyses on a limited number of gRNAs indicate that gRNAs are present in both insect and bloodstream forms, suggesting that the regulation of RNA editing is not at the level of gRNA availability [28,50,51]. Trypanosoma vivax Like T. brucei, Trypanosoma vivax is a causative agent of AAT. Its kinetoplast DNA is also very similar. The maxicircles in T. vivax possess the same genes as the maxicircle of T. brucei, and the genes that are edited in T. brucei are also edited in T. vivax to the same extent [52]. The minicircles of T. vivax vary significantly in size, from 300-1100 bp, encoding 1-3 gRNAs [52]. The life cycle of T. vivax is highly similar to that of T. brucei. In its mammalian host, it lives extracellularly in the bloodstream, primarily metabolizing glucose. When it is taken up by the tsetse fly in a bloodmeal, it initially resides in the proventriculus and foregut. From there, cells migrate to the proboscis and propagate, preparing to be deposited into the next mammalian host [53]. Trypanosoma cruzi T. cruzi is known for causing Chagas disease in South and Central America and is carried between hosts by triatomine bugs. Like T. brucei and T. vivax, it has highly similar kDNA, with the maxicircle containing the same genes in the same order. The edited genes of T. brucei and T. vivax are also edited in T. cruzi to the same extent [54]. Like T. brucei, T. cruzi spends most of its insect stage in the nutrient depleted midgut of its host, metabolizing amino acids for survival [55,56]. Once cells have propagated, they migrate to the hindgut and are excreted from the fly. 6 Transmission to the mammalian host occurs by contact with a wound or mucous membrane. Once inside the mammalian host, T. cruzi invades many different types of nucleated cells by using the microtubule cytoskeleton of the host cell to recruit lysosomes to create vacuolar compartments where T. cruzi resides [57]. Once inside the vacuole, T. cruzi metabolizes glucose as its primary energy source [58,59]. Leishmania spp. Leishmania spp. are found on almost every continent in the world, and infect 700,000 to 1.2 million people annually [60]. Leishmania spp. are transmitted by the phlebotomine sand fly. Unlike the trypanosomes, while in the midgut of the sand fly, Leishmania spp. primarily metabolize glucose, because of the fly’s frequent sap meals. Leishmania spp. are transmitted to their mammalian hosts by the bite of the sand fly. Once inside the host, they are phagocytosed by host cells. Inside macrophages, they replicate within lysosome like compartments, and it is believed that these compartments are not nutrient restrictive [61,62]. The maxicircle of the Leishmania spp. parasites has the same genes as the trypanosomes, but their editing patterns significantly vary. The pan-edited genes in Leishmania spp. are ND3, ND8, ND9, RPS12, CR3, and CR4, and the partially edited genes are A6, COII, COIII, MurfII, and ND7 [63]. Phytomonas spp. Phytomonas spp. parasitize plants, utilizing their sucrose and polysaccharides as energy sources. Their insect vectors maintain a highly sap rich diet that allows the parasites to continually metabolize carbohydrates, unlike the Trypanosoma spp. Possibly as a result of this, 7 several metabolic pathways are incomplete. The pathways for beta oxidation of fatty acids or oxidation of amino acids are missing key enzymes, but the pathways for the synthesis of these metabolites are more complete [64]. The ETC of the mitochondria is also affected. Genes for all cytochromes are missing from nuclear and kDNA, and the cytochrome oxidase genes normally present on the maxicircle are also missing [64,65]. The other maxicircle genes are present, and ND3, ND8, ND9, RPS12, CR3 and CR4 are pan-edited, while ND7, A6 and MurfII are only partially edited, as with Leishmania spp. Procyclic gRNA Transcriptome The gRNA transcriptome of insect stage (procyclic) T. brucei was previously sequenced [50]. This library was generated from the EATRO 164 cell line, grown in SDM79 medium, the most commonly used medium when culturing procyclic trypanosomes. As no reference genome exists for the minicircles, gRNAs could only be identified based on their function. Using a longest common substring algorithm, gRNAs were identified based on their complementarity to previously determined fully edited mRNA sequences. The RNA editing system tolerates G:U base pairs, so this program allows these base pairs in alignments, but these base pairs do not contribute to the overall alignment score as much as canonical Watson-Crick base pairs (1 point for G:U and 2 points Watson-Cricks). Guide RNAs with scores higher than 45 points were identified as editing gRNAs. Due to the fact that trypanosomes post transcriptionally add a poly-uridine (poly-U) tail is to gRNAs, the sequences generated in this transcriptome possess the poly-U tail as well [66]. This program ignores the transcript’s poly-U tail in the alignment, and it does not contribute to the score. Using this program, full complements of gRNAs were found for A6, COIII, CR4, CYb, and RPS12, and near full complements were identified for the 8 other edited genes. This study found that multiple different sequence classes of gRNAs (major classes) edited the same region of an mRNA (this group of gRNAs is called a population). Major classes within a population had many transition mutations, primarily A-G mutations, which appeared to be due to the editing system’s toleration of G-U base pairs. Interestingly, populations of gRNAs varied extremely in transcript abundance, with abundance varying from <10 to >350,000 reads. The gRNAs identified in this study possessed common characteristics. 64% of transcripts had 38-48 nucleotides (nt) of complementarity to their target mRNA. 84% of transcripts had 6 or fewer non-complementary nts at 5’ end, and most transcripts had 0 nts non-complementary at 3’ end prior to the poly-U tail. Conservation was observed in the gRNA transcription start site, with 74% of transcripts starting with 5’-ATATA-3’. Interestingly, a large proportion of transcripts had 5’-AAAAA-3’ transcription start sites as well. Beyond the identification of gRNAs directing the conventional mRNA edits, this study identified a number of gRNAs that could generate alternative edits. Most of these edits caused minor changes to the predicted mRNA and protein sequences, by either changing a single amino acid (ND8) or changing no amino acids at all (A6). However, some gRNAs were identified dramatically altered the mRNA or protein sequence. One edit in the essential A6 gene caused a frameshift that would alter and shorten the C-terminus of the protein [17]. Another generated a dramatically different sequence at the 3’ end of CR3, and no gRNA was identified capable of editing that sequence. Interestingly, another study identified an alternative edit in COIII, that linked an open reading frame in the unedited 5’ end of the transcript with the reading frame in 9 the edited 3’ end of the transcript [67,68]. The gRNA required to generate this alternative edit was not identified in the procyclic gRNA transcriptome of T. brucei. Evolution and retention of RNA editing in kinetoplastids The kinetoplastid RNA editing system is energetically expensive and, due to the system’s sequential dependence, should be highly fragile. This means that with even high accuracy rates for each gRNA, the overall fidelity of the process is astonishingly low. Even a single point mutation could drastically change the editing pattern, and stop the editing process, aborting expression of the protein. A major question in the field has been why this fragile and metabolically expensive system of RNA editing would evolve and persist. Initially, it was proposed that U-insertion/deletion editing (kRNA editing) was one of many RNA editing processes that were in fact relics of the RNA world. However, the very different mechanisms of the RNA editing systems in existence and their very limited distribution within specific groups of organisms indicate that they are more likely derived traits that evolved later in evolution [69,70]. The sheer complexity of the kRNA editing process, with no obvious selective advantage, led to the proposal that insertion/deletion editing arose via a constructive neutral evolution (CNE) pathway [71]. RNA editing in trypanosomes is always mentioned in support of CNE as an example of how seemingly non-advantageous, complex processes can arise [72,73]. More recently however, it has been hypothesized that RNA editing co-evolved with G-quadruplex structures found in the pre-edited mRNAs [74]. These structures can help regulate transcription in order to promote DNA replication and prevent kDNA loss, and thus provide an advantage to the organism. However, they must be removed by the RNA editing system prior to translation [74]. Another prominent hypothesis is that RNA editing is 10 advantageous because it is a mechanism by which an organism can fragment and scatter essential genetic information throughout a genome [75,76]. Kinetoplast DNA is far less stable than chromosomal DNA, and loss of minicircles due to asymmetric division of the kDNA network have been frequently observed, particularly in laboratory cultures of Leishmania tarentolae [77,78]. Buhrman et al. [76] suggest that the scattering of essential gRNA genes throughout the DNA network would prevent fast growing deletion mutants from outcompeting more metabolically versatile parasites during growth in the mammalian host. Using a mathematical model of gene fragmentation in changing environments (absence of functional selection), they showed a distinct advantage for gene fragmentation. In their model, the number of tolerable generations under periods of relaxed selective pressure was increased by more than 40% before loss of the ability to move to the next life cycle stage. One mechanism for protecting small asexual populations is by increasing the severity of the mutations that can occur. If mutations severely impact fitness, deleterious mutations are selected out, preventing their fixation [79]. This phenomenon increases the ‘drift robustness’ of a population. One study modeled the acquisition of drift robustness mathematically and computationally [80]. This study showed that in a simulated environment, small populations evolved a lower fitness than large populations, but when the most common genotypes from these populations were placed in a scenario with extremely high genetic drift, the genotypes evolved from small populations experienced a smaller decline in fitness than the genotypes evolved in a large population. Furthermore, they examined the types of mutations in the fitness landscape nearest to the peaks that each simulated population had fixed on and found that in smaller populations, there was an excess of mutations possible that were neutral, 11 beneficial or strongly deleterious, whereas in larger populations, there were more small-effect deleterious mutations possible. As the RNA editing process may be operating as a proof- reading system to weed out mutations by making them lethal, these findings strongly support the hypothesis that RNA editing is beneficial to the trypanosomes by providing a level of drift robustness to the population as a whole. While these hypotheses do address the evolution and retention of the RNA editing system itself, they do not address another key issue with the system as a whole: maintaining genetic material used in RNA editing while that material is not under selection. During the life cycle of T. brucei, trypanosomes undergo a severe bottleneck as they transition through the tsetse fly and into the mammalian host, and then within the bloodstream, they undergo multiple bottlenecks at each antigenic switch, as they evade the host immune system [22]. Such bottlenecks create additional forces of genetic drift, where genes can be lost even if their deleterious fitness effect is considerable. This life cycle should make T. brucei particularly sensitive to genetic drift, especially for those genes which are not under selection (Krebs cycle and ETC) and should make them extremely vulnerable to Muller’s ratchet (the gradual increase of mutational load that eventually leads to extinction) [81–84]. During a reexamination of the EATRO 164 procyclic gRNA transcriptome, a number of gRNAs were identified capable of shifting the open reading frame of their respective transcripts. These gRNAs acted at the 5’ end of edited mRNAs and either shifted the position of an existing start codon or generated a new start codon that would allow that transcript to be translated in an alternative reading frame. Surprisingly, the alternative reading frames spanned the full or nearly full length of their transcripts, suggesting that these transcripts were capable 12 of generating two distinctly different protein products. Based on these observations, we hypothesize that trypanosomes use dual-coding genes to protect genetic information by essentially hiding a gene not under selection (i.e. ETC genes) within one that remains under selection. Thus, the ability to access overlapping reading frames may be one explanation for how genetic material that is unused in one life cycle stage may be preserved while it is not under selection. Dual-coding and dual-function genes Dual-coding genes are defined as a stretch of DNA containing overlapping open reading frames (ORFs) [85,86]. Overlapping reading frames are common in viruses and are thought to persist due to strong genome size constraints [87,88]. More recently however, overlapping genes have been identified in mammalian and bacterial genomes [89–92]. In these organisms, size is not an issue and the potential advantage of overlapping genes is less clear. Maintaining dual-coding genes is costly, as it constrains the flexibility of the amino acid composition of both proteins, constraining the ability of each protein to become optimally adapted [93]. As this constraint can be alleviated by gene duplication, it is thought that dual-coding regions can survive long evolutionary spans only if the overlap provides a selective advantage. In mammals, many of the identified dual-coding genes produce two proteins that bind and regulate each other [94,95]. For these proteins, dual-coding may be advantageous for the tight co-expression needed. An alternative model suggests that under high mutation rates, the overlapping of critical nucleotide residues is advantageous because it may reduce the target size for lethal mutations [96]. 13 The use of genetic information with more than one function is not a new idea in T. brucei. The nuclear encoded α-ketoglutarate dehydrogenase E2 (α-KDE2) is known to be a dual-function protein, in that it plays important roles in both the Krebs cycle and in mitochondrial DNA inheritance [97]. RNAi knockdowns of this gene in bloodstream form (BF) trypanosomes also show a pronounced reduction in cell growth. Similarly, the Krebs cycle enzyme α-ketoglutarate decarboxylase (α-KDE1) is a dual-function protein with overlapping targeting signals that allow it to be localized to both the mitochondrion and glycosomes [98]. RNAi knockdowns of α-KDE1 in BF trypanosomes is lethal, suggesting that, in addition to its enzymatic role in the Krebs cycle, it plays an essential role in glycosomal function in T. brucei [98]. Another example of this was identified in the RNA editing system. Alternative editing of COIII is reported to generate a novel DNA-binding protein, Alternatively Edited Protein-1 (AEP- 1), that functions in mitochondrial DNA maintenance [67,68]. In this transcript, one alternative gRNA generates sequence changes at two sites that links an open reading frame (ORF) found in the pre-edited 5' end, to the 3' transmembrane domains found in the COIII edited ORF. This was the first indication that one cryptogene could contain information for more than one protein. It has been previously suggested that both alternative editing and dual-function proteins are important mechanisms for expanding the functional diversity of proteins found in trypanosomes [67,97–99]. We hypothesize that because trypanosomes live exclusively extracellularly in their mammalian host, they are more sensitive to genetic drift, and an equally important role for these dual-coding/function genes may be the protection of genetic information. 14 Project Summary The goal of this work is to examine the impact of alternative editing on the protein diversity, editing efficiency, developmental regulation, and genetic integrity of Trypanosoma brucei. This investigation began with the generation of the gRNA transcriptome of bloodstream form EATRO 164 T. brucei. This analysis identified near full complements of gRNAs for the edited genes, as was discovered in the procyclic transcriptome. A detailed comparison of the gRNAs identified in both datasets revealed conserved characteristic, such as anchor length, length of complementarity, and transcription start site sequences, even though very few identical sequences existed between the two transcriptomes. Additionally, an interesting correlation was found that suggests a relationship between the relative abundance of initiating gRNAs between stages and the developmental pattern of mRNA editing. During this comparison of the two transcriptomes, a number of alternative editing gRNAs were identified. Notably, three of these gRNAs were capable of shifting translation of ND7, RPS12, and CR3 into alternative reading frames. This discovery prompted the analysis of the mitochondrially encoded transcripts to determine which of the genes had the capacity to be dual-coding. Using mutational bias analysis, we show that as many as six cryptogenes in addition to the previously discovered COIII/AEP-1, encode more than one protein, and that RNA editing allows access to both reading frames. In order to determine if mRNA transcripts with access to multiple open reading frames exist within the mitochondrial transcriptome, we deep sequenced the transcript populations of three putative dual coding genes: RPS12, the 5’ editing domain of ND7 (ND7 5’), and CR3. Using the previously generated gRNA transcriptomes, we constructed detailed editing pathways for 15 each of these genes. We found evidence that CR3 and ND7 5’ are dual-coding genes, based on the identification of transcripts that would translate into different reading frames. This study indicates that RNA editing can be used to access multiple open reading frames using two different methods: in ND7 5’, different gRNAs bring alternate start codons into frame and in CR3, different gRNAs can shift the reading frame of the existing start codon. In addition, CR3 showed incredible editing diversity. In two different cell lines, highly divergent editing patterns were characterized, with the two cell lines using different sets of gRNAs to edit the CR3 cryptogene. This suggests that the use of a gRNA-guided editing system can also dramatically increase protein diversity in spite of an incredibly rigid and mutationally fragile system. With a more complete understanding of the existing edits found in the mRNA transcriptomes of RPS12, ND7 5’ and CR3, we used this knowledge to analyze the RNA editing system’s ability to tolerate noise. Reexamining the procyclic gRNA transcriptome revealed many previously unidentified gRNAs, potentially capable of generating alternative edits or disrupting the editing system. Using a new program called ACORNS (Assemble Clusters Of Related Nucleotide Sequence), the gRNAs were grouped into clusters based on sequence homology. This allowed us to determine which unidentified gRNAs were related to previously identified gRNAs. This analysis showed that more than half of the unidentified gRNAs were not related to any gRNA of known function, suggesting that many more alternative edits are waiting to be discovered. In order to analyze the impact of the gRNAs that were related to previously identified gRNAs, another new program, GUIDE (gRNA Uridine Insertion/Deletion Editor), was created. This program is able to analyze the functionality of gRNA clusters generated by ACORNS, by simulating the editing process. Combining this data with the mRNA transcriptomes 16 previously generated, we conducted a detailed analysis of each population of gRNAs capable of editing these three genes. This analysis revealed a surprisingly high tolerance for mismatches and gaps in mRNA/gRNA alignments in the editing system, most notably in the editing of the essential RPS12 [100]. This project found that not only is alternative editing present in T. brucei, but that it is pervasive, and the system, as a whole, is surprisingly robust. We propose the hypothesis that the RNA editing system does in fact promote the genetic robustness of T. brucei through the facilitation of dual-coding genes, as well as the introduction of alternative edits that increase protein diversity and allow the editing system to continue to evolve. 17 CHAPTER 2: ANALYSIS OF THE TRYPANOSOMA BRUCEI EATRO 164 BLOODSTREAM GUIDE RNA TRANSCRIPTOME Abstract The mitochondrial genome of Trypanosoma brucei contains many cryptogenes that must be extensively edited following transcription. The RNA editing process is directed by guide RNAs (gRNAs) that encode the information for the specific insertion and deletion of uridylates required to generate translatable mRNAs. We have deep sequenced the gRNA transcriptome from the bloodstream form of the EATRO 164 cell line. Using conventionally accepted fully edited mRNA sequences, ~1 million gRNAs were identified. In contrast, over 3 million reads were identified in our insect stage gRNA transcriptome. A comparison of the two life cycle transcriptomes show an overall ratio of procyclic to bloodstream gRNA reads of 3.5:1. This ratio varies significantly by gene and by gRNA populations within genes. The variation in the abundance of the initiating gRNAs for each gene, however, displays a trend that correlates with the developmental pattern of edited gene expression. A comparison of related major classes from each transcriptome revealed a median value of ten single nucleotide variations per gRNA. Nucleotide variations were much less likely to occur in the consecutive Watson-Crick anchor region, indicating a very strong bias against G:U base pairs in this region. This work indicates that gRNAs are expressed during both life cycle stages, and that differential editing patterns observed for the different mitochondrial mRNA transcripts are not due to the presence or absence of gRNAs. However, the abundance of certain gRNAs may be important in the developmental regulation of RNA editing. 18 Author Summary Trypanosoma brucei is the causative agent of African sleeping sickness, a disease that threatens millions of people in sub-Saharan Africa. During its life cycle, Trypanosoma brucei lives in either its mammalian host or its insect vector. These environments are very different, and the transition between these environments is accompanied by changes in parasite energy metabolism, including distinct changes in mitochondrial gene expression. In trypanosomes, mitochondrial gene expression involves a unique RNA editing process, where U-residues are inserted or deleted to generate the mRNA’s protein code. The editing process is directed by a set of small RNAs called guide RNAs. Our lab has previously deep sequenced the gRNA transcriptome of the insect stage of T. brucei. In this paper, we present the gRNA transcriptome of the bloodstream stage. Our comparison of these two transcriptomes indicates that most gRNAs are present in both life cycle stages, even though utilization of the gRNAs differs greatly during the two life-cycle stages. These data provide unique insight into how RNA systems may allow for rapid adaptation to different environments and energy utilization requirements. Introduction The life cycle of Trypanosoma brucei involves two distinct environments, the animal host and the insect vector. These environments are distinct in temperature and nutrient composition, providing a unique challenge to T. brucei as it cycles between hosts. In the bloodstream, trypanosomes exist in two forms, the actively dividing slender form and the non- dividing stumpy form. The slender form is optimized to utilize its glucose rich environment, using glycolysis to generate energy [14]. The stumpy form appears to be transitional, activating 19 mitochondrial genes in preparation for uptake in a blood meal by its tsetse fly vector and subsequent transfer to a harsher environment [14]. Once inside the tsetse fly, the parasite utilizes proline to drive oxidative phosphorylation and ATP production in the mitochondrion [19]. While the activity of the mitochondrion is relatively low during the bloodstream stage (BS), expression of the mitochondrial genome is still essential [17,18]. In T. brucei, the mitochondrial genome consists of two types of DNA molecules, maxicircles and minicircles. Maxicircles are 22kb circular DNA that contain the genes for two ribosomal RNAs, 12S and 9S, and eighteen mRNA genes [5]. While some of the protein-coding genes do not require RNA editing prior to translation, most require extensive editing before they can be translated [6,7]. This process involves the insertion of hundreds of uridylates (U)s and less frequently deletion of Us, often doubling the size of the transcript. The sequence changes are guided by small complementary RNA molecules (the guide RNAs) that are encoded on the minicircles [8]. Minicircles make up the bulk of the kinetoplastid network (anywhere from 5,000–10,000 present in each network) with each minicircle encoding 3–5 gRNAs. In T. brucei, there are more than 200 different minicircle sequence classes (~1200 gRNAs) [8]. Distinct differences in mitochondrial transcript abundance, polyadenylation and the extent of RNA editing are observed during the complex life cycle (Table 1). The pattern of differential RNA editing observed is especially interesting. For example, the cytochrome b (CYb) and cytochrome oxidase II (COII) mRNAs are edited during the insect stage, but are primarily unedited in bloodstream forms [24,25]. In contrast, editing of the NADH dehydrogenase subunit transcripts (ND3, ND7, ND8 and ND9) and editing of the ribosomal protein subunit 12 transcript (RPS12) appears to occur preferentially in bloodstream forms [5,26–30]. Other 20 transcripts, cytochrome oxidase III (COIII) and ATPase subunit 6 (A6) are edited in both life cycle stages [31,32]. Early studies using both Northern blot and primer extension analyses on a limited number of gRNAs indicate that gRNAs are present in both insect and bloodstream forms, suggesting that the regulation of RNA editing is not at the level of gRNA availability [28,50,51]. Our lab has previously published deep sequencing results of the gRNA transcriptome of the T. brucei EATRO 164 procyclic form [9]. Here we present the deep sequencing data for the gRNA transcriptome of a bloodstream form of EATRO 164. A total of 211 populations of gRNAs were identified. We define a population as a group of gRNAs that may vary in sequence, but direct the editing of the same or near same region of the mRNA. Because kinetoplastid RNA editing allows G:U base pairing, most populations contain multiple sequence classes that can guide the generation of the same mRNA sequence. While the number of populations identified was similar to the number identified in the procyclic gRNA transcriptome (214 populations), the total number of gRNAs identified was much reduced and the coverage was less complete; full complements of gRNAs were only identified for COIII and CYb. In spite of the reduced number of gRNAs, an interesting correlation was found that suggests a relationship between the relative abundance of initiating gRNAs between stages and the developmental pattern of mRNA editing. Materials and Methods Parasites, isolation of mitochondria and RNA extraction T. brucei brucei clone IsTar from stock EATRO 164 were grown in rats and isolated as previously described [101]. Bloodstream forms were virtually all long-slender forms isolated after 4 days of infection. Parasites were used immediately for isolation of mitochondria using 21 differential centrifugation as previously described or stored frozen at -80°C until RNA extraction [9]. Both total RNA from whole parasites and mitochondrial RNA (mtRNA) from purified mitochondria were isolated by the acid guanidinium-phenol-chloroform method [102]. Ethics statement Rats were raised according to the animal husbandry guidelines established by Michigan State University. All vertebrate animal use procedures were approved by MSU’s Institutional Animal Care and Use Committee (Application 03/11-051-00). MSU has filed with the Office of Laboratory Animal Welfare (OLAW) an assurance document that commits the university to compliance with NIH policy and the Guide for the Care and Use of laboratory Animals. Library preparation and Illumina sequencing Samples of mtRNA and total RNA were both treated with DNAse RQI and size fractioned on a polyacrylamide gel as previously described [20]. Guide RNAs were extracted from the gel and prepped for sequencing using the Illumina ‘Small RNA’ protocol as previously described [9]. Libraries from both mtRNA and total RNA samples were deep sequenced on Illumina GAIIx. Reads were then processed and trimmed as previously described [9]. Data with two or more Ns, shorter than 20nts after trimming or with an overall mean Q-score < 25 were discarded. Redundant reads were then removed, while maintaining the number of redundant reads and reads containing fewer than 4 consecutive Ts were removed. Identification of gRNAs To identify gRNAs, each transcript read was aligned to the conventionally edited mRNAs based on known base pairing rules (canonical Watson-Crick base pairs and the G-U base pair). In the initial screen, no gaps were allowed in the alignment, allowing the formulation of the 22 gRNA-mRNA alignment as an extended longest common substring (LCS) problem as previously described [25]. Matched gRNAs were then scored (two points for G:C and A:U base pairs and one point for G:U base pairs). gRNAs with scores >45 were identified as guiding a specific region based on the identified mRNA fully edited sequence. Additional searches with reduced stringency (scores >30) were performed on regions with low gRNA coverage. The matched gRNAs were sorted into populations based on their guiding positions, and the populations analyzed and sorted into major sequence classes. Results Much of the initial characterization of RNA editing in T. brucei was done using the EATRO 164 strain. These experiments suggested that RNA editing was developmentally regulated in that certain genes were shown to be more fully edited in some stages than others (Table 1) [24–32,46]. It was also reported that the developmental regulation was not controlled by gRNA availability, as gRNAs were found in both life cycle stages [28,50,51]. In these early studies, however, only a small number of gRNAs were investigated. In this study, we used deep sequencing to compare the gRNA transcriptomes of a bloodstream form to a procyclic form of T. brucei EATRO 164. The EATRO 164 strain was isolated in 1960 from Alcephalus lichtensteini and maintained in the lab of Dr. K. Vickerman until being obtained by Dr. Stuart in 1966 [103]. Dr. Stuart derived the procyclic form from the Bloodstream culture in 1979 [103]. Both cell lines have been maintained in separate culture since that time. Trypanosomes from the EATRO 164 strain were grown in Wistar rats to a parasitemia of 1–2 x 109 trypanosomes per mL and isolated using DEAE cellulose columns. Mitochondria and gRNAs were purified as previously described [9]. Libraries were generated using gRNAs isolated 23 from whole cell RNA and gRNAs isolated from mitochondrial RNA. Both bloodstream gRNA libraries were searched using conventionally accepted fully edited mRNA sequences, and a total of 1,024,604 gRNA reads were identified. Surprisingly, the library generated using gRNAs isolated from whole cell RNA had more than twice as many identified gRNA reads as the data generated using gRNAs isolated from mitochondrial RNA. To insure sufficient abundance and gRNA coverage, the two data sets were combined for the analyses presented here. In contrast, over 3 million gRNA reads were identified in our procyclic gRNA transcriptome generated from gRNAs isolated from mitochondrial RNA. Of the 1,024,604 reads identified from the bloodstream transcriptomes, 982,450 reads were sorted into major sequence classes. The overall ratio of identified procyclic gRNA reads to BS gRNA reads was 3.5:1. This ratio varies significantly by gene (Table 2), and by populations within genes (APPENDIX A) and, except for the initiating gRNA, no apparent trend relating gRNA abundance and developmental editing pattern was observed. Interestingly, for the initiating gRNA, mRNAs that are fully edited in the procyclic stage only, or are fully edited in both life cycle stages had initiating gRNAs with more reads in the procyclic data set (Figure 1) [31,32,40,46]. In contrast, mRNAs that are only fully edited or are more abundant in the BS, had more initiating gRNAs reads in the BS data set (Figure 1) [26–30]. Because the identified gRNAs from the BS cells were less abundant, the rule used to identify major gRNA sequence classes was relaxed. Instead of using a strict cut off for the minimum number of reads required, the cut off was assessed on a case-by-case basis. For example, if the total population only had 100 reads, a sequence class with only 10 reads would still be identified as a major sequence class. Once all major classes were identified, 657 24 Table 2. Number of gRNA transcripts in procyclic and bloodstream major classes and ratio of procyclic transcripts to bloodstream transcripts for each gene. Bloodstream gRNA Reads 41,628 371,139 13,316 25,753 11,022 157 13,567 291,927 112,868 83,924 17,191 982,492 Procyclic gRNA Reads 266,532 948,845 236,808 51,979 31,622 2,605 75,739 702,061 584,639 72,027 403,131 3,375,988 Gene A6 COIII CR3 CR4 CYb MurfII ND3 ND7 ND8 ND9 RPS12 Total Ratio of PC to BS Reads 6.40 2.56 17.78 2.02 2.87 16.59 5.58 2.40 5.18 0.86 23.45 3.44 Figure 1. The abundance of the initiating gRNA of all edited mRNAs in each stage. mRNAs to the left of the dashed line are constitutively edited or are edited only in the procyclic stage [31,32,40,46]. mRNAs to the right of the dashed line are only fully edited or more abundant fully edited in the bloodstream stage [26–30]. 25 sequence classes were identified that could be sorted into 211 populations (Table 3). Although the overall gRNA numbers were down in comparison to the procyclic data set, most of the populations found in that stage (214 gRNA populations) were also identified in the BS transcriptome. However, there were a number of populations that were unique to either the procyclic or BS stage. Table 3. Summary of the gRNA data coverage for each gene. Gene A6 COIII CR3 CR4 CYb MurfII ND3 ND7 ND8 ND9 RPS12 Total Populations BS 29 42 9 16 2 1 12 45 20 24 11 211 PC 28 39 9 18 2 1 12 48 21 23 13 213 Unique Average* gRNA Populations PC BS 1 0 1 4 0 0 0 2 0 0 0 0 0 0 5 2 3 2 0 1 0 2 13 10 Overlap (nts) PC BS 18 20 22 19 14 19 17 18 14 12 N.A. N.A. 15 15 21 17 21 17 16 16 17 21 19 17 Gaps BS 1 0 1 2 0 1 1 7 2 1 2 18 PC 0 0 0 0 0 1 1 2 1 0 0 5 Weak Overlaps PC BS 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 4 0 1 2 0 1 0 5 6 aThe average gRNA overlaps were determined excluding any regions where neighboring gRNAs shared no overlap. Surprisingly, when the bloodstream and procyclic data sets were compared, only 37 identical major sequence classes were found in both. However, distinctly related sequence classes could be identified when comparing the BS and procyclic populations. Comparing the related major classes from each transcriptome (BS vs procyclic) revealed a median value of ten single nucleotide variations per gRNA. Interestingly, nt variations were much less likely to occur in the consecutive Watson-Crick anchor region of the gRNA than in the rest of the gRNA indicating a very strong bias against G:U base pairs in this region (Figure 2). 26 Figure 2. The frequency of nt variations versus nucleotide position in the gRNA. Gray bars indicate the number of gRNA sequence classes with an identified nt difference between related procyclic and bloodstream gRNAs. Nucleotide numbering for each gRNA was normalized by setting the start of the Watson-Crick anchor region to zero. Black bars indicate the number of gRNA sequence classes whose contiguous Watson-Crick anchors end at that position (start of Watson-Crick = zero, so this is an indication of the length of the contiguous Watson-Crick region). The Watson-Crick anchors (defined as the number of consecutive nts in the 5’ region with only G:C and A:U base pairs) had a median length of eleven nucleotides and anchor length did not vary between the two forms. The vast majority of major classes of gRNAs had consecutive Watson-Crick anchors greater than seven nts long (92.5%). In addition, most gRNAs with Watson-Crick anchors shorter than eight nts were not an abundant major class for their respective populations. Consistent with observations made from the procyclic data set, most gRNAs had zero non-base pairing nucleotides 5’ to the poly-uridine tail and 4 to 6 non-base pairing nucleotides 5’ to the anchor region (Figure 3). Also consistent with procyclic data, most 27 of the gRNAs (59%) had 38 to 48 nts of complementarity (including anchor regions) with their respective mRNAs (Figure 4). Transcription start sites also did not vary, as preference for an RYAYA start site was observed (Table 4). Figure 3. Comparing the number of non-complementary nucleotides 5’ of the anchoring region (A) or 3’ of the guiding region (excluding the U-tail) (B) in procyclic and bloodstream gRNAs. Figure 4. Length of gRNA complementarity (including anchors) to fully edited mRNAs for both bloodstream and procyclic gRNAs. 28 Table 4. Most common gRNA transcription start sites in procyclic and bloodstream data. Initiating Sequence Stage Percentage of Sequence Classes Percentage of Transcripts ATATAT ATATAA AAAAAA ATATAC ATACAA ATATTA ATATAG ATAAAT ATACAT ATAAAA Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Bloodstream Procyclic Coverage and gaps 32.20% 35.20% 20.00% 21.10% 3.60% 4.60% 3.80% 4.30% 2.40% 2.80% 4.90% 2.60% 2.20% 2.60% 2.50% 2.60% 2.70% 2.20% 1.70% 2.10% 33.60% 37.40% 17.90% 24.70% 1.60% 1.30% 1.50% 3.80% 1.30% 1.70% 13.40% 0.90% 0.70% 7.60% 1.00% 0.70% 3.20% 2.20% 0.20% 1.10% In order to determine if the BS gRNA transcriptome contained a full complement of guide RNAs, the gRNA populations were aligned to the fully edited mRNAs (APPENDIX B). We note, that for an mRNA to be fully edited, not only must all editing sites on the mRNA be covered by a gRNA, the downstream gRNA must generate the anchor binding site for the subsequent gRNA. Therefore, adjacent gRNAs must overlap. Overall, there was an average of 17 nts of overlap between adjacent gRNAs, with the average overlap varying slightly by gene (Table 3). As the median Watson-Crick Anchor is 11 nts, in most cases, the overlap extends beyond the Watson-Crick anchor of the subsequent gRNA. However, we did observe a number 29 of regions where the overlap is minimal. Currently, there is no data that stipulates the minimum anchor needed for efficient editing. However, we postulate that similar to microRNAs, for an anchoring sequence to be sufficiently specific, it should be at least six nucleotides [104]. Indeed, when examining the overlaps between most gRNAs, there are only ten (four procyclic and six BS) that are less than six nucleotides (Figure 5). Figure 5. The percentage of different nucleotide overlaps found between adjacent gRNAs. gRNAs were aligned to their fully edited mRNA sequence and the number of mRNA nts with complementarity to both adjacent gRNAs determined. We therefore used six nucleotides as a cut off to identify regions with potential missing guide RNAs for both life cycle stage transcriptomes. In contrast to the procyclic data, where full complements of gRNAs were identified for five of the mRNA transcripts (A6, COIII, CR4, CYb, and RPS12), in the BS transcriptome, a full complement of gRNAs was only identified for COIII and CYb. Overall, there are 12 edited regions where no gRNAs were identified, and five regions with weak gRNA overlaps in the BS data (Table 5). Of these 17 regions, seven belong to ND7 alone. Interestingly, nine of the 17 missing populations are in very low abundance in the procyclic data, having 100 or fewer reads. Because the number of reads in the BS data is ~3.5 fold less abundant, this could account for some of these regions of poor coverage. There are six 30 regions that lack gRNA coverage in both data sets. These are found in CR3, MurfII, ND3 and ND7 (Table 5). Interestingly, three of these regions are close to the 3’ end of their respective genes. Regions of weak overlap (ND9(238–242), ND9(609–612)) and regions without gRNA coverage (CR3(278–292), ND8(541–553)) that are unique to the procyclic transcriptome were also observed. Interestingly, the regions of poor procyclic coverage are found in CR3, ND8 and ND9, all transcripts that are preferentially edited in the BS form [5,28,29,47]. Table 5. Identified gaps or weak overlaps (less than 6 nucleotides) between populations of gRNAs observed in both data sets. Stage Missing Coverage Range Bloodstream 669-670 Gap or Overlap 2 nt Gap Abundance of Equivalent gRNA 39,063 BS/Pa Procyclic Bloodstream Bloodstream BS/P BS/P BS/P Bloodstream Bloodstream Bloodstream BS/P Bloodstream Bloodstream 233/226-230 1 nt G/5 nt O Missing in Both Stages 278-292 143-165 302-306 80-85 389-401 92-94 95-120 292-293 325-326 485-486 1000-1000 1079-1085 15 nt Gap 23 nt Gap 5 nt Gap 6 nt Gap 13 nt Gap 3 nt Gap 26 nt Gap 2 nt Overlap 2 nt Gap 125 7,175 643 Missing in Both Stages Missing in Both Stages Missing in Both Stages 1 3,259 888 0 nt Overlap Missing in Both Stages 1 nt Overlap 7 nt Gap 101 44 BS/P 1086/1086-1088 1 nt G/3 nt G Missing in Both Stages Bloodstream Bloodstream Bloodstream Bloodstream Bloodstream Procyclic Procyclic Bloodstream Procyclic Bloodstream Bloodstream Bloodstream 1225-1232 1269-1270 54-56 153-159 386-389 541-553 238-242 340-342 609-612 122-132 156-158 337-349 8 nt Gap 2 nt Overlap 3 nt Overlap 7 nt Gap 4 nt Gap 13 nt Gap 5 nt Overlap 3 nt Gap 4 nt Overlap 11 nt Gap 3 nt Overlap 13 nt Gap 31 123 63 1 4 2 413 652 36 7 3 62 128 Gene A6 CR3 CR3 CR4 CR4 MurfII ND3 ND7 ND7 ND7 ND7 ND7 ND7 ND7 ND7 ND7 ND7 ND8 ND8 ND8 ND8 ND9 ND9 ND9 RPS12 RPS12 RPS12 Table 6. Summary of populations found in both data sets that have more reads in the bloodstream data set than in the procyclic data set. PC and BS Shared Populations more Percentage of populations Populations abundant in BS more abundant in BS 28 38 9 16 2 1 12 43 18 23 11 201 6 12 5 10 0 0 6 20 8 16 4 87 21% 32% 56% 63% 0% 0% 50% 47% 44% 70% 36% 43% Gene A6 COIII CR3 CR4 CYb MurfII ND3 ND7 ND8 ND9 RPS12 Total Gene specific gRNA characteristics ATPase 6. In the BS gRNA transcriptome, a total of 29 gRNA populations containing 86 different major sequence classes were identified that could guide the editing of A6 (Table 3; APPENDIX C Part A). One population was identified that was unique to the BS transcriptome (gA6(281-329)). The gRNAs bordering this population share extensive overlap, so its absence in the procyclic transcriptome would not impact the editing process (APPENDIX C Part A). We note that two of the gRNAs identified have single nucleotide mismatches. The bloodstream gA6(640– 668) has an identified mismatch (C:U) that disrupts the complementarity of the gA6(640–668) population (APPENDIX B Part A). The second mismatched gRNA (gA6(520–533)) would introduce a frameshift. Excluding these two mismatched regions, there is complete coverage of ATPase 6. In contrast to the procyclic data, where the conventional initiating gRNA and the gRNA immediately following it were extremely rare, both of these gRNAs, gA6(773–822), previously identified as gA6-14 and gA6(745–789), were fairly abundant, each having hundreds 32 of reads. The alternative initiating gRNA identified in the procyclic data set was not found. This finding is similar to that found in the T. brucei Lister strain 427 where authors identified alternative initiating gRNAs not found in the EATRO 164 procyclic gRNA transcriptome [105]. Another disparity between the two life cycle data sets was found when comparing the abundance of gRNAs implicated in a potential alternative edit. In the procyclic gRNA transcriptome, a gRNA was identified that would guide the insertion of 11 U-residues instead of the needed 12 between G555 and A568 [9]. This gRNA (pA6(557–593)) was 25-fold more abundant than the conventional gRNA (pA6(549–593)). In the BS data set however, more than 400 reads of the 12U gRNA were identified and only one read was found that would encode the alternative 11U edit. Surprisingly, while G555-A568 would be correctly edited (insertion of 12Us), the next editing site (A549-G555) is edited by bsA6(520–553), the gRNA that introduces the 1 nt frameshift. This frameshift would generate a predicted protein with nearly the same amino acid sequence as the procyclic 11U frameshift edit (two amino acid changes) (Figure 6). Figure 6. Alignment of conventional ATPase 6 protein sequence to hypothetical proteins generated by the 11U alternative edited mRNA and the 4U alternatively edited mRNA. Double underlined residues show where the alternative sequences differ from the conventional sequence. The shaded residues in the 4U sequence show where it differs from the 11U sequence. Cytochrome oxidase subunit III. Forty-two gRNA populations, guiding the editing of COIII were identified in the BS transcriptome; three more than in the procyclic data set (APPENDIX C Part B). This disparity is caused by the presence of several unique populations. While the procyclic data set contained one unique population, the BS data contained four gRNA 33 populations not previously identified. Of these four unique populations, three of them are required for full overlapping coverage in the bloodstream. They are not however, required for full coverage in the procyclic stage. These three unique gRNA populations all span relatively small regions of weak overlap (Figure 7, APPENDIX B Part B). Figure 7. Editing sites 420–489 of COIII aligned with the gRNAs identified for that region in the procyclic (grey) and bloodstream (black) data sets. The gRNA covering 443–474 was only found in the bloodstream data set. An alternative edit of COIII has been described, involving distinct edits at two adjacent sites that links the open reading frame of the edited 3’ end to an ORF found in the 5’ pre-edited sequence [67]. The previously identified alternative gRNA that can generate the needed editing events was not found in either the BS or procyclic transcriptomes. C-rich regions 3 and 4. In the BS data set, nine populations and 34 major sequence classes were identified that direct the editing of the CR3 transcript. The coverage of edited CR3 is nearly complete in the bloodstream data set with only a one nt gap in coverage (editing site 233) (APPENDIX B Part C). This is in contrast to the procyclic transcriptome, where gRNAs that matched the published sequence downstream of nt 196 were very rare (<10 copies) and no gRNAs were identified that could direct editing near the 3’ end (nucleotides 275–292). A full consensus sequence for edited CR4 has only been found in BS T. brucei [48]. Using this sequence, 16 gRNA populations, containing 62 major sequence classes were identified in 34 the BS transcriptome (APPENDIX C Part D). In contrast to the procyclic data, where a full complement of gRNAs were identified, there are two gaps in the BS coverage (Table 5). Cytochrome b and maxicircle unidentified reading frame II. RNA editing in the Cytochrome b (CYb) transcript is limited to the 5’ end and two gRNA populations are sufficient to guide the small number of edits needed to render the CYb transcript functional. Both populations were observed in both data sets, with a total of 6 major classes. Interestingly, in both data sets, the initiating gRNA is significantly more abundant than the second gRNA, being approximately 30 fold more abundant in the procyclic data set and approximately 200 fold more abundant in the bloodstream data set (APPENDIX C Part E). This is in contrast to most of the other transcripts where the initiating gRNAs are not very abundant. In addition, almost all of the CYb gRNA major classes have an A-run transcription start site, deviating from the common RYAYA initiation site pattern. Editing in MurfII is also limited to the 5’ end and requires only two gRNAs. One of these gRNAs (gMurfII(30–79)) is encoded on the maxicircle [106]. While this gRNA was observed in both data sets, the gRNAs identified were not identical. A purine-purine transition near the 3’ end of the gRNA differentiates the procyclic and BS forms (APPENDICES B and C Parts F). An initiating gRNA is needed to generate the 3’ most edits that create the anchor sequence for gMurfII (30–79). This gRNA was not found in either data set, despite additional searches with reduced search stringency. NADH dehydrogenase subunits 3, 7, 8, and 9. In the initial characterization of RNA editing in T. brucei EATRO 164, fully edited ND subunit transcripts were only found in RNA isolated from the BS stage. We were therefore surprised to find that fewer ND gRNA 35 populations were identified in the BS transcriptome and a full complement of gRNAs was not identified for any of the ND subunits. The most complete coverage was found for ND3 and ND9. For ND3, the BS data set contained twelve populations and 41 major classes of gRNAs. One gap in coverage was observed, from 389–401. This region overlaps a region that has no clear consensus sequence, 375–395 [26]. ND9 is the only gene in this study whose bloodstream gRNA reads outnumber the procyclic gRNA reads identified (Table 2). Twenty-four bloodstream gRNA populations were identified with all edited nucleotides covered if gRNAs with a single base pair mismatch are taken into account (APPENDIX B Part J). While 45 gRNA populations were identified for ND7 in the BS data set, the gRNA coverage was significantly worse when compared to the identified procyclic gRNAs (Table 5). Despite the poor coverage, two unique gRNA populations (bs gRNA (772–816) and (1128– 1182)) were identified (APPENDICES B and C Parts H). ND8 also had poor gRNA coverage (Table 5). Interestingly, there are several populations in ND8 that contain highly abundant gRNA sequence classes with mismatches that shorten the complementarity of the gRNA. These usually have a single mismatch in the gRNA that would otherwise guide conventional editing (APPENDIX C Part I). Ribosomal protein S12. The BS data set contained 11 populations and 26 major sequence classes that direct editing of RPS12 (Table 3). While the procyclic transcriptome contained a full complement of gRNAs, the BS RPS12 data contains one gap in coverage and one region of poor overlap, (Table 5). This was surprising, as RPS12 has been shown to be essential in both life cycle stages [100,107]. The region of the mRNA with poor coverage has a high percentage of C residues and gRNAs covering this region may utilize C:A base pairs. If this 36 is the case, some classes of gRNAs may not have been detected, as the program used to search for gRNAs does not allow for C:A base pairs (APPENDIX B Part K). Discussion This is the first comprehensive characterization of the mitochondrial gRNA transcriptome from the bloodstream stage of Trypanosoma brucei brucei. As we have previously characterized the insect stage gRNA transcriptome, these data allow the comparison of gRNA characteristics across the two main life cycle stages [9]. In the EATRO 164 BS gRNA transcriptome, gRNAs for every edited gene were identified. Interestingly, while the number of populations identified in this data set was only slightly lower than that reported in the procyclic data set, the total number of gRNA transcript reads identified was considerably lower despite the fact that multiple transcriptome libraries were combined. While this may be a reflection of the down regulation of mitochondrial transcription in the bloodstream stage (see Table 1), it is impossible to rule out technical problems in the generation and sequencing of the libraries. It has been previously reported that gRNA presence did not correlate with developmental RNA editing patterns in T. brucei and our data does not challenge this [50,51]. The data did however, show an interesting trend in the abundance of the initiating gRNAs as relates to their developmental editing patterns. It may be that the abundance of the initiating gRNAs is regulated in order to control editing of their target mRNAs. However, we cannot rule out the possibility that not all of the populations of initiating gRNAs were identified. For the pan-edited mRNAs, the initiating gRNAs direct sequence changes that are often downstream of the stop codon. Sequence changes in this region would be tolerated, as long as the anchor sequence for the next gRNA is maintained. This type of mutation was observed in the 3’ end of ATPase 6 [19]. 37 In addition, characterization of the initiating gRNAs in the Lister 427 T. brucei cell line identified several gRNAs that would direct an alternative editing pattern, suggesting a high tolerance for sequence changes near the mRNA 3’ ends. [105]. As expected, general gRNA characteristics are conserved across the two life-cycle stages. Populations retain the general location of their anchors, there is relatively little shift in the location of populations, and the lengths of complementarity are very similar. We did observe that considerable nucleotide variations were found in the guiding regions of the gRNAs from the different life cycle strains of the EATRO 164 cells. This particular cell line dates back to 1960 when the BS form was originally acquired [103]. Procyclic cells were derived from the BS stock in 1979 and the two cell lines maintained separately since that date [103]. Mixed trypanosome genotypes are detected frequently in field isolates from both tsetse flies and mammals and it may be that separation into different culture conditions allowed different genotypes to predominate in each life cycle strain [22,108,109]. Because gRNAs utilize both canonical (Watson-Crick) as well as G:U base-pairing to direct the change in sequence, most transition mutations in the gRNA, would not lead to changes in the mRNA sequence and would not be selected against [33]. We do note however, that a very strong bias against A to G transitions is observed in the anchor regions of the gRNAs. This suggests that transition mutations in this region are not tolerated. This suggests that the editing machinery recognizes and selects for a conventional base-paired double helix in the initial gRNA/mRNA pairing. The ability to discriminate against G:U base-pairs in the initial interaction would greatly increase the accuracy of the gRNA targeting event. Considering the sequential nature of the overall editing process, this would be very advantageous. 38 Coverage Surprisingly, complete gRNA coverage was observed only for the pan-edited COIII and for CYb, where editing is limited to the 5’ end. The identification of the CYb gRNAs was expected, as it has been previously reported that the gRNAs are present in both life cycle stages even though editing of CYb is limited to the procyclic stage [8,24]. The full coverage of COIII was also not surprising, as COIII was shown to be fully edited and equally abundant in both stages [32]. However, we expected to see complete coverage of ATPase 6 and RPS12 as both of these transcripts have been shown to be essential in both life cycle stages [17,100,107,110]. For ATPase 6, we did identify a total of 29 gRNA populations that do cover all of the editing sites. However, one of the gRNAs (bsA6(643–667)) has a single nucleotide mismatch (C:U) and one would introduce a frameshift (bsA6(520–553)). The C:U mismatch occurs near the middle of the gRNA, placing the C:U mismatch in a region that is unusually high in Gs and Cs (APPENDIX B Part A). It may be that the G:C base pairs immediately upstream of the mismatch stabilize the gRNA/mRNA interaction, allowing it to be tolerated. The frameshift gRNA is also interesting, as it occurs just upstream (1 editing site) of another site where we had previously observed a frameshift sequence anomaly. Both frameshifts (the BS 4U and the Procyclic 11U) generate a predicted protein with nearly the same amino acid sequence. As the frameshifts occur downstream of the highly conserved amino acid region involved in proton translocation [31], it may be that this different carboxyl terminus is tolerated. Near full coverage is also observed for RPS12. For this transcript, one BS identified gRNA (bsRPS12(96–121)) has an A-nt insertion that disrupts the gRNA complementarity. Surprisingly, the other mRNA transcript found with near complete coverage was ND9 (one gRNA has a single 39 nt mismatch). All of the other mitochondrially encoded Complex I members did have substantial gaps in coverage. Currently, there is considerable debate on the necessity of Complex I subunits for either stage of the trypanosome life cycle. Studies using RNAi and knockout cell lines of nuclear-encoded members of Complex I have shown that the complex is unnecessary for survival in either life cycle stage [111,112]. However, the nuclear encoded Complex I member genes are maintained [42], and while we not did identify full coverage for the ND transcripts, a vast majority of the gRNAs were found in both life cycle stages. This study used high-throughput sequencing to characterize the gRNA transcriptome during the bloodstream stage of the trypanosome life cycle. This work suggests that gRNAs are expressed during both life cycle stages, and that differential editing patterns observed for the different mitochondrial mRNA transcripts are not due to the presence or absence of gRNAs. Accession Numbers SAMN04302078, SAMN04302079, SAMN04302080, and SAMN04302081 NCBI’s Sequence Read Archive. Acknowledgments The authors dedicate this work in memory of David Judah, MS, DVM. He was a wonderful colleague. We also acknowledge the work of Joshua Foster, Mark Johnson, James Rauschendorfer, Heather Tyler, Callie Vivian, and Alexis Weber who were involved in sorting and identifying gRNAs, the MSU RTSF for their contribution in deep sequencing and Ken Stuart and Jason 40 Carnes at the Center for Infectious Disease Research for supplying the T. brucei strains used in this study. 41 CHAPTER 3: MITOCHONDRIAL DUAL-CODING GENES IN TRYPANOSOMA BRUCEI Abstract Trypanosoma brucei is transmitted between mammalian hosts by the tsetse fly. In the mammal, they are exclusively extracellular, continuously replicating within the bloodstream. During this stage, the mitochondrion lacks a functional electron transport chain (ETC). Successful transition to the fly, requires activation of the ETC and ATP synthesis via oxidative phosphorylation. This life cycle leads to a major problem: in the bloodstream, the mitochondrial genes are not under selection and are subject to genetic drift that endangers their integrity. Exacerbating this, T. brucei undergoes repeated population bottlenecks as they evade the host immune system that would create additional forces of genetic drift. These parasites possess several unique genetic features, including RNA editing of mitochondrial transcripts. RNA editing creates open reading frames by the guided insertion and deletion of U-residues within the mRNA. A major question in the field has been why this metabolically expensive system of RNA editing would evolve and persist. Here, we show that many of the edited mRNAs can alter the choice of start codon and the open reading frame by alternative editing of the 5' end. Analyses of mutational bias indicate that six of the mitochondrial genes may be dual-coding and that RNA editing allows access to both reading frames. We hypothesize that dual-coding genes can protect genetic information by essentially hiding a non-selected gene within one that remains under selection. Thus, the complex RNA editing system found in the mitochondria of 42 trypanosomes provides a unique molecular strategy to combat genetic drift in non-selective conditions. Author Summary In African trypanosomes, many of the mitochondrial mRNAs require extensive RNA editing before they can be translated. During this process, each edited transcript can undergo hundreds of cleavage/ligation events as U-residues are inserted or deleted to generate a translatable open reading frame. A major paradox has been why this incredibly metabolically expensive process would evolve and persist. In this work, we show that many of the mitochondrial genes in trypanosomes are dual-coding, utilizing different reading frames to potentially produce two very different proteins. Access to both reading frames is made possible by alternative editing of the 5' end of the transcript. We hypothesize that dual-coding genes may work to protect the mitochondrial genes from mutations during growth in the mammalian host, when many of the mitochondrial genes are not being used. Thus, the complex RNA editing system may be maintained because it provides a unique molecular strategy to combat genetic drift. Introduction Trypanosomes are one of the most successful parasites in existence, inhabiting an incredibly wide range of hosts [2,3]. The dixenous members cycle between two distinct hosts and can encounter different environments with distinct metabolic constraints. These parasites are unique in that they all possess glycosomes (where glycolysis occurs) as well as mitochondria [16]. The salivarian trypanosomes (e.g. T. brucei, T. vivax) are especially interesting, because 43 they are exclusively extracellular in their mammalian hosts, continuously replicating within the bloodstream over periods of months. During this stage of the life cycle, the mitochondrion is down-regulated, lacking both Krebs cycle enzymes and a functional electron transport chain (ETC) [21]. Successful transition to the fly vector, requires activation of the ETC and ATP synthesis via oxidative phosphorylation. This unique lifecycle leads to a major problem: when the mitochondrial genes are unused, they are not under selection, hence the integrity of these genes are threatened by genetic drift [75,113]. Exacerbating this, salivarian trypanosomes undergo a severe bottleneck as they transition through the tsetse fly and into the mammalian host, and then within the bloodstream, they undergo multiple bottlenecks at each antigenic switch, as they evade the host immune system [22]. Such bottlenecks create additional forces of genetic drift, where genes can be lost even if their deleterious fitness effect is considerable. These parasites possess several unique genetic features, including RNA editing of the mitochondrial transcripts. RNA editing creates open reading frames in “cryptogenes” by insertion and deletion of uridylate residues at specific sites within the mRNA. The U- insertions/deletions are directed by small guide RNAs (gRNA) and can repair frameshifts, generate start and stop codons and more than double the size of the transcript (for review see [4]). While the mRNA cryptogenes are encoded on maxicircles (25±50 copies per DNA network), the guide RNAs are encoded on thousands of 1 kb minicircles, encoding 3±5 gRNA genes each [8]. This effectively means that the genetic information for the edited mitochondrial mRNAs is dispersed between the mRNA cryptogenes on the maxicircles and the thousands of gRNA coding minicircles. The extensive editing of a single transcript can require more than 40 gRNAs and hundreds of editing events [9]. While the initial gRNA can interact with the 3' end of the 44 pre-edited transcript, all subsequent gRNAs anchor to edited sequence created by the preceding gRNA. Hence, editing proceeds from the 3' end to the 5' end of the mRNA transcript with the terminal gRNA (last one in the cascade) often creating the start codon needed for translation. This sequential dependence means that with even high accuracy rates for each gRNA, the overall fidelity of the process is astonishingly low. A major question in the field has been why this fragile and metabolically expensive system of RNA editing would evolve and persist. Another level of complexity in the kinetoplastids RNA editing process was the detection of an alternative editing event that leads to the production of a functionally discrete protein isoform. Alternative editing of Cytochrome Oxidase III (COIII) is reported to generate a novel DNA-binding protein, AEP-1, that functions in mitochondrial DNA maintenance [67,68]. In this transcript, one alternative gRNA generates sequence changes at two sites that links an open reading frame (ORF) found in the pre-edited 5' end, to the 3' transmembrane domains found in the COIII edited ORF. This was the first indication, that one cryptogene could contain information for more than one protein. Here, we show that as many as six additional cryptogenes also encode for more than one protein. Analyses of the terminal gRNA populations indicate that gRNA sequence variants exist that can alter the choice of the start codon and the open reading frame by alternative editing of the 5' end of the mRNA. Mutational bias analyses indicate that six of the mitochondrial genes may be dual-coding, with RNA editing allowing access to both reading frames. Dual-coding genes are defined as a stretch of DNA containing overlapping open reading frames (ORFs) [85,86]. Of particular interest are dual-coding genes that contain two ORFs read in the same direction: a canonical protein (normally annotated as 45 protein coding in the literature) and an alternative ORF. Maintaining dual-coding genes is costly, as it constrains the flexibility of the amino acid composition of both proteins. Hence, it is thought that dual-coding genes can survive long evolutionary spans only if the overlap is advantageous to the organism [93]. We hypothesize that trypanosomes use dual-coding genes to protect genetic information by essentially hiding a non-selected (ETC) gene within one that remains under selection. Thus, the ability to access overlapping reading frames may be added to a growing list of gene protective strategies made possible by the complex RNA editing process [74,75,113]. Trypanosome growth Materials and Methods T. brucei procyclic clones from IsTAR (EATRO 164), TREU 667 and TREU 927 cell lines were grown in SDM79 at 27°C and harvested at a cell density of 1-3x107. The TREU 667 cell line was originally isolated from a bovine host in 1966 in Uganda [114]. The TREU 927 cell line was originally isolated from Glossina pallidipes in 1970 in Kenya [115]. The EATRO 164 strain was isolated in 1960 from Alcephalus lictensteini and maintained in the lab of Dr. K. Vickerman until being obtained by Dr. Ken Stuart in 1966 [103]. Dr. Stuart derived the procyclic form from the bloodstream form culture in 1979. Guide RNA isolation, preparation, and sequencing Mitochondrial mRNAs and gRNAs were isolated as previously described [9]. All RNAs were treated with Promega DNAse RQI. In order to isolate gRNAs from TREU 667 and TREU 927 cells, RNAs were size fractionated on a polyacrylamide gel as previously described [9]. Guide 46 RNAs were then extracted and prepped for sequencing using the Illumina Small RNA protocol [9]. Libraries from TREU 667 and TREU 927 were deep sequenced on the Illumina GAIIx; reads were processed and trimmed as previously described [9]. Messenger RNA preparation and sequencing In order to isolate target mRNAs, isolated TREU 667 mitochondrial RNAs were reverse transcribed using the Applied Biosystems High Capacity cDNA Reverse Transcription Kit. CR3 cDNAs were amplified via PCR using the following primers (underlined portions are gene specific and non-underlined portions are tag regions used in deep sequencing reaction): CR3DS5’NEV:ACACTGACGACATGGTTCTACAAGAAATATAAATATGTGTATG CR3DS3’170:TACGGTAGCAGAGACTTGGTCTCAATAAACCCATATTAAATAAAAAACAAAAATCC After amplification, the products were purified using the QIAquick PCR Purification Kit, and paired end Illumina deep sequencing was performed on the Illumina Miseq (2x 250 bp paired end run). Low quality results were removed using FaQCs, adapters were removed using Trimmomatic and PEAR was used to merge paired end reads. Finally, Fastx was used to compile identical reads while maintaining the number of redundant reads. CR3 edited transcripts were identified by comparing sequence downstream of the 5' never edited region to the edited CR3 sequence. Guide RNAs were identified by using the mRNA sequences as queries against our existing gRNA databases, as previously described [9]. Mutational frequency and editing conservation analysis Mitochondrial pan-edited genes were categorized as potentially dual-coding based on identification of extended alternative reading frames and/or presence of identified gRNAs that generate alternative 5' end sequences. These genes include CR3, CR4, ND3, the 5' editing 47 domain of ND7, ND9 and RPS12. Nondual-coding pan-edited genes include ATPase 6, COIII, ND8 and the 3' editing domain of ND7. Partially edited genes include CYb,Murf II and COII. Never edited genes include COI, ND1, ND2, ND4 and ND5. For all analyses, ND7 was considered as two separate coding regions: the 5' editing domain (ND7N) and the 3' editing domain (ND7C) [27]. As we hypothesize that only the 5' editing domain of ND7 is dual-coding, mutation calculations for ND7N was pooled with the dual-coding genes and ND7C was pooled with nondual-coding pan-edited genes. T. brucei and T. vivax mRNA sequences of mitochondrial encoded genes were aligned based on protein sequence using Clustal Omega [116]. Nucleotide sequence mutations were identified and their effects on the amino acid sequence were classified as silent, missense or nonsense mutations. Missense mutations were further divided into three groups based on the PAM 250 matrix where conversions with a value <0 were considered not conserved, conversions with a value 0x0.5 were considered modestly conserved, and conversions with a value >0.5 were considered strongly conserved [117]. Mutation frequencies were normalized for each gene using nucleotide sequence length. Frequencies were compared using unpaired t- tests. The extent of editing conservation between T. vivax and T. brucei was calculated by aligning the pan-edited genes based on ACG sequence. For each alignment, each location between an A, C or G nucleotide where a U-residue was inserted or deleted in either sequence was considered an editing site. Editing sites were classified as identical in both sequences, altered in insertion or deletion length, having switched from an insertion site to a deletion site, or only occurring in one of the sequences. Percent editing conservation was based on total number of editing sites within each mRNA. Percentages were compared using unpaired t-tests. 48 A principal component analysis (PCA) was performed on all three reading frames of the pan-edited genes using the scikit-learn principal component analysis tool [118]. For this analysis, the predicted protein sequences for all three reading frames were aligned using Clustal Omega [116]. Missense, nonsense, and indel mutations were quantified. Missense mutations were further divided into three groups as described above. Each mutation type was quantified and the relative frequency of each mutation calculated based on protein amino acid length. The variables used in the PCA include the protein mutation frequencies and the percentage of identical editing sites in each mRNA. The first reading frame of each gene is defined as the ORF published in the literature. Data availability CR3 sequence accession number: SAMN06318039. TREU 927 gRNA sequence accession number: SAMN06318154. TREU 667 gRNA sequence accession number: SAMN06318153. NCBI's Sequence Read Archive. Results In T. brucei, analyses of the gRNA transcriptome for the pan-edited transcripts indicate that full editing involves a large number of gRNA populations [9,119]. In addition, most of the gRNA populations (population defined as guiding the same or near same region of the mRNA) contain multiple sequence classes. The sequence classes most often differ in R to R or Y to Y mutations, hence guide the generation of the same mRNA sequence (A:U and G:U base pairs allowed). During these analyses, we noted that the terminal gRNA population for Cytosine-rich Region 3 (CR3) (putative NADH dehydrogenase subunit 4L [120]), had 3' sequences that would extend editing beyond the previously identified translation start codon. In addition, this 49 population had several sequence variants that would generate different edited sequences in this region. The most abundant terminal gRNA would introduce a stop codon in-frame with two alternative AUG start codons found near the 5' end (Figure 8A). Other sequence classes however, would either bring the upstream AUGs into frame, or shift the reading frame. Intriguingly, the alternative +1 reading frame (ARF) did not contain any premature termination codons. In order to determine if these gRNAs were utilized, we used Illumina deep sequencing to identify the most abundant forms of fully edited CR3 transcripts. Surprisingly, we identified multiple forms of the mRNA (Figure 8A, 8B and 8C and APPENDIX D). The first was the fully edited sequence predicted by the most abundant gRNA identified (Figure 8A). The other transcripts however, had unique editing patterns at the 5' end (Figure 8B and 8C and APPENDIX D). Use of these 5' CR3 sequences allowed us to identify novel gRNAs. Predicted translation of these mRNA sequences indicate that they use the +1 reading frame, and that the protein generated would be the same length as the ORF previously identified. This suggests that CR3 is dual-coding, and that selection of the terminal gRNA determines which reading frame will be used. A re-examination of the terminal gRNAs for the pan-edited genes indicated that at least two other transcripts, NADH dehydrogenase subunit 7 (ND7) and ribosomal protein subunit 12 (RPS12), have identified gRNA sequence variants within the terminal gRNA population that allow access to alternative reading frames (Figure 8D and 8E and APPENDICES E and F). Interestingly, the alternative gRNA for ND7 generates a +2 frameshift with a 65 amino acid open reading frame. The ND7 transcript is differentially edited in two distinct domains separated by 59 nts that are not edited in the mature transcript (the HR3 region) [27]. Only the 5' domain is edited in both life cycle stages; full editing of the 3' domain was only found in 50 bloodstream form (BF) parasites. The stop codon for the +2 frameshift is found within the HR3 region, therefore this alternative protein would be generated by full editing of only the 5' domain. While the most abundant gRNA in the Eatro BF transcriptome (~50,000 reads) would generate a sequence utilizing the identified ND7 ORF (Figure 8D ORF), the most abundant gRNA (>100,000) in the Eatro 164 procyclic library is the +2 ARF gRNA (Figure 8D ARF and APPENDIX E). In RPS12, the alternative gRNA deletes an additional U-residue downstream of the existing start codon, shifting the reading frame into the +1 ARF (Figure 8E). Interestingly, in Leishmania tarentolae, a gRNA, gRPS12VIIIa, has been identified that would also shift the frame of the existing start codon into the +1 ARF [121]. The identification of gRNAs that could alter the reading frame led us to re-analyze the ORFs of the edited transcripts. In addition to CR3, we found extended ORFs in two different frames for Cytosine-rich Region 4 (CR4) and NADH Dehydrogenase (ND) subunit 9, while several others had shorter, but still significant ORFs in alternative frames (Figure 9). We do note that the original sequence publications for both CR4 and RPS12 (CR6) had indicated that the fully edited sequence contained extended ORFs in two different frames [30,48]. Additionally, NADH Dehydrogenase subunit 3 (ND3) was also considered to be potentially dual coding, based on mutational analysis described below. As we did find potential ARFs in the edited transcripts, we analyzed the predicted ORFs for biases in their mutational pattern. Dual-coding genes often display an atypical codon mutation bias due to constraints imposed by the need to maintain protein function in both genes. In single-coding genes, changes in the third nucleotide of a codon give rise to synonymous amino acids, so this position (N3) is much less constrained. In contrast, in dual- 51 Figure 8. Alternative editing of the 5' end of pan-edited genes results in access to different reading frames. CR3 (A, B, C): Sequenced mRNA variants are aligned with gRNAs and predicted protein sequences. Inserted U-residues are lowercase while deleted U-residues are shown as asterisks. Canonical Watson-Crick base pairs (|); G:U base pairs (:). Previously identified start codons are doubled underlined. Potential upstream AUG start codons are indicated by wave underlines. Alternatively edited nucleotides are shown in red. Common anchor regions are shown in blue. ND7 (D), and RPS12 (E): Predicted mRNA and protein sequences, based on identified gRNAs. coding genes, the N3 position in one frame is the N1 or N2 position in the alternative frame. Therefore, they have low rates of synonymous mutations [122]. This codon bias has been used to develop algorithms to detect novel overlapping genes [123,124]. These algorithms however, cannot be used in the analysis of our edited transcripts as the two-component genetic system (mRNAs created by gRNA editing) introduces another layer of mutational constraint [125]. In addition, the edited sequence of the transcripts is known for only a limited number of 52 Figure 9. Positions of stop codons on all RFs of the edited genes in T. brucei. For each gene, reading frame 1 (RF1) is designated as the protein ORF previously identified. Hypothetical dual- coding reading frames are shown in red. A6 = ATPase 6; CO = Cytochrome Oxidase; CYb = Cytochrome b; Murf = Maxicircle unidentified reading frame; ND3 ±ND9 = NADH Dehydrogenase subunits. kinetoplastids, and only the salivarian trypanosomes have the same general life cycle; other kinetoplastids, like Leishmania and T. cruzi, have evolved different infective cycles and are 53 under very different selective pressures [113,126,127]. Fully edited sequences are known for T. vivax, the earliest branching salivarian trypanosome [52,128]. T. vivax differs from T. brucei in that they complete the insect phase of their life cycle entirely within the proboscis of the fly. This parasite has been described as an intermediate stage in the evolutionary pathway from mechanical transmission (ancestral) to full adaptation to the midgut and salivary glands of the tsetse fly [129]. Using the T. vivax sequence, we analyzed mutation patterns in all of the mitochondrially-encoded mRNAs (Figure 10). mRNA sequences were aligned by codons based on their protein alignments (Clustal Omega [116]). Mutated codons were identified and classified as silent, missense and nonsense mutations. Missense mutations were further divided into three groups based on the PAM 250 matrix [117]. These data clearly show that the RNA editing process significantly constrains the types of mutations tolerated within the mitochondrial genome. In comparison to the genes that are not edited (ND1, ND2, ND4, ND5, COI) or have limited editing (CYb,Murf II and COII), a distinct suppression of silent mutations and strongly conserved missense mutations were observed for all of the pan-edited genes, consistent with previous observations (Figure 10) [125]. A suppression of mutations that lead to moderately conserved amino acid replacements was also observed, but these were not as striking due to the low frequency of this type of mutation. No significant difference was observed in the frequency of not conserved missense mutations, though a trend towards a lower frequency of these mutations in the putative dual-coding genes (CR3, CR4, ND3, ND9, 5’ND7 and RPS12) was noted. This was complemented by a significant increase in the frequency of strongly conserved missense mutations in the putative dual-coding genes in comparison to the other pan-edited genes (3’ND7, ND8, A6 and COIII). 54 Figure 10. Mutational frequencies in mitochondrially encoded genes categorized by effect on amino acid sequence. T. brucei and T. vivax pan-edited dual-coding (PanEd DC), pan-edited nondual-coding (PanEd NDC), partially edited (PartEd), and never edited (NevEd) mRNA sequences were aligned based on their amino acid alignment (reading frame 1, defined as the reading frame encoding the gene product previously annotated in the literature) [116]. Mutations were categorized as silent (Si), strongly conserved (SC), modestly conserved (MC), not conserved (NC) or nonsense (Non). The amount of conservation was determined using the PAM 250 matrix, where conversions with a value 00.5 were considered strongly conserved, and conversions with a value ≤0 were considered not conserved. Error bars depict standard error. * p<0.05, ** p<0.01 (unpaired t-test). Surprisingly, while the overall mutational frequency of the fully edited pan-edited genes was similar, a comparison of the conservation of editing patterns did show a significant difference between the putative dual-coding and the other pan-edited genes (Figure 11). The dual-coding genes consistently had a lower conservation of their editing pattern. Upon further examination, we found that most changes in the editing pattern resulted from thymidine insertions and deletions within the maxicircle DNA sequence, which was then corrected by the editing machinery. These types of mutations do not result in a change to the final mRNA 55 sequence once edited. The T. brucei (Tb) dual-coding genes appeared to consistently insert more U-residues, while T. vivax (Tv) had more U-residues encoded within the DNA sequence. Indeed, comparisons of the length of the coding regions of Tb and Tv cryptogenes (unedited sequence) show that the putative dual-coding genes are almost 10% shorter in Tb. In contrast, the nondual coding cryptogenes are not significantly shorter (~2.5%). Some of the other changes in editing patterns did generate small internal frameshifts as previously described by Landweber and Gilbert [125]. However, the high prevalence of internal frameshifts reported for COIII by Landweber and Gilbert is reflected in our analysis only for COIII and A6. Figure 11. Percent conservation of editing patterns between T. brucei and T. vivax. Alignment of the fully edited mRNAs was based on ACG sequence (see APPENDIX G). Each editing site was defined as a site on at least one of the two aligned mRNAs where an editing event occurred. Sites were then classified as identical, only identified in one of the two sequences, type switched (one site is an insertion and the other is a deletion), or altered in length. Error bars depict standard error. * p<0.05, ** p< 0.01 (unpaired t-test). 56 Since differences in the types of amino acid mutations were observed, we performed a principal component analysis on the frequency of mutation types for all three reading frames of the pan-edited genes (Figure 12). In addition, we included the percentage of editing site conservation as a variable. This analysis clearly clustered the putative +1 dual-coding transcripts (reading frame 2). The first component (z-axis,) is strongly based on editing conservation, and separates the dual-coding genes from the other pan-edited genes as expected. While component 2 (x-axis) separated ORF1 and ORF3 from ORF2 of each gene, component 3 clearly separated the dual-coding ORF2s from nondual-coding ORF2s. The ND7N ORF3 was the only exception, and the gRNA data suggests that it is a dual-coding gene using the +2 (ORF3) reading frame. This suggests that an additional layer of mutational constraint beyond that imposed by the RNA editing process can be detected for six of the extensively edited transcripts. Because dual-coding genes are often conserved in multiple species, we analyzed the available sequences of other kinetoplastids (Leishmania tarentolae (Lt), Leishmania mexicana amazonensis (Lma), Phytomonas serpens (Ps), Perkinsela CCAP1560/4 (Pk)) to determine if they also contain multiple overlapping reading frames with homology to those found in T. brucei. Interestingly, many of the alternative reading frames did show some homology to the ARFs found in Tb. However, most of these ARFs are punctuated with stop codons (APPENDIX H). Extended alternative reading frames are found in CR3, 5'ND7 and RPS12 in Ps. However, the extended ARF in the Ps CR3 is in the +2 reading frame and the ND7 and RPS12 ARFs shows very little homology with the Tb/Tv ARF (APPENDIX H Parts A, D and F) [130]. Interestingly, while Perkinsela has lost many of the genes in the mitochondria, RPS12 was retained [131]. The Pk RPS12 ARF possesses a near full open reading frame with one stop codon three codons after an 57 Figure 12. Principal component analysis of frequency of amino acid mutation types and editing conservation between T. brucei and T. vivax pan-edited transcripts. A. First factorial plan (z-axis: first component, x-axis: second component, y-axis: third component). ND7N = ND7 5' editing domain, ND7C = ND7 3' editing domain. ORF2 and ORF3 are defined as the +1 and +2 reading frames, respectively. B. Histogram of eigenvalues for first six components. Eigenvalues represent the amount of the variance accounted for by each component. C. Absolute contribution of each analyzed mutation frequency to components 1, 2, and 3. Amount of conservation was determined using the PAM 250 matrix as described in Figure 10. Mutation type: SC = strongly conserved, MC = moderately conserved, NC = Not conserved, Non = Nonsense, InDel = insertion or deletion. Editing conservation (EdCon) was determined using alignments of edited mRNAs (APPENDIX G). Aligned editing sites were characterized as identical or altered. in frame start near the 5' end of the edited transcript. This pattern is reminiscent of the conventionally edited sequences of CR3 and ND7, and could suggest that an alternative edit 58 may remove the stop codon, allowing access to the ARF. The L. tarentolae CR4 orthologue also has two extended ORFs. Interestingly, the published sequence for Lt CR4 appears to switch between the two ORFs (switch appears to occur in a stretch of 13 inserted Us) [77]. This may explain why only the carboxyl half of the published Lt CR4 showed good homology with Tb and Lma [132]. Translation of the Lt ARF does generate a protein with the N-terminus showing high homology to the conventional Tb and Lma CR4, while translation of the published ORF shows some homology to the Tb CR4 ARF (APPENDIX H Part B). These data are intriguing enough that these sequences should be re-examined. While most of the other pan-edited transcripts had multiple stop codons in the +1 and +2 reading frames, many did show good homology to the Tb ARF sequences. Particularly intriguing are the ND3, ND8 and ND9 alignments. While internal stop codons are found in Tv, Lt and Lma ND9 ARFs, they show strong homology to the Tb ND9 ARF throughout the protein (APPENDIX H Part E). In ND3, the amino ends of the ARFs show strong homology between all four of the Trypanosoma and Leishmania species (APPENDIX H Part C). This homology decreases after an internal stop codon found in the same position in 3 of the 4 species. As ND8 is the only other pan-edited gene in Lt, Lam and Ps, we also examined the conservation of the ND8 ORF and ARF, even though our mutational analyses did not tag the ND8 gene as dual-coding. While the ND8 ARFs were punctuated by multiple stop codons, they surprisingly also showed areas of strong homology between all 5 species, especially down stream of an internal methionine (APPENDIX H Part G). We do note that we cannot rule out the possibility that alternative editing can remove stop codons observed in the ARFs. Analyses of the ARF predicted proteins suggest that they are all short transmembrane proteins with two or more predicted transmembrane alpha helices (Figure 13) [133,134]. While 59 functional homologues are often difficult to detect in trypanosomes, searches using the predicted protein sequence of each ARF did identify small molecule transport proteins with limited confidence. Using Phyre2, the ND7 ARF was identified as a homolog of the bacterial sugar transporter SemiSWEET (61.5% confidence) [135,136]. SemiSWEET, which forms homodimeric structures, is also a distant homolog of the yeast mitochondrial pyruvate carrier 1 (MPC1). This protein has two transmembrane alpha helices and forms a heterodimer with either of the other two pyruvate carrier proteins [137,138]. While still very speculative, it is intriguing that the small ARF proteins might oligomerize to form small mitochondrial membrane transporters. Figure 13. Amino acid sequences of ARFs of dual-coding genes. Predicted transmembrane regions are shaded in gray [133,134]. *No start codon was identified and the amino acid sequence shown begins at the 5' end or after the first stop codon at the 5' end. Exclamation point indicates premature termination codon. Discussion The work presented here, suggests that as many as six of the extensively edited mRNAs in T. brucei are dual-coding and that it is alternative editing using different terminal gRNAs that allows access to the two different reading frames. Deep sequencing of the 5' end of CR3 indicates fully edited transcripts that have access to both reading frames are present in the mitochondrial transcriptome and gRNA analyses indicate that three different cell lines contain 60 gRNAs that can alternatively edit the 5' ends of CR3, RPS12 and ND7. In addition, analyses of the mutational bias in pan-edited genes suggest that an additional layer of mutational constraint is observed in the putative dual-coding genes. While the overall mutational frequency observed for the fully edited mRNAs is similar for all pan-edited genes, the types of amino acid changes that appear to be tolerated are significantly different. This is consistent with these genes having to maintain functional proteins in two different reading frames. Analyses of other trypanosomes, do show that some of the ARFs have intriguing homology to the ARFs identified in T. brucei and T. vivax. However, most of the ARFs are punctuated with stop codons. These data are difficult to interpret because we cannot rule out the possibility that the stop codons are removed by alternative editing events. In addition, the other trypanosome species have evolved very different infective life cycles and are under different selective pressures. For example, P. serpens is a pathogen that infects important crops and is transmitted by sap-feeding bugs. These parasites have glucose readily available in both life cycle stages and are unique in that they lack a fully functional respiratory electron transport chain [64,65]. For Leishmania, all life cycle stages possess an active Krebs cycle and ETC linked to the generation of ATP [61,62,139]. These unique adaptations to different hosts suggest that they may not be under the same evolutionary pressure to maintain dual-coding genes. Overlapping reading frames are common in viruses, and are thought to persist due to strong genome size constraints [87,88]. More recently however, over-lapping genes have been identified in mammalian and bacterial genomes [89–92]. In these organisms, size is not an issue and the potential advantage of overlapping genes is less clear. For dual-coding genes, the need to maintain both ORFs constrains the ability of each protein to become optimally adapted [93]. 61 As this constraint can be alleviated by gene duplication, it is thought that dual-coding regions can survive long evolutionary spans only if the overlap provides a selective advantage. In mammals, many of the identified dual-coding genes like Gnas1 and XBP1, produce two proteins that bind and regulate each other [94,95]. For these proteins, dual-coding may be advantageous for the tight co-expression needed. An alternative model, suggests that under high mutation rates, the overlapping of critical nucleotide residues is advantageous because it may reduce the target size for lethal mutations [96]. This may be particularly important for organisms that have evolved to exist in dual-metabolic environments (two hosts). We hypothesize that the trypanosome mitochondrial ARFs encode small metabolite transporters that provide a distinct growth advantage to bloodstream form parasites. The complete overlap of these small transporter genes with electron transport chain (ETC) genes would protect the integrity of the ETC genes that are required only in the insect host. Thus, in trypanosomes, dual- coding genes may be a mechanism to combat genetic drift during extended periods of growth in non-selective environments. In T. brucei, it is known that a number of bloodstream form essential proteins are functionally linked to Krebs cycle or ETC genes. While not a ªclassic º dual-coding gene in that production of the alternative protein does not involve overlapping reading frames, the pan-edited COIII gene does contain the information for two distinct proteins, COIII and AEP-1. AEP-1 is important for kinetoplastid DNA maintenance and overexpression of the DNA-binding domain results in a dominant negative phenotype including decreased cell growth and aberrant mitochondrial DNA structure [68]. The nuclear encoded α- ketoglutarate dehydrogenase E2 (α-KDE2) is known to be a dual-function protein, in that it plays important roles in both the Krebs cycle and in mitochondrial DNA inheritance [97]. RNAi 62 knockdowns of this gene in bloodstream form (BF) trypanosomes also show a pronounced reduction in cell growth. Similarly, the Krebs cycle enzyme α-ketoglutarate decarboxylase (α- KDE1) is also a dual-function protein with overlapping targeting signals that allow it to be localized to both the mitochondrion and glycosomes [98]. RNAi knockdowns of α-KDE1 in BF trypanosomes is lethal, suggesting that in addition to its enzymatic role in the Krebs cycle, it plays an essential role in glycosomal function in T. brucei [98]. It has been previously suggested that both alternative editing and dual-function proteins are important mechanisms for expanding the functional diversity of proteins found in trypanosomes [67,97–99]. We hypothesize, that in salivarian trypanosomes, an equally important role for these dual- coding/function genes may be the protection of genetic information. The ‘why’ of the unique RNA editing process in kinetoplastids has been a long-standing paradox. The complex machinery and the sheer number of gRNAs required to direct the thousands of U-insertion/deletions indicate that this process is metabolically very costly. Initially, it was proposed that U-insertion/deletion editing (kRNA editing) was one of many RNA editing processes that were in fact relics of the RNA world. However, the very different mechanism of the RNA editing systems in existence, and their very limited distribution within specific groups of organisms indicate that they are more likely derived traits that evolved later in evolution [69,70]. The sheer complexity of the kRNA editing process, with no obvious selective advantage, led to the proposal that insertion/deletion editing arose via a constructive neutral evolution (CNE) pathway [71]. Indeed, RNA editing in trypanosomes is always mentioned in support of CNE as an example of how seemingly non-advantageous, complex processes can arise [72,73]. More recently however, it has been hypothesized that RNA editing 63 co-evolved with G-quadruplex structures found in the pre-edited mRNAs [74]. These structures are thought to be advantageous in that they can help regulate transcription in order to promote DNA replication and prevent kinetoplast DNA loss. However, they must be removed by the RNA editing system prior to translation [74]. Another prominent hypothesis is that RNA editing is advantageous because it is a mechanism by which an organism can fragment and scatter essential genetic information throughout a genome [75,76]. Kinetoplast DNA is far less stable than chromosomal DNA, and loss of minicircles due to asymmetric division of the kDNA network have been frequently observed, especially in laboratory cultures of Leishmania [76,77]. Buhrman et al. [76] suggest that the scattering of essential guide RNA genes throughout the DNA network, would prevent fast growing deletion mutants from outcompeting more metabolically versatile parasites during growth in the mammalian host. Using a mathematical model of gene fragmentation in changing environments (absence of functional selection), they showed a distinct advantage for gene fragmentation. In their model, the number of tolerable generations under periods of relaxed selective pressure was increased by more than 40% before loss of the ability to move to the next life cycle stage. If the dual-coding ARFs give BF trypanosomes a selective growth advantage similar to that observed by the COIII alternative protein AEP1, then the number of ‘essential’ gRNA genes would increase greatly. Currently, only AEP1, A6 and RPS12 mitochondrial genes have been experimentally shown to be essential [68,100,110]. In addition, the presence of alternative editing and dual-coding genes would complement the protection provided by gene fragmentation by also shielding the genes from deleterious point mutations within critical ETC genes. This suggests that the complex RNA editing system found in the mitochondria may therefore provide multiple molecular strategies 64 to increase genetic robustness. Protection of the mitochondrial genome during growth in the mammal would increase the capacity for successful transfer to an insect vector and maximize the parasites long-term survival and spread. Acknowledgments We thank the Ken Stuart Lab for trypanosome cell lines and Chris Adami for helpful discussions. We would also like to acknowledge the Dr. Marvis A. Richardson Endowed Fellowship Fund for their recognition of LEK. 65 CHAPTER 4: ANALYSIS OF THREE PAN-EDITED MRNAS REVEALS DUAL-CODING GENES AND COMPLEX MULTIPATH EDITING Abstract Trypanosoma brucei is a single celled eukaryote that possesses a highly complex RNA editing system. In this system, a large set of small RNAs, called guide RNAs direct the insertion and deletion of uridines in mitochondrial mRNAs. These changes extensively alter the target mRNAs, up to doubling them in length. Recently, mutational analysis showed that several of the edited genes possessed capacity to encode two different protein products. These overlapped reading frames could be accessed through alternative RNA editing, that shifts the translated reading frame. In this study, we analyzed the editing patterns of three putative dual- coding genes, ribosomal protein S12 (RPS12), the 5’ editing domain of NADH dehydrogenase subunit 7 (ND7 5’), and C-rich region 3 (CR3). We found evidence that fully edited ND7 5’ and CR3 are can translate in more than one reading frame. Moreover, we found that CR3 has a complex set of editing pathways that vary substantially between cell lines, and that changing available energy sources also alters the editing preferences of CR3 and ND7 5’. These findings suggest that editing patterns can be influenced by the current environment, and that alternative editing may be utilized by the trypanosomes to introduce variation within this fragile editing system. 66 Introduction Trypanosoma brucei is a member of the Kinetoplastea, a group of protozoans characterized by a large network of DNA in their mitochondria known as the kinetoplast [1]. The kinetoplast is composed of two types of concatenated circular DNA molecules: maxicircles and minicircles. The maxicircles all encode mitochondrial ribosomal RNAs as well as 18 protein coding genes, most of which are components of the electron transport chain. The approximately 30-50 identical copies of the maxicircle make up a relatively small proportion of the kinetoplast [5]. Most of the DNA network is composed of 5,000 and 10,000 1 kb minicircles, each of which encodes 2-5 small non-coding guide RNAs (gRNAs) [8,33]. These gRNAs are used in the process of RNA editing. In T. brucei, RNA editing consists of specific uridine insertion and deletion events that render 12 of the 18 mitochondrially encoded mRNAs translatable [4]. The gRNAs act as templates for the large editosome complex which cleaves the mRNA, inserts or deletes the correct number of uridines and then re-ligates the mRNA in an energy intensive process. This is repeated until the mRNA is complementary to the small gRNA. Each gRNA directs edits that generate the anchor region for the next gRNA, thus the RNA editing process is sequentially dependent on correct editing by each gRNA. As editing of some of the extensively edited mRNAs can involve upwards of 40 gRNAs, this renders the process incredibly fragile. [140]. We hypothesize, that such an expensive and fragile process evolved in response to the unique life cycle of T. brucei. T. brucei is a dixenous parasite, invading the bloodstream of a mammalian host and being transmitted between hosts by bite of a tsetse fly. Once taken up in a blood meal by the tsetse fly, T. brucei transitions into the replicating procyclic state in the midgut, and the energy 67 T. brucei requires for this replication is gained through metabolism of amino acids [19,20]. This is accomplished through use of a portion of the Krebs cycle and the electron transport chain (ETC), thus most of the ATP required is produced by the mitochondria [19–21]. This stage of the life cycle is followed by a dramatic bottleneck when the trypanosomes transition from the midgut to the salivary glands of the tsetse fly [22,23]. From the salivary glands, trypanosomes are then refluxed into their next mammalian host during a bloodmeal. Once the parasite is deposited into its mammalian host, it quickly transitions to utilizing glycolysis for its energy generation, removing the requirement for ATP production in the mitochondria [15]. While in the mammalian host, T. brucei lives entirely extracellularly. It is frequently subject to attacks by the host’s adaptive immune system, and the population evades these attacks through antigenic variation [12]. This part of the life cycle can be quite long, with the longest known infection lasting 29 years [13]. This life cycle should make T. brucei particularly sensitive to genetic drift, especially for those genes which are not under selection (Krebs’s cycle and ETC) and should make them extremely vulnerable to Muller’s ratchet (the gradual increase of mutational load that eventually leads to extinction) [81–84]. One mechanism for protecting small asexual populations is by increasing the severity of the mutations that can occur. If mutations severely impact fitness, mutated individuals are selected out, preventing their fixation [79]. Recently, computer modeling studies suggest that small asexual populations can evolve this type of mechanism (termed “drift robustness”) in order to maintain fitness [80]. The sequential dependence of the kRNA editing process implies that the system is inherently fragile to mutations. Even a single point mutation can drastically change the editing pattern, and stop the editing process, aborting expression of the protein. Hence, the RNA editing process may 68 operate as a proof-reading system to weed out mutations by making them lethal. This is effective however, only if the mitochondrial genes are under selection. Previously, we showed that many of the mitochondrially pan-edited genes have a distinct mutational bias that is suggestive of dual-coding genes (coding two proteins by overlapping reading frames) [141]. The overlapping of ETC genes not under selection in the bloodstream stage with genes that are under selection during this stage of the life cycle, would prevent the accumulation of mutations. As the extensively overlapped genes share most gRNAs, this strategy would ensure that almost all of the genetic material is protected. Our analyses suggested that out of the twelve pan-edited genes in T. brucei, six are potentially dual coding, and that the RNA editing system is used to determine which reading frame is accessed. In order to determine if mRNA transcripts with access to multiple open Reading frames (ORFs) exist within the mitochondrial transcriptome, we deep sequenced the mRNA transcript populations of three putative dual coding genes: ribosomal protein S12 (RPS12), the 5’ editing domain of NADH dehydrogenase subunit 7 (ND7 5’), and C-rich region 3 (CR3). Using the previously generated gRNA transcriptomes, we constructed detailed editing pathways for each of these genes. The editing pathway of RPS12 was primarily linear, reflecting the high degree of conservation required for a gene that is essential [100,107]. We found no evidence of utilization of the gRNA that provides access to the alternative reading frame [141]. In contrast, we did identify transcripts using different reading frames for both CR3 and ND7 5’. This study indicates that RNA editing can be used to access multiple open reading frames using two different methods: in ND7 5’, different gRNAs bring alternate start codons into frame and in CR3, different gRNAs can shift the reading frame of the existing start codon. In addition, CR3 69 showed incredible editing diversity, in that two different cell lines showed very different editing patterns, using different sets of gRNAs to edit the CR3 cryptogene. This suggests that the use of a gRNA-guided editing system can also dramatically increase protein diversity in spite of an incredibly rigid and mutationally fragile system. Materials and Methods T. brucei culture and RNA Isolation T. brucei clones from strains EATRO 164 and TREU 667 were grown in SDM79 and harvested as previously described [9]. EATRO 164 cells grown in SDM79 were then gradually transitioned to SDM80 using serial 1:3 dilutions when cells reached a density of at least 5x106 cells/mL. SDM80 was prepared as described by Lamour et al. with the exception of using undialzyed FBS, and reducing the amount of FBS added by half [142]. This results in the final concentration of glucose being 0.5 mM instead of 0.15 mM. This concentration is still well below that of SDM79, which has a glucose concentration of 6 mM. Once cells had been acclimated to SDM80, cells were harvested as previously described [9]. Mitochondrial vesicles were isolated using differential spins and mitochondrial RNA was then isolated from vesicles as previously described [9]. Preparation, Sequencing, and Analysis of mRNAs cDNAs were generated from isolated RNAs using the Applied Biosystems High Capacity cDNA Reverse Transcription Kit. CR3, RPS12, and ND7 5' editing domain cDNAs were amplified via PCR using the following primers (underlined sequences are gene specific and non- underlined sequences are tag regions used in deep sequencing reaction): 70 CR3 5': ACACTGACGACATGGTTCTACAAGAAATATAAATATGTG CR3 3' Short: TACGGTAGCAGAGACTTGGTCTACAAAAATTATTTGCATACTT CR3 3’ Extended: TACGGTAGCAGAGACTTGGTCTACAAAAATTATTTGCATACTTTTTT RPS12 5': ACACTGACGACATGGTTCTACACTAATACACTTTTG RPS12 3': TACGGTAGCAGAGACTTGGTCTAAAAACATATCTTAT ND7 5’: ACACTGACGACATGGTTCTACAGATACAAAAAAACATGAC ND7 3’: TACGGTAGCAGAGACTTGGTCTCTTTTATATTCACATAACTTTTCTGTAC Amplified cDNAs from EATRO 164 cells grown in SDM79, EATRO 164 cells grown in SDM80, and TREU 667 cells grown in SDM79 were individually barcoded and combined in equal molar amounts. Samples were sequenced in a 2x250bp paired end format (PE250) using an Illumina MiSeq Standard flow cell and 500 cycle reagent cartridge, version 2. Sequence data was preprocessed as previously described [141]. Sequence data was then separated by cell line, growth media and gene. Sequence data was then analyzed using a new pipeline and program called SKETCH (Segmentation of Kinetoplast Edited Transcripts to Characterize editing Heterogeneity). This program allowed us to classify mRNAs at the block editing level and determine which editing patterns were most prominent. For each set of sequences, SKETCH would remove low quality sequences whose sequences containing more than 5 mismatches to the unedited template, disregarding uridines. In order to classify the editing patterns observed in the mRNA transcripts, SKETCH requires a set of template sequences. Initially, the templates supplied to SKETCH were the conventional fully edited and unedited sequences for each of the three genes examined. These sequences were 71 then segmented based on the editing blocks previously defined by the locations of gRNA populations [9]. Each transcript was then classified by editing block, with each block being classified as matching the unedited sequence, matching the fully edited sequence or being unknown. After the initial characterization of the transcripts, the most abundant unknown sequences for each editing block were then added to the reference pool. Sequences were then reclassified by SKETCH based on the newly added reference sequences. This process was repeated until the most abundant forms of editing were identified. SKETCH code is available upon request. To validate the newly identified editing patterns as true alternatives, the new sequences were screened against the gRNA transcriptome as previously described [9]. Sequences with a gRNA match were then considered valid alternative edits. Uridine deletion analysis For RPS12 and ND7 5’, once full editing pathways were characterized, editing sites with DNA encoded Us were identified. Each encoded U site was then characterized based on the proximity of the preceding gRNAs’ 3’ poly-U tail as well as whether the all uridines at the site in question were deleted in the final fully edited sequence. For each site, a window was defined consisting of the 6 sites upstream and 6 sites downstream of the site in question. Using these parameters, the mRNA transcripts of RPS12 and ND7 were analyzed at each deletion window. The window of each transcript for each encoded U was examined and classified as unedited, fully edited or partially edited. Partially edited sequences were then classified based on the editing state of the encoded U site. For sequences with total deletions, each editing sequence was then classified based on the states in the 3' end of the window as either matching the fully edited sequence or not. Code available upon request. 72 Results In order to confirm that transcripts with access to two reading frames exist in vivo, we analyzed the mRNA transcriptomes for three of the putative dual-coding genes, RPS12, ND7 5’ and CR3. This mRNA deep sequencing data was then used in combination with the sequenced gRNA transcriptomes, to generate precise editing pathway maps. In order to determine how robust the observed editing pathways were, we characterized editing in two different cells lines, TREU 667 and EATRO 164. In addition, we examined the effect of energy source on these editing pathways by using two different media, SDM79 and SDM80. SDM79 is the standard medium used to grow the procyclic stage parasite. However, it contains 6mM glucose, and experiments have shown that under these levels of glucose, the procyclic stage can grow in the absence of electron transport chain (ETC) activity [112,142–147]. The SDM80 medium was developed to more closely resemble insect gut conditions and has very low glucose concentrations [142]. Trypanosome growth in this medium requires ATP production using the ETC [142]. RPS12 is an essential component of the mitochondrial ribosome [30,100,107]. RPS12 is extensively edited (pan-edited) with 132 Us inserted and 28 Us deleted. Full editing is directed by 12 populations of gRNAs (defined as a group of gRNAs that edit the same region of an mRNA) [9,30]. In this analysis, we identify 10 populations, with three of the previously identified populations being combined with other populations that shared a very high amount of overlap. One new population (F) was identified through a search of the gRNA transcriptome under reduced stringency. Analyses of the canonical editing pattern indicate that there are two long ORFs, and mutational bias analyses indicate that both ORFs may be selected for [141]. The 73 longest ORF encodes the RPS12 protein and encompasses a second shorter ORF of unknown function [30]. Northern blots revealed that edited RPS12 mRNAs were found in both life cycle stages, however, edited mRNAs were more abundant in bloodstream form than procyclic form trypanosomes [30]. Because RPS12 is essential, we expected it to have a very robust editing pattern in both cell lines, as well as under both energy conditions. In contrast, neither ND7 or CR3 appear to be essential in the insect stage of the parasite [112]. The canonical ND7 has two separate editing domains that are edited independently [27]. Interestingly, while the 3’ editing domain is fully edited only in the bloodstream life cycle stage, the 5’ editing domain is edited in both life cycle stages [5,27]. In addition, the mutational bias analyses indicate that only the 5’ editing domain has characteristics indicative of a dual coding gene. The canonically edited CR3 is also a putative Complex I member (ND4L) and is preferentially edited in the BS stage [47,120]. Complex I has been shown to be non-essential in both life cycle stages, and other mitochondrially encoded complex I subunits, ND3, ND8, and ND9, have been shown to be preferentially edited in the bloodstream stage [5,26,28,29,111,112]. RNA seq data was generated by reverse transcribing all mtRNAs using random primers. For both RPS12 and CR3, transcripts were then selectively amplified using sequence specific primers targeted to the terminal 5’ and 3’ never edited regions as to not bias against any possible editing pattern. For ND7, the 5’ editing domain was selectively amplified using sequence specific primers targeted to the 5’ never edited region and the homology region 3 (HR3) that separates the 5’ and 3’ editing domains [27]. The HR3 is a span of 59 nts that is also never edited, hence should not bias the analysis. The targeted transcriptome libraries were 74 generated from TREU 667 cells grown in SDM79 and EATRO 164 cells grown in SDM79 and SDM80. Additionally, for CR3, we generated another library using TREU 667 cell line mRNA by selecting for transcripts of a larger size, instead of taking transcripts of all sizes (SDM79). This allowed us to enrich the library for transcripts that had initiated the editing process. Amplified cDNAs were then gel purified, barcoded and combined in equal molar amounts for sequencing. While the number of total reads obtained did vary by cell line and media used, surprisingly few transcript were fully edited (canonical AUG + ORF). For both RPS12 and CR3, the majority of reads (>80%) were completely unedited (Table 7). CR3, which has previously been shown to be preferentially edited in the BS stage, had the lowest percentage of fully edited transcripts, with only 0.1% – 0.2% translatable transcripts detected in both cell lines and under both growth conditions. In contrast, while RPS12 had similar levels of unedited transcripts, a larger percentage of translatable transcripts were found. For this essential transcript, the number of fully edited transcripts differed between the two different cell lines; 2.3% in TREU 667 and 0.9% in EATRO 164. Growth of the EATRO cells in low glucose media (SDM-80) did result in a substantial jump in the both the number of transcripts that initiated the editing process, and the number of fully edited transcripts (4.16%). This suggests that energy source may influence editing efficiency. While the predominance of completely unedited transcripts found for both CR3 and RPS12 was surprising, these numbers are in line with those found in other studies [131,148,149]. The ND7 5’ transcriptome analyses differed substantially from both RPS12 and CR3 in that the majority of these transcripts had initiated the editing process. The TREU cell line showed the highest editing efficiency with ~80% of transcripts having initiated editing and 9.7% 75 Table 7. Editing efficiencies of RPS12 (A), ND7 (B), and CR3 (C). Transcript Cell line and % partial edited % fully edited RPS12 RPS12 RPS12 ND7 5’ ND7 5’ ND7 5’ CR3 CR3 CR3 CR3 Total # Reads Media 787,584 Treu 667, SDM79 Eatro 164, SDM79 846,549 Eatro 164, SDM80 1,381,092 Treu 667, SDM79 1,141,322 Eatro 164, SDM79 915,610 Eatro 164, SDM80 313,657 Treu 667, SDM79 18,832 Treu 667 enriched. 50,589 Eatro 164, SDM79 348,210 Eatro 164, SDM80 53,000 % unedited 89.8% 92.6% 81.3% 20.3% 47.0% 27.4% 84.9% 18.1% 93.2% 90.6% 7.9% 6.5% 14.5% 70.0% 52.8% 72.1% 15.0% 73.1% 6.6% 9.3% 2.3% 0.9% 4.2% 9.7% 0.2% 0.5% 0.1% 8.8% 0.2% 0.1% of the transcripts fully edited and translatable. In contrast, in EATRO cells, only 53% of the transcripts had initiated editing, and a scant 0.2% had completed the editing process. As with RPS12, we did see an increase of efficiency in the cells grown in SDM80, with over 70% initiating editing. However, even with the large increase in initiation of the editing process, only a scant 0.5% of transcripts were fully edited (Table 7). The sharp drop in the ability to complete the editing process appears to be due to loss of an optimal gRNA for one region of this transcript (described below). Editing Cascade and Reading Frame Analyses In order to determine if the low editing efficiencies were due to any one step in the editing cascades, a full analysis of each editing step was done. For these analyses, we developed a pipeline that used our gRNA database to distinguish true alternative edits from both mis-edited and partially edited transcripts. This pipeline uses two programs, Segmentation of Kinetoplast Edited Transcripts to Characterize Editing Heterogeneity (SKETCH), and the gRNA database search program previously described [9]. The SKETCH program analyzes segments of transcripts that are defined by the relative range of coverage of each gRNA population used in 76 conventional editing patterns. Block sequences are compared to both the unedited sequence and the fully edited conventional sequence and then classified into unedited, fully edited and “unknown” blocks. Once the most abundant sequences of all segments are identified, transcripts containing each most abundant “unknown” sequences are used as queries against the gRNA database. If a gRNA is identified that can generate the edit, the sequence is considered a true alternative edit. If no plausible gRNA is identified the edit is considered a misedit or a junction, depending on the sequence and the status of other segments on that transcript with this sequence. By examining segments of transcripts independently, we were able to identify both branching and converging editing pathways. RPS12 Analysis As expected, the essential RPS12 showed the most robust editing path. In all three analyses, the majority of transcripts used the same series of 10 gRNA populations (A – J) (For full gRNA sequences and alignments see APPENDICES I and J). Use of the final gRNA population (gJ) in the cascade lead to only the RPS12 ORF, and we found no evidence of an alternative AUG or frameshift leading to utilization of the second ORF. We do note that there is a downstream start codon, that if translated, would be read in the alternative reading frame (APPENDIX J). While the editing cascades were relatively straight forward, we did see some minor deviations (Figure 14). Editing of block B could utilize a number of different gRNAs, including several that were used in one cell line only (dashed arrows). gRNAs B1 and B1* are variants of the same gRNA, with gB1* introducing a single amino acid (aa) change (V/Y) (Figure 15). While editing using the TREU specific gB3t and gB4t gRNAs lead to a distinct editing “dead end” (dead end = disruption of the next canonical anchor sequence, and no detection of any further editing), the 77 EATRO specific gB2e did not disrupt the editing cascade. Use of this variant, however, did introduce a frameshift seven amino acids (aa) from the C-terminus (Figure 15). Because gB2e did not disrupt editing, a significant percentage (5.3% in SDM79 and 7.6% in SDM80) of translatable RPS12 transcripts did contain the alternative C-terminus (J2 transcripts). This alternative C-terminus was previously reported in the 29-13 strain [148], however, it appears to be absent in the TREU cell line. A drop in editing efficiency was seen at the D to E block transition due to the incorrect utilization of the gFp guide RNA (Dx) that disrupted the editing cascade (Table 8). While mis- editing by gFp was limited in the TREU 667 cell line (7.1% of D-block edited transcripts), it’s use was much more prominent in the EATRO cell line (17.5%), leading to a significant drop in transcripts that could continue past D-block editing. Interestingly, growth in SDM80 lead to a significant increase in mis-editing by gFp, with over 32.5% of trancripts using gFp incorrectly, resulting in a significant portion of dead-end transcripts. The EATRO cell line had additional minor dead-end pathways at the D to E transition. Misediting by a ND7 gRNA (gEep) again disrupted any further editing, and mis-anchoring by the gE guide RNA (marked with box m) also led to the generation of an anchor sequence that could be used by a ND8 gRNA (gFep) disrupting any further editing. Interestingly, the editing efficiency did not drop as transcripts transitioned to the next block of editing (Table 8). In EATRO-SDM80 cells, the editing efficiency at level F is ~5.6%, and at level G it actually increases to 5.9%. Editing efficiency at the block level is 78 Figure 14. Observed RPS12 editing pathways in the TREU 667 cell line (A) and the EATRO 164 cell line grown in SDM79 (B) and SDM80 (C). U = unedited transcripts. Dot sizes are proportional to the percent of block level edited transcripts using the gRNA indicated. Colored arrows indicate the gRNA population used. Dashed arrows with closed heads represent gRNA populations used in only one cell line (superscript ‘e’ or ‘t’). gRNA names with superscript ‘p’ represent promiscuous gRNAs. Dots enclosed by a red box represent end point mRNAs with no AUG start codon. gFp is a promiscuous gRNA that edits both in the D and F editing block of RPS12. Arrows with a boxed ‘m’ represent a gRNA that has mis-anchored. 79 calculated based on the number of transcripts that match any of the fully edited sequences in that block, regardless of the condition of earlier blocks. Analysis of editing intermediates suggest that this increase occurs due to the ability of the downstream gRNA (the gFp population) to overwrite transcripts that have been previously edited through the G-level. Because of the overwriting gFp population, mRNAs exist that are fully edited at the G editing block but are in a transition state in block F. Figure 15. Alignment of RPS12 proteins from T. brucei, T. vivax, Leishmania tarentolae, Leishmania donovani, and Leishmania amazonensis. Asterisks indicate identical residues, colons indicate conserved residues. Highlighted amino acids are changes introduced by alternative editing (S>P, gGe, V>Y gB1*) Alignment was generated by Clustal Omega [116]. RPS12 signature sequence is shown in bold [150]. Table 8. Editing efficiency for each RPS12 gRNA population. Percentages were calculated based on the number of transcripts that had completed each editing level out of the total number of RPS12 transcripts. Block Percent complete editing of block TREU 667 (SDM 79) EATRO 164 (SDM 79) EATRO 164 (SDM 80) Initiated Editing A B C D E F G H I J 10.2 8.9 6.9 6.0 4.9 4.5 3.8 3.7 3.7 3.4 2.3 7.4 4.6 4.4 4.2 3.1 2.2 1.4 1.4 1.3 1.2 0.9 18.7 15.9 14.6 13.5 11.7 6.9 5.6 5.9 5.7 5.4 4.2 The only other minor variation was the use of the EATRO specific gGe guide that occurs in a highly cytosine-rich region (Figure 16). Previous examinations of the gRNA coverage in this 80 T. brucei B1 --MWFLYGCCLRFVLFVLCYYMSPRLPSSGNRRVLYAVFYLYNFVWMLRCFFCC-FIGLVMSLFIIEGGGFVDLPGVKYYTRIVS--------- T. brucei B2FSe --MWFLYGCCLRFVLFVLCYYMSPRLPSSGNRRVLYAVFYLYNFVWMLRCFFCC-FIGLVMSLFIIEGGGFVDLPGYKILFTYCKLDLDIRYVF T. vivax --MWFLYGCCLRFVLFVLCYYMSPRLPSSGNRRVLYAVFYLYNFVWLLRCFFCCVFFGLHLSLFIIEGGGFVDLPGIKYYTRMFIN-------- L. tarentolae MRVLFLYGLCVRFLYFCLVLYLSPRLPSSGNRRCLYAICYMFNILWFFC-VFCCVCFL-NHLLFIVEGGGFIDLPGVKYFSRFFLNA------- L. donovani VRVLYLYGLCVRFLFFSLVLYLSPQLPSSGNRRCLYAISIMFNILWIFL-VFCCVFFV-VHLLFIVEGGGFIDLPGVKYFSRFFCKS------- L. amazonensis VRVLYLYGLCVRFLFLCLVLYLSPRLPSSGNRRCLYAISIMFNILWYFL-VFCCFVFV-IFQLFIVEGGGFIDLPGVKYFSRFCNVS------- : :*** *:**: : * *:**:*** **** ***: ::*::* : .*** : ***:*****:**** * region identified only rare gRNAs with multiple C:A basepairs, alignment mismatches and with gaps between adjacent gRNAs [9,119]. While this analysis did extend the identified gRNA population and eliminated the gap region, we did not identify either mRNA sequences or gRNAs that improved the alignment mismatches (Figure 16). The use of alternative base pairs is not unheard of. A study of in vitro deletions found that alternative base pairs such as C:A, C:U, and C:C were tolerated to varying extents [151]. Interestingly, this portion of RPS12 encodes the signature sequence, which is nearly universal [150]. Use of the gGe variant gRNA results in a single point mutation, substituting a proline in place of a serine within this important sequence. Figure 16. Regions with poor gRNA coverage and functionally conserved residues in RPS12. Functionally important aa residues are underlined [150]. Pipes (|) indicate Watson/Crick base pairs and colons (:) indicate G/U base pairs. Red highlighted hashtags (#) indicated gaps or mismatches, green highlighted # indicate C:A basepairs. The introduced substitution mutation introduced by use of the gGe gRNA is highlighted in yellow (S>P). ND7 5’ analysis Analyses of the ND7 5’ targeted transcriptomes, indicate that full editing of the 5’ domain requires five gRNA populations for both cell lines (Figure 17, APPENDIX K). Two variants of the terminal population (gE1 and gE2) were identified that resulted in different 5’ terminal editing patterns (APPENDIX L). Translation of these editing patterns yields two different protein products in two different reading frames, one (RF1) encoding the canonical ND7 protein (E1) and the other (RF3) encoding a putative metabolite transporter (E2, see below) [141]. While transcripts for both open reading frames were found in both cell lines, there were notable 81 L C Y Y M S P R L P S S G N R R V L Y A V F Y L Y N F V W M uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGuuCCGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAu ||||:|||| |||||:|#|||||#||:|||||||||| | :| AATATAATATA gI 14TAAATTTAGTGACCGAAGGCTAGTGGCT-CATATAACATACG--G----TAATATA gFp |||::|:|:|||||||:||#: |:|||||#|||||| ||:| | :|||||:||:||:|||||:|::||:|:|| AATGTAGTGATATACTTAGAT--GTTAACGTGTCAAGATATA gH gE 05TATTATG--G----TATAAAGTAGATGTATTAGAGTAAGCTTA |: |:||::#|#||:||||:|||||:|||||||||||| 13TATTCAGT--GTTAGTAGATTGAGGCTATTGGTTGCACATAACATTCATA gG |||:|#||:|:|!:|||||||||||:||:|:||||| | 14TAATGTGTTAGGATCATTGGCTGCATATGATATACG--GAAATATA gGe differences in the populations. The TREU 667 cell line had the highest editing efficiency with over 80% of the transcripts initiating the RNA editing process and ~9.7% of the transcripts fully edited through Block E. Use of the gE1 or gE2 gRNAs appeared to be equally efficient, resulting in nearly equal amounts of RF1 and RF3 fully edited transcripts. A small percentage of transcripts (4.9% of transcripts that completed block E editing) were observed that appeared to be mis-edited by a TREU specific gRNA (gE4t), leading to a dead-end product (no ORF). In addition, gE4t also appeared to be able to overwrite editing directed by gE2, to generate a small number of transcripts that could be translated in RF2 (pink E3t). Figure 17. Observed ND7 5’ editing pathways in the TREU 667 cell line (A) and the EATRO 164 cell line grown in SDM79 (B) and SDM80 (C). U = unedited transcripts. For arrow and gRNA naming descriptors see Figure 14. Dots enclosed by a red box represent end point mRNAs with no AUG start codon. + indicates that more than one mRNA form was condensed into this circle to simplify the figure (See APPENDIX M). Condensed forms encode largely the same amino acid sequence with only small variants. Terminal dots are colored blue for reading frame 1, magenta for reading frame 2, or green for reading frame 3. Boxed green dots have no functional start codon, but are translatable into reading frame 3 with the use of an alternative start codon (UUG). 82 While the number of “dead-end” pathways were very limited in the TREU cell line, use of the gC guide RNA population appeared to be very inefficient, resulting in a large drop in the percent of Block C-edited transcripts (25.8% drop, Table 9). A mutant gC gRNA (gCFSt), did result in a small percentage of transcripts with a frameshift C-terminus. Interestingly, while 9.1% of C block transcripts used the gCFSt gRNA, only 2.4% of the transcripts that have completed D-block editing come from this minor branch. This suggests that this alternative edit decreases the efficiency of use of the subsequent gRNAs. In contrast, full editing of the ND7 5’ domain in the EATRO 164 cell line was very inefficient. While transcripts were able to initiate the editing process relatively efficiently (~50 – 70%, dependent on growth medium used), less than 1% of ND7 transcripts were fully edited at level E (Table 7). This appears to be due to the use of several EATRO specific gRNAs that disrupt further editing (Table 9). Again, the largest drop in editing efficiency occurred at the B to C-block transition. In addition, the EATRO specific use of gBex, gC1ex and gC2ex all disrupted the editing cascade (Table 9). This compounded the editing efficiency problem, with a majority of C-block edited transcripts (47% in SDM79 and 73.4% for SDM80), no longer editing competent. The 5’ end of ND7 has multiple AUG sequences not created by the editing process. Translation predictions of these editing blocked transcripts (C1ex, C2ex) indicate that they do have ORFs that extend through the HR3 region. The protein product of Bex transcripts is in the ARF, but is ten amino acids shorter, while the C1ex and C2ex products, which translate in the canonical ND7 reading frame, produce proteins that are both three amino acids shorter. Further drops in efficiency occurred due to an anchor mis-match (A:A) found in the gD guide RNA population (Figure 18, APPENDIX L). While the gD mutation is also observed in TREU, this cell line contains a sizable population of non-mutated gD guide 83 RNAs. Editing by the gE4e guide, results in a transcript with no in-frame AUG. However, translation of this transcript (E4e) in RF3 has no stop codons and we cannot rule out the possibility of a non-canonical START codon. Table 9. Editing efficiency for each ND7 5’domain gRNA population. Percentages were calculated based on the number of transcripts that had completed each editing level out of the total number of ND7 transcripts. Block Percent complete editing of block TREU 667 (SDM 79) EATRO 164 (SDM 79) EATRO 164 (SDM 80) Initiated Editing A B C D E 79.7 45.8 44.7 18.9 11.7 9.7 52.4 47.0 45.8 13.4 0.4 0.2 72.6 68.5 66.7 16.9 1.1 0.5 Figure 18. Regions with poor gRNA coverage and functionally conserved residues in ND7’5’. Functionally important residues are underlined [152]. Pipes indicate Watson/Crick base pairs and colons indicate G/U base pairs. Red highlighted hashtags (#) indicate gaps or mismatches, green highlighted # indicate C/A base pairs. Similar to RPS12, ND7 5’ has a cytosine-rich region with poor gRNA coverage (Figure 18). This cytosine-rich region contains two conserved residues involved in ND7 function and coincides with the C level of editing in the editing pathways where the largest drop in editing efficiency is observed (Table 9) [152]. The gC gRNA population is relatively rare (only 114 reads found in the TREU gRNA transcriptome, and 6185 reads in EATRO) and has 5 nt mismatches with the conventional ND7 sequence (including C:A basepairs; Figure 18). 84 H L Y R F T F G P Q H P A A H G V L C C L L Y F C G E F I V ACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG***CAGCACAuG**GuGuuuuAuGuuGuuuAuuGuAuuuuuGuGGuGA*AuuuAuuG :||:||:|:|:||:||:|::|||:| ||:|||:| gB 15TAATAAGTGATATGAAGATGCCATT-TAGATAGC ||: |||#|||:#|||#|| #||#||||| |:::|||||||||||||||:||| gC 19TAAT-TAGATGTT-TAGAGC----TCATGTAC--CGTGAAATACAACAAATAATATA ||||:|:||||:|:|||||||:| ::||||#||||| TGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA gD CR3 Analysis Previous work indicated that C-Rich region 3 is a putative Complex I member and that it is preferentially edited in the Bloodstream stage [47,120]. However, CR3 gRNAs are present in both life cycle gRNA transcriptomes, and PCR amplification of 5’ edited transcripts were successfully cloned and sequenced in the TREU 667 procyclic cell line [9,119,141]. These studies indicated that multiple forms of the mRNA did exist that used different reading frames suggesting that CR3 is dual-coding and that it is selection of the terminal gRNA that determines which reading frame will be used [141]. In this study, we used primers flanking the editing domain in order to analyze the entire CR3 sequence. Interestingly, while 15% of the TREU CR3 transcripts had initiated the editing process, only 2.2% had completed editing by the initiating gA guide RNAs (Table 10). This suggests that the large drop in editing efficiency occurs due to incomplete editing by the block A guides. These gRNAs are fairly abundant, and we see no alignment issues, so it is unclear why editing of Block A is so inefficient (APPENDIX N). Table 10. Editing efficiencies by block level of CR3. Transcripts whose gRNAs covered two blocks (DEs and FGs) were included in both blocks for this calculation. Percentage Complete Level TREU 667 TREU 667 (Enriched) EATRO 164 (SDM 79) EATRO 164 (SDM 80) Initiated editing A B C D E F G/FG 15.1 2.2 1.8 1.7 1.6 1.2 0.8 0.4 81.9 68.3 56.6 54.6 52.0 39.1 22.3 8.8 85 6.8 1.9 1.2 0.9 0.6 0.6 0.2 0.2 9.4 2.3 1.8 1.5 1.1 1.0 0.4 0.3 While the percentage of fully edited transcripts was very low percentage (0.2 – 0.4%, Table 10), we were able to again identify the major 5’ alternative editing patterns that direct translation to either the ORF or to the +1 Alternative Reading Frame (RF2). To increase the robustness of the analyses, we also generated a biased CR3 transcriptome, by size selecting for longer transcripts during the amplification process. Analyses of the TREU transcriptome indicates that the full CR3 editing pathway has multiple branches, resulting in a total of 12 major forms of fully edited CR3 (Figure 19). These 12 forms are comprised of three major 5’ editing patterns, paired with any of four different 3’ editing patterns. The two initiating gRNAs identified (gA1 and gA2), direct identical editing patterns except gA2 inserts an additional three U- 1 phenylalanine). The gB guide RNAs all anchor in different areas (Figure 20A) and do introduce substantial AA changes near the 3’ end (Figure 20B). However, all gB guide RNAs generate the anchor binding site (ABS) that is recognized by gC, hence all 4 nodes merge to a common sequence guided by gC and gD (APPENDIX N). The 5’ end editing patterns begin to diverge after Block D editing. FGtx transcripts are generated by the use of two subsequent gRNA populations, gEt and gFGtxp. gFGtxp is a promiscuous gRNA (previously identified as a ND7 gRNA) that spans both the F and G editing blocks. These transcripts were more abundant than both Gt and FGt, however final editing using this gRNA does not generate a AUG start codon. It has been proposed that trypanosomes can use UUG as an alternative start codon, thus we cannot rule out the possibility that FGtx transcripts can be translated (Figure 20B). Analyses of intermediates suggest that the gE guide (red arrow) can in fact “overwrite” gEt, indicating that a proportion of these may still be re- edited into other forms. Editing via the gE population required an additional gRNA to generate 86 the anchor for either gFt or gFGtp. Generation of Gt transcripts (canonical CR3) requires 2 additional gRNAs, while FGt (+1 ORF) transcripts are generated by a single gRNA population (gFGtp), another promiscuous gRNA (CR4). Figure 19. Observed CR3 editing pathways in the TREU 667 cell line. U = unedited transcripts. For arrow and gRNA naming descriptors see Figure 14. Dots enclosed by a red box represent end point mRNAs with no AUG start codon. Terminal dots are colored blue for reading frame 1 or magenta for reading frame 2. Boxed magenta dots have no functional start codon, but are translatable into reading frame 2 with the use of an alternative start codon (UUG). Surprisingly, when we examined the editing pathways of CR3 in the EATRO 164 libraries, we discovered that while three of the four initial 3’ editing patterns were found in this library, editing beyond those patterns was completely divergent (Figure 21, APPENDIX O). A completely different set of gRNAs were used to generate fully edited CR3 transcripts (APPENDIX 87 P). The divergent pathway did show some superficial similarities to the editing patterns observed in Figure 20. Four different 3’ end sequences found in the TREU 667 transcriptome for the CR3 transcript (A) and CR3 protein sequences (B). U-residues inserted by editing are indicated by lowercase; different sequences created by the different gRNAs are highlighted in RED. Thick underline sequence indicates the anchor binding site (ABS) for the initiating gRNAs (gA1 and gA2). Green = ABS for gB1B2; Blue = ABS for gB3; Purple = ABS for gB4. Bolded amino acids show sequence variants and shaded sequence shows position of predicted transmembrane domains [134] TREU 667 cells. While both the B1 and B2 transcripts were directly edited by gCe, the B4 transcripts required an additional gRNA to generate the ABS recognized by gCe. In EATRO cells, B4 transcripts could be edited by 3 different gRNAs (gB5e, gB6e and gB7e). While gB7e disrupted editing, both gB5e and gB6e generated the anchor that could be used by either gC or gCe. Surprisingly, while the conventional CR3 gC guide RNA was clearly used by B4 transcripts, we saw no evidence of its use in the B1/B2 pathways. Transcripts using gC could be further extended by both gD and gE guide RNAs, however, no evidence of editing beyond the gE guides was observed. In contrast, use of the alternative gCe guide RNA population, could be extended by a series of additional guide RNAs, generating transcripts with functional AUG start codons. 88 A. B1 AuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCA---uUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAA B2 AuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAA B3t AuUGuuGuGuuuAuAuuAuuuCAGAuuuuAuGGuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAA B4B4’ AuUGuuGuGuuuAuuuuuuuuuuuuAUUUUAuCAuuuGAuAuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAA B. ORF G1t MFDCLVLLFFYCLFVHFFCFLFVCDLFLCLLFSFCFLLDFCFLFNMGLLLCFILQIFSVII-IIVYKFSLLD G2t MFDCLVLLFFYCLFVHFFCFLFVCDLFLCLLFSFCFLLDFCFLFNMGLLLCFILQIFSVIIFIIVYKFSLLD G3t MFDCLVLLFFYCLFVHFFCFLFVCDLFLCLLFSFCFLLDFCFLFNMGLLLCLYYFRFYGIIFIIVYKFSLLD G4t MFDCLVLLFFYCLFVHFFCFLFVCDLFLCLLFSFCFLLDFCFLFNMGLLLCLFFFFILSFDMLLSFLLLYISFRY ARF FG1t MCMIYKNNVYVVVLFWFWLYIFFVFYLFVICFYVCYLVFVFYWIFVFYLIWVYCCVLYYRFLVLS-LLLYISFRY FG2t MCMIYKNNVYVVVLFWFWLYIFFVFYLFVICFYVCYLVFVFYWIFVFYLIWVYCCVLYYRFLVLSFLLLYISFRY FG3t MCMIYKNNVYVVVLFWFWLYIFFVFYLFVICFYVCYLVFVFYWIFVFYLIWVYCCVYIISDFMVSFLLLYISFRY FG4t MCMIYKNNVYVVVLFWFWLYIFFVFYLFVICFYVCYLVFVFYWIFVFYLIWVYCCVYFFFLFYHLICCYHFYYCI FG1tx LVVYCVYHCIFLWIFVYVCYLVFVFYWIFVFYLIWVYCCVLYYRFLVLS-LLLYISFRY FG2tx LVVYCVYHCIFLWIFVYVCYLVFVFYWIFVFYLIWVYCCVLYYRFLVLSFLLLYISFRY FG3tx LVVYCVYHCIFLWIFVYVCYLVFVFYWIFVFYLIWVYCCVYIISDFMVSFLLLYISFRY FG4tx LVVYCVYHCIFLWIFVYVCYLVFVFYWIFVFYLIWVYCCVYFFFLFYHLICCYHFYYCI However, many of the gRNAs used were promiscuous, in that they had been previously identified as gRNAs of other transcripts. As with the TREU editing pathway, we observe transcripts capable of being translated in two reading frames with the FGe mRNAs translating in RF1, and the FGe* mRNAs translating in RF2 (Figure 21A, Figure 22). In addition, the Ge mRNAs, while not having a functional “AUG” do translate into RF2 if the first “UUG” is used. As with ND7, we observed a shift in editing pattern preference when the EATRO 164 cells were changed from SDM79 medium to SDM80. Interestingly, a new fully edited form of CR3 appeared in the EATRO164 SDM80 library only. The gRNA gCe80 is used in the EATRO SDM79 pathway, but editing appears to cease here. Cells grown in SDM80 continue this editing pathway with two additional gRNAs, gDEe80 and gFGe80 (Figure 21B). This mRNA is translatable, but produces a distinctly different and shorter protein product (Figure 21A). The protein products of the two different cell lines are highly dissimilar. Using bioinformatics tools to predict the secondary structure of these proteins, we find that the difference is most noticeable in the RF1s of the two cell lines (Figure 23). Interestingly, the RF2s have a very similar predicted secondary structure. This evidence suggests that the two different cell lines are able to use the CR3 transcript to create distinctly different protein products. 89 Figure 21. Observed CR3 editing pathways in the EATRO 164 cell line grown in SDM79 (A) and SDM80 (B). U = unedited transcripts. For arrow and gRNA naming descriptors see Figure 14. Dots enclosed by a red box represent end point mRNAs with no AUG start codon. Terminal dots are colored blue for reading frame 1 or magenta for reading frame 2. Boxed magenta dots have no functional start codon, but are translatable into reading frame 2 with the use of an alternative start codon (UUG). + indicates that more than one mRNA form was condensed into this circle to simplify the figure (See Figure 22). Condensed forms encode largely the same amino acid sequence with only small variants. 90 Figure 22. Alignment of CR3 predicted protein variants from the EATRO 164 cell line. Bolded amino acids show sequence variants and shaded sequence shows position of predicted transmembrane domains [134]. Figure 23. Predicted secondary structures of most abundant CR3 predicted proteins. Secondary structure predictions were generated by RaptorX [153–155]. Shaded regions indicate predicted transmembrane alpha helices predicted by Phobius [134]. 91 ORF FG1e MCMIYKYYHICVRWDFGDHCLFGCYELYFMFCYGYCFLFNMGLLLCF--ILQIFSVII-IIVYKFSLLD FG2e MCMIYKYYHICVRWDFGDHCLFGCYELYFMFCYGYCFLFNMGLLLCF--ILQIFSVIIFIIVYKFSLLD FG5e MCMIYKYYHICVRWDFGDHCLFGCYELYFMFCYGYCFLFNMGLLLCLYYLCIFIVVIIFIIVYKFSLLD ARF FG1e*v1 MCMIYKLTIVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLS-LLLYISFRY FG2e*v1 MCMIYKLTIVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLSFLLLYISFRY FG5e*v1 MCMIYKLTIVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVYITYVFLLLLSFLLLYISFRY FG6e*v1 MCMIYKLTIVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVYIILCIFIVVIIFIIVYKFSLLD FG1e*v2 MCMIYKLTYVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLS-LLLYISFRY FG2e*v2 MCMIYKLTYVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLSFLLLYISFRY FG5e*v2 MCMIYKLTYVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVYITYVFLLLLSFLLLYISFRY FG6e*v2 MCMIYKLTYVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVYIILCIFIVVIIFIIVYKFSLLD FG1e*v3 MCMIYKNIFVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLS-LLLYISFRY FG2e*v3 MCMIYKNIFVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLSFLLLYISFRY FG5e*v3 MCMIYKNIFVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVYITYVFLLLLSFLLLYISFRY FG6e*v3 MCMIYKNIFVLGGILVIIVYLVVMSCILCFVMVIVFYLIWVYCCVYIILCIFIVVIIFIIVYKFSLLD G1ex LLFGVLFLICFVYFIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLS-LLLYISFRY G2ex LLFGVLFLICFVYFIVYLVVMSCILCFVMVIVFYLIWVYCCVL-YYRFL-VLSFLLLYISFRY G5ex LLFGVLFLICFVYFIVYLVVMSCILCFVMVIVFYLIWVYCCVYITYVFLLLLSFLLLYISFRY G6ex LLFGVLFLICFVYFIVYLVVMSCILCFVMVIVFYLIWVYCCVYIILCIFIVVIIFIIVYKFSLLD EATRO SDM 80 Only FGe80 MCMIYKNNGSCGFVGWFRLGYCYCECCSFCMIIL Deletion of Encoded Uridines Directed by the Poly-U Tail During the analyses of the mRNA deep sequencing data, we observed many partially edited transcripts where the deletion of encoded uridines appeared to occur early (prior to 3’ insertion events). We hypothesize that the poly-U tail of preceding gRNAs may direct the removal of these encoded uridines. To examine this hypothesis, we examined partially edited RPS12 and ND7 5’ transcripts with early deletion events, and determined their relative proximity to the preceding gRNA (Figure 24). (CR3 was excluded from this analysis due to the variability in gRNA location caused by the multiple branching editing pathways). These analyses revealed that for both RPS12 and ND7 5’, encoded U sites that are near the poly-U tail of the preceding gRNA (U-Tail accessible sites) have a higher frequency of total deletions in the partially edited transcripts. This suggests that the proximity of the preceding gRNA’s poly-U tail does impact deletion of encoded uridines and supports our hypothesis that poly-U tails can guide the deletion of encoded uridines. 92 Figure 24. Frequencies of early total deletions of DNA encoded uridines in partially edited ND7 and RPS12 transcripts. Editing sites with DNA encoded Us were identified, and each encoded U site was then characterized based on the editing access of the preceding gRNAs’ poly-U tail (U-Tail Accessible or Not Accessible) as well as whether the site in question was totally deleted in the final fully edited sequence (Correct Total Deletion or Incorrect Total Deletion). For each site, a window was defined consisting of the 6 sites upstream and 6 sites downstream of the site in question. Each editing sequence was then classified based on the states in the 3' end of the window as either matching the fully edited sequence or not. Error bars depict standard error. (*=p<0.05 **=p<0.01 unpaired T-test) Discussion In this work, we developed a new transcriptome analysis pipeline to fully characterize the editing pathways for three putative dual-coding genes, RPS12, ND75’ and CR3. The pipeline uses a new program, SKETCH, in combination with our database search program [9]. Combining these two programs allowed us to separate true alternative edits from partially edited transcripts and allowed the precise mapping of the full progression of the editing process. This 93 characterization was done in two different cell lines (TREU 667 and EATRO 164) and under different energy conditions in order to determine the robustness of the editing process. Surprisingly, distinct differences in both editing progression as well as editing efficiency were observed in the two different cell lines. In addition, growth of parasites under different energy conditions also appeared to be able to influence the editing process. In both cell lines, the editing process appeared to be very inefficient, with most of the transcripts completely unedited. A comparison of the two cell lines grown in SDM79 did suggest that overall, the TREU 667 cells were more efficient in editing these three pan-edited transcripts. However, when the EATRO 164 cells were transferred from a high glucose medium (SDM79) to a glucose-restricted medium (SDM80), the number of transcripts that initiated the editing process more than doubled. For RPS12, the increase in editing initiation resulted in a 4-fold increase in the number of fully edited and translatable mRNAs. Of the three transcripts characterized, the essential RPS12 showed the most robust editing progression. Editing of RPS12 is relatively linear, with only a few minor branching alternatives. For this mRNA, the first start codon found on the fully edited transcripts consistently translated into the canonical RPS12 open reading frame and we found no evidence of transcripts that access the alternative reading frame. The most prominent alternatively edited branch was only observed in the EATRO 164 cell line and causes a frame-shift that extends the reading frame at the 3’ end (Figure 14 B2e). Interestingly, this same alternative edit was previously described by Simpson et al. in the 29-13 strain, which shows that this edit is not an isolated occurrence in the EATRO 164 strain [148]. They observed the alternative at a low abundance, which agrees with our observations as well. Their analysis identified a large 94 amount of variance at the 5’ end, with 5.7% of transcripts being translatable in the canonical ORF. We did observe sloppy editing at the 5’ end, but found two primary forms 5’ end RPS12 editing. Interestingly, this data indicates that the 29-13 strain has a similar editing efficiency to the TREU 667 strain and EATRO 164 strain grown in SDM80. In contrast to RPS12, we found distinct evidence that ND7 5’ is dual-coding. In both cell lines, alternative editing by different terminal gRNA variants resulted in transcripts with either RF1 (the canonical ND7) or RF3 (a putative metabolite transporter) linked to the first AUG [141]. Interestingly, ND7 5’ has also been sequenced in 29-13 cells [148]. While that study did not directly state evidence of dual-coding, they did indicate that a large proportion of the fully edited ND7 5’ transcripts had a single nucleotide difference in the 5’ UTR. This difference could very well be the same difference we observe in E2 transcripts that links an upstream AUG to the ARF. This suggests that the ability to access the two different reading frames is maintained across a number of different trypanosome cell lines. While fully edited ND7 5’ transcripts were found in both cell lines, a major difference was observed in the efficiency of the editing process. In TREU 667 cells, over 79% of ND7 5’ transcripts had initiated the editing process and a full 9.7% are fully edited. In contrast, EATRO 164 cells grown under the same conditions (SDM79) had only 52.4% transcripts that initiated editing and a scant 0.2% fully edited. Growth of EATRO 164 cells in SDM80 did substantially increase the number of transcripts that had initiated RNA editing (72.6%), however, no corresponding increase in fully edited transcripts was observed. The major differences in editing efficiency appear to be due to both the use of alternative gRNAs that could disrupt the editing cascade and well as a gRNA mutation that affected the ability of the guide RNA to 95 efficiently anchor. Surprisingly, the gRNAs that disrupt editing in the EATRO cell line are also present in the TREU gRNA transcriptome. It is unclear why we see evidence of their use in only the EATRO cells. It may be that in the TREU cells, these gRNAs are more efficiently used in a different editing pathway. A full understanding of gRNA selection and use will require the characterization of the entire edited transcriptome. In addition to the large decrease in the efficiency of ND7 5’ editing observed in the EATRO cells, we also saw a distinct shift in the number of fully edited transcripts that translate in RF3, the alternative open reading frame. This alternative protein has been previously predicted to be a metabolite transporter as it shares distant homology with a bacterial sugar transporter, SemiSWEET [141]. The most pronounced differences between the two cell lines was observed for the CR3 transcript. In both cell lines, CR3 utilizes a much more complicated editing pathway than either ND7 5’ or RPS12 and the overall efficiency of the editing process is very low. Surprisingly, the number of CR3 transcripts that initiate RNA editing is comparable to the percentage observed for RPS12. However, editing by the initiating gRNA appears to be very inefficient. In TREU cells, while 15.1 % of the transcripts initiate editing, only 2.2% are fully edited through the first editing block. A similar drop is also observed in the EATRO cells. The identified gRNA population that initiates editing does not contain any mismatched base pairs and it is unclear why full editing by this gRNA is so inefficient. The canonical CR3 is a putative NADH Dehydrogenase complex I member (ND4L) [120]. Editing of Complex 1 members does appear to be developmentally regulated, with full editing only observed in the Bloodstream stage. [5,27–30] . It may be that editing is stalled right after initiation by a transcript specific mechanism. However, a small percentage of transcripts are still edited. Transcripts edited to the canonical 96 CR3 sequence were only observed in the TREU cell line. In this cell line, four different 3’ editing pathways converge to an internal consensus sequence which then diverges again near the 5’ end, generating a variety of different proteins in two different reading frames. In the EATRO cell line, editing initiates with the same 3’ gRNAs, but diverges at the internal consensus sequence when they employ a different set of gRNAs for full editing. Because of how different their editing pathways are, the TREU and EATRO protein products cannot be directly compared. Searches were run on various different databases in order to determine the putative functions of the many CR3 proteins. Unfortunately, these searches yielded few significant results, with most proteins only sharing homology with the transmembrane domains of many different proteins. The very small percentage of CR3 transcripts that undergo full editing suggests that the protein products may not be made or utilized in this stage of the parasite life cycle. However, we hypothesize that the ability to alternatively edit transcripts may be an important evolutionary mechanism to maintain genetic plasticity. The dual host life cycle of T. brucei leaves it vulnerable to genetic drift especially in regards to the mitochondrial ETC genes that are not always under selection. Previously, we proposed a mechanism that would contribute to the drift robustness of these mitochondrial genes. By overlapping ETC genes not under selection in the bloodstream stage with genes that are under selection during this stage of the life cycle, the accumulation of mutations can be prevented [141]. These overlapped genes share most gRNAs, and this strategy ensures that almost all of the genetic material is protected. We also hypothesize that the sequential nature of gRNA use and the sensitivity of the RNA editing process to both mRNA and gRNA mutations can also protect against genetic drift by increasing the deleterious effects of the mutations 97 (LaBar and Adami 2017). Increasing the lethality of mutations would insure that deleterious mutations are purged from the population during long periods of growth in the mammalian host. While the process of RNA editing may help weed out mutations by making them lethal, it would also prevent the population from generating beneficial mutations as well. This strategy leaves organisms no options for evolving. We suggest that alternative edits, such as those seen in the CR3 and others previously observed, editing pathways generate protein diversity without compromising the genetic information found within the genome [68]. Our analysis revealed a number of details about the larger mechanisms of gRNA selection in RNA editing. We found a surprising number of gRNAs with that had been identified to edit two different mRNA sequences. While some of these promiscuous gRNAs appear to be unproductive, generating dead end branches on their editing pathways, many appear to be productively used, as with the editing pathways of CR3 in the EATRO 164 cell line. The majority of these editing pathways are directed by promiscuous gRNAs and these pathways generate translatable transcripts. Interestingly, most of these gRNAs were identified to edit members of complex I, particularly ND8 and the 3’ editing domain of ND7. It may be possible that this is another mechanism of increasing the drift robustness of T. brucei by giving these gRNAs multiple functions. Promiscuous gRNAs have been previously identified editing L. tarentolae RPS12 and ND3, however, they were not shown to be producing translatable transcripts [156,157]. In addition to this, we determined that RNA editing is not strictly sequential. While overall, editing proceeds from the 3’ to 5’ across of the editing domain, we have found evidence that shows that gRNAs can overwrite editing that has been previously generated. 98 These observations show that the RNA editing system can tolerate some amount of abnormal editing, despite the fragility of the system as a whole. Interestingly, another study that examined the use of alternative RPS12 gRNAs previously identified found that the three gRNAs in question were being utilized, and that despite not generating the conventional editing patterns, a very small number of transcripts returned to the canonical editing pattern after the alternatives [148]. This last observation may be another instance of gRNA overwriting. Non- sequential editing is also suggested by our proposal that the poly-U tail can be used to direct editing. We showed that deletions in partially edited transcripts were more common in regions close to the U tail of the preceding gRNA (Figure 24). Interestingly, in some cases, an upstream deletion site would be completely deleted while a downstream site would not be deleted. This phenomenon was also observed in another study, in an in vitro editing system [158]. In this analysis we determined the editing pathways of RPS12, ND7 5’, and CR3 in EATRO 164 and TREU 667 cells, and we found evidence of dual-coding in ND7 and CR3. We also showed that editing patterns can vary quite significantly between cell lines and based on available energy sources. Using the gRNA transcriptomes to validate alternative edits was vital to completing this project. In light of the extreme variation we observed in the CR3 editing pathway, we believe that in order to fully understand the dynamics of the editing pathways as a whole, we need to sequence all of the edited mRNAs and gRNAs in multiple cell lines under multiple conditions. Acknowledgements We thank the Ken Stuart Lab for the trypanosome cell lines. We would also like to thank Hanyou Pan for his contributions to the data analysis of ND7 5’. 99 CHAPTER 5: CLUSTER CLASSIFICATION OF UNKNOWN GRNAS REVEALS THE ROBUSTNESS OF THE RNA EDITING SYSTEM Abstract The RNA editing system of Trypanosoma brucei uses small RNAs called guide RNAs to direct the insertion and deletion of mRNAs. Thousands of gRNAs are used in this system to render twelve mitochondrial mRNAs translatable. These gRNAs edit sequentially, with each gRNA generating the anchor binding sequence for the next gRNA. This means that the system is inherently fragile. gRNA transcriptomes have been generated and gRNAs were identified based on their complementarity to previously described edited mRNAs. This method was effective in identifying many gRNAs, but we found that many were still unidentified. To determine if these gRNAs were nonfunctional mutants or potentially had undescribed functions, we grouped all gRNAs into clusters, where each cluster had significant sequence conservation, using our new program ACORNS. This showed that most unidentified gRNAs were not related to any functionally known gRNAs and could be generating unobserved alternative editing patterns. Recently, the editing pathways of three genes, RPS12, ND7 5’ and CR3 were described in detail. Each gene had branches in their editing pathways, but it was not always clear based on gRNA abundance data alone why one branch of the pathway was more expressed than another. Using the defined gRNA clusters, we screened all related members editing these three genes against their targets to determine what proportion of each family was able to productively edit, using another new program GUIDE. We found that most of these unidentified gRNAs were predicted to disrupt RNA editing. However, using this information in combination with the 100 mRNA data for these three genes, we found that these mutations are highly tolerated by the editing system. We also determined that gRNA abundance does not correlate with the mRNA editing preference. We also observed many gRNA populations of high abundance that were apparently not used to edit. In most cases these populations had no issues in their mRNA alignments but were seen to be only used in one cell line and not the other. Currently, the complete mechanism of gRNA selection is unknown. Introduction The RNA editing system of the trypanosomes is a unique and complex system. It requires the use of two genetic components, the protein coding genes whose transcripts require editing, and the small RNA genes encoding the guide RNAs (gRNAs) that direct the specific edits [7]. The genes that require editing are all mitochondrially encoded and are either associated with the electron transport chain or the mitochondrial ribosome. In this RNA editing process, mRNA transcripts can be dramatically altered by the insertion and deletion of uridines [4,6]. These editing events are catalyzed by a multi-subunit editosome complex [140]. Currently, over 47 proteins have been identified that are involved in the cleavage, uridylyl addition or deletion and subsequent re-ligation events that are required for every nucleotide change [140]. Formation of these specialized complexes as well as performing the hundreds of edits required is highly energetically expensive. The RNA editing process is also incredibly fragile to mutations, as it is sequentially dependent. RNA editing starts at the 3’ end of the mRNA, and each gRNA generates the anchor for the next gRNA. A mutation in any gRNA could disrupt formation of the next anchor, halting the editing process and with it abort the expression of the protein. 101 It has been hypothesized that this energy intensive process evolved in response to the parasites complex life cycle [75,76,141]. A dixenous kinetoplastid, Trypanosoma brucei, undergoes substantial shifts in its energy sources over its life cycle, requiring extensive regulation of metabolic pathways. In the nutrient deprived environment of the insect, it relies primarily on metabolizing amino acids to drive the Krebs cycle and Oxidative Phosphorylation. Once transferred to the glucose rich bloodstream of its mammalian host however, it shifts from relying on mitochondrial respiration, and moves to using glycolysis alone. In addition, the exclusively bloodstream nature of T. brucei in the mammal host requires that they replicate continuously, escaping the host’s immune response by periodically switching their surface glycoproteins [12]. These conditions should make the genes involved in mitochondrial respiration particularly susceptible to genetic drift. Recently, we showed that as many as six of the mitochondrial transcripts may be dual coding and that alternative editing near the 5’-end of the transcript can be used to access both open reading frames (ORFs). We hypothesized, that the alternative ORF may give the parasites a selective growth advantage in the mammalian host. This would significantly increase the deleterious effects of mutations and protect against genetic drift. The linking of an ETC gene with an essential gene has been previously shown for cytochrome oxidase III (COIII). For this transcript, alternative editing by one gRNA can link an ORF found in the unedited 5’ sequence with the trans-membrane domains found in the fully edited carboxyl-end of the protein [67]. This alternative protein, AEP-1, is involved in mt DNA maintenance and appears to be essential during bloodstream growth [68]. The detection of such a large number of dual-coding genes (seven including COIII) in a genome that encodes only 17 proteins, suggests that this is an important mechanism to protect the integrity of the ETC 102 genes that are required only in the insect. Thus, the ability to overlay genetic information maybe a protective strategy made possible by RNA editing [99,141]. The changes in metabolism are reflected in other changes in the pattern of RNA editing found the different life cycle stages in T. brucei [5,24–30]. For instance, many of the transcripts encoding members of NADH dehydrogenase (Complex I) are only fully edited in the bloodstream stage. In contrast, cytochrome oxidase II (COII, Complex 3) and cytochrome B (Complex 3) are both preferentially edited in the insect stage. Recently, we even observed a shift in the editing pattern found in insect stage trypanosomes if the growth medium is switched from glucose rich to the glucose depleted. This suggests that editing can respond to its environment (Chapter 4). In addition, multiple studies have shown that different editing patterns can be observed for transcripts of the same mRNA, with edits being directed by multiple alternative gRNAs. The most dramatic example of this is seen in the putative NADH dehydrogenase subunit 4L, known as C-rich region 3 (CR3) [120]. For transcripts of this gene, two different sets of gRNAs were used in two different cell lines of T. brucei. Both editing pathways were complex and possessed multiple branches that were fully translatable. Curiously, the gRNAs required to generate the editing pathways of the two different cell lines were present in both cell lines, but apparently not selected for use for unknown reasons. The gRNAs that direct the RNA editing process are stored on 1 kb circular DNA molecules called minicircles. These minicircles make up the bulk of the mitochondrial DNA of T. brucei, with up to 10,000 minicircles per cell, and there are an estimated ~1200 different sequence classes of minicircle at varying abundances [8,33]. gRNA transcriptomes have been generated for the insect and bloodstream stage trypanosomes [9,119]. Analyses of these 103 transcriptomes identified over 600 gRNA populations (gRNAs that edit the same region of the same gene) involved in editing the mitochondrial mRNA transcripts. These populations typically contained multiple different sequence classes of gRNAs (major classes). The sequence differences observed are all R to R or Y to Y changes, and since both A:U and G:U base pairs occur in the editing process, the multiple sequence classes can all guide the generation of the same mRNA sequence. These data also show extreme quantitative differences between the different gRNA populations, with population sizes ranging from <10 to >300,000 reads. We had postulated that regions with very low gRNA coverage are areas with mRNA sequence variations and that the editing of these regions may be directed by gRNAs not identified due to internal mismatches with conventional sequence. In addition, we also identified some abundant, mutated gRNAs that could potentially introduce frameshifts or create sequence that disrupts the upstream anchor binding site. This was surprising, as we had hypothesized that these mutations would be selected against due to the fragile nature of the editing process. Because of the high stringency of our initial screen, only gRNAs with mismatch mutations biased towards the 3’ or 5’ ends of the gRNAs were initially identified and it was unclear how prevalent mutated gRNAs that could disrupt editing were. Analyses of our gRNA libraries did indicate that they contain millions of “unidentified” reads that had key guide RNA characteristics (characteristic transcription start sites, U-tail and ~ length). In this manuscript, we describe the development of two new pipelines that allows us to begin to characterize these “unidentified” transcripts in order to determine what impact they might have on RNA editing efficiency. ACORNS (Assemble Clusters Of Related Nucleotide Sequences), allows us to identify and classify gRNA sequences. A second program, GUIDE (gRNA 104 Uridine Insertion/Deletion Editor), simulates the RNA editing process and allows us to predict the effect of the gRNA mutation on the sequencing process. Using the CR3, RPS12 and ND7 5’ mRNA transcriptome data, we identified the gRNA populations that can edit or disrupt editing of these genes. We found a surprisingly high toleration of mismatches and gaps in gRNA/mRNA alignments, as well as many instances where the most abundant gRNAs present were not those preferentially used. A deeper look at the gRNA transcriptomes also revealed the presence of surprisingly many unidentified and functionally unknown gRNAs, suggesting that more work is needed to discover their roles in the RNA editing system. Materials and Methods Cluster analysis of related gRNAs by ACORNS In order to determine the relationships of previously identified and unknown gRNAs, a new program was created called ACORNS (Assemble Clusters of Related Nucleotide Sequences). This program functions to identify putative gRNAs, determine relationships between gRNAs, and group related gRNAs into clusters. Identification of putative gRNAs. As described previously, identical sequences were collapsed and sequences without four consecutive Ts were filtered out, indicating the lack of a poly-U tail [9]. To perform this analysis, a new program was generated, called Assemble Clusters Of Related Nucleotide Sequences (ACORNS). ACORNS first filtered out all putative gRNAs, based on two criteria: having 40 nucleotides prior to the start of the poly-U tail, and having a transcription start site that either matches one of the top twenty most common six nucleotide gRNA transcription start sites, or is one mutation away from one of these sites [9,119]. Maxicircle edited and unedited sequences as well as ribosomal RNAs were also filtered 105 from the sequence file. Identical sequences were collapsed, but retained their overall total read abundance. Additionally, if sequences were identical with the exception of the of the start position of the poly-U tail, or 5’ end location of the gRNA, they were still consolidated, keep the sequence of the most abundant transcript. Identification of gRNA families. A pair of related gRNAs are defined as two gRNAs that only differ by a single substitution, insertion or deletion mutation. Once putative gRNAs were identified, ACORNS aligned each gRNA with every other gRNA to determine which gRNAs were related. Alignments were scored using the python-levenshtein package [159]. In order to prevent errors due to 5’ exonuclease activity or difference in poly-U site, overhanging nucleotides on either end of the alignment were not counted as mismatches. ACORNS then grouped gRNAs into families based on their relationships. Each family was grouped together by starting with the most abundant transcript, and including any other gRNAs related to that transcript, as well as any gRNAs related to those gRNAs, and so on. Once gRNAs were grouped into families, sequences were trimmed to the same length and any new identical sequences were collapsed. Visualization of these related gRNA clusters was performed by a subprogram, and clusters were visualized in two different ways, with color coding based on gRNA identity or gRNA abundance. Identity was defined as the gene known to be edited by previously identified gRNAs, with other all other gRNAs labeled as “unknown”. Color coding by abundance was based on a log scale, and each scale created for each cluster, with the least abundant gRNA being colored purple and the most abundant gRNA being colored red, and all other gRNAs being scaled accordingly. 106 Prediction of RNA editing by GUIDE A reference library of all edited forms previously described of RPS12, ND7 5’, and CR3 were collected and annotated with the editing states in each transcript (Chapter 4). For this analysis we generated a new piece of software known as the gRNA uridine insertion/deletion editor (GUIDE). GUIDE uses the outputs of ACORNS, and for each gene, GUIDE pulls the families of gRNAs that have members that had been previously identified to edit the given gene. Editing of each member of each family was then simulated. For each family, the template generated by the previously identified members of that family were used. If gRNAs generating different forms of the same mRNA were in the same family, all appropriate templates were tested on each family member, and the predicted edit with the longest alignment with the gRNA was saved. In order to predict the edits for each gRNA, GUIDE determines the most likely anchor binding location on the template. Anchor sequences were required to use Watson-Crick base pairs only and be located in the first twenty nucleotides (nt) of the gRNA. The first twenty nt of each were scanned along the template, and the longest consecutive stretch of Watson-Crick base pairs was identified. Additionally, a second anchor was identified, using a weighted score, where each G:C base pair was worth two points and A:U base pairs were worth one point. Editing from both anchors were tested and the anchor that generated the longest edited sequence was saved. The rules of editing used by the program are the standard uridine insertion/deletion rules observed in the editing system. If a nucleotide from the gRNA is aligned to a nucleotide from the mRNA that is “illegal” (anything other than Watson-Crick or G:U base pairs), a uridine 107 will either be inserted or deleted to resolve the mismatch. If this cannot resolve the mismatch, editing ends. Once edits were predicted, they were classified based on their effect on the editing process as a whole and their effect on the predicted protein. Results In our characterization of the EATRO 164 gRNA transcriptome, we identified over 3 million reads and ~64,000 unique gRNA sequences capable of generating conventional editing patterns [9]. These gRNAs were identified by finding the longest common substring to conventionally edited mRNAs and retaining only those scoring 45 or more (Watson-Crick base pairs = 2; G:U base pairs = 1). In these studies, we found that lowering the stringency and/or allowing for mismatches lead to the identification of thousands of gRNAs with characteristics suggesting that they were misaligned. While this initial analysis did identify an almost full cohort of guide RNAs, there were still millions of reads in the transcriptome that could not be classified. To determine how many of these reads were unidentified gRNAs for possible alternative sequence or mutated conventional gRNAs, we developed a new pipeline. ACORNS (Assemble Clusters Of Related Nucleotide Sequences) has a number of features that allows it to identify and classify putative guide RNAs: 1. It filters out all contaminating nuclear and maxicircle sequences. 2. It Identifies putative gRNAs based on three criteria: presence of a U-tail (defined as 4 consecutive U-residues), length (40 nts prior to last stretch of U-residues) and transcription start site (based on sequences defined in the previous analyses) [9,119]. 108 3. It identifies related gRNAs by scoring the best alignment of each gRNA against all other putative gRNAs in the library. gRNAs with a single mismatch or gap are then classified as related. 4. It clusters related gRNAs by starting with the most abundant transcript, grouping all relatives of that transcript into the cluster, then adding all relatives of any relative to the cluster. 5. Using a subprogram, the clusters of related guide RNAs can be visualized, revealing the relationships between the previous identified gRNAs and unknown gRNAs. ACORNS analyses were done for both the TREU 667 and EATRO 164 procyclic gRNA transcriptomes, leading to the identification of 1256 clusters in TREU and 1168 clusters in EATRO 164 (Tables 11 and 12). This allowed us to identify all of the gRNAs that were distinctly related to a conventional gRNA but had undergone a mutational event. In addition, we also identified a number of clusters that had no previously identified members. The clusters identified varied greatly in size, with the largest cluster containing over 3400 sequence members with a total of 1,377,190 transcript reads. These large clusters (1000+ members) were actually quite rare, with only 10 and 4 clusters of this size identified in the TREU and EATRO gRNA libraries, respectively (Table 13). Interestingly, the majority of clusters were quite small, containing fewer than 25 sequence members and an average transcript number of approximately 250 reads (Table 13). The majority of the small clusters were “unidentified”, in that they contained no previously identified gRNA member (Table 13). An increase in the size of the cluster, also increased the probability that they contained a previously identified gRNA and that cluster members could then be identified. 109 Table 11. Summary of ACORNS Results. gRNAs are classified as previously identified, related to a previously identified gRNA based on sequence similarity or unrelated to any previously identified gRNAs. Initial Reads Final Reads after ACORNS Step 2. Previously Identified Reads Previously Unidentified tagged as Related Previously Unidentified tagged as Unrelated 11,387,683 9,049,005 4,413,142 2,072,099 2,563,764 15,251,292 11,199,364 5,121,216 2,636,714 3,441,434 TREU 667 Procyclic EATRO 164 Procyclic Table 12. Cluster summary. Clusters were defined as a group of 10 or more related gRNAs. Cluster characteristics describe the percent of the cluster members that had been previously identified. Cluster Characteristics EATRO 164 Procyclic TREU 667 Procyclic ≥95% Previously Identified >0% Previously Identified 0% Previously Identified Unclustered # of Clusters 238 267 751 NA Reads 4,232,188 3,488,118 3,289,359 189,699 # of Clusters 191 215 762 NA Reads 3,281,190 3,180,019 2,401,148 186,648 Table 13. Cluster size summary. Cluster size is determined by the number of unique sequence classes in a cluster. % Unidentified clusters is the percentage of all clusters in each bin that are completely unidentified. % Unidentified clusters Cluster Size TREU 667 EATRO 164 TREU 667 EATRO 164 TREU 667 EATRO 164 10 - 24 25 - 49 50 - 99 100 - 249 250 - 499 500 - 999 1000 + 137,413 240,804 427,029 1,575,518 1,398,488 2,167,031 2,153,210 2,354,552 3,379,901 4,107,318 1,159,362 77% 64% 61% 54% 26% 27% 25% 68% 63% 55% 45% 45% 21% 30% 503 265 184 139 47 26 4 518 291 185 173 60 19 10 # of clusters identified Total # Reads 134,707 211,154 425,535 Figure 25 shows an example of three different clusters assembled by the ACORNS program. The visualization program represents each individual gRNA sequence as a dot, with lines connecting dots representing relationships between the individual sequences. Each member is numbered in order of abundance of transcript reads (0 = most abundant). gRNAs can be further characterized within the cluster by using color to identify either transcript specific gRNAs (Figure 25 A, C and E), or to indicate Log transcript abundance (Figure 25 B, D and F). The 110 first cluster shown illustrates a relatively small cluster with 104 different gRNA sequence members (Figure 25 A and B). In this visualization, it is clear that most of the members were previously identified (red dots), including the most abundant member (dot 0). Figure 25B shows the transcript abundance of each cluster member. This cluster is very characteristic of most of the clusters we identified, in that it has a central very abundant gRNA (red dot in center, >5,000 reads) with most other cluster members identified having very few reads (purple = 1 read in the transcriptome). 98.1% of the transcripts in this cluster were previously identified. Figure 25 C and D illustrate a large cluster with 1801 sequence members. In contrast to the first cluster, most of the identified sequence members were previously unidentified (gray dots), a few of the members however, had been previously tagged as RPS12 specific gRNAs (red dots). The previously identified RPS12 gRNAs were tagged as editing the Block H region of RPS12. The originally identified transcripts were rare, with only 255 reads found in the EATRO 164 library. The ACORNS analysis allowed us to identify an additional 276,358 reads that cover this region. However, most of these newly identified gRNAs, including the most abundant sequence class, have mutations that affect their ability to correctly edit the mRNA transcript. 111 Figure 25. Example clusters of related gRNAs generated by ACORNS from the EATRO 164 PC gRNA transcriptome. Each dot represents an individual gRNA sequence and lines connecting dots represent relationships between gRNAs. Each cluster is shown with two different color schemes; editing gene identity (A, C and E) and gRNA read abundance (B, D and F). The first two clusters (A,B and C,D) both contain RPS12 identified gRNAs (red dots) and previously unidentified gRNAs (gray dots). The third cluster (E and F) contains a majority of previously unidentified gRNAs (grey), but also contains CR3 identified gRNAs (purple) and a few gRNAs that had been tagged for a dead-end ND7 5’ alternative editing pattern (green). 112 Figure 25 (cont’d). The final cluster shown also illustrates a large cluster with 1039 members. (Figure 25 E and F). For this cluster, the most abundant member (member 0) contained over >400,000 reads and had been previously identified as an alternative CR3 gRNA (gCe80). The ACORN cluster analysis allowed us to identify and additional 829 sequences and over 12,700 reads as being related to this alternative gRNA. Interestingly, in the visualization of this cluster, we noted that 4 of the members were previously tagged as involved in the generation of a disruptive (dead- end) alternative edit of ND7 5’. An examination of these transcripts indicate that they are in fact a specific mutant subclass of the gRNA cluster that are now capable of anchoring and creating a misedit in ND7. This cluster was the only example of a cluster containing related gRNAs with different targets (distinct from promiscuous gRNAs that all have multiple targets), and could be an example of how alternative editing originates. 113 These data analyses suggest that both procyclic libraries have large numbers of unidentified gRNAs of unknown function. It may be that these gRNAs are directing alternative editing events that have not yet been characterized. The full mRNA transcriptome has not been sequenced and the limited amount of mRNA sequence available does suggest an abundance of alternative editing (Chapters 3 and 4) [9,67,68,99,141,148,160]. Surprisingly, both libraries also contained large numbers of mutated conventional gRNAs. This was unexpected, due to the fragile nature of the RNA editing process. We had hypothesized, that the sequential nature of the RNA editing process, would decrease the tolerance for mutations in both the gRNA and the mRNA genes (Chapter 4) [9]. In order to determine how these mutated gRNAs might influence the RNA editing process, we developed a second program (GUIDE, gRNA Uridine Insertion/Deletion Editor) that simulates the RNA editing process. This program takes the fully edited mRNA templates for the identified gRNAs in each cluster, finds the best anchor for each gRNA and then simulates editing based on the conventional base-pairing rules (A:U and G:U pairing both allowed). For each gRNA, the anchor length, length of complementarity and the number of sites showing non-conventional editing are determined. This allows the classification of the gRNA into different bins: 1) low quality anchor; 2) does not fully edit; 3) conventionally edits and 4) alternatively edits. Guide RNAs that generate alternatively edited sequence were then further classified as 1) Disruptive edits (does not generate the anchor for the next gRNA); 2) Frameshift edits (does not disrupt editing but generates a frameshift); 3) missense editing (does not disrupt editing but generates a missense mutation). We note that in the GUIDE program, we initially classified C:A base pairs as disruptive (stopping the editing process). However, because C:A base pairs have been previously identified within known gRNA 114 alignments, we did sort gRNAs containing a C:A base pair into their own bin [9,119]. These analyses were done for all gRNA clusters identified for RPS12, ND7 5’ and CR3. These 3 genes were chosen because the mRNA transcriptomes and full editing pathways had been previously characterized (Chapter 4). Of the unidentified gRNAs that are related to previously identified gRNAs, 942,397 and 958,140 reads were related to RPS12, ND7 5’ or CR3 editing gRNAs in TREU and EATRO cells respectively. Of these newly identified gRNAs, 92.6% in TREU and 95.5% in EATRO were predicted to be incapable of fully editing, to fail to create anchor for the next gRNA or to cause a frameshift. This suggests that a large number of gRNAs could be disruptive, however, because of the large differences in population sizes for the different guide RNAs, it is important to evaluate the percent of disruptive editing at the population level. This data could indicate how tolerant the RNA editing process is to mutations in the gRNA population. RPS12 Editing of the essential RPS12 transcript was relatively straight forward with a limited number of alternative edits (Chapter 4, Figure 26). The main editing pathway involved 10 gRNA populations (gA – gJ). Analyses of these populations still show a large variation in the abundance of the different populations even after the cluster analyses (Figure 26, Table 14). Surprisingly, we saw no correlation between the abundance of the gRNA populations and RNA editing efficiency. For example, in the TREU cell line, the B1 and B1* mRNA transcripts are equally abundant (Figure 26A). However, the gB1* gRNA is almost 50-fold more abundant than gB1 (Figure 26B and Table 14). In contrast, in the EATRO cell line, the B1 mRNA transcript is significantly more abundant (5x), while the gB1 and gB1* guide RNAs are approximately equal 115 in abundance. Interestingly, the only gRNA identified that can generate the B1* edits requires a gap in the alignment with the mRNA to create the correct sequence. Our GUIDE program bins this gRNA into the “disrupts editing” bin (brown) and it is unclear why this gap is tolerated. We do note that there are three G:C base pairs surrounding the gap that would help stabilize this alignment. We hypothesize however, that the gap in the alignment does in fact affect the ability of this gRNA to efficiently edit. In the TREU cells, the abundance of gB1* may contribute to its editing efficiency compensating for the gap and increasing the amount of B1* mRNA observed. This pattern appears to be repeated in EATRO, where the B1 mRNA is almost five times as abundant as the B1* mRNA, but the gRNAs are nearly equally abundant. Interestingly, in both transcriptomes the gRNAs responsible for generating the B1 edits both require C:A base pairs for the gRNAs to function correctly. This was the first of many instances observed where editing is impossible without the use of C:A base pairs. A significant number of the gRNA populations required for the editing of RPS12 did have large numbers of mutated gRNAs that were predicted to stop (mismatch cannot be resolved by the insertion or deletion of a U-residue) or disrupt editing (does not generate the correct anchor sequence for the next gRNA). For example, while the gC populations had a small number of gRNAs with perfect alignment to the canonical sequence (1.6% in TREU and 5.9% of the gC populations in EATRO), most of these gRNAs contain an illegal base pair (G:A or G:G). Similarly, more than 70% of the gD population in the TREU cell line have a U:U mis-match in the middle of the alignment and both the gG and gH populations have a majority of gRNAs that contain a single mismatch in the best alignment with their editing block.(Figures 27 and 28). Surprisingly, when the GUIDE program simulated the edits generated by the mutants of the gD 116 Figure 26. Observed RPS12 editing pathways in the TREU 667 cell line (A and B) the EATRO cell line (C and D). U = unedited transcripts. Dot sizes are proportional to the percent of block level edited transcripts using the gRNA indicated. Colored arrows indicate the gRNA population used (A and C). Dashed arrows represent gRNA populations used in only one cell line (superscript ‘e’ or ‘t’). gRNA names with superscript ‘p’ represent promiscuous gRNAs. Dots enclosed by a red box represent end point mRNAs with no AUG start codon. Lines connecting dots (B and D) indicate gRNA population size and functionality. Disruptive gRNAs were considered to be those that were predicted by GUIDE to be unable to complete editing (excluding those that only required a C:A base pair to finish editing), unable to generate the anchor for the next productive gRNA, or generated a frameshift mutation. 117 Table 14. gRNA population analyses for RPS12. TREU 667 TREU 667 Editing Block Population gRNA Reads mRNA Reads EATRO EATRO 164 gRNA 164 mRNA Reads Reads A B C D E F G H I J gA gB1 gB1* gB3t gB4t gB2FSe gC gCt gD gFp (Dx edit) gE gEep gFp gFep gG gGe gH gI gJ+ 416 1,901 84,459 1,692 2,784 1 3,081 6 387,246 2,877 11,593 1,162 2,940 0 4,505 105,217 41,824 14,251 5,573 69,923 27,805 23,318 2,463 946 0 46,436 577 35,563 2,724 35,046 0 29,838 0 28,973 0 29,331 26,417 17,870 95 19,878 17,975 0 0 15 304 0 460,928 2,371 5,269 178 2,371 1,843 19,028 99,153 287,393 6,136 18,094 38,946 28,494 5,880 0 0 2,820 35,255 0 20,067 4,617 17,723 879 10,818 955 11,163 468 10,843 9,747 7,665 population, it predicted that the mutants would generate an alternative sequence that lacks the anchor binding site for the gE population (Figures 27 and 28). However, this alternative sequence was not identified in the RPS12 mRNAs of the TREU or EATRO cells (Chapter 4). This suggests that the alignment error is tolerated in the generation of the canonical sequence. For most of the mismatches detected, the error in the alignments are all immediately flanked by multiple G:C base pairs. It may be that the presence of multiple stable G:C pairs allows the editing machinery to tolerate these mismatches. The EATRO gH gRNAs are the exception. This population contains a majority of transcripts with a single point mutation near the end of the gRNA that should prevent it from generating the full anchor sequence for gI. There are no 118 stabilizing G:C pairs that flank this mismatch. The other notable difference between the TREU and EATRO editing patterns were found in an alternative edit (Ge) observed in the EATRO cell line only. We note that the gGe guide RNA is very abundant in both transcriptomes (>100,000 reads in TREU cells and >90,000 reads in EATRO cells). However, we found evidence of its use in only the EATRO cell line. In summary, we identified multiple populations of gRNAs that were predicted to be nonfunctional or disruptive in the RPS12 editing pathway. However, these predictions are contradicted by the observed RPS12 mRNA data, suggesting that many of these mismatches that we had previously considered to render gRNAs nonfunctional or disruptive are tolerated by the RNA editing system. Figure 27. Analysis of functionality and abundance of productive gRNAs populations that edit RPS12 in TREU 667 cells. The functionality of each subpopulation is shown as a bar, with percentage shown on the left y-axis, and subpopulation abundance before and after identification of gRNA relatives is shown on the right y-axis. gRNAs labeled as ‘Disruptive Edit’ failed to generate the anchor for the subsequent gRNA, and gRNAs labeled as ‘C:A base pair’ required a C:A base pair to be tolerated for the editing to be completed correctly. gRNAs with shaded names were found in the TREU 667 gRNA transcriptome but were not used in editing the TREU 667 RPS12 mRNAs, despite being used in EATRO 164 cells. 119 Figure 28. Analysis of functionality and abundance of productive gRNAs that edit RPS12 in EATRO 164 cells. For description of axes and gRNA functionality labels, please see Figure 27. ND7 As with RPS12, the ACORNS program identified gRNAs related to those previously identified to generate the ND7 5’ editing pathways (Figure 29). The ND7 5’ editing domain contains five editing block levels, and the editing pathway is relatively straight forward until the final editing block, where alternative editing generates transcripts translatable in multiple reading frames (Chapter 4). In full editing of this transcript, two of the 5 gRNA populations appear to be problematic; gC and gD (Figure 29). In both cell lines, the gC population requires both C:A base pairs as well as multiple mismatches or gaps in the best alignments with the canonical sequence (Figure 30). These multiple alignment errors do not appear to be well tolerated by the editing system, as a severe decrease in editing efficiency was observed in both cell lines from the B to C block level (25.8% in TREU cells and 32.4% in EATRO cells) (Chapter 4). For the ND7 5’ gD guide RNAs, both cell lines have a majority of gRNA transcripts with a base pair mismatch in their anchor. In the EATRO cells, almost all gRNAs in the gD population either 120 require a C:A or A:A base pair in the anchor (Figure 31). While these mismatches are surrounded by G:C base pairs, editing efficiency does appear to suffer with a 13.0% drop in editing efficiency and only 0.4% of transcripts completing the D block level (Table 15). In contrast, in TREU cells, while most gRNAs have a A:A mismatch in the anchor, ~31% of the population has the ability to form a conventional Watson-Crick anchor (Figure 30). The drop in editing efficiency is less (7.2%) than observed in the EATRO cells, and a full 11.7% of transcripts complete D block level editing. In these two problematic gRNA populations of ND7 5’ editing, we begin to see what the limits of the RNA editing system are. We find that multiple mismatches as well as mismatches in the anchor binding region of a gRNA appear to have severe impacts on editing efficiency. Surprisingly, however, these aberrant gRNAs do not appear to halt editing altogether. 121 Figure 29. Observed ND7 5’ editing pathways in TREU 667 cell line (A and B) and the EATRO 164 cell line (C and D). For descriptions of dots and arrows, please see Figure 26. Dashed arrows with open heads represent a hypothetical rewrite. + indicates that more than one mRNA form was condensed into this circle to simplify the figure . Condensed forms encode largely the same amino acid sequence with only small variants. Terminal dots are colored blue for reading frame 1, magenta for reading frame 2, or green for reading frame 3. Boxed green dots have no functional start codon but are translatable into reading frame 3 with the use of an alternative start codon (UUG). Lines connecting dots (B and D) indicate gRNA population size and functionality. Disruptive gRNAs were considered to be those that were predicted by GUIDE to be unable to complete editing (excluding those that only required a C:A base pair to finish editing), unable to generate the anchor for the next productive gRNA, or generated a frameshift mutation. 122 Figure 30. Analysis of functionality and abundance of productive gRNAs that edit that edit ND7 5’ in TREU 667 cells. For description of axes and gRNA functionality labels, please see Figure 27. gRNAs with shaded names were found in the TREU 667 gRNA transcriptome but were not found to be utilized in editing the TREU 667 ND7 5’ mRNAs, despite being utilized in EATRO 164 cells. The population labeled gE2v1,gE1v2 contained members that generated both sequence patterns and could not be separated. Figure 31. Analysis of functionality and abundance of productive gRNAs that edit ND7 5’ in EATRO 164 cells. For description of axes and gRNA functionality labels, please see Figure 27. gRNAs with shaded names were found in the EATRO 164 gRNA transcriptome but were not found to be utilized in editing the EATRO 164 ND7 5’ mRNAs, despite being utilized in TREU 667 cells. The population labeled gE2v1,gE1v2 contained members that generated both sequence patterns and could not be separated. 123 Table 15. gRNA population analysis for ND7 5’ Editing Block Population gA gB gBex gC gCFSt gC1ex gC2ex gD gE1+/gE2+ gE4t gE4e A B C D E CR3 TREU 667 TREU 667 EATRO EATRO 164 gRNA Reads 6,819 35,890 113 92 66 161 538,052 49,883 14,804 14,772 53 mRNA Reads 523,128 510,653 0 196,504 19,717 0 0 133,558 106,379 5,651 0 164 gRNA Reads 589 7,478 21,775 24,332 2 94 434,196 7,781 163,481 98 246,486 mRNA Reads 430,266 395,621 23,440 67,080 0 38,479 21,471 3,552 942 0 1,301 The editing pathways of CR3 are significantly different from those of RPS12 and ND7 5’ (Figure 32, Chapter 4). The most obvious difference is the fact that the pathways are highly branched, generating multiple distinctly different mRNA products. The other key difference is that most of the editing pathways defined in the TREU and EATRO cells are not shared. The two cell lines appear to use almost completely different sets of gRNAs to edit CR3, with the exception of some edits at the very 3’ end of the transcript. What was most surprising about these data was the fact that most of the gRNAs necessary to generate both the TREU and EATRO CR3 editing pathways were found in both gRNA transcriptomes in similar relative abundances (Table 16, Chapter 4). In addition, similar to the RPS12 and ND7 5’ data, we saw very little correlation between the abundance of the gRNA population and the corresponding abundance of the mRNA transcript generated. For example, in the EATRO cell line, the gB1B2 population (38,663 reads) was significantly more abundant than the gB4 guide RNA population (100 reads). Nevertheless, the number of mRNAs with the B4 editing pattern was 5-fold higher 124 than the B1B2 mRNAs. The EATRO B4 mRNAs can be further edited by three different gRNAs, gB5e, gB6e and gB7e. Again, while gB6e is the most abundant of the three, the predominant mRNA found is the B5e transcript (Table 16). In addition, while the gB5e guide RNA is also abundant in TREU cells, we found no corresponding B5e mRNAs in this cell line. The most significant divergence of the EATRO CR3 editing pattern occurs at the B to C editing block transition. In TREU, all of the different 3’ editing patterns converge and are further edited by the gC guide RNA population. In EATRO cells, the gC population is much less abundant (1590 reads instead of >17,000). Correspondingly, very few mRNA transcripts were observed that used a gC guide. Instead, the bulk of the B-level transcripts were further edited by gCe (a rare gRNA with only 43 reads detected). The C-block gRNAs also contained the most abundant gRNA detected for CR3 editing, gCe80. This gRNA has close to 500,000 reads in both cell line transcriptomes. It can anchor to all of the B block mRNAs in both the TREU and EATRO pathways but was only observed editing the B5e template in the EATRO cells. Despite the overwhelming abundance of the gCe80 guide RNA population, the corresponding mRNA had the smallest number of detected reads. Surprisingly, there were very few CR3 gRNA populations in either cell line that had sequence members that could disrupt editing (Figures 33 and 34). In the TREU cell line, the FGtx 5’ end pattern of mRNA editing is generated by a promiscuous gRNA, gFGtxp (originally identified as a CR4 gRNA). This gRNA population is predicted to cause a frameshift just upstream of the anchor region of gFGtxp. However, this frameshift is not observed in the mRNA population, and it appears that this gap is tolerated like those observed in RPS12. 125 Figure 32. Observed CR3 editing pathways in TREU 667 (A and B) and EATRO 164 (C and D) cell lines. For descriptions of dots and arrows, please see Figure 26. Dashed arrows with open heads represent a hypothetical rewrite. + indicates that more than one mRNA form was condensed into this circle to simplify the figure. Condensed forms encode largely the same amino acid sequence with only small variants. Terminal dots are colored blue for reading frame 1, or magenta for reading frame 2. Boxed magenta dots have no functional start codon but are translatable into reading frame 2 with the use of an alternative start codon (UUG). Lines connecting dots (B and D) indicate gRNA population size and functionality. Usable gRNAs were considered to be those that correctly edit, utilize C:A base pairs to edit, or generate only small missense mutations. 126 Table 16. CR3 gRNA population analysis Editing Block Population A B C D DE E F FG G gA1 gA2 gB1B2 gB3t gB4 gB4t' gB5e gB6e gB7e gC gCe gCe80 gD gDep gDEe80 gE gEt' gEt gEep gFt gFexp gFGtp gFGtxp gFGep gFGe*p gFGe80 gGt gGep TREU 667 TREU 667 EATRO EATRO 164 gRNA Reads 100 1,729 13,278 1,598 945 12 3,800 221 1,376 17,045 37 538,230 1,649 24 301 9,135 567 34 14,062 34,666 0 14 49,892 219 69,138 255 466 38,001 mRNA Reads 5,907 28,654 10,776 9,158 9,536 8,692 0 0 0 27,646 0 0 26,287 0 0 15,016 14,687 5,099 0 8,073 0 1,035 2,172 0 0 0 1,231 0 164 gRNA Reads 2,368 19,947 38,663 0 100 0 803 2,359 369 1,590 43 434,196 30,125 118 120 15,242 582 0 30,982 189,828 1,265 90 3,220 1,912 104,747 589 21,788 17,318 mRNA Reads 534 5,934 757 0 3,720 0 2,025 689 744 616 2,079 270 280 1,981 0 204 0 0 1,864 0 112 0 0 79 691 0 0 68 127 Figure 33. Analysis of functionality and abundance of gRNA subpopulations that edit CR3 in TREU 667 cells. For description of axes and gRNA functionality labels, please see Figure 27. gRNAs with shaded names were found in the TREU 667 gRNA transcriptome but were not found to be utilized in editing the TREU 667 CR3 mRNAs, despite being utilized in EATRO 164 cells. Figure 34. Analysis of functionality and abundance of gRNA subpopulations that edit CR3 in EATRO 164 cells. For description of axes and gRNA functionality labels, please see Figure 27. gRNAs with shaded names were found in the EATRO 164 gRNA transcriptome but were not found to be utilized in editing the EATRO 164 CR3 mRNAs, despite being utilized in TREU 667 cells. 128 In the EATRO cell line, only two terminal gRNA populations have predicted editing problems. About half of the gFGe*p subpopulation has a low-quality anchor (<5 nt), however the subpopulation does have >53,000 reads that have higher quality anchors, so this mutation may be tolerated. The gGep population is also interesting, in that it is predicted to generate a frameshift when editing was simulated by our GUIDE program (Figure 34). This frameshift is not observed in the mRNA transcriptome data however because editing appears to stop before reaching the last two sites (Chapter 4). Discussion ACORNS and GUIDE are two new tools that can aide in the understanding of the complex dynamics of the kinetoplastid RNA editing system. The ability of ACORNS to cluster related gRNAs proved to be a powerful mechanism for the identification of gRNAs with mutations that disrupt their alignment to fully edited sequence. These analyses allowed us to identify and characterize nearly 2 million additional gRNA transcripts from our transcriptome libraries. In addition, these analyses identified a large cohort of gRNA clusters that are not involved in directing the sequence changes associated with the known canonical transcripts. More than 25% of the gRNA transcripts found in the EATRO and TREU gRNA transcriptomes have no known functional relatives. This strongly suggests that the coding capacity of the mitochondrial genome is much larger than previously thought. Full characterization of the known gRNA population was also informative. In our initial gRNA characterization study, we had noted the extreme population differences found between the different identified gRNAs [9]. We had hypothesized that the low copy number gRNAs were an artifact of the high stringency of our initial screen. The cluster analyses suggest that extreme 129 population size differences do exist between the different gRNAs. In both the TREU and EATRO transcriptomes, ~30 clusters were identified with over 500 sequence members each. These clusters accounted for the bulk of the gRNA transcript reads (6,461,870 reads in TREU and 4,539,263 reads in EATRO). In contrast, over 500 clusters were found that contained fewer than 25 sequence members. These tiny clusters account for only ~135,000 transcript reads. The very large numbers of low copy number gRNAs are suggestive of high plasticity in the gRNA encoding minicircles. Studies in Leishmania have suggested that minicircle sequence class frequencies are extremely variable [63,78]. How these huge differences in gRNA abundance influences gRNA selection and use is unclear. In only ~50% of the editing branch points characterized for RPS12, ND7 5’ and CR3, did the abundance of the gRNAs involved somewhat correlate with the preferred editing path. However, even when gRNA abundance did align with mRNA abundance, they were often not proportional. In addition, block editing by an abundant gRNA is often followed by editing using a rare gRNA, with no equivalent drop in editing efficiency. In addition, multiple instances were observed, where highly abundant gRNAs did not appear to be used in one cell line. In one example, the gRNAs responsible for generating the B5e, B6e and B7e mRNA forms of RPS12 are present in both cell lines, but only apparently act in the EATRO cell line, despite being all more than ten times more abundant than their only competitor. It may be that protein factors play a predominant role in gRNA selection. One study examined the role of the RNA editing mediator complex (REMC), which is heterogenous and consists of one primary subunit TbRGG2 that formed associations with either MRB8170 or MRB8180 [161]. They showed that depletion of MRB8180 caused global effects on RNA editing, but depletion of MRB8170 had transcript specific effects, substantially increasing the amount of 130 pre-edited RPS12 transcripts, but not significantly affecting the amount of pre-edited ND7 5’ transcripts. This specificity is intriguing, and it is possible that the REMC or other protein factors are involved in gRNA selection. The identification of all relatives of conventional gRNAs, including those that had undergone a mutational event, also allowed us to more carefully characterize the effect of mutational noise on the RNA editing system. These analyses indicate that the RNA editing system can tolerate more mutational noise than we had originally hypothesized. Surprisingly, a large number of the gRNA populations we characterized had base pair mismatches with their best aligned mRNA transcripts. In this study, we saw that while most of the mismatches were specifically C:A base pairs, almost every other mismatch was also observed. Even gaps in the alignment of the gRNA to the mRNA appeared to be tolerated. In almost all cases however, the mismatch base pair and/or gap appeared to be stabilized by multiple flanking G:C base pairs. While we cannot rule out the possibility that rare, perfect match gRNAs do exist for these regions, it may be that the incompletely base paired interaction is the most stable structure possible, hence is generated by that gRNA [162]. While a minimum number of non-paired nucleotides does appear to be tolerated within the guiding region, mismatches or non-Watson- Crick base pairs within the anchor region do not appear to be tolerated, greatly affecting the efficiency of the RNA editing process [119]. In a large number of the gRNA populations containing alignment mismatches, our GUIDE program predicted that the mismatch would drive the generation of an alternative edit. However, these alternative edits were not observed in the mRNA transcriptome data. These data contradict one of the most prominent models of RNA editing progression, known as the 131 “mismatch recognition” model [33]. In this model, when the gRNA/mRNA duplex initially forms, the editosome proceeds to edit beginning at the first mismatch site closest to the anchor binding region. Once this mismatch is resolved either by the insertion or deletion of a uridine, the next mismatch site is edited, and no sites further will be edited until the sites nearest the anchor binding region are resolved. When GUIDE predicts editing patterns, it follows this model, predicting each editing site, moving from the anchor binding region towards the poly-U tail. If editing did follow this strict mismatch recognition model, the alternative sequences predicted by GUIDE should have been observed in the mRNA transcriptome data. An alternative model suggests that RNA editing occurs via a more “dynamic interaction” [162]. This model proposes that when a gRNA/mRNA duplex forms, the editosome targets regions of the duplex with low thermodynamic stability and edits those regions. As editing progresses, the duplex realigns, changing the targets of the editing system. These cycles of progressive realignment proceed until the gRNA/mRNA duplex reaches maximum stability. In this way, RNA editing does not necessarily proceed in a strict 3’ to 5’ directional manner. This model suggests that mismatches and gaps in alignments can be tolerated because they do not significantly impact the stability of the final gRNA/mRNA duplex. Supporting this, are the frequent observations of neighboring G:C base pairs, which would substantially enhance the stability of these mismatches. The “dynamic interaction” model is further supported based on the existence of “junction regions” in partially edited mRNAs [161–165,156,166,148,160]. These regions adjoin the unedited and fully edited regions of a partially edited mRNA, but do not match either the unedited or fully edited sequence. Junction regions vary significantly across partially edited transcripts, possessing no consensus sequence. These regions can vary in size 132 and depletion of different protein factors affects their occurrence and length, but the presence of junction regions remains ubiquitous across all partially edited mRNAs [161– 165,156,166,148,160]. The mismatch recognition model reconciles the presence of junction regions as areas of mis-editing (utilization of the wrong gRNA or a misaligned gRNA), hence all junction regions would have a gRNA capable of generating the sequence. In our characterization of the editing pathways of RPS12, ND7 5’ and CR3, highly abundant mRNA sequences were screened against the gRNA transcriptomes at low stringency levels in order to identify true alternative edits. While we were able to identify and number of gRNAs that could direct alternative edits, a large number of “junction sequences” were identified that do not match any gRNA in our databases. If the “dynamic interaction” model is correct, these multiple variable junction sequences could be generated by the same gRNA during the editing process. Alternative base pairs have also been shown to be tolerated to different extents in an in vitro gRNA directed deletion assay [151]. In this study, substitutions were made of the nucleotides immediately upstream of a deletion site. This study found that when the base pair upstream of the deletion site was C:A, C:U or a C:C, the site was still found deleted in the mRNAs. Deletions were not observed when the base pair was a G:A or G:G. Another facet of gRNA utilization is the existence of promiscuous gRNAs, gRNAs editing more than one target. In this analysis we showed many populations that edit more than one gene, and one population editing the same gene in two different locations (gFp of RPS12). Interestingly, most of the promiscuous populations were found in the CR3 data set, and these promiscuous gRNAs were the only productive promiscuous gRNAs identified. No promiscuous populations were found editing ND7 5’, and the only promiscuous gRNAs found to edit RPS12 133 lead to editing pathway dead ends. Predictions made by GUIDE indicate that many of the promiscuous gRNAs generating the CR3 EATRO specific pathways should also be able to productively edit in TREU. Because these are promiscuous gRNA, it is possible that their availability is impacted by their use in editing other transcripts. A full mRNA transcriptome would allow a full analysis of the global impacts of editing on any particular gRNA cluster. This study revealed the surprising amount of noise and errors that are tolerated in the RNA editing system of Trypanosoma brucei. Some questions still remain, such as the functions of the large proportion of unknown gRNAs, how gRNAs are selected, and what is the true extent of gRNA promiscuity. In order to answer these questions, we believe that a full deep sequence of all edited mRNAs paired with gRNA transcriptomes of multiple cell lines would shed more light on this complicated situation. Acknowledgements We would like to thank the Ken Stuart Lab for the trypanosome cell lines and Chris Adami for his assistance in the conceptualization of this work. 134 CHAPTER 6: SUMMARY AND DISCUSSION Introduction Trypanosoma brucei is one of the few organisms that utilizes the kinetoplastid RNA editing system. This system seems unnecessarily complex, using two genetic components to generate one fully functional product. Moreover, this system is prone to malfunction; each gRNA generates the anchoring region for the next gRNA, making the system sequentially dependent. This makes the mutation or loss of any gRNA along the editing pathway extremely detrimental, especially considering that two of the twelve edited genes are essential [17,100]. This problem is made worse by the fact that some gRNAs are incredibly rare, and during replication, the 5,000-10,000 minicircles encoding the gRNAs are divided asymmetrically, making minicircle loss not only possible, but routine [8,167]. This system should not work, but it does. Kinetoplastids are some of the most successful parasites on Earth, infecting insects, plants, mammals, fish, birds, and reptiles [168]. The study of this editing system inevitably brings up questions of how this system evolved and how it continues to be maintained. The concept of drift robustness begins to explain this surprising amount of fragile complexity [80]. Drift robustness is a form of genetic robustness that allows an organism to be protected from extreme events of genetic drift by making mutations either neutral or lethal. T. brucei frequently undergoes population bottlenecks throughout its life cycle and as its mitochondria is completely asexual, it seems particularly prone to genetic drift [22]. The fragility of the RNA editing system seems to coincide remarkably well with the idea of drift robustness. By rendering mutations or loss of 135 minicircles lethal, this would prevent accumulation of slightly deleterious mutants in the population. But such a system should suffer from another disadvantage. In a system where mutations are lethal or neutral, how can the organisms continue to evolve? In the first examination of the gRNA transcriptome, many gRNAs were identified that were capable of generating alternative edits [9]. These edits ranged from having no effect on the protein sequence to causing a frameshift and altering a large portion of the protein. These findings sparked this project, which sought to understand the impact of alternative editing on genetic integrity, developmental regulation, protein diversity and editing efficiency. These goals were accomplished through the generation of the bloodstream gRNA transcriptome and comparative analysis of it with the insect stage gRNA transcriptome, analysis of dual-coding genes that utilize alternative edits to access multiple reading frames, the generation of libraries of putative dual-coding mRNAs at different states of editing, and analysis of gRNA population diversity and the impact of that diversity on the editing system as a whole. Summary of Chapter 2 This chapter characterized the gRNA transcriptome of EATRO 164 bloodstream stage Trypanosoma brucei, and compared it to the gRNA transcriptome of procyclic stage trypanosomes. As with the procyclic gRNA transcriptome, conventionally accepted fully edited mRNA sequences were used to identify the gRNAs, and a comparison of the two life cycle transcriptomes show a 3.5:1 ratio of procyclic to bloodstream gRNA reads. This ratio varies significantly by gene and by gRNA populations within genes. The variation in the abundance of the initiating gRNAs for each gene, however, displays a trend that correlates with the developmental pattern of edited gene expression. Surprisingly, there were very few gRNAs 136 found in both transcriptomes, but there were many gRNAs that appeared to be related between transcriptomes. Comparing these related major classes from each transcriptome revealed a median value of ten single nucleotide variations per major class. Nucleotide variations were much less likely to occur in the consecutive Watson-Crick anchor region, indicating a very strong bias against G:U base pairs in this region. In spite of the variation we saw between related gRNAs, we did find several conserved gRNA characteristics, such as transcription start site sequence, length of complementarity, and non-base pairing nucleotides at the 5’ and 3’ ends of gRNAs. Overall, gRNA coverage of edited mRNAs as well as overlap between adjacent gRNAs was lower in the bloodstream gRNA transcriptome than in the procyclic gRNA transcriptome. This work indicates that gRNAs are expressed during both life cycle stages, and that the differences in the extent of editing previously reported for different mRNA transcripts are not due to the presence or absence of gRNAs. However, the abundance of the initiating gRNAs may be important in the developmental regulation of RNA editing. Summary of Chapter 3 In this work, we show that many of the mitochondrially edited mRNAs in T. brucei can alter the choice of open reading frame by alternative editing of the 5' end. Dual-coding genes have specific mutational biases, such as an increase of the ratio of nonsynonymous to synonymous mutations, and an increase in mutational frequency overall. Analyses of mutational bias of all mitochondrial genes indicate that six of the pan-edited genes may be dual-coding. These analyses include measuring the conservation of editing patterns between T. brucei and T. vivax transcripts and frequencies of different types of mutations. These data were 137 used in a principal component analysis, which showed a distinct difference between alternative reading frames of dual-coding genes and single-coding genes. Discovery of alternative gRNAs reveal that RNA editing can allow access to both reading frames. We predicted the functions of two of these alternative reading frames as small metabolite transmembrane transporters. We hypothesize that dual-coding genes can protect genetic information by overlapping genes that are under selection at different portions of the life cycle. Summary of Chapter 4 In this study, we analyzed the editing patterns of three putative dual-coding genes, ribosomal protein S12, the 5’ editing domain of NADH dehydrogenase subunit 7, and C-rich region 3, and constructed detailed editing pathway maps using mRNA and gRNA transcriptome data. While editing of RPS12 showed only transcripts that produce the canonical RPS12 protein, we did observe a second downstream start codon capable of producing the alternative protein, if selected by the ribosome. In ND7 5’ and CR3, we found evidence that both of these transcripts are edited to express protein products in more than one reading frame. Moreover, we found that CR3 has a very complex set of highly branched editing pathways that vary significantly between cell lines, with a different set of gRNAs being used in each cell line, despite both sets of gRNAs being present in both cell lines. We also found that changing the energy source available to cells also alters the editing preferences of both CR3 and ND7 5’. In addition to this, we found evidence that the poly-U tail that is added post transcriptionally to gRNAs may also be used in editing. These findings suggest that these reading frames can be alternatively selected based on the current environment, and that alternative editing may be a way for the trypanosomes to continue evolving this rigid editing system. 138 Summary of Chapter 5 In the analysis of the gRNA transcriptomes, gRNAs were identified based on complementarity to edited mRNAs. While millions of gRNAs were found using this method, we discovered that many gRNAs were still left unidentified. This high proportion of unidentified gRNAs was shocking, as we had predicted that the RNA editing system should be intolerant of mutations, and the presence of so many unidentified gRNAs meant that many editing pathways had not been characterized, or many mutated gRNAs were also present. To determine the identity and function of these gRNAs, two new programs were created, ACORNS and GUIDE. The first program functions to group related gRNAs into clusters, where each cluster had significant sequence conservation. This analysis showed that more than half of all unidentified gRNAs were not related to any functionally known gRNAs and could be generating uncharacterized alternative editing patterns. However, there were still many gRNAs that were related to previously identified gRNAs. These gRNAs could be capable of disrupting the editing process or generating small alternative edits that are tolerated. In order to investigate this, our second program, GUIDE examined the gRNAs responsible for generating the editing pathways of RPS12, ND7 5’ and CR3, as well as their previously unidentified relatives. Using the defined gRNA clusters, GUIDE screened all members editing these three genes against their targets to determine what proportion of each family was able to productively edit. The initial analyses of these previously unidentified gRNAs revealed than nearly all were predicted to disrupt the editing system, but in our examination of the mRNA data, we found this not to be the case. 139 By combining the analyses of the GUIDE program and the mRNA transcriptome data, we learned more about the robustness of the RNA editing system. In examining the RPS12 gRNAs we found that single mismatches or gaps were highly tolerated in gRNA alignments, with only very small drops in editing efficiency. In the examination of the ND7 5’ gRNAs, however, we identified the limit of this tolerance. Significant drops in editing efficiency were observed when gRNAs either possessed multiple mismatches or gaps, or possessed mismatches that disrupted the anchor binding region. Finally, in our analysis of the CR3 gRNAs, we observed many gRNA populations of high abundance that were apparently not used to edit. We found that RNA editing preference does not correlate positively or negatively with gRNA abundance, and in most cases, the apparently unused gRNA populations had no issues in their mRNA alignments. We predict that these gRNAs may be in use elsewhere in the editing system, and to fully understand this system, we recommend that a full mRNA transcriptome be analyzed. Genetic Integrity The introduction and maintenance of kinetoplastid RNA editing As previously mentioned, the sheer size and complexity of the RNA editing system has left many in search of the answers to how it evolved and how it has been maintained. Many hypotheses have been proposed, such as it being a relic of the old RNA world, being a product of constructive neutral evolution, or that the system co-evolved with G-quadruplex structures that served to protect the genetic information [69–74]. As for how it has been maintained, one prominent theory is that RNA editing is advantageous because it is a mechanism by which an organism can fragment and scatter essential genetic information throughout a genome [75,76]. Because kinetoplast DNA is less stable than chromosomal DNA, and minicircles are frequently 140 lost due to asymmetric division, this hypothesis suggests that scattering essential guide RNA genes throughout the DNA network would prevent fast growing deletion mutants from outcompeting more metabolically versatile parasites during growth in the mammalian host [76,77]. We propose that the RNA editing system, and its inherent fragility operate as a system to weed out deleterious mutations by making them lethal, as a form of drift robustness. Drift robustness is not adaptive, however, and prevents the population from generating beneficial mutations as well. This strategy leaves organisms no options for evolving. We suggest that alternative edits, such as those seen in CR3 and others previously observed, editing pathways generate this evolution without compromising the rigid conservation of other genes such as the essential RPS12 [68]. Supporting this is the fact that, of the kinetoplastids, Trypanosoma have some of the harshest life cycles in terms of maintaining genetic integrity, with the electron transport genes under very relaxed selection in the glucose rich bloodstream stage, and very strict selection imposed in the insect stage. In conjunction with this, we observe that Trypanosoma brucei, Trypanosoma cruzi, and Trypanosoma vivax all maintain more genes that are edited and more extensive editing than their other kinetoplastid counterparts with milder life cycles [4,16,19– 21,52–56,58,59]. For example, Phytomonas serpens infects important crops and is transmitted by sap feeding bugs. These parasites have glucose readily available in both life cycle stages, and are unique in that they lack a fully functional respiratory electron transport chain, and this species is also missing two edited genes entirely, and pan-edits six genes and partially edits three genes, compared to the nine pan-edited and three partially edited genes in the 141 Trypanosoma spp. [64,65]. For Leishmania spp., all life cycle stages possess an active Krebs cycle and ETC linked to the generation of ATP, but these cells are never restricted from access to glucose [61,62,139]. Leishmania do not pan-edit ATPase 6 or ND7, but only partially edit them [63]. These observations suggest that RNA editing provides a larger advantage to organisms with a more complex life cycle. Dual-coding and dual-function genes One oversite that has not been accounted for in the hypotheses that attempt to explain the advantage of kinetoplastid RNA editing is how genetic material that is not under selection is maintained. There has been considerable debate on the necessity of Complex I subunits for either stage of the trypanosome life cycle. Studies using RNAi and knockout cell lines of nuclear- encoded members of Complex I have shown that the complex is unnecessary for survival in either life cycle stage, and in this work we were unable to find complete gRNA coverage of the edited ND subunits in the bloodstream stage, despite the ND subunits generally being more fully edited or only fully edited in the bloodstream stage [5,26–29,111,112]. However, the nuclear encoded Complex I member genes are maintained [42], and the vast majority of the gRNAs required to edit mitochondrially encoded ND subunits were found in both life cycle stages. This evidence together suggests that the Complex I proteins are vulnerable to genetic drift but have somehow been maintained. In this work, we propose that by overlapping Complex I genes not under strict selection with genes that are under selection, the accumulation of mutations can be prevented. Because these overlapped genes share most gRNAs, and alternative edits only occur in the terminal gRNAs, this strategy ensures that almost all of the genetic material is protected. The genes 142 predicted to be dual-coding based on our mutational bias analysis include almost all of the edited ND transcripts, excluding only ND8. In the examination of procyclic ND7 5’ and CR3 (putative ND4L) we found that these two genes are alternatively edited to produce more than one protein product [120]. This evidence shows that not only are dual-coding genes being utilized in T. brucei, but also that RNA editing is facilitating the use of these dual-coding genes, providing the world with a concrete advantage of utilizing this type of RNA editing. Based on limited sequence homology, we hypothesize that the trypanosome mitochondrial alternative reading frames (ARFs) encode small metabolite transporters that provide a distinct growth advantage to bloodstream form parasites. These proteins would function differently from all other edited proteins, which do not function as transporters. While it has been previously suggested that both alternative editing and dual-function proteins are important mechanisms for expanding the functional diversity of proteins found in trypanosomes, a duplication event could easily alleviate the evolutionary constraints imposed by dual-coding genes [67,97–99]. We maintain that in salivarian trypanosomes, these genes must provide the additional benefit of protecting genetic information in order to continue to be overlapped. Protection of the mitochondrial genome during growth in the mammal would increase the capacity for successful transfer to an insect vector and maximize the parasites long-term survival and spread. Analyses of other trypanosomes do show that some of the ARFs have intriguing homology to the ARFs identified in T. brucei and T. vivax. However, most of the ARFs are punctuated with stop codons. It is possible that these genes have since lost their function and are no longer required to be dual-coding due to the reduced selective pressures endured during 143 their life cycles, or it is possible that these stop codons are removed by alternative editing events. In addition to the dual-coding genes we have described, another dual-coding gene, COIII, is accessed through alternative editing, by connecting an unedited 5’ reading frame with an edited 3’ reading frame, and the alternative protein produced, AEP-1, has been shown to be essential [67,68]. In addition, there are known Krebs cycle proteins (α-ketoglutarate dehydrogenase E2 and α-ketoglutarate decarboxylase) that have two functions, rendering them protected while they are not under selection as well [97,98]. These data show that trypanosomes are utilizing dual-coding and dual-function genes to protect genetic integrity. Another set of dual-function genes appear in the gRNAs, as promiscuous gRNAs. The editing pathways of CR3 are littered with gRNAs identified to edit other genes, such as ND8 and the 3’ editing domain of ND7, both of which were not predicted to be dual-coding, and are under less protection. We believe that this is yet another mechanism of increasing the drift robustness of T. brucei through the use of alternative RNA editing. Developmental Regulation In the examination of the bloodstream gRNA transcriptome, we found the abundance of the initiating gRNAs in the procyclic and bloodstream gRNA transcriptome is correlated with the developmental editing patterns of the genes they edit. However, we cannot rule out the possibility that not all of the populations of initiating gRNAs were identified. We identified alternative initiating gRNAs for CR3, and without deep sequencing all pan-edited mRNAs, we can’t know that others don’t exist. 144 It has been previously reported that gRNA presence did not correlate with developmental RNA editing patterns in T. brucei [50,51]. This, however, was reported on the observation of a very limited number of gRNAs, but we found that for the most part, this held true, with gRNA populations having similar relative abundances across both transcriptomes, with, of course, the exception of the initiating gRNAs. In our examination of the editing pathways of RPS12, ND7 5’, and CR3, we found many editing patterns that were only observed or were more prominent in only one cell line. This was most prevalent in CR3, where the editing pathway of the TREU cells is almost completely different from that of the EATRO cells. Curiously, the gRNAs required for both pathways are present in the transcriptomes of both cell lines in relatively equal abundances. Additionally, another exclusive editing pathway was discovered when the EATRO cells were moved into glucose depleted medium. This pathway produced a transcript that would make a unique protein, totally different from all other CR3 protein products. These gRNAs appear functional in both cell lines but are perhaps being used to edit a different gene when they are not observed to be editing CR3. Indeed, many of the EATRO specific CR3 gRNAs are promiscuous gRNAs known to edit other genes. To better understand the complexities of gRNA selection, we examined the editing branch points of the RPS12, ND7 5’ and CR3 pathways. Examination of editing branch points revealed that the abundance of the gRNAs involved did not correlate with the observed editing preference. This work has raised many questions about how gRNAs are selected and editing pattern preferences are exerted, and more study is needed to understand this system. 145 Protein Diversity In the examination of the RPS12 mRNAs, we found one major alternative editing event. This event was observed in the EATRO 164 cell line only, and causes a frame shift that extends the reading frame at the 3’ end by nine amino acids. This same alternative edit was previously described in the 29-13 strain as well, which shows that this edit is not an isolated occurrence in the EATRO 164 strain [148].Frameshifting gRNAs were also identified to edit ATPase 6 in both bloodstream and procyclic EATRO 164 cells. The predicted frameshifts also occur close to the 3’ end of the transcript and alters the C terminus of the protein. As the frameshifts occur downstream of the highly conserved amino acid region involved in proton translocation, it may be that this is also tolerated [31]. Edits of the ND7 5’ mRNAs generate transcripts that can translate into two reading frames in TREU 667 and EATRO 164 cells. Interestingly, this gene was also sequenced in 29-13 cells [148]. That study indicated that a large proportion of the fully edited ND7 5’ transcripts had a single nucleotide difference in the 5’ UTR. As the upstream start codon that allows the ARF to be translated is in what is known as the 5’ UTR, this difference could be the same alternative edit that generates the ARF transcripts. This suggests that this alternative editing may be widespread. Like ND7 5’, we also detected evidence of dual-coding in CR3. The alternative edits in both cell lines produce multiple variations of the CR3 proteins. While the editing efficiencies of CR3 are very low, possibly preventing the generation of these CR3 proteins, we propose that this mechanism of branched editing pathways is a safe way for the trypanosomes to introduce variation into the rigid system and continue to evolve. 146 In addition to the high level of alternative editing observed in CR3, in our re-examination of the procyclic EATRO 164 and TREU 667 gRNA transcriptomes, we identified many gRNAs in the existing transcriptomes still have no known function. These gRNAs may be generating alternative editing events, thus further increasing the protein diversity of T. brucei. Editing Efficiency The editing efficiencies of all three genes we examined were surprisingly low. The efficiencies of TREU cell line mRNAs were all less than ten percent, while the EATRO 164 cells grown in SDM79 were all less than one percent. With CR3 and RPS12, more than 80% of mRNAs were completely unedited, while those numbers were much lower in ND7 5’. There are many factors we observed that had the potential to affect the overall editing efficiency. Mutations and non-canonical base pairs Because gRNAs utilize both canonical (Watson-Crick) as well as G:U base-pairing to direct the change in sequence, most transition mutations in the gRNA, would not lead to changes in the mRNA sequence and would not be selected against [33]. In our observations of the gRNA transcriptome, we found a very strong bias against A to G transitions in the anchor regions of the gRNAs, suggesting that G:U base-pairing is not well tolerated in this region. However, we also found many populations of gRNAs where non-canonical base pairs appeared to be tolerated, even in essential genes, ATPase 6 and RPS12 [17,100,107,110]. In the bloodstream database, a gRNA population with a C:U base pair must be tolerated to be able to complete the editing pathway of ATPase 6, and in RPS12, we observed a gRNA population that requires toleration of a gap in the alignment to be capable of fully editing. 147 In addition to this, RPS12 and ND7 5’ possess to highly cytosine rich regions that typically have poor gRNA coverage. These regions both encode conserved amino acids vital to the functions of the proteins. When we deep sequenced these regions of the mRNAs, we had hoped to find alternative sequences that allowed us to identify a more abundant population to carry out these edits, but we only found the previously described editing patterns. Using these sequences and performing a search of the gRNA databases at a low stringency yielded populations with imperfect anchors, requiring noncanonical base pairs to be tolerated. In RPS12, this did not appear to affect editing efficiency, but in ND7 5’, the effects were severe. However, the gRNA population identified to edit this region of ND7 5’ had multiple mismatches and gaps in both cell lines, whereas the RPS12 populations did not. These examples are far from isolated incidents. Many other populations were identified where noncanonical base pairs were required to generate fully edited mRNA sequences. In most cases, these alternative base pairs do not seem to affect the editing efficiency. The most prominent mismatch was the C:A base pair, but almost every other mismatch was observed. Even gaps in the alignment of the gRNA to the mRNA seem to be tolerated. Of note, we did observe that most mismatches were flanked by nearby G:C base pairs that may have assisted in stabilizing the alignments. The use of alternative base pairs has been previously shown to be tolerated at different extents, with the use of C:A, C:C and C:U base pairs not completely disrupting editing [151]. This evidence suggests that the RNA editing system will tolerate some amount of illegal base pairs, there is a limit to what will be tolerated. The use of non-canonical base pairs in RNA editing support the model of editing known as the “dynamic interaction” model [162]. In this model, editing of an mRNA in a gRNA duplex 148 does not proceed in a site by site fashion, but instead, the editosome targets regions of low stability first, and continues to edit and re-edit the mRNA until a thermodynamically stable gRNA/mRNA duplex is achieved. In this model, non-canonical base pairs may be tolerated if they do not significantly disrupt the stability of the duplex. Overwriting In our exploration of partially edited mRNAs, we found that RNA editing is not always strictly sequential. While overall, editing proceeds from the 3’ to 5’ across of the editing domain, we have found evidence that shows that gRNAs can overwrite editing that has been previously generated. Another study that examined the use of alternative RPS12 gRNAs previously identified found that the three gRNAs in question were being utilized, and that despite not generating the conventional editing patterns, a very small number of transcripts returned to the canonical editing pattern after the alternatives [148]. This last observation may be another instance of gRNA overwriting. This is another source of noise, lowering overall editing efficiency. Future Work In this work, we proposed the hypothesis that T. brucei utilizes dual-coding genes to protect vulnerable genetic material from genetic drift and accesses the overlapped reading frames through alternative RNA editing. In order to test this hypothesis, we should confirm the presence of the alternative protein products in vivo and knock down these products to determine if do provide a benefit to the cells. Knocking down edited transcripts has proved difficult, but not impossible. One study was able to genetically engineer an artificial site- specific RNA endonuclease to target the ATPase 6 edited mRNA [110]. This engineering 149 requires an eight-nucleotide specific target sequence, and in order to use this technique, this would require finding a target sequence specific to the alternatively edited mRNAs only, and leave intact the mRNAs expressing the canonical ETC proteins, which could easily prove difficult as we have found that these sequences can be quite similar. Another approach to tackle this problem would be to sequence the dual-coding transcripts from cells grown under a variety of energetic conditions. By varying the conditions and identifying which conditions the alternative proteins were most prominently expressed in, this could help in elucidating the functions of these proteins. Once a baseline for each set of conditions was determined, the artificial site-specific RNA endonucleases could be engineered to target the unedited transcripts for each dual-coding gene and observe the effects on cell growth in the varied conditions. This could give insight into the importance of the two different overlapped proteins at different points in the trypanosome life cycle. Another direction to pursue is understanding the mechanisms of gRNA selection. We observed many promiscuous gRNAs, and it is possible that even more promiscuity exists in the system. To better understand this, it would be necessary to deep sequence all of the edited mRNAs and have paired gRNA transcriptome data. Ideally, this would be done in multiple cell lines, and under various energetic conditions, as we saw significant variation in our two cell lines and media conditions. This would allow us to see the full picture of alternative mRNA editing and gRNA usage. Then we could begin to determine how much gRNA promiscuity plays a role in gRNA selection, or if other factors are in play. 150 Conclusion This work found that alternative editing is pervasive in Trypanosoma brucei, and brought to light the use of overlapped reading frames, providing another strong reason for the utility and maintenance of RNA editing. This work also showed that the RNA editing system is surprisingly robust and tolerant of mutational noise. 151 APPENDICES 152 APPENDIX A. Quantification of the number of identified bloodstream and procyclic gRNA transcripts that cover a respective nucleotide in the fully edited mRNA. Bloodstream gRNAs are shown in dark gray and procyclic gRNAs are shown in light gray. Nucleotides and deletion sites were both numbered as edited positions in the mRNA transcripts starting from the 5’ end (+1 =0). Boxes indicate the positions of identified populations of gRNAs (coverage ranges shown in parenthesis). Boxes with dark gray or light gray diagonal stripes indicate populations identified only in the bloodstream or procyclic transcriptomes respectively. A. ATPase subunit 6; B. Cytochrome oxidase III; C. C-rich region 3; D. C-rich region 4; E. NADH dehydrogenase subunit 3; F. NADH dehydrogenase subunit 7; G. NADH dehydrogenase subunit 8; H. NADH dehydrogenase subunit 9; I. Ribosomal Protein S12. All individual data points were designated with solid circles. Close overlapping of individual data points generate the observed solid lines. A. ATPase subunit 6 153 B. Cytochrome oxidase III C. C-rich region 3 154 D. C-rich region 4 E. NADH dehydrogenase subunit 3 155 F. NADH dehydrogenase subunit 7 G. NADH dehydrogenase subunit 8 156 H. NADH dehydrogenase subunit 9 I. Ribosomal Protein S12 157 APPENDIX B. Alignment of the mitochondrial fully edited mRNAs and the most abundant gRNAs required for full coverage identified in the bloodstream (blue) and procyclic (gray) life cycle stages. Conservative mutations between gRNAs are shown in green and mutations that disrupt alignment are shown in red. Lowercase u’s indicate uridylates added by editing, asterisks indicate encoded uridylates deleted during editing. Nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 5’ end (+1=0). Watson-Crick (|) and G:U (:) base pairs are indicated. Mismatches are indicated by the number sign (#). A) ATPase 6; B) Cytochrome Oxidase III; C) C-Rich Region 3; D) C-Rich Region 4; E) Cytochrome b; F) Maxicircle Unidentified Reading Frame II (Murf II); G) NADH Dehydrogenase Subunit 3; H) NADH Dehydrogenase Subunit 7; I) NADH Dehydrogenase Subunit 8; J) NADH Dehydrogenase Subunit 9; K) Ribosomal Protein S12. A) ATPase subunit 6 0 10 20 30 40 50 60 70 80 90 AAAAAUAAGUAUUUUGAUAUUAUUAAAGUAAAuAuGuuuuuAuuuuuuuuuuGuGAuuuAUUUUGGuuGCGuuuGuuAuuAuGuAuGuAuuAuuGuGuAu |||||||::||:|:|||:::||::|||||||:|:|||||||||| 11TTTTATACGGAAGTGAAAGGGAAGTACTAAATGAGACCAACGCAACATATA 5’ pA6(29-72) 11TTTTGTATAGAAATGAAGAGAAGGTACTAGGTAAAACCAATGCAAATATA 5’ bsA6(29-75) ||::|::|||:|:|:|:|||:||||||:||:||||||| pA6(62-102) 11TAATTAGTGCAGATAGTGATATATACATGATGACACATA bsA6(62-100) 22TAATTAGTGTAGATAATGATACATATATAGTAACACATA :|||:||::|::|| pA6(86-127) 10TTATAGTAGTATGTA bsA6(90-129) 10TATAGTATATG 100 110 120 130 140 150 160 170 180 190 GAuCuAGGuuAuGuuuuAuuGuGuAuuuuAAuUGuuuAAuGuuGAuuuuuGAuuuuuuAuuAuuuuGuuuG*UUUGAuuuGuAuuuGuuuGuuGGuuuGu ||| :::|:|: |:|||:||:|||:|:||:||||:||||| CTAACATA 5’ pA6(62-102) pA6(164-208) 12TATTATGTGGTAGAT-AGACTGAATATAGATAAGCAACTAAACA C-AAAATA 5’ bsA6(62-100) bsA6(164-208) 14TATTAGTAGAT-AGACTGAATATAGACAGATAACCAAACA :|:||||:|:||:||:||:||||||||| |::||::: TTGGATCTAGTATAAGATGACACATAAATATA 5’ pA6(86-127) pA6(192-243) 04TATTAAGTG TTAGATTCAATACGAGATAGCACATAAAATATATA 5’ bsA6(90-129) bsA6(190-243) 19TTAATTAAGTG ||:|||::::|||:||||||:|:|||:||||||||||||| 12TAATAGAAGATAGTGTATAGAATTGACAGATTGCAACTAAAAACTACATA 5’ pA6(113-152) 12TTTTAATATAGAATGGTGCATGAAATTGACGAGTTACAACTAAAA-CTATA 5’ bsA6(105-148) ||:|::||:||::|::||:|||||:|:|||:|| |||||||||||| 11TGATATAGTTAGAAGTTGGAAGATAATGAGACAGAC-AAACTAAACATATA 5’(138-183) 10TAATAAGTGGTAGTTAGAGACTGGAAAATAGTAAAACAAAC-AAAT-AAATA 5’ bsA6(139-175) 158 200 210 220 230 240 250 260 270 280 290 G***UUUGuuuuuAuuGuuGuGGuuuAuGuuGuuuAAuuuAuAuAGuuuAAUUUUGuAuuA*UUGuAuuACuUAUUUG***AAuuuG*UAuuUGuuGuuu | |||:| ||||||:||::: ||:||: ||::|||::|:| C---AAATATATA pA6(164-208) pA6(266-313) 12TTAATGAGTAGGT---TTGAAT-ATGGACAGTAGA C---GAATATA 5’ bsA6(164-208) bsA6(281-313) 09TAGTGTATAGA----TTAGAT-ATGGATGATAAG : |||::||:|||:::|||::|:||||||||||||||||||| bsA6(291-329) 16TAATAGTAGA T---AAATGAAGATAGTGACATTAGATACAACAAATTAAATATA 5’ pA6(192-243) pA6(301-345) 11TAAA T---AAATGGAAATAGTAGTATCAAGTACAACAAATTAAATATA 5’ bsA6(190-243) bsA6(301-345) 15TAAA |:||:|::|||||||:|:|||||:||||:|:|||||| |||||||| 11TAGTATAGTAAATTAAGTGTATCAGATTAGAGCATAAT-AACATAATAATACA 5’ pA6(224-269) 02TTTAAGTGTAACAGATTGAATATGTCAAGTTAAAACATAAT-A-TATATA 5’ bsA6(221-262) ||||:|::|||:| |||||:|||:||:|:| |||||| ||||| pA6(248-292) 13TATTAGAGTATAGT-AACATGATGGATGAGC---TTAAAC-ATAAAATATA5’ bsA6(254-298) 17TATATAGT-AGTATGATGGATAGAC---TTAAGC-ATAAACAACAATATATA 300 310 320 330 340 350 360 370 380 390 uGuAuuGuuuuuuuAuuGuAuAuuGCAuuuuuAuuuuuGuuuuGuuuuuuAuGuGAuuuuuuuuuGuuuAAuAAuuuGuUAGuuGGuGAuA****Guuuu |||||||||||||| ||||::|||||:|||:|:|||::||:||||| :|||| ACATAACAAAAAAAAAAAAA 5’ pA6(266-313) pA6(360-407) 13TAAAAGTAAATTGTTAGATAATTGACTACTAT----TAAAA ACATAACAAAAAAAAAA 5’ bsA6(281-313) bsA6(352-401) 12TACTAAGAGAAGATGAATTGTTAGGTAATCAATCACTAT----CAAAA ATATAGTAAAGAGATAGCATATAACGTAAATATATAAA 5’ bsA6(291-329) :||| :|:|: ::||::|||||:|||::||||:||||:||:||:|||||||||||| pA6(387-435) 10TTTAT----TAGAG CTGTAGTAAAAAGATAGTATATGACGTGAAGATGAAAACAAAACAATATA 5’ pA6(301-345) bsA6(387-435) 11TTTAT----TAAAG CTGTAGTAAAAAGATAGTATGTGACGTGAAAATGAAAACAAAACAATATA 5’ bsA6(301-345) |||:||::||:|:::|||||:||||:|||:|:||||||||||||| pA6(331-375) 12TATAGAAGTAAGATGGAAAATGCACTGAAAGAGAACAAATTATTAATATA 5’ bsA6(331-371) 15TATAGAGATAAGACAGAAGATGCACTAAAGAAAAACAAATTAAATATA 5’ 400 410 420 430 440 450 460 470 480 490 AuGGAuGuuuuuuuuAUUC**GuuuuuuGuuGuGuuuuuuAGAGuGuuuuuCuuuGuuGuGuCGuuGuuuGuCGACGuuuuuGCGuuuGUUUUGuAAuuu |||||||| :||:|:||::::||::|||||:|:|||:|||:||||||||||| TACCTACATATATA 5’ pA6(360-407) pA6(455-497) 14TTAATATGGTGGTAAGTAGCTGTAGAAATGCAGACAAAACATTATATA 5’ TAAT-ATA 5’ bsA6(352-401) 04TAATTAGAGTAACATAGCAATAGATAGCTGCATTAA 5’ bsA6(452-477) ||||||::||:|:|:|||| ||||:|||||||||| |:|:|:::|||:| TATCTATGAAGAGAGTAAG--CAAAGAACAACACAATATATA 5’ pA6(387-435) pA6(487-526) 16TATAGAGTGTTAGA TGTCTATAGAAAAGATAAG--CAGAGAACAACACAATATATA 5’ bsA6(387-435) bsA6(487-526) 16TATAGAGTGTTAGA |||:::|||||::||||||||:|:||:|:|:||||||||:| 12TAAAGTGACACAGGAAATCTCATAGAAGGGAGCAACACAGTATATA 5’ pA6(424-464) 13TAAAGCGACATGGAAAATCTCATAGAAGGGAGCAACACAGTATATA 5’ bsA6(424-464) bsA6(458-500) 15TATAGTAATAGATAGCTGCGAAGATGCAAACAGAACATTAAA 159 500 510 520 530 540 550 560 570 580 590 AuuAuCAuCCCAuUUUUUAuuGuuGAuGuuuuuuGAuuuuuuuUAuuuuAuuuuuGuuuuuuuuuuuuAuGGuGuuuuuuGuuAuuGAuuuAuuuuAuuu |||||||||||||:||:|||||||||| ||::|:|:||:|:|:|:|||:||||||||||: TAATAGTAGGGTAGAAGATAACAACTAAACATA 5’(487-526) pA6(568-611) 12TTATTATAGAAGATAGTGACTGAATAAAATAAG TAATAGTAGGGTAGAAGATAACAACTAAACATA 5’ bsA6(487-526) bsA6(576-616) 07TACATATAGAATAGTGACTGGATGAAATGAA :||||::||:|:|||:|||:||||||:|||||||||||||||||||| |:|:||||:|| pA6(521-567) 05TTAATTGTAAGAGACTGAAAGAAATAAGATAAAAACAAAAAAAAAAAAAA 5’ (589-629)13AATTTAGTGAAATGAA bsA6(520-553) 04TATATAACTGTGAGAGACTAAGAAGAATGAAATAAAA-CAAAAAAAAAAA 5’ (589-629)10AATTTAGTGAAATGAA ||||:|:|::#||:|:|:|||||::|||||:||||||||:|||||||| TTTAAA 5’ bsA6(458-500) pA6(557-593) 09TATAAATGAGAGT-AAGAGAGAAAATGTCACAAGAAACAATAGCTAAATAA 5’ bsA6(546-592) 11TAAATGAGAGTGAAAGAGAGAGATACCGTAGAAGACAATAACTAAATATA 5’ 600 610 620 630 640 650 660 670 680 690 AuuuuuGuGuuuuGuuuuuGuuuAuuAuuuuAUGuGuuuuuAuAuUUGuuGGAuuuAUUuGCC***GCCAuAuuAC****AGuuAuuuAuuuuuuGuAAu |||||||||||| |:|:|:||||::|:|:|||| TAAAAACACAAATCATA 5’ pA6(568-611) pA6(680-714) 13TAAT-----TTAGTGAATAGGAGATATTG TAAAAACACAAAACAAATA 5’(576-616) bsA6(671-714) 14TTAATG----TTAGTAAGTGGAGAATATTA |||:|:||:||::||||:|||||||||||| |: TAAGAGCATAAGGCAAAGACAAATAATAAATA 5’ pA6(589-629) pA6(698-728) 11TAATAAGAGATAGTG TAAGAGCATAAGGCAAAGACAAATAATAAATA 5’ bsA6(589-629) bsA6(699-727) 11TAATAAGAAATATGA :||||:::|||:|||||||::||:|:||||||:||||||||| 12TTAAAAGTGAATGATAAAATGTACGAGAATATAGACAACCTAATATATA5’ pA6(613-654) 11TTTAAAAGTGAATGATAAAATGTACGAGAATATAGACAACCTAATATA 5’ bsA6(613-654) |||:|:|:::|:|:||:|:|| :|||||:||| |||||||||| pA6(640-689) 12TTATATAGATAGTTTGAGTAGATGG---TGGTATGATG----TCAATAAATATATA 5’ bsA6(643-667)15TAAGTAGTCTAGGTAGATGG---CGTTATAGTG----TCAATAAATATATACA 5’ 700 710 720 730 740 750 760 770 780 790 AuGAuuuuGCAGuuGAuAAuGG**AuuuuuuGuuGuuuuuGuuGuuuGuuuAGuuuuGuAuuuGAuuuuuGAuAGuuAuuAuAuuGuuGuuGAAAuuuG* |:|||:||||||||| :|||:||||:||||:::|||::||||:|| TGCTAGAACGTCAACATAAAA 5’ pA6(680-714) pA6(770-822)TCTCTTCTTTCCCTTTATTAATAGTATAGTGACAGTTTTAGAC- TACTAGAACGTCAACATAGA 5’ bsA6(671-714) bsA6(773-822) 09TTTAATAGTATAGTGACAGTTTTAGAC- ||:|:|:|:||:|:|||||||| ||||| TATTGAGATGTTAGCTATTACC--TAAAATTA pA6(698-728) TGTTAGAATGTCAATTATTACC--TAAATATATA 5’ bsA6(699-727) :: ||||:|:::|:|:|||:|:||:|:||:||||:|||||||||||| 11TTT--TAAAGAGTGATAGAAATAGCAGATAAGTCAAGACATAAACTAAATA 5’ pA6(720-767) 14TTTATT--TAGAGAGTAGCAAAGACAGTAAGTAGATCAAAACATAAAT-ATATA 5’ bsA6(717-763) :||||:|:|::||||::|::||:||:||||||||||||||||| pA6(747-789) 11TTAAATTAGAGTATAAGTTGGAAGCTGTCAATAATATAACAACATAAAA 5’ bsA6(747-789) 11TTAAATTAGAGTGTAAGTTGGAGACTATCGATAATATAACAACATATATA 5’ 160 800 810 820 830 840 *GuuUGuuA**UUGGAGUUAUAGAAUAAGAUCAAAUAAGUUAAUAAUA_ :||:|||| |:|||||||||| -TAAGCAAT--AGCCTCAATATCAGG 5’ -TAAGCAAT--AGCCTCAATATCATATA 5’ Alternate initiating gRNA (procyclic transcriptome only) 750 760 770 780 790 800 810 820 830 uAGuuuuGuAuuuGAuuuuuGAuAGuuAuuAuAuuGuuG*uGAAA*uuG**GuuuUGuuA**UUGGAGUUAUAGAAUAAGAUCAAAU :||||:|||::::|| |:||| ||| :|||:||:| |||||||||||| pA6(774-822) *14TAATAGTATGGTGAC-ATTTT-GAC--TAAAGCAGT--AACCTCAATATCATA 5’ 161 B) Cytochrome Oxidase III 0 10 20 30 40 50 60 70 80 90 GGUUAUUGAGGAUUGUUUAAAAUUGAAUAAuuAuuAuuuuuuuAuGuuuuuGuuuC*****GuuGuAuAuuuGuuGGuGuuA****GuGGuGuuuuuGuu |||:||||||||:|:|:|:|| |||||||||| pCO3(35-70)11TATG-TAGTTAAGAAAATGCAGAGATAGAG-----CAACATATAATTAATA 5’ bsCO3(36-70)12TATATATGGTAGAAAAGAGATACAAGAATAGAG-----CAACATATAATATATA 5’ || :|::|||||::||:::|:|:| |||:|||||||||| pCO3(54-101) 11TAAATAG-----TAGTATATAGGCAGTTATAGT----CACTACAAAAACAA bsCO3(51-99) 09TAAAG-----TAGTATATAAACAGTTACGAT----CATCACAAAAACAA | :||||:|:||:::| pCO3(81-116) 10TT----TACTATAGAAGTGA bsCO3(88-115) 07TCTATAGAAGTAA 100 110 120 130 140 150 160 170 180 190 uuuuuAuCuuuACCuGCCAuuGuuAuuGuGuAuuGGuuAuuuuGuuuGuuG****GGAuuuAuuuGuuuAuuGUUUG****GuAGuuuuuuAuuuGuuGA || |::|:|:||: |:||:||:|:||:||||||||| |||:| AATATA 5’ pCO3(141-185) 12TAATTTAGTAGATAAT----CTTAGATGAGCAGATAACAAAC----CATTATATA 5’ TATA 5’ bsCO3(141-185) 14TAATTTAGTAGATAAT----CTTAGATGAGCAGATAACAAAC----CATTATATA 5’ ||:|:|:||||||||||#|||||||||:|:||| |::::|||::|:|: |:||||:|:|||||||||| AAGAGTGGAAATGGACGATAACAATAATATATA 5’(81-116) (163-203) 10TAGATATAGTGGATAGTAGAT----CGTCAAGAGATAAACAACT AGAGATAGAGATGGATAGGTAGCAATAA 5’(88-115) bsCO3(168-195) 12TATAGTGAAT----TATCAGAGAATAGACTACT 09TAGATGGATGGTGACAATAACATATATA 5’bsCO3(108-132) ||:|:|:|:|:|||||:|||||:|:||||||||| |: :||:| (117-156) 13TATATAGTGATAGTGATACATAGCCAATGAGACAAACAAC----CTATATATA 5’ pCO3(195-247) 09TAATT (117-155)16TGATAGTAGTAGTGACACGTGATCAATAAGACAAACAAC----C-ATATACA 5’ bsCO3(195-244) 13TAATT 200 210 220 230 240 250 260 270 280 290 uuGuG****GuuuuAuuuuuuuuuuuGuuGGuuuuuGuAuuuGuuuGuuGuuGuuAuuGuuAGAuuuGuuuuGuGAuuuuuuACGuGGuuuAuuuGAuuu ||:| |:|||:||:|:|:|:||:|:|||:|||:|:||||||||||||| AATAA----AA 5’(163-203) pCO3(258-299) 10TATAATTTAGATAGAGCATTGAAAGATGTATCAAATAAACTAAA AACAC----CAAAATATATA 5’(168-195) bsCO3(259-300) 13TTAATTTAGATAAGATATTGAGAAGTGCACTAGATAAACTAAA |:::: :|:|||:|:|:|||:|||:||:|||||||||||:|||:| AGTGT----TAGAATGAGAGAAAGAACGACTAAAAACATAAATAAATA 5’(195-247) AGTGT----CAAGATGGAAAAAGAAGTAGCCGAGAACATAAACAATATA 5’(195-244) ::|:|:|:|||:|:|:||||::|:|:|:|||||||||||||||||: ||:||:| pCO3(229-274) 10TTTAGAGATATAGATAGACAATGATAGTGACAATCTAAACAAAACATATA 5’(293-320) 10TAATTAGA bsCO3(236-279) 11TTATAAATGAATAATGACAATGATAGTCTAGACAAGACACTAAAATATA 5 11TAAATTGAA 162 300 310 320 330 340 350 360 370 380 390 uuGuGuuuuAuuACGuuGuAuCCAGuAuuGuuuuuuAuGGuuuuuAuGuAG*UGAGuuuGuuuuAuuuAuGGCGuuuuuuG**UUGuAuuAuuuGGuuuA |:|:| ||:||: :||:||:::||||:|||||:||:|::||| ||||||||| TATATA 5’(258-299) pCO3(345-391) 12TATATT-GCTTAAGTGAAATGAATACTGCGAGGAAC--AACATAATATATATA 5’ ATATATA 5’(259-300) bsCO3(345-389) 03TATATT-GCTTAAATGAGATAAATGCTGCAGAGAAC--AACATAAAATATATA 5’ :|:|:|:|||:|||:|::|||#||||||| ||||:|||::|::|:||:: |||||:|||:||:|||| GATATAGAATGATGTAGTATATGTCATAA 5’(293-320) pCO3(362-406) 09TAATAGATATTGTGAGAAGT--AACATGATAGACTAAAT AGTGTAAGATAATGTAGCATGGGTCATAACAAATATATA 5’bsCO3(291-332) 06TAATAGATACTGTGAGAAGT--GACATAATAGACTAAAT |||:: :||:||:|||::||:|| pCO3(376-418) 10TTTAAAGT--GACGTAGTAAGTCAGAT bsCO3(378-422) 17TAATTAAT--AGTGTGATAGACTAGAT |:|||:::||||||::|||:|||::||| :||||||||||||| :|| pCO3(323-365) 11TATTATAGTGAAAAATGTCAAGAATGTATC-GCTCAAACAAAATATATA 5’ pCO3(397-436) 10TGAT bsCO3(323-365) 09TATTATAGTGAAAGATGTTAGAAATGTGTC-ACTCAAACAAAATATATATA 5’ 400 410 420 430 440 450 460 470 480 490 uGuuuAuuuuuGuGuuGuGAGuuuGCUUUCGuuuuuuGuuuACCuuAuAuGuuuuGuuGuuuAuuAuGuGAuuAuGGuuuuGuuuuuuAuuGG*UAuuuu ||||||| |||:|::|||:||:::||:|||:||:|||||| |||| ACAAATATATATA 5’(362-406) pCO3(461-497) 14TACTTATAGTGTACTGATGTTAAGACAGAAGATAACC-ATAATACATA ACAAATATATA bsCO3(362-406) bsCO3(457-499) 14TATAAGTAGTATATTAGTGTCAGAGTAAAAGATAACC-ATAAAATTATA 5’ |||:||||||||||||:|: bsCO3(483-522) 14TAAAAGTGACT-ATGGAA ACAGATAAAAACACAATATACA 5’(376-418) :: ||||:| ACGAATAAAGACACAACATTCAATATA 5’(378-422) pCO3(491-539) 10TTT-ATAAGA |:|:|||:|:|||:::||||:|:||||||||||||| bsCO3(504-535)15TTAGAA ATAGATAGAGACATGGCACTTAGACGAAAGCAAAAAA AAAA 5’ (397-436) ||:|::::|||:|:|::||||:|:|::|||||||||||| 12ATCATAGTGTTCAGATGGGAGCAGAGAGTAAATGGAATATAATATATA 5’pCO3(411-449) 16TTCAGTGTTTAAGCGAAGGTAGAAGATAAATGGAATATACAATATATA bsCO3(413-452) |:|||||:|||::||:||:|:|||||||:|::||||||||||||||||: 04TATATTAAATGGAAGTGAAGAATAGATGGAATGTGTAAAACAACAAATAATATCATA 5’pCO3(418-467) 10TTTAAATGTAGGTAAAAAGTGAATGGAATATGCAAAACAACAA-TATTATA 5’ bsCO3(427-460) 10TAAAAGTAGATAGAATATGCAAGATAGTAAATAATACACTAATAA 5’bsCO3(443-474) 163 500 510 520 530 540 550 560 570 580 590 uuAGAuuuAuuuAAuuuGuuGAuAAAuACAuuuuAUUUGuuUGuuAGuGGuuuAuuuGuuAAuuuuuuuGuuuuGuGUUUUUGGuuuAGGuuuuuuuGuu AATCTAAGTAAATTAAACAACTAATAAA 5’(483-522) |||::|:|:||::|| :||:|||:||:||||:::||||:|||||||||||||||:| pCO3(585-629) 12TAATTTAGAGAAGTAA GATTTAAGTAGATTAGGTAACTGTTTATGTAAAATAAATATA 5’(491-539) bsCO3(585-628) 13TAATTTAGAGAAGCAG GATGTAGGTAAATTGAGCAGCTGTTTATGTAAAATATATA 5’(504-535) ||||:|||:|:|:||||||||:||:|:||||||||||| pCO3(528-565) 12TATAGTAAGATAGATAGACAATCACTAAGTGAACAATTAAAATATATA 5’ bsCO3(525-563) 13TAATATGTAGAGTAGATAAACAGTCACTAGATGAACAATTAATATATA 5’ ::||||:|:::||||:|:|:||||:|:||||:|||:||||||||| pCO3(548-592) 11TTTAAATGAGTGATTAGAGAGACAAGATACAAGAACTAAATCCAAATATATA 5’ bsCO3(551-593) 13TAATAGATGATTGAAGAGACGAGATACAAGAGCCAAATCCAAAATAACA 5’ 600 610 620 630 640 650 660 670 680 690 G**UUGuuGuuuuGuAuuAuGAuuGAGuuuGuuGuuuG****GuuuuuuGuuuuuGuGAAACCAGuuAUGAGA**GUUUGCAuuGuuAuuuAuuACAuuA | ||:|||:||::|||||||||||||||: :|:| :|||:||:||::||:||:||||||| C--AATAACGAAGTATAATACTAACTCAAGATATA 5’(585-629) pCO3(669-717) 09TTTTT--TAAATGTGACGGTAGATGATGTAAT T--AATAGTAGAGCATAATACTAACTCAATATATA 5’(585-628) bsCO3(669-715) 08TTTTT--TAAATGTAGTAGTAAATGATGTAGT |:||:|:||:||:||::||::||||::|:||||| |||||| ||||||||: 13TATAATAGAATATGATGTTAGTTCAAGTAGCAAAC----CAAAAA#CAAAAACATA 5’ pCO3(604-647) 13TATAATAGAATATGATATTAGTTTGAACAGCGAGC----CATAAAACAAAAACATA 5’ bsCO3(604-643) ||: :|:|:|::|:|:|:|:||||||||:|||#:| ||||||||||| pCO3(635-669) 14TAAT----TAGAGAGTAGAGATATTTTGGTCAGTACATT--CAAACGTAACATATA 5’ bsCO3(635-669) 11TAAT----TAGAGAGTAGAGATATTTTGGTCAGTACATT--CAAACGTAACATATA 5’ |||||::|||||:| :||:||||||||||||| pCO3(659-691) 12TAACATAGATACTTGGTTGATACTTT--TAAGCGTAACAATAAATTATA 5’ bsCO3(653-682) 15TAGATAGTAGTGATGTTTTGGTCAATATTCT--CAAGTGTATA 5’ 164 700 710 720 730 740 750 760 770 780 790 AGuuGuGG****UGuuuuuGGuuCuAuuuuAuuuuuAuuGGAuuuAuUACAuuuuA**UGCAuGuuuuuuuAGGuGuuuuGuuGuuGuuuAuuuGuuuuA ||:||||| ||||:| :||:|:||:|:::|:|:||:|||:||:| TCGACACC----ACAAGATAATA 5’(669-717) pCO3(772-815) 14TATCATAGAATAGTGATAGATGAACGAAGT TTAGTACC----ACAATA-CCAAGATATA 5’(669-715) bsCO3(773-814) 11TATATAAAGTGACGGTAGATGAACAGAAT :: |:||:|:::|||||::|:|:|||:||::|||:|||||||||| |||| 10TTT----ATAAGAGTTAAGATGGAGTGAAAGTAGTCTAGATAATGTAAATATATA 5’pCO3(706-753) pCO3(796-829) 06TAAAT 26TGATATAT----ATAGAAGTTAAGATAGAATGAAGATGACTTAAATAATGTAAATATA 5’bsCO3(707-753) (788-829) 13TAATAAGTGAAAT ||||:|||:|:|||:|:||||:|:|||||:||| |||||||| bsCO3(798-842) 14TATAT pCO3(723-765) 14TGATAGAATGAGAATGATCTAAGTGATGTAGAAT--ACGTACAATATATA 5’ bsCO3(724-765) 10TATAGAATAGAAGTGACTTAGATGATGTAGAAT--ACGTACAATATATA 5’ bsCO3(736-781) 17TAATTTAGATAGTGTAAGAT--GCGTGTAGAAGAATCCACAAAACATATA 5’ || |:|||::|:|:||||::|||:|||||||||||| pCO3(754-790) 15TAATGTAGTAT--ATGTATGAGAGAATCTGCAAGACAACAACAAATTATATA 5’ bsCO3(754-790) 12TAATGTAGTAT--ATGTATGAGAGAATCTACAAGACGACAACAAATTATATA 5’ 800 810 820 830 840 850 860 870 880 890 uGCGuuuGuuuAAuuuuuuGuGuAuGGAuACACGuuuuGuuuuuuuGuAuuGuGuuuGuuuAuAuuGACAuuuuGuuGAUUUAGuuuGAuuuuuuuuAuu |:|||:|||||||||| |||:|:|:|:|:||:||::||||:|:||::||||||||||||| ATGCAGACAAATTAAATATA 5’(772-815)pCO3(848-890)08TATAATATAGATAGATGTAGTTGTAGAGCAGTTAAATCAAACTAATATATA 5’ ATGCAAACAAATTAA-TATA 5’(773-814)(845-889)11TATATATAGTGTAAGTAGATGTAGCTGTGAAATAACTAAATCAAATTATA 5’ |:|:|:|:|:|||||:|:|||:|||||||| ATGTAGATAGATTAAGAGACATATACCTATAGTGCAAAACAATA 5’ (796-829) GTGCAGATAAATTAGAAGACACATACCTATATATA 5’(788-829) |:|||::|||||:|:|::|:||||:|||||||||||||||| |||::||:|||:|:|:|||| TAGTAAATGAATTAGAGAGTATATACTTATGTGCAAAACAAAATATA 5’(802-842) pCO3(880-918) 13TATAATTGAATTAAGAGAGATAA ATGTAGGTAAATTGAGGGATATGTGTCTATGTGCAAAACAAAATA 5’(798-842) bsCO3(880-929) 14TAATTGAATTGAAAGAGATAA |||:::|:|||:|||||||||||:|:||:||||||||||:| (814-854) 06TAAAGGTATATATCTATGTGCAAAGCGAAGAAACATAACATATATA 5’ bsCO3(828-855) 22TTATGTAGATGTGCAAAGTAGAGAGACATAACACAATATATA 5’ 165 900 910 920 930 940 950 960 970 980 990 GCGAuuuGuuuAuuuuGAuGuuuuAuGuGuuAuGuAuuuGuGuGuGuAAuuuuAuuGGuGuuuuUUUAGUUGuuGAuuA*GuuAAuuuGuAuuGGUAGUU |||||:||||||||:|||| |::|||:||:::|||:| |||||:||:||||||||||| CGCTAGACAAATAAGACTAAATATA 5’(880-918) pCO3(963-1003) 08TATATTGGAATTAATGGCTAGT-CAATTGAATATAACCATCAA TGCTAAACAGATAAGACTACAAAATATATA 5’(880-929) bsCO3(965-1003) 17TGTAATTAATAGTTAGT-CAATTGAATATAGCCATCAA ::||||:||:|::||:||:||:||||||||:|||||||:|| 13TTGAATAGAATTGTAAGATGCATAATACATAGACACACATATAATATA 5’ pCO3(907-947) 13TATATAGTTAGAAGATACATGATACATAAGTACACACATTAAAAGATA 5’bsCO3(920-952) ||||:::|||::|||:|:||||:|:|:||||:||||||||||| 12TTAAATGTACATGTTAGAGTAACTATAGAAAAGTCAACAACTAAATATA 5’ pCO3(935-977) bsCO3(940-977) 14TAATTAGTGTATATTGAGATAGTCACAGAAGAATCAACAACTAAATATA 5’ ::::||||:|:||||::|||:|:||||:||||||||| |||| 14TAATTAGTGTATTAGAGTAACTGCAAGAGAATCGACAACTAAT-CAATATA 5’ pCO3(942-983) bsCO3(951-981) 10TAATGTACA-TATAGTGACTATAAGAAGATCAACAACTAAT-CATA 1000 UGUAGGAAG |||| ACATATA 5’ ACATATATA 5’ 166 C) C-rich region 3 0 10 20 30 40 50 60 70 80 90 AGAAAUAUAAAUAUGUGUAUGAUAUAUAAAAACAAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuACAuuuuuuuuGuuuuuu |||:|:||| |||:|:||||||||||| pCR3(33-62) 09TAAAGTGAGATTATAGACT----AACGAGCCAAAACAACATGTATA 5’ bsCR3(34-62) 09TAGTGTGAT-GTAAACT----AACAAGTCAAAACAACATATATA 5’ | |::|:|::|||:::||||:|:|:|||||::||||||||||||| pCR3(41-88) 14TAT----AGTAGATTAAAGTGACAAGAGAGTAACAGGCAAACATGTAAAA-TATATA 5’ bsCR3(41-89) 08TTT----AGTAGATTAAAGTGACAAGGGAATAACGAGCAAACATGTAAAGATATA 5’ |::||||::|:|:|::||||:| pCR3(78-118) 16TATCATAGTATGTGGAGAGAGTAAAAGA bsCR3(78-118) 04TATCATAGTATGTGGAGAAAGTAAAAGA pCR3(105-140) 13TATAGTTAT bsCR3(105-140) 12TTAGTTAT 100 110 120 130 140 150 160 170 180 190 AuuuGuuuGuG***A**UUUGuuuuuAuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuGGAuuuuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuA ||||||||||| | || |||::||:||:||:|:|:|:||||||:|:||||||||||||:| TAAACAAACAC---T--AATATATA 5’(78-118) pCR3(154-196) 13TTAATTTAGAAGTAGAGAGTGAATTATGCTCAAATAACAACATATATA 5’ TAAACAAACAC---T--AATATATA 5’(78-118) (162-200) 15TACTATAGATAGAAGATAGATTATGCTCAGATGACAACACAAATATATA :|:||| | :|||:|:|||||||||||| |:|| :||:::|:|| AGATGGAGCAC---T--GAACGAGAATACAAACAAT-AGATA 5’ (105-140) pCR3(190-230) 06TTAATGTAGAT AGATAGAGCAC---T--GAACGAGAATACAAACAAT-AGATA 5’ (105-140) bsCR3(192-232) 11TATATAGAT |:||||::|::||| |:|||:||:|||:|:|||||||||||:|:| 13TATAGAATATGAGTAAT-AGATCGAAGACAGAGAATAACCTAAAGATA 5’ pCR3(122-166) 09TTATAGAGTAAT-GAATCAAAGACAAGAGATAATCTAAAAACAATATA 5’ bsCR3(129-167) 200 210 220 230 240 250 260 270 280 290 uuuuuuuuuuuuAuuuuAuCAuuuGAuAuGuGuAuCA*AAuuGuuAuuuAuuAuuuAG*UUCGuUUA*UAuuGuuAuuuUUAuAAuuuAuuuAAGUAUGC :|||:||:|:||||:|||:||:||||||||| :|||#|:|#||:|#||||||| GAAAGAAGAGAATAGAATGGTGAACTATACAACATATA 5’(190-230) (293-308) 12TAATTAGT-GAAATGATAGTGATTAGAGTCATACG AAGAGAGAGAAATGGAATAGTAGACTATACACAATAGATA 5’(192-232) (268-312) 05TAAATAGTAATGGAGATATTGAGTGAATTCGTACG |||:|:||||| ||#|::|||#:|||||::|| |||||||| |||||||||| 12AATATATATAGT-TTCATGATATGTAATAGGTC-AAGCAAAT-ATAACAATAATATA 5’ pCR3(226-277) 09TAACAGT-TTGATGATAGGTAATGAGTC-AAGTAAAC-ATAACAATATA 5’ bsCR3(234-265) 14TT-TCTATGATAGGTAGTAGATC-AAGCAAAT-ATAACAATAAGATA 5’ bsCR3(241-279) 300 310 AAAUAAUUUUUGU POLYA ||||||||| TTTATTAAATATATA 5’ pCR3(293-308) TTTATTAAAAATATA 5’ bsCR3(268-312) 167 D) C-rich region 4. Resequencing of the mRNA indicated that there were 2 errors in the original sequence (yellow highlights). 0 10 20 30 40 50 60 70 80 90 UAAUUUAUUGUUAUCUUUGUGUAUUUAUUAuuAuuuuAuuuuAAuuuuGGuuGuGC***AuuuuuuuuuuuuuuAuuuG***GuG*UGuuuGuGuuuuA* |||:||:|:||||:||:||||:|:||||||| |||||| pCR4(25-64) 12TATATATAGTAGTGAAATGAAGTTAAGATCAACACG---TAAAAATA 5’ bsCR4(25-64) 14TATATATAGTAGTAAAATGAAGTTAAGATCAACACG---TAAAAATAATA 5’ ::|::|:| |::|:|:|||:|:|||:||: ||: |||:||||||||| pCR4(48-103) 12TTTAGTATG---TGGAGAGAAAGAGAATGAAT---CAT-ACAGACACAAAAT- bsCR4(48-103) 12TTTAGTATG---TAGAGAGAAAAAGAATGAAT---CAT-ACAGACACAAAAT- :|||:||:||:| pCR4(87-134) 10TTAAATACGAAGT- bsCR4(93-142) 05TATTAAAGT- 100 110 120 130 140 150 160 170 180 190 UGuA*C*A*GuuuAuGGuAuAuuuuAuuGuuGuuuuGuuuuuuGuuuuuGuuGUUUGuuUGuGuGGGuAuGuuuuAuuuGuuuuGuuAuAGuuGuuuGuu |:|| ||:|:|:|:|:||||:|||:|||:||::||||||||||| ATATATA 5’(48-103) pCR4(154-192) 14TAATAGATATATCCATGCAAGATAGACGGAACAATATCAAATTATA 5’ ACATATAACGAAAGACAAAACGAGAGACAAGAAC 5’(48-103) (166-196) 06TATGCGTGTAAGATAGATAAAACAATATCAACAAAATATA5’ |:|| | | ||:||||:||||:|:||||||||||| ||||::::|||::| ATAT-G-T-CAGATACTATATGAGATAACAACAAATATATA 5’(87-134) pCR4(186-232) 11TTATATTGGTAAATGA GTAT-G-T-TAGATGCTATATAAAGTGACAACAAAACAAAAAAA 5’(93-142) bsCR4(186-232) 14TTATATTGGTAAATGA |:|::|:||::|||:|:|:|:|||::|:|:||||||||:||||:| 14TATATAGTAGAATGAAAGATAGAGACAGTAGATAAACACACTCATATATATA 5’ pCR4(127-171) 200 210 220 230 240 250 260 270 280 290 uuuuuuuGuuGuUUUG*GGuuGuGAuuuuuuAuuG**GuGuuuuG***AuuGuAuAGuuuAuuuuuuuuGuGACGuuAuAAuuUUGuuuAuuuuuuuuuu :|:||:|:|||:||:|||:||||||:|:||||:||||||||||||||||| ||:|:|:||::||||: ||:||||||||||||| p(251-300) 11TATATTAGATAGAAGAAATACTGCAGTGTTAAGACAAATAAAAAAAAAAA 5’ AAGAGAGCAGTAAAAT-CCGACACTAAAAAATA 5’(186-232) 06TATATTAGATAGAAGAAATACTGCAGTGTTAAGACAAATAAAAAAAAAAAA 5’ AAGAGAGCAGTAAAAT-CCGACACTAAAAAATA 5’(186-232) ||:||:|||||:|:||||:| ||: ::|::::|::||:|||:: |||:||:: |||||||||||||| pCR4(280-320)10TGTAGAATAAATAGAGAAAAGA p(213-261) 12TAAT-TTAGTGTTGGAAGATAGT--CACGAAGT---TAACATATCAAATATATA 5’ 12TAAT-TCAGTATTAGAGAGTAAC--TACAGAGC---TGACATATCAA-TATATA 5’bsCR4(213-258) 168 300 310 320 330 340 350 360 370 380 390 uuAuuuuGuuuuGuGuuuuuuGuAuuG*UUGuuuuuAuUUGGuuuGuuuGGuuuuuuuuuG***UAuuuuuuGUUGuGuuuuGuGuuAuuuuuuGAuuuA :|||:|:|::||||||||||| |:|:|:||:|||:|||:|:|||||:| GATAGAGCGGAACACAAAAAAAAAA 5’(280-320) pCR4(374-417) 11TATATAGAATACAGTAAGAGACTAAGT :||||:|:||:|:|:|||:: :||::|||||:||:|:|||||||| bsCR4(374-415) 17TATATAGAATATAATGGAGAACTAAGT 11TTAAAATATAAGAGATATAGT-GACGGAAATAGACTAGACAAACCATAAAATA 5’(307-351) 09TTAAAATATGGAAAATATAGT-GACAGAGATGAGCTGAACAAACCAATAAATTAGTTGGTTTGTT 5’ bsCR4(307-352) |::||:::|||:|:|:|: ||:|:|:||||:||||||||||||| pCR4(343-388) 15TAGTAAGTTAAAGAGAGAT---ATGAGAGACAATACAAAACACAATATATA 5’ bsCR4(340-390) 05TTTAAATGAGTTAAGAGAGAAT---ATAAAGAGCAGCATAAAACACAATAAATATA 5’ 400 410 420 430 440 450 460 470 480 490 uuuuuuAuGuUGuuuuuUGuuuuGGG***UG*GuuuuuuuGuuuuuGuuuuuuuuuuuuGuuuAuGuuuGuuuuuAuuuGuGGuuGuuGuuAuuuuGuuA :|||:||||||||||||| ||||:|:::|:|::|||:|:||:|| GAAAGATACAACAAAAAAAAAA 5’(374-417) pCR4(475-519) 12TTAAATATTGATAGTAATGAGACGAT AGAGAATACAACAAAATATATA 5’(374-415) bsCR4(478-524) 14TATATTAGTGATAATGAGATAAT ||||::::|:||:|:||:||:| || :|||:|||||:||||||||||||||| 14TAATATGGTAGAAGATAAGACTC---AC-TAAAGAAACAGAAACAAAAAAAAAAA 5’ pCR4(404-457) 08TAATATAGTAGAAGATAAGACTC---AC-TAAAGAAACAGAAACAAAAAAAAAAAAAA 5’ bsCR4(404-458) |||::|:|||:|:|:|::||||:|||::|||:|||:|||||||||||:| pCR4(442-489) 12TTAAAGTAGAAAGAGAGAGTAAATGCAAGTAAAGATAGACACCAACAATA 5’ bsCR4(442-487) 11TTAAAATAGAAAGAGAGAGTAAATACAGATAAAGATGAACACCAACAAAATA 5’ |||:|::|:||::|:|:||:|||:||:||:||||||||||||||| bsCR4(453-497) 11TAAAGAGTAGATGTAGATAAGAATGAATACTAACAACAATAAAACATA 5’ bsCR4(453-497) 15TAAGAGATAGATGTAAGTGAGAATAAACACTAGTAACAATAAAACATA 5’ 500 510 520 530 540 550 560 570 580 590 GuuuGGuuGuuGUUGuuAuuUGuGuAuA****GGUUUAuuUAuA*UGCGuuuuuuAuuuuAGAuAAuUAuG****G****UA**UUGGUUUUAUAAAAUG :|:||:||:||||||||||| TAGACTAATAACAACAATAA-TATATA 5’(475-519) CAGACTAGTAACAACAATAAACATA 5’ (478-524) ::||:|::|::||||:|:|||||| ||:|:|:||||| |||||||||| 03TTTAATAGTAGTAATAGATACATAT----CCGAGTGAATAT-ACGCAAAAAAAA 5’pCR4(504-554) 12TATAGTAGTGATAGATACATGT----CTAGATGAGTAT-ACGCAAAAAAAAAAA 5’ bsCR4(507-554) || |:|::||:|:||:|:|||||||||:| | #| ||||||||||||| pCR4(542-575) 12TAAT-ATGTGAAGAGTAGAGTCTATTAATGC----C-----T--AACCAAAATATTTAA 5’ bsCR4(542-584) 19TAAT-GTGTGAAAGATGGAATCTATTAGTGT----C----AT---ACCAAAATATTTATA 5’ 600 UUUUUUCU polyA 169 E) Cytochrome b 0 10 20 30 40 50 60 70 80 90 GUUAAGAAUAAUGGUUAUAAAUUUUAUAUAAAuAuGuuuCGuuGuAGAuuuuuAuuAuuuuuuuuAuuAuuuAGAAAuuuGuGuuGUCUUUUAAUGUCAG :||:|:||:|::||:||:::||||||||||||| 12TGATAGGTGTCGTATAGAGTAGTATTTAGGGATAATAAAAAAAAAA 5’ pCYb(32-64) 06TATAGGTGTCGTATAAGGTAGTATTTGAGGATAATAAAAAAAAAA 5’ bsCYb(32-64) ||||||:::|||||||:|:||||||||::::|||||||| pCYb(53-91) 11TTAATAAGGGAAATAATGAGTCTTTAAGTGTGACAGAAAAAAAAAA 5’ bsCYb(54-91) 05TATCAATAGGAGGGGTAATGAGTCTTTAGATGTAACAGAAAAAAAA 5’ F) Maxicircle unidentified reading frame II 0 10 20 30 40 50 60 70 80 90 UUUUAUAUAGAAAGGUAUAUAAUCUAUAAUGAuuuuAAuGuuuGGuuGuuuuA****AuuuAGuuuuAuuuUUGuGCUUUGAUUGuAGUCGUGUUUUUGA :|||||||::|::::||::|:|| |||||:||||||||||||||||| pMURF2(30-79) 11TTTAAAATTGTAGGTTAATGAGAT----TAAATTAAAATAAAAACACGAAAGATA 5’ bsMURF2(30-79) 08TTTAAAATTGTAAGTTAATGAGAT----TAAATTAAAATAAAAACACGAAAGATA 5’ 170 G) NADH Dehydrogenase subunit 3 0 10 20 30 40 50 60 70 80 90 UCAAAAAAUCCUCGCCUUUUUACUUUAGUUUGUUAUCAuuAuuuuuAuAuuuGuuuuUG*A*UAuuGuGGuuuA**UUAuuuuAuuuAuAGGuuuuuuuu |||:||:||||:||:|::::||||:| | |||::||||||| | | pND3(33-76) 12TATAGTAGTAAAGATGTGGGTAAAAGC-T-ATAGTACCAAAT--ATTATATA 5’ (99-143) 11TA bsND3(30-73) 12TATAATAGTGATAAAGATGTGAATAAAGAC-T-ATAACACCAAAT--TATA 5’ (98-141) 12TAA |||:|:::||| |:|:|:||:|:|:||:|||||:|| pND3(63-113) 11TTAATATTGAAT--AGTGAGATGAGTGTCTAAAAAGAA bsND3(63-108) 09TTAATATTAGAT--AGTGAGATGAATATTCAAAGAAAA 100 110 120 130 140 150 160 170 180 190 uAuGuuuuuuAuGuuuuuuAuuGCAuuuuuuuGAuuGuuuuCGuuGuuGuuuGuGGuuuuCGuGuGGuUUGuAuGAuAuGAAuUCACGuuuG*GUGuuuu |||||||:|||||| |||:::|::|:|:||:||:|::||:||||:|||: :|||||| ATACAAAGAATACA 5’(63-113) pND3(158-205) 15TAAGTGTATTAGATGTGCTGTGTTTGAGTGTAAAT-TACAAAA ATACAAAAA-TACATA 5’(63-108) bsND3(158-205) 13TAAGTGTATTAGATATGCTGTGTTTGAGTGTAAAT-TACAAAA |||:::||||||::|||:|||::||:||:|||||||||||||:| || :|:|:|| ATATGGAAAATATGAAAGATAGTGTGAAGAAACTAACAAAAGTATA 5’ pND3(99-143) pND3(190-229) 13TAC-TATAGAA ATATAGAGAATATAGAAAATGGCGTAAAGAGACTAACAAAAGAATA 5’ bsND3(98-141) bsND3(190-233) 13TAC-TATAGAA ||:|:|:::|||:|:||::|:||||:|:||||||||:|||| 18TATAATTGATGGAAGTAGCAGTAGATACTAGAAGCACACTAAAC-TATATA 5’ pND3(130-170) 11TAATTAGTAAGAGTGACAATAAACACTAGAGGCACATCAAACATATATA 5’ bsND3(130-174) 200 210 220 230 240 250 260 270 280 290 AuACAuuGGAuuuAUGuuuuGuuAGuUGuUUGuuuuuuGuAuuGuuAAAuuCCAuUAuuuGuGuUUUGuuGuuuGuuuuuGUGAuA*GuGuuGuuuuAuu |||||| |||||:|:|:||:|:|:::|:|||:|::|:||| ||||||||||||: TATGTATATA 5’ (158-205) pND3(253-299) 14TCTTTAATAGATATAAGATAGTGAGCAAGAGTATTAT-CACAACAAAATAGTATA 5’ TATGTATATA 5’ (185-205) bsND3(253-299) 06TTTAATAGATATAGAGTGACAGATAGAAACATTAT-CACAACAAAATAATATA5’ ||||||::|||:|||:||:||||||||||| || :::||:||:|||: TATGTAGTCTAGATATAAGACAATCAACAATATATA 5’(190-229) pND3(284-329) 06TAAT-TGTAATAAGATAG TATGTAGCTTAGATACAGAGTAGTCAACAAACAATATA 5’(190-233) pND3(285-328) 10TT-TATAATAAAATAG |:|::||::|::|:|:|||:|||||||:|||||||||||:| pND3(223-263) 11TTTAGTAAGTAGGAGATATAGCAATTTAGGGTAATAAACATATATT 5’ bsND3(222-263) 16TATTAATAGATAGAGAGTGTAGCAATTTAGGGTAATAGACATATAAA 5’ 171 300 310 320 330 340 350 360 370 380 390 uuuGuuAUGGuuuuuuGuUUUUGuGGuuuuuGuuuuuuGuuGuAuGuAuAG****GAuuUGuGuGGuAuuuuuGGGAUCAC*GuAuAUUUGUGUGGUGUA |:|:||||:|:|||:|:||||||||||||| pND3(402-438) 13TATATTGC AGATAATATCGAAAGATAAAAACACCAAAATATA 5’(284-329) bsND3(402-435) 08TGC AGATAATACCGGAAGATAGAAACACCAAA-TA 5’(285-328) :|::||:|::|:||:::|:|||::||||: ||||::||||||||| pND3(322-369) 13TTATTAAGAGTAGAAGGTAGCATGTATATT----CTAAGTACACCATAA-TATATA 5’ bsND3(322-370) 12TTATTAAGAGTAGAAGGTAGCATGTATATT----CTAAGTACACCATAAATA 5’ |||| :|||::::|:||||:|:||::||||| ||||||| pND3(347-388) 18TATTTTATC----TTAAGTGTATCATAGAGACTTTAGTG-CATATAACAAATGTATA 5’ bsND3(355-388) 11TTT---ATTAAGTGTACTGTGAAAGCTTTGGTG-TATATAACAAACAAATATA 5’ 400 410 420 430 440 450 460 AUUUUAuuuuGuuuAuGA**UGuuuUUUGUUGUAUUAUACAUAUUAUAUUAAUAAAUAUAUAAAA ||:||:|:::|||||| ::||||:|||||||||||| TTAAGTAGAGTGAATACT--GTAAAAGACAACATAATATTATA 5’ (402-438) TTAAATGAGATAAATGCT--GCAGAGAACAACATAAAAT-ATATA 5’ (402-435) 172 H) NADH Dehydrogenase subunit 7 0 10 20 30 40 50 60 70 80 90 UGAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCC :|||||::|:|:|:::||||:|:|||||||||:| |||#| pND7(36-69) 14TATTATAGT-GAATACGGTGAGAGTTATCAGAGAAATGTAAATAATATA5 (108-137)TTAGATTTTTAGAG bsND7(28-71) 12TATTATAGTAAGATGCAATGAAAGCCGTCAAGAGAATGTAAACATATAAA 5’ ||:||||:|:||||:|:|||||||:| ::||||#||||| pND7(59-91) 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA5’ bsND7(58-91) 12TAAAGTGTAAATATAGCGAAGTGTAAAT-CAGGTGACATAAATATATA5’ 100 110 120 130 140 150 160 170 180 190 G***CAGCACAuG**GuGuuuuAuGuuGuuuAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UAUUGAuUGuAuuAuA***G*GuuAUUUGCAUCGUG | #||#||||| |:::|||||||||||||||:||| C----TCATGTAC--CGTGAAATACAACAAATAATATA 5’ pND7(108-137) :||:||:|:|:||:||:|::|||:| ||:|||:||||| ||||||| 14TTAATAAGTGATATGAAGATGCCATT-TAGATAGCAAAT--ATAACTACATA 5’ pND7(124-170) 16TATATAGTAAATGACATGGAAGTGCTACT-TAAATAACAAAT--ATATATA 5’ bsND7(121-166) ||||:::|:| |||::|:||||::||| | :||||| pND7(152-190) 12TAATAGTGAGT--ATAGTTGACATGGTAT---C-TAATAATACGTAGCATTAAA5’ bsND7(151-199) 12TAAAGTGATAGAT--GTGATTGATATGATGT---C-CAATAA-ATGTAGCATAAA5’ 200 210 220 230 240 250 260 270 280 290 GUACAGAAAAGUUAUGUGAAUAUAAAAGUGUAGAACAAUGUCUUCCGuAUUUCGACAGGUUAGAuuAuGuuA*GuGuuuGuuGuAAuGAGCAuuuGuuGu :||:|:|||||:||:||||||||| pND7(246-269) 14TAAATAAGGAAATCTATGAGGCTGTTCAGTCTAATACACAACTATA 5’ bsND7(246-269) 12TAAATAAGGAAATCTATGGGGCTGTTCAGTCTAATACACAACTATA 5’ |:|||||:|:| :||||:::||||||:||||:||:|||| pND7(261-311) 08TTTTAATATAGT-TACAAGTGACATTATTCGTGAATAACA bsND7(261-293) 10TTTTAATGTAGT-TATAAGTGACATTGCTCGTGA-CAACA |:| pND7(297-338) 13TACATA bsND7(292-324) 13TGAAATAGTG 173 300 310 320 330 340 350 360 370 380 390 CuuuA***UGuuuuGAGuAuAuGuuGCGAuGuuGuuuGuCGuuACGuuGuGCAuuuAuGCGuuuAuuAAuuGuA****GAAuuuAC***CCGuAGuuuuA ||||| |||| |||||::|||:||:||||||:| |||||||| |#|||||||| GAAAT---ACAATATA 5’(261-311) pND7(352-398) 07TATAAATGTGCAGATGATTAACGT----CTTAAATG---GACATCAAAAG-ATATA5’ GAA-T---ATA-TATA 5’(261-293) bsND7(353-402) 13TAAAATGTGCAGATAATTAATGT----CTTAGATG---GGTATCAAAATTATATATA5’ ||:|| ::|||:||:||||:|||||||||||||||:| || #|:||::|:|| GAGAT---GTAAAGCTTATATGCAACGCTACAACAAATATA 5’(297-338) pND7(390-424)16TAATTG---AGTATTGAGAT GAAAT---ATAAGACTCATATACGA-GCTACAACAA-TATATA 5’(292-324) bsND7(391-424) 14TAATG---GATATTAAGAT :||||::||::||||:||::||::|||||||:||||||||||||:|| pND7(327-373) 13TATTATAGTAAGTAGCAGTGTGACGTGTAAATATGCAAATAATTAATATA 5’ bsND7(327-365) 12TAATTGTAGTAAGTAGCAATGTAACGCGTAGATATGCAAATATACATA 5’ 400 410 420 430 440 450 460 470 480 490 AuGGuuuGuuGuGuAuAuCAuGuAuGGuuuuGG*AuuuAGGuuGuuuGuCUCCGuuG*UUAuGAuCAuuuGAGGAA***CG*UGACAAAuuGAuGACAuu ||::|:|:|:|:||||||||||||| |||||:|::||||| TATTAGATAGCGCATATAGTACATAAATTATA 5’(390-424) pND7(486-530) 11TTTTAATTGTTGTAA TATCAAGTGACATATATAGTACATAACATTAAA 5’(391-424) bsND7(486-530) 11TTTTAATTGTTGTAA :||||||||:||::||:|:|: ||:|||:|::||||||||| 13TTATATAGTATATGTCAGAGCT-TAGATCTAGTAAACAGAGGAATATATA 5’ pND7(412-452) 13TATAGTGTATGTTGGAATC-TAAGTTCGATAGACAGAGGCAAT-A-TATA 5’ bsND7(414-458) :||: |:|::||||:|::|:||| || |:||#||||||||||||| pND7(453-485) 13TATAAT-AGTGTTAGTGAGTTTCTT---GC-ATTGATTAACTACTGTAA bsND7(453-485) 13TATAAT-AGTGTTAGTGAGTTTCTT---GC-ATTGATTAACTACAATAA 500 510 520 530 540 550 560 570 580 590 uuuuGAuuuAuG**UUGuGGuuGuCGuAuGCAuuuGGCUUUCAuGGuuuuAuuA*GGuAUUCUUGAUGAuuuuGuuuuuGGuuuuGuuGAuuuuuuGuuG || :||:|||:|:|:|:|::|||:|:::||:||:|:||| AATA 5’(452-501) pND7(564-615) 12TTTATTAAGATAGAGATTAAAGCGGTTAGAAGATAAC ATA 5’(453-485) bsND7(564-615) 10TTTATTAAGATAGAGATTAAAGCAGTTAGAAGATAAC ||:|||:|:||: |:||::||||||||||| |:|::||||:|:::|| GAGACTGAGTAT--AGCATTAACAGCATACGAATATA 5’(486-530) pND7(584-630) 15TATAGTTAAAGAGTGAC GAGACTGAGTAT--AGCATTAACAGCATACGAATATA 5’(486-530) |||: |:||:::||:|:||::||:|::|||||||||:||| ::|: 14TAATTGTATAT--AGCATTGACGGTATGTGTGAGTCGAAAGTACTAAATATA 5’(508-548) pND7(596-642) 18TAAATTTGAT 11TTATAT--AGTGCTAGTAGTATATGTAGGTTGAGAGTACCAAAATAATATA 5’bsND7(508-553) bsND7(596-640) 11TTAAT |||:|||||::||:||||:||:|:|||| |:||||||||||||| pND7(526-569) 12TAATATGTAAATTGAGAGTATCAGAGTAAT-CTATAAGAACTACTATATATA 5’ bsND7(526-569) 13TAATATGTAAATTGAGAGTATCAGAGTAAT-CTATAAGAACTACTATATATA 5’ ||||:::|:|||:| |:||:|||::|||||||||||| pND7(540-576) 05TACTGATAGTATTGAGATAGT-CTATGAGAGTTACTAAAACAAATAATAATA 5’ bsND7(540-571) 14TAATCGTTAGTGTCAAGATAGT-CTATAAGAACTACTAAATAATATA 5’ 174 600 610 620 630 640 650 660 670 680 690 uuGuuGA***UAAuAuCAuGuuuGuuuGuuAuGGAuuGuuAuGAuuuGuuAuuuGuGGGuAAUCGuuuAuuuUAuuuGCGuuuGC***GuGGuuuGuCAu ||||||| |||||| :::|||||::|||:|||||:|||::|||| ||:||||||||| AACAACT---ATTATATT 5’ pND7(564-615) pND7(656-699) 12TTTTATTAGTGAATGAAATAGACGTGAATG---CATCAAACAGTATATATA 5’ AACAACT---ATTATATA 5’ bsND7(564-615) bsND7(666-701) 12TACATAGTTAGTGA-TAGAATAAGTGCAGATG---TACCAAACAGTAGATATA 5’ |:::||| |||||||||:||||||||||| :|||:| :::|||::|||: AGTGACT---ATTATAGTATAAACAAACAATTAATA 5’(584-630) pND7(679-727) 13TTAAATG---TGTCAAGTAGTG |::|::| ||||||||:||:|:|:||:|||||||||||||: bsND7(679-725) 16TGATGTTAAGTG---TGTTAGACGGTA AGTAGTT---ATTATAGTGCAGATAGACGATACCTAACAATATATA 5’(596-642) AGTGACT---GTTATAGTGCGAGTGAACAATACCTAACAATTAAA 5’(596-640) |||::||:::|||||:|::||||:|||:|||||||||||||: 13TATATTTAGTGATACTGAGTAATAGACATCCATTAGCAAATAGTATA 5’ pND7(629-670) 11TTTTAGTAGTATTAGATAGTAAGTGTCCATTAGCAAATAAAATATATA 5’ bsND7(632-674) 700 710 720 730 740 750 760 770 780 790 uuuuuGAuuuAuAuGAuuuA**GuuuuuA**A**UAGuuuAAGuGGuGuuuuGuCuCGuuCGuuAGGuAuGGuGuGAGAuuGUCGuuuAuuuAGuuGuuA ::|||:|||||||||||||| |||||| ||::||:||:|:|||::||:|| GGAAATTAAATATACTAAAT--CAAAAAAAAAAA 5’(679-727) pND7(778-830) 12TATAGTAGTAAGTGAATTGACGAT GAAGGTTGAATATACTAAAT--TAAA 5’(679-725) bsND7(772-806) 12TAATTGATATGACACTCTAGTAGCAAATAGATCAACAAT bsND7(782-816) 16TAATTGATGTTATATTCTAAGGGCAGATAGATCAACAAT |||:|:|:| :|:|:|| | ||:|:||||||:|:||||||||||| |:||:|:| 13TAATATTGAGT--TAGAGAT--T--ATTAGATTCACTATAAAACAGAGCATATATA 5’(711-758) (792-845) 10TTTAATAGT 13TATATATTGAGT--TAGAGAT--T--ATCAGATTAACCATAGAACAGAGCAAAA 5’ (709-741) (790-839) 11TAATTAGTGAT |:|| | ||||:|||:||:|::|:||:||||||||||| pND7(725-764) 14TAACTTAGAT--T--ATCAGATTTACTATGAGACGGAGCAAGCAATAATATA 5’ bsND7(722-765) 14TATAAGAGT--T--ATTGAGTTCGTTACAAGATAGAGCAAGCAATTATATA 5’ |:|||:|:||:||::||:|:||||:|||||||||||||| pND7(756-794) 14TAATTATAGTGTAAGTAGTCTATGTCATATTCTAGCAGCAAATAAATCATA 5’ bsND7(756-794) 14TTTATAGTGTAAGTAGTCTATGTCATATTCTAGCAGCAAATAAATCATA 5’ 175 800 810 820 830 840 850 860 870 880 890 ****UGA*****GuUGuAuuuuAuGuuuuGuuAuGAuuAuuGuuuuuGuuuuAuAGGuGAuGCAuuuGA*UCGuuuAuuuuuACGuuuGuuuGAUAuGCG ::| ||::|||||||||||||:| :|||||||:|||:|||::|:|||||:|: ----GTT-----CAGTATAAAATACAAAATATATA 5’(778-830) pND7(872-916) 09TTAAATAAAGATGTAAATGAGCTATATGT ----ACT----ATATA 5’(772-806) bsND7(872-919) 17TTAAGTGAAGATGTGGGTGAATTATATGC ----ATT-----CAATA 5’(782-816) ::| ::||:|:|||||:|:|::|:|||||||||||||| ----GTT-----TGACGTGAAATATAGAGTAGTACTAATAACAAAATATA 5’(792-845) ----ATT-----TAGCATGAGATATAGAACAATACTAATATATA 5’(790-839) :|||||::::|:|:::|||:|:|::||:||||||| :|||||||| 13TTTAATAGTGGAGATGGAATGTTCGTTATGTAAACT-GGCAAATATA 5’ pND7(834-877) 15TTTAGTAGTAAAAATAAGATATTCACTATGTAGATT-AGCAAATAAATATA 5’ bsND7(834-879) ||||:| ||::||||||:|||||:|:|:||||||||: pND7(863-902) 10TTAAATT-AGTGAATAAAGATGCAGATAGACTATACGT pND7(866-902) 10TATT-AGTGAATGAAAATGTGAGTAGATTATACGC 900 910 920 930 940 950 960 970 980 990 uAuGAGuuuGuuGAuuuGuAAGCAAuGuuuuuuuGuuGGuuuuuuuGuuuuuG*****GuuuuGuuuGuuuGuuuG**AuuAuuuAuAuuGuGAuAuuAC ||| |||::|:|:|:|:|::| |:|||:||:|:||||||||||| ATAATATA 5’(863-902) pND7(959-1000) 14TAAAAGTAGATAGATAGGC--TGATAGATGTGACACTATAATG ATAAATAATATA 5’(866-902) bsND7(959-1000) 11TAAAAGTAGATAAATAGGC--TGATAGATGTGACACTATAATG ||:||:||||||||||| |||||:|:::||||:|| ATGCTTAAACAACTAAAATATA 5’ gND7(872-916) pND7(983-1017) 12TAATATGATGTTATAGTG GTGCTTAAACAACTAAACATA 5’(872-919) ||:|:|:|:|::||:|||||||||||::|||||:||:|| ||||| 13TATTTAGATAGTTAGACATTCGTTACGGAAAAATAATCATATATA 5’ pND7(901-939) pND7(1001-1032) 08TAATAGTTAATG 12TATTTAAGTGACTAGATATTCGTTATAGAAAAACAATCATATATA 5’ bSND7(901-939) bsND7(1000-1043) 13TTTAATA |::||:|:||::|||||||:||:||:|||||||||:|||:| ATGTTCTAGTAATTGAATGTTCGTTATAAGAAGACAACCAAAGAAATA 5’ pND7(907-947) 17TCTAGTAATTGAGTGTTCGTTATAAGAAGACAACCAAAGAAATAGAAAACGGAACC 5’ bsND7(907-951) ||:|:::|:||:|:||:|::| |||:|:|:|||||||||| | 13TAATAGTTAGAAGAGCAGAGGC-----CAAGATAGACAAACAAAC--TTATATA 5’ pND7(932-978) 12TTAATTAAAGAGATAGAGAT-----CAAAGTAAGCAGACAAAC--TAATAATATATA 5’ bsND7(934-983) 176 1000 1010 1020 1030 1040 1050 1060 1070 1080 1090 CAuuG****AGACCAuuAuuAuGuuAuuuuAuAGuuuGuGGuGuuGuuGuuuGCCGGGuAuA*UCAuuuGC*UUGUGuuGAACACCCCAAAGGuGA***G | :::|||| ||||:|:| :|:|:::|||||||## ::||| | GAAATA 5’(959-1000) pND7(1055-1085) 03TTTTATAT-AGTAGATG-GATATGGCTTGTGGAAAGATTACT C 5’ GAAATATA 5’ (959-1000) |||:: |||||||||#|||||||||| :||||||#|#||:||:| | GTAGT----TCTGGTAATTATACAATAAATATATA 5’ pND7(983-1017) (1089-1121) 14AAATGAGTGTTTTGTGGAG-TTTCATT---C #||:| |||||||:||:|:||||||||||| bsND7(1087-1113) 12TAAATG-AATATAGCTTATAGAGTTTCTGCT---T ATAGC----TCTGGTAGTAGTGCAATAAAATATATATA 5’ pND7(1001-1032) : GTGAC----TTTGGTGATAGTGCAGTAAGATGTCAAACACCACATATA 5’ bsND7(1000-1043) pND7(1099-1143) 14TAT ||:| |:|||#||||:||:|||:|||||||||||||:||| bsND7(1099-1143) 12TAT 13TAGC----TTTGGGAATAGTATAATGAAATATCAAACACTACATA 5’ pND7(1015-1043) 14TT----TTTAGTGATAGTGCAGTAAGATGTCAAACACCACATATA 5’ bsND7(1013-1043) |:||::::||||:::|:|||:|||:|||:| |||||#|| ||||: pND7(1032-1067) 11TTTAAGTGTCACAGTGATAAATGGCTCATGT-AGTAA-CG-AACATATATA 5’ bsND7(1032-1078) 10TTTAAATGCTATAGTGACGAGTGGCTTATGT-AGTAAACG-AATATAAA 5’ 1100 1110 1120 1130 1140 1150 1160 1170 1180 1190 uAuuGuuuGuuAuuA****UGuuuuuGuGuuGGuuuAuGuuCUCGuuuACGuuuGCGuuGuGCGGAuuuuuuGCA*UAUUUGuuuAuuGGAuGuuuGuuu |||||||||||||:| |:| :|||||::|||:|:|:||| ATAACAAACAATAGT----ATATA 5’(1089-1121) pND7(1181-1218) 12TTAAATAGTCTATAGATAAA ATAACAAACAATAA-----ATA 5’ (1087-1113) bsND7(1183-1224) 13TAATAGTGACTTATAAGTAGA |||:::||||||:|| ||::|:|:||:|:|||||||||||| ATAGTGAACAATGAT----ACGGAGATACGATCAAATACAAGAGAATATA 5’(1099-1143) ATAGTGAATAATGAT----ACGAAGGTACAGTTAAATACAAGAGAATATA 5’(1099-1143) |:|||:|| |:|:|:|:::|:::|:|||:|||||||||||:|||:|:| 13TATAATGAT----ATAGAGATGTAGTTAGATATAAGAGCAAATGTAAATGTATA 5’pND7(1107-1157) 14TATAATAGT----ATAGAGATACAGTCAAGTGTAAGAGCAA-TGCAAAT-TATA 5’bsND7(1107-1146) 09TCAATTGAGTATAAGAGCAGATGCAAGCGTAACATGCCTAATATATA 5’ bsND7(1128-1167) :|||:|:|::|:|::||:|:|::|| ||:|:||:||||||||||||:| pND7(1150-1197) 10TTAAATGTAGTATGTTTAGAGAGTGT-ATGAGCAGATAACCTACAAATATA 5’ bsND7(1150-1195) 10TATATAAGTGTAATATGCCTGGAAGATGT-ATAGACAAATAACCTATAAA 5’ 177 1200 1210 1220 1230 1240 1250 1260 1270 1280 1290 GCGuGGuuuuuuAuuGCAuGAuuuAGuuGC***C*GuuuuAGGuAAuAuuGAuGuuGuuuuuGGAuCCGUAGAUCGuuA*GuuuuAuAuGuG**A***** |||||:|:||:|||||||| ||:|||:|:| ||:|||||:::| | CGCACTAGAAGATAACGTAAATATATA 5’(1181-1218) pND7(1269-1320) 12TAATTTAGTAGT-CAGAATATGTGC--T----- TGTATCAGAAAATGACGTACTAAATATA 5’(1183-1224) bsND7(1269-1320) 17TAATTTAGTAGT-CAGAATATGTGC--T----- |||||:||::||:|||||:| | |||:||:|||||||||||||||| 14TAATAATGTGTTAGATCAATG---G-CAAGATTCATTATAACTACAACATATA 5’pND7(1210-1257) 09TAATGTGTTAGATCAATA---G-CAAGATTCATTATAACTACAACATATA 5’bsND7(1233-1257) |:::|||:|||::|:||:|||||||||| ||| pND7(1251-1282) 12TAAGTGTGACAGAAATTTGGGTATCTAGCAAT-CAA-TATATATA 5’ bsND7(1240-1270) 16TTTTATTGTAGTTATAGTGAAGACTTAGGCATA--GCAAT-CAAATATA 5’ 1300 1310 1320 1330 *GGUUAUUGuAGGAUUGUUUAAAAUUGAAUAAAAA |:|:|||||||||||||||| -CTAGTAACATCCTAACAAATATA 5’(1269-1320) -TTAGTAACATCCTAACAAATATA 5’(1269-1320) 178 I) NADH Dehydrogenase subunit 8 0 10 20 30 40 50 60 70 80 90 CAAUUUAAUAAUUUUAAGUUUUGGUUGAUUAuuAuuuuuuuAuuuuuuuAuuuuuGuAuGuuuuuuuuuGAuuuuuuGuuuuuuuUUUUUGuuuGuuuuu |||:|:|||:||||:|||:|||||:||||||||||||||| |||::|::||:|| pND8(29-68) 08TATTATATAGTGAAAGAATAGAAAGATAAAGACATACAAAAAAAAAAA 5’ (87-136) 14TAAATGAGTAAGAA bsND8(28-56) 04TAGGGAGATAGTAAAAGAGTAGGAGGATGAGGATAAAA 5’ bsND8(86-139) 12TAAAATGAGTAAGAG :|||:|:|:|:||::||||:||||||:||:|||||||||||||| pND8(55-98) 11TTATATAGAGAGAAGTTAAAGAACAAAGAAGAAAAACAAACAAAATATA5’ bsND8(54-97) 16TATATATAGAAAGAGACTGAGAAACAAAAGAAAAAGACAAACAAA-TATA 5’ 100 110 120 130 140 150 160 170 180 190 AuAuGuGUuuuGuuuGuuGuGuuA****CuAUUU*GuuuA***CCCAuuGAGuuAACCAuuGuuAGuuuAuuGGuuCGuGGUAACCAuuuuuuGCGUUUU |||:||::|:||:|::|||||||| |||||| :| :|||:|:|||::||||:|:||||||||#|||||||||| TATGCATGAGACGAGTAACACAAT----GATAAA-TA 5’(87-136) (161-187) 11TTAATTAGATAGTCAAGTATCATTGGTATAAAACGCAAAT TATATGCAAGATAGGCAGTACAAT----GATAAA-CAAATATA 5’(86-139) 15TGTAATTAAGTAGTTAGGTATCATTGGTAAAAGACGCAAAA :||::|::::||| |||:|| :|::| |||||||||||#|||||| ||||:|::|::||| (111-153)13TTAAGTAGTGTAAT----GATGAA-TAGGT---GGGTAACTCAAATGGTAAATATATA5’ p(186-230) 13TTAAAGAGTGTGAAA 12TATTCAGATAGTATAGT----GATAGA-CAGAT---GGGTGACTTA-TTGGTAACATATA5’ pND8(187-228) 11TAAAAGAGCGTAAAA :|:||| |||:|: :||:| |#||:|:|:|:||||||||||||||||| pND8(117-170) 11TTTATAAT----GATGAG-TAAGT---GAGTGATTTAGTTGGTAACAATCAAATAAACA 5’ 200 210 220 230 240 250 260 270 280 290 uAUU***GGuGuGGuuuAGAGCGuuGuAuuGCuuGuCGuuuAuGuGAuuuAAuuuGCCCuA****GuuuAGCAuuGGAuG***UUCGuGuuGGGuGGAGu AATATA 5’ pND8(161-187) :|:: |:|:|:||:|||::|:| TGATATA 5’ bsND8(160-199) pND8(276-318) 13TAATTGT---AGGTATAATCCATTTTA |||: |:|:||:||:|||||||||||||| bsND8(289-318) 10TTTTGT---AAGTAGAGCTCATTTCA ATAG---CTATACTAAGTCTCGCAACATAACATA 5’(186-230) GTGA---CCATATTAAATCTCGCAACATATATTATA 5’(187-228) :||:|||:|:|::||||:|:|||||||||||| 06TAATAGCTATAGTAAGTCTTGTAGTATAATGGACAGCAAATACAA 5’ (213-244) 12TTAGATGTTGCAGTATAATGAGTAGCAAATACATATAAA 5’ bsND8(219-245) :||||::|:|:|:|||:|:||||| :|| :||||||||| pND8(237-267) 15TTAAATGTATTGAGTTAGATGGGAT----TAATATGTAACCTAC 5’ bsND8(246-271) 11TAATGTAGTGAGTTAAATGGGAT----TAGATTGATACCTAC---AAGCATA 5’ |||:|:|:|:|||:|:||||| |||:||#||:||||| |||:|: pND8(240-288) 10TATATATTGAGTTAGATGGGAT----CAAGTCATAGCCTAC---AAGTAT 5’ bsND8(259-285) 05TATAATGTTAGTATATTGAGTTAGATGGTAT----TAGATCGTAGCCTAC---AAGAATATA 5’ 179 300 310 320 330 340 350 360 370 380 390 uuuGGuGGuCAU**C*GuuuuGCGGAuuGAuuuACAuuGAGuuAU*C**GU**CGuuGuAuuuAuuGuGGuuuuuGuAuGCAuGuuuGCCCGACAGAU** :|::|||:|||| | ||| |||:||::|||:|:|:|||||#|||| GAGTCACTAGTA--G-CAACATATATA 5’(276-318) pND8(372-417) 14TAAATATGTGTATAGATGGGCTTTCTA-- GAGTCATCAGTA--G-CAATATATA 5’(276-318) (390-414) 15TAT-TGTGTGTGAGCGAGTTGTTTG-- |||::::|||| | :|:|::|::||:|||:||||||||||||| ||||| TAACTGTTAGTA--G-TAGAGTGTTTAGCTAGATGTAACTCAATATATA 5’(301-344) (391-431) 14TAAAATTTAATTACGTGTTAGTCTA-- 11TAATCGTTAGTA--G-TAAGATGCCTAGCTAGATGTAACTCAATATA 5’(301-344) bsND8(407-441)15TATAT-- :|#|| | :||:|:|:|||::||:||||:|||||||| | :| | 13TATAATA--G-TAAGATGTCTAGTTAGATGTGACTCAATA-G--TA--GATTATA 5’ pND8(310-353) 14TTA--T-CAGAGTGTCTAATTAAGTGTAACTCAATA-A--CA--TATA 5’ bsND8(316-344) |:||||:||:|||| | :: |||||||||||| pND8(331-364) 14TAATGTCTATAGT--AGTGTAGCTTAATA-G--TG--GCAACATAAATATATA 5’ 17TATAGTAATAGAGTGTGTGATTGGATGTAGTTCGATA-G--CA--GCA 5’bsND8(325-355) :|:|||| | :| |::||||:|||:|:||:||:|||||||||| pND8(338-382) 09TTTTAATA-G--TA--GTGACATGAATGATACTAAGAACATACGTAAATATA 5’ bsND8(338-385) 04TATTTTTAATA-G--TA--GTGACATAGATAATATCAGAGACATACGTACAACATGTA 5’ 400 410 420 430 440 450 460 470 480 490 **GCCAuuACGCAUUCAuuGuuuGuuAuGuGuuuuuGuuGuuuAGCC**AU**GuAuuuAuuG*GCGC***C***CAAGuuuuuAuuGuuuGGuuGuuGu :||||||||||||#|| bsND8(465-503) TTTTATAAGTGTC-AGTG---G---GTTCGAAGATAGTAAACTAACAACA --TGGTAATGCGTAAATATATA 5’(372-417) |||| :|:| # # |:|||:||||:||::|||:|:|| --TGGTAATGCGTAA-TATATATGTTAAA 5’(390-414) pND8(477-512) 14TAAC-TGTG---A---A#TTAAAGATAATAAGTCAATAGCA :||||||#:||::|||||||||||||:|:| bsND8(482-510) 12TATAGTAATGAGCCAGTAACG --TGGTAATATGTGGGTAACAAACAATATATATA 5’ (391-431) ||::|::|::: ||::||:|:|:||:||||||||:||||||||| pND8(489-531) 14TAATTAGTAGTG 12TATATGCAGTGGGTGATAGACGATACACAAGAACAACAAAAAAAA 5’ (411-442) bsND8(494-539)18TTATAGTG --GGAATATGTGTAGGTAGTAAATAATATACAAAAACAACAATAAA 5’(407-441) ||:|:||:|::||:|:||:|| || :|||:||||: |||# | |||||||| pND8(426-466) 13TTATATAAGAGTAATAGATTGG--TA--TATAGATAAT-CGCA---G---GTTCAAAATA 5’ bsND8(426-480) 02TTATATAGAAGTAGTGAATTGG--TA--TATGAGTAAC-TGCG---G---GTTCAATATATA 5’ 180 500 510 520 530 540 550 560 570 580 590 uuuAuGuuAuuuGAuuuuuAuuuGuGuuuuGuGuAGuuAuuuAuuuuGGGuGAuuuAuuGUGuuuAuGAuuuAA***AGAA**AuuCACGGUGAAAUUAA AAATTATA 5’ (465-503) ||||::|:|:||::||:||| |||| |||||||||||||||| :|||||||||||: pND8(554-598) 13TAATAGTATAGATGTTAGATT---TCTT--TAAGTGCCACTTTAAT 5’ GAATACAATAAATATATA 5’(477-512) bsND8(554-598) 05TAATAGTATAGATGTTAGATT---TCTT--TAGGTGCCACTTTAATATATA 5’ AGATACAATAATTATATA 5’ (482-510) |:|||:|:|:||||::||||||||||||||:| AGATATAGTGAACTGGAAATAAACACAAAATAGATA 5’ (489-531) ||||::|:|:||:||||:||||::|||:|:||||||||||| AAATGTAGTGAATTAAAGATAAGTACAGAGCACATCAATAATATATA 5’ pND8(500-540) AAATATGATAGACTGAGAATAGATACAAGACACATCAATA-TATA 5’(494-539) bsND8(523-567) 12TATAGAGTATATTGATAGATAGAGCTCACTAAATGACACAAATATAAATA 5’ 600 610 AUUUUGACUAAAU poly[A] 181 J) NADH Dehydrogenase subunit 9 0 10 20 30 40 50 60 70 80 90 UUAAUAUCAACUUAAUUUUUUUUAUAAACAuuAuAuuAUGuGuAuAuUUUUAuGuuuAuuuCGuuuAuGuuuuuGuuuAAuuUUAuuuuA**UUGuuuGu :||:||:::||||:||:||||:|:||||||||||||||| pND9(33-71) 14TATTATAGTAGTATGTATATGAAGATACGAGTAAAGCAAATACAAATATA 5’ bsND9(25-72) 11TTTTGTAATATAGTGTATATGTGAGAATATAAATAAGGCAAATACAAAATATA 5’ ||::|:||:||:|:::||||:||||||:|| |||||||| pND9(60-105) 11TATTTAGTGAGTATAAGAGTGAATTGAAATAAGAT--AACAAACA bsND9(60-101) 11TAAATGTAGTGAATATGGAGATAGATTAAGATAAAAT--AACAAACA ||| |:::||:| pND9(87-124) 11TTAAT--AGTGAATA bsND9(87-124) 10TATAAT--AGTGAATA 100 110 120 130 140 150 160 170 180 190 GuuGuAGAuGGuGuuUUGuuuGuuuuGuuGAuuGuAGuuuuuuGuuuuuuuAuuGuuuuGuuAGuuuuuuuuuGuuuuAUUGuAuGuuuuuAuuuuuuAA |||:|| ||||:::|::|:|:|||:::|||| CAATATA 5’(60-105) pND9(176-216) 14TAATAGTGTGTAGAGATAGGGAATT TA-TATA 5’(60-101) bsND9(176-216) 14TAATAGTATATAGAGATAGAAAATT :|::||||::|||:|:|:||||||| |||||| TAGTATCTGTCACGAGATAAACAAA-CAACTATA 5’ (87-124) TGACATCTACTACAGAGCAGACAAA-CAACTATAAA 5’ (87-124) :|||:|:|::||:|:|||||::||:|:|||:|||||||||||:| pND9(117-160) 13TTAAATAGAGTAATTGACATCGGAAGATAAAGAAATAACAAAATA-TATA 5’ bsND9(117-162) 12TTAAATAGAATAGTTAGTGTCAAAGAGTAAAGAAATAACAAAACAATATA 5’ :|||::|:||:|||||||::||:|:|||||||||||||||||||| pND9(149-193) 15TTGATAGTAGAATAATCAAAGGAAGATAAAATAACATACAAAAATAATATA 5’ bsND9(147-187) 15TAGAATAGTGAGATAATCAAAGAGAAGCAGAATAACATACAATATATA 5’ 200 210 220 230 240 250 260 270 280 290 uuuGuGAuuuuuGuuuuuAuAuuGUUGuGAuUUGuuAuuGAuuGAuuuuuGuGGuuuuuGuuuuuGuCGuuuuAuGuuGuuGUAuAuuuuAuuuuGuuuG |:|||||:||||||||| |||||:::|:||:|:||||:|:||:||| AGACACTGAAAACAAAATATAACATA 5’(176-216) pND9(272-312) 06TATACAGTGATATGTGAAATGAGACGAAC AGATGCTAAAAACAAAATATAACATA 5’(176-216) bsND9(273-314) 12TTTATAGTAGTATATAAGATGGAACGAAT ||:|:|:||:|:|:|:|||||::|::||||:||||||||||| TAATATTGAAGATAGAGATATAGTAGTACTAGACAATAACTAAAATATA 5’ pND9(201-242) pND9(303-339) 13TAA- 11TAATTAAGAGTAGGGATATGGTGACACTAGATAGTAACTAACTAAAAAAAA 5’ bsND9(204-249) bsND9(305-339) 13TAAT :|||:||:|:|:||:::|||:|:||:|||||:||||||||| pND9(239-279) 12TTTAATTAGAGATACTGGAAATAGAAGCAGCAGAATACAACATATTAAA 5’ bsND9(239-286) 11TTTAATTGAAAGTACTAGAGATAAGAGTAGCAAGATACAACAATATATA 5’ 182 300 310 320 330 340 350 360 370 380 390 uuuuuGuGuGuuCGuuuGuGuuuuGuuuuGuGuuGUUUGuuuGUAuuuuuuGGAuuGuGuuuuA*GuuuuA**GuuGuuuuuGuUAuGCGuuuuuGuuGu |:|:||||||||| :||::|:|:|:| |||:|| :|::|||:|||||||||||| AGAGACACACAAGAATA 5’(272-312) pND9(352-392) 13TAATTAGTATAGAGT-CAAGAT--TAGTAAAGACAATACGCAAATTATA 5’ AAAGACACACAAGCATATATA 5’ (273-314) (352-392) 10TAATTAGTATAGAGT-CAAGAT--TAACAGAGACAATACGCAAATTATA 5’ :||||::|||:||||||:|:|:|:||||||||||||| :||||:|:|:|:|:|::| AGTGACACGTAAGTAAACACGAGATAGAACACAACAAACATATA 5’(303-339) pND9(382-421) 10TTAATATGTAGAGATAGTA AGAGTCACGTAAGTAAACACGAGATAGAACACAACAAACATATA 5’(305-339) bsND9(380-418) 12TAATAGTATGTAGAGATAGTA :||||:|:||:|::|||:|:||||||:|:||:||||||||||||| :| pND9(319-366) 11TTAAAATAGAATATGACAGATAAACATGAGAAGCCTAACACAAAAT-TATA 5’ bsND9 (343-368) 03TTAAAATAGAATATGACAGATAGAGATAAAGAATCTAACACAAAAT-TAAATA 5’ 400 410 420 430 440 450 460 470 480 490 uGGAACGC*GAAuGuuuUGAUUUGuuuGGuuuuUAuuuuGuuGGuAAuGAuAuuuuACAUCGuuuAuuuGuuGAuuG****GuuuuuuGuuGGuuuuuuu ::||||:| ||||:|||||||| |:||:|:|| :|:|:|::|||||:|:|:| GTCTTGTG-CTTATAAAACTAA 5’(382-421) pND9(468-514) 13TATAATTGAC----TAGAGAGTAACCAGAGAGA ATCTTGCG-CTTACAAAATATATA 5’(380-418) bsND9(469-514) 11TTAATTGAC----TAAAGAGTGACCGAAGAAG :|||:|:||:|:||||:|:|:||:||||:|||||||||| 08TTTTATAGAATTGAACAGATCGAAGATAAGACAACCATTAAAATATA 5’ pND9(409-447) 05TATATTGTAGAATTAAGTGAATCAGAAATAAAGCAACCATTAAAATATA 5’ bs(410-447) :|||::|||:|||:|:||||||:||||:|:|||||||| |||| pND9(439-484) 13TTAACTGTTATTATGAGATGTAGTAAATGAGCAACTAAC----CAAATATA 5’ bsND9(438-483) 10TATAATTGTTATTGTAGAATGTAGCGAGTGAATAACTAAC----CAA-TATATA 5’ 500 510 520 530 540 550 560 570 580 590 uuGuuGAAGuGuuAUCCAuuAuuuGGuuuGuuuGuAuuGuuAuuuuGuGuGuuG**GuGGAGGAGAUAGuAuGuACGuuuACAAuGuuAuuuuuGuuGuu ::||||||||||||| ::||:|||:|:||||||:||||:|:||||||| GGCAACTTCACAATATATA 5’(468-514) pND9(568-604) 11TGCTATAGTGTATATGTAGATGTTATAATAGAGACAACAA AGCAACTTCACAATATATA 5’(469-514) bsND9(569-600) 04TGAGTAGTATATGCAAATGTTACAGTGAAAACAACAA :|||||:::|:|||||||||:||:||:::|||||||||||||||:|:| |||:|:|:||:|:||||: 12TTAACTTTGTAGTAGGTAATAGACTAAGTGAACATAACAATAAAATATA 5’(502-549) pND9(582-612) 09TTTATAGTGAAGATAACAG 14TAACAGTTTGATGATAGGTAATGAGTCAAGTAAACATAACAATATATA 5’(509-542) bsND9(582-611) 12TTTATAATGGAGATAGTAA bsND9(597-625) 12TTTATAGTAGAGATATTAA ||||:|:||:||:|:|::|: |::||:|||||#:||||||||||| pND9(534-566) 14TAATAATAGTAGAATATATGAT--CGTCTTCTCTAGTATACATGCAAA 5’ bsND9(533-566) 09TTATAATAGTAGAATATATGAT--CGTCTTCTCTAGTATACATGCAAATAAA 5’ |||:|:||:||:|:|||:: :|:|||:|||:|||||||||||| pND9(535-578) 13TTAATAGTAGAATATACAGT--TATCTCTTCTGTCATACATGCAATTATA 5’ bsND9(543-580) 09TAATAGTGAACATATAGT--CATCTCTTTTATTATACATGCAAATTATA 5’ 183 600 610 620 630 640 650 660 670 GCAuACC**AAuuUUUAuuuG*CAuuAuuuuAuuuA***AuA**UCACCGuUGUAAUUCUAAAUUUCUCACUUCC ||||| CGTATATATA 5’(568-604) TATATA 5’(569-600) :|||||| ||||##|||:|| |||||| TGTATGG--TTAATTATAGAC-GTAATATA 5’(582-612) CGTGTGG--TTATTAATAGAC-GTAA 5’(582-611) CGTGTGG--TTGAAGATAAAC-GTAA 5’(597-625) ||||:|:||||: |||||:||:|:||| ||| #||||:|||||||||| 12TTTAAGAGTAAAT-GTAATGAAGTGAAT---TAT---GTGGTAACATTAAGAATATATA 5’ pND9(609-644) 12TTTAAAAGTGAAT-GTAATAAGATAGAT---TG----ATAG 5’ bsND9(609-640) :||||: |||:|:|:||||:| ||| |||||::||||||:|| 12TGTAAAT-GTAGTGAGATAAGT---TAT--AGTGGTGACATTAGGAAATATATA 5’ pND9(615-659) 12TTAAAGTTAGT-GTAATAAGATAAGT---TAT--AGTGGCAACATTAAGTATATA 5’ bsND9(618-658) 184 K) Ribosomal Protein S12 0 10 20 30 40 50 60 70 80 90 CUAAUACACUUUUGAUAACAAACUAAAGUAAAuAuAuuuuGuuuuuuuuGCGuAuGuGA*UUUUUGUAUG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGu ||:|::::||:|:|||:|||||:| :||:||||:| |||||| |::| pRPS12(35-76) 12TATTTAGAGTGGAAGAGACGTATACATT-GAAGACATGC-CAACAAATA (96-121)12TATTATAGTA bsRPS12(38-78) 18TAGTGAAGAGAGTGTATATGCT-AAAGACATAC-CAACAATATATA(96-121)11TAATAGTA |:||:|:||||::||| |:||::|||| |||||||| (96-131)10TATAGTA pRPS12(43-78) 14TATATAGTTAGAAGATGCATGTACT-AGAAGTATAC-CAACAACATATA 5’ bsRPS12(43-78) 12TATATAGTTAGAAGATGCATGTACT-AGAAGTATAC-CAACAACATATA 5’ |::||:: :||::|:|:||| ::|:|||:|||||||| pRPS12(63-109) 12TATAGTATGT-TAATGATAGATG-TGAGACAGAATAAACA bsRPS12(66-99) 07TAGTAGAGTGT-CAATAGTAAATG-CAAAACAAAATAAATATA5’ :|::|:||| |||:|:||:||||||| pRPS12(74-106) 12TAATATGTCATTAGTAGATG-CAAGATAAGATAAACA bsRPS12(73-115) 16TCTTTTATAGTAAATG-TAGAGCAAGATAGACA 100 110 120 130 140 150 160 170 180 190 uuuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGuuCCGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGu |||||:|||| :|||:|:|:|||||:|||:||::|:||||:| AAATATAATATA 5’(63-109) pRPS12(169-208) 12TATATAGAGTGAATATGTTAGAATGAGCCTATA |||||:| bsRPS12(156-207) 12TTATATG--G----TATAAGATAGATGTGTTAGAATAGACTTACA AAATATATATA 5’ (74-106) GAATACAATAATATATATA 5’(73-115) ||||:||||:|||||:|:|||| ::||:| AAATGCAATGATATATTTAGGC--ACTAA 5’ (96-121) pRPS12(194-235) 11TTTTATA GAATGTAGTGATATATTCAGGT--AGCTAACGTGTCAAATATA 5’(96-121) bsRPS12(194-235) 09TTTTATA ||||:||||:|||||:|:|||| ||||||||#|:||| AAATGCAATGATATATTTAGGC--GCTAACGGATTAAGATATA 5’(96-131) |: |:||::||#|::||||:|||||:|||||||||||| 10TATTCAGT--GTTAGTGGATTGAGGCTATTGGTTGCACATAACATTCA 5’ (119-158) :|||:|#||:|:|#:|||||||||||:||:|:||||| | || TTTTAATGTGTTAGGATCATTGGCTGCATATGATATACG--G----CAATATA 5’ pRPS12(139-170) 11TTTAGTAGATTGAGGCTATTGGTTGCACATAACATTCATA 5’ bsRPS12(133-158) 185 200 210 220 230 240 250 260 270 280 290 uGCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuuAuAGA***GGGUGGuGGuuuuGuuGAuuuACCC***G****GuG*UAA ||||||||| ::|||:::||||:||#|| : ||: ||| ACGCAACAATAAATA 5’ (169-208) pRPS12(267-322) 07TTTAAAGTGACTAGATAGG---T----CAT-ATT ATGCAACATATA 5’(156-207) bsRPS12(288-322) 12TTTAAAGTAACTAGATGGA---T----CAT-ATT bsRPS12(269-308) 10TTAGTACAAGAGCAGTTAAATGGG---C----TAC-ATT |||: ||||:|:|||:||:||| :::|||::|||||||||||:| 14TATAT--AGTAGTGAATGATGTCT---TTTACCGTCAAAACAACTAGAATATA 5’pRPS12(234-280) 15TATAT--AGTAGTGAGTGATATCT---TTTACCGTCAAAACAACTAGAATATA 5’bsRPS12(234-280) |:|:|::|:|:||:|::||:|||||::|||||||||#| ||||||| ATGTAGTAGAGAAGATGACGAAATAGTCAAATCAAT-C--AGTAATATATA 5’(194-235) ATGTAGCAGAGAAGATGACGAAATAGTCAAATCAAT-C--AGTAATATATA 5’(198-235) :||:|:|:|:|::|||::|:|||:|:|:|||||:| ||||||| 14TATAATAGAGAGAGTAACGGAGTAATCGAGTCAATGC--AGTAATATATATA 5’ pRPS12(203-246) 07TATAATAGAGAGAGTAATGGGATAGTTAAATCAATAC--AGTAATTAAAATA 5’ bsRPS12(203-245) 300 310 320 330 340 350 AGuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAuAUAAGAUAUGUUUUU[AAUA]POLYA |:|||:||||| || ||||||| TTATAGTATGT-GC--ATAACATATA 5’ (267-322) TCATAATATATA 5’(269-308) TTATAATGTGT-GC--ATAACATATA 5’bsRPS12(288-322) || |: ||||::|||||||| ||:|||||||||||||||| 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTATATTCTATACAATATAAA 5’ pRPS12(309-349) 24TAAGT-GT--GTAATGTTCAATCT-AAATCT-TAGTCTATACAAAATAAA 5’bsRPS12(309-336) 186 APPENDIX C. All gRNA major classes pulled for ATPase 6 in the EATRO 164 procyclic (shaded gray) and bloodstream (white) transcriptomes. Populations of gRNAs are bordered boxes. A) ATPase 6; B) Cytochrome Oxidase III; C) C- Rich Region 3; D) C- Rich Region 4; E) Cytochrome b; F) Maxicircle Unidentified Reading Frame II (Murf II); G) NADH Dehydrogenase Subunit 3; H) NADH Dehydrogenase Subunit 7; I) NADH Dehydrogenase Subunit 8; J) NADH Dehydrogenase Subunit 9; K) Ribosomal Protein S12. A) ATPase 6 3' 75 72 75 75 75 75 62 75 62 75 75 102 100 102 102 100 127 129 118 118 118 118 129 152 152 148 152 148 183 177 175 177 208 210 208 208 208 243 248 243 243 243 246 269 269 262 262 262 267 5' 31 29 31 31 29 35 29 33 31 39 37 62 62 62 74 64 86 90 84 82 86 83 91 113 105 105 116 104 138 144 139 132 164 176 165 158 164 192 218 190 189 207 193 224 226 221 219 218 221 Reads 2,044 1,630 143 489 395 224 203 144 95 82 40 ATPase 6 gRNA Sequences AT ATAAACGTAACTGAAATGAATCACGAGAGAAAGATAAAGATATAT AT12 ATATAC AACGCAACCAGAGTAAATCATGAAGGGAAAGTGAAGGCATATTT T11 AT ATAAACGTAACTGAAATGAATCGCGAGAGAAAGATAAAGATATAT AT21 AT ATAAACGTAACCAAAATGGATCATGGAAGAGAAGTAAAGATATGT AT09 AT ATAAACGTAACCAAAATGGATCATGGAAGAGAAGTAAAGATATGTTT T11 AT ATAAACGTAACCAAAATGGATCATGGAAGAGAAGTAAAGAT T13 ATATACAACGCAACC AGATAAATCATGAAGAGAAAGTGAAGGTATATTT T09 AT ATAAACGTAACCAAAATGGATCATGGAAGAGAAGTAAAGATAT T12* AACGCAACC AGATAAATCATGAAGAGAAAGTGAAGGTATAT AT14 AT ATAAACGTAACCAAAATGGATCATGGAAGAGAAGTAA T16 AT ATAAACGTAACCAAAATGGATCATGGAAGAGAAGTAAAG TCAACTTAAT08* 1,435 154 92 10 9 ATACA ATCATACACAGTAGTACATATATAGTGATAGACGTGATTAA T11 ATAAAA CATACACAATGATATATACATAGTAATAGATGTGATTAA T23 ATATAA ATCATACACGATAATATATGCGTAGTAACAGATGTGATTAA T18 ATATAT ATCATACACGATAATGCATATGTAGTAAC T13 ATAAAA CATACACAATGATATATACATAGTAATAGATGTGATT T14* 2,158 177 85 38 28 22 21 743 54 128 84 16 430 210 3,468 205 54 36 24 14 218 172 147 52 10 7 5 864 808 149 105 57 32 ATAT AAATACACAGTAGAATATGATCTAGGTTATGTATGATGATAT T11 ATATA TAAAATACACGATAGAGCATAACTTAGATTGTATATGATA T20 ATAATAATACAC ATAGAACATGACCTAGATTGTACATAGTGATATAT T12 ATATAATAATACAC ATAGAACATGACCTAGATTGTACATAGTGATATATAT T13 ATATAATAATACAC ATAGAACATGACCTAGATTGTACATAGTGATAT T15 ATAATAATACAC ATAGAACATGACCTAGATTGTACATAGTGATATATA AT18 ATATA TAAAATACACGATAGAGCATAACTTAGATTGTATATGAT T18 ATAC ATCAAAAATCAACGTTAGACAGTTAAGATATGTGATAGAA GATAAT12 AC ATCAAAAATCGACATTAGATAATTGAGGTATGTGATAGAGTATAATTT T11 ATATC AAAATCAACATTGAGCAGTTAAAGTACGTGGTAAGATATAATTT T12 AC ATCAAAAATCAACATTGAGCAATTGAGGTACATGATA TGATATAAT09 ATATC AAAATCAACATTGAGCAGTTAAAGTACGTGGTAAGATATAATTTA T07* AT ATACAAATCAAACAGACAGAGTAATAGAAGGTTGAAGATTGATAT AGT11 ATATC ATCAAACAAACAGAATAATAGAGAATCAGAGGT GAATGTTAAGT15 ATAAA TAAACAAACAAAATGATAAAAGGTCAGAGATTGATG GTGAATAAT08* ATAT ATCAAACAAACAAAGTAATAGAAAGTCAGAGATTGATGTTAAATA T10 ATAT ATAAACACAAATCAACGAATAGATATAAGTCAGATAGATGG TGTATTAT12* AT AAACAAACACAAATCAGTAGACGAGTACAAGT GAGATGGACGTATAGAT07 ATAAT ACAAACACAAACTGATAGACGAATACGAGTTAGATGGACG TAT06 ATAT ACAAACACAAACTGACGAATAGATACAGATTAAGTGAATGAAATAAT T11 AT ATAAGCACAAACCAATAGACAGATATAAGTCAGATAGATGA TTAT14 ATATAAATTAAACAACATAGATTACAGTGATAGAAGTAAATGTGAATTA T04 ATC AGACTATGTGAGTTAGATGACGTGAATTATA CTGTATAT12 ATATAAATTAAACAACATGAACTATGATGATAAAGGTAAATGTGAATTAAT T19 ATATAAATTAAACAACATGAACTATGATGATAAAGGTAAATGTGAATTAATG TACT* ATATAAATTAAACAACATGAACTATGATGATAAAGGT T10 *ATTGTATAAATTAAACAACATGAACTATGATGATAAAGGTAAATGTGAATT T03 ACATAA TAATACAATAATACGAGATTAGACTATGTGAATTAAATGATATGA T11 ACATAA TAATACAATAATACGAGATTAGACTATGTGAATTAAATGATAT T13 ATATAT ATAATACAAAATTGAACTGTATAAGTTAGACAATGTGAATT TT ATATAT ATAATACAAAATTGAACTGTATAAGTTAGACAATGTGAATTAT T14 ATATAT ATAATACAAAATTGAACTGTATAAGTTAGACAATGTGAATTATA T21 AC ATACAATAATACAGAATTAAACTGTGTAAGTTAGATAGTGTAAATT T10* 187 5' 248 253 255 252 262 259 254 266 281 284 283 281 3' 292 298 292 292 304 304 298 313 313 310 310 312 291 293 301 300 301 301 331 332 331 331 332 331 332 331 329 329 345 346 335 345 375 378 371 375 378 374 378 371 360 349 352 360 354 362 349 361 407 389 401 407 401 407 389 407 387 387 387 387 390 398 397 424 427 421 424 424 427 424 5' 455 455 452 476 458 435 435 437 435 435 435 435 464 467 460 457 464 468 457 3' 491 497 477 500 500 Reads 25,157 5,405 244 134 4,881 395 290 586 14 13 2 2 133 12 647 263 125 33 24,736 8,776 3,561 712 387 302 144 1,547 181 41 705 429 160 143 104 74 1,428 1,049 934 664 635 28 23 ATPase 6 gRNA Sequences cont. ATATA AAATACAAATTCGAGTAGGTAGTACAATGATATGAGATTA T13 ATATAT AACAACAAATATAGATTCAAGTAAGTGATGTAGTAATATGA T11 ATATA AAATACAAATTCGAGTAGGTAGTACAATGATAT TATTATTAAT15 ATATA AAATACAAATTCGAGTAGGTAGTACAATGATATAGA TTATTAAT07 ATAT ATACAAAACAACAGATATAGATTCGGATAGGTAATATGA GATCT13 AT ATATAAAACAACAGATATGAATTCAAGTGAGTGATACAGTA TAT15* ATATAT AACAACAAATACGAATTCAGATAGGTAGTATGATGATATA T15 AAAAAA AAAAAAACAATACAAGATGACAGGTATAAGTTTGGATGAGTAAT T12 AAA AAAAAAACAATACAGAATAGTAGGTATAGATT AGATATGTGAT09* ATAT AGAACAATACAAAATAACGAGTACAG T08 TAT AGAACAATACAAAATAACGAGTACAGG ATAAGTGATAT08 AT AGAAAACAATACAGAATAGTAGGTATAGATT AGATATGTGAT19 AAATATAT AAATGCAATATACGATAGAGAAATGATATAAGATGATAA T16* AAATATAT AAATGCAATATACGATAGAGAAATGATATAAGATGAT T14 ATAT AACAAAACAAAAGTAGAAGTGCAGTATATGATAGAAAAATGATGT CAAAT11 ATAT AAACAAAACAGAAATAGAAATGCAATATACGATAAGAAAATGGTATA T12 ATAT ACAAAACAT AAATAAAAGTGCAGTATATGATAAAGAGATAATAT T11 ATAT AACAAAACAAAAGTAAAAGTGCAGTGTATGATAGAAAAATGATGT CAAAT15 ATAT AATTATTAAACAAGAGAAAGTCACGTAAAAGGTAGAATGAAGATA T12 AT ATAAATTATTAAACAGAAAGAGATCATGTAGAAAGTGAGATAGAAAT T14 ATATAA ATTAAACAAAAAGAAATCACGTAGAAGACAGAATAGAGATA T13 ATAT AATTATTAAACAAGAGAAAGTCACGTAAAAAGTAGAATGAAGATA TTAT05 AT ATAAATTATTAAACAGAAAAGAGTCATATAGAAAATAAGATAGAAAT T12 ATAT ATTATTAAACAAAGAGAAATCATATAAGAGACAGAATGAGAATA T15 AT ATAAATTATTAAACAGAAAGAGATCATGTAGAAAGTGAGATAAAAAT TTT ATATAA ATTAAACAAAAAGAAATCACGTAGAAGACAGAATAGAGATA T15* ATATAT ACATCCATAAAATTATCATCAGTTAATAGATTGTTAAATGAAAA TTTT ATATAA ATCACCAACTAATAAGTTATTGAATGAGAGAAAGTTATATA T12 ATATA ATAAAACTATCACTAACTAATGGATTGTTAAGTAGAAGAGAATCAT T11 ATATAT ACATCCATAAAATTATCATCGGTTAATAGATTGTTAAATGAAAA T11 ATATA ATAAAACTATCACTAACTAATGGATTGTTAAGTAGAAGAGAATC CT07 ATATAT ACATCCATAAAATTATCATCGGTTAATAGATTGTTAAATGAA T12 ATATAA ATCACCAACTAATAAGTTATTGAATGAGAGAAAGTTATATA T09 ATATAT ACATCCATAAAATTATCATCGGTTAATAGATTGTTAAATGAAA TTTTTTTTTTTTTTT ATATAT AACACAACAAGAAACGAATGAGAGAAGTATCTATGAGATTATT T14* ATATAT AACACAACAAGAGACGAATAGAAAAGATATCTGTGAAATTATT T13 ATATAT AAAACACAATAGAAAACGGATAAGAGAGATATTCATAGAGTTATT T13 ATATAT AACACAACAAGAGACGAATAGAAAAGATATCTGTGAAATTATT T12 ATATAT AACACAACAAGAGACGAATAGAAAAGATATCTGTGAAATT T11 ATATAT AACACAACAAGAGACGAATAGAAAAGATATCTGTGA T05 ATATAT AACACAACAAGAGACGAATAGAAAAGATATCTGTGAA T13 25,624 2,307 1,864 368 6,879 1,781 232 ATAT ATGACACAACGAGGGAAGATACTCTAAAGGACACAGTGAAA T12 ATAT ATAACGACACAATAGAGAAAGATGCTCTGAGAGATGTAATA T13 ATAAAT TACAACAAAGAAAGATACTCTAGAAAGCACAGTGAGAAAT T16 AAATTAACGACA AACAAAGAGAAATACTCTGAGAAATATGATGAAA T12 ATAT ATGACACAACGAGGGAAGATACTCTAAAAGGTACAGCGAAA T13* ATAT AACAACGATACGACAGAGAAAGATATTCTAAGAGATATGACA T13* AAATTAACGACA AACAAAGAGAAATACTCTGAGAAATATGATGAAA T12 Reads 368 22 54 14 1 ATPase 6 gRNA Sequences cont. ATATATAATTAC AAACAAACGCAGAGATGTCGGTAAATAATGATATAAT T11 ATAT ATTACAAAACAGACGTAAAGATGTCGATGAATGGTGGTATAAT T14 AATT ACGTCGATAGATAACGATACAATGAG ATTAATTTT AAATT TAAATTACAAGACAAACGTAGAAGC T24 AAATT TAAATTACAAGACAAACGTAGAAGCGTCGATAGATAATGATAT15 487 487 487 521 520 521 526 528 526 567 553 553 8,723 1 635 232 85 11 ATACAA ATCAACAATAGAAGATGGGATGATAATAGATTGTGAGATA T17 ATAC ACATCAACAATAGAAGATGGGATGATAATAGATTGTGAGATA T27 ATACAA ATCAACAATAGAAGATGGGATGATAATAGATTGTGAGATA T16 AA AAAAAAAAAAAACAAAAATAGAATAAAGAAAGTCAGAGAATGTTAAT T05 AAAAAAAAAAAC AAAATAAAGTAAGAAGAATCAGAGAGTGTCAATA TATTTT AAAAAAAAAAC AAAATAAAGTAAGAAGAATCAGAGAGTGTCAAT T12 188 5' 557 549 549 546 546 549 3' 593 593 592 592 593 592 Reads 69,619 2,587 181 313 76 30 ATPase 6 gRNA Sequences cont. AATAAATCGATAACAAAGAACACTGTAAAAGAGAGAA TGAGAGTAAATAT09 AC AATAAATCAATAACAGAGAATATCATAGAGAGGAAAGATAGAAAT T12* ATAT ATAAATCAATGACAAGAAGCACTGTAGAAAAAGAGAGTGAAAAT T13 AT ATAAATCAATAACAGAAGATGCCATAGAGAGAGAAAGTGAGAGTAAA T11 AT AATAAATCAATAACAAAGAACATTGTAAAAGAGAAAAGTGAGAATAAA T13 ATAT ATAAATCAATGACAAGAAGCACTGTAGAAAAAGAGAGTGAAAAT T13 568 576 589 589 611 616 629 629 613 613 613 613 654 657 654 657 640 647 640 654 643 640 640 680 680 672 680 671 685 689 689 689 689 667 668 662 714 719 716 714 714 714 698 686 699 699 728 728 727 727 720 715 728 720 717 718 720 720 720 747 747 745 767 755 765 767 763 763 763 767 767 789 789 789 790 770 773 774 777 822 822 822 822 822 670 115 854 3,292 618 183 459 399 39,063 678 131 234 4,401 250 165 7,581 1,291 119 105 319 309 740 12 88 24 4,588 2,272 920 165 1,613 488 452 269 177 13 587 91 8,663 1 428 223 195 ATACT AAACACAAAAATGAATAAAATAAGTCAGTGATAGAAGATATTAT T12 AT AAACAAAACACAAAAATAAGTAAAGTAGGTCAGTGATAAGA TATACAT07* AT AAATAATAAACAGAAACGGAATACGAGAATAAGTAAAGTGA TTTAAT13 AT AAATAATAAACAGAAACGGAATACGAGAATAAGTAAAGTGA TTTAAT10 ATATAT AATCCAACAGATATAAGAGCATGTAAAATAGTAAGTGAAAAT T12 AT ATAAATCCAACAAGTATAAGAACATATAGAATAGTAGGTGAAAAT T12 ATAT AATCCAACAGATATAAGAGCATGTAAAATAGTAAGTGAAAAT T11 ATAT ATAAATCCAACAAGTATGAAGACACGTAAAATAGTAAATGAAAAT T14* ATAT ATAAATAACTGTAGTATGGTGGTAGATGAGTTTGATAGATATA T12 ATAT ATAAATAACTGTAGTATGGTGGTAGATGAGTTTGAT T11 ATAT ATAAATAACTGTAGTATGGCGGTAGATGAGTTTGATAGATATA T11 ATAT ATAAATAACTGTAGTATGGTGGTAGATGA TTTTGATAGATATAT12 ACATATATAAATAACTGTGATATTGCGGTAGATGGATCTGATGAAT T14† ATATAATAAATAACTATAATAAGGTGGTAAGTGAGTTCAGTGAATATA T14† AAAACTGTAATATGGAGGTAAGTGAATTTGATAGATGTA TTAATTTT† AAAATA CAACTGCAAGATCGTGTTATAGAGGATAAGTGATT TAAT13 ATATAA ATTATCAACTGTGAGATTATATTACAAGGAATAAGTGATT T13 ACACA ATCAACTGCAGAATTATATTACAGAGAGTGAGTAATTGTAA AAT12 AAATA CAACTGCAAGATCGTGTTGTAGAGGATAAGTGATT TAAT11 AGATA CAACTGCAAGATCATATTATAAGAGGTGAATGATTGTAAT 14T* ATATA TAACTGCAGAATCATATTATAAGGGATGAA CGATTGT13 ATT AAAATCCATTATCGATTGTAGAGTTATGT GATAGAGAATAAT11 ATATATT AAAATCCATTATCGATTGTAGAGTTATGTTATAGAGAATAA TAT21 AAATCCATTATCAGTTGCGAGATTGTA GTATAAAGAATAAT14 ATATAT AAATCCATTATTAACTGTAAGATTGTA GTATAGT17 AT AAATCAAATACAGAACTGAATAGACGATAAAGATAGTGAGAAATTT T11 ATATATAT AAACTAAACAAATAGCAGAGACAGTGAGAGATTCGTTAT AAT13 AT ATCAAATACAAAACTGAGCAGATGACAGAGATAGTAAA TGATTTAT12 AT AAATCAAATACAGAACTAGATGAACAATAGAGATAGTGAGAAATTT T12 ATATA TAAATACAAAACTAGATGAATGACAGAAACGATGAGAGATTTATT T14* ATA TAAATACAAAACTAGATGAATGACAGAAACGATGAGAGATTTAT AAT17 ATATA TAAATACAAAACTAGATGAATGACAGAAACGATGAGAGATTT T12 AT AAATCAAATACAGAACTAGATAGACGATAGAGATAGTGAGAAATTC TTTT* AT AAATCAAATACAGAACTAGATAGACGATAGAGATAGTGAGAAATTT T17 ATAAAT ACAACAATATAATAACTGTCGAAGGTTGAATATGAGATTAAAT T11 ATATAT ACAACAATATAATAGCTATCAGAGGTTGAATGTGAGATTAAAT T11 ATATAT ACAACAATATAATAGCTATCAGAGGTTGAATGTGAGATTAAATGA T11 ATA CTATAACTCCAATGACGAAATCAGTTTTA CAGTGATATGATAATT T12 GGA CTATAACTCCGATAACGAATCAGATTTTGACAGTGATATGATAATTATT* ATATA CTATAACTCCGATAACGAATCAGATTTTGACAGTGATATGATAATT T09 ATATA CTATAACTCCGATAACGAATCAGATTTTGACAGTGATATGATAAT ATATA CTATAACTCCGATAACGAATCAGATTTTGACAGTGATATGAT T14 189 B) Cytochrome Oxidase III 3' 70 73 70 70 101 99 99 101 92 95 112 116 131 115 115 115 112 132 156 155 150 185 185 188 185 185 185 185 185 188 185 185 185 187 185 185 188 188 185 203 211 195 216 247 247 243 247 244 243 247 247 247 244 243 244 243 243 244 Reads 1,179 826 112 14,200 COIII gRNA Sequences ATAATT AATATACAACGAGATAGAGACGTAAAAGAAT TGATGTAT12 AT ATAAATATACAACGAGATGAAGGCATAGAGAAA AGATGGTATATAAT14 ATATAC AATATACAACGGAATGAGAATATAAGAAAGTGATGATA TTAT11 ATATAT AATATACAACGAGATAAGAACATAGAGAAA AGATGGTATATAT13 1,386 721 364 229 122 78 834 550 1 443 117 96 64 6 104 465 36 ATAT AAAACAAAAACATCACTGATATTGACGGATATATGATGA TAAAT12 ATATAT AACAAAAACACTACTAGCGTTGACAGATATATGATGAAAT T12 ATAT AACAAAAACACTACTAGCATTGACAAATATATGATGAAAT T13* AT AAAACAAAAACACTGCTAATATCGACGAATATATGATGGAA AAT14* ATATAAAAT ACACCACTGATATCAACGAGTATATGATGAGATA T14 ATATAT AAAACACCACTGACATCGATAAGTATATAGTGAAGTGA TTAAT15 GTA GAGTGAAGATAGAGAAATAAAGATATCGTT T13 ATATATAATAACAATA GCAGGTAAAGGTGAGAAAGTGAAGATATCATT T10 TACATAATAACAGTGGCGGGTAGAGATAGAAGAATAAAGATACTATT T08 AACGATGGA TAGGTAGAGATAGAGAAATGAAGATATT TTAT05* AATAACAATGGA TAGGTAGAGATAAAGAAATGAAGATATC T06 AATAACGATGGA TAGGTAGAGATAGAGAAATGAAGATATC T07 ATACATAACAGTGGCAGA GTGAAGATAGAGAAATAAAGATATCATT T11* AT ATATACAATAACAGTGGTAGGTAGA T09 ATATATA TCCAACAAACAGAGTAACCGATACATAGTGATAGTG ATAT13 ACATATA CCAACAAACAGAATAACTAGTGCACAGTGATGATG ATAGT16 ATATA CAACAAACAAAATAATCGATGCACAGTGATAGT AGTAGT13 126,513 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAGATGA TTTAAT13 761 542 350 199 183 172 116 109 106 99 ATAT ATTACCAAACAATAGATGAGTAGATTCTAATAGATGA TTTAATTAAGTTTT* ATAT AAAACTACCAAACAGTAAATAGATAAGTTCTAATAAGTGAGATAATT T11 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATA TATGATTTAAT12 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAGATA TTTAATTAT05 ATAT ATTACCAAACAATAGACAAGTAGATTCTAATAGATGA TTTAAT13 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAG CTGATTTAAT13 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAGAT CATTTAATTAT05 ATAT AAAACTACCAAACAGTAAATAGATAAGTTCTAATAAGTGAGATAATTAAT TGTTAT16 ATAT ATTACCAAACAATAGACGAGTAGATTCTAAT CGATGATTTAATTAAT17 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAAATGA TTTAATTAAT08 62,901 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAGATGA TTTAAT14 810 685 497 332 308 140 2,487 861 232 138 5,008 688 462 457 83,016 9,810 2,126 1,676 1,379 363 269 149 130 126 102 ATAT AAACTACCAAATAATGAACAGATAAATTTCAGTGAGTGA TTTAAT13* ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAGAT T13 AT ACTACCAAACGATAAGCAGATAAGTCTCAGTGAATGA TGTAACT15 ATAT AAAACTACCAAACAGTAAATAGATAAGTTCTAATAAGTGAGATAATT T10 ATAT AAAACTACCAAACGATAGACGAATAAGTTCTGATAAGTGA TATAT12 ATAT ATTACCAAACAATAGACGAGTAGATTCTAATAGATG T15 AAA ATAATCAACAAATAGAGAACTGCTAGATGATAGGTGA TATAGAT13 ATATT AACCACAATCAATAAGTAAGAGACTACTAGATGATAGATAA T14 ATATATAAAACCACAATCAT CAGATAAGAGACTATTAAGTGATA T12*† ATATAT AATAAAACCACAATTAGCAAGTAAGAAG GTATCAGATGATAATTAT06† ATATAT ACAAACAAATACAGAGATCGACGAGAAAGAAAGTGAGATT TAT12 ATAAATAAATACAAAAATCAGCAAGAAAGAGAGTAAGATTGTGATTAAT T08 ATAT ACAAATACAAAAATCGATAGAAAAGAAAGTGAGATCATGATT TAT12 ATAAATAAATACAAAAATCGACAGAGAGAAAAGTAGGATTGTGATTAAT T12 ATAT AACAAATACAAGAGCCGATGAAGAAAAAGGTAGAACTGTGATTAAT T12 ATATAT ACAAATACAGAAACTGACGAAAGAGAGAATGAAGTTATGAT CT17 ATATAT ACAAACAAATATGAGAACTAACAAGAGAGAAAGTGAGATTAT T12* ATATAT ACAAACAAATATGAGAACTAACAAGAGAGAAAGTGAGATTATA T17 ATATAT ACAAACAAATATGAGAACTAACAAGAGAGAAAGTGAGATT T13 ATATAT AACAAATACAGAAGCCAACGAGAGAAGGAATAAGATTGTAAT T10 ATATAT ACAAATACAGAAACTGACGAAAGAGAGAATGAAGTTAT T14 ATAT AACAAATACAAGAGCCGATGAAGAAAAAGGTAGAACT TTGATTAAT12 ATAT ATAAATACAAAAACTAACGAAAGAAAAGATGGAACTGTGGTTAAT T12 ATAT ACAAATACAGAAACTGACGAAAGAGAGAATGAAGTT T19 ATAT AACAAATACAAGAGCCGATGAAGAAAAAGGTAGAACTGT TATTAT14 190 5' 35 36 29 36 54 51 51 52 50 49 81 81 81 88 88 88 81 108 117 117 118 141 141 134 146 142 141 145 143 131 147 141 141 141 143 141 134 141 142 163 163 168 185 204 195 199 195 195 199 202 201 204 199 202 204 195 204 202 3' 274 279 270 268 279 264 279 299 308 310 300 306 299 300 307 300 300 289 307 300 320 321 335 332 321 365 365 365 365 360 391 391 389 389 389 389 391 389 388 390 406 406 418 422 418 418 422 426 436 449 438 449 449 452 453 452 467 469 460 474 3' 5' 229 236 229 238 238 234 234 258 265 265 258 267 261 258 261 262 261 258 263 257 293 284 293 291 293 323 330 323 323 323 345 345 345 349 347 343 357 353 352 354 362 362 376 378 384 376 378 397 397 411 409 413 410 413 418 413 418 437 427 443 5' Reads 289 575 559 120 37 33 29 COIII gRNA Sequences cont. ATA TACAAAACAAATCTAACAGTGATAGTAACAGATAGATATAGAGATT T10 ATAT AAAATCACAGAACAGATCTGATAGTAACAGTAATAAGTAAATAT T11 ATATAT AAACAAATCTAATGATAACGATGACGGATAGATATAGAGATT TAT16 ATATAAACCACAAT ACAGATCTGACAGTAATGATGATAGGTAAAT T05 ATAT AAAATCACAGAACAGATCTGATAGTAACAGTAATAAGTAAAT TTCT14 ACAAAACAT ATCTAGCAGTAACAGTGACGAATAGATACAA T07 ATAT AAAATCACAGAACAGATCTGATAGTAACAGTAATAAGTAAATATAA TTTT 18,740 4,452 4,418 814 435 181 347 328 244 195 182 71 39 ATATAT AAATCAAATAAACTATGTAGAAAGTTACGAGATAGATTTAATA T10 AAA AAAACACAAAAATCAAGTGAACTATGTAGAGGATTGTAAGATAA T11 ATAAAACACAAAAATCAAGTGAACTATGTAGAGGATTGTAAGATAA T13 ATATAT AAAATCAAATAAATTACGTAGAGAGTTACAGAATAAGTTTAAT T10 ATATAT AACACAAAAATCAGATAGACTATGTAGAAGATTGTGAAAT T11 ATATAT AAATCAAATAAACTATGTAGAAAGTTACGAGATAGATTT T08 ATATAT AAAATCAAATAGATCACGTGAAGAGTTATAGAATAGATTTAAT T14 ATAC AAACACAGAAATCAGATAGATCACGTAGAGAGTTATAAGATAAATTT T08 ATATATATT AAAATCAGATAAGCCACGTAGAAGATTGTAAAGTGAATT AT12 ATATATATT AAAATCAGATAAGCCACGTAGAAGATTGTAAAGTGAATTT T09 AATCACGTGAAAGATCGTAGAATGAGTTTAAT T13 ATAC AAACACAGAAATCAGATAGATCACGTAGAGAGTTATAAGATAAAT AT08 ATATAT AAAATCAAATAGATCACGTGAAGAGTTATAGAATAGATTTAATA T12 1,024 21 1 1,923 151 AATACTGT ATATGATGTAGTAAGATATAGAGATTAA T10 ATATAAA GATACAACGTAATAAGGCATAGAAGTTAAGTGAATTAT TGT12 AAAAAACAATACTGGATATGATGTAGTAAGATATAGAGATTAA TAACT06 ATATAT AAACAATACTGGGTACGATGTAATAGAATGTGAAAGTTAAAT T14 AATACTGA GATACGACGTGATAAGATATAGAAGTTAA T11 33,179 856 301 4,275 139 522 203 490 461 365 262 106 79 48 41 101 139 9,942 1,634 185 169 375 15 10 20,772 1,420 224 125 545 60 20 451 238 218 37 Reads ATAT ATAAAACAAACTCGCTATGTAAGAACTGTAAAAAGTGATATT AT12 ATAT ATAAAACAAACTCGCTATGTAAGAACTGTAAAAAA GTGATATT T09 ATAT ATAAAACAAACTCGCTATGTAAGAACTGTAAAAAGCGATATT AT14 ATATAT ATAAAACAAACTCACTGTGTAAAGATTGTAGAAAGTGATATT AT24 ATACAT ATAAACTCACTGCATAAGAATCATAGAGAGTGATATT AT11* ATATAT ATAATACAACAAGGAGCGTCATAAGTAAAGTGAATTCGTTATAT T12 ATATT ATAATACAACAGAAAATGTCATAAGTGAGATGAATTCGTTATAT T08 ATATAA AATACAACAAGAGACGTCGTAAATAGAGTAAATTCGTTATAT TTT ATATATAA AATACAACAAGAGACGTCGTAAATAGAGTAAATTCGTT T06 ATATATAA AATACAACAAGAGACGTCGTAAATAGAGTAAATTCGTTAT T11 ATATATAA AATACAACAAGAGACGTCGTAAATAGAGTAAATTCGTTATATAA T15 ATATT ATAATACAACAGAAAATGTCATAAGTGAGATGA TTCGTTAT06 ATATAA AATACAACAAGAGACGTCGTAAATAGAGTAAATTT TTT ACAAAT ATACAACAAAAGATGCCGTAGATAAGATAGATTTG GTATATTTT ATATAT TAATACAACAGAAGACGCTATAAGTGAGATAGATT GATTATAT27 ATATAT ATAAACATAAATCAGATAGTACAATGAAGAGTGTTATAGATAA T09 ATAT ATAAACATAAATCAGATAATACAGTGAAGAGTGTCATAGATAA T06 ACA TATAACACAAAAATAGACATAGACTGAATGATGCAGTGAAA T13 ATAT AACTTACAACACAGAGATAGACATAGATCAGATAATGTGATAA T13 ACA TATAACACAAAAATAGACATAGACTGAATGATGCA TTGAAAT11 ACA TATAACACAAAAATAGACATAGACTGAATGATGCAGTAAAA TTTTAACT06* ATAT AACTTACAACACAGAAATAAGCATAGATCAGATAGTGTGATAA TTAAT17 AAAACGAT AGCAAATTCATGACGTGAAAATAGATGTAA T11 AAAA AAAAAACGAAAGCAGATTCACGGTACAGAGATAGATATAG T10 ATATATA ATATAAGGTAAATGAGAGACGAGGGTAGACTTGTGATAC TAT12 ATAATAAGGTAT ACAGAGAACGGAAGCAGACTTATGATATAA T12 ATATATA ATATAAGGTAAATGAGAGACGAGGGTAGACTTGTGAT T10 ATATA ATATAAGGTAAATGAGAGACGAGGGTAGACTTGTGATATA T09 ATATAT AACATATAAGGTAAATAGAAGATGGAAGCGAATTTGTGAC T16 ATATATATAAACAAC AGACATATAAGGTAAGTAAGAGATGAAGGTAAATTT T09 ATATAT AACATATAAGGTAAATAGAAGATGGAAGCGAATTTGTGAT TCT05 ATAC TATAATAAACAACAAAATGTGTAAGGTAGATAAGAAGTGAAGGTAAATT ATATTTT A TACATAATAAACAATGAGATATATAAGGTGAAT CGAAAGTGAAATAT12 ATATTAT AACAACAAAACGTATAAGGTAAGTGAAAAATGGA TGTAAAT12 A ATAATCACATAATAAATGATAGAACGTATAAG ATAGATGAAAAT10 COIII gRNA Sequences cont. 191 497 497 499 497 497 499 499 499 499 501 499 522 539 539 539 539 539 539 539 539 535 539 539 539 535 565 564 564 563 565 563 564 563 592 594 592 592 592 592 592 592 592 594 593 592 593 593 593 593 593 596 593 603 593 66,677 5,941 936 371 139 6,621 3,602 295 185 185 107 288 145,031 4,441 1,630 1,118 1,052 849 847 653 599 482 448 353 1,604 188 181 77 3,592 640 336 420 144 ATACAT AATACCAATAGAAGACAGAATTGTAGTCATGTGATA TTCAT14 ATACAT AATACCAATAGAAGACAGAATCGTAGTCATGTGATA TTCAT13 ATAT AAAATACCAATAAAGAACAGAATTATAGTTGTATGATAGATAA AT12 ATACAT AATACCAATAGAAGACAGAATTGTAGTCATGTGAT T11 ATACAT AATACCAATAGAAGACAAAATTGTAGTCATGTGATA TTCATAT11 ATATT AAAATACCAATAGAAAATGAGACTGTGATTATATGATGAATA T14* ATATT AAAATACCAATAGAAAATGAGACTGTGATTATATGATGAAT T14 ATAT AAAATACCAATAAAGAACAGAATTATAGTTACATGATAGATAATA T10 ATATT AAAATACCAATAGAAAATGAGACTGTGATTATATGAT T15 AAA AAAAAATACCAGTAGAAGATAAGACCATAATCATGTGATAA TTTT ATATT AAAATACCAATAGAAAATGAGACTGTGATTATATGATG T14 AAATA ATCAACAAATTAAATGAATCTAAAAGGTATCAGTGAAAA T14 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAGAATATT T10 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAGAATATTAAT TTCT08 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAGAATATTA T15 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAA TATTAATAG TTTT ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAGAAT T14* ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAGTTTAGAGAATATT T10 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAGAATATTAATA T14 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAGAATATTAATAG TTTT ATAT ATAAAATGTATTTGTCGACGAGTTAAATGGAT GTAGAAGAT12 ATA TAAATAAAATGTATTTGTCAATGGATTAGATAAATTTAGAGAATATT T11 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTT TAGAATAT12 ATA TAAATAAAATGTATTTGTCAATGGATTAGATGAATTTAGAAAATATT T09 ATAT ATAAAATGTATTTGTCGACGAGTTAAATGGAT GTAGAAGAT15 ATATAT AAAATTAACAAGTGAATCACTAACAGATAGATAGAATG ATAT12* ATATT AAATTAACAGATAAGCCACTGACAAATAGATAGAGTG ATAT12 ATATAT AAATTAACAAATAGACTACTAATAAGTGAGTAAGATGTATT AATTTATATATTTT* ATATAT AATTAACAAGTAGATCACTGACAAATAGATGAGATGTAT AAT13* ATATAT AAAATTAACAGATAGATCATTAACGAGTAGATAAAGTG ATAT11 ATATAT AATTAACAAGTAGATCACTGACAAATAGATGAGATGTATTT T08 ATATT AAATTAACAAATAAACTATTAATGGATGAGTGAGATGTA ATTAT15 ATATAT AATTAACAAGTAGATCACTGACAAATAGATGAGATGT T16* 99,540 27,720 1,442 599 504 309 238 176 150 147 128 114 84,835 6,897 4,207 1,803 1,137 722 486 456 206 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGTGAGTAAATT T12 ACAT AAAAACCTAAACTGAGAATACGAGACAAAGAAATTAGTGA TTAAAT12 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGTGAGTAAATTA AT11 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGTGAGTAA T13 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGTGAG AAAAT13 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAG GGAGTAAAT14 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGCGAGTAAATT T09 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGTGAGTAAA AT11 ATATAT AAACCTAAATCAAGAACATAGAACAGAGAGATTAGTGAGTAAATTATT T10 ACAT AAAAACCTAAACTGAGAATACGAGACAAAGAAATTAG GGATTAAAT11 ACATT AAAACCTAAACTGAGAATACAAGACAGAGAGATTAA GGATTAAAT13 ATATAT AAACCTAAATCAAGAACATAAAACAGAGAGATTAGTGAGTAAATT T10 ACAAT AAAACCTAAACCGAGAACATAGAGCAGAGAAGTTAGTAGATAA T13 ACATT AAAACCTAAACTGAGAATACAAGACAGAGAGATTAATGA TTAAAT14 ACAAT AAAACCTAAACCGAGAACATAGAGCAGAGAAGTTAGTAGATA T13 ATATAT AAAACCTAAATCAGAGACGCAGAATAGAGAGATTGATA TAT10* ACAAT AAAACCTAAACCGAGAACATAGAGCAGAGAAGTTAGTAGAT T12 A AAAAAAACCTAAACCGAGAACATAGAGCAGAGAAGTTAGTAGATAA T15 ACATT AAAACCTAAACTGAGAATACAAGACAGAGAGATTAAT T12 ACAAAAAAACCTAAACCGAGAACATAGAGCAGAGAAGTTAGTAGATAA T05 ATAT AAAACCTAAACCAAAGATATGAGACAGAGAGATTAGTGA TATGT13 3' 629 Reads 85,916 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAGAGATTTAA T12 COIII gRNA Sequences cont. 192 461 461 456 462 461 457 458 454 462 460 461 483 491 488 490 498 495 491 487 486 504 491 502 491 504 528 528 524 525 528 523 526 527 548 555 547 551 554 558 548 550 545 558 558 548 551 555 552 556 553 551 557 551 555 5' 585 629 629 631 629 629 629 629 629 622 629 634 628 624 630 628 630 628 631 630 647 647 643 647 643 643 669 669 676 669 679 691 682 717 722 706 715 722 726 715 722 730 715 715 717 9,106 1,991 1,447 276 269 263 221 150 145 131 17,313 8,519 3,564 2,850 628 582 231 135 112 10,315 7,606 250 637 173 150 482 268 60 42 25 649 39 758 374 354 160 1,329 1,181 911 738 129 107 102 90 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAGAGATTT T11 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAGAGATTTA T12 ATAT ATAAACTCAATCATAGTATAAGATGACGACAATGAGAAGATTTAA T13 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAAAGATTTAA T12 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAGA TTTAAT14 ATATA GAACTCAATCATAATATGAAGCAATAACAATAAGAGA TTTAAT07 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAGAGATT AAT13 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAA TAGATTTAA T06 ATATAT ATCATAATACAAGGCAATGACGACGAGAAGATTTAGATTAA T11 ATATA GAACTCAATCATAATATGAAGCAATAACAATGAAGAGA ATTAATTAT06 AT ATAACAAACTTAATCGTAATATGAAACAACGA GAATGAGAAAAT13 ATATAT AACTCAATCATAATACGAGATGATAATGACGAAGAGATTTAA T13 ATATA TAATCATAATACAGAACGATGGCAGTGAAGAGATTTAGATTAA T12 ATATATAT CAAACTCAATTGTAGTACGAGACAATAATGATGAGAAGATTT T10 ATATAT AACTCAATCATAATACGAGATGATAATGACGAAGAGATTT T08 ATATATAT CAAACTCAATTGTAGTACGAGACAATAATGATGAGAAGATTTAA T11 ATATAT AACTCAATCATAATACGAGATGATAATGACGAAGAGATTTA T11 ATATAT ACAAACTCAGTCATAGTATAAGACAGTGATAATGAGAGAATTT T12 ATATATAT CAAACTCAATTGTAGTACGAGACAATAATGATGAGAAG T13 ATACAAAAAC AAAAACCAAACGATGAACTTGATTGTAGTATAAGATAATA T13 ATACAAAAAC AAAAACCAAACGATGAACTTGATTGTAGTATAAGATAAT T13 ATACAAAAACAAAAT ACCGAGCGACAGATTTGATTGTAGTATAAGATAATA T11 ATACAAAAAC AAAAACCAAACGATGAACTTGATTGTAGTATAAGATA T21 ATACAAAAACAAAAT ACCGAGCGACAAGTTTGATTATAGTATAAGATAATA T13* ATACAAAAACAAAAT ACCGAGCGACAAGTTTGATTATAGTATAAGATAAT T15* ATATACAATGCAAACTTA CATGACTGGTTTTATAGAGATGAGAGATTAA T14 ATATACAATGCAAACTTA CATGACTGGTTTTATAGAGATGAGAGATTAA T15 ATATACAATGT ACTCTCATAATTGGTTTCATAGAGATAGAAGATTAA T15 ATATACAATGCAAACTTA CATGACTGGTTTTATAGAGATGAGAGATT T08* ATATA CAAACTCTTATAACTGGTTTTACGAGAATGAGAAATTAAAT T15 ATAT TAAATAACAATGCGAATTTTCATAGTTGGTT CATAGATACAAT12 AT ATGTGAACTCTTATAACTGGTTTTGTAG TGATGATAGAT15 ATAAT AGAACACCACAGCTTAATGTAGTAGATGGCAGTGTAAATTTTT T10 ATAT AATCAAAAACACCGTAACTTGATGTAGTAGATAGTAGTGTAAATTTTT T07 ATATAGAACCAAAAACAG TGCAATTTAGTGTGATAGATGATAGTGTAAATTTTT T07* ATATAGAACCAAT AACATCGCGACTTAGTGTGATAAGTAATAGTGTAAATTTTT T07 ATATAT AACCAAAAACACTGCGATTTGATGTAATAAGTGAC TGTGTAAAT11 ATAT ATAGAACCAAGAACACCATAGTTTGATGTGATAG TGATAGTGTAAAT14* ATATAGAACCAT AACACCATGATTTGATGTAGTAAATGATGATGTAAATTTTT T09 ATAT AACCAAAAACACTGTAACTTGATGTAGTAGATAGTAGTGTAAATTTTT T08* ATAT TAAAATAGAACTAAAGACACTGTAACTTAGTG AGTAAATGATATTAAT10* ATATAGAACCAT AACACCATGATTTGATGTAGTAAATGATGATGTAAAT AT08 ATATAGAACCAT AACACCATGATTTGATGTAGTAAATGATGATGTAA T11 ATAAATT AAAACACCATAATTTGATGTGATAAGTAATGATGTAAAT AT17 3' 753 753 Reads 31,331 8,037 ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGAATATT T10 ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGAATAT AT14* COIII gRNA Sequences cont. 193 587 586 585 585 592 591 588 594 580 590 603 585 580 587 587 585 586 587 591 604 605 604 607 604 605 635 635 635 637 633 659 653 669 669 669 669 684 689 669 669 695 675 677 675 5' 706 707 753 753 748 753 753 748 752 753 753 753 753 753 750 748 746 753 765 767 765 765 767 781 781 781 779 778 790 790 815 815 815 815 815 815 815 814 815 814 814 829 829 829 842 842 842 854 855 855 854 890 891 882 889 889 891 3' 918 927 929 6,744 3,791 2,726 1,214 1,124 977 848 595 206 182 154 131 1,457 688 304 416 2,739 1,193 2,604 456 284 495 159 40 19 19 153 89 ATATAT AAATGTAATAGATCCAATGAAGGTAAGATAGAACTGAGAATATT T09 ATATAT AAATGTAATAGATCCAATGAAGGTAAGATAGAACTGAGAATAT AATTAT07 ATA TAATAAATCCAATGAAGATAAAGTAGAGTCAGAGATATTATGAT AT13 ATATAT AAATGTAATAGATTCAATGAAGGTAAGATAGAACTGAGAATATT T09 ATATAT AAATGTAATAGATTCAATGAAGGTAAGATAGAACTGAGAATAT AAT14 ATATA TAATAAATCCAATGAAGATAAAGTAGAGTCAGAGATATTATGATTT T10 ATATAT AATGTAATAAATCTAATAGAGATAAGATAGAACTGAGGATAT AT12 ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGAAT T09* ATATAT AAATGTAATAGATCTGATAAAAGTGAGGTAGAATTGAGAATAT AT13 ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATTGAGA TTTAT08 ATATAT AAATGTAATAGATCCAATGAAGGTAAGATAGAACTGAGAAT T14 ATATAT AAATGTAATAGATCTGATGAAAGTGAGGTAGAATT TAGAATATAT08 ATATAT TGTAATAAATCCGATAGAAGTAAGATAGAACTGAGA TAT15 ATATA TAATAAATCCAATAAGAATAAGATGGAACTGAAGA GATTATGAT11 AT ATAAATCCAATAAAAGTGAAATAGAATC TGAGTTTT ATAT AAATGTAATAAATTCAGTAGAAGTAAGATAGAATTGAAGATAT ATATAGT26 ATATAT AACATGCATAAGATGTAGTGAATCTAGTAAGAGTAAGATAG T14 ATATAT AAAACATGCATAGAGTGTAGTAAGTTCAGTGAAAGTGA TATAGT09 ATATAT AACATGCATAAGATGTAGTAGATTCAGTGAAGATAAGATA T14 ATATAT AACATGCATAAGATGTAGTAGATTCAGTGAAGATAAGAT T10 ATATAT AAAACATGCATAGAGTGTAGTAAGTTCAGTGAAAGTGA TATAGTTTT ATAT ACAAAACACCTAAGAAGATGTGCGTAGAATGTGATAGATTTAAT T16 ATAT ACAAAACACCTAAGAAGATGTGCGTAGAATGTGATAGATTTAA AT15 ATAT ACAAAACACCTAAGAAGATGTGCGTAGAATGTGATAGATTT T23 AAAACACCTAAGAAGATGTGCGTAGAATGTGATAGATTTAATA T06 AAACACCTAGAGAAACGTGCATGAGAT TGTGATAGATTAAT07 ATATAT TAAACAACAACAGAACGTCTAAGAGAGTATGTATA TGATGTAAT15 ATATAT TAAACAACAGCAGAACATCTAAGAGAGTATGTATA TGATGTAAT12 111,116 25,628 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGTGATAAGATACT AT14 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGTGATAAGATACTT T10 323 315 281 230 188 138 136 1,459 527 24 5 92 116 386 156 66 5 1 1 9,675 138 109 2,508 202 127 Reads 1,822 600 222 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGTGATAAGAT T14 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGTGATAA T12 ATAT AAATTAAACAAACGTATGAAGCAAGTAGATAGTGATAAGATACT AT11 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGCGATAAGATACT AT15 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGTGATAAG TTACTAT05 ACATAT AATTAAACAAACGTATAGAGCAAGTAAGTAGCAGTGAAATATTT T11 ATAT AAATTAAACAGACGTATGAAGCAAGTAGATAGTGATAAGATAC AAT12 ATAT AATTAAACAAACGTATAAGACAAGTAGATGGCAGTGAAATAT AT11 ATAT AATTAAACAAACGTATAAGACAAGTAGATGGCAGTGAAATATTT T13 ATAACAAAACGTGA TATCCATATACAGAGAATTAGATAGATGTATAAA T06 ATATA TATCCATACACAGAAAATTAGATGAACGTGTAAAATGAGTAA TTAT13 ATATA TATCCATACACAGAAGATTAAATAGACGTGTAAAGTGAATAA T15 ATAT AAAACAAAACGTGTATTCATATATGAGAGATTAAGTAAATG AT15 AT AAAACAAAACGTGTATCTGTGTATAGGGAGTTAAATGGATGTATA TAT14 ATAT AAAACAAAACGTGTATCTGTGTATAGGGAGTTAAATGGATGTAT T20 ATAT ATACAATACAAAGAAGCGAAACGTGTATCTATATATGGAAA T06 ATATAT AACACAATACAGAGAGATGAAACGTGTA GATGTAT23 ATATAT AACACAATACAGAGAGATGAAACGTGT T22 AT ACATAATACAAAGAGACAGAACGTG ATATCT16 ATATAT AATCAAACTAAATTGACGAGATGTTGATGTAGATAGATATAAT AT08 ATAT AAATCAAACTAAATTAGCAAGGTGTCAGTGTAAATGAGCATGATAT T13 ATACAA TAAATCAACAAGATGTCGATATAGATGGATATGATATGAGAGAAT T15 AT ATTAAACTAAATCAATAAAGTGTCGATGTAGATGAATGTGATATA TAT11 AT ATTAAACTAAATCAATAAAGTGTCGATGTAGATGAATGTGATAT T11 ATAT AAATCAAACTAAATTAGCAAGGTGTCAGTGTAAATGAGCATGATAT TAT13 ATATAA ATCAGAATAAACAGATCGCAATAGAGAGAATTAAGTTAA TAT14 AA ATATAAAACATCAAGATAAATGGATTGTGATAGAGAAAGTTAAATT T11 ATATATAAAACATCAGAATAGACAAATCGTAATAGAGAAAGTTAAGTTAA T14 COIII gRNA Sequences cont. 194 706 707 701 706 707 699 707 713 707 715 713 719 715 714 719 707 723 728 724 725 728 736 737 739 735 750 754 754 772 771 775 778 772 772 777 771 773 773 771 796 788 788 802 798 799 814 828 829 830 848 846 838 845 846 846 5' 880 882 880 880 882 881 907 909 913 905 905 920 935 940 935 939 942 951 951 963 965 921 929 929 947 950 946 944 944 952 977 977 977 977 983 981 981 1003 1003 209 159 28 787 126 558 3,472 1,321 162 ATAT AATATCAAAATAAACAGATCGTAGTAAAAGAAGTTAGATTAA T12 ATATATAAAACATCAGAATAGACAAATCGTAATAGAGAAAGTTAAGTT T13* ATATATAAAACATCAGAATAGACAAATCGTAATAGAGAAAGTTAAGTTA TTTT ATATAA TATACACACAGATACATAATACGTAGAATGTTAAGATAAGT T16 ATATA ATTACACACACAGATACGTGATATATAGAATGTTAAGGTAA TATAAT10 ATAC ATACACACAAATATATAACATATAGAGCATTGAG TTAGATAAT15 ATATAA ACACACAAATATATGGCATATAGAGCATTGAAGTAGATAA T13* ATATAA ACACACAAATATATGACATATAGAGCATTGAAGTAGATAA T15 ATAG AAAATTACACACATGAATACATAGTACATAGAA GATTGATATAT13 2,920 522 354 28 ATATA AATCAACAACTGAAAAGATATCAATGAGATTGTACATGTAAAT T15 ATATA AATCAACAACTAAGAAGACACTGATAGAGTTATATGTG ATTAAT14* ATATA AATCAACAACTGAAAAGATATCAATGAGATTGTACATGTAAAT T08 ATATA AATCAACAACTGAAAAGATATCAATGAGATTGTACATGT T09 1,160 127 25 111 68 ATAT AACTAATCAACAGCTAAGAGAACGTCAATGAGATTATGTG ATTAAT14 AT ACTAATCAACAACTAGAAGAATATCAGTGA TATACATGTAAT10 AT ACTAATCGACAACTAGAGGGACATCAGTGA TTTATACGTAT15 ATA TACAAACTACCAATATAAGTTAACTGATCGGTAATTAAGG TTATAT15 ATATA TACAAACTACCGATATAAGTTAACTGATTGATAATTAA TGTTCT14 195 C) C-Rich Region 3 3' 62 62 64 62 88 88 87 86 88 88 88 88 88 88 88 77 88 88 88 88 88 88 88 88 88 88 89 89 77 83 88 89 88 77 86 89 83 118 123 118 123 121 124 140 140 142 166 166 166 166 162 166 167 161 168 167 196 199 200 200 200 Reads 18 3 1 1 140,541 34,162 3,434 3,016 2,553 1,270 902 693 648 632 468 213 212 182 181 177 170 155 151 135 133 106 598 526 437 285 255 237 152 112 70 53 31 CR3 gRNA Sequences ATATGT ACAACAAAACCGAGCAATCAGATAT AGAGTGAAAT09 ATATAT ACAACAAAACTGAACAATCAAATGT AGTGTGAT09 ATCGCAAGGTCGT GGACAACAAAACTGAACAATCAAAT T11 ATATAT ACGACAAAACTGAACAATCAAATGT AGTGTGT08* ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGAT AT14 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGATT T12 ATATAATT AAATGTACAGACAAATGATAGAGAGACGATGAGATTAAGT TATAT12 AAAAATT AATGTACAAATAAACGATAGAGAGACAGTGAGATTA TGAT13 ATAT AAAATGTACAGACGAGCAGTGAAGAGACAGTGAGATTA TACAT11 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATG T11 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGAT T14 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTAAAATTAGATGAT AT12 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATT T09 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTA TATGATAATATTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGA AATAT14 ATATAT AAAATGTACATACGAACGATAAAAGGGCAGTGAAATTAGATAATT TT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGCGAAATTAGATGAT ATCT07 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAG TTGTTAATAT11 ATATAT AAAATGTACAAACGGACAATGAGAAAACAGTGAAATTAGATGAT ATCT16 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGA AGATATAT11 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAA T19 ATATAT AAAATGTACAAACGAACAATGAGAGAACAGTGAAATTAGATGAT AT13 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTAAAATTAGATGATT T12 ATATAT AAAATGTACAAATGGACAATGAGAGAACAGTGAAATTAGATGAT AATAT07 ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGA T12 ATATAT AAAATGTACAAACAGACAATGAGAGAACAGTGAAATTAGATGAT AATAT08 ATAT AGAAATGTACAAACGAGCAATAAGGGAACAGTGAAATTAGATGATT TCT07 ATAT AGAAATGTACAAACGAGCAATAAGGGAACAGTGAAATTAGATGAT AAT10* ATATATAAAATGTACAT ACGAACGATAAAAGGGCAGTGAAATTAGATAATT T09* ATATAA GTACAAACAAACAGTGAGAAGATAACGAGACTGAGTA TATATTTT* ATAT AAAATGTACAAATAAGCAGTAGAAGAGCAGTGAAATTGA TGATAT09 ATAT AGAAATGTACAAACGAGCAATAAGGGAACAGTGAAATTAGATG TTTAAT09 ATAT AAAATGTACAGACAAGCAGTGAAGAGACAGTGAGATTA TACAGTTTT ATATATAAAATGTACAT ACGAACGATAAAAGGGCAGTGAAATTAGATAATTA T12 AAAATC AATGTACAAATAAGCGATAAAGAGACAGTGAGATCA TGAT15 ATAT AGAAATGTACAAACGAGCAATAAGGGAACAGTGAAATTAGAT T12 ATATAA GTACAAACAGACAGTGAGAAGATAACGAGACTGAGTA TATTTAT17 573 1,321 685 251 243 120 ATATAT AATCACAAACAAATAGAAAATGAGAGAGGTGTATGA TACTAT15 ATATAT AAACAAATCACGAACGAGTAGAAAAT TGGAAGAATGTAT12 ATATAT AATCACAAACAAATAGAAAATGAAAGAGGTGTATGA TACTATTTT ATATT AAACAAATCACAAATGAGTAGAGAACGAGA TTGATGTATAT16 ATAC ATAAATCACAAACGAATAAGGGGCAGAAGAG TTGTATGATAT05 ATATAT AAAACAAATCATAAACGAATAAAGAGTGAGGAAGGTGTATAAAT TTT 15,770 692 442 27,885 212 139 114 309 421 370 348 110 53 1,492 136 2,176 114 98 A TAGATAACAAACATAAGAGCAAGTCACGAG GTAGATATTGATAT14 A TAGATAACAAACATAAGAGCAAGTCACGAG ATAGATATTGAT12 AT ATTAAATAACAAACATAGAAATAGATCACAGATAGATGA TTGATGAAT06 ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATAAGA TAT13 ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATAA T12 ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATA T17 ATAT ACAAAAATCCAATGAAAAATAAAGACTGAGTGATGGATG CAAT15 ATATAT AAATCCAATAAGAAATGAAAGCTAGATAGTGAGTATAAG T16 ATATAT ACAAAAATCCAATGAAAAATAAAGACTGAGTGATGGATGTAA TTATTTAT14 ATAT AACAAAAATCTAATAGAGAACAGAAACTAAGTAATGAG ATAT09 ATATTAT AATCCAATAGAAAGCAGAGACTAAATGATAGATGTAAA T09* ATATAT AAACAAAAATTCAATAGAAAACGAGAACTGAGTAAT TGATATGT05 ATATAT AACAAAAATCCGATAAAAGATGGAAACTAAGTGATAGATATA T13 ATAT ATACAACAATAAACTCGTATTAAGTGAGAGATGAAGATTTAAT T13 ATAT TAAACACAACGATAGATCTATATTAAGTAGAAGATAGAAATTTA T16 ATAT ATAAACACAACAGTAGACTCGTATTAGATAGAAGATAGA TATCAT15 ATAT ATAAACACGACAATAGATTCATATTAAATAGAAGATAGAGATTTA T12 ATAT ATAAACACGACAATAGATTCATATTAAATAGAAGATAGAGATTT T07* 196 5' 34 34 36 34 41 40 48 51 51 47 48 41 52 51 46 40 41 50 41 49 55 41 40 41 56 41 40 41 40 47 50 47 51 39 51 48 47 78 93 78 89 86 76 105 105 98 122 124 125 127 123 124 129 123 132 125 154 156 162 156 157 5' 190 190 192 226 231 237 241 234 293 268 3' 230 230 232 277 265 265 279 265 308 312 Reads 2 1,243 279 19 2 1 936 528 2 125 CR3 gRNA Sequences cont. ATATACA ACATATCAAGTGGTAAGATAAGAGAAGAAAGTAGATGTAAT T06 ATATACA ACATATCAAGTGATAAGATAAGAGAAGAAAGTAGATGTAAT T12 ATAGATA ACACATATCAGATGATAAGGTAAAGAGAGAGAATAGATATA T13* ATAT AATAACAATATAAACGAACTGGATGATGTATAGTACTTTGATATATA AT12 ATATAACAATAC AAATGAGCTAGATAATGGATGATAGTTTGATAT T12 TATAAAATAACAATAC AAGTGAATTAGATAATGGATGATG TTTCAAT15 AT AGAATAACAATATAAACGAACTAGATGATGGATAGTA TCT17 ATATAACAATAC AAATGAACTGAGTAATGGATAGTAGTTTGA CAAT09 ATATAT AAATTATTTGCATACT GAGATTAGTGATAGTAAAGTGATTAAT13 AT ATAAAAATTATTTGCATGCTTAAGTGAGTTATAGAGGTAATGATA AAT07 197 D) C-Rich Region 4 3' 64 64 64 62 103 103 98 103 103 103 134 134 134 139 134 142 138 138 171 171 192 200 196 200 196 196 232 225 228 228 232 227 232 227 261 261 261 261 258 259 259 258 258 258 300 284 295 295 295 298 295 297 301 295 320 320 3' 351 5' 25 25 25 25 48 48 65 51 59 52 87 89 88 93 90 93 90 91 127 128 154 174 166 171 169 171 186 186 186 186 186 186 189 189 213 213 216 222 213 213 213 213 215 223 251 242 251 250 253 253 244 250 251 255 280 279 5' 307 Reads 121 596 308 83 296 175 108 18 11 10 7,793 426 208 200 143 2,052 584 139 CR4 gRNA Sequences ATAAT AAAAATGCACAACTAGAATTGAAGTAAAGTGATGATA TATAT14 ATAAT AAAAATGCACAACTAGAATTGAAGTAAAATGATGATA TATAT14* ATATT AAAAATGCACAACTAGAATTGAAATAAAGTGATGGTA TATAT13 ATATAATT AAATGCACAGCCAAAGTTAAGGTAGAATAGTGATA TAT14 ATA TATATAAAACACAGACATACTAAGTAAGAGAAAGAGAGGTGTATGATT T12 *ATA TACATAAAACACAGACATACTAAGTAAGAAAAAGAGAGATGTATGATT T12* ATATAT TAAAACACAAATACATCAGATAGAAAGAGA TTGAGTGTATAAT17 ATA TACATAAAACACAGACATACTAAGTAAGAAAAAGAGAGATGTATG T21 ATA TACATAAAACACAGACATACTAAGTAAGAAAAAGAGAGAT T15 ATTATA TACATAAAACACAGACATACTAAGTAAGAAAAAGAGAGATGTAT T15 ATATAT AAACAACAATAGAGTATATCATAGACTGTATATGAAGCATAAAT T10 ATATATAT AAACAACAATAGAGTATATCATAGACTGTATATGAAGCATAA CT08 T ATATAT AAACAACAATAGAGTATATCATAGACTGTATATGAAGCATAAA AT13 AAACAAAACAACAGTGAAATATACCGTAGATTGTATGTGAAAT TATATAT06 ATATATAT AAACAACAATAGAGTATATCATAGACTGTATATGAAGCATA T13 A AAAAAACAAAACAACAGTGAAATATATCGTAGATTGTATGTGAAAT TAT05 ATAT AACAAAACAACGATAAGATGTATCATGAGCTGTATATGAGATATA T25 ATAT AACAAAACAACGATAAGATGTATCATGAGCTGTATATGAGATAT T13 6,971 204 3,603 6 5 3 1 1 31 22 7 4 166 152 25 12 14,358 264 114 112 5,062 1,370 1,074 490 340 183 23 9 77 64 25 22 20 17 16 11 643 93 Reads 725 ATAT ATATACTCACACAAATAGATGACAGAGATAGAAAGTAAGATGATA TAT15 ATAT ATATACTCACACAAATAGATGACAGAGATAGAAAGTAAGATGAT T13 ATATTA AACTATAACAAGGCAGATAGAACGTACCTATATAGATAA T14 AT AAATAAACAACTATAATAAGACAAGTG CGATGTACT10 ATATA AAACAACTATAACAAAATAGATAGAATGTGC GTAT06 AAATAAACAACTATAATGAAGTGAGTGAAG GTACGT12 ATATA AAACAACTATAACAAAATAGATAGAATG GGCGTAT06 A AAACAACTATAATGAAGCAGATAGAA GGTACGTATAT17 ATAAAAAATCACAGCCTAAAATGACGAGAGAAAGTAAATGGTTATA TAGAT05 ATATAT ATCACAACCTAAGATAACGAAAAAGAGCAGATAATTGTA TATTTT ATATAT AAAATCACAACTTAGAATGACGAAAGAGAATAGATAGTTGTA TATTGT22 ATATAT AAAATCACAACTTAGAATGACGAAAGAGAATAGATAGTT TTAAAAAAT16 ATAAAAAATCACAGCCTAAAATGACGAGAGAAAGTAAATGGTTATA T16 ATAT AAATCACAACCTGAAATAGCAGAGAAGAGTAAATGATTATA T13 ATAAAAAATCACAGCCTAAAATGACGAGAGAAAGTAAATGGTT TTT* ATATAT AAATCACAACCTGAAATAGCAGAGAAGAGTAAATGATT T09* ATAT ATAAACTATACAATTGAAGCACTGATAGAAGGTTGTGATTTAA T12 ATAT ATAAACTATACAGTCAAGACACTGATGAGAGATCGTGATTTAA T13 ATAT ATAAACTATACAATTGAAGCACTGATAGAAGGTTGTGATTT TTT ATAT ATAAACTATACAATTGAAGCACTGATAGAAGGTTG ATTTAATAT08 ATATAT AACTATACAGTCGAGACATCAATGAGAGATTATGACTTAA T13 CCCTCGGCCGCAGCGATATA AAACTATACAATTAGAGCATCAGTAGAAGATTGTGACTTAA T14 ATATA AAACTATACAATTAGAGCGTCAGTAGAAGATTGTGACTTAA T15* ATATAT AACTATACAATTGAGATATCAGTGAAGAATTGTGATTTAA T11 ATATAT AACTATACAGTCGAGACATCAATGAGAGATTATGACTT T14 ATATAT AACTATACAGTCGAGACATCAATGAGAGATT T05 AAAAAAAAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTATAT T11 AAAAAAA AAAATTATAACGTCATAGAAGAGATAGACTATATGATTAA T09 ATAT AGAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTATAT TTTT AT AGAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTATATA T06 ATAT AGAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTAT T16 AAAAAAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTAT TTT AT AGAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTATATAGTT T12 AAAAAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTATATA T12 AAAAAAAAAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATTATAT T05 ATAT AGAAAATAAACAGAATTGTGACGTCATAAAGAAGATAGATT TTTT AAAA AAAAAACACAAGGCGAGATAGAGAAAAGAGATAAATAAGAT GT10 AAAAA AAAAAACACAAGGCGAGATAGAGAAAAGAGATAAATAAGATT T05 CR4 gRNA Sequences cont. ATAAAAT ACCAAACAGATCAGATAAAGGCAGTGATATAGAGAATATAAAAT T11 198 353 351 351 354 352 353 352 352 352 388 390 388 417 417 415 457 456 457 457 457 458 456 443 457 458 489 489 487 486 487 487 497 494 497 494 519 524 524 524 524 554 547 547 554 556 554 575 584 596 568 675 259 214 140 485 374 134 42 38 863 2,308 255 411 447 423 4,211 620 392 406 185 227 186 97 34 25 ATAT AAACCAAACAAGCTGAATAAGAACAGTGATATAGAAGATATA TAT14 AAAAT ACCAAACAGATCAGATAAAGGCAGTGATATAGAGAATATAAAATA T12 ATAAAAT ACCAAACAGATCAGATAAAGGCAGTGATATAGAGAATATAAA T12 AT AAAACCAGACAAACTGAATGAAGACAGTAACGTAGAAGATAT T10 TTGTTTGGTTGATTAAAT AACCAAACAAGTCGAGTAGAGACAGTGATATAAAAGGTATAAAAT T09* ATAT AAACCAAACAGACCAAGTGAAGATGGCAGTATAAGAGATATGA TATAT11 AAAT AACCAAACAAGTCGAGTAGAGACAGTGATATAAAAGGTATAAA T16* AAAT AACCAAACAAGTCGAGTAGAGACAGTGATATAAAAGGTATAA T12 AAAT AACCAAACAAGTCGAGTAGAGACAGTGATATAAAAGGTAT T13 ATAT ATAACACAAAACATAACAGAGAGTATAGAGAGAAATTGAATGA T15 ATAT AAATAACACAAAATACGACGAGAAATATAAGAGAGAATTGAGTAAATT T05* ATAT ATAACACAAAACATAACAGAGAGTATAGAGAGAAATTGAATGA T15* AAAA AAAAAACAACATAGAAAGTGAATCAGAGAATGACATAAGATATA T11 A AAAAAACAACATAGAAAATAAGTCAGAGAGTAATATGAGAT TGTTATAAT05 ATATAT AAAACAACATAAGAGATGAATCAAGAGGTAATATAAGATATA T17 AAAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGGTATAA T14 AAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGGTATA T13 AAAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGGTAT T12 AAAAAAAAAAACAAAAACAGAAAAACTGTCTAAGATAGAGAATGATATA T15 AAAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGG GATAAAT08 AA AAAAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGATATAA T08* AAAAAAAAAACAAAAACAAGAAGACTATCTGAGGTAGAAAATGATATA T14 ATATAAACAT AAACAAAGAGACCATCCGAAATAGAGAAT T15 AAAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGATATA T09 A AAAAAAAAAAAACAAAGACAAAGAAATCACTCAGAATAGAAGATGATAT T09* 3,197 238 730 246 208 149 1,153 282 2,511 306 ATAACAACCACAGATAGAAATGAACGTAAATGAGAGAGAAAGATGAAA T13 ATAACAACCACAGATAGAAATGAACGTAAATGAGAGAGAAAGATGAAAGT T11 ATAA AACAACCACAAGTAGAAATAGACATAAATGAGAGAGAAAGATAAAA T12 ATAT ATAACCACAAATAGAGACAAATGTAAGTAAAGAGAGAAAGTAAAA T19 ATAA AACAACCACAAGTAGAAATAGACATAAATGAGAGAGAAAGATAAA T06* ATAA AACAACCACAAGTAGAAATAGACATAAATGAGAGAGAAAGAT T11 AT ACAAAATAACAACAATCATAAGTAAGAATAGATGTAGATGAGAAA T11 ATATAT AAATAACAACAATCACAGATGAAGGTAGATATAAGTGAGAAA T14 AT ACAAAATAACAATGATCACAAATAAGAGTGAATGTAGATAGAGAA T15* ATATAT AAATAACAACAATCACAGATGAAGGTAGATATAAGTGAGAAA T13* 766 390 1,913 105 101 146 53 50 98 84 37 511 997 352 62 ATATAT AATAACAACAATAATCAGATTAGCAGAGTAATGATAGTTATAAAT T13 ATACAAATAACAACAATAGCCAAGTTAATAGAGTGATGATGATTA AT14 ATACAAATAACAACAATGATCAGACTAATAGAGTAATAGTGATTATA T14 ATACAAATAACAACAATGATCAGACTAATAGAGTAATAGTGATT T12 ATACAAATAACAACAATGATCAGACTAATAGAGTAATAGTGATTAT T15 AA AAAAAACGCATATAAGTGAGCCTATACATAGATAATGATGATAATT T03 AAAAAAAAAAA GCATATAAATAGATCTATATATGAGTGATAGTGACAATTA T15 AAAAAAAAAA GCATATAAATAGATCTATATATGAGTGATAGTGACAATT T11 AAAAA AAAAAACGCATATGAGTAGATCTGTACATAGATAGTGATGATA T12 ATAGAAAACGCATATGAGTAGATCTGTACATAGATAGTGATGATA CT06 CAACTCGACTGCGTGAA AAAAAACGCATATGAGTAGATCTGTACATAGATAGTGATGAT TTT AAATTTATAAAACCAAT CCGTAATTATCTGAGATGAGAAGTGTATA AT12 ATATTTATAAAACC ATACTGTGATTATCTAAGGTAGAAAGTGTGTA AT21 ATA TTTATAAAACCAATATCATAGTTATCTGAGGTAGAGAGTGTATA ATTTAATGTAT18 CATACC TAATTATCTGAAGTAGAAGATGTATGTAAAT T14 199 311 306 309 324 307 310 309 310 312 343 340 343 374 377 374 404 405 406 405 409 404 405 411 405 406 442 440 442 442 443 446 453 453 453 453 475 480 478 481 479 504 503 504 507 507 508 542 542 542 537 E) Cytochrome b 5' 32 32 32 32 32 32 32 53 51 52 54 54 54 53 56 3' 59 60 61 64 62 64 64 91 91 91 91 91 91 91 91 Reads 327 255 190 176 65 35 17 CYb gRNA Sequences AAATAATAGGGATTTATGATGAGATATG CTGTGGATAT14 AAAATAATAGGGATTTATGATGAGATATG CTGTGGATAT12 AAAAATAATAGGGATTTATGATGAGATATG CTGTGGATAT13 AA AAAAAAAATAATAGGGATTTATGATGAGATATG CTGTGGATAGT09 GT AGAAAATAATAGGGATTTATGATGAGATATG CTGTGGATAGT12 AA AAAAAAAATAATAGGAGTTTATGATGGAATATG CTGTGGATATTTT* AA AAAAAAAATAATAGGAGTTTATGGTGAGATATG CTGAGAGTAT05 18,713 AAAAAA AAAAGACAGTGTGAATTTCTGAGTAATAAAGGGAATAAT T11 10,649 AAAAAA AAAAGACAATATAGATTTCTGGGTGATAAAAGGGATAATAA CT11 908 339 8,406 2,265 204 95 ATAA GAAAGACAATATAGGTTTCTGGGTAATGGAGAGAATAATA T16 AAAAAA AAAAGACAATGTAGATTTCTGAGTAATGGGGAGGATAA CTATTTATTTT* AAAA AAAAGACAATGTAGATTTCTGAGTAATGGGGAGGATAA CTAT05* AAAAAA AAAAGACAATGTAGATTTCTGAGTAATAGGGAGGATAA CTAT16 AAAAAAA AAAAGACAATGTAGATTTCTGAGTAATGGGGAGGATAAT T07 AAAAAAA AAAAGACAATGTAGATTTCTGAGTAATGGGGAGGAT T05* F) Maxicircle Unidentified Reading Frame II 5' 30 30 34 33 3' 79 79 79 79 Reads 2,605 125 17 15 Murf II gRNA Sequences ATAG AAAGCACAAAAATAAAATTAAATTAGAGTAATTGGATGTTAAAATT T11 ATAG AAAGCACAAAAATAAAATTAAATTAGAGTAATTGAATGTTAAAATT T08 ATAG AAAGCACAAAAATAAAATTAAATTAGAGTAATTGAATGTTAA CAT12 ATAG AAAGCACAAAAATAAAATTAAATTAGAGTAATTGAATGTTAAA T09 200 G) NADH Dehydrogenase 3 3' 76 79 73 79 79 79 113 113 113 108 108 108 108 143 141 141 141 141 141 141 141 141 141 170 170 174 174 205 189 205 229 229 233 234 233 263 263 264 264 263 263 263 263 299 299 298 299 299 299 299 329 320 329 333 329 329 328 5' 33 31 30 31 42 33 63 65 64 63 65 66 68 99 98 101 98 100 99 100 99 98 100 130 130 130 132 158 155 158 190 190 190 198 191 223 222 222 223 223 222 223 220 253 253 253 253 252 253 252 284 288 285 300 285 282 285 Reads 313 70 625 511 397 26 ND3 gRNA Sequences ATATATT ATAAACCATGATATCGAAAATGGGTGTAGAAATGATGATA T12 ATAT ATAATAAACCACAGTATCAGAGACAGATATAGAAGTGATGATAGT T13 ATAT TAAACCACAATATCAGAAATAAGTGTAGAAATAGTGATAATA T12 ATAT ATAATAAACCACAGTATCAGAGACAGATATAGAAGTGATGATAGT T09 ATAT ATAATAAATCACAGTATCAGAGACAGATATAGAA TGATGATAGT12 ATAT ATAATAAACCACAGTATCAGAGACAGATATAGAAGTGATGATA T16 1,006 411 84 59 33 27 14 826 542 501 240 199 188 168 52 24 21 507 413 2,397 133 ACATAAGAAACATAAAGAAAAATCTGTGAGTAGAGTGATAAGTTATAAT T11 ACATAAGAAACATAAAGAAAAATCTGTGAGTAGAGTGATAAGTTATA T15 AT ACATAAGAAACATAAAAAGAAATCTGTAAGTAGAGTAGTAAGTTATAA GT14 ATACAT AAAAACATAAAAAGAAACTTATAAGTAGAGTGATAGATTATAAT T09 ATACAT AAAAACATAAAAAGAAACTTATAAGTAGAGTGATAGATTATA TTTT ATACAT AAAAACATAAAAAGAAACTTATAAGTAGAGTGATAGATTAT T12 ATACAT AAAAACATAAAAAGAAACTTATAAGTAGAGTGATAGATT T10 AT ATGAAAACAATCAAAGAAGTGTGATAGAAAGTATAAAAGGTATAA T11 ATAA GAAAACAATCAGAGAAATGCGGTAAAAGATATAAGAGATATAAA T12 ATATAA GAAAACAATCAGAGAAATGCGGTAAAAGATATAAGAGATAT T09 ATATAA GAAAACAATCAGAGAAATGCAGTAAAAGATATAAGAGATATAAA T13 ATATAA GAAAACAATCAGAGAAATGCGGTAAAAGATATAAGAGATATA T11 ATATAA GAAAACAATCAGAGAAATGCAGTAAAAGATATAAGAGATATAA TCT13 ATATAA GAAAACAATCAGAGAAATGCAGTAAAAGATATAAGAGATATA T18 ATATAA GAAAACAATCAGAGAAATGCGGTAAAAGATATAAGAGATATAA T10 ATATAA GAAAACAATCAGAGAAATGCGGTAAAAGATATAAGAGATATAAG T09 ATATAA GAAAACAATCAGAGAAATGCGGTAAAAGATATAAGAGATATG T20 ATATAT CAAATCACACGAAGATCATAGATGACGATGAAGGTAGTTAA TAT18 ATACAT CAAATCACACGGAAATCATAGATGGCAATGAAGATAGTTAA T11 ATA TATACAAACTACACGGAGATCACAAATAACAGTGAGAATGATTAA T12 ATA TATACAAACTACACGGAGATCACAAATAACAGTGAGAATGATT T15 412 236 1,756 T ATGTATAAAACATTAAATGTGAGTTTGTGTCGTGTAGATTATGTGAA T15 AAATAAAACACC AACGTGAATTTATATTGTATAGATCGTATGAGAAT AT15 ATAT ATGTATAAAACATTAAATGTGAGTTTGTGTCGTATAGATTATGTGAA T14 4,864 123 227 139 40 ATATAT AACAACTAACAGAATATAGATCTGATGTATAAGATATCA T14 ATATAT AACAACTAACAGAATATAGATCTGATGTATAAGATATCG T13 ATAT AACAAACAACTGATGAGACATAGATTCGATGTATAAGATATCA T13 ATAT AAACAAACAACTAATAAAATGTAAATCTGATGTGTGA TATACT08 ATAT AACAAACAACTGATGAGACATAGATTCGATGTATAAGATATC T11 3,225 3,165 384 164 106 232 177 32 260 106 83 133 89 57 22 786 435 282 268 168 64 3 TTAT ATACAAATAATGGGATTTAACGATATAGAGGATGAATGATT T11 ATAT ATACAAATAATGGGATTTAACGATATAGAGGATGAATGATTA T15 AAAAAT AACACAGATAATGGAATTTAATGATATGAGAAATGGATGATTA T05 AAAAT AACACAGATAATGGAATTTAATGATATGAGAAATGGATGATT TCT17 ATAT ATACAAATAATGGGATTTAACGATATAGAGAGTGAATAATT TTTT AAAT ATACAGATAATGGGATTTAACGATGTGAGAGATAGATAATTA T16 AAAAAT ATACAGATAATGGGATTTAACGATGTGAGAGATAGATAATT T05 AAAT ATACAGATAATGGGATTTAACGATGTGAGAGATAGATAATTAAT T13 ATAT GATAAAACAACACTATTATGAGAACGAGTGATAGAATATAGATAAT TTCT14* ATAT AATAAAACAACACTATTACAAAGATAGACAGTGAGATATAGATAAT T13 ATACAT ATAAAACAACACTATCATAGAAGCAGACAGTGAGATATGAGTAAT T15 ATAT AATAAAACAACACTATTACAAAGATAGACAGTGAGATATAGATAAT T07 AT AATAAAACAACACTATTACAAAGATAGACAGTGAGATATAGATAATG AT06 ATAT AATAAAACAACACTATTACGAAGATAGACAGTGAGATATAGATAAT T12 AT AATAAAACAACACTATTACGAAGATAGACAGTGAGATATAGATAATG AT16 ATAT AAAACCACAAAAATAGAAAGCTATAATAGAGATAGAATAATGTTA A07 ATTTTAAGTT AGAGTGAGAAATTGTAGTGGAAATAAGATGATA AAAT11 ATAT AAAACCACAAAGGTAGAAGATCGTAATAGAGATAGAATAATATT T09 AT AACAAAAACCACAGAGATGAGAGATTGTAATAAG TATAGTGATAAT13 ATAT AAAACCACAAAAATAGAAAGCTATAATAGAGATAGAATAATGTT T12 ATAT AAAACCACAAAAATAGAAAGCTATAATAGAGATAGAATAATGTTATT CT09 AT AAACCACAAAGATAGAAGGCCATAATAGAGATAAAATAATATT T10 201 5' 322 324 321 322 322 321 320 324 347 345 355 350 349 355 402 402 402 403 3' 369 369 369 369 370 370 368 370 388 388 388 388 388 388 438 438 435 438 Reads 48,018 5,034 3,062 142 721 291 149 43 95 82 1,307 633 376 104 417 128 753 126 ND3 gRNA Sequences cont. ATAT AATACCACATGAATCTTATATGTACGATGGAAGATGAGAATTAT T13 ATAT AATACCACATGAATCTTATATGTACGATGGAAGATGAGAATT T11 ATAT AATACCACATGAATCTTATATGTACGATGGAAGATGAGAATTATG TTTCT06 ATAT AATACCACATGAATCTTATATGTACGATGGAAGATGAAAATTAT TCT10 AT AAATACCACATGAATCTTATATGTACGATGGAAGATGAGAATTAT T12 AAATACCACATGAATCTTATATGTACGATGGAAGATGAGAATTATG CAGT08 ATAT ATATCACACAAATTCTATACATATAATAGAGAATGAGAGTTACAA T06 AT AAATACCACATGAATCTTATATGTACGATGGAAGATGAGAATT T05* ATATGTAAAC AATATACGTGATTTCAGAGATACTATGTGAATTCTAT T22 ATATGTAAAC AATATACGTGATCTCAGAGATACTATGTGAATTCTATGT T07 ATATAAACAAAC AATATATGTGGTTTCGAAAGTGTCATGTGAATT AT12 ATAGAT AATATACGTGATCTTAGAGGTACCATGTGAGTCT GAGTTAT12 AGTATGCGTGATTTTAGAGATATTGTATGAATTTT T08 AGTATGCGTGATTTTAGAGATATTGTATGAATT ATAT15* ATATA TATAATACAACAAGGAGCGTCATAAGTAAAGTGAA TTCGTTATAT13 ATAT TATAATACAACAGAAAATGTCATAAGTGAGATGAA TTCGTTATAT09 ATATATAA AATACAACAAGAGACGTCGTAAATAGAGTAAA TTCGT08 ATAT TATAATACAACAGAAAATGTCATAAGTGAGATGA TTCGTTAT06 202 H) NADH Dehydrogenase Subunit 7 5' 36 35 36 36 38 28 27 31 34 36 24 24 28 59 58 108 95 124 124 139 121 139 121 122 152 147 150 151 147 152 246 246 261 261 261 260 261 297 295 292 292 292 293 327 327 327 327 330 329 352 353 352 354 352 390 391 390 391 5' 3' 69 69 71 69 71 71 71 71 71 69 71 71 71 91 91 137 132 170 168 170 166 170 166 166 190 199 199 199 179 190 269 269 311 293 310 293 293 338 338 334 324 324 324 373 365 365 373 378 378 398 402 385 402 402 424 427 424 424 3' Reads 100,761 765 240 121 113 35,079 20,487 546 402 354 327 298 194 4,376 1,242 13 1 3,944 3,640 1,507 396 384 178 26 ND7 gRNA Sequences ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAG TGATATTAT14 ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAGG GATATTAT11 AAAT ATACAAATGTAAAGAAGCTATCAGAGGTAATATAAG TGATATAAT13 ATATA ATAAATGTAAAGAGACTATTGAGAGTGACATAAG TGATATTAT13 ATAT ATACAAATGTAAAGAGACTATCGAGAGTGACATA TGTGATATTAT11 AAAT ATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGATATT AT12 AAAT ATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGATATTT T08 ATAAAT ATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGAT TTTT AAAT ATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAAT T15 ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAG TGATATTAT12 ATAAAT ATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGATATTTGTT TTTT AAAT ATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGATATTTATT T11 AAAT ATACAAATGTAAGAGAACTGCCAAAAGTAACGTAGAATGATATT ATCT09 ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAA TGT13 ATATATAAATACA GTGGACTAAATGTGAAGCGATATAAATGTGAAA T12 ATATAATAAACAACATAAAGTGCCATGTACT#CGAGAT TTTTAG AT ATAAACAACATAAAATACTATGTGATGTAGGAT CTGTGAATTAAT09 ATAC ATCAATATAAACGATAGATTTACCGTAGAAGTATAGTGAATAAT T14 ATATA CAATATAAACAGTAGATTCACTGCAGAAGTATGATAGATAAT T11 AT ATCAATATAAACAATAAGTTCGTCATAGA TTTACAGTAGATAAT12 ATACAT ATATAAACAATGAATTCACTGTGAAGATACGATAGATGATATA T14 AT ATCAATATAAACAATGAGTTCGTCATAGA TTTACAGTAGATAAT12 ATAT ATATAAACAATAAATTCATCGTGAAGGTACAGTAAATGATATA T16 ATAT ATATAAACAATAAATTCATCGTGAAGGTACAGTAAATGATAT T15 988 563 200 2,089 295 143 AAATTACGATGCAT AATAATCTATGGTACAGTTGATATGAGTGATAA T12 ATATATA CACGATGCAGATAATCTATAGTATGATTGATATAAGTGATAAATTT T09 ATATATA CACGATGCAGATAATCTATAGTATGATTGATATAAGTGATAAAT AT12 AAA TACGATGTAAATAACCTGTAGTATAGTTAGTGTAGATAGTGAA AT12* ATATATAAATGCAAATAACG TGTAATACAGTCAATATAGATGATAAATTT T09 AAATTACGATGCAT AATAATCTATGGTACAGTTGATATGAGTGATAA T13 2,470 837 3,259 2,874 146 133 1,105 888 765 91 394 178 90 2,128 1,225 20,110 997 517 111 283 3,043 602 556 173 ATATCAAC ACATAATCTGACTTGTCGGAGTAT CTAAAGGAATAAAT14 ATATCAAC ACATAATCTGACTTGTCGGGGTAT CTAAAGGAATAAAT12† ATAT AACATAAAGACAATAAGTGCTTATTACAGTGAACATTGATATAATTT T08 ATATATAAGACAAC GATGCTCATTATGATAGATACTGATGTAATTT T10 ATAT ACGTAAAGACAATAGGTGTTCATTGCAGTAGATATTGATGTAATTT T10 ATATATAAGACAAC GATGCTCATTATGGTAGATACTGATGTAATTTA T07 ATATATATAAGACAAC AGTGCTCGTTACAGTGAATATTGATGTAATTT T11* AT ATAAATAACATCGCAACGTATATTCGAAATGTAGAGATA CAT13 ATATA ACAAACAACATCGTAATATGTGCTCGGAGTATAGAGATAAT TAAATAT13 ATACT ACAACATCGCGATATATACTTGGAATGTAAAGGTGATAAA GT11 ATATATAACAACATCG AGCATATACTCAGAATATAAAGGTGATAAA GT12 ATATATAACAACATCG AGTATATACTCAGAATATAAAGATGATAAA GT09 ATCGG AGCATATACTTGAGATATAAAGATGATAA T11 A TATAATTAATAAACGTATAAATGTGCAGTGTGACGATGAATGATATT AT13 ATACAT ATAAACGTATAGGTGCATAATGTAACGATGAATGATGTT AAT11 ATACAT ATAAACGTATAGATGCGCAATGTAACGATGAATGATGTT AAT12 A TATAATTAATAAACGTATAAATGTGCAGTGTGACGATGAATGATATT T13 AAA TTACAATTAATAGACGTATAAGTGCATAGTGTAGTGATGGATAAT T17 AAA TTACAATTAATAGACGTATAAGTGCATAGTGTAGTGATGGATAATG AT13 ATATA GAAAACTACAGGTAAATTCTGCAATTAGTAGACGTGTAAAT AT07 ATATA TATTAAAACTATGGGTAGATTCTGTAATTAATAGACGTGTAAA AT13* ATATATAAAACTACGA GTAGATTCTATGATTGATGAACGTGTAAAT T11* ATATA TATTAAAACTATGGGTAGATTCTGTAATTAATAGACGTGTAA T12 ATATA TATTAAAACTATGGGTAGATTCTGTAATTAATAGACGTGTAAAT TTCTTTT 2,283 200 162 298 ATATTAA ATACATGATATACGCGATAGATTATTAGAGTTATG AGTTAAT16 AAATT ACCATACATGATATATACAGTGAACTATTAGAATTAT AGGTAATGT06 ATATTAA ATACATGATATGCGCAGTAGACTATTAAAGTTATG AGTTAAT12 AAATTACA ATACATGATATATACAGTGAACTATTAGAATTAT AGGTAAT14 Reads ND7 gRNA Sequences cont. 203 452 451 452 464 450 458 450 485 485 530 526 530 528 548 544 553 569 569 569 576 574 571 571 615 615 603 596 611 615 611 615 611 611 603 630 642 640 648 642 670 674 674 669 671 671 699 701 727 711 725 727 758 741 741 764 759 765 765 765 3' 412 414 416 416 407 414 410 453 453 486 499 486 486 508 508 508 526 526 527 540 531 540 540 564 567 562 564 567 564 564 567 563 564 568 584 596 596 596 596 629 632 630 630 630 629 656 666 679 679 679 679 711 710 709 725 731 722 719 735 5' 15,183 223 199 5 3,126 2,332 132 3 207 2,523 2,002 1,136 344 ATATATAA GGAGACAAATGATCTAGATTCGAGACTGTATATGATATAT T13 ATATATAACAAC GAGACAGATAATCTAGATTTGAAGTTATATGTGATAT T12 ATATATAA GGAGACAAATGATCTAGATTCGAGACTGTATATGAT T13 AT ATTATAATAACGGAGATGAGCAATTTAGATTCAGAGTTATATGTGAT T13 ATATA AGACAAACAATCTAAATCTGAGACTGTATATGATATGTATAAT T10 ATAT ATAACGGAGACAGATAGCTTGAATCTAAGGTTGTATGTGATAT T12 ATATA AGACAAACAATCTAAATCTGAGACTGTATATGATATGTAT T13 ATAAAATGTCATCAATTA GTTACGTTCTTTGAGTGATTGTGATAAT AT13 ATAAATAACATCAATTA GTTACGTTCTTTGAGTGATTGTGATAAT AT13 ATATAA GCATACGACAATTACGATATGAGTCAGAGAATGTTGTTAATTT T11 ATATATAA ATACGACAATCATAACGTGAATCAGAGA CTGTGATTAAT11 ATATAA GCATACGACAATTACGATATGAGTCAGAGAATGTTGTTAATTT T11 ATATA ATACGACAATCACGATATAGATTAAAAGATGTTGTTAATTT T10* 1,129 116 1 3,050 4,850 351 544 291 8,794 1,694 ATAT AAATCATGAAAGCTGAGTGTGTATGGCAGTTACGATATA TGTTAAT14 ATATAAAA CATGGAAGCTAAGTGTGTATGATGATTATGATATA TGATTAAT06 ATA TAATAAAACCATGAGAGTTGGATGTATATGATGATCGTGATATA T17 ATATAT ATCATCAAGAATATCTAATGAGACTATGAGAGTTAAATGTATA AT12 ATATAT ATCATCAAGAATATCTAATGAGACTATGAGAGTTAAATGTATA AT13 ATATAT ATCATCAAGAATATCTAATGAGACTATGAGAGTTAAATGTAT T06 ATAATAAT AAACAAAATCATTGAGAGTATCTGATAGAGTTATGA TAGTCAT05 AAAT ACAAAATCATCAGGGATACTTGGTAAGATTGTGAAAGTTAAGT T16 ATATAAT AAATCATCAAGAATATCTGATAGAACTGTGA TTGCTAAT14* ATACAAT AAATCATCAAGAGTATCTAATAGAACTGTGA TTGCTAAT14 1,896 360 267 173 1,316 1,043 754 703 188 146 131 831 20 512 159 32 1,258 463 419 299 101 65 59 2,991 2 1 1,728 373 AT ATATTATCAACAACAATAGAAGATTGGCGAAATTAGAGATAGAATTATT T12 AT ATATTATCAACAACAATAGAAGATTGGCGAAATTAGAGATAGAATT T09 ATATT ACAACAACAAGAAATCAATGAAGTCAGAGATAAAGTTATTAA T15 TAGATATCAACAACAT CAGAGAATCAATGAAACTAGAGATAGAGTTATT T10 ATACA TATCAACAACGACAAGAGATCAGTGAAATTAGAAGTAAAGTT T13* AT ATATTATCAACAACAATAGAAGATTGACGAAATTAGAGATAGAATTATT T10 ATACA TATCAACAACGACAAGAGATCAGTGAAATTAGAAGTAAAGTTATT T09* AT ATATTATCAACAACAATAGAAGATTGACGAAATTAGAGATAGAATT T15 ATACA TATCAACAACGACAAGAGATCAGTGAAATTAGAAGTAAAGTTATCA T06 ATACA TATCAACAACGACAAGAGATCAGTGAAATTAGAAGTAAAGTTATC T12 ATATT ACAACAACAAGAAATCAATGAAGTCAGAGATAAAGT AT08 ATAAT TAACAAACAAATATGATATTATCAGTGACAGTGAGAAATTGATA T15 ATA TATAACAATCCATAGCAGATAGACGTGATATTATTGATGATAGT TTAAAT18 AAAT TAACAATCCATAACAAGTGAGCGTGATATTGTCAGTGATAAT T11 ATAAATCATAACAATCTATAATAGACGAGCGTGATATTGTCAATGATAAT T10 ATA TATAACAATCCATAGCAGATAGACGTGATATTATTGATGATAAT T12 ATAT GATAAACGATTACCTACAGATAATGAGTCATAGTGATTTATA T13 ATAT ATAAAATAAACGATTACCTGTGAATGATAGATTATGATGATTT T11 ATAT ATAAAATAAACGATTACCTGTGAATGATAGATTATGATGATTTAT T26* ATACAT ATAAACGATTACTCATAGATAGCAAGTCATAGTGATTTAT T08* ATTT AAATAAACGATTACTTACGAGTGACAGATTGTGATGATTTAT T14 ATATTT AAATAAACGATTACTTACGAGTGACAGATTGTGATGATTTATA T15 ATATAT ATGACAAACTACGTAAGTGCAGATAAAGTAAGTGATTATTT T12 ATAT AGATGACAAACCATGTAGACGTGAATAAGATAG TGATTGATACAT12* AAAAAA AAAAACTAAATCATATAAATTAAAGGGTGATGAACTGTGTAAAT T13 ATAAGTCAGAAAGTGATAGATCGTGTAAAT T15 AAATTAAATCATATAAGTTGGAAGATGGCAGATTGTGTGAAT TGTAGT16 AAAAA AAAAACTAAGTCATATAAGTCAGAAAGTGATAGATCGTGTAAAT TTT 3,229 22 2 55,757 2,867 8,175 552 115 ATATAT ACGAGACAAAATATCACTTAGATTATTAGAGATTGAGTTATA AT13 AA TTAGACTATTAGAGATTGAGTTATAT TTT AAAACGAGACAAGATACCAA TTAGACTATTAGAGATTGAGTTATATA T14 ATATAA TAACGAACGAGGCAGAGTATCATTTAGACTATTAGA TTCAAT14 ATATATAAT GACGAGATAAGACATCACTTAGACTGT AGAGAT14 ATATA TTAACGAACGAGATAGAACATTGCTTGAGTTATTGAGAAT AT14* ATATA TTAACGAACGAGATAGAACATTGCTTGAGTTATTGAGAATT T12 ATA TTAACGAACGAGATAGAACATTGCTTGAGTT TT Reads ND7 gRNA Sequences cont. 204 794 794 794 794 794 806 816 830 822 830 833 833 833 830 830 830 845 839 839 844 839 844 877 877 865 857 865 857 879 902 902 916 919 919 919 919 939 931 939 947 951 978 988 978 986 978 988 978 988 983 983 983 983 3' 1000 756 757 751 755 756 772 782 778 777 775 781 780 778 778 778 778 792 790 790 792 790 790 843 834 829 831 832 832 834 863 866 872 872 867 872 871 901 899 901 907 907 932 934 931 933 940 936 941 937 934 933 936 937 5' 959 48,263 318 174 167 2,829 367 77 25,793 5,729 2,644 2,373 2,079 903 450 331 134 98 52 343 190 99 89 26,437 1,279 188 121 23 20 1 AT ACTAAATAAACGACGATCTTATACTGTATCTGATGAATG TGATATTAAT14 AT ACTAAATAAACGACGATCTTATACTGTATCTGATGAAT ATGATATTAATTTT AT ACTAAATAAACGACGATCTTATACTGTATCTGATGAATGGGATA TTAAT14 AT ACTAAATAAACGACGATCTTATACTGTATCTGATGAATGA TATTAATTGT07 AT ACTAAATAAACGACGATCTTATACTGTATCTGATGAATG TGATAT14* ATATA TCATAACAACTAGATAAACGATGATCTCACA GTATAGTTAAT12 ATAACTTATAACAACTAGATAGACGG GAATCTTATATTGTAGTTAAT16 ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTGAATGATGAT AT12 ATATAT TAAAATACAACTTATGATGACTAAGTGAATGATGATT CAAT10 ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTGAATGATGATTTT T08 AAATA ATAACAAAACATGAGATATAACTTGTAGTGATTAGATGAATGAT T11 AAATA ATAACAAAACATGAGATATAACTTGTAGTGATTAGATGAATGATA T13 AAATA ATAACAAAACATGAGATATAACTTGTAGTGATTAGATGAATGATAGT AT13 ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAGGTGAATGATGAT ATTAAAT13 ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTGAACGATGAT ATTAAAT14 ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTAAATGATGAT AT07 ATAT AAAACAATAATCATGATGAGATATAAAGTGCAGTTTGTGATAATT T10 ATAT ATAATCATAACAAGATGTAGAGTACGATTTATAGTGATTAA T12 ATAT ATAATCATAACAAGATATAGAGTACGATTTATAGTGATTAA T11 ATAT AAACAATAATCATGATGGGATATAAAGTGCAGTTTGTGATAATT T10 ATAT ATAATCATAACAGAGCATAGAATACAGTTTATAGTGATTAA TTGT05* ATAT AAACAATAATCATGATGGGATATAAAGTGCAGTTTGTGATAATTAA TAT05 ATAT ATAAACGGTCAAATGTATTACTTATAAGATAGAG TTGATAAT14 AT ATAAACGGTCAAATGTATTGCTTGTAAGGTAGAGGTGATAATT T13 ATATATAACGATC AATGCATTATCTATGAAGCAAGAACAGTGATTATAAT T15 AATGTGTG ACCTATAAAGTGAGAATAATGATTATA T12 ATATATAACGATC AATGCATTATCTATGAAGCAAGAACAGTGATTAT TTTT AATGTGTG ACCTATAAAGTGAGAATAATGATTAT T11 ATAT AAATAAACGATTAGATGTATCACTTATAGAATAAAAATGATGATT T15 2,491 126 3 3,227 528 295 142 ATATA ATATGCATATCAGATAGACGTAGAAATAAGTGATTAAAT T10 ATATAATAA ATACGCATATTAGATGAGTGTAAAAGTAAGTGATTA T10 ATATAA AATCAACAAATTCGTATGTATATCGAGTAAATGTAGAAATAAAT T09 A TACAAATCAACAAATTCGTGCGTATATTAAGTGGGTGTAGAAGTGAAT T17* A TACAAATCAACAAATTCGTGCGTATATTAAGTGGGTGTAGAAGTGAATGATT T15 A TACAAATCAACAAATTCGTGCGTATATTAAGTGAGTGTAGAAGTGAAT TAAAT19 A TACAAATCAACAAATTCGTGCGTATATTAAGTGGGTGTAGAAGTGAATG T15 1,365 292 249 24 64 44,852 11,706 3,513 2,467 182 175 137 97 18,961 1,729 472 397 Reads 173,548 ATATAT ACTAATAAAAAGGCATTGCTTACAGATTGATAGATTTAT T12 ATATATTACCAACAT AGAGACATTGCTTATGAGTTAACAGATTTATAT T12 ATATAT ACTAACAAAAAGATATTGCTTATAGATCAGTGAATTTAT T11 ATAAAGAAACCAACAGAAGAATATTGCTTGTAAGTTAATGA TCTTGTAT10 CCAAGGCA AAAGATAAAGAAACCAACAGAAGAATATTGCTTGTGAGTTAATGA TCT17 ATATAT TCAAACAAACAGATAGAACCGGAGACGAGAAGATTGATAA T13 ATATAAATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATTAAT T12 ATATAT TCAAACAAACAGATAGAACCAGAGACGAGAAGATTGATGAA T12 ATAAATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATTAATA T12 ATATAT TCAAACAAACAGATAGAACCGGAGACGAGAAG CTTGATAAT05 ATATAAATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATTA TTAT15 ATAT TCAAACAAACAGATAGAACCGGAGACGAGAA TATTGATAATTTAAT14 ATATAAATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATT TAT13 ATATAT AATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATTAAT T12* ATATAT AATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATTAATA T05 ATATAT AATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATTA T14 ATATAT AATAATCAAACAGACGAATGAAACTAGAGATAGAGAAATT T11 ND7 gRNA Sequences cont. ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGATGAAA AT14 205 1000 1000 1000 1000 1000 1000 1000 1000 1017 1032 1043 1043 1043 1067 1067 1078 1085 1085 1121 1113 1143 1143 1142 1131 1143 1143 1143 1143 1142 1143 1143 1143 1143 1143 1143 1143 1143 1145 1148 1157 1146 1146 1146 1146 1146 1154 1146 1146 1146 1152 1167 1182 1182 1167 1167 1197 1195 3' 1218 1224 952 960 966 964 963 961 962 959 983 1001 1000 1015 1013 1032 1030 1032 1057 1055 1089 1087 1099 1094 1099 1099 1101 1105 1106 1108 1094 1099 1107 1099 1099 1099 1099 1099 1095 1108 1094 1107 1107 1108 1111 1114 1110 1108 1107 1120 1113 1108 1128 1136 1138 1128 1131 1150 1150 5' 1181 1183 748 472 412 372 371 265 234 3,928 101 3 1 12 208 6,797 2,406 751 44 40 1 412 81,287 6,615 5,793 705 550 520 318 303 289 259 177 163 162 128 98 356 212 569 295 82 90,559 10,003 2,224 2,220 1,887 978 400 266 243 232 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGATGAAATT T09 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGATGAA T13 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATA TATGAAAAT13 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGA AAAAAT14 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGAT T15 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGATGA T12 ATAAA GGTAATATCACAGTGTAGATAGTCGGATAGATAGATG TAAATTTAT09 ATATAAA GGTAATATCACAGTGTAGATAGTCGGATAAATAGATGAAA AT11 ATATATAAATAACATAT TAATGGTCTTGATGGTGATATTGTAGTATAA T12 ATATA TATAAAATAACGTGATGATGGTCTCGAT AGTAATTGATAAT08 ATAT ACACCACAAACTGTAGAATGACGTGATAGTGGTTTCAGTG ATAAT15 AT ACATCACAAACTATAAAGTAATATGATAA GGGTTTCGAT15 ATAT ACACCACAAACTGTAGAATGACGTGATAGTG AT18 ATATATACAAGC AATGATGTACTCGGTAAATAGTGACACTGTGAATT T12 ATATATACAAGC AATGATGTACTCGGTAAATAGTGACACTGTGAATTAT T12 A AATATAAGCAAATGATGTATTCGGTGAGCAGTGATATCGTAAATT T10 ATTAGAAA GGTGTTCGGTATAGGTAGATGATATAT AT10 CTCATTAGAAA GGTGTTCGGTATAGGTAGATGATATATTT TTT AT ATATGATAACAAACAATACTTACTTT GAGGTGTTTTGTGAGTAAAT14 ATA AATAACAAACAATATTCGTCTTTG AGATATTCGATATAAGTAAAT12† ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAAGTGATAT AT14 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAAGTGATATTT T10 ATATAAA AGAACATAAACTGACACGAGGGTATAGTGATGAATGATAT AT13 ATATAAGAGAACATAAA CAGCATAGAGGCATAGTAACAAGTGATAT AT14 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAAGTGAT T13 TATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAAG GGATATAT13 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAA T11 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAAC T11 ATATAAA AGAACATAAACTGACACGAGGGTATAGTGATGAATGATATTT T10 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAAGCGATAT ACT12 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACA T12 ATATAA GAGAACATAAACTAGCATAGAAGCATAGTAACAAGTGATAT AT05 ATATAA GAGAACATAAACTAGCATAGAGGCATAGTAACAAATGATAT AT05 ATATAA GAGAACATAAACTAACATAGAGGCATAGTAACAAGTGATAT AT14 ATATAA GAGAACATAAACTAGCATAGAGACATAGTAACAAGTGATAT AT14 ATATAA GAGAACATAAATTGACATGGAAGCATAGTAATAAGTGATAT AT12* ATATAA GAGAACATAAATCAGTGCAAAGGTATAGTAGTGAGTGATATT AAT09* ATATAAACGTAT ACGAGAGCATAGATCAGTGTGAGAATGTAGTAAT T14 ATATAAAA TAAACGAGAATATAAACTGATGTAGAGATATAGTGATAAGTAATATTT T08 AT ATGTAAATGTAAACGAGAATATAGATTGATGTAGAGATATAGTAATA T13 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGATATGATAATA T14 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGATATGATAAT T14 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGATATGAT T10 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGATAT T14 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGATATGATA T16 ATATA TAAACGTAAATGAGAATATGAATCAGTGTGAAAATGTAATAAT T07* ATTAAACGT AACGAGAATGTGAACTGACATAGAGATATGATAATG T13 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGAT T13 ATATTAAACGT AACGAGAATGTGAACTGACATAGAGATATG TTTT ATAT AACGTAAACGAGAGCATAAATTGATGTGAAGATGTGATAAT TTCTTTT 3,079 1,834 866 337 203 ATATAT AATCCGTACAATGCGAACGTAGACGAGAATATGAGTTAAC T14 ATAT ATAAATATGCAAGAAATCTGTATGATGTAGATGTGAATGAGAATAT T10 ATAT ATAAATATGCAAGAAATCTGTATGATGTAGATGTGAATGAGAAT T11 ATATAT AATCCGTACAATGCGAACGTAGACGAGAATATGAGTTAAT TTT ATATAT AATCCGTACAATGCGAACGTAGACGAGAATATGAGTT T19 213 1,077 Reads 4,103 1,399 AT ATAAACATCCAATAGACGAGTATGTGAGAGATTTGTATGATGTAAAT T10 AAATATCCAATAAACAGATATGTAGAAGGTCCGTATAATGTGAAT ATAT13 ATATATAA ATGCAATAGAAGATCACGCAAATAGATATCTGATAAAT T13 ATA TAAATCATGCAGTAAAAGACTATGTAGATGGATATTCAGTGA TAAT14 ND7 gRNA Sequences cont. 206 1183 1183 1183 1210 1233 1233 1240 1251 1242 1242 1240 1240 1239 1240 1241 1269 1269 1223 1224 1224 1257 1257 1257 1268 1282 1283 1270 1268 1270 1268 1268 1268 1320 1320 1,017 138 2,561 123 167 83 95 63 6 296 216 54 25 23 12 6 AAATA AAATCATGCAGTAGAGAACCGTGTAAGTGAGTATCTGATAA TTAT11 ATA TAAATCATGCAGTGAAGAGCTACGTAAATGGATATTCAGTGA TAAT11 ATA TAAATCATGCAGTAAAAGACTATGTAGATGAATATTCAGTGA TAAT14* ATAT ACAACATCAATATTACTTAGAACGGTAACTAGATTGTGTAATAA T14 ATAT ACAACATCAATATTACTTAGAACG ATAACTAGATTGTGTAAT10† ATAT ATAACATCAATATTACCTAGAGTG TCAGTTAGATTATGTGATAAAT14† AAACTAACGATATT CGGATCTGAGAGTAACATTGATATTATTT T07 ATATATAT AACTAACGATCTATGGGTTTAAAGACAGTGT GAAT12 AC AAACTAACGATTTACGGATTTAGAGACAGTGTTAATGTTAT AT13 A TACGGATTCAGAAGTGATATTGATGTTAT AT15 AAACTAACGATATT CGGATCTGAGAGTAACATTGATATTATTT T ATATAAACTAACGA TACGGATTCAGAAGTGATATTGATGTTATTT T16 ATT CGGATCTGAGAGTAACATTGATATTATTTA T17 ATT CGGATCTGAGAGTAACATTGATATTGTTT TT GATATT CGGATCTGAGAGTAACATTGATATTATT AT15 ATA TAAACAATCCTACAATGATCTCGTGTATAAGACTGATGATTTA AT12 1,074 ATA TAAACAATCCTACAATGATTTCGTGTATAAGACTGATGATTTA AT17 207 I) NADH Dehydrogenase 8 3' 58 68 56 98 98 97 97 98 97 97 97 136 133 133 135 131 136 136 139 139 139 139 132 131 133 132 138 138 153 153 153 153 153 153 152 170 186 187 187 198 207 199 199 230 229 230 220 230 230 230 228 228 229 244 239 245 254 5' 34 29 28 55 55 54 55 57 57 54 59 87 84 87 87 84 87 96 86 85 86 87 88 84 92 92 86 85 111 111 111 111 115 113 111 117 158 161 161 161 161 161 160 186 186 186 186 194 195 196 186 187 189 213 209 219 219 Reads 2 1 9 2,577 274 236 100 98 65 57 29 ND8 gRNA Sequences GTGGG ATATGAAAGTAAGAGAATAAAAAAA ATTAAT13 AA AAAAAAAAACATACAGAAATAGAAAGATAAGAAAGTGATA TATTAT08 AAA ATAGGAGTAGGAGGATGAGAAAATGATAG AGGGATTTT* ATAT AAAACAAACAAAAAGAAGAAACAAGAAATTGAAGAGAGATATAT T13 ATAT AAAACAAACAAAAAGAAGAAGCGAGAAATTGAAGAGAGATATAT T12 ATAT AAACAAACAGAAAAAGAAAACAAAGAGTCAGAGAAAGATATATA T17 ATAT AAACAAACAGAAAAAGAAAACAAAGAGTCAGAGAAAGATATAT T10 ATAT AAAACAAACAAAAAGAAGAAGCGAGAAATTGAAGAGAGATAT T05 ATAT AAACAAACAGAAAAAGAAAACAAAGAGTCAGAGAAAGATAT T05 ATAT AAACAGACAGAAAAAGAAAACAAAGAGTCAGAGAAAGATATATA TGTAATTATTTT* ATAT AAACAAACAGAAAAAGAAAACAAAGAGTCAGAGAAAGAT T09 6,039 633 593 394 380 292 191 11,689 1,938 1,171 816 612 598 267 267 221 196 ATAAATAGTAACACAATGAGCAGAGTACGTATAAGAATGAGTAAA T14 A AAATAGTAATACAACAGACAGAGCATATATAGAAATAAGTGAGAAA T16 AAATAGTAACACAATGAGCAGAGTACGTATAAGAATGAGTAAA T15 TAAATAGTAACACAATGAGCAGAGTACGTATAAGAATGAGTAAA T12 AT ATAGTAATACAACAAACGAGATACGTATAGAAATAGATGAGAAA T13 GTAAATAGTAACACAATGAGCAGAGTACGTATAAGAATGAGTAAA T12 ATAAATAGTAACACAATAGACGAGATACGTGTAGAA TAAGTGATTTAAT11 ATA TAAACAAATAGTAACATGACGGATAGAACGTATATGAGAATGAGTAAAA T12* ATA TAAACAAATAGTAACATGACGGATAGAACGTATATGAGAATGAGTAAAAG T13 A TAAACAAATAGTAATATGACGAATGAAGCGTATATGAGAATAAGTAAAA TTTT* ATA TAAACAAATAGTAACATGACGGATAGAACGTATATGAGAATGAGTAAA TTTAT14 ATAT AATAGTAACACAACGAATAGAACATGTATAGAGATGAATGA TAT17 ATAT ATAGTAACACAATAAATGAGACATATATGAAGATGAATGAGAAA TTTT* ATATA GAATAGTAACACAGCAGATAAGATACATATAGAGATAA TGACAGT07 ATATAT AATAGTAACACAGCAGATAAGATACATATAGAGATAA TGACAGT14 AA AAACAAATAGTAATATGACGAATGAAGCGTATATGAGAATAAGTAAAA T06 AA AAACAAATAGTAATATGACGAATGAAGCGTATATGAGAATAAGTAAAAA T10 70,659 ATATATAAATGGTA AACTCAATGGGTGGATAAGTAGTAATGTGATGAAT T13 479 245 147 127 121 804 4 169,806 124,070 799 66 2,257 966 804 ATATATAAATGGTA AACTCAATGGGTGGATAAATAGTAATGTGATGAAT TTT ATATATAAATGGTA AACTCAATGGGTGGATAAGTAGTAATGTGATAAAT T12 ATATATAAATGGTA AACTCAATGGGTGGATAAGTAGTAATATGATGAAT TGTAATAT15 ATATATAAATGGTA AACTCAATGGGTGGATAAGTAGTAATGTGAT T14 ATATATAAATGGTA AACTCAATGGGTGGATAAGTAGTAATGTGATGA T15 A TATACAATGGTT ATTCAGTGGGTAGACAGATAGTGATATGATAGAC TTAT12 ACAT ATAAACTAACAATGGTTGATTTAGTGAGTGAATGAGTAGTAATATT T11 ATAT GAACGCAAAGATGGATTACCACGAGTTAGTAAATTGATGAT AT14 ATATAAT AAACGCAAAATATGGTTACTATGAACTGATAGATTAAT T12 AAACGCAAAATATGGTTACTATGAACTGATAGATTAAT T11 ATATAAT AAACGCAAAAAATGGTTACTATGAACTGATAGATTAAT T14 ATATA TAATAAAGACGCAAAAGATGGTTACTGTGAATTGATGAGTTAAT T12* ATATATAGT AAAACGCAGAAAATGGTTACTATGGATTGATGAATTAAT T16 ATATAGT AAAACGCAGAAAATGGTTACTATGGATTGATGAATTAATG T15* 24,763 ATA CAATACAACGCTCTGAATCATATCGATAAAAGTGTGAGAAAT T13 182 109 204 160 149 199 161 52 31 173 19 10 1 AATACAACGCTCTGAATCATATCGATAAAAGTGTGAGAAAT TTAAT12 ATA CAATACAACGCTCTGAATCATATCGATAAAAGCGTGAGAAAT T11 ATA CAATACAACACTCTGAATCATATCGATAAAAGTGTGAGAAAT TTAAT10 ATA CAATACAACGCTCTGAATCATATCGATAAAAGTG GGAGAAAT11 ATA CAATACAACGCTCTGAATCATATCGATAAAAGT TGAGAAATTTAAT11 ATA CAATACAACGCTCTGAATCATATCGATAAAAG AGTGAGAAATTTAAT05 T ATACAACGCTCTAAATTATACCAGTGAAAATGCGAGAAAT T14 ATATTAT ATACAACGCTCTAAATTATACCAGTGAAAATGCGAGAAA AT11* ATATA AATACAACGCTTTAGATCATATCAGTGAGAGTGTGAAA T13 A ACATAAACGACAGGTAATATGATGTTCTGAAT GATATCGATAAT06 ACAT AACGACAAGTGATATAACGTTTTAAGTTACA GTGATAAT19 AAATA TACATAAACGATGAGTAATATGACGTT GTAGAT13 AAATTAAATCACATAAATGACGAGCGATACAGTGCT GTAGATGATACT06 208 3' 288 267 271 285 291 318 319 317 318 318 319 314 311 321 319 319 318 318 318 353 344 350 344 344 364 358 355 382 386 382 373 385 385 385 417 414 431 434 442 443 441 466 466 466 466 466 482 462 462 462 462 462 464 480 462 503 512 510 512 3' 5' 240 237 246 259 270 276 276 277 279 275 279 276 276 284 287 275 289 276 275 310 301 310 316 301 331 325 325 338 350 339 341 338 342 343 372 390 391 405 411 411 407 426 425 423 428 423 426 423 426 425 428 430 426 426 427 465 477 482 477 5' Reads 3 2 340 325 268 51,820 28,492 1,829 594 384 334 215 186 140 139 109 4,761 1,264 175 ND8 gRNA Sequences cont. TATGAACATCCGATACTGAACTAGGGTAGATTGAGTTATATA T10 CATCCAATGTAT AATTAGGGTAGATTGAGTTATGTAAAT T15 ATACGAACATCCATA GTTAGATTAGGGTAAATTGAGT GATGTAAT12† ATATAA GAACATCCGATGCTAGATTA TGGTAGATTGAGTTATATGATTGTAATAT05† ATATA TAACACGAACATCTGATGT ATAAACTGAAGTAAATTGAT16† ATATAC AACGATGATCACTGAGATTTTACCTAATATGGATGTT AAT13* ATATAT AAACGATGACTACTAGAATTCTACTCAATGTGAATGTT AAT14 ATATAT ACGATGACTACCAAAATTCTATCTGATATGAATGT GATAATAT11 ATATAC AACGATGATCACTGAGATTTTACCTAATATGGAT T17 ATATAC AACGATGATCACTGAGATTTTACCTAATATGGATGTTT T09 ATATAT AAACGATGACTACTAGAATTCTACTCAATGTGAAT T15 GATGACTACTAGAATTCTACTCAATGTGAATGTT AATTTATGATAT14 ATATACAACT ATGATCACTGAGATTTTACCTAATATGGATGTT AAT11 ATATATA CAAAACGATGATTACCGAGATTTCATTTAATATGA TTGTCTAATTTT ATATAT AAACGATGACTACTAGAATTCTACTCAATG AATGTTAATTTT ATATAT AAACGATGACTACTAGAATTCTACTCAATGTGAATGTTT T12 ATAATAT AACGATGACTATCAGAACTTTACTCGA GATGAATGT13 ATATAT AACGATGACTACTGAGACTCTATCTGACGTGAATGTT AAT12 ATATAT AACGATGACTACTGAGACTCTATCTGACGTGAATGTTT T07* 13,915 636 312 266 4 1,709 137 21 ATATTA GATGATAACTCAGTGTAGATTGATCTGTAGAATGAT AATAT13 ATAT ATAACTCAATGTAGATCGATTTGTGAGATGATGATTGTCAA T11 ATGATAACTCAGTGTAGATTGATCTGTAGAATGAT AATATATAATGT05 ATATACA ATAACTCAATGTGAATTAATCTGTGAGAC TAT14 AT ATAACTCAATGTAGATCGATCCGTAGAATGATGATTGCTAA T11 ATAT ATAAATACAACGGTGATAATTCGATGTGA TGATATCTGTAAT16 ATAA ATAACGACGATGATTCAGTGTAAATTGGT ATGTGAAATGATGT12 ACGACGATAGCTTGATGTAGGTTAGT GTGTGAGATAATGATAT17 1,964 1,269 1,073 13,719 2,361 159 133 2 894 23 9 1,950 923 215 24,242 5,218 2,651 1,071 203 13 29,495 11,472 7,969 4,922 582 526 127 111 1 176 832 237 Reads ATATAA ATGCATACAAGAATCATAGTAAGTACAGTGATGATAATTT T09 ATATA AAACATGTATACAAAAATTACAGTAAATGCGACGA AAATAAT13 ATATAA ATGCATACAAGAATCATAGTAAGTACAGTGATGATAATT AT14 ATATATAAATGCATAC AAGGCCATGATAGATACAATGATGATAA AT11 ATGTAC AACATGCATACAGAGACTATAATAGATACAGTGATGATAATTT TTATTTT ATGTAC AACATGCATACAGAGACTATAATAGATACAGTGATGATA T19 ATGTAC AACATGCATACAGAGACTATAATAGATACAGTGATGAT T07 ATAT ATAAATGCGTAATGGTATCTTTCGGGTAGATATGTGTATAAA T14 AAATTGTATATAT AATGCGTAATGGTGTTTGTTG AGCGAGTGTGTGTTAT15† AT ATATATAACAAACAATGGGTGTATAATGGTATCTGAT TGTGCATTAATTTAAAAT14 AT AAAACACATAATAAATGATGAGTGCGTGAT ATAGGTATAAT13 AAAAA AAACAACAAGAACACATAGCAGATAGTGGGTG ACGTATAT12 A TAAACAACAAGAACACATAGCAGATAGTGGGTG ACGTATATGAT13 AAAT AACAACAAAAACATATAATAAATGATGGATGTGTA TAAGGTATAT15* ATAAAACTTGGA CGCTAATAGATATATGGTTAGATAATGAGAATATAT T13 ATAAAACTTGGA CGCTAATAGATATATGGTTAGATAATGAGAATATATA T14 ATAAAACTTGGA CGCTAATAGATATATGGTTAGATAATGAGAATATATAGT TAAT11 ATAAAACTTGGA CGCTAATAGATATATGGTTAGATAATGAGAATAT T10 TAAAACTTGGA CGCTAATAGATATATGGTTAGATAATGAGAATATATGGT TAAT14 AT AAAACTTGGGCGCTAATAGATATATGGTTAGATAATGAGAATATAT T12 ATACTTGGGCACA CAATAGATATATGGCTGAATAATAGAGATATATAAT T13 ATACTTGGGCACA CAATAGATATATGGCTGAATAATAGAGATATAT T18 ATACTTGGGCACA CAATAGATATATGGCTGAATAATAGAGATATATA T12 ATACTTGGGCACA CAATAGATATATGGCTGAATAATAGAGATAT TCTTTT ATACTTGGGCACA CAATAGATATATGGCTGAATAATAGAGAT T14 ATATATAACTTGGGCA TCAATGAGTATATGGTTAAGTGATGAAGATATAT T10 ATATAT AACTTGGGCGTCAATGAGTATATGGTTAAGTGATGAAGATATAT TTT ATACTTGGGCACA CAATAGATATATGGCTGAATAATAGAGATATA ATCTTTT ATAT TAAAACAACAATCAAATGATAGAAGCTTGGGTG ACTGTGAATATTTT ATATA TAAATAACATAAGACGATAACTGAATAATAGAAATT AAGTGTCAAT14 ATATATT AATAACATAGAGCAATGACCGAGTAATGA TAT12 ATATA TAAATAACATAAAACGATAACTGAATGATAGAGATT AAGTGTCAAT09 ND8 gRNA Sequences cont. 209 489 489 500 494 495 495 494 495 500 495 523 554 554 554 554 531 528 540 536 539 536 539 539 541 539 567 598 598 598 598 21,576 5,835 2,861 1,242 884 151 1,101 405 198 127 413 20 7 5,311 386 ATAG ATAAAACACAAATAAAGGTCAAGTGATATAGAGTGATGATTAA T14 ATAAATAT AAACACAAATAAAGGTCAAGTAATGTAGAGTGATAATTAA T12 ATATAT AATAACTACACGAGACATGAATAGAAATTAAGTGATGTAAA T12 ATATAT ACTACATAAAACACAGATAAGAATCAGATAGTGTGAGATAATA T15 ATAT ATAACTACACAAAATATAGGTAAAAGTTAGATGATGTGAAATGAT T11 ATATAT ACTACATAAAACACAGATAAGAATCAGATAGTGTGAGATAAT T15 ATAT ATAACTACACAGAACATAGATAAGAGTCAGATAGTATAAAGTGATA T19 ATAT ATAACTACACAGAACATAGATAAGAGTCAGATAGTATAAAGTGAT T15* ATA AAATAACTACACAGAATACGAGTAAAGATTGAATGATGTAAA TAAT18 ATAT ATAACTACACAGAGCACAAATGAAAGTTAAGTAATGTGAAATAGT T23 ATAAA TATAAACACAGTAAATCACTCGAGATAGATAGTTATATGAGATAT T11 ATATA TAATTTCACCGTGAATTTCTTTAGATTGTAGATATGATAA T13 ATATA TAATTTCACCGTGGATTTCTTTAGATTATAGATATGATAA T06 ATATA TAATTTCACCGTGGATTTCTTTAGATTGTAGATATGATAA T05 ATATA TAATTTCACCGTGGATTTCTTTAGATTGTAGATATGGTAA T15* 210 J) NADH Dehydrogenase 9 3' 71 72 72 105 101 124 124 124 160 167 162 162 162 162 193 187 193 216 216 216 221 242 249 249 248 248 248 249 279 268 268 279 288 286 286 286 286 279 286 289 268 286 286 282 312 314 314 314 314 339 343 343 339 339 339 338 5' 33 29 25 60 60 87 87 87 117 130 117 119 116 121 149 147 150 176 177 176 176 201 204 205 202 203 205 203 239 239 240 238 239 239 239 238 243 239 242 247 239 246 243 236 272 273 271 275 272 303 298 298 297 304 305 301 Reads 18 93 32 540 472 1 536 453 1,145 137 906 65 55 30 ND9 gRNA Sequences ATAT AAACATAAACGAAATGAGCATAGAAGTATATGTATGATG ATATTAT14 ATAT AAAACATAAACGGAATAAATATAAGAGTGTATATGTGATATAAT T15 ATAT AAAACATAAACGGAATAAATATAAGAGTGTATATGTGATATAATGTTT T11 A TATAACACAAACAATAGAATAAAGTTAAGTGAGAATATGAGTGA TTTAT11 ATAT ATACAAACAATAAAATAGAATTAGATAGAGGTATAAGTGA TGTAAAT11 ATATCAAC AAACAAATAGAGCACTGTCTATGATATAAGTGATAA TTCT09 AAATATCAAC AAACAGACGAGACATCATCTACAGTATAAGTGATAA TAT10 AAATATCAAC AAACAAATAGAGCACTGTCTACGATATAAGTGATAA T05 ATAT ATAAAACAATAAAGAAATAGAAGGCTACAGTTAATGAGATAAAT T13 AAAATTAACAAAACAATAAAAGAACGAGAAATTACAGT GAATGAAGTAAAT07 ATA TAACAAAACAATAAAGAAATGAGAAACTGTGATTGATAAGATAAAT T12* ATATA TAACAAAACAATAAGAAGACAGAGAGTTACAGTTAATAAGATAA TTAT19* ATA TAACAAAACAATAAAGAAATGAGAAACTGTGATTGATAAGATAAATA T05 ATA TAACAAAACAATAAAGAAATGAGAAACTGTGATTGATAAGAT T11 1,196 5,984 463 ATAT AATAAAAACATACAATAAAATAGAAGGAAACTAATAAGATGATAG T15 ATATAT AACATACAATAAGACGAAGAGAAACTAATAGAGTGATAAGA T15* ATAT AATAAAAACATACAATAGAATGAAAGGAAACTAATAAGATGATA TAT11 3,869 357 620 63 87 652 485 101 94 77 58 569 561 346 86 23 26,097 7,153 1,828 1,812 553 499 451 300 262 236 197 43 5,869 250 157 93 ATACAATAT AAAACAAAAGTCACAGATTAAGGGATAGAGATGTGTGATAA T14 ATACAATAT AAAACAAAAGTCACAGATTAAGGGATAGAGATGTGTGATA T11 ATACAATAT AAAACAAAAATCGTAGATTAAAAGATAGAGATATATGATAA T14 ATAT ATATAAAAACAAAAGTCACGAATTAGAAAGTAAAGATATGTGATAA TTTT ATATAA AATCAATAACAGATCATGATGATATAGAGATAGAAGTTATAA T15 AAA AAAAATCAATCAATGATAGATCACAGTGGTATAGGGATGAGAATTA AT11 AAAA AAAAATCAATCAATGATAGATCACAGTGGTATAGGGATGAGAATT T11 ATAT AAAATCAATCAATAGCAAGTTACGATAATATAGAAGTGAGAATTATA T13 ATAT AAAATCAATCAATAGCAAGTTACGATAATATAGAAGTGAGAATTAT T13 ATAT AAAATCAATCAATAGCAAGTTACGATAATATAGAAGTGAGAATT T10* AAAAA AAAAATCAATCAATGATAGATCACAGTGGTATAGGGATGAGAATTAT T05 AAATTAT ACAACATAAGACGACGAAGATAAAGGTCATAGAGATTAATT T12 AACATAAAT CGGTGAAGATAGAGACTATGAGAATTAATT T07 ATAAAT CGGTGAAGATAGAGACTATGAGAATTAAT ATGT15 AAAATTAT ACAACATAAGACGACGAAGATAAAGGTCATAGAGATTAATTA T11 AT AAATATACAACAATATAGAACGATGAAAGTGAAGATTATAAGAGTTAATT T12 ATATATAACAACATAGAACGATGAGAATAGAGATCATGAAAGTTAATT T11* ATAC ATATACAACAATATAGAACGATGAAAGTGAAGATTATAAGAGTTAATT T11 ATATATAACAACATAGAACGATGAGAATAGAGATCATGAAAGTTAATTA T11 ATATATAACAACATAGAACGATGAGAATAGAGATCATGAAAGTT T10 AAATTAT ACAATATAAGACGACGAAGATAAAGGTCATAGAGATTAATT T10* ATAC ATATACAACAATATAGAACGATGAAAGTGAAGATTATAAGAGTTA T11 AAAATATACAACAATATGAAGCGACGAAAATAGAGACTATAGA TATTAT06 ATAAAT CGATGAAGATAGAGACTATGAGAATTAATT T13* ATATATAACAACATAGAACGATGAGAATAGAGATCATGAAA T11 ATAC ATATACAACAATATAGAACGATGAAAGTGAAGATTATAAGAGTT T17 ACAACAATATAGAACGATGAAAGTGAAGATTATAAGAGTTAATTAAT T13 ATAA GAACACACAGAGACAAGCAGAGTAAAGTGTATAGTGACATA T06 ATATAT ACGAACACACAGAAATAAGCAAGGTAGAATATATGATGATAT T13* ATAT ATGAACACACAGAAACAGATGAGATAGAGTATACAGTGATATAA T17* ATATAT ACGAACACACAGAAATAAGCAAGGTAGAATATATGATGAT T12 ATAT ATGAACACACAGAAACAGATGAGATAGAGTATACAGTGATATA T14 44,686 2,681 1,741 270 139 6,792 260 ATAT ACAAACAACACAAGATAGAGCACAAATGAATGCACAG TGATAAT13 ATAAACAAACAACACAGAGCAGAATACAAGTGAGTATATGAGAATA T11 GTAAACAAACAACACAGAGCAGAATACAAGTGAGTATATGAGAATA T13 ATAT ACAAACAACACAAGATAGAGCACAAATGAATGCACAGGGATAA TAT11 ATAT ACAAACAACACAAGATAGAGCACAAATGAATGCACA CTGATAAT13 ATAT ACAAACAACACAAGATAGAGCACAAATGAATGCAC TGAGATAAT15 ATATAT TAAACAACACGAAACGAAGCATAGATGGATATATGAGA TTCTTTT 211 3' 366 366 368 369 392 392 392 392 392 392 392 392 392 421 418 447 447 447 484 483 472 483 483 472 483 514 514 514 514 549 542 539 547 547 547 547 566 566 566 566 566 578 580 604 600 612 611 611 625 644 640 659 658 Reads 36 17 157 45 10,040 270 232 5,765 2,338 691 575 189 167 1 317 761 667 82 72 185 130 46 34 16 25 234 2,106 1,535 59 746 1,407 935 576 339 55 36 15 11 89 72 24 13 39 76 78 66 66 17 7 ND9 gRNA Sequences cont. AT ATTAAAACACAATCCGAAGAGTACAAATAGACAGTATAAGATAAAAT T11 AT ATTAAAACACAATCCGAAGAGTATAAATAGACAGTATAAGATAAAAT TAAT19 AT AAATTAAAACACAATCTAAGAAATA GAGATAGACAGTATAAGATAAAAT08† AAAATTAAAACACAATCTAAGAAATA GAGATAGACAGTATAAGATAAAAT08† ATATT AAACGCATAACAGAAATGATTAGAACTGAGATATGATT AAT14 ATATAT AGACGCATAATAAAAGCAACTGAGGCTAGAATATGATTTAA T19 TAT AAACGTATAACAAAAGCAGTTAAAGCTAGAATGTGATTTAA T14 ATATT AAACGCATAACAGAGACAATTAGAACTGAGATATGATT AAT10* ATATAT AAACGCATAATAAGGGCAACTGAAACTAGAGTATGATTTAA TTTCT12 ATATT AAACGCATAACAGAGACAATTAGAACTGAGATATGATTT T19 ATATT AAACGCATAACAGAGACAATTAGAACTGAGATAT TTT ATAT AAACGCATAATAAGGGCAACTGAAACTAGAGTATGATTT T16 ATATT AAACGCATAACAGAGACAATTAGAACTGAGAT T11 AATCAAAATATTCGTGTTCTGATGATAGAGATGTATAAT T10 ATATA TAAAACATTCGCGTTCTAATGATAGAGATGTATGATAA T12 ATATAAA ATTACCAACAGAATAGAAGCTAGACAAGTTAAGATATTT T08 ATATAAA ATTACCAACAGAATAGAAGCTAGACAAGTTAAGATATTC T12 ATATAAA ATTACCAACGAAATAAAGACTAAGTGAATTAAGATGTT ATAT05 ATAT AAACCAATCAACGAGTAAATGATGTAGAGTATTATTGTCAAT T13 ATATAT AACCAATCAATAAGTGAGCGATGTAAGATGTTATTGTTAATA T10 ATATA TAACAAATAAACGATGTGAGATATCATTACCAGTGA TTTAAT08 ATATAT AACCAATCAATAAGTGAGCGGTGTAAGATGTTATTGTTAATA T14 ATATAT AACCAATCAATAAGTGAGCGATGTAAGATGTTATTGTTAAT T13 ATATA TAACAAATAAACGGTGTAGAGTGTCA GTAT16 ATATAT AACCAATCAATAAGTGAGCGATGTAAGATGTTATTGTT T12 ATAT ATAACACTTCAACGGAGAGAGACCAATGAGAGATCAGTTAATA T13 ATAT ATAACACTTCAACGAGAAGAAGCCAGTGAGAAATCAGTTAAT T11 ATAT ATAACACTTCAATAGAAAGAAACTGACAGAGAATTGATTAAT T13 ATAT ATAACACTTCAACGAGAAGAAGCCAGTGAGAAATCAGTT TTTAATTGT14 ATATAAAATAACAATACAAGTGAATCAGATAATGGATGATGTTTCAAT T13 ATAT ATAACAATACAAATGAACTGAGTAATGGATAGTA GTTTGACAAT14* ATAT ATAATACAAACAAACTGAGTGATGGATAGTGCTTTAATGAGAA TAT14* ATAGAATAACAATATAAACGAACTAGATGATGGATAGTATTTTGATA TAT16* ATAGAATAACAATATAAACGAACTAGATGATGGATAGTATTTTGATA TATGCCACG GTAGAATAACAATATAAACGAACTAGATGATGGATAGTATTTTGATA TATTTT ATAGAATAACAATATAAACGAACTAGATGATGGATAGTAT CT17 AAACGTACATATG ATCTCTTCTGCTAGTATATAAGATGATAATA AT14 AACGTACATATG ATCTCTTCTGCTAGTATATAAGATGATAATAT T11 ATATG ATCTCTTCTGCTAGTATATAAGATGATAAT TTCT16 AAATAAACGTACATATG ATCTCTTCTGCTAGTATATAAGATGATAATAT T08 ATG ATCTCTTCTGCTAGTATATAAGATGATAATA AT15 ATATT AACGTACATACTGTCTTCTCTATTGACATATAAGATGATAAT T13 ATAT TAAACGTACATATTATTTTCTCTACTGATATACAAG TGATAAT09 ATATA TATGCAACAACAGAGATAATATTGTAGATGTATATGT GATATCGT11 ATATA TAACAACAAAAGTGACATTGTAAACGTATATG ATGAGTTTT TATT AATTGGTATGTGACAATAGAAGTGATATT T09 AATGCAGATAATT ATTGGTGTGCAATGATAGAGGTAATATT T12* AATGCAGATAATT ATTGGTGTGCAATGATAGAGGTAATATTGT T12* A AATGCAAATAGAAGTTGGTGTGCAAT TATAGAGATGATAT09 274 1 1 1,673 ATATATAAGAATTACAATGGT GTATTAAGTGAAGTAATGTAAATGAGAATT T12 GATA GTTAGATAGAATAATGTAAGTGAAAATT T12 ATATATAA AGGATTACAGTGGTGATATTGAATAGAGTGATGTAAATG T12 ATATAT GAATTACAACGGTGATATTGAATAGAATAATGTGA TTGAAAT14* 212 5' 319 319 343 343 352 349 349 352 349 351 356 351 358 382 380 409 409 410 439 438 437 438 439 447 442 468 469 469 472 502 509 497 501 501 501 508 534 533 535 533 534 535 543 568 569 582 582 580 597 609 609 615 618 K) Ribosomal Protein Subunit 12 3' 78 76 78 78 109 99 106 115 115 115 115 115 121 121 121 121 121 121 131 158 170 164 158 208 195 207 201 207 207 235 235 235 235 235 229 235 235 5' 43 35 43 38 63 66 74 73 73 74 77 79 96 92 96 96 93 94 96 119 139 132 133 169 164 158 164 156 158 198 194 196 198 194 198 200 196 Reads 12,531 2,218 1,663 423 RPS12 gRNA Sequences ATAT ACAACAACCATATGAAGATCATGTACGTAGAAGA TTGATATAT14 ATA AACAACCGTACAGAAGTTACATATGCAGAGAAGGTGAGAT TTAT12 ATAT ACAACAACCATATGAAGATCATGTACGTAGAAGA TTGATATAT12 ATAT ATAACAACCATACAGAAATCGTATATGTGAGAGAAGTGA TTTCT15 5,122 32 1,212 120 1,879 1,091 466 311 233 90 1,542 1,066 793 219 2 3 44 18 56 4,724 146 67 896 192 104 AT ATAATATAAAACAAATAAGACAGAGTGTAGATAGTAATTGTATGA TAT12 AT ATAAATAAAACAAAACGTAAATGATAACTGTG AGATGAT07 ATAT ATATAAAACAAATAGAATAGAACGTAGATGAT TACTGTATAAT12 A TATATAATAACATAAAACAAATAGAACGAGATGTAAATGATA TCTAT13 ATA TATATAATAACATAAGACAGATAGAACGAGATGTAAATGATA T21 ATA TATATAATAACATAAGACAGATAGAACGAGATGTAAATGAT T13 ATATA TATATAATAACATAAGACAGATAGAACGAGATGTAAAT T05* ATA TATATAATAACATAAGACAGATAGAACGAGATGTAA T14 AATCA CGGATTTATATAGTAACGTAAAATGA TATTAT12 AACTGGGC-ATCT CGGATTTGTATAGTGATATAAAGTGAATAA TTTT ATATAAACTGTGCAATCGA TGGACTTATATAGTGATGTAAGATGA TAAT12† ATATAGAACTAGGCAGTCA CGGATTTGTATAGTAATGTAGAATGA TAT14† ATATAGAACTGGCAATTT CGGATTTATATAGTGACATGAGATAGATA T15*† ATATAGAACTGGCAATTT CGGATTTATATAGTGACATGAGATAGAT T10† ATATAGAATTA GGCAATCGCGGATTTATATAGTAACGTAAAATGA TAT10 ACT TACAATACACGTTGGTTATCGGAGTTAGGTGATTGTG ACTTAT10 ATATA ACGGCATATAGTATACGTCGGTTACTAGGATTGTGTAATTTT CATATAAA GGCATATAGTATACGTCGGTTACTGGGATTGTGTAAT07 ATACT TACAATACACGTTGGTTATCGGAGTT AGATGAT13 ATAAAT AACAACGCAATATCCGAGTAAGATTGTATAAGTGAGATAT AT12 ATAACGCAACA TCAGATGAGATTATATAAGTGAGATATG ATATAT11 ATAT ATAACGCAACATTCGAATGAGATTATGTAGATGAAATATGGTAT TAT05 ATA TAACATCCAAACAAGATTATATAGGTAGAGTATG ATGTATAATTTTAT22 ATAT ACAACGTAACATTCAGATAAGATTGTGTAGATAGAATATGGTATAT T12 ATAT ACAACGTAACATTCAGATAAGATTGTGTAGATAGAATATGGTAT T06 4,950 3,025 222 444 338 36 25 20 ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGATGATGTAAT T11 ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGATGATGTAATATTT T11 TAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGATGATGTAATAT AT14 ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGACGATGTAAT T12* ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGACGATGTAATATTT T10 ATAACTT GACTAATAGAGTAGTGAGAGAGACAGTGTAAT T08 ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGACGATGTA T15 ATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGACGATGTAATAT AT12 213 5' 203 203 205 209 206 200 203 203 208 204 203 207 203 210 211 203 203 203 205 212 203 203 203 203 203 203 203 203 234 234 248 234 248 288 267 269 288 309 309 3' 246 246 246 246 246 246 246 246 246 246 246 246 246 246 246 250 246 246 246 246 246 246 246 246 246 246 246 245 264 280 281 280 282 322 322 308 322 349 336 Reads 341,382 1,324 1,091 964 922 609 544 530 484 430 370 370 312 282 268 267 196 193 177 176 173 157 141 130 119 117 116 56 35 24 14 434 270 16,349 10 3,731 909 128 195 RPS12 gRNA Sequences cont. ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGATAAT AT14 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAAATAAT AT15 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGATA T13 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGA TAATAT12 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGAT T14 ATAT ATAATGACATAATTAGACTGATAAGATAACGAGAAAAGTGATGTA T12 ATATAT ATAATGACGTAACTGAGCTAATGAAGCAATGAGAGAGATAAT AT13 ATATAT ATAATGACGTAACTGAGCTAATGAGACAATGAGAGAGATAAT AT12 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAG TTAATAT11 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGATAA AAT13 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGGGATAAT AT12 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGA AATAT13 ATATAT ATAATGACGTAACTGAACTAATGAGGCAATGAGAGAGATAAT AT12 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAG CGATAATAT14 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGA T12 AT ATAAATAATGACATAACTAGGTTAGTAAAGTGACGAAGAAGATAAT ATTATTTT ATATAT ATAATGACGTAACTGAGTTAATGAGGCAATGAGAGAGATAAT AT13 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAACGAGAGAGATAAT AT21 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAAATA T15 ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAG CGAGATAATAT14 ATATAT ATAATGACGTAACTGAGCTAATAAGGCAATGAGAGAGATAAT AT14 ATATAT ATAATGACGTAACTGAGCTAATGGGGCAATGAGAGAGATAAT AT12 ATATAT ATAATGACGTAACTGAGCTAATGAGGCGATGAGAGAGATAAT AT19 ATATAT ATAATGACGTAACTAAGCTAATGAGGCAATGAGAGAGATAAT AT13 ATATAT ATAATGACGTAACTGAGCTAATGAGGTAATGAGAGAGATAAT AT12 ATATAT ATAATGACGTAACTGAGCCAATGAGGCAATGAGAGAGATAAT AT16 ATATAT ATAATGACGTAACTGGGCTAATGAGGCAATGAGAGAGATAAT AT13 ATAAAAT TAATGACATAACTAAATTGATAGGGTAATGAGAGAGATAAT AT09 TAG TGCCTTCTATAGTAGATGATGATATA TGAT14 ATATA AGATCAACAAAACTGCCATTTTCTGTAGTAAGTGATGATATA T14 ATA TAAATCAACAGAACTGCCATCTTTTGTAGTA TAGTGATATAAT13 ATATA AGATCAACAAAACTGCCATTTTCTATAGTGAGTGATGATATA T16 ATAT GTAAATCAACAGAACCGTCATCTTTTGTAGTA TAGTGATATAAT12 ATA TACAATACGTGTATGATATTTTATACT AGGTAGATCAGTGAAATT T12 ATA TACAATACGTGTATGATATTTTATACTGGGTAGATCAGTGAAATT T07 ATA TATAATACTTTACATCGGGTAAATTGACGAGA ACATGAT11 ATA TACAATACGTGTGTAATATTTTATACT AGGTAGATCAATGAAAT15 AAATAT AACATATCTTATATCTGAATCTAACTTGTAATATGTG AAT16 AAATAAAACATATCTGAT TCTAAATCTAACTTGTAATGTGTG AAT25† *Indicates that the tail sequence was shortened where random nucleotides after the poly U tail had been indicated. †Indicates that the gRNA was identified under conditions of reduced stringency. 214 APPENDIX D. Identified CR3 mRNA and gRNA transcripts. A-C: Major CR3 mRNA and gRNA sequence classes. The CR3 mRNA transcriptome was generated using the TREU667 cell line. Identified sequences were then used to search gRNA transcriptomes from four different cell lines: EATRO 164 Bloodstream (BS), EATRO 164 procyclic (PC), TREU 927 procyclic and TREU 667 procyclic. ORF = previously identified Open Reading Frame (purple protein sequence). ARF = Newly identified Alternative Reading Frame (green protein sequence). Alternatively edited nucleotides are shown in Red. Inserted U-residues are lowercase while deleted U-residues are shown as asterisks. Canonical Watson-Crick base pairs (|); G:U base pairs (:). Previously identified start codons are doubled underlined. Potential upstream AUG start codons are indicated by wave underlines. gRNAs were sorted based on guiding sequence class. Sequence variations observed in the 3'-U-tail were ignored in assigning class. Transcript copy number (Reads), were determined by adding all gRNAs of the same sequence class. Only major sequence classes are shown (defined as containing greater than 100 transcript copies). In the case of rare transcripts, the identified gRNA are shown regardless of copy number. gRNA transcript numbers varied greatly between the different cell lines. Interestingly, the most abundant mRNA (CR3 Form C, 7147 reads) had the fewest identified gRNA reads. A. CR3 Form A mRNA Sequence AUGUGUAUGAUAUAUAAuuAAuuAuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuA AUGUGUAUGAUAUAUAA--AAuuAuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuA AUGUGUAUGAUAUAUAA--AA--AuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuA Fully Edited Form M C M I Y N ST M F D C L V L L F F Y C L F V AUGUGUAUGAUAUAUAAuuAAuuAuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuA |||||||:|:|||||:|:|||:|:||| |||:|:||||||||||| NUUAAUUAGUGAAAGUGAGAUAUAGACU----AACGAGCCAAAACAACAUGUAUA Cell Line EATRO 164 PC TREU 927 PC TREU 667 PC gRNA Sequence ATATGTACAACAAAACCGAGCAATCAGATATAGAGTGAAAGTGATTAATN ATATAACAACAAAACTGAACAATCAAATGTAGAGTGAAAGTGATTAATN ATATAACAACAAAACTGAACAATCAAATGTAGAGTGAAAGTGATTAATN B. CR3 Form B mRNA Sequence AUGUGUAUGAUAUAUAuuAuuAuuuAuuuAuuuuCAuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuG AUGUGUAUGAUAUAUA--AuuAuuuAuuuAuuuuCAuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuG Fully Edited Form M C M I Y I I I Y L F S L C L I V W F C C F F I V C L AUGUGUAUGAUAUAUAuuAuuAuuuAuuuAuuuuCAuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuG |||:|||:|||:|:|:|::|:||||||:|::|| |||||||||| NUAUGAUAGUAAGUGAGUGGAGGUAAUAUAGGCU----AACAAACCAAUAUAUA Cell Line EATRO 164 PC TREU 927 PC TREU 927 PC TREU 927 PC TREU 667 PC TREU 667 PC gRNA Sequence ATATATAACCAAACAATCGGATATAATGGAGGTGAGTGAATGATAGTATN TATATAACCAAACAATCGGATATAATGGAGGTGAGTGAATGATAGTATATN ATATATAACCAAACAATCGGATATAATGGAGGTGAGTGAATGATAGTATATN ATATATAACCAAACAATCGGATATAATGGAGGTGAGTGAATGATAGTATN ATATATAACCAAACAATCGGATATAATGGAGGTGAGTAAATGATAGTATN ATATATAACCAAACAATCGGATATAATGGAGGTGAGTGAATGATAGTATN Reads 2114 1217 480 Reading Frame ORF Reads 21297 3398 452 Reads 413 460 Reading Frame ARF +1 Reads 5 14164 642 114 1936 236 215 C. CR3 Form C mRNA Sequence AUGUGUAUGAUAUAUAAAAACA--AuGuGuA*****UGuuGuuGuuuuGuuuuG***AuuuuGGuuGuACAuuuuuuuuG AUGUGUAUGAUAUAUAAAAACA--A-GuGuA*****UGuuGuuGuuuuGuuuuG***AuuuuGGuuGuACAuuuuuuuuG AUGUGUAUGAUAUAUAAAAACAuuA-GuGuA*****UGuuGuuGuuuuGuuuuG***AuuuuGGuuGuACAuuuuuuuuG AUGUGUAUGAUAUAUAAAAACA-uAuGuGuA*****UGuuGuuGuuuuGuuuuG***AuuuuGGuuGuACAuuuuuuuuG Fully Edited Form ATAATAAAAATGCACAACTAGAATTGAAGTAAAATGATGATATATATN ATATTAAAAATGCACAACTAGAATTGAAATAAAGTGATGGTATATATN ATATATAAAATGTACAACCAGAATTAAGATAAAGTGATGATGTATATATN gRNA Sequence ATAATAAAAATGCACAACTAGAATTGAAGTAAAGTGATGATATATATN AAAATGCACAACTAGAATTGAAGTAAAGTGATGATATATATATN M C M I Y K N N V Y V V V L F W F W L Y I F F V AUGUGUAUGAUAUAUAAAAACAAuGuGuA*****UGuuGuuGuuuuGuuuuG***AuuuuGGuuGuACAuuuuuuuuG |||:|:|| |:|:::|:||:|:::||: |:||:||:|||#||||| NUUAUAUAU-----AUAGUGAUAAGAUGGAAU---UGAAGCCGACACGUAAAUUAAUAUA Cell Line EATRO 164 PC EATRO 164 PC EATRO 164 PC EATRO 164 BS EATRO 164 BS EATRO 164 BS TREU 927 PC TREU 927 PC TREU 927 PC TREU 667 PC TREU 667 PC TREU 667 PC TREU 667 PC * no gRNAs identified. ATAATAAAAATGCACAACTAGAATTGAAATAAAGTGATGGTATATATN ATATAATTAAATGCACAGCCGAAGTTAAGGTAGAATAGTGATATATATN ATATAATTAAATGCACAGCCGAAGTTAAGGTAGAATAGTGATATATATATN ATATATAAAATGTACAACCAGAATTAAGATAAAGTGATGATGTATATATN ATATATAAAATGTACAACCAGAATTAAGATAAAGTGATGATGTATATATN TATAATTAAATGCACAGCCGAAGTTAAGGTAGAATAGTGATATATATATN ATAATAAAAATGCACAACTAGAATTGAAGTAAAATGATGATATATATATN ATAATAAAAATGCACAACTAGAATTGAAATAAAGTGATGGTATATATN Reads 7147 889 565* 505 Reading Frame ARF +1 Reads 122 3 1 453 224 2 166 117 86 53 15 14 5 216 APPENDIX E. ND7 5'-most gRNA populations and the predicted mRNA sequences generated. A-J: ND7 gRNA major classes and predicted editing patterns. ND7 terminal (5' most) gRNA populations and the predicted mRNA sequence generated. Predicted sequences presented are based on the most abundant gRNAs that generate each reading frame found in the four gRNA transcriptome databases. Initial characterization of the ND7 transcript was done using the EATRO 164 cell line and is unusual in that it is edited in two distinct domains [20]. While the 5' domain was edited in both life cycle stages, complete editing of the 3' domain was only detected in bloodstream stage parasites. Interestingly, the most abundant EATRO 164 PC (procyclic or insect form) gRNA would generate a sequence that brings the 5' most AUG into a +2 frame. The ARF is 65 AA long and involves the entire 5' editing domain. In contrast, the most abundant gRNAs in the EATRO 164 Bloodstream stage library (EATRO 164 BS), would generate sequences that use the originally described ND7 ORF). While gRNA transcript numbers again varied greatly between the different cell lines, all three cells lines had gRNA sequence variants that allowed access to both reading frames. A. ND7 Form A Predicted mRNA Sequence M I S I I L C Y F W ST M T T W ST M L F L V V F L H L Y R F T F G P Q AUGACUACAUGAUAAGUAuCAuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG :|||||||:||:||||:|||:||:||||:|:|||||||||||| NUAUAGUAAGAUGCAAUGAAAGCCGUCAAGAGAAUGUAAACAUAUAAA Cell Line EATRO 164 BS TREU 927 PC TREU 667 PC gRNA Sequence AAATATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGATATN AAATATACAAATGTAAGAAAACTATCGAGAGTGATGTAGAATGATATN AAATATACAAATGTAAGAAAACTATCGAGAGTGATGTAGAATGATATN B. ND7 Form B Predicted mRNA Sequence M T T W Y S I I L C Y F W ST M I ST M L F L V V F L H L Y R F T F G P Q AUGACUACAUGAUAuAGUAuCAuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG |||:|||||||:||:||||:|||:||:||||:|:|||||||||||| NUAUUAUAGUAAGAUGCAAUGAAAGCCGUCAAGAGAAUGUAAACAUAUAAA Cell Line EATRO 164 BS TREU 927 PC TREU 667 PC gRNA Sequence AAATATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATGATATTATN AAATATACAAATGTAAGAAAACTATCGAGAGTGATGTAGAATGATATTATN AAATATACAAATGTAAGAAAACTATCGAGAGTGATGTAGAATGATATTATN C. ND7 Form C Predicted mRNA Sequence M T T W ST M I S T F M L F L V V F L H L Y R F T F G P Q AUGACUACAUGAUAAGUACAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG |||||::|:|:||:|:::||::|:|||||||||||||||| NUAAAUGUAGUGAAGAUUGUCGGAGAAAUGUAAACAUAGCAUAUACA Cell Line TREU 927 PC gRNA Sequence ACATATACGATACAAATGTAAAGAGGCTGTTAGAAGTGATGTAAATN D. ND7 Form D Predicted mRNA Sequence M T T W ST M I I V S F M L F L V V F L H L Y R F T F G P Q AUGACUACAUGAUAAuuGUAuCAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG |||:|||||:||||:|||::|:||:|||:||:||||||||||||| NUAAUAUAGUGAAUAUAAUGGAGACUAUCGAAGAAAUGUAAACAUAUAAA Cell Line EATRO 164 PC TREU 927 PC TREU 667 PC gRNA Sequence AAATATACAAATGTAAAGAAGCTATCAGAGGTAATATAAGTGATATAATN AAATATACAAATGTAAAGAAGCTATCAGAGGTAATATAAGTGATATAATN ATATACACAAATGTAAAGAGACTATCGAGAGTGACATAAGTGATATAATN 217 Reading Frame ORF Reads 20487 10537 787 Reading Frame ORF Reads 35079 38432 4365 Reading Frame ORF Reads 75654 Reading Frame ORF Reads 240 477 1152 E. ND7 Form E Predicted mRNA Sequence M T T W ST M I M T F F M L F L V V F L H L Y R F T F G P Q AUGACUACAUGAUAAuG*ACAuuuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG ||: |#||:::|||||::|:|:|:::||||:|:|||||||||:| NUAU-UAUAGGGAAUACGGUGAGAGUUAUCAGAGAAAUGUAAAUAAUAUA Cell Line EATRO 164 PC gRNA Sequence ATATAATAAATGTAAAGAGACTATTGAGAGTGGCATAAGGGATATTATN F. ND7 Form F Predicted mRNA Sequence M T T W ST M I S T F M L F L V V F L H L Y R F T F G P Q AUGACUACAUGAUAAGUACAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG |||||::|:|:||:|:::||::|:|||||||||||||||| NUAAAUGUAGUGAAGAUUGUCGGAGAAAUGUAAACAUAGCAUAUACA Cell Line TREU 667 PC gRNA Sequence ACATATACGATACAAATGTAAAGAGGCTGTTAGAAGTGATGTAAATN G. ND7 Form G Predicted mRNA Sequence M T T W Y S I I Y V I F G S F F T F V S F Y I W S T A AUGACUACAUGAUAuAGUAuCAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG |||:|||||:|||||::|:|:|:::||||:|:|||||||||:| NUAUUAUAGUGAAUACGGUGAGAGUUAUCAGAGAAAUGUAAAUAAUAUA Cell Line EATRO 164 PC EATRO 164 BS gRNA Sequence ATATAATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGATATTATN ATATAATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGATATTATN H. ND7 Form H Predicted mRNA Sequence M T T W ST M I S T F Y V I F G S F F T F V S F Y I W S T A AUGACUACAUGAUAAGUACAuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG |||:||:||||:|||:||:||||:|:|||||||||||| NUAAGAUGCAAUGAAAGCCGUCAAGAGAAUGUAAACAUAUAAA Cell Line EATRO 164 BS gRNA Sequence AAATATACAAATGTAAGAGAACTGCCGAAAGTAACGTAGAATN I. ND7 Form I Predicted mRNA Sequence M T T W ST M I S T M L F L V V F T F V S F Y I W S T A AUGACUACAUGAUAAGUACAAuGuuAuuuuuGGuAGuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG ||:|:|||:|::|||::||:||||:|::||||:|||||||||| NUAUAGUAAGAGUCAUUGAAGAUGUGAGUAUAGUAAAAUGUAAAAUAUA Cell Line TREU 667 PC gRNA Sequence ATATAAAATGTAAAATGATATGAGTGTAGAAGTTACTGAGAATGATATN J. ND7 Form J Predicted mRNA Sequence M T T W ST M I S T F I V I F G S F F T F V S F Y I W S T A AUGACUACAUGAUAAGUACAuuuAuuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAG ||||||::|:|:||:|:::||::|:|||||||||||||||| NUAAAUAGUAGUGAAGAUUGUCGGAGAAAUGUAAACAUAGCAUAUACA Cell Line TREU 927 PC gRNA Sequence ACATATACGATACAAATGTAAAGAGGCTGTTAGAAGTGATGATAAATN 218 Reading Frame ORF Reads 765 Reading Frame ORF Reads 6623 Reading Frame ARF +2 Reads 100761 354 Reading Frame ARF +2 Reads 402 Reading Frame ARF +2 Reads 12929 Reading Frame ARF +2 Reads 251 APPENDIX F. RPS12 5'-most gRNA populations and the predicted mRNA sequences generated. A-E: RPS12 gRNA major classes and predicted editing patterns. RPS12 terminal (5' most) gRNA populations and the predicted mRNA sequence generated. RPS12 differs from both CR3 and ND7 in that the alternative edit that shifts the reading frame occurs just downstream of the previously identified start codon (double-underlined). We do note that the identified alternative gRNAs are rare in all of the gRNA libraries except TREU 667. A. RPS12 Form A Predicted mRNA Sequence M W F L Y G C C L R F V L F V CAAACUAAAGUAAuAuAuuAGuuuuuuGCGuAuGuGA*UUUUUGUAUG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGu ||||||:|:|:||:|:||||::||| |:||::|||| |||||||| NUAUAUAGUUAGAAGAUGCAUGUACU-AGAAGUAUAC-CAACAACAUAUA Cell Line EATRO 164 PC EATRO 164 BS TREU 927 PC TREU 667 PC gRNA Sequence ATATACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATN ATATACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATN ATATACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATN ATATACAACAACCATATGAAGATCGTGTACGTAGAAGATTGATATATN B. RPS12 Form B Predicted mRNA Sequence M W F L Y G C C L R F V L F V CAAACUAAAGUAAuAAAuuuuGuuuuuuuuGCGuAuGuGA*UUUUUGUAUG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGu ||||||:|::::||:|:|||:|||||:| :||:||||:| |||||| NUAUUUAGAGUGGAAGAGACGUAUACAUU-GAAGACAUGC-CAACAAAUA Cell Line EATRO 164 PC TREU 927 PC TREU 667 PC gRNA Sequence ATAAACAACCGTACAGAAGTTACATATGCAGAGAAGGTGAGATTTATN ATAAACAACCATACAGAAGTTACATATGCAGAGAAGGTGAGATTTATN ATAAACAACCGTACAGAAGTTACATATGCAGAGAAGGTGAGATTTATN C. RPS12 Form C Predicted mRNA Sequence M W F C M V V V Y V L F Y L F CAAACUAAAGUAAAAAGuuuuuuuuuuuuGCGuAuGuGA**UUUUGUAUG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGu |||:|:|:|:|:|||::|||||:||| ||:|||||| |||:| NUUUUAGAGAGAGAAAGUGCAUAUACU--AAGACAUAC-CAAUAUAUA Cell Line EATRO 164 BS gRNA Sequence ATATATAACCATACAGAATCATATACGTGAAAGAGAGAGATN D. RPS12 Form D Predicted mRNA Sequence M L F F F R M W F C M V V V Y V L F Y L F CAAACUAAAGUAuuAuAuAAuGuuGuuuuuuuuuCGuAuGuGA**UUUUGUAUG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGu |:||||||||:|::|||:|:|:||:|||::|| ||||||||| |||||| NUGAUAUAUUAUAGUAAAGAGAGAGUAUAUGCU--AAAACAUAC-CAACAAGAUAUA Cell Line TREU 927 PC gRNA Sequence ATATAGAACAACCATACAGAAATCGTATATGCGAGAGAAATGATATTATATN E. RPS12 Form E Predicted mRNA Sequence M W F C M V V V Y V L F Y L F CAAACUAAAGUAAuuuAAAuuuuGuuuuuuuuGCGuAuGuGA**UUUUGUAUG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGu ||||||||:|::|:|:|||::||||||||| ||:|||||| |||:| NUAAAUUUAGAGUAGAGAAAGUGCAUAUACU--AAGACAUAC-CAAUAUAUA Cell Line TREU 667 PC gRNA Sequence ATATATAACCATACAGAATCATATACGTGAAAGAGATGAGATTTAAATN Reading Frame ORF Reads 12531 1663 505 936 Reading Frame ORF Reads 2218 300 3834 Reading Frame ARF +1 Reads 144 Reading Frame ARF +1 Reads 22 Reading Frame ARF +1 Reads 2664 219 APPENDIX G. Alignments of T. brucei and T. vivax edited mRNAs ATPase 6 (A), COIII (B), CR3 (C), CR4 (D), ND3 (E), ND7 (F), ND8 (G), ND9 (H), and RPS12 (I). Uppercase letters indicate nucleotides originally encoded in the DNA, lower case u's indicate uridines inserted during editing and asterisks indicate uridines removed during editing. A. ATPase 6 - Pan-edited non-dual coding AAAAAUAAGUAUUUUGAUAUUAUUAAAGUAAAuAuGuuuuuAuuuuuuuuuuGuGAuuuA T. brucei ---------------------------------AuGuuuuuGuuuuuuuuuuGuGAuuuG T. vivax UUUUG-GuuGCGuuuGuuA----uuAuGuAuGuAuuAuuGuGuAuGAuCuAGGuuAuGuu T. brucei UUUUG*GuuGCGuuuGuuA****UUAuGUGuGuAuuAuuGuGuGuGAuCuAGGuuAuGuu T. vivax uuAuuGuGuAuuuuAA---uUGuuuAAuGuuGAuuuuuG-AuuuuuuAuuAuuuuGuuuG T. brucei uuGuuGuGuAuuuuAA***UUGuuuGAuGuuAAuuuuuG*AuuuuuuGuuGuuuuGuuuG T. vivax *UUUGAuuuGuAuuuGuuuGuuGGuuuGuG***UUUGuuuuuAuuGuuGuGGuuuAuGuu T. brucei -uuuGAuuuGuAuuuGuuuAuuGGuuuAuG---uuuAuuuuuGuuAuuGuGGuuuAuGuu T. vivax GuuuA----AuuuAuAuAGuuuAAUUUUGuAuuA*UUGuAuuAC----uUAUUUG***AA T. brucei GuuuA****AuuuGuAuAGuuuGAUUUuGuAuuA*UUGuAuuACC****UAuuuG--*AA T. vivax uuuG*UAuuUGuuGuuuuGuAuuGuuuuuuuAuuGuA-----uAuuG-CAuuuuuAuuuu T. brucei uuuG-uAUUUGuuGUUUuGuAuuGuuuuuuuAuuGuA*****UA-uGuCAuuuuuGuuuu T. vivax uGuuuuGuuuuuuA-uGuGAuuuuuuuuuGuuuAAuAAuuuGuUA-GuuGGuGAuA**** T. brucei uGuuuuGuuuuuuGuuG-GAuuuuuuuuuGUUUAAuAGuuuGuuGUG-uGGuGAuA---- T. vivax GuuuuAuGGAuG--uuuuuuuuAUUC**GuuuuuuGuuGuGuuuuuuAGAGuGuuuuuCu T. brucei GuuuuAuGGAuGuuuuuuuuuuG-*C--GuuuuuuGuuGuGuuuuuuAGAAuGuuuuuCu T. vivax uuGuuGuGuC----GuuGuuuGuCGACGuuuuuGCGuuuG-UUUUGuAAuuuAuuAuCAu T. brucei uuGuuAuGUC****GuuGuuuGuCAACAuuuuuACGuuuG*UUUUGuAAuuuAuuGuCAu T. vivax CCCAuUUUUUAuuGuuGAuGuuuuuuG-A-uuuuuuuUAuuuuA-uuuuuGuuuuuuuuu T. brucei CCCAUUUUuuAuuGuuAAuGuuuuuuGuAuuuuuuuuUAuuuuAuuuuuuG-uuuuuuuu T. vivax uuuA--------uG----GuGuuuuuuG--uuA-uuGAuuuAuuuuAuuuAuuuuuGuG- T. brucei uuuGuuuuuuuuuGuuuuGuG-----uAuuuuAuuuGGuuuAuuuuGuuuG--uuuGUGu T. vivax -uuuuGuuuuuGuuuAuuAuuuuAU--G-uGuuuuuAuAuUUGuuGGAuuuAUUuGCC-- T. brucei uuuuuGuuuuuGuuuGuuGuuuuAUuuGuuG---uuGuAuuuAuuGGAuuuAuuuGCC** T. vivax ***GC-CA-uAuuAC****AGuuAuuuAuuuuuuGuAAuAuGAuuuuGCAGuuGAuAAuG T. brucei ***GC**GuuAuuAC----AGuuGuuuAuuuuuuGuAAuAuGAuuuuGCAAuuGGuAAuG T. vivax --G**AuuuuuuGuuGuuuuuGuuG-uuuGuuuAGuuuuGuAuuuGAuuuuuGAuAGuuA T. brucei **G**AuuuuuuAuuGuuuuuGuuGuuuuG-uuAG------------------------- T. vivax uuAuAuuGuuGuuGAAAuuuG**GuuUGuuA**UUGGAGUUAUAGAAUAAGAUCAAAUAA T. brucei ------------------------------------------------------------ T. vivax GUUAAUAAUA T. brucei ---------- T. vivax 220 B. COIII - Pan-edited non-dual coding GGUUAUUGAGGAUUGUUUAAAAUUGAAUAAuuAuuAuuuuuuuAuGuuuuuGuuuC**** T. brucei ------------------------------------------------------------ T. vivax *GuuGuAuAuuuGuuGGuGuuA-****GuGGuGuuuuuGuuuuuuuAuCuuuACCuGCCA T. brucei ------------------------------------------------------------ T. vivax uuGuuAuuGuGuAuuGGuuAuuuuGuuuGuuG****GGAuuuAuuuGuuuAuuGUUUG** T. brucei ------------------------------------------------------------ T. vivax **GuAGuuuuuuAuuuGuuGAuuGuG****GuuuuAuuuuuuuuuuuGuuGGuuuuuGuA T. brucei ------------------------------------------------------------ T. vivax uuuGuuuGuuGuuGuuAuuGuuAGAuuuGuuuuGuGAuuuuuuACGuGGuuuAuuuGAuu T. brucei ------------------------------------------------------------ T. vivax uuuGuGuuuuAuuACGuuGuAuCCAGuAuuGuuuuuuAuGGuuuuuAuGuAG*UGAGuuu T. brucei ------------------------------------------------------------ T. vivax GuuuuAuuuAuGGCGuuuuuuG**UUGuAuuAuuuGGuuuAuGuuuAuuuuuGuGuuGuG T. brucei ------------------------------------------------------------ T. vivax AGuuuGCUUUCGuuuuuuGuuuACCuuAuAuGuuuuGuuGuuuAuuAuGuGAuuAuGGuu T. brucei ------------------------------------------------------------ T. vivax uuGuuuuuuAuuGG*UAuuuuuuAGAuuuAuuuAAuuuGuuGAuAAAuACAuuuuAUUUG T. brucei ------------------------------------------------------------ T. vivax uuUGuuAGuGGuuuAuuuGuuAAuuuuuuuGuuuuGuGUUUUUGGuuuAGGuuuuuuuGu T. brucei ------------------------------------------------------------ T. vivax uG**UUGuuGuuuuGuAuuAuGAuuGAGuuuGuuGuuuG****G--uuuuuuGuuuuuG- T. brucei --------------------------------uuG-uuG--uuGuuuuuuuuGuuuuuGu T. vivax uGAAACCA--GuuA---UGAGA**GUUUGCAuuGuuAuuuAuuACAuuAAGuuGuGG*** T. brucei uGAA-uCA**GuuG***UGGGA**AuuuACGuuGuuAuuuAuuACGuuGAGuuGuGG--- T. vivax *UG-uuuuuGGuuCuAuuuuAuuuuuAuuG---GAuuuAuUACAuuuuA**UGCAuGuuu T. brucei -uG*UUUUUGGuuCuAuuuuAuuuuuAuuG***GAuuuGuuGCAuuuuA--uGCAuGuuu T. vivax uuuuAGGuGuuuuGuuGuuG-uuuAuuuG-uuuuAuG--CGuuuGuuuAAuuuuuuGuGu T. brucei uuuuAGGuGuuuuAuuGuuGuuuuA--uGuuuUUAuG**CGuuuGuuuAGuuuuuuAuGu T. vivax AuGGAuACACGuuuuGuuuuuuuGuAuuGuGuuuGuuuAuAuuGACAuuuuGuuGA-UUU T. brucei AuGGAuACACGuuuuGuuuuuuuGuAuGuuGuuuGuuuGuAuuGACAuuuuGuuGAUUU* T. vivax AGuuuGAuuuuuuuuAuuGCGAuuuGuuuAuuuuGAuGuuuuAuG----uGuuAuGuAuu T. brucei GGuuuGGuuuuuuuuGuuGCGAuuuGuuuAuuuuGAuGuuuuAuG****UGuuAuGuAuu T. vivax uGuGuGuGuAAuuuuAuuGGuGuuuuUUUAGUUGuuGAuuA*GuuAAuuuGuAuuGGUAG T. brucei uGuGuGuGuAG------------------------------------------------- T. vivax UUUGUAGGAAG-- T. brucei ------------- T. vivax 221 C. CR3 - Pan-edited dual coding AGAAAUAUAAAUAUGUGUAUGAUAUAUAAAAACAAuGuuuGA****UUGuuuGGuuuuG- T. brucei ----------------------------------AuGuuuGA---*UUGuuuAGuuuuGU T. vivax --uuG-uuuuUUUAuuGuuuGuuuGuACAuuuuuuuuGuuuuuuAuuuGuuuGuG----- T. brucei U***GuuuuuuuuAuuG-uuGuuuGuACAuuuuuuuuGuUUUUUGuuuAuuuGuG***** T. vivax ***A**UUUGuuuuuAuGuuuGuuA-*UUUA------GuuuuuGuuuuuuAuuGGAuuuu T. brucei ***A--uuuGuuuuuAuGuuuGuuG--uuuGuuuuuuG--uuUA---uuuG-uGGAuuuu T. vivax uGuuuuuuAuuuAA----uA--uGGGuuuA---uUGuuGuGuuuAuuuuuuuuuuuuAuu T. brucei uGuuuuuuGuuuAA****UA**UGGGuuuG***UUGuuGUGuuuAuuuuuuuuuuuuGuu T. vivax uuAuCAuuuGAuAuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUU T. brucei uuAuCAuuuGAuAuGuuAuuAuCGuUUUUGuuAuuAuAuAuAAGuUUUCGUUAUUAA--- T. vivax AAAAAAGUAUGCAAAUAAUUUUUGU T. brucei ------------------------- T. vivax D. CR4 - Pan-edited dual coding UAAUUUAUUGUUAUCUUUGUGUAUUUAUUAuuAuuuuAuuuuAA---uuuuG---GuuGu T. brucei -------------------------------------------AUUuuuuuGuuuGuuGu T. vivax GC--***AuuuuuuuuuuuuuuAuuuG***GuG*UGuuuGuGuuuuA*UGuA*C*A---- T. brucei GC*****AuuuuuuuuuuuuuuAuuuG---GuG-uGuuuGuGuuuuA-uGuA--uA**** T. vivax *GuuuAuGGuAuAuuuuAuuGuuGuuuuGuuuuuuGuuuuuGuuGUUUG-uuUGuG--uG T. brucei *GuuuGuGGuAuAuuuuGUuGuuGuuuuuuuuuuuGuuuuuuuuGuuuG***UGuGuuuG T. vivax GGuA--uGuuuuAuuuGuuuuGuuAuA----GuuGuuuGuuuuuuuuuGuuGuUUUG*GG T. brucei GGuA**UGuuuuAuuuGuuuuGuuAuA****GuuGuuuGuuuuuuuuuuuuGuuuuG-GG T. vivax uuGuG----AuuuuuuAuuG**G--uGuuuuG***AuuGuAuA---GuuuAuuuuuuuuG T. brucei uuGuG----AUUUUuuGuuA--GuuuA--UuG---AuuGuAuA***GUuuGuuuuuuuuG T. vivax uGAC-GuuAuAAuuUUGuuuAuuuuuuuuuuuuAuuuuGuuuuGuGuuuuuuG---uAuu T. brucei uGAC*GuuAuAAUuuuGuuuAUUUUuuuuuuuuGUuuuGUUuuGuGuuuuuuGuuuuGuu T. vivax G*UUGuuuuuA-uUUGGuuuGuuuGGuuuuuuuuuG***UAuuuuuuGUUGuGuuuuGuG T. brucei GuuuuuuuuuG----GUUU*GuuuGGuuuuuuuuuG---uAuuuuuuGuuAuGUuuuGuG T. vivax uuAuuuuuuGAuuuAuuuuuuAuGuUGuuuuuUGuuuuG--GG***UG*G-uuuuuuuGu T. brucei uuAuuuuuuGAuuuGuuuuuuAuGuuGuuuuuuG---uAuuGG----GUGGuuuuuuuGu T. vivax uuuuGuuuuuuuuuuuuGuuuAuGuuuGuuuuuA---uuuGuGGuuGuuG--uuAuuuuG T. brucei uuuuGuuuuuuuuuuuuGuuuAUGuuuA---uuGuuuuuuGuAGuuAUUG**UUGuuuuA T. vivax uuAGuuuGGuuGuuGUUG-uuAuuUGuG---uA--uA****GGUUUAuuUAuA*UGCGuu T. brucei uuAGuuuGGuuGuuGuuGUU*AuuuGuGU***AUuuA--UUGGuuuAuuuAuA-uGCG-- T. vivax uuuuAuuuuA------GAuAAuUAuG****G****UA**UUGGUUUUAUAAAAUGUUUUU T. brucei uuuuG---uGUUUU**AA------------------------------------------ T. vivax UCU T. brucei --- T. vivax 222 E. ND3 - Pan-edited dual coding UCAAAAAAUCCUCGCCUUUUUACUUUA-GUUUGUUAUCAuuA--uuuuuAuAuuuGuuuu T. brucei --------------------------AUG----UUAuCA--AUuuuuuuGuAuuuG---- T. vivax UG---*A*UA-uuGuGGuuuA**UUAuuuuAuuuA-uAGG--uuuuuuuuuAuGuuuuuu T. brucei -GUUUuG-uGuuuGuGGuuuA--UuAuuuuA-uuGuuAGGuuuuuuuCU**AuGuuuuuu T. vivax AuGuuuuuuAuuGCAuuuuuuuG----AuuGuuuuCGuuGuuGuuuGuGGuuuuCGuGuG T. brucei AuGuuuuuuGuuACAuuuuuuuG****AuuGuuuuCGuuGuuGuuuAuGAuuuuCAuGuG T. vivax ---GuUUGuA---uGAuAuG--A-----AuUCACGuuuG*GUGuuuuA--uACAuuGGAu T. brucei ***GuuuGuA***UGAuAuG**A*****AuuCACGuuuG-GuGuuuuA**UACAuuAGAu T. vivax uuAUGuuuuGuuA-------------GuUGuUUGuuuuuuGuAuuGuu---AAAuuCCAu T. brucei uuAuGuuuuGuuA*************GuuGuuuGUuuuuuGuGuuAUU***GAAUuCuGu T. vivax UAuuuGu----GuUUUGuuGuuuGuuuuuG--UG----------A-uA*GuGuuGuuuuA T. brucei uAuuuGU****GuuuuGuuAUUUG----uAuuuGU*********GuuG-GuAuuGuuuuA T. vivax uuuuuGuuAU----GGuuuuuuG-uUUUUGuGGuuuuuGuuuuuuGuuG-uA-uGuA--- T. brucei uuuuuGUUAU****GGuuuuuuA*UUUUUGuGGuuuuuGuuuuuuG-uGuuGuuA--**U T. vivax uAG****GAuuUGuGuGGuAuuuuuGGGA-UCAC*GuAuAuuuGuGuGGuGUAAuuuuAu T. brucei UAG----GGuuuGuGUGGuAuuuuuGAGA*UCA-uGuAUAUU------------------ T. vivax uuuGuuuAuGA**UGuuuUUUGUUGUAUUAUACAUAUUAUAUUAAUAAAUAUAUAAAA T. brucei ---------------------------------------------------------- T. vivax F. ND7 - Pan-edited dual coding UGAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuuAuG-uuAuuuuuG--GuAGuu T. brucei ---------------------------------------AuGuuuAuuuuuGuuGuAGuu T. vivax uuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG***CAGCACAuG**G- T. brucei uuuuuGCAuuuGuAUCGUuuuACAUuuG-GCCCACAGCAuCCCG---CAGCACAuG-*G* T. vivax uGuuuuAuGuuGuuuAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UA---UUGAuUG T. brucei UGuuuuAuGUuGuuuAUUGuAuuuuuGUGGuGA-AuuuAuuGuuuA--uA***UUGAuuG T. vivax uAuuAuA***G*GuuA--UUUGCAUCGUGGUACAGAAAAGUUAUGUGAAUAUAAAAGUGU T. brucei uAuuAuA---G-GuuA**UUUGCAUCGAGGUACAGAAAAGUUAUGUGAGUAUAAGAGCGU T. vivax AGAACAAUGUCUUCCGuAUUUCGA---CAGGUUAGAuuAuGuuA---*GuGuuuGuuGuA T. brucei AGAGCAGUGUCUUCCGuAUUUuGAU***AGAuuAGAuuAuGuuA****GuGuuuGuuGuA T. vivax AuGAGCAuuuGuuGuCuuuA***UGuuuuGAGuA--uAuGuuGCGAuGuuGuuuGuCGuu T. brucei AuGAACAuuuAuuGuCuuuA---uGuuuuGAGuA**UAuGuuACGGuGuuGuuuAuCAuu T. vivax ACGuuGuGCAuuuAuGCGuuuAuuAAuuGuA****GAAuuuAC***CCGuAGuuuuAAuG T. brucei GCGuGuuGCAuuuAuGCGuuuAuuGAuuGuA----GAGuuuACU**C*GUAGuuuuAAuG T. vivax GuuuGuuGuGuAuAuCAuGuAuGGuuuuGG*AuuuAGGuuGuuuGuCUCCGuuG*UUAuG T. brucei GuuuAuuGuGuGuGuCGuGuAuGAuuuuAG-AuuuAGGuuGuuuAuCCCCGuuA-UuAuG T. vivax AuCAuuuGAGGAA-***CG*UGA-CAAAuuGAuGACAuuuuuuGAuuuAuG**UUGuGGu T. brucei GuCAuuuGAGGAG****CG-uGAU*AAGuuAAuGACGuuuuuuGAuuuGuG--uuGuGGu T. vivax uGuCGuAuGCAuuuGGCUUUCAuGGuuuuAuuA-*GGuAUUCUUGAUGAuuuuGuuuuuG T. brucei uGuCGuAuGCAuuuGGCUUUCAuGGuuuuAuuG**GGuAuuCUUGAuGAuuuuGuuuuuG T. vivax GuuuuGuuGAuuuuuuGuuGuuGuuGA***UAAuAuCAuGuuuGuuuGuuAuGGAuuGuu T. brucei 223 GuuuuGuuGAUUUuuuGuuGuuAuuGA---uAAuAuCGuGuuuGuuuGuuAuGGAuuGuu T. vivax AuGAuuuGuuAuuuG--uGGGuAA---UCGuuuAuuuUAuuuGCGuuuGC***GuGGuuu T. brucei AuGAuuuAuuGuuuG**UGGGuAA***UCGuuuGuuuuAuuuGCGuuuGC---GuGGuuu T. vivax GuCAuuuuuuGAuuuA---uAuGAuuuA**GuuuuuA**A**UAGuuuAAGuGGuGuuuu T. brucei GuCAuuuuuuGAuuuG***UAuGAuuuG--GuuuuuA--A--uAGuuuGAGuGGuGuuuu T. vivax GuCuCGuuCGuuAGGuAuGGuGuGAGAuuGUCGuuuAuuuAGuuGuuA****UGA***** T. brucei GuCACGuuCAuuGGGuAuGGuAuGAGAuuGCCGuuuAuAuuGuuGuuA-***UGA----- T. vivax GuUG-uAuuuuA---uGuuuuGuuAuGAuuAuuGuuuuuGuuuuAuA-GGuGAuGCAuuu T. brucei GuuG*UAUUUUA***UGuuuuGuuAuGAuuAuuGuuuuuGuuuuAuA*GGuGAuGCAuuu T. vivax GA*UCGuuuAuuuuuACGuuuGuuuGAUAuGCGuAuGAGuuuGuuGAuuuGuAAGCAA-u T. brucei GAC*CGuuuGuuuuuGCGuuuGuuuGAuAuGCGuAuGAGuuuGuuGAuuuGuAAGCAA*U T. vivax GuuuuuuuGuuGGuuuuuuuGuuuuuG*****GuuuuGuuuGuuuGuuuG**AuuAuuuA T. brucei GuuuuuuuGuuGGuuuuuuuGuuuuuG-----GAuuuGuuuGuuuGuuuG--AuuAuuuG T. vivax uAuuGuGAuAuuACCAuuG****AGACCAuuAuuAuGuuAuuuuAuAGuuuG--uGGuGu T. brucei uAuuGuGAuGuuACCAuuG----AGACuAuuAuuAuGuuGuuuuAuAGuuuA**UGGuGu T. vivax uGuuGuuuGCCGGGuAuAU*-----CAuuuGC*UUGU-GuuGAACACCCCAAAG-----G T. brucei uGuuGuuuACCAGGuAuAU******CAUUUGC-UUGU*GuuGAGCAuCCCAAGG*****G T. vivax uGA***GuAuuGuuuGuuAuuAU****GuuuuuGuGuuGGuuuAuGuuCUCGuuuACGuu T. brucei uGA---GuAuuGuuuGuuAuuAU****GuuuuuGuGuuGGuuuGuGUUCCCGuuuGCGuu T. vivax uGCGuuGuGCGGAuuuuuuGCA--*UA--UUUGuuuAuuGGAuGuuuGuuuGCGuGGuuu T. brucei uGCGuuGuGCGGAuuuuuuACA***UA**UUUGuuuGuuGGAuGuuuGuuuACGuGGuuu T. vivax uuuAuuGCAuGAuuuAGuuGC--***C*GuuuuA--GGuAAuAuuGAuGuuGuuuuuGGA T. brucei uuuAuuGCAuGAuuuAGuuGC*****C*G--uuAuuGGuAAuAuuGAuGuuGuuuuuGGA T. vivax uCC--GUAGAUCGuuA*GuuuuAuAuGuG**A******GGUUAUUGuAGGAUUGUUUAAA T. brucei uCU**GuGGAUCGuuA*G------------------------------------------ T. vivax AUUGAAUAAAAA T. brucei ------------ T. vivax G. ND8 - Pan-edited non-dual coding -------CAAUUUAAUAAUUUUAAGUUUUGGUUGAUUAuuAuuuuuuuAuuuuuuuAuuu T. brucei ------------------------------------------------------------ T. vivax uuGuAuGuuuuuuuuuGAuuuuuuGuuuuuuuUUUUUGuuuGuuuuuAuAuGuGUuuuGu T. brucei ----AuGuuuuuuuuuGAuuuuuuGuuuuuuuuuuuuGuuuGuuuuuAuAuGuGuuuuGu T. vivax uuGuuGuGuuA****CuA*UUUGuuuA-***CCCAuuGAGuuAACCA--uuGuuAGuuuA T. brucei uuGuuGuGuuA---CC-A-UUuGuuuA****CCCAuuGAAuuAAC-AuuuuG-uAGuuuG T. vivax uuGGuuC--GuGGUAA---C-C---AuuuuuuGCGUUUUuA***UUGGuGuGGuuuAGAG T. brucei uuGA-*CCCGuGGuAA***C*C***AuuuuuuGCGuuuuuA--*UUGAuGuGGuuuAGAA T. vivax CGuuGuAuuGCuuGuCGuuuAuGuGAuuuAAuuuG-C-----CCuA****GuuuAGCAuu T. brucei CGUuGuAuuGCuuGuCGuuuAuGuGAuuuGAuuuGuC*****CC-A----GuuuAGCAuu T. vivax GGAuG***UUCGuGuuGGGuGG---AGuuuuGGuGGuCA**UC*GuuuuGCG--GAuuG- T. brucei AGAuG---uuCGuGuuGGGuGG***AGuuuuGGuGGuCA--uC-GuuuuGCA**GAuuG* T. vivax 224 -AuuuACAuuGAGuuA-*UC**GU-**C----GuuGuAuuuAuuGuGGuuuuuGuAuGCA T. brucei *AuuuACAuuGAGuuA***C-CG***AC****GuuGuAuuuAuuGuGGuuuuuGuAuGCA T. vivax uGuuuGCCCGACAGAU****G---CC----AuuA-----CGCA---UUCAuuGuuuGuuA T. brucei uGuuuGuCCAACAGAu----G***CC****AuuA*****CACA***UUCAuuGuuuGuuA T. vivax uGuGuuuuuGuuGuuuA-------GCC**A**UGuAuuuAuuG*GCGC***C***C---- T. brucei uGuGuuuuuGuuGuuuA*******GCC**A--UGuAuuuAuuG-GCGC---C---C**** T. vivax AAGuuuuuAuuGuuuGG---uuGuuGuuuuAuGuuAuuuGAuuuuuAuuuGuGuuuuGuG T. brucei AAGuuuuuGuuAuuuGG***UUGuuGuuuuAuGuuGuuuGAuuuuuAuuuGuGuuuuGuG T. vivax uAGuuAuuuAuuuuGGGuGAuuuAuuGUGuuuAuGAuuuAA***AGAA**AuuCACGGUG T. brucei uAG--------------------------------------------------------- T. vivax AAAUUAAAUUUUGACUAAAU T. brucei -------------------- T. vivax H. ND9 - Pan-edited dual coding UUAAUAUCAACUUAAUUUUUUUUAUAAACAuuAuAuuAUGuGuA--uAuUUUUAuGuuuA T. brucei -------------------------------------AuG-G-GuuuGuuGuuGuGuuuA T. vivax uuuCGuuuAuGuuuuuGuuuAAuuUUAuuuuA**UUGuuuGuGuuGuA----------GA T. brucei uuuCGUUuGuGUUUUUGuuuGAUuuuGuuuuA--UuGuuuGUGuuGuA**********GG T. vivax uGGuGuuUUGuuuGuuuuGuuGA-uuGuAGuuuuuuGuuuuuuuAuuGuuuuGuuA-Guu T. brucei UGGuGuuuuGuuuGuuuuGuuGA*UUGuAGuuuuuuGuuuuuuuAUUGUUUuGuuA*Guu T. vivax uuuuuuuGuuuuAUUGuAuGuuuuuAuuuuuuAAuuuGuG-AuuuuuGuuuuuAuAuuGU T. brucei uuuuuuuGuuuuAuuGuAuGuuuuuAUUuuuuAAuuuAuG*GuuuuuGuuuuuGuAuuGu T. vivax UGuG-AuUUGuuAuuGAuuGAuuuuuGuGGuuuuuGuuuuuGuCGuuuuAuGuuGuuGUA T. brucei uGuG*AuuuAuuGuuGAuuGAuuuuuGuGGUuuuuGuuuuuGuCGuuuuAuGuuAuuAuA T. vivax uAuuuuAuuuuGuuuGuuuuuGuGuGuuCGuuuGuGuuuuGuuuuGuGuuGUUUGuuuGU T. brucei uAuuuuGuuuuGuuuGUuuuuGuGuuuuCGuuuAuGuuuuGuuuuGuGUUGUUuGuuuuu T. vivax AuuuuuuGGAuuG----uGuuuuA-*GuuuuA**GuuGuuuuuGuUAuGC---GuuuuuG T. brucei GUUUUuuGGAuuG****UGuuuuA**GuuuuA--GuuGuuuuuGuuAuGC***GuuuuuA T. vivax uuGuuGGA----ACGC*GAAuGuuuUGAUUUGuuuGGuuuuUAuuuuG--uuGGuAAuGA T. brucei UUGuuAGA****ACG-uGAGuGuuuuGAuuuGuuuGGuuuuuAuuuuG**UUGGuAAuGA T. vivax uAuuuuACAUCGuuuAuuuGuuG--AuuG****GuuuuuuGuuG-GuuuuuuuuuGuuGA T. brucei uGuuuuACACCGuuuAUuuGuuG**AuuG----AUUuuuuGuuG*GuuuuuuuuuGuuGA T. vivax -AGuGuuAUCCA--uuAuuuGGuuuGuuuGuAuuGuuAuuuuGuG---uGuuG**GuG-G T. brucei *AGuGuuAuCCA**UUAuuuGGUuuGuuuGuAuuAuuGuuuuGuGuuuuA--GuuA-AuG T. vivax A--GGA-GAUA-GuAuGuACGuuuACAAuGuuA--uuuuuGuuGuuGCAuACC**AAuuU T. brucei A-**GAUGAuAuGuA----CGuuuACAAuG--GuuuuuuuGuuGuuGCAuACC**AAuuu T. vivax UUAuuuG*CA-----uuAuuuuAuuuA***AuA**UCACCGuUGUAAUUCUAAAUUUCUC T. brucei uuAuuuG-CAUUuuuuuG-uuuA--*G--------------------------------- T. vivax ACUUCC T. brucei ------ T. vivax 225 I. RPS12 - Pan-edited dual coding CUAAUACACUUUUGAUAACAAACUAAAGUAAAuAuAuuuuGuuuuuuuuGCGuAuGuGA* T. brucei -----------------------------------------------------AuGuGA* T. vivax UUUUUGUAUG*GuuGuuGuuuAC----*GuuuuGuuuuAuuuGuuuuAuGuuAuuAuAuG T. brucei UUUUUGuAuG-GuuGuuGUUUGC*****GuuuuGuuuuGuuuGuuuuAuGuuAuuAuAuG T. vivax AGuCC----G**CGAuuGCCCAGuuCCGGuAACCGACGuGuAuuGuAuGC**C****GuA T. brucei AGUCC****C--CGAuuGCCCAGuuCCGGuAAuCGACGuGuGuuGuAuGC--C--**GuG T. vivax uuuuAuuUAuAuAAuuuuGuuuG-GA-uGuuGCGuuGuuuuuuuuGuuGuuuuAuuG--- T. brucei uuuuAuuuGuAuAAuuuuG--uGuGGuuGuuGCGuuGuuuuuuuuGuuG---uG-uGUUU T. vivax ---GuuuA---GuuA--uG**UCAuuAuuuAuuAuAGA--***G----GGUGGuGGuuuu T. brucei UuuG---GuuuG-CAUUUG--UCGuuAuuuAuuAuAGA*****G****GGuGGuGGuuuu T. vivax GuuGAuuuACCC--***G****GuG*UAAAGuAuuAuACA*CG**UAuuG--uA--AGuu T. brucei GuuGAuuuACCC*****G--**GuA-UAAAGuAuuAuACA-CG--uA-uGuuuAuuAAuu T. vivax AGA*UUUAGAuAUAAGAUAUGUUUUU T. brucei AA------------------------ T. vivax 226 APPENDIX H. Alignments of protein sequences of pan-edited dual-coding genes in L. tarentolae, L. amazonensis, P. serpens, and Perkinsela CCAP1560/4 with T. brucei and T. vivax sequences. A: CR3, B: CR4, C: ND3, D: ND7 5' Editing Domain E: ND9, F: RPS12, G: ND8 (Nondual- coding). Absent sequences were unavailable. All ORF alignments show published protein sequences. All ARF alignments show +1 or +2 (ND7 only) reading frame translations of the full length mRNA sequences. In the ARF alignments of CR3, ND7 and RPS12, translations were made using the alternative T. brucei mRNA sequences shown in Figure 8. Two alignments of CR3 ARFs are presented to display the P. serpens +2 reading frame which has no stop codons. L. tarentolae CR4 published protein sequence shows limited homology in the C terminus to all other CR4 protein sequences. The edited mRNA has two editing sites where 13 U residues are inserted. If the second of these insertion sites is shortened to 12 U residues, the translation of this mRNA has much better homology to other CR4 proteins. Alignments with translations of the two different sequences (13U and 12U) are both shown, with the location of the altered site highlighted in red. While ND8 does not appear to be dual-coding, this alignment was included as well, for comparison of the conservation of a nondual-coding gene with that of the dual-coding genes. It should be noted that ND8 is the only nondual-coding gene that is pan- edited in L. tarentolae, L. amazonensis, and P. serpens, and ND7, A6 and COIII are only partially edited in these species. !=Termination codon A. CR3 CR3 ORF Alignment CR3ORF T. brucei --MFDCLVLL-FFYCLFVHFFCFLFVCDLFLCLLFSFCFLLDFCFLFNMGLLLCLFFFFI CR3ORF T. vivax --MFDCLVLL-FFLLLFVHFFCFLFICDLFLCLLFVFCLFVDFCFLFNMGLLLCLFFFFV CR3ORF L. amazonensis --MFDFVIIMFL-FMSFVHFFCFLFIVDLLFCLMFFVFFLYDFCFVCNLGFCCCLFFFFL CR3ORF P. serpens IFLFDFVLFLVLFLLFFVHFFCFLFIIDLFCCFLLLFFLVFDFCFCCCFGFVSCLFLFFV :** :::: : *********: **: *::: . :. **** :*: ***:**: CR3ORF T. brucei LSFDMLLSFLLLYISFRY! CR3ORF T. vivax LSFDMLLSFLLLYISFRY! CR3ORF L. amazonensis LSIDMILSFILLYVSFRY! CR3ORF P. serpens FHFDMVLSFILLFVSFRY! : :**:***:**::**** CR3 ARF Alignment with P. serpens +1 RF CR3ARF T. brucei ---------RNINMCMIYKNNVYVVVLFWFWLYIFFVFYLFVICFYVCYLVFVFYWIFVF CR3ARF T. vivax --------------CLI----V!FCCFFYCCLYIFFVFCLFVICFYVCCLFFVYLWIFVF CR3ARF P. serpens +1 !!VYNI!KHNILYFCLI---LFCF!CYFYCFLCIFFVFYLLLICFVVFYYCFF!CLIFVF CR3ARF L. amazonensis --K!NNMY!V!IYICLI----SLL!CFCLWVLYIFFVFYLLLICYFVWCFLFFFYMIFVL *:* . * ***** *::**: * *. ***: CR3ARF T. brucei YLIWVYCCVYFFFLFYHLICCYHFYYCI!VFVIRLKKYANNFC- CR3ARF T. vivax CLIWVCCCVYFFFLFYHLICYYRFCYYI!VFVI----------- CR3ARF P. serpens +1 VVVLVLWVVYFYFLFFILIWFYLSYYYL!VSVIKSI!KHKLIS- CR3ARF L. amazonensis CVI!VFVVVCFFFFCYPLIWFCRLFYYMLVSDIKII!LLFL!!K :: * * *:*: : ** * : * * CR3 ARF Alignment with P. serpens +2 RF CR3ARF T. brucei -------RNINMCMIYKNNVYVVVLFWFWLYIFFVFYLFVICFYVCYLVFVFYWIFVFYL CR3ARF T. vivax ------------CLI----V!FCCFFYCCLYIFFVFCLFVICFYVCCLFFVYLWIFVFCL CR3ARF P. serpens +2 ----------SKCIIYKNIIFYI-FVWFC----FVFSVIFIVFCAFFLFFIYYWFVLLFF CR3ARF L. amazonensis K!NNMY!V!IYICLI----SLL!CFCLWVLYIFFVFYLLLICYFVWCFLFFFYMIFVLCV *:* : *** ::.* : . :.*.: :.:: . CR3ARF T. brucei IWVYCCVYFFFLFYHLICCYHFYYCI!--------VFVIRLKKYA--NNFC------ CR3ARF T. vivax IWVCCCVYFFFLFYHLICYYRFCYYI!--------VFVI------------------ CR3ARF P. serpens +2 IIVFFSVWFL--FLLLFWFCELFIFIFCFSFWYGFIFHIIICKFPLLKAFKNISL!V CR3ARF L. amazonensis I!VFVVVCFFFFCYPLIWFCRLFYYML--------VSDIKII!LL--FL!!K----- * * * *: *: .: : : * 227 B. CR4 CR4 ORF Alignment with L. tarentolae 13U translation (Published Sequence) CR4ORF T. brucei ILILVVHFFFFYLVCLC-----FMYSLWYILLLFCFLFLLFVCVGMFYLFCYSCLFFFVV CR4ORF T. vivax IFLFVVHFFFFYLVCLC-----FMYSLWYILLLFFFLFFLFVCLGMFYLFCYSCLFFFFV CR4ORF L. tarentolae 13U -------------KCCCFWFFYVLFCVLYILFLFFFLFIWFVCYGLFYLYCICLFICFSL CR4ORF L. amazonensis ---ISNILLFLYIFIYICWLIF-MYSCWYILILFFFLFLLFVVYGLFYLYCIVCLFILCL ::. ***:** ***: ** *:***:* :: : : CR4ORF T. brucei LGCDFLLVFWLYSLFFLWRYNFVYFFFLFCFVFFVLLF-LFGLFGFFLYFLLCFVLFFDL CR4ORF T. vivax LGCDFLLVYWLYSLFFLWRYNFVYFFFLFCFVFFVLLF-FFGLFGFFLYFLLCFVLFFDL CR4ORF L. tarentolae 13U LCCDFVVVFWLYSVFFVYRYNYFFFFVYFLGVYFFVIILICIWFFIFFFLCLCFDFL--F CR4ORF L. amazonensis LCCDFVVVFWLYSVFFIYRYNFVFFFFFLWFVFIFLIIFIFGFGFLFFFLVLCLVFYFEF * ***::*:****:**::***:.:**. : *::.::: : :*::: **: : : CR4ORF T. brucei FFMLFFVLGGFFVFVFF---------FCLCLFLFVVVVILLVWLL-----LLFVYRFIYM CR4ORF T. vivax FFMLFFVLGGFFVFVFF---------FCLCLLFFVVIVVLLVWLL-----LLFVYRFIYM CR4ORF L. tarentolae 13U WIFVYVVFCFLWIFVVCDVYFIFYIIFCFNCVGVLLVVIYICVSIFLYDVLYFNFNWIIL CR4ORF L. amazonensis LFMLFFVFCGFLLFVMFILFFVSFF---------VLIVLLFCWMLF-----IFVFRFICM ::::.*: : :**. :::*: : : * :.:* : CR4ORF T. brucei RFLF! CR4ORF T. vivax RFVF! CR4ORF L. tarentolae 13U KF--- CR4ORF L. amazonensis RFVF! :* CR4 ORF Alignment with L. tarentolae 12U translation (Hypothetical Sequence) CR4ORF T. brucei ILILVVHFFFFYLVCLC-----FMYSLWYILLLFCFLFLLFVCVGMFYLFCYSCLFFFVV CR4ORF T. vivax IFLFVVHFFFFYLVCLC-----FMYSLWYILLLFFFLFFLFVCLGMFYLFCYSCLFFFFV CR4ORF L. tarentolae 12U -------------KCCCFWFFYVLFCVLYILFLFFFLFIWFVCYGLFYLYCICLFICFSL CR4ORF L. amazonensis ---ISNILLFLYIFIYICWLIF-MYSCWYILILFFFLFLLFVVYGLFYLYCIVCLFILCL ::. ***:** ***: ** *:***:* :: : : CR4ORF T. brucei LGCDFLLVFWLYSLFFLWRYNFVYFFFLFCFVFFVLLFL-FGLFGFFLYFLLCFVLFFDL CR4ORF T. vivax LGCDFLLVYWLYSLFFLWRYNFVYFFFLFCFVFFVLLFF-FGLFGFFLYFLLCFVLFFDL CR4ORF L. tarentolae 12U LCCDFVVVFWLYSVFFVYRYNYFFFFVYFLGVYFFVIILICIWFFIFFFYVCVLIFYFEF CR4ORF L. amazonensis LCCDFVVVFWLYSVFFIYRYNFVFFFFFLWFVFIFLIIFIFGFGFLFFFLVLCLVFYFEF * ***::*:****:**::***:.:**. : *::.:::: :*:: : ::::*:: CR4ORF T. brucei FFMLFFVLGGFFVFVFFFCLCLFLFVVVVILLVWLLLLFVYRFIYMRFLF!--------- CR4ORF T. vivax FFMLFFVLGGFFVFVFFFCLCLLFFVVIVVLLVWLLLLFVYRFIYMRFVF!--------- CR4ORF L. tarentolae 12U LFMLFFVFCGFLLFVMFILFFISFFVLIVLVFCWLLFIFVFRFFCMTFCILILIGLF!NL CR4ORF L. amazonensis LFMLFFVFCGFLLFVMFILFFVSFFVLIVLLFCWMLFIFVFRFICMRFVF!--------- :******: **::**:*: : : :**::*::: *:*::**:**: * * : CR4 ARF Alignment with L. tarentolae 13U translation (Published Sequence) CR4ARF T. brucei -----IYCYLCVFIIILF!FWLCIFFFFIWCVCVLCTVYGIFYCCFVFCFCCLFVWVCFI CR4ARF T. vivax -----------------FFCLLCIFFFFIWCVCVLCIVCGIFCCCFFFCFFCLCVWVCFI CR4ARF L. tarentolae 13U QIH!NTYMYNCKSVV-VFGFFMY---------YFVC---CIFYFCFFFCLFDLCVMVYFI CR4ARF L. amazonensis ------DIKNIK!VI-FYYFYIFL---FTFVGWFLCIVVGIFWFYFFFYFCYL!FTVYFI : : .:* ** *.* : * . * ** CR4ARF T. brucei CFVIVVCFFLLFWVVIFYWCFDCIVYFFCDVIILFIFFFYFVLCFLYCCFYLVCLVFFC- CR4ARF T. vivax CFVIVVCFFFLFWVVIFC!FIDCIVCFFCDVIILFIFFFCFVLCFLFCCFFLVCLVFFC- CR4ARF L. tarentolae 13U YIAFVCLFVLVCYVVILLLCFDCIVFFLFTVIIIFFFLFIFWVFIFLLLFWFVFGFLFFF CR4ARF L. amazonensis CIALFVCLFYVCYVVILLLCFDCIVFFLFIDIILFFFFFFYGLCLFF!LFLYLDLVFYFF :.:. :. : :***: :**** *: **:*:*:* : : :: * : .:: CR4ARF T. brucei IFCCVLCYFLIYFLCCFLFWVVFLFLFFFFVYVCFYLWLLLFC!FGCCCYLCIG------ CR4ARF T. vivax IFCYVLCYFLICFLCCFLYWVVFLFLFFFFVYVYCFL!LLLFY!FGCCCYLCIG------ CR4ARF L. tarentolae 13U FYVCVLIFYF------------------------EFLFMLFFVFCGFLLFVMFILFFISF CR4ARF L. amazonensis F!FCVWFFILNFCLCYFLFFADFCCLWCLFYFLCHFLF!LFYYFVGCYLYLYFV------ : * : : :* *:: * :: : CR4ARF T. brucei --LFICVFYFR!LWYWFYKMFF--------------- CR4ARF T. vivax --LFICVLCF--------------------------- CR4ARF L. tarentolae 13U FVLIVLVFCWLLFIFV-FRFFCMTFCILILIGLF!NL CR4ARF L. amazonensis --LFVCVLCFK---VG-YEFIFI-------------- *:: *: : 228 CR4 ARF Alignment with L. tarentolae 12U translation (Hypothetical Sequence) CR4ARF T. brucei -----IYCYLCVFIIILF!FWLCIFFFFIWCVCVLCTVYGIFYCCFVFCFCCLFVWVCFI CR4ARF T. vivax -----------------FFCLLCIFFFFIWCVCVLCIVCGIFCCCFFFCFFCLCVWVCFI CR4ARF L. tarentolae 12U QIH!NTYMYNCKSVV-VFGFFMY---------YFVC---CIFYFCFFFCLFDLCVMVYFI CR4ARF L. amazonensis ------DIKNIK!VI-FYYFYIFL---FTFVGWFLCIVVGIFWFYFFFYFCYL!FTVYFI : : .:* ** *.* : * . * ** CR4ARF T. brucei CFVIVVCFFLLFWVVIFYWCFDCIVYFFCDVIILFIFFFYFVLCFLYCCFYLVCLVFFCI CR4ARF T. vivax CFVIVVCFFFLFWVVIFC!FIDCIVCFFCDVIILFIFFFCFVLCFLFCCFFLVCLVFFCI CR4ARF L. tarentolae 12U YIAFVCLFVLVCYVVILLLCFDCIVFFLFTVIIIFFFLFIFWVFIFLLLFWFVFGFLFFF CR4ARF L. amazonensis CIALFVCLFYVCYVVILLLCFDCIVFFLFIDIILFFFFFFYGLCLFF!LFLYLDLVFYFF :.:. :. : :***: :**** *: **:*:*:* : : :: * : .:: : CR4ARF T. brucei FCCVL-CYFLIYFLCCFLFWVVFLFLFFFFVYVCFYLWLLLFC!FGCCCYLCIGLFICVF CR4ARF T. vivax FCYVL-CYFLICFLCCFLYWVVFLFLFFFFVYVYCFL!LLLFY!FGCCCYLCIGLFICVL CR4ARF L. tarentolae 12U FMFVFWFFILNFCLCCFLFFVDFCCLWCLFYFLYHFLF!LCWCFVGCYLYLCFDFFVWRF CR4ARF L. amazonensis F!FCVWFFILNFCLCYFLFFADFCCLWCLFYFLCHFLF!LFYYFVGCYLYLYFVLFVCVL * . ::* ** **::. * *: :* :: :* * : .** ** : :*: : CR4ARF T. brucei YFR!LWYWFYKMFF CR4ARF T. vivax CF------------ CR4ARF L. tarentolae 12U VF!F!LDYFKI--- CR4ARF L. amazonensis CFKVGYEFIFI--- * C. ND3 ND3 ORF Alignment ND3ORF T. brucei --LLSLFLYLFLILWFIILFIGFFLCFLCFLLHFFDCFRCCLWFSCGLYDMNSRLVFYTL ND3ORF T. vivax --MLSIFLYLVLCLWFIILLLGFFLCFLCFLLHFFDCFRCCLWFSCGLYDMNSRLVFYTL ND3ORF L. tarentolae ------------------------FGRREKVLHFFDCFRCCLWFSCGLYDMNSRFVYVSI ND3ORF L. amazonensis MFFCSFYFNFVLIFCMLILSIGVLFYIFMFLLHFFDCFRCCLWFSCGLYDMNSRLCYIFI : :***********************: : : ND3ORF T. brucei DLCFVSCLFFVLLNSIICVLLFVFVIVLFYFCYGFLFLWFLFFVVCIGFVWYFWDHVYLC ND3ORF T. vivax DLCFVSCLFFVLLNSVICVLLFVFVLVLFYFCYGFLFLWFLFFVLLLGFVWYFWDHVY-- ND3ORF L. tarentolae DLCFAVLLCFVMFYSIIGLILFLIVVVLYFMCKL-FFVWFCFVFLL------FWSIV--- ND3ORF L. amazonensis DLCFAILLCFVLFYSNFGLIIFVLVVLLYFMCKL-FFVWFLLLFFM------LWLMYLI- ****. * **:: * : :::*::*::*:::* :*:** :... :* ND3ORF T. brucei GVILFCLWCFLLYYTYYINKYIK---------- ND3ORF T. vivax --------------------------------- ND3ORF L. tarentolae FNIWFCV---LFVFIFFLLILINSVFSYLILK! ND3ORF L. amazonensis FDSVYCL---FLFFFY!---------------- ND3 ARF Alignment ND3ARF T. brucei -SKNPRLFTLVCYHYFYICF-W-YCGLLFYL!VFFYVFYVFYCIFLIVFVVVCGFRVVCM ND3ARF T. vivax -----------CYQFFCIWF-C-VCGLLFYC!VFFYVFYVFCYIFLIVFVVVYDFHVVCM ND3ARF L. tarentolae HSKNSSI-Y-----------------NLYRSIGISLGGGKKCCIFLIVFVVVYDFRVVCM ND3ARF L. amazonensis TQKNSRFKFKICFFVHFILILC!FFVCWF!VLVYYFIFLCFCCIFLIVFVVVYDFHVVCM : ********* .*:**** ND3ARF T. brucei IWIHVWCFIHWIYVLLVVCFLYC!IPLFVFCCLFLW!CCFIFVMVFCFCGFCFLLYV!DL ND3ARF T. vivax IWIHVWCFIH!IYVLLVVCFLCYWILLFVFCYLYLCWYCFIFVMVFYFCGFCFLCCY-GL ND3ARF L. tarentolae TWIHDLFMFL!IYVSQFYYVLLCFIPLLV!FYF!!LWCCILCVNC-FLCGFVLYFCYFGV ND3ARF L. amazonensis TWIHVYVIFL!IYVSQFYCVLYCFIPILVWLFLY!WCCYILCVNY-FLCDFYCYFLCYGW *** :: *** . .* * ::* : :: * :*.* . ND3ARF T. brucei CGIFGITYI--CVV!FYFVYDVFCCIIHIILINI!--- ND3ARF T. vivax CGIFEIMYI----------------------------- ND3ARF L. tarentolae L--YLIFDSVYCLFLFFF-Y!F!!IVYSLI!Y!NNK!I ND3ARF L. amazonensis CI!YLILCTV-CFC-FFF-INLINSVK--LSLD!NN-- : * 229 D. ND7 5' Editing Domain ND7 5' Editing Domain ORF Alignment ND7ORF T. brucei -----------MLFLVVFLHLYRFTFGPQHPAAHGVLCCLLYFCGEFIVYIDCIIGYLHR ND7ORF T. vivax ----------MFIFVVVFLHLYRFTFGPQHPAAHGVLCCLLYFCGEFIVYIDCIIGYLHR ND7ORF L. tarentolae ILFSRLHDNYILYLLIVFLHLYRFTFGPQHPAAHGVLCCLLYLSGEFITYIDVIIGYLHR ND7ORF P. serpens -------IIFIFFIFVVFLHLYRFTFGPQHPAAHGVLCCLLYFSGEYITYIDVIIGYLHR : :.:**************************:.**:*.*** ******* ND7ORF T. brucei GTEKLCE ND7ORF T. vivax GTEKLCE ND7ORF L. tarentolae GTEKLCE ND7ORF P. serpens GTEKLCE ******* ND7 5' Editing Domain ARF Alignment ND7ARF T. brucei -IQKNMTTWYSIIVIFGSFFTFVSFYIWSTASRSTWCFMLFIVFLWWIYCLYWLYYRLFA ND7ARF T. vivax ------------VYFCCSFFAFVSFYIWPTASRSTWCFMLFIVFLWWIYCLYWLYYRLFA ND7ARF L. tarentolae LNFI!PTTR!LYFIFINCFFTLV!IYFRTPASSSPWRIMLFIISFWRIYNVYRCNYWVFT ND7ARF P. serpens -------SNNFYFFYFCCFFAFVSFYVWSTASSRTWCVMLFVIFFRWVYNIYWRNYRLLT . .**::* :*. ** * .***:: : :* :* * ::: ND7ARF T. brucei SWYRKVMWI! ND7ARF T. vivax SRYRKVMWV! ND7ARF L. tarentolae SRYRKVMWI! ND7ARF P. serpens PWNGKIVWI! *::*:* 230 E. ND9 ND9 ORF Alignment ND9ORF T. brucei MCIFLCLFRLCFCLILFYCLCCRWCFVCFVDCSFLFFYCFVSFFLFYCMFLFFNLWFLFL ND9ORF T. vivax MGLLLCLFRLCFCLILFYCLCCRWCFVCFVDCSFLFFYCFVSFFLFYCMFLFFNLWFLFL ND9ORF L. tarentolae MFLFLIMFRCVFVLLLFFCLCCRWVFLCFVDCSFVFFYLFVCFFLFFVMFLFFNLWFFLL ND9ORF L. amazonensis MFLFLIMFRCVFVLCLFFCLCCRWVFLCFVDCSFVFFYLFVCFFLFFVMFLFFNLWFFLL * ::* :** * * **:****** *:*******:*** **.****: *********::* ND9ORF T. brucei YCCDLLLIDFCGFCFCRFMLLYILFCLFLCVRLCFVLCCLFVFFGLCFSFSCFCYAFLLL ND9ORF T. vivax YCCDLLLIDFCGFCFCRFMLLYILFCLFLCFRLCFVLCCLFLFFGLCFSFSCFCYAFLLL ND9ORF L. tarentolae YCLDLFCIDFCGFCFVRFILIYVLFCLLLCFRVSFVLICFFLFFGLVFSLFFCSYALCIF ND9ORF L. amazonensis YCLDLFCIDFCGFCFVRFVLLYVLFCLILCFRVSFVLICFFLFFGLVFSLFFCSYALCIF ** **: ******** **:*:*:****:**.*:.*** *:*:**** **: .**: :: ND9ORF T. brucei ERECFDLFGFYFVGNDILHRLFVDWFFVGFFLLKCYPLFGLFVLLFCVLVEEIVCTFTML ND9ORF T. vivax ERECFDLFGFYFVGNDVLHRLFVDWFFVGFFLLKCYPLFGLFVLLFCVLVNEMICTFTMV ND9ORF L. tarentolae EREVFDLFGFVFCGNDCLHRFYVDWFFVGFFLCKVYPLFGLFMLNFCMLCEDIVVIATSC ND9ORF L. amazonensis ERECFDLFGFVFCGNDCLHRFYVDWFFVGFFLCKVYPLFGLFVLNFCMLVEEIIVYATCC *** ****** * *** ***::********** * *******:* **:* :::: * ND9ORF T. brucei FLLLHTNFYLHYFI!--------- ND9ORF T. vivax FLLLHTNFYLHFFV!--------- ND9ORF L. tarentolae FVLCFSNFAI!------------- ND9ORF L. amazonensis FVLVFPILHLFNLIYDNADLNIN! *:* . : : ND9 ARF Alignment ND9ARF T. brucei NINLIFFINIILCVYFYVYFVYVFV!FYFIVCVVDGVLFVLLIVVFCFFIVLLVFFCFIV ND9ARF T. vivax ------------WVCCCVYFVCVFVWFCFIVCVVGGVLFVLLIVVFCFFIVLLVFFCFIV ND9ARF L. tarentolae -RYII!YLLSIICFYFWLCFVVCLCCCYFFVCVVDGFFYVLLIVVLFFFICLCVFFYFLW ND9ARF L. amazonensis IINCLY!NLIKLCFYF!LCFVVYLYYVYFFVYVVGEFFYVLLIVVLFFFICLCVFFYFLW . : ** : *:* **. .::******: *** * *** *: ND9ARF T. brucei CFYFLICDFCFYIVVICYWLIFVVFVFVVLCCCIFYFVCFCVFVCVLFCVVCLYFLDCVL ND9ARF T. vivax CFYFLIYGFCFCIVVIYCWLIFVVFVFVVLCYYIFCFVCFCVFVYVLFCVVCFCFLDCVL ND9ARF L. tarentolae CFYFLIYDFFYCIV!IYFV!IFAVFVLFVLFWYMFCFVYYYVFE!VLYWFVFFCFLVWFL ND9ARF L. amazonensis CFCFLIYDFFYCIVWICFV!IFAVFVLFDLFYYMFCFV!FCVFG!VLYWFVFFYFLVWFL ** *** .* : ** * **.***:. * :* ** : ** **: .* : ** .* ND9ARF T. brucei VLVVFVMRFCCWNANVLICLVFILLVMIFYIVYLLIGFLLVFFCWSVIHYLVCLYCYFVC ND9ARF T. vivax VLVVFVMRFYC!NVSVLICLVFILLVMMFYTVYLLIDFLLVFFCWSVIHYLVCLYYCFVF ND9ARF L. tarentolae VYFFVVMRYVFLNVKFLICLVLFFVVMIVYIVFMLIDFLLVFFCVKFIHCLVCLCWIFVC ND9ARF L. amazonensis VYFFVVMRYVFLSENVLICLVLFFVVMIVYIVFMLIDFLLVFFYVKFIHCLVYLCWIFVC * ...***: . ..*****::::**:.* *::**.****** ..** ** * ** ND9ARF T. brucei WWRR!YVRLQCYFCCCIPIFICIILFNITVVILNFSL---- ND9ARF T. vivax !LMRWYVRLQWFFCCCIPIFICIFLF--------------- ND9ARF L. tarentolae YVKILLWLPLVVLCCVFPILQYSFYFIFV---MNLSIKLIY ND9ARF L. amazonensis WLKRLLFMLHVVLC!FFQFCIYLI!FMIM---LT!TLTKF- :* : : : * 231 F. RPS12 RPS12 ORF Alignment RPS12ORF T. brucei ----------MWFLYGCCLRFVLFVLCYYMSPRLPSSGNRRVLYAVFYLYNFVWMLRCFF RPS12ORF T. vivax ----------MWFLYGCCLRFVLFVLCYYMSPRLPSSGNRRVLYAVFYLYNFVWLLRCFF RPS12ORF L. tarentolae --------MRVLFLYGLCVRFLYFCLVLYLSPRLPSSGNRRCLYAICYMFNILWFFCVF- RPS12ORF L. amazonensis TFLNLIYFVRVLYLYGLCVRFLFLCLVLYLSPRLPSSGNRRCLYAISIMFNILWYFLVF- RPS12ORF P. serpens -----MFFVRSYCLYGFCVRFCFVFLCIYVSPRLPSSGNRRVYVVCFNLYSFVIYCFLFG RPS12ORF Perkinsela ------------MLFGFLVRYGFIEFFFFVSPRLPSSGNRFCYELDMRFFFVCYDFVLLG *:* :*: . : ::********** :: . : RPS12ORF T. brucei CC-FIGLVMSLFIIEGGGF---VDLP-GVKYYTRIVS!-- RPS12ORF T. vivax CCVFFGLHLSLFIIEGGGF---VDLP-GIKYYTRMFIN!- RPS12ORF L. tarentolae CCVCF-LNHLLFIVEGGGF---IDLP-GVKYFSRFFLNA! RPS12ORF L. amazonensis CCFVF-VIFQLFIVEGGGF---IDLP-GVKYFSRFCNVS! RPS12ORF P. serpens CCVICYSQSFYFLCEGGGF---VDLP-CIKLYVRVPIA!- RPS12ORF Perkinsela FSV---LLSSLVFYEGFGFWLFMDVPFGLYYFSRG!---- . .: ** ** :*:* : : * RPS12 ARF Alignment RPS12ARF T. brucei NTLLITN!SK--------YILFFLRMWFCMVVVYVLFYLFYVIIWVRDCPVPVTDVYCMP RPS12ARF T. vivax -------------------------CDFCMVVVCVLFCLFYVIIWVPDCPVPVIDVCCMP RPS12ARF L. tarentolae ---NTYRPI!--------IIFILCVYYFCMVYVFVFYIFVWFYI!VHDYLVPVIDVVYMQ RPS12ARF L. amazonensis HK-YLFRPF!--------I!FILFVFYICMVYVFVFYFYVWFYI!VHDYQAPVIDVVYMQ RPS12ARF P. serpens -----LKPIF!LS!YFYLYLCFLFVVIVYMVFVYVFVLYFYVYMLVPVYPVQVIVVFMLF RPS12ARF Perkinsela ---------------------------CCLVFWFVMVL!SFFFLLALVCPVLVIGFVMSW :* *: :. : . . * . RPS12ARF T. brucei YF--IYIILFGCCVVFFVVL-LV!LCHYLL!RVVVLLIYPV!SIIHVL!VRFRYKICF-- RPS12ARF T. vivax CF--ICIILCGCCVVFFVVCFLVCICRYLL!RV-VLLIYPV!SIIHVCLLI--------- RPS12ARF L. tarentolae YV--ICLIFYDFFVFFVV-FVFWIIC-CL!LKVVVLLICQE!SIFHVFFWMRKQ!VIIKI RPS12ARF L. amazonensis LV--LCLIFYDIFWFFAV-LFLWFFS-CL!LKVVVLLICQE!SIFRVFVMCRKFNYLYFY RPS12ARF P. serpens VL--ICIVLLFIVFYLVVVLFVILRVFIFYVRVVVLLIYHV!SYMSVCQ!PK!IIAS--- RPS12ARF Perkinsela IWGFFLFVMILCCWVFPFCCQVWFFMKVLV---------------FGCLWMYRLDCIIFP : ::: : . . : RPS12ARF T. brucei ---- RPS12ARF T. vivax ---- RPS12ARF L. tarentolae ILFR RPS12ARF L. amazonensis KN-- RPS12ARF P. serpens ---- RPS12ARF Perkinsela VV-- 232 G. ND8 (Nondual-coding) ND8 ORF Alignment ND8ORF T. brucei MFFFDFLFFFFVCFYMCFVCCVTICLPIELTIVSLLVRGNHFLRFYWCGLERCIACRLCD ND8ORF T. vivax MFFFDFLFFFFVCFYMCFVCCVTICLPIELTFCSLLTRGNHFLRFYWCGLERCIACRLCD ND8ORF L. tarentolae MFVYDFCFSFFVCFYMCFLCCVTLVLPLELTIVSICVRGNHFLRFYWCGLERCIACRLCD ND8ORF L. amazonensis MFCYDFVFSFFVCFYMCFLCCVTLILPCEITIVSICARGHHFLRFYWCGLERCIACRLCD ND8ORF P. serpens MFFFDFFFCFFCFVYMCFCCCVTIVVPCEVSLCSFLVRGTHFLRFYWCGLERCIACRMCD ** :** * ** .**** ****: :* *::: *: .** *****************:** ND8ORF T. brucei LICPSLALDVRVGWSFGGHRFADWFTLSYRRCIYCGFCMHVCPTDAITHSLFVMCFCCLA ND8ORF T. vivax LICPSLALDVRVGWSFGGHRFADWFTLSYRRCIYCGFCMHVCPTDAITHSLFVMCFCCLA ND8ORF L. tarentolae FICPSLALDVRCVRSLCGYRFSDVFNISYRRCIYCGFCMHVCPTDAITHSCFLLFCCCIA ND8ORF L. amazonensis FICPSLAIDVRCIRSLCGYRYSDLFYISYRRCIYCGFCMHVCPTDAITHSCFLLFCCCIA ND8ORF P. serpens YICPSVAIDVRCGVSLIGHRFAHLFFISYRRCIYCGFCMHVCPTDAITHSFVVLFSVLLS ****:*:*** *: *:*::. * :*********************** .:: :: ND8ORF T. brucei MYLLAPKFLLFGCCFMLFDFYLCFV! ND8ORF T. vivax MYLLAPKFLLFGCCFMLFDFYLCFV! ND8ORF L. tarentolae MYLCAPKFVLFGCCFMLFDFYLCFV! ND8ORF L. amazonensis MYLCAPKFVLFGCCFMLFDFYLCFV! ND8ORF P. serpens SYLVAPKFILFGCCFMVFDLFLCFC! ** ****:*******:**::*** ND8 ARF Alignment ND8RF2 T. brucei NLIILSFGWLLFFYFFIFVCFFLIFCFFFLFVFICVLFVVLLFVYPLS!PLLVYWFVVTI ND8RF2 T. vivax -------------------CFFLIFCFFFLFVFICVLFVVLPFVYPLN!HFVVCWPVVTI ND8RF2 L. tarentolae ---------KHI!CIRLKECLFMIFVFLFLFVFICVFYVVLLWFYHWSWPLLVFVFVVTI ND8RF2 L. amazonensis ---------NIIRSILIKICFVMILFFLFLFVFICVFYVVLLWFYHVRLPLLVFVLVVII ND8RF2 P. serpens ----------!I!CVNVIKCFFLIFFFVFFVLFICVFVVVLPLLFHVKYHCVVFWFVVLI *:.:*: *.*:.:****: *** .: :* ** * ND8RF2 T. brucei FCVFIGVV!SVVLLVVYVI!FALV!HWMFVLGGVLVVIVLRIDLHWVIVVVFIVVFVCMF ND8RF2 T. vivax FCVFIDVV!NVVLLVVYVIWFVPV!H!MFVLGGVLVVIVLQIDLHWVTDVVFIVVFVCMF ND8RF2 L. tarentolae FCVFIDVV!NVVLPAVYVILYAQV!L!MFVVLEVYVVIGFPMCLILVIVVVFIVVFVCMF ND8RF2 L. amazonensis FCVFIDVV!NGVLPAVYVILYALVWPLMFVVLEVYVVIVIPIYFILVIVVVFIVVFVCMF ND8RF2 P. serpens FYVFIGVV!NVVLLVVCVIIFVLVLP!MFVVVLVWLVIVLHICFLLVIDVVFIVVFVCMF * ***.***. ** .* ** :. * ***: * :** : : : * *********** ND8RF2 T. brucei ARQMPLRIHCLLCVFVV!PCIYWRPSFYCLVVVLCYLIFICVLCSYLFWVI------YCV ND8RF2 T. vivax VQQMPLHIHCLLCVFVV!PCIYWRPSFCYLVVVLCCLIFICVLC---------------- ND8RF2 L. tarentolae VQQTPLRIHVFCYFVVVLPCIYAHLNLFYLVVVLCYLIFICVLFSLVYLEKY-IWLII!! ND8RF2 L. amazonensis VPPMLLRIHVFCYFVVVLPCIYAHLNLFYLVVVLCCLIFICVLFNL!FEYFFYIFYVVCS ND8RF2 P. serpens VQPMQLPIHLLFYLVCCYPVIWLHPNLFCLVVVLWFLIYFCVFVSCIV!FI!----LFVV . * ** : .. * *: : .: ***** **::**: ND8RF2 T. brucei YDLKKFTVKLNFD! ND8RF2 T. vivax -------------- ND8RF2 L. tarentolae INF----------- ND8RF2 L. amazonensis KNLVNV-------- ND8RF2 P. serpens !!!!K!RTK!Y!-- 233 APPENDIX I. RPS12 gRNA Alignments for TREU 667 SDM79 (A) and EATRO 164 SDM79 cells (B), and all editing variants (C and D). Amino acid translations are shown above the mRNA sequences. The cDNA sequence of the most abundant gRNA in its sequence class is shown aligned beneath the fully edited mRNA. Lowercase u’s indicate uridines added by editing, asterisks indicate encoded uridines deleted during editing. Nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 5’ end (+1=0). gRNAs are colored based on transcript abundance as follows: Blue<100; Green<1,000; Purple<10,000; Orange<100,000; Red>100,000; Black=not quantified. Watson-Crick (|) and G:U base pairs (:) are indicated. Mismatches are indicated by an octothorpe (#). Highlighted sequence represents sequences were multiple CU configurations are possible. A. TREU SDM79 gRNAs - Jv2, I, H, G, F, E, D, C, B1, A M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAAAuuuuGuuuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu : |||||||:|::::||:|:|||:|||||:|:||:| |||:| |||||| |:::| 11T-TTATTTAGAGTGGAAGAGACGTATACATTGAAGA-CATGC-CAACAAATA gJv2 gH 13TATAGTGA || :||:: ||:::|:|:||| ||||::||:||:||||| gI 14TAA-TATGT-CAGTGATAGATG-CAAAGTAAGATGAACAA L C Y Y M S P R L P S S G N R R V L Y A V F Y L Y N F V W M L Y V I I W V R D C P V P V T D V Y C M P Y F I Y I I L F G C C uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGuuCCGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuu ||||:|||| |||:|||:| | :||||:||||:|:||||||:|||||||||||| AATATAATATA gI gF 18TCAATTAATATATG--G----TATAAGATAAGTGTATTAAGACAAACCTACAAAATATA |||::|:|:|||||||:||#: |:|||||#|||||| :|||:|:|:|||||:|||:||::|:||||:|| AATGTAGTGATATACTTAGAT--GTTAACGTGTCAAGATATA gH gE 12TATATAGAGTGAATATGTTAGAATGAGCCTATAA |: |:||::#|#||:||||:|||||:|||||||||||| gG 13TATTCAGT--GTTAGTAGATTGAGGCTATTGGTTGCACATAACATTCATA gG |||||:|#|||:|#||:|||||||||| | :| 12TAAATTTAGTGACCGAAGGCTAGTGGTT-CATATAACATACG--G----TAATATA gFp R C F F C C F I G L V M S L F I I E G G G F V D L P G V K V V F F V V L L V STOP GCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GuG*UAAA |||||||| ||: |||:|||:||:||:||| ||:::#::|||||||||||||| CGCAACAATAAATA gE 14TATTAT--AGTGATAGATGATGTCT---CCTGTAGTCAAAACAACTAAATATA gC :#:||:|||:|:|::|:|||:|||:::||:||||||| ||||||||||| ::|||:::||||:||||# : ||: |||| 11TATAATAAAGAGAGTAGCAAGATAGTTAAGTCAATAC--AGTAATAAATA gD gB1 11TTAAAGTGACTAGATGGA---T----CAT-ATTT Y Y T R I V S STOP GuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAuAUAAGAUAUGUUUUU :|||:||||| || ||||||| TATAGTATGT-GC--ATAACATATA gB1 || |: ||||::|||||||| ||:|||||||||||||||| 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTATATTCTATACAATATAAA gA 234 B. EATRO SDM79 gRNAs - Jv2, I, H, G, F, E, D, C, B1, A M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAAAuuuuGuuuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu |: |||||||:|::::||:|:|||:|||||:|:||:| |||:| |||||| |:::| 10TT-TTATTTAGAGTGGAAGAGACGTATACATTGAAGA-CATGC-CAACAAATA gJv2 gH 13TATAGTGA |: :||:: :||::|:|:||| |:|:|||:||||||||| gI 12TATAG-TATGT-TAATGATAGATG-TGAGACAGAATAAACAA L C Y Y M S P R L P S S G N R R V L Y A V F Y L Y N F V W M L Y V I I W V R D C P V P V T D V Y C M P Y F I Y I I L F G C C uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGuuCCGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuu |||:||| |||||||| 12TAATGTAGTGAGTTAAATGGGATTAGATTGAT--------ACCTACAAGCATATA gF |||||:|#|||||#||:|||||||||| | :| 14TAAATTTAGTGACCGAAGGCTAGTGGCT-CATATAACATACG--G----TAATATA gFp ||||:|||| AATATAATATA gI |||::|:|:|||||||:||#: |:|||||#|||||| ||:| | :|||||:||:||:|||||:|::||:|:||||| AATGTAGTGATATACTTAGAT--GTTAACGTGTCAAGATATA gH gE 05TATTATG--G----TATAAAGTAGATGTATTAGAGTAAGCTTACAA |: |:||::#|#||:||||:|||||:|||||||||||| gE 12TATATAGAGTGAATATGTTAGAATGAGCCTATAA 13TATTCAGT--GTTAGTAGATTGAGGCTATTGGTTGCACATAACATTCATA gG |||:|#||:|:|!:|||||||||||:||:|:||||| | 14TAATGTGTTAGGATCATTGGCTGCATATGATATACG--GAAATATA gGe R C F F C C F I G L V M S L F I I E G G G F V D L P G V K V V F F V V L L V STOP GCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GuG*UAAA |||||:| |||: ||||:|:|||:||:||| :::|||::|||||||||||:| CGCAATATATA gE 14TATAT--AGTAGTGAATGATGTCT---TTTACCGTCAAAACAACTAGAATATA gC CGCAACAATAAATA gE ::|||:::||||:||||# : ||: |||| :#:||:|:|:|:|::|||::|:|||:|:|:|||||:| ||||||| gB1 11TTAAAGTGACTAGATGGA---T----CAT-ATTT 14TATAATAGAGAGAGTAACGGAGTAATCGAGTCAATGC--AGTAATATATATA gD Y Y T R I V S STOP GuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAuAUAAGAUAUGUUUUU :|||:||||| || ||||||| TATAGTATGT-GC--ATAACATATA gB1 || |: ||||::|||||||| ||:|||||||||||||||| 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTATATTCTATACAATATAAA gA 235 C. TREU SDM79 Variants J Variants M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAAAuuuuGuuuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu : |||||||:|::::||:|:|||:|||||:|:||:| |||:| |||||| 11T-TTATTTAGAGTGGAAGAGACGTATACATTGAAGA-CATGC-CAACAAATA gJv2 M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAuAuuAGuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu : |||||||:|:|:||:|:||||:::|||:||: :|||| |||||||| 11T-TTATATAGTTAGAAGATGCATGTGCTAGAAG-TATAC-CAACAACATATA gJv3 M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAuAuAuuuuGuuuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu ||||||:|::::||:|:|||:|||||:|:||:| |||:| |||||| 12TATATAGAGTGGAAGAGACGTATACATTGAAGA-CATGC-CAACAAATA gJv1 M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAuAuuAuuGuuuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu |: ||||:|:|||:::||:|:|::|:|||||:||||:| ||||| ||| 12TT-TTATGTGATAGTGAAGAGAGTGTATACATTAAAGA-CATAC-CAAATATATA gJv4 D Variants D GC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuuGCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuu :#:||:|||:|:|::|:|||:|||:::||:||||||| ||||||||||| 11TATAATAAAGAGAGTAGCAAGATAGTTAAGTCAATAC--AGTAATAAATA gD :|||:|:|:|||||:|||:||::|:||||:|||||||||| 12TATATAGAGTGAATATGTTAGAATGAGCCTATAACGCAACAATAAATA gE Dx GCUUCUUUUGAAUAAAAuuuGGGuuAuuGGuuuuCGGuuGuuGAGuGuAuuGuAuG**UCAuuAuuuAuu |||||||:::|:|:|||:||:||:|:::::|||:||||||||| :|||||| 09TTTTAAATTTAGTGACCGAAGGCTAGTGGTTCATATAACATAC--GGTAATATA gFp 236 BC Variants B1, C1 V M S L F I I E G G G F V D L P G V K Y Y T R I V S STOP GuuAuG**UCAuuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GuG*UAAAGuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAu ||: |||:|||:||:||:||| ||:::#::|||||||||||||| || |: ||||::|||||||| ||:|||| TATTAT--AGTGATAGATGATGTCT---CCTGTAGTCAAAACAACTAAATATA gC gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTA ::|||:::||||:||||# : ||: ||||:|||:||||| || ||||||| 11TTAAAGTGACTAGATGGA---T----CAT-ATTTTATAGTATGT-GC--ATAACATATA gB1 B1*, C1 V M S L F I I E G G G F V D L P G Y K Y Y T R I V S STOP GuuAuG**UCAuuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GG*UAuAAGuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAu ||: |||:|||:||:||:||| ||:::#::|||||||||||||| || |: ||||::|||||||| ||:|||| TATTAT--AGTGATAGATGATGTCT---CCTGTAGTCAAAACAACTAAATATA gC gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTA ::|||:::||||:||||# : #| |||||:|||:||||| || ||||||| 13TTAAAGTGACTAGATGGA---T-----C-ATATTTATAGTATGT-GC--ATAACATATA gB1* B3t, Ct AAGA***GuGuGuGuuGuuuuGuGuuuGGuuuuAuACCC***G**UUGuuuuG*UAuAuuuuuAuGAuuAuuuAuuCA*CG**UAuuGuAAGuuAGA***UAGAu :|:|:|::::|||::::|||::|:|||||||| | |||||||| ||||| 18TATATATGGTAAAGTGTAAATTAGAATATGGG---C--AACAAAAC-ATATATATA gCt : ||:|:|:: |||||:::|||:||||||:||||| || |||||| 12T--AATAGAGT-ATATAGGGATATTAATAAGTAAGT-GC--ATAACAATATA gB3t || |: ||||::|||||||| ||:|||| gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTA B4t, Ct AAGAUUUGuGuGuGuuGuuuuGuGuuuGGuuuuAuACCC***G**UUGuuuuG*UAuuuuuAuAuuuGuAuAuAuuuuCA*CG**UAuuGuAAGuuAGA***UAGAu :|:|:|::::|||::::|||::|:|||||||| | |||||||| ||||| 18TATATATGGTAAAGTGTAAATTAGAATATGGG---C--AACAAAAC-ATATATATA gCt : :|||:|:: |||:|:||:||:|:||||||||:||| || ||||||| 14T--GACAGAGT-ATAGAGATGTAGATATATATAAGAGT-GC--ATAACATATA gB4t || |: ||||::|||||||| ||:|||| gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTA 237 D. EATRO SDM79 Variants J Variants M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAAAuuuuGuuuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu |: |||||||:|::::||:|:|||:|||||:|:||:| |||:| |||||| 10TT-TTATTTAGAGTGGAAGAGACGTATACATTGAAGA-CATGC-CAACAAATA gJv2 M W F L Y G C C L R F V L F V M V V V Y V L F Y L F CUAAUACACUUUUGAUAACAAACUAAAG*AAuAuAuuAGuuuuuuGCGuAuGuGAUUUUU*GuAuG*GuuGuuGuuuAC*GuuuuGuuuuAuuuGuu |: |||||||:|:|:||:|:||||::||||:||: :|||| |||||||| 11TT-TTATATAGTTAGAAGATGCATGTACTAGAAG-TATAC-CAACAACATATA gJv3 G Variants Multiple G sequences from gG L C Y Y M S P R L P S S G N R R V L Y A V F Y L Y N F V W M L Y V I I W V R D C P V P V T D V Y C M P Y F I Y I I L F G C C uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGuuCCGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuu L C Y Y M S P R L P S F G N R R V L Y A V F Y L Y N F V W M L Y V I I W V R D C P A S V T D V Y C M P Y F I Y I I L F G C C uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGCuuCGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuu L C Y Y M S P R L P S S G N R R V L Y A V F Y L Y N F V W M L Y V I I W V R D C P A L V T D V Y C M P Y F I Y I I L F G C C uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGCuCuGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuu |: |:||::#|#||:||||:|||||:|||||||||||| gG1 13TATTCAGT--GTTAGTAGATTGAGGCTATTGGTTGCACATAACATTCATA gG |||::|:|:|||||||:||#: |:|||||#||| AATGTAGTGATATACTTAGAT--GTTAACGTGTCAAGATATA gH Ge L C Y Y M S P R L P S P G N R R V L Y A V F Y L Y N F V W M L Y V I I W V R D C P V L V T D V Y C M P Y F I Y I I L F G C C uuAuGuuAuuAuAuGAGuCCG**CGAuuGCCCAGuCCuGGuAACCGACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuu |||:|#||:||||:|||||||||||:||:|:||||| | 14TAATGTGTTAGGATCATTGGCTGCATATGATATACG--GAAATATA gGe |||::|:|:|||||||:||#: |:|||||#|||||| ||:| | :|||||:||:||:|||||:|::||:|:||||| AATGTAGTGATATACTTAGAT--GTTAACGTGTCAAGATATA gH gE 05TATTATG--G----TATAAAGTAGATGTATTAGAGTAAGCTTACAA 238 DEF Variants D, E, F GACGuGuAuuGuAuGC**C****GuAuuuuAuuUAuAuAAuuuuGuuuGGAuGuuGCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuu |##||:|||||||||| | :| :#:||:|:|:|:|::|||::|:|||:|:|:|||||:| ||||||| C-TCATATAACATACG--G----TAATA gFp gD 14TATAATAGAGAGAGTAACGGAGTAATCGAGTCAATGC--AGTAATATATAT ||:| | :|||||:||:||:|||||:|::||:|:||||||||||:| gE 05TATTATG--G----TATAAAGTAGATGTATTAGAGTAAGCTTACAACGCAATATATA gE 12TATATAGAGTGAATATGTTAGAATGAGCCTATAACGCAACAATAAATA gE Dx AGCCGGAACCGACGGAGAGCUUCUUUUGAAUAAAAuuuGGGuuAuuGGuuuuCGGuuGuuGAGuGuAuuGuAuG**UCAuuAuuuAuu ||||:::|:|:|||:||:||:|::::||||:||||||||| :|||||| 14TAAATTTAGTGACCGAAGGCTAGTGGCTCATATAACATAC--GGTAATATA gFp Ee CCGGAACCGACGGAGAuGuC**C****GuAuA*AuuuuAAAuuuGGGuuAuuGGuuuCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuu #:||:|:|:|:|::|||::|:|||:|:|:|||||:| ||||||| gD 14TATAATAGAGAGAGTAACGGAGTAATCGAGTCAATGC--AGTAATATATAT | :|||| |:||:||||:||:|||||:#||:|||||||| 12TATAGTG----TATAT-TGAAGTTTAGATCTAATAGACAGAGCAACAATATATA gEep D, E Misanchored, Fe CGuuACGuuGAGuuAuuGC**C****GuuuuAuuuAUAuAAuuuuAuuuGGGuAGuGCGuuGuuuuuuuuGuuGuuuuAuuGGuuuAGuuAuG**UCAuuAuuuAuu ||:|:|:|||||:|||:|||:|:||:||#||||||||| 12TATATAGAGTGAATATGTTAGAATGAGCCTATAACGCAACAATAAATA gE |:|:||:|:||:||||:|| | |||#|||||||||| :|:|:|:::||||:|:|||:|:||||::||:||:||||| ||||||| GTAGTGTAGCTTAATAGTG--G----CAACATAAATATATA gFep gD 12TATGTAGTGAAAAGAGCAATAGAATAGTCAGATTAATAC--AGTAATATATA 239 B Variants B1 L F I I E G G G F V D L P G V K Y Y T R I V S STOP uuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GuG*UAAAGuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAuAUAAGAUAUGU |:|:|||:||:||| :::|||::|||||||||||:| || |: ||||::|||||||| ||:||||||||||||||| AGTGAATGATGTCT---TTTACCGTCAAAACAACTAGAATATA gC gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTATATTCTATACA ::|||:::||||:||||# : ||: ||||:|||:||||| || ||||||| 11TTAAAGTGACTAGATGGA---T----CAT-ATTTTATAGTATGT-GC--ATAACATATA gB1 B1’ L F I I E G G G F V D L P G Y K Y Y T R I V S STOP uuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GG*UAuAAGuAuuAuACA*CG**UAuuGuAAGuuAGA*UUUAGAuAUAAGAUAUGU |:|:|||:||:||| :::|||::|||||||||||:| || |: ||||::|||||||| ||:||||||||||||||| AGTGAATGATGTCT---TTTACCGTCAAAACAACTAGAATATA gC gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTATATTCTATACA ::|||:::||||:||||# : #| |||||:|||:||||| || ||||||| 13TTAAAGTGACTAGATGGA---T-----C-ATATTTATAGTATGT-GC--ATAACATATA gB1* B2FSe L F I I E G G G F V D L P G Y K I L F T Y C K L D L D I R Y V F uuAuuuAuuAuAGA***GGGuGGuGGuuuuGuuGAuuuACCC***G****GG*UAuAAGAuAuuAuuCA*CG**UAuuGuAAGuuAGA*UUUAGAuAUAAGAUAUGU |:|:|||:||:||| :::|||::|||||||||||:| ||||| |: ||||::|||||||| ||:||||||||||||||| AGTGAATGATGTCT---TTTACCGTCAAAACAACTAGAATATA gC gA 16TAAGT-GT--ATAATGTTCAATCT-AAGTCTATATTCTATACA ::|||:|:||||:||||# : #| |||||:||||:||||| || ||||||| 08TTAAAGTGACTAGATGGA---T-----C-ATATTTTATAGTAAGT-GC--ATAACATATA gB2FSe 240 APPENDIX J. gRNAs identified to edit the RPS12 mRNAs of found in both TREU 667 and EATRO 164 gRNA transcriptomes. Population J variant 1 J variant 2 J variant 2 J variant 3 J variant 3 J variant 3 J variant 3 J variant 3 J variant 3 J variant 4 J variant 4 J variant 4 I I I I I I H H H H H H H H H H G Ge Ge Fp Fp Fp Fp Fp Fp Fp F Fep Editing Region J I H G F Sequence ATA AACAACCGTACAGAAGTTACATATGCAGAGAAGGTGAGATATATT TTTTTTATTT ATA AACAACCGTACAGAAGTTACATATGCAGAGAAGGTGAGATTTATTTTTT TTTTTTT* ATA AACAACCGTACAGAAGTTACATATGCAGAGAAGGTGAGATTTAT ATTTTTTTTTTT ATAT ACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATAT TTTTTTTTTTTTT ATAT ACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATA TTTTTTTTTTTT ATAT ACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATAATTTTT TTTTTTTT ATAT ACAACAACCATATGAAGATCATGTACGTAGAAGATTGATAT TTTTTTTTTT TAT ACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATAAT ATTTTTT AT ACAACAACCATATGAAGATCATGTACGTAGAAGATTGATATATAATTT ATTTT ATATATA AACCATACAGAAATTACATATGTGAGAGAAGTGATAGTGTATTT TTTTTAATTTTT ATATATA AACCATACAGAAATTACATATGTGAGAGAAGTGATAGTGT TTTTTT ATATA AACCATACAGAAATTACATATGTGAGAGAAGTGATAGTGTAT ATTTTTTTT* AT ATAATATAAAACAAATAAGACAGAGTGTAGATAGTAATTGTATGA TATTTTTTTTTTTT ATAT ATATAAAACAAATAGAATAGAACGTAGATGAT TACTGTATAATTTTTTTTTTAT A TATATAATAACATAAAACAAATAGAACGAGATGTAAATGATA TCTATTTTTTTATTTTT AT ATAATATAAAACAAGTAGAATGAAACGTAGATAGTGACTGTATAA TTTTTTTTTTTTCT ATAATATAAAACAAATGAAACGAAGCGTAGACAGTAATTATATGA TATTTTTTTTCTTTT AT ATAATATAAAACAAGTAGAATGAAACGTAGATAG GGACTGTATAATTTTT ATATAGAACTGTGCAATTGTA GATTCATATAGTGATGTAAAGTGA TATTTTTTTTTTTTT* ATATAGAATTAGGCAATCA CGGATTTATATAGTAACGTAAAATGA TATTTTTTTTTTTTG ATATATAACTGGGCATCT CGGATTTGTATAGTGATATAAAGTGAATAA TTTTTTTTTTTTT ATATATAACTGGACAATCGTA GGCTTGTATGATGATATGAGATGAGTAAA TTTTTTTTTTTT ATATATAACTGGACAATCGTA GGCTTGTATGATGACATGAGATGAGTAAA TTTTTTTTTTTCTTT ATATATAACTGGACAATCGA TGGGCTTGTATAATGATATGAGATGAGTAAA TTTTTTTTTTTTTG ATATATAACTGGACAATCGA TGGGCTTGTATAATGATATGAGATGAGTAA TTTTTTTTTTTT ATATAAACTGTGCAATCGA TGGACTTATGTAGTGATATAAGATGA TAATTTTTTTTTTTT TATATAACTGGACAATCGA TGGGCTTGTATAATGATATGAGATGAG GAAATTTTTTTTTTTT ATATATAACTGGCAATAT CGGACTCATATAGTGATGTGAAGTAAATA TTTTTTTTTT ATACT TACAATACACGTTGGTTATCGGAGTT AGATGATTGTGACTTATTTTTTTTTTTTTC ATATAAA GGCATATAGTATACGTCGGTTACT AGGATTGTGTAATTTTTTTTTTTTTT ATATAAAGC CATATAGTATACGTCGGTTACT AGGATTGTGTAATTTTTTTTTTTT AT ATAATGGCATACAATATACTCGGTGATCGGAAGCCAGTGATTTAAATTTT TTTTTTTTTT AT ATAATGGCATACAATATACTCGGTGATCGGAAGCCAGTGATTT TTTTTTTTTTT AT ATAATGGCATACAATATACTCGGTGATCGGAAGCCAGTGATTTAA TTTTTATTTTTTTTTT T ATAATGGCATACAATATACTCGGTGATCGGAAGCCAGTGATTTAAATT AAATTTTTTT AT ATAATGGCATACAATATACTCGGTGATCGGAAGCCAGTGATTTA TTTTTTTTGTTTTTTTT* AT ATAATGGCATACAATATACTTGGTGATCGGAAGCCAGTGATTTAAATTT GTTTTTTTTTTTT AT ATAATGGCATACAATATACTTGGTGATCGGAAGCCAGTGATTTAAATTTTGTTT T ATATAA AACATCCAAACAGAATTATGTGAATAGAATATGGTATATAAT TAACTTTTTTTTTTTTTTTTTT ATATATAAATAC AACGGTGATAATTCGATGTGATG ATATCTGTAATTTTTTTTTTTTTTAT 241 Reads TREU 667 1 3,190 848 751 628 72 28 8 6 2 5,762 503 6,937 687 72 12,736 17,207 1,140 828 505 169 791 94,870 196 2,288 599 181 224 87 156 60 63 Reads EATRO 164 1,731 401 8,006 3,974 2,969 282 184 144 6,201 2,718 133 4,786 2,326 979 162 90 15,732 90,998 142 2,552 346 256 172 120 1,823 E E E Eep Eep D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D E D ATAAAT AACAACGCAATATCCGAGTAAGATTGTATAAGTGAGATAT ATTTTTTTTTTTT ATAT ATAACGCAACATTCGAATGAGATTATGTAGATGAAATATGGTAT TATTTTT ACAAATAACGCAACA TCAGATGAGATTATATAAGTGAGATATG ATATATTTTTTTTTTT ATATATAACAACGAGACA GATAATCTAGATTTGAAGTTATATG TGATATTTTTTTTTTTT ATACATAACAACGAGACA GATAATCTAGATTTAGAGTTATATG TGATATTTTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGATAAT ATTTTTTTTTTTTTT ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGATGATGTAAT TTTTTTTTTTT ATATATAATGAC TAACTAAACTGATAAAGCAGTAGAAGAGATGATGTAATATTT TTTTTTTTTTT ATATATATGACATAACTT GGCCAATGAGATAATGAAAGAGATGGTGTAAT TTTTTTTTTTTTCT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGATA TTTTTTTTTTTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGAT TTTTTTTTTTTTTG ATATATATAATGACT TAACTGAGCTAATGAGGCAATGAGAGAGATAAT ATTTTTTTTTTTT ATATATATAATGACGTAACTGAT CTAATGAGGCAATGAGAGAGATAAT ATTTTTTTTTTTTT ATAAATTAATGACATAACTT GACTAATAGGACAGTGAAAGAGGCAGTGTAAT TCTTTTTTTTTT ATATAT ATAATGACGTAACTGAGCTAATGAAGCAATGAGAGAGATAAT ATTTTTTTTTTTTT ATATAT ATAATGACGTAACTGAGCTAATGAGACAATGAGAGAGATAAT ATTTTTTTTTTTG ATATATAT ATGACGTAACTGAGCTAATGAGGCAATGAGAGAGATAAT ATTTTTTTTTT ATAT ATAATGACATAATTAGACTGATAAGATAACGAGAAAAGTGATGTATT TTTTTTTTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGGGATAAT ATTTTTTTTTTTT ATATAT ATAATGACGTAACTGAACTAATGAGGCAATGAGAGAGATAAT ATTTTTTTTTTTT AT ATAAATAATGACATAACTAGGTTAGTAAAGTGACGAAGAAGATAAT ATTATTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAAATAAT ATTTTTTTTTATTTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGGGAGATAAT ATTTTTTTTTTT ATATAT ATAATGACGTAACTGAGTTAATGAGGCAATGAGAGAGATAAT ATTTTTTTTATTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAGGTAAT ATTTTTTGTTTTTTT ATATAT ATAATGACGTAACTGAGCTAATGAGGCAATGAGAGAAATA TTTTTTTTTGTTTTT ATAAATAATGACATAACTGAATTGATAGAACGATGAGAGAAATAAT ATTTTTTTTTTTC ATAAAT TAATGACATAACTGAATCGATAGAATAATGAGAGAGATAAT ATTTTTTTTTTTG ATAAATA AATGACATAACTGGATTAGTAAAGTGGTGAAAAAGATAAT ATTTTTTTTTTT A AAATAATGACATAACTGAATTGATAGAACGATGAGAGAAATAAT TTTTTTTTTTCC ATATATAATA ACGTAATTGGATCAGTGAGATAACGA TAGAAATGATATATTTTTTTTTTTTTGTTT ATAT ATAATGACATAATTAGACTGATAAGATGACGAGAAAAGTGATGTA TTTTTTTTTTTC ATATATA AATGACATGACTAAACTAATAGGGCAGTGAAGAAGACAATG ATTTTTTTTTTTC ATATATATC AATGACATAACTGAACTAGTAAGATAATGAGAGAAGT TTTTTTTTTTTAT ATAT ATAAATAATGACGTAATTAAACTGGTAAGATGATAGAAAAAGT TTTTTTTTTTTTT ATATATA AATGACATGACTAAACTGATAGGACAGTAAAGAGGACAATG ATTTTTTTTTTGTTT ATATATATATC AATGACATAACTGAACTAGTAAGATAATGAGAGAA TTTTTTTTTCTTTT ATAT ATAATGACATAACTGAACT TGTAAGATAGCGAGATTTTTTTTTTTTTT ATATATATC AATGACATAACTGAACTAGTAAGATAATGAGAGAAGTGAT TTTTTTTTTTTT ATAAATAATGACATAACTGAATTGATAGAACGATGAGAGAAAT TTTTTTTTTTT T TAAATAATGACATAACTGAATTGATAGAACGATGAGAGAAATAAT ATTAATTTTTTTTTTTTT ATAAATAATGACATAACTAAATTGATAGAACGATGAGAGAAATAAT ATTTT ATAT ATAAATAATGACGTAATTAAACTGGTAAGATGATAGAAAAAGTGAT TTTAAATTTTTTTTTT 242 10,751 125 634 113 3,627 1,168 10,277 2,340 39,218 23,327 6,953 5,004 3,253 2,627 1,811 1,594 832 815 756 636 220 183 159 148 129 4,860 64 706 234 344,244 5,997 3,292 2,172 1,088 924 809 747 558 534 521 512 382 353 308 261 258 250 193 178 160 C C C C Ct Ct B1 B1 B1* B2FSe B2FSe B3t B4t A ATAC AAATCAACAAAACTACTACTCTCTATAGTGA TGATGATATGTATTTTTTGTTTTT ATATA AGATCAACAAAACTGCCATTTTCTGTAGTAAGTGATGATATAT TTTTTTTTATTTT ATATAAATCAACAAAACTGA TGTCCTCTGTAGTAGATAGTGATAT TATTTTTTTTTTCTTT ATATAAATCAACAAAACTAG TGCCTTCTATAGTAGATGATGATATATGAT TTTTTTTTTTT ATAT ATATACAAAACAACGGGTATAAGATTAAATGTGAAATGGTATATAT TTTTTTTTTTTTTTTTT ATAT ATATACAAAACAACGGGTATAAGATTAAATGTG TTTTTTTTT ATA TACAATACGTGTATGATATTTTATACTAGGTAGATCAGTGAAATTTTTTTTTTTT ATA TACAATACGTGTATGATATTTTATACTGGGTAGATCAGTGAAATT TTTTTTT ATA TACAATACGTGTATGATATTTATACT AGGTAGATCAGTGAAATTTTTTTTTTTTTT ATA TACAATACGTGAATGATATTTTATACT AGGTAGATCAGTGAAATTTTTTTTT ATA TACAATACGTGGATGATATTTTATACT AGGTAGATCAGTGAAATTTTTTTTTTTTTTT ATATA ACAATACGTGAATGAATAATTATAGGGATATATGAGATAAT TTTTTTTTTTT ATA TACAATACGTGAGAATATATATAGATGTAGAGATATGAGACAGT TTTTTTTTTTTTT AAATAT AACATATCTTATATCTGAATCTAACTTGTAATATGTG AATTTTTTTTTTTTTTTT 80 1,587 338 8 1 1,699 69,971 1 1,941 3,453 758 41 28 333 17,696 10 15,488 4 2 158 243 C B A APPENDIX K. ND7 gRNA Alignments for TREU 667 SDM79 (A) and EATRO 164 SDM79 cells (B), and all editing variants (C and D). Amino acid translations are shown above the mRNA sequences. The cDNA sequence of the most abundant gRNA in its sequence class is shown aligned beneath the fully edited mRNA. Lowercase u’s indicate uridines added by editing, asterisks indicate encoded uridines deleted during editing. Nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 50 end (+1=0). gRNAs are colored based on transcript abundance as follows: Blue<100; Green<1,000; Purple<10,000; Orange<100,000; Red>100,000; Black=not quantified. Watson-Crick (|) and G:U base pairs (:) are indicated. Mismatches are indicated by an octothorpe (#). Highlighted sequence represents sequences were multiple CU configurations are possible. A. TREU 667 ND7 E1v1, D, C, B, A Y K K T W L H D K Y H F M L F L V V F L H L Y R F T F G P Q H P A I Q K N M T T W ST V S F Y V I F G S F F T F V S F Y I W S T A S R GAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG :|#||||##|||#|: gC 17TAAGTGTATTAGAGT ||:||||:|:||||:|:|||||||:| ::||||#||||| gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA |:|||||||:||::|:|:|:|:|:|||||||:|||||||||||| 09TATTATAGTAAGATGTAGTGAGAGCTATCAAAAGAATGTAAACATATAAA gE1v1 A H G V L C C L L Y F C G E F I V Y I D C I I G Y L H R G S T W C F M L F I V F L W W I Y C L Y W L Y Y R L F A S W ***CAGCACAuG**GuGuuuuAuGuuGuuuAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UAUUGAuUGuAuuAuA***G*GuuAUUUGCAUCGUGG ||||||::|:|| :||::||:||||:||| : ||||||||||||:|: gA 13TAAATAGTAGAT--GTAGTTAGCATAGTAT---T-CAATAAACGTAGTATATA ||||::||:|:|::::|||| |||:||::|||| |||||||| 16TAATTAATAGTATGAGAGTGTCACT-TAAGTAGTAAAT--ATAACTAAACATA gB |##|||||| :|:|||:|::||||||||||||| ---GAAGTGTAC--TATAAAGTGTAACAAATAACATATATA gC T E K L C E Y K Y R K V M W I ST K UACAGAAAAGUUAUGUGAAUAUAAAAG 244 B. EATRO 164 ND7 E1v1, D, C, B, A Y K K T W L H D K Y H F M L F L V V F L H L Y R F T F G P Q H P A I Q K N M T T W ST V S F Y V I F G S F F T F V S F Y I W S T A S R GAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG ||: |||#|||:#|||#|| gC 19TAAT-TAGATGTT-TAGAGC ||:||||:|:||||:|:|||||||:| ::||||#||||| gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA |:|||||||:||::|:|:|:|:|:|||||||:|||||||||||| 09TATTATAGTAAGATGTAGTGAGAGCTATCAAAAGAATGTAAACATATAAA gE1v1 A H G V L C C L L Y F C G E F I V Y I D C I I G Y L H R G S T W C F M L F I V F L W W I Y C L Y W L Y Y R L F A S W ***CAGCACAuG**GuGuuuuAuGuuGuuuAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UAUUGAuUGuAuuAuA***G*GuuAUUUGCAUCGUGG ||||||:::||| |||::||::||:|||| | :||||:||||||||| gA 12TAAATAGTGAAT--ATAGTTAGTATGATAT---C-TAATAGACGTAGCACATATATA :||:||:|:|:||:||:|::|||:| ||:|||:||||| ||||||| 15TAATAAGTGATATGAAGATGCCATT-TAGATAGCAAAT--ATAACTACATA gB #||#||||| |:::|||||||||||||||:||| ----TCATGTAC--CGTGAAATACAACAAATAATATA gC T E K L C E Y K Y R K V M W I ST K UACAGAAAAGUUAUGUGAAUAUAAAAG 245 C. TREU 667 ND7 Variants E Variants E1v2, D, C D T K K H D Y M I S T F M L F L V V F L H L Y R F T F G P Q H P A Y K K T W L H D K Y I Y V I F G S F F T F V S F Y I W S T A S R GAUACAAAAAAACAUGACUACAUGAUAAGUACAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG** :|#||||##|||#|: gC 17TAAGTGTATTAGAGT-- ||:||||:|:||||:|:|||||||:| ::||||#|||||# gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA |||||::|:|:||:|:::||::|:|||||||||||||||| 12TAAATGTAGTGAAGATTGTCGGAGAAATGTAAACATAGCATATACA gE1v2 E2v1, D, C I Q K N M T T W ST V S F M L F L V V F L H L Y R F T F G P Q H P A D T K K H D Y M I S I I Y V I F G S F F T F V S F Y I W S T A S R GAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG* :|#||||##|||#|: gC 17TAAGTGTATTAGAGT- ||:||||:|:||||:|:|||||||:| ::||||#|||||# gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA :|||||:||||:|||::|:||:|||:||:||||||||||||| 15TAATATAGTGAATATAATGGAGACTATCGAAGAAATGTAAACATATAAA gE2v1 E2v2, D, C I Q K N M T T W ST V Q L L L V V F L H L Y R F T F G P Q H P A A D T K K H D Y M I S T I V I G S F F T F V S F Y I W S T A S R S GAUACAAAAAAACAUGACUACAUGAUAAGUACAAuuGuuAuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG***CAGC :|#||||##|||#|: :##| gC 17TAAGTGTATTAGAGT---GAAG ||:||||:|:||||:|:|||||||:| ::||||#|||||# gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA |||:|:|||:|||:|:||:||||||:|:||:||||||||||||: 14TAATAGTAATCATTAGAAGAATGTAGATATGGCAAAATGTAAATATA gE2v2 246 E3t, D, C I Q K N M T T W ST V S F M L F L V V F T F V S F Y I W S T A S R D T K K H D Y M I S I I Y V I F G S F Y I C I V L H L V H S I P GAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuAuGuuAuuuuuGGuAGuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG* ||:|:|||:|::|||::||:||||:|::||||:|||||||||| 12TATAGTAAGAGTCATTGAAGATGTGAGTATAGTAAAATGTAAAATATA gE4t :|||||:||||:|||::|:||:|||:||:| 15TAATATAGTGAATATAATGGAGACTATCGAAGAAATGTAAACATATAAA gE2v1 :|#||||##|||#|: gC 17TAAGTGTATTAGAGT- ||:||||:|:||||:|:|||||||:| ::||||#||||| gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA E4t, D, C I Q K N M T T W ST D T K K H D Y M I S T S Y F W ST Y K K T W L H D K Y K L F L V V F T F V S F Y I W S T A S R GAUACAAAAAAACAUGACUACAUGAUAAGUACAAGuuAuuuuuGGuAGuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG*** :|:|||:|::|||::||:||||:|::||||:|||||||||| 12TATAGTAAGAGTCATTGAAGATGTGAGTATAGTAAAATGTAAAATATA gE4t :|#||||##|||#|: gC 17TAAGTGTATTAGAGT--- ||:||||:|:||||:|:|||||||:| ::||||#||||| gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA C Variant CFSt, B D D I W S T A S R Y A H G V L C C L L Y F C G E F I V Y I D C R H L V H S I P L C T W C F M L F I V F L W W I Y C L Y W L T T F G P Q H P A M H M V F Y V V Y C I F V V N L L F I L I ACAGACGACAGUGUCCACAGCAuCCCG***CuAuGCACAuG**GuGuuuuAuGuuGuuuAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UAuuGAuU |:||||#|: ||||:|||||| :|:||:|||||:||||||||||| gCFSt 57 Reads 14TAATTGTAGAGT---GATATGTGTAC--TATAAGATACAGCAAATAACATATATA ||||::||:|:|::::|||| |||:||::|||| |||||||| gB 26624 Reads 16TAATTAATAGTATGAGAGTGTCACT-TAAGTAGTAAAT--ATAACTAAACATA 247 D. EATRO 164 ND7 Variants E Variants E2v1, D, C I Q K N M T T W ST V S F M L F L V V F L H L Y R F T F G P Q H P A D T K K H D Y M I S I I Y V I F G S F F T F V S F Y I W S T A S R GAUACAAAAAAACAUGACUACAUGAUAAGUAuCAuuuAuGuuAuuuuuGGuAGuuuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG* ||: |||#|||:#|||#|| gC 19TAAT-TAGATGTT-TAGAGC- ||:||||:|:||||:|:|||||||:| ::||||#||||| gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA |:|||||:|||||::|:|:|:::||||:|:|||||||||:| 14TATTATAGTGAATACGGTGAGAGTTATCAGAGAAATGTAAATAATATA gE2v1 E4e, D M I S T Y C Y W STOP M T T W STOP Y K K T W L H D K Y I L L L V V F T F V S F Y I W S T A S R S T GAUACAAAAAAACAUGACUACAUGAUAAGUACAuAuuGuuAuuGGuAGuuuuuACAuuuGuAuCGuuuuACAuuuG*GUCCACAGCAuCCCG***CAGCA |||||:|:|||:||||:||:|||||:||:||||||||||||| 13TATAATAGTAATCATCGAAGATGTAGACGTAGCAAAATGTAACATATA gE4e ||:||||:|:||||:|:|||||||:| ::||||#||||| gD 13TGTAAGTGTAGATATAGTAGAATGTAAGC-TGGGTGACGTAGATATATA 248 BC Variants C1ex, B, A D T K K H D Y M I S T R G D R R Q C P Q H P F I V S F I G I C C L Y K K T W L H D K Y K R R Q T T V S T A P V H C F I H W D L L F GAUACAAAAAAACAUGACUACAUGAUAAGUACAAGAGGAGACAGACGACAGUGUCCACAGCACCCG*UUCAuuGuuuCAuuCAuuG**GGAuuuGuuGuu :||:|| gB 15TAATAA : ||#|:|:|:|||||||:|| ||||||::|||| gC1ex 11T-AATTGATAGAGTAAGTGAC--CCTAAATGACAA L Y F C G E F I V Y I D C I I G Y L H R G T E K L C E Y K I V F L W W I Y C L Y W L Y Y R L F A S W Y R K V M W I ST K uAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UAUUGAuUGuAuuAuA***G*GuuAUUUGCAUCGUGGUACAGAAAAGUUAUGUGAAUAUAAAAG ||||||:::||| |||::||::||:|||| | :||||:||||||||| 12TAAATAGTGAAT--ATAGTTAGTATGATAT---C-TAATAGACGTAGCACATATATA gA :|:|:||:||:|::|||:| ||:|||:||||| ||||||| GTGATATGAAGATGCCATT-TAGATAGCAAAT--ATAACTACATA gB |||||| ATAACAAAATATA gC1ex C2ex, B, A D T K K H D Y M I S T R G D R R Q C P Q H P S L S Y S V F Y C C L Y K K T W L H D K Y K R R Q T T V S T A P V I V L Q C V L L L F GAUACAAAAAAACAUGACUACAUGAUAAGUACAAGAGGAGACAGACGACAGUGUCCACAGCACCCG**UCAuuGuCuuACAG*UGuGuuuuAuuGuuGuu :||:|| gB 15TAATAA : |#|:|:|||||||| ||||||:|||||#|||| gC2ex 07TAATATAGAGTAT--ATTGATAGAATGTC-ACACAAGATAAC-ACAA L Y F C G E F I V Y I D C I I G Y L H R G T E K L C E Y K I V F L W W I Y C L Y W L Y Y R L F A S W Y R K V M W I ST K uAuuGuAuuuuuGuGGuGA*AuuuAuuGuuuA**UAUUGAuUGuAuuAuA***G*GuuAUUUGCAUCGUGGUACAGAAAAGUUAUGUGAAUAUAAAAG ||||||:::||| |||::||::||:|||| | :||||:||||||||| 12TAAATAGTGAAT--ATAGTTAGTATGATAT---C-TAATAGACGTAGCACATATATA gA :|:|:||:||:|::|||:| ||:|||:||||| ||||||| GTGATATGAAGATGCCATT-TAGATAGCAAAT--ATAACTACATA gB ||| ATATATA gC2ex 249 Bex, A I Q K N M T T W ST V Q E E T D D S V H S T R F S T V G Y L L ST I C D T K K H D Y M I S T R G D R R Q C P Q H P F Q H S W L F V V D L W GAUACAAAAAAACAUGACUACAUGAUAAGUACAAGAGGAGACAGACGACAGUGUCCACAGCACCCGUUUCAGCACAGUUGGuuAuuuGuuGuAGAuuuGu :||||:|::||||:||:|:| gBex 13TAATAGATGACATTTAGATA G E F I V Y I D C I I G Y L H R G T E K L C E Y K W I Y C L Y W L Y Y R L F A S W Y R K V M W I ST K GGuGA*AuuuAuuGuuuA**UAUUGAuUGuAuuAuA***G*GuuAUUUGCAUCGUGGUACAGAAAAGUUAUGUGAAUAUAAAAG ||||||:::||| |||::||::||:|||| | :||||:||||||||| 12TAAATAGTGAAT--ATAGTTAGTATGATAT---C-TAATAGACGTAGCACATATATA gA |::|| |:|||||||||| ||||||| CTGCT-TGAATAACAAAT--ATAACTATATA gBex 250 APPENDIX L. gRNAs identified to edit the ND7 5’ mRNAs of found in both TREU 667 and EATRO 164 gRNA transcriptomes. Editing Region E D Population E1 version 1 E1 version 1 E1 version 1 E1 version 2 E1 version 2 E1 version 2 E2 version 1 E2 version 1 E2 version 1 E2 version 1 E2 version 1 E2 version 1 E2 version 1 E2 version 1 E2 version 1 E2 version 2 E2 version 2 E4t E4t E4t E4t E4t E4e E4e E4e E4e E4e E4e E4e E4e E4e D D D D D D D D D Sequences AAAT ATACAAATGTAAGAAAACTATCGAGAGTGATGTAGAATGATATT ATTTTTTTTT AAAT ATACAAGTGTAAGAAAACTATCGAGAGTGATGTAGAATGATATT ATTTTTTTTTTT AAAT ATACAAATGTAAGAAAACTATCGAGAGTGATGTAGAATGATATTT TTTTTTTTTTAT ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGT TTTTATTTTTTTTTTTTAT ACATAT ACGATACAAATGTAAAGAGGCTGTTAGAAGTGATGTAAAT TTTTTTTTTTTG ATACATAT ACGATACAGATGTGAAGAAACTATTAGAGATAATGTAAAT TTTTCTTTTTT ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGATATT ATTTTTTTTTTTTTT ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGATATTT TTTTTTTTTGT ATATA ATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGAT TTTTTTTTTTCT ATATA ATAAATGTAAAAAGACTATTGAGAGTGGCATAAGTGATATT ATTTTTTTTTGTTTTT ATATAATAAA TAAAGAGACTATTGAGAGTGGCATAAGTGATATT ATAATGATTTTTTTTTTTTTT AAAT ATACAAATGTAAAGAAGCTATCAGAGGTAATATAAGTGATAT AATTTTTTTGTTTTTTT ATATAAT AATGTAAAGAGACTATTGAGAGTGGCATAAGTGATATT ATAATGATATTTTTTTTT AT ATATAAATGTAAAGAGACTATTGAGAGTGGCATAAGTGATATT ATTTTTT ATATAC ACAAATGTAAAGAGACTATCGAGAGTGACATAAGTGATAT AATTTTTTTATTTT ATATA ATAAATGTAAAGAGACTATTGATAGTGG CATAAGTGATATTATAATGATTTTTTTTTTTTTTTT ATA TAAATGTAAAACGGTATAGATGTAAGAAGATTACTAATGATAATT TTTTTTTTTTTT ATATA ATAAATGTAAAGACTATTGAGAGTGGCATAAGTGATATT ATAATTTTTTTTTTTTT ATATA ATAAATGTAGAGACTATTGAGAGTGGCATAAGTGATATT ATAATTTTTTTTTTTTT ATATA AAATGTAAAATGATATGAGTGTAGAAGTTACTGAGAATGATAT TTTTTTTTTTTC ATATA AAATGTAAAATGATATGAGTGTAGAAGTTACTGAGAATGATATA TTTTTTTTTTTC ATATA AAATGTAAAATGATATGAGTGTAGAAGTTACTGAGAATGAT TTTTTTTTTTTTC ATATAC AATGTAAAACGATGCAGATGTAGAAGCTACTAATGATAATAT TTTTTTTTTTTTG ATATAC AATGTAAAACGATGCAGATGTAGAAGCTACTAATGATAAT TTTTTTTTTTTC ATATACAA TAAAACGATGCAGATGTAGAAGCTACTAATGATAATAT TTTTTTATTTTTT ATATACAATGTAAAACGATT CAGATGTAGAAGCTACTAATGATAATAT TTTCTTTTTTTTT ATATAC AATGTAAAACGATGCAGATGTAGAAGCTACTAATAATAATAT TTTTTTTAATTTTTTTTTTTTT ATATACAATGT AAACGATGCAGATGTAGAAGCTACTAATGATAATAT TTTTTTCTTTTTTTT ATATAC AATGTAAAACGATACAGATGTAGAAGCTACTAATGATAATAT TTTCTTTTTTT ATATA AAATGTAAAATGATATGAGTGTAGAAGTTACTGATAATGATAT ATTTTTTTTTTTTT ATATA AAATGTAAAATGATATGAGTGTAGAAGTTACTGATAATGAT TCTTTTTTTTTTTTT ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATGTTTTTTT TTTTTT ATATATAGATGCTGTA GATTAGATGTAGAGTGATATAAG CGTAAATTTTTTTTTTTTTG ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATTTTTTTT CTTTTTTT ATATATAA ATGCTGTGGATTAGATGTAGAATGATATGAGTGTGAAATTTTTTTT TTTTTTC ATATA AAATGTAAAATGATATGAGTGTAGAAG TTACTGAGAATGATATTTTTTTTTTTTC ATATATAAATGCA GTGGATCAGATGTAAGATGGTATAAGTGTGAATATTTTTTT TTTTT ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATGTTT GTTTTTTTTTT ATATATAA ATGCTGTGGATTAGATGTAGAATGATATGAGTGTGAATTTTTTTT TTTTTG GC GGGTCGAATGTAAGATGATATAGATGTGAA TGTATTTTTTTTTTT 251 Reads TREU 667 4,517 803 7,073 213 1,460 192 259 10,349 2,680 2,360 16 5 36,611 3,590 15,885 12,929 455 332 332 229 Reads EATRO 164 6 1 193 148,603 6,174 1,811 659 593 504 241 203 34 49 26 218,365 18,562 793 777 712 439 203 4,096 2,136 797 24 D D D D D D D D D D D D C C C1ex C2ex C2ex C2ex CFSt CFSt B B B B B B B B B B B Bex Bex Bex Bex Bex Bex Bex A A C B A AGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATGTTCTTTT TTTTTTT ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATGTTTTTT GTTTTTTT TTATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATTTTTTTTGT TTT ATATATAGATGCA GTGGGTCAAATGTAAGATGATATAGATGTGAA TGTTTTTTATTTT ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATGTTTTTTTGT TTTTT ATATATAGATGCAGT GGTCGAATGTAAGATGATATAGATGTGAA TGTTTTTTTTT ATATATAGATGCAT TGGGTCGAATGTAAGATGATATAGATGTGAA TGTTTGTTTTTTTTTTTT GTGGGCCGAATGTAAGATGATATAGATGTGAA TGTTTTTTATTTTTTT ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATGTTTTT GTTTTTTT TATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTGAATTCTTTTT TTTTTTT ATATATAAATGC TGGATTAGATGTAGAATGATATGAGTGTGAAA TTTTTTTTTTT ATATATAGATGCA GTGGGTCGAATGTAAGATGATATAGATGTAAA TGTATTTTTTTTTTTTTT ATATAATAAACAACATAAAGTGCCATGT ACTCGAGATTTGTAGATTAATTTTTTTTTTTTTTTTTTT ATAT ATACAATAAACAATGTGAAATATCATGTG AAGTGAGATTATGTGAATTTCTTTTTTTTTTTTT ATATAAA ACAATAAACAGTAAATCCCAGTGAATGAGATAGT TAATTTTTTTTTTT ATATATAAAC ACAATAGAACACACTGTAAGATAGT TATATGAGATATAATTTTTTT ATATATAAAC ACAATAGAGCACACTGTAAGATAGT TATATGAGATATAATTTTTTTTTT ATATATATAAAC ACAATAGAACGCACTGTAAGATAGT TATATGAGATATTTTTTT ATATA ACAATAAACAACATAGAGCATTGTGTGTATAGTG AGATGTTACTTTTTCTTATTTTTTTGTATTTGTGTT ATAT ATACAATAAACGACATAGAATATCATGTGTATAGTG AGATGTTAATTTTTTTTTGTTTT ATAC ATCAATATAAACGATAGATTTACCGTAGAAGTATAGTGAATAAT TTTTTTTTTTATTT ATATA CAATATAAACAGTAGATTCACTGCAGAAGTATGATAGATAAT TTTTTTTTTTG ATACAT ATATAAACAATGAATTCACTGTGAAGATACGATAGATGATA TATTTTTTTTTGTTTT ATACA AATCAATATAAATGATGAATTCACTGTGAGAGTATGATAA TTAATTTTTTTTTTTTTTCT ATATA CAATATAAACAGTAGATTCACTGCAGAGATATGATAGATAATA TTTTTTTTTTT ATATA CAATATAAACAGTAGATTCACTGCAGAGATATGATAGATAATAA TTTTTTTTTTTCTA ATATA CAATATAAACAGTAGATTCACTGCAGAGATATGATAGATAATAAATTTT TTTTTTT ATACAT ATATAAACAATGAATTCACTGTGAAGATACAATAGATGATA TATTTTTTTTTTTTT ATATA CAATATAAACAGTAGATTCACTGCAGAGATATGATAGATAAT TTTTTTTTTT ATACA AATCAATATAAATGATGAATTCACTGTGAGAGTATGAT TGTTTTTTTTTTTT ATATAT AACAATAAATTCATCATAGAGATATAGTAAGTGATATGAGACATT TTTTTT ATAT ATCAATATAAACAATAAGTTCGTCATAGATTTACAGTAGATAATT TTTTTTTTTTT ATAT ATCAATATAAACAATAAGTTCGTCATAGATTTACAGTAGATAATTA TTTTTTTTTTT AT ATCAATATAAACAATAAGTTCGTCATAGATTTACAGTAGATAATTAA GTTTTTTTTTTTT AT ATCAATATAAACAATAAGTTCGTCATAGATTTACAGTAGATAATTAATT TTTTTTTTTT ATAT ATCAATATAAACGATGAATTTGTCATAGATTTACAGTAGATAATTA TTTTTTTTTC ATCAATATAAACGATGAATTTGTCATAGATTTACAGTAGATAATT TTTT ATAT ATCAATATAAACAATGAGTTCATCATAGATTTACAGTAGATAATTA TTTTTTTTT ATATATA CACGATGCAGATAATCTATAGTATGATTGATATAAGTGATAAATTT TTTTTTTTT ATA TATGATGCAAATAACTTATGATACGATTGATGTAGATGATAAATTT TTTTTTTTTTG 252 211 203 195 158 154 153 128 123 112 109 106 101 114 183 22 14 13 23 57 4,718 26,624 3,050 1,675 798 579 327 110 30 57 38 25 993 4,939 6,185 131 103 4 3,969 3,557 390 20,696 1,024 714 693 535 APPENDIX M. Predicted ND7 protein sequences. First start codons without premature termination codons were translated. Blue sequences are found in TREU cells only, orange sequences are found in EATRO cells only, and black sequences are found in both cell lines. 253 RF1 E1v1 MLFLVVFLHLYRFTFGPQHPAAHGVLCCLLYFCGEFIVYIDCIIGYLHRGTEKLCEYK E1v2 MISTFMLFLVVFLHLYRFTFGPQHPAAHGVLCCLLYFCGEFIVYIDCIIGYLHRGTEKLCEYK E2v1FS MISIIYVIFGSFFTFVSFYIWSTASRYAHGVLCCLLYFCGEFIVYIDCIIGYLHRGTEKLCEYK E2v2FS MISTIVIGSFFTFVSFYIWSTASRYAHGVLCCLLYFCGEFIVYIDCIIGYLHRGTEKLCEYK RF3 E2v1 MISIIYVIFGSFFTFVSFYIWSTASRSTWCFMLFIVFLWWIYCLYWLYYRLFASWYRKVMWI! E2v2 MISTIVIGSFFTFVSFYIWSTASRSTWCFMLFIVFLWWIYCLYWLYYRLFASWYRKVMWI! Minor RF2 variants E3t MISIIYVIFGSFYICIVLHLVHSIPQHMVFYVVYCIFVVNLLFILIVL! E1v1FS MLFLVVFLHLYRFTFGPQHPAMHMVFYVVYCIFVVNLLFILIVL! E1v2FS MISTFMLFLVVFLHLYRFTFGPQHPAMHMVFYVVYCIFVVNLLFILIVL! No AUG E4tRF3 LVVFTFVSFYIWSTASRSTWCFMLFIVFLWWIYCLYWLYYRLFASWYRKVMWI! E4eRF3 LLLVVFTFVSFYIWSTASRSTWCFMLFIVFLWWIYCLYWLYYRLFASWYRKVMWI! APPENDIX N. CR3 gRNA Alignments for TREU 667 SDM79 (A), and all editing variants (B). Amino acid translations are shown above the mRNA sequences. The cDNA sequence of the most abundant gRNA in its sequence class is shown aligned beneath the fully edited mRNA. Lowercase u’s indicate uridines added by editing, asterisks indicate encoded uridines deleted during editing. Nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 50 end (+1=0). gRNAs are colored based on transcript abundance as follows: Blue<100; Green<1,000; Purple<10,000; Orange<100,000; Red>100,000; Black=not quantified. Watson-Crick (|) and G:U base pairs (:) are indicated. Mismatches are indicated by an octothorpe (#). Highlighted sequence represents sequences were multiple CU configurations are possible. A. TREU 667 CR3 gRNA alignment gGt, gFt, gEt’, gE, gD, gC, gB1B2, gA1 M F D C L V L L F F Y C L F V H F AGAAAUAUAAAUAUGUGUAUGAUAUAUAAuuAAuuAuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuACAuuu |||||||:|:|||||:|:||::||||| |||||::||||||||||| gGt 13TTAATTAGTGAAAGTGAGATGTAAACT----AACAAGTCAAAACAACAATATA |::|:|:::|||:|:||:|:||:|:||:||||||||| gFt 12TACATATTAGAGTGACAGAGAAGTGACGAGCAGACATGTAAA |::|||||:| gEt’ 11TATATAGTATGTAGA F C F L F V C D L F L C L L F S F C F L L D F C F L F N M G L uuuuuGuuuuuuAuuuGuuuGuG***A**UUUGuuuuuAuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuGGAuuuuuGuuuuuuAuuuAAuAuGGGuuuA | |:||||::|::||| ::||||||:|||:|:|||||||||||:|:| ATATA gFt 11TAGAATATGAGTAAT-GGATCAAAGACAGAGAATAACCTAAAGATATA gD :||:||||::|||||:||||||| | |||:| |||:|:::|||:|||:|||||::||:||| GAAGACAAGGAATAAGCAAACAC---T--AAATACATA gEt’ gC 12TAAGAGTGAAAGATAGATTATGTCCGAAT :|||||:|||:|:::|:||| | :|||:|:|||||||||||| |:|| ::||| 13TATAGTTATAGATGGAGCAC---T--GAACGAGAATACAAACAAT-AGATA gE gB1B2 13TGAAT L L C F I L Q I F S V I I I I V Y K F S L L D STOP uUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCAuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU |::|||||||| AGTAACACAAATATAAA gC |::|::|:|:|:|:||||||||||:|:|||||||||||||| AGTAGTATAGAGTGTAATGTCTAAGAGTCACAATAGTAATATATA gB1B2 :|:|:|||||||:|||::||||:||||:|:||:||||||||||| gA1 04TATAGTAGTAATGATAGTATATGTTCAGAGGCGATAATCTAATTATATA 254 B. TREU 667 Cell line variants A Variants A1 L L C F I L Q I F S V I I I I V Y K F S L L D ST K S M Q I I F C C V L Y Y R F L V L S L L L Y I S F R Y ST I K K V C K ST F L I V V F Y I T D F ST C Y H Y Y C I ST V F V I R L K K Y A N N F C AuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCAuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU ||::|::|:|:|:|:||||||||||:|:|||||||||||||| TAGTAGTATAGAGTGTAATGTCTAAGAGTCACAATAGTAATATATA gB1B2 :|:|:|||||||:|||::||||:||||:|:||:||||||||||| 04TATAGTAGTAATGATAGTATATGTTCAGAGGCGATAATCTAATTATATA gA1 A2 L L C L Y Y F R F Y G I I F I I V Y K F S L L D ST K S M Q I I F C C V Y I I S D F M V S F L L L Y I S F R Y ST I K K V C K ST F L I V V F I L F Q I L W Y H F Y Y C I ST V F V I R L K K Y A N N F C AuUGuuGuGuuuAuAuuAuuuCAGAuuuuAuGGuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU ||:|||:||:|:||:::||||:|||:||||||:|||||||| 14TAATGGTAGAAGTGATGGTATATGTTCGAAAGCAGTAATCTAAAATATA gA2 ||::|::|:|:||||:||:|:||:|||:||||||||||||||||| TAGTAGTATAGATATGATGAGGTTTAAGATACCATAGTAAAAATATATA gB3t 255 B Variants B1B2 D F C F L F N M G L L L C F I L Q I F S V I I I I V Y K F S I F V F Y L I W V Y C C V L Y Y R F L V L S L L L Y I S F R G F L F F I ST Y G F I V V F Y I T D F ST C Y H Y Y C I ST V F V GGAuuuuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCAuUAuuAuuGuAuAuAAGuUUUCG |||:|:::|||:|||:|||||::||:||||::|||||||| 12TAAGAGTGAAAGATAGATTATGTCCGAATAGTAACACAAATATAAA gC ::||||::|::|:|:|:|:||||||||||:|:|||||||||||||| 13TGAATAGTAGTATAGAGTGTAATGTCTAAGAGTCACAATAGTAATATATA gB1B2 :|:|:|||||||:|||::||||:||||:|:|| gA1 04TATAGTAGTAATGATAGTATATGTTCAGAGGC B4t D F C F L F N M G L L L C L F F F F I L S F D M L L S F L L L Y I S F R I F V F Y L I W V Y C C V Y F F F L F Y H L I C C Y H F Y Y C I ST V F G F L F F I ST Y G F I V V F I F F F Y F I I W Y V V I I F I I V Y K F S GGAuuuuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuuuuuuuuuuuuAuuuuAuCAuuuGAuAuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUC :||:::||:|:|:|:||||:|:|:||||||||||||||||||:| 12TAATGTAAGTGAGAGAAAAGAGTGAAATAGTAAACTATACAATATATA gB4t’ |||:|:::|||:|||:|||||::||:||||::|||||||||| :|||:|||:||:|:||:::||||:|||:|||| 12TAAGAGTGAAAGATAGATTATGTCCGAATAGTAACACAAATATAAA gC gA2 14TAATGGTAGAAGTGATGGTATATGTTCGAAAG |||||:|::||||:|:::||||||:|:|||||:||||||||| gB4 08TAATATAGTGAGTTATATAGTGATAGTAGAGATAATGACATATATTATATA B3t D F C F L F N M G L L L C L Y Y F R F Y G I I F I I V Y K F S I F V F Y L I W V Y C C V Y I I S D F M V S F L L L Y I S F R G F L F F I ST Y G F I V V F I L F Q I L W Y H F Y Y C I ST V F V GGAuuuuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuAuuAuuuCAGAuuuuAuGGuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCG |||:|:::|||:|||:|||||::||:||||::||||||||||||| ||:|||:||:|:||:::||||:|||:||||| 12TAAGAGTGAAAGATAGATTATGTCCGAATAGTAACACAAATATAAA gC gA2 14TAATGGTAGAAGTGATGGTATATGTTCGAAAGC |||::|::|:|:||||:||:|:||:|||:||||||||||||||||| 12TATAGTAGTATAGATATGATGAGGTTTAAGATACCATAGTAAAAATATATA gB3t 256 C Variants C C L L F S F C F L L D F C F L F N M G L L L C F I L Q I F S V I V C Y L V F V F Y W I F V F Y L I W V Y C C V L Y Y R F L V L M F V I ST F L F F I G F L F F I ST Y G F I V V F Y I T D F ST C Y AuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuGGAuuuuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuA |||:|:::|||:|||:|||||::||:||||::|||||||| 12TAAGAGTGAAAGATAGATTATGTCCGAATAGTAACACAAATATAAA gC ||::|::||| ::||||||:|||:|:|||||||||||:|:| ::||||::|::|:|:|:|:||||||||||:|:||||||| TATGAGTAAT-GGATCAAAGACAGAGAATAACCTAAAGATATA gD gB1B2 13TGAATAGTAGTATAGAGTGTAATGTCTAAGAGTCACAAT D Variants D V S L Y F F V D F C L C L L F S F C F L L D F C F L F N M G L Y H C I F L W I F V Y V C Y L V F V F Y W I F V F Y L I W V Y I I V F F C G F L F M F V I ST F L F F I G F L F F I ST Y G F I GuAuCAuuGuAuuuuuuuGuGG***AUUUUUGuuuAuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuGGAuuuuuGuuuuuuAuuuAAuAuGGGuuuA :|||||||:||:|:|:||:|:| ||||:|||:||||||||||| | |||:|:::|||:|||:|||||::||:||| TATAGTAATATGAGAGAATATC---TAAAGACAGATACAAACAAT-ATATA gEt gC 12TAAGAGTGAAAGATAGATTATGTCCGAAT :||||::|::||| ::||||||:|||:|:|||||||||||:|:| 11TAGAATATGAGTAAT-GGATCAAAGACAGAGAATAACCTAAAGATATA gD 257 E Variants E C L V L L F F Y C L F V H F F C F L F V C D L F L C L L F S F C I V W F C C F F I V C L Y I F F V F Y L F V I C F Y V C Y L V F V L F G F V V F L L F V C T F F L F F I C L W F V F M F V I ST F L A****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuACAuuuuuuuuGuuuuuuAuuuGuuuGuG***A**UUUGuuuuuAuGuuuGuuA*UUUAGuuuuuG |::|:|:::|||:|:||:|:||:|:||:|||||||||| |:||||::|::||| ::||||||:|| 12TACATATTAGAGTGACAGAGAAGTGACGAGCAGACATGTAAAATATA gFt gD 11TAGAATATGAGTAAT-GGATCAAAGAC |::|||||:|:||:||||::|||||:||||||| | |||:| 11TATATAGTATGTAGAGAAGACAAGGAATAAGCAAACAC---T--AAATACATA gEt’ :|:||| | :|||:|:|||||||||||| |:|| 13TATAGTTATAGATGGAGCAC---T--GAACGAGAATACAAACAAT-AGATA gE Et S F G G L L C V S L Y F F V D F C L C L L F S F C F L L V L V V Y C V Y H C I F L W I F V Y V C Y L V F V F Y W F W W F I V C I I V F F C G F L F M F V I ST F L F F I G A******GuuuuGGuGGUUUAuuGuGuGuAuCAuuGuAuuuuuuuGuGG***AUUUUUGuuuAuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuG | |:|:|::|::|:||::||||:||||||#|||||| T------CGAGATTATTAGATGGCACATATAGTA-CATAAATTATA gFGtxp :::::|||||||:||:|:|:||:|:| ||||:|||:||||||||||| | 13TGTGTATAGTAATATGAGAGAATATC---TAAAGACAGATACAAACAAT-ATATA gEt :||||::|::||| ::||||||:|||:|:|||||| gD 11TAGAATATGAGTAAT-GGATCAAAGACAGAGAATAAC Et’ C L V L L F F Y C L F V H F F C F L F V C D L F L C L L F S F C I V W F C C F F I V C L Y I F F V F Y L F V I C F Y V C Y L V F V L F G F V V F L L F V C T F F L F F I C L W F V F M F V I ST F L A****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuACAuuuuuuuuGuuuuuuAuuuGuuuGuG***A**UUUGuuuuuAuGuuuGuuA*UUUAGuuuuuG |::|:|:::|||:|:||:|:||:|:||:|||||||||| |:||||::|::||| ::||||||:|| 12TACATATTAGAGTGACAGAGAAGTGACGAGCAGACATGTAAAATATA gFt gD 11TAGAATATGAGTAAT-GGATCAAAGAC |::|||||:|:||:||||::|||||:||||||| | |||:| 11TATATAGTATGTAGAGAAGACAAGGAATAAGCAAACAC---T--AAATACATA gEt’ :|:||| | :|||:|:|||||||||||| |:|| 13TATAGTTATAGATGGAGCAC---T--GAACGAGAATACAAACAAT-AGATA gE 258 FG Variants FGtx K T L V C S F G G L L C V S L Y F F V D F C L K H ST F V V L V V Y C V Y H C I F L W I F V Y K N I S L ST F W W F I V C I I V F F C G F L F M AAAAACAuuAGuuuGuA******GuuuuGGuGGUUUAuuGuGuGuAuCAuuGuAuuuuuuuGuGG***AUUUUUGuuuA ||||::|::|| |:|:|::|::|:||::||||:||||||#|||||| 14TAATTGAGTAT------CGAGATTATTAGATGGCACATATAGTA-CATAAATTATA gFGtxp :::::|||||||:||:|:|:||:|:| ||||:|||:|| gEt 13TGTGTATAGTAATATGAGAGAATATC---TAAAGACAGAT Ft I N Y F H F M F D C L V L L F F Y C L F V H F F C F L F V C D L F L I I F I L C L I V W F C C F F I V C L Y I F F V F Y L F V I C N ST L F S F Y V W L F G F V V F L L F V C T F F L F F I C L W F V AAuuAAuuAuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuGuACAuuuuuuuuGuuuuuuAuuuGuuuGuG***A**UUUGu |||||||:|:|||||:|:||::||||| |||||::||||||||||| :|:||| | :|||: TTAATTAGTGAAAGTGAGATGTAAACT----AACAAGTCAAAACAACAATATA gGt gE 13TATAGTTATAGATGGAGCAC---T--GAACG |::|:|:::|||:|:||:|:||:|:||:|||||||||| 12TACATATTAGAGTGACAGAGAAGTGACGAGCAGACATGTAAAATATA gFt |::|||||:|:||:||||::|||||:||||||| | |||:| gEt’ 11TATATAGTATGTAGAGAAGACAAGGAATAAGCAAACAC---T--AAATA FGt K Q C V C C C F V L I L V V H F F C F L F V C D L F L K N N V Y V V V L F W F W L Y I F F V F Y L F V I C F Y K T M C M L L F C F D F G C T F F L F F I C L W F V F M AAAAACAAuGuGuA*****UGuuGuuGuuuuGuuuuG***AuuuuGGuuGuACAuuuuuuuuGuuuuuuAuuuGuuuGuG***A**UUUGuuuuuA ||:|:|| ::|:||:::|||:|:||: |||:|||||||||||||| 04TATATAT-----GTAGTAGTGAAATAGAAT---TAAGACCAACATGTAAAATATATA gFGtp :|::|||||:|:||:||||::|||||:||||||| | |||:| 11TATATAGTATGTAGAGAAGACAAGGAATAAGCAAACAC---T--AAATACATA gEt’ :|:||| | :|||:|:||| gE 13TATAGTTATAGATGGAGCAC---T--GAACGAGAAT 259 Gt E I ST I C V W Y I I N Y F H F M F D C L V L L F F Y C L F V K Y K Y V Y D I ST L I I F I L C L I V W F C C F F I V C L R N I N M C M I Y N ST L F S F Y V W L F G F V V F L L F V C AGAAAUAUAAAUAUGUGUAUGAUAUAUAAuuAAuuAuuuuCAuuuuAuGuuuGA****UUGuuuGGuuuuGuuGuuuuUUUAuuGuuuGuuuG |||||||:|:|||||:|:||::||||| |||||::||||||||||| 13TTAATTAGTGAAAGTGAGATGTAAACT----AACAAGTCAAAACAACAATATA gGt |::|:|:::|||:|:||:|:||:|:||:|| gFt 12TACATATTAGAGTGACAGAGAAGTGACGAGCAGAC 260 APPENDIX O. CR3 gRNA Alignments for EATRO 164 (A), and all editing variants (B). Amino acid translations are shown above the mRNA sequences. The cDNA sequence of the most abundant gRNA in its sequence class is shown aligned beneath the fully edited mRNA. Lowercase u’s indicate uridines added by editing, asterisks indicate encoded uridines deleted during editing. Nucleotides and deletion sites in the fully edited mRNA were numbered starting from the 50 end (+1=0). gRNAs are colored based on transcript abundance as follows: Blue<100; Green<1,000; Purple<10,000; Orange<100,000; Red>100,000; Black=not quantified. Watson-Crick (|) and G:U base pairs (:) are indicated. Mismatches are indicated by an octothorpe (#). Highlighted sequence represents sequences were multiple CU configurations are possible. A. EATRO 164 CR3 SDM79/SDM80 gFGep, gEep, gDep, gCe, gB5e, gB4, gA2 M C M I Y K L T I V L G G I L V I I V Y L V V M S AGAAAUAUAAAUAUGUGUAUGAUAUAUAAAuuAACAAuuGuGuuA******GGuGGG***AuuuuGGuGAuCAuuGuuuAuuuGGuuG*UUA****UGAG ||||||||:::|:||| |||::: ||:|::||||||||:||| 13TAATTGTTGGTATAAT------CCATTT---TAGAGTCACTAGTAGCAACATATA gFGep |||||::||:|:||:::|| :|| ::|| gEep 12TATAGTAGTAAGTGAATTGAC-GAT----GTTC :||: |:| ::|: gDep 12TAAT-AGT----GTTT C I L C F V M V I V F Y L I W V Y C C V Y I T Y V F L L L L S F L uuGuAUUUUAuGuuuuGuuAuGGuuAuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuAuCAuuuuUA |::||||||||||||||| |||:|::::||||:||:||||||:|||:|:||:|||||||||||| AGTATAAAATACAAAATATATA gEep 12TAATAGTGTAAATGTAGTGAATATATAGAGATGACAACAATAGTATATA gB5e :||:|:|||||:|:|::|:|||:|||||||||| |:|:::||||||:|:|| GACGTGAAATATAGAGTAGTACTAATAACAAAATATA gDep gB4 11TAATATAGTGAGTTATATAGTGATAGTAGAGAT ::|||:::|||:|||:|||:||:|||||||::||||||||||||| :||||||::||:| 13TGATAGTGAAAGATAGATTGTATCCAAATAGTAACACAAATATAAA gCe gA2 13TAATAGTGGAAGT L L Y I S F R Y STOP uuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU |||:||||||||| AATGACATATATTATATA gB4 |:||::||||:|||||:||||:|||||||| AGTAGTATATGTTCAAGAGCAGTAATCTAAAATATA gA2 261 B. EATRO 164 CR3 Variants A Variants A1 L L C F I L Q I F S V I I I I V Y K F S L L D STOP C C V L Y Y R F L V L S L L L Y I S F R Y STOP I V V F Y I T D F ST C Y H Y Y C I STOP AuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCAuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU :|:|:|||||||:|||::||||:||||:|:|||||||||||||| 11TATAGTAGTAATGATAGTATATGTTCAGAGGCAATAATCTAATTATA gA1 ||::|::|:|:|||||:||||||||:|:|||||||||||| TAGTAGTATAGAATATGATGTCTAAGAGTCACAATAGTAAATATATA gB1B2 A2 L L C L Y Y L C I F I V V I I F I I V Y K F S L L D STOP C C V Y I T Y V F L L L L S F L L L Y I S F R Y STOP I V V F I L L M Y F Y C C Y H F Y Y C I STOP AuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU :||||||::||:||:||::||||:|||||:||||:|||||||| 13TAATAGTGGAAGTAGTAGTATATGTTCAAGAGCAGTAATCTAAAATATA gA2 |:|:::||||||:|:|||||:||||||||| 11TAATATAGTGAGTTATATAGTGATAGTAGAGATAATGACATATATTATATA gB4 |||:|::::||||:||:||||||:|||:|:||:|||||||||||| TAATAGTGTAAATGTAGTGAATATATAGAGATGACAACAATAGTATATA gB5e 262 B Variants B1B2 G Y C F L F N M G L L L C F I L Q I F S V I I I I V Y K F S V I V F Y L I W V Y C C V L Y Y R F L V L S L L L Y I S F R L L F F I ST Y G F I V V F Y I T D F ST C Y H Y Y C I STOP GGuuAuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuuAuAuuACAGAuuuuuAGuGuuAuCAuUAuuAuuGuAuAuAAGuUUUCG ||||::|::|:|:|||||:||||||||:|:|||||||||||| gB1B2 14TAATAGTAGTATAGAATATGATGTCTAAGAGTCACAATAGTAAATATATA :|:|:|||||||:|||::||||:||||:|:|| gA1 11TATAGTAGTAATGATAGTATATGTTCAGAGGC ::|||:::|||:|||:|||:||:|||||||::||||||||||||| 13TGATAGTGAAAGATAGATTGTATCCAAATAGTAACACAAATATAAA gCe B5e G Y C F L F N M G L L L C L Y Y L C I F I V V I I F I I V Y K F S V I V F Y L I W V Y C C V Y I T Y V F L L L L S F L L L Y I S F R L L F F I ST Y G F I V V F I L L M Y F Y C C Y H F Y Y C I STOP GGuuAuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCG |||:|::::||||:||:||||||:|||:|:||:|||||||||||| 12TAATAGTGTAAATGTAGTGAATATATAGAGATGACAACAATAGTATATA gB5e |:|:::||||||:|:|||||:||||||||| gB4 11TAATATAGTGAGTTATATAGTGATAGTAGAGATAATGACATATATTATATA ::|||:::|||:|||:|||:||:|||||||::||||||||||||| :||||||::||:||:||::||||:|||||:||| 13TGATAGTGAAAGATAGATTGTATCCAAATAGTAACACAAATATAAA gCe gA2 13TAATAGTGGAAGTAGTAGTATATGTTCAAGAGC B6e G Y C F L F N M G L L L C L Y Y L M Y F Y C C Y H F Y Y C I STOP V I V F Y L I W V Y C C V Y I I L C I F I V V I I F I I V Y K F S L L F F I ST Y G F I V V F I L S Y V F L L L L S F L L L Y I S F R GGuuAuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuAuuAuCuuAuGuAuuuuuAuuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCG ||::|::|:|:||||:|||||:|:|||||:||||||||||||| 12TAGTAGTATAGATATGATAGAGTGCATAAGAATAACAACAATATATA gB6e |:|:::||||||:|:|||||:||||||||| gB4 11TAATATAGTGAGTTATATAGTGATAGTAGAGATAATGACATATATTATATA ::|||:::|||:|||:|||:||:|||||||::||||||||||||| :||||||::||:||:||::||||:|||||:||| 13TGATAGTGAAAGATAGATTGTATCCAAATAGTAACACAAATATAAA gCe gA2 13TAATAGTGGAAGTAGTAGTATATGTTCAAGAGC 263 B7e S N L ST C C Y L F G Y I I W Y V V I I F I I V Y K F S G V I Y S V V I C L D I S F D M L L S F L L L Y I S F R E ST F I V L L F V W I Y H L I C C Y H F Y Y C I STOP GGAGuAAuuuAuAGuGuuGuuAuuUGuuuGGAuAuAuCAuuuGAuAuGuuGuuAuCAuuuuUAuuAuuGuAuAuAAGuUUUCG :|||||:|:|:|||::|:|:||::|::|||||||| 09TATTAAGTGTTACAGTAGTGAATGAGTCTATATAG gB7e |||:|:|:|||::|:|:||::|::||||||||||||| 14TAAGTGTTACAGTAGTGAATGAGTCTATATAGTAAACGAATATAAA gB7e |||||||:|::||||:|:::||||||:|:|||||:||||||||| gB4 11TAATATAGTGAGTTATATAGTGATAGTAGAGATAATGACATATATTATATA :||||||::||:||:||::||||:|||||:||| gA2 13TAATAGTGGAAGTAGTAGTATATGTTCAAGAGC 264 C Variants C C L L F S F C F L L D F C F L F N M G L L L C L Y Y L C I F I V V I V C Y L V F V F Y W I F V F Y L I W V Y C C V Y I T Y V F L L L L M F V I ST F L F F I G F L F F I ST Y G F I V V F I L L M Y F Y C C Y AuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuGGAuuuuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuA |||::||:||::|:|:|:|:||||||:|:||||||||||||:| gC 14TAATTTAGAAGTAGAGAGTGAATTATGCTCAAATAACAACATATATA ||::|::||| |:|||:||:|||:|:|||||||||||:|:| |||:|::::||||:||:||||||:|||:|:||:|||||||| TATGAGTAAT-AGATCGAAGACAGAGAATAACCTAAAGATA gD gB5e 12TAATAGTGTAAATGTAGTGAATATATAGAGATGACAACAAT Ce L Y F M F C Y G Y C F L F N M G L L L C L Y Y L C I F I V V I S C I L C F V M V I V F Y L I W V Y C C V Y I T Y V F L L L L V V F Y V L L W L L F F I ST Y G F I V V F I L L M Y F Y C C Y AGuuGuAUUUUAuGuuuuGuuAuGGuuAuuGuuuuuuAuuuAAuAuGGGuuuAuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuA ::|||:::|||:|||:|||:||:|||||||::||||||||||||| 13TGATAGTGAAAGATAGATTGTATCCAAATAGTAACACAAATATAAA gCe |||:|::::||||:||:||||||:|||:|:||:|||||||| gB5e 12TAATAGTGTAAATGTAGTGAATATATAGAGATGACAACAAT |::||:|:|||||:|:|::|:|||:|||||||||| TTGACGTGAAATATAGAGTAGTACTAATAACAAAATATA gDep Ce80 S C I L C F V M F C M I I L ST C D L L C L Y Y L C I F I V V I V V F Y V L L C F V W L F Y S V I C C V Y I T Y V F L L L L L Y F M F C Y V L Y D Y F I V W F V V F I L L M Y F Y C C Y AGuuGuAUUUUAuGuuuuGuuAuGuuuuGuAuGAuuAuuuuAuAGuGuGAuuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuA :||||:|:|::|||:|:|||:|||:||||||||:|||||||||||||| 14TAATATAGAGTATATTGATAGAATGTCACACTAGATAACACAAATATATATA gCe80 ||:|::::||||:||:||||||:|||:|:||:|||||||| gB5e 12TAATAGTGTAAATGTAGTGAATATATAGAGATGACAACAAT |::||:|:|||||:|:|::|:||| TTGACGTGAAATATAGAGTAGTACTAATAACAAAATATA gDep 265 D Variants D Y Q Y L F C D L F L C L L F S F C F L L D F C F L F N M G L I S I C F V I C F Y V C Y L V F V F Y W I F V F Y L I W V Y V S V F V L W F V F M F V I ST F L F F I G F L F F I ST Y G F I GuAuCAGuAuuuGuuuuGuG***A**UUUGuuuuuAuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuGGAuuuuuGuuuuuuAuuuAAuAuGGGuuuA |:||||::|::||| |:|||:||:|||:|:|||||||||||:|:| gD 13TATAGAATATGAGTAAT-AGATCGAAGACAGAGAATAACCTAAAGATA :|||||:|||:|:::|:||| | :|||:|:|||||||||||| |:|| TATAGTTATAGATGGAGCAC---T--GAACGAGAATACAAACAAT-AGATA gE |||::||:||::|:|:|:|:||||||:|:||||| gC 14TAATTTAGAAGTAGAGAGTGAATTATGCTCAAAT De D H C L F G C Y E L Y F M F C Y G Y C F L F N M G L I I V Y L V V M S C I L C F V M V I V F Y L I W V Y S L F I W L L W V V F Y V L L W L L F F I ST Y G F I D C R L F S C Y E L Y F M F C Y D Y C F C F I G D A ND7 protein seq GAuuGUCGuuuAuuuAGuuG-uuA****UGA*****GuUGuAuuuuAuGuuuuGuuAuGAuuAuuGuuuuuGuuuuAuAGGuGAuGCAuuu ND7 777-866 GAuCAuuGuuuAuuuGGuuG*UUA****UGA-----GuuGuAUUUUAuGuuuuGuuAuGGuuAuuGuuuuuuAuuuAAuAuGGGuuuA :||: |:| ::|-----::||:|:|||||:|:|::|:|||:|||||||||| gDep 12TAAT-AGT----GTT-----TGACGTGAAATATAGAGTAGTACTAATAACAAAATATA |||||::||:|:||:::|| :|| ::|-----||::||||||||||||||| ATAGTAGTAAGTGAATTGAC-GAT----GTT-----CAGTATAAAATACAAAATATATA gEep ::|||:::|||:|||:|||:||:|||||| gCe 13TGATAGTGAAAGATAGATTGTATCCAAAT 266 E Variants Ee R W D F G D H C L F G C Y E L Y F M F C Y G G G I L V I I V Y L V V M S C I L C F V M V G F W W S L F I W L L W V V F Y V L L W A******GGuGGG***AuuuuGGuGAuCAuuGuuuAuuuGGuuG*UUA****UGAGuuGuAUUUUAuGuuuuGuuAuG | |||::: ||:|::||||||||:||| T------CCATTT---TAGAGTCACTAGTAGCAACATATA gFGep |||||::||:|:||:::|| :|| ::|||::||||||||||||||| gEep 12TATAGTAGTAAGTGAATTGAC-GAT----GTTCAGTATAAAATACAAAATATATA :||: |:| ::|::||:|:|||||:|:|::|:||| gDep 12TAAT-AGT----GTTTGACGTGAAATATAGAGTAGTAC E F F G G L G Y Q Y L F C D L F L C L L F S F C F L L F L G V ST G I S I C F V I C F Y V C Y L V F V F Y W I F W G F R V S V F V L W F V F M F V I ST F L F F I G AUUUUUUGGGGGUUUAGGGuAuCAGuAuuuGuuuuGuG***A**UUUGuuuuuAuGuuuGuuA*UUUAGuuuuuGuuuuuuAuuG |:||||::|::||| |:|||:||:|||:|:|||||| gD 13TATAGAATATGAGTAAT-AGATCGAAGACAGAGAATAAC :|||||:|||:|:::|:||| | :|||:|:|||||||||||| |:|| 12TATAGTTATAGATGGAGCAC---T--GAACGAGAATACAAACAAT-AGATA gE 267 FG Variants Fe K Y Y H I C V R W D F G D H C L F G C Y E N I I I F V L G G I L V I I V Y L V V M S I L S Y L C ST V G F W W S L F I W L L W AAAuAuuAuCAuAuuuGuGuuA******GGuGGG***AuuuuGGuGAuCAuuGuuuAuuuGGuuG*UUA****UGA |||||||||:|||::|:|:| |:|:|: ||||||||:#||||:|| 11TATAATAGTGTAAGTATAGT------CTATCT---TAAAACCATCAGTAGCATATATA gFGep |||||::||:|:||:::|| :|| ::| gEep 12TATAGTAGTAAGTGAATTGAC-GAT----GTT Fe* K H I C V R W D F G D H C L F G C Y E K N I F V L G G I L V I I V Y L V V M S K T Y L C ST V G F W W S L F I W L L W AAAAACAuAuuuGuGuuA******GGuGGG***AuuuuGGuGAuCAuuGuuuAuuuGGuuG*UUA****UGA |:||:::|:||| |||::: ||:|::||||||||:||| 13TAATTGTAGGTATAAT------CCATTT---TAGAGTCACTAGTAGCAACATATA gFGe*p |||||::||:|:||:::|| :|| ::| gEep 12TATAGTAGTAAGTGAATTGAC-GAT----GTT Fex I I I K F V I W C F V F D L F C V F H C L F G C Y E I L ST S S L L F G V L F L I C F V Y F I V Y L V V M S Y Y N Q V C Y L V F C F W F V L C I S L F I W L L W AuAuuAuAAuCAAGuuuGuuA***UUUGGuGuuuuGuuuuuG***AuuuGuuuuGuGuAuuuCAuuGuuuAuuuGGuuG*UUA****UGA |||:||:|||||||||::|:# |||||| TATGATGTTAGTTCAAGTAGC---AAACCAAAAA gGep |:|:|||:| ||||||:||||||||||| TATAATAGTGACTTAGACAGT---AAACCATAAAACAAAAACATATA gFexp :|:|||:::|||:| |||::::||:|:|||:|||||||||||| gFexp 12TATAAAGTGAAAGC---TAAGTGGAATATATAGAGTAACAAATAATACATA ||||::||:|:||:::|| :|| ::| gEep 12TATAGTAGTAAGTGAATTGAC-GAT----GTT 268 Ge K Y K Y V Y D I Y I I I K F V I W C F V F D L F C V R N I N M C M I Y I L ST S S L L F G V L F L I C F V E I ST I C V W Y I Y Y N Q V C Y L V F C F W F V L C AGAAAUAUAAAUAUGUGUAUGAUAUAUAuAuuAuAAuCAAGuuuGuuA***UUUGGuGuuuuGuuuuuG***AuuuGuuuuGuG ||||:||:|||||||||::|:# |||||| 12TAATAGAATATGATGTTAGTTCAAGTAGC---AAACCAAAAA gGep |:|:|||:| ||||||:||||||||||| 04TAATAGAGTATAATAGTGACTTAGACAGT---AAACCATAAAACAAAAACATATA gFexp :|:|||:::|||:| |||::::||:|: gFexp 12TATAAAGTGAAAGC---TAAGTGGAATAT SDM80 only R N I N M C M I Y K N N G S C G F V G W F R L G Y C Y C E AGAAAUAUAAAUAUGUGUAUGAUAUAUAAAAACAAuGGuA******GuuGuGGuuuuG**UAGGuuGAuuCAGAuuGGG*UUA***UUGuuAuuGuGA** ||:||| :|::|:||:|:: :||:||:|||||#||||:| ||| |||| gFGe80 14TATCAT------TAGTATCAGAGT--GTCTAATTAAGTGTAACTC-AAT---AACATATA |||::: |:| |::|:|:|||:| 11TAATTT-AGT---AGTAGTGACATT-- C C S F C M I I L ST C D L L C L Y Y L C I F I V V I I F I I V Y K **AuGuuGuAGuuuuuGuAuGAuuAuuuuAuAGuGuGAuuUGuuGuGuuuAuAuuACuuAuGuAuuuuuAuuGuuGuuAuCAuuuuUAuuAuuGuAuAuA ||:|||:||||||||||||#|||| --TATAACGTCAAAAACATACGAATATATA gDEe80 F S L L D ST AGuUUUCGUUAUUAGAUUAAAAAAGUAUGCAAAUAAUUUUUGU 269 APPENDIX P. gRNAs identified to edit the CR3 mRNAs of found in both TREU 667 and EATRO 164 gRNA transcriptomes. Editing Region Population Sequence G FG F gGt gGt gGep gGep gFGtp gFGtxp gFGtxp gFGtxp gFGtxp gFGtxp gFGtxp gFGtxp gFGtxp gFGtxp gFGe*p gFGe*p gFGe*p gFGe*p gFGe*p gFGe*p gFGe*p gFGe*p gFGe*p gFGe*p gFGep gFGep gFGep gFGe80 gFGe80 gFt gFt gFt gFt gFt gFt gFt gFt gFt gFt ATATGT ACAACAAAACCGAGCAATCAGATATAGAGTGAAAGTGATTAATT TTTTTTTTTTTC ATAT AACAACAAAACTGAACAATCAAATGTAGAGTGAAAGTGATTAATT TTTTTTTTTTTT AAAAACCAAAC GATGAACTTGATTGTAGTATA AGATAATTTTTTTTTTTT AATACCGAGC GACAGATTTGATTGTAGTATA AGATAATATTTTTTTTTTTT ATATAT AAAATGTACAACCAGAATTAAGATAAAGTGATGATGTATATATT TT ATATTAAATAC ATGATATACGCGATAGATTATTAGAGTTATGAGTTAAT TTTTTTTTTTTGTTT ATAC ATGATATATACAGTGAACTATTAGAATTATAGGT AATGAGATTTATTTTTTTTTTTTTT ATATTAAATAC ATGATATGCGCAGTAGACTATTAAAGTTATGAGTTAAT TTTTTTTTTTT ATATTAAATAC ATGATATACACGGTAGATTATTAGAGCTATGAGTTAAT TTTTTTTTTTTTT ATATTAAATAC ATGATATACACGGTAGATTATTAGAGCTATGAGTTA TTTTTTTTTTTTCT ATATTAAATAC ATGATATACACGGTAGATTATTAGAGCTATGAGTT TTTTTTTTTTT TATTAAATAC ATGATATACACGGTAGATTATTAGAGCTATGAGTTAA CTTTTTTTTTTTTT ATATTAAATAC ATATACACGGTAGATTATTAGAGCTATGAGTTAAT TTTTT ATATTAAATAC ATGATATACACGGTAGATTATTAGAGCTATGA TTTAAAATTTTTTTTTTTTTT ATATAC AACGATGATCACTGAGATTTTACCTAATATGG ATGTTAATTTTTTTTTTTTTGGGGAACTGAA ATACA AAACGATGATTACCGAGATTTCA GTTAATATGATTGTCTAATTTTTTTTTTTTT ATATACAACT ATGATCACTGAGATTTTACCTAATATGG ATGTTAATTTTTTTTTTT ATATATACA AAACGATGATTACCGAGATTTCATTTAATATGATTGT CTAATTTT ATATAC AACGATGATCACTGAGATTTTACCTAATATG TATGTTAATTTTTTTCTTTTT ATATAC AACGATGATCACTGAGATTTTACCTAATATGGTTGTTAATTT TAGATTTTTT ATATAC AACAATGATCACTGAGATTTTACCTAATATGGTTGTTAATTT TTTTTTTTTTT ATATAGAAACGATGAC TGCTAGAATTCTGCTTGATATGGATGT TAATTTTTTTTTTTTTT ATATAC AACGATGATCACTGAGATTTTACCTAATACGA TGTTTAATTTTTTTTTTTTTTAT ATATAC AACGATGATCACTGAGATTTTACCTAATATGGAT TTTTTTTTTTTTTTAAAGTGCGGCCATAGTGGGTG ATATATACGATGAC TACCAAAATTCTATCTGATATGAATGTGATAATATTT TTTTTTTT ATATATACGATGAC TGCCAAAATTCTATCTGATATGAATGTGATAATATTT TTTTTTTT ATATAC AACGATGATCACTGAGATTTTACCTAAT TTTTTTTTTTT ATATACAATAACTCAATG TGAATTAATCTGTGAGACTATGATTACT TTTTTTTGTTTTT ATATACAATAACTCAATG TGAATTAATCTGTGAGACTATGATTACTATT TTTTTTTATTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGAT ATTTTTTTTTTTTTC ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGATT TTTTTTTTTTTT ATATAATT AAATGTACAGACAAATGATAGAGAGACGATGAGATTAAGT TATATTTTTTTTTTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGAT TTTTTTTTGTTTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATG TTTTTTTTTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTAAAATTAGATGAT ATTTTTTTTTTTT ATATATAAAATA TACAAACGGACAATGAGAGAACAGTGAAATTAGATGAT AATATATTTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGA AATATTTTTTTTTTTTTT ATATAT AAAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATA TAATATTTTTTTTTTTA ATATAT AAATGTACAAACGGACAATGAGAGAACAGTGAAATTAGATGAT ATTTTTATATTTTTTTTTT 270 Reads TREU 667 493 8,171 21 44,119 2,106 936 404 133 115 53,389 133 42 12,250 1,418 534 173 140 146 112 16,892 Reads EATRO 164 21,297 2,736 94 2,772 354 159 53,459 717 166 143 97 23 1 1,854 349 239 138,710 33,909 3,738 941 932 679 603 459 263 234 gFt gFt gFt gFt gFt gFexp gFexp gFexp gFexp gE gE gE gE gEt' gEt' gEt' gEt' gEt gEep gEep gEep gEep gEep gEep gEep gDEe80 gDEe80 gD gD gD gD gD gD gD gD gD gD gD gD gD gDep gDep gDep gDep gDep E DE D ATATAT AAAATGTACAAACGGACAATGAGAGAACAGCGAAATTAGATGAT ATCTTTTTTT ATAT AAAATGTACAGACGAGCAGTGAAGAGACAGTGAGATTA TACATTTTTTTTTTTT ATATAA GTACAAACAGACAATGAAAAGATGGTGAGACTGAGTA TACATTTTTTTTT AAATT AATGTACAAATAAACGATAGAGAGACAGTGAGATTA TGATTGTAATTCTTTTTTTT ATAT AAAATGTACAGACAAGCAATGAAGAGACAGTGAGATTAGATAGTT TTTTTTTTTC ATACAT AATAAACAATGAGATATATAAGGTGAATCGAAAGTGAAATATT TTTTTTTTTT ATATA CAAAAACAAAATACCAAATGACAGATT CAGTGATAATATGAGATAATTTT ATATAC AAAACAAAATACCAAATGACAGATT CAGTGATAATATGAGATAATTTTTTTTTTTT ATACAT AATAAACAATGAGATATATAAGGTGAATCGAAAGTGAAATAT ATTTTTTTTT A TAGATAACAAACATAAGAGCAAGTCACGAGGTAGATATTGATATTTT TTTTTTTTTC A TAGATAACAAACATAAGAGCAAGTCACGAGGTAGATATTGATATTTTAA TTTTTTTATTTTT A TAGATAACAAACATAAGAGCAAGTCACGAGGTAGATATTGATATTT GTTTTTTTTTTTTT TTAGATAACAAACATAAGAGCAAGTCACGAGGTAGATATTGATATTTTAA TTTTTT ATATAT AATCACAAACAAATAGAAAATGAGAGAGGTGTATGA TACTATTTTTTTTTCTTTTT ATAC ATAAATCACAAACGAATAAGGAACAGAAGAGATGTATGA TATATTTTTTTTTTT AT AATAAATCACAAACAGATAAAGAGCAAGAAAGGTGTATGA TTATATTTTTATTTTTT ATAAAC AATCACAAACAGATAGAAGACAGAAGAGATGTATAGATA TTAAAATTTTTTTTT ATAT ATAACAAACATAGACAGAAATCTATAAGAGAGTATAATGATATGTGT TTTTTTTTTTTT ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTGAATGATGA TATTTTTTTTTTTT AAATA ATAACAAAACATGAGATATAACTTGTAGTGATTAGATGAATGAT TTTTTTTTTTT ATATAT TAAAATACAACTTATGATGACTAAGTGAATGATGATTGTCAA TTTTTTTTCTTTTT ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTAAATGATGA TATTTTTTT ATATAT TAAAATACAACTTATGATGGCTAAGTGAATGATGATTGTCAA TTTTTTTTTTTTC ATAT ATAAAACATAAAATATGACTTGTAGCAGTTAAGTGAATGATGATT TTTTTTTATTTTT ATATT ATAAAACATAAGATATAACTCATAGTGATTGAATAAGTGAT AATTTTTTTTTGTTTT ATATATAAG CATACAAAAACTGCAATATTTACAGTGATGATGATTTAATTT TTTTTTTT ATATATAAG CATACAAAAACTGCAGTATTTACAGTGATGATGATTTAATTT T ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATAAG ATATTTTTTTTTTTTT ATATAT AAATCCAATAAGAAATGAAAGCTAGATAGTGAGTATAAGT TTTTTTTTTTTGTTT GTAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATAAG ATATATTTTTTTTGTTTTTTTTTTTT ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATAA TTTTTTTTTTTT ATAT ACAAAAATCCAATGAAAAATAAAGACTGAGTGATGGATG CAATTTTCTTTTTTTTTT ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTATA TTTTTTCTTTTTTTTTT ATAGAAATCCAATAAGAGACAGAAGCTAGATAATGAGTAT TAGATATATTTTTTTTTTTTT ATAGAAATCCAATAAAAGACAGAAGCTAGATAATGAGTATAAG ATATATTTT ATAGAAATCCAATAAGAGACAGAAGCTAGATAATCAGTATAAG ATATATTTTTTTAA ATAGAAATCCAATAAGAGACAGAAGCTAGATAATTAGTATAAG ATATATATTTTTTTTTT ATATAT AAACAAAAATTCAATAGAAAACGAGAACTGAGTAATTGATATAA TTTT AT ATAGAAATCCAATAAGAGACAGAAACTAGGTAATGAGTATAAG ATTTTTTTTTTT ATAT AACAAAAATCCAATGAGAAATAGAGACTGAGTAATTGATATA TATTTTTATTTTTT ATAT AAAACAATAATCATGATGAGATATAAAGTGCAGTTTGTGATAATT TTTTTGTTTT ATAT ATAATCATAACAAGATGTAGAGTACGATTTATAGTGATTAA TTTTTTTTTTTT ATAT AAAACAATAATCATGATGAGATATAAAGTGCAGTTTGTGATAATTAA TTTTT ATAT AAAACAATAATCATAATAAGATGTAAGGTACGATTTATGATAATT TTTTTTGTTTT ATAT ATAATCATAACAGAGCATAGAATACAGTTTATAGTGATTAA TTTTTT 271 19,078 592 243 210 10,598 867 100 388 86 80 29 2,493 12,105 2,491 428 157 525 535 718 217 12 2 15 8 208 360 235 197 179 15,900 1,342 199 102 584 27,823 5,923 5,559 131 149 10 27,320 331 315 206 142 137 90 74 23 19 17 68 53 31 gDep gDep gC gC gC gC gC gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe80 gCe gCe gB1B2 gB1B2 gB1B2 gB1B2 gB1B2 gB1B2 gB1B2 gB1B2 gB3t gB3t gB3t gB4 gB4 gB4 gB4 gB4 gB4 gB4t' gB5e gB6e gB6e C B ATAT AAAACAATAATCATAATAAGATGTAAGGTACGATTTATGAT TTTTTTTTTATGCAACGTTGGATACTGGAG ATAT ATAATCATAACAAGATATAGAATACGATTCATAGTGATTAA TTTTTTTTTTTTTT ATAT ATACAACAATAAACTCGTATTAAGTGAGAGATGAAGATTTAAT TTTTTTTTATTTT ATAT TAAACACAACGATAGATCTATATTAAGTAGAAGATAGAAATTTA TTTATTTTTTTTTTTT A AATATAAACACAATGATAAGCCTGTATTAGATAGAAAGTGAGAATTT TTTTTTTTTG TA AAACACAACAATAGATTCGTATTAAATAGAGAATAGAGATTTA TTTTTTTTTTT ATAT TAAACACAACAGTAGATCTATATTAAGTAGAAGATAGAAATTT TTTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATATA ATTTTTTTTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATAT TTTTTTTTTTTTN ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATATATTTT TTTTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGAT TTTTTTTTTTGTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATATATTTTGT TTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATATATTT GTTTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATA ATTTTTTTTTATTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATATAT AATTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAAATAGTTATATGAGATATA ATTTTTCTTTTTTTT ATAT ATATAAACACAATAGATCGCACTGTAAGATAGTTATATGAGATAT TTTTTTTTTTTTG ATAT ATATAAACACAATAGATCGCACTGTAAGATAGTTATATGAGATATA ATTTTTTTTTTTTTC ATAT ATATAAACACAATAGATCGCACTGTAAGATAGTTATATGAGATATATTTT TTTTTTTTT ATAT ATATAAACACAATAGATCGCACTGTAAGATAGTTATATGAGAT TTTTTTTTTTT ATAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGGGATATA ATTTTTTTCTTTTTT NTAT ATATAAACACAATAGATCACACTGTAAGATAGTTATATGAGATATACTTT TTTTT A AATATAAACACAATGATAAACCTATGTTAGATAGAAAGTGATAGTT TTTTTTTTTTT AAAA AATATAAACACAATGATAAGCCTGTATTAGATAGAAAGTGATAATT TTTT ATATATA AATGATAACACTGAGAATCTGTAGTATAAGATATGATGATAA TTTTTTTTTTTTTT ATATATA AATGATAACACTGAGAATCTGTAGTATAAGATATGATGATA TTTTTTTATTTT ATATATAAATA ATAACACTGAGAATCTGTAGTATAAGATATGATGATAA TTTGTTTTTTTTTTTTTT ATATATA AATGATAACACTGAGAATCTGTAGTATAAGATATGATAAT TTTTTTTTTTT ATAT ATAATGATAACACTGAGAATCTGTAATGTGAGATATGATGATAAGTTT TTTTTTTTTT ATAT ATAATGATAACACTGAGAATCTGTAATGTGAGATATGATGATAA TTTTTTTTTTTTT ATATATA AATGATAACACTGAAGATCTGTAGTATAAGATATGATGATAA TTTTTTTTCTTTTTT ATAT ATAATGATAACACTGAGAATCTGTAATGTGAGATATGATGATA TTTTTTTTTATTTT ATAT ATAAAAATGATACCATAGAATTTGGAGTAGTATAGATATGATGATA TTTTTTTTTTTT ATAT ATAAAAATGATACCATAGAGTTTGAGGTGATGTAGATATGATGGTA TTTTT ATAT ATAAAAATGATACCATAGAATTTGGAGTAGTATAGATATGATGAT TTTTTTTTTTT ATATA TTATATACAGTAATAGAGATGATAGTGATATATTGAGTGATATA ATTTTTTTTTTT ATATACAGTAATAGAGATGATAGTGATATATTGAGTGATATAATTAATATGT GATTTTTAATTTTTTTTTTCTTT ATACAGTAATAGAGATGATAGTGATATATTGAGTGATATAATTAATATGTTTT T ATA TATATACAATAATAAGAGTGATAGCAGTG TATTAGTGATGATTAATTTTTTTTTTTTTTT ATA TATATACAATAATGAGAATGATGACAGTG TATTAGTGATGTAATATTTTTTTTTATTTT ATATA TTATATACAGTAATAGAGATGATAGTGATATATTGAGTGATATAATTAATATGT GATTTTTAATATAATA ATAT ATAACATATCAAATGATAAAGTGAGAAAAGAGAGTGAATGTAAT TTTTTTCTTTTCTATCTATTATTACAT ATAT ATGATAACAACAGTAGAGATATATAAGTGATGTAAATGTGATAAT TTTTTTTTTTT ATAT ATAACAACAATAAGAATACGTGAGATAGTATAGATATGATGAT TTTTTTTTTTTG ATATATAAT ACAATAAGAATACGTGAGATAGTATAGATATGATGAT TTTTTTTTTTA 272 1 1 1,509 88 187,465 170,393 57,583 2,157 932 871 829 250 242 32 37,769 330 180 98 154 8 6 5 859 2,300 10,402 209 170 214,918 180,589 67,000 2,828 553 621 1,014 330 358 33,434 14,634 3,887 828 387 374 3 10,346 2,337 139 124 1,183 474 383 260 367 18 13 3,708 222 A gB7e gB7e gB7e gB7e gB7e gB7e gB7e gA1 gA1 gA2 gA2 gA2 gA2 gA2 gA2 gA2 gA2 gA2 gA2 GATATATCTGAGTAAGTGATGACATTGTGAATTATTTTTTTT T AAATATAAG CAAATGATATATCTGAGTAAGTGATGACATTGTGAATT TTTTTTTTTTTT ATGATATATCTGAGTAAGTGATGACATTGTGAATTAT GGTATATTTTTG TGATATATCTGAGTAAGTGATGACATTGTGAATTACTTTTTT T AATGATATATCTGAGTAAGTGATGACATTGTGAATTATTTTTTTT TTTTTT AATATAAG CAAATGATATATCTGAGTAAGTGATGACATTGTGAATTATTTTTTTT TTTTTTTAAAAAAAAA AAATATAAG CAAATGATATATCTGAGTAAGTGATGACATTGTGAATTAT GGTATATAAAGTTAAATAATTTTATC ATA TTAATCTAATAACGGAGACTTGTATATGATAGTAATGATGATATT TTTTTTTTT ATATA TTAATCTAATAGCGGAGACTTGTATATGATAGTAATGATGATATT TT ATATAA AATCTAATGACGAGAACTTGTATATGATGATGAAGGTGATA ATTTTTTTTTTTTTG ATA AATCTAATAACGAGAATTTATGTACGATAATGAAAGTGATAT ATTTTTTTTCTTTTT ATATAA AATCTAATGACGAAAGCTTGTATATGGTAATGAAGATGGTATT TTTTTTTTTT ATATAA AATCTAATGACGAAAGCTTGTATATGGTAATGAAGATGGTA ATTTTTTTTTCTTTTT ATATAA AATCTAATGACGAGAACTTGTATATGATGATGAAGGTGATATT TCTTTTTTTTT ATATA AATCTAATAACGAGAATTTATGTACGATAATGAAAGTGATATT TTTTTCTTTTTTTTT ATACAT ATCTAATAACGGAAGCTTATGTGTAGTAGTGAAGATGGTA TATTTTTTTTTTTTT ATATAA AATCTAATGACGAAAGCTTGTATATGGTAGTGAAGATGGTA ATTTTTTTTTTTTTT ATATAA AATCTAATGACGAAAGCTTGTATATGGTAGTGAAGATGGTATT TTTTTTTTTT ATATA AATCTAATAACGGAAATTTGTATATGATGATAGAAGTGATAGT TTTTTTTTTTTT 489 250 117 49 14 2,697 16,984 540 540 419 331 106 103 3,042 884 522 104 1,234 382 40 273 REFERENCES 274 REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. Shapiro TA, Englund PT. The structure and replication of kinetoplast DNA. Annu Rev Microbiol. 1995;49: 117+. Vickerman K. The evolutionary expansion of the trypanosomatid flagellates. Int J Parasitol. 1994;24: 1317–1331. doi:10.1016/0020-7519(94)90198-8 Simpson AGB, Stevens JR, Lukeš J. The evolution and diversity of kinetoplastid flagellates. Trends Parasitol. 2006;22: 168–174. doi:10.1016/j.pt.2006.02.006 Read LK, Lukeš J, Hashimi H. Trypanosome RNA editing: the complexity of getting U in and taking U out. Wiley Interdiscip Rev RNA. 2016;7: 33–51. doi:10.1002/wrna.1313 Priest JW, Hajduk SL. Developmental regulation of mitochondrial biogenesis inTrypanosoma brucei. J Bioenerg Biomembr. 1994;26: 179–191. doi:10.1007/BF00763067 Aphasizhev R, Aphasizheva I. Uridine insertion/deletion editing in trypanosomes: a playground for RNA-guided information transfer. Wiley Interdiscip Rev RNA. 2011;2: 669– 685. doi:10.1002/wrna.82 Aphasizhev R, Aphasizheva I. Mitochondrial RNA editing in trypanosomes: Small RNAs in control. Biochimie. 2014;100: 125–131. doi:10.1016/j.biochi.2014.01.003 Hong M i. n., Simpson L. Genomic Organization of Trypanosoma brucei Kinetoplast DNA Minicircles. Protist. 2003;154: 265–279. doi:10.1078/143446103322166554 Koslowsky D, Sun Y, Hindenach J, Theisen T, Lucas J. The insect-phase gRNA transcriptome in Trypanosoma brucei. Nucleic Acids Res. 2014;42: 1873–1886. doi:10.1093/nar/gkt973 10. CDC - African Trypanosomiasis [Internet]. 2 May 2017 [cited 25 Jan 2019]. Available: https://www.cdc.gov/parasites/sleepingsickness/index.html 11. Programme Against African Trypanosomosis (PAAT) | Food and Agriculture Organization of the United Nations [Internet]. [cited 25 Jan 2019]. Available: http://www.fao.org/paat/en/ 12. Horn D. Antigenic variation in African trypanosomes. Mol Biochem Parasitol. 2014;195: 123–129. doi:10.1016/j.molbiopara.2014.05.001 13. Sudarshi D, Lawrence S, Pickrell WO, Eligar V, Walters R, Quaderi S, et al. Human African Trypanosomiasis Presenting at Least 29 Years after Infection—What Can This Teach Us 275 about the Pathogenesis and Control of This Neglected Tropical Disease? PLOS Negl Trop Dis. 2014;8: e3349. doi:10.1371/journal.pntd.0003349 14. Matthews KR. Developments in the Differentiation of Trypanosoma brucei. Parasitol Today. 1999;15: 76–80. doi:10.1016/S0169-4758(98)01381-7 15. van Hellemond JJ, Bakker BM, Tielens AGM. Energy Metabolism and Its Compartmentation in Trypanosoma brucei. In: Poole RK, editor. Advances in Microbial Physiology. Academic Press; 2005. pp. 199–226. doi:10.1016/S0065-2911(05)50005-5 16. Hannaert V, Bringaud F, Opperdoes FR, Michels PA. Evolution of energy metabolism and its compartmentation in Kinetoplastida. Kinetoplastid Biol Dis. 2003;2: 11. doi:10.1186/1475-9292-2-11 17. Nolan DP, Voorheis HP. The mitochondrion in bloodstream forms of Trypanosoma brucei is energized by the electrogenic pumping of protons catalysed by the F1F0-ATPase. Eur J Biochem. 1992;209: 207–216. doi:10.1111/j.1432-1033.1992.tb17278.x 18. Vertommen D, Van Roy J, Szikora J-P, Rider MH, Michels PAM, Opperdoes FR. Differential expression of glycosomal and mitochondrial proteins in the two major life-cycle stages of Trypanosoma brucei. Mol Biochem Parasitol. 2008;158: 189–201. doi:10.1016/j.molbiopara.2007.12.008 19. Weelden SWH van, Fast B, Vogt A, Meer P van der, Saas J, Hellemond JJ van, et al. Procyclic Trypanosoma brucei Do Not Use Krebs Cycle Activity for Energy Generation. J Biol Chem. 2003;278: 12854–12863. doi:10.1074/jbc.M213190200 20. Weelden SWH van, Hellemond JJ van, Opperdoes FR, Tielens AGM. New Functions for Parts of the Krebs Cycle in Procyclic Trypanosoma brucei, a Cycle Not Operating as a Cycle. J Biol Chem. 2005;280: 12451–12460. doi:10.1074/jbc.M412447200 21. Bringaud F, Rivière L, Coustou V. Energy metabolism of trypanosomatids: Adaptation to available carbon sources. Mol Biochem Parasitol. 2006;149: 1–9. doi:10.1016/j.molbiopara.2006.03.017 22. Oberle M, Balmer O, Brun R, Roditi I. Bottlenecks and the Maintenance of Minor Genotypes during the Life Cycle of Trypanosoma brucei. PLOS Pathog. 2010;6: e1001023. doi:10.1371/journal.ppat.1001023 23. Abbeele JVD, Claes Y, Bockstaele DV, Ray DL, Coosemans M. Trypanosoma brucei spp. development in the tsetse fly: characterization of the post-mesocyclic stages in the foregut and proboscis. Parasitology. 1999;118: 469–478. 24. Michelotti EF, Hajduk SL. Developmental regulation of trypanosome mitochondrial gene expression. J Biol Chem. 1987;262: 927–932. 276 25. Feagin JE, Jasmer DP, Stuart K. Developmentally regulated addition of nucleotides within apocytochrome b transcripts in Trypanosoma brucei. Cell. 1987;49: 337–345. doi:10.1016/0092-8674(87)90286-8 26. Read LK, Wilson KD, Myler PJ, Stuart K. Editing of Trypanosoma brucei maxicircle CR5 mRNA generates variable carboxy terminal predicted protein sequences. Nucleic Acids Res. 1994;22: 1489–1495. doi:10.1093/nar/22.8.1489 27. Koslowsky DJ, Bhat GJ, Perrollaz AL, Feagin JE, Stuart K. The MURF3 gene of T. brucei contains multiple domains of extensive editing and is homologous to a subunit of NADH dehydrogenase. Cell. 1990;62: 901–911. doi:10.1016/0092-8674(90)90265-G 28. Souza AE, Myler PJ, Stuart K. Maxicircle CR1 transcripts of Trypanosoma brucei are edited and developmentally regulated and encode a putative iron-sulfur protein homologous to an NADH dehydrogenase subunit. Mol Cell Biol. 1992;12: 2100–2107. doi:10.1128/MCB.12.5.2100 29. Souza AE, Shu HH, Read LK, Myler PJ, Stuart KD. Extensive editing of CR2 maxicircle transcripts of Trypanosoma brucei predicts a protein with homology to a subunit of NADH dehydrogenase. Mol Cell Biol. 1993;13: 6832–6840. doi:10.1128/MCB.13.11.6832 30. Read LK, Myler PJ, Stuart K. Extensive editing of both processed and preprocessed maxicircle CR6 transcripts in Trypanosoma brucei. J Biol Chem. 1992;267: 1123–1128. 31. Bhat GJ, Koslowsky D, Feagin J, Smiley B, Stuart K. An Extensively Edited Mitochondrial Transcript in Kinetoplastids Encodes a Protein Homologous to ATPase Subunit 6. Cell. 1990;61: 885–894. 32. Feagin JE, Abraham JM, Stuart K. Extensive editing of the cytochrome c oxidase III transcript in Trypanosoma brucei. Cell. 1988;53: 413–422. doi:10.1016/0092- 8674(88)90161-4 33. Blum B, Bakalara N, Simpson L. A model for RNA editing in kinetoplastid mitochondria: “Guide” RNA molecules transcribed from maxicircle DNA provide the edited information. Cell. 1990;60: 189–198. doi:10.1016/0092-8674(90)90735-W 34. Eperon IC, Janssen JWG, Hoeijmakers JHJ, Borst P. The major transcripts of the kinetoplast DNA of Trypanosoma brucei are very small ribosomal RNAs. Nucleic Acids Res. 1983;11: 105–125. doi:10.1093/nar/11.1.105 35. Adler BK, Harris ME, Bertrand KI, Hajduk SL. Modification of Trypanosoma brucei mitochondrial rRNA by posttranscriptional 3’ polyuridine tail formation. Mol Cell Biol. 1991;11: 5878–5884. doi:10.1128/MCB.11.12.5878 277 36. Aphasizheva I, Maslov D, Wang X, Huang L, Aphasizhev R. Pentatricopeptide Repeat Proteins Stimulate mRNA Adenylation/Uridylation to Activate Mitochondrial Translation in Trypanosomes. Mol Cell. 2011;42: 106–117. doi:10.1016/j.molcel.2011.02.021 37. Bhat GJ, Souza AE, Feagin JE, Stuart K. Transcript-specific developmental regulation of polyadenylation in Trypanosoma brucei mitochondria. Mol Biochem Parasitol. 1992;52: 231–240. doi:10.1016/0166-6851(92)90055-O 38. Hensgens LA, Brakenhoff J, De Vries BF, Sloof P, Tromp MC, Van Boom JH, et al. The sequence of the gene for cytochrome c oxidase subunit I, a frameshift containing gene for cytochrome c oxidase subunit II and seven unassigned reading frames in Trypanosoma brucei mitochrondrial maxi-circle DNA. Nucleic Acids Res. 1984;12: 7327– 7344. 39. Benne R, Van Den Burg J, Brakenhoff JPJ, Sloof P, Van Boom JH, Tromp MC. Major transcript of the frameshifted coxll gene from trypanosome mitochondria contains four nucleotides that are not encoded in the DNA. Cell. 1986;46: 819–826. doi:10.1016/0092- 8674(86)90063-2 40. Feagin JE, Stuart K. Differential expression of mitochondrial genes between life cycle stages of Trypanosoma brucei. Proc Natl Acad Sci. 1985;82: 3380–3384. 41. Payne M, Rothwell V, Jasmer DP, Feagin JE, Stuart K. Identification of mitochondrial genes in Trypanosoma brucei and homology to cytochrome c oxidase II in two different reading frames. Mol Biochem Parasitol. 1985;15: 159–170. doi:10.1016/0166- 6851(85)90117-3 42. Kannan S, Burger G. Unassigned MURF1 of kinetoplastids codes for NADH dehydrogenase subunit 2. BMC Genomics. 2008;9: 455. doi:10.1186/1471-2164-9-455 43. Feagin JE, Jasmer DP, Stuart K. Apocytochrome b and other mitochondrial DNA sequences are differentially expressed during the life cycle of Trypanosoma brucei. Nucleic Acids Res. 1985;13: 4577–4596. doi:10.1093/nar/13.12.4577 44. Stuart K, Feagin JE, Jasmer DP. Regulation of Mitochondrial Gene Expression in Trypanosoma brucei. Sequence Specificity in Transcription and Translation. Alan R. Liss; 1985. pp. 621–631. 45. Jasmer DP, Feagin JE, Stuart K. Diverse patterns of expression of the cytochrome c oxidase subunit I gene and unassigned reading frames 4 and 5 during the life cycle of Trypanosoma brucei. Mol Cell Biol. 1985;5: 3041–3047. doi:10.1128/MCB.5.11.3041 46. Feagin JE, Stuart K. Developmental aspects of uridine addition within mitochondrial transcripts of Trypanosoma brucei. Mol Cell Biol. 1988;8: 1259–1265. doi:10.1128/MCB.8.3.1259 278 47. Stuart K. The RNA editing process in Trypanosoma brucei. Semin Cell Biol. 1993;4: 251– 260. doi:10.1006/scel.1993.1030 48. Corell RA, Myler P, Stuart K. Trypanosoma brucei mitochondrial CR4 gene encodes an extensively edited mRNA with completely edited sequence only in bloodstream forms. Mol Biochem Parasitol. 1994;64: 65–74. doi:10.1016/0166-6851(94)90135-X 49. Hajduk SL, Adler BK, Madison S, McManus M, Sabatini R. Insertional and deletional RNA editing in trypanosome mitochondria. Nucleic Acids Symp Ser. 1996; 15–18. 50. Koslowsky DJ, Riley GR, Feagin JE, Stuart K. Guide RNAs for transcripts with developmentally regulated RNA editing are present in both life cycle stages of Trypanosoma brucei. Mol Cell Biol. 1992;12: 2043–2049. doi:10.1128/MCB.12.5.2043 51. Riley GR, Corell RA, Stuart K. Multiple guide RNAs for identical editing of Trypanosoma brucei apocytochrome b mRNA have an unusual minicircle location and are developmentally regulated. J Biol Chem. 1994;269: 6101–6108. 52. Greif G, Rodriguez M, Reyna-Bello A, Robello C, Alvarez-Valin F. Kinetoplast adaptations in American strains from Trypanosoma vivax. Mutat Res Mol Mech Mutagen. 2015;773: 69–82. doi:10.1016/j.mrfmmm.2015.01.008 53. Ooi C-P, Schuster S, Cren-Travaillé C, Bertiaux E, Cosson A, Goyard S, et al. The Cyclical Development of Trypanosoma vivax in the Tsetse Fly Involves an Asymmetric Division. Front Cell Infect Microbiol. 2016;6. doi:10.3389/fcimb.2016.00115 54. Ruvalcaba-Trejo LI, Sturm NR. The Trypanosoma cruzi Sylvio X10 strain maxicircle sequence: the third musketeer. BMC Genomics. 2011;12: 58. doi:10.1186/1471-2164-12- 58 55. Cazzulo JJ. Protein and amino acid catabolism in Trypanosoma cruzi. Comp Biochem Physiol Part B Comp Biochem. 1984;79: 309–320. doi:10.1016/0305-0491(84)90381-X 56. Cazzulo JJ. Intermediate metabolism inTrypanosoma cruzi. J Bioenerg Biomembr. 1994;26: 157–165. doi:10.1007/BF00763064 57. Tyler KM, Engman DM. The life cycle of Trypanosoma cruzi revisited. Int J Parasitol. 2001;31: 472–481. doi:10.1016/S0020-7519(01)00153-9 58. Cannata JJB, Cazzulo JJ. The aerobic fermentation of glucose by Trypanosoma cruzi. Comp Biochem Physiol Part B Comp Biochem. 1984;79: 297–308. doi:10.1016/0305- 0491(84)90380-8 59. Sanchez-moreno M, Fernandez-becerra MC, Castilla-calvente JJ, Osuna A. Metabolic studies by 1H NMR of different forms of Trypanosoma cruzi as obtained by “in vitro” 279 culture. FEMS Microbiol Lett. 1995;133: 119–125. doi:10.1111/j.1574- 6968.1995.tb07871.x 60. Prevention C-C for DC and. CDC - Leishmaniasis [Internet]. 16 Oct 2018 [cited 25 Jan 2019]. Available: https://www.cdc.gov/parasites/leishmaniasis/index.html 61. McConville MJ, Naderer T. Metabolic Pathways Required for the Intracellular Survival of Leishmania. Annu Rev Microbiol. 2011;65: 543–561. doi:10.1146/annurev-micro-090110- 102913 62. Saunders EC, Souza DPD, Naderer T, Sernee MF, Ralton JE, Doyle MA, et al. Central carbon metabolism of Leishmania parasites. Parasitology. 2010;137: 1303–1313. doi:10.1017/S0031182010000077 63. Simpson L, Thiemann OH, Savill NJ, Alfonzo JD, Maslov DA. Evolution of RNA editing in trypanosome mitochondria. Proc Natl Acad Sci. 2000;97: 6986–6993. doi:10.1073/pnas.97.13.6986 64. Porcel BM, Denoeud F, Opperdoes F, Noel B, Madoui M-A, Hammarton TC, et al. The Streamlined Genome of Phytomonas spp. Relative to Human Pathogenic Kinetoplastids Reveals a Parasite Tailored for Plants. PLOS Genet. 2014;10: e1004007. doi:10.1371/journal.pgen.1004007 65. Nawathean P, Maslov DA. The absence of genes for cytochrome c oxidase and reductase subunits in maxicircle kinetoplast DNA of the respiration-deficient plant trypanosomatid Phytomonas serpens. Curr Genet. 2000;38: 95–103. doi:10.1007/s002940000135 66. Blum B, Simpson L. Guide RNAs in kinetoplastid mitochondria have a nonencoded 3′ oligo(U) tail involved in recognition of the preedited region. Cell. 1990;62: 391–397. doi:10.1016/0092-8674(90)90375-O 67. Ochsenreiter T, Hajduk SL. Alternative editing of cytochrome c oxidase III mRNA in trypanosome mitochondria generates protein diversity. EMBO Rep. 2006;7: 1128–1133. doi:10.1038/sj.embor.7400817 68. Ochsenreiter T, Anderson S, Wood ZA, Hajduk SL. Alternative RNA Editing Produces a Novel Protein Involved in Mitochondrial DNA Maintenance in Trypanosomes. Mol Cell Biol. 2008;28: 5595–5604. doi:10.1128/MCB.00637-08 69. Covello P, Gray M. On the evolution of RNA editing. Trends Genet. 1993;9: 265–268. doi:10.1016/0168-9525(93)90011-6 70. Gray MW. Evolutionary Origin of RNA Editing. Biochemistry (Mosc). 2012;51: 5235–5242. doi:10.1021/bi300419r 280 71. Gray MW, Lukeš J, Archibald JM, Keeling PJ, Doolittle WF. Irremediable Complexity? Science. 2010;330: 920–921. doi:10.1126/science.1198594 72. Stoltzfus A. On the Possibility of Constructive Neutral Evolution. J Mol Evol. 1999;49: 169–181. doi:10.1007/PL00006540 73. Stoltzfus A. Constructive neutral evolution: exploring evolutionary theory’s curious disconnect. Biol Direct. 2012;7: 35. doi:10.1186/1745-6150-7-35 74. Leeder W-M, Hummel NFC, Göringer HU. Multiple G-quartet structures in pre-edited mRNAs suggest evolutionary driving force for RNA editing in trypanosomes. Sci Rep. 2016;6. doi:10.1038/srep29810 75. Speijer D. Evolutionary Aspects of RNA Editing. In: Göringer HU, editor. RNA Editing. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. pp. 199–227. doi:10.1007/978-3- 540-73787-2_10 76. Buhrman H, van der Gulik P, Severini S, Speijer D. A mathematical model of kinetoplastid mitochondrial gene scrambling advantage. ArXiv13071163 Q-Bio. 2013; Available: http://arxiv.org/abs/1307.1163 77. Thiemann OH, Maslov DA, Simpson L. Disruption of RNA editing in Leishmania tarentolae by the loss of minicircle-encoded guide RNA genes. EMBO J. 1994;13: 5689–5700. 78. Savill Nicholas J., Higgs Paul G. A theoretical study of random segregation of minicircles in trypanosomatids. Proc R Soc Lond B Biol Sci. 1999;266: 611–620. doi:10.1098/rspb.1999.0680 79. Lynch M, Bürger R, Butcher D, Gabriel W. The Mutational Meltdown in Asexual Populations. J Hered. 1993;84: 339–344. doi:10.1093/oxfordjournals.jhered.a111354 80. LaBar T, Adami C. Evolution of drift robustness in small populations. Nat Commun. 2017;8: 1012. doi:10.1038/s41467-017-01003-7 81. Muller HJ. The relation of recombination to mutational advance. Mutat Res Mol Mech Mutagen. 1964;1: 2–9. doi:10.1016/0027-5107(64)90047-8 82. Haigh J. The accumulation of deleterious genes in a population—Muller’s Ratchet. Theor Popul Biol. 1978;14: 251–267. doi:10.1016/0040-5809(78)90027-8 83. Poon A, Otto SP. Compensating for Our Load of Mutations: Freezing the Meltdown of Small Populations. Evolution. 2000;54: 1467–1479. doi:10.1111/j.0014- 3820.2000.tb00693.x 281 84. Whitlock MC. Fixation of New Alleles and the Extinction of Small Populations: Drift Load, Beneficial Alleles, and Sexual Selection. Evolution. 2000;54: 1855–1861. doi:10.1111/j.0014-3820.2000.tb01232.x 85. Normark S, Bergström S, Edlund T, Grundström T, Jaurin B, Lindberg FP, et al. Overlapping genes. Annu Rev Genet. 1983;17: 499–525. 86. Liang H. Decoding the dual-coding region: key factors influencing the translational potential of a two-ORF-containing transcript. Cell Res. 2010;20: 508–509. doi:10.1038/cr.2010.62 87. Belshaw R, Pybus OG, Rambaut A. The evolution of genome compression and genomic novelty in RNA viruses. Genome Res. 2007;17: 000–000. doi:10.1101/gr.6305707 88. Brandes N, Linial M. Gene overlapping and size constraints in the viral world. Biol Direct. 2016;11: 26. doi:10.1186/s13062-016-0128-3 89. Mouilleron H, Delcourt V, Roucou X. Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res. 2016;44: 14–23. doi:10.1093/nar/gkv1218 90. Liang H, Landweber LF. A genome-wide study of dual coding regions in human alternatively spliced genes. Genome Res. 2006;16: 190–196. doi:10.1101/gr.4246506 91. Ribrioux S, Brüngger A, Baumgarten B, Seuwen K, John MR. Bioinformatics prediction of overlapping frameshifted translation products in mammalian transcripts. BMC Genomics. 2008;9: 122. doi:10.1186/1471-2164-9-122 92. Pallejà A, Harrington ED, Bork P. Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? BMC Genomics. 2008;9: 335. doi:10.1186/1471- 2164-9-335 93. Chung W-Y, Wadhawan S, Szklarczyk R, Pond SK, Nekrutenko A. A First Look at ARFome: Dual-Coding Genes in Mammalian Genomes. PLOS Comput Biol. 2007;3: e91. doi:10.1371/journal.pcbi.0030091 94. Klemke M. Two overlapping reading frames in a single exon encode interacting proteins-- a novel way of gene usage. EMBO J. 2001;20: 3849–3860. doi:10.1093/emboj/20.14.3849 95. Yoshida H, Oku M, Suzuki M, Mori K. pXBP1(U) encoded in XBP1 pre-mRNA negatively regulates unfolded protein response activator pXBP1(S) in mammalian ER stress response. J Cell Biol. 2006;172: 565–575. doi:10.1083/jcb.200508145 96. Peleg O, Kirzhner V, Trifonov E, Bolshoy A. Overlapping Messages and Survivability. J Mol Evol. 2004;59: 520–527. doi:10.1007/s00239-004-2644-5 282 97. Sykes SE, Hajduk SL. Dual Functions of α-Ketoglutarate Dehydrogenase E2 in the Krebs Cycle and Mitochondrial DNA Inheritance in Trypanosoma brucei. Eukaryot Cell. 2013;12: 78–90. doi:10.1128/EC.00269-12 98. Sykes S, Szempruch A, Hajduk S. The Krebs Cycle Enzyme α-Ketoglutarate Decarboxylase Is an Essential Glycosomal Protein in Bloodstream African Trypanosomes. Eukaryot Cell. 2015;14: 206–215. doi:10.1128/EC.00214-14 99. Ochsenreiter T, Cipriano M, Hajduk SL. Alternative mRNA Editing in Trypanosomes Is Extensive and May Contribute to Mitochondrial Protein Diversity. PLoS ONE. 2008;3: e1566. doi:10.1371/journal.pone.0001566 100. Aphasizheva I, Maslov DA, Aphasizhev R. Kinetoplast DNA-encoded ribosomal protein S12. RNA Biol. 2013;10: 1679–1688. doi:10.4161/rna.26733 101. Stuart K, Gobright E, Jenni L, Milhausen M, Thomashow L, Agabian N. The Istar 1 Serodeme of Trypanosoma brucei: Development of a New Serodeme. J Parasitol. 1984;70: 747–754. doi:10.2307/3281757 102. Chomczynski P, Sacchi N. Single-step method of RNA isolation by acid guanidinium thiocyanate-phenol-chloroform extraction. Anal Biochem. 1987;162: 156–159. doi:10.1016/0003-2697(87)90021-2 103. Agabian N, Thomashow L, Milhausen M, Stuart K. Structural Analysis of Variant and Invariant Genes in Trypanosomes. Am J Trop Med Hyg. 1980;29: 1043–1049. 104. Friedman RC, Farh KK-H, Burge CB, Bartel DP. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 2009;19: 92–105. doi:10.1101/gr.082701.108 105. Madina BR, Kumar V, Metz R, Mooers BHM, Bundschuh R, Cruz-Reyes J. Native mitochondrial RNA-binding complexes in kinetoplastid RNA editing differ in guide RNA composition. RNA. 2014; doi:10.1261/rna.044495.114 106. Clement SL, Mingler MK, Koslowsky DJ. An Intragenic Guide RNA Location Suggests a Complex Mechanism for Mitochondrial Gene Expression in Trypanosoma brucei. Eukaryot Cell. 2004;3: 862–869. doi:10.1128/EC.3.4.862-869.2004 107. Cristodero M, Seebeck T, Schneider A. Mitochondrial translation is essential in bloodstream forms of Trypanosoma brucei. Mol Microbiol. 2010;78: 757–769. doi:10.1111/j.1365-2958.2010.07368.x 108. MacLeod A, Turner CMR, Tait A. A high level of mixed Trypanosoma brucei infections in tsetse flies detected by three hypervariable minisatellites. Mol Biochem Parasitol. 1999;102: 237–248. doi:10.1016/S0166-6851(99)00101-2 283 109. Balmer O, Caccone A. Multiple-strain infections of Trypanosoma brucei across Africa. Acta Trop. 2008;107: 275–279. doi:10.1016/j.actatropica.2008.06.006 110. Szempruch AJ, Choudhury R, Wang Z, Hajduk SL. In vivo analysis of trypanosome mitochondrial RNA function by artificial site-specific RNA endonuclease-mediated knockdown. RNA. 2015; doi:10.1261/rna.052084.115 111. Surve S, Heestand M, Panicucci B, Schnaufer A, Parsons M. Enigmatic Presence of Mitochondrial Complex I in Trypanosoma brucei Bloodstream Forms. Eukaryot Cell. 2012;11: 183–193. doi:10.1128/EC.05282-11 112. Verner Z, Čermáková P, Škodová I, Kriegová E, Horváth A, Lukeš J. Complex I (NADH:ubiquinone oxidoreductase) is active in but non-essential for procyclic Trypanosoma brucei. Mol Biochem Parasitol. 2011;175: 196–200. doi:10.1016/j.molbiopara.2010.11.003 113. Speijer D. Is kinetoplastid pan-editing the result of an evolutionary balancing act? IUBMB Life. 2006;58: 91–96. doi:10.1080/15216540600551355 114. Hudson K m., Taylor AE r., Elce B j. Antigenic changes in Trypanosoma brucei on transmission by tsetse fly. Parasite Immunol. 1980;2: 57–69. doi:10.1111/j.1365- 3024.1980.tb00043.x 115. Gibson W. The origins of the trypanosome genome strains Trypanosoma brucei brucei TREU 927, T. b. gambiense DAL 972, T. vivax Y486 and T. congolense IL3000. Parasit Vectors. 2012;5: 71. 116. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2014;7: 539–539. doi:10.1038/msb.2011.75 117. Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein sequence database. Science. 1992;256: 1443–1445. 118. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825−2830. 119. Kirby LE, Sun Y, Judah D, Nowak S, Koslowsky D. Analysis of the Trypanosoma brucei EATRO 164 Bloodstream Guide RNA Transcriptome. PLOS Negl Trop Dis. 2016;10: e0004793. doi:10.1371/journal.pntd.0004793 120. Duarte M, Tomás AM. The mitochondrial complex I of trypanosomatids - an overview of current knowledge. J Bioenerg Biomembr. 2014;46: 299–311. doi:10.1007/s10863-014- 9556-x 284 121. Simpson L, Neckelmann N, Cruz VF de la, Simpson AM, Feagin JE, Jasmer DP, et al. Comparison of the maxicircle (mitochondrial) genomes of Leishmania tarentolae and Trypanosoma brucei at the level of nucleotide sequence. J Biol Chem. 1987;262: 6182– 6196. 122. Hanada K, Shiu S-H, Li W-H. The Nonsynonymous/Synonymous Substitution Rate Ratio versus the Radical/Conservative Replacement Rate Ratio in the Evolution of Mammalian Genes. Mol Biol Evol. 2007;24: 2235–2241. doi:10.1093/molbev/msm152 123. Firth AE, Brown CM. Detecting overlapping coding sequences with pairwise alignments. Bioinformatics. 2005;21: 282–292. doi:10.1093/bioinformatics/bti007 124. Firth AE, Brown CM. Detecting overlapping coding sequences in virus genomes. BMC Bioinformatics. 2006;7: 75. doi:10.1186/1471-2105-7-75 125. Landweber LF, Gilbert W. RNA editing as a source of genetic variation. Nat Lond. 1993;363: 179. 126. Tielens AGM, van Hellemond JJ. Surprising variety in energy metabolism within Trypanosomatidae. Trends Parasitol. 2009;25: 482–490. doi:10.1016/j.pt.2009.07.007 127. Verner Z, Čermáková P, Škodová I, Kováčová B, Lukeš J, Horváth A. Comparative analysis of respiratory chain and oxidative phosphorylation in Leishmania tarentolae, Crithidia fasciculata, Phytomonas serpens and procyclic stage of Trypanosoma brucei. Mol Biochem Parasitol. 2014;193: 55–65. doi:10.1016/j.molbiopara.2014.02.003 128. Jackson AP, Berry A, Aslett M, Allison HC, Burton P, Vavrova-Anderson J, et al. Antigenic diversity is generated by distinct evolutionary mechanisms in African trypanosome species. Proc Natl Acad Sci U S A. 2012;109: 3416–3421. doi:10.1073/pnas.1117313109 129. Morrison LJ, Vezza L, Rowan T, Hope JC. Animal African Trypanosomiasis: Time to Increase Focus on Clinically Relevant Parasite and Host Species. Trends Parasitol. 2016;32: 599–607. doi:10.1016/j.pt.2016.04.012 130. Maslov DA, Hollar L, Haghighat P, Nawathean P. Demonstration of mRNA editing and localization of guide RNA genes in kinetoplast–mitochondria of the plant trypanosomatid Phytomonas serpens1Note: Nucleotide sequences from P. serpens 1G reported in this work were deposited in GenBankTM database with the following accession numbers: AF034624 (Sau3AI-cut minicircle), AF034625 (HindIII-cut minicircle), AF034626 (fully edited sequence of RPS12 mRNA), AF034627 (genomic sequence of RPS12 cryptogene).1. Mol Biochem Parasitol. 1998;93: 225–236. doi:10.1016/S0166-6851(98)00028-0 131. David V, Flegontov P, Gerasimov E, Tanifuji G, Hashimi H, Logacheva MD, et al. Gene Loss and Error-Prone RNA Editing in the Mitochondrion of Perkinsela, an Endosymbiotic Kinetoplastid. mBio. 2015;6: e01498-15. doi:10.1128/mBio.01498-15 285 132. Maslov DA. Complete set of mitochondrial pan-edited mRNAs in Leishmania mexicana amazonensis LV78. Mol Biochem Parasitol. 2010;173: 107–114. doi:10.1016/j.molbiopara.2010.05.013 133. Käll L, Krogh A, Sonnhammer ELL. A Combined Transmembrane Topology and Signal Peptide Prediction Method. J Mol Biol. 2004;338: 1027–1036. doi:10.1016/j.jmb.2004.03.016 134. Käll L, Krogh A, Sonnhammer ELL. Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res. 2007;35: W429– W432. doi:10.1093/nar/gkm256 135. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE. The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc. 2015;10: 845–858. doi:10.1038/nprot.2015.053 136. Xu Y, Tao Y, Cheung LS, Fan C, Chen L-Q, Xu S, et al. Structures of bacterial homologues of SWEET transporters in two distinct conformations. Nature. 2014;515: 448–452. doi:10.1038/nature13670 137. Bender T, Pena G, Martinou J-C. Regulation of mitochondrial pyruvate uptake by alternative pyruvate carrier complexes. EMBO J. 2015;34: 911–924. doi:10.15252/embj.201490197 138. Štáfková J, Mach J, Biran M, Verner Z, Bringaud F, Tachezy J. Mitochondrial pyruvate carrier in Trypanosoma brucei. Mol Microbiol. 2016;100: 442–456. doi:10.1111/mmi.13325 139. Saunders EC, Ng WW, Kloehn J, Chambers JM, Ng M, McConville MJ. Induction of a Stringent Metabolic Response in Intracellular Stages of Leishmania mexicana Leads to Increased Dependence on Mitochondrial Metabolism. PLOS Pathog. 2014;10: e1003888. doi:10.1371/journal.ppat.1003888 140. Aphasizheva I, Aphasizhev R. U-Insertion/Deletion mRNA-Editing Holoenzyme: Definition in Sight. Trends Parasitol. 2016;32: 144–156. doi:10.1016/j.pt.2015.10.004 141. Kirby LE, Koslowsky D. Mitochondrial dual-coding genes in Trypanosoma brucei. PLoS Negl Trop Dis. 2017;11: e0005989. doi:10.1371/journal.pntd.0005989 142. Lamour N, Rivière L, Coustou V, Coombs GH, Barrett MP, Bringaud F. Proline Metabolism in Procyclic Trypanosoma brucei Is Down-regulated in the Presence of Glucose. J Biol Chem. 2005;280: 11902–11910. doi:10.1074/jbc.M414274200 143. Coustou V, Biran M, Breton M, Guegan F, Rivière L, Plazolles N, et al. Glucose-induced Remodeling of Intermediary and Energy Metabolism in Procyclic Trypanosoma brucei. J Biol Chem. 2008;283: 16342–16354. doi:10.1074/jbc.M709592200 286 144. Bochud-Allemann N, Schneider A. Mitochondrial Substrate Level Phosphorylation Is Essential for Growth of Procyclic Trypanosoma brucei. J Biol Chem. 2002;277: 32849– 32854. doi:10.1074/jbc.M205776200 145. Horváth A, Horáková E, Dunajčíková P, Verner Z, Pravdová E, Šlapetová I, et al. Downregulation of the nuclear-encoded subunits of the complexes III and IV disrupts their respective complexes but not complex I in procyclic Trypanosoma brucei. Mol Microbiol. 2005;58: 116–130. doi:10.1111/j.1365-2958.2005.04813.x 146. Gnipová A, Panicucci B, Paris Z, Verner Z, Horváth A, Lukeš J, et al. Disparate phenotypic effects from the knockdown of various Trypanosoma brucei cytochrome c oxidase subunits. Mol Biochem Parasitol. 2012;184: 90–98. doi:10.1016/j.molbiopara.2012.04.013 147. Kuile BH ter. Adaptation of metabolic enzyme activities of Trypanosoma brucei promastigotes to growth rate and carbon regimen. J Bacteriol. 1997;179: 4699–4705. doi:10.1128/jb.179.15.4699-4705.1997 148. Simpson RM, Bruno AE, Bard JE, Buck MJ, Read LK. High-throughput sequencing of partially edited trypanosome mRNAs reveals barriers to editing progression and evidence for alternative editing. RNA. 2016;22: 677–695. doi:10.1261/rna.055160.115 149. Carnes J, McDermott S, Anupama A, Oliver BG, Sather DN, Stuart K. In vivo cleavage specificity of Trypanosoma brucei editosome endonucleases. Nucleic Acids Res. 2017;45: 4667–4686. doi:10.1093/nar/gkx116 150. Otaka E, Hashimoto T, Mizuta K. The ribosomal proteins. I: An introduction to a compilation of the protein species equivalents from various organisms by a universal code system. Protein Seq Data Anal. 1993;5: 285–300. 151. Lawson SD, Igo RP, Salavati R, Stuart KD. The specificity of nucleotide removal during RNA editing in Trypanosoma brucei. RNA. 2001;7: 1793–1802. 152. Baradaran R, Berrisford JM, Minhas GS, Sazanov LA. Crystal structure of the entire respiratory complex I. Nature. 2013;494: 443–448. doi:10.1038/nature11871 153. Källberg M, Wang H, Wang S, Peng J, Wang Z, Lu H, et al. Template-based protein structure modeling using the RaptorX web server. Nat Protoc. 2012;7: 1511–1522. doi:10.1038/nprot.2012.085 154. Peng J, Xu J. A multiple-template approach to protein threading. Proteins Struct Funct Bioinforma. 2011;79: 1930–1939. doi:10.1002/prot.23016 155. Peng J, Xu J. Raptorx: Exploiting structure information for protein alignment by statistical inference. Proteins Struct Funct Bioinforma. 2011;79: 161–171. doi:10.1002/prot.23175 287 156. Sturm NR, Maslov DA, Blum B, Simpson L. Generation of unexpected editing patterns in Leishmania tarentolae mitochondrial mRNAs: Misediting produced by misguiding. Cell. 1992;70: 469–476. doi:10.1016/0092-8674(92)90171-8 157. Maslov DA, Thiemann O, Simpson L. Editing and misediting of transcripts of the kinetoplast maxicircle G5 (ND3) cryptogene in an old laboratory strain of Leishmania tarentolae. Mol Biochem Parasitol. 1994;68: 155–159. doi:10.1016/0166-6851(94)00160- X 158. Alatortsev VS, Cruz-Reyes J, Zhelonkina AG, Sollner-Webb B. Trypanosoma brucei RNA Editing: Coupled Cycles of U Deletion Reveal Processive Activity of the Editing Complex. Mol Cell Biol. 2008;28: 2437–2445. doi:10.1128/MCB.01886-07 159. Necas D, Ohtamaa M, Määttä E, Haapala A. python-Levenshtein 0.12.0. 160. Zimmer SL, Simpson RM, Read LK. High throughput sequencing revolution reveals conserved fundamentals of U-indel editing. Wiley Interdiscip Rev RNA. 2018;9: e1487. doi:10.1002/wrna.1487 161. Simpson RM, Bruno AE, Chen R, Lott K, Tylec BL, Bard JE, et al. Trypanosome RNA Editing Mediator Complex proteins have distinct functions in gRNA utilization. Nucleic Acids Res. 2017;45: 7965–7983. doi:10.1093/nar/gkx458 162. Koslowsky DJ, Jayarama Bhat G, Read LK, Stuart K. Cycles of progressive realignment of gRNA with mRNA in RNA editing. Cell. 1991;67: 537–546. doi:10.1016/0092- 8674(91)90528-7 163. Abraham JM, Feagin JE, Stuart K. Characterization of cytochrome c oxidase III transcripts that are edited only in the 3′ region. Cell. 1988;55: 267–272. doi:10.1016/0092- 8674(88)90049-9 164. Sturm NR, Simpson L. Partially edited mRNAs for cytochrome b and subunit III of cytochrome oxidase from leishmania tarentolae mitochondria: RNA editing intermediates. Cell. 1990;61: 871–878. doi:10.1016/0092-8674(90)90197-M 165. Decker CJ, Sollner-Webb B. RNA editing involves indiscriminate U changes throughout precisely defined editing domains. Cell. 1990;61: 1001–1011. doi:10.1016/0092- 8674(90)90065-M 166. Ammerman ML, Presnyak V, Fisk JC, Foda BM, Read LK. TbRGG2 facilitates kinetoplastid RNA editing initiation and progression past intrinsic pause sites. RNA. 2010;16: 2239– 2251. doi:10.1261/rna.2285510 167. Wang Z, Drew ME, Morris JC, Englund PT. Asymmetrical division of the kinetoplast DNA network of the trypanosome. EMBO J. 2002;21: 4998–5005. doi:10.1093/emboj/cdf482 288 168. Lukeš J, Skalický T, Týč J, Votýpka J, Yurchenko V. Evolution of parasitism in kinetoplastid flagellates. Mol Biochem Parasitol. 2014;195: 115–122. doi:10.1016/j.molbiopara.2014.05.007 289