NOVEL COMPUTATIONAL APPROACHES TO INVESTIGATE MICROBIAL DIVERSITY By Qingpeng Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science — Doctor of Philosophy 2015 ABSTRACT NOVEL COMPUTATIONAL APPROACHES TO INVESTIGATE MICROBIAL DIVERSITY By Qingpeng Zhang Species diversity is an important measurement of ecological communities. Scientists believe that there is a strong relationship between species diversity and ecosystem processes. However efforts to investigate microbial diversity using whole genome shotgun reads data are still scarce. With novel applications of data structures and the development of novel algorithms, firstly we developed an efficient k-mer counting approach and approaches to enable scalable streaming analysis of large and error-prone short-read shotgun data sets. Then based on these efforts, we developed a statistical framework allowing for scalable diversity analysis of large, complex metagenomes without the need for assembly or reference sequences. This method is evaluated on multiple large metagenomes from different environments, such as seawater, human microbiome, soil. Given the velocity in growth of sequencing data, this method is promising for analyzing highly diverse samples with relatively low computational requirements. Further, as the method does not depend on reference genomes, it also provides opportunities to tackle the large amounts of unknowns we find in metagenomic datasets. To my parents, Mr. Hui Zhang and Mrs. Xiuxiang Zhao. iii ACKNOWLEDGMENTS First and foremost, I would like to sincerely thank my advisor, Dr. C. Titus Brown, for his insightful guidance, his generous support, his confidence in me, and especially his patience and encouragement when progress was slow. It’s been an honor and a privilege working with you, Titus! Next I would like to thank my committee members, Dr. James Cole, Dr. Richard Enbody, Dr. Yanni Sun and Dr. Eric Torng, for their valuable input to my research and their review of this dissertation. I am also indebted to the collaborators, including Dr. Shana Goffredi, Yiseul Kim, Dr. Joan Rose, Dr. Tom Schmidt, Dr. Tracy Teal, Dr. James Tiedje, and Dr. Susannah Tringe. I learned from their insights a lot. I would also like to thank the GED lab members, who are also my best friends. I thank Dr. Sherine Awad, Michael Crusoe, Jiarong Guo, Luiz Irber and Camille Scott for their excellent collaborative ideas and their generous help with my preparation of this dissertation and the defense. I should not forget several former members of the GED lab, including Rose Canino-Koning, Dr. Elijah Lowe, Eric McDonald, Dr. Kanchan Pavangadkar, Dr. Jason Pell, Dr. Likit Preeyanon, and Neem Serra. I really miss the time we worked together. Your work is very important to the success of my project. Thanks you! To all my friends, thank you for the support and encouragement. A big thank to Dr. Adina Howe, a colleague, mentor and friend, for everything. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1 Introduction . . . . 1.1 Overview . . . . . . . . . . . 1.2 Next-generation sequencing 1.3 Metagenomics . . . . . . . . 1.4 Concept of diversity . . . . . 1.5 Problem statement . . . . . 1.6 Significance of research . . . 1.7 Outline of dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 4 6 8 9 Chapter 2 Review of relevant literature . . . . . . . . . . . . 2.1 Challenges in counting k-mers accurately and efficiently . . . 2.2 Tackling large and error-prone short-read shotgun data sets . 2.3 Challenges in measuring diversity of metagenomics . . . . . 2.3.1 Diversity measurement in microbial ecology . . . . . 2.3.1.1 OTU Identification using sequence markers 2.3.1.2 Binning of metagenomic reads into OTUs . 2.3.1.2.1 Composition-based approach . . . 2.3.1.2.2 Similarity-based approach . . . . . 2.3.1.3 Statistics for diversity estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 16 18 18 19 19 20 20 21 Efficient online k-mer counting using a probabilistic data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Count-Min Sketch and its application in k-mer counting . . . . . . . . . . . 3.2.1 Implementing a Count-Min Sketch for k-mers . . . . . . . . . . . . . 3.2.2 Choosing number and size of hash tables used for k-mer counting . . khmer can count k-mers efficiently . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 khmer is a generally useful k-mer counting approach . . . . . . . . . . 3.3.2 khmer memory usage is fixed and low . . . . . . . . . . . . . . . . . . 3.3.3 khmer accesses k-mer counts efficiently . . . . . . . . . . . . . . . . . False positive rates in k-mer counting are low and predictable . . . . . . . . 3.4.1 The measured counting error is low on short-read data . . . . . . . . 3.4.2 Real-world applications of khmer . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Code and data set availability . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Sequence data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 26 28 29 29 32 33 34 35 39 40 41 41 41 Chapter 3 3.1 3.2 3.3 3.4 3.5 3.6 v Chapter 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 A framework for streaming analysis of short DNA sequencing reads based on k-mer counting . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimating sequencing depth without a reference assembly . . . . . . . . . . A streaming algorithm to digitally normalize the coverage distribution of data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Eliminating redundant reads reduces variation in sequencing depth . 4.3.2 Digital normalization scales assembly of microbial genomes . . . . . . 4.3.3 Digital normalization scales assembly of transcriptomes . . . . . . . . 4.3.4 lower bound on memory usage for effective digital normalization . . . 4.3.5 Digital normalization dramatically scales de novo assembly . . . . . . A streaming algorithm to analyze and trim errors in short reads . . . . . . . 4.4.1 Two-pass non-streaming method to enable read error analysis . . . . 4.4.2 A semi-streaming algorithm can be used for error analysis . . . . . . 4.4.3 Semi-streaming error trimming on synthetic and real data: . . . . . . 4.4.4 Semi-streaming Illumina error rates and error profiles analysis . . . . Time and space usage of the streaming algorithm for analyzing short DNA sequencing reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Data sets used for digital normalization . . . . . . . . . . . . . . . . . 4.7.3 Synthetic data sets used for error analysis . . . . . . . . . . . . . . . 4.7.4 Real data sets used for error analysis . . . . . . . . . . . . . . . . . . 48 49 52 53 56 58 58 60 63 67 69 72 74 75 75 76 76 77 Chapter 5 5.1 5.2 5.3 5.4 A framework for diversity analysis of whole shotgun metagenomic reads data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The concept of IGS(informational genomic segment) . . . . . . . . . . . . . . 5.2.1 IGS(informative genomic segment) can represent the novel information of a genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Using IGS to analyze alpha diversity . . . . . . . . . . . . . . . . . . 5.2.3 Using IGS to analyze beta diversity . . . . . . . . . . . . . . . . . . . Evaluating IGS method using simulated data sets . . . . . . . . . . . . . . . 5.3.1 Using a simple simulated data set to evaluate the IGS method . . . . 5.3.2 Improving the accuracy of this method in real world analysis . . . . . 5.3.2.1 the effect of sequencing error to the accuracy of analysis . . 5.3.2.2 the effect of Bloom filter size on the accuracy of analysis . . 5.3.3 the effect of sequencing depth to the accuracy of IGS method . . . . 5.3.4 Compare IGS method to COMMET in beta diversity analysis . . . . 5.3.5 The IGS method can provide a whole framework to do alpha or Tbeta diversity, with good versatility. . . . . . . . . . . . . . . . . . . . . . Applying IGS method to real metagenome data sets . . . . . . . . . . . . . . 5.4.1 GOS data sets: Sorcerer II Global Ocean Sampling Expedition . . . . 5.4.2 Human Microbiome Project(HMP) metagenomics data set . . . . . . 43 43 44 vi 78 78 79 81 82 85 89 89 92 94 98 100 104 106 107 107 110 5.5 5.4.3 GPGC - Great Prairie Soil Metagenome Grand Challenge . 5.4.4 More soil metagenomic samples . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 114 116 116 Chapter 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii LIST OF TABLES Table 2.1 Description of k-mer counting packages. . . . . . . . . . . . . 15 Table 3.1 Benchmark soil metagenome data sets for k-mer counting performance, taken from [51]. . . . . . . . . . . . . . . . . . . . 29 Table 3.2 Data sets used for analyzing miscounts. . . . . . . . . . . . . . 36 Table 4.1 Single-pass digital normalization to C=20 reduces computational requirements for transcriptome assembly. . . . . . . . . 53 Digital normalization has assembler-specific effects on transcriptome assembly. . . . . . . . . . . . . . . . . . . . . . . . . . 54 Digital normalization to C=20 removes many erroneous kmers from sequencing data sets. Numbers in parentheses indicate number of true k-mers lost at each step, based on reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Three-pass digital normalization removes most erroneous kmers. Numbers in parentheses indicate number of true kmers lost at each step, based on known reference. . . . . . . 55 Low-memory digital normalization. The results of digitally normalizing a 5m read E. coli data set (1.4 GB) to C=20 with k=20 under several memory usage/false positive rates. The false positive rate (column 1) is empirically determined. We measured reads remaining, number of “true” k-mers missing from the data at each step, and the number of total k-mers remaining. Note: at high false positive rates, reads are erroneously removed due to inflation of k-mer counts. . 56 E. coli genome assembly after low-memory digital normalization. A comparison of assembling reads digitally normalized with low memory/high false positive rates. The reads were digitally normalized to C=20 (see [10] for more information) and were assembled using Velvet. We measured total length of assembly, as well as percent of true MG1655 genome covered by the assembly using QUAST. . . . . . . . 56 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 viii Table 4.7 Iterative low-memory k-mer trimming. The results of trimming reads at unique (erroneous) k-mers from a 5m read E. coli data set (1.4 GB) in under 30 MB of RAM. After each iteration, we measured the total number of distinct k-mers in the data set, the total number of unique (and likely erroneous) k-mers remaining, and the number of unique k-mers present at the 3’ end of reads. . . . . . . . . . . . . . . . . . . . 60 A summary of trimming statistics for semi-streaming error trimming. Error rates before and after trimming were estimated by mapping. “High coverage” numbers refer to the subset of reads with C ≥ 20 that were subject to analysis. . 68 Results of streaming error trimming on complete data sets. Error rates before and after trimming were estimated by mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 4.10 Data sets used for evaluation. . . . . . . . . . . . . . . . . . . . 77 Table 5.1 Total number of IGSs in different simulated reads data sets. 82 Table 5.2 Reads coverage distribution. . . . . . . . . . . . . . . . . . . . . 84 Table 5.3 IGS abundance distribution. . . . . . . . . . . . . . . . . . . . . 84 Table 5.4 Listing IGSs with the corresponding abundances. . . . . . . . 85 Table 5.5 Samples-by-OTU matrix. . . . . . . . . . . . . . . . . . . . . . . 86 Table 5.6 Samples-by-IGS matrix. . . . . . . . . . . . . . . . . . . . . . . . 86 Table 5.7 Samples-by-IGS matrix from example. . . . . . . . . . . . . . 89 Table 5.8 Six synthetic simple metagenomes . . . . . . . . . . . . . . . . 89 Table 5.9 Alpha diversity analysis result of the simple simulated data using IGS method. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Dissimilarity matrix between synthetic samples using Braycurtis from species composition directly. . . . . . . . . . . . . 91 Dissimilarity matrix between synthetic samples using BrayCurtis from sequencing reads using IGS method. . . . . . . 91 Table 4.8 Table 4.9 Table 5.10 Table 5.11 Table 5.12 GPGC data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 ix LIST OF FIGURES Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Comparison of the time it takes for k-mer counting tools to calculate k-mer abundance histograms, with time (y axis, in seconds) against data set size (in number of reads, x axis). All programs executed in time approximately linear with the number of input reads. . . . . . . . . . . . . . . . . . . . . . . . 29 Memory usage of k-mer counting tools when calculating kmer abundance histograms, with maximum resident program size (y axis, in GB) plotted against the total number of distinct k-mers in the data set (x axis, billions of k-mers). 30 Disk storage usage of different k-mer counting tools to calculate k-mer abundance histograms in GB (y axis), plotted against the number of distinct k-mers in the data set (x axis). ∗ Note that khmer does not use the disk during counting or retrieval, although its hash tables can be saved for reuse. . . 31 Time for several k-mer counting tools to retrieve the counts of 9.7m randomly chosen k-mers (y axis), plotted against the number of distinct k-mers in the data set being queried (x axis). BFCounter, DSK, Turtle, KAnalyze, and KMC do not support this functionality. . . . . . . . . . . . . . . . . . . . 34 Relation between average miscount — amount by which the count for k-mers is incorrect — on the y axis, plotted against false positive rate (x axis), for five data sets. The five data sets were chosen to have the same total number of distinct kmers: one metagenome data set; a set of randomly generated k-mers; a set of reads, chosen with 3x coverage and 1% error, from a randomly generated genome; a simulated set of errorfree reads (3x) chosen from a randomly generated genome and a set of E. coli reads. . . . . . . . . . . . . . . . . . . . . . . 36 Relation between percent miscount — amount by which the count for k-mers is incorrect relative to its true count — on the y axis, plotted against false positive rate (x axis), for five data sets. The five data sets are the same as in Figure 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 x Figure 4.1 Representative rank-abundance distributions for 20-mers from 100-base reads with no errors, a read with a single substitution error, and a read with multiple substitution errors. . . 45 Figure 4.2 Mapping and k-mer coverage measures correlate for simulated genome data and a real E. coli data set (5m reads). Simulated data r2 = 0.79; E. coli r2 = 0.80. . . . . . . . . . . . 46 Mapping and k-mer coverage measures correlate for simulated transcriptome data as well as real mouse transcriptome data. Simulated data r2 = 0.93; mouse transcriptome r2 = 0.90. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Coverage distribution of three microbial genome samples, calculated from mapped reads (a) before and (b) after digital normalization (k=20, C=20). . . . . . . . . . . . . . . . . . . . 50 Fraction of reads kept when normalizing the E. coli dataset to C=20 at k=20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Number of unique k-mers (y axis) by starting position within read (x axis) in an untrimmed E. coli 100-bp Illumina shotgun data set, for k=17 and k=32. The increasing numbers of unique k-mers are a sign of the increasing sequencing error towards the 3’ end of reads. Note that there are only 69 starting positions for 32-mers in a 100 base read. . . . . . . . 62 Diagram of semi-streaming error detection. In a first pass over the read data, reads are loaded in until the graph locus to which they belong is saturated. From that point on, reads are examined for errors and not loaded into the graph. In a second pass, only the subset of reads loaded into the graph are examined for errors. . . . . . . . . . . . . . . . . . . . . . . . 64 Saturation curve of a real and a simulated E. coli read data set. Reads are collected when they have an estimated coverage of less than 20; in the early phase (< 1m reads), almost all reads are collected, but by 2m reads into the data set, the majority of reads come from loci with an estimated sequencing depth of > 20 and are rejected. . . . . . . . . . . . . 66 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 xi Figure 4.9 Error spectrum of reads in the E. coli data set. The sublinear k-mer spectrum analysis is calculated based on saturation of a fraction of the data set, while the two-pass spectral analysis uses all of the data. bowtie2 mismatches are based on all mapped reads. The y values for the k-mer spectral analyses are scaled by a factor of four for ease of comparison. 69 Figure 4.10 Error spectrum of reads in the mouse RNAseq data set. The sublinear k-mer spectrum analysis is calculated based on saturation of a fraction of the data set, while the twopass spectral analysis uses all of the data, and bowtie2 mismatches are based on all mapped reads. The peak of errors at position 34 in the bowtie2 mapping reflects errors that in the first part of the data set are called as Ns, and hence are ignored by the sublinear error analysis; see text for details. Note, the bowtie2 mismatch rates are larger than the spectral rates, so for ease of comparison the y values for the k-mer spectral analyses are scaled by a factor of four. . . . . 70 Figure 5.1 Transforming reads into IGSs. . . . . . . . . . . . . . . . . . . . 83 Figure 5.2 IGS(informative genomic segment) can represent the novel information of a genome. . . . . . . . . . . . . . . . . . . . . . . 84 Get the coverage of a read in samples. A read in sample A has the coverage of 5 in sample A, has the coverage of 8 in sample B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 From read coverage profile to IGS. (A): Get the coverage profile of one read. (B): Get the coverage profiles of all the reads in 3 samples. (C): Group the reads with same coverage profiles into “super bin”. (D): Calculate the number of IGSs in each “super bin”. . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 5.5 Ordination of the 6 synthetic samples using IGS method. . 92 Figure 5.6 Clustering of the 6 synthetic samples using IGS method . . 93 Figure 5.7 Richness estimation using IGS method without adjustment. 95 Figure 5.8 Beta diversity analysis using IGS method without adjustment. 96 Figure 5.9 Richness estimation using IGS method without adjustment. Figure 5.3 Figure 5.4 xii 98 Figure 5.10 Richness estimation using IGS method adjusted by sequencing error rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 5.11 Richness estimation using IGS method adjusted by sequencing error rate and false positive rate of bloom filter. . . . . . 101 Figure 5.12 Correlation between calculated distance matrix and true matrix from different data sets with different sequencing depth. 102 Figure 5.13 Estimated genome size from data sets with variable coverage, without error. . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 5.14 Estimated genome size from data sets with variable coverage, with error rate as 0.5%. . . . . . . . . . . . . . . . . . . . . 104 Figure 5.15 Estimated genome size from data sets with variable coverage, with error rate as 1.0%. . . . . . . . . . . . . . . . . . . . 105 Figure 5.16 Correlation between calculated distance matrix and true distance matrix from different data sets and using different methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 5.17 Clustering of Global Ocean Sampling Expedition samples using IGS method. . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 5.18 Rarefaction curve of IGSs of Global Ocean Sampling Expedition samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 5.19 Estimated number of IGSs of Global Ocean Sampling Expedition samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 5.20 Principal coordinates analysis of 12 Human Microbiome Project samples, red: anterior nares- skin, green: throat -oral, blue: buccal mucosa -oral, orange: posterior fornix -vaginal. . . . 111 Figure 5.21 Alpha diversity of 12 Human Microbiome Project samples: estimation of metagenome size of HMP samples, red: anterior nares- skin, green: throat -oral, blue: buccal mucosa -oral, orange: posterior fornix -vaginal. . . . . . . . . . . . . . 112 Figure 5.22 Principal coordinates analysis of 8 Great Prairie Soil Metagenome Grand Challenge (GPGC) samples. . . . . . . . . . . . . . . . 114 xiii Figure 5.23 Alpha diversity analysis of 8 GPGC samples. Upper left, rarefaction curve of IGSs. Upper right, estimated number of IGSs in different samples. Lower left, estimated number of IGSs in samples grouped by location (Iowa, Kansas and wisconsin). Lower right, estimated number of IGSs in samples grouped by treatment (corn, prairie, restored, switchgrass). 115 Figure 5.24 Principal coordinates analysis of soil samples with different treatments collected from Kellogg Biological Station(KBS). Red, corn. Blue, miscanthus. Brown, switchgrass. . . . . . . 116 Figure 5.25 Principal coordinates analysis of soil samples collected from Amazon rainforest. Red, forest samples. Blue, prairie samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Figure 5.26 Estimated number of IGSs in metagenomic data from soil samples collected from Amazon rainforest. Grouped by treatment. Red, forest samples. Blue, prairie samples. . . . . . . 118 xiv Chapter 1 Introduction 1.1 Overview Species diversity is an important measurement of ecological communities. Scientists believe that there is a relationship between species diversity and ecosystem processes [71]. Evaluating the species diversity in a community is a central research topic in macroorganism ecology. Many methods have been developed over the last few decades, aimed at answering questions such as “how many species of birds are in this habitat”. Nevertheless, until recently scientists had not started to think seriously about larger-scale questions such as “How many species are there on earth?” [81] or “How many species are there in the ocean?” [89] until recently. Why? The answer is straightforward: Microorganisms represent the vast majority of the Earth’s biodiversity and the assessment of microbial diversity is quite difficult. It is believed that microbial diversity is the outermost frontier of the exploration of diversity [74]. Microorganisms are ubiquitous. There are more bacterial cells in our body than human cells [113]. There are several reasons why assessment of microbial diversity is such a challenge. First, the concept of species is ambiguous. Morphological examination is impossible: fewer than 1% of microorganisms in the biosphere cannot be cultivated by traditional cultivation techniques[29]. To overcome this obstacle, metagenomics has emerged, driven by the progress of next-generation sequencing (NGS) technology. Lots of metagenomics projects have been performed on samples ranging from acid mine drainage channels to human gut. 1 For complex environmental samples such as soil, the resulting data sets can be huge. There are approximately a billion microbial cells, with about 4 petabase pairs of DNA(4 ∗ 1012 bp) [144]. Since we have limited sequencing power, the resulting metagenomics data sets from highly complex samples (e.g. soil) only correspond to a tiny fraction of the actual genomic content in the sample. The large size of data sets and the low sequencing coverage make the assessment of microbial diversity of high diversity sample even harder. Novel methods are needed. 1.2 Next-generation sequencing Sequencing technology is changing quickly. Over the past decade, next-generation sequencing (NGS) has become the dominant technology and almost replaced classic Sanger sequencing technology. Illumina and Roche 454 are the two most popular platforms. Illumina can generate reads of shorter lengths, typically up to 150 base pairs(bp) for HiSeq and 250 bp for MiSeq platform[102, 80]. However, at a much lower cost compared to the Roche 454 sequencing technology, which generates reads with a length of 500 to 1K bp. Because of the advantage of the low cost, there is a trend towards Illumina is dominating the sequencing market, which means that while designing any tool for metagenomics, a developer should take the relatively short length of Illumina reads into account. 1.3 Metagenomics It is believed that the word “metagenomics” was coined in 1998 [48]; it can be translated as ‘beyond the genome’ [36]. At that time, it was based on the technique of cloning environmental DNA randomly and screening for genes of interest, especially 16S ribosomal 2 RNA (rRNA) genes. This technique was firstly applied in practice by Schmidt et al. in 1991 [117]. It was a crucial step in expanding sequence-based investigation to the microbial world. Before that it was standard protocol to culture and isolate microbes and do analysis. It resulted in a much narrower picture of the diversity of an ecosystem as only a small portion of the microbial species (5% or less) in the biosphere can be cultured with traditional cultivation techniques [128]. Metagenomics, with the concept of cloning DNA directly from sample without cultivation, brought researchers the ability to explore the entire spectrum of organisms in an environment. The number of microbial species in some ecological communities is huge. In soil, it is estimated that there exist millions of species with most of them in low abundance [35]. The improvement of NGS technology with ever higher throughput and ever lower costs has been accelerating metagenomics research recently, since only high throughput NGS strategy can sample the contents of those populations deeply enough to examine rare species. Currently, there are two approaches in metagenomics. One is amplicon metagenomics, in which genes of interest, such as 16S rRNA genes, are amplified and sequenced [128]. This is the traditional way dating back to the 1991 work by Schmidt et al. Many microbial diversity studies have relied on this approach. The other approach is whole genome shotgun metagenomics, which sequences randomly isolated DNA fragments without targeting specific genes. Since the whole genomes of organisms in a sample are available, and not just the limited genes of interest like 16S/18S rRNA, this whole genome shotgun sequencing approach can in theory provide better taxonomic resolution and more information benefiting other investigation [135] [102]. Now there are thousands of metagenomic samples available in online database, such as MG-RAST [38]. There have been many metagenomics projects focusing on the microbial samples of dif3 ferent kinds of habitat, from extreme environment such as acid mine drainage channels with low complexity [135], and medium complexity samples like human gut [102] and cow rumen [49], to high complexity samples like seawater [137] and soil [37]. Metagenomics studies have revealed lots of knowledge of the microbial community in different habitats. Some of them shed light on the explanation of some serious human diseases. Studies have shown associations between human gut metagenomes and type II diabetes [104], obesity [134, 61] or Crohn’s disease [90]. In almost all of these metagenomics projects, diversity analysis plays an important role in supplying knowledge about the richness of species and the abundance distribution of species in a sample or the similarity and difference between samples, all of which are crucial to draw insightful and reliable conclusions. 1.4 Concept of diversity When we characterize an ecological community, diversity measurements are often the first step. It is always desirable to know how many species there are in a sample – its “richness” – and how abundant each species is relative to others in the same sample – its “evenness”. They are straightforward conceptually. However, in practice, there are a large number of quantities that are used to measure species diversity, for the many different approaches to sampling individuals. At a high level, three diversity indices are well established and used in ecology; these are α-diversity, β-diversity, and γ-diversity. The α-diversity is the diversity in one specific habitat or sample. The β-diversity is the comparison of species diversity between habitats or samples. The γ-diversity shows the total diversity of a region with multiple ecosystems 4 inside [74]. The concept of diversity has two aspects, richness and evenness. Richness is the total number of species identified in a sample, which is the simplest descriptor of a community structure. Evenness is a measure of how different the abundance of a species is compared to other species in a community. If all the species in a community has the same abundance, the community has a higher evenness diversity. However, almost all natural communities are highly uneven, which means the community is dominated by relatively few species and there are a large number of species with low abundance. It raises a question about the effectiveness of using the measurement of richness to represent species diversity. Is a community with 1 dominant species and 10 rare species more diverse than a community with 3 dominant species and 2 rare species? Thus, new metrics taking both richness and evenness into account have been suggested. The two most popular diversity indices are Shannon diversity [121], which is based on information theory and shows the information in a community as an estimate of diversity, and Simpson diversity [122], which basically shows the probability that two individuals picked randomly from a community belong to the same species. Besides these two, Hill [50] proposed a new diversity index based on the species abundance distribution, which uses a weighted count of species to measure diversity. It can be considered as a generalized diversity index, since both Shannon and Simpson index and richness can be seen as special cases of the Hill diversity index. It is necessary to note that we cannot tell if any index is generically better than the others. It all depends on the characteristics of a community and the process of sampling, as well as other factors. How to choose the diversity indices to better represent the diversity information of a community has been investigated and discussed extensively [5, 46, 91]. In microbial ecology, richness is simply the most popular index to measure microbial 5 diversity, partially because of the challenge raised by the different characteristics of a microbial community. Most of the microbial diversity studies concentrate on species richness comparison[5]. Lots of methods to estimate richness in classic ecology were borrowed to tackle the problem of estimating microbial diversity, which will be discussed in the next section. 1.5 Problem statement In almost all metagenomics projects, diversity analysis plays an important role in supplying knowledge about the richness of species, the abundance distribution of species in a sample, and the similarity and difference between different samples. The topic of microbial diversity measurement has been investigated for a long time with many methods and software packages developed. However, there still remains lots of room for more work. Traditionally used for amplicon metagenomics data set, OTUs(Operational Taxonomic Units) based on 16S rRNA genes are used as the basic units for diversity analysis on shotgun metagenomic data. OTUs can be good replacements of the concept of “species” in metagenomics. Basically contigs are assembled from reads and are “binned” into OTUs using composition-based or similarity-based approaches. Then the diversity can be estimated by using the abundance information of the OTUs. The mainstream methods to measure microbial diversity are still focusing on the use of 16S rRNA amplicon metagenomics data. Many of the popular microbial diversity analysis software packages generally accept 16S rRNA data as input. This is understandable because the concept of OTU is from the similarity of 16S rRNA sequences. Using 16S rRNA data to measure diversity is popular but is not without problems. The 16S rRNAs may not be that reliable to be OTU markers. The 6 reliability is sensitive to potential horizontal gene transfer and the variance of gene copy in bacteria. There have been suggestions that alternative marker genes should be used, such as single copy housekeeping genes. Thus, measuring diversity beyond using 16S rRNA data is worth investigating. Recently there are many more projects generating whole genome shotgun metagenomics data sets. However, they are mainly used for assembly and annotation purpose. Less attention was paid to diversity measurement using these whole genome metagenomics data sets. One possible reason is that the whole genome metagenomics data sets are often with low depth given the high diversity of metagenomics samples compared to 16S rRNA ampicon metagenomics data set. Assembly and annotation are always challenging with the low depth and lack of reference sequences. It is also true for diversity measurement. On the other hand, although with low depth, some whole genome metagenomics data sets are of large size because of the high diversity. For instance, there may be 4 petabase pairs of DNA in a gram of soil [144]. Many of those methods for sequence binning or diversity estimation do not scale well and will not work for large metagenomics data sets. For instance, many composition-based binning approaches involve k-mer/signature frequency distribution calculation, which is rather computationally expensive. Even basic sequence alignment will be impossible for large metagenomics data sets. Many of those statistical software packages to estimate diversity using various estimators are not prepared for the large scale of whole genome metagenomics data. With the development of NGS technology, the cost of sequencing is dropping rapidly. Whole genome metagenomics sequencing is more popular and a large amount of metagenomics data is being generated with increasing speed, which cannot be even met by the increase of computational capacity. Novel methods that can scale well are extremely needed to deal with the increasingly large metagenomics data set. 7 1.6 Significance of research We established a series of approaches to enable scalable and effective investigation of microbial diversity using whole-genome shotgun metagenomic data. Firstly a k-mer counting package - khmer was developed to enable fast and memory efficient k-mer-based analysis of sequencing data sets[148, 28]. Khmer relies on Count-Min Sketch, a probabilistic data structure used to store the frequency of distinct elements efficiently. Unlike other data structures used for k-mer counting, such as hash tables, suffix arrays, and trie structure, the Count-Min Sketch has significantly low memory usage for sparse data sets with trade-off with counting false positive. We conducted extensive analysis on the performance of the counting algorithm and benchmark to compare the performance of the khmer to other k-mer counting packages. The initial motivation of developing khmer was to count the k-mers in metagenomes for diversity analysis. Now khmer has been widely used for many other purposes, from enabling large scale de novo metagenome assembly to sequencing error detection and correction. Based on the efficient k-mer counting package khmer, especially with the ability to do online counting and retrieval entirely in memory, we developed digital normalization[10], ”a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors.” Digital normalization can reduce the computational expense of downstream analysis such as assembly dramatically because after the normalization of sampling variance, redundant reads are discarded as well as the errors in them. The algorithm of digital normalization has been used by many research groups to facilitate their analysis and has been implemented in different tools like Trinity and Illumina’s TruSeq pipeline. Like digital normalization, based on the same approach to estimate sequencing depth without 8 a reference assembly, a streaming approach to analyze and trim sequencing errors in short reads datasets was developed[147]. The approach offers a general framework for streaming sequence analysis and could be used for error correction and variant calling. Moreover, the approach can be applied generically to data sets with variable sequencing coverage, such as metagenomes especially. Further more, by integrating efficient k-mer counting and a novel de Bruijn graph mapping method based on digital normalization we developed a novel approach to allow for scalable diversity analysis of large, complex metagenomes. A novel concept - IGS (informative genomic segment) is proposed to represent the unique information in a metagenomics data set. The IGSs can be used as a complement of OTUs to be the cornerstone for diversity analysis of whole shotgun metagenomics data sets. The abundance of IGSs in different samples can be retrieved by mapping the reads to de Bruijn graphs. In this procedure, not like many other microbial diversity analysis methods, assembly or binning is not required any more. This method was evaluated on multiple metagenomes from a variety of environments (e.g., human body part, seawater, soil). Given the velocity in growth of sequencing data, this method is promising for analyzing highly diverse samples with relatively low computational requirements. Further, as the method does not depend on reference genomes, it also provides opportunities to tackle the large amounts of unknown “dark matter” we find in metagenomic datasets. 1.7 Outline of dissertation In this dissertation I will discuss in detail a series of approaches enabling scalable and effective investigation of microbial diversity using whole-genome shotgun metagenomic data. In 9 chapter 2, I will do a brief review of relevant literature about the challenges I face to enable diversity analysis of metagenomic data. In chapter 3, I will describe a novel approach to count k-mers efficiently and a scalable approach to retrieve the coverage of a read in a data set based on efficient and online k-mer counting. In chapter 4, I will introduce the two applications of this approach, digital normalization to reduce the redundancy of metagenomic reads dataset and a streaming method to analyze sequencing error. Both are critically important to the improvement of other metagenomic data analysis approaches, like assembly, error trimming or contigs/reads binning. In chapter 5, I will discuss how I developed the concept of IGS based on efficient k-mer counting and digital normalization. The effort to increase the accuracy of IGS based method will be discussed and the performance of the IGS method on simulated data sets and real data sets will be demonstrated. I will give a summary about how the novel statistical framework based on IGS makes a difference to the diversity analysis in current microbial ecology research and some directions of future work will be discussed in the last chapter. 10 Chapter 2 Review of relevant literature 2.1 Challenges in counting k-mers accurately and efficiently A k-mer is a substring with length k in a DNA sequence. K-mer counting is the problem to determine the occurrences of such k-mers in a DNA dataset [78]. Efficient k-mer counting plays an important role in solving many bioinformatics problems. One important problem is de novo assembly of very large number of short reads. With the development of next generation sequencing(NGS) technology, many research groups can afford the sequencing of the sample of specific species or even metagenomic samples with numerous different species[86]. Large amount of NGS short reads are generated and de novo assembly is required for these sequence data sets[87]. Currently, de Bruijn graph method is popular in the attempts to do de novo assembly because of its advantage in assembling next generation sequencing short reads[99]. Several popular assemblers based on de Bruijn graph have been published , including Velvet[146], ALLPATHS[12], ABySS[126] and SOAPdenovo[68]. All the k-mers in a sequence data set are represented as nodes in the de Bruijn graph. If two k-mers have an overlap of (k-1)-mer, the two k-mers can be connected. Since k-mer is such a basic unit in de Bruijn graph de novo assembly, it is of great importance to determine the occurrence of k-mers. One example is that sequencing 11 errors can generate many erroneous unique k-mers and we can filter out the reads with too many unique k-mers before doing the assembly. Similarly, we can also filter out the reads with k-mers that occur too many times for smoothing MDA-abundance reads dataset. Pre-filtering reads to reduce the size of reads data set to assemble is important to reduce the time and memory usage. Another application of k-mer counting is the evaluation of microbial diversity in metagenomic samples. Counting the number of k-mers and geting the k-mer abundance distribution can give us some hints about the richness and evenness of a metagenomic sample, although the erroneous k-mers from sequencing error will cause some problems. Also, one of the popular approaches to do metagenomic contigs binning is based on analyzing the k-mer abundance profile [95] [7][110].Actually this is one of our motivations of the development of khmer, which will be discussed in next chapter. One more example of applications of k-mer counting is the de novo detection of repetitive elements. K-mer frequency can give important information for predicting regions with such repetitive elements as transposons with important biological function.[64] There are two specific characteristics of NGS shot reads that make k-mer counting complicated. One is that large size of NGS reads data means there are large number of k-mers to count. However the large number of k-mers in a NGS reads data set still only account for a small proportion of the total number of possible k-mers. For a typical value of k as 20, there are 420 possible k-mers, which are far more than the actual number of k-mers present in any genome reads data set. Delicate choice of data structure to use is desired to accommodate to such sparseness of actual k-mers to enable efficient counting. Also, out of the large number of k-mers in a NGS reads data set, many of them, especially unique ones, are erroneous because of sequencing errors. Generally one sequencing error will introduces k erroneous k-mers. With a relatively high error rate as 0.1-1% in Illumina reads data [86]), 12 as we do more sequencing and have more reads, there will be actually more erroneous k-mers than true k-mers in the data set. Effective and efficient methods to count the large number of k-mers with most of them as erroneous is highly required in NGS reads data analysis [88]. Current methods to do k-mer counting involve the data structures like hash tables, suffix arrays, binary trees or tries structures. If the size of sequence dataset to count is modest, a simple hash table will suffice, where the key is the k-mer and the value is the corresponding count. However there is an obvious obstacle. If the size of sequence dataset is larger, the efficiency of the counting using a simple hash table drops dramatically. Instead another k-mer counting tool - Tallymer, uses suffix array data structure[64]. Admittedly it is more efficient than simple hash table in general. However, the memory requirement is still linear to the number of unique k-mers. Thus this method is not very scalable. For example, the size of a soil metagenomic data set for one sample only have already exceeded 400G bytes, with only a limited sequencing depth[52]. Reads dataset with size like this is difficult for Tallymer to handle. Jellyfish, [78] another popular k-mer counting tool, uses updated hash table data structure, which can reduce the memory usage to store k-mers and the ”lock-free” feature of the hash table also enables the parallelism of Jellyfish to make it more efficient. But it still has the same problem as Tallymer linear increase of memory usage with respect to the number of unique k-mers in the data set to count. Many more k-mer counting software packages based on different data structure and algorithms have been developed in recent years, including BFCounter, DSK, KMC, Turtle and KAnalyze [85, 106, 30, 111, 4]. These software packages differ with each other in algorithmic trade-offs and functionality, especially on how to deal with the trade-off between disk and memory usage, enabling online counting and retrieval or not, exact count or not, and others. Table 2.1 shows a summary of most current k-mer counting packages with the function and 13 limitation for each. 14 software (built in) get k-mer (built in)specific occurrence histogram k-mer count retrieval? algorithm API? multithreaded? online counting? BFCounter bloom filter,filter out low abundance kmer N N N Y Y DSK fixed-memory and fixed-disk space streaming algorithm Y N N N N Jellyfish lock-free hash table Y Y N Y Y KAnalyze split to disk and merge N N Y Y N Khmer count-min sketch Y Y Y Y Y KMC parallel disk-based,similar to DSK N N N Y N MSPKmerCounter Minimum Substring Partitioning Y Y N Y N Tallymer enhanced suffix arrays Y Y N N N Turtle pattern-blocked Bloom filter, filter out low abundance kmer N N N Y Y Table 2.1 Description of k-mer counting packages. 15 2.2 Tackling large and error-prone short-read shotgun data sets With the dramatic improvements of next generation sequencing(NGS) technologies and the dropping sequencing cost, more research groups can afford sequencing the sample of specific species or even metagenomic samples with numerous different species in large scale[86]. This leads to the explosive growth of sequencing data to analyze. Important biological insights will be drawn from mining the large genomic data. However serious challenges need to be overcome to enable efficient and effective analysis. Similar to the k-mer counting problem, the first obstacle to analyze large NGS data is the formidable size of the NGS data from high sequencing depth. Higher sequencing depth is required to assemble a genome successfully using NGS short reads than using longer reads from traditional Sanger sequencing technology.For example, to get a decent coverage of human genome with by NGS short reads to get satisfying assembly, 100x sequencing depth is generally required, which leads to a 300 GB NGS reads data set[39]. For samples with variable abundance of genomic contents such as transcriptome or metagenome, to get enough sequencing depth for the rarer genomic content, the overall sequencing depth will greatly increase, which leads to dramatically large reads data sets. However, with uneven abundance of genomic contents meanwhile. Still like the k-mer counting problem, sequencing errors bring significant troubles to effective and efficient analysis of large NGS data. With a relatively high error rate as 0.1-1% in Illumina reads data [86]), the more sequencing we perform, the more sequencing errors we will have in the data, no matter if the sequencing depth is high or low. So for deep sequencing, most of the novelty will be dominated by sequencing errors [24]. As discussed previously, 16 generally one sequencing error will introduce k erroneous k-mers. So many sequencing errors will undermine the effectiveness of many down-streaming analysis of such short reads data. With the two characteristics of NGS short reads data - large size and error-prone, combined with the dropped cost of sequencing, we are facing the third challenge, that the increase of our capacity to analyze data cannot catch up with the capacity of generating data [132]. It is difficult to easily manipulate and analyze the large genomic data without significant improvement of computational approaches. The approaches to overcome these problems, are straightforward, in a way. To deal with the large size of data, we try to decrease the size. To deal with the errors in data, we try to remove or correct them. Assembly can be seen as a solution to reduce the size of large data, with the sacrifice of losing abundance information of genomic contents. There have been significant theoretical progress to store and analyze the big sequencing data efficiently [23, 123]. Based on the progress, a number of new assemblers have been developed, such as ABySS, Velvet, SOAPdenovo, ALLPATHS, SGA, and Cortex [125, 145, 70, 39, 124, 59]. For sequencing data with uneven abundance of genomic contents like metagenome or transcriptome, novel assembler or 3rd party add-ons are developed specifically, such as Trinity, Oases, MetaVelvet, MetaVelvetSL, Meta-IDBA, Velvet-SC, DIME, MEGAHIT, Omega [41, 118, 93, 1, 97, 19, 43, 67, 47]. Assembly can be seen as a solution to decrease error rate too. There may still be errors in assembled contigs or genomes. However, large amount of errors located in numerous reads have been discarded in the procedure of assembly. To remove or correct errors directly from reads, k-mer spectral analysis is a popular approach[99]. Basically low-abundance k-mers will be found and treated as likely errors. Those likely errors can be removed, trimmed with other k-mers, or corrected using statistical learning method [63]. More error removal and 17 correction software packages were developed[83, 14, 63]. Such error removal or correction approaches can be applied to preprocess reads data before assembly. As necessary components in NGS data analysis pipeline, assembly and error removal/correction have proved to be effective to enable more accurate interrogation to the large, error-prone short-read data. However both assembly and error removal/correction are very compute intensive. The memory usage of assembly does not scale well to the size of data to assemble. The error removal/correction based on k-mer spectral analysis normally involves two iterations of examining the data, which is also memory intensive and time consuming. Streaming and semi-streaming algorithms, which examine the data only once or less than twice and scale well to the size of input data in memory/time requirements, are promising to be integrated with the pre-assembly preprocessing and error removal/correction to achieve higher efficiency of NGS data analysis finally. In this dissertation, we will discuss our efforts in this direction. 2.3 2.3.1 Challenges in measuring diversity of metagenomics Diversity measurement in microbial ecology There have been numerous mature methods and tools to measure diversity of macroorganisms in decades of development of classic ecology. One would think that we just need to borrow those methods to use in microbial field. Unfortunately in reality this is not the case. The microbial communities are so different from macroorganisms like plant or animal communities, with the number of species many order of magnitude larger [139]. This fact raises serious sampling problems. It is extremely difficult to cover enough fraction of the microbial community even with impressively large sample size thanks to modern metagenomic 18 approaches [109]. In a word, diversity measurement is a rather big challenge for microbial communities and novel and effective methods are highly demanded [114]. 2.3.1.1 OTU Identification using sequence markers To borrow the methods of diversity measurement from classic ecology on the use of evaluating microbial diversity, the first problem is that in microbial world, there is no unambiguous way to define “species” [131]. It is impossible to identify a microbial individual as a specific species morphologically. In fact in metagenomics the concept of “species” has been replaced by OTUs(Operational Taxonomic Units). An OTUs are those microbial individuals within a certain evolutionary distance. Practically we mainly use 16S rRNA genes as the evolutionary marker genes, because 16S rRNA genes exist universally among different microbial species and their sequences change at a rate corresponding with the evolutionary distance. So we can describe microbial individuals with higher than a certain percent(e.g. 97%) 16S rRNA sequence similarity as one OTU, or belonging to one species [114]. 2.3.1.2 Binning of metagenomic reads into OTUs In classic ecology dealing with samples from macroorganisms communities, before we can use any statistical method to measure diversity, it is standard procedure to identify the species of each individual in a sample. It is the same for diversity measurement of microbial communities. Difference is that here we need to place the sequences(individuals) into respective “bin” or OTUs(species). There are two strategies to do such binning - Composition-based or intrinsic binning approach and similarity-based or extrinsic binning approach. 19 2.3.1.2.1 Composition-based approach Lots of efforts have been put to get a com- prehensive category of reference microbial genome sequences [55, 141]. Currently there are a large number of finished or high-quality reference sequences of thousands of microbial species available in different databases and this number is still increasing quickly [79, 38, 138]. So the first intrinsic composition-based approach is to use those reference genomes to train a taxonomic classifier and use that classifier to classify the metagenomics reads into bins. Different statistical approaches like Support Vector Machines [95], interpolated Markov models[7],naive Bayesian classifiers, and Growing Self Organizing Maps [110] were used to train the classifier. Without using any reference sequences for the training, it is possible to use signatures like k-mers or codon-usage to develop reference-independent approach. The assumption is that the frequencies distribution of the signatures are similar of the sequences from the same species. TETRA is such a reference-independent tools using Markov models based on k-mer frequencies [133]. There is another tool integrating TETRA and the codon usage profile to classify reads [136]. 2.3.1.2.2 Similarity-based approach The similarity-based extrinsic approach is to find similarity between the reads sequences and reference sequences and a tree can be built using the similarity distance information. MEGAN [57] is a typical tool using this method, which reads a BLAST file output. Other sequence alignment tools can also be used here like BowTie2 or BWA. Recently, an alternative strategy was developed, which only uses the reference sequences with the most information rather than all the reference sequences to do alignment. Those reference sequences include 16S rRNA genes or some other specific marker genes. The benefit is obvious, it is more time-efficient since there are fewer reference sequences to align to. Also, it can provide better resolution and binning accuracy since the 20 marker genes can be selected carefully with the best distinguishing power. AMPHORA2 [142] and MetaPhlAn [119] are two typical tools using this strategy. 2.3.1.3 Statistics for diversity estimation After the binning of sequences into OTU, we need statistical analysis to help us estimate the diversity. Many statistical methods have been developed and widely used in classical ecology of macroorganisms. However the first difference between diversity measurement of macroorganisms and microbial community is that generally the microbial community diversity is much larger than observed sample diversity, thanks to the high diversity characteristics of microbial community and the limit of metagenomics sampling and sequencing. The first approach which is also considered as classic is rarefaction. Rarefaction curve can be used to compare observed richness among different samples that have been sampled unequally, which is basically plotting of the relationship between the number of observed species and the number of sampled individuals. It is worth noting that rarefaction curve shows the observed diversity, not the total diversity. We should not disregard those unseen microbial species, which is pretty common for microbial community sampling. To estimate the total diversity from observed diversity, different estimators are required. The first one is extrapolation from accumulation curve. The asymptote of this curve is the total diversity, which means the number of species will not increase any more with sampling more individuals. To get the value of that asymptote point, from observed accumulation curve, a function needs to be assumed to fit the curve. Several proposals have been made to use this extrapolation method [22, 40]. The problem is that if the sampling effort only covers a small fraction of the total sample, which means the accumulation curve just starts, it is difficult to find an optimal function to fit the curve. Different functions can fit the 21 curve equally well but will deduct dramatically different asymptote value. So this curve extrapolation method should be used cautiously. Another one is parametric estimator, which assumes that the relative abundance follows a particular distribution. Then the number of species unobserved in the community can be estimated by fitting observed sample data to such abundance distribution then the total number of species in the community can be estimated. Lognormal abundance distribution is mostly used in different project since most communities of macroorganisms has a lognormal abundance distribution and it is believed that it is also typical for some microbial communities [29, 115, 105]. It is understandable that it is difficulty to know which models fit the communities best since in an ideal world the abundance distribution should be inferred from the data,not be assumed unverifiably. The problem is that we can only infer the abundance distribution accurately when the sample size is large enough. There have been some attempts on this direction recently [35] and more robust methods are still needed. If the species abundance distribution cannot be inferred, we can still use nonparametric estimators to estimate the total diversity without assuming that abundance distribution arbitrarily. These estimators are related to MRR(mark-release-recapture) statistics, which compare the number of species observed more than once and the number of species observed only once. If current sampling only covers a small fraction of a diverse community, most species will be observed only once. If current sampling is enough to cover most species in the community, the opposite will be the case. A series of estimators invented by Chao are the representative estimators in this category, including Chao1 [15], Chao2 [16], ACE [17] and ICE [66]. For example, Chao1 formular is: n 2 SChao1 = Sobs + 1 2n2 22 where Sobs is the number of species observed, n1 the number of species observed once(singletons, with only one individule), and n2 the number of species observed twice(doubletons, with exactly two individuals) in the sample. The ACE uses data from all species rather than just singletons and doubletons. Its formular is: SACE = Sabund + Srare F1 2 + γ CACE CACE ACE where Srare is the number of rare species (with few than 10 observed individuals) and Sabund is the number of abundant species (with more than 10 observed individuals). In past years there are several software packages that have been developed for biodiversity analysis. Out of them, EstimateS [21] is a software that can be used for general purpose diversity analysis, which implement a rich set of diversity analysis algorithms. However it is not designed specifically for microbial diversity analysis. So microbial diversity data should be preprocessed to general population data to be fed into EstimateS. Two other softwares - MOTHUR [116] and QIIME [13] are designed for microbial diversity. So they are more popular in microbial diversity analysis. CatchAll [11] is a relatively newer package, which can estimate the diversity using both nonparametric and parametric estimators including many variants and return the results using different estimators and the respective credibility of the results. 23 Chapter 3 Efficient online k-mer counting using a probabilistic data structure 3.1 Introduction Our motivation for exploring efficient k-mer counting comes from our work with metagenomic data, where we routinely encounter data sets that contain 300 × 109 bases of DNA and over 50 billion distinct k-mers [51]. K-mer counting plays a key role in our initial investigation on using distinct k-mers to measure microbial diversity. We needed to count how many distinct k-mers in different metagenomics data sets and get the abundance distribution. In the beginning we used an existing k-mer counting tool - Tallymer [64] . However as we started to deal with larger metagenomic data, where we routinely encounter data sets that contain 300 × 109 bases of DNA and over 50 billion distinct k-mers [51], it was not efficient enough and for some data set it cannot handle at all. Also, to efficiently filter, partition, and assemble these data, we need to store counts for each of these k-mers in main memory, and query and update them in realtime — a set of functionality not readily offered by current packages. Moreover, we wish to enable the use of cloud and desktop computers, which may have poor I/O performance or limited memory. These needs have dictated our exploration of efficient in-memory k-mer counting techniques. In this chapter we present the khmer software package for fast and memory efficient 24 online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. We use the Amazon cloud to compare time, memory, and disk usage of our k-mer counting implementation with that of other k-mer counting software packages, for two problems. First, we generate a k-mer abundance distribution for large data sets; and second, we query many individual k-mer counts at random from a previously constructed k-mer count database. We show that khmer is competitive in speed, memory, and disk usage for these problems. We also analyze the effects of counting error on calculations of the k-mer count in sequencing data sets, and in particular on metagenomic data sets. This chapter contains published materials from [148]. 3.2 Count-Min Sketch and its application in k-mer counting Below, we describe an implementation of a simple probabilistic data structure for k-mer counting. This data structure is based on a Count-Min Sketch [26], a generalized probabilistic data structure for storing the frequency distributions of distinct elements. Probabilistic approaches can be particularly memory efficient for certain problems, with 25 memory usage significantly lower than any exact data structure [96]. However, their use introduces set membership or counting false positives, which have effects that must be analyzed in the context of specific problems. Moreover, unlike existing techniques, the Count-Min Sketch stores only counts; k-mers must be retrieved from the original data set. In exchange, the low memory footprint enabled by this probabilistic approach enables online updating and retrieval of k-mer counts entirely in memory, which in turn supports streaming applications such as digital normalization, which will be discussed in next chapter. Because of the probabilistic characteristics of this approach, We will also discuss the choice of optimal parameters to balance the efficiency and accuracy of k-mer counting. 3.2.1 Implementing a Count-Min Sketch for k-mers The two basic operations supported by khmer are increment count(kmer) and c = get count(kmer). Both operate on the data structure in memory, such that neither incrementing a count nor retrieving a count involves disk access. The implementation details are similar to those of the Bloom filter in [96], but with the use of 8 bit counters instead of 1 bit counters. Briefly, Z hash tables are allocated, each with a different size of approximately H bytes (H1 , H2 , ..., HZ ); the sum of these hash table sizes must fit within available main memory. To increment the count for a particular k-mer, a single hash is computed for the k-mer, and the modulus of that hash with each hash table’s size H gives the location for each hash table; the associated count in each hash table is then incremented by 1. We use different sizes for each hash table so as to vary the hash function. Even if two k-mers have the same modulus in one hash table (a collision), they are unlikely to collide in the other hash tables. To retrieve the count for a k-mer, the same hash is computed and the minimum count across all hash tables is computed. While different in 26 implementation detail from the standard Bloom filter, which uses a single hash table with many hash functions, the performance details are identical [96]. One particularly important feature of the Count-Min Sketch is that the counting error is one-sided [26]. Because counts are only incremented, collisions result in inflated miscounts; if there is no collision for a particular k-mer, the count is correct. An additional benefit of the Count-Min Sketch is that it is extremely easy to implement correctly, needing only about 3 dozen lines of C++ code for a simple threadsafe implementation. (We have described how khmer scales with multiple threads in [82].) To determine the expected false positive rate — the average frequency with which a given k-mer count will be incorrect when retrieved — we can look at the hash table load. Suppose N distinct k-mers have been counted using Z hash tables, each with size H. The probability that no collisions happened in a specific entry in one hash table is (1 − 1/H)N , or approximately e−N/H . The individual collision rate in one hash table is then ≈ 1 − e−N/H . The total collision rate, which is the probability that a collision occurred in each entry where a k-mer maps across all Z hash tables, is ≈ (1 − e−N/H )Z , which is also the expected false positive rate. While the false positive rate can easily be calculated from the hash table load, the average miscount — the degree to which the measured count differs from the true count — depends on the k-mer frequency distribution, which must be determined empirically. We analyze the effects of this below. 27 3.2.2 Choosing number and size of hash tables used for k-mer counting The false positive rate depends on the number of distinct k-mers N , the number of hash tables Z, and the size of the hash tables H: f ≈ (1 − e−N/H )Z , with an associated memory usage of M = HZ. We face two common scenarios: one in which we have a fixed number of k-mers N and fixed memory M and we want to calculate the optimal number of hash tables Z; and one in which we have a desired maximum false positive rate f and a fixed number of k-mers N , and we want to calculate the minimum memory usage required to achieve f . For fixed memory M and number of distinct k-mers N , the optimal number of hash tables can be found by minimizing f ; taking the derivative, df /dZ, with f ≈ exp(Z log(1 − e−ZN/M )) and solving for 0, we find that f is minimized when Z = log(2) ∗ (M/N ) (see [8] for details). Given a desired false positive rate f and a fixed number of k-mers N , the optimal memory usage can be calculated as follows. First, the optimal number of hash tables is determined by the expected false positive rate alone: Z = log0.5 f . Using this Z, the minimum average hash table size H necessary to achieve f can be calculated as H = (log0.6185 (f ) × N )/Z (see [8] for details). A remaining problem is that the number of distinct k-mers N is typically not known. However, memory- and time-efficient algorithms for calculating N do exist and we has implemented this in khmer[34]. 28 3.3 3.3.1 khmer can count k-mers efficiently khmer is a generally useful k-mer counting approach Data set subset 1 subset 2 subset 3 subset 4 entire data set size of file (GB) number of reads 1.90 9,744,399 2.17 19,488,798 3.14 29,233,197 4.05 38,977,596 5.00 48,721,995 number of distinct k-mers total number of k-mers 561,178,082 630,207,985 1,060,354,144 1,259,079,821 1,445,923,389 1,771,614,378 1,770,589,216 2,227,756,662 2,121,474,237 2,743,130,683 Table 3.1 Benchmark soil metagenome data sets for k-mer counting performance, taken from [51]. 7000 khmer (1% false positive rate) Tallymer Jellyfish DSK KMC BFCounter scTurtle KAnalyze 6000 Time (s) 5000 4000 3000 2000 1000 0 0 10 20 30 Total number of reads (millions) 40 50 Figure 3.1 Comparison of the time it takes for k-mer counting tools to calculate k-mer abundance histograms, with time (y axis, in seconds) against data set size (in number of reads, x axis). All programs executed in time approximately linear with the number of input reads. We measured time and memory required to calculate k-mer abundance histograms in five soil metagenomic read data sets using khmer, Tallymer, Jellyfish, DSK, KMC, Turtle, and KAnalyze (Table 3.1; Figures 3.1 and 3.2). We chose to benchmark abundance histograms 29 35 30 Memory usage(G) 25 20 khmer (1% false positive rate) khmer (5% false positive rate) khmer (20% false positive rate) Tallymer Jellyfish DSK KMC BFCounter scTurtle KAnalyze 15 10 5 0 0.0 0.5 1.0 1.5 Total number of distinct k-mers (billions) 2.0 Figure 3.2 Memory usage of k-mer counting tools when calculating k-mer abundance histograms, with maximum resident program size (y axis, in GB) plotted against the total number of distinct k-mers in the data set (x axis, billions of k-mers). because this functionality is common to all the software packages, and is a common analysis approach for determining assembly parameters [18]. We applied each package to increasingly large subsets of a 50m read soil metagenome data set [51]. For the BFCounter, KMC, Turtle and KAnalyze packages, which do not generate k-mer abundance distribution directly, we output the frequency of each k-mer to a file but do no further analysis. khmer offers a general range of useful performance tradeoffs for disk I/O, time and memory. From the performance comparison between khmer and other k-mer counting packages in calculating k-mer abundance distributions, khmer is comparable with existing packages. Figure 3.1 shows that the time usage of the khmer approach is comparable to DSK and BFCounter, and, as expected, increases linearly with data set size. Tallymer is the slowest of the four tools in this testing, while KMC, Turtle, and Jellyfish are the fastest. From 30 60 khmer (1% false positive rate)* khmer (1% false positive rate), gzip-compressed Tallymer Jellyfish DSK KMC BFCounter scTurtle KAnalyze 50 Disk usage (GB) 40 30 20 10 0 0.5 1.0 1.5 Total number of distinct k-mers (billions) 2.0 Figure 3.3 Disk storage usage of different k-mer counting tools to calculate kmer abundance histograms in GB (y axis), plotted against the number of distinct k-mers in the data set (x axis). ∗ Note that khmer does not use the disk during counting or retrieval, although its hash tables can be saved for reuse. Figure 3.2, we see that the memory usage of Jellyfish, Tallymer, BFCounter, and Turtle increases linearly with data set size. Tallymer uses more memory than Jellyfish generally, while BFCounter and Turtle have considerably lower memory usage. DSK, KMC, and KAnalyze use constant memory across the data sets, but at the cost of more limited functionality (discussed below). We also measured disk usage during counting. Figure 3.3 shows that the disk usage also increases linearly with the number of k-mers in the data set. For a high-diversity metagenomic data set of 5 GB, the disk usage of both Jellyfish and Tallymer is around 30 GB. khmer counts k-mers entirely in working memory and does not rely on any on-disk storage to store or retrieve k-mer counts, although for practicality the hash tables can be saved for later reuse; the uncompressed disk usage for khmer in Figure 3.3 is the same as its 31 memory. At the expense of more time, khmer supports saving and loading gzip-compressed hash tables, which are competitive in size to DSK’s on-disk database (Figure 3, dashed line). 3.3.2 khmer memory usage is fixed and low The memory usage of the basic Count-Min Sketch approach is fixed: khmer’s memory usage does not increase as data is loaded. While this means that khmer will never crash due to memory limitations, and all operations can be performed in main memory without recourse to disk storage, the false positive rate may grow too high. Therefore the memory size must be chosen in light of the false positive rate and miscount acceptable for a given application. In practice, we recommend choosing the maximum available memory, because the false positive rate decreases with increasing memory and there are no negative effects to minimizing the false positive rate. For any given data set, the size and number of hash tables will determine the accuracy of k-mer counting with khmer. Thus, the user can control the memory usage based on the desired level of accuracy (Figure 3.2). The time usage for the first step of k-mer counting, consuming the reads, depends on the total amount of data, since we must traverse every k-mer in every read. The second step, k-mer retrieval, is algorithmically constant for fixed k; however, for practicality, the hash tables are usually saved to and loaded from disk, meaning that k-mer retrieval time depends directly on the size of the database being queried. The memory usage of khmer is particularly low for sparse data sets, especially since only main memory is used and no disk space is necessary beyond that required for the read data sets. This is no surprise: the information theoretic comparison in [96] shows that, for sparse sequencing data sets, Bloom filters require considerably less memory than any possible exact information storage for a wide range of false positive rates and data set sparseness. 32 In our implementation we use 1 byte to store the count of each k-mer in the data structure. Thus the maximum count for a k-mer will be 255. In cases where tracking bigger counts is required, khmer also provides an option to use an STL map data structure to store counts above 255, with the trade-off of significantly higher memory usage. In the future, we may extend khmer to counters of arbitrary bit sizes. The memory usage of khmer also increases linearly with data set size as long as we hold the false positive rate constant. However, the memory usage of khmer varies substantially with the desired false positive rate: we can decrease the memory usage by increasing the false positive rate as shown in Figure 3.2. We also see that with a low false positive of 1%, the memory usage is competitive with Tallymer and Jellyfish; with a higher 5% false positive rate, the memory usage is lower than all but the disk-based DSK; with an false positive rate as high as 20%, the memory usage is further lower, close to DSK, KAnalyze, and KMC. 3.3.3 khmer accesses k-mer counts efficiently We measured the time it took to access 9.7m 22-mers across five different data sets after the initial databases had been built (Figure 3.4). Note that Tallymer, Jellyfish, and khmer all support random access to k-mer counts, while BFCounter, DSK, KMC, Turtle and KAnalyze do not. Here, khmer performed well, dramatically outperforming Jellyfish and Tallymer. In all three cases, system time dominated the overall time required to retrieve k-mers, suggesting that the primary reason for the increase in retrieval time was due to the increased size of the database on the disk (data not shown). In particular, khmer is independent of the size of the database in retrieval time once the hash tables are loaded into memory. This highly memory- and time-efficient online counting is particularly important for the streaming approaches to data analysis needed as data set sizes increase, like digital nor33 1600 khmer (1% false positive rate) Tallymer 1400 Jellyfish 1200 Time (s) 1000 800 600 400 200 0 0.5 1.0 1.5 2.0 Total number of distinct k-mers (billions) Figure 3.4 Time for several k-mer counting tools to retrieve the counts of 9.7m randomly chosen k-mers (y axis), plotted against the number of distinct k-mers in the data set being queried (x axis). BFCounter, DSK, Turtle, KAnalyze, and KMC do not support this functionality. malization which will be discussed in next chapter and the IGS based method to analyze microbial diversity which will be discussed in later chapters. Because query and updating of k-mer counts can be done directly as data is being loaded, with no need for disk access or an indexing step, khmer can also perform well in situations with poor disk I/O performance. (Note that BFCounter also supports online k-mer counting [85].) 3.4 False positive rates in k-mer counting are low and predictable The Count-Min Sketch is a probabilistic data structure with a one-sided error that results in random overestimates of k-mer frequency, but does not generate underestimates. Next 34 we will discuss the characteristics of such counting inaccuracy and the influence of such inaccuracy to the real-world applications of khmer for k-mer counting. 3.4.1 The measured counting error is low on short-read data Due to the use of Count-Min Sketch and its lack of collision tracking, khmer will report some incorrect counts for k-mers; these counts are always higher than the true counts, up to the bound of 255 (a limit imposed by our use of 8-bit counters). In the Count-Min Sketch, the total memory usage is fixed; the memory usage, the hash functions, and the total number of distinct objects counted all influence the accuracy of the count. While the probability of an inaccurate count can easily be estimated based on the hash table load, the miscount size is dependent on details of the frequency distribution of k-mers [26]. More specifically, in the analysis of the Count-Min Sketch, the difference between the incorrect count and actual count is related to the total number of k-mers in a data set and the size of each hash table [26]. Further study has shown that the behavior of CountMin Sketch depends on specific characteristics of the data set under consideration, like the distribution of k-mer abundances [112, 27]. In general, the average miscount will be small if the data is left-skewed. As noted by Melsted and Pritchard, a large number of k-mers in short-read data are low-abundance, leading to precisely the skew that would yield low miscounts [85]. Here we use both real and simulated data sets (Table 3.2) to evaluate the counting performance in practice. Figure 3.5 shows the relationship between average miscount and counting false positive rate for five different test data sets with similar numbers of distinct k-mers: one metagenome data set; a simulated set of random k-mers; a simulated set of reads, chosen with 3x coverage 35 data set size of data set file Real metagenomics reads 7.01M Totally random reads with 3.53M randomly generated k-mers Simulated reads from simu5.92M lated genome with error Simulated reads from simu9.07M lated genome without error Real E. coli reads 4.85M number of total k-mers number of distinct k-mers 2,917,200 1,944,996 2,250,006 1,973,059 3,757,479 2,133,592 5,714,973 1,989,644 4,004,911 2,079,302 Table 3.2 Data sets used for analyzing miscounts. 10 Average miscount 8 metagenome data random k-mers reads with error reads without error E.coli reads 6 4 2 0 0.0 0.2 0.4 0.6 Counting false positive rate (miscount>0) 0.8 1.0 Figure 3.5 Relation between average miscount — amount by which the count for k-mers is incorrect — on the y axis, plotted against false positive rate (x axis), for five data sets. The five data sets were chosen to have the same total number of distinct k-mers: one metagenome data set; a set of randomly generated k-mers; a set of reads, chosen with 3x coverage and 1% error, from a randomly generated genome; a simulated set of error-free reads (3x) chosen from a randomly generated genome and a set of E. coli reads. and 1% error; a simulated set of reads (3x) with no error; and a set of E. coli reads (Table 3.2). Even when the counting false positive rate is as high as 0.9 — where 90% of k-mers have an incorrect count — the average miscount is still below 4. We separately analyzed the average percentage miscount between true and false k-mers; 36 100 Average miscount (percent) 80 metagenome data random k-mers reads with error reads without error E.coli reads 60 40 20 0 0.0 0.2 0.4 0.6 Counting false positive rate (miscount > 0) 0.8 1.0 Figure 3.6 Relation between percent miscount — amount by which the count for k-mers is incorrect relative to its true count — on the y axis, plotted against false positive rate (x axis), for five data sets. The five data sets are the same as in Figure 3.5. e.g. an miscount of 4 for a k-mer whose true count is 1 would be 400%. Figure 3.6 shows the relationship between average miscount and counting false positive rate for the same five data sets as in Figure 3.5. For a false positive rate of 0.1 (10% of k-mer counts are incorrect), the average percentage miscount is less than 10% for all five data sets; this will of course generally be true, because the average miscount is bounded by the product of the false positive rate with k-mer abundance. We see here that for a fixed false positive rate, the simulated reads without error have the highest average miscount, and the randomly generated k-mers have the lowest average miscount. This is because these two abundance distributions have the least and most leftskew, respectively: the simulated reads without error have no abundance-1 k-mers, while the randomly generated k-mers are entirely low abundance. Thus, this counting approach is 37 especially suitable for high diversity data sets, such as metagenomic data, in which a larger proportion of k-mers are low abundance or unique due to sequencing errors. For many applications, an approximate k-mer count is sufficient. For example, when eliminating reads with low abundance k-mers, we can tolerate a certain number of lowfrequency k-mers remaining in the resulting data set falsely. If RAM-limited we can do the filtering iteratively so that at each step we are making more effective use of the available memory. In practice, we have found that a false positive rate of between 1% and 10% offers acceptable miscount performance for a wide range of tasks, including error profiling, digital normalization and low-abundance read-trimming. Somewhat surprisingly, false positive rates of up to 80% can still be used for both read trimming and digital normalization in memorylimited circumstances, although multiple passes across the data may be needed. For many applications, the fact that khmer does not break an imposed memory bound is extremely useful, since for many data sets — especially metagenomic data sets — high memory demands constrain analysis [51, 73]. Moreover, because the false positive rate is straightforward to measure, the user can be warned that the results should be invalidated when too little memory is used. When combined with the graceful degradation of performance for both error trimming and digital normalization, khmer readily enables analysis of extremely large and diverse data sets [53]. In an experiment to assemble the reads of a soil metagenomic sample collected from Iowa prairie, the number of reads to assemble drops from 3.3 million to 2.2 million and the size of the data set drops from 245GB to 145GB accordingly after digital normalization [51]. 240GB memory was used in the process. This also shows that khmer works well to analyze large, real-world metagenomic data sets. 38 3.4.2 Real-world applications of khmer Khmer has been widely used by many research groups for solving different bioinformatics problems. It is the foundation of all the work that will be discussed in this thesis later. We will show the real-worl applications of khmer extensively in the chapters next. For many applications, an approximate k-mer count is sufficient. For example, when eliminating reads with low abundance k-mers, we can tolerate a certain number of lowfrequency k-mers remaining in the resulting data set falsely. If RAM-limited we can do the filtering iteratively so that at each step we are making more effective use of the available memory. In practice, we have found that a false positive rate of between 1% and 10% offers acceptable miscount performance for a wide range of tasks, including error profiling, digital normalization and low-abundance read-trimming. Somewhat surprisingly, false positive rates of up to 80% can still be used for both read trimming and digital normalization in memorylimited circumstances, although multiple passes across the data may be needed. For many applications, the fact that khmer does not break an imposed memory bound is extremely useful, since for many data sets — especially metagenomic data sets — high memory demands constrain analysis [51, 73]. Moreover, because the false positive rate is straightforward to measure, the user can be warned that the results should be invalidated when too little memory is used. When combined with the graceful degradation of performance for both error trimming and digital normalization, khmer readily enables analysis of extremely large and diverse data sets [53]. In an experiment to assemble the reads of a soil metagenomic sample collected from Iowa prairie, the number of reads to assemble drops from 3.3 million to 2.2 million and the size of the data set drops from 245GB to 145GB 39 accordingly after digital normalization [51]. 240GB memory was used in the process. This also shows that khmer works well to analyze large, real-world metagenomic data sets. 3.5 Conclusion K-mer counting is widely used in bioinformatics, and as sequencing data set sizes increase, graceful degradation of data structures in the face of large amounts of data has become important. This is especially true when the theoretical and practical effects of the degradation can be predicted (see e.g. [85, 96, 111]). This is a key property of the Count-Min Sketch approach, and its implementation in khmer. The khmer software implementation offers good performance, a robust and well-tested Python API, and a number of useful and well-documented scripts. While Jellyfish, DSK, KMC, and Turtle also offer good performance, khmer is competitive, and, because it provides a Python API for online counting, is flexible. In memory-limited situations with poor I/O performance, khmer is particularly useful, because it will not break an imposed memory bound and does not require disk access to store or retrieve k-mer counts. However, in exchange for this memory guarantee, counting becomes increasingly incorrect as less memory is used or as the data set size grows large; in many situations this may be an acceptable tradeoff. 40 3.6 3.6.1 Data Code and data set availability The khmer software [28] is implemented in C++ in a Python wrapper, enabling flexible use and reuse by users with a wide range of computational expertise. The software package is freely available for academic and commercial use and redistribution under the BSD license at github.com/ged-lab/khmer/. khmer comes with substantial documentation and many tutorials, and contains extensive unit tests. Moreover, we have built several applications on top of khmer, including memory-efficient de Bruijn graph partitioning [96] and lossy compression of short-read data sets for assembly [10]. The version of khmer used to generate the results in this chapter is available at http://github.com/ged-lab/khmer.git, tag ‘2013-khmer-counting’. Scripts specific to this paper are available in the paper repository at https://github.com/ged-lab/2013-khmer-counting. The IPython[98] notebook file and data analysis to generate the figures are also available in that github repository. Complete instructions to reproduce all of the results in this paper are available in the khmer-counting repository; see README.rst. 3.6.2 Sequence data One human gut metagenome reads data set (MH0001) from the MetaHIT (Metagenomics of the Human Intestinal Tract) project [103] was used. It contains approximately 59 million reads, each 44bp long; it was trimmed to remove low quality sequences. Five soil metagenomics reads data sets with different size were taken from the GPGC project for benchmark purpose (see Table 3.1). These reads are from soil in Iowa region and 41 they are filtered to make sure there are less than 30% Ns in the read and each read is longer than 30 bp. The exact data sets used for the paper are available on Amazon S3 and the instructions to acquire these data sets are available in the paper repository on github.com. We also generated four short-read data sets to assess the false positive rate and miscount distribution. One is a subset of a real metagenomics data set from the MH0001 data set, above. The second consists of randomly generated reads. The third and fourth contain reads simulated from a random, 1 Mbp long genome. The third has a substitution error rate of 3%, and the fourth contains no errors. The four data sets were chosen to contain identical numbers of distinct 22-mers. The scripts necessary to regenerate these data are available in the paper repository on github.com. 42 Chapter 4 A framework for streaming analysis of short DNA sequencing reads based on k-mer counting 4.1 Introduction In the previous chapter, we introduced an efficient k-mer counting approach based on a probabilistic data structure. In this chapter, we will discuss the application of this approach to enable streaming analysis of short DNA sequencing reads. First, we will show a novel approach to use median k-mer count in a read to estimate sequencing depth without a reference assembly. Next, based on this approach, two streaming methods that are critically important in next generation sequencing data analysis will be discussed. One is the singlepass method to eliminate redundant reads in data sets to reduce computational cost in down-streaming analysis like assembly, which is termed as “digital normalization”. The other one is the method to analyze and trim sequencing errors in short reads data sets, in a semi-streaming or streaming fashion. The approach to use median k-mer count to estimate sequencing depth of a read is also the foundation of the IGS based diversity analysis approach we will discuss in the next chap- 43 ter. The streaming methods to remove redundant reads and sequencing error analysis and trimming are not directly related to the IGS based diversity analysis method, nevertheless the applications of the two streaming methods to facilitate the assembly of metagenomic reads and remove sequencing errors in reads data sets benefit many bioinformatics approaches, including the microbial diversity analysis. This chapter contains published materials from [148] [10] and [147]. 4.2 Estimating sequencing depth without a reference assembly Short-read assembly requires deep sequencing to systematically sample the source genome, because shotgun sequencing is subject to both random sampling variation and systematic sequencing biases. For example, 100x sampling of a human genome is required for recovery of 90% or more of the genome in contigs > 1kb [39]. In principle, much of this high-coverage data is redundant and could be eliminated without consequence to the final assembly. However, determining which reads to eliminate requires a per-read estimate of coverage. Traditional approaches estimate coverage by mapping reads to an assembly. This presents a chicken-and-egg problem: to determine which regions are oversampled, we must already have an assembly! We may calculate a reference-free estimate of genome coverage by looking at the k-mer abundance distribution within individual reads. First, observe that k-mers, DNA words of a fixed length k, tend to have similar abundances within a read: this is a well-known property of k-mers that stems from each read originating from a single source molecule of DNA. The more times a region is sequenced, the higher the abundance of k-mers from that region 44 250 no errors single substitution error multiple substitution errors k-mer abundance 200 150 100 50 00 10 20 30 40 50 k-mer rank 60 70 80 Figure 4.1 Representative rank-abundance distributions for 20-mers from 100base reads with no errors, a read with a single substitution error, and a read with multiple substitution errors. would be. In the absence of errors, average k-mer abundance could be used as an estimate of the depth of coverage for a particular read (Figure 4.1, “no errors” line). However, when reads contain random substitution or indel errors from sequencing, the k-mers overlapping these errors will be of lower abundance; this feature is often used in k-mer based error correction approaches [63]. For example, a single substitution introduces k low-abundance k-mers within a read. (Figure 4.1, “single substitution error” line). However, for small k and reads of length L where L > 3k − 1, a single substitution error will not skew the median k-mer abundance. Only when multiple substitution errors are found in a single read will the median k-mer abundance be affected (Figure 4.1, “multiple substitution errors”). The effect of multiple errors to median k-mer abundance will be discussed in details in the next chapter. Using a fixed-memory CountMin Sketch data structure to count k-mers (see [25]), we find that median k-mer abundance correlates well with mapping-based coverage for artificial and real genomic data sets. There is a strong correlation between median k-mer abundance 45 read coverage by median k-mer count 300 250 200 150 100 50 00 50 100 150 200 read coverage by mapping 250 300 read coverage by median k-mer count 1000 800 600 400 200 00 200 400 600 800 1000 read coverage by mapping 1200 1400 Figure 4.2 Mapping and k-mer coverage measures correlate for simulated genome data and a real E. coli data set (5m reads). Simulated data r2 = 0.79; E. coli r2 = 0.80. and mapping-based coverage both for simulated 100-base reads generated with 1% error from a 400kb artificial genome sequence (r2 = 0.79; also see Figure 4.2a), as well as for real short-read data from E. coli (r2 = 0.80, also see Figure 4.2b). This correlation also holds for simulated and real mRNAseq data: for simulated transcriptome data, r2 = 0.93 (Figure 4.3a), while for real mouse transcriptome data, r2 = 0.90 (Figure 4.3b). Thus the median k-mer abundance of a read correlates well with mapping-based estimates of read coverage. The coverage on read level estimated from median k-mer count of a read is always smaller than the mapping-based estimates of read coverage, which is essentially the coverage on nucleotide level. There is a way to convert the coverage on read level into real sequencing 46 read coverage by median k-mer count 104 103 102 101 100 1 10 104 102 103 read coverage by mapping read coverage by median k-mer count 104 103 102 101 100 0 10 101 102 103 read coverage by mapping 104 105 Figure 4.3 Mapping and k-mer coverage measures correlate for simulated transcriptome data as well as real mouse transcriptome data. Simulated data r2 = 0.93; mouse transcriptome r2 = 0.90. depth (coverage on nucleotide level). To cover a k-mer by a read, all the nucleotide in this k-mer must be covered by the read. If the coverage in nucleotide level is C N, the coverage of a k-mer in a genome will be C N * (L-k+1)/L, with length of reads as L. [63] Figure 4.2 and Figure 4.3 shows such relationship obviously. Such difference between coverage on read level and on nucleotide level is important in the IGS based diversity analysis method, which will be discussed in more details in the next chapter. For the streaming methods to analyze short reads data discussed in this chapter, the coverage means the coverage on read level estimated from median k-mer abundance in a read. 47 4.3 A streaming algorithm to digitally normalize the coverage distribution of data sets Below, we introduce “digital normalization”, a single-pass lossy compression algorithm for elimination of redundant reads in data sets based on saturating coverage of a de Bruijn graph. While several non-streaming implementations exist, including Trinity’s in silico normalization [45, 9], digital normalization can be efficiently implemented as a streaming algorithm. Critically, no reference sequence is needed to apply digital normalization. Digital normalization is inspired by experimental normalization techniques developed for cDNA library preparation, in which hybridization kinetics are exploited to reduce the copy number of abundant transcripts prior to sequencing [6, 127]. Digital normalization works after sequencing data has been generated, progressively removing high-coverage reads from shotgun data sets. This normalizes average coverage to a specified value, reducing sampling variation while removing reads, and also removing the many errors contained within those reads. This data and error reduction results in dramatically decreased computational requirements for de novo assembly. This has the advantage of enabling low-memory preprocessing of both high-coverage genomic data sets, as well as mRNAseq or metagenomic data sets with high-coverage components [10, 51]. Moreover, unlike experimental normalization where abundance information is removed prior to sequencing, in digital normalization this information can be recovered from the unnormalized reads. We present here a fixed-memory implementation of digital normalization that operates in time linear with the size of the input data. We then demonstrate its effectiveness for reducing computational requirements for de novo assembly on several real data sets. These data sets include E. coli genomic data, data from two single-cell MD-amplified microbial 48 genomes, and yeast and mouse mRNAseq. 4.3.1 Eliminating redundant reads reduces variation in sequencing depth Deeply sequenced genomes contain many highly covered loci. For example, in a human genome sequenced to 100x average coverage, we would expect 50% or more of the reads to have a coverage greater than 100. In practice, we need many fewer of these reads to assemble the source locus. Using the median k-mer abundance estimator discussed above, we can examine each read in the data set progressively to determine if it is high coverage. At the beginning of a shotgun data set, we would expect many reads to be entirely novel and have a low estimated coverage. As we proceed through the data set, however, average coverage will increase and many reads will be from loci that we have already sampled sufficiently. Suppose we choose a coverage threshold C past which we no longer wish to collect reads. If we only keep reads whose estimated coverage is less than C, and discard the rest, we will reduce the average coverage of the data set to C. This procedure is algorithmically straightforward to execute: we examine each read’s estimated coverage, and retain only those whose coverage is less than C. The following pseudocode provides one approach: for read in dataset: if estimated_coverage(read) < C: accept(read) else: discard(read) 49 where accepted reads contribute to the estimated coverage function. Note that for any data set with an average coverage > 2C, this has the effect of discarding the majority of reads. Critically, low-coverage reads, especially reads from undersampled regions, will always be retained. Figure 4.4 Coverage distribution of three microbial genome samples, calculated from mapped reads (a) before and (b) after digital normalization (k=20, C=20). The net effect of this procedure, which we call digital normalization, is to normalize the coverage distribution of data sets. In Figure 4.4a, we display the estimated coverage of an E. coli genomic data set, a S. aureus single-cell MD-amplified data set, and an MD-amplified data set from an uncultured Deltaproteobacteria, calculated by mapping reads to the known 50 or assembled reference genomes (see [19] for the data source). The wide variation in coverage for the two MDA data sets is due to the amplification procedure [130]. After normalizing to a k-mer coverage of 20, the high coverage loci are systematically shifted to an average mapping coverage of 26, while lower-coverage loci remain at their previous coverage. This smooths out coverage of the overall data set. Total fraction of reads kept 1.0 0.8 0.6 0.4 0.2 0.00 5 15 10 20 Number of reads (m) 25 30 Figure 4.5 Fraction of reads kept when normalizing the E. coli dataset to C=20 at k=20. At what rate are sequences retained? For the E. coli data set, Figure 4.5 shows the fraction of sequences retained by digital normalization as a function of the total number of reads examined when normalizing to C=20 at k=20. There is a clear saturation effect showing that as more reads are examined, a smaller fraction of reads is retained; by 5m reads, approximately 50-100x coverage of E. coli, under 30% of new reads are kept. This demonstrates that as expected, only a small amount of novelty (in the form of either new information, or the systematic accumulation of errors) is being observed with increasing sequencing depth. Here we show that digital normalization provides a general strategy for applying online or streaming approaches to analysis of de novo sequencing data. The basic algorithm pre51 sented here is explicitly a single-pass or streaming algorithm, in which the entire data set is never considered as a whole; rather, a partial “sketch” of the data set is retained and used for progressive filtering. Online algorithms and sketch data structures offer significant opportunities in situations where data sets are too large to be conveniently stored, transmitted, or analyzed [92]. This can enable increasingly efficient downstream analyses. Digital normalization can be applied in any situation where the abundance of particular sequence elements is either unimportant or can be recovered more efficiently after other processing, as in assembly, which we will discuss next. 4.3.2 Digital normalization scales assembly of microbial genomes We applied the digital normalization and error trimming protocol to three real data sets from Chitsaz et al (2011) [19]. For all three samples, the number of reads remaining after digital normalization was reduced by at least 30-fold, while the memory and time requirements were reduced 10-100x. Despite this dramatic reduction in data set size and computational requirements for assembly, both the E. coli and S. aureus assemblies overlapped with the known reference sequence by more than 98%. This confirms that little or no information was lost during the process of digital normalization; moreover, it appears that digital normalization does not significantly affect the assembly results. (Note that we did not perform scaffolding, since the digital normalization algorithm does not take into account paired-end sequences, and could mislead scaffolding approaches. Therefore, these results cannot directly be compared to those in Chitsaz et al. (2011) [19].) The Deltaproteobacteria sequence also assembled well, with 98.8% sequence overlap with the results from Chitsaz et al. Interestingly, only 30kb of the sequence assembled with 52 Velvet-SC in Chitsaz et al. (2011) was missing, while an additional 360kb of sequence was assembled only in the normalized samples. Of the 30kb of missing sequence, only 10% matched via TBLASTX to a nearby Deltaproteobacteria assembly, while more than 40% of the additional 360kb matched to the same Deltaproteobacteria sample. Therefore these additional contigs likely represent real sequence, suggesting that digital normalization is competitive with Velvet-SC in terms of sensitivity. Data set Yeast (Oases) Yeast (Trinity) Mouse (Oases) Mouse (Trinity) N reads pre/post Assembly time pre/post 100m / 9.3m 181 min / 12 min (15.1x) 100m / 9.3m 887 min / 145 min (6.1x) 100m / 26.4m 761 min/ 73 min (10.4x) 100m / 26.4m 2297 min / 634 min (3.6x) Assembly memory pre/post 45.2gb / 8.9gb (5.1x) 31.8gb / 10.4gb (3.1x) 116.0gb / 34.6gb (3.4x) 42.1gb / 36.4gb (1.2x) Table 4.1 Single-pass digital normalization to C=20 reduces computational requirements for transcriptome assembly. 4.3.3 Digital normalization scales assembly of transcriptomes We next applied single-pass digital normalization to published yeast and mouse mRNAseq data sets, reducing them to 20x coverage at k=20 [41]. Digital normalization on these samples used 8gb of memory and took about 1 min per million reads. We then assembled both the original and normalized sequence reads with Oases and Trinity, two de novo transcriptome assemblers (Table 4.1) [118, 41]. For both assemblers the computational resources necessary to complete an assembly were reduced (Table 4.1). However, normalization had different effects on performance for the different samples. On the yeast data set, time and memory requirements were reduced significantly, as for Oases running on mouse. However, while Trinity’s runtime decreased by a factor of three on the normalized mouse data set, the memory requirements did not decrease significantly. This may be because the mouse transcriptome is 5-6 times larger than the yeast transcriptome, and so the mouse mRNAseq 53 is lower coverage overall; in this case we would expect fewer errors to be removed by digital normalization. Data set Yeast (Oases) Yeast (Trinity) Mouse (Oases) Mouse (Trinity) Contigs > 300 Total bp > 300 12,654 / 9,547 33.2mb / 27.7mb 10,344 / 12,092 16.2mb / 16.5mb 57,066 / 49,356 98.1mb / 84.9mb 50,801 / 61,242 79.6 mb / 78.8mb Contigs > 1000 Total bp > 1000 9,156 / 7,345 31.2mb / 26.4mb 5,765 / 6,053 13.6 mb / 13.1mb 31,858 / 27,318 83.7mb / 72.4mb 23,760 / 24,994 65.7mb / 59.4mb Table 4.2 Digital normalization has assembler-specific effects on transcriptome assembly. The resulting assemblies differed in summary statistics (Table 4.2). For both yeast and mouse, Oases lost 5-10% of total transcripts and total bases when assembling the normalized data. However, Trinity gained transcripts when assembling the normalized yeast and mouse data, gaining about 1% of total bases on yeast and losing about 1% of total bases in mouse. Using a local-alignment-based overlap analysis, we found little difference in sequence content between the pre- and post- normalization assemblies: for example, the normalized Oases assembly had a 98.5% overlap with the unnormalized Oases assembly, while the normalized Trinity assembly had a 97% overlap with the unnormalized Trinity assembly. To further investigate the differences between transcriptome assemblies caused by digital normalization, we looked at the sensitivity with which long transcripts were recovered postnormalization. When comparing the normalized assembly to the unnormalized assembly in yeast, Trinity lost only 3% of the sequence content in transcripts greater than 300 bases. However, 10% of the sequence content in transcripts greater than 1000 bases. However, Oases lost less than 0.7% of sequence content at 300 and 1000 bases. In mouse, we see the same pattern. This suggests that the change in summary statistics for Trinity is caused by fragmentation of long transcripts into shorter transcripts, while the difference for Oases is caused by loss of splice variants. Indeed, this loss of splice variants should be expected, as 54 there are many low-prevalence splice variants present in deep sequencing data [100]. Interestingly, in yeast we recover more transcripts after digital normalization; these transcripts appear to be additional splice variants. Data set True 20-mers 20-mers in reads 20-mers at C=20 Simulated genome 399,981 8,162,813 3,052,007 (-2) Simulated mRNAseq 48,100 2,466,638 (-88) 1,087,916 (-9) E. coli genome 4,542,150 175,627,381 (-152) 90,844,428 (-5) Yeast mRNAseq 10,631,882 224,847,659 (-683) 10,625,416 (-6,469) Mouse mRNAseq 43,830,642 709,662,624 (-23,196) 43,820,319 (-13,400) % reads kept 19% 4.1% 11% 9.3% 26.4% Table 4.3 Digital normalization to C=20 removes many erroneous k-mers from sequencing data sets. Numbers in parentheses indicate number of true k-mers lost at each step, based on reference. Data set True 20-mers 20-mers in reads 20-mers remaining % reads kept Simulated genome 399,981 8,162,813 453,588 (-4) 5% Simulated mRNAseq 48,100 2,466,638 (-88) 182,855 (-351) 1.2% E. coli genome 4,542,150 175,627,381 (-152) 7,638,175 (-23) 2.1% Yeast mRNAseq 10,631,882 224,847,659 (-683) 10,532,451 (-99,436) 2.1% Mouse mRNAseq 43,830,642 709,662,624 (-23,196) 42,350,127 (-1,488,380) 7.1% Table 4.4 Three-pass digital normalization removes most erroneous k-mers. Numbers in parentheses indicate number of true k-mers lost at each step, based on known reference. The difference between Oases and Trinity results show that Trinity is more sensitive to digital normalization than Oases: digital normalization seems to cause Trinity to fragment long transcripts. One potential issue is that Trinity only permits k=26 for assembly, while normalization was performed at k=20; digital normalization may be removing 26-mers that are important for Trinity’s path finding algorithm. Alternatively, Trinity may be more sensitive than Oases to the change in coverage caused by digital normalization. Regardless, the strong performance of Oases on digitally normalized samples, as well as the high retention of k-mers (Table 4.3) suggests that the primary sequence content for the transcriptome remains present in the normalized reads, although it is recovered with different effectiveness by the two assemblers. 55 4.3.4 lower bound on memory usage for effective digital normalization In this section, we will discuss the lower bound on memory usage for effective digital normalization and the effects of high false positive rates particularly. memory before diginorm 2400 MB 240 MB 120 MB 60 MB 40 MB 20 MB 10 MB FP rate 0.0% 2.8% 18.0% 59.1% 83.2% 98.8% 100.0% retained reads 5,000,000 1,656,518 1,655,988 1,652,273 1,633,182 1,602,437 1,460,936 1,076,958 retained reads % true k-mers missing total k-mers 100.0% 170 41.6m 33.0% 172 28.1m 33.0% 172 28.1m 33.0% 172 28.1m 32.0% 172 27.9m 32.0% 172 27.6m 29.0% 172 25.7m 21.0% 185 20.9m Table 4.5 Low-memory digital normalization. The results of digitally normalizing a 5m read E. coli data set (1.4 GB) to C=20 with k=20 under several memory usage/false positive rates. The false positive rate (column 1) is empirically determined. We measured reads remaining, number of “true” k-mers missing from the data at each step, and the number of total k-mers remaining. Note: at high false positive rates, reads are erroneously removed due to inflation of k-mer counts. memory before diginorm 2400 MB 240 MB 120 MB 60 MB 40 MB 20 MB 10 MB FP rate 0.0% 2.8% 18.0% 59.1% 83.2% 98.8% 100.0% N contigs 106 617 87 86 90 89 85 97 total length(bases) 4,546,051 4,549,235 4,549,253 4,549,335 4,548,619 4,550,599 4,550,014 4,545,871 % of true genome covered 97.84% 98.05% 98.04% 98.04% 98.03% 98.11% 98.04% 97.97% Table 4.6 E. coli genome assembly after low-memory digital normalization. A comparison of assembling reads digitally normalized with low memory/high false positive rates. The reads were digitally normalized to C=20 (see [10] for more information) and were assembled using Velvet. We measured total length of assembly, as well as percent of true MG1655 genome covered by the assembly using QUAST. We applied digital normalization to the E. coli data set used above, and chose seven 56 different Count-Min Sketch sizes to yield seven different false positive rates 4.5. The data set was normalized to a k-mer coverage of 20 and the resulting data were evaluated for retention of true and erroneous k-mers, as in [10] (Table 4.5). The results show that digital normalization retains the same set of underlying “true” k-mers until the highest false positive rate of 100% (Table 4.5, column 5), while discarding only about 2% additional reads (Table 4.5, column 6). To evaluate the effect of digital normalization with high false positive rates on actual genome assembly, we next performed normalization to a coverage of 20 with the same range of false positive rates as above. We then assembled this data with Velvet [146] and compared the resulting assemblies to the known E. coli MG1655 genome using QUAST [44] (Table 4.6). To our surprise, we found that even after executing digital normalization with a false positive rate of 83.2%, a nearly complete assembly was generated. No progressive increase in misassemblies (measured against the real genome with QUAST) was seen across the different false positive rates (data not shown). This suggests that below 83.2% FP rate, the false positive rate of digital normalization has little to no effect on assembly quality with Velvet. (Note that the Velvet assembler itself used considerably more memory than digital normalization.) While these results are specific to Velvet and the coverage parameters used in digital normalization, they do suggest that no significant information loss occurs due to false positive rates below 80%. Further evaluation of assembly quality in response to different normalization parameters and assemblers is beyond the scope of this dissertation. 57 4.3.5 Digital normalization dramatically scales de novo assembly The results from applying digital normalization to read data sets prior to de novo assembly are extremely good: digital normalization reduces the computational requirements (time and memory) for assembly considerably, without substantially affecting the assembly results. It does this in two ways: first, by removing the majority of reads without significantly affecting the true k-mer content of the data set. Second, by eliminating these reads, digital normalization also eliminates sequencing errors contained within those reads, which otherwise would add significantly to memory usage in assembly [24]. Digital normalization also lowers computational requirements by eliminating most repetitive sequence in the data set. Compression-based approaches to graph storage have demonstrated that compressing repetitive sequence also effectively reduces memory and compute requirements [101, 124]. Note however that eliminating many repeats may also have its negatives (discussed below). Digital normalization should be an effective preprocessing approach for most assemblers. In particular, the de Bruijn graph approach used in many modern assemblers relies on k-mer content, which is almost entirely preserved by digital normalization (see Tables 4.3 and 4.4) [87]. 4.4 A streaming algorithm to analyze and trim errors in short reads . K-mer spectral analysis is a powerful approach to error detection and correction in shotgun sequencing data that uses k-mer abundances to find likely errors in the data [99, 69]. The 58 essential idea is that low-abundance k-mers contained in a high-coverage data set typically represent random sequencing errors. A variety of read trimming and error correcting tools use k-mer counting to reduce the error content of the read data set, independently of quality scores or reference genomes [63]. Approaches derived from spectral analysis can be very effective: spectral error correction achieves high accuracy, and later we will show that spectral k-mer trimming is considerably more effective at removing errors than quality score-based approaches. However, spectral analysis is also very computational intensive: most implementations count all the k-mers in sequencing data sets, which can be memory- or I/O-intensive for large data sets. In the section above, we discussed a streaming algorithm for downsampling read data sets to normalize read coverage spectra, termed “digital normalization”. This procedure estimates the k-mer coverage of each read in a stream using an online algorithm. Reads above a certain estimated coverage are set aside and their k-mers are not tracked. The diginorm algorithm only examines the data once, and counts only the k-mers in retained reads, leading to sublinear memory usage for high-coverage data sets. Here we introduce a semi-streaming algorithm for k-mer spectral analysis, based on digital normalization, that can detect and remove errors in sequencing reads. This algorithm operates in sublinear memory with respect to the input data, and examines the data at most twice. The approach offers a general framework for streaming sequence analysis and could be used for error correction and variant calling. Moreover, the approach can be applied generically to data sets with variable sequencing coverage such as transcriptomes, metagenomes, and amplified genomic DNA. We also provide a fully streaming approach for estimating perposition sequencing error rates in reads that operates in fixed memory and only examines part of the input data. 59 4.4.1 Two-pass non-streaming method to enable read error analysis Firstly, we implemented a two-pass non-streaming method to trim read based on the new efficient k-mer counting approach we introduced in last chapter. Basically in the first pass all the reads in a data set is loaded and the counts of each k-mer are stored in the Count-Min Sketch. Then during the second pass, for each read, the count of every k-mer in it will be examined, if a k-mer with a count as 1 in the whole data set is detected, an sequencing error is detected and the read will be truncated from this k-mer to the end. In this experiment, we especially evaluated the effect of false-positive induced miscounts on read trimming. Because the Count-Min Sketch never undercounts k-mers, reads will never be erroneously trimmed at truly high-abundance k-mers; however, reads may not be trimmed correctly when miscounts inflate the count of low-abundance k-mers. In cases where many errors remain, read trimming can potentially be applied multiple times, with each round reducing the total number of k-mers and hence resulting in lower false positive rates for the same memory usage. untrimmed khmer iteration 1 khmer iteration 2 khmer iteration 3 khmer iteration 4 khmer iteration 5 khmer iteration 6 filter by FASTX filter by seqtk(default) filter by seqtk(-q 0.01) filter by seqtk(-b 3 -e 5) FP rate 80.0% 40.2% 25.4% 23.2% 22.8% 22.7% - bases trimmed 13.5% 1.7% 0.3% 0.1% 0.0% 0.0% 9.1% 8.9% 15.4% 8.0% distinct k-mers 41.6m 13.3m 7.6m 6.8m 6.7m 6.6m 6.6m 26.6m 17.7m 9.9m 34.5m unique k-mers 34.1m 6.5m 909.9k 168.1k 35.8k 7.9k 1.9k 20.3m 12.1m 5.1m 27.7m unique k-mers at 3’ end 30.4% 29.8% 12.3% 3.1% 0.7% 0.2% 0.0% 26.3% 12.3% 5.2% 25.3% Table 4.7 Iterative low-memory k-mer trimming. The results of trimming reads at unique (erroneous) k-mers from a 5m read E. coli data set (1.4 GB) in under 30 MB of RAM. After each iteration, we measured the total number of distinct k-mers in the data set, the total number of unique (and likely erroneous) k-mers remaining, and the number of unique k-mers present at the 3’ end of reads. 60 We performed six iterations of unique k-mer trimming on 5 million Illumina reads from sequencing of E. coli, with memory usage less than 30 MB. For each iteration we measured empirical false positive rate compared with number of bases trimmed as well as the total number of k-mers (Table 4.7). In the first round, the estimated false positive rate was 80.0%, and 13.5% of the total bases were removed by trimming reads at low-abundance k-mers; the second iteration had a false positive rate of 37.7%, and removed only 1.5% additional data; and by the fourth iteration the false positive rate was down to 23.2% with 0.0% of the data removed. The elimination of so many unique k-mers (column 5) in the first pass was unexpected: the high false positive rate should have resulted in fewer k-mers being identified as unique, were the erroneous k-mers independent of each other. Upon examination, we realized that in Illumina data erroneous k-mers typically come from substitution errors that yield runs of up to k erroneous k-mers in a row [63]. When trimming reads with high false positive rates, these runs are typically trimmed after the first few unique k-mers, leaving unique k-mers at the 3’ end. Because of this we hypothesized that high-FP rate trimming would result in the retention of many unique k-mers at the 3’ end of the read, and this was confirmed upon measurement (Table 4.7, column 6, pass 1 vs pass 2). In comparison to quality-based trimming software such as seqtk and FASTX, trimming at unique k-mers performed very well: in this data set, all unique k-mers represent errors, and even with an initial false positive rate of 80%, khmer outperformed all but the most stringent seqtk run (Table 4.7). With a lower false positive rate or multiple passes, khmer eliminates more erroneous k-mers than seqtk or FASTX. The tradeoff here is in memory usage: for larger data sets, seqtk and FASTX will consume the same amount of memory as on smaller data sets, while khmer’s memory usage will need to grow with the data set size. 61 With Illumina sequencing, average and per-position error rates may vary between sequencing runs, but are typically systematic within a run [62]. Melsted and Halldorson (2014) introduced an efficient streaming approach to estimating per-run sequencing error, but this approach does not apply to error rates by position within reads [84]. Here, k-mer spectral error analysis can be used to calculate per-position relative sequencing error for entire data sets. 1.8 1e6 Number of abund=1 k-mers at that position 1.6 k=17 k=32 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0 20 40 60 Starting position of k-mer in read 80 100 Figure 4.6 Number of unique k-mers (y axis) by starting position within read (x axis) in an untrimmed E. coli 100-bp Illumina shotgun data set, for k=17 and k=32. The increasing numbers of unique k-mers are a sign of the increasing sequencing error towards the 3’ end of reads. Note that there are only 69 starting positions for 32-mers in a 100 base read. In Figure 4.6, we use khmer to examine the sequencing error pattern of a 5m-read subset of an Illumina reads data set from single-colony sequencing of E. coli [19]. The high rate of occurrence of unique k-mers close to the 3’ end of reads is due to the increased sequencing error rate at the 3’ end of reads. The results above demonstrated that the newly developed k-mer counting approach can 62 be integrated successfully to do effective error analysis. This is an application where the counting error of the Count-Min Sketch approach used by khmer may be particularly tolerable: it will never falsely call a high-abundance k-mer as low-abundance because khmer never underestimates counts. 4.4.2 A semi-streaming algorithm can be used for error analysis As shown above, k-mer spectral error detection, trimming, and correction approaches are typically implemented as a two-pass offline algorithm, in which k-mer counts are collected in a first pass and then reads are analyzed in a second pass. While several algorithms that run in sublinear memory do exist (e.g., Lighter [129]), these are still offline algorithms that require two or more passes across the data. In high coverage data sets it is possible to implement a more algorithmically efficient approach, by detecting reads that are high coverage in the context of reads previously encountered in the same pass of the data. Shotgun sequencing oversamples most regions – for example, for a 100x coverage genomic data set, we would expect 50% or more of the genome to be represented by more than 100 reads. This is a consequence of the Poisson-random sampling that underlies shotgun sequencing [65]. This oversampling provides an opportunity, however: if we regard the read data set as a stream of incoming data randomly sampled from a pool of molecules, high-abundance species or subsequences within the pool will be more highly sampled in the stream than others, and will thus generally appear earlier in the stream. For example, in mRNAseq, highly expressed transcripts should almost always be sampled much more frequently than low-expressed transcripts, and so more reads from highly expressed transcripts will be seen in any given subset.With this in mind, we can develop an approach to do semi-streaming error analysis by detecting and analyzing high-coverage reads 63 during the first pass. We implemented this by integrating k-mer spectral error analysis directly into the digital normalization algorithm. As digital normalization, here we still use the median k-mer abundance of the k-mers in a read to estimate that read’s abundance [10]; crucially, this can be done at any point in a stream, by using the online k-mer counting functionality of khmer to determine the abundance of k-mers seen thus far in the stream [148]. transcript 1 X X second pass X X X X X X X first pass X X X Figure 4.7 Diagram of semi-streaming error detection. In a first pass over the read data, reads are loaded in until the graph locus to which they belong is saturated. From that point on, reads are examined for errors and not loaded into the graph. In a second pass, only the subset of reads loaded into the graph are examined for errors. The conceptual idea is presented in Figure 4.7. On the first pass, low-coverage reads would be incorporated into the k-mer database with all the k-mers in them loaded into memory, and set aside for later analysis, because we cannot reliably detect error, which is a low-abundance k-mer in a low-coverage read. Meanwhile, the high-coverage reads would be analyzed for errors but would not be incorporated into the k-mer database. This step is 64 similar to digital normalization where the high-coverage reads are discarded. Not loading the k-mers in those high-coverage reads decreases the counts of those high abundance k-mers in the k-mer database a bit but this does not affect the counts of those low abundance k-mers. So this process will not influence the detection of errors. Actually the special treatment to high coverage reads also dismisses many errors in those high coverage reads and this makes the detection of low abundance k-mers more accurate. On the second pass, the set aside reads which were considered as low-coverage reads would be checked for coverage again, and either ignored or analyzed for errors. Crucially, this second pass involves at most another full pass across the data, but only when the entire data set is below the coverage threshold; the larger the high coverage component of the data, the smaller the fraction of the data that is examined twice. In Figure 4.8, we show diginorm-generated coverage saturation curves for both real and error-free simulated reads from E. coli MG1655. In both cases, after the first 1m reads, the majority of reads have an estimated coverage of 20 or higher, and hence can be used for error analysis on the remainder of the data encountered in the first pass. The algorithm for the semi-streaming analysis of reads can be described as follows: for read in data: # first pass if estimated_coverage(read) < C: count_kmers(read, k-mer_database) # save(read) else: analyze(read) for read in saved_reads: # second pass 65 N of reads collected < C (m) 2.5 2.0 E. coli real data E. coli synthetic/no errors 1.5 1.0 0.5 0.0 0 1 2 3 N of reads examined (m) 4 5 Figure 4.8 Saturation curve of a real and a simulated E. coli read data set. Reads are collected when they have an estimated coverage of less than 20; in the early phase (< 1m reads), almost all reads are collected, but by 2m reads into the data set, the majority of reads come from loci with an estimated sequencing depth of > 20 and are rejected. if estimated_coverage(read) >= C: analyze(read) As with digital normalization, a basic semi-streaming approach is very simple to implement: with an online way to count k-mers, the algorithm is approximately 10 lines of Python code. The approach also requires very few parameter choices: the only two parameters are k-mer size and target coverage C. However, we do not yet know how these parameters interact with read length, error rate, or data set coverage; systematic evaluation of parameters and the development of underlying theory is left for future work. In practice, we expect that additional work will need to be done to adapt existing error correction approaches to use the semi-streaming approach. 66 4.4.3 Semi-streaming error trimming on synthetic and real data: We next adapted the error detection algorithm to do semi-streaming error trimming on synthetic or real genomic, metagenomic, and transcriptomic data. On the synthetic “simple genome” this trimming approach eliminates 149 reads entirely and truncates another 392 reads. Of the 100,000 bp in the simulated reads, 31,910 (31.9%) were removed by the trimming process. In exchange, trimming eliminated all of the errors, bringing the overall error rate from 0.63% to 0.00%. For the synthetic “simple metagenome” we only trimmed reads with estimated coverage of 20 or higher. Here, of 2347 reads containing 234,700 bp, 314 reads (13.4%) were removed and 851 reads (36.3%) were trimmed, discarding a total of 74,321 bases (31.7%). Of 1451 errors total, all but 61 were eliminated, bringing the overall per-base error rate from 0.62% to 0.04%. The simple mRNAseq data set showed similar improvement: 83 of 568 reads were removed, and 208 were trimmed, removing 19,507 of 56,800 bases (34.34%). The initial error rate was 0.65% and the final error rate was 0.07%. Applying the semi-streaming error trimming to the E. coli MG1655 data set, we trimmed 2.0m reads and removed 50,281 reads entirely. Of 8.0m errors, all but 203,345 were removed, bringing the error rate from 1.49% to 0.07%. Trimming discarded 53 Mbp of the original 486 Mbp (11.1%). On the mouse mRNAseq data set, semi-streaming error trimming removed 919,327 reads and trimmed 648,322 reads, removing 19.8% of the total bases, bringing the overall error rate from 1.59% to 1.21%. When we measured only the error rate in the high-coverage reads, trimming brought the error rate from 1.20% to 0.42%. On the mock metagenome data set, 27,554 reads were removed and 171,705 reads were trimmed, removing 0.36% of bases; this 67 Data set pre-trim error E. coli 1.49% mouse mRNAseq 1.59% (high coverage only) 1.20% Mock metagenome 0.31% (high coverage only) 0.16% % bp trim % reads trim post-trim error 11.05% 41.9% 0.07% 13.9% 19.8% 1.21% 20.4% 29.0% 0.42% 0.4% 1.1% 0.28% 1.4% 3.5% 0.07% Table 4.8 A summary of trimming statistics for semi-streaming error trimming. Error rates before and after trimming were estimated by mapping. “High coverage” numbers refer to the subset of reads with C ≥ 20 that were subject to analysis. Data set mouse mRNAseq Total reads 81.3m Total bp 6.18 Gbp High-coverage reads 74.6m Number of passes 1.18 % reads trim 25.0% % bp trim 13.74% Pre-trim error rate 1.89% Post-trim error rate 1.30% mock metagenome 103.2m 10.4 Gbp 91.9m 1.43 11.75% 4.03% 0.27% 0.15% Table 4.9 Results of streaming error trimming on complete data sets. Error rates before and after trimming were estimated by mapping. low percentage is because of the very low coverage of most of the reads in this data set. In practice, the space and time performance of both digital normalization and the generalized streaming approach presented here depend on specific details of the data set under analysis and the precise implementation of the coverage estimator. While our intention in this paper is to demonstrate the general streaming approach, we note that even our naive implementation for e.g. streaming trimming is useful and can be applied to very large data sets. For high coverage data, we can efficiently error-trim 10s of millions of reads in both sublinear memory and fewer than two passes across the data. In Table 4.9, we show the summary statistics for streaming error trimming of the full mouse mRNAseq and mock metagenome data; in contrast to the smaller subsets used previously (see Table 4.8), when we consider the full data sets the majority of reads are examined only once (see “Number of 68 passes”, Table 4.9). The implementation of semi-streaming error trimming used here is somewhat inefficient, and relies on redundantly storing all of the reads needed for the second pass on disk during the first pass. In the worst case, where all reads are low coverage, a complete copy of the data set may need to be stored on disk! This is an area for future improvement. However, when we look at full data sets, fewer than half the reads are examined twice (see Number of passes, Table 4.9). 4.4.4 Semi-streaming Illumina error rates and error profiles analysis Frequency of errors at that position 0.09 0.08 0.07 bowtie2 mismatches k-mer subsample (scaled) k-mer 2pass (scaled) 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0 20 40 60 Position in read 80 100 Figure 4.9 Error spectrum of reads in the E. coli data set. The sublinear k-mer spectrum analysis is calculated based on saturation of a fraction of the data set, while the two-pass spectral analysis uses all of the data. bowtie2 mismatches are based on all mapped reads. The y values for the k-mer spectral analyses are scaled by a factor of four for ease of comparison. We can adapt the streaming approaches in the previous section to efficiently provide estimates for subsets of the data. The basic idea is to consume reads until some reads have saturated, and then to calculate error rates for new reads from the saturated loci in the 69 Frequency of errors at that position 0.05 0.04 bowtie2 mismatches k-mer subsample (scaled) k-mer 2pass (scaled) 0.03 0.02 0.01 0.00 0 10 20 30 40 50 Position in read 60 70 80 Figure 4.10 Error spectrum of reads in the mouse RNAseq data set. The sublinear k-mer spectrum analysis is calculated based on saturation of a fraction of the data set, while the two-pass spectral analysis uses all of the data, and bowtie2 mismatches are based on all mapped reads. The peak of errors at position 34 in the bowtie2 mapping reflects errors that in the first part of the data set are called as Ns, and hence are ignored by the sublinear error analysis; see text for details. Note, the bowtie2 mismatch rates are larger than the spectral rates, so for ease of comparison the y values for the k-mer spectral analyses are scaled by a factor of four. graph. This can be done in one pass for data sets with sufficiently high coverage data: as shown above (Figure 4.8), in some data sets, most of the reads will have sufficient coverage to call errors by the time 20% of the data set has been consumed. Using the same error detection code as above, we implemented a sublinear memory/sublinear time algorithm that collects reads until some regions have reached 20x coverage, or 200,000 reads have surpassed a coverage of 10x. In either case, all reads at or above a coverage of 10 are analyzed for errors, with a trusted k-mer cutoff of 3. In Figure 4.9 and Figure 4.10 we show the resulting error profiles for the E. coli and mouse RNAseq data sets, compared with the profile obtained by examining the locations of mismatches to the references. We also show the error profile obtained with the full two-pass approach (using digital normalization and then error detection as above) for comparison. 70 In the E. coli data set (Figure 4.9), we see the increase in error rate towards the 3’ end of the gene that is characteristic of Illumina sequencing [31]. All three error profiles agree in shape (Pearson’s correlation of 0.99 between each pair) although they are offset considerably in absolute magnitude. The k-mer error profile was calculated from the first 850,000 reads, but is consistent across five other subsets of the data chosen randomly with reservoir sampling (data not shown); all five subsets had Pearson’s correlation coefficients greater than 0.99 with the bowtie2 mapping profile and the two-pass spectral approach. The RNAseq error profile exhibits two large spikes, one at position 34 and one at position 69. Both spikes appear to be genuine and correlate with large numbers of Ns in those positions in the original data set. The spikes are present in the profiles derived from twopass spectral analysis as well as the bowtie2 mismatch calculation. However, the sublinear approach does not detect them when using the first 675,000 reads. This is because of the choice of subsample: five other subsamples, chosen randomly from the entire data set with reservoir sampling, match the match the two-pass spectral analysis (data not shown). The error profiles calculated from all six subsamples with the sublinear algorithm have a Pearson’s correlation coefficient greater than 0.96 with the error profiles from the full two-pass spectral approach and the bowtie2 mismatches. The ability to analyze high-coverage reads without examining the entire data set offers some intriguing possibilities. One concrete application that we demonstrate here is the use of high coverage reads to infer data-set wide error characteristics for shotgun data, in a way that is robust to the sample type [62]. This approach could also be integrated directly into sequencers to assess whether the target coverage has been obtained, and perhaps stop sequencing. More generally, the approach of using saturating coverage to truncate computational analysis may have application to streaming sequencing technologies such as 71 SMRT and Nanopore sequencing, where realtime feedback between sequencing and sequence analysis could be useful [32, 54]. 4.5 Time and space usage of the streaming algorithm for analyzing short DNA sequencing reads As shown above, the essential idea of error analysis generally is that low-abundance k-mers contained in a high-coverage data set typically represent random sequencing errors. We address this problem by making use of k-mer spectra, a common approach in which reads are treated as subpaths through a De Bruijn graph, and errors in the reads are identified by finding low-frequency subpaths [99]. We generalize this approach by building the graph with an online algorithm and detecting regions of the graph saturated by observations. These regions can then be used for per-read analysis without necessarily examining the entire data set. Detecting graph saturation: We detect graph saturation with digital normalization. The digital normalization algorithm is, in Python pseudocode: for read in data: if coverage(read, table) < DESIRED: add_read_to_graph(read, graph) analyze(read) This is a single-pass algorithm that can be implemented in fixed space using a Count-Min Sketch to store the De Bruijn graph necessary for coverage estimation [96, 148]. For any error-containing data set with coverage greater than DESIRED, the graph requires space less 72 than the size of the input - typically space sublinear in the data size, for any fixed-size source text (see Figure 4.8 and [148]). The digital normalization algorithm was developed as a filter, in which the reads are passed on to another program (such as a de novo assembler) for further analysis – these later analyses are typically based on multi-pass, heavyweight algorithms. Here, digital normalization is performing lossy compression, reducing the number of error-containing sentences while attempting to retain the structure of the De Bruijn graph [10, 148, 72]. This reliance on a post-normalization heavyweight analysis step limits the applicability of digital normalization and presents challenges in the analysis of extremely large data sets, which motivated this work. Semi-streaming analysis: The algorithm for semi-streaming analysis of reads is as follows: for read in data: # first pass if coverage(read, graph) < DESIRED: add_read_to_graph(read, graph) save(read) else: analyze(read) for read in saved_reads: # second pass if coverage(read, graph) >= DESIRED: analyze(read) 73 Here, the space used for the graph remains identical to the digital normalization algorithm and is typically sublinear in space for high coverage data sets, but the algorithm is no longer single-pass, and requires re-examining some subset of the input data in a second pass. In the worst case scenario, with an undersampled source text (or randomly generated sentences), this is a fully offline two-pass approach that requires re-examining all of the input data for the second pass. In practice, most real data sets will require fewer than two passes: graphically, any deviation from the identity line in a saturation analysis as in Figure 4.8 yields a few-pass algorithm. 4.6 Conclusion Shotgun DNA sequencing generates data as a stream of items representing sentences (“reads”) randomly sampled from a larger text, with replacement. There are several distinct features of this kind of stream-like data. The first is that important details of the source text, such as its size and statistical composition, may be completely unknown; that is, often the reads themselves are the most specific information we have about the source text. Second, the source text may be incompletely sampled by the reads, and whether or not it is completely sampled may not be known in advance. And third, read data sets are typically stored on disk, at least in current implementations; our goal is to identify more efficient approaches to examining these data sets without necessarily moving to a pure streaming model, which allows us to make use of the semi-streaming paradigm introduced by Feigenbaum et al. [33]. In this chapter, we discussed our solutions to two primary problems. One is to efficiently distill the non-redundant informaiton from the streaming to reduce the size of data finally without losing much important information. The other one is to efficiently identify the 74 locations of errors in these reads by finding differences with respect to the (unknown) source text. Both solutions are based on a novel approach to use median k-mer count in a read to estimate sequencing depth without a reference assembly, which will also be the foundation of the IGS based diversity analysis method discussed in the next chapter. Streaming represents the future of big data. This kind of problems we are dealing with is not only critically important to a better understanding of the exploding big biological data, but also a gateway to a larger set of interesting domain problems dealing with big data, like estimating the true abundance of the sentences in the larger text or detecting evil traffic in the Internet data stream. 4.7 4.7.1 Data Code availability The algorithms of digital normalization and error analysis are implemented in the khmer software package, written in C++ and Python, available at github.com/ged-lab/khmer/. khmer also relies on the screed package for loading sequences, available at github.com/ged-lab/screed/. khmer and screed are Copyright (c) 2010 Michigan State University, and are free software available for distribution, modification, and redistribution under the BSD license. The code and detailed instruction used to generate all of the results in this chapter is available at github.com/ged-lab/2012-paper-diginorm/ and http://github.com/gedlab/2014-streaming/. 75 4.7.2 Data sets used for digital normalization The E. coli, S. aureus, and Deltaproteobacteria data sets were taken from Chitsaz et al. [19], and downloaded from bix.ucsd.edu/projects/singlecell/. The mouse data set was published by Grabherr et al. [41] and downloaded from trinityrnaseq.sf.net/. All data sets were used without modification. The complete assemblies, both pre- and post-normalization, for the E. coli, S. aureus, the uncultured Deltaproteobacteria, mouse, and yeast data sets are available from ged.msu.edu/papers/2012-diginorm/. The simulated genome and transcriptome were generated from a uniform AT/CG distribution. The genome consisted of a single chromosome 400,000 bases in length, while the transcriptome consisted of 100 transcripts of length 500. 100-base reads were generated uniformly from the genome to an estimated coverage of 200x, with a random 1% per-base error. For the transcriptome, 1 million reads of length 100 were generated from the transcriptome at relative expression levels of 10, 100, and 1000, with transcripts assigned randomly with equal probability to each expression group; these reads also had a 1% per-base error. 4.7.3 Synthetic data sets used for error analysis We computationally constructed three small short-read DNA data sets for initial exploration of ideas. All synthetic sequences have equiprobable A/C/G/T. All synthetic reads are 100bp long and were sampled with 1% error. The “simple genome” data set consists of 1000 reads chosen uniformly from a 1 kb randomly constructed genome. The “simple transcriptome” data set consists of 568 reads chosen uniformly from synthetic transcripts containing different subsets of four 250-base exons, with expression levels varying by a factor of 30 from minimum to maximum. The “simple metagenome” data set consists of reads sampled from three 76 Name simple genome E. coli MG1655 simple transcriptome mouse mRNAseq simple metagenome mock metagenome Number of reads Description 1000 1kb genome; no repeats 4,863,836 Subset of ERA000206 ([20]) 568 300:1 high:low abundance; shared exons 7,915,339 Subset of GSE29209 ([42]) 2,347 316:1 high:low abundance species 18,805,251 Subset of SRR606249 ([120]) Table 4.10 Data sets used for evaluation. different 500 bp sequences, across 30 fold variation in abundance. In all three cases, the errors during read sampling were recorded for comparison with predictions. 4.7.4 Real data sets used for error analysis We used three shotgun Illumina data sets: a genomic data set from E. coli, a mRNAseq data set from Mus musculus, and a mock community metagenome. For E. coli, we took a 5m read subset of ERA000206 from [20]. For mRNAseq, we used a 10m read subset of GSE29209 from [42]. For the mock metagenome, we used a 20m read subset of SRR606249 from [120]. Prior to analysis, we eliminated any read with an ‘N’ in it and filtered the reads by mapping to the known references, yielding the read numbers in Table 4.10. 77 Chapter 5 A framework for diversity analysis of whole shotgun metagenomic reads data 5.1 Introduction Here we propose a novel concept, “informative genomic segment” or IGS, and use IGS as a replacement of OTUs as the basic unit for diversity analysis of whole shotgun metagenome data sets. IGSs represent the unique information in a metagenomics data set and the abundance of IGSs in different samples can be retrieved by analyzing read coverage with an efficient k-mer counting method (discussed in the previous two chapters). The samples-byIGS abundance data matrix is a promising replacement of samples-by-OTU data matrix used in 16S rRNA based analysis and many existing statistical methods can be applied to work on the samples-by-IGS data matrix to investigate diversity. We applied this method to several simulated data sets and several real metagenomic data sets, including human microbiome, sea water and soil data. The results of beta diversity analysis showed that the samples were clustered with comparable or better accuracy than existing alignment-based method. The results of alpha diversity analysis showed this was a promising new approach to estimate 78 metagenome sizes. Since this method is totally binning-free, assembly-free, annotation-free, and reference-free, it is promising for dealing with highly diverse environmental samples, where we are facing many unknowns. 5.2 The concept of IGS(informational genomic segment) In classic ecology dealing with macroorganisms, diversity measurement is based on the concept of species. For 16S rRNA amplicon metagenomics data set, it is based on the concept of OTUs. While the concept of OTUs can be used to analyze large shotgun metagenomics data set, normally assembly, binning and annotation are required before doing diversity analysis. However for many metagenomics projects these are difficult tasks, lacking necessary reference genome or requiring expensive computation. So we are interested in finding an approach to bypass difficult tasks like assembly, binning, annotation and use the raw reads to make the diversity analysis of large shotgun whole genome metagenomic data possible. We began such efforts by proposing that the concept of k-mer (a DNA segment with the length of k) could be used as the basic unit to measure the diversity. K-mers can be considered as the atom of information in DNA sequences. One of the composition-based approaches to binning is to use the k-mers as the source of signatures[3, 58]. Suppose the sizes of microbial genomes are similar and the difference between genomic content of microbial genomes is similar, the number of distinct k-mers in the sequence data set correlates to the number of species in a sample. However, because of sequencing error, which is unavoidable due to the limit of sequencing technology, this k-mer based analysis does not work well on unassembled data. One sequencing error on a read will generate up to k erroneous k-mers. In metagenomics data sets, especially with high coverage, most of the distinct observed k-mers 79 are from sequencing errors. Next we shifted the focus from k-mers onto a higher level - reads. In previous chapter, we have discussed a novel approach to use median k-mer count in a read to estimate sequencing depth without a reference assembly, based on which the framework for streaming analysis of short DNA sequencing reads was developed. It also offers a novel way to distill information from reads by reducing the bad effect of sequencing errors so that we can use those informative reads to measure the microbial diversity. We term those informative reads as IGSs (informative genomic segments), which can be considered segments of DNA on a microbial genome. Those IGSs should be different enough to represent the abstract information a genome contains. Suppose microbial genomes contain similar number of those IGSs, as they contain similar number of distinct k-mers; then the number of IGSs will correlate with the species richness in a sample, and the abundance distribution of IGSs will be related to species evenness in a sample. Furthermore, we can get the abundance of the IGSs across different samples. Many classic diversity estimation methods based on OTUs described in the literature review chapter can be applied to estimate the diversity of IGSs and the diversity of actual species subsequently. For alpha diversity, we can generate a list of IGSs and the respective abundance in a sample. Then existing estimators like Chao’s can be applied to estimate the total number of IGSs in the sample. Rarefaction curves based on the number of IGSs can also be generated. For beta diversity, we can generate a samples-by-IGS data matrix from the abundance of IGSs across samples, as a replacement of samples-by-OTU data matrix in OTU-based analysis and samples-by-species data matrix in traditional ecology. From that samples-byIGS data matrix, we can use existing methods to calculate similarity/dissimilarity/distance between samples and do further analysis like clustering and ordination. 80 5.2.1 IGS(informative genomic segment) can represent the novel information of a genome Median k-mer abundance can represent sequencing depth of a read, as discussed in last chapter (published as [10]). For a sequencing reads data set with multiple species, the sequencing depth of a read is related to the abundance of species where the read originates from. The upper plot in Figure 5.1 shows the abundance distribution of reads from 4 simulated sequencing data sets with different sequencing depths - 3 sequencing data sets generated with different sequencing coverage(1x, 10x, 40x) from 3 simulated random genomes respectively and 1 combined data set with all the aforementioned data sets. No error is introduced in these simulated data sets. Obviously the reads from the three data sets can be separated by estimated sequencing depth. The combined data set can be considered as a sequencing data set with three species with different abundance. Each point on the curve shows that there are Y reads with a sequencing depth of X. In other word, for each of those Y reads, there are X − 1 other reads that cover the same DNA segment in a genome that single read originates. So we can estimate that there are Y /X distinct DNA segments with reads coverage as X. We term these distinct DNA segments in species genome as IGSs (informative genomic segments). We can transform the upper plot in Figure 5.1 to show the number of IGSs and their respective reads coverage, as shown in lower plot. We sum up the numbers of IGSs with different reads coverage for each data set and get the result as shown in Table 5.1. The sum numbers of IGSs here essentially are the areas below each curve in the figure. Even though the datasets have different sequencing depth like 10X and 40X, they have 81 similar numbers of IGSs. Dataset with 1X sequencing depth has fewer IGSs because the depth is not enough to cover all the content of the genome(63.2%). The IGSs can be seen as the genomic segments on a genome with the length of reads.(Figure 5.2 Assume the composition of species genome is totally random, which is the case in the simulated data sets, the number of IGSs (N) in a species genome is related to the size of genome(G), read length(L) and k size(k), which can be denoted as N= G L which is the number of reads that can have a 1X coverage of the genome. For the simulated genome with size of 1M bps, read length as 80bps, expected number of IGSs is 1000000/80 = 12500 , which is close to the observed value (Table 5.1). Data set total number of IGSs 1X depth 6419 10X depth 12022 40X depth 12371 1X,10X,40X combined 30748 Table 5.1 Total number of IGSs in different simulated reads data sets. 5.2.2 Using IGS to analyze alpha diversity Basically the abundance distribution of IGSs with different coverage in a sample data set can be acquired using the method shown above. 82 Figure 5.1 Transforming reads into IGSs. Suppose from a reads data set, the coverage distribution of reads is as shown in Table 5.2. There are 69 reads with coverage as 3, 96 reads with coverage as 4, and so on. We transform this coverage distribution of reads into abundance distribution of IGSs, as shown 83 Reads IGSs (Informative Genomic Segment) Genome Figure 5.2 IGS(informative genomic segment) can represent the novel information of a genome. coverage 3 4 5 6 ... number of reads 69 96 125 150 ... Table 5.2 Reads coverage distribution. in Table 5.3. abundance number of IGS 3 23 4 24 5 25 6 25 ... ... Table 5.3 IGS abundance distribution. For example, there are 23 IGSs with abundance 3. This is calculated by dividing total number of reads with coverage as 3, which is 69, by the coverage 3. Similarly there are 24 IGSs with abundance 4. If we draw an analogy between IGSs and OTUs, this is like saying there are 23 different OTUs with 3 reads mapped to each OTU, and 24 different OTUs with 4 reads mapped to each OTU, and so on. 84 Next all the different IGSs and the corresponding abundances can be listed, as shown in Table 5.4. IGS ID 1 1 1 ... 23 24 25 ... 47 48 ... abundance 3 3 3 ... 3 4 4 ... 4 5 ... Table 5.4 Listing IGSs with the corresponding abundances. This list is the counterpart of an OTU table in OTU based diversity analysis. With such table at hand, existing statistical methods and software packages can be directly used to investigate the alpha diversity. 5.2.3 Using IGS to analyze beta diversity As with alpha diversity analysis, the OTU table is also a foundation for beta diversity analysis. As long as we get a reliable OTU table, there are existing pipelines to perform the beta diversity analysis. A typical OTU table across different samples is like this, which is also called a samplesby-OTU data matrix, as shown in Table 5.5. Like a OTU table, we hope to have the IGS table for the IGSs, as Table 5.6. So now the problem is how we can generate a samples-by-IGS matrix as the counterpart of samples-by-OTU matrix so many of the existing tools/methods used for OTU-based diversity 85 OTU ID OTU1 OTU2 OUT3 ... Sample A 3 2 3 ... Sample B 4 5 1 ... Sample C 2 0 4 ... Table 5.5 Samples-by-OTU matrix. IGS ID IGS1 IGS2 IGS3 ... Sample A 2 2 3 ... Sample B 3 4 3 ... Sample C 2 0 4 ... Table 5.6 Samples-by-IGS matrix. Figure 5.3 Get the coverage of a read in samples. A read in sample A has the coverage of 5 in sample A, has the coverage of 8 in sample B. 86 analysis can be borrowed for this kind of IGS-based analysis, just as what is shown above for alpha diversity analysis. Firstly, using the same approach to get the coverage of a read in the sample data set where it is from (Figure 5.3-A), we can get the coverage of a read from sample A dataset in sample B dataset (Figure 5.3-B). We still use the median k-mer count to represent the coverage of a read. Figure 5.4 From read coverage profile to IGS. (A): Get the coverage profile of one read. (B): Get the coverage profiles of all the reads in 3 samples. (C): Group the reads with same coverage profiles into “super bin”. (D): Calculate the number of IGSs in each “super bin”. For a data set with several samples to analyze, firstly we can get the coverage of a read across different samples and get a read coverage profile like “4:6:2”, as shown in Figure 5.4(A). For all the reads in the samples we can get such read coverage profiles, as shown in Figure 87 5.4(B). We have already known that ”contigs with similar coverage profiles are likely to have originated from the same microbial population”[58]. Several new binning methods based on coverage profiles have been developed based on such assumption[2, 60, 3, 94, 58]. Thus, here we can assume that reads with similar coverage profiles are likely to have originated from the same genomic region. Actually it is safer to say reads with different coverage profiles are not likely to have originated from the same genomic region. The next step is to group those reads with same coverage profiles together into “super bin” in different samples, as shown in Figure 5.4(C). The reads in each “super bin” may not be from the same species, however, they should be from the species that have same abundance profile across samples. In the example shown in the figure, the 6 reads from sample A, the 4 reads from sample B and the 4 reads from sample C all have the same coverage profile as “3:2:2”. Actually, the numbers of reads from different samples with same coverage profile have similar ratio to the numbers in the profile, like “6:4:4” versus “3:2:2” in the example above. (Experiments using simulated data are not shown here.) Next we can use the approach similar to the one used for alpha diversity analysis to estimate the size of the genomic region each “super bin” covers, represented by the number of IGSs. Still for the example in Figure 5.4(C), 6 reads from sample A grouped into the first “super bin” have a coverage of 3 in sample A, where they originate from. The number of IGSs those 6 reads cover can be calculated as 6/3, which is 2. Thus 2 IGSs have an abundance profile as “3:2:2” across the samples. Similarly there are 2 IGSs with abundance profile as “2:0:3” and 1 IGS with abundance profile as “2:4:3”, as shown in Figure 5.4(D). List those IGSs and the corresponding abundance profiles across samples, we can have the samples-by-IGS matrix as shown in Table 5.7. With such samples-by-IGS matrix, similarity matrix between samples can be calculated 88 IGS Sample A IGS1 3 IGS2 3 IGS3 2 IGS4 2 IGS5 2 Sample B 2 2 0 0 4 Sample C 2 2 3 3 3 Table 5.7 Samples-by-IGS matrix from example. using different similarity indices, like Bray-Curtis. Next clustering and ordination methods can be applied to better interpret the relationship between samples. 5.3 5.3.1 Evaluating IGS method using simulated data sets Using a simple simulated data set to evaluate the IGS method For this experiment, firstly we create 6 synthetic samples (Sample 1-6) based on 9 synthetic 100K genomes (genome A-I), with different composition of species and diversity (Table 5.8). For sample1, there are two species - A and B, with abundance distribution as 3:1. The sequencing depth of all the synthetic data sets is 10X. As a simple experiment to demonstrate the effectiveness of the IGS based method, there is no sequencing error introduced in the synthetic reads data sets. sample ID sample1 sample2 sample3 sample4 sample5 sample6 species composition sequencing depth AAAB 10 AABC 10 ABCD 10 ABCE 10 AFGH 10 IFGH 10 abundance of species size of metagenome (bp) A:30 B:10 200K A:20 B:10 C:10 300K A:10 B:10 C:10 D:10 400K A:10 B:10 C:10 E:10 400K A:10 F:10 G:10 H:10 400K I:10 F:10 G:10 H:10 400K Table 5.8 Six synthetic simple metagenomes To evaluate the effectiveness of alpha diversity analysis using IGS based method, we can use a metric to estimate the total number of IGSs in a sample, which can be used to calculate 89 the estimated genome size of a sample using the formula below: size of genome = number of IGS ∗ reads length In this experiment, we use ACE metric since we find it is more accurate than Chao1, since it uses more abundance information. sample1 sample2 sample3 sample4 sample5 sample6 observed IGS 2002 3038 4076 4078 4069 4087 ACE 2002.0 3038.0 4076.0 4078.0 4069.0 4087.0 simpson estimated real evenness genome size (Kbp) genome size (Kbp) 0.76 200.2 200 0.83 303.8 300 0.91 407.6 400 0.91 407.8 400 0.91 406.9 400 0.91 408.7 400 Table 5.9 Alpha diversity analysis result of the simple simulated data using IGS method. Table 5.9 shows the alpha diversity analysis result of the simple simulated data using IGS method. The estimated genome sizes of the samples are close to real size. This is not surprising since for this simple experiment, there is no error introduced and the coverage is high (10x) to cover most of the genetic materials in the samples. Also the Simpson evenness shows the relative evenness of the samples correctly. Sample 1 is the least even with composed of two species with abundance ratio as 1:3. This shows that in this simple example, the IGS method cannot only analyze the richness of samples but also the evenness. To evaluate the effectiveness of beta diversity analysis using IGS based method, we compared the dissimilarity matrix generated by IGS based method with the true matrix, since we know exactly the species composition of the simulated data set. The true dissimilarity matrix of the 6 simulated samples using Bray-Curtis metric from species composition directly is shown in Table 5.10. For a simulated data set with 10x coverage and no error introduced (which again will tell us the optimal performance of IGS 90 sample sample sample sample sample sample 1 2 3 4 5 6 sample 1 0.00 0.25 0.50 0.50 0.75 1.00 sample2 sample 3 0.25 0.50 0.00 0.25 0.25 0.00 0.25 0.25 0.75 0.75 1.00 1.00 sample 4 0.50 0.25 0.25 0.00 0.75 1.00 sample 5 0.75 0.75 0.75 0.75 0.00 0.25 sample 6 1.00 1.00 1.00 1.00 0.25 0.00 Table 5.10 Dissimilarity matrix between synthetic samples using Bray-curtis from species composition directly. sample sample sample sample sample sample 1 2 3 4 5 6 sample 1 0.00 0.35 0.60 0.66 0.80 1.00 sample2 sample 3 0.35 0.60 0.00 0.42 0.42 0.00 0.51 0.56 0.84 0.89 1.00 1.00 sample 4 0.66 0.51 0.56 0.00 0.89 1.00 sample 5 0.80 0.84 0.89 0.89 0.00 0.25 sample 6 1.00 1.00 1.00 1.00 0.42 0.00 Table 5.11 Dissimilarity matrix between synthetic samples using Bray-Curtis from sequencing reads using IGS method. method), the dissimilarity matrix can be calculated by using the IGS method, as shown in Table 5.11. We can see the absolute values in the matrix are not very close to those in the real matrix. However, the relative values correspond to those in the real matrix well enough to show the relative distance between each pair of samples. To get a objective metric, we use the Mantel [77] test to calculate the correlation value between the two matrixes. The correlation is 0.9714, which means a very strong correlation between the two matrices. Thus the dissimilarity matrix from the IGS method reflects the true relationship between samples effectively. If the matrix can reflect the real relationship between samples reliably, the clustering and ordination will only be routine tasks. Figure 5.5 and Figure 5.6 show that IGS method can yield similarity between samples correctly. Sample 5 and sample 6 are very close to each other on the figure, which matches 91 Figure 5.5 Ordination of the 6 synthetic samples using IGS method. their species composition. The clustering and ordination are all from the dissimilarity matrix. We think comparing matrices directly makes more sense than comparing the clustering and ordination plots. So we will not show the clustering and ordination figure in other evaluations in this section. Mantel correlation will be used to measure the accuracy of beta diversity analysis. These results show that the IGS method can work well on a simple scenario, with high sequencing depth (10X) and no sequencing error. Next we will check the influence on the analysis accuracy of variable sequencing depth and sequencing error, and introduce new ways to preprocess the data to decrease the influence of sequencing error. 5.3.2 Improving the accuracy of this method in real world analysis Previously we have shown the IGS method generally works on a simple simulated data set, with high sequencing depth and no sequencing error. In the real world, in many situations 92 Figure 5.6 Clustering of the 6 synthetic samples using IGS method we have to deal with metagenomic data sets with relatively low sequencing depth, like soil or sea water samples. Also it is a fact that all sequencing technology generates some errors. As discussed in the introduction chapter, one of the reasons we developed the IGS method is that we expect the IGS method to be less prone to sequencing error based on the abundance counting of reads rather than k-mers. However the effect of those factors on the accuracy should still be observable. 93 In this section, we will analyze the effect of these factors on the accuracy of the IGS method and investigate ways we can reduce the effect in order to increase the accuracy of analysis. As in last section, six synthetic samples were generated with the same species composition with same coverage as 10X but with different sequencing error rate (0.5%, 1.0%, 1.5%, and 0% - no error at all). To show the influence of sequencing error on accuracy of the analysis, we compared the richness estimation using reads with different sequencing error rate, as shown in Figure 5.7. For data set without error (error rate = 0), the estimated size of the metagenome matches the real size perfectly. With increasing error rate, the size of metagenome is increasingly over-estimated. This is due to several factors,which will be discussed below. We also check the beta diversity analysis with different error rate and notice that the beta diversity is less prone to increasing sequencing error rate 5.8. We will therefore focus on alpha diversity in the discussion below. 5.3.2.1 the effect of sequencing error to the accuracy of analysis The first factor to take into account is sequencing error. One sequencing error will generate up to k erroneous k-mers. This is the reason why it is difficult to use k-mer counting only to do diversity analysis, as a large proportion of k-mers in a reads data set are erroneous, especially for low coverage reads data. As discussed in the section about digital normalization, using median k-mer count to retrieve the coverage of a read is less prone to sequencing error, because this will not always affect the median k-mer count. Take the experiment we did previously as an example, for read length of 100bp and k as 19, one sequencing error will affect the count of 19 k-mers at most, and two sequencing 94 estimated size of metagenome 2000000 1500000 richness estimation (original) sample5 sample4 sample6 sample1 sample3 sample2 size of sample 1,200K size of sample 2,300K size of sample 3-6,400K 1000000 500000 0 0.0 0.5 1.0 error rate (%) 1.5 2.0 Figure 5.7 Richness estimation using IGS method without adjustment. errors will affect the count of at most 38 k-mers. The count of these k-mers will generally be much lower than the true count. So out of the 82 k-mers in the 100bp read, at most 38 k-mers will have incorrectly low counts. However, this will not affect the median k-mer count, which is the count of the 41th k-mer if ranked by count. However, if there are three or more errors in the read, the situation is more complicated. For 3 errors in a read, 3 to 57 k-mers will be affected by the errors to have an incorrect count as 1. The distribution of the probability about the number of affected k-mers can be acquired by a model similar to Lander-Waterman model used in genome sequencing theory. Here we got the distribution using simulation, as shown in Figure 5.9. From this probability distribution, we can get the 95 correlation of calcuated matrix with golden standard matrix 1.0 beta diversity analysis with different error rate, coverage = 10X 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 error rate (%) 1.5 2.0 Figure 5.8 Beta diversity analysis using IGS method without adjustment. probability that 3 errors will affect more than 40 k-mers is 0.43. In this case, 3 errors will affect the median k-mer count of a read. We can also get such probability for 4 errors or more. Combining to the probability that a certain number of errors occur in a read with a specific sequencing error rate, which is easy to derive from binomial distribution, we can get the probability that the coverage of a read is incorrectly assessed as 1. Still for the example here, this probability is the probability that 3 errors occur in a read multiplied by the probability that 3 errors will affect median k-mer count, plus the probability that 4 errors occur in a read multiplied by the probability that 4 errors will affect median k-mer count,and so on. 96 Generally, let P error(n, e, L) is the probability that n errors occur in a read with length as L, with error rate as e and P ef f ect(n, k, L) is the probability that n errors in a read with length of L affect median k-mer count. The probability that the coverage of a read is incorrectly assessed as 1 is ∞ P error(n, e, L) × P ef f ect(n, k, L) n=3 , and by binomial distribution, P error(n, e, L) = f (n; L, e) = P r(X = n) = L n e (1 − e)L−n n Practically, when n > 5 and e < 0.015, P error(n, e, L) is very small, we only consider number of errors in a read as 3, 4 and 5. From the discussion above, the sequencing errors reduces the estimated coverage of some reads incorrectly to 1 and the probability this occurs to a read can be estimated. So to reduce the effect of sequencing error on this aspect, we can calculate the expected number of reads that are affected and remove those reads from the set of reads with coverage of 1 before generating list of IGS from the reads abundance distribution. Also, we want to make sure 2 errors in a read will not affect median k-mer count, since it is more common to have 2 errors in a read practically. In this case, 2×k < L−k+1 , 2 we can get k < L/5,basically. For L as 100, k will be 19, which is what we choose in the testing. However, the k should not be too small, or the k-mers cannot handle the diversity 97 Figure 5.9 Richness estimation using IGS method without adjustment. of information of a large data set. Taking the sequencing error into account, we used the methods introduced above to adjust the estimation of metagenome size of the 6 synthetic samples. The estimation after adjustment is closer to real number, as shown in Figure 5.10.. 5.3.2.2 the effect of Bloom filter size on the accuracy of analysis As discussed in the chapter about k-mer counting, the collision in bloom filter which we use for efficient k-mer counting will result in counting error. If the false positive rate for a specific bloom filter we use for k-mer counting is 0.1, 10% of the k-mers will have incorrect counts. When we use median k-mer count to get read coverage, such incorrect count has 98 estimated size of metagenome 2000000 1500000 richness estimation (adjusted by error rate) sample5 sample4 sample6 sample1 sample3 sample2 size of sample 1,200K size of sample 2,300K size of sample 3-6,400K 1000000 500000 0 0.0 0.5 1.0 error rate (%) 1.5 2.0 Figure 5.10 Richness estimation using IGS method adjusted by sequencing error rate. the effect on two aspects. On one hand, some k-mers in a read will have incorrect higher count. However, if the false positive rate is low, this will not affect median k-mer count. This shows the method of using median k-mer count to get read coverage is not only less prone to sequencing error, but also less prone to the inaccuracy characteristics of underlying data structure. One the other hand, this inaccurate count also affects the counts of those erroneous k-mers generated by sequencing error. For example, 3 errors in a read affect the count of 43 k-mers, the counts for these 43 k-mers are supposed to be 1. However, because of the collision in the Bloom filter and the resulting incorrect k-mer counting, if the false positive rate is 0.1, about 4 out of the 43 k-mers will have inflated count, mostly as 2. So the 99 combined effect of sequencing error and collision in bloom filter is that some reads will have incorrect coverage as 1 and some reads will have incorrect as 2. We can get the percentage of total reads that will have such incorrect coverage, using statistical model similar to that discussed in last section. Using same example, 3 errors occur in a read, if the 3 errors affect 41-45 k-mers(with a chance of 0.20), the median k-mer count will be 2, due to the collision in bloom filter, while if he 3 errors affect more than 45 k-mers(with a chance of 0.24), the median k-mer count will be 1, purely due to sequencing errors. We did the same experiment but also adjusted the estimation according to the false positive rate of bloom filter and got better estimation, as shown in Figure 5.11. With adjustment to estimation taking sequencing error and collision in bloom filter into account, as shown in Figure 5.11, the estimated genome size is closer to real number. With an error rate of 1%, a false positive rate of 0.1, and with 10X coverage data, the estimated genome size is about 20-25% more than real number. However the estimation is still increasing with higher error rate. This means there are still other factors influencing this accuracy of the estimation that we failed to take into account. 5.3.3 the effect of sequencing depth to the accuracy of IGS method We have shown that the IGS method can generate good result from relatively high coverage data (like 10X). It is expected that the higher the coverage of data is, the more accurate analysis we can conduct. However for many metagenomics project,especially environmental samples, it is difficult to yield high enough sequencing depth. We investigated the effect of sequencing depth on the accuracy of the IGS method. Figure 5.12 shows how well the matrix calculated from a data set with variable coverage reflects the real relationship between samples. It is as expected that higher coverage data 100 estimated size of metagenome 2000000 richness estimation (adjusted by error rate and hash table collision) 1500000 sample5 sample4 sample6 sample1 sample3 sample2 size of sample 1,200K size of sample 2,300K size of sample 3-6,400K 1000000 500000 0 0.0 0.5 1.0 error rate (%) 1.5 2.0 Figure 5.11 Richness estimation using IGS method adjusted by sequencing error rate and false positive rate of bloom filter. will yield a more accurate distance matrix. Note even with a coverage as low as 0.1, the correlation is 0.89. This can give us the hint about how reliable the result will be if we only use a small proportion of data from a large metagenomic data set. So the beta diversity analysis using IGS method not only is less prone to sequencing error, but also less prone to sequencing depth. Figure 5.13 5.14 5.15 shows the estimated genome size from data sets with variable coverage with different error rates. It’s interesting that the estimated genome size is very high with extremely low coverage. This is probably due to the limits of the statistical model in estimating the total size of information with limited observed information. After all, only 101 1.0Correlation between calculated distance matrix and true matrix by coverage Correlation 0.8 0.6 0.4 0.2 0.0 0 error:0% error:1% error:0.5% 5 10 Coverage 15 20 Figure 5.12 Correlation between calculated distance matrix and true matrix from different data sets with different sequencing depth. a small proportion of the genomes in the sample are covered by reads. We can see the pattern again here that higher error rate will influence the accuracy of genome size estimation, especially when the coverage is low. However for error rates from 0% to 1%, as long as the coverage is higher than 1X, the estimation of genome size starts to be stable. It is important to point that even though the absolute value of estimated genome size may be overestimated, the relationship between samples is reliable, as shown in the figures. Sample 3,4,5,6 all have 4 species, while sample 2 has 3 species, and sample 1 has 2 species. They can be separately effectively. The estimation of genome size does not increase much with increasing coverage, even 102 for the data set with error rate as 1%. This proves that the adjustment method discussed previously does eliminate most of the bad effect of sequencing errors. That being said, it is still beneficial to do some preprocessing to the data to reduce the error rate. If the error rate can be reduced from 1% to 0.5%, the estimated size of genome will be more accurate. This again demonstrates the importance of the streaming method doing error profile analysis discussed in chapter 4 above. No error, estimated genome size 800000 sample1 sample2 sample3 sample4 sample5 sample6 size of sample 1,200K size of sample 2,300K size of sample 3-6,400K 700000 estimated genome size 600000 500000 400000 300000 200000 100000 0 5 10 Coverage(X) 15 20 Figure 5.13 Estimated genome size from data sets with variable coverage, without error. 103 1400000 0.5% error rate, estimated genome size sample1 sample2 sample3 sample4 sample5 sample6 size of sample 1,200K size of sample 2,300K size of sample 3-6,400K 1200000 estimated genome size 1000000 800000 600000 400000 200000 0 5 10 Coverage(X) 15 20 Figure 5.14 Estimated genome size from data sets with variable coverage, with error rate as 0.5%. 5.3.4 Compare IGS method to COMMET in beta diversity analysis Next we test how well the matrix calculated by various methods can reflect the real relationship between samples. COMMET[75], the successor of Compareads [76], is one of few software packages for comparing metagenomes. It is based on the method of count shared reads between metagenomes. The higher percentage of reads shared by two metagenomes, the more similar the two metagenomes are inferred to be. So basically this is a straightforward abundance-based method to evaluate the similarity. 104 3000000 1% error rate, estimated genome size sample1 sample2 sample3 sample4 sample5 sample6 size of sample 1,200K size of sample 2,300K size of sample 3-6,400K estimated genome size 2500000 2000000 1500000 1000000 500000 0 5 10 Coverage(X) 15 20 Figure 5.15 Estimated genome size from data sets with variable coverage, with error rate as 1.0%. We have the simulated data set with sequencing depth as 0.1X and 10X, with sequencing error as 1% and without sequencing error was used in this experiment. This data set has the same species composition as that used in other experiment previously. As shown in Figure 5.16, firstly, for all data sets, the matrix from IGS method has a higher correlation to golden standard than that from COMMET. As expected, the matrix from data sets with sequencing error has a lower correlation than that from error-free data sets. COMMET is more prone to sequencing error rate, compared to IGS method, for high coverage data or low coverage data. Also higher coverage will yield more accurate matrix, which is not surprising. 105 In the experiment below with real metagenomic dataset, we will see more evidence that the IGS method has better performance than some other metagenome comparison methods. 1.0 Correlation between calculated distance matrix and true matrix Correlation 0.8 0.6 0.4 0.2 0.0 c0.1 e0% c0.1 e1% c10 e0% Commet IGS c10 e1% Figure 5.16 Correlation between calculated distance matrix and true distance matrix from different data sets and using different methods. 5.3.5 The IGS method can provide a whole framework to do alpha or Tbeta diversity, with good versatility. From the testing using simulated data sets shown here, we are confident that our IGS method works well and can give reliable results from data sets with error and low sequencing depth. The IGS method can provide a whole framework to do alpha or beta diversity. Here we tested beta diversity using only Bray-Curtis metric and alpha diversity on richness only. In 106 fact, any standard metric can be applied to the IGS-by-samples table. The other software package to do metagenome comparison - Compareads/COMMET based on reads overlap between samples can get a matrix reflecting the real relationship between samples. However, it is stuck with one metric, which is based on the percentage of overlapping reads between samples. This metric is like Bray-Curtis, but not exactly the same. 5.4 Applying IGS method to real metagenome data sets Having shown that the IGS method delivered good results about microbial diversity from simulated synthetic data sets, we will now evaluate the novel method on several published metagenomic datasets, including samples from ocean, human microbiome and soil. For the ocean sample and human microbiome data sets, we will compare the result from IGS method with that from the original publication. For soil sample, since there is no other diversity analysis that has been conducted to these data sets, we will show the result we got from IGS and try to interpret the ecological meaning. 5.4.1 GOS data sets: Sorcerer II Global Ocean Sampling Expedition We tested the IGS method on a well known public dataset from the Sorcerer II Global Ocean Sampling expedition. During the expedition, 44 water samples were collected from different locations across the Atlantic and Pacific Oceans and were sequenced using Sanger 107 technology. The whole dataset is composed of 7m reads from of 44 samples. A whole metagenomic comparison of the samples was done using a sequence alignment method in the original research. The IGS method took only several hours on MSU HPC to generate the dissimilarity matrix of the samples thanks to the scalability and distributability of the IGS based method. After clustering, Figure 5.17 shows that, consistently with the original study, the samples are clustered according to their geographical origin. The group with yellow color contains samples from Tropical- Galapogas. The group with light purple color contains samples from Tropical -Open Ocean. The group with dark purple color contains samples from Sargasso. The group with green color contains samples from Temperate. If we compare the cluster we got from IGS method with the cluster in the original study, we can see the IGS method yields a cluster more consonant with the sample origin than the method used in the original study. For example, in the original study, sample 14,21 and 22 from Tropical - Galapogas are separated from other Tropical- Galapagos samples, while in Figure 5.17 they are grouped together. Also, samples 00a,00b,00c,00d, all from the same location, are grouped together in our result,while in original research, sample 00a is separated from the other three samples. Compared with the clustering generated using Compareads, our method is comparable, with some distinct differences. For example, sample 16 is clustered together with 15,17, 18, and 19 in our result. However, in the result by Compareads, sample 16 is clustered with 23 and 26, in contradiction to the geographical origin of the samples. Next we used IGS method to analyze the alpha diversity. Figure 5.18 shows the rarefaction curve of IGSs of the samples. As expected, we cannot see the saturation,which means the sequencing data set is still far from deeply covered. Because the data sets for different 108 samples have dramatically different sizes, we estimated the total number of IGSs using the Chao1 estimator with a limited number of reads in each sample (50000) to make sure the smallest data set has enough reads for comparison, as shown in Figure 5.19 . We see that the richness of samples is related to the geographical origin. The samples from tropical areas have a higher richness than the samples from more northern areas. The relationship between samples is consistent with the clustering in beta diversity analysis shown above. As discussed in the section above about alpha diversity analysis to synthetic data, such number of total IGSs may be over-estimated but the relative relationship between samples on richness should be reliable. (This is not discussed in original research work on the GOS samples.) Figure 5.17 Clustering of Global Ocean Sampling Expedition samples using IGS method. 109 Figure 5.18 Rarefaction curve of IGSs of Global Ocean Sampling Expedition samples. Figure 5.19 Estimated number of IGSs of Global Ocean Sampling Expedition samples. 5.4.2 Human Microbiome Project(HMP) metagenomics data set We tested the IGS method on 12 HMP (Human Microbiome Project) samples from different body parts including skin, oral and vaginal. Principal component analysis (Figure 5.20) 110 shows the samples are separated well by the body parts where they are collected. Rarefaction curve and estimated number of IGSs show that the richness of samples is related to the body part where they are collected. The oral samples have higher richness than skin or vaginal samples, which is consistent with other research. [56] Figure 5.20 Principal coordinates analysis of 12 Human Microbiome Project samples, red: anterior nares- skin, green: throat -oral, blue: buccal mucosa -oral, orange: posterior fornix -vaginal. 5.4.3 GPGC - Great Prairie Soil Metagenome Grand Challenge Having tested the IGS method on two relatively smaller metagenomic data sets, we will now use it to analyze a larger data set from 8 soil samples collected from fields with different treatments and different locations across the great prairie region in the US. (Table 5.12). 111 Figure 5.21 Alpha diversity of 12 Human Microbiome Project samples: estimation of metagenome size of HMP samples, red: anterior nares- skin, green: throat -oral, blue: buccal mucosa -oral, orange: posterior fornix -vaginal. As discussed above using simulated data sets, read data sets with lower sequencing coverage will reduce the accuracy of the analysis. However, as shown in Figure 5.12, with sequencing depth as 0.1x, the calculated distance matrix using IGS method still has a reasonably high correlation with golden standard distance matrix. So we can use subset of a large data set to acquire the diversity information, with the trade-off of lower accuracy. For the GPGC datasets, we made a subset with 2 million reads from each sample and conducted the diversity analysis using the IGS method. Principal coordinates analysis (Figure 5.22) shows the samples are separated well by location where they are collected.This proves that the geographical origin plays a more 112 sample iowa corn iowa prairie kansas corn kansas prairie wisconsin corn wisconsin prairie wisconsin restored wisconsin switchgrass # of reads size of .gz file # of bps ave. length 1,514,290,825 46G 144,202,427,079 95.2 2,597,093,273 74G 226,815,059,143 87.3 2,029,883,371 66G 206,933,829,048 101.9 4,987,358,734 145G 499,387,223,498 100.3 1,616,440,116 51G 162,257,698,471 100.4 1,653,557,590 53G 166,467,901,724 100.7 226,830,595 11G 34,241,520,930 151.0 310,966,735 13G 40,259,619,921 129.5 Table 5.12 GPGC data sets important part in determining the similarity of genomic composition of samples, compared to different treatments. Figure 5.23 shows the rarefaction curve and estimated number of IGSs of the samples. Basically the “corn” and “switchgrass” samples have higher richness than “restored” and “prairie” samples. This observation that cultivation increases the richness of soil is consistent with the intermediate disturbance hypothesis [140]. The disturbance from treatment like cultivation opens more niches and the stable communities in prairie eliminate some populations by the principle of competitive exclusion. Its harder to explain the rank by state. The Kansas site experiences more drought stress and higher tempseratures. The Iowa and Wisconsin sites experience more cold, especially freezing conditions arresting their biology for 3-4 months. However, the freeze-thaw cycles also kill off some each cycle, which is similar to intermediate disturbance. With new growth each spring, this new growth would be the fast growers with less diverse. Why Iowa is the least diverse is still difficult to explain for now. From the alpha diversity, we also have a rough estimation of the total size of the metagenome in Iowa soil, which is about 540G base pairs. This proves the high complexity of soil sample and we still need considerably more sequencing effort to achieve a reasonable 113 high coverage. Figure 5.22 Principal coordinates analysis of 8 Great Prairie Soil Metagenome Grand Challenge (GPGC) samples. 5.4.4 More soil metagenomic samples Additionally we test the IGS method on two other unpublished data sets. One is a series of soil samples collected from KBS with different treatment. Figure 5.24 shows the IGS method can separate the samples by treatment well. The other data set is a series of soil samples from Amazon rainforest. The samples are 114 Figure 5.23 Alpha diversity analysis of 8 GPGC samples. Upper left, rarefaction curve of IGSs. Upper right, estimated number of IGSs in different samples. Lower left, estimated number of IGSs in samples grouped by location (Iowa, Kansas and wisconsin). Lower right, estimated number of IGSs in samples grouped by treatment (corn, prairie, restored, switchgrass). separated well by the treatment. (Figure 5.25 ) It is also obvious that samples from forest have lower richness than prairie. (Figure 5.26) 115 Figure 5.24 Principal coordinates analysis of soil samples with different treatments collected from Kellogg Biological Station(KBS). Red, corn. Blue, miscanthus. Brown, switchgrass. 5.5 5.5.1 Data Code availability The algorithms of the IGS based diversity analysis are implemented in the khmer software package, written in C++ and Python, available at github.com/ged-lab/khmer/. khmer also relies on the screed package for loading sequences, available at github.com/ged-lab/screed/. khmer and screed are Copyright (c) 2010 Michigan State University, and are free software available for distribution, modification, and redistribution under 116 Figure 5.25 Principal coordinates analysis of soil samples collected from Amazon rainforest. Red, forest samples. Blue, prairie samples. the BSD license. The code and detailed instruction used to generate all the results in this chapter is available at http://github.com/ged-lab/2013-diversity/. 117 Figure 5.26 Estimated number of IGSs in metagenomic data from soil samples collected from Amazon rainforest. Grouped by treatment. Red, forest samples. Blue, prairie samples. 118 Chapter 6 Conclusion We have developed a novel statistical framework to enable microbial diversity analysis using whole genome shotgun metagenomic reads data without the requirement of assembly, binning, reference or annotation. This dissertation covers an overview of existing approaches of microbial diversity analysis of metagenomic samples, especially based on the concept of OTU, including the steps in the procedure, such as contigs binning, statistical analysis of OTU abundance information to estimate the microbial diversity. Next the statistical framework based on the novel concept of IGS was discussed. As the foundation of the framework, we described a novel method to count k-mers efficiently and a scalable approach to retrieve the coverage of a read in a data set based on efficient and online k-mer counting. We also introduced the applications of this approach in reducing the redundancy of metagenomic reads dataset and analyzing sequencing error, which is beneficial to other tasks in metagenomic data analysis, like assembly or error trimming. Finally, we discussed how we developed the concept of IGS based on the methods of efficient k-mer counting and digital normalization discussed before. The application of IGS to analyze microbial diversity of metagenomic data sets was discussed and the performance of the IGS method on simulated data sets and real data sets were demonstrated in the final chapter. In this chapter, we summarize how the novel statistical framework based on IGS makes a difference to the diversity analysis in current microbial ecology research. Finally some directions of future work will be discussed. 119 6.1 Summary Diversity analysis is a key part of the microbial ecology research, like of macroorganism ecology. However due to the obscure definition of the term ”species” in microbial ecology, we can virtually never measure the diversity of species directly, rather we use other taxonomic concepts like operational taxonomic unit (OTU) to evaluate the diversity of microbial community, instead of species. 16S rRNA sequencing reads may be classified into different OTUs. Shotgun whole genome sequencing reads can also be classified into OTUs. But most, if not all the existing methods based on the concept of OTU rely heavily on preprocessing of original reads data in some way like assembly or external information like reference sequences for annotation. Both of the prerequisites are not satisfied for many metagenomic projects. For metagenomic data set with low sequencing coverage, the assembly process skews the analysis by including primarily the most abundant organisms. Sequences that are rare are not assembled into contigs and are therefore not included in the contig analysis It is common that only a small proportion of reads can be used in assembly especially in a complex environmental sample. [51] The reference sequence database is far from completion especially for microbes in environmental samples from soil or sea water. The paucity of reference databases affects the ability to identify functional capacity of microbes, when traits cannot be identified. Thus, applying these methods can only obtain an incomplete diversity of the microbial community in the metagenomic sequencing data set. The IGS based method discussed in this dissertation offers a novel framework that overcomes the limitations of assembly, binning, or annotation without the requirement of reference sequence database. It can take advantage of all the information in the metagenomic reads data and gain a full picture of the diversity of the microbial community. Importantly, 120 this is a new framework with the concept of IGS instead of OTU as the taxonomic unit to analyze microbial diversity. Thus, this framework can be used to perform all possible diversity analysis that OTU-based framework can do. Moreover, this is a more thorough approach than many other methods developed to solve only specific problems in the field of diversity analysis. For example, there are several methods developed to estimate the species richness in a metagenomic sample [108]. Our IGS based framework cannot only estimate the species richness or size of metagenome as shown in the section above, but also it can estimate the evenness or species abundance distribution of a metagenomic sample, which is also an important aspect of alpha diversity analysis. For beta diversity or compositional similarity analysis between metagenomic samples, there are several methods developed to compare metagenomic samples based on reads mapping or counting shared reads [107]. However they only estimate abundance-based similarity, similar to the Bray-Curtis indices used in the experiment discussed in the section above. It should be noted that the IGS-based framework can also be used to estimate incidence-based similarity, which cannot be estimated using other existing approaches. Besides the potential for a broad application of the IGS based framework, it is also efficient and highly scalable to handle extremely large metagenomic sequencing data sets. We have discussed the efficiency of the novel k-mer counting method and the following method of digital normalization, with the ability to retrieve the coverage of a read accurately and efficiently. We also performed a thorough analysis to examine the effect of the size of used data structure to the accuracy. We can take advantage of the probabilistic characteristics of the data structure to make a trade-off between expected analysis accuracy and expected usage of computational power. In this way, we make the analysis highly scalable to keep pace with the increasing size of metagenomic sequencing data. 121 In addition, we examined the effect of sequencing depth to the accuracy of estimating microbial diversity. It was expected that using more number of reads, that is, a data set with higher sequencing depth increases the accuracy of diversity estimation. For similarity analysis between samples (beta diversity), a data set with relatively low sequencing depth can still get decent results, as shown in the experiment with synthetic data and real soil data sets. However, for alpha diversity such as richness estimation, use of a data set with lower sequencing depth results in the diversity estimation more distant from the real number. Although the absolute value of such species richness of a sample is not accurate, the relative comparison of species richness between samples is less prone to smaller reads data with lower sequencing depth. These results suggested that for a specific purpose, only a subset of the large metagenomic reads data can be enough to achieve reasonably satisfying result. Under certain circumstance, this feature is quite helpful and can reduce the computational expense dramatically. 6.2 Future work Though this dissertation demonstrated the performance of the new approach to analyze microbial diversity using whole genome shotgun sequencing data without the requirement of assembly, binning, or annotation, there is still plenty of room for improvement. Primary questions in the process of developing the IGS based method are how many species there are in a sample or how similar the samples are with each other, mostly focusing on the quantitative aspect. Admittedly these are important questions to the microbial ecologists, but they are also curious about the qualitative aspect, such as what drives differences between samples and eventually its functional potential[143]. Thus, a natural expansion of 122 the IGS based framework will focus on answering questions mentioned above. Now we have an efficient and scalable approach to obtain the coverage of a read in a sample, it is straightforward to extract the reads according to its coverage profile across samples so we can get a subset of reads that have specific properties, like the reads that are common in all samples. In this way we may collect these ”common” reads across the samples and try to co-assemble them since now they should have higher coverage. Or we can get a subset of reads that are common in a group of samples but do not exist in another group of samples, like the samples from patients and healthy persons. These ”signature” reads may offer important insights to understand what happens to the microbial community while the environment changes. Admittedly these kinds of ”extraction” can be implemented using other methods like reads alignment method. However, they may not be as efficient and scalable as the IGS based method, especially for extremely large metagenomic data. One advantage of the IGS based method is that binning is not required in this procedure. Firstly, traditionally binning is used to classify contigs after read assembly effort. The similarity based binning method relies on sequence alignment, which is inefficient, even infeasible for large metagenomic data. Secondly, reference sequences are normally required for similarity based binning approach. The composition-based approach relies on the frequency profile of sequence signatures and machine learning approach on that profile, which is computationally expensive. The third approach based on coverage profile across samples was developed recently[2, 60, 3, 94, 58].Mostly, the coverage profile is used with the companion of composition frequency profile to classify contigs. The assumption on which the coverage profile based binning approaches are based on, that contigs with similar coverage profile across samples are more likely to be from the same microbial species, is actually similar to the assumption on which using IGS to do beta diversity is based, that the IGSs with similar 123 coverage profile across samples are likely to be from the same microbial organism. Thus, it is promising to classify the IGSs by the coverage profile across samples. We have already overcome the challenge of retrieving the coverage profile efficiently based on the probabilistic data structure, while in those coverage profile based binning approaches the coverage profile is normally retrieved by assembly of contigs and mapping reads back to contigs, which both require higher coverage reads to do assembly and are computationally expensive. There are two obstacles to overcome in this coverage profile based IGS binning approach. First, with relatively small number of samples, the resolution will be limited, since the total number of different coverage profiles will be limited. This is probably the reason why most of those coverage profile based contig binning methods have to integrate composition profile information also. Second, on the other hand, if there are a large number of samples, there will be too many different coverage profiles. We can use more sophisticated approach to classify the coverage profiles to reduce the number of bins, as in those coverage profile based contigs binning methods. Another approach worthy of note is that there is a method termed partitioning developed in our group as a divide and conquer approach to scale metagenome assembly. It can be considered as a binning approach also, where the reads in the same partition are more likely to originate from the same microbial organism. We can try to integrate the partitioning and IGS coverage profile to improve the accuracy of the binning. In summary, this will be one of the first attempts to do reads binning. After the IGS/reads binning, we expect to do better assembly and annotation and gain more knowledge about the function and phylogenetic information. We have shown that after adjustment according to sequencing error and collision rate of the bloom filter, the estimated size of metagenome is close to real number for synthetic data sets. Howevere, the difference between the estimation and real number is still increasing 124 with higher error rate, which means there are other factors that affect the accuracy of estimation. This is worthy of further investigation. The size estimation of metagenome is extremely important in metagenomic data analysis and it is closely related to the estimation of sequencing depth or how much more effort is required to gain enough sequencing depth. As shown in the results, we are confident that the relative relationship between the richness of different samples is reliable from the IGS based alpha diversity analysis. How accurate the absolute value of the richness or the size of metagenome in a real data is requires further investigation and new statistical model may be needed to adapt to the abundance distribution of IGSs. Furthermore, any information about the richness of a sample is beneficial to the optimal choice of parameter for digital normalization. Our efforts to examine the effect of sequencing depth on the accuracy of beta diversity reveals that using a relatively small subset of the whole data set may get reasonably good result showing the separation of samples after clustering or ordination. However, how good the separation is seems to be related to the characteristics of samples and cannot be determined easily before starting the analysis. Thus, a potential approach should estimate beta analysis in an iterative way. We already know that using more data will benefit more accurate analysis or better separation for the purpose of comparing metagenomic samples. In such iteration procedure, pattern of separation can be monitored as more reads are loaded into the analysis and the procedure can be stopped as long as the pattern of the separation of samples is significant enough. This way, we may save lots of computational cost and still have enough information about the relationship between samples. 125 BIBLIOGRAPHY 126 BIBLIOGRAPHY [1] Afiahayati, K. Sato, and Y. Sakakibara. Metavelvet-sl: an extension of the velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Res, 22(1):69–77, Feb 2015. [2] M. Albertsen, P. Hugenholtz, A. Skarshewski, K. L. Nielsen, G. W. Tyson, and P. H. Nielsen. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol, 31(6):533–8, Jun 2013. [3] J. Alneberg, B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, L. Lahti, N. J. Loman, A. F. Andersson, and C. Quince. Binning metagenomic contigs by coverage and composition. Nat Methods, 11(11):1144–6, Nov 2014. [4] P. Audano and F. Vannberg. Kanalyze: A fast versatile pipelined k-mer toolkit. Bioinformatics: Advance Access published March 18, 2014, page doi: 10.1093/bioinformatics/btu152, Mar 2014. [5] B. J. M. Bohannan and J. Hughes. New approaches to analyzing microbial biodiversity data. Curr Opin Microbiol, 6(3):282–7, Jun 2003. [6] M. Bonaldo, G. Lennon, and M. Soares. Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res, 6(9):791–806, 1996. [7] A. Brady and S. Salzberg. Phymmbl expanded: confidence scores, custom databases, parallelization and more. Nat Methods, 8(5):367, May 2011. [8] A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet mathematics, 1(4):485–509, 2004. [9] C. T. Brown. What does trinity’s in silico normalization do? 2012. [10] C. T. Brown, A. Howe, Q. Zhang, A. B. Pyrkosz, and T. H. Brom. A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint, 03 2012. [11] J. Bunge. Estimating the number of species with catchall. Pac Symp Biocomput, pages 121–30, 2011. 127 [12] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res, 18(5):810–20, May 2008. [13] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer, A. G. Pe˜ na, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T. Yatsunenko, J. Zaneveld, and R. Knight. Qiime allows analysis of high-throughput community sequencing data. Nat Methods, 7(5):335–6, May 2010. [14] M. Chaisson, P. Pevzner, and H. Tang. Fragment assembly with short reads. Bioinformatics, 20(13):2067–74, 2004. [15] A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of statistics, pages 265–270, 1984. [16] A. Chao. Estimating the population size for capture-recapture data with unequal catchability. Biometrics, 43(4):783–91, Dec 1987. [17] A. Chao and M. C. Yang. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika, 80(1):193–201, 1993. [18] R. Chikhi and P. Medvedev. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31–7, Jan 2014. [19] H. Chitsaz, J. Yee-Greenbaum, G. Tesler, M. Lombardo, C. Dupont, J. Badger, M. Novotny, D. Rusch, L. Fraser, N. Gormley, O. Schulz-Trieglaff, G. Smith, D. Evers, P. Pevzner, and R. Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol, 29(10):915–21, 2011. [20] H. Chitsaz, J. L. Yee-Greenbaum, G. Tesler, M. J. Lombardo, C. L. Dupont, J. H. Badger, M. Novotny, D. B. Rusch, L. J. Fraser, N. A. Gormley, O. Schulz-Trieglaff, G. P. Smith, D. J. Evers, P. A. Pevzner, and R. S. Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat. Biotechnol., 29(10):915– 921, Oct 2011. [21] R. Colwell. Estimates: statistical estimation of species richness and shared species from samples. 2004. Consultado en: http://viceroy. eeb. uconn. edu/estimates. [22] R. K. Colwell, C. X. Mao, and J. Chang. Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology, 85(10):2717–2727, 2004. 128 [23] P. Compeau, P. Pevzner, and G. Tesler. How to apply de bruijn graphs to genome assembly. Nat Biotechnol, 29(11):987–91, 2011. [24] T. Conway and A. Bromage. Succinct data structures for assembling large genomes. Bioinformatics, 27(4):479–86, 2011. [25] G. Cormode and S. Muthukrishnan. An improved data stream summary: the countmin sketch, and its applications. Journal of Algorithms, 2004. [26] G. Cormode and S. Muthukrishnan. An improved data stream summary: the countmin sketch and its applications. Journal of Algorithms, 55(1):58–75, Apr. 2005. [27] G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, pages 44–55, 2005. [28] M. Crusoe, G. Edvenson, J. Fish, A. Howe, E. McDonald, J. Nahum, K. Nanlohy, H. Ortiz-Zuazaga, J. Pell, J. Simpson, C. Scott, R. Srinivasan, Q. Zhang, and C. T. Brown. The khmer software package: enabling efficient sequence analysis, 2014. [29] T. P. Curtis, W. T. Sloan, and J. W. Scannell. Estimating prokaryotic diversity and its limits. Proc Natl Acad Sci U S A, 99(16):10494–9, Aug 2002. [30] S. Deorowicz, A. Debudaj-Grabysz, and S. Grabowski. Disk-based k-mer counting on a pc. BMC Bioinformatics, 14(1):160, May 2013. [31] J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36(16):e105, Sep 2008. [32] J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark, R. Dalal, A. Dewinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, and S. Turner. Real-time DNA sequencing from single polymerase molecules. Science, 323(5910):133– 138, Jan 2009. [33] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theor. Comput. Sci., 348(2):207–216, Dec. 2005. 129 ´ Fusy, O. Gandouet, and F. Meunier. Hyperloglog: the analysis of a [34] P. Flajolet, E. near-optimal cardinality estimation algorithm. DMTCS Proceedings, (1), 2008. [35] J. Gans, M. Wolinsky, and J. Dunbar. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science, 309(5739):1387–90, Aug 2005. [36] J. A. Gilbert and C. L. Dupont. Microbial metagenomics: beyond the genome. Ann Rev Mar Sci, 3:347–71, 2011. [37] J. A. Gilbert, F. Meyer, D. Antonopoulos, P. Balaji, C. T. Brown, C. T. Brown, N. Desai, J. A. Eisen, D. Evers, D. Field, W. Feng, D. Huson, J. Jansson, R. Knight, J. Knight, E. Kolker, K. Konstantindis, J. Kostka, N. Kyrpides, R. Mackelprang, A. McHardy, C. Quince, J. Raes, A. Sczyrba, A. Shade, and R. Stevens. Meeting report: the terabase metagenomics workshop and the vision of an earth microbiome project. Stand Genomic Sci, 3(3):243–8, 2010. [38] E. M. Glass, J. Wilkening, A. Wilke, D. Antonopoulos, and F. Meyer. Using the metagenomics rast server (mg-rast) for analyzing shotgun metagenomes. Cold Spring Harb Protoc, 2010(1):pdb.prot5368, Jan 2010. [39] S. Gnerre, I. Maccallum, D. Przybylski, F. Ribeiro, J. Burton, B. Walker, T. Sharpe, G. Hall, T. Shea, S. Sykes, A. Berlin, D. Aird, M. Costello, R. Daza, L. Williams, R. Nicol, A. Gnirke, C. Nusbaum, E. Lander, and D. Jaffe. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A, 108(4):1513–8, 2011. [40] N. J. Gotelli and R. K. Colwell. Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology letters, 4(4):379–391, 2001. [41] M. Grabherr, B. Haas, M. Yassour, J. Levin, D. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. di Palma, B. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev. Full-length transcriptome assembly from rna-seq data without a reference genome. Nat Biotechnol, 29(7):644–52, 2011. [42] M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol., 29(7):644–652, Jul 2011. 130 [43] X. Guo, N. Yu, X. Ding, J. Wang, and Y. Pan. Dime: a novel framework for de novo metagenomic sequence assembly. J Comput Biol, 22(2):159–77, Feb 2015. [44] A. Gurevich, V. Saveliev, N. Vyahhi, and G. Tesler. Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8):1072–5, Apr 2013. [45] B. J. Haas, A. Papanicolaou, M. Yassour, M. Grabherr, P. D. Blood, J. Bowden, M. B. Couger, D. Eccles, B. Li, M. Lieber, M. D. Macmanes, M. Ott, J. Orvis, N. Pochet, F. Strozzi, N. Weeks, R. Westerman, T. William, C. N. Dewey, R. Henschel, R. D. Leduc, N. Friedman, and A. Regev. De novo transcript sequence reconstruction from rna-seq using the trinity platform for reference generation and analysis. Nat Protoc, 8(8):1494–512, Aug 2013. [46] B. Haegeman, J. Hamelin, J. Moriarty, P. Neal, J. Dushoff, and J. S. Weitz. Robust estimation of microbial diversity in theory and in practice. ISME J, 7(6):1092–101, Jun 2013. [47] B. Haider, T.-H. Ahn, B. Bushnell, J. Chai, A. Copeland, and C. Pan. Omega: an overlap-graph de novo assembler for metagenomics. Bioinformatics, 30(19):2717–22, Oct 2014. [48] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol, 5(10):R245–9, Oct 1998. [49] M. Hess, A. Sczyrba, R. Egan, T.-W. Kim, H. Chokhawala, G. Schroth, S. Luo, D. S. Clark, F. Chen, T. Zhang, R. I. Mackie, L. A. Pennacchio, S. G. Tringe, A. Visel, T. Woyke, Z. Wang, and E. M. Rubin. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science, 331(6016):463–7, Jan 2011. [50] M. O. Hill. Diversity and evenness: a unifying notation and its consequences. Ecology, 54(2):427–432, 1973. [51] A. C. Howe, J. K. Jansson, S. A. Malfatti, S. G. Tringe, J. M. Tiedje, and C. T. Brown. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A, 111(13):4904–9, Apr 2014. [52] A. C. Howe, J. K. Jansson, S. A. Malfatti, S. G. Tringe, J. M. Tiedje, and C. T. Brown. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A, 111(13):4904–9, Apr 2014. 131 [53] A. C. Howe, J. Pell, R. Canino-Koning, R. Mackelprang, S. Tringe, J. Jansson, J. M. Tiedje, and C. T. Brown. Illumina sequencing artifacts revealed by connectivity analysis of metagenomic datasets. PLoS ONE, -. [54] S. Howorka, S. Cheley, and H. Bayley. Sequence-specific detection of individual DNA strands using engineered nanopores. Nat. Biotechnol., 19(7):636–639, Jul 2001. [55] Human Microbiome Jumpstart Reference Strains Consortium, K. E. Nelson, G. M. Weinstock, S. K. Highlander, K. C. Worley, H. H. Creasy, J. R. Wortman, D. B. Rusch, M. Mitreva, E. Sodergren, A. T. Chinwalla, M. Feldgarden, D. Gevers, B. J. Haas, R. Madupu, D. V. Ward, B. W. Birren, R. A. Gibbs, B. Methe, J. F. Petrosino, R. L. Strausberg, G. G. Sutton, O. R. White, R. K. Wilson, S. Durkin, M. G. Giglio, S. Gujja, C. Howarth, C. D. Kodira, N. Kyrpides, T. Mehta, D. M. Muzny, M. Pearson, K. Pepin, A. Pati, X. Qin, C. Yandava, Q. Zeng, L. Zhang, A. M. Berlin, L. Chen, T. A. Hepburn, J. Johnson, J. McCorrison, J. Miller, P. Minx, C. Nusbaum, C. Russ, S. M. Sykes, C. M. Tomlinson, S. Young, W. C. Warren, J. Badger, J. Crabtree, V. M. Markowitz, J. Orvis, A. Cree, S. Ferriera, L. L. Fulton, R. S. Fulton, M. Gillis, L. D. Hemphill, V. Joshi, C. Kovar, M. Torralba, K. A. Wetterstrand, A. Abouellleil, A. M. Wollam, C. J. Buhay, Y. Ding, S. Dugan, M. G. FitzGerald, M. Holder, J. Hostetler, S. W. Clifton, E. Allen-Vercoe, A. M. Earl, C. N. Farmer, K. Liolios, M. G. Surette, Q. Xu, C. Pohl, K. Wilczek-Boney, and D. Zhu. A catalog of reference genomes from the human microbiome. Science, 328(5981):994–9, May 2010. [56] Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature, 486(7402):207–14, Jun 2012. [57] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. Megan analysis of metagenomic data. Genome Res, 17(3):377–86, Mar 2007. [58] M. Imelfort, D. Parks, B. J. Woodcroft, P. Dennis, P. Hugenholtz, and G. W. Tyson. Groopm: an automated tool for the recovery of population genomes from related metagenomes. PeerJ, 2:e603, 2014. [59] Z. Iqbal, M. Caccamo, I. Turner, P. Flicek, and G. McVean. De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet, 44(2):226–32, Feb 2012. [60] F. H. Karlsson, V. Tremaroli, I. Nookaew, G. Bergstr¨om, C. J. Behre, B. Fagerberg, J. Nielsen, and F. B¨ackhed. Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature, 498(7452):99–103, Jun 2013. 132 [61] A. L. Kau, P. P. Ahern, N. W. Griffin, A. L. Goodman, and J. I. Gordon. Human nutrition, the gut microbiome and the immune system. Nature, 474(7351):327–36, Jun 2011. [62] K. P. Keegan, W. L. Trimble, J. Wilkening, A. Wilke, T. Harrison, M. D’Souza, and F. Meyer. A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE. PLoS Comput. Biol., 8(6):e1002541, 2012. [63] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biol, 11(11):R116, 2010. [64] S. Kurtz, A. Narechania, J. C. Stein, and D. Ware. A new method to compute kmer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517, 2008. [65] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3):231–239, Apr 1988. [66] S.-M. Lee and A. Chao. Estimating population size via sample coverage for closed capture-recapture models. Biometrics, pages 88–97, 1994. [67] D. Li, C.-M. Liu, R. Luo, K. Sadakane, and T.-W. Lam. Megahit: an ultra-fast singlenode solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics, Jan 2015. [68] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, and J. Wang. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res, 20(2):265–72, Feb 2010. [69] X. Li and M. S. Waterman. Estimating the repeat structure and length of dna sequences using l-tuples. Genome Res, 13(8):1916–22, Aug 2003. [70] Y. Li, Y. Hu, L. Bolund, and J. Wang. State of the art de novo assembly of human genomes from massively parallel sequencing data. Hum Genomics, 4(4):271–7, 2010. [71] M. Loreau, S. Naeem, P. Inchausti, J. Bengtsson, J. P. Grime, A. Hector, D. U. Hooper, M. A. Huston, D. Raffaelli, B. Schmid, D. Tilman, and D. A. Wardle. Biodiversity and ecosystem functioning: current knowledge and future challenges. Science, 294(5543):804–8, Oct 2001. [72] E. K. Lowe, B. Swalla, and C. Brown. Evaluating a lightweight transcriptome assembly pipeline on two closely related ascidian species. PeerJ Preprints, 2, 2014. 133 [73] W. Luo, M. S. Friedman, K. Shedden, K. D. Hankenson, and P. J. Woolf. Gage: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics, 10:161, 2009. [74] A. E. Magurran and B. J. McGill. Biological diversity: frontiers in measurement and assessment, volume 12. Oxford University Press Oxford, 2011. [75] N. Maillet, G. Collet, T. Vannier, D. Lavenier, and P. Peterlongo. Commet: Comparing and combining multiple metagenomic datasets. In H. J. Zheng, W. Dubitzky, X. Hu, J. Hao, D. P. Berrar, K. Cho, Y. Wang, and D. R. Gilbert, editors, 2014 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2014, Belfast, United Kingdom, November 2-5, 2014, pages 94–98. IEEE, 2014. [76] N. Maillet, C. Lemaitre, R. Chikhi, D. Lavenier, and P. Peterlongo. Compareads: comparing huge metagenomic experiments. BMC Bioinformatics, 13 Suppl 19:S10, 2012. [77] N. Mantel. The detection of disease clustering and a generalized regression approach. Cancer Res, 27(2):209–20, Feb 1967. [78] G. Mar¸cais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764–770, 2011. [79] V. M. Markowitz, I.-M. A. Chen, K. Chu, E. Szeto, K. Palaniappan, Y. Grechkin, A. Ratner, B. Jacob, A. Pati, M. Huntemann, K. Liolios, I. Pagani, I. Anderson, K. Mavromatis, N. N. Ivanova, and N. C. Kyrpides. Img/m: the integrated metagenome data management and comparative analysis system. Nucleic Acids Res, 40(Database issue):D123–9, Jan 2012. [80] O. U. Mason, T. C. Hazen, S. Borglin, P. S. G. Chain, E. A. Dubinsky, J. L. Fortney, J. Han, H.-Y. N. Holman, J. Hultman, R. Lamendella, R. Mackelprang, S. Malfatti, L. M. Tom, S. G. Tringe, T. Woyke, J. Zhou, E. M. Rubin, and J. K. Jansson. Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to deepwater horizon oil spill. ISME J, 6(9):1715–27, Sep 2012. [81] R. M. May. How many species are there on earth? Science, 241(4872):1441–9, Sep 1988. [82] E. McDonald and C. T. Brown. Working with big data in bioinformatics. In T. Armstrong, editor, The Performance of Open Source Applications, chapter 12, page 151. lulu.com, 2013. 134 [83] P. Medvedev, E. Scott, B. Kakaradov, and P. Pevzner. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics, 27(13):i137–41, Jul 2011. [84] P. Melsted and B. V. Halldorsson. KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics, 30(24):3541–3547, Dec 2014. [85] P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC bioinformatics, 12:333, Jan. 2011. [86] M. Metzker. Sequencing technologies - the next generation. Nat Rev Genet, 11(1):31– 46, 2010. [87] J. Miller, S. Koren, and G. Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315–27, 2010. [88] A. E. Minoche, J. C. Dohm, and H. Himmelbauer. Evaluation of genomic highthroughput sequencing data generated on illumina hiseq and genome analyzer systems. Genome Biol, 12(11):R112, 2011. [89] C. Mora, D. P. Tittensor, S. Adl, A. G. B. Simpson, and B. Worm. How many species are there on earth and in the ocean? PLoS Biol, 9(8):e1001127, Aug 2011. [90] X. C. Morgan, T. L. Tickle, H. Sokol, D. Gevers, K. L. Devaney, D. V. Ward, J. A. Reyes, S. A. Shah, N. LeLeiko, S. B. Snapper, A. Bousvaros, J. Korzenik, B. E. Sands, R. J. Xavier, and C. Huttenhower. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol, 13(9):R79, 2012. [91] E. K. Morris, T. Caruso, F. Buscot, M. Fischer, C. Hancock, T. S. Maier, T. Meiners, C. M¨ uller, E. Obermaier, D. Prati, S. A. Socher, I. Sonnemann, N. W¨aschke, T. Wubet, S. Wurst, and M. C. Rillig. Choosing and using diversity indices: insights for ecological applications from the german biodiversity exploratories. Ecol Evol, 4(18):3514–24, Sep 2014. [92] S. Muthukrishnan. Data streams: algorithms and applications. Foundations and trends in theoretical computer science. Now Publishers, 2005. [93] T. Namiki, T. Hachiya, H. Tanaka, and Y. Sakakibara. Metavelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res, 40(20):e155, Nov 2012. 135 [94] H. B. Nielsen, M. Almeida, A. S. Juncker, S. Rasmussen, J. Li, S. Sunagawa, D. R. Plichta, L. Gautier, A. G. Pedersen, E. Le Chatelier, E. Pelletier, I. Bonde, T. Nielsen, C. Manichanh, M. Arumugam, J.-M. Batto, M. B. Quintanilha Dos Santos, N. Blom, N. Borruel, K. S. Burgdorf, F. Boumezbeur, F. Casellas, J. Dor´e, P. Dworzynski, F. Guarner, T. Hansen, F. Hildebrand, R. S. Kaas, S. Kennedy, K. Kristiansen, J. R. Kultima, P. L´eonard, F. Levenez, O. Lund, B. Moumen, D. Le Paslier, N. Pons, O. Pedersen, E. Prifti, J. Qin, J. Raes, S. Sørensen, J. Tap, S. Tims, D. W. Ussery, T. Yamada, MetaHIT Consortium, P. Renault, T. Sicheritz-Ponten, P. Bork, J. Wang, S. Brunak, S. D. Ehrlich, and MetaHIT Consortium. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol, 32(8):822–8, Aug 2014. [95] K. R. Patil, L. Roune, and A. C. McHardy. The phylopythias web server for taxonomic assignment of metagenome sequences. PLoS One, 7(6):e38581, 2012. [96] J. Pell, A. Hintze, R. Canino-Koning, A. Howe, J. M. Tiedje, and C. T. Brown. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proc Natl Acad Sci U S A, 109(33):13272–7, Aug 2012. [97] Y. Peng, H. Leung, S. Yiu, and F. Chin. Meta-idba: a de novo assembler for metagenomic data. Bioinformatics, 27(13):i94–101, 2011. [98] F. P´erez and B. Granger. Ipython: A system for interactive scientific computing. Computing in Science Engineering, 9(3):21–29, 2007. [99] P. A. Pevzner, H. Tang, and M. S. Waterman. An eulerian path approach to dna fragment assembly. Proc Natl Acad Sci U S A, 98(17):9748–53, Aug 2001. [100] J. Pickrell, A. Pai, Y. Gilad, and J. Pritchard. Noisy splicing drives mrna isoform diversity in human cells. PLoS Genet, 6(12):e1001236, 2010. [101] A. Pinho, D. Pratas, and S. Garcia. Green: a tool for efficient compression of genome resequencing data. Nucleic Acids Res, 40(4):e27, 2012. [102] J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, T. Yamada, D. R. Mende, J. Li, J. Xu, S. Li, D. Li, J. Cao, B. Wang, H. Liang, H. Zheng, Y. Xie, J. Tap, P. Lepage, M. Bertalan, J.-M. Batto, T. Hansen, D. Le Paslier, A. Linneberg, H. B. Nielsen, E. Pelletier, P. Renault, T. Sicheritz-Ponten, K. Turner, H. Zhu, C. Yu, S. Li, M. Jian, Y. Zhou, Y. Li, X. Zhang, S. Li, N. Qin, H. Yang, J. Wang, S. Brunak, J. Dor´e, F. Guarner, K. Kristiansen, O. Pedersen, J. Parkhill, J. Weissenbach, MetaHIT Consortium, P. Bork, 136 S. D. Ehrlich, and J. Wang. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464(7285):59–65, Mar 2010. [103] J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, T. Yamada, D. R. Mende, J. Li, J. Xu, S. S. S. Li, D. Li, J. Cao, B. Wang, H. Liang, H. Zheng, Y. Xie, J. Tap, P. Lepage, M. Bertalan, J.-M. Batto, T. Hansen, D. Le Paslier, A. Linneberg, H. B. r. Nielsen, E. Pelletier, P. Renault, T. Sicheritz-Ponten, K. Turner, H. Zhu, C. Yu, M. Jian, Y. Zhou, Y. Li, X. Zhang, N. Qin, H. Yang, J. J. Wang, S. r. Brunak, J. Dor´e, F. Guarner, K. Kristiansen, O. Pedersen, J. Parkhill, J. Weissenbach, P. Bork, and S. D. Ehrlich. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464(7285):59–65, 2010. [104] J. Qin, Y. Li, Z. Cai, S. Li, J. Zhu, F. Zhang, S. Liang, W. Zhang, Y. Guan, D. Shen, Y. Peng, D. Zhang, Z. Jie, W. Wu, Y. Qin, W. Xue, J. Li, L. Han, D. Lu, P. Wu, Y. Dai, X. Sun, Z. Li, A. Tang, S. Zhong, X. Li, W. Chen, R. Xu, M. Wang, Q. Feng, M. Gong, J. Yu, Y. Zhang, M. Zhang, T. Hansen, G. Sanchez, J. Raes, G. Falony, S. Okuda, M. Almeida, E. LeChatelier, P. Renault, N. Pons, J.-M. Batto, Z. Zhang, H. Chen, R. Yang, W. Zheng, S. Li, H. Yang, J. Wang, S. D. Ehrlich, R. Nielsen, O. Pedersen, K. Kristiansen, and J. Wang. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 490(7418):55–60, Oct 2012. [105] C. Quince, T. P. Curtis, and W. T. Sloan. The rational exploration of microbial diversity. ISME J, 2(10):997–1006, Oct 2008. [106] G. Rizk, D. Lavenier, and R. Chikhi. Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652–3, Mar 2013. [107] L. M. Rodriguez-R and K. T. Konstantinidis. Nonpareil: A redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics, Oct 2013. [108] L. M. Rodriguez-R and K. T. Konstantinidis. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics, 30(5):629–35, Mar 2014. [109] L. F. W. Roesch, R. R. Fulthorpe, A. Riva, G. Casella, A. K. M. Hadwin, A. D. Kent, S. H. Daroub, F. A. O. Camargo, W. G. Farmerie, and E. W. Triplett. Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J, 1(4):283–90, Aug 2007. [110] G. L. Rosen, E. R. Reichenberger, and A. M. Rosenfeld. Nbc: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics, 27(1):127–9, Jan 2011. 137 [111] R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics: Advance Access published March 10, 2014, page doi: 10.1093/bioinformatics/btu132, Apr 2014. [112] F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Transactions on Database Systems, 33(3):1–46, Aug. 2008. [113] D. C. Savage. Microbial ecology of the gastrointestinal tract. Annu Rev Microbiol, 31:107–33, 1977. [114] P. D. Schloss and J. Handelsman. Introducing dotur, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol, 71(3):1501–6, Mar 2005. [115] P. D. Schloss and J. Handelsman. Toward a census of bacteria in soil. PLoS Comput Biol, 2(7):e92, Jul 2006. [116] P. D. Schloss, S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, C. J. Robinson, J. W. Sahl, B. Stres, G. G. Thallinger, D. J. Van Horn, and C. F. Weber. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 75(23):7537–41, Dec 2009. [117] T. M. Schmidt, E. F. DeLong, and N. R. Pace. Analysis of a marine picoplankton community by 16s rrna gene cloning and sequencing. J Bacteriol, 173(14):4371–8, Jul 1991. [118] M. Schulz, D. Zerbino, M. Vingron, and E. Birney. Oases: robust de novo rna-seq assembly across the dynamic range of expression levels. Bioinformatics, 28(8):1086– 92, 2012. [119] N. Segata, L. Waldron, A. Ballarini, V. Narasimhan, O. Jousson, and C. Huttenhower. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods, 9(8):811–4, Aug 2012. [120] M. Shakya, C. Quince, J. H. Campbell, Z. K. Yang, C. W. Schadt, and M. Podar. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. Environ. Microbiol., 15(6):1882–1899, Jun 2013. [121] C. E. Shannon. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3–55, 2001. 138 [122] E. H. Simpson. Measurement of diversity. Nature, 1949. [123] J. Simpson and R. Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinformatics, 26(12):i367–73, 2010. [124] J. Simpson and R. Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome Res, 22(3):549–56, 2012. [125] J. Simpson, K. Wong, S. Jackman, J. Schein, S. Jones, and I. Birol. Abyss: a parallel assembler for short read sequence data. Genome Res, 19(6):1117–23, 2009. [126] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and I. Birol. Abyss: a parallel assembler for short read sequence data. Genome Res, 19(6):1117–23, Jun 2009. [127] M. Soares, M. Bonaldo, P. Jelene, L. Su, L. Lawton, and A. Efstratiadis. Construction and characterization of a normalized cdna library. Proc Natl Acad Sci U S A, 91(20):9228–32, 1994. [128] M. L. Sogin, H. G. Morrison, J. A. Huber, D. Mark Welch, S. M. Huse, P. R. Neal, J. M. Arrieta, and G. J. Herndl. Microbial diversity in the deep sea and the underexplored ”rare biosphere”. Proc Natl Acad Sci U S A, 103(32):12115–20, Aug 2006. [129] L. Song, L. Florea, and B. Langmead. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol., 15(11):509, 2014. [130] C. Spits, C. L. Caignec, M. D. Rycke, L. V. Haute, A. V. Steirteghem, I. Liebaers, and K. Sermon. Whole-genome multiple displacement amplification from single cells. Nat Protoc, 1(4):1965–70, 2006. [131] E. Stackebrandt, W. Frederiksen, G. M. Garrity, P. A. D. Grimont, P. K¨ampfer, M. C. J. Maiden, X. Nesme, R. Rossell´o-Mora, J. Swings, H. G. Tr¨ uper, L. Vauterin, A. C. Ward, and W. B. Whitman. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol, 52(Pt 3):1043–7, May 2002. [132] L. Stein. The case for cloud computing in genome informatics. Genome Biol, 11(5):207, 2010. [133] H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Gl¨ockner. Tetra: a webservice and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinformatics, 5:163, Oct 2004. 139 [134] P. J. Turnbaugh, M. Hamady, T. Yatsunenko, B. L. Cantarel, A. Duncan, R. E. Ley, M. L. Sogin, W. J. Jones, B. A. Roe, J. P. Affourtit, M. Egholm, B. Henrissat, A. C. Heath, R. Knight, and J. I. Gordon. A core gut microbiome in obese and lean twins. Nature, 457(7228):480–4, Jan 2009. [135] G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banfield. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 428(6978):37–43, Mar 2004. [136] S. Tzahor, D. Man-Aharonovich, B. C. Kirkup, T. Yogev, I. Berman-Frank, M. F. Polz, O. B´ej`a, and Y. Mandel-Gutfreund. A supervised learning approach for taxonomic classification of core-photosystem-ii genes and transcripts in the marine environment. BMC Genomics, 10:229, 2009. [137] J. C. Venter, K. Remington, J. F. Heidelberg, A. L. Halpern, D. Rusch, J. A. Eisen, D. Wu, I. Paulsen, K. E. Nelson, W. Nelson, D. E. Fouts, S. Levy, A. H. Knap, M. W. Lomas, K. Nealson, O. White, J. Peterson, J. Hoffman, R. Parsons, H. BadenTillson, C. Pfannkoch, Y.-H. Rogers, and H. O. Smith. Environmental genome shotgun sequencing of the sargasso sea. Science, 304(5667):66–74, Apr 2004. [138] Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole. Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl Environ Microbiol, 73(16):5261–7, Aug 2007. [139] W. B. Whitman, D. C. Coleman, and W. J. Wiebe. Prokaryotes: the unseen majority. Proc Natl Acad Sci U S A, 95(12):6578–83, Jun 1998. [140] D. M. Wilkinson. The disturbing history of intermediate disturbance. Oikos, pages 145–147, 1999. [141] D. Wu, P. Hugenholtz, K. Mavromatis, R. Pukall, E. Dalin, N. N. Ivanova, V. Kunin, L. Goodwin, M. Wu, B. J. Tindall, S. D. Hooper, A. Pati, A. Lykidis, S. Spring, I. J. Anderson, P. D’haeseleer, A. Zemla, M. Singer, A. Lapidus, M. Nolan, A. Copeland, C. Han, F. Chen, J.-F. Cheng, S. Lucas, C. Kerfeld, E. Lang, S. Gronow, P. Chain, D. Bruce, E. M. Rubin, N. C. Kyrpides, H.-P. Klenk, and J. A. Eisen. A phylogenydriven genomic encyclopaedia of bacteria and archaea. Nature, 462(7276):1056–60, Dec 2009. [142] M. Wu and A. J. Scott. Phylogenomic analysis of bacterial and archaeal sequences with amphora2. Bioinformatics, 28(7):1033–4, Apr 2012. 140 [143] Z. Xu, D. Malmer, M. G. I. Langille, S. F. Way, and R. Knight. Which is more important for classifying microbial communities: who’s there or what they can do? ISME J, 8(12):2357–9, Dec 2014. [144] I. Zarraonaindia, D. P. Smith, and J. A. Gilbert. Beyond the genome: community-level analysis of the microbial world. Biol Philos, 28(2):261–282, Mar 2013. [145] D. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res, 18(5):821–9, 2008. [146] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res, 18(5):821–9, May 2008. [147] Q. Zhang, S. Awad, and C. T. Brown. Crossing the streams: a framework for streaming. https://peerj.com/preprints/890/, 2015. [148] Q. Zhang, J. Pell, R. Canino-Koning, A. C. Howe, and C. T. Brown. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE, 7(9), 2014. 141