3-: -. w 1 Va "' I‘d-”M'qm.” ’ non-ohdn’IA—b“ , *A'rfi‘,::'.'§".-“Z ‘ I » 3 ' m7)...- y,» ‘ a <.. -—2 I! a. .. "I x‘ . ‘v‘M' HQ; '3' ‘ififl‘ I ‘1» ' 111' mrwu ! l .J‘J‘fi ' (”353 ‘. f“ 1 )3, M: ‘y- ’0 t -‘ .‘fi' .3 :1‘ ,-:~ LIBRARY Michigan State University This is to certify that the dissertation entitled GENOMIC INSIGHTS INTO ECOLOGICALLY IMPORTANT QUESTIONS FOR SOIL BACTERIA presented by Konstantinos T. Konstantinidis has been accepted towards fulfillment of the requirements for the Ph. D degree In Crop and Soil Sciences 4e%ai/z.. Mffior Professor’ s/Signature '2 (4 /6 $4 / Date MS U is an Affirmative Action/Equal Opportunity Institution PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 czfimteouojndd-p. t 5 GENOMIC INSIGHTS INTO ECOLOGICALLY IMPORTANT QUESTIONS FOR SOIL BACTERIA By Konstantinos T. Konstantinidis A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Crop and Soil Sciences 2004 ABSTRACT GENOMIC INSIGHTS INTO ECOLOGICALLY IMPORTANT QUESTIONS FOR SOIL BACTERIA By Konstantinos T. Konstantinidis Diverse prokaryotic species are the principal catalysts of the biogeochemical cycles that sustain life on Earth, however, fundamental issues that define that diversity are unresolved such as what are the extent and patterns of that diversity at the genome level, what is the interplay between genome evolution and ecological niche, and what defines a prokaryotic species in a manner that is predictive of phenotype. I compared 175 prokaryotic whole-genome sequences to address these questions. While prokaryotic genomes vary from 0.6 Mb to over 10 Mb and show enormous gene sequence diversity, several universal trends were noted that reflect the cellular and ecological strategies used by this simple but successful life form. For instance, large prokaryotic genomes, contrary to their eukaryotic counterparts, do not accumulate non-coding DNA or hypothetical genes and they are disproportionately enriched in regulation and secondary metabolism genes compared to medium and small-sized genomes. These trends suggest that larger genome-sized species may dominate in environments where resources are scarce but diverse and where there is little penalty for slow growth, such as soil. The genetic and functional diversity among highly related genomes was examined more closely to better understand the breadth and origin of biodiversity within a species and to use this insight to advance the current definition of species for Prokaryotes. Strains of the same species vary up to 30% in gene content while a large fraction, e.g. up to 65%, of these gene differences is frequently associated with bacteriophage and transposase elements, indicating a much more important role of these elements during bacterial speciation than previously thought. Additional analysis suggests that a more stringent definition for species, which should also consider the ecology of the strain, is both more appropriate and plausible. Expansion of the approach to include the higher than species ranks of the prokaryotic taxonomy revealed that there are many irregularities in the current classification schema for the 175 genomes used in this study. Consequently, the predictive power of the higher ranks of taxonomy in terms of genetic relatedness among the grouped organisms is currently rather poor. To provide for more extensive, robust, genome-based analysis of nature’s vast microbial diversity, I explored a high-through-put approach of using microarrays for comparative genome hybridizations. Therefore, the performance of different microarray platforms was evaluated by comparing the expected (in-silica) microrray results to results from comparisons of whole-genome sequences (control). Oligo-arrays, i.e., one 50 base probe per gene, were found to perform comparably to whole-ORF arrays as long as the evaluated strains reside in the same or highly related species whereas whole-ORF arrays should perform better for more distantly related strains. In all cases tested, non-specific hybridization signal was found to be substantial and could lead to misleading results if not taken into account, but can also be used to indicate potential gene duplications. My results have important implications for our understanding of the basis for and value of prokaryotic biodiversity and has broader impacts such as for reliable diagnosis of plant and animal disease agents, intellectual property rights, quarantine and (inter-) national regulations for transport and possession of microbes. Copyright by KONSTANTTNOS T. KONSTANTINIDIS 2004 This dissertation is dedicated to the memory of my mother Sophia, who passed away in the middle of my studies but never stop inspiring my character, personality and hence, my work. ACKNOWLEDGMENTS I would like to thank several individuals for helping me during the course of my graduate studies. Most importantly, I thank my advisor, Professor James Tiedje, for providing me with the creative freedom and intellectual guidance that fostered my growth as a scientist. In addition, I thank him for exposing me in the academic world through writing of grants, participating in conferences, improving my communication skills and for giving me the chance to meet and learn from the experts in my field. Last, special thanks for being always supportive in my difficult, personal moments. I wish to extent my thanks to the remaining members of my PhD guidance committee, Professors Terence Marsh, Syed Hashsham, and Michael Thomashow, as well as, Professor Tomas Schmidt of MSU, for their long-term commitment to my academic progress and contributions towards the development of critical thinking. I also feel indebted to Dr. John Heidelberg of The Institute for Genomic Research (TIGR) for stimulating my interest and training me in the area of comparative genomics, which is the subject of this dissertation. I began this journey alone five years ago, a long way from my home, Thessaloniki, Greece. It was only the love, encouragement, and patience of my sister Maria, brother Yiannis, and many friends that converted this arduous journey into a wonderful experience. I would like to thank all of them for this, and particularly, Kiki, Demertis, Alban, Joel, Hector, and Elias. Last, I want to acknowledge the funding I received from the Bouyoukos Fellowship Program, an endowment of the late Professor George Bouyoukos. Without this support, it would not have been possible for me to begin and complete this journey. -vi- TABLE OF CONTENTS LIST OF FIGURES ................................................................................. x LIST OF TABLES ................................................................................ xvi CHAPTER 1. AN INTRODUCTION INTO PROKARYOTIC DIVERSITY AND GENOMICS .......................................................................................... 1 INTRODUCTION ................................................................................ 2 BACKGROUND ................................................................................. 4 Prokaryotic biochemical and genetic diversity ........................................... 4 How many prokaryotic species are there and what is a “species”? ................... 6 Whole-genome sequencing and diversity of prokaryotic genomes... . . . . . . . . . . . .....11 Genome structure and its relation to the ecological niche ............................. 14 Biases in the collection of sequenced species ........................................... 17 THESIS OUTLINE ............................................................................... 19 REFERENCES .................................................................................. 22 CHAPTER 2. TRENDS BEWTEEN GENE CONTENT AND GENOME SIZE IN PROKARYOTIC SPECIES ...................................................................... 29 INTRODUCTION .............................................................................. 30 MATERIAL AND METHODS ............................................................... 32 Functional annotation of all sequenced prokaryotic genomes ........................ 32 RESULTS AND DISCUSSION .............................................................. 34 Data normalization ......................................................................... 34 Major trends with genome size ............................................................ 36 Minor trends with genome size ............................................................ 38 Non-coding DNA and hypothetical CDS ............................................... 40 Factors other than genome size ........................................................... 41 Results from KEGG, TIGR databases and J GI’s high draft genomes ............... 43 Bacteria vs. Archaea ....................................................................... 45 What is gained with a large genome? .................................................... 47 A hypothesis for large genomes ........................................................... 49 A CASE STUDY: THE BURKHOLDERIA CEPACIA COMPLEX ..................... 52 Background on Burkholderia cepacia complex ........................................ 52 Genomic comparisons among the Bcc genomes ....................................... 54 Chromosomal biases in terms of genetic diversity .................................... 57 ACKNOWLEDMENTS ....................................................................... 61 REFERENCES .................................................................................. 62 CHAPTER 3. GENOMIC INSIGHTS THAT ADVANCE THE SPECIES CONCEPT FOR PROKARYOTES ............................................................ 66 INTRODUCTION .............................................................................. 67 MATERIAL AND METHODS ............................................................... 69 Determination of conserved genes and evolutionary relatedness ..................... 69 Determination of DNA homology and 168 rRN A gene sequence identity... ......71 CD8 functional annotation and intergenic regions ..................................... 71 -vii- RESULTS AND DISCUSSION .............................................................. 74 Conserved gene core and genetic diversity within a species .......................... 75 The current species definition appears to be too liberal ............................... 79 What is an ecotype? ........................................................................ 85 Functional biases in the strain-specific gene set ........................................ 86 OUTLOOK ...................................................................................... 92 ACKNOWLEDMENTS ....................................................................... 94 REFERENCES ................................................................................. 95 CHAPTER 4. TOWARDS A GENOME-BASED TAXONOMY FOR PROKARYOTES ................................................................................. 100 INTRODUCTION ............................................................................ 101 MATERIAL AND METHODS .................................... 103 Determination of conserved genes and genetic relatedness ......................... 103 Taxonomic information .................................................................. 104 Phylogenetic analysis and sequence divergence ...................................... 104 RESULTS AND DISCUSSION ............................................................ 105 Average amino acid identity is a robust measurement of relatedness. . . .. . . . . 1 05 Evaluation of the taxonomic ranks in terms of genetic relatedness ................ 107 Evaluation of alternative markers to 16S rRNA for phylogenetic purposes. . . ...1 16 PERSPECTIVE ................................................................................ 1 19 ACKNOWLEDMENTS ...................................................................... 122 REFERENCES ................................................................................ 123 CHAPTER 5. IN—SILICO MODELING OF DNA-MICROARRAY PERFORMANCE FOR GENOMOTYPING BACTERIAL STRAINS .............. 126 INTRODUCTION ............................................................................ 127 MATERIAL AND METHODS ............................. 130 Microarray false negatives ............................................................... 130 Pair-wise whole genome comparisons ................................................. 131 cDNA vs. oligo arrays .................................................................... 132 Probe design ............................................................................... 135 Non-specific signal ....................................................................... 136 RESULTS ...................................................................................... 138 Predicted microarray performance ...................................................... 138 Importance of microarray false negatives .............................................. 141 Non-specific signal ....................................................................... 142 cDNA vs. oligo Arrays ................................................................... 144 DISCUSSION ................................................................................. 148 ACKNOWLEDMENTS ..................................................................... 154 REFERENCES ................................................................................ 155 CHAPTER 6. THESIS SUMMARY AND PERSPECTIVES FOR THE FUTURE ........................................................................................................ 158 REFERENCES ................................................................................ 164 -viii- APPENDIX: TABLES OF GENOMES USED IN THIS STUDY AND THEIR GENOMIC INFORMATION .................................................................. 165 -ix- LIST OF FIGURES Figure 1.1. Distribution of uncultivated vs. cultivated l6S rRNA gene sequences for each bacterial phylum. Data were collected from 65,872 sequences deposited in the Ribosomal Database Project (RDP) database, as of April 2003. The classification is based on an annotation from GenBank and was provided courtesy of Ryan Farris and James R. Cole of RDP ........................................................................................... 9 Figure 1.2. Genome size distribution of the fully sequenced prokaryotic genomes (A) and the number of published prokaryotic genomes per year (B). Data were retrieved from NCBI and included all 115 prokaryotic genomes available at the end of 2003 ................................................................................................... 12 Figure 2.1. COG functional categories that showed universal correlation with total CDS in the genome. Y-axes are the percent of CD8 in the genome attributable to a specific COGs category (graph title) and X-axes are the total CDS in the genome for each of the 99 sequenced bacterial genomes. Solid squares represent genomes that had a reasonable number of genes with homologs in the COG database whereas open squares represent genomes that had either too many or too few genes with homologs in the database (outliers). Trendlines and R2 shown are for the solid squares. Archaeal genomes were not included because Archaea had significantly different genomic fractions from Bacteria in many functional categories ........................................................... 37 Figure 2.2. COG functional categories that showed no correlation with genome size. These categories showed no correlation with genome size (at a P value threshold of 0.01) for one or both of the sets of species tested (i.e., all solid squares and solid squares with > 2,000 CDS). Only datapoints representing bacterial genomes are shown, because Archaea had significantly different genomic fractions in many of the categories shown ................................................................................................. 39 Figure 2.3. Correlation among total number of CD8 in the genome, non-coding DNA, and genome size for prokaryotic genomes. (A) The total number of CD8 in the genome vs. the genome size for 115 completed prokaryotic genomes. (B) The total amount of non-coding DNA in the genome vs. genome size .................................. 41 Figure 2.4. ABC transporter genes proportionately increase with genome size. Y-axis is the number of genes attributable to ABC transporter functions, and x-axis is the total CDS in the genome for each of the 99 fully sequenced bacterial genomes. Genomes that have disproportionately increased or decreased their number of ABC transporter genes are denoted on the graph ............................................................................ 42 Figure 2.5. Evidence for functional biases with genome size from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and The Institute for Genomic Research (T IGR) annotation databases. Y-axes are the genome portions (CDS attributable to a functional category divided by total number of CD8 in the genome) devoted to the specific functional category, and x-axes are the corresponding microbial genome sizes. Solid and open squares are used as previously for COGS data (Figure 2.1). Corresponding functional categories between the two databases are placed next to each other ................................................................................................... 44 Figure 2.6. Differences between Archaea and Bacteria in the relative usage of the genome. Bars represent the average from 34 bacterial and 12 archaeal genomes, which have between 1,500 and 3,500 CDS (to avoid any genome size effect on the data). Only normalized genomes have been included (see text). Average are statistically different by two-tailed t test, assuming unequal variances and 0.05 confidence level. Functional categories that had <2% of the genes in the genome are not shown .......................... 46 Figure 2.7. Summary of the shifts in gene content with genome size in prokaryotic genomes. The bars represent the sum of the COG functional categories, which showed strong correlation with genome size and are involved in the same major cellular processes. Only normalized genomes (represented by solid squares in Figure 2.1) have been included. Errors bars represent the standard deviation from the mean except for the last genome size class, where error bars represent data range due to a small number of normalized genomes in this class (three genomes) .............................................. 48 Figure 2.8. The Burkolderia cepacia complex and its relationship to other Burkholderia spp. 16S rRNA phylogenetic tree (based on the neighbour-j oining method) showing the phylogenetic relationships of Bee and other Burkholderia and Ralstonia species. Species in bold are sequenced or are currently being sequenced. ................. 53 Figure 2.9. Venn diagram showing the gene complements of the currently available Bcc genomes. Conserved genes were defined by whole-genome pairwise sequence comparisons, using the BLAST algorithm ( 1) using a cut-off of 30% identity (a.a. level) over at least 70% of the length of the query CDS. Parentheses denote the fraction of the strain-specific genes that has unknown function ................................................ 54 Figure 2.10. Functional annotation of the conserved gene core and the strain-specific genes for the three sequenced Bcc genomes. Bars represent the number of genes assignable to the four major classes (full description on x-axis) and the individual categories of COG database (single-letter description on x-axis; for annotation of the letters see Table 2.1). (A) Solid bars represent the conserved gene core between the three available Bcc genomes, while open bars represent the average from all genomes available in GenBank, which have a comparable number of protein coding genes (4-5,000) to the conserved core of the sequenced Bcc genomes. Panels B, C, and D show the annotation of the strain-specific genes for J2315, ATCC 17760 and G4’s genomes, respectively. Designations for each functional category have been omitted from x-axes for simplicity .............................................................................................. 56 Figure 2.11. Biased in the amount a genetic diversity carried by each chromosome and the GC% composition of the genes that are different between Bcc genomes. (A) Striped bars represent the percent of genes in each chromosome of strain J2315, which -xi- are assignable to the COG databases, while the remaining bars show what fraction of the genes in each chromosome is conserved in the other available Burkholderia genomes (graph legend). Centered open squares show the number of genes in each chromosome (right y-axis) while the leftmost bars show the same values as above for all genes in the genome (i.e., the average). (B) Black bars represent the total number of J2315’s CDS that, based on pair-wise whole-genome comparisons, do not have homologs (i.e., they are J2315-specific) in the other Bcc strain (x-axis), while gray and open striped bars represent the fraction of these J2315-specific CDS that has a GC content <5% and >5% than the average of the 12315’5 genome, respectively. Gray and white bars represent the total number of CD8 in J2315’s genome that have a GC content <5% or >5°/o than the average of12315’s genome, respectively ........................................................ 59 Figure 3.1. Relationships between average nucleotide identity (ANI), l6S rRNA sequence identity and DNA homology. Each dot represents the ANI of all conserved genes between two strains plotted against the 16S rRNA sequence identity (A) and the DNA homology (B) of the two strains. The shaded bar represents 93-94% ANI, which approximately corresponds to 70% DNA homology, i.e., the species cut-off for prokaryotic species, according to the regression analysis in panel B. 16S rRNA identity and DNA homology values were computed as described in methods section .............. 74 Figure 3.2. Conserved gene core vs. genetic diversity within E. coli species. (A) Starting with the 5,447 CDS in the genome of E. coli 0157 strain Sakai the next bar to the right represents how many unique CDS in total are found in strain EDL and Sakai together (empty bars) and how many of the 5,447 CDS are conserved in EDL (filled bars) etc. Hence, the empty bars represent the total genetic diversity within species whereas the filled ones represent the conserved core for the species. (B) All CDS in a strain (graph label) were searched against a database of an increasing number of genomes. The number of strain-specific CDS, expressed as a percentage of the strain-specific CDS when only one genome was used as database, is plotted against the number of genomes used as database. The almost identical genomes of E. coli 0157 and S. flexneri 2a lineages were pooled together so that the seven genomes finally compared showed similar average nucleotide identity between each other. The genomes of S. sonnei, E. coli str. 042 and str. E2348 were not annotated at the time of this study. For these genomes, the genomic sequence was cut in 1,000nt long consecutive fragments and these fragments were used instead of CD8. Applying this strategy to annotated genomes gave comparable results to the ones obtained using annotated CDS. The logarithmic and power correlations shown are not statistically different from each other .................................................... 76 Figure 3.3. Conserved gene core vs. genetic diversity of species. The first column shows what fraction of the total, non-redundant list of genes found in all genomes of the species belongs to the species’ conserved core and what fraction is variable (i.e., not in the core). The second column shows the same distribution for the “average” strain of the species. The functional annotation (see methods) of the genes in the average strain of the species is also shown as exemplified for E. coli. E. coli shows the greatest and S. pyogenes the lowest genetic diversity; note, however, that E. coli genomes are generally more distantly related between each other compared to genomes of the other species -xii- based on ANI measurements (ANI between E. coli genomes ~96-97% vs. >98% for the others) ................................................................................................ 77 Figure 3.4. Correlation between conserved genes and evolutionary distance for bacterial species. Each datapoint represents the percent of conserved genes between two strains plotted against their evolutionary distance, measured as average nucleotide identity (ANI) of all conserved genes between the strains. Solid squares represent all genes while open squares represent the fraction of all genes that are well-characterized genes (see methods section). Panel A includes only pairs of strains that should belong in the same species according to the current species definition standard (see Figure 3.1), whereas panel B includes pairs of more distantly related strains .............................. 80 Figure 3.5. Genetic signatures among groups of strains that show higher than 94% average nucleotide identity (ANI). Starting with all CDS in the leftmost strain the next bar to the right represents how many CDS are conserved in the next strain (x-axis) (similarly to Figure 3.2). The ANI to the leftmost strain is also shown on the top of the bars for each strain. (A) A genetic signature between the pathovar Typhi strains and the rest Salmonella strains is identifiable. (B) No genetic signature is evident for the B. anthracis-B. cereus ATCC14579 group (dashed circle). The rightmost bar in panel B shows how many of the conserved CDS between the two B. anthracis strains are also conserved in strain ATCC14579 alone. Strains from left are: (A) S. enterica ser. Typhi Ty2, S. enterica ser. Typhi Typhi, S. enterica PT2, S. enterica ser. Typhimurium DT104, S. enterica ser. Typhimurium LT2, S. enterica ser. Typhimtuium SL1344, S. gallinarum, and a pool of all Salmonella but the Typhi strains. (B) B anthracis Ames, B anthracis A2012, B cereus ATCC 10987, and B cereus ATCC 14579 .................................. 84 Figure 3.6. Functional distribution of genome-specific CDS from 82 pair-wise, whole-genome comparisons. Results using only strains showing >94 ANI are shown in parentheses. (Inset) Mean functional distribution of annotated CDSs for the 64 genomes deposited in GenBank as of October 2003. *Mobile denotes phage or transposase associated genes ..................................................................................... 87 Figure 3.7. Degree of conservation of non-coding and hypothetical sequences vs. well characterized genes. Each datapoint represents the number of non-coding sequences (expressed as a percent of the total sequences to normalize genome size effect) from a reference genome conserved in a tester genome (y-axis) vs. the number of hypothetical genes (solid squares) or well-characterized genes (open squares) from the reference genome conserved in the tester genome (x-axis). The gray diagonal represents the 1:1 regression line ........................................................................................ 89 Figure 4.1. Average Nucleotide Identity (ANI) and genetic distance. (A) The ANI for all genes in the genome, and all genes in a COG category (designated by a single letter on x-axis; see Table 2.2 for letter designation) between E. coli strain Sakai and another genome (graph legend) were determined and the difference of the average identity of the genes in each category from the average identity of all genes in the genome is shown (y- axis). These results reveal that the nucleotide identity of most orthologs between any two -xiii- genomes is within +/- 6-8% of the ANI between the genomes. A comparable picture was obtained for the Burkholderia, Mycobacteria and Streptococci groups (data not shown). (B) The average rate of non-synonymous substitutions (Ks) for all orthologs between two genomes strongly correlates with the ANI between the genomes, suggesting that ANI may be a useful descriptor of the evolutionary distance. Only genomes that show <3% 16S rRNA miss-pairing were included in the analysis to avoid saturation of nucleotide substitutions at non-synonymous sites. AN I correlates strongly with Average Amino acid Identity (AAI) (R2 > 0.95) therefore the previous conclusions are translatable to AAI as well. ANI was preferred to give higher resolution between very closely related genomes .............................................................................................. 106 Figure 4.2. Relationships between 168 rRNA, AAI, and taxonomic information for the 175 sequenced genomes. Panel A shows the 16S rRNA gene sequence identity (y- axis) plotted against the average amino acid identity (AAI) for each pair of the 175 genomes (30,635 pairs in total). The smallest taxonomic rank that the two genomes of each pair share has been overlaid in panels B, C, and D. The area corresponding to the current standard for species delineation as well as representative pairs of genomes (discussed in the text) have been annotated .................................................... 108 Figure 4.3. Phylogenetic relationships between the 175 fully sequenced genomes. Neighbor joining trees derived from the full matrix of AAI (A) and percent of conserved genes (B) between the 175 genomes used in this study. The percent of conserved genes (instead of absolute number of conserved genes) was used to accommodate for genome size differences (up to 10 fold) among the 175 genomes. Groups that are deep branching on the AAI tree are denoted by colors. Phyla represented by a single genome are in bold. Saccharomyces cerevisiae genome was used to root the trees (outgroup). Scale bar represent 10% difference. Note the difference in scale between A and B, i.e., the underlying differences are about 25% larger in the conserved gene tree for the same branch length. Abbreviations are as follows (top to bottom in panel A): T-D -- Termus- Deinococcus phylum, Spiro. -- Spirochaetes phylum, Bact. -- Bacteroidetes phylum, a- B- y— 5— e- P. -- or— B— y— 6— e- Proteobacteria class respectively, Cyano. -- Cyanobacteria phylum, Streptococ. -- Streptococcaceae family, Staphyloc. -- Staphylococcaceae family, Eury. -- Euryachaeota phylum, Crena. -- Crenarchaeota phylum .............................................................................................. 112 Figure 4.4. Relationship between conserved gene content and genetic distance. Dots represent the percent of conserved genes between a pair of genomes plotted against their genetic distance, measured as the average amino acid identity of the conserved genes. (A) All pairs of 175 genomes (30,625 pairs in total) were included, whereas pairs that contain an endosymbiotic genome were removed (32 genomes, 5,600 pairs removed) from the analysis in (B) ...................................................................................... 115 Figure 4.5. Correlation between alternative markers to 16S rRNA and Average Amino acid Identity (AAI). Panels show the correlation between identity of a molecular marker (panel title) and AAI for all pairs of the 175 genomes (at least 20,000 pairs for -xiv- each gene) used in this study. For the full name description of a marker see Table 4.1 ..................................................................................................... 118 Figure 5.1. Correlation between microarray false negatives and evolutionary distance between reference and tester strain. Each point represents the false negatives, expressed as percentage of the total number of ORFs predicted to cross-hybridize with the tester genome, between a reference and a tester strain plotted against the DNA- homology values (solid squares, upper X-axis) and the 16S rRNA sequence identity (open squares, bottom X-axis) between the reference and tester strain. (A): 30% amino acid identity cut-off. (B): 60% amino acid similarity cut-off ................................ 140 Figure 5.2. Importance of microarray false negatives. Solid bars represent the total number of ORFs from the reference genome not conserved at the nucleotide level in the tester genome (X-axis). Gray and open bars represent the part of these ORFs that are also predicted to be microarray false negatives at the 30% amino acid identity and 60% amino acid cut-offs, respectively ......................................................................... 142 Figure 5.3. Non-specific hybridization signal for whole ORF sequences. Each point represents the predicted signal for an ORF when all its matches in the tester genome were considered (Y-axis) vs. the predicted signal when only the best match was considered (X- axis). Thus, any points that deviate from the diagonal represent ORFs that are predicted to be affected by non-specific hybridization signal. (A): tester strain is S. enterica pathovar Typhimurium, (B): tester strain E. coli K12. Reference strain is E. coli 0157 ................................................................................................. 143 Figure 5.4. cDNA vs. Oligo-arrays: Evolutionary relatedness results. (A): Each spot represents the [transformed length X transformed identity] value for the best BLAST match of an 0157 ORF (right) in a tester genome (top). (B): The results from the hierarchical clustering of the [transformed length X transformed identity] values using Pearson correlation for every set of query sequences i.e. whole ORFs, cDNA and oligo probes. (C): Hierarchical clustering using Spearman rank correlation ..................... 145 Figure 5.5. cDNA vs Oligo-arrays: Gene identification. Bars represent the predicted false negatives (expressed as percentage of the total number of probes that are expected to hybridize) for the cDNA (open bars) and oligo (solid bars) probes. (A): the enterics. (B): The streptococci. Tester strains (from left to right) are: (A), E. coli K12, Shigella flexneri, S. Typhimurium, Klebsiella pneumoniae Y. pestis; (B), S. pneumoniae R6, S. mitis, S. pyogenes, S. agalactiae; reference strains were E. coli 0157 and S. pneumoniae TIGR4, respectively ................................................................................. 147 -XV- LIST OF TABLES Table 2.1. COG firnctional categories and category correlation with total number of CDS .................................................................................................... 35 Table 2.2. Genomic information and ecological niche(s) of species with a genome size larger than 6Mb ....................................................................................... 49 Table 4.1. Relationships of different phylogenetic markers to Average Amino acid Identity (AAI) ....................................................................................... 117 Table 5.1. Pair-wise 16S rRNA gene sequence similarity (upper right) and DNA-DNA reassociation values (lower left) for the species used in this study ......................... 137 -xvi- CHAPTER 1 AN INTRODUCTION INTO PROKARYOTIC DIVERSITY AND GENOMICS I have authored parts of this chapter in the book chapter: K. T. Konstantinidis, and J. M. Tiedje. Microbial diversity and genomics. In Microbial Functional Genomics. J. Zhou, D. K. Thompson, Y. Xu, and J. M. Tiedje (eds.) John Wiley & Sons. Hoboken, New Jersey, 2004, pg. 21-46. Copyright 2004 Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc. Reprinted with permission of John Wiley & Sons, Inc. -1- INTRODUCTION The prokaryotic biomass has been estimated to (at least) equal that of terrestrial and marine plants (70) while the prokaryotic biodiversity (genetic or biochemical) is presumably the largest reservoir of biodiversity on Earth based on recent estimates of 6,300 distinct prokaryotic species in a single gram of soil (15). Yet, several aspects of this biodiversity remain unexplored. For instance, the full extent of the total genetic diversity of Prokaryotes or even the genetic diversity within a single prokaryotic species remains unknown, and there is inadequate understanding of the interactions between ecological niche and biodiversity, or how important biodiversity is for (specific) ecosystem function and stability. Gaining information into such issues is at the heart of understanding the basis for and value of biodiversity, and for understanding the diverse environmental microbes that catalyze much of the biogeochemical cycles that sustain life on Earth. Such information can then be used to successfully apply microbes or control their in-situ activities for specific purposes such as bioremediation, plant protection or nitrogen fixation etc. Genomicsl offer a great opportunity to explore (at least) parts of the immense prokaryotic biodiversity, and in fact, they have already succeeded in significantly broadening our knowledge of it. For example, genomic comparisons have shown that prokaryotic genomes are much more “fluid” than previously thought e.g., mobile elements and lateral gene transfer events play a major role in the evolution and shaping of the genome space. This “fluidity” drives, by and large, the extensive genetic diversity revealed within the currently named species. Another major contribution of genomic ‘ Study that aims to reach a genome-level understanding of the molecular basis of the structure, functions, and evolution of biological systems using whole-genome sequence information and high-throughput technologies. BACKGROUND Prokaryotic biochemical and genetic diversity. Prokaryotic life emerged about 3.7 billion years ago, or about 2 billion years before eukaryotic life arose (reviewed in (30). Thus, prokaryotic organisms had a long time to evolve and this accounts for the high biochemical and genetic diversity that characterizes prokaryotes. The extent of prokaryotic metabolic and enzymatic diversity is such that it is believed that a handful of prokaryotic species can live on almost any carbon source or redox couple available on Earth. Prokaryotic organisms occupy two-thirds of the biodiversity on Earth, namely the Bacteria and the Archaea (71). What characterizes Bacteria compared to the other two domains of life is that its species that are closely related by molecular criteria (e.g., ribosomal RNA gene identities) can display strikingly different carbon and energy metabolisms. For instance, in the relatively closely related y—Proteobacteria subgroup one can find very phenotypically different organisms such as the E. coli (organotroph), Chromatium vinosum (hydrogen sulfide-based phototroph), and the syrnbiont of R. pachyptila, the tubeworrn (hydrogen sulfide-based symbiont). The situation is even more profound when specific biochemical traits, e.g. functional proteins, are considered. For instance, the ability to denitrify (making use of N-oxides as terminal electron acceptors) occurs sporadically among the cultivated bacterial species of coherent 16S rRNA clusters (73). The 16S rRNA sequence information is commonly used for the construction of phylogenetic trees to infer the ancestry and relatedness of organisms. As apparent from the denitrification example, even organisms that are identical or cluster tightly by the 16S rRNA criterion may not share most essential physiological similarities. Furthermore, the functional genes involved in the denitrification pathway (e.g. nitrite reductase, nitrous oxide reductase) exhibit substantial sequence diversity in the cultivated representatives from a single gram of marine sediment or forest soil (73). The lack of general correspondence between metabolism and evolutionary relatedness is attributed to lateral gene transfer, large—scale symbiotic fusions (e.g., between a bacterium and a bacteriophage) and the great ability of bacteria to evolve to exploit available ecological niches. The other domain of Prokaryotes, the Archaea, shows considerably less metabolic and genetic diversity compared to Bacteria based on the study of the representatives of the two domains that have been isolated to date. However, recent findings from culture- independent surveys for presence of archaeal-specific signatures in the environment suggest that archeal ecological significance and global distribution is much higher than represented in the currently cultured species (4, 8, 17, 18). For instance, several phyla- and order— level lineages and a new kingdom of Archaea, the Korarchaeota (4), have been proposed based on cloned 16S rRNA sequences from different environment sources. The lack of cultivable species representative of these lineages or information about the physiology these species severely limits our ability to summarize archaeal metabolic diversity. As far as eukaryotic organisms are concerned, they also appear to be much less metabolically versatile than Bacteria in terms of range of substrates for growth and electron terminal acceptors they can utilize. For example, organotrophy, in which reduced organic compounds are used for energy and carbon, appears to be the main mode of nutrition for most non-photosynthetic Eukaryotes, and even in the case of organotrophy, the number of different metabolic processes carried out by Bacteria far exceeds the ones carried out by Eukaryotes. Further, photosynthesis was also a bacterial innovation, and is ecologically and physiologically more diverse in the bacteria. Most bacterial photosynthesis is anaerobic and widely distributed among different bacterial phyla in contrast to a single kind of photosynthesis in Eukaryotes, i.e., the oxygenic photosynthesis of plants, and Archaeal, i.e., the photosynthesis of the Halobacterium genus. There is now conclusive genomic evidence that the eukaryotic photosynthetic machines, i.e. the chloroplasts of plants, originated from a symbiotic event between a eukaryotic cell and a cyanobacterium (37). Last, Eukaryotes that exploit other modes of nutrition such as lithotrophy, in which energy is derived from the oxidation of reduced inorganic compounds by a chemical oxidant, do so only in close association (symbiosis) with prokaryotic organisms. Although there is relatively little information about the metabolic breadth of a major lineage of Eukaryotes, the amitochondriate Eukaryotes, the indisputable conclusion from reviewing the current knowledge on the metabolic and biochemical repertoire of prokaryotic species is: the versatility of Bacteria makes the metabolic machineries of Archaea and Eukarya seem comparatively monotonous. How many prokaryotic species are there and what is a “species”? While the previous discussion points out the immense metabolic diversity of the prokaryotic organisms, what makes prokaryotic species and consequently the metabolic processes they carry out important to Earth is their huge number of cells and their ubiquity. The most recent estimates suggest that the total number of prokaryotes on earth to be 4-6 x 1030 cells and their cellular carbon to be 350-550 Pg (70). Hence, prokaryotic carbon is 60-100% of the estimated carbon in terrestrial and marine plants while prokaryotic biomass is presumably the largest pool of recyclable nitrogen (N) and phosphorus (P) since the (N+P)/C ratio is higher for the prokaryotic cell. Most of the earth’s prokaryotes are found in the open ocean and in soil, where the total number of cells is in the order of 1029-1030 (70), and particularly their subsurface, i.e., below 8 m for the terrestrial environment and below 10cm for oceanic sediments (26, 70), although there has been limited sampling of these environments and hence uncertainty in the accuracy of these predictions. The activity of prokaryotes is substantial in surficial marine and soil environments based on cell turnover times, which have been estimated at 6-25 days for the upper 200 m of ocean and 2.5 years for soil (19, 27, 70), whereas prokaryotic activity in the subsurface is orders of magnitude lower, e. g. turnover times of 1.2 x103 years (16). Although there is no doubt that prokaryotic cells are ubiquitous and far exceed any other type of life in numbers, the enumeration of prokaryotic species is far from being resolved. This is due to both fundamental problems regarding definition of species as well as practical limitations in counting prokaryotic species. The current species concept for prokaryotes (52, 58, 68) despite being pragmatic, operational and applicable (52, 58), remains controversial (9, 12). The controversy stems from the fact that prokaryotic species lack diagnostic morphological characteristics and are asexual organisms that exchange genetic material in their unique and unusual ways compared to eukaryotes. Therefore, none of the 22 species concepts described for Eukaryotes is applicable to Prokaryotes (52). In addition, it is not always feasible, due to technological limitations and/or poor understanding of the metabolic and physiological properties of prokaryotic cells, to define unique phenotypic characteristics that are required for a species description (66). This has led most prokaryotic taxonomists to agree on a functional species definition for prokaryotes that is rooted in the degree of DNA/DNA reassociation. In this definition, two strains belong to the same species when their purified DNA molecules show at least 70% hybridization (59, 68). This definition does not translate well to Eukaryotes, however. Application of the same definition to Eukaryotes would lead to the inclusion of members of many taxonomic tribes in the same species (55). For example, all the primates (i.e. humans, orangutans and gibbons) would then belong to the same species (56). Furthermore, gorillas and orangutans would not be considered threatened because they would be the same species as humans, which are numerous and cosmopolitan. Thus, a simple comparison of the number of eukaryotic and prokaryotic species greatly underestimates prokaryotic diversity. Indeed, the prokaryotic species concept is probably comparable to that of animal family or perhaps even an animal order The other obstacle in enumerating prokaryotic species is the fact that only a small fraction of many microbial communities, typically about 1%, is cultivable. The problem of “cultivating the uncultivable” has been extensively discussed and reviewed elsewhere (1, 49, 60, 62) and won’t be further discussed here. To give a illustrative example of this issue, however, about half of the 65,872 16S rRNA sequences in the Ribosomal Database Project (RDP) database as of April 2003 (13) were obtained from environmental clone libraries as opposed to isolated organisms (Figure 1.1). Furthermore, the habitats where prokaryotic species live are sometimes difficult to sample (e. g. deep ocean, subsurface) Distribuflon of uncultivated vs. cultivated 16$ rRNA sequences for each bacterial phylum mm * I — - W 5'8" 3 ‘ 4 - :11! 5 'W83'« — .: 'rmn _ [:1 cm 133 a 'op11u — 215 "OP10‘4 - 100 ‘5 'BRC1'~ - 27 . ‘ lomus - *1 16 . Venucomicrobia . _ 320 eria -+ 1 162 5 Bacteroidetes - m 4.668 Acidobacteria - j 37 . Fibrobacteres . J 20 3 Spirochaetes - 1 1.349 C Chlamydiae~ J 220 . 3 Planctomycetes - — 546 3 Actinobacteria - a 7.551 U Firmicutes ~ 4 11.333 . Proteobacteria - 3 30.212 . Chlorobi - 87 Cyanobacteria 4 r 2.094 ‘5 Deerribacteres . j 14 h Nittospira - l 225 Thermomicrobia d — 5 I: Chloroflexr ‘ _ 122 E ' at” -i L 1 aginococcus - 1 216 g . 1 1710me! '1 j 66 - Aquiflcae~ J 144 3 I I I 0 1M 50 0 50 IN I" Percentage Figure 1.1. Distribution of uncultivated vs. cultivated 16S rRNA gene sequences for each bacterial phylum. Data were collected from 65,872 sequences deposited in the Ribosomal Database Project (RDP) database, as of April 2003. The classification is based on an annotation from GenBank and was provided courtesy of Ryan Farris and James R. Cole of RDP. or too complex such as soil or sediments for an exhaustive count of prokaryotic species. This has led researches to try to model the total number of prokaryotic species rather than exhaustively count them. In one such classical study, Torsvik and colleges employed whole community DNA-DNA re-association kinetics to estimate the total number of genome equivalents or species considering the 70% DNA-DNA association cut-off as the definition of species (64). Based on this approach 350-1500 and 3500-8800 different prokaryotic species were found in the Norwegian soils sampled (47, 48, 63). Using the same method, the prokaryotic diversity in aquatic environments was found to be orders of magnitude less than that in soil (47). Dykhuizen using data from whole community DNA- -9- DNA association between related communities estimated that more than a billion (109) prokaryotic species exist in soil (20). Several reasons can explain the high soil microbial diversity such as the high diversity of carbon resources; the rather stable, protective, even ancient environment; and what appears to be a high degree of spatial isolation that reduces competition, thereby maintaining less competitive members (61, 65, 72). Others have used clone libraries of the 16S rRNA gene from environmental samples to estimate prokaryotic diversity. The distribution of unique (representing different species) 16S rRNA gene sequences relatively to the sequences that were observed more than once in these libraries was used for extrapolation to the total number of species in the environment. Assuming a lognorrnal distribution of species, that is, if species are assigned to log abundance classes, the distribution of species among these classes is normal, Curtis and colleges estimated 6,300 species per gram in two grazed grass-land soils (15). Extrapolating to a larger scale, they estimated the entire bacterial diversity in the oceans to be up to 2 x 106 species, while a ton of soil could contain 4 x 106 different species (15). Hughes and colleges don’t share the opinion that species follow a lognorrnal distribution and thus, they employed a different statistical approach and estimated about 500 species for the same dataset (31). There are several technical limitations in these approaches to estimated species richness such as the limited sampling and the exhaustiveness of the clone libraries, and uncertainty of the natural species distribution in the environment, the analytical discussion of which is not relevant here. Furthermore, it's uncertain how different species are in different environments, for example different soils, or how much they vary in different geographic locations. In one study to address this question Cho and Tiedje -10- found that fluorescent Pseudomonas, a cosmopolitan heterotroph that is frequently recovered from soil, show a high degree of endemicity at the genotype level (11). If microbial populations have a high degree of endemicity, it greatly expands the earth's total microbial diversity. Second, the description of species based on 16S rRNA gene sequence is problematic mostly because the sequence of this molecule is too conserved to resolve species (59). While the accuracy of estimates of global microbial diversity is in question, it is beyond question that the number of prokaryotic species is large, most probably much larger than the most diverse eukaryotic phylum, the insects (with greater than 106 species). Currently only 4,500 prokaryotic species are described (25), which appears to be less than the number of species in a few grams of soil. Whole-genome sequencing and diversity of prokaryotic genomes. Whole-genome sequencing2 was initially employed to advance understanding of species physiology and metabolism but it was soon realized that it could revolutionize the study of other major microbiology disciplines, including functional and genetic diversity. For these reasons and following the maj or improvements in sequencing technology, capacity, and cost reduction, prokaryotic genome sequencing projects have grown rapidly (Figure 1.28) such that over 115 genomes have been classified as of the end of 2003 and more than 300 other projects are underway. This set of genomic information is now large enough to reveal some major trends in and impressions about prokaryotic genomes and is consistent with the very high prokaryotic diversity discussed above. Whole-genome sequencing revealed much higher genetic diversity within species than originally anticipated. An example is the E. coli case, where whole-genome 2 deciphering the sequence of all nucleotide bases in the genome -11- Number of genomes ou868888888 Number of genomes 0 0—2 19% 19% A I End 2003 0 before 2002 I: . 2-4 46 >8 Genome size (in Mb) 1997 1999 1999 2000 2001 Year of publication Figure 1.2. Genome size distribution of the fully sequenced prokaryotic genomes (A) and the number of published prokaryotic genomes per year (B). Data were retrieved from NCBI and included all 115 prokaryotic genomes available at the end of 2003 sequences of four strains are now available. Comparative genomic analysis of these sequences revealed that the pathogenic Sakai strain has a genome 1 Mb larger than that of the laboratory strain K12 and about 25% of its genes are not conserved in strain K12 (29, 50). If one considers that prior to whole-genome sequencing, strains of the same species were believed to harbor minimum genotypic differences because they only rarely could be differentiated based on phenotypic characteristics, the genetic heterogeneity revealed between E. coli strains was surprisingly high for its time (3 years ago!). Furthermore, the annotation of the strain specific gene set offered novel insight into the pathogenic lifestyle of strain Sakai relative to the innocuous K12, revealing that our knowledge of even the best-studied pathogen was impeded by the incompleteness of the available conventional methods. Most of the strain Sakai-specific genes are now believed to have been acquired through lateral transfer events based on atypical sequence characteristics (35) and the enrichment of mobile elements such as phage, prophages and insertion sequences in the Sakai genome (29, 50). These findings also suggested that an environmental and benign strain could evolve relatively easily into a devastating pathogen in only about 4.5 million years (rather short in evolutionary time) since the last common ancestor between the two strains (51). The availability of two additional genomic sequences of E. coli strains (strains CFT073 and EDL933) revealed further surprises with regard to the extent of genetic diversity within E. coli. Only about 3,000 genes are shared among all four E. coli available genomes (69), compared to about 4,000 genes shared between Sakai and K12 strains (50). The 3,000 genes conserved in all E. coli strains show, however, a remarkable synteny (interrupted by strain-specific islands) suggesting a vertical transmitted backbone gene set for E. coli (69). In summary, the genomic sequencing of E. coli strains has revealed not only an enormous genetic diversity at the sub-species level but also the presence of very different selection forces that has led to the accumulation or deletion of genetic material. This subspecies diversity appears not to be unusual since preliminary evidence suggests that Streptococcus pneumoniae, based on comparative microarray hybridizations (28), and Burkholderia cepacia, based on genome size estimations (38), seem also to have high diversity. -13- On the other hand, species such as Mycobacterium tuberculosis do not appear to share the genetic diversity observe in E. coli. Based on both comparative analysis of the sequenced strains (22) and comparative microarray hybridization analysis of several strains (5), M. tuberculosis strains are unlikely to be more than 1-2% different in terms of gene content, although the current analysis might be biased due to the study of exclusively clinical isolates. These findings raise another fundamental issue as well. The current species definition based upon 70% DNA-DNA association values poorly correlates with gene content differences within species, as is apparent from the comparison among the E. coli and M tuberculosis strains. Genome structure and its relation to the ecological niche. Genome sizes vary by more than an order of magnitude among the known prokaryotes (e.g. 0.5-10 Mb). However, the genome size distribution does not appear to be random, for example, it correlates with the ecological niche of the organisms. The smaller genomes are found in endocellular parasites or symbionts (0.5-1.2 Mb), because these organisms occupy a very narrow niche and hence have undergone reductive evolution. For instance, the endosymbiont of aphids, Buchnera sp. has a genome size of only 650 Kb compared to 4 Mb of its ancestors from which Buchnera diverged 150-250 million years ago (53). For free-living bacteria, genome size correlates with the species metabolism and the width of its ecological niche. Pathogenic species with a narrow range of hosts (or, more generally, species with a narrow ecological niche) also have small genomes, for example, Helicobacter sp. and Streptococcus sp. Anaerobic bacteria with a restricted metabolism, such as methanogens, typically have small genome sizes, ranging -14- from 1.5-2.5 Mb. In contrast, aerobic organisms and opportunistic pathogens show higher diversity in genome sizes with some species such as Pseudomonas having genomes as large as 6Mb. The largest genomes are found in species that have complex life styles, including myxobacteria and actinobacteria (8-9 Mb). All these observations lead to the conclusion that the interaction between an organism and its particular habitat(s), for example, resource availability and diversity, stable or fluctuating environmental conditions, selects the genome size of species. Nonetheless, what controls the upper genome size in prokaryotes remains poorly understood. Several hypotheses exist, such as the decreased fidelity of replication in large genomes and energy cost to successfully control excessive metabolic repertoire, but none has been experimentally proven. The variation of genome sizes within species is believed to be rather limited (54), which has been supported by recent genomic sequence data and by pulse field gel electrophoresis of genomes (41). Some of the better-documented exceptions to this are the E. coli and Burkholderia cepacia species mentioned above, where different strains can vary up to 25% or up to 50% in genome size, respectively (7, 38). On the other hand, genome size can vary up to 3 fold for different species of the same genera! At one end of the spectrum there are species like Borrelia sp., whose chromosomes vary by less than 15 Kb in size (10), whereas species like spirochete T reponema sp. (40, 67), and Mycoplasma sp. (3) show a variation in genome sizes up to 3- and 2.3-fold, respectively. Perhaps more typical are genera like Streptomyces and Rickettsia, which vary from 6.4 to 8.2 Mb (36) and 1.2 to 1.7 Mb (24), respectively. It is should be pointed out, however, that too few strains within species and within genera have been studied to give us a complete understanding of the natural variation in the size of prokaryotic genomes. -15- Although Bacteria are believed to have a single, circular chromosome, an increasing number of exceptions to this are being identified. For example, several species of the a- and [3- Proteobacteria have multiple rather than single chromosomes (differentiated from large plasmids by harboring housekeeping genes like ribosomal or tRNA genes), and in at least two, the Brucella and Burkholderia, the multiple chromosomes are a stable property of the genus. In the proteobacterial phyllum, the multiple chromosomes correlate with a free-living, opportunistic lifestyle, whereas species that are obligatorily associated with animal host or vectors contain no plasmid and, with a few exceptions, single chromosomes (44). Based on these observations, multiple chromosomes are postulated to confer increased genome plasticity and potential for diversification but this has not been proven experimentally yet. Several species with linear, instead of circular, chromosomes and/or plasmids have also recently been described such as the Streptomyces, Rhodococcus, Borrelia, and Agrobacterium species. Linearity, at least in Streptomyces and Borrelia, is believed to enhance genomic plasticity, because linear chromosomes (or plasmids) are very unstable and undergo, at high frequency, amplifications and large deletions, often removing the telomeres. This was confirmed with the whole-genome sequencing of S. coelicolor, which showed that the secondary metabolite-related genes (Streptomyces is notorious for its secondary metabolites like antibiotics) are more frequently encountered in the arms of the chromosome than in its center; the center is biased toward housekeeping genes (6). However, whether linearity offers a selective advantage and why it is phylogenetically constrained to a limited number of species remain unclear. -16- While it is clear from the above discussion that there is considerable functional and sequence diversity among and within prokaryotic species, whole-genome sequencing has also revealed some universal fiinctional trends as well. For instance, the genomes of endosymbionts have preferentially lost genes involved in metabolism, biosynthesis and regulation while retaining most of the informational genes compared to their free-living relatives (2, 42, 43). Interestingly, although there is a strong deletion bias toward the former major functional categories in the symbiotic genomes, the specific pathways lost appear to be lineage specific, e.g., Buchnera sp., an obligate symbiont of aphids, contrary to other endosymbionts, retains the genes for the biosynthesis of all amino acids (53). Further, in almost every genome sequenced to date there is a constant percent (about 20-30%) of the predicted protein-coding genes (CDS) that show no homology to any known protein (23). Although it has been suggested that the majority of these are non-coding DNA based on in-silz'co analysis (32, 46, 57), more recent proteomic analyses suggest that at least a portion are translated into proteins, i.e., they presumably represent functional proteins (14, 33, 39). The significant number of “function unknown” genes in every genome also suggests that novel processes are still likely to function in every prokaryotic cell and await characterization. Alternatively, some of these genes might function in well-studied cell processes but their sequences have diverged too much to resemble any of the known annotated sequences. Biases in the collection of sequenced species. The current collection of sequenced species is rather limited (compared to the extant of species richness) and there are several issues that should be pursued in the -17- future for a comprehensive understanding of prokaryotic genetic and functional diversity. A major limitation is that several major phylogenetic lineages remain under- or over- represented with sequenced representatives. For instance, 61 (39.1%) of the 156 completely sequenced strains as of December of 2003 belong to the phylum Proetobacteria while some of the most dominant phyla in nature still have a limited number of sequenced representatives. For example, the Acidobacteria, which appear to be numerically dominant forming up to 52% of 16S rRNA gene sequences in clone libraries from different soils (21, 34, 45) have no sequenced species. Archaea have sixteen completely sequenced species, but this collection is limited to methanogenic or thermophilic species and does not include mesophilic species that are widespread in the ocean and soil environments. Another limitation of the current collection of sequenced species is that the collection is heavily biased towards organisms with smaller genomes, often from strains living in simpler, resource-rich environments such as endocellular parasites or pathogens (Figure 1.2A). About 70% of the bacterial strains fully sequenced are of clinical importance. A representative example is the Actinobacteria phylum, a dominant group in soil based on culture-independent methods, which has nine sequenced species but all of them are of clinical origin. This picture appears to be changing, however, based on the fact that about half of the ~400 prokaryotic genomic projects that are under-way at the end of 2004 worldwide involve non-pathogenic strains. In conclusion, our current knowledge of prokaryotic physiology and metabolism based on genomic approaches might be still limited and novel findings are anticipated in the near future, particularly among the environmental microbiology. -18- THESIS OUTLINE The previous discussion has pointed out the wealth of information laying in the whole-genome sequences and the power of comparative genomic analysis in providing novel insights into (previously) tantalizing scientific questions. Although substantial progress has been made in several areas of microbiology based on analyses of whole- genome sequences, there are still major questions unanswered. I have undertaken several different, and sometimes novel, comparative genomic approaches in order to address several such questions related to the ecology of prokaryotic organisms. Recognizing, at first place, that the collection of currently available genomic sequences is rather limited compared to the extent of prokaryotic diversity, my approaches were designed mostly towards “methodology development” to test larger datasets when these will become available rather than reaching definite conclusions at present. Nonetheless, several reliable trends in and impressions about the interrelationship between ecology and genomic diversity were revealed through my research. In particular, chapter 2 describes an effort to functionally characterize 115 genomes and to comprehensively evaluate how the relative usage of the genome for specific functions changes with genome size. Such analysis should be informative of what drives genome expansion, provide (further) insight into the interaction between organism’s genome and its particular habitat(s) (discussed previously), and suggest what ecological benefits accrue for large genome-sized species. The latter species are believed to be (more) ecologically successful in the soil environment but there is currently limited understanding of why this is the case and the relation of ecology to genome evolution of these species. This work was inspired by and expanded over previous, analogous, studies -19- on the small genomes of endosymbiotic parasites (summarized earlier), which has offered novel insights into the ecology and evolution of these species. Chapter 2 ends with the analysis of a close phylogenetic group of species, the Burkholderia cepacia complex (Bcc), to investigate differences between short (Bcc genomes that recently became available) and long evolutionary scales (previous comparison in the chapter). The species concept remains a highly controversial and unsettled issue, which has broader impacts such as for reliable diagnosis of infectious disease agents, (inter-) national regulations for transport and possession of pathogens, intellectual property rights, and applications of microorganisms for bioremediation or agriculture purposes. Chapter 3 summarizes my attempts to assess the species-level genetic and functional differences between 81 closely related genomes representing several of the major phylogenetic lineages of Bacteria and thus, help to refine the species concept for Prokaryotes. My approach employed whole-genome sequence comparisons to determine whether species-specific genetic signatures are identifiable (and thus, it is meaningful to have a species concept) as well as the role of the organism’s ecology on its common gene content. This information together with information from other approaches, e.g., population-based or gene-expression studies, should eventually converge to a more soundly based species definition for Prokaryotes. Taxonomic ranks higher than the species rank for Prokaryotes are primarily based on the phylogenetic analysis of the small subunit ribosomal RNA gene (16S rRNA) and secondarily on old microscopic and/or biochemical observations about the relatedness of the organisms. Chapter 4 describes a genome-based approach, expanding from the one undertaken in Chapter 3, to better inform the higher ranks of taxonomy -20- based on the genetic relatedness of the organisms. Further, the relatedness (between two organisms) estimated by my genome-based approach, which presumably represent a very reliable measurement since it is derived from thousands of independent data points (i.e., genes), was compared to the relatedness estimated by traditional genetic markers such as the 16S rRNA gene sequence to evaluate the robustness and accuracy of the later. It will become evident from the discussions in Chapters 2 and 3 that the number of available genomes is still rather limited to allow for robust interpretations and conclusions. Therefore, a better sampling with genome-scale information of more species and particularly closely related species is needed. However, it is currently economically unrealistic to do this based on genomic sequencing and thus alternative, high-throughput, methods must be developed. Whole-genome DNA microarray technology appears to be such a promising alternative because it can reveal exact, genome-level, genetic differences between closely related strains based on Competative Genome Hybridization (CGH) of the strains. However, the potential of the microarray technology for CGH has not yet been fully explored. Chapter 5 describes an attempt to simulated microarray CGH experiments in-silico by comparing the expected (in-silica) microrray results to results from comparisons of whole-genome sequences (control), to evaluate microarray performance for genetic comparisons between strains. Several technical aspects were evaluated, including the resolution level of microarrays, the extent of false positives or negatives and the influence of non-specific signal on the microarray results. -21- 10. 11. REFERENCES Amann, R. I., W. Ludwig, and K. H. Schleifer. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59:143-69. Andersson, S. G., and C. G. Kurland. 1998. Reductive evolution of resident genomes. Trends Microbiol 6:263-8. Barlev, N. A., and S. N. Borchsenius. 1991. Continuous distribution of Mycoplasma genome sizes. Biomed Sci 2:641-5. Barns, S. M., C. F. Delwiche, J. D. Palmer, and N. R. Pace. 1996. Perspectives on archaeal diversity, therrnophily and monophyly from environmental rRN A sequences. Proc Natl Acad Sci U S A 93:9188-93. Behr, M. A. 2002. BCG--different strains, different vaccines? Lancet Infect Dis 2:86- 92. Bentley, S. D., K. F. Chater, A. M. Cerdeno-Tarraga, G. L. Challis, N. R. Thomson, K. D. James, D. E. Harris, M. A. Quail, H. Kieser, D. Harper, A. Bateman, S. Brown, G. Chandra, C. W. Chen, M. Collins, A. Cronin, A. Fraser, A. Goble, J. Hidalgo, T. Hornsby, S. Howarth, C. H. Huang, T. Kieser, L. Larke, L. Murphy, K. Oliver, S. O'Neil, E. Rabbinowitsch, M. A. Rajandream, K. Rutherford, S. Rutter, K. Seeger, D. Saunders, S. Sharp, R. Squares, S. Squares, K. Taylor, T. Warren, A. Wietzorrek, J. Woodward, B. G. Barrell, J. Parkhill, and D. A. Hopwood. 2002. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417: 141 -7. Bergthorsson, U., and H. Ochman. 1998. Distribution of chromosome length variation in natural isolates of Escherichia coli. Mol Biol Evol 15:6-16. Bintrim, S. B., T. J. Donohue, J. Handelsman, G. P. Roberts, and R. M. Goodman. 1997. Molecular phylogeny of Archaea from soil. Proc Natl Acad Sci U S A 94:277-82. Brenner, D., J. Staley, and N. Krieg. 2000. Bergey's manual of systematic bacteriology, 2nd ed, vol. 1. Springer-Verlag, New York. Casjens, S., M. Delange, H. L. Ley, 3rd, P. Rosa, and W. M. Huang. 1995. Linear chromosomes of Lyme disease agent spirochetes: genetic diversity and conservation of gene order. J Bacteriol 177 :2769-80. Cho, J. C., and J. M. Tiedje. 2001. Bacterial species determination from DNA-DNA hybridization by using genome fragments and DNA microarrays. Appl Environ Microbiol 67:3677-82. -22- 12. l3. 14. 15. 16. 17. 18. 19. 20. 21. 22. Cohan, F. M. 2002. What are bacterial species? Annu Rev Microbiol 56:457-87. Cole, J. R., B. Chai, T. L. Marsh, R. J. Farris, Q. Wang, S. A. Kulam, S. Chandra, D. M. McGarrell, T. M. Schmidt, G. M. Garrity, and J. M. Tiedje. 2003. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res 31:442-3. Corbin, R. W., O. Paliy, F. Yang, J. Shabanowitz, M. Platt, C. E. Lyons, Jr., K. Root, J. McAuliffe, M. I. Jordan, S. Kustu, E. Soupcne, and D. F. Hunt. 2003. Toward a protein profile of Escherichia coli: comparison to its transcription profile. Proc Natl Acad Sci U S A 100:9232-7. Curtis, T. P., W. T. Sloan, and J. W. Scannell. 2002. Estimating prokaryotic diversity and its limits. Proc Natl Acad Sci U S A 99:10494-9. D'Hondt, S., S. Rutherford, and A. J. Spivack. 2002. Metabolic activity of subsurface life in deep-sea sediments. Science 295:2067-70. DeLong, E. F., and N. R. Pace. 2001. Environmental diversity of bacteria and archaea. Syst Biol 50:470-8. DeLong, E. F., L. T. Taylor, T. L. Marsh, and C. M. Preston. 1999. Visualization and enumeration of marine planktonic archaea and bacteria by using polyribonucleotide probes and fluorescent in situ hybridization. Appl Environ Microbiol 65:5554-63. Ducklow, H., and C. Carlson. 1992. Oceanic bacterial production. Adv. Microb. Ecol. 12:113-181. Dykhuizen, D. E. 1998. Santa Rosalia revisited: why are there so many species of bacteria? Antonie Van Leeuwenhoek 73:25-33. Felske, A., A. Wolterink, R. Van Lis, W. M. De Vos, and A. D. Akkermans. 2000. Response of a soil bacterial community to grassland succession as monitored by 16S rRNA levels of the predominant ribotypes. Appl Environ Microbiol 66:3998-4003. Fleischmann, R. D., D. Alland, J. A. Eisen, L. Carpenter, 0. White, J. Peterson, R. DeBoy, R. Dodson, M. Gwinn, D. Haft, E. Hickey, J. F. Kolonay, W. C. Nelson, L. A. Umayam, M. Ermolaeva, S. L. Salzberg, A. Delcher, T. Utterback, J. Weidman, H. Khouri, J. Gill, A. Mikula, W. Bishai, W. R. Jacobs Jr, Jr., J. C. Venter, and C. M. Fraser. 2002. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol 184:5479-90. -23- 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. Fraser, C. M., J. Eisen, R. D. Fleischmann, K. A. Ketchum, and S. Peterson. 2000. Comparative genomics and understanding of microbial biology. Emerg Infect Dis 6:505-12. F rutos, R., M. Pages, M. Bellis, G. Roizes, and M. Bergoin. 1989. Pulsed-field gel electrophoresis determination of the genome size of obligate intracellular bacteria belonging to the genera Chlamydia, Rickettsiella, and Porochlamydia. J Bacteriol 171:4511-3. Garrity, G., J. Bell, and T. Lilbum. Bergey's manual of systematic bacteriology, 2 ed, vol. Release 5.0. Springer-Verlag, New York. Gold, T. 1992. The deep, hot biosphere. Proc Natl Acad Sci U S A 89:6045-9. Grey, T., and S. Willimas. 1971. Microbial productivity in soil. Symposia of the Society for General Microbiology 21:255-286. Hakenbeck, R., N. Balmelle, B. Weber, C. Gardes, W. Keck, and A. de Saizieu. 2001. Mosaic genes and mosaic chromosomes: intra- and interspecies genomic variation of Streptococcus pneumoniae. Infect Immun 69:2477-86. Hayashi, T., K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han, E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami, T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori, and H. Shinagawa. 2001. Complete genome sequence of enterohemorrhagic Escherichia coli 0157:H7 and genomic comparison with a laboratory strain K-12. DNA Res 8:11-22. Hedges, S. B. 2002. The origin and evolution of model organisms. Nat Rev Genet 3:838-49. Hughes, J. B., J. J. Hellmann, T. H. Ricketts, and B. J. Bohannan. 2001. Counting the uncountable: statistical approaches to estimating microbial diversity. Appl Environ Microbiol 67:4399-406. Jackson, J. H., S. H. Harrison, and P. A. Herring. 2002. A theoretical limit to coding space in chromosomes of bacteria. Omics 6:115-21. Kolker, E., S. Purvine, M. Y. Galperin, S. Stolyar, D. R. Goodlett, A. I. Nesvizhskii, A. Keller, T. Xie, J. K. Eng, E. Yi, L. Hood, A. F. Picone, T. Cherny, B. C. Tjaden, A. F. Siegel, T. J. Reilly, K. S. Makarova, B. O. Palsson, and A. L. Smith. 2003. Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. J Bacteriol 185:4593-602. -24- 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. Kuske, C. R., S. M. Barns, and J. D. Busch. 1997. Diverse uncultivated bacterial groups from soils of the arid southwestern United States that are present in many geographic regions. Appl Environ Microbiol 63:3614-21. Lawrence, J. G., and H. Ochman. 1998. Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95:9413-7. Leblond, P., F. X. Francou, J. M. Simonet, and B. Decaris. 1990. Pulsed-field gel electrophoresis analysis of the genome of Streptomyces arnbofaciens strains. FEMS Microbiol Lett 60:79-88. Leister, D. 2003. Chloroplast research in the genomic age. Trends Genet 19:47-56. Lessie, T. G., W. Hendrickson, B. D. Manning, and R. Devereux. 1996. Genomic complexity and plasticity of Burkholderia cepacia. FEMS Microbiol Lett 144:117- 28. Liu, Y., J. Zhou, M. V. Omelchenko, A. S. Beliaev, A. Venkateswaran, J. Stair, L. Wu, D. K. Thompson, D. Xu, 1. B. Rogozin, E. K. Gaidamakova, M. Zhai, K. S. Makarova, E. V. Koonin, and M. J. Daly. 2003. Transcriptome dynamics of Deinococcus radiodurans recovering from ionizing radiation. Proc Natl Acad Sci U S A 100:4191-6. MacDougall, J., and 1. Saint Girons. 1995. Physical map of the Treponema denticola circular chromosome. J Bacteriol 177:1805-11. Maule, J. 1998. Pulsed-field gel electrophoresis. Mol Biotechnol 9:107-26. Mira, A., H. Ochman, and N. A. Moran. 2001. Deletional bias and the evolution of bacterial genomes. Trends Genet 17:589-96. Moran, N. A. 2002. Microbial minimalism: genome reduction in bacterial pathogens. Cell 108:583-6. Moreno, E. 1998. Genome evolution within the alpha Proteobacteria: why do some bacteria not possess plasmids and others exhibit more than one different chromosome? FEMS Microbiol Rev 22:255-75. Nogales, B., E. R. Moore, W. R. Abraham, and K. N. Timmis. 1999. Identification of the metabolically active members of a bacterial community in a polychlorinated biphenyl-polluted moorland soil. Environ Microbiol 1:199-212. Ochman, H. 2002. Distinguishing the ORFs from the ELF 3: short bacterial genes and the annotation of genomes. Trends Genet 18:335-7. -25- 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. Ovreas, L., F. Daae, M. Heldal, F. Rodriguez-Valera, and V. Torsvik. 2001. Presented at the 9th International Symposium on Microbial Ecology: Interaction in the Microbial World, Amsterdam, 26 to 31 August. Ovreas, L., and V. V. Torsvik. 1998. Microbial Diversity and Community Structure in Two Different Agricultural Soil Communities. Microb Ecol 36:303-315. Pace, N. R. 1997. A molecular view of microbial diversity and the biosphere. Science 276:734-40. Perna, N. T., G. Plunkett, 3rd, V. Burland, B. Man, J. D. Glasner, D. J. Rose, G. F. Mayhew, P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. Hackett, S. Klink, A. Boutin, Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, A. Lim, E. T. Dimalanta, K. D. Potamousis, J. Apodaca, T. S. Anantharaman, J. Lin, G. Yen, D. C. Schwartz, R. A. Welch, and F. R. Blattner. 2001. Genome sequence of enterohaemorrhagic Escherichia coli 0157:H7. Nature 409:529-33. Reid, S. D., C. J. Herbelin, A. C. Bumbaugh, R. K. Selander, and T. S. Whittam. 2000. Parallel evolution of virulence in pathogenic Escherichia coli. Nature 406:64-7. Rossello-Mora, R., and R. Amann. 2001. The species concept for prokaryotes. 25:39. Shigenobu, S., H. Watanabe, M. Hattori, Y. Sakaki, and H. Ishikawa. 2000. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407 :81-6. Shimkets, L. J. 1998. Bacterial genomes. Physical structure and analysis. Chapman and Hall, New York. Sibley, C. G., and J. E. Ahlquist. 1987. DNA hybridization evidence of hominoid phylogeny: results from an expanded data set. J Mol Evol 26:99-121. Sibley, C. G., J. A. Comstock, and J. E. Ahlquist. 1990. DNA hybridization evidence of hominoid phylogeny: a reanalysis of the data. J Mol Evol 30:202-36. Skovgaard, M., L. J. Jensen, S. Brunak, D. Ussery, and A. Krogh. 2001. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17:425-8. Stackebrandt, E., W. Frederiksen, G. M. Garrity, P. A. D. Grimont, P. Kampfer, M. C. J. Maiden, X. Nesme, R. Rossello-Mora, J. Swings, H. G. Truper, L. Vauterin, A. C. Ward, and W. B. Whitman. 2002. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 52: 1043-1047. -26- 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. Stackebrandt, E., and B. M. Goebel. 1994. Taxonomic note: a place for DNA-DNA reassociation and 168 rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol 44:846-849. Staley, J. T., and A. Konopka. 1985. Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol39:321-46. Tiedje, J. M. 2000. Presented at the Proceedings of the 8th International symposium on microbial ecology, Halifax, Canada. Tiedje, J. M. 1995. Approaches to the comprehensive evaluation of prokaryotic diversity of a habitat. CAB International. Torsvik, V., F. L. Daae, R. A. Sandaa, and L. Ovreas. 1998. Novel techniques for analysing microbial diversity in natural and perturbed environments. J Biotechnol 64:53-62. Torsvik, V., J. Goksoyr, and F. L. Daae. 1990. High diversity in DNA of soil bacteria. Appl Environ Microbiol 56:782-7. Treves, D. S., B. Xia, J. Zhou, and J. M. Tiedje. 2003. A two-species test of the hypothesis that spatial isolation influences microbial diversity in soil. Microb Ecol 45:20-8. Vandamme, P., B. Pot, M. Gillis, P. de Vos, K. Kersters, and J. Swings. 1996. Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol Rev 60:407-38. Walker, E. M., J. K. Howell, Y. You, A. R. Hoffmaster, J. D. Heath, G. M. Weinstock, and S. J. Norris. 1995. Physical map of the genome of Treponema pallidum subsp. pallidum (Nichols). J Bacteriol 177: 1797-804. Wayne, L. G., D. J. Brenner, R. R. Colwell, P. A. D. Grimont, O. Kandler, M. I. Krichevsky, L. H. Moore, W. E. C. Moore, R. G. E. Murray, E. Stackebrandt, M. P. Starr, and T. H. G. 1987. Report of the Ad Hoc Committee on reconciliation of approaches to Bacterial Systematics. Int. J. Syst. Bacteriol. 37: 463—464. Welch, R. A., V. Burland, G. Plunkett, III, P. Redford, P. Roesch, D. Rasko, E. L. Buckles, S. R. Liou, A. Boutin, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C. Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, and F. R. Blattner. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. PNAS 99: 17020-17024. Whitman, W. B., D. C. Coleman, and W. J. Wiebe. 1998. Prokaryotes: The unseen majority. PNAS 95:6578-6583. -27- 71. 72. 73. Woese, C. R. 1987. Bacterial evolution. Microbiol Rev 51:221-71. Zhou, J., B. Xia, D. S. Treves, L. Y. Wu, T. L. Marsh, R. V. O'Neill, A. V. Palumbo, and J. M. Tiedje. 2002. Spatial and resource factors influencing high microbial diversity in soil. Appl Environ Microbiol 68:326-34. Zumft, W. G. 1997. Cell biology and molecular basis of denitrification. Microbiol Mol Biol Rev 61:533-616. -28- CHAPTER 2 TRENDS BEWTEEN GENE CONTENT AND GENOME SIZE IN PROKARYOTIC SPECIES Parts of this chapter have been published in the article: K. T. Konstantinidis, and J. M. Tiedje. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc. Natl. Acad. Sci. U S A. 2004, 101(9):3160-5. -29- INTRODUCTION The genome sequences of the smallest genome-sized prokaryotic species, the obligate endocellular parasites, have provided insight into the interrelationship between the ecology and genome evolution of these species (2, 12, 24). For instance, when compared their free-living relatives, these reduced genomes have preferentially lost genes underlying the biosynthesis of compounds that can be easily taken up from the host, such as amino acids, nucleotides, and vitamins. Furthermore, regulatory elements, including 0 factors, have commonly been eliminated from such symbiotic bacteria, presumably due to the rather stable environment inside host cells, which renders extensive gene regulation useless (3, 11, 31). It is not yet clear whether there may also be trends in gene allocation for the larger genome-sized free-living bacteria. If such trends do exist, they could reveal strategies of genome expansion, provide insight into the upper limit of genome size, reveal whether there is more centrally coordinated regulation, and most important, suggest what ecological benefits accrue for such species. There is currently an increasing amount of evidence that favors the existence of universal trends between functional gene content and genome size. For instance, Jordan et al.’s (16) analysis of 21 genomes showed that lineage-specific gene expansion is positively correlated with genome size and may account for up to 33% of the coding capacities in the genome. Furthermore, comparative genomic studies of Pseudomonas aeruginosa PAOl and Streptomyces coelicolor A3, two larger genome species, noted a disproportionate increase relative to smaller genome-sized species in regulatory and transport genes and in genes involved in secondary metabolism, respectively (5, 33). However, only a limited number of species were analyzed in both of these studies, and -30- the analysis was restricted to specific functional processes. Furthermore, in the former study, no other species in the panel of strains evaluated had a genome size comparable to strain PA01, a moderately large (6.3-Mb) genome-sized strain; thus, the significance of these findings for other large prokaryotic genomes is unknown. We sought to more comprehensively evaluate how the relative usage of the genome changes with genome size, using all sequenced genomes and evaluating all functional classes of genes. -31- MATERIAL AND METHODS Functional annotation of all sequenced prokaryotic genomes. We undertook the fimctional characterization of 115 completed genomes deposited in the GenBank database as of May 2003 using the Clusters of Orthologous Groups (COG) database (34, 35). The list of genomes used in this study as well as the genome size, the GC% content, the total number of predicted protein-coding sequences (CDS) and what fraction of the CDS was assignable to the COG database (see below) for each genome is available in Table 2.1 of Appendix. At the time of this study, the COG database was comprised of 144,320 protein sequences from 66 completed genomes forming 4,873 groups of orthologous proteins (COG). Individual COG are clustered in 20 individual functional categories, which are further grouped in four major classes (see Table 2.1). All possible CDS from the 115 genomes were assigned to a functional category according to the category where their best COG homolog is classified. Homologs were identified by using the BLAST local alignment algorithm (1) and a cut- off of at least 30% identity at the amino acid level over 70% of the length of the query protein in pair-wise sequence comparisons. This cut-off is above the twilight zone of similarity searches where inference of homology is error-prone due to low similarity between aligned sequences; thus query proteins were presumably homologous to their COG match (28, 30). Homologous proteins can be either orthologs (homology through speciation) or paralogs (homology through lineage specific gene duplication), and both paralogs and orthologs are assumed to retain the same biochemical function, whereas paralogs have usually diverged in specificity (9, 13). Therefore, CDS are expected to share at least the same general function with their COG matches. PERL scripts were used -32- to edit CDS assignments where necessary; formatting databases for BLAST searches and automatically parsing BLAST outputs. We further tested our findings from the COG database by using the publicly available data from the ortholog group table database at the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Comprehensive Microbial Resource database (CMR) supported by The Institute for Genomic Research (TIGR). The KEGG database classifies orthologous genes from all sequenced species into 24 functional categories (17). An identical strategy as previously mentioned for COG was used to assign each CDS from 75 firlly sequenced genomes (the same genomes used for TIGR data below) to a KEGG functional category. TIGR performs an automated whole-genome annotation on any published microbial genome, which classifies genes in 19 redundant Role Categories (or functional categories), i.e., a single protein can be assigned in more than one category (27). The number of proteins devoted to a Role Category for each of the 75 genomes incorporated in CMR as of July 2002 was obtained from the Multi Genome Query Tool at the CMR web site (www.spacetransportation.org_Detailed__44108.html). The amount of noncoding DNA in any genome was calculated by subtracting the sum of the lengths of the coding sequences annotated in the GenBank files from the estimated size of the genome. -33- RESULTS AND DISCUSSION. With the previously described strategy, we were able to assign, on average, 70.3% of the CDS in any genome to a COG functional category. If one considers that a significant amount of predicted genes (~15—20%) is species-specific in every genome sequenced so far (25), we have characterized the large majority of the repertoire of each cell. Data Normalization. Our main objective was to study the relationship between the total CDS in the genome and the genomic fraction devoted to a functional category. To normalize the effect of the different degrees of representation in the database, genomes with too many or too few genes homologous to the database were not included in inferring patterns with genome size, i.e., genomes in which the percentage of genes homologous to the database fell within one standard deviation from the mean (E 70.3%, SD 11.2) are represented by solid squares (87 of the 115 genomes), whereas the rest are represented by open squares (Figure 2.1). Functional categories showed similar trends with total CDS in the genome both when the normalized set and all genomes were considered (Table 2.1). However, trends with the normalized set should be more accurate because this set minimizes the bias in database representation. The power correlation gave among the highest R2 values from the types of correlations tested for most functional categories. It should be mentioned however, that there were, typically, very small differences between different models (e.g. linear, power, logarithmic etc) in their ability to describe the trends with -34- Beams-“8 Hag-Bea >58 8 arn-8m 69m 838m 383% E8093- mufium-fi-m unseen-L84 3:83 mafia-a Sr 5 “-820:- Bc obe- moEofi-m Rom-Em 2 2E. Ream “-5 :- 33 SE 8% 3.583 mm =m Sm “Gong-E8 «Bo-a 5:38 so 25 .853“ 2m “8 some new 839-; BE um c8298”- 8301.90520on 0: .mo man-£093 =3 2: 88m “cocoa-w magnum-em: a 35.830 con-auto..- 2: 22: _32 Scum-mace 2: mono-Ev via-73 8.0 mo Rogue-fl uni-Ta a 3 Eu um-cw-m 803 9.828 50 m8 ooom :2: 0.88 firs 3828M emu-E 88-- was @838 :3 mofiocom eons-«Econ com ago-39:8 flop cog-s oEocow 2: E mQO 33 £3- eo-Eutoo AoZV o: .8 .3 25am“:- A+v “Eu-mom 3302:: pogo-Lu Down-fie Eco-3:3 a 8 sagas-Ea eosoam BBS-ow ugh..- .8; £25 .600 88“ boa-64 .9838 :C momma? 8?:- Bom E boa-8cm Pa €838 ES “atoms-8 finances.“ 38nt 2E. Sodv vmd omov - Sodv .vd .2 9308339595.." "a wood .23 w:- . mad . .z 28 Sue-83- 8323.- 386 "a won-525% 380m So: .8d wood .Ed Sodv .omd + aim-932.35 «Sauna: 33382: Econ-om no one . SS- 25 25a .3 + 85-2-3 a: con-35a been 6 3d - wood .26 3.0 - oz Egan-£82: 3&3 “— 080.9 26 no.0 - mad - oz E30932: SEW-800 n: 39o Ed and . Sodv .omd oz E50982: as tomes: 32- oififi "H mmod - 3d - n86 . oz Eugen—3.338 coma-5 3.3.3330 no :59 .mg :59 .5- :59 .3; - 5:332.- ..5 ages- .3232 r.- 833302 :39 .93 an? .93 :59 .25 + 3:858- 883-85155 -._. «no . 83 63 33 .3 + sea-2- :.o "z ~DD.DV Rd 0M6 - :54.“ .26 92 60398: wfiofidfi “23:83.5“ "D #8.: :. owd . 39.9 .36 .2 53332: 28 com-55 a2 cram-85 u.— Dvd - owe . 3d . .2 05.362: .830 .umuocowomp can-32.8 zoo "5 53V 93 8.0 . «86 .26 .2 52:3 532- 5-539- e-s-e-naaeom 6 an: - 53 c3 83 - ..z Ema-€2- .ano -> :5. av .53 Sodv .nwd Sodv . :3 - mean-32m oeouoeofiu as coat-E =3 5 8388a $3200 48.-- .43 83 .03 :69 .23 . age-29 con-apes.- .ao-eeas <29 5 :59 KS 83 .mg :69 .3:- + cone-35¢ a- 33 was :69 .39 33v .03 - ass-nos as sac-Sm sec-3a douse-a; a. seas-es Rum-osfiw ENG-Rm. 3.20.33 EV M90 QQQNA Eats-2 -9200 mew-omega Nuance-«Qdca-Shmfi macro ~§ocu§m .wn—U he 53:5: 138 .33 combo—E3 boa-53 use now—enough. 1.3335 000 A." 03.“. _35- total CDS in the genome (data not shown). Thus, no assumptions can be made about the mechanisms underlying the relationship between functional gene content and total CDS in the genome. Last, the use of genome size instead of total CDS in the genome gave identical results due to the high correlation (R2 = 0.98) between these two parameters of the genome (Figure 2.3A). Therefore, total CDS in the genome and genome size are used interchangeably in the following text. Major Trends with Genome Size. To identify major universal trends, as opposed to ones that are attributable to the preferential gene loss in the reduced genomes, the analysis was repeated including only normalized genomes that had at least 2,000 CDS annotated in their genomic sequences. COG functional categories that showed correlation with genome size for both sets tested (i.e., all solid squares and solid squares with 2,000 CDS) were considered cases of major trends, and these categories are shown in Figure 2.1. Categories that showed correlation with genome size (at a P value threshold of 0.01) for only one of the two sets of genomes tested were considered cases of minor trends and are shown in Figure 2.2. All findings are summarized in Table 2.1. The COG functional categories that showed universal correlation with genome size were: informational categories of translation, ribosomal structure and biogenesis, and DNA replication recombination and repair. These categories showed a strong negative correlation with genome size, whereas transcription (transcription apparatus and transcription control genes) showed a strong positive correlation (Figure 2.1. Left). Of the cellular processing categories, the percent of genes related to cell division and -35- 8:828 3:285 >52: 5 3.88.5 Eat 8283 280:8 Each-*6 .3585ch was 82—92. 8:83 83.05 8: 22> 880:8 182—92 .8533 2.8 2: 8m 98 56% mm 28 8588.5. A8353 08588 05 5 awe—0:8: .2? 88» Bow 08 8 >58 2: 858 to: 36 8:88» ~88..qu 88:3 coco 808:? 333% 000 05 E ”mo—2:0: .23 88» Mo 8982. 03888.. a 3: 85 883% “88.52 88:3 23m 882% .2883 82.258 3 2: no :08 Sm 0:88» 05 E mQU .88 05 2a 88-x v5 8:: £83 Domain ago 058% a 8 033358 2:28» 05 E mQU no .883 05 2a 8.3% 8:35» 2: .3 mGU .58 .EB ace-.983 .8825 .832: .2: 8298.8 .3385... 80 A." 0.5»...— 88..-88:881o8v 88c 88.888.589.88.- 36 as: 5.392. 9.... 53:2. 838.82 a 88w .6. SP .6. - , 3.6I .- HON 36am. .. _m 36.-«t . N -. i am.- 1 1- l - an - I -- -- :5... -I s .9 , l- - 1 - _md £8583 5.89:8 new c8039.. 5.25 ”0 95:89.3 9:39:25 can .532“. .00 ”0 tea 230:...» 3830.! 68.382... H... 8:898 Emgoofioz no 3ch 8:828 8839a 6.2.00 ”m .95.". 8:398 3:39:82. ”< 8ch -37- chromosome partitioning category showed a small decrease with genome size (~1—2%), whereas the percent of genes related to signal transduction mechanisms and cell motility strongly and moderately increased with genome size, respectively (Figure 2.]. Center). Among the individual metabolism categories, nucleotide transport and metabolism showed a strong negative correlation with genome size, whereas energy production and conversion and secondary metabolite biosynthesis, transport, and catabolism showed a moderate and strong positive correlation with genome size, respectively (Figure 2.]. Right). Notably, genomes with <2,000 CDS have almost no secondary metabolism related genes (Figure 2.1. Right). Minor Trends with Genome Size. Categories of posttranslational modification and protein turnover, inorganic ion transport and metabolism, intracellular trafficking and secretion, amino acid transport and metabolism, and function unknown categories showed correlation only when all solid squares were considered, i.e., no correlation for solid squares with >2,000 CDS (Table 2.1 and Figure 2.2). Therefore, these trends are attributable to the preferential gene loss in the reduced genomes. Furthermore, several categories that were universally correlated with total CDS in the genome showed stronger correlation with all solid squares compared to solid squares with >2,000 CDS. Thus, such categories like transcription, signal transduction, and secondary metabolite biosynthesis are also affected by preferential gene loss in thereduced genomes. These results are in good agreement with the current knowledge of which functional categories are more likely to have been reduced in the symbiotic genomes. -38- .Esozm motowofio 05 80 ESE E accent 2Eo=om Enhugv bEmomea cm: 3392 8:82— .535. 0.8 858% 3583 mac—.8058 3:63am“. 3:0 Ammo ooo.m A :23 8.533 2.8 98 833cm 38 =~ rod 352 860% mo 88 2: mo 58 8 0:0 Sm Cod go £97.25 02? m a any aim 082% :23 cots—880 o: 330% motowflao 82:. .33 2:23» 5; status“. a: .532... «I5 nor—$8.3 Esta—....— UOU .~.~ unsur— ocoow coon 098 003. Saw 0 I I - N‘DDVNNI—O 53889: In... H. 9003 ooow coco 08v Sow c 88? 88 88 08¢ 88 o o F I I . N I I I. III, II n .. h..o...:.&¥ . v m m N .n. 8.94.339: new :09.ch :9 Renee... 80.399: 855.80 ”I I I ’0 I 3% 0! ILJ’OIII 889808808Vooomo 0 II II.‘ N I I . . . . . v ... .... . o I’I II"), 0 I I ‘IQI'5 I-I‘III '3' . ..1...- ...... m . ..x. .... . ... ......o. .n..¢u~.~. NF . $3.38!: new «.832. ciiEOEIQ no 8:328 Em=ooflw2 no .28 5.892. as. c822. :8 ose< ”w QOVNO NO v-v— Q‘DVNO 809093880ng oooowooowooooooov 80“ 0. mwocouofi ongo>cVocEnEoE .mo ”2 gnaw—.09: 8:200 n> 3:328 3380:. 55:00 ”< .053 o o O o 0' N I I N I 0.0 ’H V 0 .00 II V . v..r.. PI M... o . o . .M.....s....\.r 5min...“ .r. m 0 .....II I I I’ll! . . ......obWH o 9 or «v 565:: cases“. ”w 2.3 c3992.. .8050 “a “.3588ch 288 no _ocmm 88F 88 88 889 88 o c 88, 88 88 ooov Son 0 o . w . a . ...-...; . I Q 0 ‘Irfoo II N I I 0I-IIIC III’ F . . 1...... ...... a n . . n. . ‘ I n I, II M I I ‘ I‘ N I. o I n 885:. 5205 I cseooadsxoegfiaoogs .... v 638588 _acoagcgaom no . . . 88—08008wooovooouuooo8988oo808'o8w oo - I v m n... N u o. o a}? unenc- aooo w I ' .IIII I‘uI III-*JIQ V I I I If’IHI‘II N u .r- o a u I 5 M a . ~ .. v 0., 1., On the other hand, categories of defense mechanisms and lipid metabolism showed correlation only when solid squares with >2,000 CDS were considered (Table 2.1 and Figure 2.2). These trends, however, are more likely a database artifact due to the under-representation of large genomes than a real preferential accumulation of such genes by the large genomes. The fact that there were several small genomes with high percentages of CDS devoted to these categories (which accounted for the lack of correlation when all solid squares were considered) supports the former interpretation. Last, it should be mentioned that most minor trends involved weak correlations and small changes (~l—2%) in the fraction of the genome devoted to the corresponding functional categories. Non-coding DNA and Hypothetical CDS. Interestingly, the genomic fraction assigned to hypothetical CDS (i.e., poorly characterized categories) remained constant for genomes with >2,000 CDS. Moreover, the fraction of non-coding DNA was also invariable (at ~12—14% of the genome) for all 115 genomes evaluated (Figure 2.3.8), which confirmed previous results that analyzed a smaller set of species (22). Therefore, the large prokaryotic genomes overall are not explained by disproportionate accumulation of junk DNA, i.e., hypothetical genes or non- coding sequence. In contrast, genomes with <2,000 CDS have a smaller percent of function unknown (or conserved hypothetical) CDS compared to larger genome-sized species. This suggests that some of these genes, if they indeed code for proteins, have dispensable functions in the larger genome-sized bacteria. If these genes follow the trends of the other -40- Non coding DNA (In Kb) Total number of CDS in the genome J o3 o r‘ b 333.2% Rz= 0.72 2 4 7 s 87 1o Genome size Figure 2.3. Correlation among total number of CDS in the genome, non- coding DNA, and genome size for prokaryotic genomes. (A) The total number of CDS in the genome vs. the genome size for 115 completed prokaryotic genomes. (B) The total amount of non-coding 0 DNA in the genome vs. genome size. functional categories, then these unknown genes may be involved in regulation or secondary metabolism rather than in informational processes. Nonetheless, a significant fraction (~3%) of the genes in the reduced genomes remains attributable to the function unknown category. Their retention suggests that at least some of the conserved hypothetical genes encode for functional proteins. Factors Other than Genome Size. The correlation R2 values indicate that genome size can only partially explain some of the shifts in gene content. Strain-specific traits are assumed to be responsible for datapoint dispersion around the mean, which is pronounced for several fimctional -41- categories. For example, by examining individual COG, we conclude that the number of the prevalent ABC transporter genes (and transport genes in general) was proportionately increased (i.e., the genomic fraction devoted to them remained constant) with genome size, and there was little dispersion around the mean suggesting a universal relationship with genome size (Figure 2.4). However, specific bacterial groups like the ecologically versatile a-Proteobacteria Agrobacterium and Mesorhizobium sp. had a disproportionately increased number of ABC transporters, whereas the more habitat- specific bacteria like the y-Proteobacteria Xanthomonas sp. had fewer than the average ABC transporters. As far as traits other than total CDS in the genome are concerned, we evaluated whether the ribosomal rRNA (rrn) copy number could explain some of the shifts in functional gene content. The rm copy number had, typically, a small effect on functional gene content compared to the total CDS in the genome. However, in the case of carbohydrate transport and metabolism, the correlation was stronger for rm copy number 800 -— , , _ ,, Figure 2.4. ABC transporter ' ' ' MI I . 700 . A. lumefizcrens S. melt/on 0 I genes proportionately - 0 R2_ increase with genome size. - 0.63 . . 600 -~ . Y-axrs IS the number of genes I attributable to ABC 5w 0.4 for all categories), whereas regulation category was positively correlated with genome size (R2 > 0.5), similar to the COG data. KEGG Data TIGR Data 0 12 Translation Proteins synthesis (translation) - -.....- _"‘—‘" "—W— ‘ 025 s 0.1 : a Fig: 0'87 I ' l¥= 0.68 ' _ 0.2 . . - 0.08 1 , '- 0.06 ; 0.04 l - _ 002'! _ . ‘ O I O 1 . 0 1 2 3 4 5 6 7 8 0 2 4 6 8 Signal Transducflon Regulatory proteins 0.025---~ ~-—— .. 0.15 I 0 0 2 1. Q . '. 0.14l . r3: 0.62 ° . ' 0.12 . -. 0.015I 0.1 l g 0.08; 0.01 5, . 0.06! ' 0.04 0.005l 0.02' 00 0 Figure 2.5. Evidence for functional biases with genome size from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and The Institute for Genomic Research (T IGR) annotation databases. Y-axes are the genome portions (CDS attributable to a functional category divided by total number of CDS in the genome) devoted to the specific functional category, and x-axes are the corresponding microbial genome sizes. Solid and open squares are used as previously for COGS data (Figure 2.1). Corresponding functional categories between the two databases are placed next to each other. -44- We also analyzed the 39 partially sequenced genomes in the JGI database in the same way. This is a collection of exclusively environmental strains, which includes seven strains with genome sizes >6Mb (average genome size, 3.83 vs. 3.23 Mb in the closed set). Although trends between gene functional categories and total CDS in the genome for JGI genomes were very similar to those for the fully sequenced genomes (data not shown), only 59.8% (vs. 70.3% for the closed set) of the CDS in the JGI set were assignable to a COG category. This may indicate that this genome set samples more of the uncharacterized genes in nature, although some of the difference is likely due to the lack of manual curation of the annotation. Bacteria vs. Archaea. Our analysis also revealed that there were some notable but small differences between Bacteria and Archaea in the relative usage of the genome for the different cell functions (Figure 2.6). Archaea appeared to have a higher genomic portion devoted to energy production and conversion, coenzyme metabolism, and poorly characterized categories than their bacterial counterparts of the same genome size. On the other hand, Archaea had relatively fewer genes involved in carbohydrate transport and metabolism, cell envelope and membrane biogenesis, and inorganic ion transport and metabolism. Some of the differences, like those concerning energy production, cell envelope, and general prediction-only categories were more strongly supported by the data (compare errors bars in Figure 2.6). -45- 14 . , _. I Bacteria 12 : uArchaea Mandi Percent of genes Carbohydr. Cell lorganic ion Coenzyme Energy Unknown General metabol. membrane metabol. metabol. production function prediction Functional category Figure 2.6. Differences between Archaea and Bacteria in the relative usage of the genome. Bars represent the average from 34 bacterial and 12 archaeal genomes, which have between 1,500 and 3,500 CDS (to avoid any genome size effect on the data). Only normalized genomes have been included (see text). Average are statistically different by two-tailed t test, assuming unequal variances and 0.05 confidence level. Functional categories that had <2% of the genes in the genome are not shown. A set of archaeal specific proteins in addition to the standard proteins encountered in a typical prokaryotic cell would explain the higher genomic fraction in the above categories for Archaea. In agreement with this hypothesis, Graham et a1. (14), in an attempt to define an archaeal genomic signature, concluded that genes with no detectable bacterial or eukaryotic homologs mostly involve energetic systems and cofactor biosynthesis, e.g., genes involved in methanogenesis. On the other hand, the fewer genes for cell-wall biogenesis are probably attributable to the fact that Archaea possess a different cell wall from Bacteria. Archaea lack peptidoglycan in their cell wall, and peptidoglycan biosynthesis requires a battery of enzymes in bacteria (19). Furthermore, -46- the archaeal cell wall components and metabolism have not been studied to the same extent as those for Bacteria and hence are missing from the database. What Is Gained in a Large Genome? Our analysis showed that larger genomes preferentially accumulate regulation, secondary metabolism, and, to a smaller degree, energy conversion-related genes as opposed to informational ones, judging from the inverse pattern for these classes with genome size (Figure. 2.7). We performed the same analysis in May of 2002, using the 75 genomes available at that time and a database of 3,852 COG groups (vs. 4,873 COG currently). The results between this set and the expanded set of 115 genomes presented herein were very consistent, and correlations were often more significant in the latter set. Secondary metabolism and energy conversion rather than general metabolism are disproportionately expanded in larger genomes and thus should explain a large part of the broad metabolic diversity that characterizes large genome-sized species. The expansion involved both expansions of specific COG and de novo acquisitions of new COG (or pathways), with the latter case being roughly twice as frequent as the former one (data not shown). On the other hand, the genes assignable to the remaining metabolism, except nucleotide metabolism, and several cellular processes categories are only proportionally increased with genome size (similar to the example of ABC transporter genes mentioned previously). Regardless of a proportional or disproportional increase in metabolic or cellular pathways, large genome-sized species would need increased regulation to successfully control the extensive metabolic repertoire they apparently possess under different growth -47- uTranslation-DNA replication and repair-DNA metabolism ~r IRegulation (transcription control and signal transduction) 25 I Metabolism and transport U) a u. ; .. if § I a... 10 F} l 5 l. l P 0 <1.500 1500-3000 3.0004500 4500-6000 >6.000 Total CDS in the genome Figure 2.7. Summary of the shifts in gene content with genome size in prokaryotic genomes. The bars represent the sum of the COG functional categories, which showed strong correlation with genome size and are involved in the same major cellular processes. Only normalized genomes (represented by solid squares in Figure 2.1) have been included. Errors bars represent the standard deviation from the mean except for the last genome size class, where error bars represent data range due to a small number of normalized genomes in this class (three genomes). This consistency gives higher confidence in the trends reported. These data suggest that conditions. Thus, it is not surprising that regulatory genes, i.e., transcription control, and signal transduction, dominated the genes that are disproportionately increased in larger genomes. In addition, many regulation systems are expected to cross talk, because their genes share high sequence similarity (paralogous genes of expanded gene families), which suggests increased complexity in regulation as well. In agreement with these interpretations, all species with genome sizes >6 Mb in our set are free-living bacteria that can grow in very diverse environments, several using alternative electron acceptors and a great range of substrates for energy production (Table 2.2). -43- Table 2.2. Genomic information and ecological niche(s) of species with a genome size larger than 6Mb. - * Gen. % in Eco/o ical niche SpeCIes size COGS g Bacteroides thetaiotaomicron 6.26 33.5 Human gut, metabolically versatile Bradyrhizobium japonicum 9.1 1 60.4 Soil, rhizosphere. N2 fixing symbiont of legumes Mesorhizobium loti 7.59 69 Soil, rhizosphere. N2 fixing symbiont of legumes Nostoc sp. 7.2 58.2 Cyanobacteria, ubiquitous in nature. Photosynthetic Pseudomonas sp. (aver. of 3) 6.2-6.4 69-80 Soil, water. Opportunistic pathogen of plants, humans Sinorhizobium meliloti 6.7 63 Soil, rhizosphere. N2 fixing symbiont of legumes Streptomyces avermitilis 9.03 48.8 Ubiquitous in soil. Very versatile metabolically Streptomyces coelicolor 8.67 40 Ubiquitous in soil. Very versatile metabolically *All environmental and non-proteobacteria strains (bold) have <58.2% (vs. an average of 70.3%) of their genes homologous to COG proteins (3rd column). This indicates that the over-representation of specific lineages (e.g., proteobacteria) and clinical strains in the database has possibly biased our knowledge of microbial functional gene content. The negative correlation with genome size of informational and DNA metabolism categories is equally interesting (Figures 2.1 and 2.7). This trend suggests that a similar number of informational and DNA metabolism related proteins is able to cope with an increased number of genes. For instance, there is a relatively small increase in the absolute number of genes (of ~20%) in the translation category between 2- and 8-Mb- sized genomes. This may be attributable to there being sufficient informational processes present and active at any time in the cell. Thus, when there is an unusual demand for informational proteins because of a larger genome, their transcription or posttranslational modification can be regulated accordingly to yield sufficient active proteins. A Hypothesis for Large Genomes. Presumably the interactions between the organism and particular habitat(s) have selected for genome expansion. Large genomes do not appear to be uncommon in nature -49- (Table 2.2 and J GI genomes), and hence they must have value. As noted above, all over- amplified gene families are associated directly or indirectly (regulation) with metabolism. However, the lack of knowledge of the population sizes and activities of such species in natural environments does not allow specific inferences about which environmental factors may have fostered genome expansion. In contrast, the genome evolution in endosymbiotic bacteria is much better understood. The relief from selection for specific pathways and regulation systems along with population bottlenecks that allow more rapid fixation of mutations are proposed to determine their genome evolution (2, 10, 22). Also, the higher number of bacterial generations in these nonnutrient-limiting environments probably facilitates loss of DNA through spontaneous recombination events at repeated or mobile sequences (2, 10). One hypothesis for large genomes consistent with the above data is that Bacteria with such genomes are more dominant, population—wise, in environments where resources are scarce but diverse and where there is little penalty for slow growth. These are characteristics of soil. In support of this, Mitsui et al. (23) and Klappenbach et al. (18) found slow-growing oligotrophic a-Proteobacteria to be more dominant in soil. In the former study, many of these isolates were nonsymbiotic members of the Rhizobiaceae and Bradyrhizobiaceae (23, 29), families that have genomes >6—8 Mb. Generation times in soil are thought to be low, with mean generations measured at three per year (15). Although this study shows some clear trends between gene content and genome size, the dispersion around the mean for many categories suggests that features other than genome size likely explain what is gained in larger genomes. These traits need to be explored for a fuller understanding of the interactions between ecology and genome -50- evolution. This study also draws attention to the limited number of large genomes sequenced to date. The possibility that large genomes represent a significant fraction of the extant microbial world and that they may possess unique traits missed in the current annotation knowledge is a major challenge for microbiologists. -51- A CASE STUDY: THE BURKHOLDERIA CEPACIA COMPEX. The previously described work clearly indicates what is gained in a large genome and suggests that the interactions between the organism and particular habitat(s) select the organism’s genome size and gene content. In order to expand understanding of the latter, we have performed a similar genomic analysis on a model bacterial group, the Burkholderia cepacia complex (Bcc) (or-Proteobacteria). The Bee was chosen because its members are phylogenetically very close, as opposed to previous work that included comparisons between distantly related organisms. This facilitates comparative analysis and could be informative differences between short vs. long evolutionary scales. Furthermore, a substantial body of information on the ecological and physiological differences of its members is available. Background on Burkholderia cepacia complex. The Bcc consists of ten closely related species (Figure 2.8), which share a high degree of 168 rRNA and recA sequence similarity 98-100% and 94-95%, respectively, and moderate levels of DNA-DNA reassociation homology (30-50%) (8). Members of the Bcc are successful in very different ecological niches ranging from rhizosphere colonization, biodegradation of pollutants, plant pathogenesis, and chronically infectioning Cystic Fibrosis (CF) patients, which frequently results in narcotizing pneumonia (known as the “B. cepacia syndrome”) (4, 7, 26). Moreover, Bcc species are among the most versatile bacterial species known, e.g., the type strain of Burkholderia cepacia species (formely Pseudomonas cepacia) has been shown to catabolize more than 200 organic sources of carbon (20). While Bcc species have among the largest -52- prokaryotic genomes, the genome size distribution of the group is very wide, ranging from 6 to 9 Mb (21). Interestingly, the genome is typically organized in 3-4 replicons, which is thought to give Bcc strains genomic plasticity and ecological versatility. To help understand how the group as a whole has adapted to the very different environments, three Bcc genomes have recently been sequenced. These genomes are: B. cenocepacia J2315, an enhanced virulent pathogen in CF, B. cepacia ATCC 17760, one of the classical Stanier’s collection of strains isolated from Trinidad forest soil (32), and Percent sequence dissimilarity 10 9 a 7 s 5 4 3 2 1 0 ~— 70mm Burkholdena‘ kirkii - #Burkholdeliaglalhei f ‘--—-—Bukholderiaandropogonis 1 ‘B H H . U .I I . ‘ .i ‘ WW é—(Sanger Center) ’ L—Burlkhomna‘ mallet <—(TIGR) r Burkholderia gfumee <— (SNU-Maaogen, S. Korea) . wakhoIdefiaplantarii l Burkholderia giadioii L Maidens stabih‘s . ‘ Bwldiolderia pyrrocinia _ Burkholderia cepacieATOC17760 <—(JGI) j; Bukholdefia cenooepacia J2315 (“(53098 Center) * r _“ " W08. ' Wm. ' B. oepacia ~ mm an ma ,, a m complex (Bcc) «i-Bunrholden'a viehwniensis G4 <——(JGI) EL Burkholderia multivorans ,-_. Buddtolderia sacchari - ————« Bw'kholderia tuberum ‘ ' ~ Bukholden'a kzmm'ensis . . —w&nkhddefia xenovorans LB400 é—(JGII . ‘ Burkholderia tem‘oola ._ — Burkholderia gran'a'nis Maidens caryophylli 3" Maidens fungorum Burkholderia caledonica ..n‘ Relstonia solenaceenm <— (Genoscope) ~ ~ Ralslonia thomasii - Rumors taiwanensis , ‘ Ralstom'a paucwa ~ ~Mstonia malaflidurans <—(JGI) ' ' ——Ralstonia gilardii “ -—-— Ralslonia oxalatica ”T «m Ralstonia emophe <—(JGI) i —— —— Arcaligones feecalis (outgroup) -53- Figure 2.8. The Burkolderia cepacia complex and its relationship to other Burkholderia spp. l6S rRNA phylogenetic tree (based on the neighbour- joining method) showing the phylogenetic relationships of Bee and other Burkholderia and Ralstonia species. Arrows indicate species that are sequenced or are currently being sequenced. B. vietnamiensis strain G4 (ATCC 53617), a rhizosphere colonizing strain that also oxidizes the groundwater pollutant trichloroethene. The J2314’s genome is now fully sequenced by the Sanger Center and consists of three chromosomes, 3.9, 3.2 and 0.9 Mb in size whereas G4 and ATCC 17660 are currently at high drafi status e.g., the available sequence covers >95% of the strain’s genomic DNA (6). The estimated genome sizes of the sequenced strains are: 12315 8 Mb, ATCC 17660 8.7 Mb, and G4 8.5 Mb. Genomic comparisons among the Bcc genomes. Comparative whole-genome analysis of the three available Bcc genomes reveals about 4,200 predicted protein-coding sequences (CDS) that are conserved in all three genomes (Figure 2.9). The distribution of gene functions in this Bcc conserved gene core follows closely the trends with genome size reported in the previous section of chapter 2, e. g., regulation and metabolism functions are disproportionably increased relative to J2315 Figure 2.9. Venn diagram showing the gene complements of the currently available Bcc genomes. Conserved genes were defined by whole-genome pairwise sequence comparisons, using the BLAST algorithm (1) using a cut-off of 30% identity (a.a. level) over at least 70% of the length of the query CDS. Parentheses denote the fraction of the strain-specific genes that has unknown function. G4 ATCC 17760 -54- information functions according to the correlations described previously (analytical data not shown). In addition, when compared to an average of all closed genomes with a comparable number of CDS in the genomes (i.e., 4-5,000 CDS; average from 12 genomes), the Bcc conserved gene core reveals an excess of metabolism genes, particularly genes involved in metabolism and transport of amino-acids, carbohydrates and ions, and regulation genes (Figure 2.10). These results are in good agreement with the exceptional metabolic and ecological versatility that characterizes Bcc relative to other bacterial species and reveal universal trends in genome expansion for Bcc species. The genomic comparisons also revealed that the pool of genes unique to each strain (strain-specific) is significantly large, accounting for ~1,200 genes in the clinical strain J2315 and reaching 1,400 to 2,500 genes in the two environmental strains G4 and ATCC 17760, respectively (Figure 2.9). These results reveal a surprising level of genetic diversity within the Bcc given that these species are so closely related that their distinction is frequently difficult by conventional means. The majority of these strain- specific genes have hypothetical or poorly characterized function (i.e. with very low similarity to genes in public databases), which indicates that many functions in Bcc remain undiscovered (Figure 2.9). Nonetheless, a substantial fraction of the strain-specific genes can be assigned to a well-characterized biological function and we have further investigated this set of genes in order to get insight into what drives genome expansion within each strain and identify traits that are important in different ecological niches. Our results show that these strain- specific genes are closely associated with the known ecological properties for each strain. For example, G4 is a successful root colonizer and degrader of pollutants and the G4- -55- I Bcc_core 0 Average 4-5K 1400 A #olCDS T 000 finnhnnm hhh.nhu.uflfl A " 6°¢§>+v§¢o «eeoogécoeeae\eog&eecp<§ e 0 $39 0&9 x-‘-° 14o <2 12° B Unique in J2315 100 so 60 40 H I 20 o :1 DDUDUU DDU_al:lUm [l 300 250 C Unique in ATCCtTIGO 200 150 100 5° H I [1 U ill 0 1:1 :1 aaDDnuD [ll] no [lo 250 200 D Uniquein64 150 100 50 0 -0 .uDDu-D BUD-BUD. UH Figure 2.10. Functional annotation of the conserved gene core and the strain- specific genes for the three sequenced Bcc genomes. Bars represent the number of genes assignable to the four major classes (full description on x-axis) and the individual categories of COG database (single-letter description on x-axis; for annotation of the letters see Table 2.1). (A) Solid bars represent the conserved gene core between the three available Bcc genomes, while open bars represent the average from all genomes available in GenBank, which have a comparable number of protein coding genes (4-5,000) to the conserved core of the sequenced Bcc genomes. Panels B, C, and D show the annotation of the strain-specific genes for J2315, ATCC 17760 and 64’s genomes, respectively. Designations for each functional category have been omitted from x-axes for simplicity. -56- specific set mostly involves metabolism genes such as oxidoreductases, oxygenases, cytochrome-flavoproteins and transport genes, which are presumably related to aromatic and poly-chlorinated compound degradation (Figure 2.10). The majority of the ATCC- 17760-specific genes are also involved in metabolism but the specific functions enriched are rather different from the ones identified for G4. For instance, ATCC 17760 has many unique genes for sugar and carbohydrate metabolism and transport such as acetyl- transferases, oxidoreductases, and lyases, several large gene clusters for polyketide (antibiotics) such as phenazine production, and excreted Fe (III) binding proteins. These genes may explain ATCC 17760 as a successful soil colonizer. The G4-specific gene set also includes a plethora of mobile elements, e.g., transposase and prophage-like elements. Interestingly, the only other Burkholderia strain that includes a comparable number of mobile elements is B. xenovorans str. LB400, which is the best-known Poly Chlorinated Biphenyl (PCB) degrader. In fact, many of the G4-mobile elements are conserved in LB400 and not conserved in any of the more closely related Bcc strains. It follows that these mobile elements may be an important trait in biodegradation settings. In such settings, bacteria typically encounter a variety of different pollutant compounds (rather than a single substrate) and hence genomic plasticity and potential for diversification may be more important traits than cell stability and fitness since some of these mobile elements consume resources and could be lethal for the cell when activated. Chromosomal biases in terms of genetic diversity. -57- We further examined the set of strain-specific genes to gain a better understanding of how genetic diversity is created in Bcc species. Analysis based on the J2315 genome, which is closed and facilitates analysis, reveals that the amount of genes of unknown function is biased towards the smaller chromosomes. For instance, 21,1%, 31,7% and 34.4% percent of the genes in the largest, medium and smallest chromosome, respectively, can not be assigned to the COGs database and hence, have a hypothetical function (Figure 2.1 1). When we examined how conserved the genes of each chromosome are in the other two Bcc genomes we noted a similar trend, i.e., the smaller chromosomes harbor more of the 12315’3 specific-genes. For example, only 50% of the genes in the smallest chromosome have homologs in ATCC 17760 or G4 as opposed to >70% for the large chromosome (Figure 2.11). Further, about half of the J2315-specific genes have a GC% content that is >5% different from the average of the J2315’s genome, suggesting a horizontal acquisition of a large fraction of the strain-specific CDS. Interestingly, the majority of the J2315’s CDS with a GC% <5% than the average of the J2315’s genome are also J2315-specific whereas this is less pronounce for J2315’s CDS with a GC% >5 of the average (compare gray with white bars in Figure 2.11), indicating that horizontal transfer is more frequent from low GC than high GC donors. Comparable results were noted when ATCC17760 or G4 were used as the reference genome instead of J2315 (data not shown). These findings suggest that each chromosome in Bcc species has a different evolutionary history and perhaps origin and may indicate that the Bcc species may have a mechanism to control where the diversity is created in the genome. Further, these findings show that substantial genome evolution and gene turnover take place within very short evolutionary scales, -53- m CDS-ln—COGS 1°“ A IATCC17760 3°” 90 U n I64 . 7000 I) . DPseudomallei 0' c1 315.#CDS 1: 10 68.3 01520 5000 3 T" .’ g 00 5000 3 50 4000 8 40 mo 0 30 33 2000 20 10 1000 o —— --_ L o Chrom 1 Chrom_2 Chrom 3 350°” " " ‘ ‘ " lUniqueCDS B ‘ I Unique with ec-sss I All with GC-5% 2M0 ‘ m Unique with GC+5% D All with GC+5% 3 1500 O ‘6 g 1000 500 1 0 ATCC 17700 64 Figure 2.11. Biased in the amount a genetic diversity carried by each chromosome and the GC% composition of the genes that are different between Bcc genomes. (A) Striped bars represent the percent of genes in each chromosome of strain J2315, which are assignable to the COG databases, while the remaining bars show what fraction of the genes in each chromosome is conserved in the other available Burkholderia genomes (graph legend). Centered open squares show the number of genes in each chromosome (right y-axis) while the leftmost bars show the same values as above for all genes in the genome (i.e., the average). (B) Black bars represent the total number of J2315’s CDS that, based on pair-wise whole-genome comparisons, do not have homologs (i.e., they are J2315-specific) in the other Bcc strain (x-axis), while gray and open striped bars represent the fraction of these J2315-specific CDS that has a GC content <5% and >5% than the average of the J2315’s genome, respectively. Gray and white bars represent the total number of CDS in J2315’s genome that have a GC content <5% or >5% than the average of J 23 15’s genome, respectively. -59- presumably, as a result of the interaction between the organism and particular habitat(s), and similarly to results reported previously for all bacterial genomes and longer evolutionary scales. -60- ACKNOWLEDGMENTS We thank Tom Schmidt, Rebecca Grumet, Joel Klappenbach, Frank Larimer, and an anonymous reviewer for helpful discussions regarding the manuscript. This work was supported by the Bouyoukos Fellowship Program (K.T.K.), the US. Department of Energy’s Microbial Genome Program, and the Center for Microbial Ecology. -51- 10. REFERENCES Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-402. Andersson, S. G., and C. G. Kurland. 1998. Reductive evolution of resident genomes. Trends Microbiol 6:263-8. Andersson, S. G., A. Zomorodipour, J. O. Andersson, T. Sicheritz-Ponten, U. C. Alsmark, R. M. Podowski, A. K. Naslund, A. S. Eriksson, H. H. Winkler, and C. G. Kurland. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396: 133-40. Balandreau, J., V. Viallard, B. Cournoyer, T. Coenye, S. Laevens, and P. Vandamme. 2001. Burkholderia cepacia genomovar 111 Is a common plant- associated bacterium. Appl Environ Microbiol 67 :982-5. Bentley, S. D., K. F. Chater, A. M. Cerdeno-Tarraga, G. L. Challis, N. R. Thomson, K. D. James, D. E. Harris, M. A. Quail, H. Kieser, D. Harper, A. Bateman, S. Brown, G. Chandra, C. W. Chen, M. Collins, A. Cronin, A. Fraser, A. Goble, J. Hidalgo, T. Hornsby, S. Howarth, C. H. Huang, T. Kieser, L. Larke, L. Murphy, K. Oliver, S. O'Neil, E. Rabbinowitsch, M. A. Rajandream, K. Rutherford, S. Rutter, K. Seeger, D. Saunders, S. Sharp, R. Squares, S. Squares, K. Taylor, T. Warren, A. Wietzorrek, J. Woodward, B. G. Barrell, J. Parkhill, and D. A. Hopwood. 2002. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417 : 141-7. Branscomb, E., and P. Predki. 2002. On the high value of low standards. J Bacteriol 184:6406-9; discussion 6409. Coenye, T., and P. Vandamme. 2003. Diversity and significance of Burkholderia species occupying diverse ecological niches. Environ Microbiol 5:719-29. Coenye, T., P. Vandamme, J. R. Govan, and J. J. LiPuma. 2001. Taxonomy and identification of the Burkholderia cepacia complex. J Clin Microbiol 39:3427-36. Eisen, J. A. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8: 163-7. Frank, A. C., H. Amiri, and S. G. Andersson. 2002. Genome deterioration: loss of repeated sequences and accumulation of junk DNA. Genetica 115:1-12. -62- 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Fraser, C. M., J. D. Gocayne, 0. White, M. D. Adams, R. A. Clayton, R. D. Fleischmann, C. J. Bult, A. R. Kerlavage, G. Sutton, J. M. Kelley, and et al. 1995. The minimal gene complement of Mycoplasma genitalium. Science 270:397-403. Galperin, M. Y., and E. V. Koonin. 1999. Functional genomics and enzyme evolution. Homologous and analogous enzymes encoded in microbial genomes. Genetica 106:159-70. Gerlt, J. A., and P. C. Babbitt. 2001. Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafarnilies. Annu Rev Biochem 70:209-46. Graham, D. E., R. Overbeek, G. J. Olsen, and C. R. Woese. 2000. An archaeal genomic signature. Proc Natl Acad Sci U S A 97 :3304-8. Grey, T., and S. Willimas. 1971. Microbial productivity in soil. Symposia of the Society for General Microbiology 21:255-286. Jordan, I. K., K. S. Makarova, J. L. Spouge, Y. I. Wolf, and E. V. Koonin. 2001. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res 11:555-65. Kanehisa, M., and S. Goto. 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27-30. Klappenbach, J. A., J. M. Dunbar, and T. M. Schmidt. 2000. rRNA operon copy number reflects ecological strategies of bacteria. Appl Environ Microbiol 66:1328-33. Konig, H. 1988. Archaeobacterial cell envelope. Can. J. Microbiol. 34:395-406. Lessie, T., and T. Gaffney. 1986. The Bacteria: A Treatise on Structure and Function. Academic, New York. Lessie, T. G., W. Hendrickson, B. D. Manning, and R. Devereux. 1996. Genomic complexity and plasticity of Burkholderia cepacia. FEMS Microbiol Lett 144:117-28. Mira, A., H. Ochman, and N. A. Moran. 2001. Deletional bias and the evolution of bacterial genomes. Trends Genet 17:589-96. Mitsui, H., K. Gorlach, H. Lee, R. Hattori, and T. Hattori. 1997. Incubation time and media requirements of culturable bacteria from different phylogenetic groups. J. Microbiolog. Methods. 30: 1 03-1 10. -63- 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. Moran, N. A. 2002. Microbial minimalism: genome reduction in bacterial pathogens. Cell 108:583-6. Nelson, K. E., I. T. Paulsen, J. F. Heidelberg, and C. M. Fraser. 2000. Status of genome projects for nonpathogenic bacteria and archaea. Nat Biotechnol 18:1049-54. Parke, J. L., and D. Gurian-Sherman. 2001. Diversity of the Burkholderia cepacia complex and implications for risk assessment of biological control strains. Annu Rev Phytopathol 39:225-58. Peterson, J. D., L. A. Umayam, T. Dickinson, E. K. Hickey, and 0. White. 2001. The Comprehensive Microbial Resource. Nucleic Acids Res 29:123-5. Rost, B. 1999. Twilight zone of protein sequence alignments. Protein Eng 12:85- 94. Saito, A., H. Mitsui, R. Hattori, K. Minamisawa, and T. Hattori. 1998. Slow- growing and oligotrophic soil bacteria phylogenetically close to Bradyrhizobium japonicum. FEMS Microb. Ecol. 25:277-286. Sander, C., and R. Schneider. 1991. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56-68. Shigenobu, S., H. Watanabe, M. Hattori, Y. Sakaki, and H. Ishikawa. 2000. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407:81-6. Stanier, R. Y., N. J. Palleroni, and M. Doudoroff. 1966. The aerobic pseudomonads: a taxonomic study. J Gen Microbiol 43:159-271. Stover, C. K., X. Q. Pham, A. L. Erwin, S. D. Mizoguchi, P. Warrener, M. J. Hickey, F. S. Brinkman, W. O. Hufnagle, D. J. Kowalik, M. Lagrou, R. L. Garber, L. Goltry, E. Tolentino, S. Westbrook-Wadman, Y. Yuan, L. L. Brody, S. N. Coulter, K. R. Folger, A. Kas, K. Larbig, R. Lim, K. Smith, D. Spencer, G. K. Wong, Z. Wu, 1. T. Paulsen, J. Reizer, M. H. Saier, R. E. Hancock, S. Lory, and M. V. Olson. 2000. Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406:959-64. Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. 1. Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41. -64- 35. Tatusov, R. L., E. V. Koonin, and D. J. Lipman. 1997. A genomic perspective on protein families. Science 278:631-7. -65- CHAPTER 3 GENOMIC INSIGHTS THAT ADVANCE THE SPECIES CONCEPT FOR PROKARYOTES -66- INTRODUCTION The species concept for Prokaryotes remains a highly controversial and unsettled issue, and as a result a number of different concepts exist at present. The most popular of these concepts is, by far, the one proposed by Wayne et al in 1987, which considers a bacterial species to be essentially “a collection of strains that are characterized by at least one diagnostic phenotypic trait and whose purified DNA molecules show at least 70% reassociation homology (DNA homology)” (28, 32, 39). This species definition, while pragmatic and universally applicable within the prokaryotic world, remains controversial because it is difficult to implement due to technological limitations in identifying diagnostic traits and in performing the pairwise DNA hybridizations, is based on a 30- year old arbitrary standard, is not encompassed by any of the eukaryotic species concepts, and is too ofien not adequately predictive of phenotype (6, 7, 38). Indeed, applying this standard to eukaryotic species would lead to the inclusion of members of many taxonomic tribes in the same species, e. g. all the primates should then belong to the same species (29, 30). Accordingly, there are only about 4,500 prokaryotic species described to date (12), which contrasts to well over 1 million eukaryotic species and yet the prokaryotes have been exploring evolutionary adaptations at least 100 times longer. Furthermore, several theoretical (7) and ecological (38) approaches to define prokaryotic species favor a more natural definition as opposed to the current definition. Last, several strains that show higher than 70% DNA homology are classified into different species, even different genera, usually on the basis of pathogenicity or host range, such as strains of E. coli and Shigella spp. (5), making the application of the 70% DNA homology standard anecdotal. -67- To gain insight into these issues, we have performed pair-wise, whole genome comparisons between all closely related (showing >94% 16S rRNA identity), sequenced bacterial strains (64 strains) to determine the conserved protein-coding genes (CDS) between the pair of strains as well as the strain-specific genes and study how these parameters correlate with the evolutionary distance between the strains and the strain assignment to species. This analysis is most informative with respect to the species definition because it concerns genes that largely determine the organism’s phenotype. Further, our strain set represents several major bacterial lineages, including a and [3 Proteobacteria, low GC gram-positive Bacilli, Streptococci, and Staphylococci, and high GC gram-positive Mycobacteria, which allows for robust interpretations (see Table 3.1 in Appendix). We found that strains of the same species can vary up to 30% in gene content raising questions as to whether they should belong to the same species, while a more stringent definition for species, which should also consider the ecology of the strain, is both more appropriate and plausible. -68- MATERIAL AND METHODS Sixty-four fully sequenced and closely related genomes were used in this study (Table 3.1 in Appendix). The genomic sequences and sequence annotation for 54 of the 64 closed genomes, which were published at the time of this study (May 2003), were obtained from NCBI’s ftp site at fip://ftp.ncbi.nih.gov/. The remaining 11 genomes were closed at the time of this study; however their annotation was not completed (denoted by NA in Table 3.1). These 11 strains were: S. bognori 12419, Y enterocolitica, E. carotovora, N. meningitidis FAM, S. aureus MSSR476, and S. aureus MRSA252, produced by the Sanger Center and were obtained through the Sanger ftp site at ftp://ftp.sanger.ac.uk/pub/; and M. avium, S. epidermitidis RP62A, and C. perfiigens ATCC 13124, produced by The Institute for Genomic Research (TIGR) and obtained through their website at http://www.tigr.org. N. gonorrhoeae FA109O was produced at the Advanced Center for Genome Technology at the University of Oklahoma (available at http://www.genome.ou.edu/gono.html). Determination of conserved genes and evolutionary relatedness. The conserved genes between a pair of genomes were determined by whole- genome sequence comparisons using the BLAST algorithm release 2.2.5 (2). For these pair-wise comparisons, all CDS sequences from one genome (hereafier “the reference” genome) were searched against the genomic sequence of a closely related genome (hereafter “the tester” genome). CDS from the reference genome were considered conserved when they that had a BLAST match of at least 60% overall sequence identity (recalculated to an identity along the entire sequence) and an alignable region more than -69- 70% of their length (nucleotide level) in the tester genome, whereas CDS that had no match or a match below this cut-off were considered “unique” (or genome-specific) in the reference genome. A reciprocal best match approach was also employed to determine what fraction of the previously determined conserved genes is orthologous. The BLAST was run with the following settings: X = 150 (drop-off value for gapped alignment), q = - 1 (penalty for nucleotide mismatch), and F = F (Filter for repeated sequences), the rest of the parameters were at default settings. These settings give better sensitivity with more distantly related genomes compared to default settings, because the default settings target more highly identical sequences. The genomes that were used as reference genomes, the genome sizes and total number of CDS for all genomes used is this study, as well as the raw data from the pair-wise comparisons are summarized in Table 3.1 of Appendix. Searching for the gene function (i.e. amino acid level) predicted more conserved genes than the nt. level search only when the evaluated strains show less than 97% 16S rRNA sequence identity. This, however, did not affect anything more than a slight up- shifting on the left part of the regression line in Figure 3.4A of the article. Further, the usage of less stringent cut-offs for the determination of conserved sequences did not significantly differentiate our final conclusions (data not shown). Last, the use of a cut- off for match length and identity without manual inspection of the alignments proved highly accurate for the prediction of conserved sequences. For instance, Parkhill and coworkers (26) have identified 4,297 and 3,394 CDS of B. bronchiseptica RBSO to have orthologs in B. parapertussis and B. pertussis, respectively whereas our approach predicted 4,261 and 3,382 CDS for the same comparison, respectively. -70- The evolutionary distance between a pair of strains was measured by the average nucleotide identity (ANI) of all conserved genes between the strains as computed by the BLAST algorithm. Duplicated genes within a genome were defined as the genes that had a better match within their genome than in another genome during a pair-wise whole- genome comparison, using, in all cases, a minimum cut-off for a match 60% identity over at least 70% of length of the query gene. Despite the use of the rather stringent cut-off in these comparisons, cases of independent acquisition of very similar genes (instead of gene duplication) cannot be excluded. Determination of DNA homology and 16S rRNA gene sequence identity. DNA homologies between species were obtained from the literature (5, 16, 18, 34, 37, 41). When the sequenced strains were the same as the ones used in the DNA homology experiments, we directly compared the DNA homology values with the AN I of the sequenced genomes. When the strains were different (the majority of cases), we used the average DNA homology values (or ANI) for several strains of the same species for the comparisons. The 16S rRNA sequence identity between strains was determined as the average identity between all copies of the 16S rRNA gene the strains possess. 16S rRNAsequence identity was determined using the Phylip package with Kimura 2-parameter method, available online at the tools of the Ribosomal Database Project (http://rdp.cme.msu.edu/cgis/phylip.cgi) (8). CDS functional annotation and intergenic regions. -71- We obtained more high-level annotation (compared to the one found in the GenBank files) of the CDS in the reference genome using the twenty functional categories in the recently updated Cluster of Ortholgous Genes (COG) database (33). Each COG functional category represents a major cellular process, like transcription, signal transduction etc. However, because several of our reference genomes were not incorporated into the COG database, we performed our own CDS assignment to the COG database as described previously (20). For the genomes that were already incorporated in the COG database, our assignments were more than 99% consistent with those already in the COG database. CDS that were assignable to the COG database and were not associated with phage or transposase elements were denoted as well-characterized genes. Hypothetical genes were defined in this study as the genes that were not assignable to COG database and were annotated as hypothetical or unknown function in the primary annotation (GenBank files), including hypothetical genes carried by phages. This category included the majority (>50%) of the genes not assignable to COG database and consisted between 10-20% of the total number of annotated genes in a genome. Genes that were annotated as hypothetical in the primary annotation and were assignable to COG conserved hypothetical or other category were considered “conserved hypothetical” (and well- characterized) and denoted as such in the article. The non-coding sequences between the annotated protein-coding (CDS) and RNA genes of the reference genomes were extracted from the GenBank files, after removing 100 bases upstream of the start site of the downstream gene to avoid any selection on the promoter of gene. These intergenic sequences, when longer than 100 nt, were searched -72- against the whole genomic sequence of the tester genomes, as described previously for CDS, to determine whether they are conserved in the tester genomes. Removing a longer fragment than 100 bases upstream of the start site did not significantly affected our conclusions (data not shown). PERL scripts were used to edit CDS assignments where necessary; extracting sequences from GenBank files; formatting databases for BLAST searches, and automatically parsing BLAST outputs. -73- RESULTS AND DISCUSSION For our purposes there was need for precise measurement of the evolutionary distance between closely related strains and particularly between strains of the same species. We noticed that the average nucleotide identity (ANI) of all conserved genes (typically >l,500 genes) between two strains strongly correlated with the reported DNA- DNA reassociation homologies between the same strains (Figure 313). Based on these results, the 70% DNA-DNA homology standard corresponds to about 93-94% ANI, which roughly agrees with previous experimental evidence (reviewed in (13). Therefore, strains that show higher than 94% AN I should belong to the same species according to §1°°--~A— ‘Fdiam'f‘ S 99 -o.7s 2 8 as c 3 a- 97 8 < 95 _ 2 . I 'E 95 . m I " 9‘65 7o 75 so as so Average nucleotide identity 3 mo , 90 B y = 3.32): - 238 so — 0.93 70 DNA-DNA reas. homology 6 65 70 75 00 85 90 Average nucleotide Identity s 95 100 Current 95 100 Cument pecies art-off -74- Figure 3.1. Relationships between average nucleotide identity (ANI), 16S rRNA sequence identity and DNA homology. Each dot represents the ANI of all conserved genes between two strains plotted against the 16S rRNA sequence identity (A) and the DNA homology (B) of the two strains. The shaded bar represents 93-94% ANI, which approximately corresponds to 70% DNA homology, i.e., the species cut-off for prokaryotic species, according to the regression analysis in panel B. 16S rRNA identity and DNA homology values were computed as described in methods section. the DNA homology standard. This was also confirmed by the fact that all strains in our set that reside in the same species or in species that show higher than 70% DNA-DNA homology showed higher than 94% ANI. Furthermore, the AN I strongly correlated with the 16S rRNA sequence identity but gave higher resolution, since a 0-5% 16S rRNA sequence miss-pairing is spread between 0-30% average nucleotide miss-pairing (Figure 3.1A). In summary, the strong correlations observed as well as the large number of genes used in the calculations suggest that AN I represents a robust measure of evolutionary distance, which should not be affected by lateral transfer or varied recombination rates of single (or a few) genes and offers resolution at the subspecies level where 16S rRNA gene or other single markers are not useful. Conserved gene core and genetic diversity within species. Using the 94% ANI criterion for strain assignment to species, we first attempted to evaluate the extent of genetic diversity within a single bacterial species. Our results for E. coli, the best sampled species with genomic sequences, show that when a strain showed less than 98-99% nt. identity to all eight remaining strains, it had a sizeable number of sequences, ranging between 5-15% of the total CDS in the genome, that could not be identified in any of the remaining strains. At the same time and as expected, strains that showed at least 99% nt. identity to any of the remaining eight strains had a small (<1- 2%) number of unique sequences such as the two strains of the E. coli 0157 lineage or the two strains of the S. flexneri 2a lineage. Accordingly, the number of unique genes in all nine genomes together clearly exceeds 8,000, with the trendline suggesting that a continued increase is expected with the sequencing of new genomes of the species -75- .cofio :08 So: 295%: >__8:m:8m 8: Pa :32: BEE—oboe .838 :5 28553— ofi. .mno :BSOEB wEms 3:53 3:0 2: 8 33.8.. 038888 26m 388% :3885 8 $883 E5 wEban< .mQO mo :mSmE :3: 295 aces—mm: 82: :5 3389.: 3:88:09 m=o_ :58; E :5 ma? 8:253 EEO—5m 2: .3820» 82: you 2:23 2: no 2:: 2: an :BSoEE 8: 895 $3.“. .5 :5 mvo .5 :8 .m .358». ..w .«o 358% 05. ...2:o son» 5953 >552 02820:: omega EzEm :oBofi :oSnEOo has... 352% :38 2: 8.: o... :0ng :28: 20? Bags: an toaxoi u. :5 >20 :8 .m :o 8:85» 30:52 39:...“ of. .§nfia: mu :8: moEocow :0 “3:3: 2: umEama 3303 E .0333: we :8: $3 0:8:on 25 3:0 cos-s mQU EEK-MES; 2: mo 093895: a ma :8moaxo .mQU 058% Eng .«0 con—SE oak .moEocom go 898:: mEudoBE an :0 838m: a 5:83 3:038 953 :32 .383 E95 a E mDU =< A5 .36on 2: .8: 28 328:8 2: 28032 8:0 3:: 2: 3223 83on E53 33.53: 0:28» .88 2: ESP—no.— EB .3an 2: .85: .20 EB :05: 40m E :otomcg 2m moo twin 2: mo .98:— Bo: :5 A23 >583 .8503: Exam :5 40m E88 E :58: 2a :32 E mQO 2.3:: >58 Bo: 8:823: Em: 2: 8 :3 Us: 2: Exam E83 520 :8 ...-N we oEocom 2: E woo Edam 2: E3 mEtfim 2v .832: :8 ..fl 55:: 3.9.9:: 3.2.0» .9 23 2.0a Eton—=5 .~.n 0.5!..— .....- .9 $3.3m: mm :3: moEocomboconEaz m < 09 9 av eq .9 .9 a9 )m09 m.- oo o \ ill? -31... s ......- -... .. .. ... .. d 9%?9 6.90 JVO /. awn % 647 \0 fi‘ 22339~.:o—mm\.0m en N .o a 1 « ... . o ...... 8.? m o w J i _l- ...- .. ...... .. . N u 82 .....m. 8 + 358 S. u a on I m. coon W mp rt . Sow-88.- 9. m. r. .L F 88 m. i .8. . 320250.. . .. O omoco><¢o>8ul m 83 Mu.- .. .... a n ...n. .. 0 S as. am < 8- m 1 o8» .38 $5 13 4 S l 88 ommao>98% for the others). -77- diversity as well (Figure 3.3). We also attempted to predict whether the genetic diversity within the E. coli- Shigella spp. species would be exhaustible with additional sequenced strains by searching all genes in a strain against a database of an increasing number of genomes. While the number of novel genes in a strain declines with greater coverage of the species with genomic sequences, the number of available genomic sequences is still too limited to predict how many strains would need to be sequenced to discover most of the gene diversity of the species (Figure 3.28). Nonetheless, extrapolation from the current genomic sequences suggests that when about 12-14 strains of E. coli are sequenced, the amount of new genes in the next sequenced strain would be less than 5% of the total CDS in the genome. This prediction may however be biased, since almost all evaluated strains are pathogens of animal or human hosts, i.e. they have similar ecological niches, and some E. coli are known to colonize water and soil (1). Despite the extensive genetic diversity revealed between closely related bacteria, however, species-specific diagnostic genetic signatures appear to exist, thus, it appears that it is meaningful to have a species concept for Prokaryotes. For example, by comparing the nine E. coli-Shigella spp. genomes against the seven genomes of Salmonella enterica (a close relative of E. coli, ANI between E. coli and Salmonella spp. genomes is ~80%), we identified ~3OO genes, i.e. ~6% of the total genes, in any E. coli- Shigella spp. strain that are not conserved in any S. enterica genome whereas, the reverse comparison revealed ~12% of the genes to be S. enterica-specific. About half of the genes in these signatures are related to traits that are known to differentiate E. coli- Shigella spp. from S. enterica species; for instance, the E. coli/Shigella contain about 80 genes involved in transport and metabolism of sugars, amino acids and oligopeptides, -73- which is consistent with this species growth on sucrose and production of indole from tryptophan, whereas S. enterica can do neither (5). Likewise, the S. enterica signature included genes for growth on hydrogen sulfide, which is not used by E. coli/Shigella spp. (5). The other half of the genetic signatures involves genes not assignable to COGs or of general function prediction only, which may yield even more distinguishing phenotypic traits. The current species definition appears to be too liberal. We then studied how the amount of conserved genes between two strains correlated with their evolutionary relatedness for all 64 strains compared in this study. Conserved genes were expressed as percentage of the total CDS in the reference genome to normalize for the genome size effect. Our results suggest that there is strong correlation between these two parameters over longer evolutionary distances, i.e. corresponding to 0-5% 16S rRNA miss-pairing, and this correlation appears to be consistent among several major bacterial lineages (Figure 3.48). However, when the analysis was restricted to strains that show >94% ANI, i.e., they should belong to the same species, this correlation collapsed (Figure 3.4A). According to this dataset, strains of the same species frequently differ in up to 30% of their total genes, and of these up to 50% are well-characterized genes. Well-characterized denotes genes that are assignable to the Cluster of Orthologous Groups (COG) database and are not associated with phage, or transposase elements whose significance on the cell phenotype remains largely unexplored. When a reciprocal best match approach was employed to determine the orthologous fraction of the conserved genes in an effort for a more conservative -79- .253 @828 €556 EOE .«o 9:8 mots—2: m _ocam 820:3 .2.m 23$; 33 Eat—Ea coEccov 360% E955 05 9 36.508 850% 2.53 05 E mac—on 2:05 35 2:83 ..o 9:8 Eco mug—2: < .28; .Acouoom £552: 08V 85» uo~_._Boa.a:o-__oB 0.8 “up: mocow =n mo 558$ 05 E823.— mocascm some 223 3:0» =u E0852 83:3 23m .355 05 5953 macaw 338:8 =a .3 22$ 5:52 ounce—0:: 035$ me @2388 .8532“. banana—05 :2: 35a? c883 385m 95 5953 macaw @3888 go E023 2: 3:03.58 “588% comm .832: 3.38.3 .5.— 3528 aha—8:395 6:: 8:0» 1332.3 5953 zeta—2.80 in 95$..— oow mm mm hm mm mm wm mm mm 8? mm 8 mm 8 mm s . co m. d W: - lit -1- .- l . I: I: a ,8 tmvus 5235»: m nimxmm :58 .m. anotogaxmw :28 m 0 . to.So 20on b.V “ wanton m a> momqmecocofi m:2oE%Q\ sumo. 328 m 9. 3:2 335$ .m no w 2250 MS " a \ $3980 EEoEQ W 8 _ x: r . .. on N m. .2 m a a 6 3... max “8 mu m. m , . J a m M an o a m 8 m. s . - . .. . $8. Alleges <2% mm. $3. \\ fl m 5E5 m as Co‘EEobaomq «tob\ocfam.-\ , mm H m m‘nmatoqmcmq m m> 833539“. EBEEom. . a .t- x co m 0 «3. n> Eat. ..a 3:85 gocoE‘wm Mil: . .. mud I m 0 M :5. u> 2359.82 an 9.33 musmcm>.\. .. u mm 8 m. noE< u> 33 5a «.6853 ”33mm \ I\ I IL/y «Augoofi gum T i I . 8.. 2302 32 can «000 c. D KNEW. {8| E Esatoagoxqemo 82m 2, 8., .3 EE:28§.§: mason =< I 22mm: 2. ~32 .... 2.23 a: 52m 9328» «y 88 ... 822.3 2%? ~80- estimation of functional similarity, then the gene differences were even higher (but generally not considerably higher) by an average of 1.12% (STDEV 1.15, MAX 6.78%). To extend the comparison to higher organisms, only about 25% of the human genes do not have homologs in the distantly related fish genome, F ugu rubripes (3), while the ANI between humans and chimpanzees is 98.7% (10) i.e., much higher than the current standard for prokaryotic species. Therefore, the genetic differences we find among several strains of the same bacterial species are extensive when viewed from a eukaryotic perspective. We also noticed that pairs of strains that presumably have an overlapping ecological niche, like Xyllela fastidiosa and Helicobacter pylori strains that cause the same disease in closely related plant species and humans, respectively (11, 36), have more genes conserved relative to pairs of strains that show a comparable evolutionary relatedness but presumably have non-overlapping ecological niches, like E. coli strains that cause different diseases in humans, i.e., enterohemorrhagic vs. uropathogenic (40) (the dashed circles in Figure 3.4A represent graphically this point). The former cases typically involved obligatory pathogens with small genome sizes whereas the latter involve free-living or opportunistic pathogens with large genomes. Species with larger genomes are thought to be more ecologically versatile (20), which is consistent with the previous interpretations. Further, sexual isolation is more pronounced in the former species due to restrictions in their dispersion as is documented by Helicobacter pylori biogeography (11), which may explain why strains of these species show substantial nucleotide divergence while sharing a nearly identical gene content. -31- In summary, our results (Figures 3.3 & 3.4) show that the current species definition results in too much genetic diversity within species and hence a more stringent definition is needed if species should be reasonably predictive of the phenotype and ecological potential of the organism. For example, a species definition, which includes only strains that show at least ~99% ANI or less than 99% ANI but share a common ecology, would be consistent with this goal because such strains should have minimum (i.e., <5%) gene differences (Figure 3.4A). Several additional independent lines of evidence support that a species definition based on these principles may be more appropriate than the current one. First, genetic signatures, like the ones described previously between E. coli- Shigella spp. and S. enterica genomes, are identifiable among some groups of strains that show between 94% and 99% ANI. For example, the two pathogenic genomes of the S. enterica pathovar Typhi share ~325 genes that are not conserved in any of the three pathovar Typhimurium, str. PT2 and S. gallinarum str. 287/91 genomes (ANI between the Typhi genomes is >99%, between Typhi genomes and others 97-98.5%) (Figure 3.5A). Many of the Typhi-specific genes are potential pathogenicity factors, such as fimbrial and exported polysaccharide gene clusters, further supporting the ecological importance of this genetic signature. These extensive gene differences may also indicate that Typhi strains do not directly compete with the other S. enterica strains in-situ (i.e., they exploit a different ecological niche) otherwise the genetic differences should be purged by natural selection. The lack of competition between two populations is considered strong evidence towards describing the populations as different species by several prokaryotic taxonomists (7, 38). A similar comparison revealed ~4% of the genes -32- to be Typhimurium-specific, while comparable results were obtained for other groups with several sequenced representatives, such as the Listeria monocytogenes and Neisseria spp. Importantly, the E. coli-Shigella spp. and S. enterica genomes compared previously are much more distantly related (i.e. ~80% ANI) than the genomes compared here, nonetheless, the genetic signatures revealed are comparable in size. Second, in at least two cases in our dataset we could not identify species-specific genetic signatures when applying the current definition. For instance, there are two strains of Bacillus cereus fully sequenced, str. ATCC 10987 and ATCC 14579, with the former showing ~94% ANI to the B. anthracis strains (thus, albeit marginally, str. 10987 should belong to the same species with B. anthracis according to the DNA homology standard) and the latter only ~91% (AN I between the two B. cereus genomes is 91.2%) (Figure 3.58). Str. 14579 however, has more genes conserved with the B. anthracis genomes than str. 10987, and no genetic signature is identifiable for the B. anthracis-str. 10987 group. Such instances prove that the current standard is rather arbitrary and suggest that any species definition (like the DNA homology) that does not consider the ecology of the strains in addition to their genetic relatedness is problematic. This is also evident by the low correlation observed between conserved gene content and evolutionary distance over a short evolution scale (Figure 3.4A). Last, gene expression, which is another important determinant of organism’s phenotype apart from gene presence (10, 25), is likely to be different between strains that show a substantial number of nucleotide substitutions, like between strains that show 94- 97% ANI. Notably, about half of the nucleotide substitutions between such strains cause non-synonymous amino acids substitutions in our dataset. -83- (I) 4400 99.8% A 8 4200 .5 4000 3 3800 D 3600 3 3400 3200 97.8% - 98.6% 3000 Figure 3.5. Genetic signatures among groups of strains that show higher than 94% average nucleotide identity (ANI). Starting with all CDS in the leftmost strain the next bar to the right represents how many CDS are conserved in the next strain (x-axis) (similarly to Figure 3.2). The ANI to the leftmost strain is also shown on the top of the bars for each strain. (A) A genetic signature between the pathovar Typhi strains and the rest Salmonella strains is identifiable. (B) No genetic signature is evident for the B. anthracis-B. cereus ATCC14579 group (dashed circle). The rightmost bar in panel B shows how many of the conserved CDS between the two B. anthracis strains are also conserved in strain ATCC14579 alone. Strains from lefi are: (A) S. enterica ser. Typhi Ty2, S. enterica ser. Typhi Typhi, S. enterica PT2, S. enterica ser. Typhimurium DT104, S. enterica ser. Typhimurium LT2, S. enterica ser. Typhimurium SL1344, S. gallinarum, and a pool of all Salmonella but the Typhi strains. (B) B anthracis Ames, B anthracis A2012, B cereus ATCC 10987, and B cereus ATCC 14579. -34- What is an ecotype? If one is to define species as a collection of very similar strains (at the nt. level and/or the number of genes they share) as proposed here, then the question that remains is what is an ecotype? In my view, an ecotype is a population that has acquired a small number of extra genetic elements, which enable the population to exploit a slightly different ecological niche but preserving the genetic signature and the full ecological potential that characterizes its species. Such ecotypes do exist among strains that show higher than 99% ANI. For example, several Bacillus anthracis or S. enterica pathovar Typhi strains that show higher than 99.6% AN I have significant gene differences, which primarily involve plasmids, and secondary phage and transposase-related genes (Figure 3.4A). These plasmids have been connected to a strain’s ability to cause increased disease symptoms (see for instance 15), i.e., they enable the strains to exploit a slightly different but highly overlapping ecological niche compared to their species. Such genetic differences borne as plasmids or mobile elements cannot be viewed as genetic signatures that justify a description as a new species because they are not stable properties of the genome. Moreover, otherwise identical populations that acquire a small number of beneficial mutations that enable the population to exploit a new substrate, like the parallel evolving E. coli strains founded from the same ancestor (35), can also be viewed as ecotypes of the same species. There are a few, more complicated cases with respect to speciation in our dataset, which can be exemplified by the three pathogenic Bordetella spp. genomes. These organisms, which are colonizers of the respiratory tracts of mammals, show 97.8-98.7% ANI between each other’s genomes and it appears that B. pertussis and B. parapertussis -35- have evolved by a (considerable) genome reduction from a B. bronchiseptica-like ancestor; presumably as a result of population bottlenecks or ecological specialization since these genomes show increased host-specificity compared to B. bronchoseptica (26) (see Figure 3.4A). However, no clear and ecologically meaningful genetic signature is identifiable for B. pertussis or B. parapertussis to justify their description as separate species, since the genes specific to these two genomes are limited or of hypothetical and/or transposase function. Viewing these genomes as ecotypes of B. bronchiseptica would deviate from the proposed rule that an ecotype should preserve the full potential of its species since B. bronchiseptica has at least 600 additional genes compared to B. pertussis or B. parapertussis. One possibility is that the latter genomes represent snapshots of an active speciation process, which might have not yet reached the stage of a diagnosable species-specific genetic signature. Alternatively, such instances indicate that some species are likely to show a continuum/gradient of genetic diversity rather than defined boundaries diagnosable by species-specific genetic signatures or that one should look for species-specific signatures at a different level e.g., the gene expression level or deletion (instead of acquisition) of specific pathways in order to achieve ecological specialization. Last, the Bordetella spp. example indicates that species might be found even among strains that show higher than 99% ANI if the populations have undergone major ecological constrains. Functional biases in the genome-specific genes. The functional annotation of the genes that constitute the genome-specific genes in all the pair-wise comparisons between the 64 strains used in this study was also -86- evaluated to provide insights into the factors that might foster speciation. We found that hypothetical, phage and transposase associated genes comprise 62.4% of the genome- specific genes, with the hypothetical genes comprising the majority, 40.4%; the former percentage becomes even larger, 66.1%, when the analysis is restricted to strains of the same species (Figure 3.6). Hypothetical denotes genes that are not assignable to the COG database and are annotated as hypothetical or unknown function in the primary annotation, while phage genes include all genes (assignable or not to COG) carried by phage genomes (see methods section). The former results contrast with an average of 31.1% of hypothetical, phage and transposase related genes in a typical genome (average from 64 genomes) indicating that hypothetical, phage and transposase related genes might play a more important role in the speciation process than expected based on the frequency at which these genes are encountered in the genome. These genes are, however, largely species- or genome-specific (see also Figure 3.3), which reveals a weak positive selection for these functions and reflects the enormous genetic diversity that characterizes bacteriophages (40-80% of the total genes in a phage genome are Annotation of tho genome-specific CDS: Figure 3.6. Functional In COG 8. not Mobile' distribution of genome-specific 375% (33.9%) CDS from 82 pair-Wise, whole- genome comparisons. Results CDS annotation using only strains showing >94 0" an "MW" AN I are shown in parentheses. genome (Inset) Mean functional distribution of annotated CDSs for the 64 genomes deposited in GenBank as of October 2003. N0 COG & Mob“, *Mobile denotes phage or 62.4% (66.1%) transposase assoc1ated genes. Hypothetical 40.4% (44.4%) -37- hypothetical in our dataset) (27) and insertion/transposase elements (23). Collectively, this information is congruent with phage and mobile elements being ephemeral intruders of the genome and have little, if any, value for the cell but occasionally might be important, e.g. when carrying ecologically important genes, and lead to speciation (for examples see (4). The fraction of the genome-specific genes that is well characterized is, on average, 37.6%, which contrast with an average of 69.9% of such genes in a typical genome (Figure 3.6). Restriction of the analysis to ortholgous genes (i.e. reciprocal best match vs. one-way match approach) did not significantly affect these results. Last, gene duplication appears to play a significant but not major role in the genetic diversity within species. The occurrence of duplicated genes among the genome-specific genes during comparisons of strains of the same species ranged from <1-30% and this variation appeared to be species-dependent. During the functional annotation of the genome-specific genes, we noted that hypothetical CDS are approximately as conserved as the intergenic sequences, i.e. the fraction of sequences that remain conserved with increasing evolutionary distance is very similar between both classes of sequences. For comparison, the conserved genes that are well characterized (i.e., assignable to COGS, including the conserved hypothetical) are approximately 2.4 times more conserved than the intergenic sequences (Figure 3.7). Furthermore, we could detect very few (<5%) hypothetical or intergenic sequences conserved at the family level and we could not detect any such sequences conserved at the phylum level (data not shown). In contrast, a considerable number of well- characterized genes remain conserved over the same evolutionary scales. This gene set -33- 120 1‘ ' ' ' ' i . Increasing evolu tionary distance § y=1.11x - 5.60 1820.36 8 8 I Hypothetical D In COG l A O 20' Percent of non-coding sequences conserved o 20 40 so so 100 120 Percent of 6083 conserved Figure 3.7. Degree of conservation of non-coding and hypothetical sequences vs. well characterized genes. Each datapoint represents the number of non-coding sequences (expressed as a percent of the total sequences to normalize genome size effect) from a reference genome conserved in a tester genome (y-axis) vs. the number of hypothetical genes (solid squares) or well-characterized genes (open squares) from the reference genome conserved in the tester genome (x-axis). The gray diagonal represents the 1:1 regression line. includes both informational genes, which are highly conserved, as well as non-essential and less evolutionary conserved genes, like secondary metabolism genes, which have presumably been subjected to lateral transfer. There are many inconsistencies between different published genomes with regard to the annotation and nomenclature of hypothetical genes, which impedes robust interpretations. These inconsistencies also explain part of the high dispersion of datapoints around the mean observed in Figure 3.7. Although we have not extensively -39- evaluated the effect of such inconsistencies, our results from comparisons of hypothetical genes to intergenic sequences clearly suggest that the function of the majority of hypothetical genes, if any, is different from the annotated genes (Figure 3.7). This agrees with conclusions reached by others using fundamentally different approaches, such as synomynous vs. non-synonymous amino acid substitutions (24), gene length distributions (31) and simulations on the coding capacity of the genome (17). Although there are specific caveats in all these methods (21, 24), the emerging picture is consistent with the majority of the hypothetical CDS being indispensable but not protein-coding parts of the prokaryotic genome. This conclusion seems contradictory to recent proteomic data that show that a significant portion of what is annotated as hypothetical CDS is indeed translated to proteins (9, 19, 22). The discrepancy, however, is at least partially attributed to inconsistencies in nomenclature, e.g. we did not consider conserved hypothetical in our analysis as did Kolker et al. (19) and Corbin et a1. (9), or to the study of phylogenetically diverse or not well-studied species where the fraction of annotated CDS as hypothetical genes is higher (22). Furthermore, recent evidence suggests that, in many genomes, a small (but not negligible) number of short protein-coding genes have escape identification (14) and are consequently annotated as non-coding DNA. This may have caused an underestimation of the coding potential of hypothetical CDS in our comparisons. In summary, our results do not contradict that some hypothetical genes are protein-coding, rather they suggest that such genes should constitute a small fraction of the total and their effect on cell phenotype may be uncertain in several cases, such as for the phage-related hypothetical genes. Given, however, the high frequency of hypothetical -90- CDS among the strain-specific sequences (Figure 3.6), the small number of coding hypothetical CDS may quantitatively contribute significantly to the species functional diversity. .9]- OUTLOOK Our analysis shows that if species should be reasonably predictive of phenotype and ecological potential then species should comprise a much more uniform suite of strains than provided by the current definition. In practical terms, it appears that such strains may be only the ones that show higher than 99% ANI or are less identical at the nt. level but share at least 95% of their well-characterized genes as a result of having a very overlapping ecological niche. This definition is closer to the eukaryotic standards as well. Such a stringent standard, however, would be impractical to implement, since it would instantaneously increase the number of existing species probably by a factor of 10 (6), and cause considerable confusion in the diagnostic and legal fields. Hence, the existing classification system should be maintained but adopt more stringent standards where needed, like in the case of distinguishing important species for diagnosis, patents, quarantine, transportation and possession. Our analysis clearly shows that strains of the same species according to the current standards may be too different to be considered the same species. Our analysis also reveals several issues that must be addressed before more robust interpretations are possible. Most importantly, although species-specific genetic signatures appear to exist, this conclusion is based on a limited number of available sequenced strains. Therefore, the alternative hypothesis, i.e., there is a continuum of genetic diversity, which is not supportive of a species concept for Prokaryotes, cannot be currently rejected. It is also likely that a continuum of genetic diversity would be applicable only to specific species and/or ecological niches. Last, the importance of the species’ ecology on the conserved genes needs to be more fully evaluated and quantified. -92- Related to this, the full ecological potential of most (even the sequenced!) species remains largely unknown due to the lack of knowledge on their population sizes and activities in their natural environments. A better coverage with genomic sequences of several closely related species from characterized niches is needed to further advance these cornerstone issues for microbiology and systematics. -93- AKNOWLEDGMENTS We thank The Institute for Genomic Research (TIGR) and the Sanger center for permission to use preliminary sequence data. This work was supported by the Bouyoukos Fellowship Program (KTK), the DOE’s Microbial Genome Program and the Center for Microbial Ecology. -94- 10. REFERENCES Report of the Tropical Indicator Workshop, available at: http://www.wrrc.hawaii.edu/tropindworkshop.html. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids. Res. 25:3389-3402. Aparicio, S., J. Chapman, E. Stupka, N. Putnam, J.-m. Chia, P. Dehal, A. Christoffels, S. Rash, S. Hoon, A. Smit, M. D. S. Gelpke, J. Roach, T. Oh, I. Y. Ho, M. Wong, C. Detter, F. Verhoef, P. Predki, A. Tay, S. Lucas, P. Richardson, S. F. Smith, M. S. Clark, Y. J. K. Edwards, N. Doggett, A. Zharkikh, S. V. Tavtigian, D. Pruss, M. Barnstead, C. Evans, H. Baden, J. Powell, G. Glusman, L. Rowen, L. Hood, Y. H. Tan, G. Elgar, T. Hawkins, B. Venkatesh, D. Rokhsar, and S. Brenner. 2002. Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes. Science 297 :1301-1310. Boyd, E. F., and H. Brussow. 2002. Common themes among bacteriophage- encoded virulence factors and diversity among the bacteriophages involved. Trends Microbiol 10:521-9. Brenner, D. 1984. Bergey's manual of systematic bacteriology, lst ed, vol. 1. William and Wilkins, Baltimore. Brenner, D., J. Staley, and N. Krieg. 2000. Bergey's manual of systematic bacteriology, 2nd ed, vol. 1. Springer-Verlag, New York. Cohan, F. M. 2002. What are bacterial species? Annu Rev Microbiol 56:457-87. Cole, J. R., B. Chai, T. L. Marsh, R. J. Farris, Q. Wang, S. A. Kulam, S. Chandra, D. M. McGarrell, T. M. Schmidt, G. M. Garrity, and J. M. Tiedje. 2003. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res 31 :442-3. Corbin, R. W., O. Paliy, F. Yang, J. Shabanowitz, M. Platt, C. E. Lyons, Jr., K. Root, J. McAuliffe, M. I. Jordan, S. Kustu, E. Soupene, and D. F. Hunt. 2003. Toward a protein profile of Escherichia coli: comparison to its transcription profile. Proc Natl Acad Sci U S A 100:9232-7. Enard, W., P. Khaitovich, J. Klose, S. Zollner, F. Heissig, P. Giavalisco, K. Nieselt-Struwe, E. Muchmore, A. Varki, R. Ravid, G. M. Doxiadis, R. E. Bontrop, and S. Paabo. 2002. Intra- and Interspecifie Variation in Primate Gene Expression Patterns. Science 296:340-343. -95- ll. 12. l3. 14. 15. l6. 17. 18. 19. 20. Falush, D., T. Wirth, B. Linz, J. K. Pritchard, M. Stephens, M. Kidd, M. J. Blaser, D. Y. Graham, S. Vacher, G. I. Perez-Perez, Y. Yamaoka, F. Megraud, K. Otto, U. Reichard, E. Katzowitsch, X. Wang, M. Achtman, and S. Suerbaum. 2003. Traces of Human Migrations in Helicobacter pylori Populations. Science 299: 1582-1585. Garrity, G., J. Bell, and T. Lilburn. Bergey's manual of systematic bacteriology, 2 ed, vol. Release 5.0. Springer-Verlag, New York. Goodfellow, M., and A. O'Donnell. 1993. Handbook of New Bacterial Systematics. Academic Press Inc, San Diego. Harrison, P. M., N. Carriero, Y. Liu, and M. Gerstein. 2003. A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 333:885-92. Hoffmaster, A. R., J. Ravel, D. A. Rasko, G. D. Chapman, M. D. Chute, C. K. Marston, B. K. De, C. T. Sacchi, C. Fitzgerald, L. W. Mayer, M. C. Maiden, F. G. Priest, M. Barker, L. Jiang, R. Z. Cer, J. Rilstone, S. N. Peterson, R. S. Weyant, D. R. Galloway, T. D. Read, T. P0povic, and C. M. Fraser. 2004. Identification of anthrax toxin genes in a Bacillus cereus associated with an illness resembling inhalation anthrax. Proc Natl Acad Sci U S A 101:8449-54. Imaeda, T. 1985. Deoxyribonucleic acid relatedness among selected strains of the Mycobacterium tuberculosis, Mycobacterium bovis, Mycobacterium bovis BCG, Mycobacterium microti, and Mycobacterium afi'icanum. Int. J. Syst. Bacteriol. 35:147-150. Jackson, J. H., S. H. Harrison, and P. A. Herring. 2002. A theoretical limit to coding space in chromosomes of bacteria. Omics 6:115-21. Kawamura, Y., X. G. Hou, F. Sultana, H. Miura, and T. Ezaki. 1995. Determination of 16S rRNA sequences of Streptococcus mitis and Streptococcus gordonii and phylogenetic relationships among members of the genus Streptococcus. Int J Syst Bacteriol 45:406-8. Kolker, E., S. Purvine, M. Y. Galperin, S. Stolyar, D. R. Goodlett, A. I. Nesvizhskii, A. Keller, T. Xie, J. K. Eng, E. Yi, L. Hood, A. F. Picone, T. Cherny, B. C. Tjaden, A. F. Siege], T. J. Reilly, K. S. Makarova, B. O. Palsson, and A. L. Smith. 2003. Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. J Bacteriol 185:4593- 602. Konstantinidis, K. T., and J. M. Tiedje. 2004. Trends between gene content and genome size in prokaryotic species with larger genomes. PNAS 101:3160-3165. -96- 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. Lawrence, J. 2003. When ELFs are ORFs, but don't act like them. Trends Genet 19:131-2. Liu, Y., J. Zhou, M. V. Omelchenko, A. S. Beliaev, A. Venkateswaran, J. Stair, L. Wu, D. K. Thompson, D. Xu, 1. B. Rogozin, E. K. Gaidamakova, M. Zhai, K. S. Makarova, E. V. Koonin, and M. J. Daly. 2003. Transcriptome dynamics of Deinococcus radiodurans recovering from ionizing radiation. Proc Natl Acad Sci U S A 100:4191-6. Mahillon, J., and M. Chandler. 1998. Insertion sequences. Microbiol Mol Biol Rev 62:725-74. Ochman, H. 2002. Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet 18:335-7. Oleksiak, M. F., G. A. Churchill, and D. L. Crawford. 2002. Variation in gene expression within and among natural populations. Nat Genet 32:261-6. Parkhill, J., M. Sebaihia, A. Preston, L. D. Murphy, N. Thomson, D. E. Harris, M. T. Holden, C. M. Churcher, S. D. Bentley, K. L. Mungall, A. M. Cerdeno-Tarraga, L. Temple, K. James, B. Harris, M. A. Quail, M. Achtman, R. Atkin, S. Baker, D. Basham, N. Bason, I. Cherevach, T. Chillingworth, M. Collins, A. Cronin, P. Davis, J. Doggett, T. Feltwell, A. Goble, N. Hamlin, H. Hauser, S. Holroyd, K. Jagels, S. Leather, S. Moule, H. Norberczak, S. O'Neil, D. Ormond, C. Price, E. Rabbinowitsch, S. Rutter, M. Sanders, D. Saunders, K. Seeger, S. Sharp, M. Simmonds, J. Skelton, R. Squares, S. Squares, K. Stevens, L. Unwin, S. Whitehead, B. G. Barrell, and D. J. Maskell. 2003. Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nat Genet 35:32-40. Pedulla, M. L., M. E. Ford, J. M. Houtz, T. Karthikeyan, C. Wadsworth, J. A. Lewis, D. Jacobs-Sore, J. Falbo, J. Gross, N. R. Pannunzio, W. Brucker, V. Kumar, J. Kandasamy, L. Keenan, S. Bardarov, J. Kriakov, J. G. Lawrence, W. R. Jacobs, Jr., R. W. Hendrix, and G. F. Hatfull. 2003. Origins of highly mosaic mycobacteriophage genomes. Cell 113: 171-82. Rossello-Mora, R., and R. Amann. 2001. The species concept for prokaryotes. 25:39. Sibley, C. G., and J. E. Ahlquist. 1987. DNA hybridization evidence of hominoid phylogeny: results from an expanded data set. J Mol Evol 26:99-121. Sibley, C. G., J. A. Comstock, and J. E. Ahlquist. 1990. DNA hybridization evidence of hominoid phylogeny: a reanalysis of the data. J Mol Evol 30:202-36. -97- 31. 32. 33. 34. 35. 36. 37. 38. Skovgaard, M., L. J. Jensen, S. Brunak, D. Ussery, and A. Krogh. 2001. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17:425-8. Stackebrandt, E., W. Frederiksen, G. M. Garrity, P. A. D. Grimont, P. Kampfer, M. C. J. Maiden, X. Nesme, R. Rossello-Mora, J. Swings, H. G. Truper, L. Vauterin, A. C. Ward, and W. B. Whitman. 2002. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 52:1043-1047. Tatusov, R., N. Fedorova, J. Jackson, A. Jacobs, B. Kiryutin, E. Koonin, D. Krylov, R. Mazumder, S. Mekhedov, A. Nikolskaya, B. S. Rao, S. Smirnov, A. Sverdlov, S. Vasudevan, Y. Wolf, J. Yin, and D. Natale. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinforrnatics 4:41 . Tonjum, T., D. B. Welty, E. Jantzen, and P. L. Small. 1998. Differentiation of Mycobacterium ulcerans, M. marinum, and M. haemophilum: mapping of their ‘~ relationships to M tuberculosis by fatty acid profile analysis, DNA-DNA hybridization, and 16S rRNA gene sequence analysis. J Clin Microbiol 36:918- 25. Treves, D. S., S. Manning, and J. Adams. 1998. Repeated evolution of an acetate-crossfeeding polymorphism in long-term populations of Escherichia coli. Mol Biol Evol 15:789-97. Van Sluys, M. A., M. C. de Oliveira, C. B. Monteiro-Vitorello, C. Y. Miyaki, L. R. Furlan, L. E. A. Camargo, A. C. R. da Silva, D. H. Moon, M. A. Takita, E. G. M. Lemos, M. A. Machado, M. I. T. Ferro, F. R. da Silva, M. H. S. Goldman, G. H. Goldman, M. V. F. Lemos, H. El-Dorry, S. M. Tsai, H. Carrer, D. M. Carraro, R. C. de Oliveira, L. R. Nunes, W. J. Siqueira, L. L. Coutinho, E. T. Kimura, E. S. Ferro, R. Harakava, E. E. Kuramae, C. L. Marino, E. Giglioti, I. L. Abreu, L. M. C. Alves, A. M. do Amaral, G. S. Baia, S. R. Blanco, M. S. Brito, F. S. Cannavan, A. V. Celestine, A. F. da Cunha, R. C. Fenille, J. A. Ferro, E. F. F ormighieri, L. T. Kishi, S. G. Leoni, A. R. Oliveira, V. E. Rosa, Jr., F. T. Sassaki, J. A. D. Sena, A. A. de Souza, D. Truffi, F. Tsukumo, G. M. Yanai, L. G. Zaros, E. L. Civerolo, A. J. G. Simpson, N. F. Almeida, Jr., J. C. Setubal, and J. P. Kitajima. 2003. Comparative Analyses of the Complete Genome Sequences of Pierce's Disease and Citrus Variegated Chlorosis Strains of Xylella fastidiosa. J. Bacteriol. 185:1018-1026. Vauterin, L., B. Hoste, K. Kersters, and J. Swings. 1995. Reclassification of Xanthomonas. Int. J. Syst. Bacteriol. 45:472-489. Ward, D. M. 1998. A natural species concept for prokaryotes. Curr Opin Microbiol 1:271-7. -93- 39. 40. 41. Wayne, L. G., D. J. Brenner, R. R. Colwell, P. A. D. Grimont, O. Kandler, M. I. Krichevsky, L. H. Moore, W. E. C. Moore, R. G. E. Murray, E. Stackebrandt, M. P. Starr, and T. H. G. 1987. Report of the Ad Hoc Committee on reconciliation of approaches to Bacterial Systematics. Int. J. Syst. Bacteriol. 37: 463-464. Welch, R. A., V. Burland, G. Plunkett, III, P. Redford, P. Roesch, D. Rasko, E. L. Buckles, S. R. Liou, A. Boutin, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C. Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, and F. R. Blattner. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. PNAS 99:17020-17024. Yabuuchi, E., Y. Kosako, H. Oyaizu, I. Yano, H. Hotta, Y. Hashimoto, T. Ezaki, and M. Arakawa. 1992. Proposal of Burkholderia gen. nov. and transfer of seven species of the genus Pseudomonas homology group II to the new genus, with the type species Burkholderia cepacia (Palleroni and Holmes 1981) comb. nov. Microbiol Immunol 36: 1251-75. -99- CHAPTER 4 TOWARDS A GENOME-BASED TAXONOMY FOR PROKARYOTES. -100- INTRODUCTION Prokaryotic taxonomy consists of three separate components: classification (i.e., the arrangement of organisms into groups or taxa), nomenclature and identification. Although there is no official classification for Prokaryotes, the classification system represented by the Bergey’s Manual is widely accepted by the community of microbiologists and therefore is currently considered the best approximation to an official classification (3). This classification system is primarily based on the phylogenetic analysis of the small subunit ribosomal RNA gene (16S rRNA) and secondarily on old microscopic and/or biochemical observations about the relatedness of the organisms (3, 16, 18). The current classification system has been valuable in describing and appreciating the breadth of prokaryotic diversity and setting the framework for the study of relationships between taxa. Further, results from new approaches enabled by the availability of whole-genome sequences such as phylogeny based on shared content of orthologous genes (10, 13, 15, 24), indels or signature sequences (8, l4), concatenated alignments of many proteins (4, 11, 26), are generally congruent with the grouping of organisms based on the 16S rRNA gene, which adds further value in the current system. It is important to realize, however, that the definition or standards for the existing taxonomic ranks are far from being well delineated, particularly for the higher than the species ranks. In fact, considerable subjectivity in designating genera, families etc., has been allowed, which is (at least) partially attributable to the great biochemical and morphological diversity exhibited by Prokaryotes and prevents the employment of the same measuring rules for all groups of organisms (3). The only major prerequisite for designating taxonomic ranks is that clustering by 16S rRNA data should support such -lOl- designations but no standards exist about the absolute genetic distance (measured by 16S rRNA gene sequence or other markers) between the different taxonomic ranks (16). Accordingly, the current taxonomy has frequently caused a lot of confusion, e. g., Shigella spp. and E. coli strains represent different genera (2) although based on their genetic relatedness they should belong to the same species (25), and uncertainty about how comparable the taxonomic ranks between different lineages may be. Most importantly, the relative predictive power of the different taxonomic ranks in terms of phenotype or relatedness of the grouped organisms remains unclear. Genomic approaches hold great promise to provide insights into these issues because they can accurately reveal the genetic and functional relatedness between organisms at any resolution level. However, genomic studies to date have been mostly focused on assessing the accuracy of phylogenetic reconstruction, particularly in the light of lateral gene transfer (LGT), rather than the differences in the ranks of taxonomy between lineages, and have failed to address these issues systematically for all prokaryotic taxa. Here we have assessed the consistency of the taxonomic ranks for 175 fully sequenced genomes in terms of genetic distance, using as a measure for the latter the average amino acid identity of all conserved genes between any two organisms. Based on this measure, we found that there are many irregularities in the current classification schema for these 175 genomes while there is little, if any, value in the predictive power of the higher taxonomic ranks such as the order, class, or phylum with the exception of the domain rank, as these ranks are currently used. Our approach also provided means to evaluate the robustness of 16S rRNA gene and alternative molecular markers for phylogenetic purposes. -102- MATERIAL AND METHODS Determination of conserved genes and genetic relatedness. The genomic sequences and sequence annotation of the 175 genomes used in this study were obtained from NCBI’s ftp site at ftp://ftp.ncbi.nih.gov/. Conserved genes between a pair of genomes were determined by whole-genome, pair-wise, sequence comparisons using the BLAST algorithm release 2.2.5 (1). For these comparisons, all CDS sequences from one genome were searched against the genomic sequence of the other genome mrotein query vs. translated database, tBLASTn). CDS were considered conserved when they that had a BLAST match of at least 30% identity at the amino acid level (recalculated to an identity along the entire sequence) and an alignable region more than 70% of the length of the query CDS. This cut-ofif is above the twilight zone of similarity searches where inference of homology is error-prone due to low similarity between aligned sequences; thus query CDSs were presumably homologous to their match (21, 22) while searching against genomic sequences (as opposed to CDS) circumvented the problem of inconsistencies in the annotation between different genomes. When a reciprocal best match approach was employed to determine the orthologous fraction of the conserved genes in an effort for a more conservative estimation of functional similarity, then the amount of genes conserved between two genomes was smaller (but generally not considerably smaller) by an average of ~1 .2%. The genetic relatedness between a pair of genomes was measured by the average amino acid identity (AAI) of all conserved genes between the genomes as computed by the BLAST algorithm. 16S rRNA gene or other genetic marker identity was calculated in the same way as AAI, i.e., based on BLAST searches (nucleotide level -blastn- for 168 -103- and 23S rRNA and amino acid level -Blastp- for protein-coding genes), for consistency in comparing the results. Taxonomic information. The taxonomic information for each of the 175 genomes was extracted from the Hierarchy browser of the RDP database, release 9 (http://rdp.cme.msu.edu/index.jsp), which implements the newer version of Bergey’s taxonomy (9). The taxonomic information included all the officially recognized taxonomic ranks, i.e., domain, phylum, class, order, family, genus, and species, with the exception of the subspecies rank. This information can be viewed in Table 4.1 of Appendix, which also includes the genome size and total number of CDS for each genome. Phylogenetic analysis and sequence divergence. Phylogenetic analysis was performed using the Neighbor Joining program of the Phylip package, version 3.62 (12) and the Weighbor (weighted neighbor joining) program (5). Sequence divergence at synonymous (Ks) and nonsynonymous (Ka) sites was calculated with DIVERGE software of the GCG package, which uses the method of Li (17). -104- RESULTS AND DISCUSSION. Average amino acid identity is a robust measurement of relatedness. For our purposes there was need for precise measurement of the genetic relatedness between any two strains. The main limitations in performing this task universally for all prokaryotic taxa are the lack of genes that are widely distributed in all taxa, e.g., recent estimates suggests that there are less than a hundred such genes, the varied evolutionary histories (mutation rate and selection pressures) of different genes and the, yet unclear, effect of LGT on inferred phylogenies. For these reasons and in order to maximize the robustness of our approach we employed the average amino acid identity (AAI) of all conserved genes between two strains to measure their genetic relatedness. There are several strengths in using AAI for these purposes. First, AAI is a simple, useful, overall descriptor of genetic relatedness. Second, it is derived from lineage-specific genes, in addition, to the widely distributed ones (typically >500 genes in total), which increases the robustness of the phylogenetic signal extracted. Further, due to the large number of genes used in the calculations, AAI should be superior to a single gene, such as 16S rRNA gene sequence, for measuring relatedness and should not be prone to varied evolutionary rates or LGT events of single or a few genes. Even if genes with different evolutionary histories represent a large fraction the genome, their effect on AAI is minimized when some evolve faster but others slower than the average of the genome and hence should not be problematic for AAI (see also Figure 4.1-A). AAI also offers higher resolution than 16S rRNA gene sequence since a 0-40% 16S rRNA sequence miss-pairing (40% is the maximum 16S rRNA distance observed, i.e., between -105- A —+—E. coli EDL933 —a—E. coli K12 Q +Shige||a flexneri —D—Salmonela enterica ——Yersinia pestis M & Devlatlon from average 0 -2 .4 as ’5 §§ _8'ax-IO>t-EZDOUOHJILI-ILG¢MZI Gene class 105 B E. coli K12 vs. E. coli Sakai 100 / . meningitidis vs. N. gonorrhea E B. anthracis vs. B. cereus E 95 .enterica vs. S. bognori 2 90 ' é . 85 E. coli vs. SaImonella sp. 0 80 S. aureus vs. 3 Sepidermidis 3 75 I ' < S. pyogenes vs. S.agalactiae 70 y = -20.90x + 98.75 65 2 R =0.98 60 0 0.2 0.4 0.6 0.8 1 1 .2 1 .4 Average Ks Figure 4.1. Average Nucleotide Identity (ANI) and genetic distance. (A) The ANI for all genes in the genome, and all genes in a COG category (designated by a single letter on x-axis; see Table 2.2 for letter designation) between E. coli strain Sakai and another genome (graph legend) were determined and the difference of the average identity of the genes in each category from the average identity of all genes in the genome is shown (y- axis). These results reveal that the nucleotide identity of most orthologs between any two genomes is within +/- 6-8% of the AN I between the genomes. A comparable picture was obtained for the Burkholderia, Mycobacteria and Streptococci groups (data not shown). (B) The average rate of non-synonymous substitutions (Ks) for all orthologs between two genomes strongly correlates with the ANI between the genomes, suggesting that ANI may be a useful descriptor of the evolutionary distance. Only genomes that show <3% 16S rRNA miss-pairing were included in the analysis to avoid saturation of nucleotide substitutions at non-synonymous sites. ANI correlates strongly with Average Amino acid Identity (AAI) (R2 > 0.95) therefore the previous conclusions are translatable to AAI as well. AN I was preferred to give higher resolution between very closely related genomes. -106— domains) is spread between 0-70% average amino acid miss—pairing (since 30% identity was the cut-off for calling conserved genes) (Figure 4.2A) and can resolve areas where the 16S rRNA gene is inadequate, like the species level (see chapter 3 of this thesis). Last, AAI correlates strongly with the average rate of synonymous substitutions i.e., with the rate of sequence divergence, which suggests that AAI may be a useful descriptor of the evolutionary in addition to just genetic distance between two organisms (Figure 4.1.8). Evaluation of the taxonomic ranks in terms of genetic relatedness. We first compare the AAI to 16S rRNA identity for all pairs of the 175 genomes used in this study (175 X 175, 30,635 pairs in total) to gain insight into the interrelationship between these two parameters. Our results show that there is a strong correlation between 16S rRNA identity and AAI, and that the logarithmic model best describes this correlation (R2 = 0.84, P < 0.0001) (Figure 4.2A). When the analysis is restricted to pairs of genomes with higher than 87-90% 16S rRNA identity, however, there is no significant difference between the logarithmic (R2 = 0.834) and the linear model (R2 = 0.825). These results indicate that influence of additional mutations (presumably in the 16S rRNA gene) is offset by recurrent mutations when 16S rRNA sequences are less than ~85-87% identical. In any case, the strong correlation observed further supports the robustness of 168 rRNA-based phylogeny for Prokaryotes. 16S rRNA appears to have limited resolution between closely related genomes, e. g., showing higher than 80% AAI, whereas it has higher resolution than AAI between (very) distantly -107- Figure 4.2. Relationships between 16S rRNA, AAI, and taxonomic information for the 175 sequenced genomes. Panel A shows the 16S rRNA gene sequence identity (y- axis) plotted against the average amino acid identity (AAI) for each pair of the 175 genomes (30,635 pairs in total). The smallest taxonomic rank that the two genomes of each pair share has been overlaid in panels B, C, and D. The area corresponding to the current standard for species delineation as well as representative pairs of genomes (discussed in the text) have been annotated. Images in this thesis are presented in color. -108- 8.. cm on on 00 on ow On on l. 1| . lllll l .l l. . om esooom.. A— n==00.. mm >350“. o ,- 8 gonzo e Lacunouuaeooatufi _ no nun—O e as Eaitaehv ..Mu . - on Sagan. 52:00 . ,. m» , 52:2. in o W 8 «2518. . - a L gflkofiumfi f 8 ...: 3:933:33: :ME L ma , - to W o9 8. co ow on 8 on 9 9.. on l l l l .l l 11 ls l . on m . mm ,- 8 uESaSED ...: xahawv. “ mo usaioeak as :5.»onch , on 3.0.3.8:qu a: 3:22: use; 3er35» EsS‘uam 0. ..aniox .4 as . mm o o 1.. £35.33 35.038.33— .59... Eaflefi fiuagaefim _,8r 8' oo on on 8 on on 33333 as usetokufi ., e - 3%..» 0 lab! ..1... . 1 :35 360cm .ape am 8 on 8 on 9 -109- related genomes, i.e., showing 30-40% AAI, presumably because this area approaches the cut-off used. We then determined for each pair of genomes their glose_s_t_ taxonomic relationship, i.e., what is the smallest taxonomic rank they share, and overlay this information on the graph of Figure 4.2. It appears that there are many inconsistencies between the different taxonomic ranks since all ranks higher than the species and with the exception of the different domain show extensive overlap (compare for example genus vs. family in panel B or same domain vs. phylum between panels B and C). These results clearly show that the predictive power of current taxonomic ranks in terms of genetic distance between the grouped organisms is rather limited. In few cases the overlap is limited to a few genomes, such as among the Prochlorococcus marinus or the Buchnera aphidicola genomes (Panel B) and between T reponema and Leptospira (Panel C) genomes, whose genetic distance does not justify their inclusion in the same species and order, respectively. Such cases are apparently artifacts, e.g., P. marinus strains were grouped in the same species based solely on their high 16S rRNA gene sequence similarity (6, 7) and T reponema and Leptospira were assigned to the same order due to their common spirochete-like morphology (20), which can be corrected. Another remarkable trend revealed in our data is that the currently named bacterial phyla are approximately as distant from each other in terms of AAI as Bacteria are from Archaea. This becomes more obvious on a neighbor joining tree built based on the full matrix of AAI between the 175 genomes. All bacterial phyla and sometimes classes, such the Mollicutes and Clostridia of the Firmicutes phylum and the or, 8, and 2 classes of the Proteobacteria phylum, on this tree are as deeply branching as are Archaea -110- (see colored groups on Figure 4.3A). At the same time, clustering at nodes of the tree that correspond to well-defined relationships between groups is as expected, e. g., enterics are clustered together, with Salmonella spp. being the closest relative to E. coli-Shigella spp. group etc., which adds further support to the results. In addition, we found that there is strong linear correlation between the AAI between two genomes and the amount of genes that these genomes share (R2 = 0.70, P < 0.0001), and this correlation becomes even stronger when the 32 reduced genomes of endo-symbiotic species are removed from the analysis (R2 = 0.82, P < 0.0001) (Figure 4.4). The stronger correlation in the second case is attributable to the reduced genomes being enriched in highly conserved, housekeeping genes relative to the core of free-living species (the majority in the current dataset) and therefore the amount of conserved genes is overestimated in the former genomes relative to the latter. These results reveal that the genetic distance between the previous phyla/classes corresponds to comparably large functional/biochemical (gene) differences as well. The previous conclusion is further supported by the distance tree derived from the full matrix of the percent of conserved genes between the 175 genomes. On this tree, one can see that most of the deep branching bacterial groups (phyla or classes) in the AAI tree are similarly deep branching in the conserved gene tree, i.e., genomes of these groups share a comparable amount of genes with genomes of the remaining bacterial phyla to the amount of genes they share with archaeal genomes (Figure 4.3B). For instance, the Thermus-Deinococcus, and the Actinobacteria phyla and the Clostridia, 8 and e Proteobacteria classes are deep branching in both trees, whereas the few apparent exceptions such as the Molicutes and or Proteobacteria classes that are not deep branching -111- Figure 4.3. Phylogenetic relationships between the 175 fully sequenced genomes. Neighbor joining trees derived from the full matrix of AAI (A) and percent of conserved genes (B) between the 175 genomes used in this study. The percent of conserved genes (instead of absolute number of conserved genes) was used to accommodate for genome size differences (up to 10 fold) among the 175 genomes. Groups that are deep branching on the AAI tree are denoted by colors. Phyla represented by a single genome are in bold. Saccharomyces cerevisiae genome was used to root the trees (outgroup). Scale bar represent 10% difference. Note the difference in scale between A and B, i.e., the underlying differences are about 25% larger in the conserved gene tree for the same branch length. Abbreviations are as follows (top to bottom in panel A): T-D -- Termus- Deinococcus phylum, Spiro. -- Spirochaetes phylum, Bact. -- Bacteroidetes phylum, a- B— y- 8— s- P. -- 01— B— y— 8— e— Proteobacteria class respectively, Cyano. -- Cyanobacteria phylum, Streptococ. -- Streptococcaceae family, Staphyloc. -- Staphylococcaceae family, Eury. -- Euryachaeota phylum, Crena. -- Crenarchaeota phylum. Images in this thesis are presented in color. -112- SII Figure 4.3.A AAI tree Deinococcus radiodurans r' ,— eptospira i, terro ans 56601 I I—1 se tomorgas eIrXBaIIQIosa ac er s Coxlgl a%%rne tii p ; eptosplramterro ansCo cgonpenhageni 3.. teroi es the tantao omicro iv 31: —“’E 33 ‘<< Clo 00 Pr: 3: 55' U'0 J 2g: c .33 ... OO 2.: _ m- (0% m g) 217 I—UU 0 U2 ;___ ”I EscIlrIIericIIIIja coI|_i EST-E333 a co Eggh.§fi|§hila colil K12 Enterobacteriaceae Vibr rio aeemolyticus ViIIbrion vulnia_ficusYJ016 58 _ NeIsseria meningitidis. 22491 _C romo ac e um r0 Opa . acearIum SIBIorh obiu umm eIII ‘ ro IIactebrjumIO tiltrirnngaciens IZO ra melitensis0t r— I‘UCEO I—~—He|icobacterp yII 26695 e JCO cter orI :WOIIDG Ila succpIXogelneSs elicobac te er hepaticu f_ylobacter jIe uniS m [35:3u‘lfotaleai0 psycgrophlla GegfivafitelrIb su IIlfILKredu ter!09f_‘rl' I IL_J|¥ Pr ochlorococ ccas maIInus 7S _l—I::E'°nC§'°g°C°C CCIIUSSmaI/IIPI‘éfilE’EIP/gg 3% coccus PSch nhEJrococc snearl 588MP1378 acter Cviola IUnSU Nostoc sp. PCC712O Synechocyst i s.p P g’ rmosa/ echococcus eléogrqatus treproc ccus pyogenes gtreptococcu pyo enes315 treptococcus pyo enes 10394 t pt coccus py 3 Streptococcu Streptococ I l O 1 _ us a t ragIs Ames str. A2012 SI Bacillus cereus ATCC1 10987 L erIa enes EGD _Ister,Ia_monoc to enes F2365 iLIsterIa Innocu Costridium perfr ringe Clostridium acetoIbutySIicum Clostridium teta Thermoanaerobagter ten congensis Fu ”OyaCSA? Coleriumnuce tu um asma genIta ium Myco plam ma pneum Wco sma gaIlIse t yc p asma pene ra onme Icum ns Urea Iasma ur alytigum M co Iasma um_onIs coplas amo Ile a I ] Star-[III cocc e_ ‘ Baql us anthracIs Am acinus ant racis Ames str 58 ] l L Mollicutes ClostridIa Bacillaceae Staphyloc mi M co acterI mtuV iercul 51 H RV gc oBacctree:i;Jum tuV Iercu osIs C3I3C Myc bacterium be M oco acterIum leper ale oryn nebacterium dipht theriae oIrIy/negacte riu tamicum ne acterIum mIIqIIIeSns treptomyces aver Stre ept om yces coelicolor Pr ro ionibacItIerium acne-s Lei sohn Iax hi I .T Tro her malw e wist Methann Oco ccyus maaanIICSEIIu dIis coccus MethatEIIobactert ermdqaqtotrophlcus M aknopyrus rw yrococcus horiho gyrococ cus b ss 1% rococcus uan slus rchaeum equitans ano A cha Iobus mfuzlgiidus F—h Met anoesa CI na 7- Proteobacteria Bact Spiro Chlamydiae e-P a—Proteobacteria B-Proteo. (up Cyano Firmicutes (Low GC) Actinobacteria Eury. I———Methan nosarcian auamcetivorans1 WA 5 2g 0.0. 0'!» CBC VIC 81113 539mm; gas o_fl' ”012 921% OR . CO . VId l_ll_lg The s a 'd hl Picerophmhsto m rriados op Ium S. cereVISIaeo OUTGROUP Eury.Crena. BACTERIA ARCHAEA 1711 Figure 4.3.B Conserved Gene tree I_ter0Ides thetaiotaomicron —r onas gingiv aIis v at Se my ophila pneumoniae TVV81 Chlamyggphila pneumomae AR39 phila pneumonIae eJ138 Chlaglgdjflhil aiayp eumonIae CWL IHelicro(:“lblauctaers EleoSrpi 26695 j‘HelIico acter CIR/(g rIJ99 cc m0 0' m n eeptosIPIra Interrtoc _I ‘-Le t s aInte ro agiassnéoggnhageni elPovibrio ac erIovorus Pse edu m ae Pseagfiqnsaas aEUg. 'C‘Iaos u Ineto bagc Sstp. SAIDP1 _:Bordeltella arcat pertu Bord tael Iburosnchiseptica ceua rsusm n m 3 '0 1‘<_ 100 (DUO ‘32» D. 2 TD'II I l_l |_J L .... L_I hr moba acteIrI‘urIn violaceum \. SinorhIzro 'xmrb me ro bacterium tumIefaCiens RhodopseuIIdomonas palus gayrhizIo oinumjarpSonicum esorhizro IumZI Bartonella Muintana ucelIFIS amelitensis a SUI BartonellaI'I nsela au 0 actere crescentus Salmonella enterica Ty 2 — Salmonella enterica [IR/23m Salmonella en tericaL Buchnera aphidicol aW nIga carotovora Buchnera aphidlcoltla aIzongIa Pho orah bdus luminescens Buchnera aphidico la 09 ShiIgeIIaa flexneriI 245 17T fl gell (finIchIa coII EFT0i73 r_ichioa IcoliE L9 e Eschecric iaI Candid. Blochmannia Io anus MCI-mm Ye ersinia pestis KIM . , . . Yersinia pestis CO92WIgglesworthIa breVIpaIpIs YersiniaIIaeestisph910I10 emlous irlifiuenlcareyi FElaesItmeLPeIlla mut cf? era Vibr io |_II_JL f‘PB‘p 5 o. o 32 03933 g: n3 Ha I Co acte mtubercul sisH 7Rv ycolgacctreirtilumtu bIesrCu osis DC t O 5' royn nebacterium diphthemrIiae Corrynnebacterium luta micum my?) acjteriIum efficiens ronpir nI erIum acnes — 03 n n gngg. 8 LI Sy Ch 0 sm Knechocystls spl'l PCC6 syn hococcus elongatus I IBIGOIoeob Cter vi70|a auce s SESPC 120 [Blosslflfio v rio vulgaris I-I-""L hDeesu goialeatpsyd C rophila Aquifex aeolilg ue obacterS sulfurreducens esprococcu pyogenes 551— 1 gtreptococcus pyo treptococcus pyogenes 10394 Streptococcus pyogen es 8232 II 5-}: Cyano TAD Actinobacteria I III I" a n: '0 H o n o h c V1 A) In DJ DI n z 353% 'm' “0 ng w o _I w < \ 3.7 Streptococ tre tocochusIIIaneumonliaeTlGR4 tre tococcus Lathococcusla SEE SBSEIIIEE I512"€‘a5rcuh” L steriLaa monoc ones EGD ‘fi‘flfgeeriaa mnonocIII/toe enes F2365 eaE tIeno c a rococcus faecali s Bacillus anthracis Am eSSs BacillLIIs ant thra acis Ame r.A2012 BficiS IIIIII santhracis Amsest srtr. 0581 Sam urI IenSIs BaCi us Cer us ATCC14 457897 ‘—— aCillus cereus ATCC1 09 Bacillus subt ”Is 0 Bacillus sIIIIaloIloluransI C ano Cl 5 I enSIs CeIIosttrr IIcIfIIuml'l peetrfi’Ingens lCslonstrlidium taaceltobut II Thermoanaerobacter tenfe/Cocngmensis "us obI teriICImIa I'nuc at I ycoyp plasma urenalyticum IVIVlgopPasma ggmgggtlcum MyCIo plasma MesoIR/Iascma rorum laa asIIma m Coides SA;50 eIIIIIJmsIm ihytoplasma g8 ggfi HE‘S)??? sagiemecu ua astIIg? sae9 eryma whipplei Twist Rickettsiah throwaze ek'I RI cket SIa conorii i Treponemaden Ia "cl a I ,um Bo rr rt.) Methanocegficusur?‘ ad‘é)r afiidis etah aInocochIIusjannIaschiiS Mkt IIIImazeio Methanosarcina acetiv Methanobactertherm doraIIIItotrophicus an Ami... Moefiilflmfll’fiusgius % o! 0 us 58” atarjcus s tokodeII_I rococ S U l l_J LI LI Etaphyloc.Mollicutes Clostr.Bacillaceae Spiro. a—P [i-P Ifiouimtans IIIIeIrIIrIIIggIIast'aCI Picrophilus to orriI A ,IcPr Ego aculumn Saiertfiphilum 3 Halo acterium o 1 s. cemre'iliNsiae OUTGROUP I. P Snivo e—P Chlamydiaenam ~I«Proteobacteria In: a—Proteo. Firmicutes (Low GC) Eyry. Eyry. BACTERIA ARCHAEA % Genes conserved 8 3 8 8 O l A: All genomes I I V l 20 40 80 80 Average amino acid identity 100 8 8 '/. Genes conserved s B: All but symbionts . O l I ". I I I 40 60 80 Average amino acid identity 100 120 Figure 4.4. Relationship between conserved gene content and genetic distance. Dots represent the percent of conserved genes between a pair of genomes plotted against their genetic distance, measured as the average amino acid identity of the conserved genes. (A) All pairs of 175 genomes (30,625 pairs in total) were included, whereas pairs that contain an endosymbiotic genome were removed (32 genomes, 5,600 pairs removed) from the analysis in (B). -115- in the conserved gene. tree (contrary to the AAI tree) are attributable to the bias associated with the reduced genomes (discussed previously) or a shared ecology, e.g., the large genome-sized, free-living a Proteobacteria cluster together with the large genome-sized, free-living [3 and y Proteobacteria in less deep nodes of the tree. In summary, these results suggest that there appears to be a much greater genetic and functional diversity in the Prokaryotes than hitherto expected based on the 16S rRNA phylogeny and that organisms of several bacterial phyla appear to be as different (genetically and/or biochemically) from each other as bacteria are different from archaea! Evaluation of alternative markers to 168 rRNA for phylogenetic purposes. The robustness of alternative markers to the 16S rRNA gene for phylogenetic purposes was also evaluated using as control in these evaluations the AAI and a similar approach as that used for the 168 rRNA gene. The results show that several of these markers such as RNA-polymerase subunits, t-RNA synthetases, Gyrase, Rec A protein etc. show considerable robustness based on the high correlation (R2 > 0.68, P < 0.0001 for all markers tested) observed between the AAI and identity of these proteins for all pairs of the 175 genomes (Table 1 and Figure 4.5). Among the protein-coding genes tested, RNA-polymerase subunit B showed the highest correlation (R2 = 0.78) to AAI and RecA protein the lowest (R2 = 0.68) while all protein-coding genes evaluated showed significantly lower correlation to AAI than 16S rRNA (R2 =0.84). On the other hand, the large subunit RNA gene (23S rRNA) showed comparable, if not better, correspondence to AAI, suggesting that is a highly reliable marker (Figure 4.5). A similar approach may be used to evaluate the robustness of other markers as well, targeting the full breadth of -116- prokaryotic diversity or shorter evolutionary scales, e.g. the species level, for specific applications. Table 4.1. Relationships of different phylogenetic markers to Average Amino acid Identity (AAI). GENE RF— 16S rRNA (Small subunit ribosomal gene) 0.84 23S rRNA (Large subunit ribosomal gene) 0.84 RecA (DNA strand exchange and recombination protein) 0.68 RpoB (RNA polymerase, beta subunit) 0.78 GyrB (DNA gyrase subunit B) 0.77 IleS (Isoleucine tRNA synthetase) 0.72 FusA (GTP-binding protein chain elongation factor EF-G) 0.69 *R2 is for logarithmic second order correlation. This Correlation gave among the highest R2 values from the types of correlations tested for most genes. It should be mentioned however, that there were, typically, very small differences between different models (e.g. linear, power, logarithmic, sigmoidal etc) in their ability to describe the relationship between individual genes and the average of the genomes. Thus, no assumptions can be made about the underlying mechanisms of this relationship. -ll7- 20- 43 5'0 30 150 3030 4'0 5'0 6'0 7'0 3'0 9'0 1% 110 Figure 4.5. Correlation between alternative markers to 16S rRNA and Average Amino acid Identity (AAI). Panels show the correlation between identity of a molecular marker (panel title) and AAI for all pairs of the 175 genomes (at least 20,000 pairs for each gene) used in this study. For the full name description of a marker see Table 4.1. -ll8- PERSPECTIVE The most important contribution of this work is the recognition that the ranks of prokaryotic taxonomy are frequently defined rather arbitrarily with respect to the genetic or biochemical relatedness of the grouped organisms. AAI and conserved gene content represent convenient means to quickly identify such cases and assist in standardizing the definitions of the ranks when these appear problematic. Moreover, it is evident from our analysis that organisms of almost all prokaryotic phyla and sometimes classes (colored groups in Figure 4.3A) are very different from each other, similarly to how different Bacteria are from Arhcaea. A number of morphological or physiological traits that characterize these organisms are fundamental and big differences from a prokaryotic perspective and therefore consistent with the vast differences revealed by the genomic comparisons. For example, organisms of the Molicutes class lack a cell wall, spirochetes have unique cell morphology and mode of movement and cyanobacteria are the only prokaryotes able to carry out water-based oxygenic photosynthesis. In addition, these differences are comparable to the morphological or physiological traits that are known to differentiate Archaea from Bacteria, namely, the existence of ether linked branched hydrocarbons in the membrane of the former (vs. ester linked fatty acids for Bacteria) and a few metabolic cofactors that are archaeal-specific such as coenzyme M, tetrahydromethanopterin etc. Last, by comparing the highly branching pattern of the best- represented bacterial phyla, the y Proteobacteria and F rimicutes (light blue in Figure 4.3), with the deep rooting but not branching pattern of the remaining phyla or classes it becomes obvious that the great majority of the prokaryotic diversity is not yet represented by genomic sequences. -119- Although averaging across all genes in the genome may miss important, lineage- specific information, AAI (or Average Nucleotide Identity -ANI- for short evolutionary scales) represent a powerful first step towards a genome-based taxonomy because it is simple, robust and pragmatic for all prokaryotic taxa. Moreover, recent reports suggest that it may not be feasible to expand the current (168 rRNA-based) phylogeny by including more genetic markers either due to the shortage of genes widespread in all prokaryotic taxa or the difficulty in designing universal primers for widespread genes (23). Therefore, alternative methods such as the AAI-based method are needed. It may also be feasible to devise a new method or optimize an existing one to indirectly measure AAI i.e., to circumvent the need for whole-genome sequencing. Multi Locus Sequencing Typing (MLST) (19) that employs genes (not necessarily the same genes for all taxa!) that evolve comparably to the genome average may be one such approach, while the methodology described here (Figure 4.5) can assist the identification of good candidate genes for such an MLST-based application. In addition, work in our lab (J. Goris et al. in preparation) as well as the 2nd chapter of this thesis show that there is strong correlation between ANI and DNA-DNA reassociation homology values, the classical method for species delineation in Prokaryotes, over a range of relatedness that correspond to 0 to 5% 16S rRNA miss-pairing. The AAI tree shows the Thermus-Deinococcus, Aquificae, and Thermotogae as the deepest branching bacterial phyla and the closest relative of Archaea similar to previous reports (4, 18, 26) but in conflict with others (10, 14). The differences at the ancestral nodes of the tree are very small (Figure 4.3A), however, therefore no definite conclusions can be reached based on these data about the sequence of evolution of the -120- different bacterial phyla. The same picture was also obtained when a less stringent cut-off for calling conserved genes (i.e., 20% identity instead of 30%) was used, which can pick homologs with weaker similarity at the expense of increasing the rate of false positive homolog recovery (data not shown). These results suggest that homology-based analysis may be inadequate to resolve the early evolutionary events of the prokaryotic life. The 16S rRNA gene might offer better resolution at the deep branches of the tree, however, the relationship between 16S rRNA and AAI (Figure 4.2) as well as the extensive genetic and biochemical distinctiveness of organisms related at this level, which presumably impose varied functional constrains and selection pressures on the 16S gene, raise serious concerns as to how quantifiable are 16S rRN A differences at this level of relatedness. -121- ACKNOWLEDGES We thank Pr. George Garrity, James Cole and Dr. Joel Klappenbach for helpful discussions regarding the manuscript. This work was supported by the Bouyoukos Fellowship Program (KTK), the DOE’s Microbial Genome Program and the Center for Microbial Ecology. -122- 10. 11. REFERENCES Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids. Res. 25:33 89-3402. Brenner, D. 1984. Bergey's manual of systematic bacteriology, lst ed, vol. 1. William and Wilkins, Baltimore. Brenner, D., J. Staley, and N. Krieg. 2000. Bergey's manual of systematic bacteriology, 2nd ed, vol. 1. Springer-Verlag, New York. Brown, J. R., C. J. Douady, M. J. Italia, W. E. Marshall, and M. J. Stanhope. 2001. Universal trees based on large combined protein sequence data sets. Nat Genet 28:281-5. Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol Biol Evol 17:189-97. Canale-Parola, E. 1984. Bergey's manual of systematic bacteriology, lst ed, vol. 1. William and Wilkins, Baltimore. Chisholm, S., S. Frankel, R. Goericke, R. Olson, B. Palenik, B. Waterbury, L. West-Johnrud, and E. Zettler. 1992. Prochlorococcus marinus nov. gen. sp.: an oxyphototrophic prokaryote containing divinyl chrolophyll a and b. Arch. Microbiol. 157:297-300. Coenye, T., and P. Vandamme. 2004. Use of the genomic signature in bacterial classification and identification. Syst Appl Microbiol 27:175-85. Cole, J. R., B. Chai, T. L. Marsh, R. J. Farris, Q. Wang, S. A. Kulam, S. Chandra, D. M. McGarrell, T. M. Schmidt, G. M. Garrity, and J. M. Tiedje. 2003. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res 31:442-3. Daubin, V., M. Gouy, and G. Perriere. 2002. A Phylogenomic Approach to Bacterial Phylogeny: Evidence of a Core of Genes Sharing a Common History. Genome Res. 12: 1080-1090. Daubin, V., N. A. Moran, and H. Ochman. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829-32. -123- 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. Felsenstein, J. 2004. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle. Fitz-Gibbon, S. T., and C. H. House. 1999. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res 27 :4218-22. Gupta, R. S., and E. Griffiths. 2002. Critical issues in bacterial phylogeny. Theor Popul Biol 61:423-34. Hong, S. H., T. Y. Kim, and S. Y. Lee. 2004. Phylogenetic analysis based on genome-scale metabolic pathway reaction content. Appl Microbiol Biotechnol 65:203-10. Krieg, N., and G. Garrity. 2000. Bergey's manual of systematic bacteriology, 2 ed, vol. 1. Springer-Verlag, New York. Li, W. H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 36:96-9. Ludwig, W., and H.-P. Klenk. 2000. Bergey's manual of systematic bacteriology, 2nd ed, vol. 1. Springer-Verlag, New York. Maiden, M. C., J. A. Bygraves, E. Feil, G. Morelli, J. E. Russell, R. Urwin, Q. Zhang, J. Zhou, K. Zurth, D. A. Caugant, I. M. Feavers, M. Achtman, and B. G. Spratt. 1998. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95: 3140- 5. Munson, M. A., P. Baumann, and M. Kinsey. 1991. Buchnera gen. nov. and Buchnera aphidicola sp. nov., a taxon consisting of the mycetocyteassociated, primary endosymbionts of aphids. Int. J. Syst. Bacteriol. 41:566-568. Rost, B. 1999. Twilight zone of protein sequence alignments. Protein Eng 12:85- 94. Sander, C., and R. Schneider. 1991. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56-68. Santos, S. R., and H. Ochman. 2004. Identification and phylogenetic sorting of bacterial lineages with universally conserved genes and proteins. Environ Microbiol 6:754-9. Snel, B., P. Bork, and M. A. Huynen. 1999. Genome phylogeny based on gene content. Nat Genet 21:108-10. -124- 25. 26. Stackebrandt, E., W. Frederiksen, G. M. Garrity, P. A. Grimont, P. Kampfer, M. C. Maiden, X. Nesme, R. Rossello-Mora, J. Swings, H. G. Truper, L. Vauterin, A. C. Ward, and W. B. Whitman. 2002. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 52:1043-7. Wolf, Y. 1., I. B. Rogozin, N. V. Grishin, R. L. Tatusov, and E. V. Koonin. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 1:8. -125- CHAPTER 5 IN-SILICO MODELING OF DNA-MICROARRAY PERFORMANCE FOR GENOMOTYPING BACTERIAL STRAINS. -126- INTRODUCTION The recent explosion in genomic sequencing has been accompanied by the development of high throughput technologies for post-sequencing analysis. Microarray technology has been at the cornerstone of this effort and is under continuing development. DNA microarrays were originally used to study gene expression levels between populations of mRNA expressed under different culture conditions or genotype backgrounds. Genes that are differentially expressed are very likely to play an important role in the cell physiology under these conditions and are targeted for further analysis (cf. references (16, 25, 29). More recently, microarrays have been used for genetic (or DNA- DNA) comparisons between different strains. In this case, a microarray is typically built using the available genomic sequence from a particular strain (the reference strain) and is used to competitively hybridize genomic DNA from closely related strains (the tester strains) (2, 6, 12, 17). The objective in this case is to reveal the gene differences between the reference and tester strain(s) that could explain unique characteristics of the strains under study. Last, DNA-DNA studies with specially designed microarray platforms have been proposed as a promising approach for genomo-typing and taxonomic characterization because they offer advantages at the species to genotype level over the 16S rRNA gene sequence analysis and the cumbersome DNA-DNA reassociation (DNA- homology) experiments (4). A key issue for the successful application of microarrays for DNA-DNA studies is the evolutionary distance, i.e. the degree of nucleotide divergence between the reference and the tester strain(s). A microarray is expected to give false negative signal when the evolutionary distance is such that the nucleotide sequence of genes has diverged but their -127- amino-acid sequence remains conserved (hence the proteins are conserved). However, the relationship between false negative signal and evolutionary relatedness of the evaluated strains has not yet been investigated. This issue is also problematic when a whole genome microarray based on a reference strain is to be used for expression studies with other than the reference strain. Bioinformatic sequence analysis can potentially offer novel insight into this aspect of microarray technology. For instance, the number of genes from the reference genome conserved (nucleotide level) in the tester genome should approach the number of genes that is expected to cross-hybridize when the tester genome is hybridized on a microarray built from the reference genome. The relationship between sequence identity of the probe-target pair and hybridization kinetics has been extensively studied for different types of probes (14, 17), which allows for a fairly accurate estimation of the number of genes that can cross-hybridize based on their sequence identity. For DNA-DNA studies, microarrays have been commonly used within species because it is assumed that the rate of false negatives will be minimum at the sub-species level. This assumption is based on the fact that strains that show 70% or greater DNA- homology values (the classical cut-off for species definition) are believed to have at least 95% DNA sequence identity in coding regions (11, 22). At this level of sequence identity (or evolutionary relatedness) no false negatives are expected. However, it is important to realize that DNA-homology values do not reflect the actual degree of sequence identity at the level of the primary structure. For instance, each of the three fully sequenced E. coli strains has about 25% if its DNA not shared with the other two sequenced strains (26). It is yet unclear how such differences between strains can affect microarray performance in experiments within species. -128- Finally, oligonucleotide arrays, which are typically comprised of a short, 30-60 nucleotide long probe per predicted open reading frame (ORF) in the genome, have recently gained popularity over those made from PCR products of ORFs (hereafter termed cDNA arrays) for expression studies due to their higher specificity during hybridizations, flexibility in design and potential for further technological development (14, 19). For DNA-DNA studies within or across species, cDNA arrays are presumably preferable for their higher sensitivity due to the longer probes employed even though the longer probes are prone to more non-specific signal from cross-hybridization of paralogous genes or conserved domains. Whether oligo-arrays can be comparable to cDNA arrays for DNA-DNA studies has not yet been investigated. Using the available genomic sequences, we attempted to simulate, in-silico, the microarray performance and evaluate the previously mentioned issues. Three bacterial groups, namely the enterics, mycobacteria, and streptococci, were used as models in the simulations (Table 5.1) because they include several sequenced representatives (complete or high draft) and these representatives show a gradient of evolutionary relatedness. -l29— MATERIAL AND METHODS The genomic sequences of the completely sequenced strains of the three groups targeted were obtained from NCBI’s ftp site at ftp://ftp.ncbi.nih.gov/. Preliminary sequence data for M. bovis and M. marinum strains were produced by the Sanger Center and were obtained through the Sanger ftp site at fip://ftp.sanger.ac.uk/pub/; M. avium, M smegmatis, and S. mitis were produced by The Institute for Genomic Research (TIGR) and obtained through their website at http://www.tigr.org. Microarray false negatives. False negatives for a microarray experiment were defined as the ORFs from the reference genome that were conserved at the amino acid level but were not conserved enough at the nucleotide level in the tester genome to allow cross-hybridization on a hypothetical microarray built based on the reference genome (see how whole genome sequence comparisons were performed below). An ORF was considered to cross- hybridize when it had a match of at least 60% nucleotide identity over more than 70% of its length in the tester genome. ORFs that have 60% or higher nucleotide identity have been shown to give significant cross-hybridization signal on cDNA microarrays in at least two independent studies (6, 17). Murray et al. have also proposed that this level of sequence identity is close to the detection limit on cDNA platforms (6, 17). To determine the number of genes conserved at the amino acid level, two cut-offs in pair-wise sequence comparisons were used: either at least 30% amino acid identity over more than 70% of the length of the query ORF or 60% amino acid similarity over more than 70% of the length of the query ORF. The former cut-off is above the twilight zone of homology ~130- searches thus, the genes that pass this cut-off are expected to be homologous (either orthologs or paralogs) and share at least the same general biochemical function (7, 10, 20). The latter cut-off is comparable to the one used in the nucleotide comparisons (same match length, same degree of similarity) and offers a measure of the different rates of evolution between the nucleotide and the amino acid level. Similarity instead of identity was preferred in this case to make use of the available knowledge on similarities in function between different amino acids. Finally, the cut-off of 70% of the length of the query ORF was used in all cases to ensure that the same gene is involved (not just a conserved domain) but simulations using smaller cut-offs such as 60% of the length did not significantly affected our conclusions (data not shown). This in-silico experiment was performed within each of the three model groups in our study, namely the enterics (10 genomes), mycobacteria (7 genomes) and streptococci (6 genomes). E. coli strain 0157, M. tuberculosis strain H3 7Rv and S. pneumoniae strain TIGR4 were used as the reference genomes in each group, respectively. Pair-wise whole genome comparisons. All ORFs annotated as (predicted) protein-coding sequences in the GenBank files at NCBI of the reference genome were searched against the whole genomic sequence of the tester genome using the appropriate versions of the BLAST algorithm release 2.2.4 (1). The blastn (nucleotide level) default settings tend to give shorter alignments compared to blastp or tblastn (amino acid level) with distantly related species (where nucleotide sequences are more diverged) because they are targeting highly identical matches. This caused an underestimation of the number of conserved genes at the -l3l- nucleotide level compared to the amino acid level in distantly related species when default settings where applied. In an attempt to make BLAST alignments in the nucleotide search comparable to ones in the amino acid search, we gradually changed several of the blastn parameters until saturation in the total number of matches passing our nucleotide cut-off was reached. When differences were negligible i.e. less than 1% difference in the total number of matches, default settings were used. The same approach was applied for several parameters in the amino acid searches as well. This led to the following parameters used in the study: a) for blastn: X = 150 (drop-off value for gapped alignment), and q = -1 (penalty for nucleotide mismatch), the rest of the parameters were at default settings, b) for blastp or tblastn: default settings c) for the oligo probes (50 mers) blastn search: X = 50, q = -1 and W = 7 (word size); the rest of the parameters were at default settings. Searching against the whole genomic sequences (instead of the annotated ORFs) was preferred to avoid inconsistencies in annotation between two genomes. cDNA vs. oligo arrays. Evolutionary relatedness experiment. When microarrays are used to study the evolutionary relationships among strains the following procedure is typically used: The tester strain is labeled with a different dye (e.g Cy3) from the reference genome (e.g. Cy5), the two labeled genomes are then competitively hybridized on a whole genome microarray platform build based on the reference genome and the dye ratios are used to reveal the evolutionary relatedness (in terms of gene content) between the evaluated strains. We attempted to compare cDNA to oligo arrays with this respect by simulating -132- evolutionary relatedness experiments in-silico as follows: We designed hypothetical oligo and cDNA probes (see probe design below) for the reference genomes and use their BLAST matches in the tester genomes to estimate to the expected hybridization signal on a hypothetical microarray experiment. For this, the best blastn match of every query sequence (e.g. oligo probe, cDNA probe or whole ORF sequence) in the tester genome when had an expectation value less than e <0.001 (or e < 10 for oligo-probe sequences) was saved. The length of the match and its identity were transformed to a 0 to 1 scale and the transformed length and identity values were multiplied. In this way the most similar matches were given higher scores, e. g. the perfect matches equaled 1, which was analogous to a hybridization experiment where the genes with a higher degree of similarity are expected to give higher hybridization signals. Thus, the [transformed length X transformed identity] values for each query sequence offered a reliable, qualitative prediction of its expected hybridization signal against the tester genome relative to its expected signal against the reference genome (because the latter equaled l for all query sequences), similar to the Cy3/Cy5 ratio used in real microarray experiments. Phylogenetic trees of the evaluated genomes were subsequently built based on the hierarchical clustering of the predicted hybridization signals (the [transformed length X transformed identity] values) using the Cluster version 3.0 software (8). Final trees were visualized with the TreeView software, available at http://rana.lbl.gov/EisenSoftware.htm (8). Both parametric (Pearson correlation, Euclidean distance) and non—parametric (Spearman correlation) methods were used to calculate distances in the trees. The whole ORF trees presumably represented the expected results and were used as reference for comparisons between the oligo and cDNA trees. Finally, the non-specific hybridization -l33- signal, which is presumably significant in real experiments, was not considered in this simulation since only the best BLAST match for each query sequence was included in the analysis. Correct gene identification experiment. We also evaluated whether oligo arrays give comparable results to cDNA arrays with respect to the correct gene identification. For this, we determined which oligo or cDNA probes are expected to cross-hybridize (with the tester genome) and check them against the results of the corresponding i.e. the ORF that the probe was designed for, whole ORF sequences. False negatives in this case were defined as the probes that were not predicted to cross-hybridize but the corresponding ORFs were, whereas, the reverse was considered false positive. Oligo probes (50 mers) were expected to cross-hybridize when they had a blastn match better than 80% identity over more than 80% of the length of the oligo probe in the tester genome. Fifty-met oligonucleotides that share this level of identity with a target sequence have been shown to cross-hybridize to it (14). The same cut-off (i.e. 60% nucleotide identity over more than 70% of the length) as previously used for whole ORF sequences was applied to determine the number of cDNA probes that cross-hybridize. cDNA probes were at least 200nt long and had small differences compared to the corresponding whole ORF sequences (e.g. the average sequence length was 718 vs. 903 nucleotides, respectively), which justified the usage of the same cut-off for cDNA probe sequences. False negatives involved instances where two ORFs had a short non-overlapping region and the probe was designed for this region or the region targeted by the probe had diverged below the probe cut-off but the overall sequence identity of the whole ORFs was still greater than the cut-off used for ORFs. False positives mostly involved cases -134- where two ORFs have a short overlapping region (less than 70% of the length) and the oligo was designed for this region. Probe design. Probes were designed for each reference strain within a group. In the following text, the reference strain for the enterics group, strain 0157, is used as a representative example i.e. the same analysis was performed for the remaining two reference strains. cDNA probes were designed as follows: the PRIMEGENS software (27) was used to design primers to amplify unique fragments for every possible ORF in the E. coli strain 0157’s genome. PRIMEGENS was run with default settings except that the amplified region was limited to between 200 to 1000 nucleotides. The sequence between a primer pair (the amplified region) was then extracted from the genomic sequence using PERL scripts and these sequences were used as cDNA probes. With this approach, we were able to design specific primers for 3,994 ORFs in strain 0157’s genome. The 3,994-cDNA probe sequences were then searched against the remaining genomic sequences in enterics group as previously described for whole ORF sequences. Oligo-probes (50 mers) specific for each ORF in the E. coli strain 0157 genome were designed using the OligoArray soflware (21). The OligoArray settings were optimized to ensure probe specificity, avoid secondary structure and poly-nucleotide repeats (> 5 mers, e.g. TTTTT) in the probe sequence. In total, 5,298 oligos were designed for the 5,361 ORFs in strain 0157 genome; 63 ORFs failed to give a specific oligo under the selection criteria of our design. The oligo sequences were then searched against the remaining genomic sequences in the enterics group as described previously. -135- The final comparison between cDNA and oligo probes was performed with the ORF set that had both a cDNA probe and an oligo probe designed (3,992 ORFs in the E. coli 0157 case). Non-specific signal. The influence of non-specific hybridization signal, i.e. signal that is attributable to multiple gene copies, paralogous genes and/or conserved domains rather than the targeted sequence, on microarray results remains a poorly investigated issue. We attempted to evaluate the importance of non-specific signal in DNA-DNA microarray studies by considering all BLAST matches of the whole ORF sequences in the tester genomes. In this case, the [transformed length X transformed identity] values for all matches of an ORF in the tester genome were summed and the result was divided by the sum of the [transformed length X transformed identity] values for all matches of the same ORF within the reference genome. The ratio of the sums was used as a qualitative prediction of the relative hybridization signal between tester and reference genomes; similar to the simulation described previously where only the best match was considered. This approach assumed that two matches of similar identity but of different length (e.g. 10% vs. 100% of the length) would contribute to the overall signal proportionally to their length (e. g. 1/11 vs. 10/11, respectively). Likewise, matches of different levels of identity would contribute proportionally to their identity. This experiment was not performed for probe sequences because the effect of the position and extent of the miss-pairing on hybridization signal is not easily quantifiable, particularly for short oligo sequences such as 50 mers (14). -136- .80: :39; 880% 36.283980 05 82 839 8238332 70% (or >99% for 16S rRNA) for the highly related pairs to 50-30% (or >98% for 16S rRNA) for the moderately related ones and <30% (or <98% 16S rRNA) for the distantly related ones. In fact, some of the species we term highly related are considered ecotypes of the same species by many investigators. Predicted microarray performance. To comprehensively evaluate microarray performance, we expressed the number of false negatives between any pair of strains (i.e. a reference and a tester strain) as a percent of the total number of ORFs expected to cross-hybridize and plotted it against the DNA-homology and 16S rRNA sequence identity values between the pair of strains (Figure 5.1). The expression of false negatives as a percentage allowed for a genome size independent estimation, since larger genomes (i.e. more ORFs) gave more false negatives (absolute number) compared to smaller genome-sized species that showed similar evolutionary relatedness to the reference strain. Additionally, the usage of number of -138- ORFs that are expected to cross-hybridize instead of the total number of ORFs in the genome minimizes the effect of the varied levels of genomic diversity (e.g. loss or addition of genetic element) that characterize different species (e. g. the sequenced E. coli strains harbor much greater genomic diversity that the M tuberculosis ones). Our results suggested that false negatives increased with increased evolutionary distance between the reference and the tester strain (Figure 5.1 ). The two cut-offs used to determine the number of the conserved genes (amino acid level) gave significantly different estimations, with the 30% amino acid identity cut-off giving more false negatives than the 60% amino acid similarity cut-off. For example, a microarray experiment would be expected to miss at least 5% of the conserved genes when the reference and tester strains reside in moderately related species according to the 30% amino acid identity cut—off (Figure 5.1A) whereas, the same number of false negatives is expected when the reference and tester strains reside in moderately related species according to the 60% amino acid similarity cut-off (Figure 5.18). Regardless of the cut- off used however, DNA-DNA studies between strains that are less than 97.5-97.0% identical in terms of 168 rRNA sequence are expected to have an unacceptably high number of false negatives (i.e., more than 10%). With regard to the estimation of the evolutionary distance between reference and tester strain, 16S rRNA sequence identity offered a better measurement than DNA homology values because the latter method gave poor resolution in distantly related species (see DNA homology datapoints below 20% in Figure 5.1). In addition, the 16S rRNA sequence identity values gave a stronger correlation than the DNA-homology values. This is partially explained by the technical limitations in the DNA-homology -139- DNA-homology values between strains -5 29 49 66———66—-——166———1§20 0 l 5 . $12 ' 8 20 1 i2. ; 30 .3 35 g I .t 40 45 IONA-DNA 50 an L l I l J 0168'“. 55 I I I I I I I 94 95 96 97 98 99 100 101 165 rRNAsequence identity between strains DNA-homology values between strains '5 o 20 4o 60 80 100 120 o i u E 5 7 o 10 i g - § 15 c 20 3 25 C u- 30 I D E IONA-DNA : 35 “ID I 01 rRNA 4oi : : : : : f3 . 94 95 96 97 98 99 100 101 168 rRNA sequence identity between straits Figure 5.1. Correlation between microarray false negatives and evolutionary distance between reference and tester strain. Each point represents the false negatives, expressed as percentage of the total number of ORFs predicted to cross-hybridize with the tester genome, between a reference and a tester strain plotted against the DNA- homology values (solid squares, upper X-axis) and the 16S rRNA sequence identity (open squares, bottom X-axis) between the reference and tester strain. (A): 30% amino acid identity cut-off. (B): 60% amino acid similarity cut-off. -140- experiments such as the imprecision of these measures, the varied protocols used and the fact that the strains of the species used in these experiments were different from the strains of the same species sequenced and used in our simulations. The correlation was slightly higher for the 30% identity than for the 60% similarity cut-off (R2=0.94 vs. R2=0.83 for the 16S rRNA data and R2=0.84, vs. R2=0.72 for the DNA-homology data; all regressions were significant at P < 0.001). The strong correlations obtained with the combined data set are indicative of the comparable results obtained within each of the three bacterial groups evaluated (analytical results for each group are not shown). Importance of microarray false negatives. To evaluate their importance, microarray false negatives were checked against the total number of ORFs from the reference strain not conserved at the nucleotide level in the tester strain (i.e., the reference strain-specific ORFs). False negatives comprised, at maximum, one-third and one-fifth of the total number of ORFs not conserved based on the 30% amino acid identity and 60% amino acid similarity cut-off, respectively (Figure 5.2). Furthermore, false negatives became less important, i.e. comprised a smaller fraction of the non-conserved genes, with decreased evolutionary distance between the tester and reference strains. For instance and regardless of the cut-off used, false negatives did not comprise more than 15% of the ORFs not conserved in the tester strain for any tester strain highly or moderately related to the reference strain. These results suggested that although false negatives may occur at significantly high numbers (see 30% amino acid identity cut-off in Figure 5.1), they should represent a small fraction of the genes not shared between highly or moderately related strains (Figure 5.2). -141- g 100 IORanotconserved in tester (nt. level) 2 so DFalse negatives30%identity o D False negatives 60% similarity or so 3, 10 5 so .1- “ 50 ° 40 3 so '5 20 E 10 g o - _ . h. L A. $390.9 “.Voé‘f‘y ’é'"! o 3:: NZ; 1“ fol‘ [9013‘ 9"" “of“ $19“ 6 9°80 9929‘, if, 5 - . 93§+9° 9" “ I e 9 “’9 Species Figure 5.2. Importance of microarray false negatives. Solid bars represent the total number of ORFs from the reference genome not conserved at the nucleotide level in the tester genome (X-axis). Gray and open bars represent the part of these ORFs that are also predicted to be microarray false negatives at the 30% amino acid identity and 60% amino acid cut-offs, respectively. Non-specific signal. Non-specific hybridization signal appeared to affect a sizeable number of ORFs in all pairs of reference-tester strains tested. For instance, for the 3,994 whole ORF sequences evaluated between E. coli 0157 and S. enterica pathovar Typhimurium, 1,222 (30.6%) had a different predicted hybridization signal when all matches were considered compared to the best match prediction (Figure 5.3A) and 466 (11.7%) of them showed a larger difference than +/- 0.1 from their best match prediction. Of the 466 ORFs, 268 gave higher signal when all matches were considered and 123 of them were predicted to give higher signal with the Typhimurium genome than with strain 0157 (datapoints that have values more than 1 on the y-axis). The latter is attributable to the tester strain having -142- more copies of the gene, paralogous genes, and/or conserved domains than the reference strain for these 123 ORFs. The opposite situation i.e. ORFs showing less hybridization signal when all matches are considered, was true for 198 of the 466 ORFs. Thus, for a significant fraction of ORFs in any competitive hybridization experiment, misleading 2 1.8 1.6 A :between 0157 and Typhimurium . ° ’ .44 b Predicted signal when all matches were considered -- re 2 1.8 15 B :between 0157 and K12 ‘ ...; N) Predicted signal when all matches were considered . O o 0.2 0.4 0.6 0.8 1 Predicted signal for the best match only Figure 5.3. Non-specific hybridization signal for whole ORF sequences. Each point represents the predicted signal for an ORF when all its matches in the tester genome were considered (Y-axis) vs. the predicted signal when only the best match was considered (X- axis). Thus, any points that deviate from the diagonal represent ORFs that are predicted to be affected by non-specific hybridization signal. (A): tester strain is S. enterica pathovar Typhimurium, (B): tester strain E. coli K12. Reference strain is E. coli 0157. —143- results, i.e. false positives or false negatives, should be expected as the result of non- specific signal. E. coli K12 had more highly related matches (datapoints in the 0.8-l range between Panels A & B) and fewer ORFs (267) that showed more than +/- 0.1 difference from their best match prediction than Typhimurium reflecting its closer relatedness to the reference strain (Figure 5.38). Nonetheless, strain K12 had a comparable number of ORFs affected by non-specific signal (1232) to Typhimurium. Similar trends were observed for the remaining pairs tested (data not shown). cDNA vs. Oligo arrays. The predicted performance of oligo and cDNA arrays was evaluated in terms of: I) the expected results relative to the evolutionary distance among the evaluated strains and II) the correct gene identification. 1) Evolutionary relatedness. Trees based on the hierarchical clustering of the predicted hybridization signal were very similar, both in terms of topology and distances between nodes, between cDNA and whole ORF regardless of the method (parametric vs. non-parametric) used for the calculation of distances (Figure 5.4 B & C). The high congruence between cDNA and whole ORF trees was probably attributable to the small differences between the cDNA probe sequences and the whole ORF ones (see methods section). On the other hand, the oligo tree tended to overestimate distances in more distantly related strains (relative to the reference strain) compared to the cDNA one. For example, the oligo tree predicted a larger distance between the E. coli-Shigella cluster -144- fig 43g _N _ 5'2 8'5 Panel A Panel B: Pearson Panel C: Spearman s:s§ss§_§§2 :== x E 3“: W 2“ occur O'QC:§- Scale 1-0 u u 96: 33350-0 mmmfidv‘ddgxy Whole ORF tree Whole ORF tree cc 0157 1131 ~~C°-'°d' “coliedl cc 0151 113- ggglg'gf? 1°}! k12 .. ‘ . :34... cc 0151 1140 _ ~ ‘flexneri °° 0151 11‘1 ...dysenteriae ..Zdysenteriae °c 0151 11¢: ..typhimurium . t himurium cc 0157 1141 _ty i -tYPh. °° 015? 11" ..piPeumoniae ' ynpetimoniae cc 0151 1149 V pestis Y.p . cc 0151 115 ' .pestls cc 0151 1150 lcllec 0157 1151 .c 0157 1152 cDNA tree cDNA tree at: 015? 1153 0c 0157 1154 .cojedl ..coliedl cc 0151' 1155 .cork12 .coli k12 cc 0157 1156 .colicft .colicft cc 0151 1150 . onnei. .sonnei, ec 0151 116 . exneri ' .flexnen ec 0151 11“ ...dysentenae ...dysenteriae cc 0151 1163 ...typhrmurrum .typhimurium cc 0151 11“ . p i . .typ i so 0151 1155 ..iiiiieumomae .-pneumoniae cc 0157 11“ ‘.pestrs ‘.pestis cc 0151 111 1 cc 0151 11. . . lot: 0157 11.1 Oligo tree Oligo tree 1 cc 0157 11“ cc 0151 1.1.” - - ..COlI edl .. co 1 ed! 00 0157 11’ .colik12 -coiklz 0c 0157 11” .coi .colicft ec 0151 11” .fionnei. . onnei ec 01:; 3:. .dexneri . exneri cc 0 ysenteriae ..d senteriae cc 0157 121 ...typhimurium ...tyibhimurium lcllec 0157 1219 .ty hi . p i lcllec 0151' 122 .pribumoniae .giieumoniae lcllec 015? 1220 ‘,pestis .pestis Figure 5.4. cDNA vs. Oligo-arrays: Evolutionary relatedness results. (A): Each spot represents the [transformed length X transformed identity] value for the best BLAST match of an 0157 ORF (right) in a tester genome (top). (B): The results from the hierarchical clustering of the [transformed length X transformed identity] values using Pearson correlation for every set of query sequences i.e. whole ORFs, cDNA and oligo probes. (C): Hierarchical clustering using Spearman rank correlation. Images in this thesis are presented in color. and the Salmonella or the Yersinia ones than the whole ORF tree. This property of the oligo tree also caused some branching differences in the ancestral nodes when the Spearman correlation was used, e.g. Yersinia groups with E. coli EDL instead of the Salmonella-Klebsiella cluster (Figure 5.4C). However there was, overall, high similarity 44s between the oligo and the cDNA trees as was evident by the identical clustering of strains at the terminal nodes between the two trees. In addition, principal component analysis confirmed the presence of three major clusters (i.e. the E. coli-Shigella, the Salmonella- KIebsiella and the Yersinia) for all three trees e. g. oligo, cDNA and whole ORF (data not shown). 11) Correct gene identification. When the reference and tester genome resided in the same or highly related species, oligos had sufficiently low incidences of false negatives (Figure 5.5). However, oligo-array false negatives dramatically increased with increased evolutionary distance. It appeared that the increase correlated with the transition of the tester strain from highly related to moderately related species and leveled-off when the tester strain is a distantly related species. For instance, all the highly related pairs of reference-tester strains in Figure 5.5 had about 1% predicted false negatives (see S. pneumoniae strain R6, E. coli K12, Shigellaflexneri data points) and the moderately related S. mitis (46% DNA-DNA reassociation and 99% 16S rRNA sequence identity to the reference S. pneumoniae TIGR4) had about 5%. When the tester strain was a distantly related species (e.g. Salmonella or Yersinia for Panel A, or S. pyogenes and S. agalactiae for Panel B), false negatives were between 30-40%. On the other hand, false negatives for the cDNA array were consistently below 5% for all tester strains. Lastly, the predicted false positives for both cDNA and oligo-arrays were consistently below 2-3% regardless of the tester strain used (data not shown). For the oligo-array, this was not surprising inasmuch as the likelihood of getting a 50 nucleotide long exact match in the tester strain by chance alone is ( 1A)”. -146- False negatives (96) o 01 ES :3 8 8 8 8% 8 Falsenegatives(%) omsaBBBEfiSB e3“) Figure 5.5. cDNA vs Oligo-arrays: Gene identification. Bars represent the predicted false negatives (expressed as percentage of the total number of probes that are expected to hybridize) for the cDNA (open bars) and oligo (solid bars) probes. (A): the enterics. (B): The streptococci. Tester strains (from left to right) are: (A), E. coli K12, Shigella flexneri, S. Typhimurium, Klebsiella pneumoniae Y. pestis; (B), S. pneumoniae R6, S. mitis, S. pyogenes, S. agalactiae; reference strains were E. coli 0157 and S. pneumoniae TIGR4, respectively. -147- DISCUSSION DNA microarrays have been used for genetic comparisons among strains and have the potential to be used for expression studies with other than the sequenced (reference) strain; but such uses raise potential uncertainties in interpretations. We found that false negatives caused by the different rates of evolution between amino acid and nucleotide sequences comprised a rather small fraction of the total number of ORFs not conserved at the nucleotide level between strains of the same or highly related species (Figure 5.2). This suggests that the total genomic diversity is far more important than false negatives in DNA microarray studies within species. The practical implication of these findings is that, in DNA-DNA studies that attempt to cover a whole species, false negatives should be of secondary importance compared to flexibility in microarray design to accommodate genetic diversity (e.g. more unique sequences). An understanding of the genetic diversity within a species is also required for the successful coverage of the species in such experiments. For example, M tuberculosis and S. pneumoniae do not share the genetic diversity of E. coli and Shigella sp. species, at least based on the available genomic sequences (Figure 5.2). Experiments with distantly related species are less common and probably involve specialized goals such as taxonomic comparisons. However, microarray false negatives are probably too high to be neglected in this case. The relationship described in this study (Figure 5.1) allows the approximate estimation of the missed genes for a given evolutionary distance between reference and tester strain and a given stringency in the amino acid comparisons. This relationship is probably applicable to bacterial groups besides the ones used in this study because all three groups evaluated gave consistent -148- results, covered a range of typical bacterial genome sizes (2-5.5 Mb) and included both gram-positive and negative members. The 30% amino acid identity cut-off gave more false negatives than the 60% amino similarity cut-off due to the lower stringency in sequence comparisons, which selected for more paralogous genes. This was evident in the enterics group where the 30% amino acid identity cut-off predicted a significant number (up to 5%) of false negatives for several E. coli or Shigella sp. tester strains (Figure 5.2). Indeed, part of the extra DNA in the reference 0157 strain compared to these E. coli or Shigella tester strains involves paralogous genes in expanded gene families and multiple phage copies (26). At the same time and validating its usage, the 30% amino acid identity cut-off predicted almost no false negatives for strain EDL, which is the most closely related, of all strains evaluated, to the reference strain 0157 (18); and it predicted very low numbers of false negatives (<1-2%) for the highly related strains of M tuberculosis and S. pneumoniae, which is consistent with the decreased genetic diversity within these species compared to E. coli (2, 9, 12). On the other hand, the 60% amino acid similarity selected, more frequently than the 30% amino acid identity, the same genes (orthologs) between tester and reference strains and this accounted for the lower numbers of false negatives it typically predicted, particularly within species. Which of the two cut-offs is more suitable depends on the desired stringency in the experiment. It should be mentioned, however, that genes that share 60% or more amino acid similarity are also likely to have diverged in function specificity (although this is less likely than when two genes are related at 30% amino identity) since a few critical amino acid changes could be accompanied by a change in function specificity (10). -149- The use of the above cut-offs with no manual inspection of the pair-wise alignments proved highly accurate for the prediction of the conserved ORFs between closely related species. For instance, F leischmann et al. (9) identified, based on genomic sequence comparisons, 28 ORFs of M tuberculosis H37Rv not conserved in M tuberculosis strain CDC 1 55 1. Our 30% amino acid identity cut-off predicted 29 ORFs for the same comparison (31 for the 60% amino acid similarity cut-off). The low rate of error with closely related species was expected given the low level of sequence divergence between such species. Nonetheless, our approach performed equally satisfactory with distantly related species where the nucleotide divergence is more likely to compromise automated annotations that are based on cut-offs in sequence similarity. For instance, the comparative genomic analysis of the fully sequenced Streptococcus species suggested that S. pneumoniae TIGR4 shares 1,108 and 1,229 genes with S. pyogenes and S. agalactiae respectively (23). The 30% amino acid identity cut-off predicted 1,152 and 1,242 conserved ORFs for the same pair of strains, respectively (1,028 and 1,114 ORFs for the 60% amino acid similarity cut-off, respectively). When DNA microarrays are applied to reveal exact genetic differences, e.g. gene presence or absence, an oligo platform should perform satisfactory with strains of the same or highly related species (Figure 5.5). In this case, DNA-homology value is a better measure of the evolutionary relatedness between tester and reference strains than 16S rRNA identity because it offers better resolution between highly related strains. It should be pointed out, however, that there are too few pairs of moderately related (e.g. DNA- homoology values between 40-60%) strains in the sequenced genome collection for a robust prediction in this critical range. -150- Oligo-array performance substantially declined (i.e. high rates of false negatives) with distantly related strains however; and this in-silico prediction is confirmed by the experimental data to date. For example, oligo-array based genetic comparisons in the streptococci group (12) and in the Burkholderia group in our lab (K. Konstantinidis et al. unpublished) suggested that strains that are distantly related (e.g. 3-5% 16S rRNA gene miss-pairing) to the reference strain give little hybridization signal relative to the total number of genes conserved based on the genomic sequences. Thus, for experiments with distantly related strains, a cDNA platform should be preferable for its steady performance over this range of evolutionary distance (Figure 5.5). Experimental data with distantly related strains also agree with our predictions for cDNA arrays. Dong et al. (6), using a whole genome array that had as probes the whole ORF sequences, have shown that 3,000 ORFs of E. coli K12 were conserved (i.e. cross-hybridize) with K. pneumoniae strain 342. Our approach predicted 2,890 ORFs of strain K12 to be conserved in K. pneumoniae strain M6H 78578 (the sequenced strain). The small difference between our prediction and the experimental results might be due to the different K. pneumoniae strain used or ORFs missed in the high draft sequence for strain M9H 78578 or to non-specific signal in the microarray study. In the case that DNA microarrays are employed to study evolutionary relatedness between species, an oligo array (one 50 mer probe per ORF) will probably give comparable results to a cDNA array (Figure 5.4). This was not surprising inasmuch a 50 mer fragment of an ORF evolves similarly to a larger fragment (e.g. a cDNA probe) or the whole ORF. The oligo platform tended to overestimate distances between distantly related species, however. This is attributable to the difference in information content -151- between a 50 vs. 718 (on average) nucleotides long sequence for cDNA probes and the lower tolerance of sequence miss-pairing for oligo-probes. Indeed, oligo-probes require, on average, higher sequence identity for cross-hybridization than cDNA probes (>75% vs. 60% identity) (14). Although our predictions of non-specific signal cannot be absolute because of the complications in quantifying total non-specific signal by adding predicted signal from individual matches, they offered some perspective on this critical issue of microarray technology. And, to the best of our knowledge, no systematic attempt has been ever made to calculate non-specific signal in whole genome DNA-DNA studies. According to our simulation, a significant number of ORFs was affected by non-specific hybridization in any pair of strains evaluated (Figure 5.3). It is also anticipated that any platform, when used for genetic studies, is prone to (at least part of) the non-specific signal revealed for whole ORF sequences in this study. Because, even if the cDNA or oligo-probes are designed to be specific within the reference genome, this does not preclude non-specific hybridization when another genome, which would have different classes of paralogous genes, more copies of genes etc, is used. Such non-specific signal was evident even among strains of the same species (see E. coli K12 vs. E. coli 0157 in Figure 5.38). These findings suggest that misleading conclusions might be reached when non-specific signal is not considered in DNA-DNA microarray studies. On the other hand, if hybridization signal is carefully considered, it has potential to reveal genes and regions that have been duplicated in the tester genome compared to the reference one. Such duplicated regions are likely to play a major role in the unique phenotypic characteristics or ecological niche of the tester strain. -152- When undertaking microarray approaches both technical and performance issues need to be considered. There are several reviews dealing with technical issues such as flexibility in design, chip technology, probe chemistry, labeling method (cf. references (19, 28). We evaluated the predicted performance of cDNA and oligo arrays as well as false negatives and non-specific hybridization based solely on sequence analysis. Despite certain limitations in the in-silico modeling, our results should be a good approximation of reality and can offer useful information in planning appropriate DNA microarray studies. Our results also provide guidance for some experimental tests, which would not only test the validity of our predictions but also enhance predictive ability, especially with moderate and distantly related species. -153- ACKNOWLEDGEMENTS We thank TIGR for permission to use preliminary sequence data for M avium, M smegmatis, and S. mitis. Sequencing of M avium, M smegmatis, and S. mitis was accomplished with support from NIAID and NIH-NIDCR, respectively. We thank Joel Klappenbach and Hector Ayala-del-Rio, for helpful discussions regarding the manuscript. This work was supported by the Bouyoukos Fellowship Program (KTK), the DOE’s Microbial Genome Program and the Center for Microbial Ecology. -154- REFERENCES Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-402. Behr, M. A., M. A. Wilson, W. P. Gill, H. Salamon, G. K. Schoolnik, S. Rane, and P. M. Small. 1999. Comparative genomics of BCG vaccines by whole- genome DNA microarray. Science 284: 1 520-3. Brenner, D. 1984. Bergey's manual of systematic bacteriology, lst ed, vol. 1. William and Wilkins, Baltimore. Cho, J. C., and J. M. Tiedje. 2001. Bacterial species determination from DNA- DNA hybridization by using genome fragments and DNA microarrays. Appl Environ Microbiol 67:3677-82. Cole, J. R., B. Chai, T. L. Marsh, R. J. Farris, Q. Wang, S. A. Kulam, S. Chandra, D. M. McGarrell, T. M. Schmidt, G. M. Garrity, and J. M. Tiedje. 2003. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res 31:442-3. Dong, Y., J. D. Glasner, F. R. Blattner, and E. W. Triplett. 2001. Genomic interspecies microarray hybridization: rapid discovery of three thousand genes in the maize endophyte, Klebsiella pneumoniae 342, by microarray hybridization with Escherichia coli K-12 open reading frames. Appl Environ Microbiol 67 :191 1-21. Eisen, J. A. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8: 163-7. Eisen, M. B., P. T. Spellman, P. 0. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95: 14863-8. Fleischmann, R. D., D. Alland, J. A. Eisen, L. Carpenter, 0. White, J. Peterson, R. DeBoy, R. Dodson, M. Gwinn, D. Haft, E. Hickey, J. F. Kolonay, W. C. Nelson, L. A. Umayam, M. Ermolaeva, S. L. Salzberg, A. Delcher, T. Utterback, J. Weidman, H. Khouri, J. Gill, A. Mikula, W. Bishai, W. R. Jacobs Jr, Jr., J. C. Venter, and C. M. Fraser. 2002. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol 184:5479-90. ~155- 10. ll. 12. 13. 14. 15. 16. 17. 18. 19. 20. Gerlt, J. A., and P. C. Babbitt. 2001. Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu Rev Biochem 70:209-46. Goodfellow, M., and A. O'Donnell. 1993. Handbook of New Bacterial Systematics. Academic Press Inc, San Diego. Hakenbeck, R., N. Balmelle, B. Weber, C. Gardes, W. Keck, and A. de Saizieu. 2001. Mosaic genes and mosaic chromosomes: intra- and interspecies genomic variation of Streptococcus pneumoniae. Infect Immun 69:2477-86. Imaeda, T. 1985. Deoxyribonucleic acid relatedness among selected strains of the Mycobacterium tuberculosis, Mycobacterium bovis, Mycobacterium bovis BCG, Mycobacterium microti, and Mycobacterium afiicanum. Int. J. Syst. Bacteriol. 35:147-150. Kane, M. D., T. A. Jatkoe, C. R. Stumpf, J. Lu, J. D. Thomas, and S. J. Madore. 2000. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res 28:4552-7. Kawamura, Y., X. G. Hou, F. Sultana, H. Miura, and T. Ezaki. 1995. Determination of 16S rRNA sequences of Streptococcus mitis and Streptococcus gordonii and phylogenetic relationships among members of the genus Streptococcus. Int J Syst Bacteriol 45:406-8. Lockhart, D. J., H. Dong, M. C. Byme, M. T. Follettie, M. V. Gallo, M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown. 1996. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14:1675-80. Murray, A. E., D. Lies, G. Li, K. Nealson, J. Zhou, and J. M. Tiedje. 2001. DNA/DNA hybridization to microarrays reveals gene-specific differences between closely related microbial genomes. Proc Natl Acad Sci U S A 98:9853-8. Pupo, G. M., D. K. Karaolis, R. Lan, and P. R. Reeves. 1997. Evolutionary relationships among pathogenic and nonpathogenic Escherichia coli strains inferred from multilocus enzyme electrophoresis and mdh sequence studies. Infect Immun 65:2685-92. Relogio, A., C. Schwager, A. Richter, W. Ansorge, and J. Valcarcel. 2002. Optimization of oligonucleotide-based DNA microarrays. Nucleic Acids Res 30:e51. Rost, B. 1999. Twilight zone of protein sequence alignments. Protein Eng 12:85- 94. -156- 21. 22. 23. 24. 25. 26. 27. 28. 29. Rouillard, J. M., C. J. Herbert, and M. Zuker. 2002. OligoArray: genome- scale oligonucleotide design for microarrays. Bioinforrnatics 18:486-7. Stackebrandt, E., and B. M. Goebel. 1994. Taxonomic note: a place for DNA- DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol 44:846-849. Tettelin, H., V. Masignani, M. J. Cieslewicz, J. A. Eisen, S. Peterson, M. R. Wessels, I. T. Paulsen, K. E. Nelson, 1. Margarit, T. D. Read, L. C. Madoff, A. M. Wolf, M. J. Beanan, L. M. Brinkac, S. C. Daugherty, R. T. DeBoy, A. S. Durkin, J. F. Kolonay, R. Madupu, M. R. Lewis, D. Radune, N. B. Fedorova, D. Scanlan, H. Khouri, S. Mulligan, H. A. Carty, R. T. Cline, S. E. Van Aken, J. Gill, M. Scarselli, M. Mora, E. T. Iacobini, C. Brettoni, G. Galli, M. Mariani, F. Vegni, D. Maione, D. Rinaudo, R. Rappuoli, J. L. Telford, D. L. Kasper, G. Grandi, and C. M. Fraser. 2002. Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae. Proc Natl Acad Sci U S A 99: 12391-6. Tonjum, T., D. B. Welty, E. Jantzen, and P. L. Small. 1998. Differentiation of Mycobacterium ulcerans, M marinum, and M haemophilum: mapping of their relationships to M tuberculosis by fatty acid profile analysis, DNA-DNA hybridization, and 16S rRNA gene sequence analysis. J Clin Microbiol 36:918- 25. Wei, Y., J. M. Lee, C. Richmond, F. R. Blattner, J. A. Rafalski, and R. A. LaRossa. 2001. High-density microarray-mediated gene expression profiling of Escherichia coli. J Bacteriol 183:545-56. Welch, R. A., V. Burland, G. Plunkett, III, P. Redford, P. Roesch, D. Rasko, E. L. Buckles, S. R. Liou, A. Boutin, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C. Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, and F. R. Blattner. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. PNAS 99:17020-17024. Xu, D., G. Li, L. Wu, J. Zhou, and Y. Xu. 2002. PRIMEGENS: robust and efficient design of gene-specific probes for microarray analysis. Bioinforrnatics 18:1432-7. Yang, Y. H., and T. Speed. 2002. Design issues for cDNA microarray experiments. Nat Rev Genet 3:579-88. Ye, R. W., W. Tao, L. Bedzyk, T. Young, M. Chen, and L. Li. 2000. Global gene expression profiles of Bacillus subtilis grown under anaerobic conditions. J Bacteriol 182:4458-65. -157- SUMMARY AND PERSPECTIVES FOR THE FUTURE Although the evolution process and ecological benefits of symbiotic species with small genomes are well understood, these issues remain poorly elucidated for free-living species with large genomes. Hence, I compared the 115 completed (at the time) prokaryotic genomes to determine whether there are changes with genome size in the proportion of the genome attributable to particular cellular processes since this may reflect both cellular and ecological strategies associated with genome expansion. Large genomes were found to be disproportionately enriched in regulation and secondary metabolism genes and depleted in protein translation, DNA replication, cell division and nucleotide metabolism genes compared to medium and small-sized genomes. Further, large genomes do not accumulate non-coding DNA or hypothetical CDS since the portion of the genome devoted to these functions remained constant with genome size. Traits other than genome size or strain-specific processes are reflected by the dispersion around the average and the current analysis provide means to identify such traits and processes and quantify their importance for every gene functional category or bacterial group of interest. These trends suggest that larger genome-sized species may dominate in environments where resources are scarce but diverse and where there is little penalty for slow growth, such as soil. Testing this hypothesis is not a trivial task, however. One approach may be to estimate genome sizes of many strains isolated from environmental sources showing different characteristics in terms of resource abundance and availability (for instance soil vs. marine water). The genome size of the isolates can be inferred by the phylogenetic -158- position of the isolate, e.g., when a closely related isolate with known genome size already exists, or determined by Pulse Field Gel Electrophoresis. Potential pitfalls of this approach are that the evaluated strains must show significant population sizes and activities in their natural environments, i.e., to be ecologically successful as opposed to simply surviving in a dormant or spore state, and represent a phylogenetically unbiased collection. For soil, slow growing isolates should be included in the analysis, when possible, because previous studies suggest that these isolates represent (more) dominant populations in this environment (4, 5). Further, there might be correlation between the time of appearance of an isolate and genome size. One reason for this could be that a large genome-sized species should spend energy to express (at least part of) the increased regulatory proteins it possesses to successfully control its metabolic repertoire. This, all the other equal, might make them to grow slower than smaller genome-sized species. The species definition for Prokaryotes remains a highly controversial and unsettled issue (2, 8, 9). A comparative analysis -- using gene content derived from genome information -- to identify whether there are species boundaries and determine the ' role of the organism’s ecology on its common gene content was undertaken to better inform the current species definition. It was found that strains of the same species may frequently show too large genetic and functional differences to be considered the same species and that (different) ecology appears to play an important role in these differences. The existence of genetic signatures, i.e., a sizable number of genes of ecological importance, between groups of strains of the same species (current definition) further supports the previous interpretations. The inter-group genetic similarity in several of these cases is as high as 98-99% average nucleotide identity (ANI), indicating that -159- “species” might be found even among very identical, at the nucleotide level, organisms. Moreover, a large fraction, e.g., up to 65%, of the differences within species (current definition) is associated with bacteriophage and transposase elements, indicating a much more important role of these elements during bacterial speciation than previously expected. The effect of such “mobile” elements on organism’s phenotype is currently considered mostly unclear. In conclusion, the results presented in this dissertation support a more stringent and natural definition for prokaryotic species compared to the current one, which should be flexible to accommodate the ecological differences among the organisms. It is important to realize that the results presented here should be considered as a first step to describe an emerging picture rather than conclusive findings because the available genomes represent only a tiny fraction of the total prokaryotic diversity and are heavily biased towards pathogenic species. Further, in order to obtain a large enough dataset, comparisons between strains of different genera had to be pooled together resulting in clear discontinuities in the results reported. Therefore, a better sampling of species with genomic sequences is still needed to reject, for instance, the hypothesis that there is a continuum of genetic diversity as oppose to species-specific genetic signatures, which is not supportive of a species concept for Prokaryotes. Last, there is inadequate knowledge on the population sizes and activities in the natural environments of most (even the sequenced!) species and hence the quantification of the effect of ecology on the conserved gene content is not currently feasible. Studying natural populations at the genomic level and over time will allow us to more fully evaluate the importance of mobile elements for the process of bacterial speciation as well. ~160- The higher than the species ranks of the prokaryotic taxonomy, i.e., the family, order, class, phylum and domain, are primarily based on phylogenetic analysis of the 16S rRNA gene sequence and secondarily on old observations about the morphological and/or biochemical relatedness of the grouped organisms (2, 6). Phylogenetic clustering based on the genetic distance between two organisms derived from their whole-genome comparison is generally congruent with the clustering based on the 168 rRNA gene, which adds further support to the current classification system. The genomic approach revealed, however, that there is little (if any) predictive power for the currently used higher taxonomic ranks, with the exception of the domain rank, in terms of conserved gene content and genetic distance (measured as the average amino acid identity (AAI) of the conserved genes) of the grouped organisms. Further, organisms of each prokaryotic phylum and several classes may be considered nearly as different from each other as Bacteria are different from Archaea at the genomic level. These findings reveal a much larger genetic and functional diversity for Prokaryotes than previously expected based on the analysis of the 16S rRNA gene. AAI for longer and ANI for shorter evolutionary scales are simple and highly reliable means to measure relatedness between organisms and evaluate the robustness of genetic markers for phylogenetic purposes. The influence of lateral gene transfer (LGT) on these measures remains to be seen but it is anticipate that it will not be more important than the influence of LGT on measures that are based on single or a few genes such as the 16S rRNA or Multi Locus Sequence Typing (MLST) based approaches. The major limitation in the former measures is that they require the availability of genomic sequences, however, the technological advancements in genomic sequence may render -l6l- this less problematic in the near future. Last, AAI and ANI can be uniformly measured for all living organisms, including eukaryotic ones, and therefore contribute towards a uniform taxonomy for all domains of life and provide higher resolution in cases were current methodology is proved inadequate. DNA microarray technology is currently envisioned as a promising alternative to whole-genome sequencing for genetic (DNA-DNA) comparisons between strains (1, 3), however, several issues regarding the applicability of microarrays for these purposes remain uninvestigated. Using the available genomic sequences (control results) and the existing knowledge on the microarray hybridization kinetics (in-silico predicted results), the performance of different microarray platforms for genetic comparisons was first modeled and subsequently evaluated. The number of false negatives, i.e., observing no hybridization signal when the amino acid sequence is conserved but the nucleotide sequence has diverged to a level that does not allow hybridization, were found to be unacceptably high (>10%) between distantly related strains (e.g. <97% 16S rRNA gene identity), but are sufficiently low (<5%) between strains of the same or highly related species (e.g. >98% l6S rRNA gene identity) to not be problematic. Further, oligo-arrays, i.e., one 50mer probe per gene, should give comparable results to whole Open Reading Frame (ORF) arrays as long as the evaluated strains reside in the same or highly related species whereas whole-ORF arrays should perform better with more distantly related strains. Last, a sizeable number of genes (up to 30-35% of the total) in all genomes evaluated appeared to suffer from non-specific hybridization signal from paralogous genes or conserved domains. This non-specific hybridization may lead to significant false positive as well as false negative signal independent of the microarray platform used. -l62- This theoretical analysis assumes that the experimental procedure is ideal, i.e., there are no complications or introduced error during the execution of the experiments. The latter is known to not be true, however, since several technical issues such as probe design and chemistry, labeling method, complications during hybridizations etc. have not yet been fully resolved and thus add complexity to the microarray results. These issues have been extensively reviewed previously (7, 10) and are subjects of ongoing research and continuing improvement. Until the technical aspects of the microarray technology are fully worked out, the results presented here should be a relatively good approximation of reality and can provide guidance for some experimental tests, which would not only test the validity of the predictions described previously but also enhance predictive ability, especially with moderate and distantly related species. Experimental testing of DNA- DNA hybridization (competitive genome hybridization, CGH) among strains by arrays is timely and needed to efficiently advance of understanding of the patterns and order in the high divergent prokaryotic world. -l63- 10. REFERENCES Behr, M. A., M. A. Wilson, W. P. Gill, H. Salamon, G. K. Schoolnik, S. Rane, and P. M. Small. 1999. Comparative genomics of BCG vaccines by whole- genome DNA microarray. Science 284:1520-3. Brenner, D., J. Staley, and N. Krieg. 2000. Bergey's manual of systematic bacteriology, 2 ed, vol. 1. Springer-Verlag, New York. Cho, J. C., and J. M. Tiedje. 2001. Bacterial species determination from DNA- DNA hybridization by using genome fragments and DNA microarrays. Appl Environ Microbiol 67:3677-82. Hattori, T., H. Mitsui, H. Haga, N. Wakao, S. Shikano, K. Gorlach, Y. Kasahara, A. el-Beltagy, and R. Hattori. 1997. Advances in soil microbial ecology and the biodiversity. Antonie Van Leeuwenhoek 72:21-8. Klappenbach, J. A., J. M. Dunbar, and T. M. Schmidt. 2000. rRNA operon copy number reflects ecological strategies of bacteria. Appl Environ Microbiol 66:1328-33. Ludwig, W., and H.-P. Klenk. 2000. Bergey's manual of systematic bacteriology, 2 ed, vol. 1. Springer-Verlag, New York. Relogio, A., C. Schwager, A. Richter, W. Ansorge, and J. Valcarcel. 2002. Optimization of oligonucleotide-based DNA microarrays. Nucleic Acids Res 30:e51. Stackebrandt, E., W. Frederiksen, G. M. Garrity, P. A. D. Grimont, P. Kampfer, M. C. J. Maiden, X. Nesme, R. Rossello-Mora, J. Swings, H. G. T ruper, L. Vauterin, A. C. Ward, and W. B. Whitman. 2002. Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 52:1043-1047. Wayne, L. G., D. J. Brenner, R. R. Colwell, P. A. D. Grimont, O. Kandler, M. I. Krichevsky, L. H. Moore, W. E. C. Moore, R. G. E. Murray, E. Stackebrandt, M. P. Starr, and T. H. G. 1987. Report of the Ad Hoc Committee on reconciliation of approaches to Bacterial Systematics. Int. J. Syst. Bacteriol. 37: 463-464. Yang, Y. H., and T. Speed. 2002. Design issues for cDNA microarray experiments. Nat Rev Genet 3:579-88. -164- APPENDIX TABLES OD GENOMES USED IN THIS STUDY AND THEIR GENOMIC INFORMATION -l65- Table 2.1 Prokaryotic species and their genomic information used in this study. The genomic sequences for the 115 microbial genomes used in this study were obtained from NCBI. The percent of CDS from each genome homologous to COG database is shown in 4th column. Species that are within one standard deviation from the average (average 70.3%, standard deviation 11.2) are designated solid whereas the rest are designated open (5‘h column). The solid and open squares are used in the article figures to represent the corresponding genomes. -l66- Table 2.1 Roles —§en._gize Total 8 551‘s 5: 55s in 555s Normal; Aeropyrum_pernix 1.67 1840 66.5 Solid ‘ Agrobacterium_tumefaciens 5.6 5299 80.7 Solid Agrobacterium_tumefaciens_UWash 5.6 5402 78.9 Solid Aquifex_aeolicus 1.6 1560 84.2 Open Archaeoglobus_fulgidus 2.18 2420 78.3 Solid Bacillus_anthracis 5.1 531 1 57.1 Open Bacillus_oereus 5.41 5255 59.1 Solid Bacillus_halodurans 4.2 4066 75.6 Solid Bacillus_subtilis 4.2 41 12 73.8 Solid Bacteroides_thetaiotaomicron 6.26 4778 33.5 Open Bifidobacterium_longum 2.26 1729 56.9 Open Borrelia_burgdorferi 0.9 1638 42.8 Open BradyrhizobiumJaponicum 9.1 1 8317 60.4 Solid Brucella_melitensis 3.3 3198 82.1 Open Brucella_suis 3.28 3264 73.1 Solid Buchnera_aphidicola 0.62 504 95.4 Open Buchnera_sp. 0.71 574 96.5 Open CampylobacterJejuni 1.64 1634 78.6 Solid Caulobacter_cresoentus 4.01 3737 77.1 Solid Chlamydia_muridarum 1.07 916 67.6 Solid Chlamydia_trachomatis 1 .05 895 69.6 Solid Chlamydophila_caviae 1.17 1005 63.5 Solid Chlamydophila_pneumoniae_AR39 1.23 1 112 57.8 Open Chlamydophila_pneumoniae_CWL029 1.23 1054 61.4 Solid Chlamydophila_pneumoniae_J138 1.22 1069 60.7 Solid Chlorobium_tepidum 2.16 2252 52.2 Open Clostridium_acetobutylicum 4.1 3848 72.6 Solid Clostridium_perfringens 3.1 2723 64.2 Solid Clostridium_tetani 2.8 2373 67.4 Solid Corynebacterium_efficiens 3.15 2950 64.3 Solid Corynebacterium_glutamicum 3.3 3040 69.5 Solid Coxiel|a_bruneti 2 2009 51 .6 Open Deinococcus_radiodurans 3.28 3182 70.2 Solid Enterococcusjaecalis 3.35 31 13 59.9 Solid Escherichia_coli_K12 4.6 4279 82.7 Open Escherichia_coli_O157:H7 5.5 5361 73.2 Solid Escherichia_coli_0157:H7__EDL933 5.6 5324 74.2 Solid Escherichia_coli_CFT 5.23 5379 69.0 Solid Fusobacterium_nucleatum 2.17 2067 73.8 Solid Haemophilus_influenzae 1.83 1714 91.8 Open Halobacterium_sp._NRC-1 2.57 2622 67.3 Solid Helicobacter_pylori_26695 1.66 1576 69.9 Solid Helicobacter_pylori_J99 1.64 1491 71.9 Solid Lactobacillus_p|antarum 3.31 3009 63.5 Solid Lactococcus_lactis 2.36 2267 77.6 Solid Leptospira_interrogans 4.69 4727 27.7 Open Listeria_innocua 3.01 3043 77.6 Solid Listeria_monocytogenes 2.94 2846 79.7 Solid Mesorhizobium_loti 7.59 7275 75.7 Solid MethanococcusJannaschii 1.74 1785 79.2 Solid Methanopyrus_kandleri_AV19 1.7 1687 72.6 Solid Methanosarcina_acetivorans 5.75 4540 67.3 Solid Methanosarcina_mazei 4.1 3371 68.4 Solid Methanothermobacter_thennautotrophicus 1.75 1873 77.7 Solid Mycobacterium_leprae 3.26 1605 70.5 Solid Mycobacterium_tuberculosis__CDC1551 4.4 4187 63.7 Solid Mycobacterium_tuberculosis_H37Rv 4.4 3927 70.1 Solid Mycoplasma_genitalium 0.58 484 77.9 Solid -l67- Table 2.1 (cont’d) Species Gen. size Total rt OR_—_6_Fs x RFs in COGs NormalEze Mycoplasma_penetrans 1.36 1037 44.7 Open Mycoplasma_pneumoniae 0.81 689 60.5 Solid Mycoplasma_pulmonis 0.96 782 63.8 Solid Neisseria_meningitidis_MC58 2.27 2079 . 73.7 Solid Neisseria_meningitidis_22491 2.18 2065 75.0 Solid Nitrosomonas_europaea 2.81 2461 66.8 Solid Nostoc_sp. 7.2 6129 58.2 Open Oceanobacillus_iheyensis 3.63 3496 69.5 Solid Pasteurella_multocida 2.4 2015 89.2 Open Pseudomonas_aeruginosa 6.3 5567 80.8 Solid Pseudomonas_syringae 6.4 5471 71 .5 Solid Psedomonas_putida 6.18 5350 68.7 Solid Pyrobaculum_aerophilum 2.2 2605 58.1 Open Pyrococcus_abyssi 1.76 1769 83.7 Open Pyrococcus_furiosus 1 .9 2065 75.7 Solid Pyrococcus_horikoshii 1 .8 1801 77.6 Solid Ralstoniagsolanacearum 5.8 51 16 75.0 Solid Rickettsia_conorii 1 .27 1374 65.8 Solid Rickettsia_prowazekii 1.1 835 84.9 Open S.enterica_ser.__Typhi 4.8 4767 71 .8 Solid Salmonella_typhimurium_LT2 4.95 4553 79.7 Solid Shewanella_oneidensis 5.03 4472 62.1 Solid Shigella_flexneri 4.61 4180 83.2 Open Sinorhizobium_meliloti 6.7 6205 81 .7 Open Staphylococcus_aureus_Mu50 2.9 2748 73.6 Solid Staphylococcus_aureus_sz 2.8 2632 73.7 Solid Staphylococcus_aureus_N315 2.81 2625 77.1 Solid Staphylococcus_epidermis 2.5 2419 74.5 Solid Streptococcus_agalactiae 2.2 2124 69.7 Solid Streptococcus_mutans 2.04 1960 71.9 Solid Streptococcus _pneumoniae_R6 2.03 2043 78.8 Solid Streptococcus _pneumoniae_TlGR4 2.2 2094 74.0 Solid Streptococcus _pyogenes 1.85 1697 77.7 Solid Streptococcus _pyogenes_MGA88232 1.9 1845 72.2 Solid Streptomyces_avermitilis 9.03 7575 48.8 Open Strepromyces_coelicolor 8.67 7512 48.3 Open Sulfolobus_solfataricus 2.99 2977 73.4 Solid Sulfolobus__tokodaii 2.7 2826 60.4 Solid Synechocystis_sp._PCC_6803 3.57 3167 70.2 Solid Thermoanaerobacter_tengcongensis 2.7 2588 64.8 Solid Thermoplasma_acidophilum 1 .56 1482 83.5 Open Thermoplasma_volcanium 1 .58 1499 83.7 Open Thermosynechococcus_elongatus 2.6 2475 65.0 Solid Thermotoga__maritima 1 .85 1858 82.0 Open Treponema_pallidum 1.14 1036 69.4 Solid Tropheryma_whipplei 0.93 783 63.5 Solid Ureaplasma_urealyticum 0.75 614 66.6 Solid \flbrio_cholerae 4 3835 73.0 Solid Vibrio_parahaemolyticus 5.18 4537 68.8 Solid Vibrio_vulnificus 5.13 4832 64.2 Solid Wigglesworthia_brevipalpis 0.7 654 89.3 Open Xanthomonas_campestris 5.08 4181 65.9 Solid Xanthomonas_pv._citri 5.17 4312 63.9 Solid Xylella_fastidiosa 2.68 2832 60.0 Solid Xylella_fastidiosa_pv._temecula 2.52 2036 74.5 Solid Yersinia_pestis 4.65 4083 79.8 Solid Yersinia gstis_pv._kim 4.6 m _ 77.3 JSolid AVERAGE 3.24 2987 70.3 (STDEV: 11 .2) 87 Solid -l68- Table 3.1 Genomic information of 64 genomes used in this study and their relatedness to the reference genomes. The 64 strains used in this study (21nd column), their genome size (3rd column) and total CDSs in the genome (4th column) are shown. Strains in bold were used as reference genomes during the pair-wise comparisons between strains of the same bacterial group (15‘t column). The groups used are (from top to bottom): Enterics, Pseudomonas, Neisseria, Bordetella, Bacilli, Mycobacteria, Streptococcus, Staphylococcus, Others. NA = Not available, because the genome annotation has not been published. *Average nucleotide identity of the conserved CDSs between the corresponding strain and the reference strain, which is denoted by the superscript number. +Percent of reference strain CDSs that are conserved in the corresponding strain. -l69- Table 3.1. Group Strain Gen. Total Average “t Percent conser. Size CDSs identity* genes + 1.1:. coli 0157:117 Sakai 5.50 5361 2.13. coli 0157:117 EDL933 5.60 5324 99.7‘,97.43,97.35 935290233935 3.15. coli K12 4.60 4279 9721,9795 729233.15 4.13. coli CFT073 5.23 5379 95.9l,96.43,96.55 75.51,86.83,88.85 5.S. flexneri 2a 2457 4.60 4068 9651,9753 59.523253 6.S. flexneri 2a 301 4.61 4180 954137533935 69513233395 78. typhirnurium LTZ 4.95 4553 79,9‘,8o,75 5971,6865 ENTE 8.S. enterica ser. Typhi Ty2 4.79 4323 i 30.2], 571, “CS 9.8. enterica ser. Typhi 4.80 4767 3021,9998 57,71,991;8 10.8. bongori 12419 4.46 NA 3958 73.33 11.17. pestis Kim 4.60 4090 7151,7155 371,215.15 12.Y. pestis c092 4.65 4083 7151,9951“ 37,21,997“ 13.Y. pestis Mediaevails 4.6 4142 99,33“ 93,55“ 14.Y. enterocolitica 4.68 NA 82.1 1 69,31 15.E. carotovora 5.06 NA 72,11 33,41 PSEUD 1r putida KT2440 6.18 5350 OMON 2.P aeruginosa PA01 6.30 5567 75.1 56.2 AS 31> syringaj DC3000 6.40 5471 75.4 50.7 LN. meningitidis MC58 2.27 2079 9573 91,2? NEISS 2.N. meningitidis FAM 2.20 NA 971,9113 9031,9023 ERIA 3.N. meningitidis 22491 2.18 2065 96.91 90.21 4.N. gonorrhoeae FA1090 2.15 NA 9431,9433 8141,8173 BORD 1.3 pertussis Tohama I 4.09 3447 95,52 72,12 ETELL 2.8. bronchiceptica RBSO 5.35 4994 93.41 _ 911 A 3-B. parapertussis 4.77 4185 9821,9832 87.9‘,87.22 l.B cereus ATCC 14579 5.41 5255 ”‘5’" 2.13 cereus 10987 5.22 NA 91 83 3.3 anthracis A2012 5.10 5311 91.2 85.3 MYCO 1.M. tuberculosis CDC1551 4.40 4187 B A CTE 2.M. tuberculosrs h37Rv 4.40 3927 99.7 99.5 RM 3.M. bows 4.35 3920 99.4 98.3 4.M. avium 5.48 NA 79.1 62.6 1.S. pyogenes ss1-1 1.89 1861 9792 92,72 2.S. pyogenes MGAS8232 1.90 1845 97,31 9251 3.8. pyogenes MGAS315 1.90 1865 99,91,9792 1001,51172 STREP 4-S- pyogenes M1 GAS 1-85 1697 98‘,97.92,71.37 86.8‘,89.72,38.57 rococ 5.s. agalactiae 2603 2.20 2124 7451,7422 55,2‘,56,52 Cl 6.S. agalactiae NEM 2.21 2094 9355 3755 7.8. pneumoniae r6 2.03 2043 8.8. pneumoniae TIGR4 2.20 2094 714271323347 41.3‘,42.12,95.97 9.s. mutans 2.03 1920 725 45,35 -170- Table 3.1 (cont’d) Group Strain Gen. Total Average nt Percent of Size CDSs Identity" conser. ge_nes+ 1.s aureus Mu50 2.90 2748 9323 9353 2.8 aureus N315 2.81 2625 9981,9833 94.8‘,93.33 STA”, 3.s aureus MW2 2.80 2632 93,21 90,41 YLOCO 4.8 aureus M88R252 2.80 NA 933299.73 39,71,9343 CC] 5.8 aureus MR8A476 2.90 NA 9711,973 9151,9393 6.S. epidermitidis ATCC12228 2.50 2419 7.8. epidermitidis RP62A 2.65 NA 759275333896 66.4‘,66.83,94.66 1.x. campestris ATCC33913 5.08 4181 2.x. axonopodis pv. Citri 301 5.17 4312 34,61 32,91 3.x. fastidiosa temecula 2.52 2036 4.x. fastidiosa 9a5c 2.68 2832 95,73 95,33 5.Vibrio vulnificus CMCP6 5.13 4514 6.Vibrio vulnificus YJ016 5.21 5024 97,915 3955 7.11. pylori .199 1.64 1491 OTHE 8.H. pylori 26695 1.66 1576 957 93.37 RS 9.8. melitensis 16M 3.30 3198 10.3 suis 1330 3.28 3264 99,19 93,49 ll.R. conorri 1.27 1374 12.3. prowazekii 1.10 835 87,71 1 59,4“ 13.C perfrigens 13 3.1 2723 MC perfrigens ATCC13124 3.26 NA 93,1‘3 90.513 15.1: Whipplei twist 0.93 808 l6.T. whipplei TW08/27 0.93 783 99.2‘5 99.315 -171- Table 4.1. Taxonomic information of the 175 genomes used in this study. The taxonomic information for each genome was extracted from the Hierarchy browser of the RDP database, release 9 (http://rdp.cme.msu.edu/index.j sp), which implements the newer version of Bergey’s taxonomy. The genome size (in Mb) and the total number of CD8 in the genome are also shown in the last two columns. Abbreviations of 4th column: D. -- Domain, B -- Bacteria, A -- Archaea. -172- Table 4.1. «man 0. N 3823320 3.23320 23220 .3320 0 0.... 23. 2 23.3520 0... mm. 3823.320 3.232220 33.320 3.32220 0 0072..., 3.323.... 2233.320 000. mm. 3823.320 3.232220 33.320 3.32220. 0 00:. 323.33 2233.320 <3. mm. 38232220 3.23.320 3.32220 33.320 0 0001.30 3323.... 22332220 0... mm. 38232220 3.232220 33.320 33.320 0 09? 3.3.32... 2233.320 08. 2 . 3822525 3223525 8.2228 35.528 m 200 3.36 2.22.528 000 3. 3823.320 3.23.320 3.32220 3.3.320 0 5.0.25.0 23.332. 23.320 0.0 0...... 3823.320 3.232220 3.32220 3.3.320 0 :02... 223.5... 23.320 «mum Nov 382233.30 «22233.30 2.2332233? 2.233220 0 0.00 3283.“. 3.33.30 «00. v0. 383.233.3230 33.233.22.30 23.323.203.30 2.233220 0 00... 0302 23.2. .2333..an Em 00.0 382.233.»...0 3.2.233320 2.233223...an 2.233220 0 m0< 23223 22230 03 v0.0 383233.220 3.23.33320 2.233223...an 2.233220 0 mm 28323 22230 .50 00.0 382233.220 332233.20 2.233223...an 2.233220 0 00 28323 22230 vmmm mm...” 382.820 3.2330 2.2332233? 2.233220 0 000. ...m 23 2.820 09m mam 382.820 3.2330 2.2332233? 2.233220 0 20. 232:2: 2.820 0.00 ..0 382322330 3.23220 2.2332233? 2.233220 0 9.4003 28.33.. 23.232330 0mm. mm. 38233220 3.233220 3.33220 3.33220 0 .00 333.3 2.230 mmvm 00v 38339.82 8.2.2.2230 223322330 2.233220 0 .1333. 232.2. 3:23.90 mm... R v 30339.82 3.2.3.3250 2.23322330 2.233220 0 0000. ...m 282.322.. 2.2230 .59. vmm 3833.324 3.2.3.3250 223322330 2.233220 0 0000 8.32332 2.2330 000 .00 382233.320 32.233320 2.233223...an 2.233220 0 3.3230 3.32.2. 2.232320 0mm. 000 383233.20 3.23.33200 2.233284 2.233....u< 0 020002 233. 22.233220 Bmm Em 3:323: 3:323: 223322330 2.233220 0 00.0... 3.2.2.232 2232.30 0.... 00. 382.23.30 8.23220 2.2332233? 2.233220 0 323. 2.2.3.. 2.2.2.30 00... mm. 382.3230 3.23220 2.2332233? 2.233220 0 .2233... 3.3.3.. 2.2.230 00% 000 383.2230 3.3.2230 3.3.2230 3.3.2230 0 ~0300> 3.223.223. 322230 2.0 v00 382.30 3.2.30 ...30 3.32.2.0 0 mmfim .3 2329.3... 3:30 9... .m v 382.30 3.2.30 .30 3.8.3.0 0 00. ...m w. .23 3:30 000.. N v 382:30 3.2.30 _30 35.2.2.0 0 00.0 2.232.... m2:30 mmmm mvm 382.30 3.2.30 .30 352220 0 0.031002 23.8 3:30 .800 000 382.30 3.2.30 .30 3.32.20 0 “08.1005. 328 2.....30 ..mm 00.0 382:30 3.2.30 .30 3.8220 0 324 232.3 2.....30 300 300 382.30 3.2.30 30 3522.0 0 0.09. 232.3 2.....30 5.00 mm 382.30 3.2.30 _ .30 3.3.220 0 600132.... 232.3 m2:30 omvm 0. N 32322.: 3.32333? 2.233.223332 2033.350 4 «091.200 2.20.... 33.3333 000. mm . 3882...}. 3.823.... 302:3 302.54 0 00> 32.03 3.5.3 090 $0 3823220 3.2330 23.330.202.03 2.233220 0 «0.70.00 2.23.23. 22.2322? 90. B . .023 ......3.0< 3.2.2330 .3... 2.23.22... 203222.20 ( .v. ......3 2.33.3 mmmm mm 3033323380 8.23323380 2.233223...an 2.233220 0 .09.. .3 333.334 08 0 3.0 >.=l<0 209.0 0010 I:..>:0 .0 2.5.0.0 00.0000 0:200 -173- Table 4.1 (cont’d) 000. v0 . 3.0.322... .2333. 02 3.833323... 203.23 ...0 < .000 .200 .2333. 38833.3: 000. 00.. .33332... .0... 23233323... 3.3233323... 232.2350, < ... 220 2232.3... 22.232.232.22 0000 0 0 383233.320 3.23220 32332233.... 3233220 0 00000010032 ..2 2232.332 000 00 0 3822323230. 3.23.3.322cm 3.32.2.2 352220 0 .2 2.3.. 3.3.33.2 0.00 «00 38322.. 3.2.30. ...30 3.82.20 0 3000 3300333... 2.22.. .000 .00 382.22.. 3.2..30 ....30 3.32220 0 0800 n... 3300333... 322.. 0v00 00 0 382.22.. 3.2.230 ....30 3.82.20 0 000......0 282.... 322.. 080 00 v 383323.. 3.3.3.2220 3.32330 3.3.3220 0 00.... 3.3.0 2302.2... 3323.. 000.. 00 v 382.323.. 3.23.2220 3.33220 3.33220 0 .0000 ...m 2.323.... 2.323.. 0000 00 0 383233.22 3.2832822 32322.3( 32322.3( 0 000020 ...... 2322.. 0000 00 0 388082320 3.2.2.3232 ....30 3522.... 0 003.1. 2.3. 38823.. 0000 00 0 382323.32 32:33.32 ....30 3.8.220 0 .0003 2.3.3... 3:33.32 .00. 00 . 382.3223. 33:33.32 ....30 3.82.20 0 0001002 .333. 3:33.32 .0... ..0. 38223322: 3.22333..an 323322....230 2.233220 0 00.. 33... 3.3322... 000. ...0. 38223322... 3.22333..an 3233223230 2.233220 0 00000 ...w .53.. 3.3322... 030. 0. 38223322... 3.223330230 3233223230 2.233220 0 03.01005. 32.32. 3.3322... 0000 000 3.0322... 3333 2.23.20 2.233.222.3222 232.2230 < .-002 .3 23233.2. 3.0. 00 . 382.2..230 3.2.3.230 2.233223...an 3233220 0 00.20.. ..0 332...... 3.2323... 3.0. .... 38225230 3.22.3230 2.233223...an 2.233220 0 0.00800 323.. 3.2323... 003. 00 v .. .1323 0 3.2232330 2.233230 32333.0 0 .0: 000 3822.. 3.32320 0.60 .0 0 38223230 3.32.3.2...me 23.323.23.20 3233220 0 <00 2.833.523 3.3300 0000 0.0 38323330 3.323330 2.23330 32332.0 0 000001003. 22.3.3: 22.23330 .000 00 383233.020 3.3233320 2.233223...an 2.233220 0 330 :8 3.2.2.30 000v vmv 382.233320 3.2.233320 2.233223...an 2.233220 0 0.2 :8 3.2.2.30 .000 000 383233.30 3.323222..0 2.233223...an 2.233220 0 000.00 :8 3.2.2.30 0000 000 383233.220 3.3232230 2.233223...an 2.233220 0 00900 :8 222.330 003. 000 383233320 3.32322220 2.233223...an 2.233220 0 03.1.000 226.28 22.20 020 00.0 38833320 3.2.2323. ....30 3.8.3.0 0 000) 2.83.. 38822.0 .000 R0 3832232230 3.322...2...30 323322330 2.233220 0 203.332.... 230...» 22.3....30 0000 00 0 382.23....me 3.2233330 23.333.23.20 2.233220 0 v0.02 223233 3.22.330 00.0 000 38833.30 3.388230 2082.20 32.32.782.20 0 .0 23.323. 388230 0000 000 382.330 3.2.2.202. 32332232230 2.233220 0 00v1.=I<0 009.0 00<.U 30¢...— .0 2.5.0.0 00.0000 0:200 -174- Table 4.1 (cont’d) Maw E... 85822.: 8. «38...... .Qn. «28888.2.» «.828... am. < mmwm :mo 2.8.5. 8882 n. R... 85820:: 9.65 «82.3 On. «o_«uuooogofi «388...».5 < 80 .mm»....« 8882»... and 85822.: E85522 in. 8.... 2:83.55... «68858.0 < G2. E:....8.o« E=_8«8.»n. 3m 88%«8E828n. 8.8«8E838n. «..«.u«..8.oa«EE«o 2.28820... m 8080 89...? 88.888... a; m«8«u«coEo.u=8n. 8.8«8E838n. «..20«n.8§n«EE«o 2.28822... m 036v. «Esq w«coE8=8n. mam ««8«.u«coE8:8n. 8.8«8E8a8n. «..«.u«8oE.n«EE«o 2.20885... m Eo «323» 8:5 mad «2« .u«m. . .u«m 852E... m 58...: «.888. 8...“.«8880 Fm.» 8.228....830 2.28885 2.298810 m om—nloon. .8 8.82 .‘m‘m 8.«u«coEo8...z «..2882985m «88885.1 m mFB—noobx «88.8 m«coE8o...z 9N 882.8202 «22.8202 «..288228Sm 2.28822... m 5va «......mcEmE «.8832 RN 8828832 «223882 «..«8385323 2.28829... m 9.6: 22%...8E «.5832 91.. 85822.: 858225 82822.: 85822.: < :.v...v. 82.38 E82288: 8.0 ««8«.«E««.a8»: 8.2«Em2n8»: «28:3: 852E... m n...b 93 28E_.a «E838»: mad m«8«_«Em«.a8»: 8.225298»: «28:3: 8522...... m m»: «28E85. «E818»: mm,— ««8«.«Em«.a8»: 8.2«Ew2a8»: 8.8..3: 852E... m «-0... 9.9.2.8 «E838»: a. ««8«.«Em«_a8»: 8.2«Em2a8»: «28:3: 8.8.8.... m En. «8.8»E «E838»: and ««8«.«Em2n8»: moi—«E838»: 852.3: 852E.E m v.8. 2.8E «E838»: mmd o«8«.«Em2a8»: 8.2«Em2a8»: «88%: 852E... m 5.0 E:..«=:«m «Em—28“.»: F «886E838»: 8.2«Em2a8»: «28:3: 852E... m. m Eau.a8.._«m «E838»: 3.... 882.2028»: 8.«.8»E2._8< 2.288....3 2.288584 m EB... 28.8.82 E25828»: .3. 882.2898»: 8.«.8»Eo_...o< 2.288....3 2.288281» m 59000 28.8.82 E26888»: R...” 882.2880»: 8.«.8»Eoc:u< 2.28826,» 2.288....2 m z... 8.8. E38828»: mm... 882.2038»: mo.«.8»Eo.._8< 2.288....2 «22882.0( m. Baa“? «$2. E25828»: mm... 882.2038»: mo.«.8»Eoc:o< 2.288264 2.288....3‘ m 0:. E25 E25888»: 3 «28882.3: «28.058850: «..2u«8.u.Eoc«..S: «S««..o.«».:m. < .80 .«~«E «8.89.2.5: mbm «22882.3: m2«:.o.«moc«£m: «..2888.Eo:«£«: «.8«_.o.«».:m < (mo «5.2.88 «8.89.9.3: ‘ 85822.: «2.828: m2«.»n.oc«_..«: «.8«..o.«».=m. < mp>< 8.9.8. «E»8:«£«: 858225 28.8..«E 0: 8.888855: 282.0855 < «M 28.2...«E 88882.6: 2:2. 88 mod 33...... ... 55.5 888 8:8 -175- Table 4.1 (cont’d) 888 222.2: 88 8828.; 8 28.8.; 8 28.8.... m a 23.8.... mmvm mmm 8:88.82 888832880 28839.10 2.203810 m 7.6 88988 38185.2: 83 6.: 8.0.8288 8.3.8.8885. 28832228382 28321.6 < 660 23.—.88., «28.8283... mm: mm. 22.2.8.8 .n... 8828.82.85 28832228382 28321.6 < mm...— Ewo 22.5.8.8 «28.8285 8m 8m 8832828285 83083888.... 22.820 8888...”. m 562 2288882 88385.3... ...mpm 5mm €12.88... 8.2282880 28832.10 28832.10 m 881005 .8 28183816 2mm mew 63:88.. 8.2282889 28832.10 28832.10 m. 85:13 .8 8883816 mama 8m 8:88.83 . .38 88.2.6 .88 2.88285 2832320 < R ..8 ..832 88.2.6 Ram mm m 8822.8 .36 . .38 88.2.6 .88 2.88285 2832320 4 Na. 2.2.22.8 88.2.6 93 mod 882812286 8881288.... 28838.8( 28838.8( m 9 8.8.88 81228.6 m6. m3 882812286 8881Eoc.8< 28838.8( 28838.8( m. 89¢: 2.3.288 81528.6 99 m; 88888826 8.2.8328. ....8m 8822.... m mmmmmads. 8:81... 88828.6 Paar an 88888880808826 888:883088. .Euam 88882225 m 1.66 mocmmo1a wauuouoaozw mm? m _ 88888826 8.2.8328. ....8m. 8822.... m 96(9). 8:81.. 88828.6 89 m2 88888826 8.2.8328. ....86 8822...... m 6864.62 8.88.... 88828.6 89 mm P 88888826 8.2.8328. ....8m 88.6.... m 6.5 :2 8:81.. 88828.6 mvom 3 m 88888826 8.2.8328. ....8m 88.2.... 6 mm 8.828.... 88828.6 vmom m_m mmmomuuouoamzm 888=8mno8m. .=8am mmsu::=5 m «50:. m2coE=mcn maoooooamzw .082 SN. 88888826 8.2.8328. ....86 88.2.5 m 335 2.2:... 88828.6 «mom RN 88888826 8.2.8328. ....86 888...... m 26.262 8.888 88828.6 68 m. m 88888826 8.2.8328. ....8m. 8822...“. m 538mm 8:888 88828.6 9cm mm 888882386 8.2..8m ....8m 88.2.... m mmmmrnooh.‘ 222828 88821886 ...Nmm vmm 888881286 8.2..8m 5.86 8.8.2.... 6 ...sz 828 88821386 mmmm mum 88888.1..86 8.2..8m ....8m. 8828.5 m 92... 828 8882386 3R mm 88888.1..86 8.2..8m. ....8m 88.2.... m 95.2 828 88821386 mumm mmm 88888.1;86 8.2.86 ....8m 88.2.... m $9.6me 828 88821826 66mm mm 88888.1..86 8.2.8.6 ....8m 8522.5 m mmml.=I1..— .D 25.—h mmmem mazm—Q -177-