DEEP SEQUENCING DRIVEN PROTEIN ENGINEERING: NEW METHODS AND APPLICATIONS IN STUDYING THE CONSTRAINTS OF FUNCTIONAL ENZYME EVOLUTION By Emily Elizabeth Wrenbeck A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Chemical Engineering – Doctor of Philosophy 2017   ABSTRACT DEEP SEQUENCING DRIVEN PROTEIN ENGINEERING: NEW METHODS AND APPLICATIONS IN STUDYING THE CONSTRAINTS OF FUNCTIONAL ENZYME EVOLUTION By Emily Elizabeth Wrenbeck Chemical engineers have long sought enzymes as alternatives to traditional chemocatalytic routes as they are highly selective and have evolved to function under mild conditions (physiological temperature, neutral pH, and atmospheric pressure). Enzymes, the workhorses of biological chemistry, represent a vast catalogue of chemical transformations. This feature lends their use in a variety of industrial applications including food processing, biofuels, engineered biosynthetic pathways, and as biocatalysts for preparing specialty chemicals (e.g. pharmaceutical building blocks). The totality of an enzymatic bioprocess is a function of its catalytic efficiency (specificity and turnover), product profile (i.e. regio- and enantio-selectivity), and thermodynamic and kinetic stability. For native enzymes, these parameters are seldom optimal. Importantly, they can be modified using protein engineering techniques, which generally involves introducing mutation(s) to a protein sequence and screening for beneficial effects. However, robust enzyme engineering and design based on first principles is extremely challenging, as mutations that improve one parameter often yield undesired tradeoffs with one or more other parameters. In this thesis, deep mutational scanning - the testing of all possible single-amino acid substitutions of a protein sequence using high-throughput screens/selections and DNA counting via deep sequencing - was used to address two fundamental constraints on functional enzyme evolution. First, how do enzymes encode substrate specificity? To address this question, deep mutational scanning of an amidase on multiple substrates was performed using growth-based selections. Comparison of the resulting datasets revealed that mutations benefiting function on a     given substrate were globally distributed in both protein sequence and structure. Additionally, our massive datasets permitted the most rigorous testing to date of theoretical models of adaptive molecular evolution. These results have implications for both design of biocatalysts and in understanding how natural enzymes function and evolve. Another fundamental constraint of enzyme engineering is that mutations improving stability (folding probability) of an enzyme are often inactivating for catalytic function, and vice versa. Towards overcoming this activity-stability constraint, I sought to improve the heterologous expression and maintain the catalytic function of a Type III polyketide synthase from Atropa belladonna. This was accomplished using deep mutational scanning and high-throughput GFPfusion stability screening, followed by novel filtering methods to only accept beneficial mutations with high probability for maintaining function. Lastly, deep mutational scanning relies on the construction of user-defined DNA libraries, however current available techniques are limited by accessibility or poor coverage. To address these limitations, I will present the development of Nicking Mutagenesis, a new method for the construction of comprehensive single-site saturation mutagenesis libraries that requires only double-stranded plasmid DNA as input substrate. This method has been validated on several gene targets and plasmids and is currently being used in academic, government, and industry laboratories worldwide.     I dedicate this thesis to the family and friends who have inspired, encouraged, and delighted in my pursuit of understanding our world through science. iv     ACKNOWLEDGMENTS I want to acknowledge the significant impact my graduate advisor, Tim Whitehead, has had on my life and scientific career. Thank you for training me how to think quantitatively, design clever experiments, and for initiating my journey in a lifelong obsession with proteins. Thank you for your assistance in acquiring various fellowships and national conference experiences. I express my deepest appreciation for your patience, support, and for challenging me to do my best. To my labmates – Caitlin Stein, Justin Klesmith, James Stapleton, Matthew Faber, and Carolyn Haarmeyer – thank you for your mentorship, support, and for sharing laughs. I want to thank the Plant Biotechnology for Health and Sustainability graduate training program for providing financial support and an enriching graduate experience. Lastly, I want to thank my parents for striving to provide me with opportunities to nourish my brain and for their never-ending love and support. My siblings, for shaping who I am. My friends, for thinking of me as cool for being a scientist. And finally, my husband Phil, for always believing me and for your patience, love, and support throughout my graduate studies. v     TABLE OF CONTENTS LIST OF TABLES ....................................................................................................................... ix LIST OF FIGURES ..................................................................................................................... xi KEY TO ABBREVIATIONS ................................................................................................... xiii CHAPTER ONE Introduction to deep sequencing driven protein engineering ......................1 ABSTRACT.........................................................................................................................2 INTRODUCTION ...............................................................................................................3 ENGINEERING PROTEIN MOLECULAR RECOGNITION ..........................................4 Deep sequencing for screening protein binder libraries ..........................................5 Paratope optimization for affinity and specificity ...................................................5 Epitope mapping ......................................................................................................8 MEMBRANE PROTEIN ENGINEERING ........................................................................8 ENZYME ENGINEERING .................................................................................................9 High-throughput screening and selection for enzyme function ...............................9 From fitness landscapes to enzyme engineering ....................................................10 METHODOLOGICAL ADVANCES AND CURRENT LIMITATIONS .......................12 Mutagenic library preparation................................................................................12 DNA read length restrictions .................................................................................13 Sequencing analysis ...............................................................................................14 CONCLUSION ..................................................................................................................15 REFERENCES ..................................................................................................................17 CHAPTER TWO Nicking Mutagenesis: a plasmid-based, one-pot saturation mutagenesis method...........................................................................................................................................23 ABSTRACT.......................................................................................................................24 INTRODUCTION .............................................................................................................25 RESULTS ..........................................................................................................................26 DISCUSSION ....................................................................................................................30 MATERIALS AND METHODS.......................................................................................31 Reagents .................................................................................................................31 Plasmid construction ..............................................................................................31 Comprehensive nicking mutagenesis optimization ...............................................32 Comprehensive nicking mutagenesis of amiE and bla ..........................................33 Single and multi-site nicking mutagenesis ............................................................35 DNA deep sequencing and analysis .......................................................................36 Statistics .................................................................................................................36 APPENDIX ........................................................................................................................38 REFERENCES ..................................................................................................................51 vi     CHAPTER THREE Exploring the sequence-determinants to specificity of an enzyme using deep mutational scanning ............................................................................................................54 ABSTRACT.......................................................................................................................55 INTRODUCTION .............................................................................................................56 RESULTS ..........................................................................................................................58 Local Fitness Landscapes of amiE on Multiple Substrates ...................................58 The Distribution of Fitness Effects (DFE) of amiE ...............................................62 Beneficial Mutations result from Protein, not mRNA, effects ..............................65 Comparison of DFE Between Selections ...............................................................66 Biophysical Characterization of Beneficial Mutations ..........................................68 DISCUSSION ....................................................................................................................72 MATERIALS AND METHODS.......................................................................................74 Reagents .................................................................................................................74 Plasmid construction ..............................................................................................74 Construction of mutagenesis libraries ....................................................................74 Growth selections...................................................................................................75 Sequencing .............................................................................................................75 Beneficial mutations and lower bounds for fitness metrics ...................................76 Distribution fitting of beneficial DFE ....................................................................77 Protein characterization .........................................................................................78 Isogenic growth and lysate flux assays ..................................................................79 Data Availability ....................................................................................................80 APPENDIX ........................................................................................................................81 REFERENCES ................................................................................................................103 CHAPTER FOUR Computational design of destabilized proteins for assessing theories on adaptive molecular evolution ....................................................................................................109 ABSTRACT.....................................................................................................................110 INTRODUCTION ...........................................................................................................111 RESULTS ........................................................................................................................114 DISCUSSION AND OUTLOOK ....................................................................................116 MATERIALS AND METHODS.....................................................................................118 Reagents ...............................................................................................................118 Plasmid construction ............................................................................................118 Protein purification and characterization .............................................................118 Computational point-mutant scan ........................................................................118 APPENDIX ......................................................................................................................120 REFERENCES ................................................................................................................127 CHAPTER FIVE Improving the expression of a polyketide synthase in a biosynthetic pathway using deep mutational scanning and GFP-fusion screening ..................................131 ABSTRACT.....................................................................................................................132 INTRODUCTION ...........................................................................................................133 RESULTS ........................................................................................................................136 DISCUSSION AND FUTURE WORK ..........................................................................143 MATERIALS AND METHODS.....................................................................................145 vii     Reagents ...............................................................................................................145 Plasmid construction ............................................................................................145 PKS comprehensive point-mutant library construction .......................................146 FACS screening of GFP-fusion libraries .............................................................147 DNA deep sequencing and analysis .....................................................................148 PKS PSSM generation .........................................................................................148 Combinatorial library generation and screening methods ...................................148 Characterization of combinatorial hits and point-mutations ................................150 APPENDIX ......................................................................................................................153 REFERENCES ................................................................................................................164 CHAPTER SIX Summary and future work ...........................................................................169 SUMMARY AND OUTLOOK .......................................................................................170 REFERENCES ................................................................................................................172 viii     LIST OF TABLES Table 1.1: NGS-assisted studies of large enzyme libraries............................................................10 Table 2.1: Nicking mutagenesis library coverage statistics ...........................................................28 Table A 2.1: Performance metrics of published comprehensive mutagenesis methods ................45 Table A 2.2: Estimated time required for comprehensive library construction using nicking mutagenesis ....................................................................................................................................46 Table A 2.3: Primer sequences ......................................................................................................47 Table A 2.4:  Cost analysis of nicking mutagenesis compared with PFunkel ................................48 Table 3.1: Model fitting results for distribution of beneficial mutations .......................................65 Table 3.2: Wild-type and variant amiE biophysical data ...............................................................71 Table B 3.1: Constructs used in growth selections ........................................................................97 Table B 3.2: Library coverage statistics for combined amiE libraries (replicate 1 and 2) used in the acetamide, propionamide, and isobutyramide selections .........................................................98 Table B 3.3: Isogenic growth and lysate flux data.........................................................................99 Table B 3.4: mRNA effects on fitness .........................................................................................100 Table B 3.5: Gene amplification primers for preparing samples for deep sequencing................101 Table C 4.1: Characterization of destabilized amiE variants .......................................................121 Table C 4.2: Primers for generating destabilized amiE variants .................................................122 Table 5.1: Correlation of stability metrics for beneficial mutations from replicate GFP-fusion experiments based on depth of sequencing coverage ..................................................................140 Table D 5.1: Literature examples of engineered biosynthetic pathways in microbes with limiting enzymes........................................................................................................................................157 Table D 5.2: Library statistics for PKS comprehensive single-mutation libraries ......................158 Table D 5.3: Filtered beneficial mutations from the GFP-fusion experiment .............................159 ix     Table D 5.4: Mutations included in the combinatorial PKS library ............................................160 Table D 5.5: Primer sequences used in this work ........................................................................161 Table D 5.6: Multiple reaction monitoring parameters utilized for LC-MS/MS analyses of AbPKS products ..........................................................................................................................162 Table D 5.7: UPLC Mobile Phase Gradients Utilized for LC-MS/MS analyses of PKS products using a Waters Acquity TQ-D mass spectrometer .......................................................................163 x     LIST OF FIGURES Figure 1.1: Overview of the steps involved in deep mutational scanning .......................................4 Figure 1.2: Engineering of affinity and specificity in protein-ligand interactions using deep mutational scanning .........................................................................................................................6 Figure 1.3: Strategies to overcome read length limitations of NGS ..............................................14 Figure 2.1: Comprehensive single-site Nicking Mutagenesis .......................................................27 Figure A 2.1: Gel snapshots along the optimized nicking mutagenesis method ...........................39 Figure A 2.2: Probability distribution of mutation counts in amiE comprehensive nicking mutagenesis libraries ......................................................................................................................40 Figure A 2.3: Comparison of the probability distributions of site-saturation mutagenesis libraries resulting from nicking mutagenesis or PFunkel mutagenesis........................................................41 Figure A 2.4: Off-target mutational analysis of amiE input plasmid and mutational libraries by shotgun sequencing ........................................................................................................................42 Figure A 2.5: bla library coverage distributions ............................................................................43 Figure A 2.6: Schematic overview of single- or multi-site nicking mutagenesis ..........................44 Figure 3.1: Experimental overview................................................................................................58 Figure 3.2: Establishing growth-based selection conditions..........................................................59 Figure 3.3: Validation of deep sequencing results .........................................................................61 Figure 3.4: Distribution of fitness effects (DFE) are exponentially distributed for beneficial mutations ........................................................................................................................................63 Figure 3.5: Correlative analysis of fitness effects ..........................................................................67 Figure 3.6: Substrate specificity is globally encoded ....................................................................69 Figure B 3.1: Frequency distribution of library member counts ...................................................82 Figure B 3.2: Fitness versus pre-selection read counts ..................................................................83 Figure B 3.3: Fitness landscape for acetamide selection ...............................................................84 Figure B 3.4: Fitness landscape for propionamide selection .........................................................87 xi     Figure B 3.5: Fitness landscape for isobutyramide selection  .......................................................  90   Figure B 3.6: Fitness metrics from biological replicate growth selection experiments.................93 Figure B 3.7: Variance of fitness metrics for synonymous codons of beneficial mutations (z>0.15)..........................................................................................................................................94 Figure B 3.8: Principle component analysis of renormalized fitness values .................................95 Figure B 3.9: amiE activity assay ..................................................................................................96 Figure 4.1: Process flow for computational design of destabilized proteins ...............................114 Figure 4.2: Destabilized amiE variants ........................................................................................115 Figure 5.1: Overview of the Tropane Alkaloids (TA) pathway enzymes....................................136 Figure 5.2: Relative fluorescence of EGFP-tagged Ab genes in yeast ........................................137 Figure 5.3: Combinatorial PKS hits .............................................................................................142 Figure D 5.1: Display of PKS on the surface of yeast proves unsuccessful ................................154 Figure D 5.2: Overview of GFP-fusion deep mutational scanning experiment ..........................155 Figure D 5.3: Relative fluorescence intensity of combinatorial PKS hits ...................................156 xii     KEY TO ABBREVIATIONS Ab, Atropa belladonna ACT, acetamide CSM, comprehensive single-site saturation mutagenesis DFE, distribution of fitness effects dsDNA, double-stranded DNA dU-ssDNA, uracil-containing single-stranded DNA FACS, fluorescence assisted cell sorting GFP, green fluorescent protein GPD, generalized Pareto distribution HTS, high-throughput screen or selection IB, isobutyramide kRBS, knockdown ribosome binding sequence MPO, methyl-putrescine amine oxidase NGS, next-generation sequencing NS, nonsynonymous PKS, polyketide synthase PMT, putrescine methyltransferase PR, propionamide PSSM, position-specific scoring matrix RBS, ribosome binding sequence TA, tropane alkaloids TRI, tropinone reductase TS, tropinone synthase xiii   CHAPTER ONE Introduction to deep sequencing driven protein engineering Portions of this chapter were adapted with permission from “Deep sequencing methods for protein engineering and design” in Current Opinion in Structural Biology 45 (2017) 36-44 by Emily E. Wrenbeck, Matthew S. Faber, and Timothy A. Whitehead. 1 ABSTRACT The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. 2 INTRODUCTION Researchers have been engineering proteins for almost 4 decades. Early endeavors involved generation of a handful of point mutations followed by low-throughput assays for function; the ‘search space’ a protein scientist could feasibly explore was miniscule. As demonstrated by the seminal works of Fowler et al.1 and Hietpas et al.2, the advent of next-generation sequencing (NGS) has presented protein engineers with the ability to economically observe entire populations of molecules before, during, and after a high-throughput screen or selection for function (HTS) (Figure 1.1). A typical NGS run provides sufficient sequencing data to permit the study of millions of protein variants. Thus, when coupled to HTS, NGS significantly expands the accessible mutational search space. In this way, a researcher can test all possible point mutations or combinations of mutations, for example, and remove the duty of having to design small focused libraries that may miss unpredictable beneficial mutations. As a testimonial to the accessibility of these methodologies, experiments can be performed in a beginning graduate-level course3. The intent of this review is to highlight examples where deep sequencing has been applied in different areas of protein engineering and design. As such, we will not provide a comprehensive review of directed evolution or of deep mutational scanning (excellent reviews can be found here4,5). We will discuss the use of NGS for engineering protein molecular recognition, membrane proteins, and enzymes, highlight recent technological advances, and offer a perspective on the shape of the field over the next several years. 3 Figure 1.1: Overview of the steps involved in deep mutational scanning. A library of protein variants is generated. Often this is a comprehensive single-site saturation mutagenesis library. The library is subjected to a high-throughput selection or screen for function. Examples of commonly used selections and screens include survival or competitive growth-based selections, protein binding screens like phage or yeast surface display, and fluorescence reporter-based screens. Variants are quantified in the pre- and post-selection populations with counting via deep sequencing. These pre- and post-selection counts are transformed to a normalized functional score and are used to generate fitness landscapes of the target protein. ENGINEERING PROTEIN MOLECULAR RECOGNITION Dozens of studies over the past five years have used deep sequencing to identify and engineer protein-ligand interactions. Rapid adoption of deep sequencing by this field is a direct result of mature display-based technologies that can be used to screen very large initial libraries. For example, in the study of protein-protein binding interactions a library of protein variants can 4 be displayed on the surface of yeast using yeast surface display (Figure 1.1). Using a fluorescently conjugated protein binding partner FACS can be used (a thorough review can be found here 6). Deep sequencing for screening protein binder libraries NGS is now frequently used in the evaluation of synthetic or natural libraries to identify antigen-specific binders. Advances in pairing VH and VL sequences from individual B cells7 allows one to identify antigen-specific antibodies directly from sequencing, including panels of antibodies targeting Ebola virus8 and ricin9. Methodological details and limitations associated with identification of rare clones and evaluation of library diversity are presented in a recent review10. As an emerging area, engineers now use NGS to refine protein binder libraries11,12. In a notable advance, Woldring et al. screened a hydrophilic fibronectin domain library to bind various protein targets12. The researchers exploited the site-specific amino acid preferences from an initial library to develop a more focused second library depleted in mutations at the periphery of the binder paratope. Compared to other libraries, this library design afforded far superior performance in isolation of high affinity, stable binders. Paratope optimization for affinity and specificity NGS can be used to rapidly improve the affinity and specificity of the binding paratope (Figure 1.2)13,14. A crucial advantage enabled by NGS is the ability to discriminate very small beneficial changes in binding - on the order of 0.1 kcal/mol or about a 20% improvement in dissociation constant. These small-scale beneficial mutations can be additive, allowing one to “leapfrog” over potential affinity maturation bottlenecks by combining mutations. 5 Figure 1.2: Engineering of affinity and specificity in protein-ligand interactions using deep mutational scanning. A.) Consider a protein binder that recognizes two separate targets A and B. Deep mutational scanning is performed against each target in parallel. Site-specific preferences for the protein against each target are visualized by a heatmap. Mutations can be combined to impart binders with greater affinity to both targets (top panel, red box) or restrict specificity to a single target (bottom panel, blue box). In practice, mutations at multiple positions are combined to make a focused library that is subsequently screened. B.) The structural basis for specificity- and affinity- altering mutations identified by deep mutational scanning using a dual action Fab (green cartoon) to Ang2 (purple surface) and VEGF (orange surface) as an example15. Heavy Chain (HC) L93K can increase affinity to both targets presumably by increasing electrostatic complementarity. Here Ang2 and VEGF are colored by electrostatic surface potential and HC-L93 (green) and HCK93 (pink) are shown as sticks. By contrast, HC F98I is strongly depleted for in the VEGF binding population most likely because of steric clashes. Structures were created using PyMol from the PDB IDs 4ZFG, 4ZFF. 6 Whitehead et al. provide the first example of paratope engineering for affinity and specificity using deep sequencing16. The researchers screened a comprehensive single-site saturation mutagenesis library of two de novo designed Influenza Hemagglutinin (HA) binders against H1 and H5 HA subtypes. Engineering specificity was demonstrated by comparing sitespecific preferences for H1 to the H5 subtype. A single point mutation was identified that gave over a 30-fold specificity switch from the parental designed protein. For affinity maturation, sitespecific preferences were encoded into a second library and sorted to improve affinity against both subtypes by approximately 25-fold. The affinity of one designed HA binder, HB36.6, was further improved against seven diverse HA subtypes. HB36.6 showed prophylactic and therapeutic efficacy against lethal challenge of pandemic Influenza in a BALB/c mouse model17. Deep mutational scanning approaches have been extended to affinity mature antibodies18,19. In an impressive demonstration, Genentech scientists engineered a dual action Fab for high affinity for two unrelated proteins simultaneously15. The group used phage display to profile a single and triple site saturation mutagenesis library of a Fab with low nanomolar binding to Ang2 and VEGF. NGS revealed significant site-specific amino acid preferences for each of the two binding paratopes. The researchers combined mutations shown to improve affinity on at least one target and not negatively impact binding on the other target, thus engineering five different sub-nanomolar dual-affinity Fabs. The apotheosis of deep mutational scanning to identify high affinity binders with defined specificity comes from Jardine et al.20, who engineered an HIV immunogen that can be recognized by B cell precursors to broadly neutralizing anti-HIV antibodies. Starting with a designed outer domain of the gp120 protein from HIV, they screened a 58-residue site saturation mutagenesis library against 18 germline-reverted and 11 VRC01-class broadly neutralizing antibodies. 7 Information obtained from the scan was used to encode a second library that was screened against the same antibody panel. One variant showed dramatically improved binding to all antibodies in the panel and could bind naïve B cells in full human repertoires. Binding surface optimization is not limited to protein-protein binders, provided that there is a suitable HTS. Tinberg et al. used yeast display coupled to NGS to affinity mature a computationally designed anti-steroid binder21. Raman and colleagues used an in vivo fluorescent reporter coupled to FACS (Figure 1.1) to engineer the E. coli allosteric transcription factor LacI to recognize four different non-metabolizable inducers, including sucralose22. Epitope mapping An important consideration for the antibody engineer is the identification of the binding epitope. Three recent publications used yeast surface display, site-saturation mutagenesis, FACS, and deep sequencing to identify conformational epitopes for diverse antigenic targets on the order of weeks23–25. Doolan and Colby determined epitope regions on prions recognized by conformational-specific antibodies23. Van Blarcom et al. performed epitope mapping for a panel of antibodies against the alpha toxin from methicillin-resistant Staphylococcus aureus24. Kowalsky et al. automated and improved the speed of epitope identification for three different antigens25. MEMBRANE PROTEIN ENGINEERING Plückthun and colleagues screened a near-comprehensive single point mutant library of G protein-coupled receptor (GPCR) rat neurotensin receptor 1 for enhanced heterologous expression, a proxy for protein stability. The library was expressed in the periplasm of E. coli and sorted by FACS using a fluorescently conjugated agonist as a probe26. NGS was used to quantify variants in 8 the input library and the enriched FACS selected libraries, and hits identified in the initial library were combined, resulting in variants that express at up to 50-fold higher levels in E. coli compared with the wild-type GPCR. Each stability-enhancing mutation contributed a small amount of the overall stability to the protein27. Notably, the structure of an engineered GPCR was solved28, suggesting a general directed evolution strategy of stabilizing membrane proteins for X-ray crystallography structure determination. In a separate effort, Fleishman and colleagues used deep mutational scanning to unravel the energetics associated with membrane protein insertion and homodimerization revealing insights that may facilitate membrane protein design29. ENZYME ENGINEERING In contrast to protein-ligand interactions, the complex and diverse nature of enzyme function has made it challenging to develop robust, sensitive, and generalizable functional screens. As such, far fewer examples of deep sequencing-assisted enzyme engineering exist in the literature (Table 1.1). High-throughput screening and selection for enzyme function The primary strategy for functional selection of enzymes is to tether enzymatic function to the growth and/or survival (fitness) of a host organism. One type of competitive growth selection is to provide a substrate that the enzyme must catabolize as the sole source of an essential element for growth (carbon, nitrogen) (Figure 1.1). Thus, variants enabling higher flux through and enzyme permit faster growth rates and become enriched in the population. Klesmith et al. performed deep mutational scanning of levoglucosan kinase, where levoglucosan was fed as the carbon source30. Similarly, Wrenbeck et al. performed deep mutational scanning on amiE, an 9 aliphatic amidase from Pseudomonas aeruginosa, by feeding amides as the nitrogen source31. Antibiotic resistance genes also provide straightforward targets for competitive growth selections. Indeed, these represent 4/9 published enzyme scans (Table 1.1)32–34. In summary, high-throughput screens or selections that are generalizable are desired, yet the incredible diversity of enzyme function makes their development a critical challenge for the field. Table 1.1: NGS-assisted studies of large enzyme libraries. Gene Application Selection employed Reference TEM-1 β-lactamase β-lactam antibiotic resistance Growth competition Deng et al.32 TEM-1 β-lactamase β-lactam antibiotic resistance Growth competition Firnberg et al.33 TEM-1 β-lactamase β-lactam antibiotic resistance Growth competition Stiffler et al.34 APH(3')II kinase aminoglycoside antibiotic resistance Growth competition Melnikov et al.35 Homing endonucleases Genome engineering Survival Thyme et al.36 Biomass conversion Metabolic growth Klesmith et al.30 Multiple industrial Metabolic growth Wrenbeck et al.31 Biomass conversion Micro-fluidic Romero et al.37 E3 ubiquitin ligase Phage display Starita et al.38 Levoglucosan kinase amiE aliphatic amidase Bgl3 β-glucosidase Ube4b E3 ubiquitin ligase From fitness landscapes to enzyme engineering Deep mutational scanning experiments afford a richness of knowledge of ‘hits’. However, efficiently utilizing ambiguous ‘fitness values’ to inform enzyme design is still a significant 10 challenge. To avert this challenge, van der Meer et al. performed over 4000 assays to generate ‘mutability landscapes’ of a tautomerase enzyme for its expression, Michael-type activities on multiple substrates, and characterization of its enantioselectivity, and used this information to design a novel enantioselective Michaelase39. How does one intelligently combine hits to achieve a given design goal? One approach is to biophysically characterize beneficial mutations. For example, Klesmith et al. performed deep mutational scanning of levoglucosan kinase to identify mutations that improved fitness through improved flux of levoglucosan conversion. They characterized a set of beneficial mutations for activity and thermodynamic stability and used this information to generate designs, one of which had greater than 24-fold improvement in activity and 7°C increase in apparent melting temperature30. An alternative approach is to generate multiple fitness landscapes under different conditions (concentration and identity of substrate, temperature, etc.) and use differential analysis to generate designs. To that end Melnikov et al. performed deep mutational scanning of APH(3’)II, an enzyme responsible for aminoglycoside antibiotic resistance, with several antibiotics at different concentrations and generated designs with orthogonal activities35. Datasets from deep mutational scanning can be used to probe the fundamental nature of enzyme behavior and can be used to ask questions related to evolutionary trajectories, rigorously testing theories gleaned from over two decades of directed evolution experiments. Steinberg and Ostermeier analyzed fitness effects for TEM-15 β-lactamase under varying environmental conditions and found that negative selections were able to bridge access to the highest fitness peaks40. Wrenbeck et al. performed deep mutational scanning of an aliphatic amidase on three substrates and found that specificity-determining mutations were distributed throughout the protein sequence and structure rather than located near the active site31. 11 METHODOLOGICAL ADVANCES AND CURRENT LIMITATIONS Mutagenic library preparation Consider a protein of a typical length of 300 residues. A library comprising every possible single or double point mutation would contain 6x103 or 3.6x107 sequences, respectively. Similarly, a library with simultaneous saturation mutagenesis at four defined positions contains 1.6x105 sequences. For a typical experimental workflow there are 106-107 quality-filtered DNA reads, and accurate estimation of variant frequencies occurs above a statistical background of ~100 sequence reads per variant41,42. Dividing the number of sequences from a NGS run by the minimum number needed to estimate frequencies we arrive at an effective maximum population size of 104-105 per experiment. Thus, even NGS permits only small dances around the local protein sequence-fitness space. Purchasing thousands to millions of synthetically generated DNA sequences is still not an economically viable option for the average academic lab. Furthermore, established facile protocols for random mutagenesis like error-prone PCR43 or chemical synthesis by doping1 provide access only to a minority of possible codon substitutions, and there is often a large variance in the number of mutations introduced. Thus, robust methods for constructing large, user-defined DNA libraries are needed. Generation of libraries with mutations at 1-4 defined positions have been demonstrated using homologous recombination and cassette mutagenesis. For applications such as lead candidate maturation the generation of comprehensive single-site saturation mutagenesis (CSM) libraries is desired. A CSM library contains all possible single amino acid substitutions at every position in the primary sequence. One could generate such libraries by performing separate saturation mutagenesis reactions for each position using QuikChange or similar methods. 12 However, there are now three methods that can generate CSM libraries for gene-length targets with a single reaction: PALS44, PFunkel45, and Nicking Mutagenesis46. In PFunkel mutagenesis, single mutants are generated by thermocycling mutagenic oligos with template DNA at a low primer:template ratio in a single test-tube. While PFunkel has been demonstrated on multiple systems with excellent performance30,33,42 the method requires a bacteriophage preparation of a Uracil-containing ssDNA template, which can be laborious. To overcome this, Wrenbeck et al. developed a similar method, Nicking Mutagenesis, which uses plasmid dsDNA as the reaction template46. DNA read length restrictions One major limitation of NGS is the inherent short read length (75 to 300 nucleotides for Illumina sequencing platform) (Figure 1.3a). As such, a mutation located outside of the read window would be invisible. Longer read lengths are possible using PacBio and Oxford Nanopore instruments but at the cost of reduced throughput and accuracy, respectively. Because of these limitations, many groups perform deep mutational scanning on small genes or on subsets of genes (tiling) (Figure 1.3b)25,27,30,34,42,47. An emerging strategy is to perform a selection on a full-length gene but ‘link’ or phase haplotypes from one portion of the gene to the remainder (Figure 1.3c)44,48–54. For example, Sarkisyan et al. introduced a random 20-nucleotide barcode at the C-terminal end of a library of green fluorescent protein variants whilst performing error-prone PCR54. Genotypes were barcode linked by sequencing both the N- and C- termini, with the N-terminus brought into proximity of the barcode with successive digestion and ligation reactions. 13 Figure 1.3: Strategies to overcome read length limitations of NGS. A.) Mutations falling outside of a length ’readable’ by current sequencing technologies would be invisible. B.) In a gene tiling approach, mutational libraries are prepared such that mutations are restricted to a stretch of DNA readable by NGS platforms. Parallel screens or selections for function are performed. C.) Molecular barcoding of library members provides a means to overcome NGS sequencing read length restrictions. Randomized DNA barcodes are assigned to library member (1). Variants and their corresponding barcodes are linked and cataloged (haplotyped) (2). After functional selection (3), variants in the pre- and post-selection populations are counted by sequencing barcodes (4). Sequencing analysis A crucial step in any NGS-utilizing experiment is to extract useful phenotypic data binding, kinetics, thermodynamic stability, host organismal fitness, etc. - from raw sequencing reads. Many groups report site-specific preferences as an enrichment ratio. To that end, Fowler et al. developed Enrich, a python-based software that transforms raw sequencing counts from pre- 14 and post-selection populations into per-allele enrichment ratios55. Similarly, Bloom developed a software that calculates enrichments using a likelihood-based treatment of mutation counts instead of simple ratios56. Woldring et al. developed ScaffoldSeq, a Python-based software for the analysis of partially diverse protein sequences for single site and pairwise amino acid frequencies across the population57. Normalization of these enrichment ratios to an unambiguous fitness metric like binding or catalytic efficiency is perhaps the least standardized portion of the deep mutational scanning pipeline and there is a need for a community-wide consensus on how to normalize. Kowalsky et al. describe a mathematical framework for normalizing enrichment ratios of variants assayed in deep mutational scanning experiments for FACS and growth-based selections42. Similar approaches are used for plate-based selections33. Finally, Abriata et al. developed a webserver, PsychoProt, for the analysis of functional data from saturation mutational libraries and protein sequence alignments for biophysical constraints using structural information58. CONCLUSION NGS has been a transformative technology for many fields in the biological sciences, with protein science and engineering being no exception. Generation and analysis of fitness landscapes can inform on mechanisms of natural evolution and fundamentals of enzyme behavior. Notable advances in our ability to engineer affinity and specificity in protein-ligand interactions has been enabled by NGS, while enzyme engineering has lagged behind largely because of the lack of generalized HTS strategies. For this same reason, the application of NGS to membrane protein engineering has even further lagged behind. The utility of NGS enabled enzyme and membrane protein engineering awaits screening technology breakthroughs. Accurate and facile sequencing 15 of non-contiguous mutations (haplotyping), either through the use of barcoding or the advent of longer-read technologies, will improve and expand the utility of NGS protein engineering. 16 REFERENCES 17 REFERENCES 1. Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010). 2. Hietpas, R. T., Jensen, J. D. & Bolon, D. N. A. Experimental illumination of a fitness landscape. Proc. Natl. Acad. Sci. 108, 7896–7901 (2011). 3. Mavor, D. et al. Determination of ubiquitin fitness landscapes under different chemical stresses in a classroom setting. Elife 5, e15802 (2016). 4. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). 5. Boucher, J. I., Bolon, D. N. A. & Tawfik, D. S. Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature. Protein Sci. (2016). doi:10.1002/pro.2928 6. Gai, S. A. & Wittrup, K. D. Yeast surface display for protein engineering and characterization. Curr. Opin. Struct. Biol. 17, 467–73 (2007). 7. Dekosky, B. J. et al. High-throughput sequencing of the paired human immunoglobulin heavy and light chain repertoire. Nat. Biotechnol. 31, 166–169 (2013). 8. Wang, B. et al. Facile discovery of a diverse panel of anti-Ebola virus antibodies by immune repertoire mining. Sci. Rep. 5, (2015). 9. Wang, B. et al. Discovery of high affinity anti-ricin antibodies by B cell receptor sequencing and by yeast display of combinatorial VH: VL libraries from immunized animals. MAbs (2016). doi:10.1080/19420862.2016.1190059 10. Glanville, J. et al. Deep sequencing in library selection projects: What insight does it bring? Curr. Opin. Struct. Biol. 33, 146–160 (2015). 11. Mahon, C. M. et al. Comprehensive interrogation of a minimalist synthetic CDR-H3 library and its ability to generate antibodies with therapeutic potential. J. Mol. Biol. 425, 1712– 1730 (2013). 12. Woldring, D. R., Holec, P. V, Zhou, H. & Hackel, B. J. High-Throughput Ligand Discovery Reveals a Sitewise Gradient of Diversity in Broadly Evolved Hydrophilic Fibronectin Domains. PLoS One 10, e0138956 (2015). 13. Strauch, E.-M., Fleishman, S. J. & Baker, D. Computational design of a pH-sensitive IgG 18 binding protein. Proc. Natl. Acad. Sci. 111, 675–680 (2014). 14. Procko, E. et al. A computationally designed inhibitor of an Epstein-Barr viral Bcl-2 protein induces apoptosis in infected cells. Cell 157, 1644–1656 (2014). 15. Koenig, P. et al. Deep Sequencing-guided Design of a High Affinity Dual Specificity Antibody to Target Two Angiogenic Factors in Neovascular Age-related Macular Degeneration. J. Biol. Chem. 290, 21773–21786 (2015). 16. Whitehead, T. A. et al. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat. Biotechnol. 30, 543–8 (2012). 17. Koday, M. T. et al. A Computationally Designed Hemagglutinin Stem-Binding Protein Provides In Vivo Protection from Influenza Independent of a Host Immune Response. Plos Pathog. 12, e1005409 (2016). 18. Forsyth, C. M. et al. Deep mutational scanning of an antibody mammalian cell display and massively parallel Deep mutational scanning of an antibody against epidermal growth factor receptor using mammalian cell display and massively parallel pyrosequencing. MAbs 5, (2013). 19. Fujino, Y. et al. Robust in vitro affinity maturation strategy based on interface-focused highthroughput mutational scanning. Biochem. Biophys. Res. Commun. 428, 395–400 (2012). 20. Jardine, J. G. et al. HIV-1 broadly neutralizing antibody precursor B cells revealed by germline-targeting immunogen. Science (80-. ). 351, 1458–1463 (2016). 21. Tinberg, C. E. et al. Computational design of ligand-binding proteins with high affinity and selectivity. Nature 501, 212–216 (2013). 22. Taylor, N. D. et al. Engineering an allosteric transcription factor to respond to new ligands. Nat. Methods 13, 177–183 (2016). 23. Doolan, K. M. & Colby, D. W. Conformation-dependent epitopes recognized by prion protein antibodies probed using mutational scanning and deep sequencing. J. Mol. Biol. 427, 328–340 (2015). 24. Van Blarcom, T. et al. Precise and efficient antibody epitope determination through library design, yeast display and next-generation sequencing. J. Mol. Biol. 427, 1513–1534 (2015). 25. Kowalsky, C. A. et al. Rapid Fine Conformational Epitope Mapping Using Comprehensive Mutagenesis and Deep Sequencing. J. Biol. Chem. 290, 26457–26470 (2015). 26. Schlinkmann, K. M. et al. Critical features for biosynthesis, stability, and functionality of a G protein-coupled receptor uncovered by all-versus-all mutations. Proc. Natl. Acad. Sci. 19 109, 9810–9815 (2012). 27. Schlinkmann, K. M. et al. Maximizing detergent stability and functional expression of a GPCR by exhaustive recombination and evolution. J. Mol. Biol. 422, 414–428 (2012). 28. Egloff, P. et al. Structure of signaling-competent neurotensin receptor 1 obtained by directed evolution in Escherichia coli. Proc. Natl. Acad. Sci. 111, E655–E662 (2014). 29. Elazar, A. et al. Mutational scanning reveals the determinants of protein insertion and association energetics in the plasma membrane. Elife 5, e12125 (2016). 30. Klesmith, J. R., Bacik, J., Michalczyk, R. & Whitehead, T. A. Comprehensive SequenceFlux Mapping of a Levoglucosan Utilization Pathway in E. coli. ACS Synth. Biol. 4, 1235– 1243 (2015). 31. Wrenbeck, E. E., Azouz, L. R. & Whitehead, T. A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nat. Commun. 8, 15695 (2017). 32. Deng, Z. et al. Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. J. Mol. Biol. 424, 150–67 (2012). 33. Firnberg, E., Labonte, J. W., Gray, J. J. & Ostermeier, M. A Comprehensive, HighResolution Map of a Gene’s Fitness Landscape. Mol. Biol. Evol. 31, 1581–1592 (2014). 34. Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase. Cell 160, 882–892 (2015). 35. Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T. S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Res. 42, gku511 (2014). 36. Thyme, S. B. et al. Massively parallel determination and modeling of endonuclease substrate specificity. 42, 13839–13852 (2014). 37. Romero, P. A., Tran, T. M. & Abate, A. R. Dissecting enzyme function with microfluidicbased deep mutational scanning. Proc. Natl. Acad. Sci. 112, 7159–7164 (2015). 38. Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl. Acad. Sci. 110, E1263–E1272 (2013). 39. van der Meer, J. Y. et al. Using mutability landscapes of a promiscuous tautomerase to guide the engineering of enantioselective Michaelases. Nat. Commun. 7, 1–16 (2016). 40. Steinberg, B. & Ostermeier, M. Environmental changes bridge evolutionary valleys. Sci. 20 Adv. 2, e1500921 (2016). 41. Fowler, D. M., Stephany, J. J. & Fields, S. Measuring the activity of protein variants on a large scale using deep mutational scanning. Nat. Protoc. 9, 2267–2284 (2014). 42. Kowalsky, C. A. et al. High-Resolution Sequence-Function Mapping of Full-Length Proteins. PLoS One 10, e0118193 (2015). 43. Cirino, P. C., Mayer, K. M. & Umeno, D. in Directed Evolution Library Creation: Methods and Protocols 3–9 (Springer, 2003). 44. Kitzman, J. O., Starita, L. M., Lo, R. S., Fields, S. & Shendure, J. Massively parallel singleamino-acid mutagenesis. Nat. Methods 12, 203–206 (2015). 45. Firnberg, E. & Ostermeier, M. PFunkel: Efficient, Expansive, User-Defined Mutagenesis. PLoS One 7, e52031 (2012). 46. Wrenbeck, E. E. et al. Plasmid-based one-pot saturation mutagenesis. Nat. Methods (2016). doi:10.1038/nmeth.4029 47. Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein. RNA 19, 1537–1551 (2013). 48. Borgstrom, E. et al. Phasing of single DNA molecules by massively parallel barcoding. Nat. Commun. 6, (2015). 49. Cho, N. et al. De novo assembly and next-generation sequencing to analyse full-length gene variants from codon-barcoded libraries. Nat. Commun. 6, (2015). 50. Hiatt, J. B., Patwardhan, R. P., Turner, E. H., Lee, C. & Shendure, J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat. Methods 7, 119–122 (2010). 51. Hong, L. Z. et al. BAsE-Seq  : a method for obtaining long viral haplotypes from short sequence reads. Genome Biol. 15, (2014). 52. Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014). 53. Stapleton, J. A. et al. Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing. PLoS One 11, e0147229 (2016). 54. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). 21 55. Fowler, D. M., Araya, C. L., Gerard, W. & Fields, S. Enrich: Software for analysis of protein function by enrichment and depletion of variants. Bioinformatics 27, 3430–3431 (2011). 56. Bloom, J. D. Software for the analysis and visualization of deep mutational scanning data. BMC Bioinformatics 16, 1–13 (2015). 57. Woldring, D. R., Holec, P. V & Hackel, B. J. ScaffoldSeq: Software for characterization of directed evolution populations. Proteins Struct. Funct. Bioinforma. 84, 869–874 (2016). 58. Abriata, L. A., Bovigny, C. & Peraro, M. D. Detection and sequence/structure mapping of biophysical constraints to protein variation in saturated mutational libraries and protein sequence alignments with a dedicated server. BMC Bioinformatics 17, 1–13 (2016). 22 CHAPTER TWO Nicking Mutagenesis: a plasmid-based, one-pot saturation mutagenesis method This chapter is adapted with permission from “Plasmid-based one-pot saturation mutagenesis” in Nature Methods 13:11 (2016) 928-930 by Emily E. Wrenbeck, Justin R. Klesmith, James A. Stapleton, Adebola Adeniran, Keith EJ Tyo, and Timothy A. Whitehead. 23 ABSTRACT Deep mutational scanning is a foundational tool for addressing the functional consequences of large numbers of mutants, yet a more efficient and accessible method for construction of userdefined mutagenesis libraries is needed. Here we present Nicking Mutagenesis, a robust singleday, single-pot saturation mutagenesis method that is performed on routinely prepped plasmid dsDNA. The method can be used to produce comprehensive, single-, or multi-site saturation mutagenesis libraries. 24 INTRODUCTION Mutational studies have been used for over six decades to probe protein sequence-function relationships. Deep mutational scanning has emerged as a method to assess the effect of thousands of mutations on function using massively parallel functional screens and DNA counting via deep sequencing1. Information rich sequence-function maps obtained from such methods allow a researcher to address a variety of aims, including the generation of biomolecular fitness landscapes2–6, therapeutic protein optimization7, and high-resolution conformational epitope mapping8. Although other technical challenges have been resolved9,10, a robust and accessible method for the construction of high quality, user-defined mutational libraries is lacking. Random mutagenesis methods such as error-prone PCR suffer from limited codon sampling and imprecise control over the number of mutations introduced11. Of the published comprehensive saturation mutagenesis methods2,4,11–14, PFunkel12 offers the best combination of library coverage, mutational efficiency, control over number of mutations introduced, and scalability (Table A 2.1). In particular, PFunkel can be used to prepare libraries covering all possible single point mutations, with most members of the library having exactly one mutation. However, PFunkel is limited by the required preparation of a uracil-containing ssDNA template by phage infection. dU-ssDNA yields are highly variable and the preparation adds at least two days to the mutagenesis procedure. By analogy to site-directed mutagenesis, PCR-based methods like QuikChange have mostly supplanted the highly efficient Kunkel mutagenesis that also requires dU-ssDNA15. 25 RESULTS Here we present Nicking Mutagenesis, a method that does not rely on dU-ssDNA (Figure 2.1). Nicking mutagenesis is flexible, as any plasmid dsDNA can be used provided that it contains a single 7-bp BbvCI restriction site. The key mechanism in nicking mutagenesis is the successive creation and degradation of a wild-type ssDNA template. This is accomplished via a pair of nicking endonucleases (Nt.BbvCI and Nb.BbvCI)16,17 that recognize the same site but nick one strand or the other, followed by exonuclease digestion. First, ssDNA template is created from dsDNA plasmid via a strand-specific nick introduced by Nt.BbvCI followed by selective digestion of the nicked strand with Exonuclease III (step 1; Figure 2.1). Mutant strands are then synthesized by thermal cycling template DNA with mutagenic oligos at a low primer-to-template ratio to promote annealing of effectively one primer to each template12 (step 2). The highly processive and high fidelity Phusion Polymerase extends the primer around the circular template. Taq DNA Ligase closes the new strand to form a dsDNA plasmid with a mismatch at the mutational site. The heteroduplex DNA is then column purified to avoid buffer incompatibility issues and prevent potential competition between Phusion and Exonuclease III. To resolve the heteroduplex, the opposite strand nicking endonuclease, Nb.BbvCI, creates a nick in the template strand, which is subsequently degraded by Exonuclease III (step 3). A secondary primer is then added and synthesis of the complementary mutant strand follows as above (step 4). To reduce wild-type background, the final reaction is treated with DpnI to digest methylated and hemi-methylated parental DNA. The complete protocol can be performed in a single day with minimal hands-on time (Table A 2.2). We first optimized nicking mutagenesis using a green/white fluorescent screen based on reversion of a non-fluorescent green fluorescent protein (GFP) mutant (Note A 2.1, 26 Figure 2.1: Comprehensive single-site Nicking Mutagenesis. Plasmid dsDNA containing a 7bp BbvCI recognition site is nicked by Nt.BbvCI. Exonuclease III degrades the nicked strand to generate an ssDNA template (step 1). Mutagenic oligos are then added at a 1:20 ratio with template, Phusion Polymerase synthesizes mutant strands, and Taq DNA Ligase seals nicks (step 2). The reaction is column purified, and then the wild-type template strand is nicked by Nb.BbvCI and digested by Exonuclease III digestion (step 3). A second primer is added and the complementary mutant strand is synthesized to yield mutagenized dsDNA (step 4). Figure A 2.1, and Table A 2.3). Next, we used nicking mutagenesis to prepare comprehensive single-site saturation mutagenesis libraries for two different 71 codon stretches of an aliphatic amidase encoded by the gene amiE from Pseudomonas aeruginosa (reaction 1 and 2 correspond to residues 100-170 and 171-241, respectively)18. A mixture of 71 degenerate NNN oligo sets, each with three consecutive randomized bases (NNN) corresponding to one of the 71 codons, was used at a 1:20 primer:template ratio. We deep sequenced the resulting libraries to an average depth of coverage of 2,200 reads per variant and processed the data using Enrich19. We observed 100% of possible single non-synonymous (NS) mutants (2840 total) and 100% of all possible programmed codon mutations (8946 total) with at least 10 reads (library coverage statistics are shown in Table 2.1). 64.4% and 63.5% of library members had exactly one 27 Table 2.1: Nicking mutagenesis library coverage statistics. amiE reaction 1 amiE reaction 2 bla Sequencing reads post quality filter19 (fold coverage) 4273346 (941x) 5378051 (1184x) 414417 (74x) Number of transformants Number of mutated codons Total plasmid length (nucleotides) Percent of reads with: No nonsynonymous mutations One nonsynonmymous mutation Multiple nonsynonymous mutations Frameshift mutation Percent of mutant codons with: 1-bp substitution 2-bp substitution 3-bp substitution 1.3x107 71 4,612 1.4x107 71 4,612 1.5x105 88 6,907 1.6 98.4 0 0 27.2 64.4 8.4 0.05 26.3 63.5 10.2 0.05 30.1 59.9 9.7 0.34 14.3 42.9 42.9 32.2 32.8 35.0 31.4 31.5 37.1 25.4 41.7 33.0 Percent of possible codon substitutions observed 1-bp substitution 2-bp substitution 3-bp substitution All substitutions 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.7 83.5 77.8 83.4 Coverage of possible single amino acid substitutions with ≥5 reads 100.0 100.0 91.5 Coverage of possible programmed mutant codons with ≥5 reads 100.0 100.0 75.4 Theoretical NS mutation for amiE reaction 1 and 2, respectively. The incidence of non-programmed indel mutations was 0.05% for both reactions 1 and 2. The frequency of individual mutations in each library followed a log-normal distribution, which is consistent with libraries prepared by PFunkel mutagenesis6,9 (Figure A 2.2). In deep mutational scanning experiments, the initial library is typically sequenced at approximately 200-fold depth of coverage of the expected diversity. 28 Normalizing the above sequencing results to a 200-fold depth of coverage reveal that 93.2% and 97.8% of possible NS mutations would be represented above the typical threshold of 10 sequencing reads for amiE reaction 1 and 2, respectively (Figure A 2.3a). This compares favorably with PFunkel mutagenesis (91.7% using the same threshold), although we note that the library distributions between the two methods are essentially identical (Figure A 2.3b). We next assessed the libraries for off-target mutations by shotgun sequencing the input plasmid, pEDA3_amiE (no intended mutations), and library dsDNA from amiE reactions 1 and 2 and found the corresponding mutant tiles had significantly higher percent mutant allele rates (p-value < 2.2x10-16 for both reaction 1 and 2, Figure A 2.4). To demonstrate performance on larger plasmids, we used nicking mutagenesis to prepare a comprehensive single-site saturation mutagenesis library for an 88-codon stretch of the gene bla encoding E. coli TEM-1 !-lactamase from a 6.9 kb plasmid and sequenced to 74-fold coverage of codon space. We observed nearly identical library composition with 91.5% coverage of possible amino acid substitutions (Table 2.1), which is consistent with expected coverage at this depth of sequencing (Figure A 2.5). Of note, we observed an order of magnitude fewer transformants when preparing this library compared to amiE, consistent with larger plasmids having lower transformation efficiency. One potential strategy to improve transformation efficiency is to use ultra-competent cells. Alternatively, the library can be constructed on a smaller plasmid and then transferred to a desired plasmid via subcloning. To further expand the utility of nicking mutagenesis, we developed a single- and multi-site protocol (Figure A 2.6). The protocol was modified by adding primer at a 5:1 molar ratio to template and altering the thermal cycling steps for mutant strand synthesis. We tested the method by performing three single- and one triple-mutation nicking mutagenesis reaction on bla (plasmid 29 pSALECT-wtTEM1/csTEM1). Sanger sequencing of two clones from each of the three single-site reactions revealed that 5/6 clones contained a single mutation. For the multi-site reaction, 5 out of 10 sequenced clones contained the desired three programmed mutations. Robust and effective molecular biology methods are characterized by the ease of their adoption by laboratories outside of where they were developed. To evaluate the accessibility of the method, the Tyo lab (Northwestern University) tested the method by performing single-site nicking mutagenesis on the pEDA5_GFPmut3_Y66H plasmid with the restore-to-function oligo GFP_H66Y. The resulting mutational efficiency, calculated by counting fluorescent (mutant) and non-fluorescent (wild-type) colonies, was 86.8 ± 6.1% (n=3 independent experiments). DISCUSSION We have demonstrated a single-pot single-day method for the preparation of comprehensive single- and multi-site saturation mutagenesis libraries from plasmid dsDNA (method cost detailed in Table A 2.4). The utility of nicking mutagenesis is not limited to saturation mutagenesis. Codon substitutions are user defined, making it possible to restrict diversity to specific residues such as hydrophobic or charged substitutions. An inherent limitation is that if multiple BbvCI nicking sites on a plasmid exist they must be in the same orientation. In the human genome, BbvCI has a mean distance between sites of 2058 base pairs, thus a considerable fraction of human genes will have nicking sites. Solutions include either cloning the gene of interest into a plasmid with a compatible nicking site orientation or using custom gene synthesis to remove extra BbvCI sites. To validate the performance of nicking mutagenesis we used “testers” from an external lab; we propose using such testers to enhance reproducibility and accessibility of new molecular 30 biology methods. To aid in method adoption, the GFP plasmid used for green/white screening (pEDA5_GFPmut3_Y66H) has been deposited to the AddGene repository (www.addgene.org, plasmid #80085) as a tool for practicing and troubleshooting the method. MATERIALS AND METHODS Reagents All chemicals were purchased from Sigma-Aldrich unless otherwise noted. All enzymes were purchased from New England Biolabs. All mutagenic oligos were designed using the QuikChange Primer Design Program (Agilent, Santa Clara, CA). Mutagenic oligos and sequencing primers were ordered from Integrated DNA Technologies (Coralville, IA). Plasmid construction All primer sequences used in this work are listed in Table A 2.3. Plasmid pEDA5_GFPmut3_Y66H was prepared by modification of pJK_proB_GFPmut3 as described in Bienick et al.18 by a single Kunkel15 reaction with two mutagenic primers: one encoding a BbvCI site (primer pED_BbvCI) and the second to introduce a Tyr66His point mutation (primer GFP_Y66H). pEDA3_amiE was constructed by altering pJK_proK17_amiE as described in Bienick et al.18 with a single Kunkel15 reaction with two primers: one encoding a BbvCI site (pED_BbvCI) and the second encoding a mutated ribosome binding sequence (pED_kRBS3). pEDA5_GFPmut3_Y66H has been deposited in the AddGene repository (www.addgene.org, plasmid #80085). Plasmid pSALECT-wtTEM1/csTEM1 was created as follows. Overhang PCR was used to add in an XhoI and BbvCI site after the existing NdeI site and before the original stop codon of 31 plasmid pSALECT-EcoBam (Plasmid #59705, acquired from AddGene). A Δ2-23 truncation of wild-type TEM-1 was cloned in-frame between the NdeI and XhoI sites. A codon swapped Δ2-23 truncation of wild-type TEM-1 with a C-terminal His6x tag and double stop codon was ordered as a gBlock (IDT) and was cloned in-frame between the XhoI and BbvCI site. This second TEM-1 is a C-terminal fusion to the wild-type TEM-1. Plasmid pETconNK-TEM1(S70A,D179G) was created as follows. Gibson assembly was used to remove the ampicillin gene from pETcon(-) (Addgene plasmid #41522) and insert a kanamycin gene with a 3’ BbvCI site on the coding strand. A Δ2-23 truncation of TEM-1 with point mutations S70A and D179G was cloned in-frame between the NdeI and XhoI sites. Comprehensive nicking mutagenesis optimization The final optimized comprehensive nicking mutagenesis protocol is supplied in Supplementary Protocol 1 and on the Protocols Exchange (DOI 10.1038/protex.2016.061). 1X CutSmart Buffer (NEB) was used as an enzyme diluent when necessary. Two reactions were set up as follows: 0.76 pmol pEDA5_GFPmut3_Y66H was incubated with 10 U each of Nt.BbvCI and Exonuclease III in 1X CutSmart Buffer (20 µL final volume) for 60 minutes at 37°C followed by 80°C for 20 minutes (heat kill). 40 U of DpnI was added and the reaction was incubated at 37°C for 60 minutes followed by 80°C for 20 minutes (heat kill). One reaction was then column purified by Zymo Clean & Concentrator (5:1 v/v ratio of binding buffer to sample), eluted in 6 µL Nuclease-Free H2O (NFH2O, Integrated DNA Technologies), transformed into XL1-Blue electrocompetent cells, and dilution plated. The following was added to the second reaction: 200 U of Taq DNA Ligase, 2 U Phusion High-Fidelity DNA Polymerase, 20 µL 5X Phusion HF Buffer, 20 µL 50 mM DTT, 1 µL 50 mM NAD+, 2 µL 10 mM dNTPs, 29 µL NFH2O (final reaction volume 32 of 100µL). The tube was placed into a preheated (98°C) thermal cycler set with the following program: 98°C for 2 minutes, 15 cycles of 98°C for 30 seconds (denature), 55°C for 45 seconds (anneal oligos), 72°C for 7 minutes (extension), followed by a final incubation at 45°C for 20 minutes to complete ligation. The reaction was column purified, transformed, and dilution plated as described above. The optimization experiment including addition of Exonuclease I was performed as described below with the following modifications. A single mutagenic primer, His66Tyr (restores wild-type chromophore sequence), was used at a 1:20 primer:template ratio. The reaction was column purified and transformed into XL1 Blue electrocompetent cells as above. Green fluorescent (mutated) and white (parental) colonies were counted to calculate transformational and mutational efficiencies. Comprehensive nicking mutagenesis of amiE and bla Three separate reactions targeting residues 100-170 and 171-241 of amiE and 201-289 of TEM-1 were performed. Mutagenic oligos programming degenerate codons (NNN) for each reaction were mixed in equimolar amounts to a final concentration of 10 µM. 20 µL of each primer mix was added to a phosphorylation reaction containing 2.4 µL of T4 Polynucleotide Kinase Buffer, 1 µL 10 mM ATP, 10 U T4 Polynucleotide Kinase, and incubated for 1 hour at 37°C. Secondary primer pED_2ND was phosphorylated in a reaction containing 18 µL NFH2O, 2 µL T4 Polynucleotide Kinase Buffer, 7 µL 100 µM secondary primer, 1 µL 10 mM ATP, and 10 U T4 Polynucleotide Kinase. The reaction was incubated for 1 hour at 37°C. Phosphorylated NNN and secondary primers were diluted 1:1000 and 1:20 in NFH2O, respectively. 33 ssDNA template was prepared in a reaction containing 0.76 pmol plasmid dsDNA, 2 µL NEB CutSmart Buffer, 10 U Nt.BbvCI, 10 U Exonuclease III, 20 U Exonuclease I, and NFH2O to 20 µL final reaction volume in a PCR tube. The following thermal cycle program was used: 37°C for 60 minutes, 80°C for 20 minutes (heat kill), hold at 4-10°C. Next, for mutant strand synthesis the following was added to each PCR tube on ice: 20 µL 5X Phusion HF Buffer, 20 µL 50 mM DTT, 1 µL 50 mM NAD+, 2 µL 10 mM dNTPs, 4.3 µL 1:1000 diluted phosphorylated NNN mutagenic oligos, and 26.7 µL NFH2O (final reaction volume of 100µL). The tube contents were mixed, spun down, and placed on ice. 200 U of Taq DNA Ligase and 2 U Phusion High-Fidelity DNA Polymerase were added to each reaction, mixed, spun down, and placed into a preheated (98°C) thermal cycler set with the following program: 98°C for 2 minutes, 15 cycles of 98°C for 30 seconds (denature), 55°C for 45 seconds (anneal oligos), 72°C for 7 minutes (extension), followed by a final incubation at 45°C for 20 minutes to complete ligation. Additional 4.3 µL of oligos were added at the beginning of cycles 6 and 11. Each reaction was then column purified using a Zymo Clean & Concentrator kit (5:1 DNA Binding Buffer to sample). Each reaction was eluted in 15 µL NFH2O, and 14 µL was transferred to a fresh PCR tube. Next, for the template degradation reaction the following was added to each tube: 2 µL 10X NEB CutSmart Buffer, 1 U Nb.BbvCI, 2 U Exonuclease III, and 20 U Exonuclease I (20µL final volume). The following thermocycler program was used: 37°C for 60 minutes, 80°C for 20 minutes (heat kill), hold at 4-10°C. To synthesize the second (complementary) mutant strand, the following was added to each reaction: 20 µL 5X Phusion HF Buffer, 20 µL 50 mM DTT, 1 µL 50 mM NAD+, 2 µL 10 mM dNTPs, 3.3 µL 1:20 diluted phosphorylated secondary primer (0.38 pmol), and 27.7 µL NFH2O (final reaction volume of 100 µL). The tube contents were mixed, spun down, and placed on ice. 200 U of Taq DNA Ligase and 2 U Phusion High-Fidelity DNA 34 Polymerase were added to each reaction, mixed, spun down, and placed into a preheated (98°C) thermal cycler set with the following program: 98°C for 30 seconds, 55°C for 45 seconds, 72°C for 10 minutes (can be extended for longer constructs), and 45°C for 20 minutes. To degrade methylated and hemi-methylated wild-type DNA, 40 U of DpnI was added to each reaction and incubated at 37°C for 1 hour. The final reaction was column purified using the Zymo Clean & Concentrator-5 kit as described above but eluted in 6 µL NFH2O. The entire 6 µL was transformed into 40 µL of XL1-Blue electroporation competent cells (Agilent) and plated on Corning square bioassay dishes (Sigma-Aldrich, 245mm x 245mm x 25mm). The following day, colonies were scraped with 15 mL of TB, vortexed, and 1 mL was removed and mini-prepped using a Qiagen Mini-prep Kit. Single and multi-site nicking mutagenesis Mutagenic primers were phosphorylated separately following the protocol described above for the secondary primer, then diluted 1:20 with NFH2O. For multi-site nicking mutagenesis, 2 µL of each primer was mixed in a single tube and diluted to a final volume of 40 µL. ssDNA template preparation was performed as described above. For mutant strand synthesis, oligos were annealed in the absence of polymerase as suggested by Firnberg et al.11. 3.3 µL of 1:20 phosphorylated oligos (single or mixed), 10 µL 5X Phusion HF Buffer, and 16.7 µL NFH2O were added to the appropriate tube. Oligos were annealed with the following thermocycler program: 98°C for 2 minutes, decrease to 55°C over 15 minutes, 55°C for 5 minutes, and hold at 55°C. While the reactions were held on the block, the following was added to each tube from a master mix: 20 µL 5X Phusion HF Buffer, 20 µL 50 mM DTT, 1 µL 50 mM NAD+, 2 µL 10 mM dNTPs, and 11 µL NFH2O (final reaction volume of 100µL). The tube contents were mixed by pipetting, then 200 U 35 of Taq DNA Ligase and 2 U Phusion High-Fidelity DNA Polymerase were added to each reaction, mixed, spun down, and returned to the thermocycler for the following program: 72°C for 10 minutes, 45°C for 20 minutes. The remainder of the protocol proceeded as described in the comprehensive protocol. DNA deep sequencing and analysis Plasmids obtained after transformation of the reaction mix and miniprep were used for deep sequencing analysis of library coverage. Samples were prepared for deep sequencing as described in Kowalsky et al.9 following Method B. Sequences of PCR primers are listed in Table A 2.3. Samples for shotgun sequencing were prepared at the Michigan State University sequencing core (approximate median insert size of 360bp). amiE libraries were sequenced on an Illumina MiSeq with 250bp PE reads at the University of Illinois Chicago sequencing core. All other samples were sequenced on an Illumina MiSeq with 300bp PE reads at Michigan State University. Read statistics are given in Table 2.1. Raw FASTQ files were analyzed with Enrich software19 with modifications as described in Kowalsky et al.9. Analysis of libraries for frameshift and off-target mutations was done using the Burrows Wheeler Aligner20 followed by processing with SAMtools21. Library statistics (Table 2.1) and read coverage plots (Figure A 2.2 and A 2.5a) were obtained using custom scripts freely available at Github (user JKlesmith). Sequencing data has been deposited to the NCBI Sequence Read Archive (accession numbers SRR4105481-SRR4105486). Statistics For analysis of the shotgun sequencing data, the mean of the background subtracted per-position percent mutant allele values for amiE reactions 1 and 2 at positions inside and outside the targeted 36 region for mutagenesis were computed. Welch two sample t-tests were performed using the R statistical software22 to calculate significance between averages from the inside regions and the outside regions for reaction 1 (p-value < 2.2*10-16, t = -14.846, df = 697.06) and reaction 2 (pvalue < 2.2*10-16, t = -19.259, df = 214). 37 APPENDIX 38 APPENDIX Figure A 2.1: Gel snapshots along the optimized nicking mutagenesis method. Plasmid dsDNA and ssDNA (prepared from bacteriophage) of pEDA5_GFPmut3 are included for size reference. NR = nicking reaction; 2 µg of pEDA5_GFPmut3_Y66H was placed in a 20 µL reaction with 10 U Nt.BbvCI in 1X CutSmart buffer. TP = template preparation; a reaction was ceased after the template preparation phase. MS = mutant strand; a reaction was ceased after the synthesis of the mutant strands, where regeneration of relaxed dsDNA can be seen. 1 kb Plus Ladder (Thermo Fischer Scientific, lane 1) included for size reference. Gel image has been cropped to size. 39 Figure A 2.2: Probability distribution of mutation counts in amiE comprehensive nicking mutagenesis libraries. Dashed vertical lines represent median (red) and mean (blue) library member read coverage. Panel a shows distribution for reaction 1 and panel b shows the distribution for reaction 2. 40 Figure A 2.3: Comparison of the probability distributions of site-saturation mutagenesis libraries resulting from nicking mutagenesis or PFunkel mutagenesis. Because the depth of sequencing coverage varied between the three methods, all samples were normalized to a 200-fold depth of coverage of possible single non-synonymous mutations. The expected library diversity is 820 for Kowalsky et al.1,2 and 1420 for amiE reaction 1 & reaction 2 (this work). a. Cumulative distribution function for the three libraries as a function of normalized sequencing counts. 91.7%, 93.2%, and 97.8% of the library is represented above a threshold of 10 sequencing counts for PFunkel library, amiE reaction 1, and the amiE reaction 2 libraries, respectively. b. Frequency is plotted as a function of sequencing counts for the same three libraries. The experimental data are plotted as symbols, with lines representing a best fit of the data using a log-normal distribution (PFunkel: µ=2, "=0.49, amiE reaction 1: µ=2, "=0.50. amiE reaction 2: µ=2, "=0.44). 41 Figure A 2.4: Off-target mutational analysis of amiE input plasmid and mutational libraries by shotgun sequencing. a-c. Percent mutant allele at each position in the plasmid sequence for the input plasmid (a) amiE reaction 1 library (b) and amiE reaction 2 library (c). Shotgun sequencing reads were aligned to the pEDA3_amiE plasmid using BWA aligner and the frequency of each base at each position was counted using bam-readcount (www.github.com). Percent mutant allele was calculated for each position by summing all non-wildtype allele counts and diving by total reads at that position. Overlain red curves indicate depth of sequencing coverage at each position. d-e. Background subtracted percent mutant allele for each position in plasmid sequence of amiE reaction 1 library (d) and amiE reaction 2 library (e). 3,4 42 Figure A 2.5: bla library coverage distributions. Probability distribution of mutation counts in bla comprehensive nicking mutagenesis libraries. Dashed vertical lines represent median (red) and mean (blue) library member read coverage. b. Cumulative distribution function for the three libraries as a function of normalized sequencing counts. 43 Figure A 2.6: Schematic overview of single- or multi-site nicking mutagenesis. After the preparation of an ssDNA template, an annealing reaction is set up with a single or mixed set of mutagenic oligos at a 5:1 primer:template ratio (for each oligo). Next, reagents and enzymes necessary to synthesize the mutant strands are added. The remainder of the protocol is identical to comprehensive nicking mutagenesis. 44 Table A 2.1: Performance metrics of published comprehensive mutagenesis methods2,11,12,14,23. Bolded text indicates metrics that are comparatively inefficient to nicking mutagenesis and PFunkel mutagenesis. NS = nonsynonymous. Percent of mutants with NS mutations Library type Library coverage Single Zero Multiple Scalability mutatable codons/ reaction Cassette Mutagenesis Hietpas et al.2 Hsp90 (9) userdefined 100% nd nd nd 20 Error-Prone PCR Doolan et al.23 mouse PrP (211) random nd 28.2% 60.6% 11.08% all Chemical Synthesis Fowler et al.14 hYAP65 WW domain (25) random 83.2% nd 20* nd 30 PALS Mutagenesis Kitzman et al.11 Gal4 DBD and p53 (457 total) userdefined 94.3% 35% 29.2% 33% all PFunkel Mutagenesis Kowalsky et al.24 Ct Cohesin (162) userdefined 97.1% 73.6% 20.5% 5.9% all 64% 26.8% 9.3% all Mutagenesis method Publication data gathered from Gene (# codons mutated) Nicking Mutagenesis userThis work 100.0% defined amiE (142) *estimated from Supplementary Figure 3 of original publication 45 Table A 2.2: Estimated time required for comprehensive library construction using nicking mutagenesis. Step number 1a* Hands-on time (min) 30 On-thermal cycler time (min) 60* 1b* ssDNA template strand preparation 5 80* 2a 2b 3 Comprehensive codon mutagenesis strand 1 Column purification I Degrade template strand 10 146 5 5 80 Synthesize complimentary mutagenic strand DpnI DNA cleanup 10 32 2 5 60 1.2 7.8 6.6 4a 4b 4c Phosphorylate oligos Column purification II Subtotal (hr): Total (hr): *steps can be performed simultaneously 46 Table A 2.3: Primer sequences. Plasmid construction primers pED_BbvCI gcggccccacgggtcctcagcgcgcatgat pED_kRBS3 gacgagctaatatcgccatgtctcatatgtataaaaacttcttaaagttaaacaaaattatttctagaaagttaaa GFP_Y66H gcaaagcattgaacaccatgaccgaaagtagtgacaagt Green/white screening mutagenic oligos GFP_H66Y gcaaagcattgaacaccataaccgaaagtagtgacaagt GFP_H66Y_RC acttgtcactactttcggttatggtgttcaatgctttgc Green/white screening secondary primer pED_2ND ggtgattcattctgctaa amiE and TEM-1 secondary primers pED_2ND (amiE) ggtgattcattctgctaa pSALECT/pETconNK_2ND (TEM-1) ggtttcccgactggaaag Gene amplification: inner primers amiE_NMT1_FWD gttcagagttctacagtccgacgatcgcaaatgtttggggtgtg amiE_T2_FWD gttcagagttctacagtccgacgatcctgcgatgacggtaat amiE_T1_REV ccttggcacccgagaattccactctccaaatttccggata amiE_NMT2_REV ccttggcacccgagaattccattcgccgcattcacccagagt TEM1_T3_FWD gttcagagttctacagtccgacgatcattaactggcgaactacttact pETconNK_REV ccttggcacccgagaattccaaagcttttgttcggatc blue = Illumina sequencing primer; black = gene overlap Gene amplification: outer primers Illumina_FWD aatgatacggcgaccaccgagatctacacgttcagagttctacagtccga RPI30 caagcagaagacggcatacgagatCCGGTGgtgactggagttccttggcacccgagaattcca RPI31 caagcagaagacggcatacgagatATCGTGgtgactggagttccttggcacccgagaattcca RPI21 caagcagaagacggcatacgagatCGAAACgtgactggagttccttggcacccgagaattcca red = Illumina adapter sequence; BOLD = barcode; blue = Illumina sequencing primer 47 Table A 2.4: Cost analysis of nicking mutagenesis compared with PFunkel12. Library preparation cost was calculated by totaling cost of enzymes (price information gathered from New England Biolabs) and reagents (price information gathered from Sigma-Aldrich, Qiagen, and Zymo Research) on a per reaction basis. Price of chemically synthesized degenerate NNN oligos based on IDT pricing for a 40bp primer6 at the 500 pmole scale: $0.10/base*40bp = $4/codon. Prices obtained February 2016. PFunkel Nicking Mutagenesis $53 $55 NNN oligo cost per codon (source) $4 (IDT) $4 (IDT) Total cost per 100 scanned codons $453 $455 Library preparation cost per reaction 48 Note A 2.1: Optimization of nicking mutagenesis using green/white screening A previously constructed GFPmut3 expression plasmid18 was modified by incorporating a BbvCI site and by changing the amino acid sequence of the GFPmut3 chromophore, Gly65-Tyr66Gly67, to Gly65-His66-Gly67, resulting in a non-fluorescent protein. We performed nicking mutagenesis on this construct (pEDA5_GFPmut3_Y66H) with a restore-to-function mutagenic oligo (primer GFP_H66Y, see Supplementary Table 3 for sequences). Figure A 2.1 shows gel snapshots at different stages along the optimized process. Initial experiments with the full nicking mutagenesis protocol showed a mutational efficiency of 23% with 3x105 transformants. To determine the sources of high wild-type background, we performed a series of control experiments containing no mutagenic primer. Thus, any resulting transformants could be unambiguously attributed to wild-type. The number of background transformants was 103 after the template preparation step and incubation with DpnI, but increased to 106 if the reaction was allowed to proceed through the thermal cycling steps. We hypothesized that short stretches of incompletely degraded DNA were priming and regenerating wild-type constructs. To remedy this, Exonuclease I, which specifically degrades ssDNA, was added to both the template preparation and degradation reactions. The addition of Exonuclease I improved mutational efficiency to 56% with >5x105 transformants. Incubation of the final reaction mixture with DpnI to remove methylated and hemi-methylated wild-type DNA increased the mutational efficiency to 68% with >3x105 transformants. In oligonucleotide-programmed mutagenesis, mutagenic oligos are designed to be complementary to the wild-type template sequence on either side of the programmed mutation such that they can anneal to the template. For Kunkel mutagenesis15, the ssDNA template strand 49 is made by replication and packaging within a phage host. The directionality of the ssDNA template strand (sense or anti-sense) is dependent upon the directionality of the F1-origin of replication. If the F1-origin is such that the template strand made is sense, then mutagenic oligos are designed anti-sense. For nicking mutagenesis, the directionality of the template strand is dependent upon the orientation of the BbvCI site. The set of enzymes, Nt.BbvCI (Nick-top BbvCI) and Nb.BbvCI (Nick-bottom BbvCI) will create nicks on the strands containing their respective recognition sequence. If the Nt.BbvCI nicking enzyme is used for template preparation and its recognition sequence is encoded on the anti-sense strand, the ssDNA template formed will be sense. Thus, mutagenic oligos should be designed anti-sense. The opposite is true if Nb.BbvCI was used to create the template strand. To confirm that the order of nicking enzymes could be switched, we performed nicking mutagenesis using green/white screening in two reactions: one with Nt.BbvCI then Nb.BbvCI using the GFP_H66Y mutagenic primer (priming one strand), and the second using Nb.BbvCI first with the GFP_H66Y_RC primer (priming the opposite strand at the same location as GFP_H66Y). We observed mutational efficiencies of 46% and 44% with >8x104 and >9x104 total transformants, respectively, confirming that the order of nicking enzymes can be switched. Another consideration is that a target gene of interest may contain a BbvCI nicking site. In such a case, confirm that the orientation of the BbvCI nicking site is the same on the gene as on the backbone. 50 REFERENCES 51 REFERENCES 1. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). 2. Hietpas, R. T., Jensen, J. D. & Bolon, D. N. A. Experimental illumination of a fitness landscape. Proc. Natl. Acad. Sci. 108, 7896–7901 (2011). 3. Firnberg, E., Labonte, J. W., Gray, J. J. & Ostermeier, M. A Comprehensive, HighResolution Map of a Gene’s Fitness Landscape. Mol. Biol. Evol. 31, 1581–1592 (2014). 4. Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T. S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Res. 42, gku511 (2014). 5. Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase. Cell 160, 882–892 (2015). 6. Klesmith, J. R., Bacik, J., Michalczyk, R. & Whitehead, T. A. Comprehensive SequenceFlux Mapping of a Levoglucosan Utilization Pathway in E. coli. ACS Synth. Biol. 4, 1235– 1243 (2015). 7. Whitehead, T. A. et al. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat. Biotechnol. 30, 543–8 (2012). 8. Kowalsky, C. A. et al. Rapid Fine Conformational Epitope Mapping Using Comprehensive Mutagenesis and Deep Sequencing. J. Biol. Chem. 290, 26457–26470 (2015). 9. Kowalsky, C. A. et al. High-Resolution Sequence-Function Mapping of Full-Length Proteins. PLoS One 10, e0118193 (2015). 10. Fowler, D. M., Stephany, J. J. & Fields, S. Measuring the activity of protein variants on a large scale using deep mutational scanning. Nat. Protoc. 9, 2267–2284 (2014). 11. Kitzman, J. O., Starita, L. M., Lo, R. S., Fields, S. & Shendure, J. Massively parallel singleamino-acid mutagenesis. Nat. Methods 12, 203–206 (2015). 12. Firnberg, E. & Ostermeier, M. PFunkel: Efficient, Expansive, User-Defined Mutagenesis. PLoS One 7, e52031 (2012). 52 13. Jain, P. C. & Varadarajan, R. A rapid, efficient, and economical inverse polymerase chain reaction-based method for generating a site saturation mutant library. Anal. Biochem. 449, 90–98 (2014). 14. Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010). 15. Kunkel, T. A. Rapid and efficient site-specific mutagenesis without phenotypic selection. Proc. Natl. Acad. Sci. 82, 488–492 (1985). 16. Chan, S.-H., Stoddard, B. L. & Xu, S. Natural and engineered nicking endonucleases - from cleavage mechanism to engineering of strand-specificity. Nucleic Acids Res. 39, 1–18 (2011). 17. Heiter, D. F., Lunnen, K. D. & Wilson, G. G. Site-Specific DNA-nicking Mutants of the Heterodimeric Restriction Endonuclease R.BbvCI. J. Mol. Biol. 348, 631–640 (2005). 18. Bienick, M. S. et al. The Interrelationship between Promoter Strength, Gene Expression, and Growth Rate. PLoS One 9, e109105 (2014). 19. Fowler, D. M., Araya, C. L., Gerard, W. & Fields, S. Enrich: Software for analysis of protein function by enrichment and depletion of variants. Bioinformatics 27, 3430–3431 (2011). 20. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010). 21. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078– 2079 (2009). 22. R Core Team, R. R: A language and environment for statistical computing. (2015). at 23. Doolan, K. M. & Colby, D. W. Conformation-dependent epitopes recognized by prion protein antibodies probed using mutational scanning and deep sequencing. J. Mol. Biol. 427, 328–340 (2015). 24. Kowalsky, C. A. & Whitehead, T. A. Determination of binding affinity upon mutation for type I dockerin-cohesin complexes from Clostridium thermocellum and Clostridium cellulolyticum using deep sequencing. Proteins (2016). 53 CHAPTER THREE Exploring the sequence-determinants to specificity of an enzyme using deep mutational scanning This chapter is adapted with permission from “Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded” in Nature Communications 8 (2017) 15695 by Emily E. Wrenbeck, Laura R. Azouz, and Timothy A. Whitehead. 54 ABSTRACT Our lack of total understanding of the intricacies of how enzymes behave has constrained our ability to robustly engineer substrate specificity. Furthermore, the mechanisms of natural evolution leading to improved or novel substrate specificities are not wholly defined. Here we generate near-comprehensive single-mutation fitness landscapes comprising >96.3% of all possible single nonsynonymous mutations for hydrolysis activity of an amidase expressed in E. coli with three different substrates. For all three selections, we find that the distribution of beneficial mutations can be described as exponential, supporting a current hypothesis for adaptive molecular evolution. Beneficial mutations in one selection have essentially no correlation with fitness for other selections and are dispersed throughout the protein sequence and structure. Our results further demonstrate the dependence of local fitness landscapes on substrate identity and provide an example of globally distributed sequence-specificity determinants for an enzyme. 55 INTRODUCTION Understanding the sequence determinants to substrate specificity for enzymes is a significant challenge in protein science that impacts fields as diverse as evolutionary biology and biocatalysis1,2. The dynamic relationship between protein structure and function makes it difficult to predict perturbations to the primary sequence that will improve or alter activity for a given substrate2. More fundamental concerns relate the nature of protein fitness landscapes to a biophysical basis underlying molecular evolution and adaptation3,4. What is the distribution of fitness effects (DFE) for mutations, and do they correspond with existing theory of adaptation5–7? Are the DFE of mutations correlated between substrates8? Are specificity-modulating mutations correlated to bulk properties of enzymes (e.g. distance to active site)? Over the past 20 years directed evolution experiments have provided a number of insights to the above questions9–11. For engineering enzyme specificity, it has been shown that a rational mutagenesis approach – primarily focused on residues lining a substrate binding pocket - provides greater payoffs than random mutagenesis (i.e. error-prone PCR)1,12,13. However, it is no secret that distant (>10 Å) mutations can have significant effects on catalytic function13–20. For example, in a classic paper Oue et al. evolved the specificity of an aspartate aminotransferase to valine and found only one mutation in direct contact with the substrate out of seventeen accumulated in the final construct20. However, the spatial distribution of specificity-modulating substitutions is still unclear, as typical experiments assay the effects of less than 100 mutations. Large scale mutational studies, such as deep mutational scanning to generate local fitness landscapes21,22, provide a more comprehensive purview and can potentially be used to resolve the above open questions23. From the protein engineer’s perspective, the ability to predict fitness effects would greatly improve the discovery rate of beneficial mutations. In recent years, theoretical work on adaptive 56 molecular evolution has experienced a revolution with the availability of new experimental tools. Recognizing the rare nature of beneficial mutations, Gillespie24 borrowed extreme value theory mathematics to predict that the distribution of fitness effects (DFE) for beneficial mutations, drawn from the extreme tail of DFEs, would be of the Gumbel or ‘typical’ type (exponential, gamma, Weibull, etc.). Orr later proposed that beneficial mutations from a high fitness parent should be roughly exponentially distributed6. While generally providing support for these theories, discerning the parameterization of a mathematical model from experimental data has yielded mixed conclusions as summarized by Orr25. To explore the question of how enzymes encode specificity and scrutinize adaptive molecular evolution theory, we evaluate the sequence determinants to substrate specificity for an enzyme by generating comprehensive single-mutation fitness landscapes – the effects of all possible single point mutations - on multiple substrates. As a model system we use the aliphatic amide hydrolase encoded by amiE from Pseudomonas aeruginosa26 because the structure is solved27, amidases are an industrially-relevant class of enzymes28,29, and amiE has activity against multiple substrates. In particular, amiE maintains comparatively higher activity on acetamide and propionamide compared with the bulkier isobutyramide. Thus, our experimental system allows comparison of adaptation between similar and structurally dissimilar substrates. 57 RESULTS Local Fitness Landscapes of amiE on Multiple Substrates We first developed growth selections for three short-chain aliphatic amides: acetamide (ACT), propionamide (PR), and isobutyramide (IB) (Figure 3.1), such that only E. coli cells harboring a functional amiE gene product can grow when an amide is provided as the sole nitrogen source in selective minimal growth media30,31. Following passive diffusion into cells, amiE catalyzes hydrolysis of the amide to its corresponding carboxylic acid, liberating ammonium (a bioavailable nitrogen source). To allow variants supporting higher ammonium flux to become enriched in the population relative to wild-type, we tuned amiE expression levels by screening Figure 3.1: Experimental overview. Growth selections for acetamide (ACT), propionamide (PR), and isobutyramide (IB) were established. Amides passively diffuse into host cells harboring amiE variants that produce ammonia necessary for cell growth. Comprehensive site-saturation mutagenesis libraries of amiE were made and selected in media containing an amide as the sole nitrogen source. The pre- and post-selection populations from each selection were deep sequenced and each variant was assigned a fitness metric (!) value. synthetic, insulated constitutive promoters32,33 such that the specific growth rate in selection media relative to that in defined minimal media (µS,wt/µM9,wt) is 0.4-0.634. Promoter proK14 with the high translational efficiency RBS from gene 10 of T7 bacteriophage (t7RBS) had a suitable µS,wt/µM9,wt at 0.54 ± 0.11 for IB selection media (plasmid pEDA6_amiE, Figure 3.2 and Table B 3.1). However, the weakest promoter of the set, proK17, had a µS,wt/µM9,wt of 0.92 ± 0.05 for ACT. 58 Figure 3.2: Establishing growth-based selection conditions. amiE expression was tuned using promoter and RBS engineering. µS,wt/µM9.wt is the ratio of growth rate of wild-type amiE harboring cells in selection media (µS,wt) to M9 minimal media (µM9,wt). Error bars represent 1 s.d. of n = 4, 12, 10, and 12 independent measurements. Inset represents the final promoter/RBS combination used for each selection. To further decrease protein expression, plasmids containing an altered RBS were tested. One construct containing promoter proK17 and a knockdown RBS 3 (kRBS3) sequence had a µS,wt/µM9,wt of 0.56 ± 0.06 for ACT and 0.37 ± 0.08 for PR (plasmid pEDA2_amiE, Figure 3.2 and Table B 3.1). A significant concern with this growth selection is the potential for cells containing a nonfunctional enzyme variant to propagate in a population by acquiring ammonium that has leaked into the extracellular medium. We assessed the risk of such ‘cheating’ by competing wild-type amiE on the ampicillin-resistant expression construct described above (pEDA2_amiE) against a catalytic knockout, amiE_C166S, with kanamycin-resistance on an otherwise identical expression construct (pEDK2_amiE_C166S), in acetamide selection medium containing no antibiotics. E. coli cells harboring either pEDA2_amiE or pEDK2_amiE_C166S were mixed in equal proportion 59 and competed for 4.12 and 4.17 generations for replicates 1 and 2, respectively. Cells from the pre- and post-selection populations were dilution plated on ampicillin and kanamycin containing plates, and the resulting colonies counted to calculate the frequency of each member in the preand post-selection populations. Using the fitness equations laid out in Kowalsky et al.34, the average fitness metric for the C166S mutant was -2.40 ± 0.9 (n = 2), close to the fitness metric expected if no cheating occurred (-2.46). Thus, we conclude that non-functional variants minimally propagate under the conditions of the selection. Next, we used PFunkel mutagenesis35 to construct comprehensive single-site saturation mutagenesis amiE libraries and transformed them into E. coli MG1655 rph+. We carried out growth selections for each of the three substrates for approximately 8 generations, starting with an initial population size of >6x106 cells. Deep sequencing of the pre- and post-selection populations was used to determine a relative fitness metric (!i) for each amiE variant i, defined as34: !" = log ' (),+ (),,- ..................................................................(1) The pre-selection populations were comprised of >51.8% single nonsynonymous mutations and represented >96.3% of the 6820 possible single nonsynonymous mutations for all libraries (Table B 3.2, Figure B 3.1). Given that the read counts per variant in the pre-selection population were log-normally distributed (Figure B 3.1) and underrepresented variants could show biased fitness metrics, we calculated Pearson’s product moment correlation coefficients for pre-selection read counts and fitness and found them to be to 0.047, 0.033, and -0.064 for the ACT, PR and IB selections, respectively (Figure B 3.2). This confirmed that resulting fitness metrics were not biased by a wide distribution of pre-selection read counts. Furthermore, we determined a lower bound fitness metric for each selection that can be discriminated based on depth of sequencing coverage, such that while below this value fitness effects can be described 60 categorically as ‘deleterious’, the quantitative effect cannot be reliably predicted. The lower bounds were found to be -1.3, -0.8, and -0.6 for the ACT, PR, and IB selections, respectively. Heat map representations of the local fitness landscapes for each selection can be found in Figures B 3.3-3.5. We tested the validity of using deep sequencing to reconstruct fitness in multiple ways. First, we performed replicate growth selections using the same pre-selection library. The resulting two post-selection libraries were prepared for sequencing in parallel attaching unique Illumina barcodes to each, and normalized fitness metrics were calculated for each replicate. To assess whether the selection results were reproducible we calculated the Pearson product moment correlation coefficients of fitness metrics between replicates and found them to be 0.661, 0.842, and 0.889 for the ACT, PR, and IB selections, respectively (P < 2.2 x10-308, n = 6627, 6630, and 6569). When we excluded variants with fitness metrics below the lower bounds the correlation coefficients improved to 0.932, 0.949, and 0.943 for the ACT, PR, and IB selections, respectively (P < 2.2 x10-308, n = 3834, 2954, 4977, Figure 3.3a and Figure B 3.6). Figure 3.3: Validation of deep sequencing results. A.) Fitness metrics from replicate growth selections in the propionamide selection (n = 2954). Red lines indicate two standard deviations from theoretical error estimation34. The reported p-value for the Pearson’s product moment correlation coefficient was calculated using a two-tailed t-test. B.) Comparison of relative growth rates calculated from the selection experiments (grey bars) and isogenic growth rate assays (colored bars). Error bars represent 1 s.d. of at least four independent measurements. 61 Second, we compared relative isogenic growth rates (µS,i/µS,wt) to deep sequencingcalculated growth rates for a set of mutations (Figure 3.3b, Table B 3.3). Deep-sequencing derived fitness corresponded to increased growth rates for 16/17 beneficial mutations, near wildtype growth rates for 2/2 neutral mutations, and no growth for 1/1 deleterious mutation tested. To confirm that improved growth rates were a result of increased flux through amiE, we performed lysate activity assays for a subset of these variants and found that all samples save one improved flux relative to wild-type (Table B 3.3). The Distribution of Fitness Effects (DFE) of amiE The DFE, both at the organismal and protein level, demarcates evolution5,7,36. Specifically, the DFE for a protein is related to its evolvability: the number and type of available beneficial mutations for a new function compared with effects on existing functions is illustrative of how natural proteins evolve. While theoretical and experimental work has advanced our understanding of the available pathways for adaptive molecular evolution4,6,37–44, the exact form of the distribution, which determines these pathways, is still a subject of debate. Figure 3.4a shows the DFE for the three selections. For each, nonsense mutations had a median fitness metric below the detection limit of the deep sequencing method. Nonsense mutations with increased fitness metrics (!>0.15) cluster in the last 19 residues of the C-terminus, a relatively unstructured region likely to have no influence on catalytic activity, suggesting that translation of these residues plus the Cterminal His6-tag used for purification is deleterious to fitness. Missense mutations were on average deleterious for the ACT and PR selections, with 75.8 and 74.2% of variants yielding at least 20% reduction in growth rate relative to wild-type, respectively. By contrast, only 45.4% fell below this threshold for the IB selection. 62 Figure 3.4: Distribution of fitness effects (DFE) are exponentially distributed for beneficial mutations. A.) The DFE of missense and nonsense mutations for ACT (cranberry), PR (orange), and IB (cyan) growth selections. The dashed vertical line demarcates the wild-type fitness metric (!=0.0). B.) DFE for beneficial mutations identified in the ACT, PR, and IB selections, respectively. Overlain curves are best-fit exponential distributions estimated from the data. Remarkably, 21.5% (n = 1394) of missense mutations had above wild-type fitness metrics for the IB selection, with 483 (7.5%) variants having at least 10% increased growth rate (!>0.15). There were appreciably less enhanced variants found in the ACT and PR selections, with 4.7 and 5.1% (n = 306 and 328) having fitness metrics above wild-type, respectively. Modern theories of adaptive molecular evolution predict the DFE for beneficial mutations is scale-free and exponentially distributed6,24. However, the available experimental data is 63 conflicted25, and most studies have low statistical power due to the rare nature of beneficial mutations. While synthetically constructed, our competitive growth selection results yield fitness metrics for a large effective population size, and the hundreds of beneficial mutations observed provides high statistical power for model fitting. Predictions of beneficial DFE are derived from extreme value theory that describes many distributions falling under the umbrella of the generalized Pareto distribution (GPD)24. GPD includes three domains of attraction defined by their shape parameter ("): Gumbel ("=0), Fréchet (">0), and Weibull ("<0). We first performed bootstrap goodness of fit tests to a GPD and concluded a failure to reject the null hypothesis that the datasets belonged to a GPD (P < 0.066) and estimated " to be -0.292, -0.309, and -0.195 for the ACT, PR, and IB datasets, respectively. This finding indicates that the tail behavior for the observed beneficial DFE for amiE is slightly truncated, yet our results are consistent with the predictions of Orr that if departures from the Gumbel domain are observed they will be minimal (-1/2<"<1/2)6. We next conducted log-likelihood ratio tests for fitted exponential distributions (null hypothesis) against fitted gamma and Weibull distributions (alternative hypotheses) for the DFE of beneficial mutations (Figure 3.4b). These alternative models were chosen as previous empirical studies have observed tail behavior indicative of these such distributions40,45,46. We concluded a failure to reject the null hypothesis for the IB dataset, yet found that the ACT dataset best fit a Weibull distribution (P = 0.05) and that gamma and Weibull were both better fits for the PR dataset (P = 0.023 and 0.039, respectively, Table 3.1). Interestingly, one-sample Anderson-Darling tests for goodness-of-fit to each distribution indicated a failure to reject the null hypothesis that the data fit any of the distributions (Table 3.1). To assess the null hypothesis that the three data sets came from a single, statistically indistinguishable distribution, we performed a k-sample Anderson- 64 Darling test and concluded they were not from a single distribution (P = 0.0124). Thus, all datasets can be described as exponentially distributed, though the ACT and PR datasets best fit the higher parameter models. Table 3.1: Model fitting results for distribution of beneficial mutations Exponential Gamma Weibull rate A-D test P-value LL ACT 8.72 0.527 365.9 PR 6.67 0.338 303.2 IB 7.26 0.635 1382.6 shape rate A-D test P-value LL 1.14 9.90 0.834 367.4 1.17 7.81 0.459 305.8 1.03 7.46 0.723 1382.9 shape scale A-D test P-value LL 1.094 0.119 0.851 367.8 1.094 0.155 0.451 305.3 1.012 0.138 0.712 1382.8 3.11 0.078 3.8 0.050 5.14 0.023 4.3 0.039 0.62 0.43 0.31 0.58 Log-likelihood ratio tests H0 Exponential Exponential HA Gamma P-value Weibull P-value Beneficial Mutations result from Protein, not mRNA, effects We addressed whether effects at the mRNA level could explain beneficial mutations, as variants can achieve higher fitness by increasing total active amiE concentration through improvements to the rate of transcription, the degradation rate of mRNA, and the efficiency of translation. The fitness metrics of synonymous codons for beneficial mutations (!>0.15) showed 65 low variance in most cases except near the N-terminus (Figure B 3.7). A recent mRNA model47 could explain up to 5% of the variance in the first 15 residues but only 0.2% of the variance over the entire sequence length (Table B 3.4). We conclude that the observed fitness effects are the result of changes at the protein level, not at the mRNA level. Comparison of DFE Between Selections Promiscuous activity of enzymes is believed to be the driving force of evolution towards new activities3. Our fitness maps allow us to address the question of how mutations impact fitness in multiple substrate backgrounds. At the outset of this work, we anticipated that the majority of ‘hits’ or beneficial mutations would be shared across selections. This null hypothesis is grounded in the biophysical argument that most beneficial mutations would improve protein expression, not activity, and these would be beneficial regardless of the substrate selected on. Additionally, we anticipated that the pool of mutations available for improving activity for a single substrate would predominately localize to the vicinity of the active site, thus rendering few specificity-altering mutations. Consequences of this prediction are that there should be significant correlation of fitness between amides, with specificity-determining mutations localized near the active site. We first assessed whether there was a significant correlation of fitness between different amides (Figure 3.5a). Correlation for the ACT and PR selections (r = 0.827, P < 2.2x10-308) was notably higher than that for the IB and ACT (r = 0.317, P = 8.6 x 10-85) or IB and PR selections (r = 0.367, P = 6.7 x 10-95). PCA revealed that a single principal component could explain 96.8% and 87.8% of the variance of the ACT and PR datasets, respectively, while two principal components are sufficient to explain over 99% of the variance for the IB dataset (Figure 3.5b, Figure B 3.8). These results are inconsistent with our null hypothesis, pointing towards global alterations in the 66 protein structure to adapt to different substrates. Of note, the fitness data is non-normal and could be analyzed with other multivariate statistical analysis methods. Figure 3.5: Correlative analysis of fitness effects. A.) Correlation of variant fitness metrics between selections. Variants with fitness metrics above the lower bounds are compared between each selection. Plots represents n = 3054, 3600, and 2959 points for panels ACT vs PR, ACT vs. IB, and PR vs. IB, respectively. The reported p-value for the Pearson’s product moment correlation coefficient was calculated using a two-tailed t-test. B.) Principal component analysis of substratespecific fitness effects. Black dots show common neutral and deleterious mutations, while substrate-specific beneficial mutations (!>0.15) are colored according to 7 bins. C.) 3-way Venn diagram representing 7 specificity bins. Restricting our correlative analysis to only beneficial mutations (!>0.15) revealed that fitness-enhancing mutations for ACT were, on average, likely to be beneficial for PR (mean ! = 0.236). By contrast, beneficial mutations for IB were likely to be deleterious in both the ACT and 67 PR selections (mean ! = -0.480 and -0.319). This result is consistent with the findings of Stiffler et al. that beneficial mutations for a new or less evolved function are likely to be deleterious for existing functions when the selections pressures are high. Furthermore, IB-beneficial variants showed essentially no correlation for fitness in the ACT and PR selections (r = 0.0617 and 0.164, respectively). This finding indicates that, at least for amiE, predicting hits based on known fitness effects for a given substrate cannot be accomplished through correlative analysis. We next analyzed the relationship between beneficial (!>0.15) and specificity-determining (!>0.15 for one amide and !<0 for the other two substrates) mutations and their distance to the catalytic active site. Distance was measured by the minimum distance from the alpha-carbon of positions with beneficial mutations to any active site atom (six identical active sites in the functional homohexamer). The mutations were placed in 3 Å bins that were normalized to total available mutations in each distance shell. For beneficial mutations, we found that most were >15 Å from the active site for the ACT and PR variants, while the IB variants were mostly 9-21 Å away (Figure 3.6a). Strikingly, we found very few specificity-determining mutations for the ACT and PR selections (n = 6 and 14, respectively), with variants distanced by 6-15 Å for ACT and >14 Å for PR (Figure 3.6b). By contrast, we found 395 specificity-determining mutations for IB, which were distributed similarly to the set of all IB-beneficial mutations. Thus, beneficial and specificity-determinant positions are globally dispersed throughout the primary sequence and structure of amiE. Biophysical Characterization of Beneficial Mutations To understand the biophysical basis underlying beneficial mutations, we expressed, purified, and characterized a set of 11 variants chosen in part on their ability to predict larger sets of beneficial 68 variants (Table 3.2, Supplementary Fig. 9). For example, globally beneficial mutation S9A was chosen because it could potentially explain other N-terminal beneficial mutations. For all variants save one (see Methods), apparent melting temperatures (Tm,app) were within 7°C of the wild-type Tm,app of 67.7 ± 0.1°C, indicating that differences in thermal stability are unlikely to explain in vivo beneficial fitness effects. Figure 3.6: Substrate specificity is globally encoded. A-B.) Frequency of beneficial (A) and specificity-determining (B) mutations as a function of distance to active site. C.) Beneficial mutations for all selections and specificity-determinant mutations for ACT and PR selections mapped onto the structure of amiE. The inset illustrates the catalytic active site. D.) Specificitydeterminant mutations for IB selection colored by number found at given position. To evaluate commonalities between beneficial mutations, we sorted variants into seven possible bins for beneficial fitness metrics (!>0.15 for given selection(s) and !<0.15 in other 69 selection(s), Figure 3.5c). 21 of 26 beneficial mutations common to all three selections were found at extreme N- or C-terminal residues. Of the remaining five, we characterized R89E, a surface mutation located over 20 Å away from the active site that yielded an increase in relative kcat/Km of 1.96 ± 0.59 and 1.42 ± 0.42 for PR and IB substrates, respectively (Figure 3.6c). Alternatively, shared N-terminal mutation S9A had slightly reduced relative kcat/Km. Thus, even for a highly stable protein like amiE, we found few mutations like R89E that can generally increase kcat or KM and increase fitness. Beneficial mutations shared in two of the three selections were scarce. 18/29 mutations shared between ACT and PR cluster at the extreme N- or C- termini. The 17 PR+IB specific mutations cluster at Q273, a 2nd shell residue that buttresses W138 at the active site, and at M202 located 14 Å to the active site. Variant M202H showed over 2.5-fold increase in relative kcat/Km for IB and PR, but Q273A did not show increased catalytic efficiency in vitro. We speculate the conditions required by the enzyme assay for sensitive ammonia detection prohibited the recapitulation of in vivo kinetics. Four ACT-specific mutations encoded smaller substitutions (A/C/S/V) at position L119, a residue that supplies hydrophobic packing behind the catalytic nucleophile C166 10 Å from the active site (Figure 3.6c). L119A showed a 2.2 ± 0.1–fold increase in kcat relative to wild-type with a compensatory increase in KM. In stark contrast to ACT, there were 435 IB-specific and 395 specificity-determining mutations for IB distributed throughout the protein structure (Figure 3.6d). Substitution W138A/G decreases van der Waals area in the vicinity of the amide transition state, allowing accommodation of the bulky isobutyrl group. However, most specificity-altering mutations were located far from the active site. Hot spots of positions where 5 or more specificity-determining mutations confer 70 Table 3.2: Wild-type and variant amiE biophysical data. Variant ACT ζ PR ζ IB ζ Wildtype 0.00 0.00 0.00 S9A 0.33 0.41 0.36 A28R 0.27 0.10 0.11 R89E 0.30 0.34 L119A Km (mM) / Km,wt (mM) kcat (s-1) / kcat,wt (s-1) kcat/Km (M-1 s-1) / kcat,wt/Km,wt (M-1 s-1) 4.7 ± 0.5 59.0 ± 2.0 52.7 ± 8.3 144.7 ± 9.9 297.2 ± 54.8 13.3 ± 1.1 Tm,app (°C) 67.7 ± 0.1 2.3 ± 0.6 4.4 ± 1.1 0.5 ± 0.1 nd 0.6 ± 0.1 1.6 ± 0.2 0.4 ± 0.0 nd 0.28 ± 0.1 0.36 ± 0.13 0.76 ± 0.27 nd 0.15 nd 0.7 ± 0.1 0.8 ± 0.2 nd 1.4 ± 0.1 1.1 ± 0.1 nd 1.96 ± 0.59 1.42 ± 0.42 66.7 ± 0.1 0.30 -0.80 -0.60 2.8 ± 0.4 not active 2.5 ± 0.6 2.2 ± 0.1 not active 1.3 ± 0.2 0.8 ± 0.16 not active 0.52 ± 0.18 67.4 ± 0.2 I165C 0.25 -0.27 -0.32 2.6 ± 0.4 0.7 ± 0.2 0.7 ± 0.2 1.7 ± 0.1 0.6 ± 0.1 0.8 ± 0.1 0.66 ± 0.14 0.79 ± 0.24 1.22 ± 0.65 65.7 ± 0.1 V201M 0.37 0.12 0.22 10.2 ± 2.0 0.8 ± 0.4 4.6 ±1.3 1.1 ± 0.1 0.1 ± 0.0 0.8 ± 0.1 0.11 ± 0.03 0.13 ± 0.10 0.18 ± 0.08 61.2 ± 0.1 V201T 0.20 0.34 0.25 1.6 ± 0.3 nd 3.8 ± 0.8 1.0 ± 0.1 nd 1.4 ± 0.1 0.65 ± 0.16 nd 0.37 ± 0.11 64.9 ± 0.2 M202H -0.08 0.16 0.43 M203W -1.30 -0.80 0.43 0.9 ± 0.1 0.7 ± 0.1 0.4 ± 0.1 nd 1.3 ± 0.0 1.8 ± 0.1 1.4 ± 0.1 nd 1.4 ± 0.23 2.55 ± 0.72 3.08 ± 1.05 nd A234M 0.33 0.15 0.21 2.8 ± 0.4 0.5 ± 0.2 3.6 ± 0.9 1.0 ± 0.0 0.3 ± 0.0 0.8 ± 0.1 0.35 ± 0.07 0.58 ± 0.32 0.23 ± 0.08 64.4 ± 0.1 Q273A -0.71 0.31 0.23 4.9 ± 2.0 2.3 ± 0.6 0.6 ± 0.2 0.2 ± 0.0 0.5 ± 0.1 0.4 ± 0.0 0.05 ± 0.03 0.20 ± 0.08 0.71 ± 0.28 69.7 ± 1.1 71 63.1 ± 0.1 67.7 ± 0.1 63.0 ± 0.1 nd increased fitness occur at the N- and C-terminus (residues H3, S7, T323, R324, T327, V329, and C332-V334), as well as P50, C139, I174, A196, K197, V201, M202, W209, N212, F223, S228, G247-E249, G252, Q271, Q273-H275, and Y284. Interestingly, hot spot positions 197 through 212 are located on an alpha helix located at least 12 Å from the active site that contacts the dimeric interface. As these mutations do not benefit all substrates, we hypothesize that mutations at these positions cause rigid body motion of the helix to yield subtle geometric rearrangement, if not largescale disruption, of the active site that favors IB catalysis. We tested V201M/T for activity on IB and, contrary to expectations, found a decrease in relative kcat/Km. We speculate that the mismatch between expected and measured catalytic efficiency results from hexamer dissociation caused by the low enzyme concentration required by the activity assay, as the lysate assays showed an increase in velocity for the V201T mutant. DISCUSSION In this contribution, we generated single-mutation protein fitness landscapes for an amidase on three different substrates. In contrast to studying protein-protein interactions, the application of deep mutational scanning to enzymes has been limited by the difficulty in developing generalizable high-throughput functional assays, as the nature of enzyme function is highly diverse. Regardless, exhaustive mutational studies permit a glimpse into how natural enzymes evolve for new functions. Our results show that, at least for amiE, mutations which are beneficial for only one substrate are 1.) not confined to vicinity of the active site and 2.) cannot be predicted based on known fitness for another substrate. In terms of predicting fitness, we conclude that single-mutation fitness landscapes are highly substrate dependent, which is consistent with previous works8,38,48,49. However, this work 72 provides a unique perspective of comparing two structurally similar substrates, ACT and PR, to the dissimilar IB substrate. Not surprisingly, we found the IB single-mutation fitness landscapes to be the most divergent, signaling that at the biophysical level the requirement to accommodate the larger IB substrate significantly alters the mutational landscape. The percentage of beneficial mutations observed is consistent with previous deep mutational scanning experiments on enzymes8,37,38,50. These rates are significantly larger than that predicted for the natural adaptation of organisms51 because in deep mutational scanning experiments a strong selection is imposed upon a gene that is, by experimental design, intended to influence only a single phenotypic trait4. This intention to mitigate pleiotropy is especially true with bulk growth competition experiments, and it should be noted that randomly drifting populations contain genes that do not ascribe to such constraints. Our results strengthen the theoretical case that fitness for beneficial mutations is approximately exponentially distributed even though the percentage of beneficial mutations differs substantially between substrates. We note that this exponential distribution holds even for the IB selection which presumably causes large-scale rearrangements of the active site to allow better access to the branched chain IB. Other studies have explored the mechanics of multiple steps and epistasis39,42,43,52,53. In this work, we considered only single steps in the local fitness landscape. Thus, the generality of our observations for multiple steps remain to be seen. For the design and engineering of substrate promiscuous or specific biocatalysts, knowledge of the sequence and spatial distribution of ‘hits’ is imperative. Our findings indicate that, at least for amiE, most substrate-determining mutations for new functions, in this case IB, localize approximately 9-24 Å from the active site. Together, these results have strong implications for design and engineering of substrate promiscuous biocatalysts because it suggests current 73 strategies of iterative site saturation mutagenesis near the active site are sub-optimal1. Additionally, computational design algorithms focused solely on the modifying 1st and 2nd shell mutations around the active site need to be revisited. MATERIALS AND METHODS Reagents All chemicals were purchased from Sigma-Aldrich unless specified otherwise. All primers and mutagenic oligonucleotides were designed using the Agilent QuikChange Primer Design Program (www.agilent.com) and were ordered from Integrated DNA Technologies. Propionamide and isobutyramide solids were recrystallized from ethyl acetate and water, respectively. Plasmid construction pEDA6_amiE was renamed from pJK_proK14_amiE as described in Bienick et al.33. pEDA2_amiE was constructed by Kunkel mutagenesis54 of pJK_proK17_amiE from Bienick et al.33 to introduce a knockdown ribosome binding sequence (primer kRBS3). Protein expression constructs were made by subcloning the amiE gene from pEDA2_amiE into the pET-29b(+) (Novagen) backbone at the NdeI and XhoI restriction sites following standard protocols. amiE point mutants were created using Kunkel mutagenesis54. Primer sequences used in this work are listed in Table B 3.5. Construction of mutagenesis libraries Eight comprehensive single-site saturation mutagenesis libraries of amiE were constructed (residues 1-85, 86-170, 171-255, and 256-341 on plasmids pEDA2_amiE and pEDA6_amiE) 74 using PFunkel mutagenesis35 with modifications as noted34. Library cell stocks of the selection strain, E. coli MG1655 rph+ [F- #-] (Coli Genetic Stock Center, #7925), were made essentially as described in Klesmith et al.50. Growth selections Starter cultures for growth selection were prepared as in Klesmith et al.50, except 1X M9 minimal media lacking ammonium chloride (M9 N-) was used to wash cell pellets prior to inoculation of selection media. 3 mL of selection media (M9 N- + 10 mM acetamide, M9 N- + 15 mM propionamide, and M9 N- + 10 mM isobutyramide for ACT, PR, and IB selections, respectively) was inoculated to an initial OD600 of 0.02 at a volume of 3 mL (>6x106 cells). To ensure exponential growth during the entire selection experiment, after approximately four generations the cells were harvested, washed with M9 N-, and a fresh 3 mL culture with selection media was inoculated to the same initial OD600 of 0.02 (to maintain the same initial population size). Growth selections were carried out and samples preserved for sequencing as described in Klesmith et al.50. Replicates were performed using the same pre-selection population. Based on the high correlation between replicates and the fact that a major source of error in deep sequencing measurements are counting errors (Poisson noise)34, the fitness metrics used in subsequent analysis were computed by combining reads from the two replicates and repeating the analysis. Sequencing Libraries were amplified, barcoded, cleaned, and quantified following Method B as described in Kowalsky et al.34. Gene amplification primers are listed in Table B 3.5. Pre- and post-selection samples were pooled and sequenced with 300bp PE reads on an Illumina MiSeq available at the 75 Michigan State University sequencing core. Deep sequencing data was analyzed using Enrich software55 with modifications as noted in Kowalsky et al.34 and scripts freely available at Github (https://github.com/JKlesmith/Deep_Sequencing_Analysis). Normalized fitness metrics for each variant, !i, were determined according to the ‘Normalization for Growth Rate Selections’ section as outlined in Kowalsky et al.34. Briefly, deep sequencing was used to count each library member in the pre- and post-selection populations. For each single nonsynonymous mutation and wild-type an enrichment ratio was calculated by: 2" = log ' 34+ 35+ ....................................................................................(2) Where ffi and foi represent the frequency of member i in the final (post-selection) and initial (pre-selection) populations. Normalized fitness metrics were calculated using the following equation: !" = log ' 6+ 9: 78 6;< 9: 78 (3) Where $i is enrichment ratio for variant i, gp is the number of population doublings, and $WT is the enrichment ratio for wild-type. Beneficial mutations and lower bounds for fitness metrics A beneficial mutation was defined as having at least 10% increase in growth rate (!>0.15) relative to wild-type. Weighted means for synonymous codon fitness metrics, where the weights were read counts (depth of coverage) for each mutation, were calculated to be 0.03 ± 0.09, -0.02 ± 0.11, and -0.01 ± 0.07 for the ACT, PR, and IB datasets, respectively. A fitness metric of 0.15 was found to be in >90% percentile for all three datasets. 76 To determine lower-bound fitness metrics for each selection, we first determined the halfmedian of read counts of the pre-selection library for each selection (63, 49, and 31 for the ACT, PR, and IB selections, respectively). This number was normalized by the ratio of post- to preselection read counts (2.97, 1.86, and 2.04 for ACT, PR, and IB selections, respectively). Next, a lower-bound enrichment ratio ($LB) based on 10 read counts in the post-selection population was calculated: 2=> = log ' :? (3) 3@A Where fLB represents the normalized half-median pre-selection reads. The lower-bound fitness metrics, !LB, was then calculated using 8 population doublings (gp) and the wild-type enrichment ratio ($WT): !=> = log ' 6@A 9: 78 6;< 9: 78 (4) Distribution fitting of beneficial DFE Distribution fitting analysis was conducted using R statistical software56. Bootstrap goodness of fit and parameter estimation for the generalized Pareto distribution were done using the package gPdtest57. Model parameters were approximated and log likelihood values were determined using maximum likelihood estimation with package fitdistrplus58. Anderson-Darling tests were performed using the package kSamples59. Log-likelihood (LL) ratios were calculated as 2*[(LL HA) – (LL H0)], where H0 = null hypothesis and HA = alternative hypothesis. P-values were computed from a chi-squared distribution with one degree of freedom. 77 Protein characterization Wild-type and variant amiE protein was expressed using Studier auto-induction60 and purified according to Klesmith et al.50. The eluate was buffer exchanged into PBS buffer, pH 7.5 using GE disposable PD-10 desalting columns (GE Healthcare). Purified protein was stored in PBS at 4°C. Wild-type and variant amiE melting temperatures were measured using a SYPRO Orange thermalshift assay61,62 as described in Klesmith et al.50, but in PBS buffer, pH 7.5. Catalytic parameters (Km and kcat) were assayed at 37°C in PBS buffer, pH 7.5 using a phenol and hypochlorite ammonia detection assay63. PCR plates containing 100 µL of 7 different concentrations of amide (highest concentrations were 40, 150, and 800 mM for ACT, PR, and IB activity assays, respectively, with 1:2 serial dilutions for remaining substrate concentrations) in PBS were incubated on a thermocyler block (Eppendorf) with the lid open at 37°C for 5 minutes. To begin the assay, 20 µL of 0.02 µM (ACT and PR assays) or 0.2 µM (IB assays) enzyme was added. At discrete time points, 100 µL of the reaction was removed and quenched by depositing into a clear 96-well plate containing 50 µL phenol nitroprusside solution held on ice. At the end of the last time point, 50 µL alkaline hypochlorite solution was added to all wells and the plate was covered and incubated in a metal bead heat bath for 10 minutes at 35°C. The plate was then transferred to a Synergy H1 spectrophotometer (BioTek) held at 35°C and A625 was measured every minute for 15 minutes. Non-linear regression was performed using GraphPad Prism version 6 for Mac OS X, GraphPad Software, La Jolla California USA, www.graphpad.com. All measurements were performed at least in duplicate. The IB specific variant M203W shows increased fitness in the deep sequencing selection but decreased lysate activity compared with wild type. M203W immediately precipitated out when we tried to purify this enzyme. Thus, for this case, lysate activity would not be representative of in vivo conditions. For PR variants the coefficient of variation for wild-type was 78 prohibitively high to calculate statistically significant ratios; note the variance of the other wildtypes measurements. Isogenic growth and lysate flux assays Starter cultures were prepared by inoculating 2 mL of M9 minimal media + carbenicillin (50 µg/mL) with scrapings of MG1655 rph+ cell stocks harboring pEDA2_amiE or pEDA6_amiE variant plasmids and grown overnight at 37°C with 250 rpm shaking. In the morning, cells were pelleted, washed twice with M9 N-, and resuspended in 1 mL M9 N-. 3 mL of selection media + carbenicillin (50 µg/mL) in Hungate tubes was inoculated to a final OD600 of 0.02. Cultures were grown at 37°C with shaking at 250 rpm. For growth assays, OD600 was measured every 30-45 minutes until a final OD600 of approximately 0.5 was reached. All growth rate measurements represent at least 4 biological replicates collected on at least 2 separate dates. Lysate flux assays were adapted from Bienick et al.33. 2 mL of exponential phase culture (OD600 of approximately 0.15-0.3) was spun down at 15,000xg for 5 minutes. Cell pellets were washed twice and resuspended with 1 mL PBS, pH 7.5. Cells were lysed as described in Bienick et al.33. 0.5-0.9 mL of lysate was used in a 1 mL total volume assay containing 10 mM, 15 mM, or 10 mM acetamide, propionamide, or isobutyramide, respectively. The assay was conducted at 37°C. Every 5 minutes, 100 µL of the assay volume was removed and added to a 96-well plate containing 50 µL prechilled phenol nitroprusside. At the end of the last time point, 50 µL of alkaline hypochlorite was added to all wells. Absorbance at 625 nm was measured as in Bienick et al.33. 79 Data Availability Full datasets including normalized fitness metrics, pre- and post-selection read counts, and raw log base two enrichment scores for each variant can be obtained from Figshare (https://dx.doi.org/10.6084/m9.figshare.3505901.v2). Raw sequencing reads for this work have been deposited in the SRA (SAMN06237792-SAMN06237827). 80 APPENDIX 81 APPENDIX Figure B 3.1: Frequency distribution of library member counts. Panels are for reads in the preselection libraries for A.) acetamide B.) propionamide and C.) isobutyramide. Vertical lines indicate median (red) and mean (blue) read coverage. 82 Figure B 3.2: Fitness versus pre-selection read counts. Panels are for each variant in the A.) acetamide B.) propionamide and C.) isobutyramide libraries. Variants with insignificant read counts (n ≤ 5) and fitness metrics below the lower bounds were excluded from the analysis. Plots represent n = 4037, 3135, and 4969 variants. P-values for Pearson’s product moment correlation coefficients were calculated using a two-tailed t-test. 83 1 M hydrophobic hydrophilic * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R 2 R 3 H 4 G 5 D 6 I 7 S 8 S 9 S NS NS 10 N 11 D 12 T 13 V 14 G 15 V 16 A 17 V 18 V 19 N 20 Y 21 K 22 M 23 P 24 R 25 L 26 H 27 T 28 A 29 A 30 E 31 V 32 L 33 D 34 N 35 A 36 R 37 K 38 I 39 A 40 E 41 M 42 I 43 V 44 G 45 M 46 K 47 Q 48 G 49 L 50 P 92 R 93 K 94 A 95 N 96 V 97 W 98 G 99 V 100 F STOP NS NS NS NS hydrophobic NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 52 M 53 D 54 L 55 V 56 V 57 F 58 P 59 E 60 Y NS STOP hydrophilic NS NS NS NS 51 G * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS 61 S 62 L 63 Q 64 G 65 I 66 M 67 Y 68 D 69 P 70 A 71 E 72 M 73 M 74 E 75 T 76 A 77 V 78 A 79 I 80 P 81 G 82 E 83 E 84 T 85 E 86 I 87 F 88 S 89 R 90 A 91 C NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 S L T G E R H E E H P R K A P Y N T L V L I D N N G E I V Q K Y R K I I P W C P I E G W Y P G G Q T hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS Figure B 3.3: Fitness landscape for acetamide selection. 84 NS Figure B 3.3 (cont’d) 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Y V S E G P K G M K I S L I I C D D G N Y P E I W R D C A M K G A E L I V R C Q G Y M Y P A K D Q Q hydrophobic hydrophilic * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS STOP NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 V M M A K A M A W A N N C Y V A V A N A A G F D G V Y S Y F G H S A I I G F D G R T L G E C G E E E STOP * F W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R hydrophilic hydrophobic aromatic NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 M G I Q Y A Q L S L S Q I R D A R A N D Q S Q N H L F K I L H R G Y S G L Q A S G D G D R G L A E C hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS 85 NS NS Figure B 3.3 (cont’d) 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 P F E F Y R T W V T D A E K A R E N V E R L T R S T T G V A Q C P V G R L P Y E G hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS ! NS ≥0.5 Beneficial NS 0.0 ≤1.3 NS NS Neutral Deleterious ≤5 preselection reads 86 1 M hydrophobic hydrophilic * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R 2 R 3 H 4 G 5 D 6 I 7 S 8 S 9 S NS NS 10 N 11 D 12 T 13 V 14 G 15 V 16 A 17 V 18 V 19 N 20 Y 21 K 22 M 23 P 24 R 25 L 26 H 27 T 28 A 29 A 30 E 31 V 32 L 33 D 34 N 35 A 36 R 37 K 38 I 39 A 40 E 41 M 42 I 43 V 44 G 45 M 46 K 47 Q 48 G 49 L 50 P 92 R 93 K 94 A 95 N 96 V 97 W 98 G 99 V 100 F STOP NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 51 G 52 M 53 D 54 L 55 V 56 V 57 F 58 P 59 E 60 Y 61 S NS NS 62 L 63 Q 64 G 65 I 66 M 67 Y 68 D 69 P 70 A 71 E 72 M 73 M 74 E 75 T 76 A 77 V 78 A 79 I 80 P 81 G 82 E 83 E 84 T 85 E 86 I 87 F 88 S 89 R 90 A 91 C hydrophobic STOP hydrophilic * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 S L T G E R H E E H P R K A P Y N T L V L I D N N G E I V Q K Y R K I I P W C P I E G W Y P G G Q T hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS Figure B 3.4: Fitness landscape for propionamide selection. 87 NS Figure B 3.4 (cont’d) 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Y V S E G P K G M K I S L I I C D D G N Y P E I W R D C A M K G A E L I V R C Q G Y M Y P A K D Q Q hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 V M M A K A M A W A N N C Y V A V A N A A G F D G V Y S Y F G H S A I I G F D G R T L G E C G E E E hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 M G I Q Y A Q L S L S Q I R D A R A N D Q S Q N H L F K I L H R G Y S G L Q A S G D G D R G L A E C hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS 88 NS NS NS NS Figure B 3.4 (cont’d) 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 P F E F Y R T W V T D A E K A R E N V E R L T R S T T G V A Q C P V G R L P Y E G hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS ! NS NS ≥0.5 Beneficial NS 0.0 ≤0.8 NS NS NS NS NS 89 Neutral Deleterious ≤5 preselection reads 1 M hydrophobic hydrophilic * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R 2 R 3 H 4 G 5 D NS NS NS NS 6 I 7 S NS NS 8 S 9 S 10 N 11 D 12 T 13 V 14 G 15 V 16 A 17 V 18 V 19 N 20 Y 21 K 22 M 23 P 24 R 25 L 26 H 27 T 28 A 29 A 30 E 31 V 32 L 33 D 34 N 35 A 36 R 37 K 38 I 39 A 40 E 41 M 42 I 43 V 44 G 45 M 46 K 47 Q 48 G 49 L 50 P 97 W 98 G 99 V 100 F STOP NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 51 G 52 M 53 D NS NS NS NS NS 54 L 55 V NS NS NS NS 56 V 57 F 58 P 59 E NS 60 Y 61 S 62 L NS NS NS 63 Q 64 G 65 I 66 M 67 Y 68 D 69 P 70 A 71 E 72 M 73 M 74 E 75 T 76 A 77 V 78 A 79 I 80 P 81 G 82 E 83 E 84 T 85 E 86 I 87 F NS NS NS 88 S 89 R 90 A 91 C 92 R 93 K 94 A 95 N 96 V hydrophobic STOP hydrophilic * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 S L T G E R H E E H P R K A P Y N T L V L I D N N G E I V Q K Y R K I I P W C P I E G W Y P G G Q T hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS Figure B 3.5: Fitness landscape for isobutyramide selection. 90 NS NS Figure B 3.5 (cont’d) 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Y V S E G P K G M K I S L I I C D D G N Y P E I W R D C A M K G A E L I V R C Q G Y M Y P A K D Q Q hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 V M M A K A M A W A N N C Y V A V A N A A G F D G V Y S Y F G H S A I I G F D G R T L G E C G E E E hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 M G I Q Y A Q L S L S Q I R D A R A N D Q S Q N H L F K I L H R G Y S G L Q A S G D G D R G L A E C hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS NS NS NS NS 91 NS Figure B 3.5 (cont’d) 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 P F E F Y R T W V T D A E K A R E N V E R L T R S T T G V A Q C P V G R L P Y E G hydrophilic hydrophobic STOP * F aromatic W Y P START M I non-polar aliphatic L V A G small C S polar uncharged T N Q D negatively charged E H positively charged K R NS NS NS NS NS NS NS NS ζ NS ≥0.5 NS NS NS Beneficial NS 0.0 ≤0.6 NS NS NS NS 92 Neutral Deleterious ≤5 preselection reads Figure B 3.6: Fitness metrics from biological replicate growth selection experiments. Panels represent replicates in A.) acetamide and B.) isobutyramide media. Plots represent n = 3834 and 4977 variants for panels A and B, respectively. Red lines indicate two standard deviations from theoretical error estimation34. P-values for Pearson’s product moment correlation coefficients were calculated using a two-tailed t-test. 93 A 1.2 1.2 Variance 11 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 00 B 00 25 50 50 75 100 100 125 150 150 175 200 200 225 250 250 275 300 300 325 350 350 1.2 1.2 Variance 11 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 00 C 00 25 50 50 75 100 100 125 150 150 175 200 200 225 250 250 275 300 300 325 350 350 1.2 1.2 Variance 11 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 00 00 25 50 50 75 100 100 125 150 150 175 200 200 225 250 250 275 300 300 325 350 350 Position in primary sequence Figure B 3.7: Variance of fitness metrics for synonymous codons of beneficial mutations (!>0.15). Panel represent fitness metrics for the A.) acetamide B.) propionamide and C.) isobutyramide selections as a function of position in the primary sequence. 94 Figure B 3.8: Principle component analysis of renormalized fitness values. Panels are for A.) acetamide B.) propionamide and C.) isobutyramide selections. Fitness values were renormalized by subtracting the mean fitness (mean = -0.824, -0.575, -0.255 for acetamide, propionamide, and isobutyramide, respectively) from each variant. P-values for Pearson’s product moment correlation coefficients were calculated using a two-tailed t-test. 95 Figure B 3.9: amiE activity assay. A.) amiE activities were measured using a colorimetric Berthelot reaction33,63 for ammonia detection with phenol nitroprusside and alkaline hypochlorite. B.) Representative data for the amiE activity assay. Absorbance at 625 nm was measured at discrete time intervals for reactions containing one of seven concentrations of acetamide (ACT) and purified wild-type amiE. Reaction velocities were calculated by obtaining the slopes of each line. C.) Michaelis-Menten plot of wild-type amiE activity on acetamide substrate. Plot represents four independent measurements. Non-linear regression was performed using GraphPad Prism version 6 for Mac OS X, GraphPad Software, La Jolla California USA, www.graphpad.com. 96 Table B 3.1: Constructs used in growth selections. ACT, PR, and IB = acetamide, propionamide, and isobutyramide selection media (see Methods). Promoters obtained from Bienick et al.33. plasmid pJK_proK17_amiE selection promoter media ACT proK17 -35 hexamer -10 hexamer RBS name RBS sequence TTCCCG TAATAT t7RBS AGGAGA µS,wt (hr-1) µM9,wt (hr-1) µS,wt/µM9,wt 0.60 ± 0.02 0.65 ± 0.03 0.92 ± 0.05 0.44 ± 0.04 pEDA2_amiE ACT proK17 TTCCCG TAATAT kRBS3 AGTTTT 0.78 ± 0.03 0.56 ± 0.06 0.29 ± 0.06 pEDA2_amiE PR proK17 TTCCCG TAATAT kRBS3 AGTTTT 0.78 ± 0.03 0.37 ± 0.08 0.36 ± 0.07 pEDA6_amiE IB proK14 TGTACG TAATAT t7RBS AGGAGA 0.66 ± 0.04 0.54 ± 0.11 97 Table B 3.2: Library coverage statistics for combined amiE libraries (replicate 1 and 2) used in the acetamide, propionamide, and isobutyramide selections. Raw sequencing reads were quality filtered using Enrich55. Acetamide selection Propionamide Isobutyramide selection selection Pre-selection population DNA reads post quality filter 1,935,216 1,735,919 1,390,991 Post-selection population DNA reads post quality filter 7,738,122 3,236,744 2,831,599 Percent of possible codon substitutions observed: 1-base substitution 2-base substitution 3-base substitution All substitutions 100.0 97.6 95.7 97.1 100.0 97.6 95.6 97.1 100.0 98.5 97.2 98.1 40.0 52.0 39.6 51.8 38.3 52.6 8.0 8.7 9.1 97.2 97.2 96.3 Percent of reads in pre-selection library with: No nonsynonymous mutations One nonsynonymous mutation Multiple nonsynonymous mutations Coverage of possible single nonsynonymous mutations: ! ! ! 98 Table B 3.3: Isogenic growth and lysate flux data. Confidence intervals in error are given as 1 s.d. of at least 3 independent measurements. fitness lysate flux metric theoretical Ji/JWT (mmol NH3 variant (selection) µi (hr-1) µi/µWT µi/µWT gDCW-1 hr-1) wildtype 0.00 (ACT) 0.44 ± 0.04 1.00 1.00 0.15 ± 0.02 wildtype 0.00 (PR) 0.29 ± 0.06 1.00 1.00 * wildtype 0.00 (IB) 0.36 ± 0.07 1.00 1.00 0.11 ± 0.01 S9A 0.33 (ACT) 0.68 ± 0.04 1.40 ± 0.08 1.26 nd A28R 0.27 (ACT) 0.59 ± 0.03 1.44 ± 0.07 1.21 nd L119A 0.30 (ACT) 0.66 ± 0.01 1.35 ± 0.03 1.23 1.98 ± 0.66 I136A -0.07 (PR) 0.26 ± 0.00 0.78 ± 0.01 0.95 nd I136A 0.52 (IB) 0.58 ± 0.00 1.31 ± 0.01 1.43 nd I136H -0.10 (PR) 0.33 ± 0.01 0.96 ± 0.02 0.94 nd Q149A 0.19 (PR) 0.25 ± 0.00 0.88 ± 0.01 1.14 nd I165C 0.25 (ACT) 0.62 ± 0.03 1.51 ± 0.08 1.19 nd Y192V -1.3 (ACT) ng nd nd V201M 0.12 (PR) 0.37 ± 0.01 1.09 ± 0.03 1.09 nd V201M 0.22 (IB) 0.50 ± 0.00 1.29 ± 0.07 1.17 nd V201T 0.20 (ACT) 0.52 ± 0.01 1.07 ± 0.02 1.15 2.08 ± 0.66 V201T 0.34 (PR) 0.39 ± 0.00 1.17 ± 0.01 1.27 nd V201T 0.25 (IB) 0.56 ± 0.01 1.25 ± 0.02 1.19 1.9 ± 0.23 M203W 0.43 (IB) 0.61 ± 0.01 1.86 ± 0.07 1.34 0.69 ± 0.07 A234M 0.33 (ACT) 0.63 ± 0.01 1.29 ± 0.03 1.25 nd A234M 0.15 (PR) 0.42 ± 0.01 1.25 ± 0.02 1.11 nd A234M 0.21 (IB) 0.50 ± 0.01 1.14 ± 0.02 1.15 nd I236Y 0.26 (ACT) 0.60 ± 0.00 1.22 ± 0.02 1.20 nd Q273A 0.23 (IB) 0.59 ± 0.01 1.55 ± 0.09 1.17 nd *error in measurements was prohibitively high for calculating ratios nd = not determined ng = no growth 99 mRNA folding parameters mRNA folding parameters mRNA folding parameters Codon influence parameters IB Codon influence parameters PR Codon influence parameters ACT Table B 3.4: mRNA effects on fitness. Pearson correlation analysis of mRNA model parameters calculated as in47 with codon fitness metrics obtained in this work. Analysis was restricted to variants with ≥50 pre-selection read counts. Term # of variants ΔGUH aH gH u3H π(θwt) Σβcfc s7-16 s17-32 r # of variants ΔGUH aH gH u3H π(θwt) Σβcfc s7-16 s17-32 r # of variants ΔGUH aH gH u3H π(θwt) Σβcfc s7-16 s17-32 r All codons r p-value 6760 -0.022 0.073 0.008 0.492 -0.003 0.814 0.012 0.325 -0.022 0.069 -0.033 0.007 0.031 0.010 -0.020 0.092 0.023 0.061 6193 -0.0293 0.021 0.0036 0.777 -0.0131 0.303 -0.0061 0.630 -0.0131 0.303 -0.0087 0.496 0.0245 0.054 -0.0147 0.248 0.0131 0.304 2975 0.015 0.407 0.097 0.000 -0.066 0.000 0.088 0.000 0.044 0.017 -0.048 0.008 0.028 0.125 -0.016 0.392 -0.015 0.402 100 Codons 2-16 r p-value 196 -0.111 0.122 0.069 0.338 -0.003 0.966 0.163 0.022 0.219 0.002 0.301 0.000 0.247 0.000 0.084 0.240 161 -0.1421 0.071 0.0625 0.429 -0.0641 0.418 0.1002 0.205 0.2256 0.004 0.2795 0.000 0.3097 0.000 0.1038 0.189 43 -0.025 0.874 0.275 0.071 -0.161 0.297 0.162 0.295 0.064 0.678 -0.224 0.143 -0.303 0.045 -0.012 0.936 Table B 3.5: Gene amplification primers for preparing samples for deep sequencing. Gene amplification: inner primers gttcagagttctacagtccgacgatcttaactttaagaagtttttatacat Fwd_Tile1_amiE gttcagagttctacagtccgacgatcttaactttaagaaggagatatacat Fwd_Tile1_amiE-2 gttcagagttctacagtccgacgatcggcgaagaaacggaa Fwd_Tile2_amiE gttcagagttctacagtccgacgatcctgcgatgacggtaat Fwd_Tile3_amiE gttcagagttctacagtccgacgatcaagaaatgggcattcaatac Fwd_Tile4_amiE ccttggcacccgagaattccaaagcacggctaaagat Rev_Tile1_amiE ccttggcacccgagaattccactctccaaatttccggata Rev_Tile2_amiE ccttggcacccgagaattccacagagacaactgcgc Rev_Tile3_amiE ccttggcacccgagaattccatggtggtgctcgag Rev_Tile4_amiE blue = Illumina sequencing primer; black = gene overlap Gene amplification: outer primers Illumina_FWD aatgatacggcgaccaccgagatctacacgttcagagttctacagtccga Primer (selection, sample) RPI41 (ACT, T1-1) caagcagaagacggcatacgagatGTCGTCgtgactggagttccttggcacccgagaattcca RPI38 (ACT, T1-2) caagcagaagacggcatacgagatAGCTAGgtgactggagttccttggcacccgagaattcca RPI33 (ACT, T2-1) caagcagaagacggcatacgagatCGCCTGgtgactggagttccttggcacccgagaattcca RPI34 (ACT, T2-2) caagcagaagacggcatacgagatGCCATGgtgactggagttccttggcacccgagaattcca RPI43 (ACT, T3-1) caagcagaagacggcatacgagatGCTGTAgtgactggagttccttggcacccgagaattcca RPI40 (ACT, T3-2) caagcagaagacggcatacgagatTCTGAGgtgactggagttccttggcacccgagaattcca RPI44 (ACT, T4-1) caagcagaagacggcatacgagatATTATAgtgactggagttccttggcacccgagaattcca RPI41 (ACT, T4-2) caagcagaagacggcatacgagatGTCGTCgtgactggagttccttggcacccgagaattcca RPI37 (ACT, T1U) caagcagaagacggcatacgagatATTCCGgtgactggagttccttggcacccgagaattcca RPI22 (ACT, T2U) caagcagaagacggcatacgagatCGTACGgtgactggagttccttggcacccgagaattcca RPI39 (ACT, T3U) caagcagaagacggcatacgagatGTATAGgtgactggagttccttggcacccgagaattcca RPI40 (ACT, T4U) caagcagaagacggcatacgagatTCTGAGgtgactggagttccttggcacccgagaattcca RPI25 (PR, T1-1) caagcagaagacggcatacgagatATCAGTgtgactggagttccttggcacccgagaattcca RPI26 (PR, T1-2) caagcagaagacggcatacgagatGCTCATgtgactggagttccttggcacccgagaattcca RPI27 (PR, T2-1) caagcagaagacggcatacgagatAGGAATgtgactggagttccttggcacccgagaattcca RPI28 (PR, T2-2) caagcagaagacggcatacgagatCTTTTGgtgactggagttccttggcacccgagaattcca RPI29 (PR, T3-1) caagcagaagacggcatacgagatTAGTTGgtgactggagttccttggcacccgagaattcca RPI30 (PR, T3-2) caagcagaagacggcatacgagatCCGGTGgtgactggagttccttggcacccgagaattcca RPI31 (PR, T4-1) caagcagaagacggcatacgagatATCGTGgtgactggagttccttggcacccgagaattcca RPI32 (PR, T4-2) caagcagaagacggcatacgagatTGAGTGgtgactggagttccttggcacccgagaattcca RPI21 (PR, T1U) caagcagaagacggcatacgagatCGAAACgtgactggagttccttggcacccgagaattcca 101 Table B 3.5 (cont’d) RPI22 (PR, T2U) caagcagaagacggcatacgagatCGTACGgtgactggagttccttggcacccgagaattcca RPI23 (PR, T3U) caagcagaagacggcatacgagatCCACTCgtgactggagttccttggcacccgagaattcca RPI24 (PR, T4U) caagcagaagacggcatacgagatGCTACCgtgactggagttccttggcacccgagaattcca RPI13 (IB, T1-1) caagcagaagacggcatacgagatTTGACTgtgactggagttccttggcacccgagaattcca RPI14 (IB, T1-2) caagcagaagacggcatacgagatGGAACTgtgactggagttccttggcacccgagaattcca RPI15 (IB, T2-1) caagcagaagacggcatacgagatTGACATgtgactggagttccttggcacccgagaattcca RPI16 (IB, T2-2) caagcagaagacggcatacgagatGGACGGgtgactggagttccttggcacccgagaattcca RPI17 (IB, T3-1) caagcagaagacggcatacgagatCTCTACgtgactggagttccttggcacccgagaattcca RPI18 (IB, T3-2) caagcagaagacggcatacgagatGCGGACgtgactggagttccttggcacccgagaattcca RPI19 (IB, T4-1) caagcagaagacggcatacgagatTTTCACgtgactggagttccttggcacccgagaattcca RPI20 (IB, T4-2) caagcagaagacggcatacgagatGGCCACgtgactggagttccttggcacccgagaattcca RPI9 (IB, T1U) caagcagaagacggcatacgagatCTGATCgtgactggagttccttggcacccgagaattcca RPI10 (IB, T2U) caagcagaagacggcatacgagatAAGCTAgtgactggagttccttggcacccgagaattcca RPI11 (IB, T3U) caagcagaagacggcatacgagatGTAGCCgtgactggagttccttggcacccgagaattcca RPI12 (IB, T4U) caagcagaagacggcatacgagatTACAAGgtgactggagttccttggcacccgagaattcca red = Illumina adapter sequence; BOLD = barcode; blue = Illumina sequencing primer 102 REFERENCES 103 REFERENCES 1. Bornscheuer, U. T. et al. Engineering the third wave of biocatalysis. Nature 485, 185–194 (2012). 2. Liberles, D. A. et al. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 21, 769–785 (2012). 3. Khersonsky, O. & Tawfik, D. S. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471–505 (2010). 4. Bloom, J. D. & Arnold, F. H. In the light of directed evolution: pathways of adaptive protein evolution. Proc. Natl. Acad. Sci. U. S. A. 106 Suppl, 9995–10000 (2009). 5. Orr, H. A. The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 6, 119–127 (2005). 6. Orr, H. A. The Population Genetics of Adaptation: The Distribution of Factors Fixed During Adaptive Evolution. Evolution (N. Y). 52, 935–949 (1998). 7. Eyre-Walker, A. & Keightley, P. D. The distribution of fitness effects of new mutations. Nat. Rev. 8, (2007). 8. Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T. S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Res. 42, gku511 (2014). 9. Arnold, F. H., Wintrode, P. L., Miyazaki, K. & Gershenson, A. How enzymes adapt: lessons from directed evolution. Trends Biochem. Sci. 26, 100–106 (2001). 10. Currin, A., Swainston, N., Day, P. J. & Kell, D. B. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem. Soc. Rev. 44, 1172–1239 (2015). 11. Bloom, J. D., Romero, P. A., Lu, Z. & Arnold, F. H. Neutral genetic drift can alter promiscuous protein functions, potentially aiding functional evolution. Biol. Direct 2, 17 (2007). 12. Dalby, P. A. Strategy and success for the directed evolution of enzymes. Curr. Opin. Struct. Biol. 21, 473–480 (2011). 13. Morley, K. L. & Kazlauskas, R. J. Improving enzyme properties: when are closer mutations better? Trends Biotechnol. 23, 231–237 (2005). 104 14. Schmidt, M. et al. Directed Evolution of an Esterase from Pseudomonas fluorescens Yields a Mutant with Excellent Enantioselectivity and Activity for the Kinetic Resolution of a Chiral Building Block. ChemBioChem 7, 805–809 (2006). 15. Aharoni, A. et al. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73– 76 (2005). 16. Lee, J. & Goodey, N. M. Catalytic Contributions from Remote Regions of Enzyme Structure. Chem. Rev. 7595–7624 (2011). 17. Omari, K. El, Liekens, S., Bird, L. E., Balzarini, J. & Stammers, D. K. Mutations Distal to the Substrate Site Can Affect Varicella Zoster Virus Thymidine Kinase Activity: Implications for Drug Design. Mol. Pharmacol. 69, 1891–1896 (2006). 18. Tomatis, P. E., Rasia, R. M., Segovia, L., Vila, A. J. & Gray, H. B. Mimicking Natural Evolution in Metallo-beta-Lactamases through Second-Shell Ligand Mutations. Proc. Natl. Acad. Sci. 102, 13761–13766 (2005). 19. Yang, G., Hong, N., Baier, F., Jackson, C. J. & Tokuriki, N. Conformational Tinkering Drives Evolution of a Promiscuous Activity through Indirect Mutational Effects. Biochemistry 55, 4583–4593 (2016). 20. Oue, S., Okamoto, A., Yano, T. & Kagamiyama, H. Redesigning the Substrate Specificity of an Enzyme by Cumulative Effects of the Mutations of Non-active Site Residues. J. Biol. Chem. 274, 2344–2349 (1999). 21. Mavor, D. et al. Determination of ubiquitin fitness landscapes under different chemical stresses in a classroom setting. Elife 5, e15802 (2016). 22. Roscoe, B. P., Thayer, K. M., Zeldovich, K. B., Fushman, D. & Bolon, D. N. A. Analyses of the Effects of All Ubiquitin Point Mutants on Yeast Growth Rate. J. Mol. Biol. 425, 1363–1377 (2013). 23. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). 24. Gillespie, J. H. Molecular Evolution Over the Mutational Landscape. Evolution (N. Y). 38, 1116–1129 (1984). 25. Orr, H. A. The population genetics of beneficial mutations. Philos. Trans. R. Soc. London B Biol. Sci. 365, 1195–1201 (2010). 26. Kelly, M. & Clarke, P. H. An Inducible Amidase Produced by a Strain of Pseudomonas aeruginosa. Microbiology 27, 305–316 (1962). 105 27. Andrade, J., Karmali, A., Carrondo, M. A. & Frazao, C. Structure of Amidase from Pseudomonas aeruginosa Showing a Trapped Acyl Transfer Reaction Intermediate State. J. Biol. Chem. 282, 19598–19605 (2007). 28. Schmid, A. et al. Industrial biocatalysis today and tomorrow. Nature 409, 258–268 (2001). 29. Buchholz, K., Kasche, V. & Bornscheuer, U. T. Biocatalysts and Enzyme Technology. (John Wiley & Sons, 2012). 30. Kim, M. et al. Need-based activation of ammonium uptake in Escherichia coli. Mol. Syst. Biol. 8, 616 (2012). 31. Reitzer, L. Nitrogen assimilation and global regulation in Escherichia coli. Annu. Rev. Microbiol. 57, 155–176 (2003). 32. Davis, J. H., Rubin, A. J. & Sauer, R. T. Design, construction and characterization of a set of insulated bacterial promoters. Nucleic Acids Res. 39, 1131–1141 (2011). 33. Bienick, M. S. et al. The Interrelationship between Promoter Strength, Gene Expression, and Growth Rate. PLoS One 9, e109105 (2014). 34. Kowalsky, C. A. et al. High-Resolution Sequence-Function Mapping of Full-Length Proteins. PLoS One 10, e0118193 (2015). 35. Firnberg, E. & Ostermeier, M. PFunkel: Efficient, Expansive, User-Defined Mutagenesis. PLoS One 7, e52031 (2012). 36. Soskine, M. & Tawfik, D. S. Mutational effects and the evolution of new protein functions. Nat. Publ. Gr. 11, 572–582 (2010). 37. Firnberg, E., Labonte, J. W., Gray, J. J. & Ostermeier, M. A Comprehensive, HighResolution Map of a Gene’s Fitness Landscape. Mol. Biol. Evol. 31, 1581–1592 (2014). 38. Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase. Cell 160, 882–892 (2015). 39. Steinberg, B. & Ostermeier, M. Environmental changes bridge evolutionary valleys. Sci. Adv. 2, e1500921 (2016). 40. Kassen, R. & Bataillon, T. Distribution of fitness effects among beneficial mutations before selection in experimental populations of bacteria. Nat. Genet. 38, 484–488 (2006). 41. Serohijos, A. W. R. & Shakhnovich, E. I. Contribution of Selection for Protein Folding Stability in Shaping the Patterns of Polymorphisms in Coding Regions. Mol. Biol. Evol. 31, 165–176 (2014). 106 42. Bershtein, S., Segal, M., Bekerman, R., Tokuriki, N. & Tawfik, D. S. Robustness-epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444, 929–932 (2006). 43. Jacquier, H. et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proc. Natl. Acad. Sci. 110, 13067–13072 (2013). 44. Poon, A. & Chao, L. The rate of compensatory mutation in the DNA bacteriophage φX174. Genetics 170, 989–999 (2005). 45. Rokyta, D. R. et al. Beneficial fitness effects are not exponential for two viruses. J. Mol. Evol. 67, 368–376 (2008). 46. Sanjuán, R., Moya, A. & Elena, S. F. The distribution of fitness effects caused by singlenucleotide substitutions in an RNA virus. Proc. Natl. Acad. Sci. U. S. A. 101, 8396–401 (2004). 47. Boël, G. et al. Codon influence on protein expression in E . coli correlates with mRNA levels. Nature 529, 358–363 (2016). 48. van der Meer, J. Y. et al. Using mutability landscapes of a promiscuous tautomerase to guide the engineering of enantioselective Michaelases. Nat. Commun. 7, 1–16 (2016). 49. Chevereau, G. et al. Quantifying the Determinants of Evolutionary Dynamics Leading to Drug Resistance. PLos Biol. 13, e1002299 (2015). 50. Klesmith, J. R., Bacik, J., Michalczyk, R. & Whitehead, T. A. Comprehensive SequenceFlux Mapping of a Levoglucosan Utilization Pathway in E. coli. ACS Synth. Biol. 4, 1235– 1243 (2015). 51. Sniegowski, P. D. & Gerrish, P. J. Beneficial mutations and the dynamics of adaptation in asexual populations. Philos. Trans. R. Soc. London B Biol. Sci. 365, 1255–1263 (2010). 52. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). 53. Gong, L. I. & Bloom, J. D. Epistatically Interacting Substitutions Are Enriched during Adaptive Protein Evolution. PLoS One 10, (2014). 54. Kunkel, T. A. Rapid and efficient site-specific mutagenesis without phenotypic selection. Proc. Natl. Acad. Sci. 82, 488–492 (1985). 55. Fowler, D. M., Araya, C. L., Gerard, W. & Fields, S. Enrich: Software for analysis of protein function by enrichment and depletion of variants. Bioinformatics 27, 3430–3431 (2011). 107 56. R Core Team, R. R: A language and environment for statistical computing. (2015). at 57. Villaseñor-Alva, J. A. & González-Estrada, E. A bootstrap goodness of fit test for the generalized Pareto distribution. Comput. Stat. Data Anal. 53, 3835–3841 (2009). 58. Delignette-Muller, M. L. & Dutang, C. fitdistrplus: An R Package for Fitting Distributions. J. Stat. Softw. 64, 1–34 (2015). 59. Scholz, F. & Zhu, A. K-Sample Rank Tests and their Combinations. (2016). at 60. Studier, F. W. Protein production by auto-induction in high-density shaking cultures. Protein Expr. Purif. 41, 207–234 (2005). 61. Ericsson, U. B., Hallberg, B. M., DeTitta, G. T., Dekker, N. & Nordlund, P. Thermofluorbased high-throughput stability optimization of proteins for structural studies. Anal. Biochem. 357, 289–298 (2006). 62. Lavinder, J. J., Hari, S. B., Sullivan, B. J. & Magliery, T. J. High-throughput thermal scanning: a general, rapid dye-binding thermal shift screen for protein engineering. J. Am. Chem. Soc. 131, 3794–3795 (2009). 63. Searle, P. L. The Berthelot or indophenol reaction and its use in the analytical chemistry of nitrogen. A review. Analyst 109, 549–568 (1984). 108 CHAPTER FOUR Computational design of destabilized proteins for assessing theories on adaptive molecular evolution 109 ABSTRACT Understanding and predicting the mechanisms of adaptive evolution is a key challenge for theoretical and experimental biologists. Of interest to protein engineers is knowledge of how beneficial mutations arise in a population. More specifically, how many beneficial mutations are available given a starting amino acid sequence, and what is the nature of their distribution of fitness effects (DFE)? Among other biophysical properties, does the stability of a given protein (thermodynamic, colloidal, etc.) have an impact on this distribution? While theoretical frameworks have been developed, generating empirical data to rigorously test these theories has been a challenge. Deep sequencing mutational studies provide data on thousands of mutations in a single experiment and have proven useful in testing adaptive molecular evolution theories. In this chapter, a computational methodology was developed and applied to design functional yet destabilized proteins for use in future work to test hypotheses on the DFE for beneficial mutations. 110 INTRODUCTION Understanding the mechanisms of molecular evolution is important to a broad range of scientists including molecular biologists, virologists, evolutionary biologists, and protein engineers. Researchers interested in evolving natural proteins or designing proteins de novo must wrestle with the implicit evolutionary limitations set forth by nature. The challenge, then, is to first define these mechanisms. Of interest to most protein engineers are beneficial mutations or ‘hits’ – variants that achieve some engineering goal. Given the amino acid sequence of an enzyme, how many ‘hits’ are available? What are the mechanisms and constraints that govern the evolvability of a protein for new functions and, more importantly, can we leverage knowledge of these mechanisms to improve the design process? Since Ronald Fisher’s seminal work in the 1930’s in developing a ‘geometric’ evolutionary model1,2, several theoretical developments have followed that attempt to mathematically describe the adaptive behavior of stochastic mutations. One phenomena noted early on was that beneficial mutations are rare; deleterious and neutral mutations are far more probable than ones providing a selective advantage. Understanding the rare nature of beneficial mutations, Gillespie reasoned that extreme value theorem mathematics could be utilized to model the extreme tails of these distributions3. Orr later matured these theories and predicted that the DFE for beneficial mutations should be approximately exponentially distributed4,5. Several experimental works have attempted to test these theories, yet the results have been conflicting5. A significant challenge is identifying enough beneficial mutations for a given protein so that rigorous distribution model fitting can be carried out. Deep mutational scanning experiments provide a wealth of mutational data that can be used address questions in molecular evolution6. In Chapter 3, deep mutational scanning was used to 111 study how an enzyme encodes substrate specificity by generating comprehensive single-mutation fitness landscapes on multiple substrates7. Using the set of beneficial mutations - variants with functional scores above wild-type – we were able to address the shape of the DFE for beneficial mutations with high statistical power. As was predicted by Orr, all three datasets could essentially be described as exponential. As these results are contingent upon a laboratory experiment and not ‘true’ adaptive evolution, one has to consider the details of the selection. Importantly, the thermal stability of the enzyme under study was favorable to the selection conditions; the melting temperature of wild-type amiE is >65°C and the selections were performed at 37°C. Since proteins are generally only marginally stable in their native environments, an interesting question one could ask is, then, how does the shape of the DFE for beneficial mutations behave when the protein under study is not stable in the selection conditions? In growth selections, fitness of an enzyme is determined by its catalytic activity and the amount of folded, functional protein. The probability that a protein will fold to a native ‘lowest energy’ functional confirmation is a complex computation governed by both the thermodynamic and kinetic stability constraints in the present environment8,9. However, in the context of a growth selection, this can be simplified and modeled as a two-state system: folded (functional) and unfolded (inactive). What is the relationship between folding probability and fitness outcomes? Previous works have aimed to address this question10–13. In a classic paper, Bloom et al. showed that a stabilized cytochrome P450 had a greater probability of accepting mutations conferring new functions than the wild-type enzyme14. The fitness effect of a mutation then, is a function of the protein’s mutational ‘robustness’ – random mutations to a less stable protein are more likely to be deleterious. Counterintuitive to this finding and a point of frustration for protein engineers, is that stabilizing mutations often come with trade-off to activity15–19. 112 The design of experiment for this project is simple: compare DFE of a destabilized protein against the known DFE of the stable wild-type protein. Using the experiment framework developed previously for the amiE enzyme as described in Chapter 37, the experimental objectives of this project are to 1.) design and generate destabilized variants of amiE while maintaining wildtype catalytic function, 2.) establish and perform growth selections on the acetamide substrate, and 3.) compare the beneficial DFE of the destabilized protein to the wild-type. The null hypothesis is that the shape of the DFE is independent of thermodynamic stability. In the following chapter I will describe a computational approach to address objective 1 using the Rosetta Design Software. We were able to identify several variants of amiE that impacted the relative purification yield (a proxy for enzyme stability). In the Discussion and Outlook section I will discuss the implications for use of computational design in altering protein stability and outline future work necessary to accomplish the remaining objectives of this project. 113 RESULTS We first sought to generate destabilized variants of amiE using computational methods. The objective was to identify mutations that would decrease the stability of the protein while maintaining wild-type catalytic function. In effort to computationally identify destabilizing mutations, we modified the Rosetta protocols used in the PROSS pipeline developed by Goldenzweig et al.20. Briefly, the objective of PROSS is to suggest designs given an input protein sequence and structure that will have improved stability and heterologous expression yields from bacterial hosts such as E coli. As our experimental objective is essentially the inverse problem, we modified the protocols used in PROSS and then selected mutations with worse energy scores relative to wild-type. The computational pipeline that was used is outlined in Figure 4.1. The input structure for amiE (PDB code 2UXY)21 was first refined by iterative rounds of sidechain and backbone repacking to obtain the lowest energy structure. Next, we reduced the number of positions to test by removing any position that was 1.) within 8 Å of the active site or 2.) on the surface of the protein. The former is to reduce the likelihood of choosing a mutation that impacts catalytic function, as proximity to active site correlates with activity15,17,22. The removal of positions on the surface of the protein from our search space was to avoid any potential effects to protein oligomerization (amiE is a homohexamer). The FilterScan protocol23 was then run on the remaining positions to test and score all possible single amino acid substitutions. Figure 4.1: Process flow for computational design of destabilized proteins. 114 Using the output scores from the FilterScan module along with the point-mutant fitness metric data previously generated7, six destabilized amiE designs harboring 1-3 mutations each were expressed and purified from E coli BL21* along with wild-type amiE (Table C 4.1). For 4/6 designs we observed essentially no protein yield from our purification, and the remaining two designs (ED2 and LA3) had a significant reduction in the amount of protein yielded relative to wild-type (Figure 4.2a and Table C 4.1). Figure 4.2: Destabilized amiE variants. a.) Wild-type and variant amiE proteins were expressed, purified, and quantified using the A280 method and the yield of each variant protein relative to wild-type was computed. b.) The ‘large-to-small’ mutations introduce voids into the core of the 115 (Figure 4.2 cont’d) protein (F100A and I235A) while other hydrophobic to hydrophobic mutations alter the local packing. To obtain an unambiguous view of the contributions of each mutation, we made and purified some of the point-mutations contained within the six designs and found that all had decreased expression yields relative to wild-type (Figure 4.2a and Table C 4.1). The majority of mutations introduce a void into the core of the structure. For example, F100A and I235A are both ‘large-to-small’ mutations, with the former also eliminating a stabilizing pi-pi stacking interaction with F87 (Figure 4.2b). Other mutations such as I122L are less aggressive in terms of void introduction, yet change the local packing dynamics in the vicinity of the mutation (Figure 4.2b). Interestingly, when we measured the apparent melting temperatures of the point-mutants we found that most were close to the wild-type Tm of 67.7 ± 0.1ºC (wild-type Tm previously reported in Wrenbeck et al.7). However, this could be explained by the fact that the homohexameric biological assembly of amiE complicates measuring the true melting temperature of the monomer with the irreversible thermal shift assay used. Following dissociation of the homohexameric complex and further (likely immediate) unfolding of each monomer unit, one cannot reverse the process to refold the monomers and thus obtain the true Tm of monomeric amiE. DISCUSSION AND OUTLOOK Altering the stability of a protein while maintaining functionality is a significant challenge in protein science. Generally, the objective is to improve stability: thermodynamic, colloidal, solvent tolerance, etc. Important examples include efforts to stabilize enzymes in biocatalytic or in vivo process15,22,24–27 and improved shelf-life and/or heat-tolerance of protein therapeutics28. Computational methods, though imperfect, are becoming increasingly better at predicting the 116 thermodynamic effect (∆∆G) that a mutation will have on the folding stability of a protein. However, it has long been understood that there is an inherent tradeoff between stability and activity, especially with enzymes15,17,24. These effects will be discussed at length in Chapter 5, but in the context of this project the objective was to utilize computational predictions of free energy change upon mutation to select less stable structures. The next objective of this project is to identify which of the destabilized variants maintain catalytic activity. A fellow graduate student in the Whitehead lab, Matthew Faber, has expressed, purified, and performed activity assays as described in Wrenbeck et al.7 on the variants described in this chapter (unpublished data). Two point-mutants, I38V and I122L, retained near wild-type activity. Future work involves establishing growth selection for one or both of these variants, performing a growth selection on a comprehensive single-mutational library using the variant as the parental enzyme, deep sequencing the pre- and post-selection populations, and then analyzing the data as in the previous study to observe the shape of the DFE for beneficial mutations. 117 MATERIALS AND METHODS Reagents All chemicals were purchased from Sigma-Aldrich unless otherwise noted. Mutagenic oligonucleotides were ordered from Integrated DNA Technologies and are listed in Table C 4.2. Plasmid construction pET29b_amiE was constructed as previously described7. Briefly, the amiE coding sequence was subcloned into the pET-29b(+) backbone (Novagen). All variants were generated using Kunkel mutagenesis29. Mutagenic primers are listed in Table C 4.2. Protein purification and characterization amiE proteins were expressed and purified following the exact protocols in Wrenbeck et al.7. Briefly, proteins were expressed using the auto-induction method30 and purified on Ni-NTA column according to Klesmith et al.25. Purified protein solutions were quantified by measuring absorbance at 280 nm on a BioTek Synergy H1 plate reader in 96-well UV-transparent plates using an extinction coefficient of 5.883x10-2 uM-1cm-1 for amiE. Apparent melting temperatures were measured using a SYPRO Orange thermal-shift assay31 as detailed in Klesmith et al.25 and Wrenbeck et al.7. Computational point-mutant scan Rosetta scripts and command lines used in this work are listed in Notes C 4.1 and 4.2. The crystal structure for amiE was obtained from the Protein Data Bank (PDB code 2UXY)21 and cleaned for use in Rosetta with the ‘clean_pdb_keep_ligand.py’ script as part of the Rosetta 3 118 release. The structure was refined using the refine.xml script (without alteration) included in Data S1 from Goldenzweig et al.20. Residues within 8 Å of the C3Y ligand (substrate transition state analogue crystalized with amiE structure) and those comprising the C-terminus were fixed during refinement. A list of fixed residues is included in Note C 4.1. Distance to the catalytic active site was calculated by finding the minimum distance of a position’s alpha carbon to any active site atom (six identical active sites in the homohexamer amiE). Residues with 8 Å or less distance to the active site were excluded from the FilterScan. Surface residues were identified with the Python script findSurfaceResidues.py (https://pymolwiki.org/index.php/FindSurfaceResidues) and were also excluded from the FilterScan. The filterscan.xml script from Goldenzweig et al.20 was modified to exclude PSSM input. The modified script can be found in Note C 4.2. 119 APPENDIX 120 APPENDIX Table C 4.1: Characterization of destabilized amiE variants. Tmapp (°C) 68.6 ± 1.05 Relative purification yield 0.24 Design ED2 Mutation(s) F87Y, I235L FilterScan score 4.927 ED4 V17T, I38V, I122L, L258A 8.130 nd 0.00 V56I, F100H, L258A I235A F100L, C178L F100A, S162V V17T I38V F87Y F100A F100L I122L S162V C178L I235L 9.449 5.801 8.777 11.188 2.354 2.023 1.645 7.469 4.583 0.395 3.719 4.195 3.281 nd nd 60.28 ± 0.10 nd 65.8 ± 0.20 67.3 ± 0.24 73.1 ± 4.33 67.4 ± 0.31 79.8 ± 4.30 nd nd 64.4 ± 0.15 79.2 ± 2.30 0.01 0.01 0.09 0.01 0.64 0.72 0.31 0.10 0.37 0.11 0.07 0.40 0.43 ED5 LA2 LA3 LA5 121 Table C 4.2: Primers for generating destabilized amiE variants. Primer AmiE_C178L AmiE_F100A AmiE_F100H AmiE_F100L AmiE_F87Y AmiE_I122L AmiE_I235A AmiE_I235L AmiE_I38V AmiE_L258A AmiE_S162V AmiE_V17T AmiE_V56I Sequence gcacccttcatggctaaatctctccaaatttccggataattac gttcgccggtcagggaggccacaccccaaacatttg gttcgccggtcagggagtgcacaccccaaacatttg cgccggtcagggataacacaccccaaacatt tacaagcacggctatagatttccgtttcttcgcctg gatttcaccgttgttatcgagcaagaccagagtgttgtatg cggccgtcaaaaccgatagcggcactatgaccgaagta gccgtcaaaaccgataagggcactatgaccgaagt atcatttccgcaactttgcgggcattatccaggac cggatttgtgacagagacgcctgcgcgtattgaatgcc ccgtcatcgcagataattaatacgatcttcatgcctttcggacc ggcatcttgtaattcaccgtggctacgcctacggtatc ctgtaaagaatattcaggaaatataaccagatccatgcccggca 122 Note C 4.1: Command lines and supporting files for Rosetta computational design Prepare coordinate constraints file for amiE make_csts.sh infile.pdb > outfile.cst Flags file used in refinement -ex1 -ex2 -use_input_sc -extrachi_cutoff 5 -ignore_unrecognized_res -use_occurrence_data #-corrections::correct #-corrections::score:no_his_his_pairE -chemical:exclude_patches LowerDNA UpperDNA Cterm_amidation SpecialRotamer VirtualBB ShoveBB VirtualDNAPhosphate VirtualNTerm CTermConnect sc_orbitals pro_hydroxylated_case1 pro_hydroxylated_case2 ser_phosphorylated thr_phosphorylated tyr_phosphorylated tyr_sulfated lys_dimethylated lys_monomethylated lys_trimethylated lys_acetylated glu_carboxylated cys_acetylated tyr_diiodinated N_acetylated C_methylamidated MethylatedProteinCterm #-output_virtual -linmem_ig 10 -ignore_zero_occupancy false #-out:path:pdb pdbs/ #-out:path:score scores/ Refinement command line ./path/to/rosetta/scripts -database ./path/to/rosetta/database/ -in:file:s infile.pdb -parser:protocol refine.xml -parser:script_vars res_to_fix= 22A,59A,60A,65A,103A,117A,119A,132A,134A,136A,137A,138A,139A,142A,144A,163A,16 4A,165A,166A,167A,168A,169A,170A,171A,174A,175A,187A,188A,189A,190A,191A,192A, 193A,200A,203A,217A,218A,219A,227A,229A,230A,260A,261A,262A,263A,264A,265A,266 A,267A,268A,269A,270A,271A,272A,273A,274A,275A,276A,277A,278A,279A,280A,281A,2 82A,283A,284A,285A,286A,287A,288A,289A,290A,291A,292A,293A,294A,295A,296A,297A ,298A,299A,300A,301A,302A,303A,304A,305A,306A,307A,308A,309A,310A,311A,312A,313 A,314A,315A,316A,317A,318A,319A,320A,321A,322A,323A,324A,325A,326A,327A,328A,3 29A,330A,331A,332A,333A,334A,335A,336A,337A,338A,339A,340A,341A parser:script_vars pdb_reference=infile.pdb -parser:script_vars cst_full_path=infile.cst parser:script_vars cst_value=0.4 @flags -overwrite 123 Command line for FilterScan for i in {245..341}; do ./path/to/rosetta/scripts -database ./path/to/rosetta/database/ -in:file:s refinedinfile.pdb -parser:protocol filterscan.xml -parser:script_vars res_to_fix=22A,59A,60A,65A,103A,117A,119A,132A,134A,136A,137A,138A,139A,142A,144 A,163A,164A,165A,166A,167A,168A,169A,170A,171A,174A,175A,187A,188A,189A,190A,1 91A,192A,193A,200A,203A,217A,218A,219A,227A,229A,230A,260A,261A,262A,263A,264A ,265A,266A,267A,268A,269A,270A,271A,272A,273A,274A,275A,276A,277A,278A,279A,280 A,281A,282A,283A,284A,285A,286A,287A,288A,289A,290A,291A,292A,293A,294A,295A,2 96A,297A,298A,299A,300A,301A,302A,303A,304A,305A,306A,307A,308A,309A,310A,311A ,312A,313A,314A,315A,316A,317A,318A,319A,320A,321A,322A,323A,324A,325A,326A,327 A,328A,329A,330A,331A,332A,333A,334A,335A,336A,337A,338A,339A,340A,341A parser:script_vars pdb_reference=refinedinfile.pdb -parser:script_vars res_to_restrict=22A,59A,60A,65A,103A,117A,119A,132A,134A,136A,137A,138A,139A,142A, 144A,163A,164A,165A,166A,167A,168A,169A,170A,171A,174A,175A,187A,188A,189A,190 A,191A,192A,193A,200A,203A,217A,218A,219A,227A,229A,230A,260A,261A,262A,263A,2 64A,265A,266A,267A,268A,269A,270A,271A,272A,273A,274A,275A,276A,277A,278A,279A ,280A,281A,282A,283A,284A,285A,286A,287A,288A,289A,290A,291A,292A,293A,294A,295 A,296A,297A,298A,299A,300A,301A,302A,303A,304A,305A,306A,307A,308A,309A,310A,3 11A,312A,313A,314A,315A,316A,317A,318A,319A,320A,321A,322A,323A,324A,325A,326A ,327A,328A,329A,330A,331A,332A,333A,334A,335A,336A,337A,338A,339A,340A,341A parser:script_vars cst_full_path=infile.cst -parser:script_vars cst_value=0.4 -parser:script_vars scores_path=scores/ -parser:script_vars resfiles_path=resfiles/ @flags_delay -parser:script_vars current_res=${i} -overwrite; done 124 Note C 4.2: Rosetta scripts used in this work Modified FilterScan script excluding PSSM input #upper and lower are booleans. Delta filters out all the mutations that are worse or better by less than -0.5R.E.U 125 126 REFERENCES 127 REFERENCES 1. Fisher, R. A. The Genetical Theory of Natural Selection. (Oxford University Press, 1930). 2. Orr, H. A. The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 6, 119–127 (2005). 3. Gillespie, J. H. Molecular Evolution Over the Mutational Landscape. Evolution (N. Y). 38, 1116–1129 (1984). 4. Orr, H. A. The Population Genetics of Adaptation: The Distribution of Factors Fixed During Adaptive Evolution. Evolution (N. Y). 52, 935–949 (1998). 5. Orr, H. A. The population genetics of beneficial mutations. Philos. Trans. R. Soc. London B Biol. Sci. 365, 1195–1201 (2010). 6. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). 7. Wrenbeck, E. E., Azouz, L. R. & Whitehead, T. A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nat. Commun. 8, 15695 (2017). 8. Baker, D. & Agard, D. A. Kinetics Thermodynamics in Protein Folding. Biochemistry 33, 7505–7509 (1994). 9. Shakhnovich, E. I. Theoretical studies of protein-folding thermodynamics and kinetics. Curr. Opin. Struct. Biol. 7, 29–40 (1997). 10. Serohijos, A. W. R. & Shakhnovich, E. I. Contribution of Selection for Protein Folding Stability in Shaping the Patterns of Polymorphisms in Coding Regions. Mol. Biol. Evol. 31, 165–176 (2014). 11. Tokuriki, N. & Tawfik, D. S. Stability effects of mutations and protein evolvability. Curr. Opin. Struct. Biol. 19, 596–604 (2009). 12. Tokuriki, N. & Tawfik, D. S. Protein Dynamism and Evolvability. Science (80-. ). 324, 203–208 (2009). 13. Zeldovich, K. B., Chen, P. & Shakhnovich, E. I. Protein stability imposes limits on organism complexity and speed of molecular evolution. Proc. Natl. Acad. Sci. 104, 16152–16157 (2007). 128 14. Bloom, J. D., Labthavikul, S. T., Otey, C. R. & Arnold, F. H. Protein stability promotes evolvability. Proc. Natl. Acad. Sci. 103, 5869–5874 (2006). 15. Klesmith, J. R., Bacik, J., Wrenbeck, E. E., Michalczyk, R. & Whitehead, T. A. Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning. Proc. Natl. Acad. Sci. 114, 2265–2270 (2017). 16. Nagatani, R. A., Gonzalez, A., Shoichet, B. K., Brinen, L. S. & Babbitt, P. C. Stability for Function Trade-Offs in the Enolase Superfamily ‘Catalytic Module’. Biochemistry 46, 6688–6695 (2007). 17. Tokuriki, N. et al. Diminishing returns and tradeoffs constrain the laboratory optimization of an enzyme. Nat. Commun. 3, 1257 (2012). 18. Tokuriki, N. & Tawfik, D. S. Chaperonin overexpression promotes genetic variation and enzyme evolution. Nature 459, 668–675 (2009). 19. Tokuriki, N., Stricher, F., Serrano, L. & Tawfik, D. S. How Protein Stability and New Functions Trade Off. PLoS Comput. Biol. 4, e1000002 (2008). 20. Goldenzweig, A. et al. Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression. Mol. Cell 63, 337–346 (2016). 21. Andrade, J., Karmali, A., Carrondo, M. A. & Frazao, C. Structure of Amidase from Pseudomonas aeruginosa Showing a Trapped Acyl Transfer Reaction Intermediate State *. J. Biol. Chem. 282, 19598–19605 (2007). 22. Arnold, F. H. Combinatorial and computational challenges for biocatalyst design. Nature 409, 253–257 (2001). 23. Whitehead, T. A. et al. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat. Biotechnol. 30, 543–8 (2012). 24. Morley, K. L. & Kazlauskas, R. J. Improving enzyme properties: when are closer mutations better? Trends Biotechnol. 23, 231–237 (2005). 25. Klesmith, J. R., Bacik, J., Michalczyk, R. & Whitehead, T. A. Comprehensive SequenceFlux Mapping of a Levoglucosan Utilization Pathway in E. coli. ACS Synth. Biol. 4, 1235– 1243 (2015). 26. Polizzi, K. M., Bommarius, A. S., Broering, J. M. & Chaparro-Riggers, J. F. Stability of biocatalysts. Curr. Opin. Chem. Biol. 11, 220–225 (2007). 27. Bornscheuer, U. T. et al. Engineering the third wave of biocatalysis. Nature 485, 185–194 (2012). 129 28. Frokjaer, S. & Otzen, D. E. Protein Drug Stability: A Formulation Challenge. Nat. Rev. 4, 298–306 (2005). 29. Kunkel, T. A. Rapid and efficient site-specific mutagenesis without phenotypic selection. Proc. Natl. Acad. Sci. 82, 488–492 (1985). 30. Studier, F. W. Protein production by auto-induction in high-density shaking cultures. Protein Expr. Purif. 41, 207–234 (2005). 31. Ericsson, U. B., Hallberg, B. M., DeTitta, G. T., Dekker, N. & Nordlund, P. Thermofluorbased high-throughput stability optimization of proteins for structural studies. Anal. Biochem. 357, 289–298 (2006). 130 CHAPTER FIVE Improving the expression of a polyketide synthase in a biosynthetic pathway using deep mutational scanning and GFP-fusion screening 131 ABSTRACT Engineering organisms for the production of various chemicals, therapeutics, and fuels is a promising and sustainable alternative to other sources. To this end, enzymes, the entities responsible for in vivo conversion of feedstocks into desired products, are heterologously expressed in chassis organisms such yeast or bacteria. However, poor expression of heterologous proteins can create bottlenecks in pathways. These efforts are complicated by the low probability of proteins to fold and function in non-native environments, often at concentrations far past their solubility limit. Further, mutations that improve the stability and solubility of a protein often come with a tradeoff for activity. In this work, we sought to improve the stability and solubility of a Type III polyketide synthase (PKS) using GFP-fusion high-throughput screening coupled to deep mutational scanning. This PKS comes from the tropane alkaloids biosynthetic pathway from Atropa belladonna, of which we aim to reconstruct a portion of the pathway, including the PKS, in Saccharomyces cerevisiae. Hits from the screen were filtered for their probability to be catalytically neutral, and combinatorial libraries were prepared and screened for improved expression. Stabilized variants containing ≥12 total mutations were identified, with the best variant providing a 6.2-fold improvement in expression in E. coli. However, all initial designs had negligible activity. Analysis of individual PKS point mutants revealed that at least two of the included mutations significantly reduce activity. Future work includes generating backcrosses of these inactivating mutations on the stabilized designs and engineering the tropane alkaloids pathway in yeast. 132 INTRODUCTION Biomanufacturing – the production of molecules by engineered microorganisms – is a viable and sustainable alternative to traditional chemical synthetic routes1. Key factors influencing the rapid advancement of this field include the dramatic increase of available gene coding sequences, improved accuracy and reduced cost of synthetic DNA synthesis and assembly2, and improved tools for engineering biology3–7. However, the economization of biomanufacturing processes to compete with alternative sources (i.e. fluctuating crude oil prices) remains a grand challenge. For example, reported titers from the production of plant secondary metabolites in microbes are generally miniscule (microgram per liter quantities), necessitating significant optimization efforts to compete with plant-derived or traditional chemical synthetic routes. At the heart of these processes are the enzymes responsible for chemical transformation; a typical engineered organism will express at least five heterologous proteins. While other factors are certainly important (toxicity, metabolic flux balancing, localization, etc.), the efficiency of each enzyme step is a fundamental determinant for specific productivity and production titer. Flux through an enzyme, JE, can be modeled as a function of total active enzyme, [E]t, its catalytic efficiency (KM and kcat), and the thermodynamic reversibility of a reaction8: J" = E % × k ()% S K, 1+ S K, ∆2345 67 × 1−e                                                                                       (1) where ∆Grxn = -RTK’eq + RTln(p/s), with p = concentration of product and s = concentration of substrate. At equilibrium, ∆Grxn will go to zero and thus thermodynamic reversibility term will 133 become zero. Provided that an enzyme is not limited by catalytic efficiency, poor functional expression can significantly hinder productivity and efficiency. Indeed, a review of current literature reveals that several recently engineered pathways are limited by poor expression of at least one enzyme (Table D 5.1)9–20. Native proteins are marginally stable, and their native expression levels are often at their solubility limit. A common strategy to overcome a bottleneck enzyme is to overexpress the protein. However, overexpression of heterologous genes can lead to poor solubility and aggregation. Recently, Klesmith et al. demonstrated that by improving the apparent melting temperature of a Lipomyces starkeyi levoglucosan kinase by 5.1°C while maintaining near wildtype activity, growth rates of levoglucosan kinase expressing E. coli fed levoglucosan as a sole carbon source were improved by 15-fold from the wild-type enzyme21. Similarly, Xie et al. engineered improved solubility of Simvastatin synthase in E. coli and achieved approximately 50% increase in whole cell activity and solubility22. While these examples are demonstrative, this strategy has not been widely adopted amongst pathway engineers. Stability and solubility engineering of enzymes is complicated by the need to maintain functional enzymatic activity, and mutations with stabilizing effects often come with a tradeoff for protein function. To address this challenge, we recently published on identifying stabilizing ‘hits’ using high-throughput screens for stability and solubility in deep mutational scanning experiments23. Existing comprehensive single-mutation functional datasets for two enzymes were compared against datasets generated with the stability screens. In short, we found a 90% probability of choosing a catalytically neutral mutation by filtering out mutations that were near the active site, not evolutionarily conserved, or buried in the protein core. 134 As a rigorous test and application of this method, in this work we sought to improve the expression of an enzyme without an existing functional dataset or crystal structure. As a model system, we performed a GFP-fusion stability scan on a Type III polyketide synthase (PKS) from Atropa belladonna (Ab) that expresses very poorly in both bacterial and yeast systems. This enzyme is part of the Tropane Alkaloids (TA) pathway from Ab that was recently elucidated by the Barry Lab at Michigan State University (Figure 5.1, unpublished data). 135 RESULTS As a model system to test our hypothesis that improving protein expression boosts the productivity of engineered metabolic pathways, we sought to reconstruct a portion of the TA pathway in Saccharomyces cerevisiae. The engineered pathway begins with the pathway precursor putrescine followed by chemical transformations to the TA pharmacore tropine. This transformation is catalyzed by five enzymes: putrescine methyltransferase (PMT), methylputrescine amine oxidase (MPO), a Type III polyketide synthase (PKS), tropinone synthase (TS), and tropinone reductase (TRI) (Figure 5.1a). We first evaluated the localization and expression of native Ab genes in yeast by generating EGFP-tagged fusions of each gene under the control of a galactose inducible promoter. Fluorescence microscopy of S. cerevisiae strain BY471024 expressing EGFP-fusions revealed that all genes except for MPO were expressed in the cytosol (Figure 5.1b). The native MPO contains a canonical Ala-Lys-Leu C-terminal peroxisomal targeting sequence (PTS). Co-expression of an RFP-tagged peroxisomal protein25 and EGFPMPO confirmed that the MPO localizes to the yeast peroxisome (Figure 5.1b). Figure 5.1: Overview of the Tropane Alkaloids (TA) pathway enzymes. a.) The pathway precursor, putrescine, will be fed in the growth medium of yeast cells expressing five enzymes: 136 Figure 5.1 (cont’d) PMT, MPO, TRI, TS, and PKS. The final products, tropinone and tropine will be detected from the cultures using LC/MS methods developed by the Barry Lab. b.) Fluorescence microscopy images of EGFP-tagged Ab genes. PMT, PKS, and TRI indicate cytosolic expression, whereas MPO localizes to the yeast peroxisome as confirmed by colocalization with Pex11-mKate2 fusion protein. We quantified the mean fluorescence of the EGFP-tagged Ab gene products by flow cytometry and found that the PKS expressing cells were significantly less fluorescent compared to the other genes, indicating poor expression (Figure 5.2). Attempts by the Barry Lab to express and purify PKS for characterization yielded extremely low levels of active protein (~5 mg/L yield from auto-induction cultures, unpublished data). They found that essentially all of the protein was insoluble and that activity sharply declines at temperatures in excess of 25°C. Together, these data indicate that the PKS expresses poorly in both bacterial and yeast hosts and support engineering the enzyme to find variants with improved stability and solubility. Figure 5.2: Relative fluorescence of EGFP-tagged Ab genes in yeast. Fluorescence of S. cerevisiae strain BY471024 cells expressing EGFP-tagged Ab genes under galactose induction was quantified using flow cytometry. Error bars represent the standard deviation of at least two independent measurements. TRII is a homolog of TRI that produces pseudo-tropine. 137 In effort to improve the expression of the PKS, we sought to use deep mutational scanning coupled to a high-throughput screen for stability and solubility. Due to its experimental ease, we first explored the use of yeast surface display coupled to FACS26. The PKS coding sequence was cloned into the pETConNK backbone23,27 and expressed with galactose induction at 22°C in EBY100 cells. Initial tests for display proved futile (Figure D 5.1). We then tested several alternate induction temperatures (18-30°C) as well as mutating a potential glycosylation site at Asn339. We hypothesized that glycosylation of this asparagine could disrupt folding and thus ability to display, so we made and tested PKS_N339A using nicking mutagenesis27. However, despite these troubleshooting efforts, we were unable to successfully display the PKS on the yeast surface. We next attempted use of a GFP-fusion stability screen. The concept of GFP-fusion is to genetically encode a gene of interest linked to GFP generating a fusion protein (Figure D 5.2a), such that the folding probability of GFP is tied to the folding probability of the gene of interest. Expression of a protein library can then be screened by fluorescence intensity using FACS (Figure D 5.2b). Recently, the use of a GFP-fusion stability screen28 in deep mutational scanning experiments was validated in our lab using a levoglucosan kinase (LGK) from Lipomyces starkeyi as a model system (unpublished data). The screen was able to correctly identify 9/12 known stabilizing LGK mutations (P-value < 0.0001, Fisher’s exact test). Thus, GFP-fusion has been validated in a deep mutational scanning pipeline. The objective of this project was to perform a GFP-fusion deep mutational scan on PKS, filter the resulting hits for probability of maintaining catalytic activity using recently published classification methods23, and then generate a combinatorial library to identify active variants with improved expression (Figure 5.3a). We generated a comprehensive single-site saturation 138 mutagenesis library of PKS using nicking mutagenesis27. Plasmid DNA libraries were transformed into BL21*(DE3) and protein expression was induced with IPTG (see MATERIALS AND METHODS). Individual cells were sorted using FACS and two populations were collected: a reference population and the top 8-10% of cells based on GFP fluorescence intensity. The resulting samples were deep sequenced and the population counts and enrichment ratios of wild-type and each variant were calculated using Enrich29 (see Table D 5.2 for library statistics). A normalized stability metric, z, for each variant was calculated as outlined in Kowalsky et al.30, where a stability metric of zero corresponds to wild-type, above zero are beneficial mutations (more stable), and below zero are deleterious. Unfortunately, the reference population for the gene tile covering residues 157-234 did not grow after FACS, thus we chose to omit these positions from further analysis. Deep sequencing of the reference populations revealed 84.3% coverage of single nonsynonymous (NS) mutations (5107/6060). Nonsense mutations had a mean stability metric of -0.653 ± 0.48 (1 s.d.), which was significantly lower than the mean of -0.0561 ± 0.54 for missense mutations (P-value < 0.0001, two-tailed unpaired Student’s T-test). To evaluate the reproducibility of the method, we performed replicate sorting, deep sequencing, and analysis for one gene tile. The Pearson’s correlation coefficient between replicates was found to be 0.72, which is low compared to previous deep mutational scanning experiments (coefficients of 0.8523 and 0.9331 have been previously reported from our lab). As reproducibility generally improves with increasing depth of sequencing coverage, we calculated Pearson’s correlation coefficients for mutations with at least 100 read counts in the reference population and found the coefficient improves to 0.83. Thus, the relatively low depth of coverage in this experiment partially but not completely explains the relatively high variance between replicates. 139 Since we are interested in ‘improved’ variants, we next asked how correlation scales with coverage for variants with at least 20% improvement in stability metric (z>0.15). We found that variants with ≥50 average selected read counts had a Pearson’s of 0.84, which we deemed a reasonable threshold for reliability of the deep sequencing experiment to yield predictive results on a mutation’s stability effect (Table 5.1). Table 5.1: Correlation of stability metrics for beneficial mutations from replicate GFPfusion experiments based on depth of sequencing coverage. Average selected read counts between replicates were calculated and Pearson’s product moment correlation coefficients were determined above the indicated read count thresholds. N indicates the number of mutations at each threshold. Average selected read count threshold ≥10 ≥30 ≥50 ≥100 N 298 280 247 193 Pearson's 0.70 0.71 0.84 0.90 The GFP-fusion experiment identified an astounding 1,115 beneficial missense mutations (z>0.15) with ≥50 selected read counts (19.4% of total tested). As the objective was to generate a stabilized PKS variant with wild-type catalytic activity, hits were filtered using a multiple filter approach as validated in Klesmith et al.23 for their probability to be catalytically neutral23. In brief, hits at positions within 15 Å of the active site or with a PSSM score <3 were removed. Mutations with PSSM scores ≥3 represent mutations that have been evolutionarily conserved, and thus are less likely to impact catalytic function (see MATERIALS AND METHODS for details of PSSM generation). Proximity to the active site was calculated using a homology structure for PKS generated with I-TASSER with default options32. The resulting set of hits postfilter was comprised of 38 mutations at 35 unique positions (Table D 5.3). 140 Using the homology structure, we selected 23 mutations to include in a combinatorial library27. Included in this set were 5 mutations that did not pass the stringent filtering criteria, but were included due to having either relatively high stability metric scores and/or PSSM scores of ≥0 (Table D 5.4). Initially, BL21*(DE3) cells expressing the combinatorial library were cultured, induced with IPTG, plated and grown at 30°C, and visually screened for fluorescence intensity. 20 variants were picked and their fluorescence was quantified using flow cytometry (Figure D 5.3). The best variants from this initial screening effort only provided approximately 3-fold improvement over wild-type. Because the number of library members one can reasonably screen on plates is low (~102), we performed a FACS sort to enrich the library for the top 3% of cells based on GFP fluorescence intensity. Sorted cells were plated and visually screened as before, and 10 additional variants were picked for isogenic characterization. The best variant, PKS.21, provided a 6.2 ± 0.01-fold improvement in fluorescence over wild-type. Next, we selected the top four variants (PKS.4, PKS.20, PKS.21, PKS.23), sequenced and cloned them into a GST-tagged expression vector, and characterized them for relative catalytic activity. Each design had ≥12 mutations, with 8 mutations shared amongst all designs (V12I, S37A, P106A, M115R, A121G, A143V, T245A, and S284K, Figure 5.3a). Based on the homology structure, these mutations generally appear to alter surface charge characteristics, core packing, loop flexibility, or dimeric interface contacts (Figure. 5.3b). For example, N64E/D introduce a negative charge to a patch on the surface that is otherwise positive. V282I is a hydrophobic-hydrophobic mutation in the core of the protein that presumably improves hydrophobic packing. Lastly, A121G likely improves loop flexibility. Interestingly, when we screened these hits for activity with a lysate assay we found that all had no detectable activity except for PKS.20, which had approximately 0.1% of the activity 141 Figure 5.3: Combinatorial PKS hits. a.) Sequencing of the top four combinatorial hits revealed several shared mutations. Grey shading indicates the mutation is present. b.) Structural analysis of beneficial mutations indicates that many improve surface charge characteristics, hydrophobic core packing, and secondary structure like loops. The grey surface representation is the dimer subunit. C167 is the putative catalytic nucleophile. c.) Point-mutant analysis of the 8 shared mutations. Lysates of GFP-fused PKS mutants were tested for their activity and fluorescence intensity (expression yield). Error bars represent 1 s.d. of technical replicates. Axis are both logarithmic scale (base 10). compared with wild-type (data not shown). We hypothesized that one or more of the 8 shared mutations were responsible for destroying enzyme activity. To test this hypothesis, we generated the 8 point-mutations in the GFP-fusion background using nicking mutagenesis27 and performed lysate activity assays. We found two mutations, P106A and A143V, that reduced activity to 0.93% and 0.02% of wild-type activity (Figure 5.3c). Not surprisingly, these two mutations were 142 ones that did not pass the filtering criteria because they had low PSSM scores of -2 (P106A) and 0 (A143V), however were included in the library as they had stability scores of 0.641 and 0.257, respectively. This result underlines the importance of the filtering criteria: mutations that improve stability but are not evolutionarily conserved are significantly more probable of being deleterious for catalytic function. DISCUSSION AND FUTURE WORK In this work, we performed a high-throughput screen for stability and solubility to test thousands of mutations on a protein sequence. Deep sequencing driven protein science enables the generation of previously unthinkable amounts of mutational data33. Applications stretch as far as studying enzyme function21,34–38,31,39,40, probing mechanics of evolution31,41,42, and antibody engineering43,44.The ability of deep mutational scanning – counting DNA sequences from unselected and selected populations – to recapitulate various biological, chemical, and physical phenomena is wholly dependent on the quality of the screen or selection method used. Replication of experiment is an important metric for all methods, and the correlations observed in this work indicate that GFP-fusion, at least for PKS, is not as sensitive as other highthroughput screening technologies. Nevertheless, computational methods for filtering hits can aid in generating stabilized protein designs. However, if given the option a more robust stability screen such as yeast surface display23,26 is more desirable. There are two important takeaways from this project. First, this work clearly validates the filtering method previously developed by Klesmith et al.23. Results from the point-mutation analysis indicate that although certain mutations provide stabilizing effects, if they were not conserved in nature they are likely to be deleterious for function. Indeed, proline 106 is a 143 canonical example of this stability/function trade-off. P106 lies in the middle of a helix, where prolines are generally disfavored, and the solubility screen indicates that several other residues at this position improve overall stability of the protein. However, the PSSM indicates that proline is highly conserved and thus important to catalytic function. The P106A variant increased the solubility of the PKS-EGFP fusion but almost completely ablated activity. Since all characterized PKS hits (PKS.4, PKS.20, PKS.21, and PKS.23) contained P106A and A143V that did not pass the filter, immediate next steps include backcrossing these mutations and testing the resulting enzyme variants for activity. The second important takeaway is now that the filtering method has been validated on 3 different enzymes, library size from the outset can be significantly decreased; any mutation that does not pass the filtering criteria need not be included in the library. In the absence of stability metric data, for the PKS removing mutations with PSSM scores <3 reduces the library size from 7,448 (392 positions with 19 amino acid substitutions) down to 154, and removing positions within 15 Å of the active site reduces the library size down to 123. Testing approximately one hundred mutations versus thousands (comprehensive scan of a gene) is certainly more practical and economical. Notably, nicking mutagenesis developed in Chapter 2 enables such efficient library generation. A remaining objective for this project is to reconstruct the TA pathway in yeast. This will be accomplished using the hierarchal MoClo cloning strategy and a kit of characterized parts for yeast45 developed by John Dueber and company at University of California, Berkeley. Multigene cassettes will be integrated into the chromosome of yeast strain BY474224, as there are several available options for auxotrophic selection. Initial designs will feature genes placed under the control of medium strength constitutive promoters with the objective of detecting 144 tropinone and/or tropine in 72hr cultures. Once confirmed that this is the minimum set of genes to produce tropane alkaloids in yeast, optimization of expression elements (promoters, copy number, integration location, etc.) will be carried out using a modular approach. In brief, optimization of the PMT+MPO, then PMT+MPO+PKS, and lastly PMT+MPO+PKS+TS+TRI will be done. Additionally, three putative cytochrome P450 reductases (CPR) for TS will be screened for activity and their expression level will be optimized with respect to TS. Ultimately, the objective is to test stabilized active PKS designs versus the wild-type in the context of the engineered pathway to see the effects on pathway productivity. MATERIALS AND METHODS Reagents Chemicals were sourced from Sigma-Aldrich unless otherwise noted. Mutagenic oligos were designed using the QuikChange Primer Design Program (Agilent, Santa Clara, CA). All oligonucleotides were ordered from Integrated DNA Technologies (Coralville, IA). All minipreps were done using the Monarch Plasmid Miniprep Kit (New England Biolabs). Plasmid construction Yeast gateway expression constructs used for fluorescence microscopy were made using plasmids from the Yeast Gateway Kit from the Lindquist Lab (available from www.addgene.com). PMT, PKS, TRI, and TRII Atropa belladonna gene sequences were cloned from pENTR/D-TOPO entry vectors into the pAG424GAL_EGFP_ccdB plasmid using LR cloning kit (Thermo Fisher Scientific). MPO was cloned as above but into a gateway plasmid modified to harbor an RFP peroxisomal marker. The Pex11-mKate2 transcriptional unit from 145 pWCD2520 (gifted from the Dueber Lab) was PCR amplified attaching BsmBI sites on either side. The resulting amplicon was subcloned into the pAG424GAL-EGFP-ccdB plasmid between the two BsmBI sites, generating construct pAG424GAL-EGFP-MPO_Pex11/mKate2. The pET29NK_/mGFPmut3 vector was constructed by Justin Klesmith as follows. Overhang PCR was used attaching a 5’ XhoI site and a 3’ His6x, stop codon, BbvCI site to mGFPmut3 from a plasmid based from pJK_proB_GFP (45). Similarly, overhang PCR was used to add a BbvCI site to pET29b just after the stop codon in the plasmid. The mGFPmut3 construct was cloned between the XhoI and BbvCI sites using standard techniques to make the fusion construct -Leu-Glu-mGFPmut3-His6x. pET29NK-PKS-RD/mGFPmut3 was constructed by overhang PCR of the PKS coding sequence attaching NdeI and XhoI sites for ligation into the pET29NK-GOI/mGFPmut3 construct. pGEX expression vectors were made by subcloning PKS genes between the BamHI and SmaI sites of the pGEX-4T1 backbone. Wild-type and variant PKS sequences were amplified by overhang PCR attaching 5’ BamHI and 3’ SmaI restriction sites and then cloned into the pGEX backbone following standard restriction digest and ligation protocols. PKS comprehensive point-mutant library construction Nicking mutagenesis was used to generate a comprehensive single-site mutagenesis library on the pET29NK_PKS-RD/mGFPmut3 plasmid27. Degenerate NNK primers covering residues Lys8 to Arg388 were used in 5 separate mutagenesis reactions to generate 5 gene tiles: T1 (K8-E78), T2 (I79-S156), T3 (V157-G234), T4 (L235-I312), T5 (V313-R388) (see Table D 5.2). Mutagenesis reaction products were transformed into XL1-Blue Electrocompetent Cells 146 (Agilent) and plated on 245mm x 245 mm Large Bioassay Dishes (Sigma). The following day, cells were scraped and plasmids harvested with a miniprep. 30 ng of library plasmid DNA was transformed into electrocompetent E. coli BL21*(DE3) cells and plated as above. The following day, cells were scraped and used to inoculate a 100 mL LB culture at an initial OD600 of 0.05 and grown at 30°C, 250 rpm. Once the cultures reached an OD600=0.4-0.6, DMSO was added (7% v/v) and 1 mL aliquots were flash frozen in liquid nitrogen. FACS screening of GFP-fusion libraries Kanamycin was used at a final concentration of 50 µg/mL in all cultures. Library cell stocks were thawed on ice for 30-45 minutes and washed twice in TB media. For each library, 3 mL TB cultures were inoculated to an initial OD600=0.05 in Hungate tubes and grown at 30°C with 250 rpm shaking. Once the cultures reached an OD600=0.8-1.6, cultures were diluted into a fresh 3 mL TB culture in Hungate tubes to an initial OD600=0.0025. After approximately 4-5 generations (OD600=0.05-0.08) IPTG was added to a final concentration of 250 µM. Once the cultures reached an OD600=0.25-0.3, 1 mL was pelleted and washed with cold sterile PBS twice. Cells were sorted on a BD Influx cell sorter. 700,000 cells each from two populations were collected for each sample: a reference population (FSC vs. SSC gate), and a selected population (intersection of FSC vs. SSC gate and top 8-10% based on GFP fluorescence intensity with a 530/40 nm filter [488 nm]). The collected cells were added to 10 mL TB media and grown at 25°C with 250 rpm shaking until they reached an OD600=0.3-0.6. Cells were pelleted and stored at -20°C until the DNA was extracted with a miniprep. 147 DNA deep sequencing and analysis Library DNA was prepared for deep sequencing using Method B PCR amplification as described in Kowalsky et al.30 (PCR primers listed in Table D 5.5). Amplicons were cleaned using Agencourt AMPure XP Beads (Beckman Coulter) and quantified using Quant-iT PicoGreen reagent (Life Technologies). Deep sequencing was performed on an Illumina MiSeq with 250 bp paired-end reads. The resulting data was processed using Enrich29 and custom scripts freely available from GitHub (user JKlesmith). Stability scores for each mutant were calculated exactly as described in Klesmith et al.23. PKS PSSM generation A PSSM for the Ab PKS gene was generating following similar methodologies outlined in Goldenzweig et al.46 and Klesmith et al.23. A BLASTp search47 of the nonredundant protein database was done for the PKS sequence with an e-value cutoff of 10-4, excluding synthetic and engineered items from the search. The top 20,000 results were saved and sequences with less than 30% sequence identity and/or 60% coverage of the query sequence were removed. Hits were clustered using Cd-hit48 with a 98% threshold and the top 500 clusters were aligned using MUSCLE49. The alignment was split into 20 amino acid sections (to reduce gap penalty) and then PSI-BLAST50 was used to generate a PSSM. Combinatorial library generation and screening methods The single and multi-site nicking mutagenesis protocol27 was used to generate combinatorial PKS libraries. Two separate reactions were performed at a primer:template molar ratio of 3:1 and 10:1 in effort to obtain mutants with a range in number of total mutations. The 148 two reactions were transformed into XL1-Blue Electrocompetent Cells (Agilent) and plated on 245 mm x 245 mm Large Bioassay Dishes (Sigma). Cells were scraped and plasmids were harvested with a miniprep. 10 ng of library DNA was transformed into electrocompetent BL21*(DE3) and plated as above. The following day, cell stocks were prepared from cell scrapings following the same methods used for the comprehensive single-mutation PKS libraries. For the first round of plate screening, library cell stocks were thawed on ice for 30-45 minutes and then washed twice in TB. An overnight TB cultures was started at an initial OD600=0.05. In the morning, a fresh 3 mL TB culture was inoculated to an OD600=0.02 in Hungate tubes. Once the cultures reached an OD600=0.2-0.3, IPTG was added to a final concentration of 250 µM. Cells were grown for 2.5 hours and then plated on plain LB-Agar plates and grown at 30°C overnight. Individual colonies were visually screened for fluorescence intensity and ‘winners’ were picked to be grown in isogenic 2 mL TB cultures overnight. PKS.1PKS.10 and PKS.11-PKS.20 originated from the 3:1 and 10:1 primer:template ratio libraries, respectively. The next day, glycerol stocks were made of each variant for downstream characterization. In effort to enrich for the best variants, a second round of plate screening was performed following an initial FACS sort. Cell stocks were thawed and cultured and induced as above, and then washed twice in cold filtered PBS. Libraries were sorted on a BD Influx sorter and 20,000 cells from the intersection of the FSC vs. SSC gate and the top 3% based on GFP fluorescence intensity (530/40 nm filter [488 nm]) were collected. 2 mL of TB media was added to the collected cells and plated on 245 mm x 245 mm Large Bioassay Dishes (Sigma) with LB-Agar and Kanamycin at a final concentration of 50 µg/mL. Plates were visually screened as before, and ‘winner’ colonies picked for growth in isogenic cultures. 149 Characterization of combinatorial hits and point-mutations Quantification of the mean fluorescence intensity of GFP-fused PKS hits was performed as follows. Overnight cultures of wild-type and variant PKS were started from isogenic glycerol stocks in 2 mL TB. In the morning, 3 mL TB cultures were prepared from overnight cultures at an initial OD600=0.02 in Hungate tubes. Cells were grown until the OD600 reached 0.2-0.3 and then IPTG was added at a final concentration of 250 µM. After 2.5 hours, 1 mL of culture was removed and washed twice in filtered PBS. Cells were diluted with PBS to an OD600=0.1 and fluorescence intensity was measured on a BD Acuri C6 Flow Cytometer. The sort was run for 50,000 events in a polygon gate drawn on the FSC vs. SSC plot. Mean fluorescence was obtained on the plot of the intersection of the FSC vs. SSC and an FL1 vs. Count plot, where FL1 represents fluorescence intensity using a 510/15 nm filter. Auto-induction cultures of wild-type and variant PKS proteins for initial activity assays were prepared as follows. Isogenic overnight cultures in 2 mL TB were started from glycerol stocks of BL21*(DE3) cells harboring the pGEX-PKS plasmids. In the morning, 1 mL of overnight culture was removed, spun down at 8,000xg for 3 minutes, and resuspended in 1 mL of standard auto-induction media37. The OD600 was measured and 1 mL solutions of cells at an OD600=0.5 in auto-induction media were prepared. 500 µL of these solutions were added to 25 mL pre-warmed auto-induction media in 250 mL Erlenmeyer flasks and were grown at 30°C with 250 rpm shaking for 6 hours. Cultures were then switched to grow at 18°C with 250 rpm shaking for 20 hours. Cultures were spin down at 8000xg for 20 minutes at 4°C and the wet cell weight recorded. Cell pellets were washed once and then resuspended in PBS to a final volume of 10 mL at an OD600=10. Samples were pelleted and stored at -80°C for future analysis. To prepare lysates, cell pellets were thawed and resuspended to an OD=2.5 in resuspension buffer 150 (10% glycerol, 147mM NaCl, 4.5 KCl, 100 mM HEPES pH 8.0, 1x Sigma Protease Inhibitor Cocktail, 5 mM DTT, 1 mg/mL lysozyme). 1 mL of resuspension was lysed in a 1.5 mL microfuge tube on ice using a sonicator fitted with a 1/8” horn with the following settings: 3 sec on, 10 sec off, 60 sec total on time, 37% amplitude. Lysate activity assays were performed by Matt Bedewitz in the Barry Lab at Michigan State University. Standard activity assays for AbPKS were performed using 25 mM potassium phosphate buffer pH 8.0, 50 µM N-methyl-Δ1-pyrrolinium hydrochloride, and 100 µM malonylcoenzyme A lithium salt (Santa Cruz Biotechnology Cat. No. sc-215286) in 50 µL, using 5 µL of crude lysate. Reactions were stopped using 100 µL 2% formic acid, 200 mM ammonium formate, 5% methanol, and 2 µM Telmisartan as an internal standard. Reaction products were analyzed using a Waters Acquity TQ-D Mass Spectrometer coupled to a Waters Acquity UPLC system. Parameters for electrospray ionization in positive-ion mode were as follows: 2.99 kV capillary voltage, source temperature of 130°C, desolvation temperature of 350°C, and desolvation gas flow of 700 L/h, with MS/MS transitions as provided in Table D 5.6, using the gradient described in Table D 5.7 with a flow rate of 0.3 mL/min. Chromatography was performed using an Ascentis Express PFPP column (2.1 × 100 mm with 2.7-µm particle size) with an oven temperature of 50°C and an injection volume of 10 µL. Where reaction products were quantified, 4-(2-N-methylpyrrolidine)-3-oxobutanoic acid was quantified using a standard generated the same day via alkaline hydrolysis of 4-(2-Nmethylpyrrolidine)-3-oxobutanoic acid methyl ester. This hydrolysis was performed as follows: 12 µL 25 mM 4-(2-N-methylpyrrolidine)-3-oxobutanoic acid methyl ester in THF was added to 138 µL of THF in a glass vial. The hydrolysis was begun by addition of 150 µL of 0.335 M ammonium hydroxide. Hydrolyses were performed for 4 h at 37° C and quenched by addition of 151 300 µL 0.26 M ammonium formate and 1% formic acid. A standard curve of 4-(2-Nmethylpyrrolidine)-3-oxobutanoic acid was determined from this solution by subtraction of unreacted 4-(2-N-methylpyrrolidine)-3-oxobutanoic acid methyl ester and spontaneous decarboxylation product hygrine from the calculated concentration of standard. Cuscohygrine in reactions was quantified using a cuscohygroline standard. All other compounds were quantified using authentic standards. N-methyl-Δ1-pyrrolinium hydrochloride, 4-(2-N-methylpyrrolidine)-3oxobutanoic acid methyl ester, and cuscohygroline were kindly provided by John d’Auria, Texas Tech. Lysates for PKS point-mutant activity assays were prepared as follows. 3 mL TB cultures were inoculated to an initial OD600=0.05 and grown until OD600=0.15. IPTG was added to a final concentration of 250 µM, and cells were grown for 3 hours. Cultures were transferred to ice, spun down at 8,000xg for 5 minutes, washed twice in cold filtered PBS, and resuspended in 1 mL resuspension buffer. Cells were lysed by sonication as above. GFP fluorescence was quantified on a BioTek Hybrid plate reader in 96-well black round-bottom plates. 200 µL lysate was quantified with the following parameters: excitation=485 nm, emission=507 nm, gain=50, height=0.7. Lysate activity assays were performed as described above. 152 APPENDIX 153 APPENDIX Figure D 5.1: Display of PKS on the surface of yeast proves unsuccessful. a.) A positive control for display yields two distinct populations: a non-displaying (leftmost) and a displaying (rightmost). B.) Initial attempts to display the PKS at 22°C with galactose induction were unsuccessful. Several induction conditions were tested, including temperatures ranging from 18°C (c) up to 30°C (d). We also mutated a potential glycosylation site, Asn339, to alanine, however this also was unsuccessful (e). 154 Figure D 5.2: Overview of GFP-fusion deep mutational scanning experiment. a.) A protein of interest is genetically encoded as a fusion to GFP. Upon expression, folded proteins will permit the folding and subsequent chromophore formation of GFP, while unfolded proteins will be non-fluorescent. b.) In the GFP-fusion deep mutational scanning experiment, a comprehensive site-saturation library of PKS was generated using nicking mutagenesis27, expressed in BL21*(DE3) with IPTG induction, and sorted using FACS. The resulting libraries were then deep sequenced. 155 Figure D 5.3: Relative fluorescence intensity of combinatorial PKS hits. Mean fluorescence intensity of E. coli expressing GFP-fusions of the above hits obtained from plate screens was quantified using flow cytometry and normalized to the fluorescence of wild-type PKS expressing cells. 156 Table D 5.1: Literature examples of engineered biosynthetic pathways in microbes with limiting enzymes. Product Reticuline Thebaine Synthetic Biodiesel Glucaric Acid Ethyl Ester n-butanol, 1butanol MEP (DXP) pathway Isobutanol Etoposide Serotonin Flavonoids Limiting Enzyme(s) Publication(s) norcoclaurine synthase, tyrosine hydroxylase salutaridine synthase DeLoache et al. 20159 wax-ester synthase Steen et al. 201011 myo-inositol oxygenase alcohol-Oacetyltransferase butyryl-CoA dehydrogenase from Streptomyces collinus Shiue and Prather 201412 Several Zhou et al. 201216 alcohol dehydrogenase Several monooxidase cytochrome P450 reductase Atsumi et al. 201017 Lau and Sattely 201518 Ehrenworth et al. 201519 Galanie et al. 201510 Zhu et al. 201513 Steen et al. 200814, Atsumi et al. 200815 Kim et al. 200920 157 Table D 5.2: Library statistics for PKS comprehensive single-mutation libraries. NS = nonsynonymous. Residues 878 Residues 79156 Residues 235-312 Residues 313-388 Number of initial library transformants 7E+04 9E+04 8E+04 1.8E+05 Reference population DNA reads post quality filter 391,403 144,646 1,605,888 Post-selection population DNA reads post quality filter 230,172 932,646 814,066 323,027 (rep1) 711,320 (rep2) Percent of reads in reference population with: No NS mutations One NS mutation Multiple NS mutations 46.4 49.6 4.0 49.2 47.0 3.8 39.4 57.5 3.2 45.4 49.4 5.2 Coverage of possible single NS mutations: 80.4 71.9 91.5 93.2 158 486,129 Table D 5.3: Filtered beneficial mutations from the GFP-fusion experiment. Wildtype residue V F W S N D D N N K H N F V I V M M D A I L L F T T V V N N V S I I Q K G S D V Position 12 31 33 37 48 50 54 64 64 82 89 90 94 96 100 113 115 115 118 121 127 235 235 240 244 245 250 250 252 281 282 284 287 304 328 355 357 364 366 372 Mutation I P C A K E E D E E Y M L A M A K R K G V I V Y S A L I D H I K V V K R A T E C Stability Metric 0.206 1.254 0.532 0.967 0.461 0.589 0.164 0.297 0.320 0.240 0.157 0.323 0.220 0.246 0.169 0.497 1.603 0.871 0.390 0.756 0.228 0.504 1.043 0.216 0.666 0.529 0.268 0.240 0.649 0.771 0.430 0.681 0.510 0.289 0.321 0.221 0.744 0.251 0.387 0.660 Reference read counts 54 12 28 25 70 64 80 64 217 29 8 21 46 51 18 346 10 104 51 49 26 37 115 82 549 261 413 77 61 12 40 28 48 544 143 185 68 490 1099 305 159 Selected read counts 59 50 52 80 117 129 81 82 289 326 77 274 498 580 177 6028 555 3034 746 1238 286 126 791 170 2380 925 942 167 258 60 121 124 165 1290 81 87 75 244 699 299 Enrichment ratio 0.89 2.82 1.66 2.44 1.51 1.78 0.78 1.12 1.18 0.80 0.58 1.02 0.75 0.82 0.61 1.43 3.11 2.18 1.18 1.97 0.77 1.42 2.44 0.71 1.77 1.48 0.844 0.771 1.73 1.98 1.25 1.80 1.44 0.90 0.90 0.64 1.87 0.72 1.07 1.70 CA distance to active site (Å) 39.1 22.8 21.0 23.5 27.2 28.4 27.7 20.6 20.6 28.0 20.1 20.5 19.4 22.3 20.3 21.9 25.0 25.0 29.1 32.1 28.3 35.9 35.9 24.0 21.9 18.5 15.0 15.0 18.1 17.3 16.7 20.9 23.8 15.5 20.5 26.0 29.8 26.5 26.4 18.6 PSSM score 3 5 8 3 3 6 4 5 3 6 8 7 5 4 5 5 5 3 5 6 3 3 4 3 4 3 3 3 6 3 7 6 3 3 3 3 3 5 5 4 Table D 5.4: Mutations included in the combinatorial PKS library. Highlighted mutations are ones that did not pass the filtering criteria. Wildtype residue V S D N N N P M D A R A L T T V V V S I S G D Position 12 37 50 64 64 90 106 115 118 121 135 143 235 244 245 250 250 282 284 301 318 357 366 Mutation I A E E D M A R K G S V V S A L I I K V E A E Stability Metric 0.206 0.967 0.589 0.320 0.297 0.323 0.641 0.871 0.390 0.756 0.530 0.257 1.043 0.666 0.529 0.268 0.240 0.430 0.681 0.565 0.358 0.744 0.387 Reference read counts 54 25 64 217 64 21 66 104 51 49 40 97 115 549 261 413 77 40 28 154 86 68 1099 160 Selected read counts 59 80 129 289 82 274 1425 3034 746 1238 733 1125 791 2380 925 942 167 121 124 576 52 75 699 Enrichment ratio 0.894 2.444 1.777 1.179 1.123 1.017 1.744 2.178 1.182 1.970 1.507 0.847 2.437 1.771 1.480 0.844 0.771 1.251 1.801 1.558 0.998 1.865 1.071 CA distance to active site (Å) 39.1 23.5 28.4 20.6 20.6 20.5 19.4 25.0 29.1 32.1 8.4 19.0 35.9 21.9 18.5 15.0 15.0 16.7 20.9 24.4 22.0 29.8 26.4 PSSM score 3 3 6 3 5 7 -2 3 5 6 1 0 4 4 3 3 3 7 6 2 2 3 5 Table D 5.5: Primer sequences used in this work. Deep sequencing inner primers PKS_T1R8_FWD gttcagagttctacagtccgacgatcgttggaaaatggtcaa PKS_T2_FWD gttcagagttctacagtccgacgatctgtttttgacagaggaa PKS_T3_FWD gttcagagttctacagtccgacgatcgcctaagcccatca PKS_T4_FWD gttcagagttctacagtccgacgatcgaccctaagatgggc PKS_T5_FWD gttcagagttctacagtccgacgatccaggaggtaatgcaatt PKS_T1_REV ccttggcacccgagaattccagggatttttctgtaatat PKS_T2_REV ccttggcacccgagaattccacatcattacacgttgaac PKS_T3_REV ccttggcacccgagaattccagatgggcctctctag PKS_T4_REV ccttggcacccgagaattccactcgacttggtccac PKS_T5R388_REV ccttggcacccgagaattccagagaatgggcacact blue = Illumina sequencing primer; black = gene overlap Combinatorial library mutant primers PKS_V12I ggtcaaaaatttgggaggattcatgagagagctgaag PKS_S37A caacacctttccattgggttgatcaagcctcctatcctgatt PKS_D50E cagggttacaaatagtgagcatttggtggacctcaa PKS_N64E ggacctcaaagaaaaatttagacgtatctgtgagagaacaatgattagcaa PKS_N64D gacctcaaagaaaaatttagacgtatctgtgacagaacaatgattag PKS_N90M cccaatttgtgctctcacatggagccatcctttgatgtca PKS_P106A tcaggcaggacattttagtttcagaaatagccaaac PKS_M115R ttggaaaagaggctgtccttagggccattgatgaatg PKS_D118K ctgtccttatggccattaaggaatgggcccagcccaa PKS_A121G gccattgatgaatggggccagcccaaatccaaa PKS_M115R+A121G ggctgtccttagggccattgatgaatggggccagcccaaatcca PKS_D118K+A121G gaggctgtccttatggccattaaggaatggggccagcccaaat PKS_R135S tttagtcttttgcacaagcagtggtgttgacatgccc PKS_A143V ggtgttgacatgcccggtgtagattaccaattaattaagc PKS_L235V accctaagatgggcgtagagaggccc PKS_T244S atctttgagatagtctcaacggcccaaacattt PKS_T245A ggcccatctttgagatagtcacagcggcccaaacat PKS_V250L cacaacggcccaaacatttctccctaacgggg PKS_V250I cacaacggcccaaacatttatccctaacgggg PKS_V282I ggatgtaccaccaactattgcgaaaaatattgagagttgcttaa PKS_S284K tgtaccaccaactattgcgaaaaatgttgagaagtgcttaataaaggcttt PKS_V282I+S284K ccaaggatgtaccaccaactattgcgaaaaatattgagaagtgcttaataaaggcttttgaac PKS_I301V ggaatatcagattggaactcggtcttttggattcttcatccag PKS_S318E caattgtggaccaagtcgaggagacattgggcctagagcccaa PKS_G357A gagattagaaagaaatctgctagagaagggctgaagact PKS_D366E ggctgaagacttcaggagaggggctggact 161 Table D 5.6: Multiple reaction monitoring parameters utilized for LC-MS/MS analyses of AbPKS products. Compound Precursor ion > product ion (m/z) Cone Collision voltage (V) voltage (V) Retention time (min) N-methyl-Δ1-pyrrolinium 84 > 42 34 16 1.28 Tropinone 140.1 > 98 40 22 1.44 Hygrine 142.1 > 84 28 16 1.78 4-(2-N-methylpyrrolidine)-3186.1 > 84 28 16 1.42 oxobutanoic acid 4-(2-N-methylpyrrolidine)-3200.1 > 84 28 16 1.96 oxobutanoic acid methyl ester Cuscohygrine 225.2 > 84 28 16 1.50 Cuscohygroline 227.2 > 84 28 16 1.62 a Telmisartan 515.2 > 276.1 42 52 5.07 Data was analyzed in positive ion mode using a Waters Acquity TQ-D mass spectrometer. a 1 µM Telmisartan is included as an internal standard. 162 Table D 5.7: UPLC Mobile Phase Gradients Utilized for LC-MS/MS analyses of PKS products using a Waters Acquity TQ-D mass spectrometer. Time Mobile Mobile (min) phase A (%) phase B (%) 0.00 99 1 0.50 62.5 37.5 2.00 50 50 4.00 0 100 5.00 0 100 5.01 99 1 6.00 99 1 Mobile phase A = 100 mM ammonium formate + 1% formic acid in water. Mobile phase B = 100 mM ammonium formate + 1% formic acid in 80% methanol. 163 REFERENCES 164 REFERENCES 1. Clomburg, J. M., Crumbley, A. A. & Gonzalez, R. Industrial biomanufacturing: The future of chemical production. Science (80-. ). 355, eaag0804 (2017). 2. Hughes, R. A. & Ellington, A. D. Synthetic DNA Synthesis and Assembly: Putting the Synthetic in Synthetic Biology. Cold Spring Harb. Perspect. Biol. 9, a023812 (2017). 3. Billingsley, J. M., Denicola, A. B. & Tang, Y. Technology development for natural product biosynthesis in Saccharomyces cerevisiae. Curr. Opin. Biotechnol. 42, 74–83 (2016). 4. Alper, H., Fischer, C., Nevoigt, E. & Stephanopoulos, G. Tuning genetic control through promoter engineering. Proc. Natl. Acad. Sci. 102, 12678–12683 (2005). 5. Keasling, J. D. Synthetic biology and the development of tools for metabolic engineering. Metab. Eng. 14, 189–195 (2012). 6. Yadav, V. G., Mey, M. De, Giaw, C., Kumaran, P. & Stephanopoulos, G. The future of metabolic engineering and synthetic biology: Towards a systematic practice. Metab. Eng. 14, 233–241 (2012). 7. Mali, P., Esvelt, K. M. & Church, G. M. Cas9 as a versatile tool for engineering biology. Nat. Methods 10, 957–963 (2013). 8. Noor, E., Flamholz, A., Liebermeister, W., Bar-Even, A. & Milo, R. A note on the kinetics of enzyme action: A decomposition that highlights thermodynamic effects. FEBS Lett. 587, 2772–2777 (2013). 9. DeLoache, W. C. et al. An enzyme-coupled biosensor enables (S)-reticuline production in yeast from glucose. Nat. Chem. Biol. 11, 465–471 (2015). 10. Galanie, S., Thodey, K., Trenchard, I. J., Interrante, M. F. & Smolke, C. D. Complete biosynthesis of opiods in yeast. Science (80-. ). 349, 1095–1100 (2015). 11. Steen, E. J. et al. Microbial production of fatty-acid-derived fuels and chemicals from plant biomass. Nature 463, 559–562 (2010). 12. Shiue, E. & Prather, K. L. J. Improving d-glucaric acid production from myo-inositol in E. coli by increasing MIOX stability and myo-inositol transport. Metab. Eng. 22, 22–31 (2014). 13. Zhu, J., Lin, J.-L., Palomec, L. & Wheeldon, I. Microbial host selection affects 165 intracellular localization and activity of alcohol-O-acetyltransferase. Microb. Cell Fact. 14, 1–10 (2015). 14. Steen, E. J. et al. Metabolic engineering of Saccharomyces cerevisiae for the production of isobutanol and 3-methyl-1-butanol. Appl. Microbiol. Biotechnol. 98, 9139–9147 (2008). 15. Atsumi, S. et al. Metabolic engineering of Escherichia coli for 1-butanol production. Metab. Eng. 10, 305–311 (2008). 16. Zhou, K., Zou, R., Stephanopoulos, G. & Too, H. P. Enhancing solubility of deoxyxylulose phosphate pathway enzymes for microbial isoprenoid production. Microb. Cell Fact. 11, 1–8 (2012). 17. Atsumi, S. et al. Engineering the isobutanol biosynthetic pathway in Escherichia coli by comparison of three aldehyde reductase/alcohol dehydrogenase genes. Appl. Microbiol. Biotechnol. 85, 651–657 (2010). 18. Lau, W. & Sattely, E. S. Six enzymes from mayapple that complete the biosynthetic pathway to the etoposide aglycone. Science (80-. ). 349, 1224–1228 (2015). 19. Ehrenworth, A. M., Sarria, S. & Peralta-Yahya, P. Pterin-Dependent Mono-oxidation for the Microbial Synthesis of a Modified Monoterpene Indole Alkaloid. ACS Synth. Biol. 4, 1295–1307 (2015). 20. Kim, D. H., Kim, B. G., Jung, N. R. & Ahn, J. H. Production of genistein from naringenin using Escherichia coli containing isoflavone synthase-cytochrome P450 reductase fusion protein. J. Microbiol. Biotechnol. 19, 1612–1616 (2009). 21. Klesmith, J. R., Bacik, J., Michalczyk, R. & Whitehead, T. A. Comprehensive SequenceFlux Mapping of a Levoglucosan Utilization Pathway in E. coli. ACS Synth. Biol. 4, 1235–1243 (2015). 22. Xie, X. et al. Rational Improvement of Simvastatin Synthase Solubility in Escherichia coli Leads to Higher Whole-cell Biocatalytic Activity. Biotechnol. Bioeng. 102, 20–28 (2009). 23. Klesmith, J. R., Bacik, J., Wrenbeck, E. E., Michalczyk, R. & Whitehead, T. A. Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning. Proc. Natl. Acad. Sci. 114, 2265–2270 (2017). 24. Brachmann, C. B., Davies, A., Cost, G. J. & Caputo, E. Designer Deletion Strains derived from Saccharomyces cerevisiae S288C  : a Useful set of Strains and Plasmids for PCRmediated Gene Disruption and Other Applications. 132, 115–132 (1998). 25. Deloache, W. C., Russ, Z. N. & Dueber, J. E. Towards repurposing the yeast peroxisome 166 for compartmentalizing heterologous metabolic pathways. Nat. Commun. 7, 11152 (2016). 26. Gai, S. A. & Wittrup, K. D. Yeast surface display for protein engineering and characterization. Curr. Opin. Struct. Biol. 17, 467–73 (2007). 27. Wrenbeck, E. E. et al. Plasmid-based one-pot saturation mutagenesis. Nat. Methods (2016). doi:10.1038/nmeth.4029 28. Waldo, G. S., Standish, B. M., Berendzen, J. & Terwilliger, T. C. Rapid protein-folding assay using green fluorescent protein. Nat. Biotechnol. 17, 691–695 (1999). 29. Fowler, D. M., Araya, C. L., Gerard, W. & Fields, S. Enrich: Software for analysis of protein function by enrichment and depletion of variants. Bioinformatics 27, 3430–3431 (2011). 30. Kowalsky, C. A. et al. High-Resolution Sequence-Function Mapping of Full-Length Proteins. PLoS One 10, e0118193 (2015). 31. Wrenbeck, E. E., Azouz, L. R. & Whitehead, T. A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nat. Commun. 8, 15695 (2017). 32. Zhang, Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9, (2008). 33. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). 34. Deng, Z. et al. Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. J. Mol. Biol. 424, 150–67 (2012). 35. Firnberg, E., Labonte, J. W., Gray, J. J. & Ostermeier, M. A Comprehensive, HighResolution Map of a Gene’s Fitness Landscape. Mol. Biol. Evol. 31, 1581–1592 (2014). 36. Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase. Cell 160, 882–892 (2015). 37. Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T. S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Res. 42, gku511 (2014). 38. Thyme, S. B. et al. Massively parallel determination and modeling of endonuclease substrate specificity. 42, 13839–13852 (2014). 39. Romero, P. A., Tran, T. M. & Abate, A. R. Dissecting enzyme function with microfluidic- 167 based deep mutational scanning. Proc. Natl. Acad. Sci. 112, 7159–7164 (2015). 40. Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl. Acad. Sci. 110, E1263–E1272 (2013). 41. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). 42. Gong, L. I. & Bloom, J. D. Epistatically Interacting Substitutions Are Enriched during Adaptive Protein Evolution. PLoS One 10, (2014). 43. Chao, G. et al. Isolating and engineering human antibodies using yeast surface display. Nat. Protoc. 1, 755–68 (2006). 44. Kowalsky, C. A. et al. Rapid Fine Conformational Epitope Mapping Using Comprehensive Mutagenesis and Deep Sequencing. J. Biol. Chem. 290, 26457–26470 (2015). 45. Lee, M. E., Deloache, W. C., Cervantes, B. & Dueber, J. E. A Highly Characterized Yeast Toolkit for Modular, Multipart Assembly. (2015). doi:10.1021/sb500366v 46. Goldenzweig, A. et al. Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression. Mol. Cell 63, 337–346 (2016). 47. Altschup, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990). 48. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006). 49. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). 50. Altschul, S. F., Gertz, E. M., Agarwala, R., Scha, A. A. & Yu, Y.-K. PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res. 37, 815– 824 (2009). 168 CHAPTER SIX Summary and future work 169 SUMMARY AND OUTLOOK In this thesis, deep sequencing technology was utilized in a standardized research pipeline developed by the Whitehead lab to study and engineer enzymes1. Deep mutational scanning, the testing of all possible single amino acid substitutions on the function of a protein, provides information rich datasets that address a variety of aims relevant to numerous fields2. The novelty of this dissertation is the application of this style of protein science to probe fundamental questions relating to the intricacies of enzyme function. In Chapter 2, a novel comprehensive saturation mutagenesis method, Nicking Mutagenesis, was developed3. Analogous to popular commercial kits available for site-directed mutagenesis (Agilent’s QuikChange or New England Biolabs Q5 Site-Directed Mutagenesis Kit), nicking mutagenesis conveniently requires routinely prepped dsDNA as input substrate. This solves the accessibility challenge presented with its best competing method, PFunkel, that requires a dU-ssDNA template that must be prepared from phage4,5. Until there is a significant decrease in the cost of DNA synthesis, methods such as nicking mutagenesis will be imperative for labs desiring to analyze comprehensive point-mutant libraries. In Chapter 3, the deep sequencing technology pipeline was applied to address the question of how enzymes encode specificity through their primary sequence of amino acids6. Using growthbased selections that I developed, I was able to assess the effect of >6,000 single-amino acid substitutions on the function of a protein with three different substrates. Comparison of datasets between selections of multiple substrates provided an unprecedented look at the differential effects of mutations between substrates. Mutations benefiting only one substrate were spread throughout the protein sequence and structure, and did not correlate with the other selections. 170 The datasets obtained from deep mutational scanning of amiE provided a fortuitous opportunity to test theories on adaptive molecular evolution. Specifically, distribution model fitting of the DFE for beneficial mutations could be performed with high statistical power, as hundreds of beneficial mutations were identified. The DFE for beneficial mutations was found to be approximately exponentially distributed as predicted, however the relationship between protein biophysics – namely stability – and beneficial DFE has yet to be explored. To address this, in Chapter 4 destabilized variants of wild-type amiE were designed using the Rosetta Design Software. I was able to successfully identify several variants that had decreased expression yields (a measure of stability) that have been shown to maintain wild-type catalytic function. Future work includes developing and performing growth-based selections and analyzing the resulting DFE in comparison to the existing datasets for wild-type amiE. Natural product synthesis in workhorse organisms such as bacteria or yeast is an attractive technology, however plug-and-play of non-native enzymes as part of designed biosynthetic pathways often leads to poor protein expression. In Chapter 5, the deep sequencing pipeline was applied to test a generalizable method for improving protein expression while maintaining catalytic activity7. As a model system, a poorly expressing Type III PKS from Atropa belladonna was scanned using a high-throughput GFP-fusion folding reporter assay and resulting hits were combined to generate stabilized variants. However, two included mutations caused inactivation. Future work includes generating backcrosses of these two mutations and testing for activity. Additional future work includes reconstitution of a portion of the Tropane Alkaloids pathway from Atropa belladonna (from which the PKS originates) in Saccharomyces cerevisiae, and testing the effect stabilized PKS variants have on pathway productivity. 171 REFERENCES 172 REFERENCES 1. Kowalsky, C. A. et al. High-Resolution Sequence-Function Mapping of Full-Length Proteins. PLoS One 10, e0118193 (2015). 2. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014). 3. Wrenbeck, E. E. et al. Plasmid-based one-pot saturation mutagenesis. Nat. Methods (2016). doi:10.1038/nmeth.4029 4. Kunkel, T. A. Rapid and efficient site-specific mutagenesis without phenotypic selection. Proc. Natl. Acad. Sci. 82, 488–492 (1985). 5. Firnberg, E. & Ostermeier, M. PFunkel: Efficient, Expansive, User-Defined Mutagenesis. PLoS One 7, e52031 (2012). 6. Wrenbeck, E. E., Azouz, L. R. & Whitehead, T. A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nat. Commun. 8, 15695 (2017). 7. Klesmith, J. R., Bacik, J., Wrenbeck, E. E., Michalczyk, R. & Whitehead, T. A. Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning. Proc. Natl. Acad. Sci. 114, 2265–2270 (2017). 173