m . fifiw mm. 4?: , h an... ‘ «H. 4.9.: .uflmwur . Ham. .(5 f: l.‘: , ‘ :er . I. ._ in nnnriifltd. £35.98. V .; $1.4 . , 1...... 4.. a.» :6 r ' x .020. .u . g. z. .... i... a: .1. v ‘ 9 . , I I. $34223 _ m NE; :21 a. i‘thx‘ .2 00-} This is to certify that the dissertation entitled An Analysis of Protein Folding by Decoding the Hierarchy of Native-State Structural Interactions presented by Brandon Michael Hespenheide has been accepted towards fulfillment of the requirements for Ph.D. degree in Biochemistry 8: Molecular Biology and Physics & Astronomy go! (a; V4 MVL Major professor Date t4??? 6] 2002 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c:/CIRC/DateDue.p65-p.15 AN ANALYSIS OF PROTEIN FOLDING BY DECODING THE HIERARCHY OF NATIVE-STATE STRUCTURAL INTERACTIONS By Brandon Michael Hespenheide A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Biochemistry and Molecular Biology and Department of Physics and Astronomy 2002 ABSTRACT AN ANALYSIS OF PROTEIN FOLDING BY DECODING THE HIERARCHY OF NATIVE-STATE STRUCTURAL INTERACTIONS By Brandon Michael Hespenheide Understanding the mechanism by which proteins fold is one of the most intensely stud- ies problems in science. Here, an analysis of the native—state structures of proteins is pre- sented as a means to study protein folding. The hypothesis is formed as follows. As a protein folds, hydrophobic collapse results in a compact, fluid structure with few, if any, specific contacts. As the protein begins to fold, hydrogen bonds and salt bridges begin to form, stabilizing the structure. These noncovalent bonds continue to form until the native state is reached. Assuming that these noncovalent bonds are maintained throughout the folding reaction, any stable substructure formed during folding should be visible as a sub- set of the interactions found in the native state of a protein. An analysis of the observed packing geometry between helices and sheets in a set of nonhomologous proteins is presented in Chapter 2. The role of possible dipole interactions is evaluated by explicitly taking into account the N— to C—terminal orientations of the sec- ondary structures. A reduced representation is used in which the structures are defined by 3-dimensional vectors fit to the Ca positions. Helix—sheet interactions are defined such that the geometry can be expressed by a single angle, 9, which represents the dihedral angle formed by the helix, the strand in the sheet closest to the helix, and the line of closest ap- proach between the helix and the strand. The results show that for helix—strand interactions in which no fl-sheet dipole is present, no preferred 9 packing angle is observed. How- ever, for fi-sheets with a net dipole due to a partially or entirely parallel fi-Sheet topology, a strong preference for helices to pack at ~180° relative to the strand is observed. This is expected, if dipoles play a key role in defining the packing geometry. Chapters 3 and 4 present a novel means for measuring the flexibility in protein struc- tures by using the program FIRST. Results of native-state flexibility analysis correlate well with experimentally observed native—state flexibility in several proteins, prompting the assumption that rigid regions represent folded structure and flexible regions represent unfolded structure. A means of simulating thermal denaturation is presented. We view the thermal unfolding of a protein as a process in which the hydrogen bonds and salt bridges break in an energy—dependent manner. This process is mimicked by breaking hydrogen bonds and salt bridges one by one, from weakest to strongest, and observing how the flexi- bility of a protein structure increases after each step. As the protein unfolds in response to this increased flexibility, an increasing number of residues in the protein become flexible, while others remain rigid. This proceeds until the entire protein becomes flexible when all hydrogen bonds have been removed. The mean coordination, (r), computed as the aver- age number of bonds per atom in the structure, is determined at each step in the simulated denaturation, and is shown to be a relevant structural variable for tracking the unfolding re- action. Specifically, the number of bond—rotational degrees of freedom in the system, a free energy like quantity, can be monitored as a function of (r), and used to identify the rigid to flexible phase transition during the unfolding simulation. Finally, the folding cores for ten proteins are predicted by identifying the the last set of two or more secondary structures to remain mutually rigid, or stable, during simulated unfolding. The predicted folding cores are compared to those observed in hydrogen—deuterium exchange/N MR experiments, and the results for 8 out of the 10 proteins indicate a close correlation. For my mother and father, and Judy and Beth iv ACKNOWLEDGMENTS I would like to start by thanking my advisors, Dr. Leslie A. Kuhn and Dr. M. F. Thorpe. I began my graduate studies in the lab of Dr. Kuhn, whose encouragement and endless enthusiasm provided the driving force behind my application for a dual degree program in the Department of Biochemistry and Molecular Biology and the Department of Physics and Astronomy. Over the years I have had a chance to work closely with Dr. Kuhn and Dr. Thorpe, both in the lab and in the classroom. Observing and learning from the unique ways in which both of these professors approach problems has been the most rewarding experience of my graduate studies. I am extremely grateful for their mentoring, and I am looking forward to a future of exciting research using the skills they have taught me. I extend an enthusiastic thank you to the members of my committee, Dr. Shelagh Ferguson-Miller, Dr. Robert Hausinger, Dr. Jack Watson and Dr. Phil Duxbury. They have all been key in shaping this interdisciplinary thesis. Their encouragement and critical assessment of this thesis has been greatly appreciated. The research I have completed over the years would not have been possible without support and advice of many graduate students, post docs, and faculty that I have had the pleasure to work with over the years. Of particular note are Dr. Paul Sanchagrin, Dr. Michael Raymer and Dr. Volker Schnecke, all former members of Dr. Kuhn’s lab. These three guys taught me how to write computer program properly, which has saved me hours V and sometimes days of frustration while performing experiments. I would also like to thank Maria Zavodszky, Rajesh Korde and Ming Lei for helping create an enjoyable working environment and providing usefiil feedback. Most importantly, I send a big thank you to A. J. Rader, a graduate student of Dr. Thorpe’s, who is also doing a dual degree with Dr. Thorpe and Dr. Kuhn. Much of the research I’ve done over the years was completed after long discussions with A. J. on how to get the job done. We have both contributed to, and sometimes struggled with, the collaborative project between our respective labs, and I am thankful for his support, advice, and friendship over the years. Finally, I would like to thank all of my friends and family who have made it all worth while. I would like to thank my whole family, from parents to fourth cousins, the best fam- ily in the world. Mom and dad, Judy, grandma and grandpa, uncle Keno, Ryan (Chachi), Alex, uncle Mike, aunt Dee, aunt Margaret and uncle Bob, aunt Mary, and Mr. and Misses Strunk, thank you for all your support. To all my fiiends from back in the day, and those I’ve made while a graduate student, Teri, Dr. Brad Mballs, Chachi again, Dave, Volker, Ina, Annika, Tim, John Moehn, John Centner, Frans, Josh, Bryan and Kirsten. Thanks for all the advice (beer) and good times (getting me in trouble). Lastly, I want to send all my love and thanks to Bethany Strunk. When things seemed impossible to do, she was my inspiration to keep going. Nakupenda, Beti. vi TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xi LIST OF ABBREVIATIONS xiii 1 Protein Folding: A Transition from Flexible to Rigid 1 1.1 Computers and Biology .............................. 1 1.2 The Protein Folding Problem ........................... 3 1.3 The “Old” and “New” Views of Protein Folding ................. 4 1.4 Overview of Protein Folding Models ....................... 6 1 .4.1 Nucleation—Condensation ........................... 9 1.4.2 The Diffusion—Collision Model ........................ 1 1 1.4.3 The Hydrophobic Zipper Model ........................ 12 1.5 Computational Analyses of Native-State Structure as a Tool To Study Protein Folding .................................... 14 1.6 H-D Exchange, (I) values, and the Protein Folding Core ............. 20 1.7 Protein Flexibility ................................. 27 1.8 Work Presented in This Thesis .......................... 3O 2 An Analysis of Helix-Sheet Packing Geometry in a Set of Nonhomologous Pro- tein Structures 32 2. 1 Abstract ...................................... 32 2.2 Introduction .................................... 33 2.3 Methods ...................................... 36 vii 2.3.1 Protein Dataset ................................. 36 2.3.2 Representing Secondary Structures as Vectors ................. 36 2.3.3 Identifying a Pair of Interacting Secondary Structures ............. 37 2.3.4 Assigning Local Strand Orientation ...................... 40 2.3.5 Measuring the Packing Geometry of a Helix—Strand Interaction ....... 41 2.3.6 Measuring Local Sheet Twist .......................... 42 2.3.7 Normalizing the Helix—Sheet 0 Angle ..................... 43 2.4 Results ....................................... 44 2.4.1 Helix-Strand 9 Packing Angle as a Function of Strand Orientation ...... 44 2.4.2 (1 Packing Angle as a Function of Local Sheet Twist ............. 47 2.5 Conclusions .................................... 50 3 FIRST Flexibility Analysis and Hydrogen Bond Dilution as a Method to Sim- ulate Thermal Denaturation 54 3.1 Abstract ...................................... 55 3.2 Introduction .................................... 56 3.3 Methods ...................................... 58 3.3.1 FIRST Flexibility Analysis .......................... 58 3.3.2 Preprocessing Protein Structures for Analysis ................. 67 3.3.3 Identifying and Modeling Hydrogen Bonds .................. 68 3.3.4 Identifying and Modeling Hydrophobic Interactions .............. 72 3.3.5 Computing the Mean Coordination of a Protein Structure ........... 76 3.3.6 Computing the Fraction of Floppy Modes ................... 76 3.3.7 Simulating Denaturation ............................ 77 3.3.8 Visualizing Results: The 3D Rigid Cluster Decomposition .......... 79 3.3.9 Visualizing Results: The 1D Rigid Cluster Decomposition .......... 80 3.4 Results ....................................... 85 3.4.1 Native State Flexibility Analysis: Open and Closed Structures of HIV Protease 85 3.4.2 The Folding Transition State .......................... 89 3.5 Conclusions .................................... 94 viii 4 Identifying Protein Folding Cores from the Evolution of Flexible Regions Dur- ing Unfolding 96 4.1 Abstract ...................................... 96 4.2 Introduction .................................... 97 4.3 Methods ...................................... 100 4.3.1 Selection of Proteins for Analysis ....................... 100 4.3.2 FIRST Flexibility Analysis .......................... 101 4.3.3 Simulating Denaturation ............................ 104 4.3.4 Identifying the Folding Core .......................... 105 4.4 Results ....................................... 107 4.4. 1 Thermal Denaturation ............................. 107 4.4.2 Evaluating Other Models of Denaturation ................... l 18 4.5 Conclusions .................................... 1 22 5 Summary and Perspectives 124 5.1 Secondary Structure Packing ........................... 124 5.1.1 Summary .................................... 124 5. l .2 Perspective ................................... 125 5.1.3 Future Directions ................................ 126 5.2 Protein Folding and Flexibility Analysis ..................... 128 5.2.1 Summary .................................... 128 5.2.2 Perspectives ................................... 1 3 1 5.2.3 Future Directions ................................ 134 APPENDICES 138 A Summary of Publications Outside of the Scope of the Work Presented in this Dissertation 138 BIBLIOGRAPHY 141 ix 2.1 2.2 3.1 4.1 LIST OF TABLES Assigning a unique orientation value for each strand in a sheet ......... 41 Correlation coefficients between sheet twist, T, and Q packing angle for all five possible strand orientations ........................... 50 Dataset of 26 structurally diverse protein analyzed using FIRST ........ 9O Dataset of 10 proteins used to identify folding cores ............... 101 1.1 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 LIST OF FIGURES Generic energy landscape or folding funnel for a protein. ............ 7 Graphical representation of a helix-sheet packing geometry ........... 38 Distribution of helix-sheet Q packing angles for each of the five strand orien- tations ..................................... 45 Example of a helix—sheet packing interaction found in the protein IIB cel- lobiose from E. coli, in which the strand is in a type 2 orientation ...... 48 Scatter plot of local sheet twist versus 9 packing angle for each of the five strand orientations ............................... 49 Hydrogen bonding pattern for parallel and anti-parallel fl-strands ........ 53 Determining the number of internal degrees of freedom in 3 small rings using constraint counting ............................... 60 A schematic representation of microscopic bond forces ordered fiom strongest to weakest ................................... 63 Example of bond-length and bond-angle distance constraints for the main- chain atoms of an amino acid ......................... 64 Geometric parameters used to identify hydrogen bonds and measure their energy 69 Histogram of hydrogen bond energies from three structures of HIV protease . . 73 Identifying and modeling a hydrophobic tether distance constraint ....... 75 The fraction of floppy modes, f = F/ 3N, as a function of the mean coordina- tion in a glass model and a set of 26 proteins ................. 78 Rigid cluster decomposition results for C12 when 67% of the weakest hydrogen bonds have been removed ........................... 81 Removing redundant information from the results of a complete hydrogen bond dilution for c-SRC SH3 domain ....................... 82 xi 3.10 Rigid cluster decomposition for HIV protease in ligand-free and ligand-bound conformations ................................. 86 3.11 The first derivative of the fraction of floppy modes, f’, as a fimction of mean coordination, (r), for the set of 26 proteins listed in Table 3.1 ........ 92 3.12 The second derivative of the fraction of floppy modes, f ”, as a fimction of 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 mean coordination, (r), for the set of 26 proteins listed in Table 3.1 ..... 93 Example of bond-length and bond-angle distance constraints for the main- chain atoms of an amino acid ......................... 103 Results of Simulated thermal denaturation for cytochrome c ........... 108 Results of simulated thermal denaturation for bamase .............. ll 1 Results of simulated thermal denaturation for interleukin-16 .......... l 13 Results of simulated thermal denaturation for bovine pancreatic trypsin inhibitor 1 15 Comparison of protein folding cores predicted by FIRST to those observed in H-D exchange experiments .......................... 1 17 Results of random hydrogen bond dilution over a window of 10 hydrogen bonds for cytochrome c ............................ 1 18 Four completely random dilutions of the hydrogen bonds in cytochrome c . . . 120 xii 1D 3D AFU BPTI CD C12 DC DOF DSSP FIRST F LOPS H-D exchange HIV HZ MD NC LIST OF ABBREVIATIONS mean coordination 1 -dimensional 3-dimensional autonomous folding unit bovine pancreatic trypsin inhibitor alpha carbon circular dichroism chymotrypsin inhibitor 2 diffusion—collision degrees of freedom Dictionary of Secondary Structures of Protein Floppy Inclusions and Rigid Substructure Topography floating point operations per second hydrogen—deuterium exchange NMR human immunodeficiency virus hydrophobic zipper molecular dynamics nucleation—condensation xiii NMA NMR PDB PE RCD RCSB TS TSE normal modes analysis nuclear magnetic resonance Protein Data Bank protein engineering rigid cluster decomposition Rutgers Collaboratory for Structural Bioinformatics transition state transition state ensemble xiv Chapter 1 Protein Folding: A Transition from Flexible to Rigid 1.1 Computers and Biology Perhaps the best known of the first fiIlly electronic computers built was ENIAC (Elec- tronic Numerical Integrator and Computer), which was constructed in 1946 (Adams et al., 1995). The physical size of the machine nearly filled a 7 by 13 meter room and required 18,000 vacuum tubes to run. Despite the impressive size, ENIAC could only perform 340 floating point operations per second (FLOPS). Technological advancements in processor design brought computers to their current position of dominance in modern society. The average workstation today occupies a physical space no larger than a shoe box, and boasts gigaFLOPS processing power. Recent advances in networking and computer architecture have allowed for many computers to be connected in parallel, acting as a Single compu- tational processor capable of solving large problems. In 2001, IBM announced that in l collaboration with Lawrence Livermore National Laboratory, it would build a networked computer system, named Blue Gene/L, capable of ~200 teraFLOPS, specifically designed for applications in the life sciences, particularity in the area of protein folding. An equiv- alent version of ENIAC in 1946 would require more surface area than is available on the planet. Concurrent with advances in computer technology have come advances in all of the natural sciences. It is impossible to fathom the extent to which computers have added to our understanding of nature. A general case can demonstrate the point. Computers have allowed for quicker and more reliable analysis of data, allowing subsequent experiments to be performed more often and more accurately. As a specific example, all of structural biol- ogy has benefited from the speed at which protein crystal structure data is made available. Advances in computer technology allow for the design of better and better experimental equipment. Advances in processor speed allow for faster analysis and refinement of the diffraction data. And perhaps most important of all, the availability of the World “Wide Web has allowed for easy public access to protein structures via the Protein Data Bank (PDB), hosted by Rutgers Collaboratory for Structural Bioinformatics (RCSB). Of particular importance in regard to this thesis is the application of computers and computer science to problems in physics and biology that would otherwise be practically impossible to solve. The general field to which this thesis belongs could best be described as structural biology, and the following hierarchical category can be applied: natural sci- ence —> biophysics —> protein folding ——) computational —> native-state structure analysis. This hierarchy can be extended one more step, where Chapter 2 would be —+ geomet- ric analysis of secondary structure packing, and Chapters 3 and 4 would be —+ graph- theoretical analysis of native-state bond networks. All of the work presented here ad- dresses one of the most challenging unsolved problems in structural biology, the protein folding problem. 1.2 The Protein Folding Problem In the 19505 and early 19608, Anfinsen published several experiments on the denaturation and renaturation of ribonuclease A (Anfinsen et al., 1954; Anfinsen and Haber, 1961). His main conclusion from this work of relevance to protein folding was that proteins can spontaneously refold to their native conformation afier unfolding (Anfinsen, 1973). This experiment provided solid evidence for what has become a tenet of structural biology: all of the information required for a protein to fold is encoded in the primary structure, or sequence, of the protein. In Anfinsen’s own words from his 1972 Nobel Prize speech, “The native conformation is determined by the totality of interatomic interactions and hence by the amino acid sequence, in a given environment.” The first clear indication of the complexity of protein folding emerged in 1968 with what has become know as “Levinthal’s paradox” (Levinthal, 1968). The paradox arises fi'om a simple estimation of the number of possible conformations a protein can adopt. Given the crude estimate that each amino acid in a protein can adopt 10 unique confonna- tions that produce nonoverlapping structures, then the number of possible unique structures is 10” , where N is the number of amino acids in the protein. For a protein 104 amino acids in length such as horse heart cytochrome c, this number is 10104, or ~ 10100. If a pro- tein could convert between unique conformations on the order of molecular vibrations, say every femtosecond, it would still require on the order of 1080 years to sample every con- formation, many times longer than the age of the universe. Given that most proteins fold in between microseconds and seconds, it is clear that proteins do not reach the native state by random sampling of conformations. Nonrandom searching implies the concept of a pathway, and leads to the question, by what mechanism do proteins fold? 1.3 The “Old” and “New” Views of Protein Folding The protein folding pathway became an established concept as a direct consequence of Levinthal’s paradox. Generally referred to the “old view” of protein folding in recent lit- erature, early protein folding mechanisms envisioned the formation of a series of discrete intermediates along the free energy surface fi'om a denatured state to a native state (Kim and Baldwin, 1990; Englander and Mayne, 1992). This view of protein folding has since been replaced by the “new view” of protein folding, perhaps best represented by the energy landscape theory of protein folding, in which folding occurs by the diffusion of ensembles of structures, rather than discrete intermediates, across the multidimensional free energy surface of a protein structure. A qualitative description of the energy landscape view of protein folding is given here. Detailed explanations can be found in a number of references (Onuchic et al., 2000; Nymeyer et al., 1998; Onuchic et al., 1997; Dill and Chan, 1997; Bryngelson et al., 1995; Bryngelson and Wolynes, 1987). The energy landscape theory of protein folding employs a statistical mechanical de- scription of the folding reaction in which the kinetics and thermodynamics are dictated by the self-organization of ensembles of structures. A funnel-shaped picture is often used to describe the energy landscape, as shown in Figure 1.1 (Leopold et al., 1992). Key for the interpretation of the folding firnnel is the idea of a reaction coordinate or order parameter. The free energy of a protein will depend upon its structure, and therefore, the dimension- ality of the free energy surface will depend on how many variables it takes to describe the structure of a protein. A common means of describing a protein structure is to use the x, y, and 2 positions of each atom in Cartesian space, which yields 3N variables to describe the structure of the protein, where N is the number of atoms. Visualizing high-dimensionality Space is impossible, so the goal of the reaction coordinate is to provide a single variable that describes the structural features common to any conformation with a given free en- ergy during folding. The most often cited reaction coordinate is the percentage of native contacts, Q, present at any point along the folding reaction, although other parameters such as surface area or radius of gyration have been used (Socci et al., 1996; Brooks III et al., 1998). The rough funnel shape of the energy landscape arises because evolution has selected for protein sequences that exhibit minimal frustration, as compared to random sequences. Frustration can manifest itself energetically or topologically, and examples of both are given here. Energetic fi'ustration can arise when a main-chain amide group that is hydrogen bonded to solvent in the denatured state does not hydrogen bond in the native state. Topological frustration implies that certain native state conformations are easier to reach than others. For example, imagine a protein whose native state consists of a knot. It is very possible that a sequence could be mapped onto this conformation such that all 5 bond lengths and angles are unstressed and every potential hydrogen bond is formed. This would be energetically unfrustrated. However, the energy barriers present in forming a knot would result in a folding firnnel that resembles a golf course, with a tiny, extremely deep hole in the middle of the green. The roughness of the folding firnnel shown in Figure 1.1 is indicative of energetic and topological frustration present in the protein during folding. The degree of roughness will depend on how well the protein was designed by nature. F ast-folding proteins will have smoother energy landscapes and less fi'ustration relative to slow-folding proteins. The two most important features of the folding funnel, which provide the clearest sepa- ration from the old view of protein folding, are the possibility for many folding paths from the denatured ensemble to the native state, and the idea of ensembles or macrostates, rather than discrete intermediates. Although a single pathway down the funnel representation may dominate, it is critical to understand that each point along that pathway represents an ensemble of structures. Identifying the key features of these ensembles, and hence the key features of the protein structure as it folds, has been the subject of many theoretical and experimental studies over the past decade. 1.4 Overview of Protein Folding Models Concurrent with the development of the energy landscape picture of folding has been the refinement of several phenomenological models describing the mechanism of protein fold- ing. A common theme in all of these models is the formation of a small substructure(s) as Entropy Denatured States [I Energy Reaction Coordinate, Q Native $1 0 State ' Figure 1.1: Generic energy landscape or folding funnel for a protein. The funnel shape represents the energetic bias towards the native state at the bottom of the funnel. The width of the tunnel roughly corresponds to the conformational entropy present in a protein as it folds. The top of the funnel, representing the denatured state, is wide indicating a large amount of conformational entropy. As the protein folds, it loses entropy and the width of the funnel shrinks. The many local small groves in the funnel represent local energy minima where the protein can get trapped for various amounts of time depending on the depth of the minima. The reaction coordinate, or order parameter, Q, is a number of native contacts present in any structure along the folding funnel and is a measure of similarity to the native state. the first step of folding, but they differ in their emphasis on sequence local versus nonlocal interactions. An overview of these models is given here. For clarity, the terms sequence local and sequence nonlocal refer to the number of amino acids intervening between two interacting residues, and not the spatial distance between two atoms or residues. A typical sequence local interaction would be the 2' H i + 4 hydrogen bond in an a-helix. The ex- act cutoff for defining a sequence nonlocal interaction can vary between experiments, but generally any interaction between residues 2‘ and j where Ii — j | 2 8 would be considered nonlocal. A caveat of the following models is that they best describe the folding mechanism of single domain proteins that exhibit 2-state folding kinetics. A protein that folds by 2-state kinetics is believed to have only the denatured and native states populated at equilibrium, no intermediate steps are observed. Extension of these models to larger proteins appears feasible, as most large proteins split into domains that are usually capable of proper folding in the absence of the rest of the structure. Subregions of a whole protein that can fold independently of the rest of the structure are sometimes termed autonomous folding units (AFUs) (Fischer and Marqusee, 2000; Peng and Yu, 2000). One proposed model of folding in larger proteins is that it occurs via independent folding of multiple AF Us, as has been proposed for T4 lysozyme (Llinas and Marqusee, 1988), although this hypothesis is still being debated. 1.4.1 Nucleation—Condensation The nucleation—condensation (NC) model describes a mechanism in which folding is ini- tiated by formation of a stable nucleus consisting of both sequence local and nonlocal interactions (Mimy and Shakhnovich, 2001; F ersht, 2000; Thirumalai and Klimov, 1998; F ersht, 1997; Abkevich et al., 1994). These interactions result in the formation of native- like structure which collapses the protein into a phase that is more condensed than the denatured state. The Specific interactions need not be unique (Guo and Thirumulai, 1997), although both theory and experiment suggest that several key residues are usually involved (Shakhnovich, 1998). This condensed phase is believed to be quite structurally similar to the native state, except that all of the non-nucleus forming interactions are weakened. The formation of the sequence local and nonlocal interactions together with the condensing of the structure represent the rate-limiting step in the folding kinetics. The NC model provides a general description for how single domain, 2-state proteins may fold. The model fits within the theoretical flamework of the energy landscape, and is fairly well accepted. The key experimental evidence supporting the NC mechanism has come flom mutational experiments known as the protein engineering (PE) approach. Developed by F ersht and coworkers (Fersht et al., 1992), the PE approach attempts to identify whether a residue is important for nucleation by observing the affects of mutation on the height of the energy barrier for folding. The height of the barrier is expressed as the flee energy of going flom the denatured state to the transition state, AG 0_ 1, and the difference in barrier height between the mutant and the wild-type protein is AG‘I’J"?t — AG’B‘L“, = AAGD_1. This number is normalized by dividing by change in the total flee 9 energy of folding, AAGN- 0, between the mutant and the wild-type, and this ratio a called the (I) value for the residue. The structural interpretation of values near 1.0 is that they are as structured in the transition state as they are in the native state, and contribute to stabilizing the structure during folding. Likewise, if a mutation does not affect the flee energy of the transition state (AGB‘L‘I 9’: AG‘B’TI), the residue will have a (I) value near 0.0. It is suggested that residues with (1) values near 0.0 are disordered in the transition state, as much as in the denatured state, and therefore are not important for nucleation. An important assumption in the PE method is that the mutations do not significantly alter the folding mechanism or the structure of the native state. This assumption assures that any observed changes in the folding kinetics can be attributed to stabilization/destabilization of the wild-type folding mechanism. For this reason, residues are usually mutated to alanine. Extensive value analyses has been performed on several proteins to date, and the results do seem to support the NC model (Clarke and Itzhaki, 1998; NOlting et al., 1997; Itzhaki et al., 1995). Typically, most residues have flactional (I) values, which can be in- terpreted in several ways. Because the transition state is represented by an ensemble of structures, a CD value of 0.5 could mean that the given residue is structured in half of the structures, and disordered in the other half. It could also mean that the given residue is in the core, but the mutation caused a weakening of the interactions it makes. A third ex- planation could be the existence of parallel pathways, with parallel transition states. For example, if two pathways existed, the nucleus of pathway 1 could involve the given residue in the TSE, but the TSE of pathway 2 could use a different nucleus. Distinguishing be- tween these interpretations experimentally is not trivial, and the issue is still being debated 10 (Myers and Oas, 2002; Ozkan etal., 2001) In addition to results of PE experiments, there have been many computational studies using lattice models (Abkevich et al., 1994), off-lattice models (Li and Shakhnovich, 2001) and MD simulations (Kazmirski et al., 2001 ; Daggett et al., 1996) which have supported the NC model and provided a theoretical framework to describe the experimental observations. A good review of these computational techniques can be found in Mimy and Shakhnovich (2001). 1.4.2 The Diffusion—Collision Model The diffusion—collision model bears some resemblance to the NC model described above, the key difference being that interactions local in sequence are much more strongly em- phasized in the DC model (Karplus and Weaver, 1994, 1979). These nearby contacts lead to the formation of microdomains, which generally correspond to packed secondary struc- tures such as a-helices or fl-hairpins. These microdomains, which are marginally stable, difiuse through the solvent and collide with each other. Collisions that result in a stable ter- tiary interaction will produce larger substructures, and can occur in a unique order (single pathway) or in near random order (many parallel pathways). The rate-limiting step is die- tated by how stable the individual structural units are, the probability of collisions forming native state conformation, and how quickly the units can diffuse through the media. The DC model draws heavily flom statistical mechanics and helix-coil transition theory (McCammon et al., 1980; Flory, 1969), in which the local 2' H z' + 1 contacts of the a- helix provide the cooperativity required to cross the flee energy barrier between random 11 coil and structured helix. It is perhaps for this reason that the DC collision model has been successful for describing the folding of all-helical proteins such as apomyoglobin (Pappu and Weaver, 1998), as helix formation is so strongly driven by local interactions. Application of the DC model to proteins with significant fi-sheet structure, or proteins with little or no secondary structure appears not to be a viable option, especially in light of experimental evidence for chymotrypsin inhibitor 2 (C12) in which the folding nucleus specifically involves nonlocal interactions. In summary, it appears that the DC model is a special case of the NC model in which the protein folds via several nuclei that are localized to secondary structures. 1.4.3 The Hydrophobic Zipper Model The hydrophobic—zipper (HZ) model (Dill et al., 1993) describes a mechanism that would initiate folding immediately after hydrophobic collapse through the interaction between nonpolar amino acids. Initially, a contact (HH) is made between a pair of hydrophobic residues 2' and j that are near each other spatially. Most likely this interaction will involve residues that are local in sequence, as this requires a smaller conformational search and will result in less loss of entropy upon folding, but nonlocal interactions are not excluded. Once an interaction is established, hydrophobic residues proximal to i and j will now be near each other spatially, and will have a higher chance of interacting than if the interaction between i and j had not been established. This scenario can be repeated indefinitely until all hydrophobic residues are in contact. Because each subsequent HH interaction gets easier to form due to a smaller loss of conformational entropy, this model implicitly describes a 12 cooperative folding event. It is important to note that the hydrophobic zipper model is not simply saying that the hydrophobic effect drives the entropic collapse of protein structures, resulting in burial of nonpolar residues (Tanford, 1980). The hydrophobic effect is a well accepted part of every modern theory of protein folding and implicitly occurs a priori in most mechanisms. It is for this reason that any qualitative description of protein folding begins with a compact de- natured state, not a fully extended polypeptide, simply because the fully extended structure is not observed in nature. While the HZ model appears to be a generalization of the hydrophobic effect to pro- tein folding, its key difference is the formation of contacts between hydrophobic residues buried within the protein. The buried core of any denatured state is believed to be quite fluid because the hydrophobic effect results flom nonspecific interactions between non- polor groups. The HZ model suggests that as folding begins, some pairs of hydrophobic groups do form specific contacts, providing a nucleation site for the propagation of further hydrophobic contacts in a cooperative manner. The nonpolar amino acids can be ordered according to their relative “hydrophobicity values” (Bull and Breese, 1974), which implies that specific interactions may be possible. The predictions of the HZ model have been tested exhaustively using lattice simulations, and recent experimental evidence on the fold- ing of proteins with coiled coil structure lends support (Hicks et al., 2002). Also, the HZ model favors the presence of multiple folding pathways. It seems clear, flom theory and experiment, that any mechanism of folding is going to require the formation of substructure or microdomains involving several key sequence 13 local and most likely sequence nonlocal interactions that provide the cooperativity needed to scale the flee energy barrier separating the denatured and native states. Furthermore, since it is implied that folding continues to the native state after the initial substructure is formed, the TSE along any pathway will have these substructures in common. Taken one step further, let us assume that the folding nucleus or microdomain is maintained all the way to the native state. In this case, the native state structure, as realized by experiment (X-ray crystallography, nuclear magnetic resonance (N MR), etc) should contain within its network of bonds a subnetwork corresponding to the folding nucleus or microdomains. It is this line of reasoning that has led to the development of many experiments, including those presented in this thesis, designed to answer one question: does the native state structure of a protein encode information about the folding mechanism? 1.5 Computational Analyses of Native-State Structure as a Tool To Study Protein Folding The availability of high-resolution structural data for many proteins, with corresponding experimental data about the mechanism of folding, has facilitated the development of com- putational techniques to study protein folding based on protein structure. Experiments have made the important contribution that the interactions forming the folding nucleus or microdomains are generally conserved in the native state. The goal of the computational techniques discussed here is the de novo prediction of the folding initiation structure(s) and/or the TSE for a given protein by using only the native state structure. 14 The general scheme for most native-state analysis techniques is as follows. The topol- ogy of the native state network is derived flom a high-resolution source of atomic coordi- nates, such as X-ray crystallography or NMR. In general, the topology of a system defines how things are connected, in this case we want to know how the atoms in a protein are con- nected. Depending on the method, the topology can be a reduced representation, describing only connections between C, ’3 within a certain radius of each other, or it can be extremely descriptive and include connections for each covalent bond, salt bridge, hydrogen bond and hydrophobic interaction. Once the topology is defined, an algorithm will proceed to dissect the native topology into subtopologies (or subgraphs) and measure a given quantity for each subgraph. The goal is to identify a subgraph that corresponds to a key structure along the folding reaction for a protein. Depending on how well the predicted structure compares to the corresponding structure observed experimentally, will dictate the viability of the method. Galzitskaya and F inkelstein (1999) have a developed an algorithm for computing a free energy-like quantity for subtopologies of a given native state. They use a highly reduced model in which two (or four for larger proteins) consecutive residues are assigned a single site on the native state graph, and this site is referred to as a “link”. For a protein with N links, any given subgraph is defined as having S links in native conformation, and N — S links disordered. For each subgraph, they compute a flee energy, where the enthalpy term is derived solely flom S native state links, and the entropy is calculated based on the num- ber and length of the N — S disordered links. The flee energy is then computed for every possible subgraph of S ordered and N — S disordered links, subject to certain restrictions 15 (for example, they only allow a fixed number of disordered loops). Their hypothesis is that the ensemble of subgraphs of covalent and noncovalent bonds with the highest computed flee energy correspond to the TSE, as by definition the transition state is the most unstable species on any reaction path. In the case of proteins, structures in the TSE have exactly a 50.0% chance of proceeding to the native state and 50.0% chance of unfolding. Their computed TSE generally consists of thousands of structures that can be used to compute how often, on average, link 2' is found in a native conformation. This average is the compu- tational analog of a (I) value. For example, if link 1 (corresponding to residues 1 and 2 in the protein) is part of an ordered region in exactly half of the TSE subgraphs, the computed (I) value is 0.5. (I) value predictions are made for each link in the protein and compared to experimental values. Comparison between predicted and experimental (I) values was performed for five pro- teins: C12, bamase, CheY, SRC SH3 domain and a-spectrin SH3 domain. The average correlation coeflicient between prediction and experiment for all five proteins was 0.46, with a highest value of 0.56 for C12 and a low value of -0.02 (no correlation) for SRC SH3 domain. The poor correlation can be attributed to deficiencies in the flee energy calcula- tion, such as the absence of a term describing potentially stabilizing interactions that occur within disordered loops. (The authors made the assumption that disordered links could not adopt stable nonnative conformations.) Despite the unimpressive correlation coeflicients, the predictions are better than random, and imply that the topology of a protein can en- code inforrnation about folding. Extensions of this work have been performed in which an ensemble of dynamically generated structures, such as flom an off-lattice Monte Carlo 16 simulation, was analyzed instead of just the native state topology. Predicted folding nu- clei and features of the TSE in these experiments have shown much better correlation to experiment (Dokholyan et al., 2002; Vendruscolo et al., 2001 ). Nussinov and coworkers have developed a different type of native-state analysis algo- rithm that is designed to dissect a protein into hydrophobic folding units or building blocks, and the folding of a protein is described as the hierarchic assembly of these building blocks (Tsai et al., 2000, 1998). The details of how the building blocks are identified can be found in (Tsai and Nussinov, 1997). Briefly, the protein structure is exhaustively dissected into flagrnents of contiguous sequence called building blocks, ranging flom the entire structure down to a minimal block Size of seven residues. For each building block, an empirical score is computed based on its solvent accessible surface area, compactness, hydrophobicity and isolation. The score is designed to represent how stable each building block would be if it were isolated flom the rest of the protein structure. Low scoring building blocks are dis- carded. The remaining set of building blocks can be assembled in various ways to form the complete protein such that the sequences represented by the building blocks don’t overlap by more than a few residues. Because many building blocks will generally be found for a protein (78 were found for actin, which has 373 residues), it is possible to build the whole protein flom different assemblies of building blocks. The pathways identified by the above algorithm can most readily be associated with both the HZ model of folding initiation (formation of the building blocks), followed by a mechanism of hierarchic folding in which the building blocks are assembled, sim- ilar to the DC model. Taken together, Nussinov refers to the predicted folding as- 17 sernblies as the “building block” model of folding. The model is consistent with en- ergy landscape theory in that it allows for Single or parallel folding pathways, depend- ing on the order in which the building blocks are assembled. Despite their making available the anatomy trees for every protein in the PDB (via the following web site: http://protein3d.ncifcrf.gov/tsai/anatomy.html), very few experimental correlations have been published. It appears that the method is quite good at dissecting a protein into do- mains, supersecondary structures, and subsequently individual secondary structures. How- ever, the specific interactions forming the folding nucleus cannot be distinguished, and de novo prediction of which building blocks compose the TSE would be diflicult. Further- more, the anatomy tree for bamase does not include the N-terminal helix in the first level, corresponding to an early forming structure along the folding reaction. This result does not correlate with mutational experiments that suggest an early interaction between the N-terminal a-helix with several strands of the C-terminal fl-sheet. A third method, developed by Wallqvist et al., (1997) is described as, “a computational method useful for identifying the existence of stable structural components of a protein and rank ordering their stability”. The details of their algorithm are quite complicated and outside the scope of this introduction. Briefly, they compute an “unfolding penalty” for each residue in a protein based on an empirically derived flee energy-like equation. The flee energy equation was pararneterized through the analysis of a large number of nonhomologous protein structures, not unlike the pararneterization of force fields used in MD simulations. The flee energy unfolding penalty can be thought of as the degree to which a given residue will resist unfolding, and they depend on the geometry and the amino 18 acid composition in the vicinity of the given residue. The authors compare their unfolding penalties to protection factors determined by hydrogen-deuterium exchange NMR (H-D exchange) experiments (H-D exchange is de- scribed more thoroughly in the next section). Under native conditions and at equilibrium, H-D exchange experiments measure the rate at which the main-chain amide groups of specific residues exchange their protons with solvent. It is believed that the protein must unfold to a certain extent for exchange to occur, and that the distribution of observed ex- change rates indicates the degree to which each amide must unfold. Residues that exchange quickly easily unfold, whereas residues that exchange slowly resist unfolding. From these exchange rates a protection factor can be computed, which is the experimental analog of the computed unfolding penalties. Correlation coefficients greater than 0.5 between unfolding penalties (predicted) and protection factors (observed) were reported for several proteins; plastocyanin, staphylo- coccal nuclease and three different cytochrome c’s (Wallqvist et al., 1997). For horse heart cytochrome c, the correlation coefficient between the predicted and experimental data was 0.71, and the qualitative overlap was very good for all proteins studied. The authors de- fined the subset of structure with the highest unfolding penalties as the folding core of the protein. In relation to the folding mechanisms described above, this folding core represents a substructure that forms after nucleation (in the NC model) or after a favorable collision (in the DC model). The authors make no suggestion that the residues with the highest un- folding penalties should be involved in the nucleus or the microdomains. Overall, these data, together with the two methods described above, provide encouraging results that the 19 native state structure does indeed encode information about how the protein folded. It should be noted that the application of H-D exchange data to the study of protein folding pathways has come under some scrutiny lately, particularly in light of results flom PE experiments. These issues are addressed in the next section, in which H-D exchange methodology is described. The concerns raised by PE experiments are addressed, and a possible alternative to the interpretation of CD values is discussed. 1.6 H-D Exchange, CI) values, and the Protein Folding Core To verify any computational or theoretical prediction of protein folding it is necessary to have reliable experimental data for comparison. NMR measurements of hydrogen— deuterium exchange of protein backbone amide groups provides a powerful tool for the study of structural fluctuations in proteins. An outline of the method is given here. A pro- tein is expressed and isolated under environmental conditions favoring the native state. An NMR spectrum of the protein is recorded in ngO, and the observed chemical shifts are assigned to specific backbone amides in the protein. The protein is then transferred into a buffer composed of deuterated water, 2H20. Under native conditions, a protein will ex- perience dynamic fluctuations that can range flom localized unfolding events to complete denaturation via a global unfolding pathway. These fluctuations have the effect of expos- ing amides to solvent, allowing hydrogen—deuterium exchange to occur. Under conditions favoring the native state, local unfolding, or breathing, can arise as a result of protein func- tion, or simply be due to the absence of suflicient bond forces in a given area. (According to 20 thermodynamics, even global unfolding to high energy conformations is expected to occur in a small number of protein molecules at equilibrium based on the Boltzmann distribu- tion.) Several experiments are performed, and in each the protein is allowed to exchange in 2H20 for a different time period. At the end of each time period, a new NMR spectrum of the protein is recorded. Because deuterium will not produce a signal in these experiments, exchange can be observed as a decay in the signal intensity for each amide proton. By ob- serving exchange over many different time periods, an exponential can be fit to the signal intensity decay, and an exchange rate computed. The mechanism of hydrogen—deuterium exchange in proteins is believed to occur ac- cording to an unfolding reaction (local or global) according to equation 1.1, as initially proposed by Linderstrom-Lang (1958). In this equation, C represents a closed form of the amide group. Exchange cannot occur flom this state. Likewise, 0 represents an open or exchange competent form of the amide proton. Equilibrium between these two forms is defined by the rate constants for opening, kop, and closing, kd. Once in an exchange com- petent form the amide can exchange its hydrogen with solvent. Because the apparent rate of exchange depends on both the rate of opening, leap, and the rate of exchange, km, it is nearly impossible to determine these rates individually in the context of whole protein stud- ies. Therefore, km, is typically determined flom the rate of exchange observed, for each amino acid type, within the structure of small model peptides (Bai et al., 1993; Molday et al., 1972), for which no “opening” reaction is required. k0? CH 2 OH L; 0D 2 CD kc, (1.1) 21 Under conditions that favor folding, kc, >> kop, and the observed rate of exchange, kn, can be expressed using equation 1.2. kopkint. _—: —— 1.2 kcl + kint ( ) 81‘ Based on equation 1.2, two limiting scenarios of exchange arise. The first case occurs if km, >> k0,, in which the observed rate of exchange flom 1.2 can be reduced to Ice, 2 kop. This scenario is named the EXl limit for exchange. The EXl limit for exchange is rarely observed in proteins under native conditions. The fact that exchange occurs more quickly than reprotection of the amide suggests a significant structural instability for the protein. This observation is valid, as experiments have shown that most amides favor the EXl mechanism at increasing concentrations of denaturant. The alternative scenario occurs when kc, >> km, referred to as the EX2 limit. In this case, equation 1.2 can be reduced to equation 1.3. Because the term [cop/kc, = K 0,, represents the equilibrium constant for opening and closing the amide, and this represents the rate-limiting unfolding required for exchange, an apparent flee energy of exchange can be computing flom the observed exchange rate, km, and the intrinsic exchange rate, km, by using equation 1.4. EX2 exchange has been shown to be the dominant mechanism of exchange under native conditions, allowing the apparent flee energies of exchange to computed. kart: : _0p_ ' kint : Kop ' kint (13) 22 A0359 = ——RT 1n K0,, (1.4) The usefulness of H-D exchange as a means to study protein folding is based on the thermodynamic premise that a protein can sample all of its higher energy conformations along the folding pathway according to a Boltzmann distribution. This means that even under native conditions, at any given time a small population of protein molecules will be in an unfolded state. The protein will rapidly refold, but during the time it is denatured, H-D exchange can occur, and the highly sensitive NMR technique can observe the exchange. The exchange rates observed in proteins can vary by several orders of magnitude. Fast exchange rates are generally associated with local fluctuations or breathing. The slowest exchanging residues correspond to global unfolding events; that is, these residues only ex- change if the protein completely unfolds. It is assumed that the slowest exchanging amides require global unfolding for exchange to occur. This is verified by comparing the apparent flee energy of exchange to the flee energy of folding, AGD_ N. If A6339 ~ AGD- N, then exchange requires global unfolding. Assigning a global unfolding mechanism to Slow- exchanging residues can also be accomplished by comparing the change in AGE?" that occurs upon mutation. If AAszgp for a given residue is approximately equal to the change in the stability between the wild type and mutant proteins, AAGD_N, then that residue exchanges by a global unfolding pathway. Based on the results of H-D exchange experiments, Woodward and coworkers proposed the idea of a slow-exchange core, defined as the minimal collection of residues that include 23 the slowest-exchanging amides observed under native-state conditions. This slow exchange core can be further expanded to define a “folding core”, using the following definition. If a slow-exchange amide is found in a secondary structure, then that structure is part of the folding core. Occasionally, slow exchange core residues are found within turns or loops, and these are excluded flom the definition of the folding core. Therefore, the folding core is the set of secondary structures that encompass the slowest exchanging amides. The folding core provides a low-resolution picture of the earliest stable substructure formed on a folding pathway. Beginning in the early 1990’s, around the same time the folding core concept was in- troduced, F ersht and coworkers began applying the protein engineering (PE) method to probe the structure of the TSE in bamase and other small proteins. Since that time, PE results, presented as Q values, have become available for a number of proteins. In par- ticular, bamase, barstar, C12 and SRC SH3 domain have Q value data corresponding to mutations in over50% of the amino acids in each protein. Determining Q values is a con- siderable task, given that each requires a site-specific mutant protein be made, verified, and thermodynamically characterized. Q values provide an additional experimental means to identify residues important for folding. In the mid-1990’s, F ersht and coworkers began to notice discrepancies between Q value results and H-D exchange exchange rates. In partic- ular, Q values indicate the N-terminal helix of bamase to be in native-state conformation in the transition state, even though almost all of the residues in this helix have fast exchange mechanisms. In 1999, Li and Woodward presented a review article of H-D exchange results for many 24 proteins, together with their identification of the folding core in each protein based on the published exchange rates. In this paper, they defend the concept of the folding core, and. Show that the correlation between structures with high Q values and structures in the folding core is quite high. In the case of the N-terminal helix of bamase, the following explanation has been given. The only slow-exchanging residue in the helix is L14, in the middle of the secondary structure. This residue also has a high Q value (~1.0) and so it is reasonable to assume that the main-chain hydrogen bond of L14 is formed early in the folding pathway. Once this bond is formed, local interactions will become favored (relative to their random association), according to any of the folding models described above (NC, DC, HZ models). Especially since this interaction is in a helix, we would expect the formation of adjacent i H i + 1 hydrogen bonds to contribute to the cooperativity required for folding. The fact that the amides adjacent to L14 exchange by local fluctuations does not exclude the possibility that they formed early, as F ersht asserts. However, the slow exchange rate of L14 does indicate that it will only exchange when the entire helix is disrupted as a result of global unfolding. It is this line of reasoning that allows a single slow exchange residue to impart an “early folding” label to a whole secondary structure. While it would seem more reasonable that the folding core definition should only be extended to the small section of a secondary structure local to the slowly exchanging residues, Woodward points out that the rate constants cannot be resolved well enough to allow for a clear delineation of which part of the structure is involved. It is for this reason that the folding core provides a low—resolution picture of the early folding structure. Given that most of the controversy surrounding the folding core definition has come 25 flom PE studies, it is necessary to discuss the structural interpretation of Q values. It is generally accepted that a Q value near 0.0 indicates that the side chain of the given residue does not contribute to stabilizing the TSE. In fact, a Q value of 0.0 can also occur if a given residue is significantly structured in the denatured state. The denatured state is most often envisioned as an ensemble of random coil states, random implying no specific interactions. However, some studies have indicated that native interactions may exist in a protein under conditions that are often considered denaturing. If this is true, mutating a residue that is structured in the denatured ensemble will affect the flee energy of the denatured state as much as the flee energy of the transition state, resulting in a AAG 0.; ~ 0.0, and subsequently a Q value near 0.0. The explicit definition of a Q value near 0.0 is that the residue is as structured in the TSE as it is in the denatured state. But if it’s structured (native-like) in the denatured state, and therefore in the TSE, it should have a Q value of 1.0, hence the misinterpretation of the data. Also, many Q values have values less than 0.0 or greater than 1.0, and the structural interpretation of these residues is unclear, although several interpretations have been suggested (Ozkan et al., 2001; Myers and Oas, 2002). Despite the controversy surrounding H-D exchange as a method to study folding path- ways (Clarke et al., 1997), the assignment of a folding core based on slow-exchanging residues remains a low-resolution way of identifying structures that form early in the fold- ing pathway. Thus, folding core data provide a useful dataset for validating computational techniques designed to probe early forming substructures (Torshin and Harrison, 2001), as presented in this thesis. H-D exchange, particularly in conjunction with other techniques such as mutagenesis or mass spectrometry, continues to be widely used experimental probe 26 of protein folding (Perrett et al., 1995). 1.7 Protein Flexibility Conformational flexibility is an intrinsic and necessary property of protein structures (Ja- cobs et al., 2001). The very concept of “folding” a protein implies that a structural deforma- tion is required to change flom a denatured state to a native conformation. The importance of native-state flexibility has been discussed for over 20 years (Huber, 1979; Brooks et al., 1988), especially in the context of enzymes (Gavish, 1986). Multiple experimental tech- niques, such as fluorescence quenching of tryptophan residues, circular dichroism (CD) spectroscopy, NMR, and hydrogen-deuterium exchange (H-D exchange) have been ap- plied to probe native state fluctuations. Catalytic mechanisms can require a broad range of flexibility flom individual side-chain rotations (Cobessi et al., 2000), to small loop move- ment as in the flaps of HIV protease (Venable et al., 1993), up to the concerted motion of multiple domains, as in ATP synthase (Sabbert et al., 1997). Regulation of proteins via allosteric mechanisms has also been shown to require structural flexibility (Bustos-Jaimes et al., 2002). Theoretical approaches to predicting flexibility in the native states of proteins arose as the number of high-resolution crystal structures increased. Molecular dynamics (MD) simulation is perhaps the most straightforward method to probe the structural flexibility observed in proteins, using techniques such as essential dynamics (Amadei et al., 1999, 1993). However, these methods are computationally expensive. Running a simulation long 27 enough, even for small proteins, is prohibitive. Alternative methods have been developed in which flexibility can predicted empirically based on a combination of structural and chem- ical features, such as atomic density and the distribution of polar and nonpolar residue type (Ragone et al., 1989), or based on sequence alone (Bhaskaran and Ponnuswamy, 1988). These methods have met with limited success (Vihinen et al., 1994), suggesting that se- quence and/or chemistry alone are not the sole discriminants of protein flexibility. Perhaps the explicit interactions between residues needs to be taken into account. Methods that include structural information, specifically the chain topology, when iden- tifying flexible regions in proteins fall into two broad categories. The first rely on compari- son between structures of the same protein in different conformations. These conformations can arise flom several sources, such as alternative crystal packings or 1i gand—flee versus ligand-bound states. Comparison of the structures is accomplished using various geomet- ric parameters, such as the difference between inter-Ca distances (Nichols et al., 1995) or differences in dihedral angles (Korn and Rose, 1994). These methods provide solid evi- dence for the location of flexible regions in protein, but are severely limited in that multiple structures must be available. Thus possible alternative conformations are not probed. The second class of algorithms designed to predict flexibility in proteins using structural in- formation is based on physical forces. MD falls into this category, as does normal-mode analysis (NMA). NMA was first applied to proteins in the 1980’s (Go et al., 1983; Brooks and Karplus, 1983). It is believed that the lowest flequency vibrational modes, or “soft modes”, represent the largest fluctuations in the structure, and therefore are associated with firnctionally relevant motion. Interestingly, much of the verification that normal mode anal- 28 ysis gives reasonable results came through comparison of NMA results to those of crystal structures comparisons mentioned above (Thomas et al., 1996; Ma and Karplus, 1997). As in MD, NMA is restricted by the size of the protein being analyzed, due to a computa- tionally intensive step (diagonalization of the Hessian matrix), although clever alternatives have arisen to help overcome this limitation (Go et al., 1983; Brooks et al., 1995). Al- though faster than MD, NMA is subject to the same criticisms: lengthy computation time, and reliance on empirical force fields. Rigidity theory provides an alternative technique to measure flexibility. The mathe- matics of rigidity theory allows one to describe the deforrnability of any structure, given internal constraints on the structure. The key to applying rigidity theory to real life prob- lems lies in properly representing a physical structure in mathematical terms, such that conditions required by the theory hold true. For molecular structures, a proper representa- tion lies in accurately representing the bond forces that hold the atoms together. For glass networks, which have been extensively studied by such techniques 0, the dominant bond forces are the covalent bonds. Flexibility analysis of these networks has been successful using rigidity theory, and led to accurate prediction of their material properties. In partic- ular, the mean coordination of the networks has been identified as the relevant structural reaction coordinate or order parameter. The variation in the flexibility of glass networks, as a function of the mean coordination, can accurately define the phase transition between rigidity and flexibility in these structures. Recently, an approximate representation of proteins has been developed such that pro- tein flexibility can be analyzed using rigidity theory (Jacobs et al., 1999, 2001; Rader et al., 29 2001). These advances have been embodied in the computer program FIRST (Floppy In- clusions and Rigid Substructure Topography) which identifies each bond in a protein as being rotatable (flexible) or nonrotatable (rigid). Furthermore, coupling between flexible and rigid bonds allows decomposition of a protein into rigid regions and flexible regions, and these flexible regions have been shown to correlate well with structurally significant motion in many proteins (Jacobs et al., 2001). The bulk of the work presented in this thesis (chapters 3 and 4) builds on FIRST analysis, and a description of the program is given in chapter 3. 1.8 Work Presented in This Thesis The motivation for this thesis has been to address the hypothesis that native-state topology encodes information about protein folding. Chapter 2 presents an analysis of the geometry of secondary structure packing in a set of nonhomologous protein structures, specifically a-helices interacting with B-sheets. The results can be divided into two categories, those interactions in which a dipole is present in the sheet, and those interactions in which no dipole is present in the sheet. For the latter case, no preferred packing geometry is observed. However, for helix—sheet interactions in which a dipole is present in the sheet, a strong preference is observed for the helix to align its dipole in the opposite direction relative to the sheet dipole. Chapter 3 presents an introduction to the flexibility analysis of proteins using the pro- gram FIRST. Comparison of native-state flexibility results to experimentally observed flex- 30 ible regions show good correlation, validating that the bond forces in a protein can be accurately modeled. A simple model of protein folding is assumed in which hydrophobic collapse leads to a compact structure, which is then stabilized by specific hydrogen bonds as the protein folds to the native state. The key to this simple view of folding is that hydro- gen bonds form during folding, and therefore break during unfolding. Protein unfolding is simulated by breaking hydrogen bonds, in an energy-dependent manner, in a method called hydrogen bond dilution. The changes in protein flexibility that occur as hydrogen bonds are diluted flom the structure are tracked and related to corresponding experimental observables of protein unfolding. The mean coordination of a protein at any given point during hydrogen bond dilution is shown to be a useful reaction coordinate for the unfolding of a protein. Chapter 4 presents a method for predicting protein folding cores flom these hydrogen bond dilution results, and the correlation with experimentally observed folding cores from H-D exchange experiments is shown to be very good. A summary and perspec- tive of the results is given in chapter 5, with a qualitative interpretation of the hydrogen bond dilution results. Also, potential future directions of FIRST analysis are discussed, including methods to predict Q values for a protein. 31 Chapter 2 An Analysis of Helix-Sheet Packing Geometry in a Set of Nonhomologous Protein Structures 2.1 Abstract Here I present an analysis of the packing geometry observed between a-helix and fl-sheet secondary structures. The structures are represented as finite size vectors fit to the Ca co- ordinates. A packing interaction is defined by any helix-strand pair within 13.0A of each other, and whose line of closest approach intersects both finite vector representations of the secondary structures. These criteria ensure that the packing geometry can be described by a single dihedral angle, 9. A strand that is interacting with a helix can be in one of five orientations, depending on a parallel or antiparallel hydrogen bonding pattern with re- spect to its neighbors, and whether it is the terminal strand in a sheet. a-helices packing against fl-sheets were searched for in a set of 1316 proteins non-homologous protein crys- tal structures determined at better than 2.2A resolution. From this set, helix-sheet packing 32 interactions were found in 391 (29.7%) proteins. Bias in the distribution of Q angles is accounted for by dividing the observed distribution by the expected uniform random dis- tribution of packing angles that exhibits a sinfl dependence. For most helix-strand pairs no preferred 9 packing angle is observed. However, for helix-strand interactions in which the strand is parallel to both of its neighboring strands, we see a strong preference for the helix to align antiparallel to the strand, with a packing 9 angle near 180°. 2.2 Introduction The mechanism by which a protein folds flom a denatured state to a folded conformation is an intensely studied, unsolved problem in the natural sciences. Many models describing the reaction have been proposed and supported by experimental evidence, and one single model may not hold for all known proteins. In one particular folding model, known as the flamework or diffusion—collision model (Karplus and Weaver, 1994), a subset of sec- ondary elements form partial or complete structures early in the folding reaction. These substructures then interact forming a super-secondary structure that is representative of the transition state ensemble, and folding then continues to the native state. Both mutagenesis (Kippen et al., 1994) and hydrogen-deuterium out-exchange (H-D exchange) experiments (Perrett et al., 1995) have shown the flamework model to be a valid scenario for the folding of bamase, in which the N-terrninal a-helix packs against several strands of the C-terminal fl-sheet to form the folding core. Assuming the flamework model is a valid scenario for protein folding, it is an interest- 33 ing question to ask whether secondary structures prefer to adopt specific geometries when they coalesce. Research on observed packing geometry for secondary structures extends back 20 years. In one of the earliest studies by Chothia et al., (1981), they analyzed 50 helix-helix packing interactions flom 10 protein structures. The results led them to propose the “ridges into grooves” model for helix-helix interactions, in which helix pairs prefer to adopt specific geometries so as to avoid steric overlap between the side chains. Since that time, advances in computer technology have allowed for not only an invaluable increase in the number of protein crystal structures available, but also the development of algorithms to parse out proteins with homologous sequences whose structures may bias the data. More recent studies have expanded the analysis of helix-helix packing interactions to a dataset of 687 interactions flom 220 protein structures with less than 35% sequence identity and better than 2.2A resolution (Walther et al., 1996). In these studies, and similar experiments, the secondary structures are most often rep- resented by best-fit lines though the Ca coordinates of the residues in each structure. The geometry of the interaction can then be uniquely expressed by a distance and two angles. The key angle, named Q, is defined as the dihedral angle formed by two interacting struc- tures and the line of closest approach between them (Figure 2.1). Initially, observed distri- butions of SI packing angles for helix-helix interactions exhibited distinct peaks (Walther etal., 1996). However, Bowie (1997), with further developments by Walther et a1. (1998), demonstrated that the expected uniform random distribution of S2 is biased towards angles near 90°. As described in (Walther et al., 1998), there are simply more ways to pack two helices at 90° than there are to pack them at 10°. When this bias was taken into account, the 34 observed peaks in the helix-helix 9 angle distribution were significantly attenuated. How- ever, parallel studies in which specific details of the helix-helix interface were measured as a function of S1 angle yielded new correlations. These analyses continue to provide usefirl information in the field of protein design. Measuring the packing geometry for helix-sheet packing interactions has proven a more diflicult task than helix-helix interactions due to the non-symmetric structure of the fl-sheet. Early work by Janin and Chothia (1980) stated that the Q angle for a helix packing against a sheet should be near 0°, indicating that only small angles allowed for complementary packing of the helix side chains within the groove created by a twisting fl-sheet. This observation of near parallel helix-sheet packing was further supported by work published by Cohen et a1. (1982) a few years later. A theoretical study by Chou et a1. (1985), in which low energy helix-Sheet conformations were predicted, further agreed that a helix-strand packing 9 angle near 0° was a favorable interaction. An analysis of 163 helix-sheet packing interactions observed flom proteins of known structure showed a predominate peak near 0°. In all of these studies the packing angles were measured by approximating inherently twisted fl-sheets as a plane. Also, the 9 angle was measured on the range, -90° S Q _<_ 90°, therefore the N—terminal to C—tenninal direction of the structures was not taken into account. In this chapter the analysis of helix—strand packing interactions is extended. Five possi- ble strand orientations within a sheet are defined depending on the strand direction relative to its neighbors (parallel versus antiparallel) and present the observed distributions of I) packing angles, with geometric bias taken into account, for each of the five cases. The Q 35 packing angle is measured over the range, -180° 3 Q S 180°, to observe any correlation between parallel/antiparallel packing and Q angle. We use a coordinate transformation to measure the Q packing angle that does not require fitting a plane to the fl-sheet. The re- sults indicate a strong preference for an helix to pack antiparallel to a sheet composed of parallel strands, indicating that the dipole-dipole interaction may be important for this type of supersecondary structure. 2.3 Methods 2.3.1 Protein Dataset The culled Protein Data Bank (PDB) list (Hobohm et al., 1993) flom March 8th, 2002 was used to create a dataset of protein crystal structures that had less than 20% sequence identity, better than 2.2A resolution, and R-factors below 0.2. Only proteins whose PDB files contained HELIX and SHEET records were included. The final dataset consisted of 1316 proteins. 2.3.2 Representing Secondary Structures as Vectors The residues forming regular secondary structure in each protein structure were identified according to the HELIX and SHEET records in each PDB file. Helices were required to have at least seven residues, corresponding to two complete turns of a regular a-helix. Strands were required to have at least 3 residues for proper fitting of a vector to the Ca 36 coordinates. Occasionally, the ordering of strands in a PDB file is not consistent with the order they are observed in. For each sheet identified, the closest distance between neighboring strands was measured, and any sheet that had a closest interstrand distance greater than 5.0A was visually checked to see that the strands were in the proper order. Errors in strand order within a PDB file were fixed manually. The a-carbon positions of each residue in a helix and strand were used to compute the best fit line through a given structure using a parametric least squares algorithm (Christo- pher et al., 1996). Because an individual strand can severely deviate flom linearity, the degree to which each strand bowed was also computed. Strand bow was calculated using the following equation: Bow 2 — (2.1) Where (I is the distance between the first Ca and the last Ca in the strand and m is the distance flom the Co, in the middle of the strand projected onto d. If the strand contained an even number of residues, the average position of the middle two Ca ’s was used to compute (1. 2.3.3 Identifying a Pair of Interacting Secondary Structures Each helix in a protein is represented in 3D by a vector h, and each strand is represented by a vector 3. These vectors, and their corresponding secondary structures, are shown graphically in Figure 2.1. The distance between the midpoints of h and s is defined as MD. The point of closest approach between h and s was computed using equations described by 37 Figure 2.1: Graphical representation of a helix-sheet packing geometry. The helix and the strand are shown as light gray ribbons. The vector representations of the helix, Ii, and the strand, S, are shown as black arrows. The line of closest approach is labeled D, and intersects the helix at a point labeled CPI, and the strand at the point CP2. Because I: is perpendicular to s‘, their cross product, I: x § is perpendicular to both. The Q packing angle is measured as the angle between S' and the projection of Ii, shown as a light gray arrow, onto the plane defined by § and I: x s‘ 38 Chothia et. a1. (1981). These equations compute two scalar quantities, cpl and cp2. The point on the helix vector h that is closest to strand 8 is defined as CPI and the point on the strand vector s that is closest to helix h is defined as CP2. Examples of CPI and CP2 are shown graphically in Figure 2.1. The scalar quantities cpl and cp2 can be less than 0.0 or greater than 1.0, in which case the line of closest approach does not intersect with one or both of the secondary structures. Likewise, if both cpl and cp2 are between 0.0 and 1.0, then the line of closest approach intersects both secondary structures. The vector L is defined as the line of closest approach between h and s and is computed as, L = CPI - CP2. The length of the closest distance between h and s is defined as CD, and is computed as the magnitude of the vector L. As seen in Figure 2.1, L can be viewed as both the projection of CPl onto s, and the projection of CP2 onto h, and consequently L is orthogonal to both h and s. Using the measured quantities MD, CD, CPI, and CP2, a helix is defined as interacting with a strand if the following criteria are met: 1. MD g 20.0A. 2. CD 3 13.011. 3. CDJ- g 13.0A; CDk g 13.0A, where j and k are the two closest strands to s. 4. 0.01 3 CF], CP2 g 0.99 5. Bow 3 0.25. Also the neighboring strands (or neighboring strand if s is the last strand in a sheet) must have Bow 3 0.25. Criteria 1 and 2 are designed to limit the search of helix-strand pairs within a given structure to those that are near each other in 3D. Initially larger cutoff values for MD were 39 chosen, however all of the additional helix-strand pairs identified failed to meet any of the subsequent criteria, and were discarded. Criterion 3 states that if a helix is interacting with strand i, then it should also be within 13.0A of strands j and k, which are the two nearest strands to strand i within the sheet. This criterion was implemented to discard cases where the helix is interacting with the hydrogen bonding edge of one of the strands, and not the side-chain face of a sheet. Criterion 4 ensures that surface of the helix is packing against surface of the sheet to which the interaction strand belongs. If CPI and CP2 are allowed to be less than 0.0 and/or greater than 1.0, it is possible that the helix and the strand are oriented perpendicular to each other (that is, the helix vector is normal to an approximate sheet plane). Criterion 4 also ensures that the line of closest approach, L, will be perpen- dicular to both h and s, and thus h and s will be coplanar when projected onto a common plane normal to L. Criterion 5 will discard interactions in which the interaction strand or its neighbor(s) are excessively bowed. Excessive bow can result from the occurrence of a fl-bulge or the presence of a residue in the strand that has Q,\II angles that are outside the 5 region of the Rarnachandran plot (Salemme, 1983). 2.3.4 Assigning Local Strand Orientation The orientation of a strand relative to its hydrogen bonded neighbor(s) can be assigned using the “sense” field assigned to columns 39—40 of the SHEET record in a PDB file. This field gives the N-terminal to C-tenninal direction of a strand with respect to the previous strand in the sheet. The first strand in the sheet is assigned a sense of 0. If the second strand is parallel to the first strand, it is assigned a sense of 1, if it is antiparallel, it is assigned a 40 Table 2.1: Assigning a unique orientation value for each strand in a sheet. The left-hand column shows the strand we are computing an orientation value for, depicted as a double- lined arrow, and its neighbors. Orientations —1 and 1 correspond to strands at the end of a sheet. Strand Orientation Order Value till -2 iii -1 Till 0 iii 1 w 2 sense of —1. The orientation value of strand 2' is computed as the sense of strand 2' plus the sense of strand i+ 1. For example, if the second strand in a Sheet is parallel with respect to the first one, it will have a sense value of 1. If the third strand in the sheet is parallel with the second strand, it will also have a sense value of 1. The orientation value of the second strand is the sum of these two values, 1 + 1 = 2. Table 2.1 lists the five possible orientation values that can occur for a strand in a sheet. The left hand column shows a cartoon representation of a portion of a sheet. The strand we are computing an orientation value for is shown as a double-lined arrow. The right hand column lists the orientation value assigned to each case. Orientations —1 and 1 correspond to stands at the end of a sheet. 2.3.5 Measuring the Packing Geometry of a Helix-Strand Interaction Because the line of closest approach, L, between the helix and the strand is perpendicular to both h and s, the packing geometry between the two structures can be defined by a single 41 dihedral angle, $2. The angle is computed by orienting the vector s along the positive x- axis, with the N-terminal end positioned at the origin. The system is then rotated about the x-axis such that L lies along the positive z-axis. The result of the final transformation is shown in Figure 2.1. In this orientation both h and s are coplanar with the L-s plane (which can also be viewed as the x-y plane), and Q can be computed as the angle between the transformed coordinates of h and s using equation 2.2. h - s Q = cos‘1 ( ) (2.2) ||h||||S|| 2.3.6 Measuring Local Sheet Twist The local sheet twist was measured to observe the degree to which sheet twist effects the Q packing angle for helix-strand interactions. For a given helix-strand interaction, the vectors s and L are orthogonal, and can be used as a basis for a 2-dirnensional subspace, W = {s,L}. The two closest strands to s, a and b, are then projected onto W using equations 2.3 and 2.4. projwa = (a, §)§ + (a, L)L (2.3) projwb = (b, as + (b, L)L (2.4) where l} and is are unit vectors in the direction of L and s, respectively. The twist angle, T, is then found by using equations 2.5, 2.6, and 2.7, which is the average of the angles the 42 projected vectors, a and b, make with strand 3, in the plane of W. T, = COS-1 (mg) (2.5, llproywall T. = (m) (2.6) llPTOJWb“ JET”;2 on 2.3.7 Normalizing the Helix—Sheet (2 Angle The {I packing angle measured for packing secondary structures has an inherent bias to- wards angles near 90°. This bias was originally shown for helix-helix interactions by Bowie (1997), and further developed by Walther et a1. (1998). The bias arises flom non-uniform probability distribution in the 10° bin sizes used to tabulate the Q angle results. For exam- ple, two vectors of unit length forming a 90° angle generate an area, A1 = sin(90°) = 1.0. Likewise, two unit vectors forming a 45° angle create an area, A2 = sin(45°) = 0.7071. In the case of A1, if we keep one of the unit vectors fixed in space, the other vector can position its endpoint anywhere within the area A1 and keep {2 = 90°. It can be readily seen that A1 > A2, and therefore there are more ways in which two unit vectors can form a 90° angle than there are ways to form a 45° angle, and the observed bias is proportional to sin 9. To eliminate this bias flom our data, the number of observed occurrences for each 10° bin was divided by the number expected flom the uniform random distribution. The number of occurrences expected for each 10° bin was computed using equation 2.8. In these unbiased data a value of 1.0 indicates that the given angle occurs just as often as we would expect it 43 to if secondary structures packed together at random angles. Values less than 1.0 indicate unfavorable packing angles, and values greater than 1.0 indicate a preferred packing angle. 92 [9 Sin 9 dd (2.8) 2.4 Results 2.4.1 Helix-Strand 9 Packing Angle as a Function of Strand Orienta- tion For each helix-strand packing interaction the strand can be in one of five orientations de- pending on whether it is hydrogen bonded in parallel or antiparallel with respect to its neighbors, and whether or not it is the first or last strand in the sheet. The distributions of observed (2 packing angle, divided by the expected distribution, is presented in Figure 2.2 for each of the five possible strand orientations. A cartoon representation of the strand orientation within the sheet is shown in the upper left of each panel (see also Table 2.1). Orientations of 1 and —1 correspond to strands with only one neighbor, as they are at the end of a sheet. For each 10° bin in the plots, a value of 1 indicates that the number of observed It angles in that bin occurred just as often as would be expected randomly. Values less than 1.0 suggest the packing angle occurs less often than expected, and values greater than 1.0 indicate preferred packing angles. 52 angles in the range 90° 3 Q S —90° repre- sent an interaction in which the N-terminal to C-tenninal direction of the helix is parallel to the direction of the strand. If Q 3 —90° or Q 2 90°, the helix is packed antiparallel to the 44 Figure 2.2: Distribution of helix-sheet Q packing angles for each of the five strand orienta- tions. A cartoon representation of the strand orientation is shown in the upper left of each histogram. The number of observed occurrences for each 10° bin was divided by the num- ber expected flom a uniform random distribution. Here, a value of 1.0 indicates that the given 9 angle occurs just as often as would be expected by random. Values less than 1.0 are unfavored, and values greater than 1.0 indicate preferred packing angles. Helix—strand interactions in which the strand is in an orientation of l or 2 Show a strong preference to pack antiparallel (Q angles near 00° or 180°). Strands in orientations 0, —1 and —2 shown no strong angle preference when packing with a helix. 45 Tens f act at the n 1.0 mi re to (Will Observed/Expected Observed/Expected Observed/Expected Observed/Expected Observed/Expected 12— 10-i 3.. 6i 4- 2.. Orientation = 2 (ill) A 12‘ 10~ 8... 6—1 4-1 2i ''''''''''' l I TI I I I I I I I I I I I I' I I I I I IIIIIII :dd:“=‘. 09 0!. 09 39 39 OZ 0t 00 06 39' 09' 09' 0" -OC' 03' 0' 0 Di OZ 09 01? 09 09 I. 09 06 001 OH on 091» Oil 09l 09$ 0“ 09L on Packing Angle (Degrees) Orientation = 1 (11) :J W ITTfijfiTWIIIII lllll dddddddd‘. assessssssSS asasaacassassass§§§§§§§§§ 12- 10- 8— 3- (2 Packing Angle (Degrees) Orientation = 0 (ill) I dzdd-‘h—‘bé . 12- 10- g- on Packing Angle (Degrees) Orientation = -1 (1T) 12- 10~ 3- 4.. 2.. I I I I I I I I I I I I I W IIIIIIIIIIIIIIIIIIIII IIIIIIII Ad‘dd‘d 9 Packing Angle (Degrees) Orientation = 4011) __ “MIA I I I I I I T I I 1 I I I I I I I I I I I I I I j I I' I I r I we SSSSSaoasssssass§§§§§§§§§ (2 Packing Angle (Degrees) Figure 2.2 46 strand. The top three panels in Figure 2.2 show that for type 0, l and 2 strand orientations, there is an increasing preference for the helix to pack antiparallel to the strand. Type 2 strand orientations exhibit the strongest preference, with almost no parallel packing inter- actions observed in real proteins. Figure 2.3 shows an ideal type 2 helix-strand interaction present in the protein 113 cellobiose flom E. coli (PDB code: liib) (von Montfort et al., 1997). The arrows on the strands (colored yellow) point in the N- to C-terminal direction. The strand determined to be interacting with the helix is the second one flom the left, and it can be seen that this strand is parallel to both its neighbors. The N- to C-terrninal di- rection of the helix is flom the upper-right to the lower-left. The O packing angle for this interaction is 1 18.04°. Type —1 and —2 strand orientations Show no preference for parallel versus antiparal- lel packing, however there is a preference to pack at angles near —25° and 155°, which represent the same angle if you disregard the N- to C-terminal direction of the structures. 2.4.2 {2 Packing Angle as a Function of Local Sheet Twist For each helix-sheet interaction found, the geometry is measured relative to a single strand in the sheet that is closest to the helix. The local twist of the sheet is then measured by using the strand interacting with the helix, and its neighbors. A scatter plot of local sheet twist versus 9 packing angle is shown in Figure 2.4. The points are colored according to the orientation value of the interaction strand. Correlation coefficients between T and 0 were computed for each of the five strand orientations, and the results are shown in Table 2.2. No correlation between sheet twist, T, and helix-sheet Q packing angle was observed 47 Figure 2.3: Example of a helix—sheet packing interaction found in the protein IIB cellobiose from E. coli. The geometry of the interaction was measured relative to the second strand from the left, which has an orientation value of 2. The N- to C-terminal direction of the strands is indicated by the arrow heads. The N- to C—terminal direction of the helix is from upper-right to lower—left. The measured (2 angle ll8.04° 48 Orientation 08L 09L 07L OZl 00L 't o (2 angle (degrees) 001" OZL' 014' 09k 09 l' Twist angle versus Q angle for all Strand Orienatations 1O 0 o o ‘l' "P (seerfiep) e|6ue rsIMj Figure 2.4: Scatter plot of local sheet twist versus (2 packing angle for each of the five strand orientations. The right-handed twist common to most fl-sheets is indicated by the large number of negative twist angles observed. No clear correlation between twist angle and It angle was observed for any of the five strand orientations. 49 Table 2.2: Correlation coefficients between sheet twist, T, and Q packing angle for all five possible strand orientations. Strand Correlation Orientation Coefficient —2 0.0761 —1 -0.0289 0 0.2304 1 -0.2533 2 -0.2525 for any of the five possible strand orientations. 2.5 Conclusions The coiling and right-handed twist associated with [i-strands depends on the Q / \II values of the individual residues (Chothia, 1983). These Q / ‘11 values in turn depend on the type of residue at any given position and the hydrogen bonding pattern between adjacent strands. Ideally, a flat (uncoiled) fi-sheet, like the one proposed by Pauling and Corey (1951), would have optimal interchain hydrogen bonding geometry. However, this also required the residues in the sheet to adopt a perfect 2-fold helix symmetry, which is energetically unfavorable. To minimize the energetic frustration, residues within a strand adopt Q / ‘11 an- gles that lead to a right-handed twist, resulting in poor hydrogen bonding complementarity between strands. To realign the hydrogen bond donors and acceptor of adjacent strands, successive residues in a strand adopt different Q / ill values producing twisted, coiled fi- strands, and subsequently giving rise to a twisted fi—sheet. This compromise between max- 50 imizing the number of hydrogen bonds formed and minimizing the conformational energy of each strand has been predicted theoretically and observed in proteins of known structure (Salemme, 1983). An individual sheet can consist of all parallel, all antiparallel, or mixed parallel and antiparallel strands. This diversity in hydrogen bonding pattern, along with varying amino acid composition, can lead to Sheets in which the twist and coil vary depending on where in the sheet you are looking. Here I presented a novel geometric definition for the mea- surement of the local twist of a fl-sheet. The hypothesis was that as the twist of a sheet deviates farther flom planarity, steric interactions would cause the helix to turn, creating a larger 9 packing angle to better fit in the groove formed by the strands of the sheet. A plot of local sheet twist versus 52 would then reveal a correlation between the degree to which a sheet twists, and the angle at which the helix will pack against the sheet. Table 2.2 and Figure 2.4 clearly indicate that there is no correlation between our measure of sheet twist and I2 packing angle. This can arise flom several reasons, most likely, due to the side-chain conformations. Side chains can vary in size, and most exhibit conformational flexibility. By not taking into account the specific interactions occurring in the helix-sheet interface, we assume that there are specific side chains within the interface between all observed helix-sheet pairs. The hypothesis also assumes that the surface created by the side-chains, the surface to which a helix is actually interacting, can be approximated by the backbone atoms of the strands. This appears not the case, and an extended analysis of these interfaces, similar to what has been reported for helix-helix packing interfaces, is warranted. 51 In the distributions of Q angle for each strand orientation shown in Figure 2.2, only those orientations where the interacting strand is parallel to its neighbors show a strong (2 angle preference. In these cases, orientations l and 2, the helix prefers to pack antiparallel to the strand, near 180°. One possible explanation for seeing a preference in these orienta- tions and not the others is the presence of a net dipole arising flom the hydrogen bonding pattern in parallel strands. The hydrogen bonds between parallel strands make a 20° angle with respect to the N to C direction of the protein backbone (Figure 2.5A). This leads to a net dipole moment of about 1.15 Debyes (Hol et al., 1981). If a helix-strand dipole interac- tion is occurring, we would expect the helix to orient its dipole in the opposite direction as the strands. This expected antiparallel packing interaction is indeed what is observed. For the remaining strands in orientations 0, — 1, or —2, the interacting strand is antiparallel to one or both of its neighbors. The hydrogen bonds between antiparallel strands are nearly perpendicular to the protein backbone (Figure 2.58), and a negligible net dipole moment is produced. In these helix-strand interactions, the dipole would not be expected to play a role, and we observe no strong preference for 0 angle. Another possible explanation for the observed {2 angle packing preference in type 2 and l strand orientations is the structure of the sheet. Sheets composed entirely of par- allel strands have been shown to be flatter and less flexible than purely antiparallel sheets (Salemme, 1983), increasing the net dipole moment relative to a highly twisted sheet. Also, purely parallel sheets are uncommon, and tend to be buried within protein of a/fi architec- ture (Chothia, 1983). In these cases, optimizing the packing interaction between a helix and a sheet would be beneficial to maintaining a compact protein structure. 52 . v. CK. I VI r O (I I'l‘ airy 'J' s ., )‘\ / \./ \ 2"“ a, Li. 0 \ /\ / \/ \ O O .z I / \ I” {2' '21. -. . ,. 7-: r.- . .. . a"... \ )“K / \ / '“Iu‘p <71 ‘c‘ I f C "\ / \ 1;; ‘ O L. Q ~\/\ )1. u‘ 4‘ I I I “I .{\/ C5 :1 lust I; '1 i a O \/ \.. I I I .‘i‘ \ )0“ «1:: I . ,t‘ Figure 2.5: Hydrogen bonding pattern for parallel and anti-parallel fl-strands. Carbons are depicted as light gray spheres, nitrogen as dark gray spheres, oxygen as open spheres, and hydrogen as small black spheres. Hydrogen bonds are shown as dashed lines between the main-chain oxygen and hydrogen atoms of adjacent strands. A. The hydrogen bonds between parallel strands form a 20° angle with respect to the protein backbone resulting in a net dipole in the C-—>N direction. B. The hydrogen bonds between antiparallel strands are nearly perpendicular to the protein backbone, and no net dipole is produced. 53 Chapter 3 FIRST Flexibility Analysis and Hydrogen Bond Dilution as a Method to Simulate Thermal Denaturation Research presented in this chapter is based on work that has appeared in the following publications: B. M. Hespenheide, A. J. Rader, M. F. Thorpe, and L. A. Kuhn. Identifying protein folding cores from the evolution of flexible regions during unfolding. J. Mol. Graph. Model., In press. A. J. Rader, B. M. Hespenheide, L. A. Kuhn, and M. F. Thorpe. Protein unfolding: Rigidity lost. Proc. Natl. Acad. Sci, 99:3540-3545, 2002 54 3.1 Abstract Here I present the application of a novel computational technique, FIRST is presented for measuring flexibility in protein structures. The flexibility present in a molecular structure is a property that depends upon the bond forces present in the structure. FIRST treats bond forces, such as covalent and hydrogen bonds, as distance constraints that put restrictions on the conformational space available to the atoms in a protein. Once all the bond forces have been identified and modeled, FIRST computes the resulting flexibility, and produces a rigid cluster decomposition (RCD) of the protein structure. The RCD reports for each bond in a protein whether it is free to rotate (flexible) or not free to rotate (rigid). The RCD for the native state of HIV protease, in both ligand bound and unbound forms, correlates well with experimentally identified rigid and flexible regions in this protein. Also, we present a method for mimicking thermal denaturation in a protein based on dilution of the hydrogen bonds and salt bridges within a protein. We show that the unfolding of a protein can be viewed as a rigid to flexible transition, and this transition can be tracked by observing how the flexibility of a protein changes at each step during hydrogen bond dilution (simulated thermal denaturation). A novel graphical representation is presented for displaying the data. Finally, the transition state is determined from the inflection point in the change in the number of independent bond-rotational degrees of freedom, or floppy modes, of the protein as its mean atomic coordination decreases. The first derivative of the fraction of floppy modes as a function of mean coordination is similar to the fraction-folded curve for a protein as a function of denaturant concentration or temperature. The second derivative, 3 specific heat-like quantity, shows a peak around a mean coordination of (r) = 2.41 for 26 55 diverse proteins. As a protein denatures, it loses rigidity at the transition state, proceeds to a state where just the initial folding core remains stable, then becomes entirely denatured or flexible. This universal behavior for proteins of diverse architecture, including monomers and oligomers, is analogous to the rigid to floppy phase transition in network glasses. This approach provides a unifying view of the phase transitions of proteins and glasses, and identifies the mean coordination as the relevant structural variable, or reaction coordinate, along the unfolding pathway. 3.2 Introduction Much interest is currently focused on the rapid and faithful folding of proteins from a one-dimensional (1D) sequence of amino acids in a random coil, to a three-dimensional (3D) biologically functional structure in the native state (Bryngelson et al., 1995; Honig, 1999; Baker, 2000). A general view of protein folding is that it begins with hydrophobic collapse, in which the random coil changes to a compact state, with the hydrophobic groups in the interior region and polar groups at the surface interacting with the surrounding water. The packing is not yet optimal, with hydrophobic groups somewhat free to slide about in the interior of the globule, until residues are locked in place by the formation of specific hydrogen bonds. These hydrogen bonds can be regarded as a sort of velcro that locks the various structural elements in the folded protein together. Once these interactions are optimized, the native state is predominantly rigid with flexible hinges or loops at the surface - the number and distribution of these depending on the particular protein. 56 There have been many significant theoretical advances in understanding protein folding in recent years — including the concept of a funnel-shaped free energy landscape (Bryn- gelson et al., 1995; Onuchic et al., 1997; Chan and Dill, 1998; Brooks III et al., 2001), simplified lattice models that are more tractable for simulations of folding (Chan and Dill, 1998; Klimov and Thirumalai, 1999; Mimy and Shakhnovich, 2001i), and more detailed but computationally intensive off—lattice models and molecular dynamics (MD) simulations (Daggett et al., 1996; Duan et al., 1998; Shea and Brooks III, 2001). These approaches have increased our understanding considerably, but the actual steps along the folding pathway continue to remain elusive. Experimentally, chemical and thermal denaturation of proteins are standard techniques to determine protein folding and unfolding equilibria and kinetics (Jackson, 1998; Eaton et al., 2000). However, to probe the range of time scales involved in folding, from microseconds to seconds, a series of challenging experiments is required (Eaton et al., 2000; Gruebele, 1999), and detailed structural information is generally not available. I have concentrated on a simpler problem — that of analyzing the unfolding mecha- nism by dilution of noncovalent contacts in the native structure. For proteins in which the unfolding process is reversible, this approach also provides information about the folding pathway. 1 postulate that information about the folding pathway is contained within the den- sity, strength, and specific location of the hydrogen bonds in the native state. To simulate denaturation, the hydrogen bonds and salt bridges within the structure are ranked according to their relative energies and broken one by one, from weakest to strongest, similar to the way these bonds would break in response to slowly increasing temperature. The transition 57 towards a flexible, denatured ensemble in the protein is observed as the hydrogen-bond and salt-bridge network is disrupted. In chapter 4, these results are found to be robust against the introduction of some noise, or stochastic character, into the order in which the hydrogen bonds are broken. In this chapter the program FIRST (Floppy Inclusions and Rigid Substructure Topogra- phy) is introduced as a computational tool to study protein folding. FIRST can decompose a protein structure into rigid clusters and flexible regions. When hydrogen bonds are re- moved from a structure, as during simulated unfolding, a protein will become increasingly flexible. The results of FIRST analysis on native-state structures are shown to agree with known flexible and rigid regions of folded proteins. This leads to the conclusion that rigid regions of a protein represent folded structure, and flexible regions represent unfolded or non-native structure. Using this definition of rigid = folded, flexible = unfolded, we can track the unfolding of a protein by observing the evolution of flexible regions during a sim- ulated unfolding experiment. Also, the ability of FIRST to present detailed information on the phase transition between native (rigid) and denatured (flexible) states of the protein is presented. 3.3 Methods 3.3.1 FIRST Flexibility Analysis The program FIRST was developed as a computational tool to measure flexibility in pro- tein structures. At the core of the program is a graph-theory algorithm named the 3D pebble 58 game which is a 3-dimensional extension and implementation of results in mathematical rigidity theory that have developed over the past few years (Jacobs and Hendrickson, 1997; Jacobs and Thorpe, 1995, 1998). The roots of this work go back to Lagrange’s (1788) intro- duction of constraints on the motion of mechanical systems in the late eighteenth century, which Maxwell (1864) used in the mid-nineteenth century to determine whether structures were stable or deformable. The applications of this kind of work have traditionally been to solve problems in engineering, such as the structural stability of different truss config- urations in bridges. A very significant advance occurred with Laman’s theorem (Laman, 1970), which exactly determines the degrees of freedom (DOF) within 2-dimensional net- works, and allows the rigid regions and flexible joints between them to be found. A rig- orous application of Larnan’s theorem to 3D structures has not yet been proven, however, the molecular framework conjecture proposed by Tay and Whiteley suggests that Laman’s theorem will hold for a specific class of 3D networks called bond-bending networks, in which vertices (atoms) are connected by edges (bonds) and every angle between edges is defined (each bond angle is fixed) (Tay and Whiteley, 1984). For 3D bond-bending net- works, the flexibility in the system derives from dihedral or torsional rotations of the bonds that are not locked in by the network. A brief introduction into rigidity theory as applied to macromolecules, such as proteins, is presented here. More detailed accounts can be found in (Jacobs et al., 1999, 2001) and references therein. The results of FIRST rely on accurately counting the DOF and distance constraints in a system. Each atom in the system is assigned 3 DOF associated with motion in any direc- tion in 3 dimensions. When bonds form between atoms the motion of the atoms becomes 59 5 Atoms 6 Atoms 7 Atoms 5*3 -5 -5 —6=—l 6*3 -6-(w-6=0 7*3 -7-7-6=1 Rigid Isostatic Floppy Figure 3.1: Determining the number of internal degrees of freedom in 3 small rings us- ing constraint counting. Examples are shown for five, six, and seven—fold rings. The in- ternal degrees of freedom (DOF) are counted by determining the total DOF, 3 for each atom (shown in green), and subtracting the number of distance constraints that arise from central-force bonds (shown in black), bond-bending constraints (shown in red), and the macroscopic rigid-body DOF (indicated by the light blue —6 in each equation). A negative value for the number of internal DOF (as in the five-fold ring) indicates that the structure is rigid, and overconstrained. It has more than enough constraints to be rigid. A value of 0 (as in the six-fold ring) indicates the structure is rigid and isostatic. This structure has just enough constraints to be rigid. A positive value for the number of internal DOF (as in the seven-fold ring) indicates that the structure is flexible or underconstrained. restricted. Bond forces impose distance constraints on the atoms, that is, a pair of bonded atoms can no longer move independent of each other. The Euclidean distance between bonded atoms is held constant, and the net effect is the loss of DOF in the system. An example of how the internal DOF of three small rings can be computed by counting the distance constraints is shown in Figure 3.1. For the five-fold ring, which could represent the side-chain of a histidine residue, there are 5 atoms, so the system consists of 3 * 5 = 15 DOF. There are 5 covalent bonds (thick black lines) and 5 bond-bending constraints (red 60 dashed lines), resulting in 15 — 10 = 5 DOF in the system. However, it is necessary to sub- tract off 6 trivial DOF, referred to the macroscopic or rigid body DOF in order to determine the internal DOF. Rigid body DOF refer the fact that you can take all 5 atoms in the five- fold ring and translate or rotate them together and it doesn’t change any of the properties of the system. Because we are in 3 dimensions, there are 3 rigid-body translational DOF and 3 rigid-body rotational DOF for a total of 6 rigid body DOF. If we subtract off these rigid body DOF (indicated by the light blue —6 in the equations of Figure 3.1), then we see that the five-fold ring has —1 internal DOF. The physical interpretation of the negative value is that there are more constraints than are necessary to make the five-fold ring rigid. The common name for these type of structures is overconstrained. If another atom is added to the system, as in the six-fold ring, the final constraint count shows there are 0 DOF in the six-fold ring. This means that there are just enough constraints to make this structure rigid. Add a bond and it will become overconstrained. Remove a bond and it will become flexible. A structure with O DOF is rigid and is referred to as isostatic. For completeness, the constraint counting for a seven-fold ring is shown. Here, the final count yields 1 DOF. Positive values in the number of DOF indicate flexible or floppy structures. For a protein, the total number of DOF will be the number of atoms observed in the crystal structure times three. Because the intricate bond network of a protein structure con- sists of many large and small rings, it is possible to have multiple overconstrained, isostatic and flexible regions in a protein at the same time. Determining the size and location of these regions, after all the DOF and distance constraints have been accounted for, is practically impossible to do by hand, and requires the program FIRST, specifically the 3D pebble 61 game, to do the counting. At this point it becomes necessary to identify all of the distance constraints that can arise due to bond forces. The bond forces in a molecular structure such as a protein will range from strong (i.e. covalent bonds) to weak (i.e. van der Waals interactions) (Figure 3.2). For the purposes of flexibility analysis all bond forces that are as strong or stronger than hydrogen bonds are included. By setting this cutoff, it is assumed that weaker bond forces, such as van der Waals interactions, are not strong enough to impose a distance constraint between a pair of atoms. The specific bond forces included the model are covalent bonds, salt bridges and hydrogen bonds. These bond forces are used to build the bond-bending network that FIRST requires for proper analysis of the flexibility in the structure. The connections between atoms generate the central-force distance constraints. The required angular con- straints arise because each bond angle is treated as constant. To represent a constant angle in the bond-bending network, the distance between second-nearest neighbor atoms is fixed. An example of both of bond length and bond angle distance constraints for the main-chain atoms of an amino acid are shown in Figure 3.3. The bond length distance constraints are shown as thick black lines between N—Ca and Ca—C atoms. The bond-angle constraint, which results from a constant angle a, is shown a dashed, gray line between the N and C atoms. Representing a bond force, such as a covalent bond, as distance constraint assumes that the distance between the two atoms is constant. These constant distances are defined in a protein structure either explicitly as equilibrium bond lengths, or implicitly as equilibrium bond angles. By fixing the distance we neglect high-frequency motion (bond-stretching, 62 Microscopic Interactions Strong Umol = UCF +UBB+ USB +UH 111 l‘ * Weak > + UD + Uother .V \/ l van der Waals, weak electrostatic, and non-bonded forces Dihedral/torsional rotations Hydrogen bond range ———I> Salt bridges U —> Covalent bond bending U —> Covalent bond stretching Figure 3.2: A schematic representation of microscopic bond forces ordered from strongest to weakest. Umol represents the total potential energy of the bond forces in a protein. It is necessary to select which bond forces impose distance constraints by setting an appropriate energy cutoff. For the purposes of protein flexibility analysis, hydrogen bonds (with ener- gies _>_ —0.1kcal/mol), salt bridges and covalent bonds are modeled as distance constraints. Weaker forces such as van der Waals interactions as not included. 63 Figure 3.3: Example of bond-length and bond-angle distance constraints for the main- chain atoms of an amino acid. The positions of the N, Ca, C atoms are crystallographically defined, and the sp3 hydridization of the Ca atom defines the bond angle a. Because the angle a is constant, the distance between the N and C atoms, shown as dashed, gray line, is also constant. The thick black lines between the N—Ca and Ca—C atoms represent bond- stretching distance constraints that arise from covalent bonds. 64 bond-bending) that would be expected due to thermal motion. This leads to the interpreta- tion that FIRST results are meaningful only on time scales longer than those observed for bond bending and bond stretching frequencies, which generally occur in the range of 4000 — 200 cm‘1 (120.0 - 6.0 femtoseconds) (Fadini and Schnepel, 1989). The structural flexi- bility required for protein folding (Jackson, 1998) or domain motion (Epstein et al., 1995) occur as a result of dihedral rotations, which are low-frequency modes that occur on much longer time scales (2 microseconds). Therefore FIRST results can give us information about flexibility in these processes. The peptide bond of a protein represents a special case of a bond force due to its partial- double bond character that arises from resonance with the main-chain carboxylate. All double and partial double bonds are viewed as non-rotatable dihedral angles, and special care is taken within the FIRST program to lock these bonds. In addition to modeling the strong bond forces mentioned above, hydrophobic inter- actions are also included as distance constraints. However, in contrast to covalent bonds and hydrogen bonds, hydrophobic interactions are modeled such that they restrict the mo- tion between two hydrophobic atoms, and do not fix it constant. This is accomplished by linking a pair of hydrophobic atoms via a series of artificial atoms and bonds. These pseudoatoms increase the number of DOF associated with a hydrophobic interaction, and the intervening pseudobonds create distance constraints that reduce the number of DOF. The net effect is the loss of 2 DOF/hydrophobic interaction. A firrther description of how hydrophobic tethers are modeled is given below in the Methods section: Identifying and Modeling Hydrophobic Interactions. 65 Once all of the bond forces and hydrophobic interactions have been identified, it is possible to create the bond-bending network of a protein. In this 3D network, each of the vertices represents the position of an atom from the protein structure. Each edge represents a distance constraint that arises from fixed bond lengths and angles. This generic bond- bending network is what FIRST analyzes using 3D constraint counting. The algorithm will identify which distance constraints in the network are adding stress to the network. These redundant bonds are associated with nonrotatable dihedral angles in the protein. A set of interconnected nonrotatable bonds form a rigid cluster. Also computed are the number of floppy modes, which is specifically the number of bond-rotational DOF that remain in the protein after the nonredundant distance constraints have been subtracted. Floppy modes are usually associated with a collective motion (a concerted motion of many bonds within a protein, such as a large domain motion). In general, because floppy modes are associated with a collective motion consisting of many bonds, the number of floppy modes will be less than the total number of flexible bonds in a protein. It is worth mentioning that the algorithms encoded in FIRST are extremely efficient. Alternative methods to identifying rigid and flexible bonds in protein will generally scale with a computational complexity of order O(N7), where N is the number of atoms in the protein (Jacobs et al., 1999). Theoretically, FIRST scales as order 0(N2), however, in practice is usually linear in the number of atoms (of order O(N) ). The worst case that has been observed was of order 0(N1'2) (M. F. Thorpe, personal communication). 66 3.3.2 Preprocessing Protein Structures for Analysis Given the absence of electron density for hydrogen atoms in most X-ray crystal structures, positions for polar hydrogen atoms (including those in bound water molecules) were as- signed using the soflware Whatlf (V riend, 1990). The Whatlf software uses a combination of heuristic criteria and hydrogen bond energy functions to optimize the placement of po- lar hydrogen atoms in a protein structure. Comparison of hydrogen positions determined by Whatlf to those observed in neutron diffraction structures for five proteins have been shown to overlap well (the worst case had 94.3% of the hydrogen positions in common between the computational and experimental results) (Jacobs et al., 2001). Whatlf was run on a protein in the presence of all crystallographic water molecules found in the structure. However, for all subsequent analyses only buried water molecules were included. Buried waters were identified using the PRO.ACT software (Williams et al., 1994). The program Whatlf will not add hydrogens to atoms or molecules defined with the HETATM (heteroatom) field of a Protein Data Bank (PDB) file. HETATMs are typically small ligands such as metals, cofactors, inhibitors, and substrates or substrate analogs. To add hydrogen atoms to HETATM groups the Biopolymer programs of Insightll molecular graphics package (Biosym, Molecular Simulations) was used. In the choice of protein structures to analyze, the stereochemical quality of the struc- ture can have a significant influence on the definition of its network of hydrogen bonds, due to their angular dependence (described in the next section). The result is that FIRST analysis on a structure with poor stereochemistry is likely to indicate the protein as being more flexible than it actually is, due to missing hydrogen bond distance constraints. It is 67 advisable to assess the main-chain stereochemistry through a (I), \II plot, as well as focus on high-resolution, well-refined structures for FIRST analysis. 3.3.3 Identifying and Modeling Hydrogen Bonds Hydrogen bonds were identified between donor and acceptor groups according to the fol- lowing geometric criteria (Stickle et al., 1992; McDonald and "Thornton, 1994), shown graphically in Figure 3.4: 1. Donor-Acceptor distance, d _<_ 3.6A. 2. Hydrogen-Acceptor distance, r g 2.6A. 3. Donor-Hydrogen-Acceptor angle, 90° 3 6 3 180°. The energy of each hydrogen bond was measured using a modified Mayo potential (Dahiyat et al., 1997). The function evaluates the favorability of the observed hydrogen- bond length relative to the optimal, equilibrium length for that pair of atoms based on their electron orbital hybridization, as well as the favorability of the angles between the donor and acceptor groups. The modification avoids non-physical H-bonds with angles near 90° (e.g., between C=O(i) and NH(i+3), rather than the important C=O(i)<—>NH(i+4) interactions in the middle of a-helices). Salt bridges were identified between the nega- tively charged groups of aspartate, glutamate, or the carboxy-terminus of the protein, with the positively charged groups of histidine, lysine, arginine, or the amino-terminus. The energies of hydrogen bonds, Egg, and salt bridges, E53, were calculated using equations 68 Figure 3.4: Geometric parameters used to identify hydrogen bonds and measure their en- ergy. The hydrogen bond is depicted as a dashed line between the hydrogen and the accep- tor oxygen. r is the hydrogen-acceptor distance, d is the donor-acceptor distance, 0 is the donor-hydrogen-acceptor angle and (f) is the hydro gen-acceptor—base atom angle, where the carbon is the base atom in this example. 69 3.1 and 3.2, respectively. ' R0 12 R0 10 »E =v 5(—) _<_) F9,, . H8 0{ R 6 R ( 45 99) (3 1) with V0 = 8 kcal/mol R0 = 2.80 A sp3 donor - sp3 acceptor F = cos2fle‘("“9)60032 (45 — 109.5) sp3 donor - sp2 acceptor F = 003266‘("‘9)60032¢ sp2 donor - sp3 acceptor F = c0346(e‘2(”‘9)6) sp2 donor - sp2 acceptor F = co.92t9e‘(”"’)6cos2 (max [¢, 90]) RS 12 RS 10 E = V 5 ( ) — 6 ( ) . SB 5 { R + a: R + a: (3 2) with ’5 = 10 kcal/mol, R3 = 3.2 A, and :1: = 0.375 A. In each equation, R is the distance between the donor and acceptor atoms. The 0 angle is the donor—hydrogen—acceptor angle, and (b is the hydrogen—acceptor-base atom angle, where the base atom is the atom bonded to the acceptor (e. g., carbonyl carbon for a carbonyl oxygen acceptor atom). The angle cp is an out-of-plane angle that arises when both the donor and acceptor have sp2 hybridization. For the salt-bridge energy function, the 70 values of V5, R5, and :1: were selected such that the computed energies matched those of experimental results on salt bridges (Xu et al., 1997). Because salt bridges are essentially a special case of hydrogen bonds in which the donor and acceptor are charged, for simplicity, hydrogen bonds and salt bridges will both be referred to as hydrogen bonds. To determine a reasonable default energy cutoff for hydrogen bonds, the threshold that best conserves the hydrogen bonds within a family of protein structures was evaluated (Jacobs et al., 2001). Multiple structures within four different protein families were studied to find such a threshold. The PDB codes used for each family are as follows: trypsin (1tpo, 2ptn, 3ptn), trypsin inhibitor (4pti, Spti, 6pti, 9pti), adenylate kinase (lzin, lzio, lzip), and HIV protease (ldif, lhhp, lhtg). Figure 3.5 shows the hydrogen-bond energy distribution for one of these families, namely the three HIV protease structures. A large spike appears in the distribution between —0.1 and 0.0 kcal/mol. This spike is largely due to the fact that quite generous definitions of hydrogen bonds are allowed initially (donor—hydrogen— acceptor angle, 6 2 90° and donor—acceptor distance, d g 3.6 A, as shown in Figure 3.4). The inset of Figure 3.5 expands the region near 0.0 kcal/mol, demonstrating how a large number of very weak hydrogen bonds, often with 0 angles near 90°, can be removed by setting EM 3 —0.1 kcal/mol. Thus, the generous hydrogen bond distance and angle screening criteria can be effectively filtered by setting Em. When these geometric criteria and an energy threshold of -0.1 kcal/mol are applied to analyze the hydrogen bonds and salt bridges in five neutron diffraction structures, a Gaussian distribution is observed for the number of hydrogen bonds as a function of donor-acceptor distance, with virtually all hydrogen bonds and salt bridges having distances between 2.6 and 3.6 A. The distribution 71 in donor—hydrogen—acceptor angles is bimodal, with a strong, Gaussian peak between 130 and 180° and a weaker peak between 90 and 130°. An energy cutoff of -0.1 kcal/mol is used in all subsequent FIRST analyses. 3.3.4 Identifying and Modeling Hydrophobic Interactions The hydrophobic effect observed in protein folding describes the tendency for nonpo- lar residues to bury themselves within the interior of the protein structure. This process frees many solvent DOF, which would necessarily form hydrogen bonded ice-like struc- ture around an exposed hydrophobic group in an attempt to compensate for the loss of entropy by increasing the enthalpy. Buried within the protein, the hydrophobic groups in- teract weakly in what can be appropriately described as a slippery or “greasy” manner. It has been shown that these hydrophobic interactions contribute significantly to protein sta- bility and are generally believed to be critical in driving the protein folding process (Dill, 1990) As with covalent bonds and hydrogen bonds, hydrophobic interactions must be modeled as a connection between two atoms due to the graph-theory nature of the FIRST program. Hydrophobic interactions are identified as contacts between pairs of carbon atoms or be- tween carbon and sulfirr atoms. Van der Waals radii of 1.7A and 1.8A were assigned to carbon and sulfur atoms, respectively (Bondi, 1964). A pair of carbon and/or sulfur atoms were determined to be in hydrophobic contact if the distance between their atom centers was g 1‘, + n, + R, where Ta is the van der Waals radii of atom a, and n, is the van der Waals radii of atom b (Figure 3.6A). R was set to 0.25A as this value was empirically determined 72 200k..a,...,,,.,.-.,.,,, 180; recite...--,,. 160 140‘ 120: 100: Y I T Y I I I Number of Hydrogen Bonds co 0 Energy (kcal/mol) Figure 3.5: Histogram of hydrogen bond energies from three structures of HIV protease. Hydrogen atom positions in each of the three structures (PDB codes: ldif, lhhp, lhtg) were computed using the program Whatlf. The inset expands the low-energy region between —0.2 and 0 kcal/mol. An energy cutoff of —0.1 kcal/mol is used to eliminate the large number of very weak hydrogen bonds in the spike near 0 kcal/mol. 73 to yield the best result when predicting protein folding cores in a test set of ten proteins (described in Chapter 4) when sampling over many values of R. The net effect of hydrophobic interactions on the flexibility of a protein structure is to restrict motion. That is, they impose distance constraints between hydrophobic groups and therefore remove DOF from the system. However, due to the nonspecific nature of hydrophobic interactions, they will have a less constraining effect on protein motion than hydrogen bonds. Therefore, hydrophobic interactions are modeled such that they introduce less constraints on a protein than hydrogen bonds. This is accomplished by connecting a pair of hydrophobic atoms via a series of three pseudoatoms, as shown in Figure 3.6. The sole purpose of the pseudoatoms is to attenuate the number of DOF consumed by a hydrophobic tether. For example, if we were to simply connect two hydrophobic atoms, the single bond would generate one central-force constraint (the actual bond) and four bond- bending constraints (due to four new bond angles). The net effect would be to remove 5 DOF fi'om the system, a result similar to how covalent bonds are modeled. By introducing three pseudoatoms in between the hydrophobic atoms, we first add 9 DOF to the system (each pseudoatom adds 3 DOF). The intervening bonds generate 4 central-force constraints and 7 bond-bending constraints, for a net loss of 2 DOF (9 (DOF) - 1 1 (constraints) = -2) for each hydrophobic tether introduced into the protein. By comparison, each hydrogen bond removes 3 DOF from the system, and therefore, hydrophobic tethers are less constraining than hydrogen bonds. 74 B. Hydrophobic Contacts C. Hydrophobic Tether with 3 pseudoatoms Figure 3.6: Identifying and modeling a hydrophobic tether distance constraint. A hy- drophobic interaction is identified between a pair of carbon and/or sulfur atoms if Ta + n, + R 5 0.25A, where r, is the van der Waals radii of atom a and n, is the van der Waals radii of atom b. R was empirically defined to be 0.25A. Van der Waals radii of 1.7A and 1.8A were assigned to carbon and sulfur atoms, respectively. Hydrophobic tethers are modeled using three pseudoatoms, which results in a loss of 2 DOF per hydrophobic tether. 75 3.3.5 Computing the Mean Coordination of a Protein Structure The mean coordination, (r), of a protein structure is computed as the average number of bonds each atom in the protein makes by using equation 3.3, where n, is the number of r-coordinated atoms in the protein. (r) = E—Q— (3.3) The mean coordination gives a partial description of the protein bond network, and is strongly dependent on how many bonds are present in a protein at any given time. For overconstrained systems in which bonds are being diluted, the mean coordination can be used to describe the state of the system when the rigid —-> flexible transition occurs. Below, a method for simulating the thermal denaturation of proteins is presented in which hydro- gen bonds are repeatedly removed from the protein structure, beginning with an overcon- strained native state through to a flexible denatured state. The mean coordination is shown to be a useful number with which to compare the rigid —> flexible transition in different proteins that occurs during the simulated thermal denaturation. Additional detail can be found in the supplementary material of Rader et. al., 2001. 3.3.6 Computing the Fraction of Floppy Modes A key quantity computed by FIRST when analyzing the flexibility in a protein structure, or any 3D bond-bending network, is the number of floppy modes, F, also known as the 76 number of independent bond-rotational DOF. This number can be used to compute the fraction of floppy modes, f, by using equation 3.4, where the term in the denominator, 3N, represents the total number of DOF in the protein (N is the number of atoms in the protein). f = — (3.4) The fiaction of floppy modes will necessarily increase as bonds are removed from the bond-bending network representation of a protein or a glass. An example of f plotted versus (r) for random dilution of a glass network is shown in Figure 3.7A. As bonds are randomly removed from the glass network the rigid —> flexible phase transition occurs when the slope in the line changes sign. This point can be identified as the inflection point in a first derivative plot of f’ vs. (1'), and as a peak in the second derivative plot, f ” vs. (T). As in glass networks, the rigid —-> flexible phase transition observed during simulated protein unfolding (described in the next section) can be tracked using f vs. (T). 3.3.7 Simulating Denaturation As a protein is gradually thermally denatured, the covalent bonds remain intact, whereas hydrogen bonds will begin to break. The flexibility in the protein will increase as the num- ber of hydrogen bonds in the protein decreases. Our hypothesis is that information about the protein unfolding/ folding pathway is encoded in the network of hydrogen bonds present in the native state of a protein. This hypothesis was tested by removing hydrogen bonds from a protein structure to simulate thermal denaturation, then using FIRST to observe 77 Fraction of floppy modes, f 0.06 0.04 0.02 . Glass Networks Rigid L ‘ 213E ‘ 2.45 ' 2.55 Mean coordination, 0.06 J 0.02 4 r Proteins 2.35 V 2.45 2355‘ Mean coordination, Figure 3.7: The fraction of floppy modes, f = F/ 3N , as a function of the mean coordina- tion in two glass models and a set of 26 proteins. The mean-field Maxwell approximation to computing the number of floppy modes is shown as a black dashed line in each panel. A. The results for random bond dilution of a glass network. The purple line shows re- sults in which the rigid —> flexible transition is second-order. The orange line represents a first-order transition that arises in glass networks that lack small rings. B. Results for a representative set of 26 structurally and functionally diverse proteins. The blue lines are monomers; red lines, dimers; green lines, tetramers. The gray shaded region indicates the range in which protein folding/unfolding occurs. 78 where the resultant change in flexibility occurs. The results will depend upon the order in which hydrogen bonds are removed. Because hydrophobic interactions actually become somewhat stronger with moderate temperature increases (Tanford, 1980), these interactions are maintained throughout the simulation. During thermal denaturation, the hydrogen bonds are expected to break in an energy- dependent manner. This process is simulated by using the following procedure. Initially, the flexibility of the native protein structure is analyzed with all its covalent and nonco- valent interactions included (hydrogen bonds and hydrophobic interactions). The weakest hydrogen bond in the structure is then broken by removing any distance constraints created by that bond. The effect of removing this bond is then observed by applying FIRST to identify the flexible regions in the protein. We continue this process of breaking the weak- est hydrogen bond remaining in the structure and updating the identification of flexible regions until all the hydrogen bonds have removed. 3.3.8 Visualizing Results: The 3D Rigid Cluster Decomposition The results of FIRST indicate for each bond in the protein structure whether it is flex- ible (free to rotate) or rigid (not rotatable). Groups of atoms coupled to each other via rigid bonds form a rigid cluster. One or more independent rigid clusters with intervening flexible regions may be found in a protein structure. The distribution of rigid clusters and flexible bonds identified by FIRST is called a rigid cluster decomposition (RCD) and can be viewed graphically by color-mapping the results onto the 3D structure of the protein. Figure 3.8A displays the results for C12 when the 18 weakest hydrogen bonds have been 79 diluted from the structure. Flexible bonds are shown as thin black tubes, while rigid clus- ters are depicted by thick, colored tubes, with each independent rigid cluster distinguished by a different color. Hydrogen bonds and hydrophobic interactions are shown as dark gray lines. It is generally easier to interpret the results by removing the side chains from the graphical depiction of the results. The results shown at the top of Figure 3.88 are identical to those in Figure 3.8A, except that the side chains have not been displayed. It is much eas- ier to identify common secondary structural elements, such as the a-helix (colored blue), a B-strand (colored red), a fi-tum (colored orange) and a loop region (colored yellow), when viewing only the main chain bonds. 3.3.9 Visualizing Results: The 1D Rigid Cluster Decomposition The hydrogen bond dilution method to simulate denaturation produces a RCD each time a hydrogen bond is removed from a protein structure. Interpreting the 3D results requires flipping through a large number protein structures, and keeping track of where flexibility occurs in the structure as a function of the hydrogen-bond dilution. To overcome this visualization problem, we employ the reduced l-dimensional (1D) representation of the data depicted graphically in Figure 3.8B. In the 1D representation, the only results shown are for the backbone N—Ca and Ca—C bonds. As in the 3D figures, each backbone bond is represented as a thin black line if it is flexible (rotatable), or as a colored block if it is rigid. The 1D mapping of the flexibility data is a convenient means of reducing the amount of information generated in a hydrogen bond dilution experiment to a tractable level. A complete denaturation simulation can now be viewed as a series of horizontal lines, ordered 80 Figure 3.8: Rigid cluster decomposition results for C12 when 67% of the weakest hydro- gen bonds have been removed. A. This panel shows all non-hydrogen atoms present in the structure (excluding water molecules). There are four independent rigid clusters, as com- puted by FIRST. The rigid clusters are depicted by thick colored tubes (blue, red, orange and yellow, from largest to smallest). Each thin black tube represents a rotatable or flexible bond. The thin, dark gray lines show the location of hydrogen bonds and hydrophobic teth- ers. B. The same results of FIRST analysis for C12, showing only the main-chain atoms. Because the main-chain for a protein monomer is an unbranched linear polymer, the flexi- bility results for the main chain can be mapped onto a 1D line. From the N-terminus to the C-terminus, each backbone bond is represented as a thin black line if it is flexible or a thick colored block if it is rigid. Independently rigid clusters are assigned different colors. 81 Figure 3.9: Results of the complete hydrogen bond dilution for c-SRC SH3 domain. A. The top line in this figure shows the results for the native state of the SH3 domain. There are 44 hydrogen bonds present. Each successive line shows the 1D rigid cluster decomposition as hydrogen bonds are removed from the structure. The lines shaded gray indicate results with identical 1D RCDs. These lines can be identical because the 1D rigid cluster decomposition only shows changes in the flexibility of the backbone bonds of a protein. B. By removing the redundant lines from panel A, we are left with results that show only when a change in the flexibility of the main chain occurred. 82 tbp c_ E 5. m I. 2 mp m c». 2 83:63.05: .6335» EEYoEmfi Sup—0.5.8”: 53.3.5931 accusing ail. u u u an.” 2.3. sail-LIlloi-T ”on.” can. xmzlltlrlollcl-T 33 £3. x. zllllil. 82 n8..- 5: 2 :3 3:5. u 3.: . _ _ _ — _ m m m m a a b < (Snug EmEoD me EBaéSofim .853? £59.02me 52.9.5352 coauuuafiuz 5.5935 0.: CHEIOI-lolellqlll «and 03th. a 1 Nil? m oviilllil‘liT 82 can. nan.“ nmoNi SYN 3:61 9. li a—VN vv _ _ _ _ _ _ m m m m m a D m: mum < m m EmEo ..w m Dme < Figure 3.9 83 fi'om a native state (at the top) to a denatured state (at the bottom). Each line shows the re- gions of structural stability and flexibility for the backbone atoms after a specific hydrogen bond has been removed during the denaturation process. Figure 3.9A provides an exam- ple of a complete thermal denaturation simulation for the SH3 domain of human tyrosine kinase c-SRC (PDB code: 1frnk). The three columns on the left-hand side of Figure 3.9A describe: 1) the number of remaining hydrogen bonds in the protein at each step; 2) the energy of the just-broken bond (in kcal/mol), according to the modified Mayo potential (Dahiyat et al., 1997); and 3) the mean coordination, (r), of the atoms in the network at that step. Regular secondary structure content is shown at the top, as determined by DSSP (Kabsch and Sander, 1983). The right-hand columns, together with the solid triangles be- neath each line, show the residue locations of the donor (blue) and acceptor (red) atoms of the hydrogen bond broken to generate this step. For example, “M 2” indicates the main chain of residue 2, “S 93” indicates the side chain of residue 93, and “W 120” indicates water molecule 120 in the PDB structure. “H” indicates other heteroatoms, belonging to non-protein functional groups such as bound heme. The residue numbers are shown at the top, with tick marks denoting the position of the numbered residue. Frequently, several successive lines are identical because the flexibility of the backbone bonds has not been affected by the changes in the noncovalent bond network. In Figure 3.9A these redundant lines are highlighted in gray. Because each line within a gray highlighted region is identi- cal, the information is redundant and can be omitted. Figure 393 shows only those steps in the hydrogen bond dilution of SH3 domain that result in a change in backbone flexibility. Images in this thesis are presented in color. 84 3.4 Results 3.4.1 Native State Flexibility Analysis: Open and Closed Structures of HIV Protease Given a protein’s native-state structure, all of the covalent bonds, hydrophobic tethers, hydrogen bonds and salt bridges are used to define the distance constraint network for the protein. Given these constraints, FIRST identifies all the rigid and flexible regions within a protein, and these results have been shown to correlate well with experimental measures of flexibility for a range of proteins (Jacobs et al., 1999, 2001; Thorpe et al., 2000, 2001). The FIRST results for HIV protease, in both unbound (Figure 3.10A) and inhibitor bound forms (Figure 3.10B), have been compared with experimental measures of protein flexibility. The major peaks in main-chain thermal mobility (B-value), measured crystal- lographically, correlate directly with the a, fl, and 7 flexible regions predicted by FIRST (Figure 3.10) (Jacobs et al., 2001). It should be noted that for proteins with mobile domains or other moving rigid bodies, such as a-helices, the crystallographic mobility and FIRST results will not necessarily compare well with B-values. Crystallographically, they appear as mobile regions, whereas in FIRST they appear as rigid regions flanked by flexible loops, which allow the rigid-body motion. HIV protease has also been crystallized with various inhibitors bound, resulting in a closed conformation with the flaps lowered. The main-chain dihedral angle changes (simi- lar to the analysis of Korn and Rose (1994) observed for crystal structures of the open (PDB 85 Figure 3.10: Rigid cluster decomposition for HIV protease. A. 3D RCD of HIV protease in an unbound, open conformation (PDB code: lhhp) The “flaps” at the top of the structure are determined to be flexible in the open conformation (indicated by the red and yellow bonds), providing ligand access to the active site. B. 3D RCD of HIV protease in a ligand- bound, closed conformation (PDB code: lhtg). Upon ligand binding, the flaps become part of the large rigid cluster, colored blue. 86 code: lhhp) and closed (PDB code: lhtp) have been computed. The F IRST-predicted flex- ible regions directly correspond with the regions of greatest dihedral angle change (Jacobs et al., 2001). In the three flexible regions (a, 6, and 7), the flexibility is associated with a flip in at least one dihedral angle (defined as a change of more than 60 degrees) within a rigid fl-turn in the center of each flexible region. The results are consistent with the motion observed by interpolation between different HIV protease crystal structures (Gerstein and Krebs, 1998) and an earlier dihedral analysis for a different pair of HIV protease structures (Korn and Rose, 1994) indicating that large dihedral angle changes at residues 40, 50, and 51 in the a and 6 regions result in a large, concerted movement of the flaps. Flexibility of the 7 region has not been emphasized in other studies of HIV protease; however, it is known that drug-resistant mutants of the protease include two residues that pack against the 7 region, 63 and 71, with residue 63 proposed to induce a conformational perturbation (Chen et al., 1995; Patrick et al., 1995). Thus, conformational coupling between the 7 re- gion and the flaps, through the 7—a loop interactions, may explain why mutations in the 7 region, which are distal from the active site, cause resistance to drug binding. Ligand binding restricts the motion of the flaps through new hydrogen bonds linking the two flaps to each other and to the ligand. Some of these hydrogen bonds between the flaps and ligand are mediated by a conserved water molecule found in retroviral but not mammalian homologs of HIV protease (Wlodawer and Erickson, 1993), providing a useful basis for designing more HIV-specific drugs. To compare the influence of ligands on HIV protease flexibility, there were a number of ligand-bound structures of good stereochem- istry from which to choose. For brevity, only the results from PDB entry lhtg are shown, 87 with inhibitor GR137615 bound to represent the closed form of HIV protease. (Two other ligand-bound structures, lhiv and ldif, have also been analyzed by FIRST, and the lig- ands’ influence on protein flexibility was found to be substantially similar.) Unlike the open form, the closed structures were resolved crystallographically as dimers, and thus in- dependent structural information is available for the two subunits of the dimer. This means it is possible to assess the influence of different side-chain conformations in the two halves (due to thermal fluctuations and environmental differences) in terms of their effects on the hydrogen-bonding network and flexibility. The left and right sides of HIV protease in F i g- ure 3.103 indicate that the only substantial difference in their flexibility is caused by the asymmetry of the bound ligand. Comparison of the ligand-bound structure with the open HIV protease also demon- strates how a ligand can rigidify part of the protein through new hydrogen bonds even though the ligand itself is not rigid, while making other parts of the protein more flexible. Particularly note the dimer interface, where inter-subunit rotation occurs upon ligand bind- ing, breaking some of the interfacial stabilizing hydrogen bonds, and the loop to the right of the binding cavity, shown as a flexible region of the main-chain ribbon in Figure 3.10B. This loop flexibility is not reflected in the other HIV protease subunit, due to ligand asym- metry. Flexibility of the dimer interface in a ligand-bound structure is also a prominent feature found by NMR (Ishima et al., 1999) and MD analyses (Scott and Schiffer, 2000); MD also predicts flap flexibility in the ligand-free conformation. Native-state flexibility analysis results for dihydrofolate reductase and adenylate kinase have also been performed. The FIRST results for these proteins have been shown to be 88 consistent with experimentally observed conformational flexibility in the native states of these two proteins (Jacobs et al., 2001). 3.4.2 The Folding Transition State The results of simulating denaturation can be tracked quantitatively along the unfolding pathway in terms of the change in number of fractional floppy modes, f (bond-rotational DOF) as the mean coordination decreases. A plot of f as a function of (r) for 26 structurally diverse proteins (listed in Table 3.1) and for two limiting models of network glasses are shown in Figure 3.7. The overall similarity in the flexibility transition behavior of f for the diverse proteins and glasses is striking. To examine these results in more detail, in particular the phase transition region shown in gray in Figure 3.7, A. J. Rader, a graduate student of Dr. M. F. Thorpe in the Department of Physics and Astronomy at Michigan State University, has obtained the first and second derivatives of f versus (r) (Figures 3.11 and 3.12, respectively). The derivatives were calculated numerically by fitting a cubic equation over an interval corresponding to A0“) = 0.75, which contained typically from 90 to 2,000 data points. In Figure 3.11, we see the sharp rise of the first derivative through the transition region, again marked in gray. One of the glass models (orange line) shows a first-order transition as indicated by the discontinuity at (r) = 2.389. The insert in Figure 3.11 is adapted from several folding experiments (Creighton, 1993), showing that as the temperature increases, the fi'action of folded protein decreases. The fraction of floppy modes plays the role of a free energy as the transition is traversed (Duxbury et al., 1999), and as such the second derivative couples 89 Table 3.1: Set of 26 structurally diverse protein analyzed using FIRST. The PDB code, protein name, and CATH (Orengo et al., 1997) structural class are listed in the first three columns. Nm is the number of residues in the protein; N H20 is the number of buried water molecules in the protein. (r)T is the mean coordination of the protein in the transition state of the protein, identified as the inflection point in the plot of f’ vs (r). (r) pc is the mean coordination of the protein when the folding core has been identified (described in Chapter 4). Code Protein Name Class NW N H20 (Th (7‘) FC monomers 1a2p bamase (1,8 108 5 2.41 2.39 la3k galectin [3 137 5 2.40 — 1a6m myoglobin a 151 7 2.40 2.37 lake adenylate kinase 03 214 14 2.40 — lbpi bovine pancreatic few 58 4 2.39 2.38 trypsin inhibitor lbu4 ribonuclease Tl ad 104 0 2.40 2.39 lhml a-Lactalbumin a 123 4 2.40 2.38 1hrc cytochrome c a 105 4 2.42 2.38 1nkr killer cell 6 201 5 2.39 — inhibitor receptor lruv ribonuclease A ad 124 3 2.41 2.40 lrxl DHFR ad 159 0 2.41 — lten tenascin fi 90 0 2.40 - lubi ubiquitin ofi 76 1 2.39 2.40 2chf CheY a I” 128 7 2.39 — 2ci2 chymotrypsin inhibitor 05 83 0 2.40 2.41 21iv LIV-binding protein (16 344 7 2.40 — 312m T4 lysozyme a 164 7 2.41 2.38 4ilb interleukin 1-6 B 153 9 2.40 2.39 dimers lbif PFKinase/FBPase 06 864 242 2.40 — lcku electron transfer protein few 170 4 2.40 — lhhp HIV-1 protease H 198 0 2.39 - lvls aspartate receptor 6 292 32 2.39 - tetramers lice interleukin 1-[3 06 514 19 2.41 — converting enzyme lids Fe-SOD ad 792 43 2.40 — lszj GAPDH a I3 1332 105 2.40 — 2cts citrate synthase a 874 60 2.40 — 90 to the fluctuations and reaches a maximum at the transition point as shown in Figure 3.12. The second derivative, shown in Figure 3.12, is noisier, due to the numerical differ- entiations, but nevertheless shows similar behavior for the 26 proteins, with the peak that defines the transition state occurring at (r) 2 2.405 i 0.015. There is no obvious pattern in size, architecture, oligomeric state, or ligand content for the few proteins with irregu- lar curves. Cytochrome c (PDB code: 1hrc) is the one protein with a bimodal curve that decreases near the transition region, and this behavior occurs both when the heme group is included or excluded from the calculation. Proteins with somewhat broad peaks and a shoulder at lower (r) values are a-lactalbumin (PDB code: lhml), bamase (PDB code: 1a2p), and glyceraldehyde-B-phosphate dehydrogenase (PDB code: lszj). The behavior of all proteins becomes predictably noisier at low mean coordination values, as more and more hydrogen bonds are removed from the native structure. The insert in Figure 3.12 com- pares these results with the specific heat curve for a typical protein (Privalov, 1996; Angel], 1999). The shape of the second derivative in Figure 3.12 is suggestive of a relationship with the specific heat, as sketched in the insert. The two quantities are similar in that both are related to fluctuations, with specific heat reflecting fluctuations in the energy, and f ” representing fluctuations in conformational flexibility. It is unclear whether the width of the measured specific heat, as typically measured experimentally, is associated with a single protein, or whether it is broadened due to monitoring an ensemble of unfolding proteins. The specific heat of a single protein as it unfolds thus could be considerably narrower than the measured specific heat, which will be known once experiments can be done on single proteins. 91 D O _____ A -0.2- g 1.0. ‘5. a 1 9 04 g 2 8 E "‘3 'U m .. E u': u' '0-6 0.0-- ‘ (—Temperature 08 - 2.1351 2345‘ L I 2.55% Mean coordination, Figure 3.11: Change in the fraction of floppy modes, f’, as a function of mean coordi- nation, (r), for the set of 26 proteins listed in Table 3.1. The gray shaded region shows the location where the folding transition takes place. The curves for two kinds of glass networks from Figure 3.7 (thick orange and purple lines) are shown superimposed on the protein curves. The notation at the top indicates the Denatured state, Transition state, and the Native states of the proteins. For comparison with results for a typical thermal denatura- tion experiment, the inset sketches the decrease in fraction of folded protein as temperature increases (adapted from Figure 7.11 in (Creighton, 1993)) 92 25 I I I *I I l I I I I l I T 20 ~ 3 E N1: 2‘5’ FC \ 15 - 8 ‘36 "’ D g5 N .176 __>_ 10 ' «Temperature 5 U '0 5 s - O 0) (D 0 _ _ 2.35 2.45 Mean coordination, Figure 3.12: The second derivative of the fraction of floppy modes, f", as a function of mean coordination, (r), for the set of 26 proteins listed in Table 3.1. The inset shows a sketch of the specific heat as a function of temperature for a protein, with the location of the Denatured state, Folding core, Transition state, and the Native state of the protein indicated. The x—axis of the inset has the temperature increasing to the left. 93 3.5 Conclusions In this chapter, a novel distance constraint approach for characterizing the intrinsic flexibil- ity of a protein structure has been presented. Hydrophobic interactions and the strong bond forces, covalent bonds, salt bridges and hydrogen bonds, impose constraints on the allowed motion in a protein structure. F IRST uses these constraints to decompose a structure into rigid clusters, consisting of nonrotatable dihedral angles, and flexible regions. There are several advantages of FIRST relative to previous methods for analyzing protein flexibility. FIRST calculations can be done virtually in real time (a few seconds of CPU time) once the network of distance constraints has been defined. Analysis of a native-state protein structure indicates regions likely to undergo conformational change as part of the protein’s function. For a given set of distance constraints, the rigid regions and the flexible joints between them are determined exactly. The ability to very quickly determine coupled mo- tions among the dihedral angles of a flexible region gives FIRST an advantage over other methods. Collective motions, in which changing one flexible dihedral angle will influence the other flexible dihedral angles within the region, are identified within the protein. Anal- ysis of the relative flexibility within HIV protease (presented in this chapter), dihydrofolate reductase, and adenylate kinase (Jacobs et al., 2001), even when performed on a single structure, captures much of the fimctionally important conformational flexibility observed experimentally between different ligand-bound states. In addition to native-state flexibility analysis, a simple model of protein unfolding by thermal denaturation was presented. In this model, it is assumed that the rigid clusters defined by FIRST represent regions of the protein that are folded. Because flexible re- 94 gions contain rotatable bonds they are able to sample conformational space, and therefore represent unfolded regions of the protein. Thermal denaturation was simulated by break- ing hydrogen bonds in order of their energy, weakest first. The resulting protein folding transition can be viewed as a flexible to rigid phase transition, similar to that observed for network glasses. The mean coordination, (r), of atoms in the protein, including non- covalent interactions, can be regarded as the reaction coordinate controlling protein folding, and provides a unifying treatment of the many dynamic and structural processes involved. Proteins are self-organized networks, due to the special nature of the cross-linking of the polypeptide chain via hydrophobic contacts and hydrogen bonds. This transition is shared among diverse proteins ranging fiom all-a to all-fl folds, and from monomers to tetramers, and occurs once the protein denatures to a mean coordination of (r) ”E 2.41, which is very similar to the value found in network glasses (Thorpe et al., 1999). 95 Chapter 4 Identifying Protein Folding Cores from the Evolution of Flexible Regions During Unfolding Research presented in this chapter is being published as the following reference: B. M. Hespenheide, A. J. Rader, M. F. Thorpe, and L. A. Kuhn. Identifying protein folding cores from the evolution of flexible regions during unfolding. J. Mol. Graph. Model., In press. 4.1 Abstract The unfolding of a protein can be described as a transition fiom a predominantly rigid, folded structure to a denatured state, or an ensemble of denatured states. During unfolding, the hydrogen bonds and salt bridges break, destabilizing the secondary and tertiary struc- ture. Previous work (described in Chapter 3) shows that the network of covalent bonds, 96 salt bridges, hydrogen bonds, and hydrophobic interactions, forms constraints that define which regions of the native protein are flexible or rigid (structurally stable). Here, ther- mal denaturation of protein structures is simulated by diluting the network of salt bridges and hydrogen bonds, breaking them one by one, from weakest to strongest. The struc- turally stable and flexible regions are identified at each step, providing information about the evolution of flexible regions during denaturation. This approach is used to test the hy- pothesis that the folding core is the region of strongest tertiary interactions, and greatest structural stability. For ten diverse proteins, the folding core is identified as the region formed by two or more regular secondary structures that is most stable against thermal de- naturation. For the ten proteins with different architectures the predicted folding cores from this flexibility/stability analysis are in good agreement with those identified by native-state hydrogen-deuterium exchange experiments. 4.2 Introduction Understanding protein folding pathways has been the subject of many recent theoretical and experimental studies (Onuchic et al., 1997; Gruebele, 1999; Shea and Brooks III, 2001; Mirny and Shakhnovich, 2001; Jackson, 1998; Englander, 2000; Eaton et al., 2000; Ven- druscolo et al., 2001). These studies often focus on processes that occur early in folding, and models such as nucleation-condensation (Fersht et al., 1992; Clarke and Itzhaki, 1998; Fersht, 2000) and diffusion-collision (Karplus and Weaver, 1994) have been used to de- scribe the initial step(s). Whether folding is initiated by nucleation of tertiary interactions or diffusion-controlled coalescence of already folded secondary structures is being debated, 97 and a single model may or may not hold for all proteins. However, a unifying theme is that the initial steps in the folding process involve the interaction of non-local regions in the protein sequence forming a substructure that is substantially preserved in the fiilly folded protein. Several experimental techniques have been designed to identify early folding sub- structures (Galzitskaya and Finkelstein, 1999; Torshin and Harrison, 2001; Hilser et al., 1998). These techniques are unique in that the analysis is performed solely on the native- state conformation, instead of following the folding reaction fiom a denatured state to the native state. The advantage of using the native state is that this conformation is largely ordered, whereas the denatured state is typically an ensemble of dissimilar, unfolded con- formations. An experimental technique that gives detailed structural information about unfolding is hydrogen-deuterium exchange NMR (H-D exchange). Under native conditions, rota- tion about main-chain /\II dihedral angles leads to fluctuations in which a protein can locally explore conformational space. H-D exchange occurs when the amide and carbonyl groups involved in a hydrogen bond separate enough for deuterated water to intervene, al- lowing the shared proton to be replaced by a deuteron (Englander et al., 1997). Because deuterium does not produce a signal in proton NMR experiments, it is possible to iden- tify which amides undergo hydrogen exchange by comparing the NMR spectra before and after the exchange. By allowing the experiment to run for different time steps, individual exchange rate constants can be assigned to each of the main-chain amide protons identified in the NMR spectra. Woodward has proposed that amide protons that exchange only after long periods of exposure to deuterated water define the slow-exchange core of a protein 98 (Woodward, 1993). Li and Woodward compiled the results from a number of studies on native-state H-D exchange for different proteins, tabulating the residues forming the slow- exchange core in each protein (Li and Woodward, 1999). They have proposed that the secondary structures to which these residues belong define the folding core for the protein. Additionally, they have shown for bamase and chymotrypsin inhibitor 2 (C12) that the fold- ing core identified by H-D exchange consists of residues with high -values (Oliveberg and Fersht, 1996), indicating that slow-exchange core residues contribute to the stabilization of the folding transition state. For H-D exchange to occur in main-chain amides involved in hydrogen bonds, flexi- bility in the protein structure is required to allow access to deuterated water. Given that residues in the folding core have small exchange rates, it is reasonable to assume that the folding core protons either are not accessible to solvent or are in regions that are sufliciently rigid that the hydrogen bond donor and acceptor cannot move apart enough to allow H-D exchange. This could be probed by observing how the flexibility of a protein structure changes as it is gradually denatured. The hypothesis is that the folding core is stabilized by a network of particularly dense and/or strong noncovalent interactions, which tend to resist unfolding or denaturation. F ol- lowing this hypothesis, a novel computational method for predicting the folding core of a protein is presented. This approach employs the FIRST software, which accurately pre- dicts flexible regions in proteins (Jacobs et al., 1999; Thorpe et al., 2000; Jacobs et al., 2001) by analyzing the constraints on flexibility formed by the covalent and noncovalent bond network. Covalent bonds, salt bridges, hydrogen bonds, and hydrophobic interactions 99 are included in the protein model. Because thermal denaturation or unfolding involves the breaking of hydrogen bonds and salt bridges, we compare several methods for simulating thermal denaturation, and observe how the removal of these bonds affects the stability and flexibility of the protein. As hydrogen bonds are removed, the protein structure becomes more and more flexible as the stable regions decrease in size. The folding core can then be predicted as the most stable region involving at least two secondary structures. The thermal denaturation model in which hydrogen bonds and salt bridges are removed from weakest to strongest predicts folding cores that correlate best with the experimentally observed folding cores. The ability to predict an early state in folding indicates that information about the folding pathway is encoded in the structure of the native state. 4.3 Methods 4.3.1 Selection of Proteins for Analysis Crystallographic structures for ten monomeric proteins (Table 4.1) were selected from the PDB (Berman et al., 2000) for analysis. These proteins were chosen based on their diver- sity of structure and the availability of native state H-D exchange data for comparison (Li and Woodward, 1999). A 3D structure was not available for apo-myoglobin (which lacks heme), though qualitative data show its fold is very similar to that of holo-myoglobin (with heme), except for dynamic fluctuations of the F helix (Fontana et al., 1997). As an ap- proximation to the apo-myoglobin structure, we analyzed the holo structure upon removal of its heme group. For this structure, FIRST analysis also found the F helix to be one of 100 Table 4.1: Dataset of 10 proteins used to identify folding cores. The PDB code and number of residues are listed for each protein. The fourth column gives the CATH (Orengo et al., 1997) structure classification for each protein. The mean coordination of each protein at that point in the hydrogen bond dilution when the folding core is found, core is listed in column 5. Number of H20 lists the number of buried water molecules identified by PRO_ACT (VVrlliams et al., 1994) Protein PDB Size Stuct. core Number of Number of Name Code (Res.) Class H20 S-S Bonds BPTI lbpi 58 few 2.38 4 3 Ubiquitin lubi 76 a-fl 2.40 l 0 C12 2ci2 83 a-B 2.41 0 0 Ribonuclease Tl lbu4 104 a-,8 2.39 O 2 Cytochrome c 1hrc 104 a 2.39 4 0 Bamase 1a2p 1 10 01-6 2.39 5 0 a-Lactalbumin lhml 123 a 2.38 4 0 Apo-myoglobin 1 a6m 15 1 a 2.37 l l 0 Interleukin-16 lilb 153 B 2.39 9 0 T4 Lysozyme 3lzm 164 a 2.38 7 0 the two most flexible helices in the protein (data not shown). The experimental results of H-D exchange used for comparison in this study are for apo-myoglobin. The proteins were preprocessed as described in Chapter 3 under Methods: Preprocessing Protein Structures for Analysis. 4.3.2 FIRST Flexibility Analysis The structural flexibility of a protein structure is a property that depends upon how the mo- tion of each atom is restricted by bond forces. In the absence of noncovalent forces, the single covalent bonds in a protein could rotate about any dihedral angle that did not result in steric overlap. The protein would be free to adopt a large number of conformations with 101 comparable energies. Thus, it is the noncovalent forces that largely define the secondary, tertiary, and quaternary structure observed in proteins. The noncovalent interactions, such as hydrogen bonds and hydrophobic interactions, impose constraints on bond rotation that can be observed by identifying the stable and flexible regions in a protein structure. The software FIRST (Floppy Inclusions and Rigid Substructure Topography) is used to rep- resent the covalent and noncovalent constraints present in a protein and to compute the resulting flexibility of the main chain and side chains (Thorpe et al., 2000; Jacobs et al., 2001). Because it is the macroscopically significant flexibility that I am interested in, rather than the high-frequency fluctuations associated with thermal motion, bond lengths and an- gles are assigned their equilibrium values as observed in the crystal structure. These fixed bonds lengths and angles give rise to distance constraints between pairs of atoms in the protein, either explicitly from chemical bonds or implicitly from other local bond lengths and angles. For example, each of the covalent bonds between adjacent N, Ca, and C atoms in the backbone has a constant bond length and forms a constant bond angle (Figure 4.1). This fixes the distance, shown as a dashed gray line in Figure 4.1, between the second nearest neighbor N and C atoms. All such fixed bond angles can be represented by the associated distance constraints. In this manner, all the distance constraints that arise due to covalent bonds and angles are identified, and constraints for nonrotatable peptide and other double or partial double bonds, as well as those arising from salt bridges, hydro- gen bonds, and hydrophobic interactions are added, as described above (detailed in (Rader et al., 2001)). FIRST uses 3D constraint counting (Jacobs et al., 2001) on this network of distance constraints to identify the flexible and rigid (structurally stable) regions within a protein. The results of FIRST native-state flexibility analysis have been shown to com- 102 Figure 4.1: Example of bond-length and bond-angle distance constraints for the main- chain atoms of an amino acid. The positions of the N, CO, C atoms are crystallographically defined, and the sp3 hydridization of the Ca atom defines the bond angle 0. Because the angle a is constant, the distance between the N and C atoms, shown as dashed, gray line, is also constant. The thick black lines between the N—Ca and Ca—C atoms represent bond- stretching distance constraints that arise from the backbond covalent bonds. 103 pare well with experimental definitions of flexible regions in a series of proteins including lysine-arginine-ornithine binding protein (Jacobs et al., 1999), cytochrome c (Thorpe et al., 2000), HIV protease, adenylate kinase, and dihydrofolate reductase (Jacobs et al., 2001). 4.3.3 Simulating Denaturation As a protein is gradually thermally denatured, the covalent bonds remain intact, whereas hydrogen bonds begin to break. The flexibility in the protein will increase as the number of hydrogen bonds in the protein decreases. Our hypothesis is that the folding core is the region that will remain structurally stable the longest under denaturing conditions. This hypothesis was tested by removing hydrogen bonds from a protein structure to simulate thermal denaturation, then using FIRST to observe where the resultant flexibility occurs. The results will depend upon the order in which hydrogen bonds are removed. Because hy- drophobic interactions actually become somewhat stronger with moderate temperature in- creases (Tanford, 1980), these interactions are maintained throughout the simulation. Three methods for diluting the hydrogen bond network of a protein are presented, each designed to test the importance of the strength and/or density of the hydrogen bonds when selecting which bond to remove next. 1. Thermal Denaturation. As the temperature of a protein is gradually increased, the hydrogen bonds are expected to break in an energy-dependent manner. This process is simulated by using the following procedure. Initially, the flexibility of the native protein structure is analyzed with all its covalent and noncovalent interactions included (hydrogen bonds and hydrophobic interactions). The weakest hydrogen bond in the structure is then 104 broken by removing any distance constraints imposed by that bond. The effect of remov- ing this bond is then observed by applying FIRST to identify the flexible regions in the protein. This process of breaking the weakest hydrogen bond remaining in the structure and updating the identification of flexible regions is continued until all the hydrogen bonds have been removed. 2. Random Removal of Noncovalent Bonds Over a Small Energy Window. The thermal denaturation method described in (1) removes hydrogen bonds strictly in order of energy. To introduce some noise into the simulation, the next hydrogen bond to be removed is randomly selected from the 10 weakest bonds remaining in the protein. This modification was designed to reflect the stochastic nature of thermal denaturation and to test the effect of inaccuracies in the hydrogen—bond energy function. The results of this simulation should also indicate that the small fluctuations expected to occur during thermal denaturation do not significantly affect the flexibility or folding core predictions. 3. Completely Random Removal of Noncovalent Bonds. To check whether the relative energies of hydrogen bonds, and not just their density in the structure, are indeed important in thermal denaturation, completely random dilutions of the hydrogen bonds in a protein, without respect to their energies, have been performed. In this case, the next hydrogen bond to be removed from the protein is selected randomly from all remaining hydrogen bonds. 4.3.4 Identifying the Folding Core The native-state flexibility of a protein structure is computed using FIRST with all nonco- valent interactions present. Generally, in the native state, most of the residues belonging 105 to an a-helix or fl-strand are rigid, and the secondary structures are mutually rigid. As the hydrogen bonds are removed from the protein, parts of the secondary structures may become flexible, such as the ends of a helix or strand. Also, the secondary structures tend to become independently rigid at intermediate steps in denaturation, due to loss of tertiary hydrogen bonds. The protein folding core is defined in this study as the set of secondary structures that remain mutually rigid the longest in the simulated denaturation. The secondary structures for the native states of each of the ten proteins were identified by using the program DSSP (Kabsch and Sander, 1983) and tracked during the unfolding simulation. Not all residues in the secondary structure are required to be rigid when identifying the folding core. An a-helix is considered to be rigid if at least 5 consecutive residues, corresponding to one complete turn of an a-helix, belong to the rigid cluster. If a helix is defined by DSSP to contain fewer than 5 residues, as can occur with 310 helices, all its residues must be mutually rigid to be considered a rigid secondary structure. The fl-strands are required to have at least 3 consecutive residues rigid to be considered as part of the folding core. This criterion of three consecutive rigid residues allows for at least 2 hydrogen bonds to an adjacent strand. If a strand is defined by DSSP as consisting of less fewer than 3 residues, the entire strand is required to be rigid to be counted as part of the folding core. 106 4.4 Results 4.4.1 Thermal Denaturation For cytochrome c, the native state is composed of a single, structurally stable region repre- sented by the top line in Figure 4.2, and the 3D structure shown at the right. When hydrogen bonds 1 14 through 65 (the weakest 50) were removed, the large rigid cluster (colored red) significantly decreased in size (at the fifth line in panel A), resulting in new flexibility in those residues between the N- and C-terminal helices. These helices formed the only sig- nificantly rigid region in the protein. The folding core was predicted as the last point in the denaturation when at least two secondary structures formed a single rigid region. This point in cytochrome c occurred in the fifih-to-last line, where the N- and C-terminal he- lices are mutually rigid. On the next line, no single rigid cluster contained more than one secondary structure. The predicted folding core is shown structurally at bottom right, and summarized in a 1D representation just below the denaturation results, along with the fold- ing core determined by H-D exchange (Li and Woodward, 1999; Jeng et al., 1990), shown in orange. The predicted and observed folding cores correspond well, both indicating that the N- and C-terminal helices together form a stable folding core. Detailed unfolding pathway and folding core predictions upon thermal denaturation are shown for barnase in Figure 4.3. There was a significant change in the flexibility of the protein observed after 34 hydrogen bonds had been removed (fourth line from the top), in this case resulting in several small rigid regions that could move independently of one an- other (as indicated by their different colors in the plot), and one large rigid region (shown 107 Figure 4.2: Results of simulated thermal denaturation for cytochrome c. This figure shows how the structure fragments into smaller rigid regions, with intervening flexible bonds, as the hydrogen bond network denatures with increasing temperature. a-helices within the native structure are indicated as red zigzags at the top. Shown at right is the 3D RCD representation of the largest rigid cluster (colored red) in the protein for the native state (top), and intermediate, partially unfolded state (middle) and the folding core (bottom), defined here as the last point in denaturation at which the largest rigid region consists of more than one secondary structure. The summary of the folding core prediction, shown at the bottom, indicates that there is close correspondence between the prediction of the folding core as the most stable supersecondary region and the folding core as defined by protection from H-D exchange (Li and Woodward, 1999) 108 I 3?; 4:: 4 s 3.: 5i .7 I __ . .. lllllilli 28 9:23 cause; so E27982”: 553”? Eafibvmmnm cficuéfifiuz Eacdficux =:_:_§:_m E 7. I 2|-|.Il.illi:lo|llhi NRN 30.... o— 3; SEImI...|II-||Ii1lll.+ E2 331 «N i; 32ll|II|I|¢$|o+ P2 :31 mm c 7 Ezlllilillilliioilir on: end? S 2:7 SifiiltliiIlIIi-lll End 035. mm 4 v. 221litl.i|.fi.|.il|l 83 2:. mm 5 _z i 2 ILl—iicilolli 8g 8:. an a; _22Il.l'l.|ill.nlloi|i 49% 83- on m. 3. c 2 H u an u H.. 84a $2. a E 3 Z 2 "J. "u u " no: 30... 3 II: V. £2 . . 2: 84... E a. z a 2 r . “EA :2- E. z. m m: I u £3 23. _: econ: 222 m m m m. m m .o. m m m i > m m _H a w E? )\r a m mm o mEoEoB>o Figure 4.2 109 in red). Our study of folding transition states has shown that the rigid core of proteins disintegrates into several independent rigid regions when the mean coordination decreases below ~2.415. This is seen for both bamase and cytochrome c in this figure, yielding a transition between rigid and flexible states that is also found for network glasses near this same mean coordination (Rader et al., 2001). An intermediate structural state in bamase is formed by the packing of an a-helix against the B-sheet (second structural panel at right in Figure 4.3). This super-secondary structure recedes to form the folding core itself, consist- ing of the a-helix packed against part of the fl-sheet (fourth line from bottom in Figure 4.3, with structure shown in last panel at right). The H-D exchange folding core, shown at bot- tom (orange), matches the predicted folding core (red) well, with the exception of the short, C-terminal fl-strand. Figure 4.4 shows the unfolding pathway for interleukin-16, a protein whose secondary structure content consists entirely of fl-strands. The structure shows little breakup during the initial steps of the unfolding simulation. A significant event occurred when hydrogen bond 106 was broken, resulting in flexibility for a large portion of the struc- ture. The fi-strands formed by residues between 50 and 130 remain rigid, and eventually are identified as the folding core on the fourth line from the bottom. A comparison to the experimental folding core, shown at the bottom in orange, shows good overlap. For completeness, the hydrogen bond dilution results for bovine pancreatic trypsin in- hibitor (BPTI) are shown in Figure 4.5. BPTI is a member of the DSSP class “few” due to its small size and few secondary structures. The unfolding path represented in Figure 4.5 shows a gradual breakup of the structure into small flexible regions. The N-terminal he- lix becomes flexible when hydrogen bond 29 is broken, followed by the C—terminal helix 110 Figure 4.3: Results of simulated thermal denaturation for bamase. The secondary structure content of bamase is depicted at the top of the figure; a-helices, red zigzags; fl-strands, yellow arrows. The 3D RCD of the largest rigid cluster (colored red) is shown on the right side for bamase in the native state (top), a transition state (middle) and the folding core (bottom). The predicted folding core, identified on the fourth line fi'om the bottom, is compared to the experimentally defined folding core (colored orange) at the bottom. There is good overlap between the predicted and experimental results. lll IIIIII|II 3.;...::..:d:.__..a 3. l|||I||II Ecowcfiagugg EofiéSufi: 5835» 52.992me 529:5:“2 3:31.. 1.9x E: 2:. 2:; Mn H. H u H 0 $3 33.. 9. zzfficiiillolniii 53 Ed R ; :ifTeLilloi-iil ES 8:. no 2. 3 iii-Ll 33 mean. S S 3. lillllltirrilivll 32 32. 2 3 5. Ell£i¢lilivli mama ~03. R 3 3. ifi'iwll 83 22. _m 3.. S. ifllll'illll 83 53. S : 3 illili'iitill 33 :3. 3 ma 5. il-AIITIIIIIIIII. Siam wo_.N. hm 3.3!.[1l'liollllililll 8: $3. 3 .x_$I.II.Ir|IIIil||| 83 So... no :2 il :3 mom... 8 5 3. . . il 2% we»... _2 Q S Illllllllr'lril :2 NS... 8. H S. lllliTlIlilltrl 2: 5m... 8. 2. . . .i. £3 33. 2. .2 . a din ~3t~ 09:0. o2 mason: on _ =< ‘lfiIlIiIl‘i‘lzl w m. raqumu ".n m mmmEmm Figure 4.3 112 Figure 4.4: Results of simulated thermal denaturation for interleukin-16. The secondary structure content of this protein is entirely fi—strands. Their location is indicated by the yellow arrows at the top of the figure. The folding core for interleukin-13, identified on the fourth line fiom the bottom, is compared to the experimentally determined folding core (depicted in orange) at the bottom of the figure. There is good overlap between the two. 113 E _ 2 oz. E .x: 2 .2 2 m_ m n_ m Him. 5. 1m 2 «m. w xm 2 _m,_ 7. mm _2. hr. 2 3.. w No 2 C». 2 mm 2 2: Z lT'IIIlI'IIIIllI-ll 3;... 3:2,... 35:55 3m ESQ-6.53%: .533"; :muzoioEmum Emu—“Uriah”: COEmuunwtum pace—302m 8:2 " u nimm u H n u u 8m.” 82. S a 2 i. u u "H n "J u u 82 84.”- 8 2:2 IT-itiirIalillillTillli 8.2 8:- mm m: m n u u u 1|” H H n 82 SE- 8. 2 m t. u "u H H H n . . Ema own..- «2 3 2 u u "u H H H u .nlu o3.” awn..- 2: 3:2 H H . H H. H 8...” 08.7 mo— 22 H H H H H 1 Sta $2- 8— 3. m H H H H H" a need :2- :— 8; H H H H H" . 8...” 8a.? 8. E: 1H H H "H 5% 23 5 S. 5. n H H H “W. 24.” 8a.? a 5.2 in“ H H .H“ :4.” :3- R. 5.5 in“ H H H“ 22 84.9 ”2 32 “H “I" H" 2E $4.? on. 2. 2 "H “In” 3d 09.9 or 5. 2 "H 2 :E ”9.9 5 2:: n H 1 3% Re? a: n H 8:8: N2 .2 m m m m m m m w m w w w m m m 3 Jaqlunu puoq-H a .m. Figure 4.4 114 H—bond number Energy r> V O O O O O —- -- N M v Vt All 58 Hbonds 45 -0.767 2.420‘ . s 1 s 3.1 43 4.419 2.417 . ‘ M28 M :4 42 -1.794 2.415 ,= . M to \130 41 -1.814 2.414 = 7 I . MSG Ms: 39 -l.877 2.411 = . M29 M :4 38 —1.978 2.409 .31.: = M 5 M 2 33 -2.334 2.404 WM (1 M 3 31 -2.593 2.402 +-!-———-——-—W' ()1 M In 29 -2819 2.399 WM an M 11 24 -3.451 2.393 WM 34 Ml‘) 22 -3.607 2.390H*—————-97—s 54 M 511 15 -4.38l 2.3344—-—————ITg—n152 M 48 11 -5.089 2.379 WMIX MRS 10 -5.703 2.373W1135 M m Bluczdmmr lezucccpmr Mzmain-chain stide-chain W:water thetero-atom Our predicted foldmg a,” —-—-—._.__ lwp |Vtt‘1|lt|1‘kl hiding: 11110 —_——_{———7 7 7 Figure 4.5: Results of simulated thermal denaturation for bovine pancreatic trypsin in- hibitor. This small protein with few secondary structures shows a gradual rigid —) flexible transition as hydrogen bonds are diluted from the structure. The position of the secondary structures is indicated at the top of the figure; a-helices, red zigzags; fl-strands, yellow arrows. The predicted folding core is identified on the second line from the bottom, and is compared to the experimental folding core (in orange) at the bottom of the figure. There is very good agreement between the two. 115 when hydrogen bond 15 is broken. The remaining two secondary structures remain mutu- ally rigid, along with residues 45 and 51, to form the predicted folding core of BPTI. The overlap between the predicted and the experimental folding cores, shown at the bottom, is good. Thermal denaturation simulations were performed to predict the folding core for each protein in our dataset. Figure 4.6 summarizes the results fiom these simulations, comparing the predicted folding core to the observed folding core. For a majority of the proteins (8 out of 10), the folding core predictions agree well with folding cores predicted by regions of slow H-D exchange, and often involve tertiary interactions between sequence-distant secondary structures. For a-lactalbumin, half of the folding core region is in agreement, and for T4 lysozyme, the folding core identified by experiment is much larger than that identified by flexibility analysis. Given that different experimental conditions can also pro- duce different results, it is planned to consult a broader range of experimental probes of T4 lysozyme folding, as well as doing further structural analysis. However, given the di- verse structures and folding mechanisms for these ten proteins, the overall good agreement between theory and experiment suggests that flexibility analysis is a useful tool for prob- ing the stability of substructures, in particular the folding core, along the folding/unfolding pathway. This approach provides explicit 3D structural maps of the stable regions predicted in the protein at each step during denaturation, as well as providing a model for the interac- tions important in stabilizing folding cores: a dense network of hydrogen—bond interactions that augment the ubiquitous, but less specific, hydrophobic interactions. 116 Bamase P.—— 5,—— Cytochrome c P,—— u 11 E,—— Ubiquitin P. E.-——-———- Bovine Pancreatic Trypsin Inhibitor P.——-—- E. Ribonuclease T1 P. I v I H E. I "— Chymotrypsin Inhibitor 2 P. E. Interleukin-1B P. E. T4 lysozyme p E,—— —-————-—-—- _-——-— oc-Lactalbumin P. E.——_ Apo-myoglobin p. — E,—_—_ Figure 4.6: Comparison of the folding core predicted by FIRST flexibility analysis (P) to the observed folding core of H—D exchange experiments (E) for bamase (Perrett et al., 1995), cytochrome c (Jeng et al., 1990), ubiquitin (Pan and Briggs, 1992), BPTI (Wood- ward and Hilton, 1980), ribonuclease Tl (Mullins et al., 1997), C12 (Neira et al., 1997), interleukin-w (Driscoll et al., 1990), T4 lysozyme (Anderson et al., 1993), a-lactalbumin (Schulman et al., 1995) and apo-myoglobin (Hughson et al., 1990) 117 '0 "‘ >. §§ E" e W v‘\—‘M =' 3 ‘5 o a 2 e s s o 8 a 8 ._ _ .a _ All “4 Hbonds 111 0.110 2.455 A , M 2 s «)3 83 -1.221 2.430 A A , . M 7 M 3 78 —0.867 2.425 . A M x M 4 67 -1020 2.416 x A A: = M 60 M 66 66 -2.l42 2.415 A A x 3 = M 9 M 5 65 -2.289 2.4M A Mll S 2| 60 -1.799 2.410 :- = H = =A A3179 $105 46 -2.266 2.401 —'-——_--— 3 =3 5 F. M 7x \175 40 -2.226 2.396 = f 3: = = A A M 91 MW 36 4.234 2.392 A A = = H *- = M m M 6 35 -3.656 2.391 —I—|—l-+————I—I——TMI()2 MW 22 4574 2.381 = = =: = = A A: 1x105 M 91 18 4.336 2.377 = = :: = = . A: M 07 M m Blucxlnnnr Rcdnt-ccpmr Mzmain-chain stide-chain szater thetero-atom Our predicted folding core |:\D pl‘r‘dk‘lt‘rl lultlmu (01c ww— _. Figure 4.7: Results of random hydrogen bond dilution over a window of 10 hydrogen bonds for cytochrome c. Denaturation is simulated by removing hydrogen bonds as in the thermal denaturation method, however, instead of always removing the weakest hydrogen bond next, a hydrogen bond is randomly selected from the 10 weakest hydrogen bonds in the protein. Beneath the figure the predicted folding core (red) is compared to the observed folding core (orange). The similarity in folding core predictions between this result and that of thermal denaturation simulation (Figure 4.2) indicate that the results of simulated thermal denaturation are robust. 4.4.2 Evaluating Other Models of Denaturation Figure 4.7 shows the result of simulating cytochrome c denaturation by removing a hy- drogen bond randomly from the ten lowest-energy bonds in the protein at each step. It can be seen in the second column on the left that the energies of the bonds being removed are generally becoming more negative (stronger), however they are not removed strictly from weakest to strongest energy as in the thermal denaturation (Figure 4.2). This approach tests 118 the robustness of the thermal denaturation scheme to thermal fluctuations or some inaccu- racy in the calculation of hydrogen—bond energies. Comparing Figure 4.2 to Figure 4.7 shows that introducing some randomness into the thermal denaturation has little effect on accurate prediction of the folding core for cytochrome c, and mainly predicts a more rigid unfolding intermediate state between -1 .4 and -2.2 kcal/mol. Twenty separate runs were performed with different random selection of the hydrogen bonds removed, and all runs predicted the same folding core (data not shown). As an extreme example of a random dilution, we simulated denaturation in which the hydrogen bond energies were not taken into account. Each hydrogen bond was weighted equally, and the next bond to be removed was chosen randomly from all hydrogen bonds remaining in the protein. If the folding core of a protein could be identified solely by hav- ing the highest density of covalent bonds, hydrogen bonds and hydrophobic interactions, regardless of their energy, the results for this approach would be accurate. Four separate, random denaturation simulations for cytochrome c are shown in Figure 4.8. Below each panel, a comparison between the folding core predicted from this simulation and the exper- imentally observed folding core is shown. Panel C in Figure 4.8 shows that a completely random simulation can produce a correct folding core prediction and have similar inter- mediate features to thermal denaturation according to hydro gen- bond energy (compare with Figure 4.2). However, the other panels in Figure 4.8 indicate that a random hydrogen bond removal scheme most commonly mispredicts the folding core. These results show that the energy of hydrogen bonds is a significant factor in simulating the denaturation and unfolding of proteins, as validated by folding core prediction. 119 Figure 4.8: Four completely random dilutions of the hydrogen bonds in cytochrome c. Each panel represents a single unfolding simulation in which the hydrogen bonds were removed in random order. The secondary structures are shown at the top of each panel (the red zigzags represent a-helices). The predicted folding core from each panel is compared to the observed folding core (in orange) at the bottom of each panel. The panel at the lower left shows that an accurate folding core prediction can by chance be obtained fi'om a completely random hydrogen bond removal scheme. However, the results in the other three panels are in poor agreement with the observed folding core. These data indicate that density of hydrogen bonds alone is not the sole determinant when forming a folding core. 120 u 1 l H I c 3 «.5. u. a no u H 22 3:- v I I in :3 1n . mu 0 u NR" Nata. h n f 4 s - 1 u 3 u 1 E3 82. a. Z. r n :. . u" «u o n . 2.2 0:6. 2 Z I 3.5. o u 9 . «b u 1|. En.” unvd. _N 2 7 x. S ill-1LT 32 and. 8 2. 3 of, on . 1 o H o a H mend 5:- on b S __ _zlo|-.I||I-U|Ti|.ll.ll. and flow. 3. L.7;:7|u.[l.tl.|l+.ll mom.” 2:6. R ._ 7 2.. F lollii|oulall oom~ 9:. 9 v. .2 z» _zllLllolloll 3m." 03.”. _v .23 c. Si|ni.|r.fl|¢|| Stu ENE. 3 9.7 _a$|l.|1I1¢|olloll wand 32. «v S. F 3 Il.|l.l'|.1r|.011.l no: :56. cm 2 F on S, Illllollotlrll can.” n3._. 9 _zll'llill now” 2:- «n 37 ESIIIIHLHIOH 3v.” 3%? 9 Elo|1tlllllll §~ 091m. vn 2.31:.).|[|II¢1'|‘|0|| as“ 3.6. 1n 3in Eva man. an 37 Ava—Illiifll 2: _mod. on n 1H. . H NSN 3&6. 8 .13 :efilllllofllllllll m_v.~ One... 8 H H Dim 3.6. S 37 "VIIIIIIILIIIOII SYN 3.3. an H I H . . n_w.N RE. 3 2:? 3.2/Ill. . SYN mad. 3 7; K: H lrr I. Btu v86. 3 2. r n _> . u n on: 2.6. 9 , . Ll . . ‘9." an”. :. _r .1 :5 .m 35 4%... 8 7.17 n: S H. «NQN 8nd. 2. c. r K 3 4|. _mvd Q36. nu _w; 5 7 . . Et~ :86. 3 2:, S7 Lr 9.3 ~56. Q nun—3: Q: =<1m§85=<2 :< W m m m m m m m m m l a m. mmn l m w Nu. m m m M W 10.. a mfl l)\rl)\v||\l\F m Im\l\rlLI\ri|.|\I\F. w m .A NW. ,A NW. :. 7 7. F o n. w 1. it n 2 7 i .2 H n" 1 - .1. u 1 n .. o w. 2.2 3;. : 315:1 . m 11 - ti . u 0 . . 1H 0 1: End. SN? 2 _ Z a. $I|3laln| u n o H. m. u 1 82 End. 8 c 3 Z 7I||31|tlll . . . . . , SFI..|.|OI9|.1I.¢|.¢+ cmmfi 02:»: On x4 I S;|.|t.ll|.|1|.1f.lltilil| .. .. . I S lolTuIII'Il own.” 1%... mm M .r m IlTlt.‘|o.|l|lk .. . .. . «_FlolltTdIllslol‘r ca: 331 mm 3 r V 3 |'|l..|...orllulll. .. . . . oilfltlo+ and Own... mm .15; n:_;l..t1||n.l|'|fi.|t|lll. .. .. S FloIIIOIIIIIII momd ~56. 5m 2 7 S 7 |I||IT|.1.L.U|O|1I1I . . . . l;|l.|l|i.t'lll..1l+|11¢+ and out? am .. 3 xr_’.1'.|.r.l|III-.|0|lll. . . . r SlL'llllIalollllP 3nd 916. 3 9 7 .. 7TI..|.I..1|t.|III. . . L. _I 5:, LITII‘IItllll SYN ~36. 3 H 1" . . H "II I ie.2|'|'|l.|IllTi¢|lilla 81d .26. 3 3 3 3 F H I.“ H .H H I 3.3 if Sin 3:. an e...» :r F I w . 1 F I'llill'l'tltrllilo §.~ 2:6. 3. y. 3 27 H . . I eoFlillrll'l'TTIlSvu Sam. an .. 3 o F. 11 .w pixelllll‘lall Evn 030. 3 "an _ .2 H. 1. vixllilLIII Eva come. No .3 n _c n H . h H pct/{Ill tam 881. we :3, Z r H . . H 31...." Ibo. i .57. 2:) H. w main find. 2. 3:8: 3. =< wax—002v: =< 1mmmmmmmmmlwamfl Wmmmwmmmmm1amufl $1<||s values determined experimentally from the mutagene- 136 sis experiments. Preliminary data on bamase and the p53 tetrarnerization domain have indicated a very good correlation between predicted and experimental values; however, comparisons were less good for SRC SH3 domain or C12. Even more than the folding core predictions, the predicted structure of the transition state will depend upon the order in which hydrogen bonds are broken during an unfolding simulation. Future work on pre- dicting values will be addressed concurrently with optimizing the new hydrogen bond dilution scheme. In closing, it is clear that FIRST flexibility analysis provides a novel means for studying protein folding. It appears that native state protein structures do indeed encode information about the folding mechanism for many proteins, and that FIRST can decode much of this information. Analysis of folding with FIRST is still in its infancy, but the ease with which the program can be developed and the wealth of experimental data available for comparison ensure that future experiments are forthcoming, and that FIRST results will continue to contribute to our understanding of one of the most important phenomena in nature, the folding of proteins. 137 Appendix A Summary of Publications Outside of the Scope of the Work Presented in this Dissertation c Q. Yaun, J. J. Petska, B. M. Hespenheide, L. A. Kuhn, J. E. Linz and L. P. Hart. Identification of mimotope peptides which bind to the mycotoxin deoxynivalenol- specific monoclonal antibody. Appl. Environ. Microbial. 65:3279—86, 1999. Monoclonal antibody 6F5 (mAb 6F 5), which recognizes the mycotoxin deoxynivalenol (DON) (vomitoxin), was used to select for peptides that mimic the mycotoxin by employing a library of filamentous phages that have random 7-mer peptides on their surfaces. Two phage clones se- lected from the random peptide phage-displayed library coded for the amino acid sequences SWGPFPF and SWGPLPF. These clones were des- ignated DONPEP.2 and DONPEPJZ, respectively. The results of a com- petitive enzyme-linked immunosorbent assay (ELISA) suggested that the two phage displayed peptides bound to mAb 6F5 specifically at the DON 138 binding site. The amino acid sequence of DONPEP.2 plus a structurally flexible linker at the C terminus (SWGPFPFGGGSC) was synthesized and tested to determine its ability to bind to mAb 6F 5. This synthetic peptide (designated peptide C430) and DON competed with each other for mAb 6F 5 binding. When translationally fused with bacterial alkaline phosphatase, DONPEP.2 bound specifically to mAb 6F5, while the fu- sion protein retained alkaline phosphatase activity. The potential of using DONPEP.2 as an irnmunochemical reagent in a DON immunoassay was evaluated with a DON-spiked wheat extract. When peptide C430 was con- jugated to bovine serum albumin, it elicited antibody specific to peptide C430 but not to DON in both mice and rabbits. In an in vitro transla- tion system containing rabbit reticulocyte lysate, synthetic peptide C430 did not inhibit protein synthesis but did show antagonism toward DON- induced protein synthesis inhibition. These data suggest that the peptides selected in this study bind to mAb 6F5 and that peptide C430 binds to ribosomes at the same sites as DON o B. Essigmann, B. M. Hespenheide, L. A. Kuhn and C. Benning. Prediction of the active-site structure and NAD+ binding in SQDl, a protein essential for sulfolipid biosynthesis in Arabidopsis. Arch. Biochem. Biophys. 369:30-41, 1999. Sulfolipids of photosynthetic bacteria and plants are characterized by their unique sulfoquinovose headgroup, a derivative of glucose in which the 6-hydroxyl group is replaced by a sulfonate group. These sulfolipids 139 have been discussed as promising anti-tumor and anti-HIV therapeutics based on their inhibition of DNA polymerase and reverse transcriptase. To study sulfolipid biosynthesis, in particular the formation of UDP- sulfoquinovose, we have combined computational modeling with bio- chemical methods. A database search was performed employing the de- rived arnino acid sequence fiom SQDl, a gene involved in sulfolipid biosynthesis of Arabidopsis thaliana. This sequence shows high similarity to other sulfolipid biosynthetic proteins of different organisms and also to sugar nucleotide modifying enzymes, including UDP-glucose epimerase and dTDP-glucose dehydratase. Additional biochemical data on the pu- rified SQDl protein suggest that it is involved in the formation of UDP- sulfoquinovose, the first step of sulfolipid biosynthesis. To understand which aspects of epimerase catalysis may be shared by SQDl, we built a three—dimensional model of SQDl using the 1.813. crystallographic struc- ture of UDP-glucose 4-epimerase as a template. This model predicted an NAD(+) binding site, and the binding of NAD(+) was subsequently confirmed by enzymatic assay and mass spectrometry. The active-site in- teractions together with biochemical data provide the basis for proposing a reaction mechanism for UDP-sulfoquinovose formation 140 Bibliography V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich. Specific nucleus as the transition state for protein folding: evidence from the lattice model. Biochemistry, 33:10026—10036, 1994. J. Adams, S. Leestrna, and L. Nyhoff. C + +.° An introduction to computing, chapter 1. Prentice-Hall, Inc., New Jersey, 1995. A. Amadei, B. L. de Groot, M. A. Ceruso, M. Paci, A. Di Nola, and H. J. Berendsen. A kinetic model for the internal motions of proteins: diffusion between multiple harmonic wells. Proteins:Struct. Func. Gen., 35:283—292, 1999. A. Amadei, A. B. M. Linssen, and H. J. C. Berendsen. Essential dynamics of proteins. Proteins:Stmct. Func. Gen., 172412, 1993. D. B. Anderson, J. Lu, L. McIntosh, and F. W. Dahlquist. NMR of proteins, pages 258—304. CRC Press, 1993. C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181:223—230, 1973. C. B. Anfinsen and E. Haber. Studies on the reduction and re-formation of protein disulfide bonds. J. Biol. Chem, 236: 1361—1363, 1961. C. B. Anfinsen, R. R. Redfield, W. L. Choate, J. Page, and W. R. Carroll. Studies on the gross structure, cross-linkages, and terminal sequences in ribonuclease. J. Biol. Chem, 207:201—210, 1954. C. A. Angell. Hydration Processes in Biology, pages 127—139. 108 Press, Amsterdam, 1999. Y. Bai, J. S. Milne, L. Mayne, and S. W. Englander. Primary structure effects on peptide group hydrogen exchange. Proteins:Struct. Func. Gen., 17:75—86, 1993. D. Baker. A surprising simplicity to protein folding. Nature, 405:39—42, 2000. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Boume. The protein data bank. Nucleic Acids Research, 28: 235—242, 2000. 141 R. Bhaskaran and P. K. Ponnuswamy. Positional flexibilities of amino acid residues in globular proteins. Int. J. Peptide Prat. Res, 32:241—255, 1988. A. Bondi. Van der Waals volumes and radii. J. Phys. Chem, 68:441-451, 1964. J. U. Bowie. Helix packing angle preferences. Nat Struct Biol, 4(11):915—9l7, 1997. B. Brooks and M. Karplus. Harmonic dynamics of proteins: Normal modes and fluctua- tions in bovine pancreatic trypsin inhibitor. Proc. Natl. Acad. Sci, 80:6571—6575, 1983. B. R. Brooks, D. Janezic, and M. Karplus. J. Comput. Chem, 16:1522-1542, 1995. C. L. Brooks, M. Karplus, and B. M. Pettitt. Proteins. A theoretical perspective of dynam- ics, structure, and thermodynamics. Wiley, New York, 1988. C. L. Brooks III, M. Gruebele, J. N. Onuchic, and P. G. Wolynes. Chemical physics of protein folding. Proc. Natl. Acad. Sci, 95:11037—11038, 1998. C. L. Brooks 111, J. N. Onuchic, and D. J. Wales. Taking a walk on a landscape. Science, 293:612—613, 2001. J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes. Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins: Struct. F unct. Genet, 21:167—195, 1995. J. D. Bryngelson and P. G. Wolynes. Spin glasses and the statistical mechanics of protein folding. Proc. Natl. Acad. Sci, 84:7524-7528, 1987. H. B. Bull and K. Breese. Surface tension of amino acid solutions: A hydrophobicity scale of the amino acid residues. Arch. Biochem. Biophys., 161:665—670, 1974. I. Bustos-Jaimes, A. Sosa-Peinado, E. Rudino-Pinera, E. Horjales, and M. L. Calcagno. On the role of the conformational flexibility of the active-site lid on the allosteric kinetics of glucosamine—6-phosphate dearninase. J. Mol. Biol, 319:183-189, 2002. M. Carrion-Vazquez, A. F. Oberhauser, S. B. Fowler, P. E. Marzalek, S. E. Broedel, J. Clarke, and J. M. Fernandez. Mechanical and chemical unfolding of a single pro- tein: A comparison. Proc. Natl. Acad. Sci, 96:3694—3699, 1999. H. S. Chan and K. A. Dill. Protein folding in the landscape perspective: Chevron plots and non-Arrhenius kinetics. Proteins:Struct. F unc. Genet, 30:2—33, 1998. Z. Chen, Y. Li, H. B. Schock, D. Hall, E. Chen, and L. C. Kuo. Three dimensional struc- ture of a mutant HIV-l protease displaying cross-resistance to all protease inhibitors in clinical trials. J. Biol. Chem, 270:21433—21436, 1995. C. Chothia. Coiling ofbeta-pleated sheets. J. Mol. Biol, 163:107—117, 1983. C. Chothia, M. Levitt, and D. Richardson. Helix to helix packing in proteins. J. Mol. Biol. , 145:215—250, 1981. 142 K.-C. Chou, G. Nemethy, S. Rumsey, R. W. Tuttle, and H. A. Sheraga. Interactions between an a-helix and a fl-sheet. J. Mol. Biol, 186:591—609, 1985. J. A. Christopher, R. Swanson, and T. 0. Baldwin. Algorithms for finding the axis of a helix: fast rotational and parametric least—squares method. Comput. Chem, 20:339—345, 1996. J. Clarke and L. S. Itzhaki. Hydrogen exchange and protein folding. Curr: Opin. Struct. Biol, 8:112—118, 1998. J. Clarke, L. S. Itzhaki, and A. R. Fersht. Hydrogen exchange at equilibrium: a short cut for analysing protein-folding pathways? TIBS, 22:284—287, 1997. D. Cobessi, F. Tete-Favier, S. Marchal, G. Branlant, and A. Aubry. Structural and bio- chemical investigations of the catalytic mechanism of an NADP-dependent aldehyde dehydrogenase from streptococcus mutants. J. Mol. Biol, 300: 141—152, 2000. F. E. Cohen, M. J. E. Stemberg, and W. R. Taylor. Analysis and prediction of the packing of a-helices against a B-sheet in the tertiary structure of globular proteins. J. Mol. Biol, 156:821—862, 1982. T. E. Creighton. Proteins: Structures and Molecular Properties, pages 287—291. W. H. Freedman, New York, 2nd edition, 1993. V. Daggett, A. Li, L. S. Itzhaki, D. E. Otzen, and A. R. Fersht. Structure of the transition state for folding of a protein derived fi'om experiment and simulation. J. Mol. Biol, 257: 430-440, 1996. B. I. Dahiyat, D. B. Gordon, and S. L. Mayo. Automated design of the surface positions of protein helices. Prot. Sci, 6:1333-1337, 1997. K. A. Dill. Dominant forces in protein folding. Biochemistry, 1990. K. A. Dill and H. S. Chan. From Levinthal pathways to folding funnels. Nat. Struct. Biol, 4:10—19, 1997. K. A. Dill, K. M. Fiebig, and H. S. Chan. Cooperativity in protein-folding kinetics. Proc. Natl. Acad. Sci, 90:1942-1946, 1993. N. V. Dokholyan, L. Li, F. Ding, and E. I. Shakhnovich. Topological determinants of protein folding. Proc. Natl. Acad. Sci, 99:8637—8641 , 2002. P. C. Driscoll, A. M. Wingfield, and G. M. Clore. Determination of the secondary struc- ture and molecular topology of interleukin-16 by use of two— and three-dimensional heteronuclear 15N-1H NMR spectroscopy. Biochemistry, 29:4668—4682, 1990. L. Duan, L. Wang, and P. A. Kollman. The early stage of folding of villin headpiece subdomain observed in a ZOO-nanosecond fully solvated molecular dynamics simulation. Proc. Natl. Acad. Sci, 95:9897—9902, 1998. 143 P. M. Duxbury, D. J. Jacobs, and M. F. Thorpe. Floppy modes and the free energy: Rigidity and connectivity percolation on bethe lattices. Phys. Rev. E, 59(2):2084—2092, 1999. W. A. Eaton, V. Mufioz, S. J. Hagen, G. S. Jas, L. J. Lapidus, E. R. Henry, and J. Hofiichter. Fast kinetics and mechanisms in protein folding. Annu. Rev. Biophys. Biomol. Struct, 29:327-359, 2000. S. W. Englander. Protein folding intermediates and pathways studied by hydrogen ex- change. Annu. Rev. Biophys. Biomol. Struct, 29:213—238, 2000. S. W. Englander and L. Mayne. Protein folding studied using hydrogen-exchange labeling and two-dimensional NMR. Annu. Rev. Biophys. Biomol. Struct, 21:243—265, 1992. S. W. Englander, L. Mayne, Y. Bai, and T. R. Sosnick. Hydrogen exchange: The modern legacy of Linderstrem-Lang. Prat. Sci, 611101—1109, 1997. D. M. Epstein, S. J. Benkovic, and P. E. Wright. Dynamics of the dihydrofolate reductase- folate complex: Catalytic sites and regions known to undergo conformational change exhibit diverse dynamical features. Biochemistry, 34:11037—11048, 1995. A. Fadini and F.-M. Schnepel. Vibrational Spectroscopy Methods and Applications. John \Vrley and Sons, New York, 1989. A. R. Fersht. Nucleation mechanisms in protein folding. Curr: Opin. Struct. Biol, 7:3—7, 1997. A. R. F ersht. Transition-state structure as a unifying basis in protein-folding mechanisms: Contact order, chain topology, stability, and the extended nucleus mechanism. Proc. Natl. Acad. Sci, 97(4):]525—1529, 2000. A. R. Fersht, A. Matouschek, and L. Serrano. The folding of an enzyme. 1. Theory of protein engineering analysis of stability and pathway of protein folding. J. Mol. Biol, 224:771-782, 1992. K. F. Fischer and S. Marqusee. A rapid test for identification of autonomous folding units in proteins. J. Mol. Biol, 302:701—712, 2000. P. J. Flory. Statistical mechanics of chain molecules. Wiley, New York, 1969. A. Fontana, M. Zambonin, P. P. de Laureto, V. dc Filippis, A. Clementi, and E. Scararnella. Probing the conformational state of apomyoglobin by limited proteolysis. J. Mol. Biol, 266:223—230, 1997. O. V. Galzitskaya and A. V. Finkelstein. A theoretical search for the folding/unfolding nuclei in three-dimensional protein structures. Proc. Natl. Acad. Sci, 96:11299—11304, 1999. B. Gavish. The fluctuating enzyme, pages 263—339. John \Vrley and Sons, New York, 1986. 144 M. Gerstein and W. Krebs. A database of molecular motions. Nucleic Acids Res, 26: 4280—4290, 1998. N. Go, T. Noguti, and T. Nishikawa. Dynamics of a small globular protein in terms of low-fi'equency vibrational modes. Proc. Natl. Acad. Sci, 80:3696—3700, 1983. M. Gruebele. The fast protein folding problem. Annu. Rev. Phys. Chem, 50:485—516, 1999. Z. Guo and D. Thirumulai. The nucleation-collapse mechanism in protein folding: evi- dence for the non-uniqueness of the folding nucleus. Folding and Design, 2:377—391, 1997. M. R. Hicks, J. Walshaw, and D. N. Woolfson. Investigating the tolerance of coiled-coil peptides to nonheptad sequence inserts. J. Struct. Biol, 137:73—8 1, 2002. V. J. Hilser, D. Dowdy, T. G. Oas, and E. Freire. The structural distibution of cooperative interactions in proteins: Analysis of the native state ensemble. Proc. Natl. Acad. Sci, 95:9903—9908, 1998. U. Hobohm, M. Scharf, and R. Schneider. Selection of representative protein data sets. Prat. Sci, 1 :409—417, 1993. W. G. Hol, L. M. Halie, and C. Sander. Dipoles of the a-helix and B-sheetztheir role in protein folding. Nature, 294:532—536, 1981. B. Honig. Protein folding: From the Levinthal paradox to structure prediction. J. Mol. Biol, 293(2):283—293, 1999. R. Huber. Conformational flexibility in protein molecules. Nature, 280:538—539, 1979. F. M. Hughson, P. E. Wright, and R. L. Baldwin. Structural characterization of a partially folded apomyoglobin intermediate. Science, 249:1544—1548, 1990. R. lshima, D. Freedber, Y.-X. Wang, J. Louis, and D. Torchia. Flap opening and dimer- interface flexibility in the free and inhibitor-bound HIV protease, and thise implications for function. Struct. Fold. Des, 721047—1055, 1999. L. S. Itzhaki, D. E. Otzen, and A. R. Fersht. The structure of the transition state for folding of chymotrypsin inhibitor 2 analyzed by protein engineering methods: Evidence for a nucleation-condensation mechanism for protein folding. J. Mol. Biol, 254:260—288, 1995. S. E. Jackson. How do small single-domain proteins fold? Fold. and Des, 3:R81—R91, 1998. D. Jacobs and M. F. Thorpe. Computer-implemented system for analyzing rigidity of sub- structures within a macromolecule. US. Patent number l998:6,014,449. 1998. 145 D. J. Jacobs and B. Hendrickson. An algorithm for two-dimensional rigidity percolation: The pebble game. J. Comp. Phys, 137:346, 1997. D. J. Jacobs, L. A. Kuhn, and M. F. Thorpe. Flexible and rigid regions in proteins. In M. F. Thorpe and P. M. Duxbury, editors, Rigidity theory and applications, pages 357—384. Kluwer Academic/Plenum Press, 1999. D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe. Protein flexibility predictions using graph theory. ProteinsStruct. F unc. Gen., 44: 150—165, 2001. D. J. Jacobs and M. F. Thorpe. Generic rigidity percolation: The pebble game. Phys. Rev. Letts, 75:4051, 1995. J. Janin and C. Chothia. Packing of a-helices onto fl-pleated sheets and the anatomy of a/B proteins. J. Mol. Biol, 143:95—128, 1980. M.-F. Jeng, S. W. Englander, G. A. Eldve, A. J. Wand, and H. Roder. Structural description of acid-denatured cytochrome c by hydrogen exchange. Biochemistry, 29(46): 10433— 10437, 1990. W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577—2637, 1983. M. Karplus and D. L. Weaver. Diffusion—collision model for protein folding. Biopolymers, 18:1421—1437, 1979. M. Karplus and D. L. Weaver. Protein folding dynamics: The diffusion-collision model and experimental data. Prot. Sci, 3:650—668, 1994. S. L. Kazrnirski, K.-B. Wong, S. M. V. Freund, Y.-J. Tan, A. R. Fersht, and V. Daggett. Protein folding from a highly disordered denatured state: The folding pathway of chy- motrypsin inhibitor 2 at atomic resoluion. Proc. Natl. Acad. Sci, 98(8):4349—4354, 2001. P. S. Kim and R. L. Baldwin. Intermediates in the folding reactions of small proteins. Annu. Rev. Biochem., 59:631—660, 1990. A. Kippen, J. Sancho, and A. R. Fersht. Folding of bamase in parts. Biochemistry, 33: 3778—3786, 1994. J. Klein-Seetharaman, M. Oikawa, S. B. Grimshaw, J. “firmer, E. Duchardt, T. Ueda, T. Imoto, L. J. Smith, C. M. Dobson, and H. Schwalbe. Long-range interactions within a nonnative protein. Science, 295:1719—1722, 2002. D. K. Klimov and D. Thirumalai. Stretching single-domain proteins: phase diagram and kinetics of force-induced unfolding. Proc. Natl. Acad. Sci, 96:6166—6170, 1999. A. P. Kern and D. R. Rose. Torsion angle differences as a means of pinpointing local polypeptide chain trajectory changes for identical proteins in different conformational states. Prot. Eng, 7:961—967, 1994. 146 G. Laman. On graphs and rigidity of plane skeletal structures. J. Eng. Math, 4:331—340, 1970. J. Langrange. Mec'anique analytique, 1788. P. E. Leopold, M. Mental, and J. N. Onuchic. Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc. Natl. Acad. Sci, 89:8721—8725, 1992. C. Levinthal. Are there pathways for protein folding? J. Chem. Phys, 65:44, 1968. M. Levitt, M. Gerstein, E. Huang, S. Subbiah, and J. Tsai. Protein folding: The endgame. Annu. Rev. Biochem., pages 549—579, 1997. L. Li and E. I. Shakhnovich. Constructing, verifying, and dissecting the folding transition state of chymotrysin inhibitor 2 with all-atom simulations. Proc. Natl. Acad. Sci, 98 (23):13014—13018, 2001. R. Li and C. Woodward. The hydrogen exchange core and protein folding. Prot. Sci, 8: 1571—1591,1999. K. Linderstrom-Lang. Deuterium exchange and protein structure. In A. Neurberger, editor, Symposium on protein structure, London, 1958. Metheun. M. Llinas and S. Marqusee. Subdomain interactions as a determinant in the folding and stability of t4 lysozyme. Prot. Sci, 7:96—104, 1988. J. Ma and M. Karplus. Ligand-induced conformational changes in ras p21: a normal mode and energy minimization analysis. J. Mol. Biol, 274:114—131, 1997. J. C. Maxwell. On the calculation of the equilibrium and stiffness of frames. Philos. Mag, 27:294—299, 1864. J. A. McCammon, S. H. Northrup, M. Karplus, and R. M. Levy. Helix-coil transitions in a simple polypeptide model. Biopolymers, 19:2033—2045, 1980. I. K. McDonald and J. M. Thornton. Satisfying hydrogen bonding protein in proteins. J. Mol. Biol, 238:777—793, 1994. L. Mimy and E. Shakhnovich. Protein folding theory: From lattice to all-atom models. Annu. Rev. Biophys. Biomol. Struct., 30:361-396, 2001. R. S. Molday, S. W. Englander, and R. G. Kallen. Primary structure effects on peptide group hydrogen exchange. Biochemistry, 11:150—158, 1972. L. S. Mullins, C. N. Pace, and F. M. Raushel. Conformational stability of ribonuclease T1 determined by hydrogen-deuterium exchange. Prot. Sci, 6: 1387—1395, 1997. J. K. Myers and T. G. Oas. Mechanisms of fast protein folding. Annu. Rev. Biochem., 71: 783-815, 2002. 147 J. L. Neira, L. S. Itzahki, D. E. Otzen, B. Davis, and A. R. Fersht. Hydrogen exchange in chymotrypsin inhibitor 2 probed by mutagenesis. J. Mol. Biol, 270:99—110, 1997. W. L. Nichols, G. D. Rose, L. F. T. Eyck, and B. H. Zimm. Rigid domains in proteins: An algorithmic approach to their identification. ProteinsStruct. F unc. Gen., 23:3 8—48, 1995. B. Ndlting, R. Golbik, J. L. Neira, A. S. S. G. Schreiber, and A. R. Fersht. The folding pathway fo a protein at high resolution from microseconds to seconds. Proc. Natl. Acad. Sci, 94:826—830, 1997. H. Nymeyer, A. E. Garcia, and J. N. Onuchic. Folding funnels and frustration in off-lattice minimalist protein landscapes. Proc. Natl. Acad. Sci, 95:5921—5928, 1998. M. Oliveberg and A. R. Fersht. Thermodynamics of transient conformations in the folding pathway of bamase: Reorganization of the folding intermediate at low pH. Biochemistry, 35:2738—2749, 1996. J. N. Onuchic, Z. Luthey-Schulten, and P. G. Wolynes. Theory of protein folding: The energy landscape perspective. Annu. Rev. Phys. Chem, 48:545—600, 1997. J. N. Onuchic, H. Nymeyer, A. E. Garcia, J. Chahine, and N. D. Socci. The energy land- scape theory of protein folding: Insights into folding mechanisms and scenarios, vol- ume 53 of Advances in Protein Chemistry, chapter 3. Academic Press, San Diego, 2000. C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. CATH-A hierarchic classification of protein domain structures. Structure, 5(8):1093— 1108, 1997. S. B. Ozkan, I. Bahar, and K. A. Dill. Transition states and the meaning of @-values in protein folding kinetics. Nat. Struct. Biol, 8(9):765—769, 2001. Y. Pan and M. S. Briggs. Hydrogen exchange in native and alcohol forms of ubiquitin. Biochemistry, 31 :1 1405-1 1412, 1992. R. V. Pappu and D. L. Weaver. The early folding kinetics of apomyoglobin. Prot. Sci, 7: 480—490, 1998. A. Patrick, R. Rose, J. Greytok, C. Bechtold, M. Herrnsmeier, P. Chen, J. Barrish, R. Zahler, P. Colonno, and P. Lin. Characterization of a human immunodeficiency virus type 1 variant with reduced sensitivity to an aminodiol protease inhibitor. J. Virol, 69:2148- 2152, 1995. L. Pauling and R. B. Corey. The pleated sheet, a new layer configuration of polypeptide chains. Proc. Natl. Acad. Sci, 37:2451—2456, 1951. Z. Peng and L. C. Yu. Autonomous folding units, volume 53 of Advances in Protein Chem- istry, chapter 1. Academic Press, San Diego, 2000. 148 S. Perrett, J. Clarke, A. M. Hounslow, and A. R. F ersht. Relationship between equilibrium amide proton exchange behavior and the folding pathway of bamase. Biochemistry, 34: 9288—9298, 1995. P. L. Privalov. Intermediate state in protein folding. J. Mol. Biol, 258:707—725, 1996. A. J. Rader, B. M. Hespenheide, L. A. Kuhn, and M. F. Thorpe. Protein unfolding: Rigidity lost. Proc. Natl. Acad. Sci, 99:3540—3545, 2001. R. Ragone, F. Facchiano, A. Facchiano, A. M. Facchiano, and G. Colonna. Flexibility plot of proteins. Prot. Eng, 2(7):497-504, 1989. D. Sabbert, S. Engelbrecht, and W. Junge. Functional and idling rotatory motion within Fl-ATPase. Proc. Natl. Acad. Sci, 94:4401—4405, 1997. F. R. Salemme. Structural properties of protein beta-sheets. Prog. Biophys. Mol. Biol, 42: 95—133, 1983. B. A. Schulman, C. Redfield, Z. Peng, C. M. Dobson, and P. S. Kim. Different subdomains are most protected from hydrogen exchange in the molten globule and native state of human a-lactalbumin. J. Mol. Biol, 253:651—657, 1995. W. Scott and C. Schiffer. Curling of flap tips in HIV-1 protease as a mechanism for substrate entry and tolerance of drug resistance. Struct. Fold. Des, 9:1259—1265, 2000. E. 1. Shakhnovich. Folding nucleus: Specific or multiple? ?Insights from lattice models and experiments. Folding and Des, 3:R108—R111, 1998. J.-E. Shea and C. L. Brooks III. From folding theories to folding proteins: A review and assessment of simulation studies of protein folding and unfolding. Annu. Rev. Phys. Chem, 52:499—535, 2001. D. A. Simmons and L. Konermann. Characterization of transient protein folding interme- diates during myoglobin reconstitution by time-resolved electrospray mass spectrometry with on-line isotopic pulse labeling. Biochemistry, 41 : 1906—1914, 2002. N. D. Socci, J. N. Onuchic, and P. G. Wolynes. Diffusive dynamics of the reaction coordi- nate for protein folding funnels. J. Chem. Phys, 104:5860, 1996. D. Stickle, L. Presta, K. Dill, and G. Rose. Hydrogen bonding in globular proteins. J. Mol. Biol, 226:1 143—1 159, 1992. C. Tanford. The hydrophobic eflect. Mley-Interscience, New York, second edition, 1980. T. Tay and W. Whiteley. Recent progress in the generic rigidity of structures. Struct. Topol. , 9:31—38, 1984. D. Thirumalai and D. K. Klimov. Fishing for folding nuclei in lattice models and proteins. Fold. Des, 3:R112—R118, 1998. 149 A. Thomas, M. J. Field, and D. Perahia. Analysis of the low-frequency normal modes of the R state of aspartate transcarbarnylase and a comparison with the T state modes. J. Mol. Biol, 261 :490—506, 1996. M. F. Thorpe, B. M. Hespenheide, Y. Yang, and L. A. Kuhn. Flexibility and critical hydro- gen bonds in cytochrome c. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, and T. E. Klein, editors, Pacific Symposium on Biacamputing, pages 191—202. World Scientific, New Jersey, 2000. M. F. Thorpe, D. J. Jacobs, N. V. Chubynsky, and A. J. Rader. Generic rigidity of network glasses. In M. F. Thorpe and P. M. Duxbury, editors, Rigidity theory and applications, pages 239—278. Kluwer Academic/Plenum Press, 1999. M. F. Thorpe, M. Lei, A. J. Rader, D. J. Jacobs, and L. A. Kuhn. Protein flexibility and dynamics using constraint theory. J. Mol. Graph. Model., 19:60—69, 2001. 1. Y. Torshin and R. W. Harrison. Charge centers and formation of the protein folding core. ProteinsStruct. F unc. Gen., 43:353—364, 2001. C. Tsai, J. V. Maize] Jr., and R. Nussinov. Anatomy of protein structures: Visualizing how a one-dimensional protein chain folds into a three-dimensional shape. Proc. Natl. Acad. Sci, 97(22): 1203 8—1 2043, 2000. C. Tsai and R. Nussinov. Hydrophobic folding units derived from dissimilar monomer structures and their interactions. Prat. Sci, 6:24—42, 1997. C. Tsai, D. Xu, and R. Nussinov. Protein folding via binding and vice versa. Fold. Des, 3: R71—R80, 1998. R. M. Venable, B. R. Brooks, and F. W. Carson. Theoretical studies of relaxation of a monomeric subunit of HIV-1 protease in water using molecular dynamics. Pro- teins.'Struct. Func. Gen., 15(4):374—384, 1993. M. Vendruscolo, M. Paci, E. Dobson, and M. Karplus. Three key residues form a critical contact network in a protein folding transition state. Nature, 409:641—645, 2001. M. Vrhinen, E. Torkkila, and P. Riikonen. Accuracy of protein flexibility predictions. Pro- teins, 19:141-149, 1994. R. L. van Montfort, T. Pijning, K. H. Kalk, J. Reizer, M. H. Saier Jr., M. M. Thunnissen, G. T. Robillard, and B. W. Dijkstra. The structure of an energy-coupling protein from bacteria, IIB cellobiose, reveals similarity to eukaryotic protein tyrosine phosphatases. Structure, 5:217—225, 1997. G. Vriend. What if: A molecular modeling and drug design program. J. Mol. Graph, 8: 52—56, 1990. A. Wallqvist, G. W. Smythers, and D. G. Covell. Identification of cooperative folding units in a set of native proteins. Prat. Sci, 28(3):l627—1642, 1997. 150 D. Walther, F. Eisenhaber, and P. Argos. Principles of helix - helix packing in proteins: the helical lattice superposition model. J. Mol. Biol, pages 536—553, 1996. D. Walther, C. Springer, and F. E. Cohen. Helix-helix packing angle preferences for finite helix axes. Proteins:Struct. Func. Gen., 33:457—459, 1998. M. A. Williams, J. M. Goodfellow, and J. M. Thornton. Buried waters and internal cavities in monomeric proteins. Prat. Sci, 3:1224—1235, 1994. A. Wlodawer and J. Erickson. Structure-based inihibitors of HIV-1 protease. Annu. Rev. Biochem., 62:543—585, 1993. C. Woodward. Is the slow exchange core the protein folding core? T IBS, 18:359—360, 1993. C. K. Woodward and B. D. Hilton. Hydrogen isotope exchange kinetics of single protons in bovine pancreatic trypsin inhibitor. Biophys. J., 32:561—575, 1980. D. Xu, C.-J. Tsai, and R. Nussinov. Hydrogen bonds and salt bridges across protein-protein interfaces. Prat. Eng, 10(9):999—1012, 1997. H. Yang and D. L. Smith. Kinetics of cytochrome c folding examined by hydrogen ex- change and mass spectrometry. Biochemistry, 36: 14992—14999, 1997. 151