MULTI-SCALE, MULTI-DIMENSION, DNA-BASED MATHEMATICAL MODELING OF GENE EXPRESSION By Jacqueline M. Dresch A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Mathematics 2012 ABSTRACT MULTI-SCALE, MULTI-DIMENSION, DNA-BASED MATHEMATICAL MODELING OF GENE EXPRESSION By Jacqueline M. Dresch The study of gene regulation has been an important topic in biology for decades. Activation and silencing of genes affect everything from everyday breathing to deadly diseases, and thus concern not only academic biologists, but also healthcare professionals and thus the general population. Researchers now have mapped out many species’ entire genomic sequences, but annotation is incomplete concerning stretches of DNA that regulate genes, as well as protein coding regions. Accompanying the expansion in sequence data, new technologies have provided copious amounts of expression data suitable for modeling studies. Experimental and computational biology now combine to answer many of the complicated questions behind gene regulation and gene function, although we lack a complete picture of the physical and chemical reactions of the complex gene regulatory machinery. To better understand the processes of gene regulation we need to develop mathematical models that describe how DNA sequences direct differentiated gene expression, predict the expression of unknown genes or variants of genes, and reveal the driving forces behind gene regulation. With new high throughput techniques leading to large data sets of increased quality, mathematicians need no longer make speculative hypotheses; they now can connect their models directly with varying levels of biological inputs and verify the model predictions with quantitative data sets. My studies develop new approaches for quantitative assessment and application of mathematical tools important for gene regulatory studies. ACKNOWLEDGEMENTS First and foremost, I would like to thank both of my advisors, Dr. Chichia Chiu and Dr. David N. Arnosti. Dr. Chiu has guided me both in my mathematical research as well as discovering not what others expect of me, but what I truly wish to achieve in my life. Dr. Arnosti has spent endless hours with me going over simulations, analyzing results, and above all, making sure that I have a thorough understanding of the biological data and hypotheses feeding into quantitative biological studies. Without both Dr. Chiu and Dr. Arnosti, this thesis would not have been possible. I would also like to thank all of the collaborators I have had the opportunity to work with and everyone who has contributed to this thesis. These individuals encompass a large array of mathematicians, computer scientists, and biologists with varying specialties, including Dr. Ahmet Ay, Evan Dayringer, Richard Shadrach, Dr. Ben Beckmann, Dr. Saurabh Sinha, Md. Abul Samee, Yerzhan Suleimenov, Ben Taylor, Rupinder Sayal, Dr. Walid Fakhouri, Xiaozhou Liu, and Irina Pushel. In addition, I would like to thank all of my committee members, especially Dr. Peter Bates for his thoughtful comments on this thesis. I would like to thank my family for all of their support throughout my time as a graduate student. As a woman in mathematics, my family’s support and constant encouragement has been truly priceless. Last, but not least, I would like to thank all of my friends at MSU. This thesis would not have been possible without a break every so often. A very special thanks to Tom Bellsky, Matt O’Toole, Dominique Lewis, Petra Hendrickson, Li Yang, Yan Wei, the Liouville Sluggers, and all of the wonderful friends I have made as a graduate student! iii TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 List of Figures . . . . . . . . . . . . . . 2 Thermodynamic-based Modeling . . . 2.1 Introduction and Description . . . . . 2.2 Mathematical Analysis of a Simple Thermodynamic-based Model . . . . 2.3 Incorporating More Information into Thermodynamic-based Models . . . . 2.3.1 Protein-protein Interactions . 2.3.2 Overlapping Binding Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 29 3 Dynamic Modeling: Incorporating Both cis-regulatory Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction and Problem Description . . . . . . . . . . . 3.2 Development of a semi-implicit numerical method . . . . . 3.2.1 Construction . . . . . . . . . . . . . . . . . . . . . 3.2.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . 3.3 Numerical Simulations . . . . . . . . . . . . . . . . . . . . and . . . . . . . . . . . . . . . . . . . . . 4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Local Sensitivity Analysis . . . . . . . . . . . . . . . . . 4.2.2 Global Sensitivity Analysis . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Local Sensitivity Analysis of the Fakhouri et al. Model . 4.3.2 Global Sensitivity Analysis of the Fakhouri et al. Model 4.3.3 Optimized Gene Construct Design and Synthetic Data . iv Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 36 36 38 51 55 . . . . . . . . . 60 60 62 63 64 71 76 79 87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 4.3.4 Global Sensitivity Analysis of the Zinzen et al. Model . . . . . . . . . 97 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.0.1 GEMSTAT Specifics and Applications . . . . . . . . . . . . 5.2 Methods and Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Construction of Synthetic Data Sets . . . . . . . . . . . . . . . . . . . 5.2.1.1 Construction of Continuous Synthetic Data Sets . . . . . . . 5.2.1.2 Construction of Perturbed, Binary, and Interpolated Synthetic Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Description of QN/NMS and CMA-ES Methods . . . . . . . . . . . . 5.2.2.1 QN/NMS Method . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.2 CMA-ES Method . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 105 111 113 113 113 6 Thermodynamic-based Modeling: Analysis of Biological Data Sets . . . 6.1 Understanding the Structural Rules of Enhancer Function . . . . . . . . . . 6.1.1 Reporter Gene Constructs and Confocal Image Dataset . . . . . . . . 6.1.2 Structure of Models Tested . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Model Performance and Parameter Estimation . . . . . . . . . . . . . 6.1.4 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Understanding the Dynamic Effects of Redundant Enhancers . . . . . . . . . 6.2.1 The Addition of Redundant Enhancers to Dynamic Models of Transcriptional Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . 133 133 134 134 142 143 146 114 115 116 116 121 130 147 7 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A Data Processing . . . . . . . . . A.1 Introduction . . . . . . . . . . A.2 Fakhouri et al. 2010 data set . A.3 Sayal et al. 2012 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 153 155 156 . . . . . . . . . . . . . . . . . . . 162 LIST OF TABLES Table 4.1 Redesigned gene constructs . . . . . . . . . . . . . . . . . . . . . . . 87 Table 4.2 Gene constructs with combinatorial quenching. . . . . . . . . . . . . 92 Table 4.3 Parameter relationships resulting in high correlations. . . . . . . . . 101 Table 5.1 A pseudo-algorithm for the derivative score. . . . . . . . . . . . . . . 114 Table 5.2 A pseudo-algorithm for CMA-ES. . . . . . . . . . . . . . . . . . . . 117 vi LIST OF FIGURES Figure 1.1 Figure 1.2 Figure 2.1 Figure 2.2 Illustration of an Enhancer Region. The horizontal line represents a sequence of DNA, read from 5’ to 3’. The arrow corresponds to the transcription start site, the nucleotide at which the RNA coding sequence, or “gene”, begins. The blue region, just upstream of the transcription start site, depicts the region of DNA where regulatory proteins bind to control gene expression, the enhancer. Green circles and red squares depict TFs (activators and repressors respectively). For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 2 Dynamic Transcriptional Network. On the left of the figure, each node corresponds to a TF or gene of involved in the network. Dorsal acts as the input to the network at the end of nuclear cycle 13, activating (depicted by an arrow) all other genes in the network. As Twist is expressed, it in turn activates snail and rhomboid. Snail then represses (depicted by a blunt arrow) rhomboid. On the right, the corresponding final expression patterns, in late nuclear cycle 14, are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 1C from [26]. A Drosophila embryo with Giant repressor (red stripes) and Dorsal activator (green) staining is shown. Each embryo provides a diversity of potential inputs to the regulatory element: the white arrow points to a region where activator levels are high and repressor levels are low. The black arrow points to a region where both activator and repressor levels are low. The white triangle points to a region where activator and repressor levels are both high, and the black triangle points to a region where repressor levels are high and activator levels are low. . . . . . . . . . . . . . . . . . . . . . . 22 1g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict a 44% reduction in expression. . . . . . . . . 24 vii Figure 2.3 Figure 2.4 Figure 3.1 Figure 3.2 Figure 3.3 Figure 4.1 2g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict a 69% reduction in expression. . . . . . . . . 24 3g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict an 82% reduction in expression. . . . . . . . 25 Simulations with decreasing step size. Parameter values were fixed in each of these four model simulations, allowing us to analyze the effect of changing step-size on the numerical solutions obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and D) h = 0.001. . . . 56 Contour plots for simulations with decreasing step size. Parameter values were fixed in each of these four model simulations, allowing us to analyze the effect of changing step-size on the numerical solutions obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and D) h = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Data vs. Model Predictions. 8 time points are shown. Each pair of graphs depicts on the left, the data used to fit model parameters, and on the right, the model predictions after parameter fitting, for the given time point. Observe that the model is very successful in fitting the general dynamics and expression pattern trends of this system. The corresponding RMSE for this prediction is 0.0616. . . 58 General description of thermodynamic-based models and examples from [26] and [105]. A) Column 1: fractional occupancy of a TF binding to an enhancer. Column 2: all possible states of an enhancer containing one repressor and one activator binding site. Column 3: fractional occupancy of each event. Column 4: expression contribution for each state. Column 5: calculation of total expression. B) Column 1: two constructs used in [26]. Column 2: representative embryos. Column 3: quantitative data of normalized repressor [Gt] vs reporter gene [lacZ]. Column 4: total expression formulas. C) Line 1, Column 1: two conceptual constructs used in [105]. Line 2, Column 2: quantitative rho data used by Zinzen et al.. Line 2: total expression formulas. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 viii Figure 4.2 Sensitivity analysis using local and global-local methods on the thermodynamicbased model of Fakhouri et al., scheme 2. A) Local sensitivity analysis highlights great differences among parameters. Model parameters are shown on the horizontal axis and local sensitivity coefficients (scaled from 0.0 to 1.0) are shown on the vertical axis. Larger values indicate that the model output changes greatly when this parameter is varied. B) Global-local analysis reveals differences in relative sensitivities compared to local analysis. The global-local sensitivity contribution (also scaled from 0.0 to 1.0) is plotted on the vertical axis. In the global-local analysis, quenching parameters (Q) are generally more sensitive than repressor scaling factors (R), which in turn are more sensitive than cooperativity parameters (C). . . . . . . . . . . . . . 78 Figure 4.3 Sensitivity analysis using the eFAST global sensitivity method on the thermodynamic-based model of Fakhouri et al., scheme 2. A) Shown are first- and total-order sensitivity indices, which represent the amount of variation in the model output with respect to each parameter individually (gray bars) and in conjunction with all other parameters (black bars), scaled from 0.0 to 1.0. If the gray bar is much smaller than the black bar, as in Q6, secondary (or higher) effects are predominant. Quenching parameters are generally most sensitive and cooperativity parameters are least sensitive. B) Effect of frequency with which a quenching parameter is represented in the twelve constructs on sensitivity. The number of constructs that the parameter is represented in is shown along the horizontal axis. Corresponding first- and total-order sensitivity indices are shown for each quenching parameter, as calculated by the eFAST algorithm. There are two quenching parameters (Q4 and Q5) represented in two constructs, and two quenching parameters (Q1 and Q3) represented in five constructs. At these values, there are four data points, two for each quenching parameter. The lines illustrate linear fits to each data set, first- and total-order sensitivity indices. In general, the more constructs a quenching parameter is represented in, the more sensitive that parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 80 Figure 4.4 Figure 4.5 Sensitivity analysis using the HDMR global sensitivity method on the thermodynamic-based model of Fakhouri et al., scheme 2. Shown are first-, second- and the sum of first- and second-order sensitivity indices, which represent the amount of variation in the model output with respect to each parameter individually (light gray bars), each pair of parameters (dark gray bars), and the sum of these two variations (black bars), scaled from 0.0 to 1.0. General sensitivity trends are similar to those observed in Figure 2, although secondorder sensitivities are largest for R and C1, implying that they affect the model output through pair-wise interactions with other parameters. Over 95% of the variation in model output was found using only first- and second- order sensitivities (data not shown); therefore the sum of first- and second-order sensitivities shown here approximates the total sensitivity of each parameter. . . . . . . . . . . . . . . . . 85 Sensitivity analysis using the HDMR global sensitivity method on 11 redesigned gene constructs. Synthetic data created using the Fakhouri et al. model on these 11 redesigned gene constructs shows a decrease in variation of sensitivities between quenching parameters compared to Figure 4.3. In A, the synthetic data was created using the parameter values that were found in the Fakhouri et. al study. In B, the synthetic data was created using mean parameter values (0.5 for quenching parameters, 50 for scaling factor and cooperativity parameters). A decrease in variation of sensitivities for quenching parameters is also observed from A to B (the sum of first and second order sensitivities has a range of 0.10 to 0.28, with a mean of 0.17 and a standard deviation of 0.07 in A and a range of 0.16 to 0.31, with a mean of 0.20 and a standard deviation of 0.06), indicating that the sensitivity analysis is dependent on the true parameter values represented in the biological data. . . . . . . . . . . . . . . . 88 x Figure 4.6 Figure 4.7 Values of biological data can affect sensitivity analysis. The eFAST global sensitivity method was applied to the thermodynamic-based model of Fakhouri et al., scheme 2, using synthetic data with mean values (0.5 for quenching parameters, 50 for scaling factor and cooperativity parameters). We observe a decrease in variation of sensitivities between quenching parameters, when compared to Figure 4.2. A) Sensitivity indices represent the amount of variation in the RMSE. B) Effect of frequency with which a quenching parameter is represented in the twelve constructs on sensitivity. The number of constructs that the parameter is represented in is shown along the horizontal axis. Corresponding first- and total-order sensitivity indices are shown for each quenching parameter, as calculated by the eFAST algorithm. Note that there are two quenching parameters (Q4 and Q5) represented in two constructs, and two quenching parameters (Q1 and Q3) represented in five constructs. Thus at these values there are four data points, two for each quenching parameter. The lines illustrate linear fits to each data set, first- and total-order sensitivity indices. In general, the more constructs a parameter is represented in, the more sensitive that parameter. When compared to Figure 4.2B, a more pronounced correlation between a parameter’s total-order sensitivity coefficient and the number of gene constructs that parameter is represented in is observed here, underscoring the need for sensitivity analysis in experimental design. . . . . . . . . . . . . . . . . . . . . 90 Effect of combinations of four quenching parameters on sensitivity analysis using the HDMR global sensitivity method on redesigned gene constructs (see Table 4.2). Compared to an even design, an increase in variation of sensitivities between quenching parameters is seen when the data used comes from an uneven construct design (compare A to B). The synthetic data was created using the nine constructs in which the quenching parameters were evenly distributed (A) and not evenly distributed (B). Mean parameter values (0.5 for quenching parameters, 50 for scaling factor and cooperativity parameters) were used in both cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 xi Figure 4.8 Sensitivity analysis using the HDMR global sensitivity method on the thermodynamic-based model of Zinzen et al. First-, second- and the sum of first- and second-order sensitivity indices (scaled from 0.0 to 1.0), are shown. Despite the focus in the Zinzen et al. study on Dorsal-Twist cooperativity, these parameters are relatively insensitive compared to protein scaling factors. Results from representative enhancer structures: A) set 1 and B) set 2. The model parameters are shown on the horizontal axis. D, T, and S correspond to scaling factors for Dorsal, Twist, and Snail binding sites, respectively. DTr, TTr, and SSr correspond to cooperativity parameters representing DorsalTwist, Twist-Twist, and Snail-Snail cooperativities respectively, for the rho-like enhancer. Similarly, DTv, TTv, and SSv correspond to cooperativity parameters representing Dorsal-Twist, Twist-Twist, and Snail-Snail cooperativities respectively, for the vnd -like enhancer. 98 Figure 5.1 Illustration of the expression patterns for synthetic data set 20 obtained by reverse engineering of a gene regulatory network by the Direct model. The synthetic target data used for the parameter estimation is created by simulating the model with the random parameters obtained as explained in the Methods and Calculations. We created the expression patterns using the parameters obtained by the CMAES and QN/NMS methods. The black tick marks are the synthetic target data and the blue (QN/NMS) and red (CMA-ES) lines correspond to the simulated patterns. The y-axis gives the relative gene expression level and the x-axis corresponds to the anterior-posterior (A-P) axis of the embryo. The final SSE is 0.001 and 13.0 for CMAES and QN/NMS respectively. . . . . . . . . . . . . . . . . . . . . 118 Figure 5.2 Performance comparison of the CMA-ES and QN/NMS methods on the Direct model. The minimum SSE values are plotted for data sets of varying quality: A)continuous, B) perturbed, C) binary, and D) interpolated. The analysis has been done on 30 synthetic data sets. For each data set, two SSE values are shown; CMA-ES (red circles) and QN/NMS (blue triangles). It has been observed that the CMAES method performs better than the QN/NMS method on continuous data. However, the choice of parameter estimation technique has no observable effect on lower quality data sets. The horizontal line in panel A) signals a change in scale of the vertical axis. . . . . . . . . 122 xii Figure 5.3 Comparison of the parameter values obtained by the CMA-ES and QN/NMS methods on the GEMSTAT Direct model. The minimum SSEs between the true parameter values and those obtained by the CMA-ES and QN/NMS methods are plotted for data sets of varying quality: A)continuous, B) perturbed, C) binary, and D) interpolated. The analysis has been done on 30 synthetic data sets. For each data set, two SSE values are shown; CMA-ES (red circles) and QN/NMS (blue triangles). It has been observed that the CMA-ES method more accurately predicts the true parameter values on continuous data. However, the choice of parameter estimation technique has no observable effect on lower quality data sets. The horizontal lines signal a change in scale of the vertical axis. . . . . . . . . . . . . . 124 Figure 5.4 Scatter plot of parameters for synthetic data set 20 using CMA-ES and QN/NMS methods on the Direct model. Each method was run five times. The black diamonds represents the target parameters, and the red circles and blue triangles represent parameters obtained by CMA-ES and QN/NMS methods respectively. The CMA-ES method more accurately predicted the true parameter values on continuous data. The horizontal lines signal different scales of the vertical axis. For the CMA-ES and QN/NMS methods respectively, the mean standard deviation in parameter values found is 64.11 and 80.29. . . . . 125 Figure 5.5 Comparison of the relative error (%) for the estimated parameters of synthetic data set 20 using CMA-ES (red circles) and QN/NMS (blue triangles) methods. The CMA-ES method predicts the true parameter values with lower relative error on continuous data. . . . 126 Figure 5.6 Sensitivity analysis for the Direct model. The blue, green, and red bars represent the parameter first-, sum of second-, and sum of firstand second-order sensitivities calculated using the HDMR algorithm respectively. The objective function used was the SSE between model predictions and synthetic data set 20, whose predictions are shown in Figure 5.1 and best fit parameters shown in Figures 5.4 and 5.5. Observe that the model is much more sensitive to perturbations in the fourth parameter, one of the parameters associated with Caudal, than any other parameter in the model. For other parameters, we see large second-order sensitivities, corresponding to a high level of parameter compensation. . . . . . . . . . . . . . . . . . . . . . . . 127 xiii Figure 5.7 Scatter plot of the parameters obtained for experimental data using CMA-ES and QN/NMS methods on the Direct model. The red circles and blue triangles represent parameters resulting in the minimum SSE when CMA-ES and QN/NMS methods are applied respectively. It has been observed that, although SSE values were comparable (CMAES SSE = 0.2760 and QN/NMS SSE = 0.2767), for the most sensitive parameters the two methods results in different values. The horizontal line signals a change in scale of the vertical axis. . . . . . . . . . 129 Figure 6.1 Structures of reporter genes assayed. Green, blue, and red circles denote annotated binding sites for Dorsal, Twist, and Snail respectively. X’s depict binding sites that were mutated. . . . . . . . . . . . . . . 135 Figure 6.2 Continuous Functions used for Quenching and Cooperativity. These functions are the functions used in different schemes of the model, to describe the effects of a repressor’s quenching ability on a bound activator or the cooperativity between two bound TFs as functions of the distance between the binding sites. A) f (d) = a + bd, B) 1 2 . When implemented, a = 1 f (d) = ae−d /b , and C)f (d) = a ed/b and b is a model parameter for quenching functions, and a and b are both model parameters for cooperativity functions. . . . . . . . . . . 138 Figure 6.3 Quenching Function Derived from Fakhouri et al. Analysis. This function was fit using the six quenching values, shown in red, found in [26] to a fifth degree polynomial, f (d) = −0.000000004249d5 + 0.0000010199d4 − 0.00010199d3 + 0.0050099d2 − 0.10731d + 1.0 for d ≤ 100 and assumed to be 0 for d > 100, shown in blue. When implemented, this function is used directly, with no parameter values corresponding to quenching. . . . . . . . . . . . . . . . . . . . . . . 139 Figure 6.4 Binned Functions used for Quenching and Cooperativity. These stepfunctions are representatives of those used in various model schemes to describe the effects of a repressor’s quenching ability on a bound activator and cooperativity between two bound TFs as step-functions of the distance between the binding sites. A) Quenching function with four bins, B) Cooperativity function with three bins. When implemented, no quenching is allowed beyond the distance of the last bin, thus the function value is assumed to be 0 beyond 100 bp in A). 140 xiv Figure 6.5 Model performance for 120 model schemes. Each box represents the RMSE for a single model scheme, with the row representing the cooperativity function and the column representing the quenching function used in this scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Figure 6.6 Different Measures of Model Performance for 18 model schemes. Each box represents the ranking for a single model scheme, with the row representing the model scheme and the column representing the fitness measure used for ranking. . . . . . . . . . . . . . . . . . . . . . 144 Figure 6.7 Construct Performance for 18 model schemes. Each box represents the RMSE for a single model scheme on a single construct, with the row representing the model scheme and the column representing the construct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Figure A.1 Six representative yellow-white images with quadratics fit to their intensity values. The blue lines represent ten lines taken uniformly from the region 40-60% Anterior-Posterior and plotted from Dorsal to Ventral. The red line represents the quadratic fit to the average of these ten lines from 10-40% Dorsal-Ventral. This region was chosen due to the fact that there should be no expression in this region regardless of whether the embryos contain the transgene, as in our core data set, or not, as in these yellow-white images. . . . . . . . . 158 Figure A.2 Variation in Signal Intensity on different imaging days. In panels A and B, each circle represents a single embryo’s mean intensity value. All embryos represented in panel A came from a single construct, which is a double activator knock-out, stained on the same day and then imaged on three different days. All embryos represented in panel B came from a different single construct, which is a triple repressor knock-out, again stained on the same day and then imaged on three different days. In both cases, there is a great deal of variation in signal intensity. In panels C and D, these mean values were averaged over all embryos stained and imaged on the same day. We observed a well-defined decreasing trend in mean intensity as the number of days between staining and imaging increased. . . . . . . . . . . . . . 159 xv Chapter 1 Introduction The process of gene regulation and understanding the intricate machinery involved in controlling it poses a very important question in modern biology and biochemistry: what physical properties drive DNA sequences to control complex protein interactions leading to the precise expression of genes? All living organisms have genes that turn on and off to regulate life processes, many of which depend on the expression of genes at precise levels, spatial locations and times. Due to this large array of applications, the study of gene regulation underlies the vast majority of biological fields, from bacterial growth to human disease pathways. The regions of DNA that control gene expression (see Figure 1.1) are referred to as enhancers and can be highly divergent in different organisms. Often the factors controlling gene regulation are divided into two groups, cis-acting elements and trans-acting factors. Cis-acting elements, also referred to as cis-regulatory modules (CRMs) or enhancers, are the regions of DNA containing binding sites for other factors involved in gene regulation and are typically located at a close proximity to the transcriptional start site. Trans-factors, on the other hand, are proteins that bind to these cis-elements to control expression. These factors in- 1 Figure 1.1: Illustration of an Enhancer Region. The horizontal line represents a sequence of DNA, read from 5’ to 3’. The arrow corresponds to the transcription start site, the nucleotide at which the RNA coding sequence, or “gene”, begins. The blue region, just upstream of the transcription start site, depicts the region of DNA where regulatory proteins bind to control gene expression, the enhancer. Green circles and red squares depict TFs (activators and repressors respectively). For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. clude transcription factor (TF) proteins transcribed elsewhere in the genome, translated and actively transported through the cell, which bind to enhancers to contribute to the spatial and temporal dynamics of gene expression. In prokaryotes, enhancers typically have a very small number of binding sites and the protein-DNA and protein-protein interactions involved in gene expression are often fairly well understood. In eukaryotes, however, many of the rules behind gene regulation are still almost entirely unknown due to the complexity of eukaryotic gene regulatory systems, involving the concerted occupancy of many trans-factors; including multiple TFs, cofactors, and nucleosomes on enhancers. One would like to understand not only the basic properties of protein-binding effects involved in prokaryotic gene regulation (i.e., the quantitative change in mRNA levels as one protein blocks or enhances the activity of another), but also gain the ability to model some of the cis-acting elements and trans-acting factors which contribute to the complex nature of eukaryotic regulation. 2 A large number of studies have presented evidence of one aspect of eukaryotic regulation or another within a given system, but we lack a general unifying concept to explain how DNA sequence can give rise to the complicated spatio-temporal expression patterns of so many genes. In this thesis I describe my research which has focused on gene expression during the early blastoderm stage of development in the eukaryotic organism Drosophila melanogaster. Drosophila, more commonly known as the “fruit fly,” offers particular advantages for studying gene regulation. As metazoans, their transcriptional machinery closely resembles that of all other animal species, including humans. Flies have very short life spans and large birth rates, making experiments much more feasible in terms of preparation time as well as amount of data acquired for quantitive analyses. Powerful genetic tools make them a natural choice for studies involving transgenes or genetic crosses due to the large amount of high quality quantitative data that can be obtained. Using this system, biologists can learn a great deal of information about gene expression, providing a basis for hypotheses about disease, development, and evolution in many other eukaryotic species. Experimentalists use diverse techniques to gather different pieces of information regarding transcriptional regulation. Some studies identify the developmental patterns linked to a particular gene’s expression, some determine what two proteins are physically interacting to control a single gene or group of genes’ expression, and others classify the environmental conditions under which gene expression patterns are altered. Although each study provides us with only a subset of information on gene regulation, together they have greatly increased our knowledge of when, where, and how certain genes are expressed. A common methodology that has been employed for many years involves the discovery of direct physical and functional characteristics of cis-regulatory elements. Proteins binding to these DNA elements can 3 be assayed by DNAse1 footprinting. Functional studies involve reporter constructs linked to small fragments of genomic DNA (‘promoter bashing’), to determine minimal enhancer regions and the functional effects of TFs binding to specific sites [95, 67, 46]. More recently, the same idea of promoter bashing was used with more sophisticated biochemical techniques involving large genomic segments and environmental perturbations to learn more about the robust nature of a single gene with multiple enhancer regions [71]. These studies typically answer questions about specific genes using only experimentation in a lab. Other studies, addressing wider questions of gene circuitry begin in the lab and then further test hypotheses using statistical or computational tools. One general approach has involved synthetic reporter constructs to learn more about protein-protein interactions and short-range enhancer ‘grammar,’ referring to the specific number and arrangement of binding sites needed on an enhancer to drive proper expression [93, 4, 50, 26]. On a much larger scale, recent studies involve acquiring protein binding profiles by whole-genome surveys aimed at discovering the locations of CRMs and groups of genes regulated by the same set of TFs, or surveys on groups of genes regulated in the same time period or by the same set of TFs aimed at understanding the dynamics of binding events, building gene regulatory networks, and gaining insight on possible protein interactions [101, 80, 104, 37, 12, 65, 66, 56]. These studies typically use microarray, referred to as chip, chromatin immunoprecipitation (ChIP) - chip, or ChIP - sequencing (seq) experiments, although some have combined these binding profiles with expression measurements to learn more about spatio-temporal expression patterns [104]. When experiments answer qualitative questions, such as when and where certain genes are expressed, or give a picture of how one node fits into a large gene regulatory network, the door has been opened for further inquiry. This can be done in the form of modeling or statistics. 4 For genome-wide studies, one often goes through multiple levels of statistical analysis to tease out the important interactions between genes and proteins. For example, ChIP-chip and ChIP-seq studies have compared binding profiles using Pearson Correlation Coefficient to identify possible combinatorial relationships between TFs [37]. For studies aimed at explicitly describing the physical interactions taking place, modelers often take the system of interest and derive deterministic or stochastic models based on the new assumptions that the experiment has provided them. For example, studies involving a cascade of gene expression can begin with a set of genes that have been found to bind the same set of TFs or have correlated expression profiles and through differential equation modeling, explicitly incorporate parameters representing the level of positive and negative effects one gene has on another [43]. Each of these approaches aims at extracting as much information as possible from the experimental data and mathematically analyzing it to provide a direction for future experimentation. In formulating an approach to understand how a biological process works, scientists develop models, a term which carries different connotations to researchers in different fields. Biologists may use a model to provide a non-quantitative, simple cartoon of the processes taking place. Mathematicians, statisticians, and quantitative biologists use the term to refer to an equation or system of equations that precisely describes the physical aspects of the system. Clear communications between biologists and mathematicians is necessary for effective collaborations, however. The two fields use very different vocabulary, require different coursework and training, and typically have distinct research goals in mind. I believe it is essential to first address the question of why mathematical models are needed in biology. To answer 5 this, we must first understand the power behind such models. Experiments carried out in a wet lab often involve the development of a hypothesis and the development of specific experiments to test this hypothesis. Although simple to describe, the hypothesis development and execution of experiments is not always so obvious, and the experiments that need to be run are often quite time consuming. Mathematical modeling has been used to effectively target experimental biological studies. For example, in the area of evolutionary biology, in silico stochastic models can simulate evolution over thousands of generations in a single day, while studying evolution of most species in a wet lab, if possible at all, can be a career-long project [19, 99]. Another example of an area where mathematical modeling has been crucial is in enhancer discover. Enhancer discovery refers to identifying regions of DNA previously not known to contribute to the regulation of one or more genes. Statistical models applied by bioinformaticians can use known DNA-protein binding site sequences and search the entire genome for clusters of binding sites [11, 48]. These clusters are known as “predicted enhancer regions” that can be tested out in a lab. Without this type of model and computation, the only way for biologists to discover enhancer regions is to run multiple footprinting or ChIP experiments, which are expensive and time consuming. Quantitative predictive models can identify possible clusters of binding sites to test for transcriptional activity. In the area of transcriptional regulation, the same issues of guiding experimental design are important. System-wide insights are key to understanding the impacts of enhancer function on development, disease, population variation, and the evolution of new traits. Unfortunately, little is known about many of the proteins involved in regulation and their interactions, making the design of experiments very challenging. Mathematical models, how- 6 ever, are capable of using available enhancer information and quantitative expression data to fit parameter values, such as activator efficiency or protein cooperativity, and predict additional enhancers’ expression, giving researchers a better idea of what proteins and enhancers should be further investigated experimentally. This could then lead to the discovery of mechanisms involved in transcriptional events essential to proper development, the design of new drug treatments targeting specific diseases, and a better understanding of the selective pressure put on DNA-sequences during evolution. Gene regulation is a complex process involving control mechanisms at the levels of transcription, RNA processing, transport and translation. In the field of gene regulation, three different mathematical models have been applied: Boolean models, thermodynamic-based (also termed “Fractional Occupancy”) models, and differential equation models [8]. These models all have inherent strengths and weaknesses involving the amount of input data, the level of complexity, and the time-frame in which they can model gene expression. Boolean models are graph based probabilistic models used to find regulatory interactions and gene networks when looking at genome-wide data. They have the benefit of being able to use large amounts of available gene array data to identify specific gene interaction networks, but the major drawback of their inability to explain complex interactions between the individual components of the transcriptional machinery (ie: TF and polymerase interactions) [79, 100, 3]. Thermodynamic-based models, on the other hand, can explain a great deal of the intricate machinery at the DNA-protein binding level. They utilize information on DNA sequence, protein binding specificity, and TF concentrations using the laws of biophysical thermodynamics, and often include terms for specific TF interactions and distance-dependent DNA 7 binding events. Although many researchers believe that as more data becomes available these fractional occupancy models will someday have the ability to decipher the cis-regulatory grammar behind enhancer architecture, the amount of data necessary to test current hypotheses is still lacking [1, 15, 105, 45, 85, 30, 26, 38]. I will discuss my contributions to this field in Chapter 7. Both Boolean and thermodynamic-based models are static in nature. They may use information from slightly varying time frames, but their outputs are snapshots of a system. Whether they are adding insight into which gene products effect other genes or whether they are predicting expression, these conclusions are made for specific time points or over some arbitrary time scale. Differential equations models have allowed researchers to model transcriptional regulation over different time intervals to study the behavior of subtly evolving gene or protein network structure. These models can simulate the turning on and off of genes over time, as well as intercellular diffusion and decay of proteins and mRNAs. This dynamic movie of gene expression can be extremely useful in developing visual representations of small networks of genes. The primary drawback to these models is that, when applied to larger data sets or larger gene networks, the number of parameters rapidly increases and fitting these parameters can be computationally infeasible. Also, in past applications, the intricate details of transcriptional regulation, such as specific protein binding sites and affinities that thermodynamic-based models offer, were not included [33, 10, 62, 81, 97, 43, 32, 20, 69, 13]. In eukaryotic systems, models have taken into account TFs’ positive or negative contributions to gene expression, but none have included the DNA sequence information allowing a dynamic quantitative model of gene expression to incorporate terms for binding specificity 8 and TF interactions [43, 69, 13]. This gap in our understanding is addressed in studies I describe in Chapters 3 and 6. In addition to choosing a suitable model for their particular system, modelers are also faced with four other important considerations: how to process their data before using it for model fitting [92, 9], which parameter estimation technique should be used [6, 47, 91], how to analyze model performance regarding the limits of the model or sensitivity of the model to parameter perturbations [6, 28, 24], and how much information the model can provide biologists [26, 24, 91]. These four important issues, and their relationships to the modeling approaches used in our group are the main focus of this thesis. First, I focus on the most popular static model of transcriptional regulation for incorporating DNA level information, thermodynamic-based (or fractional occupancy) modeling. The general equation for calculating gene expression from the protein occupancy profile of the enhancer is given by: Φ (X) expression level ∝ 1 + Ψ (X) where X ∈ Rn represents TF concentrations and Φ, Ψ : Rn → R are polynomial functions representing the contributions from all active non-empty states of the enhancer and all possible non-empty states of the enhancer respectively. In Chapter 2, I first derive a simple thermodynamic-based model and discuss how it can be analyzed to produce biological insights on the limitations of DNA-sequence and TF contributions to gene expression. I then show examples of how more complicated biochemical mechanisms, such as protein-protein cooperativity, can be incorporated into the model to allow for a more realistic picture of the system. After a thorough introduction to these static models, I build the first dynamic model of transcriptional regulation in eukaryotes to take into account specific DNA level 9 information on the location and strength of TF binding sites in Chapter 3. The basis for this model will be the general reaction diffusion equation: dY dt = R (Y ) + D∆Y for Y ∈ Rm and D a diagonal matrix containing diffusion constants; with the reaction term, R (Y ), incorporating known enhancer information. Since this modeling approach involves a system of differential equations, I devise a numerical solver to use for model simulations and analyze the numerical solutions before showing model results on biological data. As illustrated in this chapter, the model is capable of capturing the intricate dynamics and key features of a dynamically changing system involved in early eukaryotic development (system depicted in Figure 1.2). In Chapters 4 and 5, I investigate the parameter values estimated in the case of static thermodynamic-based models, including sensitivity analysis of these parameters as well as choosing the proper parameter estimation technique to decrease the probability of the parameter set getting stuck in a local minimum of the objective function used during fitting. In Appendix A, I explain how data collected in the lab is processed for use in quantitative studies of transcriptional regulation. The main goal of my research has been the development of a mathematical model of gene regulation capable of capturing the physical properties of the system at the DNA level, as well as the dynamic properties of protein and mRNA concentrations in a developing organism. Transcriptional regulation involves many different factors, including evolving DNA sequences (i.e., binding site positioning and strength changing with the evolution of new species), regulatory proteins, chromatin states, nucleosome positions, and cofactors, work- 10 Figure 1.2: Dynamic Transcriptional Network. On the left of the figure, each node corresponds to a TF or gene of involved in the network. Dorsal acts as the input to the network at the end of nuclear cycle 13, activating (depicted by an arrow) all other genes in the network. As Twist is expressed, it in turn activates snail and rhomboid. Snail then represses (depicted by a blunt arrow) rhomboid. On the right, the corresponding final expression patterns, in late nuclear cycle 14, are shown. 11 ing together in perfect harmony to produce precise spatio-temporal expression patterns. A thorough understanding of a system so complex is not something a single experiment or modeling study can provide. Improvements to our understanding have been made with the help of mathematical modeling, such as the thermodynamic-based model in [26] which was very successful in predicting the expression of synthetic enhancers with fixed activator binding sites and providing important insights into short-range repression. My work has added tremendously to this understanding by transforming a static model that gave biologists a snapshot prediction of gene expression at a single point in time into a dynamic model incorporating the same amount of detail at the DNA level which predicts gene expression over an important time period in development. This DNA level information was introduced into a reaction-diffusion model by the inclusion of a thermodynamic-based term. This directed my work toward understanding model parameter sensitivity, parameter estimation strategies, and thermodynamic-based model formulations including more complex protein-protein interactions. These additional studies have shed light on some common misconceptions when analyzing model parameters and the cause of such misconceptions, as well as the important factors to consider when choosing an appropriate, efficient parameter estimation strategy. However, I believe the largest impact my work has had on the biological and mathematical communities has been in developing a novel approach to incorporate both cis-regulatory and temporal information into a gene regulatory model, which is computationally feasible as well as predictive. 12 Chapter 2 Thermodynamic-based Modeling 2.1 Introduction and Description Thermodynamic-based models, or fractional occupancy models, aim to predict gene expression output from the DNA sequence of the gene’s regulatory region and the concentration of proteins involved in regulation (ie: TFs). The term “thermodynamic” itself is often misinterpreted; it only refers to the fact that the model has been derived with thermodynamic equilibrium assumptions. Historically, these calculations have been done using techniques derived from statistical physics which involve computing all possible states of TFs binding to a promoter and relating these states to gene expression [87, 75, 15, 45, 105, 85, 30, 26, 38]. During implementation, the model is derived in two separate parts. The first part is the true “thermodynamic” piece of the model, which is derived using chemical stoichiometric and equilibrium equations. It is assumed that RNA polymerase (RNAP), the driving force in gene transcription, is recruited by bound TFs. This requires the occupancy distribution of TFs on the DNA sequence to be computed using equilibrium constants (binding affinities) 13 and concentration levels. The second part to thermodynamic-based modeling requires some scientific guesswork. There are two common ways to translate binding states into gene expression. The first approach models transcriptional output as proportional to the probability of RNAP binding, which is related to the active, or successful, states of the enhancer [105, 26]. The second approach represents it with a nonlinear sigmoidal function in which activators have a positive and repressors a negative influence [85]. For a thermodynamic-based modeling study in a eukaryote, the choice of sigmoidal function is arbitrary, and reflects the lower and upper bounds observed in the transcription process. Again, this is guesswork, not chemically or physically derived. For this reason, my work has used the first approach. Due to the lack of data relating RNAP binding to the state of an enhancer, we have assumed that transcriptional output of an enhancer is directly proportional to the fraction of successful states. The following examples should serve as a guide to this modeling approach: Example 1: Consider the case of a piece of DNA (also referred to as a binding site or motif), S, which a protein (TF), TF, can bind. The chemical equation for TF binding to S is: KB −− [S] + [T F ] −− [S · T F ] KU where KB is the constant associated with the rate of binding and KU is the constant associated with the rate of unbinding. Note that although S represents a sequence of DNA, the concentration of S, [S], can be interpreted by considering the fact that there may be multiple copies of this sequence within the genome. 14 The rate equations associated with this chemical equation are: d[S] = KU [S · T F ] − KB [S][T F ] dt d[T F ] = KU [S · T F ] − KB [S][T F ] dt d[S · T F ] = KB [S][T F ] − KU [S · T F ]. dt Under the assumption that the system is in thermodynamic equilibrium, d[S] = KU [S · T F ] − KB [S][T F ] = 0. dt Thus, it follows that KB [S][T F ] = KU [S · T F ] . Let KT F = KB [S · T F ] . = KU [S][T F ] Due to matter conservation (stoichiometry), [S]total = [S] + [S · T F ]. 15 Therefore, it follows that P = [S · T F ] [S] + [S · T F ] = [S·T F ][T F ] [S][T F ] [S][T F ] [S·T F ][T F ] + [S][T F ] [S][T F ] = KT F [T F ] 1 + KT F [T F ] where KT F is referred to as the binding affinity (equilibrium constant) of TF and P is the fractional occupancy, or the time-averaged occupancy of the TF on the enhancer. Example 2: Consider the case of an enhancer, S, which contains two binding sites, one for a transcriptional activator, A, and one for a transcriptional repressor, R. Assume that the expression level of the gene (measured as the concentration of mRNA) is proportional to the fraction of successful states, where ‘successful’ is defined as having only the activator bound and not the repressor. The expression level is then calculated as [mRN A] ∝ PA (1 − PR ) = = KA [A] 1 + KA [A] 1 1 + KR [R] KA [A] . 1 + KA [A] + KR [R] + KA KR [A][R] Looking back at Example 2, one can start to see the high level of computation that must go into such a model as enhancers become more complicated, containing multiple 16 TF binding sites, due to the combinatorial nature of multiple TFs binding to a regulatory region. There are many important biological questions that have been posed in the case of more complicated thermodynamic-based models and their parameters (i.e., the effects of protein-protein cooperativity). However, we will begin by analyzing some mathematical properties of a simple thermodynamic-based model similar to the one depicted in Example 2. 2.2 Mathematical Analysis of a Simple Thermodynamic-based Model We begin with the following question: What happens to the expression function as the number of binding sites increases? To answer this, we will again start with the assumption that a successful state is any state in which at least one activator is bound and no repressors are bound. Note that we are not considering any protein-protein interactions (i.e., cooperative or antagonistic binding interactions between TFs). We are only considering the interactions taking place between DNA and proteins (TFs). There are many different types of proteins (TFs) involved in transcriptional regulation. Consider the general case of an enhancer containing N types of proteins. Assume there are β different types of activators, A1 , . . . , Aβ and for each Ai , there are ni identical binding sites. Similarly, assume there are γ different types of repressors, R1 , . . . , Rγ and for each Rj , 17 there are mj identical binding sites. Note that β ≥ 0, γ ≥ 0, and β + γ = N . Then, β 1 + KAi [Ai ] [mRN A] ∝ ni −1 i=1 β 1 + KAi [Ai ] i=1 ni . γ 1 + KRj [Rj ] mj j=1 Given 1 ≤ i ≤ β such that [Ai ] > 0: 1 lim [mRN A] = γ ni →∞ . mj 1 + KRj [Rj ] j=1 Similarly, given 1 ≤ j ≤ γ such that [Rj ] > 0: lim [mRN A] = 0. mj →∞ Lastly, given 1 ≤ i ≤ β such that [Ai ] > 0 and 1 ≤ j ≤ γ such that [Rj ] > 0: lim [mRN A] = ni →∞ mj →∞ 1 lim mj →∞ γ 1 + KRj [Rj ] mj j=1 1 − lim ni →∞ β mj →∞ 1 + KAi [Ai ] i=1 ni γ 1 + KRj [Rj ] mj j=1 = 0. Therefore, the answer to this first question can be stated: as the number of activator sites is increased, the expression function asymptotically approaches a function of the number 18 of repressor sites present and as the number of repressor sites is increased, the expression function asymptotically approaches zero, regardless of the number of activator sites present. Next, we consider the question: Holding the number of activator sites constant, can we determine how many repressor sites are needed for the reduction in expression to equal p? First note that under the assumption that β ni 1 + KAi [Ai ] [mRN A] ∝ −1 i=1 β 1 + KAi [Ai ] γ ni i=1 1 + KRj [Rj ] mj j=1 we define the reduction in expression due to repression as 1 . γ 1 + KRj [Rj ] mj j=1 Assume that ni is fixed for all 1 ≤ i ≤ β and β is fixed. Fix 0 < p ≤ 1 and consider 1 = p. γ 1 + KRj [Rj ] mj j=1 Observe that γ 1 + KRj [Rj ] j=1 19 mj 1 = . p Thus, γ mj ln 1 + KRj [Rj ] = − ln (p) . j=1 Now, let max mj ln 1 + KRj [Rj ] = mM ln 1 + KR [RM ] . M 1≤j≤γ Then, since mj ln 1 + KRj [Rj ] ≥ 0 for all 1 ≤ j ≤ γ, it follows that 1 − ln (p) ≤ mM ln 1 + KR [RM ] ≤ − ln (p) . M γ Therefore, − ln (p) γ ln 1 + KR [RM ] ≤ mM ≤ M − ln (p) . ln 1 + KR [RM ] M Since this is a biological problem with a physical interpretation of KRj [Rj ], we can assume that there exist constants c1 and c2 such that for all 1 ≤ j ≤ γ, c1 ≤ KRj [Rj ] ≤ c2 . These bounds are independent of j, resulting in an upper and lower bound on the maximum number of repressor binding sites, mM : − ln (p) − ln (p) ≤ mM ≤ . γ ln (1 + c2 ) ln (1 + c1 ) For biological studies, such as [15, 105, 45, 26], the main focus is on pattern formation caused by the repression of activators by repressor proteins. In modeling studies involving experimental data, it is commonly assumed that the level of fluorescence observed in images from embryos stained for a protein such as R, is proportional to the actual concentration of R [76, 14, 15, 105, 45, 26]. Then, what is approximated and sometimes referred to as the scaling factor, KR , actually corresponds to the product of the binding affinity and the 20 value which relates the level of fluorescence to the concentration of protein. Thus, modeling studies often have strict bounds on what they refer to as [R] due to physical measurement constraints. Most standard procedures for data processing normalize values between 0 and 1, although various ranges have been used [15, 45, 105, 26]. Since different normalization procedures are used, the units for KR and [R] vary, but KR [R] is measured in µg/ml, a standard unit for protein concentration. In previous studies, the following K values have been found. • Bintu et al. [15]: 0.22 ≤ K ≤ 110 Concentrations were between 0.001 and 1000. Thus, 0.00022 ≤ K[R] ≤ 110000. • Janssens et al. [45]: 0.003 ≤ K ≤ 1.0 Concentrations were between 0 and 255. In particular for the repressor Giant, whenever [Gt] > 0, 1 ≤ [Gt] ≤ 150. Thus, 0.003 ≤ K[R] ≤ 150. • Zinzen et al. [105]: 108.25 ≤ K ≤ 108.75 Concentrations were between 0 and 1. Here we consider the bounds 0.1 ≤ [R] ≤ 1.0. Thus, 107.25 ≤ K[R] ≤ 108.75 . • Fakhouri et al. [26]: 12 ≤ K ≤ 30 Concentrations were between 0 and 1. Here we consider the bounds 0.1 ≤ [R] ≤ 1.0. Thus, 0.12 ≤ K[R] ≤ 30. Observe that there is a large amount of variation in these bounds. For many studies such as these, one must be careful when interpreting K values or when applying them to our previous estimates since values were often found using experimental data to fit models that simultaneously searches for parameters corresponding to protein-protein interactions such 21 Figure 2.1: Figure 1C from [26]. A Drosophila embryo with Giant repressor (red stripes) and Dorsal activator (green) staining is shown. Each embryo provides a diversity of potential inputs to the regulatory element: the white arrow points to a region where activator levels are high and repressor levels are low. The black arrow points to a region where both activator and repressor levels are low. The white triangle points to a region where activator and repressor levels are both high, and the black triangle points to a region where repressor levels are high and activator levels are low. as cooperativity and/or quenching terms. Focusing on one particular example, we consider the Fakhouri et al. data set [26]. To illustrate the spatial expression patterns analyzed, refer to Figure 2.1. Using a subset of the expression data from this study containing only the expression obtained from enhancers with Giant binding sites at a fixed distance from activator sites, we fit a value for KR corresponding to the Giant repressor protein. The value found using a nonlinear least squares approximation was 0.7869. The expression data used for this fitting was processed and normalized, resulting in 0.0 ≤ [R] ≤ 1.0 and was measured from embryos with enhancers containing only one type of repressor binding site, Giant; hence γ = 1. Since the enhancers only contained Giant binding sites for repression, a reduction in expression is only observed when [R] > 0. Since [R] must be strictly greater than 0, here we consider the case with 0.1 ≤ [R] ≤ 1.0. It follows from our estimate of KR that 0.0787 ≤ KR [R] ≤ 0.7869. Therefore, the actual bounds on the maximum number of binding sites, mM , assuming a reduction in expression 22 of p, are − ln (p) − ln (p) ≤ mM ≤ . ln (1.7869) ln (1.0787) In the context of this enhancer, in order to obtain an expression that is reduced by 50% (p = 0.5), the enhancer must contain between 2 and 9 repressor binding sites. To obtain an expression that is reduced by 10% (p = 0.9), the enhancer must contain between 1 and 2 repressor binding sites. To obtain an expression that is reduced by 90% (p = 0.1), the enhancer must contain between 4 and 30 repressor binding sites. Although these bounds are not as tight as an experimentalist would hope for, they are entirely dependent on the region of the embryo, since KR is constant and once p is chosen the bounds depend only on [R]. We reduce to considering only regions of the embryo where maximum Giant concentration exists, those in which expression is reduced with the highest efficiency. Thus, we can consider [R] ≈ 1.0, which results in a strict lower bound on the number of binding sites necessary to obtain a reduction in expression equal to p. In the context of this enhancer, in order to obtain an expression that is reduced by 50% (p = 0.5) in regions where maximum Giant concentration exists, mM = 1.1941, hence to get a reduction of at least 50%, the enhancer must contain two repressor binding sites. To obtain an expression that is reduced by 10% (p = 0.9), mM = 0.1815, hence to get a reduction of at least 10%, the enhancer must contain one repressor binding site. To obtain an expression that is reduced by 90% (p = 0.1), mM = 3.9667, hence to get a reduction of at least 90%, the enhancer must contain four repressor binding sites. For constructs that were tested (and used to fit the K value), with a fixed number of known repressor binding sites, we can also make predictions on the reduction in expression. The qualitative predictions given in the legends of Figures 2.2-2.4 were made using the 23 Figure 2.2: 1g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict a 44% reduction in expression. Figure 2.3: 2g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict a 69% reduction in expression. bounds: 0.0787 ≤ KR [R] ≤ 0.7869. To describe the structure of each enhancer tested, a string of letters and numbers is used, where each number represents the number of binding sites for a particular protein and each letter (‘g’, ‘t’, and ‘d’) corresponds to the type of protein (Giant, Twist, and Dorsal respectively). The images shown in Figures 2.2-2.4 are representative embryos from the actual data set of Fakhouri et al. [26]. To assess the reduction in expression observed from a more quantitative perspective, we calculated the ratio of average expression in “reduced regions” (highest repressor concentrations) to the average expression in a region of highest expression for those embryos contained 24 Figure 2.4: 3g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict an 82% reduction in expression. in the high quality data set of Fakhouri et al.[26] and obtained the following results: 1gt 2t2d: 52% expression = reduction of 48% 2gt 2t2d: 37% expression = reduction of 63% 3gt 2t2d: 16% expression = reduction of 84% From these results, one can conclude that using even the simplest derivation of a thermodynamic-model can provide very powerful predictions in terms of the reduction in expression levels caused by multiple repressor binding sites. 25 2.3 Incorporating More Information into Thermodynamic-based Models 2.3.1 Protein-protein Interactions When using a thermodynamic-based model in practice, each state is not solely based on the binding affinities and concentrations of the proteins bound to the enhancer, but also carries with it a statistical weight which can hold biological meaning within the system. In addition, successful states can include those with repressors bound, simply causing a reduction in the weight of the state. A state’s weight can represent biological phenomena such as the activation efficiency of a particular protein, the cooperativity of multiple bound proteins, the quenching (or repressive) efficiency of a bound repressor on a bound activator, etc. Examples with cooperativity and quenching added to the weight of such states follow. Example 1: Consider the case of an enhancer, S, which contains two binding sites, each for a transcriptional activator, A1 and A2. Assume that the expression level of the gene (measured as the concentration of mRNA) is proportional to the fraction of successful states, where ‘successful’ is defined as having at least one activator bound and that these two TFs are cooperating. The expression level is then calculated as [mRN A] ∝ KA1 [A1] + KA2 [A2] + CKA1 KA2 [A1][A2] 1 + KA1 [A1] + KA2 [A2] + CKA1 KA2 [A1][A2] where C represents the cooperative interaction between the TFs when they bind simultaneously. 26 Example 2: Consider the case of an enhancer, S, which contains two binding sites, one for a transcriptional activator, A, and one for a transcriptional repressor, R. Assume that the expression level of the gene (measured as the total mRNA) is proportional to the fraction of successful states, where ‘successful’ is defined as having the activator bound with the ability to communicate with the transcriptional machinery and that when the repressor is bound, it has some ability to quench the activator, or decrease its ability to communicate with the transcriptional machinery. Note that one can depict the state when both TFs are bound as being divided into two substates, one successful when A has the ability to activate the gene and one unsuccessful when R is blocking that ability. The expression level is then calculated as [mRN A] ∝ KA [A] + (1 − Q) KA KR [A][R] 1 + KA [A] + KR [R] + KA KR [A][A][R] where Q represents the quenching ability of the bound repressor, R, on A when both proteins are bound (Q = 1 when R is quenching at 100% efficiency). It is easy to understand biological examples when such weights would be important. If a certain activator communicates strongly with the basal machinery of transcription, giving it a high activation efficiency, the gene expression should increase when this particular protein is bound. Thus a larger weight is given to this state. On the other hand, if a repressor is bound, a smaller weight is given to the state since the activator may be “quenched” by the repressor and with some probability lose the ability to communicate with the basal machinery. These weights, which hold such high biological relevance and have the ability to provide hypotheses on the mechanisms involved in transcriptional regulation, have been the 27 focus of my ongoing work (discussed in Chapter 6, section 6.1) with thermodynamic-based models of transcriptional regulation in the early Drosophila embryo. The two biological phenomena that have drawn the most attention in this field have been cooperativity between pairs of TFs and quenching or repressive activity of repressors on activators. Some investigation of these weights has been carried out previously, but only very simple functions have been considered up until now [75, 45, 105, 26]. In 2006, a model was tested on endogenous gene expression data in the early Drosophila embryo which incorporated terms for protein cooperativity, although one constant cooperativity value was given to each pair of proteins [105]. In 2010, a similar model was tested on synthetic gene expression data in the early Drosophila embryo which incorporated terms for protein cooperativity, where a discrete (very small) number of different cooperativity values were used to incorporate distance(in base pairs between binding sites)-dependent cooperativity [26]. It is still unknown what the distance-dependent cooperativity function is for any given pair of proteins, although hypotheses can be made based upon what is known about the actual size of proteins, the helical shape of DNA, and the nucleosome positioning on DNA. With regard to the quenching of short-range repressors on bound activator proteins, in 2003 a model was proposed, although not implemented, which incorporated a quenching term into thermodynamic-based models which was distance-dependent [75]. In this publication, the quenching function proposed was a piecewise-defined monotonically decreasing function of distance. The function was constructed by taking two constant functions (1 and 0) and connecting them with a linear function. This was based on the notion that repressors have a given range over which they can repress bound activators. They quench with full (1) 28 efficiency when activator sites are very close and do not quench at all (0) when they are out of range. In between these two distances, a linear interpolation was done, assuming that repressors can vary in their efficiency at intermediate distances [75]. Using this same idea, in 2006 two different studies were published using models with either a simple monotonic step-function or a linear function for distance-dependent quenching [45, 105]. In 2010, the Fakhouri et al.study also incorporated quenching into their model. Since their study was conducted on synthetic enhancers in which the activity of short-range repressors were thoroughly investigated by creating enhancers with varying distances between repressor binding sites and activator binding sites, they were able to handle distance-dependent quenching with a bit more sophistication. In the Fakhouri et al. study, a step-function representing distance-dependent quenching was fit. Different step-functions were investigated by altering the binning strategy used (the intervals on which each step was defined) and fitting each function to the experimental data. One of the most important points of this publication was the surprising finding that the best fit quenching function was non-monotonic, supporting the hypothesis that repressors have a preferred distance due to spatial constraints [26]. 2.3.2 Overlapping Binding Sites During implementation of thermodynamic-based models, another important consideration is how to calculate the occupancy distribution of TFs on an enhancer that contains overlapping binding sites, those that have the ability to bind two different proteins. This is a common enhancer characteristic due to similarities in protein binding preferences. For example, TFs involved in early Drosophila development, such as Bicoid and Kruppel as well as Twist and Snail, have been shown to bind the same region of DNA in multiple enhancers [88, 42]. 29 Thermodynamic-based models handle such binding sites by making the assumption that, due to spatial constraints, TF binding sites may be occupied by only one protein at a time. This affects the total number of possible states since any overlapping binding sites have three possible states compared to binding sites for a single protein which have only two possible states. An example of the occupancy distribution of TFs on an enhancer that contains an overlapping binding site follows. Example: Consider the case of an enhancer, S, which contains two binding sites, one for a transcriptional activator, A1, and one overlapping binding site which can bind a transcriptional activator, A2, or a repressor, R. Assume that the expression level of the gene (measured as the concentration of mRNA) is proportional to the fraction of successful states. We will assume that the activators, A1 and A2, can cooperate, with C representing the cooperative interaction between the TFs when they bind simultaneously, and that R can repress A1, with Q representing the quenching ability of R on A1 when both proteins are bound. The expression level is then calculated as [mRN A] ∝ KA1 [A1] + KA2 [A2] + CKA1 KA2 [A1][A2] + (1 − Q) KA1 KR [A1][R] . 1 + KA1 [A1] + KA2 [A2] + KR [R] + CKA1 KA2 [A1][A2] + KA1 KR [A1][R] Note that there are only six possible states of the enhancer (six terms in the denominator) since A2 and R cannot bind simultaneously. This treatment of overlapping binding sites is used in the thermodynamic-based modeling discussed in Chapter 6. 30 Chapter 3 Dynamic Modeling: Incorporating Both cis-regulatory and Temporal Information 3.1 Introduction and Problem Description As mentioned in Chapter 1, modeling of eukaryotic gene regulation has focused on either time-dependent interactions of gene networks [43, 69, 13] or equilibrium approaches that can explore the cis-regulatory grammar of transcriptional enhancers [85, 45, 105, 38, 26]. These models rely on systems of differential equations or thermodynamic-based descriptions, which can be used either to understand the dynamics of a system, or DNA level regulatory processes. For many real-world settings, both aspects need to be considered in order to simulate the dynamic changes in gene expression taking place early in development. To combine the strengths of each of these approaches, we have developed a novel model 31 that joins these methods to provide a dynamical description of gene regulatory systems, using detailed DNA-based information, as well as spatial TF concentration data at varying time points. Our ‘two-layer’ modeling approach uses a thermodynamic-based model as the synthesis term in our differential equation, representing the rate of mRNA production. The differential equation incorporates all other terms, representing decay and diffusion of mRNA; thus the model does not lose its effectiveness in predicting emerging spatial expression patterns over time. Simple dual layer models have been applied and proven successful in prokaryotic systems, such as that of the lac operon in E.coli [81]. The difficulty of applying such models to eukaryotic systems is the increased complexity of cis-regulatory modules in addition to the difficulty of collecting high quality data sets. With advances in high-throughput genome sequencing and transcriptome analysis, we have gained crucial insights and data sets that make such a model possible in a complicated eukaryotic system. Here, I describe the steps taken in building a two-layer model of eukaryotic transcriptional regulation and its numerical implementation, as well as point out some of the advantages to using this approach. We begin constructing a model by considering the general Reaction-Diffusion equation: dY dt = R (Y ) + D∆Y for Y ∈ Rm , where D is a diagonal matrix containing diffusion constants. Note that this is a PDE due to the dependence on both space and time. This equation is a good starting point for a realistic model of gene regulation during development. It can incorporate mRNA and protein concentrations, their spatial diffusion, decay over time, and the intricate relationships 32 that these concentrations can have on each other. All of these features are important to consider for any time in development, but are especially important when modeling the sharp dynamics of early development. Cell-to-cell diffusion is known to take place and decay rates are important when trying to accurately model temporal dynamics of any organic substance. mRNA production, the process of transcription, is heavily regulated by protein (TF) binding, and protein production, the process of translation, is dependent on mRNA production. So, starting with an equation that allows for diffusion, decay, and an easily modified synthesis term which can incorporate natural feedforward and feedback loops is essential. Since gene expression patterns in early Drosophila development are typically categorized as either Dorsal-Ventral patterns or Anterior-Posterior patterns, the data used in most modeling studies is obtained from a single strip of discrete nuclei. Thus, it is intuitive to begin by discretizing the diffusion term. We approximate diffusion using a finite difference method, allowing us to transform the above PDE into the following ODE: dya,i dt = Sa,i (Y ) + Da ya,i+1 − 2ya,i + ya,i−1 − λa ya,i (3.1) with the zero flux boundary conditions: ya,0 = ya,1 ya,n+1 = ya,n where a represents the specific mRNA or protein that the equation corresponds to, and Da and λa are the corresponding diffusion and decay rates. We will assume that we are 33 modeling a given set of mRNAs, M R, and a given set of proteins, P R. So, for all a ∈ M R, a represents a specific type of mRNA and for all a ∈ P R, a represents a specific type of protein. i represents the nucleus (spatial) location. The most important term in equation (3.1) is the synthesis term, Sa (Y ). This term holds all of the DNA level information about transcription when a ∈ M R and models translation when a ∈ P R. It is also the term that controls the overall system behavior, hence effecting the convergence and stability criteria for any numerical solver implemented. First consider when a ∈ M R. For these terms, Sa (Y ) represents the scaled output from a thermodynamic-based model of the enhancer for gene a at each nucleus, i. Recall from Chapter 2 that this function models the TF-DNA binding and TF-TF interactions regulating gene a’s expression level in a given cell. Thus, this term in the ODE is a rational function of the protein concentrations in nucleus i, other elements of the vector Y . For ease of notation, recall that this function always takes the form: Φa,i (Y ) Sa,i (Y ) = χa 1 + Ψa,i (Y ) where χa is the scaling constant associated with gene a, N Φa,i (Y ) = j=1 and N Ψa,i (Y ) = j=1   N   vk Kρ(k) ΓΦa (k, v) Yρ(k),i     v∈Vj k=1      v∈Vj   vk Kρ(k) ΓΨa (k, v) Yρ(k),i  N k=1 34 for N denoting the total number of binding sites on enhancer a, Vj =   N v ∈ NN | vk ∈ {0, 1} ∀ 1 ≤ k ≤ N and  vk = j k=1   ,  where each v represents a state of the enhancer with all proteins in positions with vk = 1 being bound and all others unbound, ρ (k) denotes the protein p ∈ P R corresponding to binding site k, ΓΦa (k, v) equals 0 when v represents a state with no activators bound and equals the cooperative or repressive parameter associated with the TF at binding site k and the next bound protein in the state v otherwise, and ΓΨa (k, v) equals the cooperative parameter associated with the TF at binding site k and the next bound protein in the state v. Note that if vr = 0 for all r > k then ΓΦa (k, v) = ΓΨa (k, v) = 1 and if there is no cooperative parameter associated with the TF at binding site k and the next bound protein in the state v, ΓΨa (k, v) = 1. For each gene a and nucleus i, Sa,i : Rm → R. The function Sa represents the vector of these functions for the gene a over all nuclei considered, so Sa : Rm → Rµ , for µ ≤ m. Note that these functions are predetermined based on the DNA-sequence and parameter values. These parameter values include the transcription rate scaling constant, χa , binding affinities, Kp , cooperativity values, Cpq , and quenching parameters, Qpq , for p, q ∈ P R. During implementation, these are all fixed values throughout the simulation. Regardless of the number of different simulations run, each parameter value is bounded above and below. Also note that 0 ≤ Qpq ≤ 1 for all p, q ∈ P R. Thus, Ψa,i (Y ) ≥ Φa,i (Y ) for all Y . Next consider when a ∈ P R. For these terms, Sa (Y ) represents translation. This process is believed to be much better understood. We operate under the simplified assumption that newly translated proteins are produced at a rate proportional to the concentration of 35 the corresponding mRNA in nucleus i. Thus, this term in the ODE is the product of the translation parameter, τ , and the mRNA concentration for the gene corresponding to protein a: Sa (Y ) = τ Yζ(a) where ζ (a) denotes the mRNA b ∈ M R for the gene corresponding to protein a. Again during implementation, the value of the translation parameter is fixed throughout the simulation and regardless of the number of different simulations run, this value is bounded above and below. When solving such an ODE system with initial conditions given, we must consider the choice of solver. 3.2 3.2.1 Development of a semi-implicit numerical method Construction We begin with equation (3.1) for a given mRNA or protein, a, at nucleus i. We will handle the diffusion and decay terms implicitly and the synthesis term explicitly. This treatment allows us to exploit the simple structure created by discretizing diffusion. Approximating the left-hand side with a semi-implicit finite difference gives us: n+1 n ya,i − ya,i h n+1 n+1 n+1 n+1 = Sa,i (Y n ) + Da ya,i+1 − 2ya,i + ya,i−1 − λa ya,i 36 where h is the size of each time step and superscripts, n and n + 1, denote the solution at that given time step. This can be rewritten as: n+1 n+1 n+1 n −hDya,i−1 + ya,i (1 + 2hDa + hλa ) − hDa ya,i+1 = ya,i + hSa,i (Y n ) . Now, considering all nuclei in the one-dimensional strip of interest and the zero flux boundary conditions, for a given mRNA or protein, a, we get the following equation: n+1 = Y n + hS (Y n ) (I + hMa ) Ya a a where I is the identity matrix and Ma is the following matrix:    2D + λ , if i = j  a  a     Ma (i, j) = −Da , if i = j + 1 or i = j − 1        0,  otherwise. Note that Ma is tridiagonal, diagonally dominant, symmetric and positive semidefinite. Thus, the existence of the inverse of I + hMa is guaranteed and the algorithm to compute it is computational very inexpensive. Hence, the solver is defined by the following iteration: n+1 = (I + hM )−1 (Y n + hS (Y n )) . Ya a a a 37 (3.2) 3.2.2 Stability Lemma 3.2.1. Given a set of fixed parameter values and a fixed a ∈ M R ∪ P R, there exists a constant, ca , such that for all n < ∞, ||Sa (Y n )||2 ≤ ca ||Y n ||2 . Proof. Observe that ||Sa (Y n )||2 ≤ √ µ ||Sa (Y n )||∞ since Sa (Y n ) ∈ Rµ and µ ≤ m < ∞ [72]. Note that the case when ||Y n ||∞ ≥ 1 is trivial since Sa,i (Y ) is bounded above by maxa∈M R (χa ) when a ∈ M R, and Sa,i (Y ) is the product of τ ≥ 0 and the Yk corresponding to the concentration of protein a when a ∈ P R. Thus, ||Sa (Y n )||2 ≤ √ µ max max χa , τ a∈M R ||Y n ||2 . Case 1: Assume||Y n ||∞ < 1 and a ∈ M R. First note that 1 + Ψa,i (Y ) ≥ 1, thus Sa,i (Y n ) ≤ χa Φa,i (Y n ) N ≤ χa j=1 N j j max Kp p∈P R j−1 max p,q∈P R 38 Cpq , 1 − Qpq n max Yk k j . Since ||Y n ||∞ < 1 and 1 ≤ j ≤ N , it follows that j n max Yk k = ||Y n ||j ∞ = ||Y n ||j−1 ||Y n ||∞ ∞ < ||Y n ||∞ . Now, recall that ||Sa (Y n )||∞ = max Sa,i (Y n ) i and max Sa,i (Y n ) = max Sa,i (Y n ) i i since Sa,i ≥ 0 for all a and i. Thus, N ||Sa (Y n )||∞ ≤ χa j=1 j N j max Kp p∈P R N ≤ χa ||Y n ||∞ j=1 N j j−1 max p,q∈P R Cpq , 1 − Qpq j max Kp p∈P R n max Yk k j j−1 max p,q∈P R Cpq , 1 − Qpq . Let √ N c2 = χ a µ j=1 N j j max Kp p∈P R j−1 max p,q∈P R Cpq , 1 − Qpq . c2 is a constant independent of n, since χa , µ, N , Kp , Cpq , and Qpq values are independent of n for all p, q ∈ P R. Thus, there exists a constant, c2 , such that for 39 all n < ∞, ||Sa (Y n )||2 ≤ c2 ||Y n ||∞ and ||Y n ||∞ ≤ ||Y n ||2 since Y n ∈ Rm and m < ∞ [72]. Therefore, ||Sa (Y n )||2 ≤ c2 ||Y n ||2 . Case 2: Assume||Y n ||∞ < 1 and a ∈ P R. Observe that ||Sa (Y n )||∞ ≤ τ n max Yk k since a ∈ P R and τ ≥ 0. Now, let c3 = √ = τ ||Y n ||∞ µτ . c3 is a constant independent of n, since µ and τ are independent of n. Thus, there exists a constant, c3 , such that for all n < ∞, ||Sa (Y n )||2 ≤ c3 ||Y n ||∞ and ||Y n ||∞ ≤ ||Y n ||2 since Y n ∈ Rm and m < ∞ [72]. Therefore, ||Sa (Y n )||2 ≤ c3 ||Y n ||2 . 40 Note that we are only interested in numerically solving our system in a finite time interval, [0, T ]. Our solution is not guaranteed to be bounded for all time, but it is bounded on the finite time interval in which we are interested, as shown in the following lemma. In addition, we prove stability on the finite time interval [0, T ]. Lemma 3.2.2. Given T > h > 0 and a set of fixed parameter values, there exists a constant, c, such that for all n ≤ T , ||Y n ||2 ≤ (1 + ch)n Y 0 2 . h Proof. First note that 1  ||Y n ||2 2 n ||Ya ||2  2 = . a∈M R∪P R n Recall the definition of Ya given by equation 3.2: n n−1 + hS Y n−1 Ya = (I + hMa )−1 Ya a . Observe that n ||Ya ||2 ≤ (I + hMa )−1 2 n−1 + hS Y n−1 Ya a 2 by the definition of an induced matrix norm [89] and n−1 + hS Y n−1 Ya a 2 n−1 ≤ Ya + h Sa Y n−1 2 41 2 by the triangle inequality. Thus, n ||Ya ||2 ≤ (I + hMa )−1 ||Y n ||2 + h Sa Y n−1 2 Sa Y n−1 By Lemma 3.2.1, there exists a constant ca such that 2 2 . ≤ ca Y n−1 2 . Now note that [72] ||I + hMa ||2 (I + hMa )−1 2 = max (eig (I + hMa )) . min (eig (I + hMa )) Since I + hMa is a symmetric positive semidefinite matrix, ||I + hMa ||2 = max eig (I + hMa )T (I + hMa ) = max (eig (I + hMa )) > 0. Thus, 1 −1 (I + hMa ) 2 = . min (eig (I + hMa )) Since I is the identity matrix, Ma is a positive semidefinite matrix, and h > 0, min (eig (I + hMa )) = 1 + h min (eig (Ma )) ≥ 1. Hence, (I + hMa )−1 2 ≤ 1. Therefore, n n−1 ||Ya ||2 ≤ (1 + ca h) Ya . 2 42 Let c= max a∈M R∪P R ca . c is a finite constant independent of n. It follows that n n−1 ||Ya ||2 ≤ (1 + ch) Ya . 2 Now, observe that 1  ||Y n ||2 2 n ||Ya ||2  2 = a∈M R∪P R 1  ≤ (1 + ch) 2 a∈M R∪P R n−1 2  Ya 2 2 1  2 = (1 + ch) a∈M R∪P R = (1 + ch) Y n−1 2 n−1 2  Ya 2 2 . It then follows directly that ||Y n ||2 ≤ (1 + ch)n Y 0 2 . Lemma 3.2.3. Suppose Y 0 ∈ Rm is a vector of initial normalized mRNA and protein ˜ concentrations, and Y 0 ∈ Rm is a perturbed vector of initial values. Given T > h > 0, a 43 set of fixed parameter values, and a fixed a ∈ M R ∪ P R there exists a constant, ca , such ˜ that ∀n ≤ T , if Y n and Y n are the corresponding numerical solutions at time step n, then h ˜ Sa (Y n ) − Sa Y n 2 ˜ ≤ ca Y n − Y n 2 . Proof. Observe that ˜ Sa (Y n ) − Sa Y n 2 ≤ √ ˜ µ Sa (Y n ) − Sa Y n ∞ ˜ since Sa (Y n ) − Sa Y n ∈ Rµ and µ ≤ m < ∞ [72]. Case 1: Assume a ∈ M R. Observe that ˜ Sa (Y n ) − Sa Y n ∞ ˜ = max Sa,i (Y n ) − Sa,i Y n i . Recall that Φa,i (Y ) Sa,i (Y ) = χa 1 + Ψa,i (Y ) and 1 + Ψa,i (Y ) ≥ 1. Since Sa,i (Y ) is differentiable for all Y ∈ Rm with Yj ≥ 0 for ˜ all 1 ≤ j ≤ m, by the Mean Value Theorem, there exists a ξ ∈ {ηY n +(1 − η) Y n |0 ≤ 44 η ≤ 1} such that ˜ Sa,i (Y n ) − Sa,i Y n = ˜ Sa,i (ξ) · Y n − Y n ≤ ˜ Sa,i (ξ) ∞ Y n − Y n ∞ by the Cauchy Schwartz Inequality. Now observe, ∂Sa,i (ξ) Sa,i (ξ) ∞ = max 1≤j≤m ∂Yj   ∂Ψa,i  ∂Φa,i   (ξ) 1 + Ψa,i (ξ) − (ξ) Φa,i (ξ)  ∂Y  ∂Yj j = max 1≤j≤m χa 1 + Ψa,i (ξ)  2  ∂Ψa,i  ∂Φa,i   (ξ) 1 + Ψa,i (ξ) − (ξ) Φa,i (ξ)  ∂Yj 1≤j≤m  ∂Yj ≤ |χa | max since 1 + Ψa,i (ξ) ≥ 1. Continuing, observe that     ∂Φa,i  Sa,i (ξ) ∞ ≤ |χa | max   (ξ) 1 + Ψa,i (ξ)  1≤j≤m   ∂Yj 45  ∂Ψa,i +  (ξ) Φa,i (ξ)   ∂Yj by the triangle inequality. It then follows that   ∂Ψa,i  ∂Φa,i  Sa,i (ξ) ∞ ≤ |χa | max  (ξ) + (ξ) Φa,i (ξ)   ∂Yj 1≤j≤m  ∂Yj since 1 + Ψa,i (ξ) ≥ 1. Note that N j=1 ∂Φa,i ∂Yj ∂Ψa,i ∂Yj N (ξ) ≤ j=1 j j=1 j−1 max Kp max p∈P R N j j N (ξ) ≤ j N j Φa,i (ξ) ≤ N j p,q∈P R Cpq , 1 − Qpq k j j−1 max Cpq , 1 − Qpq max max Kp p∈P R Cpq , 1 p,q∈P R j max Kp j−1 max |ξk | k j−1 p,q∈P R p∈P R j max |ξk | j−1 max |ξk | k . ˜ Since ξ ∈ {ηY n +(1 − η) Y n |0 ≤ η ≤ 1} and by Lemma 3.2.2, there exists a constant c such that ||Y n ||∞ ≤ (1 + ch)n Y 0 ∞ ≤ (1 + ch)n and ˜ Yn ∞ ˜ ≤ (1 + ch)n Y 0 ∞ ≤ (1 + ch)n , it follows that max |ξk | = ||ξ||∞ ≤ (1 + ch)n . k 46 So, let N N j c1 = j=1 j−1 j max Kp max p∈P R p,q∈P R Cpq , 1 − Qpq Tj (1 + ch) h and N c2 = j j=1 N j j−1 j max Kp p∈P R max p,q∈P R Cpq , 1 − Qpq (1 + ch) T (j−1) h . c1 and c2 are constants independent of n, since Kp , Cpq , and Qpq values are fixed ∀p, q ∈ P R and N , T , and h are known. Then, Sa,i (ξ) ∞ ≤ |χa | (c1 + c1 c2 ) ≤ |χa | (c1 + c1 c2 ) √ since n ≤ T . Therefore, there exists a constant, µ |χa | (c1 + c1 c2 ), independent of h n since µ is known and |χa | is a constant independent of n, such that for all n ≤ T , h ˜ Sa (Y n ) − Sa Y n 2 ≤ √ ˜ µ |χa | (c1 + c1 c2 ) Y n − Y n ∞ . ˜ Since Y n − Y n ∈ Rm and m < ∞ [72], ˜ Yn−Yn ∞ ˜ ≤ Yn−Yn 2 . Thus, ˜ Sa (Y n ) − Sa Y n 2 ≤ √ Case 2: Assume a ∈ P R. 47 ˜ µ |χa | (c1 + c1 c2 ) Y n − Y n 2 . Observe that ˜ Sa (Y n ) − Sa Y n Now, let c3 = √ ∞ n ˜n max Yk − Yk k =τ ˜ =τ Yn−Yn ∞ . µτ . Thus, there exists a constant, c3 , such that for all n < ∞, ˜ Sa (Y n ) − Sa Y n 2 ˜ ≤ c3 Y n − Y n . ∞ ˜ Since Y n − Y n ∈ Rm and m < ∞ [72], ˜ Yn−Yn ∞ ˜ ≤ Yn−Yn 2 . Thus, ˜ Sa (Y n ) − Sa Y n 2 ˜ ≤ c3 Y n − Y n 2 . Theorem 3.2.4. Given a set of fixed parameter values, the numerical solution given by (3.2) is stable (in the sense of Lyapunov) [23] in a finite time interval [0, T ]. Proof. Suppose Y 0 is a vector of initial normalized mRNA and protein concentrations and ˜ ˜ Y 0 is a perturbed initial value. Let T > h > 0 and n ≤ T . Suppose Y n and Y n are the h ˜ corresponding numerical solutions to Y 0 and Y 0 at time step n. First note that 1  Yn ˜ −Yn 2 n Ya = a∈M R∪P R 48 ˜n − Ya 2 2 2  . Observe that by equation 3.2: ˜ n−1 + hSa Y n−1 ˜ − (I + hMa )−1 Ya n n−1 + hS Y n−1 ˜n Ya − Ya = (I + hMa )−1 Ya a n−1 − Y n−1 + h S Y n−1 − S Y n−1 ˜a = (I + hMa )−1 Ya a a ˜ . Thus, n ˜n Ya − Ya 2 ≤ (I + hMa )−1 2 n−1 − Y n−1 + h S Y n−1 − S Y n−1 ˜a Ya a a ˜ by the definition of induced matrix norm [89] and n−1 − Y n−1 + h S Y n−1 − S Y n−1 ˜ Ya a a ˜ a 2 n−1 − Y n−1 ˜ ˜ ≤ Ya + h Sa Y n−1 − Sa Y n−1 a 2 by the triangle inequality. As shown in the proof of Lemma 3.2.2, (I + hMa )−1 ≤ 1 and by Lemma 3.2.3, there exists a constant, ca , such that ˜ Sa Y n−1 − Sa Y n−1 2 49 ˜ ≤ ca Y n−1 − Y n−1 2 . 2 2 Let c= max a∈M R∪P R ca . c is a finite constant independent of n. It follows that n ˜n Ya − Ya 2 n−1 − Y n−1 ˜ ≤ (1 + ch) Ya a 2 . Now, observe that 1  Yn ˜ −Yn 2 n Ya = a∈M R∪P R 2 ˜n − Ya  2 2 1  2 ≤ (1 + ch) a∈M R∪P R n−1 − Y n−1 2  ˜a Ya 2 1  = (1 + ch) 2 2 n−1 Ya a∈M R∪P R ˜ = (1 + ch) Y n−1 − Y n−1 50 2 . ˜ n−1 − Ya 2 2 2  It then follows that ˜ Yn−Yn 2 ˜ ≤ (1 + ch)n Y 0 − Y 0 ˜ ≤ echn Y 0 − Y 0 ˜ ≤ ecT Y 0 − Y 0 2 2 2 . Thus, by definition of stability (in the sense of Lyapunov), the numerical solution given by (3.2) is stable in the finite time interval [0, T ]. Knowing that our solver is stable gives us an idea of how robust the numerical solver is. We have found a constant, ecT , which measures the effect of small perturbations in the initial data on the numerical solution. Now, we would like to determine how well our numerical solver actually approximates the true solution of the system and whether we can improve our approximations by decreasing the step-size, h. For this, we need to determine whether our solver is convergent, and we will use the stability we have just shown to help in proving convergence. 3.2.3 Convergence A fundamental theorem of numerical solvers states that if a method is consistent and 0stable, then the method is convergent [5]. Since we have shown stability (in the sense of Lyapunov) in the previous section, a stronger condition than 0-stable, to prove convergence of the numerical solution given by (3.2), it suffices to show that the method is consistent. Definition 1. Let Y (tn ) denote the true solution to equation (3.1) at time step tn . Suppose 51 Y n is the numerical solution obtained after a single iteration, starting from the true solution at the previous time step, Y (tn−1 ). The local error, dn , at step n is defined by: dn = Y (tn ) − Y n Theorem 3.2.5. Given a set of fixed parameter values and a finite time interval [0, T ], the numerical solution given by (3.2) is consistent. Proof. Consider the local error, dn , at step n. Note that ||dn ||2 = ||Y (tn ) − Y n ||2 1  2 n ||Ya (tn ) − Ya ||2  2 = . a∈M R∪P R n n For ease of notation, let dn,a = Ya (tn )−Ya . Recall the definition of Ya given by equation 3.2: n n−1 + hS Y n−1 Ya = (I + hMa )−1 Ya a . Observe that dn,a = Ya (tn ) − (I + hMa )−1 (Ya (tn−1 ) + hSa (Ya (tn−1 ))) since Y n is the numerical solution obtained after a single iteration, starting from the true 52 solution to equation (3.1) at the previous time step, Y (tn−1 ). Thus, (I + hMa ) dn,a = (I + hMa ) Ya (tn ) − (Ya (tn−1 ) + hSa (Ya (tn−1 ))) = Ya (tn ) − Ya (tn−1 ) + h (Ma Ya (tn ) − Sa (Ya (tn−1 ))) . Since Y (tn−1 ) is the true solution to equation (3.1) at time tn−1 , dYa Sa (Ya (tn−1 )) = dt (tn−1 ) + Ma Ya (tn−1 ) . Thus,   dYa   Ma Ya (tn ) − (I + hMa ) dn,a = Ya (tn ) − Ya (tn−1 ) + h  (tn−1 ) − Ma Ya (tn−1 )  dt dYa = Ya (tn ) − Ya (tn−1 ) − h dt 53 (tn−1 ) + h (Ma Ya (tn ) − Ma Ya (tn−1 )) . Now, Taylor expanding Ya (t) about tn−1 , results in the following: dYa dYa (I + hMa ) dn,a = Ya (tn−1 ) + h (tn−1 ) + O h2 − Ya (tn−1 ) − h (tn−1 ) dt   dt  dYa     + h Ma Ya (tn−1 ) + h (tn−1 ) + O h2  − Ma Ya (tn−1 )     dt = O h2 . Hence, dn,a = (I + hMa )−1 O h2 . As shown in the proof of Lemma 3.2.2, (I + hMa )−1 ≤ 1. Thus, dn,a 2 ≤ O h2 . Therefore, 1  2 2 dn,a 2  ||dn ||2 =  a∈M R∪P R 1  ≤ O a∈M R∪P R 54 2 h2  2 = O h2 . So, by definition of consistent [23], the numerical solution given by (3.2) is consistent. Now that we have shown stability and consistency of our numerical solver, it follows that the solver is convergent. Thus, as we take smaller and smaller step-sizes, ie: as h → 0, the numerical solution given by (3.2) converges to the true solution of equation (3.1). 3.3 Numerical Simulations Using the numerical solver as defined by (3.2), along with the parameter estimation algorithm, CMAES, which will be described in detail in Chapter 5, we can fit expression data. For fitting, we used an 80 minute time window corresponding to the period of development from late nuclear cycle 13, when twi, sna, and rho begin to be expressed, to late nuclear cycle 14, when twi, sna, and rho expression patterns are mature. For fitting, we began with previously published quantitative data for mRNA concentrations at the time point t = 80 min. [105]. 20 nuclear positions were used at approximately 50% egglength (A-P) from the ventral-midline to approximately 50% eggheight (V-D). We combined this with a qualitative assessment of blue-white images at time points corresponding to t = 0 min and t = 30 min. We then linearly interpolated these three sets of data to create a set of data corresponding to 16 time points, in 5 minute intervals, to use for fitting model parameters. Initial mRNA concentrations used were those corresponding to the data at t = 0 min and initial protein concentrations were set to 0. The results are shown in Figures 3.1-3.3. Figures 3.1 and 3.2 demonstrate the convergence of the solver and Figure 3.3 illustrates the ability of the model to successfully replicate the dynamic expression of a simple Drosophila gene regulatory circuit. This model is the first in a eukaryotic system to include a transcriptional term that 55 A twi sna rho twi sna rho twi sna rho twi sna rho B C D Figure 3.1: Simulations with decreasing step size. Parameter values were fixed in each of these four model simulations, allowing us to analyze the effect of changing step-size on the numerical solutions obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and D) h = 0.001. 56 A twi sna rho twi sna rho twi sna rho twi sna rho B C D Figure 3.2: Contour plots for simulations with decreasing step size. Parameter values were fixed in each of these four model simulations, allowing us to analyze the effect of changing step-size on the numerical solutions obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and D) h = 0.001. 57 Figure 3.3: Data vs. Model Predictions. 8 time points are shown. Each pair of graphs depicts on the left, the data used to fit model parameters, and on the right, the model predictions after parameter fitting, for the given time point. Observe that the model is very successful in fitting the general dynamics and expression pattern trends of this system. The corresponding RMSE for this prediction is 0.0616. 58 takes into account DNA sequence information and a translational term to relate mRNA and protein concentrations. All simulation results shown were obtained using the numerical solver given by (3.2), which we have shown is both stable and convergent in the finite time interval in which simulations were run. Each simulation took under one minute to generate a numerical estimate for the 80 minute time period of interest and parameter estimation using an evolutionary strategy took less than 24 hours. This is a notable improvement in computational speed compared to previous methods that did not incorporate any DNA level information [43, 6, 7]. The results shown support our earlier claim that by incorporating data on DNA sequence we are able to successfully model context-specific features of enhancers, as well as replicate the dynamic expression of a simple Drosophila gene regulatory circuit that drives development in the dorsal-ventral axis of the blastoderm embryo, involving Dorsal, Twist, Snail, and rhomboid. Where protein and cis-regulatory information is available, this model provides a powerful method to recapitulate and predict dynamic aspects of eukaryotic transcriptional systems that will greatly improve our understanding of gene regulation at a global level. 59 Chapter 4 Sensitivity Analysis This Chapter was taken from the publication: “Thermodynamic-based modeling of transcription: sensitivity analysis differentiates biological mechanism from mathematical model-induced effects” Jacqueline M Dresch, Xiaozhou Liu, David N Arnosti, Ahmet Ay which was published in BMC Systems Biology, Volume 4, in 2010. 4.1 Introduction In this chapter, we focus on the parameters involved in thermodynamic-based models. For these models, parameters are usually not known and the derivation of these parameters from biological data, such as images from confocal microscopy, is complicated by the noisy nature of biological data. Two major challenges then face the modeler who seeks a biological interpretation of the parameter values. First, large parameter ranges are often found, leading to uncertainty about where realistic parameter values lie. In many cases, relevant experimental measurements have not been conducted - mathematical models are often in 60 fact the best method to determine such values. Second, parameter values can be strongly influenced by the form used to shape the problem, thus the problem then becomes whether these extracted values are realistic - are the values due to properties of the biological system or the mathematical model? To address these questions, one must examine the mathematical model itself to determine the uncertainty in parameter values and ascertain the impact that perturbations in parameter values will have on the model output. This question of parameter uncertainty and sensitivity analysis has been addressed in many different applications, including biological, chemical and risk assessment [34, 25, 58, 96, 103, 29, 16]. Parameter uncertainties have been explored extensively for ODE models, which typically have a very large number of parameters, resulting in a great deal of model variation and parameter uncertainty [34, 25, 58, 96]. The sensitivity of parameters derived from thermodynamic-based transcription models, however, has not yet been examined in such a context; studies have focused simply on extracting and interpreting parameter values or parameter ranges [26, 105, 85, 75, 45]. As discussed in the Chapter 2, thermodynamic-based models are based on formulations originating from statistical physics. For transcriptional analysis, these models consider all possible states of a DNA regulatory element, where a state refers to a specific configuration of regulatory proteins (TFs or TFs) bound to DNA [26, 105, 85, 75, 45]. Each state is awarded a weight, which depends on properties such as the binding affinity and concentration of proteins. In many thermodynamic-based models, including those examined in this chapter, the mathematical formula used to calculate the expression level is a rational function in which parameters to be estimated can be found in both the numerator and denominator [26, 105]. 61 Sensitivity analysis can be applied to any field that uses mathematical modeling as a tool, and has been used extensively on diverse models in economics, civil engineering, and medicine [68, 27, 41]. In drug design, sensitivity analysis has been used in determining which parameter or parameters of a differential equation model would have the largest effect on a given outcome of interest, such as the sugar level in the blood of a patient [41]. Its utility is shown here when applied to thermodynamic-based models of gene regulation (Figure 4.1). Such models have been applied to systems ranging from a single cis-regulatory element to diverse collections of highly divergent elements bound by a wide range of proteins [85, 45]. This chapter concentrates on two studies that took a very focused approach to identify specific features relating to transcriptional activation, repression and cooperativity by examining sets of similar enhancers that feature a limited degree of variation. 4.2 Methods Sensitivity analysis, whether local or global, has the primary goal of determining how a given model responds to variations in parameter values. Local sensitivity analysis (also known as differential analysis or nominal range sensitivity analysis) focuses on a particular point in parameter space, varying parameters one at a time to obtain a local response of the model to each parameter [96, 29, 78, 94]. Global sensitivity analysis, on the other hand, tries to capture the entire parameter space all at once, allowing multiple parameter values to be explored simultaneously. Parameters are individually analyzed by averaging the variation in model output over the entire space [78]. It is possible to explore multi-parameter responses using local sensitivity analysis, employing techniques such as second-order partial derivatives, but this analysis can be computationally costly. For this reason, in this chapter 62 a global sensitivity analysis is used to investigate multi-parameter effects. Our approach initially used local sensitivity analysis to calculate the models rate of change with respect to each variable at a given point in parameter space, quantifying at that point how small changes in each variable will affect the model output. Then, we turned to global analysis, employing three different methods. The first simply takes an average of results obtained from local sensitivity analyses on different points in parameter space. The other two global methods each rely on determining which parameters and parameter combinations contribute most to the variation in the model output that resulted from sampling a wide range of parameter values. The eFAST algorithm does this through Fourier analysis, while the HDMR algorithm does this through polynomial approximation. 4.2.1 Local Sensitivity Analysis Most local sensitivity analysis methods use partial derivatives evaluated at a point in parameter space to determine how the model output changes locally with respect to small variations of a particular parameter [25, 96, 94, 73]. To gain insight on local effects, when calculating the sensitivity coefficient, the partial derivative is normalized using the relative changes in quantities [78, 73]. To calculate the local sensitivity coefficient for an objective function, C, with respect to a parameter, µ, one uses the formula [73]:   µ ∂  sensitivity coefficient =  C  . C  ∂µ  A major advantage of local sensitivity analysis is its ease of implementation [94]. The formula is straightforward and the results are simple to interpret, but must be viewed with 63 caution, because the analysis is only valid in the small region around a particular point [25, 58, 103]. Also, this method does not account for parameter interactions, so in nonlinear models, this often results in an underestimation of true model sensitivities [29, 94]. In many mathematical models of biological events, including differential equation models of the gap gene network in Drosophila, parameter ranges are fairly well constrained due to previous studies which have estimated values for parameters such as diffusion and decay rates [6, 7, 28]. This allows a great deal of information to be extracted from a local analysis. With the large uncertainty of parameter values and ranges in models such as those we focus on in this study, this method is not robust, so global sensitivity techniques are preferred. 4.2.2 Global Sensitivity Analysis These approaches vary parameters over a larger parameter space and have the ability to quantify parameter interactions [58, 103, 29, 94]. Global sensitivity analysis covers a much wider range of methods than that of local sensitivity analysis. A common element is that global sensitivity methods explore the full parameter space, quantifying a model’s sensitivity to perturbations in each parameter. Due to computational limitations, this exploration of parameter space requires a sampling method that randomly samples parameter space, while evenly covering individual parameter ranges [96]. Here, we focus on three global sensitivity methods: a basic, easily-implemented method of average local sensitivities, and two more complex methods that capture first-order sensitivities, as well as higher order sensitivities, those sensitivities derived from varying more than one parameter. Higher order sensitivities can capture possible parameter interactions. An example of higher order interactions would be the relationship between the intrinsic activity of a TF and cooperative binding. Solutions 64 to a model would include those that ascribed high cooperativity between weak neighboring TFs, as well as low cooperativity between inherently powerful TFs. Thus, the higher order sensitivity would quantify the impact this relationship has on the model output. For a particular system, varying cooperativity or TF activity alone may have little effect on model output, thus these parameters would have small first-order sensitivity indices, because the two effects can “cancel each other out” or compensate. If cooperativity and activity are simultaneously varied, large effects might be observed, producing significant second-order sensitivities. (i) Method 1 To overcome the major disadvantage of the small region of sampling in local sensitivity analysis while retaining the ease of computational simplicity, some studies have used a global approach of averaging local sensitivities taken throughout the parameter space [96]. This method chooses sets of parameters randomly using realistic ranges of each parameter space, then local sensitivity analysis is applied to each set of parameters, and results are averaged over the entire space for each parameter to provide a global sensitivity coefficient. This method can be computationally expensive. (ii) Method 2 One of the earliest, yet most efficient global sensitivity analysis methods is the variancebased Fourier amplitude sensitivity test (FAST) [22]. FAST uses a unique searchcurve to explore parameter space and employs Fourier decomposition of the objective function to calculate main effects, or first-order sensitivity indices [29, 78]. The FAST method does not rely on any assumptions about the form of the model, thus it can be used on nonlinear, monotonic or nonmonotonic models [29, 78]. An extension of the 65 FAST method (eFAST) can compute not only the first-order, but also the total-order sensitivity coefficients for a given parameter, as defined by Sobol’ [58, 78]: n STi = Si + n Sijk + · · · + S12...n Sij + j=1 j=i 1≤j 1, Si is calculated using the mean of the variances. The calculation of the total order sensitivity index of a parameter is then easily computed by combining other parameters into a single group (with one fundamental frequency) and calculating the first order sensitivity of this complementary group of parameters, SCi , and removing this proportion of variance from the total proportion of variance. Thus, the total order sensitivity index of the parameter i is computed as: STi = 1 − SCi . In this chapter, the eFAST algorithm, as written by Marino et al., was implemented [58]. Modifications to this algorithm involved removing time-dependence, including our objective function, and setting baseline parameter values based on those published in [26]. eFAST parameters were determined from the formulation recommended in [78]. For scheme 2, the parameters were set at k = 10, Ns = 73, NR = 100, and M = 4. (iii) Method 3 Another global sensitivity analysis method, ANalysis Of Variance (ANOVA), decom- 68 poses a function into a summation of terms of increasing dimensionality [103]: n f (x) = f0 + n fij xi , xj + · · · + f12...n (x1 , x2 , . . . , xn ) fi (xi ) + i=1 (∗) 1≤i 0.981). We carried out this test a second time, again testing 100,000 different random parameter sets for each construct and obtained essentially identical results (data not shown). The relationships that were said to be of greatest importance in Zinzen et al. appear in a plurality of parameter sets in almost every scheme, but other relationships between the cooperativity parameters also result in a very high correlation. The relationships predicted in Zinzen et al. hold for less than 40% (between 13 and 59 of between 105 and 174) of the parameter sets that show satisfactory correlation (> 0.981). As modeled in that study, the relationships may not represent biological features that reflect a compelling best fit with the experimental data; rather, the low sensitivities of these particular parameters may allow one to accept a local best fit. The biological conclusions, that rho and vnd expression patterns differ based on Twist-Dorsal cooperativity, should therefore be interpreted with caution. Table 4.3 also indicates how sensitivity analysis can guide model selection. The insensitivity of cooperativity parameters led us to test a large set of different parameter values, which revealed that multiple parameter sets with different biological interpretations produced satisfactory correlations between the experimental data and the model output. This 102 modeling weakness may reflect the reliance on only one experimental data set to fit parameter values. In addition, information about positions of binding sites in enhancers was not considered, and somewhat arbitrary assumptions about which proteins are required for activity were made. Any of these factors may have led to insensitivity in the cooperativity parameters, and should be addressed before trying to improve upon the model or its biological interpretations. 4.4 Discussion Thermodynamic-based models represent the most powerful approach to quantitative understanding of gene transcription because of their use of DNA sequence information that can capture subtle differences in enhancer architecture. As with other models, this approach involves many unknown parameters and interactions, however, a rigorous sensitivity analysis has not been applied to recent studies in this area. We have shown here, through local and global sensitivity analyses, that there is a wide range in sensitivities for parameters in these models. In some cases, this is due to the biology of the system, or the experimental data used to calculate the error, while in other cases it may simply be an artifact of the mathematical formulation (Figure 4.1), and the underlying assumptions that have gone into this model. Our analysis also uncovered previously undescribed interactions between parameters that may reflect true biological interactions, and at the same time found very low sensitivities in parameters that were previously thought to be most informative in explaining the overall architecture of distinct enhancers. We used the insights drawn from sensitivity analysis to redesign experiments, and discover new parameter values that are well correlated with experimental data. Clearly, standard parameter estimation techniques used alone, in the 103 absence of sensitivity analysis, cannot fully show the importance of possible biologicallydriven or model-driven effects. To improve upon thermodynamic-based modeling attempts to unlock key aspects of cisregulatory grammar, we should recognize a need for sensitivity analysis and a thorough exploration of parameter space. With this analysis, new iterative parameter estimation and model selection techniques can be designed and implemented [46, 54]. Analysis of the effects parameters have on the model output, within their realistic ranges, can also help biologists to design experiments. As we demonstrated with our synthetic data sets, a tight coordination between experimental design and model formulation is required to maximize the effectiveness of these modeling studies. A conscious effort can be made to collect data evenly and models can be redesigned to make sure that model sensitivity is similar for various parameters. Thermodynamic-based models, as other models, benefit by such sensitivity analysis, providing important context in which to place biological interpretations suggested by the model. 104 Chapter 5 Parameter Estimation This Chapter was taken from the manuscript: “Global Parameter Estimation for Thermodynamic-based Models of Transcriptional Regulation” Yerzhan Suleimenov, Ahmet Ay, Md. Abul Samee, Saurabh Sinha, Jacqueline M Dresch∗ , and David N Arnosti∗ (∗ = corresponding authors) which is in the process of being submitted to Methods - Elsevier. 5.1 Introduction The goal of most studies in computational biology is to better understand overall system behavior and the contributions of specific system components to this behavior. To study behavior of biological systems, a variety of mathematical modeling approaches are employed [79, 100, 3, 1, 15, 105, 45, 85, 30, 26, 38, 33, 10, 62, 81, 97, 43, 32, 20, 69, 13]. The evaluation of such models rests on two central pillars, namely model validation and parameter estimation. 105 Whether the mathematical framework is straightforward, leading to a model with an explicit formula, or more complex, leading to a model with only an implicit formula or a high degree of stochasticity, there are always some aspects that remain unknown with parameters that need to be estimated. If the system were completely understood, with no unknown parameters, the model would not hold much value for most biological applications. In contrast, a model containing unknown parameter values that can be estimated using experimental data may hold answers to important biological questions. A challenge that faces these efforts is that no single parameter estimation strategy works well for all problems, due to the large variety of models and model constraints. In many biological modeling studies, in particular those that use thermodynamic-based models to examine gene regulation, parameter estimation techniques have not been thoroughly investigated [105, 85, 45, 38]. Thermodynamic-based models adapt tools from statistical physics that describe proteinDNA interactions, and they predict gene output based on the assumption that gene activity is proportional to the level of activators bound and inversely proportional to the level of repressors bound.of transcriptional regulation [38]. Essential information for this type of model includes the DNA sequences of cis-regulatory regions and the binding specificities of TFs (TFs) that interact with them to effect transcriptional activation or repression. Many underlying assumptions are incorporated into such models of transcription, and also many unknowns, such as the maximal binding affinity of proteins, protein-protein cooperativity, inhibition of activators by repressors, and the effects of TFs on the basal machinery. Thus, there is clearly a need for parameter estimation techniques when studying thermodynamicbased models of transcriptional regulation. Previous studies have implemented a variety of techniques, ranging from very simple local parameter estimation techniques to some of the 106 most complicated global algorithms for parameter estimation [105, 85, 45, 38, 26]. These studies have not generally provided a rationale for their choice of estimation strategy, although these methods can differ enormously in computational demands, and, potentially effectiveness in finding global minima. The general idea behind all parameter estimation techniques is quite simple: given a model with some unknown variables and spatial and/or temporal experimental data, parameter estimation algorithms work toward fitting numerical values to these variables so that the difference between the model output and the experimental data is as small as possible. Once parameter values are fit, these values can be used in interpreting aspects of the biological system and in predicting the behavior of similar systems [98, 57, 63]. Parameter estimation has a long history in mathematics and computer science. Since mathematicians first began solving inverse problems by searching for extrema of functions, they have developed different techniques. Although the extrema of a function can refer to either a minima or a maxima, here we concentrate on finding minima, i.e., minimizing the error between model predictions and experimental data. Parameter estimation techniques typically fall into one of two classes, local or global algorithms. Local algorithms have the advantage of computational simplicity, while global approaches are more likely to recover optimal values, but the extra effort may or may not be worth while. Here, we compare the performance of representative local and global approaches using a recently described thermodynamic-based model [38]. Local estimation techniques generally require either the calculation or approximation of the objective function’s derivative or a comparison of the objective function at multiple different parameter values [63]. Some common local parameter estimation techniques include 107 the Conjugate Gradient method, Newton’s method, and the Nelder-Mead algorithm. In two earlier thermodynamic-based modeling studies local parameter estimation techniques were used [85, 38]. Segal and colleagues employed the Conjugate Gradient Ascent and the NelderMead simplex methods in an alternating fashion. In a later study, Sinha and colleagues applied the quasi-Newton and the Nelder-Mead simplex methods in an alternating fashion on a slightly different model using the same data set. The major drawback of using such local parameter estimation techniques is that they often lead to the discovery of local, not global, minima of the objective function, producing misleading results. This problem can be minimized if the modeler has prior knowledge of where the global optimal parameter values lie, allowing for an initial guess close to the global minima and thus quick convergence by a local algorithm. In many real-world settings, including the modeling of transcription, almost no prior knowledge of parameter values is available however. Thus, the only way to overcome the problem of getting stuck at local minima is to run the algorithm multiple times, with different initial parameter values, a so called multi-start strategy. Multiple restarts with different initial parameter values were implemented in both of the studies mentioned above [85, 38]. However, this method is not very efficient and it has been shown that for highly nonlinear inverse problems local parameter estimation strategies cannot find the correct parameters even when run with an extremely large number (> 300) of starting points [60]. Global parameter estimation techniques offer another path to finding global minima of a model when no a priori information on parameters is available [98, 57]. However, with these techniques, obtaining the global minima for a nonlinear model is very difficult and computationally expensive [60, 59]. Both deterministic and stochastic global techniques have been developed. Deterministic methods such as the branch-and-bound and interval optimization 108 methods are more reliable, they are computationally very expensive and not practical for many nonlinear problems. In contrast, stochastic methods such as genetic algorithms, simulated annealing, and evolutionary strategies can more quickly find the vicinity of global minima. These methods move through parameter space with some stochasticity to avoid getting “stuck” at local minima. Unfortunately, in general, for these methods global optimality cannot be guaranteed for all problems due to their probabilistic nature, therefore multiple runs are advised. However, global convergence has been proven for evolutionary algorithms applied to problems with a unique global minimum and a nonzero probability of reaching the neighborhood of the global minimum from any initial starting population in a single evolutionary time step [77]. In the realm of stochastic techniques, evolutionary strategies (ES) have shown excellent performance in many studies of continuous models with large sets of estimated parameters [60, 47, 39, 61, 26]. These strategies are inspired by biological evolution, and include features such as cross-over, mutation and selection. Each specific technique varies in the number of offspring, the number of parents, whether mutation rates, recombination, or cross-over are considered, and the selection strategy applied. In each ES, the the “fittest individuals” (the parameter sets with lowest value for the objective function), have a higher probability of surviving to the next “generation.” Along with selection criteria, the use of mutation rates, recombination, and cross-over allow populations of parameter sets to leave a local minima and eventually reach a global minima. In this study, we used a version of an evolutionary strategy, the so called Covariance Matrix Adaptation-Evolutionary Strategy (CMA-ES). Our choice was motivated by the documented success of CMA-ES algorithms over other global parameter estimation techniques on benchmark problems [49]. 109 Nonlinear models, such as thermodynamic-based models of gene regulation, are known to be ill-posed, that is, either they do not have a solution, the solution is not unique or the solution does not depend continuously on the data. In these problems the parameter space is usually unknown and the complexity of the parameter estimation problem grows exponentially as the number of parameters increase. For this reason, it may be worth employing computationally expensive global parameter estimation techniques. However, in several recent thermodynamic-based modeling studies, local strategies rather than global counterparts were employed to achieve computational efficiency [85, 38]. It is not clear what the potential cost of effectiveness was in these cases: the fits obtained were judged adequate, but perhaps the use of a local strategy had a significant effect on fit with possible misleading biological interpretations. To better discuss the possible costs and trade-offs in using these diverse parameter estimation approaches, it is necessary to compare performance on the same dataset, using the same models. Here, we test the performance of the CMA-ES global and QN/NMS local methods, with respect to fitting parameter values in thermodynamicbased models of gene regulation. We compared these algorithms using both synthetic and experimental gene expression data [84]. Our analysis is based on GEMSTAT (Gene Expression Modeling based on Statistical Thermodynamics), a thermodynamic-based model that incorporates terms for TF-DNA interactions as well as TF-transcriptional apparatus interactions [38, 87, 17]. The model can also incorporate mechanistic features such as short range repression, cooperativity in TF-DNA binding and the synergistic effect of multiple activators. The authors have systematically evaluated the significance of each of these features in the context of available interpolated expression data. 110 5.1.0.1 GEMSTAT Specifics and Applications Six different models are available for implementation in GEMSTAT. In ‘DirectInt’, each TF interacts directly with the basal transcriptional machinery (BTM) and TFs are assumed to only interact with TFs bound to adjacent sites located within a certain distance. ‘DirectIntcoop’ extends DirectInt by also allowing for cooperative binding of any given pair of bound TFs regardless of distance. The ‘ChrMod-Unlimited’ and ‘ChrMod-Limited’ models assume that repressors do not interact directly with the BTM, but inhibit the binding of activators. This effects the enumeration of all possible states since it creates new possible configurations, where DNA in the ‘neighborhood’ of a bound repressor is inaccessible to binding by any other TF. This new configuration competes with the configurations where the chromatin is accessible to activators, effectively reducing the occupancy of activators. Both models allow cooperativity between TFs that are bound to adjacent sites, in the same way as the DirectInt model. These two models can also be used to test the effect of synergism between bound TFs, i.e., whether increasing the number of TFs interacting simultaneously with the BTM can better explain the observations. In ‘ChrMod-Unlimited’, there is no limit on the number of activators that can simultaneously interact with the BTM, as in the ‘DirectInt’ models, while ‘ChrMod-Limited’ limits the number of simultaneous interactions with the BTM. The ‘Quenching’ model is based on short-range repression, not assuming repression works through chromatin modification, but rather through a quenching parameter. The enumeration of all possible states is the same as in the ‘DirectInt’ models, but the weight of states is modified by incorporating this quenching parameter. The ‘Logistic’ model evaluates a logistic function at TF ‘occupancies’ to estimate the level of gene expression, where ‘occupancy’ of a given TF is the summation of the probabilities of that TF binding to its different putative binding 111 sites. As such, unlike the other models described above, the logistic model lacks the details of TF-TF and TF-BTM interaction. GEMSTAT is divided into two functional modules, one that executes the model specified and another that trains the model. The model training program takes as input a set of DNA sequences, the PWMs of the relevant TFs and the measured expression patterns. Given a set of inputs, GEMSTAT assesses fit by either maximizing the average correlation coefficient and/or minimizing the sum of squared errors (SSE) (the program can be run in multiple configurations). In [38], a local parameter optimization approach from the GNU Scientific Library was used that combines the Nelder-Mead simplex method and the quasi-Newton method referred to as the Broyden-Fletcher-Goldfarb-Shanno algorithm [90]. This study provided insight into the mechanistic details of gene regulation in segmentation-related transcriptional enhancers in Drosophila, namely, short-range repression, cooperativity in TF-DNA binding, and the synergistic effect of multiple activators [38]. Using tests for statistical significance, the authors showed that each of these three mechanisms appear to contribute to the output of enhancers. Cooperative DNA-binding of activators as well as repressors appeared to be critical for some, but not all TFs. Their models suggested that the synergistic effect of multiple activators with the BTM contributes significantly more than the TF-DNA binding cooperativity between activators to the accuracy of predicted expression patterns. Simple competition between repressors and activators for binding sites is an insufficient mechanism of repression, supporting the short-range repression mechanism for two of the TFs, in agreement with experimental evidence. 112 5.2 5.2.1 Methods and Calculations Construction of Synthetic Data Sets To compare the two parameter estimation strategies, CMA-ES and QN/NMS, we created synthetic gene expression data for each of the six model versions in GEMSTAT. The synthetic data sets were created so that the true parameter values are known a priori and can be compared to those found using the parameter estimation techniques. Each set consists of 36 expression patterns, corresponding to 36 experimentally characterized CRMs involved in Drosophila embryonic development (anteriorposterior axis specification) containing binding sites for 6 TFs. Since parameter values were chosen, not fit, the resulting synthetic patterns do not agree with the actual expression patterns driven by these CRMs that have been observed in embryonic development. We chose this particular set of CRMs and TFs due to the fact that they were used in [38]. Recent transcriptional modeling studies have used data sets of varying quality, therefore we compared the parameter estimation strategies not only on high quality continuous data sets, but also on low quality noisy or discrete data sets. In order to obtain a thorough analysis, we have created continuous, perturbed, binary, and interpolated synthetic data sets, as described below. 5.2.1.1 Construction of Continuous Synthetic Data Sets Creating continuous synthetic data sets is a two-step process. In the first step, values for each parameter are randomly selected from ranges specified in the model. In the second step, these parameter values are used to create the expression (on a scale from 0 to 1) for each input sequence. The expressions obtained comprise the continuous synthetic data, which represent exact results, devoid of measurement noise. For GEMSTAT models tested we find that a 113 large number of genes had ‘saturated’ expressions for most of the data points, and when expressions were not saturated, the patterns were quite similar for a large portion of genes. Both of these scenarios do not reflect the patterns of developmental genes that this study is analyzing; these patterns typically are highly heterogeneous. Rather than selecting certain parameter sets arbitrarily, we designed a scoring scheme, the ‘derivative score’, which assigns a numerical score to each set of random parameters (Table 5.1). This scheme selects outputs with diverse and non-monotonic patterns. In our experiments, 25000 random parameter sets were generated and the best 30 sets, based on the derivative scores, were used to generate our continuous synthetic data. Table 5.1: A pseudo-algorithm for the derivative score. Let P be a set of parameters, B be the set of bins, and S be the set of sequences. For all i sequences and j bins: Let di,j be the absolute value of the derivative of the expression obtained using P . 1 The derivative score is defined as: j∈B maxi∈S di,j B 5.2.1.2 Construction of Perturbed, Binary, and Interpolated Synthetic Data Sets To compare the performance of the two parameter estimation strategies on lower quality data, we created perturbed, binary, and interpolated data sets. Perturbed data was created by adding to each of the continuous synthetic data points, x, a perturbation drawn from a uniform distribution on the interval [−rx, +rx], where r represents the perturbation percentage. We used r = 0.025, 0.05 and 0.10 in this study. Results shown in Figures 5.2 -5.5 were obtained with r = 0.05. The overall SSE values obtained from both local and global 114 parameter estimation techniques with r = 0.025 were similar to those obtained with r = 0.05 and those obtained with r = 0.10 were similar to those obtained using a data set of slightly lower quality, such as interpolated data (described below). We did not allow our expression values to go outside of the range [0, 1], so if adding perturbation caused a value to be larger than 1, the value was set to 1. Binary data was created from the continuous synthetic data using a simple thresholding technique, whereby data points greater than or equal to 0.5 were set to 1 and data points below 0.5 were set to 0. The interpolated data, an intermediate between continuous and binary, were created from the binary data by replacing data points near a discontinuity (a jump from 0 to 1 or 1 to 0) with intermediate values such as 0.89 or 0.11, to smooth out these discontinuities. These data sets were created using the same process as in [85, 38]. 5.2.2 Description of QN/NMS and CMA-ES Methods To estimate the parameter values of a model that best fit experimental data, one uses an objective function to measure the quality of fit. Parameter estimation algorithms start with an initial guess and iteratively generate new estimates seeking to minimize the error. Different parameter estimation techniques employ different iteration processes, depending on the objective function and parameter constraints. Often the sum of square errors (SSE) between the models prediction and experimental data, as used in Chapter 4, is employed. It was previously shown that the two measures provide similar model sensitivity to parameters in thermodynamic-based models [24]. Here, we use GEMSTAT to perform the optimization 115 using SSE as the objective function. SSE is defined as: n yp (j) − ye (j) SSE = 2 j=1 where n is the number of data points and yp and ye are the model’s prediction and experimental data respectively. 5.2.2.1 QN/NMS Method We employed the quasi-Newton and Nelder-Mead simplex methods in an alternating fashion as a local search strategy, as described in [38]. The quasi-Newton method tries to find the parameter set that makes the gradient of the objective function zero. The method uses the first and second derivatives (gradient and Hessian) of the objective function to determine the search direction. For this reason, the objective function, its gradient and Hessian need to be evaluated. On the other hand, the Nelder-Mead simplex method uses simplexes (geometrical shapes that have N + 1 vertices and all edges between them in an N dimensional space) in the space of parameters. At each step the function is evaluated at the vertices of the simplex. The search continues iteratively by shrinking the simplex size (if possible) toward the lowest scoring vertex until one of the stopping criteria of the problem is satisfied. We used a tolerance for the change in function value of 10−8 , resulting in approximately 400 iterations. The QN/NMS algorithm is available as part of the GEMSTAT code [38]. 5.2.2.2 CMA-ES Method We selected the global parameter estimation algorithm, CMA-ES (Table 5.2), in this study because of its superiority over other algorithms on different test problems [49, 36]. In this 116 Table 5.2: A pseudo-algorithm for CMA-ES. Initialization Choose initial population of parameter sets, distribution mean, and step-size. Set the initial Covariance Matrix to the Identity and evolution paths to 0 Do Sample offspring parameter sets from a normal distribution Selection and Recombination calculates new weighted mean, recombines offspring, and selects µ best individuals Update step-size and Covariance Matrix While termination criteria has not been met method, at each iteration, new candidate parameter vectors are selected from a multivariate normal distribution centered around the weighted mean of the previously generated sample vectors. The CMA-ES introduces an intermediate recombination with derandomized covariance matrix adaptation, which increases the probability of sampling a parameter vector resulting in a lower objective function value. This approach also controls the stepsize to prevent premature convergence on local minima [49, 36]. Again the tolerance used for the change in function value was 10−8 , resulting in the method running for between 100 and 10000 generations. The C-source code for the CMA-ES algorithm is available at http://www.lri.fr/~hansen/cmaes_inmatlab.html#cpp. For the results shown in this study, we combined this source code with GEMSTAT during compilation (available upon request). 117 Figure 5.1: Illustration of the expression patterns for synthetic data set 20 obtained by reverse engineering of a gene regulatory network by the Direct model. The synthetic target data used for the parameter estimation is created by simulating the model with the random parameters obtained as explained in the Methods and Calculations. We created the expression patterns using the parameters obtained by the CMA-ES and QN/NMS methods. The black tick marks are the synthetic target data and the blue (QN/NMS) and red (CMA-ES) lines correspond to the simulated patterns. The y-axis gives the relative gene expression level and the x-axis corresponds to the anterior-posterior (A-P) axis of the embryo. The final SSE is 0.001 and 13.0 for CMA-ES and QN/NMS respectively. 118 Figure 5.1: (cont’d) 119 Figure 5.1: (cont’d) 120 5.3 Results To compare the effectiveness of two parameter estimation strategies, we used both synthetic and experimental spatial gene expression data for 36 transcriptional enhancers active in the Drosophila embryo for thermodynamic-based modeling of gene expression. The synthetic data does not reflect actual patterns of expression, but mimics the spatially varied expression of mRNA typically observed for developmental gene activity. The experimental data represents actual mRNA measurements obtained from blastoderm embryos using in situ hybridization [84]. We compared the CMA-ES (global) and the QN/NM simplex (local) method on 30 synthetic data sets of varying quality: continuous (highest quality), perturbed (noise introduced at increasing levels), binary, and interpolated for each version of the model. Each parameter estimation strategy was run five times; the run with the lowest SSE was taken as the best fit. We tested six thermodynamic-based models in the GEMSTAT family; the results shown in Figures 5.2-5.5 were obtained using the Direct model. The results from other models were qualitatively very similar(data not shown). An example of the continuous synthetic gene expression data, and the expression patterns predicted by the Direct model using the CMA-ES and QN/NMS methods with this data set are shown in Figure 5.1. The minimum SSE values for the Direct model obtained by CMA-ES and QN/NM on the 30 continuous synthetic data sets are shown in Figure 5.2A. The CMA-ES method gave smaller SSE values than the QN/NM simplex method on most continuous data sets, indicating that global search was consistently outperforming the local method. Experimental biological data sets are not typically continuous high quality data sets, but contain extrinsic or intrinsic noise reflecting experimental and biological variation. We therefore compared these two methods on perturbed, interpolated and binary synthetic gene expression data 121 Figure 5.2: Performance comparison of the CMA-ES and QN/NMS methods on the Direct model. The minimum SSE values are plotted for data sets of varying quality: A)continuous, B) perturbed, C) binary, and D) interpolated. The analysis has been done on 30 synthetic data sets. For each data set, two SSE values are shown; CMA-ES (red circles) and QN/NMS (blue triangles). It has been observed that the CMA-ES method performs better than the QN/NMS method on continuous data. However, the choice of parameter estimation technique has no observable effect on lower quality data sets. The horizontal line in panel A) signals a change in scale of the vertical axis. 122 (Figure 5.2B, C, and D). We observed that unlike the differences observed for continuous data, there is no discernible difference between the two methods when we evaluated lower quality data sets. This trend was also observed for the other GEMSTAT models (not shown). Low SSE values alone do not demonstrate the success of a parameter estimation strategy because the objective function used can be multimodal. For this reason, we defined a new comparison score, the parameter SSE, which quantifies the distance between the predicted parameters and true parameter values used to generate the synthetic data sets. As it did for overall SSE values, the CMA-ES method gave lower parameter SSE values than did QN/NMS in the case of continuous synthetic data (Figures 5.2A and 5.3A). We found that both estimation strategies did worse on noisy data (Figure 5.3B) and failed to reach the true parameter values in the case of low quality data sets (Figure 5.3C and D). Again, a similar trend was observed for all other GEMSTAT models (not shown). The overall parameter SSE provides a global view of how well known parameters can be rediscovered, but previous studies have noted that the parameters involved in many biological models have varying sensitivities [34, 6, 24]. For thermodynamic-based models, different sensitivities drastically effect the ability of estimation strategies to accurately fit parameter values [24]. Very sensitive parameters are more likely than less sensitive parameters to be fit well regardless of the parameter estimation method implemented. For this reason, parameter-wise comparison will also help to illustrate the robustness of each method. To check the fitness of each individual parameter, we therefore compared the CMA-ES and QN/NMS methods parameter-wise. Figure 5.4 illustrates the estimated parameter values for the Direct model on a representative continuous synthetic data set shown in Figure 5.1. We conclude that the CMA-ES method is more robust than the QN/NMS method due to 123 Figure 5.3: Comparison of the parameter values obtained by the CMA-ES and QN/NMS methods on the GEMSTAT Direct model. The minimum SSEs between the true parameter values and those obtained by the CMA-ES and QN/NMS methods are plotted for data sets of varying quality: A)continuous, B) perturbed, C) binary, and D) interpolated. The analysis has been done on 30 synthetic data sets. For each data set, two SSE values are shown; CMA-ES (red circles) and QN/NMS (blue triangles). It has been observed that the CMA-ES method more accurately predicts the true parameter values on continuous data. However, the choice of parameter estimation technique has no observable effect on lower quality data sets. The horizontal lines signal a change in scale of the vertical axis. 124 Figure 5.4: Scatter plot of parameters for synthetic data set 20 using CMA-ES and QN/NMS methods on the Direct model. Each method was run five times. The black diamonds represents the target parameters, and the red circles and blue triangles represent parameters obtained by CMA-ES and QN/NMS methods respectively. The CMA-ES method more accurately predicted the true parameter values on continuous data. The horizontal lines signal different scales of the vertical axis. For the CMA-ES and QN/NMS methods respectively, the mean standard deviation in parameter values found is 64.11 and 80.29. 125 Figure 5.5: Comparison of the relative error (%) for the estimated parameters of synthetic data set 20 using CMA-ES (red circles) and QN/NMS (blue triangles) methods. The CMAES method predicts the true parameter values with lower relative error on continuous data. the lower variance of parameter values found using CMA-ES on continuous synthetic data. Similar clustering was observed when the two methods were compared on other data types (results not shown). The magnitude of different estimated parameters vary widely, therefore we calculated the relative parameter errors to create a more comparable measure. Figure 5.5 shows these relative parameter errors for the Direct model on the representative continuous synthetic data set shown in Figure 5.1. The relative parameter errors obtained using the CMA-ES method are smaller than those obtained using the QN/NMS method for almost all parameters. Similar results were obtained for other synthetic data sets, and the other GEMSTAT models 126 Figure 5.6: Sensitivity analysis for the Direct model. The blue, green, and red bars represent the parameter first-, sum of second-, and sum of first- and second-order sensitivities calculated using the HDMR algorithm respectively. The objective function used was the SSE between model predictions and synthetic data set 20, whose predictions are shown in Figure 5.1 and best fit parameters shown in Figures 5.4 and 5.5. Observe that the model is much more sensitive to perturbations in the fourth parameter, one of the parameters associated with Caudal, than any other parameter in the model. For other parameters, we see large second-order sensitivities, corresponding to a high level of parameter compensation. 127 (data not shown). This reinforces the conclusion that the CMA-ES method is more accurate and robust than the QN/NMS method on high quality data. The success of the two methods at estimating ‘true’ parameters varied among parameters. To gain insight on those parameters that were fit more accurately using the CMA-ES method, we conducted a sensitivity analysis on the Direct Model using this same continuous synthetic data set and the HDMR algorithm used in Dresch et al. 2010 (Figure 5.6) [24]. We observed that the model showed highest sensitivity to perturbations in the fourth parameter, a parameter for weighting TF activity with the BTM associated with the Caudal TF. The high sensitivity may be due to the large number of Caudal binding sites on the enhancers used, making Caudal activation potentially more significant factor for expression. For other parameters, we see non-zero model sensitivity values emerging primarily from second-order terms, implying that there is a high level of parameter compensation. The large amount of parameter interdependence makes accurate parameter estimation very difficult. For the sixth parameter, a parameter for weighting TF activity with the BTM associated with the Giant TF, the CMA-ES method resulted in higher relative parameter errors, but here the model sensitivity is very low. This parameter does not impact the model very much, hence the inability of the global method to locate the proper parameter value may not be reflective of the estimation method as much as it is reflecting an inherent bias caused by the structure of the model. The CMA-ES method did result in lower relative parameter errors on parameters that were found to have higher model sensitivities. There was no observable difference in relative parameter error levels on the lower quality data sets (results not shown). Many gene regulatory modeling studies have used relatively low quality data sets to infer parameters that were then used to interpret biological processes. Our analysis suggests that 128 Figure 5.7: Scatter plot of the parameters obtained for experimental data using CMA-ES and QN/NMS methods on the Direct model. The red circles and blue triangles represent parameters resulting in the minimum SSE when CMA-ES and QN/NMS methods are applied respectively. It has been observed that, although SSE values were comparable (CMA-ES SSE = 0.2760 and QN/NMS SSE = 0.2767), for the most sensitive parameters the two methods results in different values. The horizontal line signals a change in scale of the vertical axis. 129 it may be difficult to obtain correct parameter values using models such as GEMSTAT with low quality data. We therefore analyzed the original data set used in [38] which was based on qualitative spatial expression data of patterning genes in embryos. We compared the parameter values and SSEs obtained by the CMA-ES and QN/NMS methods on this data set for each of the six GEMSTAT models. The resulting parameter values were similar for many of the parameters and the two methods had comparable overall SSE values, but the two approaches were considerably different for the sensitive Caudal parameter (Figures 5.6 and 5.7). This model is highly sensitive to perturbation in this particular parameter, therefore a choice between two apparently equivalent parameter estimation strategies may result in extremely different biological interpretations about key features of this system. 5.4 Discussion The demands on modeling will grow as we seek to analyze the huge genome-scale data sets created with current experimental techniques. Modeling studies will continue to estimate parameter values, making the choice of strategy a crucial step in modeling. Here we have shown that when applied to the GEMSTAT model a slow stochastic parameter estimation strategy, CMA-ES, was necessary to find the correct parameter values when using high quality data sets. On low quality data sets, it was alarming to observe the failure of both CMA-ES and QN/NMS algorithms. These results may not be generalizable to all mathematical models. Since many models used in biological applications are highly nonlinear with multimodal objective functions, once a reasonable model is created for a biological system with experimental data, one should create an objective function for the problem and test several different flavors of parameter estimation techniques. 130 The CMA-ES method used in this study provides an alternative to the QN/NMS method used in the original study [38]. In the above analysis we have shown that the CMA-ES method is more reliable than the QN/NMS method on continuous synthetic data. The SSE values were clearly lower for continuous data, and parameter values have been captured with high accuracy (Figures 5.2 and 5.3). We observed relative error levels below 1% for most parameters estimated using continuous data (Figure 5.5). In spite of its accuracy on high quality data, the computational effort required is on the order of hours (all computations were performed using a node on the MSU HPCC platform running Linux). However, we should mention that the convergence criteria used in this study was very tight and this method could be written in a parallel version. For the perturbed, interpolated and binary data sets the selection of the parameter estimation strategy was not critical, which suggests that the use of low quality data sets for modeling studies is prone to error. When we compared these two methods on the low quality experimental data set used in [38], we observed that the qualities of fit were similar (Figure 5.7). However, the parameter values found were different, especially the values corresponding to the parameter with the highest model sensitivity. This suggests that the interpretation of parameter values obtained in modeling studies using low quality data must be viewed with caution. Experimental validation of findings in modeling studies is absolutely essential. For a modeling study to capture the correct biological behavior of a system, the parameter estimation strategy selected should be accurate, robust and computationally efficient. To determine the best strategy one should create synthetic data of a quality similar to that in which they plan to use and compare the performance of several methods on their model. It has been noted in previous studies that there is no best parameter estimation technique 131 that will work for all model types [60, 59]. It is the responsibility of the modeler to choose one algorithm that will help them to get the proper insights from the experimental data. Unfortunately, the field of parameter estimation is not a very well-established area for gene regulation models. Many thermodynamic-based modeling studies have been carried out with data and methods that have not been well suited for such studies. This might have resulted in very poor estimates of the parameter values, which in turn have lead to incorrect biological interpretations. We hope that this study will help those working in biological modeling, in particular thermodynamic-based modeling, to see the importance of high quality data production and parameter estimation strategy selection. 132 Chapter 6 Thermodynamic-based Modeling: Analysis of Biological Data Sets 6.1 Understanding the Structural Rules of Enhancer Function To investigate what features dictate where, how and in what order TF binding sites should be positioned for optimum enhancer function, we chose to study the enhancer responsible for the expression of the rhomboid (rho) gene in early Drosophila development. Activation of rho in the presumptive neurogenic ectoderm by the cooperative action of two TFs, Dorsal and Twist [42]. It is repressed in the mesoderm by a zinc finger protein, Snail. Previous studies have identified footprinted binding sites for the above three TFs on an approximate 300-bp region 1.7 kb upstream of the transcriptional start site for rho. In addition to rho, Dorsal, Twist and Snail coordinate the expression of about 100 genes involved in dorsal-ventral patterning in early Drosophila development [74, 101]. So, insights obtained from analysis of 133 this well-characterized enhancer are likely to shed light on mechanisms of enhancer function for a variety of genes involved in this gene regulatory network (GRN). These characteristics make this enhancer a suitable platform for quantitative studies. 6.1.1 Reporter Gene Constructs and Confocal Image Dataset The 318-bp rho neuroectodermal enhancer [42, 83] was cloned and the footprinted binding sites for Dorsal, Twist and Snail, as well as two sites known to be bound by bHLH factors [42], were mutated one at a time as well as in combination with one another, resulting in a set of 38 reporter gene constructs (Figure 6.1). A total of 935 embryo images were taken, with a minimum of 10 images per construct. Late stage five embryos were used for analysis, using eve expression to determine embryo age [92]. Embryos were then selected based on their rotation, with the appropriate number of pixels in the dorsal region of the embryo for background estimation and the appropriate number of pixels in the ventral region of the embryo to obtain information on the repression domain. The image processing techniques used are those described in the Appendix, section A.3. 6.1.2 Structure of Models Tested A general thermodynamic-based model, as described in Chapter 2, was constructed in C++. At runtime, the user has the option to choose distance(in base pairs between binding sites)dependent definitions of TF cooperativity and repressor quenching ability. Cooperativity and quenching terms were incorporated into the model as described in section 2.3.1 and the treatment of overlapping binding sites was described in section 2.3.2. 134 Figure 6.1: Structures of reporter genes assayed. Green, blue, and red circles denote annotated binding sites for Dorsal, Twist, and Snail respectively. X’s depict binding sites that were mutated. 135 Figure 6.1: (cont’d) 136 Cooperativity among TFs is known to influence gene expression, and previous studies have uncovered distance-dependent cooperativity among TFs [35, 2]. Quenching is the effect that repressor proteins exert on DNA or chromatin to negate the effects of activators [4, 31]. It can be mediated by post-translational modifications of histones to maintain ‘repressive heterochromatin’ [52, 70]. Previous studies have implicated repressors acting by different mechanisms to block activators [4, 31, 52, 70, 26, 38]. Several different schemes involving TF cooperativity and repressor quenching were implemented in our modeling effort, where we tested different hypotheses regarding the biochemical mechanisms of enhancer function. 137 Figure 6.2: Continuous Functions used for Quenching and Cooperativity. These functions are the functions used in different schemes of the model, to describe the effects of a repressor’s quenching ability on a bound activator or the cooperativity 2 between two bound TFs as functions of the distance between the binding sites. A) f (d) = a + bd, B) f (d) = ae−d /b , and 1 C)f (d) = a . When implemented, a = 1 and b is a model parameter for quenching functions, and a and b are both d/b e model parameters for cooperativity functions. 138 Figure 6.3: Quenching Function Derived from Fakhouri et al. Analysis. This function was fit using the six quenching values, shown in red, found in [26] to a fifth degree polynomial, f (d) = −0.000000004249d5 + 0.0000010199d4 − 0.00010199d3 + 0.0050099d2 − 0.10731d + 1.0 for d ≤ 100 and assumed to be 0 for d > 100, shown in blue. When implemented, this function is used directly, with no parameter values corresponding to quenching. For the quenching term in our model, we used three continuous functions (Linear, Logistic and Exponential decay) to describe the effect of repressor quenching in relation to increasing distance to bound activator binding sites (Figure 6.2). Since a linear function was used previously in a thermodynamic-based modeling study, we began with this formulation [75]. We included schemes with functions displaying logistic and exponential decay to represent different biological assumptions as well. Functions illustrating high quenching activity over some distance may correspond to a situation in which cofactors allow flexibility in the spacing of binding sites, and quenching activity that is rapidly decreased as the distance between binding sites is increased may correspond to direct binding between the two proteins causing little flexibility in the spacing of binding sites. Note that the functions in Figure 6.2 were allowed to stretch horizontally, representing quenching over larger distances, but the quenching ability at a distance of 0 bp is assumed to be at a maximum of 1.0, representing a state that is not contributing to expression due to the maximal quenching ability of the bound repressor. We also used the non-monotonic quenching function derived from the pre139 Figure 6.4: Binned Functions used for Quenching and Cooperativity. These step-functions are representatives of those used in various model schemes to describe the effects of a repressor’s quenching ability on a bound activator and cooperativity between two bound TFs as step-functions of the distance between the binding sites. A) Quenching function with four bins, B) Cooperativity function with three bins. When implemented, no quenching is allowed beyond the distance of the last bin, thus the function value is assumed to be 0 beyond 100 bp in A). vious analysis of short-range repression in synthetic enhancer constructs (Figure 6.3) [26]. In addition, we fit different quenching parameters (Q) to construct step-functions, as was done in [26], grouping quenching parameters in different ‘bins’ of distances between repressor sites and activator sites. Each bin corresponds to a single quenching parameter, which is used as the value of the step-function for the distances assigned to that bin (Figure 6.4A). These schemes allow for a non-monotonic quenching function to arise, illustrating a preferred distance between binding sites. The binned quenching schemes are described as follows, with no quenching allowed beyond the distance of the last bin. The distances between binding sites were calculated from the center of each binding site. Scheme 1: Q1: 0-25 bp, Q2: 26-50 bp, Q3: 51-75 bp, Q4: 76-100 bp Scheme 2: Q1: 0-35 bp, Q2: 36-70 bp, Q3: 71-105, Q4: 106-140 bp Scheme 3: Q1: 0-45 bp, Q2: 46-90 bp, Q3: 91-135, Q4: 136-180 bp Scheme 4: Q1: 0-10 bp, Q2: 11-20 bp, Q3: 21-30, Q4: 31-40 bp, Q5: 41-50 bp, Q6: 51-60 140 bp, Q7: 61-70 bp, Q8: 71-80 bp, Q9: 81-90 bp, Q10: 91-100 bp In each case, two quenching functions were fit, one representing the quenching ability of Snail on Dorsal sites and one representing the quenching ability of Snail on Twist sites. As in the case of quenching, three different continuous functions (Linear, Logistic and Exponential decay) were used to represent cooperativity (Figure 6.2). Note that the functions in Figure 6.2 were allowed to stretch horizontally and vertically, representing cooperativity over larger distances as well as increased cooperative ability. In each case, two cooperativity functions were fit, one representing heterotypic cooperativity (interactions between different TFs, e.g., Dorsal and Twist) and one representing homotypic cooperativity (interactions between the same TFs, e.g. Dorsal binding to two different sites on an enhancer). We also fit different cooperativity parameters (C) to construct step-functions, grouping cooperativity parameters in different ‘bins’ of distances. This again was done in two different ways: distances were considered either between binding sites for different proteins, representing heterotypic cooperativity, or the same protein, representing homotypic cooperativity. Each bin corresponds to a single cooperativity parameter, which is used as the value of the step-function for the distances assigned to that bin (Figure 6.4B). In addition, we considered the possibility of protein-dependent functions, using ‘binned’ distances in four different ways: distances were considered either between two Dorsal sites, two Twist sites, two Snail sites, or one Dorsal and one Twist site. Scheme 1: C1: 0-25 bp, C2: > 25 bp Scheme 2: C1: 0-50 bp, C2: > 50 bp Scheme 3: C1: 0-75 bp, C2: > 75 bp Scheme 4: C1: 0-50 bp, C2: 51-100 bp, C3: 101-150 bp, C4: > 150 bp 141 Scheme 5: C1: 0-60 bp, C2: 61-120 bp, C3: 121-180 bp, C4: > 180 bp Scheme 6: C1: 0-70 bp, C2: 71-140 bp, C3: 141-210 bp, C4: > 210 bp 6.1.3 Model Performance and Parameter Estimation Model performance was measured using three different criteria: root mean square error (RMSE) of data and model predictions [26], CC (correlation coefficient) between data and model predictions [38], and AIC (Akaike Information criterion) which was used to compare performance of different models while compensating for the different number of parameters in models [18]. RMSE, CC and AIC are defined as: 1 n RMSE = n CC = n n j=1 n yp (j) − ye (j) j=1 n j=1 yp (j) ye (j) − yp (j) 2 − 2 2 n j=1 yp (j) n j=1 yp (j) n n j=1 ye (j) 2 n j=1 (ye (j)) − 2 n j=1 ye (j) AIC = 2nln (RMSE) + 2 (k + 1) where n is the number of data points, yp and ye are the model’s prediction and experimental data respectively, and k is the number of model parameters. The global parameter estimation strategy (CMAES), which was discussed in Chapter 5, was applied to estimate the parameters [36]. Root mean square error was used as a measure of model performance. All possible pair-wise combinations of cooperativity and quenching functions described above were considered, resulting in 120 total model schemes. Due to the stochastic nature of the starting point of simulations in CMAES, estimations were run five 142 Figure 6.5: Model performance for 120 model schemes. Each box represents the RMSE for a single model scheme, with the row representing the cooperativity function and the column representing the quenching function used in this scheme. times each, and the parameter set resulting in the lowest RMSE was used in analyzing each model’s performance. 6.1.4 Preliminary Results To systematically test various hypotheses, we fit parameters to all of our 120 model schemes and ranked schemes according to their performance on the whole dataset, RMSE. The performance of all 120 model schemes is shown in Figure 6.5. Note that most schemes resulted in an RMSE∈ [0.1, 0.11]. Although we observed little variation in RMSE, we still were able 143 Figure 6.6: Different Measures of Model Performance for 18 model schemes. Each box represents the ranking for a single model scheme, with the row representing the model scheme and the column representing the fitness measure used for ranking. to rank the model schemes according to this measure of fitness. To further investigate this, we reduced the number of model schemes to analyze down to 18: the one worst, eight average, and nine best performing model schemes with respect to RMSE. To test whether the small variation between model performance holds when other measures of model fitness are used, we calculated the CC and AIC for these 18 model schemes as well (Figure 6.6). We found that although the ranking varies slightly when we change the measure of model fitness, models remain in the same grouping of worst, average, and best performing models. We can thus conclude that these groups represent the best, average, and 144 Figure 6.7: Construct Performance for 18 model schemes. Each box represents the RMSE for a single model scheme on a single construct, with the row representing the model scheme and the column representing the construct. worst performing model schemes. Recalling that the RMSE is a measure of overall fit, over all data points obtained from 38 constructs, one may ask the question of how well each model scheme fits individual constructs. To address this question, we calculated individual RMSEs for each of the 38 constructs and compared these for our 18 model schemes (Figure 6.7). We found that there are approximately five constructs in which most model schemes are fitting poorly, while the remaining constructs seem to be fit relatively consistently compared to other construct fits for a single model scheme. This could be due to the definition of binding sites that we are 145 using, a high level of uncertainty in the experimental data corresponding to these poorly fit constructs, or this could reflect the limits of this modeling approach and the fact that certain important transcriptional features, such as chromatin modifications and nucleosome positioning, have been ignored. We are currently exploring the first of these possibilities by implementing the model with new definitions of binding sites, using different TF binding preferences (PWMs) derived from various different experimental data sets [12, 65, 66, 56, 105]. 6.2 Understanding the Dynamic Effects of Redundant Enhancers Redundant, also termed ‘shadow’, enhancers located at varying distances form the transcriptional start site have been found to contribute to the expression of multiple genes playing critical roles during development [101, 40]. These enhancers have been shown to have an important contribution to the robustness in gene expression in response to environmental or genetic perturbations [71]. It has been reported that many Dorsal-Ventral patterning genes have multiple binding clusters, suggesting that redundant enhancers may have an extremely important role in the expression dynamics of these genes during development [101, 40]. Two particular genes that have been tested and shown to contain these redundant enhancers are twi and sna [46, 71]. For this reason, we believe that including these enhancers in any dynamic model of gene regulation is an essential step in improving our understanding of the dynamics of GRN’s such as that containing twi, sna, and rho. 146 6.2.1 The Addition of Redundant Enhancers to Dynamic Models of Transcriptional Regulation Incorporating more than one enhancer into the dynamic model of transcriptional regulation described in Chapter 3 involves a modification to the transcriptional term, Sa , for all a ∈ M R such that a represents a gene with multiple known enhancers. Recall from Chapter 2 that Sa,i is the fraction of successful states of gene a in nucleus i. Thus, to consider multiple enhancers, E1 , . . . , Es , we assume that the rate of transcription is proportional to the fraction of combined successful states, where a ‘combined successful state’ is defined as any combined enhancer state with at least one enhancer in a successful state. Assuming that each enhancer acts independently, by the inclusion-exclusion principle, also to referred as the SchuetteNesbitt formula in probability theory, this implies that:  s Saα ,i − Sa,i = χa  α=1 Saα ,i Saβ ,i 1≤α≤β≤s  (6.1) Saα ,i Saβ ,i Saγ ,i − · · · + (−1)s−1 Sa1 ,i Sa2 ,i . . . Sas ,i  + 1≤α≤β≤γ≤s Another way to consider incorporating multiple distant enhancer regions is to treat them them all as one single enhancer, with a very large distance between binding sites. In this case, a reasonable biological assumption to make is that this large distance physically prevents interactions from occurring between bound proteins on these different enhancer regions. Note that this is equivalent to the assumption that each enhancer region acts independently. Thus, calculating the fraction of successful states of this very large enhancer will result in exactly the same formula as in equation 6.1. 147 Using the above formula, incorporating redundant enhancers into the model described in Chapter 3, is not only possible, but computationally feasible as well. The difficulty in the addition of redundant enhancers is only in the amount of knowledge now needed for each gene. One must know where these enhancers are located in the genome in order to determine what DNA sequence(s) and/or binding site(s) should be included when running the model. For the network including twi, sna, and rho, a great deal of work has been done by the Levine lab to determine whether such regions exist and where they are located within the genome [101, 40, 46, 71]. My current research on this project includes incorporating redundant enhancers when fitting the model to the experimental data and testing whether the derivation in equation 6.1 is sufficient to improve our predictions or whether the assumption that these enhancers act independently has oversimplified the complex nature of these genes. 148 Chapter 7 Closing Remarks Thermodynamic-based models provide an effective way to link the structure of cis-regulatory elements to expression levels. The current interest in this field suggests that the demand for thermodynamic-based models will grow as biologists create more genome-level data and strive to achieve a quantitative systems-level understanding of the molecular physiology. Unfortunately, there are still many challenges to this modeling approach that should be addressed. Challenge 1: There are many aspects of thermodynamic-based models that can be improved upon with the addition of distinct types of data that reflect processes important to transcriptional regulation. These include chromatin remodeling and nucleosome positioning in eukaryotes, and the presence of transcriptional cofactors in both eukaryotes and prokaryotes. For example, chromosome remodeling and nucleosome positioning can effect transcription in multiple ways, such as reducing the affinity of TF binding sites or competing with TFs for their binding sites [38, 86]. These nuances can easily be added to thermodynamic-based models 149 by adding new binding configurations or modifying existing states statistical weights [86]. Challenge 2: Another outstanding challenge is in the definition of functional binding sites. Current bioinformatic techniques produce many false positives and negatives. An increasing number of whole-genome ChIP-chip studies have been published recently [101]; more comprehensive genome-wide studies to determine the binding sites of TFs are needed to close this knowledge gap [64, 102, 44]. It is difficult to predict where regulatory sequences lie, therefore even with good knowledge of TF binding sites, knowing which regions of DNA to incorporate in models is also a large hurdle which is something that needs further exploration in the future. Challenge 3: Current model assumptions of thermodynamic-based equilibrium, the types of interactions between bound TFs, the correlation between RNAP binding to the promoter and gene expression, and the presumed similar activities of TFs on disparate enhancers, should all be validated experimentally or modified accordingly. Challenge 4: Lastly, adding temporal dynamics by introducing time-dependence can be extremely beneficial when studying genes involved in dynamic processes such as development. However, there is still a lack of quantitative data over specific time frames. New experimental techniques, such as fluorescent protein tagging combined with live imaging, are currently being developed to provide us with real-time images of gene expression in living organisms [55]. These new data sets will be crucial in accurately fitting differential equation models of GRNs 150 consisting of genes with highly dynamic expression. For a molecular understanding of discrete transcriptional systems, thermodynamic-based modeling offers the most promising approach to deciphering gene regulation at the DNA level. Incorporating such models into systems of differential equations provides the additional ability to model the dynamics of a developing organism. This has been my major contribution to the field of eukaryotic gene regulation. The model I have developed and implemented has made possible the in-depth exploration of gene regulatory networks and the cis-regulatory modules underlying the precise expression patterns that arise as a result of them. The analysis I have done on others’ modeling efforts has contributed to our understand of thermodynamic-based models and allowed us to reach this point. The topics listed above represent outstanding challenges for researchers in the area of quantitative gene regulation and future efforts should be able to guide us to even more sophisticated and powerful models. 151 APPENDICES 152 Appendix A Data Processing A.1 Introduction When fitting a mathematical model, regardless of the field in which the model is being applied to, the quality of the data used is of tremendous importance. For models of transcriptional regulation, both spatial and temporal protein (TF) and mRNA concentrations need to be considered. To obtain the most accurate quantitative measurements of these concentrations, one needs to have a thorough understanding of where the data is coming from and what procedures can be implemented to remove any background or noise from the data set. Using data from confocal images in modeling gene expression levels presents the researcher first with the challenge of translating the data from stacks of eight bit images to spatial information on actual levels of mRNA and protein. For thermodynamic-based modeling studies, such as Fakhouri et al. 2010 and Sayal et al. 2012, large sets of data obtained from a confocal microscope are used to determine a single set of model parameters. With each set of parameter values, the model itself outputs relative expression levels. For this 153 reason, noise and background are not the only aspects of the data to be considered during processing, but normalization of data sets is also a crucial consideration. The importance of the resulting data quality will be discussed in more detail in Chapter 4, where parameter estimation strategies are explored. Throughout the last twenty years, various studies have been performed on Drosophila using qualitative blue-white images, resulting in binary data sets [67, 95, 46, 80, 85, 38]. These data sets do not require much data processing since thresholding alone can give data values of 0 or 1. These binary data sets have been used for many years and are easily interpreted as a gene being (1) “on” or (0) “off”. The major drawback to using such data sets is that as enhancer regions become more complex, with more binding sites and more interactions taking place, a gene may not be “on” or “off”; genes have been shown to have varying levels of “on”. For mathematical models to capture the subtle interactions occurring at the DNA level that result in genes producing mRNA concentrations at varying levels, they need to be fit to varying levels of gene output. Clearly, qualitative data resulting in binary, or discrete, measurements would not suffice for model fitting. With new experimental techniques, quantitative confocal imaging has allowed researchers to obtain detailed measurements of relative mRNA and protein concentrations [105, 43, 45, 26]. Before using these new data sets to extract biological information about the system of interest, one needs to first process the data to adjust for experimental variation. Without processing, this variation could cause many problems in interpreting data, ie: if an image has a high amount of background signal, the high values might lead one to believe that the gene is more strongly expressed due to increased activation. The following gives a fairly comprehensive overview of the data processing techniques used on the data sets of [26] and 154 [82]. A.2 Fakhouri et al. 2010 data set Confocal images of Drosophila embryos are first saved as stacks, which can easily be opened and modified using the common biological software package, ImageJ. Each image within a stack is a cross-section of the embryo. In order to obtain a single image from a stack, the first step in data processing is to project the images in the stack, using the “Max Z Project” function in ImageJ. The projection is created using the maximum value over all images in the stack at each pixel location. The processor can then save the image, as an eight bit Tiff, which can be imported into a program such as Matlab, for further processing. These eight bit images then go through a series of processing steps, all of which have been automated in Matlab. (All Matlab codes described in this section were written by Ahmet Ay, Evan Dayringer, and Jacqueline M Dresch.) The first of these steps is an outlier removal process. Outliers in an image are those regions of the image with spots of high intensity where no actual gene expression is observed. These are typically simple artifacts of experimental error, and thus need to be removed from the image before processing steps such as background calculation, smoothing, and normalization. After any outliers are removed, images are rotated into the proper orientation. The orientation used is a standard orientation so that each embryo is viewed with the anterior to the left, posterior to the right, dorsal to the top, and ventral to the bottom. Once images are in the proper orientation and any non-specific signal has been removed, images are background subtracted. The purpose of background subtraction is to calculate the average intensity of the image in areas where no gene expression is observed, areas of 155 nonspecific fluorescence, and then reset this background signal to 0. In most scenarios, where background is assumed uniform throughout the embryo, this is done by subtracting this average background intensity from the entire image. In the case of certain proteins, such as the Giant protein which was used in the Fakhouri et al. 2010 study, it has been observed that the background signal is parabolic. The origin of this parabolic shape is not well known, although it is most likely due to light scattering during imaging and has been shown to correlate with the degree to which the embryo has been flattened prior to imaging [9]. For this case, background subtraction is performed by fitting a parabola to a DorsalVentral strip of the background region and then subtracting that parabola from the entire embryo, along the Anterior-Posterior axis. The last two steps of data processing include smoothing the image and normalizing intensity values. Smoothing is done to remove any discrete artifacts of the image itself and normalization is done so that all intensity values are mapped to a common interval, [0,1]. The normalization of a set of images gives us the ability to compare images without imaging bias. It is a known fact that embryos can be stained slightly lighter or darker regardless of the mRNA or protein concentrations due to technical variation. Normalization allows us to minimize this variation between images and compare images side by side with only the biological variation from embryo to embryo remaining. A.3 Sayal et al. 2012 data set For the data set of [82], slight modifications were made to the process used in [26]. Most importantly, the images were much cleaner, so no outlier removal step was needed. The background subtraction as well as normalization steps were also done in a more intricate manner, after some very insightful observations. Smoothing of the image was not done, 156 although average signals over ten samples were taken from each embryo. Another important addition to this processing pipeline was a step marking the rotation of each embryo, to ensure proper comparison of expression patterns from embryo to embryo. The information that we need to obtain from this data set is the gene expression pattern from a strip taken toward the Anterior-Posterior center of the embryo along the Ventral-Dorsal axis. The details of these modifications follow. The first steps in processing, other than outlier removal, follow exactly as those used in [26]. The images are each projected in ImageJ and rotated to the standard orientation. For the background subtraction step, we first decided to check whether the background appeared constant or parabolic for a single strip of pixels taken along the Ventral-Dorsal axis. This was done by analyzing a set of images lacking the transgene of interest. These images, referred to as yellow-white images, were stained and imaged using the exact same protocol as all other images, although no lacZ reporter gene was present, so the resulting images contained no gene expression; their intensity values corresponded to only background staining. The yellow-white images showed a clear parabolic-shaped background when ten samples were taken uniformly from the region 40-60% Anterior-Posterior and plotted from Dorsal to Ventral (Figure A.1). Due to this parabolic background shape, all images in the data set have their background curve fit in the same way, using data from the region 10-40% Dorsal-Ventral, where the gene is not expressed. This quadratic is then subtracted from the entire embryo, along the Anterior-Posterior axis. Since all constructs should decrease expression intensity or express wild-type levels in Ventral regions of the embryo, for normalization of embryo intensities, we decided to nor- 157 Figure A.1: Six representative yellow-white images with quadratics fit to their intensity values. The blue lines represent ten lines taken uniformly from the region 40-60% AnteriorPosterior and plotted from Dorsal to Ventral. The red line represents the quadratic fit to the average of these ten lines from 10-40% Dorsal-Ventral. This region was chosen due to the fact that there should be no expression in this region regardless of whether the embryos contain the transgene, as in our core data set, or not, as in these yellow-white images. 158 Figure A.2: Variation in Signal Intensity on different imaging days. In panels A and B, each circle represents a single embryo’s mean intensity value. All embryos represented in panel A came from a single construct, which is a double activator knock-out, stained on the same day and then imaged on three different days. All embryos represented in panel B came from a different single construct, which is a triple repressor knock-out, again stained on the same day and then imaged on three different days. In both cases, there is a great deal of variation in signal intensity. In panels C and D, these mean values were averaged over all embryos stained and imaged on the same day. We observed a well-defined decreasing trend in mean intensity as the number of days between staining and imaging increased. 159 malize each embryo with respect to the average peak wild-type signal. As we began to look at the variation in wild-type signals, we noted a trend in signal intensity from one imaging day to another. This trend is shown in Figure A.2. Due to this variation in intensities from one imaging day to another, on each day of imaging at least two wild-type images are stained and imaged in parallel with other constructs. Then, each image is normalized to the average peak ( > 95%) wild-type signal on that image’s particular imaging date. This normalization procedure allows for images to be compared for a single construct imaged on multiple days, as well as to compare one construct’s intensity to another’s. Once normalization is complete, the image manipulation steps are concluded. The final step in image processing is then data extraction. Since the data set of interests contains hundreds of embryos, all in slightly different orientation, we had to be very careful in aligning and comparing data extracted from different embryos. This alignment was done by marking the rotation of each embryo, using the snail protein signal. Snail protein is known for it’s very sharp and precise border, forming a boundary between the mesoderm and neuroectoderm [42]. 160 BIBLIOGRAPHY 161 BIBLIOGRAPHY [1] GK Ackers, AD Johnson, and MA Shea. Quantitative model for gene regulation by λ phage repressor. PNAS, 79:1129–1133, 1982. [2] CC Adams and JL Workman. Binding of disparate transcriptional activators to nucleosomal dna is inherently cooperative. Molecular and Cellular Biology, 15(3):1405–1421, 1995. [3] R Albert and HG Othmer. The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. Journal of Theoretical Biology, 223:1–18, 2003. [4] DN Arnosti, S Gray, S Barolo, J Zhou, and M Levine. The gap protein knirps mediates both quenching and direct repression in the Drosophila embryo. EMBO Journal, 15(14):3659–3666, 1996. [5] UM Ascher and LR Petzold. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1998. Theorem 3.1. [6] M Ashyraliyev, J Jaeger, and JG Blom. Parameter estimation and determinability analysis applied to Drosophila gap gene circuits. BMC Systems Biology, 2:83, 2008. [7] M Ashyraliyev, K Siggens, H Janssens, JG Blom, M Akam, and J Jaeger. Gene circuit analysis of the terminal gap gene huckebein. PLoS Computational Biology, 5(10):e1000696, 2009. [8] A Ay and DN Arnosti. Mathematical modeling of gene expression: a guide for the perplexed biologist. Critical Reviews in Biochemistry and Molecular Biology, 46(2):137– 151, 2011. 162 [9] A Ay, WD Fakhouri, C Chiu, and DN Arnosti. Image processing and analysis for quantifying gene expression from early Drosophila embryos. Tnumber Engineering, 14(9), 2008. [10] A Babloyantz and M Sanglier. Chemical instabilities of “all-or-none” type in betagalactosidase induction and active transport. FEBS Letters, 23:364–366, 1972. [11] TL Bailey and M Gribskov. Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14(1):48–54, 1998. [12] CM Bergman, JW Carlson, and SE Celniker. Drosophila dnase i footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics, 21(8):1747–1749, 2005. [13] J Bieler, C Pozzorini, and F Naef. Whole-embryo modeling of early segmentation in Drosophila identifies robust and fragile expression domains. Biophysical Journal, 101(2):287–296, 2011. [14] MD Biggin. Animal transcription networks as highly connected, quantitative continua. Developmental Cell, pages 611–626, 2011. [15] L Bintu, NE Buchler, HG Garcia, U Gerland, T Hwa, J Kondev, T Kuhlman, and R Phillips. Transcriptional regulation by the numbers: applications. Current Opinion in Genetics and Development, 15:125–135, 2005. [16] DM Bortz and PW Nelson. Sensitivity analysis of a nonlinear lumped parameter model of hiv infection dynamics. Bulletin of Mathematical Biology, 66:1009–1026, 2004. [17] NE Buchler, U Gerland, and Hwa T. On schemes of combinatorial transcription logic. PNAS, 100(9):5136–5141, 2003. [18] KP Burnham and DR Anderson. Model selection and inference: information-theoretic approach. Springer-Verlag, 1998. a practical [19] J Clune, HJ Goldsby, C Ofria, and RT Pennock. Selective pressures for accurate altruism targeting: evidence from digital evolution for difficult-to-test aspects of inclusive fitness theory. Proceedings of the Royal Society B: Biological Sciences, 278(1706):666– 674, 2011. 163 [20] M Coppey, AM Berezhkovskii, Y Kim, AN Boettiger, and SY Shvartsman. Modeling the bicoid gradient: diffusion and reversible nuclear trapping of a stable protein. Developmental Biology, 312:623–630, 2007. [21] J Crocker, Y Tamori, and A Erives. Evolution acts on enhancer organization to finetune gradient threshold readouts. PLoS Biology, 6:e263, 2008. [22] RI Cukier, CM Fortuin, KE Shuler, AG Petschek, and JH Schaibly. Study of the sensitivity of coupled reaction systems to uncertainties in rate coefficients. Journal of Chemical Physics, 59:3873–3878, 1973. [23] P Deuflhard and F Bornemann. Scientific Computing with Ordinary Differential Equations. Springer, 2002. page 124 and 3.2. [24] JM Dresch, X Liu, DN Arnosti, and A Ay. Thermodynamic modeling of transcription: sensitivity analysis differentiates biological mechanism from mathematical modelinduced effects. BMC Systems Biology, 4:142, 2010. [25] RS Erb and GS Michaels. Sensitivity of biological models to errors in parameter estimates. Pacific Symposium on Biocomputing, 4:53–64, 1999. [26] WD Fakhouri, A Ay, R Sayal, J Dresch, E Dayringer, and DN Arnosti. Deciphering a transcriptional regulatory code: modeling short-range repression in the Drosophila embryo. Molecular Systems Biology, 6:341, 2010. [27] EG Fernando and DR Luhr. Sensitivity analysis of predicted pavement performance. Technology Transfer and Training, 1200:32–41, 1988. [28] Y Fomekong-Nanfack, M Postma, and JA Kaandorp. Inferring Drosophila gap gene regulatory network: a parameter sensitivity and perturbation analysis. BMC Systems Biology, 3:94, 2009. [29] HC Frey and SR Patil. Identification and review of sensitivity analysis methods. Risk Analysis, 22:553–578, 2002. [30] J Gertz, ED Siggia, and BA Cohen. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature, 457:215–218, 2009. [31] S Gray and M Levine. Short-range transcriptional repressors mediate both quenching and direct repression within complex loci in Drosophila. Genes and Development, 10:700–710, 1996. 164 [32] T Gregor, W Bialek, RR de Ruyter van Steveninck, DW Tank, and EF Wieschaus. Diffusion and scaling during early embryonic pattern formation. PNAS, 102(51):18403– 18407, 2005. [33] JA Griffith. Mathematics of cellular control processes. Journal of Theoretical Biology, 20:202–216, 1968. [34] RN Gutenkunst, JJ Waterfall, FP Casey, KS Brown, CR Myers, and JP Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS Computational Biology, 3:e189, 2007. [35] SD Hanes, G Riddihough, D Ish-Horowicz, and R Brent. Specific dna recognition and intersite spacing are critical for action of the bicoid morphogen. Molecular and Cellular Biology, 14(5):3364–3375, 1994. [36] N Hansen and A Ostermeier. Convergence properties of evolution strategies with the derandomized covariance matrix adaptation: The (µ/µI , λ)-ES. EUFIT’97, 5th Europen Congress on Intelligent Techniques and Soft Computing, Proceedings, pages 650–654, 1997. [37] Q He, AF Bardet, B Patton, J Purvis, J Johnston, A Paulson, M Gogol, A Stark, and J Zeitlinger. High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nature Genetics, 21:414–421, 2011. [38] X He, MA Samee, C Blatti, and S Sinha. Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Computational Biology, 6(9):e1000935, 2010. [39] T Hohm and E Zitzler. Modeling the shoot apical meristem in A. thaliana: Parameter estimation for spatial pattern formation. In E Marchiori, JH Moore, and JC Rajapakse, editors, Lecture Notes in Computer Science, volume 4447, pages 102–113. Springer, 2007. [40] JW Hong, DA Hendrix, and MS Levine. Shadow enhancers as a source of evolutionary novelty. Science, 321:1341, 2008. [41] B Ingalls. Sensitivity analysis: from model parameters to system behaviour. Essays in Biochemistry, 45:177–193, 2008. 165 [42] YT Ip, RE Park, D Kosman, K Yazdanbakhsh, and M Levine. dorsal-twist interactions establish snail expression in the presumptive mesoderm of the Drosophila embryo. Genes and Development, 6:1518–1530, 1992. [43] J Jaeger, S Surkova, M Blagov, H Janssens, D Kosman, KN Kozlov, Manu, E Myasnikova, CE Vanario-Alonso, M Samsonova, D Sharp, and J Reinitz. Dynamic control of positional information in the early Drosophila embryo. Nature, 430:368–371, 2004. [44] SA Jaeger, ET Chan, MF Berger, R Stottmann, TR Hughes, and ML Bulyk. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics, 95(4):185–195, 2010. [45] H Janssens, S Hou, J Jaeger, A Kim, E Myasnikova, D Sharp, and J Reinitz. Quantitative and predictive model of transcriptional control of the Drosophila melanogaster even skipped gene. Nature Genetics, 38:1159–1165, 2006. [46] J Jiang and M Levine. Binding affinities and cooperative interactions with bhlh activators delimit threshold responses to the dorsal gradient morphogen. Cell, 72:741–752, 1993. [47] L Jostins and J Jaeger. Reverse engineering a gene network using an asynchronous parallel evolution strategy. BMC Systems Biology, 4:17, 2010. [48] M Kazemian, Q Zhu, MS Halfon, and S Sinha. Improved accuracy of supervised crm discovery with interpolated markov models and cross-species comparison. Nucleic Acids Research, 39(22):9463–9472, 2011. [49] S Kern, SD Muller, N Hansen, D Buche, J Ocenasek, and P Koumoutsakos. Learning probability distributions in continuous evolutionary algorithms. Natural Computing, 3(1):77–112, 2004. [50] MM Kulkarni and DN Arnosti. cis-regulatory logic of short-range transcriptional repression in Drosophila melanogaster. Molecular Cell Biology, 25(9):3411–3420, 2005. [51] G Li, C Rosenthal, and H Rabitz. High dimensional model representations. Journal of Physical Chemistry, 105:7765–7777, 2001. [52] LM Li and DN Arnosti. Long- and short-range transcriptional repressors induce distinct chromatin states on repressed genes. Current Biology, 21(5):1–7, 2011. 166 [53] X Liang and J Guo. Intercomparison of land-surface parameterization schemes: Sensitivity of surface evergy and water fluxes to model parameters. Journal of Hydrology, 279:182–209, 2003. [54] G Lillacci and M Khammash. Parameter estimation and model selection in computational biology. PLoS Computational Biology, 6:e1000696, 2010. [55] MZ Ludwig, Manu, R Kittler, KP White, and M Kreitman. Consequences of eukaryotic enhancer architecture for gene expression dynamics, development, and fitness. PLoS Genetics, 7(11):e1002364, 2011. [56] S MacArthur, XY Li, J Li, JB Brown, HC Chu, L Zeng, BP Grondona, A Hechmer, L Simirenko, SVE Kernen, DW Knowles, M Stapleton, P Bickel, MD Biggin, and MB Eisen. Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biology, 10:R80, 2009. [57] K Madsen, HB Nielsen, and O Tingleff. Methods for Non-Linear Least Squares Problems. Informatics and Mathematical Modeling, Technical University of Denmark, 2004. [58] S Marino, IB Hogue, CJ Ray, and DE Kirschner. A methodology for performing global uncertainty and sensitivity analysis in systems biology. Journal of Theoretical Biology, 254:178–196, 2008. [59] P Mendes and D Kell. Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation. Bioinformatics, 14(10):869–883, 1998. [60] CG Moles, P Mendes, and JR Banga. Parameter estimation in biochemical pathways: A comparison of global optimization methods. Genome Research, 13:2467–2474, 2003. [61] SD Muller, J Marchetto, S Airaghi, and P Koumoutsakos. Optimization based on bacterial chemotaxis. IEEE Transactions on Evolutionary Computation, 6(1):16–29, 2002. [62] G Nicolis and I Prigogine. Self-Organization in Nonequilibrium Systems: From Dissipative Structures to Order through Fluctuations. Wiley, 1977. [63] J Nocedal and SJ Wright. Numerical Optimization. Springer, 1999. 167 [64] MB Noyes, RG Christensen, A Wakabayashi, GD Stormo, MH Brodsky, and SA Wolfe. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell, 133:1277–1289, 2008. [65] MB Noyes, X Meng, A Wakabayashi, S Sinha, MH Brodsky, and SA Wolfe. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Research, 36(8):2547–2560, 2008. [66] A Ozdemir, KI Fisher-Aylor, S Pepke, M Samanta, L Dunipace, K McCue, L Zeng, N Ogawa, BJ Wold, and A Stathopoulos. High resolution mapping of twist to dna in Drosophila embryos: Efficient functional analysis and evolutionary conservation. Genome Research, 21(4):566–577, 2011. [67] D Pan, JD Huang, and AJ Courey. Functional analysis of the Drosophila twist promoter reveals a dorsal -binding ventral activator region. Genes and Development, 5:1892–1901, 1991. [68] DJ Pannell. Sensitivity analysis of normative economic models: theoretical framework and practical strategies. Agricultural Economics, 16:139–152, 1997. [69] D Papatsenko and M Levine. The Drosophila gap gene network is composed of two parallel toggle switches. PLoS One, 6(7):e21145, 2011. [70] S Payankaulam, LM Li, and DN Arnosti. Transcriptional repression: Conserved and evolved features. Current Biology, 20(17):R764–R771, 2010. [71] MW Perry, AN Boettiger, JP Bothma, and M Levine. Shadow enhancers foster robustness of Drosophila gastrulation. Current Biology, 20:1562–1567, 2010. [72] KB Petersen and MS Pedersen. The Matrix Cookbook. Informatics and Mathematical Modeling, Technical University of Denmark, 2008. Section 10.4. [73] GT Reeves and SE Fraser. Biological systems from an engineer’s point of view. PLoS Biology, 7:32–35, 2009. [74] GT Reeves and A Stathopoulos. Graded dorsal and differential gene regulation in the Drosophila embryo. Cold Spring Harbor Perspectives in Biology, Generation and Interpretation of Morphogen Gradients, 2009. [75] J Reinitz, S Hou, and DH Sharp. Transcriptional control in Drosophila. ComPlexUs, 1:54–64, 2003. 168 [76] O Rubel, S Ahern, EW Bethel, MD Biggin, H Childs, E Cormier-Michel, A DePace, MB Eisen, CC Fowlkes, CGR Geddes, H Hagen, B Hamann, MY Huangj, SVE Keranen, DW Knowles, CLL Hendriks, J Malikn, J Meredith, P Messmer, Prabhata, D Ushizima, GH Weber, and K Wua. Coupling visualization and data analysis for knowledge discovery from multi-dimensional scientific data. Procedia Computer Science Procedia Computer Science, 1:17571764, 2010. [77] G Rudolph. Convergence of evolutionary algorithms in general search spaces. International Conference on Evolutionary Computation, pages 50–54, 1996. [78] A Saltelli, S Tarantola, and KPS Chan. A quantitative model-independent method for global sensitivity analysis of model output. Technometrics, 41:39–56, 1999. [79] L Sanchez and D Thieffry. A logical analysis of the Drosophila gap-gene system. Journal of Theoretical Biology, 211:115–141, 2001. [80] T Sandmann, C Girardot, M Brehme, W Tongprasit, V Stolc, and EEM Furlong. A core transcriptional network for early mesoderm development in Drosophila melanogaster. Genes and Development, 21:436–449, 2007. [81] M Santillan and MC Mackey. Influence of catabolite repression and inducer exclusion on the bistable behavior of the lac operon. Biophysical Journal, 86:1282–1292, 2004. [82] R Sayal, JM Dresch, I Pushel, BR Taylor, and DN Arnosti. A quantitative approach to understand the structural rules of enhancer function. Manuscript in Preparation, 2012. [83] R Sayal, SM Ryu, and DN Arnosti. Optimization of reporter gene architecture for quantitative measurements of gene expression in the Drosophila embryo. Fly, 43(5):47– 52, 2011. [84] MD Schroeder, M Pearce, J Fak, H Fan, U Unnerstall, E Emberly, N Rajewsky, ED Siggia, and U Gaul. Transcriptional control in the segmentation gene network of drosophila. PLoS Biology, 2(9):e271, 2004. [85] E Segal, T Raveh-Sadka, M Schroeder, U Unnerstall, and U Gaul. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature, 451:535– 540, 2008. [86] E Segal and J Widom. From dna sequence to transcriptional behaviour: a quantitative approach. Nature Reviews Genetics, 10:443–456, 2009. 169 [87] MA Shea and GK Ackers. The or control system of bacteriophage lambda. a physicalchemical model for gene regulation. Journal of Molecular Biology, 181:211–230, 1985. [88] D Stanojevic, S Small, and M Levine. Regulation of a segmentation stripe by overlapping activators and repressors in the Drosophila embryo. Science, 254(5036):1385–1387, 1991. [89] J Stoer and R Bulirsch. Introduction to Numerical Analysis. Springer, 3 edition, 2002. Section 4.4. [90] GD Stormo and DS Fields. Specificity, free energy and information content in proteindna interactions. Trends in Biochemical Sciences, 23(3):109–113, 1998. [91] Y Suleimenov, A Ay, MA Samee, S Sinha, JM Dresch, and DN Arnosti. Global parameter estimation for thermodynamic models of transcriptional regulation. Manuscript in Preparation, 2012. [92] S Surkova, E Myasnikova, H Janssens, KN Kozlov, AA Samsonova, J Reinitz, and M Samsonova. Pipeline for acquisition of quantitative data on segmentation gene expression from confocal images. Fly, 2(2):58–66, 2008. [93] P Szymanski and M Levine. Multiple modes of dorsal-bhlb transcriptional synergy in the drosophila embryo. EMBO Journal, 14:2229–2238, 1995. [94] Y Tang, P Reed, T Wagener, and K van Werkhoven. Comparing sensitivity analysis methods to advance lumped watershed model identification and evaluation. Hydrology and Earth System Sciences, 3:3333–3395, 2006. [95] C Thisse, F Perrin-Schmitt, C Stoetzel, and B Thisse. Sequence-specific transactivation of the Drosophila twist gene by the dorsal gene product. Cell, 65:1191–1201, 1991. [96] NAW van Riel. Dynamic modeling and analysis of biochemical networks: mechanismbased models and model-based experiments. Briefings in Bioinformatics, 7:364–374, 2006. [97] G von Dassow, E Meir, EM Munro, and GM Odell. The segment polarity network is a robust developmental module. Nature, 406:188–192, 2000. [98] T Weise. Global Optimization Algorithms - Theory and Application. self-published, http://www.it-weise.de/, 2 edition, 2009. 170 [99] S Wielgoss, JE Barrick, O Tenaillon, S Cruvellier, B Chane-Woon-Ming, C Mdigue, RE Lenski, and D Schneider. Mutation rate inferred from synonymous substitutions in a long-term evolution experiment with Escherichia coli. G3: Genes, Genomes, Genetics, 1:183–186, 2011. [100] CH Yuh, H Bolouri, and EH Davidson. Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development, 128:617–629, 2001. [101] J Zeitlinger, RP Zinzen, A Stark, M Kellis, H Zhang, RA Young, and M Levine. Wholegenome chip-chip analysis of dorsal, twist, and snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes and Development, 21:385–390, 2007. [102] C Zhu, KJRP Byers, RP McCord, Z Shi, MF Berger, DE Newburger, K Saulrieta, S Smith, MV Shah, M Radhakrishnan, AA Philippakis, Y Hu, F DeMasi, M Pacek, A Rolfs, T Murthy, J LaBaer, and ML Bulyk. High-resolution dna-binding specificity analysis of yeast transcription factors. Genome Research, 19:556–566, 2009. [103] T Ziehn and AS Tomlin. A global sensitivity study of sulfur chemistry in a premixed methane flame model using hdmr. International Journal of Chemical Kinetics, 40:742– 753, 2008. [104] RP Zinzen, C Girardot, J Gagneur, M Braun, and EEM Furlong. Combinatorial binding predicts spatio-temporal cis-regulatory activity. Nature, 462:65–70, 2009. [105] RP Zinzen, K Senger, M Levine, and D Papatsenko. Computational models for neurogenic gene expression in the Drosophila embryo. Current Biology, 16:1358–1365, 2006. 171