MULTI-SCALE, MULTI-DIMENSION, DNA-BASED MATHEMATICAL MODELING
OF GENE EXPRESSION
By
Jacqueline M. Dresch

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Mathematics
2012

ABSTRACT
MULTI-SCALE, MULTI-DIMENSION, DNA-BASED MATHEMATICAL
MODELING OF GENE EXPRESSION
By
Jacqueline M. Dresch
The study of gene regulation has been an important topic in biology for decades. Activation and silencing of genes aﬀect everything from everyday breathing to deadly diseases,
and thus concern not only academic biologists, but also healthcare professionals and thus
the general population. Researchers now have mapped out many species’ entire genomic
sequences, but annotation is incomplete concerning stretches of DNA that regulate genes,
as well as protein coding regions. Accompanying the expansion in sequence data, new technologies have provided copious amounts of expression data suitable for modeling studies.
Experimental and computational biology now combine to answer many of the complicated
questions behind gene regulation and gene function, although we lack a complete picture of
the physical and chemical reactions of the complex gene regulatory machinery. To better
understand the processes of gene regulation we need to develop mathematical models that
describe how DNA sequences direct diﬀerentiated gene expression, predict the expression
of unknown genes or variants of genes, and reveal the driving forces behind gene regulation. With new high throughput techniques leading to large data sets of increased quality,
mathematicians need no longer make speculative hypotheses; they now can connect their
models directly with varying levels of biological inputs and verify the model predictions with
quantitative data sets. My studies develop new approaches for quantitative assessment and
application of mathematical tools important for gene regulatory studies.

ACKNOWLEDGEMENTS

First and foremost, I would like to thank both of my advisors, Dr. Chichia Chiu and Dr.
David N. Arnosti. Dr. Chiu has guided me both in my mathematical research as well as
discovering not what others expect of me, but what I truly wish to achieve in my life. Dr.
Arnosti has spent endless hours with me going over simulations, analyzing results, and above
all, making sure that I have a thorough understanding of the biological data and hypotheses
feeding into quantitative biological studies. Without both Dr. Chiu and Dr. Arnosti, this
thesis would not have been possible.
I would also like to thank all of the collaborators I have had the opportunity to work with
and everyone who has contributed to this thesis. These individuals encompass a large array
of mathematicians, computer scientists, and biologists with varying specialties, including
Dr. Ahmet Ay, Evan Dayringer, Richard Shadrach, Dr. Ben Beckmann, Dr. Saurabh Sinha,
Md. Abul Samee, Yerzhan Suleimenov, Ben Taylor, Rupinder Sayal, Dr. Walid Fakhouri,
Xiaozhou Liu, and Irina Pushel. In addition, I would like to thank all of my committee
members, especially Dr. Peter Bates for his thoughtful comments on this thesis.
I would like to thank my family for all of their support throughout my time as a graduate
student. As a woman in mathematics, my family’s support and constant encouragement has
been truly priceless.
Last, but not least, I would like to thank all of my friends at MSU. This thesis would
not have been possible without a break every so often. A very special thanks to Tom
Bellsky, Matt O’Toole, Dominique Lewis, Petra Hendrickson, Li Yang, Yan Wei, the Liouville
Sluggers, and all of the wonderful friends I have made as a graduate student!

iii

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

vi

. . . . . . . . . . . . . . . . . .

vii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

List of Figures

. . . . . . . . . . . . . .

2 Thermodynamic-based Modeling . . .
2.1 Introduction and Description . . . . .
2.2 Mathematical Analysis of a Simple
Thermodynamic-based Model . . . .
2.3 Incorporating More Information into
Thermodynamic-based Models . . . .
2.3.1 Protein-protein Interactions .
2.3.2 Overlapping Binding Sites . .

. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .

13
13

. . . . . . . . . . . . . . . . . . . . . .

17

. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .

26
26
29

3 Dynamic Modeling: Incorporating Both cis-regulatory
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Introduction and Problem Description . . . . . . . . . . .
3.2 Development of a semi-implicit numerical method . . . . .
3.2.1 Construction . . . . . . . . . . . . . . . . . . . . .
3.2.2 Stability . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . .
3.3 Numerical Simulations . . . . . . . . . . . . . . . . . . . .

and
. . .
. . .
. . .
. . .
. . .
. . .
. . .

4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Local Sensitivity Analysis . . . . . . . . . . . . . . . . .
4.2.2 Global Sensitivity Analysis . . . . . . . . . . . . . . . . .
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Local Sensitivity Analysis of the Fakhouri et al. Model .
4.3.2 Global Sensitivity Analysis of the Fakhouri et al. Model
4.3.3 Optimized Gene Construct Design and Synthetic Data .
iv

Temporal
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

31
31
36
36
38
51
55

.
.
.
.
.
.
.
.
.

60
60
62
63
64
71
76
79
87

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

4.4

4.3.4 Global Sensitivity Analysis of the Zinzen et al. Model . . . . . . . . . 97
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.0.1 GEMSTAT Speciﬁcs and Applications . . . . . . . . . . . .
5.2 Methods and Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Construction of Synthetic Data Sets . . . . . . . . . . . . . . . . . . .
5.2.1.1 Construction of Continuous Synthetic Data Sets . . . . . . .
5.2.1.2 Construction of Perturbed, Binary, and Interpolated Synthetic Data Sets . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Description of QN/NMS and CMA-ES Methods . . . . . . . . . . . .
5.2.2.1 QN/NMS Method . . . . . . . . . . . . . . . . . . . . . . .
5.2.2.2 CMA-ES Method . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

105
105
111
113
113
113

6 Thermodynamic-based Modeling: Analysis of Biological Data Sets . . .
6.1 Understanding the Structural Rules of Enhancer Function . . . . . . . . . .
6.1.1 Reporter Gene Constructs and Confocal Image Dataset . . . . . . . .
6.1.2 Structure of Models Tested . . . . . . . . . . . . . . . . . . . . . . . .
6.1.3 Model Performance and Parameter Estimation . . . . . . . . . . . . .
6.1.4 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Understanding the Dynamic Eﬀects of Redundant Enhancers . . . . . . . . .
6.2.1 The Addition of Redundant Enhancers to Dynamic Models of Transcriptional Regulation . . . . . . . . . . . . . . . . . . . . . . . . . .

133
133
134
134
142
143
146

114
115
116
116
121
130

147

7 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A Data Processing . . . . . . . . .
A.1 Introduction . . . . . . . . . .
A.2 Fakhouri et al. 2010 data set .
A.3 Sayal et al. 2012 data set . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

Bibliography . . . . . . . . . . . . . . .

v

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

153
153
155
156

. . . . . . . . . . . . . . . . . . . 162

LIST OF TABLES

Table 4.1

Redesigned gene constructs . . . . . . . . . . . . . . . . . . . . . . .

87

Table 4.2

Gene constructs with combinatorial quenching. . . . . . . . . . . . .

92

Table 4.3

Parameter relationships resulting in high correlations. . . . . . . . . 101

Table 5.1

A pseudo-algorithm for the derivative score. . . . . . . . . . . . . . . 114

Table 5.2

A pseudo-algorithm for CMA-ES. . . . . . . . . . . . . . . . . . . . 117

vi

LIST OF FIGURES

Figure 1.1

Figure 1.2

Figure 2.1

Figure 2.2

Illustration of an Enhancer Region. The horizontal line represents
a sequence of DNA, read from 5’ to 3’. The arrow corresponds to
the transcription start site, the nucleotide at which the RNA coding
sequence, or “gene”, begins. The blue region, just upstream of the
transcription start site, depicts the region of DNA where regulatory
proteins bind to control gene expression, the enhancer. Green circles
and red squares depict TFs (activators and repressors respectively).
For interpretation of the references to color in this and all other ﬁgures, the reader is referred to the electronic version of this dissertation.

2

Dynamic Transcriptional Network. On the left of the ﬁgure, each
node corresponds to a TF or gene of involved in the network. Dorsal acts as the input to the network at the end of nuclear cycle 13,
activating (depicted by an arrow) all other genes in the network. As
Twist is expressed, it in turn activates snail and rhomboid. Snail
then represses (depicted by a blunt arrow) rhomboid. On the right,
the corresponding ﬁnal expression patterns, in late nuclear cycle 14,
are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Figure 1C from [26]. A Drosophila embryo with Giant repressor (red
stripes) and Dorsal activator (green) staining is shown. Each embryo
provides a diversity of potential inputs to the regulatory element:
the white arrow points to a region where activator levels are high and
repressor levels are low. The black arrow points to a region where
both activator and repressor levels are low. The white triangle points
to a region where activator and repressor levels are both high, and
the black triangle points to a region where repressor levels are high
and activator levels are low. . . . . . . . . . . . . . . . . . . . . . .

22

1g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict a 44% reduction in expression. . . . . . . . .

24

vii

Figure 2.3

Figure 2.4

Figure 3.1

Figure 3.2

Figure 3.3

Figure 4.1

2g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict a 69% reduction in expression. . . . . . . . .

24

3g 2t2d: In regions where Giant protein is present at maximum concentrations, we predict an 82% reduction in expression. . . . . . . .

25

Simulations with decreasing step size. Parameter values were ﬁxed in
each of these four model simulations, allowing us to analyze the eﬀect
of changing step-size on the numerical solutions obtained by (3.2). In
panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and D) h = 0.001. . . .

56

Contour plots for simulations with decreasing step size. Parameter
values were ﬁxed in each of these four model simulations, allowing us
to analyze the eﬀect of changing step-size on the numerical solutions
obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and
D) h = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

Data vs. Model Predictions. 8 time points are shown. Each pair
of graphs depicts on the left, the data used to ﬁt model parameters,
and on the right, the model predictions after parameter ﬁtting, for
the given time point. Observe that the model is very successful in
ﬁtting the general dynamics and expression pattern trends of this
system. The corresponding RMSE for this prediction is 0.0616. . .

58

General description of thermodynamic-based models and examples
from [26] and [105]. A) Column 1: fractional occupancy of a TF
binding to an enhancer. Column 2: all possible states of an enhancer
containing one repressor and one activator binding site. Column 3:
fractional occupancy of each event. Column 4: expression contribution for each state. Column 5: calculation of total expression. B)
Column 1: two constructs used in [26]. Column 2: representative
embryos. Column 3: quantitative data of normalized repressor [Gt]
vs reporter gene [lacZ]. Column 4: total expression formulas. C) Line
1, Column 1: two conceptual constructs used in [105]. Line 2, Column 2: quantitative rho data used by Zinzen et al.. Line 2: total
expression formulas. . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

viii

Figure 4.2

Sensitivity analysis using local and global-local methods on the thermodynamicbased model of Fakhouri et al., scheme 2. A) Local sensitivity analysis
highlights great diﬀerences among parameters. Model parameters are
shown on the horizontal axis and local sensitivity coeﬃcients (scaled
from 0.0 to 1.0) are shown on the vertical axis. Larger values indicate
that the model output changes greatly when this parameter is varied. B) Global-local analysis reveals diﬀerences in relative sensitivities
compared to local analysis. The global-local sensitivity contribution
(also scaled from 0.0 to 1.0) is plotted on the vertical axis. In the
global-local analysis, quenching parameters (Q) are generally more
sensitive than repressor scaling factors (R), which in turn are more
sensitive than cooperativity parameters (C). . . . . . . . . . . . . . 78

Figure 4.3

Sensitivity analysis using the eFAST global sensitivity method on
the thermodynamic-based model of Fakhouri et al., scheme 2. A)
Shown are ﬁrst- and total-order sensitivity indices, which represent
the amount of variation in the model output with respect to each
parameter individually (gray bars) and in conjunction with all other
parameters (black bars), scaled from 0.0 to 1.0. If the gray bar is
much smaller than the black bar, as in Q6, secondary (or higher)
eﬀects are predominant. Quenching parameters are generally most
sensitive and cooperativity parameters are least sensitive. B) Eﬀect
of frequency with which a quenching parameter is represented in the
twelve constructs on sensitivity. The number of constructs that the
parameter is represented in is shown along the horizontal axis. Corresponding ﬁrst- and total-order sensitivity indices are shown for each
quenching parameter, as calculated by the eFAST algorithm. There
are two quenching parameters (Q4 and Q5) represented in two constructs, and two quenching parameters (Q1 and Q3) represented in
ﬁve constructs. At these values, there are four data points, two for
each quenching parameter. The lines illustrate linear ﬁts to each data
set, ﬁrst- and total-order sensitivity indices. In general, the more constructs a quenching parameter is represented in, the more sensitive
that parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

80

Figure 4.4

Figure 4.5

Sensitivity analysis using the HDMR global sensitivity method on
the thermodynamic-based model of Fakhouri et al., scheme 2. Shown
are ﬁrst-, second- and the sum of ﬁrst- and second-order sensitivity
indices, which represent the amount of variation in the model output with respect to each parameter individually (light gray bars),
each pair of parameters (dark gray bars), and the sum of these two
variations (black bars), scaled from 0.0 to 1.0. General sensitivity
trends are similar to those observed in Figure 2, although secondorder sensitivities are largest for R and C1, implying that they aﬀect
the model output through pair-wise interactions with other parameters. Over 95% of the variation in model output was found using only
ﬁrst- and second- order sensitivities (data not shown); therefore the
sum of ﬁrst- and second-order sensitivities shown here approximates
the total sensitivity of each parameter. . . . . . . . . . . . . . . . .

85

Sensitivity analysis using the HDMR global sensitivity method on
11 redesigned gene constructs. Synthetic data created using the
Fakhouri et al. model on these 11 redesigned gene constructs shows
a decrease in variation of sensitivities between quenching parameters
compared to Figure 4.3. In A, the synthetic data was created using
the parameter values that were found in the Fakhouri et. al study.
In B, the synthetic data was created using mean parameter values
(0.5 for quenching parameters, 50 for scaling factor and cooperativity
parameters). A decrease in variation of sensitivities for quenching
parameters is also observed from A to B (the sum of ﬁrst and second order sensitivities has a range of 0.10 to 0.28, with a mean of
0.17 and a standard deviation of 0.07 in A and a range of 0.16 to
0.31, with a mean of 0.20 and a standard deviation of 0.06), indicating that the sensitivity analysis is dependent on the true parameter
values represented in the biological data. . . . . . . . . . . . . . . .

88

x

Figure 4.6

Figure 4.7

Values of biological data can aﬀect sensitivity analysis. The eFAST
global sensitivity method was applied to the thermodynamic-based
model of Fakhouri et al., scheme 2, using synthetic data with mean
values (0.5 for quenching parameters, 50 for scaling factor and cooperativity parameters). We observe a decrease in variation of sensitivities between quenching parameters, when compared to Figure 4.2. A)
Sensitivity indices represent the amount of variation in the RMSE. B)
Eﬀect of frequency with which a quenching parameter is represented
in the twelve constructs on sensitivity. The number of constructs that
the parameter is represented in is shown along the horizontal axis.
Corresponding ﬁrst- and total-order sensitivity indices are shown for
each quenching parameter, as calculated by the eFAST algorithm.
Note that there are two quenching parameters (Q4 and Q5) represented in two constructs, and two quenching parameters (Q1 and
Q3) represented in ﬁve constructs. Thus at these values there are
four data points, two for each quenching parameter. The lines illustrate linear ﬁts to each data set, ﬁrst- and total-order sensitivity
indices. In general, the more constructs a parameter is represented in,
the more sensitive that parameter. When compared to Figure 4.2B, a
more pronounced correlation between a parameter’s total-order sensitivity coeﬃcient and the number of gene constructs that parameter is
represented in is observed here, underscoring the need for sensitivity
analysis in experimental design. . . . . . . . . . . . . . . . . . . . .

90

Eﬀect of combinations of four quenching parameters on sensitivity
analysis using the HDMR global sensitivity method on redesigned
gene constructs (see Table 4.2). Compared to an even design, an
increase in variation of sensitivities between quenching parameters is
seen when the data used comes from an uneven construct design (compare A to B). The synthetic data was created using the nine constructs
in which the quenching parameters were evenly distributed (A) and
not evenly distributed (B). Mean parameter values (0.5 for quenching
parameters, 50 for scaling factor and cooperativity parameters) were
used in both cases. . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

xi

Figure 4.8

Sensitivity analysis using the HDMR global sensitivity method on
the thermodynamic-based model of Zinzen et al. First-, second- and
the sum of ﬁrst- and second-order sensitivity indices (scaled from 0.0
to 1.0), are shown. Despite the focus in the Zinzen et al. study on
Dorsal-Twist cooperativity, these parameters are relatively insensitive compared to protein scaling factors. Results from representative
enhancer structures: A) set 1 and B) set 2. The model parameters are
shown on the horizontal axis. D, T, and S correspond to scaling factors for Dorsal, Twist, and Snail binding sites, respectively. DTr, TTr,
and SSr correspond to cooperativity parameters representing DorsalTwist, Twist-Twist, and Snail-Snail cooperativities respectively, for
the rho-like enhancer. Similarly, DTv, TTv, and SSv correspond
to cooperativity parameters representing Dorsal-Twist, Twist-Twist,
and Snail-Snail cooperativities respectively, for the vnd -like enhancer. 98

Figure 5.1

Illustration of the expression patterns for synthetic data set 20 obtained by reverse engineering of a gene regulatory network by the Direct model. The synthetic target data used for the parameter estimation is created by simulating the model with the random parameters
obtained as explained in the Methods and Calculations. We created
the expression patterns using the parameters obtained by the CMAES and QN/NMS methods. The black tick marks are the synthetic
target data and the blue (QN/NMS) and red (CMA-ES) lines correspond to the simulated patterns. The y-axis gives the relative gene
expression level and the x-axis corresponds to the anterior-posterior
(A-P) axis of the embryo. The ﬁnal SSE is 0.001 and 13.0 for CMAES and QN/NMS respectively. . . . . . . . . . . . . . . . . . . . . 118

Figure 5.2

Performance comparison of the CMA-ES and QN/NMS methods on
the Direct model. The minimum SSE values are plotted for data sets
of varying quality: A)continuous, B) perturbed, C) binary, and D)
interpolated. The analysis has been done on 30 synthetic data sets.
For each data set, two SSE values are shown; CMA-ES (red circles)
and QN/NMS (blue triangles). It has been observed that the CMAES method performs better than the QN/NMS method on continuous
data. However, the choice of parameter estimation technique has no
observable eﬀect on lower quality data sets. The horizontal line in
panel A) signals a change in scale of the vertical axis. . . . . . . . . 122

xii

Figure 5.3

Comparison of the parameter values obtained by the CMA-ES and
QN/NMS methods on the GEMSTAT Direct model. The minimum
SSEs between the true parameter values and those obtained by the
CMA-ES and QN/NMS methods are plotted for data sets of varying
quality: A)continuous, B) perturbed, C) binary, and D) interpolated.
The analysis has been done on 30 synthetic data sets. For each data
set, two SSE values are shown; CMA-ES (red circles) and QN/NMS
(blue triangles). It has been observed that the CMA-ES method
more accurately predicts the true parameter values on continuous
data. However, the choice of parameter estimation technique has
no observable eﬀect on lower quality data sets. The horizontal lines
signal a change in scale of the vertical axis. . . . . . . . . . . . . . 124

Figure 5.4

Scatter plot of parameters for synthetic data set 20 using CMA-ES
and QN/NMS methods on the Direct model. Each method was run
ﬁve times. The black diamonds represents the target parameters, and
the red circles and blue triangles represent parameters obtained by
CMA-ES and QN/NMS methods respectively. The CMA-ES method
more accurately predicted the true parameter values on continuous
data. The horizontal lines signal diﬀerent scales of the vertical axis.
For the CMA-ES and QN/NMS methods respectively, the mean standard deviation in parameter values found is 64.11 and 80.29. . . . . 125

Figure 5.5

Comparison of the relative error (%) for the estimated parameters
of synthetic data set 20 using CMA-ES (red circles) and QN/NMS
(blue triangles) methods. The CMA-ES method predicts the true
parameter values with lower relative error on continuous data. . . . 126

Figure 5.6

Sensitivity analysis for the Direct model. The blue, green, and red
bars represent the parameter ﬁrst-, sum of second-, and sum of ﬁrstand second-order sensitivities calculated using the HDMR algorithm
respectively. The objective function used was the SSE between model
predictions and synthetic data set 20, whose predictions are shown
in Figure 5.1 and best ﬁt parameters shown in Figures 5.4 and 5.5.
Observe that the model is much more sensitive to perturbations in
the fourth parameter, one of the parameters associated with Caudal,
than any other parameter in the model. For other parameters, we
see large second-order sensitivities, corresponding to a high level of
parameter compensation. . . . . . . . . . . . . . . . . . . . . . . . 127

xiii

Figure 5.7

Scatter plot of the parameters obtained for experimental data using
CMA-ES and QN/NMS methods on the Direct model. The red circles
and blue triangles represent parameters resulting in the minimum SSE
when CMA-ES and QN/NMS methods are applied respectively. It
has been observed that, although SSE values were comparable (CMAES SSE = 0.2760 and QN/NMS SSE = 0.2767), for the most sensitive
parameters the two methods results in diﬀerent values. The horizontal line signals a change in scale of the vertical axis. . . . . . . . . . 129

Figure 6.1

Structures of reporter genes assayed. Green, blue, and red circles denote annotated binding sites for Dorsal, Twist, and Snail respectively.
X’s depict binding sites that were mutated. . . . . . . . . . . . . . . 135

Figure 6.2

Continuous Functions used for Quenching and Cooperativity. These
functions are the functions used in diﬀerent schemes of the model,
to describe the eﬀects of a repressor’s quenching ability on a bound
activator or the cooperativity between two bound TFs as functions
of the distance between the binding sites. A) f (d) = a + bd, B)
1
2
. When implemented, a = 1
f (d) = ae−d /b , and C)f (d) = a
ed/b
and b is a model parameter for quenching functions, and a and b are
both model parameters for cooperativity functions. . . . . . . . . . . 138

Figure 6.3

Quenching Function Derived from Fakhouri et al. Analysis. This
function was ﬁt using the six quenching values, shown in red, found
in [26] to a ﬁfth degree polynomial, f (d) = −0.000000004249d5 +
0.0000010199d4 − 0.00010199d3 + 0.0050099d2 − 0.10731d + 1.0 for
d ≤ 100 and assumed to be 0 for d > 100, shown in blue. When
implemented, this function is used directly, with no parameter values
corresponding to quenching. . . . . . . . . . . . . . . . . . . . . . . 139

Figure 6.4

Binned Functions used for Quenching and Cooperativity. These stepfunctions are representatives of those used in various model schemes
to describe the eﬀects of a repressor’s quenching ability on a bound
activator and cooperativity between two bound TFs as step-functions
of the distance between the binding sites. A) Quenching function
with four bins, B) Cooperativity function with three bins. When
implemented, no quenching is allowed beyond the distance of the last
bin, thus the function value is assumed to be 0 beyond 100 bp in A). 140

xiv

Figure 6.5

Model performance for 120 model schemes. Each box represents the
RMSE for a single model scheme, with the row representing the cooperativity function and the column representing the quenching function
used in this scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Figure 6.6

Diﬀerent Measures of Model Performance for 18 model schemes. Each
box represents the ranking for a single model scheme, with the row
representing the model scheme and the column representing the ﬁtness measure used for ranking. . . . . . . . . . . . . . . . . . . . . . 144

Figure 6.7

Construct Performance for 18 model schemes. Each box represents
the RMSE for a single model scheme on a single construct, with the
row representing the model scheme and the column representing the
construct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Figure A.1

Six representative yellow-white images with quadratics ﬁt to their
intensity values. The blue lines represent ten lines taken uniformly
from the region 40-60% Anterior-Posterior and plotted from Dorsal
to Ventral. The red line represents the quadratic ﬁt to the average of
these ten lines from 10-40% Dorsal-Ventral. This region was chosen
due to the fact that there should be no expression in this region
regardless of whether the embryos contain the transgene, as in our
core data set, or not, as in these yellow-white images. . . . . . . . . 158

Figure A.2

Variation in Signal Intensity on diﬀerent imaging days. In panels A
and B, each circle represents a single embryo’s mean intensity value.
All embryos represented in panel A came from a single construct,
which is a double activator knock-out, stained on the same day and
then imaged on three diﬀerent days. All embryos represented in panel
B came from a diﬀerent single construct, which is a triple repressor
knock-out, again stained on the same day and then imaged on three
diﬀerent days. In both cases, there is a great deal of variation in
signal intensity. In panels C and D, these mean values were averaged
over all embryos stained and imaged on the same day. We observed
a well-deﬁned decreasing trend in mean intensity as the number of
days between staining and imaging increased. . . . . . . . . . . . . . 159

xv

Chapter 1
Introduction
The process of gene regulation and understanding the intricate machinery involved in controlling it poses a very important question in modern biology and biochemistry: what physical
properties drive DNA sequences to control complex protein interactions leading to the precise expression of genes? All living organisms have genes that turn on and oﬀ to regulate life
processes, many of which depend on the expression of genes at precise levels, spatial locations
and times. Due to this large array of applications, the study of gene regulation underlies
the vast majority of biological ﬁelds, from bacterial growth to human disease pathways. The
regions of DNA that control gene expression (see Figure 1.1) are referred to as enhancers
and can be highly divergent in diﬀerent organisms. Often the factors controlling gene regulation are divided into two groups, cis-acting elements and trans-acting factors. Cis-acting
elements, also referred to as cis-regulatory modules (CRMs) or enhancers, are the regions of
DNA containing binding sites for other factors involved in gene regulation and are typically
located at a close proximity to the transcriptional start site. Trans-factors, on the other
hand, are proteins that bind to these cis-elements to control expression. These factors in-

1

Figure 1.1: Illustration of an Enhancer Region. The horizontal line represents a sequence of
DNA, read from 5’ to 3’. The arrow corresponds to the transcription start site, the nucleotide
at which the RNA coding sequence, or “gene”, begins. The blue region, just upstream of the
transcription start site, depicts the region of DNA where regulatory proteins bind to control
gene expression, the enhancer. Green circles and red squares depict TFs (activators and
repressors respectively). For interpretation of the references to color in this and all other
ﬁgures, the reader is referred to the electronic version of this dissertation.
clude transcription factor (TF) proteins transcribed elsewhere in the genome, translated and
actively transported through the cell, which bind to enhancers to contribute to the spatial
and temporal dynamics of gene expression. In prokaryotes, enhancers typically have a very
small number of binding sites and the protein-DNA and protein-protein interactions involved
in gene expression are often fairly well understood. In eukaryotes, however, many of the rules
behind gene regulation are still almost entirely unknown due to the complexity of eukaryotic
gene regulatory systems, involving the concerted occupancy of many trans-factors; including
multiple TFs, cofactors, and nucleosomes on enhancers. One would like to understand not
only the basic properties of protein-binding eﬀects involved in prokaryotic gene regulation
(i.e., the quantitative change in mRNA levels as one protein blocks or enhances the activity of
another), but also gain the ability to model some of the cis-acting elements and trans-acting
factors which contribute to the complex nature of eukaryotic regulation.

2

A large number of studies have presented evidence of one aspect of eukaryotic regulation
or another within a given system, but we lack a general unifying concept to explain how DNA
sequence can give rise to the complicated spatio-temporal expression patterns of so many
genes. In this thesis I describe my research which has focused on gene expression during the
early blastoderm stage of development in the eukaryotic organism Drosophila melanogaster.
Drosophila, more commonly known as the “fruit ﬂy,” oﬀers particular advantages for studying
gene regulation. As metazoans, their transcriptional machinery closely resembles that of all
other animal species, including humans. Flies have very short life spans and large birth rates,
making experiments much more feasible in terms of preparation time as well as amount of
data acquired for quantitive analyses. Powerful genetic tools make them a natural choice
for studies involving transgenes or genetic crosses due to the large amount of high quality
quantitative data that can be obtained. Using this system, biologists can learn a great
deal of information about gene expression, providing a basis for hypotheses about disease,
development, and evolution in many other eukaryotic species.
Experimentalists use diverse techniques to gather diﬀerent pieces of information regarding
transcriptional regulation. Some studies identify the developmental patterns linked to a
particular gene’s expression, some determine what two proteins are physically interacting to
control a single gene or group of genes’ expression, and others classify the environmental
conditions under which gene expression patterns are altered. Although each study provides
us with only a subset of information on gene regulation, together they have greatly increased
our knowledge of when, where, and how certain genes are expressed. A common methodology
that has been employed for many years involves the discovery of direct physical and functional
characteristics of cis-regulatory elements. Proteins binding to these DNA elements can

3

be assayed by DNAse1 footprinting. Functional studies involve reporter constructs linked
to small fragments of genomic DNA (‘promoter bashing’), to determine minimal enhancer
regions and the functional eﬀects of TFs binding to speciﬁc sites [95, 67, 46]. More recently,
the same idea of promoter bashing was used with more sophisticated biochemical techniques
involving large genomic segments and environmental perturbations to learn more about the
robust nature of a single gene with multiple enhancer regions [71]. These studies typically
answer questions about speciﬁc genes using only experimentation in a lab.
Other studies, addressing wider questions of gene circuitry begin in the lab and then
further test hypotheses using statistical or computational tools. One general approach has
involved synthetic reporter constructs to learn more about protein-protein interactions and
short-range enhancer ‘grammar,’ referring to the speciﬁc number and arrangement of binding
sites needed on an enhancer to drive proper expression [93, 4, 50, 26]. On a much larger scale,
recent studies involve acquiring protein binding proﬁles by whole-genome surveys aimed at
discovering the locations of CRMs and groups of genes regulated by the same set of TFs, or
surveys on groups of genes regulated in the same time period or by the same set of TFs aimed
at understanding the dynamics of binding events, building gene regulatory networks, and
gaining insight on possible protein interactions [101, 80, 104, 37, 12, 65, 66, 56]. These studies
typically use microarray, referred to as chip, chromatin immunoprecipitation (ChIP) - chip,
or ChIP - sequencing (seq) experiments, although some have combined these binding proﬁles
with expression measurements to learn more about spatio-temporal expression patterns [104].
When experiments answer qualitative questions, such as when and where certain genes are
expressed, or give a picture of how one node ﬁts into a large gene regulatory network, the door
has been opened for further inquiry. This can be done in the form of modeling or statistics.

4

For genome-wide studies, one often goes through multiple levels of statistical analysis to
tease out the important interactions between genes and proteins. For example, ChIP-chip
and ChIP-seq studies have compared binding proﬁles using Pearson Correlation Coeﬃcient
to identify possible combinatorial relationships between TFs [37]. For studies aimed at
explicitly describing the physical interactions taking place, modelers often take the system
of interest and derive deterministic or stochastic models based on the new assumptions
that the experiment has provided them. For example, studies involving a cascade of gene
expression can begin with a set of genes that have been found to bind the same set of TFs
or have correlated expression proﬁles and through diﬀerential equation modeling, explicitly
incorporate parameters representing the level of positive and negative eﬀects one gene has on
another [43]. Each of these approaches aims at extracting as much information as possible
from the experimental data and mathematically analyzing it to provide a direction for future
experimentation.
In formulating an approach to understand how a biological process works, scientists
develop models, a term which carries diﬀerent connotations to researchers in diﬀerent ﬁelds.
Biologists may use a model to provide a non-quantitative, simple cartoon of the processes
taking place. Mathematicians, statisticians, and quantitative biologists use the term to refer
to an equation or system of equations that precisely describes the physical aspects of the
system.
Clear communications between biologists and mathematicians is necessary for eﬀective
collaborations, however. The two ﬁelds use very diﬀerent vocabulary, require diﬀerent coursework and training, and typically have distinct research goals in mind. I believe it is essential
to ﬁrst address the question of why mathematical models are needed in biology. To answer

5

this, we must ﬁrst understand the power behind such models. Experiments carried out in a
wet lab often involve the development of a hypothesis and the development of speciﬁc experiments to test this hypothesis. Although simple to describe, the hypothesis development
and execution of experiments is not always so obvious, and the experiments that need to be
run are often quite time consuming.
Mathematical modeling has been used to eﬀectively target experimental biological studies. For example, in the area of evolutionary biology, in silico stochastic models can simulate
evolution over thousands of generations in a single day, while studying evolution of most
species in a wet lab, if possible at all, can be a career-long project [19, 99]. Another example of an area where mathematical modeling has been crucial is in enhancer discover.
Enhancer discovery refers to identifying regions of DNA previously not known to contribute
to the regulation of one or more genes. Statistical models applied by bioinformaticians can
use known DNA-protein binding site sequences and search the entire genome for clusters of
binding sites [11, 48]. These clusters are known as “predicted enhancer regions” that can be
tested out in a lab. Without this type of model and computation, the only way for biologists
to discover enhancer regions is to run multiple footprinting or ChIP experiments, which are
expensive and time consuming. Quantitative predictive models can identify possible clusters
of binding sites to test for transcriptional activity.
In the area of transcriptional regulation, the same issues of guiding experimental design are important. System-wide insights are key to understanding the impacts of enhancer
function on development, disease, population variation, and the evolution of new traits.
Unfortunately, little is known about many of the proteins involved in regulation and their
interactions, making the design of experiments very challenging. Mathematical models, how-

6

ever, are capable of using available enhancer information and quantitative expression data
to ﬁt parameter values, such as activator eﬃciency or protein cooperativity, and predict
additional enhancers’ expression, giving researchers a better idea of what proteins and enhancers should be further investigated experimentally. This could then lead to the discovery
of mechanisms involved in transcriptional events essential to proper development, the design of new drug treatments targeting speciﬁc diseases, and a better understanding of the
selective pressure put on DNA-sequences during evolution.
Gene regulation is a complex process involving control mechanisms at the levels of transcription, RNA processing, transport and translation. In the ﬁeld of gene regulation, three
diﬀerent mathematical models have been applied: Boolean models, thermodynamic-based
(also termed “Fractional Occupancy”) models, and diﬀerential equation models [8]. These
models all have inherent strengths and weaknesses involving the amount of input data, the
level of complexity, and the time-frame in which they can model gene expression.
Boolean models are graph based probabilistic models used to ﬁnd regulatory interactions
and gene networks when looking at genome-wide data. They have the beneﬁt of being able to
use large amounts of available gene array data to identify speciﬁc gene interaction networks,
but the major drawback of their inability to explain complex interactions between the individual components of the transcriptional machinery (ie: TF and polymerase interactions)
[79, 100, 3].
Thermodynamic-based models, on the other hand, can explain a great deal of the intricate
machinery at the DNA-protein binding level. They utilize information on DNA sequence,
protein binding speciﬁcity, and TF concentrations using the laws of biophysical thermodynamics, and often include terms for speciﬁc TF interactions and distance-dependent DNA

7

binding events. Although many researchers believe that as more data becomes available these
fractional occupancy models will someday have the ability to decipher the cis-regulatory
grammar behind enhancer architecture, the amount of data necessary to test current hypotheses is still lacking [1, 15, 105, 45, 85, 30, 26, 38]. I will discuss my contributions to this
ﬁeld in Chapter 7.
Both Boolean and thermodynamic-based models are static in nature. They may use
information from slightly varying time frames, but their outputs are snapshots of a system.
Whether they are adding insight into which gene products eﬀect other genes or whether they
are predicting expression, these conclusions are made for speciﬁc time points or over some
arbitrary time scale.
Diﬀerential equations models have allowed researchers to model transcriptional regulation over diﬀerent time intervals to study the behavior of subtly evolving gene or protein
network structure. These models can simulate the turning on and oﬀ of genes over time,
as well as intercellular diﬀusion and decay of proteins and mRNAs. This dynamic movie of
gene expression can be extremely useful in developing visual representations of small networks of genes. The primary drawback to these models is that, when applied to larger data
sets or larger gene networks, the number of parameters rapidly increases and ﬁtting these
parameters can be computationally infeasible. Also, in past applications, the intricate details of transcriptional regulation, such as speciﬁc protein binding sites and aﬃnities that
thermodynamic-based models oﬀer, were not included [33, 10, 62, 81, 97, 43, 32, 20, 69, 13].
In eukaryotic systems, models have taken into account TFs’ positive or negative contributions to gene expression, but none have included the DNA sequence information allowing a
dynamic quantitative model of gene expression to incorporate terms for binding speciﬁcity

8

and TF interactions [43, 69, 13]. This gap in our understanding is addressed in studies I
describe in Chapters 3 and 6.
In addition to choosing a suitable model for their particular system, modelers are also
faced with four other important considerations: how to process their data before using it for
model ﬁtting [92, 9], which parameter estimation technique should be used [6, 47, 91], how
to analyze model performance regarding the limits of the model or sensitivity of the model
to parameter perturbations [6, 28, 24], and how much information the model can provide
biologists [26, 24, 91]. These four important issues, and their relationships to the modeling
approaches used in our group are the main focus of this thesis.
First, I focus on the most popular static model of transcriptional regulation for incorporating DNA level information, thermodynamic-based (or fractional occupancy) modeling.
The general equation for calculating gene expression from the protein occupancy proﬁle of
the enhancer is given by:
Φ (X)
expression level ∝

1 + Ψ (X)

where X ∈ Rn represents TF concentrations and Φ, Ψ : Rn → R are polynomial functions
representing the contributions from all active non-empty states of the enhancer and all possible non-empty states of the enhancer respectively. In Chapter 2, I ﬁrst derive a simple
thermodynamic-based model and discuss how it can be analyzed to produce biological insights on the limitations of DNA-sequence and TF contributions to gene expression. I then
show examples of how more complicated biochemical mechanisms, such as protein-protein
cooperativity, can be incorporated into the model to allow for a more realistic picture of
the system. After a thorough introduction to these static models, I build the ﬁrst dynamic
model of transcriptional regulation in eukaryotes to take into account speciﬁc DNA level
9

information on the location and strength of TF binding sites in Chapter 3. The basis for
this model will be the general reaction diﬀusion equation:

dY
dt

= R (Y ) + D∆Y

for Y ∈ Rm and D a diagonal matrix containing diﬀusion constants; with the reaction term,
R (Y ), incorporating known enhancer information. Since this modeling approach involves
a system of diﬀerential equations, I devise a numerical solver to use for model simulations
and analyze the numerical solutions before showing model results on biological data. As
illustrated in this chapter, the model is capable of capturing the intricate dynamics and key
features of a dynamically changing system involved in early eukaryotic development (system
depicted in Figure 1.2). In Chapters 4 and 5, I investigate the parameter values estimated
in the case of static thermodynamic-based models, including sensitivity analysis of these
parameters as well as choosing the proper parameter estimation technique to decrease the
probability of the parameter set getting stuck in a local minimum of the objective function
used during ﬁtting. In Appendix A, I explain how data collected in the lab is processed for
use in quantitative studies of transcriptional regulation.
The main goal of my research has been the development of a mathematical model of
gene regulation capable of capturing the physical properties of the system at the DNA level,
as well as the dynamic properties of protein and mRNA concentrations in a developing organism. Transcriptional regulation involves many diﬀerent factors, including evolving DNA
sequences (i.e., binding site positioning and strength changing with the evolution of new
species), regulatory proteins, chromatin states, nucleosome positions, and cofactors, work-

10

Figure 1.2: Dynamic Transcriptional Network. On the left of the ﬁgure, each node corresponds to a TF or gene of involved in the network. Dorsal acts as the input to the network at
the end of nuclear cycle 13, activating (depicted by an arrow) all other genes in the network.
As Twist is expressed, it in turn activates snail and rhomboid. Snail then represses (depicted
by a blunt arrow) rhomboid. On the right, the corresponding ﬁnal expression patterns, in
late nuclear cycle 14, are shown.

11

ing together in perfect harmony to produce precise spatio-temporal expression patterns. A
thorough understanding of a system so complex is not something a single experiment or
modeling study can provide. Improvements to our understanding have been made with the
help of mathematical modeling, such as the thermodynamic-based model in [26] which was
very successful in predicting the expression of synthetic enhancers with ﬁxed activator binding sites and providing important insights into short-range repression. My work has added
tremendously to this understanding by transforming a static model that gave biologists a
snapshot prediction of gene expression at a single point in time into a dynamic model incorporating the same amount of detail at the DNA level which predicts gene expression over an
important time period in development. This DNA level information was introduced into a
reaction-diﬀusion model by the inclusion of a thermodynamic-based term. This directed my
work toward understanding model parameter sensitivity, parameter estimation strategies,
and thermodynamic-based model formulations including more complex protein-protein interactions. These additional studies have shed light on some common misconceptions when
analyzing model parameters and the cause of such misconceptions, as well as the important
factors to consider when choosing an appropriate, eﬃcient parameter estimation strategy.
However, I believe the largest impact my work has had on the biological and mathematical
communities has been in developing a novel approach to incorporate both cis-regulatory and
temporal information into a gene regulatory model, which is computationally feasible as well
as predictive.

12

Chapter 2
Thermodynamic-based Modeling

2.1

Introduction and Description

Thermodynamic-based models, or fractional occupancy models, aim to predict gene expression output from the DNA sequence of the gene’s regulatory region and the concentration
of proteins involved in regulation (ie: TFs). The term “thermodynamic” itself is often misinterpreted; it only refers to the fact that the model has been derived with thermodynamic
equilibrium assumptions. Historically, these calculations have been done using techniques
derived from statistical physics which involve computing all possible states of TFs binding
to a promoter and relating these states to gene expression [87, 75, 15, 45, 105, 85, 30, 26, 38].
During implementation, the model is derived in two separate parts. The ﬁrst part is the
true “thermodynamic” piece of the model, which is derived using chemical stoichiometric
and equilibrium equations. It is assumed that RNA polymerase (RNAP), the driving force
in gene transcription, is recruited by bound TFs. This requires the occupancy distribution
of TFs on the DNA sequence to be computed using equilibrium constants (binding aﬃnities)

13

and concentration levels.
The second part to thermodynamic-based modeling requires some scientiﬁc guesswork.
There are two common ways to translate binding states into gene expression. The ﬁrst
approach models transcriptional output as proportional to the probability of RNAP binding,
which is related to the active, or successful, states of the enhancer [105, 26]. The second
approach represents it with a nonlinear sigmoidal function in which activators have a positive
and repressors a negative inﬂuence [85]. For a thermodynamic-based modeling study in a
eukaryote, the choice of sigmoidal function is arbitrary, and reﬂects the lower and upper
bounds observed in the transcription process. Again, this is guesswork, not chemically or
physically derived. For this reason, my work has used the ﬁrst approach. Due to the
lack of data relating RNAP binding to the state of an enhancer, we have assumed that
transcriptional output of an enhancer is directly proportional to the fraction of successful
states.
The following examples should serve as a guide to this modeling approach:
Example 1: Consider the case of a piece of DNA (also referred to as a binding site or motif),
S, which a protein (TF), TF, can bind. The chemical equation for TF binding
to S is:
KB

−−
[S] + [T F ] −− [S · T F ]
KU

where KB is the constant associated with the rate of binding and KU is the
constant associated with the rate of unbinding. Note that although S represents
a sequence of DNA, the concentration of S, [S], can be interpreted by considering
the fact that there may be multiple copies of this sequence within the genome.

14

The rate equations associated with this chemical equation are:

d[S]
= KU [S · T F ] − KB [S][T F ]
dt
d[T F ]
= KU [S · T F ] − KB [S][T F ]
dt
d[S · T F ]
= KB [S][T F ] − KU [S · T F ].
dt

Under the assumption that the system is in thermodynamic equilibrium,

d[S]
= KU [S · T F ] − KB [S][T F ] = 0.
dt

Thus, it follows that
KB [S][T F ] = KU [S · T F ]
. Let
KT F =

KB
[S · T F ]
.
=
KU
[S][T F ]

Due to matter conservation (stoichiometry),

[S]total = [S] + [S · T F ].

15

Therefore, it follows that

P =

[S · T F ]
[S] + [S · T F ]

=

[S·T F ][T F ]
[S][T F ]
[S][T F ]
[S·T F ][T F ]
+ [S][T F ]
[S][T F ]

=

KT F [T F ]
1 + KT F [T F ]

where KT F is referred to as the binding aﬃnity (equilibrium constant) of TF
and P is the fractional occupancy, or the time-averaged occupancy of the TF
on the enhancer.
Example 2: Consider the case of an enhancer, S, which contains two binding sites, one for a
transcriptional activator, A, and one for a transcriptional repressor, R. Assume
that the expression level of the gene (measured as the concentration of mRNA)
is proportional to the fraction of successful states, where ‘successful’ is deﬁned
as having only the activator bound and not the repressor. The expression level
is then calculated as

[mRN A] ∝ PA (1 − PR ) =
=

KA [A]
1 + KA [A]

1
1 + KR [R]

KA [A]
.
1 + KA [A] + KR [R] + KA KR [A][R]

Looking back at Example 2, one can start to see the high level of computation that
must go into such a model as enhancers become more complicated, containing multiple

16

TF binding sites, due to the combinatorial nature of multiple TFs binding to a regulatory
region. There are many important biological questions that have been posed in the case
of more complicated thermodynamic-based models and their parameters (i.e., the eﬀects
of protein-protein cooperativity). However, we will begin by analyzing some mathematical
properties of a simple thermodynamic-based model similar to the one depicted in Example
2.

2.2

Mathematical Analysis of a Simple
Thermodynamic-based Model

We begin with the following question:

What happens to the expression function as the number of binding sites increases?
To answer this, we will again start with the assumption that a successful state is any state
in which at least one activator is bound and no repressors are bound. Note that we are
not considering any protein-protein interactions (i.e., cooperative or antagonistic binding
interactions between TFs). We are only considering the interactions taking place between
DNA and proteins (TFs).
There are many diﬀerent types of proteins (TFs) involved in transcriptional regulation.
Consider the general case of an enhancer containing N types of proteins. Assume there are
β diﬀerent types of activators, A1 , . . . , Aβ and for each Ai , there are ni identical binding
sites. Similarly, assume there are γ diﬀerent types of repressors, R1 , . . . , Rγ and for each Rj ,

17

there are mj identical binding sites. Note that β ≥ 0, γ ≥ 0, and β + γ = N . Then,
β

1 + KAi [Ai ]
[mRN A] ∝

ni

−1

i=1
β

1 + KAi [Ai ]
i=1

ni

.

γ

1 + KRj [Rj ]

mj

j=1

Given 1 ≤ i ≤ β such that [Ai ] > 0:
1

lim [mRN A] = γ

ni →∞

.
mj

1 + KRj [Rj ]
j=1

Similarly, given 1 ≤ j ≤ γ such that [Rj ] > 0:

lim [mRN A] = 0.

mj →∞

Lastly, given 1 ≤ i ≤ β such that [Ai ] > 0 and 1 ≤ j ≤ γ such that [Rj ] > 0:

lim [mRN A] =

ni →∞
mj →∞

1

lim

mj →∞ γ

1 + KRj [Rj ]

mj

j=1

1

− lim

ni →∞ β
mj →∞

1 + KAi [Ai ]

i=1

ni

γ

1 + KRj [Rj ]

mj

j=1

= 0.

Therefore, the answer to this ﬁrst question can be stated: as the number of activator
sites is increased, the expression function asymptotically approaches a function of the number
18

of repressor sites present and as the number of repressor sites is increased, the expression
function asymptotically approaches zero, regardless of the number of activator sites present.
Next, we consider the question:

Holding the number of activator sites constant, can we determine how many
repressor sites are needed for the reduction in expression to equal p?
First note that under the assumption that
β

ni

1 + KAi [Ai ]
[mRN A] ∝

−1

i=1
β

1 + KAi [Ai ]

γ

ni

i=1

1 + KRj [Rj ]

mj

j=1

we deﬁne the reduction in expression due to repression as

1

.

γ

1 + KRj [Rj ]

mj

j=1

Assume that ni is ﬁxed for all 1 ≤ i ≤ β and β is ﬁxed. Fix 0 < p ≤ 1 and consider
1

= p.

γ

1 + KRj [Rj ]

mj

j=1

Observe that
γ

1 + KRj [Rj ]
j=1

19

mj

1
= .
p

Thus,
γ

mj ln 1 + KRj [Rj ] = − ln (p) .
j=1

Now, let
max mj ln 1 + KRj [Rj ] = mM ln 1 + KR [RM ] .
M

1≤j≤γ

Then, since mj ln 1 + KRj [Rj ] ≥ 0 for all 1 ≤ j ≤ γ, it follows that
1
− ln (p) ≤ mM ln 1 + KR [RM ] ≤ − ln (p) .
M
γ

Therefore,
− ln (p)
γ ln 1 + KR [RM ]

≤ mM ≤

M

− ln (p)

.

ln 1 + KR [RM ]
M

Since this is a biological problem with a physical interpretation of KRj [Rj ], we can
assume that there exist constants c1 and c2 such that for all 1 ≤ j ≤ γ, c1 ≤ KRj [Rj ] ≤ c2 .
These bounds are independent of j, resulting in an upper and lower bound on the maximum
number of repressor binding sites, mM :
− ln (p)
− ln (p)
≤ mM ≤
.
γ ln (1 + c2 )
ln (1 + c1 )
For biological studies, such as [15, 105, 45, 26], the main focus is on pattern formation
caused by the repression of activators by repressor proteins. In modeling studies involving
experimental data, it is commonly assumed that the level of ﬂuorescence observed in images
from embryos stained for a protein such as R, is proportional to the actual concentration
of R [76, 14, 15, 105, 45, 26]. Then, what is approximated and sometimes referred to as
the scaling factor, KR , actually corresponds to the product of the binding aﬃnity and the
20

value which relates the level of ﬂuorescence to the concentration of protein. Thus, modeling
studies often have strict bounds on what they refer to as [R] due to physical measurement
constraints. Most standard procedures for data processing normalize values between 0 and
1, although various ranges have been used [15, 45, 105, 26]. Since diﬀerent normalization
procedures are used, the units for KR and [R] vary, but KR [R] is measured in µg/ml, a
standard unit for protein concentration.
In previous studies, the following K values have been found.
• Bintu et al. [15]: 0.22 ≤ K ≤ 110
Concentrations were between 0.001 and 1000. Thus, 0.00022 ≤ K[R] ≤ 110000.
• Janssens et al. [45]: 0.003 ≤ K ≤ 1.0
Concentrations were between 0 and 255. In particular for the repressor Giant, whenever
[Gt] > 0, 1 ≤ [Gt] ≤ 150. Thus, 0.003 ≤ K[R] ≤ 150.
• Zinzen et al. [105]: 108.25 ≤ K ≤ 108.75
Concentrations were between 0 and 1. Here we consider the bounds 0.1 ≤ [R] ≤ 1.0.
Thus, 107.25 ≤ K[R] ≤ 108.75 .
• Fakhouri et al. [26]: 12 ≤ K ≤ 30
Concentrations were between 0 and 1. Here we consider the bounds 0.1 ≤ [R] ≤ 1.0.
Thus, 0.12 ≤ K[R] ≤ 30.
Observe that there is a large amount of variation in these bounds. For many studies
such as these, one must be careful when interpreting K values or when applying them to our
previous estimates since values were often found using experimental data to ﬁt models that
simultaneously searches for parameters corresponding to protein-protein interactions such
21

Figure 2.1: Figure 1C from [26]. A Drosophila embryo with Giant repressor (red stripes) and
Dorsal activator (green) staining is shown. Each embryo provides a diversity of potential
inputs to the regulatory element: the white arrow points to a region where activator levels are
high and repressor levels are low. The black arrow points to a region where both activator
and repressor levels are low. The white triangle points to a region where activator and
repressor levels are both high, and the black triangle points to a region where repressor
levels are high and activator levels are low.
as cooperativity and/or quenching terms. Focusing on one particular example, we consider
the Fakhouri et al. data set [26]. To illustrate the spatial expression patterns analyzed,
refer to Figure 2.1. Using a subset of the expression data from this study containing only
the expression obtained from enhancers with Giant binding sites at a ﬁxed distance from
activator sites, we ﬁt a value for KR corresponding to the Giant repressor protein. The
value found using a nonlinear least squares approximation was 0.7869. The expression data
used for this ﬁtting was processed and normalized, resulting in 0.0 ≤ [R] ≤ 1.0 and was
measured from embryos with enhancers containing only one type of repressor binding site,
Giant; hence γ = 1. Since the enhancers only contained Giant binding sites for repression,
a reduction in expression is only observed when [R] > 0.
Since [R] must be strictly greater than 0, here we consider the case with 0.1 ≤ [R] ≤ 1.0.
It follows from our estimate of KR that 0.0787 ≤ KR [R] ≤ 0.7869. Therefore, the actual
bounds on the maximum number of binding sites, mM , assuming a reduction in expression

22

of p, are
− ln (p)
− ln (p)
≤ mM ≤
.
ln (1.7869)
ln (1.0787)
In the context of this enhancer, in order to obtain an expression that is reduced by 50%
(p = 0.5), the enhancer must contain between 2 and 9 repressor binding sites. To obtain
an expression that is reduced by 10% (p = 0.9), the enhancer must contain between 1 and
2 repressor binding sites. To obtain an expression that is reduced by 90% (p = 0.1), the
enhancer must contain between 4 and 30 repressor binding sites. Although these bounds are
not as tight as an experimentalist would hope for, they are entirely dependent on the region
of the embryo, since KR is constant and once p is chosen the bounds depend only on [R].
We reduce to considering only regions of the embryo where maximum Giant concentration
exists, those in which expression is reduced with the highest eﬃciency. Thus, we can consider
[R] ≈ 1.0, which results in a strict lower bound on the number of binding sites necessary
to obtain a reduction in expression equal to p. In the context of this enhancer, in order
to obtain an expression that is reduced by 50% (p = 0.5) in regions where maximum Giant
concentration exists, mM = 1.1941, hence to get a reduction of at least 50%, the enhancer
must contain two repressor binding sites. To obtain an expression that is reduced by 10%
(p = 0.9), mM = 0.1815, hence to get a reduction of at least 10%, the enhancer must
contain one repressor binding site. To obtain an expression that is reduced by 90% (p = 0.1),
mM = 3.9667, hence to get a reduction of at least 90%, the enhancer must contain four
repressor binding sites.
For constructs that were tested (and used to ﬁt the K value), with a ﬁxed number of
known repressor binding sites, we can also make predictions on the reduction in expression.
The qualitative predictions given in the legends of Figures 2.2-2.4 were made using the

23

Figure 2.2: 1g 2t2d: In regions where Giant protein is present at maximum concentrations,
we predict a 44% reduction in expression.

Figure 2.3: 2g 2t2d: In regions where Giant protein is present at maximum concentrations,
we predict a 69% reduction in expression.
bounds: 0.0787 ≤ KR [R] ≤ 0.7869. To describe the structure of each enhancer tested, a
string of letters and numbers is used, where each number represents the number of binding
sites for a particular protein and each letter (‘g’, ‘t’, and ‘d’) corresponds to the type of
protein (Giant, Twist, and Dorsal respectively). The images shown in Figures 2.2-2.4 are
representative embryos from the actual data set of Fakhouri et al. [26].
To assess the reduction in expression observed from a more quantitative perspective, we
calculated the ratio of average expression in “reduced regions” (highest repressor concentrations) to the average expression in a region of highest expression for those embryos contained

24

Figure 2.4: 3g 2t2d: In regions where Giant protein is present at maximum concentrations,
we predict an 82% reduction in expression.
in the high quality data set of Fakhouri et al.[26] and obtained the following results:

1gt 2t2d: 52% expression = reduction of 48%
2gt 2t2d: 37% expression = reduction of 63%
3gt 2t2d: 16% expression = reduction of 84%

From these results, one can conclude that using even the simplest derivation of a thermodynamic-model can provide very powerful predictions in terms of the reduction in expression
levels caused by multiple repressor binding sites.

25

2.3

Incorporating More Information into
Thermodynamic-based Models

2.3.1

Protein-protein Interactions

When using a thermodynamic-based model in practice, each state is not solely based on
the binding aﬃnities and concentrations of the proteins bound to the enhancer, but also
carries with it a statistical weight which can hold biological meaning within the system.
In addition, successful states can include those with repressors bound, simply causing a
reduction in the weight of the state. A state’s weight can represent biological phenomena
such as the activation eﬃciency of a particular protein, the cooperativity of multiple bound
proteins, the quenching (or repressive) eﬃciency of a bound repressor on a bound activator,
etc. Examples with cooperativity and quenching added to the weight of such states follow.
Example 1: Consider the case of an enhancer, S, which contains two binding sites, each for
a transcriptional activator, A1 and A2. Assume that the expression level of the
gene (measured as the concentration of mRNA) is proportional to the fraction
of successful states, where ‘successful’ is deﬁned as having at least one activator
bound and that these two TFs are cooperating. The expression level is then
calculated as

[mRN A] ∝

KA1 [A1] + KA2 [A2] + CKA1 KA2 [A1][A2]
1 + KA1 [A1] + KA2 [A2] + CKA1 KA2 [A1][A2]

where C represents the cooperative interaction between the TFs when they bind
simultaneously.
26

Example 2: Consider the case of an enhancer, S, which contains two binding sites, one for a
transcriptional activator, A, and one for a transcriptional repressor, R. Assume
that the expression level of the gene (measured as the total mRNA) is proportional to the fraction of successful states, where ‘successful’ is deﬁned as having
the activator bound with the ability to communicate with the transcriptional
machinery and that when the repressor is bound, it has some ability to quench
the activator, or decrease its ability to communicate with the transcriptional
machinery. Note that one can depict the state when both TFs are bound as being divided into two substates, one successful when A has the ability to activate
the gene and one unsuccessful when R is blocking that ability. The expression
level is then calculated as

[mRN A] ∝

KA [A] + (1 − Q) KA KR [A][R]
1 + KA [A] + KR [R] + KA KR [A][A][R]

where Q represents the quenching ability of the bound repressor, R, on A when
both proteins are bound (Q = 1 when R is quenching at 100% eﬃciency).
It is easy to understand biological examples when such weights would be important. If a
certain activator communicates strongly with the basal machinery of transcription, giving it
a high activation eﬃciency, the gene expression should increase when this particular protein
is bound. Thus a larger weight is given to this state. On the other hand, if a repressor
is bound, a smaller weight is given to the state since the activator may be “quenched” by
the repressor and with some probability lose the ability to communicate with the basal
machinery. These weights, which hold such high biological relevance and have the ability to
provide hypotheses on the mechanisms involved in transcriptional regulation, have been the
27

focus of my ongoing work (discussed in Chapter 6, section 6.1) with thermodynamic-based
models of transcriptional regulation in the early Drosophila embryo.
The two biological phenomena that have drawn the most attention in this ﬁeld have
been cooperativity between pairs of TFs and quenching or repressive activity of repressors
on activators. Some investigation of these weights has been carried out previously, but
only very simple functions have been considered up until now [75, 45, 105, 26]. In 2006,
a model was tested on endogenous gene expression data in the early Drosophila embryo
which incorporated terms for protein cooperativity, although one constant cooperativity
value was given to each pair of proteins [105]. In 2010, a similar model was tested on
synthetic gene expression data in the early Drosophila embryo which incorporated terms
for protein cooperativity, where a discrete (very small) number of diﬀerent cooperativity
values were used to incorporate distance(in base pairs between binding sites)-dependent
cooperativity [26]. It is still unknown what the distance-dependent cooperativity function is
for any given pair of proteins, although hypotheses can be made based upon what is known
about the actual size of proteins, the helical shape of DNA, and the nucleosome positioning
on DNA.
With regard to the quenching of short-range repressors on bound activator proteins, in
2003 a model was proposed, although not implemented, which incorporated a quenching
term into thermodynamic-based models which was distance-dependent [75]. In this publication, the quenching function proposed was a piecewise-deﬁned monotonically decreasing
function of distance. The function was constructed by taking two constant functions (1 and
0) and connecting them with a linear function. This was based on the notion that repressors
have a given range over which they can repress bound activators. They quench with full (1)

28

eﬃciency when activator sites are very close and do not quench at all (0) when they are out
of range. In between these two distances, a linear interpolation was done, assuming that
repressors can vary in their eﬃciency at intermediate distances [75]. Using this same idea,
in 2006 two diﬀerent studies were published using models with either a simple monotonic
step-function or a linear function for distance-dependent quenching [45, 105]. In 2010, the
Fakhouri et al.study also incorporated quenching into their model. Since their study was
conducted on synthetic enhancers in which the activity of short-range repressors were thoroughly investigated by creating enhancers with varying distances between repressor binding
sites and activator binding sites, they were able to handle distance-dependent quenching
with a bit more sophistication. In the Fakhouri et al. study, a step-function representing
distance-dependent quenching was ﬁt. Diﬀerent step-functions were investigated by altering
the binning strategy used (the intervals on which each step was deﬁned) and ﬁtting each
function to the experimental data. One of the most important points of this publication was
the surprising ﬁnding that the best ﬁt quenching function was non-monotonic, supporting
the hypothesis that repressors have a preferred distance due to spatial constraints [26].

2.3.2

Overlapping Binding Sites

During implementation of thermodynamic-based models, another important consideration is
how to calculate the occupancy distribution of TFs on an enhancer that contains overlapping
binding sites, those that have the ability to bind two diﬀerent proteins. This is a common
enhancer characteristic due to similarities in protein binding preferences. For example, TFs
involved in early Drosophila development, such as Bicoid and Kruppel as well as Twist and
Snail, have been shown to bind the same region of DNA in multiple enhancers [88, 42].

29

Thermodynamic-based models handle such binding sites by making the assumption that,
due to spatial constraints, TF binding sites may be occupied by only one protein at a time.
This aﬀects the total number of possible states since any overlapping binding sites have three
possible states compared to binding sites for a single protein which have only two possible
states. An example of the occupancy distribution of TFs on an enhancer that contains an
overlapping binding site follows.
Example: Consider the case of an enhancer, S, which contains two binding sites, one for
a transcriptional activator, A1, and one overlapping binding site which can bind
a transcriptional activator, A2, or a repressor, R. Assume that the expression
level of the gene (measured as the concentration of mRNA) is proportional to the
fraction of successful states. We will assume that the activators, A1 and A2, can
cooperate, with C representing the cooperative interaction between the TFs when
they bind simultaneously, and that R can repress A1, with Q representing the
quenching ability of R on A1 when both proteins are bound. The expression level
is then calculated as

[mRN A] ∝

KA1 [A1] + KA2 [A2] + CKA1 KA2 [A1][A2] + (1 − Q) KA1 KR [A1][R]
.
1 + KA1 [A1] + KA2 [A2] + KR [R] + CKA1 KA2 [A1][A2] + KA1 KR [A1][R]

Note that there are only six possible states of the enhancer (six terms in the
denominator) since A2 and R cannot bind simultaneously.
This treatment of overlapping binding sites is used in the thermodynamic-based modeling
discussed in Chapter 6.

30

Chapter 3
Dynamic Modeling: Incorporating
Both cis-regulatory and Temporal
Information

3.1

Introduction and Problem Description

As mentioned in Chapter 1, modeling of eukaryotic gene regulation has focused on either
time-dependent interactions of gene networks [43, 69, 13] or equilibrium approaches that
can explore the cis-regulatory grammar of transcriptional enhancers [85, 45, 105, 38, 26].
These models rely on systems of diﬀerential equations or thermodynamic-based descriptions,
which can be used either to understand the dynamics of a system, or DNA level regulatory
processes. For many real-world settings, both aspects need to be considered in order to
simulate the dynamic changes in gene expression taking place early in development.
To combine the strengths of each of these approaches, we have developed a novel model

31

that joins these methods to provide a dynamical description of gene regulatory systems,
using detailed DNA-based information, as well as spatial TF concentration data at varying
time points. Our ‘two-layer’ modeling approach uses a thermodynamic-based model as the
synthesis term in our diﬀerential equation, representing the rate of mRNA production. The
diﬀerential equation incorporates all other terms, representing decay and diﬀusion of mRNA;
thus the model does not lose its eﬀectiveness in predicting emerging spatial expression patterns over time.
Simple dual layer models have been applied and proven successful in prokaryotic systems,
such as that of the lac operon in E.coli [81]. The diﬃculty of applying such models to
eukaryotic systems is the increased complexity of cis-regulatory modules in addition to the
diﬃculty of collecting high quality data sets. With advances in high-throughput genome
sequencing and transcriptome analysis, we have gained crucial insights and data sets that
make such a model possible in a complicated eukaryotic system. Here, I describe the steps
taken in building a two-layer model of eukaryotic transcriptional regulation and its numerical
implementation, as well as point out some of the advantages to using this approach.
We begin constructing a model by considering the general Reaction-Diﬀusion equation:

dY
dt

= R (Y ) + D∆Y

for Y ∈ Rm , where D is a diagonal matrix containing diﬀusion constants. Note that this is a
PDE due to the dependence on both space and time. This equation is a good starting point
for a realistic model of gene regulation during development. It can incorporate mRNA and
protein concentrations, their spatial diﬀusion, decay over time, and the intricate relationships

32

that these concentrations can have on each other. All of these features are important to
consider for any time in development, but are especially important when modeling the sharp
dynamics of early development. Cell-to-cell diﬀusion is known to take place and decay rates
are important when trying to accurately model temporal dynamics of any organic substance.
mRNA production, the process of transcription, is heavily regulated by protein (TF) binding,
and protein production, the process of translation, is dependent on mRNA production. So,
starting with an equation that allows for diﬀusion, decay, and an easily modiﬁed synthesis
term which can incorporate natural feedforward and feedback loops is essential.
Since gene expression patterns in early Drosophila development are typically categorized
as either Dorsal-Ventral patterns or Anterior-Posterior patterns, the data used in most modeling studies is obtained from a single strip of discrete nuclei. Thus, it is intuitive to begin by
discretizing the diﬀusion term. We approximate diﬀusion using a ﬁnite diﬀerence method,
allowing us to transform the above PDE into the following ODE:

dya,i
dt

= Sa,i (Y ) + Da ya,i+1 − 2ya,i + ya,i−1 − λa ya,i

(3.1)

with the zero ﬂux boundary conditions:

ya,0 = ya,1
ya,n+1 = ya,n

where a represents the speciﬁc mRNA or protein that the equation corresponds to, and
Da and λa are the corresponding diﬀusion and decay rates. We will assume that we are
33

modeling a given set of mRNAs, M R, and a given set of proteins, P R. So, for all a ∈ M R,
a represents a speciﬁc type of mRNA and for all a ∈ P R, a represents a speciﬁc type of
protein. i represents the nucleus (spatial) location.
The most important term in equation (3.1) is the synthesis term, Sa (Y ). This term holds
all of the DNA level information about transcription when a ∈ M R and models translation
when a ∈ P R. It is also the term that controls the overall system behavior, hence eﬀecting
the convergence and stability criteria for any numerical solver implemented.
First consider when a ∈ M R. For these terms, Sa (Y ) represents the scaled output from
a thermodynamic-based model of the enhancer for gene a at each nucleus, i. Recall from
Chapter 2 that this function models the TF-DNA binding and TF-TF interactions regulating
gene a’s expression level in a given cell. Thus, this term in the ODE is a rational function of
the protein concentrations in nucleus i, other elements of the vector Y . For ease of notation,
recall that this function always takes the form:

Φa,i (Y )
Sa,i (Y ) = χa

1 + Ψa,i (Y )

where χa is the scaling constant associated with gene a,
N

Φa,i (Y ) =
j=1

and
N

Ψa,i (Y ) =
j=1




N


vk Kρ(k) ΓΦa (k, v) Yρ(k),i 




v∈Vj

k=1









v∈Vj



vk Kρ(k) ΓΨa (k, v) Yρ(k),i 

N

k=1

34

for N denoting the total number of binding sites on enhancer a,

Vj =




N

v ∈ NN | vk ∈ {0, 1} ∀ 1 ≤ k ≤ N and



vk = j
k=1




,



where each v represents a state of the enhancer with all proteins in positions with vk = 1
being bound and all others unbound, ρ (k) denotes the protein p ∈ P R corresponding to
binding site k, ΓΦa (k, v) equals 0 when v represents a state with no activators bound and
equals the cooperative or repressive parameter associated with the TF at binding site k
and the next bound protein in the state v otherwise, and ΓΨa (k, v) equals the cooperative
parameter associated with the TF at binding site k and the next bound protein in the state
v. Note that if vr = 0 for all r > k then ΓΦa (k, v) = ΓΨa (k, v) = 1 and if there is no
cooperative parameter associated with the TF at binding site k and the next bound protein
in the state v, ΓΨa (k, v) = 1.
For each gene a and nucleus i, Sa,i : Rm → R. The function Sa represents the vector
of these functions for the gene a over all nuclei considered, so Sa : Rm → Rµ , for µ ≤ m.
Note that these functions are predetermined based on the DNA-sequence and parameter
values. These parameter values include the transcription rate scaling constant, χa , binding
aﬃnities, Kp , cooperativity values, Cpq , and quenching parameters, Qpq , for p, q ∈ P R.
During implementation, these are all ﬁxed values throughout the simulation. Regardless of
the number of diﬀerent simulations run, each parameter value is bounded above and below.
Also note that 0 ≤ Qpq ≤ 1 for all p, q ∈ P R. Thus, Ψa,i (Y ) ≥ Φa,i (Y ) for all Y .
Next consider when a ∈ P R. For these terms, Sa (Y ) represents translation. This process
is believed to be much better understood. We operate under the simpliﬁed assumption
that newly translated proteins are produced at a rate proportional to the concentration of
35

the corresponding mRNA in nucleus i. Thus, this term in the ODE is the product of the
translation parameter, τ , and the mRNA concentration for the gene corresponding to protein
a:
Sa (Y ) = τ Yζ(a)
where ζ (a) denotes the mRNA b ∈ M R for the gene corresponding to protein a. Again
during implementation, the value of the translation parameter is ﬁxed throughout the simulation and regardless of the number of diﬀerent simulations run, this value is bounded above
and below.
When solving such an ODE system with initial conditions given, we must consider the
choice of solver.

3.2
3.2.1

Development of a semi-implicit numerical method
Construction

We begin with equation (3.1) for a given mRNA or protein, a, at nucleus i. We will handle
the diﬀusion and decay terms implicitly and the synthesis term explicitly. This treatment
allows us to exploit the simple structure created by discretizing diﬀusion. Approximating
the left-hand side with a semi-implicit ﬁnite diﬀerence gives us:

n+1
n
ya,i − ya,i

h

n+1
n+1
n+1
n+1
= Sa,i (Y n ) + Da ya,i+1 − 2ya,i + ya,i−1 − λa ya,i

36

where h is the size of each time step and superscripts, n and n + 1, denote the solution at
that given time step. This can be rewritten as:

n+1
n+1
n+1
n
−hDya,i−1 + ya,i (1 + 2hDa + hλa ) − hDa ya,i+1 = ya,i + hSa,i (Y n ) .

Now, considering all nuclei in the one-dimensional strip of interest and the zero ﬂux boundary
conditions, for a given mRNA or protein, a, we get the following equation:

n+1 = Y n + hS (Y n )
(I + hMa ) Ya
a
a

where I is the identity matrix and Ma is the following matrix:



2D + λ , if i = j
 a

a




Ma (i, j) = −Da ,
if i = j + 1 or i = j − 1







0,

otherwise.
Note that Ma is tridiagonal, diagonally dominant, symmetric and positive semideﬁnite.
Thus, the existence of the inverse of I + hMa is guaranteed and the algorithm to compute
it is computational very inexpensive. Hence, the solver is deﬁned by the following iteration:

n+1 = (I + hM )−1 (Y n + hS (Y n )) .
Ya
a
a
a

37

(3.2)

3.2.2

Stability

Lemma 3.2.1. Given a set of ﬁxed parameter values and a ﬁxed a ∈ M R ∪ P R, there exists
a constant, ca , such that for all n < ∞, ||Sa (Y n )||2 ≤ ca ||Y n ||2 .
Proof. Observe that

||Sa (Y n )||2 ≤

√
µ ||Sa (Y n )||∞

since Sa (Y n ) ∈ Rµ and µ ≤ m < ∞ [72]. Note that the case when ||Y n ||∞ ≥ 1 is trivial
since Sa,i (Y ) is bounded above by maxa∈M R (χa ) when a ∈ M R, and Sa,i (Y ) is the
product of τ ≥ 0 and the Yk corresponding to the concentration of protein a when a ∈ P R.
Thus,
||Sa (Y n )||2 ≤

√

µ max

max χa , τ

a∈M R

||Y n ||2 .

Case 1: Assume||Y n ||∞ < 1 and a ∈ M R.
First note that 1 + Ψa,i (Y ) ≥ 1, thus

Sa,i (Y n ) ≤ χa Φa,i (Y n )
N

≤ χa
j=1

N
j

j

max Kp

p∈P R

j−1

max

p,q∈P R

38

Cpq , 1 − Qpq

n
max Yk
k

j

.

Since ||Y n ||∞ < 1 and 1 ≤ j ≤ N , it follows that
j

n
max Yk
k

= ||Y n ||j
∞
= ||Y n ||j−1 ||Y n ||∞
∞
< ||Y n ||∞ .

Now, recall that
||Sa (Y n )||∞ = max Sa,i (Y n )
i

and
max Sa,i (Y n ) = max Sa,i (Y n )
i

i

since Sa,i ≥ 0 for all a and i. Thus,
N

||Sa (Y n )||∞

≤ χa
j=1

j

N
j

max Kp

p∈P R
N

≤ χa ||Y n ||∞
j=1

N
j

j−1

max

p,q∈P R

Cpq , 1 − Qpq

j

max Kp

p∈P R

n
max Yk
k

j

j−1

max

p,q∈P R

Cpq , 1 − Qpq

.

Let
√

N

c2 = χ a µ
j=1

N
j

j

max Kp

p∈P R

j−1

max

p,q∈P R

Cpq , 1 − Qpq

.

c2 is a constant independent of n, since χa , µ, N , Kp , Cpq , and Qpq values are
independent of n for all p, q ∈ P R. Thus, there exists a constant, c2 , such that for
39

all n < ∞,
||Sa (Y n )||2 ≤ c2 ||Y n ||∞
and
||Y n ||∞ ≤ ||Y n ||2
since Y n ∈ Rm and m < ∞ [72]. Therefore,

||Sa (Y n )||2 ≤ c2 ||Y n ||2 .

Case 2: Assume||Y n ||∞ < 1 and a ∈ P R.
Observe that
||Sa (Y n )||∞ ≤ τ

n
max Yk
k

since a ∈ P R and τ ≥ 0. Now, let c3 =

√

= τ ||Y n ||∞

µτ . c3 is a constant independent of n,

since µ and τ are independent of n. Thus, there exists a constant, c3 , such that for
all n < ∞,
||Sa (Y n )||2 ≤ c3 ||Y n ||∞
and
||Y n ||∞ ≤ ||Y n ||2
since Y n ∈ Rm and m < ∞ [72]. Therefore,

||Sa (Y n )||2 ≤ c3 ||Y n ||2 .

40

Note that we are only interested in numerically solving our system in a ﬁnite time interval,
[0, T ]. Our solution is not guaranteed to be bounded for all time, but it is bounded on the
ﬁnite time interval in which we are interested, as shown in the following lemma. In addition,
we prove stability on the ﬁnite time interval [0, T ].
Lemma 3.2.2. Given T > h > 0 and a set of ﬁxed parameter values, there exists a constant,
c, such that for all n ≤ T , ||Y n ||2 ≤ (1 + ch)n Y 0 2 .
h
Proof. First note that
1


||Y n ||2

2

n
||Ya ||2 
2

=

.

a∈M R∪P R

n
Recall the deﬁnition of Ya given by equation 3.2:

n
n−1 + hS Y n−1
Ya = (I + hMa )−1 Ya
a

.

Observe that
n
||Ya ||2 ≤ (I + hMa )−1

2

n−1 + hS Y n−1
Ya
a

2

by the deﬁnition of an induced matrix norm [89] and

n−1 + hS Y n−1
Ya
a

2

n−1
≤ Ya
+ h Sa Y n−1
2

41

2

by the triangle inequality. Thus,

n
||Ya ||2 ≤ (I + hMa )−1

||Y n ||2 + h Sa Y n−1

2

Sa Y n−1

By Lemma 3.2.1, there exists a constant ca such that

2

2

.

≤ ca Y n−1 2 . Now

note that [72]
||I + hMa ||2 (I + hMa )−1

2

=

max (eig (I + hMa ))
.
min (eig (I + hMa ))

Since I + hMa is a symmetric positive semideﬁnite matrix,

||I + hMa ||2 =

max eig (I + hMa )T (I + hMa )

= max (eig (I + hMa )) > 0.

Thus,
1

−1

(I + hMa )

2

=

.
min (eig (I + hMa ))

Since I is the identity matrix, Ma is a positive semideﬁnite matrix, and h > 0,

min (eig (I + hMa )) = 1 + h min (eig (Ma )) ≥ 1.

Hence, (I + hMa )−1

2

≤ 1. Therefore,

n
n−1
||Ya ||2 ≤ (1 + ca h) Ya
.
2

42

Let
c=

max

a∈M R∪P R

ca .

c is a ﬁnite constant independent of n. It follows that

n
n−1
||Ya ||2 ≤ (1 + ch) Ya
.
2

Now, observe that
1


||Y n ||2

2

n
||Ya ||2 
2

=
a∈M R∪P R

1


≤

(1 + ch)

2

a∈M R∪P R

n−1 2 
Ya
2

2

1


2

= (1 + ch)

a∈M R∪P R

= (1 + ch) Y n−1

2

n−1 2 
Ya
2

2

.

It then follows directly that

||Y n ||2 ≤ (1 + ch)n Y 0

2

.

Lemma 3.2.3. Suppose Y 0 ∈ Rm is a vector of initial normalized mRNA and protein
˜
concentrations, and Y 0 ∈ Rm is a perturbed vector of initial values. Given T > h > 0, a
43

set of ﬁxed parameter values, and a ﬁxed a ∈ M R ∪ P R there exists a constant, ca , such
˜
that ∀n ≤ T , if Y n and Y n are the corresponding numerical solutions at time step n, then
h
˜
Sa (Y n ) − Sa Y n

2

˜
≤ ca Y n − Y n

2

.

Proof. Observe that

˜
Sa (Y n ) − Sa Y n

2

≤

√
˜
µ Sa (Y n ) − Sa Y n

∞

˜
since Sa (Y n ) − Sa Y n ∈ Rµ and µ ≤ m < ∞ [72].
Case 1: Assume a ∈ M R.
Observe that

˜
Sa (Y n ) − Sa Y n

∞

˜
= max Sa,i (Y n ) − Sa,i Y n
i

.

Recall that
Φa,i (Y )
Sa,i (Y ) = χa

1 + Ψa,i (Y )

and 1 + Ψa,i (Y ) ≥ 1. Since Sa,i (Y ) is diﬀerentiable for all Y ∈ Rm with Yj ≥ 0 for
˜
all 1 ≤ j ≤ m, by the Mean Value Theorem, there exists a ξ ∈ {ηY n +(1 − η) Y n |0 ≤

44

η ≤ 1} such that

˜
Sa,i (Y n ) − Sa,i Y n

=

˜
Sa,i (ξ) · Y n − Y n

≤

˜
Sa,i (ξ) ∞ Y n − Y n
∞

by the Cauchy Schwartz Inequality. Now observe,

∂Sa,i
(ξ)
Sa,i (ξ) ∞ = max
1≤j≤m ∂Yj




∂Ψa,i
 ∂Φa,i 

(ξ) 1 + Ψa,i (ξ) −
(ξ) Φa,i (ξ)
 ∂Y

∂Yj
j
= max

1≤j≤m

χa

1 + Ψa,i (ξ)



2



∂Ψa,i
 ∂Φa,i 

(ξ) 1 + Ψa,i (ξ) −
(ξ) Φa,i (ξ)

∂Yj
1≤j≤m  ∂Yj

≤ |χa | max

since 1 + Ψa,i (ξ) ≥ 1. Continuing, observe that




  ∂Φa,i 
Sa,i (ξ) ∞ ≤ |χa | max  
(ξ) 1 + Ψa,i (ξ)

1≤j≤m   ∂Yj

45


∂Ψa,i
+


(ξ) Φa,i (ξ) 

∂Yj

by the triangle inequality. It then follows that




∂Ψa,i
 ∂Φa,i

Sa,i (ξ) ∞ ≤ |χa | max 
(ξ) +
(ξ) Φa,i (ξ) 

∂Yj
1≤j≤m  ∂Yj

since 1 + Ψa,i (ξ) ≥ 1. Note that
N
j=1

∂Φa,i
∂Yj
∂Ψa,i
∂Yj

N

(ξ) ≤
j=1

j
j=1

j−1

max Kp

max

p∈P R

N
j
j

N

(ξ) ≤

j

N
j

Φa,i (ξ) ≤

N
j

p,q∈P R

Cpq , 1 − Qpq

k

j

j−1

max

Cpq , 1 − Qpq

max

max Kp

p∈P R

Cpq , 1

p,q∈P R

j

max Kp

j−1

max |ξk |
k

j−1
p,q∈P R

p∈P R

j

max |ξk |

j−1

max |ξk |
k

.

˜
Since ξ ∈ {ηY n +(1 − η) Y n |0 ≤ η ≤ 1} and by Lemma 3.2.2, there exists a constant
c such that
||Y n ||∞ ≤ (1 + ch)n Y 0

∞

≤ (1 + ch)n

and
˜
Yn

∞

˜
≤ (1 + ch)n Y 0

∞

≤ (1 + ch)n ,

it follows that
max |ξk | = ||ξ||∞ ≤ (1 + ch)n .
k

46

So, let
N

N
j

c1 =
j=1

j−1

j

max Kp

max

p∈P R

p,q∈P R

Cpq , 1 − Qpq

Tj

(1 + ch) h

and
N

c2 =

j
j=1

N
j

j−1

j

max Kp

p∈P R

max

p,q∈P R

Cpq , 1 − Qpq

(1 + ch)

T (j−1)
h

.

c1 and c2 are constants independent of n, since Kp , Cpq , and Qpq values are ﬁxed
∀p, q ∈ P R and N , T , and h are known. Then,

Sa,i (ξ) ∞ ≤ |χa | (c1 + c1 c2 ) ≤ |χa | (c1 + c1 c2 )
√
since n ≤ T . Therefore, there exists a constant, µ |χa | (c1 + c1 c2 ), independent of
h
n since µ is known and |χa | is a constant independent of n, such that for all n ≤ T ,
h

˜
Sa (Y n ) − Sa Y n

2

≤

√

˜
µ |χa | (c1 + c1 c2 ) Y n − Y n

∞

.

˜
Since Y n − Y n ∈ Rm and m < ∞ [72],

˜
Yn−Yn

∞

˜
≤ Yn−Yn

2

.

Thus,
˜
Sa (Y n ) − Sa Y n

2

≤

√

Case 2: Assume a ∈ P R.
47

˜
µ |χa | (c1 + c1 c2 ) Y n − Y n

2

.

Observe that

˜
Sa (Y n ) − Sa Y n

Now, let c3 =

√

∞

n
˜n
max Yk − Yk
k

=τ

˜
=τ Yn−Yn

∞

.

µτ . Thus, there exists a constant, c3 , such that for all n < ∞,

˜
Sa (Y n ) − Sa Y n

2

˜
≤ c3 Y n − Y n

.

∞

˜
Since Y n − Y n ∈ Rm and m < ∞ [72],

˜
Yn−Yn

∞

˜
≤ Yn−Yn

2

.

Thus,
˜
Sa (Y n ) − Sa Y n

2

˜
≤ c3 Y n − Y n

2

.

Theorem 3.2.4. Given a set of ﬁxed parameter values, the numerical solution given by (3.2)
is stable (in the sense of Lyapunov) [23] in a ﬁnite time interval [0, T ].
Proof. Suppose Y 0 is a vector of initial normalized mRNA and protein concentrations and
˜
˜
Y 0 is a perturbed initial value. Let T > h > 0 and n ≤ T . Suppose Y n and Y n are the
h
˜
corresponding numerical solutions to Y 0 and Y 0 at time step n. First note that
1


Yn

˜
−Yn

2

n
Ya

=
a∈M R∪P R

48

˜n
− Ya

2
2

2

 .

Observe that by equation 3.2:

˜ n−1 + hSa Y n−1
˜
− (I + hMa )−1 Ya

n
n−1 + hS Y n−1
˜n
Ya − Ya = (I + hMa )−1 Ya
a

n−1 − Y n−1 + h S Y n−1 − S Y n−1
˜a
= (I + hMa )−1 Ya
a
a ˜

.

Thus,

n
˜n
Ya − Ya

2

≤ (I + hMa )−1

2

n−1 − Y n−1 + h S Y n−1 − S Y n−1
˜a
Ya
a
a ˜

by the deﬁnition of induced matrix norm [89] and

n−1 − Y n−1 + h S Y n−1 − S Y n−1
˜
Ya
a
a ˜
a

2

n−1 − Y n−1
˜
˜
≤ Ya
+ h Sa Y n−1 − Sa Y n−1
a
2

by the triangle inequality. As shown in the proof of Lemma 3.2.2,

(I + hMa )−1 ≤ 1

and by Lemma 3.2.3, there exists a constant, ca , such that

˜
Sa Y n−1 − Sa Y n−1

2

49

˜
≤ ca Y n−1 − Y n−1

2

.

2

2

Let
c=

max

a∈M R∪P R

ca .

c is a ﬁnite constant independent of n. It follows that

n
˜n
Ya − Ya

2

n−1 − Y n−1
˜
≤ (1 + ch) Ya
a

2

.

Now, observe that
1


Yn

˜
−Yn

2

n
Ya

=
a∈M R∪P R

2
˜n
− Ya 
2

2

1


2

≤

(1 + ch)

a∈M R∪P R

n−1 − Y n−1 2 
˜a
Ya
2

1


= (1 + ch)

2

2

n−1
Ya
a∈M R∪P R

˜
= (1 + ch) Y n−1 − Y n−1

50

2

.

˜ n−1
− Ya

2

2
2



It then follows that

˜
Yn−Yn

2

˜
≤ (1 + ch)n Y 0 − Y 0
˜
≤ echn Y 0 − Y 0
˜
≤ ecT Y 0 − Y 0

2

2
2

.

Thus, by deﬁnition of stability (in the sense of Lyapunov), the numerical solution given by
(3.2) is stable in the ﬁnite time interval [0, T ].
Knowing that our solver is stable gives us an idea of how robust the numerical solver is.
We have found a constant, ecT , which measures the eﬀect of small perturbations in the initial
data on the numerical solution. Now, we would like to determine how well our numerical
solver actually approximates the true solution of the system and whether we can improve
our approximations by decreasing the step-size, h. For this, we need to determine whether
our solver is convergent, and we will use the stability we have just shown to help in proving
convergence.

3.2.3

Convergence

A fundamental theorem of numerical solvers states that if a method is consistent and 0stable, then the method is convergent [5]. Since we have shown stability (in the sense of
Lyapunov) in the previous section, a stronger condition than 0-stable, to prove convergence
of the numerical solution given by (3.2), it suﬃces to show that the method is consistent.
Deﬁnition 1. Let Y (tn ) denote the true solution to equation (3.1) at time step tn . Suppose
51

Y n is the numerical solution obtained after a single iteration, starting from the true solution
at the previous time step, Y (tn−1 ). The local error, dn , at step n is deﬁned by:

dn = Y (tn ) − Y n

Theorem 3.2.5. Given a set of ﬁxed parameter values and a ﬁnite time interval [0, T ], the
numerical solution given by (3.2) is consistent.
Proof. Consider the local error, dn , at step n. Note that

||dn ||2 = ||Y (tn ) − Y n ||2
1



2

n
||Ya (tn ) − Ya ||2 
2

=

.

a∈M R∪P R

n
n
For ease of notation, let dn,a = Ya (tn )−Ya . Recall the deﬁnition of Ya given by equation 3.2:

n
n−1 + hS Y n−1
Ya = (I + hMa )−1 Ya
a

.

Observe that

dn,a = Ya (tn ) − (I + hMa )−1 (Ya (tn−1 ) + hSa (Ya (tn−1 )))

since Y n is the numerical solution obtained after a single iteration, starting from the true

52

solution to equation (3.1) at the previous time step, Y (tn−1 ). Thus,

(I + hMa ) dn,a = (I + hMa ) Ya (tn ) − (Ya (tn−1 ) + hSa (Ya (tn−1 )))
= Ya (tn ) − Ya (tn−1 ) + h (Ma Ya (tn ) − Sa (Ya (tn−1 ))) .

Since Y (tn−1 ) is the true solution to equation (3.1) at time tn−1 ,

dYa
Sa (Ya (tn−1 )) =

dt

(tn−1 ) + Ma Ya (tn−1 ) .

Thus,




dYa


Ma Ya (tn ) −
(I + hMa ) dn,a = Ya (tn ) − Ya (tn−1 ) + h 
(tn−1 ) − Ma Ya (tn−1 )

dt
dYa
= Ya (tn ) − Ya (tn−1 ) − h

dt

53

(tn−1 ) + h (Ma Ya (tn ) − Ma Ya (tn−1 )) .

Now, Taylor expanding Ya (t) about tn−1 , results in the following:

dYa
dYa
(I + hMa ) dn,a = Ya (tn−1 ) + h
(tn−1 ) + O h2 − Ya (tn−1 ) − h
(tn−1 )
dt


dt



dYa




+ h Ma Ya (tn−1 ) + h
(tn−1 ) + O h2  − Ma Ya (tn−1 )




dt
= O h2 .

Hence,
dn,a = (I + hMa )−1 O h2 .
As shown in the proof of Lemma 3.2.2,

(I + hMa )−1 ≤ 1.

Thus,
dn,a 2 ≤ O h2 .
Therefore,
1



2

2
dn,a 2 

||dn ||2 = 
a∈M R∪P R

1


≤

O
a∈M R∪P R

54

2
h2 

2

= O h2 .

So, by deﬁnition of consistent [23], the numerical solution given by (3.2) is consistent.
Now that we have shown stability and consistency of our numerical solver, it follows that
the solver is convergent. Thus, as we take smaller and smaller step-sizes, ie: as h → 0, the
numerical solution given by (3.2) converges to the true solution of equation (3.1).

3.3

Numerical Simulations

Using the numerical solver as deﬁned by (3.2), along with the parameter estimation algorithm, CMAES, which will be described in detail in Chapter 5, we can ﬁt expression data.
For ﬁtting, we used an 80 minute time window corresponding to the period of development
from late nuclear cycle 13, when twi, sna, and rho begin to be expressed, to late nuclear
cycle 14, when twi, sna, and rho expression patterns are mature. For ﬁtting, we began with
previously published quantitative data for mRNA concentrations at the time point t = 80
min. [105]. 20 nuclear positions were used at approximately 50% egglength (A-P) from the
ventral-midline to approximately 50% eggheight (V-D). We combined this with a qualitative
assessment of blue-white images at time points corresponding to t = 0 min and t = 30 min.
We then linearly interpolated these three sets of data to create a set of data corresponding
to 16 time points, in 5 minute intervals, to use for ﬁtting model parameters. Initial mRNA
concentrations used were those corresponding to the data at t = 0 min and initial protein
concentrations were set to 0. The results are shown in Figures 3.1-3.3. Figures 3.1 and
3.2 demonstrate the convergence of the solver and Figure 3.3 illustrates the ability of the
model to successfully replicate the dynamic expression of a simple Drosophila gene regulatory
circuit.
This model is the ﬁrst in a eukaryotic system to include a transcriptional term that
55

A
twi

sna

rho

twi

sna

rho

twi

sna

rho

twi

sna

rho

B

C

D

Figure 3.1: Simulations with decreasing step size. Parameter values were ﬁxed in each of
these four model simulations, allowing us to analyze the eﬀect of changing step-size on the
numerical solutions obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C) h = 0.01, and
D) h = 0.001.

56

A
twi

sna

rho

twi

sna

rho

twi

sna

rho

twi

sna

rho

B

C

D

Figure 3.2: Contour plots for simulations with decreasing step size. Parameter values were
ﬁxed in each of these four model simulations, allowing us to analyze the eﬀect of changing
step-size on the numerical solutions obtained by (3.2). In panel A) h = 1.0, B) h = 0.1, C)
h = 0.01, and D) h = 0.001.
57

Figure 3.3: Data vs. Model Predictions. 8 time points are shown. Each pair of graphs
depicts on the left, the data used to ﬁt model parameters, and on the right, the model
predictions after parameter ﬁtting, for the given time point. Observe that the model is very
successful in ﬁtting the general dynamics and expression pattern trends of this system. The
corresponding RMSE for this prediction is 0.0616.

58

takes into account DNA sequence information and a translational term to relate mRNA
and protein concentrations. All simulation results shown were obtained using the numerical
solver given by (3.2), which we have shown is both stable and convergent in the ﬁnite time
interval in which simulations were run. Each simulation took under one minute to generate
a numerical estimate for the 80 minute time period of interest and parameter estimation
using an evolutionary strategy took less than 24 hours. This is a notable improvement in
computational speed compared to previous methods that did not incorporate any DNA level
information [43, 6, 7]. The results shown support our earlier claim that by incorporating data
on DNA sequence we are able to successfully model context-speciﬁc features of enhancers, as
well as replicate the dynamic expression of a simple Drosophila gene regulatory circuit that
drives development in the dorsal-ventral axis of the blastoderm embryo, involving Dorsal,
Twist, Snail, and rhomboid. Where protein and cis-regulatory information is available, this
model provides a powerful method to recapitulate and predict dynamic aspects of eukaryotic
transcriptional systems that will greatly improve our understanding of gene regulation at a
global level.

59

Chapter 4
Sensitivity Analysis
This Chapter was taken from the publication:
“Thermodynamic-based modeling of transcription: sensitivity analysis diﬀerentiates biological mechanism from mathematical model-induced eﬀects”
Jacqueline M Dresch, Xiaozhou Liu, David N Arnosti, Ahmet Ay
which was published in BMC Systems Biology, Volume 4, in 2010.

4.1

Introduction

In this chapter, we focus on the parameters involved in thermodynamic-based models. For
these models, parameters are usually not known and the derivation of these parameters
from biological data, such as images from confocal microscopy, is complicated by the noisy
nature of biological data. Two major challenges then face the modeler who seeks a biological interpretation of the parameter values. First, large parameter ranges are often found,
leading to uncertainty about where realistic parameter values lie. In many cases, relevant
experimental measurements have not been conducted - mathematical models are often in
60

fact the best method to determine such values. Second, parameter values can be strongly
inﬂuenced by the form used to shape the problem, thus the problem then becomes whether
these extracted values are realistic - are the values due to properties of the biological system
or the mathematical model?
To address these questions, one must examine the mathematical model itself to determine
the uncertainty in parameter values and ascertain the impact that perturbations in parameter values will have on the model output. This question of parameter uncertainty and
sensitivity analysis has been addressed in many diﬀerent applications, including biological,
chemical and risk assessment [34, 25, 58, 96, 103, 29, 16]. Parameter uncertainties have been
explored extensively for ODE models, which typically have a very large number of parameters, resulting in a great deal of model variation and parameter uncertainty [34, 25, 58, 96].
The sensitivity of parameters derived from thermodynamic-based transcription models, however, has not yet been examined in such a context; studies have focused simply on extracting
and interpreting parameter values or parameter ranges [26, 105, 85, 75, 45].
As discussed in the Chapter 2, thermodynamic-based models are based on formulations
originating from statistical physics. For transcriptional analysis, these models consider all
possible states of a DNA regulatory element, where a state refers to a speciﬁc conﬁguration of
regulatory proteins (TFs or TFs) bound to DNA [26, 105, 85, 75, 45]. Each state is awarded
a weight, which depends on properties such as the binding aﬃnity and concentration of
proteins. In many thermodynamic-based models, including those examined in this chapter,
the mathematical formula used to calculate the expression level is a rational function in
which parameters to be estimated can be found in both the numerator and denominator
[26, 105].

61

Sensitivity analysis can be applied to any ﬁeld that uses mathematical modeling as a tool,
and has been used extensively on diverse models in economics, civil engineering, and medicine
[68, 27, 41]. In drug design, sensitivity analysis has been used in determining which parameter
or parameters of a diﬀerential equation model would have the largest eﬀect on a given
outcome of interest, such as the sugar level in the blood of a patient [41]. Its utility is shown
here when applied to thermodynamic-based models of gene regulation (Figure 4.1). Such
models have been applied to systems ranging from a single cis-regulatory element to diverse
collections of highly divergent elements bound by a wide range of proteins [85, 45]. This
chapter concentrates on two studies that took a very focused approach to identify speciﬁc
features relating to transcriptional activation, repression and cooperativity by examining
sets of similar enhancers that feature a limited degree of variation.

4.2

Methods

Sensitivity analysis, whether local or global, has the primary goal of determining how a
given model responds to variations in parameter values. Local sensitivity analysis (also
known as diﬀerential analysis or nominal range sensitivity analysis) focuses on a particular
point in parameter space, varying parameters one at a time to obtain a local response of
the model to each parameter [96, 29, 78, 94]. Global sensitivity analysis, on the other
hand, tries to capture the entire parameter space all at once, allowing multiple parameter
values to be explored simultaneously. Parameters are individually analyzed by averaging the
variation in model output over the entire space [78]. It is possible to explore multi-parameter
responses using local sensitivity analysis, employing techniques such as second-order partial
derivatives, but this analysis can be computationally costly. For this reason, in this chapter
62

a global sensitivity analysis is used to investigate multi-parameter eﬀects.
Our approach initially used local sensitivity analysis to calculate the models rate of change
with respect to each variable at a given point in parameter space, quantifying at that point
how small changes in each variable will aﬀect the model output. Then, we turned to global
analysis, employing three diﬀerent methods. The ﬁrst simply takes an average of results
obtained from local sensitivity analyses on diﬀerent points in parameter space. The other
two global methods each rely on determining which parameters and parameter combinations
contribute most to the variation in the model output that resulted from sampling a wide
range of parameter values. The eFAST algorithm does this through Fourier analysis, while
the HDMR algorithm does this through polynomial approximation.

4.2.1

Local Sensitivity Analysis

Most local sensitivity analysis methods use partial derivatives evaluated at a point in parameter space to determine how the model output changes locally with respect to small
variations of a particular parameter [25, 96, 94, 73]. To gain insight on local eﬀects, when
calculating the sensitivity coeﬃcient, the partial derivative is normalized using the relative
changes in quantities [78, 73]. To calculate the local sensitivity coeﬃcient for an objective
function, C, with respect to a parameter, µ, one uses the formula [73]:




µ ∂ 
sensitivity coeﬃcient =  C  .
C  ∂µ 

A major advantage of local sensitivity analysis is its ease of implementation [94]. The
formula is straightforward and the results are simple to interpret, but must be viewed with

63

caution, because the analysis is only valid in the small region around a particular point
[25, 58, 103]. Also, this method does not account for parameter interactions, so in nonlinear models, this often results in an underestimation of true model sensitivities [29, 94].
In many mathematical models of biological events, including diﬀerential equation models
of the gap gene network in Drosophila, parameter ranges are fairly well constrained due to
previous studies which have estimated values for parameters such as diﬀusion and decay
rates [6, 7, 28]. This allows a great deal of information to be extracted from a local analysis.
With the large uncertainty of parameter values and ranges in models such as those we focus
on in this study, this method is not robust, so global sensitivity techniques are preferred.

4.2.2

Global Sensitivity Analysis

These approaches vary parameters over a larger parameter space and have the ability to
quantify parameter interactions [58, 103, 29, 94]. Global sensitivity analysis covers a much
wider range of methods than that of local sensitivity analysis. A common element is that
global sensitivity methods explore the full parameter space, quantifying a model’s sensitivity
to perturbations in each parameter. Due to computational limitations, this exploration of
parameter space requires a sampling method that randomly samples parameter space, while
evenly covering individual parameter ranges [96]. Here, we focus on three global sensitivity
methods: a basic, easily-implemented method of average local sensitivities, and two more
complex methods that capture ﬁrst-order sensitivities, as well as higher order sensitivities,
those sensitivities derived from varying more than one parameter. Higher order sensitivities
can capture possible parameter interactions. An example of higher order interactions would
be the relationship between the intrinsic activity of a TF and cooperative binding. Solutions

64

to a model would include those that ascribed high cooperativity between weak neighboring
TFs, as well as low cooperativity between inherently powerful TFs. Thus, the higher order
sensitivity would quantify the impact this relationship has on the model output. For a
particular system, varying cooperativity or TF activity alone may have little eﬀect on model
output, thus these parameters would have small ﬁrst-order sensitivity indices, because the
two eﬀects can “cancel each other out” or compensate. If cooperativity and activity are
simultaneously varied, large eﬀects might be observed, producing signiﬁcant second-order
sensitivities.
(i) Method 1
To overcome the major disadvantage of the small region of sampling in local sensitivity
analysis while retaining the ease of computational simplicity, some studies have used a
global approach of averaging local sensitivities taken throughout the parameter space
[96]. This method chooses sets of parameters randomly using realistic ranges of each
parameter space, then local sensitivity analysis is applied to each set of parameters,
and results are averaged over the entire space for each parameter to provide a global
sensitivity coeﬃcient. This method can be computationally expensive.
(ii) Method 2
One of the earliest, yet most eﬃcient global sensitivity analysis methods is the variancebased Fourier amplitude sensitivity test (FAST) [22]. FAST uses a unique searchcurve to explore parameter space and employs Fourier decomposition of the objective
function to calculate main eﬀects, or ﬁrst-order sensitivity indices [29, 78]. The FAST
method does not rely on any assumptions about the form of the model, thus it can be
used on nonlinear, monotonic or nonmonotonic models [29, 78]. An extension of the
65

FAST method (eFAST) can compute not only the ﬁrst-order, but also the total-order
sensitivity coeﬃcients for a given parameter, as deﬁned by Sobol’ [58, 78]:
n

STi = Si +

n

Sijk + · · · + S12...n

Sij +
j=1
j=i

1≤j<k≤n
j=i,k=i

The search-curve used to sample the parameter space is deﬁned by a collection of sinusoidal functions, one for each parameter. The assigned frequencies of these functions
are then used to partition the variance amongst parameters and calculate the ﬁrst-order
sensitivity indices by Fourier analysis. For each parameter, the eFAST algorithm calculates the ﬁrst order sensitivity of that parameter as well as the ﬁrst order sensitivity
of the entire set of remaining parameters. The total-order sensitivity coeﬃcient for the
parameter of interest is then approximated by the sum of that parameters ﬁrst-order
sensitivity and the remaining variance after ﬁrst-order partitioning.
For the model given by the function f (x), where x is the vector of parameter values,
the ﬁrst order sensitivity index for each parameter, xi , where i = 1, . . . , n and n is the
number of parameters, is calculated in the following way:
(1) A sinusoidal function (search curve) xi = gi (j; s), is deﬁned with j = 1, . . . , Ns ,
where j values correspond to integer frequencies and Ns corresponds to the sample
size. One example of such a search curve is [78]:

1

1
gi (j; s) =

2

+

π

arcsin (sin (js))

(2) Fourier coeﬃcients are then calculated for the search curve at each frequency, j,
66

using the following formulas:
π

f (s) cos (js) ds

Aj =
−π
π

f (s) sin (js) ds

Bj =
−π
∞

where f (s) =

Aj cos (js) + Bj sin (js) = f (g1 (j; s) , . . . , gn (j; s))
j=−∞

(3) The variance at each frequency, j, and the total variance are then calculated using
these Fourier coeﬃcients in the following way:

2
2
σi,j = 2 A2 + Bj
j

Ns
2
σi,total

2
σi,j

=
j=1

(4) The ﬁrst order sensitivity index for parameter i is then calculated using the variance
at the fundamental frequency, ω, as a fraction of the total variance in the following
way:
2
σi,ω

Si = 2
σi,total
The fundamental frequencies are chosen so that each parameter with a strong
inﬂuence on the model output will produce oscillations of high amplitude at the
fundamental frequency [78]. Additionally, the fundamental frequencies, ω’s, are
bounded by the relation:
Ns = 2M max{ω} + 1
67

where M is the interference factor and is typically set to be 4 or higher [78].
There is an additional parameter, NR , which corresponds to the resampling size. Resamplings simply introduce phase shifts to the search curve, increasing the number of
total samples in parameter space. When NR > 1, Si is calculated using the mean of
the variances.
The calculation of the total order sensitivity index of a parameter is then easily computed by combining other parameters into a single group (with one fundamental frequency) and calculating the ﬁrst order sensitivity of this complementary group of parameters, SCi , and removing this proportion of variance from the total proportion of
variance. Thus, the total order sensitivity index of the parameter i is computed as:

STi = 1 − SCi .

In this chapter, the eFAST algorithm, as written by Marino et al., was implemented
[58]. Modiﬁcations to this algorithm involved removing time-dependence, including our
objective function, and setting baseline parameter values based on those published in
[26]. eFAST parameters were determined from the formulation recommended in [78].
For scheme 2, the parameters were set at k = 10, Ns = 73, NR = 100, and M = 4.
(iii) Method 3
Another global sensitivity analysis method, ANalysis Of Variance (ANOVA), decom-

68

poses a function into a summation of terms of increasing dimensionality [103]:
n

f (x) = f0 +

n

fij xi , xj + · · · + f12...n (x1 , x2 , . . . , xn )

fi (xi ) +
i=1

(∗)

1≤i<j≤n

where f0 is the main eﬀect (also designated the zeroth order term, or the overall mean)
[103, 94].
Note that this method is distinct from the ANOVA commonly used in biological studies
when comparing two or more groups of experimental data. We will refer to ANOVA in
this study as the high dimensional model representation (HDMR) expansion [103]. The
underlying assumption for an HDMR method is that the objective function is normally
distributed [103, 29, 94]. Due to random sampling, this is typically not violated, but if
a violation occurs, corrective measures can be taken [29]. Previous studies have shown
that, when implemented, ﬁrst and second-order eﬀect terms are typically suﬃcient for
a good approximation of the total sensitivity [51, 53]. In calculating the terms in the
expansion, the Monte Carlo integration method is used [103]. The HDMR algorithm
used in this study is computationally highly eﬃcient due to the use of orthonormal
polynomial approximations for ﬁrst and second-order terms [103]. These terms are then
normalized by the total variance to obtain ﬁrst and second-order sensitivity indices
corresponding to the main eﬀect of each parameter, and the eﬀects from pair-wise
parameter interactions [103].
The theoretical foundations of the HDMR algorithm are almost identical to that of
eFAST. Both algorithms originate from ideas of Sobol’ and Cukier of decomposing
the variance into its contribution from each input parameter. The HDMR algorithm

69

uses a diﬀerent method of searching parameter space than that of eFAST. Instead of
deﬁning search curves as 1-dimensional sinusoidal functions, the algorithm begins with
a quasi-random sampling of parameter space obtained using Sobol’ sequences. Before
constructing the quasi-random sampling, the ﬁrst step is to rescale the parameter ranges
to [0, 1]. This allows one to view the model function, f (x), on the domain deﬁned by
the unit hypercube, K n = {(x1 , x2 , . . . , xn ) |0 ≤ xi ≤ 1, i = 1, 2, . . . , n}.
Note that the decomposition given by (∗) always contains a ﬁnite number of terms
and is exact. When implementing the HDMR algorithm in this chapter, the following
approximation of the model function up to second order has been used:
n

f (x) ≈ f0 +

fi (xi ) +
i=1

fij xi , xj
1≤i<j≤n

When a quasi-random sampling of parameter space is taken, a set of N parameter sets
are obtained. By using these parameter sets as inputs to the model, N output values,
f xk , k = 1, . . . , N are obtained. The ﬁrst term in the decomposition of f , the main
eﬀect term, f0 , is simply calculated as the overall mean of these N outputs:

1 N
f0 =

N

f xk

k=1

The higher order terms in the expansion are then approximated by orthonormal polynomials, Φ1 , Φ2 , Φ3 , . . . , with deg (Φr ) = r. In this chapter, orthonormal polynomials
up to degree ﬁve have been used for ﬁrst order approximations and up to degree three

70

for second order approximations. Thus, the approximations were the following:
5
i
αr Φr (xi )

fi (xi ) ≈
r=1
2

2
ij

fij xi , xj ≈

βrl Φr (xi ) Φl xj
r=1 l=1
l+r≤3

ij
i
where αr and βrl are constant coeﬃcients determined by minimizing the square of the

diﬀerence between the function and the approximation:



1 N
f (x) ≈

N

n

f
k=1

xk

5
i
αr Φr (xi )

+
i=1 r=1

+
1≤i<j≤n

 2



2

r=1 l=1
l+r≤3


ij
βrl Φr (xi ) Φl xj 


using the output values obtained from the random sampling.
In this chapter, the HDMR algorithm, as written by Ziehn and Tomlin, was implemented [103]. This algorithm required a matrix of input values (9x1024) corresponding
to the n = 9 diﬀerent parameter values used for N = 1024 diﬀerent runs of the model
and a vector of output values (1024 RMSEs or 1024 Pearson correlation coeﬃcients
obtained). The input values were created pseudo-randomly, using Sobol’ sequences,
and the HDMR-GUI was run under its default settings.

4.3

Results

71

Figure 4.1: General description of thermodynamic-based models and examples from [26] and [105]. A) Column 1: fractional
occupancy of a TF binding to an enhancer. Column 2: all possible states of an enhancer containing one repressor and one
activator binding site. Column 3: fractional occupancy of each event. Column 4: expression contribution for each state.
Column 5: calculation of total expression. B) Column 1: two constructs used in [26]. Column 2: representative embryos.
Column 3: quantitative data of normalized repressor [Gt] vs reporter gene [lacZ]. Column 4: total expression formulas. C)
Line 1, Column 1: two conceptual constructs used in [105]. Line 2, Column 2: quantitative rho data used by Zinzen et al..
Line 2: total expression formulas.

72

Figure 4.1: (cont’d)

73

Figure 4.1: (cont’d)

74

To determine parameter sensitivities, we analyzed two diﬀerent thermodynamic-based
models (Figure 4.1A) recently used to study transcriptional regulation in the early Drosophila
embryo; neither publication included a thorough sensitivity analysis of the parameters. In
Figure 4.1A, note that square brackets denote protein concentrations, emphK values represent binding parameters, and the total expression is calculated as a sum of each states’
probability multiplied by its expression contribution. In the ﬁrst work, Arnosti and colleagues extracted their model parameters using protein and mRNA concentration data from
a set of simpliﬁed regulatory elements functionally tested in embryos (Figure 4.1B) [26]. This
model was implemented using nine diﬀerent variations, with the total number of parameters
ranging from eight to thirteen. The parameters represent the overall repressor activity (R),
distance-function for repression (Q), and homotypic repressor cooperativities (C). In the
second model, Levine and colleagues extracted their model parameters using protein and
mRNA concentration data representing the activity of two endogenous regulatory elements
(Figure 4.1C) [105]. Their model analyzes the expected activity of 128 diﬀerent hypothetical
regulatory sequences, with nine diﬀerent parameters for each element representing protein
binding aﬃnities, and homotypic and heterotypic cooperativities. In (Figure 4.1C), these
constructs are not depicted in the same way as in A) and B) because the Zinzen et al. model
does not consider order or spacing of binding sites, but only the number of binding sites.
Terms inside brackets indicate the presence of binding sites for that particular TF, while
subscripts indicate the number of such binding sites. In this study, we focused on the eight
forms of regulatory sequences that produced the best correlation with the experimental data,
published in Table S2 of Zinzen et al..

75

4.3.1

Local Sensitivity Analysis of the Fakhouri et al. Model

To initiate our study, we performed the local sensitivity analysis on the model of Fakhouri et
al.. In this study, Arnosti and colleagues pursued a unique bottom up approach to understand
the mechanistic processing of regulatory elements by the transcriptional machinery, using
a well deﬁned and characterized set of repressors and activators in Drosophila blastoderm
embryos. Their approach not only incorporated cooperativity between TFs and a quenching
activity of repressor proteins, as other models have done [105, 85, 75, 43, 45, 97, 79, 100],
but included distance dependence and relative enhancer positioning of these activities with
no a priori assumptions. In contrast to earlier enhancer based studies, which use many
bioinformatically determined binding sites, they created and systematically analyzed a welldeﬁned set of transcriptional regulatory elements in the Drosophila embryo with experimentally determined binding sites, focusing on repressor-activator spacing, stoichiometry and
arrangement. Due to this well deﬁned construct design, their data set also depends on a
fewer number of features than previous studies. They used experimental data from 12 of
these elements to extract parameter values representing repressor scaling factor, eﬃciencies,
and cooperativities. Fakhouri et al. used the root mean square error (RMSE) to determine
the quality of ﬁt, thus, for the sensitivity analysis of this model, RMSE was used as the
objective function [26]. Their study employed nine model schemes, which diﬀer slightly in
their treatment of distance eﬀects and cooperativity. Each scheme uses the same general
formula as the equations shown in Figure 4.1B, but the treatment of distance-dependent
repression and cooperativity diﬀers. In diﬀerent schemes, repressor-activator distances were
“binned” into diﬀerent intervals, (e.g. 0-6 bp, 28-41 bp, 50-56 bp etc. vs. 0-6 bp, 28-31 bp,
41-50 bp etc.) and cooperativity was represented as a single term or two diﬀerent terms for

76

genes containing three repressor binding sites, resulting a in diﬀerent number of parameters
for each scheme. For a full description of all nine schemes, refer to [26].

77

Figure 4.2: Sensitivity analysis using local and global-local methods on the thermodynamic-based model of Fakhouri et al.,
scheme 2. A) Local sensitivity analysis highlights great diﬀerences among parameters. Model parameters are shown on the
horizontal axis and local sensitivity coeﬃcients (scaled from 0.0 to 1.0) are shown on the vertical axis. Larger values indicate
that the model output changes greatly when this parameter is varied. B) Global-local analysis reveals diﬀerences in relative
sensitivities compared to local analysis. The global-local sensitivity contribution (also scaled from 0.0 to 1.0) is plotted on the
vertical axis. In the global-local analysis, quenching parameters (Q) are generally more sensitive than repressor scaling factors
(R), which in turn are more sensitive than cooperativity parameters (C).

78

We analyzed the local sensitivity of the Fakhouri et al. model at parameter values optimized in their study using scheme 2 (Figure 4.2A). To reduce the dependence of local analysis
on a single set of optimized values, we applied local sensitivity analysis in a global fashion.
10,000 sets of parameter values within biologically reasonable boundaries were generated,
and for each parameter, a weighted average sensitivity coeﬃcient was then calculated. We
found that this global/local approach resulted in sensitivity coeﬃcients which diﬀer significantly from the local sensitivities found using the optimized values (Figure 4.2B). This
diﬀerence is expected, because local sampling of parameter space may not reﬂect the overall
possible sensitivities of all parameter values. The sensitivities converge to those identiﬁed
by global methods, as noted below, and reveal signiﬁcant diﬀerences among the parameters
sensitivities.
The global/local approach avoids the problem of dependence on estimated parameter
values, and is global in the sense that the whole parameter space is sampled, but it fails to
capture higher-level interactions of parameters. Therefore, we applied global analyses to test
this model using two well-established methods.

4.3.2

Global Sensitivity Analysis of the Fakhouri et al. Model

79

Figure 4.3: Sensitivity analysis using the eFAST global sensitivity method on the thermodynamic-based model of Fakhouri
et al., scheme 2. A) Shown are ﬁrst- and total-order sensitivity indices, which represent the amount of variation in the
model output with respect to each parameter individually (gray bars) and in conjunction with all other parameters (black
bars), scaled from 0.0 to 1.0. If the gray bar is much smaller than the black bar, as in Q6, secondary (or higher) eﬀects are
predominant. Quenching parameters are generally most sensitive and cooperativity parameters are least sensitive. B) Eﬀect of
frequency with which a quenching parameter is represented in the twelve constructs on sensitivity. The number of constructs
that the parameter is represented in is shown along the horizontal axis. Corresponding ﬁrst- and total-order sensitivity indices
are shown for each quenching parameter, as calculated by the eFAST algorithm. There are two quenching parameters (Q4
and Q5) represented in two constructs, and two quenching parameters (Q1 and Q3) represented in ﬁve constructs. At these
values, there are four data points, two for each quenching parameter. The lines illustrate linear ﬁts to each data set, ﬁrstand total-order sensitivity indices. In general, the more constructs a quenching parameter is represented in, the more sensitive
that parameter.
80

Figure 4.3: (cont’d)

81

We applied the eFAST global sensitivity analysis method to the nine formulations of
the model of Fakhouri et al. and obtained comparable results with all of them (scheme 2
shown in Figure 4.3). Signiﬁcantly, we found that the quenching parameters, those that
represent distance-dependent repression, are generally the most sensitive and the repressorrepressor cooperativity parameters are the least sensitive. This general trend was observed
with all nine schemes (scheme 2 shown in Figure 4.3). The Fakhouri et al. data set was
constructed to speciﬁcally test spacing eﬀects, i.e., distance-dependent quenching of activators by short-range repressors, and this feature was thoroughly explored by a large number
of gene constructs. Thus, it is reasonable that with this particular experimental data used
to ﬁt the parameters, in which quenching terms are represented in many constructs, the
quenching parameters are more sensitive than other parameters because diﬀerences in ﬁt
quality will reﬂect contributions to the RMSE from many constructs. A small change will
be magniﬁed because it contributes to a larger number of the terms in the RMSE. We tested
whether the number of gene constructs in which each parameter is represented aﬀects that
parameter sensitivity for other features (Figure 4.3B). We noted that in general, the more
gene constructs in which a quenching parameter is represented, the more sensitive that parameter. The frequency with which a feature is found is not the sole inﬂuence on sensitivity,
however. Some quenching parameters, such as Q1 and Q3, are seen in the same number of
constructs but have diﬀerent sensitivity indices. We hypothesize that this diﬀerence is due
to parameter interactions, and test this idea as described below.
From a strictly mathematical viewpoint, the enhanced sensitivity of quenching parameters over cooperativity parameters is also not surprising. The form of this model itself
suggests that small changes in quenching parameters will have the largest eﬀects on the

82

model output, because quenching parameters appear only in the numerator of the expression function, while the repressor scaling factor and cooperativity parameters of the same
form appear in both the numerator and denominator (Figure 4.1). Thus, small changes in
the repressor scaling factor or cooperativity parameters will have much smaller eﬀects on the
overall change in expression. This representation of cooperativity parameters in the model
is the standard method of describing interactions in thermodynamic-based models, thus a
resulting low sensitivity may permit more non-physiological parameters to be accepted. Experimental evidence suggests that at least in some cases, however, there should be a high
degree of sensitivity to reﬂect the important protein-protein cooperative interactions on enhancers [46, 93, 21].
The mathematical representation of cooperativity in this model is based on the interactions at the TF-DNA binding level, thus cooperativities are represented in the same form
in both the numerator and denominator of the expression function (Figure 4.1). However,
cooperativity between two TFs can be represented as a post-binding event, by constructing
the model in such a way that all states of the enhancer are considered, then weighted by
a cooperativity parameter.In this second case, the model formula would be written exactly
the same as Figure 4.1B except that C1 would not be present in the denominator of the expression. Thus the cooperativity parameters would be present only in the numerator of the
expression function, increasing their sensitivity, possibly to levels comparable to quenching
parameters.
We noted that quenching parameters’ sensitivities varied, with Q6 showing the lowest sensitivity (Figure 4.3). In the construction of the model, Q6 represents the quenching directed
by a lone repressor binding to the 3’ end of the regulatory region. There is no biological

83

evidence that suggests a correlation between the particular position and short-range repression eﬀectiveness, but some other model parameter might compensate for changes in Q6.
For this reason, and the fact that some quenching parameters are seen in the same number
of constructs, but have diﬀerent sensitivity indices, we turned to another global sensitivity
analysis technique to test for possible eﬀects of interactions between parameters.

84

Figure 4.4: Sensitivity analysis using the HDMR global sensitivity method on the thermodynamic-based model of Fakhouri et
al., scheme 2. Shown are ﬁrst-, second- and the sum of ﬁrst- and second-order sensitivity indices, which represent the amount
of variation in the model output with respect to each parameter individually (light gray bars), each pair of parameters (dark
gray bars), and the sum of these two variations (black bars), scaled from 0.0 to 1.0. General sensitivity trends are similar to
those observed in Figure 2, although second-order sensitivities are largest for R and C1, implying that they aﬀect the model
output through pair-wise interactions with other parameters. Over 95% of the variation in model output was found using only
ﬁrst- and second- order sensitivities (data not shown); therefore the sum of ﬁrst- and second-order sensitivities shown here
approximates the total sensitivity of each parameter.

85

We tested the HDMR global sensitivity technique, which provides information about
inter-parameter sensitivities, on this thermodynamic-based model. The ﬁrst and secondorder sensitivity results for scheme 2 are shown in Figure 4.4. Relative overall sensitivities
were generally similar to those obtained with eFAST, however, this method also provided
insights on possible interactions between quenching, repressor scaling, and cooperativity
factors. The ﬁve largest second-order sensitivity indices found for one formulation of the
model, scheme 2, in descending order, were Q6-C1, Q3-Q5, R-C1, Q2-Q3, and R-C2. Out of
the nine schemes tested, C1 was always seen in these top ﬁve. It is interesting that although
C1 has a very small ﬁrst-order sensitivity index, the sum of its second-order sensitivity
indices is the largest of any parameter. Biologically, cooperativity between repressors would
only have an eﬀect on gene expression by increasing binding strength or quenching eﬃciency.
Hence, C1 only has an impact on the model output when we are also considering the impacts
of the repressor scaling factor or the quenching parameters. Mathematically, C1 is always
multiplied by a factor containing another parameter; therefore it can easily be compensated
for. An increase in C1 with a simultaneous decrease in another parameter would result in
no change in the predicted gene expression, hence no change in the RMSE.
HDMR second-order interactions also suggested interactions between quenching parameters. Most often, this was observed between quenching parameters that were present together
in more than one construct, such as Q3 and Q5. This eﬀect may be due to compensation in
the RMSE from one construct to another. RMSE is calculated by summing up the square errors over all constructs, therefore parameter values that increase the error in one construct,
but severely decrease the error in another construct, would leave the RMSE unchanged.
Hence, a sensitivity index interpreted as an interaction between two parameters may not

86

be as simple as the two values aﬀecting one another. A high sensitivity index between two
quenching parameters may be a result of those two quenching parameters being present together in most constructs. In this case, it would be diﬃcult to remove the dependence of one
parameter on another because the parameters appear together in most terms of the RMSE
calculation. This eﬀect should be taken into consideration when designing the experiments
and developing a model. The best way to avoid this ambiguity would be to design the experiments and the model in unison, to minimize the bias to a certain parameter or group of
parameters.
To address this question, we designed new sets of hypothetical genes, created synthetic
data sets, and performed global sensitivity analyses, as described in the following section.

4.3.3

Optimized Gene Construct Design and Synthetic Data
Table 4.1: Redesigned gene constructs
Gene Constructs

R Q1 Q2 Q3 Q4 Q5

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

x
x
x
x
x
x
x
x
x
x
x

1Gt 2Tw2Dl
2Tw2Dl 1Gt
1Gt 25 2Tw2Dl
1Gt 32 1Gt 2Tw2Dl
2Gt 50 2Tw2Dl
1Gt 32 1Gt 35 2Tw2Dl
2Gt 60 2Tw2Dl
2Gt 35 2Tw2Dl 1Gt
3Gt 50 2Tw2Dl
1Gt 2Tw2Dl 1Gt
3Gt 60 2Tw2Dl

Q6 C1 C2

x
x
x
x

x
x

x
x
x

x
x
x

x
x

x
x
x

x

x

x
x
x

x

x

x

x
x

87

Figure 4.5: Sensitivity analysis using the HDMR global sensitivity method on 11 redesigned gene constructs. Synthetic data
created using the Fakhouri et al. model on these 11 redesigned gene constructs shows a decrease in variation of sensitivities
between quenching parameters compared to Figure 4.3. In A, the synthetic data was created using the parameter values
that were found in the Fakhouri et. al study. In B, the synthetic data was created using mean parameter values (0.5
for quenching parameters, 50 for scaling factor and cooperativity parameters). A decrease in variation of sensitivities for
quenching parameters is also observed from A to B (the sum of ﬁrst and second order sensitivities has a range of 0.10 to
0.28, with a mean of 0.17 and a standard deviation of 0.07 in A and a range of 0.16 to 0.31, with a mean of 0.20 and a
standard deviation of 0.06), indicating that the sensitivity analysis is dependent on the true parameter values represented in
the biological data.

88

Figure 4.5: (cont’d)

89

Figure 4.6: Values of biological data can aﬀect sensitivity analysis. The eFAST global sensitivity method was applied to
the thermodynamic-based model of Fakhouri et al., scheme 2, using synthetic data with mean values (0.5 for quenching
parameters, 50 for scaling factor and cooperativity parameters). We observe a decrease in variation of sensitivities between
quenching parameters, when compared to Figure 4.2. A) Sensitivity indices represent the amount of variation in the RMSE.
B) Eﬀect of frequency with which a quenching parameter is represented in the twelve constructs on sensitivity. The number
of constructs that the parameter is represented in is shown along the horizontal axis. Corresponding ﬁrst- and total-order
sensitivity indices are shown for each quenching parameter, as calculated by the eFAST algorithm. Note that there are two
quenching parameters (Q4 and Q5) represented in two constructs, and two quenching parameters (Q1 and Q3) represented in
ﬁve constructs. Thus at these values there are four data points, two for each quenching parameter. The lines illustrate linear
ﬁts to each data set, ﬁrst- and total-order sensitivity indices. In general, the more constructs a parameter is represented in, the
more sensitive that parameter. When compared to Figure 4.2B, a more pronounced correlation between a parameter’s total90
order sensitivity coeﬃcient and the number of gene constructs that parameter is represented in is observed here, underscoring
the need for sensitivity analysis in experimental design.

Figure 4.6: (cont’d)

91

Table 4.2: Gene constructs with combinatorial quenching.
Gene Constructs

R Q1 Q2 Q3 Q4 C1 C2

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

1Gt 2Tw2Dl
1Gt 15 2Tw2Dl
1Gt 30 2Tw2Dl
2Tw2Dl 1Gt
2Gt 2Tw2Dl
1Gt 15 1Gt 2Tw2Dl
1Gt 2Tw2Dl 1Gt
2Gt 15 2Tw2Dl
1Gt 15 2Tw2Dl 1Gt
1Gt 30 2Tw2Dl 1Gt
3Gt 2Tw2Dl
2Gt 2Tw2Dl 1Gt
1Gt 15 1Gt 2Tw2Dl 1Gt
2Gt 15 2Tw2Dl 1Gt
3Gt 2Tw2Dl 1Gt

92

x
x
x
x
x
x
x

x

x

x
x

x
x
x
x
x

x
x

x
x
x
x

x
x
x
x
x
x

x
x
x
x
x
x
x

x
x

x
x

x
x

x

Figure 4.7: Eﬀect of combinations of four quenching parameters on sensitivity analysis using the HDMR global sensitivity
method on redesigned gene constructs (see Table 4.2). Compared to an even design, an increase in variation of sensitivities
between quenching parameters is seen when the data used comes from an uneven construct design (compare A to B). The
synthetic data was created using the nine constructs in which the quenching parameters were evenly distributed (A) and
not evenly distributed (B). Mean parameter values (0.5 for quenching parameters, 50 for scaling factor and cooperativity
parameters) were used in both cases.

93

Figure 4.7: (cont’d)

94

To test the eﬀect of experimental design on the sensitivity analysis, we redesigned eleven
gene constructs so that each quenching parameter is represented in exactly three constructs
(Table 4.1). We then created a synthetic data set of Giant protein and corresponding lacZ
mRNA concentrations, using the model formula and the parameter values estimated in the
Fakhouri et al. study. The eFAST and HDMR analyses gave comparable results, and showed
a decrease in the variation of the parameter sensitivities when compared to Figure 4.4, although not a perfectly even distribution (HDMR results in Figure 4.5A, eFAST results not
shown). We hypothesized that this was either due to the parameter values used to create
our synthetic data set, or to uneven parameter interactions in the redesigned constructs. To
test our ﬁrst hypothesis, we created a new synthetic data set for these eleven gene constructs
using the mean values in each parameter range (0.5 for quenching parameters, 50 for scaling
factor and cooperativity parameters). The results show only a slight decrease in the variation
of sensitivities (Figure 4.5B). This decrease can be seen by comparing parameters Q2 and
Q5 with Q6 (Figure 4.5A vs. 4.5B; Q2, Q5, and Q6 have been changed from 0.31, 0.20, and
0.62 respectively to 0.50). To further investigate this point, we created a new synthetic data
set using the mean value in each parameter range for the twelve original gene constructs used
by Fakhouri et al. We observed a slight decrease in variation of sensitivities, and a stronger
correlation between the number of constructs in which a parameter is represented and the
sensitivity of the parameters (Figure 4.6 vs. Figure 4.3B). In neither of the above cases
were the constructs perfectly combinatorial. Although in the new design each quenching
parameter is represented in the same number of constructs, they are not represented in the
same context. To test the “best” combinatorial enhancer design, we considered only four
quenching parameters (three representing quenching at various distances 5’ of the activator

95

binding sites and one representing quenching from the 3’ direction) and designed a perfect
combinatorial set of ﬁfteen enhancers, including all possible combinations of these four parameters (Table 4.1). From this set of ﬁfteen enhancers, we then chose two separate subsets
to test: the ﬁrst subset (Constructs 1,2,3,4,11,12,13,14,15) was perfectly even in the sense
that each quenching parameter was represented in exactly the same number of constructs
and with the same number of other quenching parameters, and the second subset (Constructs
1,2,5,6,7,8,11,12,13) was uneven, with Q1 represented seven times, Q2 ﬁve times, Q3 four
times, and Q4 three times. We then created synthetic data sets using the mean parameter
values. We observed that with a perfectly even experimental design and mean parameter
values, minimal variation is seen in the sensitivities of quenching parameters (Figure 4.7A).
This is not the case with an uneven experimental design even with mean parameter values
(Figure 4.7B). Thus, our results have shown that parameter sensitivities have a clear dependence on both the experimental design and true biological parameter values (Figures 4.5 4.7).
Optimizing the design of constructs to equalize the frequency of parameters does decrease
the range of parameter sensitivities, but not completely (Figure 4.5). Further equalization
is possible by ensuring that quenching parameters are represented uniformly in a combinatorial fashion (Figure 4.7). This analysis on synthetic data underscores the need for global
sensitivity analyses on modeling studies with the objective of reducing the error between
predictions and experimental data. The analysis should be done both before and after experimental data collection, ﬁrst to test for uniform sensitivity on synthetic data constructed
with mean values; then a second time, to provide a level of conﬁdence in extracted parameter
values and their biological implications.

96

4.3.4

Global Sensitivity Analysis of the Zinzen et al. Model

97

Figure 4.8: Sensitivity analysis using the HDMR global sensitivity method on the thermodynamic-based model of Zinzen et
al. First-, second- and the sum of ﬁrst- and second-order sensitivity indices (scaled from 0.0 to 1.0), are shown. Despite the
focus in the Zinzen et al. study on Dorsal-Twist cooperativity, these parameters are relatively insensitive compared to protein
scaling factors. Results from representative enhancer structures: A) set 1 and B) set 2. The model parameters are shown on
the horizontal axis. D, T, and S correspond to scaling factors for Dorsal, Twist, and Snail binding sites, respectively. DTr,
TTr, and SSr correspond to cooperativity parameters representing Dorsal-Twist, Twist-Twist, and Snail-Snail cooperativities
respectively, for the rho-like enhancer. Similarly, DTv, TTv, and SSv correspond to cooperativity parameters representing
Dorsal-Twist, Twist-Twist, and Snail-Snail cooperativities respectively, for the vnd -like enhancer.

98

Figure 4.8: (cont’d)

99

The second study that we considered diﬀers considerably from the previous work, in which
a limited set of deﬁned gene constructs was quantitatively tested. In Zinzen et al., many
conceptual constructs thought to resemble one of two endogenous genes were the subject
of analysis. The structures of the constructs simulate rho or vnd, two genes with slightly
diﬀering expression patterns in the Drosophila embryo. The derived parameters were used to
infer possible regulatory relationships explaining the design of enhancers controlling rho and
vnd. We applied the HDMR approach to their model to obtain information on sensitivities
revealed by this method. Zinzen et al. used the Pearson correlation coeﬃcient to determine
the quality of ﬁt, therefore, for the sensitivity analysis of this model, we used this correlation
coeﬃcient as the objective function. In their study, twelve parameter combinations and eight
unique enhancer structures were identiﬁed as most informative, after ﬁtting all possible forms
of their model to expression data [105]. We used these eight enhancer structures for our global
sensitivity analysis. Each enhancer set is associated with nine parameters: three scaling
factors reﬂecting TF activity, and three diﬀerent cooperativity parameters for each of the
rho-like and vnd-like enhancer structures. The ﬁrst and second-order parameter sensitivity
results for two representative pairs of enhancer structures are shown in Figure 4.8.
Strikingly, TF scaling factors are far more sensitive than the cooperativity parameters
for almost every enhancer structure tested (Figure 4.8). The overall low sensitivities in
cooperativity parameters do not necessarily contradict the conclusions made by the authors
that diﬀerent cooperativity parameters distinguish the expression patterns of rho and vnd,
but they do draw into question how robust these conclusions can be [105]. Low sensitivity
values do not mean that the parameters are of low importance, but rather indicate that
changes in these parameter values do not have large eﬀects on the output. Therefore, other

100

sets of parameter values, with diﬀerent relative importance of cooperativity parameters,
may also allow the model to distinguish the expression patterns of rho and vnd. The study
claimed that rho-like elements require 5-10 fold-higher Dorsal-Twist cooperativity, higher
Twist-Twist cooperativity and lower Sna-Sna cooperativity than do vnd -like elements [105].
We explored the correlation in other regions of parameter space to determine how strict
these relationships are and how much changes in these relationships would aﬀect the models
correlation with experimental data.
Table 4.3: Parameter relationships resulting in high correlations.
Set(s)

All
DT
TT
SS
DT & TT DT & SS
correct wrong wrong wrong
wrong
wrong

1

57
(39%)
49
(28%)
59
(38%)
48
(39%)
37
(35%)
13
(11%)
36
(24%)
36
(28%)

2
3, 9
4, 7
5, 6, 8
10
11
12

6
(4%)
32
(18%)
10
(6%)
1
(1%)
2
(2%)
4
(3%)
1
(1%)
3
(2%)

43
(29%)
45
(26%)
47
(31%)
44
(36%)
34
(32%)
17
(15%)
30
(20%)
16
(13%)

20
(14%)
16
(9%)
13
(8%)
17
(14%)
17
(16%)
29
(25%)
37
(25%)
38
(30%)

0
(0%)
8
(5%)
1
(1%)
0
(0%)
0
(0%)
8
(7%)
0
(0%)
0
(0%)

2
(1%)
9
(5%)
5
(3%)
0
(0%)
0
(0%)
12
(10%)
8
(5%)
3
(2%)

TT & SS
All
Total
wrong
wrong
18
(12%)
13
(7%)
19
(12%)
13
(11%)
15
(14%)
21
(18%)
36
(24%)
32
(25%)

0
(0%)
2
(1%)
0
(0%)
0
(0%)
0
(0%)
13
(11%)
0
(0%)
0
(0%)

146
174
154
123
105
117
148
128

By investigating the parameter values listed in their study, a reasonable range for the parameter values was deduced [105]. The scaling factors were taken in the range [108,1010] and
cooperativity values were taken in the range [1,100]. For each enhancer structure, we chose
100,000 sets of parameter values at random and calculated the Pearson correlation coeﬃ-

101

cients. We sorted the Pearson correlation coeﬃcient values and selected for further analysis
all parameter sets that resulted in values greater than 0.981, a threshold chosen based on the
lowest value that appeared in Table 2 of Zinzen et al. (log[correlation coeﬃcient] = 1.72)
[105]. We obtained between 105 to 174 acceptable parameter sets from each of the original
100,000 parameter sets for each enhancer structure. The results are shown in Table 4.3,
which illustrates how often the relationships between cooperativity parameters in elements
chosen to resemble rho and vnd were exactly as predicted by Zinzen et al., and how often
some or all of these relationships could be changed, yet still recover a satisfactory correlation
(> 0.981). We carried out this test a second time, again testing 100,000 diﬀerent random
parameter sets for each construct and obtained essentially identical results (data not shown).
The relationships that were said to be of greatest importance in Zinzen et al. appear in a
plurality of parameter sets in almost every scheme, but other relationships between the cooperativity parameters also result in a very high correlation. The relationships predicted in
Zinzen et al. hold for less than 40% (between 13 and 59 of between 105 and 174) of the
parameter sets that show satisfactory correlation (> 0.981). As modeled in that study, the
relationships may not represent biological features that reﬂect a compelling best ﬁt with the
experimental data; rather, the low sensitivities of these particular parameters may allow one
to accept a local best ﬁt. The biological conclusions, that rho and vnd expression patterns
diﬀer based on Twist-Dorsal cooperativity, should therefore be interpreted with caution.
Table 4.3 also indicates how sensitivity analysis can guide model selection. The insensitivity of cooperativity parameters led us to test a large set of diﬀerent parameter values,
which revealed that multiple parameter sets with diﬀerent biological interpretations produced satisfactory correlations between the experimental data and the model output. This

102

modeling weakness may reﬂect the reliance on only one experimental data set to ﬁt parameter values. In addition, information about positions of binding sites in enhancers was
not considered, and somewhat arbitrary assumptions about which proteins are required for
activity were made. Any of these factors may have led to insensitivity in the cooperativity parameters, and should be addressed before trying to improve upon the model or its
biological interpretations.

4.4

Discussion

Thermodynamic-based models represent the most powerful approach to quantitative understanding of gene transcription because of their use of DNA sequence information that can
capture subtle diﬀerences in enhancer architecture. As with other models, this approach involves many unknown parameters and interactions, however, a rigorous sensitivity analysis
has not been applied to recent studies in this area. We have shown here, through local and
global sensitivity analyses, that there is a wide range in sensitivities for parameters in these
models. In some cases, this is due to the biology of the system, or the experimental data used
to calculate the error, while in other cases it may simply be an artifact of the mathematical
formulation (Figure 4.1), and the underlying assumptions that have gone into this model.
Our analysis also uncovered previously undescribed interactions between parameters that
may reﬂect true biological interactions, and at the same time found very low sensitivities in
parameters that were previously thought to be most informative in explaining the overall
architecture of distinct enhancers. We used the insights drawn from sensitivity analysis
to redesign experiments, and discover new parameter values that are well correlated with
experimental data. Clearly, standard parameter estimation techniques used alone, in the
103

absence of sensitivity analysis, cannot fully show the importance of possible biologicallydriven or model-driven eﬀects.
To improve upon thermodynamic-based modeling attempts to unlock key aspects of cisregulatory grammar, we should recognize a need for sensitivity analysis and a thorough
exploration of parameter space. With this analysis, new iterative parameter estimation
and model selection techniques can be designed and implemented [46, 54]. Analysis of the
eﬀects parameters have on the model output, within their realistic ranges, can also help
biologists to design experiments. As we demonstrated with our synthetic data sets, a tight
coordination between experimental design and model formulation is required to maximize
the eﬀectiveness of these modeling studies. A conscious eﬀort can be made to collect data
evenly and models can be redesigned to make sure that model sensitivity is similar for various
parameters. Thermodynamic-based models, as other models, beneﬁt by such sensitivity
analysis, providing important context in which to place biological interpretations suggested
by the model.

104

Chapter 5
Parameter Estimation
This Chapter was taken from the manuscript:
“Global Parameter Estimation for Thermodynamic-based Models of Transcriptional Regulation”
Yerzhan Suleimenov, Ahmet Ay, Md. Abul Samee, Saurabh Sinha, Jacqueline M Dresch∗ ,
and David N Arnosti∗
(∗ = corresponding authors)
which is in the process of being submitted to Methods - Elsevier.

5.1

Introduction

The goal of most studies in computational biology is to better understand overall system
behavior and the contributions of speciﬁc system components to this behavior. To study
behavior of biological systems, a variety of mathematical modeling approaches are employed
[79, 100, 3, 1, 15, 105, 45, 85, 30, 26, 38, 33, 10, 62, 81, 97, 43, 32, 20, 69, 13]. The evaluation of
such models rests on two central pillars, namely model validation and parameter estimation.
105

Whether the mathematical framework is straightforward, leading to a model with an explicit
formula, or more complex, leading to a model with only an implicit formula or a high degree of
stochasticity, there are always some aspects that remain unknown with parameters that need
to be estimated. If the system were completely understood, with no unknown parameters,
the model would not hold much value for most biological applications. In contrast, a model
containing unknown parameter values that can be estimated using experimental data may
hold answers to important biological questions. A challenge that faces these eﬀorts is that
no single parameter estimation strategy works well for all problems, due to the large variety
of models and model constraints. In many biological modeling studies, in particular those
that use thermodynamic-based models to examine gene regulation, parameter estimation
techniques have not been thoroughly investigated [105, 85, 45, 38].
Thermodynamic-based models adapt tools from statistical physics that describe proteinDNA interactions, and they predict gene output based on the assumption that gene activity
is proportional to the level of activators bound and inversely proportional to the level of
repressors bound.of transcriptional regulation [38]. Essential information for this type of
model includes the DNA sequences of cis-regulatory regions and the binding speciﬁcities of
TFs (TFs) that interact with them to eﬀect transcriptional activation or repression.
Many underlying assumptions are incorporated into such models of transcription, and also
many unknowns, such as the maximal binding aﬃnity of proteins, protein-protein cooperativity, inhibition of activators by repressors, and the eﬀects of TFs on the basal machinery. Thus,
there is clearly a need for parameter estimation techniques when studying thermodynamicbased models of transcriptional regulation. Previous studies have implemented a variety of
techniques, ranging from very simple local parameter estimation techniques to some of the

106

most complicated global algorithms for parameter estimation [105, 85, 45, 38, 26]. These
studies have not generally provided a rationale for their choice of estimation strategy, although these methods can diﬀer enormously in computational demands, and, potentially
eﬀectiveness in ﬁnding global minima.
The general idea behind all parameter estimation techniques is quite simple: given a
model with some unknown variables and spatial and/or temporal experimental data, parameter estimation algorithms work toward ﬁtting numerical values to these variables so that the
diﬀerence between the model output and the experimental data is as small as possible. Once
parameter values are ﬁt, these values can be used in interpreting aspects of the biological
system and in predicting the behavior of similar systems [98, 57, 63].
Parameter estimation has a long history in mathematics and computer science. Since
mathematicians ﬁrst began solving inverse problems by searching for extrema of functions,
they have developed diﬀerent techniques. Although the extrema of a function can refer to
either a minima or a maxima, here we concentrate on ﬁnding minima, i.e., minimizing the
error between model predictions and experimental data. Parameter estimation techniques
typically fall into one of two classes, local or global algorithms. Local algorithms have the
advantage of computational simplicity, while global approaches are more likely to recover
optimal values, but the extra eﬀort may or may not be worth while. Here, we compare
the performance of representative local and global approaches using a recently described
thermodynamic-based model [38].
Local estimation techniques generally require either the calculation or approximation
of the objective function’s derivative or a comparison of the objective function at multiple
diﬀerent parameter values [63]. Some common local parameter estimation techniques include

107

the Conjugate Gradient method, Newton’s method, and the Nelder-Mead algorithm. In two
earlier thermodynamic-based modeling studies local parameter estimation techniques were
used [85, 38]. Segal and colleagues employed the Conjugate Gradient Ascent and the NelderMead simplex methods in an alternating fashion. In a later study, Sinha and colleagues
applied the quasi-Newton and the Nelder-Mead simplex methods in an alternating fashion
on a slightly diﬀerent model using the same data set. The major drawback of using such
local parameter estimation techniques is that they often lead to the discovery of local, not
global, minima of the objective function, producing misleading results. This problem can be
minimized if the modeler has prior knowledge of where the global optimal parameter values
lie, allowing for an initial guess close to the global minima and thus quick convergence by a
local algorithm. In many real-world settings, including the modeling of transcription, almost
no prior knowledge of parameter values is available however. Thus, the only way to overcome
the problem of getting stuck at local minima is to run the algorithm multiple times, with
diﬀerent initial parameter values, a so called multi-start strategy. Multiple restarts with
diﬀerent initial parameter values were implemented in both of the studies mentioned above
[85, 38]. However, this method is not very eﬃcient and it has been shown that for highly
nonlinear inverse problems local parameter estimation strategies cannot ﬁnd the correct
parameters even when run with an extremely large number (> 300) of starting points [60].
Global parameter estimation techniques oﬀer another path to ﬁnding global minima of a
model when no a priori information on parameters is available [98, 57]. However, with these
techniques, obtaining the global minima for a nonlinear model is very diﬃcult and computationally expensive [60, 59]. Both deterministic and stochastic global techniques have been
developed. Deterministic methods such as the branch-and-bound and interval optimization

108

methods are more reliable, they are computationally very expensive and not practical for
many nonlinear problems. In contrast, stochastic methods such as genetic algorithms, simulated annealing, and evolutionary strategies can more quickly ﬁnd the vicinity of global
minima. These methods move through parameter space with some stochasticity to avoid
getting “stuck” at local minima. Unfortunately, in general, for these methods global optimality cannot be guaranteed for all problems due to their probabilistic nature, therefore
multiple runs are advised. However, global convergence has been proven for evolutionary
algorithms applied to problems with a unique global minimum and a nonzero probability of
reaching the neighborhood of the global minimum from any initial starting population in a
single evolutionary time step [77].
In the realm of stochastic techniques, evolutionary strategies (ES) have shown excellent
performance in many studies of continuous models with large sets of estimated parameters
[60, 47, 39, 61, 26]. These strategies are inspired by biological evolution, and include features
such as cross-over, mutation and selection. Each speciﬁc technique varies in the number of
oﬀspring, the number of parents, whether mutation rates, recombination, or cross-over are
considered, and the selection strategy applied. In each ES, the the “ﬁttest individuals” (the
parameter sets with lowest value for the objective function), have a higher probability of
surviving to the next “generation.” Along with selection criteria, the use of mutation rates,
recombination, and cross-over allow populations of parameter sets to leave a local minima
and eventually reach a global minima. In this study, we used a version of an evolutionary
strategy, the so called Covariance Matrix Adaptation-Evolutionary Strategy (CMA-ES). Our
choice was motivated by the documented success of CMA-ES algorithms over other global
parameter estimation techniques on benchmark problems [49].

109

Nonlinear models, such as thermodynamic-based models of gene regulation, are known
to be ill-posed, that is, either they do not have a solution, the solution is not unique or
the solution does not depend continuously on the data. In these problems the parameter
space is usually unknown and the complexity of the parameter estimation problem grows
exponentially as the number of parameters increase. For this reason, it may be worth
employing computationally expensive global parameter estimation techniques. However,
in several recent thermodynamic-based modeling studies, local strategies rather than global
counterparts were employed to achieve computational eﬃciency [85, 38]. It is not clear what
the potential cost of eﬀectiveness was in these cases: the ﬁts obtained were judged adequate,
but perhaps the use of a local strategy had a signiﬁcant eﬀect on ﬁt with possible misleading
biological interpretations. To better discuss the possible costs and trade-oﬀs in using these
diverse parameter estimation approaches, it is necessary to compare performance on the
same dataset, using the same models. Here, we test the performance of the CMA-ES global
and QN/NMS local methods, with respect to ﬁtting parameter values in thermodynamicbased models of gene regulation. We compared these algorithms using both synthetic and
experimental gene expression data [84].
Our analysis is based on GEMSTAT (Gene Expression Modeling based on Statistical
Thermodynamics), a thermodynamic-based model that incorporates terms for TF-DNA interactions as well as TF-transcriptional apparatus interactions [38, 87, 17]. The model
can also incorporate mechanistic features such as short range repression, cooperativity in
TF-DNA binding and the synergistic eﬀect of multiple activators. The authors have systematically evaluated the signiﬁcance of each of these features in the context of available
interpolated expression data.

110

5.1.0.1

GEMSTAT Speciﬁcs and Applications

Six diﬀerent models are available for implementation in GEMSTAT. In ‘DirectInt’, each TF
interacts directly with the basal transcriptional machinery (BTM) and TFs are assumed to
only interact with TFs bound to adjacent sites located within a certain distance. ‘DirectIntcoop’ extends DirectInt by also allowing for cooperative binding of any given pair of bound
TFs regardless of distance. The ‘ChrMod-Unlimited’ and ‘ChrMod-Limited’ models assume
that repressors do not interact directly with the BTM, but inhibit the binding of activators.
This eﬀects the enumeration of all possible states since it creates new possible conﬁgurations,
where DNA in the ‘neighborhood’ of a bound repressor is inaccessible to binding by any
other TF. This new conﬁguration competes with the conﬁgurations where the chromatin is
accessible to activators, eﬀectively reducing the occupancy of activators. Both models allow
cooperativity between TFs that are bound to adjacent sites, in the same way as the DirectInt
model. These two models can also be used to test the eﬀect of synergism between bound
TFs, i.e., whether increasing the number of TFs interacting simultaneously with the BTM
can better explain the observations. In ‘ChrMod-Unlimited’, there is no limit on the number
of activators that can simultaneously interact with the BTM, as in the ‘DirectInt’ models,
while ‘ChrMod-Limited’ limits the number of simultaneous interactions with the BTM. The
‘Quenching’ model is based on short-range repression, not assuming repression works through
chromatin modiﬁcation, but rather through a quenching parameter. The enumeration of all
possible states is the same as in the ‘DirectInt’ models, but the weight of states is modiﬁed by
incorporating this quenching parameter. The ‘Logistic’ model evaluates a logistic function
at TF ‘occupancies’ to estimate the level of gene expression, where ‘occupancy’ of a given
TF is the summation of the probabilities of that TF binding to its diﬀerent putative binding

111

sites. As such, unlike the other models described above, the logistic model lacks the details
of TF-TF and TF-BTM interaction.
GEMSTAT is divided into two functional modules, one that executes the model speciﬁed
and another that trains the model. The model training program takes as input a set of DNA
sequences, the PWMs of the relevant TFs and the measured expression patterns. Given a
set of inputs, GEMSTAT assesses ﬁt by either maximizing the average correlation coeﬃcient
and/or minimizing the sum of squared errors (SSE) (the program can be run in multiple
conﬁgurations).
In [38], a local parameter optimization approach from the GNU Scientiﬁc Library was
used that combines the Nelder-Mead simplex method and the quasi-Newton method referred
to as the Broyden-Fletcher-Goldfarb-Shanno algorithm [90]. This study provided insight into
the mechanistic details of gene regulation in segmentation-related transcriptional enhancers
in Drosophila, namely, short-range repression, cooperativity in TF-DNA binding, and the
synergistic eﬀect of multiple activators [38]. Using tests for statistical signiﬁcance, the authors showed that each of these three mechanisms appear to contribute to the output of
enhancers. Cooperative DNA-binding of activators as well as repressors appeared to be
critical for some, but not all TFs. Their models suggested that the synergistic eﬀect of
multiple activators with the BTM contributes signiﬁcantly more than the TF-DNA binding
cooperativity between activators to the accuracy of predicted expression patterns. Simple
competition between repressors and activators for binding sites is an insuﬃcient mechanism of repression, supporting the short-range repression mechanism for two of the TFs, in
agreement with experimental evidence.

112

5.2
5.2.1

Methods and Calculations
Construction of Synthetic Data Sets

To compare the two parameter estimation strategies, CMA-ES and QN/NMS, we created
synthetic gene expression data for each of the six model versions in GEMSTAT. The synthetic
data sets were created so that the true parameter values are known a priori and can be
compared to those found using the parameter estimation techniques. Each set consists of
36 expression patterns, corresponding to 36 experimentally characterized CRMs involved in
Drosophila embryonic development (anteriorposterior axis speciﬁcation) containing binding
sites for 6 TFs. Since parameter values were chosen, not ﬁt, the resulting synthetic patterns
do not agree with the actual expression patterns driven by these CRMs that have been
observed in embryonic development. We chose this particular set of CRMs and TFs due to
the fact that they were used in [38]. Recent transcriptional modeling studies have used data
sets of varying quality, therefore we compared the parameter estimation strategies not only
on high quality continuous data sets, but also on low quality noisy or discrete data sets.
In order to obtain a thorough analysis, we have created continuous, perturbed, binary, and
interpolated synthetic data sets, as described below.

5.2.1.1

Construction of Continuous Synthetic Data Sets

Creating continuous synthetic data sets is a two-step process. In the ﬁrst step, values for
each parameter are randomly selected from ranges speciﬁed in the model. In the second step,
these parameter values are used to create the expression (on a scale from 0 to 1) for each input
sequence. The expressions obtained comprise the continuous synthetic data, which represent
exact results, devoid of measurement noise. For GEMSTAT models tested we ﬁnd that a
113

large number of genes had ‘saturated’ expressions for most of the data points, and when
expressions were not saturated, the patterns were quite similar for a large portion of genes.
Both of these scenarios do not reﬂect the patterns of developmental genes that this study is
analyzing; these patterns typically are highly heterogeneous. Rather than selecting certain
parameter sets arbitrarily, we designed a scoring scheme, the ‘derivative score’, which assigns
a numerical score to each set of random parameters (Table 5.1). This scheme selects outputs
with diverse and non-monotonic patterns. In our experiments, 25000 random parameter sets
were generated and the best 30 sets, based on the derivative scores, were used to generate
our continuous synthetic data.
Table 5.1: A pseudo-algorithm for the derivative score.

Let P be a set of parameters, B be the set of bins, and S be the set of sequences.
For all i sequences and j bins:
Let di,j be the absolute value of the derivative of the expression obtained using P .
1
The derivative score is deﬁned as:
j∈B maxi∈S di,j
B

5.2.1.2

Construction of Perturbed, Binary, and Interpolated Synthetic Data
Sets

To compare the performance of the two parameter estimation strategies on lower quality
data, we created perturbed, binary, and interpolated data sets. Perturbed data was created
by adding to each of the continuous synthetic data points, x, a perturbation drawn from
a uniform distribution on the interval [−rx, +rx], where r represents the perturbation percentage. We used r = 0.025, 0.05 and 0.10 in this study. Results shown in Figures 5.2 -5.5
were obtained with r = 0.05. The overall SSE values obtained from both local and global
114

parameter estimation techniques with r = 0.025 were similar to those obtained with r = 0.05
and those obtained with r = 0.10 were similar to those obtained using a data set of slightly
lower quality, such as interpolated data (described below). We did not allow our expression
values to go outside of the range [0, 1], so if adding perturbation caused a value to be larger
than 1, the value was set to 1. Binary data was created from the continuous synthetic data
using a simple thresholding technique, whereby data points greater than or equal to 0.5 were
set to 1 and data points below 0.5 were set to 0. The interpolated data, an intermediate
between continuous and binary, were created from the binary data by replacing data points
near a discontinuity (a jump from 0 to 1 or 1 to 0) with intermediate values such as 0.89
or 0.11, to smooth out these discontinuities. These data sets were created using the same
process as in [85, 38].

5.2.2

Description of QN/NMS and CMA-ES Methods

To estimate the parameter values of a model that best ﬁt experimental data, one uses an
objective function to measure the quality of ﬁt. Parameter estimation algorithms start
with an initial guess and iteratively generate new estimates seeking to minimize the error.
Diﬀerent parameter estimation techniques employ diﬀerent iteration processes, depending
on the objective function and parameter constraints. Often the sum of square errors (SSE)
between the models prediction and experimental data, as used in Chapter 4, is employed. It
was previously shown that the two measures provide similar model sensitivity to parameters
in thermodynamic-based models [24]. Here, we use GEMSTAT to perform the optimization

115

using SSE as the objective function. SSE is deﬁned as:
n

yp (j) − ye (j)

SSE =

2

j=1

where n is the number of data points and yp and ye are the model’s prediction and experimental data respectively.

5.2.2.1

QN/NMS Method

We employed the quasi-Newton and Nelder-Mead simplex methods in an alternating fashion
as a local search strategy, as described in [38]. The quasi-Newton method tries to ﬁnd the
parameter set that makes the gradient of the objective function zero. The method uses the
ﬁrst and second derivatives (gradient and Hessian) of the objective function to determine the
search direction. For this reason, the objective function, its gradient and Hessian need to be
evaluated. On the other hand, the Nelder-Mead simplex method uses simplexes (geometrical
shapes that have N + 1 vertices and all edges between them in an N dimensional space)
in the space of parameters. At each step the function is evaluated at the vertices of the
simplex. The search continues iteratively by shrinking the simplex size (if possible) toward
the lowest scoring vertex until one of the stopping criteria of the problem is satisﬁed. We
used a tolerance for the change in function value of 10−8 , resulting in approximately 400
iterations. The QN/NMS algorithm is available as part of the GEMSTAT code [38].

5.2.2.2

CMA-ES Method

We selected the global parameter estimation algorithm, CMA-ES (Table 5.2), in this study
because of its superiority over other algorithms on diﬀerent test problems [49, 36]. In this

116

Table 5.2: A pseudo-algorithm for CMA-ES.

Initialization
Choose initial population of parameter sets, distribution mean, and step-size.
Set the initial Covariance Matrix to the Identity and evolution paths to 0
Do
Sample oﬀspring parameter sets from a normal distribution
Selection and Recombination
calculates new weighted mean, recombines oﬀspring, and selects µ best individuals
Update step-size and Covariance Matrix
While termination criteria has not been met

method, at each iteration, new candidate parameter vectors are selected from a multivariate normal distribution centered around the weighted mean of the previously generated
sample vectors. The CMA-ES introduces an intermediate recombination with derandomized covariance matrix adaptation, which increases the probability of sampling a parameter
vector resulting in a lower objective function value. This approach also controls the stepsize to prevent premature convergence on local minima [49, 36]. Again the tolerance used
for the change in function value was 10−8 , resulting in the method running for between
100 and 10000 generations. The C-source code for the CMA-ES algorithm is available at
http://www.lri.fr/~hansen/cmaes_inmatlab.html#cpp. For the results shown in this
study, we combined this source code with GEMSTAT during compilation (available upon
request).

117

Figure 5.1: Illustration of the expression patterns for synthetic data set 20 obtained by reverse engineering of a gene regulatory
network by the Direct model. The synthetic target data used for the parameter estimation is created by simulating the model
with the random parameters obtained as explained in the Methods and Calculations. We created the expression patterns
using the parameters obtained by the CMA-ES and QN/NMS methods. The black tick marks are the synthetic target data
and the blue (QN/NMS) and red (CMA-ES) lines correspond to the simulated patterns. The y-axis gives the relative gene
expression level and the x-axis corresponds to the anterior-posterior (A-P) axis of the embryo. The ﬁnal SSE is 0.001 and 13.0
for CMA-ES and QN/NMS respectively.
118

Figure 5.1: (cont’d)

119

Figure 5.1: (cont’d)

120

5.3

Results

To compare the eﬀectiveness of two parameter estimation strategies, we used both synthetic
and experimental spatial gene expression data for 36 transcriptional enhancers active in the
Drosophila embryo for thermodynamic-based modeling of gene expression. The synthetic
data does not reﬂect actual patterns of expression, but mimics the spatially varied expression of mRNA typically observed for developmental gene activity. The experimental data
represents actual mRNA measurements obtained from blastoderm embryos using in situ
hybridization [84]. We compared the CMA-ES (global) and the QN/NM simplex (local)
method on 30 synthetic data sets of varying quality: continuous (highest quality), perturbed
(noise introduced at increasing levels), binary, and interpolated for each version of the model.
Each parameter estimation strategy was run ﬁve times; the run with the lowest SSE was
taken as the best ﬁt. We tested six thermodynamic-based models in the GEMSTAT family;
the results shown in Figures 5.2-5.5 were obtained using the Direct model. The results from
other models were qualitatively very similar(data not shown). An example of the continuous
synthetic gene expression data, and the expression patterns predicted by the Direct model
using the CMA-ES and QN/NMS methods with this data set are shown in Figure 5.1.
The minimum SSE values for the Direct model obtained by CMA-ES and QN/NM on
the 30 continuous synthetic data sets are shown in Figure 5.2A. The CMA-ES method
gave smaller SSE values than the QN/NM simplex method on most continuous data sets,
indicating that global search was consistently outperforming the local method. Experimental
biological data sets are not typically continuous high quality data sets, but contain extrinsic
or intrinsic noise reﬂecting experimental and biological variation. We therefore compared
these two methods on perturbed, interpolated and binary synthetic gene expression data

121

Figure 5.2: Performance comparison of the CMA-ES and QN/NMS methods on the Direct
model. The minimum SSE values are plotted for data sets of varying quality: A)continuous,
B) perturbed, C) binary, and D) interpolated. The analysis has been done on 30 synthetic
data sets. For each data set, two SSE values are shown; CMA-ES (red circles) and QN/NMS
(blue triangles). It has been observed that the CMA-ES method performs better than
the QN/NMS method on continuous data. However, the choice of parameter estimation
technique has no observable eﬀect on lower quality data sets. The horizontal line in panel
A) signals a change in scale of the vertical axis.

122

(Figure 5.2B, C, and D). We observed that unlike the diﬀerences observed for continuous
data, there is no discernible diﬀerence between the two methods when we evaluated lower
quality data sets. This trend was also observed for the other GEMSTAT models (not shown).
Low SSE values alone do not demonstrate the success of a parameter estimation strategy
because the objective function used can be multimodal. For this reason, we deﬁned a new
comparison score, the parameter SSE, which quantiﬁes the distance between the predicted
parameters and true parameter values used to generate the synthetic data sets. As it did
for overall SSE values, the CMA-ES method gave lower parameter SSE values than did
QN/NMS in the case of continuous synthetic data (Figures 5.2A and 5.3A). We found that
both estimation strategies did worse on noisy data (Figure 5.3B) and failed to reach the true
parameter values in the case of low quality data sets (Figure 5.3C and D). Again, a similar
trend was observed for all other GEMSTAT models (not shown).
The overall parameter SSE provides a global view of how well known parameters can be
rediscovered, but previous studies have noted that the parameters involved in many biological
models have varying sensitivities [34, 6, 24]. For thermodynamic-based models, diﬀerent
sensitivities drastically eﬀect the ability of estimation strategies to accurately ﬁt parameter
values [24]. Very sensitive parameters are more likely than less sensitive parameters to
be ﬁt well regardless of the parameter estimation method implemented. For this reason,
parameter-wise comparison will also help to illustrate the robustness of each method. To
check the ﬁtness of each individual parameter, we therefore compared the CMA-ES and
QN/NMS methods parameter-wise. Figure 5.4 illustrates the estimated parameter values
for the Direct model on a representative continuous synthetic data set shown in Figure 5.1.
We conclude that the CMA-ES method is more robust than the QN/NMS method due to

123

Figure 5.3: Comparison of the parameter values obtained by the CMA-ES and QN/NMS
methods on the GEMSTAT Direct model. The minimum SSEs between the true parameter
values and those obtained by the CMA-ES and QN/NMS methods are plotted for data
sets of varying quality: A)continuous, B) perturbed, C) binary, and D) interpolated. The
analysis has been done on 30 synthetic data sets. For each data set, two SSE values are
shown; CMA-ES (red circles) and QN/NMS (blue triangles). It has been observed that the
CMA-ES method more accurately predicts the true parameter values on continuous data.
However, the choice of parameter estimation technique has no observable eﬀect on lower
quality data sets. The horizontal lines signal a change in scale of the vertical axis.

124

Figure 5.4: Scatter plot of parameters for synthetic data set 20 using CMA-ES and QN/NMS
methods on the Direct model. Each method was run ﬁve times. The black diamonds represents the target parameters, and the red circles and blue triangles represent parameters
obtained by CMA-ES and QN/NMS methods respectively. The CMA-ES method more accurately predicted the true parameter values on continuous data. The horizontal lines signal
diﬀerent scales of the vertical axis. For the CMA-ES and QN/NMS methods respectively,
the mean standard deviation in parameter values found is 64.11 and 80.29.

125

Figure 5.5: Comparison of the relative error (%) for the estimated parameters of synthetic
data set 20 using CMA-ES (red circles) and QN/NMS (blue triangles) methods. The CMAES method predicts the true parameter values with lower relative error on continuous data.
the lower variance of parameter values found using CMA-ES on continuous synthetic data.
Similar clustering was observed when the two methods were compared on other data types
(results not shown).
The magnitude of diﬀerent estimated parameters vary widely, therefore we calculated the
relative parameter errors to create a more comparable measure. Figure 5.5 shows these relative parameter errors for the Direct model on the representative continuous synthetic data
set shown in Figure 5.1. The relative parameter errors obtained using the CMA-ES method
are smaller than those obtained using the QN/NMS method for almost all parameters. Similar results were obtained for other synthetic data sets, and the other GEMSTAT models

126

Figure 5.6: Sensitivity analysis for the Direct model. The blue, green, and red bars represent the parameter ﬁrst-, sum of second-, and sum of ﬁrst- and second-order sensitivities
calculated using the HDMR algorithm respectively. The objective function used was the
SSE between model predictions and synthetic data set 20, whose predictions are shown in
Figure 5.1 and best ﬁt parameters shown in Figures 5.4 and 5.5. Observe that the model is
much more sensitive to perturbations in the fourth parameter, one of the parameters associated with Caudal, than any other parameter in the model. For other parameters, we see
large second-order sensitivities, corresponding to a high level of parameter compensation.

127

(data not shown). This reinforces the conclusion that the CMA-ES method is more accurate
and robust than the QN/NMS method on high quality data. The success of the two methods
at estimating ‘true’ parameters varied among parameters. To gain insight on those parameters that were ﬁt more accurately using the CMA-ES method, we conducted a sensitivity
analysis on the Direct Model using this same continuous synthetic data set and the HDMR
algorithm used in Dresch et al. 2010 (Figure 5.6) [24]. We observed that the model showed
highest sensitivity to perturbations in the fourth parameter, a parameter for weighting TF
activity with the BTM associated with the Caudal TF. The high sensitivity may be due to
the large number of Caudal binding sites on the enhancers used, making Caudal activation
potentially more signiﬁcant factor for expression. For other parameters, we see non-zero
model sensitivity values emerging primarily from second-order terms, implying that there is
a high level of parameter compensation. The large amount of parameter interdependence
makes accurate parameter estimation very diﬃcult. For the sixth parameter, a parameter
for weighting TF activity with the BTM associated with the Giant TF, the CMA-ES method
resulted in higher relative parameter errors, but here the model sensitivity is very low. This
parameter does not impact the model very much, hence the inability of the global method to
locate the proper parameter value may not be reﬂective of the estimation method as much as
it is reﬂecting an inherent bias caused by the structure of the model. The CMA-ES method
did result in lower relative parameter errors on parameters that were found to have higher
model sensitivities. There was no observable diﬀerence in relative parameter error levels on
the lower quality data sets (results not shown).
Many gene regulatory modeling studies have used relatively low quality data sets to infer
parameters that were then used to interpret biological processes. Our analysis suggests that

128

Figure 5.7: Scatter plot of the parameters obtained for experimental data using CMA-ES
and QN/NMS methods on the Direct model. The red circles and blue triangles represent
parameters resulting in the minimum SSE when CMA-ES and QN/NMS methods are applied
respectively. It has been observed that, although SSE values were comparable (CMA-ES SSE
= 0.2760 and QN/NMS SSE = 0.2767), for the most sensitive parameters the two methods
results in diﬀerent values. The horizontal line signals a change in scale of the vertical axis.

129

it may be diﬃcult to obtain correct parameter values using models such as GEMSTAT with
low quality data. We therefore analyzed the original data set used in [38] which was based
on qualitative spatial expression data of patterning genes in embryos. We compared the
parameter values and SSEs obtained by the CMA-ES and QN/NMS methods on this data
set for each of the six GEMSTAT models. The resulting parameter values were similar for
many of the parameters and the two methods had comparable overall SSE values, but the two
approaches were considerably diﬀerent for the sensitive Caudal parameter (Figures 5.6 and
5.7). This model is highly sensitive to perturbation in this particular parameter, therefore
a choice between two apparently equivalent parameter estimation strategies may result in
extremely diﬀerent biological interpretations about key features of this system.

5.4

Discussion

The demands on modeling will grow as we seek to analyze the huge genome-scale data sets
created with current experimental techniques. Modeling studies will continue to estimate
parameter values, making the choice of strategy a crucial step in modeling. Here we have
shown that when applied to the GEMSTAT model a slow stochastic parameter estimation
strategy, CMA-ES, was necessary to ﬁnd the correct parameter values when using high
quality data sets. On low quality data sets, it was alarming to observe the failure of both
CMA-ES and QN/NMS algorithms. These results may not be generalizable to all mathematical models. Since many models used in biological applications are highly nonlinear with
multimodal objective functions, once a reasonable model is created for a biological system
with experimental data, one should create an objective function for the problem and test
several diﬀerent ﬂavors of parameter estimation techniques.
130

The CMA-ES method used in this study provides an alternative to the QN/NMS method
used in the original study [38]. In the above analysis we have shown that the CMA-ES
method is more reliable than the QN/NMS method on continuous synthetic data. The SSE
values were clearly lower for continuous data, and parameter values have been captured with
high accuracy (Figures 5.2 and 5.3). We observed relative error levels below 1% for most
parameters estimated using continuous data (Figure 5.5). In spite of its accuracy on high
quality data, the computational eﬀort required is on the order of hours (all computations were
performed using a node on the MSU HPCC platform running Linux). However, we should
mention that the convergence criteria used in this study was very tight and this method
could be written in a parallel version. For the perturbed, interpolated and binary data sets
the selection of the parameter estimation strategy was not critical, which suggests that the
use of low quality data sets for modeling studies is prone to error. When we compared these
two methods on the low quality experimental data set used in [38], we observed that the
qualities of ﬁt were similar (Figure 5.7). However, the parameter values found were diﬀerent,
especially the values corresponding to the parameter with the highest model sensitivity. This
suggests that the interpretation of parameter values obtained in modeling studies using low
quality data must be viewed with caution. Experimental validation of ﬁndings in modeling
studies is absolutely essential.
For a modeling study to capture the correct biological behavior of a system, the parameter
estimation strategy selected should be accurate, robust and computationally eﬃcient. To
determine the best strategy one should create synthetic data of a quality similar to that in
which they plan to use and compare the performance of several methods on their model.
It has been noted in previous studies that there is no best parameter estimation technique

131

that will work for all model types [60, 59]. It is the responsibility of the modeler to choose
one algorithm that will help them to get the proper insights from the experimental data.
Unfortunately, the ﬁeld of parameter estimation is not a very well-established area for gene
regulation models. Many thermodynamic-based modeling studies have been carried out with
data and methods that have not been well suited for such studies. This might have resulted in
very poor estimates of the parameter values, which in turn have lead to incorrect biological
interpretations. We hope that this study will help those working in biological modeling,
in particular thermodynamic-based modeling, to see the importance of high quality data
production and parameter estimation strategy selection.

132

Chapter 6
Thermodynamic-based Modeling:
Analysis of Biological Data Sets

6.1

Understanding the Structural Rules of Enhancer
Function

To investigate what features dictate where, how and in what order TF binding sites should
be positioned for optimum enhancer function, we chose to study the enhancer responsible for
the expression of the rhomboid (rho) gene in early Drosophila development. Activation of rho
in the presumptive neurogenic ectoderm by the cooperative action of two TFs, Dorsal and
Twist [42]. It is repressed in the mesoderm by a zinc ﬁnger protein, Snail. Previous studies
have identiﬁed footprinted binding sites for the above three TFs on an approximate 300-bp
region 1.7 kb upstream of the transcriptional start site for rho. In addition to rho, Dorsal,
Twist and Snail coordinate the expression of about 100 genes involved in dorsal-ventral
patterning in early Drosophila development [74, 101]. So, insights obtained from analysis of
133

this well-characterized enhancer are likely to shed light on mechanisms of enhancer function
for a variety of genes involved in this gene regulatory network (GRN). These characteristics
make this enhancer a suitable platform for quantitative studies.

6.1.1

Reporter Gene Constructs and Confocal Image Dataset

The 318-bp rho neuroectodermal enhancer [42, 83] was cloned and the footprinted binding
sites for Dorsal, Twist and Snail, as well as two sites known to be bound by bHLH factors
[42], were mutated one at a time as well as in combination with one another, resulting in a
set of 38 reporter gene constructs (Figure 6.1).
A total of 935 embryo images were taken, with a minimum of 10 images per construct.
Late stage ﬁve embryos were used for analysis, using eve expression to determine embryo
age [92]. Embryos were then selected based on their rotation, with the appropriate number
of pixels in the dorsal region of the embryo for background estimation and the appropriate
number of pixels in the ventral region of the embryo to obtain information on the repression
domain. The image processing techniques used are those described in the Appendix, section
A.3.

6.1.2

Structure of Models Tested

A general thermodynamic-based model, as described in Chapter 2, was constructed in C++.
At runtime, the user has the option to choose distance(in base pairs between binding sites)dependent deﬁnitions of TF cooperativity and repressor quenching ability. Cooperativity
and quenching terms were incorporated into the model as described in section 2.3.1 and the
treatment of overlapping binding sites was described in section 2.3.2.

134

Figure 6.1: Structures of reporter genes assayed. Green, blue, and red circles denote annotated binding sites for Dorsal, Twist, and Snail respectively. X’s depict binding sites that
were mutated.

135

Figure 6.1: (cont’d)

136

Cooperativity among TFs is known to inﬂuence gene expression, and previous studies
have uncovered distance-dependent cooperativity among TFs [35, 2]. Quenching is the eﬀect
that repressor proteins exert on DNA or chromatin to negate the eﬀects of activators [4, 31].
It can be mediated by post-translational modiﬁcations of histones to maintain ‘repressive
heterochromatin’ [52, 70]. Previous studies have implicated repressors acting by diﬀerent
mechanisms to block activators [4, 31, 52, 70, 26, 38]. Several diﬀerent schemes involving
TF cooperativity and repressor quenching were implemented in our modeling eﬀort, where
we tested diﬀerent hypotheses regarding the biochemical mechanisms of enhancer function.

137

Figure 6.2: Continuous Functions used for Quenching and Cooperativity. These functions are the functions used in diﬀerent
schemes of the model, to describe the eﬀects of a repressor’s quenching ability on a bound activator or the cooperativity
2
between two bound TFs as functions of the distance between the binding sites. A) f (d) = a + bd, B) f (d) = ae−d /b , and
1
C)f (d) = a
. When implemented, a = 1 and b is a model parameter for quenching functions, and a and b are both
d/b
e
model parameters for cooperativity functions.

138

Figure 6.3: Quenching Function Derived from Fakhouri et al. Analysis. This function was
ﬁt using the six quenching values, shown in red, found in [26] to a ﬁfth degree polynomial,
f (d) = −0.000000004249d5 + 0.0000010199d4 − 0.00010199d3 + 0.0050099d2 − 0.10731d + 1.0
for d ≤ 100 and assumed to be 0 for d > 100, shown in blue. When implemented, this
function is used directly, with no parameter values corresponding to quenching.
For the quenching term in our model, we used three continuous functions (Linear, Logistic and Exponential decay) to describe the eﬀect of repressor quenching in relation to
increasing distance to bound activator binding sites (Figure 6.2). Since a linear function was
used previously in a thermodynamic-based modeling study, we began with this formulation
[75]. We included schemes with functions displaying logistic and exponential decay to represent diﬀerent biological assumptions as well. Functions illustrating high quenching activity
over some distance may correspond to a situation in which cofactors allow ﬂexibility in the
spacing of binding sites, and quenching activity that is rapidly decreased as the distance between binding sites is increased may correspond to direct binding between the two proteins
causing little ﬂexibility in the spacing of binding sites. Note that the functions in Figure 6.2
were allowed to stretch horizontally, representing quenching over larger distances, but the
quenching ability at a distance of 0 bp is assumed to be at a maximum of 1.0, representing
a state that is not contributing to expression due to the maximal quenching ability of the
bound repressor. We also used the non-monotonic quenching function derived from the pre139

Figure 6.4: Binned Functions used for Quenching and Cooperativity. These step-functions
are representatives of those used in various model schemes to describe the eﬀects of a repressor’s quenching ability on a bound activator and cooperativity between two bound TFs as
step-functions of the distance between the binding sites. A) Quenching function with four
bins, B) Cooperativity function with three bins. When implemented, no quenching is allowed
beyond the distance of the last bin, thus the function value is assumed to be 0 beyond 100
bp in A).
vious analysis of short-range repression in synthetic enhancer constructs (Figure 6.3) [26].
In addition, we ﬁt diﬀerent quenching parameters (Q) to construct step-functions, as was
done in [26], grouping quenching parameters in diﬀerent ‘bins’ of distances between repressor
sites and activator sites. Each bin corresponds to a single quenching parameter, which is
used as the value of the step-function for the distances assigned to that bin (Figure 6.4A).
These schemes allow for a non-monotonic quenching function to arise, illustrating a preferred
distance between binding sites. The binned quenching schemes are described as follows, with
no quenching allowed beyond the distance of the last bin. The distances between binding
sites were calculated from the center of each binding site.
Scheme 1: Q1: 0-25 bp, Q2: 26-50 bp, Q3: 51-75 bp, Q4: 76-100 bp
Scheme 2: Q1: 0-35 bp, Q2: 36-70 bp, Q3: 71-105, Q4: 106-140 bp
Scheme 3: Q1: 0-45 bp, Q2: 46-90 bp, Q3: 91-135, Q4: 136-180 bp
Scheme 4: Q1: 0-10 bp, Q2: 11-20 bp, Q3: 21-30, Q4: 31-40 bp, Q5: 41-50 bp, Q6: 51-60
140

bp, Q7: 61-70 bp, Q8: 71-80 bp, Q9: 81-90 bp, Q10: 91-100 bp
In each case, two quenching functions were ﬁt, one representing the quenching ability of
Snail on Dorsal sites and one representing the quenching ability of Snail on Twist sites.
As in the case of quenching, three diﬀerent continuous functions (Linear, Logistic and
Exponential decay) were used to represent cooperativity (Figure 6.2). Note that the functions
in Figure 6.2 were allowed to stretch horizontally and vertically, representing cooperativity
over larger distances as well as increased cooperative ability. In each case, two cooperativity
functions were ﬁt, one representing heterotypic cooperativity (interactions between diﬀerent
TFs, e.g., Dorsal and Twist) and one representing homotypic cooperativity (interactions
between the same TFs, e.g. Dorsal binding to two diﬀerent sites on an enhancer).
We also ﬁt diﬀerent cooperativity parameters (C) to construct step-functions, grouping
cooperativity parameters in diﬀerent ‘bins’ of distances. This again was done in two diﬀerent
ways: distances were considered either between binding sites for diﬀerent proteins, representing heterotypic cooperativity, or the same protein, representing homotypic cooperativity.
Each bin corresponds to a single cooperativity parameter, which is used as the value of the
step-function for the distances assigned to that bin (Figure 6.4B). In addition, we considered
the possibility of protein-dependent functions, using ‘binned’ distances in four diﬀerent ways:
distances were considered either between two Dorsal sites, two Twist sites, two Snail sites,
or one Dorsal and one Twist site.
Scheme 1: C1: 0-25 bp, C2: > 25 bp
Scheme 2: C1: 0-50 bp, C2: > 50 bp
Scheme 3: C1: 0-75 bp, C2: > 75 bp
Scheme 4: C1: 0-50 bp, C2: 51-100 bp, C3: 101-150 bp, C4: > 150 bp

141

Scheme 5: C1: 0-60 bp, C2: 61-120 bp, C3: 121-180 bp, C4: > 180 bp
Scheme 6: C1: 0-70 bp, C2: 71-140 bp, C3: 141-210 bp, C4: > 210 bp

6.1.3

Model Performance and Parameter Estimation

Model performance was measured using three diﬀerent criteria: root mean square error
(RMSE) of data and model predictions [26], CC (correlation coeﬃcient) between data and
model predictions [38], and AIC (Akaike Information criterion) which was used to compare
performance of diﬀerent models while compensating for the diﬀerent number of parameters
in models [18]. RMSE, CC and AIC are deﬁned as:

1 n
RMSE =

n
CC =
n

n
j=1

n

yp (j) − ye (j)
j=1

n
j=1 yp (j) ye (j) −

yp (j)

2

−

2

2
n
j=1 yp (j)

n
j=1 yp (j)

n

n
j=1 ye (j)

2
n
j=1 (ye (j))

−

2
n
j=1 ye (j)

AIC = 2nln (RMSE) + 2 (k + 1)
where n is the number of data points, yp and ye are the model’s prediction and experimental
data respectively, and k is the number of model parameters.
The global parameter estimation strategy (CMAES), which was discussed in Chapter 5,
was applied to estimate the parameters [36]. Root mean square error was used as a measure
of model performance. All possible pair-wise combinations of cooperativity and quenching
functions described above were considered, resulting in 120 total model schemes. Due to the
stochastic nature of the starting point of simulations in CMAES, estimations were run ﬁve

142

Figure 6.5: Model performance for 120 model schemes. Each box represents the RMSE for
a single model scheme, with the row representing the cooperativity function and the column
representing the quenching function used in this scheme.
times each, and the parameter set resulting in the lowest RMSE was used in analyzing each
model’s performance.

6.1.4

Preliminary Results

To systematically test various hypotheses, we ﬁt parameters to all of our 120 model schemes
and ranked schemes according to their performance on the whole dataset, RMSE. The performance of all 120 model schemes is shown in Figure 6.5. Note that most schemes resulted
in an RMSE∈ [0.1, 0.11]. Although we observed little variation in RMSE, we still were able

143

Figure 6.6: Diﬀerent Measures of Model Performance for 18 model schemes. Each box
represents the ranking for a single model scheme, with the row representing the model
scheme and the column representing the ﬁtness measure used for ranking.
to rank the model schemes according to this measure of ﬁtness.
To further investigate this, we reduced the number of model schemes to analyze down to
18: the one worst, eight average, and nine best performing model schemes with respect to
RMSE. To test whether the small variation between model performance holds when other
measures of model ﬁtness are used, we calculated the CC and AIC for these 18 model schemes
as well (Figure 6.6). We found that although the ranking varies slightly when we change the
measure of model ﬁtness, models remain in the same grouping of worst, average, and best
performing models. We can thus conclude that these groups represent the best, average, and
144

Figure 6.7: Construct Performance for 18 model schemes. Each box represents the RMSE
for a single model scheme on a single construct, with the row representing the model scheme
and the column representing the construct.
worst performing model schemes.
Recalling that the RMSE is a measure of overall ﬁt, over all data points obtained from
38 constructs, one may ask the question of how well each model scheme ﬁts individual
constructs. To address this question, we calculated individual RMSEs for each of the 38
constructs and compared these for our 18 model schemes (Figure 6.7). We found that there
are approximately ﬁve constructs in which most model schemes are ﬁtting poorly, while the
remaining constructs seem to be ﬁt relatively consistently compared to other construct ﬁts
for a single model scheme. This could be due to the deﬁnition of binding sites that we are
145

using, a high level of uncertainty in the experimental data corresponding to these poorly
ﬁt constructs, or this could reﬂect the limits of this modeling approach and the fact that
certain important transcriptional features, such as chromatin modiﬁcations and nucleosome
positioning, have been ignored. We are currently exploring the ﬁrst of these possibilities by
implementing the model with new deﬁnitions of binding sites, using diﬀerent TF binding
preferences (PWMs) derived from various diﬀerent experimental data sets [12, 65, 66, 56,
105].

6.2

Understanding the Dynamic Eﬀects of Redundant
Enhancers

Redundant, also termed ‘shadow’, enhancers located at varying distances form the transcriptional start site have been found to contribute to the expression of multiple genes playing
critical roles during development [101, 40]. These enhancers have been shown to have an
important contribution to the robustness in gene expression in response to environmental or
genetic perturbations [71].
It has been reported that many Dorsal-Ventral patterning genes have multiple binding
clusters, suggesting that redundant enhancers may have an extremely important role in the
expression dynamics of these genes during development [101, 40]. Two particular genes that
have been tested and shown to contain these redundant enhancers are twi and sna [46, 71].
For this reason, we believe that including these enhancers in any dynamic model of gene
regulation is an essential step in improving our understanding of the dynamics of GRN’s
such as that containing twi, sna, and rho.

146

6.2.1

The Addition of Redundant Enhancers to Dynamic Models
of Transcriptional Regulation

Incorporating more than one enhancer into the dynamic model of transcriptional regulation
described in Chapter 3 involves a modiﬁcation to the transcriptional term, Sa , for all a ∈ M R
such that a represents a gene with multiple known enhancers. Recall from Chapter 2 that
Sa,i is the fraction of successful states of gene a in nucleus i. Thus, to consider multiple
enhancers, E1 , . . . , Es , we assume that the rate of transcription is proportional to the fraction
of combined successful states, where a ‘combined successful state’ is deﬁned as any combined
enhancer state with at least one enhancer in a successful state. Assuming that each enhancer
acts independently, by the inclusion-exclusion principle, also to referred as the SchuetteNesbitt formula in probability theory, this implies that:


s

Saα ,i −

Sa,i = χa 
α=1

Saα ,i Saβ ,i
1≤α≤β≤s



(6.1)

Saα ,i Saβ ,i Saγ ,i − · · · + (−1)s−1 Sa1 ,i Sa2 ,i . . . Sas ,i 

+
1≤α≤β≤γ≤s

Another way to consider incorporating multiple distant enhancer regions is to treat them
them all as one single enhancer, with a very large distance between binding sites. In this case,
a reasonable biological assumption to make is that this large distance physically prevents
interactions from occurring between bound proteins on these diﬀerent enhancer regions.
Note that this is equivalent to the assumption that each enhancer region acts independently.
Thus, calculating the fraction of successful states of this very large enhancer will result in
exactly the same formula as in equation 6.1.

147

Using the above formula, incorporating redundant enhancers into the model described
in Chapter 3, is not only possible, but computationally feasible as well. The diﬃculty in
the addition of redundant enhancers is only in the amount of knowledge now needed for
each gene. One must know where these enhancers are located in the genome in order to
determine what DNA sequence(s) and/or binding site(s) should be included when running
the model. For the network including twi, sna, and rho, a great deal of work has been done
by the Levine lab to determine whether such regions exist and where they are located within
the genome [101, 40, 46, 71]. My current research on this project includes incorporating
redundant enhancers when ﬁtting the model to the experimental data and testing whether the
derivation in equation 6.1 is suﬃcient to improve our predictions or whether the assumption
that these enhancers act independently has oversimpliﬁed the complex nature of these genes.

148

Chapter 7
Closing Remarks
Thermodynamic-based models provide an eﬀective way to link the structure of cis-regulatory
elements to expression levels. The current interest in this ﬁeld suggests that the demand
for thermodynamic-based models will grow as biologists create more genome-level data and
strive to achieve a quantitative systems-level understanding of the molecular physiology.
Unfortunately, there are still many challenges to this modeling approach that should be
addressed.
Challenge 1: There are many aspects of thermodynamic-based models that can be improved
upon with the addition of distinct types of data that reﬂect processes important
to transcriptional regulation. These include chromatin remodeling and nucleosome positioning in eukaryotes, and the presence of transcriptional cofactors in
both eukaryotes and prokaryotes. For example, chromosome remodeling and
nucleosome positioning can eﬀect transcription in multiple ways, such as reducing the aﬃnity of TF binding sites or competing with TFs for their binding sites
[38, 86]. These nuances can easily be added to thermodynamic-based models

149

by adding new binding conﬁgurations or modifying existing states statistical
weights [86].
Challenge 2: Another outstanding challenge is in the deﬁnition of functional binding sites.
Current bioinformatic techniques produce many false positives and negatives.
An increasing number of whole-genome ChIP-chip studies have been published
recently [101]; more comprehensive genome-wide studies to determine the binding sites of TFs are needed to close this knowledge gap [64, 102, 44]. It is
diﬃcult to predict where regulatory sequences lie, therefore even with good
knowledge of TF binding sites, knowing which regions of DNA to incorporate
in models is also a large hurdle which is something that needs further exploration in the future.
Challenge 3: Current model assumptions of thermodynamic-based equilibrium, the types of
interactions between bound TFs, the correlation between RNAP binding to
the promoter and gene expression, and the presumed similar activities of TFs
on disparate enhancers, should all be validated experimentally or modiﬁed
accordingly.
Challenge 4: Lastly, adding temporal dynamics by introducing time-dependence can be extremely beneﬁcial when studying genes involved in dynamic processes such as
development. However, there is still a lack of quantitative data over speciﬁc
time frames. New experimental techniques, such as ﬂuorescent protein tagging
combined with live imaging, are currently being developed to provide us with
real-time images of gene expression in living organisms [55]. These new data
sets will be crucial in accurately ﬁtting diﬀerential equation models of GRNs
150

consisting of genes with highly dynamic expression.
For a molecular understanding of discrete transcriptional systems, thermodynamic-based
modeling oﬀers the most promising approach to deciphering gene regulation at the DNA
level. Incorporating such models into systems of diﬀerential equations provides the additional ability to model the dynamics of a developing organism. This has been my major
contribution to the ﬁeld of eukaryotic gene regulation. The model I have developed and
implemented has made possible the in-depth exploration of gene regulatory networks and
the cis-regulatory modules underlying the precise expression patterns that arise as a result
of them. The analysis I have done on others’ modeling eﬀorts has contributed to our understand of thermodynamic-based models and allowed us to reach this point. The topics listed
above represent outstanding challenges for researchers in the area of quantitative gene regulation and future eﬀorts should be able to guide us to even more sophisticated and powerful
models.

151

APPENDICES

152

Appendix A
Data Processing

A.1

Introduction

When ﬁtting a mathematical model, regardless of the ﬁeld in which the model is being
applied to, the quality of the data used is of tremendous importance. For models of transcriptional regulation, both spatial and temporal protein (TF) and mRNA concentrations
need to be considered. To obtain the most accurate quantitative measurements of these
concentrations, one needs to have a thorough understanding of where the data is coming
from and what procedures can be implemented to remove any background or noise from the
data set. Using data from confocal images in modeling gene expression levels presents the
researcher ﬁrst with the challenge of translating the data from stacks of eight bit images to
spatial information on actual levels of mRNA and protein. For thermodynamic-based modeling studies, such as Fakhouri et al. 2010 and Sayal et al. 2012, large sets of data obtained
from a confocal microscope are used to determine a single set of model parameters. With
each set of parameter values, the model itself outputs relative expression levels. For this

153

reason, noise and background are not the only aspects of the data to be considered during
processing, but normalization of data sets is also a crucial consideration. The importance
of the resulting data quality will be discussed in more detail in Chapter 4, where parameter
estimation strategies are explored.
Throughout the last twenty years, various studies have been performed on Drosophila
using qualitative blue-white images, resulting in binary data sets [67, 95, 46, 80, 85, 38].
These data sets do not require much data processing since thresholding alone can give data
values of 0 or 1. These binary data sets have been used for many years and are easily
interpreted as a gene being (1) “on” or (0) “oﬀ”. The major drawback to using such data
sets is that as enhancer regions become more complex, with more binding sites and more
interactions taking place, a gene may not be “on” or “oﬀ”; genes have been shown to have
varying levels of “on”. For mathematical models to capture the subtle interactions occurring
at the DNA level that result in genes producing mRNA concentrations at varying levels,
they need to be ﬁt to varying levels of gene output. Clearly, qualitative data resulting in
binary, or discrete, measurements would not suﬃce for model ﬁtting.
With new experimental techniques, quantitative confocal imaging has allowed researchers
to obtain detailed measurements of relative mRNA and protein concentrations [105, 43, 45,
26]. Before using these new data sets to extract biological information about the system of
interest, one needs to ﬁrst process the data to adjust for experimental variation. Without
processing, this variation could cause many problems in interpreting data, ie: if an image
has a high amount of background signal, the high values might lead one to believe that the
gene is more strongly expressed due to increased activation. The following gives a fairly
comprehensive overview of the data processing techniques used on the data sets of [26] and

154

[82].

A.2

Fakhouri et al. 2010 data set

Confocal images of Drosophila embryos are ﬁrst saved as stacks, which can easily be opened
and modiﬁed using the common biological software package, ImageJ. Each image within a
stack is a cross-section of the embryo. In order to obtain a single image from a stack, the
ﬁrst step in data processing is to project the images in the stack, using the “Max Z Project”
function in ImageJ. The projection is created using the maximum value over all images in
the stack at each pixel location. The processor can then save the image, as an eight bit Tiﬀ,
which can be imported into a program such as Matlab, for further processing.
These eight bit images then go through a series of processing steps, all of which have
been automated in Matlab. (All Matlab codes described in this section were written by
Ahmet Ay, Evan Dayringer, and Jacqueline M Dresch.) The ﬁrst of these steps is an outlier
removal process. Outliers in an image are those regions of the image with spots of high
intensity where no actual gene expression is observed. These are typically simple artifacts of
experimental error, and thus need to be removed from the image before processing steps such
as background calculation, smoothing, and normalization. After any outliers are removed,
images are rotated into the proper orientation. The orientation used is a standard orientation
so that each embryo is viewed with the anterior to the left, posterior to the right, dorsal to
the top, and ventral to the bottom.
Once images are in the proper orientation and any non-speciﬁc signal has been removed,
images are background subtracted. The purpose of background subtraction is to calculate
the average intensity of the image in areas where no gene expression is observed, areas of
155

nonspeciﬁc ﬂuorescence, and then reset this background signal to 0. In most scenarios,
where background is assumed uniform throughout the embryo, this is done by subtracting
this average background intensity from the entire image. In the case of certain proteins,
such as the Giant protein which was used in the Fakhouri et al. 2010 study, it has been
observed that the background signal is parabolic. The origin of this parabolic shape is not
well known, although it is most likely due to light scattering during imaging and has been
shown to correlate with the degree to which the embryo has been ﬂattened prior to imaging
[9]. For this case, background subtraction is performed by ﬁtting a parabola to a DorsalVentral strip of the background region and then subtracting that parabola from the entire
embryo, along the Anterior-Posterior axis. The last two steps of data processing include
smoothing the image and normalizing intensity values. Smoothing is done to remove any
discrete artifacts of the image itself and normalization is done so that all intensity values are
mapped to a common interval, [0,1]. The normalization of a set of images gives us the ability
to compare images without imaging bias. It is a known fact that embryos can be stained
slightly lighter or darker regardless of the mRNA or protein concentrations due to technical
variation. Normalization allows us to minimize this variation between images and compare
images side by side with only the biological variation from embryo to embryo remaining.

A.3

Sayal et al. 2012 data set

For the data set of [82], slight modiﬁcations were made to the process used in [26]. Most
importantly, the images were much cleaner, so no outlier removal step was needed. The
background subtraction as well as normalization steps were also done in a more intricate
manner, after some very insightful observations. Smoothing of the image was not done,
156

although average signals over ten samples were taken from each embryo. Another important
addition to this processing pipeline was a step marking the rotation of each embryo, to ensure
proper comparison of expression patterns from embryo to embryo. The information that we
need to obtain from this data set is the gene expression pattern from a strip taken toward
the Anterior-Posterior center of the embryo along the Ventral-Dorsal axis. The details of
these modiﬁcations follow.
The ﬁrst steps in processing, other than outlier removal, follow exactly as those used in
[26]. The images are each projected in ImageJ and rotated to the standard orientation.
For the background subtraction step, we ﬁrst decided to check whether the background
appeared constant or parabolic for a single strip of pixels taken along the Ventral-Dorsal
axis. This was done by analyzing a set of images lacking the transgene of interest. These
images, referred to as yellow-white images, were stained and imaged using the exact same
protocol as all other images, although no lacZ reporter gene was present, so the resulting
images contained no gene expression; their intensity values corresponded to only background
staining. The yellow-white images showed a clear parabolic-shaped background when ten
samples were taken uniformly from the region 40-60% Anterior-Posterior and plotted from
Dorsal to Ventral (Figure A.1).
Due to this parabolic background shape, all images in the data set have their background
curve ﬁt in the same way, using data from the region 10-40% Dorsal-Ventral, where the
gene is not expressed. This quadratic is then subtracted from the entire embryo, along the
Anterior-Posterior axis.
Since all constructs should decrease expression intensity or express wild-type levels in
Ventral regions of the embryo, for normalization of embryo intensities, we decided to nor-

157

Figure A.1: Six representative yellow-white images with quadratics ﬁt to their intensity
values. The blue lines represent ten lines taken uniformly from the region 40-60% AnteriorPosterior and plotted from Dorsal to Ventral. The red line represents the quadratic ﬁt to
the average of these ten lines from 10-40% Dorsal-Ventral. This region was chosen due to
the fact that there should be no expression in this region regardless of whether the embryos
contain the transgene, as in our core data set, or not, as in these yellow-white images.
158

Figure A.2: Variation in Signal Intensity on diﬀerent imaging days. In panels A and B, each
circle represents a single embryo’s mean intensity value. All embryos represented in panel
A came from a single construct, which is a double activator knock-out, stained on the same
day and then imaged on three diﬀerent days. All embryos represented in panel B came from
a diﬀerent single construct, which is a triple repressor knock-out, again stained on the same
day and then imaged on three diﬀerent days. In both cases, there is a great deal of variation
in signal intensity. In panels C and D, these mean values were averaged over all embryos
stained and imaged on the same day. We observed a well-deﬁned decreasing trend in mean
intensity as the number of days between staining and imaging increased.

159

malize each embryo with respect to the average peak wild-type signal. As we began to look
at the variation in wild-type signals, we noted a trend in signal intensity from one imaging
day to another. This trend is shown in Figure A.2.
Due to this variation in intensities from one imaging day to another, on each day of
imaging at least two wild-type images are stained and imaged in parallel with other constructs. Then, each image is normalized to the average peak ( > 95%) wild-type signal
on that image’s particular imaging date. This normalization procedure allows for images
to be compared for a single construct imaged on multiple days, as well as to compare one
construct’s intensity to another’s.
Once normalization is complete, the image manipulation steps are concluded. The ﬁnal
step in image processing is then data extraction. Since the data set of interests contains
hundreds of embryos, all in slightly diﬀerent orientation, we had to be very careful in aligning
and comparing data extracted from diﬀerent embryos. This alignment was done by marking
the rotation of each embryo, using the snail protein signal. Snail protein is known for it’s very
sharp and precise border, forming a boundary between the mesoderm and neuroectoderm
[42].

160

BIBLIOGRAPHY

161

BIBLIOGRAPHY
[1] GK Ackers, AD Johnson, and MA Shea. Quantitative model for gene regulation by λ
phage repressor. PNAS, 79:1129–1133, 1982.
[2] CC Adams and JL Workman. Binding of disparate transcriptional activators to nucleosomal dna is inherently cooperative. Molecular and Cellular Biology, 15(3):1405–1421,
1995.
[3] R Albert and HG Othmer. The topology of the regulatory interactions predicts the
expression pattern of the segment polarity genes in Drosophila melanogaster. Journal
of Theoretical Biology, 223:1–18, 2003.
[4] DN Arnosti, S Gray, S Barolo, J Zhou, and M Levine. The gap protein knirps mediates both quenching and direct repression in the Drosophila embryo. EMBO Journal,
15(14):3659–3666, 1996.
[5] UM Ascher and LR Petzold. Computer Methods for Ordinary Diﬀerential Equations
and Diﬀerential-Algebraic Equations. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 1998. Theorem 3.1.
[6] M Ashyraliyev, J Jaeger, and JG Blom. Parameter estimation and determinability
analysis applied to Drosophila gap gene circuits. BMC Systems Biology, 2:83, 2008.
[7] M Ashyraliyev, K Siggens, H Janssens, JG Blom, M Akam, and J Jaeger. Gene
circuit analysis of the terminal gap gene huckebein. PLoS Computational Biology,
5(10):e1000696, 2009.
[8] A Ay and DN Arnosti. Mathematical modeling of gene expression: a guide for the perplexed biologist. Critical Reviews in Biochemistry and Molecular Biology, 46(2):137–
151, 2011.

162

[9] A Ay, WD Fakhouri, C Chiu, and DN Arnosti. Image processing and analysis for
quantifying gene expression from early Drosophila embryos. Tnumber Engineering,
14(9), 2008.
[10] A Babloyantz and M Sanglier. Chemical instabilities of “all-or-none” type in betagalactosidase induction and active transport. FEBS Letters, 23:364–366, 1972.
[11] TL Bailey and M Gribskov. Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14(1):48–54, 1998.
[12] CM Bergman, JW Carlson, and SE Celniker. Drosophila dnase i footprint database:
a systematic genome annotation of transcription factor binding sites in the fruitﬂy,
Drosophila melanogaster. Bioinformatics, 21(8):1747–1749, 2005.
[13] J Bieler, C Pozzorini, and F Naef. Whole-embryo modeling of early segmentation
in Drosophila identiﬁes robust and fragile expression domains. Biophysical Journal,
101(2):287–296, 2011.
[14] MD Biggin. Animal transcription networks as highly connected, quantitative continua.
Developmental Cell, pages 611–626, 2011.
[15] L Bintu, NE Buchler, HG Garcia, U Gerland, T Hwa, J Kondev, T Kuhlman, and
R Phillips. Transcriptional regulation by the numbers: applications. Current Opinion
in Genetics and Development, 15:125–135, 2005.
[16] DM Bortz and PW Nelson. Sensitivity analysis of a nonlinear lumped parameter model
of hiv infection dynamics. Bulletin of Mathematical Biology, 66:1009–1026, 2004.
[17] NE Buchler, U Gerland, and Hwa T. On schemes of combinatorial transcription logic.
PNAS, 100(9):5136–5141, 2003.
[18] KP Burnham and DR Anderson. Model selection and inference:
information-theoretic approach. Springer-Verlag, 1998.

a practical

[19] J Clune, HJ Goldsby, C Ofria, and RT Pennock. Selective pressures for accurate altruism targeting: evidence from digital evolution for diﬃcult-to-test aspects of inclusive
ﬁtness theory. Proceedings of the Royal Society B: Biological Sciences, 278(1706):666–
674, 2011.

163

[20] M Coppey, AM Berezhkovskii, Y Kim, AN Boettiger, and SY Shvartsman. Modeling the bicoid gradient: diﬀusion and reversible nuclear trapping of a stable protein.
Developmental Biology, 312:623–630, 2007.
[21] J Crocker, Y Tamori, and A Erives. Evolution acts on enhancer organization to ﬁnetune gradient threshold readouts. PLoS Biology, 6:e263, 2008.
[22] RI Cukier, CM Fortuin, KE Shuler, AG Petschek, and JH Schaibly. Study of the
sensitivity of coupled reaction systems to uncertainties in rate coeﬃcients. Journal of
Chemical Physics, 59:3873–3878, 1973.
[23] P Deuﬂhard and F Bornemann. Scientiﬁc Computing with Ordinary Diﬀerential Equations. Springer, 2002. page 124 and 3.2.
[24] JM Dresch, X Liu, DN Arnosti, and A Ay. Thermodynamic modeling of transcription: sensitivity analysis diﬀerentiates biological mechanism from mathematical modelinduced eﬀects. BMC Systems Biology, 4:142, 2010.
[25] RS Erb and GS Michaels. Sensitivity of biological models to errors in parameter
estimates. Paciﬁc Symposium on Biocomputing, 4:53–64, 1999.
[26] WD Fakhouri, A Ay, R Sayal, J Dresch, E Dayringer, and DN Arnosti. Deciphering
a transcriptional regulatory code: modeling short-range repression in the Drosophila
embryo. Molecular Systems Biology, 6:341, 2010.
[27] EG Fernando and DR Luhr. Sensitivity analysis of predicted pavement performance.
Technology Transfer and Training, 1200:32–41, 1988.
[28] Y Fomekong-Nanfack, M Postma, and JA Kaandorp. Inferring Drosophila gap gene
regulatory network: a parameter sensitivity and perturbation analysis. BMC Systems
Biology, 3:94, 2009.
[29] HC Frey and SR Patil. Identiﬁcation and review of sensitivity analysis methods. Risk
Analysis, 22:553–578, 2002.
[30] J Gertz, ED Siggia, and BA Cohen. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature, 457:215–218, 2009.
[31] S Gray and M Levine. Short-range transcriptional repressors mediate both quenching
and direct repression within complex loci in Drosophila. Genes and Development,
10:700–710, 1996.
164

[32] T Gregor, W Bialek, RR de Ruyter van Steveninck, DW Tank, and EF Wieschaus.
Diﬀusion and scaling during early embryonic pattern formation. PNAS, 102(51):18403–
18407, 2005.
[33] JA Griﬃth. Mathematics of cellular control processes. Journal of Theoretical Biology,
20:202–216, 1968.
[34] RN Gutenkunst, JJ Waterfall, FP Casey, KS Brown, CR Myers, and JP Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS Computational
Biology, 3:e189, 2007.
[35] SD Hanes, G Riddihough, D Ish-Horowicz, and R Brent. Speciﬁc dna recognition and
intersite spacing are critical for action of the bicoid morphogen. Molecular and Cellular
Biology, 14(5):3364–3375, 1994.
[36] N Hansen and A Ostermeier. Convergence properties of evolution strategies with
the derandomized covariance matrix adaptation: The (µ/µI , λ)-ES. EUFIT’97, 5th
Europen Congress on Intelligent Techniques and Soft Computing, Proceedings, pages
650–654, 1997.
[37] Q He, AF Bardet, B Patton, J Purvis, J Johnston, A Paulson, M Gogol, A Stark,
and J Zeitlinger. High conservation of transcription factor binding and evidence for
combinatorial regulation across six Drosophila species. Nature Genetics, 21:414–421,
2011.
[38] X He, MA Samee, C Blatti, and S Sinha. Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding
and short-range repression. PLoS Computational Biology, 6(9):e1000935, 2010.
[39] T Hohm and E Zitzler. Modeling the shoot apical meristem in A. thaliana: Parameter
estimation for spatial pattern formation. In E Marchiori, JH Moore, and JC Rajapakse,
editors, Lecture Notes in Computer Science, volume 4447, pages 102–113. Springer,
2007.
[40] JW Hong, DA Hendrix, and MS Levine. Shadow enhancers as a source of evolutionary
novelty. Science, 321:1341, 2008.
[41] B Ingalls. Sensitivity analysis: from model parameters to system behaviour. Essays
in Biochemistry, 45:177–193, 2008.

165

[42] YT Ip, RE Park, D Kosman, K Yazdanbakhsh, and M Levine. dorsal-twist interactions
establish snail expression in the presumptive mesoderm of the Drosophila embryo.
Genes and Development, 6:1518–1530, 1992.
[43] J Jaeger, S Surkova, M Blagov, H Janssens, D Kosman, KN Kozlov, Manu, E Myasnikova, CE Vanario-Alonso, M Samsonova, D Sharp, and J Reinitz. Dynamic control
of positional information in the early Drosophila embryo. Nature, 430:368–371, 2004.
[44] SA Jaeger, ET Chan, MF Berger, R Stottmann, TR Hughes, and ML Bulyk. Conservation and regulatory associations of a wide aﬃnity range of mouse transcription
factor binding sites. Genomics, 95(4):185–195, 2010.
[45] H Janssens, S Hou, J Jaeger, A Kim, E Myasnikova, D Sharp, and J Reinitz. Quantitative and predictive model of transcriptional control of the Drosophila melanogaster
even skipped gene. Nature Genetics, 38:1159–1165, 2006.
[46] J Jiang and M Levine. Binding aﬃnities and cooperative interactions with bhlh activators delimit threshold responses to the dorsal gradient morphogen. Cell, 72:741–752,
1993.
[47] L Jostins and J Jaeger. Reverse engineering a gene network using an asynchronous
parallel evolution strategy. BMC Systems Biology, 4:17, 2010.
[48] M Kazemian, Q Zhu, MS Halfon, and S Sinha. Improved accuracy of supervised
crm discovery with interpolated markov models and cross-species comparison. Nucleic
Acids Research, 39(22):9463–9472, 2011.
[49] S Kern, SD Muller, N Hansen, D Buche, J Ocenasek, and P Koumoutsakos. Learning
probability distributions in continuous evolutionary algorithms. Natural Computing,
3(1):77–112, 2004.
[50] MM Kulkarni and DN Arnosti. cis-regulatory logic of short-range transcriptional
repression in Drosophila melanogaster. Molecular Cell Biology, 25(9):3411–3420, 2005.
[51] G Li, C Rosenthal, and H Rabitz. High dimensional model representations. Journal
of Physical Chemistry, 105:7765–7777, 2001.
[52] LM Li and DN Arnosti. Long- and short-range transcriptional repressors induce distinct chromatin states on repressed genes. Current Biology, 21(5):1–7, 2011.

166

[53] X Liang and J Guo. Intercomparison of land-surface parameterization schemes: Sensitivity of surface evergy and water ﬂuxes to model parameters. Journal of Hydrology,
279:182–209, 2003.
[54] G Lillacci and M Khammash. Parameter estimation and model selection in computational biology. PLoS Computational Biology, 6:e1000696, 2010.
[55] MZ Ludwig, Manu, R Kittler, KP White, and M Kreitman. Consequences of eukaryotic
enhancer architecture for gene expression dynamics, development, and ﬁtness. PLoS
Genetics, 7(11):e1002364, 2011.
[56] S MacArthur, XY Li, J Li, JB Brown, HC Chu, L Zeng, BP Grondona, A Hechmer,
L Simirenko, SVE Kernen, DW Knowles, M Stapleton, P Bickel, MD Biggin, and
MB Eisen. Developmental roles of 21 Drosophila transcription factors are determined
by quantitative diﬀerences in binding to an overlapping set of thousands of genomic
regions. Genome Biology, 10:R80, 2009.
[57] K Madsen, HB Nielsen, and O Tingleﬀ. Methods for Non-Linear Least Squares Problems. Informatics and Mathematical Modeling, Technical University of Denmark, 2004.
[58] S Marino, IB Hogue, CJ Ray, and DE Kirschner. A methodology for performing global
uncertainty and sensitivity analysis in systems biology. Journal of Theoretical Biology,
254:178–196, 2008.
[59] P Mendes and D Kell. Non-linear optimization of biochemical pathways: applications
to metabolic engineering and parameter estimation. Bioinformatics, 14(10):869–883,
1998.
[60] CG Moles, P Mendes, and JR Banga. Parameter estimation in biochemical pathways:
A comparison of global optimization methods. Genome Research, 13:2467–2474, 2003.
[61] SD Muller, J Marchetto, S Airaghi, and P Koumoutsakos. Optimization based on
bacterial chemotaxis. IEEE Transactions on Evolutionary Computation, 6(1):16–29,
2002.
[62] G Nicolis and I Prigogine. Self-Organization in Nonequilibrium Systems: From Dissipative Structures to Order through Fluctuations. Wiley, 1977.
[63] J Nocedal and SJ Wright. Numerical Optimization. Springer, 1999.

167

[64] MB Noyes, RG Christensen, A Wakabayashi, GD Stormo, MH Brodsky, and SA Wolfe.
Analysis of homeodomain speciﬁcities allows the family-wide prediction of preferred
recognition sites. Cell, 133:1277–1289, 2008.
[65] MB Noyes, X Meng, A Wakabayashi, S Sinha, MH Brodsky, and SA Wolfe. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial
one-hybrid system. Nucleic Acids Research, 36(8):2547–2560, 2008.
[66] A Ozdemir, KI Fisher-Aylor, S Pepke, M Samanta, L Dunipace, K McCue, L Zeng,
N Ogawa, BJ Wold, and A Stathopoulos. High resolution mapping of twist to dna
in Drosophila embryos: Eﬃcient functional analysis and evolutionary conservation.
Genome Research, 21(4):566–577, 2011.
[67] D Pan, JD Huang, and AJ Courey. Functional analysis of the Drosophila twist promoter reveals a dorsal -binding ventral activator region. Genes and Development,
5:1892–1901, 1991.
[68] DJ Pannell. Sensitivity analysis of normative economic models: theoretical framework
and practical strategies. Agricultural Economics, 16:139–152, 1997.
[69] D Papatsenko and M Levine. The Drosophila gap gene network is composed of two
parallel toggle switches. PLoS One, 6(7):e21145, 2011.
[70] S Payankaulam, LM Li, and DN Arnosti. Transcriptional repression: Conserved and
evolved features. Current Biology, 20(17):R764–R771, 2010.
[71] MW Perry, AN Boettiger, JP Bothma, and M Levine. Shadow enhancers foster robustness of Drosophila gastrulation. Current Biology, 20:1562–1567, 2010.
[72] KB Petersen and MS Pedersen. The Matrix Cookbook. Informatics and Mathematical
Modeling, Technical University of Denmark, 2008. Section 10.4.
[73] GT Reeves and SE Fraser. Biological systems from an engineer’s point of view. PLoS
Biology, 7:32–35, 2009.
[74] GT Reeves and A Stathopoulos. Graded dorsal and diﬀerential gene regulation in
the Drosophila embryo. Cold Spring Harbor Perspectives in Biology, Generation and
Interpretation of Morphogen Gradients, 2009.
[75] J Reinitz, S Hou, and DH Sharp. Transcriptional control in Drosophila. ComPlexUs,
1:54–64, 2003.
168

[76] O Rubel, S Ahern, EW Bethel, MD Biggin, H Childs, E Cormier-Michel, A DePace,
MB Eisen, CC Fowlkes, CGR Geddes, H Hagen, B Hamann, MY Huangj, SVE Keranen, DW Knowles, CLL Hendriks, J Malikn, J Meredith, P Messmer, Prabhata,
D Ushizima, GH Weber, and K Wua. Coupling visualization and data analysis for
knowledge discovery from multi-dimensional scientiﬁc data. Procedia Computer Science Procedia Computer Science, 1:17571764, 2010.
[77] G Rudolph. Convergence of evolutionary algorithms in general search spaces. International Conference on Evolutionary Computation, pages 50–54, 1996.
[78] A Saltelli, S Tarantola, and KPS Chan. A quantitative model-independent method for
global sensitivity analysis of model output. Technometrics, 41:39–56, 1999.
[79] L Sanchez and D Thieﬀry. A logical analysis of the Drosophila gap-gene system.
Journal of Theoretical Biology, 211:115–141, 2001.
[80] T Sandmann, C Girardot, M Brehme, W Tongprasit, V Stolc, and EEM Furlong. A core transcriptional network for early mesoderm development in Drosophila
melanogaster. Genes and Development, 21:436–449, 2007.
[81] M Santillan and MC Mackey. Inﬂuence of catabolite repression and inducer exclusion
on the bistable behavior of the lac operon. Biophysical Journal, 86:1282–1292, 2004.
[82] R Sayal, JM Dresch, I Pushel, BR Taylor, and DN Arnosti. A quantitative approach
to understand the structural rules of enhancer function. Manuscript in Preparation,
2012.
[83] R Sayal, SM Ryu, and DN Arnosti. Optimization of reporter gene architecture for
quantitative measurements of gene expression in the Drosophila embryo. Fly, 43(5):47–
52, 2011.
[84] MD Schroeder, M Pearce, J Fak, H Fan, U Unnerstall, E Emberly, N Rajewsky,
ED Siggia, and U Gaul. Transcriptional control in the segmentation gene network
of drosophila. PLoS Biology, 2(9):e271, 2004.
[85] E Segal, T Raveh-Sadka, M Schroeder, U Unnerstall, and U Gaul. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature, 451:535–
540, 2008.
[86] E Segal and J Widom. From dna sequence to transcriptional behaviour: a quantitative
approach. Nature Reviews Genetics, 10:443–456, 2009.
169

[87] MA Shea and GK Ackers. The or control system of bacteriophage lambda. a physicalchemical model for gene regulation. Journal of Molecular Biology, 181:211–230, 1985.
[88] D Stanojevic, S Small, and M Levine. Regulation of a segmentation stripe by overlapping activators and repressors in the Drosophila embryo. Science, 254(5036):1385–1387,
1991.
[89] J Stoer and R Bulirsch. Introduction to Numerical Analysis. Springer, 3 edition, 2002.
Section 4.4.
[90] GD Stormo and DS Fields. Speciﬁcity, free energy and information content in proteindna interactions. Trends in Biochemical Sciences, 23(3):109–113, 1998.
[91] Y Suleimenov, A Ay, MA Samee, S Sinha, JM Dresch, and DN Arnosti. Global parameter estimation for thermodynamic models of transcriptional regulation. Manuscript
in Preparation, 2012.
[92] S Surkova, E Myasnikova, H Janssens, KN Kozlov, AA Samsonova, J Reinitz, and
M Samsonova. Pipeline for acquisition of quantitative data on segmentation gene
expression from confocal images. Fly, 2(2):58–66, 2008.
[93] P Szymanski and M Levine. Multiple modes of dorsal-bhlb transcriptional synergy in
the drosophila embryo. EMBO Journal, 14:2229–2238, 1995.
[94] Y Tang, P Reed, T Wagener, and K van Werkhoven. Comparing sensitivity analysis
methods to advance lumped watershed model identiﬁcation and evaluation. Hydrology
and Earth System Sciences, 3:3333–3395, 2006.
[95] C Thisse, F Perrin-Schmitt, C Stoetzel, and B Thisse. Sequence-speciﬁc transactivation
of the Drosophila twist gene by the dorsal gene product. Cell, 65:1191–1201, 1991.
[96] NAW van Riel. Dynamic modeling and analysis of biochemical networks: mechanismbased models and model-based experiments. Brieﬁngs in Bioinformatics, 7:364–374,
2006.
[97] G von Dassow, E Meir, EM Munro, and GM Odell. The segment polarity network is
a robust developmental module. Nature, 406:188–192, 2000.
[98] T Weise. Global Optimization Algorithms - Theory and Application. self-published,
http://www.it-weise.de/, 2 edition, 2009.

170

[99] S Wielgoss, JE Barrick, O Tenaillon, S Cruvellier, B Chane-Woon-Ming, C Mdigue,
RE Lenski, and D Schneider. Mutation rate inferred from synonymous substitutions
in a long-term evolution experiment with Escherichia coli. G3: Genes, Genomes,
Genetics, 1:183–186, 2011.
[100] CH Yuh, H Bolouri, and EH Davidson. Cis-regulatory logic in the endo16 gene: switching from a speciﬁcation to a diﬀerentiation mode of control. Development, 128:617–629,
2001.
[101] J Zeitlinger, RP Zinzen, A Stark, M Kellis, H Zhang, RA Young, and M Levine. Wholegenome chip-chip analysis of dorsal, twist, and snail suggests integration of diverse
patterning processes in the Drosophila embryo. Genes and Development, 21:385–390,
2007.
[102] C Zhu, KJRP Byers, RP McCord, Z Shi, MF Berger, DE Newburger, K Saulrieta,
S Smith, MV Shah, M Radhakrishnan, AA Philippakis, Y Hu, F DeMasi, M Pacek,
A Rolfs, T Murthy, J LaBaer, and ML Bulyk. High-resolution dna-binding speciﬁcity
analysis of yeast transcription factors. Genome Research, 19:556–566, 2009.
[103] T Ziehn and AS Tomlin. A global sensitivity study of sulfur chemistry in a premixed
methane ﬂame model using hdmr. International Journal of Chemical Kinetics, 40:742–
753, 2008.
[104] RP Zinzen, C Girardot, J Gagneur, M Braun, and EEM Furlong. Combinatorial
binding predicts spatio-temporal cis-regulatory activity. Nature, 462:65–70, 2009.
[105] RP Zinzen, K Senger, M Levine, and D Papatsenko. Computational models for neurogenic gene expression in the Drosophila embryo. Current Biology, 16:1358–1365,
2006.

171